AN AUTOMATED NEW APPROACH IN FAST TEXT CLASSIFICATION: A CASE STUDY FOR KURDISH TEXT
DOI:
https://doi.org/10.25271/sjuoz.2024.12.3.1296Keywords:
FastText, Deep Learning, Kurdish Text Classification, Machine Learning, Natural Language ProcessingAbstract
With the rapid development of internet technology, text classification has become a vital part of obtaining quick and accurate data. Traditional machine learning methods often suffer from poor performance and high-dimensional feature spaces, which reduce their accuracy. In this paper, the FastText model is proposed as the first-ever classifier on Kurdish text and the results are compared with traditional machine learning methods to show the effects on Kurdish Text. For evaluating the model four datasets Kurdish News Dataset Headlines (KNDH), Medical Kurdish Dataset (MKD), Kurdish-Emotional-Dataset (KMD-77000), and KurdiSent are utilized and compared the results with the traditional machine learning algorithms such as: Random Forest (RF), k-nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Decision Tree (DT), Stochastic Gradient Descent (SGD), as well as the deep learning model Bidirectional Encoder Representations from Transformers (BERT). The outcomes indicate that the FastText model achieved the highest performance with 89% for each precision, recall, F1-score, and 89.10% accuracy for the KNDH dataset. Moreover, when the KMD dataset is utilized the FatText model obtained outperforms all others by approximately 2%. In addition, the comparative analysis showed that FastText is superior when Kurdisent is considered with precision, recall, F1-score, and accuracy by 81.32, 81.83, 81.57, and 81.4 respectively. In addition, when MKD is implemented, the FastText model obtained the highest performance with a precision of 93.32%, recall of 93.36, F1-score of 93.34, and accuracy of 93.1%.
References
Ahmadi, S. (2020). KLPT–Kurdish language processing toolkit. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), 72–84.
Alammary, A. S. (2022). BERT Models for Arabic Text Classification: A Systematic Review. Applied Sciences 2022, Vol. 12, Page 5720, 12(11), 5720. https://doi.org/10.3390/APP12115720
Amalia, A., Sitompul, O. S., Nababan, E. B., & Mantoro, T. (2020). An Efficient Text Classification Using fastText for Bahasa Indonesia Documents Classification. 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics, DATABIA 2020 - Proceedings, 69–75. https://doi.org/10.1109/DATABIA50434.2020.9190447
Badawi, S. (2023). KMD: A New Kurdish Multilabel Emotional Dataset For the Kurdish Sorani Dialect. In M. Abbas & A. A. Freihat (Eds.), Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023) (pp. 308–315). Association for Computational Linguistics. https://aclanthology.org/2023.icnlsp-1.33
Badawi, S., Kazemi, A., & Rezaie, V. (2024). KurdiSent: a corpus for kurdish sentiment analysis. Language ResourcesandEvaluation,1–20. https://doi.org/10.1007/S10579-023-09716-6/METRICS
Badawi, S. S. (2023). Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 11(1), 10–15. https://doi.org/10.14500/aro.11088
Badawi, S., Saeed, A. M., Ahmed, S. A., Abdalla, P. A., & Hassan, D. A. (2023). Kurdish News Dataset Headlines (KNDH) through multiclass classification. Data in Brief, 48, 109120. https://doi.org/10.1016/j.dib.2023.109120
Dharma, E. M., Gaol, F. L., Leslie, H., Warnars, H. S., & Soewito, B. (2022). THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION. Journal of Theoretical and Applied Information Technology, 31(2). www.jatit.org
Khomsah, S., Ramadhani, R. D., & Wijaya, S. (2022). The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(3), 352–358. https://doi.org/10.29207/resti.v6i3.3711
Kuyumcu, B., Aksakalli, C., & Delil, S. (2019). An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing. ACM International Conference Proceeding Series, 1–4. https://doi.org/10.1145/3342827.3342828
Naeem, M. Z., Rustam, F., Mehmood, A., Mui-zzud-din, Ashraf, I., & Choi, G. S. (2022). Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms. PeerJ Computer Science, 8, e914.https://doi.org/10.7717/PEERJ-CS.914/SUPP-4
Saeed, A. M., Badawi, S., Ahmed, S. A., & Hassan, D. A. (2023). Comparison of feature selection methods in Kurdish text classification. Iran Journal of Computer Science, 1–10.
Saeed, A. M., Hussein, S. R., Ali, C. M., & Rashid, T. A. (2022). Medical dataset classification for Kurdish short text over social media. Data in Brief, 42, 108089. https://doi.org/10.1016/J.DIB.2022.108089
Saeed, A. M., Ismael, A. N., Rasul, D. L., Majeed, R. S., & Rashid, T. A. (2022). Hate Speech Detection in Social Media for the Kurdish Language. 253–260. https://doi.org/10.1007/978-3-031-14054-9_24
Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A.-R., Shamsaldin, A. S., & Al-Salihi, N. K. (2018). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99–107. https://doi.org/10.1007/s42044-018-0007-4
Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach. International Journal of Information Management Data Insights, 2(1). https://doi.org/10.1016/j.jjimei.2022.100061
Umer, M., Imtiaz, Z., Ahmad, M., Nappi, M., Medaglia, C., Choi, G. S., & Mehmood, A. (2023). Impact of convolutional neural network and FastText embedding on text classification. Multimedia Tools and Applications, 82(4), 5569–5585. https://doi.org/10.1007/s11042-022-13459-x
Yao, T., Zhai, Z., & Gao, B. (2020). Text Classification Model Based on fastText. Proceedings of 2020 IEEE International Conference on Artificial Intelligence and Information Systems, ICAIIS 2020,154–157. https://doi.org/10.1109/ICAIIS49377.2020.9194939
Zulqarnain, M., Ghazali, R., Mazwin, Y., Hassim, M., & Rehan, M. (2020). A comparative review on deep learning models for text classification. Indonesian Journal of Electrical Engineering and Computer Science,19(1),325–335. https://doi.org/10.11591/ijeecs.v19.i1.pp325-335
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Ari M. Saeed
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0] that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work, with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online.