AN AUTOMATED NEW APPROACH IN FAST TEXT CLASSIFICATION: A CASE STUDY FOR KURDISH TEXT

Authors

  • Ari M. Saeed Department of Computer Science, College of Science, University of Halabja, Halabja, Kurdistan Region, Iraq

DOI:

https://doi.org/10.25271/sjuoz.2024.12.3.1296

Keywords:

FastText, Deep Learning, Kurdish Text Classification, Machine Learning, Natural Language Processing

Abstract

With the rapid development of internet technology, text classification has become a vital part of obtaining quick and accurate data. Traditional machine learning methods often suffer from poor performance and high-dimensional feature spaces, which reduce their accuracy. In this paper, the FastText model is proposed as the first-ever classifier on Kurdish text and the results are compared with traditional machine learning methods to show the effects on Kurdish Text.  For evaluating the model four datasets Kurdish News Dataset Headlines (KNDH), Medical Kurdish Dataset (MKD), Kurdish-Emotional-Dataset (KMD-77000), and KurdiSent are utilized and compared the results with the traditional machine learning algorithms such as: Random Forest (RF), k-nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Decision Tree (DT), Stochastic Gradient Descent (SGD), as well as the deep learning model Bidirectional Encoder Representations from Transformers (BERT). The outcomes indicate that the FastText model achieved the highest performance with 89% for each precision, recall, F1-score, and 89.10% accuracy for the KNDH dataset. Moreover, when the KMD dataset is utilized the FatText model obtained outperforms all others by approximately 2%. In addition, the comparative analysis showed that FastText is superior when Kurdisent is considered with precision, recall, F1-score, and accuracy by 81.32, 81.83, 81.57, and 81.4 respectively. In addition, when MKD is implemented, the FastText model obtained the highest performance with a precision of 93.32%, recall of 93.36, F1-score of 93.34, and accuracy of 93.1%.

References

Ahmadi, S. (2020). KLPT–Kurdish language processing toolkit. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), 72–84.

Alammary, A. S. (2022). BERT Models for Arabic Text Classification: A Systematic Review. Applied Sciences 2022, Vol. 12, Page 5720, 12(11), 5720. https://doi.org/10.3390/APP12115720

Amalia, A., Sitompul, O. S., Nababan, E. B., & Mantoro, T. (2020). An Efficient Text Classification Using fastText for Bahasa Indonesia Documents Classification. 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics, DATABIA 2020 - Proceedings, 69–75. https://doi.org/10.1109/DATABIA50434.2020.9190447

Badawi, S. (2023). KMD: A New Kurdish Multilabel Emotional Dataset For the Kurdish Sorani Dialect. In M. Abbas & A. A. Freihat (Eds.), Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023) (pp. 308–315). Association for Computational Linguistics. https://aclanthology.org/2023.icnlsp-1.33

Badawi, S., Kazemi, A., & Rezaie, V. (2024). KurdiSent: a corpus for kurdish sentiment analysis. Language ResourcesandEvaluation,1–20. https://doi.org/10.1007/S10579-023-09716-6/METRICS

Badawi, S. S. (2023). Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 11(1), 10–15. https://doi.org/10.14500/aro.11088

Badawi, S., Saeed, A. M., Ahmed, S. A., Abdalla, P. A., & Hassan, D. A. (2023). Kurdish News Dataset Headlines (KNDH) through multiclass classification. Data in Brief, 48, 109120. https://doi.org/10.1016/j.dib.2023.109120

Dharma, E. M., Gaol, F. L., Leslie, H., Warnars, H. S., & Soewito, B. (2022). THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION. Journal of Theoretical and Applied Information Technology, 31(2). www.jatit.org

Khomsah, S., Ramadhani, R. D., & Wijaya, S. (2022). The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(3), 352–358. https://doi.org/10.29207/resti.v6i3.3711

Kuyumcu, B., Aksakalli, C., & Delil, S. (2019). An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing. ACM International Conference Proceeding Series, 1–4. https://doi.org/10.1145/3342827.3342828

Naeem, M. Z., Rustam, F., Mehmood, A., Mui-zzud-din, Ashraf, I., & Choi, G. S. (2022). Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms. PeerJ Computer Science, 8, e914.https://doi.org/10.7717/PEERJ-CS.914/SUPP-4

Saeed, A. M., Badawi, S., Ahmed, S. A., & Hassan, D. A. (2023). Comparison of feature selection methods in Kurdish text classification. Iran Journal of Computer Science, 1–10.

Saeed, A. M., Hussein, S. R., Ali, C. M., & Rashid, T. A. (2022). Medical dataset classification for Kurdish short text over social media. Data in Brief, 42, 108089. https://doi.org/10.1016/J.DIB.2022.108089

Saeed, A. M., Ismael, A. N., Rasul, D. L., Majeed, R. S., & Rashid, T. A. (2022). Hate Speech Detection in Social Media for the Kurdish Language. 253–260. https://doi.org/10.1007/978-3-031-14054-9_24

Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A.-R., Shamsaldin, A. S., & Al-Salihi, N. K. (2018). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99–107. https://doi.org/10.1007/s42044-018-0007-4

Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach. International Journal of Information Management Data Insights, 2(1). https://doi.org/10.1016/j.jjimei.2022.100061

Umer, M., Imtiaz, Z., Ahmad, M., Nappi, M., Medaglia, C., Choi, G. S., & Mehmood, A. (2023). Impact of convolutional neural network and FastText embedding on text classification. Multimedia Tools and Applications, 82(4), 5569–5585. https://doi.org/10.1007/s11042-022-13459-x

Yao, T., Zhai, Z., & Gao, B. (2020). Text Classification Model Based on fastText. Proceedings of 2020 IEEE International Conference on Artificial Intelligence and Information Systems, ICAIIS 2020,154–157. https://doi.org/10.1109/ICAIIS49377.2020.9194939

Zulqarnain, M., Ghazali, R., Mazwin, Y., Hassim, M., & Rehan, M. (2020). A comparative review on deep learning models for text classification. Indonesian Journal of Electrical Engineering and Computer Science,19(1),325–335. https://doi.org/10.11591/ijeecs.v19.i1.pp325-335

Downloads

Published

2024-07-30

How to Cite

Saeed, A. M. (2024). AN AUTOMATED NEW APPROACH IN FAST TEXT CLASSIFICATION: A CASE STUDY FOR KURDISH TEXT. Science Journal of University of Zakho, 12(3), 329–335. https://doi.org/10.25271/sjuoz.2024.12.3.1296

Issue

Section

Science Journal of University of Zakho