AN AUTOMATED NEW APPROACH IN FAST TEXT CLASSIFICATION: A CASE STUDY FOR KURDISH TEXT

Ari M. Saeed^*

Department of Computer Science, College of Science, University of Halabja, Halabja, Kurdistan Region, Iraq

Received: 29 Mar., 2024 / Accepted: 25 June., 2024 / Published: 30 July., 2024. https://doi.org/10.25271/sjuoz.2024.12.3.1296

ABSTRACT:

With the rapid development of internet technology, text classification has become a vital part of obtaining quick and accurate data. Traditional machine learning methods often suffer from poor performance and high-dimensional feature spaces, which reduce their accuracy. In this paper, the FastText model is proposed as the first-ever classifier on Kurdish text and the results are compared with traditional machine learning methods to show the effects on Kurdish Text. For evaluating the model four datasets Kurdish News Dataset Headlines (KNDH), Medical Kurdish Dataset (MKD), Kurdish-Emotional-Dataset (KMD-77000), and KurdiSent are utilized and compared the results with the traditional machine learning algorithms such as: Random Forest (RF), k-nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Decision Tree (DT), Stochastic Gradient Descent (SGD), as well as the deep learning model Bidirectional Encoder Representations from Transformers (BERT). The outcomes indicate that the FastText model achieved the highest performance with 89% for each precision, recall, F1-score, and 89.10% accuracy for the KNDH dataset. Moreover, when the KMD dataset is utilized the FatText model obtained outperforms all others by approximately 2%. In addition, the comparative analysis showed that FastText is superior when Kurdisent is considered with precision, recall, F1-score, and accuracy by 81.32, 81.83, 81.57, and 81.4 respectively. In addition, when MKD is implemented, the FastText model obtained the highest performance with a precision of 93.32%, recall of 93.36, F1-score of 93.34, and accuracy of 93.1%.

KEYWORDS: FastText, Deep Learning, Kurdish Text Classification, Machine Learning, Natural Language Processing

1. INTRODUCTION

With the rapid development of technology, the internet has become a necessary part of human life. Various data types are used on the internet, with text being one of the most common forms. Text Classification (TC) is a crucial task in the fields of Natural Language Processing (NLP), Machine Learning (ML), and data mining. Moreover, TC has emerged as a method for analyzing, extracting, detecting, and retrieving user knowledge from large volumes of text. The TC can be categorized into two groups: multi-label and multiclass classification.

In multi-label classification, a text is assigned multiple target labels (such as sports or social), whereas in multiclass classification, a text is assigned only one target label. Some applications of TC include sentiment analysis, news filtering, email sorting, product review categorization, and text message analysis (Umer et al., 2023). Various text classification algorithms, including Support Vector Machine (SVM), Naive Bayes (NB), K-nearest neighbor (KNN), Decision Tree, Neural Network, and FastText, are utilized to classify text. In machine learning algorithms, the Vector Space Model (VSM) is employed to convert text inputs (sentences) into multiple vectors (features) for classification purposes. In contrast to deep learning models, layers are used to prioritize these features (Zulqarnain et al., 2020). The Bag-of-Words (BoW) model and Term Frequency-Inverse Document Frequency (TF-IDF) are implemented to depict the occurrence of words across documents in the corpus. The BoW model transforms documents into vectors, while TF-IDF assigns weights to terms based on their frequency in each document. TF calculates the score for each feature depending on its frequency within each document, whereas IDF determines the count of unique features present in the dataset. In addition, GloVe, BERT, and FastText are word embedding methods utilized to diminish high-dimensional features by considering the similarity between words (Naeem et al., 2022; Singh et al., 2022). GloVe is a pre-trained word embedding model trained on Wikipedia 2014 and Gigaword 5 corpus, and it utilizes a technique that assesses the similarity between word vectors. Meanwhile, Bert, introduced in 2018, a bidirectional pre-trained model designed for various languages. Unlike other models that rely on unidirectional context capture, Bert leverages bidirectional processing (Alammary, 2022).

FastText, developed by Facebook, serves as a library offering pre-trained models for 157 distinct languages, catering to supervised classification and unsupervised text representation tasks (Badri et al., 2022). FastText utilizes character n-grams instead of word n-grams, effectively addressing out-of-vocabulary issues, particularly in languages with complex morphology (Yao et al., 2020). One of the richest languages in terms of morphology is the Kurdish language (Saeed et al., 2023). The significance of this study for the Kurdish language lies in its potential to enhance the accuracy of Kurdish text classification, given the language's intricate morphology. Furthermore, using character n-gram features can improve the efficiency of classifying Kurdish text by offering alternatives for matching out-of-vocabulary words, especially in the presence of typographical errors.

2. RELATED WORK

The proliferation of internet applications has led to a massive surge in the volume of text available online. In recent years, automated text mining has become a significant challenge, to extract valuable information from large volumes of content. Automated text classification has driven the creation and enhancement of numerous algorithms for organizing large collections of documents. In a study by Hassan et al. (2022), five machine learning algorithms were applied and compared using two different datasets. The performance of Random Forest (RF), k-nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), and Support Vector Machine (SVM) was evaluated based on metrics such as accuracy, precision, recall, and F1-score. The results indicated that SVM and LR outperformed the other algorithms on the IMDB English dataset, while k-NN exhibited the best performance on the SPAM dataset. Mokhtar used an Arabic corpus to categorize text into predefined categories such as news, economy, culture, diversity, and sports. In this experiment, five machine learning models, including Random Forest, Logistic Regression, Decision Tree (DT), Stochastic Gradient Descent (SGD), Naïve Bayes (NB), and Support Vector Machine (SVM), were employed. According to the results, Logistic Regression achieved the highest F1 score compared to the other models (Madhfar & Al-Hagery, 2019).

In a recent study, researchers applied machine learning and deep learning algorithms to a medical Kurdish dataset. Before analysis, the text underwent preprocessing steps, such as the removal of irrelevant words and stop words, using the Kurdish Language Processing Toolkit (KLPT) Python library. The study involved a comparison of Bert-multilingual with traditional machine learning algorithms including NB, SGD, DT, Random Forest, SVM, KNN, and LG. The findings indicated that Bert achieved an accuracy of 92%, surpassing the performance of traditional machine-learning algorithms by two percentage points (Badawi, 2023). In another study, researchers addressed the issue of high dimensionality in feature space and the limitations of conventional machine learning algorithms such as SVM, NB, and KNN by introducing FastText as a novel classification model. The study revealed that the FastText model attained an impressive F1-score of 0.9286, outperforming the other models (Yao et al., 2020). Birol Kuyucu used the FastText classifier to analyze the TTC-3600 Turkish dataset. Remarkably, no preprocessing steps such as tokenization, stemming, lemmatization, stop word removal, lowercase conversion, or dimensionality reduction were applied. Following this, the performance of FastText was compared with K-NN, decision tree J48, and Multinomial Naïve Bayes (NV). The results demonstrated that FastText surpassed the other models, achieving an impressive accuracy score of 93.52%, despite the absence of preprocessing steps (Kuyumcu et al., 2019). Additionally, Amalia conducted a comparison between the FastText model and TF-IDF as one of the BOW models for 500 new articles in a low-resource Bahasa Indonesia. The study revealed that TF-IDF requires more preprocessing steps and is time-consuming for model prediction. Moreover, it was observed that FastText classification exhibited superior performance with a 0.97 F1-score compared to TF-IDF (Amalia et al., 2020).

3. FASTTEXT ARCHITECTURE

In natural language processing, reduced performance in text classification challenges the accuracy of neural networks. To address this issue, the Facebook research team developed FastText. FastText is a library designed for both supervised and unsupervised learning. Supervised learning can be applied to tasks such as text classification, while unsupervised learning is utilized for learning word embedding from the training dataset. Two primary data sources, Wikipedia and Common Crawl corpus, contribute to collecting the training data for FastText. It is worth noting that FastText offers pre-trained word vectors for 157 different languages, comprising a vast number of vocabulary (600 billion words) from 2 million combined texts, each represented in 300 dimensions (Umer et al., 2023). The structure of FastText closely resembles that of Continuous Bag of Words (CBOW). However, the primary distinction lies in FastText's architecture compared to CBOW, as CBOW utilizes intermediate words rather than labels as shown in Figure. 1

As shown in Table 1, the document is ready for the input layer. After converting the document format, another step is done which is representing words in the documents. The word representation of FastText is different from other models such as word2vec. In word2vec each word is represented as a bag of words while in FastText each word is represented as a bag of character n-gram which is generated vector for unknown words to improve generalization. For example, in character n-gram architecture the word (دکتۆر) is:

<دک, دکت, کتۆ, تۆر, ۆر> when (n-gram=3)

The increase in character n-grams represents a significant improvement over word n-grams and helps address "out of vocabulary" errors, especially in high-dimensional feature spaces (Khomsah et al., 2022).

The hidden layer, which averages several feature vectors, constructs the Huffman tree. This tree is utilized to determine the most probable function based on the weight and parameters of the class, and it serves output purposes since calculating the tag based on the Huffman coding path can significantly reduce computational load. The SoftMax function is employed in FastText to estimate the likelihood distribution of classes. To define the objective of the model for a dataset containing multiple documents, the following formula (1) is used:

(1)

Based on Equation (1), represents the number of documents in the dataset, denotes the class label for a specific document (the nth document), signifies the loss function, and is the weight matrix from the hidden layer to the output layer. Additionally, represents the weight matrix for word embedding (the embedding layer), and represents the normalized features of the specific document (the nth document). The model employs a linear decay learning rate and stochastic gradient descent for training.

Two important factors have made FastText a robust model: the first is the utilization of the Huffman coding tree-based hierarchical Softmax method, and the second is the utilization of the sub-word n-gram method (Amalia et al., 2020).

4. DATASET COLLECTION AND DESCRIPTION

The Kurdish language is a member of the Indo-European language family and is spoken by 40 million people. Kurdish dialects can be broadly categorized into two groups: Sorani and Kurmanji. The Kurdish homeland, known as Kurdistan, spans across four countries: Iraq, Turkey, Iran, and Syria. Both Sorani and Kurmanji are spoken in Iran and Iraq, whereas only Kurmanji is used in Turkey and Syria. Furthermore, Kurds in Iraq and Iran use the Arabic alphabet, while those in Turkey and Syria use the Latin alphabet (Saeed et al., 2018). In this study, four distinct Kurdish datasets were analyzed. These datasets were gathered from various online sources within the Kurdistan region of Iraq (Ahmadi, 2020; Saeed, Ismael, et al., 2022).

The first dataset is the Kurdish News Dataset Headlines (KNDH), compiled from 34 distinct Kurdish channels such as Kurdsat, PayamTV, Rudaw, K24, and others. The proportion of headlines collected from each channel varies. KNDH contains 50,000 headlines categorized into five distinct classes: Health, Science, Economy, Sport, and Social, with each class containing 10,000 headlines. The headlines were gathered using BeautifulSoup and ParsHup software and the texts are labeled automatically (S. Badawi et al., 2023).

The second dataset is the Medical Kurdish Dataset (MKD), which includes 6,756 comments from Facebook. These comments were gathered from various posts related to Education, Sport, Medicine, News, and Economy. After collection, three annotators manually labeled the comments as either medical or non-medical based on their understanding. The dataset was compiled using the Facepager tool (Saeed, Hussein, et al., 2022)

The third dataset is the Kurdish-Emotional Dataset (KMD-77000), comprising 77,000 texts collected via the Twitter API. Three annotators proficient in the Kurdish language labeled the texts based on their understanding. The texts were categorized into four classes: joy, sadness, fear, and surprise (S. Badawi, 2023).

The fourth dataset is KurdiSent, comprising 12,000 instances collected from Twitter. After gathering the tweets, three annotators manually labeled them positive, negative, or neutral. To facilitate the annotation process, the open-source text annotation tool Doccano was used for automatic annotation (S. Badawi et al., 2024).

1. Implementation and Experiments

The fastText classification model focuses on categorizing Kurdish text documents based on their content. To classify these documents with FastText, several essential steps need to be performed for analyzing and predicting labels, as illustrated in Figure 2:

As demonstrated in Figure 2, when initiating the classification process using FastText, the initial step involves converting a raw dataset into a suitable format for training input. To achieve this, the text documents need to be prepared in a FastText format. This entails prefixing the text with the keyword "__label__", followed by the corresponding class name, such as "medical", and then appending the text document. For instance: "__label__medical this is a good doctor."

The next stage involves preprocessing the text document, a vital step focused on cleaning and readying unstructured textual data for analysis. In this experiment, the Kurdish Language Processing Toolkit (KLPT) is utilized as an open-source Python toolkit for preprocessing, tokenization, stemming, and transliteration (Ahmadi, 2020).

Preprocessing involves a set of techniques applied to tokens to ensure they are clean and consistent. Kurdish has numerous customized keyboard layouts, each assigning different character encodings to visually similar graphemes. Furthermore, keyboards vary in their use for typing Kurdish alongside languages like Persian, Arabic, and Turkish. The Sorani dialect employs various non-uniform graphemes. For example, the grapheme "ی" is represented with five different Unicode characters within the same script: " ي", "ى", "ﻲ", "ي", and " ي", "ى" (with Unicode U+064A, U+0649, U+FEF2, U+FEF1, and U+06CC respectively), with the last one intended for Sorani. To tackle these challenges, two functions, Normalization () and Standardization (), are utilized.

In the Normalization () function, abnormal forms in the text are replaced with standardized forms of the letter (grapheme). The Standardization () function addresses orthographic aspects of the text. Additionally, the unify-numeral () function converts numbers from Persian (۰, ۱, ۲, ۳, ۴, ۵, ۶, ۷, ۸, ۹) and Latin (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) forms to Arabic forms (۰, ۱, ۲, ۳, ٤, ٥, ٦, ۷, ۸, ۹).

Tokenization involves separating each word in a sentence. While spaces are used to separate words in Arabic script, the process is different in the Kurdish Sorani dialect due to its complex morphology. Not all words correspond directly to tokens. For example, the word "لەخوێندنگاکانماندا" consists of six tokens: "لە", "خوێندن", "گا", "کان", "مان", "دا". KLPT addresses this complexity by using an annotated lexicon and a morphological analyzer to tokenize Sorani words. To handle various compound forms, KLPT provides two functions for word tokenization (mwe_tokenize() and word_tokenize()) and uses sent_tokenize() to tokenize sentences based on punctuation.

Transliteration involves converting text from one alphabet to another while retaining the original pronunciation. In the KLPT system, the issue of transliterating characters with dual uses (such as "ى" and "و") in the Sorani dialect has been addressed.

Stemming involves reducing a word to its base form by removing prefixes, affixes, and postfixes. This process is implemented in two classes: Stem and Spellcheck.

The stem class has four functions:

a) Stem (): Retrieves the root of a word, e.g., "بردن" becomes "بر".

b) Lemmatize (): Performs lemmatization, e.g., "بردوویانە" becomes "بردن".

c) Analyze (): Analyzes the morphology of words, returning a dictionary with parts of speech.

d) suffix_suggest (): Returns all possible suffixes that can appear with a given lexeme.

The Spellcheck class includes two functions:

a) check_spelling(): Returns True or False based on the correctness of the spelling.

b) Correct_spelling(): Corrects misspelled words and provides suggestions, e.g., suggesting "بردن" for the misspelled word "بردب".

Following preprocessing, the text document is partitioned for training and testing, employing the holdout method due to the relatively small datasets utilized in this study. The training ratio is set at 80 percent, leaving the remaining 20 percent for testing. Subsequently, the FastText classifier is implemented to train the model, which is then saved as a data model for subsequent steps. Next, the classifier is trained using the testing data alongside the saved data model. The ultimate goal step involves assessing the classification performance.

2. Experimental Result

This study evaluates the effectiveness of FastText algorithms on Kurdish text through the utilization of four different Kurdish datasets. The performance of FastText is compared against eight other machine learning and deep learning algorithms. The procedure is structured into several key steps. Initially, the raw datasets are converted into a format compatible with FastText. Subsequently, the data undergoes preprocessing, which includes tokenization, stemming, and the removal of stop words. Afterward, the datasets are split into training and testing sets, with 80% designated for training and 20% for testing purposes. The next phase involves training the model using one of the selected machine learning or deep learning algorithms on the training dataset. Finally, the trained model is evaluated by applying it to the testing dataset to generate predictions.

For evaluating the performance of each classifier, a confusion matrix is used as shown in Table 2:

As illustrated in Table 2, the columns represent the predicted labels, categorized as positive and negative, while the rows denote the actual labels, also categorized as positive and negative. This results in four possible outcomes: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN).

From Table 2, the following performance metrics are employed to evaluate the results:

Precision: it is used to identify only the relevant data (accurate data) among retrieved data.

(2)

Recall: it is used to identify all relevant data among retrieved data.

(3)

F-measure: it is the harmonic mean of precision and recall.

(4)

Accuracy: indicates the ratio of correctly predicted labels to the total number of predictions

(5)

Table 3 evaluates the Precision, Recall, F1-score, and Accuracy for nine algorithms which are implemented on the KNDH dataset as shown below:

Tabel 3: Evaluates the Precision, Recall, F1-score, and Accuracy of Classifiers for the KNDH Dataset

	Precision	Recall	F1	Accuracy
FastText	89.00	89.00	89.00	89.10
NB	87.30	87.40	87.35	87.25
SVM	88.01	88.91	88.46	88.53
DT	80.23	79.90	80.06	80.91
KNN	62.80	62.30	62.55	62.36
LR	88.00	88.00	88.00	88.00
RF	85.63	85.34	85.48	85.24
SGD	76.43	76.54	76.48	76.49
Bert	88.26	88.29	88.27	88.12

Tabel 4: Evaluates the Precision, Recall, F1-score, and Accuracy of Classifiers for the KMD Dataset

	Precision	Recall	F1	Accuracy
FastText	70.31	70.39	70.35	70.5
NB	63.23	63	63.11	63.32
SVM	65.08	65.01	65.04	65.03
DT	63.54	63.5	63.52	63.32
KNN	54.79	54.7	54.74	54.61
LR	65.7	65.8	65.75	65.73
RF	68.44	68.48	68.46	68.47
SGD	44.32	44.5	44.41	44.03
Bert	66.64	66.61	66.62	66.32

Tabel 5: Evaluates the Precision, Recall, F1-score, and Accuracy of Classifiers for the Kurdisent Dataset

	Precision	Recall	F1	Accuracy
FastText	81.32	81.83	81.57	81.4
NB	75.92	75.78	75.85	75.9
SVM	78.14	78.18	78.16	78.2
DT	73.46	73.9	73.68	73.7
KNN	57.49	57.43	57.46	57.6
LR	79.3	79.23	79.26	79.3
RF	78.09	78.12	78.10	78.12
SGD	71.1	71.02	71.06	71.3
Bert	80.5	80.45	80.47	80.67

Tabel 6: Evaluates the Precision, Recall, F1-score, and Accuracy of Classifiers for the MKD Dataset

	Precision	Recall	F1	Accuracy
FastText	93.32	93.36	93.34	93.1
NB	92.9	92.93	92.91	92.8
SVM	93.1	93.3	93.20	93.3
DT	87.68	87.64	87.66	87.66
KNN	62.34	62.1	62.22	62.31
LR	90.5	90.51	90.50	90.49
RF	91.31	91.2	91.25	91.3
SGD	63.31	63.33	63.32	63.45
Bert	92.1	92	92.05	92.01

The Medical dataset is the smallest with 4,729 samples and a vocabulary size of 20,168, and shows improved efficiency, with a training time of 0.307 seconds and an inference time of 0.078 seconds. Kurdisent, with 8,614 training samples and a vocabulary size of 22,079, shows improved efficiency, with a training time of 0.375 seconds and an inference time of 0.113 seconds. KNDH, considerably larger with 35,000 samples and a vocabulary size of 43,427, demonstrates increased computational demand, requiring 0.964 seconds for training and 0.594 seconds for inference. The largest KMD dataset with 54,089 samples and a vocabulary size of 53,400, has the highest training and inference times of 1.646 and 0.900 seconds, respectively. These results indicate a clear trend: as the dataset size and vocabulary increase, both training and inference times rise significantly, highlighting the scalability challenges of the FastText model.

It can also be concluded that the FastText classifier outperforms all other classifiers in classifying Kurdish text. This effectiveness of FastText is due to two main factors. Firstly, it employs a high-quality n-gram character approach to mitigate out-of-vocabulary errors. Secondly, it uses the hierarchical Softmax method based on the Huffman coding tree as the final layer in training neural networks.

CONCLUSIONS

The primary objective of this study is to introduce a FastText-based classifier for classifying Kurdish language text. The proposed method was compared with eight traditional machine learning and deep learning algorithms. To assess the performance of each classifier, four Kurdish datasets were used. After preprocessing steps such as tokenization, stemming, lemmatization, and stop word removal, the datasets were divided into training and testing sets. The study demonstrated that FastText achieved the highest precision, recall, F1 score, and accuracy compared to all other algorithms across all datasets. Based on these findings, it can be concluded that FastText is the most effective classifier for Kurdish language text classification. Future research can expand on this method to develop a hybrid model based on the Fasttext model for more efficient text classification in the Kurdish language.

Ahmadi, S. (2020). KLPT–Kurdish language processing toolkit. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), 72–84.

Alammary, A. S. (2022). BERT Models for Arabic Text Classification: A Systematic Review. Applied Sciences 2022, Vol. 12, Page 5720, 12(11), 5720. https://doi.org/10.3390/APP12115720

Amalia, A., Sitompul, O. S., Nababan, E. B., & Mantoro, T. (2020). An Efficient Text Classification Using fastText for Bahasa Indonesia Documents Classification. 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics, DATABIA 2020 - Proceedings, 69–75. https://doi.org/10.1109/DATABIA50434.2020.9190447

Badawi, S. (2023). KMD: A New Kurdish Multilabel Emotional Dataset For the Kurdish Sorani Dialect. In M. Abbas & A. A. Freihat (Eds.), Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023) (pp. 308–315). Association for Computational Linguistics. https://aclanthology.org/2023.icnlsp-1.33

Badawi, S., Kazemi, A., & Rezaie, V. (2024). KurdiSent: a corpus for kurdish sentiment analysis. Language ResourcesandEvaluation,1–20. https://doi.org/10.1007/S10579-023-09716-6/METRICS

Badawi, S. S. (2023). Using Multilingual Bidirectional Encoder Representations from Transformers on Medical Corpus for Kurdish Text Classification. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 11(1), 10–15. https://doi.org/10.14500/aro.11088

Badawi, S., Saeed, A. M., Ahmed, S. A., Abdalla, P. A., & Hassan, D. A. (2023). Kurdish News Dataset Headlines (KNDH) through multiclass classification. Data in Brief, 48, 109120. https://doi.org/10.1016/j.dib.2023.109120

Dharma, E. M., Gaol, F. L., Leslie, H., Warnars, H. S., & Soewito, B. (2022). THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION. Journal of Theoretical and Applied Information Technology, 31(2). www.jatit.org

Khomsah, S., Ramadhani, R. D., & Wijaya, S. (2022). The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(3), 352–358. https://doi.org/10.29207/resti.v6i3.3711

Kuyumcu, B., Aksakalli, C., & Delil, S. (2019). An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing. ACM International Conference Proceeding Series, 1–4. https://doi.org/10.1145/3342827.3342828

Naeem, M. Z., Rustam, F., Mehmood, A., Mui-zzud-din, Ashraf, I., & Choi, G. S. (2022). Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms. PeerJ Computer Science, 8, e914.https://doi.org/10.7717/PEERJ-CS.914/SUPP-4

Saeed, A. M., Badawi, S., Ahmed, S. A., & Hassan, D. A. (2023). Comparison of feature selection methods in Kurdish text classification. Iran Journal of Computer Science, 1–10.

Saeed, A. M., Hussein, S. R., Ali, C. M., & Rashid, T. A. (2022). Medical dataset classification for Kurdish short text over social media. Data in Brief, 42, 108089. https://doi.org/10.1016/J.DIB.2022.108089

Saeed, A. M., Ismael, A. N., Rasul, D. L., Majeed, R. S., & Rashid, T. A. (2022). Hate Speech Detection in Social Media for the Kurdish Language. 253–260. https://doi.org/10.1007/978-3-031-14054-9_24

Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A.-R., Shamsaldin, A. S., & Al-Salihi, N. K. (2018). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99–107. https://doi.org/10.1007/s42044-018-0007-4

Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach. International Journal of Information Management Data Insights, 2(1). https://doi.org/10.1016/j.jjimei.2022.100061

Umer, M., Imtiaz, Z., Ahmad, M., Nappi, M., Medaglia, C., Choi, G. S., & Mehmood, A. (2023). Impact of convolutional neural network and FastText embedding on text classification. Multimedia Tools and Applications, 82(4), 5569–5585. https://doi.org/10.1007/s11042-022-13459-x

Yao, T., Zhai, Z., & Gao, B. (2020). Text Classification Model Based on fastText. Proceedings of 2020 IEEE International Conference on Artificial Intelligence and Information Systems, ICAIIS 2020,154–157. https://doi.org/10.1109/ICAIIS49377.2020.9194939

Zulqarnain, M., Ghazali, R., Mazwin, Y., Hassim, M., & Rehan, M. (2020). A comparative review on deep learning models for text classification. Indonesian Journal of Electrical Engineering and Computer Science,19(1),325–335. https://doi.org/10.11591/ijeecs.v19.i1.pp325-335

	Training sample	Vocabulary size	Training time	Inference time
Medical	4729	20168	0.307	0.078
Kurdisent	8614	22079	0.375	0.113
KNDH	35000	43427	0.964	0.594
KMD	54089	53400	1.646	0.900

	Positive	Negative
Positive	TP	FN
Negative	FP	TN