The Kurdish Language Corpus: State of the Art

Authors

  • Media Azzat Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq
  • Karwan Jacksi Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq
  • Ismael Ali Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

DOI:

https://doi.org/10.25271/sjuoz.2023.11.1.1123

Keywords:

Kurdish language, Text Corpus, Text Mining, Natural Language Processing

Abstract

The notable growth of the digital communities and different online news streams led to the growing availability of online natural language content. However not all natural languages have the enough attention of being made readable and comprehendible to machines. Among these less resourced and paid attention languages is the Kurdish language. Creating the machine-readable text is the first step toward applications of text mining and semantic web, such as translation, information retrieval and recommendation systems. With the de facto challenges in the Kurdish language, such as the scarcity of linguistic sources and not having unified orthography rules, this language has a lack of the language processing tools. However, to overcome the mentioned challenges and enable intelligent applications the well organized and annotated Kurdish text corpora is needed. This review paper investigates the available textual corpora in the Kurdish language and its dialects and then determined challenges are discussed, open problems are listed and future directions suggested.

Author Biographies

Media Azzat, Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

(media.azzat@uoz.edu.krd).

Karwan Jacksi, Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

(Karwan.jacksi@uoz.edu.krd).

Ismael Ali, Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

Department of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region - Iraq

(ismael.Ali@uoz.edu.krd).

References

H. Veisi, M. MohammadAmini, and H. Hosseini, “Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus,” Digit. Scholarsh. Humanit., no. June, 2019, doi: 10.1093/llc/fqy074.

Z. Alyafeai, M. S. Al-shaibani, M. Ghaleb, and I. Ahmad, “Evaluating Various Tokenizers for Arabic Text Classification,” vol. 5, 2021, [Online]. Available: http://arxiv.org/abs/2106.07540.

R. O. Abdulrahman, H. Hassani, and S. Ahmadi, “Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish ∗,” pp. 106–109.

A. Al-Talabani, Z. Abdul, and A. Ameen, “Kurdish Dialects and Neighbor Languages Automatic Recognition,” ARO-The Sci. J. Koya Univ., vol. 5, no. 1, pp. 20–23, 2017, doi: 10.14500/aro.10167.

D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural Language Processing : State of The Art , Current Trends and Challenges Natural Language Processing : State of The Art , Current Trends and Challenges Department of Computer Science and Engineering Manav Rachna International University , Faridabad-,” arXiv Prepr. arXiv, no. August 2017, 2018.

W. Khan, A. Daud, J. A. Nasir, and T. Amjad, “A survey on the state-of-the-art machine learning models in the context of NLP,” Kuwait J. Sci., vol. 43, no. 4, pp. 95–113, 2016.

S. Ahmadi, H. Hassani, and J. P. McCrae, “Towards electronic lexicography for the Kurdish language,” Proc. Electron. Lexicogr. 21st Century Conf., vol. 2019-Octob, pp. 881–906, 2019.

S. Ahmadi, “A Tokenization System for the Kurdish Language,” 2013.

S. Ahmadi, “KLPT – Kurdish Language Processing Toolkit,” pp. 72–84, 2020, doi: 10.18653/v1/2020.nlposs-1.11.

S. Salavati and S. Ahmadi, “Building a Lemmatizer and a Spell-checker for Sorani Kurdish,” arXiv, 2018.

S. Ahmadi and H. Hassani, “Towards Finite-State Morphology of Kurdish,” arXiv, no. Cl, 2020.

S. Ahmadi, “Building a Corpus for the Zaza–Gorani Language Family,” Proc. 7th Work. NLP Similar Lang. Var. Dialects, pp. 70–78, 2020, [Online]. Available: https://aclanthology.org/2020.vardial-1.7.

P. Aliabadi, S. Salavati, M. S. Ahmadi, and K. Sheykh Esmaili, “Towards building KurdNet, the Kurdish WordNet,” GWC 2014 Proc. 7th Glob. Wordnet Conf., pp. 1–6, 2014.

M. Gökırmak and F. Tyers, “A dependency treebank for Kurmanji Kurdish,” Proc. Fourth Int. Conf. Depend. Linguist. (Depling 2017), no. Depling, pp. 64–72, 2017.

S. Ahmadi and M. Masoud, “Towards Machine Translation for the {K}urdish Language,” Proc. 3rd Work. Technol. MT Low Resour. Lang., pp. 87–98, 2020, [Online]. Available: https://aclanthology.org/2020.loresmt-1.12.

H. Rouhizadeh, M. Shamsfard, V. Tajalli, and M. Rouhziadeh, “Persian-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation.”

M. Asgari-Bidhendi, B. Janfada, O. R. Roshani Talab, and B. Minaei-Bidgoli, “ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts,” J. AI Data Min., vol. 9, no. 2, pp. 181–192, 2021, doi: 10.22044/jadm.2020.9949.2143.

A.-A. Asaad, “QuranTree.jl: A Julia Package for Quranic Arabic Corpus,” Proc. Sixth Arab. Nat. Lang. Process. Work., pp. 208–212, 2021, [Online]. Available: https://aclanthology.org/2021.wanlp-1.22.

N. Levshina, “Corpus-based typology: Applications, challenges and some solutions,” Linguist. Typology, pp. 1–32, 2021, doi: 10.1515/lingty-2020-0118.

I. E. Onyenwe, “Developing Methods and Resources for Automated Processing of the African Language Igbo,” no. April, 2017.

S. Ahmadi, “A rule-based Kurdish text transliteration system,” arXiv, vol. 1, no. 1, pp. 1–9, 2018.

D. Ataman, “Bianet: A Parallel News Corpus in Turkish, Kurdish and English,” arXiv, pp. 1–4, 2018.

S. Ahmadi, H. Hassani, and K. Abedi, “A Corpus of the {S}orani {K}urdish Folkloric Lyrics,” Proc. 1st Jt. Work. Spok. Lang. Technol. Under-resourced Lang. Collab. Comput. Under-Resourced Lang., no. May, pp. 330–335, 2020, [Online]. Available: https://www.aclweb.org/anthology/2020.sltu-1.46.

S. Malmasi, “Subdialectal Differences in Sorani Kurdish,” Proc. Third Work. NLP Similar Lang. Var. Dialects, pp. 89–96, 2016, [Online]. Available: https://www.aclweb.org/anthology/W16-4812.

H. Hassani, “Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts,” pp. 1–3, 2020.

S. Ahmadi, H. Hassani, and D. Q. Jaff, “Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus,” 2020, [Online]. Available: http://arxiv.org/abs/2010.01554.

L. Informatique, “Central Kurdish Machine Translation : First Large Scale Parallel Corpus and Experiments,” pp. 1–13.

H. Hassani, “Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus),” 2022, [Online]. Available: http://arxiv.org/abs/2201.12793.

K. S. Esmaili, “Building A Test Collection For Sorani Kurdish.”

Keselj, Vlado. "Speech and Language Processing” Daniel Jurafsky and James H. Martin (Stanford University and University of Colorado at Boulder) Pearson Prentice Hall, 2009, xxxi+ 988 pp; hardbound, ISBN 978-0-13-187321-6, $115.00." (2009): 463-466.

Downloads

Published

2023-02-20

How to Cite

Azzat, M., Jacksi, K., & Ali, I. (2023). The Kurdish Language Corpus: State of the Art. Science Journal of University of Zakho, 11(1), 127–133. https://doi.org/10.25271/sjuoz.2023.11.1.1123

Issue

Section

Science Journal of University of Zakho