EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS

Authors

  • Ramadan T. Hassan Information Technology, Technical College of Informatics, Duhok Polytechnic University, Kurdistan Region - Iraq
  • Nawzat S. Ahmed Information Technology Management, Technical College of Administration, Duhok Polytechnic University, Kurdistan Region,- Iraq

DOI:

https://doi.org/10.25271/sjuoz.2023.11.3.1120

Keywords:

TF-IDF, BERT, SBERT, Doc2Vec, Semantic Similarity, Cosine Similarity, NLP

Abstract

Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.

References

Balani, Z., & Varol, C. (2021). Combining Approximate String Matching Algorithms and Term Frequency In The Detection of Plagiarism. 9.

Brandt, J. (2019). Text mining policy: Classifying forest and landscape restoration policy agenda with neural information retrieval (arXiv:1908.02425). arXiv. http://arxiv.org/abs/1908.02425

Chawla, S., Aggarwal, P., & Kaur, R. (2022). Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection. In Emerging Technologies for Computing, Communication and Smart Cities (pp. 15–29). Springer.

Chicco, D. (2021). Siamese neural networks: An overview. Artificial Neural Networks, 73–94.

Davoodifard, M. (2022). Automatic Detection of Plagiarism in Writing. Studies in Applied Linguistics and TESOL, 21(2). https://doi.org/10.52214/salt.v21i2.9058

Dept. of Computer Science, Jigawa State Colledge of Education, Gumel, Nigeria, Abubakar, H. D., Umar, M., & Dept. of Computer Sceince, Faculty of Science, Sokoto State University, Sokoto, Nigeria. (2022). Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec. SLU Journal of Science and Technology, 4(1 & 2), 27–33. https://doi.org/10.56471/slujst.v4i.266

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. http://arxiv.org/abs/1810.04805

Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), 1–309.

Jones, K. S. (1999). What is the Role of NLP in Text Retrieval? In T. Strzalkowski (Ed.), Natural Language Information Retrieval (Vol. 7, pp. 1–24). Springer Netherlands. https://doi.org/10.1007/978-94-017-2388-6_1

Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (arXiv:1405.4053). arXiv. http://arxiv.org/abs/1405.4053

Magara, M. B., Ojo, S. O., & Zuva, T. (2018). A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS), 1–5.

Malmberg, J. (2021). Evaluating semantic similarity using sentence embeddings. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291425

Mandal, A., Ghosh, K., Ghosh, S., & Mandal, S. (2021). Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law, 29(3), 417–451. https://doi.org/10.1007/s10506-020-09280-2

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv:1301.3781.

Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19, 1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022

P., S., & Shaji, A. P. (2019). A Survey on Semantic Similarity. 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), 1–8. https://doi.org/10.1109/ICAC347590.2019.9036843

Park, K., Hong, J. S., & Kim, W. (2020). A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence, 34(5), 396–411.

Pranjic, M., & Podpecˇan, V. (2020). Evaluation of related news recommendations using document similarity methods. Digital Humanities, 6.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. http://arxiv.org/abs/1908.10084

Resta, O. A., Aditya, A., & Purwiantono, F. E. (2021). Plagiarism Detection in Students’ Theses Using The Cosine Similarity Method. SinkrOn, 5(2), 305–313. https://doi.org/10.33395/sinkron.v5i2.10909

Shahmirzadi, O., Lugowski, A., & Younge, K. (2018). Text Similarity in Vector Space Models: A Comparative Study (arXiv:1810.00664). arXiv. http://arxiv.org/abs/1810.00664

Singh, A. K., & Shashi, M. (2019). Vectorization of Text Documents for Identifying Unifiable News Articles. International Journal of Advanced Computer Science and Applications, 10(7). https://doi.org/10.14569/IJACSA.2019.0100742

Sitikhu, P., Pahi, K., Thapa, P., & Shakya, S. (2019). A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. 2019 Artificial Intelligence for Transforming Business and Society (AITB), 1–4. https://doi.org/10.1109/AITB48515.2019.8947433

Vrbanec, T., & Meštrović, A. (2020). Corpus-Based Paraphrase Detection Experiments and Review. Information, 11(5), 241. https://doi.org/10.3390/info11050241

Zhu, J., Patra, B. G., & Yaseen, A. (2021). Recommender system of scholarly papers using public datasets. AMIA Summits on Translational Science Proceedings, 2021, 672–679.

Downloads

Published

2023-08-14

How to Cite

Hassan , R. T., & Ahmed , N. S. (2023). EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS. Science Journal of University of Zakho, 11(3), 396–. https://doi.org/10.25271/sjuoz.2023.11.3.1120

Issue

Section

Science Journal of University of Zakho