EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS

Ramadan T. Hassan; Nawzat S. Ahmed

doi:10.25271/sjuoz.2023.11.3.1120

Ramadan T. Hassan ⁽¹⁾ , Nawzat S. Ahmed ⁽²⁾

(1) Information Technology, Technical College of Informatics, Duhok Polytechnic University, Kurdistan Region - Iraq ,

(2) Information Technology Management, Technical College of Administration, Duhok Polytechnic University, Kurdistan Region,- Iraq

https://doi.org/10.25271/sjuoz.2023.11.3.1120

Issue
Vol. 11 No. 3 (2023): July-September issue

Submitted
February 13, 2023

Accepted
April 5, 2023

Published
August 14, 2023

Keywords:

TF-IDF, BERT, SBERT, Doc2Vec, Semantic Similarity, Cosine Similarity, NLP

PDF HTML

Abstract

Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.

Full text article

Generated from XML file

References

Balani, Z., & Varol, C. (2021). Combining Approximate String Matching Algorithms and Term Frequency In The Detection of Plagiarism. 9.

Brandt, J. (2019). Text mining policy: Classifying forest and landscape restoration policy agenda with neural information retrieval (arXiv:1908.02425). arXiv. http://arxiv.org/abs/1908.02425

Chawla, S., Aggarwal, P., & Kaur, R. (2022). Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection. In Emerging Technologies for Computing, Communication and Smart Cities (pp. 15–29). Springer.

Chicco, D. (2021). Siamese neural networks: An overview. Artificial Neural Networks, 73–94.

Davoodifard, M. (2022). Automatic Detection of Plagiarism in Writing. Studies in Applied Linguistics and TESOL, 21(2). https://doi.org/10.52214/salt.v21i2.9058

Dept. of Computer Science, Jigawa State Colledge of Education, Gumel, Nigeria, Abubakar, H. D., Umar, M., & Dept. of Computer Sceince, Faculty of Science, Sokoto State University, Sokoto, Nigeria. (2022). Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec. SLU Journal of Science and Technology, 4(1 & 2), 27–33. https://doi.org/10.56471/slujst.v4i.266

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. http://arxiv.org/abs/1810.04805

Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), 1–309.

Jones, K. S. (1999). What is the Role of NLP in Text Retrieval? In T. Strzalkowski (Ed.), Natural Language Information Retrieval (Vol. 7, pp. 1–24). Springer Netherlands. https://doi.org/10.1007/978-94-017-2388-6_1

Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (arXiv:1405.4053). arXiv. http://arxiv.org/abs/1405.4053

Magara, M. B., Ojo, S. O., & Zuva, T. (2018). A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS), 1–5.

Malmberg, J. (2021). Evaluating semantic similarity using sentence embeddings. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291425

Mandal, A., Ghosh, K., Ghosh, S., & Mandal, S. (2021). Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law, 29(3), 417–451. https://doi.org/10.1007/s10506-020-09280-2

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv:1301.3781.

Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19, 1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022

P., S., & Shaji, A. P. (2019). A Survey on Semantic Similarity. 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), 1–8. https://doi.org/10.1109/ICAC347590.2019.9036843

Park, K., Hong, J. S., & Kim, W. (2020). A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence, 34(5), 396–411.

Pranjic, M., & Podpecˇan, V. (2020). Evaluation of related news recommendations using document similarity methods. Digital Humanities, 6.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. http://arxiv.org/abs/1908.10084

Resta, O. A., Aditya, A., & Purwiantono, F. E. (2021). Plagiarism Detection in Students’ Theses Using The Cosine Similarity Method. SinkrOn, 5(2), 305–313. https://doi.org/10.33395/sinkron.v5i2.10909

Shahmirzadi, O., Lugowski, A., & Younge, K. (2018). Text Similarity in Vector Space Models: A Comparative Study (arXiv:1810.00664). arXiv. http://arxiv.org/abs/1810.00664

Singh, A. K., & Shashi, M. (2019). Vectorization of Text Documents for Identifying Unifiable News Articles. International Journal of Advanced Computer Science and Applications, 10(7). https://doi.org/10.14569/IJACSA.2019.0100742

Sitikhu, P., Pahi, K., Thapa, P., & Shakya, S. (2019). A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. 2019 Artificial Intelligence for Transforming Business and Society (AITB), 1–4. https://doi.org/10.1109/AITB48515.2019.8947433

Vrbanec, T., & Meštrović, A. (2020). Corpus-Based Paraphrase Detection Experiments and Review. Information, 11(5), 241. https://doi.org/10.3390/info11050241

Zhu, J., Patra, B. G., & Yaseen, A. (2021). Recommender system of scholarly papers using public datasets. AMIA Summits on Translational Science Proceedings, 2021, 672–679.

Authors

Ramadan T. Hassan

ramadan.hassan@dpu.edu.krd (Primary Contact)

Nawzat S. Ahmed

Hassan , R. T., & Ahmed , N. S. (2023). EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS. Science Journal of University of Zakho, 11(3), 396– 402. https://doi.org/10.25271/sjuoz.2023.11.3.1120

Download Citation

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0] that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work, with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online.

How to Cite