EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS
DOI:
https://doi.org/10.25271/sjuoz.2023.11.3.1120Keywords:
TF-IDF, BERT, SBERT, Doc2Vec, Semantic Similarity, Cosine Similarity, NLPAbstract
Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.
References
Balani, Z., & Varol, C. (2021). Combining Approximate String Matching Algorithms and Term Frequency In The Detection of Plagiarism. 9.
Brandt, J. (2019). Text mining policy: Classifying forest and landscape restoration policy agenda with neural information retrieval (arXiv:1908.02425). arXiv. http://arxiv.org/abs/1908.02425
Chawla, S., Aggarwal, P., & Kaur, R. (2022). Comparative Analysis of Semantic Similarity Word Embedding Techniques for Paraphrase Detection. In Emerging Technologies for Computing, Communication and Smart Cities (pp. 15–29). Springer.
Chicco, D. (2021). Siamese neural networks: An overview. Artificial Neural Networks, 73–94.
Davoodifard, M. (2022). Automatic Detection of Plagiarism in Writing. Studies in Applied Linguistics and TESOL, 21(2). https://doi.org/10.52214/salt.v21i2.9058
Dept. of Computer Science, Jigawa State Colledge of Education, Gumel, Nigeria, Abubakar, H. D., Umar, M., & Dept. of Computer Sceince, Faculty of Science, Sokoto State University, Sokoto, Nigeria. (2022). Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec. SLU Journal of Science and Technology, 4(1 & 2), 27–33. https://doi.org/10.56471/slujst.v4i.266
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. http://arxiv.org/abs/1810.04805
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), 1–309.
Jones, K. S. (1999). What is the Role of NLP in Text Retrieval? In T. Strzalkowski (Ed.), Natural Language Information Retrieval (Vol. 7, pp. 1–24). Springer Netherlands. https://doi.org/10.1007/978-94-017-2388-6_1
Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (arXiv:1405.4053). arXiv. http://arxiv.org/abs/1405.4053
Magara, M. B., Ojo, S. O., & Zuva, T. (2018). A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS), 1–5.
Malmberg, J. (2021). Evaluating semantic similarity using sentence embeddings. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291425
Mandal, A., Ghosh, K., Ghosh, S., & Mandal, S. (2021). Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law, 29(3), 417–451. https://doi.org/10.1007/s10506-020-09280-2
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv:1301.3781.
Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal, 19, 1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022
P., S., & Shaji, A. P. (2019). A Survey on Semantic Similarity. 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), 1–8. https://doi.org/10.1109/ICAC347590.2019.9036843
Park, K., Hong, J. S., & Kim, W. (2020). A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence, 34(5), 396–411.
Pranjic, M., & Podpecˇan, V. (2020). Evaluation of related news recommendations using document similarity methods. Digital Humanities, 6.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. http://arxiv.org/abs/1908.10084
Resta, O. A., Aditya, A., & Purwiantono, F. E. (2021). Plagiarism Detection in Students’ Theses Using The Cosine Similarity Method. SinkrOn, 5(2), 305–313. https://doi.org/10.33395/sinkron.v5i2.10909
Shahmirzadi, O., Lugowski, A., & Younge, K. (2018). Text Similarity in Vector Space Models: A Comparative Study (arXiv:1810.00664). arXiv. http://arxiv.org/abs/1810.00664
Singh, A. K., & Shashi, M. (2019). Vectorization of Text Documents for Identifying Unifiable News Articles. International Journal of Advanced Computer Science and Applications, 10(7). https://doi.org/10.14569/IJACSA.2019.0100742
Sitikhu, P., Pahi, K., Thapa, P., & Shakya, S. (2019). A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. 2019 Artificial Intelligence for Transforming Business and Society (AITB), 1–4. https://doi.org/10.1109/AITB48515.2019.8947433
Vrbanec, T., & Meštrović, A. (2020). Corpus-Based Paraphrase Detection Experiments and Review. Information, 11(5), 241. https://doi.org/10.3390/info11050241
Zhu, J., Patra, B. G., & Yaseen, A. (2021). Recommender system of scholarly papers using public datasets. AMIA Summits on Translational Science Proceedings, 2021, 672–679.
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Ramadan T. Hassan , Nawzat S. Ahmed
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0] that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work, with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online.