Text similarity detection in agglutinative languages: a case study of Kazakh using hybrid n-gram and semantic models

Анотація

This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language.

Опис

Тип публікації

Text

Тип текстової публікації

Стаття

ISSN

Ключові слова

anti-plagiarism, Kazakh language, combined models, text data analysis, near duplicates, semantic analysis, academic integrity, intelligent analysis system

Бібліографічний опис

Biloshchytska S., Tleubayeva A., Kuchanskyi O., Biloshchytskyi A., Andrashko Y., Toxanov S., Mukhatayev A., Sharipova S. Text similarity detection in agglutinative languages: a case study of Kazakh using hybrid n-gram and semantic models. Applied Sciences. Vol. 15, Issue 12. 2025. Pub. 6707. DOI: https://doi.org/10.3390/app15126707

Endorsement

Review

Supplemented By

Referenced By