Text similarity detection in agglutinative languages: a case study of Kazakh using hybrid n-gram and semantic models
Дата
Назва журналу
Номер ISSN
Назва тому
Видавець
Applied Sciences
Анотація
This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language.
Опис
Тип публікації
Text
Тип текстової публікації
Стаття
ISSN
Ключові слова
anti-plagiarism, Kazakh language, combined models, text data analysis, near duplicates, semantic analysis, academic integrity, intelligent analysis system
Бібліографічний опис
Biloshchytska S., Tleubayeva A., Kuchanskyi O., Biloshchytskyi A., Andrashko Y., Toxanov S., Mukhatayev A., Sharipova S. Text similarity detection in agglutinative languages: a case study of Kazakh using hybrid n-gram and semantic models. Applied Sciences. Vol. 15, Issue 12. 2025. Pub. 6707. DOI: https://doi.org/10.3390/app15126707