Development of the combined method of identification of near duplicates in electronic scientific works

Лізунов, Петро Петрович; Білощицький, Андрій Олександрович; Кучанський, Олександр Юрійович; Андрашко, Юрій Васильович; Білощицька, Світлана; Сербін, Олег

Please use this identifier to cite or link to this item: https://dspace.uzhnu.edu.ua/jspui/handle/lib/42600

Title:	Development of the combined method of identification of near duplicates in electronic scientific works
Authors:	Лізунов, Петро Петрович Білощицький, Андрій Олександрович Кучанський, Олександр Юрійович Андрашко, Юрій Васильович Білощицька, Світлана Сербін, Олег
Keywords:	near-duplicate, electronic scientific paper, antiplagiarism system, locally sensitive hashing
Issue Date:	2021
Publisher:	Eastern-European Journal of Enterprise Technologies
Citation:	8. Lizunov P., Biloshchytskyi A., Kuchansky A., Andrashko Y., Biloshchytska S., Serbin O. Development of the combined method of identification of near duplicates in electronic scientific works. Eastern-European Journal of Enterprise Technologies. 2021. Vol. 4/4 (112). P. 57–63. DOI: https://doi.org/10.15587/1729-4061.2021.238318
Abstract:	The methods for identification of near-du-plicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hash-ing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximi-ty between the papers is determined as the Euclidian distance between the vectors con-sisting of the numbers of these sub-sequences. To compare mathematical formulas, the me-thod for comparing the sample of formulas is used and the names of variables are com-pared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and apply-ing locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are exami-ned separately using the methods for compar-ing text information. The combined method for identification of near-duplicates in elec-tronic scientific papers, which combines the methods for identification of near-dupli-cates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electro-nic scientific papers, an information-analyti-cal system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identi-fy near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific arti-cles, dissertations, monographs, conference materials, etc
Type:	Text
Publication type:	Стаття
URI:	https://dspace.uzhnu.edu.ua/jspui/handle/lib/42600
ISSN:	1729-3774
Appears in Collections:	Наукові публікації кафедри cистемного аналізу та теорії оптимізації

Files in This Item:

File	Description	Size	Format
238318-Article Text-549440-2-10-20210901.pdf		184.97 kB	Adobe PDF	View/Open

Show full item record