A Gene Ontology-Based Pipeline for Selecting Significant Gene Subsets in Biomedical Applications

Babichev, Sergii; Yarema, Oleg; Liakh, Igor; Shumylo, Nataliia

Будь ласка, використовуйте цей ідентифікатор, щоб цитувати або посилатися на цей матеріал: https://dspace.uzhnu.edu.ua/jspui/handle/lib/74558

Назва:	A Gene Ontology-Based Pipeline for Selecting Significant Gene Subsets in Biomedical Applications
Автори:	Babichev, Sergii Yarema, Oleg Liakh, Igor Shumylo, Nataliia
Ключові слова:	A Gene Ontology-Based Pipeline for Selecting Significant Gene Subsets in Biomedical Applications, : Gene Ontology (GO); differential gene expression; GO enrichment analysis; machine learning; random forest; Bayesian optimization; precision medicine; feature selection
Дата публікації:	18-кві-2025
Видавництво:	Advances in Bioinformatics and Biomedical Engineering
Бібліографічний опис:	The growing volume and complexity of gene expression data necessitate biologically meaningful and statistically robust methods for feature selection to enhance the effectiveness of disease diagnosis systems. The present study addresses this challenge by proposing a pipeline that integrates RNA-seq data preprocessing, differential gene expression analysis, Gene Ontology (GO) enrichment, and ensemble-based machine learning. The pipeline employs the non-parametric Kruskal–Wallis test to identify differentially expressed genes, followed by dual enrichment analysis using both Fisher’s exact test and the Kolmogorov–Smirnov test across three GO categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Genes associated with GO terms found significant by both tests were used to construct multiple gene subsets, including subsets based on individual categories, their union, and their intersection. Classification experiments using a random forest model, validated via 5-fold cross-validation, demonstrated that gene subsets derived from the CC category and the union of all categories achieved the highest accuracy and weighted F1-scores, exceeding 0.97 across 14 cancer types. In contrast, subsets derived from BP, MF, and especially their intersection exhibited lower performance. These results confirm the discriminative power of spatially localized gene annotations and underscore the value of integrating statistical and functional information into gene selection. The proposed approach improves the reliability of biomarker discovery and supports downstream analyses such as clustering and biclustering, providing a strong foundation for developing precise diagnostic tools in personalized medicine. Keywords: Gene Ontology (GO); differential gene expression; GO enrichment analysis; machine learning; random forest; Bayesian optimization; precision medicine; feature selection
Серія/номер:	15(8);4471
Короткий огляд (реферат):	The growing volume and complexity of gene expression data necessitate biologically meaningful and statistically robust methods for feature selection to enhance the effectiveness of disease diagnosis systems. The present study addresses this challenge by proposing a pipeline that integrates RNA-seq data preprocessing, differential gene expression analysis, Gene Ontology (GO) enrichment, and ensemble-based machine learning. The pipeline employs the non-parametric Kruskal–Wallis test to identify differentially expressed genes, followed by dual enrichment analysis using both Fisher’s exact test and the Kolmogorov–Smirnov test across three GO categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Genes associated with GO terms found significant by both tests were used to construct multiple gene subsets, including subsets based on individual categories, their union, and their intersection. Classification experiments using a random forest model, validated via 5-fold cross-validation, demonstrated that gene subsets derived from the CC category and the union of all categories achieved the highest accuracy and weighted F1-scores, exceeding 0.97 across 14 cancer types. In contrast, subsets derived from BP, MF, and especially their intersection exhibited lower performance. These results confirm the discriminative power of spatially localized gene annotations and underscore the value of integrating statistical and functional information into gene selection. The proposed approach improves the reliability of biomarker discovery and supports downstream analyses such as clustering and biclustering, providing a strong foundation for developing precise diagnostic tools in personalized medicine.
Опис:	The growing volume and complexity of gene expression data necessitate biologically meaningful and statistically robust methods for feature selection to enhance the effectiveness of disease diagnosis systems. The present study addresses this challenge by proposing a pipeline that integrates RNA-seq data preprocessing, differential gene expression analysis, Gene Ontology (GO) enrichment, and ensemble-based machine learning. The pipeline employs the non-parametric Kruskal–Wallis test to identify differentially expressed genes, followed by dual enrichment analysis using both Fisher’s exact test and the Kolmogorov–Smirnov test across three GO categories: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Genes associated with GO terms found significant by both tests were used to construct multiple gene subsets, including subsets based on individual categories, their union, and their intersection. Classification experiments using a random forest model, validated via 5-fold cross-validation, demonstrated that gene subsets derived from the CC category and the union of all categories achieved the highest accuracy and weighted F1-scores, exceeding 0.97 across 14 cancer types. In contrast, subsets derived from BP, MF, and especially their intersection exhibited lower performance. These results confirm the discriminative power of spatially localized gene annotations and underscore the value of integrating statistical and functional information into gene selection. The proposed approach improves the reliability of biomarker discovery and supports downstream analyses such as clustering and biclustering, providing a strong foundation for developing precise diagnostic tools in personalized medicine.
Тип:	Text
Тип публікації:	Стаття
URI (Уніфікований ідентифікатор ресурсу):	https://dspace.uzhnu.edu.ua/jspui/handle/lib/74558
ISSN:	2076-3417
Розташовується у зібраннях:	Наукові публікації кафедри інформатики та фізико-математичних дисциплін

Файли цього матеріалу:

Файл	Опис	Розмір	Формат
applsci-15-04471.pdf	Stattja	3.63 MB	Adobe PDF	Переглянути/Відкрити

Показати повний опис матеріалу Перегляд статистики

Усі матеріали в архіві електронних ресурсів захищені авторським правом, всі права збережені.