From Open-Ended to Multiple-Choice: Evaluating Diagnostic Performance and Consistency of ChatGPT, Google Gemini and Claude AI

Михалко, Ярослав Омелянович; Філак, Ярослав Феліксович; Дуткевич-Іванська, Юлія Василівна; Сабадош, Мар’яна Володимирівна; Рубцова, Єлізавета Іллівна

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: https://dspace.uzhnu.edu.ua/jspui/handle/lib/68102

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.author	Михалко, Ярослав Омелянович	-
dc.contributor.author	Філак, Ярослав Феліксович	-
dc.contributor.author	Дуткевич-Іванська, Юлія Василівна	-
dc.contributor.author	Сабадош, Мар’яна Володимирівна	-
dc.contributor.author	Рубцова, Єлізавета Іллівна	-
dc.date.accessioned	2024-11-30T09:34:42Z	-
dc.date.available	2024-11-30T09:34:42Z	-
dc.date.issued	2024-10	-
dc.identifier.citation	From open-ended to multiple-choice: evaluating diagnostic performance and consistency of ChatGPT, Google Gemini and Claude AI / Y. O. Mykhalko, Y. F. Filak, Y. V. Dutkevych-Ivanska, M. V. Sabadosh, Y. I. Rubtsova // Wiadomości Lekarskie Medical Advances. – 2024, – Vol. 77(10). – p. 1852-1856.	uk
dc.identifier.issn	0043-5147	-
dc.identifier.uri	https://dspace.uzhnu.edu.ua/jspui/handle/lib/68102	-
dc.description.abstract	Aim: To determine the performance and response repeatability of freely available LLMs in diagnosing diseases based on clinical case descriptions. Materials and Methods: 100 detailed clinical case descriptions were used to evaluate the diagnostic performance of ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet large language models (LLMs). The analysis was conducted in two phases: Phase 1 with only case descriptions, and Phase 2 with descriptions and answer variants. Each phase used specific prompts and was repeated twice to assess agreement. Response consistency was determined using agreement percentage and Cohen's Kappa (k). 95% confidence intervals for proportions were calculated using Wilson's method. Statistical significance was set at p<0.05 using Fisher's exact test. Results: In Phase 1 of the study, ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet's efficacy was 69.00%, 64.00%, 44.00%, and 72.00% respectively. All models showed high consistency as agreement percentages ranged from 93.00% to 97.00%, and k ranged from 0.86 to 0.94. In Phase 2 all models' productivity increased significantly (90.00%, 95.00%, 65.00%, and 89.00% for ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet respectively). The agreement percentages ranged from 97.00% to 99.00%, while k values were between 0.85 and 0.93. Conclusion: Claude AI 3.5 Sonnet and both ChatGPT models can be used effectively for the differential diagnosis process, while using these models for diagnosing from scratch should be done with caution. As Google Gemini's efficacy was low, its feasibility in real clinical practice is currently questionable.	uk
dc.language.iso	en	uk
dc.publisher	ALUNA Publishing	uk
dc.subject	artificial intelligence	uk
dc.subject	large language model	uk
dc.subject	diagnosis	uk
dc.subject	performance	uk
dc.title	From Open-Ended to Multiple-Choice: Evaluating Diagnostic Performance and Consistency of ChatGPT, Google Gemini and Claude AI	uk
dc.title.alternative	From Open-Ended to Multiple-Choice: Evaluating Diagnostic Performance and Consistency of ChatGPT, Google Gemini and Claude AI	uk
dc.type	Text	uk
dc.pubType	Стаття	uk
Располагается в коллекциях:	Наукові публікації кафедри терапії та сімейної медицини

Файлы этого ресурса:

Файл	Описание	Размер	Формат
article-wiadomosci-2024.pdf		3.46 MB	Adobe PDF	Просмотреть/Открыть

Показать базовое описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.