Diagnostic performance of generative artificial intelligences for a series of complex case reports
View abstract on PubMed
Summary
This summary is machine-generated.Generative artificial intelligences (AIs) show varied diagnostic performance. ChatGPT-4 demonstrated superior accuracy in differential diagnosis compared to Google Gemini and LLaMA2 chatbot in complex medical cases.
Area Of Science
- Medical Artificial Intelligence
- Clinical Diagnostics
- Large Language Models
Background
- The diagnostic capabilities of generative artificial intelligences (AIs) utilizing large language models (LLMs) across diverse medical specialties remain largely uncharacterized.
- Evaluating AI diagnostic performance is crucial for understanding their potential role in clinical decision support.
Purpose Of The Study
- To assess and compare the diagnostic performance of leading generative AIs in generating differential diagnoses for complex medical cases.
- To identify variations in accuracy among different LLM-based AI platforms.
Main Methods
- Analysis of 392 published case reports from the American Journal of Case Reports (Jan 2022-Mar 2023), excluding pediatric and management-focused cases.
- Three generative AIs (ChatGPT-4, Google Gemini, LLaMA2 chatbot) generated top 10 differential diagnosis lists from case descriptions.
- Two physicians independently verified the inclusion of the final diagnosis in the AI-generated lists.
Main Results
- ChatGPT-4 achieved an 86.7% inclusion rate for the final diagnosis in its top 10 differential diagnosis (DDx) lists, significantly outperforming Google Gemini (68.6%) and LLaMA2 chatbot (54.6%).
- ChatGPT-4 also showed the highest rate of matching the final diagnosis as the top listed diagnosis (54.6%), followed by Google Gemini (31.4%) and LLaMA2 chatbot (23.0%).
- Statistical analysis confirmed ChatGPT-4's superior diagnostic accuracy over both Google Gemini and LLaMA2 chatbot (P < 0.001), and Google Gemini's superiority over LLaMA2 chatbot (P < 0.001 for top 10 DDx, P = 0.010 for top diagnosis).
Conclusions
- Generative AIs exhibit distinct levels of diagnostic performance in complex medical case series.
- ChatGPT-4 demonstrates higher diagnostic accuracy compared to Google Gemini and LLaMA2 chatbot for differential diagnosis.
- Understanding these performance differences is vital for the effective integration of generative AIs in clinical practice, particularly in general medicine.
Related Concept Videos
An important concept in studying metabolism and energy is that of chemical equilibrium. Most chemical reactions are reversible. They can proceed in both directions, releasing energy into their environment in one direction, and absorbing it from the environment in the other direction. The same is true for the chemical reactions involved in cell metabolism, such as the breaking down and building up of proteins into and from individual amino acids, respectively. Reactants within a closed system...
There are many research methods available to psychologists in their efforts to understand, describe, and explain behavior and the cognitive and biological processes that underlie it.
In 2011, the New York Times published a feature story on Krista and Tatiana Hogan, Canadian twin girls. These particular twins are unique because Krista and Tatiana are conjoined twins, connected at the head. There is evidence that the two girls are connected in a part of the brain called the thalamus, which is a...

