Evaluating the diagnostic reasoning of large language models in complex neuro-ophthalmological cases: a comparative analysis of GPT-o1 Pro, GPT-4o, Gemini, Grok 2 and DeepSeek
View abstract on PubMed
Summary
This summary is machine-generated.ChatGPT-o1 Pro demonstrates superior diagnostic reasoning in complex neuro-ophthalmology cases compared to other large language models (LLMs). This advanced AI shows significant potential for improving diagnostic accuracy in this specialized medical field.
Area Of Science
- Artificial Intelligence in Medicine
- Medical Diagnostics
- Ophthalmology and Neurology
Background
- Large language models (LLMs) are increasingly explored for clinical applications.
- Evaluating the diagnostic reasoning of LLMs in specialized medical fields is crucial.
- Neuro-ophthalmology presents complex diagnostic challenges requiring sophisticated reasoning.
Purpose Of The Study
- To compare the diagnostic reasoning capabilities of five leading LLMs.
- To assess LLM performance in complex neuro-ophthalmological case scenarios.
- To identify the most effective LLM for neuro-ophthalmology diagnostics.
Main Methods
- 18 clinical scenarios from six complex neuro-ophthalmological cases were used.
- Five LLMs (GPT-o1 Pro, GPT-4o, Google Gemini, Grok 2, DeepSeek) were evaluated.
- The Revised-IDEA (R-IDEA) assessment tool and response word count were utilized.
Main Results
- GPT-o1 Pro significantly outperformed other LLMs in R-IDEA scores (8.80 vs. 6.80-6.94).
- GPT-o1 Pro achieved 100% high-quality responses and 88.9% 'Excellent' responses.
- GPT-o1 Pro provided the most concise responses, using significantly fewer words than GPT-4o and Gemini.
Conclusions
- ChatGPT-o1 Pro exhibits superior clinical reasoning in neuro-ophthalmology.
- LLMs show promise in enhancing diagnostic processes in complex medical fields.
- Further research into AI-driven diagnostics in specialized medicine is warranted.

