Comparative performance of large language models for patient-initiated ophthalmology consultations
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) offer medical advice, but performance varies. ChatGPT-4o and DeepSeek-V3 excel in accuracy and consistency for general users seeking ophthalmology information.
Area Of Science
- Artificial Intelligence in Healthcare
- Medical Informatics
- Ophthalmology
Background
- Large language models (LLMs) are increasingly used by the public for medical information.
- Evaluating the reliability of LLM responses for medical advice is crucial.
Purpose Of The Study
- To comprehensively evaluate the performance of five LLMs in generating responses to ophthalmology-related patient questions.
- To compare LLM performance across accuracy, logical consistency, coherence, safety, and accessibility.
Main Methods
- Thirty-one common ophthalmology patient questions were used to query five LLMs: ChatGPT-4o, DeepSeek-V3, Doubao, Wenxin Yiyan 4.0 Turbo, and Qwen.
- Responses were assessed using a five-point Likert scale across five domains.
- Quantitative analysis of textual characteristics (character, word, sentence counts) was performed.
Main Results
- ChatGPT-4o and DeepSeek-V3 showed the highest overall performance, with statistically superior accuracy and logical consistency.
- Doubao and Wenxin Yiyan 4.0 Turbo exhibited significant safety deficiencies.
- Qwen produced significantly longer outputs compared to other models.
Conclusions
- ChatGPT-4o and DeepSeek-V3 are recommended for laypersons seeking ophthalmic information.
- Doubao and Qwen may be more suitable for users with medical training due to richer clinical terminology.
- Wenxin Yiyan 4.0 Turbo is effective for patient understanding of diagnostic procedures.
- Further research, including randomized controlled trials, is needed to assess LLM integration in patient triage.

