Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists
View abstract on PubMed
Summary
This summary is machine-generated.GPT-4 demonstrated superior performance in a psychiatric licensing exam and differential diagnosis tasks compared to other large language models (LLMs). This suggests GPT-4 shows promise as a valuable tool in psychiatric practice.
Area Of Science
- Psychiatry and Artificial Intelligence
- Natural Language Processing in Medicine
- Clinical Decision Support Systems
Background
- Large language models (LLMs) are increasingly explored for medical applications.
- The utility of LLMs in the specialized field of psychiatry remains under-investigated.
- Evaluating LLM performance in psychiatric assessments is crucial for understanding their potential.
Purpose Of The Study
- To assess and compare the performance of leading LLMs (GPT-4, Bard, Llama-2) in psychiatric evaluations.
- To evaluate LLM capabilities in a standardized psychiatric licensing examination.
- To compare LLM performance in complex clinical differential diagnosis scenarios against experienced psychiatrists.
Main Methods
- Comparative analysis of GPT-4, Bard, and Llama-2 performance on the 2022 Taiwan Psychiatric Licensing Examination.
- Evaluation of LLM responses to advanced clinical scenario questions for psychiatric differential diagnosis.
- Benchmarking LLM scores against those of 24 experienced psychiatrists on the same diagnostic questions.
Main Results
- GPT-4 was the only LLM to pass the 2022 Taiwan Psychiatric Licensing Examination (score 69/≥60).
- GPT-4 significantly outperformed Bard (36) and Llama-2 (25) in exam sections like 'Pathophysiology & Epidemiology' and 'Psychopharmacology & Other therapies'.
- In differential diagnosis, GPT-4 (score 5) approached the performance of experienced psychiatrists (mean 6.1), outperforming Bard (3) and Llama-2 (1).
Conclusions
- GPT-4 exhibits superior capabilities in psychiatric symptom identification and clinical judgment compared to Bard and Llama-2.
- GPT-4's differential diagnosis performance closely mirrors that of seasoned psychiatrists.
- GPT-4 presents significant potential as an assistive tool in psychiatric practice and education.

