Performance comparison of large language models in treatment planning for the restoration of endodontically treated teeth over time
View abstract on PubMed
Summary
This summary is machine-generated.Gemini excelled in restorative treatment planning for endodontically treated teeth, outperforming other large language models (LLMs). However, no LLM achieved perfect consistency, indicating a need for human oversight in clinical decision-making.
Area Of Science
- Artificial Intelligence in Dentistry
- Clinical Decision Support Systems
- Endodontic Treatment Planning
Background
- Large language models (LLMs) are increasingly explored for clinical applications.
- Evaluating LLM performance in specialized fields like endodontic restorative treatment planning is crucial.
- Understanding LLM response variability over time and with exposure to expert data is essential.
Purpose Of The Study
- To compare the performance of five leading LLMs in endodontic treatment planning.
- To assess LLM accuracy and completeness in restorative treatment planning scenarios.
- To evaluate the impact of repeated exposure and expert examples on LLM responses.
Main Methods
- Five LLMs (ChatGPT 4.5, DeepSeek R1, Gemini 2.5 Pro, Claude 3.7 Sonnet, Microsoft Copilot) were tested.
- Twenty-five case scenarios with a 39-item checklist were used for evaluation.
- LLMs were assessed over three weeks, with and without expert response examples, by blinded evaluators.
Main Results
- Gemini 2.5 Pro demonstrated superior performance in accuracy and completeness compared to DeepSeek and Microsoft Copilot.
- Claude 3.7 Sonnet showed significant accuracy improvement over time.
- Gemini and Claude showed increased completeness after exposure to expert examples, but no LLM achieved perfect repeatability or consistently complete/accurate responses.
Conclusions
- Gemini 2.5 Pro is the most effective LLM for endodontic restorative treatment planning among those tested.
- DeepSeek R1 exhibited the lowest performance.
- Current LLMs should be used as adjunctive tools, requiring professional human supervision due to limitations in consistency and completeness.

