Confidence-Accuracy Alignment in Cardiology Knowledge: Comparing Medical-Specific and General-Purpose Large Language Models Using ACCSAP | JoVE Visualize

Area of Science:

Artificial Intelligence in Medicine
Clinical Decision Support Systems
Medical Education Technology

Background:

Large language models (LLMs) are increasingly used in healthcare, but their clinical reliability hinges on accuracy and confidence calibration.
General-purpose LLMs show promise in medical tasks, while medical-specific LLMs aim for domain alignment, but their comparative clinical reliability is unclear.
Cardiology, with its intricate case-based reasoning, presents a high-stakes environment to evaluate LLM performance.

Purpose of the Study:

To compare the diagnostic accuracy, confidence calibration, uncertainty, and fidelity of general-purpose and medical-specific LLMs on a cardiology knowledge benchmark.
To assess the impact of domain specialization versus broad training on LLM performance in a complex medical field.

Main Methods:

Evaluated 365 text-based cardiology questions from the ACCSAP, excluding image-dependent items.
Compared ChatGPT-4o and Gemini 2.5 Pro (general-purpose) against MedGemma 27B (medical-specific LLM).
Utilized standardized prompts for stepwise reasoning, answer selection, confidence, uncertainty, and fidelity, followed by statistical analysis.

Main Results:

General-purpose LLMs demonstrated higher accuracy: Gemini (87%), ChatGPT (85%), versus MedGemma (67%).
All models reported high confidence, but confidence-accuracy calibration was modest, with small differences between correct and incorrect answers.
ChatGPT showed the strongest confidence-accuracy correlation (r=0.80), while MedGemma exhibited higher uncertainty and lower fidelity.

Conclusions:

General-purpose LLMs may offer advantages in complex clinical reasoning tasks within cardiology compared to specialized models.
Confidence calibration remains a significant challenge for all evaluated LLMs, rendering self-reported certainty an unreliable indicator of correctness.
Current LLM applications in cardiology should be supportive and clinician-supervised until uncertainty estimation and calibration improve.