Improving Radiology Report Error Detection Using a Multipass Large Language Model: Framework Development and Validation | JoVE Visualize

Area of Science:

Artificial Intelligence in Medical Imaging
Radiology Quality Assurance
Natural Language Processing in Healthcare

Background:

Large language models (LLMs) for radiology report proofreading often produce numerous false positives (FPs) due to the low error rates in clinical data.
This limitation hinders the practical application of LLMs for automated quality assurance in radiology.

Purpose of the Study:

To evaluate an optimized LLM framework designed to enhance precision and cost-efficiency in detecting errors within radiology reports.
To determine if the proposed framework could maintain or improve error detection capabilities while reducing false positives.

Main Methods:

A retrospective analysis of 1000 radiology reports across various modalities (radiography, ultrasonography, CT, MRI) from the MIMIC-III database.
Evaluation of three LLM frameworks: single-prompt detector, report extractor plus single-prompt detector, and a multipass framework with an FP verifier.
Assessment of precision using positive predictive value (PPV) and error detection rates, alongside estimation of model inference and reviewer labor costs.

Main Results:

The multipass LLM framework (framework 3) demonstrated a significant increase in PPV (0.159) compared to single-prompt frameworks (0.063-0.079).
Human review burden was reduced by over 50% (from 192 to 88 reports per 1000), and model inference costs decreased by up to 42.6%.
Remaining FPs were primarily associated with complex clinical context, indicating a shift from structural errors to nuanced discrepancies.

Conclusions:

The optimized multipass LLM framework effectively improves precision and cost-efficiency for radiology report error detection in low-prevalence settings.
This approach facilitates a synergistic AI-radiologist collaboration, offering a scalable and cost-effective solution for AI-assisted quality assurance in radiology.
The framework enables a targeted human-in-the-loop workflow by filtering out simple errors, allowing human reviewers to focus on complex cases.