Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study
View abstract on PubMed
Summary
This summary is machine-generated.Large language models show promise for radiology, but their performance on Reporting and Data Systems (RADS) categorization varies. Claude-2, with structured prompts and guideline PDFs, achieved higher accuracy, especially with LI-RADS 2018.
Area Of Science
- Artificial Intelligence in Medical Imaging
- Natural Language Processing in Healthcare
- Radiology Workflow Optimization
Background
- Large language models (LLMs) offer potential for enhancing radiology workflows.
- Performance of LLMs on structured radiological tasks like Reporting and Data Systems (RADS) categorization is largely unexplored.
- This study investigates LLM capabilities in standardized radiological reporting.
Purpose Of The Study
- To evaluate three LLM chatbots: Claude-2, GPT-3.5, and GPT-4.
- To assess their accuracy in assigning RADS categories to radiology reports.
- To determine the impact of different prompting strategies on LLM performance.
Main Methods
- A cross-sectional study compared three chatbots using 30 radiology reports across LI-RADS, Lung-RADS, and O-RADS.
- A three-level prompting strategy was employed: zero-shot, few-shot, and guideline PDF-informed prompts.
- Radiology reports were prepared by board-certified radiologists, and chatbot responses were assessed by blinded reviewers.
Main Results
- Claude-2 demonstrated the highest accuracy (57% average) with few-shot prompts and guideline PDFs, particularly for LI-RADS 2018 (75% accuracy).
- Prompt engineering significantly improved accuracy for all chatbots; Claude-2 showed enhanced performance with specific prompts, unlike GPT-4.
- Chatbots performed better with LI-RADS 2018 compared to Lung-RADS 2022 and O-RADS.
Conclusions
- Claude-2 shows potential for RADS categorization when provided with structured prompts and guideline PDFs, especially for LI-RADS 2018.
- Current LLM generations struggle with accurately categorizing cases based on more recent RADS criteria.
- Further development is needed to improve LLM accuracy and reliability in diverse radiological reporting scenarios.

