Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs | JoVE Visualize

Area of Science:

Artificial Intelligence in Medicine
Medical Imaging Analysis
Natural Language Processing

Background:

Large language models (LLMs) and multimodal LLMs (MLLMs) demonstrate promise in medical diagnosis.
Accurate localization of pathological findings is crucial for medical image interpretation, beyond diagnostic capabilities.
Evaluating localization abilities offers insights into models' spatial understanding of anatomy and disease.

Purpose of the Study:

To systematically assess the pathology localization capabilities of general-purpose MLLMs (GPT-4, GPT-5) and a domain-specific model (MedGemma) on chest radiographs.
To compare MLLM performance against a task-specific Convolutional Neural Network (CNN) baseline and a human radiologist benchmark.
To analyze model errors and identify areas for improvement in spatial understanding and localization accuracy.

Main Methods:

A prompting pipeline was developed, overlaying a spatial grid on chest radiographs to elicit coordinate-based pathology predictions.
Two general-purpose MLLMs (GPT-4, GPT-5) and one domain-specific MLLM (MedGemma) were evaluated.
Performance was assessed on the CheXlocalize dataset across nine distinct pathologies, comparing results against CNN and radiologist benchmarks.

Main Results:

GPT-5 achieved 49.7% localization accuracy, GPT-4 achieved 39.1%, and MedGemma achieved 17.7%, all below the CNN baseline (59.9%) and radiologist benchmark (80.1%).
GPT-5's errors were often anatomically plausible but imprecise; GPT-4 struggled with variable pathologies and produced more implausible predictions.
MedGemma showed the lowest performance but improved with few-shot prompting, indicating potential for domain-specific fine-tuning.

Conclusions:

Current general-purpose MLLMs exhibit limitations in precise pathological localization on chest radiographs, despite potential for anatomical plausibility.
Performance gaps highlight the need for task-specific tools and further development to integrate MLLMs reliably into clinical workflows.
Future research should focus on enhancing spatial reasoning and localization accuracy in MLLMs for medical imaging applications.