Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine-needle aspiration cytology
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) show limited accuracy in thyroid fine-needle aspiration cytology (FNAC) image analysis. Structured prompts improved some metrics but failed to achieve reliable diagnostic performance for clinical use.
Area Of Science
- Cytopathology
- Artificial Intelligence
- Medical Imaging
Background
- Fine-needle aspiration cytology (FNAC) is crucial for thyroid nodule evaluation.
- The role of large language models (LLMs) in analyzing FNAC images is currently unclear.
- Accurate cytological classification impacts patient management and treatment decisions.
Purpose Of The Study
- To evaluate the performance of two leading LLMs (GPT-4o and Claude 3.5 Sonnet) in thyroid FNAC image analysis.
- To compare the effectiveness of generic versus structured prompts for LLM-driven FNAC interpretation.
- To assess the diagnostic accuracy and concordance of LLM outputs with expert cytopathological diagnoses.
Main Methods
- Sixty-three thyroid FNAC cases were analyzed, each with eight microscopic images (Pap and MGG stains, 10x/40x magnification).
- Two LLMs, GPT-4o and Claude 3.5 Sonnet, were tested using both generic and structured prompts.
- Performance was evaluated based on Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) concordance, sensitivity, specificity, and inter-rater agreement.
Main Results
- Structured prompts enhanced Bethesda concordance and near-match rates compared to generic prompts.
- However, inter-rater agreement among LLMs and human experts remained very low (κ ≤ 0.09).
- While specificity reached 100% with structured prompts, sensitivity decreased significantly (≤11.8%), indicating persistent misclassification.
Conclusions
- Current LLMs demonstrate potential but lack the necessary accuracy and reliability for independent clinical application in thyroid FNAC analysis.
- Domain-specific training and rigorous validation are essential before integrating LLMs into cytopathological workflows.
- Further research is needed to refine LLM capabilities for complex diagnostic tasks in surgical pathology.

