Large-Scale Validation of the Feasibility of GPT-4 as a Proofreading Tool for Head CT Reports

  • 0From the Departments of Biomedical Systems Informatics (S.K., Jaewoong Kim, C.H., D.Y.) and Neurology (Joonho Kim, J.Y.), Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Radiology, Central Draft Physical Examination Office of Military Manpower Administration, Daegu, Republic of Korea (D.K.); Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science (H.J.S. Y.K., S.J.), and Center for Digital Health (H.J.S., D.Y.), Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea; Department of Radiology, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea (S.H.L.); Departments of Radiology (M.H.) and Neurology (S.J.L.), Ajou University Hospital, Ajou University School of Medicine, Suwon, Republic of Korea; and Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea (D.Y.).

|

|

Summary

This summary is machine-generated.

Large language models like GPT-4 show promise in improving radiology report accuracy by detecting and revising errors. While effective for factual errors, GPT-4 needs further development for prioritizing clinical significance in radiology.

Area Of Science

  • Artificial Intelligence in Medical Imaging
  • Radiology Report Analysis
  • Natural Language Processing in Healthcare

Background

  • Radiologist workload contributes to burnout and potential errors in radiology reports.
  • Large language models (LLMs) offer a potential solution for automated error detection and revision in medical documentation.

Purpose Of The Study

  • To assess the feasibility of using GPT-4 for error detection, reasoning, and revision in head CT radiology reports.
  • To compare GPT-4's clinical utility against human readers for identifying and correcting report errors.

Main Methods

  • Retrospective analysis of 10,300 head CT reports from the MIMIC-III dataset.
  • Experiment 1: Evaluated GPT-4's error detection, reasoning, and revision on 400 reports (300 original, 300 with applied errors), with initial optimization on 200 reports.
  • Experiment 2: Validated GPT-4's detection performance on 10,000 error-free reports to assess false-positive rates.

Main Results

  • GPT-4 achieved high sensitivity in detecting interpretive (84%) and factual (89%) errors.
  • Human readers demonstrated lower sensitivity for factual errors (0.33-0.69) and required significantly more review time (82-121 seconds) compared to GPT-4 (16 seconds).
  • GPT-4 detected 96 errors in 10,000 reports, with a low positive predictive value (0.05), though 14% of false positives were potentially beneficial.

Conclusions

  • GPT-4 effectively detects, reasons through, and revises errors in radiology reports, demonstrating strong performance in identifying factual inaccuracies.
  • The model's capability to prioritize clinically significant findings remains a limitation.
  • GPT-4 presents a feasible tool for enhancing radiology report quality, acknowledging its current strengths and limitations.