Using consensus-based reasoning and large language models to extract structured data from surgical pathology reports | JoVE Visualize

Area of Science:

Medical Informatics
Computational Pathology
Artificial Intelligence in Medicine

Background:

Surgical pathology reports contain critical cancer diagnostic information but vary widely in format and style.
The unstructured nature of these reports hinders automated data extraction for large-scale analysis.
Variability across tumor types and institutions presents significant challenges for consistent data retrieval.

Purpose of the Study:

To develop a consensus-driven, reasoning-based framework for extracting standard diagnostic variables and biomarkers from pathology reports.
To adapt locally deployed large language models (LLMs) for accurate and reliable data extraction.
To evaluate the framework's performance across diverse organ systems and cancer types.

Main Methods:

Utilized multiple locally deployed large language models (LLMs) to extract diagnostic variables (site, histology, stage, grade, behavior) and biomarkers.
Employed three separate reasoning models for accuracy and coherence evaluation of LLM-generated outputs.
Aggregated outputs to determine final consensus values and conducted expert validation by board-certified pathologists.

Main Results:

The framework achieved high accuracy in extracting standard variables from over 6,100 The Cancer Genome Atlas (TCGA) reports (mean 84.9%±7.3%) and 510 Moffitt Cancer Center reports (mean 88.2%±7.2%).
Histology, site, and behavior showed the highest extraction accuracy, with expert review confirming strong agreement across key variables.
Biomarker extraction achieved 70.6%±7.9% overall accuracy, with specific biomarkers showing high performance in relevant tumor types.

Conclusions:

Locally deployed LLMs, within a consensus-based framework, offer a transparent, accurate, and auditable solution for pathology data extraction.
The framework demonstrates potential for integration into real-world workflows like synoptic reporting and cancer registry abstraction.
Stratified, multi-organ evaluation frameworks with multi-evaluator consensus are crucial for benchmarking LLMs in clinical applications.