Natural Language Processing Methods Automate Molecular Marker Extraction From Glioma Pathology Reports | JoVE Visualize

Area of Science:

Computational pathology
Bioinformatics
Natural Language Processing (NLP)

Background:

Accurate molecular marker status (Isocitrate Dehydrogenase - IDH, Alpha-thalassemia/mental retardation syndrome X-linked - ATRX) is critical for glioma classification and treatment.
Manual extraction of these markers from pathology reports presents a significant bottleneck for research.
Evaluating the performance of NLP approaches with varying computational complexity is essential for optimizing research workflows.

Purpose of the Study:

To compare the effectiveness of three Natural Language Processing (NLP) approaches—Regular Expressions (RegEx), Term Frequency-Inverse Document Frequency (TF-IDF), and Bidirectional Encoder Representations from Transformers (BERT)—for extracting IDH and ATRX molecular markers from glioma pathology reports.
To determine if more computationally intensive NLP methods offer significant performance advantages over simpler methods in computational pathology research.

Main Methods:

Analysis of pathology reports from 404 patients (Institution A) and 197 patients (Institution B) for external validation.
Application of identical preprocessing steps, including text normalization and terminology standardization, to all evaluated NLP approaches.
Performance evaluation using standard classification metrics (accuracy, AUC) and memory usage benchmarks on both internal and external datasets.

Main Results:

Simpler NLP approaches, RegEx and TF-IDF, outperformed complex BERT-based models in accuracy and AUC for both IDH and ATRX marker extraction on external validation data.
RegEx achieved near-perfect accuracy (99-100%) and TF-IDF maintained high accuracy (94.2-98.0%) for both markers.
BERT-based approaches required substantially more memory (1825-1953 MB) compared to RegEx (0.82-5.52 MB) and TF-IDF (17.27-34.89 MB).

Conclusions:

Simple NLP approaches, particularly RegEx, provide a highly accurate and computationally efficient solution for automating molecular marker extraction from pathology reports.
The findings suggest that simpler NLP methods are sufficient for many computational pathology research tasks, enabling larger sample sizes and multi-institutional analyses.
Future research should focus on validating these findings across larger datasets and integrating NLP tools for broader application in biomarker research.