A Pipeline for the Automatic Identification of Randomized Controlled Oncology Trials and Assignment of Tumor Entities Using Natural Language Processing

  • 0Department of Radiation Oncology, Cantonal Hospital Winterthur, Winterthur, Switzerland.

|

|

Summary

This summary is machine-generated.

This study shows that classifying medical publications as randomized controlled trials (RCTs) or not, and as oncology-related or not, is feasible. This specialized classification enables more efficient data processing for oncology RCTs.

Area Of Science

  • Biomedical Informatics
  • Natural Language Processing
  • Clinical Trials

Background

  • General information extraction tools lack domain specificity.
  • Domain-specific retrieval of trials can improve data processing.

Purpose Of The Study

  • To classify medical publications into randomized controlled trials (RCTs) vs. non-RCTs and oncology vs. non-oncology topics.
  • To evaluate the performance of a small transformer model and large language models (GPT-4o, GPT-4o mini) for this classification task.
  • To develop a rule-based system for extracting tumor entities from oncology RCTs.

Main Methods

  • Trained a small transformer model for binary classification of RCT status and oncology topic.
  • Utilized GPT-4o and GPT-4o mini for the same classification tasks.
  • Developed a rule-based system to extract tumor entities from classified oncology RCTs.

Main Results

  • Small transformer achieved F1 scores of 0.96 for RCT classification and 0.84 for oncology classification.
  • GPT-4o achieved F1 scores of 0.94 for RCT classification and 0.91 for oncology classification.
  • The rule-based system accurately assigned all oncology RCTs to a tumor entity.

Conclusions

  • Classifying publications as randomized controlled oncology trials is feasible.
  • This specialized classification facilitates downstream processing with rule-based systems and dedicated models.