Large Language Model-Assisted Systematic Review: Validation Based on Cochrane Review Data
View abstract on PubMed
Summary
This summary is machine-generated.Large Language Models (LLMs) show promise for automating systematic reviews, with GPT-4o excelling in abstract screening. However, accuracy in risk of bias assessment varies across domains, indicating current limitations.
Area Of Science
- Medical Informatics
- Artificial Intelligence in Medicine
- Evidence-Based Medicine
Background
- Systematic reviews are crucial for evidence-based medicine but are time-consuming.
- Large Language Models (LLMs) present an opportunity to automate parts of this process.
Purpose Of The Study
- To evaluate the performance of advanced LLMs (GPT-4o, GPT-4o-mini, Llama 3.1:8B) in automating systematic review tasks.
- To assess LLMs' utility in abstract screening and risk of bias assessment.
Main Methods
- LLMs were tested on abstract screening and risk of bias assessment using 12 Cochrane drug intervention reviews.
- A novel one-shot inclusivity adjustment method was proposed for threshold modulation.
Main Results
- GPT-4o demonstrated the highest screening performance (recall 0.894, precision 0.492).
- Risk of bias assessment accuracy was domain-dependent, with highest accuracy in random sequence generation (0.873) and lowest in selective reporting (0.418).
Conclusions
- LLMs show practical utility for automating systematic reviews, particularly in abstract screening.
- Current LLM applications in systematic reviews have limitations, especially in nuanced risk of bias assessments.

