Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jan 18, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews:

Hesam Mahmoudi¹, Doris Chang¹, Hannah Lee¹

¹MGH Institute for Technology Assessment, Harvard Medical School, 125 Nashua St, Boston, MA, 02114, United States, 1 6177243738.

|September 11, 2025

Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

The behavioural spillover effect: modelling behavioural interdependencies in multi-pathogen dynamics.

Journal of biological dynamics·2026

Same author

Toward a science of human-AI teaming for decision making: A complementarity framework.

PNAS nexus·2026

Same author

Effect of containment strategies for respiratory diseases on infections imported via international travel to the USA: a modelling study.

BMJ open·2026

Same author

Prescription Depressant-Involved Overdose Mortality in Massachusetts (2000-2023): A Cohort Study.

Journal of general internal medicine·2026

Same author

Bridging data gaps and tackling human vulnerabilities in healthcare cybersecurity with generative AI.

PLOS digital health·2025

Same author

Drug involvement variations in overdose death spikes: county-level analysis in Massachusetts.

BMJ public health·2025

Same journal

Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

JMIR AI·2026

Same journal

Patient Perceptions on the Use of Artificial Intelligence in Creating Clinical Research Documents: Survey Study.

JMIR AI·2026

Same journal

Application of Language Models for the Analysis of Adverse Drug Events in Pharmaceutical Research and Development: Scoping Review.

JMIR AI·2026

Same journal

Correction: Deep Learning for Age Estimation and Sex Prediction Using Mandibular-Cropped Cephalometric Images: Comparative Model Development and Validation Study.

JMIR AI·2026

Same journal

AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.

JMIR AI·2026

Same journal

Knowledge-Augmented Large Language Model for Multimodal Electronic Health Record-Based Risk Prediction: Development and Validation Study.

JMIR AI·2026

See all related articles

This summary is machine-generated.

Large language models (LLMs) show promise for automating data extraction in systematic literature reviews (SLRs). While effective for explicit data, LLMs require human oversight for nuanced information, improving accuracy with refined prompts.

Area of Science:

Health and biomedical sciences
Computational linguistics
Research methodology

Background:

Systematic literature reviews (SLRs) are crucial for evidence synthesis in health sciences but are labor-intensive due to manual data extraction.
Large language models (LLMs) offer potential for automating research tasks, including data extraction from academic papers.
Understanding LLM capabilities in extracting explicit data is vital for advancing SLR methodologies.

Purpose of the Study:

To evaluate the effectiveness of ChatGPT (GPT-4) in extracting both explicit study characteristics and complex, contextual information from academic literature.
To assess the impact of prompt refinement on LLM accuracy in data extraction for SLRs.

Main Methods:

Full-text screening of COVID-19 modeling studies.

Keywords:

evidence synthesis generative artificial intelligence human-AI collaboration large language models systematic reviews

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Related Experiment Videos

Last Updated: Jan 18, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Extraction of explicit study settings (analysis location, modeling approach, interventions) and complex behavioral components (mobility, risk perception, compliance).

Comparison of manual data extraction by two researchers with ChatGPT responses across 7 prompt iterations.

Main Results:

ChatGPT's accuracy improved significantly with prompt refinement, increasing from 43.3% to 71.7% correct data elements.
Higher accuracy was achieved in extracting explicit study settings (93.3%) compared to subjective behavioral components (50%).
Performance varied across measures, indicating limitations in handling nuanced data.

Conclusions:

LLMs can enhance SLRs by efficiently extracting basic, explicit data when provided with effective prompts.
Significant limitations exist in LLM performance for nuanced and subjective data, necessitating human supervision.
Optimizing prompts is key to maximizing LLM utility in systematic literature reviews.