Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Jan 18, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.0K

Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews:

Hesam Mahmoudi1, Doris Chang1, Hannah Lee1

  • 1MGH Institute for Technology Assessment, Harvard Medical School, 125 Nashua St, Boston, MA, 02114, United States, 1 6177243738.

JMIR AI
|September 11, 2025
PubMed
Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

The behavioural spillover effect: modelling behavioural interdependencies in multi-pathogen dynamics.

Journal of biological dynamics·2026
Same author

Toward a science of human-AI teaming for decision making: A complementarity framework.

PNAS nexus·2026
Same author

Effect of containment strategies for respiratory diseases on infections imported via international travel to the USA: a modelling study.

BMJ open·2026
Same author

Prescription Depressant-Involved Overdose Mortality in Massachusetts (2000-2023): A Cohort Study.

Journal of general internal medicine·2026
Same author

Bridging data gaps and tackling human vulnerabilities in healthcare cybersecurity with generative AI.

PLOS digital health·2025
Same author

Drug involvement variations in overdose death spikes: county-level analysis in Massachusetts.

BMJ public health·2025
Same journal

Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

JMIR AI·2026
Same journal

Patient Perceptions on the Use of Artificial Intelligence in Creating Clinical Research Documents: Survey Study.

JMIR AI·2026
Same journal

Application of Language Models for the Analysis of Adverse Drug Events in Pharmaceutical Research and Development: Scoping Review.

JMIR AI·2026
Same journal

Correction: Deep Learning for Age Estimation and Sex Prediction Using Mandibular-Cropped Cephalometric Images: Comparative Model Development and Validation Study.

JMIR AI·2026
Same journal

AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.

JMIR AI·2026
Same journal

Knowledge-Augmented Large Language Model for Multimodal Electronic Health Record-Based Risk Prediction: Development and Validation Study.

JMIR AI·2026
See all related articles
This summary is machine-generated.

Large language models (LLMs) show promise for automating data extraction in systematic literature reviews (SLRs). While effective for explicit data, LLMs require human oversight for nuanced information, improving accuracy with refined prompts.

Area of Science:

  • Health and biomedical sciences
  • Computational linguistics
  • Research methodology

Background:

  • Systematic literature reviews (SLRs) are crucial for evidence synthesis in health sciences but are labor-intensive due to manual data extraction.
  • Large language models (LLMs) offer potential for automating research tasks, including data extraction from academic papers.
  • Understanding LLM capabilities in extracting explicit data is vital for advancing SLR methodologies.

Purpose of the Study:

  • To evaluate the effectiveness of ChatGPT (GPT-4) in extracting both explicit study characteristics and complex, contextual information from academic literature.
  • To assess the impact of prompt refinement on LLM accuracy in data extraction for SLRs.

Main Methods:

  • Full-text screening of COVID-19 modeling studies.
Keywords:
evidence synthesisgenerative artificial intelligencehuman-AI collaborationlarge language modelssystematic reviews

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.3K

Related Experiment Videos

Last Updated: Jan 18, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.0K
Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.3K
  • Extraction of explicit study settings (analysis location, modeling approach, interventions) and complex behavioral components (mobility, risk perception, compliance).
  • Comparison of manual data extraction by two researchers with ChatGPT responses across 7 prompt iterations.
  • Main Results:

    • ChatGPT's accuracy improved significantly with prompt refinement, increasing from 43.3% to 71.7% correct data elements.
    • Higher accuracy was achieved in extracting explicit study settings (93.3%) compared to subjective behavioral components (50%).
    • Performance varied across measures, indicating limitations in handling nuanced data.

    Conclusions:

    • LLMs can enhance SLRs by efficiently extracting basic, explicit data when provided with effective prompts.
    • Significant limitations exist in LLM performance for nuanced and subjective data, necessitating human supervision.
    • Optimizing prompts is key to maximizing LLM utility in systematic literature reviews.