Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Apr 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.3K

Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract

Petter Fagerberg1, Oscar Sallander1, Kim Vikhe Patil1

  • 1The National Board of Health and Welfare Stockholm Region Stockholm Sweden.

Cochrane Evidence Synthesis and Methods
|April 15, 2026
PubMed
Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Investigating Children's Exposure to Outdoor Food Marketing in 2 Swedish Cities Using a Smartphone App: Cross-Sectional Study.

JMIR mHealth and uHealth·2026
Same author

The Role of Supraoptic Hypothalamic Arginine Vasopressin Neurons in Aging-Associated Water Balance and Thermoregulatory Deficits.

bioRxiv : the preprint server for biology·2025
Same author

The citrullinating enzyme PADI4 governs progenitor cell proliferation and translation in developing hair follicles.

Science advances·2025
Same author

Pediatric acute kidney injury is associated with impairment in nicotinamide adenine dinucleotide (NAD+) metabolism.

Pediatric nephrology (Berlin, Germany)·2025
Same author

Effect of Carbamylation and Anemia on the Association between Glycated Albumin and Kidney Outcomes in Diabetic Kidney Disease.

Clinical journal of the American Society of Nephrology : CJASN·2025
Same author

Trajectory of depressive symptoms in a longitudinal stroke cohort.

Journal of stroke and cerebrovascular diseases : the official journal of National Stroke Association·2025
Same journal

Correction to: "Should We Adopt the Case Report Format to Report Challenges in Complicated Evidence Synthesis? A Proposal and Illustration of a Case Report of a Complex Search Strategy for Humanitarian Interventions" and "A New Process Model of Study Identification Specific to the Identification of Randomised Studies for Systematic Reviews of Medical Interventions".

Cochrane evidence synthesis and methods·2026
Same journal

Reinterpreting I<sup>2</sup> Thresholds: Toward Context-Specific Heterogeneity Assessment in Evidence Synthesis.

Cochrane evidence synthesis and methods·2026
Same journal

Better Together: Tools to Aid Patient and Public Involvement in Evidence Synthesis and the Need for Consensus on Essential Reporting Items.

Cochrane evidence synthesis and methods·2026
Same journal

An Open-Source Systematic Reviews Integrated System (OSSYRIS) - Streamlining Processes and Standardising Data Structures.

Cochrane evidence synthesis and methods·2026
Same journal

RETRACTION: Human-in-the-Loop Artificial Intelligence System for Systematic Literature Review: Methods and Validations for the AutoLit Review Software.

Cochrane evidence synthesis and methods·2026
Same journal

Do We Need Systematic Reviews of Research Priority Setting? A Proposal for a New Concept on Conducting Systematic Reviews of Research Priority Setting Exercises.

Cochrane evidence synthesis and methods·2026
See all related articles
This summary is machine-generated.

Large language models (LLMs) can screen multiple abstracts simultaneously, improving evidence synthesis efficiency. Gemini 2.5 Pro demonstrated superior performance across large batch sizes, though model choice impacts sensitivity and specificity.

Area of Science:

  • Artificial Intelligence
  • Biomedical Informatics
  • Systematic Reviews

Background:

  • Manual abstract screening is a significant bottleneck in evidence synthesis.
  • Large language models (LLMs) show promise for automating abstract screening.
  • The performance of LLMs processing multiple references in batches is not well understood.

Purpose of the Study:

  • To evaluate the performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, GPT-5 mini) in predicting reference eligibility.
  • To assess LLM performance across a range of batch sizes for systematic reviews.
  • To compare sensitivity and specificity of different LLMs when processing batched references.

Main Methods:

  • A gold-standard dataset of 790 references from a Cochrane Review was used.
Keywords:
AIChatGPTGeminiartificial intelligencediagnostic test accuracylarge language modelsliterature screeningmeta‐analysissystematic reviewvalidation

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.9K

Related Experiment Videos

Last Updated: Apr 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.3K
Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.9K
  • References were batched from 1 to 790 and processed using public APIs of four LLMs.
  • Performance was measured by sensitivity and specificity, with internal validation via 10 repeated runs.
  • Main Results:

    • Gemini 2.5 Pro successfully processed the full 790-reference batch, showing the most robustness.
    • GPT-5 failed at batches of 400 or more; GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch.
    • At a batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00), while GPT-5 had the highest specificity (0.98).

    Conclusions:

    • State-of-the-art LLMs can effectively screen multiple abstracts per prompt, enhancing efficiency.
    • LLM performance varies by model, with trade-offs between sensitivity and specificity.
    • Optimizing batch size and selecting the appropriate LLM are crucial for successful implementation in evidence synthesis.