Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Apr 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract

Petter Fagerberg¹, Oscar Sallander¹, Kim Vikhe Patil¹

¹The National Board of Health and Welfare Stockholm Region Stockholm Sweden.

Cochrane Evidence Synthesis and Methods

|April 15, 2026

Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Investigating Children's Exposure to Outdoor Food Marketing in 2 Swedish Cities Using a Smartphone App: Cross-Sectional Study.

JMIR mHealth and uHealth·2026

Same author

The Role of Supraoptic Hypothalamic Arginine Vasopressin Neurons in Aging-Associated Water Balance and Thermoregulatory Deficits.

bioRxiv : the preprint server for biology·2025

Same author

The citrullinating enzyme PADI4 governs progenitor cell proliferation and translation in developing hair follicles.

Science advances·2025

Same author

Pediatric acute kidney injury is associated with impairment in nicotinamide adenine dinucleotide (NAD+) metabolism.

Pediatric nephrology (Berlin, Germany)·2025

Same author

Effect of Carbamylation and Anemia on the Association between Glycated Albumin and Kidney Outcomes in Diabetic Kidney Disease.

Clinical journal of the American Society of Nephrology : CJASN·2025

Same author

Trajectory of depressive symptoms in a longitudinal stroke cohort.

Journal of stroke and cerebrovascular diseases : the official journal of National Stroke Association·2025

Same journal

Correction to: "Should We Adopt the Case Report Format to Report Challenges in Complicated Evidence Synthesis? A Proposal and Illustration of a Case Report of a Complex Search Strategy for Humanitarian Interventions" and "A New Process Model of Study Identification Specific to the Identification of Randomised Studies for Systematic Reviews of Medical Interventions".

Cochrane evidence synthesis and methods·2026

Same journal

Reinterpreting I<sup>2</sup> Thresholds: Toward Context-Specific Heterogeneity Assessment in Evidence Synthesis.

Cochrane evidence synthesis and methods·2026

Same journal

Better Together: Tools to Aid Patient and Public Involvement in Evidence Synthesis and the Need for Consensus on Essential Reporting Items.

Cochrane evidence synthesis and methods·2026

Same journal

An Open-Source Systematic Reviews Integrated System (OSSYRIS) - Streamlining Processes and Standardising Data Structures.

Cochrane evidence synthesis and methods·2026

Same journal

RETRACTION: Human-in-the-Loop Artificial Intelligence System for Systematic Literature Review: Methods and Validations for the AutoLit Review Software.

Cochrane evidence synthesis and methods·2026

Same journal

Do We Need Systematic Reviews of Research Priority Setting? A Proposal for a New Concept on Conducting Systematic Reviews of Research Priority Setting Exercises.

Cochrane evidence synthesis and methods·2026

See all related articles

This summary is machine-generated.

Large language models (LLMs) can screen multiple abstracts simultaneously, improving evidence synthesis efficiency. Gemini 2.5 Pro demonstrated superior performance across large batch sizes, though model choice impacts sensitivity and specificity.

Area of Science:

Artificial Intelligence
Biomedical Informatics
Systematic Reviews

Background:

Manual abstract screening is a significant bottleneck in evidence synthesis.
Large language models (LLMs) show promise for automating abstract screening.
The performance of LLMs processing multiple references in batches is not well understood.

Purpose of the Study:

To evaluate the performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, GPT-5 mini) in predicting reference eligibility.
To assess LLM performance across a range of batch sizes for systematic reviews.
To compare sensitivity and specificity of different LLMs when processing batched references.

Main Methods:

A gold-standard dataset of 790 references from a Cochrane Review was used.

Keywords:

AI ChatGPT Gemini artificial intelligence diagnostic test accuracy large language models literature screening meta‐analysis systematic review validation

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Related Experiment Videos

Last Updated: Apr 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

References were batched from 1 to 790 and processed using public APIs of four LLMs.

Performance was measured by sensitivity and specificity, with internal validation via 10 repeated runs.

Main Results:

Gemini 2.5 Pro successfully processed the full 790-reference batch, showing the most robustness.
GPT-5 failed at batches of 400 or more; GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch.
At a batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00), while GPT-5 had the highest specificity (0.98).

Conclusions:

State-of-the-art LLMs can effectively screen multiple abstracts per prompt, enhancing efficiency.
LLM performance varies by model, with trade-offs between sensitivity and specificity.
Optimizing batch size and selecting the appropriate LLM are crucial for successful implementation in evidence synthesis.