Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Jun 13, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evaluating Large Language Models for Automated Evidence Synthesis in Neuroimaging AI: A Multi-Model Benchmark.

Umid Sulaimanov¹, Nafiye Sanlier¹, Ariorad Moniri²

¹Department of Neurological Surgery, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI 53792, USA.

Journal of Clinical Medicine

|June 12, 2026

Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Anterior Midline Skull Base Meningiomas: A Systematic Review of Resection Rates, Functional Outcomes, and Perioperative Complications Following Contemporary Endoscopic Endonasal Versus Transcranial Approaches.

Journal of clinical medicine·2026

Same author

Open brain biopsy for nonneoplastic undiagnosed neurological conditions: diagnostic yield, clinical impact, and contemporary role.

Irish journal of medical science·2026

Same author

Surgical strategies and long-term survival for third ventricle chordoid gliomas: a systematic review and clinical algorithm.

Neurosurgical review·2026

Same author

Madison Microneurosurgery Initiative: A Tribute to Professor M. Gazi Yaşargil's Legacy in Microvascular Surgery Training. Part I - A Brief History of Microsurgery and Yaşargil's Contributions.

Turkish neurosurgery·2026

Same author

Madison Microneurosurgery Initiative: A Tribute to Professor M. Gazi Yaşargil's Legacy in Microvascular Surgery Training. Part II - Principles Applied and Practices Implemented.

Turkish neurosurgery·2026

Same author

Endoscopic Transorbital Anterior Clinoidectomy: Surgical Anatomy and Step-wise Technique.

Operative neurosurgery (Hagerstown, Md.)·2026

Same journal

Evidence-Based Clinical Recommendations for the Appropriate Use of Diagnostic Tests in Pediatric Allergology: Focus on Asthma, Rhinoconjunctivitis, and Keratoconjunctivitis Vernal.

Journal of clinical medicine·2026

Same journal

Surgical and Transcatheter Approach of a Failed Mitral Valve Repair: A Comprehensive Review on Selecting the Most Suitable Approach.

Journal of clinical medicine·2026

Same journal

Hybrid Metaheuristic Feature Selection for Breast Cancer Detection in Digital Mammography: A Feasibility Study with Nested Validation, Benchmarking, and External Stress Testing.

Journal of clinical medicine·2026

Same journal

Identity Transformation and the Role of Accountability in Recovery from Problematic Pornography Use: A Phenomenological-Hermeneutical Study.

Journal of clinical medicine·2026

Same journal

Does Early Surgical Treatment in Degenerative Cervical Myelopathy Have a Favorable Clinical Outcome and Impact on Quality of Life?

Journal of clinical medicine·2026

Same journal

Shear Wave Elastography in Musculoskeletal Imaging: A Narrative Review.

Journal of clinical medicine·2026

See all related articles

This summary is machine-generated.

Large language models (LLMs) show promise for automating data extraction in systematic reviews, but struggle with complex neuroimaging AI literature. Gemini 3 Pro Preview led in accuracy, though human oversight remains crucial for nuanced data.

Area of Science:

Artificial Intelligence
Neuroimaging
Systematic Reviews

Background:

Data extraction for systematic reviews is time-consuming and resource-intensive.
Evaluating the utility of advanced AI in automating evidence synthesis is critical.
Specialized neuroimaging artificial intelligence (AI) literature presents unique challenges for data extraction.

Purpose of the Study:

To assess the performance of four leading large language models (LLMs) in extracting structured metadata from neuroimaging AI literature.
To compare the accuracy of Google Gemini 3 Pro Preview, Anthropic Claude Opus 4.5, Perplexity Sonar Pro, and OpenAI GPT 5.2 for complex data extraction tasks.
To determine the impact of variable complexity on LLM performance in automated evidence synthesis.

Main Methods:

Keywords:

artificial intelligence benchmarking evidence synthesis information extraction large language models neuroimaging

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Related Experiment Videos

Last Updated: Jun 13, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

A standardized prompt was used to extract 22 variables from 91 neuroimaging AI articles.

Variables were categorized into low, medium, and high complexity tiers.

Performance was evaluated using exact-match accuracy against expert-validated ground truth.

Main Results:

Gemini 3 Pro Preview achieved the highest overall exact-match accuracy (56.4%), outperforming other models.
Model performance decreased significantly with increasing variable complexity.
Accuracy for low-complexity fields was high (88.9-92.9%), while high-complexity variables yielded very low accuracy (2.7-15.5%).

Conclusions:

Frontier LLMs can automate the extraction of simple, categorical data effectively.
Complex methodological variables requiring clinical judgment or multi-section synthesis remain challenging for current LLMs.
Human review is indispensable for ensuring accuracy in extracting context-dependent variables from specialized literature.