Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Improving Translational Accuracy

Improving Translational Accuracy

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Using Large Language Models to Understand Suicidality in a Social Media-Based Taxonomy of Mental Health Disorders: Linguistic Analysis of Reddit Posts.

JMIR mental health·2024

Same author

Development of a Quantitative Digital Urinalysis Tool for Detection of Nitrite, Protein, Creatinine, and pH.

Biosensors·2024

Same author

Pan-Canadian Electronic Medical Record Diagnostic and Unstructured Text Data for Capturing PTSD: Retrospective Observational Study.

JMIR medical informatics·2022

Same author

Characterizing primary care patients with posttraumatic stress disorder using electronic medical records: a retrospective cross-sectional study.

Family practice·2022

Same author

Natural Language Processing of Computed Tomography Reports to Label Metastatic Phenotypes With Prognostic Significance in Patients With Colorectal Cancer.

JCO clinical cancer informatics·2022

Same author

Diagnosing post-traumatic stress disorder using electronic medical record data.

Health informatics journal·2021

Same journal

Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

JMIR AI·2026

Same journal

Patient Perceptions on the Use of Artificial Intelligence in Creating Clinical Research Documents: Survey Study.

JMIR AI·2026

Same journal

Application of Language Models for the Analysis of Adverse Drug Events in Pharmaceutical Research and Development: Scoping Review.

JMIR AI·2026

Same journal

Correction: Deep Learning for Age Estimation and Sex Prediction Using Mandibular-Cropped Cephalometric Images: Comparative Model Development and Validation Study.

JMIR AI·2026

Same journal

AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.

JMIR AI·2026

Same journal

Knowledge-Augmented Large Language Model for Multimodal Electronic Health Record-Based Risk Prediction: Development and Validation Study.

JMIR AI·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 8, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study.

Yuhao Chen¹, Bo Wen², Farhana Zulkernine¹

¹School of Computing, Queen's University, 557 Goodwin Hall, Kingston, ON, K7L 2N8, Canada, 1 6138930999.

|December 16, 2025

Summary

This summary is machine-generated.

Large language models (LLMs) can reliably summarize and evaluate medical text, reducing reliance on human experts. This AI system demonstrates scalability for clinical use, addressing challenges like hallucination and bias.

Keywords:

LLM LLM as a judge large language model evaluation multi-agent network summarization evaluation unstructured medical data summarization

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

Published on: September 20, 2018

Related Experiment Videos

Last Updated: Jan 8, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

Published on: September 20, 2018

Area of Science:

Artificial Intelligence
Medical Informatics
Natural Language Processing

Background:

Large language models (LLMs) show promise in processing medical text but are prone to inaccuracies (hallucinations).
Human expert review of LLM outputs is time-consuming and costly, hindering clinical deployment.
Ensuring accuracy and reliability is critical for LLMs in healthcare.

Purpose of the Study:

To develop an AI system for extracting structured information from unstructured medical data.
To incorporate self-verification mechanisms for assessing LLM output accuracy and reliability.
To enhance robustness and trustworthiness of AI-driven medical summarization and evaluation.

Main Methods:

A two-layer framework: summarization (Llama2-70B, Mistral-7B) and evaluation (GPT-4-turbo as judge).
Pairwise comparison and prompt strategies evaluated summaries on coherence, consistency, fluency, and relevance.
LLM judgments were compared against medical expert evaluations, with analysis of inter-expert disagreement.

Main Results:

GPT-4 demonstrated strong alignment with expert judgments (83.06% agreement with at least one expert).
Prompt-enhanced guidance improved GPT-4's alignment compared to baseline prompts.
Variability in expert consensus was observed (19.2% overall, 54% among 3 experts).

Conclusions:

LLMs can serve as reliable tools for medical data summarization and evaluation, reducing human dependency.
The proposed multiagent summarization and auto-evaluation framework is scalable and adaptable for clinical applications.
The framework addresses key challenges such as hallucination and position bias in LLM outputs.