Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Improving Translational Accuracy02:07

Improving Translational Accuracy

3.5K
3.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Using Large Language Models to Understand Suicidality in a Social Media-Based Taxonomy of Mental Health Disorders: Linguistic Analysis of Reddit Posts.

JMIR mental health·2024
Same author

Development of a Quantitative Digital Urinalysis Tool for Detection of Nitrite, Protein, Creatinine, and pH.

Biosensors·2024
Same author

Pan-Canadian Electronic Medical Record Diagnostic and Unstructured Text Data for Capturing PTSD: Retrospective Observational Study.

JMIR medical informatics·2022
Same author

Characterizing primary care patients with posttraumatic stress disorder using electronic medical records: a retrospective cross-sectional study.

Family practice·2022
Same author

Natural Language Processing of Computed Tomography Reports to Label Metastatic Phenotypes With Prognostic Significance in Patients With Colorectal Cancer.

JCO clinical cancer informatics·2022
Same author

Diagnosing post-traumatic stress disorder using electronic medical record data.

Health informatics journal·2021
Same journal

Supporting Radiology Resident Education and Clinical Decision-Making With Large Language Models: Comparative Study of Reasoning Models DeepSeek-R1 and ChatGPT-o1.

JMIR AI·2026
Same journal

Patient Perceptions on the Use of Artificial Intelligence in Creating Clinical Research Documents: Survey Study.

JMIR AI·2026
Same journal

Application of Language Models for the Analysis of Adverse Drug Events in Pharmaceutical Research and Development: Scoping Review.

JMIR AI·2026
Same journal

Correction: Deep Learning for Age Estimation and Sex Prediction Using Mandibular-Cropped Cephalometric Images: Comparative Model Development and Validation Study.

JMIR AI·2026
Same journal

AI-Assisted Systematic Literature Review of the Economic Burden of Pneumococcal Disease: Development and Validation Study.

JMIR AI·2026
Same journal

Knowledge-Augmented Large Language Model for Multimodal Electronic Health Record-Based Risk Prediction: Development and Validation Study.

JMIR AI·2026
See all related articles

Related Experiment Video

Updated: Jan 8, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

986

A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study.

Yuhao Chen1, Bo Wen2, Farhana Zulkernine1

  • 1School of Computing, Queen's University, 557 Goodwin Hall, Kingston, ON, K7L 2N8, Canada, 1 6138930999.

JMIR AI
|December 16, 2025
PubMed
Summary
This summary is machine-generated.

Large language models (LLMs) can reliably summarize and evaluate medical text, reducing reliance on human experts. This AI system demonstrates scalability for clinical use, addressing challenges like hallucination and bias.

Keywords:
LLMLLM as a judgelarge language model evaluationmulti-agent networksummarization evaluationunstructured medical data summarization

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.2K
A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts
07:50

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

Published on: September 20, 2018

16.4K

Related Experiment Videos

Last Updated: Jan 8, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

986
Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.2K
A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts
07:50

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

Published on: September 20, 2018

16.4K

Area of Science:

  • Artificial Intelligence
  • Medical Informatics
  • Natural Language Processing

Background:

  • Large language models (LLMs) show promise in processing medical text but are prone to inaccuracies (hallucinations).
  • Human expert review of LLM outputs is time-consuming and costly, hindering clinical deployment.
  • Ensuring accuracy and reliability is critical for LLMs in healthcare.

Purpose of the Study:

  • To develop an AI system for extracting structured information from unstructured medical data.
  • To incorporate self-verification mechanisms for assessing LLM output accuracy and reliability.
  • To enhance robustness and trustworthiness of AI-driven medical summarization and evaluation.

Main Methods:

  • A two-layer framework: summarization (Llama2-70B, Mistral-7B) and evaluation (GPT-4-turbo as judge).
  • Pairwise comparison and prompt strategies evaluated summaries on coherence, consistency, fluency, and relevance.
  • LLM judgments were compared against medical expert evaluations, with analysis of inter-expert disagreement.

Main Results:

  • GPT-4 demonstrated strong alignment with expert judgments (83.06% agreement with at least one expert).
  • Prompt-enhanced guidance improved GPT-4's alignment compared to baseline prompts.
  • Variability in expert consensus was observed (19.2% overall, 54% among 3 experts).

Conclusions:

  • LLMs can serve as reliable tools for medical data summarization and evaluation, reducing human dependency.
  • The proposed multiagent summarization and auto-evaluation framework is scalable and adaptable for clinical applications.
  • The framework addresses key challenges such as hallucination and position bias in LLM outputs.