Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Experiment Video

Updated: Apr 29, 2026

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models
07:14

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Published on: December 23, 2025

1.1K

AgentClinic: a multimodal benchmark for tool-using clinical AI agents.

Samuel Schmidgall1, Rojin Ziaei2, Carl Harris3

  • 1Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA. sschmi46@jhu.edu.

NPJ Digital Medicine
|April 27, 2026
PubMed
Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Precision RNAi for Fibrodysplasia Ossificans Progressiva: a combinatorial, unimolecular, allele selective approach.

Research square·2026
Same author

Neonate with 'rocker-bottom' feet: what to do when it is not Edwards syndrome.

Archives of disease in childhood. Education and practice edition·2026
Same author

Octopus bimaculoides can learn to utilize a mirror to localize a reward outside the line of sight.

Current biology : CB·2026
Same author

Current validation practice undermines surgical AI development.

ArXiv·2026
Same author

Imitation learning for supervised autonomous tumor resection in central airway obstruction.

International journal of computer assisted radiology and surgery·2026
Same author

Longitudinal TCR repertoires in ulcerative colitis patients show features distinguishing disease states.

Inflammatory bowel diseases·2026
Same journal

A multimodal instruction dataset and benchmark for ultrasound understanding.

NPJ digital medicine·2026
Same journal

Evaluating the shift in psychiatric care: Associations between remote consultation use and clinical outcomes in a large longitudinal cohort.

NPJ digital medicine·2026
Same journal

Identifying suicide-related language in smartphone keyboard entries among high-risk adolescents.

NPJ digital medicine·2026
Same journal

Impact of a virtual nurse-led Early paLlIative Care IntervenTion (ELICIT) randomized controlled trial.

NPJ digital medicine·2026
Same journal

Towards trustworthy AI-driven cuffless blood pressure monitoring.

NPJ digital medicine·2026
Same journal

Spatially identifying regions of tumor recurrence in patients with suspected recurrent glioma using physiologic MRI and machine learning.

NPJ digital medicine·2026
See all related articles
This summary is machine-generated.

AgentClinic, a new benchmark for evaluating large language models (LLMs) in clinical settings, reveals significant challenges in sequential decision-making. Claude-3.5 agents generally outperform others, but tool utilization varies greatly among LLMs.

Area of Science:

  • Artificial Intelligence in Medicine
  • Clinical Decision Support Systems
  • Natural Language Processing

Background:

  • Current benchmarks for large language models (LLMs) in healthcare often use static question-answering formats.
  • These static formats fail to capture the dynamic, sequential nature of clinical decision-making.
  • There is a need for more realistic evaluations of LLM clinical utility.

Purpose of the Study:

  • To introduce AgentClinic, a novel multimodal agent benchmark for assessing LLMs in simulated clinical environments.
  • To evaluate LLM performance in complex clinical scenarios involving patient interaction and tool usage.
  • To compare the capabilities of different LLM backbones in a clinical context.

Main Methods:

  • Development of AgentClinic, a benchmark featuring simulated patient interactions, multimodal data, and tool integration.

More Related Videos

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.3K

Related Experiment Videos

Last Updated: Apr 29, 2026

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models
07:14

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Published on: December 23, 2025

1.1K
Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.3K
  • Evaluation of LLMs across nine medical specialties and seven languages.
  • Assessment of diagnostic accuracy in sequential decision-making tasks.
  • Analysis of LLM tool utilization, including note-taking and retrieval.
  • Main Results:

    • Solving clinical problems in AgentClinic's sequential format significantly reduces diagnostic accuracy compared to static benchmarks.
    • Claude-3.5 agents demonstrate superior performance across most evaluated settings.
    • LLMs exhibit substantial variability in their ability to effectively utilize tools like experiential learning and reflection cycles.
    • Llama-3 showed notable improvement (up to 92%) with a persistent notebook tool.

    Conclusions:

    • AgentClinic provides a more challenging and realistic evaluation of LLMs for clinical applications.
    • LLM performance in clinical settings is highly dependent on the benchmark design and the model's ability to integrate tools.
    • Further research is needed to optimize LLMs for complex clinical workflows and patient-centric outcomes.