Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Video

Updated: Apr 29, 2026

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Published on: December 23, 2025

AgentClinic: a multimodal benchmark for tool-using clinical AI agents.

Samuel Schmidgall¹, Rojin Ziaei², Carl Harris³

¹Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA. sschmi46@jhu.edu.

NPJ Digital Medicine

|April 27, 2026

Summary

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Precision RNAi for Fibrodysplasia Ossificans Progressiva: a combinatorial, unimolecular, allele selective approach.

Research square·2026

Same author

Neonate with 'rocker-bottom' feet: what to do when it is not Edwards syndrome.

Archives of disease in childhood. Education and practice edition·2026

Same author

Octopus bimaculoides can learn to utilize a mirror to localize a reward outside the line of sight.

Current biology : CB·2026

Same author

Current validation practice undermines surgical AI development.

ArXiv·2026

Same author

Imitation learning for supervised autonomous tumor resection in central airway obstruction.

International journal of computer assisted radiology and surgery·2026

Same author

Longitudinal TCR repertoires in ulcerative colitis patients show features distinguishing disease states.

Inflammatory bowel diseases·2026

Same journal

A multimodal instruction dataset and benchmark for ultrasound understanding.

NPJ digital medicine·2026

Same journal

Evaluating the shift in psychiatric care: Associations between remote consultation use and clinical outcomes in a large longitudinal cohort.

NPJ digital medicine·2026

Same journal

Identifying suicide-related language in smartphone keyboard entries among high-risk adolescents.

NPJ digital medicine·2026

Same journal

Impact of a virtual nurse-led Early paLlIative Care IntervenTion (ELICIT) randomized controlled trial.

NPJ digital medicine·2026

Same journal

Towards trustworthy AI-driven cuffless blood pressure monitoring.

NPJ digital medicine·2026

Same journal

Spatially identifying regions of tumor recurrence in patients with suspected recurrent glioma using physiologic MRI and machine learning.

NPJ digital medicine·2026

See all related articles

This summary is machine-generated.

AgentClinic, a new benchmark for evaluating large language models (LLMs) in clinical settings, reveals significant challenges in sequential decision-making. Claude-3.5 agents generally outperform others, but tool utilization varies greatly among LLMs.

Area of Science:

Artificial Intelligence in Medicine
Clinical Decision Support Systems
Natural Language Processing

Background:

Current benchmarks for large language models (LLMs) in healthcare often use static question-answering formats.
These static formats fail to capture the dynamic, sequential nature of clinical decision-making.
There is a need for more realistic evaluations of LLM clinical utility.

Purpose of the Study:

To introduce AgentClinic, a novel multimodal agent benchmark for assessing LLMs in simulated clinical environments.
To evaluate LLM performance in complex clinical scenarios involving patient interaction and tool usage.
To compare the capabilities of different LLM backbones in a clinical context.

Main Methods:

Development of AgentClinic, a benchmark featuring simulated patient interactions, multimodal data, and tool integration.

More Related Videos

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Related Experiment Videos

Last Updated: Apr 29, 2026

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Virtual Agent for Real-Time Motivational Interviewing by Integrating Adaptive Nonverbal Behavior and Language Models

Published on: December 23, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evaluation of LLMs across nine medical specialties and seven languages.

Assessment of diagnostic accuracy in sequential decision-making tasks.

Analysis of LLM tool utilization, including note-taking and retrieval.

Main Results:

Solving clinical problems in AgentClinic's sequential format significantly reduces diagnostic accuracy compared to static benchmarks.
Claude-3.5 agents demonstrate superior performance across most evaluated settings.
LLMs exhibit substantial variability in their ability to effectively utilize tools like experiential learning and reflection cycles.
Llama-3 showed notable improvement (up to 92%) with a persistent notebook tool.

Conclusions:

AgentClinic provides a more challenging and realistic evaluation of LLMs for clinical applications.
LLM performance in clinical settings is highly dependent on the benchmark design and the model's ability to integrate tools.
Further research is needed to optimize LLMs for complex clinical workflows and patient-centric outcomes.