Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Improving Translational Accuracy

Improving Translational Accuracy

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

Mechanistic models play a crucial role in algorithms for numerical problem-solving, particularly in nonlinear mixed effects modeling (NMEM). These models aim to minimize specific objective functions by evaluating various parameter estimates, leading to the development of systematic algorithms. In some cases, linearization techniques approximate the model using linear equations.
In individual population analyses, different algorithms are employed, such as Cauchy's method, which uses a...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Automated biomedical hypothesis generation with time-aware hypergraph contrastive learning.

Knowledge and information systems·2026

Same author

Evaluating the Potential Impact of AI on Urinary Tract Infection Diagnosis in the Emergency Department Across Demographic Groups: Retrospective Cohort Study.

JMIR AI·2026

Same author

Cell-o1 : training LLMs to solve single-cell reasoning puzzles with reinforcement learning.

Bioinformatics (Oxford, England)·2026

Same author

β-Substitution and prodrug derivation leading to identification of fosmidomycin analogs with improved herbicidal activity.

Pest management science·2026

Same author

Genome-Wide Characterization of the <i>Expansin</i> Gene Family in Eggplant (<i>Solanum melongena</i> L.) Reveals Its Roles in Fruit Development and Heat Stress Response.

Plants (Basel, Switzerland)·2026

Same author

Cholangiocyte biology in primary sclerosing cholangitis and other cholangiopathies: pathogenesis, clinical insights, and experimental tools.

Physiological reviews·2026

Same journal

Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data.

ArXiv·2026

Same journal

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering.

ArXiv·2026

Same journal

Characterizing Universal Object Representations Across Vision Models.

ArXiv·2026

Same journal

CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification.

ArXiv·2026

Same journal

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework.

ArXiv·2026

Same journal

The Origin of Life in the Light of Evolution.

ArXiv·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jan 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations.

Nikhil Khandekar¹, Qiao Jin¹, Guangzhi Xiong²

¹National Library of Medicine, National Institutes of Health.

|October 1, 2025

Summary

This summary is machine-generated.

This study introduces MedCalc-Bench, a new dataset for evaluating large language models (LLMs) in medical calculations. Current LLMs struggle with quantitative reasoning, highlighting a gap for clinical applications.

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

Related Experiment Videos

Last Updated: Jan 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

Area of Science:

Artificial Intelligence in Medicine
Natural Language Processing
Clinical Decision Support

Background:

Current benchmarks for evaluating large language models (LLMs) in medicine primarily assess domain knowledge and descriptive reasoning, not quantitative skills.
Physicians frequently rely on clinical calculators employing quantitative equations and rule-based reasoning for evidence-based decision support.
There is a need to evaluate the computational and logic-based reasoning capabilities of LLMs in medical contexts.

Purpose of the Study:

To introduce MedCalc-Bench, a novel dataset designed to evaluate the medical calculation capabilities of LLMs.
To assess the performance of current LLMs on quantitative medical reasoning tasks.
To identify specific weaknesses in LLMs related to clinical calculations.

Main Methods:

Development of MedCalc-Bench, a dataset comprising over 1000 manually reviewed instances from 55 distinct medical calculation tasks.
Each instance includes a patient note, a question requiring a specific medical value computation, a ground truth answer, and a step-by-step explanation.
Evaluation of existing LLMs using the MedCalc-Bench dataset.

Main Results:

LLMs demonstrate potential in medical calculations but are not yet clinically viable.
Common errors include incorrect entity extraction, misuse of equations or rules, and arithmetic inaccuracies.
Significant gaps exist in LLMs' quantitative knowledge and reasoning abilities for clinical settings.

Conclusions:

MedCalc-Bench serves as a crucial resource for benchmarking LLM performance in medical calculations.
Current LLMs require substantial improvement to reliably perform clinical calculations.
Future research should focus on enhancing LLMs' quantitative reasoning for diverse clinical applications.