Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Improving Translational Accuracy02:07

Improving Translational Accuracy

3.5K
3.5K
Improving Translational Accuracy02:07

Improving Translational Accuracy

14.1K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
14.1K
Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving01:29

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

290
Mechanistic models play a crucial role in algorithms for numerical problem-solving, particularly in nonlinear mixed effects modeling (NMEM). These models aim to minimize specific objective functions by evaluating various parameter estimates, leading to the development of systematic algorithms. In some cases, linearization techniques approximate the model using linear equations.
In individual population analyses, different algorithms are employed, such as Cauchy's method, which uses a...
290

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Automated biomedical hypothesis generation with time-aware hypergraph contrastive learning.

Knowledge and information systems·2026
Same author

Evaluating the Potential Impact of AI on Urinary Tract Infection Diagnosis in the Emergency Department Across Demographic Groups: Retrospective Cohort Study.

JMIR AI·2026
Same author

Cell-o1 : training LLMs to solve single-cell reasoning puzzles with reinforcement learning.

Bioinformatics (Oxford, England)·2026
Same author

β-Substitution and prodrug derivation leading to identification of fosmidomycin analogs with improved herbicidal activity.

Pest management science·2026
Same author

Genome-Wide Characterization of the <i>Expansin</i> Gene Family in Eggplant (<i>Solanum melongena</i> L.) Reveals Its Roles in Fruit Development and Heat Stress Response.

Plants (Basel, Switzerland)·2026
Same author

Cholangiocyte biology in primary sclerosing cholangitis and other cholangiopathies: pathogenesis, clinical insights, and experimental tools.

Physiological reviews·2026
Same journal

Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data.

ArXiv·2026
Same journal

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering.

ArXiv·2026
Same journal

Characterizing Universal Object Representations Across Vision Models.

ArXiv·2026
Same journal

CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification.

ArXiv·2026
Same journal

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework.

ArXiv·2026
Same journal

The Origin of Life in the Light of Evolution.

ArXiv·2026
See all related articles

Related Experiment Video

Updated: Jan 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.0K

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations.

Nikhil Khandekar1, Qiao Jin1, Guangzhi Xiong2

  • 1National Library of Medicine, National Institutes of Health.

Arxiv
|October 1, 2025
PubMed
Summary
This summary is machine-generated.

This study introduces MedCalc-Bench, a new dataset for evaluating large language models (LLMs) in medical calculations. Current LLMs struggle with quantitative reasoning, highlighting a gap for clinical applications.

More Related Videos

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.3K
Constructing and Visualizing Models using Mime-based Machine-learning Framework
06:19

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

2.3K

Related Experiment Videos

Last Updated: Jan 16, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

1.0K
Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems
05:47

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

1.3K
Constructing and Visualizing Models using Mime-based Machine-learning Framework
06:19

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

2.3K

Area of Science:

  • Artificial Intelligence in Medicine
  • Natural Language Processing
  • Clinical Decision Support

Background:

  • Current benchmarks for evaluating large language models (LLMs) in medicine primarily assess domain knowledge and descriptive reasoning, not quantitative skills.
  • Physicians frequently rely on clinical calculators employing quantitative equations and rule-based reasoning for evidence-based decision support.
  • There is a need to evaluate the computational and logic-based reasoning capabilities of LLMs in medical contexts.

Purpose of the Study:

  • To introduce MedCalc-Bench, a novel dataset designed to evaluate the medical calculation capabilities of LLMs.
  • To assess the performance of current LLMs on quantitative medical reasoning tasks.
  • To identify specific weaknesses in LLMs related to clinical calculations.

Main Methods:

  • Development of MedCalc-Bench, a dataset comprising over 1000 manually reviewed instances from 55 distinct medical calculation tasks.
  • Each instance includes a patient note, a question requiring a specific medical value computation, a ground truth answer, and a step-by-step explanation.
  • Evaluation of existing LLMs using the MedCalc-Bench dataset.

Main Results:

  • LLMs demonstrate potential in medical calculations but are not yet clinically viable.
  • Common errors include incorrect entity extraction, misuse of equations or rules, and arithmetic inaccuracies.
  • Significant gaps exist in LLMs' quantitative knowledge and reasoning abilities for clinical settings.

Conclusions:

  • MedCalc-Bench serves as a crucial resource for benchmarking LLM performance in medical calculations.
  • Current LLMs require substantial improvement to reliably perform clinical calculations.
  • Future research should focus on enhancing LLMs' quantitative reasoning for diverse clinical applications.