Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Types of Errors: Detection and Minimization

Types of Errors: Detection and Minimization

Error is the deviation of the obtained result from the true, expected value or the estimated central value. Errors are expressed in absolute or relative terms.
Absolute error in a measurement is the numerical difference from the true or central value. Relative error is the ratio between absolute error and the true or central value, expressed as a percentage.
Errors can be classified by source, magnitude, and sign. There are three types of errors: systematic, random, and gross.
Systematic or...

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic models are utilized in individual analysis using single-source data, but imperfections arise due to data collection errors, preventing perfect prediction of observed data. The mathematical equation involves known values (Xi), observed concentrations (Ci), measurement errors (εi), model parameters (ϕj), and the related function (ƒi) for i number of values. Different least-squares metrics quantify differences between predicted and observed values. The ordinary least...

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

Mechanistic Models: Compartment Models in Algorithms for Numerical Problem Solving

Mechanistic models play a crucial role in algorithms for numerical problem-solving, particularly in nonlinear mixed effects modeling (NMEM). These models aim to minimize specific objective functions by evaluating various parameter estimates, leading to the development of systematic algorithms. In some cases, linearization techniques approximate the model using linear equations.
In individual population analyses, different algorithms are employed, such as Cauchy's method, which uses a...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Errors In Hypothesis Tests

Errors In Hypothesis Tests

When performing a hypothesis test, there are four possible outcomes depending on the actual truth (or falseness) of the null hypothesis and the decision to reject or not.

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Development and evaluation of an ontology for non-invasive respiratory support in acute care.

PloS one·2026

Same author

Failure Modes of Time Series Interpretability Algorithms for Critical Care Applications and Potential Solutions.

AMIA ... Annual Symposium proceedings. AMIA Symposium·2026

Same author

PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping.

AMIA ... Annual Symposium proceedings. AMIA Symposium·2026

Same author

SHREC: A framework for advancing next-generation computational phenotyping with large language models.

PLOS digital health·2026

Same author

Standardizing Data Elements for Implementation of ICU Liberation Bundle.

Applied clinical informatics·2026

Same author

Comparative Evaluation of USG, CT, and MRI in Acute Pancreatitis.

Journal of pharmacy & bioallied sciences·2026

Same journal

Poisoning the Genome: Targeted Backdoor Attacks on DNA Foundation Models.

ArXiv·2026

Same journal

Mechanistic mathematical model of the in vitro infection dynamics of Bunyamwera and Batai viruses including MOI-dependent shortening of the eclipse phase.

ArXiv·2026

Same journal

AI-Driven Lumped-Element Modeling of Human Respiratory System for Studying Voice Mechanics.

ArXiv·2026

Same journal

Beyond Algorithms: Conceptual Innovation in Medical Imaging AI.

ArXiv·2026

Same journal

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization.

ArXiv·2026

Same journal

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3.

ArXiv·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 12, 2025

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks.

Sarah Pungitore¹, Shashank Yadav¹, David Maughan¹

¹College of Engineering, The University of Arizona, Tucson, AZ.

|August 6, 2025

Summary

This summary is machine-generated.

Large language models (LLMs) show reasoning errors in complex computational phenotyping tasks. Enhancing LLM evaluation frameworks like PHEONA is crucial for identifying and addressing these errors in artificial intelligence development.

Keywords:

Computational Phenotyping Computer Reasoning Electronic Phenotyping Generative Artificial Intelligence Large Language Models

More Related Videos

In Vivo Modeling of the Morbid Human Genome using Danio rerio

In Vivo Modeling of the Morbid Human Genome using Danio rerio

Published on: August 24, 2013

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

Related Experiment Videos

Last Updated: Sep 12, 2025

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Published on: June 13, 2025

In Vivo Modeling of the Morbid Human Genome using Danio rerio

In Vivo Modeling of the Morbid Human Genome using Danio rerio

Published on: August 24, 2013

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Lexical Decision Task for Studying Written Word Recognition in Adults with and without Dementia or Mild Cognitive Impairment

Published on: June 25, 2019

Area of Science:

Biomedical Informatics
Artificial Intelligence

Background:

Computational phenotyping is essential for cohort identification but is time-intensive due to manual data review.
Previous studies showed limitations of LLMs in complex phenotyping tasks, particularly with multiple therapies.

Purpose of the Study:

To evaluate the reasoning capabilities of lightweight LLMs in computational phenotyping.
To enhance the PHEONA framework for assessing faulty reasoning in LLMs.

Main Methods:

Assessed three lightweight LLMs (DeepSeek-r1, Mistral Small, Phi-4) for phenotyping accuracy.
Utilized prompt modifications to identify explanation correctness and unfaithfulness errors.
Expanded the PHEONA framework to include faulty reasoning evaluation.

Main Results:

Reasoning errors, including explanation correctness and unfaithfulness, were prevalent across all tested LLMs.
DeepSeek demonstrated the smallest accuracy impact after prompt modifications compared to Mistral and Phi.
The enhanced PHEONA framework successfully identified pervasive reasoning errors.

Conclusions:

Reasoning errors are ubiquitous in LLM responses for complex tasks like computational phenotyping.
The enhanced PHEONA framework is vital for LLM evaluation, highlighting the need for improved interpretability methods.