Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Improving Translational Accuracy

Improving Translational Accuracy

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Retrospective external validation of the Mayo Delirium Prediction tool in a Swiss cohort of medical and surgical inpatients.

BMC medical informatics and decision making·2026

Same author

Current Evidence of Acetyl-L-Carnitine Use in Mood Disorders-: A Systematic Review and Meta-Analysis.

Neuropsychiatric disease and treatment·2026

Same author

A Structured Consent Framework for Research of Electroconvulsive Therapy in Advanced Dementia: Consent Process for the ECT-AD Trial.

The journal of ECT·2026

Same author

The Rise of Small Language Models in Healthcare: A Comprehensive Survey.

Computer science review·2026

Same author

Socioeconomic deprivation, rurality, and travel distance negatively impact survival in early-stage pancreatic ductal adenocarcinoma but are not associated with stage at diagnosis.

Cancer epidemiology·2026

Same author

A lifecycle governance and learning health system framework for trustworthy, generalizable, and sustainable human-ai partnership in clinical practice: Lessons from the asthma-guidance and prediction system (A-GPS).

Journal of the National Medical Association·2026

Same journal

From Chaos to Care: Personalized AI for Early Cardiac Arrhythmia Warning.

medRxiv : the preprint server for health sciences·2026

Same journal

Large distant deletion disrupts CDKN2A enhancer and predisposes to melanoma.

medRxiv : the preprint server for health sciences·2026

Same journal

Artificial Intelligence-Based Chatbots in Genetic Counseling Practice: Current Uptake, Utilization, and Perspectives.

medRxiv : the preprint server for health sciences·2026

Same journal

Longitudinal MAP-MRI-based Assessment of Tissue Microstructural Alterations in Acute mTBI.

medRxiv : the preprint server for health sciences·2026

Same journal

A class of deep intronic <i>IGHMBP2</i> variants activate a shared cryptic splice donor, enabling correction of select variants with a single antisense oligonucleotide.

medRxiv : the preprint server for health sciences·2026

Same journal

Global Socioeconomic Context and Brain Ageing in Epilepsy: an ENIGMA-Epilepsy study.

medRxiv : the preprint server for health sciences·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Apr 11, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Reproducibility and Robustness of Large Language Models for Mobility Functional Status Extraction.

Xingyi Liu¹, Muskan Garg¹, Eunji Jeon¹

¹Department of AI and Informatics, Mayo Clinic, Rochester, USA.

Medrxiv : the Preprint Server for Health Sciences

|April 10, 2026

Summary

This summary is machine-generated.

Evaluating large language models (LLMs) for clinical information extraction (IE) is crucial. This study found that prompt paraphrasing and model choice significantly impact LLM stability, but self-consistency can improve reliability.

Keywords:

Clinical NLP Information Extraction Large Language Models Mobility Functional Status Reliability Trustworthiness

More Related Videos

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Published on: July 22, 2025

Deep-Learning Based Multi-Joint Synchronous Tracking for Objective Quantification of Hindlimb Locomotor Kinematics in Rats

Deep-Learning Based Multi-Joint Synchronous Tracking for Objective Quantification of Hindlimb Locomotor Kinematics in Rats

Published on: April 3, 2026

Related Experiment Videos

Last Updated: Apr 11, 2026

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Asthma Detection Research Based on Voice Signal Processing and Machine Learning

Published on: July 22, 2025

Deep-Learning Based Multi-Joint Synchronous Tracking for Objective Quantification of Hindlimb Locomotor Kinematics in Rats

Deep-Learning Based Multi-Joint Synchronous Tracking for Objective Quantification of Hindlimb Locomotor Kinematics in Rats

Published on: April 3, 2026

Area of Science:

Natural Language Processing
Medical Informatics
Artificial Intelligence in Healthcare

Background:

Clinical narrative text is rich in patient data but challenging for information extraction (IE) due to variability.
Large language models (LLMs) show promise for clinical IE, but their reproducibility and robustness are critical for deployment.
Quantifying LLM stability is essential for reliable clinical applications.

Purpose of the Study:

To evaluate the reproducibility and robustness of three open-weight large language models (LLMs) for clinical information extraction.
To assess the impact of prompt variations and model architecture on LLM performance and stability.
To provide recommendations for improving LLM reliability in clinical settings.

Main Methods:

Evaluated three distinct LLMs (Llama 3.3, Llama 4, MedGemma) on binary clinical IE tasks related to the ICF mobility framework.
Quantified intra-prompt reproducibility (repeated sampling) and inter-prompt robustness (paraphrased prompts).
Measured predictive performance (F1-score) and stability (Fleiss' Kappa), analyzing factor effects with ANOVA.

Main Results:

Increasing temperature generally decreased LLM agreement, with model-dependent effects.
Prompt paraphrasing significantly reduced LLM stability, especially for Mixture-of-Experts (MoE) models.
Self-consistency via majority voting improved stability (Kappa) and often maintained or improved performance (F1-score).

Conclusions:

LLM reliability in clinical IE is sensitive to prompt design and model architecture.
Self-consistency offers a practical method to enhance the stability and performance of LLMs for clinical tasks.
A reproducible framework is presented for evaluating and improving LLM reliability in healthcare.