Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Bioequivalence Data: Statistical Interpretation01:16

Bioequivalence Data: Statistical Interpretation

179
Body:The statistical interpretation of bioequivalence data is a significant aspect of pharmaceutical research. Bioequivalence refers to the absence of any significant difference in the rate and extent to which the active ingredient in pharmaceutical products becomes available at the site of drug action when administered at the same molar dose under similar conditions. This helps determine if different drug products have similar absorption rates, ensuring their interchangeability.Statistical...
179
Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

3.4K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
3.4K
Comparing Experimental Results: Student's t-Test01:09

Comparing Experimental Results: Student's t-Test

4.7K
The t-test is a statistical method used to compare the sample mean with a population mean or compare two means from two data sets. The test statistic is calculated from the standard deviation, mean, and number of measurements in the data set at a selected confidence interval and then compared to a table of critical values at this confidence level. If the test statistic is smaller than the critical value, the null hypothesis is accepted. In this case, we state that the difference between the...
4.7K
Improving Translational Accuracy02:07

Improving Translational Accuracy

14.0K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
14.0K
Improving Translational Accuracy02:07

Improving Translational Accuracy

3.5K
3.5K
Statistical Analysis: Overview01:11

Statistical Analysis: Overview

14.0K
When we take repeated measurements on the same or replicated samples, we will observe inconsistencies in the magnitude. These inconsistencies are called errors. To categorize and characterize these results and their errors, the researcher can use statistical analysis to determine the quality of the measurements and/or suitability of the methods.
One of the most commonly used statistical quantifiers is the mean, which is the ratio between the sum of the numerical values of all results and the...
14.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

EFLM position statement on the proposed 2025/0404(COD) IVDR Amendment of Article 5.5.

Clinical chemistry and laboratory medicine·2026
Same author

Unity among the units - a position paper by the DGKL.

Clinical chemistry and laboratory medicine·2026
Same author

CLEC3A-derived peptides exhibit broad-spectrum activity against <i>Candida auris</i> and clinically relevant pathogens.

Frontiers in cellular and infection microbiology·2026
Same author

From ordering to interpretation: a comprehensive framework for laboratory test indications.

Clinical chemistry and laboratory medicine·2026
Same author

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation.

JMIR formative research·2025
Same author

Hierarchy of reference interval models: advancing laboratory data interpretation.

Clinical chemistry and laboratory medicine·2025
Same journal

An efficient hybrid CNN-transformer framework for real-time weapon detection and face recognition.

Frontiers in artificial intelligence·2026
Same journal

Ontology-based annotation and fuzzy recommendation for community formation in smart city knowledge platforms.

Frontiers in artificial intelligence·2026
Same journal

A generalized logistic-logit function and its application to multi-layer perceptron and neuron segmentation.

Frontiers in artificial intelligence·2026
Same journal

A multimodal, risk-stratified framework for AI-driven early risk prediction and personalised prevention in obesity.

Frontiers in artificial intelligence·2026
Same journal

The quantified immune-aging dysregulation index: a large-language model-powered method for annotating and quantifying systems-level dysregulation.

Frontiers in artificial intelligence·2026
Same journal

CA<sup>2</sup>PNet: a context-aware multi-scale architecture with adaptive attention and progressive dilated convolutions for biomedical image segmentation.

Frontiers in artificial intelligence·2026
See all related articles
  1. Home
  2. Chatgpt And Reference Intervals: A Comparative Analysis Of Repeatability In Gpt-3.5 Turbo, Gpt-4, And Gpt-4o.
  1. Home
  2. Chatgpt And Reference Intervals: A Comparative Analysis Of Repeatability In Gpt-3.5 Turbo, Gpt-4, And Gpt-4o.

Related Experiment Video

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes
05:07

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Published on: November 7, 2025

309

ChatGPT and reference intervals: a comparative analysis of repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o.

Annika Meyer1,2, Edgar Schömig3, Thomas Streichert2

  • 1Department of Anesthesiology and Operative Intensive Care, Faculty of Medicine and University Hospital, University Hospital Cologne, Cologne, Germany.

Frontiers in Artificial Intelligence
|December 29, 2025

View abstract on PubMed

Summary
This summary is machine-generated.

Large language models like ChatGPT show promise in lab medicine but struggle with consistent reference intervals. Newer versions improve, yet variability persists, especially for unstandardized tests.

Keywords:
ChatGPTchatbotconsistencylarge language modelreference intervalrepeatability

More Related Videos

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes
08:30

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Published on: March 15, 2018

14.6K
Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform
09:30

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Published on: August 17, 2022

3.5K

Related Experiment Videos

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes
05:07

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Published on: November 7, 2025

309
Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes
08:30

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Published on: March 15, 2018

14.6K
Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform
09:30

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Published on: August 17, 2022

3.5K

Area of Science:

  • Artificial Intelligence in Healthcare
  • Laboratory Medicine and Diagnostics
  • Clinical Pathology and Informatics

Background:

  • Large language models (LLMs) offer potential for rapid clinical consultation in laboratory medicine.
  • Uncertainty exists regarding the consistency and clinical reliability of reference intervals generated by LLMs, especially without clinical context.

Purpose of the Study:

  • To evaluate the repeatability of reference interval outputs from three ChatGPT versions (GPT-3.5-Turbo, GPT-4, GPT-4o).
  • To assess model consistency by using reference interval variability as a stress test when prompts omit interval information.

Main Methods:

  • A cross-sectional study involving 726,000 chatbot requests with standardized prompts.
  • Analysis of 246,842 reference intervals across 47 laboratory parameters for consistency.
  • Statistical analysis using coefficient of variation (CV) and regression models to assess variability.
  • Main Results:

    • Average CVs for reference intervals were 26.50% (lower limit) and 15.82% (upper limit).
    • GPT-4 and GPT-4o demonstrated significantly lower CVs than GPT-3.5-Turbo.
    • Inconsistent outputs were notable for poorly standardized parameters and varied unit expressions.

    Conclusions:

    • While newer ChatGPT versions show improved repeatability, diagnostically unacceptable variability remains, particularly for unstandardized analytes.
    • Thoughtful prompt design, global standardization of lab practices, model refinement, and regulatory oversight are crucial.
    • Current AI chatbots should be limited to professional use and trained to decline interpretation without provided reference intervals.