Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Bioequivalence Data: Statistical Interpretation

Bioequivalence Data: Statistical Interpretation

Body:The statistical interpretation of bioequivalence data is a significant aspect of pharmaceutical research. Bioequivalence refers to the absence of any significant difference in the rate and extent to which the active ingredient in pharmaceutical products becomes available at the site of drug action when administered at the same molar dose under similar conditions. This helps determine if different drug products have similar absorption rates, ensuring their interchangeability.Statistical...

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Comparing Experimental Results: Student's t-Test

Comparing Experimental Results: Student's t-Test

The t-test is a statistical method used to compare the sample mean with a population mean or compare two means from two data sets. The test statistic is calculated from the standard deviation, mean, and number of measurements in the data set at a selected confidence interval and then compared to a table of critical values at this confidence level. If the test statistic is smaller than the critical value, the null hypothesis is accepted. In this case, we state that the difference between the...

Improving Translational Accuracy

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...

Improving Translational Accuracy

Improving Translational Accuracy

Statistical Analysis: Overview

Statistical Analysis: Overview

When we take repeated measurements on the same or replicated samples, we will observe inconsistencies in the magnitude. These inconsistencies are called errors. To categorize and characterize these results and their errors, the researcher can use statistical analysis to determine the quality of the measurements and/or suitability of the methods.
One of the most commonly used statistical quantifiers is the mean, which is the ratio between the sum of the numerical values of all results and the...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

EFLM position statement on the proposed 2025/0404(COD) IVDR Amendment of Article 5.5.

Clinical chemistry and laboratory medicine·2026

Same author

Unity among the units - a position paper by the DGKL.

Clinical chemistry and laboratory medicine·2026

Same author

CLEC3A-derived peptides exhibit broad-spectrum activity against <i>Candida auris</i> and clinically relevant pathogens.

Frontiers in cellular and infection microbiology·2026

Same author

From ordering to interpretation: a comprehensive framework for laboratory test indications.

Clinical chemistry and laboratory medicine·2026

Same author

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation.

JMIR formative research·2025

Same author

Hierarchy of reference interval models: advancing laboratory data interpretation.

Clinical chemistry and laboratory medicine·2025

Same journal

An efficient hybrid CNN-transformer framework for real-time weapon detection and face recognition.

Frontiers in artificial intelligence·2026

Same journal

Ontology-based annotation and fuzzy recommendation for community formation in smart city knowledge platforms.

Frontiers in artificial intelligence·2026

Same journal

A generalized logistic-logit function and its application to multi-layer perceptron and neuron segmentation.

Frontiers in artificial intelligence·2026

Same journal

A multimodal, risk-stratified framework for AI-driven early risk prediction and personalised prevention in obesity.

Frontiers in artificial intelligence·2026

Same journal

The quantified immune-aging dysregulation index: a large-language model-powered method for annotating and quantifying systems-level dysregulation.

Frontiers in artificial intelligence·2026

Same journal

CA<sup>2</sup>PNet: a context-aware multi-scale architecture with adaptive attention and progressive dilated convolutions for biomedical image segmentation.

Frontiers in artificial intelligence·2026

See all related articles

Search research articles

Home
Chatgpt And Reference Intervals: A Comparative Analysis Of Repeatability In Gpt-3.5 Turbo, Gpt-4, And Gpt-4o.

Home
Chatgpt And Reference Intervals: A Comparative Analysis Of Repeatability In Gpt-3.5 Turbo, Gpt-4, And Gpt-4o.

Related Experiment Video

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Published on: November 7, 2025

ChatGPT and reference intervals: a comparative analysis of repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o.

Annika Meyer^1,2, Edgar Schömig³, Thomas Streichert²

¹Department of Anesthesiology and Operative Intensive Care, Faculty of Medicine and University Hospital, University Hospital Cologne, Cologne, Germany.

Frontiers in Artificial Intelligence

|December 29, 2025

View abstract on PubMed

Summary

This summary is machine-generated.

Large language models like ChatGPT show promise in lab medicine but struggle with consistent reference intervals. Newer versions improve, yet variability persists, especially for unstandardized tests.

Keywords:

ChatGPT chatbot consistency large language model reference interval repeatability

More Related Videos

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Published on: March 15, 2018

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Published on: August 17, 2022

Related Experiment Videos

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Rup (RNA-seq Usability Assessment Pipeline) - Quality Control for Bulk RNA-seq Experiments in Eukaryotes

Published on: November 7, 2025

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Intraperitoneal Glucose Tolerance Test, Measurement of Lung Function, and Fixation of the Lung to Study the Impact of Obesity and Impaired Metabolism on Pulmonary Outcomes

Published on: March 15, 2018

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Pre-Implantation Genetic Testing for Aneuploidy on a Semiconductor Based Next-Generation Sequencing Platform

Published on: August 17, 2022

Area of Science:

Artificial Intelligence in Healthcare
Laboratory Medicine and Diagnostics
Clinical Pathology and Informatics

Background:

Large language models (LLMs) offer potential for rapid clinical consultation in laboratory medicine.
Uncertainty exists regarding the consistency and clinical reliability of reference intervals generated by LLMs, especially without clinical context.

Purpose of the Study:

To evaluate the repeatability of reference interval outputs from three ChatGPT versions (GPT-3.5-Turbo, GPT-4, GPT-4o).
To assess model consistency by using reference interval variability as a stress test when prompts omit interval information.

Main Methods:

A cross-sectional study involving 726,000 chatbot requests with standardized prompts.
Analysis of 246,842 reference intervals across 47 laboratory parameters for consistency.

Statistical analysis using coefficient of variation (CV) and regression models to assess variability.

Main Results:

Average CVs for reference intervals were 26.50% (lower limit) and 15.82% (upper limit).
GPT-4 and GPT-4o demonstrated significantly lower CVs than GPT-3.5-Turbo.
Inconsistent outputs were notable for poorly standardized parameters and varied unit expressions.

Conclusions:

While newer ChatGPT versions show improved repeatability, diagnostically unacceptable variability remains, particularly for unstandardized analytes.
Thoughtful prompt design, global standardization of lab practices, model refinement, and regulatory oversight are crucial.
Current AI chatbots should be limited to professional use and trained to decline interpretation without provided reference intervals.