Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Survival Tree01:19

Survival Tree

374
Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
 Building a Survival Tree
Constructing a...
374
Bias01:22

Bias

7.2K
Bias refers to any tendency that prevents a question from being considered unprejudiced. In research, bias occurs when one outcome or answer is selected or encouraged over others in sampling or testing. Bias can occur during any research phase, including study design, data collection, analysis, and publication.
In statistics, a sampling bias is created when a sample is collected from a population, and some members of the population are not as likely to be chosen as others (remember, each member...
7.2K
Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

3.5K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
3.5K
One-Way ANOVA: Unequal Sample Sizes01:15

One-Way ANOVA: Unequal Sample Sizes

6.6K
One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:
6.6K
Outliers and Influential Points01:08

Outliers and Influential Points

5.9K
An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500), while others may indicate that something unusual is happening. Outliers are present far from the least squares line in the vertical direction. They have large "errors," where the "error" or residual is the...
5.9K
One-Way ANOVA: Equal Sample Sizes01:15

One-Way ANOVA: Equal Sample Sizes

4.0K
One-Way ANOVA can be performed on three or more samples with equal or unequal sample sizes. When one-way ANOVA is performed on two datasets with samples of equal sizes, it can be easily observed that the computed F statistic is highly sensitive to the sample mean.
Different sample means can result in different values for the variance estimate: variance between samples. This is because the variance between samples is calculated as the product of the sample size and the variance between the...
4.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

End-to-end evaluation of pipelines for metagenome-assembled genomes reveals hidden performance gaps.

bioRxiv : the preprint server for biology·2026
Same author

A generalizable cross-continent prediction of esophageal squamous cell carcinoma using the oral microbiome.

Communications medicine·2026
Same author

Comparative metagenomics using pan-metagenomic graphs.

bioRxiv : the preprint server for biology·2025
Same author

A generalizable cross-continent prediction of esophageal squamous cell carcinoma using the oral microbiome.

bioRxiv : the preprint server for biology·2025
Same author

Transcriptomic Plasticity Is a Hallmark of Metastatic Pancreatic Cancer.

Cancer research·2025
Same author

Identification of Sample Processing Errors in Microbiome Studies Using Host Genetic Profiles.

bioRxiv : the preprint server for biology·2025
Same journal

Erratum for the Research Article "Assessing the health risks of rice cadmium content standards in China" by H. Chu <i>et al</i>.

Science advances·2026
Same journal

Erratum for the Research Article "Developmental regulation of Erk signaling by mitotic kinases" by F. Chen <i>et al</i>.

Science advances·2026
Same journal

Magnetically levitated metasurface enabling tangible and bidirectional human-machine interaction.

Science advances·2026
Same journal

A general photoinduced manganese-catalyzed platform for the sequential difunctionalization of [1.1.1]propellane.

Science advances·2026
Same journal

Turning sound and force into light with AlN:Mn<sup>2+</sup> mechanoluminescence.

Science advances·2026
Same journal

Extreme dominance of Earth-origin heavy ions in the intense ring current near the Earth during the May 2024 super geomagnetic storm.

Science advances·2026
See all related articles

Related Experiment Video

Updated: Jan 10, 2026

An R-Based Landscape Validation of a Competing Risk Model
05:37

An R-Based Landscape Validation of a Competing Risk Model

Published on: September 16, 2022

2.5K

Distributional bias compromises leave-one-out cross-validation.

George I Austin1,2, Itsik Pe'er2,3, Tal Korem2,4

  • 1Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.

Science Advances
|November 28, 2025
PubMed
Summary
This summary is machine-generated.

Leave-one-out cross-validation can introduce "distributional bias," negatively impacting machine learning model evaluation. A new rebalanced cross-validation method corrects this bias, improving performance assessment in various machine learning tasks.

More Related Videos

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.9K
Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances
07:35

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

7.9K

Related Experiment Videos

Last Updated: Jan 10, 2026

An R-Based Landscape Validation of a Competing Risk Model
05:37

An R-Based Landscape Validation of a Competing Risk Model

Published on: September 16, 2022

2.5K
A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.9K
Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances
07:35

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

7.9K

Area of Science:

  • Machine Learning
  • Statistical Modeling
  • Data Science

Background:

  • Cross-validation is a standard technique for assessing machine learning model generalization.
  • Leave-one-out cross-validation (LOOCV) is frequently employed in low-data scenarios.
  • Aggregating predictions across LOOCV folds is common practice for performance metrics.

Purpose of the Study:

  • To identify and mathematically prove the existence of "distributional bias" in aggregated cross-validation.
  • To demonstrate the negative impact of distributional bias on model evaluation and hyperparameter tuning.
  • To develop and validate a novel cross-validation approach robust to distributional bias.

Main Methods:

  • Theoretical proof establishing the negative correlation between training fold means and test instance labels.
  • Empirical validation across diverse machine learning tasks, models, and evaluation metrics.
  • Development and simulation of a rebalanced cross-validation technique for bias mitigation.

Main Results:

  • Distributional bias was proven to be an inherent artifact of aggregated LOOCV, negatively affecting performance.
  • This bias was observed across various machine learning applications and can unfairly penalize strong regularization.
  • The proposed rebalanced cross-validation method demonstrated improved accuracy and robustness in simulations and benchmarks.

Conclusions:

  • Aggregated leave-one-out cross-validation introduces a systematic distributional bias, compromising evaluation reliability.
  • A new rebalanced cross-validation strategy effectively mitigates this bias in both classification and regression.
  • This method offers a more accurate and reliable approach to machine learning model assessment, particularly in data-scarce settings.