Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Analysis of Population Pharmacokinetic Data

Analysis of Population Pharmacokinetic Data

Analysis of population pharmacokinetic data involves studying the behavior of drugs within diverse populations to understand their pharmacokinetic parameters. Traditional pharmacokinetic methods typically involve collecting samples from a few individuals and estimating these parameters. While these methods are commonly used, they have limitations in capturing the variability in drug response among individuals or heterogeneous populations. Population pharmacokinetics is employed to address these...

Distributions to Estimate Population Parameter

Distributions to Estimate Population Parameter

The accurate values of population parameters such as population proportion, population mean, and population standard deviation (or variance) are usually unknown. These are fixed values that can only be estimated from the data collected from the samples. The estimates of each of these parameters are sample proportion, the sample mean, and sample standard deviation (or variance). To obtain the values of these sample statistics, data are required that have particular distribution and central...

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic models are utilized in individual analysis using single-source data, but imperfections arise due to data collection errors, preventing perfect prediction of observed data. The mathematical equation involves known values (Xi), observed concentrations (Ci), measurement errors (εi), model parameters (ϕj), and the related function (ƒi) for i number of values. Different least-squares metrics quantify differences between predicted and observed values. The ordinary least squares (OLS)...

Statistical Inference Techniques in Hypothesis Testing: Parametric Versus Nonparametric Data

Statistical Inference Techniques in Hypothesis Testing: Parametric Versus Nonparametric Data

Statistical inference techniques, paramount in hypothesis testing, differentiate into two broad categories: parametric and nonparametric statistics.
Parametric statistics, as the name suggests, assumes that data follow a specific distribution, often a normal distribution. This assumption enables robust hypothesis testing and estimation. Parametric methods, like the Student's t-test or Goodness-of-fit test, are frequently employed in biostatistics due to their robustness. For instance, comparing...

Testing a Claim about Population Proportion

Testing a Claim about Population Proportion

A complete procedure for testing a claim about a population proportion is provided here.
There are two methods of testing a claim about a population proportion: (1) Using the sample proportion from the data where a binomial distribution is approximated to the normal distribution and (2) Using the binomial probabilities calculated from the data.
The first method uses normal distribution as an approximation to the binomial distribution. The requirements are as follows: sample size is large...

Choosing Between z and t Distribution

Choosing Between z and t Distribution

The z and the Student t distribution estimate the population mean using the sample mean and standard deviation. However, to decide which distribution to use for a calculation, one needs to determine the sample size, the nature of the distribution, and whether the population standard deviation is known. If the population standard deviation is known and the population is normally distributed, or if the sample size is greater than 30, the z distribution is preferred. The Student t distribution is...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same authorSame journal

Integrative Transfer Network: Deep Transfer Learning Across Populations and Prediction Targets.

bioRxiv : the preprint server for biology·2026

Same author

Omics-based Cancer Prognosis Across Ethnic Groups: From Feature Engineering to Disparity Detection and Mitigation.

IEEE transactions on artificial intelligence·2026

Same author

Bridging Ancestry Gaps in Genomic Risk Prediction with Tabular Foundation Models.

bioRxiv : the preprint server for biology·2026

Same author

Equitable Health Intelligence: An Open Benchmark of Multi-Population Machine Learning for Omics-Based Cancer Prognosis.

bioRxiv : the preprint server for biology·2026

Same author

Lung Adenocarcinoma Just Desserts: An Expanding Pie of Activating Oncogenes or a Layer Cake of Integrated Alterations.

bioRxiv : the preprint server for biology·2025

Same author

Digital pathways connecting social and biological factors to health outcomes and equity.

NPJ digital medicine·2025

Same journal

Layered social competition coordinates reproductive hierarchy formation in ants.

bioRxiv : the preprint server for biology·2026

Same journal

Combination epigenetic-targeted therapy increases the immunogenicity of poorly immunogenic sarcomas.

bioRxiv : the preprint server for biology·2026

Same journal

Loss of LanC-like proteins delays post-injury regeneration of aging skeletal muscles.

bioRxiv : the preprint server for biology·2026

Same journal

Confidence-supported label-free metabolic imaging with FPhaS phase autofluorescence microscopy.

bioRxiv : the preprint server for biology·2026

Same journal

Sequence-encoded autoinhibition couples mRNA decapping activity to phase separation.

bioRxiv : the preprint server for biology·2026

See all related articles

Search research articles

Related Experiment Videos

Data Representation Bias and Conditional Distribution Shift Drive Predictive Performance Disparities in

Sandeep Kumar^1,2, Yan Cui^1,2,3

¹Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA.

Biorxiv : the Preprint Server for Biology

|June 5, 2026

Summary

This summary is machine-generated.

Machine learning struggles with population-stratified data due to bias and distribution shifts. Conditional shifts and data bias significantly impact model performance across diverse groups, affecting transfer learning effectiveness.

Related Experiment Videos

Area of Science:

Genetics and Bioinformatics
Machine Learning and Artificial Intelligence
Population Health

Background:

Machine learning models often face performance and generalizability issues with population-stratified datasets.
Data representation bias and distribution shifts are key challenges impacting model fairness across diverse ancestry groups.
Understanding these challenges is crucial for developing equitable AI in various scientific domains.

Purpose of the Study:

To systematically investigate the influence of data representation bias and distribution shifts on multi-population machine learning.
To evaluate the effectiveness of mixture learning, independent learning, and transfer learning in mitigating disparities.
To provide insights for building robust and equitable machine learning models for diverse populations.

Main Methods:

Utilized synthetic genotype-phenotype datasets representing five continental populations.
Evaluated three distinct machine learning approaches: mixture learning, independent learning, and transfer learning.
Analyzed the impact of conditional and marginal distribution shifts alongside data representation bias.

Main Results:

Conditional distribution shifts, coupled with data representation bias, significantly degrade machine learning performance across diverse populations.
The effectiveness of transfer learning as a disparity mitigation strategy is notably influenced by these factors.
Marginal distribution shifts demonstrated a limited impact compared to conditional shifts.

Conclusions:

The interplay between data representation bias and distribution shifts critically affects multi-population machine learning outcomes.
Conditional distribution shifts are a primary driver of performance disparities in population-stratified machine learning.
Findings offer critical insights for developing equitable and high-performing machine learning models for diverse datasets.