Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Sample Size Calculation

Sample Size Calculation

Knowledge of the sample size is the first requirement to conduct random sampling or an experiment. The sample size is the total number of units, observations, or groups (in some cases) used to get the data to estimate a population parameter. As the name suggests, the sample size is that of the sample drawn from the population and differs from the population size.
The sample size for the given experiment or sampling effort is fundamental to any study design. Sample size decides the number of...

Bootstrapping

Bootstrapping

The term "bootstrap" originated in the 19th century as a metaphor for self-improvement or achieving something independently, without external assistance. This concept extends to statistical bootstrapping, a self-contained method for estimating population parameters through resampling, even though it can be computationally intensive. Developed by the American statistician Dr. Bradley Efron in 1979, bootstrapping provides a robust way to perform inference when the original sample size is small or...

Estimating Population Standard Deviation

Estimating Population Standard Deviation

When the population standard deviation is unknown and the sample size is large, the sample standard deviation s is commonly used as a point estimate of σ. However, it can sometimes under or overestimate the population standard deviation. To overcome this drawback, confidence intervals are determined to estimate population parameters and eliminate any calculation bias accurately. However, this only applies to random samples from normally distributed populations. Knowing the sample mean and...

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic models are utilized in individual analysis using single-source data, but imperfections arise due to data collection errors, preventing perfect prediction of observed data. The mathematical equation involves known values (Xi), observed concentrations (Ci), measurement errors (εi), model parameters (ϕj), and the related function (ƒi) for i number of values. Different least-squares metrics quantify differences between predicted and observed values. The ordinary least squares (OLS)...

Estimating Population Mean with Known Standard Deviation

Estimating Population Mean with Known Standard Deviation

To construct a confidence interval for a single unknown population mean μ, where the population standard deviation is known, we need sample mean as an estimate for μ and we need the margin of error. Here, the margin of error (EBM) is called the error bound for a population mean (abbreviated EBM). The sample mean is the point estimate of the unknown population mean μ.
The confidence interval estimate will have the form as follows:
(point estimate - error bound, point estimate + error bound)
The...

Estimating Population Mean with Unknown Standard Deviation

Estimating Population Mean with Unknown Standard Deviation

In practice, we rarely know the population standard deviation. In the past, when the sample size was large, this did not present a problem to statisticians. They used the sample standard deviation s as an estimate for σ and proceeded as before to calculate a confidence interval with close enough results. However, statisticians ran into problems when the sample size was small. A small sample size caused inaccuracies in the confidence interval.
William S. Gosset (1876–1937) of the Guinness...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Activation of the Nrf2 pathway by inorganic arsenic in human hepatocytes and the role of transcriptional repressor Bach1.

Oxidative medicine and cellular longevity·2013

Same author

Simultaneous Quantification of Limonin, Two Indolequinazoline Alkaloids, and Four Quinolone Alkaloids in Evodia rutaecarpa (Juss.) Benth by HPLC-DAD Method.

Journal of analytical methods in chemistry·2013

Same author

Ten-eleven translocation 1 (Tet1) is regulated by O-linked N-acetylglucosamine transferase (Ogt) for target gene repression in mouse embryonic stem cells.

The Journal of biological chemistry·2013

Same author

BAD overexpression inhibits cell growth and induces apoptosis via mitochondrial-dependent pathway in non-small cell lung cancer.

Cancer cell international·2013

Same author

Cigarette smoking is associated with human semen quality in synergy with functional NRF2 polymorphisms.

Biology of reproduction·2013

Same author

Downregulation of Erbin in Her2-overexpressing breast cancer cells promotes cell migration and induces trastuzumab resistance.

Molecular immunology·2013

Same journal

Zero-shot reconstruction of mutant spatial transcriptomes.

Patterns (New York, N.Y.)·2026

Same journal

Dendritic nonlinearities mitigate communication costs.

Patterns (New York, N.Y.)·2026

Same journal

Erratum: Agentic AI as a coordination paradigm in digital health and agri-food systems.

Patterns (New York, N.Y.)·2026

Same journal

Spacing effect improves generalization in biological and artificial systems.

Patterns (New York, N.Y.)·2026

Same journal

A multi-modal foundation model for brain disease diagnosis and medical imaging.

Patterns (New York, N.Y.)·2026

Same journal

DuoMod-Net: Logarithmic balancing and geometric refinement for imbalanced semi-supervised medical image segmentation.

Patterns (New York, N.Y.)·2026

See all related articles

Search research articles

Related Experiment Videos

Sample size calculation for training ensemble machine learning models on health data.

Nicholas Mitsakakis¹, Dan Liu^1,2, Thomas Walters³

¹CHEO Research Institute, Ottawa, ON, Canada.

Patterns (New York, N.Y.)

|June 22, 2026

Summary

This summary is machine-generated.

Researchers created a sample size calculator for machine learning (ML) models in health research. This tool helps determine adequate sample sizes for ensemble ML models, improving study design and data analysis.

Keywords:

sample size calculation

Related Experiment Videos

Area of Science:

Machine Learning
Health Research
Statistical Modeling

Background:

Health research studies frequently face limitations due to small sample sizes.
Training machine learning (ML) models necessitates substantial datasets, creating a gap in guidance for adequate sample size determination.
Existing literature lacks comprehensive methods for calculating sample sizes for ML model development.

Purpose of the Study:

To develop an empirically derived sample size calculator for ensemble ML models.
To predict the necessary sample size for achieving a specific level of prognostic performance with a defined probability.
To compare the accuracy of the developed calculator against common heuristics and statistical approaches.

Main Methods:

Developed an empirically derived sample size calculator for ensemble ML models, including random forests, light gradient boosting machine (LGBM), and extreme gradient boosting (XGBoost).
Defined prognostic performance as the sample area under the ROC curve (ROC-AUC) relative to the optimal model trained on the full dataset.
Compared the calculator's accuracy against three common heuristics and one statistical approach for sample size calculation.

Main Results:

The developed calculator demonstrated significantly better accuracy for tree-based ensemble ML models compared to other methods.
For instance, the median relative error in sample size prediction was 25% for achieving 85% of optimal performance with 90% certainty for LGBM.
The calculator effectively predicts sample sizes needed for desired prognostic performance levels.

Conclusions:

The new sample size calculator provides a more accurate method for determining adequate sample sizes in health research utilizing ML.
This tool addresses the critical need for sample size guidance in ML model development, particularly for tree-based ensemble methods.
Improved sample size estimation can enhance the reliability and generalizability of ML models in health research.