Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Sampling Plans

Sampling Plans

Sampling is a crucial step in analytical chemistry, allowing researchers to collect representative data from a large population. Common sampling methods include random, judgmental, systematic, stratified, and cluster sampling.
Random sampling is a method where each member of the population has an equal chance of being selected for the sample. It involves selecting individuals randomly, often using random number generators or lottery-type methods. For example, when analyzing the properties of a...

Bootstrapping

Bootstrapping

The term "bootstrap" originated in the 19th century as a metaphor for self-improvement or achieving something independently, without external assistance. This concept extends to statistical bootstrapping, a self-contained method for estimating population parameters through resampling, even though it can be computationally intensive. Developed by the American statistician Dr. Bradley Efron in 1979, bootstrapping provides a robust way to perform inference when the original sample size is...

Survival Tree

Survival Tree

Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
Building a Survival Tree
Constructing a...

Random Sampling Method

Random Sampling Method

Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest. Among the various sampling methods used by...

Stratified Sampling Method

Stratified Sampling Method

Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a stratified sample, divide the population into groups called strata and then take a...

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Inference on summaries of a model-agnostic longitudinal variable importance trajectory with application to suicide prevention.

The annals of applied statistics·2026

Same author

Identifying anaphylaxis using weakly-supervised prediction models and natural language processing.

medRxiv : the preprint server for health sciences·2026

Same author

Efficacy of codesigned COVID-19 booster vaccine promotion materials for long-term care staff: a cluster-randomized trial.

BMC public health·2026

Same author

Substitution Patterns After Discontinuation of CNS-Active Medications in Older Adults in Primary Care.

Journal of the American Geriatrics Society·2026

Same author

Simulation-Based Power Analysis for Time-Dependent Area Under Receiver Operating Characteristic Curve Using Approximate Bayesian Computation.

Statistics in medicine·2026

Same author

Postmastectomy radiotherapy in pN1 breast cancer: Survival outcomes and prognostic factors from a single-institution cohort.

PloS one·2026

Same journal

Risk prediction of sepsis-associated acute kidney injury: development, validation of a machine learning model with multicenter data.

BMC medical informatics and decision making·2026

Same journal

Trajectory analysis of sleep disorders and anxiety-depression in female breast cancer patients undergoing chemotherapy: based on group-based Multi-Trajectory Model and machine learning.

BMC medical informatics and decision making·2026

Same journal

Multitask learning of longitudinal circulating biomarkers and clinical outcomes: identification of optimal machine-learning and deep-learning models.

BMC medical informatics and decision making·2026

Same journal

Comparative machine learning approaches to prognosticate clinical outcomes in oral and maxillofacial space infections: a retrospective analysis.

BMC medical informatics and decision making·2026

Same journal

Development and validation of machine learning models for early diagnosis of hemophagocytic lymphohistiocytosis in pediatric Epstein-Barr virus infection.

BMC medical informatics and decision making·2026

Same journal

Clinical subphenotypes in septic patients with new-onset atrial fibrillation: validation and parsimonious classifier model development.

BMC medical informatics and decision making·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 12, 2025

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Improving random forest predictions in small datasets from two-phase sampling designs.

Sunwoo Han¹, Brian D Williamson¹, Youyi Fong²

¹Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA.

BMC Medical Informatics and Decision Making

|November 23, 2021

Summary

This summary is machine-generated.

Optimizing random forests for rare outcomes in biomedical studies requires careful variable screening and inverse sampling probability weighting. Stacking random forests with generalized linear models further enhances prediction performance in small, two-phase sampled datasets.

Keywords:

Case–control design Class imbalance HIV vaccine Variable screening

More Related Videos

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

Sampling Soils in a Heterogeneous Research Plot

Sampling Soils in a Heterogeneous Research Plot

Published on: January 7, 2019

Related Experiment Videos

Last Updated: Oct 12, 2025

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

Sampling Soils in a Heterogeneous Research Plot

Sampling Soils in a Heterogeneous Research Plot

Published on: January 7, 2019

Area of Science:

Machine Learning
Biostatistics
Epidemiology

Background:

Random forests are powerful machine learning tools but require optimization for specific data structures.
Biomedical studies often involve two-phase sampling with rare outcomes and resource-intensive covariate measurements.
Optimizing random forest performance is crucial for these challenging datasets.

Purpose of the Study:

To optimize random forest prediction performance for small datasets from two-phase sampling designs.
To evaluate the impact of variable screening, class balancing, weighting, and hyperparameter tuning.
To compare random forests with generalized linear models and explore ensemble methods.

Main Methods:

Utilized an immunologic marker dataset from an HIV vaccine efficacy trial.
Applied combinations of variable screening, class balancing, weighting, and hyperparameter tuning.
Employed stacking of random forests and generalized linear models.

Main Results:

Class balancing improved performance without variable screening but harmed it with screening.
Weighting's impact depended on variable screening.
Hyperparameter tuning was ineffective for small sample sizes.
Random forests under-performed generalized linear models on some marker subsets.
Stacking models improved prediction performance, dependent on learner prediction dissimilarities.

Conclusions:

Variable screening and inverse sampling probability weighting are key for random forest performance in small, two-phase sampled datasets.
Stacking random forests with linear models offers performance improvements.
Careful method selection is vital for rare outcome prediction in biomedical research.