Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Sampling Plans01:23

Sampling Plans

327
Sampling is a crucial step in analytical chemistry, allowing researchers to collect representative data from a large population. Common sampling methods include random, judgmental, systematic, stratified, and cluster sampling.
Random sampling is a method where each member of the population has an equal chance of being selected for the sample. It involves selecting individuals randomly, often using random number generators or lottery-type methods. For example, when analyzing the properties of a...
327
Bootstrapping01:24

Bootstrapping

686
The term "bootstrap" originated in the 19th century as a metaphor for self-improvement or achieving something independently, without external assistance. This concept extends to statistical bootstrapping, a self-contained method for estimating population parameters through resampling, even though it can be computationally intensive. Developed by the American statistician Dr. Bradley Efron in 1979, bootstrapping provides a robust way to perform inference when the original sample size is...
686
Survival Tree01:19

Survival Tree

181
Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
 Building a Survival Tree
Constructing a...
181
Random Sampling Method01:09

Random Sampling Method

12.8K
Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest. Among the various sampling methods used by...
12.8K
Stratified Sampling Method01:16

Stratified Sampling Method

13.2K
Sampling is a technique to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. The sampling method ensures that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a stratified sample, divide the population into groups called strata and then take a...
13.2K
Cluster Sampling Method01:20

Cluster Sampling Method

13.1K
Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...
13.1K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Inference on summaries of a model-agnostic longitudinal variable importance trajectory with application to suicide prevention.

The annals of applied statistics·2026
Same author

Identifying anaphylaxis using weakly-supervised prediction models and natural language processing.

medRxiv : the preprint server for health sciences·2026
Same author

Efficacy of codesigned COVID-19 booster vaccine promotion materials for long-term care staff: a cluster-randomized trial.

BMC public health·2026
Same author

Substitution Patterns After Discontinuation of CNS-Active Medications in Older Adults in Primary Care.

Journal of the American Geriatrics Society·2026
Same author

Simulation-Based Power Analysis for Time-Dependent Area Under Receiver Operating Characteristic Curve Using Approximate Bayesian Computation.

Statistics in medicine·2026
Same author

Postmastectomy radiotherapy in pN1 breast cancer: Survival outcomes and prognostic factors from a single-institution cohort.

PloS one·2026
Same journal

Risk prediction of sepsis-associated acute kidney injury: development, validation of a machine learning model with multicenter data.

BMC medical informatics and decision making·2026
Same journal

Trajectory analysis of sleep disorders and anxiety-depression in female breast cancer patients undergoing chemotherapy: based on group-based Multi-Trajectory Model and machine learning.

BMC medical informatics and decision making·2026
Same journal

Multitask learning of longitudinal circulating biomarkers and clinical outcomes: identification of optimal machine-learning and deep-learning models.

BMC medical informatics and decision making·2026
Same journal

Comparative machine learning approaches to prognosticate clinical outcomes in oral and maxillofacial space infections: a retrospective analysis.

BMC medical informatics and decision making·2026
Same journal

Development and validation of machine learning models for early diagnosis of hemophagocytic lymphohistiocytosis in pediatric Epstein-Barr virus infection.

BMC medical informatics and decision making·2026
Same journal

Clinical subphenotypes in septic patients with new-onset atrial fibrillation: validation and parsimonious classifier model development.

BMC medical informatics and decision making·2026
See all related articles

Related Experiment Video

Updated: Oct 12, 2025

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.7K

Improving random forest predictions in small datasets from two-phase sampling designs.

Sunwoo Han1, Brian D Williamson1, Youyi Fong2

  • 1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA.

BMC Medical Informatics and Decision Making
|November 23, 2021
PubMed
Summary
This summary is machine-generated.

Optimizing random forests for rare outcomes in biomedical studies requires careful variable screening and inverse sampling probability weighting. Stacking random forests with generalized linear models further enhances prediction performance in small, two-phase sampled datasets.

Keywords:
Case–control designClass imbalanceHIV vaccineVariable screening

More Related Videos

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach
04:35

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

3.5K
Sampling Soils in a Heterogeneous Research Plot
07:11

Sampling Soils in a Heterogeneous Research Plot

Published on: January 7, 2019

35.1K

Related Experiment Videos

Last Updated: Oct 12, 2025

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.7K
Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach
04:35

Development of an Individual-Tree Basal Area Increment Model using a Linear Mixed-Effects Approach

Published on: July 3, 2020

3.5K
Sampling Soils in a Heterogeneous Research Plot
07:11

Sampling Soils in a Heterogeneous Research Plot

Published on: January 7, 2019

35.1K

Area of Science:

  • Machine Learning
  • Biostatistics
  • Epidemiology

Background:

  • Random forests are powerful machine learning tools but require optimization for specific data structures.
  • Biomedical studies often involve two-phase sampling with rare outcomes and resource-intensive covariate measurements.
  • Optimizing random forest performance is crucial for these challenging datasets.

Purpose of the Study:

  • To optimize random forest prediction performance for small datasets from two-phase sampling designs.
  • To evaluate the impact of variable screening, class balancing, weighting, and hyperparameter tuning.
  • To compare random forests with generalized linear models and explore ensemble methods.

Main Methods:

  • Utilized an immunologic marker dataset from an HIV vaccine efficacy trial.
  • Applied combinations of variable screening, class balancing, weighting, and hyperparameter tuning.
  • Employed stacking of random forests and generalized linear models.

Main Results:

  • Class balancing improved performance without variable screening but harmed it with screening.
  • Weighting's impact depended on variable screening.
  • Hyperparameter tuning was ineffective for small sample sizes.
  • Random forests under-performed generalized linear models on some marker subsets.
  • Stacking models improved prediction performance, dependent on learner prediction dissimilarities.

Conclusions:

  • Variable screening and inverse sampling probability weighting are key for random forest performance in small, two-phase sampled datasets.
  • Stacking random forests with linear models offers performance improvements.
  • Careful method selection is vital for rare outcome prediction in biomedical research.