Variable selection - A review and recommendations for the practicing statistician | JoVE Visualize

Area of Science:

Biostatistics
Medical Informatics
Statistical Modeling

Background:

Statistical models are crucial for medical research, aiding in outcome prediction and risk factor analysis.
Established theory applies when the set of independent variables is fixed and small, ensuring unbiased estimates and valid confidence intervals.
Routine medical research often faces challenges with a large number of candidate variables (10-30), exceeding typical model capacities.

Purpose of the Study:

To provide an overview of various variable selection methods used in statistical modeling.
To address the challenges posed by a large number of candidate variables in routine medical research.
To offer pragmatic recommendations for statisticians on applying variable selection methods and ensuring model stability and valid inference.

Main Methods:

Review of variable selection techniques including significance testing, information criteria, penalized likelihood, change-in-estimate, and background knowledge.
Discussion of the transferability of methods from linear regression to generalized linear models and survival data.
Proposal of resampling-based quantities for routine reporting in automated variable selection algorithms.

Main Results:

Variable selection methods, while useful, can compromise model stability, unbiasedness of regression coefficients, and the validity of p-values and confidence intervals.
The study categorizes methods based on underlying principles like significance, information criteria, or penalized likelihood.
Resampling-based approaches are suggested to enhance the reliability of automated variable selection processes.

Conclusions:

Variable selection in statistical modeling requires careful consideration due to potential impacts on model reliability and inference validity.
Practicing statisticians are advised on pragmatic approaches for low-dimensional modeling problems, emphasizing stability investigations and inference.
The integration of resampling-based diagnostics into software is proposed to improve the routine application of automated variable selection.