Comparing penalization methods for linear models on large observational health data | JoVE Visualize

Area of Science:

Machine Learning in Healthcare
Statistical Modeling
Predictive Analytics

Background:

Logistic regression is widely used for healthcare predictions.
Regularization techniques are crucial for optimizing model performance and preventing overfitting.
Evaluating various regularization methods is essential for selecting the most effective approach.

Purpose of the Study:

To compare the discrimination and calibration performance of different logistic regression regularization variants.
To assess the internal and external validation of these methods in healthcare prediction models.
To guide the selection of regularization techniques for improved predictive accuracy and interpretability.

Main Methods:

Utilized data from 5 US claims and electronic health record databases for major depressive disorder patient population.
Developed and externally validated logistic regression models using L1, L2, ElasticNet, Adaptive L1, Adaptive ElasticNet, Broken Adaptive Ridge (BAR), and Iterative Hard Thresholding (IHT).
Employed a 75%/25% train-test split and evaluated performance using discrimination (AUC) and calibration metrics, with statistical analysis via Friedman's test and critical difference diagrams.

Main Results:

L1 and ElasticNet regularization demonstrated superior internal and external discrimination performance.
BAR and IHT methods exhibited the best internal calibration, though no single method led in external calibration.
While IHT and BAR were slightly less discriminative, they significantly reduced model complexity and feature count compared to L1 and ElasticNet.

Conclusions:

L1 and ElasticNet provide the best discriminative performance for logistic regression in healthcare, ensuring robustness across internal and external validations.
L0-based methods (IHT, BAR) are advantageous for creating simpler, more interpretable models with enhanced parsimony and calibration.
The findings assist in choosing appropriate regularization techniques for healthcare prediction models, balancing predictive performance, model complexity, and interpretability.