Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO | JoVE Visualize

Area of Science:

Quantitative Structure-Activity Relationship (QSAR) modeling
Cheminformatics
Machine Learning

Background:

Variable selection is critical for enhancing predictive accuracy and reducing noise in QSAR models.
Choosing the appropriate variables is more challenging than developing the predictive models themselves.
Existing methods require careful evaluation for their effectiveness in descriptor selection.

Purpose of the Study:

To explore the applicability of two distinct variable selection methods: Random Forests (RF) and Least Absolute Shrinkage and Selection Operator (LASSO).
To compare the performance of RF and LASSO in selecting optimal variables for QSAR modeling across diverse datasets.
To propose and utilize novel metrics for a comprehensive evaluation of variable selection strategies.

Main Methods:

Applied recursive RF to iteratively remove less important descriptors.
Employed LASSO with 10-fold inner cross-validation to determine the optimal penalty parameter (λ).
Utilized highest pairwise correlation rate, average Pearson's correlation coefficient, and Tanimoto coefficient for evaluation.

Main Results:

Variable selection significantly reduced noisy descriptors (up to 96% with RF) and improved predictive performance.
RF selected important predictors without restricting pairwise correlations, unlike LASSO which excludes highly correlated variables.
LASSO's tendency to exclude correlated variables can lead to the omission of important predictors, potentially undermining model performance.
Optimal variable sets from RF and LASSO showed low similarity (Tanimoto coefficients < 0.20 in 7/8 datasets).

Conclusions:

The strategy for variable selection, rather than the learning algorithm itself, primarily drives differences in predictive performance.
Effective variable selection is more critical than the choice of learning algorithm for successful QSAR modeling.
The study advocates for developing a standardized procedure using proposed metrics to identify truly important variables for model interpretation and application in drug discovery and environmental toxicity assessment.