STatistical Inference Relief (STIR) feature selection | JoVE Visualize

Area of Science:

Machine Learning
Bioinformatics
Statistical Genetics

Background:

Relief algorithms are effective for feature selection in high-dimensional data, identifying features associated with outcomes possibly due to epistasis or interactions.
However, Relief estimators are non-parametric, lacking a formal statistical inference framework to determine the significance of attribute estimates.
This necessitates a method to avoid arbitrary thresholds and rigorously select important features, especially in complex biological datasets.

Purpose of the Study:

To reconceptualize Relief-based feature selection by developing a new family of STatistical Inference Relief (STIR) estimators.
To incorporate sample variance of nearest neighbor distances into attribute importance estimation to enable statistical significance calculation.
To provide a statistical inferential formalism for Relief-based scores, including adjustment for multiple testing and application to case-control data.

Main Methods:

Developed STatistical Inference Relief (STIR) estimators, a novel family of algorithms building upon Relief.
Incorporated sample variance of nearest neighbor distances into the attribute importance estimation process.
Developed a pseudo t-test version of Relief-based algorithms for case-control data analysis.

Main Results:

Demonstrated the statistical power and type I error control of STIR on simulated data mimicking gene expression patterns, including main and network interaction effects.
Compared the performance of STIR using adaptive radius versus fixed-k nearest neighbor constructors.
Applied STIR to real RNA-Seq data from a major depressive disorder study, showing its utility in analyzing complex biological data.

Conclusions:

STIR provides a statistically rigorous framework for feature selection using Relief-based methods, retaining the ability to identify interactions.
The method allows for the calculation of statistical significance and adjustment for multiple testing, overcoming limitations of traditional Relief algorithms.
STIR shows promise for applications in genetic association studies and analysis of complex diseases like major depressive disorder.