Wrapper feature selection for small sample size data driven by complete error estimates | JoVE Visualize

Area of Science:

Machine Learning
Bioinformatics
Computational Biology

Background:

Wrapper-based feature selection is crucial for high-dimensional data, particularly in biomedical applications with limited sample sizes.
1-nearest neighbor (1NN) classifiers are sensitive to feature relevance, necessitating effective selection methods.
Existing methods like standard cross-validation and bootstrap may suffer from high variance in small sample scenarios.

Purpose of the Study:

To propose and evaluate a complete bootstrap technique for feature selection in 1NN classifiers.
To assess the efficacy of complete bootstrap and complete cross-validation error estimates as selection criteria.
To compare these novel criteria against standard methods using various optimization algorithms.

Main Methods:

Developed a complete bootstrap method for 1NN classifiers, averaging over all data partitions.
Utilized complete bootstrap and complete cross-validation error estimates as novel feature selection criteria.
Compared performance against standard 2-fold, 10-fold cross-validation, and bootstrap (50 trials) using Sequential Forward Selection (SFS), Binary Particle Swarm Optimization (BPSO), and Simplified Social Impact Theory based Optimization (SSITO).

Main Results:

Complete criteria significantly outperformed standard cross-validation and bootstrap methods across all tested search strategies (SFS, BPSO, SSITO).
1NN wrappers employing complete criteria with SFS demonstrated superior performance compared to FILTER and SIMBA.
The proposed methods showed benefits in a real-world application for automatic subthalamic nucleus detection.

Conclusions:

The complete bootstrap and complete cross-validation error estimates offer lower variance and superior performance for feature selection in 1NN classifiers, especially with small sample sizes.
Complete criterion-based 1NN wrappers, particularly with SFS, are highly effective and recommended for biomedical data analysis.
The developed techniques are validated through successful application in detecting the subthalamic nucleus.