Improving Machine Learning Classification Predictions through SHAP and Features Analysis Interpretation | JoVE Visualize

Area of Science:

Computational chemistry and cheminformatics
Machine learning in drug discovery
Cancer research

Background:

Tree-based machine learning (ML) algorithms like Extra Trees (ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB) are vital in early drug discovery.
These models often face challenges with misclassification and limited interpretability, hindering practical application.
SHapley Additive Explanations (SHAP) offers a way to understand feature importance and potentially improve model predictions.

Purpose of the Study:

To develop and validate a novel approach integrating SHAP values and feature analysis to reduce misclassification errors in ML models.
To benchmark the performance of ET, RF, GBM, and XGB algorithms using prostate cancer cell line data.
To create a misclassification-detection framework to improve the reliability of virtual screening predictions.

Main Methods:

Benchmarking of ET, RF, GBM, and XGB classifiers using RDKit and ECFP4 molecular descriptors.
Application of SHAP value analysis to understand prediction drivers and identify misclassified compounds.
Development and testing of four misclassification-detection filtering rules: RAW, SHAP, RAW OR SHAP, and RAW AND SHAP.

Main Results:

GBM and XGB models achieved high performance (MCC > 0.58, F1-score > 0.8) on antiproliferative activity data for PC3, LNCaP, and DU-145 cell lines.
SHAP analysis revealed that misclassified compounds often had feature values typical of the opposite class.
The 'RAW OR SHAP' rule successfully identified a significant percentage of misclassified compounds (up to 63% in LNCaP).

Conclusions:

The proposed integration of SHAP and feature analysis provides an effective strategy to detect and mitigate misclassifications in ML models.
The developed filtering rules enhance classifier performance by enabling the exclusion of likely erroneous predictions.
This approach offers a valuable tool for improving the accuracy and reliability of virtual screening in drug discovery.