Virtual screening of bioassay data | JoVE Visualize

Area of Science:

Computational chemistry
Cheminformatics
Machine learning in drug discovery

Background:

Challenges in virtual screening of bioassay data include limited access to curated datasets, a high rate of false positives in primary screening, and significant class imbalance (few active compounds vs. many inactive ones).
Pharmaceutical data accessibility is restricted, with PubChem data lacking curation and cross-referencing between primary and confirmatory assays.
Analysis of false positives in primary screening is hampered by poor data cross-referencing, though identified cases show an average of 64% false positives in High-Throughput Primary screening.

Purpose of the Study:

To discuss the key challenges in virtual screening of bioassay data.
To evaluate the performance of various Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5, Random Forest) on diverse bioassay datasets.
To investigate the impact of classifier choice and cost matrix settings on classification performance.

Main Methods:

Application of cost-sensitive classification algorithms available in the Weka software suite.
Utilizing Naive Bayes, Support Vector Machine (SVM), C4.5 decision tree, and Random Forest classifiers.
Testing classifiers on multiple bioassay datasets with varying characteristics.

Main Results:

Weka's Support Vector Machine (SVM) and C4.5 decision tree implementations demonstrated relatively strong performance in cost-sensitive classification tasks.
The optimal configuration of the Weka cost matrix is dependent on the specific base classifier being used, rather than solely on the ratio of class imbalance.
The high percentage of false positives in primary screening raises questions about the suitability of such data for virtual screening.

Conclusions:

Enhanced accessibility of curated pharmaceutical screening data is crucial for both industry and academia.
Virtual screening can benefit drug discovery by reducing the search space and improving primary screening processes through analysis of false positives.
Care must be taken when applying Weka's cost-sensitive classifiers; using generic misclassification costs based on class ratios is not recommended for comparing different classifiers on the same dataset.