Binary Classification with Imbalanced Data | JoVE Visualize

Area of Science:

Statistics
Machine Learning
Computational Statistics

Background:

Imbalanced data, characterized by an excess of zero counts in the response variable, pose significant challenges for binary classification tasks.
Existing methods struggle with accurate parameter estimation and prediction when dealing with zero-inflated and imbalanced datasets.

Purpose of the Study:

To propose an expectation-maximization (EM) algorithm for simplifying the computation of maximum likelihood estimators (MLEs) for zero-inflated Bernoulli (ZIBer) model parameters with imbalanced data.
To compare the predictive performance of the ZIBer model against popular machine learning algorithms like LightGBM and artificial neural networks (ANNs) using Monte Carlo simulations.

Main Methods:

Development of an expectation-maximization (EM) algorithm to efficiently derive MLEs for ZIBer model parameters.
Implementation of a logistic regression model to link Bernoulli probabilities with covariates within the ZIBer framework.
Comparative analysis using Monte Carlo simulations to evaluate prediction performance across ZIBer, LightGBM, and ANN models.

Main Results:

No single method demonstrated consistent dominance across all scenarios for predictive performance on imbalanced data.
The zero-inflated Bernoulli (ZIBer) model and LightGBM exhibited more competitive predictive capabilities compared to the artificial neural network (ANN) model.
The proposed EM algorithm effectively simplifies parameter estimation for ZIBer models with imbalanced data.

Conclusions:

For zero-inflated imbalanced datasets, the ZIBer model and LightGBM offer robust predictive performance, outperforming ANNs in certain contexts.
The choice of model for imbalanced binary classification should consider the specific characteristics of the data, as no universal best method exists.
The developed EM algorithm provides an efficient computational approach for parameter estimation in ZIBer models, particularly beneficial for imbalanced data scenarios.