Missing data imputation using utility-based regression and sampling approaches | JoVE Visualize

Area of Science:

Statistics
Machine Learning
Clinical Trials

Background:

Missing data, particularly missing not at random (MNAR), poses a significant challenge in scientific experiments and clinical trials.
Standard regression error measures are inadequate for imbalanced learning problems common with MNAR data, especially in clinical settings with extreme values.
Existing methods like random forests and multiple imputation can introduce systematic bias, underestimating key statistical measures when data is MNAR.

Purpose of the Study:

To develop and evaluate a hybrid imbalanced learning approach for handling MNAR data in cross-sectional clinical trial settings.
To address the limitations of standard predictive error measures in regression for imbalanced datasets.
To mitigate the systematic bias observed in conventional methods when dealing with MNAR data.

Main Methods:

Investigated hybrid imbalanced learning combining utility-based regression (UBR) with synthetic minority oversampling technique for regression (SMOTER).
UBR was employed to optimize the product of conditional probability density (estimated via quantile regression forests) and a utility function.
SMOTER was utilized to oversample relevant rare cases, enhancing the model's ability to handle imbalanced data.

Main Results:

Simulations demonstrated that the proposed hybrid method yields plausible predictions and significantly reduces bias in realistic MNAR data scenarios.
Compared to standard approaches (random forests, multiple imputation), the proposed method showed superior performance in mitigating systematic bias.
Application to an antidepressant clinical trial dataset confirmed the systematic bias in conventional methods and the effectiveness of the proposed approach.

Conclusions:

The proposed hybrid imbalanced learning strategy effectively handles missing not at random data in clinical trials.
Utility-based learning offers a promising avenue for improving the analysis of clinical trial data with missing values.
Integration of utility-based learning strategies is encouraged for more accurate and less biased analyses in clinical research.