Sample size and predictive performance of machine learning methods with survival data: A simulation study | JoVE Visualize

Area of Science:

Biostatistics
Machine Learning in Healthcare
Survival Analysis

Background:

Machine learning (ML) methods are increasingly popular for diagnostic and prognostic prediction models, often surpassing traditional regression techniques.
While the Cox proportional hazards model is standard for survival outcomes, ML offers potential for improved performance by capturing complex data patterns.
Determining adequate sample size for developing ML-based survival prediction models remains a challenge, unlike traditional statistical models.

Purpose of the Study:

To develop a time-to-event simulation framework for evaluating the performance of survival prediction models.
To compare the performance of Cox regression against various machine learning techniques, including random survival forest, gradient boosting, and neural networks.
To investigate the impact of varying sample sizes on the performance of these prediction models.

Main Methods:

A time-to-event simulation framework was developed using subject replications from publicly available databases.
Event times were simulated based on a Cox model incorporating nonlinearities, covariate interactions, and time-varying effects.
The performance of Cox regression was evaluated against tuned random survival forest, gradient boosting, and neural networks across different sample sizes.

Main Results:

The simulation framework allowed for direct comparison of model performances under various conditions.
Performance differences between Cox regression and ML techniques were observed at different sample sizes.
The study provides insights into the sample size requirements for developing robust survival prediction models using ML.

Conclusions:

The developed framework is valuable for understanding sample size requirements in survival prediction modeling.
Machine learning techniques show promise but necessitate careful sample size considerations for optimal predictive performance.
Further research is needed to establish specific sample size guidelines for various ML methods in survival analysis.