A real data-driven simulation strategy to select an imputation method for mixed-type trait data | JoVE Visualize

Area of Science:

Evolutionary Biology
Bioinformatics
Comparative Genomics

Background:

Missing observations in biological trait datasets hinder analyses across various disciplines.
Existing imputation methods yield mixed results, necessitating a framework for selecting appropriate techniques for diverse, real-world datasets.
Trait datasets often contain mixed data types (categorical, count, continuous), complicating imputation strategies.

Purpose of the Study:

To develop and validate a real data-driven simulation strategy for selecting the optimal imputation method for mixed-type trait datasets.
To evaluate the performance of candidate imputation methods, including mean/mode, k-nearest neighbour, random forests, and MICE, with and without phylogenetic information.

Main Methods:

A squamate trait dataset was used as a target, with missing data simulated under missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) mechanisms.
Imputation was performed using candidate methods, incorporating phylogenetic information from nuclear, mitochondrial, or multigene trees.
Performance was assessed using mean squared error for numerical traits and proportion falsely classified rates for categorical traits.

Main Results:

The random forest method, enhanced with a nuclear-derived phylogeny, demonstrated the lowest error rates across most traits.
Imputed datasets more accurately reflected the original data's characteristics and distributions compared to complete-case datasets.
Phylogenetic information did not consistently improve performance for all traits or scenarios, highlighting the need for careful method selection.

Conclusions:

A real data-driven simulation strategy is effective for selecting suitable imputation methods for mixed-type trait datasets.
Random forests combined with appropriate phylogenetic data offer a robust approach for trait data imputation in evolutionary biology.
Caution is advised, as the utility of phylogenetic information in imputation varies by trait and missingness mechanism.