A machine learning pipeline for quantitative phenotype prediction from genotype data | JoVE Visualize

Area of Science:

Systems biology and biomedicine
Genetics and genomics
Computational biology

Background:

Quantitative phenotypes are crucial in systems biology and biomedicine, especially for complex diseases with high individual variability.
Machine learning enhances Genome-Wide Association Studies (GWAS) by focusing on predictive accuracy and feature selection in multivariate genetic data.
Reproducible and stringent Data Analysis Protocols (DAP) are essential for controlling variability and ensuring reliable results in genotype-phenotype mapping.

Purpose of the Study:

To present a genome-to-phenotype machine learning pipeline for quantitative trait prediction.
To apply the pipeline for fitting complex phenotypic traits in heterogeneous stock mice using single nucleotide polymorphisms (SNPs).
To evaluate the pipeline's effectiveness in marker selection and prediction accuracy compared to existing methods.

Main Methods:

A machine learning pipeline centered on the L1L2 regularization method (naïve elastic net) for regression and dimensionality reduction.
SNP marker selection using a DAP developed in the MAQC-II initiative, adapted for microarray data and applied to SNP data.
Comparison of the L1L2 approach with Support Vector Regression (SVR) and Monte Carlo Markov Chain (MCMC), employing algebraic indicators for model selection and a 'saturation' procedure for marker panel refinement.

Main Results:

The L1L2 pipeline achieved prediction accuracies comparable to MCMC and SVR methods.
Selected SNPs by the L1L2 algorithm showed good agreement with candidate loci identified through standard GWAS.
The combined L1L2 feature selection and saturation procedure effectively addressed the issue of neglecting highly correlated features.

Conclusions:

The L1L2 pipeline demonstrates efficacy in both genetic marker selection and prediction accuracy for quantitative phenotypes.
Machine learning techniques, when supported by adequate Data Analysis Protocols (DAP), can significantly aid quantitative phenotype prediction.
This approach is valuable for functional studies utilizing whole-genome information and for understanding complex genetic traits.