Accelerating Maximum Likelihood Phylogenetic Inference via Early Stopping to Evade (Over-)optimization | JoVE Visualize

Area of Science:

Computational Biology
Phylogenetics
Bioinformatics

Background:

Maximum likelihood-based phylogenetic inference is an optimization problem prone to overoptimization and overfitting due to noisy sequence data.
Existing methods may excessively optimize, leading to computational inefficiency and potentially inaccurate evolutionary models.
There is a need for reliable early stopping criteria to balance optimization thoroughness with computational cost and data noise.

Purpose of the Study:

To integrate the Kishino-Hasegawa (KH) test as an early stopping criterion into RAxML-NG to prevent overoptimization.
To develop a simplified heuristic tree search strategy (sRAxML-NG) as a foundation for the early stopping method.
To propose an extension of the KH test for multiple testing correction to enhance speed and accuracy.

Main Methods:

Implemented a simplified heuristic tree search strategy (sRAxML-NG) within RAxML-NG.
Integrated the Kishino-Hasegawa (KH) test to statistically assess improvements between intermediate phylogenetic trees.
Developed and applied a multiple testing correction extension to the KH test for enhanced performance.
Benchmarked performance using 300 empirical DNA and amino acid (AA) datasets from TreeBASE.

Main Results:

Early stopping methods using KH test and sRAxML-NG achieved statistically equivalent trees to RAxML-NG v1.2 for 98% of DNA datasets.
For AA datasets, sRAxML-NG, KH, and KH-multiple testing versions yielded statistically equivalent trees in 96%, 95%, and 92% of cases, respectively.
The KH-multiple testing version with sRAxML-NG provided average speedups of 5× for DNA and 3.9× for protein datasets compared to RAxML-NG v1.2.

Conclusions:

The implemented early stopping criteria, particularly the KH test with multiple testing correction, effectively prevent overoptimization in phylogenetic inference.
These methods offer significant computational speedups without compromising the statistical accuracy of inferred phylogenetic trees.
The early stopping criteria are now integrated into RAxML-NG, providing a more efficient tool for phylogenetic analysis.