EvANI benchmarking workflow for evolutionary distance estimation | JoVE Visualize

Area of Science:

Genomics and Evolutionary Biology
Bioinformatics and Computational Biology

Background:

Advances in long-read sequencing yield high-quality genome assemblies, enabling comparative genomics across the Tree of Life.
Average nucleotide identity (ANI) quantifies genetic similarity between genomes, aiding species delineation and phylogenetic analysis.
Traditional ANI calculation via genome alignment is computationally intensive, prompting the development of faster, sketch-based methods.

Purpose of the Study:

To introduce EvANI, a novel evaluation framework for assessing the accuracy and efficiency of different average nucleotide identity (ANI) estimation algorithms.
To analyze the impact of assumptions and heuristics in sketch-based ANI methods on distance estimations.
To guide the selection of appropriate ANI estimation strategies for diverse genomic datasets and evolutionary studies.

Main Methods:

Development of a benchmark dataset comprising simulated and real genomic data for rigorous evaluation.
Implementation of a rank-correlation-based metric to quantify the accuracy of ANI estimates against evolutionary distances.
Comparative analysis of various ANI estimation algorithms, including alignment-based (ANIb) and k-mer-based approaches.

Main Results:

ANIb provides the most accurate tree distance estimation but is computationally inefficient.
K-mer-based methods demonstrate high efficiency and consistent accuracy across various datasets.
Optimal k-mer lengths can vary by clade, suggesting the utility of multiple k-mer values (e.g., k=10 and k=19 for Chlamydiales).
Maximal exact match approaches offer a balance between computational efficiency and accuracy.

Conclusions:

EvANI serves as a robust framework for evaluating genome comparison tools.
K-mer-based strategies are highly effective for rapid and accurate ANI estimation in large-scale genomic comparisons.
The choice of k-mer length and the consideration of alternative methods like maximal exact matches are important for optimizing phylogenetic analyses.