Scaling up genome annotation using MAKER and work queue | JoVE Visualize

Area of Science:

Bioinformatics
Computational Biology
Genomics

Background:

Next-generation sequencing (NGS) generates vast amounts of data, increasing demand for efficient bioinformatics analyses.
Many bioinformatics applications, including genome annotation, require significant computational resources and can benefit from parallel processing on clusters, clouds, or grids.
Existing tools often rely on shared file systems, posing limitations for distributed computing environments.

Purpose of the Study:

To develop and evaluate a modified annotation framework for parallel execution of bioinformatics tools on distributed computing resources.
To enhance the scalability and efficiency of genome annotation pipelines, specifically addressing limitations of shared file system dependencies.
To enable seamless execution of sequence analysis tools across diverse computing infrastructures like clusters, clouds, and grids.

Main Methods:

Parallelization of the underlying genome annotation tool (MAKER) as a Message Passing Interface (MPI) application.
Modification of the framework to enable execution without MPI, facilitating broader compatibility with distributed resources.
Implementation of explicit data transfer mechanisms to overcome shared file system limitations.
Evaluation of the framework's performance using a Caenorhabditis japonica test case on a cluster and within the Amazon EC2 cloud environment.

Main Results:

Achieved a 45x speed-up in genome annotation using 50 workers on the Caenorhabditis japonica test case.
Demonstrated the framework's ability to run efficiently on distributed computing resources, including cloud environments (Amazon EC2).
Successfully enabled parallel execution of the annotation tool without MPI, enhancing its applicability.
Facilitated explicit data transfer, mitigating issues associated with shared file system dependencies.

Conclusions:

The modified annotation framework significantly enhances the speed and scalability of bioinformatics analyses, particularly genome annotation.
The framework effectively utilizes distributed computing resources (clusters, clouds, grids) by removing MPI dependency and enabling explicit data transfer.
This approach provides a flexible and efficient solution for running sequence analysis tools, even in early development stages, on modern computational infrastructures.