Scalable and Maintainable Distributed Sequence Alignment Using Spark

Summary

This summary is machine-generated.

Bioinformatics faces challenges with large genomic datasets. SparkLeBLAST offers a scalable, maintainable parallel BLAST solution, improving genomic analysis performance and accessibility for researchers.

Area Of Science

  • Bioinformatics
  • Computational Biology
  • Genomics

Background

  • Genomic data is growing exponentially, challenging traditional bioinformatics tools like NCBI BLAST.
  • Existing parallel BLAST tools (mpiBLAST, SparkBLAST) have limitations in scalability, maintainability, or performance with large datasets.

Purpose Of The Study

  • To develop a parallel BLAST tool that combines the performance and scalability of mpiBLAST with the simplicity and maintainability of SparkBLAST.
  • To address the need for a tool that democratizes scalable genomic analysis for scientists without extensive distributed computing experience.

Main Methods

  • Introduced SparkLeBLAST, a parallel BLAST tool utilizing the Spark framework and efficient data partitioning.
  • Implemented a novel approach to data partitioning that overcomes SparkBLAST's limitations with large databases.

Main Results

  • SparkLeBLAST demonstrates significant performance improvements, running up to 6.68× faster than SparkBLAST.
  • Achieved an 88.6× speedup in the BLAST search component for COVID-19 genomic diversity analysis, accelerating the overall taxonomic assignment by 20.9× using 128 compute nodes.

Conclusions

  • SparkLeBLAST provides a high-performance, scalable, and maintainable solution for parallel BLAST searches.
  • This tool enhances accessibility to large-scale genomic analysis for a broader range of scientific researchers.

Related Concept Videos

Multi-species Conserved Sequences 02:51

4.3K

Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale  studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved...

RNA-seq 03:21

10.4K

RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases. 
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...

Per-Unit Sequence Models 01:26

116

An ideal Y-Y transformer, grounded through neutral impedances, displays per-unit sequence networks akin to those of a single-phase ideal transformer when subjected to balanced positive- or negative-sequence currents. These currents do not produce neutral currents, and their associated voltage drops.
Zero-sequence currents, which are identical in magnitude and phase, generate a neutral current, resulting in voltage drops across the neutral impedance and the low-voltage winding. If the...

Maxam-Gilbert Sequencing 01:05

11.5K

In the same year as the discovery of the Sanger sequencing method, another group of scientists, Allan Maxam and Walter Gilbert, demonstrated their chemical-cleavage method for DNA sequencing. The Maxam-Gilbert method relies on using different chemicals that can cleave the DNA sequence at specific sites, the separation of resulting DNA fragments of variable size using electrophoresis, and deciphering the DNA sequence from the resulting gel bands.
Challenges of the Maxam-Gilbert Method
The...

Next-generation Sequencing 03:00

92.6K

The first human genome sequencing project cost $2.7 billion and was declared complete in 2003, after 15 years of international cooperation and collaboration between several research teams and funding agencies. Today, with the advent of next-generation sequencing technologies, the cost and time of sequencing a human genome have dropped over 100 fold.
Next-Generation Sequencing Methods
Although all next-generation methods use different technologies, they all share a set of standard features....

Sanger Sequencing 01:57

757.1K

DNA sequencing is a fundamental technique that is routinely used in the biological sciences. This method can be applied to a range of questions at different scales - from the sequencing of a cloned DNA fragment or the study of a mutation in a gene up to whole-genome sequencing. However, despite the widespread use of sequencing today, it was not until 1977 that Fredrick Sanger and his collaborators developed the chain-termination method to decode DNA sequences. It relies on the separation of a...