polars-bio-fast, scalable, and out-of-core operations on large genomic interval datasets

  • 0Institute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, Poland.

|

|

Summary

This summary is machine-generated.

polars-bio significantly accelerates genomic interval data analysis in Python, offering faster and more memory-efficient operations than existing tools. This new library enhances computational genomics workflows for researchers working with large datasets.

Area Of Science

  • Computational Genomics
  • Bioinformatics
  • Data Science

Background

  • Genomic studies require computationally intensive analyses of genomic intervals.
  • Python, with libraries like Pandas, is widely used but faces scalability challenges.
  • Polars, a modern alternative, offers improved performance with a Rust backend.

Purpose Of The Study

  • Introduce polars-bio, a Python library for efficient genomic interval data processing.
  • Enable fast, parallel, and out-of-core operations on large genomic datasets.
  • Provide a compatible alternative to existing libraries like Bioframe.

Main Methods

  • Implemented in Rust using Apache DataFusion and Apache Arrow.
  • Compatible with Polars and Pandas DataFrame formats.
  • Benchmarked against Bioframe on real-world and synthetic datasets.

Main Results

  • Achieved significant speedups: 6.5x-38x faster than Bioframe on real-world data.
  • Demonstrated substantial memory reductions: up to 90x less memory usage.
  • Showcased good scalability characteristics in multi-threaded benchmarks.

Conclusions

  • polars-bio is the most efficient single-node Python library for genomic interval DataFrames.
  • Offers substantial performance improvements for genomic data analysis.
  • Facilitates more efficient handling of large-scale genomic interval datasets.

Related Concept Videos

Genomics 02:02

39.6K

Genomics is the science of genomes: it is the study of all the genetic material of an organism. In humans, the genome consists of information carried in 23 pairs of chromosomes in the nucleus, as well as mitochondrial DNA. In genomics, both coding and non-coding DNA is sequenced and analyzed. Genomics allows a better understanding of all living things, their evolution, and their diversity. It has a myriad of uses: for example, to build phylogenetic trees, to improve productivity and...

Gene Evolution - Fast or Slow? 02:05

7.9K

The genomes of eukaryotes are punctuated by long stretches of sequence which do not code for proteins or RNAs. Although some of these regions do contain crucial regulatory sequences, the vast majority of this DNA serves no known function. Typically, these regions of the genome are the ones in which the fastest change, in evolutionary terms, is observed, because there is typically little to no selection pressure acting on these regions to preserve their sequences.
In contrast, regions which code...

Gene Evolution - Fast or Slow? 02:05

3.4K
Comparing Copy Number Variations and SNPs 02:26

18.5K

Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%...

DNA Microarrays 02:34

20.6K

Microarrays are high-throughput and relatively inexpensive assays that can be automated to analyze large quantities of data at a time. They are used in genome-wide studies to compare gene or protein expression under two varied conditions, such as healthy and diseased states. Microarrays consist of glass or silica slides on which probe molecules are covalently attached through surface functionalization. Most commonly, the slides are prepared through the chemisorption of silanes to silica...

Genomic DNA in Eukaryotes 00:58

52.1K

Eukaryotes have large genomes compared to prokaryotes. To fit their genomes into a cell, eukaryotic DNA is packaged extraordinarily tightly inside the nucleus. To achieve this, DNA is tightly wound around proteins called histones, which are packaged into nucleosomes that are joined by linker DNA and coil into chromatin fibers. Additional fibrous proteins further compact the chromatin, which is recognizable as chromosomes during certain phases of cell division.

The Human Genome Measured in...