Sequence analysis and decoding with extra low-quality reads for DNA data storage

  • 1Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju, 61186, South Korea.
  • 2Department of Chemical Engineering, POSTECH, Pohang, 37673, South Korea.

Abstract

MOTIVATION

Error detection/correction codes play an important role to reduce writing and/or reading costs in DNA data storage. Sequence analysis algorithms also make a crucial effect on error correction but have been executed independently from the decoding of error correction codes. In conventional sequence analysis, low-quality reads are usually discarded. For DNA data storage, low-quality reads can be constructively used to sequence analysis with the assistance of error detection/correction codes.

RESULTS

We obtained the low-quality reads which failed to pass the chastity filter in Illumina NGS sequencing. We confirmed the effectiveness of the extra low-quality reads by providing error statistics and performing decoding with them. We proposed a sequence clustering algorithm for various length reads and a consensus algorithm based on probabilistic majority and error detection to efficiently exploit the extra reads. The proposed methods reduced the reading cost by 6.83% on average and up to 19.67% while maintaining the writing cost.

AVAILABILITY AND IMPLEMENTATION

https://github.com/PParkJy/SAD-DNAstorage (10.5281/zenodo.15571858).

Related Concept Videos

RNA-seq 03:21

9.8K

RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases. 
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...

Sanger Sequencing 01:57

753.8K

DNA sequencing is a fundamental technique that is routinely used in the biological sciences. This method can be applied to a range of questions at different scales - from the sequencing of a cloned DNA fragment or the study of a mutation in a gene up to whole-genome sequencing. However, despite the widespread use of sequencing today, it was not until 1977 that Fredrick Sanger and his collaborators developed the chain-termination method to decode DNA sequences. It relies on the separation of a...

Next-generation Sequencing 03:00

88.4K

The first human genome sequencing project cost $2.7 billion and was declared complete in 2003, after 15 years of international cooperation and collaboration between several research teams and funding agencies. Today, with the advent of next-generation sequencing technologies, the cost and time of sequencing a human genome have dropped over 100 fold.
Next-Generation Sequencing Methods
Although all next-generation methods use different technologies, they all share a set of standard features....

Maxam-Gilbert Sequencing 01:05

11.1K

In the same year as the discovery of the Sanger sequencing method, another group of scientists, Allan Maxam and Walter Gilbert, demonstrated their chemical-cleavage method for DNA sequencing. The Maxam-Gilbert method relies on using different chemicals that can cleave the DNA sequence at specific sites, the separation of resulting DNA fragments of variable size using electrophoresis, and deciphering the DNA sequence from the resulting gel bands.
Challenges of the Maxam-Gilbert Method
The...