SeArcH schemes for Approximate stRing mAtching
View abstract on PubMed
Summary
This summary is machine-generated.This study unifies approximate search strategies using search schemes, developing a new heuristic that improves performance for multiple errors. It also introduces a weighted node count metric for more accurate performance evaluation.
Area Of Science
- Computational Biology
- Bioinformatics
- Stringology
Background
- Approximate string matching with full-text indices is crucial for bioinformatics.
- Existing search schemes offer speed-ups but finding optimal schemes for multiple errors is challenging.
Purpose Of The Study
- To unify and extend existing approximate search strategies within the search scheme framework.
- To develop an improved heuristic for constructing search schemes, especially for higher error counts.
- To propose a more accurate performance metric for search algorithms.
Main Methods
- Modeling existing approximate search strategies (suffix filters, 01*0-seeds, pigeonhole principle) as search schemes.
- Developing a novel heuristic for constructing search schemes applicable to any number of errors.
- Introducing and evaluating the 'weighted node count' as a performance metric.
Main Results
- Unified diverse search strategies under the search scheme framework, enabling new schemes for any error count.
- Developed a heuristic search scheme construction that matches optimal schemes and improves node count for >= 4 errors.
- Demonstrated the limitations of node count and validated the accuracy of the weighted node count metric.
Conclusions
- Search schemes provide a unified framework for approximate string matching.
- The new heuristic offers efficient search scheme construction for complex error scenarios.
- The weighted node count metric offers a more realistic performance evaluation for search algorithms.
Related Concept Videos
Overview
Organisms are capable of detecting and fixing nucleotide mismatches that occur during DNA replication. This sophisticated process requires identifying the new strand and replacing the erroneous bases with correct nucleotides. Mismatch repair is coordinated by many proteins in both prokaryotes and eukaryotes.
The Mutator Protein Family Plays a Key Role in DNA Mismatch Repair
The human genome has more than 3 billion base pairs of DNA per cell. Prior to cell division, that vast amount...
Erwin Chargaff’s rules on DNA equivalence paved the way for the discovery of base pairing in DNA. Chargaff’s rules state that in a double-stranded DNA molecule,
the amount of adenine (A) is equal to the amount of thymine (T);
the amount of guanine (G) is equal to the amount of cytosine (C); and
the sum of purines, A and G, is equal to the sum of pyrimidines, C and T (i.e., A+G = C+T).
Later work by Watson and Crick revealed that in double-stranded DNA, A always forms two...
Protons in identical electronic environments within a molecule are chemically equivalent and have the same chemical shift. The replacement test is a useful tool to identify chemical equivalence and predict NMR spectra. A substituent replaces each of the protons being examined and the resulting molecules are compared. If the same molecule is obtained, the protons are equivalent or homotopic. Replacement of any hydrogens in ethane by chlorine yields chloroethane because all six protons are...
The Pople nomenclature system classifies spin systems based on the difference between their chemical shifts. Coupled spins are denoted by capital letters with subscripts indicating the number of equivalent nuclei. When the coupled nuclei have well-separated chemical shifts, they are assigned letters that are far apart in the alphabet, such as A and X. When the difference in chemical shifts is small, coupled nuclei are named using adjacent letters of the alphabet (AB, MN, or XY).
A proton...
In the same year as the discovery of the Sanger sequencing method, another group of scientists, Allan Maxam and Walter Gilbert, demonstrated their chemical-cleavage method for DNA sequencing. The Maxam-Gilbert method relies on using different chemicals that can cleave the DNA sequence at specific sites, the separation of resulting DNA fragments of variable size using electrophoresis, and deciphering the DNA sequence from the resulting gel bands.
Challenges of the Maxam-Gilbert Method
The...

