Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Search research articles

Related Experiment Videos

Statistics of large-scale sequence searching

R Spang¹, M Vingron

¹Deutsches Krebsforschungszentrum (DKFZ), Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany.

Bioinformatics (Oxford, England)

|June 6, 1998

Summary

This summary is machine-generated.

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Gene expression and copy number profiling of follicular lymphoma biopsies from patients treated with first-line rituximab without chemotherapy.

Leukemia & lymphoma·2023

Same author

BITES: balanced individual treatment effect for survival data.

Bioinformatics (Oxford, England)·2022

Same author

SPARC-positive macrophages are the superior prognostic factor in the microenvironment of diffuse large B-cell lymphoma and independent of MYC rearrangement and double-/triple-hit status.

Annals of oncology : official journal of the European Society for Medical Oncology·2021

Same author

Molecular signatures that can be transferred across different omics platforms.

Bioinformatics (Oxford, England)·2017

Same author

Molecular signatures that can be transferred across different omics platforms.

Bioinformatics (Oxford, England)·2017

Same author

Stochastics of Cellular Differentiation Explained by Epigenetics: The Case of T-Cell Differentiation and Functional Plasticity.

Scandinavian journal of immunology·2017

Same journal

conMItion: an R package adjusting confounding factors for associations in multi-omics.

Bioinformatics (Oxford, England)·2026

Same journal

SpaMFG: a Spatial Multi-omics Integration Method based on Feature Grouping.

Bioinformatics (Oxford, England)·2026

Same journal

CSCN: Inference of Cell-Specific Causal Networks Using Single-Cell RNA-Seq Data.

Bioinformatics (Oxford, England)·2026

Same journal

Sparse CCA-Based Mediation Analysis with High-Dimensional Exposures and Mediators.

Bioinformatics (Oxford, England)·2026

Same journal

Enhancing Cross-Context Generalization in Drug Perturbation Prediction with a Multimodal Conditional Diffusion Framework.

Bioinformatics (Oxford, England)·2026

Same journal

Primer Design through Submodular Function Estimation.

Bioinformatics (Oxford, England)·2026

See all related articles

Statistical significance of database search scores is improved by accounting for database properties. A new semi-random model with an "effective database size" parameter corrects discrepancies in p-value computations for sequence similarity searches.

Area of Science:

Bioinformatics
Computational Biology
Genomics

Background:

Standard sequence alignment tools like BLAST and FASTA rely on statistical significance of similarity scores.
Accurate p-value computation is challenging in database searches due to multiple comparisons and database characteristics.
Existing models often assume purely random data, failing to capture real-world database complexities.

Purpose of the Study:

To address the limitations of current statistical models for sequence database searches.
To improve the accuracy of p-value calculations for similarity scores in large biological databases.
To introduce a more realistic statistical framework for evaluating search results.

Main Methods:

Extensive simulations of database searches were performed on the SWISS-PROT protein database (Release 31.0).

Related Experiment Videos

A novel semi-random statistical model was developed to better represent real databases.

The model incorporates an "effective database size" parameter to account for database-specific statistical properties.

Main Results:

A discrepancy was observed between theoretical predictions and empirical distributions of similarity scores.
The proposed semi-random model demonstrated improved accuracy in p-value computation compared to purely random models.
The "effective database size" parameter effectively captures database-specific statistical properties.

Conclusions:

The developed semi-random model provides a more accurate assessment of statistical significance for database search results.
Accounting for database properties like sequence length distribution and repeated patterns is crucial for reliable p-value estimation.
This approach enhances the credibility of findings from large-scale sequence similarity searches.