Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Evolutionary Relationships through Genome Comparisons

Evolutionary Relationships through Genome Comparisons

Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...

Genome Annotation and Assembly

Genome Annotation and Assembly

The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing.

Journal of mathematical biology·2026

Same author

HMMER web server: 2026 update.

Nucleic acids research·2026

Same author

Inferring genotype-phenotype maps using attention models.

PNAS nexus·2026

Same author

Induction of menstruation in mice reveals the regulation of menstrual shedding.

bioRxiv : the preprint server for biology·2025

Same author

Presence of group II introns in phage genomes.

Nucleic acids research·2025

Same author

Prevalence of Group II Introns in Phage Genomes.

bioRxiv : the preprint server for biology·2025

Same journal

Detection, communication, and individual identification with deep audio embeddings: A case study with North Atlantic right whales.

PLoS computational biology·2026

Same journal

Exploring the structural lexicon of the Proteome via Metric Geometry.

PLoS computational biology·2026

Same journal

Linking retinal sampling in neural encoding models to temporal profiles of visual processing in humans.

PLoS computational biology·2026

Same journal

CAdir: Joint clustering of cells and genes for single-cell transcriptomics with visualization-driven cluster quality assessment.

PLoS computational biology·2026

Same journal

Systematic design of auxotrophic strains and media conditions to probe metabolic functions in E. coli.

PLoS computational biology·2026

Same journal

Neuronal excitability and parameter variability in the Hodgkin-Huxley model.

PLoS computational biology·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 1, 2025

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Published on: December 13, 2024

Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Samantha Petti¹, Sean R Eddy²

¹NSF-Simons Center for the Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, Massachusetts, United States of America.

Plos Computational Biology

|March 7, 2022

Summary

This summary is machine-generated.

Developing robust biological sequence analysis methods requires careful data splitting. New algorithms ensure training and test sets are evolutionarily dissimilar, improving benchmark dataset diversity and method evaluation.

More Related Videos

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

Published on: September 7, 2022

Novel Sequence Discovery by Subtractive Genomics

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

Related Experiment Videos

Last Updated: Oct 1, 2025

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Published on: December 13, 2024

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

Published on: September 7, 2022

Novel Sequence Discovery by Subtractive Genomics

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

Area of Science:

Bioinformatics
Computational Biology
Genomics

Background:

Biological sequence families share evolutionary relationships, complicating data splitting for method benchmarking.
Random data splits can lead to highly similar or identical sequences in training and test sets, compromising evaluation.
Existing methods for splitting biological sequence data may not adequately address evolutionary relatedness.

Purpose of the Study:

To develop novel algorithms for splitting biological sequence data into dissimilar training and test sets.
To ensure that test sequences exhibit minimal identity to any training sequence within a family.
To enhance the construction of diverse and reliable benchmark datasets for sequence analysis.

Main Methods:

Adapted concepts from independent set graph algorithms.
Developed two new algorithms for partitioning sequence data.
Implemented a threshold (p%) for sequence identity between training and test sets.

Main Results:

The new algorithms successfully split a greater number of sequence families compared to previous approaches.
Achieved splits where test sequences are less than p% identical to any training sequence.
Enabled the creation of more diverse benchmark datasets.

Conclusions:

The proposed methods provide a more effective strategy for splitting biological sequence data.
These algorithms facilitate the development of more rigorous benchmarking for sequence analysis tools.
Improved data splitting enhances the reliability and generalizability of findings in bioinformatics research.