Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Evolutionary Relationships through Genome Comparisons02:54

Evolutionary Relationships through Genome Comparisons

6.4K
Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...
6.4K
Genome Annotation and Assembly03:36

Genome Annotation and Assembly

19.5K
The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.
19.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing.

Journal of mathematical biology·2026
Same author

HMMER web server: 2026 update.

Nucleic acids research·2026
Same author

Inferring genotype-phenotype maps using attention models.

PNAS nexus·2026
Same author

Induction of menstruation in mice reveals the regulation of menstrual shedding.

bioRxiv : the preprint server for biology·2025
Same author

Presence of group II introns in phage genomes.

Nucleic acids research·2025
Same author

Prevalence of Group II Introns in Phage Genomes.

bioRxiv : the preprint server for biology·2025
Same journal

Detection, communication, and individual identification with deep audio embeddings: A case study with North Atlantic right whales.

PLoS computational biology·2026
Same journal

Exploring the structural lexicon of the Proteome via Metric Geometry.

PLoS computational biology·2026
Same journal

Linking retinal sampling in neural encoding models to temporal profiles of visual processing in humans.

PLoS computational biology·2026
Same journal

CAdir: Joint clustering of cells and genes for single-cell transcriptomics with visualization-driven cluster quality assessment.

PLoS computational biology·2026
Same journal

Systematic design of auxotrophic strains and media conditions to probe metabolic functions in E. coli.

PLoS computational biology·2026
Same journal

Neuronal excitability and parameter variability in the Hodgkin-Huxley model.

PLoS computational biology·2026
See all related articles

Related Experiment Video

Updated: Oct 1, 2025

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance
04:58

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Published on: December 13, 2024

3.0K

Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Samantha Petti1, Sean R Eddy2

  • 1NSF-Simons Center for the Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, Massachusetts, United States of America.

Plos Computational Biology
|March 7, 2022
PubMed
Summary
This summary is machine-generated.

Developing robust biological sequence analysis methods requires careful data splitting. New algorithms ensure training and test sets are evolutionarily dissimilar, improving benchmark dataset diversity and method evaluation.

More Related Videos

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses
12:23

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

Published on: September 7, 2022

1.8K
Novel Sequence Discovery by Subtractive Genomics
09:40

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

8.8K

Related Experiment Videos

Last Updated: Oct 1, 2025

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance
04:58

Author Spotlight: Investigating the Role of Repetitive DNA Misregulation in Cancer Initiation and Immunotherapy Resistance

Published on: December 13, 2024

3.0K
In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses
12:23

In Vitro Selection of Aptamers to Differentiate Infectious from Non-Infectious Viruses

Published on: September 7, 2022

1.8K
Novel Sequence Discovery by Subtractive Genomics
09:40

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

8.8K

Area of Science:

  • Bioinformatics
  • Computational Biology
  • Genomics

Background:

  • Biological sequence families share evolutionary relationships, complicating data splitting for method benchmarking.
  • Random data splits can lead to highly similar or identical sequences in training and test sets, compromising evaluation.
  • Existing methods for splitting biological sequence data may not adequately address evolutionary relatedness.

Purpose of the Study:

  • To develop novel algorithms for splitting biological sequence data into dissimilar training and test sets.
  • To ensure that test sequences exhibit minimal identity to any training sequence within a family.
  • To enhance the construction of diverse and reliable benchmark datasets for sequence analysis.

Main Methods:

  • Adapted concepts from independent set graph algorithms.
  • Developed two new algorithms for partitioning sequence data.
  • Implemented a threshold (p%) for sequence identity between training and test sets.

Main Results:

  • The new algorithms successfully split a greater number of sequence families compared to previous approaches.
  • Achieved splits where test sequences are less than p% identical to any training sequence.
  • Enabled the creation of more diverse benchmark datasets.

Conclusions:

  • The proposed methods provide a more effective strategy for splitting biological sequence data.
  • These algorithms facilitate the development of more rigorous benchmarking for sequence analysis tools.
  • Improved data splitting enhances the reliability and generalizability of findings in bioinformatics research.