Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Gene Duplication and Divergence02:37

Gene Duplication and Divergence

8.2K
The seminal work of Ohno in 1970 popularized the idea of gene duplication and divergence. DNA sequence comparison studies reveal that a large portion of the genes in bacteria, archaebacteria, and eukaryotes was  generated by gene duplication and divergence, indicating its critical role in evolution.
The duplicated copies of the gene are called Paralogs. Paralogs with similar sequences and functions form a gene family. Across several species, a large number of gene families are...
8.2K
Evolutionary Relationships through Genome Comparisons02:54

Evolutionary Relationships through Genome Comparisons

7.2K
Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...
7.2K
Gene Families01:57

Gene Families

10.2K
Gene families consist of groups of genes proposed to have originated from a common ancestor. Typically these arise through events in which a gene or genes are mistakenly duplicated during cell division. Unlike their parent genes (which are subject to selection pressure to maintain function), these gene copies do not need to preserve their sequences and may evolve at a relatively faster rate.
Occasionally these regions can be adapted to take on new roles within the organism, becoming novel genes...
10.2K
Comparing Copy Number Variations and SNPs02:26

Comparing Copy Number Variations and SNPs

19.1K
Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%...
19.1K
RNA-seq03:21

RNA-seq

12.4K
RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases. 
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...
12.4K
Multi-species Conserved Sequences02:51

Multi-species Conserved Sequences

4.9K
Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale  studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved...
4.9K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Memorization in large language models in medicine prevalence characteristics and implications.

Nature communications·2026
Same author

Adaptmol: domain adaptation for molecular image recognition with limited supervision.

Journal of cheminformatics·2026
Same author

Comparison of the eighth and ninth editions of American Joint Committee on Cancer/Union for International Cancer Control staging for non-metastatic nasopharyngeal carcinoma.

Otolaryngologia polska = The Polish otolaryngology·2026
Same author

Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry.

medRxiv : the preprint server for health sciences·2026
Same author

The state of standardized musculoskeletal terminology for healthcare reuse:A scoping review.

International journal of medical informatics·2026
Same author

A comprehensive systematic review dataset is a rich resource for training and evaluation of AI systems for title and abstract screening.

Research synthesis methods·2026
Same journal

Analysis of strength degradation of coal and rock masses and stability of mined areas under long term immersion environment.

PloS one·2026
Same journal

Biogenic Silver-Selenium nanocomposite with anticancer activity and potent efficacy against vancomycin-resistant Staphylococcus aureus.

PloS one·2026
Same journal

Preparation and physicochemical characterization of a biodegradable chitosan/carboxymethyl cellulose hydrogel synthesized in NaOH/urea medium.

PloS one·2026
Same journal

Action-guilt, survivor-guilt, and depression in combat-related PTSD.

PloS one·2026
Same journal

Explainable machine learning for predicting activities of daily living at discharge in stroke patients: A retrospective study using SHAP interpretability.

PloS one·2026
Same journal

Deep learning based two-way feature depiction model for brain tumor detection.

PloS one·2026
See all related articles

Related Experiment Video

Updated: Mar 16, 2026

Detection of Copy Number Alterations Using Single Cell Sequencing
09:45

Detection of Copy Number Alterations Using Single Cell Sequencing

Published on: February 17, 2017

12.2K

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Qingyu Chen1, Justin Zobel1, Xiuzhen Zhang2

  • 1Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

Plos One
|August 5, 2016
PubMed
Summary
This summary is machine-generated.

Supervised machine learning effectively detects duplicate genomic sequences, improving biological database accuracy. This approach learns from expert curation to precisely identify redundant data, enhancing data consistency and reliability.

More Related Videos

Rare Event Detection Using Error-corrected DNA and RNA Sequencing
10:36

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Published on: August 3, 2018

12.7K
Novel Sequence Discovery by Subtractive Genomics
09:40

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

9.2K

Related Experiment Videos

Last Updated: Mar 16, 2026

Detection of Copy Number Alterations Using Single Cell Sequencing
09:45

Detection of Copy Number Alterations Using Single Cell Sequencing

Published on: February 17, 2017

12.2K
Rare Event Detection Using Error-corrected DNA and RNA Sequencing
10:36

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Published on: August 3, 2018

12.7K
Novel Sequence Discovery by Subtractive Genomics
09:40

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

9.2K

Area of Science:

  • Bioinformatics
  • Genomics
  • Computational Biology

Background:

  • Biological databases face challenges with data redundancy and inconsistency due to duplication, identified as an issue in 1996.
  • Manual de-duplication is impractical for large datasets, and existing automated systems lack expert-level precision.
  • Supervised learning offers a promising avenue for developing precise and efficient automated duplicate detection systems.

Purpose of the Study:

  • To develop and evaluate a supervised machine learning method for detecting duplicate records in genomic sequence databases.
  • To assess the performance of binary and multi-class models trained on expert-curated data.
  • To identify key features influencing duplicate detection accuracy.

Main Methods:

  • Developed a supervised duplicate detection method using an expert-curated dataset of over one million sequence pairs across five organisms.
  • Selected 22 features representing database record attributes, including metadata, sequence identity, and alignment quality.
  • Implemented and cross-validated both binary and multi-class classification models.

Main Results:

  • The binary model achieved over 90% accuracy across five organisms.
  • The multi-class model demonstrated high accuracy and improved generalization capabilities.
  • An ablation study revealed that metadata, sequence identity, and alignment quality features most strongly impact performance.

Conclusions:

  • Machine learning, specifically supervised learning, is an effective tool for de-duplicating genomic sequence databases.
  • The developed models provide a precise and efficient method for identifying duplicate biological data.
  • The findings highlight the potential of integrating machine learning into biological database management workflows.