Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Gene Duplication and Divergence

Gene Duplication and Divergence

The seminal work of Ohno in 1970 popularized the idea of gene duplication and divergence. DNA sequence comparison studies reveal that a large portion of the genes in bacteria, archaebacteria, and eukaryotes was generated by gene duplication and divergence, indicating its critical role in evolution.
The duplicated copies of the gene are called Paralogs. Paralogs with similar sequences and functions form a gene family. Across several species, a large number of gene families are...

Evolutionary Relationships through Genome Comparisons

Evolutionary Relationships through Genome Comparisons

Genome comparison is one of the excellent ways to interpret the evolutionary relationships between organisms. The basic principle of genome comparison is that if two species share a common feature, it is likely encoded by the DNA sequence conserved between both species. The advent of genome sequencing technologies in the late 20th century enabled scientists to understand the concept of conservation of domains between species and helped them to deduce evolutionary relationships across diverse...

Gene Families

Gene Families

Gene families consist of groups of genes proposed to have originated from a common ancestor. Typically these arise through events in which a gene or genes are mistakenly duplicated during cell division. Unlike their parent genes (which are subject to selection pressure to maintain function), these gene copies do not need to preserve their sequences and may evolve at a relatively faster rate.
Occasionally these regions can be adapted to take on new roles within the organism, becoming novel genes...

Comparing Copy Number Variations and SNPs

Comparing Copy Number Variations and SNPs

Sequencing of the human genome has opened up several best-kept secrets of the genome. Scientists have identified thousands of genome variations that exist within a population. These variations can be a single nucleotide or a larger chromosomal variation.
Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1%...

RNA-seq

RNA-seq

RNA sequencing, or RNA-Seq, is a high-throughput sequencing technology used to study the transcriptome of a cell. Transcriptomics helps to interpret the functional elements of a genome and identify the molecular constituents of an organism. Additionally, it also helps in understanding the development of an organism and the occurrence of diseases.
Before the discovery of RNA-seq, microarray-based methods and Sanger sequencing were used for transcriptome analysis. However, while...

Multi-species Conserved Sequences

Multi-species Conserved Sequences

Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Memorization in large language models in medicine prevalence characteristics and implications.

Nature communications·2026

Same author

Adaptmol: domain adaptation for molecular image recognition with limited supervision.

Journal of cheminformatics·2026

Same author

Comparison of the eighth and ninth editions of American Joint Committee on Cancer/Union for International Cancer Control staging for non-metastatic nasopharyngeal carcinoma.

Otolaryngologia polska = The Polish otolaryngology·2026

Same author

Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry.

medRxiv : the preprint server for health sciences·2026

Same author

The state of standardized musculoskeletal terminology for healthcare reuse:A scoping review.

International journal of medical informatics·2026

Same author

A comprehensive systematic review dataset is a rich resource for training and evaluation of AI systems for title and abstract screening.

Research synthesis methods·2026

Same journal

Analysis of strength degradation of coal and rock masses and stability of mined areas under long term immersion environment.

PloS one·2026

Same journal

Biogenic Silver-Selenium nanocomposite with anticancer activity and potent efficacy against vancomycin-resistant Staphylococcus aureus.

PloS one·2026

Same journal

Preparation and physicochemical characterization of a biodegradable chitosan/carboxymethyl cellulose hydrogel synthesized in NaOH/urea medium.

PloS one·2026

Same journal

Action-guilt, survivor-guilt, and depression in combat-related PTSD.

PloS one·2026

Same journal

Explainable machine learning for predicting activities of daily living at discharge in stroke patients: A retrospective study using SHAP interpretability.

PloS one·2026

Same journal

Deep learning based two-way feature depiction model for brain tumor detection.

PloS one·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Mar 16, 2026

Detection of Copy Number Alterations Using Single Cell Sequencing

Detection of Copy Number Alterations Using Single Cell Sequencing

Published on: February 17, 2017

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Qingyu Chen¹, Justin Zobel¹, Xiuzhen Zhang²

¹Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

|August 5, 2016

Summary

This summary is machine-generated.

Supervised machine learning effectively detects duplicate genomic sequences, improving biological database accuracy. This approach learns from expert curation to precisely identify redundant data, enhancing data consistency and reliability.

More Related Videos

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Published on: August 3, 2018

Novel Sequence Discovery by Subtractive Genomics

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

Related Experiment Videos

Last Updated: Mar 16, 2026

Detection of Copy Number Alterations Using Single Cell Sequencing

Detection of Copy Number Alterations Using Single Cell Sequencing

Published on: February 17, 2017

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Rare Event Detection Using Error-corrected DNA and RNA Sequencing

Published on: August 3, 2018

Novel Sequence Discovery by Subtractive Genomics

Novel Sequence Discovery by Subtractive Genomics

Published on: January 25, 2019

Area of Science:

Bioinformatics
Genomics
Computational Biology

Background:

Biological databases face challenges with data redundancy and inconsistency due to duplication, identified as an issue in 1996.
Manual de-duplication is impractical for large datasets, and existing automated systems lack expert-level precision.
Supervised learning offers a promising avenue for developing precise and efficient automated duplicate detection systems.

Purpose of the Study:

To develop and evaluate a supervised machine learning method for detecting duplicate records in genomic sequence databases.
To assess the performance of binary and multi-class models trained on expert-curated data.
To identify key features influencing duplicate detection accuracy.

Main Methods:

Developed a supervised duplicate detection method using an expert-curated dataset of over one million sequence pairs across five organisms.
Selected 22 features representing database record attributes, including metadata, sequence identity, and alignment quality.
Implemented and cross-validated both binary and multi-class classification models.

Main Results:

The binary model achieved over 90% accuracy across five organisms.
The multi-class model demonstrated high accuracy and improved generalization capabilities.
An ablation study revealed that metadata, sequence identity, and alignment quality features most strongly impact performance.

Conclusions:

Machine learning, specifically supervised learning, is an effective tool for de-duplicating genomic sequence databases.
The developed models provide a precise and efficient method for identifying duplicate biological data.
The findings highlight the potential of integrating machine learning into biological database management workflows.