Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Multi-species Conserved Sequences

Multi-species Conserved Sequences

Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

ClairS: a deep-learning method for long-read tumor-normal pair somatic small variant calling.

Nature methods·2026

Same author

Correction: Protein domain-specific genotype-phenotype correlation study of neurofibromatosis type 1.

Scientific reports·2026

Same author

Protein domain-specific genotype-phenotype correlation study of neurofibromatosis type 1.

Scientific reports·2025

Same author

Primary prevention cardiovascular disease risk prediction model for contemporary Chinese (1°P-CARDIAC): Model derivation and validation using a hybrid statistical and machine-learning approach.

PloS one·2025

Same author

AutoPM3: enhancing variant interpretation via LLM-driven PM3 evidence extraction from scientific literature.

Bioinformatics (Oxford, England)·2025

Same author

Repun: an accurate small variant representation unification method for multiple sequencing platforms.

Briefings in bioinformatics·2024

Same journal

circ2DGNN: circRNA-Disease Association Prediction via Transformer-Based Graph Neural Network.

IEEE/ACM transactions on computational biology and bioinformatics·2024

Same journal

Hierarchical Hypergraph Learning in Association- Weighted Heterogeneous Network for miRNA- Disease Association Identification.

IEEE/ACM transactions on computational biology and bioinformatics·2024

Same journal

Discriminative Domain Adaption Network for Simultaneously Removing Batch Effects and Annotating Cell Types in Single-Cell RNA-Seq.

IEEE/ACM transactions on computational biology and bioinformatics·2024

Same journal

MLW-BFECF: A Multi-Weighted Dynamic Cascade Forest Based on Bilinear Feature Extraction for Predicting the Stage of Kidney Renal Clear Cell Carcinoma on Multi-Modal Gene Data.

IEEE/ACM transactions on computational biology and bioinformatics·2024

Same journal

An End-to-End Knowledge Graph Fused Graph Neural Network for Accurate Protein-Protein Interactions Prediction.

IEEE/ACM transactions on computational biology and bioinformatics·2024

Same journal

Generative Biomedical Event Extraction With Constrained Decoding Strategy.

IEEE/ACM transactions on computational biology and bioinformatics·2024

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 4, 2025

A Practical Guide to Phylogenetics for Nonexperts

A Practical Guide to Phylogenetics for Nonexperts

Published on: February 5, 2014

MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment.

Mengmeng Kuang, Yong Zhang, Tak-Wah Lam

IEEE/ACM Transactions on Computational Biology and Bioinformatics

|February 4, 2022

Summary

This summary is machine-generated.

This study introduces MLProbs, a novel data-centric pipeline for Multiple Sequence Alignment (MSA). MLProbs utilizes machine learning to outperform existing tools, especially for low-similarity protein families.

More Related Videos

An Integrated Approach for Microprotein Identification and Sequence Analysis

An Integrated Approach for Microprotein Identification and Sequence Analysis

Published on: July 12, 2022

A Concoction Pipeline for Generating Molecular Operational Taxonomic Units (MOTUs) Among Riparian and Aquatic Beetles

A Concoction Pipeline for Generating Molecular Operational Taxonomic Units (MOTUs) Among Riparian and Aquatic Beetles

Published on: July 11, 2025

Related Experiment Videos

Last Updated: Oct 4, 2025

A Practical Guide to Phylogenetics for Nonexperts

A Practical Guide to Phylogenetics for Nonexperts

Published on: February 5, 2014

An Integrated Approach for Microprotein Identification and Sequence Analysis

An Integrated Approach for Microprotein Identification and Sequence Analysis

Published on: July 12, 2022

A Concoction Pipeline for Generating Molecular Operational Taxonomic Units (MOTUs) Among Riparian and Aquatic Beetles

A Concoction Pipeline for Generating Molecular Operational Taxonomic Units (MOTUs) Among Riparian and Aquatic Beetles

Published on: July 11, 2025

Area of Science:

Bioinformatics
Computational Biology
Machine Learning

Background:

Traditional Multiple Sequence Alignment (MSA) construction relies on algorithm-centric approaches, often reducing the problem to complex combinatorial optimization.
These methods may not optimally handle the inherent variability and complexity of biological sequence data.
A data-centric approach, leveraging machine learning on benchmark datasets, offers a promising alternative.

Purpose of the Study:

To develop and evaluate a novel data-centric pipeline for Multiple Sequence Alignment (MSA) construction.
To demonstrate the efficacy of shallow machine learning models in guiding MSA tool selection and realignment decisions.
To improve MSA accuracy, particularly for challenging datasets like low-similarity protein families.

Main Methods:

Developed MLProbs, a new MSA pipeline based on a data-centric approach.
Trained shallow machine learning classification models on benchmark data to guide alignment tool choice and realignment.
Evaluated MLProbs against 10 popular MSA tools using four benchmark databases (BAliBASE, OXBench, OXBench-X, SABMark).

Main Results:

MLProbs consistently achieved the highest TC score across benchmark databases.
Demonstrated significant improvement for protein families with low similarity (≤ 50%), outperforming top competitors by over 1.8%.
MLProbs exhibited superior performance in real-life applications, including phylogenetic tree construction and protein secondary structure prediction.

Conclusions:

The data-centric approach, powered by shallow machine learning, offers a robust and effective strategy for Multiple Sequence Alignment.
MLProbs provides a significant advancement in MSA accuracy, especially for evolutionarily distant sequences.
Future research could explore deep learning methods to further enhance MSA construction capabilities.