Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Genome Annotation and Assembly03:36

Genome Annotation and Assembly

The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.
Improving Translational Accuracy02:07

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
Improving Translational Accuracy02:07

Improving Translational Accuracy

Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
Conserved Binding Sites01:49

Conserved Binding Sites

Many proteins’ biological role depends on their interactions with their ligands, small molecules that bind to specific locations on the protein known as ligand-binding sites. Ligand-binding sites are often conserved among homologous proteins as these sites are critical for protein function.
Binding sites are often located in large pockets, and if their location on a protein’s surface is unknown, it can be predicted using various approaches. The energetic method computationally analyses the...
Conserved Binding Sites01:49

Conserved Binding Sites

Many proteins’ biological role depends on their interactions with their ligands, small molecules that bind to specific locations on the protein known as ligand-binding sites. Ligand-binding sites are often conserved among homologous proteins as these sites are critical for protein function.
Binding sites are often located in large pockets, and if their location on a protein’s surface is unknown, it can be predicted using various approaches. The energetic method computationally analyses the...
Multi-species Conserved Sequences02:51

Multi-species Conserved Sequences

Next-generation sequencing technologies have created large genomic databases of a variety of animals and plants. Ever since the human genome project was completed, scientists studied the genome of primates, mammals, and other phylogenetically distant living beings. Such large-scale  studies have provided new insights into the evolutionary relationship between organisms.
Although the genome of each species varies greatly from each other, a few sequences are highly conserved. Such conserved DNA...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

The role of protein embeddings for protein-protein interaction prediction with graph neural networks.

Briefings in bioinformatics·2026
Same author

Machine-guided design for bioengineering gene therapy vectors: Where are we and what lies ahead?

Biotechnology advances·2026
Same author

FAIR Omics Data Management: Overview, Challenges, and Best Practices.

Advances in experimental medicine and biology·2026
Same author

Data Analysis in Extreme Resolution Mass Spectrometry Untargeted Metabolomics.

Advances in experimental medicine and biology·2026
Same author

Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability.

Scientific reports·2026
Same author

The Portuguese Beacon: sharing genomic variant data safely.

Database : the journal of biological databases and curation·2026
Same journal

Invaders taking over-Mollusc faunal change in volcanic barrier lakes of the Albertine Rift biodiversity hotspot.

PloS one·2026
Same journal

AI-driven molecular diversification and ligand-based optimization of macitentan derivatives targeting VEGFR1 and endothelin signaling pathways.

PloS one·2026
Same journal

Performance patterns and records in the world aquatics masters championships: Where do the most frequently represented nations among the top-ten masters swimmers come from?

PloS one·2026
Same journal

Modeling diurnal Temperature-Rainfall relationships under multicollinearity using PLS-SEM: A case study of Ghana.

PloS one·2026
Same journal

Organizational culture, social capital, and emergency capacity in primary healthcare institutions: A cross-sectional structural equation modeling study comparing ordinary and older communities.

PloS one·2026
Same journal

Impact of kidney function on the metabolome in the general population.

PloS one·2026
See all related articles

Related Experiment Video

Updated: May 20, 2026

Mining Spatial Transcriptomics Datasets using DeepSpaceDB
10:16

Mining Spatial Transcriptomics Datasets using DeepSpaceDB

Published on: September 5, 2025

Mining GO annotations for improving annotation consistency.

Daniel Faria1, Andreas Schlicker, Catia Pesquita

  • 1Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal. dfaria@xldb.di.fc.ul.pt

Plos One
|August 1, 2012
PubMed
Summary
This summary is machine-generated.

This study examines the accuracy of protein functional labels in biological databases. Researchers identified widespread inconsistencies in how proteins are described and developed a new computational tool to help experts fix these errors. By uncovering hidden patterns in functional data, this approach significantly improves the reliability of automated protein classification.

Keywords:
UniProtKBprotein annotationcomputational biologyassociation rule learning

Frequently Asked Questions

More Related Videos

Annotation of Plant Gene Function via Combined Genomics, Metabolomics and Informatics
08:09

Annotation of Plant Gene Function via Combined Genomics, Metabolomics and Informatics

Published on: June 17, 2012

Related Experiment Videos

Last Updated: May 20, 2026

Mining Spatial Transcriptomics Datasets using DeepSpaceDB
10:16

Mining Spatial Transcriptomics Datasets using DeepSpaceDB

Published on: September 5, 2025

Annotation of Plant Gene Function via Combined Genomics, Metabolomics and Informatics
08:09

Annotation of Plant Gene Function via Combined Genomics, Metabolomics and Informatics

Published on: June 17, 2012

Area of Science:

  • Bioinformatics and computational biology research within Gene Ontology annotation quality control
  • Data mining and machine learning applications in molecular biology

Background:

No prior work had resolved the persistent challenge of maintaining high-quality functional descriptions for proteins within large biological databases. The Gene Ontology provides a structured framework, yet the automated assignment of these labels remains prone to significant errors. That uncertainty drove the need for better validation strategies for electronically inferred data. Prior research has shown that manual verification of every entry is impossible due to the sheer volume of information. This gap motivated an investigation into the patterns of inconsistency affecting molecular function labels. Experts often struggle with the reliability of automated systems, which frequently produce conflicting or incomplete data. Understanding these structural flaws is necessary to improve the utility of protein databases for the wider scientific community. This study addresses these limitations by analyzing the current state of annotation consistency across the UniProtKB repository.

Purpose Of The Study:

The aim of this study is to improve the quality and consistency of protein functional annotations within the Gene Ontology framework. Researchers seek to address the high error rates associated with electronically inferred data. The project investigates the prevalence of incomplete and conflicting labels across the UniProtKB protein database. By quantifying these inconsistencies, the authors highlight the urgent need for better validation tools. The study also introduces a novel data mining algorithm to assist curators in identifying and correcting errors. This work aims to provide a more reliable method for managing complex biological information. The motivation stems from the fact that manual curation is currently impossible for the entire dataset. Ultimately, the researchers intend to demonstrate that computational approaches can effectively support the maintenance of accurate functional ontologies.

Main Methods:

The review approach involves a comprehensive analysis of molecular function labels within the UniProtKB protein database. Researchers implemented a specialized data mining algorithm designed to detect hidden associations between functional terms. This process utilizes association rule learning to identify patterns that indicate potential errors or missing information. The team compared their custom approach against a standard version of the association rule learning methodology. They evaluated the performance of both systems by calculating the precision of predicted relationships. The study focuses on identifying implicit links that help curators maintain the integrity of the ontology. By systematically scanning the database, the authors isolate inconsistencies that affect a large percentage of protein entries. This computational framework provides a scalable solution for managing the complexity of biological annotation tasks.

Main Results:

Key Findings From the Literature indicate that 64% of proteins in the examined repository are incompletely annotated. The analysis shows that inconsistent labels affect 83% of all molecular function terms. Furthermore, at least 23% of the proteins exhibit some form of functional inconsistency. The authors' custom algorithm successfully predicted 501 relationships between terms. This specialized approach achieved an estimated precision of 94%. In contrast, the basic association rule learning methodology predicted 12,352 relationships. The precision of this basic method was found to be below 9%. These results confirm that targeted data mining significantly improves the reliability of functional predictions compared to standard techniques.

Conclusions:

The authors demonstrate that a large portion of protein entries currently suffer from incomplete or conflicting functional descriptions. Synthesis and Implications suggest that automated systems require more robust validation to ensure biological accuracy. The researchers propose that their association rule learning approach offers a superior alternative to basic methods for identifying functional relationships. This tool provides a practical mechanism for curators to detect and resolve errors in existing datasets. By focusing on implicit connections between terms, the algorithm helps prevent the propagation of incorrect labels. The study emphasizes that improving data quality is a continuous process requiring both computational support and expert oversight. These findings highlight the potential for data mining to enhance the reliability of large-scale biological resources. The work provides a clear path forward for refining functional annotations to better support future research endeavors.

The researchers propose an association rule learning algorithm that identifies implicit relationships between molecular function terms. This method achieves a 94% precision rate, significantly outperforming the basic association rule learning methodology, which yields a precision below 9%.

The study utilizes the UniProtKB repository, specifically focusing on the full molecular function annotation set. This database serves as the primary source for evaluating the prevalence of incomplete and inconsistent protein labels.

The researchers state that manual curation is unfeasible due to the overwhelming volume of protein entries. Consequently, they argue that computational tools are necessary to assist curators in updating and correcting the vast amount of electronically inferred data.

The authors employ a data mining approach based on association rule learning to uncover hidden patterns. This technique allows for the identification of implicit relationships between terms, which helps curators detect inconsistencies that might otherwise remain hidden.

The analysis reveals that 64% of proteins are incompletely annotated. Furthermore, inconsistent labels impact 83% of functional terms and at least 23% of the proteins themselves, highlighting the scale of the current data quality problem.

The authors suggest that their algorithm assists curators in updating the ontology and preventing future errors. They imply that this approach is a viable solution for maintaining the integrity of large-scale functional databases.