Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Feature Down-Selection to Improve Supervised Classification by Machine Learning on Mass Spectrometry Imaging Data.

Molecules (Basel, Switzerland)·2026

Same author

Quantitative determination of longitudinal CNS cholesterol loss during myelin damage and repair.

bioRxiv : the preprint server for biology·2026

Same author

Lipidomics in Children: Noninvasive Sebum Sampling in Children and Adults Allows for Assessment of Lipidomic Differences According to Age, Sex, and Biological Relatedness.

Analytical chemistry·2026

Same author

Almost Nobody Is Using ChatGPT to Write Academic Science Papers (Yet).

Big data and cognitive computing·2025

Same author

Exploring Sample Storage Conditions for the Mass Spectrometric Analysis of Extracted Lipids from Latent Fingerprints.

Biomolecules·2025

Same author

Groomed Fingerprint Sebum Sampling: Reproducibility and Variability According to Anatomical Collection Region and Biological Sex.

Molecules (Basel, Switzerland)·2025

Same journal

Proteomic Profiling of Extracellular Vesicle-Enriched Plasma Using Mag-Net for Biomarker Discovery in Pancreatic Ductal Adenocarcinoma.

Journal of proteome research·2026

Same journal

Computationally Efficient Bayesian Estimation of Graphical Networks for Omics Data.

Journal of proteome research·2026

Same journal

Hierarchy of MS-Based Evidence.

Journal of proteome research·2026

Same journal

Proteomic Profiling of Exosomes from HPV-Positive and HPV-Negative Head and Neck Squamous Cell Carcinoma: Selective Cargo Packaging.

Journal of proteome research·2026

Same journal

Proteomic Analysis Identifies ATE1-Dependent Arginylation Dysregulation across Meningioma Grades.

Journal of proteome research·2026

Same journal

Proteomic Impact of Peripheral Expression of Mutant Huntingtin in <i>C. elegans</i>.

Journal of proteome research·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Aug 31, 2025

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

How (Not) to Generate a Highly Predictive Biomarker Panel Using Machine Learning.

Heather Desaire¹

¹Department of Chemistry, University of Kansas, Lawrence, Kansas 66045, United States.

Journal of Proteome Research

|August 25, 2022

Summary

This summary is machine-generated.

Researchers can avoid inflated machine learning results in proteomics by preventing data leakage. This cautionary review highlights flawed feature selection, leading to unreliable biomarker discovery and emphasizing correct cross-validation practices.

Keywords:

AUC biomarker classification feature selection machine learning overfitting proteomics validation xgboost

More Related Videos

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Predicting Treatment Response to Image-Guided Therapies Using Machine Learning: An Example for Trans-Arterial Treatment of Hepatocellular Carcinoma

Predicting Treatment Response to Image-Guided Therapies Using Machine Learning: An Example for Trans-Arterial Treatment of Hepatocellular Carcinoma

Published on: October 10, 2018

Related Experiment Videos

Last Updated: Aug 31, 2025

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Constructing and Visualizing Models using Mime-based Machine-learning Framework

Published on: July 22, 2025

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Predicting Treatment Response to Image-Guided Therapies Using Machine Learning: An Example for Trans-Arterial Treatment of Hepatocellular Carcinoma

Predicting Treatment Response to Image-Guided Therapies Using Machine Learning: An Example for Trans-Arterial Treatment of Hepatocellular Carcinoma

Published on: October 10, 2018

Area of Science:

Proteomics
Bioinformatics
Machine Learning

Background:

Biomarker discovery in proteomics often employs machine learning.
Feature selection is a critical step in building predictive models.
Inappropriate feature selection can lead to overestimated model performance.

Purpose of the Study:

To demonstrate a common data processing error in proteomics biomarker studies.
To illustrate how biased feature selection inflates machine learning model accuracy.
To provide guidance on applying machine learning to proteomics data correctly.

Main Methods:

Demonstration of a flawed feature selection strategy.
Building a classification model using biased feature selection.
Simulating a dataset to highlight the impact of data leakage.

Main Results:

An artificially high classification accuracy of 92% and AUC of 0.98 was achieved.
The inflated performance was demonstrated on a dataset relying on random numbers.
The study identified test data leakage into the feature selection step as the core issue.

Conclusions:

Biomarker panels generated by selecting features across all data before cross-validation are unreliable.
Test data leakage during feature selection is a common pitfall in machine learning for proteomics.
Correct application of machine learning requires careful separation of feature selection and model validation to prevent inflated accuracies.