Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Censoring Survival Data01:09

Censoring Survival Data

48
Survival analysis is a statistical method used to analyze time-to-event data, often employed in fields such as medicine, engineering, and social sciences. One of the key challenges in survival analysis is dealing with incomplete data, a phenomenon known as "censoring." Censoring occurs when the event of interest (such as death, relapse, or system failure) has not occurred for some individuals by the end of the study period or is otherwise unobservable, and it might have many different...
48
Data Reporting and Recording01:24

Data Reporting and Recording

4.6K
Reporting and recording are crucial in data documentation. The timely, thorough, and accurate documentation of facts is essential when recording patient data. Failure to record findings during an assessment or interpretation of a problem will result in loss of information and make the patient document unreliable. The reader is left with general impressions if the information is not specific. A recording is documenting data of the individual's health information in a traceable, secure, and...
4.6K
Leaky Scanning02:28

Leaky Scanning

5.0K
During most eukaryotic translation processes, the small 40S ribosome subunit scans an mRNA from its 5' end until it encounters the first start AUG codon. The large 60S ribosomal subunit then joins the smaller one to initiate protein synthesis. The location of the translation initiation is largely determined by the nucleotides near the start codon as there may be multiple translation initiation sites present on the mRNA.  Marilyn Kozak discovered that the sequence RCCAUGG (where R...
5.0K
Shear Diagram01:27

Shear Diagram

696
In the study of beam mechanics, shear diagrams play a crucial role in understanding the distribution of shear forces along the length of a beam. Consider a beam AB that is supported at both ends and subjected to perpendicular loads.
First, a free-body diagram of the beam is drawn, representing all the external forces and internal reactions acting on the beam. One can calculate the reaction forces at each support by employing the equilibrium equations of force and moment. The vertical component...
696
Data: Types and Distribution01:19

Data: Types and Distribution

650
In biostatistics, data are the observations collected for analysis. There are two main types: parametric and non-parametric. Parametric data, which include continuous (e.g., weight) and discrete numerical data (e.g., number of tablets), assume a particular distribution pattern, often the normal distribution. Non-parametric data do not adhere to a specific distribution and typically comprise nominal (e.g., gender) and ordinal categorical data (e.g., pain scale ratings).
Distributions in...
650
Standard Deviation of Calculated Results01:14

Standard Deviation of Calculated Results

5.0K
Standard deviation measures the spread of data around the mean value. Many large data sets follow a Gaussian distribution, also known as a normal distribution. This distribution is bell-shaped curved, with the most frequently observed value (mean or central value) in the middle. The farther away from the central value, the greater the deviation from the central value, and the lower the frequency.
A broad Gaussian distribution curve has a wider standard deviation, representing a data set with...
5.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Aggregicyclins Shed Light on Type II Polyketide Biosynthesis in <i>Myxococcota</i>.

JACS Au·2026
Same author

Explainability Methods from Machine Learning Detect Important Drugs' Atoms in Drug-Target Interactions.

Journal of chemical information and modeling·2026
Same author

TNF alpha unmasks enteric malate aspartate shuttle dysfunction bridging Parkinson disease and intestinal inflammation.

Nature communications·2026
Same author

GlyContact analyzes glycan 3D structures at scale.

Nature communications·2025
Same author

Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65.

Bioinformatics (Oxford, England)·2025
Same author

Comparison of the safety profiles of CD19-targeting CAR T-cell therapy in patients with SLE and B-cell lymphoma.

Blood·2025
Same journal

Demonstration of a quantum C-NOT gate in a time-multiplexed fully reconfigurable photonic processor.

Nature communications·2026
Same journal

Nonlinear quantum light source with van der Waals ferroelectric NbOX<sub>2</sub> (X = Br, I).

Nature communications·2026
Same journal

Antagonistic histone H2A variants and autonomous heterochromatin formation shape epigenomic patterns in Arabidopsis.

Nature communications·2026
Same journal

The long tail of nitrate pollution in groundwater challenges governance of global water quality.

Nature communications·2026
Same journal

Select microbial metabolites promote tau aggregation in a murine tauopathy model.

Nature communications·2026
Same journal

Warming climate has lengthened global intense tropical cyclone seasons.

Nature communications·2026
See all related articles

Related Experiment Video

Updated: May 15, 2025

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER
14:06

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Published on: June 23, 2012

15.1K

Data splitting to avoid information leakage with DataSAIL.

Roman Joeres1,2,3,4, David B Blumenthal5, Olga V Kalinina6,7,8

  • 1Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany. roman.joeres@helmholtz-hips.de.

Nature Communications
|April 8, 2025
PubMed
Summary
This summary is machine-generated.

Information leakage in machine learning can inflate performance metrics. DataSAIL is a new Python package that reduces data leakage for more accurate evaluations of biomedical AI models.

More Related Videos

Detection of Homologous Recombination Intermediates via Proximity Ligation and Quantitative PCR in Saccharomyces cerevisiae
07:55

Detection of Homologous Recombination Intermediates via Proximity Ligation and Quantitative PCR in Saccharomyces cerevisiae

Published on: September 11, 2022

1.7K
Analysis of SEC-SAXS data via EFA deconvolution and Scatter
10:59

Analysis of SEC-SAXS data via EFA deconvolution and Scatter

Published on: January 28, 2021

8.9K

Related Experiment Videos

Last Updated: May 15, 2025

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER
14:06

Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

Published on: June 23, 2012

15.1K
Detection of Homologous Recombination Intermediates via Proximity Ligation and Quantitative PCR in Saccharomyces cerevisiae
07:55

Detection of Homologous Recombination Intermediates via Proximity Ligation and Quantitative PCR in Saccharomyces cerevisiae

Published on: September 11, 2022

1.7K
Analysis of SEC-SAXS data via EFA deconvolution and Scatter
10:59

Analysis of SEC-SAXS data via EFA deconvolution and Scatter

Published on: January 28, 2021

8.9K

Area of Science:

  • Machine Learning
  • Bioinformatics
  • Computational Biology

Background:

  • Information leakage during training can cause machine learning models to memorize data, leading to overestimated performance.
  • Accurate evaluation of machine learning models is crucial for reliable deployment in biomedical applications, especially in out-of-distribution scenarios.

Purpose of the Study:

  • To introduce DataSAIL, a Python package designed to mitigate information leakage in data splitting for machine learning models.
  • To enable more realistic performance evaluations of biomedical machine learning models by reducing data leakage.

Main Methods:

  • Formulating the problem of leakage-reduced data splitting as a combinatorial optimization problem.
  • Proving the NP-hard nature of the problem and developing a scalable heuristic solution using clustering and integer linear programming.
  • Implementing the heuristic within the DataSAIL Python package.

Main Results:

  • DataSAIL provides a method for generating data splits that minimize information leakage.
  • The package facilitates more reliable assessment of machine learning model generalizability in biomedical contexts.
  • Empirical results demonstrate the positive impact of DataSAIL on evaluating machine learning models for biological data.

Conclusions:

  • DataSAIL offers a practical solution for addressing information leakage in machine learning for biomedical applications.
  • The package promotes robust model evaluation, crucial for trustworthy AI in healthcare and life sciences.
  • By enabling leakage-reduced data splitting, DataSAIL supports the development of more generalizable and reliable biomedical AI tools.