Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Types of Errors: Detection and Minimization01:12

Types of Errors: Detection and Minimization

1.4K
Error is the deviation of the obtained result from the true, expected value or the estimated central value. Errors are expressed in absolute or relative terms.
Absolute error in a measurement is the numerical difference from the true or central value. Relative error is the ratio between absolute error and the true or central value, expressed as a percentage.
Errors can be classified by source, magnitude, and sign. There are three types of errors: systematic, random, and gross.
Systematic or...
1.4K
Improving Translational Accuracy02:07

Improving Translational Accuracy

8.5K
Base complementarity between the three base pairs of mRNA codon and the tRNA anticodon is not a failsafe mechanism. Inaccuracies can range from a single mismatch to no correct base pairing at all. The free energy difference between the correct and nearly correct base pairs can be as small as 3 kcal/ mol. With complementarity being the only proofreading step, the estimated error frequency would be one wrong amino acid in every 100 amino acids incorporated. However, error frequencies observed in...
8.5K
Observational Learning01:12

Observational Learning

111
Albert Bandura's observational learning, also known as imitation or modeling, occurs when a person observes and imitates another's behavior. It is a quicker process than operant conditioning. A well-known example is the Bobo doll study, where children who saw an adult acting aggressively towards the doll were more likely to act aggressively when left alone, compared to those who observed a nonaggressive adult. Many psychologists view observational learning as a form of latent learning...
111
Data Validation01:03

Data Validation

4.8K
Data validation is an essential part of a comprehensive assessment. Validation is confirming or verifying and opening the door to gathering more assessment data as it clarifies vague or unclear data. The process of checking and verifying the collected information is called data validation. The primary purpose of data validation is to ensure data is as free from error, bias, and misinterpretation as possible.
Nursing assessment guides are generally based on holistic models rather than medical...
4.8K
Censoring Survival Data01:09

Censoring Survival Data

54
Survival analysis is a statistical method used to analyze time-to-event data, often employed in fields such as medicine, engineering, and social sciences. One of the key challenges in survival analysis is dealing with incomplete data, a phenomenon known as "censoring." Censoring occurs when the event of interest (such as death, relapse, or system failure) has not occurred for some individuals by the end of the study period or is otherwise unobservable, and it might have many different...
54
Detection of Gross Error: The Q Test01:00

Detection of Gross Error: The Q Test

5.0K
When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...
5.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same journal

DARUMA: a gateway to fast and easy prediction of intrinsically disordered regions.

PeerJ. Computer science·2026
Same journal

Alzheimer's disease detection using a quantum deep neural network with Haralick feature extraction and simulated annealing optimization.

PeerJ. Computer science·2026
Same journal

Network anomaly detection using Deep Autoencoder and parallel Artificial Bee Colony algorithm-trained neural network.

PeerJ. Computer science·2026
Same journal

An anomaly detection model for multivariate time series with anomaly perception.

PeerJ. Computer science·2026
Same journal

Retraction: A wormhole attack detection method for tactical wireless sensor networks.

PeerJ. Computer science·2026
Same journal

Evaluation of mental disorder with prioritization of its type by utilizing the bipolar complex fuzzy decision-making approach based on Schweizer-Sklar prioritized aggregation operators.

PeerJ. Computer science·2026
See all related articles

Related Experiment Video

Updated: May 20, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

474

Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

Nouf Alturayeif1,2, Jameleddine Hassine1,3

  • 1Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia.

Peerj. Computer Science
|March 26, 2025
PubMed
Summary
This summary is machine-generated.

This study introduces machine learning (ML)-based methods to detect code-level data leakage in ML code. Active learning proved most effective, significantly reducing the need for annotated data while improving detection accuracy.

Keywords:
Active learningCode qualityData leakageLow-shot promptingTransfer learning

More Related Videos

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.4K
Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention
06:37

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Published on: December 15, 2023

2.5K

Related Experiment Videos

Last Updated: May 20, 2025

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness
03:14

Augmenting Large Language Models via Vector Embeddings to Improve Domain-Specific Responsiveness

Published on: December 6, 2024

474
Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.4K
Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention
06:37

Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Published on: December 15, 2023

2.5K

Area of Science:

  • Computer Science
  • Artificial Intelligence
  • Software Engineering

Background:

  • Machine learning (ML) code quality is crucial for reliable model performance, with data leakage being a significant issue.
  • Data leakage, where test set information influences training, inflates metrics and harms generalization.
  • Existing detection methods for code-level data leakage are often manual or lack advanced automation.

Purpose of the Study:

  • To explore and propose ML-based approaches for detecting code-level data leakage, particularly with limited annotated datasets.
  • To address challenges in identifying quality issues within large and complex ML codebases efficiently.
  • To develop automated methods for handling imbalanced code data relevant to leakage detection.

Main Methods:

  • Investigated three ML-based approaches: transfer learning, active learning, and low-shot prompting.
  • Developed an automated method to manage data imbalance issues specific to code data.
  • Evaluated the effectiveness of these approaches in detecting code-level data leakage.

Main Results:

  • Active learning demonstrated superior performance, achieving an F-2 score of 0.72.
  • Active learning reduced the required number of annotated samples from 1,523 to 698.
  • The proposed ML-based methods effectively addressed challenges posed by limited data availability.

Conclusions:

  • ML-based approaches, especially active learning, are effective for detecting code-level data leakage in ML code.
  • These methods significantly improve efficiency by reducing the need for extensive manual annotation.
  • Automated detection of data leakage enhances ML code quality and model reliability.