Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Expected Frequencies in Goodness-of-Fit Tests

Expected Frequencies in Goodness-of-Fit Tests

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n) to the number of categories (k).

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Goodness-of-Fit Test

Goodness-of-Fit Test

The goodness-of-fit test is a type of hypothesis test which determines whether the data "fits" a particular distribution. For example, one may suspect that some anonymous data may fit a binomial distribution. A chi-square test (meaning the distribution for the hypothesis test is chi-square) can be used to determine if there is a fit. The null and alternative hypotheses may be written in sentences or stated as equations or inequalities. The test statistic for a goodness-of-fit test is given as...

Introduction to z Scores

Introduction to z Scores

A z score (or standardized value) is measured in units of the standard deviation. It indicates how many standard deviations the value x is above (to the right of) or below (to the left of) the mean, μ. Values of x that are larger than the mean have positive z scores, and values of x that are smaller than the mean have negative z scores. If x equals the mean, then x has a zero z score. It is important to note that the mean of the z scores is zero, and the standard deviation is one.
z scores...

Compacting Factor test

Compacting Factor test

The compacting factor test is a method used to assess the workability of concrete. It is especially suitable for concrete mixes containing aggregates up to one and a half inches in size. This test involves specialized equipment consisting of two truncated cone-shaped hoppers and a cylinder, all with polished interior surfaces to minimize friction.
The procedure begins by placing concrete into the upper hopper without any compaction. Once filled, the bottom door of this hopper is opened,...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Platelet proteomics on less than a drop of previously frozen, non-citrate plasma.

Molecular omics·2026

Same author

Artificial Intelligence as an Add-On Instrument in Fetal Ultrasound; Sonographers' and Obstetricians' Expectations.

Prenatal diagnosis·2026

Same author

Contrasting effects of SARS-CoV-2 vaccination vs. infection on antibody and TCR repertoires.

PloS one·2026

Same author

Why are we doing this alone? A collaborative framework for LDT development and validation.

Journal of clinical microbiology·2026

Same author

What's not to learn? AI meets parasitology.

Journal of clinical microbiology·2025

Same author

From Bytes to Beats: Overcoming Conceptual and Implementation Challenges for AI in Cardiovascular Care.

Circulation·2025

Same journal

Poisoning the Genome: Targeted Backdoor Attacks on DNA Foundation Models.

ArXiv·2026

Same journal

Mechanistic mathematical model of the in vitro infection dynamics of Bunyamwera and Batai viruses including MOI-dependent shortening of the eclipse phase.

ArXiv·2026

Same journal

AI-Driven Lumped-Element Modeling of Human Respiratory System for Studying Voice Mechanics.

ArXiv·2026

Same journal

Beyond Algorithms: Conceptual Innovation in Medical Imaging AI.

ArXiv·2026

Same journal

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization.

ArXiv·2026

Same journal

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3.

ArXiv·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 19, 2025

Author Spotlight: Integrated Multi-Omics Analysis for Unveiling Multicellular Immune Signatures in Clinical Heart Attack Cohorts

Author Spotlight: Integrated Multi-Omics Analysis for Unveiling Multicellular Immune Signatures in Clinical Heart Attack Cohorts

Published on: September 20, 2024

X-Factor: Quality Is a Dataset-Intrinsic Property.

Josiah Couch¹, Miao Li¹, Rima Arnaout^2,3,4

¹Department of Pathology, BIDMC.

|June 10, 2025

Summary

This summary is machine-generated.

Dataset quality, independent of size and architecture, significantly impacts machine-learning classifier performance. This intrinsic property, stemming from class quality, offers a new optimization target for better model performance.

More Related Videos

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

Quantifying X-Ray Fluorescence Data Using MAPS

Quantifying X-Ray Fluorescence Data Using MAPS

Published on: February 17, 2018

Related Experiment Videos

Last Updated: Sep 19, 2025

Author Spotlight: Integrated Multi-Omics Analysis for Unveiling Multicellular Immune Signatures in Clinical Heart Attack Cohorts

Author Spotlight: Integrated Multi-Omics Analysis for Unveiling Multicellular Immune Signatures in Clinical Heart Attack Cohorts

Published on: September 20, 2024

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

Quantifying X-Ray Fluorescence Data Using MAPS

Quantifying X-Ray Fluorescence Data Using MAPS

Published on: February 17, 2018

Area of Science:

Machine Learning
Computer Science
Data Science

Background:

Model architecture, dataset size, and class balance are known factors influencing machine-learning classifier performance.
An additional factor, dataset quality, was previously suggested but its intrinsic nature was unclear.

Purpose of the Study:

To determine if dataset quality is an intrinsic property independent of other factors.
To investigate the relationship between dataset quality and classifier performance across diverse model architectures.

Main Methods:

Thousands of datasets were created, controlling for size and class balance.
Classifiers with various architectures (random forests, SVMs, deep networks) were trained on these datasets.
Classifier performance was analyzed across different datasets and architectures.

Main Results:

Classifier performance showed strong correlation across different architectures (R² = 0.79).
This indicates dataset quality is an intrinsic property, independent of dataset size, class balance, and model architecture.
Dataset quality was found to be an emergent property of the quality of constituent classes.

Conclusions:

Dataset quality is an independent correlate of machine-learning classifier performance.
Quality joins dataset size, class balance, and model architecture as a key optimization target.
Focusing on intrinsic dataset and class quality can improve machine-learning model optimization.