Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

Variability: Analysis

Variability: Analysis

Measures of variability are statistical metrics that reveal the dispersion pattern within a dataset. They are pivotal in biostatistics, providing insights into the heterogeneity within health and biological data. Variability signifies the degree to which data points diverge from one another, helping researchers understand the potential range of values and associated uncertainty within the data.
The range is a simple measure of variability, indicating the difference between the highest and...

Expected Frequencies in Goodness-of-Fit Tests

Expected Frequencies in Goodness-of-Fit Tests

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n) to the number of categories (k).

Residuals and Least-Squares Property

Residuals and Least-Squares Property

The vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line
If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative, and the line overestimates the actual data value for y.
The process of fitting the best-fit...

Multiple Regression

Multiple Regression

Multiple regression assesses a linear relationship between one response or dependent variable and two or more independent variables. It has many practical applications.
Farmers can use multiple regression to determine the crop yield based on more than one factor, such as water availability, fertilizer, soil properties, etc. Here, the crop yield is the response or dependent variable as it depends on the other independent variables. The analysis requires the construction of a scatter plot...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

AGAPI-Agents: An Open-Access Agentic AI Platform for Accelerated Materials Design on AtomGPT.org.

The journal of physical chemistry letters·2026

Same author

Entropy-Enabled Stabilization and Activity Enhancement of Ruthenium Oxides for Acidic Oxygen Evolution.

Journal of the American Chemical Society·2026

Same author

Patient-Reported Outcomes After Bilateral Implantation of Monofocal versus Monofocal-Plus Toric Intraocular Lenses.

Clinical ophthalmology (Auckland, N.Z.)·2026

Same author

Autonomous sampling and SHAP interpretation of deposition-rates in bipolar HiPIMS.

Digital discovery·2026

Same author

Left-Right Determination Factor 2 (LEFTY2) Is an Aqueous Humor Biomarker for Exfoliation Glaucoma.

Translational vision science & technology·2026

Same author

CHIPS-TB: Evaluating Tight-Binding Models for Metals, Semiconductors, and Insulators.

The journal of physical chemistry. C, Nanomaterials and interfaces·2026

Same journal

Interplay between oxygen redox and interfacial stability of Li-rich positive electrodes in sulfide-based all-solid-state batteries.

Nature communications·2026

Same journal

Breaking dependence on melanisation imparts diversity to a dogmatic invasion strategy of phytopathogenic fungi.

Nature communications·2026

Same journal

Hydroxyl-rich nanocavities on perovskite enable nearly barrierless intramolecular hydrogen transfer for nitrate electroreduction to ammonia.

Nature communications·2026

Same journal

Household mobility responses to weather extremes in Kyrgyzstan.

Nature communications·2026

Same journal

Autonomous Motion Vision with Tri-bulk-heterojunctioned Organic Adaptation Transistor.

Nature communications·2026

Same journal

Tissue-adhesive hydrogel optical fiber for peripheral optogenetic neuromodulation.

Nature communications·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jul 11, 2025

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Exploiting redundancy in large materials datasets for efficient machine learning with less data.

Kangming Li¹, Daniel Persaud¹, Kamal Choudhary²

¹Department of Materials Science and Engineering, University of Toronto, 27 King's College Cir, Toronto, ON, Canada.

Nature Communications

|November 10, 2023

Summary

This summary is machine-generated.

Redundant materials data, often comprising up to 95%, can be removed without harming machine learning predictions. Focusing on data richness, not volume, improves model performance and training efficiency.

More Related Videos

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Published on: November 22, 2019

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

Published on: March 1, 2022

Related Experiment Videos

Last Updated: Jul 11, 2025

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Published on: November 22, 2019

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

Published on: March 1, 2022

Area of Science:

Materials Science
Data Science
Machine Learning

Background:

Large-scale materials data collection often ignores data redundancy.
Existing datasets may contain a significant proportion of non-informative or repetitive data points.

Purpose of the Study:

To quantify data redundancy in materials datasets.
To investigate the impact of data redundancy on machine learning model performance.
To explore alternative data acquisition strategies for efficient machine learning training.

Main Methods:

Analysis of multiple large materials datasets for various properties.
Evaluation of machine learning model performance with varying data subsets.
Application of uncertainty-based active learning algorithms for dataset construction.

Main Results:

Up to 95% of data can be removed from training datasets with minimal impact on in-distribution performance.
Redundant data primarily consists of over-represented material types.
Redundant data does not improve out-of-distribution prediction performance.
Uncertainty-based active learning can create smaller, equally informative datasets.

Conclusions:

The "bigger is better" approach to materials data is inefficient.
Prioritizing data informativeness over sheer volume is crucial for effective machine learning.
Optimized data acquisition and training strategies enhance prediction performance and robustness.