Search research articles

相关概念视频

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Cluster Sampling Method

Cluster Sampling Method

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

Variability: Analysis

Variability: Analysis

Measures of variability are statistical metrics that reveal the dispersion pattern within a dataset. They are pivotal in biostatistics, providing insights into the heterogeneity within health and biological data. Variability signifies the degree to which data points diverge from one another, helping researchers understand the potential range of values and associated uncertainty within the data.
The range is a simple measure of variability, indicating the difference between the highest and...

Expected Frequencies in Goodness-of-Fit Tests

Expected Frequencies in Goodness-of-Fit Tests

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n) to the number of categories (k).

Residuals and Least-Squares Property

Residuals and Least-Squares Property

The vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line
If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative, and the line overestimates the actual data value for y.
The process of fitting the best-fit...

Multiple Regression

Multiple Regression

Multiple regression assesses a linear relationship between one response or dependent variable and two or more independent variables. It has many practical applications.
Farmers can use multiple regression to determine the crop yield based on more than one factor, such as water availability, fertilizer, soil properties, etc. Here, the crop yield is the response or dependent variable as it depends on the other independent variables. The analysis requires the construction of a scatter plot...

您也可能阅读

相关文章

通过共同作者、期刊和引用图与本文相关的文章。

排序

Same author

AGAPI-Agents: An Open-Access Agentic AI Platform for Accelerated Materials Design on AtomGPT.org.

The journal of physical chemistry letters·2026

Same author

Entropy-Enabled Stabilization and Activity Enhancement of Ruthenium Oxides for Acidic Oxygen Evolution.

Journal of the American Chemical Society·2026

Same author

Patient-Reported Outcomes After Bilateral Implantation of Monofocal versus Monofocal-Plus Toric Intraocular Lenses.

Clinical ophthalmology (Auckland, N.Z.)·2026

Same author

Autonomous sampling and SHAP interpretation of deposition-rates in bipolar HiPIMS.

Digital discovery·2026

Same author

Left-Right Determination Factor 2 (LEFTY2) Is an Aqueous Humor Biomarker for Exfoliation Glaucoma.

Translational vision science & technology·2026

Same author

CHIPS-TB: Evaluating Tight-Binding Models for Metals, Semiconductors, and Insulators.

The journal of physical chemistry. C, Nanomaterials and interfaces·2026

Same journal

Interplay between oxygen redox and interfacial stability of Li-rich positive electrodes in sulfide-based all-solid-state batteries.

Nature communications·2026

Same journal

Breaking dependence on melanisation imparts diversity to a dogmatic invasion strategy of phytopathogenic fungi.

Nature communications·2026

Same journal

Hydroxyl-rich nanocavities on perovskite enable nearly barrierless intramolecular hydrogen transfer for nitrate electroreduction to ammonia.

Nature communications·2026

Same journal

Household mobility responses to weather extremes in Kyrgyzstan.

Nature communications·2026

Same journal

Autonomous Motion Vision with Tri-bulk-heterojunctioned Organic Adaptation Transistor.

Nature communications·2026

Same journal

Tissue-adhesive hydrogel optical fiber for peripheral optogenetic neuromodulation.

Nature communications·2026

查看所有相关文章

关于 JoVE

概览领导团队博客 JoVE 帮助中心

作者

出版流程编辑委员会范围与政策同行评审常见问题投稿

图书馆员

用户评价订阅访问资源图书馆顾问委员会常见问题

研究

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments 存档

教育

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual 教师资源中心教师网站

使用条款与条件

Search research articles

相关实验视频

Updated: Jul 11, 2025

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

利用大型材料数据集中的冗余性,以减少数据的使用,实现高效的机器学习.

Kangming Li¹, Daniel Persaud¹, Kamal Choudhary²

¹Department of Materials Science and Engineering, University of Toronto, 27 King's College Cir, Toronto, ON, Canada.

Nature communications

|November 10, 2023

概括

此摘要是机器生成的。

冗余的材料数据,通常包括高达95%,可以在不损害机器学习预测的情况下被删除. 专注于数据丰富性,而不是数量,可以提高模型性能和训练效率.

更多相关视频

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Published on: November 22, 2019

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

Published on: March 1, 2022

相关实验视频

Last Updated: Jul 11, 2025

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Databases to Efficiently Manage Medium Sized, Low Velocity, Multidimensional Data in Tissue Engineering

Published on: November 22, 2019

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

A Psychophysics Paradigm for the Collection and Analysis of Similarity Judgments

Published on: March 1, 2022

科学领域:

材料科学材料科学材料科学
数据科学数据科学数据科学
机器学习机器学习

背景情况:

大规模的材料数据收集往往忽略了数据冗余.
现有的数据集可能包含大量非信息性或重复性数据点.

研究的目的:

量化材料数据集中的数据冗余性.
调查数据冗余对机器学习模型性能的影响.
探索用于高效机器学习培训的替代数据采集策略.

主要方法:

对各种属性的多个大型材料数据集的分析.
用不同的数据子集评估机器学习模型的性能.
应用基于不确定性的积极学习算法来构建数据集.

主要成果:

从训练数据集中可以删除高达95%的数据,对分发业绩的影响最小.
冗余数据主要由过度代表的材料类型组成.
冗余数据并不能改善分布之外的预测性能.
基于不确定性的积极学习可以创建更小,同样具有信息性的数据集.

结论:

对材料数据的"越大越好"方法是低效的.
为有效的机器学习优先考虑数据信息性而不是单纯的数据量至关重要.
优化的数据采集和培训策略提高了预测性能和稳定性.