Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Nominal Level of Measurement00:56

Nominal Level of Measurement

28.6K
The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. Not every statistical operation can be used with every set of data. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
The data that cannot be measured but can be grouped into categories fall under the nominal level of measurement. Data that is measured using a nominal...
28.6K
Statistical Analysis: Overview01:11

Statistical Analysis: Overview

6.6K
When we take repeated measurements on the same or replicated samples, we will observe inconsistencies in the magnitude. These inconsistencies are called errors. To categorize and characterize these results and their errors, the researcher can use statistical analysis to determine the quality of the measurements and/or suitability of the methods.
One of the most commonly used statistical quantifiers is the mean, which is the ratio between the sum of the numerical values of all results and the...
6.6K
Ordinal Level of Measurement00:55

Ordinal Level of Measurement

23.7K
The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
Data measured using an ordinal scale are similar to nominal scale data, but there is one major difference. The ordinal scale data can be ordered. An example of ordinal scale data is a list of the top five national parks...
23.7K
Statistical Inference Techniques in Hypothesis Testing: Parametric Versus Nonparametric Data01:16

Statistical Inference Techniques in Hypothesis Testing: Parametric Versus Nonparametric Data

130
Statistical inference techniques, paramount in hypothesis testing, differentiate into two broad categories: parametric and nonparametric statistics.
Parametric statistics, as the name suggests, assumes that data follow a specific distribution, often a normal distribution. This assumption enables robust hypothesis testing and estimation. Parametric methods, like the Student's t-test or Goodness-of-fit test, are frequently employed in biostatistics due to their robustness. For instance,...
130
Survival Tree01:19

Survival Tree

86
Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
 Building a Survival Tree
Constructing a...
86
Ratio Level of Measurement00:54

Ratio Level of Measurement

18.0K
The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
A set of data measured using the ratio scale takes care of the ratio problem and provides complete information. Ratio scale data are like interval scale data, except they have a zero point and ratios can be calculated....
18.0K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Gender-based data bias and model fairness evaluation in benchmarked open-access disease prediction datasets.

Computers in biology and medicine·2026
Same author

The impact of K selection in K‑fold cross-validation on bias and variance in supervised learning models.

Scientific reports·2026
Same author

FG-DDI: Functional group-aware graph neural networks for drug-drug interaction prediction.

Journal of biomedical informatics·2026
Same author

Impact of music-based interventions on subjective well-being: a meta-analysis of listening, training, and therapy in clinical and nonclinical populations.

Frontiers in psychology·2025
Same author

Toward fair medical advice: Addressing and mitigating bias in large language model-based healthcare applications.

Artificial intelligence in medicine·2025
Same author

A Dataset of Stakeholder Networks for Project Performance Analysis.

Scientific data·2025
Same journal

Turbulent flow in a vortex separator with a directed pipe inlet.

Scientific reports·2026
Same journal

Systematic characteristic evaluation of clay-based cementitious material derived from calcium carbide residue and waste tile powder.

Scientific reports·2026
Same journal

Retraction Note: Improvement of a rapid diagnostic application of monoclonal antibodies against avian influenza H7 subtype virus using Europium nanoparticles.

Scientific reports·2026
Same journal

Applying large language models to spam detection in the Kazakh low-resource language setting.

Scientific reports·2026
Same journal

An open-source 3D printing system enabling in-situ freeze-thaw processing of hydrogels.

Scientific reports·2026
Same journal

An enhanced EfficientNet framework for automated waste classification using cosine annealing and label smoothing.

Scientific reports·2026
See all related articles

Related Experiment Video

Updated: Jul 5, 2025

Machine Learning Algorithms for Early Detection of Bone Metastases in an Experimental Rat Model
07:15

Machine Learning Algorithms for Early Detection of Bone Metastases in an Experimental Rat Model

Published on: August 16, 2020

6.8K

Dataset meta-level and statistical features affect machine learning performance.

Shahadat Uddin1, Haohui Lu2

  • 1School of Project Management, Faculty of Engineering, The University of Sydney, Forest Lodge, NSW, 2037, Australia. shahadat.uddin@sydney.edu.au.

Scientific Reports
|January 18, 2024
PubMed
Summary
This summary is machine-generated.

Dataset features significantly impact machine learning (ML) performance. Kurtosis negatively affects non-tree-based algorithms like Support Vector Machines (SVM), Logistic Regression (LR), and K-Nearest Neighbors (KNN), while meta-level and statistical features influence accuracy when datasets are balanced.

More Related Videos

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.6K
Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances
07:35

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

7.5K

Related Experiment Videos

Last Updated: Jul 5, 2025

Machine Learning Algorithms for Early Detection of Bone Metastases in an Experimental Rat Model
07:15

Machine Learning Algorithms for Early Detection of Bone Metastases in an Experimental Rat Model

Published on: August 16, 2020

6.8K
Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques
08:05

Measuring Statistical Learning Across Modalities and Domains in School-Aged Children Via an Online Platform and Neuroimaging Techniques

Published on: June 30, 2020

7.6K
Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances
07:35

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published on: October 11, 2018

7.5K

Area of Science:

  • Computer Science
  • Machine Learning
  • Data Science

Background:

  • The influence of dataset characteristics on machine learning (ML) algorithm performance remains largely unexplored in existing literature.
  • Understanding these relationships is crucial for selecting optimal ML models and improving predictive accuracy.

Purpose of the Study:

  • To investigate the impact of tabular dataset meta-level and statistical features on the performance of various ML algorithms.
  • To identify which dataset characteristics significantly affect ML model accuracy across different algorithms and implementations.

Main Methods:

  • Analyzed 200 open-access tabular datasets from Kaggle and UCI Machine Learning Repository.
  • Examined meta-level features (dataset size, number of attributes, class ratio) and statistical features (mean, standard deviation, skewness, kurtosis).
  • Developed ML classification models (Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest) using both classical and hyperparameter-tuned implementations.
  • Utilized multiple regression models to assess the impact of dataset features on ML performance.

Main Results:

  • Kurtosis exhibited a significant negative effect on the accuracy of non-tree-based algorithms (SVM, LR, KNN) in their classical implementations.
  • Meta-level and statistical features showed minimal impact on tree-based algorithms (Decision Tree, Random Forest), except in specific hyperparameter-tuned scenarios.
  • When excluding imbalanced datasets, the meta-level ratio and statistical mean/standard deviation features significantly impacted SVM, LR, and KNN accuracy.

Conclusions:

  • Dataset characteristics, particularly kurtosis and class imbalance, play a critical role in ML algorithm performance.
  • Findings suggest that non-tree-based algorithms are more sensitive to specific statistical properties of datasets.
  • This research opens new avenues for understanding dataset-algorithm interactions, aiding in the selection of appropriate ML models for optimal outcomes.