Dataset meta-level and statistical features affect machine learning performance | JoVE Visualize

Area of Science:

Computer Science
Machine Learning
Data Science

Background:

The influence of dataset characteristics on machine learning (ML) algorithm performance remains largely unexplored in existing literature.
Understanding these relationships is crucial for selecting optimal ML models and improving predictive accuracy.

Purpose of the Study:

To investigate the impact of tabular dataset meta-level and statistical features on the performance of various ML algorithms.
To identify which dataset characteristics significantly affect ML model accuracy across different algorithms and implementations.

Main Methods:

Analyzed 200 open-access tabular datasets from Kaggle and UCI Machine Learning Repository.
Examined meta-level features (dataset size, number of attributes, class ratio) and statistical features (mean, standard deviation, skewness, kurtosis).
Developed ML classification models (Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest) using both classical and hyperparameter-tuned implementations.
Utilized multiple regression models to assess the impact of dataset features on ML performance.

Main Results:

Kurtosis exhibited a significant negative effect on the accuracy of non-tree-based algorithms (SVM, LR, KNN) in their classical implementations.
Meta-level and statistical features showed minimal impact on tree-based algorithms (Decision Tree, Random Forest), except in specific hyperparameter-tuned scenarios.
When excluding imbalanced datasets, the meta-level ratio and statistical mean/standard deviation features significantly impacted SVM, LR, and KNN accuracy.

Conclusions:

Dataset characteristics, particularly kurtosis and class imbalance, play a critical role in ML algorithm performance.
Findings suggest that non-tree-based algorithms are more sensitive to specific statistical properties of datasets.
This research opens new avenues for understanding dataset-algorithm interactions, aiding in the selection of appropriate ML models for optimal outcomes.