Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Surveys02:16

Surveys

16.1K
Often, psychologists develop surveys as a means of gathering data. Surveys are lists of questions to be answered by research participants, and can be delivered as paper-and-pencil questionnaires, administered electronically, or conducted verbally. Generally, the survey itself can be completed in a short time, and the ease of administering a survey makes it easy to collect data from a large number of people.
16.1K
Data Collection by Survey01:07

Data Collection by Survey

7.4K
The systematic method of obtaining and analyzing accurate information of a population is called data collection. A survey is a standard method of data collection that involves collecting information from a target human population about their experience, opinion, or knowledge of a product, service, or process. The responses are recorded and interpreted. The most common survey examples are written questionnaires, face-to-face or telephonic conversations, focus groups, and electronic (e-mail or...
7.4K
Censoring Survival Data01:09

Censoring Survival Data

278
Survival analysis is a statistical method used to analyze time-to-event data, often employed in fields such as medicine, engineering, and social sciences. One of the key challenges in survival analysis is dealing with incomplete data, a phenomenon known as "censoring." Censoring occurs when the event of interest (such as death, relapse, or system failure) has not occurred for some individuals by the end of the study period or is otherwise unobservable, and it might have many different...
278
Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

2.7K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
2.7K
One-Way ANOVA: Unequal Sample Sizes01:15

One-Way ANOVA: Unequal Sample Sizes

6.0K
One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:
6.0K
Detection of Gross Error: The Q Test01:00

Detection of Gross Error: The Q Test

6.5K
When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...
6.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Informed Random Forest to Model Associations of Epidemiological Priors, Government Policies, and Public Mobility.

MDM policy & practice·2023
Same author

Achieving Reliable Intervehicle Positioning Based on Redheffer Weighted Least Squares Model Under Multi-GNSS Outages.

IEEE transactions on cybernetics·2021
Same journal

CardiaTics: An explainable AI integrated heart disease diagnosis model with feature engineering and stacked ensemble approach.

Journal of big data·2026
Same journal

Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records.

Journal of big data·2026
Same journal

UniqueNOSD: a novel framework for NoSQL over SQL databases.

Journal of big data·2025
Same journal

<i>F</i>u<i>n</i>Da: scalable serverless data analytics and in situ query processing.

Journal of big data·2025
Same journal

Integrating Big Data, Artificial Intelligence, and motion analysis for emerging precision medicine applications in Parkinson's Disease.

Journal of big data·2024
Same journal

Interpolation-split: a data-centric deep learning approach with big interpolated data to boost airway segmentation performance.

Journal of big data·2024
See all related articles

Related Experiment Video

Updated: Oct 14, 2025

Design and Analysis for Fall Detection System Simplification
08:05

Design and Analysis for Fall Detection System Simplification

Published on: April 6, 2020

10.9K

A survey on missing data in machine learning.

Tlamelo Emmanuel1, Thabiso Maupong1, Dimane Mpoeleng1

  • 1Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana.

Journal of Big Data
|November 1, 2021
PubMed
Summary
This summary is machine-generated.

Handling missing data is crucial for accurate machine learning analysis. This study evaluates k-nearest neighbor and missForest imputation methods, showing they effectively manage missing values in datasets.

Keywords:
ImputationMachine learningMissing data

More Related Videos

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.7K
Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine
07:05

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Published on: October 27, 2016

9.3K

Related Experiment Videos

Last Updated: Oct 14, 2025

Design and Analysis for Fall Detection System Simplification
08:05

Design and Analysis for Fall Detection System Simplification

Published on: April 6, 2020

10.9K
A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment
12:18

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

7.7K
Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine
07:05

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Published on: October 27, 2016

9.3K

Area of Science:

  • Data Science
  • Machine Learning
  • Statistics

Background:

  • Missing values are a common challenge in data analysis, arising from various factors and potentially biasing results.
  • Ignoring missing data can lead to inaccurate conclusions in machine learning models.
  • Effective imputation techniques are essential for robust data pre-processing.

Purpose of the Study:

  • To review and aggregate literature on machine learning techniques for handling missing data.
  • To provide insights into the performance, limitations, and suitability of different imputation methods.
  • To propose and evaluate two specific imputation methods: k-nearest neighbor and missForest.

Main Methods:

  • Literature review focusing on machine learning-based missing data imputation.
  • Implementation and evaluation of the k-nearest neighbor imputation algorithm.
  • Implementation and evaluation of the missForest algorithm, an iterative imputation method based on random forests.

Main Results:

  • Both k-nearest neighbor and missForest demonstrated successful handling of missing values in the evaluated datasets.
  • Performance was assessed on the Iris dataset and a novel power plant fan dataset with induced missingness (5-20%).
  • The study highlights the practical applicability of these machine learning imputation techniques.

Conclusions:

  • Machine learning imputation methods, specifically k-nearest neighbor and missForest, are effective in addressing missing data challenges.
  • These methods offer viable solutions to prevent biased analysis caused by missing values.
  • Further research directions in missing data imputation using machine learning are suggested.