Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Surveys

Surveys

Often, psychologists develop surveys as a means of gathering data. Surveys are lists of questions to be answered by research participants, and can be delivered as paper-and-pencil questionnaires, administered electronically, or conducted verbally. Generally, the survey itself can be completed in a short time, and the ease of administering a survey makes it easy to collect data from a large number of people.

Data Collection by Survey

Data Collection by Survey

The systematic method of obtaining and analyzing accurate information of a population is called data collection. A survey is a standard method of data collection that involves collecting information from a target human population about their experience, opinion, or knowledge of a product, service, or process. The responses are recorded and interpreted. The most common survey examples are written questionnaires, face-to-face or telephonic conversations, focus groups, and electronic (e-mail or...

Censoring Survival Data

Censoring Survival Data

Survival analysis is a statistical method used to analyze time-to-event data, often employed in fields such as medicine, engineering, and social sciences. One of the key challenges in survival analysis is dealing with incomplete data, a phenomenon known as "censoring." Censoring occurs when the event of interest (such as death, relapse, or system failure) has not occurred for some individuals by the end of the study period or is otherwise unobservable, and it might have many different...

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

One-Way ANOVA: Unequal Sample Sizes

One-Way ANOVA: Unequal Sample Sizes

One-way ANOVA can be performed on three or more samples of unequal sizes. However, calculations get complicated when sample sizes are not always the same. So, while performing ANOVA with unequal samples size, the following equation is used:

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Informed Random Forest to Model Associations of Epidemiological Priors, Government Policies, and Public Mobility.

MDM policy & practice·2023

Same author

Achieving Reliable Intervehicle Positioning Based on Redheffer Weighted Least Squares Model Under Multi-GNSS Outages.

IEEE transactions on cybernetics·2021

Same journal

CardiaTics: An explainable AI integrated heart disease diagnosis model with feature engineering and stacked ensemble approach.

Journal of big data·2026

Same journal

Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records.

Journal of big data·2026

Same journal

UniqueNOSD: a novel framework for NoSQL over SQL databases.

Journal of big data·2025

Same journal

<i>F</i>u<i>n</i>Da: scalable serverless data analytics and in situ query processing.

Journal of big data·2025

Same journal

Integrating Big Data, Artificial Intelligence, and motion analysis for emerging precision medicine applications in Parkinson's Disease.

Journal of big data·2024

Same journal

Interpolation-split: a data-centric deep learning approach with big interpolated data to boost airway segmentation performance.

Journal of big data·2024

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 14, 2025

Design and Analysis for Fall Detection System Simplification

Design and Analysis for Fall Detection System Simplification

Published on: April 6, 2020

A survey on missing data in machine learning.

Tlamelo Emmanuel¹, Thabiso Maupong¹, Dimane Mpoeleng¹

¹Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana.

Journal of Big Data

|November 1, 2021

Summary

This summary is machine-generated.

Handling missing data is crucial for accurate machine learning analysis. This study evaluates k-nearest neighbor and missForest imputation methods, showing they effectively manage missing values in datasets.

Keywords:

Imputation Machine learning Missing data

More Related Videos

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Published on: October 27, 2016

Related Experiment Videos

Last Updated: Oct 14, 2025

Design and Analysis for Fall Detection System Simplification

Design and Analysis for Fall Detection System Simplification

Published on: April 6, 2020

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Visualization Method for Proprioceptive Drift on a 2D Plane Using Support Vector Machine

Published on: October 27, 2016

Area of Science:

Data Science
Machine Learning
Statistics

Background:

Missing values are a common challenge in data analysis, arising from various factors and potentially biasing results.
Ignoring missing data can lead to inaccurate conclusions in machine learning models.
Effective imputation techniques are essential for robust data pre-processing.

Purpose of the Study:

To review and aggregate literature on machine learning techniques for handling missing data.
To provide insights into the performance, limitations, and suitability of different imputation methods.
To propose and evaluate two specific imputation methods: k-nearest neighbor and missForest.

Main Methods:

Literature review focusing on machine learning-based missing data imputation.
Implementation and evaluation of the k-nearest neighbor imputation algorithm.
Implementation and evaluation of the missForest algorithm, an iterative imputation method based on random forests.

Main Results:

Both k-nearest neighbor and missForest demonstrated successful handling of missing values in the evaluated datasets.
Performance was assessed on the Iris dataset and a novel power plant fan dataset with induced missingness (5-20%).
The study highlights the practical applicability of these machine learning imputation techniques.

Conclusions:

Machine learning imputation methods, specifically k-nearest neighbor and missForest, are effective in addressing missing data challenges.
These methods offer viable solutions to prevent biased analysis caused by missing values.
Further research directions in missing data imputation using machine learning are suggested.