Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Kaplan-Meier Approach

Kaplan-Meier Approach

The Kaplan-Meier estimator is a non-parametric method used to estimate the survival function from time-to-event data. In medical research, it is frequently employed to measure the proportion of patients surviving for a certain period after treatment. This estimator is fundamental in analyzing time-to-event data, making it indispensable in clinical trials, epidemiological studies, and reliability engineering. By estimating survival probabilities, researchers can evaluate treatment effectiveness,...

Comparing the Survival Analysis of Two or More Groups

Comparing the Survival Analysis of Two or More Groups

Survival analysis is a cornerstone of medical research, used to evaluate the time until an event of interest occurs, such as death, disease recurrence, or recovery. Unlike standard statistical methods, survival analysis is particularly adept at handling censored data—instances where the event has not occurred for some participants by the end of the study or remains unobserved. To address these unique challenges, specialized techniques like the Kaplan-Meier estimator, log-rank test, and...

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Statistical Methods for Analyzing Epidemiological Data

Statistical Methods for Analyzing Epidemiological Data

Epidemiological data primarily involves information on specific populations' occurrence, distribution, and determinants of health and diseases. This data is crucial for understanding disease patterns and impacts, aiding public health decision-making and disease prevention strategies. The analysis of epidemiological data employs various statistical methods to interpret health-related data effectively. Here are some commonly used methods:

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic models are utilized in individual analysis using single-source data, but imperfections arise due to data collection errors, preventing perfect prediction of observed data. The mathematical equation involves known values (Xi), observed concentrations (Ci), measurement errors (εi), model parameters (ϕj), and the related function (ƒi) for i number of values. Different least-squares metrics quantify differences between predicted and observed values. The ordinary least...

Analysis Methods of Pharmacokinetic Data: Model and Model-Independent Approaches

Analysis Methods of Pharmacokinetic Data: Model and Model-Independent Approaches

Drug disposition in the body is a complex process and can be studied using two major approaches: the model and the model-independent approaches.
The model approach uses mathematical models to describe changes in drug concentration over time. Pharmacokinetic models help characterize drug behavior in patients, predict drug concentration in the body fluids, calculate optimum dosage regimens, and evaluate the risk of toxicity. However, ensuring that the model fits the experimental data accurately...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Replicability of multivariate brain-behaviour associations depends on clinical profile.

Communications biology·2026

Same author

Open neuroinformatics infrastructure ecosystem for federated multisite studies.

bioRxiv : the preprint server for biology·2026

Same author

Clinical profile impacts the replicability of multivariate brain-behavioural associations.

bioRxiv : the preprint server for biology·2025

Same author

Mining the neuroimaging literature.

eLife·2025

Same author

Challenging the status quo: A guide to open and reproducible neuroimaging for early career researchers.

Imaging neuroscience (Cambridge, Mass.)·2025

Same author

Open-source platforms to investigate analytical flexibility in neuroimaging.

Imaging neuroscience (Cambridge, Mass.)·2025

Same journal

NanoporeDB: A Structural Resource Of Multimeric Protein Nanopores For Single-Molecule Sensing.

GigaScience·2026

Same journal

From the Brain Cell Atlas to Precision Neurology: A review of the application of AI-driven multi-omics in brain science.

GigaScience·2026

Same journal

Comparison of Deep Learning Approaches for Extreme Low-SNR Image Restoration.

GigaScience·2026

Same journal

ScopeViewer: A Browser-Based Solution for Visualizing Large Biological Images.

GigaScience·2026

Same journal

ChatMDV: Reducing Technical Barriers in Bioinformatics Analysis using Large Language Models.

GigaScience·2026

Same journal

ClusterGraph: a new tool for visualisation and compression of multidimensional data.

GigaScience·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Sep 27, 2025

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

Benchmarking missing-values approaches for predictive models on health databases.

Alexandre Perez-Lebel^1,2,3, Gaël Varoquaux^1,2,3, Marine Le Morvan²

¹McConnell Brain Imaging Centre, The Neuro (Montreal Neurological Institute-Hospital), Faculty of Medicine, McGill University, 3801 University Street, Montreal, QC H3A 2B4, Canada.

|April 15, 2022

Summary

This summary is machine-generated.

Machine learning models can effectively handle missing values in large health datasets. Native support for missing values in models offers robust, fast, and accurate predictions, outperforming imputation methods.

Keywords:

bagging benchmark imputation machine learning missing values multiple imputation supervised learning

More Related Videos

An R-Based Landscape Validation of a Competing Risk Model

An R-Based Landscape Validation of a Competing Risk Model

Published on: September 16, 2022

Performing Data Mining And Integrative Analysis Of Biomarker in Breast Cancer Using Multiple Publicly Accessible Databases

Performing Data Mining And Integrative Analysis Of Biomarker in Breast Cancer Using Multiple Publicly Accessible Databases

Published on: May 17, 2019

Related Experiment Videos

Last Updated: Sep 27, 2025

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

An R-Based Landscape Validation of a Competing Risk Model

An R-Based Landscape Validation of a Competing Risk Model

Published on: September 16, 2022

Performing Data Mining And Integrative Analysis Of Biomarker in Breast Cancer Using Multiple Publicly Accessible Databases

Performing Data Mining And Integrative Analysis Of Biomarker in Breast Cancer Using Multiple Publicly Accessible Databases

Published on: May 17, 2019

Area of Science:

Machine Learning
Data Science
Health Informatics

Background:

Large databases, common in health informatics, often contain missing values, complicating data management and analysis.
Existing research on handling missing values primarily focuses on inferential statistics, not predictive modeling.
Machine learning models, particularly discriminative approaches, offer new strategies for addressing missing data in large datasets.

Purpose of the Study:

To systematically benchmark missing-value strategies for predictive modeling using large health databases.
To compare the performance of native handling of missing values versus imputation methods in machine learning.
To evaluate prediction accuracy and computational efficiency of different missing-value strategies.

Main Methods:

Conducted a benchmark study on six large health datasets (electronic health records, brain imaging, surveys).
Utilized gradient-boosted trees to compare native missing-value handling against simple and advanced imputation techniques.
Assessed prediction accuracy and computational time for each strategy.

Main Results:

Native handling of missing values within gradient-boosted trees demonstrated robust, fast, and accurate predictive performance.
Imputation methods, while potentially improving prediction, incurred significantly longer computational times on large datasets.
The inclusion of indicator columns for imputed values was crucial, suggesting data were not missing at random.

Conclusions:

Supervised machine learning models with native support for missing values provide superior prediction accuracy with lower computational cost compared to imputation.
When imputation is employed, adding indicator columns to denote imputed data is essential for optimal performance.
Learning algorithms that incorporate missing values directly (missing incorporated attribute) are efficient and effective for large-scale health data.