Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Truncation in Survival Analysis

Truncation in Survival Analysis

Truncation in survival analysis refers to the exclusion of individuals or events from the dataset based on specific criteria related to the time of the event. This exclusion can happen in two primary forms: left truncation and right truncation.
Left truncation occurs when individuals who experienced the event of interest before a certain time are not included in the study. This is often due to a "delayed entry" into the study where only those who survive until a certain entry point are...

Estimating Population Mean with Unknown Standard Deviation

Estimating Population Mean with Unknown Standard Deviation

In practice, we rarely know the population standard deviation. In the past, when the sample size was large, this did not present a problem to statisticians. They used the sample standard deviation s as an estimate for σ and proceeded as before to calculate a confidence interval with close enough results. However, statisticians ran into problems when the sample size was small. A small sample size caused inaccuracies in the confidence interval.
William S. Gosset (1876–1937) of the...

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic Models: Compartment Models in Individual and Population Analysis

Mechanistic models are utilized in individual analysis using single-source data, but imperfections arise due to data collection errors, preventing perfect prediction of observed data. The mathematical equation involves known values (Xi), observed concentrations (Ci), measurement errors (εi), model parameters (ϕj), and the related function (ƒi) for i number of values. Different least-squares metrics quantify differences between predicted and observed values. The ordinary least...

Quantitative Analysis

Quantitative Analysis

Quantitative analysis is a technique for measuring the amount of specific constituents in a sample. When the sample's composition is unknown, qualitative analysis is performed first to identify its components, which ensures that the correct substances are measured during the quantitative phase.
In quantitative analysis, two key measurements are made: the sample quantity and a property proportional to the amount of the analyte (the substance being analyzed). This forms the basis of the...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Large Language Models for Automating Clinical Trial Criteria Conversion to Observational Medical Outcomes Partnership Common Data Model Queries: Validation and Evaluation Study.

JMIR medical informatics·2025

Same author

Cardiovascular Outcomes of Early LDL-C Goal Achievement in Patients with Very-High-Risk ASCVD.

Cardiology and therapy·2025

Same author

Safety and Feasibility of Robot-Assisted Percutaneous Coronary Intervention Using the AVIAR 2.0 System: A Prospective, Multi-Center, Single-Arm, Open, Investigator-Initiated, Post-Approval Clinical Trial.

Korean circulation journal·2024

Same author

Task-Specific Transformer-Based Language Models in Health Care: Scoping Review.

JMIR medical informatics·2024

Same author

Cardiovascular Outcomes Associated With Isolated Systolic or Diastolic Hypertension According to the 2017 AHA/ACC Guideline in Adult Cancer Survivors.

Journal of the American Heart Association·2024

Same author

Forecasting Hospital Room and Ward Occupancy Using Static and Dynamic Information Concurrently: Retrospective Single-Center Cohort Study.

JMIR medical informatics·2024

Same journal

Predicting Tuberculosis Outcomes Using Routine Surveillance Data in Chiang Mai, Thailand: Retrospective Cohort Study.

JMIR public health and surveillance·2026

Same journal

Multimodal Data Approaches for Examining the 2024-2025 Highly Pathogenic Avian Influenza Outbreak in the United States: Descriptive Study.

JMIR public health and surveillance·2026

Same journal

Encouraging Adults at Risk for Type 2 Diabetes to Enroll in Diabetes Prevention Programs Through a Media Campaign in Hawai'i: Cross-Sectional Study.

JMIR public health and surveillance·2026

Same journal

Experts' Opinions on the Sustainable Use of Digital Health Tools for Effective Future Pandemic Preparedness and Response: Questionnaire Study.

JMIR public health and surveillance·2026

Same journal

Retraction: Secular Trends in Gastric and Esophageal Cancer Attributable to Dietary Carcinogens From 1990 to 2019 and Projections Until 2044 in China: Population-Based Study.

JMIR public health and surveillance·2026

Same journal

Legal Infoveillance of Unlicensed Medical Practices in South Korea Through Criminal Court Decisions Using Machine Learning: Retrospective Observational Study.

JMIR public health and surveillance·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Oct 17, 2025

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic

Hansle Gwon^1,2, Imjin Ahn^1,2, Yunha Kim^1,2

¹Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea.

JMIR Public Health and Surveillance

|October 13, 2021

Summary

This summary is machine-generated.

This study introduces a self-training method to address missing data in machine learning, particularly for scarce medical datasets. The novel approach significantly improved imputation accuracy compared to traditional methods.

Keywords:

artificial intelligence electronic medical records imputation self-training

More Related Videos

Establishing a Competing Risk Regression Nomogram Model for Survival Data

Establishing a Competing Risk Regression Nomogram Model for Survival Data

Published on: October 23, 2020

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Related Experiment Videos

Last Updated: Oct 17, 2025

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

Establishing a Competing Risk Regression Nomogram Model for Survival Data

Establishing a Competing Risk Regression Nomogram Model for Survival Data

Published on: October 23, 2020

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

A Machine Learning Approach to Design an Efficient Selective Screening of Mild Cognitive Impairment

Published on: January 11, 2020

Area of Science:

Machine Learning
Data Science
Medical Informatics

Background:

Missing data is a prevalent challenge in real-world machine learning applications.
Existing imputation methods include statistical approaches (mean, expectation-maximization, MICE) and machine learning techniques (MLP, k-NN, decision trees).

Purpose of the Study:

To impute numeric medical data, including physical and laboratory values.
To develop an effective data imputation strategy using self-training for scarce medical data environments.

Main Methods:

Proposed a progressive self-training method to gradually increase available data for model training.
Employed pseudolabeling: models trained on complete data predict missing values, and valid predictions are incorporated back into the complete dataset.
Iteratively repeated the prediction and incorporation process until a stopping condition was met, evaluating pseudolabel accuracy by its impact on model performance.

Main Results:

Self-training with Random Forest (RF) demonstrated up to 12% lower mean squared error and 0.1% higher Pearson correlation coefficient compared to pure RF.
Statistical tests (Friedman, Wilcoxon signed-rank) confirmed the significant improvement of self-training over Multiple Imputations by Chained Equations (MICE) and mean imputation (p < .05 and p = 3.05e-5, respectively).

Conclusions:

Self-training shows statistically significant improvements in imputing missing values, particularly for medical datasets.
Further validation in real-world machine learning systems and refinement of pseudolabel evaluation methods are warranted for future research.