Clustering-Informed Shared-Structure Variational Autoencoder for Missing Data Imputation in Large-Scale Healthcare Data
View abstract on PubMed
Summary
This summary is machine-generated.We introduce a new method, the clustering-informed shared-structure variational autoencoder (CISS-VAE), to accurately impute missing data in electronic health records (EHR). This advanced technique improves healthcare analytics by handling complex data relationships and various missing data types.
Area Of Science
- Health Informatics
- Machine Learning
- Biostatistics
Background
- Missing data in electronic health records (EHR) and patient-reported outcomes hinders healthcare analytics.
- Conventional imputation methods fail to capture complex nonlinear relationships and various missing data mechanisms, including missing not at random (MNAR).
Purpose Of The Study
- To develop an advanced imputation method that effectively addresses the challenges of missing data in healthcare analytics.
- To improve the accuracy and usability of EHR and patient-reported outcome data for health monitoring and analysis.
Main Methods
- Proposed the clustering-informed shared-structure variational autoencoder (CISS-VAE), a Bayesian neural network model.
- Developed iterative learning algorithms to enhance imputation accuracy and prevent overfitting.
- Validated the model through comprehensive simulations and application to real-world EHR data.
Main Results
- The CISS-VAE model demonstrated superior accuracy compared to traditional and contemporary imputation methods in simulations.
- The model effectively captures complex associations and accommodates various missing data mechanisms, including MNAR.
- Successful application to EHR data from early-stage breast cancer patients.
Conclusions
- The CISS-VAE model offers a powerful solution for mitigating the impact of missing data in healthcare analytics.
- This approach enhances the reliability of health monitoring and analyses using EHR data.
- The proposed method advances the field of health informatics by improving data imputation techniques.
Related Concept Videos
Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...
Epidemiological data primarily involves information on specific populations' occurrence, distribution, and determinants of health and diseases. This data is crucial for understanding disease patterns and impacts, aiding public health decision-making and disease prevention strategies. The analysis of epidemiological data employs various statistical methods to interpret health-related data effectively. Here are some commonly used methods:
Descriptive Statistics: These provide basic...
The meaning of illness is individualized to each person who experiences an alteration in health. In contrast, disease is a medical term indicating a pathological change in the structure and function of the body or mind. It is a condition that has specific symptoms and boundaries.
An illness is a response to a disease in which the person's level of functioning is changed compared with a previous level. The general classification of illness includes acute and chronic.
Acute illness is severe...
Pharmacokinetic models are mathematical constructs that represent and predict the time course of drug concentrations in the body, providing meaningful pharmacokinetic parameters. These models are categorized into compartment, physiological, and distributed parameter models.
The distributed parameter models are specifically designed to account for variations and differences in some drug classes. This model is particularly useful for assessing regional concentrations of anticancer or...
In practice, we rarely know the population standard deviation. In the past, when the sample size was large, this did not present a problem to statisticians. They used the sample standard deviation s as an estimate for σ and proceeded as before to calculate a confidence interval with close enough results. However, statisticians ran into problems when the sample size was small. A small sample size caused inaccuracies in the confidence interval.
William S. Gosset (1876–1937) of the...
Statistical software is pivotal in data analysis and clinical trials by providing tools to analyze data, draw conclusions, and make predictions. These software packages range from simple data management applications to complex analytical platforms, supporting various statistical tests, models, and simulation techniques. Their significance lies in their ability to handle vast amounts of data with precision and efficiency, enabling researchers to validate hypotheses, identify trends, and make...

