Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Distributions to Estimate Population Parameter

Distributions to Estimate Population Parameter

The accurate values of population parameters such as population proportion, population mean, and population standard deviation (or variance) are usually unknown. These are fixed values that can only be estimated from the data collected from the samples. The estimates of each of these parameters are sample proportion, the sample mean, and sample standard deviation (or variance). To obtain the values of these sample statistics, data are required that have particular distribution and central...

Censoring Survival Data

Censoring Survival Data

Survival analysis is a statistical method used to analyze time-to-event data, often employed in fields such as medicine, engineering, and social sciences. One of the key challenges in survival analysis is dealing with incomplete data, a phenomenon known as "censoring." Censoring occurs when the event of interest (such as death, relapse, or system failure) has not occurred for some individuals by the end of the study period or is otherwise unobservable, and it might have many different...

Data: Types and Distribution

Data: Types and Distribution

In biostatistics, data are the observations collected for analysis. There are two main types: parametric and non-parametric. Parametric data, which include continuous (e.g., weight) and discrete numerical data (e.g., number of tablets), assume a particular distribution pattern, often the normal distribution. Non-parametric data do not adhere to a specific distribution and typically comprise nominal (e.g., gender) and ordinal categorical data (e.g., pain scale ratings).
Distributions in...

Sampling Distribution

Sampling Distribution

Given simple random samples of size n from a given population with a measured characteristic such as mean, proportion, or standard deviation for each sample, the probability distribution of all the measured characteristics is called a sampling distribution. How much the statistic varies from one sample to another is known as the sampling variability of a statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of the mean is an example...

Estimating Population Mean with Unknown Standard Deviation

Estimating Population Mean with Unknown Standard Deviation

In practice, we rarely know the population standard deviation. In the past, when the sample size was large, this did not present a problem to statisticians. They used the sample standard deviation s as an estimate for σ and proceeded as before to calculate a confidence interval with close enough results. However, statisticians ran into problems when the sample size was small. A small sample size caused inaccuracies in the confidence interval.
William S. Gosset (1876–1937) of the...

Choosing Between z and t Distribution

Choosing Between z and t Distribution

The z and the Student t distribution estimate the population mean using the sample mean and standard deviation. However, to decide which distribution to use for a calculation, one needs to determine the sample size, the nature of the distribution, and whether the population standard deviation is known. If the population standard deviation is known and the population is normally distributed, or if the sample size is greater than 30, the z distribution is preferred. The Student t distribution is...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Privacy-preserving verification of preprocessing in federated learning for genomic data.

JAMIA open·2026

Same author

Sustainable Personalized Home Care for Pandemic Management: A Service-Oriented Approach.

Digital government (New York, N.Y.)·2026

Same author

Semantically Correct Policy Mining and Enforcement for Attribute based Access Control.

ACM transactions on Internet technology·2026

Same author

Performance Analysis of Dynamic ABAC Systems using a Queuing Theoretic Framework.

Computers & security·2026

Same author

Privacy-Preserving Verification of ML Preprocessing via Model Behavior Indicators.

IEEE transactions on privacy·2026

Same author

MALITE: Lightweight Malware Detection and Classification for Constrained Devices.

IEEE transactions on emerging topics in computing·2025

Same journal

MedAssist: LLM-Empowered Medical Assistant for Assisting the Scrutinization and Comprehension of Electronic Health Records.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2026

Same journal

Bridging the Scientific Knowledge Gap and Reproducibility: A Survey of Provenance, Assertion and Evidence Ontologies.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2025

Same journal

Uncertainty-Aware Pre-Trained Foundation Models for Patient Risk Prediction via Gaussian Process.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2025

Same journal

DPAR: Decoupled Graph Neural Networks with Node-Level Differential Privacy.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2024

Same journal

Federated Node Classification over Graphs with Latent Link-type Heterogeneity.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2024

Same journal

Application of an ontology for model cards to generate computable artifacts for linking machine learning information from biomedical research.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference·2024

See all related articles

Search research articles

Home
Preserving Missing Data Distribution In Synthetic Data.

Home
Preserving Missing Data Distribution In Synthetic Data.

Related Experiment Video

The Replica Set Method: A High-throughput Approach to Quantitatively Measure Caenorhabditis elegans Lifespan

The Replica Set Method: A High-throughput Approach to Quantitatively Measure Caenorhabditis elegans Lifespan

Published on: June 29, 2018

Preserving Missing Data Distribution in Synthetic Data.

Xinyue Wang¹, Hafiz Asif¹, Jaideep Vaidya¹

¹Rutgers University, Newark, USA.

Proceedings of the ... International World-Wide Web Conference. International WWW Conference

|January 28, 2025

View abstract on PubMed

Summary

This summary is machine-generated.

This study introduces novel methods for generating synthetic data that retain the informational value of missing data points. This approach enhances privacy-preserving data analysis by preserving crucial missing data distributions.

Keywords:

GAN Missing Data Privacy Synthetic Data Generation

More Related Videos

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

Quantification of Information Encoded by Gene Expression Levels During Lifespan Modulation Under Broad-range Dietary Restriction in C. elegans

Quantification of Information Encoded by Gene Expression Levels During Lifespan Modulation Under Broad-range Dietary Restriction in C. elegans

Published on: August 16, 2017

Related Experiment Videos

The Replica Set Method: A High-throughput Approach to Quantitatively Measure Caenorhabditis elegans Lifespan

The Replica Set Method: A High-throughput Approach to Quantitatively Measure Caenorhabditis elegans Lifespan

Published on: June 29, 2018

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Inverse Probability of Treatment Weighting Propensity Score using the Military Health System Data Repository and National Death Index

Published on: January 8, 2020

Quantification of Information Encoded by Gene Expression Levels During Lifespan Modulation Under Broad-range Dietary Restriction in C. elegans

Quantification of Information Encoded by Gene Expression Levels During Lifespan Modulation Under Broad-range Dietary Restriction in C. elegans

Published on: August 16, 2017

Area of Science:

Computer Science
Data Science
Statistics

Background:

Web data is often sensitive and requires privacy-preserving methods for analysis.
Synthetic data generation is a key technique for protecting sensitive information.
Missing data in web artifacts contains valuable information often lost during traditional data preprocessing.

Purpose of the Study:

To develop and evaluate methods for generating synthetic data that preserve both observable and missing data distributions.
To address the loss of information inherent in imputation or deletion of missing data before synthetic data generation.

Main Methods:

Proposed novel methods for synthetic data generation.
Focused on preserving the distribution of both observed and missing data.

Conducted extensive empirical evaluations on fabricated and real-world datasets.

Main Results:

Demonstrated the effectiveness of the proposed methods in preserving missing data distributions.
Showcased the ability of synthetic data to retain informational content from missingness.
Empirical evaluations confirmed the utility of the approach across various datasets.

Conclusions:

The proposed methods offer a significant advancement in privacy-preserving synthetic data generation.
Preserving missing data distributions is crucial for maintaining data utility in sensitive web data analysis.
This approach enables more robust and informative data analysis from web artifacts.