Bayesian clustering with uncertain data | JoVE Visualize

Area of Science:

Bioinformatics and Computational Biology
Statistical Learning and Data Mining
Genomics and Systems Biology

Background:

Clustering is a fundamental technique in bioinformatics and other fields for data analysis and prediction.
Existing clustering methods often fail to incorporate data uncertainty or measurement error, limiting their effectiveness.
Immune-mediated diseases (IMD) represent a complex group of disorders requiring sophisticated analytical approaches for subtyping.

Purpose of the Study:

To introduce Dirichlet Process Mixtures with Uncertainty (DPMUnc), a novel Bayesian nonparametric clustering algorithm designed to leverage data uncertainty.
To demonstrate the superior performance of DPMUnc compared to existing methods using simulated and real-world biological data.
To develop and validate a new procedure for applying gene signatures to datasets where they were not originally discovered.

Main Methods:

Developed DPMUnc, an extension of Bayesian nonparametric clustering that explicitly incorporates data uncertainty.
Applied DPMUnc to cluster immune-mediated diseases (IMD) using genome-wide association study (GWAS) summary statistics, accounting for sample size uncertainty.
Introduced a novel procedure for summarizing gene expression data using gene signatures, including gene expression variability, for cross-dataset application.

Main Results:

DPMUnc significantly outperformed existing clustering methods on simulated data.
Clustering of IMD using GWAS data with DPMUnc successfully separated autoimmune from autoinflammatory diseases and identified subgroups like adult-onset arthritis.
Clustering of gene expression datasets from IMD patients using summarized gene signatures showed disease associations and consistent structures across datasets.

Conclusions:

Data uncertainty should be actively incorporated into clustering algorithms, and DPMUnc provides an effective method for this purpose.
The novel gene signature summarization procedure enables robust analysis of gene expression data across different datasets and disease contexts.
DPMUnc and the gene signature application method offer valuable tools for advancing the understanding and classification of complex diseases like IMD.