Statistical significance of clustering for count data

  • 0Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

|

|

Summary

This summary is machine-generated.

We introduce SigClust-DEV, a novel method for assessing cluster significance in count data, outperforming existing approaches. This tool enhances subgroup identification in genomics and healthcare data by addressing statistical uncertainty.

Area Of Science

  • Bioinformatics
  • Computational Biology
  • Statistical Genetics

Background

  • Clustering is vital for identifying subgroups in biomedical research, but existing methods often overlook statistical uncertainty, leading to spurious clusters.
  • The Statistical Significance of Clustering (SigClust) method assesses cluster significance in high-dimensional data but is limited to continuous data and can lack power for non-Gaussian distributions.

Purpose Of The Study

  • To develop a novel method, SigClust-DEV, for evaluating the statistical significance of clusters specifically in discrete count data.
  • To address the limitations of existing SigClust methods in handling count data and non-Gaussian distributions.

Main Methods

  • SigClust-DEV was developed to assess cluster significance in high-dimensional count data.
  • Extensive simulations were conducted to compare SigClust-DEV against existing SigClust variations across diverse count distributions.
  • The method was applied to single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) for real-world validation.

Main Results

  • SigClust-DEV demonstrated superior performance compared to existing SigClust approaches in simulations across various count distributions.
  • The method successfully identified meaningful latent cell types in Hydra scRNA data.
  • SigClust-DEV effectively identified significant patient subgroups within cancer EHR data.

Conclusions

  • SigClust-DEV is a powerful and statistically robust method for assessing cluster significance in count data.
  • The method enhances subgroup discovery in complex biomedical datasets like scRNA and EHR data.
  • SigClust-DEV overcomes limitations of previous methods, offering improved statistical power and accuracy for discrete data analysis.

Related Concept Videos

Statistical Significance 01:50

21.1K

Once data is collected from both the experimental and the control groups, a statistical analysis is conducted to find out if there are meaningful differences between the two groups. A statistical analysis determines how likely any difference found is due to chance (and thus not meaningful). In psychology, group differences are considered meaningful, or significant, if the odds that these differences occurred by chance alone are 5 percent or less. Stated another way, if we repeated this...

Cluster Sampling Method 01:20

14.0K

Appropriate sampling methods ensure that samples are drawn without bias and accurately represent the population. Because measuring the entire population in a study is not practical, researchers use samples to represent the population of interest.
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly sample four departments from your...

Introduction to Test of Independence 01:21

2.9K

In statistics, the term independence means that one can directly obtain the probability of any event involving both variables by multiplying their individual probabilities. Tests of independence are chi-square tests involving the use of a contingency table of observed (data) values.
The test statistic for a test of independence is similar to that of a goodness-of-fit test:

where:

O = observed values
E = expected values (which should be at least 5)

A test of independence determines whether...

Test for Homogeneity 01:23

2.4K

The goodness–of–fit test can be used to decide whether a population fits a given distribution, but it will not suffice to decide whether two populations follow the same unknown distribution. A different test, called the test for homogeneity, can be used to conclude whether two populations have the same distribution. To calculate the test statistic for a test for homogeneity, follow the same procedure as with the test of independence. The hypotheses for the test for homogeneity can...

Significance Testing: Overview 01:04

11.5K

Significance testing is a set of statistical methods used to test whether a claim about a parameter is valid. In analytical chemistry, significance testing is used primarily to determine whether the difference between two values comes from determinate or random errors. The effect of a particular change in the measurement protocol, analyst, or sample itself can cause a deviation from the expected result. In the case of a suspected deviation/outlier, we need to be able to confirm mathematically...

Determination of Expected Frequency 01:08

2.5K

Suppose one wants to test independence between the two variables of a contingency table. The values in the table constitute the observed frequencies of the dataset. But how does one determine the expected frequency of the dataset? One of the important assumptions is that the two variables are independent, which means the variables do not influence each other. For independent variables, the statistical probability of any event involving both variables is calculated by multiplying the individual...