OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery

  • 0Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA 02142.

Summary

This summary is machine-generated.

A new statistical test, Optimized Adaptive Statistic for Inferring Structure (OASIS), offers efficient and valid analysis for contingency tables. OASIS excels in genomic data analysis, enabling novel strain detection and outperforming existing methods in simulations.

Area Of Science

  • Statistics
  • Genomics
  • Bioinformatics

Background

  • Contingency tables are crucial in quantitative research but existing statistical tests lack computational efficiency and statistical validity for finite samples.
  • Reference-free genomic inference presents challenges requiring robust statistical methods for analyzing count data.

Purpose Of The Study

  • To develop a novel family of statistical tests, Optimized Adaptive Statistic for Inferring Structure (OASIS), for contingency tables.
  • To provide computationally efficient and statistically valid tests suitable for finite observations, particularly in genomic applications.

Main Methods

  • OASIS constructs a test statistic linear in the normalized data matrix, utilizing concentration inequalities for closed-form P-value bounds.
  • Derivation of the asymptotic distribution of the OASIS test statistic to validate finite-sample bounds.
  • Application of OASIS to genomic sequencing data for strain detection and analysis of overdispersed data.

Main Results

  • OASIS provides interpretable decomposition of contingency tables, aiding in the understanding of null hypothesis rejection.
  • Experiments demonstrate OASIS's power and interpretability in genomic data, enabling de novo detection of SARS-CoV-2 and Mycobacterium tuberculosis strains.
  • Simulations show OASIS is robust to overdispersion, controls false discovery rate effectively, and outperforms Pearson's chi-squared test in certain scenarios.

Conclusions

  • OASIS represents a significant advancement in statistical testing for contingency tables, offering improved efficiency and validity.
  • The method facilitates novel applications in genomics, such as identifying microbial strains, which were previously unachievable.
  • OASIS demonstrates superior performance and robustness compared to traditional methods like Pearson's chi-squared test, especially for complex biological data.

Related Concept Videos

Goodness-of-Fit Test 01:16

3.3K

The goodness-of-fit test is a type of hypothesis test which determines whether the data "fits" a particular distribution. For example, one may suspect that some anonymous data may fit a binomial distribution. A chi-square test (meaning the distribution for the hypothesis test is chi-square) can be used to determine if there is a fit. The null and alternative hypotheses may be written in sentences or stated as equations or inequalities. The test statistic for a goodness-of-fit test is given as...

Finding Critical Values for Chi-Square 01:18

2.9K

Consider a curve representing sample data drawn randomly from a normally distributed population. One must construct confidence intervals to estimate or to test a claim regarding the population standard deviation. For example, a 95% confidence interval covers 95% of the area under the curve, and the remaining 5% is equally distributed on either side of the curve. To achieve such confidence intervals, one must determine the critical values. The critical values are simply the values separating the...

Fisher's Exact Test 01:08

488

Fisher's exact test is a statistical significance test widely used to analyze 2x2 contingency tables, particularly in situations where sample sizes are small. Unlike the chi-squared test, which approximates P-values and assumes minimum expected frequencies of at least five in each cell, Fisher's exact test calculates the exact probability (P-value) of observing the data or more extreme results under the null hypothesis. This feature makes it especially valuable when the assumptions of...

Chi-square Analysis 02:46

38.2K

The chi-square test is a statistical hypothesis test. It is used to check whether there is a significant difference between an expected value and an observed value. In the context of genetics, it enables us to either accept or reject a hypothesis, based on how much the observed values deviate from the expected values.
The chi-square test was developed by Pearson in 1990.
The first step of performing a Chi-square analysis is to establish a null hypothesis, which assumes that there is no real...

Test for Homogeneity 01:23

2.0K

The goodness–of–fit test can be used to decide whether a population fits a given distribution, but it will not suffice to decide whether two populations follow the same unknown distribution. A different test, called the test for homogeneity, can be used to conclude whether two populations have the same distribution. To calculate the test statistic for a test for homogeneity, follow the same procedure as with the test of independence. The hypotheses for the test for homogeneity can...

Expected Frequencies in Goodness-of-Fit Tests 01:19

2.5K

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n)  to the number of categories (k).

Hence, the expected frequency of any number appearing when casting a die...