What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions
View abstract on PubMed
Summary
This summary is machine-generated.Merging samples from different experiments violates assumptions for correlation coefficients, leading to biased results. This technical note reviews Pearson
Area Of Science
- Biostatistics
- Data Analysis
- Scientific Research Methodology
Background
- Data analysis often merges samples from diverse experiments, conditions, or time series to inflate sample size for correlation coefficient calculation.
- This common practice violates fundamental assumptions of the Pearson's correlation coefficient: sampling from a single population and independence of observations.
- Violating these assumptions can lead to unreliable and biased scientific findings, particularly when inferring associations between biological entities.
Purpose Of The Study
- To review the fundamental properties of the Pearson's correlation coefficient.
- To illustrate the detrimental effects of violating its underlying assumptions using simulated and experimental data.
- To provide a clear, didactic explanation with graphical examples to enhance understanding of correlation analysis pitfalls.
Main Methods
- Review of theoretical properties of the Pearson's correlation coefficient.
- Generation of simulated data to demonstrate assumption violations.
- Analysis of experimental data to show real-world implications.
Main Results
- Merging non-independent samples significantly biases correlation coefficients.
- Violation of the independence assumption leads to inaccurate estimations of biological associations.
- Graphical examples clearly depict the distortion of results caused by improper data aggregation.
Conclusions
- The merging of samples from different experiments or conditions before calculating correlation coefficients is statistically invalid.
- Adherence to the assumptions of the Pearson's correlation coefficient is crucial for reliable scientific results.
- Researchers must exercise caution and employ appropriate statistical methods to avoid biased interpretations of biological associations.
Related Concept Videos
Dimensional analysis simplifies complex physical problems and guides experimental investigations, but it does not provide complete solutions. It identifies the dimensionless groups that influence a phenomenon, but experimental data is needed to establish the specific relationships and validate theoretical predictions.
For example, a spherical particle moving through a viscous fluid experiences drag. Dimensional analysis shows that the drag force depends on the particle's diameter, velocity,...
In statistics, the term independence means that one can directly obtain the probability of any event involving both variables by multiplying their individual probabilities. Tests of independence are chi-square tests involving the use of a contingency table of observed (data) values.
The test statistic for a test of independence is similar to that of a goodness-of-fit test:
where:
O = observed values
E = expected values (which should be at least 5)
A test of independence determines whether...
In statistics, two variables are said to be correlated if the values of one variable are associated with the other variable. Depending on the relationship between two variables, correlation can be of three types– positive correlation, negative correlation, and zero correlation.
Two variables, for example, a and b, are said to be positively correlated if both variables move in the same direction. In other words, a positive correlation exists between two variables, a and b, if:
Variable a...
Statistical tests can calculate whether there is a relationship, or correlation, between independent and dependent variables. An indirect relationship of the variables signifies a correlation, while a direct relationship shows causation. If it is determined that no connection exists between the variables, then the correlation is a coincidence.
Correlation versus Causation
If the dependent variable increases or decreases when the independent variable increases, there is a positive or negative...
In statistics, correlation describes the degree of association between two variables. In the subfield of linear regression, correlation is mathematically expressed by the correlation coefficient, which describes the strength and direction of the relationship between two variables. The coefficient is symbolically represented by 'r' and ranges from -1 to +1. A positive value indicates a positive correlation where the two variables move in the same direction. A negative value suggests a...
While variables are sometimes correlated because one does cause the other, it could also be that some other factor, a confounding variable, is actually causing the systematic movement in our variables of interest. For instance, as sales in ice cream increase, so does the overall rate of crime. Is it possible that indulging in your favorite flavor of ice cream could send you on a crime spree? Or, after committing crime do you think you might decide to treat yourself to a cone?
There is no...

