Application of weighted low rank approximations: outlier detection in a data matrix

  • 0Departamento de Matemáticas, Pontificia Universidad Javeriana, Bogotá, Colombia. marisolgarcia@javeriana.edu.co.

|

|

Summary

This summary is machine-generated.

This study introduces weighted matrix approximations for effective outlier detection in rectangular datasets. These methods outperform traditional bias-adjusted boxplots for identifying anomalies in numerical data.

Area Of Science

  • Data Science
  • Statistical Analysis
  • Bioinformatics

Background

  • Outlier identification is crucial for exploratory data analysis.
  • The presence of outliers influences subsequent modeling choices.
  • Existing methods may not be optimal for all data structures.

Purpose Of The Study

  • To present novel strategies for outlier identification using weighted matrix approximations.
  • To evaluate the effectiveness of these strategies on diverse real-world datasets.
  • To propose a statistic for evaluating outlier detection performance.

Main Methods

  • Utilized weighted approximations of matrices to identify outliers.
  • Evaluated six criteria including residuals and Jackknife methods.
  • Compared performance against a bias-adjusted boxplot gold standard.
  • Tested on sixteen real datasets with artificial contamination.

Main Results

  • Weighted approximation methods demonstrated superior effectiveness in detecting random outliers compared to bias-adjusted boxplots.
  • All proposed methods are applicable to any numerical dataset in matrix form.
  • The proposed evaluation statistic effectively distinguishes good detection from false positives/negatives.

Conclusions

  • Weighted matrix approximations offer a more effective approach to outlier detection in numerical datasets.
  • These methods are versatile and applicable to complex data, including genotype-by-environment interactions.
  • The study provides a robust framework for evaluating outlier detection techniques.

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test 01:02

2.2K

Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Outliers and Influential Points 01:08

4.3K

An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500), while others may indicate that something unusual is happening. Outliers are present far from the least squares line in the vertical direction. They have large "errors," where the "error" or residual is the...

What Are Outliers? 01:12

4.2K

Outliers are observed data points that are far from the least squares line. They have unusual values and need to be examined carefully. Though an outlier may result from erroneous data, at other times, it may hold valuable information about the population under study and should be included in the data. Hence, it is crucial to examine what causes a data point to be an outlier.
The z score is used to find outliers or unusual values. It should be noted that any values beyond -2 and +2 are...

Detection of Gross Error: The <em>Q</em> Test 01:00

6.4K

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Weighted Mean 00:57

5.4K

While taking the arithmetic, geometric, or harmonic mean of a sample data set, equal importance is assigned to all the data points. However, all the values may not always be equally important in some data sets. An intrinsic bias might make it more important to give more weightage to specific values over others.
For example, consider the number of goals scored in the matches of a tournament. While computing the average number of goals scored in the tournament, it may be more important to...

Wilcoxon Rank-Sum Test 01:21

355

The Wilcoxon rank-sum test, also known as the Mann-Whitney U test, is a nonparametric test used to determine if there is a significant difference between the distributions of two independent samples. This test is designed specifically for two independent populations and has the following key requirements:

The samples must be randomly drawn.
The data should be ordinal or capable of being converted to an ordinal scale, allowing the values to be ordered and ranked.

The null hypothesis is that...