Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

What Are Outliers?01:12

What Are Outliers?

Outliers are observed data points that are far from the least squares line. They have unusual values and need to be examined carefully. Though an outlier may result from erroneous data, at other times, it may hold valuable information about the population under study and should be included in the data. Hence, it is crucial to examine what causes a data point to be an outlier.
The z score is used to find outliers or unusual values. It should be noted that any values beyond -2 and +2 are...
Outliers and Influential Points01:08

Outliers and Influential Points

An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500), while others may indicate that something unusual is happening. Outliers are present far from the least squares line in the vertical direction. They have large "errors," where the "error" or residual is the vertical...
Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This number is...
Detection of Gross Error: The Q Test01:00

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...
Survival Tree01:19

Survival Tree

Survival trees are a non-parametric method used in survival analysis to model the relationship between a set of covariates and the time until an event of interest occurs, often referred to as the "time-to-event" or "survival time." This method is particularly useful when dealing with censored data, where the event has not occurred for some individuals by the end of the study period, or when the exact time of the event is unknown.
 Building a Survival Tree
Constructing a survival tree begins...
Modified Boxplots00:57

Modified Boxplots

A standard box and whisker plot informs us about the spread of the data in a given sample. One can identify the minimum value, maximum value, first quartile value, second quartile or median value, and third quartile.
However, the box plot does not tell the reader about outliers - values that lie far from the center of the data. We can modify the standard box and whisker plot to identify the outliers and visualize the actual spread of the data in a sample.
Initially, we calculate the adjusted...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Prefrontal to ventral tegmental area dynamics drive contingency degradation.

Nature·2026
Same author

Long-read transcriptome analysis using IsoRanker for identifying pathogenic variants in Mendelian conditions.

medRxiv : the preprint server for health sciences·2025
Same author

A haplotype-resolved view of human gene regulation.

bioRxiv : the preprint server for biology·2025
Same author

Bioprinted platform for parallelized screening of engineered microtissues in vivo.

Cell stem cell·2025
Same author

Valence and salience encoding in the central amygdala.

eLife·2025
Same author

Valence and Salience Encoding in the Central Amygdala.

bioRxiv : the preprint server for biology·2024
Same journal

Improving Overall Risk Ranking via Subgroup-Level Information Borrowing in Survival Risk Stratification.

Statistics and its interface·2026
Same journal

High-dimensional Bayesian mediation analysis with adaptive Laplace priors.

Statistics and its interface·2026
Same journal

Imaging mediation analysis for longitudinal outcomes: a case study of childhood brain tumor survivorship.

Statistics and its interface·2025
Same journal

Variable selection for doubly robust causal inference.

Statistics and its interface·2025
Same journal

Smooth online parameter estimation for time varying VAR models with application to rat local field potential activity data.

Statistics and its interface·2025
Same journal

A Double Regression Method for Graphical Modeling of High-dimensional Nonlinear and Non-Gaussian Data.

Statistics and its interface·2025
See all related articles

Related Experiment Videos

Penalized unsupervised learning with outliers.

Daniela M Witten1

  • 1Box 357232, Department of Biostatistics, University of Washington, Seattle WA 98195-7232.

Statistics and Its Interface
|July 23, 2013
PubMed
Summary
This summary is machine-generated.

This study introduces a novel method for unsupervised learning with outliers, using a group lasso penalty to minimize errors. This approach enhances K-means clustering and principal components analysis for accurate outlier detection and improved performance.

Keywords:
M-estimationgroup lassok-means clusteringoutliersprincipal components analysisrobustunsupervised learning

Related Experiment Videos

Area of Science:

  • Statistics
  • Machine Learning
  • Data Mining

Background:

  • Standard unsupervised learning methods struggle with outliers, leading to distorted results.
  • Outliers, observations from different distributions, can significantly degrade model performance.
  • Existing techniques often fail to effectively identify or handle these anomalous data points.

Purpose of the Study:

  • To develop robust unsupervised learning techniques capable of handling data with outliers.
  • To extend existing methods like K-means clustering and principal components analysis for outlier accommodation.
  • To introduce a novel approach utilizing group lasso penalties for outlier detection and data cleaning.

Main Methods:

  • A new approach is proposed, extending outlier detection methods from regression to unsupervised learning.
  • Each observation is assigned an "error" term, penalized by a group lasso penalty to enforce sparsity (most errors become zero).
  • This framework is applied to develop robust versions of K-means clustering and principal components analysis.

Main Results:

  • The proposed methods demonstrate accurate outlier detection capabilities.
  • Significant improvements in unsupervised learning performance are achieved in the presence of outliers.
  • The effectiveness is validated through simulation studies and application to gene expression datasets.

Conclusions:

  • The group lasso penalized error approach offers a powerful solution for unsupervised learning with outliers.
  • This method provides accurate outlier identification and enhances the performance of standard algorithms.
  • The approach shows promise for real-world applications, including biological data analysis.