Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

1.8K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
1.8K
Wilcoxon Rank-Sum Test01:21

Wilcoxon Rank-Sum Test

292
The Wilcoxon rank-sum test, also known as the Mann-Whitney U test, is a nonparametric test used to determine if there is a significant difference between the distributions of two independent samples. This test is designed specifically for two independent populations and has the following key requirements:
292
Detection of Gross Error: The Q Test01:00

Detection of Gross Error: The Q Test

6.3K
When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...
6.3K
Comparing Experimental Results: Student's t-Test01:09

Comparing Experimental Results: Student's t-Test

1.6K
The t-test is a statistical method used to compare the sample mean with a population mean or compare two means from two data sets. The test statistic is calculated from the standard deviation, mean, and number of measurements in the data set at a selected confidence interval and then compared to a table of critical values at this confidence level. If the test statistic is smaller than the critical value, the null hypothesis is accepted. In this case, we state that the difference between the...
1.6K
Statistical Methods to Analyze Parametric Data: Student t-Test and Goodness-of-Fit Test01:09

Statistical Methods to Analyze Parametric Data: Student t-Test and Goodness-of-Fit Test

1.7K
In parametric statistics, two fundamental tests stand out for their utility and wide application: the Student's t-test and goodness-of-fit tests. These tests provide researchers with a robust method for drawing insights from data, testing hypotheses, and making informed decisions based on their findings.
The Student's t-test is a statistical test that examines if there is a statistically significant difference between the means of two groups. This test is instrumental when dealing with...
1.7K
Significance Testing: Overview01:04

Significance Testing: Overview

3.5K
Significance testing is a set of statistical methods used to test whether a claim about a parameter is valid. In analytical chemistry, significance testing is used primarily to determine whether the difference between two values comes from determinate or random errors. The effect of a particular change in the measurement protocol, analyst, or sample itself can cause a deviation from the expected result. In the case of a suspected deviation/outlier, we need to be able to confirm mathematically...
3.5K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Does the Timing of USMLE Step 1 Affect Performance? A Comparison of Pre- and Post-Clerkship Timing in the Pass/Fail Era.

Teaching and learning in medicine·2026
Same author

Promoting equity on licensing exams: Test accommodations for medical students with diabetes.

PloS one·2026
Same author

Using Deep Learning to Choose Optimal Smoothing Values for Equating.

Applied psychological measurement·2025
Same author

Few and Different: Detecting Examinees With Preknowledge Using Extended Isolation Forests.

Applied psychological measurement·2025
Same author

Application of Sampling Variance of Item Response Theory Parameter Estimates in Detecting Outliers in Common Item Equating.

Applied psychological measurement·2022
Same author

To the Editor: Limitations and Alternative Solutions to a USMLE COMLEX-USA Concordance.

Journal of graduate medical education·2022
Same journal

The EM Algorithm and Its Variants in Cognitive Diagnostic Models: Comparing Their Propensity for Boundaries, Extremes, Convergence, and Suboptimal Solutions.

Applied psychological measurement·2026
Same journal

When Perceptions of Social Desirability Differ: Implications for the Multidimensional Nominal Response Model of Faking.

Applied psychological measurement·2026
Same journal

csemGT: An R Package for Estimating Raw-Score Conditional Standard Errors of Measurement in Generalizability Theory.

Applied psychological measurement·2026
Same journal

Confirmatory Factor Analysis with Adaptive Quadrature Estimator Using Four Link Functions.

Applied psychological measurement·2026
Same journal

Automatic Item Generation Measurement Models Respecting the Stochastic Sampling Space for Cross-Classified and Two-Level Sampling of Subjects and Incidentals.

Applied psychological measurement·2026
Same journal

Multistage Testing for Cognitive Diagnosis Based on Skill-Space Partitioning.

Applied psychological measurement·2026
See all related articles

Related Experiment Video

Updated: Aug 20, 2025

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios
06:02

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Published on: October 6, 2020

2.3K

Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design.

Chunyan Liu1, Daniel Jurich1

  • 1National Board of Medical Examiners, Philadelphia, PA, USA.

Applied Psychological Measurement
|November 25, 2022
PubMed
Summary
This summary is machine-generated.

The t-test method effectively detects outliers in anchor items, improving test equating accuracy and score validity. This study confirms its superior performance over other outlier detection methods in various simulation conditions.

Keywords:
OutliersRasch modelt-testtranslation constant

More Related Videos

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects
08:13

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Published on: May 10, 2019

6.4K
A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences
08:33

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

Published on: September 4, 2019

7.1K

Related Experiment Videos

Last Updated: Aug 20, 2025

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios
06:02

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Published on: October 6, 2020

2.3K
Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects
08:13

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Published on: May 10, 2019

6.4K
A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences
08:33

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

Published on: September 4, 2019

7.1K

Area of Science:

  • Psychometrics
  • Educational Measurement
  • Statistical Analysis

Background:

  • Outliers in anchor items can compromise test equating accuracy and score validity.
  • Evaluating anchor item performance stability is crucial before equating.
  • Existing outlier detection methods may not be sufficiently robust.

Purpose of the Study:

  • To investigate the effectiveness of the t-test method for detecting outliers in anchor items.
  • To compare the t-test method with other outlier detection techniques.
  • To evaluate the impact of sample size, outlier proportion, item difficulty drift, and group differences on outlier detection.

Main Methods:

  • Simulation study design.
  • Comparison of the t-test method against logit difference and robust z statistic outlier detection methods.
  • Analysis of outlier detection performance across varied simulated conditions.

Main Results:

  • The t-test method demonstrated superior sensitivity in identifying true outliers.
  • The t-test method showed reduced bias in estimating the translation constant.
  • The t-test method resulted in lower root mean square error for examinee ability estimates.
  • The t-test method consistently outperformed other methods across all investigated factors.

Conclusions:

  • The t-test method is a highly effective tool for detecting outliers in anchor items during test equating.
  • Utilizing the t-test method enhances equating accuracy and strengthens the validity of test scores.
  • The findings support the t-test method as a preferred approach for ensuring anchor item stability in psychometric practice.