Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Wilcoxon Rank-Sum Test

Wilcoxon Rank-Sum Test

The Wilcoxon rank-sum test, also known as the Mann-Whitney U test, is a nonparametric test used to determine if there is a significant difference between the distributions of two independent samples. This test is designed specifically for two independent populations and has the following key requirements:

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Comparing Experimental Results: Student's t-Test

Comparing Experimental Results: Student's t-Test

The t-test is a statistical method used to compare the sample mean with a population mean or compare two means from two data sets. The test statistic is calculated from the standard deviation, mean, and number of measurements in the data set at a selected confidence interval and then compared to a table of critical values at this confidence level. If the test statistic is smaller than the critical value, the null hypothesis is accepted. In this case, we state that the difference between the...

Statistical Methods to Analyze Parametric Data: Student t-Test and Goodness-of-Fit Test

Statistical Methods to Analyze Parametric Data: Student t-Test and Goodness-of-Fit Test

In parametric statistics, two fundamental tests stand out for their utility and wide application: the Student's t-test and goodness-of-fit tests. These tests provide researchers with a robust method for drawing insights from data, testing hypotheses, and making informed decisions based on their findings.
The Student's t-test is a statistical test that examines if there is a statistically significant difference between the means of two groups. This test is instrumental when dealing with...

Significance Testing: Overview

Significance Testing: Overview

Significance testing is a set of statistical methods used to test whether a claim about a parameter is valid. In analytical chemistry, significance testing is used primarily to determine whether the difference between two values comes from determinate or random errors. The effect of a particular change in the measurement protocol, analyst, or sample itself can cause a deviation from the expected result. In the case of a suspected deviation/outlier, we need to be able to confirm mathematically...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Does the Timing of USMLE Step 1 Affect Performance? A Comparison of Pre- and Post-Clerkship Timing in the Pass/Fail Era.

Teaching and learning in medicine·2026

Same author

Promoting equity on licensing exams: Test accommodations for medical students with diabetes.

PloS one·2026

Same author

Using Deep Learning to Choose Optimal Smoothing Values for Equating.

Applied psychological measurement·2025

Same author

Few and Different: Detecting Examinees With Preknowledge Using Extended Isolation Forests.

Applied psychological measurement·2025

Same author

Application of Sampling Variance of Item Response Theory Parameter Estimates in Detecting Outliers in Common Item Equating.

Applied psychological measurement·2022

Same author

To the Editor: Limitations and Alternative Solutions to a USMLE COMLEX-USA Concordance.

Journal of graduate medical education·2022

Same journal

The EM Algorithm and Its Variants in Cognitive Diagnostic Models: Comparing Their Propensity for Boundaries, Extremes, Convergence, and Suboptimal Solutions.

Applied psychological measurement·2026

Same journal

When Perceptions of Social Desirability Differ: Implications for the Multidimensional Nominal Response Model of Faking.

Applied psychological measurement·2026

Same journal

csemGT: An R Package for Estimating Raw-Score Conditional Standard Errors of Measurement in Generalizability Theory.

Applied psychological measurement·2026

Same journal

Confirmatory Factor Analysis with Adaptive Quadrature Estimator Using Four Link Functions.

Applied psychological measurement·2026

Same journal

Automatic Item Generation Measurement Models Respecting the Stochastic Sampling Space for Cross-Classified and Two-Level Sampling of Subjects and Incidentals.

Applied psychological measurement·2026

Same journal

Multistage Testing for Cognitive Diagnosis Based on Skill-Space Partitioning.

Applied psychological measurement·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Aug 20, 2025

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Published on: October 6, 2020

Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design.

Chunyan Liu¹, Daniel Jurich¹

¹National Board of Medical Examiners, Philadelphia, PA, USA.

Applied Psychological Measurement

|November 25, 2022

Summary

This summary is machine-generated.

The t-test method effectively detects outliers in anchor items, improving test equating accuracy and score validity. This study confirms its superior performance over other outlier detection methods in various simulation conditions.

Keywords:

Outliers Rasch model t-test translation constant

More Related Videos

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Published on: May 10, 2019

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

Published on: September 4, 2019

Related Experiment Videos

Last Updated: Aug 20, 2025

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Evaluating Usability Aspects of a Mixed Reality Solution for Immersive Analytics in Industry 4.0 Scenarios

Published on: October 6, 2020

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Using the Race Model Inequality to Quantify Behavioral Multisensory Integration Effects

Published on: May 10, 2019

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

A Cross-Disciplinary and Multi-Modal Experimental Design for Studying Near-Real-Time Authentic Examination Experiences

Published on: September 4, 2019

Area of Science:

Psychometrics
Educational Measurement
Statistical Analysis

Background:

Outliers in anchor items can compromise test equating accuracy and score validity.
Evaluating anchor item performance stability is crucial before equating.
Existing outlier detection methods may not be sufficiently robust.

Purpose of the Study:

To investigate the effectiveness of the t-test method for detecting outliers in anchor items.
To compare the t-test method with other outlier detection techniques.
To evaluate the impact of sample size, outlier proportion, item difficulty drift, and group differences on outlier detection.

Main Methods:

Simulation study design.
Comparison of the t-test method against logit difference and robust z statistic outlier detection methods.
Analysis of outlier detection performance across varied simulated conditions.

Main Results:

The t-test method demonstrated superior sensitivity in identifying true outliers.
The t-test method showed reduced bias in estimating the translation constant.
The t-test method resulted in lower root mean square error for examinee ability estimates.
The t-test method consistently outperformed other methods across all investigated factors.

Conclusions:

The t-test method is a highly effective tool for detecting outliers in anchor items during test equating.
Utilizing the t-test method enhances equating accuracy and strengthens the validity of test scores.
The findings support the t-test method as a preferred approach for ensuring anchor item stability in psychometric practice.