Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test

Quantifying and Rejecting Outliers: The Grubbs Test

Sometimes, a data set can have a recorded numerical observation that greatly deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier. To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...

Detection of Gross Error: The Q Test

Detection of Gross Error: The Q Test

When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...

Reliability and Validity

Reliability and Validity

Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.

Goodness-of-Fit Test

Goodness-of-Fit Test

The goodness-of-fit test is a type of hypothesis test which determines whether the data "fits" a particular distribution. For example, one may suspect that some anonymous data may fit a binomial distribution. A chi-square test (meaning the distribution for the hypothesis test is chi-square) can be used to determine if there is a fit. The null and alternative hypotheses may be written in sentences or stated as equations or inequalities. The test statistic for a goodness-of-fit test is given as...

Expected Frequencies in Goodness-of-Fit Tests

Expected Frequencies in Goodness-of-Fit Tests

A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n) to the number of categories (k).

Multiple Comparison Tests

Multiple Comparison Tests

Multiple comparison test, abbreviated as MCT, is a post hoc analysis generally performed after comparing multiple samples with one or more tests. An MCT will help identify a significantly different sample among multiple samples or a factor among multiple factors.
It would be easy to compare two samples using a significance alpha level of 0.05. In other words, there is only one sample pair to be compared. However, it would be difficult to identify a significantly different sample if the number...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Reconceptualizing Scoring Reliability Through Linguistic Similarity.

Educational and psychological measurement·2025

Same author

Modeling Item Revisit Behavior: The Hierarchical Speed-Accuracy-Revisits Model.

Educational and psychological measurement·2023

Same author

Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks.

Educational and psychological measurement·2023

Same author

Erratum to: A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.

Psychometrika·2022

Same author

A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.

Psychometrika·2021

Same author

Erratum: Electronic cigarette use and its association with asthma, chronic obstructive pulmonary disease (COPD) and asthma-COPD overlap syndrome among never cigarette smokers.

Tobacco induced diseases·2021

Same journal

A Simple Approach for Differential Test Functioning Based on Sum Scores.

Educational and psychological measurement·2026

Same journal

Evaluating Factor Retention in Large Factor Analysis Models: A Simulation Study Comparing 15 Methods.

Educational and psychological measurement·2026

Same journal

Agreement and Alignment in Binary Rating Tasks: Strategic Convergence as an Equilibrium Outcome.

Educational and psychological measurement·2026

Same journal

Interactions Between Termination Criteria and Ability Estimators in Computerized Adaptive Testing.

Educational and psychological measurement·2026

Same journal

Identification and Diagnosis of Misreporting in Surveys.

Educational and psychological measurement·2026

Same journal

The Aggregated Latent Profile Index: Measuring Person Profile Differentiation Within a Bootstrap-Validated Latent Profile Space.

Educational and psychological measurement·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Jul 24, 2025

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

A Robust Method for Detecting Item Misfit in Large-Scale Assessments.

Matthias von Davier¹, Ummugul Bezirhan¹

¹Boston College, Chestnut Hill, MA, USA.

Educational and Psychological Measurement

|July 3, 2023

Summary

This summary is machine-generated.

This study introduces a new method for detecting Differential Item Functioning (DIF) without assuming perfect model-data fit. It uses robust outlier detection to identify items with inadequate fit, improving measurement accuracy.

Keywords:

DIF Tukey’s contaminated distributions item fit mixture distribution model outlier detection robust statistics

More Related Videos

Computerized Adaptive Testing System of Functional Assessment of Stroke

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

Assessment of Child Anthropometry in a Large Epidemiologic Study

Assessment of Child Anthropometry in a Large Epidemiologic Study

Published on: February 2, 2017

Related Experiment Videos

Last Updated: Jul 24, 2025

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

Computerized Adaptive Testing System of Functional Assessment of Stroke

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

Assessment of Child Anthropometry in a Large Epidemiologic Study

Assessment of Child Anthropometry in a Large Epidemiologic Study

Published on: February 2, 2017

Area of Science:

Psychometrics
Statistical modeling
Educational measurement

Background:

Accurate scale construction requires identifying item misfit and Differential Item Functioning (DIF).
Existing methods often assume perfect model-data fit, which can be unrealistic.
Classical test theory and item response theory rely on explicit assumptions about item function.

Purpose of the Study:

To develop a robust approach for detecting DIF that does not require perfect model-data fit.
To provide a more reliable method for assessing item fit in scale construction.

Main Methods:

Utilized Tukey's concept of contaminated distributions.
Employed robust outlier detection techniques.
Flagged items with inadequate model-data fit.

Main Results:

Successfully identified items with inadequate model-data fit.
Demonstrated a robust approach to DIF detection.
Provided a method less reliant on idealized statistical assumptions.

Conclusions:

The proposed robust method enhances the accuracy of DIF detection.
This approach offers a more practical solution for scale construction and measurement.
It addresses limitations of traditional methods by not assuming perfect model fit.