Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Quantifying and Rejecting Outliers: The Grubbs Test01:02

Quantifying and Rejecting Outliers: The Grubbs Test

1.7K
Sometimes, a data set can have a recorded numerical observation that greatly  deviates from the rest of the data. Assuming that the data is normally distributed, a statistical method called the Grubbs test can be used to determine whether the observation is truly an outlier.  To perform a two-tailed Grubbs test, first, calculate the absolute difference between the outlier and the mean. Then, calculate the ratio between this difference and the standard deviation of the sample. This...
1.7K
Detection of Gross Error: The Q Test01:00

Detection of Gross Error: The Q Test

6.3K
When one or more data points appear far from the rest of the data, there is a need to determine whether they are outliers and whether they should be eliminated from the data set to ensure an accurate representation of the measured value. In many cases, outliers arise from gross errors (or human errors) and do not accurately reflect the underlying phenomenon. In some cases, however, these apparent outliers reflect true phenomenological differences. In these cases, we can use statistical methods...
6.3K
Reliability and Validity01:29

Reliability and Validity

12.8K
Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.
12.8K
Goodness-of-Fit Test01:16

Goodness-of-Fit Test

3.5K
The goodness-of-fit test is a type of hypothesis test which determines whether the data "fits" a particular distribution. For example, one may suspect that some anonymous data may fit a binomial distribution. A chi-square test (meaning the distribution for the hypothesis test is chi-square) can be used to determine if there is a fit. The null and alternative hypotheses may be written in sentences or stated as equations or inequalities. The test statistic for a goodness-of-fit test is given as...
3.5K
Expected Frequencies in Goodness-of-Fit Tests01:19

Expected Frequencies in Goodness-of-Fit Tests

2.6K
A goodness-of-fit test is conducted to determine whether the observed frequency values are statistically similar to the frequencies expected for the dataset. Suppose the expected frequencies for a dataset are equal such as when predicting the frequency of any number appearing when casting a die. In that case, the expected frequency is the ratio of the total number of observations (n)  to the number of categories (k).
2.6K
Multiple Comparison Tests01:13

Multiple Comparison Tests

3.9K
Multiple comparison test, abbreviated as MCT, is a post hoc analysis generally performed after comparing multiple samples with one or more tests. An MCT will help identify a significantly different sample among multiple samples or a factor among multiple factors.
It would be easy to compare two samples using a significance alpha level of 0.05. In other words, there is only one sample pair to be compared. However, it would be difficult to identify a significantly different sample if the number...
3.9K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Reconceptualizing Scoring Reliability Through Linguistic Similarity.

Educational and psychological measurement·2025
Same author

Modeling Item Revisit Behavior: The Hierarchical Speed-Accuracy-Revisits Model.

Educational and psychological measurement·2023
Same author

Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks.

Educational and psychological measurement·2023
Same author

Erratum to: A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.

Psychometrika·2022
Same author

A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.

Psychometrika·2021
Same author

Erratum: Electronic cigarette use and its association with asthma, chronic obstructive pulmonary disease (COPD) and asthma-COPD overlap syndrome among never cigarette smokers.

Tobacco induced diseases·2021
Same journal

A Simple Approach for Differential Test Functioning Based on Sum Scores.

Educational and psychological measurement·2026
Same journal

Evaluating Factor Retention in Large Factor Analysis Models: A Simulation Study Comparing 15 Methods.

Educational and psychological measurement·2026
Same journal

Agreement and Alignment in Binary Rating Tasks: Strategic Convergence as an Equilibrium Outcome.

Educational and psychological measurement·2026
Same journal

Interactions Between Termination Criteria and Ability Estimators in Computerized Adaptive Testing.

Educational and psychological measurement·2026
Same journal

Identification and Diagnosis of Misreporting in Surveys.

Educational and psychological measurement·2026
Same journal

The Aggregated Latent Profile Index: Measuring Person Profile Differentiation Within a Bootstrap-Validated Latent Profile Space.

Educational and psychological measurement·2026
See all related articles

Related Experiment Video

Updated: Jul 24, 2025

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education
09:00

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

825

A Robust Method for Detecting Item Misfit in Large-Scale Assessments.

Matthias von Davier1, Ummugul Bezirhan1

  • 1Boston College, Chestnut Hill, MA, USA.

Educational and Psychological Measurement
|July 3, 2023
PubMed
Summary
This summary is machine-generated.

This study introduces a new method for detecting Differential Item Functioning (DIF) without assuming perfect model-data fit. It uses robust outlier detection to identify items with inadequate fit, improving measurement accuracy.

Keywords:
DIFTukey’s contaminated distributionsitem fitmixture distribution modeloutlier detectionrobust statistics

More Related Videos

Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

5.8K
Assessment of Child Anthropometry in a Large Epidemiologic Study
09:36

Assessment of Child Anthropometry in a Large Epidemiologic Study

Published on: February 2, 2017

27.1K

Related Experiment Videos

Last Updated: Jul 24, 2025

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education
09:00

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

825
Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

5.8K
Assessment of Child Anthropometry in a Large Epidemiologic Study
09:36

Assessment of Child Anthropometry in a Large Epidemiologic Study

Published on: February 2, 2017

27.1K

Area of Science:

  • Psychometrics
  • Statistical modeling
  • Educational measurement

Background:

  • Accurate scale construction requires identifying item misfit and Differential Item Functioning (DIF).
  • Existing methods often assume perfect model-data fit, which can be unrealistic.
  • Classical test theory and item response theory rely on explicit assumptions about item function.

Purpose of the Study:

  • To develop a robust approach for detecting DIF that does not require perfect model-data fit.
  • To provide a more reliable method for assessing item fit in scale construction.

Main Methods:

  • Utilized Tukey's concept of contaminated distributions.
  • Employed robust outlier detection techniques.
  • Flagged items with inadequate model-data fit.

Main Results:

  • Successfully identified items with inadequate model-data fit.
  • Demonstrated a robust approach to DIF detection.
  • Provided a method less reliant on idealized statistical assumptions.

Conclusions:

  • The proposed robust method enhances the accuracy of DIF detection.
  • This approach offers a more practical solution for scale construction and measurement.
  • It addresses limitations of traditional methods by not assuming perfect model fit.