Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Binet's Contribution to Measures of Intelligence01:23

Binet's Contribution to Measures of Intelligence

1.4K
Alfred Binet, along with his student Théophile Simon, was tasked by the French Ministry of Education in 1904 to create a method for identifying students who struggled to learn through conventional classroom instruction. This initiative aimed to address overcrowding by placing such students in specialized schools. Binet and Simon developed an intelligence test comprising 30 tasks, ranging from simple commands, like touching one's nose or ear, to more complex tasks, such as drawing...
1.4K

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Psychometric Evaluation of Unfolding Case Studies for Clinical Judgment Assessment.

Journal of nursing measurement·2026
Same author

Latent Poisson count models for action count data from technology-enhanced assessments.

The British journal of mathematical and statistical psychology·2026
Same author

A Latent Markov Model for Noninvariant Measurements: An Application to Interaction Log Data From Computer-Interactive Assessments.

Psychometrika·2025
Same author

Integrating Clinical Judgment Into the Entry-Level Nursing: A Confirmatory Factor Analytic Study.

Journal of nursing measurement·2024
Same author

Evaluating the Importance of Clinical Judgment in Entry-Level Nursing.

The Journal of nursing education·2024
Same author

Location-Matching Adaptive Testing for Polytomous Technology-Enhanced Items.

Applied psychological measurement·2024
Same journal

Proficiency order invariance of MLE, MAP, EAP, and WLE in item response theory.

The British journal of mathematical and statistical psychology·2026
Same journal

Bias and precision in true-score estimation.

The British journal of mathematical and statistical psychology·2026
Same journal

Polychoric correlations under the assumption of elliptical latent traits.

The British journal of mathematical and statistical psychology·2026
Same journal

Regularized reduced rank regression for mixed predictor and response variables.

The British journal of mathematical and statistical psychology·2026
Same journal

A multiple-choice SDT model for cognitive diagnosis models.

The British journal of mathematical and statistical psychology·2026
Same journal

Modular item response and structural equation modelling via measurement and uncertainty preserving parametric modelling.

The British journal of mathematical and statistical psychology·2026
See all related articles

Related Experiment Video

Updated: Oct 22, 2025

Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

6.0K

Computerized adaptive testing for testlet-based innovative items.

Hyeon-Ah Kang1, Suhwa Han1, Joe Betts2

  • 1University of Texas at Austin, Texas, USA.

The British Journal of Mathematical and Statistical Psychology
|August 31, 2021
PubMed
Summary
This summary is machine-generated.

This study compares different statistical methods for scoring tests that use groups of related questions, known as testlets. Researchers found that most models perform similarly, but one specific approach often provides inaccurate results.

Keywords:
adaptive testingpolytomous itemstechnology-enhanced innovative itemstestletitem response theorylatent trait estimationassessment methodologystatistical bias

Frequently Asked Questions

More Related Videos

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education
09:00

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

972
Problem-Solving Before Instruction PS-I: A Protocol for Assessment and Intervention in Students with Different Abilities
10:26

Problem-Solving Before Instruction PS-I: A Protocol for Assessment and Intervention in Students with Different Abilities

Published on: September 11, 2021

4.1K

Related Experiment Videos

Last Updated: Oct 22, 2025

Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

6.0K
Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education
09:00

Author Spotlight: Validation of SICOLE-R for Assessing Cognitive and Reading Skills in Spanish-Speaking Children and Its Role in Personalized Education

Published on: August 16, 2024

972
Problem-Solving Before Instruction PS-I: A Protocol for Assessment and Intervention in Students with Different Abilities
10:26

Problem-Solving Before Instruction PS-I: A Protocol for Assessment and Intervention in Students with Different Abilities

Published on: September 11, 2021

4.1K

Area of Science:

  • Psychometrics and educational measurement research within Computerized adaptive testing
  • Statistical modeling of latent traits in assessment science

Background:

No prior work had resolved how various scoring frameworks handle random dependencies within grouped assessment items. Researchers frequently rely on standard models despite potential violations of independence assumptions. This gap motivated a deeper look into how specific statistical structures influence trait estimation accuracy. Prior research has shown that innovative item formats often create complex data dependencies. That uncertainty drove the need to evaluate multiple scoring approaches under controlled conditions. It was already known that ignoring these dependencies might bias results in high-stakes testing environments. This investigation addresses the performance of four distinct models when random testlet effects are present. The study clarifies which approaches remain robust when dealing with these specific item structures.

Purpose Of The Study:

The aim of this study is to examine the performance of several scoring models when polytomous items exhibit random testlet effects. Researchers sought to determine how different statistical frameworks handle dependencies within innovative assessment items. The motivation stems from the increasing use of these complex items in modern operational assessments. No prior work had resolved whether standard models remain effective in the presence of these specific random effects. This investigation addresses the potential for bias when applying traditional scoring methods to testlet-based data. The authors intended to provide clear guidance for practitioners selecting models for high-stakes testing environments. By comparing four distinct approaches, the study highlights which methods offer the most accurate trait inference. The findings aim to improve the reliability of scoring procedures in assessments that utilize grouped item structures.

Main Methods:

The review approach involved a comparative analysis of four distinct statistical scoring frameworks. Investigators simulated two adaptive testing scenarios to test model performance. Each scenario incorporated nonzero random effects to mimic real-world item dependencies. The team evaluated the Partial Credit Model, Testlet-as-a-polytomous-item Model, Random-effect Testlet Model, and Fixed-effect Testlet Model. Researchers focused on how these methods recovered latent traits and classified participants. The design ensured that each model faced identical conditions regarding item structure and random effects. This systematic comparison allowed for a direct assessment of accuracy across different mathematical assumptions. The approach prioritized identifying which models remain reliable when dealing with complex, grouped item formats.

Main Results:

Key findings from the literature indicate that the Partial Credit Model, Fixed-effect Testlet Model, and Random-effect Testlet Model perform comparably in trait recovery. These three methods also show similar results for examinee classification tasks. The overall accuracy of the Partial Credit Model and Fixed-effect Testlet Model in trait inference matches the Random-effect Testlet Model. The Testlet-as-a-polytomous-item Model consistently underestimates population variance in the simulated data. This specific model also leads to a significant overestimation of measurement precision. The researchers observed that the Testlet-as-a-polytomous-item Model demonstrates limited utility for operational applications. These results suggest that manifest random testlet effects do not necessarily degrade the performance of the top three models. The evidence highlights a clear performance gap between the Testlet-as-a-polytomous-item Model and the other evaluated approaches.

Conclusions:

The authors suggest that the Partial Credit Model, Fixed-effect Testlet Model, and Random-effect Testlet Model yield similar accuracy for trait recovery. These three approaches demonstrate comparable effectiveness for classifying examinees in adaptive testing scenarios. The researchers propose that practitioners can choose among these options without significant loss of precision. Conversely, the Testlet-as-a-polytomous-item Model consistently produces biased estimates of population variance. This specific model also leads to an overstatement of measurement reliability. The study implies that the Testlet-as-a-polytomous-item Model lacks the necessary utility for operational assessment tasks. These findings offer clear guidance for selecting appropriate scoring frameworks in modern testing programs. The evidence supports using models that account for testlet effects without overcomplicating the statistical process.

The researchers propose that the Partial Credit Model, Fixed-effect Testlet Model, and Random-effect Testlet Model provide similar accuracy. In contrast, the Testlet-as-a-polytomous-item Model underestimates population variance and overstates precision, making it less suitable for operational use.

The study evaluates the Partial Credit Model, Testlet-as-a-polytomous-item Model, Random-effect Testlet Model, and Fixed-effect Testlet Model. These frameworks represent different statistical approaches to handling dependencies within grouped items.

The authors emphasize that testlets often exhibit nonzero random effects. These effects are necessary to include in simulations to accurately reflect how innovative items behave in real-world assessment settings.

The researchers utilize simulated adaptive testing data where testlets contain random effects. This data type allows for a controlled comparison of how each model handles the specific statistical challenges posed by grouped item structures.

The study measures trait recovery, examinee classification, population variance, and measurement precision. These metrics determine the overall utility of each model for operational assessment purposes.

The authors conclude that the Testlet-as-a-polytomous-item Model shows limited utility for operational use. They suggest that practitioners should favor the other three models to ensure accurate trait inference.