Search research articles

ABOUT JoVE

Overview Leadership Blog JoVE Help Center

AUTHORS

Publishing Process Editorial Board Scope & Policies Peer Review FAQ Submit

LIBRARIANS

Testimonials Subscriptions Access Resources Library Advisory Board FAQ

RESEARCH

JoVE Journal Methods Collections JoVE Encyclopedia of Experiments Archive

EDUCATION

JoVE Core JoVE Business JoVE Science Education JoVE Lab Manual Faculty Resource Center Faculty Site

Terms & Conditions of Use

Related Concept Videos

Self-Report Tests of Personality

Self-Report Tests of Personality

Self-report inventories are objective personality assessments that use multiple-choice items or numbered scales, typically ranging from 1 (strongly disagree) to 5 (strongly agree). They are often called Likert scales after Rensis Likert. These inventories are widely used due to their ease of administration and cost-effectiveness. One of the most prominent examples is the Minnesota Multiphasic Personality Inventory (MMPI), initially developed in the 1940s to assess abnormal personality traits.

Response Surface Methodology

Response Surface Methodology

Response Surface Methodology (RSM) is a collection of statistical and mathematical techniques used to develop, improve, and optimize processes. It is particularly valuable when many input variables or factors potentially influence a response variable.
The process of RSM involves several key steps:

Wechsler's Contribution to Measures of Intelligence

Wechsler's Contribution to Measures of Intelligence

David Wechsler, a psychologist who worked with World War I veterans, developed a significant IQ test in 1939 called the Wechsler-Bellevue Intelligence Scale. This test was innovative because it combined several subtests that measured both verbal and nonverbal skills, reflecting Wechsler's belief that intelligence is a global capacity involving purposeful action, rational thinking, and effective interaction with the environment. This test later evolved into the Wechsler Adult Intelligence...

Ordinal Level of Measurement

Ordinal Level of Measurement

The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
Data measured using an ordinal scale are similar to nominal scale data, but there is one major difference. The ordinal scale data can be ordered. An example of ordinal scale data is a list of the top five national parks...

Ratio Level of Measurement

Ratio Level of Measurement

The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
A set of data measured using the ratio scale takes care of the ratio problem and provides complete information. Ratio scale data are like interval scale data, except they have a zero point and ratios can be calculated....

Self-Evaluation Maintenance Model

Self-Evaluation Maintenance Model

The Self-Evaluation Maintenance (SEM) model offers a psychological framework to understand how individuals’ self-esteem is influenced by the achievements of others, particularly those with whom they share close personal bonds. The SEM model operates when personal rather than social identity guides individuals. Central to this model is the notion that individuals have an inherent desire to preserve a favorable self-image, which is continuously shaped by interpersonal comparisons and...

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by

Same author

Does insulin bolster antioxidant defenses via the extracellular signal-regulated kinases-protein kinase B-nuclear factor erythroid 2 p45-related factor 2 pathway?

Antioxidants & redox signaling·2011

Same author

Decrease in calcium-sensing receptor in the progress of diabetic cardiomyopathy.

Diabetes research and clinical practice·2011

Same author

JAMIE: A software tool for jointly analyzing multiple ChIP-chip experiments.

Methods in molecular biology (Clifton, N.J.)·2011

Same author

Morphine-induced conditioned place preference in mice: metabolomic profiling of brain tissue to find "molecular switch" of drug abuse by gas chromatography/mass spectrometry.

Analytica chimica acta·2011

Same author

[The interventions effect-assessment of the workers exposed to N, N-dimethylformamide by percutaneous in a synthetic leather factory].

Zhonghua lao dong wei sheng zhi ye bing za zhi = Zhonghua laodong weisheng zhiyebing zazhi = Chinese journal of industrial hygiene and occupational diseases·2011

Same author

[The analysis of effect of Th1/Th2 cytokine in the different prognosis in severe influenza A (H1N1)].

Zhonghua shi yan he lin chuang bing du xue za zhi = Zhonghua shiyan he linchuang bingduxue zazhi = Chinese journal of experimental and clinical virology·2011

Same journal

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

Same journal

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026

See all related articles

Search research articles

Related Experiment Video

Updated: Mar 9, 2026

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Published on: August 29, 2025

Building an Evaluation Scale using Item Response Theory.

John P Lalor¹, Hao Wu², Hong Yu³

¹University of Massachusetts, MA, USA.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

|December 23, 2016

Summary

This summary is machine-generated.

Item Response Theory (IRT) offers a novel approach to evaluating Natural Language Processing (NLP) systems. This psychometric method provides a more insightful evaluation than standard metrics by considering item difficulty and discrimination power.

More Related Videos

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Published on: March 14, 2018

Computerized Adaptive Testing System of Functional Assessment of Stroke

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

Related Experiment Videos

Last Updated: Mar 9, 2026

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Published on: August 29, 2025

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Published on: March 14, 2018

Computerized Adaptive Testing System of Functional Assessment of Stroke

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

Area of Science:

Natural Language Processing (NLP)
Psychometrics
Computational Linguistics

Background:

Standard evaluation of NLP methods relies on gold-standard test sets and metrics like accuracy, precision, recall, and F1.
Current NLP evaluations assume all test items possess equal difficulty and discriminating power, which is a flawed assumption.

Purpose of the Study:

To propose and demonstrate Item Response Theory (IRT) as an alternative for gold-standard test-set generation and NLP system evaluation.
To leverage IRT's ability to characterize individual item difficulty and discriminating power for more nuanced NLP assessment.

Main Methods:

Applied Item Response Theory (IRT) from psychometrics to NLP.
Generated a gold-standard test set for the Recognizing Textual Entailment task.
Collected a large dataset of human responses to the test set.
Fitted an IRT model to the human response data.

Main Results:

The IRT model enables a comparison of NLP systems against human performance benchmarks.
IRT provides deeper insights into NLP system performance beyond traditional metrics.
Demonstrated that high accuracy does not always correlate with high IRT scores, highlighting the influence of item characteristics.

Conclusions:

Item Response Theory (IRT) offers a more sophisticated framework for evaluating NLP systems and test sets.
IRT accounts for item difficulty and discrimination, leading to more accurate assessments of system capabilities.
This psychometric approach enhances the reliability and interpretability of NLP evaluation results.