Jove
Visualize
Contact Us
JoVE
x logofacebook logolinkedin logoyoutube logo
ABOUT JoVE
OverviewLeadershipBlogJoVE Help Center
AUTHORS
Publishing ProcessEditorial BoardScope & PoliciesPeer ReviewFAQSubmit
LIBRARIANS
TestimonialsSubscriptionsAccessResourcesLibrary Advisory BoardFAQ
RESEARCH
JoVE JournalMethods CollectionsJoVE Encyclopedia of ExperimentsArchive
EDUCATION
JoVE CoreJoVE BusinessJoVE Science EducationJoVE Lab ManualFaculty Resource CenterFaculty Site
Terms & Conditions of Use
Privacy Policy
Policies

Related Concept Videos

Self-Report Tests of Personality01:22

Self-Report Tests of Personality

1.1K
Self-report inventories are objective personality assessments that use multiple-choice items or numbered scales, typically ranging from 1 (strongly disagree) to 5 (strongly agree). They are often called Likert scales after Rensis Likert. These inventories are widely used due to their ease of administration and cost-effectiveness. One of the most prominent examples is the Minnesota Multiphasic Personality Inventory (MMPI), initially developed in the 1940s to assess abnormal personality traits.
1.1K
Response Surface Methodology01:16

Response Surface Methodology

753
Response Surface Methodology (RSM) is a collection of statistical and mathematical techniques used to develop, improve, and optimize processes. It is particularly valuable when many input variables or factors potentially influence a response variable.
The process of RSM involves several key steps:
753
Wechsler's Contribution to Measures of Intelligence01:23

Wechsler's Contribution to Measures of Intelligence

2.2K
David Wechsler, a psychologist who worked with World War I veterans, developed a significant IQ test in 1939 called the Wechsler-Bellevue Intelligence Scale. This test was innovative because it combined several subtests that measured both verbal and nonverbal skills, reflecting Wechsler's belief that intelligence is a global capacity involving purposeful action, rational thinking, and effective interaction with the environment. This test later evolved into the Wechsler Adult Intelligence...
2.2K
Ordinal Level of Measurement00:55

Ordinal Level of Measurement

36.0K
The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
Data measured using an ordinal scale are similar to nominal scale data, but there is one major difference. The ordinal scale data can be ordered. An example of ordinal scale data is a list of the top five national parks...
36.0K
Ratio Level of Measurement00:54

Ratio Level of Measurement

22.0K
The way a set of data is measured is called its level of measurement. Correct statistical procedures depend on a researcher being familiar with levels of measurement. For analysis, data are classified into four levels of measurement—nominal, ordinal, interval, and ratio.
A set of data measured using the ratio scale takes care of the ratio problem and provides complete information. Ratio scale data are like interval scale data, except they have a zero point and ratios can be calculated....
22.0K
Self-Evaluation Maintenance Model01:29

Self-Evaluation Maintenance Model

377
The Self-Evaluation Maintenance (SEM) model offers a psychological framework to understand how individuals’ self-esteem is influenced by the achievements of others, particularly those with whom they share close personal bonds. The SEM model operates when personal rather than social identity guides individuals. Central to this model is the notion that individuals have an inherent desire to preserve a favorable self-image, which is continuously shaped by interpersonal comparisons and...
377

You might also read

Related Articles

Articles linked to this work by shared authors, journal, and citation graph.

Sort by
Same author

Does insulin bolster antioxidant defenses via the extracellular signal-regulated kinases-protein kinase B-nuclear factor erythroid 2 p45-related factor 2 pathway?

Antioxidants & redox signaling·2011
Same author

Decrease in calcium-sensing receptor in the progress of diabetic cardiomyopathy.

Diabetes research and clinical practice·2011
Same author

JAMIE: A software tool for jointly analyzing multiple ChIP-chip experiments.

Methods in molecular biology (Clifton, N.J.)·2011
Same author

Morphine-induced conditioned place preference in mice: metabolomic profiling of brain tissue to find "molecular switch" of drug abuse by gas chromatography/mass spectrometry.

Analytica chimica acta·2011
Same author

[The interventions effect-assessment of the workers exposed to N, N-dimethylformamide by percutaneous in a synthetic leather factory].

Zhonghua lao dong wei sheng zhi ye bing za zhi = Zhonghua laodong weisheng zhiyebing zazhi = Chinese journal of industrial hygiene and occupational diseases·2011
Same author

[The analysis of effect of Th1/Th2 cytokine in the different prognosis in severe influenza A (H1N1)].

Zhonghua shi yan he lin chuang bing du xue za zhi = Zhonghua shiyan he linchuang bingduxue zazhi = Chinese journal of experimental and clinical virology·2011
Same journal

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
Same journal

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing·2026
See all related articles

Related Experiment Video

Updated: Mar 9, 2026

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning
10:39

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Published on: August 29, 2025

1.3K

Building an Evaluation Scale using Item Response Theory.

John P Lalor1, Hao Wu2, Hong Yu3

  • 1University of Massachusetts, MA, USA.

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
|December 23, 2016
PubMed
Summary
This summary is machine-generated.

Item Response Theory (IRT) offers a novel approach to evaluating Natural Language Processing (NLP) systems. This psychometric method provides a more insightful evaluation than standard metrics by considering item difficulty and discrimination power.

More Related Videos

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers
09:16

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Published on: March 14, 2018

10.8K
Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

6.4K

Related Experiment Videos

Last Updated: Mar 9, 2026

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning
10:39

Qualitative and Quantitative Validation of Tools with Rating Scales Aimed at Assessing the Quality of University Service-Learning

Published on: August 29, 2025

1.3K
Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers
09:16

Use of a Video Scoring Anchor for Rapid Serial Assessment of Social Communication in Toddlers

Published on: March 14, 2018

10.8K
Computerized Adaptive Testing System of Functional Assessment of Stroke
05:21

Computerized Adaptive Testing System of Functional Assessment of Stroke

Published on: January 7, 2019

6.4K

Area of Science:

  • Natural Language Processing (NLP)
  • Psychometrics
  • Computational Linguistics

Background:

  • Standard evaluation of NLP methods relies on gold-standard test sets and metrics like accuracy, precision, recall, and F1.
  • Current NLP evaluations assume all test items possess equal difficulty and discriminating power, which is a flawed assumption.

Purpose of the Study:

  • To propose and demonstrate Item Response Theory (IRT) as an alternative for gold-standard test-set generation and NLP system evaluation.
  • To leverage IRT's ability to characterize individual item difficulty and discriminating power for more nuanced NLP assessment.

Main Methods:

  • Applied Item Response Theory (IRT) from psychometrics to NLP.
  • Generated a gold-standard test set for the Recognizing Textual Entailment task.
  • Collected a large dataset of human responses to the test set.
  • Fitted an IRT model to the human response data.

Main Results:

  • The IRT model enables a comparison of NLP systems against human performance benchmarks.
  • IRT provides deeper insights into NLP system performance beyond traditional metrics.
  • Demonstrated that high accuracy does not always correlate with high IRT scores, highlighting the influence of item characteristics.

Conclusions:

  • Item Response Theory (IRT) offers a more sophisticated framework for evaluating NLP systems and test sets.
  • IRT accounts for item difficulty and discrimination, leading to more accurate assessments of system capabilities.
  • This psychometric approach enhances the reliability and interpretability of NLP evaluation results.