The answer may vary: large language model response patterns challenge their use in test item analysis
View abstract on PubMed
Summary
This summary is machine-generated.Large language models (LLMs) show limited ability to predict multiple-choice question (MCQ) performance metrics like difficulty and point biserial indices. Consistency of LLM responses is key for assessment development, not prediction of item characteristics.
Area Of Science
- Medical Education
- Artificial Intelligence in Assessment
Background
- Validating multiple-choice questions (MCQs) requires extensive testing.
- Large language models (LLMs) offer potential for streamlining assessment development.
- Predicting psychometric properties of MCQs is a key challenge.
Purpose Of The Study
- To investigate LLMs' ability to predict MCQ difficulty and point biserial indices.
- To assess if LLMs can reduce the need for preliminary test population analysis.
- To compare LLM performance with human expert assessment.
Main Methods
- Sixty anesthesiology MCQs were administered to five LLMs and clinical fellows.
- LLM response patterns, difficulty indices, and point biserial indices were analyzed.
- Spearman correlation coefficients compared LLM and fellow performance metrics.
Main Results
- LLM response consistency varied, with Claude 3.5 Sonnet and Llama 3.2 being most consistent.
- LLMs generally scored higher than fellows (58-85% vs. 57%).
- LLMs showed weak to no correlation with fellow difficulty indices and failed to predict point biserial indices.
Conclusions
- LLMs have limited utility in predicting specific MCQ psychometric properties.
- Higher-performing LLMs correlated less with human performance, suggesting a potential inverse relationship.
- Future research should focus on LLMs for broader assessment optimization, not item-level prediction.

