Item response theory

Item response theory is used to model student answers to assessment questions. A typical IRT model might look like this:

where X_{s,i} is 1 if student s gets question i correct. This model has a student ability parameter s_s and a question difficulty parameter d_i, as well as an item discrimination parameter \alpha_i giving the slope of the curve. Different IRT models may have more or fewer parameters.

Overview

Sijtsma, K., & Junker, B. W. (2006). Item response theory: Past performance, present developments, and future expectations. Behaviormetrika, 33(1), 75–102. doi:10.2333/bhmk.33.75

A high-level historical overview of item response theory and its many variants.
Ding, L., & Beichner, R. (2009). Approaches to data analysis of multiple-choice questions. Physical Review Special Topics - Physics Education Research, 5(2), 020103. doi:10.1103/physrevstper.5.020103

A high-level overview of “classical test theory, factor analysis, cluster analysis, item response theory, and model analysis.” Model analysis was invented in physics education research to see which models students use to answer questions (like work-energy theorem vs. impulse-momentum) and estimate which models students are most likely to use for questions.
Mazza, A., Punzo, A., & McGuire, B. (2014). KernSmoothIRT: An R package for kernel smoothing in item response theory. Journal of Statistical Software, 58(6). doi:10.18637/jss.v058.i06

Discussion of nonparametric (kernel-based) estimation of the “option characteristic curve” (OCC), which is the probability of selecting a specific answer choice to a question, as a function of your unidimensional proficiency/ability parameter. This avoids the parametric assumptions of typical IRT. The R package implements simplex plots that show the probability of each answer choice as a function of proficiency. Looks like a useful package for nonparametric exploration of test data.

Exploring incorrect answers

Morris, G. A., Branum-Martin, L., Harshman, N., Baker, S. D., Mazur, E., Dutta, S., Mzoughi, T., & McCauley, V. (2006). Testing the test: Item response curves and test quality. American Journal of Physics, 74(5), 449–453. doi:10.1119/1.2174053

An interesting alternative to item response theory. Suppose you have some estimate of each student’s ability (here, raw score). Plot the percentage of students who choose each answer choice against ability. If the distractors represent specific real misconceptions, you will ideally see specific distractors become predominant in specific ability ranges; they show examples where low-ability students pick one distractor, moderate-ability students pick another, and high-ability students pick the right answer. This is useful for qualitative analysis of each question and its answer choices.
Smith, T. I., Louis, K. J., Ricci, B. J., & Bendjilali, N. (2020). Quantitatively ranking incorrect responses to multiple-choice questions using item response theory. Physical Review Physics Education Research, 16(1), 010107. doi:10.1103/physrevphyseducres.16.010107

A more rigorous version of the above, using IRT modeling of the probability of picking each specific incorrect answer, given the student’s latent ability and response-specific parameters. Results in a table showing the ranking of the answer choices, and you can then plot the probability of choosing each incorrect answer versus ability. Useful for studying why students pick specific incorrect answers.

Pre/post tests

Lee, Y.-J., Palazzo, D. J., Warnakulasooriya, R., & Pritchard, D. E. (2008). Measuring student learning with item response theory. Physical Review Special Topics - Physics Education Research, 4(1), 010102. doi:10.1103/physrevstper.4.010102

An attempt to make a pre/post version of IRT. In the pre-test, estimate the student ability and question difficulty parameters. Re-fit in the post test with student ability s_s + \delta s_i, allowing student ability to increase or decrease for each question, holding their difficulties constant. (This is a bit odd because \delta s_i depends on the question i, not the student s.) Used this to analyze how student responses to online homework questions changed when the system gave them multiple attempts and hints.

I suppose you could, alternately, do the same with s_s + \delta s_s and see how individual student abilities change, without the ability to measure per question. But then the number of parameters increases with the number of students, rather than remaining fixed as more students take the test.