Item Response Theory (IRT)

Item response theory (IRT) is a collection of statistical models and methods used for two broad purposes in the measurement of health outcomes: item analysis and scale scoring.
 
The family of IRT models describes, in probabilistic terms, the relationship between a person's response to a survey question and his or her standing on the construct (e.g., emotional distress) being measured by the scale. Specifically, IRT models predict the probability of choosing each response category as a function of an underlying, unobserved trait and item parameters.
 
For item analysis, the IRT model characterizes each scale item with a set of properties that describes its ability to discriminate among individuals at different levels along a trait continuum. For scale scoring, IRT uses the full information from a person's responses to each item to estimate their standing on the measured construct. Scale scoring using IRT estimates a score along the continuum of the construct being measured for persons who provide a particular sequence of item responses. Usually a person's score estimates include a measure of central tendency and a description of variability that is reported as a standard error of measurement. The IRT scale score may be computed using only the item parameters and the responses of a single individual to any arbitrarily selected set of items, and this is the basis for computer adaptive testing. IRT models come in many varieties, more than 100, and can handle unidimensional as well as multidimensional data, binary and polytomous response data, and ordered as well as unordered response data.
 
The most commonly applied IRT models in health outcomes measurement are the unidimensional parametric family of polytomous-response models, which include the Rating Scale Model, the Partial Credit Model, the Generalized-Partial Credit Model, the Graded Response Model, and the Nominal Model. Each differs in the number of item parameters that are estimated for each scale item and the constraints placed on the model or data. The item parameters define how well an item performs for measuring different levels of the measured construct or trait such as fatigue. The threshold (or difficulty) parameter describes where along the trait continuum an item is most informative for differentiating between lower and higher function levels. The slope parameter describes the strength of an item for discriminating among different levels of the underlying construct. Discrimination is related to precision in that the more an item can discriminate among individuals at different levels of the construct, the more precision the item adds for measuring a person’s trait level.