You are on page 1of 23

Chapter 36

An Introduction to Item
Response Theory Models
and Their Application in the
Assessment of Noncognitive
Traits
Steven P. Reise and Tyler M. Moore
Copyright American Psychological Association. Not for further distribution.

Item response theory (IRT; Embretson & Reise, outcomes measurement have been increasing (see
2000; Hambleton, Swaminathan, & Rogers, 1991; Reise & Waller, 2009, for review). Given the
de Ayala, 2009) refers to a class of mathematical authors’ interests and expertise, this chapter
models relating individual differences on one or describes IRT models mostly through the lens of
more latent variables to the probability of respond- these latter, noncognitive measurement contexts.
ing to a scale item in a specific way. A response of Regardless of context, researchers from a variety
“3” on a 5-point personality item, a correct answer of fields have been keenly interested in the potential
on a college entrance exam item, and a clinician’s of IRT modeling as an alternative to traditional psy-
rating of an adolescent’s anxiety are all item chometric approaches to scale construction, item
responses that can potentially be related (probabilis- analysis, scale administration, and scoring individ-
tically) to a latent psychological variable. IRT mod- ual differences. As reviewed in Reise, Ainsworth,
els, which focus on characterizing how individual and Haviland (2005), IRT models potentially offer
differences on a latent variable interact with item many attractive features. For example, through
properties to produce a response, contrast sharply inspection of item and scale information functions, a
with classical test theory (Lord & Novick, 1968) researcher can gain a better understanding of how
procedures, which focus on understanding the well an item, or scale, functions (e.g., measurement
statistical properties of a composite scale score precision) across different ranges of a latent variable.
(e.g., estimating reliability). This inspection is often done graphically. Moreover,
Development of IRT models and associated because of the IRT item and person parameter
methods was originally motivated by applied prob- invariance properties, IRT models can be used to
lems in large-scale multiple-choice aptitude testing either place items from different instruments onto a
(e.g., how to efficiently administer different test common scale, or place individuals who responded
items to individuals but still compare them on the to different items onto a common scale. In turn, this
same scale, how to link different sets of items mea- facilitates the analysis of differential item function-
suring the same construct onto the same scale). ing across demographic groups (i.e., do the items
Lately, however, applications of IRT models to per- measure the same latent variable in the same way
sonality, psychopathology, and patient-reported across different groups?) as well as the creation of

This work was supported by the Consortium for Neuropsychiatric Phenomics and National Institutes of Health (NIH) Roadmap for Medical Research
Grants UL1-DE019580 (Robert Bilder, Principal Investigator [PI]) and RL1DA024853 (Edythe London, PI). Additional research support was obtained
through NIH Roadmap for Medical Research Grant AR052177 (David Cella, PI), National Cancer Institute (NCI) Grant 4R44CA137841-03 (Patrick
Mair, PI) for item response theory software development for health outcomes and behavioral cancer research, and through Institute of Educational
Sciences Grant 00362556 (Noreen Webb, Program Director). The content is solely the responsibility of the authors and does not necessarily represent
the official views of the NCI or the NIH.

DOI: 10.1037/13619-037
APA Handbook of Research Methods in Psychology: Vol. 1. Foundations, Planning, Measures, and Psychometrics, H. Cooper (Editor-in-Chief)
699
Copyright © 2012 by the American Psychological Association. All rights reserved.
Reise and Moore

item banks that can be administered efficiently via (Fox & Glas, 2001), or sequential-steps models
computerized adaptive testing (Wainer, 2000). (Tutz, 1990).
Nevertheless, this chapter neither focuses on the
virtues of IRT nor compares IRT with traditional
Unidimensional Dichotomous IRT
psychometric procedures. Such articles are plentiful
Models
(see, e.g., Embretson, 1996; Reise & Henson, 2003).
Rather, this chapter is divided into two sections. In IRT modeling begins with a Person (i = 1 . . . I) ×
the first, we describe commonly applied unidimen- Item (j = 1 . . . J) matrix of item responses (X).
sional IRT models that are appropriate for dichoto- When items are dichotomously scored, such as cor-
mous or polytomous item response data; space rect versus incorrect or endorsed versus not
considerations prevent us from extending these endorsed, the item response matrix consists entirely
models to the multidimensional case. Our primary of zeros and ones. Given this matrix, the chief objec-
goal in this section is to inform readers of the tive of IRT modeling is to fit a mathematical func-
Copyright American Psychological Association. Not for further distribution.

most popular IRT models, their origin, and the tion that characterizes the relation between
interpretation of parameters. In short, this first sec- individual differences on an assumed latent variable
tion is oriented toward researchers who are rela- (labeled θ) and the probability of endorsing an item.
tively unfamiliar with modern measurement theory Herein, for dichotomous items, this function will be
and who desire a basic understanding of IRT model- termed an item response curve (IRC). The goal of
ing options. fitting IRT models is to find a model such that the
In the second section, we discuss conceptual and estimated IRC best represents or fits the observed
technical issues that arise in the application of IRT item response data. In this section, we provide
models to psychological data. Among the topics we detailed review of the most commonly observed uni-
consider are (a) the applicability of IRT models dimensional IRT models for describing dichotomous
across various types of constructs and domains, item response data. We proceed slowly in develop-
(b) sample size and model selections issues, and ing these models for the sake of audiences with little
(c) some lessons learned thus far from the research prior exposure to IRT.
literature on application of IRT to noncognitive As noted, the basic goal of IRT modeling is to find
measures. This second section is oriented toward a function that relates an individual’s standing on a
both novice researchers who are considering apply- latent variable with the probability of endorsing an
ing IRT to their data and to more experienced inves- item. One such IRC must be found for each scale item.
tigators who may not have considered some of the In considering an appropriate model, note that as indi-
issues raised herein. vidual trait levels increase, the probability of item
Many key topics are not addressed. For example, endorsement should increase monotonically. Stated
technical topics (such as item and person parameter differently, groups of individuals who are higher on
estimation) and applications of IRT models are not the latent variable measured by an item should have
covered, including using IRT (a) to score individu- higher item endorsement rates relative to groups of
als on a latent variable, (b) to evaluate measures for individuals who are lower on the latent variable.
differential item functioning, or (c) as a basis for At first blush, the above observation may suggest
computerized adaptive testing. Interested readers that a straight-line function could be used to
are referred to Embretson and Reise (2000); Ham- describe the relation between the latent variable and
bleton, Swaminathan, and Rogers (1991); and de item endorsement probability. However, a straight-
Ayala (2009) for treatment of these topics. Finally, line function will not suffice because probabilities
given space constraints, we cannot do justice to the are bounded between zero and one, and any line
seemingly endless variety of alternative IRT models, will eventually predict values above one as the latent
such as nonparametric (Sijtsma & Molenaar, 2002), variable increases and predict values below zero
unfolding (Roberts, Donoghue, & Laughlin, 2000), as the latent variable decreases. Alternatively, a
explanatory (De Boeck & Wilson, 2004), hierarchical function that increases monotonically and is

700
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

bounded between zero and one is the cumulative ations of .17, .66 (top panel); .58, 2.17 (middle
function of the normal distribution: panel); and 2.19, 1.78 (bottom panel).
This example illustrates simple use of the cumu-

1 lative normal distribution to model the proportions

2
P( x = 1 | ) = () = e −[1/ 2]Z dz. (1)
− Z=−( − )/  2  of individuals endorsing an item as a function of the
latent variable. By using different normal distribu-
Equation 1 states that to find the probability of tions and their associated cumulative ogives, we can
endorsement, conditional on a latent variable, we accommodate items regardless of difficulty or
need to integrate (find the area under the curve) strength of relationship to the latent variable. Yet,
between −Z and infinity. As −Z becomes more nega- Equation 1 is not the actual normal-ogive model
tive, proportions increase; as −Z becomes more posi- referred to in the IRT literature. For that, we need to
tive, proportions decrease. To clarify, in Equation 1, rearrange and redefine some terms. We start by rela-
θ represents individual differences on a latent vari- beling the mean (μ) as an item location parameter
Copyright American Psychological Association. Not for further distribution.

able. (For now, think of θ as a standardized raw (b) and defining a slope parameter (a; sometimes
scale score.) The μ and σ parameters in Equation 1 called a discrimination parameter) as the reciprocal
are the population mean and standard deviation of a of the standard deviation: a = 1/σ. Replacing and
normal distribution. The term (θ − μ)/σ thus indi- rearranging terms leaves us with Equation 2:
cates how many standard deviation units an individ- ∞
1

2
ual’s level on the latent variable is relative to the P( x ij = 1 | ) = () = e −[1/ 2]Z dz. (2)
mean. When different values of μ or σ are plugged − Z=−[a ( −b )] 2 
into Equation 1, the resulting cumulative distribu-
Equation 2 makes clear that the probability of
tion takes different forms. By decreasing the stan-
endorsing an item is now a function of an individu-
dard deviation, we can produce a steeper curve, or
al’s standing on a latent variable, and two item prop-
by increasing the standard deviation, we can
erties, location and slope (to be defined more sharply
decrease the acceleration of the ogive. In turn, the
in the following paragraphs). Individuals with trait
mean always corresponds to the point on the latent
level values above the item location will have high
continuum at which the proportion is .50.
probabilities of endorsing the item, and conversely,
To illustrate, we examined a large set of item
individuals with trait levels below the item location
response data drawn from a 24-item measure of ado-
will have low probabilities of item endorsement.
lescent Social Discomfort (Williams, Butcher, Ben-
Finally, note that Equation 2 is in slope and loca-
Porath, & Graham, 1992) administered to a large
tion form, meaning, to find the probability of item
sample. Parts of this data set were previously ana-
endorsement given θ, we must know this particular
lyzed in Reise and Waller (2003). For each item, we
item’s slope (a) and location (b). We can rewrite
began our analysis by simply summing raw item
Equation 2 to be in slope and intercept form by defin-
responses (not including the item under study) and
ing an intercept d as −ab:
then standardizing those raw scale scores. We then

plotted item endorsement proportions (i.e., means) 1

2
P( x ij = 1 | ) = () = e −[1/ 2]Z dz. (3)
for groups of individuals with similar Z-scores. 2
− Z=−( a+d ) 
These empirical IRCs are shown for items 16, 22,
and 18 in the three panels of Figure 36.1, top, middle, Equations 2 and 3 express exactly the same thing;
and bottom, respectively. We then estimated best- both are two-parameter normal-ogive IRT (2PNO)
fitting two-parameter cumulative normal ogives models. Equation 2 is the parameterization most fre-
(2PNO), and these estimated IRCs are shown as quently reported and discussed, but Equation 3 is crit-
solid lines in the figures. Clearly, these lines appear ically important to understand because (a) it is the
to describe the data quite nicely. Specifically, the model that is most easily transformed in a factor ana-
three curves are cumulative ogives from three different lytic model (i.e., slopes transformed into factor load-
normal distributions with means and standard devi- ings and intercepts transformed into factor thresholds;

701
Reise and Moore
Copyright American Psychological Association. Not for further distribution.

Figure 36.1.  Empirical (dots) and estimated (solid lines) item response curves for three social discomfort items.

McLeod, Swygert, & Thissen, 2001, p. 199), and thus of the model that is estimated in most popular IRT
(b) it is the model that generalizes most easily to mul- software programs (Baker & Kim, 2004).
tidimensional models (i.e., models with more than Thus far, we have been thinking of θ as a stan-
one latent variable), and (c) it is the parameterization dardized scale score. Of course, in IRT applications,

702
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

θ is a latent variable with an arbitrary scale, and an software program (Zimowski, Muraki, Mislevy, &
individual’s position on that latent variable needs to Bock, 2003). The first two columns of Table 36.1
be estimated. In most IRT applications, the scale of report the item–test correlations and the item means
the latent variable is identified by specifying that the (proportions endorsed). Clearly, items do not vary
mean in the population is zero and the standard greatly in their response rates—there are no highly
deviation is one. As a consequence, a person’s endorsed items—but they do vary in item–test cor-
standing on the latent variable is like a z score (if we relation. We note that the raw score distribution was
assume normality of the latent variable), and impor- highly skewed positively; most people score low on
tantly, the item parameters (location and slope) are the measure.
estimated in reference to this latent trait scale. If we The third column of Table 36.1 reports the esti-
identified the latent variable in a different way, say mated item intercepts (−ab). Generally speaking,
using a mean of 500 and standard deviation of 100, intercepts tend to range between −3 and 3 with high
the scale for the item parameters would change values representing easy (high proportion endorsed)
Copyright American Psychological Association. Not for further distribution.

accordingly. items and negative values reflecting difficult items


To illustrate and better define the 2PNO, Table 36.1 (low proportion endorsed). Technically, the inter-
shows item parameter estimates for the 24 Adoles- cept corresponds to the z score on a normal distri-
cent Social Discomfort items noted earlier. Item bution that cuts off the predicted proportion of
parameters were estimated using marginal maxi- endorsements for individuals at θ = 0. If the latent
mum likelihood as implemented by the BILOG-MG variable is perfectly normally distributed, the

Table 36.1

Descriptive Statistics and Two-Parameter IRT Parameter Estimates of the 24-Item Social Discomfort Scale

Item–test Proportion Normal ogive (IRT) Logistic (IRT)


Item correlation endorsed Int. Location Slope Int. Location Slope
1 .44 .31 −.53 .72 .74 −.90 .72 1.26
2 .61 .35 −.51 .36 1.41 −.87 .36 2.39
3 .36 .51 .08 −.13 .59 .13 −.13 1.00
4 .35 .56 .22 −.40 .54 .37 −.40 .92
5 .56 .41 −.23 .20 1.13 −.39 .20 1.92
6 .45 .41 −.21 .27 .77 −.36 .27 1.32
7 .42 .29 −.62 .88 .70 −1.05 .88 1.19
8 .52 .37 −.38 .38 .98 −.64 .38 1.67
9 .39 .54 .18 −.27 .65 .30 −.27 1.10
10 .53 .26 −.84 .79 1.06 −1.43 .79 1.80
11 .43 .11 −1.67 1.64 1.02 −2.84 1.64 1.74
12 .29 .20 −.90 1.86 .48 −1.53 1.86 .82
13 .60 .29 −.82 .59 1.40 −1.40 .59 2.39
14 .43 .38 −.31 .43 .71 −.52 .43 1.21
15 .57 .33 −.55 .46 1.21 −.94 .46 2.06
16 .62 .41 −.26 .17 1.51 −.44 .17 2.57
17 .35 .50 .04 −.07 .56 .07 −.07 .95
18 .29 .14 −1.22 2.19 .56 −2.08 2.19 .95
19 .54 .33 −.56 .52 1.09 −.96 .52 1.86
20 .50 .39 −.29 .31 .93 −.49 .31 1.58
21 .52 .23 −1.00 .94 1.07 −1.70 .94 1.81
22 .31 .39 −.27 .58 .46 −.45 .58 .79
23 .36 .44 −.12 .22 .56 −.21 .22 .96
24 .45 .40 −.24 .31 .76 −.41 .31 1.30

Note. Int. = intercept; IRT = item response theory.

703
Reise and Moore

intercept is nothing more than the z-score cutpoint The normal-ogive model has the advantages of
corresponding to the proportion who endorsed the being familiar and easily related to the factor analy-
item in the calibration sample. Consider, for exam- sis of ordinal items, and it can be extended to multi-
ple, that the intercepts in Table 36.1 are very close ple dimensions. However, there are some
to the z-score cutpoints that correspond to the pro- well-known mathematical difficulties in estimating
portions endorsed. For example, for Item 1, the pro- model parameters. For this reason, most IRT appli-
portion endorsed is 31%. The z score that cuts off cations estimate the IRC using a logistic-ogive
31% from 69% is −0.49, very close to the estimated model in place of the normal ogive. In fact, some
intercept of −.53. For Item 4, the endorsement rate statistical estimation software (MULTILOG; This-
is 56%. The z score that cuts off 56% from 44% is sen, 2003) only includes logistic models. Equations
.16, which is again close to the estimated intercept 4 and 5 show the two-parameter logistic model
of .22. (2PLM) in both slope-and-location and slope-and-
The fourth column of Table 36.1 shows the esti- intercept forms, respectively.
Copyright American Psychological Association. Not for further distribution.

mated item location parameters. These parameters


are expressed on the same scale as the latent variable exp[1.7a( − b )]
P( x = 1 | ) = () = . (4)
and typically range between −3 and 3. In this two- 1 + exp[1.7a( − b )]
parameter model, the location parameter reflects the exp(1.7a + d )
P( x = 1 | ) = () = . (5)
point on the latent variable continuum at which the 1 + exp(1.7a + d )
endorsement rate is .50. Finally, the fifth column
shows the estimated slope or discrimination parame- As written in Equations 4 and 5, because of the
ters. Slope parameters indicate the steepness of the inclusion of the constant 1.7 in the model, these
IRC in the area around the item location; higher are actually two-parameter logistic approxima-
slopes indicate more discriminating items. More tions of the normal-ogive model. The purpose of
technically, Equation 4 reveals that the slope in the including this scaling factor is so that the slope
normal-ogive metric, which typically ranges parameter estimated in the logistic model is the
between .5 and 1.2, determines how fast −Z changes same as its value estimated in the normal-ogive
as the latent variable changes. Finally, observe that model. If the 1.7 scaling factor were not included
the slope is nearly perfectly correlated with the item- in the model, the slope parameter in the pure
test correlation. logistic model would be 1.7 times higher than its
Before moving on, a few comments are in order value in the normal ogive. To demonstrate the
regarding the intercept and location parameters. equivalence of the logistic and normal-ogive mod-
First, as noted, the intercept is essentially a simple els, in Table 36.1, we also display the estimated
transformation of the observed proportion endorsed parameters from the logistic model using
into a z-score metric. In turn, because b = −(d/a), BILOG-MG (Zimowski et al., 2003).
items can have equal intercepts but different loca- Interpretation of the item parameters under the
tions if the items have different slopes. This observa- 2PLM remains essentially the same as under the
tion calls into question the use of the common label 2PNO model. As usual, the scale for the latent vari-
difficulty for the location parameter, b; it appears able is arbitrary and must be identified by setting the
more reasonable to consider the intercepts as the dif- mean and standard deviation. Values of zero and
ficulty of an item. Moreover, even if item parameters one are typically chosen. The location parameter
from multiple groups are linked on a common met- remains the point on the latent continuum at which
ric, a researcher cannot simply compare estimated the proportion endorsed is .50; these values are the
location parameters across two or more groups with- same in the 2PNO and 2PLM. The slope indicates
out first establishing that the slopes are invariant. how rapidly response rates change as a function of
This is much like multiple-group confirmatory fac- the latent variable in the area of the item location.
tor analysis, which requires first establishing loading The slopes in a pure logistic model (no D = 1.7) are
invariance before investigating threshold invariance. 1.7 times higher than in the normal ogive.

704
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

Reduced and Expanded Dichotomous the difference between an individual’s trait standing
Models and the item’s location, weighted by a constant (the
The 2PLM and 2PNO allow IRCs to vary in two item slope).
ways: location and slope. Other IRT models for The 1PLM, although similar, is not to be con-
dichotomous responses can be viewed as either fused with the Rasch model (Bond & Fox, 2007). In
expansions of these models or as nested models a Rasch model, it is customary to impose the con-
derived by placing restrictions on the item slopes. straint that all the slopes are equal to one, and thus
For example, if we retain the identification con- the slope parameter disappears from the model. In
straint that the mean of the latent variable is zero turn, to accommodate this constraint, one needs to
with standard deviation of one in the population, we free up the variance of the latent variable by estimat-
can impose the constraint that all the items have the ing it rather than fixing its value (a variety of other
same slope parameter. In Table 36.2, we show the identification constraints are possible). Over the
Social Discomfort items under a logistic model with past 30 years, many researchers have been champi-
Copyright American Psychological Association. Not for further distribution.

the equal slope constraint. This is called the one-pa- oning the potential virtues of Rasch models (e.g.,
rameter logistic model (1PLM). In this model, the sufficient statistics for estimating item parameters,
IRCs will not intersect because they all constrained nonintersecting IRCs, specific objectivity) and an
to have the same slope. Moreover, in the 1PLM, the equal number have been questioning their utility in
probability of endorsement is solely a function of real-world psychological data. Such debates are

Table 36.2

1PLM and 3PLM Parameter Estimates for the Social Discomfort Scale

1PLM 3PLM
Item Intercept Slope Location Intercept Slope Location c-P
1 −0.92 1.34 0.69 −0.63 0.85 0.74 0.00
2 −0.69 1.34 0.51 −0.99 1.96 0.51 0.03
3 0.18 1.34 −0.14 −1.63 2.47 0.66 0.34
4 0.46 1.34 −0.34 −0.34 0.90 0.38 0.27
5 −0.35 1.34 0.26 −0.40 1.24 0.32 0.00
6 −0.35 1.34 0.26 −0.30 0.86 0.35 0.00
7 −1.09 1.34 0.82 −0.71 0.80 0.88 0.00
8 −0.59 1.34 0.44 −0.52 1.10 0.47 0.00
9 0.35 1.34 −0.26 0.10 0.66 −0.16 0.01
10 −1.28 1.34 0.96 −1.04 1.28 0.81 0.00
11 −2.58 1.34 1.93 −1.98 1.44 1.37 0.00
12 −1.75 1.34 1.31 −0.96 0.58 1.66 0.00
13 −1.09 1.34 0.82 −1.95 2.98 0.65 0.04
14 −0.53 1.34 0.40 −0.48 0.86 0.55 0.03
15 −0.80 1.34 0.60 −1.22 2.05 0.59 0.05
16 −0.35 1.34 0.27 −0.77 2.08 0.37 0.04
17 0.11 1.34 −0.08 −1.07 1.60 0.67 0.30
18 −2.31 1.34 1.73 −1.31 0.70 1.87 0.00
19 −0.85 1.34 0.64 −1.82 2.74 0.67 0.09
20 −0.47 1.34 0.35 −1.39 2.29 0.61 0.15
21 −1.52 1.34 1.14 −1.28 1.40 0.92 0.01
22 −0.48 1.34 0.36 −1.58 1.61 0.98 0.25
23 −0.20 1.34 0.15 −0.36 0.74 0.48 0.09
24 −0.40 1.34 0.30 −0.53 1.04 0.51 0.07

Note. 1PLM = one-parameter logistic model; 3PLM = three-parameter logistic model; c-P = c parameter.

705
Reise and Moore

beyond the present treatment (see Borsboom, 2005, standing is on the latent variable, the IRC will go no
for extended discussion), but two features of Rasch lower than .20. The location (b) in Equation 6 is no
modeling are worth noting. longer the point on the latent trait continuum at
First, Rasch modeling, by its nature, emphasizes which the probability of endorsing is .50. Rather,
the meaningfulness and interpretability of the latent the probability of endorsing at θ = b is (100 + c)/2
variable metric and items arrayed along that dimen- (Hambleton & Swaminathan, 1985).
sion. This is a laudable goal and contrasts sharply To illustrate, the last set of columns in Table 36.2
with most psychometric work in its consideration of shows the 3PLM parameters estimated for the Adoles-
the unit of measurement. Second, as a general phi- cent Social Discomfort data (logistic model). For
losophy, following the factor analytic tradition, IRT most items, there is no c parameter, indicating that as
modeling typically focuses on finding a model that levels on the latent variable decrease, response proba-
best fits the data. On the other hand, a Rasch model bilities go to zero. However, there are several items
is usually considered the only real measurement for which there is evidence for a c parameter (i.e.,
Copyright American Psychological Association. Not for further distribution.

model, and the goal of research is to find a set of Items 3, 4, 17—three items with relatively high pro-
items that provide responses fitting that model. In portion endorsed—and 22). For example, Item 22 is
short, the philosophies underlying application of interesting in that it has somewhat extreme content,
Rasch and non-Rasch models can be very different abbreviated as “spend most of spare time alone” (Ben-
(Wilson, 2005). Porath, Tellegen, & Kaemmer, 2005), which is keyed
Moving beyond restricted models, we now con- positively for Social Discomfort. Thus, the 3PLM tells
sider models that add parameters to the 2PLM (or us that even individuals low on Social Discomfort will
2PNO). Consider that on multiple-choice tests, it is say “yes” to this item approximately 25% of the time.
arguable that the 2PLM is inadequate to describe the Moving beyond the 3PLM, Reise and Waller
item response process because individuals may pro- (2003) and Waller and Reise (2010) considered a
duce a correct response by chance, regardless of 4PLM model that includes an upper asymptote
their levels on the latent variable. Notice that in a parameter (γ) as well as a lower asymptote for
two-parameter model, the IRC has a lower asymp- psychopathology items.
tote of zero (very low-scoring individuals have zero exp[a( − b )]
chance of answering the item correctly) and an P( x = 1 | ) = () = c + ( − c) . (7)
1 + exp[a( − b )] 
upper asymptote of one (very high scoring individu-
als always get the item right). Thus, the model may This model was motivated by inspection of
be unrealistic in describing certain types of multiple- empirical IRCs and by theoretical considerations.
choice tests. Moreover, some personality researchers For example, even in clinical populations (such as
have argued that a two-parameter model is inade- depressed populations), it would be unrealistic to
quate because individuals at low levels of the latent expect that 100% of patients display any one symp-
variable may have a nonzero probability of endors- tom (e.g., suicide ideation, hopelessness). Hence,
ing the item (Rouse, Finger, & Butcher, 1999). To the upper asymptote for an item may not be 100%,
accommodate this fact in either cognitive or non- as in the 2PLM or 3PLM, but rather some smaller
cognitive measurement, the canonical 2PLM can be value. Relative to the 2PLM, the interpretation of the
expanded to include a lower-asymptote or pseudo- item parameters in this 4PLM model changes
guessing parameter, as in Equation 6. slightly. Specifically, the location (b) is now the
point on the latent scale at which the response pro-
exp[a( − b )]
P( x ij = 1 | ) = () = c + (1 − c) . (6) portion is (100 + c – γ)/2. We will not attempt to
1 + exp[a( − b )] 
estimate and display parameters for the 4PLM
In this three-parameter logistic model (3PLM), herein. Instead, we refer readers to the original arti-
the lower asymptote parameter (c) places a lower cles for discussion of the possible causes and conse-
boundary on the IRF. For example, if c is estimated quences of upper and lower asymptote parameters
to be .20, then regardless of how low an individual’s in personality and psychopathology data.

706
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

Unidimensional Polytomous Item single IRC is needed to describe how increases on


Response Models the latent variable increase the chances of an indi-
Although dichotomously scored multiple-choice vidual endorsing the item (i.e., responding one
tests continue to dominate cognitive assessment, the instead of zero).
fields of personality, psychopathology, and patient- Now consider an item with four ordered response
reported outcomes assessment rely heavily on mea- options (0, 1, 2, 3). This item can be thought of as
sures with ordered multicategory response options. containing three (number of categories minus one)
Over the past 10 years, the application of polyto- dichotomies: 0 versus 1, 2, 3; 0, 1 versus 2, 3; and 0,
mous IRT models has increased, especially in the 1, 2 versus 3. When represented in this way, it is easy
field of patient-reported outcomes assessment (see to understand that the first step in estimating a GRM
Cella et al., 2007, 2010; Reise & Waller, 2009). The is to estimate the number of categories minus one
goals of this section are to introduce the logic of threshold response curve (TRC), one for each of the
polytomous IRT models and to highlight important possible dichotomizations. These TRCs, shown in
Copyright American Psychological Association. Not for further distribution.

differences. There are many important polytomous Equation 8, are simply 2PLM IRCs with equal slopes
models, but here, we focus discussion on only a within an item (but not necessarily between items):
small subset. In addition, we limit the description to
only the logistic versions of each model. Readers Px* () =
( )
exp   − b j 
, (8)
interested in fuller treatment of the diverse family of (
1 + exp   − b j  )
polytomous IRT models should consult Nering and
Ostini (2010) and Ostini and Nering (2006). where, j = 1 . . . number of response categories
Just as the basic goal of IRT modeling for dichot- minus one.
omous items is to estimate an IRC that best repre- Given that they are 2PLM functions, the TRCs
sents the data, the chief objective of polytomous IRT indicate how the probability of responding in or
models is to estimate a set of best-fitting category above a given category changes as a function of the
response curve (CRC). These CRCs model the rela- latent variable. In other words, for a four-category
tion between level on a latent variable and the prob- item, in the GRM, each item is described by one
ability of responding in a particular response item slope parameter (a) and the number of catego-
category for an item. A central distinction between ries minus one location parameter (bj)—one for
polytomous IRT models is the distinction between each threshold between the response categories. The
difference models and divide-by-total models TRCs are important, but they do not directly yield
(Thissen & Steinberg, 1986). As shown in the next the desired CRCs. Rather, once the parameters of
section, difference models require a two-step com- the TRCs are estimated, computing the conditional
putation procedure to derive CRCs, whereas in category response probabilities for x = 0 . . . 3
divide-by-total models, only a single equation is requires a second step that is done by subtraction:
needed to derive the CRCs. Px () = P(*x ) () − P(*x+1) () . (9)
Graded-Response Model By definition, the probability of responding in or
We begin by describing the iconic graded-response above the lowest response category is P(*x =0 ) () = 1.0,
model (GRM; Samejima, 1969), which appears to be and the probability of responding above the highest
the most frequently applied polytomous IRT model response category is P(*x =4 ) () = 0.0. The CRCs
in the noncognitive assessment domain (Reise & derived from Equation 9 represent the probability of
Waller, 2009). One way to view the GRM is to think an individual responding in a particular category
of it as an extension of the 2PLM to the polytomous conditional on the latent variable.
response case. Consider that, with a dichotomous The item parameters in the GRM dictate the
item, there are only two response options, and thus shape and location of the TRCs (and thus the
there is only a single boundary or threshold between CRCs). The higher the slope parameters (α), the
a response of zero and one. As a consequence, only a steeper the TRCs, and the more narrow and peaked

707
Reise and Moore

the CRCs, indicating that the response categories the two items with the highest estimated (Item 21,
differentiate among individuals at different levels of “I felt hopeless”) and lowest (Item 27, “I felt guilty”)
the latent variable well. The location parameters (bj) slopes.
determine the location of the TRCs along the latent
variable continuum and the point at which each of Nominal Response Model
the CRCs for the middle response options peak. Spe- We now turn our attention to a set of nested models
cifically, the CRCs peak in the middle of two adja- that belong to a distinct class of divide-by-total poly-
cent location parameters. To illustrate the model, tomous IRT models. We begin by introducing the
Table 36.3 displays the estimated item parameters most general direct model, namely, Bock’s (1972)
for a set of Depression items scored with four nominal response model (NRM). It is called a “nom-
response options. These items were drawn from the inal” response because the model does not assume
National Institutes of Health (NIH) PROMIS project that category responses are ordered within an item.
(Cella et al., 2007) and item parameters are based on Rather, the model treats category ordering as a prop-
Copyright American Psychological Association. Not for further distribution.

a small subsample drawn from a much larger project erty to be discovered, as we will see. In contrast, the
(Pilkonis et al., 2011). The darker lines in Figure 36.2 GRM assumed that response categories are strictly
show the TRCs and the CRCs for two items, namely, ordered.

Figure 36.2.  Threshold response curves (top) for the most and least discriminating depression items under the
GRM, and CRCs (bottom) for these same items under the GRM (darker lines) and GPCM (lighter lines).

708
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

Table 36.3

Graded-Response Model and Generalized Partial Credit Model Parameter Estimates for an Example
Depression Scale

Graded-response model Generalized partial credit model


Item α b1 b2 b3 α δ1 δ2 δ3
1 3.13 0.81 1.55 2.35 2.99 0.97 1.45 2.12
2 2.64 0.65 1.35 2.28 2.49 0.89 1.19 2.11
3 2.80 0.70 1.47 2.45 3.08 0.90 1.29 2.28
4 1.98 0.29 1.19 2.24 1.58 0.67 1.02 1.93
5 2.73 0.67 1.48 2.58 2.42 0.84 1.31 2.42
6 2.11 0.43 1.30 2.40 1.93 0.77 1.14 2.13
7 2.50 −0.37 0.72 1.94 2.11 −0.27 0.73 1.79
Copyright American Psychological Association. Not for further distribution.

8 2.67 0.97 1.67 2.60 2.46 1.22 1.43 2.39


9 2.16 0.21 1.20 2.42 2.05 0.43 1.09 2.24
10 2.91 0.49 1.17 2.35 2.74 0.70 1.01 2.23
11 2.07 0.08 0.95 2.07 1.96 0.38 0.79 1.86
12 2.40 −0.21 0.74 1.94 2.20 −0.03 0.70 1.80
13 2.34 0.52 1.35 2.32 1.93 0.78 1.21 2.09
14 2.03 0.12 1.00 2.11 1.43 0.45 0.86 1.87
15 2.89 0.09 1.02 1.95 2.80 0.21 1.00 1.76
16 1.88 0.03 1.24 2.49 1.89 0.24 1.16 2.25
17 2.27 −0.10 0.72 1.99 1.98 0.16 0.57 1.88
18 2.30 0.13 1.01 2.20 2.02 0.38 0.87 2.07
19 3.08 −0.27 0.77 1.88 3.23 −0.20 0.76 1.75
20 2.37 1.46 2.20 2.89 2.13 1.77 1.99 2.49
21 3.43 0.73 1.52 2.31 3.16 0.83 1.42 2.12
22 1.86 0.15 1.20 2.44 1.69 0.46 1.08 2.19
23 2.02 0.56 1.48 2.75 2.12 0.88 1.26 2.57
24 2.08 0.32 1.41 2.81 2.18 0.51 1.26 2.65
25 2.04 −0.14 0.86 2.06 1.61 0.10 0.77 1.87
26 2.47 0.60 1.20 2.33 3.11 0.98 0.88 2.20
27 1.79 0.33 1.38 2.58 1.52 0.70 1.19 2.31
28 2.33 −0.01 0.90 1.94 2.03 0.21 0.83 1.74

Note. α denotes an item discrimination parameter, b denotes a threshold, and δ denotes an intersection.

In the NRM, the probability of an individual intercept parameter for category x, and the αx is the
responding in category x (x = 0 . . . number of cate- slope of the linear regression of the latent variable
gories minus one) on an item, conditional on the on the log-odds of the probability of responding in a
latent variable (θ) is particular category. The cx parameters reflect the
exp (a  + c x )
Px () = NCAT −1 x . (10)
popularity of a particular response category, where
larger values reflect more popular options (Thissen,
∑ ( x x)
exp a  + c
Steinberg, & Fitzpatrick, 1989).
x =0 
Simply stated, Equation 10 is called a divide-by-
To identify the model, some constraint must be total model because the numerator contains a single
set: Σax = Σcx = 0. This constraint forces one conditional probability and the denominator con-
response option (the one with the highest a) to have tains a scaling factor, which is the sum of condi-
a monotonically increasing CRC and one response tional probabilities for each response distinction. In
option (the one with the lowest a) to have a mono- short, the denominator ensures that the sum of the
tonically decreasing CRC. In Equation 10, cx is an probabilities of each category response conditional

709
Reise and Moore

on the latent trait equals one. In other divide-by-to- category slopes (a) and the CBDs (a*) for the exam-
tal models to follow, the denominator will become ple Depression items estimated using MULTILOG
more complicated but will always serve the exact (Thissen, 2003). In the present data, it appears that
same purpose. all categories (1 = never, 2 = rarely, 3 = sometimes,
This description of the NRM does not readily 4 = often) provide some useful discrimination. How-
reveal some of its most intriguing properties. How- ever, for some items, such as 3, 11, and 16, there
ever, Thissen et al. (1989) and Thissen, Cai, and appears to be a trend for the first CBD to be signifi-
Bock (2010) have shown how the NRM can be cantly higher than the second or third. If statistical
thought of in terms of the choice between two adja- tests (e.g., a Wald test) indicated that the a* within
cent response categories. Specifically, if one thinks an item varied significantly, one would have to ques-
of an item response as a choice between two options tion whether any model that proposed a single item
x and x − 1 (3 vs. 2), then the NRM for this choice slope is appropriate.
can be written in the form of the 2PLM: Finally, it can be shown that the category inter-
Copyright American Psychological Association. Not for further distribution.

1 cepts in the NRM can be transformed into useful


Px | x = x or x − 1 = , (11) and interpretable information by taking c* = (cx−1 −
1 + exp[−a* ( − )] 
cx)/aj* to obtain category intersection parameters.
where a* = ax − ax–1 and δ = cx–1 – cx/(ax − ax–1). These intersection parameters, of which there are
That is, the difference between the category slope the number of categories minus one, indicate the
parameters for adjacent categories functions is like a point on the latent variable scale at which two CRCs
slope parameter in a two-parameter model (Equa- intersect. These parameters contrast sharply with
tion 8), indicating how discriminating the choice the location parameters in the GRM. Specifically,
between two adjacent categories is. Clearly, if ax is locations in the GRM indicate where the probability
larger than ax–1, then a* is positive, and the response of responding in and above a category is .50,
categories are ordered along the latent variable; whereas intersections indicate where the selection of
higher levels of the latent variable mean one is more one response category becomes more likely than the
likely to select x than x − 1. previous category. Moreover, category intersections
Thus, inspection of the a* parameters in the are not necessarily ordered, but locations in the
NRM allows an empirical test of the ordering of GRM must be. Unordered category intersections do
response categories. Preston, Reise, Cai, and Hays not indicate that the response options are unordered
(2011) called the a* parameters category boundary but rather merely that one response category is
discriminations (CBDs). They argued that inspection never the most likely response for any value of the
of these values is important not only to test for the latent trait.
ordering of categories but also to determine whether
the item contains too many response categories (e.g., Generalized Partial Credit Model
when an a* is near zero). It can also be used to eval- It may seem odd to describe Muraki’s (1992) gener-
uate whether the response categories are equally dif- alization of the partial credit model (GPCM) before
ferentiating. For example, they cited examples of describing the partial credit model (PCM; Masters,
multipoint items for which the CBD between the 1982).Our decision to include this model here is
first and second category is very high but the based on the fact that the NRM highlighted in the
remaining CBDs are very low, suggesting that only previous section is the most general direct polyto-
the first response distinction is meaningful. mous IRT model. In turn, our goal is to emphasize
Finally, they argued that a lack of consistency that Muraki’s model is simply a restricted version of
between the CBDs within an item may call into the NRM, where item responses are assumed
question the application of models that specify only ordered, and CBD parameters (a*) are set equal
a single item slope parameter (e.g., the generalized within an item. Thus, although the NRM allows
partial credit model to be described shortly, or within-item response category distinctions to vary
the GRM). For example, Table 36.4 displays the in slope, the GPCM assumes that CBDs are equally

710
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

Table 36.4

Nominal Response Model (NRM) and Generalized Partial Credit Model (GPCM) Slope/Discrimination
Parameter Estimates for Depression Items

NRM slope (a) and CBD (a*) estimates GPCM


Item a1 a2 a3 a4 a*1 a*2 a*3 slopes
1 −4.54 −1.62 1.53 4.64 2.92 3.15 3.11 2.99
2 −3.54 −0.77 0.69 3.62 2.77 1.46 2.93 2.49
3 −3.82 −0.63 1.26 3.19 3.19 1.89 1.93 3.08
4 −2.48 −0.58 0.93 2.13 1.90 1.51 1.20 1.58
5 −3.86 −0.86 1.24 3.49 3.00 2.10 2.25 2.42
6 −2.73 −0.91 0.92 2.72 1.82 1.83 1.80 1.93
7 −3.65 −1.66 1.14 4.16 1.99 2.80 3.02 2.11
Copyright American Psychological Association. Not for further distribution.

8 −3.43 −0.74 0.95 3.21 2.69 1.69 2.26 2.46


9 −2.94 −0.70 0.76 2.88 2.24 1.46 2.12 2.05
10 −4.15 −1.24 1.15 4.24 2.91 2.39 3.09 2.74
11 −2.75 −0.50 0.83 2.42 2.25 1.33 1.59 1.96
12 −3.17 −1.22 0.94 3.45 1.95 2.16 2.51 2.20
13 −3.18 −0.76 0.93 3.00 2.42 1.69 2.07 1.93
14 −2.55 −0.68 0.85 2.37 1.87 1.53 1.52 1.43
15 −4.24 −1.33 1.65 3.92 2.91 2.98 2.27 2.80
16 −2.66 −0.47 0.82 2.31 2.19 1.29 1.49 1.89
17 −2.97 −0.80 0.74 3.03 2.17 1.54 2.29 1.98
18 −3.13 −0.65 0.94 2.85 2.48 1.59 1.91 2.02
19 −4.89 −1.79 1.69 4.99 3.10 3.48 3.30 3.23
20 −2.83 −0.28 0.41 2.70 2.55 0.69 2.29 2.13
21 −5.19 −1.09 1.68 4.59 4.10 2.77 2.91 3.16
22 −2.37 −0.76 0.79 2.34 1.61 1.55 1.55 1.69
23 −2.60 −0.61 0.93 2.28 1.99 1.54 1.35 2.12
24 −2.92 −0.62 0.72 2.82 2.30 1.34 2.10 2.18
25 −2.74 −0.74 0.80 2.67 2.00 1.54 1.87 1.61
26 −3.32 −0.75 1.05 3.02 2.57 1.80 1.97 3.11
27 −2.21 −0.41 0.78 1.84 1.80 1.19 1.06 1.52
28 −3.19 −0.79 1.14 2.84 2.40 1.93 1.70 2.03

Note. a denotes category slopes and a* denotes category boundary discriminations. CBD = category boundary
discrimination.

differentiating within an item but that items can for each item. The category intersection parameters
vary in their overall discrimination. The GPCM can (δ) in this model are interpreted in the same way as
be written as x the c* parameters in the NRM—as the intersection
EXP∑   −  j ( ) point of two adjacent CRCs. They are the points on
Px () = j=0
, (12)
the latent variable scale at which one category
NCAT−1  x  response becomes relatively more likely than the
∑ 
(
EXP∑   −  j )

  preceding response. The slope (αi) parameters have
x =0 j=0
0 the usual interpretation as a kind of discrimination
( )
where ∑   −  j ≡ 0. parameter, that is, they “indicate the degree to
j=0 which categorical responses vary among items as θ
For the GPCM (see Table 36.3), one unique level changes” (Muraki, 1992, p. 162).
slope parameter (α) and one minus the number of The GPCM and the GRM are very similar models
categories intersection parameters (δ) are estimated but are derived from two different families (and thus

711
Reise and Moore

are not nested models or comparable using a likeli- parameters are ordered within an item, then there is
hood ratio test). Models like the GPCM are built by at least one location on the latent variable where
considering the choice between two adjacent every response option is most likely (Andrich, 1988).
response categories x and x − 1—ignoring all other Thus, the divide-by-total or direct polytomous
categories—and then deriving how the latent vari- IRT models are a nested family going from least
able interacts with item slope and intersection restricted NRM, to allowing items to vary in slope
parameters to determine the response choice. This is (GPCM), to allowing items to vary only in category
certainly different than the GRM for which all intersection parameters (PCM). Of course, only the
response options are considered in deriving each PCM has features associated with Rasch models
within-item TRC. For comparison sake, the esti- (specific objectivity). We note that the model hierar-
mated parameters from the GPCM are shown in chy considered here did not include the well-known
Table 36.3 alongside the GRM. Although the values rating scale version of the PCM (Andrich, 1978).
look different, the two bottom graphs in Figure 36.2 We believe this model to be more of theoretical
Copyright American Psychological Association. Not for further distribution.

show examples in which the CRCs estimated under interest and thus was not detailed here.
the GRM (dark lines) and GPCM (lighter lines) are
almost visually indistinguishable.
Practical Applications and
Considerations
A Polytomous Rasch Model
The PCM (Masters, 1982) is the prototypical Rasch The first section was dedicated to introducing key
model suitable for analyzing ordered polytomous IRT models, their properties, and their interpreta-
items. Recall that the GRM is built around dichoto- tion. Understanding the mechanics of these models
mizing items at the boundaries between the is crucial because all applications of IRT (e.g., com-
response categories (e.g., 1, 2 vs. 3, 4) and then fit- puterized adaptive testing, analysis of item and scale
ting 2PLMs to each of the number of categories information functions, linking distinct measures
minus one of the boundaries between categories. The onto a common scale, analysis of differential item
PCM is also built around dichotomizing items at the functioning) rest squarely on the validity of these
boundaries between categories, but this process dif- representations. By validity of the model, we mean
fers from the GRM in important ways. Specifically, that term in the broadest sense to include the viabil-
the PCM is based on fitting Rasch’s 1PLM to the ity of IRT model assumptions, the accuracy of item
number of categories minus one ordered dichoto- parameter estimates, and the proper representation
mies but considering only two response categories at of the construct (i.e., is there really a latent variable
a time. Thus, with a four-category item, the PCM fits that accounts for the relations among the items?).
a Rasch model considering the dichotomy 0 versus 1; With this in mind, the following section is geared
1 versus 2; and 2 versus 3, as opposed to a GRM toward applied research. Specifically, we consider
where the dichotomies are 0 versus 1, 2, 3; 0, 1 (a) the applicability of IRT models across various
versus 2, 3; and 0, 1, 2, versus 3. types of constructs and domains, (b) sample size
Interestingly, it is easy to show that the PCM is and model selection issues, and (c) some lessons
simply Equation 12 without the slope parameter. learned thus far from the research literature on the
Thus, in the PCM, the probability of response is application of IRT to noncognitive measures.
determined solely by the difference between an indi-
vidual’s level on the latent variable and the category The Applicability of IRT Models
intersections. An interesting feature of the PCM is As noted by many authors, IRT models developed in
that the category intersections are not necessarily the context of large-scale cognitive testing; applica-
ordered. Recall that in the GRM, locations (b) must tions outside this domain emerged slowly and are
be ordered. When the intersection parameters are only now starting to fully blossom. In considering
not ordered, this phenomena is known as a “reversal” the application of IRT models across a broad range
(Dodd & Koch, 1987). As a rule, if the intersection of constructs, we consider the following three

712
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

issues: (a) latent versus emergent variables, (b) cog- measures should be entirely consistent with the
nitive versus noncognitive applications, and (c) the assumptions of IRT (i.e., unidimensionality, mono-
measurement of narrow versus broadband con- tonicity of response).
structs. First, we point out that IRT models can be Finally, we note that researchers in both cogni-
viewed as latent factor models (not components) tive and noncognitive domains have long recognized
that are applied to item-level data. As such, IRT the hierarchical nature of psychological constructs.
models are clearly latent variable models for which Accordingly, it has long been recognized that con-
the items are effect indicators in the Bollen and structs and their associated measures vary in con-
Lennox (1991) sense. In short, it makes little sense ceptual bandwidth, ranging from narrow (physical
to apply IRT models unless one has reason to believe attractiveness self-esteem, tooth brushing self-
that a single common latent variable is causing vari- efficacy, test anxiety) to broad (general self-esteem,
ation in item responses and thus causing items to be neuroticism). Commonly applied IRT models are
intercorrelated. unidimensional—they assume the existence of a sin-
Copyright American Psychological Association. Not for further distribution.

Second, although it is challenging to cite any spe- gle common variable affecting item responses. As
cific author, we have observed that some researchers pointed out by multiple authors over the years (see
are hesitant to embrace the application of IRT mod- Humphreys, 1970; Gustafsson & Aberg-Bengtsson,
els to measures of noncognitive constructs (i.e., 2010), attempting to create pure “unidimensional”
achievement, ability, competence). In this regard, scales can create quite a quandary and actually lead
we agree with the following: (a) noncognitive con- to poor measurement.
structs, such as personality traits, psychopathology, The way to achieve unidimensionality, and thus
attitudes, values, health outcomes, and so on, are be consistent with IRT modeling, is to focus on a
conceptually different than such constructs as nurs- narrow construct and write items that are essentially
ing competency, eighth-grade spelling knowledge, replicates (this can increase coefficient alpha as
or verbal ability; and in turn (b) the psychometric well). Yet, it is questionable whether we really need
properties of noncognitive measures are often very the power of IRT to measure such conceptually nar-
different than what is typically seen in the measure- row constructs in cases in which the correlation
ment of cognitive constructs (see concluding section between the items can be explained merely by the
for examples); and, finally, (c) the measurement semantic similarity of the items, rather than by the
context (i.e., the goals of measurement) between need to postulate an underlying psychological trait.
noncognitive and cognitive assessment is often dra- On the other hand, the measurement of broadband
matically different (see Reise, 2009, for further constructs demands the inclusion of content diverse
commentary). indicators. In turn, this almost guarantees violations
Yet, despite these differences, we see little if any of the IRT unidimensionality assumption. Yet, it
reason to question the application of IRT models to appears to us that for these types of substantively
measures of well-examined personality, psychopa- complex constructs, this is exactly where the power
thology, or health outcomes constructs. Latent vari- of IRT and other latent variable methods are most
ables models are justified when one can propose needed and interestingly applied.
that the empirical relations among a set of content- We raise this issue because it is important for
diverse items is explained by individual differences future directions of IRT modeling. Specifically, we
on a common psychological process (a latent trait). see the future as rejecting the unidimensional model
In this regard, we see no empirical evidence that as an appropriate foundation for IRT modeling and
data derived from noncognitive measures are more for defining good measurement in general. Never-
multidimensional than cognitive measures or that a theless, we do not necessarily see the future as mov-
dominance response process (higher trait levels lead ing toward multidimensional correlated traits IRT
to higher item scores) does not apply. In fact, the models either (see Reckase, 2010) but rather as rec-
history of factor analysis, both exploratory and con- ognizing a broader definition of what the common
firmatory, leads us to believe that many noncognitive latent variable is in IRT modeling.

713
Reise and Moore

For example, the bifactor model described by the relative order of trait level estimates (e.g.,
Gibbons and Hedeker (1992) has seen increased use Wainer & Thissen, 1987). On the other hand, in
as a tool for exploring the applicability of IRT mod- some contexts, such as the evaluation of different
els (e.g., Reise, Moore, & Haviland, 2010; Reise, CRCs or IRCs across different groups of individuals
Morizot, & Hays, 2007). This model allows each (e.g., men versus woman), high-stakes testing in
item to reflect a single common latent variable and health care settings, or large-scale computerized adap-
one additional latent variable caused by clusters of tive testing, item parameter accuracy is paramount.
items that share item content (i.e., sometimes To clarify the number of subjects issue, some
referred to as content parcels). Thus, when a mea- researchers have conducted Monte Carlo simulation
sure consists of multiple content parcels necessary studies. For example, Reise and Yu (1990) showed
to validly measure complex constructs, the bifactor that the GRM can be estimated with MULTILOG
model can easily accommodate such a structure. In (Thissen, 2003) with as few as 250 examinees, but
other words, a bifactor framework affords the ability they recommend around 500 (the number of items
Copyright American Psychological Association. Not for further distribution.

to use multidimensional item response data to is not critical with marginal maximum likelihood
achieve unidimensional measurement. Thus, we estimation). Yet simulation studies are useful only if
view the bifactor model not only as a tool for evalu- the real data match the simulated data, and they
ating the distorting effects of forcing multidimen- should not be relied on to judge the viability of a
sional data into a unidimensional model (Reise particular application. Rather than suggesting some
et al., 2007) but also for expanding the range of citable magic N or subjects-to-parameters ratio, we
constructs that can be accommodated by IRT mod- recommend the following.
els (see also Chapter 35 of this volume). First, thoroughly explore the data using nonpara-
metric methods (Sijtsma & Molenaar, 2002). If the
Sample Size and Model Choice data considered at the nonparametric level do not
A common question in seminars and workshops is reasonably conform to a parametric model, a large
“how many subjects do I need to estimate IRT sample size is pointless—there is no sense precisely
model parameters?” This is a fundamental question, estimating parameters that do not exist. Second,
but unfortunately there is no simple rule-of-thumb evaluate model assumptions (dimensionality, mono-
answer. The needed sample size for reasonably tonicity) carefully, even if sample size is small.
accurate item parameter estimates depends on the Again, item parameter estimation accuracy is moot if
interplay among a variety of intertwined factors, the data are not appropriate for the model in the first
including (a) the model selected (constrained mod- place. Third, for models that include item slopes,
els are easier to estimate); (b) how large the item consider their magnitude as being the key influence
slope parameters are expected to be (higher is bet- on parameter estimation. Research (MacCallum,
ter); (c) how well the data fit the model (if the data Widaman, Zhang, & Hong, 1999) has shown that
do not fit the model, then parameter accuracy is the size of a factor loading is a key factor in how
meaningless); (d) the underlying latent trait distri- many subjects are needed to conduct a factor analy-
bution (normal vs. skewed); (e) sample heterogene- sis (i.e., high loadings can be estimated with fewer
ity (more is better); and, importantly, (f) what the subjects). The IRT slope parameter is analogous to a
ultimate research goal is (high-stakes testing vs. factor loading, and the same principle should apply.
research-only scale). Finally, it is important to recognize the location
There may be applications where accuracy of parameters in the GRM are, generally speaking, eas-
item parameter estimation is not critically important ier to estimate than intersection parameters in
(e.g., in exploring basic item functioning). More- divide-by-total models. Assuming that the items
over, given reasonable scale length (e.g., 20 items), have a reasonable degree of slope, and that there are
minor biases in item parameter estimates or even sufficient responses in each category, location
estimating the wrong model may not greatly affect parameters are relatively easy to estimate because all
model outcomes, such as the scale response curve or the data are used. An intersection parameter, on the

714
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

other hand, is relatively harder to estimate than a model never makes a difference. In fact, recent work
location because the intersection considers only has argued that model accuracy is important in scal-
responses in x and x − 1. Thus, it is particularly ing individuals correctly in certain trait ranges
important to have sufficient responses in every cate- (Waller & Reise, 2010).
gory for models like the NRM, GPCM, and PCM.
Finally, having decided to pursue IRT modeling, Lessons Learned From the Application
a critical question is which model should be esti- of IRT to Noncognitive Measures
mated. Although the choice of model is often dic- IRT modeling and its by-products (e.g., computer-
tated by what software an individual has access to, ized adaptive testing) can be considered standard
for the sake of argument, we assume that any model practice in education measurement, large-scale abili-
described herein can be estimated. With this in ties testing, and licensure testing. Although this
mind, we note that from a statistical perspective, the trend is by no means true in noncognitive assess-
4, 3, 2, and 1 parameter logistic models for dichoto- ment, even in those domains, IRT modeling has
Copyright American Psychological Association. Not for further distribution.

mous item responses are nested, and thus a likeli- moved well beyond the stage of didactic articles and
hood ratio test may be useful in model comparisons. proof-of-concept demonstrations. In fact, there is
Moreover, the same likelihood-ratio test also applies now quite a substantial body of research that has
to divide-by-total models where the NRM, GPCM, considered IRT modeling outside the cognitive
and PCM also form a nested series. Unfortunately, domain (see Cella et al., 2007, 2010; Reise &
comparison of the GRM with the GPCM cannot be Waller, 2009). In this section, we consider some
made on the basis of a likelihood-ratio test, but observations we have made regarding this research.
rather it must be made on the basis of other statisti- In turn these observations may suggest future
cal criteria (e.g., the Akaike information criterion) research issues and directions.
as well as substantive considerations. We have observed that several interesting phe-
Conversely, researchers seldom are concerned nomena occur relatively frequently when polytomous
with overall model fit but rather are concerned with IRT models are applied to personality, health out-
a comparison between the observed and estimated comes, and psychopathology measures, namely the
IRC or CRCs on an item-by-item basis. In consider- following: (a) extreme item locations, (b) extremely
ing the various modeling options in IRT, we feel that high item slopes, (c) bunched item locations, and
researchers spend too much effort worrying about (d) an unusually large range of item slopes. In this
relative model fit and item parameter accuracy section, we discuss each of these phenomena in turn
rather than model outcomes. In any particular data and propose reasons why they occur. For simplicity
set, it may be the case that a model can be shown to and continuity, we discuss these issues in the context
provide a better relative fit than another, but in of the slope and location parameters resulting from
terms of outcomes, model choice is inconsequential. the application of the GRM. These same issues arise
Ostini and Nering (2006) pointed out that although in application of any model.
the polytomous IRT models may differ in a variety We begin with the phenomena of extremely high
of ways, few researchers have attempted to compare item slope parameters. To be conservative, we define
the models in terms of outcomes as opposed to fit. It high slopes as any value larger than 3 (logistic met-
is arguable that in terms of ultimate outcomes, such ric), but note that even a slope of two is relatively
as scale-response curves or individuals trait-level large compared with what is typically found in
estimates, the impact of model choice (or even poor dichotomous IRT models. It is easy to demonstrate
parameter estimation) is minimal. Similar argu- that as an item slope parameter moves beyond 3, the
ments were made by Embretson and Reise (2000), response categories are providing a high degree of
who fitted a personality scale to a number of differ- discrimination among individuals. In other words,
ent polytomous models and found that trait-level with a slope of three in the GRM, if a researcher
estimates were above .90 regardless of model. This knew an individual’s trait level, he or she could pre-
demonstration should not be taken to mean that the dict which the individual’s likely response category

715
Reise and Moore

with high accuracy. Another way of thinking about measurement precision should not be viewed as
this issue is as follows. As item slopes increase resulting from quality measurement, but rather as
beyond three, the TRCs begin to look more like resulting from the narrow construct that is being
Guttman step functions than monotonically increas- measured. It is unsurprising to observe high slopes
ing ogives. Do such high slopes mean good mea- on narrowband measures because individual differ-
surement, or is there something wrong? ences are much easier to discriminate in that
We argue that the answer to this question context. Consider this example: The items 3 + 4 = ?,
depends on understanding the causes of the high 2 + 6 = ?, and 1 + 3 = ? would form a highly discrimi-
slope parameters. To begin, it is well known that nating set of indicators of the construct of “adding
one possible source of a high item slope is a problem two single digits.” In short, the fewer and more
known as local dependency. Local dependency means homogeneous the trait (or ability) manifestations,
that two (or more) items share common variance the easier it is to discriminate between those that are
above and beyond that because of the latent trait high or low on the trait with high precision.
Copyright American Psychological Association. Not for further distribution.

being measured. Local dependence can be caused by A narrow construct and the resulting item con-
item content redundancy (i.e., asking essentially the tent homogeneity can explain some cases of
same question twice) and results in inflated item extremely high slopes, but it is irrelevant when high
slope parameters (Steinberg & Thissen, 1996). To slopes are observed on measures of complex and
the extent that slopes are inflated by local dependen- multifaceted broadband constructs that contain
cies, the slope parameter estimates provide a mis- diverse item content. In this situation, we propose
leading gauge of measurement precision. However, two additional sources as possible explanations,
local dependence is not an explanation for high namely, (a) a mixture of extreme groups (e.g., clini-
slopes when all or most of the items on a scale dis- cal and nonclinical) included in the calibration sam-
play exceptionally high slopes. In this situation, we ple, and (b) a skewed or quasi-trait. These latter two
need to turn to other explanations. concepts are not synonyms, but discussion of them
A second possible explanation for high slopes is is joined because in practice we argue that it is very
that the measured construct has a narrow conceptual difficult to differentiate between a highly skewed
bandwidth (e.g., “knee joint pain”). Sparing the reader latent distribution and a quasi-dimensional trait.
the technical details, in IRT, the magnitude of an In the measurement of psychopathology or health
item’s average correlation with other items determines outcomes, psychometricians have long warned
the item’s slope (or factor loading in factor analysis). against the use of combined clinical and nonclinical
In measures of narrowband constructs, homogeneous samples in judging the psychometric properties of
item content is expected. In addition, because of a instruments. The obvious danger is that if a nonclini-
lack of diverse trait manifestations, in measures of cal group with a floor effect is combined with a clini-
narrowband constructs the conceptual distance cal group with a ceiling effect, an instrument can
between the item and the construct is often very small look deceptively good in terms of item intercorrela-
(e.g., “I experience pain when bending my knee”). tions, item-test correlations, factor loadings, and
These factors result in narrow measures containing coefficient alpha. In short, in mixed clinical and non-
items with very high intercorrelations, which in turn clinical samples, strong psychometrics reflect merely
result in high IRT slope parameters. This is especially the fact that the item differentiates between extreme
true if a 5-, 7-, or 9-point rating scale is used for each groups rather than indicating a good measure of a
item because the more response options, the more dimensional construct across the entire range of the
room is left for individual differences in response style construct. Because IRT slope parameters are complex
to operate, further inflating correlations. transformations of item intercorrelations, a mixed
When high item slopes are caused by the narrow- clinical and nonclinical sample can easily result in
band nature of the measured construct, their values very high slope parameter estimates.
are perfectly valid indicators of measurement preci- Finally, we propose that a second possible cause
sion. However, there is one big caveat. Namely, that of unusually high slopes on broadband measures is a

716
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

highly skewed or a quasi-trait. Although these con- being measured is a highly skewed trait or quasi-
cepts are difficult to differentiate in practice, we trait that is not a fully dimensional construct.
consider a skewed trait to be a fully continuous trait, It appears that many researchers are operating
conceptually definable at both ends, but with a under the assumption that all constructs are fully
strong floor or ceiling effect (e.g., an optimism– continuous, defined at both ends of the construct,
pessimism scale applied in a culture in which almost and that items can be found that measure (have
everyone is an optimist). A quasi-trait is a dimen- location parameters) across an entire trait range.
sional construct that is defined only on one end of For example, in a different context, Andrich (1995)
the scale (i.e., meaningful individual differences conveyed the following sentiments:
exist only on one end of the scale). Many constructs
Although measuring instruments have
in psychology, although assumed dimensional (e.g.,
operating ranges, a measurement is not
aggression, self-esteem, spirituality), are possibly
taken to be a function of the operating
only quasi-traits.
Copyright American Psychological Association. Not for further distribution.

range of any instrument—instead, if a


For example, in our work on psychopathology
measurement is contaminated by the
(Reise & Waller, 2003; Waller & Reise, 2010), we
operating range of the instrument (e.g.,
noted that the low end of our measures, say depres-
floor or ceiling effects), another instru-
sion, is not happiness or well-being but rather is
ment with a range more compatible with
lack of depression (i.e., no symptoms). In working
the location of the entity or object is
with either highly skewed traits or quasi-traits, cau-
sought. (p. 101)
tion must be used in interpreting high slope parame-
ters. For the same reasons as articulated for mixed Such a statement assumes that, in theory, a
samples, item slopes may be artificially inflated researcher could find an alternative instrument that
because of either the non-normality or quasi-trait would not produce a floor or ceiling effect. For
status of the construct. In such cases, a high slope another example, Fraley, Waller and Brennan
may indicate that the item differentiates between (2000), on observing that in attachment measures
not-traited (not depressed) and traited (depressed) item locations are highly bunched together on one
individuals, but it may not necessarily provide a pre- end of the trait, suggested that new items be written
cise differentiation among people along a meaning- to better spread out the measurement precision over
ful continuous latent dimension. the complete range of the latent variable. But what if
The discussion of skewed traits or quasi-traits such a search for new items or measures is fruitless?
leads directly to the next topic, namely, the issue of With a highly skewed construct or a quasi-trait,
bunched item location parameters. By bunched item it may not be possible to find items with location
location parameters, we mean location parameters parameters spread out across the range. Consider a
(e.g., in the GRM) that are clumped closely together research study by Gray-Little, Williams, and Han-
along the latent variable (usually at the trait cock (1997), who applied the GRM to a 5-point ver-
extremes) rather than spread out over the trait con- sion of the Rosenberg Self-Esteem Scale (Rosenberg,
tinuum. When location parameters are bunched 1989). This study is fascinating because although
together, this implies that the instrument affords the items contain five response options, the four
measurement precision in only a narrow trait range. location parameters per item are closely bunched at
This is an interesting occurrence given that the only the low end of the latent trait (i.e., low self-esteem).
legitimate purpose of having a multipoint response For example, even the third location (x = 1, 2, 3, vs.
format is to allow people to validly make discrimina- 4, 5) was in the negative range for 8 out of the 10
tions across the continuum. If a researcher uses a items. To make this more concrete, the third loca-
multipoint response scale in a diverse sample, and tion parameter for Item 1 (“On the whole, I am sat-
location parameters still clump together, this is evi- isfied with myself”) was b = −1.48, implying that
dence that either respondents cannot differentiate even individuals who are a standard deviation and a
among response options, or more likely, that what is half below the mean on the trait are most likely to

717
Reise and Moore

respond in the highest fourth or fifth category. In of 2.13. This item has very low slope (a = 1.16),
other words, even people far below the mean on however, relative to the high slope items that have
self-esteem rate themselves highly self-satisfied. slopes ranging from 2.0 to 2.5. It is thus plausible
Although reasonable minds can disagree, one expla- that this item is not measuring the same common
nation for this effect is that it is not due to poor trait as the high slope items.
items or poor choice of response options. Rather, it Our final practical issue in applying IRT models
is due to the nature of the self-esteem construct; to noncognitive data is the observation that on some
items only differentiate people with low self-esteem measures we have noticed a very wide spread of item
because that is the only end of the construct that is slopes. For example, Van Der Ark (2001) reported
meaningful. slopes of 0.8, 3.6, 0.4, 6.4, and 0.2 for a measure of
Related to the topic of bunched item location coping strategy to industrial odor annoyance. We
parameters is the topic of extreme item locations. call this a steep descent pattern, where one or two
By extreme we mean that the location parameters items have a relatively high slope, and then the
Copyright American Psychological Association. Not for further distribution.

that do not fall within a reasonable range of values, value of the slope parameter decreases rapidly for
say −2 to 2 (assuming that the latent variable has a the remainder of the items. Although it is reasonable
mean of zero and standard deviation of one). As that different trait indicators can be related to the
mentioned, one reason why an extreme location latent trait to different degrees, a wide variance in
may occur is if an item has a relatively low slope slopes may also be a sign of problems. For example,
parameter (recall that the location is a function of in Hall, Reise, and Haviland (2007), a nine-item
the intercept divided by the slope). A second obvi- spiritual instability measure had GRM slopes rang-
ous reason is that either the lowest or highest ing from 2.63 (“When I sin, I am afraid of what God
response category is too extreme given the content will do to me”) to 0.74 (“When I sin, I tend to with-
of the item. A third, more interesting reason is that draw from God”). The remaining items had slopes
the latent variable is either highly skewed or is a of 1.9, 1.9, 1.5, 1.5, 1.4, 1.0, and 0.9, respectively.
quasi-trait. Such a variable pattern of slopes is troubling in a
We again use the results of the Gray-Little et al. number of respects. First, one could argue that the
(1997) study to illustrate the phenomena. In their construct is so narrow that one item essentially
calibration of the Rosenberg self-esteem scale, the defines the latent variable and the other items are
category locations not only were bunched in the low only tangentially related. If that were true, one could
end of the latent variable but also were very argue that the remaining items are unnecessary.
extreme. For example, for all 10 items the first cate- More technically, a second potential problem is that
gory location (x = 1 vs. 2, 3, 4, 5) was an absurdly if IRT methods were used to estimate standing on
low value. For example, for Item 1 (“On the whole, I the construct, then in this example the best item has
am satisfied with myself”), the estimated location 3.7 times the influence compared with the worst
was b = −3.45, suggesting that even individuals who item because the sufficient statistic for the latent
are nearly three and a half standard deviations below trait estimate is the raw item response times the item
the mean will be likely to respond in at least cate- slope. Moreover, because an item contributes to
gory 2. Moreover, for many items, even the fourth measurement precision by the square of the slope
location (x = 1, 2, 3, 4 vs. 5) is barely above the trait parameter, such results argue that the best item con-
mean. For example, for Item 3 (“I feel that I have a tributes 12.7 times more error reduction than the
number of good qualities”) the fourth location is worst item; that is, it takes almost 13 items like the
only b = 0.22. Thus, even people barely above the worst item to equal one item like the best.
mean on self-esteem are most likely to respond in This wide-ranging slope phenomenon is not
the highest category on this item. The only item unique to spirituality constructs and their associated
that displayed a more reasonable fourth location measures. In fact, this phenomena is rather easy to
was Item 9 (“All in all I am inclined to feel that I observe in the IRT literature and non-IRT literature
am a failure”—reversed) that had a fourth threshold (in the form of variable item–test correlations or

718
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

factor loadings). Ignoring multidimensionality as a possibly be standard normal. To the degree that IRT
possible cause, we propose that this phenomena is applications in noncognitive settings raise issues
due to a combination of narrowband constructs that have not caught the attention of previous
(where one or two good items essentially define the researchers (e.g., the Gray-Little et al., 1997, study),
construct), combined with limited item pools. Many or call into question the quality of legacy measures
constructs in personality, health, and psychopathol- developed under traditional coefficient alpha–
ogy have an extremely limited indicator pool (e.g., centric scale construction practices, this is a positive
how many ways are there to react to industrial development for the field of psychometrics. Indeed,
odor?). Even a relatively complex and multifaceted part of the excitement of current IRT research lies in
construct such as depression has a finite set of indi- identifying new problems and working toward their
cators (e.g., sad moods, social isolation, suicide cog- solutions.
nitions, feelings of hopelessness, and somatic
disturbances). Importantly, when only a few items References
Copyright American Psychological Association. Not for further distribution.

have high slopes and the remainder has much lower Andrich, D. (1978). A rating formulation for ordered
slopes, a researcher must be cautious in interpreting response categories. Psychometrika, 43, 561–573.
the latent variable. It could well be that the latent doi:10.1007/BF02293814
variable does not reflect variance on a common Andrich, D. (1988). A general form of Rasch’s extended
latent variable shared by all the items, but rather it logistic model for partial credit scoring. Applied
Measurement in Education, 1, 363–378. doi:10.1207/
merely reflects individual differences on the item s15324818ame0104_7
with the highest slope. Andrich, D. (1995). Distinctive and incompatible proper-
In sum, this section is not meant to disparage or ties of two common classes of IRT models for graded
discourage IRT applications to typical performance responses. Applied Psychological Measurement, 19,
constructs. Our goal is merely to draw attention to 101–119. doi:10.1177/014662169501900111
some interesting challenges that researchers may Baker, F. B., & Kim, S.-H. (2004). Item response theory:
Parameter estimation techniques (2nd ed.). New York,
face in applying polytomous models to noncognitive
NY: Marcel Dekker.
constructs. As we stated, IRT emerged in the context
Ben-Porath, Y. S., Tellegen, A. T., & Kaemmer, B.
of large-scale cognitive ability testing. In that con- (2005). MMPI–A: Minnesota Multiphasic Personality
text, it is relatively easy to conceive of normally dis- Inventory—Adolescent™: Booklet of abbreviated items.
tributed continuous traits and unlimited item pools. Minneapolis: University of Minnesota Press.
For example, it is easy to envision an endless num- Bock, R. D. (1972). Estimating item parameters and
ber of spelling, algebra, analogy, constitutional law, latent ability when responses are scored in two or
more latent categories. Psychometrika, 37, 29–51.
or nursing skills questions. In turn, it is relatively doi:10.1007/BF02291411
easy to envision these items as varying in their loca-
Bollen, K., & Lennox, R. (1991). Conventional wis-
tion across the trait range. Moreover, large-scale dom on measurement: A structural equation per-
cognitive ability researchers are typically working spective. Psychological Bulletin, 110, 305–314.
with broadband constructs (verbal ability, algebra, doi:10.1037/0033-2909.110.2.305
knowledge of the law, nursing expertise) for which Bond, T. G., & Fox, C. M. (2007). Applying the Rasch
domains of item content (learning or skill domains) model: Fundamental measurement in the human sci-
ences (2nd ed.). Mahwah, NJ: Erlbaum.
are well articulated.
Borsboom, D. (2005). Measuring the mind: Conceptual
In contrast, as the IRT technology is transported issues in contemporary psychometrics. Cambridge,
from the cognitive abilities realm into the broader England: Cambridge University Press. doi:10.1017/
world of performance assessment, special issues and CBO9780511490026
problems are bound to emerge (Reise, 2009). This is Cella, D., Riley, W., Stone, A., Rothrock, N., Reeve,
especially true given that noncognitive researchers B., Yount, S., . . . Hays, R. D. (2010). Initial item
banks and first wave testing of the Patient-Reported
have to work with constructs of varying conceptual
Outcomes Measurement Information System
breath, for which a set of limited indicators exist, (PROMIS) network: 2005–2008. Journal of Clinical
and for which the underlying distribution cannot Epidemiology, 63, 1179–1194.

719
Reise and Moore

Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, Humphreys, L. G. (1970). A skeptical look at the factor
K., Reeve, B., & Rose, M. (2007). The Patient- pure test. In C. E. Lunneborg (Ed.) Current problems
Reported Outcomes Measurement Information and techniques in multivariate psychology: Proceedings of
System (PROMIS): Progress of an NIH Roadmap a conference honoring Professor Paul Horst (pp. 23–32).
Cooperative Group during its first two years. Medical Seattle: University of Washington.
Care, 45(5, Suppl. 1), S3–S11. doi:10.1097/01.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of
mlr.0000258615.42478.55
mental test scores. Reading, PA: Addison-Wesley.
de Ayala, R. J. (2009). The theory and practice of item
response theory. New York, NY: Guilford Press. MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S.
(1999). Sample size in factor analysis. Psychological
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory Methods, 4, 84–99. doi:10.1037/1082-989X.4.1.84
item response models: A generalized linear and nonlin-
ear approach. New York, NY: Springer. Masters, G. N. (1982). A Rasch model for partial credit
scoring. Psychometrika, 47, 149–174. doi:10.1007/
Dodd, B. G., & Koch, W. R. (1987). Effects of variations BF02296272
in item step values on item and test information
in the partial credit model. Applied Psychological McLeod, L. D., Swygert, K. A., & Thissen, D. (2001).
Copyright American Psychological Association. Not for further distribution.

Measurement, 11, 371–384. doi:10.1177/01466216 Factor analysis for items scored in two categories.
8701100403 In D. Thissen & H. Wainer (Eds.), Test scoring
Embretson, S. E. (1996). The new rules of measure- (pp. 189–216). Mahwah, NJ: Erlbaum.
ment. Psychological Assessment, 8, 341–349. Muraki, E. (1992). A generalized partial credit
doi:10.1037/1040-3590.8.4.341 model: Application of an EM algorithm. Applied
Embretson, S. E., & Reise, S. P. (2000). Psychometric Psychological Measurement, 16, 159–176.
methods: Item response theory for psychologists. doi:10.1177/014662169201600206
Mahwah, NJ: Erlbaum. Nering, M. L., & Ostini, R. (Eds.). (2010). Handbook of
Fox, J-P., & Glas, C. A. W. (2001). Bayesian estima- polytomous item response theory models. New York,
tion of a multilevel IRT model using Gibbs sam- NY: Taylor & Francis.
pling. Psychometrika, 66, 271–288. doi:10.1007/ Ostini, R., & Nering, M. L. (2006). Polytomous item
BF02294839 response theory models. Thousand Oaks, CA: Sage.
Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000).
An item response theory analysis of self- report mea- Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M.,
sures of adult attachment. Journal of Personality and Riley, W. T., & Cella, D. (2011). Item banks for
Social Psychology, 78, 350–365. doi:10.1037/0022- measuring emotional distress from the Patient-
3514.78.2.350 Reported Outcomes Measurement Information
System (PROMIS): Depression, anxiety, and anger.
Gibbons, R. D., & Hedeker, D. (1992). Full-information Assessment, 18, 263–283.
item bi-factor analysis. Psychometrika, 57, 423–436.
doi:10.1007/BF02295430 Preston, K. S. J., Reise, S. P., Cai, L., & Hays, R. D.
(2011). Using the nominal response model to evalu-
Gray-Little, B., Williams, V. S. L., & Hancock, T. D. ate response category discrimination in the PROMIS
(1997). An item response theory analysis of the emotional distress item pools. Educational and
Rosenberg Self-Esteem Scale. Personality and Social Psychological Measurement, 71, 523–550.
Psychology Bulletin, 23, 443–451. doi:10.1177/
0146167297235001 Reckase, M. D. (2010). Multidimensional item response
theory. New York, NY: Springer.
Gustafsson, J. E., & Aberg-Bengtsson, L. (2010).
Unidimensionality and the interpretability of psy- Reise, S. P. (2009). The emergence of item response
chological instruments. In S. E. Embretson (Ed.), theory (IRT) models and the Patient-Reported
Measuring psychological constructs (pp. 97–121). Outcomes Measurement Information System
Washington, DC: American Psychological Association. (PROMIS). Austrian Journal of Statistics, 38, 211–220.
doi:10.1037/12074-005
Reise, S. P., Ainsworth, A. T., & Haviland, M. G. (2005).
Hall, T. W., Reise, S. P., & Haviland, M. G. (2007). An Item response theory: Fundamentals, applications,
item response theory analysis of the spirituality and promise in psychological research. Current
assessment inventory. International Journal for the Directions in Psychological Science, 14, 95–101.
Psychology of Religion, 17, 157–178. doi:10.1111/j.0963-7214.2005.00342.x
Hambleton, R. K., & Swaminathan, H. (1985). Item Reise, S. P., & Henson, J. M. (2003). A discussion
response theory. Boston, MA: Kluwer-Nijhoff. of modern versus traditional psychometrics as
Hambleton, R. K., Swaminathan, H., & Rogers, H. applied to personality assessment scales. Journal of
J. (1991). Fundamentals of item response theory. Personality Assessment, 81, 93–103. doi:10.1207/
Newbury Park, CA: Sage. S15327752JPA8102_01

720
An Introduction to Item Response Theory Models and Their Application in the Assessment of Noncognitive Traits

Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Thissen, D. (2003). MULTILOG 7: Multiple categorical
Bifactor models and rotations: Exploring the extent item analysis and test scoring using item response
to which multidimensional data yield univocal theory [Computer software]. Chicago, IL: SSI.
scale scores. Journal of Personality Assessment, 92,
Thissen, D., Cai, L., & Bock, R. D. (2010). The nominal
544–559.
categories item response model. In M. Nering & R.
Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role Ostini (Eds.), Handbook of item response theory mod-
of the bifactor model in resolving dimensionality els (pp. 43–75). Philadelphia, PA: Taylor & Francis.
issues in health outcomes measures. Quality of Life
Research, 16, 19–31. Thissen, D., & Steinberg, L. (1986). A taxonomy of
item response models. Psychometrika, 51, 567–577.
Reise, S. P., & Waller, N. G. (2003). How many IRT doi:10.1007/BF02295596
parameters does it take to model psychopathol-
ogy items. Psychological Methods, 8, 164–184. Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989).
doi:10.1037/1082-989X.8.2.164 Multiple-choice models: The distractors are also part
of the item. Journal of Educational Measurement, 26,
Reise, S. P., & Waller, N. G. (2009). Item response 161–176. doi:10.1111/j.1745-3984.1989.tb00326.x
theory and clinical measurement. Annual Review of
Copyright American Psychological Association. Not for further distribution.

Clinical Psychology, 5, 27–48. doi:10.1146/annurev. Tutz, G. (1990). Sequential item response models with an
clinpsy.032408.153553 ordered response. British Journal of Mathematical and
Statistical Psychology, 43, 39–55.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the
graded response model using MULTILOG. Journal of Van Der Ark, L. A. (2001). Relationships and proper-
Educational Measurement, 27, 133–144. doi:10.1111/ ties of polytomous item response theory models.
j.1745-3984.1990.tb00738.x Applied Psychological Measurement, 25, 273–282.
doi:10.1177/01466210122032073
Roberts, J. S., Donoghue, J. R., & Laughlin, J. E.
(2000). A general item response theory model for Wainer, H. (2000). Computerized adaptive testing: A
unfolding unidimensional polytomous responses. primer (2nd ed.). Mahwah, NJ: Erlbaum.
Applied Psychological Measurement, 24, 3–32. Wainer, H., & Thissen, D. (1987). Estimating ability with
doi:10.1177/01466216000241001 the wrong model. Journal of Educational Statistics, 12,
Rosenberg, M. (1989). Society and the adolescent self- 339–368. doi:10.2307/1165054
image (rev. ed.). Middletown, CT: Wesleyan Waller, N. G., & Reise, S. P. (2010). Measuring psychopa-
University Press. thology with nonstandard item response theory mod-
Rouse, S. V., Finger, M. S., & Butcher, J. N. (1999). els: Fitting the four-parameter model to the Minnesota
Advances in clinical personality measurement: An Multiphasic Personality Inventory. In S. Embretson
item response theory analysis of the MMPI–2 PSY-5 (Ed.), Measuring psychological constructs: Advances in
scales. Journal of Personality Assessment, 72, 282–307. model-based approaches (pp. 147–173). Washington,
doi:10.1207/S15327752JP720212 DC: American Psychological Association. doi:10.1037/
12074-007
Samejima, F. (1969). Estimation of latent ability using
a response pattern of graded scores. Psychometrika, Williams, C. L., Butcher, J. N., Ben-Porath, Y. S.,
Monograph Supplement 17. & Graham, J. R. (1992). MMPI–A content
scales: Assessing psychopathology in adolescents.
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to Minneapolis: University of Minnesota Press.
nonparametric item response theory. Thousand Oaks,
CA: Sage. Wilson, M. (2005). Constructing measures: An item
response modeling approach. Mahwah, NJ: Erlbaum.
Steinberg, L., & Thissen, D. (1996). Uses of item
response theory and the testlet concept in the mea- Zimowski, M., Muraki, E., Mislevy, R., & Bock, R. D.
surement of psychopathology. Psychological Methods, (2003). BILOG-MG (Version 3) [Computer
1, 81–97. doi:10.1037/1082-989X.1.1.81 software]. Lincolnwood, IL: SSI.

721

You might also like