You are on page 1of 8

C UR R E NT D I R EC TI ON S I N P SY CH O L O GIC A L S CI E NC E

Item Response Theory


Fundamentals, Applications, and Promise in Psychological Research
Steven P. Reise,1 Andrew T. Ainsworth,1 and Mark G. Haviland2 Department of Psychology, University of California, Los Angeles, and 2Department of Psychiatry, Loma Linda University School of Medicine
1

ABSTRACTItem

response theory (IRT) is an increasingly popular approach to the development, evaluation, and administration of psychological measures. We introduce, rst, three IRT fundamentals: (a) item response functions, (b) information functions, and (c) invariance. We next illustrate how IRT modeling can improve the quality of psychological measurement. Available evidence suggests that the differences between IRT and traditional psychometric methods are not trivial; IRT applications can improve the precision and validity of psychological research across a wide range of subjects.

KEYWORDSitem response theory; classical test theory; psychometrics; scaling

Since the beginnings of psychological measurement, classical test theory (CTT) has been the dominant approach to the construction, analysis, and scoring of psychological scales. Although CTT methods dominate to this day, a second approach, item response theory (IRT; Embretson & Reise, 2000), is becoming more popular and better appreciated. Here we review three IRT fundamentals and illustrate how the use of IRT can improve the quality of substantive research in psychology.

IRTA PRIMER

IRT is a collection of mathematical models and statistical methods that are used to (a) analyze items and scales, (b) create and administer psychological measures, and (c) measure individuals on psychological constructs (e.g., depression). Three fundamentals of IRT are (a) item response functions (IRFs), (b) information functions, and (c) invariance.
Address correspondence to Steven P. Reise, Department of Psychology, Franz Hall, UCLA, Los Angeles, CA 90095; e-mail: reise@ psych.ucla.edu.

IRFs The basic unit of IRT is the IRF. For items on a rating scale, an IRF is a mathematical function describing the relation between where an individual falls on the continuum of a given construct such as depression and the probability that he or she will give a particular response to a scale item designed to measure that construct. In IRT, a construct is called a latent trait, because that trait is assumed to underlie and directly inuence responses to items on a scale designed to measure that trait. The basic goal of IRT modeling is to determine an IRF for each item on a measure. In turn, IRFs are used to evaluate item quality and serve as building blocks to derive other important psychometric properties, as we will illustrate. Figure 1 displays the IRFs for three dichotomously scored (e.g., true/false) items. In IRT terminology, the location along the latent trait axis where the IRF curve changes direction (its inection point) gives the items difculty; an individual will need to have a higher level of the trait to endorse a more difcult item. (Item C in Fig. 1 is more difcult than Items A or B.) The item difculty parameter in IRT is analogous to the item mean in CTT. The steepness of the IRF at the curves inection point (i.e., at the items difculty level) is called item discrimination. More discriminating items are better able to differentiate among individuals in the trait range around an items difculty. With its steeper curve, Item B in Figure 1 is thus considered more discriminating than Items A or C. The item discrimination parameter in IRT models is analogous to the itemtest correlation in CTT or to a factor loading (itemfactor correlation) in factor analysis.

Information Functions To judge the quality of an item, one can transform the items IRF into an item information function, which shows how much psychometric information (a number that represents an items ability to differentiate among people) the item provides at each trait level. Different items can provide different amounts of

Volume 14Number 2

Copyright r 2005 American Psychological Society

95

Item Response Theory

Fig. 1. Examples of item response functions (IRFs) for three dichotomously scored (e.g., true/ false) items. In item response theory (IRT), an IRF expresses the probability of endorsing a particular item (the vertical axis) as a function of a persons level of a particular trait (the horizontal axis). The difculty of a particular item (i.e., how probable it is to be endorsed) is indicated by where its IRF curve changes direction; an individual will need to have a higher level of the latent trait to endorse a more difcult item. Here, item C is more difcult than Items A or B.

information in different ranges of a given latent trait. Relatively easy items are best for differentiating among individuals low on the trait, whereas relatively difcult items are best for differentiating among individuals high on the trait. Figure 2 shows the item information functions corresponding to the three items in Figure 1. Clearly, items of different difculty provide information in different latent-trait ranges, and more-discriminating items (e.g., Item B) provide more information than less-discriminating items (e.g., Items A and C). Item information functions from different items can be added together to form a scale information function. Because information is directly related to measurement precision (more information equals more precise measurement), the scale information function estimates how well a measure functions as a whole in different trait ranges. The fact that item information functions can be added together is the foundation for scale construction with IRT. Figure 3 shows the scale information function for the three example items. This three-item measure appears to differentiate among individuals best in the middle of the trait range and to not differentiate among individuals at either of the extremes. Item and scale information are analogous to CTTs item and test reliability. An important difference, however, is that under an IRT framework, information (precision) can vary depending on where an individual falls along the trait range, whereas in CTT, the scale reliability (precision) is the same for all individuals, regardless of their raw-score levels.

Invariance Invariance in IRT models essentially means two things. First, an individuals position on a latent-trait continuum can be estimated from his or her responses to any set of items with known IRFs, even items that come from different measures. In contrast, in CTT, item responses are aggregated to estimate a true score that is specific to that measure alone. Second, item properties, as represented by the IRF, do not depend on the characteristics of a particular population.1 Also, the scale of the trait does not depend on any particular item set, but exists independently. In CTT, item means and itemtest correlations, as well as reliability coefcients and standard errors, are dependent on the characteristics of particular populations. In CTT, the raw (and true) score scale is dened by a particular set of items on a single measure. The advantages of item-parameter invariance are clear. For example, in large-scale educational assessment, item-parameter invariance facilitates the linking of scales from different measures (i.e., placing scores on a single, common scale), across students in different grade levels (e.g., third through sixth grade in the same school) and within a grade level (e.g., fourth graders in different schools). Similarly, using IRT
1 IRT item properties are invariant only within a linear transformation. Itemparameter invariance does not mean that item-parameter estimates will be the same regardless of sample characteristics.

96

Volume 14Number 2

Steven P. Reise, Andrew T. Ainsworth, and Mark G. Haviland

Fig. 2. Item information functions for the same three items shown in Figure 1. An item information function (based on the items response function) shows how much psychometric information (ability to differentiate among peoplevertical axis) an item provides at each trait level (horizontal axis). Relatively easy items like A are best for differentiating among individuals low on the trait, whereas relatively difcult items like C are best for differentiating among individuals high on the trait.

methods to compare individuals who have responded to different measures is relevant to cross-cultural and developmental researchers, as we show in the next section.
THE PROMISE OF IRT

IRT methods represent an improvement over CTT in many ways. Given space limits, however, we do not list and comment on all of the potential advantages; rather, we give three examples: (a) the handling of qualitative variation, (b) the scaling of individual differences, and (c) psychometric analyses. Handling Qualitative Differences Psychological researchers increasingly are confronted with diversity challenges. Specifically, how can individuals from different age groups, genders, cultures, ethnic groups, and socioeconomic backgrounds be compared meaningfully on psychological variables (e.g., depression)? A critical challenge researchers face is that a psychological construct may be manifested in different ways by different types of people (i.e., qualitative variation)scale items may have different relations with the underlying construct for people from different groups. When this is true, it is virtually impossible to administer a common measure to different groups, compute raw scores, and make meaningful comparisons.

IRT does not solve all cross-group assessment problems, but it offers (a) an excellent framework for exploring qualitative variation (and, thus, measurement equivalence) across groups and (b) a method of scaling individuals from qualitatively diverse groups onto a common scale. Because of item-parameter invariance, it is possible to estimate an IRF for an item separately in two or more groups. If certain conditions are met (i.e., a subset of invariant items exists), it is possible to then place these estimated IRFs onto the same scale and meaningfully compare the items. This is true even if the groups differ on the trait mean and variance. For example, consider the two depression items whose IRFs are shown in Figures 4 (I havent lived life right) and 5 (I dont care about life).2 If these items function equivalently across groups (i.e., there is no qualitative variation), then the IRFs should be completely overlapping after placing them onto a common scale. This, clearly, is not the case: Both items function differently in different groups. In Figure 4, the IRFs show that women must be more depressed than men to be as likely as them to agree with the statement, I havent lived life right. That is, at any point on the latent-trait scale (depression),
2 These are items from the Minnesota Multiphasic Personality Inventory-2 item pool (Butcher et al., 2001). We have changed the wording of the items to avoid copyright infringement. The IRFs are based on analyses of large clinical samples of adults and adolescents (see Reise & Waller, 2003).

Volume 14Number 2

97

Item Response Theory

Fig. 3. The scale information function is created by adding together the three item information functions in Figure 2. The scale information function estimates how much information (vertical axis) a set of items provides in different trait ranges. The graph shows that the measure consisting of Items A, B, and C differentiates among individuals better in the middle of the trait range than elsewhere and does not differentiate among individuals at either of the extremes.

women are less likely than men to endorse this item. Figure 5 shows that the item I dont care about life works similarly between genders: The IRFs for men and women are highly similar, as are those for adolescent boys and girls. In this case, however, there is a large difference due to age: The IRFs for adults are shifted to the right (i.e., the item is more difcult or requires higher depression levels for adults to endorse it than for adolescents). The IRFs also show that this item is more discriminating for adults than for adolescentsthe curves for adults are steeper. IRT is not only a fancy statistical method to identify how traits manifest themselves differently in different groups or to evaluate the applicability of a psychological measure across populations; it can also be used to adjust for those differences. Under a traditional CTT framework, it would be challenging to use the raw scores from items such as those shown in Figures 4 and 5 to compare adult and adolescent men and women on the same depression continuum. In IRT, however, a researcher can simply use different IRFs for different groups and still maintain latent-trait estimates (i.e., scores) that are on the same scale, and hence comparable. Scaling Individual Differences In psychology, individual differences are empirically represented by some measurement scale, typically the raw total score on a scale that is specific to a particular measure. In turn, all

substantive ndings of change, continuity, and group differences are dependent on this ordinal/quasi-interval metric used to represent the variables of interest. Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have relatively better (i.e., closer to interval) scaling properties.3 Does this technical superiority make any difference in terms of substantive research in psychology, however? Research suggests that it does. Embretson (1996), for example, working with a 2 2 ANOVA model, and Kang and Waller (2005), working with a regression model, demonstrated that using raw scores (as opposed to latent-trait metrics) can result in the identication of spurious interactions. In both studies, the authors noted that this phenomenon is most likely to occur if there is a mismatch between item difculties and individual trait levels (e.g., when a depression scale is administered to normal undergraduates). The Kang and Waller study demonstrated that the use of IRT trait scores greatly reduces these spurious ndings. The superior properties of IRT scores are also important in analyses of growth and clinical change. For example, we developed and analyzed a measure of cognitive problems and
3 For discussion of the interpretation of IRT latent-trait scales under different IRT models, see Perline, Wright, and Wainer (1979) and Embretson and Reise (2000).

98

Volume 14Number 2

Steven P. Reise, Andrew T. Ainsworth, and Mark G. Haviland

Fig. 4. Item response functions showing gender differences for a depression-scale item (I havent lived life right). Item response theory allows different IRFs to be created for different groups and then placed onto a common scalein this case, men and women. The IRFs show that women must be more depressed than men to have an equal chance of endorsing the item; at any point on the latent-trait scale (depression), women are less likely than men to endorse this item.

demonstrated that, on our measure, equal changes on the IRT latent-trait continuum do not correspond to equal changes in raw scores (Reise and Haviland, in press). Specifically, equal changes on the latent cognitive-problems trait produce unequal changes on the raw-score scale, depending on where on the latent-trait scale the changes occur. Thus, two individuals may change the same amount on the latent trait measured by an instrument, but the changes in their raw scores will not reect this. Such ndings greatly complicate the use of raw scores in studying clinical change at the individual level. We note, also, that many longitudinal data-analytic techniques, such as growth-curve modeling, assume that the dependent variable has at least interval-measurement properties (Khoo, West, Wu, &, Kwok, in press). Thus, IRT scalings potentially offer the opportunity to apply these models beyond, say, physical and biological dependent measures. Fraley, Waller, and Brennan (2000) provided a further example of the potential advantages of IRT scaling. These authors considered the use of IRT methods in examining hypotheses about change and continuity in adult attachment. Their analyses of the scale information functions for popular adult attachment measures revealed that these functions tend to be relatively peakedthat is, they provide good measurement precision on one end of the continuum only. In turn, the authors demonstrated (using a datasimulation procedure) that this property can adversely affect the evaluation of substantive hypotheses about continuity in secure attachment or differential patterns of attachment stability.

In fact, Fraley et al. (2000) demonstrated that analyses of change at the raw-score level and analyses of change using the latent-trait metric may lead to opposite conclusions. In one example, they displayed results showing that highly anxious individuals are relatively less stable over time when considered at the raw-score level, but more stable over time when considered at the latent-trait level. Although Fraley et al. were not the rst to point out problems with attachment measures, their demonstrationthat failing to understand the scaling properties of an instrument can lead to grossly inaccurate conclusionsis quite sobering.

Psychometric Analysis Fundamentally, IRT is a psychometric tool. In several studies, IRF analyses have been used to shed light on the way people respond to items on measures of psychopathology. Reise and Waller (2003), for example, challenged the notion that direction of item keying (e.g., extraverted vs. introverted as opposed to introverted vs. extraverted) on a personality or psychopathology measure makes no difference, and Santor and Ramsay (1998) called into question the assumed ordering of response options (i.e., that a response of 3 is higher than 2, a response of 2 is higher than 1, and so forth) and the gender equivalence of a popular depression measure. The specific results of these studies are not the issue here; our point is that these analyses are excellent examples of how IRT enabled these authors to

Volume 14Number 2

99

Item Response Theory

Fig. 5. Item response functions showing age differences for the depression item, I dont care about life. In this case, the IRFs are similar between genders, but there is a large difference due to age: The IRFs for adults are shifted to the right, meaning that the item is more difcult for adults than for adolescents. The steeper curves for men and women, relative to boys and girls, show that this item is more discriminating for adults than for adolescents.

identify important scale properties and problems that are missed by traditional analyses. In addition to being useful in scale analysis, IRT is changing the way scales are developed and administered. Historically, measurement has been based on competing xed-length, paperand-pencil measures. IRT methods, in contrast, naturally lead researchers away from these competing measures and toward the creation of a common item pool to measure a given construct. (An item pool is a set of items with known IRFs.) An excellent example of how an item pool can be created can be found in McHorney and Cohen (2000). The authors created a pool of physical-functioning items from different measures and used IRT to equate them (i.e., place them on a common scale). Previously, it had been virtually impossible to compare data across the many (more than 75) competing measures of functional status. Several researchers have used IRT to develop item pools, which are used as a basis for computerized adaptive testing (CAT)a method in which items are administered via a computer to optimally match an individuals trait level. Decades of research have demonstrated that IRT-based CAT is at least twice as efcient as paper-and-pencil testing and is no less precise (Weiss, 2004). Creating item pools and using CAT appears to be increasingly attractive in health-outcomes research, where, as in other applied areas, there often is an urgent need to measure people quickly (so that more constructs can be meas-

ured) and accurately (so that results can be trusted and clients change can be monitored). For example, Ware et al. (2003) demonstrated that an adaptive version of the Headache Impact Survey outperformed the traditional paper-and-pencil form in reducing burden on patients, in tracking change over time, and in test reliability and test validity.
CONCLUSION

IRT methods are used because researchers want to (a) more rigorously study how items function differently in different groups; (b) place individuals from different groups onto a common scale, even if they have responded to different items; (c) use individual scores that have good psychometric properties, so that statistical techniques (such as a growth model) can be applied with greater accuracy and spurious results or invalid ndings can be avoided; (d) thoroughly understand the psychometric properties of their instruments; (e) create more order in their elds by having a common metric for a construct, rather than many competing xed-length instruments; and (f) develop CAT systems for more efcient assessment of individual differences. IRT methods are most appropriate whenever measurement of psychological characteristics via multiple indicators (items) is desirable. IRT methods are most applicable in elds in which there is a need for independent or dependent variables with

100

Volume 14Number 2

Steven P. Reise, Andrew T. Ainsworth, and Mark G. Haviland

strong scaling properties, when it is important to maintain a consistent measurement scale across time, and in contexts in which decisions of great consequence are made on the basis of individuals scores. We expect increased IRT applications will lead to further clear illustrations of its importance.

Recommended Reading Embretson, S.E., & Reise, S.P. (2000). (See References) Thissen, D., & Wainer, H. (2001). Test scoring. Mahwah, NJ: Erlbaum. Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Erlbaum.

REFERENCES Butcher, J.N., Graham, J.R., Ben-Porath, Y.S., Tellegen, A., Dahlstrom, W.G., & Kaemmer, B. (2001). Minnesota Multiphasic Personality Inventory-2 (MMPI-2): Manual for administration, scoring, and interpretation (2nd ed.). Minneapolis: University of Minnesota Press. Embretson, S.E. (1996). Item response theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20, 201212. Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Fraley, R.C., Waller, N.G., & Brennan, K.A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78, 350365.

Kang, S., & Waller, N.G. (2005). Moderated multiple regression, spurious interaction effects, and IRT. Applied Psychological Measurement, 29, 87105. Khoo, S.T., West, S.G., Wu, W., & Kwok, O. (in press). Longitudinal methods. In M. Eid & E. Diener (Eds.), Handbook of psychological measurement: A multimethod perspective. Washington, DC: American Psychological Association. McHorney, C.A., & Cohen, A.S. (2000). Equating health status measures with item response theory: Illustrations with functional status items. Medical Care, 38(Suppl. 9), II43II59. Perline, R., Wright, B., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237255. Reise, S.P., & Haviland, M.G. (in press). Item response theory and the measurement of clinical change. Journal of Personality Assessment. Reise, S.P., & Waller, N.G. (2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8, 164184. Santor, D.A., & Ramsay, J.O. (1998). Progress in the technology of measurement: Applications of item response models. Psychological Assessment, 10, 345359. Ware Jr., J.E., Kosinski, M., Bjorner, J.B., Bayliss, M.S., Batenhorst, A., Carl, G.H., Dahlof, C.G.H., Tepper, S., & Dowson, A. (2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research, 12, 935952. Weiss, D.J. (2004). Computerized adaptive testing for effective and efcient measurement in counseling and education. Measurement and Evaluation in Counseling and Development, 37, 7084.

Volume 14Number 2

101

You might also like