You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228453786

Modern Psychometric Methodology Applications of Item Response Theory

Article  in  Rehabilitation Counseling Bulletin · April 2007


DOI: 10.1177/00343552070500030501

CITATIONS READS

19 1,001

4 authors:

Christine A. Reid Stephanie A Kolakowsky-Hayner


Virginia Commonwealth University Santa Clara Valley Medical Center
24 PUBLICATIONS   201 CITATIONS    82 PUBLICATIONS   3,214 CITATIONS   

SEE PROFILE SEE PROFILE

Allen Lewis Amy Armstrong


State University of New York Downstate Medical Center Virginia Commonwealth University
48 PUBLICATIONS   585 CITATIONS    34 PUBLICATIONS   167 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Role and Function Study of Life Care Planners View project

Remote ADOS Administration View project

All content following this page was uploaded by Christine A. Reid on 29 August 2014.

The user has requested enhancement of the downloaded file.


RCB 50:3 pp. 177–188 (2007) 177

Modern Psychometric Methodology:


Applications of Item Response Theory

Christine A. Reid Item response theory (IRT) methodology is introduced as a tool for improving assess-
Stephanie A. ment instruments used with people who have disabilities. Need for this approach in
rehabilitation is emphasized; differences between IRT and classical test theory are clar-
Kolakowsky-Hayner ified. Concepts essential to understanding IRT are defined, necessary data assumptions
Allen N. Lewis are identified, and specific data analysis techniques, using software such as TESTFACT
Amy J. Armstrong and BILOG, are described. An example of IRT analysis applied to a subtest of the Gen-
Virginia Commonwealth eral Aptitude Test Battery (GATB) administered to people who have diverse disabilities
University is provided. Future potential uses of this approach in rehabilitation are outlined.

T
he purpose of this article is to introduce item re- using IRT generates a mathematical function to describe
sponse theory (IRT) and the associated latent trait the relationship between test performance and ability or
and item response modeling techniques in an ef- trait level. This function can be used to estimate a wide
fort to improve psychometric methodology focused on the range of ability levels, such as those common among sam-
assessment of people who have disabilities. The need for ples of rehabilitation clients, regardless of the range of
such methodology in rehabilitation applications is em- ability within the original normative sample (Hambleton
phasized, and examples of applications in related areas & Swaminathan, 1985). Because this methodology fo-
are provided. Differences between IRT and “classical” test cuses on the relationship between each individual item
theory are clarified, and implications of those differences and the underlying (latent) trait or ability assessed by the
are discussed. Key constructs necessary to understand IRT instrument, each of those items can provide meaningful
methodology are defined. Statistical software packages information about an examinee. In contrast, traditional
necessary for IRT analyses are described, and resources for testing requires administration of the entire test to each
developing expertise in using them effectively are identi- examinee to meaningfully interpret the test results. If a
fied. A practical example of IRT analysis of one of the test has been developed using IRT, the level of the ability
subtests of the General Aptitude Test Battery (GATB; U.S. or trait measured can be estimated for an individual com-
Department of Labor, 1970) administered to a rehabilita- pleting any subset of the test items; administration of the
tion client population is given, accompanied by explana- entire test is no longer necessary. Different subsets of
tions about how to select the appropriate analysis options items can be given to different examinees or to the same
and how to interpret the resulting statistical output for examinees at different points in time; interpretation of
each phase of the analysis. Implications of the findings the results does not depend on consistent administration
from this example application are then discussed. of one standardized set of items. Estimates of an exami-
nee’s individual ability or trait levels are generated
through formulas computing the levels along that latent
I MPORTANCE OF IRT continuum with the greatest probability, given that indi-
vidual’s pattern of responses to the particular subset of
IRT techniques have tremendous potential for solving items administered. In contrast, for assessment instru-
measurement problems in rehabilitation. A test developer ments developed without IRT techniques, meaningful in-
178 Rehabilitation Counseling Bulletin

terpretation of test results requires administration of the the probability of answering an item correctly should in-
entire test and comparison of the obtained total score crease in a predictable manner as the level of the exami-
with the results obtained from an appropriate normative nee’s ability increases. When one is measuring traits (such
group. Applying item response models to tests used with as level of depression), the dichotomy is between en-
rehabilitation clients can provide more information about dorsed and not endorsed, rather than right and wrong,
the relationship between performance on test items and and the probability of endorsing an item should increase
the underlying ability or trait of interest than can tradi- in a predictable manner as the level of the examinee’s trait
tional item analysis techniques. in question increases. A plot of the probability of answer-
IRT techniques are potent tools for studying the psy- ing an item correctly (or endorsing the item) at every
chometric characteristics of tests administered to rehabil- level of examinee ability (or level of the underlying trait)
itation populations. The long list of potentially useful results in a curve that increases in slope until it reaches a
applications includes (a) study of potential bias in assess- point of inflection, at which point the rate of increase
ment instruments, (b) generation of models to extrapolate begins to decrease, so that the curve has a stretched-out
estimation of ability for individuals outside the range stud- S shape. Simple linear regression does not capture the
ied in the standardization sample, (c) enhanced equating complexity of this nonlinear relationship; a more sophis-
of the results of newly developed instruments with those ticated model is needed. An item characteristic curve can
of old standards, (d) assessment of how accurately an in- be fitted to the data, with characteristics of that curve de-
strument measures across the continuum of ability, (e) de- termined by its corresponding item response model.
velopment of instruments to measure accurately at a selected The simplest model (the one-parameter logistic, or
target level of ability, and even (f) development of com- Rasch model) requires an assumption that items differ
puterized adaptive tests. In computerized adaptive testing, from each other only in their degree of difficulty. Item
the instrument “tailors itself” precisely to the individual characteristic curves for a test developed using this model
ability or trait level of each examinee, precisely measuring would look remarkably similar to each other, differing
ability with a fraction of the number of items required for only in their relative location on the ability or trait con-
conventional testing. tinuum, parallel to each other in every other way.
Traditional scale development statistics such as item In contrast, item characteristic curves fitted to the
difficulties (p values) and overall item–test score correla- data using a two-parameter model would differ not only in
tions are no longer the only criteria available to test de- relative placement along the ability continuum but also in
velopers for selecting the best items for a test. Using IRT, the degree to which probability of answering an item cor-
researchers can specify how well individual items distin- rectly (or endorsing it) increases steeply with an increase
guish between individuals who do or do not possess a par- in level of underlying ability (or level of trait). Some
ticular target level of the trait or ability of interest. The curves will have rather steep slopes before they hit the
degree to which guessing negatively affects measurement point of inflection; others will be relatively flat. Curves
accuracy can also be assessed on an item-by-item basis. with a steep slope show evidence of more power to dis-
Such detailed specification makes it possible to examine criminate between people who are above or below that
the expected degree of measurement error at particular particular level of ability or trait.
levels of the ability or trait on an item-by-item basis, Item characteristic curves fitted using the three-
rather than relying on one global standard error of mea- parameter model differ not only in their level of difficulty
surement (SEM) conventionally considered the counter- and their power to discriminate between people who are
part to overall reliability of the test as a whole. Test items or are not at a particular level of ability (or trait); they also
can be selected to minimize measurement error and max- differ in the degree to which examinees with a low level
imize information gained at a particular targeted level of of ability (or trait) will answer that item correctly (or en-
ability. dorse it). The lower asymptote of the curve (bottom left
corner of the S curve) will be higher for an item that is
easy to guess correctly and lower for one that is not so easy
ITEM R ESPONSE THEORY P RINCIPLES to guess.
IRT is a “modern” test theory utilizing a set of proposi- For items with multiple response options (such as a
tions or mathematical models related to individuals’ re- rating scale) instead of one dichotomous answer, more
sponses to items, providing a probabilistic way of linking complex IRT models can be used. In essence, these mod-
observable data to theoretical constructs, with the ability els generate separate item characteristic curves for each
to statistically adjust scores for properties of test items response option of each item, and they base ability or trait
such as difficulty, discriminating power, and liability to estimates on the pattern of those responses. Santor, Ram-
guessing (Embretson & Reise, 2000; Lord, 1980; Van der say, and Zuroff (1994) used this technique to examine the
Linden & Hambleton, 1997). In its simplest form for di- Beck Depression Inventory and to evaluate it for potential
chotomized (right or wrong) answers, IRT specifies that gender bias.
Volume 50, No. 3 (Spring 2007) 179

An outstanding text for developing an intuitive (vs. or low) levels of an ability or trait, even within the nor-
a mathematically based) understanding of IRT is Baker’s mative sample. Although the traditionally reported stan-
(1985) classic, The Basics of Item Response Theory, which dard error of measurement is assumed to be constant
includes a companion tutorial computer program. Al- across all levels of an ability or trait, in practice it is higher
though the text in its original format is out of print, it has at the extreme levels (Hambleton & Swaminathan, 1985).
been revised (Baker, 2001) and is now available online at Generally, a test most accurately and efficiently measures
no charge from the ERIC Clearinghouse on Assessment ability when the average level of item difficulty is equal
and Evaluation at http://edres.org/irt/baker/. The down- to the tested individual’s own ability level, the point at
loadable companion interactive tutorial (also provided which the examinee has a 50% probability of answering
free of charge) includes specific exercises to parallel each each item correctly (Bolton, 2001). Through the use of
of Baker’s chapters, which help users build a conceptual IRT, tests can be tailored to a desired ability level to ob-
understanding of how IRT techniques work and why the tain the most accurate estimate of ability possible within
results have such useful characteristics. the shortest possible period of time. Such an increase in
measurement efficiency is important for rehabilitation ap-
plications, in which comprehensive assessment of knowl-
“MODERN” VERSUS “C LASSICAL” edge, skills, abilities, interests, values, and so forth may be
TEST THEORY needed within a relatively short period of time.

Latent trait modeling, item response modeling, and IRT


methodology are roughly equivalent terms used to de- KEY TERMS DEFINED
scribe techniques applying “modern psychometric theory,”
as opposed to “classical test theory.” Key differences be- Latent trait: the underlying ability or trait presumed to be
tween these two have been discussed by Baker (1992), measured by the test items. Although models of a latent
Cohen and Swerdlik (2005), Hays, Morales, and Reise trait cannot be tested directly (because the latent trait can-
(2000), Hulin, Drasgow, and Parsons (1983), and Reid not be directly observed), they can be tested indirectly
(1995). IRT was developed to remedy at least three prob- through their implications for the joint probability distri-
lems with psychometric assessment based on traditional bution of the item responses for a set of items (Borsboom,
(classical) test development. The first problem with tradi- Mellenbergh, & van Heerden, 2003).
tional methods lies in the fact that norms used for inter- Item characteristic curve (ICC): the mathematical
pretation of the test scores are sample specific. Both the curve (logistic function) describing the relationship be-
average level of ability and the range of abilities among tween the ability (or level of trait) of examinees and the
individuals whose performance was used to norm the test probability of answering a particular item correctly (or en-
influence subsequent test results. In contrast, IRT tech- dorsing it). The equation for an ICC under the three-
niques can be used to generate a mathematical formula to parameter logistic model can be expressed as
extrapolate beyond the range of ability of the standardiza- P(θ) = c + (1 − c) / {1 + exp[−1.7a (θ − b)]},
tion sample. This is particularly important for some reha-
bilitation populations, whose level of performance on a in which θ is the latent trait (or ability) and a, b, and c are
given assessment instrument may be outside the range of the item parameters. Equivalent yet alternative versions
the original sample used to develop the test. of this formula can be found in Baker (1985) and Lord
Another problem with traditional test development (1980).
methods is the necessity of administering the entire in- Difficulty parameter (b): the examinee ability level at
strument to a given individual to obtain any usable in- which approximately half of the examinees are likely to
formation about his or her performance. A correct or answer a particular item correctly or endorse it. (Note
incorrect response to any individual test item is relatively that the probability of a correct answer is greater than .5
meaningless without information about performance on at b for the three-parameter model.) The b parameter is
the rest of the items. Any modification to shorten the in- the inflection point on the item characteristic curve. It
strument would require renorming, as if it were a com- corresponds to the ability or trait level best measured by
pletely new instrument. Adaptive testing, in which the that item.
response to one item determines which item should be ad- Discrimination parameter (a): the degree to which the
ministered next, is not possible under these conditions. In item has power to discriminate between individuals who
contrast, IRT techniques model the relationship between have or do not have the corresponding b level of a partic-
ability (or trait) and performance for each individual ular ability or trait. (Note that a is proportional to the
item. slope of the ICC at point b.)
One more problem with traditional test development Pseudoguessing parameter (c): the probability that a
methods is the instability of scores at extreme (either high low-ability examinee (or one with a low level of the trait
180 Rehabilitation Counseling Bulletin

in question) would answer the item correctly (or endorse Horney and Cohen (2000). A study by Haley, Andres,
the item), possibly through guessing. The c value serves as and colleagues (2004) addressed both test shortening and
the ICC’s intercept on the graph axis representing proba- item equating applications for post–acute care settings
bility of obtaining a correct (or endorsed) score. using IRT methodology. Use of IRT techniques to exam-
One-parameter model (Rasch model): an IRT logistic ine scale properties for common item equating and CAT
model in which only the difficulty parameter (b) is con- applications, again for post–acute care settings, was ad-
sidered relevant. Discrimination (a) and pseudoguessing dressed by Haley, Coster, and colleagues (2004). The use
(c) parameters are considered constant; items that do not of IRT to identify differential item functioning (DIF) or to
satisfy such assumptions are excluded from tests developed distinguish between bias and real differences in an ability
using this model. or trait among groups was addressed by Fayers and Machin
Two-parameter model: an IRT logistic model in which (2000), Gumpel and Wilson (1996), and Hahn and Cella
the item difficulty (b) and item discrimination (a) are (2003). Further detail about methodology for detecting
considered, but the effect of pseudoguessing (c, the lower DIF was provided by Raju (1990).
asymptote of the ICC) is assumed to be negligible. To illustrate the potential for IRT to solve practical
Three-parameter model: an IRT logistic model in testing problems, two applications were described more
which the item difficulty (b), item discrimination (a), and fully by Reid (1995). The first is the promise of comput-
pseudoguessing (c) parameters are all considered relevant erized adaptive testing. Although computerized adaptive
in estimating the ability of examinees. tests are not yet commonly available for use with popula-
Item information function: a description of how much tions of people who have disabilities, development of such
information about examinees’ underlying abilities or traits tools would make it possible for tests to interactively “tai-
is available at various levels along a particular ability or lor themselves” to the ability level of an individual in a
trait continuum. The greater the item information is for a minimum amount of time. A computerized adaptive test
given ability level, the less error is involved in estimating begins with administration of an item of medium diffi-
examinee ability or trait level through use of that item. culty, and the examinee’s response to that item deter-
Item information is inversely related to measurement mines which level of item difficulty is administered next;
error associated with that item. this process is repeated until the test items are primarily
Test information function: the summed combination of those with a level of difficulty that the examinee is likely
all the item information functions for a given test. This to answer correctly 50% of the time (when measurement
function describes the degree of information provided by error is minimized). Decreased testing time could result in
the test at each level of an ability or underlying trait. more accurate assessments by decreasing the effect of ex-
aminee fatigue. Given that computerized adaptive tests
can efficiently measure abilities without the artificial con-
APPLICATIONS OF IRT straint of time limits for administrative convenience, the
influence of speed of response could be removed as a
The value of applying IRT methodology to solve psycho- threat to the validity of traditional timed tests.
metric problems has been well documented. Advantages An example of IRT that has already been applied to
of applying IRT to a variety of assessment challenges have a practical problem in rehabilitation was described by
been identified by Aiken (2003), Cook, Monahan, and Reid (1995), Bullis et al. (1995), and Bullis et al. (1997).
McHorney (2003), Hambleton (1989), Reid (1995), San- The problem was a barrier to the use of the Transition
tor and Ramsay (1998), and others. Examples of studies in Competence Battery for Adolescents and Young Adults Who
which IRT techniques were used to shorten instruments Are Deaf (TCB). The amount of time needed to adminis-
used in rehabilitation or health care settings include ter the instrument was prohibitive, according to feedback
Bullis, Reiman, Davis, and Reid (1997), Bullis, Reiman, from initial field testing. The TCB was designed to mea-
Reid, and Davis (1995), Chang and Gehlert (2003), Jenkin- sure the competence of transition-age youth who are deaf
son, Fitzpatrick, Garratt, Peto, and Stewart-Brown (2001), in the following areas: Job-Seeking Skills, Work Adjust-
Jenkinson, Norquist, and Fitzpatrick (2003), Kopec et al. ment Skills, Job-Related Social Skills, Money Manage-
(1996), Mungas and Reed (2000), and Saliba, Orlando, ment, Health and Home, and Community Awareness.
Wenger, Hays, and Rubenstein (2000). The usefulness of Psychometric characteristics of the TCB were strong, but
IRT to develop computerized adaptive testing (CAT) ap- administration of the entire instrument required more
plications for rehabilitation or health care settings was ad- than 6 to 8 hr. IRT techniques were used to develop a
dressed by Cella and Chang (2000), Stoloff and Couch “mini” screening version of the TCB. Items were selected
(1992), and Ware, Bjorner, and Kosinski (2000). Exam- that best measured ability at a targeted level of proficiency.
ples of a focus on IRT equating or linking items between Examinees who “pass” the screening version should not
tests related to rehabilitation or health care include arti- have to spend time taking the entire TCB. In contrast,
cles by Hays et al. (2000), McHorney (2002), and Mc- examinees who do not score at sufficient levels on some
Volume 50, No. 3 (Spring 2007) 181

sections of the screening version should then take the cor- IRT techniques should be used to analyze “power”
responding subtests of the larger TCB exam to provide tests, which are designed to assess ability without the need
detailed information about exactly which of their com- for a rapid response to each item. In contrast, the scores of
petence deficits could benefit from further services or “speeded” tests are significantly affected by the speed of
education. examinees’ responses. Most individuals should be able to
complete all of the items for a power test; very few indi-
viduals will ever complete all of the items for a “speeded”
Statistical Software for IRT test. Hambleton and Swaminathan (1985) provide crite-
ria to assess the degree to which a test can be considered
The “gold standard” for assessing the dimensionality (num-
an appropriate power test. Details about testing this as-
ber of factors accounting for variance) of a data set before
sumption (and the other assumptions described here) are
conducting IRT analyses is the TESTFACT program (Bock
provided in the case example section of this article.
et al., 2003), which provides full-information factor anal-
Some IRT models are designed to capture the com-
ysis. This kind of analysis examines the factor structure
plexity of the relationship between an ability (or a trait)
based not only on correlations between items within the
and performance for each item of a test; others are less so-
test but on patterns (vectors) of individual responses as
phisticated, and they require the data to satisfy additional
they relate to the underlying latent trait or ability. Simi-
assumptions. In particular, the one-parameter (Rasch)
larly, the BILOG-MG 3 program (Zimowski, Muraki, Mis-
model assumes that all of the items are equally discrimi-
levy, & Bock, 2003) is the most versatile for IRT analysis
nating at their respective levels of ability. The degree to
of dichotomous items. For multiple-response-option items,
which data fit a given model can be assessed using fit sta-
programs such as MULTILOG or PARSCALE (Du Toit,
tistics following item calibration, but, if one is focused on
2003) are appropriate. These programs and other software
Rasch analysis, there is another way to screen the data for
related to IRT are available from Assessment Systems
consistency in discrimination power before calibrating the
Corporation (http://www.assess.com). A useful tutorial for
items. Hambleton and Swaminathan (1985) recommend
IRT data analysis, including sample data sets and instruc-
calculating the biserial correlation between each item and
tions to format them for later analysis, is provided by
the total test score and examining how many of these cor-
Stark, Chernyshenko, Chuah, Lee, and Wadlington
relations fall outside of an acceptable range. An applica-
(2001) at their IRT Modeling Laboratory Web site (http://
tion of such screening, as well as details of model–data fit
work.psych.uiuc.edu/irt/tutorial.asp).
statistics, is provided as part of the following case example.

Testing Assumptions
P RACTICAL EXAMPLE:
The use of IRT to evaluate an assessment tool requires sat-
isfaction of several assumptions. The primary assumption
G ENERAL APTITUDE TEST BATTERY
is that the test is measuring only one trait: It is “suffi-
ciently unidimensional” (Stout, 1987). Factor analysis This example is taken from Reid’s (1993) study of IRT
(preferably using tetrachoric correlations when the latent model–data fit for selected subtests of the General Aptitude
trait, a continuous variable, has been dichotomized into Test Battery (GATB; U.S. Department of Labor, 1970) ad-
right/wrong scoring) should reveal one factor that ac- ministered to a sample of people who have disabilities. For
counts for much of the total variance, with any successive clarity in demonstrating the IRT analytical techniques
factors accounting for much less variance in the data set used, only one subtest is examined in this article: GATB
(Reckase, 1979). Related to the assumption of unidimen- Subtest 3, Three-Dimensional Space.
sionality is the assumption of local independence: The re- Results of the analyses are used to answer the follow-
sponse to one item is not dependent in any way on the ing questions about the psychometric properties of this
response to any other item. (For example, one test item GATB subtest:
cannot provide information useful in deducing the answer
to another item.) 1. Does the subtest measure primarily one
A second key assumption is that the relationship be- construct, or is it multidimensional when
tween ability and performance on an item can be de- administered to people who have disabili-
scribed by a monotonically increasing function (as ability ties?
increases, the probability of answering an item correctly 2. Does speed of response have a considerable
increases). Violation of this assumption would result in a influence on test scores?
significant misfit between the generated item characteris- 3. What is the nature of the relationship
tic curve and the observed data; it could also be detected between underlying ability and performance
through examination of item scatter plots. for individual items in this subtest?
182 Rehabilitation Counseling Bulletin

4. Do the items in this subtest vary in level of Example Instrument


difficulty?
5. Do the items in this subtest vary in the The Three-Dimensional Space subtest is the third subtest
degree to which they can discriminate in the GATB. This subtest was designed to be a power test
between examinees who do and those who (vs. a speeded test), but with time limits. The GATB has
do not have a particular level of ability? been described as “the best researched of the multiple ap-
6. Do the items in this subtest vary in the titude batteries. Because of its large amount of validity
degree to which low-ability examinees data, it should be useful to vocational counselors” (Weiss,
correctly answer, possibly through guessing? 1972, p. 1061). However, its speededness is a barrier to ef-
fective assessment of many individuals who have disabili-
Additional questions, relevant to future research and ties that affect their speed of response. Weiss (1972)
practice using the GATB with a population of people who explained, “Time limits for the ability tests [of the GATB]
have disabilities, are also addressed: are purely an administrative convenience (except, of course,
when only speed of response is being measured); however,
1. What are the estimated IRT parameters for when an administrative convenience serves to complicate
each calibrated item? the measurement of a variable for some individuals, it is
2. Given the model–data fit found in this time to reexamine its desirability” (p. 1060). Hartigan and
study, what sample size requirements should Wigdor (1989) criticized the GATB’s “speededness, its
be considered for future study related to consequent susceptibility to practice effects and coach-
item bias, development of new items to ing” (p. 116), but recognized its potential value as a re-
replace outdated ones, and related topics? habilitation tool: “To ensure that handicapped applicants
who can compete with tested applicants are given that
The actual parameter estimates can be used for a opportunity, the GATB should be used when feasible to
variety of rehabilitation applications, including the de- assess the abilities of handicapped applicants . . . the test
velopment of computerized adaptive tests, creation of a should be used to supplement decision making” (p. 14).
shortened untimed version of the test, or efficient mea- Identification (through IRT analysis) of the items that
surement at the ability cutoff levels of specific aptitude function best and equating that shortened subtest of items
requirements for particular occupations or training pro- to the standard GATB subtest could result in a shortened
grams. The selection of which parameter model suffi- form of the GATB that does not require time limits and
ciently fits the data has implications for the sample sizes would more accurately measure the abilities of people who
necessary for further research in this area; the more com- have disabilities.
plex the parameter model used, the greater the sample size
needed to accurately calibrate the item parameters.

Speed Versus Power


Sample Characteristics
Examinees who completed this GATB subtest were adult IRT analyses are appropriate for power tests, not for tests
clients of a vocational evaluation center in a large south- where speed of response is a significant factor in perfor-
western state during the late 1980s (identifying informa- mance. Hambleton and Swaminathan (1985) recommend
tion was removed from the data prior to analysis). examining and reporting the percentage of examinees
Response data were available for 406 examinees with a completing the test, the percentage of examinees com-
variety of disabilities. The sample included primarily in- pleting 75% of the items, and the number of items com-
dividuals with physical impairments; intervertebral disk pleted by 80% of the examinees. For this example, the
disorders (23.5%) were the most commonly reported dis- extent to which speed of response affected the number of
ability. Other disabilities commonly represented in this items completed by each examinee was examined using
sample were orthopedic impairments (16%); hearing im- the traditional item analysis feature of TESTFACT (Bock
pairments (14.5%); mental health disorders such as de- et al., 2003). The third subtest of the GATB consists of 40
pression, anxiety disorders, personality disorders, or items. Only 3.4% of the sample completed all 40 items;
schizophrenia (14.4%); nervous system disorders such as 14.5% completed 75% of the items. Only 18 items were
epilepsy, multiple sclerosis, or neural trauma (7.5%); alco- completed by 80% of the examinees. Because time avail-
holism or other substance abuse disorders (6.3%); cardio- able was insufficient for most examinees to answer all of
vascular diseases (3.0%); and diabetes (2.0%). Disabilities the items, further steps of the IRT analyses did not include
reported for fewer than 2% of the examinees in this sam- all 40 items; only those items completed by 75% of the
ple included arthritis, visual impairments, asthma, mental examinees (n = 20 total items) were retained. That set of
retardation, speech impairments, burns, cancer, digestive 20 items satisfied Hambleton and Swaminathan’s (1985)
disorders, and bone disorders. guidelines for a power test.
Volume 50, No. 3 (Spring 2007) 183

Analysis of Dimensionality model and that of the two-factor model, essential unidi-
mensionality is demonstrated for this example.
The 20 subtest items completed by 75% of the examinees
were then factor analyzed using TESTFACT, which is
preferred over other factor analysis programs because it
Equal Discrimination of Items
uses not only the information inherent in each individual Different IRT models carry different assumptions about
item response but also the additional (“full”) information the nature of the relationship between item responses and
that results from analysis of the unique pattern (vector) of the latent trait or ability measured by them. The Rasch
an individual’s collective set of item responses. In essence, model (one-parameter logistic model), for example, re-
the factor analysis takes into account the relationships quires that items differ from each other only in level of
among items as they relate to the underlying latent trait, difficulty; the degree to which items efficiently discrimi-
not just as they relate to other individual item responses. nate between those who have and those who do not have
Such analysis requires computation of tetrachoric correla- that ability (or trait) does not vary. When one uses Rasch
tions, which are not yet provided as options in most stan- scaling, it is important to discard items that do not fit the
dard statistical software packages. Using TESTFACT, a requirements of the model. In contrast, a two-parameter
stepwise principal components analysis of the matrix of or a three-parameter model better characterizes the actual
tetrachoric interitem correlations was performed for this relationship between the underlying trait and item perfor-
example, with the specification to extract the maximum mance for many tests. However, there are some practical
possible number of factors. Next, the principal factor so- advantages to using a one-parameter model, if one can
lution served as the basis for full-information factor analy- develop items that satisfy assumptions such as equal dis-
sis, using information from vectors of item responses. The crimination power (the a parameter) among items. For ex-
resulting full-information solution could then be rotated, ample, scoring a test developed using the Rasch model is
first orthogonally to the varimax criterion and then easy; one just adds the number of correct (or endorsed) re-
obliquely by the promax method, to aid interpretation of sponses to obtain a total score. In contrast, scoring a test
the resulting factors. using a two- or three-parameter model requires a formula
One method of evaluating the degree of dimension- that differentially weights item values according to their
ality of this factor solution is to generate a scree plot of discrimination power at given levels of that underlying
percentage of total variance accounted for by each factor trait.
extracted. According to Reckase (1979), sufficient unidi- Hambleton and Swaminathan (1985) recommend
mensionality for IRT analysis is demonstrated when a sin- screening for equal discrimination of items through calcu-
gle dominant factor accounts for at least 20% of the total lating the biserial correlation between each item and the
subtest variance, with the second and succeeding factors total subtest score and examining how many of those cor-
accounting for considerably less variance. Figure 1 shows relations fall outside of what they consider an acceptable
the percentage of variance accounted for by each ex-
tracted factor for the GATB subtest evaluated in this ex-
ample. This plot shows a clear “elbow” between the first
factor and the second; the first factor accounts for 32.13%
of the variance, the second for 4.40% of the variance, the
third for 3.64% of the variance, the fourth for 3.20% of
the variance, and the last for 2.90% of the variance.
Another method of assessing the “essential unidi-
mensionality” (Stout, 1987) of the subtest is to examine
differences in chi-square approximation to the likelihood-
ratio statistics for successive stepwise factor models. These
statistics are available as part of the TESTFACT output
for full-information factor analysis. The differences in
chi-square values should approximate a chi-square dis-
tribution, with degrees of freedom corresponding to the
difference in degrees of freedom associated with the re-
spective factor models (Haberman, 1977). For the exam-
ple GATB subtest, the difference between chi-square
approximation to the likelihood-ratio statistics for the
one-factor model and the two-factor model was 13.62,
FIGURE 1. GATB Subtest 3, Three-Dimensional
with 19 degrees of freedom; this value is not significant, Space: Percentage of total subtest variance accounted
χ2(19, N = 406) = 13.62, p > .05. Given that there is no for by each extracted factor (maximum number of fac-
significant difference between the fit of the one-factor tors extracted by TESTFACT).
184 Rehabilitation Counseling Bulletin

range, which consists of the mean of those correlations of the examinees responded, a total of 20 items, were used
plus or minus a value of 0.15. (Biserial correlations can be in item calibrations.
produced using TESTFACT, but they are also available Two other indexes of goodness of parameter model fit
from other widely used statistical software packages.) For to the data are available through BILOG: standardized
the GATB subtest used in this example, the mean biserial posterior residuals and the population root mean square of
correlation was 0.665. Using the range of the mean biser- the posterior deviates (Du Toit, 2003; Mislevy & Bock,
ial correlation plus or minus 0.15, an acceptable range is 1990). To generate these estimates, BILOG divides the
between 0.515 and 0.815. Two items (Items 12 and 13), latent trait (or ability) continuum into multiple discrete
constituting 10% of the total examined, fell outside of this chunks. Within each chunk, a comparison is made be-
range. Although not extreme, this degree of diversity calls tween the expected values given that particular model
into question the assumption of equally discriminating and the observed values falling within that chunk. Stan-
items required by the one-parameter model. dardized posterior residuals are standardized differences
between the actual posterior probability of a correct (or
Generation of IRT Parameter endorsing) response at various ability (or trait) levels and
the probability of those levels hypothesized according
Estimates to the fitted item characteristic curve. Mislevy and Bock
Many software programs are available for calibration of (1990) indicated that a standardized residual greater than
items to generate IRT parameter estimates. For dicho- 2.0 may indicate failure of an item response model to fit
tomized (yes/no, right/wrong) responses, the BILOG pro- the data at that point. However, Mislevy and Bock advise
gram is the most versatile. The most recent version, taking into consideration the posterior weight, or propor-
BILOG-MG 3 (Zimowski et al., 2003), is designed to han- tion of the examinee population at that level of ability.
dle multiple groups (MG), which aids in tasks such as as- Model misfit in the tail ends of the ability (or trait) con-
sessing potential item bias (differential item functioning). tinuum, where the probability of examinee performance
For this GATB example, BILOG was used to generate pa- at that level is extremely low, is expected to have little ef-
rameter estimates for each of the 20 items using three lo- fect. Therefore, only standardized posterior residuals with
gistic models: a one-parameter, a two-parameter, and a posterior weights of 0.05 or greater are considered in this
three-parameter model. For the three-parameter logistic example analysis. Using this criterion, the GATB exam-
model, item difficulty (b), discrimination (a), and pseudo- ple subtest revealed 12 items (60%) evidencing some de-
guessing (c) parameters were all estimated. Under the gree of misfit under the one-parameter logistic model, 3
two-parameter model, only item difficulty (b) and dis- items (15%) evidencing some degree of misfit under the
crimination (a) parameters were estimated. Under the two-parameter logistic model, and 1 item (5%) evidenc-
one-parameter model, only item difficulty (b) was esti- ing some degree of misfit under the three-parameter logis-
mated for each item. tic model.
Item parameter estimation used the marginal maxi- The population root mean square of the posterior de-
mum a posteriori (MMAP) method, which incorporates viates is an overall index of fit for each item, given the fit-
moderately strong Bayesian priors on the item slope and ted item characteristic curve. This index is the square root
asymptote estimates. The relatively small sample size (for of a weighted average of the squares of standardized pos-
IRT analysis) was considered insufficient to permit joint terior residuals at selected points along the ability contin-
estimation of both the prior distributions and the item uum. Again, a value greater than 2.0 suggests misfit of the
parameters, so default normal prior distributions were model for the particular item. GATB example results are
selected. presented in Table 1, which identifies six items (Items 1,
4, 7, 11, 12, and 13) as misfitting under the one-parameter
model and no items misfitting under either the two-
Model–Data Fit: Item Level
parameter or the three-parameter model.
One BILOG option for assessing the degree to which the
selected one-, two-, or three-parameter model fits the ac-
tual data uses generation of likelihood-ratio chi-square Overall (Subtest-Wide) Comparison
values to identify items for which the pattern of observed of IRT Model–Data Fit
responses differs significantly from the pattern of re-
sponses predicted using the given model and associated Within BILOG, overall comparisons of the fit of different
estimated parameters. However, these values are not con- response models can be made through examination of the
sidered dependable for subtests with 20 or fewer items (Du differences between the −2 log likelihood values associ-
Toit, 2003; Mislevy & Bock, 1990). Although the entire ated with each model at the end of the last calibration
third subtest of the GATB contains 40 items, few of the cycle for each subtest. The difference between values should
examinees in the sample for this example responded to all be distributed approximately as chi-square with degrees of
40 of the items. Instead, the items to which at least 75% freedom equal to the difference in the number of parame-
Volume 50, No. 3 (Spring 2007) 185

ters estimated (across items) in the test or subtest of in- TABLE 1. GATB Subtest 3, Three-Dimensional
terest. Degrees of freedom for the three-parameter model, Space: Root Mean Squares of the Posterior Deviates
for example, would be three times the number of items 1-parameter 2-parameter 3-parameter
(because a, b, and c parameters are estimated for each); logistic logistic logistic
degrees of freedom for the two-parameter model would be Item no. model model model
two times the number of items; and degrees of freedom for
the one-parameter model would equal the number of 1 2.7801a 0.4092 1.0061
items in the test or subtest itself. Therefore, a difference 2 0.6966 0.5314 1.2313
between the chi-square values associated with the one- 3 1.9290 1.7944 0.6879
parameter model and those associated with the two- 4 2.6807a 0.3926 0.8977
parameter model would have degrees of freedom equal to 5 1.3964 0.7827 1.4937
the number of items in the subtest; the same rule would 6 1.6054 1.3759 1.2417
apply for the difference between the two-parameter 7 3.5519a 0.8459 1.5410
model and the three-parameter model. A significant dif- 8 1.3395 1.2185 1.6657
ference in the −2 log likelihood values, given the appro- 9 1.5238 0.6304 0.5820
priate degrees of freedom, would suggest that the model 10 0.7983 0.6174 0.6182
with the lesser −2 log likelihood value better fits the data. 11 2.6233a 0.6495 0.5410
For the GATB example subtest, there was a significant 12 4.3005a 0.9943 0.8216
difference in overall fit between the one-parameter 13 2.7847a 0.9587 1.2211
model and the two-parameter model, χ2(17, N = 406) = 14 1.6560 1.8820 1.5337
72.31, p < .001. However, there was not a significant dif- 15 1.4977 1.0919 0.9577
ference in overall fit between the two-parameter model 16 0.9389 0.7085 0.3831
and the three-parameter model, χ2(17, N = 406) = 20.05, 17 1.1285 1.2217 0.2167
p > .05. 18 1.5090 0.5457 0.3773
For this example, results of the three methods for 19 1.6392 0.5778 0.7996
assessing model–data fit are all consistent: There is a sig- 20 1.3896 1.2340 1.3679
nificant difference in fit between the one-parameter a
Root mean square of posterior deviates is greater than 2.0.
and two-parameter models, but not between the two-
parameter and three-parameter models. The data do not
fit the Rasch model; if Rasch techniques are to be used
with this subtest, the misfitting items must be discarded. TABLE 2. GATB Subtest 3, Item Parameter Estimates
for the Two-Parameter Logistic Model
However, a two-parameter model, estimating both the
item difficulty (b) and item discrimination (a) parameters, Difficulty Discrimination
fits the data nicely. It is not necessary to add the third Item no. parameter (b) parameter (c)
(c, pseudoguessing) parameter to fit the data. This means 1 −1.852 1.316
there are no significant differences among items in this 2 −1.218 0.823
subtest in the degree to which low-level examinees cor- 3a −0.855 0.895
rectly answer the items, possibly through guessing. 4 −1.564 1.184
The actual item parameters (under the two-parameter 5 −1.042 0.939
logistic model) for calibrated items in the third subtest 6 −1.436 0.663
(Three-Dimensional Space) of the GATB are presented 7 −1.047 1.258
in Table 2. Inspection of these parameters shows that Item 8a −0.401 0.808
1 is the easiest, with a b parameter of −1.852 correspond- 9 −0.079 0.638
ing to a lower point on the ability continuum. The most 10 −0.989 0.731
difficult item in this set is Item 9, with a b parameter of 11 −1.113 1.027
−0.079. This item measures best near the average level of 12 −0.136 1.334
the ability continuum (near the zero point). Interestingly, 13 −0.835 0.521
none of the items had difficulty or trait levels above zero, 14a −0.513 0.771
suggesting that this set is composed of relatively easy 15 −0.558 0.648
items. Perhaps the second half of this GATB test, the part 16 −0.285 0.811
that was not reached by at least 75% of the examinees, in- 17 −0.249 0.722
cludes more difficult items. One way to assess that would 18 −0.437 0.630
be to administer the second half of the items to a similar 19 −0.821 0.599
group of examinees and include several items from the 20 −0.328 0.657
previously calibrated set to serve as “linking” items to cal-
ibrate the next set on the same scale. a
Model–data misfit residuals are less than 2.0. or greater than +2.0.
186 Rehabilitation Counseling Bulletin

The discrimination (a) parameters for the items in nontimed version of the GATB is an obvious application.
Table 2 reflect the power of the item to discriminate be- Other possible applications include developing tests for
tween people who are or are not at the corresponding specific levels of aptitude (computerized or paper-and-
b level of the trait. So, Item 12 discriminates relatively pencil tests), as well as equating other test instruments to
well (a = 1.334) at the level of b = −0.136, but it would the GATB through use of linking items selected from this
not discriminate as well at a much lower ability; Item 1 calibrated set.
(a = −1.852, b = 1.316) would do a better job of discrimi- Computerized adaptive testing could be developed to
nating between people who are or are not at that lower tailor this GATB subtest to the particular level of ability
end of ability. The shape of the item characteristic curve of each examinee. To do this, one would need to calibrate
for Item 12 would be a stretched-out S shape, with the a bank of all 40 GATB items for this subtest, expanding
lower asymptote representing the probability of people the list of items resulting from this study, to ensure that
with a low level of that ability correctly answering the enough items would be available to accurately measure at
item (perhaps through guessing); the curve would rise at each level of ability. If not enough items appropriate for a
an increasing rate until it reached the −0.136 level on the particular level of ability are currently part of this GATB
ability or trait continuum; at that inflection point, it subtest, additional items should be created or imported
would continue rising but at a lower rate, until it reached from another, similar test. Computerized adaptive testing
the upper asymptote representing the probability of peo- would make possible the most efficient measurement of
ple with a high ability level correctly answering or en- ability for each examinee. Measurement accuracy would
dorsing that response. In contrast, an item with a lower be maximized and time required to take the test would be
discrimination parameter, such as Item 13 (a = 0.521), minimized for each examinee.
would have a flatter item characteristic curve, with a less Creating adaptive tests is not the only way to tailor
dramatic change in the probability of answering an item tests for particular uses. Using generated item parameters,
correctly between an examinee whose ability is below the one could select items that best measure at the cutoff level
b parameter (−0.835) and one whose level of ability is of each aptitude required for a specific occupation or
above it. training program, for example. Specific aptitude batteries
The relatively good model–data fit for all but the based on empirically derived ability levels required for
one-parameter model suggests that the items on this specific occupations could be developed as efficient
GATB example subtest are measuring the latent trait, the screening tools for rehabilitation clients considering those
underlying ability measured by this subtest, in a pre- occupations.
dictable, understandable manner, even among a popula- Equating other aptitude tests to this subtest of the
tion of examinees as diverse as the one sampled in this GATB is possible using item parameter estimates gener-
study. As examinee ability increases, the probability of ated in this study. Given that the GATB is often held as a
answering a given test item increases in a predictable fash- standard with which to compare other aptitude assess-
ion. For each item that fits its model, the relationship be- ment instruments used in rehabilitation settings, equating
tween ability and performance on that item is specified. It beyond simply correlating total test scores is quite desir-
is no longer necessary to look at the results of this entire able. Probably the easiest way to equate the items on this
subtest to gain information about examinee ability levels, GATB subtest to items in another test measuring the
because an estimate of ability can be generated from any same construct is to administer some linking GATB sub-
subset of these calibrated items. test items along with the other test. The whole set of
items, including the linking GATB items, can then be
calibrated together. To ensure that the new parameters are
Implications of This Example
on the same scale as the previously calibrated GATB
Measurement accuracy for people who have disabilities items, the item parameter estimates of the linking items
could be improved by removing the time limits on this can be constrained to their originally calibrated values.
third subtest of the GATB, which would allow examinees This method can also be used to test out new items for the
to complete each item. This modification would remove GATB subtest. Additional experimental items can be
the unfair penalty that time limits impose on individuals added to the subtest and examined to see how their item
whose disabilities limit their speed of response to the characteristic curves function. If an item is to be deleted
GATB in its current format. In addition, it could improve from the GATB subtest because it is outdated or biased
measurement accuracy through reducing the benefits of against people who have disabilities, a new item with a
employing a strategy of guessing or just filling in random similar item characteristic curve could be substituted.
answers to all the items before time runs out. The two-parameter logistic model seems to be neces-
The item parameter estimates generated can lead to sary and sufficient to describe the relationship between
a variety of applications for the development of reha- underlying ability and performance on the third subtest of
bilitation assessment tools. Development of a shortened, the GATB. IRT modeling of this subtest with the two-
Volume 50, No. 3 (Spring 2007) 187

parameter model instead of a more complex model has REFERENCES


two advantageous implications for rehabilitation research.
Aiken, L. R. (2003). Psychological testing and assessment (11th ed.).
First, the requirement for extremely large sample sizes to Boston: Allen & Bacon.
ensure stability of item parameter estimates is relaxed. Baker, F. B. (1985). The basics of item response theory. Portsmouth, NH:
Stable estimates of both the difficulty (b) and discrimina- Heinemann.
tion (a) parameters are feasible with relatively small sam- Baker, F. B. (1992). Item response theory: Parameter estimation techniques.
ple sizes (hundreds, rather than thousands, of examinees). New York: Marcel Dekker.
Baker, F. B. (2001). The basics of item response theory (ERIC Clearing-
A second benefit of less complex parameter modeling is house on Assessment and Evaluation, University of Maryland, Col-
the feasibility of examining potential item bias or differ- lege Park, MD). Retrieved October 30, 2004, from http://edres.org/
ential item functioning through comparison of the item irt/baker/
characteristic curves for people with disabilities versus Bock, R. D., Gibbons, R., Schilling, S. G., Muraki, R., Wilson, D. T., &
their nondisabled counterparts. Raju’s (1990) formulas for Wood, R. (2003). Testfact 4. Lincolnwood, IL: Scientific Software
International.
calculating the exact area between item characteristic Bolton, B. (2001). Handbook of measurement and evaluation in rehabilita-
curves, and criteria for determining the significance of the tion (3rd ed.). Gaithersburg, MD: Aspen.
differences between those curves, can be applied. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The the-
In conclusion, appropriate application of IRT tech- oretical status of latent variables. Psychological Review, 110(2), 203–
niques could greatly improve psychometric methodology 219.
Bullis, M., Reiman, J. W., Davis, C., & Reid, C. (1997). National field-
in rehabilitation. This article has provided an overview of testing of the “mini” version of the Transition Competence Battery
such techniques, as well as guidelines for their use. Appli- for Adolescents and Young Adults Who Are Deaf. The Journal of
cations such as improving the efficiency and accuracy of Special Education, 31(3), 347–361.
assessment tools, developing computerized adaptive in- Bullis, M., Reiman, J., Reid, C., & Davis, C. (1995). Preliminary inves-
struments, identifying and rectifying potential bias in tigation of the “mini” version of the Transition Competence Battery
for Deaf Adolescents and Young Adults. Assessment in Rehabilitation
tests, and equating new and improved instruments with and Exceptionality, 2(3), 179–196.
the old tools they are designed to replace could revolu- Cella, D., & Chang, C. H. (2000). A discussion of item response the-
tionize assessment of people who have disabilities. ory and its applications in health status assessment. Medical Care,
38(9, Suppl. 2), II66–II72.
Chang, C. H., & Gehlert, S. (2003). The Washington Psychosocial
Seizure Inventory (WPSI): Psychometric evaluation and future ap-
ABOUT THE AUTHORS plications. Seizure, 12(5), 261–267.
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assess-
ment: An introduction to tests and measurement. Boston: McGraw-Hill.
Christine A. Reid, PhD, is a professor in the Rehabilitation Cook, K. F., Monahan, P. O., & McHorney, C. A. (2003). Delicate bal-
Counseling Department at Virginia Commonwealth University. ance between theory and practice: Health status assessment and
item response theory. Medical Care, 41(5), 571–574.
Her current research interests include assessment, ethics, deaf- Du Toit, M. (Ed.). (2003). IRT from SSI: BILOG MG, MULTILOG,
blindness, and life-care planning. Stephanie A. Kolakowsky- PARSCALE, TESTFACT. Lincolnwood, IL: Scientific Software In-
Hayner is currently completing her dissertation in health-related ternational.
sciences with a concentration in rehabilitation leadership at Vir- Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychol-
ginia Commonwealth University. Allen N. Lewis, Jr., PhD, ogists. Mahwah, NJ: Erlbaum.
Fayers, P. M., & Machin, D. (2000). Quality of life: Assessment, analysis,
is an assistant professor and vice-chair of the Department of and interpretation. West Sussex, England: Wiley.
Rehabilitation Counseling at Virginia Commonwealth Univer- Gumpel, T., & Wilson, M. (1996). Application of a Rasch analysis to
sity. He is also a senior research associate of the Community the examination of the perception of facial affect among persons
Health Research Initiative of the L. Douglas Wilder School of with mental retardation. Research in Developmental Disabilities,
Government and Community Affairs at VCU. His current re- 17(2), 161–171.
Haberman, S. J. (1977). Log-linear models and frequency tables with
search interests include exploring the disability experience in cul- small expected cell counts. Annals of Statistics, 5, 1148–1169.
turally diverse populations, understanding health and disability Hahn, E. A., & Cella, D. (2003). Health outcomes assessment in vul-
disparities, leadership in the public rehabilitation system, and the nerable populations: Measurement challenges and recommenda-
application of program evaluation research technology to rehabil- tions. Archives of Physical Medicine and Rehabilitation, 84(Suppl. 2),
itation services. Amy J. Armstrong, PhD, is an assistant S35–S42.
Haley, S. M., Andres, P. L., Coster, W. J., Kosinski, M., Ni, P., & Jette,
professor in the Department of Rehabilitation Counseling at Vir- A. M. (2004). Short-form activity measure for post-acute care.
ginia Commonwealth University. Her interest areas include the Archives of Physical Medicine and Rehabilitation, 85(4), 649–660.
employment of individuals with significant disabilities, advocacy, Haley, S. M., Coster, W. J., Andres, P. L., Ludlow, L. H., Ni, P., Bond,
welfare and poverty issues, disability policy and systems issues, T. L., et al. (2004). Activity outcome measurement for postacute
and distance education. Address: Christine Reid, Dept. of Reha- care. Medical Care, 42(1, Suppl. 1), I49–I61.
Hambleton, R. K. (1989). Principles and selected applications of item
bilitation Counseling, Virginia Commonwealth University, PO response theory. In R. L. Linn (Ed.), Educational measurement (3rd
Box 980330, 1112 E. East Clay St., Richmond, VA 23298- ed., pp. 147–200). New York: American Council on Education and
0330; e-mail: creid@vcu.edu Macmillan.
188 Rehabilitation Counseling Bulletin

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Reckase, M. D. (1979). Unifactor latent trait models applied to multi-
Principles and applications. Boston: Kluwer Nijhoff. factor tests: Results and implications. Journal of Educational Statistics,
Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment 4, 207–230.
testing: Validity generalization, minority issues, and the General Aptitude Reid, C. (1993). Latent trait modeling of the General Aptitude Test Battery
Test Battery (Final report of the Committee on the General Apti- used with a rehabilitation client population: An investigation of model–
tude Test Battery, Commission on Behavioral and Social Sciences data fit. Unpublished doctoral dissertation, Illinois Institute of Tech-
and Education, National Research Council). Washington, DC: Na- nology, Chicago.
tional Academy Press. Reid, C. (1995). Application of item response theory to practical prob-
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory lems in assessment with people who have disabilities. Assessment in
and health outcomes measurement in the 21st century. Medical Rehabilitation and Exceptionality, 2(2), 89–96.
Care, 38(9, Suppl. 2), II28–II42. Saliba, D., Orlando, M., Wenger, N. S., Hays, R. D., & Rubenstein,
Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: L. Z. (2000). Identifying a short functional disability screen for older
Application to psychological measurement. Homewood, IL: Dow Jones- persons. The Journals of Gerontology. Series A, Biological Sciences and
Irwin. Medical Sciences, 55(12), M750–M756.
Jenkinson, C., Fitzpatrick, R., Garratt, A., Peto, V., & Stewart-Brown, Santor, D. A., & Ransay, J. O. (1998). Progress in the technology of
S. (2001). Can item response theory reduce patient burden when measurement: Applications of item response models. Psychological
measuring health status in neurological disorders? Results from Assessment, 10(4), 345–359.
Rasch analysis of the SF-36 physical functioning scale (PF-10). Jour- Santor, D. A., Ramsay, J. O., & Zuroff, F. C. (1994). Nonparametric
nal of Neurology, Neurosurgery, and Psychiatry, 71(2), 220–224. item analyses of the Beck Depression Inventory: Evaluating gender
Jenkinson, C., Norquist, J. M., & Fitzpatrick, R. (2003). Deriving sum- item bias and response option weights. Psychological Assessment, 6,
mary indices of health status from the Amyotrophic Lateral Sclero- 255–270.
sis Assessment Questionnaires (ALSAQ-40 and ALSAQ-5). Journal Stark, S., Chernyshenko, S., Chuah, D., Lee, W., & Wadlington, P.
of Neurology, Neurosurgery, and Psychiatry, 74(2), 242–245. (2001). IRT Modeling Lab tutorial on item response theory. Retrieved
Kopec, J. A., Esdaile, J. M., Abrahamowicz, M., Abenhaim, L., Wood- October 30, 2004, from http://work.psych.uiuc.edu/irt/tutorial.asp
Dauphinee, S., Lamping, D. L., et al. (1996). The Quebec Back Pain Stoloff, M. L., & Couch, J. V. (Eds.). (1992). Computer use in psychol-
Disability Scale: Conceptualization and development. Journal of ogy: A directory of software (3rd ed.). Washington, DC: American
Clinical Epidemiology, 49(2), 151–161. Psychological Association.
Lord, F. M. (1980). Applications of item response theory to practical testing Stout, W. (1987). A nonparametric approach for assessing latent trait
problems. Hillsdale, NJ: Erlbaum. unidimensionality. Psychometrika, 52, 589–617.
McHorney, C. A. (2002). Use of item response theory to link 3 mod- U.S. Department of Labor. (1970). Manual for the General Aptitude Test
ules of functional status items from the Asset and Health Dynamics Battery. Washington, DC: U.S. Government Printing Office.
Among the Oldest Old study. Archives of Physical Medicine and Re- Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of mod-
habilitation, 83(3), 383–394. ern item response theory. New York: Springer-Verlag.
McHorney, C. A., & Cohen, A. S. (2000). Equating health status mea- Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implica-
sures with item response theory: Illustrations with functional status tions of item response theory and computerized adaptive testing: A
items. Medical Care, 38(9, Suppl. 2), II43–II59. brief summary of ongoing studies of widely used headache impact
Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: Item analysis and test scales. Medical Care, 38(9, Suppl. 2), II73–II82.
scoring with binary logistic models (2nd ed.). Chicago: Scientific Soft- Weiss, D. J. (1972). Review of the General Aptitude Test Battery. In
ware. O. K. Buros (Ed.), Seventh mental measurements yearbook (Vol. 2,
Mungas, D., & Reed, B. R. (2000). Application of item response theory pp. 1058–1061). Highland Park, NJ: Gryphon Press.
for development of a global functioning measure of dementia with Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003).
linear measurement properties. Statistics in Medicine, 19(11–12), BILOG-MG 3. Lincolnwood, IL: Scientific Software International.
1631–1644.
Raju, N. S. (1990). Determining the significance of estimated signed
and unsigned areas between two item response functions. Applied
Psychological Measurement, 14, 197–207.
View publication stats

You might also like