1PNO The Rasch Testlet Model

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/247742842
The Rasch Testlet Model
Article in Applied Psychological Measurement · March 2005

DOI: 10.1177/0146621604271053
CITATIONS READS
145 1,613
2 authors:
Wen-Chung Wang Mark Wilson

The Education University of Hong Kong University of California, Berkeley
107 PUBLICATIONS 1,825 CITATIONS 265 PUBLICATIONS 10,782 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Work at ETS View project
Modeling collaborative problem-solving data: new approaches for individual and team level inferences View project
All content following this page was uploaded by Mark Wilson on 05 November 2016.
The user has requested enhancement of the downloaded file.

The Rasch Testlet Model
Wen-Chung Wang, National Chung Cheng University, Chia-Yi, Taiwan
Mark Wilson, University of California at Berkeley
The Rasch testlet model for both dichotomous parameters as well as the random testlet effects
and polytomous items in testlet-based tests is could be recovered very accurately under all the
proposed. It can be viewed as a special case of the simulated conditions. As sample sizes were
multidimensional random coefficients multinomial increased, the root mean square errors of the
logit model (MRCMLM). Therefore, the estima- estimates decreased to an acceptable level. An
tion procedures for the MRCMLM can be directly empirical example of an English test with 11
applied. Simulations were conducted to examine testlets was given. Index terms: multidimensional
parameter recovery under the dichotomous item response model, item bundle, marginal
Rasch testlet model and the partial-credit testlet maximum likelihood estimation, parameter
model. Results indicated that the item and person recovery.
A testlet is a bundle of items that share a common stimulus (e.g., a reading comprehension passage
or a figure) (Wainer & Kiely, 1987). Another name for a testlet is an item bundle (Rosenbaum, 1988).
The design of testlets or item bundles has been adopted in educational and psychological tests.
Fitting standard item response models to testlet responses ignores the possible dependence between
the items within a testlet. Such an item response analysis tends to overestimate the precision of
measures obtained from testlets and yields biased estimation for item difficulty and discrimination
parameters. Overstatement of precision and biased estimation lead to inaccurate inferences about
the parameters (Sireci, Thissen, & Wainer, 1991; Wainer, 1995; Wainer & Lukhele, 1997; Wainer
& Thissen, 1996; Wainer & Wang, 2000; Yen, 1993). Another method for analyzing a testlet is to
treat the testlet as a single super-item, score it polytomously, and apply polytomous item response
models such as the nominal response model (Bock, 1972), the partial-credit model (Masters, 1982),
the graded response models (Samejima, 1969), or the generalized partial-credit model (Muraki,
1992). This approach might be appropriate when the local dependence between items within a
testlet is moderate and the test contains a large proportion of independent items (Wainer, 1995).
Using this approach, as long as the total scores in a testlet are identical, they will be assigned to
the same category. In this case, the information in the exact response patterns within a testlet is
missing. To extract this information, more complex item response models are required.
Bradlow, Wainer, and Wang (1999) extended the two-parameter logistic model (Birnbaum, 1968)
to include an additional random effect for the dependence between items within the same testlet.
The variances of the random testlet effects were assumed to be constant across testlets. Wainer,
Bradlow, and Du (2000) further extended this two-parameter testlet model into a three-parameter
testlet model that includes the guessing parameter and allows variation in the random effects over
Applied Psychological Measurement, Vol. 29 No. 2, March 2005, 126–149

126 DOI: 10.1177/0146621604271053
© 2005 Sage Publications
W.-C. WANG and M. WILSON
THE RASCH TESTLET MODEL 127
testlets. Under this three-parameter testlet model, the probability of a correct answer (scoring 1) to
item i within testlet d(i) for a person with latent trait level θn is
exp[ai (θn − bi + γnd(i) )]
pni1 = ci + (1 − ci ) , (1)
1 + exp[ai (θn − bi + γnd(i) )]
where pni1 is the probability; ai , bi , and ci are the discrimination, difficulty, and guessing para-
meters, respectively; and γnd(i) is the random effect for person n on testlet d(i), which describes
the interaction between persons and items (local item dependence) within the testlet. To facilitate
parameter estimation, Wainer et al. (2000) embedded the three-parameter testlet model in a larger
hierarchical Bayesian framework where
θn ∼ N (0, 1), (2)

γnd(i) ∼ N (0, σγ2d(i) ). (3)
ai ∼ N (µa , σ 2a ), (4)
bi ∼ N (µb , σ 2b ), (5)
log[ci (1 − ci )] ∼ N (µc , σ 2c ). (6)
For the distributional means, they set µa ∼ N (0, Va ), µb ∼ N (0, Vb ), and µc ∼ N (0, Vc ).
In addition, Va−1 = Vb−1 = Vc−1 = 0 reflects lack of information about these parameters, and
σz2 ∼ χg−2
z
is for all prior variances, where χg−2
z
is an inverse chi-square random variable with gz
degrees of freedom. They set gz = 0.5 for all distributions to reflect a small amount of information.
Note that σγ2 indicates the amount of the testlet effect for testlet d(i). The larger σγ2 is, the
d(i) d(i)
greater the proportion of total variance in test score that is attributable to the testlet. Because equation
(1) and the normal prior distributions are nonconjugate, inference from the model parameters is
drawn by choosing samples from their marginal posterior distribution using a form of Markov chain
Monte Carlo (MCMC) simulation, the Gibbs sampler (Gelfand & Smith, 1990). Glas, Wainer, and
Bradlow (2000) derived estimates for the parameters in the three-parameter testlet model using
marginal maximum likelihood (MML) and expected a posteriori (EAP) estimates. They also show
how the testlet model can be used within the framework of computerized adaptive testing.
If ci = 0 and σγ2 = σγ2 (the same variance for all testlets), equation (1) reduces to the two-
d(i)
parameter testlet model proposed by Bradlow et al. (1999). If ci = 0 and ai = 1, then equation (1)
reduces to the one-parameter Rasch testlet model:
exp(θn − bi + γnd(i) )
pni1 = . (7)
1 + exp(θn − bi + γnd(i) )
Equation (7) can be expressed as
log(pni1 /pni0 ) = θn − bi + γnd(i) , (8)
where pni1 and pni0 are the probabilities of scoring 1 and 0 in item i for person n, respectively. If
γnd(i) = 0 (no testlet effects), equation (8) reduces to the dichotomous Rasch model (Rasch, 1960).
Compared to the polytomous model approach to the dichotomous items in a testlet, the testlet
model approach has three major advantages. First, the units of analysis are items rather than testlets,
so that the information in the response patterns is not lost. Second, standard item scoring rubrics
(e.g., 1 for a correct response and 0 for an incorrect response) remain unchanged. Third, the usual
Volume 29 Number 2 March 2005
128 APPLIED PSYCHOLOGICAL MEASUREMENT
concepts of item parameters such as item discrimination, difficulty, and guessing parameters are
still applicable. Although the polytomous model approach seems to be conceivable, its complexity
is likely to doom it from the start (Wainer & Wang, 2000).
The above testlet models, although having these advantages, are limited to dichotomous items.
In practice, testlets may contain both dichotomous and polytomous items. For example, a math-
ematics or science test may contain constructed-response items in a testlet format that are scored
polytomously. An inventory may contain several scenarios, each followed by a set of Likert-type
or rating scale items. It is thus desirable to develop a testlet model that is suitable not only for
dichotomous but also for polytomous items.
Wang, Bradlow, and Wainer (2002) extended their earlier works to include the situation in which
a test is composed, partially or completely, of polytomous items in testlets. The item response
models are the three-parameter logistic model (Birnbaum, 1968) and the graded response model
(Samejima, 1969) for dichotomous and polytomous items, respectively, plus a random effect for
each testlet. As in Wainer et al. (2000), the model is also embedded in a Bayesian hierarchical
framework. Inferences under the model are obtained using MCMC techniques.
The Bayesian approach has some significant advantages over classical statistical analysis. It
allows meaningful assessments in confidence regions, incorporates prior knowledge into the analy-
sis, yields more precise estimators (provided the prior knowledge is accurate), and follows the like-
lihood and sufficiency principles. However, it has been argued that whenever the Bayesian approach
is used, considerable care should be taken to document fully the basis for the various prior distribu-
tions. Care should be taken when selecting the functional form for a prior because poor choices can
lead to incorrect inferences. In addition, there is a tendency to underestimate uncertainty and hence
to specify unrealistically informative priors. This tendency should be explicitly acknowledged and
avoided (Congdon, 2003; Gelman, Carlin, Stern, & Rubin, 2003; Lee, 1989; Punt & Hilborn, 1997).
In the above-cited testlet models, ai , bi , and log[ci /(1 − ci )] are all assumed to follow normal
distributions. These prior distributions may not be always realistic. In most standard item response
models, the items are usually treated as “fixed” effects. That is, all the items in a test constitute the
item population. They are not randomly selected from an item bank. Even when items are sometimes
randomly selected from an item bank (such as automatic item generation; Irvine & Kyllonen, 2002),
the distribution of the item parameters may not be normal. Therefore, assuming all three kinds of
item parameters to be normally distributed is not always appropriate.
In the present study, the testlet model approach is applied to the family of Rasch models for
both dichotomous and polytomous items. The Rasch models have several desirable measurement
and psychometric properties, such as observable sufficient statistics for the model parameters and
a relatively small sample size requirement for parameter estimation. The items are treated as fixed
effects so that no distributional assumption on the item parameters is necessary. Only the latent trait
θ and the random testlet effects variables γ s are assumed to be independently normally distributed,
as they are in the above multiparameter testlet models (equations (2) and (3)). However, no distri-
butional assumptions on σγ2 are made. It will be shown that the resulting Rasch testlet model is a
d(i)
special case of a multidimensional Rasch model; therefore, the computer software for the multidi-
mensional Rasch model can be directly adopted to estimate parameters. Simulations were conducted
to evaluate parameter recovery of the Rasch testlet model, and the results are summarized.
Modeling Testlet Effects
Equation (8) is for dichotomous items. For polytomous items, it can be extended to
log(pnij /pni(j −1) ) = θn − bij + γnd(i) , (9)

where pnij and pni(j −1) are the probabilities of scoring j and j – 1 to item i for person n, respectively,
and bij is the j th step difficulty of item i. If γnd(i) = 0, then equation (9) reduces to the partial-
credit model (Masters, 1982). Let
bij = bi + (bij − bi ) ≡ bi + τij , (10)
where bi is called the overall difficulty of item i, and τij is called thej th threshold parameter of
item i. Equation (9) can be expressed as
log(pnij /pni(j −1) ) = θn − (bi + τij ) + γnd(i) . (11)
When polytomous items are scored with a common scoring rubric, such as Likert-type or rating
scale items, the threshold parameters across items may be constrained as
τij = τj , (12)
so that equation (11) reduces to
log(pnij /pni(j −1) ) = θn − (bi + τj ) + γnd(i) . (13)
If γnd(i) = 0, then equation (13) reduces to the rating scale model (Andrich, 1978). Adding random
testlet effects onto other Rasch models is straightforward, such as the linear logistic test model (Fis-
cher, 1973), the partial-order model (Wilson, 1992), the facet model (Linacre, 1989), the linear rating
scale model (Fischer & Parzer, 1991), and the linear partial-credit model (Fischer & Pononcy, 1994).
Because θ and γ are assumed to be independently and normally distributed under the testlet
$
model, θ = [θ, γ1 , . . . , γd , . . . , γD ] has a multivariate normal distribution N (µ, Σ), where, for
model identification, µ is set at zero, and Σ is constrained to be a diagonal matrix,
 2 
σθ 0 · · · 0
 0 σγ2 · · · 0 
 1 
Σ= . .. .. ..  . (14)
 .. . . . 
0 0 · · · σγ2D
Under the two- and three-parameter testlet models, σθ2 has to be set at unity for model identification.
Viewing the latent trait θ and the random testlet effects γd this way, one finds that the Rasch testlet
model is a multidimensional item response model in which the multiple dimensions are constrained
to be independent. It will be shown that the Rasch testlet model (equations (8), (9), and (13)) is
a special case of the multidimensional random coefficients multinomial logit model (MRCMLM;
Adams, Wilson, & Wang, 1997). Therefore, the parameters of the Rasch testlet model can be
estimated using the computer program ConQuest (Wu, Adams, & Wilson, 1998). No extra effort is
needed to derive parameter estimation procedures or to develop computer software for the Rasch
testlet model.
The MRCMLM
Under the MRCMLM, which is a multidimensional extension of the random coefficients multi-
nomial logit model (Adams & Wilson, 1996), person n’s levels on the L latent traits are denoted
as θ = (θ1 , . . . , θD )$ , which are considered to represent a random sample from a population with
a multivariate density function g(θ; α ), where α indicates a vector of parameters that characterize
the distribution. In this study, g is constrained to be normal so that α ≡ (µ, Σ). The probability of
a response in category j of item i for person n is
$ $
exp(bij ξn + aij θ)
f (Xnij = 1; ξ|θn ) = 'K $ $
, (15)
i
u=1 exp(biu θn + aiu ξ)
where Xnij = 1 if the response to item i is in category j for person n and is 0 otherwise; Ki is the
number of categories in item i; ξ is a vector of difficulty parameters that describe the items; bij is
a score vector given to category j of item i across the L latent traits, which can be collected across
items into a scoring matrix B; and aij is a design vector given to category j of item i that describes
the linear relationship among the elements of ξ, which can be collected across items into a design
matrix A. Equation (15) can be expressed as
( )
f (Xnij = 1; ξ|θn ) $ $ $ $
log = (bij − bi(j −1) ) θn + (aij − ai(j −1) ) ξ
f (Xni(j −1) = 1; ξ|θn )
$ (16)
∗$
≡ b∗ij θn + aij ξ,
which is more consistent with the standard expression of the family of Rasch models and the above
Rasch testlet model. Notice that aij and bij are not parameters; rather, they are specified by test
analysts to form customized item response models.
Using aij and bij (or, equivalently, a∗ij and b∗ij ) to define the relationship between items and
persons allows a general model to be written that includes most of the existing unidimensional Rasch
models, such as the dichotomous Rasch model, the linear logistic test model, the rating scale model,
the partial-credit model, the partial-order model, the facet model, the linear rating scale model,
and the linear partial-credit model. To see specification of these unidimensional models within the
MRCMLM framework, the reader is referred to Adams and Wilson (1996), Adams et al. (1997),
and Wu et al. (1998). The definitions also allow the specification of a range of multidimensional
models by imposing linear constraints on the item parameters, such as multidimensional forms of
the dichotomous Rasch model, the rating scale model, the partial-credit model, the linear partial-
credit model, and, more important, the Rasch testlet model to be used in this study.
The computer program ConQuest for the MRCMLM is implemented with MML estimation and
Bock and Aitkin’s (1981) formulation of the EM (expectation-maximization) algorithm (Dempster,
Laird, & Rubin, 1977). Based on the assumption of conditional independence among items and
persons, the probability of a response vector x conditioned on the random quantities θ is
exp[x$ (Bθ + Aξ)]
f (X = x; ξ|θ) = , (17)
'(θ, ξ)
with
*
'(θ, ξ) = exp[z$ (Bθ + Aξ)], (18)
z∈(
where ( is the set of all possible response vectors. The marginal density of the response x is
+ exp[x$ (Bθ + Aξ)]
f (X = x) = θ dG(θ, α ), (19)
'(θ, ξ)
where G is the cumulative distribution of g. The likelihood for a set of N response vectors is
$
,N +
exp[xn (Bθ + Aξ)]
)(ξ, α |X) = θ dG(θ, α ). (20)
n=1
'(θ, ξ)
The likelihood equations for the item parameters are

-
∂ log )(ξ, α |X) * N
∂ log f (xn ; ξ, α )
= dH (θ; ξ, α |xn ) = 0, (21)
∂ξ n=1
∂ξ
θ
where H (θ; ξ, α |xn ) is the cumulative posterior marginal distribution of θ given xn , with a density
function
f (xn ; ξ|θ)g(θ; α )
h(θ; ξ, α |xn ) = . (22)
f (xn ; ξ)
Assuming the distribution of the latent traits is multivariate normal so that α ≡ (µ, Σ), the likelihood
equations for the mean and variance-covariance matrix are
-
∂ log ) (ξ, µ, Σ|X) * N
∂ log g(θ; µ, Σ)
= dH (θ; ξ, µ, Σ|xn ) = 0, (23)
∂µ n=1
∂µ
θ
and
-
∂ log ) (ξ, µ, Σ|X) * N
∂ log g(θ; µ, Σ)
= dH (θ; ξ, µ, Σ|xn ) = 0. (24)
∂Σ n=1
∂Σ
θ
Bock and Aitkin’s (1981) formulation of the EM algorithm is used to estimate all the parameters
ξ, µ, and Σ simultaneously.
Note that only the form of the multivariate normal distribution is assumed, and the corresponding
mean vector and variance-covariance matrix are empirically estimated, which is referred to as
the empirical Bayes method (Lee, 1989). Necessary and sufficient conditions for identification of
the parameters in the MRCMLM have been derived (Volodin & Adams, 1995; Wu et al., 1998).
In the present study, µ is set at zero for model identification, and Σ is a diagonal matrix as in
equation (14).
In addition to the usual technique that uses fixed quadrature points for integration, ConQuest
also provides a Monte Carlo method in which the quadrature points are readjusted according to
recent estimates for better integration. After the model parameters are calibrated, point estimates for
individual persons can be obtained from either the mean vector of the marginal posterior distribution
(equation (22)), called the expected a posteriori (EAP) estimates (Bock & Mislevy, 1982), or the
maximum point of conditional likelihood, called the maximum likelihood estimates (MLE).
The MRCMLM, being a member of the exponential family of distributions, can be viewed
as a generalized linear mixed model (De Boeck & Wilson, 2004; McCullagh & Nelder, 1989;
McCulloch & Searle, 2001; Nelder & Wedderburn, 1972; Rijmen, Tuerlinckx, De Boeck, & Kup-
pens, 2003). In addition to ConQuest, the SAS NLMIXED procedure (SAS Institute, 1999) is
an alternative for fitting many common nonlinear and generalized linear mixed models, includ-
ing the MRCMLM. The reader is referred to Wolfinger and SAS Institute (n.d.) for details
of the NLMIXED procedure. According to the authors’ experiences in the multidimensional
approach, the NLMIXED procedure may take several hours to converge (or sometimes even fail
to converge), whereas ConQuest takes only a few minutes. Hence, ConQuest was used for all
analyses.
To show how the MRCMLM encompasses the partial-credit testlet model (equation (9)) as a
special case, let uid (d = 1, . . . , D) denote an indicator variable where uid is 1 if item i is within
testlet d and is zero otherwise. Let vij (i = 1, . . . , I ; j = 1, . . . , Ki −1) denote an indicator variable
where vi1 = vi2 = · · · = vij = −1 for step j of item i and is zero otherwise. D and I are the
numbers of testlets and items in the test, respectively, and Ki is the number of response categories
in item i. Let
$
θn = [θn , γn1 , . . . , γnd , . . . , γnD ], (25)
ξ$ = [ξ11 , ξ12 , . . . , ξ1(K1 −1) , . . . , ξij , . . . , ξI (KI −1) ], (26)
$
bij = [j, j ui1 , . . . , j uid , . . . , j uiD ], (27)
$
aij = [v11 , . . . , vij , . . . , vI (KI −1) ], (28)
then
. /
f (Xnij = 1) $ $ $ $
log = (bij − bi(j −1) ) θn + (aij − ai(j −1) ) ξ
(Xni(j −1) = 1)
= θn + γnd(i) − ξij , (29)
which is equivalent to equation (9). When Ki = 2 for every item (i.e., all dichotomous items),
. /
f (Xni1 = 1) $ $ $ $
log = (bi1 − bi0 ) θn + (ai1 − ai0 ) ξ
(Xni0 = 1)
= θn + γnd(i) − ξi1 , (30)
which is equivalent to equation (8). By formulating aij and bij , equation (13) can also be formed.
Moreover, the testlet model approach can be applied to the linear logistic test model, the partial-
order model, the facet model, the linear partial-credit model, and many other customized models
via formulating aij and bij .
Although the MRCMLM and ConQuest have been applied to many testing situations (e.g., Adams
et al., 1997; Hoijtink, Rooks, & Wilmink, 1999; Hoskens & De Boeck, 1997, 2001; Wang, 1999;
Wang & Chen, 2004; Wang, Chen, & Cheng, 2004; Wang, Wilson, & Adams, 1997, 2000; Wang
& Wu, 2004), they have not been applied to analyze testlets. Whether ConQuest provides accurate
parameter recovery under the Rasch testlet model is unknown. Simulations were thus conducted to
assess parameter recovery. The design, item generation procedure, and analysis are described, and
the results are summarized below.
The Simulation
Design
Four independent variables were manipulated: (a) item type—all dichotomous items, all poly-
tomous items, and dichotomous plus polytomous items; (b) testlet number—40 dichotomous items
in four or eight testlets (each testlet has 10 or 5 dichotomous items), 24 three-point polytomous
items in four or eight testlets (each testlet has 6 or 3 items), and 20 dichotomous items in two or
four testlets (each testlet has 10 or 5 dichotomous items) plus 12 three-point polytomous items in
two or four testlets (each testlet has 6 or 3 dichotomous items); (c) sample size—200 and 500; and
(d) testlet effect—the variances of the random testlet variables were 0.25, 0.50, 0.75, and 1.00, repre-
senting small to large effects. From the following empirical example and other empirical examples
in the literature (e.g., Wainer & Wang, 2000), it appears that the variances of testlets in real tests may
be very diverse (ranging from as small as almost zero to as large as the variance of the latent trait).
In the simulations, the variance of the latent trait was set at 1.00 and so was the largest variance of
the testlet.
The dependent variables were (a) the difference between the mean estimates and the generating
value and (b) the root mean square error of the estimates. The item difficulties for the 40 dichotomous
items ranged uniformly from –2.00 to 2.00. The 48 step difficulties for the 24 three-point poly-
tomous items ranged from −2.00 to 2.42. For the tests that contained 20 dichotomous and 12 poly-
tomous items, the item difficulties for the 20 dichotomous items ranged uniformly from –2.00 to
2.00. The 24 step difficulties for the 12 polytomous items ranged from –2.00 to 2.08. One hundred
replications were made under each condition.
Data Generation
A FORTRAN 90 computer program was written by the authors to generate item responses.
The generating procedure contained the following steps: (a) latent trait parameters were randomly
generated from the multivariate normal distribution N (0,Σ) (Σ is a diagonal matrix), (b) these
latent trait parameters (θ and γ ) and the defined item parameters were used to compute the
corresponding category probability and the cumulative probabilities using equation (8) or (9),
and (c) these cumulative probability values were compared to a random number from the uniform
[0, 1] distribution. The simulated item response was defined as the highest score category at
which the random number was less than or equal to the associated cumulative probability. More
specifically, the response category of person n on item iwas determined to be h, iff:
*
h−1
pnij ≤ rni , (31)
j =0
and
*
h
pnij > rni , (32)
j =0
where pnij is the probability of scoring j on item i for person n, and rni is one random number from
the uniform [0, 1] distribution for item i of person n.
Analysis
Each generated data set was calibrated using ConQuest. The bias and root mean square error
across the 100 replications for the estimator ς̂ were computed as
100
*
Bias(ς̂ ) = (ς̂r − ς )/100, (33)
r=1
0
1 100
1*
RMSE(ς̂ ) = 2 (ς̂ − ς )2 /100, (34)
r=1
Table 1
2
Hotelling’s T Tests for the Overall
Null Hypothesis of Unbiased Estimation
Condition F df1 df2 p
Dichotomous items
Four testlets
N = 200 1.50 45 55 .08
N = 500 1.16 45 55 .30
Eight testlets
N = 200 1.03 49 51 .46
N = 500 1.83 49 51 .02
Polytomous items
Four testlets
N = 200 0.99 53 47 .51
N = 500 1.19 53 47 .28
Eight testlets
N = 200 1.31 57 43 .18
N = 500 1.14 57 43 .32
Dichotomous plus polytomous items
Four testlets
N = 200 1.58 49 51 .05
N = 500 1.24 49 51 .23
Eight testlets
N = 200 1.25 53 47 .22
N = 500 1.68 53 47 .04
where ς was the generating value of an element of ξ and Σ. Hotelling’s T 2 test was used to test
the overall null hypothesis of unbiased estimation across all the parameters; that is, E(ξ̂) = ξ
and E(Σ̂) = Σ. Note that the distributional parameters of latent traits (Σ in this case) rather than
the individual person parameters θn and γn were assessed because the latter were assumed to be
randomly sampled from the multivariate normal distribution and were not modeled. However, the
accuracy of the individual person parameter estimates can be assessed using test reliability (Mislevy,
Beaton, Kaplan, & Sheehan, 1992), as shown in the following empirical example.
Results
Dichotomous items. The upper part of Table 1 shows the transformed F statistics, degrees of
freedom, and p values of Hotelling’s T 2 tests. All the p values were larger than .01, indicating that
the estimation of ConQuest was unbiased. Table 2 lists the generating value, bias, and RMSE. The
magnitudes of bias (between –0.063 and 0.050) were rather small, as compared to the range of the
generating values (between –2.00 and 2.00). As the sample sizes were increased from 200 to 500,
the magnitudes of RMSE decreased.
Polytomous items. The transformed F statistics, degrees of freedom, and p values of Hotelling’s
T 2 tests are listed in the middle part of Table 1. As for the dichotomous items, all the p values
were larger than .01. The generating value, bias, and RMSE are shown in Table 3. Generally,
the magnitudes of bias (between –0.067 and 0.099) were small, as compared to the range of the
Table 2
Parameter Recovery in Dichotomous Items
Four Testlets Eight Testlets
N = 200 N = 500 N = 200 N = 500
Parameter Gen Bias RMSE Bias RMSE Gen Bias RMSE Bias RMSE
σθ2 1.00 −0.007 0.152 0.005 0.094 1.00 −0.007 0.135 −0.017 0.079
σγ21 0.25 −0.009 0.076 0.013 0.054 0.25 0.014 0.110 −0.004 0.036
σγ22 1.00 0.024 0.244 −0.004 0.149 1.00 −0.004 0.316 −0.051 0.195
σγ23 0.75 0.015 0.192 −0.005 0.130 0.75 −0.039 0.224 −0.058 0.152
σγ24 0.50 0.045 0.173 0.044 0.117 0.50 −0.008 0.177 −0.031 0.099
σγ25 0.50 −0.014 0.194 −0.020 0.080
σγ26 0.75 0.014 0.220 −0.024 0.149
σγ27 1.00 −0.024 0.300 −0.012 0.184
σγ28 0.25 0.029 0.145 −0.007 0.049
ξ1 −2.00 −0.006 0.228 −0.013 0.133 −2.00 −0.020 0.2043 −0.009 0.139
ξ2 −1.90 −0.043 0.206 −0.028 0.149 −1.90 −0.001 0.1971 −0.010 0.139
ξ3 −1.79 −0.045 0.223 −0.014 0.145 −1.79 −0.020 0.2239 −0.012 0.135
ξ4 −1.69 −0.054 0.243 −0.028 0.145 −1.69 −0.063 0.2275 −0.016 0.132
ξ5 −1.59 −0.039 0.222 −0.018 0.130 −1.59 −0.036 0.2364 −0.010 0.121
ξ6 −1.49 −0.010 0.198 −0.020 0.142 −1.49 0.008 0.2347 −0.008 0.153
ξ7 −1.38 −0.036 0.208 −0.024 0.134 −1.38 −0.015 0.2081 −0.016 0.158
ξ8 −1.28 −0.034 0.228 −0.017 0.129 −1.28 −0.018 0.2381 0.000 0.134
ξ9 −1.18 −0.050 0.206 −0.021 0.149 −1.18 −0.017 0.1947 −0.020 0.143
ξ10 −1.08 −0.022 0.208 −0.014 0.131 −1.08 0.009 0.1985 −0.001 0.137
ξ11 −0.97 0.027 0.189 0.016 0.133 −0.97 0.014 0.2082 0.020 0.124
ξ12 −0.87 0.015 0.206 −0.008 0.129 −0.87 −0.024 0.1762 0.000 0.126
ξ13 −0.77 0.025 0.202 −0.002 0.133 −0.77 −0.005 0.2016 −0.004 0.128
ξ14 −0.67 −0.005 0.184 −0.013 0.131 −0.67 0.007 0.2013 −0.008 0.126
ξ15 −0.56 0.015 0.195 −0.005 0.132 −0.56 0.009 0.2173 0.010 0.123
ξ16 −0.46 0.014 0.197 −0.016 0.133 −0.46 −0.027 0.1898 −0.023 0.131
ξ17 −0.36 0.028 0.183 −0.014 0.118 −0.36 −0.005 0.1786 −0.018 0.117
ξ18 −0.26 0.016 0.189 0.006 0.124 −0.26 0.023 0.1800 0.001 0.118
ξ19 −0.15 −0.001 0.184 −0.005 0.111 −0.15 −0.032 0.1907 −0.015 0.106
ξ20 −0.05 0.050 0.206 0.011 0.133 −0.05 0.014 0.2003 0.005 0.120
ξ21 0.05 0.027 0.190 0.008 0.112 0.05 0.033 0.1857 0.001 0.115
ξ22 0.15 0.026 0.198 0.015 0.124 0.15 0.022 0.1575 −0.004 0.111
ξ23 0.26 0.021 0.193 0.007 0.126 0.26 0.048 0.2086 0.008 0.121
ξ24 0.36 0.037 0.181 0.002 0.124 0.36 0.036 0.1875 0.003 0.106
ξ25 0.46 0.012 0.193 −0.005 0.117 0.46 0.010 0.1673 −0.012 0.114
ξ26 0.56 −0.007 0.217 0.008 0.108 0.56 0.003 0.1967 −0.021 0.126
ξ27 0.67 0.035 0.180 0.005 0.127 0.67 0.002 0.1891 −0.003 0.123
ξ28 0.77 0.034 0.197 −0.003 0.119 0.77 −0.001 0.2034 −0.010 0.116
ξ29 0.87 0.012 0.204 0.013 0.134 0.87 0.001 0.1906 −0.006 0.120
ξ30 0.97 −0.003 0.192 0.013 0.121 0.97 0.013 0.2035 −0.023 0.130
ξ31 1.08 0.002 0.213 −0.010 0.117 1.08 −0.010 0.2147 −0.009 0.122
ξ32 1.18 −0.005 0.186 0.016 0.133 1.18 −0.026 0.2065 −0.006 0.122
ξ33 1.28 0.007 0.191 −0.005 0.135 1.28 0.003 0.2094 −0.018 0.128
ξ34 1.38 0.010 0.204 0.000 0.150 1.38 0.014 0.2086 −0.023 0.153
(continued)
Table 2
(continued)
N = 200 N = 500 N = 200 N = 500
ξ35 1.49 0.048 0.214 0.031 0.133 1.49 −0.006 0.2115 0.003 0.126
ξ36 1.59 −0.027 0.197 −0.002 0.123 1.59 −0.009 0.1958 −0.001 0.125
ξ37 1.69 0.024 0.211 0.010 0.137 1.69 0.029 0.2082 −0.010 0.143
ξ38 1.79 0.029 0.220 0.017 0.124 1.79 0.008 0.2507 −0.002 0.127
ξ39 1.90 0.029 0.199 0.029 0.144 1.90 0.012 0.2415 −0.021 0.134
ξ40 2.00 0.002 0.220 0.023 0.150 2.00 −0.005 0.2107 0.004 0.135
Note. Gen = generating value; RMSE = root mean square error. ξ1 through ξ40 are the 40 item parameters.
Table 3
Parameter Recovery in Polytomous Items
N = 200 N = 500 N = 200 N = 500
σθ2 1.00 0.036 0.145 0.016 0.081 1.00 −0.001 0.131 −0.022 0.088
σγ21 0.25 0.000 0.087 −0.010 0.043 0.25 0.006 0.109 −0.009 0.049
σγ22 1.00 −0.007 0.230 −0.007 0.133 1.00 −0.043 0.334 −0.035 0.186
σγ23 0.75 0.039 0.175 0.009 0.111 0.75 −0.049 0.219 −0.027 0.144
σγ24 0.50 0.050 0.193 0.052 0.129 0.50 −0.051 0.147 0.021 0.094
σγ25 0.50 −0.033 0.160 −0.014 0.097
σγ26 0.75 0.002 0.239 −0.035 0.129
σγ27 1.00 −0.034 0.298 0.043 0.186
σγ28 0.25 0.023 0.133 −0.001 0.062
ξ1 −2.00 −0.067 0.322 −0.002 0.174 −2.00 −0.022 0.304 −0.005 0.176
ξ2 −0.74 −0.005 0.200 −0.007 0.099 −0.74 0.003 0.187 −0.007 0.107
ξ3 −1.87 −0.057 0.316 0.002 0.203 −1.87 −0.018 0.389 −0.006 0.218
ξ4 −1.27 −0.034 0.193 −0.018 0.131 −1.27 −0.016 0.194 −0.010 0.130
ξ5 −1.74 −0.004 0.247 0.013 0.163 −1.74 −0.059 0.299 0.008 0.172
ξ6 −0.56 0.014 0.179 −0.004 0.122 −0.56 0.058 0.187 −0.013 0.119
ξ7 −1.61 0.027 0.237 0.003 0.181 −1.61 0.014 0.339 0.030 0.169
ξ8 −0.94 −0.012 0.199 −0.003 0.127 −0.94 −0.003 0.201 −0.012 0.135
ξ9 −1.48 −0.016 0.245 0.026 0.151 −1.48 0.006 0.301 0.035 0.168
ξ10 −0.61 −0.020 0.166 −0.010 0.120 −0.61 −0.017 0.216 −0.010 0.123
ξ11 −1.35 −0.011 0.262 −0.024 0.177 −1.35 0.020 0.321 0.027 0.179
ξ12 −0.79 −0.025 0.179 −0.003 0.116 −0.79 −0.021 0.212 −0.003 0.143
ξ13 −1.22 −0.013 0.214 −0.022 0.175 −1.22 0.040 0.248 −0.007 0.161
ξ14 −0.44 −0.021 0.218 −0.013 0.120 −0.44 −0.002 0.221 −0.004 0.110
ξ15 −1.09 −0.001 0.234 −0.006 0.154 −1.09 0.045 0.233 0.014 0.140
(continued)
Table 3
(continued)
N = 200 N = 500 N = 200 N = 500
ξ16 0.38 −0.037 0.212 0.003 0.127 0.38 −0.049 0.221 −0.018 0.118
ξ17 −0.96 0.003 0.224 −0.031 0.147 −0.96 0.014 0.244 0.012 0.139
ξ18 0.53 0.008 0.217 −0.002 0.142 0.53 0.020 0.231 −0.034 0.146
ξ19 −0.83 −0.014 0.220 −0.016 0.151 −0.83 0.061 0.214 0.009 0.130
ξ20 0.57 −0.019 0.228 −0.020 0.124 0.57 −0.038 0.252 −0.032 0.126
ξ21 −0.70 −0.014 0.258 −0.025 0.158 −0.70 0.010 0.228 0.012 0.139
ξ22 0.15 0.003 0.224 0.017 0.154 0.15 0.024 0.195 −0.004 0.120
ξ23 −0.57 0.006 0.210 −0.009 0.130 −0.57 0.034 0.226 −0.007 0.135
ξ24 0.17 −0.001 0.236 −0.010 0.127 0.17 −0.010 0.218 −0.003 0.121
ξ25 −0.43 0.017 0.203 −0.009 0.141 −0.43 0.028 0.220 0.025 0.145
ξ26 0.16 0.037 0.200 0.014 0.133 0.16 0.006 0.231 −0.030 0.127
ξ27 −0.30 −0.019 0.202 −0.015 0.127 −0.30 −0.010 0.199 −0.011 0.135
ξ28 0.80 0.048 0.251 0.004 0.152 0.80 0.018 0.231 −0.033 0.139
ξ29 −0.17 0.008 0.227 −0.003 0.126 −0.17 −0.001 0.218 −0.010 0.133
ξ30 1.26 0.041 0.244 −0.003 0.156 1.26 0.007 0.242 −0.030 0.145
ξ31 −0.04 0.007 0.201 −0.033 0.139 −0.04 −0.023 0.206 −0.019 0.125
ξ32 1.42 0.023 0.263 0.014 0.140 1.42 −0.004 0.242 −0.060 0.166
ξ33 0.09 −0.030 0.181 −0.003 0.115 0.09 −0.005 0.214 −0.005 0.119
ξ34 1.28 0.099 0.272 0.000 0.143 1.28 0.010 0.235 −0.015 0.157
ξ35 0.22 0.006 0.189 −0.007 0.120 0.22 0.018 0.239 −0.007 0.118
ξ36 0.83 0.077 0.265 −0.018 0.155 0.83 0.000 0.240 −0.061 0.166
ξ37 0.35 −0.026 0.165 −0.005 0.119 0.35 −0.007 0.197 −0.007 0.128
ξ38 1.37 0.019 0.242 0.015 0.160 1.37 −0.032 0.259 −0.030 0.178
ξ39 0.48 0.009 0.196 −0.020 0.112 0.48 0.044 0.211 −0.010 0.129
ξ40 1.61 0.043 0.298 0.051 0.204 1.61 −0.066 0.253 −0.025 0.199
ξ41 0.61 0.017 0.183 −0.003 0.128 0.61 0.002 0.197 −0.008 0.133
ξ42 1.73 0.013 0.273 0.055 0.174 1.73 −0.028 0.290 −0.026 0.191
ξ43 0.74 0.002 0.196 0.027 0.125 0.74 0.055 0.212 0.001 0.105
ξ44 1.34 0.061 0.266 0.010 0.177 1.34 0.028 0.247 0.004 0.172
ξ45 0.87 −0.005 0.180 0.013 0.125 0.87 0.037 0.200 −0.008 0.104
ξ46 2.08 0.075 0.325 0.016 0.200 2.08 0.047 0.302 0.006 0.199
ξ47 1.00 −0.002 0.198 −0.007 0.127 1.00 0.001 0.190 −0.008 0.106
ξ48 2.42 0.046 0.343 0.060 0.229 2.42 0.035 0.369 −0.025 0.219
generating values (between –2.00 and 2.42). As found for the dichotomous items, a larger sample
size led to smaller RMSE.
Dichotomous plus polytomous items. The lower part of Table 1 lists the transformed F statistics,
degrees of freedom, and p values of Hotelling’s T 2 tests. All the p values were larger than .01.
The generating value, bias, and RMSE are presented in Table 4. The magnitudes of bias (between
–0.064 and 0.083) were also very small, as compared to the range of the generating values (between
–2.00 and 2.08). As found under the above conditions, a larger sample size led to smaller RMSE.
Table 4
Parameter Recovery in Dichotomous Plus Polytomous Items
N = 200 N = 500 N = 200 N = 500
σθ2 1.00 −0.025 0.122 0.021 0.091 1.00 0.011 0.155 −0.013 0.075
σγ21 0.25 0.000 0.083 −0.004 0.042 0.25 0.005 0.107 −0.014 0.039
σγ22 1.00 −0.002 0.265 0.018 0.155 1.00 −0.017 0.346 −0.064 0.184
σγ23 0.75 −0.035 0.203 −0.022 0.130 0.75 −0.017 0.255 −0.049 0.140
σγ24 0.50 0.033 0.157 0.036 0.100 0.50 0.015 0.148 −0.017 0.106
σγ25 0.50 0.016 0.181 −0.007 0.095
σγ26 0.75 −0.047 0.203 −0.018 0.140
σγ27 1.00 0.009 0.284 −0.045 0.189
σγ28 0.25 0.021 0.107 −0.009 0.064
ξ1 −2.00 −0.036 0.229 −0.009 0.139 −2.00 −0.008 0.235 −0.021 0.135
ξ2 −1.79 0.008 0.204 −0.001 0.148 −1.79 −0.037 0.210 0.006 0.125
ξ3 −1.58 0.019 0.187 0.000 0.121 −1.58 0.005 0.180 0.014 0.127
ξ4 −1.37 0.026 0.203 −0.016 0.131 −1.37 −0.003 0.194 −0.012 0.117
ξ5 −1.16 0.028 0.189 −0.001 0.129 −1.16 −0.017 0.194 0.005 0.107
ξ6 −0.95 0.023 0.165 −0.008 0.122 −0.95 −0.002 0.189 −0.021 0.125
ξ7 −0.74 0.025 0.186 0.002 0.116 −0.74 −0.002 0.211 −0.011 0.129
ξ8 −0.53 −0.014 0.167 −0.009 0.123 −0.53 −0.036 0.186 −0.004 0.123
ξ9 −0.32 0.028 0.186 −0.003 0.101 −0.32 0.012 0.191 0.006 0.126
ξ10 −0.11 0.008 0.164 −0.028 0.107 −0.11 −0.024 0.204 −0.026 0.127
ξ11 0.11 0.015 0.180 0.013 0.121 0.11 −0.006 0.179 0.011 0.102
ξ12 0.32 0.035 0.182 −0.004 0.123 0.32 0.014 0.201 −0.004 0.130
ξ13 0.53 −0.020 0.190 −0.005 0.119 0.53 −0.018 0.195 −0.009 0.121
ξ14 0.74 −0.016 0.189 0.009 0.126 0.74 −0.001 0.192 −0.001 0.124
ξ15 0.95 0.017 0.184 −0.005 0.140 0.95 −0.020 0.198 −0.015 0.123
ξ16 1.16 0.020 0.191 0.005 0.130 1.16 0.010 0.209 0.019 0.126
ξ17 1.37 0.002 0.211 −0.001 0.140 1.37 0.026 0.206 0.012 0.135
ξ18 1.58 0.034 0.233 −0.010 0.136 1.58 0.027 0.216 −0.004 0.137
ξ19 1.79 0.010 0.224 0.011 0.154 1.79 0.016 0.204 0.000 0.114
ξ20 2.00 −0.004 0.258 −0.012 0.144 2.00 0.001 0.234 −0.004 0.134
ξ21 −2.00 0.026 0.313 0.001 0.196 −2.00 −0.013 0.323 0.007 0.164
ξ22 −0.74 0.035 0.209 0.006 0.128 −0.74 −0.019 0.179 0.003 0.121
ξ23 −1.74 0.060 0.264 0.007 0.162 −1.74 −0.028 0.299 0.004 0.184
ξ24 −0.56 0.032 0.216 0.019 0.117 −0.56 −0.004 0.192 −0.009 0.118
ξ25 −1.48 0.012 0.251 −0.019 0.165 −1.48 −0.026 0.250 0.017 0.162
ξ26 −0.61 0.014 0.200 0.017 0.128 −0.61 −0.011 0.196 −0.017 0.100
ξ27 −1.22 0.027 0.257 −0.014 0.158 −1.22 −0.005 0.239 0.017 0.155
ξ28 −0.44 −0.025 0.203 0.007 0.141 −0.44 −0.060 0.226 −0.027 0.120
ξ29 −0.96 0.023 0.234 0.018 0.145 −0.96 0.000 0.239 0.002 0.124
ξ30 0.53 0.058 0.199 0.018 0.134 0.53 −0.019 0.196 0.007 0.142
ξ31 −0.70 0.050 0.227 −0.006 0.158 −0.70 −0.016 0.197 0.034 0.125
ξ32 0.15 0.011 0.246 0.020 0.129 0.15 −0.015 0.201 −0.037 0.135
ξ33 −0.43 0.023 0.203 −0.021 0.132 −0.43 0.002 0.203 0.026 0.146
(continued)
Table 4
(continued)
N = 200 N = 500 N = 200 N = 500
ξ34 0.16 0.054 0.217 0.012 0.131 0.16 −0.014 0.220 −0.003 0.147
ξ35 −0.17 0.023 0.212 0.001 0.125 −0.17 −0.018 0.239 0.007 0.128
ξ36 1.26 0.057 0.233 −0.006 0.145 1.26 0.011 0.263 −0.007 0.143
ξ37 0.09 0.032 0.192 0.004 0.119 0.09 0.027 0.212 0.017 0.127
ξ38 1.28 0.054 0.238 0.027 0.156 1.28 −0.014 0.235 −0.013 0.161
ξ39 0.35 0.030 0.187 −0.002 0.115 0.35 0.003 0.183 0.011 0.122
ξ40 1.37 0.043 0.306 0.024 0.164 1.37 0.020 0.248 −0.013 0.161
ξ41 0.61 0.041 0.210 −0.004 0.117 0.61 0.020 0.176 −0.002 0.124
ξ42 1.73 0.041 0.317 0.024 0.171 1.73 −0.017 0.268 −0.011 0.172
ξ43 0.87 0.065 0.218 0.007 0.115 0.87 −0.011 0.190 0.012 0.118
ξ44 2.08 0.083 0.339 0.030 0.199 2.08 0.057 0.316 −0.011 0.187
In summary, ConQuest has done an excellent job of recovering the generating values of the
Rasch testlet model. The magnitudes of bias are negligible, and the magnitudes of RMSE are
satisfactory.
An Empirical Example
The 2001 English test of the Basic Competence Tests for Journal High School Students, which
served as the entrance examination of senior high schools in Taiwan, was analyzed. The test
contained 44 multiple-choice items, including 17 independent items and 27 items in 11 testlets.
A total of 5,000 examinees were randomly selected from a population of more than 300,000
examinees. The data set was analyzed using the standard Rasch model (no testlet effects) and the
Rasch testlet model, respectively. The likelihood deviances (–2 log likelihood) of these two models
were 211,756.25 and 211,256.22, respectively. The difference between them was 500.03, with 11
degrees of freedom, which was statistically significant at the .001 level. Therefore, the testlet model
fitted the data statistically better than the standard model, indicating that local dependence existed
between items within testlets.
Table 5 lists the difficulty estimates and weighted mean square errors (WMSE) for the 44 items.
The WMSEs ranged from 0.77 to 1.35 (M = 1.01, SD = 0.17). When items fit the model’s
expectation, the WMSEs would have a mean of unity. However, because the sample size was very
large, a trivial misfit would be detected as statistically significant. Figure 1a,b shows the empirical
response curves versus expected response curves for the two items with the most extreme WMSE:
21 (WMSE = 0.77) and 34 (WMSE = 1.35), respectively. Some fluctuations, but not very serious,
can be found between the empirical and expected response curves. Therefore, the data appeared to
fit the expectation of the Rasch testlet model fairly well.
Table 6 lists the variance estimates for the testlet and standard models. The variance estimates
of the latent trait obtained from the testlet and standard models were 3.60 and 3.26, respectively.
Figure 2 shows the relationship between the difficulty estimates obtained from the testlet and
(text continues p. 145)
Table 5
Difficulty Estimates and Weighted Mean Square
Errors (WMSE) for the 44 Items
Item Difficulty WMSE
Independent
1 −1.99 0.81
2 −1.34 0.85
3 −1.05 0.84
4 −1.21 0.87
5 −0.86 1.01
6 −0.63 1.16
7 −1.04 1.01
8 −0.46 1.23
9 −1.13 0.97
10 −1.16 0.94
11 −0.77 1.27
12 −1.33 1.00
13 −1.72 1.08
14 −1.12 0.84
15 −0.15 1.34
16 −0.72 1.17
17 −0.22 1.29
Testlet 1
18 −1.26 0.86
19 −0.63 0.96
Testlet 2
20 −1.40 0.82
21 −1.55 0.77
22 −0.85 0.85
Testlet 3
23 −1.05 0.94
24 −1.67 0.83
Testlet 4
25 −1.56 0.79
26 −2.31 0.90
Testlet 5
27 −1.75 0.83
28 −0.72 1.14
Testlet 6
29 −2.57 0.91
30 −1.61 0.97
31 −0.86 0.98
Testlet 7
32 −1.24 0.90
33 −0.29 1.05
(continued)
Table 5
(continued)
Item Difficulty WMSE
Testlet 8
34 0.44 1.35
35 0.63 1.14
Testlet 9
36 0.44 1.18
37 −0.80 1.06
Testlet 10
38 −1.03 0.85
39 −1.06 0.99
40 −0.26 1.24
Testlet 11
41 −0.86 1.18
42 −0.28 0.91
43 0.83 1.33
44 −0.84 0.89
Figure 1
Empirical Curves Versus Expected Curves for the Two Items With the Most Extreme Weighted
Mean Square Errors (WMSE): (a) Item 21 (WMSE = 0.77) and (b) Item 34 (WMSE = 1.35)
(continued)
Figure 1
(continued)
Table 6
Variance Estimates of the Theta and Random Testlet
Variables Under the Testlet Model and the Standard Model
Model Testlet Standard
Theta 3.60 3.26

Testlet 1 0.21
Testlet 2 0.14
Testlet 3 0.07
Testlet 4 0.57
Testlet 5 0.21
Testlet 6 0.51
Testlet 7 0.25
Testlet 8 2.09
Testlet 9 1.20
Testlet 10 0.55
Testlet 11 0.51
Note. In the standard model, no testlet effects are modeled.

Figure 2
Relationship Between the Difficulty Estimates Obtained From
the Testlet and Standard Models
0.5
-0.5
Standard
-1
-1.5
-2
-2.5
-3
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1
Testlet
Figure 3
Testlet-Based Items With the Smallest Testlet Effect
Figure 4
Testlet-Based Items With the Largest Testlet Effect
standard models. Obviously, these two models did not yield identical difficulty estimates, but they
were very similar (r = .99). The difficulty estimates obtained from the standard model, ranging
from –2.37 to 0.79 (M = −0.88), tended to shrink toward the mean, as compared to those from the
testlet model, ranging from –2.59 to 0.83 (M = −0.93).
The variance estimates for the 11 testlets ranged from 0.07 to 2.09 (M = 0.57). Compared to the
variance of the latent trait (3.60), the random testlet effects ranged from trivial to large. Figures 3
and 4 show the testlets with the smallest and largest variances, respectively, in which the examinees
read a common passage and then responded to two multiple-choice items. It is educative to explain
why some testlets have much larger random effects than others. An examination of the content and
physical characteristics is required to answer this question. Sheehan, Ginther, and Schedl (1999)
called attention to some plausible candidate characteristics: (a) degree of redundancy in the passage,
(b) existence of overt markers in the passage to guide examiners, and (c) extent of metaphorical
language. In addition, an investigation of examinees’ cognitive processes in responding to test items
may be helpful.
As ConQuest uses MML estimation, the test reliability can be calculated as
Reliability = Var(θEAP )/σ 2 , (35)
where σ 2 is the variance of the latent trait and Var(θEAP ) is the variance of the EAP estimator
(Mislevy et al., 1992). The estimates of test reliability for the latent trait θ were .931 under the
standard model and .920 under the testlet model. The test reliability was overestimated when the
local item dependence within testlets was ignored under the standard model, which is consistent
with findings in the literature (Sireci et al., 1991; Thissen, Steinberg, & Mooney, 1989; Wainer,
1995; Wainer & Lukhele, 1997; Wainer & Thissen, 1996; Wainer & Wang, 2000; Wilson & Adams,
1995; Yen, 1993). According to the Spearman-Brown prophecy formula, the test would have to
be increased approximately 17.4% in length to achieve a reliability of .931 from .920, when the
local item dependence was taken into account appropriately under the testlet model. An increment
in test length of 17.4% depicted an overall impact of the testlet effects on measurement precision.
The estimates of test reliability for the 11 random testlet variables were between .02 and .33
(M = .13). They were very low because each testlet consisted of only two to four dichotomous
items. Fortunately, test users are more much concerned about the reliability of the latent trait θ than
that of the random testlet variables.
Conclusions
The Rasch testlet model for dichotomous and polytomous items in testlet-based tests is proposed.
It is a special case of the MRCMLM so that the computer program ConQuest can be directly
applied to calibrate the parameters in the Rasch testlet model. Results of the simulations show that
ConQuest yields unbiased estimates. As sample size is increased, the root mean square error of the
estimates decreases to an acceptable level. The simulations were conducted on personal computers
with a 2.0-GHz Intel Pentium IV. It took approximately 5 to 10 minutes for a single calibration. The
computation speed is feasible for most real data analyses. As the number of dimensions is one plus
the number of testlets, it is interesting to know whether ConQuest can recover generating values
very well in reasonable time when the number of testlets is very large (say, 20). More studies are
needed for this. It is also desirable to develop stand-alone computer programs specific for the Rasch
testlet model.
The English test of an entrance examination of senior high school students in Taiwan was
analyzed. The Rasch testlet model fits the data statistically better than the standard Rasch model.
The variances of the 11 testlets are very different, ranging from trivial to large, which means that
ignoring local dependence between items within testlets is inappropriate and can cause problems.
The difficulty estimates obtained from the standard Rasch model shrink slightly toward the mean,
compared with those from the Rasch testlet model. The test reliability is also overestimated under
the standard Rasch model.
The testlet model examined in the simulations focuses on the dichotomous Rasch and partial-
credit models. The incorporation of other Rasch family models—such as the linear logistic test
model, the rating scale model, the partial-order model, the facet model, the linear rating scale
model, the linear partial-credit model, and many other customized models—into the testlet models
is straightforward within the MRCMLM framework (e.g., Rijmen & De Boeck, 2002). Although in
the simulations, every item belongs to a certain testlet, the MRCMLM does not have this constraint,
as demonstrated in the empirical example. It can be applied to tests containing independent items
and testlet-based items. Being a multidimensional model, the MRCMLM can also be easily applied
to multiple tests (e.g., language and mathematics), each with its own set of testlets. Simultaneous
calibration of multiple tests (with or without testlets) has the advantages of direct estimation of the
correlations between latent traits and more precise measures for individual persons than several
separate calibrations of latent traits, one latent trait at a time (Wang et al., 2004).
Local item dependence within a testlet can be modeled with a random effect, as was done in
the present study and by Bradlow et al. (1999), Wainer et al. (2000), and Wang et al. (2002).
On the other hand, it can be modeled with a set of fixed-effects parameters. The basic idea is to
model the response patterns of items within a testlet rather than individual responses of items.
For example, if a testlet contains five dichotomous items, then there will be up to 32(= 25 )
response patterns in the testlet. The testlet can be treated as a super-item containing 32 cate-
gories. Under the family of Rasch models, at most 31 item parameters can be modeled to describe
the super-item, although not every possible item parameter has to be included in practice. By
adding so many item parameters, this fixed-effects approach has the advantage of describing the
local item dependence within a testlet thoroughly (Hoskens & De Boeck, 1997; Tuerlinckx &
De Boeck, 2001; Wang, Cheng, & Wilson, in press; Wilson & Adams, 1995). The fixed-effects
approach can be easily implemented within the MRCMLM framework as well so that Con-
Quest can be used to calibrate test data. Note that as the number of items within a testlet is
increased linearly, the number of possible response patterns is increased exponentially. For a test-
let with 10 dichotomous items, there will be up to 1,024(= 210 ) response patterns. Therefore,
the fixed-effects approach seems to be more applicable when the number of items within a test-
let is relatively small. In contrast, the random-effect approach has the strength of dealing with
large numbers of items within a testlet because only one single random variable for each test-
let is added. Further studies may be conducted to compare the random-effect and fixed-effects
approaches in terms of theoretical implication, parameter recovery, model-data fit, and computa-
tional efficiency.
In the present study, the parameters under the Rasch testlet model are calibrated using MML
estimation. MCMC techniques have been used to calibrate parameters of the generalized MRCMLM
and found to be efficient (Hung, 2002). Conditional maximum likelihood (CML) estimation, being
widely used for the family of Rasch models, is another alternative for the Rasch testlet model.
Relative efficiency of MCMC, MML, and CML estimation procedures under the Rasch testlet model
needs further investigation.
References
Adams, R. J., & Wilson, M. R. (1996). Formu- the measurement of change. Psychometrika, 59,
lating the Rasch model as a mixed coefficients 177-192.
multinomial logit. In G. Englhard & M. Wilson Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-
(Eds.), Objective measurement: Theory into prac- based approaches to calculating marginal
tice (Vol. 3, pp. 143-166). Norwood, NJ: Ablex. densities. Journal of the American Statistical
Adams, R. J., Wilson, M. R., & Wang, W.-C. Association, 85, 398-409.
(1997). The multidimensional random coefficients Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, R. D.
multinomial logit model. Applied Psychological (2003). Bayesian data analysis (2nd ed.).
Measurement, 21, 1-23. New York: Chapman & Hall/CRC.
Andrich, D. (1978). A rating formulation for Glas, C. A. W., Wainer, H., & Bradlow, E. T.
ordered response categories. Psychometrika, 43, (2000). MML and EAP estimation in testlet-
561-573. based adaptive testing. In W. van der Linden &
Birnbaum, A. (1968). Some latent trait models and C. A. W. Glas (Eds.), Computerized adaptive test-
their use in inferring an examinee’s ability. In ing: Theory and practice (pp. 271-287). London:
F. M. Lord & M. R. Novick (Eds.), Statistical Kluwer.
theories of mental test scores (pp. 397-479). Hoijtink, H., Rooks, G., & Wilmink, F. W. (1999).
Reading, MA: Addison-Wesley. Confirmatory factor analysis of items with a
Bock, R. D. (1972). Estimating item parameters dichotomous response format using the multidi-
and latent ability when responses are scored in mensional Rasch model. Psychological Methods,
two or more nominal categories. Psychometrika, 4, 300-314.
37, 29-51. Hoskens, M., & De Boeck, P. (1997). A paramet-
Bock, R. D., & Aitkin, M. (1981). Marginal maxi- ric model for local dependence among test items.
mum likelihood estimation of item parameters: Psychological Methods, 2, 261-277.
Application of an EM algorithm. Psychometrika, Hoskens, M., & De Boeck, P. (2001). Multidimen-
46, 443-459. sional componential item response theory models
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP for polytomous items. Applied Psychological
estimation of ability in a microcomputer envi- Measurement, 25, 19-37.
ronment. Applied Psychological Measurement, Hung, L.-F. (2002). The generalized multidimen-
6, 431-444. sional multilevel multinomial logit model. Unpub-
Bradlow, E. T., Wainer, H., & Wang, X. (1999). lished doctoral dissertation, National Chung
A Bayesian random effects model for testlets. Cheng University, Taiwan.
Psychometrika, 64, 153-168. Irvine, S. H., & Kyllonen, P. C. (Eds.). (2002). Item
Congdon, P. (2003). Applied Bayesian modelling. generation for test development. Hillsdale, NJ:
New York: John Wiley. Lawrence Erlbaum.
De Boeck, P., & Wilson, M. R. (Eds.). (2004). Lee, P. M. (1989). Bayesian statistics: An introduc-
Explanatory item response models: A generalized tion. New York: Oxford University Press.
linear and nonlinear approach. New York: Linacre, J. M. (1989). Many-facet Rasch measure-
Springer-Verlag. ment. Chicago: Measurement, Evaluation, Statis-
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). tics, and Assessment Press.
Maximum likelihood from incomplete data via Masters, G. N. (1982). A Rasch model for partial
the EM algorithm. Journal of the Royal Statistical credit scoring. Psychometrika, 47, 149-174.
Society (Series B), 39, 1-38. McCullagh, P., & Nelder, J. A. (1989). Generalized
Fischer, G. H. (1973). The linear logistic test model linear models (2nd ed.). London: Chapman 3 Hall.
as instrument in educational research. Acta Psy- McCulloch, C. E., & Searle, S. R. (2001). General-
chologica, 37, 359-374. ized, linear, and mixed models. New York: John
Fischer, G. H., & Parzer, P. (1991). An extension of Wiley.
the rating scale model with an application to the Mislevy, R. J., Beaton, A. E., Kaplan, B., &
measurement of treatment effects. Psychometrika, Sheehan, K. M. (1992). Estimating population
56, 637-651. characteristics from sparse matrix samples of item
Fischer, G. H., & Pononcy, I. (1994). An extension responses. Journal of Educational Measurement,
of the partial credit model with an application to 29, 133-161.
Muraki, E. (1992). A generalized partial credit School Admissions Test as an example. Applied
model: Application of an EM algorithm. Applied Measurement in Education, 8, 157-186.
Psychological Measurement, 16, 159-176. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Test-
Nelder, J. A., & Wedderburn, R. W. M. (1972). let response theory: An analog for the 3PL
Generalized linear models. Journal of the Royal model using in testlet-based adaptive testing. In
Statistical Society A, 135, 370-384. W. van der Linden & C. A. W. Glas (Eds.), Com-
Punt, A. E., & Hilborn, R. (1997). Fisheries stock puterized adaptive testing: Theory and practice
assessment and decision analysis: The Bayesian (pp. 245-269). London: Kluwer.
approach. Reviews in Fish Biology and Fisheries, Wainer, H., & Kiely, G. (1987). Item clusters
7, 35-63. and computerized adaptive testing: A case for
Rasch, G. (1980). Probabilistic models for some testlets. Journal of Educational Measurement,
intelligence and attainment tests (Expanded ed.). 24, 185-202.
Chicago: University of Chicago Press. (Original Wainer, H., & Lukhele, R. (1997). How reliable are
work published 1960) TOEFL scores? Educational and Psychological
Rijmen, F., & De Boeck, P. (2002). The random Measurement, 57, 749-766.
weights linear logistic test model. Applied Psycho- Wainer, H., & Thissen, D. (1996). How is reliabil-
logical Measurement, 26, 271-285. ity related to the quality of test scores? What
Rijmen, F., Tuerlinckx, F., De Boeck, P., & is the effect of local dependence on reliability?
Kuppens, P. (2003). A nonlinear mixed model Educational Measurement: Issues and Practice,
framework for item response theory. Psychologi- 15(1), 22-29.
cal Methods, 8, 185-205. Wainer, H., & Wang, X. (2000). Using a new statisti-
Rosenbaum, P. R. (1988). Item bundles. Psycho- cal model for testlets to score TOEFL. Journal of
metrika, 53, 349-359. Educational Measurement, 37, 203-220.
Samejima, F. (1969). Estimation of latent ability Wang, W.-C. (1999). Direct estimation of corre-
using a response pattern of graded scores. Psycho- lations among latent traits within IRT frame-
metrika Monograph Supplement, 17, 1-100. work. Methods of Psychological Research Online,
SAS Institute. (1999). The NLMIXED procedure 4, 63-82.
[Computer software]. Cary, NC: Author. Wang, W.-C., & Chen, H.-C. (2004). The standard-
Sheehan, K. M., Ginther, A., & Schedl, M. ized mean difference within the framework of item
(1999, March). Understanding performance on response theory. Educational and Psychological
the TOEFL reading comprehension section: Measurement, 64, 201-223.
A tree-based regression approach. Paper Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004).
presented at the annual conference of the Improving measurement precision of test batter-
American Association of Applied Linguistics, ies using multidimensional item response models.
Stamford, CT. Psychological Methods, 9, 116-136.
Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the Wang, W.-C., Cheng, Y.-Y., & Wilson, M. R. (in
reliability of testlet-based tests. Journal of Educa- press). Local item dependence for items across
tional Measurement, 28, 237-247. tests connected by common stimuli. Educational
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). and Psychological Measurement.
Trace lines for testlets: A use of multiple- Wang, W.-C., Wilson, M. R., & Adams, R. J.
categorical-response models. Journal of Educa- (1997). Rasch models for multidimensionality
tional Measurement, 26, 247-260. between items and within items. In M. Wilson,
Tuerlinckx, F., & De Boeck, P. (2001). The effect of G. Engelhard, & K. Draney (Eds.), Objective
ignoring item interactions on the estimated dis- measurement: Theory into practice (Vol. 4,
crimination parameters in item response theory. pp. 139-155). Norwood, NJ: Ablex.
Psychological Methods, 6,181-195. Wang, W.-C., Wilson, M. R., & Adams, R. J.
Volodin, N., & Adams, R. J. (1995, April). Iden- (2000). Interpreting the parameters of a multi-
tifying and estimating a D-dimensional Rasch dimensional Rasch model. In M. Wilson & G.
model. Paper presented at the International Engelhard (Eds.), Objective measurement: Theory
Objective Measurement Workshop, University of into practice (Vol. 5, pp. 219-242). Norwood, NJ:
California at Berkeley. Ablex.
Wainer, H. (1995). Precision and differential item Wang, W.-C., & Wu, C.-I. (2004). Gain score in
functioning on a testlet-based test: The 1991 Law item response theory as an effect size measure.
Educational and Psychological Measurement, Wu, M. L., Adams, R. J., & Wilson, M. R. (1998).
64, 758-780. ConQuest: Generalized item response model-
Wang, X., Bradlow, E. T., & Wainer, H. (2002). ing software [Computer software and manual].
A general Bayesian model for testlets: Theory Camberwell, Victoria: Australian Council for
and applications. Applied Psychological Measure- Educational Research.
ment, 26, 109-128. Yen, W. (1993). Scaling performance assessment:
Wilson, M. R. (1992). The partial order model: An Strategies for managing local item depen-
extension of the partial credit model. Applied dence. Journal of Educational Measurement,
Psychological Measurement, 16, 309-325. 30, 187-213.
Wilson, M. R., & Adams, R. J. (1995). Rasch models
for item bundles. Psychometrika, 60, 181-198. Author’s Address
Wolfinger, R. D., & SAS Institute. (n.d.). Fitting
nonlinear mixed models with the new NLMIXED Address correspondence to Wen-Chung Wang,
procedure. Retrieved August 17, 2003, from Department of Psychology, National Chung
http://support.ssas.com/rnd/app/papers/ nlmixed- Cheng University, Chia-Yi, Taiwan; e-mail:
sugi.pdf psywcw@ccu.edu.tw.
View publication stats

1PNO The Rasch Testlet Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1PNO The Rasch Testlet Model

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

The Rasch Testlet Model

Article in Applied Psychological Measurement · March 2005

Wen-Chung Wang Mark Wilson

SEE PROFILE SEE PROFILE

Work at ETS View project

The user has requested enhancement of the downloaded file.

Applied Psychological Measurement, Vol. 29 No. 2, March 2005, 126–149

θn ∼ N (0, 1), (2)

Equation (7) can be expressed as

log(pni1 /pni0 ) = θn − bi + γnd(i) , (8)

Modeling Testlet Effects

log(pnij /pni(j −1) ) = θn − bij + γnd(i) , (9)

bij = bi + (bij − bi ) ≡ bi + τij , (10)

log(pnij /pni(j −1) ) = θn − (bi + τij ) + γnd(i) . (11)

so that equation (11) reduces to

log(pnij /pni(j −1) ) = θn − (bi + τj ) + γnd(i) . (13)

The likelihood equations for the item parameters are

= θn + γnd(i) − ξij , (29)

Condition F df1 df2 p

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Four Testlets Eight Testlets

N = 200 N = 500 N = 200 N = 500

Item Difficulty WMSE

Item Difficulty WMSE

Model Testlet Standard

Theta 3.60 3.26

Note. In the standard model, no testlet effects are modeled.

Reliability = Var(θEAP )/σ 2 , (35)

View publication stats

You might also like