You are on page 1of 15

Computerized Adaptive Testing

With Item Cloning


Cees A. W. Glas and Wim J. van der Linden, University of Twente,
the Netherlands

To increase the number of items available for adaptive testing with item cloning is
adaptive testing and reduce the cost of item writing, presented which has the following two stages:
the use of techniques of item cloning has been First, a family of item clones is selected to be
proposed. An important consequence of item optimal at the estimate of the person parameter.
cloning is possible variability between the item Second, an item is randomly selected from the
parameters. To deal with this variability, a family for administration. Results from simulation
multilevel item response (IRT) model is presented studies based on an item pool from the Law School
which allows for differences between the Admission Test (LSAT) illustrate the accuracy of
distributions of item parameters of families of item these item pool calibration and adaptive testing
clones. A marginal maximum likelihood and a procedures. Index terms: computerized adaptive
Bayesian procedure for estimating the testing, item cloning, multilevel item response
hyperparameters are presented. In addition, an theory, marginal maximum likelihood, Bayesian
item-selection procedure for computerized item selection.

Introduction

A major impediment to cost-effective implementation of computerized adaptive testing (CAT)


is the amount of resources needed for item pool development. One of the solutions to this problem
currently pursued is the application of techniques of item cloning to generate item pools. Early
pioneers of this idea were Bormuth (1970), Hively, Patterson and Page (1968) and Osburn (1968).
Common to their approaches is a formal description of a set of “parent items” along with algorithms
to derive families of clones from them. These parents are also known as “item forms,” “item
templates,” or “item shells.”
One type of parent item consists a syntactic description of a test item with one or more open
places for which substitution sets are specified. For this type, item cloning becomes a “replacement
set procedure” (Millman & Westman, 1989), which can easily be implemented using a computer
algorithm. Examples of replacement set procedures are algorithms that pick distractors randomly
from a list of possible wrong answers or substitute random elements in a open places in the stem
of the item and adjust the alternatives accordingly. Other types of parent items consist of intact
items which are cloned using transformation rules. Examples of such rules are linguistic rules that
transform one verbal item into others, geometric rules that present objects from a different angle
in spatial ability tests, chemical rules that derive molecular structure from a given structure in tests
of organic chemistry, or rules from proposition logic that transform items in analytic reasoning
tests into a set of new items. Comprehensive reviews of item-cloning techniques are given in Bejar
(1993) and Roid and Haladyna (1982).

Applied Psychological Measurement, Vol. 27 No. 4, July 2003, 247–261


DOI: 10.1177/0146621603254291 247
© 2003 Sage Publications

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
248 APPLIED PSYCHOLOGICAL MEASUREMENT

An important question is whether items in a family cloned from the same parent have comparable
statistical characteristics. Empirical studies addressing this question are reported, for example,
in Enright, Morley, and Sheehan (2002), Hively, Patterson and Page (1968), Macready (1983),
Macready and Merwin (1973) and Meisner, Luecht and Reckase (1993). The general impression
from these studies is that the variability within families of items clones from the same parent is
much smaller than between families but not small enough to justify the assumption of identical
parameter values for items in the same family.
The current article is based on the expectation that though item cloning techniques are still
improving, some degree of within-family variability between item parameters always will remain.
The best way to deal with this phenomenon is therefore not to ignore it but to model the differences
between item parameters within families and allow for those differences in item-selection procedures
for adaptive testing.
A procedure of item-selection for adaptive testing that fits in with this approach is a stratified
or two-staged procedure in which each item is selected in the following two steps: (1) a family
of items is selected from the pool that is optimal at the current estimate of the person parameter;
(2) an item is randomly sampled from the family and administered. This procedure still capitalizes
on the statistical efficiency involved in adapting the test to the person parameter. In addition, it
allows us to model differences between item parameters within families as random.
In the first stage of the procedure, when a family is selected to be optimal at the estimate of
the person parameter, we have to deal with a distribution rather than individual values for the item
parameters. An obvious solution is to base the selection on a Bayesian criterion, for example,
one that maximizes the expected reduction in the posterior variance of the person parameter or a
posterior weighted version of Fisher’s information in the items, where the expectation is taken not
only over the posterior distribution of the person parameter, but also over the distribution of the
item parameters in the family.
The proposed item-selection procedure leads naturally to a two-level item response theory (IRT)
model, with a lower level at which items in families are represented by a three-parameter logistic
(3PL) model and a higher level at which the parameters of items in the same family have a (joint)
distribution that represents within-family variability.
The result of using such a model with random item parameters and an adaptive test with an
additional random component in item selection is an expected reduction in the accuracy of the
estimation of item and person parameters. This reduction should be evaluated against two alternative
cases. One case is to maintain item cloning but calibrate the item pool and administer the adaptive
tests under the regular 3PL model in (1). In this approach, the family structure in the item pool is
ignored and the application of the 3PL model, though convenient, is incorrect due to dependencies
between the items in the pool. The other case is to give up item cloning and calibrate the individual
items in the pool and administer the adaptive tests under the regular 3PL model. In this approach
the advantages of item cloning are no longer present. A simulation study in this article evaluates
the reduction in estimation accuracy against these two cases.

The Model

Consider an item pool with families of items generated from parent p = 1, ..., P . Items within
family p will be labeled ip = 1, ..., Ip .

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 249

The first-level model is the 3PL model, which describes the probability of success on item ip as
a function of the latent trait parameter θ as
  exp[aip (θ − bip )]
pip (θ ) = Pr Xip = 1; θ = cip + (1 − cip ) , (1)
1 + exp[aip (θ − bip )]

where Xip is the response variable for item ip , with Xip = 1 for a correct and Xip = 0 for an
incorrect response. The values of the item parameters (aip , bip , cip ) are considered as realizations
of a random vector. The second-level model describes the distribution of this vector. The item
parameter vector is transformed as

ξip = (log aip , bip , logit cip ). (2)

It is assumed that ξip has a multivariate normal distribution

ξip ∼ N (µp , Σp ), (3)

where hyperparameters µp and Σp are the vector with the mean values of the item parameters in
family p and their covariance matrix, respectively. The transformation in (2) removes the restriction
of range for the ai and ci parameters in the usual metric so that the assumption of multivariate
normality in (3) can hold.
In the calibration and item-selection procedures below, we will assume that θ has a standard
normal prior distribution, that is,

θ ∼ N (0, 1). (4)

This assumption holds if θ is from a population of exchangeable persons with a normal distribution
of θ .

Discussion
The model presented in (1)-(4) has several relatives. The multilevel IRT models for testlets in
Bradlow, Wainer, and Wang (1999) and Wainer, Bradlow, and Zu (2000) differ from the present
model in that they do have a random component for difficulty parameter bi but have fixed parameters
ai and ci . The random component was introduced to allow for dependences between responses to
fixed items in the same testlet. In our approach, the items are randomly sampled from families, so
all item parameters need to be random and the dependence between responses to items from the
same family is captured by the covariance matrix in (3). The present model also differs from the
one in Albers, Does, Imbos and Jansen (1989), which is based on item sampling as well. Their
model is a version of the one-parameter normal-ogive model with a random difficulty parameter,
and the sampling procedure is simple random sampling from a pool of items and not the two-stage
procedure from a pool with families of items in this article. Janssen, Tuerlinckx, Meulders, and de
Boeck (2000) present a multilevel version of the two-parameter normal-ogive model in which no
item sampling is assumed but the second level is introduced to describe dependencies within fixed
sets of items used in a standard setting procedure.
Though in each of these models, a (multivariate) normal distribution for the parameters is
assumed, we should not claim universal validity for the normal distribution as second-level model.
The multivariate model in (3) is expected to capture differences in location and spread between
families in an adequate way and to be robust against small deviations from normality. However,
ultimately the choice of a model for the distribution of the item parameters within families is an
empirical issue. The use of item-cloning techniques is a new development and practical experience

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
250 APPLIED PSYCHOLOGICAL MEASUREMENT

still has to be cumulated. For example, it is no known yet what the impact of possible review of
families of cloned items by content specialist will be. In principle, strong preferences for certain item
attributes by these specialists may change the initial results from purely algorithmic item cloning.
The model in (1)-(3) has some flexibility to deal with empirical distributions of item parameters
that deviate from normality. For example, if families of items appear to have distributions with too
much skew, (2) could be replaced by a transformation that normalizes the distribution. The current
transformation was only introduced to remove the restrictions on the range of the parameters, but in
fact a large set of alternative (monotone) transformations is possible. Also, it is not necessary to use
an identical transformation for all families of items. If for some reason item parameter distributions
appear to be bimodal, mixtures of two multivariate normal distributions could be adopted instead
of (3). The technical complexities involved in this change do not seem to be too large (though
the conditions under which the model remains identifiable deserve care). These and other options
have not been explored yet. The current model is only a first attempt to deal with the statistical
consequences of item cloning.

Item Pool Calibration

For the adaptive testing procedure proposed in this article, item pool calibration amounts to
estimation of the values of the hyperparameters µp and Σp in the distribution in (3) for each family
in the pool. These values can be estimated by marginal maximum likelihood (MML), Bayes modal
(MAP) or fully Bayesian methods. The first two methods are discussed here; for a fully Bayesian
estimation procedure, see Glas and van der Linden (2001).
It is assumed that the hyperparameters for all families are stacked in a vector η, so this vector
contains the elements of the mean vectors µp , and the diagonal and lower off-diagonal elements
of Σp for p = 1, ..., P . Besides, the response vector of person j is denoted as xj = (xip j ) =
(xi1 j,..., xiP j ), where ip represents an item clone randomly drawn from family p. For each person
j the vector xj contains the responses to one item sampled from each family. The set of response
vectors across all persons constitutes the data matrix used in item pool calibration. Because for each
person the responses to the other items from a family are missing at random, they can be ignored. To
save unnecessary complexity, our notation will not make this incompleteness in the data set explicit.

MML Calibration
In MML estimation, a distinction in made between structural and nuisance parameters.
The structural parameters are estimated from a log-likelihood marginalized with respect to the
nuisance parameters. In the present case, the structural parameters are in the vector η, whereas
the nuisance parameters are the ability parameters θj and the random item parameters ξip . These
nuisance parameters are supposed to be stacked in vectors θ and ξ, respectively.
The marginal probability of observing response pattern xj is given by
 
p( xj ; η) = ... p(xj | θ, ξ) p(ξ, θ |η) dξ dθ
  
= ... p(xip j | θ, ξip ) p(ξip |µp , Σp ) φ(θ )dξip dθ
p
    
= ... p(xip j | θ, ξip ) p(ξip |µp , Σp ) dξip φ(θ) dθ, (5)
p

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 251

where p(xj | θ, ξ) is the


 probability of the response pattern, that factors into the probabilities of
the item responses as p p(xip j | θ, ξip ), and p(ξip |µp , Σp ) and φ(θ) are the normal densities
of the item parameters and θ, respectively. Note that ξip is a random effect nested within persons;
alternative approaches to this assumption are discussed in Glas and van der Linden (2001). The
marginal log-likelihood of η is given by

log L(η; x) = log p( xj ; η). (6)
j

The marginal likelihood equations for η can be easily derived using Fisher’s identity (Efron, 1977;
Louis 1982), which equates the first-order derivative of the marginal likelihood in (6) with respect
to η to the expected first-order derivative of a so-called “complete data” log-likelihood . That is, the
likelihood equations are given by

∂  ∂


log L(η; x) = E logpj (ξ, θ, xj | η)

xj , η = 0. (7)
∂η j
∂η

In (2), logpj (ξ, θ, xj | η) is the complete data log-likelihood for person j , which is equal to
log pj (ξip , θ, xj | η) =

log p(xip j | θ, ξip ) + log p(ξip |η) + log φ(θ ),
p p

and the expectation is with respect to the conditional posterior density for the nuisance parameters

p(ξip , θ | xj , η) ∝ p(xip j | θ, ξip )p(ξip |µp , Σp )φ(θ ).
p

It follows that the likelihood equations are given by


1 
µpu = E(ξpu | xj , η), (8)
np j

1 
2
σpu = E(ξ2pu | xj , η) − µ2pu , (9)
np j

and
1 
σpuv = E(ξpu ξpv | xj , η) − µpu µpv, (10)
np j

where indices u and v = u denote the uth and vth element in the parameter vectors and np is the
number of responses to family p. These equations can be solved using an EM or Newton-Raphson
algorithm.
Computation of the standard errors of the parameters estimates is a straightforward generalization
of the method for the 3PL model presented in Glas (2000). These estimates are found upon inverting
the approximate information matrix
 ∂

t

E logpj (ξ, θ, xj | η)

xj , η E logpj (ξ, θ, xj | η)

xj , η .
j
∂η ∂η

Note that the information matrix is a sum over persons of outer products of a vector of first-order
derivatives and its transpose.

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
252 APPLIED PSYCHOLOGICAL MEASUREMENT

Bayes Modal Calibration


For the regular 3PL model, the use of Bayes modal estimation can be motivated by the fact
that the item parameters in the model are sometimes hard to estimate because the model is poorly
identified. In such instances, the values of the θ-parameters are predominantly in a region of the
θ -scale for which the response functions are equally well approximated by different combinations
of item parameter values. As a result, the estimates are highly correlated. In such cases, adding
a covariance matrix for every family of items may further deteriorate the identifiability of the
model.
To obtain improved estimates for the 3PL model, Mislevy (1986) considered a number of
Bayesian approaches, each of which entails a prior distribution for the item parameters. In one
approach, the prior distribution is assumed to be postulated by the item calibrator and its parameters
are thus known. In another, often labeled empirical Bayes, the parameters of the prior distribution
are estimated along with the other parameters, for example, as the modes of their posterior
distributions.
The problem of estimating the hyperparameters in the model in (1)-(4) is formally identical to
the one of estimating the parameters in the prior distribution of the item parameters in an empirical
Bayes approach to the regular 3PL model. The only difference is that these parameters now have
to be estimated for multiple families of items simultaneously.
In Bayes modal (or maximum a posteriori; MAP) estimation of the 3PL model, the estimates
are computed by maximizing the log-posterior density of η, which is proportional to
log L(η; x)+ log p(η; ζ) + log p(ζ), (11)
where p(η; ζ) is the prior density of η with parameters ζ which follow a density p(ζ). The approach
involves a replacement of the likelihood equation in (7) by
∂ log L(η; x)/∂η+∂ log p(η; x)/∂η = 0. (12)
If the prior distribution p(η; ζ) is not postulated, that is, in an empirical Bayes approach,
equation
∂ log p(η; ζ)/∂ζ+∂ log p(ζ)/∂ζ = 0 (13)
must be solved simultaneously.
For the multilevel model in (1)-(3), Bayes modal estimation entails the introduction of a prior
distribution for the hyperparameters µp and Σp in (1)-(4). Let p(µp , Σp | ω) denote the (common)
prior density for these parameters which itself has a parameter vector ω.
The marginal probability of response vector xj now becomes
  
p(xj ; η) = ... p(xip j | θ, ξip ) p(ξip | µp , Σp )p(µp , Σp | ω) φ(θ ) dξip dθ. (14)
p

The complete data specification has factors


p(ξip | µp , Σp ) p(µp , Σp | ω),
which suggest a normal model with a normal-inverse-Wishart prior with parameter ω = (µ0 , Σ0 ).
A prior from the normal-inverse-Wishart family is attractive because it is the conjugate prior for
the multivariate normal distribution (see, for instance, Box & Tiao, 1973).
Let v0 be the degrees of freedom for the inverse-Wishart prior of Σ0 , κ0 the number of
observations to which the normal prior for µp can be equated, and let np be the number of

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 253

items administered from family p. The posterior distribution of η is also normal-inverse-Wishart


distributed, with parameters

v = v0 + np ,
κ = κ0 + np ,
np κ0
µp = ξ̄ + µ ,
κ0 + n p p κ0 + n p 0
κ0 n p
vΣp = (np − 1)Sp + (ξ̄ − µ0 )(ξ̄p − µ0 )t + v0 Σo ,
κ0 + np p

where ξ̄p = (1/np ) ξip and Sp = 1/(np − 1) (ξip − ξ̄p )(ξip − ξ̄p )t , for p = 1, ..., P .
i i
As can be verified from (7), the likelihood equations are the posterior expectations of the
first-order derivatives of the complete data likelihood. Analogous to (8)-(10), we now have
np κ0
µp = ξ + µ , (15)
κ0 + np p κ0 + np 0

and
   κ0 n p
vΣp = E (ξp −
ξp )(ξp −
ξp )t | xj , η + (ξ − µ0 )(
ξp − µ0 )t + v0 Σo , (16)
j
κ0 + np p

with

1 
ξp = E(ξp | xj , η).
np j

Comparing the MML estimations given by (8) with the Bayes modal estimates given by (15), it
becomes clear that (15) is a so-called shrinkage estimator. It is a weighted average of the mean as it
can be obtained from the likelihood of the relevant observations and the mean imposed via the prior.

Discussion
The assumption of all respondents randomly drawn from one population in (4) can be replaced
by the assumption of sampling from multiple populations of respondents each with a normal ability
distribution indexed by a unique mean and variance parameter. Bock and Zimowski (1997) point
out that this generalization, together with the possibility of analyzing incomplete item-calibration
designs, provides a unified approach to such problems as differential item functioning, item para-
meter drift, non-equivalent groups equating, vertical equating and matrix-sampled educational
assessment. Though not illustrated here, calibration under the multilevel model for multiple item
families in this article can be extended to fit this framework.

Adaptive Selection of Families of Items

Our initial estimate of the ability of examinee j is the prior distribution in (4), which has a
density denoted as φ(θj ). Suppose the kth family is selected to deliver the next item for person j .
The responses of j to the items from the k − 1 previously selected families are denoted by a vector

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
254 APPLIED PSYCHOLOGICAL MEASUREMENT

(k−1)
xj = (xj 1 , ...., xj (k−1) ). The update of the posterior distribution of θj after these k − 1 items is
given by
(k−1)

k−1
p(θj | xj ) ∝ φ(θj ) p(xjp | θj , ξp )p(ξp |µp , Σp )dξp . (17)
p=1
The kth family is selected to be optimal at this posterior distribution.
Several Bayesian criteria of optimality for adaptive testing have been proposed; for studies of
several old and new criteria, see van der Linden (1998) and van der Linden and Pashley (2000). The
one used in the simulation study below is a version of the criterion of minimum expected posterior
variance adapted for use with the two-stage item-selection procedure. The criterion requires the
family selected to have minimum expected posterior variance, where the expectation is taken over
the posterior predictive distribution of the responses to a random item from the family.
Suppose we select family p as the kth family in the test. The posterior predictive distribution of
the responses of examinee j to a random item from this family has the following probability function
 
(k−1) (k−1)
p(xjpk | xj )= p(xjpk | θj , ξpk )p(ξpk |µpk , Σpk )dξpk p(θj | xj )dθj . (18)

Note that we first average the response probability over the distribution of the item parameters for
family pk and then over the posterior distribution of the ability of the examinee.
The two possible responses for which this function provides the predictive probabilities are
Xjpk = 0 and Xjpk = 1. Either response would lead to an update of the posterior variance of θj ,
(k−1) (k−1)
which we denote as Var(θj |xj , Xjpk = 0) and Var(θj |xj , Xjpk = 1), respectively. The first
proposed criterion for the selection of the kth parent is the expected value of this update. That is,

(k−1) (k−1)
pk = arg minr Var(θj |xj , Xj rk = 0)p(0 | xj )
 (19)
(k−1) (k−1)
+Var(θj |xj , Xj rk = 1)p1 | xj ); r ∈ Rk ,

where Rk is the set of families in the pool from which the kth family is chosen.
If the interest is in a criterion based on Fisher’s information measure, an alternative to (19) can
be derived from the posterior weighted information criterion (van der Linden & Pashley, 2000,
Eq. 25). Fisher’s information on θj in a random item from family p is defined as


Ip (θj ) = − 2 ln L(θj ; xjp , ξp )p(ξp |µp , Σp )dξp , (20)
∂ θj
and the posterior weighted information criterion selects as the kth family
 
(k−1)
pk = arg max Irp (θj )p(θj | xj )dθj ; r ∈ Rk . (21)
r

Simulation Studies

Three different cases were studied, namely,


(1) families of cloned items calibrated and administered under the multilevel IRT model in
(1)-(3);
(2) families of cloned items calibrated and administered under the regular 3PL model in (1); and
(3) individual items calibrated and administered under the regular 3PL model in (1).

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 255

Two different studies were conducted. The first study was to assess the accuracy of item pool
calibration in these three cases; the second study to assess the accuracy of adaptive testing in these
cases. The comparison between Case 1 and 2 is made to identify the consequences of ignoring the
dependencies due to the family structure in the item pool; the regular 3PL model does not allow
for such dependences, whereas the multilevel IRT does. The comparison between Case 1 and Case
3 is made to identify the potential loss in statistical accuracy due to the random nature of the item
parameters in the multilevel IRT model and item sampling in the second stage of item selection.

Item Pools
The family structure of the item pools used in these studies were all derived from a pool from
the Law School Admission Test (LSAT). The pool had 753 items that fitted the model in (1). The
parameters of these items were transformed using (2) and these transformed parameters had mean
vector
µ = (−.309, .189, −1, 723) (22)
and covariance matrix
 
.102 .054 .065
Σ =  .054 1.318 .180  . (23)
.065 .180 .499
Item parameters for four different pools were derived from these LSAT data. All parent item
parameter values µp were sampled from the multivariate normal distribution with a mean and
covariance structure as in (22)-(23). Then, given the sampled means µp , parameters for the clones
within family p were sampled from a multivariate normal distribution with mean µp and a covari-
ance matrix Σp with entries proportional to the entries of Σ0 . That is, all true values for the item
clones were sampled subject to a fixed ratio of within-family and between-family covariances. For
Pool 1, the sizes of the within-covariances and between-family covariances the same. For Pool 2, 3,
and 4, the sizes of the within-covariances were 50%, 25% and 12.5% of the between-covariances,
respectively.

Item Pool Calibration


In this study, the following additional variables were manipulated:
(1) test length: n=20 and 40 items; and
(2) sample size: N =100, 400 and 1,000 examinees.

For each condition, 100 replications were made. For every replication, N examinees were
simulated with θ values randomly drawn from the standard normal distribution, as in (4). For each
examinee, one item per family was sampled and responses to the items were generated. These
response data were then used to calibrate the item pool. The parameter estimation method was the
empirical Bayes modal method. For Case 1, the prior distribution for the hyperparameters was a
(low-informative) normal-inverse-Wishart distribution with υ0 = 5 and κ0 = 5. For Case 2 and 3,
the prior distributions of the (transformed) item parameters in (2) were taken to be normal with the
means and variances in (22)-(23), but the covariances were set equal to zero because in these two
cases the regular 3PL model was used.
The mean absolute errors in the parameter estimates for Case 1 (data of cloned items analyzed
in the multilevel model) and Case 2 (data of cloned item analyzed with the regular 3PL model)

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
256 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 1
Effects on Calibration of Ignoring the Family Structure
Mean Absolute Error in Item Parameter Estimates

Case 2 Case 1 2–1

Fact n N a b c a b c a b c

1.000 20 100 0.268 0.451 0.088 0.243 0.450 0.083 0.024 0.000 0.005
400 0.209 0.396 0.092 0.204 0.389 0.087 0.005 0.007 0.005
1,000 0.179 0.374 0.093 0.177 0.356 0.081 0.002 0.019 0.012

40 100 0.245 0.452 0.086 0.230 0.446 0.083 0.015 0.005 0.003
400 0.193 0.394 0.090 0.184 0.384 0.087 0.009 0.010 0.003
1,000 0.169 0.366 0.089 0.160 0.327 0.078 0.009 0.040 0.011

0.500 20 100 0.264 0.455 0.087 0.248 0.457 0.086 0.016 –0.002 0.002
400 0.204 0.382 0.086 0.199 0.381 0.082 0.006 0.001 0.004
1,000 0.174 0.358 0.087 0.169 0.354 0.081 0.004 0.004 0.005

40 100 0.248 0.445 0.081 0.230 0.440 0.085 0.018 0.005 –0.004
400 0.193 0.373 0.086 0.182 0.371 0.080 0.010 0.002 0.006
1,000 0.156 0.344 0.084 0.154 0.334 0.079 0.001 0.010 0.005

0.250 20 100 0.265 0.456 0.084 0.255 0.454 0.084 0.010 0.002 0.000
400 0.207 0.385 0.084 0.198 0.378 0.081 0.009 0.007 0.002
1,000 0.173 0.353 0.083 0.170 0.342 0.080 0.003 0.010 0.002

40 100 0.246 0.448 0.083 0.234 0.443 0.082 0.012 0.005 0.001
400 0.188 0.365 0.084 0.186 0.370 0.083 0.003 –0.005 0.002
1,000 0.156 0.334 0.081 0.154 0.329 0.078 0.002 0.005 0.003

0.125 20 100 0.268 0.454 0.085 0.260 0.452 0.083 0.008 0.002 0.002
400 0.212 0.384 0.081 0.208 0.381 0.081 0.004 0.003 –0.001
1,000 0.171 0.350 0.081 0.168 0.348 0.079 0.004 0.001 0.002

40 100 0.247 0.449 0.083 0.243 0.448 0.082 0.004 0.002 0.001
400 0.191 0.372 0.082 0.189 0.370 0.082 0.002 0.002 0.000
1,000 0.156 0.331 0.079 0.154 0.327 0.077 0.002 0.004 0.002

are compared in Table 1. As expected, the table shows a decrease in the errors with an increasing
sample size (N ) and test length (n). The decrease is strongest for the a parameter and negligible
for the c parameter. Also, the decrease with the sample size appears to be stronger for the pools
with the smallest ratio of within-family to between-family variability. The differences between the
mean errors for Case 2 and 1, though nearly all in favor of Case 1, are negligibly small. For the
current item pools, misspecifying the model in the sense of ignoring dependences due to a family
structure of the items did hardly yield any consequences for item pool calibration. We will return
to this conclusion below.
In Table 2, the same comparison is made for the mean absolute error in the estimates for Case
1 (data of cloned items analyzed in the multilevel model) and Case 3 (no family structure and

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 257

Table 2
Effects of Item Cloning on Calibration
Mean Absolute Error in Item Parameter Estimates

Case 1 Case 3 1–3

Fact n N a b c a b c a b c

1.000 20 100 0.243 0.450 0.083 0.239 0.450 0.081 0.005 0.000 0.002
400 0.204 0.389 0.087 0.196 0.388 0.082 0.007 0.001 0.005
1,000 0.177 0.356 0.081 0.174 0.345 0.080 0.003 0.011 0.001

40 100 0.230 0.446 0.083 0.216 0.445 0.082 0.014 0.002 0.001
400 0.184 0.384 0.087 0.183 0.382 0.080 0.001 0.002 0.006
1,000 0.160 0.327 0.078 0.156 0.326 0.076 0.004 0.000 0.002

0.500 20 100 0.248 0.457 0.086 0.241 0.457 0.083 0.007 –0.001 0.002
400 0.199 0.381 0.082 0.196 0.379 0.081 0.003 0.002 0.001
1,000 0.169 0.354 0.081 0.169 0.352 0.080 0.000 0.002 0.001

40 100 0.230 0.440 0.085 0.225 0.437 0.086 0.005 0.002 –0.001
400 0.182 0.371 0.080 0.179 0.371 0.079 0.004 0.000 0.000
1,000 0.154 0.334 0.079 0.154 0.325 0.076 0.000 0.009 0.003

0.250 20 100 0.255 0.454 0.084 0.252 0.453 0.084 0.003 0.001 0.000
400 0.198 0.378 0.081 0.196 0.377 0.081 0.003 0.001 0.001
1,000 0.170 0.342 0.080 0.166 0.342 0.079 0.004 0.001 0.001

40 100 0.234 0.443 0.082 0.232 0.442 0.082 0.002 0.001 0.000
400 0.186 0.370 0.083 0.184 0.373 0.081 0.001 –0.003 0.002
1,000 0.154 0.329 0.078 0.154 0.328 0.075 0.000 0.001 0.003

0.125 20 100 0.260 0.452 0.083 0.255 0.452 0.081 0.005 0.000 0.002
400 0.208 0.381 0.081 0.204 0.378 0.082 0.004 0.003 0.000
1,000 0.168 0.348 0.079 0.167 0.348 0.078 0.001 0.000 0.001

40 100 0.243 0.448 0.082 0.241 0.445 0.082 0.002 0.002 0.000
400 0.189 0.370 0.082 0.189 0.369 0.082 0.000 0.001 0.000
1,000 0.154 0.327 0.077 0.154 0.326 0.076 0.001 0.002 0.000

item parameters in the regular 3PL model). This table shows the same trends of decreasing errors
with increasing sample sizes and test lengths for Case 3. Besides, the differences in mean errors
between Case 3 and 1 are negligibly small again, in fact, they were even smaller than in the
previous case.
The mean absolute estimation errors in the estimates of the (co)variances of the item parameters
for the families of items for Case 1 are given in Table 3. The general impression from this table is
that these hyperparameters can be estimated reasonably well. The errors in the estimates of loga
and logitc are relatively large for the pools with the larger variability ratios. This effect is due to the
large variances for these parameters in the original LSAT pool; see the covariance matrix in (23).
As a consequence, a larger ratio implied larger within-family variability.

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
258 APPLIED PSYCHOLOGICAL MEASUREMENT

Table 3
Mean Absolute Error for Estimates of Item Covariance Matrix

Fact n N σlog a σb σlog itc σlog ab σlog a log itc σblogitc

1.000 20 100 0.044 0.337 0.531 0.025 0.023 0.172


400 0.023 0.235 0.314 0.029 0.027 0.084
1,000 0.020 0.142 0.216 0.020 0.016 0.078

40 100 0.020 0.375 0.430 0.023 0.028 0.157


400 0.019 0.210 0.335 0.028 0.028 0.106
1,000 0.024 0.134 0.212 0.015 0.019 0.068

0.500 20 100 0.010 0.288 0.502 0.024 0.027 0.133


400 0.010 0.156 0.376 0.020 0.029 0.120
1,000 0.024 0.070 0.196 0.020 0.016 0.051

40 100 0.019 0.286 0.499 0.014 0.020 0.116


400 0.011 0.121 0.304 0.008 0.013 0.065
1,000 0.010 0.092 0.141 0.009 0.010 0.042

0.250 20 100 0.015 0.205 0.409 0.009 0.015 0.095


400 0.006 0.114 0.035 0.006 0.004 0.020
1,000 0.006 0.053 0.039 0.004 0.005 0.015

40 100 0.014 0.220 0.401 0.006 0.013 0.084


400 0.005 0.064 0.219 0.004 0.006 0.045
1,000 0.005 0.054 0.129 0.004 0.006 0.031

0.125 20 100 0.023 0.046 0.404 0.003 0.022 0.067


400 0.013 0.035 0.132 0.003 0.005 0.023
1,000 0.002 0.028 0.045 0.002 0.002 0.012

40 100 0.023 0.062 0.364 0.003 0.014 0.067


400 0.013 0.061 0.089 0.003 0.004 0.025
1,000 0.003 0.039 0.078 0.003 0.004 0.025

Adaptive Testing
In this study, comparisons were made between the errors in the estimators of θ in the adaptive
tests for the same three cases. The simulations were repeated for test lengths n = 20 and 40
items. For each θ value equal to −2.0, −1.0, 0.0, 1.0, and 2.0, 1,000 examinees were simulated.
In addition, to get estimates of the errors for a typical population of examinees, 1,000 examinees
were randomly sampled from N (0, 1).
The item pools for Case 1 and 2 consisted of 400 families of items. If a family was selected in
the adaptive test, only one item was randomly sampled from it. The item pool for Case 3 consisted
of 400 individual items.
In Case 1, the items were selected according to the criterion of minimum expected posterior
variance adapted for use with two-stage item-selection in (19). The final ability estimate was the
expected value of the posterior distribution (EAP estimate) in the multilevel IRT model in (17). In

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 259

Table 4
Mean Absolute Error in Ability Estimates

Case of Standard
Fact n CAT –2.0 –1.0 0.0 1.0 2.0 Normal

1.000 20 3 0.499 0.330 0.299 0.311 0.413 0.336


2 0.658 0.393 0.332 0.362 0.519 0.396
1 0.618 0.393 0.331 0.364 0.522 0.388

40 3 0.363 0.278 0.259 0.238 0.318 0.261


2 0.541 0.336 0.292 0.269 0.385 0.311
1 0.496 0.309 0.282 0.267 0.408 0.310

0.500 20 3 0.518 0.321 0.296 0.323 0.422 0.335


2 0.561 0.364 0.331 0.349 0.455 0.366
1 0.566 0.360 0.323 0.348 0.469 0.363

40 3 0.357 0.270 0.255 0.243 0.315 0.274


2 0.460 0.304 0.275 0.261 0.358 0.305
1 0.408 0.303 0.281 0.296 0.339 0.305

0.250 20 3 0.497 0.336 0.293 0.311 0.441 0.346


2 0.556 0.356 0.322 0.330 0.448 0.347
1 0.538 0.330 0.329 0.327 0.434 0.350

40 3 0.358 0.255 0.248 0.232 0.310 0.268


2 0.414 0.289 0.267 0.252 0.342 0.271
1 0.404 0.285 0.264 0.257 0.324 0.281

0.125 20 3 0.482 0.328 0.316 0.316 0.423 0.338


2 0.496 0.338 0.303 0.335 0.432 0.338
1 0.492 0.336 0.311 0.331 0.426 0.335

40 3 0.373 0.255 0.263 0.242 0.320 0.264


2 0.391 0.275 0.259 0.237 0.313 0.268
1 0.373 0.276 0.260 0.249 0.325 0.269

Note. CAT = computerized adaptive testing.

Case 2 and 3 the items were selected according to the original version of the criterion of minimum
expected posterior variance, whereas the final ability estimator was the EAP estimate under the
regular 3PL model. In all three cases, ability estimation started with the normal prior in (4).
The mean absolute errors in the ability estimates are shown in Table 4. Note that for all conditions
in the setup, the parameters are estimated more poorly for lower values of θ. That is, there is a clear
lack of symmetry in the mean absolute errors of the low values (θ = −2.0 and θ = −1.0) and
the high values (θ = 1.0 and θ = 2.0). The explanation is the loss of information at low θ-levels
due to guessing. That is, a guessed response contains little information about the latent trait level.
As expected, the errors showed a tendency to decrease with the length of the test. Also, adaptive
testing in Case 3 was superior to Case 1 and 2 in all conditions.
The differences were larger both for the extreme θ- values in the study and the larger variability
ratios. Improved estimation of extreme values of θ is a result typical of CAT, whereas a smaller

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


Volume 27 Number 4 July 2003
260 APPLIED PSYCHOLOGICAL MEASUREMENT

variability ratio implies a smaller effect of second-stage sampling of items from the families selected
in the test. Also, note that the results for Case 2 were generally worst. Ignoring the family structure
in the model did result in loss of accuracy in ability estimation.

Discussion

The results of the study of item pool calibration accuracy confirmed that it pays to model the
family structure in data from cloned items by a two-level IRT model with different parameter
distributions for each family. Essentially the same results were obtained in Bradlow, Wainer, and
Wang (1999, Table 1) for their Bayesian treatment of the 2PL model with a random component for
the b parameter. It is a statistical fact that ignoring the family structure of the items in the pool is a
case of model misspecification, which generally leads to bias in parameter estimation and hence to
an increase in the mean absolute estimation error. In the simulation studies, the multilevel IRT model
did suffer from this type of bias, but the effects were very small. On the other hand, the addition of
a random component to the test as such did reduce the statistical accuracy of item calibration even
less. The effects of item cloning on CAT were more pronounced. So the precision of estimation on
the individual level is clearly affected by item cloning and ignoring item cloning in the statistical
model makes the effect even worse.
It is instructive to see how extreme relations between within-family and between-family vari-
ability in the item parameters the two-stage item-selection introduced in this article leads to the
cases of (1) regular CAT from a pool of individual items and (2) classical domain-referenced test-
ing (Lord & Novick, 1968, chap. 23). The procedure shares its first-stage selection of a family of
items optimal at the θ -estimates with the former but its second-stage of random selection of an item
from the family with the latter. If all variability in the pool is within the families, the procedure is
domain-referenced testing, whereas if all variability is between families, it is CAT from a pool of
individually calibrated items. The more efficient the item-cloning techniques are, the smaller the
amount of within-family item variability is and the better the test adapts to the examinee’s ability
level.

References

Albers, W., Does, R. J. M. M., Imbos, Tj., & Janssen, Bradlow, E. T., Wainer, H., & Wang, X. (1999). A
M. P. E. (1989). A stochastic growth model applied Bayesian random effects model for testlets. Psy-
to repeated test of academic knowledge. Psy- chometrika, 64, 153-168.
chometrika, 54, 451-466. Efron, B. (1977). Discussion on maximum likelihood
Bejar, I. I. (1993). A generative approach to from incomplete data via the EM algorithm (by A.
psychological and educational measurement. In N. P. Dempster, N. M. Laird and D. B. Rubin). Jour-
Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test nal of the Royal Statistical Society (Series B), 39,
theory for a new generation of tests (pp. 323-357). 1-38.
Hillsdale, NJ: Lawrence Erlbaum. Enright, M. K., Morley, M., & Sheehan, K. M. (2002).
Bock, R. D., & Zimowski, M. F. (1997). Items by design: The impact of systematic fea-
Multiple group IRT. In W. J. van der Linden & ture variation on item statistical characteristics.
R. K. Hambleton (Eds.), Handbook of modern Applied Measurement in Education, 15, 49-74.
item response theory (pp. 433-448). New York: Glas, C. A. W. (2000). Item calibration and parame-
Springer-Verlag. ter drift. In W. J. van der Linden & C. A. W. Glas
Bormuth, J. R. (1970). On the theory of achievement (Eds.), Computerized adaptive testing: Theory
test items. Chicago, IL: University of Chicago and practice (pp. 183-199). Norwell, MA: Kluwer
Press. Academic.
Box, G., & Tiao, G. (1973). Bayesian inference in sta- Glas, C. A. W., & van der Linden, W. J. (2001). Model-
tistical analysis. Reading, MA: Addison-Wesley. ing variability in item parameters in item response

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015


C. A. W. GLAS and W. J. V. D. LINDEN
COMPUTERIZED ADAPTIVE TESTING WITH ITEM CLONING 261

models (Research Report 01-11). Enschede: Osburn, H. G. (1968). Item sampling for achievement
University of Twente. testing. Educational and Psychological Measure-
Hively, W., Patterson, H. L., & Page, S. H. (1968). A ments, 28, 95-104.
“universe-defined” system of arithmetic achieve- Roid, G., & Haladyna, T. (1982). A technology for
ment items. Journal of Educational Measurement, test-item writing. New York: Academic Press.
5, 275-290. van der Linden, W. J. (1998). Bayesian
Janssen, R., Tuerlinckx, F., Meulders, M., & de item-selection criteria for adaptive testing.
Boeck, P. (2000). A hierarchical IRT model for Psychometrika, 63, 201-216.
criterion-referenced measurement.Journal of Edu- van der Linden, W. J., & Pashley, P. J. (2000). Item
cational and Behavioral Statistics, 25, 285-306. selection and ability estimation in adaptive test-
Lord, F. M., & Novick, M. R. (1968). Statistical theo- ing. In W. J. van der Linden & C. A. W. Glas
ries of mental test scores. Reading, MA: Addison- (Eds.), Computerized adaptive testing: Theory
Wesley. and practice (pp. 1-25). Norwell, MA: Kluwer
Louis, T. A. (1982). Finding the observed Academic.
information matrix when using the EM Wainer, H., Bradlow, E. T., & Du, Z. (2000). Test-
algorithm. Journal of the Royal Statistical Society, let response theory: An analog for the 3PL model
Series B, 44, 226-233. useful in testlet-based adaptive testing. In W. J.
Macready, G. B. (1983). The use of generalizability van der Linden & C. A. W. Glas (Eds.), Com-
theory for assessing relations among items within puterized adaptive testing: Theory and practice
domains in diagnostic testing. Applied Psycholog- (pp. 245-269). Norwell, MA: Kluwer Academic.
ical Measurement, 7, 149-157.
Macready, G. B., & Merwin, J. C. (1973). Acknowledgments
Homogeneity within item forms in domain ref-
erenced testing. Educational and Psychological This study received funding from the Law School
Measurement, 33, 351-360. Admission Council (LSAC). The opinions and conclu-
Meisner, R., Luecht, R. M., & Reckase, M. D. (1993). sions contained in this article are those of the authors
The comparability of the statistical characteristics and do not necessarily reflect the position or policy
of test items generated by computer algorithms of the LSAC.
(ACT Research Report Series Nop. 93-9). Iowa
City, IA: ACT, Inc. Author’s Address
Millman, J., & Westman, R. S. (1989). Computer-
assisted writing of achievement test items: Toward Send requests for information to: Cees A. W. Glas,
a future technology. Journal of Educational Department of Research Methodology, Measurement
Measurement, 26, 177-190. and Data Analysis, University of Twente, P.O. Box
Mislevy, R. J. (1986). Bayes modal estimation in item 217, 7500 AE Enschede, The Netherlands; e-mail:
response models. Psychometrika, 51, 177-195. C.A.W.Glas@edte.utwente.nl.

Downloaded from apm.sagepub.com at University of Alabama at Birmingham on February 12, 2015

You might also like