You are on page 1of 16

Applied Psychological Measurement

http://apm.sagepub.com/

Graphical Models and Computerized Adaptive Testing


Russell G. Almond and Robert J. Mislevy
Applied Psychological Measurement 1999 23: 223
DOI: 10.1177/01466219922031347

The online version of this article can be found at:


http://apm.sagepub.com/content/23/3/223

Published by:

http://www.sagepublications.com

Additional services and information for Applied Psychological Measurement can be found at:

Email Alerts: http://apm.sagepub.com/cgi/alerts

Subscriptions: http://apm.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations: http://apm.sagepub.com/content/23/3/223.refs.html

>> Version of Record - Sep 1, 1999

What is This?

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Graphical Models and
Computerized Adaptive Testing
Russell G. Almond and Robert J. Mislevy
Educational Testing Service

Computerized adaptive testing (CAT) based on variability outside the IRT model. Relevant variables
item response theory (IRT) is viewed from the per- can play many roles without appearing in the oper-
spective of graphical modeling (GM). GM provides ational IRT model per se, e.g., in validity studies,
methods for making inferences about multifaceted assembling tests, and constructing and modeling
skills and knowledge, and for extracting data from tasks. Some of these techniques are described from
complex performances. However, simply incorpo- a GM perspective, as well as how to extend them
rating variables for all sources of variation is rarely to more complex assessment situations. Issues are
successful. Thus, researchers must closely analyze illustrated in the context of language testing. Index
the substance and structure of the problem to create terms: adaptive testing, Bayesian nets, computerized
more effective models. Researchers regularly employ adaptive testing, graphical models, item response
sophisticated strategies to handle many sources of theory.

Computerized adaptive testing (CAT) has significantly advanced psychological and educational
testing (Wainer et al., 1990). By using examinees’ response patterns to adaptively select items, CAT
can improve motivation, reduce testing time, and administer fewer items per examinee, all without
sacrificing measurement accuracy. The theoretical basis for most CAT is item response theory (IRT;
Hambleton, 1989).
Despite its usefulness, it is difficult to extend IRT-CAT to more applications because of two
constraints: the limited scope of tasks that can be used without seriously violating IRT conditional
independence assumptions and the limited capability for dealing with multiple aspects of knowledge
or skill. Graphical modeling (GM; Almond, 1995; Lauritzen, 1996; Pearl, 1988) provides methods
for working with such complex multivariate dependencies. When used in a predictive framework,
a GM is referred to as a Bayesian inference network (BIN).
IRT and CAT
IRT models express an examinee’s trait in terms of an unobservable variable, θ . The examinee’s
responses are assumed to be independent, conditional on both θ and item characteristics, e.g.,
difficulty and discrimination. The Rasch model for n dichotomous test items is a simple example:
n
Y
P (x1 , x2 , . . . , xn |θ, β1 , β2 , . . . , βn ) = P (xj |θ, βj ) , (1)
j =1

where
xj is the response to item j (1 for correct, 0 for incorrect),
βj is the difficulty parameter for item j , and
P (xj |θ, βj ) = exp[xj (θ − βj )]/[1 + exp(θ − βj )] . (2)

Applied Psychological Measurement, Vol. 23 No. 3, September 1999, 223–237


©1999 Sage Publications, Inc. 223

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
224 APPLIED PSYCHOLOGICAL MEASUREMENT

For selecting items and scoring examinees, estimates of the item parameters (β1 , β2 , . . . , βn ; or B)
are typically obtained from large samples of examinee responses and treated as known.
Given an observed response vector x (x1 , x2 , . . . , xn ), Equation 1 becomes a likelihood func-
tion for θ , L(θ|x, B). Bayesian inference is based on the posterior distribution p(θ |x, B) ∝
L(θ |x, B)p(θ), which can be summarized by the posterior mean θ̄ and the posterior variance
var(θ|x, B). Fixed test forms have different precision for different θ s, with greater precision when
θ is near item difficulties. A CAT can modify its difficulty for each examinee. Testing proceeds
sequentially, with each successive item (k + 1) selected to be informative about the examinee’s θ
based on the responses to the first k items, or x(k) (Wainer et al., 1990, Chap. 5). A pure Bayesian
approach begins with a prior distribution for θ and determines each next item as the item that
minimizes the expected posterior variance (Owen, 1975), i.e.,
h i
Exj Var(θ|x(k) , xj , B(k) , βj )|x(k) , B(k) . (3)

Additional constraints that can be placed on item selection, such as item content and format, are
discussed below. Testing ends when a desired measurement precision has been attained or when a
specified number of items has been presented.
IRT-CAT as GM
Probability-based inference in complex networks of interdependent variables is an active topic
in statistical research (Spiegelhalter, Dawid, Lauritzen, & Cowell, 1993). The structure of relation-
ships among variables can be depicted in an acyclic directed graph (DAG), in which nodes represent
variables and edges (the arrows between the nodes) represent conditional dependence relationships.
For any DAG, the joint distribution of the variables of interest (generically denoted {Z1 , Z2 , . . . ,
Zm }) can be recursively represented:

m
Y  
p(Z1 , Z2 , . . . , Zm ) = p Zj | “parents” of Zj , (4)
j =1

where {“parents” of Zj } is the subset of {Zj −1 , . . . , Z1 } on which Zj is directly dependent. A


recursive representation can be written for any ordering of the variables, but a modeler builds
around substantively motivated conditional independence relationships. For example, measurement
models posit unobservable trait variables as the parents of observable variables that characterize
test behavior.
DAGs are similar to diagrams of structural equation models (Jöreskog & Sörbom, 1979), and many
of the same modeling techniques apply. The additional requirement of recursive representation
prevents cycles in the resulting expression of the relationships among variables, and it ensures that
certain Markov assumptions hold. In a GM, separations in a graph formally imply the statistical
independence of the variables (Pearl, 1988). Thus, basic probability calculations can be described
in terms of either passing messages in the graph (Pearl, 1988) or as graph operations, e.g., node
removal and edge reversal (Shachter, 1986). These mathematical formalisms allow researchers to
use graphs to describe complex operations.
Figure 1 shows a DAG for IRT as applied to an inference about a single examinee, given item
parameters and a prior θ distribution. The parameters that determine the conditional distributions
of Xs (given θ ) are implicit in Figure 1a and explicit in Figure 1b. The generic Z variables specialize
to θ and the item responses {x1 , x2 , . . . , xn }. When B is assumed known and θ and B are assumed
independent, Equation 4 becomes

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 225

Figure 1
DAGs for an IRT Model
a. Item Parameters Implicit b. Item Parameters Explicit

p(x1 , x2 , . . . , xn , θ|β1 , β2 , . . . , βn ) = P (x1 , . . . , xn |θ, β1 , . . . , βn )p(θ )


Yn
= P (xj |θ, βj )p(θ ) . (5)
j =1

Figure 1a emphasizes the conditional independence of the item responses given θ , suppressing
the dependence on item parameters. Figure 1b labels the edges from θ to the Xs with their item
parameters to facilitate the following discussion of CAT issues. DAGs convey the qualitative structure
of the probabilistic relationships but not their form or quantitative character.
IRT-CAT can be represented as a DAG with θ as the single parent of all items in the item bank
(Figure 1). At the beginning of a test, the full joint distribution [p(x1 , x2 , . . . , xn , θ); assuming
B known] characterizes the examinee’s θ and future item responses. The DAG shows that this
distribution can be obtained as the product of the initial distribution of θ [p(θ )] times the conditional
distributions of each item response Xj [P (Xj |θ )], because each Xj has θ as its only parent. The
prior θ distribution p(θ ) represents the examiner’s belief about an examinee’s θ . The conditional
distributions P (Xj |θ ) are given by the IRT model.
Evidence absorption is the process of updating belief as information is acquired about the
variables in the network. Once an examinee’s response to the first item has been observed, the
value of the observable variable X1 can be fixed. The information is first propagated to update belief
about θ (note that the information flows in the opposite direction from the arrow, essentially through
Bayes’ theorem). Then it is propagated outward from θ to update the predictive distributions
for X2 , X3 , . . . , Xn (essentially through IRT conditional distributions). An inference about the
examinee can be drawn from this new posterior distribution for θ , in this case p(θ |x(1) ). Inferences
about X2 , X3 , . . . , Xn can be drawn from their updated predictive distributions, namely
Z
p(Xj = xj |X1 = x1 ) = p(Xj = xj |θ )p(θ |X1 = x1 ) dθ . (6)

Evidence from subsequent items is handled similarly. Each item is evaluated to identify the item
that minimizes expected posterior variance, and then this item is administered. The absorption
process is repeated after the response, starting from p(θ |x(1) ). The process continues with each
successive p(θ|x(k) ) until testing is terminated. At each step, the marginal distribution of the
administered item response is fixed at its observed value, the distribution of θ is updated accordingly,
and the predictive distributions for unadministered items are revised.
A second and statistically equivalent way to describe IRT-CAT highlights the modular reasoning
of IRT, which GM can use very effectively. Figure 2a depicts a testing situation in terms of GM
fragments, a student model (SM) with a θ node and a test item fragment library with Xj and

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
226 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 2
CAT as Knowledge-Based Model Construction
a. θ Node and Task-Node Library b. A Dyadic DAG

item parameter nodes. Grouping item parameters with Xj s represents storing information about
conditional probabilities within the response nodes. Any of the test item fragments can be “docked”
with the θ fragment to produce a dyadic DAG, as shown in Figure 2b. This small DAG (and the
corresponding BIN) is temporarily assembled to absorb evidence about θ from the response to item
j . It is disassembled after the response has been observed and after the distribution of θ has been
updated accordingly. The new knowledge about θ either guides a search of the item bank for the
next item to administer or provides the grounds to terminate testing.
CAT is an example of what Breese, Goldman, and Wellman (1994) call “knowledge-based model
construction,” creating collections of model fragments that can be assembled to meet inferential
demands as they arise rather than attempting to build a single all-encompassing model. In this case,
the dynamically constructed models are statistical models for the joint distribution of θ and one
X, which is sufficient for absorbing evidence about θ item by item, by virtue of IRT’s conditional
independence structure.
GMs and Bayesian Estimation with Item Response Models
This framework allows inference about item and examinee population parameters. The statistical
model is based on Mislevy (1986) and monte carlo Markov chain (MCMC) estimation for IRT
models (Albert, 1992; Spiegelhalter, Thomas, Best, & Gilks, 1996a). The extension to collateral
information about tasks is discussed further in Mislevy, Sheehan, & Wingersky (1993) and Mislevy
(1988).
The full probability model for inference about Xs, θ s, β s, and higher-level parameters τ and η is
N Y
Y n
p(Xij |θi , βj )p(θi |τ )p(βj |η)p(τ )p(η) , (7)
i j

where
θi is the parameter for examinee i(i = 1, 2, . . . , N),
p(θ|τ ) is the θ distribution (assumed exchangeable given the higher-level parameter τ ),
p(τ ) is the prior distribution of τ ,
βj is the parameter for item j, (j = 1, 2, . . . , n),
p(β|η) is the β distribution (assumed exchangeable given the higher-level parameter η),
p(η) is the prior distribution of η,
Xij is the response of examinee i to item j , and
p(Xij |θi , βj ) is the IRT model.

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 227

Bayesian inference proceeds by analyzing the conditional distribution given observed values of
Xij . Figure 3 depicts the DAG corresponding to Equation 7 for n = 3 and N = 3.

Figure 3
An IRT DAG Including Item and Examinee Population Parameters

All parameters of each type have the same structural relationship to the other parameters in the
model; thus, they do not need to be fully depicted. Spiegelhalter, Thomas, Best, & Gilks (1996b)
introduced the convention of drawing a box around variables that are to be replicated in such a
manner (Figure 4).

Figure 4
An IRT DAG Using the Replication Convention

Although simultaneously estimating all of the unknowns is difficult, the MCMC approach allows
them to be estimated with only the complexity of estimating them one at a time. The MCMC algorithm
repeatedly samples from the distribution of one unknown given all of the others. Thus, when the
distribution of βj is sampled, the current values of θi and η are available. The MCMC algorithm
uses the conditional independence assumptions in the GM to derive efficient computation strategies.
Under the proper conditions, the algorithm converges to the true joint posterior distribution of the
unknowns.
If collateral information about items is available in the form of known item covariates Yj ,
Equation 7 can be revised by replacing p(βj |η) with p(βj |Yj , η) and interpreting η as the parameter(s)
of conditional distributions of β given Y . Bayesian inference about θ s, β s, τ , and η would then be
conditional on Yj s as well as Xij s (Figure 5).
Roles of Variables in IRT-CAT

Variables can play different roles in different tests, depending on the purposes and operational
definitions of those tests. Only θ and Xi appear explicitly in the IRT-CAT measurement model,
whereas other variables might be hidden. θ can be thought of as a summary of evidence about a
construct arrived at through choices about, and manipulation of, other hidden variables. The test
assembly specifications are a particularly important example. These specifications include the item

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
228 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 5
An IRT DAG Using Collateral Item Information

selection algorithm and item content specifications, and they play a large role in determining the
operational meaning of the SM variables.

Characterizing Aspects of θ (SMs)


SM variables are unobservable variables that characterize aspects of examinees’ knowledge,
skills, and abilities (referred to below as “traits” for brevity). However, it is neither possible nor
desirable to include in the model variables for all conceivable aspects of knowledge and skill.
These variables are used to accumulate and integrate information across distinct pieces of data and
to support inference about examinees’ traits. The choice of SM variables is determined partly by
the purpose of the test, e.g., reporting or decision-making, and partly by theoretical considerations,
e.g., the degree to which empirical patterns in item covariances agree with those predicted by the
statistical model.
The current Test of English as a Foreign Language (TOEFL), for example, has three SM variables
(ETS, Inc., 1999). Each corresponds to one of three subtests—Listening, Reading, and Grammatical
Structure. Each subtest contains only discrete tasks associated with that skill, so there are disjoint
item domains and associated trait variables θL , θR , and θS . These variables are appropriate for
the primary use of the TOEFL, making decisions about admitting non-native English speakers into
undergraduate and graduate academic programs.
Standard IRT-CAT is based on univariate SMs. Multivariate SMs become important when obser-
vations contain information about more than one aspect of a trait with respect to which evidence
must be accumulated. See Segall (1996) and van der Linden (1997) for examples of multivariate
models.
Describing Observations (Evidence Models)
The second class of variables that appear explicitly in the measurement model are those that
describe or summarize an examinee’s response. These variables are called observables. In the
simplest case, Xi is binary, indicating a correct or incorrect response. The GM fragment for a
given item links the observable variable to the appropriate SM variables. In IRT, this is the logistic
regression likelihood (with one, two, or three parameters). These fragments are shown in the task
library in Figure 2a.
More complex response variables can be devised for more complex items, e.g., constructed
response items. In this case, the response variable could represent a grade for the particular
response. The task fragment then represents a polytomous IRT likelihood.

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 229

Limiting the Assessment Scope


Two kinds of studies usually thought of in terms of validity analyses help ensure that the simple
structure of IRT is adequate. In both cases, the focus is on those variables that might generate
interactions among item responses beyond those accounted for by an overall trait variable. When
such variables are identified, the test assembly specifications are modified so that these variables
need not be included in the measurement model. Constraining the test domain and methods by
fixing the values of variables that characterize features of tasks will operationally define θ to be
conditional on specified values of these variables. Also, test assembly specifications can eliminate
items that engender strong interactions with unmodeled examinee characteristics so that those
characteristics can be ignored.
Delimiting the domain and the testing methods. Measurement specialists must consider which
aspects of an examinee’s trait should be measured, based on the goal of the test. In a test of
academic language proficiency, for example, should scenarios be limited to academic and classroom
interactions? Should listening skills be assessed with closed-form items based on taped segments,
or with tasks that combine listening with speaking in a conversation with a human examiner?
The way performance is elicited in language tests has a significant effect on performance. As
a result, responses to some tasks might be more strongly related to one another than only a single
θ would predict, which would invalidate the structure of the DAG in Figure 1. Figure 6 shows one
possible remedy. Figure 6a shows a hypothetical DAG that analysis might reveal to be necessary
for approximating covariances among three items of two types, which were originally intended to
be six measures of a single construct. Conditional dependencies among items of the same type are
required, even after accounting for a measure of overall proficiency. Figure 6b shows a likely result
of breaking the test into two separate measures, one for each item type, with an SM variable for
each type. The edge between θm and θe (which has an arbitrary direction) captures the covariance
between the two θ s.
Figure 6
DAGs Illustrating Shared Variance Due to Measurement Methods
a. A Single Overall θ With b. Conditional Independence With
Conditional Dependencies Correlated θ s Specific to Item Type

Differential item functioning. Differential item functioning (DIF) occurs when, for reasons
unrelated to the trait of interest, certain item features tend to introduce more difficulty for members
of different groups. Figure 7a depicts a situation in which the response probabilities of Items 2
through n are conditionally independent of gender, given θ , but Item 1 is not. Some causes of DIF
can be avoided by defining variables that identify problematic features of items and removing any

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
230 APPLIED PSYCHOLOGICAL MEASUREMENT

Figure 7
DAGs Illustrating DIF
a. Response Probabilities Dependent b. Items 2 Through n After
on Gender and θ for Item 1 Removal of Item 1

items that have these features. Items that exhibit substantial DIF (e.g., X1 in Figure 7a) are often
removed from the item bank so that the same structure and conditional probabilities in Figure 1
can be applied to all examinees. Figure 7b illustrates the result of this remedy.

Describing Task Features


Individual tasks in a test can be described in terms of many variables. They concern such elements
as format, content, modality, situation, and purpose. Some appear formally in test specifications.
However, when test developers create tasks, they draw on their expertise to account for far more
task features without encoding them as variables. Studies have shown that these features can be
strong predictors of item difficulty (e.g., Chalifour & Powers, 1989; Freedle & Kostin, 1993).
One way to use this collateral information about tasks is to supplement or supplant data from
pretest examinee samples as the source of information about the IRT item parameters, B (Mislevy,
Sheehan, & Wingersky, 1993). In effect, second-order DAGs are created for modeling item pa-
rameters. Figure 8 shows a DAG that posits a model for β2 , which in turn gives the conditional
probabilities of the response to Item 2, given θ (Y21−23 are coded features of Item 2).

Figure 8
Portions of a Two-Level DAG That Posits a Model for β2

A second way to use collateral information is to develop item schemas or item shells, which are
blueprints for families of tasks with similar, well-understood characteristics. Features of schemas
and features of the elements that fill them in could then be used to model IRT parameters. Bejar
(1990), Embretson (1993), and Collis, Tapsfield, Irvine, Dann, & Wright (1995) discussed this
approach in different measurement applications.
Finally, collateral information can be used to link values of SM variables to expected observable
behaviors. With the Rasch model, for example, knowing βj allows the probability of a correct
response from an examinee with any given θ to be calculated. Conversely, meaning can be given to
a value of θ by describing the kinds of items an examinee at that level is likely to answer correctly
and incorrectly. Thus, to the extent that item features account for β s, the examinee’s θ can be

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 231

described in terms of task characteristics and/or cognitively relevant variables (e.g., Sheehan &
Mislevy, 1990).
Controlling Test Assembly
Test assembly specifications constrain item selection for a given examinee’s CAT. Optimal test
assembly under multiple constraints has been a topic of much interest in the psychometric literature,
both for fixed tests and CAT (see van der Linden, 1998, for an overview of recent developments
in optimizing item selection and test construction). These constraints are generally imposed with
respect to the statistical characteristics of the measurements (e.g., reducing posterior variance) and
to factors such as content, format, and timing. Thus, items are selected with respect to constraints
expressed in terms of many non-measurement-model variables. Blocking and overlap constraints
are of particular importance.
Blocking constraints ensure that even though different examinees are administered different
items, all examinees receive tests with similar non-measurement-model characteristics. Stocking
and Swanson (1993) listed 41 constraints used in a prototype for the GRE CAT. Because satisfying
all constraints simultaneously is difficult, Stocking and Swanson employed integer programming
methods to optimize item selection using item-variable blocking constraints, as well as IRT-based
information-maximizing constraints.
Overlap constraints address the miscellaneous item features that cannot be completely accounted
for. These constraints identify items that must not appear together in the same test. These items
share incidental features that might allow them to give away each other’s answers or to be redundant.
Each item is acceptable by itself, but when it appears with certain others a conditional dependence
is introduced. Thus, evidence is “double counted” when the conditional independence model is
used (Schum, 1994, p. 129). Statistical models, which can be expressed by GM, allow researchers
to explicate how much information is lost as well as how it is lost (e.g., Bradlow, Wainer, & Wang,
1998).
Figure 9 illustrates the effect of test assembly constraints. In this example, the item bank has
four items. Items 1 and 2 both use the unfamiliar word “ubiquitous,” and Items 3 and 4 both concern
right triangles. Overlap constraints would prevent Item 1 from appearing in the same test as Item
2 and Item 3 from appearing with Item 4. Blocking constraints would force one item from each
pair to appear in each examinee’s test.
Figures 9a and 9b are alternative DAGs for the complete item bank. Figure 9a shows conditional
dependencies among overlap sets and Figure 9b introduces additional SM variables. Figure 9c is the
standard IRT-CAT DAG with overlap and blocking constraints in place. Its simplicity is appropriate

Figure 9
Three DAGs Related to Overlap and Blocking Constraints
a. Conditional Dependencies b. Conditional Independence c. Conditional Independence
Within Item Sets Resulting From the Using a Constrained
Addition of SM Variables IRT Model

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
232 APPLIED PSYCHOLOGICAL MEASUREMENT

only because incoming evidence has been restricted; i.e., the test assembly constraints do not
allow both X1 and X2 to be observed for the same examinee. Because of this restriction, the
simpler model shown in the Figure 9c can be used. Thus, constraints within the test assembly
specifications allow the IRT conditional independence assumption to remain valid. Other variables
that characterize test items according to features not controlled by test assembly constraints are
controlled by randomization. The particular values they take in any given examinee’s test are a
random sample from the item bank, subject to blocking, overlap, and measurement constraints.

GM-Based CAT

Complex assessment models can be built around three central ideas of IRT-CAT: (1) defining
unobservable variables to explain patterns of observable responses, (2) assembling tasks so that
some sources of variation accumulate and others do not, and (3) using probability-based inference
to manage accumulating information about θ as assessment proceeds. Considerable work in this
direction has begun to appear in the psychometric literature over the past decade (e.g., Adams &
Wilson, 1997; Embretson, 1993; Kelderman & Rijkes, 1994; Tatsuoka, 1990).
Figure 10 illustrates aspects of a plausible implementation of a GM-based CAT (GM-CAT). θR ,
θW , θS , and θL are associated with reading, writing, speaking, and listening, respectively, and they
are operationally defined in terms of performance on tasks in a specified bank. These tasks are
administered according to given selection constraints. Thus, the meaning of the tasks is partly
controlled by variables that do not appear in the figure, variables that control the scope of the test
and the selection of tasks.
Figure 10
A Task-Oriented DAG
a. SM b. Task Model Fragment Library

In the GM-CAT framework, the GM is distributed across two locations. Figure 10a is the SM.
Its structure is constant over all examinees and over all tasks. Figure 10b is a collection of DAG
fragments corresponding to a pool of tasks. Items 1 and 2 are conditionally independent, whereas
Item 3 is a simple integrated task. Items 4–6 are multiple aspects of response to a single complex
task. Each has multiple skills as parents, and conditional dependencies among items are further
indicated to deal with context effects. An examinee would see a subset of these tasks according to a
selection algorithm that balances maximized information with content and overlap constraints, just
as in IRT-CAT. When an examinee is assigned a task, the model fragment associated with that task is
attached to the SM according to the pattern of SM variable stubs in the task’s model fragment. The

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 233

evidence from the examinee’s response is then absorbed into the SM and then the model fragment
for that task is detached, leaving the updated SM ready for the next task.
The nodes in the SM are unobservable variables related to a multivariate generalization of IRT θ .
These variables represent aspects of skill and knowledge, and they are included in the model because
they will be used to either report examinee performance, accumulate supplementary patterns across
task situations for diagnostic feedback, or account for incidental dependencies across tasks. The
number and character of these variables depends on how performance is understood within the
domain and the requirements for reporting or diagnosis in the examination. Thus, a pass/fail
licensure test will use a much coarser SM than an intelligent tutoring system.
The nodes in the task-model fragments are observable variables that correspond to salient aspects
of examinees’ behaviors in specified task situations. This is a generalization of IRT item responses.
Generally, these nodes will correspond to features of a task response. They could be as simple as a
correct/incorrect response to a multiple-choice question or as complex as ratings of many aspects of
a solution trace. The SM variable stubs that appear in these fragments indicate which SM variables
are parents of those response variables. The form and parameters of the corresponding conditional
probability distributions are also stored within the task GM fragments.
There are three kinds of associations among the SM and observable nodes. First, SM variables
are parents of observables. Associations of this type are conditional probabilities of values of the
observable variables given the values of SM variables, a generalization of IRT conditional probabil-
ity. Through these associations, skills and knowledge explain patterns of observable behavior, and
once responses are observed, probability distributions for SM variables are updated. The condi-
tional probability distribution of the observation, given its parents, can model different patterns of
dependence. Researchers design the structure of these associations (the distributions corresponding
to edges from parent stubs to observable variables as in Figure 10) and provide initial estimates of
the conditional probabilities at various values of the SM variables. These conditional probabilities
can be further modeled as functions of task-characteristic variables, a generalization of the IRT
technique depicted in Figure 8.
Second, observable variables can be associated with one another. These associations occur when
multiple aspects of a performance in the same task situation are captured as observables. Including
them in the DAG is a way to model the effects of shared contexts, similarities in response methods,
or incidental connections that overlap constraints would disallow in IRT-CAT. A GM fragment for
a complex task would consist of multiple observables, perhaps with associations resulting from
commonalities induced by shared context, and possibly with different SM parents according to
their particular demands. In Figure 10b, these kinds of associations are represented by the edges
connecting X4 , X5 , and X6 .
Third, SM variables may be associated with one another, i.e., some may appear as parents of
other SM variables. These associations express the relationships among elements of competency
and knowledge. In Figure 10a, these associations are represented by the edges connecting SM
variables to one another. Thus, direct evidence about one SM variable can provide indirect evidence
about another, which increases a test’s accuracy by taking into account associations among skills
and knowledge.
A GM-CAT would input the current state of the SM into the task selection algorithm. Just as in IRT-
CAT, the GM-CAT would select tasks from a bank to maximize information in some metric, subject to
constraints expressed in terms of variables outside the measurement model. Two possible selection
metrics, with respect to targeted variables, include value of information (Heckerman, Horvitz, &
Middleton, 1993) and weight of evidence (Madigan & Almond, 1996). The GM-CAT attaches the
GM task fragment to the SM and absorbs the evidence provided by the examinee’s responses. The

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
234 APPLIED PSYCHOLOGICAL MEASUREMENT

algorithm can then discard the task or maintain it in the model for addressing dependencies (e.g.,
overlap effects). The selection algorithm still requires balanced task context, content, and type
within examinees; these specifications are analogous to those that define θ in IRT.
The status of the SM is also used for reporting or providing feedback. If a single-number summary
of performance is desired, the current state of the SM can be projected onto a particular dimension,
e.g., expected performance on some typical tasks.
Examples from Language Proficiency Assessment
Background
The TOEFL 2000 project was initiated with the goal of measuring communicative competence in
English in a way that targets language use in the academic environment. It is anticipated that the
resulting assessment will include more performance-based tasks as well as tasks that are integrated
across modalities, e.g., writing based on listening to a conversation or speaking in response to a
reading passage. These kinds of tasks require that complex relationships among SM variables and
observable task performance variables be taken into account. Further, it is necessary to specify
skills in terms of the variables that accumulate evidence and to describe the aspects of performance
that these variables represent.
Task-Centered Examinee Modeling
The task-centered model can be viewed as an extension of the inferential approach employed in
the current TOEFL (ETS, Inc., 1999). This model includes variables for four skills (reading, writing,
speaking, and listening) and uses them in a way that reveals the relationships among the variables
and the observed (and expected) behaviors.
Important steps in this direction have been taken with tasks that focus on a single modality.
The following are the most important features of these extensions. First, the SM contains the four
reporting variables θR , θW , θS , and θL . The relationships among them are correlations in the target
population. (The meaning of the edges in the SM of Figure 10 is not being interpreted. Therefore, the
direction of the edges in the SM is arbitrary; all skills are expected to be associated in the population
of examinees.) Second, the observables associated with tasks indicate their parents with stubs that
represent where SM and task fragments must be connected when the task is administered.
Third, some conditionally independent tasks that address a single modality are included to
ground the definition of θR , θW , θS , and θL (e.g., X1 and X2 are associated with Tasks 1 and 2
and both depend on θR only). The conditional probabilities of response to these items, given their
single θ parent, can be modeled in terms of features that influence difficulty (Figure 8). Other task
features can be used to control task selection and to balance content, situation, and context across
examinees. Fourth, some observables have multiple θ s as parents. For example, one prototype
TOEFL task requires an examinee to read a passage about one theory for the extinction of dinosaurs
and then write a response with several features. Both θR and θW are parents of such an item. Their
relationship could be modeled as conjunctive, and values of conditional probabilities would depend
on both the reading demand and the writing demand of the task.
Finally, some tasks generate multiple observable variables (e.g., observables X4 − X6 , all asso-
ciated with Task 4). The dinosaur task requires several responses, each with a different mixture of
parent θ s and different values of variables that drive conditional probabilities. All share the subject
matter of dinosaurs and might need to be modeled as conditionally dependent beyond associations
induced by the θ s.
With only four variables included in the SM, many aspects of examinee skills and knowledge
are confounded and others are neglected. Some of these aspects will influence performance in all

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 235

tasks, so they account for part of the associations among θ s. Others are confounded with levels of
performance, and still others will contribute to uncertainty about θ s.
Competence-Centered Examinee Modeling
The competence-centered model departs more radically from current procedures, incorporating
SM variables motivated by models from the communicative competence literature (e.g., Bachman,
1990). This approach, depicted in Figure 11, could use many of the same task variables and test
assembly rules of task-centered modeling. However, it would accumulate evidence in terms of
performance on variables suggested by Bachman’s (1990) model of communicative competence.
The SM variables are different in two ways.
First, new variables appear for grammatical (θGC ), sociolinguistic (θSC ), conversational (θCV ),
and correspondence competence (θCR ), the last two of which are aspects of discourse competence
that distinguish between speaking/listening and reading/writing (Bachman & Palmer, 1996, p. 128,
attribute these terms to Widdowson, 1978). These variables can serve as parents for observable
variables that tap different modalities. θGC and θSC are used for observables associated with any of
the four traditional skills. θCV is used for observables involving speaking and/or listening, and θCR
is used for those involving reading and/or writing. Again, conditional probabilities for observable
variables with these parents can be modeled in terms of a task’s demand on each competence.
Second, the SM variables for reading, writing, speaking, and listening have radically different
operational definitions than those in the preceding example (indicated by an * in Figure 11).
These variables now serve as selector variables that indicate which modalities are involved in an
observable. They account for the different strengths that examinees have in different modalities
more so than the cross-modality competencies discussed above. An observable has a θ ∗ as a parent
if it involved that modality. Examinees’ θ ∗ values indicate the degree to which their cross-modality
competences are either enabled or prohibited when carrying out tasks requiring that modality.
Thus, for any observable in which they are parents, the relationship among these variables and the
competencies is conjunctive.
In this approach, an examinee’s performance across task types would be summarized in terms of
cross-modality competencies. To evidence these competencies, a profile of strengths and limitations

Figure 11
A Competence-Oriented DAG

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


Volume 23 Number 3 September 1999
236 APPLIED PSYCHOLOGICAL MEASUREMENT

associated with the modalities is used. For reporting purposes, projections could be made from
these multivariate profiles to scores on sets of tasks organized by type.
Future Research
GMs are not a solution for all testing problems. However, they do provide a language for
describing the kinds of complex multivariate models that are a natural extension of IRT. GMs help
measurement experts convey technical issues to domain experts, which facilitates the construction
of balanced models.
There are a large number of problems that must be addressed in order to develop a methodology
of GM-based assessment. Probability-based inference using GM offers a framework for expressing
and then confronting such problems. The following technical challenges have begun to be addressed
but still require further research: generalized item selection algorithms for GM-CAT, task-induced
dependencies (i.e., common descendants of two or more conditionally independent SM variables),
continuous variables in SMs, and assessing the fit of complex models.
References
Adams, R., & Wilson, M. R. (1997). The multidi- Collis, J. M., Tapsfield, P. G. C., Irvine, S. H., Dann,
mensional random coefficients multinomial logit P. L., & Wright, D. (1995) The British Army
model. Applied Psychological Measurement, 21, Recruit Battery goes operational: From theory
1–23. to practice in computer-based testing using item-
Albert, J. H. (1992). Bayesian estimation of normal generation techniques. International Journal of
ogive item response curves using Gibbs sampling. Selection and Assessment, 3, 96–104.
Journal of Educational Statistics, 17, 251–269. Embretson, S. E. (1993). Psychometric models for
Almond, R. G. (1995). Graphical belief modeling. learning and cognitive processes. In N. Frederik-
London: Chapman and Hall. sen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory
Bachman, L. (1990). Fundamental considerations for a new generation of tests (pp. 125–150). Hills-
in language testing. Oxford: Oxford University dale NJ: Erlbaum.
Press. ETS, Inc. (1999). TOEFL 1999–2000: Information
Bachman, L. F., & Palmer, A. S. (1996). Language bulletin for TOEFL, TWE, and TSE. Princeton NJ:
testing in practice. Oxford: Oxford University Author.
Press. Freedle, R., & Kostin, I. (1993). The prediction of
Bradlow, E. T., Wainer, H., & Wang, X. (1998). A TOEFL reading item difficulty: Implications for
Bayesian random effects model for testlets (Re- construct validity. Language Testing, 10, 133–
search Report RR-98-3). Princeton NJ: Educa- 170.
tional Testing Service. Hambleton, R. K. (1989). Principles and selected ap-
Bejar, I. I. (1990). A generative analysis of a three- plications of item response theory. In R. L. Linn
dimensional spatial task. Applied Psychological (Ed.), Educational measurement (3rd ed., pp. 147–
Measurement, 14, 237–245. 200). New York: American Council on Educa-
Berger, M. P. F., & Veerkamp, W. J. J. (1996). A re- tion/Macmillan.
view of selection methods for optimal test design. Heckerman, D., Horvitz, E., & Middleton, B. (1993).
In G. Engelhard & M. Wilson (Eds.), Objective An approximate nonmyopic computation for value
measurement: Theory into practice (Vol. 3). Nor- of information. IEEE Transactions on Pattern
wood NJ: Ablex. Analysis and Machine Intelligence, 15, 292–298.
Breese, J. S., Goldman, R. P., & Wellman, M. P. Jöreskog, K. G., & Sörbom, D. (1979). Advances
(1994). Introduction to the special section on in factor analysis and structural equation models.
knowledge-based construction of probabilistic and Cambridge MA: Abt Books.
decision models. IEEE Transactions on Systems, Kelderman, H., & Rijkes, C. P. M. (1994). Loglinear
Man, and Cybernetics, 24, 1577–1579. multidimensional IRT models for polytomously
Chalifour, C., & Powers, D. E. (1989). The relation- scored items. Psychometrika, 59, 149–176.
ship of content characteristics of GRE analytical Lauritzen, S. L. (1996). Graphical models. Oxford:
reasoning items to their difficulties and discrimi- Clarenden Press.
nations. Journal of Educational Measurement, 26, Madigan, D., & Almond, R. G. (1996). Test selection
120–132. strategies for belief networks. In D. Fisher and

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013


R. G. ALMOND and R. J. MISLEVY
GRAPHICAL MODELS AND CAT 237

H.-J. Lenz (Eds.), Learning from data: AI and N. Frederiksen, R. Glaser, A. Lesgold, & M. G.
statistics IV (pp. 89–98). New York: Springer. Shafto, (Eds.), Diagnostic monitoring of skill and
Mislevy, R. J. (1986). Bayes modal estimation in item knowledge acquisition (pp. 453–488). Hillsdale
response models. Psychometrika, 51, 177–196. NJ: Erlbaum.
Mislevy, R. J. (1988). Exploiting auxiliary infor- van der Linden, W. J. (1997). Multidimensional adap-
mation about items in the estimation of Rasch tive testing with a minimum error-variance cri-
item difficulty parameters. Applied Psychological terion (Research Report 97-03). Enschede, The
Measurement, 12, 281–296. Netherlands: University of Twente, Department
Mislevy, R. J., Sheehan, K. M., & Wingersky, M. S. of Educational Measurement and Data Analysis.
(1993). How to equate tests with little or no data. van der Linden, W. J. (1998). Optimal test assembly
Journal of Educational Measurement, 30, 55–78. [Special issue]. Applied Psychological Measure-
Owen, R. J. (1975). A Bayesian sequential proce- ment, 22.
dure for quantal response in the context of adaptive Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F.,
mental testing. Journal of the American Statistical Mislevy, R. J., Steinberg, L., & Thissen, D. (1990).
Association, 70, 351–356. Computerized adaptive testing: A primer. Hills-
Pearl, J. (1988). Probabilistic reasoning in intelligent dale NJ: Erlbaum.
systems: Networks of plausible inference. San Widdowson, H.G. (1978). Teaching language as
Mateo CA: Kaufmann. communication. Oxford: Oxford University Press.
Schum, D. A. (1994). The evidential foundations of
probabilistic reasoning. New York: Wiley. Acknowledgments
Segall, D. (1996). Multidimensional adaptive testing. This work was supported by the Test of English as a
Psychometrika, 61, 331–354. Foreign Language (TOEFL) and the National Cen-
Shachter, R. D. (1986). Evaluating influence dia- ter for Research on Evaluation, Standards, Student
grams. Operations Research, 34, 871–882. Testing (CRESST), Educational Research and De-
Sheehan, K. M., & Mislevy, R. J. (1990). Integrat- velopment Program, cooperative agreement number
ing cognitive and psychometric models in a mea- R117G10027 and CFDA catalog number 84.117G, as
sure of document literacy. Journal of Educational administered by the Office of Educational Research
Measurement, 27, 255–272. and Improvement, U.S. Department of Education.
Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L., & The authors thank Isaac Bejar, Dan Eignor, Drew
Cowell, R. G. (1993). Bayesian analysis in expert Gitomer, Ed Herskovitz, David Madigan, Martha
systems. Statistical Science, 8, 219–283. Stocking, and Linda Steinberg for comments and dis-
Spiegelhalter, D. J., Thomas, A., Best, N., & Gilks, cussions on various topics presented here, and the
W. (1996a). BUGS 0.5 examples (Vols. 1 & 2). TOEFL Research Committee for its encouragement
Cambridge, England: MRC Biostatistics Unit, In- and support. Several experts in communicative com-
stitute of Public Health. petence and language proficiency assessment con-
Spiegelhalter, D. J., Thomas, A., Best, N., & Gilks, tributed to this project by generously sharing their ad-
W. (1996b). BUGS 0.5: Bayesian inference using vice, wisdom, and reference materials. They are Lyle
Gibbs sampling (Version 2.0) [Software manual]. Bachman, Gary Buck, Francis Butler, Carol Chapelle,
Cambridge England: MRC Biostatistics Unit, In- Dan Douglas, Joan Jamieson, Irwin Kirsch, Carol
stitute of Public Health. Taylor, and Mary Schedl.
Stocking, M. L., & Swanson, L. (1993). A method
for severely constrained item selection in adaptive Author’s Address
testing. Applied Psychological Measurement, 17, Send requests for reprints or further information
277–292. to Robert J. Mislevy, Educational Testing Service,
Tatsuoka, K. K. (1990). Toward an integration of item Princeton NJ 08541, U.S.A. Email: rmislevy@ets.org
response theory and cognitive error diagnosis. In

Downloaded from apm.sagepub.com at St Petersburg State University on December 23, 2013

You might also like