Criterion-Referenced M E A S U R E M E N T: Its M A I N Applications, Problems A N D Findings

Evaluation in Education. 1982, Vol. 5, pp.
97-118
Printed in Great Britain. All rights reserved
CHAPTER 1
CRITERION-REFERENCED MEASUREMENT: ITS

M A I N APPLICATIONS, PROBLEMS A N D FINDINGS
Wim J. van der Linden
TechnischeHogeschool Twente, Postbus217, 7500A E Enschede, The Netherlands
ABSTRACT
The need f o r c r i t e r i o n - r e f e r e n c e d measurements has mainly

arisen from the introduction of instructional programs
organized according to modern p r i n c i p l e s from educational
technology. Some of these programs are discussed, and it is
indicated for what purposes c r i t e r i o n - r e f e r e n c e d measurements
are used. T h r e e main problems of c r i t e r i o n - r e f e r e n c e d
measurement are d i s t i n g u i s h e d : The problem of c r i t e r i o n -
referenced scoring and score i n t e r p r e t a t i o n , the problem of
c r i t e r i o n - r e f e r e n c e d item and test analysis, and the problem
of mastery t e s t i n g . For each of these problems a v a r i e t y of
solutions and approaches have been suggested. It is the
purpose of the paper to p r o v i d e an o v e r v i e w of these and to
introduce the reader to the original l i t e r a t u r e .
The recent rise of a number of new learning strategies has basically changed
the meaning of measurement in education and made new demands on the
construction, scoring, and analysis of educational tests. Educational
measurements s a t i s f y i n g these demands are usually called c r i t e r i o n - r e f e r e n c e d ,
while t r a d i t i o n a l measurements are often known as n o r m - r e f e r e n c e d .
The common feature of these learning strategies is t h e i r objective-based

character. All lead to i n s t r u c t i o n a l programs being set up and executed
according to w e l l - d e f i n e d , c l e a r - c u t learning objectives. The organizational
measures taken to realize these objectives d i f f e r , however. For example in
l e a r n i n g f o r mastery, one of the most p o p u l a r new developments (Block,
1971b; Bloom, 1968; Bloom, Hastings, ~, Madaus, 1971, chap. 3 ) , the following
is done: First, the learning objectives are kept f i x e d d u r i n g the
implementation of the program. Second, the program is d i v i d e d into a
sequence of small learning u n i t s , and students are allowed to proceed to the
next u n i t o n l y a f t e r t h e y have mastered the preceding one. Third, end-of-
u n i t tests are administered p r o v i d i n g students and i n s t r u c t o r s with q u i c k
feedback. One of the p r i n c i p a l uses of these tests is to separate students
who master the u n i t from those who do not (mastery t e s t i n g ) . F o u r t h , when
a s t u d e n t does not master the u n i t , he is given c o r r e c t i v e l e a r n i n g materials
or remedial teaching. F i f t h , as extra learning time is needed f o r going
97
98 Wire J. van der Linden
through these materials and i n s t r u c t i o n , mastery learning allows some

d i f f e r e n t i a t i o n in tempo. There is no d i f f e r e n t i a t i o n in level, however, since
the l e a r n i n g objectives are kept f i x e d and equal to all students. (For these
f i v e p r o p e r t i e s of mastery learning and a f u r t h e r explanation t h e r e o f , see
Warries, 1977a.)
The c o n t e x t in which mastery learning seeks to realize its goal is r e g u l a r

group instruction. In each u n i t all students receive the same in s t r uc t io n
u n t i l the e n d - o f - u n i t test. A l t h o u g h t h e y have much in common with mastery
l e a r n i n g , i n d i v i d u a l i z e d i n s t r u c t i o n a l systems like the P i t t s b u r g h I n d i v i d u a l l y
Prescribed I n s t r u c t i o n (IPI) (Glaser, 1968) and Flanagan's Program f or
Learning According to Needs (PLAN) (Flanagan, 1967) d i f f e r in this respect.
These systems allow students to reach the learning objectives in d i f f e r e n t
ways. Each student is p r o v i d e d with his own route t h r o u g h the instructional
units and with learning materials matching his e n t e r i n g b e h a v i o r and
aptitudes. I n d i v i d u a l i z e d i n s t r u c t i o n has mainly been motivated by the view-
p o i n t u n d e r l y i n g A p t i t u d e Treatment Interaction ( A T I ) research (Cronbach ~,
Snow, 1977), namely t h a t subjects can react d i f f e r e n t l y to treatments and that
treatments which are best on the average may t h e r e f o r e be worst in i n d i v i d u a l
cases. The methodology needed to test w h e t h e r ATI 's are present has been
reviewed by Cronbach and Snow (1977, chaps. 1-4) and Plomp (1977).
Decision rules f o r assigning students to treatments are given in van der
Linden (1980d, 1981b) and Mijn (1980).
The most f a r - r e a c h i n g mode of i n d i v i d u a l i z a t i o n takes place in Computer-Aided

I n s t r u c t i o n ( C A I ) ( A t k i n s o n , 1968; Suppes, 1966; Suppes, Smith, F, Beard,
1975). In CAI the i n s t r u c t i o n is i n t e r a c t i v e and each next step depends on
the student's response to the preceding one, thus creating a system in which
the s t u d e n t is, w i t h i n given b o u n d a r ie s , choosing his own i n d i v i d u a l
instruction.
Compared with t r a d i t i o n a l educational approaches, one of the most conspicuous

p r o p e r t i e s of the t e s t i n g programs i n h e r e n t in these modern learning
strategies is t h e i r f r e q u e n c y of t e s t i n g . At several points of time, tests are
i n v o l v e d f o r several purposes. B e g i n n i n g - o f - u n i t tests describe the e n t e r i n g
b e h a v i o r and skills of students who are about to s t ar t with the unit. When
the u n i t itself covers a sequence of objectives, scores on these tests can be
used f o r deciding where to place students in this sequence. When
i n d i v i d u a l i z a t i o n is based on ATI research, a p t i t u d e tests are usually
administered to assign students to those treatments t h a t promise best results.
Once the st u d e n t has taken up the unit , intermediate tests can be

administered f o r monitoring his l e a r n i n g . The p r i n c i p a l use of these tests is
to describe r e l e v a n t aspects of the l e a r n i n g process enabling students and
i n s t r u c t o r to r e - a d j u s t it when necessary. This intermediate use of tests is
known as f o r m a t i v e e v a l u a t i o n .
Tests are also used as e n d - o f - u n i t tests f o r d e s c r i b i n g the level of master the

s t u d e n t has attained when he has completed the u n it . If the information from
these tests serves as a basis f o r deciding w h e t h e r the s t udent has mastered
the u n i t and can be moved up to the next unit , the t e s t ing p r o c edur e is
Criterion-- Referenced Measurement 99
known as mastery testing. E n d - o f - u n i t tests may be used f o r diagnostic

purposes as well; then t h e y indicate the objectives f o r which the student's
performances are poor, and w h i c h part of the u n i t he has to cover again.
In all these modes of t e s t i n g , scores are used only f o r i n s t r u c t i o n a l purposes.

In p a r t i c u l a r , t h e y are not used f o r g r a d i n g students (summative e v a l u a t i o n ) .
For t h a t purpose a separate test is o r d i n a r i l y administered at the end of the
program, covering the content of the u n i t s b u t i g n o r i n g the u n i t test scores
given earlier.
For a f u r t h e r review of t e s t i n g procedures in modern learning strategies,

r e f e r to Glaser and Nitko (1971), Hambleton (1974), Nitko and Hsu (1974),
and Warries (1979b).
As already indicated, the i n t r o d u c t i o n of such strategies as learning f o r

mastery and i n d i v i d u a l i z e d i n s t r u c t i o n has led to a change in the use and the
i n t e r p r e t a t i o n of test scores. The modes of t e s t i n g o u t l i n e d above are all
concerned with a behavioral description of s t u d e n t s . By doing so, it is
possible to control the l e a r n i n g process and to make optimal decisions, f o r
example, on the placement of students in i n s t r u c t i o n a l u n i t s and t h e i r e n d - o f -
u n i t mastery level. T r a d i t i o n a l t e s t i n g procedures are, however, b e t t e r
suited for differentiating between subjects, and mostly serve as an i n s t r u m e n t
f o r ( f i x e d - q u o t a ) selection. The psychometric analysis of these tests is
g e n e r a l l y adapted to this use.
The attempts to remove t r a d i t i o n a l assumptions from educational t e s t i n g and to

replace these by assumptions better adjusted to the use of tests in modern
l e a r n i n g strategies arose in the b e g i n n i n g of the s i x t i e s . As a r e s u l t the
term " c r i t e r i o n - r e f e r e n c e d measurement" has been coined. Elsewhere (van
der Linden, 1979) we have given a review in which three main problems of
c r i t e r i o n - r e f e r e n c e d measurement are d i s t i n g u i s h e d , each f o r the f i r s t time
signaling a d i f f e r e n t " h i s t o r i c a l " paper. These problems are: The problem of
criterion-referenced scoring and score interpretation, the problem of
c r i t e r i o n - r e f e r e n c e d item and test analysis, and the problem of mastery
decisions. Following are s h o r t i n t r o d u c t i o n s to these three problems.
CRITERION-REFERENCED SCORING AND SCORE INTERPRETATION
Glaser (1963), in his paper on i n s t r u c t i o n a l technology and the measurement

of learning outcomes, c o n f r o n t e d two possible uses of educational tests and
t h e i r areas of application. The f i r s t is t h a t tests can s u p p l y norm-referenced
measurements. In norm-referenced measurement the performances of subjects
are scored and i n t e r p r e t e d with respect to each o t h e r . As the name
indicates, t h e r e is always a norm g r o u p , and the i n t e r e s t is in the relative
s t a n d i n g of the subjects to be tested in this g r o u p . T h i s f i n d s expression in
scoring methods as percentile scores, normalized scores, and age e q u i v a l e n t s .
Tests are c o n s t r u c t e d such t h a t the relative positions of subjects come out as
r e l i a b l y as possible. An o u t s t a n d i n g example of an area where norm-
100 Wim J. van der Linden
referenced measurements are needed is testing f o r selection ( e . g . , of

applicants f o r a j o b ) . In such applications the test must be maximally
d i f f e r e n t i a t i n g in o r d e r to enable the employer" to select tile best applicants.
The second use is t h a t tests can supply c r i t e r i o n - r e f e r e n c e d measurements.

In c r i t e r i o n - r e f e r e n c e d measurement the interest is not in using test scores
f o r r a n k i n g subjects on the continuum measured by the test, but in c a r e f u l l y
specifying the behavioral referents (the " c r i t e r i o n " ) p e r t a i n i n g to scores or
points along this continuum. Measurements are n o r m - r e f e r e n c e d when t h e y
indicate how much b e t t e r or worse the performances of i n d i v i d u a l subjects are
compared with those of o t h e r subjects in the norm g r o u p ; they are c r i t e r i o n -
referenced when t h e y indicate what performances a subject with a given score
is able to do, and what his behavioral r e p e r t o r y is, w i t h o u t any reference to
scores of o t h e r subjects. This d e s c r i p t i v e use of test scores is needed in the
testing programs of the learning strategies mentioned e a r l i e r . For a f u r t h e r
c l a r i f i c a t i o n of the distinction between norm-referenced and criterion-
referenced test score i n t e r p r e t a t i o n s , we refer to Block (1971a), C a r v e r
(1974), Ebel (1962, 1971), Flanagan (1950), Olaser and Klaus (1962), Glaser
and Nitko (1971), Glass (1978), Hambleton, Swaminathan, Algina and Coulson
(1978), Linn (1980), Nitko (1980), and Popham (1978).
How to establish a relation between test scores and behavioral referents is

what can be called the problem of c r i t e r i o n - r e f e r e n c e d score i n t e r p r e t a t i o n .
We have also called this the problem of local v a l i d i t y , since Glaser (1963) goes
beyond the classical v a l i d i t y problem and does not ask f o r the v a l i d i t y of a
t e s t , which is the classical question, b ut f o r the v a l i d i t y of the i n t e r p r e t a t i o n
of test s c o r e s . The v a l i d i t y of the test must, as it were, be locally
specifiable f o r points on the continuum u n d e r l y i n g the test (van der Linden,
1979).
Several answers have been proposed to the problem of c r i t e r i o n - r e f e r e n c e d

score i n t e r p r e t a t i o n , some of which will be mentioned here. One is a
c o n s t r u c t i v e approach based on the idea of r e f e r e n c i n g test scores to a
domain of tasks ( e . g . , H i v e l y , 1974; H i v e l y , Maxwell, Rabehl, Sension, E,
Lundin,1973; H i v e l y , Patterson, r, Page, 1968; O s bur n, 1968). In this
approach, the test is a random sample from a domain of tasks (or may be
conceived of as such) defined by a c l e a r - c u t learning objective and test
scores are i n t e r p r e t e d with respect to this domain. Domain-referenced testing
usually involves the use of binomial test models and has been the most
p o p u l a r way of c r i t e r i o n - r e f e r e n c i n g so far. A n o t h e r approach is an empirical
method in which the congruence between items and objectives is determined
by assessing how sensitive the items are to objective-based i n s t r u c t i o n . The
objectives are then used to i n t e r p r e t the test performances. An example of
this approach is Cox and Vargas' (1966) p r e t e s t - p o s t t e s t method, which has
led to a m u l t i t u d e of v a r i a n t s and modifications ( f o r a review, see Berk,
1980b; van der Linden, ]981a). A t h i r d approach is subjective and uses
c o n t e n t - m a t t e r specialists to judge the congruence between items and learning
objectives (Hambleton, 1980; Rovineili ~, Hambleton, 1977). Yet another
approach has been followed by Cox and Graham (]966) who conceived the
continuum to be referenced as a Guttman scale and used scalogram analysis to
scale items p e r t a i n i n g to a sequence of learning objectives. By so doing,
they were able to p r e d i c t f o r points along the scale which arithmetical skills
the students in t h e i r example were able to perform. W r i g h t and Stone (1979,
chap. 5) used the Rasch model to link points on the continuum u n d e r l y i n g the
test to behaviors. The crucial d i f f e r e n c e between scalogram analysis and this
approach is t h a t the former assumes Guttman item c h a r a c t e r i s t i c c u r v e s and
allows only deterministic statements about behaviors whereas use of the Rasch
model entails p r o b a b i l i s t i c i n t e r p r e t a t i o n s .
The above f i v e approaches were mentioned because they have been most
p o p u l a r so f a r or indicate promising developments. For a more e x h a u s t i v e
review, we r e f e r to Nitko (1980).
CRITERION-REFERENCED ITEM AND TEST A N A L Y S I S
Popham and Husek's (1969) paper is a second milestone in the h i s t o r y of

c r i t e r i o n - r e f e r e n c e d measurement. It goes beyond Glaser's paper in t h a t it
adds a d i s t i n c t i o n in item and test analysis to norm-referenced and c r i t e r i o n -
referenced measurement. A c c o r d i n g to Popham and Husek, both t y p e s of
measurement d i f f e r so basically that classical models and procedures f o r item
and test analysis are inadequate f o r c r i t e r i o n - r e f e r e n c e d measurement and
t h e i r outcomes sometimes even misleading.
The k e y - w o r d in the analysis of items and tests f o r norm-referenced

measurement, with its emphasis on d i f f e r e n t i a t i n g between subjects, is
variance. Classical models and procedures rely on the presence of a large
amount of v a r i a b i l i t y of scores. In c r i t e r i o n - r e f e r e n c e d measurement,
however, this condition will seldom be f u l f i l l e d because it is not the
v a r i a b i l i t y of scores which is critical b u t t h e i r relation to a c r i t e r i o n . In this
connection, Popham and Husek r e f e r to classical reliability analysis.
Consistency, both i n t e r n a l l y and t e m p o r a r i l y , is a desirable p r o p e r t y not only
of norm-referenced but of criterion-referenced measurements as well.
Nevertheless, classical r e l i a b i l i t y coefficients are v a r i a n c e - d e p e n d e n t and thus
often low for c r i t e r i o n - r e f e r e n c e d measurements. For this reason, t h e y plead
for new models and procedures for item and test analysis. These must be
models and procedures w h i c h , more than classical test t h e o r y , make allowance
for the specific requirements c r i t e r i o n - r e f e r e n c e d measurement imposes on item
c o n s t r u c t i o n , test composition, scoring, and score i n t e r p r e t a t i o n .
Popham and Husek's plea has led to a v a r i e t y of proposals. A proposal to

adapt classical test t h e o r y f o r use with c r i t e r i o n - r e f e r e n c e d measurement is
due to Livingston (1972a). He argued that in c r i t e r i o n - r e f e r e n c e d
measurement we are no longer interested in estimating the deviations of
i n d i v i d u a l t r u e scores from the mean but in estimates of deviations from the
c u t - o f f score. Central in his approach is a c r i t e r i o n - r e f e r e n c e d r e l i a b i l i t y
coefficient defined as the ratio of t r u e to observed score variance about the
c u t - o f f score. L i v i n g s t o n ' s proposal stimulated others to c o n t r i b u t e ( H a r r i s ,
1972; Lovett, 1978; Shavelson, Block, & Ravitch, 1972; see also L i v i n g t o n ,
1972b, 1972c, 1973), and has had the merit of making a l a r g e r p u b l i c aware
of the issue of c r i t e r i o n - r e f e r e n c e d test analysis.
As noted above, random sampling of tests from item domains has been the
most p o p u l a r way of c r i t e r i o n - r e f e r e n c i n g test scores. However, when tests
are sampled and the concern is with domain scores and not with t r u e test
scores, classical test t h e o r y does not hold (unless all items in the domain
have equal d i f f i c u l t y ) . Brennan and Kane (1977a) proposed to use
g e n e r a l i z i b i l i t y t h e o r y when domain sampling takes place and transformed
L i v i n g s t o n ' s r e l i a b i l i t y coefficient into what t h e y called an index of
d e p e n d a b i l i t y . At the same time t h e y presented a modification of this index
which can be i n t e r p r e t e d as a signal-noise ratio f o r domain-referenced
measurements, while later two general agreement coefficients were given from
which these indices can be d e r i v e d as special instances (Brennan & Kane,
1977b; Kane & Brennan, 1980). A summary of these developments is given in
Brennan (1980).
In the f o r e g o i n g approaches the emphasis is on developing test t h e o r y

r e f l e c t i n g the r e l i a b i l i t y or d e p e n d a b i l i t y with which deviations of i n d i v i d u a l
t r u e or domain scores from the c u t - o f f score can be estimated. It can be
a r g u e d , however, t h a t when c r i t e r i o n - r e f e r e n c e d tests are used f o r mastery
decisions the concern should not be so much with these deviations as with the
" r e l i a b i l i t y " and " v a l i d i t y " of the decisions. Since approaches along this line
belong to the content of the next section, we will postpone a discussion of
t h e i r results until then.
The f i r s t proposal of a c r i t e r i o n - r e f e r e n c e d item analysis was made by Cox

and Vargas (1966). As a l r e a d y indicated in the preceding section, t h e i r
p r e t e s t - p o s t t e s t method of item v a l i d a t i o n is based on the idea that c r i t e r i o n -
referenced items o u g h t to be sensitive to objective-based i n s t r u c t i o n . It
measures this s e n s i t i v i t y by the d i f f e r e n c e in item p - v a l u e before and a f t e r
instruction. The rationale of the method is discussed in Coulson and
Hambleton (1974), Cox (1971), Edmonston and Randall (1972), Hambleton and
Gorth (1971), Henrysson and Wedman (1973), Millman (1974), Roudabush
(1973), Rovinelli and Hambleton (1977), and Wedman (1973, 1974a, 1974b).
S l i g h t l y d i f f e r e n t approaches can be found in Brennan and Stolurow (lCJ71),
Harris (1976), Herbig (1975, 1976), Kosecoff and Klein (1974), and
Roudabush (1973). Popham (1971) offers a c h i - s q u a r e d test f o r detecting
items with atypical differences in p - v a l u e . Saupe's (1966) correlation between
item and total change score is often considered the c o u n t e r p a r t of the norm-
referenced item-test c o r r e l a t i o n .
A critical review of item analysis based on the p r e t e s t - p o s t t e s t method is

given in van der Linden (1981a) who also offers an a l t e r n a t i v e d e r i v e d from
latent t r a i t t h e o r y . (See also van Naerssen, 1977a, ]977b.) Other reviews
of c r i t e r i o n - r e f e r e n c e d item analysis procedures are given in Berk (1980b)
and Hambleton (1980). For reviews of test analysis procedures, we r e f e r to
Berk (1980a), Hambleton, Swaminathan, A l g i n a , and Coulson (1978), and Linn
(1979).
Popham and Husek's diagnosis of the role of test score variance in the
analysis of c r i t e r i o n - r e f e r e n c e d measurement has been a h o t l y - d e b a t e d issue
(Haladyna, 1974a, 1974b; Millman E, Popham, 1974, Simon, 1969; Woodson,
1974a, 1974b). Our own opinion is t h at , although it has given rise to many
Criterion--Referenced Measurement 103
w o r t h w h i l e developments and comes close to a fundamental problem in test
t h e o r y , it is erroneous as f a r as the classical test model is concerned. A p a r t
from the requirement of f i n i t e variance, this model does not contain any
assumption with respect to score variance. (A full statement of these
assumptions can be found in, e . g . , Lord ~, Novick, 1968, section 3 . 1 . ) Low
v a r i a b i l i t y of test scores in c r i t e r i o n - r e f e r e n c e d measurement can t h e r e f o r e
never i n v a l i d a t e the classical test model. Latent t r a i t t h e o r y has called
attention to the more fundamental problem of p o p u l a t i o n - d e p e n d e n t item and
test parameters and offers models in which these are replaced by parameters
being not only v a r i a n c e - i n d e p e n d e n t b u t in d e p e n d e n t of any d i s t r i b u t i o n a l
characteristic ( e . g . , Birnbaum, 1968; Lord, 1980; Wright ~- Stone, 1979; f o r
an informal i n t r o d u c t i o n , see van der Linden, 1978b). In fact, Popham and
Husek's paper alludes to this problem but w r o n g l y claims it as an e x c l u s i v e
problem of c r i t e r i o n - r e f e r e n c e d measurement.
MASTERY DECISIONS
Hambleton and Novick (1973) were the f i r s t to int r oduc e a d e c i s i o n - t h e o r e t i c

approach to c r i t e r i o n - r e f e r e n c e d measurement. In modern learning strategies,
such as the ones b r i e f l y reviewed above, the ultimate use of c r i t e r i o n -
referenced measurements is mostly not measuring students b u t making
i n s t r u c t i o n a l decisions. Hambleton and Novick compare the use of norm-
referenced and c r i t e r i o n - r e f e r e n c e d measurements with the concepts of f i x e d -
quota and q u o t a - f r e e selection (Cronbach ~, Gleser, 1965).
Norm-referenced tests are needed to d i f f e r e n t i a t e between subjects in o r d e r to

select a p r e d e te r m i n e d number of best p e r f o r m i n g persons, regardless of t h e i r
actual level of performance ( f i x e d - q u o t a selection). C r i t e r i o n - r e f e r e n c e d tests
are mostly used to select all persons exceeding a performance level f i x e d in
advance, regardless of t h e i r actual number ( q u o t a - f r e e selection). With
c r i t e r i o n - r e f e r e n c e d tests, this f i x e d level of performance is known as the
mastery score, and selection with this score as mastery t e s t i n g .
It is important to note t h a t in mastery testing t h e r e is thus o n l y one point on

the c r i t e r i o n - r e f e r e n c e d continuum t h a t counts: the mastery score. This
score divides the continuum in mastery and non-mastery areas.
T h e r e is an a l t e r n a t i v e conception of mastery which is not based on a

continuum model, as the former, but on a state model. In this conception
mastery and non-mastery are viewed as two latent states, each characterized
by a d i f f e r e n t set of p r o b a b i l i t i e s of a successful response to the test items.
T h e r e is no state between mastery and non-mastery and it is not necessary to
set a mastery score, as with continuum models. By f i t t i n g the model to test
data it is left to nature to define who is a master and who is not. Relevant
references to the l i t e r a t u r e on state model are Bergan, Cancelli, and Luiten
(1980), Besel (1973), Dayton and Macready (1976, 1980), Emrick (1971),
Emrick and Adams (1969), Macready and Dayton (1977, 1980a, 1980b),
Reulecke (1977a, 1977b), van der Linden (1980c, 1981c, 1981d, 1981e), Wilcox
104 Wire J. van der Linden
(1979a), and Wilcox and Harris (1977). For a discussion of the differences
between continuum and state models, we r e f e r to Meskauskas (1976) and van
der Linden (1978a).
Though mastery is defined using t r u e scores or states, mastery decisions are

made with test scores containing measurement e r r o r s . Due to this, decision
e r r o r s can be made, and a s tu d e n t can be w r o n g l y classified as a master
(false positive e r r o r s ) or a non-master (false negative e r r o r s ) . An important
mastery testing problem is how to choose a c u t - o f f score on the test so that
decisions are made as optimally as possible. A second problem is the
psychometric analysis of mastery decisions. Classical psychometric analyses
view tests as instruments f o r making measurements, and this point of view
has pervaded its models and procedures. When tests are used f o r making
decisions, this is no longer correct, however.
Hambleton and Novick (1973) proposed the use of Bayesian decision t h e o r y to

optimize the c u t - o f f score on the test. Important in a p p l y i n g decision t h e o r y
is the choice of the loss function r e p r e s e n t i n g the "seriousness" of the
decision outcomes on a numerical scale. Several have been used so far:
t h r e s h o l d loss, linear loss, and normal ogive loss functions ( f o r a comparison,
see van der Linden, 1980a). The t h r e s h o l d loss function is used, e . g . , in
Hambleton and Novick (1973), Huynh (1976), and Mellenbergh, Koppelaar,
and van d e r Linden (1977). L i v i n g s t o n (1975) and van der Linden and
Mellenbergh (1977) show how semi-linear, r e s p e c t i v e l y , linear loss functions
can be used to select optimal c u t - o f f scores. Tile use of normal ogive loss
f u n c t i o n is suggested by Novick and L i nd le y (1978).
All these applications of decision t h e o r y to the optimal c u t - o f f score problem

can be used with a Bayesian as well as an empirical Bayes i n t e r p r e t a t i o n .
Fully Bayesian approaches are presented and discussed in Hambleton, Hutten,
and Swaminathan (1976), Lewis, Wang, and Novick (1975), and Swaminathan,
Hambleton, and Algina (1975).
It is also possible to base the making of mastery decisions on the Neyman-

Pearson t h e o r y of te s ti n g statistical hypotheses. Then no e x p l i c i t loss
f u n c t i o n s are chosen but the consequences of false positive and false negative
e r r o r s are evaluated via d e t e r m i n i n g the sizes of the t y p e I and t y p e II
e r r o r s of t e s ti n g the hypothesis of mastery. Approaches along this line, all
based on binomial e r r o r models, can be found in Fhaner (1974), Klauer
(1972), Kriewall (1972), Millman (1973), van den B r i n k and Koele (1980), and
Wilcox (1976).
A minimax solution to the optimal c u t - o f f score problem, which resembles the

(empirical) Bayes approach in that the losses must be specified e x p l i c i t l y but
does not assume the a v a i l a b i l i t y of (subjective) information about t r u e score,
is presented in Huynh (1980a).
The problem of the analysis of mastery decisions was also addressed by

Hambleton and Novick (1973). T h e y proposed to determine the r e l i a b i l i t y of
mastery decisions by assessing the consistency of decisions from t e s t - r e t e s t
or parallel test administrations using the c oef f ic ie n t of agreement. Similarly,
Criterion--Referenced Measurement 105
an i n d e p e n d e n t l y measured c r i t e r i o n could be used for the determination of

the v a l i d i t y of mastery decisions. Swaminathan, Hambleton, and Algina (1974)
suggested to replace this c o e f f i c i e n t , w h i c h was already proposed f o r this
purpose by C a r v e r (1970), by the chance-corrected coefficient kappa of
Cohen (1960). These two coefficients suppose the a v a i l a b i l i t y of two test
administrations. Both Huynh (1976) and Subkoviak (1976) presented a single
administration method of estimating the coefficients, which were d e r i v e d using
the beta-binomial model and an e q u i v a l e n t set of assumptions, r e s p e c t i v e l y .
A n o t h e r single administration method was given by Marshall and Haertel
(1976). All these methods have been e x t e n s i v e l y examined and compared to
each o t h e r : Algina and Noe (1978), Berk (1980a), Divgi (1980), H u y n h
(1978, 1979), H u y n h and Saunders (1980), Marshall and Serlin (1979), Peng
and Subkoviak (1980), Subkoviak (1978, 1980), T r a u b and Rowley (1980),
and Wilcox (1979e).
The idea t h a t the " r e l i a b i l i t y " of decisions can be determined via t h e i r

consistency goes back to the assumption that classical test t h e o r y holds f o r
decisions j u s t as it does f o r measurements. Mellenbergh and van der Linden
(1979) have shown t h a t the assumption is i n c o r r e c t and t h a t t e s t - r e t e s t or
parallel test consistency of decisions does not necessarily reflect t h e i r
accuracy. T h e y recommend the use of coefficient delta (van d e r Linden F~
Mellenbergh, 1978) which indicates how optimal the decisions a c t u a l l y made are
with respect to the t r u e mastery and non-mastery states. Comparable
approaches have been undertaken by de G r u i j t e r (1978), L i v i n g s t o n and
Wingersky (1979), and Wilcox (1978a).
Also relevant to the issue of mastery setting with continuum models are
techniques f o r setting mastery scores. These techniques all somehow
t r a n s l a t e the learning objectives into a mastery score on the t r u e score
continuum u n d e r l y i n g the test. Hence, they should be d i s t i n g u i s h e d from the
d e c i s i o n - t h e o r e t i c approaches mentioned above w h i c h , once the mastery score
on the continuum has been set, indicate how to optimally select the c u t - o f f
score on the test. Several techniques of setting mastery scores have been
proposed. Reviews of these techniques are g i v e n , e . g . , in Glass (1978),
Hambleton (1980), Jaeger (1979), and Shepard (1979, 1980). An approach
accounting f o r possible u n c e r t a i n t y in setting standards is given in de
G r u i j t e r (1980).
CONCLUDING REMARKS
In this paper we have presented an o v e r v i e w of the way c r i t e r i o n - r e f e r e n c e d

measurement is used in modern i n s t r u c t i o n a l programs, what its main problem
areas are, and how these have been approached. The purpose of the paper
was to summarize developments and results and to p r o v i d e an i n t r o d u c t i o n to
the o r i g i n a l l i t e r a t u r e .
It should be noted, however, t h a t not all aspects of c r i t e r i o n - r e f e r e n c e d

measurement have been considered. For example, we did not refer to
106 WirnJ. van der Linden
developments in the area of c r i t e r i o n - r e f e r e n c e d item w r i t i n g (Millman, 1980;
Popham, 1980; Ro{d E- Haladyna, 1980), developing guidelines for e v a l u a t i n g
c r i t e r i o n - r e f e r e n c e d tests and t h e i r manuals (Hambleton & Eignor, 1978),
estimating domain scores and binomial model parameters (Hambleton, H u t t e n ,
Swaminathan, 1976; Jackson, 1972; Novick, Lewis, & Jackson, 1972; Wilcox,
]978b, 1979d), d e t e r m i n i n g the length of a mastery test (Fhaner, 1974; Hsu,
1980; Klauer, 1972; Kriewall, 1972; Millman, 1973; Novick & Lewis, 1974; van
den B r i n k & K o e l e , 1980; van der Linden, 1980b; Wilcox, 1976, ]979b, 1980a,
1980b), and estimating p r o p o r t i o n s of misclassification in mastery t e s t i n g
( H u y n h , 1980; Koppelaar, van der Linden, E, Mellenbergh, 1977; Mellenbergh,
Koppelaar, & van der Linden, 1977; Subkoviak & Wilcox, 1978; Wilcox, 1977,
1979c). Several of these topics are also dealt with in the review by
Hambleton et al. (1978) r e f e r r e d to earlier and in the recent Applied
Psychological Measurement (1980) special issue on c r i t e r i o n - r e f e r e n c e d
measurement.
This paper has t h u s not touched on all aspects of c r i t e r i o n - r e f e r e n c e d

measurement. Yet, we hope it has given an impression of the d i r e c t i o n s the
field of c r i t e r i o n - r e f e r e n c e d measurement is t a k i n g . Three " h i s t o r i c a l " papers
have guided us in e x p l o r i n g the field. We were able to observe that a great
v a r i e t y of solutions have been formulated in response to these papers. Most
of the solutions promise important improvements of educational testing
t e c h n o l o g y and, t h e r e b y , of educational practice.
REFERENCES
A l g i n a , J. & Noe, M.J. A s t u d y of the accuracy of Subkoviak's single

a d m i n i s t r a t i o n estimate of the coefficient of agreement using two t r u e - s c o r e
estimates. Journal of Educational Measurement, 15, 101-110, 1978.
A p p l i e d Psychological Measurement. Methodological c o n t r i b u t i o n s to c r i t e r i o n -

referenced t e s t i n g t e c h n o l o g y , (#), 1980.
Atkinson, R.C. Computer based instruction in initial reading. In Proceeding

of the 1967 Invitational Conference on Testing Problems. Princeton, N.J.:
Educational T e s t i n g Service, 1968.
Bergan, J . R . , Cancelli, A . A . & L u i t e n , J.W. Mastery assessment with latent

class and q u a s i - i n d e p e n d e n c e models representing homogeneous item
domains. Journal of Educational Statistics, 5, 65-82, 1980.
Berk, R.A. A consumers' guide to c r i t e r i o n - r e f e r e n c e d test reliability.

Journal of Educational Measurement, 17, 323-350, 1980a.
Berk, R.A. Item analysis. In R . A . Berk ( E d . ) , C r i t e r i o n - referenced

measurement: The state of the a r t . Baltimore, MD: The Johns Hopkins
U n i v e r s i t y Press, 1980b.
Criterion-Referenced Measurement 107
Besel, R. Using g r o u p performance to i n t e r p r e t i n d i v i d u a l responses to

c r i t e r i o n - r e f e r e n c e d tests. Paper presented at the Annual Meeting of the
American Educational Research Association, New Orleans, Louisiana, 1973.
(EDRS No. ED 076 658)
Birnbaum, A. S o m e latent t r a i t models and their use in i n f e r r i n g a n

examinee's ability. In F.M. Lord ~ M.R. Novick, Statistical theories of
mental test scores. Reading, Mass.: Addison-Wesley, 1968.
Block, J.H. Criterion-referenced measurements: Potential. School Review,

79, 289-298, 1971a.
Block, J.H. (Ed.) Mastery Learning: Theory and Practice. New York:
Holt, Rinehart and Winston, 1971b.
Bloom, B.S. Learning for mastery. Evaluation Comment, 1 (2), 1968.
Bloom, B.S., Hastings, J . T . , 8 Madaus, G.F. Handbook on formative and

summative evaluation of s t u d e n t l e a r n i n g . New York, N . Y . : McGraw-Hill,
1971.
Brennan, R.L. Applications of generalizability theory. In. R.A. Berk

( Ed. ), C r i t e r i o n - referenced measurement : The state of the art.
Baltimore, MD: The Johns Hopkins University Press, 1980.
Brennan, R.L. & Kane, M.T. An index of dependability for mastery tests.
Journal of Educational Measurement, I#, 227-289, 1977a.
Brennan, R.L. & Kane, M.T. Signal/noise ratios for domain-referenced

tests. Psychometrika, #2, 609-625, 1977b.
Brennan, R.L. ~ Stolurow, L.M. An empirical decision process for formative

evaluation. Research Memorandum No. q. Harvard CAI Laboratory,
Cambridge, Mass., 1971. (EDRS No. ED 048 343)
Carver, R.P. Two dimensions of tests: Psychometic and edumetric, American

Psychologist, 29, 512-518, 1974.
Cohen, J. A coefficient of agreement for nominal scales. Educational and

Psychological Measurement, 20, 37-46, 1960.
Coulson, D.B. & Hambleton, R.K. On the validation of c r i t e r i o n - r e f e r e n c e d

tests designed to measure i n d i v i d u a l mastery. Paper presented at the
Annual Meeting of the American Psychological Association, New Orleans,
]974.
Cox, R.C. Evaluative aspects of criterion-referenced measures. In W.J.

Popham ( E d . ) , Criterlon-referenced measurement: An introduction.
Englewood Cliffs, N.J.: Educational Technology Publications, 1971.
Cox, R.C. S Graham, G . T . The development of a sequentially scaled
achievement test. Journal of Educational Measurement, 3, 147-150, 1966.
Cox, R.C. & Vargas, J.S. A comparison of item selection techniques for
n o r m - r e f e r e n c e d and c r i t e r i o n - r e f e r e n c e d tests. Paper presented at the
Annual Meeting of the National Council on Measurement in Education,
Chicago, Illinois, 1966. (EDRS No. ED 010 517)
Cronbach, L.J. & Gleser, G.C. Psychological tests and personnel decisions,
(2nd E d . ) . Urbana, II1.: U n i v e r s i t y of Illinois Press, 1965.
Cronbach, L.J. ~, Snow, R.E. A p t i t u d e s and i n s t r u c t i o n a l methods: A

handbook f o r research on i n t e r a c t i o n s . New Y o r k : I r v i n g t o n Publishers,
I n c . , 1977.
Dayton, C.M. & Macready, G . B . A p r o b a b i l i s t i c model for" validation of

behavioral hierarchies. Psychometrika, q l , 189-204, 1976.
Dayton, C.M. & Macready, G . B . A scaling model with response e r r o r s and

i n t r i n s i c a l l y unscalable respondents. Psychometrika, qS, 343-356, 1980.
de G r u i j t e r , D.N.M. A criterion-referenced point biserial correlation

coefficient. T i j d s c h r i f t voor O n d e r w i j s r e a r c h , 3, 257-26], 1978.
de G r u i j t e r , D . N . M . Accounting f o r u n c e r t a i n t y in performance standards.

Paper presented at the Fourth I n t e r n a t i o n a l Symposium on Educational
T e s t i n g , A n t w e r p , Belgium, June 24-27, 1980.
Divgi, D.R. Group dependency of some r e l i a b i l i t y indices for mastery tests.

A p p l i e d Psychological Measurement, #, 213-218, 1980.
Ebel, R.L. Content standard test scores. Educational and Psychological

Measurement, 22, 15-25, 1962.
Ebel, R.L. Criterion-referenced measurements: Limitations. School Review,

79, 282-288, 1971.
Edmonston, L.P. & Randall, R.L. A model f o r estimating the r e l i a b i l i t y and

v a l i d i t y of c r i t e r i o n - r e f e r e n c e d measures. Paper presented at the Annual
Meeting of the American Educational Research Association, Chicago,
I l l i n o i s , 1972. (EDRS No. ED 965 591)
Emrick, J.A. An evaluation model for mastery testing. Journal of

Educational Measurement, 8, 321-326, 1971.
Emrick, J.A. & Adams, F.N. An evaluation model f o r i n d i v i d u a l i z e d

i n s t r u c t i o n ( R e p o r t RC 2674). Yorktown Hts., N.Y.: IBM Thomas J.
Watson Research Center, 1969.
Fhaner, S. Item sampling and decision-making in achievement testing.

B r i t i s h Journal of Mathematical and Statistical Psychology, 27, 172-175,
1974.
Criterion- Referenced Measurement 109
Flanagan, J.C. Units, scores and norms. In E.F. Lindquist ( E d . ) ,
Educational Measurement, Washington, D.C. : American Council on
Education, 1951.
Flannagan, J.C. Functional education for the seventies. Phi Delta Kappan,
zig, 27-32, 1967.
Glaser, R. Instructional technology and the measurement of learning

outcomes: Some questions. American Psychologist, 18, 519-521, 1963.
Glaser, R. Adapting the elementary school curriculum to individual

performance. In Proceedings of the 1967 I n v i t a t i o n a l Conference on
Testing Problems. Princeton, N.J.: Educational Testing Service, 1968.
Glaser, R. ~- Cox, R.C. Criterion-referenced testing for the measurement of

educational outcomes. In R.A. Weisgerber [ E d . ) , I n s t r u c t i o n a l process
and media innovation. Chicago; Rand McNally, 1968.
Glaser, R. t, Klaus, D.H. Proficiency measurement: Assessing human

performance. In R. Gagne (Ed.), Psychological p r i n c i p l e s in system
development. New York: Holt, Rinehart, ~, Winston, 1962.
Glaser, R. 8 Nitko, A.J. Measurement in learning and instruction. In R.L.

Thorndike f E d . ) , Educational Measurement. Washington, D.C.: American
Council on Education, 1971.
Glass, G.V. Standards and criteria. Journal of Educational Measurement,

15, 237-261, 1978.
Haladyna, T.M. Effects of different examples on item and test characteristics

of criterion-referenced tests. Journal of Educational Measurement, 11,
93-99, 1974a.
Haladyna, T.M. An investigation of f u l l - and subscale reliabilities of

c r i t e r i o n - r e f e r e n c e d tests. Paper presented at the Annual Meeting of the
American Educational Research Association, Chicago, Illinois, 1974b.
(EDRS No. ED 091 435)
Hambleton, R.K. Testing and decision-making procedures for selected

individualized instructional program. Review of Educational Research, zlzl,
371-400, 1974.
Hambleton, R.K. Test score v a l i d i t y and standard-setting methods. In R.A.

Berk f E d . ) , C r i t e r i o n - r e f e r e n c e d measurement: The state of the a r t .
Baltimore, MD: The Johns Hopkins University Press, 1980.
Hambleton, R. K 8 Eignor, D.R. Guidelines for evaluating criterion-

referenced tests and test manuals. Journal of Educational Measurement,
15, 321-327, 1978.
Hambleton, R.K & G o r t h , W.P. C r i t e r i o n - r e f e r e n c e d testing: Issues and
applications (Report No. T R - 1 3 ) . Amherst: U n i v e r s i t y of Massachusetts,
School of Education, 1971. (EDRS No. ED 060 025)
Hambleton, R . K . , H u t t e n , L . R . , & Swaminathan, H. A comparison of several

methods for assessing s t u d e n t mastery in objectives-based i n s t r u c t i o n a l
programs. Journal of Experimental Education, 45, 57-64, 1976.
Hambleton, R.K. & Novick, M.R. Toward an i n t e g r a t i o n of t h e o r y and

method f o r c r i t e r i o n - r e f e r e n c e d tests. Journal of Educational Measurement,
10, 159-170, 1973.
Hambleton, R . K . , Swaminathan, H., A l g i n a , R., & Coulson, D . B . Criterion-

referenced t e s t i n g and measurement: A review of technical issues and
developments. Review o[ Educational Research, 48, 1-47, 1978.
H a r r i s , C.W. An i n t e r p r e t a t i o n of L i v i n g s t o n ' s r e l i a b i l i t y coefficient f o r

c r i t e r i o n - r e f e r e n c e d tests. Journal of Educational Measurement 9, 27-29,
1972.
Harris, N.D.C. An index of effectiveness for c r i t e r i o n - r e f e r e n c e c items used

in p r e - t e s t s and post-tests. Programmed Learning and Educational
Technology, 11, 125-132, 1974.
H e n r y s s o n , S. & Wedman, I. Some problems in c o n s t r u c t i o n and evaluation of

c r i t e r i o n - r e f e r e n c e d tests. Scandinavian Journal of Educational Research,
18, 1-12, 1973.
Herbig, M. Zur Vortest-Nachtest-Validierung L e h r z i e l o r i e n t i e r t e r Tests.

Z e i t s c h r i f t f u r Erziehungswissenschaftliche F o r s c h u n g , 9, 112-126, 1975.
Herbig, M. Item analysis by use in pretests and posttests: A comparison of

d i f f e r e n t coefficients. Programmed Learning and Educational Technology,
13, 49-54, 1976.
Hively, W. Introduction to domain-referenced testing. Educational

Technology, IZl, 5-10, 1974.
H i v e l y , W., Maxwell, G, Rabehl, G., Sension, D, & L u n d i n , S. Domain-

referenced c u r r i c u l u m evaluation: A technical handbook and a case s t u d y
f r o m the Minnemast Project (CSE Monograph Series in Evaluation No. 1).
Los Angeles: Center for the S t u d y of Evaluation, 1973.
H i v e l y , W., Patterson, H . L . , & Page, S.H. A u n i v e r s e - d e f i n e d system of

a r i t h m e t i c achievement tests. Journal of Educational Measurement, 5,
275-290, 1968.
Hsu, L.M. Determination of the number of items and passing score in a

mastery test. Educational and Psychological Measurement, qO, 709-714,
1980.
Huynh, H. On the r e l i a b i l i t y of decisions in domain-referenced testing.
Huynh, H. Statistical considerations of mastery scores. Psychometrika, #1,

65-79, 1976b.
Huynh, H. Reliability of multiple classifications. Psychometrika, 43,

317-326, 1978.
Huynh, H. Statistical inference for two r e l i a b i l i t y indices in mastery testing

based on the beta-binomial model. Journal of Educational Statistics, #,
231-246, 1979.
Huynh, H. A nonrandomized minimax solution for passing scores in the

binomial e r r o r model. Psychometrika, z15, 167-182, 1980a.
Huynh, H. Statistical inference for false positive and false negative e r r o r

rates in mastery testing. Psychometrika, 45, 107-120, 1980b.
Huynh, H. 8 Saunders, J.C. Accuracy of two procedures for estimating

r e l i a b i l i t y of mastery tests. Journal of Educational Measurement, 17,
351-358, 1980.
Jackson, P.H. Simple approximations in the estimation of many parameters.

British Journal of Mathematical and Statistical Psychology, 25, 213-229,
1972.
Jaeger, R.M. Measurement consequences of selected s t a n d a r d - s e t t i n g models.

In M.A. Bunda and J . R . Sanders ( E d s . ) , Practices and problems in
competency-based measurement. Washington, D . C . : National Council on
Measurement in Education, 1979.
Kane, M.T. & Brennan, R.L. Agreement coefficients as indices of

dependability for domain-referenced tests. Applied Psychological
Measurement, 4, 105-126, 1980.
Klauer, K.J. Zur Theorie und Praxis des binomialen Modells

l e h r z i e l o r i e n t i e r t e r Tests. In K.J. Klauer, R. Fricke, M. Herbig, H.
Rupprecht, & F. Schott, Lehrzielorientierte Tests. Dusseldorf: Schwann,
1972.
Koppelaar, H., van der Linden, W.J., & Mellenbergh, G.J. A computer
program for classification proportions in dichotomous decisions based on
dichotomously scored items. Tijdschrift voor Onderwijsresearch, 2, 32-37,
1977.
Kosecoff, J . B . & Klein, S.P. Instructional s e n s i v i t y statistics appropriate for

objective-based test items (CSE Report No. 91). Los Angeles, Cal. :
U n i v e r s i t y of California, Center for the Study of Evaluation, 1974.
Kriewall, T.E. Aspects and applications of criterion-referenced tests.
Illinois School Research, 9, 5-18, 1972.
Lewis, C . , Wan9, M., & Novick, M.R. Marginal d i s t r i b u t i o n s for the

estimation of p r o p o r t i o n s in m g r o u p s . Psychometrika, #0, 63-75, 1975.
Linn, R.L. Issues of r e l i a b i l i t y in measurement for competency-based

programs. In M . A . Bunda and J . R . Sanders ( E d s . ) , Practices and
problems in competency-based measurement. Washington, D . C . : National
Council on Measurement in Education, 1979,
L i n n , R.L. Issues of v a l i d i t y for c r i t e r i o n - r e f e r e n c e d measures. Applied

Psychological Measurement, q, 547-561, 1980.
L i v i n g s t o n , S.A. C r i t e r i o n - r e f e r e n c e d applications of classical test t h e o r y .

Livingston, S.A. A r e p l y to Harris' "An i n t e r p r e t a t i o n of Livingston's

reliability coefficient for criterion-referenced tests". Journal of
Educational Measurement, 9, 31, 1972b.
L i v i n 9 s t o n , S.A. Reply to Shavelson, Block, and Ravitch's "Criterion-

referenced t e s t i n 9 : Comments on r e l i a b i l i t y " . Journal of Educational
Measurement, 9, 139-140, 1972c.
L i v i n g s t o n , S.A. A note on the i n t e r p r e t a t i o n of the c r i t e r i o n - r e f e r e n c e d

r e l i a b i l i t y coefficient. Journal of Educational Measurement, 10, 311, 1973.
L i v i n g s t o n , S.A. A u t i l i t y - b a s e d approach to the evaluation of pass~fail

testing decision procedures (COPA-75-01). Princeton, N . J . : Center f o r
Occupational and Professional Assessment, Educational Testing Service,
1975,
L i v i n g s t o n , S . A . & W i n g e r s k y , M. Assessing the r e l i a b i l i t y of tests used to

make pass/fail decisions. Journal of Educational Measurement, 16, 247-260,
1979.
Lord, F.M. Applications of item response t h e o r y to practical testing

problems. Hillsdale, New Jersey: Lawrence Erlbaum Associates, P u b l . ,
1980.
Lord, F.M. ~, Novick, M.R. Statistical theories of mental test scores.

Reading, Mass.: Addison-Wesley, 1968.
Lovett, H.T. Criterion-referenced reliability estimated by anova.

Educational and Psychological Measurement, 37, 21-29, ]977.
Macready, G.B & Dayton, D.M. The use of p r o b a b i l i s t i c models in the

assessment of mastery. Journal of Educational Statistics, 2, 99-120, 1977.
Criterion - Referenced Measurement 113
Macready, G.B 8 Dayton, C.M. The nature and use of state mastery models.
Applied Psychological Measurement, zl, 493-516, 1980a.
Macready, G . B . ~, Dayton, C.M. A two-stage conditional estimation procedure

f o r u n r e s t r i c t e d latent class models. Journal of Educational Statistics, 5,
129-156, 1980b.
Marshall, J . L . 8 Haertel, E.H. A s i n g l e - a d m i n i s t r a t i o n r e l i a b i l i t y index f o r

c r i t e r i o n - r e f e r e n c e d tests: The mean s p l i t - h a l f coefficient of agreement.
Paper presented at the Annual Meeting of the American Educational
Research Association, Washington, D . C . , 1975.
Marshall, J . L . & Serlin, R.C. Characteristics of f o u r mastery test r e l i a b i l i t y

indices: Influence of d i s t r i b u t i o n shape and c u t t i n g score. Paper
presented at the A n n u a l Meeting of the American Educational Research
Association, San Francisco, California, A p r i l 8-12, 1979.
Mellenbergh, G . J . , Koppelaar, H. ~, van der Linden, W.J. Dichotomous

decisions based on dichotomously scored items: A case s t u d y . Statistica
Neerlandica, 31, 161-169, 1977.
Mellenbergh, G.J. ~, van der Linden, W.J. The internal and external
optimality of decisions based on tests. Applied Psychological Measurement,
3, 257-273, 1979.
Meskauskas, J . A . Evaluation models f o r c r i t e r i o n - r e f e r e n c e d t e s t i n g : Views

r e g a r d i n g mastery and s t a n d a r d - s e t t i n g . Review of Educational Research,
46, 133-158, 1976.
Millman, J. Passing scores and test lengths f o r domain-referenced measures.

Review of Educational Research, z/3, 205-216, 1973.
Millman, J. C r i t e r i o n - r e f e r e n c e d measurement. In W.J. Popham [Ed. ),

Evaluation in education. Berkeley, California: McCutchan, 1974.
Millman, J. Computer-based item generation. In R . A . Bark (Ed.),

C r i t e r i o n - r e f e r e n c e d measurements: The state of the a r t . Baltimore, MD:
The Johns Hopkins U n i v e r s i t y Press, 1980.
Millman, J. 8 Popham, W.J. The issue of item and test variance f o r

c r i t e r i o n - referenced tests : A clarification. Journal of Educational
Measurement, 11, 137-138, 1974.
Nitko, A . J . D i s t i n g u i s h i n g the many varieties of c r i t e r i o n - r e f e r e n c e d tests.

Review of Educational Research, 50, 461-485, 1980.
Nitko, A.J. & Hsu, T . L . Using domain- referenced tests f o r s t u d e n t

placement, diagnosis and attainment in a system of adaptive, i n d i v i d u a l i z e d
instruction. Educational Technology, lZl, 48-54, 1974.
Novick, M.R. & Lewis, C. P r e s c r i b i n g test length for" c r i t e r i o n - r e f e r e n c e d
measurement. In C.W. H a r r i s , M.C. A l k i n , r, W.J. Popham, Problems in
c r i t e r i o n - r e f e r e n c e d measurement (CSE Monograph Series in Evaluation,
No. 3). Los Angeles: Center" for" the S t u d y of Evaluation, U n i v e r s i t y of
C a l i f o r n i a , 1974.
Novick, M . R . , Lewis, C . , & Jackson, P.H. The estimation of p r o p o r t i o n s in

m groups. Psychometrika, 38, 19-46, 1972.
Novick, M . R . r, L i n d l e y , D . V . The use of more realistic u t i l i t y f u n c t i o n s in

educational applications. Journal of Educational Measurement, 15, 181-191,
1978.
Osbur'n, H.G. Item sampling f o r achievement testing. Educational and

Psychological Measurement, 28, 95-104, 1968.
Peng, C . - Y . J . & S u b k o v i a k , M.J. A note on H u y n h ' s normal approximation

for" estimating c r i t e r i o n - r e f e r e n c e d reliability. Journal of Educational
Measurement, 17, 359-368, 1980.
Plomp, Tj. Enkele methodologische en statistische aspecten van A T I -

onderzoek. In A p t i t u d e Treatment I n t e r a c t ion (VOR Publikatie nr. 5).
Amsterdam: V e r e n i g i n g voor O n d e r w i j s r e s e a r c h , 1977.
Popham, W.J. Indices of adequacy for c r i t e r i o n - r e f e r e n c e d test items. In

W.J. Popham f E d . ) , C r i t e r i o n - r e f e r e n c e d measurement: An i n t r o d u c t i o n .
Englewood C l i f f s , N . J . : Educational T e c h n o l o g y Publications, 1971.
Popham, W.J. C r i t e r i o n - r e f e r e n c e d measurement. Englewood Cliffs, N.J.:

Prentice-Hall, 1978.
Popham, W.J. Domain specification strate9ies. In R . A . Berk (Ed.),

C r i t e r i o n - r e f e r e n c e d measurement: The state of the art. Baltimore, MD:
The Johns Hopkins U n i v e r s i t y Press, 1980.
Popham, W.J. & Husek, T.R. Implications of criterion-referenced

measurement. Journal of Educational Measurement, 6, 1-9, 1969.
Reulecke, W.A. Ein Modell f u r die k r i t e r i u m o r i e n t i e r t e Testauswertung.

Z e i t s c h r i f t f u r Empirische Padagogik, 1, 49-72, 1977a.
Reulecke, W.A. A statistical analysis of deterministic theories. In H. Spada

& W.F. Kempf (Eds.), Structural models of thinking and learning. Bern,
Switzerland: Hans Huber Publishers, 1977b.
Roid, G. & Haladyna, T. The emergence of an item-writing technology.

Review of Educational Research, 50, 293-315, 1980.
Roudabush, G.E. Item selection for criterion-referenced tests. Paper

presented at the Annual Meeting of the American Educational Research
Association, N e w Orleans, Louisiana, 1973, ( E D R S No. ED 074 147).
Rovinelli, R.J. 8 Hambleton, R.K. On the use of content specialists in the
assessment of criterion-referenced test item v a l i d i t y . T i j d s c h r i f t voor
Onderwijsresearch, 2, 49-60, 1977.
Saupe, J.L. Selecting items to measure change. Journal of Educational

Measurement, 3, 223-228, 1966.
Shavelson, R., Block, J . , & Ravitch, M. Criterion-referenced testing:

Comments on reliability. Journal of Educational Mesurement, 9, 133-138,
1972.
Shepard, L.A. Setting standards. In M.A. Bunda and J.R. Sanders

(Eds.), Practices and problems in competency-based measurement.
Washington, D.C.: National Council on Measurement in Education, 1979.
Shepard, L. Standard setting issues and methods. Applied Psychological

Measurement, zl, 447-467, 1980.
Simon, G.B. Comments on "Implications of criterion-referenced measurement".

Journal of Educational Measurement, 6, 259-260, 1969.
Subkoviak, M.J. Estimating reliability from a single administration of a

criterion - referenced test. Journal of Educational Measurement, 13,
265-276, 1976.
Subkoviak, M.J. Empirical investigation for estimating reliability of mastery

tests. Journal of Educational Measurement, 15, 111-116, 1978.
Subkoviak, M.J. Decision-consistency approaches. In R.A. Berk ( E d . ) ,

C r i t e r i o n - r e f e r e n c e d measurement: The state of the a r t . Baltimore, MD:
The Johns Hopkins University Press, 1980.
Subkoviak, M.J. 8 Wilcox, R. Estimating the p r o b a b i l i t y of c o r r e c t

classification in mastery testing. Paper presented at the Annual Meeting
of the American Educational Research Association, Toronto, 1978.
Suppes, P. The use of computers in education. Scientific American, 215,

206-221, 1966.
Suppes, P., Smith, R., & Beard, M. U n i v e r s i t y - l e v e l computer-assisted

i n s t r u c t i o n at S t a n f o r d : 1975 (Technical Report No. 265). Stanford, Cal.:
Institute for Mathematical Studies in the Social Sciences, Stanford
University, 1975.
Swaminathan, H., Hambleton, R . K . , & Algina, J. Reliability of criterion-

referenced tests: A decision-theoretic formulation. Journal of Educational
Measurement, 11, 263-267, 1974.
Swaminathan, H., Hambleton, R . K . , ~, Algina, J. A Bayesian decision-

theoretic procedure for use with criterion-referenced tests. Journal of
Educational Measurement, 12, 87-98, 1975.
Traub, R.E. & Rowley, G.L. R e l i ab ilit y of test scores and decisions.
Applied Psychological Measurement, q, 517-545, 1980.
van den B r i n k , W.P. S Koele, P. Item sampling, guessing, and decision-

making in achievement testing. B r i t i s h Journal of Mathematical and
Statistical Psychology, 33, 104-108, 1980.
van der Linden, W.J. F o r g e t t i n g , guessing, and mastery: The Macready

and Dayton models r e v i s i t e d and compared with a latent t r a i t approach.
Journal of Educational Statistics, 3, 305-318, 1978a.
van der Linden, W.J. Het klassieke testmodel , latente trek modellen en
evaluatie-onderzoek (VOR Publikatie nr. 7). Amsterdam, The Netherlands:
V e r e n i g i n g r o a r Onderwijsresearch, 1978b.
van der Linden, W.J. C r i t e , ' i u m g e o r i e n t e e r d toetsen. In E. Warries ( E d . ) ,

Beheersingsleren een leerstrategie. Gr'oningen, Tile Netherlands: Wolters-
Noordhoff, 1979.
van der Linden, W.J. Decision models f or use with c r i t e r i o n - r e f e r e n c e d

tests. Applied Psychological Measurement, #, 469-492, 1980a.
van d e r Linden, W.J. Passing score and length of a mastery test. Paper
presented at the 11th European Mathematical Psychology Group Meeting,
Liege, Belgium, September 3-4, 1980b.
van der Linden, W.J. Simple estimates f o r the simple latent class mastery
testing model. Paper presented at the Fourth I n t e r n a t i o n a l Symposium on
Educational T e s t i n g , A n t w e r p , Belgium, June 24-27, 1980c.
van d e r Linden, W.J. Statistical aspects of optimal treatment assignment.

Paper presented at the 1980 European Meeting of the Psychometric Society,
G r o n i n g e n , The Netherlands, June 19-21, 1980d. (EDRS No. ED 194 565)
van d e r Linden, W.J. A latent t r a i t look at p r e t e s t - p o s t t e s t validation of

criterion-referenced test items. Review of Educational Research, 51,
379-402, 1981a.
van der Linden, W.J. Using a p t i t u d e measurements f o r the optimal

assignment of subjects to treatments with and w i t h o u t mastery scores.
Psychometrika, #6, in press, 1981b.
van d e r Linden, W.J. Estimating the parameters of Emrick's mastery testing

model. Applied Psychological Measurement, 5, 517-530, 1981c.
van d e r Linden, W.J. On the estimation of the p r o p o r t i o n of masters in

c r i t e r i o n - r e f e r e n c e d measurement. Submitted f o r p u b l i c a t i o n , 1981d.
van der Linden, W.J. The use' of moment estimators f o r mixtures of two
binomials with one known success parameter. Submitted f o r p u b l i c a t i o n ,
1981e.
van der Linden, W.J. ~, Mellenbergh, G.J. Optimal c u t t i n g scores using a
linear loss f u n c t i o n . Applied Psychological Measurement, 1, 593-599, 1977.
van der Linden, W.J. E, Mellenbergh, G.J. Coefficients for tests from a
decision t h e o r e t i c point of view. Applied Psychological Measurement, 2,
119-134, 1978.
van Naerssen, R.F. Lokale b e t r o u w b a a r h e i d : begrip en operationalisatie.

T i j d s c h r i f t voor Onderwijsresearch, 2, 111-119, 1977a.
van Naerssen, R.F. Grafieken voor de s c h a t t i n g van de helling van

itemkarakteristieken. T i j d s c h r i f t voor Onderwijsresearch, 2, 193-201,
1977b.
V i j n , P. Prior information in linear models ( T h e s i s ) . Groningen, The

Netherlands: R i j k s u n i v e r s i t e i t G r o n i n g e n , 1980.
Warries, E. Beheersingsleren als een nieuwe leerstrategie: Inleiding. In E.

Warries (Ed.), Beheerslngsleren een leerstrategie. Groningen, The
Netherlands: Wolters-Noordhoff, 1979a.
Warries, E. Het meten van de graad van beheersing. In E. Warries ( E d . ) ,

Beheersingsleren een leerstrategie. Groningen, The Netherlands:
Wolters- N o o r d h o f f , 1979b.
Wedman, I. Theoretical problems in construction of criterion-referenced tests

(Educational Report Umea No. 3). Umea, Sweden: Umea U n i v e r s i t y and
Umea School of Education, 1973.
Wedman, I. On the evaluation of c r i t e r i o n - r e f e r e n c e d tests. In H.F.

Crombag & D . N . M . de G r u i j t e r ( E d s . ) , Contemporary issues in educational
testing. The Hague, The Netherlands: Mouton, 1974a.
Wedman, I. Reliability, validity, and discrimination measures for c r i t e r i o n -

referenced tests (Educational Reports Umea No. 4). Umea, Sweden: Umea
U n i v e r s i t y and Umea School of Education, 1974b.
Wilcox, R.R. A note on the length and passing score of a mastery test.
Journal of Educational Statistics, I, 359-364, 1976.
Wilcox, R.R. Estimating the likelihood of f a l s e - p o s i t i v e and f a l s e - n e g a t i v e

decisions in mastery t e s t i n g : An empirical Bayes approach. Journal of
Educational Statistics, 2, 289-307, 1977.
Wilcox, R.R. A note on decision theoretic coefficients for tests. Applied

Psychological Measurement, 2, 609-613, 1978a.
Wilcox, R.R. Estimating the t r u e score in the compound binomial error

model. Psychometrika, q3, 245-258, 1978b.
Wilcox, R.R. Achievement tests and latent s t r u c t u r e models. B r i t i s h Journal
of Mathematical and Statistical Psychology, 32, 6]-71, 1979a.
Wilcox, R.R. A p p l y i n g ranking and selection techniques to determine the

length of a mastery test. Educational and Psychological Measurement, 39,
12-22, 1979b.
Wilcox, R.R. Comparing examinees to a control. Psychometrika, ztzl, 55-68,

1979c.
Wilcox R.R. Estimating the parameters of the beta-binomial d i s t r i b u t i o n .

Educational and Psychological Measurement, 39, 527-536, 1979d.
Wilcox, R.R. Prediction analysis and the r e l i a b i l i t y of a mastery test.

Educational and Psychological Measurement, 39, 825-839, 1979e.
Wilcox R.R. An approach to measuring the achievement or p r o f i c i e n c y of an

examinee. Applied Psychological Measurement, #, 241-251, 1980a.
Wilcox, R.R. Determining the length of a c r i t e r i o n - r e f e r e n c e d test. Applied

Psychological Measurement, q, 425-446, 1980b.
Wilcox, R.R. S H a r r i s , C.W. On Emrick's "An evaluation model f o r mastery

testing". Journal of Educational Measurement, lq, 215-218, 1977.
Woodson, M . I . C . E . The issue of item and test variance f o r c r i t e r i o n -

referenced tests. Journal of Educational Measurement, 11, 63-64, 1974a.
Woodson, M . I . C . E . The issue of item and test variance f o r c r i t e r i o n -

referenced tests. Journal of Educational Measurement, 11, 139-140, 1974b.
Wright, B.D. & Stone, M.H. Best test design: A handbook for Rasch
Measurement. Chicago, II1.: MESA Press, 1979.

Criterion-Referenced M E A S U R E M E N T: Its M A I N Applications, Problems A N D Findings

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Criterion-Referenced M E A S U R E M E N T: Its M A I N Applications, Problems A N D Findings

Uploaded by

Copyright:

Available Formats

Evaluation in Education. 1982, Vol. 5, pp.

CRITERION-REFERENCED MEASUREMENT: ITS

Wim J. van der Linden

TechnischeHogeschool Twente, Postbus217, 7500A E Enschede, The Netherlands

The need f o r c r i t e r i o n - r e f e r e n c e d measurements has mainly

The common feature of these learning strategies is t h e i r objective-based

through these materials and i n s t r u c t i o n , mastery learning allows some

The c o n t e x t in which mastery learning seeks to realize its goal is r e g u l a r

The most f a r - r e a c h i n g mode of i n d i v i d u a l i z a t i o n takes place in Computer-Aided

Compared with t r a d i t i o n a l educational approaches, one of the most conspicuous

Once the st u d e n t has taken up the unit , intermediate tests can be

Tests are also used as e n d - o f - u n i t tests f o r d e s c r i b i n g the level of master the

known as mastery testing. E n d - o f - u n i t tests may be used f o r diagnostic

In all these modes of t e s t i n g , scores are used only f o r i n s t r u c t i o n a l purposes.

For a f u r t h e r review of t e s t i n g procedures in modern learning strategies,

As already indicated, the i n t r o d u c t i o n of such strategies as learning f o r

The attempts to remove t r a d i t i o n a l assumptions from educational t e s t i n g and to

CRITERION-REFERENCED SCORING AND SCORE INTERPRETATION

Glaser (1963), in his paper on i n s t r u c t i o n a l technology and the measurement

referenced measurements are needed is testing f o r selection ( e . g . , of

The second use is t h a t tests can supply c r i t e r i o n - r e f e r e n c e d measurements.

How to establish a relation between test scores and behavioral referents is

Several answers have been proposed to the problem of c r i t e r i o n - r e f e r e n c e d

CRITERION-REFERENCED ITEM AND TEST A N A L Y S I S

Popham and Husek's (1969) paper is a second milestone in the h i s t o r y of

The k e y - w o r d in the analysis of items and tests f o r norm-referenced

Popham and Husek's plea has led to a v a r i e t y of proposals. A proposal to

In the f o r e g o i n g approaches the emphasis is on developing test t h e o r y

The f i r s t proposal of a c r i t e r i o n - r e f e r e n c e d item analysis was made by Cox

A critical review of item analysis based on the p r e t e s t - p o s t t e s t method is

Hambleton and Novick (1973) were the f i r s t to int r oduc e a d e c i s i o n - t h e o r e t i c

Norm-referenced tests are needed to d i f f e r e n t i a t e between subjects in o r d e r to

It is important to note t h a t in mastery testing t h e r e is thus o n l y one point on

T h e r e is an a l t e r n a t i v e conception of mastery which is not based on a

Though mastery is defined using t r u e scores or states, mastery decisions are

Hambleton and Novick (1973) proposed the use of Bayesian decision t h e o r y to

All these applications of decision t h e o r y to the optimal c u t - o f f score problem

It is also possible to base the making of mastery decisions on the Neyman-

A minimax solution to the optimal c u t - o f f score problem, which resembles the

The problem of the analysis of mastery decisions was also addressed by

an i n d e p e n d e n t l y measured c r i t e r i o n could be used for the determination of

The idea t h a t the " r e l i a b i l i t y " of decisions can be determined via t h e i r

In this paper we have presented an o v e r v i e w of the way c r i t e r i o n - r e f e r e n c e d

It should be noted, however, t h a t not all aspects of c r i t e r i o n - r e f e r e n c e d

This paper has t h u s not touched on all aspects of c r i t e r i o n - r e f e r e n c e d

A l g i n a , J. & Noe, M.J. A s t u d y of the accuracy of Subkoviak's single

A p p l i e d Psychological Measurement. Methodological c o n t r i b u t i o n s to c r i t e r i o n -

Atkinson, R.C. Computer based instruction in initial reading. In Proceeding

Bergan, J . R . , Cancelli, A . A . & L u i t e n , J.W. Mastery assessment with latent

Berk, R.A. A consumers' guide to c r i t e r i o n - r e f e r e n c e d test reliability.

Berk, R.A. Item analysis. In R . A . Berk ( E d . ) , C r i t e r i o n - referenced

Besel, R. Using g r o u p performance to i n t e r p r e t i n d i v i d u a l responses to

Birnbaum, A. S o m e latent t r a i t models and their use in i n f e r r i n g a n

Block, J.H. Criterion-referenced measurements: Potential. School Review,

Bloom, B.S. Learning for mastery. Evaluation Comment, 1 (2), 1968.

Bloom, B.S., Hastings, J . T . , 8 Madaus, G.F. Handbook on formative and

Brennan, R.L. Applications of generalizability theory. In. R.A. Berk

Brennan, R.L. & Kane, M.T. Signal/noise ratios for domain-referenced

Brennan, R.L. ~ Stolurow, L.M. An empirical decision process for formative

Carver, R.P. Two dimensions of tests: Psychometic and edumetric, American

Cohen, J. A coefficient of agreement for nominal scales. Educational and

Coulson, D.B. & Hambleton, R.K. On the validation of c r i t e r i o n - r e f e r e n c e d

Cox, R.C. Evaluative aspects of criterion-referenced measures. In W.J.