You are on page 1of 5

210-226 reliability

Reliability of criterion-referenced test scores


All of the approaches to reliability discussed thus far have been developed within
frameworks that operationally define the level of ability (true score, universe score, IRT ability
parameter) as an average of an indefinitely large number of measures. That is an individual’s
observed score on a given test in interpreted in relation to an estimate of what his average score
would be if we were able to obtain a large number of measures of that ability.
In the criterion-referenced (CR) interpretation of test scores, on the other hand, an
individual’s ability is defined not in terms of the average performance of a group of individuals,
but in terms of his successful completion of tasks from a set or domain of criterion tasks, or his
performance with reference to a criterion level that defines the ability in question. In CR test
interpretation, test scores provide information not about an individual’s relative standing in a
given group, but about his relative ‘mastery’ of an ability domain.
In language testing, the most common CR test situations are those that occur in
educational programs and language classrooms, in which decisions such as progress or assigning
grades must be made. In situations such as these, achievement tests are most commonly used.
Aspects of reliability in CR tests
Although NR reliability estimates are iqappropriate for CR test scores, it is not the
case that reliability is of no concern in such tests. On the contrary, consistency, stability,
and equivalence are equally important for CR tests.
Dependability of domain score estimates
One approach to CR test development is to specify a well-defined set of tasks or
items that constitute a domain, with any given test viewed as a sample of items or tasks
from that particular domain
In order for an individual’s observed score on a given test to be interpreted as a
dependable indicator of his domain score, we must be sure that the given sample of items
or tasks is representative of that domain. Assume, for example, that we have defined a
number of abilities that constitute the domain of reading comprehension, and that we
have generated a large number of items that we believe measure these abilities.
Domain score dependability index
One approach to estimating the dependability of scores on domainreferenced tests
involves a direct application of G-theory to estimate the proportion of observed score
variance that is domain score variance (Kane and Brennan 1980; Brennan 1984).
Brown’s formula thus provides a practical means for estimating the depend- ability of an
obtained score on a domain-referenced test as an indicator of an individual’s domain
score.
Criterion-referenced standard mor of measurement
An individual-specific estimate of the criterion-referenced SEM is the
SEmeas(xj), which is computed as follows (Berk 1984~):

where xi is the observed test score for a given individual and n is equal to the
number of items on the test. The disadvantages are that individual estimates will be lower
for extremely high or low scores, and higher for scores around the mean, and that
calculations are tedious since a separate estimate must be calculated for each observed
test score. Nevertheless, the SEmeas(xj) is the appropriate estimate for computing a band
score for the cut-off score (Berk 1984b).
Dependability of masteryhonmastery classifications
In such contexts, ‘mastery/nonmastery’ classification decisions about individuals are
often made in terms of a predetermihtd level of ‘mastery’, or ‘minimum competence’, which
may correspond to an observed ‘cut-off‘ score on a CR test. For example, she decision to
advance an individual to the next unit of instruction, or to permit him to graduate from a
language program, might be contingent upon his scoring at least 80 per cent on an achievement
test.
As with any decision, there is a chance that masteryhonmastery classification decisions
will be in error, and such errors will result in certain costs, in terms of lost time, misallocated
resources, and so forth. Because of the costs associated with errors of classification, it is
extremely important that we be able to estimate the probability that the masteryhonmastery
classifications we make will be correct.
Cut-off score
There are many different approaches to setting cut-off scores, and a complete discussion
of these is beyond the scope of this However, an appreciation of the difference between mastery
levels and cut-off scores, as well as familiarity with the types of mastery/ nonmastery
classification errors and their relative seriousness, is relevant to both determining the most
appropriate cut-off score and estimating the dependability of classifications based on a given
cutoff score, and a brief overview of these will be presented.
Mastery level and cut-off score
Mastery level for a given language ability can be understood as the domain score that is
considered to be indicative of minimal competence (for a given purpose) in that ability. This
distinction between the domain score that corresponds to minimum competency, or mastery, and
its corresponding observed, cut-off score is important, since the cut-off score, as with any
observed score, will be subject to measurement error as an indicator of the domain score at
mastery level, and this measurement error will be associated with errors in mastery/nonmastery
classifications that are made on the basis of the cut-off score.
Classification errors
Whenever we make a masteryhonmastery classification decision, there are two possible
types of errors that can occur. A ‘false positive’ classification error occurs when we classify the
test taker as a master when his domain score is in fact below the cut-off score. If, on the other
hand, we incorrectly classify him as a nonmaster when his domain score is above the cut-off, we
speak of a ‘false negative’ classification error.
Estimating the dependability of masteryhonmastery classifications
The various approaches to the dependability of masteryhonrnastery classifications all
assume that individuals are categorized as masters or nonmasters on the basis of a predetermined
cut-off score. These approaches differ in terms of how they treat classification errors. Suppose,
for example, that the cut-off score on a CR test were 80 per cent, and two individuals, whose
domain scores are actually 75 per cent and 55 per cent, were misclassified as masters on the basis
of scores on this test.
Threshold loss agreement indices
Several agreement indices that treat all misclassifications as equally serious have
been developed. The most easily understood approach is simply to give the test twice,
and then compute the proportion of individuals who are consistently classified as masters
and nonmasters on both tests (Wambleton and Novick 1973). This coefficient,
symbolized by Po, is computed as follows:

where nl is the number of individuals classified as masters on both test


administrations, n, is the number of individuals classified as nonmasters on both, and N is
the total number of individuals who took the test twice.
Squared-error loss agreement indices
In contrast to these agreement indices, which treat misclassification errors as all
of equal seriousness, a number of agreement indices thattreat misclassifiqtion errors near
the cut-off score as less serious than those far from the cut-off score have been
developed.

where 2, is the mean of the proportion scores, s2p is the variance of the proportion
scores, k is the number of items in the test, and c, is the cut-off score. A second
agreement index, K2(x, T,) (Livingston 1972), is derived from classical true score theory,
and can be computed as follows:
where I, # is the classical internal consistency reliability coefficient (for example,
Guttman split-half, KR-20, or KR-21), and c, is the cut-off score.
Both types of agreement indices - threshold loss and squared error loss - provide
estimates of the probability of making correct masteryhonmastery classifications.
Standard error of measurement in CR tests
In CR testing situations in which masteryhonmastery classifications are to be
made, the SEM can be used to determine the amount of measurement error at different
cut-off scores. The SEW for different cut-off scores can be estimated by computing the
SEmeas(xi) for each cut-off score. These values can then be used to compute band scores,
or band interpretations for these scores. The particular way we estimate this will depend
on whether we consider all classification errors as equally serious or whether .we
consider incorrect decisions to differ in seriousness.
Factors that affect reliability estimates
Thus far, different sources of error have been discussed as the primary factors that
affect the reliability of tests. In addition to these sources of error, there are general
characteristics of tests and test scores that influence the size of our estimates of
reliability. An understanding of these factors will help us to better determine which
reliability estimates are appropriate to a given set of test scores, and interpret their
meaning.
Length of test
For both NR and CR tests, long tests are generally more reliable than short ones,
and all of the estimates discussed in this chapter - reliability and generalizability
coefficients, test information function, coefficients of dependability, and agreement
indices - reflect this. If we can assume that all the items or tasks included are
representative indicators of the ability being measured, the more we include, the more
adequate our sample of that ability.
Difficulty of test and test score variance
If the test is too easy or too difficult for P particular group, this will generally
result in a restricted range of scores or very little variance. CR reliability coefficients, on
the other hand, are relatively unaffected by restrictions in score variance. Consider the
extreme case in which all individuals obtain scores of 100 per cent, where the mean score
will be 100 per cent, and the variance will equal zero.
Cut-off score
From the above discussion it is obvious that CR agreement indices are sensitive to
differences in cut-off scores. That is, these coefficients will have different values for
different cut-off scores. In general, the greater the differences between the cut-off score
and the mean score, the greater will be the CR reliability. This is because differences
between individuals are likely to be minimal around the mean score, even for a CR test,
so that decisions made on the basis of such minimal differences are more likely to be in
error.
Systematic measurement error
one of the weaknesses of classical true score theory is that it considers all error to be
random and consequently fails to distinguish between random and systematic error. Within the
context of G-theory, systematic error can be defined as the variance component associated with a
facet whose conditions are fixed.
The effects of systematic measurement error
Two different effects are associated with systematic error: a general effect and a
specific effect (Kane 1982). The general effect of systematic error is constant €or ail
observations; it affects the scores of all individuals who take the test. The specific effect
varies across individuals; it affects different individuals differentially.
The effects of test method
In obtaining a measure of language ability we observe a sample of an individual’s
performance under certain conditions, which can be characterized by the test method
facets described in Chapter 4. If we can assume that test scores are affected by method
facets, this means these facets are potential sources of error for every measurement we
obtain. This presents two problems: a dilemma in choosing the type of error we want to
minimize, and ambiguity in the inferences we can make from test scores.
The presence of two sources of systematic error presents a problem in that we are
usually interested in interpreting a language test score as an indicator.of an individual’s
language ability, rather than a5 an indicator of her ability to take tests of a particular type.
If the effect of test method is sizeable, this clearly limits the validity of the test score as
an indicator of the individual’s language ability. In this case, the test developer may
choose to eliminate this test and begin developing a new test in which method facets are
less controlled.

You might also like