Professional Documents
Culture Documents
where xi is the observed test score for a given individual and n is equal to the
number of items on the test. The disadvantages are that individual estimates will be lower
for extremely high or low scores, and higher for scores around the mean, and that
calculations are tedious since a separate estimate must be calculated for each observed
test score. Nevertheless, the SEmeas(xj) is the appropriate estimate for computing a band
score for the cut-off score (Berk 1984b).
Dependability of masteryhonmastery classifications
In such contexts, ‘mastery/nonmastery’ classification decisions about individuals are
often made in terms of a predetermihtd level of ‘mastery’, or ‘minimum competence’, which
may correspond to an observed ‘cut-off‘ score on a CR test. For example, she decision to
advance an individual to the next unit of instruction, or to permit him to graduate from a
language program, might be contingent upon his scoring at least 80 per cent on an achievement
test.
As with any decision, there is a chance that masteryhonmastery classification decisions
will be in error, and such errors will result in certain costs, in terms of lost time, misallocated
resources, and so forth. Because of the costs associated with errors of classification, it is
extremely important that we be able to estimate the probability that the masteryhonmastery
classifications we make will be correct.
Cut-off score
There are many different approaches to setting cut-off scores, and a complete discussion
of these is beyond the scope of this However, an appreciation of the difference between mastery
levels and cut-off scores, as well as familiarity with the types of mastery/ nonmastery
classification errors and their relative seriousness, is relevant to both determining the most
appropriate cut-off score and estimating the dependability of classifications based on a given
cutoff score, and a brief overview of these will be presented.
Mastery level and cut-off score
Mastery level for a given language ability can be understood as the domain score that is
considered to be indicative of minimal competence (for a given purpose) in that ability. This
distinction between the domain score that corresponds to minimum competency, or mastery, and
its corresponding observed, cut-off score is important, since the cut-off score, as with any
observed score, will be subject to measurement error as an indicator of the domain score at
mastery level, and this measurement error will be associated with errors in mastery/nonmastery
classifications that are made on the basis of the cut-off score.
Classification errors
Whenever we make a masteryhonmastery classification decision, there are two possible
types of errors that can occur. A ‘false positive’ classification error occurs when we classify the
test taker as a master when his domain score is in fact below the cut-off score. If, on the other
hand, we incorrectly classify him as a nonmaster when his domain score is above the cut-off, we
speak of a ‘false negative’ classification error.
Estimating the dependability of masteryhonmastery classifications
The various approaches to the dependability of masteryhonrnastery classifications all
assume that individuals are categorized as masters or nonmasters on the basis of a predetermined
cut-off score. These approaches differ in terms of how they treat classification errors. Suppose,
for example, that the cut-off score on a CR test were 80 per cent, and two individuals, whose
domain scores are actually 75 per cent and 55 per cent, were misclassified as masters on the basis
of scores on this test.
Threshold loss agreement indices
Several agreement indices that treat all misclassifications as equally serious have
been developed. The most easily understood approach is simply to give the test twice,
and then compute the proportion of individuals who are consistently classified as masters
and nonmasters on both tests (Wambleton and Novick 1973). This coefficient,
symbolized by Po, is computed as follows:
where 2, is the mean of the proportion scores, s2p is the variance of the proportion
scores, k is the number of items in the test, and c, is the cut-off score. A second
agreement index, K2(x, T,) (Livingston 1972), is derived from classical true score theory,
and can be computed as follows:
where I, # is the classical internal consistency reliability coefficient (for example,
Guttman split-half, KR-20, or KR-21), and c, is the cut-off score.
Both types of agreement indices - threshold loss and squared error loss - provide
estimates of the probability of making correct masteryhonmastery classifications.
Standard error of measurement in CR tests
In CR testing situations in which masteryhonmastery classifications are to be
made, the SEM can be used to determine the amount of measurement error at different
cut-off scores. The SEW for different cut-off scores can be estimated by computing the
SEmeas(xi) for each cut-off score. These values can then be used to compute band scores,
or band interpretations for these scores. The particular way we estimate this will depend
on whether we consider all classification errors as equally serious or whether .we
consider incorrect decisions to differ in seriousness.
Factors that affect reliability estimates
Thus far, different sources of error have been discussed as the primary factors that
affect the reliability of tests. In addition to these sources of error, there are general
characteristics of tests and test scores that influence the size of our estimates of
reliability. An understanding of these factors will help us to better determine which
reliability estimates are appropriate to a given set of test scores, and interpret their
meaning.
Length of test
For both NR and CR tests, long tests are generally more reliable than short ones,
and all of the estimates discussed in this chapter - reliability and generalizability
coefficients, test information function, coefficients of dependability, and agreement
indices - reflect this. If we can assume that all the items or tasks included are
representative indicators of the ability being measured, the more we include, the more
adequate our sample of that ability.
Difficulty of test and test score variance
If the test is too easy or too difficult for P particular group, this will generally
result in a restricted range of scores or very little variance. CR reliability coefficients, on
the other hand, are relatively unaffected by restrictions in score variance. Consider the
extreme case in which all individuals obtain scores of 100 per cent, where the mean score
will be 100 per cent, and the variance will equal zero.
Cut-off score
From the above discussion it is obvious that CR agreement indices are sensitive to
differences in cut-off scores. That is, these coefficients will have different values for
different cut-off scores. In general, the greater the differences between the cut-off score
and the mean score, the greater will be the CR reliability. This is because differences
between individuals are likely to be minimal around the mean score, even for a CR test,
so that decisions made on the basis of such minimal differences are more likely to be in
error.
Systematic measurement error
one of the weaknesses of classical true score theory is that it considers all error to be
random and consequently fails to distinguish between random and systematic error. Within the
context of G-theory, systematic error can be defined as the variance component associated with a
facet whose conditions are fixed.
The effects of systematic measurement error
Two different effects are associated with systematic error: a general effect and a
specific effect (Kane 1982). The general effect of systematic error is constant €or ail
observations; it affects the scores of all individuals who take the test. The specific effect
varies across individuals; it affects different individuals differentially.
The effects of test method
In obtaining a measure of language ability we observe a sample of an individual’s
performance under certain conditions, which can be characterized by the test method
facets described in Chapter 4. If we can assume that test scores are affected by method
facets, this means these facets are potential sources of error for every measurement we
obtain. This presents two problems: a dilemma in choosing the type of error we want to
minimize, and ambiguity in the inferences we can make from test scores.
The presence of two sources of systematic error presents a problem in that we are
usually interested in interpreting a language test score as an indicator.of an individual’s
language ability, rather than a5 an indicator of her ability to take tests of a particular type.
If the effect of test method is sizeable, this clearly limits the validity of the test score as
an indicator of the individual’s language ability. In this case, the test developer may
choose to eliminate this test and begin developing a new test in which method facets are
less controlled.