You are on page 1of 21

1

1 2. ERRORS OF MEASUREMENT AND


2 RELIABILITY/PRECISION
3 BACKGROUND
4 A test, broadly defined, is a set of tasks designed to elicit responses that provide a sample
5 of an examinee‘s behavior or performance in a specified domain, or a system for
6 collecting samples of an individual's work in a particular area. Coupled with the test is a
7 scoring procedure that enables the scorer to evaluate the behavior or work samples and
8 generate a score. In interpreting and using such test scores, it is important to have
9 some indication of the precision or dependability of the scores. In educational and
10 psychological testing, the general notion of precision, consistency, or dependability,
11 has traditionally been discussed under the heading of ‗reliability‘, where the
12 reliability of test scores can be defined in terms of the consistency in the scores for
13 individual examinees (or groups) that would be observed if the testing procedure
14 were replicated, under the condition that the first administration had no impact on the
15 outcome of the second.

16 The term reliability has been used in two ways in the measurement literature.
17 First, the term has been used to refer to the reliability coefficients of classical test theory
18 (defined as ratios of true-score variances to observed-score variances). Second, the term
19 has been used in a more general sense, to refer to the consistency or precision of scores
20 across replications of a testing procedure, regardless of how this consistency is estimated
21 or reported (e.g., in terms of standard errors, reliability coefficients per se,
22 generalizability coefficients, error/tolerance ratios, information functions, or various
23 indices of categorical consistency). To maintain a link to the traditional notions of
24 reliability, while avoiding the ambiguity inherent in using a single, familiar term to refer
25 to a wide range of concepts and indices of precisions, the term reliability/precision will
26 be used to denote the more general notion of consistency of the scores across replications,
27 and the term reliability coefficient will be used to refer to the reliability coefficients of
28 classical test theory.

29 The usefulness of behavioral measurements depends on assumptions claiming


30 that individuals and groups exhibit some degree of consistency or stability in their
31 performance/behavior. However, successive samples of performance from the same
32 person are rarely identical. An individual's performances, products, and responses to
33 sets of test questions vary in their quality or character from one occasion to another,
34 even under strictly controlled conditions. This variation is reflected in the examinee's
35 scores, which also vary across instances of a measurement procedure. An examinee
36 may try harder, may make luckier guesses, be more alert, feel less anxious, or enjoy
37 better health on one occasion than another. An examinee may have knowledge,
38 experience, or understanding that is more relevant to some tasks than to others in the
39 domain sampled by the test. Different raters of the samples of performance may be
40 more lenient or less lenient, and a given rater may be more or less lenient on different
41 occasions, or for different kinds of responses.
2

42 Some test takers may exhibit less variation in their scores than others, but all test
43 takers exhibit some variation in performance. Because of this variation, an individual's
44 observed scores (and the average scores of groups) will vary across replications of the
45 testing procedure. This unexplained and apparently random variability across replications
46 is referred to as measurement error, or random error.

47 The oldest and most basic way to evaluate the consistency of scores involves an
48 analysis of the variation in each individual‘s scores across replications of the
49 measurement procedure. We administer the test, and then, after a brief period, over which
50 the examinee‘s standing on the variable being measured would not be expected to
51 change, we administer the test (or a parallel test) a second time; it is assumed that the first
52 administration has no substantial influence on the second administration. Given that the
53 attribute being measured is assumed to remain the same for each examinee over the two
54 administrations and that the test administrations are independent of each other, more
55 variation across replications indicates more error, and therefore, less precision and lower
56 reliability.

57 The impact of such measurement errors can be summarized in a number of ways,


58 but typically, in educational and psychological measurement, it is conceptualized in terms
59 of the standard deviation in the scores for a person over replications of the testing
60 procedure. The standard deviation in a person‘s observed scores over replications is
61 referred to as the standard error of measurement for the person.

62 In most testing contexts, it is not possible to replicate the testing procedure


63 repeatedly, and therefore it is not possible to estimate the conditional standard error for
64 each person‘s score directly. Instead, the average error of measurement is estimated over
65 some population, and this average is referred to as the standard error of measurement
66 (SEM). The SEM is an indicator of a lack of precision in the scores generated by the
67 testing procedure for some population. A relatively large SEM indicates a relatively low
68 reliability/precision. The conditional standard error of measurement for a score level is
69 the standard error of measurement at that score level.

70 To say that a score includes error implies that there is a hypothetical error-free
71 value that characterizes an examinee‘s score at the time of testing. In classical test
72 theory this error-free value is referred to as the person's true score for the test or
73 measurement procedure. It is conceptualized as the hypothetical average score over
74 an infinite series of replications of the testing procedure. In statistical terms, a
75 person‘s true score is an unknown parameter, and the observed score for the person is a
76 random variable that fluctuates around the true score for the person.

77 Generalizability theory provides a different framework for estimating


78 reliability/precision. While classical test theory assumes a single distribution for the errors
79 in a test taker‘s scores, generalizability theory allows for multiple sources of error, and
80 seeks to evaluate the contributions of these different sources of error (e.g., items, occasions,
81 raters) to the overall error. The universe score for a person is the expected value over a
82 universe of possible observations for the person, all of which are consistent with the
83 definition of the measurement procedure.
3

84 Item response theory addresses the basic issue of reliability/precision in terms of


85 the precision with which observed scores estimate values of a latent trait and employs
86 information functions to describe the relationship between observed scores and the latent
87 trait. Indices analogous to traditional reliability coefficients can be estimated from the
88 information functions and distributions of the latent trait.

89 In practice, the reliability/precision of the scores is typically evaluated in terms of


90 various coefficients, including reliability coefficients, generalizability coefficients,
91 error/tolerance ratios, and information functions, depending on the focus of the analysis
92 and the measurement model being used. The coefficients tend to have high values when
93 the error variance is small compared to the observed variation in the scores (or score
94 differences) to be estimated, and therefore higher values for the coefficients correspond
95 to greater precision.

96 Replications and Errors of Measurement


97 As indicated earlier, the general notion of reliability/precision is defined in terms of
98 consistency over replications of the testing procedure. The precision is high if the scores
99 for each person are consistent over replications and is low if the scores are not consistent
100 over replications. In evaluating reliability/precision, therefore, it is important to be clear
101 about what constitutes a replication of the testing procedure.

102 A replication of the testing procedure involves an independent administration of


103 the procedure to an examinee, under conditions where the attribute being measured
104 would not be expected to change. Since the attribute is not expected to change, and the
105 two administrations are assumed to be independent of each other, any difference in scores
106 across replications can be attributed to errors of measurement. For example, in assessing
107 an attribute that is not expected to change over an extended period of time (e.g., in
108 measuring a trait), scores generated on two successive days (using different test forms if
109 appropriate) would be considered replications. For a state variable (e.g., mood or
110 hunger), where fairly rapid changes are common, scores generated on two successive
111 days would not be considered replications; the scores obtained on each occasion would
112 be interpreted in terms of the value of the state variable on that occasion. For many tests
113 of knowledge or skill, the administration of different forms of a test with different
114 samples of items would be considered replications of the test; for survey instruments and
115 some personality measures, it is expected that the same questions will be used every time
116 the test is administered, and any change in wording would constitute a different ―test‖.

117 The cardinal features of standardized tests include consistency of the test
118 materials from test taker to test taker, close adherence to stipulated procedures for test
119 administration, and use of prescribed scoring rules that can be applied with a high degree
120 of consistency. Administering the same test to all test takers under the same conditions
121 promotes fairness and facilitates comparisons of scores across individuals. Conditions of
122 observation that are fixed or standardized for the testing procedure remain the same
123 across replications. However, some aspects of any standardized testing procedure may be
124 allowed to vary. The time and place of testing, as well as the persons administering the
125 test, are generally allowed to vary. The particular questions or tasks included in the test
4

126 may be allowed to vary, and the persons who score the results can vary over some set of
127 qualified scorers.

128 The reliability/precision of the scores depends on how much the scores vary
129 across replications of the testing procedure, and the evaluative evidence for
130 reliability/precision should be consistent with the kinds of variability allowed in the
131 testing procedure and materials (e.g., over tasks contexts, raters), and with the
132 assumptions built into the proposed interpretation for use of the test scores. For example,
133 if the interpretation of the scores assumes that they are at least approximately invariant
134 over occasions, then variability over occasions is a potential source of error. If the test
135 tasks vary over different forms of the test, and the observed performances are treated as a
136 sample from a domain of similar tasks, the variability in scores from one form to another
137 would be considered error. If raters are used to assign scores to responses, and all raters
138 meeting certain qualifications are considered qualified, the variability in scores over
139 qualified raters is a source of error.

140 In some cases, it may be possible to evaluate the magnitude of all major sources
141 of error in one analysis (e.g., by comparing scores on different test forms, administered
142 on different occasions and in different settings, and scored by different raters). In other
143 cases it may be more convenient or useful to accumulate evidence separately for different
144 potential sources of error (e.g., by having special studies in which students take different
145 forms of the test on different days).

146 In some cases, it may be feasible to estimate the variability over replications
147 directly (e.g., by having a number of qualified raters evaluate a sample of test
148 performances). In other cases, it may be necessary to use less direct estimates (e.g., using
149 internal consistency, the extent of agreement between different parts of one test) to
150 estimate the random error associated with form-to-form variability. In some cases, it may
151 be reasonable to assume that a potential source of variability is likely to be negligible
152 (e.g., variability across well calibrated scoring machines, variability over some changes
153 in the formatting of test forms). In other cases, it may be necessary to examine the
154 variability over actual replications directly. For example, when a test is designed
155 to reflect rate of work, reliability should be estimated by the alternate-form or test-retest
156 approach, using separately timed administrations. Split-half coefficients based on separate
157 scores from the odd-numbered and even-numbered items are known to yield inflated
158 estimates of reliability for highly speeded tests.

159 In some cases, it may be possible to infer adequate reliability from other types of
160 evidence. For example, if test scores are used mainly to predict some criterion scores,
161 and the test does an adequate job in predicting the criterion, it can be inferred that the test
162 scores are reliable/precise enough for their intended use.
163
164 The definition of what constitutes a standardized test or measurement procedure
165 has broadened significantly over the last few decades. Various kinds of performance
166 assessments, simulations, and portfolio-based assessments have been developed to
167 provide measures of constructs that might otherwise be difficult to assess. Performance
168 assessments raise complex issues regarding the performance domain represented by
5

169 the test, about what constitutes a replication, and about the reliability/precision of
170 individual and group scores. Each step toward greater flexibility in the assessment
171 procedures enlarges the scope of the variations allowed in replications of the testing
172 procedure, and therefore tends to increase the measurement error. However, some of
173 these sacrifices in reliability/precision may reduce construct irrelevance or construct
174 underrepresentation and thereby improve the validity of the intended interpretations of
175 the scores for the proposed use or uses of the test.
176
177 Random and Systematic Errors
178 Random errors of measurement are generally viewed as unpredictable fluctuations in
179 scores. They are conceptually distinguished from systematic errors, which may also
180 affect the performances of individuals or groups, but in a consistent rather than a
181 random manner. For example, a systematic group error would occur as a result of
182 differences in the difficulty of test forms that have not been adequately equated. In
183 this instance, examinees who take one form may earn higher scores on average than
184 they would have gotten if they had taken the other form. Such systematic errors would
185 not generally be included in the standard error of measurement, and they are not
186 generally regarded as contributing to a lack of reliability/precision. Rather,
187 systematic errors constitute construct-irrelevant factors that reduce validity, but not
188 reliability/precision.

189 Replications of a testing procedure can typically vary along a number of


190 dimensions (occasions, settings, raters, items, and perhaps item types), and the variability
191 associated with the replications of the testing procedure is interpreted as random error.

192 Important sources of random error may be broadly categorized as those rooted
193 within the test takers and those external to them. Fluctuations in the level of an
194 examinee's motivation, interest, or attention and the inconsistent application of skills
195 are clearly internal factors that may lead to score inconsistencies. Differences among
196 testing sites in their freedom from distractions, the random effects of scorer subjectivity,
197 and variation in scorer standards are examples of external factors. The importance
198 of any particular source of variation depends on the specific conditions under which
199 the measures are taken, how performances are scored, and the interpretations made from
200 the scores. A particular factor, such as the subjectivity in scoring, may be a significant
201 source of measurement error in some assessments and a minor consideration in others.

202 Some changes in scores from one occasion to another are not regarded as error
203 (random or systematic), because they result, in part, from changes in the construct being
204 measured (e.g., due to learning or maturation that has occurred between the initial and
205 final measures). In such cases, the change in performance constitutes the phenomenon of
206 interest, and the changes would not be considered to be due to errors of
207 measurement.

208 Measurement error reduces the usefulness of test scores. It limits the extent to
209 which test results can be generalized beyond the particulars of a specific application of
210 the testing procedure. Therefore, it reduces the confidence that can be placed in the
211 results from any single measurement. Because random measurement errors are
6

212 inconsistent and unpredictable, they cannot be removed from observed scores.
213 However, their aggregate magnitude can be summarized in several ways, as discussed
214 below, and they can be controlled to some extent (e.g., by averaging over multiple
215 scores).

216 The standard error of measurement, as such, provides an indication of the


217 expected level of error over some population. In many cases, it is useful to have
218 estimates of the standard errors for individual examinees (or for examinees with
219 scores in certain score ranges). These conditional standard errors are difficult, if
220 not impossible, to estimate directly, because we can‘t administer the testing
221 procedure to examinees repeatedly without changing their performance in
222 substantial ways, but a number of models have been proposed to estimate
223 conditional standard errors indirectly. For example, the test information functions
224 based on IRT models can be used to estimate standard errors for different values
225 of a latent ability parameter and /or for different observed scores. In using any of
226 these model-based estimates of conditional standard errors, it is important to
227 employ models that are consistent with the data.

228 Information about measurement error is essential to the proper evaluation


229 and use of the scores generated by a testing procedure. This is true whether the
230 measure is based on the responses to a specific set of questions, a portfolio of work
231 samples, the performance on a task, or the creation of an original product.

232 Evaluating Reliability/Precision


233 The ideal approach to the evaluation of reliability, or precision, would require
234 independent replication of the entire measurement process. However, only a rough or
235 partial approximation of such replication is possible in most testing situations, and
236 investigation of measurement error may require special studies that depart from routine
237 testing procedures. Nevertheless, it should be the goal of test developers to investigate
238 test reliability/precision as fully as practical considerations permit. No test developer is
239 exempt from this responsibility.

240 For most testing programs, scores are expected to generalize over different forms
241 of the test and over occasions (over some period), over testing contexts, and over raters
242 (if judgment is required in scoring). To the extent that the variability associated with any
243 of these dimensions or facets is likely to be substantial, the variability should be
244 estimated in some way. In evaluating the reliability/precision of the scores, it is
245 important to identify the major sources of error, and to estimate the magnitudes of
246 these errors, and thereby, the degree of generalizability of scores across alternate forms,
247 scorers, administrations, or other relevant dimensions.

248 The interpretation of reliability/precision analyses depends on the population


249 being tested, as the data may accurately reflect what is true of one population but
250 misrepresent what is true of another. For example, a given reliability or generalizability
251 coefficient or estimated standard error derived from scores of a nationally representative
252 sample may differ significantly from that obtained for a more homogeneous sample
253 drawn from one gender, one ethnic group, or one community. Therefore, to the extent
7

254 possible (i.e., sample sizes are large enough), reliability/precision should be estimated
255 separately for all major subgroups (e.g., defined in terms of race, gender, language
256 proficiency) in the population [see the Fairness chapter].

257 Reliability/Generalizability Coefficients


258 Although it is not generally possible to obtain multiple independent scores on individuals,
259 it is often possible to get two more-or-less independent scores (e.g., scores on two
260 different forms of the test, or on different parts of a single test) on a sample of
261 individuals, and using the observed relationship between the two sets of scores (e.g., the
262 correlation coefficient) and some assumptions about the errors, it is possible to estimate
263 the SEM for the group indirectly.

264 Traditionally, the consistency of test scores was evaluated mainly in terms or
265 reliability coefficients, defined in terms of the ratio of true-score variance to observed
266 score variance, and was estimated by computing the correlation between scores derived
267 from two replications of the testing procedure. Three broad categories of reliability
268 coefficients were recognized: (a) coefficients derived from the administration of parallel
269 forms in independent testing sessions (alternate-form coefficients); (b) coefficients
270 obtained by administration of the same instrument on separate occasions (test-
271 retest coefficients); and (c) coefficients based on the relationships/interactions
272 among scores derived from individual items or subsets of the items within a test, all
273 data accruing from a single administration (internal consistency coefficients). In
274 addition, where test scoring involves a high level of judgment, indexes of scorer
275 consistency are commonly obtained.

276 In generalizability theory, these three categories are treated as special cases
277 of a more general framework for estimating error variance in terms of the variance
278 associated with different sources of error. A generalizability coefficient is defined as
279 the ratio of universe score variance to observed score variance. Unlike traditional
280 approaches to the study of reliability, generalizability theory encourages the researcher to
281 specify and estimate components of true score variance, error variance, and observed
282 score variance and coefficients based on these estimates. Estimation is typically
283 accomplished by the application of the techniques of analysis of variance. Of special
284 interest are the separate numerical estimates of the components of overall error variance
285 (e.g., variance components for items, occasions, raters, and variance components for the
286 interactions among these potential sources of error). Such estimates permit examination
287 of the contribution of each source of error to the overall measurement error and can be
288 very helpful in identifying an effective strategy for controlling overall error
289 variance. The generalizability approach also makes possible the estimation of
290 coefficients that apply to a wide variety of potential measurement designs.

291 The test information function, an important result of IRT, summarizes how
292 well the test discriminates among individuals at various levels of the ability or trait
293 being assessed. Under the IRT conceptualization for dichotomously scored items, a
294 mathematical function called the item characteristic curve or item response
295 function is used as a model to represent the increasing proportion of correct
296 responses to an item for groups at progressively higher levels of the ability or trait being
8

297 measured. Given an adequate database, the parameters of the characteristic curve for
298 each item in a test can be estimated. The test information function can then be
299 estimated from the parameters for the set of items in the test and can be used to
300 derive coefficients with interpretations similar to reliability coefficients.

301 The information function may be viewed as a mathematical statement of the


302 precision of measurement at each level of the given trait. Note that, if the IRT
303 information function is based on the results obtained using a specific set of items, on a
304 specific occasion, or in a specific context, it does not provide an indication of
305 generalizability over these facets, and like any measure of internal consistency, reflects a
306 relatively narrow conception of reliability/precision.

307 Scale-Invariant Coefficients


308 In general, reliability and generalizability coefficients are useful in comparing tests or
309 measurement procedures, particularly those that yield scores in different units or metrics.
310 However, such comparisons are rarely straightforward. Allowance must be made for
311 differences in the variability of the groups on which the coefficients are based, the
312 techniques used to obtain the coefficients, the sources of error reflected in the
313 coefficients, and the lengths of the instruments being compared.

314 Generalizability coefficients and the many coefficients included under the
315 traditional categories of reliability may appear to be interchangeable, but the different
316 coefficients may convey quite different information. A coefficient in any given category
317 may encompass errors of measurement from a highly restricted perspective, a very
318 broad perspective, or some point between these extremes. For example, a coefficient
319 may reflect error due to scorer inconsistencies but not reflect the variation over an
320 examinee‘s performances or products. A coefficient may reflect only the internal
321 consistency of item responses within an instrument and fail to reflect measurement
322 error associated with day-to-day changes in examinee performance.

323 It should not be inferred, however, that alternate-form or test-retest coefficients


324 based on test administrations several days or weeks apart are always preferable to
325 internal consistency coefficients. In cases where we can assume that scores are not likely
326 to change, based on past experience and/or theoretical considerations, it may be
327 reasonable to assume invariance over occasions (without conducting a test-retest study).
328 In cases where only one form of a test exists, retesting could result in an inflated
329 correlation between the first and second scores due to idiosyncratic features of the test
330 or to examinee recall of initial responses.

331 An individual's status on some attributes, such as mood or emotional state,


332 may change significantly in a short period of time. In the assessment of such
333 constructs, reliability/precision estimates should be obtained within the short period in
334 which the attribute remains stable. Therefore, for characteristics of this kind an internal
335 consistency coefficient may be preferred.

336 Coefficients (e.g., reliability, generalizability, and IRT-based coefficients) have


337 two major advantages as indices of precision. First, as indicated above, they can be used
9

338 to estimate standard errors (overall and/or conditional) in cases where it would not be
339 possible to do so directly. Second, coefficients (e.g., reliability and generalizability
340 coefficients), which are defined in terms of ratios of variances for scores on the same
341 scale, are invariant over linear transformations of the score scale and can be useful in
342 comparing different testing procedures based on different scales.

343 Factors Affecting Reliability/Precision


344 There are a number of factors that can have strong effects on reliability/precision, and in
345 some cases, these factors can lead to misinterpretations of the results, if not taken into
346 account.

347 First, any evaluation of reliability/precision applies to a particular assessment


348 procedure, and is likely to change if the procedure is changed in any substantial way. In
349 general, if the assessment is shortened (e.g., by decreasing the number of items or tasks),
350 the reliability is likely to decrease, and if the assessment is lengthened, the reliability is
351 likely to increase. In fact, lengthening the assessment, and thereby increasing the size of
352 the sample of tasks being employed, is an effective and commonly used method for
353 improving reliability/precision, in cases where this is called for.

354 Second, if the variability associate with raters is estimated for a select group of
355 raters who have been especially well trained (and were perhaps involved in the
356 development of the procedures), but raters are not so well trained in some operational
357 contexts, the error associated with rater variability in these operational settings may be
358 much higher than is indicated by the reported inter-rater reliability coefficients.
359 Similarly, if raters are still refining their performance in the early days of an extended
360 scoring window, the error associated with rater variability may be greater for examinees
361 testing early in the window than for later test-takers.

362 The reliability/precision also depends on the population for which the procedure
363 is being used. In particular, if the variability in the construct of interest in the population
364 for which scores are being generated is substantially different from what it is in the
365 population for which reliability/precision was evaluated, the reliability/precision can be
366 quite different in the two populations; as the variability in the construct being measured
367 decreases, reliability and generalizability coefficients tend to decrease, and as the
368 variability in the construct being measured increases, the coefficients tend to increase.

369 In addition, the reliability/precision can vary from one population to another, even
370 if the variability in the construct of interest in the two populations is the same. If the
371 populations have different average levels of achievement, and the test is particularly easy
372 or difficult for one population, the reliability/precision is likely to be depressed in this
373 population. The reliability can also vary from one population to another because
374 particular sources of error (rater effects, familiarity with formats and instructions, etc.)
375 have more impact in one population than they do on the other. In general, if any aspects
376 of the assessment procedures or the population being assessed are changed in an
377 operational setting, the reliability/precision should be reevaluated for that particular
378 setting.
10

379 Errors of Measurement


380 The standard error of measurement can be used to generate confidence intervals around
381 reported scores, and therefore, is generally more informative than a reliability or
382 generalizability coefficient once a measurement procedure has been adopted and the
383 interpretation of scores has become the user's primary concern.

384 Information about the precision of measurement at each of several widely spaced
385 score levels—that is, conditional standard errors—is usually a valuable supplement to the
386 single statistic for all score levels combined. Conditional standard errors of
387 measurement are generally more informative than a single average standard error
388 for a population. If decisions are based on test scores, and these decisions are
389 concentrated in one area or a few areas of the score scale, then the conditional
390 errors in those areas of the scale need to be examined.

391 Like reliability and generalizability coefficients, standard errors may reflect
392 variation from many sources of error or only a few. A more comprehensive standard
393 error (i.e., one that includes the most relevant sources of error, given the
394 proposed interpretation) is more informative than a less comprehensive value.
395 However, practical constraints often preclude conduct of the kinds of studies that
396 would yield information on all potential sources of error, but in such cases, it is always
397 important to examine those sources of error that are likely to be most serious.
398 Measurements derived from observations of behavior or evaluations of products by raters
399 are especially sensitive to a variety of error factors. These include scorer biases and
400 idiosyncrasies, scoring subjectivity, and intra-examinee factors that cause variation from
401 one performance or product to another. The methods of generalizability theory are well
402 suited to the investigation of the reliability/precision of the scores on such measures. In
403 general, the impact or seriousness of errors of measurement depends on the context in
404 which the scores are used and the purposes for which they are used, and therefore, the
405 errors need to be evaluated in terms of their intended uses.

406 The interpretations of test scores may be broadly categorized as relative or


407 absolute. Relative interpretations convey the standing of an individual or group within a
408 reference population. Absolute interpretations relate the status of an individual or group
409 to defined performance standards. These standards may originate in empirical data for
410 one or more populations or be based on authoritative judgment. Different values of the
411 standard error apply to the two types of interpretations. An error that is the same for all
412 individuals does not contribute to the relative error but may contribute to the absolute
413 error.

414 Traditional norm-referenced reliability coefficients were developed to evaluate


415 the precision with which test scores estimated the relative standing of examinees on some
416 scale, and they evaluate reliability/precision in terms of the ratio of true-score variance to
417 observed-score variance. These norm-referenced reliability coefficients basically examine
418 the degree to which observed scores consistently rank examinees in the same way.

419 As the range of uses of test scores has expanded and the contexts of use have been
420 extended (e.g., diagnostic categorization, the evaluation of educational programs), the
11

421 range of indices that are used to evaluate reliability/precision has also grown to include
422 indices for various kinds of change scores and difference scores, indices of decision
423 consistency, and indices appropriate for evaluating the precision of group means.

424 Some indices of precision, especially standard errors and conditional standard
425 errors, also depend on the scale in which they are reported. An index stated in terms
426 of raw scores or the trait level estimates of IRT may convey very different perception
427 of the error if restated in terms of scale scores. For example, for the raw-score scale, the
428 standard error may appear to be high at one score level and low at another, but when the
429 conditional standard errors are restated in units of scale scores, quite different trends in
430 comparative precision may emerge.

431 Precision and consistency in measurement are always desirable. However, the
432 need for precision increases as the consequences of decisions and interpretations grow
433 in importance. If a decision can and will be corroborated by information from other
434 sources or if an erroneous initial decision can be quickly corrected, scores with modest
435 reliability/precision may suffice. But if a test score leads to a decision that is not easily
436 reversed, such as rejection or admission of a candidate to a professional school or a
437 jury‘s decision, based on test results, that a serious cognitive injury was sustained, the
438 need for a high degree of precision is much greater.

439 Decision Consistency


440 Where the purpose of measurement is classification, some measurement errors are
441 more serious than others. An individual who is far above or far below the value
442 established for pass/fail or for eligibility for a special program can be measured with
443 considerable error without affecting the classification decision. Errors of measurement
444 for examinees whose true scores are close to the cut score are more likely to lead to
445 classification errors. The techniques used to quantify reliability/precision should
446 recognize these circumstances. This can be done by reporting the conditional standard
447 error in the vicinity of the cut score, or decision-consistency/accuracy indices (e.g.,
448 percentage of correct decisions, Cohen‘s kappa), which vary as functions of both
449 score reliability/precision and the location of the cutscore.

450 Decision consistency refers the extent to which the observed classifications of
451 examinees would be the same if they were to take two non-overlapping, equally difficult
452 forms of a test. Decision accuracy refers to the extent to which observed classifications of
453 examinees based on the results of a single test form would agree with their true
454 classification status. Statistical methods are available to calculate indices for both
455 decision consistency and decision accuracy. Adoption of these terms focuses attention on
456 the consistency or accuracy of classifications, rather than the consistency in scores per se,
457 and consistency/accuracy of classification is the main concern in many decision
458 contexts. However, it should be recognized that the degree of consistency or
459 agreement in examinee classification is specific to the cut score employed and its
460 location within the score distribution.
12

461 Reliability/Precision of Group Means


462 Estimates of mean (or average) scores of groups (or proportions in certain categories),
463 involve sources of error that are different from those that operate at the individual level.
464 Such estimates are often used as measures of program effectiveness (and under some
465 educational accountability systems, may be used to evaluate the effectiveness of
466 schools and teachers). In particular, in evaluating group performance, by estimating the
467 mean performance or mean improvement in performance for samples from the group, the
468 variation due to the sampling of persons is typically a major source of error (e.g., in
469 drawing conclusions about teaching performance from student outcomes), especially if
470 the sample sizes are small. For large samples, the variability in the sampling of persons
471 may average out almost completely in the estimates of the group means. However in
472 cases where the samples of persons are not very large (e.g., in evaluating the mean
473 achievement of school classes from year to year or the average expressed satisfaction of
474 successive samples of clients in a clinical program) the error associated with the
475 sampling of persons may be a major source of error. It can be a significant source of
476 error in inferences about programs even if there is a high degree of precision in individual
477 test scores.

478 Standard errors for individual scores are not appropriate measures of the
479 precision of group averages; a more appropriate statistic is the standard error of the
480 estimates of the group means. Generalizability theory can provide more refined
481 indices when the sources of measurement error are numerous and complex.

482 Documenting Reliability/Precision


483 Typically, developers and distributors of tests have primary responsibility for
484 obtaining and reporting evidence for precision (e.g. appropriate standard errors,
485 reliability or generalizability coefficients, or test information functions). The user
486 must have such data to make an informed choice among alternative measurement
487 approaches and will generally be unable to conduct reliability/precision studies prior to
488 operational use of an instrument.

489 In some instances, however, local users of a test or assessment procedure must
490 accept at least partial responsibility for documenting the precision of measurement.
491 This obligation holds when one of the primary purposes of measurement is to
492 classify students using locally developed performance standards, or to rank
493 examinees within the local population. It also holds when users must rely on local
494 scorers who are trained to use the scoring rubrics provided by the test developer. In
495 such settings, local factors may materially affect the magnitude of error variance and
496 observed score variance. Therefore, the reliability/precision of scores may differ
497 appreciably from that reported by the developer.

498 Reported evaluations of reliability/precision should identify the potential sources


499 of error for the testing program, given the proposed uses of the scores. These potential
500 sources of error can then be evaluated in terms of previously reported research, new
501 empirical studies, or analyses of the reasons for assuming that a potential source of error
502 is likely to be negligible and therefore can be ignored.
13

503 The reporting of indices of precision alone, with little detail regarding the
504 methods used to estimate the indices reported, the nature of the group from which the
505 data were derived, and the conditions under which the data were obtained constitutes
506 inadequate documentation. General statements to the effect that a test is "reliable" or that
507 it is "sufficiently reliable to permit interpretations of individual scores" are rarely, if ever,
508 acceptable. It is the user who must take responsibility for determining whether or not
509 scores are sufficiently trustworthy to justify anticipated uses and interpretations. Of
510 course, test constructors and publishers are obligated to provide sufficient data to make
511 informed judgments possible.

512 As the foregoing comments emphasize, there is no single, preferred approach to


513 quantification of reliability/precision. No single index adequately conveys all of the
514 relevant facts. No one method of investigation is optimal in all situations, nor is the
515 test developer limited to a single approach for any instrument. The choice of estimation
516 techniques and the minimum acceptable level for any index remain a matter of
517 professional judgment.

518 Implications for Validity


519 Although reliability/precision is discussed here as an independent characteristic of test
520 scores, it should be recognized that the level of reliability/precision of scores has
521 implications for the validity of score interpretations for particular uses.
522 Reliability/precision data ultimately bear on the generalizability, or dependability of the
523 scores. The data also bear on the consistency of classifications of individuals derived
524 from the scores. To the extent that scores are not consistent across replications of the
525 testing procedure (i.e., to the extent that they reflect random errors of measurement),
526 their potential for accurate prediction of criteria, for beneficial examinee diagnosis, and
527 for wise decision making is limited.

528 STANDARDS FOR ERRORS OF MEASUREMENT AND


529 RELIABILITY/PRECISION
530 The standards in this chapter begin with an overarching standard (numbered 2.0). This
531 overarching standard is designed to convey the central intent or primary focus of the
532 chapter. The overarching standard may also be viewed as the guiding principle of the
533 chapter, and is applicable to all tests and test users.
534

535 The standards that follow the overarching standard have been separated into 8 clusters.
536 The clusters for this chapter have been labeled and ordered as follows:

537 1. Replications – Specification of the Testing Procedure

538 2. Evaluating Reliability/Precision

539 3. Reliability/Generalizability Coefficients

540 4. Factors Affecting Reliability/Precision


14

541 5. Errors of Measurement

542 6. Decision Consistency

543 7. Reliability/Precision of Group means

544 8. Documenting Reliability/Precision

545 Standard 2.0


546 Regardless of the purpose or type of assessment, appropriate evidence of
547 reliability/precision should be provided for each intended score interpretation for
548 use.

549 Comment: The form of the index (reliability or generalizability coefficient, information
550 function, conditional standard error, index of decision consistency) should be appropriate
551 for the intended uses of the scores, the population involved, and the psychometric models
552 used to derive the scores.

553 Cluster 1: Replications – Specification of the Testing Procedure

554 Standard 2.1


555 The domain of replications over which consistency, or precision, is being evaluated
556 should be clearly stated, along with a rationale for the choice of this definition, given
557 the testing situation.

558 Comment: For any testing program, some aspects of the testing procedure (e.g., time
559 limits, availability of resources like books, calculators, and computers) are likely to be
560 fixed, and some aspects will be allowed to vary from one administration to another (e.g.,
561 specific tasks, testing contexts, raters, and possibly, occasions). Any test administration
562 that maintains the standardized, fixed conditions and involves acceptable samples of the
563 conditions that are allowed to vary (e.g., tasks, raters) would be considered a legitimate
564 replication of the testing procedure. As a first step in evaluating the reliability/precision
565 of a testing procedure, it is important to identify the range of conditions of various kinds
566 that are allowed to vary, and over which, scores are expected to be invariant.

567 Standard 2.2


568 The evidence provided for the reliability/precision of the scores should be consistent
569 with the domain of replications associated with the testing procedures and
570 materials, and with the intended interpretations and uses of the test scores.

571 Comment: The evaluative evidence for reliability/precision should be consistent with the
572 variability allowed in testing procedures and materials, and with the assumptions built
573 into the proposed interpretation and use of the test scores. For example, if the test can be
574 taken on any of a range of occasions, and the interpretation presumes that the scores are
575 at least approximately invariant over these occasions, then any variability in scores over
576 these occasions is a potential source of error. If the test tasks are allowed to vary over
577 alternate forms of the test, and the observed performances are treated as a sample from a
15

578 domain of similar tasks, the variability in scores from one form to another would be
579 considered error. If raters are used to assign scores to responses, and all raters meeting
580 certain qualifications are considered qualified, the variability in scores over qualified
581 raters is a source of error. If context is likely to make a difference in performance, the
582 person-context interaction should be evaluated. It is not appropriate to employ a
583 reliability/precision index that does not evaluate the main sources of error in an
584 assessment program in evaluating the general question of reliability/precision.

585 Cluster 2: Evaluating Reliability/Precision

586 Standard 2.3


587 For each total score, subscore, or combination of scores that is to be interpreted,
588 estimates of relevant indices of reliability/precision should be reported.

589 Comment: It is not sufficient to report estimates of reliabilities and standard errors of
590 measurement only for total scores when sub-scores are also interpreted. The form-to-
591 form and day-to-day consistency of total scores on a test may be acceptably high, yet
592 subscores may have unacceptably low reliability. Users should be supplied with
593 reliability data for all scores to be interpreted in enough detail to judge whether scores
594 are precise enough for the users' intended interpretations and uses. Composites formed
595 from selected subtests within a test battery are frequently proposed for predictive
596 and diagnostic purposes. Users need information about the reliability of such
597 composites.

598 Standard 2.4


599 When a test interpretation emphasizes differences between two observed scores of
600 an individual or two averages of a group, reliability/precision data, including
601 standard errors, should be provided for such differences.

602 Comment: Observed score differences are used for a variety of purposes. Achievement
603 gains are frequently the subject of inferences for groups as well as individuals. At least
604 in some cases, the reliability/precision of change scores can be much lower than the
605 reliabilities of the separate scores involved. Differences between verbal and performance
606 scores of intelligence and scholastic ability tests are often employed in the diagnosis
607 of cognitive impairment and learning problems. Psycho-diagnostic inferences are
608 frequently drawn from the differences between subtest scores. Aptitude and achievement
609 batteries, interest inventories, and personality assessments are commonly used to identify
610 and quantify the relative strengths and weaknesses or the pattern of trait levels of an
611 examinee. When the interpretation of test scores centers on the peaks and valleys in
612 the examinee's test score profile, the reliability of score differences between the peaks
613 and valleys is critical.

614 Standard 2.5


615 Reliability estimation procedures should be consistent with the structure of the test.

616 Comment: The total score on a test that is substantially multifactor should be treated
617 as a composite score. If an internal consistency estimate of total score reliability is
16

618 obtained by the split-halves procedure, the halves should be parallel in content and
619 statistical characteristics.

620 Cluster 3: Reliability/Generalizability Coefficients

621 Standard 2.6


622 A reliability/precision index (e.g., a coefficient or standard error) that addresses one
623 kind of variability should not be interpreted as interchangeable with indices that
624 address different kinds of variability unless their implicit definitions of
625 measurement error are equivalent.

626 Comment: Internal consistency, alternate-form, test-retest reliability (or


627 generalizability) coefficients should not be considered equivalent, as each may
628 incorporate a unique definition of measurement error. Error variances derived via item
629 response theory are generally not equivalent to error variances estimated via other
630 approaches. Test developers should state the sources of error that are reflected in, and
631 are ignored by, the reported reliability indices.

632 Standard 2.7


633 When subjective judgment enters into test scoring, evidence should be provided on
634 both inter-rater consistency in scoring and within-examinee consistency over
635 repeated measurements. A clear distinction should be made among reliability data
636 based on (a) independent panels of raters scoring the same performances or
637 products, (b) a single panel scoring successive performances or new products, and
638 (c) independent panels scoring successive performances or new products.

639 Comment: Task-to-task variations in the quality of an examinee's performance and rater-
640 to-rater inconsistencies in scoring represent independent sources of measurement error.
641 Reports of reliability studies should make clear which of these sources are reflected in the
642 data. Where feasible, the error variances arising from each source should be estimated.
643 Generalizability studies and variance component analyses are especially helpful in this
644 regard. These analyses can provide separate error variance estimates for tasks, for
645 judges, and for occasions within the time period of trait stability. Information should
646 be provided on the qualifications of the judges used in reliability studies.

647 Inter-rater or inter-observer agreement may be particularly important for ratings


648 and observational data that involve subtle discriminations. It should be noted,
649 however, that when raters evaluate positively correlated characteristics, a favorable or
650 unfavorable assessment of one trait may color their opinions of other traits. Moreover,
651 high inter-rater consistency does not imply high examinee consistency from task to
652 task. Therefore, internal consistency within raters and inter-rater agreement do not
653 guarantee high reliability of examinee scores.
17

654 Cluster 4: Factors Affecting Reliability/Precision

655 Standard 2.8


656 If local scorers are employed to apply general scoring rules and principles specified
657 by the test developer, local reliability/precision data should be gathered and reported
658 by local authorities when adequate size samples are available.

659 Comment: For example, many statewide testing programs depend on local scoring
660 of essays, constructed-response exercises, and performance tests. Reliability/precision
661 analyses bear on the possibility that additional training of scorers is needed and,
662 hence, should be an integral part of program monitoring.

663 Standard 2.9


664 When a test is available in both long and short versions, evidence for
665 reliability/precision should be reported for scores on each version, preferably based
666 on an independent administration(s) of each.

667 Comment: The reliability/precision of scores on each version is best evaluated through
668 an independent administration of each, using the designated time limits. Psychometric
669 models can be used to estimate the reliability/precision of a shorter (or longer) version of
670 an existing test, based on data from an administrations of the existing test. However,
671 these models generally make assumptions (e.g., that the items in the existing test and the
672 items to be added or dropped are all randomly sampled from a single domain) that may
673 not be satisfied. Context effects are commonplace in tests of maximum performance,
674 and the short version of a standardized test often comprises a nonrandom sample of
675 items from the full-length version. Therefore, the predicted reliability/precision may
676 not provide a very good estimate of the actual reliability/precision, and therefore, where
677 feasible, the reliability/precision should be evaluated directly.

678 Standard 2.10


679 When significant variations are permitted in tests or test administration procedures,
680 separate reliability/precision analyses should be provided for scores produced under
681 each major variation if adequate sample sizes are available.

682 Comment: In order to make a test accessible to all examinees, test publishers might
683 authorize accommodations or modifications in the procedures and time limits that are
684 specified for the administration of a test. For example, audio or large print versions may
685 be used for test takers who are visually impaired. Any alteration in standard testing
686 materials or procedures may have an impact on the reliability/precision of the
687 resulting scores, and therefore, to the extent feasible, the reliability/precision should
688 be examined for all versions of the test and testing procedures.

689 Standard 2.11


690 Because reliability coefficients can differ substantially when subgro ups vary
691 in heterogeneity or have other differences in their distributions of the
692 characteristic assessed, publishers should provide reliability estimates as soon
693 as feasible for each major subgroup for which the test is recommended.
18

694 Comment: Reliability is subgroup dependent and so it should be examined in major


695 subgroups. Test users who work with a specific linguistic and cultural group or with
696 individuals who have a particular disability would benefit from an estimate of the
697 standard error for such a subpopulation. Some groups of test takers—pre-school children,
698 for example—tend to respond to test stimuli in a less consistent fashion than do older
699 children.

700 Standard 2.12


701 If a test is proposed for use in several grades or over a range of chronological age
702 groups and if separate norms are provided for each grade or each age group,
703 reliability/precision data should be provided for each age or grade population, not
704 solely for all grades or ages combined.

705 Comment: A reliability coefficient based on a sample of examinees spanning several


706 grades or a broad range of ages in which average scores are steadily increasing will
707 generally give a spuriously inflated impression of reliability/precision. When a test is
708 intended to discriminate within age or grade populations, reliability coefficients and
709 standard errors should be reported separately for each population.

710 Cluster 5: Errors of Measurement


711 Standard 2.13
712 The standard error of measurement, both overall and conditional (if reported),
713 should be provided in units of each reported score.

714 Comment: The standard error of measurement (overall or conditional), that is reported
715 should be consistent with the scales that are used in reporting scores. Standard errors in
716 scale-score units for the scales used to report scores and/or to make decisions are
717 particularly helpful to the typical test user. The data on examinee performance should be
718 consistent with the assumptions built into any statistical models used to generate scale
719 scores and to estimate the standard errors for these scores.

720 Standard 2.14


721 Conditional standard errors of measurement should be reported at several score
722 levels if constancy cannot be assumed. Where cut scores are specified for selection or
723 classification, the standard errors of measurement should be reported in the vicinity
724 of each cut score.

725 Comment: Estimation of conditional standard errors is usually feasible with the sample
726 sizes that are used for analyses of reliability/precision. If it is assumed that the
727 standard error is constant over a broad range of score levels, the rationale for this
728 assumption should be presented.

729 Standard 2.15


730 When there is credible evidence for expecting that conditional standard errors
731 of measurement or test information functions will differ substantially for
19

732 various subgroups, investigation of the extent and impact of such differences
733 should be undertaken and reported as soon as is feasible.

734 Comment: If substantial differences do exist, the test content and scoring models should
735 be examined to see if there are acceptable alternatives that do not result in such
736 differences.

737 Cluster 6: Decision Consistency

738 Standard 2.16


739 When a test or combination of measures is used to make categorical decisions,
740 estimates should be provided of the percentage of examinees who would be
741 classified in the same way on two applications of the procedure, using the same
742 form or alternate forms of the instrument.

743 Comment: When a test or composite is used to make categorical decisions, such as
744 pass/fail, the standard error of measurement at or near the cut score has important
745 implications for the trustworthiness of these decisions. However, the standard error
746 cannot be translated into the expected percentage of consistent or accurate
747 decisions unless assumptions are made about the form of the distributions of
748 measurement errors and true scores. Although decision consistency is typically
749 estimated from the administration of a single form, it can and should be estimated
750 directly through the use of a repeated-measurements approach if consistent with the
751 requirements of test security and if adequate samples are available.

752 Cluster 7: Reliability/Precision of Group Means


753 Standard 2.17
754 When average test scores for groups are the focus of the proposed interpretation of the
755 test results, the groups tested should generally be regarded as a sample from a
756 larger population, even if all examinees available at the time of measurement are
757 tested. In such cases the standard error of the group mean should be reported,
758 because it reflects variability due to sampling of examinees as well as variability due
759 to measurement error.

760 Comment: The students in a particular class or school, the current clients of a social
761 service agency, and analogous groups exposed to a program of interest typically
762 constitute a sample in a longitudinal sense. Presumably, comparable groups from the
763 same population will recur in future years, given static conditions. The factors
764 leading to uncertainty in conclusions about program effectiveness arise from the
765 sampling of persons as well as measurement error. Therefore, the standard error of the
766 mean observed score, reflecting variation in both true scores and measurement
767 errors, represents a more realistic standard error in this setting. Even this value may
768 underestimate the variability of group means over time. In many settings, the stable
769 conditions assumed under random sampling of persons do not prevail.

770
20

771 Standard 2.18


772 When the purpose of testing is to measure the performance of groups rather than
773 individuals, a procedure frequently used is to assign small randomly chosen subsets of
774 items to different subsamples of examinees. Data are aggregated across sub-samples
775 and item subsets to obtain a measure of group performance. When such procedures
776 are used for program evaluation or population descriptions, reliability/precision
777 analyses must take the sampling scheme into account.

778 Comment: This type of measurement program is termed matrix sampling. It is designed
779 to reduce the time demanded of individual examinees and to increase the total number of
780 items on which data are obtained. This testing approach provides the same type of
781 information about group performances that would accrue if all examinees could respond
782 to all exercises in the item pool. Reliability/precision statistics must be appropriate to
783 the sampling plan used with respect to examinees and items.

784 Cluster 8: Documenting Reliability/Precision

785 Standard 2.19


786 Each method of quantifying the precision or consistency of scores should be
787 described clearly and expressed in terms of statistics appropriate to the method. The
788 sampling procedures used to select examinees for reliability/precision analyses and
789 descriptive statistics on these samples should be reported.

790 Comment: Information on the method of data collection, sample sizes, means,
791 standard deviations, and demographic characteristics of the groups helps users judge
792 the extent to which reported data apply to their own examinee populations. If the test-
793 retest or alternate-form approach is used, the interval between administrations should be
794 indicated.

795 Because there are many ways of estimating reliability/precision, each influenced
796 by different sources of measurement error, it is unacceptable to say simply, "The
797 reliability/precision of test X is .90." A better statement would be, "The reliability
798 coefficient of .90 reported for scores on test X was obtained by correlating scores from
799 forms A and B administered on successive days. The data were based on a sample of
800 400 10th-grade students from five middle-class suburban schools in New York State.
801 The demographic breakdown of this group was as follows: ...."

802 Standard 2.20


803 If reliability coefficients are adjusted for restriction of range or variability, the
804 adjustment procedure and both the adjusted and unadjusted coefficients should be
805 reported. The standard deviations of the group actually tested and of the target
806 population, as well as the rationale for the adjustment, should be presented.

807 Comment: Application of a correction for restriction in variability presumes that the
808 available sample is not representative of the test-taker population to which users might be
809 expected to generalize. The rationale for the correction should consider the
810 appropriateness of such a generalization. Adjustment formulas that presume constancy
21

811 in the standard error across score levels should not be used unless constancy can be
812 defended.

You might also like