Professional Documents
Culture Documents
16 The term reliability has been used in two ways in the measurement literature.
17 First, the term has been used to refer to the reliability coefficients of classical test theory
18 (defined as ratios of true-score variances to observed-score variances). Second, the term
19 has been used in a more general sense, to refer to the consistency or precision of scores
20 across replications of a testing procedure, regardless of how this consistency is estimated
21 or reported (e.g., in terms of standard errors, reliability coefficients per se,
22 generalizability coefficients, error/tolerance ratios, information functions, or various
23 indices of categorical consistency). To maintain a link to the traditional notions of
24 reliability, while avoiding the ambiguity inherent in using a single, familiar term to refer
25 to a wide range of concepts and indices of precisions, the term reliability/precision will
26 be used to denote the more general notion of consistency of the scores across replications,
27 and the term reliability coefficient will be used to refer to the reliability coefficients of
28 classical test theory.
42 Some test takers may exhibit less variation in their scores than others, but all test
43 takers exhibit some variation in performance. Because of this variation, an individual's
44 observed scores (and the average scores of groups) will vary across replications of the
45 testing procedure. This unexplained and apparently random variability across replications
46 is referred to as measurement error, or random error.
47 The oldest and most basic way to evaluate the consistency of scores involves an
48 analysis of the variation in each individual‘s scores across replications of the
49 measurement procedure. We administer the test, and then, after a brief period, over which
50 the examinee‘s standing on the variable being measured would not be expected to
51 change, we administer the test (or a parallel test) a second time; it is assumed that the first
52 administration has no substantial influence on the second administration. Given that the
53 attribute being measured is assumed to remain the same for each examinee over the two
54 administrations and that the test administrations are independent of each other, more
55 variation across replications indicates more error, and therefore, less precision and lower
56 reliability.
70 To say that a score includes error implies that there is a hypothetical error-free
71 value that characterizes an examinee‘s score at the time of testing. In classical test
72 theory this error-free value is referred to as the person's true score for the test or
73 measurement procedure. It is conceptualized as the hypothetical average score over
74 an infinite series of replications of the testing procedure. In statistical terms, a
75 person‘s true score is an unknown parameter, and the observed score for the person is a
76 random variable that fluctuates around the true score for the person.
117 The cardinal features of standardized tests include consistency of the test
118 materials from test taker to test taker, close adherence to stipulated procedures for test
119 administration, and use of prescribed scoring rules that can be applied with a high degree
120 of consistency. Administering the same test to all test takers under the same conditions
121 promotes fairness and facilitates comparisons of scores across individuals. Conditions of
122 observation that are fixed or standardized for the testing procedure remain the same
123 across replications. However, some aspects of any standardized testing procedure may be
124 allowed to vary. The time and place of testing, as well as the persons administering the
125 test, are generally allowed to vary. The particular questions or tasks included in the test
4
126 may be allowed to vary, and the persons who score the results can vary over some set of
127 qualified scorers.
128 The reliability/precision of the scores depends on how much the scores vary
129 across replications of the testing procedure, and the evaluative evidence for
130 reliability/precision should be consistent with the kinds of variability allowed in the
131 testing procedure and materials (e.g., over tasks contexts, raters), and with the
132 assumptions built into the proposed interpretation for use of the test scores. For example,
133 if the interpretation of the scores assumes that they are at least approximately invariant
134 over occasions, then variability over occasions is a potential source of error. If the test
135 tasks vary over different forms of the test, and the observed performances are treated as a
136 sample from a domain of similar tasks, the variability in scores from one form to another
137 would be considered error. If raters are used to assign scores to responses, and all raters
138 meeting certain qualifications are considered qualified, the variability in scores over
139 qualified raters is a source of error.
140 In some cases, it may be possible to evaluate the magnitude of all major sources
141 of error in one analysis (e.g., by comparing scores on different test forms, administered
142 on different occasions and in different settings, and scored by different raters). In other
143 cases it may be more convenient or useful to accumulate evidence separately for different
144 potential sources of error (e.g., by having special studies in which students take different
145 forms of the test on different days).
146 In some cases, it may be feasible to estimate the variability over replications
147 directly (e.g., by having a number of qualified raters evaluate a sample of test
148 performances). In other cases, it may be necessary to use less direct estimates (e.g., using
149 internal consistency, the extent of agreement between different parts of one test) to
150 estimate the random error associated with form-to-form variability. In some cases, it may
151 be reasonable to assume that a potential source of variability is likely to be negligible
152 (e.g., variability across well calibrated scoring machines, variability over some changes
153 in the formatting of test forms). In other cases, it may be necessary to examine the
154 variability over actual replications directly. For example, when a test is designed
155 to reflect rate of work, reliability should be estimated by the alternate-form or test-retest
156 approach, using separately timed administrations. Split-half coefficients based on separate
157 scores from the odd-numbered and even-numbered items are known to yield inflated
158 estimates of reliability for highly speeded tests.
159 In some cases, it may be possible to infer adequate reliability from other types of
160 evidence. For example, if test scores are used mainly to predict some criterion scores,
161 and the test does an adequate job in predicting the criterion, it can be inferred that the test
162 scores are reliable/precise enough for their intended use.
163
164 The definition of what constitutes a standardized test or measurement procedure
165 has broadened significantly over the last few decades. Various kinds of performance
166 assessments, simulations, and portfolio-based assessments have been developed to
167 provide measures of constructs that might otherwise be difficult to assess. Performance
168 assessments raise complex issues regarding the performance domain represented by
5
169 the test, about what constitutes a replication, and about the reliability/precision of
170 individual and group scores. Each step toward greater flexibility in the assessment
171 procedures enlarges the scope of the variations allowed in replications of the testing
172 procedure, and therefore tends to increase the measurement error. However, some of
173 these sacrifices in reliability/precision may reduce construct irrelevance or construct
174 underrepresentation and thereby improve the validity of the intended interpretations of
175 the scores for the proposed use or uses of the test.
176
177 Random and Systematic Errors
178 Random errors of measurement are generally viewed as unpredictable fluctuations in
179 scores. They are conceptually distinguished from systematic errors, which may also
180 affect the performances of individuals or groups, but in a consistent rather than a
181 random manner. For example, a systematic group error would occur as a result of
182 differences in the difficulty of test forms that have not been adequately equated. In
183 this instance, examinees who take one form may earn higher scores on average than
184 they would have gotten if they had taken the other form. Such systematic errors would
185 not generally be included in the standard error of measurement, and they are not
186 generally regarded as contributing to a lack of reliability/precision. Rather,
187 systematic errors constitute construct-irrelevant factors that reduce validity, but not
188 reliability/precision.
192 Important sources of random error may be broadly categorized as those rooted
193 within the test takers and those external to them. Fluctuations in the level of an
194 examinee's motivation, interest, or attention and the inconsistent application of skills
195 are clearly internal factors that may lead to score inconsistencies. Differences among
196 testing sites in their freedom from distractions, the random effects of scorer subjectivity,
197 and variation in scorer standards are examples of external factors. The importance
198 of any particular source of variation depends on the specific conditions under which
199 the measures are taken, how performances are scored, and the interpretations made from
200 the scores. A particular factor, such as the subjectivity in scoring, may be a significant
201 source of measurement error in some assessments and a minor consideration in others.
202 Some changes in scores from one occasion to another are not regarded as error
203 (random or systematic), because they result, in part, from changes in the construct being
204 measured (e.g., due to learning or maturation that has occurred between the initial and
205 final measures). In such cases, the change in performance constitutes the phenomenon of
206 interest, and the changes would not be considered to be due to errors of
207 measurement.
208 Measurement error reduces the usefulness of test scores. It limits the extent to
209 which test results can be generalized beyond the particulars of a specific application of
210 the testing procedure. Therefore, it reduces the confidence that can be placed in the
211 results from any single measurement. Because random measurement errors are
6
212 inconsistent and unpredictable, they cannot be removed from observed scores.
213 However, their aggregate magnitude can be summarized in several ways, as discussed
214 below, and they can be controlled to some extent (e.g., by averaging over multiple
215 scores).
240 For most testing programs, scores are expected to generalize over different forms
241 of the test and over occasions (over some period), over testing contexts, and over raters
242 (if judgment is required in scoring). To the extent that the variability associated with any
243 of these dimensions or facets is likely to be substantial, the variability should be
244 estimated in some way. In evaluating the reliability/precision of the scores, it is
245 important to identify the major sources of error, and to estimate the magnitudes of
246 these errors, and thereby, the degree of generalizability of scores across alternate forms,
247 scorers, administrations, or other relevant dimensions.
254 possible (i.e., sample sizes are large enough), reliability/precision should be estimated
255 separately for all major subgroups (e.g., defined in terms of race, gender, language
256 proficiency) in the population [see the Fairness chapter].
264 Traditionally, the consistency of test scores was evaluated mainly in terms or
265 reliability coefficients, defined in terms of the ratio of true-score variance to observed
266 score variance, and was estimated by computing the correlation between scores derived
267 from two replications of the testing procedure. Three broad categories of reliability
268 coefficients were recognized: (a) coefficients derived from the administration of parallel
269 forms in independent testing sessions (alternate-form coefficients); (b) coefficients
270 obtained by administration of the same instrument on separate occasions (test-
271 retest coefficients); and (c) coefficients based on the relationships/interactions
272 among scores derived from individual items or subsets of the items within a test, all
273 data accruing from a single administration (internal consistency coefficients). In
274 addition, where test scoring involves a high level of judgment, indexes of scorer
275 consistency are commonly obtained.
276 In generalizability theory, these three categories are treated as special cases
277 of a more general framework for estimating error variance in terms of the variance
278 associated with different sources of error. A generalizability coefficient is defined as
279 the ratio of universe score variance to observed score variance. Unlike traditional
280 approaches to the study of reliability, generalizability theory encourages the researcher to
281 specify and estimate components of true score variance, error variance, and observed
282 score variance and coefficients based on these estimates. Estimation is typically
283 accomplished by the application of the techniques of analysis of variance. Of special
284 interest are the separate numerical estimates of the components of overall error variance
285 (e.g., variance components for items, occasions, raters, and variance components for the
286 interactions among these potential sources of error). Such estimates permit examination
287 of the contribution of each source of error to the overall measurement error and can be
288 very helpful in identifying an effective strategy for controlling overall error
289 variance. The generalizability approach also makes possible the estimation of
290 coefficients that apply to a wide variety of potential measurement designs.
291 The test information function, an important result of IRT, summarizes how
292 well the test discriminates among individuals at various levels of the ability or trait
293 being assessed. Under the IRT conceptualization for dichotomously scored items, a
294 mathematical function called the item characteristic curve or item response
295 function is used as a model to represent the increasing proportion of correct
296 responses to an item for groups at progressively higher levels of the ability or trait being
8
297 measured. Given an adequate database, the parameters of the characteristic curve for
298 each item in a test can be estimated. The test information function can then be
299 estimated from the parameters for the set of items in the test and can be used to
300 derive coefficients with interpretations similar to reliability coefficients.
314 Generalizability coefficients and the many coefficients included under the
315 traditional categories of reliability may appear to be interchangeable, but the different
316 coefficients may convey quite different information. A coefficient in any given category
317 may encompass errors of measurement from a highly restricted perspective, a very
318 broad perspective, or some point between these extremes. For example, a coefficient
319 may reflect error due to scorer inconsistencies but not reflect the variation over an
320 examinee‘s performances or products. A coefficient may reflect only the internal
321 consistency of item responses within an instrument and fail to reflect measurement
322 error associated with day-to-day changes in examinee performance.
338 to estimate standard errors (overall and/or conditional) in cases where it would not be
339 possible to do so directly. Second, coefficients (e.g., reliability and generalizability
340 coefficients), which are defined in terms of ratios of variances for scores on the same
341 scale, are invariant over linear transformations of the score scale and can be useful in
342 comparing different testing procedures based on different scales.
354 Second, if the variability associate with raters is estimated for a select group of
355 raters who have been especially well trained (and were perhaps involved in the
356 development of the procedures), but raters are not so well trained in some operational
357 contexts, the error associated with rater variability in these operational settings may be
358 much higher than is indicated by the reported inter-rater reliability coefficients.
359 Similarly, if raters are still refining their performance in the early days of an extended
360 scoring window, the error associated with rater variability may be greater for examinees
361 testing early in the window than for later test-takers.
362 The reliability/precision also depends on the population for which the procedure
363 is being used. In particular, if the variability in the construct of interest in the population
364 for which scores are being generated is substantially different from what it is in the
365 population for which reliability/precision was evaluated, the reliability/precision can be
366 quite different in the two populations; as the variability in the construct being measured
367 decreases, reliability and generalizability coefficients tend to decrease, and as the
368 variability in the construct being measured increases, the coefficients tend to increase.
369 In addition, the reliability/precision can vary from one population to another, even
370 if the variability in the construct of interest in the two populations is the same. If the
371 populations have different average levels of achievement, and the test is particularly easy
372 or difficult for one population, the reliability/precision is likely to be depressed in this
373 population. The reliability can also vary from one population to another because
374 particular sources of error (rater effects, familiarity with formats and instructions, etc.)
375 have more impact in one population than they do on the other. In general, if any aspects
376 of the assessment procedures or the population being assessed are changed in an
377 operational setting, the reliability/precision should be reevaluated for that particular
378 setting.
10
384 Information about the precision of measurement at each of several widely spaced
385 score levels—that is, conditional standard errors—is usually a valuable supplement to the
386 single statistic for all score levels combined. Conditional standard errors of
387 measurement are generally more informative than a single average standard error
388 for a population. If decisions are based on test scores, and these decisions are
389 concentrated in one area or a few areas of the score scale, then the conditional
390 errors in those areas of the scale need to be examined.
391 Like reliability and generalizability coefficients, standard errors may reflect
392 variation from many sources of error or only a few. A more comprehensive standard
393 error (i.e., one that includes the most relevant sources of error, given the
394 proposed interpretation) is more informative than a less comprehensive value.
395 However, practical constraints often preclude conduct of the kinds of studies that
396 would yield information on all potential sources of error, but in such cases, it is always
397 important to examine those sources of error that are likely to be most serious.
398 Measurements derived from observations of behavior or evaluations of products by raters
399 are especially sensitive to a variety of error factors. These include scorer biases and
400 idiosyncrasies, scoring subjectivity, and intra-examinee factors that cause variation from
401 one performance or product to another. The methods of generalizability theory are well
402 suited to the investigation of the reliability/precision of the scores on such measures. In
403 general, the impact or seriousness of errors of measurement depends on the context in
404 which the scores are used and the purposes for which they are used, and therefore, the
405 errors need to be evaluated in terms of their intended uses.
419 As the range of uses of test scores has expanded and the contexts of use have been
420 extended (e.g., diagnostic categorization, the evaluation of educational programs), the
11
421 range of indices that are used to evaluate reliability/precision has also grown to include
422 indices for various kinds of change scores and difference scores, indices of decision
423 consistency, and indices appropriate for evaluating the precision of group means.
424 Some indices of precision, especially standard errors and conditional standard
425 errors, also depend on the scale in which they are reported. An index stated in terms
426 of raw scores or the trait level estimates of IRT may convey very different perception
427 of the error if restated in terms of scale scores. For example, for the raw-score scale, the
428 standard error may appear to be high at one score level and low at another, but when the
429 conditional standard errors are restated in units of scale scores, quite different trends in
430 comparative precision may emerge.
431 Precision and consistency in measurement are always desirable. However, the
432 need for precision increases as the consequences of decisions and interpretations grow
433 in importance. If a decision can and will be corroborated by information from other
434 sources or if an erroneous initial decision can be quickly corrected, scores with modest
435 reliability/precision may suffice. But if a test score leads to a decision that is not easily
436 reversed, such as rejection or admission of a candidate to a professional school or a
437 jury‘s decision, based on test results, that a serious cognitive injury was sustained, the
438 need for a high degree of precision is much greater.
450 Decision consistency refers the extent to which the observed classifications of
451 examinees would be the same if they were to take two non-overlapping, equally difficult
452 forms of a test. Decision accuracy refers to the extent to which observed classifications of
453 examinees based on the results of a single test form would agree with their true
454 classification status. Statistical methods are available to calculate indices for both
455 decision consistency and decision accuracy. Adoption of these terms focuses attention on
456 the consistency or accuracy of classifications, rather than the consistency in scores per se,
457 and consistency/accuracy of classification is the main concern in many decision
458 contexts. However, it should be recognized that the degree of consistency or
459 agreement in examinee classification is specific to the cut score employed and its
460 location within the score distribution.
12
478 Standard errors for individual scores are not appropriate measures of the
479 precision of group averages; a more appropriate statistic is the standard error of the
480 estimates of the group means. Generalizability theory can provide more refined
481 indices when the sources of measurement error are numerous and complex.
489 In some instances, however, local users of a test or assessment procedure must
490 accept at least partial responsibility for documenting the precision of measurement.
491 This obligation holds when one of the primary purposes of measurement is to
492 classify students using locally developed performance standards, or to rank
493 examinees within the local population. It also holds when users must rely on local
494 scorers who are trained to use the scoring rubrics provided by the test developer. In
495 such settings, local factors may materially affect the magnitude of error variance and
496 observed score variance. Therefore, the reliability/precision of scores may differ
497 appreciably from that reported by the developer.
503 The reporting of indices of precision alone, with little detail regarding the
504 methods used to estimate the indices reported, the nature of the group from which the
505 data were derived, and the conditions under which the data were obtained constitutes
506 inadequate documentation. General statements to the effect that a test is "reliable" or that
507 it is "sufficiently reliable to permit interpretations of individual scores" are rarely, if ever,
508 acceptable. It is the user who must take responsibility for determining whether or not
509 scores are sufficiently trustworthy to justify anticipated uses and interpretations. Of
510 course, test constructors and publishers are obligated to provide sufficient data to make
511 informed judgments possible.
535 The standards that follow the overarching standard have been separated into 8 clusters.
536 The clusters for this chapter have been labeled and ordered as follows:
549 Comment: The form of the index (reliability or generalizability coefficient, information
550 function, conditional standard error, index of decision consistency) should be appropriate
551 for the intended uses of the scores, the population involved, and the psychometric models
552 used to derive the scores.
558 Comment: For any testing program, some aspects of the testing procedure (e.g., time
559 limits, availability of resources like books, calculators, and computers) are likely to be
560 fixed, and some aspects will be allowed to vary from one administration to another (e.g.,
561 specific tasks, testing contexts, raters, and possibly, occasions). Any test administration
562 that maintains the standardized, fixed conditions and involves acceptable samples of the
563 conditions that are allowed to vary (e.g., tasks, raters) would be considered a legitimate
564 replication of the testing procedure. As a first step in evaluating the reliability/precision
565 of a testing procedure, it is important to identify the range of conditions of various kinds
566 that are allowed to vary, and over which, scores are expected to be invariant.
571 Comment: The evaluative evidence for reliability/precision should be consistent with the
572 variability allowed in testing procedures and materials, and with the assumptions built
573 into the proposed interpretation and use of the test scores. For example, if the test can be
574 taken on any of a range of occasions, and the interpretation presumes that the scores are
575 at least approximately invariant over these occasions, then any variability in scores over
576 these occasions is a potential source of error. If the test tasks are allowed to vary over
577 alternate forms of the test, and the observed performances are treated as a sample from a
15
578 domain of similar tasks, the variability in scores from one form to another would be
579 considered error. If raters are used to assign scores to responses, and all raters meeting
580 certain qualifications are considered qualified, the variability in scores over qualified
581 raters is a source of error. If context is likely to make a difference in performance, the
582 person-context interaction should be evaluated. It is not appropriate to employ a
583 reliability/precision index that does not evaluate the main sources of error in an
584 assessment program in evaluating the general question of reliability/precision.
589 Comment: It is not sufficient to report estimates of reliabilities and standard errors of
590 measurement only for total scores when sub-scores are also interpreted. The form-to-
591 form and day-to-day consistency of total scores on a test may be acceptably high, yet
592 subscores may have unacceptably low reliability. Users should be supplied with
593 reliability data for all scores to be interpreted in enough detail to judge whether scores
594 are precise enough for the users' intended interpretations and uses. Composites formed
595 from selected subtests within a test battery are frequently proposed for predictive
596 and diagnostic purposes. Users need information about the reliability of such
597 composites.
602 Comment: Observed score differences are used for a variety of purposes. Achievement
603 gains are frequently the subject of inferences for groups as well as individuals. At least
604 in some cases, the reliability/precision of change scores can be much lower than the
605 reliabilities of the separate scores involved. Differences between verbal and performance
606 scores of intelligence and scholastic ability tests are often employed in the diagnosis
607 of cognitive impairment and learning problems. Psycho-diagnostic inferences are
608 frequently drawn from the differences between subtest scores. Aptitude and achievement
609 batteries, interest inventories, and personality assessments are commonly used to identify
610 and quantify the relative strengths and weaknesses or the pattern of trait levels of an
611 examinee. When the interpretation of test scores centers on the peaks and valleys in
612 the examinee's test score profile, the reliability of score differences between the peaks
613 and valleys is critical.
616 Comment: The total score on a test that is substantially multifactor should be treated
617 as a composite score. If an internal consistency estimate of total score reliability is
16
618 obtained by the split-halves procedure, the halves should be parallel in content and
619 statistical characteristics.
639 Comment: Task-to-task variations in the quality of an examinee's performance and rater-
640 to-rater inconsistencies in scoring represent independent sources of measurement error.
641 Reports of reliability studies should make clear which of these sources are reflected in the
642 data. Where feasible, the error variances arising from each source should be estimated.
643 Generalizability studies and variance component analyses are especially helpful in this
644 regard. These analyses can provide separate error variance estimates for tasks, for
645 judges, and for occasions within the time period of trait stability. Information should
646 be provided on the qualifications of the judges used in reliability studies.
659 Comment: For example, many statewide testing programs depend on local scoring
660 of essays, constructed-response exercises, and performance tests. Reliability/precision
661 analyses bear on the possibility that additional training of scorers is needed and,
662 hence, should be an integral part of program monitoring.
667 Comment: The reliability/precision of scores on each version is best evaluated through
668 an independent administration of each, using the designated time limits. Psychometric
669 models can be used to estimate the reliability/precision of a shorter (or longer) version of
670 an existing test, based on data from an administrations of the existing test. However,
671 these models generally make assumptions (e.g., that the items in the existing test and the
672 items to be added or dropped are all randomly sampled from a single domain) that may
673 not be satisfied. Context effects are commonplace in tests of maximum performance,
674 and the short version of a standardized test often comprises a nonrandom sample of
675 items from the full-length version. Therefore, the predicted reliability/precision may
676 not provide a very good estimate of the actual reliability/precision, and therefore, where
677 feasible, the reliability/precision should be evaluated directly.
682 Comment: In order to make a test accessible to all examinees, test publishers might
683 authorize accommodations or modifications in the procedures and time limits that are
684 specified for the administration of a test. For example, audio or large print versions may
685 be used for test takers who are visually impaired. Any alteration in standard testing
686 materials or procedures may have an impact on the reliability/precision of the
687 resulting scores, and therefore, to the extent feasible, the reliability/precision should
688 be examined for all versions of the test and testing procedures.
714 Comment: The standard error of measurement (overall or conditional), that is reported
715 should be consistent with the scales that are used in reporting scores. Standard errors in
716 scale-score units for the scales used to report scores and/or to make decisions are
717 particularly helpful to the typical test user. The data on examinee performance should be
718 consistent with the assumptions built into any statistical models used to generate scale
719 scores and to estimate the standard errors for these scores.
725 Comment: Estimation of conditional standard errors is usually feasible with the sample
726 sizes that are used for analyses of reliability/precision. If it is assumed that the
727 standard error is constant over a broad range of score levels, the rationale for this
728 assumption should be presented.
732 various subgroups, investigation of the extent and impact of such differences
733 should be undertaken and reported as soon as is feasible.
734 Comment: If substantial differences do exist, the test content and scoring models should
735 be examined to see if there are acceptable alternatives that do not result in such
736 differences.
743 Comment: When a test or composite is used to make categorical decisions, such as
744 pass/fail, the standard error of measurement at or near the cut score has important
745 implications for the trustworthiness of these decisions. However, the standard error
746 cannot be translated into the expected percentage of consistent or accurate
747 decisions unless assumptions are made about the form of the distributions of
748 measurement errors and true scores. Although decision consistency is typically
749 estimated from the administration of a single form, it can and should be estimated
750 directly through the use of a repeated-measurements approach if consistent with the
751 requirements of test security and if adequate samples are available.
760 Comment: The students in a particular class or school, the current clients of a social
761 service agency, and analogous groups exposed to a program of interest typically
762 constitute a sample in a longitudinal sense. Presumably, comparable groups from the
763 same population will recur in future years, given static conditions. The factors
764 leading to uncertainty in conclusions about program effectiveness arise from the
765 sampling of persons as well as measurement error. Therefore, the standard error of the
766 mean observed score, reflecting variation in both true scores and measurement
767 errors, represents a more realistic standard error in this setting. Even this value may
768 underestimate the variability of group means over time. In many settings, the stable
769 conditions assumed under random sampling of persons do not prevail.
770
20
778 Comment: This type of measurement program is termed matrix sampling. It is designed
779 to reduce the time demanded of individual examinees and to increase the total number of
780 items on which data are obtained. This testing approach provides the same type of
781 information about group performances that would accrue if all examinees could respond
782 to all exercises in the item pool. Reliability/precision statistics must be appropriate to
783 the sampling plan used with respect to examinees and items.
790 Comment: Information on the method of data collection, sample sizes, means,
791 standard deviations, and demographic characteristics of the groups helps users judge
792 the extent to which reported data apply to their own examinee populations. If the test-
793 retest or alternate-form approach is used, the interval between administrations should be
794 indicated.
795 Because there are many ways of estimating reliability/precision, each influenced
796 by different sources of measurement error, it is unacceptable to say simply, "The
797 reliability/precision of test X is .90." A better statement would be, "The reliability
798 coefficient of .90 reported for scores on test X was obtained by correlating scores from
799 forms A and B administered on successive days. The data were based on a sample of
800 400 10th-grade students from five middle-class suburban schools in New York State.
801 The demographic breakdown of this group was as follows: ...."
807 Comment: Application of a correction for restriction in variability presumes that the
808 available sample is not representative of the test-taker population to which users might be
809 expected to generalize. The rationale for the correction should consider the
810 appropriateness of such a generalization. Adjustment formulas that presume constancy
21
811 in the standard error across score levels should not be used unless constancy can be
812 defended.