AERA, APA, NCME Standards - Reliability

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION
AMERICAN PSYCHOLOGICAL ASSOCIATION

NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION
2. RELIABILITY/PRECISION AND
ERRORS OF MEASUREMENT
BACKGROUND
A test, broadly defined, is a set of tasks or stimuli reliability/precision is warranted. If a decision can
designed to elicit responses that provide a sample and will be corroborated by information from
of an examinee's behavior or performance in a other sources or if an erroneous initial decision
specified domain. Coupled with the test is a scoring can be easily corrected, scores with more modest
procedure that enables the scorer to evaluate the reliability/precision may suffice.
behavior or work samples and generate a score. In Interpretations of test scores generally depend
interpreting and using test scores, it is important on assumptions that individuals and groups exhibit
to have some indication of their reliability. some degree of consistency in their scores across
The term reliability has been used in two ways independent administrations of the testing pro-
in the measurement literature. First, the term has cedure. However, different samples of performance
been used to refer to the reliability coefficients of from the same person are rarely identical. An in-
classical test theory, defined as the correlation be- dividual's performances, products, and responses
tween scores on two equivalent forms of the test, to sets of tasks or test questions vary in quality or
presuming that taking one form has no effect on character from one sample of tasks to another
performance on the second form. Second, the and from one occasion to another, even under
term has been used in a more general sense, to strictly controlled conditions. Different raters may
refer to the consistency of scores across replications award different scores to a specific performance.
of a testing procedure, regardless of how this con- All of these sources of variation are reflected in
sistency is estimated or reported (e.g., in terms of the examinees' scores, which will vary across in-
standard errors, reliability coefficients per se, gen- stances of a measurement procedure.
eralizability coefficients, error/tolerance ratios, The reliability/precision of the scores depends
item response theory (IRT) information functions, on how much the scores vary across replications
or various indices of classification consistency). of the testing procedure, and analyses of
To maintain a link to the traditional notions of reliability/precision depend on the kinds of vari-
reliability while avoiding the ambiguity inherent ability allowed in the testing procedure (e.g., over
in using a single, familiar term to refer to a wide tasks, contexts, raters) and the proposed interpre-
range of concepts and indices, we use the term re- tation of the test scores. For example, if the inter-
liability/precision to denote the more general notion pretation of the scores assumes that the construct
of consistency of the scores across instances of the being assessed does not vary over occasions, the
testing procedure, and the term reliability coefficient variability over occasions is a potential source of
to refer to the reliability coefficients of classical measurement error. If the test tasks vary over al-
test theory. ternate forms of the test, and the observed per-
The reliability/precision of measurement is formances are treated as a sample from a domain
always important. However, the need for precision of similar tasks, the random variability in scores
increases as the consequences of decisions and in- from one form to another would be considered
terpretations grow in importance. If a test score error. If raters are used to assign scores to responses,
leads to a decision that is not easily reversed, such the variability in scores over qualified raters is a
l.1
'I as rejection or admission of a candidate to a pro- source of error. Variations in a test taker's scores
fessional school, or a score-based clinical judgment that are not consistent with the definition of the
I
(e.g., in a legal context) that a serious cognitive construct being assessed are attributed to errors
injury was sustained, a higher degree of of measurement.
33
CHAPTER 2
RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
A very basic way to evaluate the consistency variable that fluctuates around the true score for
they reflect random errors of measurement), their comparisons of scores across individuals. Conditions
of scores involves an analysis of the variation in the person.
potential for accurate prediction of criteria, for of observation that are fixed or standardized for
each test taker's scores across replications of the Generalizability theory provides a different beneficial examinee diagnosis, and for wise decision the testing procedure remain the same across
testing procedure. The test is administered and framework for estimating reliability/precision.
making is limited. replications. However, some aspects of any stan-
then, after a brief period during which the exam- While classical test theory assumes a single dis-
dardized testing procedure will be allowed to vary.
inee's standing on the variable being measured tribution for the errors in a test taker's scores,
would not be expected to change, the test (or a generalizability theory seeks to evaluate the con- Specifications for Replications The time and place of testing, as well as the
of the Testing Procedure persons administering the test, are generally allowed
distinct but equivalent form of the test) is admin- tributions of different sources of error (e.g., items,
to vary to some extent. The particular tasks
istered a second time; it is assumed that the first occasions, raters) to the overall error. The universe
administration has no influence on the second score for a person is defined as the expected value
As indicated earlier, the general notion of reliability/ included in the test may be allowed to vary (as
precision is defined in terms of consistency over samples from a common content domain), and
administration. Given that the attribute being over a universe of all possible replications of the
measured is assumed to remain the same for each testing procedure for the test taker. The universe replications of the testing procedure. Reliability/pre- the persons who score the results can vary over
test taker over the two administrations and that score of generalizability theory plays a role that is cision is high if the scores for each person are some set of qualified scorers.
the test administrations are independent of each similar to the role of true scores in classical test consistent over replications of the testing procedure Alternate forms (or parallel forms) of a stan-
other, more variation across the two administrations theory. and is low if the scores are not consistent over dardized test are designed to have the same general
indicates more error in the test scores and therefore replications. Therefore, in evaluating reliability/pre- distribution of content and item formats (as de-
Item response theory (IRT) addresses the basic cision, it is important to be clear about what scribed, for example, in detailed test specifications),
lower reliability/precision. issue of reliability/precision using information constitutes a replication of the testing procedure. the same administrative procedures, and at least
The impact of such measurement errors can functions, which indicate the precision with which
be summarized in a number of ways, but typically, observed task/item performances can be used to Replications involve independent administra- approximately the same score means and standard
in educational and psychological measurement, it estimate the value of a latent trait for each test tions of the testing procedure, such that the deviations in some specified population or popu-
attribute being measured would not be expected lations. Alternate forms of a test are considered
is conceptualized in terms of the standard deviation taker. Using IRT, indices analogous to traditional
to change. For example, in assessing an attribute interchangeable, in the sense that they are built to
in the scores for a person over replications of the reliability coefficients can be estimated from the
that is not expected to change over an extended the same specifications, and are interpreted as
testing procedure. In most testing contexts, it is item information functions and distributions of
not possible to replicate the testing procedure re- the latent trait in some population. period of time (e.g., in measuring a trait), scores measures of the same construct.
peatedly, and therefore it is not possible to estimate generated on two successive days (using different In classical test theory, strictly parallel tests are
In practice, the reliability/precision of the test forms if appropriate) would be considered assumed to measure the same construct and to
the standard error for each person's score via scores is typically evaluated in terms of various
repeated measurement. Instead, using model- coefficients, including reliability coefficients, gen- replications. For a state variable (e.g., mood or yield scores that have the same means and standard
hunger), where fairly rapid changes are common, deviations in the populations of interest and have
based assumptions, the average error of measure- eralizability coefficients, and IRT information
ment is estimated over some population, and this functions, depending on the focus of the analysis scores generated on two successive days would the same correlations with all other variables. A
average is referred to as the standard error ofmeas- and the measurement model being used. The co- not be considered replications; the scores obtained classical reliability coefficient is defined in terms
urement (SEM). The SEM is an indicator of a efficients tend to have high values when the vari- on each occasion would be interpreted in terms of the correlation between scores from strictly
lack of consistency in the scores generated by the ability associated with the error is small compared of the value of the state variable on that occasion. parallel forms of the test, but it is estimated in
testing procedure for some population. A relatively with the observed variation in the scores (or score For many tests of knowledge or skill, the admin- terms of the correlation between alternate forms
large SEM indicates relatively low reliability/pre- differences) to be estimated. istration of alternate forms of a test with different of the test that may not quite be strictly parallel.
cision. The conditional standard error ofmeasurement samples of items would be considered replications Different approaches to the estimation of reli-
for a score level is the standard error of measurement of the test; for survey instruments and some per- ability/precision can be implemented to fit different
at that score level. Implications for Validity sonality measures, it is expected that the same data-collection designs and different interpretations
To say that a score includes error implies that Although reliability/precision is discussed here as questions will be used every time the test is ad- and uses of scores. In some cases, it may be
there is a hypothetical error-free value that char- an independent characteristic of test scores, it ministered, and any substantial change in wording feasible to estimate the variability over replications
acterizes the variable being assessed. In classical should be recognized that the level of reliability/pre- would constitute a different test form. directly (e.g., by having a number of qualified
test theory this error-free value is referred to as cision of scores has implications for validity. Reli- Standardized tests present the same or very raters evaluate a sample of test performances for
the person's true score for the test procedure. It is ability/precision of data ultimately bears on the similar test materials to all test takers, maintain each test taker). In other cases, it may be necessary
conceptualized as the hypothetical average score generalizability or dependability of the scores close adherence to stipulated procedures for test to use less direct estimates of the reliability coeffi-
over an infinite set of replications of the testing and/or the consistency of classifications of indi- administration, and employ prescribed scoring cient. For example, internal-consistency estimates
procedure. In statistical terms, a person's true viduals derived from the scores. To the extent rules that can be applied with a high degree of of reliability (e.g., split halves coefficient, KR-20,
score is an unknown parameter, or constant, and that scores are not consistent across replications consistency. Administering the same questions or coefficient alpha) use the observed extent of agree-
the observed score for the person is a random of the testing procedure (i.e., to the extent that commonly scaled questions to all test takers under ment between different parts of one test to estimate
the same conditions promotes fairness and facilitates the reliability associated with form-to-form vari-
34
35
ability. For the split-halves method, scores on two differences in the difficulty of test forms that
The standard error of measurement, as such, Therefore, to the extent feasible (i.e., if sample
more-or-less parallel halves of the test (e.g., odd- have not been adequately equated or linked; ex-
provides an indication of the expected level of sizes are large enough), reliability/precision should
numbered items and even-numbered items) are aminees who take one form may receive higher
random error over score points and replications be estimated separately for all relevant subgroups
correlated, and the resulting half-test reliability scores on average than if they had taken the other
for a specific population. In many cases, it is (e.g., defined in terms of race/ethnicity, gender,
coefficient is statistically adjusted to estimate reli- form. Such systematic errors would not generally
useful to have estimates of the standard errors for language proficiency) in the population. (Also see
ability for the full-length test. However, when a be included in the standard error of measurement,
individual examinees (or for examinees with scores chap. 3, "Fairness in Testing.")
test is designed to reflect rate of work, internal- and they are not regarded as contributing to a
in certain score ranges). These conditional standard
consistency estimates of reliability (particularly lack of reliability/precision. Rather, systematic
errors are difficult to estimate directly, but can be Reliability/Generalizability Coefficients
by the odd-even method) are likely to yield inflated errors constitute construct-irrelevant factors that
estimated indirectly. For example, the test infor-
estimates of reliability for highly speeded tests. reduce validity but not reliability/precision. mation functions based on IRT models can be In classical test theory, the consistency of test scores
In some cases, it may be reasonable to assume Important sources of random error may be used to estimate standard errors for different is evaluated mainly in terms of reliability coefficients,
that a potential source of variability is likely to be grouped in two broad categories: those rooted
values of a latent ability parameter and/or for dif- defined in terms of the correlation between scores
negligible or that the user will be able to infer ad- within the test takers and those external to them.
ferent observed scores. In using any of these mod- derived from replications of the testing procedure
equate reliability from other types of evidence. Fluctuations in the level of an examinee's motivation,
el-based estimates of conditional standard errors, on a sample of test takers. Three broad categories
For example, if test scores are used mainly to interest, or attention and the inconsistent application
it is important that the model assumptions be of reliability coefficients are recognized: (a) coefficients
predict some criterion scores and the test does an of skills are clearly internal sources that may lead
consistent with the data. derived from the administration of alternate forms
acceptable job in predicting the criterion, it can to random error. Variations in testing conditions
in independent testing sessions (alternate-farm co-
be inferred that the test scores are reliable/precise (e.g., time of day, level of distractions) and
efficients); (b) coefficients obtained by administration
enough for their intended use. variations in scoring due to scorer subjectivity are
Evaluating Reliability/Precision
of the same form on separate occasions (test-retest
The definition of what constitutes a standardized examples of external sources that may lead to ran-
The ideal approach to the evaluation of reliability/pre- coefficients); and (c) coefficients based on the rela-
test or measurement procedure has broadened dom error. The importance of any particular
cision would require many independent replications tionships/interactions among scores derived from
significantly over the last few decades. Various source of variation depends on the specific condi-
of the testing procedure on a large sample of test individual items or subsets of the items within a
kinds of performance assessments, simulations, tions under which the measures are taken, how
takers. The range of differences allowed in replications test, all data accruing from a single administration
and portfolio-based assessments have been developed performances are scored, and the interpretations
of the testing procedure and the proposed inter- (internal-consistency coefficients). In addition, where
to provide measures of constructs that might oth- derived from the scores.
pretation of the scores provide a framework for in- test scoring involves a high level of judgment,
erwise be difficult to assess. Each step toward Some changes in scores from one occasion to vestigating reliability/precision. indices of scorer consistency are commonly obtained.
greater flexibility in the assessment procedures another are not regarded as error (random or sys-
For most testing programs, scores are expected In formal treatments of classical test theory, reliability
enlarges the scope of the variations allowed in tematic), because they result, in part, from changes
to generalize over alternate forms of the test, oc- can be defined as the ratio of true-score variance to
replications of the testing procedure, and therefore in the construct being measured (e.g., due to
casions (within some period), testing contexts, observed score variance, but it is estimated in terms
tends to increase the measurement error. However, learning or maturation that has occurred between
and raters (if judgment is required in scoring). To of reliability coefficients of the kinds mentioned
some of these sacrifices in reliability/precision the initial and final measures). In such cases, the
the extent that the impact of any of these sources above.
may reduce construct irrelevance or construct un- changes in performance would constitute the phe-
of variability is expected to be substantial, the In generalizability theory, these different reli-
derrepresentation and thereby improve the validity nomenon of interest and would not be considered
variability should be estimated in some way. It is ability analyses are treated as special cases of a
of the intended interpretations of the scores. For errors of measurement.
not necessary that the different sources of variance more general framework for estimating error vari-
example, performance assessments that depend Measurement error reduces the usefulness of be estimated separately. The overall reliability/pre- ance in terms of the variance components associated
on ratings of extended responses tend to have test scores. It limits the extent to which test results
cision, given error variance due to the sampling with different sources of error. A generalizability
lower reliability than more structured assessments can be generalized beyond the particulars of a
of forms, occasions, and raters, can be estimated coefficient is defined as the ratio of universe score
(e.g., multiple-choice or short-answer tests), but given replication of the testing procedure. It
through a test-retest study involving different variance to observed score variance. Unlike tradi-
they can sometimes provide more direct measures reduces the confidence that can be placed in the
forms administered on different occasions and tional approaches to the study of reliability, gen-
of the attribute of interest. results from any single measurement and therefore scored by different raters. eralizability theory encourages the researcher to
Random errors of measurement are viewed as the reliability/precision of the scores. Because ran-
The interpretation of reliability/precision analy- specify and estimate components of true score
unpredictable fluctuations in scores. They are dom measurement errors are unpredictable, they
ses depends on the population being tested. For variance, error score variance, and observed score
conceptually distinguished from systematic errors, cannot be removed from observed scores. However,
example, reliability or generalizability coefficients variance, and to calculate coefficients based on
which may also affect the performances of indi- their aggregate magnitude can be summarized in
derived from scores of a nationally representative these estimates. Estimation is typically accomplished
viduals or groups but in a consistent rather than a several ways, as discussed below, and they can be
sample may differ significantly from those obtained by the application of analysis-of-variance techniques.
random manner. For example, an incorrect answer controlled to some extent (e.g., by standardization
from a more homogeneous sample drawn from The separate numerical estimates of the components
key would contribute systematic error, as would or by averaging over multiple scores).
one gender, one ethnic group, or one community. of variance (e.g., variance components for items,
36
37
CHAPTER 2
occasions, and raters, and for the interactions

The information function may be viewed as a Second, if the variability associated with raters and the interpretation of scores has become the
among these potential sources of error) can be mathematical statement of the precision of meas-
is estimated for a select group of raters who have user's primary concern.
used to evaluate the contribution of each source urement at each level of the given trait. The IRT
been especially well trained (and were perhaps Estimates of the standard errors at different
of error to the overall measurement error; the information function is based on the results
involved in the development of the procedures), score levels (that is, conditional standard errors)
variance-component estimates can be helpful in obtained on a specific occasion or in a specific
but raters are not as well trained in some operational are usually a valuable supplement to the single sta-
identifying an effective strategy for controlling context, and therefore it does not provide an in-
contexts, the error associated with rater variability tistic for all score levels combined. Conditional
overall error variance.
dication of generalizability over occasions or con- in these operational settings may be much higher standard errors of measurement can be much more
Different reliability (and generalizability) co- texts.
, than is indicated by the reported interrater reliability informative than a single average standard error
efficients may appear to be interchangeable, but Coefficients (e.g., reliability, generalizability, . coefficients. Similarly, if raters are still refining their for a population. If decisions are based on test
the different coefficients convey different infor- and !RT-based coefficients) have two major ad-
performance in the early days of an extended scoring scores and these decisions are concentrated in one
mation. A coefficient may encompass one or more vantages over standard errors. First, as indicated
window, the error associated with rater variability area or a few areas of the score scale, then the con-
sources of error. For example, a coefficient may above, they can be used to estimate standard
may be greater for examinees testing early in the ditional errors in those areas are of special interest.
reflect error due to scorer inconsistencies but not errors (overall and/or conditional) in cases where
window than for examinees who test later. Like reliability and generalizability coefficients,
reflect the variation over an examinee's performances it would not be possible to do so directly. Second,
Reliability/precision can also depend on the standard errors may reflect variation from many
or products. A coefficient may reflect only the in- coefficients (e.g., reliability and generalizability
population for which the procedure is being used. sources of error or only a few. A more comprehensive
ternal consistency of item responses within an in- coefficients), which are defined in terms of ratios
In particular, if variability in the construct of standard error (i.e., one that includes the most
strument and fail to reflect measurement error as- of variances for scores on the same scale, are
interest in the population for which scores are relevant sources of error, given the definition of
sociated with day-to-day changes in examinee invariant over linear transformations of the score
being generated is substantially different from what the testing procedure and the proposed interpre-
performance.
scale and can be useful in comparing different it is in the population for which reliability/precision tation) tends to be more informative than a less
It should not be inferred, however, that alter- testing procedures based on different scales. How-
was evaluated, the reliability/precision can be quite comprehensive standard error. However, practical
nate-form or test-retest coefficients based on test ever, such comparisons are rarely straightforward,
different in the two populations. When the variability constraints often preclude the kinds of studies
administrations several days or weeks apart are al-
because they can depend on the variability of the in the construct being measured is low, reliability that would yield information on all potential
ways preferable to internal-consistency coefficients. groups on which the coefficients are based, the
and generalizability coefficients tend to be small, sources of error, and in such cases, it is most in-
In cases where we can assume that scores are not techniques used to obtain the coefficients, the
and when the variability in the construct being formative to evaluate the sources of error that are
likely to change, based on past experience and/or sources of error reflected in the coefficients, and
measured is higher, the coefficients tend to be likely to have the greatest impact.
theoretical considerations, it may be reasonable the lengths and contents of the instruments being
larger. Standard errors of measurement are less de- Interpretations of test scores may be broadly
to assume invariance over occasions (without con- compared.
pendent than reliability and generalizability coeffi- categorized as relative or absolute. Relative inter-
ducting a test-retest study). Another limitation of
cients on the variability in the sample of test takers. pretations convey the standing of an individual or
test-retest coefficients is that, when the same form
of the test is used, the correlation between the Factors Affecting Reliability/Precision
In addition, reliability/precision can vary from group within a reference population. Absolute in-
one population to another, even if the variability terpretations relate the status of an individual or
first and second scores could be inflated by the A number of factors can have significant effects
in the construct of interest in the two populations group to defined performance standards. The stan-
test taker's recall of initial responses.
on reliability/precision, and in some cases, these is the same. The reliability can vary from one pop- dard error is not the same for the two types of in-
The test information function, an important factors can lead to misinterpretations of the results,
ulation to another because particular sources of terpretations. Any source of error that is the same
result of IRT, summarizes how well the test dis- if not taken into account.
error (rater effects, familiarity with formats and for all individuals does not contribute to the relative
criminates among individuals at various levels of
First, any evaluation of reliability/precision instructions, etc.) have more impact in one popu- error but may contribute to the absolute error.
ability on the trait being assessed. Under the IRT applies to a particular assessment procedure and
lation than they do in the other. In general, if any Traditional norm-referenced reliability coeffi-
conceptualization for dichotomously scored items, is likely to change if the procedure is changed in
aspects of the assessment procedures or the popu- cients were developed to evaluate the precision
the item characteristic curve or item response fanction any substantial way. In general, if the assessment
lation being assessed are changed in an operational with which test scores estimate the relative standing
is used as a model to represent the increasing pro- is shortened (e.g., by decreasing the number of
setting, the reliability/precision may change. of examinees on some scale, and they evaluate re-
portion of correct responses to an item at increasing items or tasks), the reliability is likely to decrease;
liability/precision in terms of the ratio of true-
levels of the ability or trait being measured. Given and if the assessment is lengthened with comparable
Standard Errors of Measurement score variance to observed-score variance. As the
appropriate data, the parameters of the characteristic tasks or items, the reliability is likely to increase.
range of uses of test scores has expanded and the
curve for each item in a test can be estimated. The In fact, lengthening the assessment, and thereby
The standard error of measurement can be used contexts of use have been extended (e.g., diagnostic
test information function can then be calculated increasing the size of the sample of tasks/items
to generate confidence intervals around reported categorization, the evaluation of educational pro-
from the parameter estimates for the set of items in (or raters or occasions) being employed, is an ef-
scores. It is therefore generally more informative grams), the range of indices that are used to
the test and can be used to derive coefficients with fective and commonly used method for improving
than a reliability or generalizability coefficient, evaluate reliability/precision has also grown to in-
interpretations similar to reliability coefficients. reliability/precision. once a measurement procedure has been adopted clude indices for various kinds of change scores
38
39
CHAPTER 2
and difference scores, indices of decision consistency,

per se. Note that the degree of consistency or evidence for reliability/precision (e.g., appropriate The reporting of indices of reliability/ precision
and indices appropriate for evaluating the precision
agreement in examinee classification is specific to standard errors, reliability or generalizability co- alone-with little detail regarding the methods
of group means.
the cut score employed and its location within efficients, or test information functions). The test used to estimate the indices reported, the nature
Some indices of precision, especially standard the score distribution.
errors and conditional standard errors, also depend user must have such data to make an informed of the group from which the data were derived,
on the scale in which they are reported. An index choice among alternative measurement approaches and the conditions under which the data were
stated in terms of raw scores or the trait-level esti- Reliability/Precision of Group Means and will generally be unable to conduct adequate obtained-constitutes inadequate documentation.
mates ofIRT may convey a very different perception reliability/precision studies prior to operational General statements to the effect that a test is
Estimates of mean (or average) scores of groups use of an instrument. "reliable" or that it is "sufficiently reliable to permit
of the error if restated in terms of scale scores. For
(or proportions in certain categories) involve In some instances, however, local users of a interpretations of individual scores" are rarely, if
example, for the raw-score scale, the conditional
sources of error that are different from those that test or assessment procedure must accept at least ever, acceptable. It is the user who must take re-
standard error may appear to be high at one score
operate at the individua,l level. Such estimates are partial responsibility for documenting the precision sponsibility for determining whether scores are
level and low at another, but when the conditional
often used as measures of program effectiveness of measurement. This obligation holds when one sufficiently trustworthy to justify anticipated uses
standard errors are restated in units of scale scores,
(and, under some educational accountability sys- of the primary purposes of measurement is to and interpretations for particular uses. Nevertheless,
quite different trends in comparative precision
tems, may be used to evaluate the effectiveness of classify students using locally developed performance test constructors and publishers are obligated to
may emerge.
schools and teachers). provide sufficient data to make informed judgments
standards, or to rank examinees within the local
In evaluating group performance by estimating population. It also holds when users must rely on possible.
Decision Consistency the mean performance or mean improvement in local scorers who are trained to use the scoring If scores are to be used for classification, indices
performance for samples from the group, the vari- rubrics provided by the test developer. In such of decision consistency are useful in addition to
Where the purpose of measurement is classification, ation due to the sampling of persons can be a
some measurement errors are more serious than settings, local factors may materially affect the estimates of the reliability/precision of the scores.
major source of error, especially if the sample magnitude of error variance and observed score If group means are likely to play a substantial role
others. Test takers who are far above or far below
sizes are small. To the extent that different samples variance. Therefore, the reliability/precision of in the use of the scores, the reliability/precision of
the cut score established for pass/fail or for
from the group of interest (e.g., all students who scores may differ appreciably from that reported these mean scores should be reported.
eligibility for a special program can have considerable
use certain educational materials) yield different by the developer. As the foregoing comments emphasize, there
error in their observed scores without any effect
results, conclusions about the expected outcome Reported evaluations of reliability/precision is no single, preferred approach to quantification
on their classification decisions. Errors of meas-
over all students in the group (including those should identify the potential sources of error for of reliability/precision. No single index adequately
urement for examinees whose true scores are dose
who might join the group in the future) are un- the testing program, given the proposed uses of conveys all of the relevant information. No one
to the cut score are more likely to lead to classifi-
certain. For large samples, the variability due to the scores. These potential sources of error can method of investigation is optimal in all situations,
cation errors. The choice of techniques used to
the sampling of persons in the estimates of the then be evaluated in terms of previously reported nor is the test developer limited to a single
quantify reliability/precision should take these
group means may be quite small. However, in research, new empirical studies, or analyses of the approach for any instrument. The choice of esti-
circumstances into account. This can be done by
cases where the samples of persons are not very reasons for assuming that a potential source of mation techniques and the minimum acceptable
reporting the conditional standard error in the
large (e.g., in evaluating the mean achievement of error is likely to be negligible and therefore can level for any index remain a matter of professional
vicinity of the cut score or the decision-
students in a single classroom or the average ex- be ignored. judgment.
consistency/accuracy indices (e.g., percentage of
pressed satisfaction of samples of clients in a
correct decisions, Cohen's kappa), which vary as
clinical program), the error associated with the
functions of both score reliability/precision and
sampling of persons may be a major component
the location of the cut score.
of overall error. It can be a significant source of
Decision consistency refers to the extent to
error in inferences about programs even if there is
which the observed classifications of examinees
a high degree of precision in individual test scores.
would be the same across replications of the
Standard errors for individual scores are not
testing procedure. Decision accuracy refers to the
appropriate measures of the precision of group av-
extent to which observed classifications of examinees
erages. A more appropriate statistic is the standard
based on the results of a single replication would
error for the estimates of the group means.
agree with their true classification status. Statistical
methods are available to calculate indices for both
decision consistency and decision accuracy. These Documenting Reliability/Precision
methods evaluate the consistency or accuracy of
Typically, developers and distributors of tests have
classifications rather than the consistency in scores
primary responsibility for obtaining and reporting
40
41
CHAPTER 2 RELIABILITY/PRECISION AND ERRORS OF MEASUREMENT
STANDARDS FOR RELIABILITY/PRECISION stimuli are allowed to vary over alternate forms individual or two averages of a group, reliability/
of the test, and the observed performances are precision data, including standard errors, should
treated as a sample from a domain of similar be provided for such differences.
The standards in this chapter begin with an over- Cluster 1. Specifications for tasks, the variability in scores from one form to
Comment: Observed score differences are used
arching standard (numbered 2.0), which is designed another would be considered error. If raters are
Replications of the Testing Procedure for a variety of purposes. Achievement gains are
to convey the central intent or primary focus of used to assign scores to responses, the variability
frequently of interest for groups as well as indi-
the chapter. The overarching standard may also in scores over qualified raters is a source of error.
Standard 2.1 viduals. In some cases, the reliability/precision of
be viewed as the guiding principle of the chapter, Different sources of error can be evaluated in a
change scores can be much lower than the relia-
and is applicable to all tests and test users. All single coefficient or standard error, or they can
The range of replications over which reliability/pre- bilities of the separate scores involved. Differences
subsequent standards have been separated into be evaluated separately, but they should all be
cision is being evaluated should be clearly stated, between verbal and performance scores on tests
eight thematic clusters labeled as follows: . addressed in some way. Reports of reliability/pre-
along with a rationale for the choice of this def- of intelligence and scholastic ability are often em-
inition, given the testing situation. cision should specify the potential sources of
ployed in the diagnosis of cognitive impairment
1. Specifications for Replications of the Testing error included in the analyses.
and learning problems. Psychodiagnostic inferences
Procedure Comment: For any te$ting program, some aspects
are frequently drawn from the differences between
2. Evaluating Reliability/Precision of the testing procedure (e.g., time limits and
Cluster 2. Evaluating subtest scores. Aptitude and achievement batteries,
3. Reliability/Generalizability Coefficients availability of resources such as books, calculators,
Reliability/Precision interest inventories, and personality assessments
4. Factors Affecting Reliability/Precision and computers) are likely to be fixed, and some
are commonly used to identify and quantify the
5. Standard Errors of Measurement aspects will be allowed to vary from one adminis-
relative strengths and weaknesses, or the pattern
6. Decision Consistency tration to another (e.g., specific tasks or stimuli, Standard 2.3 of trait levels, of a test taker. When the interpretation
7. Reliability/Precision of Group Means testing contexts, raters, and, possibly, occasions).
of test scores centers on the peaks and valleys in
8. Documenting Reliability/Precision Any test administration that maintains fixed con- For each total score, subscore, or combination
the examinee's test score profile, the reliability of
ditions and involves acceptable samples of the of scores that is to be interpreted, estimates of
score differences is critical.
Standard 2.0 conditions that are allowed to vary would be con- relevant indices of reliability/precision should
sidered a legitimate replication of the testing pro- be reported.
Appropriate evidence of reliability/precision cedure. As a first step in evaluating the reliability/pre- Standard 2.5
Comment: It is not sufficient to report estimates
should be provided for the interpretation for cision of the scores obtained with a testing proce-
of reliabilities and standard errors of measurement Reliability estimation procedures should be con-
each intended score use. dure, it is important to identify the range of con-
only for total scores when subscores are also in- sistent with the structure of the test.
ditions of various kinds that are allowed to vary,
Comment: The form of the evidence (reliability terpreted. The form-to-form and day-to-day con-
and over which scores are to be generalized. Comment: A single total score can be computed
or generalizability coefficient, information function, sistency of total scores on a test may be acceptably
conditional standard error, index of decision con- high, yet subscores may have unacceptably low on tests that are multidimensional. The total
sistency) for reliability/precision should be ap- Standard 2.2 reliability, depending on how they are defined score on a test that is substantially multidimensional
propriate for the intended uses of the scores, the and used. Users should be supplied with reliability should be treated as a composite score. If an in-
The evidence provided for the reliability/precision ternal-consistency estimate of total score reliability
population involved, and the psychometric models data for all scores to be interpreted, and these
of the scores should be consistent with the
used to derive the scores. A higher degree of relia- data should be detailed enough to enable the is obtained by the split-halves procedure, the
domain of replications associated with the testing
bility/precision is required for score uses that have users to judge whether the scores are precise halves should be comparable in content and sta-
procedures, and with the intended interpretations
more significant consequences for test takers. enough for the intended interpretations for use. tistical characteristics.
for use of the test scores.
Conversely, a lower degree may be acceptable Composites formed from selected subtests within In adaptive testing procedures, the set of tasks
where a decision based on the test score is reversible Comment: The evidence for reliability/precision a test battery are frequently proposed for predictive included in the test and the sequencing of tasks
or dependent on corroboration from other sources should be consistent with the design of the and diagnostic purposes. Users need information are tailored to the test taker, using model-based
of information. testing procedures and with the proposed inter- about the reliability of such composites. algorithms. In this context, reliability/precision
pretations for use of the test scores. For example, can be estimated using simulations based on the
if the test can be taken on any of a range of oc- model. For adaptive testing, model-based condi-
Standard 2.4 tional standard errors may be particularly useful
casions, and the interpretation presumes that
the .scores are invariant over these occasions, When a test score interpretation emphasizes and appropriate in evaluating the technical adequacy
then any variability in scores over these occasions differences between two observed scores of an of the procedure.
is a potential source of error. If the tasks or
42 43
Cluster 3. Reliability/Generalizability time period of trait stability. Information should longer) version of an existing test, based on data pretation of scores involves within-group inferences
Coefficients be provided on the qualifications and training of from an administration of the existing test. (e.g., in terms of subgroup norms). For example,
the judges used in reliability studies. Interrater or However, these models generally make assumptions test users who work with a specific linguistic and
interobserver agreement may be particularly im- that may not be met (e.g., that the items in the cultural subgroup or with individuals who have a
Standard 2.6 portant for ratings and observational data that in- existing test and the items to be added or dropped particular disability would benefit from an estimate
volve subtle discriminations. It should be noted, are all randomly sampled from a single domain). of the standard error for the subgroup. Likewise,
A reliability or generalizability coefficient (or evidence that preschool children tend to respond
however, that when raters evaluate positively cor- Context effects are commonplace in tests of max-
standard error) that addresses one kind of vari- imum performance, and the short version of a to test stimuli in a less consistent fashion than do
related characteristics, a favorable or unfavorable
ability should not be interpreted as interchangeable standardized test often comprises a nonrandom older children would be helpful to test users inter-
assessment of one trait may color their opinions
with indices that address other kinds of variability, preting scores across age groups.
of other traits. Moreover, high interrater consistency sample of items from the full-length version. As a
unless their definitions of measurement error When considering the reliability/precision of
does not imply high examinee consistency from . result, the predicted value of the reliability/precision
can be considered equivalent. test scores for relevant subgroups, it is useful to
task to task. Therefore, interrater agreement does may not provide a very good estimate of the
Comment: Internal-consistency, alternate-form, not guarantee high reliability of examinee scores. actual value, and therefore, where feasible, the re- evaluate and report the standard error of measure-
and test-retest coefficients should not be considered liability/precision of both forms should be evaluated ment as well as any coefficients that are estimated.
equivalent, as each incorporates a unique definition directly and independently. Reliability and generalizability coefficients can
Cluster 4. Factors Affecting differ substantially when subgroups have different
of measurement error. Error variances derived via
item response theory are generally not equivalent
Reliability/Precision variances on the construct being assessed. Differences
Standard 2.10
to error variances estimated via other approaches. in within-group variability tend to have less impact
Test developers should state the sources of error Standard 2.8 When significant variations are permitted in on the standard error of measurement.
that are reflected in, and those that are ignored tests or test administration procedures, separate
by, the reported reliability or generalizability co- When constructed-response tests are scored locally, reliability/ precision analyses should be provided Standard 2.12
efficients. reliability/precision data should be gathered and for scores produced under each major variation
reported for the local scoring when adequate- if adequate sample sizes are available. If a test is proposed for use in several grades or
size samples are available. over a range of ages, and if separate norms are
Standard 2.7 Comment: To make a test accessible to all exam- provided for each grade or each age range, relia-
Comment: For example, many statewide testing inees, test publishers or users might authorize, or bility/precision data should be provided for each
When subjective judgment enters into test scoring, programs depend on local scoring of essays, con- might be legally required to authorize, accommo- age or grade-level subgroup, not just for all
evidence should be provided on both interrater structed-response exercises, and performance tasks. dations or modifications in the procedures that grades or ages combined.
consistency in scoring and within-examinee con- Reliability/precision analyses can indicate that ad- are specified for the administration of a test. For
sistency over repeated measurements. A clear dis- ditional training of scorers is needed and, hence, example, audio or large print versions may be Comment: A reliability or generalizability coefficient
tinction should be made among reliability data should be an integral part of program monitoring. used for test takers who are visually impaired. based on a sample of examinees spanning several
based on (a) independent panels of raters scoring Reliability/precision data should be released only Any alteration in standard testing materials or grades or a broad range of ages in which average
the same performances or products, (b) a single when sufficient to yield statistically sound results procedures may have an impact on the scores are steadily increasing will generally give a
panel scoring successive performances or new and consistent with applicable privacy obligations. reliability/precision of the resulting scores, and spuriously inflated impression of reliability/precision.
products, and (c) independent panels scoring therefore, to the extent feasible, the reliability/pre- When a test is intended to discriminate within
successive performances or new products. age or grade populations, reliability or generaliz-
Standard 2.9 cision should be examined for all versions of the
test and testing procedures. ability coefficients and standard errors should be
Comment: Task-to-task variations in the quality
When a test is available in both long and short reported separately for each subgroup.
of an examinee's performance and rater-to-rater
versions, evidence for reliability/precision should
inconsistencies in scoring represent independent Standard 2.11
be reported for scores on each version, preferably
sources of measurement error. Reports of
based on independent administration(s) of each
Cluster 5. Standard Errors of
reliability/precision studies should make clear Test publishers should provide estimates of reli-
version with independent samples of test takers. Measurement
which of these sources are reflected in the data. ability/precision as soon as feasible for each
Generalizability studies and variance component Comment: The reliability/precision of scores on relevant subgroup for which the test is recom-
analyses can be helpful in estimating the error each version is best evaluated through an inde- mended. Standard 2.13
variances arising from each source of error. These pendent administration of each, using the designated Comment: Reporting estimates of reliability/pre- The standard error of measurement, both overall
analyses can provide separate error variance estimates time limits. Psychometric models can be used to cision for relevant subgroups is useful in many and conditional (if reported), should be provided
for tasks, for judges, and for occasions within the estimate the reliability/precision of a shorter (or contexts, but it is especially important if the inter- in units of each reported score.
44 45
Comment: The standard error of measurement Cluster 6. Decision Consistency period. Therefore, the students in a particular class Comment: Information on the method of data
(overall or conditional) that is reported should be or school at the current time, the current clients of collection, sample sizes, means, standard deviations,
consistent with the scales that are used in reporting a social service agency, and analogous groups and demographic characteristics of the groups
scores. Standard errors in scale-score units for the
Standard 2.16
exposed to a program of interest typically constitute tested helps users judge the extent to which
scales used to report scores and/or to make When a test or combination of measures is used a sample in a longitudinal sense. Presumably, com- reported data apply to their own examinee popu-
decisions are particularly helpful to the typical to make classification decisions, estimates should parable groups from the same population will recur lations. If the test-retest or alternate-form approach
test user. The data on examinee performance be provided of the percentage of test takers who in future years, given static conditions. The factors is used, the interval between administrations
should be consistent with the assumptions built would be classified in the same way on two leading to uncertainty in conclusions about program should be indicated.
into any statistical models used to generate scale replications of the procedure. effectiveness arise from the sampling of persons as Because there are many ways of estimating re-
scores and to estimate the standard errors for well as from individual measurement error. liability/ precision, and each is influenced by
these scores. Comment: When a test score or composite score
different sources of measurement error, it is unac-
is used to make classification decisions (e.g.,
pass/fail, achievement levels), the standard error
Standard 2.18 ceptable to say simply, "The reliability/precision
of scores on test X is .90." A better statement
Standard 2.14 of measurement at or near the cut scores has im- When the purpose of testing is to measure the would be, "The reliability coefficient of .90
portant implications for the trustworthiness of performance of groups rather than individuals,
When possible and appropriate, conditional stan- reported for scores on test X was obtained by cor-
these decisions. However, the standard error cannot subsets of items can be assigned randomly to dif-
dard errors of measurement should be reported relating scores from forms A and B, administered
be translated into the expected percentage of con- ferent subsamples of examinees. Data are aggregated
at several score levels unless there is evidence on successive days. The data were based on a
sistent or accurate decisions without strong as- across subsamples and item subsets to obtain a
that the standard error is constant across score sample of 400 10th-grade students from five mid-
sumptions about the distributions of measurement measure of group performance. When such pro-
levels. Where cut scores are specified for selection dle-class suburban schools in New York State.
errors and true scores. Although decision consistency cedures are used for program evaluation or pop-
or classification, the standard errors of measure- The demographic breakdown of this group was
is typically estimated from the administration of ulation descriptions, reliability/precision analyses
ment should be reported in the vicinity of each as follows: ... "In some cases, for example, when
a single form, it can and should be estimated must take the sampling scheme into account.
cut score. small sample sizes or particularly sensitive data
directly through the use of a test-retest approach,
Comment: This type of measurement program is are involved, applicable legal restrictions governing
Comment: Estimation of conditional standard if consistent with the requirements of test security,
termed matrix sampling. It is designed to reduce privacy may limit the level of information that
errors is usually feasible with the sample sizes that and if the assumption of no change in the construct
the time demanded of individual examinees and should be disclosed.
are used for analyses of reliability/precision. If it is met and adequate samples are available.
is assumed that the standard error is constant over yet to increase the total number of items on
a broad range of score levels, the rationale for this which data can be obtained. This testing approach Standard 2.20
Cluster 7. Reliability/Precision provides the same type of information about
assumption should be presented. The model on
which the computation of the conditional standard
of Group Means group performances that would be obtained if all If reliability coefficients are adjusted for restriction
errors is based should be specified. examinees had taken all of the items. Reliability/pre- of range or variability, the adjustment procedure
Standard 2.17 cision statistics should reflect the sampling plan and both the adjusted and unadjusted coefficients
used with respect to examinees and items. should be reported. The standard deviations of
Standard 2.15 When average test scores for groups are the focus
the group actually tested and of the target popu-
of the proposed interpretation of the test results, Cluster 8. Documenting lation, as well as the rationale for the adjustment,
When there is credible evidence for expecting the groups tested should generally be regarded as Reliability/Precision should be presented.
that conditional standard errors of measurement a sample from a larger population, even if all ex-
or test information functions will differ sub- aminees available at the time of measurement are
Standard 2.19 Comment: Application of a correction for restriction
stantially for various subgroups, investigation of tested. In such cases the standard error of the in variability presumes that the available sample
the extent and impact of such differences should group mean should be reported, because it reflects Each method of quantifying the reliability/pre- is not representative (in terms of variability) of
be undertaken and reported as soon as is feasible. variability due to sampling of examinees as well as cision of scores should be described clearly and the test-taker population to which users might be
Comment: If differences are found, they should variability due to individual measurement error. expressed in terms of statistics appropriate to expected to generalize. The rationale for the cor-
be clearly indicated in the appropriate documen- Comment: The overall levels of performance in the method. The sampling procedures used to rection should consider the appropriateness of
tation. In addition, if substantial differences do various groups tend to be the focus in program select test takers for reliability/precision analyses such a generalization. Adjustment formulas that
exist, the test content and scoring models should evaluation and in accountability systems, and the and the descriptive statistics on these samples, presume constancy in the standard error a~ross
be examined to see if there are legally acceptable groups that are of interest include all students/clients subject to privacy obligations where applicable, score levels should not be used unless constancy
alternatives that do not result in such differences. who could participate in the program over some should be reported. can be defended.
46 47

AERA, APA, NCME Standards - Reliability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AERA, APA, NCME Standards - Reliability

Uploaded by

Copyright:

Available Formats

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION

AMERICAN PSYCHOLOGICAL ASSOCIATION

occasions, and raters, and for the interactions

and difference scores, indices of decision consistency,

You might also like