Professional Documents
Culture Documents
2000 - Brezinski
2000 - Brezinski
HOGAN ET AL.
This study examined the frequency of use of various types of reliability coefficients for a
systematically drawn sample of 696 tests appearing in the APA-published Directory of
Unpublished Experimental Mental Measures. Almost all articles included some type of
reliability report for at least one test administration. Coefficient alpha was the over-
whelming favorite among types of coefficients. Several measures treated almost univer-
sally in psychological-testing textbooks were rarely or never used. Problems
encountered in the study included ambiguous designations of types of coefficients,
reporting reliability based on a study other than the one cited, inadequate information
about subscales, and simply incorrect recording of the information given in an original
source.
The APA Task Force on Statistical Inference was appointed by the APA
Board of Scientific Affairs in 1996. The recommendations of the Task Force
were published in a recent issue of the American Psychologist (Wilkinson &
The APA Task Force on Statistical Inference, 1999). Among its various rec-
ommendations, the task force emphasized, “Always provide some effect-size
estimate when reporting a p value” (p. 599, italics added). Later, the task
force also wrote,
Always present effect sizes for primary outcomes. . . . It helps to add brief com-
ments that place these effect sizes in a practical and theoretical context. . . . We
must stress again that reporting and interpreting effect sizes in the context of
previously reported effects is essential to good research. (p. 599, italics added)
Correspondence regarding this article should be directed to the first author, Department of
Psychology, University of Scranton, Scranton, PA 18510-4596; e-mail: hogant1@uofs.edu. An
earlier version of this article was presented as a poster at the annual meeting of the Eastern Psy-
chological Association, Providence, RI, April 1999.
Educational and Psychological Measurement, Vol. 60 No. 4, August 2000 523-531
© 2000 Sage Publications, Inc.
523
524 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
However, the task force also emphasized that “Interpreting the size of ob-
served effects requires an assessment of the reliability of the scores” (p. 596)
being analyzed because score unreliability attenuates detected study effects.
And the task force explained,
Method
The Directory of Unpublished Experimental Mental Measures, Volume 7
(Goldman, Mitchel, & Egelson, 1997), a publication of the American Psy-
chological Association, was used to define the population of tests examined
in the present study. The most recent directory provided information for
2,078 tests appearing in 37 professional journals from 1991 to 1995. The
journals are primarily from the fields of education, psychology, and sociol-
ogy. The directory provides summary information for the following three
items: test name, purpose, and source; and for at least four of the following
HOGAN ET AL. 525
Table 1
Number of Types of Reliability Coefficients Reported
Number Reported n %
None 43 6
One 523 75
Two 117 17
Three 12 2
Four 1 0
Results
Table 1 summarizes the number of types of reliability reported for each
test entry. Reliability information was not reported for 6% of the entries. For
the majority of entries (75%), only one type of reliability was reported,
whereas 17% reported two types of reliability, and a few cases reported three
or four types of reliability. Table 2 summarizes the types of reliability coeffi-
cients reported; entries are arranged in order of frequency of occurrence. There
are a total of 801 entries for the 553 tests for which some type of reliability
was reported, because 19% of the tests had more than one type of reliability
reported. The “unspecified” category includes entries that reported a reliabil-
526 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
Table 2
Types of Coefficients
Type n %
ity coefficient with no indication of its type. Tables 1 and 2 provide the basic
information for answering our question about the frequency of use for vari-
ous types of reliability methods.
The number of items per test varied considerably, ranging from 2 to 400,
but the great majority of tests were in a fairly narrow range centered around
the median value of 20. There were a few cases in which we could not deter-
mine how many items the test contained. It was also often difficult to deter-
mine the number of subscales on a test. If only one reliability coefficient was
given, which was the modal case, we assumed there was only one scale. Mul-
tiple scales were recorded when there was specific reference to separate
scales (e.g., by reference to subscale names and/or number of items) or by
provision of separate reliability data by subscale. An “indeterminate” cate-
gory included cases in which there was explicit reference to subscales, but we
could not determine from the directory how many such scales there were. We
found that 533 (77%) of the measures consisted of a single scale.
Table 3 summarizes the actual reliability coefficients recorded. Most coef-
ficients were in the range of .75 to .95. It is difficult to generalize from the
data in Table 3 because, as noted below, some of the reliability coefficients
reported in the directory were not based on the articles covered in the direc-
tory but on other sources cited in those articles. The total number of coeffi-
cients reported in Table 3 is 799, whereas 801 cases are categorized in Table 2.
We identified 2 cases in which a type of reliability was reported but no reli-
ability coefficient was presented. In one of these cases, the entry stated that
“Test-retest was over 80%” and in the other case that “Interrater agreement
was 88.7%.”
HOGAN ET AL. 527
Table 3
Distribution of Coefficient Values
Coefficient n
.95 to .99 32
.90 to .94 110
.85 to .89 153
.80 to .84 160
.75 to .79 134
.70 to .74 88
.65 to .69 54
.60 to .64 29
.55 to .59 22
.50 to .54 4
.45 to .49 2
.40 to .44 4
.35 to .39 1
.30 to .34 3
.25 to .29 1
.20 to .24 2
Discussion
Coefficient Selection
The most evident outcome from the study is the overwhelming reliance on
coefficient alpha for reporting reliability. It was used for at least two thirds of
the tests recorded in the directory. Although we could not feasibly verify this
for all of the articles represented in this study, we did retrace some ambiguous
cases back to the original articles and determined that coefficient alpha was
used in other cases in which it was not exactly clear what reliability coeffi-
cient was reported in the directory. For example, some coefficients reported
in the directory as “internal consistency” were alpha coefficients. It is likely
that some of the cases reported as “interitem” or “unspecified” were also
alpha coefficients. Thus, if all were known, coefficient alpha was probably
used in nearly three quarters of the cases. In his original presentation of alpha,
Cronbach (1951) referred to alpha as “ a tool that we expect to become
increasingly prominent in the test literature” (p. 299). His view seems to have
been correct.
In the typical textbook on psychological testing, coefficient alpha gener-
ally receives no more treatment and often less treatment than reliability meth-
ods such as split-half or KR-20. Of course, alpha is a more general case of the
KR-20 formula first presented by the founding editor of EPM. Based on our
analysis, we recommend that coverage of coefficient alpha be increased sub-
528 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
It was curious to find some duplicate entries in the directory. That is, the
same test and same article were recorded in different places, sometimes
within the same chapter. There was no evident rationale for such duplication.
Finally, we found some instances in which the reliability information
recorded in the directory was simply incorrect. We did not systematically
check all entries but discovered these errors when retracing other matters,
such as whether a report of internal consistency could be more precisely iden-
530 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
tified. This point, of course, reemphasizes the need for the user to consult
original sources before making final judgments about an instrument identi-
fied in the directory. The directory is an outstanding source of information for
individuals needing to find an instrument for a particular purpose. However,
due diligence must be exercised in its use.
Of course, the present study had two evident limitations. First, the gener-
alizations suggested above are confined to the articles covered in the direc-
tory and perhaps somewhat more widely to “unpublished” tests appearing in
the broad range of journals in the social and behavioral sciences. If one sur-
veyed “published” tests, a different set of generalizations might emerge. Sec-
ond, we did not attempt a complete verification of the directory; that is, we
did not retrace all entries. Thus, we cannot offer definitive statements about
the relative frequency with which certain problems occur in the directory, for
example, incorrect information being recorded or specific reliability meth-
ods being recorded with a more generic label. We determined that such prob-
lems do exist, but we did not conduct retracing in such a systematic manner as
to be able to say anything about how common they are.
References
Aiken, L. R. (1997). Psychological testing and assessment (9th ed.). Boston: Allyn & Bacon.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ:
Prentice Hall.
Cohen, R. J., & Swerdlik, M. E. (1999). Psychological testing and assessment: An introduction
to tests and measurement (4th ed.). Mountain View, CA: Mayfield.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York:
Holt, Rinehart & Winston.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,
297-334.
Eason, S. (1991). Why generalizability theory yields better results than classical test theory: A
primer with concrete examples. In B. Thompson (Ed.), Advances in educational research:
Substantive findings, methodological developments (Vol. 1, pp. 83-98). Greenwich, CT: JAI.
Educational Testing Service. (1995). The ETS test collection catalog: Vol. 2. Vocational tests and
measurement devices (2nd ed.). Phoenix, AZ: Oryx.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their
item/person statistics. Educational and Psychological Measurement, 58, 357-381.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement
(3rd ed., pp. 105-146). Washington, DC: American Council on Education/Oryx.
Goldman, B. A., Mitchel, D. F., & Egelson, P. E. (1997). Directory of unpublished experimental
mental measures (Vol. 7). Washington, DC: American Psychological Association.
Gregory, R. J. (1996). Psychological testing: History, principles, and applications (2nd ed.).
Boston: Allyn & Bacon.
Jaeger, R. (1991). Forward. In R. J. Shavelson & N. M. Webb (Eds.), Generalizability theory: A
primer (pp. ix-xi). Newbury Park, CA: Sage.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation
standards: How to assess evaluations of educational programs (2nd ed.). Thousand Oaks,
CA: Sage.
HOGAN ET AL. 531
Kaplan, R. M., & Saccuzzo, D. P. (1997). Psychological testing: Principles, applications, and is-
sues (4th ed.). Pacific Grove, CA: Brooks/Cole.
Maddox, T. (1997). Tests: A comprehensive reference for assessments in psychology, education,
and business (4th ed.). Austin, TX: Pro-Ed.
Murphy, K. R., & Davidshofer, C. O. (1998). Psychological testing: Principles and applications
(4th ed.). Upper Saddle River, NJ: Prentice Hall.
Murphy, L. L., Impara, J. C., & Plake, B. S. (1999). Tests in print V. Lincoln: University of Ne-
braska Press.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York:
McGraw-Hill.
Reinhardt, B. (1996). Factors affecting coefficient alpha: A mini Monte Carlo study. In B. Thompson
(Ed.), Advances in social science methodology (Vol. 4, pp. 3-20). Greenwich, CT: JAI.
Thompson, B. (1991). Review of generalizability theory: A primer by R. J. Shavelson & N. W. Webb.
Educational and Psychological Measurement, 51, 1069-1075.
Thompson, B., & Snyder, P. A. (1998). Statistical significance and reliability analyses in recent
JCD research articles. Journal of Counseling and Development, 76, 436-441.
Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable.
Educational and Psychological Measurement, 60, 174-195.
Vacha-Haase, T., Ness, C., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reli-
ability coefficients: A review of three journals. Journal of Experimental Education, 67,
335-341.
Wilkinson, L., & The APA Task Force on Statistical Inference. (1999). Statistical methods in
psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
(Reprint available through the APA home page: http://www.apa.org/journals/amp/
amp548594.html)
Willson, V. L. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Re-
searcher, 9(6), 5-10.