You are on page 1of 22

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

VOL. XXV, No. 2, 1965

ALPHA COEFFICIENTS FOR


STRATIFIED-PARALLEL TESTS

LEE J. CRONBACH, PETER SCHÖNEMANN, AND DOUGLAS McKIE


1
University of Illinois

IN an internal-consistency study investigators almost always rely


on the KR20 formula or its generalized version «, a coefficient
most appropriate for tests formed by random sampling of items
(Cronbach, Rajaratnam, and Gleser, 1963). Several sources (Cron-
bach, 1951; Technical Recommendations, 1954; Lord, 1956; Tryon,
1957) have questioned the suitability of internal-consistency anal-
ysis that does not take stratification into account and have rec-
ommended instead a modification of the Jackson-Ferguson (1941)
&dquo;battery reliability&dquo; coefficient. But the usefulness of this coeffi-
cient, which we shall call a8, has not been widely appreciated. Raj a-
ratnam, Cronbach, and Gleser (1964) have recently placed the
theory of stratified-parallel tests on a more substantial basis, discus-
sing them in the context of a &dquo;theory of generalizability.&dquo; These
authors rederive the formula for as, present a formula ys analogous
to KR21, and advocate the application of these formulas to any test
constructed by a stratified plan. It is uncertain, however, how much

1 A study conducted under grant M-1839 from the National Institute of


Mental Health. The assistance of Hiroshi Azuma and Kern Dickman is grate-
fully acknowledged. The current address of Cronbach is School of Education,
Stanford University, and that of McKie is School of Education, University
of British Columbia.
This appears to be an appropriate place to voice the gratitude of the entire
profession for the ILLIAC computing facility during the past ten years. The
hospitality of the Digital Computer Laboratory of the University of Illinois
to behavioral scientists has had far-reaching benefits. The rush of technological
improvement having left her behind, ILLIAC I was retired from service on
January 1, 1963; this study is one of the very last of her contributions to
psychology.
291

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


292 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

difference there is between the random-model coefficient and that


which takes stratification into account. Nearly all previous empiri-
cal studies of internal-consistency formulas have been restricted to
randomly parallel tests with uniform content. One limited study
(Cronbach, 1951) compared stratified split-half coefficients with
coefficients from random splits and found little difference between
the two sets of results.
The present study follows in the main the method of the Cron-
bach-Azuma (1962) study of random-parallel, single-factor tests,
and in part replicates it. We specify the statistical properties of a
universe of items divisible into subsets representing content cate-
gories, direct a computer to construct a series of tests by applying
a certain sampling plan, and compute the test reliability by several
formulas. We are chiefly concerned with the question: how well
does each formula estimate the average squared correlation of the
observed test score with the average score on the infinite family of
stratified-parallel tests? Our computations range over universes
with various statistical structures and over various sampling plans.
Though our report deals with only a limited number of rather
special cases, our general theory gives us a basis for confidence in
generalizing to cases not studied. Our comparisons therefore give
a reasonably good answer to all of the questions of practical im-

portance in test analysis.


Some Theory Regarding Stratified-Parallel Tests
We extract here the theory developed by Rajaratnam, Cronbach,
and Gleser (1964). They assume that there is a universe of items
classified into strata. A sampling plan dictates that a test is to be
formed by drawing, at random, kh items from each stratum h; kh
may in principle change from stratum to stratum. The sampling
forms a testt of k items (k = ~kh) on which p has a total score Xt
whose variance in the population of persons is Vt. All tests that
could be formed by applying a particular plan to a particular uni-
verse constitute a family. Each test t is in effect a random sample

drawn from the family. The expectation of the Xpt for person p
over all tests in the family is denoted by Mp; this we shall call p’s

&dquo;family score.&dquo; Ordinarily there is a different family score for


every combination of item universe and sampling plan.
For any specific test t, there is a coefficient of generalizability

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 293

PMt2, the squared product-moment correlation between test scores


Xpt and family scores Mp over the population of persons. It reports
what proportion of the variance in family (&dquo;true&dquo;) scores is linearly
predictable from observed scores. Since tests do not necessarily have
equal variances and intercorrelations, this coefficient varies some-
what from test to test; we therefore distinguish between a specific
PMt2 and the expectation E (pMt2) over all tests in the family, again
for the population of persons. As an index of generalizability or
reliability for the family we adopt E (PMt2), which tells how ac-
curately we can expect to generalize from one test to the universe
of similar tests. This reflects the degree of equivalence among tests;
it of course does not give information about generalization over
other facets such as occasions. (Some investigators might choose as
an index the correlation between two of the tests, or the expected

value over all pairs of tests; for the tests we deal with, E (ptt.) is
certainly very close to E (PMt2), because the variances of stratified
tests of even modest length are close to uniform.) For actual tests
and persons, family scores cannot be observed and PMt2 remains
unknown. With hypothetical data, however, we can determine pMt2
for any specific test, and by averaging coefficients for many tests can
determine E (PMt2) with any desired precision.
The intraclass correlation as between tests in the family is an
approximation to E (PMt2) ; and the estimate of as from a single test
is close to an unbiased estimate of as for the family.
We shall write formulas here in a form corresponding to our com-
puter operations; computational formulas suitable in treating actual
data are given by Rajaratnam et al. Our formulas and the usual
formulas give identical results, save that we assume at all points
that we have data from the entire population of persons. We restrict
ourselves to sampling plans where kh is uniform for all m strata,
which allows us to simplify certain formulas. Let Wh be a covari-
ance between two particular items in a given stratum, let ~Wh be
the sum of the kh (kh 1) covariances for pairs of items within a
-

subtest drawn from stratum h, and let 2E Wn be the sum of Wh over


all strata. Let U be a covariance between two (unlike) items in
different strata, and 22 U the sum over all such pairs of items in
the test.
Various internal-consistency analyses are possible. Consider a
universe with two types of content, a and b, and items at two levels

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


294 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

Figure 1. Schema representing covariances within an 8-item test. a and b are


content categories; A and B are difficulty levels. Numerals represent covari-
ances of different types; type 3 covariances, for example, involve items of
unlike content but similar difficulty.

of difficulty A and B. Crossing these categories produces four strata,


as indicated in Figure 1. Here, an illustrative test has been formed
by selecting 2 items at random within each stratum. Thus items
1 and 2 have the same general content a and the same difficulty A.
While this test was (hypothetically) constructed by drawing from
four strata, it might have arisen from any one of four different
sampling plans.
Random sampling. Equivalent to considering all categories to-
gether as a single stratum, kh,= 8.
Stratification on content alone. Two strata, a and b, 4 items each.
Stratification on difficulty alone. Two strata, A and B, 4 items
each.
Stratification on content and difficulty. Four strata, 2 items each.
Each of these modes of test construction would generate a different
family, any of which the present test might belong to. For each
family there is a different E (PMt2), with the finer stratifications
producing the higher values. If the given test is analyzed by formula
(1), different values of as will be obtained when different stratifica-
tions are made. Analysis may be based on any of the plans stated

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 295

above; each leads toa coefficient which the theory says is an


estimate of E (PMt2) for the family defined by that plan. We
distinguish these four values by designating them «, ac, aD, and
aCD, respectively.
If the analysis treats all items as belonging to a single stratum,
all covariances represented in Figure 1 are &dquo;within-stratum,&dquo; and
(1) gives a result identical to that from the (KR20) formula. c,

This analysis has the effect of averaging all covariances of types


1, 2, 3, and 4, entering this average in the 8 diagonal cells, and
dividing the total of the 64 entries by the test variance. According
to theory, a approximates E (pMt2) for the family of randomly
parallel tests. Three empirical questions arise. The first-how good
is the approximation for randomly parallel tests formed from items
having a single common factor?-was answered in the Cronbach-
Azuma study. Only in certain unlikely situations is the correspond-
ence unsatisfactory. The second-how good is the approximation
for randomly sampled tests where items cluster around various
common factors?-has not been studied. The third-how good is
the approximation when the family score is that for a family of
stratified-parallel tests?-is examined in this paper. This amounts
to asking how much is lost when one obtains instead of the as that
a

is theoretically appropriate for the stratified family.


If the analysis takes both content and difficulty into account,
covariances of type 1 are W’s and all others are U’s. Formula (1) in
effect averages the type-1 covariances within each stratum and
enters that average in the diagonal. Summing the matrix and divid-
ing by Vt gives act. If content strata only are to be taken into
account, covariances of types 1 and 2 are W’s; the resulting coef-
ficient is «c. In aD the diagonal entries are determined by averaging
the type-1 and type-3 covariances. This analysis recognizes difficulty
strata but not content strata.
There is a series of &dquo;y&dquo; formulas that provide lower bounds to the
several a’s,just as KR21 provides a lower bound to the KR20
coefficient (Rajaratnam et al., 1964). We have calculated y values
for selected tests treated below, but we find that no simple gener-
alization can be made. To discuss the y values adequately would
overburden the present paper. The stratified y underestimates the
corresponding stratified a to such an extent that it cannot be used
under many circumstances. But as kh increases or as the within-

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


296 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

stratum range of Pi decreases, y becomes a good estimator of the

corresponding a. We hope to present more definite results on y in a


subsequent paper.
The dichotomous items (scored 1 or 0) that many tests employ
have properties that have caused some concern to test theorists. In
the Kuder-Richardson development of the a formula it was assumed
that the item intercorrelation matrix had unit rank. Since a matrix
of phi coefficients for items differing in difficulty cannot have unit
rank, this implied that all items were of uniform difficulty. There
have been various ways of circumventing this problem. The deriva-
tion by Rajaratnam et al. simply argues that the expected as is a
lower bound, and an approximation, to E (PMt2). If items can be
classified into strictly homogeneous content strata, then as stratifi-
cation on difficulty becomes increasingly fine the value of as ap-
proaches the value of E (PMt2) for the family so defined. But fine
stratification is rare in practice. A significant question, then, is how
adequately a coarse stratification on difficulty reduces the dis-
crepancy between as and E (,~~t2) .
One important theoretical point remains to be made. In a certain

sense, as obeys the


Spearman-Brown rule. To be precise, consider
two families of tests, of different lengths but based on proportional
sampling plans (plans such that the kh of one test are proportional
to the respective kh of the other). Then the expected value of as over
the family of longer tests is related to the expected value for the
family of shorter tests according to the Spearman-Brown formula,
provided that the stratification in analysis coincides with that used
in selecting items. As a consequence, we can restrict our empirical
study to tests of a single length, and yet draw conclusions regarding
the behavior of these coefficients for longer and shorter tests.
We would like to draw conclusions about the discrepancy be-
tween as and E (PMt2) for any length of test. Though we have no
theoretical basis for extrapolating E (PMt2) to other test lengths, there
is empirical evidence that E (pMt2) for randomly parallel tests comes
very close to following the Spearman-Brown formula. Table 1 shows
values of E (PMt2) for single-factor randomly parallel tests with 9, 18,
and 54 items having various levels of interitem tetrachoric correla-
tions rw. (The data for the 18-item tests were collected during the
present study; the remainder are taken from Cronbach and Azuma.)
Here and elsewhere, we shall find it illuminating to transform correla-

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


297

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


298 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
tional indices such as ct into &dquo;signal-noise ratios&dquo; of the form
«J ( 1 a) ; for a discussion of this index, see Cronbach and Gleser
-

(1964). Under the hypothesis that E (PMt2) obeys the Spearman-


Brown formula, the signal-noise equivalents for the three lengths
of test should exhibit the ratio 1:2:6; for each r.~, the observed
ratio is close to the expectation. Because the variances of stratified
tests are more nearly uniform than those of randomly parallel tests,
we expect E (p2) for stratified tests to conform even more closely
to the Spearman-Brown prediction. No data for heterogeneous tests
are available, however.

Although, for a test stratified in a given manner, the average as


obeys the Spearman-Brown rule if the analysis uses the same strati-
fication as did the test construction, this is not the case if some
other stratification is used in analysis. Particularly, if tests are
constructed on the basis of content-and-difficulty stratification,
the Spearman-Brown rule does not hold for a,ac, or aD; only
aCD for those strata obeys the rule. A formula describing the change
in a with change in length of a stratified test is given by Raj aratnam
et al. The striking consequence of this formula is that where two
coefficients are computed, one for the sampling plan used in test
construction and one for some coarser stratification, the ratio of
the corresponding Six values approaches a limit as k increases.
The limit is expressible in terms of two average covariances-W,
the average covariance for items within the same stratum of both
plans, and W’, the average for items within the same stratum of
the coarser plan only-and the item variances. We shall give the
formula only for the case where one plan uses three strata (kh uni-
form) that are further divided into thirds in the finer plan. Write
v for the average item variance. Then

Procedures
A hypothetical universe of items is specified; this is divided into
three content categories (except in one substudy where six are used).
Items within a category have a uniform tetrachoric correlation rw,
and a specified correlation rb with each item in another category.
While it is not necessary to make rw the same for all categories or

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 299

rb the same for all pairs of categories, in our computations these are
kept uniform. While there are many ways of specifying universe
characteristics other than those we employ (for example, uniform
phi correlations or uniform covariances might easily have been
introduced), our specifications appear to cover much of the range
of test types.

Figure 2. Specimen tetrachoric correlations among items representing three


content categories.

The representative values of rb and rw in Figure 2 indicate the


flexibility of our model. In each matrix, the diagonal entries are rw
and the off-diagonal entries are rb. The first set of correlations
implies that there is just one common content factor among the
test items. In matrix (2) a fairly strong general factor links items
of the three types, and slightly less influential factors link items
within categories. Matrix (3) exhibits three orthogonal content
factors.
All tests under study consist of dichotomously scored items. Item
difficulties in the universe are specified by defining a range of Pi
and a rectangular distribution within that range. Two ranges are
used: the &dquo;limited range&dquo; .60 Pi ~ .99 (comparable to many
ability tests), and the &dquo;wide range&dquo; .01 ~ P~ G .99. The latter is
unrealistically wide, but it thereby provides a severe test of the
susceptibility of our formulas to differences in item difficulty. For
stratification on difficulty, the range is divided into segments: .60-
.79 and .80-.99 ; or .01-.30, .31-.69, and .70-.99.
The sampling plans are arranged so that both content and diffi-
culty or either one alone can be taken into account. The computer
program allows the use of from one to nine strata. To each stratum
is assigned one type of content (i.e., the stratum is identified with
one column of an rb-rw matrix), a value of kh, and a &dquo;difficulty seg-

ment.&dquo; We restrict our studies to sampling plans where difficulty


segments and content categories are completely crossed; that is, we
do not draw difficult items from one content category and easier
items from another.

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


300 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT
The universe specification and the sampling plan together define
a family of tests. Drawing a particular set of Pi from a stratum by
means of random numbers to conform to the sampling plan defines

a subtest. The several subtests form a test. Once the Pi for the test

are selected, the computer considers in turn each pair of items, enters
their parameters (Pi, P;, and rw or rb) in the usual series approxi-
mation to calculate their product-moment covariance Cih and stores
it according to its type, as defined in Figure 1. The Vi are also cal-
culated. Then, by appropriately cumulating covariances and vari-
ances, the computer determines VT and a, ac, aD, and act. This cal-
culation is repeated with the same Pi and a new intercorrelation
matrix, to generate a test from another family. After this step has
been repeated for each correlation matrix under consideration, a
new set of Pi is drawn according to the same sampling plan and

coefficients for the new test are calculated, using each correlation
matrix in turn. To conserve computer time, only a few tests of each
family were constructed and analyzed; the number of tests sampled
for any family was selected so as to reduce the standard error of the
mean coefficient to an acceptable level.

It remains to describe the determination of pM‘2, whose expected


value serves as the &dquo;ideal&dquo; coefficient in our theory. It can be shown
that when all k,, are equal and rectangular distributions of Pi are
assumed,

where M~ (or more properly Mph) is the expected value of a person’s


scores on subtests drawn from stratum H. Here, h may be the same as
h’. To evaluate this, it is necessary to compute* the Ci/¡M&dquo;, and C M/¡M&dquo;..
The computer generates a complete matrix of covariances for each
tetrachoric correlation under study and all pairs Pi, Pi in the range
.01 - .99. For any item ih, Ci/¡M&dquo;, is determined by identifying the
tetrachoric correlation between items in stratum h’ and the stratum
to which i,, belongs (which may or may not be h’), identifying the
difficulty segment assigned to stratum h’ and averaging the Cifor
the specified r and Pi over all Pi within that segment. To determine
* Actually, the procedure described here gives /k
ihMh’ and /k
C
h MhMh’
C
;
2
h
h cancel out and
but the k can be ignored.

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 301

CMAMA’ it is necessary to average Cii over the segment of Pi assigned


to h and the segment of Pi assigned to h’. There is a different PM,2
for each sampling plan and set of rb and r w, but the amount of calcula-
tion is greatly reduced by recognizing that many steps of the com-
putation are the same for different sampling plans. For any one
sampling plan, rb, and r~, several tests were formed and pM‘2 for these
tests were averaged to obtain an estimate of E(p~‘2) for the family.
A by-product of our work was a set of tables reporting ëi/¡i/¡’ for
any two strata. There is one table for each r,~, giving a mean co-
variance for any pair of difficulty segments of width .10. From these
tables one can calculate -E’(C«’) and E(V,) for tests drawn according
to virtually any sampling plan. The ratio of these two is very close to
the expected value of unstratified a, which is jE’(C«./~<). By appro-
priate separation of within- and between-stratum covariances one
can also calculate as for any stratification. We have used this method
in certain subordinate analyses and checks. We believe that questions
will arise in the future where our tables will serve other investigators.
Moreover, they may be helpful in instruction, since students can
design tests with different hypothetical specifications and compute
various internal-consistency coefficients and so learn more about the
properties of dichotomous items. Copies of the tables together with
directions for computing the several a and y coefficients will be sup-
plied upon request to the Bureau of Educational Research, University
of Illinois, Urbana, Illinois.

Results for Single-Factor Tests


Unstratified Tests. We consider first a set of 18-item single-factor
tests like the 9- and 54-item tests examined by Cronbach and
Azuma. Though three content-strata are formally represented,
setting rb equal to rw produces items of uniform content so that the
stratification &dquo;on content&dquo; has no meaning. For each of the pseudo-
strata, six items were drawn from the difficulty range .01-.99.
Table 2 presents the average values of pMt2 and a for several tests
of each family. The mean pMt2 is our best estimate of E (p2) ; for
convenience this mean is referred to throughout as E (p2). aD, being
identical to a for these tests, is not reported. ac also is omitted since,
for a single-factor test, ac and a differ only by chance; of the two,
a is the more dependable because the mean covariance &dquo;entered in

the diagonal&dquo; is based on a greater number of sampled covariances.

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


302 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

TABLE 2
Data for Single-Factor Randomly Parallel Tests with Wide P Range

Each entry for E(&rho;


) and &alpha; gives the mean,
2 , and range for ten tests. Italicized
M
s.e. figures are
S/N equivalents of the corresponding means.

Although it is customary to compare correlational indices directly,


we have found this somewhat misleading because numerically small
differences between very high coefficients may have much practical
significance. Transforming coefficients into &dquo;signal-noise ratios&dquo;
often gives a better basis for interpretation (Cronbach and Gleser,
1964). For example, the averages for p2 and where rw 1.00, .961
a =

and .948, seem close together-but the corresponding Six ratios are
25 and 18. A test with SIN 18 must be lengthened about 40 percent
to attain an Six of 25. A coefficient of .961 therefore implies that
the test is 140 percent as efficient as the coefficient of .948 implies.
Only where the Six equivalents are close together are the impli-
cations of the two coefficients the same. To assist in interpreting
results, all average coefficients have been converted to Six; these
conversions appear in italics in our tables. As a rule of thumb, we
accept one coefficient as an adequate estimator of another if their
Six equivalents have the ratio .83:1.00 or 1.00:1.20.
For homogeneous tests, E fa) is close to E (p2 ) even when the
tetrachoric correlation is as high as .70. The reader is reminded that
interitem tetrachoric correlations for actual tests are in the neigh-
borhood of .05-.40, rarely higher. This confirms the Cronbach-Azuma
conclusion that for tests constructed by random sampling from
homogeneous items, provides, on the average, a very good estimate
a

of E (p2) . But the range of individual a’S is noteworthy; for r’/l) =

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 303

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


304 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

.30, the Six range is 2.7 to 3.5 (corresponding to a’s of .73 and .78).
In longer tests, the range of a would be less.
Cureton (1958) criticized derivations of internal-consistency
formulas. These formulas invariably involve the ratio of estimates
of E (Ctt’) and E (Vt), even though the theory calls for obtaining the
expected value of E(Ctt’IVt). Working from our tables we calculated
E(Ctt.)IE(Vt); this value is entered in Table 2 under the heading
E(C)IE(V). Comparing this to our sample mean of a, which is an
unbiased estimate of E (CIV), we see that for each rw the two values
agree within .002. Discrepancies between E(C)IE(V) and as were
no larger than .01 for the stratified-parallel families studied. Our

results therefore seem to dispose of Cureton’s concern. At least for


tests of 18 or more dichotomous items, the ratio of expectations of
Ctt, and Vt agrees excellently with the expectation of the ratio and
the usual substitution of the former for the latter is acceptable.
Homogeneous Tests Stratified on Difficulty. Specifying a certain
number of items per difficulty stratum constrains the distributions of
P-values in the test. We expected E(p2) for families of tests stratified
on difficulty to be somewhat higher than for random-parallel tests

made up from the same universe of items, because the former tests
are more perfectly equivalent.
We constructed two series of tests. In the wide-range series, six
items were selected from each of the three difficulty segments:
.01-.30, .31-.69, and .70-.99; in the &dquo;limited range&dquo; series, nine items
were drawn from the segments .60-.79 and .80-.99. Again, three

content strata were formally represented, but rw =rb so that content


is homogeneous. Table 3 shows the average coefficients for both series
together with their ranges, standard errors, and Six equivalents.
Again, ap differs from a only by chance, and act from <x.D.
The tests of the wide-range series are just like those of Table 2
save for stratification on difficulty. E (p2) shows some increase from

difficulty stratification, not very important unless rt > .70 (cf. SIN
values). E(a) has changed negligibly, except for a decrease at rw
= 1.00. The
ranges of all coefficients are reduced by difficulty strati-
fication.
When we compare each estimator to E (p2) for either wide- or
narrow-range tests, we find that aD is, as expected, a good estimator
except for the wide-range test with rw = 1.00. In general, a is only
slightly less satisfactory as an estimator, and since it is easier to

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 305

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


306 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

compute it may be advantageous unless a test has remarkably high


item intercorrelations.

Results for Heterogeneous Tests


Tests Stratified on Content and Difficulty. The next set of tests
to be considered is formed from a universe with three orthogonal
content categories (rb = 0). With a wide range of difficulty there
are three content categories crossed with three difficulty segments;

in the limited-range universe we cross the content categories with


two difficulty segments. These tests have much lower coefficients than
those of Tables 2 and 3; using three orthogonal content categories
we get coefficients similar to those for a six-item single-factor test

having the same rw.


While it is possible to analyze these tests by formulas that ignore
content stratification, this is distinctly inadvisable. This is shown by
results for limited-range tests where rw = .50; the Six equivalents
are 2.1 for E (p2)1.4 for CID, and 1.4 for a. The results are only a

trifle better when rw == .30, and much worse for higher r~. Analyzing
a short heterogeneous test by the simple a formula gives misleading

results.
The formulas that take content stratification into account give
the results in Table 4. acD is an excellent estimator of E (p2). While
ac gives somewhat lower estimates, it is close enough to E (p2) to be

practically useful when 7Bc is in the normal range.


Tests Stratified on Content Only. The final major series of tests

TABLE 5
Three-Factor Wide-Range Tests Stratified on Content Only:
Mean, S.E’MI and Range for Each Coefficient

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 307
was constructed by stratifying on content only, using a wide range
of item difficulties. Table 5 shows that ao is, on the average, ex-
tremely close to pMt2, performing considerably better than a. As in
the single-factor case of Table 2, coefficients from individual tests
within a family vary greatly. Variation among coefficients will de-
crease rapidly as the number of items increases or the range of P,
is reduced.
Trend with Change in rb. We have so far compared the limiting
cases where rb = 0 and rb = r~. In actual tests, rb is likely to fall
between these extremes. We have made only a limited study of

Figure 3. Adequacy of a and 0:0 as a function of between-stratum item


correlations (r.~ = .50, limited range).

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


308 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

intermediate cases, but these are in sufficiently close agreement that


only one set of illustrative results need be reported. For limited-
range tests with rw - .50 and rb - .00, .30, or .50, the tables of
covariances were used to estimate E (aCD) , E Cac), and E(a). Since
we have previously established that aCD is a fairly good approxima-
tion to E (p2), we use it as the standard of comparison to avoid
computational labor. Figure 3 shows that the relation of ac to aCD
is very much the same for all r~, but that a becomes a considerably
poorer estimator as rb decreases. Similar results are found with wide-
range tests and with rw > .50. The goodness of any estimator is to
be judged from our earlier tables when rb = .00 or rb = r~. For
intermediate rb, linear interpolation between these values will indi-
cate the approximate size of the estimator.
Effect of Increasing kh. We expect all coefficients to approach 1.00
as tests are lengthened by increasing k1l, proportionately for all

strata. There are grounds for expecting the Six values for E (p2)
and as to maintain the same ratio as kh changes, provided that as
is based on the plan used in test construction. We wish to know
also how the relations among a, ac, and acD change as a test is
lengthened.
From our tables of covariances and formula (2) we calculate the
limit of the ratio of Six values for wide-range tests, presented in
Table 6. An additional ratio is available for rw == .50, rb = .30-
.869 :.941 :1.00. For long tests, it appears that ac is a satisfactory

TABLE 6
Limiting Values of Ratios SN(a) : SN(ac) : ’SUV(o’cD) Corresponding to
Stratified Alpha with Coarse and Fine Stratification (Wide-Range Tests)

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 309

When a is compared with ac or act for very long tests where rb


= 0, we find a acceptable so long as rw is less than about .40. But
« is quite unsatisfactory for an 18-item test. The minimum length
where a may be used in place of aCD (by the .83 criterion) is 54 items
when rw == .30 and rb = 0. The discrepancy between ,a and as de-
creases as rb increases, the trend in the Six ratio being nearly linear.
The discrepancy is slightly lessened when the difficulty range is
narrowed.

Summary and Recommendations


To explore the properties of various internal-consistency formulas,
we have analyzed hypothetical stratified-parallel tests constructed

by sampling items from universes with specified characteristics.


Formulas generally thought of as approximations or lower bounds
to the squared correlation of test score with true test score
(E (PMt2»were compared to a direct numerical estimate of the latter
value, so that the goodness of approximation can be judged. We
compare several &dquo;a&dquo; coefficients, versions of the stratified intraclass
correlation whose degenerate case with a single stratum is the or a

KR20 formula.
While we have dealt with only a few of the possible universe
specifications and sampling plans, the reader can extend the findings
by interpolation and extrapolation. All tests in this study are 18
items long. It is known that as test length changes the stratified ce
coefficients calculated on the basis of the stratification used in test
construction obey the Spearman-Brown rule. There is some evidence
that E (p2) follows that rule fairly closely. Homogeneous tests draw
items from a single content stratum whose degree of homogeneity is
defined by the tetrachoric interitem correlation r,~. Our tests draw
items from three content strata. The interitem tetrachoric correla-
tions are labelled r~ for items within a stratum and rb for items in
different strata. In most of our analyses, rb = rw (single-factor test)
orrb=0.
The findings are too complex to be compactly summarized, but a
single table of particularly representative values will present the
heart of the results. Computations were made assuming rw of .30,
.50, .70, and 1.00. While the anomalous behavior of product-moment
correlations at high tetrachoric levels makes the higher r~ interesting,
it is the lower levels that are encountered in actual tests. We re-

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


310 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

produce in Table 7 results only for within-stratum correlations of


.50 (adding, for completeness, a few values not presented earlier).

TABLE 7
Summary of Results for 18-item Tests with r,~ = .50

*Acceptable as estimate of E(p2).


Values in boldface are results for the coefficient theoretically appropriate to tests stratified as
indicated.

All coefficients have been converted to equivalent signal-noise


ratios, which are directly related to an investigator’s decisions about
whether a certain test length is appropriate for the degree of
generalizability he requires. As a rule of thumb, we suggest that an
estimate is seriously misleading if Six for the estimate is less than
.83 times the Six corresponding to the parameter value. Any two
Six values will retain roughly the same ratio to each other as tests
are lengthened, other things being equal. For realistic tests-where

r~ is likely to fall below .50-the discrepancies among E (p2 ) and


the coefficients will ordinarily be less than those in this table, The
reader should bear in mind that we have investigated extremely
heterogeneous tests; the typical test has correlated content strata,
whose behavior will fall between that of the one-factor and three-
factor tests of this table.
The results are clear. When a test is constructed by stratifying on
content and difficulty, one may properly estimate its coefficient of
generalizability by aCD or ao. The latter, though less precise, is
generally easier to compute. For a test stratified on content, ao
should be used. With rw below .20, one may use the simpler a formula.
For a single-factor test stratified on difficulty only, the best esti-
mator is <xD, but the more easily computed a gives acceptable results

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


LEE J. CRONBACH, ET AL. 311

unless strata are extremely narrow and rw is high. For unstratified


(random-parallel tests), gives acceptable results.
a

Stratifying on content is clearly more important than stratification


on difficulty, both in test construction and test analysis. The so-called

difficulty factors that have received so much attention from some


test theorists prove to have very little influence on a coefficients
unless 7’to is unrealistically high.
This paper has examined tests for which an explicit sampling plan
or table of specifications was laid down during test construction.

Nothing in our procedure, however, makes the results inapplicable


to a posteriori stratification, where the investigator sorts items from
an existing test into strata of his own defining, and estimates the

coefficient of generalizability over the family defined by this


postulated sampling plan.
In sum, our results strongly support the Technical Recommenda-
tions : &dquo;If a test can be divided into sets of items of different content,
internal consistency should be determined by procedures designed
for such tests&dquo;-always assuming that the intention is to generalize
over other tests covering these same content categories.

REFERENCES

Cronbach, L. J. "Coefficient Alpha and the Internal Structure of


Tests."Psychometrika XVI (1951), 297-334.
,
Cronbach, L. J. and Azuma, Hiroshi. "Internal-consistency Reli-
ability Formulas Applied to Randomly-sampled Single-factor
Tests: An Empirical Comparison." EDUCATIONAL AND PSYCHO-
LOGICAL MEASUREMENT, XXII (1962), 645-665.
Cronbach, L. J. and Gleser, Goldine C. "The Signal-Noise Ratio in
the Comparison of Reliability Coefficients." EDUCATIONAL AND
PSYCHOLOGICAL MEASUREMENT, XXIII (1964), 467-480.
Cronbach, L. J., Rajaratnam, Nageswari, and Gleser, Goldine C.
"Theory of Generalizability: A Liberalization of Reliability
Theory." British Journal of Statistical Psychology, XVI (1963),
137-163.
Cureton, E.E. "The Definition and Estimation of Test Reliability."
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, XVIII (1958),
715-738.
Jackson, R. W. B. and Ferguson, G. A. Studies on the Reliability of
Tests. Bulletin No. 12. Department of Educational Research,
University of Toronto, 1941.
Lord, F. M. "Sampling Error Due to Choice of Split in Split-Half
Reliability Coefficients." Journal of Experimental Education,
XXIV (1956), 245-249.
Rajaratnam, Nageswari, Cronbach, L. J., and Gleser, Goldine C.

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015


312 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

"Generalizability of Stratified-Parallel Tests." Psychometrika,


in press.
1965,
"Technical Recommendations for Psychological Tests and Diag-
nostic Techniques." Washington, D. C.: American Psychological
Association, 1954. (Psychological Bulletin, L (1954), Supp.)
Tryon, R. C. "Reliability and Behavior Domain Validity: Reform-
ulation and Historical Critique." Psychological Bulletin, LIV
(1957), 229-249.

Downloaded from epm.sagepub.com at WESTERN OREGON UNIVERSITY on May 31, 2015

You might also like