Professional Documents
Culture Documents
drawn from the family. The expectation of the Xpt for person p
over all tests in the family is denoted by Mp; this we shall call p’s
value over all pairs of tests; for the tests we deal with, E (ptt.) is
certainly very close to E (PMt2), because the variances of stratified
tests of even modest length are close to uniform.) For actual tests
and persons, family scores cannot be observed and PMt2 remains
unknown. With hypothetical data, however, we can determine pMt2
for any specific test, and by averaging coefficients for many tests can
determine E (PMt2) with any desired precision.
The intraclass correlation as between tests in the family is an
approximation to E (PMt2) ; and the estimate of as from a single test
is close to an unbiased estimate of as for the family.
We shall write formulas here in a form corresponding to our com-
puter operations; computational formulas suitable in treating actual
data are given by Rajaratnam et al. Our formulas and the usual
formulas give identical results, save that we assume at all points
that we have data from the entire population of persons. We restrict
ourselves to sampling plans where kh is uniform for all m strata,
which allows us to simplify certain formulas. Let Wh be a covari-
ance between two particular items in a given stratum, let ~Wh be
the sum of the kh (kh 1) covariances for pairs of items within a
-
Procedures
A hypothetical universe of items is specified; this is divided into
three content categories (except in one substudy where six are used).
Items within a category have a uniform tetrachoric correlation rw,
and a specified correlation rb with each item in another category.
While it is not necessary to make rw the same for all categories or
rb the same for all pairs of categories, in our computations these are
kept uniform. While there are many ways of specifying universe
characteristics other than those we employ (for example, uniform
phi correlations or uniform covariances might easily have been
introduced), our specifications appear to cover much of the range
of test types.
a subtest. The several subtests form a test. Once the Pi for the test
are selected, the computer considers in turn each pair of items, enters
their parameters (Pi, P;, and rw or rb) in the usual series approxi-
mation to calculate their product-moment covariance Cih and stores
it according to its type, as defined in Figure 1. The Vi are also cal-
culated. Then, by appropriately cumulating covariances and vari-
ances, the computer determines VT and a, ac, aD, and act. This cal-
culation is repeated with the same Pi and a new intercorrelation
matrix, to generate a test from another family. After this step has
been repeated for each correlation matrix under consideration, a
new set of Pi is drawn according to the same sampling plan and
coefficients for the new test are calculated, using each correlation
matrix in turn. To conserve computer time, only a few tests of each
family were constructed and analyzed; the number of tests sampled
for any family was selected so as to reduce the standard error of the
mean coefficient to an acceptable level.
TABLE 2
Data for Single-Factor Randomly Parallel Tests with Wide P Range
and .948, seem close together-but the corresponding Six ratios are
25 and 18. A test with SIN 18 must be lengthened about 40 percent
to attain an Six of 25. A coefficient of .961 therefore implies that
the test is 140 percent as efficient as the coefficient of .948 implies.
Only where the Six equivalents are close together are the impli-
cations of the two coefficients the same. To assist in interpreting
results, all average coefficients have been converted to Six; these
conversions appear in italics in our tables. As a rule of thumb, we
accept one coefficient as an adequate estimator of another if their
Six equivalents have the ratio .83:1.00 or 1.00:1.20.
For homogeneous tests, E fa) is close to E (p2 ) even when the
tetrachoric correlation is as high as .70. The reader is reminded that
interitem tetrachoric correlations for actual tests are in the neigh-
borhood of .05-.40, rarely higher. This confirms the Cronbach-Azuma
conclusion that for tests constructed by random sampling from
homogeneous items, provides, on the average, a very good estimate
a
.30, the Six range is 2.7 to 3.5 (corresponding to a’s of .73 and .78).
In longer tests, the range of a would be less.
Cureton (1958) criticized derivations of internal-consistency
formulas. These formulas invariably involve the ratio of estimates
of E (Ctt’) and E (Vt), even though the theory calls for obtaining the
expected value of E(Ctt’IVt). Working from our tables we calculated
E(Ctt.)IE(Vt); this value is entered in Table 2 under the heading
E(C)IE(V). Comparing this to our sample mean of a, which is an
unbiased estimate of E (CIV), we see that for each rw the two values
agree within .002. Discrepancies between E(C)IE(V) and as were
no larger than .01 for the stratified-parallel families studied. Our
made up from the same universe of items, because the former tests
are more perfectly equivalent.
We constructed two series of tests. In the wide-range series, six
items were selected from each of the three difficulty segments:
.01-.30, .31-.69, and .70-.99; in the &dquo;limited range&dquo; series, nine items
were drawn from the segments .60-.79 and .80-.99. Again, three
difficulty stratification, not very important unless rt > .70 (cf. SIN
values). E(a) has changed negligibly, except for a decrease at rw
= 1.00. The
ranges of all coefficients are reduced by difficulty strati-
fication.
When we compare each estimator to E (p2) for either wide- or
narrow-range tests, we find that aD is, as expected, a good estimator
except for the wide-range test with rw = 1.00. In general, a is only
slightly less satisfactory as an estimator, and since it is easier to
trifle better when rw == .30, and much worse for higher r~. Analyzing
a short heterogeneous test by the simple a formula gives misleading
results.
The formulas that take content stratification into account give
the results in Table 4. acD is an excellent estimator of E (p2). While
ac gives somewhat lower estimates, it is close enough to E (p2) to be
TABLE 5
Three-Factor Wide-Range Tests Stratified on Content Only:
Mean, S.E’MI and Range for Each Coefficient
strata. There are grounds for expecting the Six values for E (p2)
and as to maintain the same ratio as kh changes, provided that as
is based on the plan used in test construction. We wish to know
also how the relations among a, ac, and acD change as a test is
lengthened.
From our tables of covariances and formula (2) we calculate the
limit of the ratio of Six values for wide-range tests, presented in
Table 6. An additional ratio is available for rw == .50, rb = .30-
.869 :.941 :1.00. For long tests, it appears that ac is a satisfactory
TABLE 6
Limiting Values of Ratios SN(a) : SN(ac) : ’SUV(o’cD) Corresponding to
Stratified Alpha with Coarse and Fine Stratification (Wide-Range Tests)
KR20 formula.
While we have dealt with only a few of the possible universe
specifications and sampling plans, the reader can extend the findings
by interpolation and extrapolation. All tests in this study are 18
items long. It is known that as test length changes the stratified ce
coefficients calculated on the basis of the stratification used in test
construction obey the Spearman-Brown rule. There is some evidence
that E (p2) follows that rule fairly closely. Homogeneous tests draw
items from a single content stratum whose degree of homogeneity is
defined by the tetrachoric interitem correlation r,~. Our tests draw
items from three content strata. The interitem tetrachoric correla-
tions are labelled r~ for items within a stratum and rb for items in
different strata. In most of our analyses, rb = rw (single-factor test)
orrb=0.
The findings are too complex to be compactly summarized, but a
single table of particularly representative values will present the
heart of the results. Computations were made assuming rw of .30,
.50, .70, and 1.00. While the anomalous behavior of product-moment
correlations at high tetrachoric levels makes the higher r~ interesting,
it is the lower levels that are encountered in actual tests. We re-
TABLE 7
Summary of Results for 18-item Tests with r,~ = .50
REFERENCES