Professional Documents
Culture Documents
Encyclopedia of Statistics in Behavioral Science PDF
Encyclopedia of Statistics in Behavioral Science PDF
VOLUME 1
Dominance. 513-514
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
answer the questions Where does alpha come from? demonstrate statistical significance. In such a case,
and How much alpha should be applied?, but it can be difficult to select one outcome measure to
in trying to answer these questions, one may well serve as the primary outcome measure. Sometimes,
suggest that the process of generating alpha requires however, the outcome measures are fusible [4], and,
a prespecified hypothesis [5]. Yet, this is not very in this case, this decision becomes much easier. To
satisfying because sometimes unexpected findings clarify, suppose that there are two candidate outcome
need to be explored. In fact, discarding these findings measures, say response and complete response (how-
may be quite problematic itself [1]. For example, a ever these are defined in the context in question).
confounder may present itself only after the data are Furthermore, suppose that a complete response also
in, or a key assumption underlying the validity of implies a response, so that each subject can be clas-
the planned analysis may be found to be violated. sified as a nonresponder, a partial responder, or a
In theory, it would always be better to test the complete responder.
hypothesis on new data, rather than on the same In this case, the two outcome measures are
data that suggested the hypothesis, but this is not fusible, and actually represent different cut points
always feasible, or always possible [1]. Fortunately, of the same underlying ordinal outcome measure [4].
there are a variety of approaches to controlling the By specifying neither component outcome measure,
overall Type I error rate while allowing for flexibility but rather the information-preserving composite end
in testing hypotheses that were suggested by the data. point (IPCE), as the primary outcome measure, one
Two such approaches have already been mentioned, avoids having to select one or the other, and can
specifically the Pocock sequential boundaries and find legitimate significance if either outcome mea-
the OBrienFleming sequential boundaries, which sure shows significance. The IPCE is simply the
allow one to avoid having to select just one analysis underlying ordinal outcome measure that contains
time [9]. each component outcome measure as a binary sub-
In the context of the analysis of variance, Fishers endpoint. Clearly, using the IPCE can be cast as a
least significant difference (LSD) can be used to method for allowing post hoc testing, because it obvi-
control the overall Type I error rate when arbi- ates the need to prospectively select one outcome
trary pairwise comparisons are desired (see Multiple measure or the other as the primary one. Suppose,
Comparison Procedures). The approach is based on for example, that two key outcome measures are
operating in protected mode, so that these pairwise response (defined as a certain magnitude of bene-
comparisons occur only if an overall equality null fit) and complete response (defined as a somewhat
hypothesis is first rejected (see Multiple Testing). higher magnitude of benefit, but on the same scale).
Of course, the overall Type I error rate that is being If one outcome measure needs to be selected as the
protected is the one that applies to the global null primary one, then it may be unclear which one to
hypothesis that all means are the same. This may select. Yet, because both outcome measures are mea-
offer little consolation if one mean is very large, sured on the same scale, this decision need not be
another is very small, and, because of these two, addressed, because one could fuse the two outcome
all other means can be compared without adjustment measures together into a single trichotomous outcome
(see Multiple Testing). The Scheffe method offers measure, as in Table 1.
simultaneous inference, as in any linear combina- Even when one recognizes that an outcome mea-
tion of means can be tested. Clearly, this generalizes sure is ordinal, and not binary, there may still be
the pairwise comparisons that correspond to pairwise a desire to analyze this outcome measure as if it
comparisons of means. were binary by dichotomizing it. Of course, there is
Another area in which post hoc issues arise is the a different binary sub-endpoint for each cut point of
selection of the primary outcome measure. Some-
times, there are various outcome measures, or end Table 1 Hypothetical data set #1
points, to be considered. For example, an interven- No Partial Complete
tion may be used in hopes of reducing childhood response response response
smoking, as well as drug use and crime. It may
Control 10 10 10
not be clear at the beginning of the study which of Active 10 0 20
these outcome measures will give the best chance to
A Priori v Post Hoc Testing 3
the original ordinal outcome measure. In the previ- Table 3 Hypothetical data set #3
ous paragraph, for example, one could analyze the No Partial Complete
binary response outcome measure (20/30 in the con- response response response
trol group vs 20/30 in the active group in the fictitious
data in Table 1), or one could analyze the binary com- Control 10 10 10
Active 5 10 15
plete response outcome measure (10/30 in the control
group vs 20/30 in the active group in the fictitious
data in Table 1). With k ordered categories, there are the cut point for a subsequent application on of an
k 1 binary sub-endpoints, together comprising the analogue of Fishers exact test (see Exact Methods
Lancaster decomposition [12]. for Categorical Data), whereas adaptive tests allow
In Table 1, the overall response rate would not the data to determine the numerical scores to be
differentiate the two treatment groups, whereas the assigned to the columns for a subsequent linear rank
complete response rate would. If one knew this ahead test. Only if those scores are zero to the left of a given
of time, then one might select the overall response column and one to the right of it will the linear rank
rate. But the data could also turn out as in Table 2. test reduce to Fishers exact test. For the fictitious
Now the situation is reversed, and it is the over- data in Tables 1 and 2, for example, the Smirnov
all response rate that distinguishes the two treat- test would allow for the data-dependent selection of
ment groups (30/30 or 100% in the active group the analysis of either the overall response rate or the
vs 20/30 or 67% in the control group), whereas the complete response rate, but the Smirnov test would
complete response rate does not (10/30 or 33% in not allow for an analysis that exploits reinforcing
the active group vs 10/30 or 33% in the control effects. To see why this can be a problem, consider
group). If either pattern is possible, then it might not Table 3.
be clear, prior to collecting the data, which of the Now both of the aforementioned measures can
two outcome measures, complete response or over- distinguish the two treatment groups, and in the same
all response, would be preferred. The Smirnov test direction, as the complete response rates are 50%
(see KolmogorovSmirnov Tests) can help, as it and 33%, whereas the overall response rates are 83%
allows one to avoid having to prespecify the par- and 67%. The problem is that neither one of these
ticular sub-endpoint to analyze. That is, it allows for measures by itself is as large as the effect seen in
the simultaneous testing of both outcome measures Table 1 or Table 2. Yet, overall, the effect in Table 3
in the cases presented above, or of all k 1 outcome is as large as that seen in the previous two tables,
measures more generally, while still preserving the but only if the reinforcing effects of both measures
overall Type I error rate. This is achieved by letting are considered. After seeing the data, one might wish
the data dictate the outcome measure (i.e., selecting to use a linear rank test by which numerical scores
that outcome measure that maximizes the test statis- are assigned to the three columns and then the mean
tic), and then comparing the resulting test statistic scores across treatment groups are compared. One
not to its own null sampling distribution, but rather might wish to use equally spaced scores, such as 1,
to the null sampling distribution of the maximally 2, and 3, for the three columns. Adaptive tests would
chosen test statistic. allow for this choice of scores to be used for Table 3
Adaptive tests are more general than the Smirnov while preserving the Type I error rate by making the
test, as they allow for an optimally chosen set of appropriate adjustment for the inherent multiplicity.
scores for use with a linear rank test, with the scores The basic idea behind adaptive tests is to subject
essentially being selected by the data [7]. That is, the the data to every conceivable set of scores for use
Smirnov test allows for a data-dependent choice of with a linear rank test, and then compute the min-
imum of all the resulting P values. This minimum
Table 2 Hypothetical data set #2 P value is artificially small because the data were
No Partial Complete allowed to select the test statistic (that is, the scores
response response response for use with the linear rank test). However, this min-
imum P value can be used not as a (valid) P value,
Control 10 10 10
Active 0 20 10
but rather as a test statistic to be compared to the
null sampling distribution of the minimal P value so
4 A Priori v Post Hoc Testing
computed. As a result, the sample space can be parti- [2] Berger, V.W. (1998). Admissibility of exact conditional
tioned into regions on which a common test statistic tests of stochastic order, Journal of Statistical Planning
is used, and it is in this sense that the adaptive test and Inference 66, 3950.
[3] Berger, V.W. (2001). The p-value interval as an infer-
allows the data to determine the test statistic, in a
ential tool, The Statistician 50(1), 7985.
post hoc fashion. Yet, because of the manner in which [4] Berger, V.W. (2002). Improving the information content
the reference distribution is computed (on the basis of categorical clinical trial endpoints, Controlled Clinical
of the exact design-based permutation null distribu- Trials 23, 502514.
tion of the test statistic [8] factoring in how it was [5] Berger, V.W. (2004). On the generation and ownership
selected on the basis of the data), the resulting test is of alpha in medical studies, Controlled Clinical Trials
exact. This adaptive testing approach was first pro- 25, 613619.
posed by Berger [2], but later generalized by Berger [6] Berger, V.W. & Bears, J. (2003). When can a clinical
and Ivanova [7] to accommodate preferred alternative trial be called randomized? Vaccine 21, 468472.
[7] Berger, V.W. & Ivanova, A. (2002). Adaptive tests for
hypotheses and to allow for greater or lesser belief in
ordered categorical data, Journal of Modern Applied
these preferred alternatives. Statistical Methods 1, 269280.
Post hoc comparisons can and should be explored, [8] Berger, V.W., Lunneborg, C., Ernst, M.D. & Levine,
but with some caveats. First, the criteria for selecting J.G. (2002). Parametric analyses in randomized clinical
such comparisons to be made should be specified trials, Journal of Modern Applied Statistical Methods
prospectively [1], when this is possible. Of course, 1(1), 7482.
it may not always be possible. Second, plausibility [9] Demets, D.L. & Lan, K.K.G. (1994). Interim analy-
and subject area knowledge should be considered sis: the alpha spending function approach, Statistics in
Medicine 13, 13411352.
(as opposed to being based exclusively on statistical
[10] Hacking, I. (1965). The Logic of Statistical Inference,
considerations) [1]. Third, if at all possible, these Cambridge University Press, Cambridge.
comparisons should be considered as hypothesis- [11] Macdonald, R.R. (2002). The incompleteness of proba-
generating, and should lead to additional studies to bility models and the resultant implications for theories
produce new data to test these hypotheses, which of statistical inference, Understanding Statistics 1(3),
would have been post hoc for the initial experiments, 167189.
but are now prespecified for the additional ones. [12] Permutt, T. & Berger, V.W. (2000). A new look
at rank tests in ordered 2 k contingency tables,
Communications in Statistics Theory and Methods 29,
References 9891003.
[13] Senn, S. (1997). Statistical Issues in Drug Development,
[1] Adams, K.F. (1998). Post hoc subgroup analysis and the Wiley, Chichester.
truth of a clinical trial, American Heart Journal 136,
753758. VANCE W. BERGER
ACE Model
HERMINE H. MAES
Volume 1, pp. 510
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of these four sources of variation. Typically, these if environmental factors contribute to a trait and they
designs include individuals with different degrees are shared by twins, they will increase correlations
of genetic relatedness and environmental similarity. equally between MZ and DZ twins. The relative
One such design is the family study (see Family magnitude of the MZ and DZ correlations thus tells
History Versus Family Study Methods in Genet- us about the contribution of additive genetic (a 2 ) and
ics), which studies the correlations between parents shared environmental (c2 ) factors. Given that MZ
and offspring, and/or siblings (in a nuclear family). twins share their genotype and shared environmental
While this design is very useful to test for familial factors (if reared together), the degree to which
resemblance, it does not allow us to separate addi- they differ informs us of the importance of specific
tive genetic from shared environmental factors. The environmental (e2 ) factors.
most popular design that does allow the separation If the twin similarity is expressed as correlations,
of genetic and environmental (shared and unshared) one minus the MZ correlation is the proportion due to
factors is the classical twin study. specific environment (Figure 1). Using the raw scale
of measurement, this proportion can be estimated
from the difference between the MZ covariance and
The Classical Twin Study the variance of the trait. With the trait variance and
the MZ and DZ covariance as unique observed statis-
The classical twin study consists of a design in which tics, we can estimate the contributions of additive
data are collected from identical or monozygotic genes (A), shared (C), and specific (E) environmental
(MZ) and fraternal or dizygotic (DZ) twins reared factors, according to the genetic model. A useful tool
together in the same home. MZ twins have identical to generate the expectations for the variances and
genotypes, and thus share all their genes. DZ twins, covariances under a model is path analysis [11].
on the other hand, share on average half their genes,
as do regular siblings. Comparing the degree of
similarity in a trait (or their correlation) provides
an indication of the importance of genetic factors Path Analysis
to the trait variability. Greater similarity for MZ
versus DZ twins suggests that genes account for A path diagram is a graphical representation of the
at least part of the trait. The recognition of this model, and is mathematically complete. Such a path
fact led to the development of heritability indices, diagram for a genetic model, by convention, consists
based on the MZ and DZ correlations. Although of boxes for the observed variables (the traits under
these indices may provide a quick indication of the study) and circles for the latent variables (the genetic
heritability, they may result in nonsensical estimates. and environmental factors that are not measured but
Furthermore, in addition to genes, environmental inferred from data on relatives, and are standardized).
factors that are shared by family members (or twins in The contribution of the latent variables to the vari-
this case) also contribute to familial similarity. Thus, ances of the observed variables is specified in the path
1.0
0.9 0.8 Expectations
0.8 a2 + c2 + e2 = 1
0.7 0.6 rMZ = a2 + c2
0.6 e2
rDZ = 1/2a2 + c2
0.5 a2
0.4 Example
c2 e2 = 1 rMZ = 0.2
0.3
rMZ rDZ = 1/2a2 = 0.2
0.2
a2 = 0.4
0.1 c2 = rMZ 0.4 = 0.4
0.0
UN DZ MZ
coefficients, which are regression coefficients (rep- together all the coefficients in a chain, and then
resented by single-headed arrows from the latent to summing over all legitimate chains. Using these rules,
the observed variables). We further add two kinds of the expected covariance between the phenotypes of
double-headed arrows to the path coefficients model. twin 1 and twin 2 for MZ twins and DZ twins can
First, each of the latent variables has a double-headed be shown to be:
arrow pointing to itself, which is fixed to 1.0. Note 2
a + c2 + e2 a 2 + c2
that we can either estimate the contribution of the MZ cov =
a 2 + c2 a 2 + c2 + e2
latent variables through the path coefficients and stan- 2
dardize the latent variables or we can estimate the a + c2 + e2 0.5a 2 + c2
DZ cov =
variances of the latent variables directly while fix- 0.5a 2 + c2 a 2 + c2 + e2
ing the paths to the observed variables. We prefer
the path coefficients approach to the variance com- This translation of the ideas of the theory into math-
ponents model, as it generalizes much more easily to ematical form comprises the stage of model building.
advanced models. Second, on the basis of quantita- Then, it is necessary to choose the appropriate study
tive genetic theory, we model the covariance between design, in this case the classical twin study, to gen-
twins by adding double-headed arrows between the erate critical data to test the model.
additive genetic and shared environmental latent vari-
ables. The correlation between the additive genetic
Model Fitting
latent variables is fixed to 1.0 for MZ twins, because
they share all their genes. The corresponding value The stage of model fitting allows us to compare
for DZ twins is 0.5, derived from biometrical prin- the predictions with actual observations in order to
ciples [7]. The correlation between shared environ- evaluate how well the model fits the data using
mental latent variables is fixed to 1.0 for MZ and DZ goodness-of-fit statistics. Depending on whether the
twins, reflecting the equal environments assumption. model fits the data or not, it is accepted or rejected,
Specific environmental factors do not contribute to in which case an alternative model may be chosen.
covariance between twins, which is implied by omit- In addition to the goodness-of-fit of the model, esti-
ting a double-headed arrow. The full path diagrams mates for the genetic and environmental parameters
for MZ and DZ twins are presented in Figure 2. are obtained. If a model fits the data, we can fur-
The expected covariance between two variables ther test the significance of these parameters of the
in a path diagram may be derived by tracing all model by adding or dropping parameters and evalu-
connecting routes (or chains) between the variables ate the improvement or decrease in model fit using
while following the rules of path analysis, which are: likelihood-ratio tests. This is equivalent to estimat-
(a) trace backward along an arrow, change direction ing confidence intervals. For example, if the ACE
in a double-headed arrow and then trace forward, or model fits the data, we may drop the additive genetic
simply forward from one variable to the other; this (a) parameter and refit the model (now a CE model).
implies to trace through at most one two-way arrow The difference in the goodness-of-fit statistics for the
in each chain of paths; (b) pass through each variable two models, the ACE and the CE models, provides a
only once in each chain of paths. The expected likelihood-ratio test with one degree of freedom for
covariance between two variables, or the expected the significance of a. If this test is significant, additive
variance of a variable, is computed by multiplying genetic factors contribute significantly to the variation
a c e a c e a c e a c e
P1 MZ twins P2 P1 DZ twins P2
Figure 2 Path diagram for the ACE model applied to data from MZ and DZ twins
4 ACE Model
in the trait. If it is not, a could be dropped from the Development (VTSABD) [2]. One focus of the study
model, according to the principle of parsimony. Alter- is conduct disorder, which is characterized by a set
natively, we could calculate the confidence intervals of disruptive and destructive behaviors. Here we use
around the parameters. If these include zero for a par- a summed symptom score, normalized and standard-
ticular parameter, it indicates that the parameter is not ized within age and sex, and limit the example to the
significantly different from zero and could be dropped data on 816-year old boys, rated by their mothers.
from the model. Given that significance of parameters Using the sum score data on 295 MZ and 176 DZ
is related to power of the study, confidence inter- pairs of twins, we first estimated the means, vari-
vals provide useful information around the precision ances, and covariances by maximum likelihood in
with which the point estimates are known. The main Mx [10], separately for the two twins and the two
advantages of the model fitting approach are thus zygosity groups (MZ and DZ, see Table 1). This
(a) assessing the overall model fit, (b) incorporating model provides the overall likelihood of the data
sample size and precision, and (c) providing sensi- and serves as the saturated model against which
ble heritability estimates. Other advantages include other models may be compared. It has 10 estimated
that it (d) generalizes to the multivariate case and to parameters and yields a 2 times log-likelihood of
extended pedigrees, (e) allows the addition of covari- 2418.575 for 930 degrees of freedom, calculated as
ates, (f) makes use of all the available data, and (g) is the number of observed statistics (940 nonmissing
suitable for selected samples. If we are interested in data points) minus the number of estimated param-
testing the ACE model and quantifying the degree to eters. First, we tested the equality of means and
which genetic and environmental factors contribute variances by twin order and zygosity by impos-
to the variability of a trait, data need to be collected ing equality constraints on the respective parameters.
on relatively large samples of genetically informative Neither means nor variances were significantly dif-
relatives, for example, MZ and DZ twins. The ACE
ferent for the two members of a twin pair, nor did
model can then be fitted either directly to the raw data
they differ across zygosity ( 2 6 = 5.368, p = .498).
or to summary statistics (covariance matrices) and
Then we fitted the ACE model, thus partitioning
decisions made about the model on the basis of the
the variance into additive genetic, shared, and specific
goodness-of-fit. There are several statistical modeling
environmental factors. We estimated the means freely
packages available capable of fitting the model, for
as our primary interest is in the causes of individ-
example, EQS, SAS, Lisrel, and Mx (see Structural
ual differences. The likelihood ratio test obtained
Equation Modeling: Software). The last program
was designed specifically with genetic epidemiologic by subtracting the 2 log-likelihood of the saturated
models in mind, and provides great flexibility in model from that of the ACE model (2421.478) for the
specifying both basic and advanced models [10]. Mx difference in degrees of freedom of the two models
models are specified in terms of matrices, and matrix (933930) indicates that the ACE model gives an
algebra is used to generate the expected covariance adequate fit to the data ( 2 3 = 2.903, p = .407). We
matrices or other statistics of the model to be fitted. can evaluate the significance of each of the parame-
ters by estimating confidence intervals, or by fitting
submodels in which we fix one or more parameters
Example
to zero. The series of models typically tested includes
We illustrate the ACE model, with data collected the ACE, AE, CE, E, and ADE models. Alterna-
in the Virginia Twin Study of Adolescent Behavior tive models can be compared by several fit indices,
Table 1 Means and variances estimated from the raw data on conduct disorder in VTSABD twins
Monozygotic male twins (MZM) Dizygotic male twins (DZM)
T1 T2 T1 T2
Expected means 0.0173 0.0228 0.0590 0.0688
Expected covariance matrix T1 0.9342 T1 1.0908
T2 0.5930 0.8877 T2 0.3898 0.9030
ACE Model 5
Table 2 Goodness-of-fit statistics and parameter estimates for conduct disorder in VTSABD twins
Model 2 df p AIC 2 df a2 c2 e2
ACE 2.903 3 .407 3.097 .57 (.33 .72) .09 (.00 .31) .34 (.28 .40)
AE 3.455 4 .485 4.545 0.552 1 .66 (.60 .72) .34 (.28 .40)
CE 26.377 4 .000 18.377 23.475 1 .55 (.48 .61) .45 (.39 .52)
E 194.534 5 .000 184.534 191.63 2 1.00
ADE 3.455 3 .327 2.545 .66 (.30 .72) d 2 .00 (.00 .36) .34 (.28 .40)
AIC: Akaikes information criterion; a 2 : additive genetic variance component; c2 : shared environmental variance component; e2 :
specific environmental variance component.
for example, the Akaikes Information Criterion two-thirds of the variation, with the remaining one-
(AIC; [1]), which takes into account both goodness- third explained by specific environmental factors. A
of-fit and parsimony and favors the model with the more detailed description of these methods may be
lowest value for AIC. Results from fitting these mod- found in [8].
els are presented in Table 2. Dropping the shared
environmental parameter c did not deteriorate the fit
of the model. However, dropping the a path resulted Limitations and Assumptions
in a significant decrease in model fit, suggesting
that additive genetic factors account for part of the Although the classical twin study is a powerful design
variation observed in conduct disorder symptoms, in to infer the causes of variation in a trait of interest, it
addition to specific environmental factors. The latter is important to reflect on the limitations when inter-
are always included in the models for two main rea- preting results from fitting the ACE model to twin
sons. First, almost all variables are subject to error. data. The power of the study depends on a number
Second, the likelihood is generally not defined when of factors, including among others the study design,
twins are predicted to correlate perfectly. The same the sample size, the effect sizes of the components
conclusions would be obtained from judging the con- of variance, and the significance level [9]. Further,
fidence intervals around the parameters a 2 (which do several assumptions are made when fitting the ACE
not include zero) and c2 (which do include zero). Not model. First, it is assumed that the effects of A, C,
surprisingly, the E model fits very badly, indicating and E are linear and additive (i.e., no genotype by
highly significant family resemblance. environment interaction) and mutually independent
Typically, the ADE model (with dominance (i.e., no genotype-environment covariance). Second,
instead of common environmental influences) is also the effects are assumed to be equal across twin order
fitted, predicting a DZ correlation less than half and zygosity. Third, we assume that the contribution
the MZ correlation. This is the opposite expectation of environmental factors to twins similarity for a trait
of the ACE model that predicts a DZ correlation is equal for MZ and DZ twins (equal environments
greater than half the MZ correlation. Given that assumption). Fourth, no direct influence exists from
dominance (d) and shared environment (c) are a twin on his/her co-twin (no reciprocal sibling envi-
confounded in the classical twin design and that ronmental effect). Finally, the parental phenotypes are
the ACE and ADE models are not nested, both assumed to be independent (random mating). Some
are fitted and preference is given to the one with of these assumptions may be tested by extending the
the best absolute goodness-of-fit, in this case the twin design.
ACE model. Alternative designs, for example, twins
reared apart, provide additional unique information
to identify and simultaneously estimate c and d Extensions
separately. In this example, we conclude that the
AE model is the best fitting and most parsimonious Although it is important to answer the basic questions
model to explain variability in conduct disorder about the importance of genetic and environmental
symptoms in adolescent boys rated by their mothers factors to variation in a trait, the information obtained
in the VTSABD. Additive genetic factors account for remains descriptive. However, it forms the basis for
6 ACE Model
more advanced questions that may inform us about [2] Eaves, L.J., Silberg, J.L., Meyer, J.M., Maes, H.H.,
the nature and kind of the genetic and environmental Simonoff, E., Pickles, A., Rutter, M., Neale, M.C.,
factors. Some examples of these questions include: Reynolds, C.A., Erickson, M., Heath, A.C., Loeber, R.,
Truett, K.R. & Hewitt, J.K. (1997). Genetics and devel-
Is the contribution of genetic and/or environmental opmental psychopathology: 2. The main effects of genes
factors the same in males and females? Is the and environment on behavioral problems in the Vir-
heritability equal in children, adolescents, and adults? ginia Twin study of adolescent behavioral develop-
Do the same genes account for variation in more than ment, Journal of Child Psychology and Psychiatry 38,
one phenotype, or thus explain some or all of the 965980.
covariation between the phenotypes? Does the impact [3] Falconer, D.S. (1989). Introduction to Quantitative
Genetics, Longman Scientific & Technical, New York.
of genes and environment change over time? How
[4] Fisher, R.A. (1918). The correlations between rel-
much parent-child similarity is due to shared genes atives on the supposition of Mendelian inheritance,
versus shared environmental factors? Transactions of the Royal Society of Edinburgh 52,
This basic model can be extended in a variety 399433.
of ways to account for sex limitation, genotype [5] Galton, F. (1865). Hereditary talent and character,
environment interaction, sibling interaction, and to MacMillans Magazine 12, 157166.
deal with multiple variables measured simultane- [6] Kendler, K.S. & Eaves, L.J. (2004). Advances in Psychi-
atric Genetics, American Psychiatric Association Press.
ously (multivariate genetic analysis) or longitudinally
[7] Mather, K. & Jinks, J.L. (1971). Biometrical Genetics,
(developmental genetic analysis). Other relatives can Chapman and Hall, London.
also be included, such as siblings, parents, spouses, [8] Neale, M.C. & Cardon, L.R. (1992). Methodology for
and children of twins, which may allow better sepa- Genetic Studies of Twins and Families, Kluwer Aca-
ration of genetic and cultural transmission and esti- demic Publishers BV, Dordrecht.
mation of assortative mating and twin and sibling [9] Neale, M.C., Eaves, L.J. & Kendler, K.S. (1994). The
environment. The addition of measured genes (geno- power of the classical twin study to resolve variation in
threshold traits, Behavior Genetics 24, 239225.
typic data) or measured environments may further [10] Neale, M.C., Boker, S.M., Xie, G. & Maes, H.H. (2003).
refine the partitioning of the variation, if these mea- Mx: Statistical Modeling, 6th Edition, VCU Box 900126,
sured variables are linked or associated with the Department of Psychiatry, Richmond, 23298.
phenotype of interest. The ACE model is thus the [11] Wright, S. (1934). The method of path coefficients,
cornerstone of modeling the causes of variation. Annals of Mathematical Statistics 5, 161215.
HERMINE H. MAES
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
place this subject in Treatment Group A, and compute one-to-one correspondence between the colors of the
the marginal male imbalance to be (4 + 4 + 1 balls in the urn and the treatment groups to which a
5 2) = 2, the marginal smoker imbalance to be subject could be assigned). A ball is drawn at random
(4 + 5 + 1 5 6) = 1, and the joint male smoker from the urn to determine the treatment assignment.
imbalance to be (4 + 1 5) = 0. Now provisionally Then the ball is replaced, possibly along with other
place this subject in Treatment Group B and compute balls of the same color or another color, depending
the marginal male imbalance to be (4 + 4 5 on the response of the subject to the initial treat-
2 1) = 0, the marginal smoker imbalance to be ment [10].
(4 + 5 5 6 1) = 2, and the joint male smoker With this design, the allocation probabilities de-
imbalance to be (4 5 1) = 2. Using the joint pend not only on the previous treatment assign-
balancing, Treatment Group A would be preferred. ments but also on the responses to those treatment
The actual allocation may be deterministic, as in assignments; this is the basis for calling such designs
simply assigning the subject to the group that leads to response adaptive, so as to distinguish them from
better balance, A in this case, or it may be stochastic, covariate-adaptive designs. Perhaps the most well-
as in making this assignment with high probability. known actual trial that used a response-adaptive
For example, one might add one to the absolute randomization procedure was the Extra Corporeal
value of each imbalance, and then use the ratios as Membrane Oxygenation (ECMO) Trial [1]. ECMO
probabilities. is a surgical procedure that had been used for infants
So here the probability of assignment to A would with respiratory failure who were dying and were
be (2 + 1)/[(0 + 1) + (2 + 1)] = 3/4 and the proba- unresponsive to conventional treatment of ventilation
bility of assignment to B would be (0 + 1)/[(2 + and drug. Data existed to suggest that the ECMO
1) + (0 + 1)] = 1/4. If we were using the marginal treatment was safe and effective, but no randomized
balancing technique, then a weight function could be controlled trials had confirmed this. Owing to prior
used to weigh either gender or smoking status more data and beliefs, the ECMO investigators were reluc-
heavily than the other or they could each have the tant to use equal allocation. In this case, response-
same weight. Either way, the decision would again adaptive randomization is a practical procedure, and
be based, either deterministically or stochastically, on so it was used.
which treatment group minimizes the imbalance, and The investigators chose the randomized play-the-
possibly by how much. winner RPW(1,1) rule for the trial. This means that
after a ball is chosen from the urn and replaced, one
Response-adaptive Randomization Procedures additional ball is added to the urn. This additional
ball is of the same color as the previously chosen
In response-adaptive randomization, the treatment ball if the outcome is a response (survival, in this
allocations depend on the previous subject out- case). Otherwise, it is of the opposite color. As it
comes, so that the subjects are more likely to be turns out, the first patient was randomized to the
assigned to the superior treatment, or at least to ECMO treatment and survived, so now ECMO had
the one that is found to be superior so far. This is two balls to only one conventional ball. The second
a good way to address the objective of minimiz- patient was randomized to conventional therapy, and
ing exposure to an inferior treatment, and possibly he died. The urn composition then had three ECMO
the only way to address both objectives discussed balls and one control ball. The remaining 10 patients
above [5]. Response-adaptive randomization proce- were all randomized to ECMO, and all survived. The
dures may determine the allocation ratios so as to trial then stopped with 12 total patients, in accordance
optimize certain criteria, including minimizing the with a prespecified stopping rule.
expected number of treatment failures, minimizing At this point, there was quite a bit of controversy
the expected number of patients assigned to the infe- regarding the validity of the trial, and whether it was
rior treatment, minimizing the total sample size, or truly a controlled trial (since only one patient received
minimizing the total cost. They may also follow intu- conventional therapy). Comparisons between the two
ition, often as urn models. A typical urn model starts treatments were questioned because they were based
with k balls of each color, with each color repre- on a sample of size 12, again, with only one subject
senting a distinct treatment group (that is, there is a in one of the treatment groups. In fact, depending
Adaptive Random Assignment 3
on how the data were analyzed, the P value could Extracorporeal circulation in neonatal respiratory fail-
range from 0.001 (an analysis that assumes complete ure: a prospective randomized study, Pediatrics 76,
randomization and ignores the response-adaptive ran- 479487.
[2] Begg, C.B. (1990). On inferences from Weis biased coin
domization; [9]) to 0.620 (a permutation test that con- design for clinical trials, Biometrika 77, 467484.
ditions on the observed sequences of responses; [2]) [3] Berger, V.W. & Christophi, C.A. (2003). Randomization
(see Permutation Based Inference). technique, allocation concealment, masking, and suscep-
Two important lessons can be learned from the tibility of trials to selection bias, Journal of Modern
ECMO Trial. First, it is important to start with more Applied Statistical Methods 2(1), 8086.
than one ball corresponding to each treatment in the [4] Berger, V.W., Ivanova, A. & Deloria-Knoll, M. (2003).
Enhancing allocation concealment through less restric-
urn. It can be shown that starting out with only one
tive randomization procedures, Statistics in Medicine
ball of each treatment in the urn leads to instability 22(19), 30173028.
with the randomized play-the-winner rule. Second, a [5] Berry, D.A. & Fristedt, B. (1985). Bandit Problems:
minimum sample size should be specified to avoid Sequential Allocation of Experiments, Chapman & Hall,
the small sample size found in ECMO. It is also London.
possible to build in this requirement by starting the [6] Rosenberger, W.F. & Lachin, J.M. (2002). Randomiza-
trial as a nonadaptively randomized trial, until a tion in Clinical Trials, John Wiley & Sons, New York.
[7] Taves, D.R. (1974). Minimization: a new method of
minimum number of patients are recruited to each
assigning patients to treatment and control groups,
treatment group. The results of an interim analysis at Clinical Pharmacology Therapeutics 15, 443453.
this point can determine the initial constitution of the [8] Therneau, T.M. (1993). How many stratification factors
urn, which can be used for subsequent allocations, are too many to use in a randomization plan? Con-
and updated accordingly. The allocation probability trolled Clinical Trials 14(2), 98108.
will then eventually favor the treatment with fewer [9] Wei, L.J. (1988). Exact two-sample permutation
failures or more success, and the proportion of tests based on the randomized play-the-winner rule,
Biometrika 75, 603606.
allocations to the better arm will converge to one. [10] Wei, L.J. & Durham, S.D. (1978). The randomized play-
the-winner rule in medical trials, Journal of the American
References Statistical Association 73, 840843.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
(a) (b)
Figure 1 Adaptive cluster sampling and its result (From Thompson, S.K. (1990). Adaptive cluster sampling, Journal of
the American Statistical Association 85, 10501059 [6])
Figure 1 [6]. There are 400 square units. The Thompson [8] proposed several types of estimators
following steps are carried out in the sampling that are unbiased for the population mean or total.
procedure. Some examples are estimators based on expected
numbers of initial intersections, estimators based on
1. An initial random sample of 10 units is shown initial intersection probabilities, and modified estima-
in Figure 1(a). tors based on the RaoBlackwell method.
2. In adaptive sampling, we need to define a neigh- Another type of adaptive sampling is the design
borhood for a sampling unit. A neighborhood with primary and secondary units. Systematic adap-
can be decided by a prespecified and nonadaptive tive cluster sampling and strip adaptive cluster sam-
rule. In this case, the neighborhood of a unit is its pling belong to this type. For both sampling schemes,
set of adjacent units (left, right, top, and bottom). the initial design could be systematic sampling or
3. We need to specify a criterion for searching a strip sampling. That is, the initial design is selected
neighbor. In this case, once one or more objects in terms of primary units, while subsequent sampling
are observed in a selected unit, its neighborhood
is in terms of secondary units. Conventional estima-
is added to the sample.
tors of the population mean or total are biased with
4. Repeat step 3 for each neighbor unit until no
such a procedure, so Thompson [7] developed unbi-
object is observed. In this case, the sample
ased estimators, such as estimators based on partial
consists of 45 units. See Figure 1(b).
selection probabilities and estimators based on par-
Stratified adaptive cluster sampling (see Strat- tial inclusion probabilities. Thompson [7] has shown
ification) is an extension of the adaptive cluster that by using a point pattern representing locations of
approach. On the basis of prior information about individuals or objects in a spatially aggregated popu-
the population or simple proximity of the units, lation, the adaptive design can be substantially more
units that are thought to be similar to each other efficient than its conventional counterparts.
are grouped into strata. Following an initial strati- Commonly, the criterion for additional sampling is
fied sample, additional units are added to the sample a fixed and prespecified rule. In some surveys, how-
from the neighborhood of any selected unit when it ever, it is difficult to decide on the fixed criterion
satisfies the criterion. If additional units are added ahead of time. In such cases, the criterion could be
to the sample, where the high positive identifica- based on the observed sample values. Adaptive clus-
tions are observed, then the sample mean will over- ter sampling based on order statistics is particularly
estimate the population mean. Unbiased estimators appropriate for some situations, in which the investi-
can be obtained by making use of new observa- gator wishes to search for high values of the variable
tions in addition to the observations initially selected. of interest in addition to estimating the overall mean
Adaptive Sampling 3
or total. For example, the investigator may want to [2] Felix Medina, M.H. & Thompson S.K. (1999). Adap-
find the pollution hot spots. Adaptive cluster sam- tive cluster double sampling, in Proceedings of the Sur-
pling based on order statistics is apt to increase the vey Research Section, American Statistical Association,
Alexandria, VA.
probability of observing units with high values, while [3] Gasaway, W.C., DuBois, S.D., Reed, D.J. & Harbo, S.J.
at the same time allowing for unbiased estimation of (1986). Estimating moose population parameters from
the population mean or total. Thompson has shown aerial surveys, Biological Papers of the University of
that these estimators can be improved by using the Alaska (Institute of Arctic Biology) Number 22, Univer-
RaoBlackwell method [9]. sity of Alaska, Fairbanks.
Thompson and Seber [11] proposed the idea of [4] Roesch Jr, F.A. (1993). Adaptive cluster sampling for
forest inventories, Forest Science 39, 655669.
detectability in adaptive sampling. Imperfect detect-
[5] Salehi, M.M. & Seber, G.A.F. (1997). Two-stage adap-
ability is a source of nonsampling error in the natural tive cluster sampling, Biometrics 53(3), 959970.
survey and human population survey. This is because [6] Thompson, S.K. (1990). Adaptive cluster sampling,
even if a unit is included in the survey, it is possible Journal of the American Statistical Association 85,
that not all of the objects can be observed. Examples 10501059.
are a vessel survey of whales and a survey of [7] Thompson, S.K. (1991a). Adaptive cluster sampling:
homeless people. To estimate the population total in a designs with primary and secondary units, Biometrics
47(3), 11031115.
survey with imperfect detectability, both the sampling
[8] Thompson, S.K. (1991b). Stratified adaptive cluster
design and the detection probabilities must be taken sampling, Biometrika 78(2), 389397.
into account. If imperfect detectability is not taken [9] Thompson, S.K. (1996). Adaptive cluster sampling
into account, then it will lead to underestimates of based on order statistics, Environmetrics 7, 123133.
the population total. In the most general case, the [10] Thompson S.K. (2002). Sampling, 2nd Edition, John
values of the variable of interest are divided by the Wiley & Sons, New York.
detection probability for the observed object, and then [11] Thompson, S.K. & Seber, G.A.F. (1994). Detectability
in conventional and adaptive sampling, Biometrics 50(3),
estimation methods without detectability problems 712724.
are used. [12] Thompson, S.K. & Seber, G.A.F. (2002). Adaptive
Finally, regardless of the design on which the Sampling, Wiley, New York.
sampling is obtained, optimal sampling strategies
should be considered. Bias and mean-square errors
are usually measured, which lead to reliable results. (See also Survey Sampling Procedures)
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
can be embedded in (n 2)-dimensional Euclidean Saito assumed that p (c) > 0, which implies that
space, where n is the number of objects. Cailliez [1] [max (i (c), 0) i (c)]2 = 0 for i = 1, . . . , p. One
derived a formula for the smallest c for which this can then write
embedding is possible.
n
In the case of fallible data, a different formulation 2i (c)
is required. Torgerson argued: i=p+1 (c)
P (c) = 1 =1 . (3)
This means that with fallible data the condition that n
(c)
B be positive semidefinite as a criterion for the 2i (c)
points existence in real space is not to be taken i=1
too seriously. What we would like to obtain is a
Hence, Saitos formulation is equivalent to minimiz-
B -matrix whose latent roots consist of
ing (c)/(c), and it is evident that his formulation
1. A few large positive values (the true dimen- encourages choices of c for which (c) is large. Why
sions of the system), and one should prefer such choices is not so clear. Trosset,
2. The remaining values small and distributed Baggerly, and Pearl [10] concluded that Saitos crite-
about zero (the error dimensions).
rion typically results in a larger additive constant than
It may be that for fallible data we are asking would be obtained using the classical formulation of
the wrong question. Consider the question, For Torgerson [9] and de Leeuw and Heiser [3].
what value of c will the points be most nearly A comprehensive formulation of the additive con-
(in a least-squares sense) in a space of a given stant problem is obtained by introducing a loss func-
dimensionality? tion, , that measures the discrepancy between a set
Torgersons [9] question was posed by de Leeuw and of p-dimensional Euclidean distances and a set of
Heiser [3] as the problem of finding the symmetric dissimilarities. One then determines both the addi-
positive semidefinite matrix of rank p that best tive constant and the graphical representation of
approximates ((c) (c)) in a least-squares the data by finding a pair (c, D) that minimizes
sense. This problem is equivalent to minimizing (D, (c)). The classical formulations loss function
is the squared error that results from approximating
((c) (c)) with (D D). This loss function
p
n
(c) = [max(i (c), 0) i (c)]2 + i (c)2 , is sometimes called the strain criterion. In contrast,
i=1 i=p+1 Coopers [2] loss function was Kruskals [5] raw
(1) stress criterion, the squared error that results from
approximating (c) with D. Although the raw stress
where 1 (c) n (c) are the eigenvalues of ( criterion is arguably more intuitive than the strain cri-
(c) (c)). The objective function may have terion, Coopers formulation cannot be reduced to a
nonglobal minimizers. However, unless n is very unidimensional optimization problem.
large, modern computers can quickly graph (),
so that the basin containing the global minimizer References
can be identified by visual inspection. The global
minimizer can then be found by a unidimensional [1] Cailliez, F. (1983). The analytical solution of the additive
search algorithm. constant problem, Psychometrika 48, 305308.
[2] Cooper, L.G. (1972). A new solution to the additive
constant problem in metric multidimensional scaling,
Other Formulations Psychometrika 37, 311322.
[3] de Leeuw, J. & Heiser, W. (1982). Theory of multi-
In a widely cited article, Saito [8] proposed choosing dimensional scaling, in Handbook of Statistics, Vol. 2,
c to maximize a normalized index of fit, P.R. Krishnaiah & I.N. Kanal, eds, North Holland, Ams-
terdam, pp. 285316, Chapter 13.
p
[4] Klingberg, F.L. (1941). Studies in measurement of
2i (c) the relations among sovereign states, Psychometrika 6,
i=1 335352.
P (c) = . (2)
n [5] Kruskal, J.B. (1964). Multidimensional scaling by opti-
2i (c) mizing goodness of fit to a nonmetric hypothesis, Psy-
i=1 chometrika 29, 127.
Additive Constant Problem 3
[6] Messick, S.J. & Abelson, R.P. (1956). The additive con- [9] Torgerson, W.S. (1952). Multidimensional scaling: I.
stant problem in multidimensional scaling, Psychometrika Theory and method, Psychometrika 17, 401419.
21, 115. [10] Trosset, M.W., Baggerly, K.A. & Pearl, K. (1996).
[7] Richardson, M.W. (1938). Multidimensional psychophy- Another look at the additive constant problem in multi-
sics, Psychological Bulletin 35, 659660; Abstract of dimensional scaling, Technical Report 967, Department
presentation at the forty-sixth annual meeting of the of Statistics-MS 138, Rice University, Houston.
American Psychological Association, American Psycho-
logical Association (APA), Washington, D.C. September
710, 1938. (See also BradleyTerry Model; Multidimensional
[8] Saito, T. (1978). The problem of the additive constant Unfolding)
and eigenvalues in metric multidimensional scaling, Psy-
chometrika 43, 193201. MICHAEL W. TROSSET
Additive Genetic Variance
DANIELLE POSTHUMA
Volume 1, pp. 1822
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
175 4
167
Figure 1 Worked example of genotypic effects, average effects, breeding values, and genetic variation. Assume body
height is determined by a single gene with two alleles A1 and A2, and frequencies p = 0.6, q = 0.4. Body height differs
per genotype: A2A2 carriers are 167 cm tall, A1A2 carriers are 175 cm tall, and A1A1 carriers are 191 cm tall. Half the
difference between the heights of the two homozygotes is a, which is 12 cm. The midpoint of the two homozygotes is
179 cm, which is also the intercept of body height within the population, that is, subtracting 179 from the three genotypic
means scales the midpoint to zero. The deviation of the heterozygote from the midpoint (d) = 4 cm. The mean effect of
this gene to the population mean is thus 12(0.6 0.4) + 2 0.6 0.4 4 = 0.48 cm. To calculate the average effect of
allele A1 (1 ) c, we sum the product of the conditional frequencies and genotypic values of the two possible genotypes,
including the A1 allele. The two genotypes are A1A1 and A1A2, with genotypic values 12 and 4. Given one A1 allele,
the frequency of A1A1 is 0.6 and of A1A2 is 0.4. Thus, 12 0.6 4 0.4 = 5.6. We need to subtract the mean effect of
this gene (0.48) from 5.12 to get the average effect of the A1 allele (1 ): 5.6 0.48 = 5.12. Similarly, the average effect
of the A2 allele (2 ) can be shown to equal 7.68. The breeding value of A1A1 carriers is the sum of the average effects
of the two A1 alleles, which is 5.12 + 5.12 = 10.24. Similarly, for A1A2 carriers this is 5.12 7.68 = 2.56 and for A2A2
carriers this is 7.68 7.68 = 15.36. The genetic variance (VG ) related to this gene is 82.33, where VA is 78.64 and VD
is 3.69
One diallelic gene Two diallelic genes
0.60 0.40
0.50
0.30
Frequency
Frequency
0.40
0.30 0.20
0.20
0.10
0.10
0.00 0.00
Trait value Trait value
800
N cases
600
400
200
0
142.5
155.0
167.5
180.0
192.5
205.0
Height (cm)
Figure 2 The combined discrete effects of many single genes result in continuous variation in the population. a Based on
8087 adult subjects from the Dutch Twin Registry (http://www.tweelingenregister.org)
Additive Genetic Variance 3
parent and a blue-eyed parent is of course a conse- The breeding value of an individual equals the
quence of the fact that parents transmit alleles to their sum of the average effects of gene substitution of an
offspring and not their genotypes. Therefore, parents individuals alleles, and is therefore directly related
cannot directly transmit their genotypic values a, d, to the mean genetic value of its offspring. Thus, the
and a to their offspring. To quantify the transmis- breeding value for an individual with genotype A1A1
sion of genetic effects from parents to offspring, and is 21 (or 2q), for individuals with genotype A1A2
ultimately to decompose the observed variance in the it is 1 + 2 (or (q p)), and for individuals with
offspring generation into genetic and environmental genotype A2A2 it is 22 (or 2p).
components, the concepts average effect and breed- The breeding value is usually referred to as the
ing value have been introduced [3]. additive effect of an allele (note that it includes
Average effects are a function of genotypic val- both the values a and d), and differences between
ues and allele frequencies within a population. The the genotypic effects (in terms of a, d, and a,
average effect of an allele is defined as .. the mean for genotypes A1A1, A1A2, A2A2 respectively)
deviation from the population mean of individuals and the breeding values (2q, (q p), 2p, for
which received that allele from one parent, the allele genotypes A1A1, A1A2, A2A2 respectively) reflect
received from the other parent having come at random the presence of dominance. Obviously, breeding
from the population [3]. To calculate the average values are of utmost importance to animal and crop
effects denoted by 1 and 2 of alleles A1 and A2 breeders in determining which crossing will produce
respectively, we need to determine the frequency offspring with the highest milk yield, the fastest race
of the A1 (or A2) alleles in the genotypes of the horse, or the largest tomatoes.
offspring coming from a single parent. Again, we
assume a single locus system with two alleles. If there
Genetic Variance
is random mating between gametes carrying the A1
allele and gametes from the population, the frequency Although until now we have ignored environmental
with which the A1 gamete unites with another gamete effects, quantitative geneticists assume that popula-
containing A1 (producing an A1A1 genotype in the tionwise the phenotype (P) is a function of both
offspring) equals p, and the frequency with which genetic (G) and environmental effects (E): P = G +
the gamete containing the A1 gamete unites with a E, where E refers to the environmental deviations,
gamete carrying A2 (producing an A1A2 genotype which have an expected average value of zero. By
in the offspring) is q. The genotypic value of the excluding the term GxE, we assume no interac-
genotype A1A1 in the offspring is a and the geno- tion between the genetic effects and the environ-
typic value of A1A2 in the offspring is d, as defined mental effects (see Gene-Environment Interaction).
earlier. The mean value of the genotypes that can be If we also assume there is no covariance between
produced by a gamete carrying the A1 allele equals G and E, the variance of the phenotype is given
the sum of the products of the frequency and the by VP = VG + VE , where VG represents the vari-
genotypic value. Or, in other terms, it is pa + qd. ance of the genotypic values of all contributing loci
The average genetic effect of allele A1 (1 ) equals including both additive and nonadditive components,
the deviation of the mean value of all possible geno- and VE represents the variance of the environmen-
types that can be produced by gametes carrying the tal deviations. Statistically, the total genetic variance
A1 allele from the population mean. The population (VG ) can be obtained by applying the standard for-
mean has been derived earlier as a(p q) + 2pqd mula for the variance: 2 = fi (xi )2 , where
(1). The average effect of allele A1 is thus: 1 = fi denotes the frequency of genotype i, xi denotes
pa + qd [a(p q) + 2pqd] = q[a + d(q p)]. the corresponding genotypic mean of that genotype,
Similarly, the average effect of the A2 allele is 2 = and denotes the population mean, as calculated
pd qa [a(p q) + 2pqd ] = p[a + d(q p)]. in (1). Thus, VG = p 2 [a (a(p q) + 2pqd)]2 +
1 2 is known as or the average effect of 2pq[d (a(p q) + 2pqd)]2 + q 2 [a (a(p
gene substitution. If there is no dominance, 1 = qa q) + 2pqd)]2 . This can be simplified to VG =
and 2 = pa, and the average effect of gene p 2 [2q(a dp)]2 + 2pq[a(q p) + d(1 2pq)]2 +
substitution thus equals the genotypic value a q 2 [2p(a + dq)]2 , and further simplified to VG =
( = 1 2 = qa + pa = (q + p)a = a). 2pq[a + d(q p)]2 + (2pqd)2 = VA + VD [3].
4 Additive Genetic Variance
If the phenotypic value of the heterozygous geno- relatedness, such as monozygotic and dizygotic twin
type lies midway between A1A1 and A2A2, the total pairs (see ACE Model). Ultimately, p, q, a, d,
genetic variance simplifies to 2pqa 2 . If d is not equal and environmental deviations are the parameters that
to zero, the additive genetic variance component quantitative geneticists hope to quantify.
contains the effect of d. Even if a = 0, VA is usu-
ally greater than zero (except when p = q). Thus, Acknowledgments
although VA represents the variance due to the addi-
tive influences, it is not only a function of p, q, and The author wishes to thank Eco de Geus and Dorret
Boomsma for reading draft versions of this chapter.
a but also of d. Formally, VA represents the variance
of the breeding values, when these are expressed in
terms of deviations from the population mean. The References
consequences are that, except in the rare situation in
[1] Eiberg, H. & Mohr, J. (1987). Major genes of eye color
which all contributing loci are diallelic with p = q and hair color linked to LU and SE, Clinical Genetics
and a = 0, VA is usually greater than zero. Models 31(3), 186191.
that decompose the phenotypic variance into com- [2] Eiberg, H. & Mohr, J. (1996). Assignment of genes
ponents of VD , without including VA , are therefore coding for brown eye colour (BEY2) and brown hair
biologically implausible. When more than one locus colour (HCL3) on chromosome 15q, European Journal
is involved and it is assumed that the effects of these of Human Genetics 4(4), 237241.
[3] Falconer, D.S. & Mackay, T.F.C. (1996). Introduction to
loci are uncorrelated and there is no interaction (i.e., Quantitative Genetics, Longan Group Ltd, Fourth Edition.
no epistasis), the VG s of each individual locus may [4] Fisher, R.A. (1918). The correlation between relatives on
be summed to obtain the total genetic variances of all the supposition of Mendelian inheritance, Transactions
loci that influence a trait [4, 5]. of the Royal Society of Edinburgh: Earth Sciences 52,
In most human quantitative genetic models, the 399433.
observed variance of a trait is not modeled directly [5] Mather, K. (1949). Biometrical Genetics, Methuen,
London.
as a function of p, q, a, d, and environmental devi- [6] Mather, K. & Jinks, J.L. (1982). Biometrical Genetics,
ations (as all of these are usually unknown), but Chapman & Hall, New York.
instead is modeled by comparing the observed resem-
blance between pairs of differential, known genetic DANIELLE POSTHUMA
Additive Models
ROBERT J. VANDENBERG
Volume 1, pp. 2224
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
^
Regression surface: Y = 0.2X + 0.6Z + 2
50
40
30
^
Y
20 Zhigh = 8
Zmean = 5
Zlow = 2
10
B1 = 0.2
0
8 10
10 8 6
6 4 4
2 00 2
(a) Z X
^
Regression surface: Y = 0.2X + 0.6Z + 0.4XZ + 2
50 Zhigh = 8
40
Zmean = 5
30
^
Y
20 Zlow = 2
10
B1 = 0.2
0
8 10
10 8 6
6 4 4
2 00 2
(b) Z X
Figure 1 Additive versus interactive effects in regression contexts. Used with permission: Figure 7.1.1, p. 259 of Cohen, J.,
Cohen, P., West, S.G. & Aiken, L.S. (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences,
3rd Edition, Lawrence Erlbaum, Mahwah
2 Additive Models
10) along each of three values of Z will produce the Applying the same exercise used in (2) above would
darkened lines. These lines are parallel meaning that result in Figure 1(b). The point is that the regression
the regression of Y on X is constant over the values of Y on X is not constant over the values of Z (and
of Z. One may demonstrate this as well by holding neither would the regression of Y on Z at values
values of X to two, five, and eight, and substituting of X), but depends very much on the value of Z at
all of the values of Z into (2). The only aspect of which the regression of Y on X is calculated. This
Figure 1(a) that varies is the height of the regression conditional effect is illustrated in Figure 1(b) by the
lines. There is a general upward displacement of the angle of the plane representing the predicted values
lines as Z increases. of Y at the joint of X and Z values.
Figure 1(b) is offered as a contrast. In this case, As noted above, additive models are also con-
X and Z are presumed to have an interaction or sidered in the context of experimental designs but
joint effect that is above any additive effect of the much less frequently. The issue is exactly the same
variables. This is represented generally by as in multiple regression, and is illustrated nicely
by Charles Schmidts graph which is reproduced in
Y = b1 X + b2 Z + b3 XZ + b0 (3) Figure 2. The major point of Figure 2 is that when
and specifically for purposes of the illustration by there is no interaction between the independent vari-
ables (A and B in the figure), the main effects (addi-
Y = 0.2X + 0.6Z + 0.4XZ + 2. (4) tive effects) of each independent variable may be
Additivity assumption
Rij = waai + wbbj
Example for a 2 2 design
A1 A2 A1Bj A2Bj
Non-additivity assumption
Rij = waai + wbbj + f (ai ,bj)
Example for a 2 2 design
A1 A2 A1Bj A2Bj
waa1 + wbb1 waa2 + wbb1 wa(a1 a2)
B1 + f (a1,b1) +f (a2,b1) + [f (a1,b1) f(a2,b1)]
waa1 + wbb2 waa2 + wbb2 wa(a1 a2)
B2 + f (a1,b2) + f (a2,b2) + [f (a1,b2) f(a2,b2)]
wb(b1 b2) wb(b1 b2) [f (a1,b1) + f (a2,b2)]
+ [f (a1,b1) + [f (a2,b1) [f (a2,b1) + f (a1,b2)]
f (a1,b2)] f (a2,b2)]
Ai B 1 A i B 2
Figure 2 Additive versus interactive effects in experimental designs. Used with permission: Professor Charles F. Schmidt,
Rutgers University, http://www.rci.rutgers.edu/cfs/305 html/MentalChron/MChronAdd.html
Additive Models 3
independently determined (shown in the top half of Behavioral Sciences, 3rd Edition, Lawrence Erlbaum,
Figure 2). If, however, there is an interaction between Mahwah.
the independent variables, then this joint effect needs
to be accounted for in the analysis (illustrated by the
gray components in the bottom half of Figure 2). Further Reading
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Icelandic
Danish
v v
y y
Greek
(a) (b)
Figure 1 An additive tree representing the percentage of
shared cognates between each pair of languages, for sample Figure 2 Two additive trees on four objects, displayed in
data on seven Indo-European languages unrooted form
2 Additive Tree
include those described in [26] and [7], the last [3] Corter, J.E. (1998). An efficient metric combinatorial
method using a maximum-likelihood approach. A algorithm for fitting additive trees, Multivariate Behav-
public-domain software program for fitting addi- ioral Research 33, 249271.
[4] De Soete, G. (1983). A least squares algorithm for
tive trees, GTREE [3], may be obtained at sev- fitting additive trees to proximity data, Psychometrika 48,
eral sites, including http://www.netlib.org/ 621626.
mds/ or http://www.columbia.edu/jec34/. [5] Hubert, L. & Arabie, P. (1995). Iterative projection
Routines implementing the algorithm of Hubert and strategies for the least squares fitting of tree structures
Arabie [5] are also available (see http://ljhoff. to proximity data, British Journal of Mathematical &
psych.uiuc.edu/cda toolbox/cda toolbox Statistical Psychology 48, 281317.
[6] Sattath, S. & Tversky, A. (1977). Additive similarity trees,
manual.pdf).
Psychometrika 42, 319345.
[7] Wedel, M. & Bijmolt, T.H.A. (2000). Mixed tree and
References spatial representations of dissimilarity judgments, Journal
of Classification 17, 243271.
[1] Atkinson, Q.D. & Gray, R.D. (2004). Are accurate dates
an intractable problem for historical linguistics? in Map- (See also Hierarchical Clustering; Multidimen-
ping our Ancestry: Phylogenetic Methods in Anthropology
sional Scaling)
and Prehistory, C. Lipo, M. OBrien, S. Shennan &
M. Collard, eds, Aldine de Gruyter, New York.
[2] Corter, J.E. (1982). ADDTREE/P: a PASCAL program
JAMES E. CORTER
for fitting additive trees based on Sattath and Tverskys
ADDTREE algorithm, Behavior Research Methods &
Instrumentation 14, 353354.
Additivity Tests
GEORGE KARABATSOS
Volume 1, pp. 2529
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
after combining the effect of level a A1 = {a, b, are m exchangeable sequences of Nk observations of
c, . . .} from one independent variable, and the effects a dependent variable, where Y is either a real-valued
of level x A2 = {x, y, z, . . .} from another inde- scalar or vector, and each sequence arises from some
pendent variable. According to the theory of additive experimental condition k {1, . . . , m}. For example,
conjoint measurement [20], the effects of two inde- m = IJ conditions may be considered in a two-factor
pendent variables are additive if and only if experimental design. According to de Finettis repre-
sentation theorem (e.g., [1]), the following Bayesian
ax by implies f1 (a) + f2 (x) f1 (b) + f2 (y) model describes the joint probability of m exchange-
(3) able sequences:
weak order defined over all elements of A1 A2 . [4] Boik, R.J. (1990). Inference on covariance matrices
In under rank restrictions, Journal of Multivariate Analysis
this context,
refers to the sum-constraint 33, 230246.
rk Vk rk C for each experimental condition k,
[5] Boik, R.J. (1993a). Testing additivity in two-way classi-
and some chosen threshold C [1/2, 1], where Vk
fications with no replications: the locally best invariant
is the set of response patterns that do not violate a test, Journal of Applied Statistics 20, 4155.
given cancellation axiom. [6] Boik, R.J. (1993b). A comparison of three invariant
Karabatsos [13] proposed a slightly different tests of additivity in two-way classifications with no
multinomial model, as a basis for a Bayesian replications, Computational Statistics and Data Analysis
bootstrap [32] approach to isotonic (inequality- 15, 411424.
constrained) regression (see Bootstrap Inference). [7] Boik, R.J. & Marasinghe, M.G. (1989). Analysis of
non-additive multiway classifications, Journal of the
This procedure can be used to estimate the non-
American Statistical Association 84, 10591064.
parametric posterior distribution of a discrete- or [8] Coombs, C.H. & Huang, L.C. (1970). Polynomial psy-
continuous-valued dependent variable Y , subject chophysics of risk, Journal of Mathematical Psychology
to the order-constraints of the set of all possi- 7, 317338.
ble linear orders (for example, Y1 Y2 [9] Falmagne, J.-C. (1976). Random conjoint measurement
Yk Ym ) that satisfy the entire hierarchy of and loudness summation, Psychological Review 83,
cancellation axioms. Here, a test of additivity is 6579.
[10] Harter, H.L. & Lum, M.D. (1962). An interpretation and
achieved by evaluating the fit of the observed data
extension of Tukeys one-degree of freedom for non-
{Y1k , . . . , Ynk , . . . , YNk ; k = 1, . . . , m} to the cor- additivity, Aeronautical Research Laboratory Technical
responding order-constrained posterior distribution Report, ARL, pp. 62313.
of Y . [11] Johnson, D.E. & Graybill, F.A. (1972). An analysis of a
Earlier, as a non-Bayesian approach to additivity two-way model with interaction and no replication, Jour-
testing, Macdonald [21] proposed isotonic regression nal of the American Statistical Association 67, 862868.
to determine the least-squares maximum-likelihood [12] Karabatsos, G. (2001). The Rasch model, additive con-
estimate (MLE) of the dependent variable {Yk ; k = joint measurement, and new models of probabilistic
measurement theory, Journal of Applied Measurement
1, . . . , m}, subject to a linear order-constraint (e.g., 2, 389423.
Y1 Y2 Yk Ym ) that satisfies a given [13] Karabatsos, G. (2004a). A Bayesian Bootstrap Approach
cancellation axiom (see Least Squares Estimation). To Testing The Axioms Of Additive Conjoint Measure-
He advocated testing each cancellation axiom sep- ment. Manuscript under review.
arately, by evaluating the fit of the observed data [14] Karabatsos, G. (2004b). The exchangeable multinomial
{Y1k , . . . , Ynk , . . . , YNk ; k = 1, . . . , m} to the MLE model as an approach to testing deterministic axioms
{Yk ; k = 1, . . . , m} under the corresponding axiom.
of choice and measurement. To appear, Journal of
Mathematical Psychology.
[15] Karabatsos, G. & Sheu, C.-F. (2004). Order-constrained
Acknowledgments Bayes inference for dichotomous models of non-para-
metric item-response theory, Applied Psychological
Karabatsoss research is supported by National Science Measurement 2, 110125.
Foundation grant SES-0242030, program of Methodology, [16] Karabatsos, G. & Ullrich, J.R. (2002). Enumerating
Measurement, and Statistics. Also, this work is supported and testing conjoint measurement models, Mathematical
in part by Spencer Foundation grant SG2001000020. Social Sciences 43, 485504.
[17] Krantz, D.H., Luce, R.D., Suppes, P. & Tversky, A.
(1971). Foundations of Measurement: Additive and Poly-
References nomial Representations, Academic Press, New York.
[18] Krishnaiah, P.R. & Yochmowitz, M.G. (1980). Inference
[1] Bernardo, J.M. (2002). Bayesian Theory (second re- on the structure of interaction in the two-way classifica-
print), Wiley, New York. tion model, in Handbook of Statistics, Vol. 1, P.R. Krish-
[2] Boik, R.J. (1986). Testing the rank of a matrix with naiah, ed., North Holland, Amsterdam, pp. 973984.
applications to the analysis of interactions in ANOVA, [19] Levelt, W.J.M., Riemersma, J.B. & Bunt, A.A. (1972).
Journal of the American Statistical Association 81, Binaural additivity of loudness, British Journal of Math-
243248. ematical and Statistical Psychology 25, 5168.
[3] Boik, R.J. (1989). Reduced-rank models for interaction [20] Luce, R.D. & Tukey, J.W. (1964). Additive conjoint
in unequally-replicated two-way classifications, Journal measurement: a new type of fundamental measurement,
of Multivariate Analysis 28, 6987. Journal of Mathematical Psychology 1, 127.
4 Additivity Tests
[21] Macdonald, R.R. (1984). Isotonic regression analysis [30] Nygren, T.E. (1986). A two-stage algorithm for assess-
and additivity, in Trends in Mathematical Psychology, ing violations of additivity via axiomatic and numerical
E. Degreef & J. Buggenhaut, eds, Elsevier Science conjoint analysis, Psychometrika 51, 483491.
Publishers, North Holland, pp. 239255. [31] Perline, R., Wright, B.D. & Wainer, H. (1979). The
[22] Mandel, J. (1961). Non-additivity in two-way analysis of Rasch model as additive conjoint measurement, Applied
variance, Journal of the American Statistical Association Psychological Measurement 3, 237255.
56, 878888. [32] Rubin, D.B. (1981). The Bayesian bootstrap, Annals of
[23] Mandel, J. (1969). The partitioning of interaction in Statistics 9, 130134.
analysis of variance, Journal of Research, National [33] Scheffe, H. (1959). The Analysis of Variance (Sixth
Bureau of Standards B 73B, 309328. Printing), Wiley, New York.
[24] Mandel, J. (1971). A new analysis of variance model for [34] Searle, S.R. (1971). Linear Models, John Wiley & Sons,
non-additive data, Technometrics 13, 118. New York.
[25] Marasinghe, M.G. & Boik, R.J. (1993). A three-degree [35] Tukey, J.W. (1949). One degree of freedom for non-
of freedom test of additivity in three-way classifications, additivity, Biometrics 5, 232242.
Computational Statistics and Data Analysis 16, 4761. [36] Tukey, J.W. (1962). The future of data analysis, Annals
[26] Michell, J. (1990). An Introduction to the Logic of of Mathematical Statistics 33, 167.
Psychological Measurement, Lawrence Earlbaum Asso- [37] Tusell, F. (1990). Testing for interaction in two-way
ciates, Hillsdale. ANOVA tables with no replication, Computational
[27] Milliken, G.A. & Graybill, F.A. (1970). Extensions Statistics and Data Analysis 10, 2945.
of the general linear hypothesis model, Journal of the [38] Tversky, A. (1967). Additivity, utility, and subjec-
American Statistical Association 65, 797807. tive probability, Journal of Mathematical Psychology 4,
[28] Milliken, G.A. & Johnson, D.E. (1989). Analysis of 175201.
Messy Data, Vol. 2, Van Nostrand Reinhold, New York.
[29] Nygren, T.E. (1985). An examination of conditional GEORGE KARABATSOS
violations of axioms for additive conjoint measurement,
Applied Psychological Measurement 9, 249264.
Adoption Studies
MICHAEL C. NEALE
Volume 1, pp. 2933
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
interest, then additional groups of relatives, such as three children. Two of the children are offspring of
unrelated individuals reared together, are needed to the biological parents (siblings reared together), while
estimate it. Similar arguments may be made about the third is adopted. This diagram may also be con-
across-generational sources of resemblance. Heath sidered as multivariate, allowing for the joint analysis
and Eaves [11] compared the power to detect genetic of multiple traits. Each box and circle then repre-
and environmental transmission across several twin- sents a vector of observed variables. Multivariate
family (twins and their parents or twins and their analyses (see Multivariate Analysis: Overview) are
children) adoption designs. particularly important when studying the relationship
between parental attributes and outcomes in their off-
spring. For example, harsh parenting may lead to
Methods of Analysis psychiatric disorders. Both variables should be stud-
Most modern adoption study data are analyzed with ied in a multivariate genetically informative design
Structural Equation Models (SEM) [2, 17]. SEM such as an adoption or twin study to distinguish
is an extension of multiple linear regression anal- between the possible direct and indirect genetic and
ysis that involves two types of variable: observed environmental pathways.
variables that have been measured, and latent vari- From the rules of path analysis [22, 23] we can
ables that have not. Two variables may be specified derive predicted covariances among the relatives in
as causally related or simply correlated from unspec- terms of the parameters of the model in Figure 1.
ified effects. It is common practice to represent the These expectations may, in turn, be used in a struc-
variables and their relationships in a path diagram tural equation modeling program such as Mx [16]
(see Path Analysis and Path Diagrams), where (see Software for Behavioral Genetics) to estimate
single-headed arrows indicate causal relationships, the parameters using maximum likelihood or some
and double-headed arrows represent correlations. By other goodness-of-fit function. Often, simpler models
convention, observed variables are shown as squares than the one shown will be adequate to account for
and latent variables are shown as circles. a particular set of data.
Figure 1 shows the genetic and environmental A special feature of the diagram in Figure 1
transmission from biological and adoptive parents to is the dotted lines representing delta-paths [21].
Adoption Studies 3
S S S S
ABF CBF EBF ABM CBM EBM AAF CAF EAF AAM CAM EAM
a c e a c e a c e a c e
BF BM AF AM
d rmf d
rff rmm
rfm
z z z z z
a c e a c e a c e
Figure 1 Path diagram showing sources of variation and covariation between: adoptive mother, AM; adoptive father, AF;
their own biological children, BC1 and BC2 ; a child adopted into their family, AC1 ; and the adopted childs biological
parents, BF and BM
These represent the effects of two possible types If selection transforms A to D, then the new covari-
of selection: assortative mating, in which husband ance matrix is given by
and wife correlate; and selective placement, in which
the adoptive and biological parents are not paired D DA1 B
at random. The effects of these processes may be .
B A1 D C B (A1 A1 DA1 )B
deduced from the PearsonAitken selection for-
mulas [1]. These formulas are derived from linear
Similarly, if the original means are (xS : xn ) and
regression under the assumptions of multivariate lin-
selection modifies xS to x S , then the vector of means
earity and homoscedasticity. If we partition the vari-
after selection is given by
ables into selected variables, XS , and unselected
variables XN , then it can be shown that changes in
the covariance of XS lead to changes in covariances [xS : xn + A1 B(xS x S )] .
among XN and the cross-covariances (XS with XN ).
Let the original (preselection) covariance matrix of These formulas can be applied to the covariance
XS be A, the original covariance matrix of XN be C, structure of all the variables in Figure 1. First,
and the covariance between XN and XS be B. The the formulas are applied to derive the effects of
preselection matrix may be written assortative mating, and, secondly, they are applied
to derive the effects of selective placement. In both
cases, only the covariances are affected, not the
A B
. means. An interesting third possibility would be to
B C control for the effects of nonrandom selection of the
4 Adoption Studies
biological and adoptive relatives, which may well interaction has been found for alcoholism [7] and
change both the means and the covariances. substance abuse [6].
Logistic regression is a popular method to test
for genetic and environmental effects and their inter-
Selected Samples action on binary outcomes such as psychiatric diag-
noses. These analyses lack the precision that struc-
A common approach in adoption studies is to identify tural equation modeling can bring to testing and
members of adoptive families who have a particular quantifying specific hypotheses, but offer a practi-
disorder, and then examine the rates of this disor- cal method of analysis for binary data. Analysis of
der in their relatives. These rates are compared with binary data can be difficult within the framework
those from control samples. Two common starting of SEM, requiring either very large sample sizes for
points for this type of study are (a) the adoptees asymptotic weighted least squares [4], or integration
(the adoptees families method), and (b) the biolog- of the multivariate normal distribution (see Cata-
ical parents (the adoptees study method). For rare logue of Probability Density Functions) over as
disorders, this use of selected samples may be the many dimensions as there are relatives in the pedi-
only practical way to assess the impact of genetic gree, which is numerically intensive.
and environmental factors.
One limitation of this type of method is that
it focuses on one disorder, and is of limited use References
for examining comorbidity between disorders. This
limitation is in contrast to the population-based [1] Aitken, A.C. (1934). Note on selection from a multi-
sampling approach, where many characteristics and variate normal population, Proceedings of the Edinburgh
their covariances or comorbidity can be explored Mathematical Society, Series B 4, 106110.
[2] Bollen, K.A. (1989). Structural Equations with Latent
simultaneously.
Variables, Wiley, New York.
A second methodological difficulty is that ascer- [3] Bouchard Jr, T.J. & McGue, M. (1981). Familial studies
tained samples of the disordered adoptees or parents of intelligence: a review, Science 212, 10551059.
may not be representative of the population. For [4] Browne, M.W. (1984). Asymptotically distribution-free
example, those attending a clinic may be more severe methods for the analysis of covariance structures, British
or have different risk factors than those in the general Journal of Mathematical and Statistical Psychology 37,
population who also meet criteria for diagnosis, but 6283.
[5] Cadoret, R.J. (1978). Psychopathology in adopted-away
do not attend the clinic.
offspring of biologic parents with antisocial behavior,
Archives of General Psychiatry 35, 176184.
[6] Cadoret, R.J., Troughton, E., OGorman, T.W. & Hey-
Genotype Environment Interaction wood, E. (1986). An adoption study of genetic and
environmental factors in drug abuse, Archives of General
Psychiatry 43, 11311136.
The natural experiment of an adoption study provides
[7] Cloninger, C.R., Bohman, M. & Sigvardsson, S. (1981).
a straightforward way to test for geneenvironment Inheritance of alcohol abuse: cross-fostering analysis
interaction. In the case of a continuous phenotype, of adopted men, Archives of General Psychiatry 38,
interaction may be detected with linear regression on 861868.
[8] Cloninger, C.R., Bohman, M., Sigvardsson, S. & von
1. the mean of the biological parents phenotypes Knorring, A.L. (1985). Psychopathology in adopted-out
(which directly estimates heritability) children of alcoholics: the Stockholm adoption study, in
2. the mean of the adoptive parents phenotypes Recent Developments in Alcoholism, Vol. 3, M. Galanter,
3. the product of points 1 and 2. ed., Plenum Press, New York, pp. 3751.
[9] DeFries, J.C. & Plomin, R. (1978). Behavioral genetics,
Annual Review of Psychology 29, 473515.
Significance of the third term would indicate signif-
[10] Fuller, J.L. & Thompson, W.R. (1978). Foundations of
icant G E interaction. With binary data such as Behavior Genetics, Mosby, St. Louis.
psychiatric diagnoses, the rate in adoptees may be [11] Heath, A.C. & Eaves, L.J. (1985). Resolving the effects
compared between subjects with biological or adop- of phenotype and social background on mate selection,
tive parents affected, versus both affected. G E Behavior Genetics 15, 1530.
Adoption Studies 5
[12] Heston, L.L. (1966). Psychiatric disorders in foster home [18] Phillips, K. & Fulker, D.W. (1989). Quantitative genetic
reared children of schizophrenic mothers, British Journal analysis of longitudinal trends in adoption designs with
of Psychiatry 112, 819825. application to IQ in the Colorado adoption project,
[13] Kaprio, J., Koskenvuo, M. & Langinvainio, H. (1984). Behavior Genetics 19, 621658.
Finnish twins reared apart: smoking and drinking habits. [19] Plomin, R. & DeFries, J.C. (1990). Behavioral Genetics:
Preliminary analysis of the effect of heredity and envi- A Primer, 2nd Edition, Freeman, New York.
ronment, Acta Geneticae Medicae et Gemellologiae 33, [20] Sorensen, T.I. (1995). The genetics of obesity,
425433. Metabolism 44, 46.
[14] Kety, S.S. (1987). The significance of genetic factors in [21] Van Eerdewegh, P. (1982). Statistical selection in mul-
the etiology of schizophrenia: results from the national tivariate systems with applications in quantitative genet-
study of adoptees in Denmark, Journal of Psychiatric ics, Ph.D. thesis, Washington University.
Research 21, 423429. [22] Vogler, G.P. (1985). Multivariate path analysis of famil-
[15] Mendlewicz, J. & Rainer, J.D. (1977). Adoption study ial resemblance, Genetic Epidemiology 2, 3553.
supporting genetic transmission in manic-depressive ill- [23] Wright, S. (1921). Correlation and causation, Journal of
ness, Nature 268, 327329. Agricultural Research 20, 557585.
[16] Neale, M.C. (1995). Mx: Statistical Modeling, 3rd Edi-
tion, Box 980126 MCV, Richmond, p. 23298. MICHAEL C. NEALE
[17] Neale, M.C. & Cardon, L.R. (1992). Methodology for
Genetic Studies of Twins and Families, Kluwer Aca-
demic Publishers, Boston.
AgePeriodCohort Analysis
THEODORE HOLFORD
Volume 1, pp. 3338
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
200
100
80
70
60
50
40
30
Birth rate
20
10
8
7
6
5
4 1980-
1985-
3
1990-
2 1995-
2000-
1
15 20 25 30 35 40 45 50
Age
Figure 1 Period plot: a semilog plot of birth rates among US black women by age and period
of regression parameters associated with each tem- of age and period. When the widths of these intervals
poral element. Suppose that the expected value of are equal, the model may be expressed as
the outcome, Y (the log birth rate in our example) is E[Yijk ] = + i + j + k , (3)
linearly related to the temporal factors,
where is the intercept, i the effect of age
E[Y ] = 0 + aa + pp + cc . (1) for the ith (i = 1, . . . , I ) interval, j the effect
of period for the j th (j = 1, . . . , J ) interval, and
Using the linear relationship between the temporal k the effect of the kth cohort (k = j i + I =
factors gives rise to 1, . . . , K = I + J
1). The usual constraints in this
model imply that i = j = k = 0. The
E[Y ] = 0 + aa + pp + (p a)c
identifiability problem manifests itself through a
= 0 + a(a c ) + p(p + c ), (2) single unidentifiable parameter [3], which can be
more easily seen if we partition each temporal
which has only two identifiable parameters besides effect into components of overall linear trend, and
the intercept instead of the expected three. Another curvature or departure from linear trend. For example,
way of visualizing this phenomenon is that all combi- age can be given by i = i + i , where i =
nations of age, period and cohort may be displayed in
i 0.5(I + 1), is the overall slope and i the
the Lexis diagram shown in Figure 3, which is obvi- curvature. The overall model can be expressed as
ously a representation of a two dimensional plane
E[Yij k ] = + (i + i ) + (j + j )
instead of the three dimensions expected for three
separate factors. + (k + k )
In general, these analyses are not limited to linear
effects applied to continuous measures of time, but = + i ( ) + j ( + )
instead they are applied to temporal intervals, such as
+ i + j + k ,
(4)
mortality rates observed for five- or ten-year intervals
AgePeriodCohort Analysis 3
200
100
80
70
60
50
40
30
Birth rate
20
1940
1945
10 1950
8 1955
7 1960
6
5 1965
4 1970
1975
3 1980
2
15 20 25 30 35 40 45 50
Age
Figure 2 Cohort plot: a semilog plot of birth rates among US black women by age and cohort
C
oh
or
00
90
80
70
60
50
40
t
20
19
19
19
19
19
19
2010
30
19
2000
2 0
19
1990
Period (calendar year)
10
19
1980
00
19
1970
90
18
1960
80
18
1950
1940
0 10 20 30 40 50 60 70
Age (years)
Figure 3 Lexis diagram showing the relationship between age, period, and cohort. The diagonal line traces age-period
lifetime for an individual born in 1947
4 AgePeriodCohort Analysis
because k = j i . Thus, each of the curvatures can the presence of the other two will generally be a test
be uniquely determined, but the overall slopes are of the corresponding curvature, and not the slope.
hopelessly entangled so that only certain combina- Holford provides further detail on how software can
tions can be uniquely estimated [4]. be set up for fitting these models [5].
The implication of the identifiability problem is To illustrate the implications of the identifiability
that the overall direction of the effect for any of problem, and the type of valid interpretations that
the three temporal components cannot be deter- one can make by fitting an age-period-cohort model,
mined from a regression analysis (see Multiple Lin- we return to the data on birth rates among US
ear Regression). Thus, we cannot even determine black women. There are seven five-year age groups
whether the trends are increasing or decreasing with and five periods of identical width thus yielding
cohort, for instance. Figure 4 displays several com- 7 + 5 1 = 11 cohorts. A general linear model (see
binations of age, period and cohort parameters, each Generalized Linear Models (GLM)) will be fitted to
set of which provides an identical set of fitted rates. the log rates, introducing main effects for age, period
However, even though the specific trends cannot be and cohort. In situations in which the numerator and
uniquely estimated, certain combinations of the over- denominator for the rate are available, it is common
all trend can be uniquely determined, such as + to fit a log-linear model using Poisson regression
which is called the net drift [1, 2]. Alternative drift (see Generalized Linear Models (GLM)), but the
estimates covering shorter time spans can also be resulting interpretation issue will be identical for
determined, and these have practical significance in either model. An F test for the effect of cohort
that they describe the experience of following a par- in the presence of age and period yields a value
ticular age group in time, because both period and of 26.60 with df1 = 9 and df2 = 15. The numerator
cohort will advance together. Curvatures, on the other degrees of freedom, df1 , are not 10 because the model
hand, are completely determined including polyno- with age and period effects implicitly includes their
mial parameters for the square and higher powers, linear contribution, and thus the linear contribution
changes in slopes, and second differences. The sig- for cohort. Therefore, this test can only evaluate
nificance test for any one of the temporal effects in the curvature for cohort. Similarly, the tests for age
1.5
1.0
0.5
0.0
-
35
45
55
65
75
85
15
25
35
45
85
95
Effects
19
19
19
19
19
19
19
19
0.5
Age Period Cohort
1.0
Period slope
1.5 0.10
0.05
0.00
2.0
0.05
0.10
2.5
Figure 4 Age, period, and cohort effects for a log-linear model for birth rates in US black women, 19802001 by
alternatively constrained period slopes
AgePeriodCohort Analysis 5
(F5,15 = 3397.47, p < 0.0001) and period (F3,15 = the last three periods and the last three cohorts are
4.50, p = 0.0192) in the presence of the other two negative, implying that the recent drift would also
temporal factors are tests of curvature. be negative. While the individual interpretation of
Figure 4 shows five sets of age, period and cohort the other lines shown in Figure 4 would be slightly
parameters that may be obtained using least squares different, the sum would be the same, thus indicating
estimation. Each set of parameters provides an iden- increasing early drift and decreasing late drift.
tical fit to the data, but there is obviously not a We can estimate the drift by taking the sum of
unique solution here but rather an infinite number of the contrasts for linear trend in the first three periods
solutions. At the same time, once one of the slopes and the first four cohorts, that is, (1, 0, 1, 0, 0)/2
has been fixed (in this example the period slopes for period and (3, 1, 1, 3, 0, 0, 0, 0, 0, 0)/10 for
have been fixed), the other slopes can be identi- cohort. This yields the result 0.1036 (t15 = 6.33, p <
fied. Notice that when the period slope is arbitrarily 0.0001), which indicates that the positive early drift is
decreased, the underlying period trend is effectively statistically significant. Similarly, the late drift, which
rotated in a clockwise direction. Observing the cor- uses the contrast (0, 0, 1, 0, 1)/2 for period and
responding cohort parameters, we can see that when (0, 0, 0, 0, 0, 0, 3, 1, 1, 3)/10 for cohort, yields
the period trends are decreased, the trends for cohort 0.1965(t15 = 12.02, p < 0.0001), which is high-
are increased, that is, the corresponding estimates for ly significant and negative.
the cohort parameters are rotated counterclockwise. In this discussion, we have concentrated on the
Likewise, the age parameters experience a counter- analysis of data that have equal spaced intervals with
clockwise rotation, although in this example it is not age and period. The unequal case introduces further
easy to see because of the steep age trends. Some- identifiability problems, which involve not only the
times there may be external information indicating a overall linear trend, but certain short-term patterns,
particular constraint for one of the temporal param- as well (see Identification). The latter can sometime
eters, and once this can been applied then the other appear to be cyclical trends; therefore, considerable
effects are also identified. However, such informa- care is needed in order to be certain that these are not
tion must come from external sources because within just an artifact of the identifiability issues that arise
the dataset itself it is impossible to disentangle for the unequal interval case.
the interrelationship so as to obtain a unique set
References
of parameters.
In the absence of the detail required to make a con- [1] Clayton, D. & Schifflers, E. (1987). Models for temporal
straint on one of the temporal parameters, it is safer variation in cancer rates. I: age-period and age-cohort
to make inferences using estimable functions of the models, Statistics in Medicine 6, 449467.
parameters, that is, functions that do not depend on an [2] Clayton, D. & Schifflers, E. (1987). Models for temporal
variation in cancer rates. II: age-period-cohort models,
arbitrary constraint. Curvature, which would include
Statistics in Medicine 6, 469481.
both the overall departure from linear trend, as well [3] Fienberg, S.E. & Mason, W.M. (1978). Identification and
as local changes of direction are estimable [1, 2, 4]. estimation of age-period-cohort models in the analysis of
The latter would include polynomial terms of power discrete archival data, in Sociological Methodology 1979,
greater than one (see Polynomial Model), change Schuessler K.F., eds, Jossey-Bass, Inc., San Francisco,
of slope, or second differences, which would com- 167.
[4] Holford, T.R. (1983). The estimation of age, period and
pare the parameter at one point to the average of the
cohort effects for vital rates, Biometrics 39, 311324.
parameters just before and just after that point. In [5] Holford, T.R. (2004). Temporal factors in public health
addition, the sum of the period and cohort slope or surveillance: sorting out age, period and cohort effects,
drift is also estimable, thus providing a net indication in Monitoring the Health of Populations, Brookmeyer R.,
of the trend. Stroup D.F., eds, Oxford University Press, Oxford,
In our example, we can see from the solid lines in 99126.
Figure 4 that the first three periods show a gradual [6] Tango, T. & Kurashina, S. (1987). Age, period and cohort
analysis of trends in mortality from major diseases in
increasing trend, as do the first four cohorts. If we Japan, 1955 to 1979: peculiarity of the cohort born in
were to add these slopes, we would have an estimate the early Showa Era, Statistics in Medicine 6, 709726.
of the early drift, which would be positive because
both of the slope components are positive. Similarly, THEODORE HOLFORD
Akaikes Criterion
CHARLES E. LANCE
Volume 1, pp. 3839
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
2 + 2q (1)
References
where 2 is the maximum likelihood chi-squared
statistic and q refers to the number of free parameters
being estimated in the estimated model. According [1] Akaike, H. (1981). Likelihood of a model and informa-
tion criteria, Journal of Econometrics 16, 314.
to Bozdogan [3], the first term . . . is a measure of [2] Akaike, H. (1987). Factor analysis and AIC, Psychome-
inaccuracy, badness of fit, or bias when the maxi- trika 52, 317332.
mum likelihood estimators of the models are used [3] Bozdogan, H. (1987). Model selection and Akaikes
while the . . . second term . . . is a measure of com- information criterion (AIC): the general theory and its
plexity, of the penalty due to the increased unre- analytical extensions, Psychometrika 52, 345370.
liability, or compensation for the bias in the first [4] Bozdogan, H. (2000). Akaikes information criterion and
term which depends upon the number of parameters recent developments in information complexity, Journal
of Mathematical Psychology 44, 6291.
used to fit the data (p. 356). Thus, when several
[5] Browne, M.W. & Cudeck, R. (1989). Single sample
models parameters are estimated using maximum cross-validation indices for covariance structures, Mul-
likelihood, the models AICs can be compared to tivariate Behavioral Research 24, 445455.
find a model with a minimum value of AIC. This [6] Cudeck, R. & Browne, M.W. (1983). Cross-validation of
procedure is called the minimum AIC procedure and covariance structures, Multivariate Behavioral Research
the model with the minimum AIC is called the mini- 18, 147167.
mum AIC estimate (MAICE) and is chosen to be the [7] Hu, L. & Bentler, P.M. (1998). Fit indices in covari-
best model ([3], p. 356). Cudeck and Browne [6] ance structure modeling: sensitivity to underparameter-
ization model misspecification, Psychological Methods
and Browne and Cudeck [5] considered AIC and a
4, 424453.
closely related index proposed by Schwartz [11] as [8] MacCallum, R.C. (2003). Working with imperfect mod-
measures of cross-validation of SEMs; Cudeck and els, Multivariate Behavioral Research 38, 113139.
Brown proposed a rescaled version of AIC, osten- [9] Marsh, H.W., Balla, J.R. & McDonald, R.P. (1988).
sibly to eliminate the effect of sample size (p. 154). Goodness-of-fit indexes in confirmatory factor analysis:
2 Akaikes Criterion
the effect of sample size, Psychological Bulletin 103, [11] Schwartz, G. (1978). Estimating the dimension of a
391410. model, Annals of Statistics 6, 461464.
[10] McDonald, R.P. & Marsh, H.W. (1990). Choosing a
multivariate model: noncentrality and goodness of fit, CHARLES E. LANCE
Psychological Bulletin 107, 247255.
Allelic Association
LON R. CARDON
Volume 1, pp. 4043
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Phenotype
Sampled
chromosomes
Frequency
Trait value
Figure 1 Example of indirect and direct association. The disease allele is shown as a filled star. Alleles at an anonymous
marker are shown as circles. The two markers are correlated (in linkage disequilibrium) with one another. The frequency
of the disease allele is greater in the individuals with high trait scores (7/10 = 0.70) versus those with low trait scores
(0.50). Similarly, the correlated allele has higher frequency in high-trait individuals (0.50) versus low-trait individuals
(0.30). Although the markers are correlated, they are not perfectly correlated, so it is difficult to distinguish the directly
associated disease allele from the indirectly associated marker allele
cloning, candidate region, or whole-genome associa- groups can be incorrectly ascribed to allelic asso-
tion designs) are indirect association strategies that ciation when in fact they reflect some unmeasured
rely on linkage disequilibrium between measured variable(s). Such spurious association outcomes are
genetic markers and unmeasured causal loci [15]. described as resulting from population stratification,
From a sampling perspective, most association or classical confounding in epidemiological terms.
studies can be classified into two general group- Spurious association due to population strati-
ings: case/control and family based. Historically, fication has worried geneticists considerably over
case-control studies have been used most widely, the past two decades because human populations
involving collections of sample of individuals who are known to have widely varying allele frequen-
have a disease or trait of interest plus a sample of cies simply because of their different population
control individuals who do not have the trait (or who histories [4]. Thus, one might expect many allele
are randomly ascertained in some designs). Tests of frequency differences between groups by chance
allelic association involve comparisons of allele or alone. To address this concern, a number of family-
genotype frequencies between the two groups. Match- based designs have been developed, popularized most
ing between the case and control samples is a critical widely in the Transmission Disequilibrium Test [17].
feature of this design, since differences between the In this design, samples of affected offspring and their
Allelic Association 3
two parents are collected (the disease status of the loci identified by any strategy, lending at least par-
parents is usually irrelevant), and the frequencies of tial support to the traditional poly-/oligo-genic model,
the alleles that are transmitted from parent to off- which posits that common trait variation results from
spring form the case group, while those that are many genes having individually small effects. To the
present in the parents but not transmitted to the off- extent that this model applies for a specific trait, so
spring form the control group. The general idea that each associated variant has an individually small
is that drawing cases and controls from the same effect on the trait, the sample size issue becomes even
families renders the confounding allele frequency more problematic. Future studies are being designed
differences irrelevant. Similar strategies have been to detect smaller effect sizes, which should help
developed for continuous traits [9, 11, 12]. reveal the ultimate utility of the association approach
With the advent of family-based studies and with for common traits.
rapidly advancing studies of linkage disequilibrium
across the human genome, the related concepts of
References
linkage and association are becoming increasingly
confused. The main distinction is that genetic link-
age exists within families, while association extends [1] Bulmer, M.G. (1980). The Mathematical Theory of
Quantitative Genetics, Clarendon Press, Oxford.
to populations. More specifically, linkage refers to [2] Cardon, L.R. & Bell, J.I. (2001). Association study
cosegregation of marker alleles with trait alleles designs for complex diseases, Nature Review. Genetics
within a family. Using the T G example, a family 2, 9199.
would show evidence for linkage if members having [3] Cardon, L.R. & Palmer, L.J. (2003). Population strat-
high trait scores shared the T allele more often than ification and spurious allelic association, Lancet 361,
598604.
expected by chance. However, another family would
[4] Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. (1994).
also show linkage if its members shared the G allele History and Geography of Human Genes, Princeton
more often than expected by chance. In each case, University Press, Princeton.
an allele occurs in excess of expectations under ran- [5] Collins, F.S., Guyer, M.S. & Charkravarti, A. (1997).
dom segregation, so each family is linked. In contrast, Variations on a theme: cataloging human DNA sequence
allelic association requires allelic overrepresentation variation, Science 278, 15801581.
[6] Elston, R.C., Palmer, L.J., Olson, J.E., eds (2002).
across families. Thus, only if both families shared the
Biostatistical Genetics and Genetic Epidemiology, John
same T (or G) allele would they offer joint evidence Wiley & Sons, Chichester.
for association. In a simple sense, genetic linkage is [7] Falconer, D.S. & Mackay, T.F.C. (1996). Quantitative
allelic association within each family. The linkage- Genetics, Longman, Harlow.
association distinction is important because linkage [8] Fisher, R.A. (1918). The correlations between relatives
is useful for identifying large chromosome regions on the supposition of Mendelian inheritance, Transaction
of the Royal Society of Edinburgh 52, 399433.
that harbor trait loci, but poor at identifying specific
[9] Fulker, D.W., Cherny, S.S., Sham, P.C. & Hewitt, J.K.
genetic variants, while association is more power- (1999). Combined linkage and association sib-pair anal-
ful in very tight regions but weak in identifying ysis for quantitative traits, American Journal of Human
broad chromosomal locations. In addition, association Genetics 64, 259267.
analysis is more powerful than linkage for detect- [10] Hugot, J.P., Chamaillard, M., Zouali, H., Lesage, S.,
ing alleles that are common in the population [16], Cezard, J.P., Belaiche, J., Almer, S., Tysk, C.,
OMorain, C.A., Gassull, M., Binder, V., Finkel, Y.,
whereas family-based linkage approaches offer the
Cortot, A., Modigliani, R., Laurent-Puig, P., Gower-
best available genetic approach to detect effects of Rousseau, C., Macry, J., Colombel, J.F., Sahbatou, M.
rare alleles. & Thomas, G. (2001). Association of NOD2 leucine-
Association studies have not yielded many suc- rich repeat variants with susceptibility to crohns disease,
cesses in detecting novel genes in the past, despite Nature 411, 599603.
tens of thousands of attempts [18]. There are many [11] Lange, C., DeMeo, D.L. & Laird, N.M. (2002). Power
and design considerations for a general class of family-
postulated reasons for this lack of success, of which
based association tests: quantitative traits, American
some of the most prominent are small sample sizes Journal of Human Genetics 71, 13301341.
and poorly matched cases and controls [3]. In addi- [12] Martin, E.R., Monks, S.A., Warren, L.L. & Kaplan, N.L.
tion, there have been few large-effect complex trait (2000). A test for linkage and association in general
4 Allelic Association
pedigrees: the pedigree disequilibrium test, American [16] Risch, N. & Merikangas, K. (1996). The future of
Journal of Human Genetics 67, 146154. genetic studies of complex human diseases, Science 273,
[13] Mather, K. & Jinks, J.L. (1982). Biometrical Genetics, 15161517.
Chapman & Hall, London. [17] Spielman, R., McGinnis, R. & Ewens, W. (1993).
[14] Ogura, Y., Bonen, D.K., Inohara, N., Nicolae, D.L., Transmission test for linkage disequilibrium: the insulin
Chen, F.F., Ramos, R., Britton, H., Moran, T., Kar- gene region and insulin-dependent diabetes mellitus
aliuskas, R., Duerr, R.H., Achkar, J.P., Brant, S.R., (IDDM), American Journal of Human Genetics 52,
Bayless, T.M., Kirschner, B.S., Hanauer, S.B., Nunez, G. 506516.
& Cho, J.H. (2001). A frameshift mutation in NOD2 [18] Weiss, K.M. & Terwilliger, J.D. (2000). How many
associated with susceptibility to Crohns disease, Nature diseases does it take to map a gene with SNPs? Nature
411, 603606. Genetics 26, 151157.
[15] Risch, N.J. (2000). Searching for genetic determinants
in the new millennium, Nature 405, 847856. LON R. CARDON
All-X Models
RONALD S. LANDIS
Volume 1, pp. 4344
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
d1 x1 y1 e1
l11 l11
l21 g11 l21
d2 x2 x1 h1 y2 e2
l31 l31
d3 x3 z1 y3 e3
F12 y12 b12
d4 x4 z2 y4 e4
l42 l42
l52 y12 l52
d5 x5 x2 h2 e5
y5
l62 l62
d6 x6 y6 e6
Figure 1 Model in which two endogeneous variables (1 and 2 ), each measured with three items, are predicted from two
exogeneous variables (1 and 2 ), each also measured with three items
2 All-X Models
(b) measurement errors associated with the observed relationship between the factors would be captured
variables, and (c) relationships, if any, between the in the phi () matrix.
exogenous constructs.
Testing the all-X model requires the use of only
(See also Structural Equation Modeling: Over-
three of the previously described eight matrices. The
view)
factor loadings would be contained in the lambda-
x (x ) matrix, the measurement errors would be RONALD S. LANDIS
contained in the theta-delta (
) matrix, and the
All-Y Models
RONALD S. LANDIS
Volume 1, pp. 4444
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Considerations
Overview
This research example illustrates certain consider-
The alternating treatments design (ATD) is a type of ations of using an ATD [3]. In particular, when
single-participant design (see Single-Case Designs) using an ATD, questions regarding the number of
characterized by rapid and random/semirandom shifts conditions, baseline data, alternations, and analyses
between two or more conditions [1]. Essentially, emerge. First, with regard to the number of condi-
conditions are alternated as often as necessary to tions, it is important to understand that as the number
capture meaningful measurement of the behavior of of conditions increases, the complexity of the ATD
interest. For example, if daily measurement is the increases in terms of drawing comparisons among
most telling way to measure the behavior of interest, conditions of the design. Consequently, it is gener-
conditions are alternated daily. Similarly, depending ally recommended that the number of conditions not
on the behavior of interest, conditions could be exceed three, and that each condition has at least two
alternated bi-daily, weekly, by session, or by any data points (although four or more are preferable).
other schedule that is appropriate for the behavior. In In the example, the researchers used two conditions,
addition to the frequency with which conditions are each with seven data points.
changed, the order by which conditions are changed In terms of baseline data, a unique feature of
is a significant component of the ATD. Usually, the ATD, in comparison to other single-participant
conditions in an ATD are shifted semirandomly. The designs, is that baseline data are not required when
alternations are semirandom, rather than random, one wants to examine the relative effectiveness of
because there are restrictions on the number of treatments that are already known to be effective
times conditions can be sequentially implemented. (baseline data are usually required when a treatments
Overall, the ATD is most commonly used to compare effectiveness has not been demonstrated). For exam-
two or more treatments through the examination ple, because in the mathematics assignments Skinner
of treatment divergence and overlap [3]. Other uses et al. [5] used treatments that were already known to
include comparing a treatment to no treatment or be effective, baseline data were not required. How-
inspecting treatment components. ever, regardless of the known effectiveness, including
To illustrate the use of the ATD in a class- baseline data before the ATD and as a condition
room setting, consider a recent study by Skinner, within the ATD can provide useful information about
Hurst, Teeple, and Meadows [5]. In this study, the individual change and the effectiveness of the treat-
researchers examined the effects of different math- ments, while ruling out extraneous variables as the
ematics assignments (control and experimental) on cause of change.
the on-task behavior and the rate of problem com- In addition to the number of conditions and
pletion in students with emotional disturbance. The baseline data, alternations are also aspects of the
experimental assignment was similar to the control ATD that must be considered. First, as the number
assignment with the addition of brief mathematics of alternations increase, the number of opportunities
problems interspersed after every third problem. In for divergence and overlap also increase. Because
a self-contained classroom, the researchers observed comparisons are made after examining divergence
four students across 14 days (one 15-minute session and overlap of conditions, increasing the number
per day) as the students completed the mathematics of alternations can yield more accurate results in
assignments. The assignment (control or experimen- the analyses. In general, although the number of
tal) was randomly selected on days 1, 5, 9, and 13. alternations cannot be less than two, the maximum
Then, the assignments were alternated daily so that if number of alternations varies depending on the study
the students completed the experimental assignment because the unit of measurement (e.g., day, week,
on day 1, they would complete the control assignment session) and the duration of the treatment effect
2 Alternating Treatment Designs
influence the number of alternations. The variables Overall, depending on ones needs, the ATD is a
in the example allowed for seven alternations. useful design and presents certain advantages over
Related to the number of alternations, but with other single-participant designs [3]. One advantage
more serious ramifications, is the way that condi- is that the ATD does not require the withdrawal of
tions are alternated (e.g., order and schedule). The a treatment. This aspect of the ATD can be useful
manner in which the conditions are alternated is in avoiding or minimizing the ethical and practical
important because it can threaten the validity of the issues that withdrawal of treatment can present.
ATD. In particular, the validity of the ATD can Another advantage of the ATD is that comparisons
be threatened by sequential confounding, carryover between treatments can be made more quickly than
effects, and alternation effects a type of carryover in other single-participant designs sometimes in as
effect [1]. Fortunately, although these concerns have little time as one session. A final advantage of the
the potential of threatening the validity of the ATD, ATD, as discussed previously, is that the ATD does
they can usually be addressed by random or semi- not require baseline data.
random alternations and by monitoring for carryover
effects. In addition to implementing a randomization References
strategy, the researchers in the example reported that
naturally occurring absences also contributed to con-
[1] Barlow, D.H. & Hayes, S.C. (1979). Alternating treat-
trolling for carryover effects. ments design: one strategy for comparing the effects of
A final consideration of using an ATD is the data two treatments in a single subject, Journal of Applied
analysis. Analysis of data in an ATD is important so Behavior Analysis 12, 199210.
that one can understand the effects of the treatment(s). [2] Edgington, E.S. (1992). Nonparametric tests for single-
As with other single-participant designs, data points case experiments, in Single-case Research Design and
in an ATD can be analyzed by visual inspection that Analysis: New Directions for Psychology and Education,
T.R. Kratochwill & J.R. Levin, eds, Lawrence Erlbaum
is, by assessing the level, trend, and variability within Associates, Hillsdale, pp. 133157.
each condition [3]. This is the methodology used in [3] Hayes, S.C., Barlow, D.H. & Nelson-Gray, R.O. (1999).
the example [5] and is the most common way of ana- The Scientist Practitioner: Research and Accountability in
lyzing data in single-participant designs. In addition the Age of Managed Care, 2nd Edition, Allyn & Bacon,
to visual inspection, however, data from ATDs can Boston.
also be analyzed with inferential statistics such as [4] Onghena, P. & Edgington, E.S. (1994). Randomiza-
tion tests for restricted alternating treatments designs,
randomization tests [4]. Using this type of inferen-
Behaviour Research and Therapy 32, 783786.
tial analysis is unique to the ATD and can be accom- [5] Skinner, C.H., Hurst, K.L., Teeple, D.F. & Meadows, S.O.
plished by using available software packages [4] or (2002). Increasing on-task behavior during mathematics
doing hand calculations. It should be noted, however, independent seat-work in students with emotional distur-
that randomization tests are appropriate when alter- bance by interspersing additional brief problems, Psychol-
nations are truly random. Additional nonparametric ogy in the Schools 39, 647659.
tests that can be used to analyze data from an ATD
include Wilcoxons matched-pairs, signed-ranks tests GINA COFFEE HERRERA AND THOMAS
(see Paired Observations, Distribution Free Meth- R. KRATOCHWILL
ods), sign tests, and Friedmans analysis of variance
(see Friedmans Test) [2].
Analysis of Covariance
BRADLEY E. HUITEMA
Volume 1, pp. 4649
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Four analyses applied to a pretestposttest ran- useful information contained in the pretest. It is
domized-groups experiment (taken from [4]) included here to demonstrate the relative advantages
Difference of the other methods, all of which use the pretest
Treatment Pretest (X) Posttest (Y ) (Y X) information in some way.
The second approach is a one-factor ANOVA on
1 2 4 2
1 4 9 5
the differences between the pretest and the posttest
1 3 5 2 scores, often referred to as an analysis of change
2 3 7 4 scores. The third approach is to treat the data as a
2 4 8 4 two-factor split-plot factorial design (see Analysis of
2 4 8 4 Variance: Classification) in which the three groups
3 3 9 6 constitute the levels of the between-subjects factor
3 3 8 5
and the two testing times (i.e., pre and post) constitute
3 2 6 4
the levels of the repeated measurement factor. The
last approach is a one-factor ANCOVA in which the
Group X Y Y X Adjusted Y pretest is used as the covariate and the posttest is used
as the dependent variable.
1 3.00 6.00 3.00 6.24 The results of all four analytic approaches are
2 3.67 7.67 4.00 6.44 summarized in Table 1. The group means based on
3 2.67 7.67 5.00 8.64
variables X, Y , and Y X differences are shown,
along with the adjusted Y means. Below the means are
ANOVA on posttest scores
the summary tables for the four inferential analyses
Source SS df MS F p used to test for treatment effects. Before inspecting
Between 5.56 2 2.78 0.86 .47
the P values associated with these analyses, notice
Within 19.33 6 3.22 the means on X. Even though random assignment
Total 24.89 8 was employed in forming the three small groups,
it can be seen that there are annoying differences
ANOVA on difference scores among these means. These group differences on the
covariate may seem to cloud the interpretation of the
Source SS df MS F p
differences among the means on Y . It is natural to
Between 6.00 2 3.00 2.25 .19 wonder if the observed differences on the outcome
Within 8.00 6 1.33 are simply reflections of chance differences that were
Total 14.00 8 present among these groups at pretesting. This issue
seems especially salient when the rank order of the
Split-plot ANOVA Y means is the same as the rank order of the X
Source SS df MS F p means. Consequently, we are likely to lament the
fact that random assignment has produced groups
Between subjects 22.78 8 with different covariate means and to ponder the
Groups 4.11 2 2.06 0.66 .55
Error a 18.67 6 3.11
following question: If the pretest (covariate) means
Within subjects 79.00 9 had been exactly equivalent for all three groups,
Times 72.00 1 72.00 108.00 .00 what are the predicted values of the posttest means?
Times gps. 3.00 2 1.50 2.25 .19 ANCOVA provides an answer to this question in
Error b 4.00 6 0.67 the form of adjusted means. The direction of the
Total 101.78 17 adjustment follows the logic that a group starting with
an advantage (i.e., a high X) should have a downward
ANCOVA on posttest scores (using pretest as covariate) adjustment to Y , whereas a group starting with a
Source SS df MS F p disadvantage (i.e., a low X) should have an upward
adjustment to Y .
Adjusted treatment 8.96 2 4.48 7.00 .04 Now consider the inferential results presented
Residual within gps. 3.20 5 0.64
Residual total 12.16 7 below the means in the table; it can be seen that
the conclusions of the different analyses vary greatly.
Analysis of Covariance 3
The P values for the ANOVA on the posttest advantage is obtained by combining both approaches
scores and the ANOVA on the difference scores in a single analysis. If we carry out ANCOVA using
are .47 and .19 respectively. The split-plot analysis the pretest as the covariate and the difference scores
provides three tests: a main-effects test for each factor rather than the posttest scores as the dependent vari-
and a test on the interaction. Only the interaction able, the error mean square and the P value from this
test is directly relevant to the question of whether analysis will be identical to those shown in Table 1.
there are differential effects of the treatments. The
other tests can be ignored. The interaction test turns Assumptions
out to be just another, more cumbersome, way
of evaluating whether we have sufficient evidence Several assumptions in addition to those normally
to claim that the difference-score means are the associated with ANOVA (viz., homogeneity of pop-
same for the three treatment groups. Hence, the ulation variances, normality of population error dis-
null hypothesis of no interaction in the split-plot tributions, and independence of errors) are associated
ANOVA is equivalent to the null hypothesis of with the ANCOVA model. Among the most important
equality of means in the one-factor ANOVA on are the assumptions that the relationship between the
difference scores. covariate and the dependent variable is linear, that the
covariate is measured without error, that the within
Although these approaches are generally far more
group regression slopes are homogeneous, and that
satisfactory than is a one-factor ANOVA on the
the treatments do not affect the covariate.
posttest, the most satisfactory method is usually
a one-factor ANCOVA on the posttest using the
pretest as the covariate. Notice that the P value Alternatives and Extensions
for ANCOVA (p = .04) is much smaller than those A simple alternative to a one-factor ANCOVA is
found using the other methods; it is the only one to use a two-factor (treatments by blocks) ANOVA
that leads to the conclusion that there is sufficient in which block levels are formed using scores on
information to claim a treatment effect. X (see Randomized Block Designs). Although this
The main reason ANOVA on difference scores approach has the advantage of not requiring a lin-
is usually less satisfactory than ANCOVA is that ear relationship between X and Y , it also has sev-
the latter typically has a smaller error mean square. eral disadvantages including the reduction of error
This occurs because ANOVA on difference scores degrees of freedom and the censoring of informa-
implicitly assumes that the value of the population tion on X. Comparisons of the two approaches
within group regression slope is 1.0 (whether it actu- usually reveal higher power for ANCOVA, espe-
ally is or not), whereas in the case of ANCOVA, cially if the treatment groups can be formed using
the within group slope is estimated from the data. restricted randomization rather than simple ran-
This difference is important because the error vari- dom assignment.
ation in both analyses refers to deviations from the Alternatives to conventional ANCOVA are now
within group slope. If the actual slope is far from available to accommodate violations of any of the
1.0, the ANCOVA error mean square will be much assumptions listed above [4]. Some of these alter-
smaller than the error mean square associated with natives require minor modifications of conventional
the ANOVA on Y X differences. The example data ANCOVA computational procedures. Others such
illustrate this point. The estimated within group slope as those designed for dichotomously scaled depen-
is 2.2 and the associated ANCOVA error mean square dent variables, robust analysis [3], complex match-
is less than one-half the size of the ANOVA error ing [7], random treatment effects [6], and intragroup
mean square. dependency of errors [6] require specialized software.
In summary, this example shows that information Straightforward extensions of covariance analysis are
on the pretest can be used either as a covariate or to available for experimental designs having more than
form pretestposttest differences, but it is more effec- one factor (multiple-factor ANCOVA), more than
tive to use it as a covariate. Although there are condi- one dependent variable (multivariate ANCOVA), and
tions in which this will not be true, ANCOVA is usu- more than one covariate (multiple ANCOVA). Most
ally the preferred analysis of the randomized-groups of these extensions are described in standard refer-
pretestposttest design. By the way, no additional ences on experimental design [1, 2, 5].
4 Analysis of Covariance
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
We will consider only categorical covariates, but dis- weighted average of age-specific differences in pro-
tinguish nominal (including binary) covariates from portions is 0.3254. That is, the differences in pro-
ordinal covariates. portions are computed within each age group (young
Koch et al. [11] present a data set with 59 and old), and these differences are weighted by the
patients, two treatment groups (active and placebo), relative frequencies of each age group, to obtain
five response status levels (excellent, good, moder- 0.3254. Monte Carlo (see Monte Carlo Simulation)
ate, fair, poor), and age as a continuous covariate. P values based on resampling from the permuta-
In their Table 5 (p. 577), the response variable is tion distribution with 25 000 permutations yields a
dichotomized into good and excellent versus mod- P value of 0.0017 unadjusted. This is the proportion
erate, fair, or poor. We then dichotomize the age of permutations with unadjusted test statistics at least
into 54 or less (younger) versus 55 or over (older). as large as 0.4051. The proportion with adjusted test
Dichotomizing an ordinal response variable can result statistics at least as large as 0.3254 is 0.0036, so this
in a loss of power [3, 14, 16], and dichotomizing an is the first adjusted P value.
ordinal covariate can result in a reversal of the direc- Next, consider only the restricted set of permuta-
tion of the effect [9], but we do so for the sake of sim- tions that retain the age imbalance across treatment
plicity. The data structure is then a 2 2 2 table: groups, that is, permutations in which 17 younger
Now if the randomization was unrestricted other and 15 older cases are assigned to the placebo and
than the restriction that 32 patients were to receive 7 younger and 20 older cases are assigned to the
placebo and 27 were to receive the active treat- active treatment. The number of permissible permu-
ment, there would be 59!/[32!27!] ways to select 32 tations can be expressed as the product of the number
patients out of 59 to constitute the placebo group. An of ways of assigning the younger cases, 24!/[17!7!],
unadjusted permutation test would compare the test and the number of ways of assigning the older cases,
statistic of the observed table to the reference distri- 35!/[15!20!]. The proportion of Monte Carlo permu-
bution consisting of the test statistics computed under tations with unadjusted test statistics at least as large
the null hypothesis [2, 7] of all permuted tables (pos- as 0.4051 is 0.0086, so this is the second adjusted P
sible realizations of the randomization). To adjust for value. Considering only the restricted set of permu-
age, one can use an adjusted test statistic that com- tations that retain the age imbalance across treatment
bines age-specific measures of the treatment effect. groups, the proportion of permutations with adjusted
This could be done with an essentially unrestricted test statistics at least as large as 0.3254 is 0.0082,
(other than the requirement that the row margins be so this is the third adjusted P value, or the doubly
fixed) permutation test. adjusted P value [4]. Of course, the set of numer-
It is also possible to adjust for age by using an ical values of the various P values for a given set
unadjusted test statistic (a simple difference across of data does not serve as a basis for selecting one
treatment groups in response rates) with a restricted adjustment technique or another. Rather, this decision
permutation sample space (only those tables that should be based on the relative importance of testing
retain the age*treatment distribution). One could also and estimation, because obtaining a valid P value by
use both the adjusted test statistic and the adjusted comparing a distorted estimate to other equally dis-
permutation sample space. We find that for that data torted estimates does not help with valid estimation.
set in Table 1, the unadjusted difference in propor- The double adjustment technique might be ideal for
tions (active minus placebo) is 0.4051, whereas the ensuring both valid testing and valid estimation [4].
Another exact permutation adjustment technique [9] Brenner, H. (1998). A potential pitfall in control of
applies to ordered categorical covariates measured on covariates in epidemiologic studies, Epidemiology 9(1),
the same scale as the ordered categorical response 6871.
[10] Greenland, S., Robins, J.M. & Pearl, J. (1999). Con-
variable [8]. The idea here is to consider the inform- founding and collapsibility in causal inference, Statisti-
ation-preserving composite end point [3], which con- cal Science 14, 2946.
sists of the combination of the baseline value (the [11] Koch, G.G., Amara, I.A., Davis, G.W. & Gillings, D.B.
covariate) and the subsequent outcome measure. (1982). A review of some statistical methods for
Instead of assigning arbitrary numerical scores and covariance analysis of categorical data, Biometrics 38,
then considering a difference from baseline (as is fre- 563595.
[12] Koch, G.G., Tangen, C.M., Jung, J.W. & Amara, I.A.
quently done in practice), this approach is based on a
(1998). Issues for covariance analysis of dichotomous
partial ordering on the set of possible values for the and ordered categorical data from randomized clinical
pair (baseline, final outcome), and then a U test. trials and non-parametric strategies for addressing them,
Regardless of the scale on which the covariate is Statistics in Medicine 17, 18631892.
measured, it needs to be a true covariate, meaning [13] Lachenbruch, P.A. & Clements, P.J. (1991). ANOVA,
that it is not influenced by the treatments, because Kruskal-Wallis, normal scores, and unequal variance,
adjustment for variables measured subsequent to ran- Communications in Statistics Theory and Methods
20(1), 107126.
domization is known to lead to unreliable results [17,
[14] Moses, L.E., Emerson, J.D. & Hosseini, H. (1984).
18]. Covariates measured after randomization have Analyzing data from ordered categories, New England
been called pseudocovariates [15], and the subgroups Journal of Medicine 311, 442448.
defined by them have been called improper sub- [15] Prorok, P.C., Hankey, B.F. & Bundy, B.N. (1981).
groups [20]. Concepts and problems in the evaluation of screening
programs, Journal of Chronic Diseases 34, 159171.
[16] Rahlfs, V.W. & Zimmermann, H. (1993). Scores: Ordi-
References nal data with few categories how should they be
analyzed? Drug Information Journal 27, 12271240.
[1] Akritas, M.G., Arnold, S.F. & Brunner, E. (1997). Non- [17] Robins, J.M. & Greenland, S. (1994). Adjusting for
parametric hypotheses and rank statistics for unbalanced differential rates of prophylaxis therapy for PCP in
factorial designs, Journal of the American Statistical high-vs. low-dose AZT treatment arms in an AIDS
Association 92, 258265. randomized trial, Journal of the American Statistical
[2] Berger, V.W. (2000). Pros and cons of permutation tests Association 89, 737749.
in clinical trials, Statistics in Medicine 19, 13191328. [18] Rosenbaum, P.R. (1984). The consequences of adjusting
[3] Berger, V.W. (2002). Improving the information content for a concomitant variable that has been affected by the
of categorical clinical trial endpoints, Controlled Clinical treatment, Journal of the Royal Statistical Society, Part
Trials 23, 502514. A 147, 656666.
[4] Berger, V.W. (2005). Nonparametric adjustment tech- [19] Tangen, C.M. & Koch, G.G. (2000). Non-parametric
niques for binary covariates, Biometrical Journal In covariance methods for incidence density analyses of
press. time-to-event data from a randomized clinical trial
[5] Berger, V.W. & Christophi, C.A. (2003). Randomization and their complementary roles to proportional hazards
technique, allocation concealment, masking, and suscep- regression, Statistics in Medicine 19, 10391058.
tibility of trials to selection bias, Journal of Modern [20] Yusuf, S., Wittes, J., Probstfield, J. & Tyroler, H.A.
Applied Statistical Methods 2, 8086. (1991). Analysis and interpretation of treatment effects
[6] Berger, V.W. & Exner, D.V. (1999). Detecting selection in subgroups of patients in randomized clinical trials,
bias in randomized clinical trials, Controlled Clinical Journal of the American Medical Association 266(1),
Trials 20, 319327. 9398.
[7] Berger, V.W., Lunneborg, C., Ernst, M.D. & Levine,
J.G. (2002). Parametric analyses in randomized clinical
trials, Journal of Modern Applied Statistical Methods 1, (See also Stratification)
7482.
[8] Berger, V.W., Zhou, Y.Y., Ivanova, A. & Tremmel, L. VANCE W. BERGER
(2004). Adjusting for ordinal covariates by inducing a
partial ordering, Biometrical Journal 46(1), 4855.
Analysis of Variance
RONALD C. SERLIN
Volume 1, pp. 5256
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
null hypothesis, the K samples are effectively drawn associated Type II error rate. The rate of correct
from the same population. Pizzettis result tells us rejection of H0 is known as the power of the test.
that Fisher rarely acknowledged the contributions of the
Neyman and Pearson method of hypothesis testing.
K
(Yk Y )2 K
Nk (Yk Y )2 SSB He did, nevertheless, derive the non-null distribution
= = 2
2 /N k 2 of the F -ratio, which allowed the power of tests to
k=1 k=1
be calculated.
2
follows a K1 distribution, and so MSB/ 2 has a dis- As an example of an analysis of variance, consider
tribution that is 1/(K 1) times a K12
distributed the data reported in a study [8] of the effects of four
variable. Similarly, we find that MSW/ 2 has a dis- treatments on posttraumatic stress disorder (PSD),
tribution that is 1/(N K) times a NK
2
distributed summarized in Table 1. The dependent variable is
variable. Finally, then, we see that the ratio a posttest measure of the severity of PSD. On the
basis of pooling the sample variances, the MSW is
[MSB/ 2 ] MSB found to equal 55.71. From the sample means and
2
= sample sizes, the combined group mean is calculated
[MSW/ ] MSW
to be 15.62, and using these values and K = 4, the
is distributed as the ratio of two independent chi- MSB is determined to equal 169.32. Finally, the F -
square distributed variables, each divided by its ratio is found to be 3.04. These results are usually
degrees of freedom. It is the distribution of this ratio summarized in an analysis of variance table, shown
that Fisher derived. Snedecor named the ratio F in in Table 2. From a table of the F distribution with
Fishers honor, reputedly [11] for which officious- numerator degrees of freedom equal to K 1 = 3,
ness Fisher never seems to have forgiven him. denominator degrees of freedom equal to N K =
The distribution of the F -ratio is actually a family 41, and Type I error rate set to 0.05, we find the
of distributions. The particular distribution appropri- critical value is equal to 2.83. Because the observed
ate to the problem at hand is determined by two F -ratio exceeds the critical value, we conclude that
parameters, the number of degrees of freedom asso- the test is significant and that the null hypothesis
ciated with the numerator and denominator estimates is false.
of the variance. As desired, the ratio of the two mean This example points to a difficulty associated with
squares reflects the relative amounts of variability the analysis of a variance test, namely, that although
attributed to the explanatory factor of interest and
to chance. The F distribution allows us to specify a
cutoff, called a critical value (see Critical Region). Table 1 Summary data from study of traumatic
F -ratios larger than the critical value lead us to con- stress disorder
clude that [5] either there is something in the [mean Treatment groupa
differences], or a coincidence has occurred . . . A
SIT PE SC WL
small percentage of the time, we can obtain a large
F -ratio even when the null hypothesis is true, on the Nk 14 10 11 10
basis of which we would conclude incorrectly that Yk 11.07 15.40 18.09 19.50
H0 is false. Fisher often set the rate at which we Sk2 15.76 122.99 50.84 51.55
would commit this error, known as a Type I error, at Note: a SIT = stress inoculation therapy, PE = prolonged
0.05 or 0.01; the corresponding critical values would exposure, SC = supportive counseling, WL = wait-list
be the 95th or 99th cumulative percentiles of the F control.
distribution.
Neyman and Pearson [10] introduced the concept
of an alternative hypothesis, denoted H1 , which Table 2 Analysis of variance table for data in Table 1
reflected the conclusion to be drawn regarding the Source df SS MS F
population parameters if the null hypothesis were
rejected. They also pointed to a second kind of Between 3 507.97 169.32 3.04
Within 41 2284.13 55.71
error that could occur, the failure to reject a false Total 44 2792.10
null hypothesis, called a Type II error, with an
4 Analysis of Variance
we have concluded that the population means dif- [4] Fisher, R.A. (1924). On a distribution yielding the error
fer, we still do not know in what ways they differ. functions of several well known statistics, Proceed-
It seems likely that this more focused information ings of the International Congress of Mathematics 2,
805813.
would be particularly useful. One possible solution [5] Fisher, R.A. (1926). The arrangement of field exper-
would involve testing the means for equality in a iments, Journal of the Ministry of Agriculture, Great
pairwise fashion, but this approach would engender Britain 33, 503513.
its own problems. Most importantly, if each pair- [6] Fisher, R.A. (1990). Statistical Inference and Analysis:
wise test were conducted with a Type I error rate Selected Correspondence of R.A. Fisher, J.H. Bennett,
of 0.05, then the rate at which we would falsely con- ed., Oxford University Press, London.
[7] Fisher, R.A. & Mackenzie, W.A. (1923). Studies in crop
clude that the means are not all equal could greatly
variation II: the manurial response of different potato
exceed 0.05. Fisher introduced a method, known as varieties, Journal of Agricultural Science 13, 311320.
a multiple comparison procedure for performing [8] Foa, E.B., Rothbaum, B.O., Riggs, D.S. & Mur-
the desired pairwise comparisons, but it failed to dock, T.B. (1991). Treatment of posttraumatic stress dis-
hold the Type I error rate at the desired level in order in rape victims: a comparison between cognitive-
all circumstances. Many other multiple comparison behavioral procedures and counseling, Journal of Con-
procedures have since been developed that either sulting and Clinical Psychology 59, 715723.
[9] Hald, Anders. (2000). Studies in the history of probabil-
bypass the F test or take advantage of its prop-
ity and statistics XLVII. Pizzettis contributions to the
erties and successfully control the overall Type I statistical analysis of normally distributed observations,
error rate. 1891, Biometrika 87, 213217.
[10] Neyman, J. & Pearson, E. (1933). The testing of
statistical hypotheses in relation to probabilities a priori,
References Proceedings of the Cambridge Philosophical Society 29,
492510.
[1] Fisher, R.A. (1920). A mathematical examination of the [11] Savage, L.J. (1976). On rereading Fisher, The Annals of
methods of determining the accuracy of an observation Statistics 4, 441500.
by the mean error, and by the mean square error, Monthly
Notes of the Royal Astronomical Society 80, 758770.
[2] Fisher, R.A. (1922a). On the mathematical foundations (See also Generalized Linear Models (GLM);
of theoretical statistics, Philosophical Transactions of the History of Multivariate Analysis of Variance;
Royal Society, A 222, 309368. Repeated Measures Analysis of Variance)
[3] Fisher, R.A. (1922b). The goodness of fit of regression
formulae and the distribution of regression coefficients, RONALD C. SERLIN
Journal of the Royal Statistical Society 85, 597612.
Analysis of Variance: Cell Means Approach
ROGER E. KIRK AND B. NEBIYOU BEKELE
Volume 1, pp. 5666
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
for computing these sums of squares using vectors, formula for SSBG simplifies to
matrices, and a scalar are
SSBG = (C )
[C (X X)1 C]1 (C ).
(7)
1
SSTOTAL = y y y JyN
SSBG = (C 0) [C (X X)1 C]1 (C 0) The between groups, within groups, and total sums
of squares for the data in Table 1 are, respectively,
SSWG = y y X y, (5)
SSBG = (C )
[C (X X)1 C]1 (C )
where y is an N 1 vector of observations and
X is an N p structural matrix that indicates the = 34.6667
treatment level in which an observation appears. The SSWG = y y X y
X matrix contains ones and zeros such that each
row has only one one and each column has as many = 326.0000 296.0000 = 30.0000
ones as there are observations from the corresponding SSTOTAL = y y y JyN 1
population. For the data in Table 1, y and X are
= 326.0000 261.3333 = 64.6667. (8)
y X
The between and within groups mean square are
2 1 0 0
1 1 0 0 given by
a1
6
1
0 0
SSBG 34.666
3
1
0 0
MSBG = = = 17.3333
0 0 (p 1) 2
3
1
5 0 1 0 SSWG 30.0000
a2 MSWG = = = 3.3333.
4 0 1 0 [p(n 1)] [3(4 1)]
4 0 1 0
(9)
5 0 0 1
7 0 1
a3 0 The F statistic and P value are
6 0 0 1
10 0 0 1 MSBG 17.3333
F = = = 5.20, P = .04. (10)
J is an N N matrix of ones and is obtained from MSWG 3.3333
the product of an N 1 vector of ones, 1, (column The computation of the three sums of squares is easily
sum vector) and a 1 N vector of ones, 1 (row sum performed with any computer package that performs
vector). is a p 1 vector of sample means and is matrix operations.
given by
(X X)1 X y
Restricted Cell Means Model
1
4
0 0 12 3 (6)
0 1
0 16 = 4 . A second form of the cell means model enables
4
0 0 1 28 7 a researcher to test a null hypothesis subject to
4 one or more restrictions. This cell means model is
1
(X X) is a p p diagonal matrix whose elements called a restricted model. The restrictions represent
are the inverse of the sample ns for each treat- assumptions about the means of the populations that
ment level. are sampled. Consider a randomized block ANOVA
An advantage of the cell means model is that design where it is assumed that the treatment and
a representation, C 0, of the null hypothesis, blocks do not interact. The restricted cell means
C = 0, always appears in the formula for SSBG. model for this design is
Hence there is there is never any ambiguity about
the hypothesis that this sum of squares is used to Yij = ij + i(j ) (i = 1, . . . , n; j = 1, . . . , p),
test. Because 0 in C 0 is a vector of zeros, the (11)
Analysis of Variance: Cell Means Approach 3
The restrictions on ij state that all block-treatment The randomized block design has np = h sample
interaction effects equal zero. These restrictions, means. The 1 h vector of means is =
which are a part of the model, are imposed when [ 11 21 31 41 12 22 , . . . , 43 ]. The coefficient
the cell nij s are equal to one as in a randomized matrices for computing sums of squares for treatment
block design and it is not possible to estimate error A, blocks, and the block-treatment interaction
effects separately from interaction effects. are denoted by C A , C BL , and R , respectively.
Consider a randomized block design with p = 3 These coefficient matrices are easily obtained from
treatment levels and n = 4 blocks. Only four blocks Kronecker products, , as follows:
H A 1
1 1
CA
n BL n
1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0
[1 1 1 1] =
0 1 1 n n 0 0 0 0 1 1 1 1 1 1 1 1
1
1
1 H BL C BL
p A p
1 1 0 0
1 1 1 1 0 0 1 1 0 0 1 1 0 0
[1 1 1] 0 1 1 0 = 0 1 1 0 0 1 1 0 0 1 1 0
p 0 0 1 1 p 0 0 1 1 0 0 1 1 0 0 1 1
(15)
R
1 1 0 0 1 1 0 0 0 0 0 0
H
HA
1 1 0 0
BL
0 1 1 0 0 1 1 0 0 0 0 0
1 1 0 0 0 1 1 0 0 1 1 0 0 0 0
0 1 1 0 = ,
0 1 1 0 0 0 0 1 1 0 0 1 1 0 0
0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 0
0 0 0 0 0 0 1 1 0 0 1 1
are used because of space limitations; ordinarily where 1 BL is a 1 n vector of ones and 1 A is
a researcher would use many more blocks. The a 1 p vector of ones. The null hypothesis for
null hypotheses for treatment A and blocks are, treatment A can be expressed as 1/nC A = 0A ,
respectively, where is an h 1 vector of population means and
0A is a (p 1) 1 vector of zeros. The restrictions
H0 : 1 2 = 0 H0 : 1 2 = 0 on the model, ij i j ij + i j = 0 for all i,
2 3 = 0 2 3 = 0 (13) i , j , and j , can be expressed as R = A , where
3 4 = 0. A is an s = h p n + 1 vector of zeros. Without
any loss of generality, the fractions in the 1/nC A and
In matrix notation, the hypotheses can be written as 1/pC BL matrices can be eliminated by replacing the
fractions with 1 and 1. To test C A = 0A subject
Hypothesis Mean Null to the restrictions that R = A , we can form an
matrix vector vector augmented treatment matrix and an augmented vector
H A A of zeros as follows:
1 0A
1 0 R
1 0 2 = , QA=
A = A , (16)
0 1 1 3
0 C A 0A
4 Analysis of Variance: Cell Means Approach
where Q A consists of the s rows of the R matrix The previous paragraphs have provided an
and the p 1 rows of the C A matrix that are not overview of the restricted cell means model. For
identical to the rows of R , inconsistent with them, many restricted models, the formulas for computing
or linearly dependent on them and A is an s + p 1 SSA and SSBL can be simplified. The simplified
vector of zeros. The joint null hypothesis Q A = A formulas are described by Kirk [2, pp. 290297].
combines the restrictions that all interactions are An important advantage of the cell means model
equal to zero with the hypothesis that differences is that it can be used when observations are
among the treatment means are equal to zero. The missing; the procedures for a randomized block
sum of squares that is used to test this joint null design are described by Kirk [2, pp. 297301].
hypothesis is Another advantage of the cell means model that
we will illustrate in the following section is that
SSA = (Q A )
(Q A QA )1 (Q A )
SSRES , (17) it can be used when there are empty cells in a
multitreatment design.
where SSRES = (R ) (R R)1 (R ).
To test hypo-
theses about contrasts among treatment means, res-
tricted cell means rather than unrestricted cell means Unrestricted Cell Means Model for a
should be used. The vector of restricted cell means, Completely Randomized Factorial Design
R , is given by
The expectation of the classical sum of squares model
R = R(R R)1 R .
(18) equation for a two-treatment, completely randomized
factorial design is
The formula for computing the sum of squares
for blocks follows the pattern described earlier for E(Yij k ) = + j + k + ()j k
treatment A. We want to test the null hypothesis for
(i = 1, . . . , n; j = 1, . . . , p; k = 1, . . . , q), (22)
blocks, C BL = 0BL , subject to the restrictions that
R = BL . We can form an augmented matrix for where Yij k is the observation for participant i in
blocks and an augmented vector of zeros as follows: treatment combination aj bk , is the grand mean,
j is the treatment effect for population j , k is
R BL
Q BL = = , (19) the treatment effect for population k, and ()j k
C BL BL
0BL is the interaction of treatment levels j and k. If
where Q BL consists of the s rows of the R matrix treatments A and B each have three levels, the
and the n 1 rows of the C BL matrix that are not classical model contains 16 parameters: , 1 , 2 ,
identical to the rows of R , inconsistent with them, or 3 , 1 , 2 , 3 , ()11 , ()12 , . . . , ()33 . However,
linearly dependent on them, and BL is an s + n 1 only nine cell means are available to estimate these
vector of zeros. The joint null hypothesis Q BL = parameters. Thus, the model is overparameterized it
BL combines the restrictions that all interactions are contains more parameters than there are means from
equal to zero with the hypothesis that differences which to estimate them. Statisticians have developed
among the block means are equal to zero. The sum of a number of ways to get around this problem [4, 5, 8].
squares that is used to test this joint null hypothesis is Unfortunately, the solutions do not work well when
there are missing observations or empty cells.
SSBL = (Q BL )
(Q BL QBL )1 (Q BL )
SSRES . The cell means model equation for a two-
(20) treatment, completely randomized factorial design is
Yij k = j k + i(j k) (i = 1, . . . , n;
The total sum of squares is given by
j = 1, . . . , p; k = 1, . . . , q), (23)
SSTOTAL = Jh
1 . (21)
where j k is the population mean for treatment
The treatment and block mean squares are given combination aj bk and i(j k) is the error effect that is
by, respectively, MSA = SSA/(p 1) and MSBL = i.i.d. N (0, 2 ). This model has none of the problems
SSBL/[(n 1)(p 1)]. The F statistics are F = associated with overparameterization. A population
MSA/MSRES and F = MSBL/MSRES . mean can be estimated for each cell that contains
Analysis of Variance: Cell Means Approach 5
H A 1
q B
1 1
CA
q
1 1 0 1 1 1 1 1 1 1 1 0 0 0
[1 1 1 1] =
0 1 1 q q 0 0 0 1 1 1 1 1 1
1 1
1 HB CB
p A
p
1 1 1 0 1 1 1 0 1 1 0 1 1 0
[1 1 1] = (26)
p 0 1 1 p 0 1 1 0 1 1 0 1 1
C AB
H A H B
1 1 0 1 1 0 0 0 0
1 1 0 1 1 0 0 1 1 0 1 1 0 0 0
= .
0 1 1 0 1 1 0 0 0 1 1 0 1 1 0
0 0 0 0 1 1 0 1 1
6 Analysis of Variance: Cell Means Approach
1)]/MSWCELL = 285 833/62.500 = 4.57, p > .05. general, the use of weighted means is not recom-
Because the test is not significant, the null hypothesis mended unless the sample nj k s are proportional to
that 1(B) = for all j remains tenable. the population nj k s.
For the case in which observation Y511 in Table 2
is missing, null hypotheses for treatment A using
Cell Means Model with Missing unweighted and weighted means are, respectively,
Observations and Empty Cells 11 + 12 + 13 21 + 22 + 23
H0 : =0
The cell means model is especially useful for analyz- 3 3
ing data when the cell nj k s are unequal and one or 21 + 22 + 23 31 + 32 + 33
=0
more cells are empty. A researcher can test hypothe- 3 3
ses about any linear combination of population means (35)
that can be estimated from the data. The challenge
and
facing the researcher is to formulate interesting and
interpretable null hypotheses using the means that 411 + 512 + 513
are available. H0 :
Suppose that for reasons unrelated to the treat- 14
ments, observation Y511 in Table 2 is missing. When 521 + 522 + 523
=0
the cell nj k s are unequal, the researcher has a 15
choice between computing unweighted means, , or 521 + 522 + 523 531 + 532 + 533
weighted means, (see Analysis of Variance: Mul- = 0.
15 15
tiple Regression Approaches). Unweighted means
(36)
are simple averages of cell means. Hypotheses for
these means were described in the previous section.
For treatments A and B, the means are given by, The coefficients of the unweighted means are 1/q
respectively, and 0; the coefficients of the weighted means are
nj k /nj and 0. The unweighted and weighted coef-
q
j k
p
j k ficient matrices and sums of squares are, respec-
j = and k = . (33) tively,
j =1
q k=1
p
1/3 1/3 1/3 1/3 1/3 1/3 0 0 0
C 1(A) =
0 0 0 1/3 1/3 1/3 1/3 1/3 1/3
SSA = (C 1(A) )
[C 1(A) (X X)1 C1(A) ]1 (C 1(A) )
= 188.09
(37)
4/14 5/14 5/14 5/15 5/15 5/15 0 0 0
C 2(A) =
0 0 0 5/15 5/15 5/15 5/15 5/15 5/15
SSA = (C 2(A) )
[C 2(A) (X X)1 C2(A) ]1 (C 2(A) )
= 187.51
with 2 degrees of freedom, the number of rows in is testable because data are available to estimate each
C 1(B) and C 2(B) . of the population means. However, the hypothesis is
When one or more cells are empty, the analysis uninterpretable because different levels of treatment
of the data is more challenging. Consider the police B appear in each row: (b1 and b2 ) versus (b1 and b3 )
attitude data in Table 3 where an observation Y511 in the first row and (b1 and b3 ) versus (b1 , b2 , and
b3 ) in the second row. The following hypothesis is
both testable and interpretable.
Table 3 Police recruit attitude data, observation Y511 is
missing and cells a1 b3 and a2 b2 are empty 11 + 12 31 + 32
H0 : =0
a1 a1 a1 a2 a2 a2 a3 a3 a3 2 2
b1 b2 b3 b1 b2 b3 b1 b2 b3 21 + 23 31 + 33
= 0. (43)
24 44 30 26 21 41 42 2 2
33 36 21 27 18 39 52
37 25 39 36 10 50 53 For a hypothesis to be interpretable, the esti-
29 27 26 46 31 36 49 mators of population means for each contrast in
43 34 45 20 34 64 the hypothesis should share the same levels of the
n other treatment(s). For example, to estimate 1 =
Yj k = 30.75 35 30 36 20 40 52 1/2(11 + 12 ) and 3 = 1/2(31 + 32 ), it is nec-
i=1
essary to average over b1 and b2 and ignore b3 . The
Analysis of Variance: Cell Means Approach 9
null hypothesis for treatment A can be expressed in ANOVA model is used. When this model is used,
matrix notation as C A = 0, where the hypotheses that are tested are typically left to
a computer package, and the researcher is seldom
C A aware of exactly what hypotheses are being tested.
1/2 1/2 0 0 1/2 1/2 0
=
0 0 1/2 1/2 1/2 0 1/2 Unrestricted Cell Means Model
= [11 12 21 23 31 32 33 ]. (44) for ANCOVA
The fractions in C A can be eliminated by replac- The cell means model can be used to perform an
ing the fractions with 1 and 1. For the data in analysis of covariance (ANCOVA). This application
Table 3 where Y511 is missing and two of the cells of the model is described using a completely
are empty, the sum of squares for testing (43) is randomized ANCOVA design with N observations,
p treatment levels, and one covariate. The adjusted
SSA = (C A )
[C A (X X)1 CA ]1 (C A )
= 110.70 between-groups sum of squares, Aadj , and the
(45) adjusted within-groups sum of squares, Eadj , for
a completely randomized ANCOVA design are
with 2 degrees of freedom, the number of rows in given by
C A .
(Azy + Ezy )2
Testable and interpretable hypotheses for treat- Aadj = (Ayy + Eyy ) Eadj and
ment B and the A B interaction are, respectively, Azz + Ezz
1 1 (Ezy )2
H0 : (11 + 31 ) (12 + 32 ) = 0 Eadj = Eyy . (48)
2 2 Ezz
(Treatment B) The sums of squares in the formula, Ayy , Eyy , Azy ,
1 1 and so on, can be expressed in matrix notation using
(21 + 31 ) (23 + 33 ) = 0 (46) the cell means model by defining
2 2
and Ayy = (C A y ) (C A (X X)1 CA )1 (C A y )
Azy = (C A z ) (C A (X X)1 CA )1 (C A y )
H0 : 11 31 12 + 32 = 0(A B interaction)
Azz = (C A z ) (C A (X X)1 CA )1 (C A z )
21 31 23 + 33 = 0. (47)
y = (X X)1 X y
If there were no empty cells, the null hypothesis
for the A B interaction would have h p q + z = (X X)1 X z
1 = 9 3 3 + 1 = 4 interaction terms. However, Eyy = y y y X y
because of the empty cells, only two of the interaction
terms can be tested. If the null hypothesis for the Ezy = z y z X y
A B interaction is rejected, we can conclude that
at least one function of the form j k j k j k + Ezz = z y z X z, (49)
j k does not equal zero. However, failure to reject where y is a p 1 vector of dependent variable
the null hypothesis does not imply that all functions cell means, X is a N p structural matrix, z is
of the form j k j k j k + j k equal zero a p 1 vector of covariate means, y is an N
because we are unable to test two of the interaction 1 vector of dependent variable observations, and
terms. z is an N 1 vector of covariates. The adjusted
When cells are empty, it is apparent that to test between- and within-groups mean squares are given
hypotheses, the researcher must be able to state by, respectively,
the hypotheses in terms of linearly independent
contrasts in C . Thus, the researcher is forced to MSAadj =
Aadj
and MSE adj =
Eadj
.
consider what hypotheses are both interesting and (p 1) (N p 1)
interpretable. This is not the case when the classical (50)
10 Analysis of Variance: Cell Means Approach
The F statistic is F = MSAadj /MSEadj with p 1 mean squares. The researcher specifies the hypothe-
and N p 1 degrees of freedom. sis of interest when the contrasts in C are specified.
The cell means model can be extended to other Hence, a sample representation of the null hypothesis
ANCOVA designs and to designs with multiple always appears in formulas for treatment and inter-
covariates (see Analysis of Covariance). Lack action mean squares. Finally, the cell means model
of space prevents a description of the computa- gives the researcher great flexibility in analyzing data
tional procedures. because hypotheses about any linear combination of
available cell means can be tested (see Multiple
Comparison Procedures).
Some Advantages of the Cell
Means Model References
The simplicity of the cell means model is readily [1] Hocking, R.R. & Speed, F.M. (1975). A full rank
analysis of some linear model problems, Journal of the
apparent. There are only two models for all ANOVA
American Statistical Association 70, 706712.
designs: an unrestricted model and a restricted model. [2] Kirk, R.E. (1995). Experimental Design: Procedures
Furthermore, only three kinds of computational for- for the Behavioral Sciences, 3rd Edition, Brooks/Cole,
mulas are required to compute all sums of squares: Pacific Grove.
treatment and interaction sums of squares have [3] Milliken, G.A. & Johnson, D.E. (1984). Analysis of
the general form (C ) [C (X X)1 C]1 (C ),
the Messy Data, Vol. 1: Designed Experiment, Wadsworth,
Belmont.
within-groups and within-cells sums of squares have
[4] Searle, S.R. (1971). Linear Models, Wiley, New York.
the form y y X y, and the total sum of squares [5] Searle, S.R. (1987). Linear Models for Unbalanced Data,
has the form y y y JyN 1 . Wiley, New York.
The cell means model has an important advantage [6] Speed, F.M. (June 1969). A new approach to the anal-
relative to the classical overparameterized model: the ysis of linear models. NASA Technical Memorandum,
ease with which experiments with missing obser- NASA TM X-58030.
[7] Speed, F.M., Hocking, R.R. & Hackney, O.P. (1978).
vations and empty cells can be analyzed. In the
Methods of analysis of linear models with unbalanced
classical model, questions arise regarding which func- data, Journal of the American Statistical Association 73,
tions are estimable and which hypotheses are testable. 105112.
However, these are nonissues with the cell means [8] Timm, N.H. & Carlson, J.E. (1975). Analysis of vari-
model. There is never any confusion about what ance through full rank models, Multivariate Behavioral
functions of the means are estimable and what Research Monographs. No. 75-1.
[9] Urquhart, N.S., Weeks, D.L. & Henderson, C.R. (1973).
their best linear unbiased estimators are. And it is
Estimation associated with linear models: a revisitation,
easy to discern which hypotheses are testable. Fur- Communications in Statistics 1, 303330.
thermore, the cell means model is always of full [10] Woodward, J.A., Bonett, D.G. & Brecht, M. (1990).
rank with the X X matrix being a diagonal matrix Introduction to Linear Models and Experimental Design,
whose elements are the cell sample sizes. The num- Harcourt Brace Jovanovich, San Diego.
ber of parameters in the model exactly equals the
number of cells that contain one or more observa- (See also Regression Model Coding for the Analy-
tions. sis of Variance)
There is never any confusion about what null
hypotheses are tested by treatment and interaction ROGER E. KIRK AND B. NEBIYOU BEKELE
Analysis of Variance: Classification
ROGER E. KIRK
Volume 1, pp. 6683
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Variable
The advantage of the Latin square design is the
comb. ability to isolate two nuisance variables to obtain
Participant1 a1b1c1 greater power to reject a false null hypothesis. The
Group1 .. .. Y.111
. . disadvantages are (a) the number of treatment lev-
Participant5 a1b1c1 els, rows, and columns must be equal, a balance
Participant6 a1b2c3 that may be difficult to achieve; (b) if there are
.. .. Y.123
Group2 . . any interactions among the treatment levels, rows,
Participant10 a1b2c3
and columns, the test of the treatment is posi-
Participant11 a1b3c2 tively biased; and (c) the randomization is rela-
.. .. Y.132
Group3 . . tively complex.
Participant15 a1b3c2
Three building block designs have been described
Participant16 a2b1c2 that provide the organizational framework for the
.. .. Y.212
Group4 . . classification scheme and nomenclature in this arti-
Participant20 a2b1c2
cle. The following ANOVA designs are exten-
.. .. .. ..
. . . . sions of or variations of one of the building block
Participant41 a3b3c1 designs or a combination of two or more building
.. .. Y.331
Group9 . . block designs.
Participant45 a3b3c1
Figure 5 Generalized randomized block design (GRB-3 design) with p = 3 treatment levels and w = 5 groups of
np = (2)(3) = 6 homogeneous participants. The six participants in each group were randomly assigned to the three treatment
levels with the restriction that two participants were assigned to each level
Hence, the design has p q treatment combinations, (treatment B population means are equal)
a1 b1 , a1 b2 , . . . , ap bq . The layout for a CRF-23 design
H0 : j k j k j k + j k = 0 for all j, j , k,
with p = 2 levels of treatment A and q = 3 levels of
treatment B is shown in Figure 11. In this example, and k (treatments A and B do not interact)
30 participants are randomly assigned to the 2 3 = (17)
6 treatment combinations with the restriction that n =
5 participants are assigned to each combination. The F statistics are
The total sum of squares and total degrees of SSA/(p 1) MSA
freedom for the design are partitioned as follows: F = =
SSWCELL/[pq(n 1)] MSWCELL
SSTOTAL = SSA + SSB + SSA B SSB/(q 1) MSB
F = =
+ SSWCELL SSWCELL/[pq(n 1)] MSWCELL
Figure 12 Layout for a two-treatment, randomized block factorial design (RBF-32 design)
have smoked. The two participants who have smoked A. Three null hypotheses can be tested.
for the shortest time are assigned to one block, the
next two smokers to another block, and so on. This H0 : .1 . = .2 . = .3 .
procedure produces 30 blocks in which the two smok- (treatment A population means are equal)
ers in a block are similar in terms of the length of
time they have smoked. In the first stage of random- H0 : ..1 = ..2
ization, the 30 blocks are randomly assigned to the (treatment B population means are equal)
three levels of treatment A with the restriction that
10 blocks are assigned to each level of treatment A. H0 : .j k .j k .j k + .j k = 0 for all j, j , k,
In the second stage of randomization, the two smok- and k (treatments A and B do not interact) (23)
ers in each block are randomly assigned to the two
levels of treatment B with the restriction that b1 and The F statistics are
b2 appear equally often in each level of treatment A. SSA/(p 1)
An exception to this randomization procedure must F =
SSBL(A)/[p(n 1)]
be made when treatment B is a temporal variable,
such as successive learning trials or periods of time. MSA
=
Trial two, for example, cannot occur before trial one. MSBL(A)
The layout for this split-plot factorial design SSB/(q 1)
with three levels of treatment A and two levels of F =
SSRESIDUAL/[p(n 1)(q 1)]
treatment B is shown in Figure 13. The total sum of
squares and total degrees of freedom are partitioned MSB
=
as follows: MSRESIDUAL
SSA B/(p 1)(q 1)
SSTOTAL = SSA + SSBL(A) + SSB + SSA B F =
SSRESIDUAL/[p(n 1)(q 1)]
+ SSRESIDUAL
MSA B
= . (24)
npq 1 = (p 1) + p(n 1) + (q 1) MSRESIDUAL
+ (p 1)(q 1) + p(n 1)(q 1), Treatment A is called a between-blocks effect. The
(22) error term for testing between-blocks effects is
MSBL(A). Treatment B and the A B interaction are
where SSBL(A) denotes the sum of squares of within-blocks effects. The error term for testing the
blocks that are nested in the p levels of treatment within-blocks effects is MSRESIDUAL. The designa-
tion for a two-treatment, split-plot factorial design is
SPF-pq. The p preceding the dot denotes the number
Treat. Treat.
comb. comb. of levels of the between-blocks treatment; the q after
b1 b2 the dot denotes the number of levels of the within-
Block1 a1b1 a1b2 blocks treatment. Hence, the design in Figure 13 is
.. .. .. an SPF-32 design. A careful examination of the ran-
a1 Group1 . . . Y.1.
Blockn a1b1 a1b2 domization and layout of the between-blocks effects
Blockn + 1 a2b1 a2b2 reveals that they resemble those for a CR-3 design.
.. .. .. The randomization and layout of the within-blocks
a2 Group2 . . . Y.2.
Block2n a2b1 a2b2 effects at each level of treatment A resemble those
Block2n + 1 a3b1 a3b2 for an RB-2 design.
.. .. .. The block size of the SPF-32 design in Figure 13
a3 Group3 . . . Y.3.
Block3n a3b1 a3b2 is three. The RBF-32 design in Figure 12 contains
Y..1 Y..2 the same 3 2 = 6 treatment combinations, but the
block size is six. The advantage of the split-plot
Figure 13 Layout for a two-treatment, split-plot facto- factorial the smaller block sizeis achieved by
rial design (SPF-32 design). Treatment A is confounded confounding groups of blocks with treatment A.
with groups Consider the sample means Y1 , Y2 , and Y3 in
12 Analysis of Variance: Classification
Figure 13. The differences among the means reflect Treat. Treat.
the differences among the three groups of smokers comb. comb.
as well as the differences among the three levels Block1 a1b1 a2b2
.. .. ..
of treatment A. To put it another way, we cannot (AB )jk Group1 . . . Y.111 + Y.221
tell how much of the differences among the three Blockn a1b1 a2b2
sample means is attributable to the differences among Blockn + 1 a1b2 a2b1
.. .. ..
Group1 , Group2 , and Group3 , and how much is (AB )jk Group2 . . . Y.122 + Y.212
attributable to the differences among treatments levels Block2n a1b2 a2b1
a1 , a2 , and a3 . For this reason, the groups and
treatment A are said to be completely confounded (see Figure 14 Layout for a two-treatment, randomized block
Confounding Variable). confounded factorial design (RBCF-22 design). The A B
The use of confounding to reduce the block size in interaction is confounded with groups
an SPF-pq design involves a trade-off that needs to
be made explicit. The RBF-32 design uses the same
error term, MSRESIDUAL, to test hypotheses for total sum of squares and total degrees of freedom are
treatments A and B and the A B interaction. The partitioned as follows:
two-treatment, split-plot factorial design, however, SSTOTAL = SSA B + SSBL(G) + SSA + SSB
uses two error terms: MSBL(A) is used to test
treatment A; a different and usually much smaller + SSRESIDUAL
error term, MSRESIDUAL, is used to test treatment nvw 1 = (w 1) + w(n 1) + (p 1) + (q 1)
B and the A B interaction. As a result, the power
of the tests of treatment B and the A B interaction + w(n 1)(v 1), (25)
is greater than that for treatment A. Hence, a split-
plot factorial design is a good design choice if a where SSBL(G) denotes the sum of squares of blocks
researcher is more interested in treatment B and the that are nested in the w groups and v denotes the
A B interaction than in treatment A. When both number of combinations of treatments A and B in
treatments and the A B interaction are of equal each block. Three null hypotheses can be tested.
interest, a randomized block factorial design is a H0 : .j k . .j k . .j k . + .j k . = 0 for all j,
better choice if the larger block size is acceptable. If
a large block size is not acceptable and the researcher j , k, and k (treatments A and B do not interact)
is primarily interested in treatments A and B, an
H0 : .1 .. = .2 ..
alternative design choice is the confounded factorial
design. This design, which is described next, achieves (treatment A population means are equal)
a reduction in block size by confounding groups of
H0 : ..1 . = ..2 .
blocks with the A B interaction. As a result, tests
of treatments A and B are more powerful than the (treatment B population means are equal) (26)
test of the A B interaction.
The F statistics are
SSA B/(w 1)
Confounded Factorial Designs F =
SSBL(G)/[w(n 1)]
Confounded factorial designs are constructed from MSA B
=
either randomized block designs or Latin square MSBL(G)
designs. A simple confounded factorial design is SSA/(p 1)
denoted by RBCF-p k . The RB in the designation F =
SSRESIDUAL/[w(n 1)(v 1)]
indicates the building block design, C indicates that
an interaction is completely confounded with groups, MSA
=
k indicates the number of treatments, and p indicates MSRESIDUAL
the number of levels of each treatment. The layout SSB/(q 1)
F =
for an RBCF-22 design is shown in Figure 14. The SSRESIDUAL/[w(n 1)(v 1)]
Analysis of Variance: Classification 13
(treatment A population means are equal In summary, the main advantage of a fractional
factorial design is that it enables a researcher to
or the B C interaction is zero)
efficiently investigate a large number of treatments
H0 : .1 . = .2 . in an initial experiment, with subsequent experiments
designed to focus on the most promising lines of
(treatment B population means are equal
investigation or to clarify the interpretation of the
or the A C interaction is zero) original analysis. Many researchers would consider
ambiguity in interpreting the outcome of the initial
H0 : ..1 = ..2 experiment a small price to pay for the reduction in
(treatment C population means are equal experimental effort.
or the A B interaction is zero ) (29)
A wide array of designs is available to researchers. [7] Hicks, C.R. & Turner Jr, K.V. (1999). Fundamental
Hence, it is important to clearly identify designs in Concepts in the Design of Experiments, Oxford Univer-
research reports. One often sees statements such as a sity Press, New York.
[8] Jones, B. & Kenward, M.G. (2003). Design and Analysis
two-treatment, factorial design was used. It should of Cross-over Trials, 2nd Edition, Chapman & Hall,
be evident that a more precise description is required. London.
This description could refer to 10 of the 11 factorial [9] Keppel, G. (1991). Design and Analysis: A Researchers
designs in Table 1. Handbook, 3rd Edition, Prentice-Hall, Englewood Cliffs.
[10] Kirk, R.E. (1995). Experimental Design: Procedures
for the Behavioral Sciences, 3rd Edition, Brooks/Cole,
References Pacific Grove.
[11] Maxwell, S.E. & Delaney, H.D. (2004). Designing
[1] Anderson, N.H. (2001). Empirical Direction in Design Experiments and Analyzing Data, 2nd Edition, Lawrence
and Analysis, Lawrence Erlbaum, Mahwah. Erlbaum, Mahwah.
[2] Bogartz, R.S. (1994). An Introduction to the Analysis of [12] Winer, B.J., Brown, D.R. & Michels, K.M. (1991). Sta-
Variance, Praeger, Westport. tistical Principles in Experimental Design, 3rd Edition,
[3] Cochran, W.G. & Cox, G.M. (1957). Experimental McGraw-Hill, New York.
Designs, 2nd Edition, John Wiley, New York.
[4] Cobb, G.W. (1998). Introduction to Design and Analysis
of Experiments, Springer-Verlag, New York. (See also Generalized Linear Mixed Models; Lin-
[5] Federer, W.T. (1955). Experimental Design: Theory and ear Multilevel Models)
Application, Macmillan, New York.
[6] Harris, R.J. (1994). ANOVA: An Analysis of Variance ROGER E. KIRK
Primer, F. E. Peacock, Itasca.
Analysis of Variance: Multiple Regression Approaches
RICHARD J. HARRIS
Volume 1, pp. 8393
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Amount Allocated to 10-Interchange Partner as allocation of shares of the room fee give lower
f(Allocation Task) outcomes to the partner who deals with the most
X = Allocn Y = P10Outc X = Allocn Y = P10Outc difficult anagrams than do groups who determine final
outcomes directly. That is, having a high score on X
1 $1.70 2 $1.50* is associated with having a lower Y score which
1 1.00 2 0.50
is what we concluded on the basis of the test of our
1 0.07 2 1.50
1 0.00 2 1.50 correlation coefficient.
1 1.30 2 1.00 The formula for computing the independent-means
1 3.00 2 2.10 t from the correlation coefficient is just the formula
1 0.50 2 2.00 given earlier for testing the statistical significance of
1 2.00 2 0.50 a correlation coefficient. Inverting that relationship
1 1.70 2 1.50 gives us
1 0.50 2 0.00 t
rXY = , (3)
Notes: X = 1 indicates that the group was asked to allocate the t + df
2
final shares of a prize directly. X = 2 indicates that they were
asked how much of a room fee should be subtracted from each where df = the degrees of freedom for the t Test,
partners individual contribution to determine his or her final namely, n1 + n2 2.
share of the prize. Thus, in the present example,
Source: Raw data supplied by first author of [3].
2.719
rXY =
7.3930 + 18
with 18 df, p < 0.01. Thus, we can be quite confident
2.719
that having a higher score on X (i.e., being one of = = 0.5396. (4)
the groups in the expenses-allocation condition) is, in 5.03915
the population, associated with having a lower score
on Y (i.e., with recommending a lower final outcome An Application of the Above Relationship
for the person in your group who worked on the most
difficult problems). An important application of the above relationship
Had we simply omitted the two columns of is that it reminds us of the distinction between sta-
scores on X, the data would look just like the tistical significance confidence that weve got the
usual setup for a test of the difference between two sign of an effect right and substantive signifi-
independent means, with the left-hand column of cance the estimated magnitude of an effect (see
Y scores providing group 1s recommendations and Effect Size Measures). For example, ASRTs Envi-
the right-hand column of Y scores giving group 2s ronmental Scan of the RadiationTherapists Work-
scores on that dependent variable. Applying the usual place found that the mean preference for a great
formula for an independent-means t Test gives work environment over a great salary was sig-
nificantly greater among the staff-therapist-sample
Y1 Y2 respondents who still held a staff or senior-staff ther-
t =
(Y2 Y2 )2 1
apist title than among those who had between the
(Y1 Y1 )2 + 1
+ time of their most recent certification renewal and
n1 + n2 2 n1 n1 the arrival of the questionnaire moved on to another
position, primarily medical dosimetrist or a man-
1.177 (0.290)
= agerial position within the therapy suite. The t for
8.1217 + 18.0690 the difference between the means was 3.391 with
(1/10 + 1/10)
18 1908 df, p < 0.001. We can be quite confident that
1.467 the difference between the corresponding population
= = 2.719 (2) means is in the same direction. However, using the
0.53945
above formula tells us that the correlation between
with n1 + n2 2 = 18 df, p < 0.01. Thus, we can the still-staff-therapist
versus moved-on distinction
be quite confident that, in the population, groups is r = 3.391/ 1919.5 = 0.0774. Hence, the distinc-
who determine final outcomes only indirectly by tion accounts for (0.0774)2 = 0.6%, which is less
Analysis of Variance: Multiple Regression Approaches 3
than one percent of the variation among the respon- find that the female faculty members make, on aver-
dents in their work environment versus salary prefer- age, $6890 less than do male faculty members (see
ences. Type I, Type II and Type III Sums of Squares).
As this example shows, it is instructive to convert These results are the same as those we would have
experimental statistics into correlational statistics. obtained had we simply computed the mean salary on
That is, convert ts into the corresponding r 2 s. The a per-individual basis, ignoring the college in which
resulting number can come as a surprise; many of the the faculty member taught (see Markov, Andrei
r 2 s for statistically significant differences in means Andreevich).
will be humblingly low. For the present purposes, this reversal paradox
(see Odds and Odds Ratios) [6] helps sharpen the
contrast among alternative ways of using multiple
Equivalence Between Analysis of Variance linear regression analysis (MRA) to analyze data
and Multiple Regression with that would normally be analyzed using ANOVA.
Level-membership or Contrast Predictors This, in turn, sheds considerable light on the choice
we must make among alternative models when car-
Consider the hypothetical, but not atypical, data rying out a factorial ANOVA for unbalanced designs.
shown in Table 2. In an unbalanced design, the percentage of the obser-
Note that within each college, the female fac- vations for a given level of factor A differs across
ulty members mean salary exceeds the male faculty the various levels of factor B. The choice we must
members mean salary by $5000$10 000. On the make is usually thought of as bearing primarily on
other hand, the female faculty is concentrated in whether a given effect in the ANOVA design is sta-
the low-paying College of Education, while a slight tistically significant. However, the core message of
majority of the male faculty is in the high-paying Col- this presentation is that it is also and in my opin-
lege of Medicine. As a result, whether on average ion, more importantly a choice of what kinds of
female faculty are paid more or less than male faculty means are being compared in determining the sig-
depends on what sort of mean we use to define on nificance of that effect. Specifically, a completely
average. An examination of the unweighted mean uncorrected model involves comparisons among the
salaries (cf. the Unweighted mean row toward the weighted means and is thus, for main effects, equiv-
bottom of the table) of the males and females in the alent to carrying out a one-way ANOVA for a single
three colleges (essentially a per-college mean), indi- factor, ignoring all other factors. Furthermore, the
cates that female faculty make, on average, $6667 analysis makes no attempt to correct for confounds
more per year than do male faculty. If instead we with those other factors. A completely uncorrected
compute for each gender the weighted mean of the model is equivalent to testing a regression equation
three college means, weighting by the number of that includes only the contrast variables for the par-
faculty members of that gender in each college, we ticular effect being tested. A fully corrected model
Males Females
Unweighted Weighted
College Meana Std. Dev. n Meana Std. Dev. n mean mean
Engineering 30 1.491 55 35 1.414 5 32.5 30.416
Medicine 50 1.423 80 60 1.451 20 55.0 52
Education 20 1.451 20 25 1.423 80 22.5 24
Unweighted mean 33.333 40 36.667
Weighted mean 39.032 32.142 36.026
a
Salaries expressed in thousands of dollars.
Source: Adapted from [1] (Example 4.5) by permission of author/copyright holder.
4 Analysis of Variance: Multiple Regression Approaches
Table 3 Raw Data, Hypothetical U Faculty Salaries step along the way to a model that corrects only for
Group College Gender Sal nij interactions of a given order or lower.
You can replicate the analyses I am about to use to
Engnr-M 1 1 30.000 25 illustrate the above points by entering the following
1 1 28.000 15 variable names and values into an SPSS data editor
1 1 32.000 15
(aka .sav file). One advantage of using hypothetical
Engnr-F 1 2 33.000 1 data is that we can use lots of identical scores and
1 2 35.000 3
1 2 37.000 1
thus condense the size of our data file by employing
the Weight by function in SPSS (Table 3).
Medcn-M 2 1 48.000 20
2 1 50.000 40
2 1 52.000 20
Medcn-F 2 2 58.000 5 One-way, Independent-means ANOVA
2 2 60.000 10
2 2 62.000 5 First, I will conduct a one-way ANOVA of the
Educn-M 3 1 18.000 5 effects of College, ignoring for now information
3 1 20.000 10 about the gender of each faculty member. Submitting
3 1 22.000 5 the following SPSS commands
Educn-F 3 2 23.000 20
3 2 25.000 40 Title Faculty Salary example .
3 2 27.000 20 Weight by nij .
Subtitle Oneway for college effect .
Manova sal by college (1,3) /
Print = cellinfo (means) signif
involves comparisons among the unweighted means. (univ) design (solution) /
A fully corrected model is equivalent to testing each Design /
effect on the basis of the increment to R 2 that results Contrast (college) = special
from adding the contrast variables representing that (1 1 1, 1 2 1, 1 0 1) /
effect last, after the contrast variables for all other Design = college (1), college (2).
effects have been entered. And in-between models, Weight off .
where any given effect is corrected for confounds
with from zero to all other effects, involve contrasts yields (in part) the output as shown in Tables 4
that are unlikely to be interesting and correspond to and 5.
questions that the researcher wants answered. I will Notice that the means are identical to those listed
note one exception to this general condemnation of in the Weighted mean column of the data table
in-between models though only as an intermediate shown previously.
Table 5 Tests of significance for SAL using unique sums of squares; analysis of variance design 1
Source of Variation SS DF MS F Sig of F
WITHIN CELLS 2642.58 257 10.28
COLLEGE 41854.17 2 20927.08 2035.23 0.000
Analysis of Variance: Multiple Regression Approaches 5
To accomplish the same overall significance test MedvOth and EngvEduc, for example, by submitting
using MRA, we need a set of two predictor variables the following SPSS commands.
(more generally, k 1, where k = number of inde-
pendent groups) that together completely determine Weight off .
the college in which a given faculty member works. Subtitle One-way for College and
College-First Sequential .
There are many alternative sets of predictor variables
Weight by nij .
(see Regression Model Coding for the Analysis of
Regression variables = sal mfcontr
Variance and Type I, Type II and Type III Sums of medvoth engveduc gbcoll1 gbcoll2 /
Squares), but it is generally most useful to construct Statistics = defaults cha /
contrast variables. This is accomplished by choosing Dep = sal / enter medvoth engveduc /
a set of k 1 interesting or relevant contrasts among enter gbcoll1 gbcoll2 /
our k means. We then set each cases score on a given enter mfcontr/.
contrast equal to the contrast coefficient that has been Weight off .
assigned to the group within which that case falls. Let
us use the contrast between the high-paying College Only the first step of this stepwise regression
of Medicine and the other two colleges as our first analysis that in which only the two College con-
contrast (labeled medvoth in our SPSS data file) trasts have been entered is relevant to our one-
and the contrast between the Colleges of Engineer- way ANOVA. I will discuss the remaining two
ing and Education (labeled engveduc) as our second steps shortly.
contrast. This yields the expanded .sav file as shown The resulting SPSS run yields, in part, the output
in Table 6. as shown in Tables 7 and 8.
Recall that contrast coefficients are defined only The test for the statistical significance of R 2
up to a multiplicative constant. For example, Med consists of comparing F = R 2 /(k 1)/(1 R 2 )/
0.5Engn 0.5Educ = 0 if and only if 2Med
Engn Educ = 0. The extra three columns in the Table 7 Model Summary
expanded .sav file above give contrasts for the Gender Model R R2 Adjusted R 2
main effect and the interaction between Gender and
1 0.970a 0.941 0.940
each of the two College contrasts; more about these
a
later. We then run an MRA of Salary predicted from Predictors: (Constant), ENGVEDUC, MEDVOTH.
Table 8 ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 41854.167 2 20927.083 2035.228 0.000b
Residual 2642.583 257 10.282
Total 44496.750 259
a
Dependent Variable: Salary (thousands of dollars). b Predictors: (Constant), ENGVEDUC, MEDVOTH.
Table 9 Coefficientsa
Unstandardized Coefficients
Standardized Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 35.472 0.205 173.311 0.000
MEDVOTH 8.264 0.138 0.922 59.887 0.000
ENGVEDUC 3.208 0.262 0.189 12.254 0.000
a
Dependent Variable: Salary (thousands of dollars).
Table 10 Tests of Significance for SAL using UNIQUE sums of squares; analysis of variance
design 2
Source of Variation SS DF MS F Sig of F
WITHIN + RESIDUAL 2642.58 257 10.28
COLLEGE(1) 36877.60 1 36877.60 3586.47 0.000
COLLEGE(2) 1544.01 1 1544.01 150.16 0.000
(N k) to the critical value for an F with k 1 150.16, matches equally closely the ANOVA-derived
and N k df, yielding the F of 2035.228 reported F . Notice, too, that the unstandardized regression
in the above table, which matches the overall F for coefficients for the two contrasts are directly propor-
College from our earlier ANOVA table. tional to signs and magnitudes of the corresponding
But, what has been gained for the effort involved contrasts: 3.208 (the regression coefficient for EngvE-
in defining contrast variables? The MRA output duc) is exactly half the difference between those
continued with the listing of regression coefficients two means, and 8.264 (the coefficient for MedvOth)
(Table 9) and tests of the statistical significance exactly equals one-sixth of 2(mean for Medicine)
thereof. (mean for Engineering) (mean for Education). The
We want to compare these MRA-derived t Tests divisor in each case is the sum of the squared contrast
for the two contrasts to the corresponding ANOVA coefficients.
output (Table 10) generated by our Contrast sub-
command in conjunction with the expanded design
Factorial ANOVA via MRA
statement naming the two contrasts.
Recall that the square of a t is an F with 1 df in We will add to the data file three more contrast
the numerator. We see that the t of 59.887 for the variables: MFContr to represent the single-df Gen-
significance of the difference in mean salary between der effect and GbColl1 and GbColl2 to represent
the College of Medicine and the average of the other the interaction between Gender and each of the
two colleges corresponds to an F of 3586.45. This previously selected College contrasts. We need these
value is equal, within round-off error, to the ANOVA- variables to be able to run the MRA-based analysis.
derived value. Also, the square of the t for the More importantly, they are important in interpreting
Engineering versus Education contrast, (12.254)2 = the particular patterns of differences among means
Analysis of Variance: Multiple Regression Approaches 7
that are responsible for statistically significant effects. 3. the F s for all effects can be computed via simple
See the article in this encyclopedia on regression algebraic formulae; see [1], any other ANOVA
model coding in ANOVA for details of how to com- text, or the article on factorial ANOVA in this
pute coefficients for an interaction contrast by cross encyclopedia.
multiplying the coefficients for one contrast for each
of the two factors involved in the interaction. We can However, when ns are unequal (unless row- and
then test each of the two main effects and the interac- column-proportional),
tion effect by testing the statistical significance of the
increment to R 2 that results from adding the contrast 1. Orthogonal contrasts (sum of cross-products of
variables representing that effect to the regression coefficients equal zero) are, in general, not uncor-
equation. This can be done by computing related;
2. The increment to R 2 from adding any such set
(Increase in R 2 ) of contrasts depends on what other contrast vari-
(number of predictors added) ables are already in the regression equation;
Fincr = (5) and
(R 2 for full model)
3. Computing the F for any multiple-df effect
(N total # of predictors 1)
requires the use of matrix algebra.
or by adding Cha to the statistics requested in the
SPSS Regression command. To illustrate this context dependence, consider
For equal-n designs the following additional output (Table 11) from our
earlier stepwise-MRA run. There, we entered the
1. all sets of mutually orthogonal contrasts are also College contrasts first, followed by the interaction
uncorrelated; contrasts and then by the Gender main-effect con-
2. the increment to R 2 from adding any such set trast, together with a second stepwise-MRA run in
of contrasts is the same, no matter at what point which the order of entry of these effects is reversed
they are added to the regression equation; and (Table 11, 12).
Table 13 Coefficientsa
Unstandardized
Coefficients Standardized
Coefficients
Model B Std. Error Beta t Sig.
1(Gender only) (Constant) 35.588 0.802 44.387 0.000
MFCONTR 3.445 0.802 0.258 4.296 0.000
2 (Gender and G x C) (Constant) 37.967 0.966 39.305 0.000
MFCONTR 1.648 0.793 0.124 2.079 0.039
GBCOLL1 1.786 0.526 0.188 3.393 0.001
GBCOLL2 6.918 1.179 0.349 5.869 0.000
3 (Full model) (Constant) 36.667 0.141 260.472 0.000
MFCONTR 3.333 0.141 0.250 23.679 0.000
GBCOLL1 0.833 0.088 0.088 9.521 0.000
GBCOLL2 3.654E-15 0.191 0.000 0.000 1.000
MEDVOTH 9.167 0.088 1.023 104.731 0.000
ENGVEDUC 5.000 0.191 0.294 26.183 0.000
a
Dependent Variable: Salary (thousands of dollars).
Order of entry of an effect can affect not only is more likely to be relevant when the unequal ns
the magnitude of the F for statistical significance but are a reflection of preexisting differences in rep-
even our estimate of the direction of that effect, as resentation of the various levels of our factor in
shown in the regression coefficients for Gender in the the population. Though even in such cases, includ-
various stages of the Gender, C x G, College MRA ing the present example, we will probably also
(Table 13). want to know what the average effect is within,
Notice that the Gender contrast is positive, indica- that is, controlling for the levels of the other fac-
tive of higher salaries for males, when Gender is tor. Where the factors are manipulated variables, we
the first effect entered into the equation, but neg- are much more likely to be interested in the dif-
ative when it is entered last. Notice, also that the ferences among unweighted means, because these
test for significance of a regression coefficient in the comparisons remove any confounds among the vari-
full model is logically and arithmetically identical ous factors.
to the test of the increment to R 2 when that con- But, what is being tested when an effect is neither
trast variable is the last one added to the regression first nor last into the equation? A little known,
equation. This is of course due to the fact that dif- or seldom remarked upon, aspect of MRA is that
ferent contrasts are being tested in those two cases: one can, with the help of a bit of matrix algebra,
When Gender is first in, the contrast being tested express each sample regression coefficient as a linear
is the difference between the weighted means, and combination of the various subjects scores on the
the B coefficient for MFCONTR equals half the dependent variable. See section 2.2.4 of Harris [2] for
difference ($6890) between the mean of the 155 the details. But when we are using MRA to analyze
males salaries and the mean of the 105 females data from a independent-means design, factorial, or
salaries. When Gender is last in, the contrast being otherwise, every subject in a particular group who
tested is the difference between the unweighted receives a particular combination of one level of each
means, and the B coefficient for MFCONTR equals of the factors in the design has exactly the same set
half the difference ($6667) between the mean of of scores on the predictor variables. Hence, all of
the three college means for males and the mean those subjects scores on Y must be given the same
of the three college means for females. Each of weight in estimating our regression coefficient. Thus,
these comparisons is the right answer to a differ- the linear combination of the individual Y scores that
ent question. is used to estimate the regression coefficient must
The difference between weighted means what perforce also be a linear combination of contrasts
we get when we test an effect when entered first among the means and therefore also a contrast among
Analysis of Variance: Multiple Regression Approaches 9
Table 15 Analysis of variance design 1 been determined, each retained effect should then
Combined Observed Means for GENDER be retested, correcting each for all other retained
effects.
Variable.. SAL
GENDER
Males WGT. 39.03226 Hand Computation of Full-model
UNWGT. 33.33333
Females WGT. 32.14286
Contrasts
UNWGT. 40.00000
Our exploration of unequal-n factorial designs has
Combined Observed Means for COLLEGE relied on the use of computer programs such as
Variable.. SAL
SPSS. However, another little-remarked-upon aspect
COLLEGE
Engineer WGT. 30.41667 of such designs is that the full-model (fully cor-
UNWGT. 32.50000 rected) test of any single-df contrast can be con-
Medicine WGT. 52.00000 ducted via the straightforward (if sometimes tedious)
UNWGT. 55.00000 formula,
Educatio WGT. 24.00000
UNWGT. 22.50000 SScontr
Fcontr = , where
MSw
2
For whichever effect is tested first in your sequen-
ngroups
[5] Maxwell, S.E. & Delaney, H.D. (2003). Designing Exper- (See also Analysis of Variance: Cell Means
iments and Analyzing Data: A Model Comparison Per- Approach)
spective, Lawrence Erlbaum Associates, Mahwah.
[6] Messick, D.M. & Van de Geer, J.P. (1981). A reversal RICHARD J. HARRIS
paradox, Psychological Bulletin 90, 582593.
AnsariBradley Test
CLIFFORD E. LUNNEBORG
Volume 1, pp. 9394
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
both the ascertainment variable and the variable of which is equal to the sum of the probabilities of
interest. If the ascertainment variables x1 and x2 observing the three remaining outcomes for a relative
are both independent of both y1 and y2 , it would pair:
not be necessary to correct for ascertainment, and t
uncorrected univariate analysis of y1 and y2 would
(x1 , x2 ) dx2 dx1
yield the same results. t
Ascertainment Corrections 3
t Acknowledgment
+ (x1 , x2 ) dx2 dx1
t
Michael C. Neale is primarily supported from PHS grants
MH-65322 and DA018673.
+ (x1 , x2 ) dx2 dx1 . (10)
t t
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
= ab(1 + m). (1) [1] Cavalli-Sforza, L.L. & Bodmer, W.F. (1999). The Genet-
ics of Human Populations, Dover, Mineola.
Thus, the genetic correlation between parent and [2] Galton, F. (1886). Regression towards mediocrity in
offspring increases whenever m, the genetic correla- hereditary stature, Journal of the Anthropological Institute
tion between mates, is nonzero. Further details can 15, 246263.
be found in [1]. [3] Mascie-Taylor, C.G.N. (1995). Human assortative mating:
evidence and genetic implications, in Human populations.
On the other hand, if assortative mating occurs
Diversity and Adaptations, A.J. Boyce & V. Reynolds,
only through environmental similarity, such as social eds, Oxford University Press, Oxford, pp. 86105.
class, m = 0 and there is effect on the characteristics [4] Pearson, K. (1903). Mathematical contributions to the
heritability. However, in modern times when mar- theory of evolution. XI. On the influence of natural
riages are less likely to be arranged, it is extremely selection on the variability and correlation of organs,
unlikely that selection is based solely on characteris- Philosophical Transactions of the Royal Society A 200,
tics of the individuals environment, without consid- 166.
[5] Pearson, K. & Lee, A. (1903). In the laws of inheritance
eration given to phenotypic characteristics. In addi- in man. I. Inheritance of physical characters, Biometrika
tion, heritability will still increase if differences on 2, 357462.
the environmental characteristic are in part due to
genetic differences among individuals. For example, SCOTT L. HERSHBERGER
Asymptotic Relative Efficiency
CLIFFORD E. LUNNEBORG
Volume 1, pp. 102102
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Thurstones method begins by developing a large (exp()/1 + exp()). This model for paired compar-
number of items, or statements, that represent partic- isons is called the BradleyTerry model. A review
ular attitudes toward the topic of study (e.g., politics; of the extensions of paired comparisons models is
education). Once the set of items is constructed, the presented in [8].
items are presented to a group of judges who sort the The drawback of the paired comparisons item
items. After the items have been scaled from the sort- scaling method is that it requires a huge number
ing data, they can be administered to new respondents of comparisons. For example locating the last forty
in a second, separate data collection stage. presidents on the liberalism-conservatism
scale would
require each judge to make 40 2
= 780 paired com-
parisons.
Stage One: Locating Items
scaling method assumes that respondents endorse By being able to locate items and respon-
only those items that are located near the respon- dents simultaneously, it became unnecessary to pre-
dent; this assumption implies a unimodal response measure the opinions as required by Thurstones
function. The method then estimates the location of scaling method. However, Coombss determinis-
the respondent with the Thurstone estimator which tic model is quite restrictive because it does not
is the average or median location of the endorsed allow any response pattern to contain the triplet
items. {1, 0, 1}. For example, if former US Presidents Clin-
To scale respondents on a social liberalism- ton, Carter, Ford, Bush, and Reagan are ordered
conservatism attitude scale, a political scientist might from most liberal to most conservative, then a
ask the respondents: For each of the following Pres- person who endorses the social policy of Presi-
idents please mark whether you agreed with the dents Ford and Reagan, but not Bush, violates the
Presidents social policy (1 = Agree, 0 = Disagree). model.
A respondent might disagree because they feel the Since Coombss deterministic unfolding response
Presidents social policy was too liberal, or because model, a number of probabilistic models have been
the policy was too conservative. The scale loca- developed [e.g., 3, 12, 20, 30]. One of the ear-
tion of a respondent who agreed with the pol- liest probabilistic unfolding response models used
icy of Clinton and Carter, but disagreed with the for scaling in attitude studies is the squared logis-
remaining Presidents, would be estimated by 1 = tic model [3]. The model assumes the logit of
(72.0 + 67.0)/2 = 69.5 Similarly, the scale posi- the item response function is quadratic in the
tion of a second respondent who agreed with the respondents location on the attitude scale. Specif-
policies of Carter, Ford, and Bush would be esti- ically the model assumes Pj () = ( j )2 ,
mated as 2 = (67.0 + 39.3 + 32.8)/3 = 46.4 The which reaches a maximum value of 1/2 when
first respondent is more liberal than the second = j .
respondent. The squared logistic model is too restrictive
for many attitudinal surveys because the maximal
Unfolding Response Models endorsement probability is fixed at 1/2. The PAR-
ELLA [20] and the hyperbolic cosine models [4, 51]
Coombs [11] describes a procedure called unfolding overcome this limitation by adding another parame-
which simultaneously scales items and subjects on the ter similar to the latitude of acceptance in Coombss
same linear scale using only Thurstones second stage model [30]. Unfolding response models for poly-
of data collection. The method assumes there exists tomous response scales have also been utilized in
some range of attitudes around each item such that attitudinal studies [41]. Noel [35], for example, uti-
all respondents within that range necessarily endorse lizes a polytomous unfolding response model to scale
that item; outside of that range the respondents neces- the attitudes of smokers as they approach cessa-
sarily disagree with that item (see Multidimensional tion.
Unfolding). The locations of items and respondents are esti-
Coombss model is an example of an unfolding mated using one of a number of estimation algo-
response model. Unfolding response models assume rithms for unfolding responses models. These include
that the item response function, the probability a joint maximum likelihood [3, 4], marginal maxi-
respondent located at i endorses an item located at mum likelihood [20, 51] and Markov Chain Monte
j , is a unimodal function which reaches a maximum Carlo techniques [22, 24]. Post [37] develops a non-
at j . Coombss model assumes a deterministic parametric definition of the unfolding response model
response function: which allows for the consistent estimation of the
rank order of item locations along the attitude
Pj (i ) P {Xij = 1|i } scale [22]. Assuming Posts nonparametric unfolding
model and that the rank order of item locations is
1 if (j j , j + j )
= (1) known, [22] shows that the Thurstone estimator (i.e.,
/ (j j , j + j )
0 if
the average location of the endorsed items) does in
The parameter j is called the latitude of acceptance fact consistently estimate respondents by their atti-
for item j . tudes.
4 Attitude Scaling
Guttman [18] suggests another method for the scal- Reagan Carter
ing of respondents. The main difference between Bush
Thurstones scaling method and Guttmans is in the
type of questions used to scale the respondents atti- Figure 1 An example of a biform scale or belief poset
tudes. Guttmans key assumption is that individuals
necessarily agree with all items located below their
own position on the attitude scale and necessarily for example, Reagan Ford Bush Carter and
disagree with all items above. Reagan Bush Ford Carter. Because there
Unlike Thurstones scaling assumption, which are two plausible orderings of the items, Goodman
implies a unimodal response function, Guttmans calls the resulting scale a biform scale. Wiley and
assumption implies a monotone response function. In Martin [53] represent this partially ordered set of
fact, the response function can be parameterized as beliefs, or belief poset as in Figure 1.
the step function: Items are only partially ordered (Reagan is less
liberal than both Ford and Bush, and Ford and Bush
1 if > j
Pj () = (2) are less liberal than Carter), hence, unlike Guttman
0 if j . scaling, there is no longer a strict ordering of subjects
So questions that are valid for Thurstone scaling are by the response patterns. In particular, subjects with
not for Guttman scaling. For example a question valid belief state 1100 (i.e., Reagan and Ford are too
for Thurstone scaling, such as, Did you agree with conservative) and 1010 (i.e., Reagan and Bush are
President Reagans social policy? might be altered too conservative) cannot be ordered with respect to
to ask, Do you feel that President Reagans social one another.
policy was too conservative? to use in Guttman In general, if J items make up a biform scale, then
scaling. J + 2 response patterns are possible, as compared
Once a large number of items have been developed to J + 1 response patterns under Guttman scaling.
judges are utilized to sort the items. The researcher Although the biform scale can be easily generalized
then performs a scalogram analysis to select the set to multiform scales (e.g., triform scales), the limita-
of items that are most likely to conform to Guttmans tions placed on the response patterns often prove to
deterministic assumption. restrictive for many applications.
Guttmans deterministic assumption restricts the
number of possible response patterns. If all J items Monotone Item Response Models
in an attitudinal survey are Guttman items, then at
most J + 1 response patterns should be observed. Probabilistic item response theory models, a class
These J + 1 response patterns rank order survey of generalized mixed effect models, surmount the
respondents along the attitude scale. Respondents restrictions of Guttmans and Goodmans determin-
answering No to all items (00 0) are positioned istic models by adding a random component to
below respondents who endorse the lowest item the model. The Rasch model [39], one example
(10 0), who lie below respondents who endorse the of such a model, assumes the logit of the item
two lowest items (110 0), and so forth. response function is equal to the difference between
the respondents location and the items location
Goodmans Partially Ordered Items (i.e., log{Pj ()/(1 Pj ())} = j ). The normal-
ogive model assumes Pj () = ( ), where ()
Goodman [17] calls a set of strictly ordered is the normal cumulative distribution function [28].
items, as assumed in Guttman scaling, a uniform The Rasch and normal-ogive models have also been
scale because there is only one order of items. generalized to applications where the latent attitude is
However in some cases, it is not plausible to assumed to be multidimensional ( k ) [6, 32, 40].
assume that items are strictly ordered. Assume Respondents and items are located on the attitude
that two orderings of past US Presidents from scale using some estimation procedure for item
most conservative to most liberal are plausible, response models. These include joint maximum
Attitude Scaling 5
D
Summary
The early scaling methods of Thurstone, Guttman,
Likert, and Coombs give researchers in the behavioral
E sciences a way to quantify, or measure, the seemingly
unmeasurable construct we call attitude. This measure
of attitude allowed researchers to examine how
behavior varied according to differences in attitudes.
Figure 2 The relationship between items and a subjects However, these techniques are often too restrictive in
locations and the observed ranking of the items in Coombs
their applicability. Modern attitude scaling techniques
(1950) unfolding model
based on item response theory models overcome
many of the limitations of the early scaling methods,
but that is not to say they cannot be improved upon.
serious limitation, and it is not surprising that few real
The increasing use of computer-based attitudinal
data sets can be analyzed using the method Coombs
surveys offer a number of ways to improve on the
suggests.
current attitude scaling methodologies. Typically atti-
A method that is closely related to Coombss tudinal surveys have all respondents answering all
ranking method is the pairwise preference method. survey items, but an adaptive survey may prove more
The method, like the paired comparison item scaling powerful. Computerized adaptive assessments which
method, pairs each of the J survey items with the select items that provide the most information about
other J 1 items, but unlike the paired comparison an individuals attitude (or ability) have been used
method the respondents personal attitudes affect their extensively in educational testing [e.g., 52] and will
responses. For each item pair the respondents are likely receive more attention in attitudinal surveys.
asked to select which of the pair of items they prefer. Roberts, Yin, and Laughlin, for example, introduce
For example, Whose social policy did you prefer, an adaptive procedure for unfolding response mod-
Bushs or Reagans? els [42].
Assuming pairwise preference is error-free, a The attitude scaling methods described here use
respondent whose attitude is located at i on the latent discrete responses to measure respondents on the
scale prefers item j to item k whenever he or she is attitude scale. Another direction in which attitude
located below the midpoint of two items (assuming scaling may be improved is with the use of con-
j < k ). Coombs [11] describes a method to con- tinuous response scales. Continuous responses scales
struct a complete
rank ordering of all J items from typically ask respondents to mark their response on
the set of J2 pairwise preference comparisons and a line segment that runs from Complete Disagree-
estimates the rank order of respondents. ment to Complete Agreement. Their response is
Bechtel [5] introduces a stochastic model for the then recorded as the proportion of the line that lies
analysis of pairwise preference data that assumes below the mark. Because a respondent can mark any
the location of respondents attitudes are normally point on the line, each response will likely contain
distributed along the attitude scale. Sixtl [44] gen- more information about the attitude being studied
eralizes this model to a model which assumes a than do discrete responses [2].
general distribution F () for the locations of respon- Continuous response formats are not a new devel-
dents on the attitude scale. The probability that opment. In fact Freyd [16] discussed their use before
a randomly selected respondent prefers item j to Thurstones Law of Comparative Judgment, but con-
item k closes P (j preferred to k) = I {t < tinuous response scales were once difficult to imple-
(j + k )/2}dF (t) = F (j + k )/2. Let P (j to k) ment because each response had to be measured
Attitude Scaling 7
by hand. Modern computer programming languages [17] Goodman, L.A. (1975). A new model for scaling response
make continuous response scales more tractable. patterns. Journal of the American Statistical Society,
70; Reprinted in Analyzing Qualitative/Categorical Data,
There are several options available for the analysis of
edited by Jay. Magdison. Cambridge, MA: Abt Books,
multivariate continuous responses, including factor 1978.
analysis, multivariate regression models (see Mul- [18] Guttman, L. (1950). The basis for scalogram analysis,
tivariate Multiple Regression), and generalizations Measurement and Prediction, Studies in Social Psychol-
of item response models to continuous response for- ogy in World War II, Vol. IV, University Press, Princeton,
mats [43]. pp. 6090.
[19] Hemker, B.T., Sijtsma, K., Molenaar, I.W. & Junker,
B.W. (1997). Stochastic ordering using the latent trait and
References the sum score in polytomous IRT models, Psychometrika
62, 331347.
[20] Hoijtink, H. (1990). A latent trait model for dichotomous
[1] Allport, G.W. (1935). Attitudes, in Handbook of Social choice data, Psychometrika 55, 641656.
Psychology, C. Murchinson, ed., Clark University Press, [21] Horowitz, L.M., Rosenberg, S.E., Baer, B.A., Ureno, G.
Worcester, pp. 798844. & Villasenor, V.S. (1988). Inventory of interpersonal
[2] Alwin, D.F. (1997). Feeling thermometers versus 7- problems: psychometric properties and clinical applica-
point scales: which are better? Sociological Methods & tions, Journal of Consulting and Clinical Psychology 56,
Research 25, 318340. 885892.
[3] Andrich, D. (1988). The application of an unfolding [22] Johnson, M.S. (2001). Parametric and non-parametric
model of the PIRT type for the measurement of attitude, extensions to unfolding response models, PhD thesis,
Applied Psychological Measurement 12, 3351. Carnegie Mellon University, Pittsburgh.
[4] Andrich, D. & Luo, G. (1993). A hyperbolic cosine [23] Johnson, M.S., Cohen, W.M. & Junker, B.W. (1999).
latent trait model for unfolding dichotomous single- Measuring appropriability in research and development
stimulus responses, Applied Psychological Measurement with item response models, Technical report No. 690,
17, 253276. Carnegie Mellon Department of Statistics.
[5] Bechtel, G.G. (1968). Folded and unfolded scaling from [24] Johnson, M.S. & Junker, B.W. (2003). Using data aug-
preferential paired comparisons, Journal of Mathematical mentation and Markov chain Monte Carlo for the estima-
Psychology 5, 333357. tion of unfolding response models, Journal of Educational
[6] B`eguin, A.A. & Glas, C.A.W. (2001). MCMC estimation and Behavioral Statistics 28(3), 195230.
and some fit analysis of multidimensional IRT models, [25] Junker, B.W. (1991). Essential independence and like-
Psychometrika 66, 471488. lihood-based ability estimation for polytomous items,
[7] Bock, R.D. & Aitkin, M. (1981). Marginal maximum Psychometrika 56, 255278.
likelihood estimation of item parameters: an application [26] Kim, Y. & Pilkonis, P.A. (1999). Selecting the most infor-
of an EM algorithm, Psychometrika 46, 443459. mative items in the IIP scales for personality disorders:
[8] Bradley, R.A. (1976). Science, statistics, and paired an application of item response theory, Journal of Per-
comparisons, Biometrika 32, 213232. sonality Disorders 13, 157174.
[9] Bradley, R.A. & Terry, M.E. (1952). Rank analysis [27] Likert, R.A. (1932). A technique for the measurement of
of incomplete block designs: I. the method of paired attitudes, Archives of Psychology 140, 553.
comparisons, Biometrika 39, 324345. [28] Lord, F.M. & Novick, M.R. (1968). Statistical Theories
[10] Coombs, C.H. (1950). Psychological scaling without a of Mental Test Scores, Addison-Wesley, Reading.
unit of measurement, Psychological Review 57, 145158. [29] Luce, R.D. (1959). Individual Choice Behavior, Wiley,
[11] Coombs, C.H. (1964). A Theory of Data, Wiley, New New York.
York. [30] Luo, G. (1998). A general formulation for unidimensional
[12] Davison, M. (1977). On a metric, unidimensional unfold- latent trait unfolding models: making explicit the latitude
ing models for attitudinal and development data, Psy- of acceptance, Journal of Mathematical Psychology 42,
chometrika 42, 523548. 400417.
[13] Douglas, J. (1997). Joint consistency of nonparametric [31] Masters, G.N. (1982). A Rasch model for partial credit
item characteristic curve and ability estimates, Psychome- scoring, Psychometrika 47, 149174.
trika 47, 728. [32] McDonald, R.P. (1997). Normal-ogive multidimensional
[14] Eagly, A.H. & Chaiken, S. (1993). The Psychology of model, in Handbook of Modern Item Response Theory,
Attitudes, Harcourt Brace Jovanovich. Chap. 15, W.J. van der Linden & R.K. Hambleton, eds,
[15] Fishbein, M. & Ajzen, I. (1975). Belief, Attitude, Inten- Springer-Verlag, New York, pp. 257269.
tion, and Behavior: An Introduction to Theory and [33] Mosteller, F. (1951). Remarks on the method of paired
Research, Addison-Wesley. comparisons: III. a test of significance for paired compar-
[16] Freyd, M. (1923). The graphic rating scale, Journal of isons when equal standard deviations and equal correla-
Educational Psychology 14, 83102. tions are assumed, Psychometrika 16, 207218.
8 Attitude Scaling
[34] Muhlberger, P. (1999). A general unfolding, non-folding [43] Samejima, F. (1973). Homogeneous case of the continu-
scaling model and algorithm, in Presented at the 1999 ous response model, Psychometrika 38(2), 203219.
American Political Science Association Annual Meeting, [44] Sixtl, F. (1973). Probabilistic unfolding, Psychometrika
Atlanta. 38(2), 235248.
[35] Noel, Y. (1999). Recovering unimodal latent patterns of [45] Stout, W.F. (1990). A new item response theory modeling
change by unfolding analysis: application to smoking approach with applications to unidimensionality assess-
cessation, Psychological Methods 4(2), 173191. ment and ability estimation, Psychometrika 55, 293325.
[36] Patz, R. & Junker, B. (1999). Applications and extensions [46] Thissen, D. (1982). Marginal maximum likelihood estima-
of MCMC in IRT: multiple item types, missing data, and tion for the one-parameter logistic model, Psychometrika
rated responses, Journal of Educational and Behavioral 47, 175186.
Statistics 24, 342366. [47] Thurstone, L.L. (1927). A law of comparative judgment,
[37] Post, W.J. (1992). Nonparametric Unfolding Models: A Psychological Review 34, 278286.
Latent Structure Approach, M&T Series, DSWO Press, [48] Thurstone, L.L. (1928). Attitudes can be measured, Amer-
Leiden. ican Journal of Sociology 33, 529554.
[38] Ramsay, J.O. (1991). Kernel smoothing approaches to [49] Thurstone, L.L. & Chave, E.J. (1929). The Measurement
nonparametric item characteristic curve estimation, Psy- of Attitude, University of Chicago Press, Chicago.
chometrika 56, 611630. [50] Verhelst, N.D. & Glas, C.A.W. (1995). Rasch mod-
[39] Rasch, G. (1961). On general laws and the meaning of els: foundations, recent developments, and applications,
measurement in psychology, in Fourth Berkeley Sympo- The One Parameter Logistic Model, Chap. 12, Springer-
sium on Mathematical Statistics and Probability, Statis- Verlag, New York.
tical Laboratory, University of California, June 20July [51] Verhelst, N.D. & Verstralen, H.H.F.M. (1993). A stochas-
30, 1960, Proceedings published in 1961 by University of tic unfolding model derived from the partial credit model,
California Press. Kwantitative Methoden 42, 7392.
[40] Reckase, M.D. (1997). Loglinear multidimensional model [52] Wainer, H., ed. (2000). Computerized Adaptive Testing: A
for dichotomous item response data, in Handbook of Primer, 2nd edition, Lawrence Erlbaum.
Modern Item Response Theory, Chap. 16, W.J. van der [53] Wiley, J.A. & Martin, J.L. (1999). Algebraic represen-
Linden & R.K. Hambleton, eds, Springer-Verlag, New tations of beliefs and attitudes: partial order models for
York, pp. 271286. item responses, Sociological Methodology 29, 113146.
[41] Roberts, J.S., Donoghue, J.R. & Laughlin, J.E. (2000). A
general model for unfolding unidimensional polytomous
responses using item response theory, Applied Psycholog- (See also Multidimensional Scaling; Unidimen-
ical Measurement 24(1), 332. sional Scaling)
[42] Roberts, J.S., Lin, Y. & Laughlin, J. (2001). Com-
puterized adaptive testing with the generalized graded MATTHEW S. JOHNSON AND
unfolding model, Applied Psychological Measurement 25, BRIAN W. JUNKER
177196.
Attrition
WILLIAM R. SHADISH AND JASON K. LUELLEN
Volume 1, pp. 110111
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[7] Lazar, I. & Darlington, R. (1982). Lasting effects of early [15] Scharfstein, D.O., Rotnitzky, A. & Robins, J.M. (1999).
education, Monographs of the Society for Research in Adjusting for nonignorable drop-out using semipara-
Child Development 47, (23, Serial No. 195). metric nonresponse models, Journal of the American
[8] Leaf, R.C., DiGiuseppe, R., Mass, R. & Alington, D.E. Statistical Association 94, 10961120.
(1993). Statistical methods for analyses of incomplete [16] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002).
service records: Concurrent use of longitudinal and Experimental and Quasi-experimental Designs for Gen-
cross-sectional data, Journal of Consulting and Clinical eralized Causal Inference, Houghton Mifflin, Boston.
Psychology 61, 495505. [17] Shadish, W.R., Hu, X., Glaser, R.R., Kownacki, R.J. &
[9] Little, R.J. & Rubin, D.B. (2002). Statistical Analysis Wong, T. (1998). A method for exploring the effects of
with Missing Data, 2nd Edition, Wiley, New York. attrition in randomized experiments with dichotomous
[10] Little, R.J. & Schenker, N. (1995). Missing data, outcomes, Psychological Methods 3, 322.
in Handbook of Statistical Modeling for the Social [18] Stanton, M.D. & Shadish, W.R. (1997). Outcome, attri-
and Behavioral Sciences, G. Arminger, C.C. Clogg & tion and family-couples treatment for drug abuse: a
M.E. Sobel, eds, Plenum Press, New York, pp. 3975. meta-analysis and review of the controlled, comparative
[11] Marcantonio, R.J. (1998). ESTIMATE: Statistical Soft- studies, Psychological Bulletin 122, 170191.
ware to Estimate the Impact of Missing Data [Computer [19] Statistical Solutions. (2001). SOLAS 3.0 for Missing
software], Statistical Research Associates, Lake in the Data Analysis [Computer software]. (Available from
Hills. Statistical Solutions, Stonehill Corporate Center, Suite
[12] Muthen, B.O., Kaplan, D. & Hollis, M. (1987). On struc- 104, 999 Broadway, Saugus 01906).
tural equation modeling with data that are not missing [20] Welch, W.P., Frank, R.G. & Costello, A.J. (1983). Miss-
completely at random, Psychometrika 52, 431462. ing data in psychiatric research: a solution, Psychologi-
[13] Rubin, D.B. (1987). Multiple Imputation for Nonre- cal Bulletin 94, 177180.
sponse in Surveys, Wiley, New York.
[14] Schafer, J.L. & Graham, J.W. (2002). Missing data: our WILLIAM R. SHADISH AND JASON K. LUELLEN
view of the state of the art, Psychological Methods 7,
147177.
Average Deviation
DAVID C. HOWELL
Volume 1, pp. 111112
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Points representing 24 meat samples a deformation of the original space. In the case of
2 canonical variables, such a deformation is justified
10 2
2 because it converts Mahalanobis distance in the
9
original space into Euclidean distance in the sub-
8
3 3
space [2], and the latter is more readily interpretable.
7
4 4 4 In some techniques, such as factor analysis, the
6 3 3 3
linear combinations are derived implicitly from a
PC2
5 1 4
24 1 2
statistical model. They can still be viewed as defin-
4 3
3 1 ing axes and subspaces of the original space, but
3
2 1 4 1
direct projection of points into these subspaces may
1 1
not necessarily coincide with derived scores (that are
0 often estimated in some way from the model). In any
0 1 2 3 4 5 6 7 8 9 10
of these cases, projecting the original axes into the
PC1
subspace produced by the technique will show the
inclination of the subspace to the original axes and
Figure 1 A scatter plot of the scores for each of 24 meat will help in the interpretation of the data. Such pro-
samples on the first two principal components jection of axes into subspaces underlies the ideas of
biplots.
The above techniques have required quantitative
to the true 31-dimensional configuration of points.
data. Data sets containing qualitative, nominal, or
This plot is shown in Figure 1 below. The 24 meat
ordinal variables will not permit direct representation
samples were of four types: reformed meats (type
as points in space with coordinates given by vari-
1), sausages (type 2), whole meats (type 3), and
able values. It is, nevertheless, possible to construct
beef burgers (type 4). The points in the diagram
a representation using techniques such as multidi-
are labelled by type, and it is immediately evident
mensional scaling or correspondence analysis, and
that whole meats are recognizably different from the
then to derive approximating subspaces for this rep-
other types. While the other three types do show
resentation. However, such representations no longer
some evidence of systematic differences, there are,
associate variables with coordinate axes, so there are
nevertheless, considerable overlaps among them. This
no underlying linear combinations of variables to link
simple graphical presentation has thus provided some
to the axes in the approximating subspaces.
valuable insights into the data.
In recent years, a variety of techniques produc-
ing subspaces into which to project the data in order References
to optimize criteria other than variance has been
developed under the general heading of projection [1] Hotelling, H. (1933). Analysis of a complex of statistical
pursuit, but many of the variants obtain the data variables into principal components, Journal of Educa-
projection in a subspace by direct computational tional Psychology 24, 417441.
[2] Krzanowski, W.J. (2000). Principles of Multivariate Anal-
means, without the intermediate step of obtaining
ysis: A Users Perspective, University Press, Oxford.
linear combinations to act as axes. In these cases, [3] Pearson, K. (1901). On lines and planes of closest fit to
therefore, any substantive interpretation has to be systems of points in space, The London, Edinburgh and
based on the plot alone. Other popular techniques Dublin Philosophical Magazine and Journal of Science,
such as canonical variate analysis (see Canonical Sixth Series 2, 559572.
Correlation Analysis) produce linear combinations
but nonorthogonal ones. These are therefore oblique WOJTEK J. KRZANOWSKI
axes in the original space, and if used as orthogo-
nal axes against which to plot scores they produce
Bagging
ADELE CUTLER, CHRIS CORCORAN AND LESLIE TOONE
Volume 1, pp. 115117
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) Data and underlying function (b) Single regression tree
1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(c) (d) 100 Bagged regression trees
10 Boostrap samples
Table 1 LDA, error rate 4.7% Table 3 Classification tree, error rate 10.1%
Predicted class
Predicted class
Non-impaired Alzheimers
Non-impaired Alzheimers
True class Non-impaired 90 3
True class Non-impaired 90 3
Alzheimers 1 35
Alzheimers 2 34
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of two or more different treatments. First, consider and the adjusted (adj ) average for that treatment is
the problem of comparing responses to Treatments
A and B in Table 1. The most common analysis Y i(adj) = Y + i . (6)
of variance method assesses the effects of two
design features, blocks, and treatments. Block 1 The sum of squares breakdown for an intrablock
has a score of 19 under A and a score of 17 analysis is as follows.
under B. The difference, +2, can be called an
intrablock comparison. Because the [6] data set has SSTotal = Yij2 C, (7)
= 1, there is only one A B comparison in our
Bj2
research example. Otherwise we could average all SSBlocks = C, (8)
such A B differences in making an intrablock k
comparison of these two treatments. A comparison
Q2i
of the performance in different blocks allows the SSTreatments(adj) = , (9)
comparison of raw averages for the different blocks, kt
regardless of the treatments involved. However, an and
overall test of intrablock effects in an analysis of
variance must extract block effects first and then SSResidual = SSTotal SSBlocks SSTreatments(adj) .
extract treatment effects adjusted for block effects. (10)
This is called an intrablock analysis. Equations 3
through 10 below are consistent with [6, p. 52]. Let The degree of freedom values are
2
dfTotal = N 1, dfBlocks = b 1,
Yij
i j dfTreatments = t 1, and
C= , Ti = Yij , Bj = Yij
N dfResidual = N t b + 1, respectively.
j i
(3)
Alternatively, for an interblock analysis, we can
and find all treatment averages regardless of the blocks in
Qi = kTi nij Bj . (4)
which they are located. Table 1(b) shows these means
j
for our example data. It is intuitive that we need to
Because nij = 1 rather than 0 only if Treatment i compute differences such as Y A Y B and Y A Y C
is present in Block j , Qi adjusts kT i by subtracting on the basis of several observations rather than on
the total of all block totals for all blocks containing one for each treatment, thus computing an interblock
Treatment i. The estimated Treatment i effect is comparison for any two treatments. In practice, we
Qi can either proceed with a general method [7, p. 225]
i = , (5) of finding a total sum of squares because of regression
t
Balanced Incomplete Block Designs 3
and extracting new squares from it as needed, or from each effect that has been assessed. Accordingly,
by computing the values of SS Total and SS Residual for treatments and blocks that are the only factors of
from the intrablock analysis and then finding the interest, this analysis is implied by its predecessors
unadjusted value of the sum of squares for treatments using SS Treatments(adj) from the intrablock analysis and
with the obvious formula: SS Blocks(adj) from the interblock analysis. All Table 2
entries not previously computed [6] could have been
Ti2 obtained with a hand calculator using the formulas
SSTreatments = C. (11)
r above or simpler ones such as F = MS Effects /MS Error .
Actually, they were calculated using a SAS Proc
and finding the adjusted sum of squares for blocks by
GLM Type I or II analysis.
subtraction:
The model of (1) assumes that both treatment and
SSBlocks(adj) = SSTotal SSTreatments SSResidual . block effects are fixed (see Fixed and Random
Effects). An alternate mixed effects model used in
(12)
the SAS program just mentioned is
Table 2 presents three versions of the analysis of Yij = + i + bj + eij , (13)
variance for the [6] BIBD example, going beyond
calculation of mean squares to include F tests not where the Roman symbol bj implies that the effect
reported earlier. The first two analyses correspond for the j th block is random rather than fixed as
to the two just described. The so-called Type I with the Greek j of (1). This new model assumes
analyses extract one effect first and then adjust the that the blocks have been selected at random from
sum of squares for the next variable by excluding a population of blocks. Most authors performing
effects of the first. The first example of a Type I interblock analyses employ the model in (13).
analysis is consistent with a previously published [6, All sums of squares, mean squares, and F statis-
p.54] intrablock analysis, in which block effects are tics obtained with these two models are identical, but
extracted first followed by the extraction of treatment expected mean squares for them differ. One reason
effects adjusted for block effects. Here treatment to consider using an intrablock analysis is that the
effects and block effects both have very small P average mean square in a BIBD is not contaminated
values (p < 0.0001). by block effects. With an intrablock analysis of the
The second Type I analysis reverses the order of current data using (13), this expected mean square is
extracting the sums of squares. Treatment effects are 2 + a function of treatment effects, i2 . In contrast,
computed first and then block effects adjusted for with an interblock analysis, the expected mean square
treatment effects are measured. Therefore, this is an is 2 + 0.75b2 + a function of treatment effects. With
interblock analysis. For this analysis, the treatment the latter expected mean square, a significantly large
effects of the P value is very small (p < 0.0001), but F for treatments might be due to block effects, b2 ,
the adjusted block effects do not even come close to rather than treatment effects. A comparable prob-
the 0.05 level. lem arises in testing block effects in an intrablock
The third analysis in Table 2 is called a Type II analysis, where the expected mean square for blocks
analysis. Here an adjusted sum of squares is used is 2 + 3b2 + a function of treatment effects. If a
test for block effects is not of interest to the exper- BA = 234, BB = 221,
imenter, an intrablock analysis is fine one merely
QA = 3, QB = 20,
fails to report a test for block effects. However, a
Type II analysis usually has the advantage of yielding A = 0.333, B = 2.222,
uncontaminated expected mean squares (and there-
Y A(adj) = 19.75, Y B(adj) = 17.19,
fore uncontaminated F values) for both treatment and
block effects. V (YA(adj) ) = 0.2733 = V (YB(adj) ),
V (Y A(adj) Y B(adj) ) = 0.5467,
Evaluating Differences Between Two Y A(adj) YB(adj) = 2.556,
Treatment Effects and Other Contrasts
of Effects s.e.(Y A(adj) YB(adj) ) = 0.74,
and
An overall assessment of treatment effects in the
dishwashing experiment [6] can be supplemented t = 3.46, (18)
by a comparison of the adjusted average numbers
of dishes washed with one detergent and some with df = 36 9 12 + 1 = 16, so that a two-tailed
other detergent. Alternatively, one can make more test has p < 0.01. So performance under Detergent
complicated analyses such as comparing the adjusted A is significantly better than under Detergent B.
average of A and B with the adjusted average for C More complicated contrast analyses use standard
(see Multiple Comparison Procedures). Consider a methods, essentially the same as those with means
comparison between A and B. From (6), we know from a one-way analysis of variance. The principal
how to compute Y A(adj) and Y B(adj) . We need to difference in the BIBD case is that means, vari-
know the variance (V ) and standard error (s.e.) of ances, and standard errors reflect the adjustments
each adjusted treatment mean and of the difference given above.
between two independent adjusted means. Knowing
from [1, p. 275] that Evaluation of Possible Polynomial Trends
k2 in Treatment Effects for BIBD Data
V (Y i(adj) ) = , (14)
t A further analysis [6] of the dishwashing data moved
for each treatment is helpful, but we must estimate 2 from the usual study of qualitative treatment varia-
with MS Error from Table 2 and then use the relation tions to quantitative treatment variations. The nine
detergents studied included a standard detergent
2k 2 (Control), four with a first base detergent with 0 to 3
V (Y A(adj) Y B(adj) ) = , (15) amounts of an additive, and four with a second base
t
detergent with 0 to 3 amounts of an additive. Eight
leading to contrasts among the nine treatments were evaluated:
Linear, quadratic, and cubic components for Base 1;
2k 2 linear, quadratic, and cubic components for Base 2;
s.e.(Y A(adj) Y B(adj) ) = . (16) Base 1 versus Base 2; and Control versus Bases 1
t
and 2 combined. The resulting fixed effects ANOVA
Now a standard t Test with df = N t b + 1 found significant linear and quadratic effects of addi-
is tive amounts for Base 1, significant linear effects of
Y A(adj) Y B(adj) additive amounts for Base 2, significant superiority
t= . (17) of Base 2 over Base 1, and significant superiority
s.e.(Y A(adj) Y B(adj) )
of Control over the two averages of Base I and 2.
For the dishwashing experiment, (46) and As expected, the linear effects were increasing the
(1417) yield number of plates washed with increasing amounts of
additive. Also see [6] for formulas for this contrast
TA = 79, TB = 67, analysis. See also [4, Ch. 5] and various sources such
Balanced Incomplete Block Designs 5
as manuals for statistical computing packages for a such a design is that it permits (but does not
general treatment of polynomial fitting and signifi- require) the inclusion and assessment of an additional
cance testing. experimental variable with as many values as the
number of plots per block. Normally each value of
the added (auxiliary) variable occurs exactly once
Bayesian Analysis of BIBD Data with each value of the main treatment variable. This
is an orthogonal relationship between two variables.
Bayesian analysis adds to conventional statistics a set Box, Hunter, and Hunter [1, p. 260, pp. 276279]
of assumptions about probable outcomes, thus com- describe the use of a Youden square design in
bining observed data and the investigators expecta- a so-called wear testing experiment, in which a
tions about results (see Bayesian Methods for Cat- machine simultaneously measures the amount of
egorical Data). These expectations are summarized wear in k = 4 different pieces of cloth after an
in a so-called prior distribution with specific or even emery paper has been rubbed against each for 1000
very vague indications of a central tendency mea- revolutions of the machine. The observed wear is
sure and variability measure related to those beliefs. the weight loss (number of 1 milligram units) in a
The prior distribution plus the observed data and given piece of cloth. Their example presumes t = 7
classical assumptions about the data are combined kinds of cloth (treatments) and b = 7 testing runs
to yield a posterior distribution assigning probabil- (blocks). An added variable, position of the emery
ities to parameters of the model. Box and Tiao [2, paper rubbing a cloth, had four options with each
pp. 396416] present a Bayesian method of analy- appearing with one of the four cloths of a block.
sis for BIBD data sets originally assumed to satisfy a In discussing this experiment, the authors expand (1)
mixed model for analysis of variance. Beginning with above to include an l effect with this general
an assumption of a noninformative prior distribu-
purpose
tion with equal probabilities of all possible b2 (block
effect variance) and e2 (error variance) values, they Yij = + l + i + j + eij . (19)
prove an equation defining the posterior distribution
of the parameters i . This posterior distribution is In their case, this extra effect is called an l =
a product of three factors: (a) a multivariate t distri- blocks (positions) effect because it is generated by the
bution centered at the mean of intrablock treatment emery paper positions within blocks. Table 3 com-
effects, (b) a multivariate t distribution centered at bines information from [1, p. 260, p. 277] to display
the mean of interblock treatment effects, and (c) an wear, treatments (from A to G), and blocks (posi-
incomplete beta integral with an upper limit related tion) in each plot of each block in the experiment.
to the treatment vector of parameters i . Com- In order to facilitate a later analysis in Table 5, I
bining the first two factors permits giving a com- have reordered cells within the different blocks of [1,
bined estimate of treatment effects simultaneously p. 277] in a nonunique way ensuring that each treat-
reflecting intrablock and interblock effects. Applica- ment appears exactly once in each column of Table 3.
tions of approximation procedures are shown in their A Type I analysis of variance [1, p. 279] assesses
[2, pp. 415417]. Their Table 7.4.5 includes numer-
ical results of this initially daunting analysis method
Table 3 Design and data from a Youden square experi-
as applied to a set of simulated data for a three-
ment with an extra independent variable [1, p. 260, p. 277]
treatment, fifteen-block BIBD experiment. Also see
[7, pp. 235238] for a non-Bayesian combination of PLOT
intrablock and interblock treatment effects. BLOCK 1 2 3 4 Total
1 B 627 D 248 F 563 G 252 1690
Youden Squares as a Device for Increasing 2 A 344 C 233 G 226 F 442 1245
the Number of Effects Studied in a BIBD 3 C 251 G 297 D 211 E 160 919
4 G 300 E 195 B 537 A 337 1369
Model 5 F 595 B 520 E 199 C 278 1592
6 E 185 F 606 A 369 D 196 1356
A Youden square is a BIBD with the same number 7 D 273 A 396 C 240 B 602 1511
of treatments as blocks. One advantage of using
6 Balanced Incomplete Block Designs
Table 4 A SAS PROC MIXED analysis of Table 1 BIBD data, extracting only treatment effects (Blocks are random and
serve as a control factor.)
Source dfnumerator dfdenominator SS numerator F p
Treatments 8 16 Not shown 220.57 <0.001
these effects in the following order: blocks, blocks blocks are now treated as a random covariate, the
(positions), and treatments. Compared to a compa- only significance test is for treatments, yielding an
rable analysis [1, p. 278] without blocks (positions), F = 220.57, almost identical to the 225.94 for the
the authors find identical sums of squares for treat- comparable interblock analysis test in Table 2.
ments and blocks as before. This is a consequence
of the orthogonality between treatments and blocks
(positions) in the wear testing experiment. Because Using Youden Squares to Examine Period
the sum of squares for the residual error is reduced Effects in a BIBD Experiment Using
by the amount of the sum of squares for blocks Mixed Models
(position), the F for treatments is increased in the
expanded analysis. Psychologists may desire to use a Youden square or
other modified BIBDs design permitting an assess-
ment of the effects of additional features, such as
Modern Mixed Model Analysis of BIBD occasions (periods, stages, time, or trial number).
Data Accordingly, Table 5 presents a PROC MIXED anal-
ysis of the Table 3 wear testing experiment data
In the early analysis of variance work, equations with Plots 1 through 4 now being interpreted as
like (1) and (13) above were universally employed, stages, and blocks (positions) are ignored. This new
using what is called the generalized linear model analysis shows significant (p < 0.0001) effects of
(GLM) regardless of whether random effects other treatments but not of stages, even at the 0.05 level.
than error were assumed. More modern work such as The program employed uses the KenwardRoger
[9, p. 139] replaces (13), for example, with degrees of freedom method of PROC MIXED, as
Y = XB + Z u + e, (20) described in [8]. Behavioral scientists using repeated
measures in Youden squares and other BIBDs also
where X is a design matrix for fixed effects, B is a may want to sacrifice the advantages of orthogonal-
vector of fixed parameters, Z is the design matrix ity, such as higher efficiency, in order to analyze
for random effects, u is a vector of random ele- both period effects and carryover effects (residual
ments, and e is the vector of error elements. The effects of prior treatments on current behavior) as
new kind of analysis estimates the size of random is done with other designs [8]. This also permits
effects but does not test them for significance. Thus, the selection from a larger number of BIBDs of a
random block effects become a covariate used in given size, which can facilitate use of randomiza-
assessing treatment effects rather than effects to be tion tests.
tested for significance themselves. We believe, like
others such as Lunneborg [10], that more attention
Table 5 A SAS PROC MIXED reanalysis of the Table 3
needs to be given to the question of whether so- Youden square example data (Assume that plots (columns)
called random effects have indeed been randomly are stages whose effects are to be assessed)
drawn from a specific population. Strict justifica-
dfnumerator dfdenominator F p
tion of using a mixed model analysis or even a
random or mixed model GLM analysis seems to Stage 3 12 2.45 0.1138
require an affirmative answer to that question or Treatment 6 13.5 74.61 <0.0001
some indication of the tests robustness to its fail- Covariance Estimate
ure. parameter
Block 367.98
Table 4 summarizes a reanalysis of Table 1 dish- Residual 1140.68
washing data using SAS PROC MIXED. Because
Balanced Incomplete Block Designs 7
Assessing Treatment Effects in the Light define our three blocks: A B, A C, and B C. With each
of Observed Covariate Scores pair there are two orderings of the Greek letters. So
we can have (A B) or (A B) for an A B block.
Littell, Milliken, Stroup, and Wolfinger [9, pp. If we do not require orthogonality of treatments and
187201] provide extended examples of SAS PROC the auxiliary variable, there are 23 = 8 orderings of
MIXED analysis of BIBD data for which measures of Greek pairs for the three blocks. But the three blocks
a covariate also are available. Educators also will be may be permuted in six ways, yielding a total of
interested in their [9, pp. 201218] PROC MIXED 48 possible designs for this specific Youden square
analyses of covariance for data from two split-plot example. With this many options for a simple design
experiments on the effectiveness of different teaching or, even better with larger designs with many options,
methods with years of teachers experience as a random selection of a specific set of t blocks has
covariate in one study and pupil IQ as a covariate the further possible advantage of permitting use of
in the other. randomization theory in significance testing.
Case 3. Let us modify Case 2 to require orthog-
onality of treatments and the auxiliary variable.
Construction of Balanced Incomplete With that constraint, there are two canonical Youden
Block Designs squares for the t = b = 3, k = 2, r = 2, and = 1
case: (A B, B C, and C A) or (B A, C
We delayed this topic until now in order to take into B, and A C). Each canonical set has 6 permuta-
account variations of BIBD discussed above. tions of its 3 blocks, yielding a total of 12 possible
Finding all possible designs for a given set of b, Youden square designs from which to sample.
k, r, t, and is a problem in algebraic combinatorics
Cox and Reid [3, pp. 7273] provide a recent sum-
mary of some possible designs in the range k = 2 to Miscellaneous Topics: Efficiency and
4, t = 3 to 16, b = 3 to 35, and r = 2 to 10. See Resolvable Designs
also [7, pp. 221223, pp. 268287] for theory and
an introduction to early and relevant journal articles, Efficiency relates most clearly to the variance of
as well as [1, pp. 269275; 5, p. 74, Table XVII]. a contrast such as between two treatment effects,
Randomization of blocks or possibly also of A B . The reference variance is the variance
plots in a BIBD is in principle good experimental between such effects in a complete design such as
design practice [3, pp. 7980, pp. 252253]. We a two independent groups t Test or a two-group
now consider the number of possible BIBD designs comparison in a standard randomized block design
in the case of very small experiments with t = 3 with every treatment present in every block. Cox and
treatments, b = 3 blocks, and k = 2 plots per block, Reid [3, pp. 7879] define the efficiency factor of a
controlling the set from which a random experiment BIBD as:
must be selected. t (k 1)
Case 1. Here is a tiny BIBD with three two- = , (21)
(t 1)k
treatment blocks containing Treatments A B, B C, and
C A, respectively. Clearly there are 3! = 6 possible being less than 1 for all incomplete designs with at
permutations of the three blocks independently of least t = 2 treatments and t > k = plots (units) in a
position ordering in the blocks themselves. So a block. But is not enough to define efficiency. Quite
reasonable selection of a BIBD of this structure possibly, the error variance of scores for the different
would choose randomly from 6 options. units of a block of t units, t2 , is different from the
Case 2. The examples in Case 1 above are also error variance of scores from a block with k units, k2 .
Youden squares because t = b. Suppose we have an Therefore, the efficiency of a BIBD compared to an
auxiliary variable for plots in a block like the blocks ordinary randomized block design takes into account
(position) of Table 3 or the stage number. Let its these variances as well as :
values be and . Given the same structure of
t = b = 3, k = 2, r = 2, and = 1 as before, there t2
Efficiency = . (22)
are three pairs of ordinary letters that must partially k2
8 Balanced Incomplete Block Designs
A resolvable BIBD is one in which separate [4] Draper, N.R. & Smith, H. (1981). Applied Regression
complete analyses of each replication of the data are Analysis, 2nd Edition, Wiley, New York.
possible, permitting comparison of the r replicates. [5] Fisher, R.A. & Yates, F. (1953). Statistical Tables
for Biological, Agricultural, and Medical Research, 4th
In each block of the experiment, each treatment Edition, Hafner, New York.
appears exactly once. An analysis of variance for a [6] John, P.W.M. (1961). An application of a balanced
resolvable design may split its sum of squares for incomplete block design, Technometrics 3, 5154.
blocks into a sum of squares for replicates and a sum [7] John, P.W.M. (1971). Statistical Design and Analysis of
of squares for blocks within replicates [3, pp. 7374; Experiments, Macmillan, New York.
7, pp. 226227]. [8] Jones, B. & Kenward, M.K. (2003). Design and Anal-
ysis of Cross-over Trials, 2nd Edition, Chapman &
Hall/CRC, London.
References [9] Littell, R.C., Milliken, G.A., Stroup, W.W. & Wolfin-
ger, R.D. (1996). SAS System for Mixed Models, SAS
[1] Box, G.E.P., Hunter, W.G. & Hunter, J.S. (1978). Institute Inc., Cary.
Statistics for Experimenters. An Introduction to Design, [10] Lunneborg, C.E. (2000). Data Analysis by Resampling:
Data Analysis, and Model Building, Wiley, New York. Concepts and Applications, Duxbury, Pacific Grove.
[2] Box, G.E.P. & Tiao, G.C. (1973). Bayesian Inference in
Statistical Analysis, Addison-Wesley, Reading. JOHN W. COTTON
[3] Cox, D.R. & Reid, N. (2000). The Theory of the Design
of Experiments, Chapman & Hall/CRC, Boca Raton.
Bar Chart
BRIAN S. EVERITT
Volume 1, pp. 125126
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Number of patients
A bar chart is a graphical display of data that 150
have been classified into a number of categories.
Equal-width rectangular bars are used to represent 100
each category, with the heights of the bars being
proportional to the observed frequency in the
50
corresponding category. An example is shown in
Figure 1 for the age of marriage of a sample of
women in Guatemala. 0
BP CP
An extension of the simple bar chart is the
component bar chart in which particular lengths of Figure 2 Component bar chart for response to treatment
each bar are differentiated to represent a number of
frequencies associated with each category forming
the chart. Shading or color can be used to enhance Data represented by a bar chart could also be
the display. An example is given in Figure 2; here shown as a dot chart or a pie chart. A bar chart
the numbers of patients in the four categories of a is the categorical data counterpart to the histogram.
response variable for two treatments (BP and CP)
are displayed. BRIAN S. EVERITT
25
20
15
Frequency
10
0
910 1112 1314 1516 1718 1920 2122 2324 2526 2728 2930 3132 3334
Age at marriage
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the remaining m3 components, and so on, until Table 1 Initial component matrix A for Framingham
the variable accounting for the largest proportion Heart Study depression questionnaire
of the (m1)th residual variance (variable m) a1 a2 a3 h2
is found. Variables 1 through m are then the
variables which reproduce, as much as possible, the EFFORT 0.60 0.15 0.41 0.55
RESTLESS 0.39 0.07 0.55 0.46
variance retained by the m principal components, DEPRESS 0.77 0.13 0.10 0.62
and so also the salient variance contained in HAPPY 0.70 0.23 0.06 0.55
the original n variables. In the vocabulary of LONELY 0.64 0.23 0.21 0.51
principal components analysis, variable 1 is the first UNFRIEND 0.35 0.68 0.33 0.69
transformed component, variable 2 is the second, ENJOYLIF 0.52 0.27 0.27 0.42
and so on. To determine how much of the original FELTSAD 0.71 0.22 0.20 0.59
DISLIKED 0.34 0.72 0.22 0.68
variance of all n variables is explained by the m GETGOING 0.58 0.20 0.47 0.60
transformed components, we simply compute the sum
of squares of all the loadings in the final n m Note: h2 = a12 + a22 + a32 is the communality.
GramSchmidt rotated matrix (this should be close
to the sum of squares of the elements of the n m Now, to use GramSchmidt transformations to
initial component matrix A). The following example determine the three variables which explain the
will illustrate the use of the GramSchmidt process largest portion of the salient variance from the
in battery reduction. original 10 variables, we do the following:
In the Framingham Heart Study, a 10-question
depression scale was administered (so n = 10), where 1. Find, from A in Table 1, the variable which
the responses were No or Yes to the following (the explains the largest proportion of salient
corresponding name to which each question will variance from the original 10 variables. This
hereafter be referred is enclosed in parentheses): is the variable UNFRIEND, with a sum of
squares of loadings (communality) across the
1. I felt everything I did was an effort (EFFORT). three components equal to 0.352 + 0.682 +
2. My sleep was restless (RESTLESS). (0.33)2 = 0.69.
3. I felt depressed (DEPRESS). 2. Take the loadings of UNFRIEND from Table 1
4. I was happy (HAPPY). (0.35, 0.68, 0.33) and normalize them (i.e.,
5. I felt lonely (LONELY). divide each element by the square root of
6. People were unfriendly (UNFRIEND). the sum of the squares of all three elements).
7. I enjoyed life (ENJOYLIF). This yields the normalized loadings: 0.42, 0.82,
8. I felt sad (FELTSAD). 0.40.
9. I felt that people disliked me (DISLIKED). 3. Create a 3 3 (m m) matrix Y1 , which, in
10. I could not get going (GETGOING). the GramSchmidt process, is given by
A Yes was scored as 1 and No as 0 except for a b c
questions 4 and 7, where this scoring was reversed Y1 = k2 ab/k2 ac/k2 , (2)
so that a score of 1 would indicate depression for all
questions. 0 c/k2 b/k2
After performing a principal components anal- where a = 0.42, b = 0.82, c = 0.40 (the nor-
ysis on this data, there were three components malized row of UNFRIEND from A), and k2 =
with variances greater than unity. The variances (1 a 2 )1/2 . Thus,
of these three components were 3.357, 1.290, and
0.42 0.82 0.40
1.022 for a percentage variance explained equal to
100 (3.357 + 1.290 + 1.022)/10 = 56.69%. Thus, Y1 = 0.91 0.38 0.18 . (3)
using the Kaiser rule for selecting the number of
0 0.44 0.90
retained components [1], we set m equal to 3 for this
example. The 10 3 initial component matrix A is 4. Calculate AY1 ,
which is shown in Table 2. Note
in Table 1. that, for UNFRIEND, the only nonzero loading
Battery Reduction 3
b1 b2 b3 res. h2 c1 c2 c3 h2
EFFORT 0.21 0.56 0.44 0.51 EFFORT 0.21 0.46 0.54 0.55
RESTLESS 0.00 0.43 0.53 0.47 RESTLESS 0.00 0.31 0.61 0.46
DEPRESS 0.26 0.73 0.15 0.56 DEPRESS 0.26 0.75 0.00 0.63
HAPPY 0.13 0.71 0.15 0.52 HAPPY 0.13 0.73 0.00 0.55
LONELY 0.16 0.63 0.29 0.48 LONELY 0.16 0.68 0.16 0.51
UNFRIEND 0.84 0.00 0.00 0.00 UNFRIEND 0.84 0.00 0.00 0.70
ENJOYLIF 0.11 0.53 0.36 0.41 ENJOYLIF 0.11 0.59 0.25 0.42
FELTSAD 0.20 0.69 0.28 0.55 FELTSAD 0.20 0.73 0.14 0.59
DISLIKED 0.82 0.00 0.12 0.01 DISLIKED 0.82 0.02 0.12 0.67
GETGOING 0.22 0.54 0.51 0.55 GETGOING 0.22 0.43 0.61 0.60
Note: res. h2 = residual communality = b22 + b32 . Note: h2 = c12 + c22 + c32 is the final communality.
is on the first component (or first column). This 8. Postmultiply the last two columns of AY1 by
loading is equal to the square root of the sum of Y2 ; the result is shown in the last two columns
squares of the original loadings of UNFRIEND of Table 3. The first column of Table 3 is
in matrix A (thus, no information explained by the first column of AY1 . Together, the three
UNFRIEND is lost during the rotation process). columns are called the rotated reduced compo-
For each of the remaining variables in Table 2, nent matrix (matrix C of Table 3).
we have the following: (i) the squares of the Note that, for DEPRESS, the loading on the last
elements in the first column are the portions component (or last column) is zero. The sum of
of the variances of these variables which are squares of the loadings (the final communality)
accounted for by UNFRIEND; and (ii) the sum of DEPRESS in Table 3 is, within rounding
of the squares of the elements in the second and error, equal to the square root of the sum of
third columns is the residual variance (i.e., the squares of the loadings of DEPRESS in the
variance of the variables not accounted for by initial component matrix A (0.63 vs. 0.62; thus,
UNFRIEND). no information explained by DEPRESS is lost
5. Find the variable which explains the largest during the rotation process). For the remaining
proportion of residual variance (i.e., has the variables in the second column of Table 3, the
largest residual communality). This is the vari- elements are the portions of the variances of
able DEPRESS, with a sum of squares of load- these variables which are accounted for by
ings across the last two columns of Table 2 DEPRESS.
which is equal to 0.732 + 0.152 = 0.56. 9. The last of the three variables which explains
6. Take the loadings of DEPRESS from Table 2 the largest portion of variance in the original
(0.73, 0.15) and normalize them. This yields the 10 variables is GETGOING, since its loading
normalized loadings: 0.98, 0.20. is largest in the last column of Table 3.
7. Create a 2 2 matrix Y2 , which, in the 10. The sum of squares of all the loadings in
GramSchmidt process, is given by Table 3 is approximately equal, within rounding
error, to the sum of squares of loadings in A.
b c
Y2 = , (4) Thus the three variables UNFRIEND, DEPRESS,
c b
and GETGOING alone retain approximately the
where b = 0.98, c = 0.20 (the normalized row same variance that was retained by the first three
of DEPRESS from the last two columns of principal components (which involved all 10 original
Table 2). Thus, variables). We have reduced the original battery of
10 questions to three.
0.98 0.20 The above is presented only as an illustration. It
Y2 = . (5)
0.20 0.98 is unlikely that a researcher would need to perform
4 Battery Reduction
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Pleasant
0.7 0.8 = 0.56
Clear 0.8
0.7 Grouchy
0.7 0.2 = 0.14
0.2
Pleasant p(Grouchy) =
0.3 0.1 = 0.03 0.14 + 0.27 = 0.41
Rainy 0.1
0.3 Grouchy
0.3 0.9 = 0.27
0.9
Figure 4 Calculations for the probability of being grouchy for the weather on mood BBN
Figure 5 The expanded BBN to take account of the weather forecast, with its less-than-perfect reliability
Bayesian Belief Networks 3
Figure 6 The effect on uncertainty about weather and mood of a clear forecast
Let H , a hypothesis, stand for tomorrows wea- Table 1 Application of Bayess theorem after receipt of
ther, and D, data, todays forecast. Bayes theorem the forecast clear
provides the basis for calculating p(H |D) from the Weather Prior Likelihood Posterior
inverse probability, which we know, p(D|H ), repre- H p(H ) p(D|H ) P (H ) p(D|H ) P (H |D)
senting the forecasters reliability. Bayess theorem is
Clear 0.70 0.85 0.595 0.888
a simple consequence of the above probability laws,
Rainy 0.30 0.25 0.075 0.112
with the added recognition that p(H and D) must Sum = P (D) = 0.67
equal p(D and H ); the order in which H and D
are written down makes no difference to their joint
probability. Since p(H and D) = p(H |D) p(D), forecast probability to 100% in the BBN (Figure 7),
and p(D and H ) = p(D|H ) p(H ), then equating your chance of grouchy becomes 67.7%, rather less
the right hand sides of the equations, and rearranging than your original 90%, largely because the forecasts
terms, gives Bayess theorem: are less reliable for rain than they are for clear.
p(H ) p(D|H ) So, applying the laws of probability, the multipli-
p(H |D) = (1) cation and addition laws operating in one direction,
p(D)
and Bayess theorem applied in the other direction,
posterior probability = prior probability likelihood/ allows information to be propagated throughout the
probability of the data (see Bayesian Statistics). network. Suppose, for example, that several days later
This result can be shown by flipping the original you recall being in a pleasant mood on the day in
event tree, but Bayess theorem is easier to apply in question, but cant remember what the weather was
tabular form; see Table 1. Recall that the datum, D, like. Changing the probability of your mood to 100
is the forecasters prediction today, clear, and H is for pleasant in the BBN gives the result shown in
the weather that will be realized tomorrow. Figure 8.
Note that D, the forecasters clear, stays the The chance of good weather, while in reality now
same in the table, while the hypothesis H , next days either zero or 100%, is, for you at this moment,
weather, changes, Clear in the first row and Rainy nearly 95%, and the chance of a clear forecast
in the second. the day before, about 82%. In summary, propagating
The unreliability of the forecast has increased your information in the direction of an arrow requires
original assessment of a 20% chance of grouchiness, application of the multiplication and addition laws of
if the actual weather is clear, to a 27.8% chance if probability, whereas propagating information against
the forecast is for clear. And by changing the rain an arrows direction invokes Bayess theorem.
Figure 7 The effect on uncertainty about weather and mood of a forecast of rain
Figure 8 Inferences about the weather forecast and subsequently realized weather if a pleasant mood is all that can be
recollected for the day of the original inference
4 Bayesian Belief Networks
The directed graph indicates conditional depen- up these ideas in the 1960s and studied the parallels
dence between events, with missing links in the between legal reasoning and Bayesian inference [16]
graph showing independence. Thus, the lack of an arc the agenda for studying human judgement in the face
between forecast and mood shows that the forecast of uncertainty took a new turn with studies of heuris-
has no impact on your uncertainty about tomorrows tics and biases (see Heuristics: Fast and Frugal;
mood. For this simple problem, lack of arcs is trivial, Subjective Probability and Human Judgement).
but for complex BBNs consisting of tens or hundreds The increasing availability of convenient and
of nodes, the presence or absence of arcs provides a substantial computer power saw the growth of BBNs
compact display that allows a user to grasp quickly from the mid-1980s to the present day. This growth
the structure of the representation. A corresponding was fuelled by developments in decision analysis and
event tree could only be displayed on a computer in artificial intelligence [8, 11]. It became feasible to
small sections or on a very large printed surface, and apply the technology to very complex networks [5],
even then the structure would not be easily grasped aided by computer programs that facilitate structuring
even with the probabilities displayed on the corre- and entry of data, with the computational complexity
sponding branches. left to the computer (the mood model, above,
BBNs break the problem down into many rela- was constructed using Netica [10]). For complex
tively simple probability statements, and from these models, special computational algorithms are used,
new insights can emerge. It is this property that was variously developed by Schachter [15], Lauritzen
recognised early by psychologists, who first devel- and Spiegelhalter [7], Pearl [12] and Spiegelhalter
oped the fundamental idea of using human exper- and Lauritzen [17]. Textbooks by Jensen [6] and
tise to provide the probabilistic inputs [1, 3, 4]. Neapolitan [9] provide guidance on how to construct
Their studies initially assumed that data were reli- the models.
able, though not definitive, in pointing to the correct BBNs are now in widespread use in applications
hypothesis, and their systems assumed a single level that require consistent reasoning and inference in sit-
of inference, from reliable datum to the hypothesis uations of uncertainty. They are often invisible to a
of interest. Studies comparing actual human infer- user, as in Microsofts help and troubleshooting facil-
ences to the properties of Bayess theorem led to ities, whose behind-the-scene BBNs calculate which
the surprising conclusion that in general people do questions would be most likely to reduce uncertainty
not revise their uncertainty as much as is prescribed about a problem. At other times, as in medical diag-
by Bayesian calculations, a replicable phenomenon nostic systems, the probabilistic inferences are dis-
called conservatism by the psychologists who dis- played. A web search on BBNs already displays tens
covered it [2, 14]. of thousands of items; these are bound to increase as
But real-world data are often unreliable, ambigu- this form of rational reasoning becomes recognised
ous, redundant, or contradictory, so many inves- for its power to capture the knowledge of experi-
tigators developed cascaded inference models to enced experts along with hard data, and make this
accommodate data unreliability and intermediate lev- available in a form that aids decision making.
els of uncertainty. Examples abound in medical diag-
nosis: unreliable data (reports of symptoms from References
patients) may point to physical conditions (signs
only observable from tests) that in turn bear on
[1] Edwards, W. (1962). Dynamic decision theory and
hypotheses of interest (possible disease states). Com- probabilistic information processing, Human Factors 4,
paring actual unaided inferences with these cas- 5973.
caded inference models, as reported in a special [2] Edwards, W. (1968). Conservatism in human informa-
issue of Organisational Behavior and Human Perfor- tion processing, in Formal Representations of Human
mance [13], showed occasions when people became Judgment, B. Kleinmuntz, ed., Wiley, New York,
less certain than Bayesian performance, but other pp. 1752.
[3] Edwards, W. (1998). Hailfinder: tools for and expe-
occasions when they were over confident. Sometimes riences with Bayesian normative modeling, American
they assumed unreliable data were reliable, and some- Psychologist 53, 416428.
times they ignored intermediate levels of inference. [4] Edwards, W., Phillips, L.D., Hays, W.L. & Goodman, B.
Although another psychologist, David Schum, picked (1968). Probabilistic information processing systems:
Bayesian Belief Networks 5
design and evaluation, IEEE Transactions on Systems [12] Pearl, J. (1988). Probabilistic Reasoning in Expert Sys-
Science and Cybernetics, SSR-4, 248265. tems, Morgan Kaufmann, San Mateo.
[5] Heckerman, D., Mamdani, A. & Wellman, M.P. (1995). [13] Peterson, C.R. ed. (1973). Special issue: cascaded infer-
Special issue: real-world applications of Bayesian net- ence, Organizational Behavior and Human Performance
works, Communications of the ACM 38, 2457. 10, 315432.
[6] Jensen, F.V. (2001). Bayesian Networks and Decision [14] Phillips, L.D., Hays, W.L. & Edwards, W. (1966).
Graphs, Springer-Verlag. Conservatism in complex probabilistic inference, IEEE
[7] Lauritzen, S.L. & Spiegelhalter, D.J. (1988). Local Transactions on Human Factors in Electronics, HFE-7,
computations with probabilities on graphical structures 718.
and their application to expert systems (with discussion), [15] Schachter, R.D. (1986). Evaluating influence diagrams,
Journal of the Royal Statistical Society, Series B 50, Operations Research 34(6), 871882.
157224. [16] Schum, D.A. (1994). The Evidential Foundations of
[8] Matzkevich, I. & Abramson, B. (1995). Decision ana- Probabilistic Reasoning, John Wiley & Sons, New York.
lytic networks in artificial intelligence, Management Sci- [17] Spiegelhalter, D. & Lauritzen, S.L. (1990). Sequential
ence 41(1), 122. updating of conditional probabilities on directed graph-
[9] Neapolitan, R.E. (2003). Learning Bayesian Networks, ical structures, Networks 20, 579605.
Prentice Hall.
[10] Netica Application APL, DLL and Users Guide (1995
2004). Norsys Software Corporation, Vancouver,
Download from www.norsys.com. (See also Markov Chain Monte Carlo and Baye-
[11] Oliver, R.M. & Smith, J.Q., eds (1990). Influence sian Statistics)
Diagrams, Belief Nets and Decision Analysis, John Wiley
& Sons, New York. LAWRENCE D. PHILLIPS
Bayesian Item Response Theory Estimation
HARIHARAN SWAMINATHAN
Volume 1, pp. 134139
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Item Response Theory (IRT) Models for Polyto- In the dichotomous case with response categories
r 1r
mous Response Data). The examinees are usually r1 and r2 , r2 = 1 r1 , and (uj |, ) = Pj 1j Qj 1j .
characterized by a single ability parameter, . The Thus,
joint posterior density of the item and ability param-
N
n
eters for any one examinee is thus L(U |, ) = L(uj |, i )
i=1 j =1
(, |u) = L(u|, )(, ), (5)
N
n
r 1r1j
where is the vector of item parameters, is = Pj 1j Qj . (10)
the ability parameter for the examinee, and u = i=1 j =1
[u1 u2 . . . un ] is the vector of responses to n items.
The posterior density is determined up to a constant
once the likelihood function, L(u|, ), is determined Prior Specification, Posterior Densities,
and the prior, (, ), is specified. and Estimation
While the evaluation of the likelihood function is
The Likelihood Function straightforward, the specification of the prior is some-
what complex. In IRT, the prior density, (, ), is
The assumption of local independence in Item a statement about the prior belief or information the
Response Theory (IRT) implies that researcher has about the item and ability parameters.
It is assumed a priori that the item and ability param-
(u|, ) = (u1 |, )(u2 |, ) . . . (un |, ),
eters are independent, that is, (, ) = ( )().
(6) Specification of priors for the ability and item
parameters may be carried out in a single stage,
where (uj |, ) is specified by the item response or a hierarchical procedure may be employed. In
model that is deemed appropriate. In the general the single stage procedure, a distributional form is
case where an item is scored polytomously, with assumed for and the parameters of the distribu-
response categories r1j , r2j , . . . , rsj for item j , if we tion are specified. For example, it may be assumed
denote the probability of responding in category rk as that the item parameters have a multivariate nor-
P (uj= rkj |, ) Pj with rkj 1 or 0, k rkj = 1, mal distribution, that is, |, N (, ), and the
and k Pjrk = 1, then the probability of a response parameters (, ) are specified. In the hierarchi-
to the item can be expressed as cal procedure, distributional forms are assumed for
r r r
s
r
the parameters (, ) and the hyper-parameters that
(uj |, ) = Pj 1j Pj 2j . . . Pj sj = Pj kj . (7) determine the distribution of (, ) are specified.
k=1 In contrast to the single stage approach, the hierar-
The joint probability of the response vector u is the chical approach allows for a degree of uncertainty
product of these probabilities, and once the responses in specifying priors by expressing prior beliefs in
are observed, becomes the likelihood terms of a family of prior distributions. Swaminathan
and Gifford [1416] proposed a hierarchical Bayes
n
s
r procedure for the joint estimation of item and abil-
L(u|, ) = Pj kj . (8) ity parameters following the framework provided by
j =1 k=1
Lindley and Smith [8]; Mislevy [9], using the same
With N examinees, the likelihood function is framework, provided a hierarchical procedure for the
given by marginal estimation of item parameters. In the fol-
lowing discussion, only the three-parameter dichoto-
N
n
s
r mous item response model is considered since the
L(U |, ) = Pj kj , (9)
one- and the two-parameter models are obtained as
i=1 j =1 k=1
special cases. The procedures described are easily
where U is the response vector of the N examinees extended to the polytomous case [11, 12].
on n items, and is the vector of ability parameters In the dichotomous case, when the three-parameter
for the N examinees. model is assumed, the vector of item parameters
Bayesian Item Response Theory Estimation 3
consists of 3n parameters: n difficulty parameters, fixed, Mislevy [9], following the approach taken by
bj , n discrimination parameters, aj , and n pseudo Bock and Lieberman [1], assumed that the examinees
chance-level parameters, cj . While in theory, it is were sampled at random from a population. With this
possible to assume a multivariate distribution for assumption, the marginal joint posterior density of the
the item parameters, specification of the parameters item parameters is obtained as
poses some difficulty. To simplify the specification
of priors, Swaminathan and Gifford [16] assumed ( |U, ) = L(U |, )()( | ). (12)
that the sets of item parameters b, a, c are inde-
pendent. They further assumed that the difficulty With the assumption that N (0, 1), the inte-
parameters bj are exchangeable and that in the first gration is carried out using Gaussian quadrature.
stage, bj N (b , b2 ). In the second stage, they The advantage of this marginal Bayesian proce-
assumed a noninformative prior for b and an inverse dure over the joint Bayesian procedure is that
chi-square prior with parameters b , b for 2 . For the the marginal modes are closer to the marginal
discrimination parameters, they assumed a chi distri- means than are the joint modes. It also avoids
bution with parameters aj and aj . Finally, for the the problem of improper estimates of structural,
c-parameter, they assumed a Beta distribution with that is, item, parameters in the presence of an
parameters sj and tj . (see Catalogue of Probabil- infinite number of nuisance or incidental ability
ity Density Functions). The ability parameters are parameters.
assumed to be exchangeable, and independently and Mislevy [9], rather than specifying priors on
identically distributed normally with mean and the item parameters directly, specified priors on
variance 2 . By setting and 2 as zero and one transformed discrimination and pseudo chance-
respectively, the conditions required for identifying level parameters, that is, on j = log(aj ) and
the model may be imposed. With these assump- j = log[cj /(1 cj )]. With j = bj , the vector of
tions, the joint posterior density of item and ability parameters, j = [j j j ], was assumed to have a
parameters after integrating the nuisance parameters, multivariate normal distribution with mean vector j
b , b2 , , 2 , is and variance-covariance matrix j (see Catalogue
of Probability Density Functions). At the second
(a, b, c, |a , a , s, t, U ) = L(U |a, b, c, ) stage, it is assumed that j is distributed multivariate
normally and that j has the inverted Wishart
n
distribution, a multivariate form of the inverse
(aj |aj , aj )(bj |b , b2 )(cj |sj , tj )
chi-square distribution. Although in principle it is
j =1
possible to specify the parameters of these hyper
N prior distributions, they present problems in practice
(i |, 2 )db db2 d d2 (11) since most applied researchers and measurement
i=1 specialists lack sufficient experience with these
Swaminathan and Gifford [16] provided procedures distributions. Simplified versions of these prior
for specifying the parameters for the prior distribu- distributions are obtained by assuming the item
tions of the discrimination and the pseudo chance parameters are independent. In this case, it may
level parameters. Once the parameters of the pri- be assumed that j is normally distributed, or
ors are specified, the posterior density is com- equivalently that aj has a lognormal distribution.
pletely specified up to a constant. These authors The parameters of this distribution are more tractable
then obtained the joint posterior modes of the pos- and readily specified. With respect to the pseudo
terior density as the joint Bayes estimates of the chance-level parameter, computer programs such
item and ability parameters. Through an empirical as BILOG [10] and PARSCALE [12] use the beta
study, Gifford and Swaminathan [4] demonstrated prior for the pseudo chance-level parameter, as
that the joint Bayesian procedure offered consider- recommended by Swaminathan and Gifford [16]. A
able improvement over the joint maximum likelihood detailed study comparing the joint and the marginal
procedure. estimation procedures in the case of the two-
In contrast to the approach of Swaminathan and parameter item response model is provided by Kim
Gifford [16] who assumed that the examinees were et al. [7].
4 Bayesian Item Response Theory Estimation
Rigdon and Tsutakawa [13], Tsutakawa [17], [18], each examinees ability separately. In this case, if
and Tsutakawa and Lin [19] have provided alternative the prior density of is taken as normal, then the
marginalized Bayes modal estimation procedures. posterior density of is
The procedures suggested by Tsutakawa [18] and
Tsutakawa and Lin [19] for specifying priors is basi- (|u, , , 2 ) = L(u|, )(|, 2 ) (13)
cally different from that suggested by Swaminathan
where and 2 are the mean and variance of the
and Gifford and Mislevy; Tsutakawa and Lin [19]
prior distribution of . The mode of the posterior
suggested an ordered bivariate beta distribution for
density, known as the maximum a posteriori (MAP)
the item response function at two ability levels,
estimate, may be taken as the point estimate of
while Tsutakawa [18] suggested the ordered Dirich-
ability. Alternatively, the mean of the posterior den-
let prior on the entire item response function. These
sity, the expected a posteriori (EAP) estimate [2],
approaches are promising, but no extensive research
defined as
has been done to date comparing this approach with
other Bayesian approaches.
More recently, the joint estimation procedure out- = (|, , 2 )d, (14)
lined above has received considerable attention in
terms of Markov Chain Monte Carlo (MCMC)
procedures. In this approach, observations are sam- may be taken as the point estimate of . The inte-
pled from the posterior density, and with these gral given above is readily evaluated using numerical
the characteristics of the posterior density, such procedures. The variance of the estimate can also be
as the mean, variance, and so on, are approxi- obtained as
mated. This powerful technique has been widely
applied in Bayesian estimation and inference and Var( ) = [ ]2 (|, , 2 )d. (15)
is receiving considerable attention for parameter
estimation in item response models (see Markov
Chain Monte Carlo Item Response Theory Esti- A problem that is noted with the Bayesian esti-
mation). mate of ability is that unless reasonably good prior
information is available, the estimates tend to be
biased.
Estimation of Ability Parameters with In the case of conventional testing where many
Known Item Parameters examinees respond to the same items, a hierarchical
Bayesian procedure may prove to be useful. Swami-
As mentioned previously, one of the primary pur- nathan and Gifford [14] applied a two-stage proce-
poses of testing is to determine the ability or profi- dure similar to that described earlier to obtain the
ciency level of examinees. The estimation procedure joint posterior density of the abilities of N examinees.
for jointly estimating item and ability parameters They demonstrated that the hierarchical Bayes pro-
may be employed in this case. However, in situ- cedure, by incorporating collateral information avail-
ations such as computer-adaptive tests, joint esti- able from the group of examinees, produced more
mation may not be possible. The alternative is accurate estimates of the ability parameters than max-
to employ a two-stage procedure, where in the imum likelihood estimates or a single-stage Bayes
first stage, item parameters are estimated using the procedure.
marginal Bayesian or maximum likelihood proce-
dures, and in the second stage, assuming that the
item parameters are known, the ability parameters are
References
estimated.
The estimation of ability parameters when item [1] Bock, R.D. & Lieberman, M. (1970). Fitting a response
model for n dichotomously scored items, Psychometrika
parameters are known is far less complex than the 35, 179197.
procedure for estimating item parameters or jointly [2] Bock, R.D. & Mislevy, R.J. (1982). Adaptive EAP
estimating item and ability parameters. Since the estimation of ability in a microcomputer environment,
examinees are independent, it is possible to estimate Applied Psychological Measurement 6, 431444.
Bayesian Item Response Theory Estimation 5
[3] Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. [12] Muraki, E. & Bock, R.D. (1996). PARSCALE: IRT Based
(2004). Bayesian Data Analysis, Chapman & Hall/CRC, Test Scoring and Item Analysis for Graded Open-Ended
Boca Raton. Exercises and Performance Tests, Scientific Software,
[4] Gifford, J.A. & Swaminathan, H. (1990). Bias and the Chicago.
effect of priors in Bayesian estimation of parameters in [13] Rigdon, S.E. & Tsutakawa, R.K. (1983). Parameter
item response models, Applied Psychological Measure- estimation in latent trait models, Psychometrika 48,
ment 14, 3343. 567574.
[5] Hambleton, R.K. & Swaminathan, H. (1985). Item [14] Swaminathan, H. & Gifford, J.A. (1982). Bayesian
Response Theory: Principles and Applications, Kluwer- estimation in the Rasch model, Journal of Educational
Nijhoff, Boston. Statistics 7, 175191.
[6] Hambleton, R.K., Swaminathan, H. & Rogers, H.J. [15] Swaminathan, H. & Gifford, J.A. (1985). Bayesian esti-
(1991). Fundamentals of Item Response Theory, Sage, mation in the two-parameter logistic model, Psychome-
Newbury Park. trika 50, 175191.
[7] Kim, S.H., Cohen, A.S., Baker, F.B., Subkoviak, M.J. [16] Swaminathan, H. & Gifford, J.A. (1986). Bayesian esti-
& Leonard, T. (1994). An investigation of hierarchical mation in the three-parameter logistic model, Psychome-
Bayes procedures in item response theory, Psychome- trika 51, 581601.
trika 99, 405421. [17] Tsutakawa, R.K. (1984). Estimation of two-parameter
[8] Lindley, D.V. & Smith, A.F.M. (1972). Bayes estimates logistic item response curves, Journal of Educational
for the linear model (with discussion), Journal of the Statistics 9, 263276.
Royal Statistical Society, Series B 34, 141. [18] Tsutakawa, R.K. (1992). Prior distributions for item
[9] Mislevy, R.J. (1986). Bayes modal estimation in item response curves, British Journal of Mathematical and
response models, Psychometrika 51, 177195. Statistical Psychology 45, 5174.
[10] Mislevy, R.J. & Bock, R.D. (1990). BILOG 3: Item [19] Tsutakawa, R.K. & Linn, R. (1986). Bayesian esti-
Analysis and Test Scoring with Binary Logistic Models, mation of item response curves, Psychometrika 51,
Scientific Software, Chicago. 251267.
[11] Muraki, E. (1992). A generalized partial credit model:
application of an EM algorithm, Applied Psychogolical HARIHARAN SWAMINATHAN
Measurement 16, 159176.
Bayesian Methods for Categorical Data
EDUARDO GUTIERREZ
-PENA
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Thus, we will denote by l and n l respectively, alternative testing procedure based on Bayes factors
the probability and observed count for cell l(l = (see also [20]).
1, . . . , m) and will denote both ( 1 , . . . , m )T and For three-way contingency tables, the analogy
(11 , . . . , rc )T , with the entries of the latter arranged with the analysis of variance is no longer useful, but
in lexicographical order. Similarly, n will denote both the analysis can still be carried out at the cost of
(n 1 , . . . , n m )T and (n11 , . . . , nrc )T . additional computations.
Under multinomial sampling, the vector of counts, Good [19] developed a Bayesian approach to test-
n, is regarded as an observation from a (m 1)- ing independence in multiway contingency tables.
dimensional
multinomial distribution with index N = Unlike Lindleys, this approach has the advantage
l l
n and unknown parameter vector : that it does not depend on the availability of large
samples and so is applicable even when many
N ! n l expected cell frequencies are small. Moreover, this
f (n| , N ) = l , (1)
n l ! l approach allows one to estimate the cell probabilities
l (see also [16] and [17]). To test for independence,
Good proposed the use of Bayes factors where the
where l > 0 and l l = 1. priors assumed for the nonnull model are mixtures of
symmetric Dirichlet distributions (see also [2]).
Bishop et al. [9] consider pseudo-Bayes estima-
Some History tors arising from the use of a two-stage prior distribu-
tion, following a suggestion by Good [18]. Such esti-
Early accounts of Bayesian analyses for categorical mators are essentially empirical Bayes estimators (see
data include [20], in particular, Section 5.11, and also [22]). Bishop et al. also provide various asymp-
[16], [17], [18], and [27]. totic results concerning the risk of their estimators.
Suppose that the cell counts n have a multi- Leonard [23] uses exchangeable normal priors on
nomial distribution with density function (1) and the components of a set of multivariate logits. He
that
1 the prior density of is proportional to then derives estimators of the cell frequencies from
l l over the region l > 0, l = 1 (this is
l the resulting posterior distributions. In a subsequent
a limiting case of the Dirichlet distribution and paper [24], he also develops estimators of the cell fre-
is meant to describe vague prior information; see quencies from several multinomial distributions via
section titled Bayesian Inference for Multinomial two-stage priors (see also [25] and [26]). Albert and
Data). Write y = (log n 1 , . . . , log n m )T and = Gupta [6] also consider estimation in contingency
(log 1 , . . . , log m )T , and let C be a k m matrix of tables, but use mixtures of Dirichlet distributions
rank k < m and rows summing to zero. Then Lind- as priors. In [7], they discuss certain tailored pri-
ley ([27], Theorem 1) showed that, provided the cell ors that allow them to incorporate (i) separate prior
counts are not too small, the posterior distribution of knowledge about the marginal probabilities and an
the contrasts = C is given approximately by interaction parameter in 2 2 tables; and (ii) prior
beliefs about the similarity of a set of cell probabili-
MV N (Cy, CN1 CT ), (2)
ties in r 2 tables with fixed row totals (see also [4]).
where N is a diagonal matrix with entries
(n 1 , . . . , n m ).
This result provides approximate estimates of the Bayesian Inference for Multinomial Data
log-odds , or linear functions thereof, but not of
Conjugate Analysis
the cell probabilities . Nevertheless, when testing
common hypothesis in two-way contingency tables The standard conjugate prior for the multinomial
(such as independence or homogeneity of popula- parameter in (1) is the Dirichlet distribution, with
tions), Lindley found analogies with the classical density function
analysis of variance which greatly simplify the anal-
ysis. He proposed a Bayesian significance test based ( ) l 1
p(|) = l , (3)
on highest posterior density credible intervals (see (l ) l
also [18]). Spiegelhalter and Smith [32] discuss an l
Bayesian Methods for Categorical Data 3
for l > 0 and = l l , where () is the gamma are often used as smoothed expected cell probabilities
function (see [1]). (see also [9], [13], [6] and [23]).
This distribution is characterized by a parameter
vector = (1 , . . . , m )T such that E(l ) = l / .
The value of is interpreted as a hypothetical Testing for Independence
prior sample size, and determines the strength of When testing hypothesis concerning the cells proba-
the information contained in the prior: a small bilities or frequencies in a contingency table, the null
implies vague prior information whereas a large hypothesis imposes constraints on the space of possi-
suggests strong prior beliefs about . Owing to the ble values of . In other words, under the null hypoth-
conjugacy property, the corresponding posterior dis- esis, the cell probabilities are given by l0 = hl ( )
tribution of is also Dirichlet with parameter n = for some functions hl (), l = 1, . . . , m. As a simple
(n 1 + 1 , . . . , n m + m )T . This distribution contains example, consider a r c contingency table and a
all the available information about the cell probabil- null model which states that the two variables are
ities , conditional on the observed counts n. independent. In this case, it is convenient to use the
In the absence of prior information, we would double-index notation to refer to the individual cell
typically use a rather vague prior. One of the most probabilities or counts. Then
widely used such priors for the multinomial parame-
ter is precisely the Dirichlet distribution with parame- ij0 = hij () i+ +j , (5)
ter = (1/2, . . . , 1/2) (see [20]). In practical terms,
however, one could argue that the strength of the where i+ = j ij and +j = i ij (i = 1, . . . , r;
prior should be measured in relation to the actual j = 1, . . . , c).
observed sample. Keeping in mind the interpreta- Since we have the posterior distribution of ,
tion of as a prior sample size, the quantity we can, in principle, calculate the posterior prob-
I = /(N + ) can be regarded as the proportion ability of any event involving the cell probabili-
of the total information that is contributed by the ties . In particular, the posterior distribution of
prior. Thus, a value of yielding I = 0.01 would induces a posterior distribution on the vector 0 =
0 0 T
produce a fairly vague prior contributing about 1% of (11 , . . . , rc ) of cell probabilities constrained by
the total information, whereas I 1 would imply that the null hypothesis.
the data are completely dominated by the prior. Since The null model of independence can be tested on
E(l ) = l / , the individual values of the i should the basis of the posterior distribution of
be chosen according to the prior beliefs concerning
E( l ). These may be based on substantive knowledge l
= () log log( l ). (6)
about the population probabilities or on data from l
l0
previous studies. In the case of vague prior informa-
tion (I 0.05, say), l = /m for all l = 1, . . . , m This quantity can be regarded as a Bayesian version
is a sensible default choice. When = 1, this cor- of the deviance. It is always nonnegative and is
responds to the prior proposed by Perks [31] and can zero if and only if the null model and the saturated
be interpreted as a single prior observation divided model are the same, i.e., if and only if l0 = l for
evenly among all the cells in the table. all l.
When analyzing contingency tables, we often wish The marginal posterior distribution of is not
to provide a table of expected cell probabilities or available in closed form, but it can easily be
frequencies that can be used for other purposes such obtained from that of using Monte Carlo tech-
as computing standardized rates. The raw observed niques. In this case, we can generate a sample
counts are usually not satisfactory for this purpose, { (1) , . . . , (M) } of size M from the posterior (Dirich-
for example, when the table has many cells and/or let) distribution of . Next, we compute (k) =
when few observations are available. In such cases, ( (k) ) for each k = 1, . . . , M. The resulting val-
Bayesian estimators based on posterior expectations ues { (1) , . . . , (M) } then constitute a sample from the
marginal posterior distribution of . The accuracy of
n l + l the Monte Carlo approximation increases with the
E( l |n) = , (4) value of M.
N +
4 Bayesian Methods for Categorical Data
Posterior distributions of concentrated around Table 1 Alcohol, hypertension, and obesity data
zero support the null model, whereas posterior dis- Alcohol intake
tributions located away from zero lead to rejection (drinks/day)
of the null model. Following a suggestion by Lind-
ley [27], we can test the null hypothesis of indepen- Obesity High BP 0 12 35 6+
dence by means of a Bayesian significance test: Low Yes 5 9 8 10
reject the null hypothesis if the 95% (say) highest No 40 36 33 24
posterior density credible interval for does not con- Average Yes 6 9 11 14
tain the value zero. No 33 23 35 30
High Yes 9 12 19 19
Numerical example. The 3 2 4 cross-classifi- No 24 25 28 29
cation in Table 1 shows data previously analyzed
in [21]. The data concern a small study of alcohol
intake, hypertension, and obesity. independence of the three variables. Figure 1 (upper-
A sample of size M = 10 000 was simulated from left panel) shows the corresponding histogram. In this
the posterior distribution of for the null model of case, the posterior distribution of is located away
30
80
25
60
20
Density
Density
15
40
10
20
5
0 0
0.02 0.04 0.06 0.08 0.10 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035
d d
50
30
40
30
Density
20
Density
20
10
10
0 0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.00 0.02 0.04 0.06 0.08
d d
from zero, indicating that the model of independence impose constraints to reduce the number of indepen-
should be rejected. dent parameters represented by each u-term. Usually,
A similar procedure can be used to perform other such
constraints
take the form i u 1(i) = j u2(j ) =
analyses of contingency tables, such as tests of i u 12(ij ) = j u 12(ij ) = 0.
homogeneity of populations or even some interaction If we consider instead a table of expected
counts
tests based on simple log-linear model (see section {ij } that sum to the grand total N = i j nij , then
titled Log-linear and Generalized Linear Models). we have ij = N ij and hence,
All we require is (a) the ability to generate large
samples from Dirichlet distributions; and (b) stan- log ij = u + u1(i) + u2(j ) + u12(ij ) ,
dard software capable of producing the estimated i = 1, . . . , r; j = 1, . . . , c; (8)
expected cell probabilities or cell frequencies under
the relevant null model, thus implicitly providing the where u = u0 + log N .
corresponding functions hl () discussed above. As mentioned in the section titled Testing for
Independence, simple models of this type can also
be analyzed using the procedure outlined there.
Other Priors
The Dirichlet distribution is suitable for inputting Numerical example (ctd). For the data of Table 1,
prior information about the cell probabilities, but it we now consider the following model with no
does not allow sufficient structure to be imposed second-order interaction:
on such probabilities. Alternative classes of prior
distribution were mentioned in the section titled log ij k = u + u1(i) + u2(j ) + u3(k)
Some History, and, in the next section, we describe + u12(ij ) + u13(ik) + u23(j k) , (9)
yet another alternative which is particularly suitable
for log-linear models. where i denotes obesity level, j denotes blood
It is often convenient to model multinomial data pressure level, and k denotes alcohol intake level.
as observations of independent Poisson variables. We simulated a sample of size M = 10 000 from
This approach leads to valid Bayesian inferences the posterior distribution of with the aid of the
provided that the prior for the Poisson means factors function loglin of the R language and environ-
in a particular way (see [27]). This result can be ment for statistical computing (http://www.R-
generalized to product-multinomial settings. project.org). Figure 1 (upper-right panel), shows
the corresponding histogram. In this case, the pos-
terior distribution of is concentrated around zero,
Log-linear and Generalized Linear Models indicating that the model provides a good fit. How-
ever, it is possible that a more parsimonious model
Log-linear models provide a general and useful also fits the data.
framework for analyzing multidimensional contin- Consider, for example, the models
gency tables. Consider a two-way contingency table
with r rows and c columns. A log-linear model for log ij k = u + u1(i) + u2(j ) + u3(k)
the cell probabilities has the form
+ u12(ij ) + u23(j k) , (10)
log ij = u0 + u1(i) + u2(j ) + u12(ij ) ,
and
i = 1, . . . , r; j = 1, . . . , c; (7)
log ij k = u + u1(i) + u2(j ) + u3(k) + u12(ij ) . (11)
where u0 is the overall mean, u1(i) and u2(ij ) represent
the main effects of variables 1 and 2 respectively, and The posterior distribution of for each of these mod-
u12(ij ) represents the interaction between variables 1 els is shown in Figure 1 (lower-left and lower-right
and 2 (see, for example, [9]). The number of indepen- panels, respectively). For model (10), the 95% high-
dent parameters must be equal to the total number of est posterior density credible interval for contains
elementary cells in the table, so it is necessary to the value zero, whereas, for model (11), this is not
6 Bayesian Methods for Categorical Data
the case. Thus, we reject model (11) and retain model Missing Data: Nonresponse
(10), which suggests that alcohol intake and obesity
are independently associated with hypertension. Park and Brown [28] and Forster and Smith [15]
The saturated log-linear model allows = develop Bayesian approaches to modeling nonre-
(log 11 , . . . , log rc )T to take any value on R rc . A sponse in categorical data problems. Specifically,
nonsaturated model constrains to lie in some vector the framework they consider concerns contingency
subspace of R rc , in which case, we can write tables containing both completely and partially cross-
classified data, where one of the variables (Y , say)
= Xu, (12) is a response variable subject to nonignorable non-
response and the other variables (here collectively
where X is a design matrix with columns containing denoted by X) are regarded as covariates and are
the values of explanatory variables or the values of always observed. They then introduce an indica-
dummy variables for main effects and interaction tor variable R to represent a dichotomous response
terms, and u is the corresponding vector of unknown mechanism (R = 1 and R = 0 indicating response
regression coefficients or effects. and nonresponse respectively). A nonresponse model
Knuiman and Speed [21] discuss a general pro- is defined as a log-linear model for the full array of
cedure to incorporate prior information directly into Y, X, and R. A nonignorable nonresponse model is
the analysis of log-linear models. In order to incor- one that contains a Y R interaction term.
Park and Brown [28] show that a small shift of
porate constraints on main effects and interaction
the nonrespondents can result in large changes in
parameters, they use a structured multivariate normal
the maximum likelihood estimates of the expected
prior for all parameters taken collectively, rather than
cell frequencies. Maximum likelihood estimation is
specify univariate normal priors for individual param-
problematic here because boundary solutions can
eters, as in [23] and [22]. A useful feature of this
occur, in which case the estimates of the model
general prior is that it allows separate specification of
parameters cannot be uniquely determined. Park and
prior information for different interaction terms. They
Brown [28] propose a Bayesian method that uses
go on to propose an approximate Bayesian analysis,
data-dependent priors to provide some information
where the mode and curvature of the posterior density
about the extent of nonignorability. The net effect of
at the mode are used as summary statistics.
such priors is the introduction of smoothing constants,
Dellaportas and Smith [11] show how a specific
which avoid boundary solutions.
Markov chain Monte Carlo algorithm known as
the Gibbs sampler may be implemented to produce
exact, fully Bayesian analyses for a large class of gen- Nonidentifiability
eralized linear models, of which the log-linear model Censoring. Standard models for censored categor-
with a multivariate normal prior is a special case. ical data (see Censored Observations) are usually
Dellaportas and Forster [10] use reversible jump nonidentifiable. In order to overcome this problem,
Markov chain Monte Carlo methods to develop the censoring mechanism is typically assumed to
strategies for calculating posterior probabilities of be ignorable (noninformative) in that the unknown
hierarchical log-linear models for high-dimensional parameter of the distribution describing the censoring
contingency tables. The best models are those with mechanism is unrelated to the parameter of inter-
highest posterior probability. This approach to model est (see [12] and the references therein). Paulino and
selection is closely related to the use of Bayes factors, Pereira [29] discuss Bayesian conjugate methods for
but it also takes into account the prior probabilities of categorical data under general, informative censor-
all of the models under consideration (see also [3]). ing. In particular, they are concerned with Bayesian
estimation of the cell frequencies through posterior
expectations. Walker [33] considers maximum a pos-
Specialized Models teriori estimates, obtained via an EM algorithm, for
a more general class of priors.
In this section, we present a selective review of some
specialized problems for which modern Bayesian Misclassification. Paulino et al. [30] present a fully
techniques are particularly well suited. Bayesian analysis of binomial regression data with a
Bayesian Methods for Categorical Data 7
possibly misclassified response. Their approach can elaborate the probit model by using suitable mixtures
be extended to multinomial settings. They use an of normal distributions to model the latent data.
informative misclassification model whose parame-
ters turn out to be nonidentifiable. As in the case of References
censoring, from a Bayesian point of view this is not
a serious problem since a suitable proper prior will [1] Abramowitz, M. & Stegun, I.A. (1965). Handbook of
typically make the parameters identifiable. However, Mathematical Functions, Dover Publications, New York.
care must be taken since posterior inferences on non- [2] Albert, J.H. (1990). A Bayesian test for a two-way
contingency table using independence priors, Canadian
identifiable parameters may be strongly influenced by
Journal of Statistics 18, 347363.
the prior even for large sample sizes. [3] Albert, J.H. (1996). Bayesian selection of log-linear
models, Canadian Journal of Statistics 24, 327347.
[4] Albert, J.H. (1997). Bayesian testing and estimation of
Latent Class Analysis. A latent class model
association in a two-way contingency table, Journal of
usually involves a set of observed variables called the American Statistical Association 92, 685693.
manifest variables and a set of unobservable or [5] Albert, J.H. & Chib, S. (1993). Bayesian analysis of
unobserved variables called latent variables. The binary and polychotomous response data, Journal of the
most commonly used models of this type are the American Statistical Association 88, 669679.
latent conditionally independence models, which state [6] Albert, J.H. & Gupta, A.K. (1982). Mixtures of Dirichlet
distributions and estimation in contingency tables, The
that all the manifest variables are conditionally
Annals of Statistics 10, 12611268.
independent given the latent variables. [7] Albert, J.H. & Gupta, A.K. (1983). Estimation in con-
Latent class analysis in two-way contingency tingency tables using prior information, Journal of the
tables usually suffers from unidentifiability problems. Royal Statistical Society B 45, 6069.
These can be overcome by using Bayesian techniques [8] Altham, P.M.E. (1969). Exact Bayesian analysis of a
in which prior distributions are assumed on the latent 2 2 contingency table, and Fishers exact significant
parameters. test, Journal of the Royal Statistical Society B 31,
261269.
Evans et al. [14] discuss an adaptive importance [9] Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. (1975).
sampling approach to the computation of posterior Discrete Multivariate Analysis: Theory and Practice,
expectations, which are then used as point estimates MIT Press, Cambridge.
of the model parameters. [10] Dellaportas, P. & Forster, J.J. (1999). Markov chain
Monte Carlo model determination for hierarchical and
graphical log-linear models, Biometrika 86, 615633.
Ordered Categories [11] Dellaportas, P. & Smith, A.F.M. (1993). Bayesian
inference for generalised linear and proportional haz-
Albert and Chib [5] develop exact Bayesian methods ards models via Gibbs sampling, Applied Statistics 42,
443459.
for modeling categorical response data using the
[12] Dickey, J.M., Jiang, J.-M. & Kadane, J.B. (1987).
idea of data augmentation combined with Markov Bayesian methods for censored categorical data, Journal
chain Monte Carlo techniques. For example, the of the American Statistical Association 82, 773781.
probit regression model (see Probits) for binary [13] Epstein, L.D. & Fienberg, S.E. (1992). Bayesian estima-
outcomes is assumed to have an underlying normal tion in multidimensional contingency tables, in Bayesian
regression structure on latent continuous data. They Analysis in Statistics and Economics, P.K. Goel & N.S.
generalize this idea to multinomial response models, Iyengar, eds, Springer-Verlag, New York, pp. 3747.
[14] Evans, M.J., Gilula, Z. & Guttman, I. (1989). Latent
including the case where the multinomial categories class analysis of two-way contingency tables by
are ordered. In this latter case, the models link Bayesian methods, Biometrika 76, 557563.
the cumulative response probabilities with the linear [15] Forster, J.J. & Smith, P.W.F. (1998). Model-based
regression structure. inference for categorical survey data subject to non-
This approach has a number of advantages, espe- ignorable non-response, Journal of the Royal Statistical
cially in the multinomial setup, where it can be Society B 60, 5770.
[16] Good, I.J. (1956). On the estimation of small frequencies
difficult to evaluate the likelihood function. For small in contingency tables, Journal of the Royal Statistical
samples, this Bayesian approach will usually perform Society B 18, 113124.
better than traditional maximum likelihood methods, [17] Good, I.J. (1965). The Estimation of Probabilities,
which rely on asymptotic results. Moreover, one can Research Monograph No. 30, MIT Press, Cambridge.
8 Bayesian Methods for Categorical Data
[18] Good, I.J. (1967). A Bayesian significant test for multi- [27] Lindley, D.V. (1964). The Bayesian analysis of contin-
nomial distributions (with discussion), Journal of the gency tables, The Annals of Mathematical Statistics 35,
Royal Statistical Society B 29, 399431. 16221643.
[19] Good, I.J. (1976). On the application of symmetric [28] Park, T. & Brown, M.B. (1994). Models for categorical
Dirichlet distributions and their mixtures to contingency data with nonignorable nonresponse, Journal of the
tables, The Annals of Statistics 4, 11591189. American Statistical Association 89, 4452.
[20] Jeffreys, H. (1961). Theory of Probability, 3rd Edition, [29] Paulino, C.D. & Pereira, C.A. (1995). Bayesian methods
Clarendon Press, Oxford. for categorical data under informative general censoring,
[21] Knuiman, M.W. & Speed, T.P. (1988). Incorporating Biometrika 82, 439446.
prior information into the analysis of contingency tables, [30] Paulino, C.D., Soares, P. & Neuhaus, J. (2003). Bino-
Biometrics 44, 10611071. mial regression with misclassification, Biometrics 59,
[22] Laird, N.M. (1978). Empirical Bayes for two-way con- 670675.
tingency tables, Biometrika 65, 581590. [31] Perks, F.J.A. (1947). Some observations on inverse
[23] Leonard, T. (1975). Bayesian estimation methods for probability including a new indifference rule (with
two-way contingency tables, Journal of the Royal Sta- discussion), Journal of the Institute of Actuaries 73,
tistical Society B 37, 2337. 285334.
[24] Leonard, T. (1977). Bayesian simultaneous estimation [32] Spiegelhalter, D.J. & Smith, A.F.M. (1982). Bayes
for several multinomial distributions, Communications factors for linear and log-linear models with vague prior
in Statistics Theory and Methods 6, 619630. information, Journal of the Royal Statistical Society B
[25] Leonard, T. (1993). The Bayesian analysis of categorical 44, 377387.
data a selective review, in Aspects of Uncertainty: A [33] Walker, S. (1996). A Bayesian maximum a posteriori
Tribute to D.V. Lindley, P.R. Freeman & A.F.M. Smith, algorithm for categorical data under informative general
eds, Wiley, New York, pp. 283310. censoring, The Statistician 45, 293298.
[26] Leonard, T. (2000). A Course in Categorical Data
Analysis, Chapman & Hall, London.
EDUARDO GUTIERREZ
-PENA
Bayesian Statistics
LAWRENCE D. PHILLIPS
Volume 1, pp. 146150
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
6
example. The proportion of people in the sample who
5
can smell freesias is 45/60 = 0.75. In the vicinity of
4 0.75, the actual prior does not change very much (it is
3 mostly about 1.1), and the prior does not show a very
2 Prior
much larger amount elsewhere, such as a substantial
1 amount on = 1.0, which might be thought appro-
0
priate by a flower seller whose clients frequently
0.00 0.20 0.40 0.60 0.80 1.00 comment on the strong, sweet smell. Thus, to quote
Edwards, Lindman, and Savage, . . . far from ignor-
ing prior opinion, stable estimation exploits certain
Figure 1 Prior and posterior distributions for the freesias well-defined features of prior opinion and is accept-
example able only insofar as those features are really present.
For much scientific work, stable estimation justifies
use of a uniform prior.
formal analysis. In some practical applications, prior The likelihood principle states that all the informa-
probabilities are largely based on hard data, while tion relevant to a statistical inference in contained in
likelihoods are judged by specialists and experts in the likelihood. For the freesias example, the relevant
the topic at hand. Whatever the source of priors and data are only the number of people who can smell
likelihoods, methods for assessing them are now well the freesias and the number who cannot. The order
developed and routinely applied in many fields [12]. in which those data were obtained is not relevant,
As for the scientists critique, the thorough exam- nor is the rule employed to determine when to stop
ination of the foundations of all statistical inference collecting data. In this case, it was when you became
approaches given by the philosophers Howson and tired of collecting data, a stopping rule that would
Urbach [6] shows that subjective judgments attend confound the classical statistician whose significance
all approaches, Bayesian and classical. For example, test requires knowing whether you decided to stop at
the choice of a significance level, the power of a 60 people, or when 45 smellers or 15 nonsmellers
test, and Type 1 and Type 2 errors are all judgments were obtained.
in classical methods, though it often appears not to In most textbooks, much is made of the theory
be the case when, for example, social science jour- of conjugate distributions, for that greatly simplifies
nals require 0.05 or 0.01 levels of significance for calculations. Again, for the freesias example, the
results to be published, thereby relieving the scientist sampling process is judged to be Bernoulli (see
of having to make the judgment. Catalogue of Probability Density Functions): each
Bayesian statistics was first introduced to psychol- smeller is a success, each nonsmeller a failure;
ogists by Edwards, Lindman, and Savage [4] in their the data then consist of s successes and f failures.
landmark paper that set out two important princi- By the theory of conjugate distributions, if the prior
ples: stable estimation and the likelihood principle. is in the two-parameter Beta family, then with a
Stable estimation is particularly important for scien- Bernoulli process generating the data, the posterior is
tific research, for it enables certain properties of prior also in the Beta family (see Catalogue of Probability
opinion to justify use of a noninformative prior, Density Functions). The parameters of the posterior
that is, a prior that has little control over the pos- Beta are simply the parameters of the prior plus s
terior distribution, such as a uniform prior. In the and f , respectively. For the above example, the prior
freesias example, a uniform prior would be a hori- parameters are 2 and 2, the data are s = 45 and
zontal line intersecting the y-axis at 1.0. If that were f = 15, so the posterior parameters are 47 and 17.
the prior, and the data showed that one person can The entire distribution can be constructed knowing
smell the freesias and the other cannot, then the pos- only those two parameters. While Bayesian methods
terior would be the gentle curve shown as the prior are often more computationally difficult than classical
in Figure 1. Stable estimation states that the actual tests, this is no longer a problem with the ready
Bayesian Statistics 3
availability of computers, simulation software, and real. The Bayesian approach finds the posterior dis-
Bayesian statistical programs (see Markov Chain tribution of the difference between the measures, and
Monte Carlo and Bayesian Statistics). determines the probability of a positive difference,
So how does Bayesian inference compare to clas- which is the area of the posterior density function
sical methods (see Classical Statistical Inference: to the right of zero. That probability turns out to be
Practice versus Presentation)? The most obvious similar to the classical one-tailed significance level,
difference is in the definition of probability. While provided that the Bayesians prior is noninforma-
both approaches agree about the laws of probabil- tive. The Bayesian would report the probability that
ity, classical methods assume a relative frequency the difference is positive; if it is greater than 0.95,
interpretation of probability (see Probability: An that would correspond to significance of p < 0.05.
Introduction). As a consequence, posterior proba- But the significance level should be interpreted as
bility distributions play no part in classical methods. meaning that there is less than a 5% chance that
The true proportion of people who can smell freesias this result or one more extreme would be obtained
is a particular, albeit unknown, value, X. There can if the null hypothesis of no difference were true.
be no probability about it; either it is X or it is Therefore, since this probability is so small, the null
not. Instead, sampling distributions are constructed: hypothesis can be rejected. The Bayesian, on the
if the freesias experiment were repeated over and other hand, asserts that there is better than a 95%
over, each with, say 60 different people, then the pro- chance, based only on the data actually observed,
portion of smellers would vary somewhat, and it is that there is a real difference between treatment
this hypothetical distribution of results that informs and control groups. Thus, the significance level is a
the inferences made in classical methods. Sampling probability statement about data, while the Bayesian
distributions enable the construction of confidence
posterior probability is about the uncertain quantity
intervals, which express the probability that the inter-
of interest.
val covers the true value of . The Bayesian also
For the freesias example, 60% of the probability
calculates an interval, but as it is based on the pos-
density function lies to the left of = 0.75, so there
terior distribution, it is called a credible interval, and
is a 60% chance that the proportion is equal
it gives the probability that lies within the inter-
to or less than 0.75. If a single estimate about
val. For the freesias example, there is a 99% chance
were required, the mean of the posterior distribution,
that X lies between 0.59 and 0.86. The confidence
interval is a probability statement about the interval, = 0.73, would be an appropriate figure, slightly
while the credible interval is a statement about the different from the sample mean of 0.75 because of
uncertain quantity, , a subtle distinction that often the additional prior information.
leads the unwary to interpret confidence intervals as In comparing Bayesian and classical methods,
if they were credible intervals. Pitz [14] showed graphically, cases in which data
As social scientists know, there are two stages of that led to a classical rejection of a null hypothe-
inference in any empirical investigation, statistical sis actually provided evidence in favor of the null
inference concerning the relationship between the hypothesis in a Bayesian analysis of the same data,
data and the statistical hypotheses, and scientific examples of Lindleys paradox [10]. Lindley proved
inference, which takes the inference a step beyond the that as the sample size increases, it is always possible
statistical hypotheses to draw conclusions about the to obtain a significant rejection of a point hypothe-
scientific hypothesis. A significance level interpreted sis whether it is true or false. This applies for any
as if it were a Bayesian inference usually makes significance level at all, but only for classical two-
little difference to the scientific inferences, which is tailed tests, which have no interpretation in Bayesian
possibly one reason why social scientists have been theory.
slow to take up Bayesian methods. From the perspective of making decisions, sig-
Hypothesis testing throws up another difference nificance levels play no part, which leaves the
between the approaches. Many significant results in step between classical statistical inference and deci-
the social science literature establish that a result is sion making bridgeable only by the exercise of
not just a chance finding, that the difference on some unaided judgment (see entries on utility and on
measure between a treatment and a control group is strategies of decision making). On the other hand,
4 Bayesian Statistics
Bayesian posterior probabilities or predictive proba- [8] Jeffreys, H. (1939). Theory of Probability, 3rd Edition,
bilities about uncertain quantities or events are eas- 1961, Oxford, Clarendon.
ily accommodated in decision trees, making pos- [9] Kadane, J., ed. (1996). Bayesian Methods and Ethics
in a Clinical Trial Design, John Wiley & Sons, New
sible a direct link between inference and deci- York.
sion. While this link may be of no interest to [10] Lindley, D. (1957). A statistical paradox, Biometrika 44,
the academic researcher, it is vital in many busi- 187192.
ness applications and for regulatory authorities, [11] Lindley, D. (1965). Introduction to Probability and
where important decisions based on fallible data Statistics from a Bayesian Viewpoint, Vols. 1, 2, Cam-
are made. Indeed, the design of experiments can bridge University Press, Cambridge.
[12] Morgan, M.G. & Henrion, M. (1990). Uncertainty:
be very different, as Kadane [9] has demonstrated
A Guide to Dealing with Uncertainty in Quantitative
for the design of clinical trials in pharmaceutical Risk and Policy Analysis, Cambridge University Press,
research. This usefulness of Bayesian methods has Cambridge.
led to their increasing acceptance over the past few [13] Phillips, L.D. (1973). Bayesian Statistics for Social
decades, and the early controversies have now largely Scientists, Thomas Nelson, London; Thomas Crowell,
disappeared. New York, 1974.
[14] Pitz, G.F. (1978). Hypothesis testing and the comparison
of imprecise hypotheses, Psychological Bulletin 85(4),
References 794809.
[15] Raiffa, H. & Schlaifer, R. (1961). Applied Statistical
[1] Bayes, T. (1763). An essay towards solving a problem Decision Theory, Harvard University Press, Cambridge.
in the doctrine of chances, Philosophical Transactions [16] Ramsey, F.P. (1926). Truth and probability, in
of the Royal Society 53, 370418, Reprinted in Barnard, R.B. Braithwaite, ed., The Foundations of Mathematics
1958. and Other Logical Essays, Keegan Paul, (1931), London,
[2] de Finetti, B. (1937). La prevision: ses lois logiques, pp. 158198.
ses sources subjectives, Annales De LInstitut Henri [17] Savage, L.J. (1954). The Foundations of Statistics,
Poincare 7, Translated by H.E. Kyburg, (Foresight: 2nd Edition, 1972, Dover Publications, Wiley, New
Its logical laws, its subjective sources) in Kyburg and York.
Smokler, 1964. [18] Schlaifer, R. (1959). Probability and Statistics for Busi-
[3] de Finetti, B. (1974). Theory of Probability, Vol. 1, ness Decisions, McGraw-Hill, New York.
Translated by A. Machi & A. Smith, Wiley, Lon- [19] Schlaifer, R. (1961). Introduction to Statistics for Busi-
don. ness Decisions, McGraw-Hill, New York.
[4] Edwards, W., Lindman, H. & Savage, L.J. (1963). [20] von Neumann, J. & Morgenstern, O. (1947). Theory of
Bayesian statistical inference for psychological research, Games and Economic Behavior, 2nd Edition, Princeton
Psychological Review 70,(3), 193242. University Press, Princeton.
[5] Good, I.J. (1950). Probability and the Weighing of [21] Winkler, R.L. (2003). An Introduction to Bayesian Infer-
Evidence, Griffin Publishing, London. ence and Decision, 2nd Edition, Probabilistic Publishing,
[6] Howson, C. & Urbach, P. (1993). Scientific Reason- Gainesville.
ing: The Bayesian Approach, 2nd Edition, Open Court,
Chicago.
LAWRENCE D. PHILLIPS
[7] Jeffreys, H. (1931). Scientific Inference, 3rd Edition,
1957, Cambridge University Press, Cambridge.
Bernoulli Family
STEPHEN SENN
Volume 1, pp. 150153
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Nicholas Senior
16231708
Christopher
17821863
John-Gustave
18111863
and the Netherlands, returning via France. These vis- of divine providence. The official publication date for
its enabled him to establish excellent contacts with Arbuthnots paper was 1710 and Nicholas Bernoulli
many of the leading mathematicians of the day and discussed it with fellows of the Royal Society during
are the origin of his important correspondence with his stay in London in 1712. In a letter to Burnet and
Montmort through which, together with DeMoivre, sGravesande, he uses an improved form of his Uncle
he contributed to the solution of a problem posed by Jamess approximation to the tail area of a binomial
William Waldegrave. This involves a circular tour- distribution (see Catalogue of Probability Density
nament of n players P1 to Pn of equal skill. P1 Functions) to show that Arbuthnots data are unsur-
plays P2 and the winner plays P3 , the winner play- prising if the probability of a male birth is taken to
ing P4 and so on. The game stops once a player has be 18/35.
beaten every other player in a row. If necessary, P1
reenters the game once Pn has played and so on.
Montmort and DeMoivre had solutions for n = 3 and Daniel I
n = 4 but Nicholas was able to provide the general
Born: February 8, 1700, in Groningen, The Nether-
solution.
lands.
Nicholas also worked on John Arbuthnots
Died: March 17, 1782, in Basle, Switzerland.
famous significance test. Arbuthnot had data on chris-
tenings by sex in London from 1629 to 1710. Male Daniel Bernoulli was in his day, one of the most
christenings exceeded female ones in every one of famous scientists in Europe. His early career in
the 82 years, and he used this fact to calculate the mathematics was characterized by bitter disputes with
probability of this occurring by chance as (1/2)82 . his father John I, also a brilliant mathematician.
This is equivalent to, but must not necessarily be Daniel was born in Groningen in 1700, but the family
interpreted as, a one-sided P value. He then argued soon returned to Basle. In the same way that Johns
that this probability was so small that it could not be father, Nicholas Senior, had tried to dissuade his
interpreted as a chance occurrence and, since it was son from studying mathematics, John in turn tried to
desirable for the regulation of human affairs that there push Daniel into business [1]. However, when only
should be an excess of males at birth, was evidence ten years old, Daniel started to receive lessons in
Bernoulli Family 3
mathematics from his older brother, Nicholas III. For expectation six crowns. This is then regarded as a
a while, after a change of heart, he also studied with fair price to play the game. In the second variant,
his father. Eventually, however, he chose medicine however, the reward is 2x1 and this does not have a
as a career instead and graduated in that discipline finite expectation, thus implying that one ought to be
from Heidelberg in 1721 [5]. A subsequent falling prepared to pay any sum at all to play the game [7].
out with his father caused Daniel to be banished from Daniels solution, published in the journal of the
the family home. St. Petersburg Academy (hence the St. Petersburg
Daniel is important for his contribution to at least Paradox) was to replace money value with utility.
four fields of interest to statisticians: stochastic pro- If this rises less rapidly than the monetary reward,
cesses, tests of significance, likelihood, and utility. a finite expectation may ensue. Daniels resolution
As regards the former, his attempts to calculate the of his cousins paradox is not entirely satisfactory
advantages of vaccination against smallpox are fre- and the problem continues to attract attention. For
quently claimed to be the earliest example of an example, a recent paper by Pawitan includes a
epidemic model, although as Dietz and Heesterbeek discussion [8].
have pointed out in their recent detailed examina-
tion [4], the model in question is static not dynamic. References
However, the example is equally interesting as a con-
tribution to the literature on competing risk. [1] Bell, E.T. (1953). Men of Mathematics, Vol. 1, Penguin
Books, Harmondsworth.
In an essay of 1734, one of several of Daniels that [2] Boyer, C.B. (1991). A History of Mathematics, (Revised
won the prize of the Parisian Academy of Sciences, by U.C. Merzbach), 2nd Edition, Wiley, New York.
he calculates, amongst other matters, the probability [3] Csorgo , S. (2001). Nicolaus Bernoulli, in Statisticians of
that the coplanarity of the planetary orbits could have the Centuries, C.C. Heyde & E. Seneta, eds, Springer,
arisen by chance. Since the orbits are not perfectly New York.
coplanar, this involves his calculating the probability, [4] Dietz, K. & Heesterbeek, J.A.P. (2002). Daniel
Bernoullis epidemiological model revisited, Mathemat-
under a null of perfect random distribution, of a result
ical Biosciences 180, 1.
as extreme or more extreme than that observed. This [5] Gani, J. (2001). Daniel Bernoulli, in Statisticians of the
example, rather than Arbuthnots, is thus perhaps Centuries, C.C. Heyde & E. Seneta, eds, Springer, New
more properly regarded as a forerunner of the modern York, p. 64.
significance test [10]. [6] Hald, A. (1990). A History of Probability and Statistics
More controversial is whether Daniel can be and their Applications before 1750, Wiley, New York.
[7] Hald, A. (1998). A History of Mathematical Statistics
regarded as having provided the first example of the
from 1750 to 1930, 1st Edition, John Wiley & Sons,
use of the concept of maximizing likelihood to obtain New York.
an estimate (see Maximum Likelihood Estimation). [8] Pawitan, Y. (2004). Likelihood perspectives in the
A careful discussion of Bernoullis work of 1769 and consensus and controversies of statistical modelling
1778 on this subject and his friend and fellow Basler and inference, in Method and Models in Statistics,
Eulers commentary of 1778 has been provided by N.M. Adams, M.J. Crowder, D.J. Hand & D.A. Stephens,
eds, Imperial College Press, London, p. 23.
Stigler [12].
[9] Schneider, I. (2001). Jakob Bernoulli, in Statisticians of
Finally, Daniel Bernoullis work on the famous St. the Centuries, C.C. Heyde & E. Seneta, eds, Springer,
Petersburg Paradox should be noted. This problem New York, p. 33.
was communicated to Daniel by his cousin Nicholas [10] Senn, S.J. (2003). Dicing with Death, Cambridge Uni-
II and might equally well have been discussed in versity Press, Cambridge.
the section Nicholas II. The problem was originally [11] Shafer, G. (1997). Bernoullis, The, in Leading Person-
proposed by Nicholas to Montmort and concerns a alities in Statistical Sciences, N.L. Johnson & S. Kotz,
eds, Wiley, New York, p. 15.
game of chance in which B rolls a die successively [12] Stigler, S.M. (1999). Statistics on the Table, Harvard
and gets a reward from A, that is dependent on University Press, Cambridge.
the number of throws x to obtain the first six. In
the first variant, the reward is x crowns. This has STEPHEN SENN
Binomial Confidence Interval
CLIFFORD E. LUNNEBORG
Volume 1, pp. 153154
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Parameters r = 0, 1, . . . , n. (1)
the success probability once this treatment is used. which the null probability of a success count in the
If so, then the treatment has not had its intended rejection region will be exactly equal to . When
effect, but if not, then perhaps it has (depending on P0 {X k} < , the test is said to be conservative.
the direction of the shift). This analysis could be set That is, the probability of a Type I error will be less
up as a two-sided test: than an intended or nominal significance level .
The actual level of significance, , is computed as
H0 : p = 0.5 versus
n!
HA : p = 0.5 (2) P0 {X k} = p0r (1 p0 )nr
r=k,...,n
[r!(n r)!]
or as a one-sided test in either of two directions:
= (5)
H0 : p = 0.5 versus
and should be reported. The value of depends only
HA : p > 0.5 (3)
on n, p0 , and (these three determine k) and can be
or computed as soon as these parameters are established.
It is a good idea to also report P values, and the
H0 : p = 0.5 versus P value can be found with the above formula, except
replacing k with the observed number of successes.
HA : p < 0.5 (4) The discreteness of the binary random variable and
To test any of these hypotheses, one would use consequent conservatism of the hypothesis test can
the binomial test. The binomial test is based on be managed by reporting a P value interval [2].
the binomial distribution with the null value for This discussion was based on testing H0 against
p (in this case, the null value is p = 0.5), and HA : p > p0 , but with an obvious modification it can
whatever n happens to be appropriate. The binomial be used also for testing H0 against HA : p < p0 . In
test is generally conducted by specifying a given this case, the rejection region would be on the oppo-
significance level, , although one could also provide site side of the distribution, as we would reject H0
a P value, thereby obviating the need to specify a for small values of the binomial X. The modifica-
given significance level. If is specified and we tion for the two-sided test, HA : p = p0 , is not quite
are conducting a one-sided test, say with HA : p > as straightforward, as it requires rejecting H0 for
0.5, then the rejection region will consist of the either small or large values of X. Finally, we men-
most extreme observations in the direction of the tion that with increasing frequency one encounters
hypothesized effect. That is, it will take a large tests designed to establish not superiority but rather
number of successes, r, to reject H0 and conclude equivalence. In such a case, H0 would specify that
that p > 0.5. p is outside a given equivalence interval, and the
There is some integer, k, which is termed the rejection region would consist of intermediate values
critical value. Then H0 is rejected if, and only if, of X.
the number of successes, r, is at least as large as As mentioned, the binomial test is an exact test [1,
k. What the value of k is depends on , as well 4], but when np 5 and n(1 p) 5, it is com-
as on n and p. The set of values of the random mon to use the normal distribution to approximate
variable {X k} makes up a rejection region. The the binomial distribution. In this situation, the z-
probability (computed under the null hypothesis) that test may be used as an approximation of the bino-
the binomial random variable takes a value in the mial test.
rejection region cannot exceed and should be as
close to as possible. As a result, the critical value References
k is the smallest integer for which P0 {X k} .
This condition ensures that the test is exact [1, 4].
[1] Berger, V.W. (2000). Pros and cons of permutation tests
Because the distribution of the number of suc- in clinical trials, Statistics in Medicine 19, 13191328.
cesses is discrete, it will generally turn out that for [2] Berger, V.W. (2001). The p-value interval as an inferen-
the critical k P0 {X k} will be strictly less than . tial tool, Journal of the Royal Statistical Society D (The
That is, we will be unable to find a value of k for Statistician) 50(1), 7985.
Binomial Distribution 3
[3] Berger, V.W. (2002). Improving the information content [7] Moses, L.E., Emerson, J.D. & Hosseini, H. (1984).
of categorical clinical trial endpoints, Controlled Clinical Analyzing data from ordered categories, New England
Trials 23, 502514. Journal of Medicine 311, 442448.
[4] Berger, V.W., Lunneborg, C., Ernst, M.D. & Levine, J.G. [8] Rahlfs, V.W. & Zimmermann, H. (1993). Scores: ordinal
(2002). Parametric analyses in randomized clinical trials, data with few categories how should they be analyzed?
Journal of Modern Applied Statistical Methods 1(1), Drug Information Journal 27, 12271240.
7482.
[5] Berger, V.W., Rezvani, A. & Makarewicz, V. (2003). VANCE W. BERGER AND YANYAN ZHOU
Direct effect on validity of response run-in selection in
clinical trials, Controlled Clinical Trials 24(2), 156166.
[6] Brenner, H. (1998). A potential pitfall in control of
covariates in epidemiologic studies, Epidemiology 9(1),
6871.
Binomial Effect Size Display
ROBERT ROSENTHAL
Volume 1, pp. 157158
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
The binomial effect size display (BESD) was intro- Treatment 66 34 100
duced in 1982 as an intuitively appealing general Control 34 66 100
purpose display of the magnitude of experimental 100 100 200
effect (see Effect Size Measures) [3]. Although there
had been a growing awareness of the importance Table 2 Binomial effect size displays corresponding to
of estimating sizes of effects along with estimating various values of r 2 and r
the more conventional levels of significance, there Success rate
was still a problem in interpreting various effect increased
size estimators such as the Pearson correlation r. Difference in
For example, experienced behavioral researchers and r2 r From To success rates
experienced statisticians were quite surprised when 0.01 0.10 0.45 0.55 0.10
they were shown that the Pearson r of 0.32 associated 0.04 0.20 0.40 0.60 0.20
with a coefficient of determination (r 2 ) of only 0.10 0.09 0.30 0.35 0.65 0.30
was the correlational equivalent of increasing a suc- 0.16 0.40 0.30 0.70 0.40
cess rate from 34 to 66% by means of an experimental 0.25 0.50 0.25 0.75 0.50
treatment procedure; for example, these values could 0.36 0.60 0.20 0.80 0.60
0.49 0.70 0.15 0.85 0.70
mean that a death rate under the control condition is
0.64 0.80 0.10 0.90 0.80
66% but is only 34% under the experimental condi- 0.81 0.90 0.05 0.95 0.90
tion. There appeared to be a widespread tendency to 1.00 1.00 0.00 1.00 1.00
underestimate the importance of the effects of behav-
ioral (and biomedical) interventions simply because
it would be misleading to label as modest an effect
they are often associated with what are thought to
size equivalent to increasing the success rate from
be low values of r 2 [2, 3]. The interpretation of the
BESD is quite transparent, and it is useful because 34 to 66% (e.g., reducing a death rate from 66
it is (a) easily understood by researchers, students, to 34%).
and lay persons; (b) applicable in a wide variety of Table 2 systematically shows the increase in suc-
contexts; and (c) easy to compute. cess rates associated with various values of r 2 and
The question addressed by the BESD is: What is r. Even so small an r as 0.20, accounting for only
the effect on the success rate (survival rate, cure rate, 4% of the variance, is associated with an increase
improvement rate, selection rate, etc.) of instituting in success rate from 40 to 60%, such as a reduction
a certain treatment procedure? It displays the change in death rate from 60 to 40%. The last column of
in success rate (survival rate, cure rate, improvement Table 2 shows that the difference in success rates is
rate, selection rate, etc.) attributable to a certain identical to r. Consequently, the experimental suc-
treatment procedure. An example shows the appeal cess rate in the BESD is computed as 0.50 + r/2,
of the procedure. whereas the control group success rate is computed as
In their meta-analysis of psychotherapy outcome 0.50 r/2. When researchers examine the reports of
studies, Smith and Glass [5] summarized the results others and no effect size estimates are given, there are
of some 400 studies. An eminent critic stated that many equations available that permit the computation
the results of their analysis sounded the death knell of effect sizes from the sample sizes and the signifi-
for psychotherapy because of the modest size of the cance tests that have been reported [1, 4, 6].
effect. This modest effect size was calculated to be
References
equivalent to an r of 0.32 accounting for only 10%
of the variance. [1] Cohen, J. (1965). Some statistical issues in psychological
Table 1 is the BESD corresponding to an r of research, in Handbook of Clinical Psychology, B.B. Wol-
0.32 or an r 2 of 0.10. The table shows clearly that man, ed., McGraw-Hill, New York, pp. 95121.
2 Binomial Effect Size Display
[2] Rosenthal, R. & Rubin, D.B. (1979). A note on percent [5] Smith, M.L. & Glass, G.V. (1977). Meta-analysis of
variance explained as a measure of the importance of psychotherapy outcome studies, American Psychologist
effects, Journal of Applied Social Psychology 9, 395396. 32, 752760.
[3] Rosenthal, R. & Rubin, D.B. (1982). A simple, general [6] Wilkinson, L., Task Force on Statistical Inference. (1999).
purpose display of magnitude of experimental effect, Statistical methods in psychology journals, American
Journal of Educational Psychology 74, 166169. Psychologist 54, 594604.
[4] Rosenthal, R. & Rubin, D.B. (2003). requivalent : A simple
effect size indicator, Psychological Methods 8, 492496. ROBERT ROSENTHAL
Binomial Test
MICHAELA M. WAGNER-MENGHIN
Volume 1, pp. 158163
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the probability function (1), as done here as an the coin is not fair, but biased showing head in 80%
example for the value x = 7. of all tosses.
There are some other details worth knowing
10 about the binomial distribution, which are usually
P (x = 7) = 0.57 (0.5)3
7 described in length in statistical text books [1, 4].
The binomial mean, or the expected count of success
10!
= (0.5)10 in n trials, is E(X) = np. The standard deviation is
7! (10 7)! Sqrt(npq), where q = 1 p. The standard deviation
10 9 8 7 6 5 4 3 2 1 is a measure of spread and it increases with n and
=
(7 6 5 4 3 2 1) (3 2 1) decreases as p approaches 0 or 1. For any given
n, the standard deviation is maximized when p =
(0.5)10 = 0.117. (2) 0.5.
With increasing n, the binomial distribution can
The probability of observing seven times head in be approximated by the normal distribution (with
10 trials is 0.117, which is a rather small probability, or without continuity correction), when p indicates
and might leave us with some doubt whether our a symmetric rather than an asymmetric distribution.
coin is really a fair coin or the experimenter did There is no exact rule when the sample is large
the tosses really independently from each other. enough to justify the normal approximation of the
Figure 1(a) gives us the probability distribution for binomial distribution, however, a rule of thumb pub-
the binomial variable x. The text book of Cohen [1, lished in most statistical text books is that when p
p. 612] gives a simple explanation of what makes is not near 0.5, npq should be at least 9. How-
the binomial distribution a distribution: The reason ever, according to a study by Osterkorn [3], this
you have a distribution at all is that whenever there approximation is already possible when np is at
are n trials, some of them will fall into one category least 10.
and some will fall into the other, and this division
into categories can change for each new set of
n trials. The Binomial Test
The resulting probability distribution for the coin
toss experiment is symmetric; the value p = 0.5 is However, the possibility of deriving the probability
indicating this already. The distribution in Figure 1(b) of observing a special value x of a count variable
illustrates what happens when the value for p by means of the binomial distribution might be
increases, for example, when we have to assume that interesting from a descriptive point of view. The
.35 .35
.30 .30
.25 .25
Probability f (x)
Probability f (x)
.20 .20
.15 .15
.10 .10
.05 .05
.00 .00
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
(a) x (b) x
Figure 1 Probability function for the binomial distribution with n = 10 and p = 0.5 (a) coin toss example and n = 10
and p = 0.8 (b)
Binomial Test 3
Table 1 Data of the coin toss experiment smaller than or equal to the expected ratio:
Observed Expected
Outcome ratio ratio k
n
Outcome 1: head k=7 p1 = 0.7 p = 0.5 P = p k (1 p)nk . (3)
i
Outcome 2: tail nk =3 p2 = 0.3 1 p = 0.5 i=0
n = 10
Table 2 Data of the representative sample example. Sta- Table 3 Data of the aspiration level example. Statis-
tistical hypothesis: The observed ratio of males and females tical hypothesis: The observed ratio of optimistic perfor-
is equal to the expected ratio. H0 : p = 0.5; HA : p = 0.5; mance prediction is equal to the expected level of perfor-
= 0.05 mance prediction. H0 : p = 0.5; HA : p = 0.5; = 0.05
Observed Expected Observed Expected
Sex ratio ratio Group ratio ratio
Group 1: male k = 16 p1 = 0.17 p = 0.5 Group 1: k = 65 p1 = 0.71 p = 0, 5
Group 2: female n k = 76 p2 = 0.83 1 p = 0.5 optimistic
n = 92 Group 2: n k = 27 p2 = 0.29 1 p = 0, 5
pessimistic
Note: Asymptotic significance, two-tailed: P = 0.000 n = 92
P = 0.000 < = 0.05, reject H0 .
Note: Asymptotic significance, two-tailed: P = 0.000
P = 0.000 < = 0.05, reject H0 .
Some Social Science Examples for Using
the Binomial Test Using SPSS
symbol-coding task (Table 3). Afterward, they were
Example 1: A natural grouping variable consisting informed about their performance (number of cor-
of two mutually exclusive groups 92 subjects vol- rect), and they provided a performance prediction
unteered in a research project. 16 of them are males, for the next trial. We now use the binomial test
76 are females. The question arises whether this dis- to test whether the ratio of subjects with opti-
tribution of males and females is representative or mistic performance prediction (predicted increase of
whether the proportion of males in this sample is too performance = group 1) and pessimistic performance
small (Table 2). prediction (predicted decrease of performance =
group 2) is equal.
Using SPSS, the syntax below, we obtain a P The significant result indicates that we can reject
value for an asymptotic significance (two-tailed), the null hypothesis of equal ratio between optimistic
which is based on the z-approximation of the bino- and pessimistic performance prediction. The majority
mial distribution: P = 0.000. of the subjects expect their performance to increase
NPAR TEST in the next trial.
/BINOMIAL (0.50) = sex.
The decision to calculate the two-tailed signifi- Example 3: Testing a hypothesis with more than
cance is done automatically by the software, when- one independent binomial test As we know, our
ever the expected ratio is 0.5. Still, we can interpret a sample of n = 92 is not representative regarding
one-sided hypothesis. As the distribution is symmet- males and females (see Example 1). Thus, we might
ric, when p = 0.5, all we have to do is to divide the be interested in testing whether the tendency to
obtained P value by two. In our example, the P value optimistic performance prediction is the same in
is very small, indicating that the statistical hypothe- both groups. Using the binomial test, we perform
sis of equal proportions is extremely unlikely (P = two statistical tests to test the same hypothesis
0.000 < = 0.05, reject H0 ). We therefore reject the (Table 4). To avoid accumulation of statistical error,
H0 and assume that the current sample is not repre- we use Bonferoni-adjustment (see Multiple Com-
sentative for males and females. Males are underrep- parison Procedures) = /m (with m = number
resented. of statistical tests performed to test one statistical
hypothesis) and perform the binomial tests by using
Example 2: Establishing two groups by definition the protected level (see [2] for more information
In Example 1, we used a natural grouping variable on Bonferoni-adjustment).
consisting of two mutually exclusive groups. In
Example 2, we establish the two groups on the basis The significant result for the females indicates that
of empirical information. we can reject the overall null hypothesis of equal
ratio between optimistic and pessimistic performance
For the assessment of aspiration level, 92 sub- prediction for males and for females. Although not
jects volunteered to work for 50 sec on a speeded further discussed here, on the basis of the current
Binomial Test 5
Table 4 Data of the aspiration level example split by Table 5 Data of the comparing volunteers and appli-
gender. Statistical hypothesis: The observed ratio of opti- cants example. Statistical hypothesis: Applicants
mistic performance prediction is equal to the expected level (observed) ratio of optimistic performance prediction is
of performance prediction. Test 1: male; Test 2: female equal to the expected level of performance prediction.
H0 : p = 0.5 for males and for females; HA : p = 0.5 for H0 : p = 0.7; HA : p > 0.7; = 0.05
males and/or females; = 0.05
Observed Expected
Observed Expected Group ratio ratio
Male ratio ratio
Group 1: k = 21 p1 = 0.87 p = 0.7
Group 1: k = 12 p1 = 0.75 p = 0.5 optimistic
optimistic Group 2: nk =3 p2 = 0.13 1 p = 0.3
Group 2: nk =4 p2 = 0.25 1 p = 0.5 pessimistic
pessimistic n = 24
n = 16
Note: Exact significance, one-tailed: P = 0.042
Exact significance, two-tailed: P = 0.077 P = 0.042 < = 0.05, reject H0 .
Observed Expected
Female ratio ratio rather than in the research-lab situation. 24 females
Group 1: k = 53 p1 = 0.70 p = 0.5 applying for a training as air traffic control
optimistic personnel did the coding task as part of their
Group 2: n k = 23 p2 = 0.30 1 p = 0.5 application procedure. 87% of them gave an
pessimistic optimistic prediction.
n = 76 Now we remember the result of Example 3. In the
Asymptotic significance, two-tailed: P = 0.001 research lab, 70% of the females had made an opti-
mistic performance prediction, and we ask whether
Note: Adjusted : = /m; = 0.05, m = 2, = 0.025; these ratios of optimistic performance prediction are
P (male) = 0.077 > = 0.025, retain H0 for males; the same (Table 5).
P (female) = 0.001 < = 0.025, reject H0 for females.
On the basis of the binomial test result, we reject
the null hypothesis of equal ratios of optimistic per-
data, we may assume equal proportion of optimistic formance prediction between the volunteers and the
and pessimistic performance predictions for males, applicants. Female applicants are more likely to give
but not for females. The current female sample tends optimistic than pessimistic performance predictions.
to make more optimistic performance predictions
than the male sample. References
Example 4: Obtaining p from another sample [1] Cohen, B.H. (2001). Explaining Psychological Statistics,
The previous examples obtained p for the binomial 2nd Edition, Wiley, New York.
test from theoretical assumptions (e.g., equal distri- [2] Haslam, S.A. & McGarty, C. (2003). Research Methods
bution of male and females; equal distribution of and Statistics in Psychology, Sage Publications, London.
optimistic and pessimistic performance). But there are [3] Osterkorn, K. (1975). Wann kann die Binomial und
other sources of obtaining p that might be more inter- Poissonverteilung hinreichend genau durch die Nor-
malverteilung ersetzt werden? [When to approximate the
esting in social sciences. One source for p might be a Binomial and the Poisson distribution with the Normal
result found with an independent sample or published distribution?], Biometrische Zeitschrift 17, 3334.
in literature. [4] Tamhane, A.C. & Dunlop, D.D. (2000). Statistics and
Data Analysis: From Elementary to Intermediate, Prentice
Continuing our aspiration level studies, we are Hall, Upper Saddle River.
interested in situational influences and are now
collecting data in real-life achievement situations MICHAELA M. WAGNER-MENGHIN
Biplot
JOHN GOWER
Volume 1, pp. 163164
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
any block occurs by the random allocation rule. The assignment depends entirely on the first assignment
block size may be fixed or random. If the block size within each block; when the block size is 4, the fourth
is fixed at 4, for example, then there are six admissi- assignment is always predetermined, and the third
ble sequences per block, specifically AABB, ABAB, may be as well. The first two are not predetermined,
ABBA, BAAB, BABA, and BBAA, so each would but the second will tend to differ from the first
be selected (independently) with probability 1/6 for (they will agree in only two of the six possible
each block. There are two admissible sequences per sequences).
block of size 2 (AB and BA), and 20 admissible Smaller block sizes are ideal for controlling
sequences within each block of size 6. Of course, chronological bias, because they never allow the
these numbers would change if the allocation propor- treatment group sizes to differ appreciably. In
tions were not 1 : 1 to the two groups, or if there were particular, the largest imbalance that can occur
more than two groups. Certainly, randomized blocks in any stratum when randomized blocks are used
can handle these situations. Following (Table 1) is an sequentially within this stratum (that is, within the
example of blocked randomization with 16 subjects stratum one block starts when the previous one
and a treatment assignment ratio of 1 : 1. Three ran- ends) is half the largest block size, or half the
domization schedules are presented, corresponding to common block size if there is a common block size.
fixed block sizes of 2, 4, and 6. One could vary the If chronological bias were the only consideration,
block sizes by sampling from each of the columns in then any randomized block study of two treatments
an overall randomization scheme. For example, the with 1 : 1 allocation would be expected to use a
first six subjects could constitute one block of size common block size of 2, so that at no point in
6, the next two could be a block of size 2, and the time could the treatment group sizes ever differ by
last eight could be two blocks of size 4 each. Note more than one. Of course, chronological bias is not
that simply varying the block sizes does not consti- the only consideration, and blocks of size two are
tute random block sizes. As the name would suggest, far from ideal. As mentioned, as the block size
random block sizes means that the block sizes not increases, the ability to predict upcoming treatment
only vary but are also selected by a random mecha- assignments on the basis of knowledge of the
nism [2]. previous ones decreases.
The patterns within each block are clear. For This is important, because prediction leads to a
example, when the block size is 2, the second type of selection bias that can interfere with internal
validity, even in randomized trials. Note that a A more recent method to address the trade-off
parallel is often drawn between randomized trials between chronological bias and selection bias is the
(that use random allocation) and random samples, as maximal procedure [4], which is an alternative to
ideally each treatment group in a randomized trial randomized blocks. The idea is to allow for any
constitutes a random sample from the entire sample. allocation sequence that never allows the groups
This analogy requires the sample to be formed first, sizes to differ beyond an acceptable limit. Beyond
and then randomized, to be valid, and so it breaks this, no additional restrictions, in the way of forced
down in the more common situation in which the returns to perfect balance, are imposed. In many
recruitment is sequential over time. The problem is ways, the maximal procedure compared favorable to
that if a future allocation can be predicted and the randomized blocks of fixed or varied size [4].
subject to be so assigned has yet to be selected,
then the foreknowledge of the upcoming allocation References
can influence the decision of which subject to select.
Better potential responders can be selected when one
[1] Berger, V. (1999). FDA product approval informa-
treatment is to be assigned next and worse potential tionlicensing action: statistical review, http://
responders can be selected when another treatment is www.fda.gov/cber/products/etanimm
to be assigned next [3]. This selection bias can render 052799.htm accessed 3/7/02.
analyses misleading and estimates unreliable. [2] Berger, V.W. & Bears, J.D. (2003). When can a clinical
The connection between blocked designs and trial be called randomized? Vaccine 21, 468472.
selection bias stems from the patterns inherent in the [3] Berger, V.W. & Exner, D.V. (1999). Detecting selection
bias in randomized clinical trials, Controlled Clinical
blocks and allowing for prediction of future alloca- Trials 20, 319327.
tions from past ones. Clearly, the larger the block [4] Berger, V.W., Ivanova, A. & Deloria-Knoll, M. (2003).
size, the less prediction is possible, and so if selec- Enhancing allocation concealment through Less restric-
tion bias were the only concern, then the ideal design tive randomization procedures, Statistics in Medicine
would be the random allocation rule (that maximizes 22(19), 30173028.
the block size and minimizes the number of blocks), [5] Lovell, D.J., Giannini, E.H., Reiff, A., Cawkwell, G.D.,
Silverman, E.D., Nocton, J.J., Stein, L.D., Gedalia, A.,
or preferably even unrestricted randomization. But
Ilowite, N.T., Wallace, C.A., Whitmore, J. & Finck, B.K.
there is now a trade-off to consider between chrono- (2000). Etanercept in children with polyarticular juve-
logical bias, which is controlled with small blocks, nile rheumatoid arthritis, The New England Journal of
and selection bias, which is controlled with large Medicine 342, 763769.
blocks. Often this trade-off is addressed by varying [6] Rosenberger, W. & Lachin, J.M. (2001). Randomization
the block sizes. While this is not a bad idea, the basis in Clinical Trials: Theory and Practice, John Wiley &
for doing so is often the mistaken belief that vary- Sons.
[7] Rosenberger, W.F. & Rukhin, A.L. (2003). Bias prop-
ing the block sizes eliminates all prediction of future
erties and nonparametric inference for truncated bino-
allocations, and, hence, eliminates all selection bias. mial randomization, Nonparametric Statistics 15(45),
Yet varying the block sizes can, in some cases, actu- 455465.
ally lead to more prediction of future allocations than
fixed block sizes would [4, 6]. VANCE W. BERGER
Boosting
ADELE CUTLER
Volume 1, pp. 168169
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
1.0
1.0 00 0 0 0
00 0 0
0 00 0 0 00
0 000 0 00 0
0 00 0 0
0.8 0 0000 0 0 00 00 00 0000 0.8
00 0 00 000 00 00 011 0
0 0 0 0 1 11 11
0 1 11
0 0 0 001 11111
0.6 0 0 0 00 0 0 11 1 1 1
0
1 0.6
0 0 00 1
11 1 1
0 0 00 0
0 1 1 111 1
0.4 0 0 00 1 1 111 1 1 0.4
0 0 11 1 1 1
000 0
0000 0 00 1 1 1 1 1 11 1 11
0.2 1 11 1 11 11 1
1 1 1 11 1 1 1
0.2
11 1
1 1 11 11 1 1
11 1 1 1
0.0 1 1 1 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) (b)
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(c) (d)
Figure 1 (a) Data and underlying function; (b) 10 boosted stumps; (c) 100 boosted stumps; (d) 400 boosted stumps
2 Boosting
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Data from a study of handedness; hand is an integer measure of handedness and dnan a genetic measure. Data
due to Dr Gordon Claridge, University of Oxford
dnan hand dnan hand dnan hand dnan hand
1 13 1 11 28 1 21 29 2 31 31 1
2 18 1 12 28 2 22 29 1 32 31 2
3 20 3 13 28 1 23 29 1 33 33 6
4 21 1 14 28 4 24 30 1 34 33 1
5 21 1 15 28 1 25 30 1 35 34 1
6 24 1 16 28 1 26 30 2 36 41 4
7 24 1 17 29 1 27 30 1 37 44 8
8 27 1 18 29 1 28 31 1
9 28 1 19 29 1 29 31 1
10 28 2 20 29 2 30 31 1
a statistic
of a complexity limited only by the
4 1 1
data analysts imagination and the computing power
available see, for example, the article on finite
3 1
mixture distributions. The bootstrap is not a univer-
sal panacea, however, and many of the procedures
2 2211 described below apply only when is sufficiently
smooth as a function of the data values. The resam-
1 1 1 2 2 1 5 5 3 4 11 pling scheme may need careful modification for reli-
15 20 25 30 35 40 45 able results.
dnan The discussion above has shown how resampling
may be used to mimic sampling variability, but not
Figure 1 Scatter plot of handedness data. The numbers how the resamples can be used to provide inferences
show the multiplicities of the observations on the underlying population quantities. We discuss
this below.
which may be estimated by
1 2 Confidence Intervals
R
b= , v= , (1)
R 1 r=1 r Many estimators are normally distributed, at least
in large samples. If so, then an approximate equitailed
where = R 1 Rr=1
r is the average of the sim- 100(1 2)%confidence interval for the estimand
ulated
s. For the handedness data, we obtain b = is
b z v, where b and v are given at (1) and
0.046 and v = 0.043 using the 10 000 simulations z is the quantile of the standard normal, N (0, 1),
shown in (b) of Figure 2. distribution. For the handedness data this gives a 95%
The simulation variability of quantities such as b interval of (0.147, 0.963) for the correlation . The
and v, which vary depending on the random resam- quality of the normal approximation and hence the
ples, is reduced by increasing R. As a general rule, reliability of the confidence interval can be assessed
R 100 is needed for bias and variance estimators to by graphical comparison of 1 , . . . ,
R with a normal
Bootstrap Inference 3
3.5 3.5
3.0 3.0
2.5 2.5
Probability density
Probability density
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
Figure 2 Histograms of simulated correlation coefficients for handedness data. (a): Simulation from fitted bivariate-normal
distribution. (b): Simulation from the data by bootstrap resampling. The line in each figure shows the theoretical
probability-density function of the correlation coefficient under sampling from the fitted bivariate-normal distribution
1.5
1.5
Transformed correlation coefficient
1.0
1.0
0.5
Density
0.0
0.5
0.5
0.0
Figure 3 Bootstrap values of transformed correlation coefficient . (a): histogram, with vertical dashed line showing
original value
. (b): normal probability plot, with dashed line indicating exact normality
for which the normal distribution is a better, though bias-corrected and accelerated (BCa ) or adjusted per-
a not perfect, fit. The 95% confidence interval for centile interval, may be written as (
(R
) , (R(1 )) ),
computed using values of b and v obtained from the where and are estimated from the 1 , . . . ,
R
s is (0.074, 1.110), and transformation of this back in such a way that the resulting intervals are closer
to the original scale gives a 95% confidence interval to equitailed with the required coverage probability
for of (0.074, 0.804), shorter than and shifted to 1 2. Formulae for and are given in 5.3
the left relative to the interval obtained by treating of [2], but often they are built into software libraries
the themselves as normally distributed. for bootstrapping. For the handedness data, we find
Although simple, normal confidence intervals of- that = 0.0485 and = 0.0085 resulting in the
ten require a transformation to be determined by 95% interval (0.053, 0.792). This method seems to
the data analyst, and hence, more readily automated have corrected for the shift to the left we saw in the
approaches have been extensively studied. The most percentile interval.
natural way to use the bootstrap replicates 1 , . . . ,
R Other methods for the calculation of confidence
of to obtain a confidence interval for is to use their intervals rely on an analogy with the Student t
quantiles directly. Let
(1)
(R) be the ordered statistic used with normally distributed samples. Sup-
bootstrap replicates. Then, one simple approach to pose that a standard error s for is available; then
constructing an equitailed 100(1 2)% confidence s 2 is an estimate of the variance of . Then, the
interval is to take the and (1 ) quantiles basis of more general confidence-interval procedures
of the , that is,
(R) and
(R(1)) , where, if is the use of bootstrap simulation to estimate the
necessary, the numbers R and R(1 ) are rounded distribution of z = ( )/s. Studentized bootstrap
to the nearest integers. Thus, for example, if a 95% confidence intervals are constructed by using boot-
confidence interval is required, we set = 0.025 and, strap simulation to generate R bootstrap replicates
with R = 10 000, would take its limits to be
(250) z = ( )/s , where s is the standard error com-
and
(9750) . In general, the corresponding interval puted using the bootstrap sample that gave . The
may be expressed as (
(R) ,
(R(1)) ), which is
resulting z1 , . . . , zR are then ordered, their and
known as the bootstrap percentile confidence interval. (1 ) quantiles z(R) and z(R(1)) obtained, and
This has the useful property of being invariant the resulting (1 2) 100% confidence interval
to the scale on which it is calculated, meaning has limits (
sz(R(1)) ,
sz(R) ). These inter-
that the same interval is produced using the s vals often behave well in practice but require a
directly as would be obtained by transforming them, standard error s, which must be calculated for the
computing an interval for the transformed parameter, original sample and each bootstrap sample. If a
and then back-transforming this to the scale. Its standard error is unavailable, then the Studentized
simplicity and transformation-invariance have led to interval may be simplified to the bootstrap basic con-
widespread use of the percentile interval, but it has fidence interval (2
(R(1)) , 2
(R) ). Either of
drawbacks. Typically such intervals are too short, these intervals can also be used with transforma-
so the probability that they contain the parameter tion, but unlike the percentile intervals, they are not
is lower than the nominal value: an interval with transformation-invariant. It is generally advisable to
nominal level 95% may in fact contain the parameter use a transformation that maps the parameter range to
with probability only .9 or lower. Moreover, such the whole real line to avoid getting values that lie out-
intervals tend to be centered incorrectly: even for side the allowable range. For the handedness data, the
equitailed intervals, the probabilities that falls 95% Studentized and basic intervals using the same
below the lower end point and above the upper end transformation as before are (0.277, 0.868) and
point are unequal, and neither is equal to . For the (0.131, 0.824) respectively. The Studentized interval
handedness data, this method gives a 95% confidence seems too wide in this case and the basic interval too
interval of (0.047, 0.758). This seems to be shifted short. Without transformation, the upper end points
too far to the left relative to the transformed normal of both intervals were greater than 1.
interval. Standard error formulae can be found for many
These deficiencies have led to intensive efforts to everyday statistics, including the correlation coeffi-
develop more reliable bootstrap confidence intervals. cient. If a formula is unavailable, the bootstrap itself
One variant of the percentile interval, known as the can sometimes be used to find a standard error, using
Bootstrap Inference 5
two nested levels of bootstrapping. The bootstrap is For bootstrap hypothesis testing to work, it is
applied Q times to each first-level bootstrap sample essential that the fitted distribution F 0 satisfy the
y1 , . . . , yn , yielding second-level samples and cor- null hypothesis. This is rarely true of the empirical
responding replicates 1 , . . . ,
Q . Then, s = v , distribution function F , which therefore cannot be
used in the usual way. The construction of an
where v is obtained by applying the variance for-
mula at (1) to 1 , . . . ,
Q . The standard error of the appropriate F 0 can entail restriction of F or the
original is computed as s = v, with v computed testing of a slightly different hypothesis. For example,
by applying (1) to the first-level replicates 1 , . . . ,
R . the hypothesis of no correlation between hand and
Although the number R of first-level replicates should dnan could be tested by taking as test statistic T =
be at least 1000, it will often be adequate to take the
, the sample correlation coefficient, but imposing
number Q of second-order replicates of order 100: this hypothesis would involve reweighting the points
thus, around 100 000 bootstrap samples are needed in Figure 1 in such a way that the correlation
in all. With todays fast computing, this can be quite coefficient computed using the reweighted data would
feasible. equal zero, followed by simulation of samples
Chapter 5 of [2] gives fuller details of these boot- y1 , . . . , yn from the reweighted distribution. This
strap confidence-interval procedures, and describes would be complicated, and it is easier to test the
other approaches. stronger hypothesis of independence, under which
any association between hand and dnan arises purely
by chance. If so, then samples may be generated
Hypothesis Tests under the null hypothesis by independent bootstrap
Often a sample is used to test a null hypothesis about resampling separately from the univariate empirical
the population from which the sample was drawn distributions for hand and for dnan; see (a) of
for example, we may want to test whether a correla- Figure 4, which shows that pair (hand , dnan ) that
tion is zero, or if some mean response differs between were not observed in the original data may arise when
groups of subjects. A standard approach is then to sampling under the null hypothesis. Comparison of
choose a test statistic T in such a way that large val- (b) of Figure 2 and (b) of Figure 4 shows that
ues of T give evidence against the null hypothesis, the distributions of correlation coefficients generated
and to compute its value tobs for the observed data. using the usual bootstrap and under the independence
The strength of evidence against the hypothesis is hypothesis are quite different.
given by the significance probability or P value pobs , The handedness data have correlation coefficient
the probability of observing a value of T as large tobs = 0.509, and 18 out of 9999 bootstrap samples
as or larger than tobs if the hypothesis is true. That generated under the hypothesis of independence gave
is, pobs = P0 (T tobs ), where P0 represents a prob- values of T exceeding tobs . Thus p obs = 0.0019,
ability computed as if the null hypothesis were true. strong evidence of a positive relationship. A two-
Small values of pobs are regarded as evidence against sided test could be performed by taking T = | |,
the null hypothesis. yielding significance probability of about 0.004 for
A significance probability involves the computa- the null hypothesis that there is neither positive nor
tion of the distribution of the test statistic under the negative association between hand and dnan.
null hypothesis. If this distribution cannot be obtained A parametric test that assumes that the underlying
theoretically, then the bootstrap may be useful. A key distribution is bivariate normal gives an appreciably
step is to obtain an estimator F 0 of the population smaller one-sided significance probability of .0007,
distribution under the null hypothesis, F0 . The boot- but the inadequacy of the normality assumption
strap samples are then generated by sampling from implies that this test is less reliable than the bootstrap
0 , yielding R bootstrap replicates T , . . . , T of T .
F test.
1 R
The significance probability is estimated by Another simulation-based procedure often used to
test independence is the permutation method (see
{number of T tobs } + 1
obs =
p , Permutation Based Inference), under which the
R+1 values of one of the variables are randomly permuted.
where the +1s appear because tobs is also a replicate The resulting significance probability is typically very
under the null hypothesis, and tobs tobs . close to that from the bootstrap test, because the
6 Bootstrap Inference
8 1
2.5
7
6 2 1 1 2.0
Probability density
5
1.5
hand
4
1.0
3 1
0.5
2 2 2
1 2 1 1102 3 4 3 1 0.0
Figure 4 Bootstrap hypothesis test of independence of hand and dnan. (a) intersections of the grey lines indicate
possible pairs (hand , dnan ) when resampling values of hand and dnan independently with replacement. The numbers
show the multiplicities of the sampled pairs for a particular bootstrap resample. (b) histogram of correlation coefficients
generated under null hypothesis of independence. The vertical line shows the value of the correlation coefficient for the
dataset, and the shaded part of the histogram corresponds to the significance probability
only difference between the resampling schemes is a relationship between the two variables. Using the
that permutation samples without replacement and transformed Studentized interval, however, we get a
the bootstrap samples with replacement. For the P value of greater than .10, which contradicts this
handedness data, the permutation test yields one- conclusion. In general, inverting a confidence inter-
sided and two-sided significance probabilities of .002 val in this way seems to be unreliable and is not
and .003 respectively. advised.
Owing to the difficulties involved in construct- Chapter 4 of [2] contains a more complete discus-
ing a resampling distribution that satisfies the null sion of bootstrap hypothesis testing.
hypothesis, it is common in practice to use confidence
intervals to test hypotheses. A two-sided test of the
null hypothesis that = 0 may be obtained by boot- More Complex Models
strap resampling from the usual empirical distribution The discussion above considers only the simple
function F to compute a 100(1 )% confidence
situation of random sampling, but many applications
interval. If this does not contain 0 , we conclude that involve more complex statistical models, such as
the significance probability is less than . For one- the linear regression model (see Multiple Linear
sided tests, we use one-sided confidence intervals. Regression). Standard assumptions for this are that
This approach is most reliable when used with an the mean response variable is a linear function of
approximately pivotal statistic but this can be hard to explanatory variables, and that deviations from this
verify in practice. For the handedness data, the value linear function have a normal distribution. Here
0 = 0 lies outside the 95% BCa confidence interval, the bootstrap can be applied to overcome potential
but within the 99% confidence interval, so we would nonnormality of the deviations, by using the data
conclude that the significance probability for a two- to estimate their distribution. The deviations are
sided test is between .01 and .05. This is appreciably unobserved because the true line is unknown, but
larger than found above, but still gives evidence of they can be estimated using residuals, which can then
Bootstrap Inference 7
be resampled. If 0 ,
1 , . . . ,
k are estimates of the the R package. Another free package that can han-
linear model coefficients and e1 , . . . , en is a bootstrap dle a limited range of statistics is David Howells
sample from the residuals e1 , . . . , en , then bootstrap Resampling Statistics available from http://www.
responses y can be constructed as uvm.edu/dhowell/StatPages/Resampling/
Resampling.html. S-Plus also has many features
0 +
yi = 1 x1i + +
k xki + ei , i = 1, . . . , n. for the bootstrap and related methods. Some are part
of the base software but most require the use of
The values of the explanatory variables in the boot- the S + Resample library, which can be downloaded
strap sample are the same as for the original sample, from http://www.insightful.com/downloads
but the response variables vary. Since the matrix of /libraries/. The commercial package Resam-
explanatory variables remains the same, this method pling Stats is available as a stand-alone program or
is particularly appropriate in designed experiments as an add-in for Excel or Matlab from http://www.
where the explanatory variables are set by the experi- resample.com/ Statistical analysis packages that
menter. It does however presuppose the validity of the include some bootstrap functionality are Systat, Stata,
linear model from which the coefficients and residuals and SimStat. There are generally limits to the types
are estimated. of statistics that can be resampled in these packages
An alternative procedure corresponding to our but they may be useful for many common statistics
treatment of the data in Table 1 is to resample the (see Software for Statistical Analyses).
vector observations (yi , x1i , . . . , xki ). This may be
appropriate when the explanatory variables are not
fixed by the experimenter but may be treated as
Literature
random. One potential drawback is that the design
matrix of the resamples may be singular, or nearly Thousands of papers and several books about the
so, and if so there will be difficulties in fitting bootstrap have been published since it was for-
the linear model to the bootstrap sample. These mulated by Efron [3]. Useful books for the prac-
procedures generalize to the analysis of variance, titioner include [7], [1], and [6]. References [2] and
the generalized linear model, and other regression [4] describe the ideas underlying the methods, with
models. many further examples, while [5] and [8] contain
Extensions of the bootstrap to survival analysis more theoretical discussion. Any of these contains
and time series (see Time Series Analysis) have also many further references, though [1] has a particularly
been studied in the literature. The major difficulty in full bibliography. The May 2003 issue of Statisti-
complex models lies in finding a resampling scheme cal Science contains recent surveys of aspects of the
that appropriately mimics how the data arose. A research literature.
detailed discussion can be found in Chapters 3 and
68 of [2].
Acknowledgment
The work was supported by the Swiss National Science
Computer Resources Foundation and by the National Sciences and Engineering
Research Council of Canada.
Although the bootstrap is a general computer inten-
sive method for nonparametric inference, it does not
appear in all statistical software packages. Code for References
bootstrapping can be written in most packages, but
this does require programming skills. The most com- [1] Chernick, M.R. (1999). Bootstrap Methods: A Practi-
prehensive suite of code for bootstrapping is the boot tioners Guide, Wiley, New York.
library written by A. J. Canty for S-Plus, which can [2] Davison, A.C. & Hinkley, D.V. (1997). Bootstrap Meth-
be downloaded from http://statwww.epfl.ch/ ods and Their Application, Cambridge University Press,
Cambridge.
davison/BMA/library.html. This code has also [3] Efron, B. (1979). Bootstrap methods: another look at the
been made available as a package for R by B. D. Rip- jackknife, Annals of Statistics 7, 126.
ley and can be downloaded free from http://cran. [4] Efron, B. & Tibshirani, R.J. (1993). An Introduction to
r-project.org as part of the binary releases of the Bootstrap, Chapman & Hall, New York.
8 Bootstrap Inference
[5] Hall, P. (1992). The Bootstrap and Edgeworth Expansion, [8] Shao, J. & Tu, D. (1995). The Jackknife and Bootstrap,
Springer, New York. Springer, New York.
[6] Lunneborg, C.E. (2000). Data Analysis by Resampling:
Concepts and Applications, Duxbury Press, Pacific Grove. A.J. CANTY AND ANTHONY C. DAVISON
[7] Manly, B.F.J. (1997). Randomization, Bootstrap and
Monte Carlo Methodology in Biology, 2nd Edition, Chap-
man & Hall, London.
Box Plots
SANDY LOVIE
Volume 1, pp. 176178
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
120 120
Outlier 110
110
Pulse after exercise
100
100 Median 90
80
90
Midspread 70
80 60
50
70 Whisker
Adjacent value
Male Female
60 Gender
Figure 1 One sample box plot with the major parts Figure 2 Two sample box plots for comparing pulse after
labeled exercise for males and females
2 Box Plots
boxes do not overlap contain medians that are also [2] Hoaglin, D.C., Mosteller, F. & Tukey, J.W., eds (1983).
different. A more precise version of the latter test Understanding Robust and Exploratory Data Analysis,
is available with notched box plots, where the prob- John Wiley & Sons, New York.
[3] Tukey, J.W. (1977). Exploratory Data Analysis, Addison-
ability of rejecting the null hypothesis of true equal Wesley, Reading.
medians is 0.05 when the 95% notches just clear each [4] Velleman, P.F. & Hoaglin, D.C. (1981). Applications,
other, provided that it can be assumed that the sam- Basics and Computing of Exploratory Data Analysis,
ples forming the comparison are roughly normally Duxbury Press, Boston.
distributed, with approximately equal spreads (see [1]
for more details). SANDY LOVIE
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
pij = P [Object i is preferred to Object j ] With pA = 0.80 and pB = 0.65, we have P [A beats
i B] = 0.28/(0.28 + 0.13) = 0.6829. That is, we guess
= . (1) that the chance A beats B is about 68%, which is at
i + j
least plausible. The relationship between P [A beats
B] and pA , pB in (4) is somewhat complicated. The
The model is thought to be first proposed by Zermelo
formula can be simplified by looking at odds instead
[16], but Bradley and Terry [3] brought it into wide
of probability. From (4),
popularity among statistical practitioners. See [7] for
an extensive bibliography, and the book [5], which
covers other models as well. Odds AB = Odds[A beats B]
It is often more convenient to work with odds than pA (1 pB )/(pA (1 pB ) + (1 pA )pB )
=
with probabilities (see Odds and Odds Ratios). If p 1 pA (1 pB )/(pA (1 pB ) + (1 pA )pB )
2 BradleyTerry Model
i
nij
In fact, (n1 , . . . , nL ) is a sufficient statistic for this
= . (6)
i=j
i + j model.
The parameters can be estimated using maximum
likelihood (see section titled Calculating the Esti-
In the example comparing pork roasts in [3], let mates for some computational approaches), but notice
Object 1 be the feed with just corn, Object 2 the feed that these probabilities do not change if every i is
with corn and some peanut, and Object 3 the feed multiplied by the same positive constant c, which
with corn and much peanut. Then the results from means that the i s are not uniquely determined by
Judge I, who made five comparisons of each pair, the pij s. One usually places a constraint on the i s,
are (from Table 4 in their paper) for example, that L = 1, or that 1 + + L = 1.
BradleyTerry Model 3
I prefer scaling the i s so that The question arises whether the apparent differ-
ence between the judges is statistically significant,
1 i
L
1 which we address by testing the hypotheses
= , (11)
L i=1 1 + i 2
H0 : (1 , 2 , 3 ) = (1 , 2 , 3 ) versus
which means that i is the odds that Object i is
preferred to a typical ideal object, where the typical HA : (1 , 2 , 3 ) = (1 , 2 , 3 ). (17)
object has on an average a 50% chance of being The likelihood ratio test (see Maximum Likelihood
preferred over the others. Estimation) uses the statistic
Maximizing (8) subject to the constraint in (11)
yields the estimates W = 2[log(LDifferent ) log(LSame )], (18)
1 = 0.2166,
2 =
3 = 1.9496. (12) where in LDifferent we replace the parameters with
their estimates from (12) and (16), and in LSame with
This judge likes some or much peanut equally, and the maximum likelihood estimates from (13), which
any peanut substantially better than no peanut. are
Bradley and Terry also present the results from
another judge, Judge II, who again made five com-
(
1 , 3 ) = (
2 , 1 ,
2 ,
3 ) = (0.7622, 1.3120, 1.0000).
parisons of each pair. Let mij be the number of times
Judge II preferred Object i to Object j , and 1 , 2 , 3 (19)
be the corresponding BradleyTerry parameters. One
can imagine at least two models for the combined Under the null hypothesis that the two judges have the
data of the two judges: (1) The judges have the same same preference structure, W will be approximately
preferences, so that 1 = 1 , 2 = 2 , 3 = 3 ; (2) The 22 . The degrees of freedom here are the difference
judges may have different preferences, so that i does between the number of free parameters in the alter-
not necessarily equal i . The likelihoods for the two native hypothesis, which is 4 because there are six
models are parameters but two constraints, and the null hypothe-
i nij +mij
sis, which is 2. In general, if there are L objects and
LSame = (13) M judges, then testing whether all M judges have the
i=j
i + j same preferences uses (L 1) (M 1) degrees of
freedom.
and For these data, W = 8.5002, which yields an
i
nij
i
mij approximate P value of 0.014. Thus, it appears that
LDifferent = , (14) indeed the two judges have different preferences.
i=j
i + j i + j
usual. In this case, not every pair is observed, that and the home/away effect is the same for each team,
is, Milwaukee cannot be the home team and away that is,
team in the same game. That fact does not present a
iHome i
problem in fitting the model, though. The second and Effect(team i) = Away
= = 2. (22)
third columns in Table 1 give the estimated home i i /
and away odds for each team in this model. The Table 1 contains the estimated i s in the Neutral
Effect column is the ratio of the home odds to the column. Notice that these values are very close to
away odds, and estimates the effect of being at home the estimates in the simple model. The estimate of
versus away for each team. That is, it is the odds the is 1.1631, so that the effect for each team is
home team would win in the imaginary case that the (1.1631)2 = 1.3529. The last two columns in the
same team was the home and away team. For most Away
table give the estimated iHome s and i s. This
teams, the odds of winning are better at home than model is a smoothed version of the second model.
away, especially for Boston, although for Toronto and To test this model versus the simple model, we
Baltimore the reverse is true. have W = 5.41. There is 1 degree of freedom here,
To test whether there is a home/away effect, because the only additional parameter is . The
one can perform the likelihood ratio test between P value is then 0.02, which is reasonably statisti-
the two models. The W = 10.086 on 7 degrees cally significant, suggesting that there is indeed a
of freedom (7 because the simple model has 7 home/away effect. One can also test the last two mod-
1 = 6 free parameters, and the more complicated els, which yields W = 4.676 on 13 7 = 6 degrees
model has 14 1 = 13). The approximate P value of freedom, which is clearly not statistically signifi-
is 0.18, so it appears that the simpler model is not cant. Thus, there is no evidence that the home/away
rejected. effect differs among the teams.
Agresti considers a model between the two, where
it is assumed that the home/away effect is the same
Luces Choice Axiom
for each team. The parameter i can be thought of as
the odds of team i winning at a neutral site, and a In the seminal book Individual Choice Behavior
new parameter is introduced that is related to the [10], Luce proposed an axiom to govern models for
home/away effect, where choosing one element among any given subset of
elements. The Axiom is one approach to specifying
i lack of interaction between objects when choosing
Away
iHome = i and i = . (20) from subsets of them. That is, the relative preference
between two objects is independent of which other
objects are among the choices. This idea is known as
Then the odds that team i beats team j when the independence from irrelevant alternatives.
game is played at team is home is To be precise, let O = {1, . . . , L} be the complete
set of objects under consideration and T be any subset
i of O. Then, for Object i T , denote
Odds(i at home beats j ) = 2 , (21)
j PT (i) = P [i is most preferred among T ]. (23)
BradleyTerry Model 5
The special case of a paired comparison has T matter which, if any, other objects are available. In
consisting of just two elements, for example, the soft drink example, this independence implies, in
particular, that
P{i,j } (i) = P [i is preferred to j ]. (24)
Luces Axiom uses the notation that for S T , P{Coke,7up} (Coke) P{Coke,7up,Sprite} (Coke)
=
P{Coke,7up} (7-up) P{Coke,7up,Sprite} (7-up)
PT (S)
PO (Coke)
= P [The most preferred object among T is in S] = . (30)
PO (7-up)
= PT (i). (25)
iS
The main implication is given in the next Theo-
rem, which is Theorem 3 in [10].
The axiom follows.
Luces Choice Axiom. For any T O, Luces Theorem. Assume that the Choice Axiom
holds, and that P{a,b} (a) = 0 for all a, b O. Then,
(i) If P{a,b} (a) = 0 for all a, b T , then for i there exists a positive finite number for each object,
S T, say i for Object i, such that for any i S O,
PT (i) = PS (i)PT (S); (26)
(ii) If P{a,b} (a) = 0 for some a, b T , then if i i
PS (i) = . (31)
T , i = a, j
j S
PT (i) = PT {a} (i). (27)
The essence of the Axiom can be best understood The interest in this paper is primarily paired com-
when P{a,b} (a) = 0 for all a, b O, that is, when parison, and we can see that for paired comparisons,
in any paired comparison, there is a positive chance the Luce Choice Axiom with S = {i, j } leads to the
that either object is preferred. Then equation (26) is BradleyTerry model (1), as long as the i s are
operational for all i S T . strictly between 0 and .
For an example, suppose O = {Coke, Pepsi, 7-up,
Sprite}. The axiom applied to T = O, S = {Coke,
Pepsi} (the colas) and i = Coke is Thurstones Scaling
P (Coke is the favorite among all four) Scaling models in preference data is addressed fully
in [2], but here we will briefly connect the two. Thur-
= P (Coke is preferred to Pepsi)
stone [14] models general preference experiments by
P (A cola is chosen as the favorite assuming that a given judge has a one-dimensional
response to each object. For example, in comparing
among all four). (28)
pork roasts, the response may be based on tender-
Thus, the choosing of the favorite can be decomposed ness, or in the soft drinks, the response may be based
into a two-stage process, where the choosing of Coke on sweetness, or caffeine stimulation. Letting Za be
as the favorite starts by choosing colas over noncolas, the response to Object a, the probability of prefer-
then Coke as the favorite cola. ring i among those in subset T (which contains i) is
There are many implications of the Axiom. One given by
is a precise formulation of the independence from
irrelevant alternatives:
PT (i) = P [i is most preferred among T ]
P{i,j } (i) PS (i)
= (29) = P [Zi > Zj , j T {i}]. (32)
P{i,j } (j ) PS (j )
for any subset S that contains i and j . That is, the That is, the preferred object is that which engenders
relative preference of i and j remains the same no the largest response.
6 BradleyTerry Model
Thurstone gives several models for the responses i /(i + j + cij ). He suggests taking cij = c i j
(Z1 , . . . , ZL ) based on the Normal distribution. Da- for some c > 0, so that
niels [4] looks at cases in which the Zs are inde-
pendent and from a location-family model with pos- P [Object i is preferred to Object j ]
sibly different location parameters; [8] and [13] used
i
gamma distributions. A question is whether any = (36)
such Thurstonian model would satisfy Luces Choice i + j + c i j
Axiom. The answer, given by Luce and Suppes in
and
[11], who attribute the result to Holman and Mar-
ley, is the Gumbel distribution. That is, the Zi s are
independent with density P [Object i and Object j are tied]
c i j
= . (37)
fi (zi ) = exp((zi i )) exp( exp((zi i ))), i + j + c i j
< zi < , (33) Davidsons suggestion may be slightly more pleasing
than (34) because the i s have the same meaning as
where the i measures the typical strength of the before conditional on there being a preference, that is,
response for Object i. Then the Thurstonian choice
probabilities from (32) coincide with the model (31)
with i = exp(i ). Yellott [15] in fact shows that the P [Object i is preferred to Object j |
Gumbel is the only distribution that will satisfy the Object i and Object j are not tied]
Axiom when there are three or more objects. i
= . (38)
i + j
Ties
Calculating the Estimates
It is possible that paired comparisons result in no
preference, for example, in soccer, it is not unusual In the BradleyTerry model with likelihood as in (6),
for a game to end in a tie, or people may not be able the expected number of times Object i is preferred
to express a preference between two colas. Particular equals
parameterizations of the paired comparisons when
extending the BradleyTerry parameters to the case i
of ties are proposed in [12] and [6]. Rao and Kupper E[ni ] = E nij = Nij . (39)
[12] add the parameter , and set j =i j =i
i + j
This equation is the basis of an iterative method [3] Bradley, R.A. & Terry, M.A. (1952). Rank analysis of
for finding the estimates, Starting with a guess incomplete block designs. I, Biometrika 39, 324345.
[4] Daniels, H.E. (1950). Rank correlation and population
1(0) , . . . ,
( L(0) ) of the estimates, a sequence (
1(k) , . . . ,
(k) models, Journal of the Royal Statistical Society B 12,
L ), k = 0, 1, . . . , is produced where the (k + 1)st
171181.
vector is obtained from the kth vector via (41). [5] David, H.A. (1988). The Method of Paired Comparisons,
After finding the new estimates, renormalize them, 2nd Edition, Charles Griffin & Company, London.
for example, divide them by the last one, so that [6] Davidson, R.R. (1970). On extending the Bradley-Terry
the last is then 1. Zermelo [16] first proposed this model to accommodate ties in paired comparison exper-
iments, Journal of the American Statistical Association
procedure, and a number of authors have considered
65, 317328.
it and variations since. Rao and Kupper [12] and [7] Davidson, R.R. & Farquhar, P.H. (1976). A bibliography
Davidson [6] give modifications for the models with on the method of paired comparisons, Biometrics 32,
ties presented in the section titled Ties. Under certain 241252.
conditions, this sequence of estimates will converge [8] Henery, R.J. (1983). Permutation probabilities for
to the maximum likelihood estimates. In particular, gamma random variables, Journal of Applied Probability
if one object is always preferred, the maximum 20, 822834.
[9] Hunter, D.R. (2004). MM algorithms for generalized
likelihood estimate does not exist, and the algorithm Bradley-Terry models, Annals of Statistics 32, 386408.
will fail. See [9] for a thorough and systematic [10] Luce, R.D. (1959). Individual Choice Behavior, Wiley,
presentation of these methods and their properties. New York.
An alternative approach is given in [1], pages [11] Luce, R.D. & Suppes, P. (1965). Preference, utility,
436438, that exhibits the BradleyTerry model, and subjective probability, Handbook of Mathematical
including the one in (20), as a logistic regression Psychology Volume III, Wiley, New York, pp. 249410.
[12] Rao, P.V. & Kupper, L.L. (1967). Ties in paired-
model. Thus widely available software can be used
comparison experiments: a generalization of the Bradley-
to fit the model. The idea is to note that the data can Terry model (Corr: V63 p1550-51), Journal of the Amer-
be thought of as (L/2) independent binomial random ican Statistical Association 62, 194204.
variables, [13] Stern, H. (1990). Models for distributions on permuta-
tions, Journal of the American Statistical Association 85,
nij Binomial(Nij , pij ), for i < j. (42) 558564.
[14] Thurstone, L.L. (1927). A law of comparative judgment,
Then under (1) and (3), Psychological Review 34, 273286.
[15] Yellott, J.I. (1977). The relationship between Luces
log(Odds ij ) = log(i ) log(j ) = i j , (43) choice axiom, Thurstones theory of comparative judg-
ment, and the double exponential distribution, Journal
that is, the log(Odds) is a linear function of the of Mathematical Psychology 15, 109144.
parameters i (= log(i ))s. The constraint that L = [16] Zermelo, E. (1929). Die Berechnung der Turnier-
1 means L = 0. See [1] for further details. Ergebnisse als ein Maximumproblem der Wahrshein-
lichkeitrechnung, Mathematische Zeitshrift 29, 436460.
References
(See also Attitude Scaling)
[1] Agresti, A. (2002). Categorical Data Analysis, 2nd
JOHN I. MARDEN
Edition, Wiley, New York.
[2] Bockenholt, U. (2005). Scaling of preferential choice,
Encyclopedia of Behavioral Statistics, Wiley, New York.
BreslowDay Statistic
MOLIN WANG AND VANCE W. BERGER
Volume 1, pp. 184186
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
References [3] Halperin, M., Ware, J.H., Byar, D.P., Mantel, N.,
Brown, C.C., Koziol, J., Gail, M. & Green, S.B. (1977).
Testing for interaction in and I J K contingency table,
[1] Breslow, N.E. (1996). Statistics in epidemiology: the
Biometrika 64, 271275.
case-control study, Journal of the American Statistical
[4] Tarone, R.E. (1985). On heterogeneity tests based on
Association 91, 1428.
efficient scores, Biometrika 72, 9195.
[2] Breslow, N.E. & Day N.E. (1980). Statistical Methods in
Cancer Research I. The Analysis of Case-Control Studies,
MOLIN WANG AND VANCE W. BERGER
IARC, Lyon.
Brown, William
PAT LOVIE
Volume 1, pp. 186187
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[3] Brown, W. (1932). The mathematical and experimental [5] Spearman, C. (1910). Correlation calculated from faulty
evidence for the existence of a central intellective factor data, British Journal of Psychology 3, 271295.
(g), British Journal of Psychology 23, 171179.
[4] Levy, P. (1995). Charles Spearmans contribution to test PAT LOVIE
theory, British Journal of Mathematical and Statistical
Psychology 48, 221235.
Bubble Plot
BRIAN S. EVERITT
Volume 1, pp. 187187
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[7] Norman, M.F. (1972). Markov Processes and Learning Conditioning II, A. Black & W.F. Prokasy, eds, Appleton
Models, Academic Press, New York. Century Crofts, New York, pp. 6499.
[8] Rescorla, R.A. & Wagner, A.T. (1972). A theory of
Pavlovian conditioning: variations in the effectiveness R. DUNCAN LUCE
of reinforcement and nonreinforcement, in Classical
Calculating Covariance
GILAD CHEN
Volume 1, pp. 191191
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
function the weights, called standardized canonical that in multivariate statistics one can only test the sta-
function coefficients, optimize RC 2 just as regression tistical significance of a function as a single function
beta weights optimize R 2 . (i.e., the last function), unless one uses a structural
The canonical functions are uncorrelated or ortho- equation modeling approach to the analysis [2].
gonal. In fact, the functions are bi-orthogonal. If the researcher decides that the results reflect
For example, if there are two functions in a given nothing, the second question is rendered irrelevant,
analysis, the canonical scores on Function I for the because the sensible researcher will not ask, From
criterion variable set are (a) perfectly uncorrelated where does my nothing originate? If the researcher
with the canonical scores on Function II for the decides that the results reflect more than nothing,
criterion variable set and (b) perfectly uncorrelated then both the standardized function coefficients and
with the canonical scores on Function II for the the structure coefficients must be consulted, as is
predictor variable set. Additionally, the canonical the case throughout the general linear model [1, 3].
scores on Function I for the predictor variable set are Only variables that have weight and structure coeffi-
cients of zero on all functions contribute nothing to
(a) perfectly uncorrelated with the canonical scores
the analysis.
on Function II for the predictor variable set and
A heuristic example may be useful in illustrat-
(b) perfectly uncorrelated with the canonical scores
ing the application. The example is modeled on real
on Function II for the criterion variable set. Each
results presented by Pitts and Thompson [6]. The
function theoretically can yield squared canonical heuristic presumes that participants obtain scores on
correlations that are 1.0. In this case, because the two reading tests: one measuring reading compre-
functions are perfectly uncorrelated, each function hension when readers have background knowledge
perfectly explains relationships of the variables in the related to the reading topic (SPSS variable name
two variable sets, but does so in a unique way. read yes) and one when they do not (read no).
As is the case throughout the general linear model, These two reading abilities are predicted by scores
interpretation of CCA addresses two questions [10]: on vocabulary (vocabulr), spelling (spelling),
and self-concept (self con) tests. The Table 1 data
1. Do I have anything?, and, if so cannot be analyzed in SPSS by point-and-click, but
2. Where does it (my effect size) come from?. canonical results can be obtained by executing the
syntax commands:
The first question is addressed by consulting some
combination of evidence for (a) statistical signifi- MANOVA read yes read no WITH
cance, (b) effect sizes (e.g., RC 2 or adjusted RC 2 [8]), vocabulr spelling self con/
and (c) result replicability (see Cross-validation; PRINT=SIGNIF(MULTIV EIGEN
Bootstrap Inference). It is important to remember DIMENR)/
Note: The z-score equivalents of the five measured variables are presented in parentheses. The scores on the canonical composite
variables (e.g., CRIT1 and PRED1) are also in z-score form.
Canonical Correlation Analysis 3
Note: The canonical adequacy coefficient equals the average squared structure coefficient for the variables on a
given function [7, 10]. The canonical redundancy coefficient equals the canonical adequacy coefficient times the
RC 2 [7, 9].
[5] Knapp, T.R. (1978). Canonical correlation analysis: a [9] Thompson, B. (1991). A primer on the logic and
general parametric significance testing system, Psycho- use of canonical correlation analysis, Measurement and
logical Bulletin 85, 410416. Evaluation in Counseling and Development 24, 8095.
[6] Pitts, M.C. & Thompson, B. (1984). Cognitive styles [10] Thompson, B. (2000). Canonical correlation analysis, in
as mediating variables in inferential comprehension, Reading and Understanding More Multivariate Statis-
Reading Research Quarterly 19, 426435. tics, L. Grimm & P. Yarnold, eds, American Psycho-
[7] Thompson, B. (1984). Canonical Correlation Analysis: logical Association, Washington, pp. 285316.
Uses and Interpretation, Sage, Newbury Park. [11] Thompson, B. (2004). Exploratory and Confirmatory
[8] Thompson, B. (1990). Finding a correction for the Factor Analysis: Understanding Concepts and Applica-
sampling error in multivariate measures of relationship: tions, American Psychological Association, Washington.
a Monte Carlo study, Educational and Psychological
Measurement 50, 1531. BRUCE THOMPSON
CarrollArabie Taxonomy
JAN DE LEEUW
Volume 1, pp. 196197
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
(usually found by some previous analysis, or defined [2] Coombs, C.H. (1964). A Theory of Data, Wiley.
by theoretical considerations). We only fit the coor- [3] De Leeuw, J. & Heiser, W.J. (1980). Theory of multi-
dinates of the points in other spaces; for instance, dimensional scaling, in Handbook of Statistics, Vol. II,
P.R. Krishnaiah, ed., North Holland Publishing Com-
we have a two-dimensional space of objects and we pany, Amsterdam.
fit individual preferences as either points or lines in [4] Guttman, L. (1968). A general nonmetric technique for
that space. fitting the smallest coordinate space for a configuration
In summary, we can say that the CarrollArabie of points, Psychometrika 33, 469506.
taxonomy can be used to describe and classify a [5] Kruskal, J.B. (1964a). Multidimensional scaling by opti-
large number of scaling methods, especially scaling mizing goodness of fit to a nonmetric hypothesis, Psy-
methods developed at Bell Telephone Laboratories chometrika 29, 127.
[6] Kruskal, J.B. (1964b). Nonmetric multidimensional scal-
and its immediate vicinity between 1960 and 1980. ing: a numerical method, Psychometrika 29, 115129.
Since 1980, the field of scaling has moved away [7] Roskam, E.E.C.H.I. (1968). Metric analysis of ordinal
to some extent from the geometrical methods and data in psychology, Ph.D. thesis, University of Leiden.
the heavy emphasis on solving very complicated [8] Shepard, R.N. (1962a). The analysis of proximities:
optimization problems. Item response theory and multidimensional scaling with an unknown distance
choice modeling have become more prominent, and function (Part I), Psychometrika 27, 125140.
they are somewhat at the boundaries of the taxonomy. [9] Shepard, R.N. (1962b). The analysis of proximities:
multidimensional scaling with an unknown distance
New types of discrete representations have been function (Part II), Psychometrika 27, 219246.
discovered. The fact that the taxonomy is still very [10] Shepard, R.N. (1972). A taxonomy of some principal
useful and comprehensive attests to the importance types of data and of the multidimensional methods for
of the frameworks developed during 19601980, their analysis, in Multidimensional Scaling, Volume I,
and to some extent also to the unfortunate fact that Theory, R.N. Shepard, A.K. Romney & S.B. Nerlove,
there no longer is a center in psychometrics and eds, Seminar Press, pp. 2347.
scaling with the power and creativity of Bell Labs
in that area. (See also Proximity Measures; Two-mode Cluster-
ing)
References
JAN DE LEEUW
[1] Carroll, J.D. & Arabie, P. (1980). Multidimensional
scaling, Annual Review of Psychology 31, 607649.
Carryover and Sequence Effects
MARY E. PUTT
Volume 1, pp. 197201
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 3 Mean weekly IDS score by period and mean Estimation and Testing. Table 4 gives the expec-
difference between periods for individual patients on the TP tation of the outcome for each sequence/period com-
and PT sequences [12]. Reprinted with permission from the bination as well as for Yij , the mean difference of
Journal of the American Statistical Association. Copyright
2002 by the American Statistical Association
periods for each subject that is,
1 (ES ) E + 1 + 1 S + 2 + E1 + 1 1
+ 12 (1 2 ) 12 E1
2 D
2 (SE ) S + 1 + 2 E + 2 + S1 + 2 12 D + 12 (1 2 ) 12 S1
Carryover and Sequence Effects 3
In most crossover studies, interest lies in estimat- interpretable in the context of the study. However,
ing D and testing for nonzero treatment effects. If in contrast to the crossover study, the mean of the
the sample size is large, or the data are normally repeated measurements for each subject has variance
distributed, a hypothesis test of 2
(1 + ), and this yields a larger variance of the
1 2
less efficient than the estimate of the treatment effect, [6] Fleiss, J.L. (1986). The Design and Analysis of Clinical
and the power to detect carryover is typically very Experiments, Wiley, New York.
small [2]. In contrast, D , while typically not of great [7] Freeman, P. (1989). The performance of the two-stage
analysis of two-treatment, two-period cross-over trials,
interest in the EE : SS design, is estimated efficiently Statistics in Medicine 8, 14211432.
from a within-subject comparison. Grizzles popular [8] Grizzle, J.E. (1965). The two-period change over design
two-stage method used a test for the presence of a and its use in clinical trials, Biometrics 21, 467480.
carryover effect to determine whether to use both or [9] Grizzle, J.E. (1974). Correction to Grizzle (1965), Bio-
only the first period of data to estimate D in the metrics 30, 727.
crossover design [8, 9]. This analysis is fundamen- [10] Jones, B. & Kenward, M. (2003). Design and Analysis
tally flawed, with Type I error rates in excess of the of Crossover Trials, Chapman & Hall, New York.
[11] Kirk, R.E. (1995). Experimental Design: Procedures for
nominal Type I error rates in the absence of carry- the Behavioral Sciences, 3rd Edition, Pacific Grove,
over [7]. California, Brooks-Cole.
Lastly, analysis of variance or mixed-effects [12] Putt, M.E. & Chinchilli, V.M. (2000). A robust analysis
models (see Generalized Linear Mixed Models; of crossover designs using multisample generalized L-
Linear Multilevel Models) extend the analyses we statistics, Journal of the American Statistical Association
have described here, and provide a unified framework 95, 12561262.
[13] Putt, M.E. & Ravina, B. (2002). Randomized placebo-
for analyzing the repeated measures studies [5, 10,
controlled, parallel group versus crossover study designs
11, 16]. for the study of dementia in Parkinsons disease, Con-
trolled Clinical Trials 23, 111126.
References [14] Senn, S. (2002). Cross-over Trials in Clinical Research,
Wiley, New York.
[1] Balaam, L.N. (1968). A two-period design with t 2 [15] Tourangeau, R., Rasinski, K., Bradburn, N. &
experimental units, Biometrics 24, 6167. DAndarade, R. (1989). Carryover effects in attitude
[2] Brown, B.W. (1980). The cross-over experiment for surveys, Public Opinion Quarterly 53, 495524.
clinical trials, Biometrics 36, 6979. [16] Vonesh, E.F. & Chinchilli, V.M. (1997). Linear and
[3] Carriere, K.C. & Reinsel, G.C. (1992). Investigation Non-linear Models for the Analysis of Repeated Mea-
of dual-balanced crossover designs for two treatments, surements, Marcel Dekker Inc, New York.
Biometrics 48, 11571164.
[4] Craig, T.J., Teets, S., Lehman, E.G., Chinchilli, V.M.
& Zwillich, C. (1998). Nasal congestion secondary to
Further Reading
allergic rhinitis as a cause of sleep disturbance and
Frison, L. & Pocock, S.J. (1992). Repeated measures in
daytime fatigue and the response to topical nasal corti-
clinical trials: analysis using mean summary statistics and
costeroids, Journal of Allergy and Clinical Immunology
its implications for designs, Statistics in Medicine 11,
101, 633637.
16851704.
[5] Diggle, P.J., Liang, K. & Zeger, S.L. (1994). Analysis
of Longitudinal Data, Oxford Science Publications, New MARY E. PUTT
York.
Case Studies
PATRICK ONGHENA
Volume 1, pp. 201204
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
than nouns, and this double dissociation of grammat- embedded designs, there is also only one case, but,
ical category by modality within a single individual in addition, there are multiple subunits of analysis,
has been presented as a serious challenge to current creating opportunities for more extensive analysis
neurolinguistic theories. (e.g., a case study of school climate may involve
teachers and pupils as subunits of study). Multiple-
case holistic designs and multiple-case embedded
Cases in All Shapes and Sizes designs are the corresponding designs when the same
When we look at the diversity of the case study litera- study contains more than one case (e.g., a case study
ture, we will notice that there are various types of case of school climate that uses a multiple-case design
studies, and that there are several possible dimensions implies involving several schools).
to express this multitude. A first distinction might
refer to the kind of paradigm the researcher is work-
ing in. On the one hand, there is the more quantitative Just in Case
and analytical perspective of, for example, Yin [19, As Yin [20] observed: Case study research is
20]. On the other hand, there is also the more qual- remarkably hard, even though case studies have tradi-
itative and ethnographic approach of, for example, tionally been considered to be soft research. Para-
Stake [15, 16]. It is important to give both quantita- doxically, the softer a research strategy, the harder
tive and qualitative case studies a place in behavioral it is to do (p. 16). Here are some common pitfalls in
science methodology. As Campbell [3] remarked: case study research (based on the recommendations
It is tragic that major movements in the social in [12]):
sciences are using the term hermeneutics to connote
giving up on the goal of validity and abandoning Bad journalism. Selecting a case out of several
disputation as to who has got it right. Thus, in available cases because it fits the researchers
addition to the quantitative and quasi-experimental theory or distorting the complete picture by
case study approach that Yin teaches, our social picking out the most sensational features of
science methodological armamentarium also needs a the case.
humanistic validity-seeking case study methodology
Anecdotal style. Reporting an endless series of
that, while making no use of quantification or tests of
significance, would still work on the same questions low-level banal and tedious nonevents that take
and share the same goals of knowledge. (italics in over from in-depth rigorous analysis.
original, p. ixx) Pomposity. Deriving or generating profound the-
ories from low-level data, or by wrapping up
A second distinction might refer to the kind of accounts in high-sounding verbiage.
research problems and questions that are addressed Blandness. Unquestioningly accepting the respon-
in the case study. In a descriptive case study, the dents views, or only including safe uncontrover-
focus is on portraying the phenomenon, providing sial issues in the case study, avoiding areas on
a chronological narrative of events, citing numbers which people might disagree.
and facts, highlighting specific or unusual events
and characteristics, or using thick description of
lived experiences and situational complexity [7]. An Cases in Point
exploratory case study may be used as a tryout or
act as a pilot to generate hypotheses and propositions Instructive applications of case study research and
that are tested in larger scale surveys or experiments. additional references can be found in [1, 8, 15, 19].
An explanatory case study tackles how and why Many interesting case studies from clinical psychol-
questions and can be used in its own right to test ogy and family therapy can be found in Clinical Case
causal hypotheses and theories [20]. Studies, a journal devoted entirely to case studies.
Yin [20] uses a third distinction referring to the
study design, which is based on the number of cases References
and the number of units of analysis within each
case. In single-case holistic designs, there is only [1] Bassey, M. (1999). Case Study Research in Educational
one case and a single unit of analysis. In single-case Settings, Open University Press, Buckingham.
Case Studies 3
[2] Campbell, D.T. (1975). Degrees of freedom and the case [12] Nisbet, J. & Watt, J. (1984). Case study, in Conducting
study, Comparative Political Studies 8, 178193. Small-Scale Investigations in Educational Management,
[3] Campbell, D.T. (2003). Foreword, in Case Study J. Bell, T. Bush, A. Fox, J. Goodey & S. Goulding, eds,
Research: Design and Methods, 3rd Edition, R.K. Yin, Harper & Row, London, pp. 7992.
ed., Sage Publications, London, pp. ixxi. [13] Rapp, B. & Caramazza, A. (2002). Selective difficulties
[4] Campbell, D.T. & Stanley, J.C. (1963). Experimental with spoken nouns and written verbs: a single case study,
and Quasi-Experimental Designs for Research, Rand Journal of Neurolinguistics 15, 373402.
MacNally, Chicago. [14] Shallice, T. (1979). Case study approach in neuropsy-
[5] Caramazza, A. (1986). On drawing inferences about the chological research, Journal of Clinical Neuropsychol-
structure of normal cognitive systems from the analysis ogy 1, 183211.
of patterns of impaired performance: the case for single- [15] Stake, R.E. (1995). The Art of Case Study Research, Sage
patient studies, Brain and Cognition 5, 4166. Publications, Thousand Oaks.
[6] Cook, T.D. & Campbell, D.T. (1979). Quasi- [16] Stake, R.E. (2000). Case studies, in Handbook of
Experimentation: Design and Analysis Issues for Field Qualitative Research, 2nd Edition, N.K. Denzin &
Settings, Rand MacNally, Chicago. Y.S. Lincoln, eds, Sage Publications, Thousand Oaks,
[7] Geertz, C. (1973). Thick description: towards an inter- pp. 435454.
pretative theory of culture, in The Interpretation of Cul- [17] Sturman, A. (1997). Case study methods, in Educational
tures, C. Geertz, ed., Basic Books, New York, pp. 330. Research, Methodology and Measurement: An Interna-
[8] Hitchcock, G. & Hughes, D. (1995). Research and tional Handbook, 2nd Edition, J.P. Keeves, ed., Perga-
the Teacher: A Qualitative Introduction to School-Based mon, Oxford, pp. 6166.
Research, 2nd Edition, Routledge, London. [18] Wolpe, J. (1958). Psychotherapy by Reciprocal Inhibi-
[9] Kazdin, A.E. (1992). Drawing valid inferences from case tion, Stanford University Press, Stanford.
studies, in Methodological Issues and Strategies in Clini- [19] Yin, R.K. (2003a). Applications of Case Study Design,
cal Research, A.E. Kazdin, ed., American Psychological 2nd Edition, Sage Publications, London.
Association, Washington, pp. 475490. [20] Yin, R.K. (2003b). Case Study Research: Design and
[10] Kratochwill, T.R., Mott, S.E. & Dodson, C.L. (1984). Methods, 3rd Edition, Sage Publications, London.
Case study and single-case research in clinical and
applied psychology, in Research Methods in Clinical
Psychology, A.S. Bellack & M. Hersen, eds, Pergamon, (See also Single-Case Designs)
New York, pp. 5599.
[11] Masters, W.H. & Johnson, V.E. (1970). Human Sexual PATRICK ONGHENA
Inadequacy, Little, Brown, Boston.
CaseCohort Studies
BRYAN LANGHOLZ
Volume 1, pp. 204206
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
One-group t Test. This test is used to compare a where m1 and m2 are the sample means of groups 1
mean from a sample with that of a population or and 2 respectively n1 and n2 are the sample sizes of
hypothesized value from the population, when the the two groups and pv is the pooled variance
standard deviation for the population is not known.
The null hypothesis is that the sample is from the (n1 1) s12 + (n2 1) s22
pv = , (5)
(hypothesized) population (that is, = h , where n1 + n2 2
is the mean of the population from which the sample
came and h is the mean of the population with which where s12 & s22 are the variances of groups 1 and 2.
it is being compared). When the two group sizes are the same this simpli-
The equation for this version of the t Test is fies to
m1 m2
t= , (6)
m h s12 + s22
t= , (3)
s n
n
where n is the sample size of each group.
where Heterogeneous variances
m is the mean of the sample m1 m2
t= , (7)
h is the mean or assumed mean for the population
s12 s22
s is the standard deviation of the sample +
n is the sample size. n1 n2
The within-subjects sum of squares (SSws ) is dfres = dfTotal (dfbc + dfS ). (27)
found from
Mean squares
SSws = (xip mp )2 , (21) The mean squares for between-conditions (MSbc ) is
found from
where SSbc
MSbc = . (28)
xip is the value provided by participant p in dfbc
condition i The mean squares for the residual (MSres ) is found
mp is the mean of participant p across all the from
conditions. SSres
MSres = . (29)
dfres
In words, for each participant, find the deviation
between that persons score on each condition from F -ratio
that persons mean score. Square the deviations and The F -ratio is formed from
sum them for that person. Find the sum of the
sums. MSbc
F = . (30)
The between-conditions sum of squares (SSbc ) is MSres
calculated the same way as the between-groups sum
Example
of squares in the between-subjects design, except that
because the sample size in each condition will be the Five participants provide scores on four different
same, the multiplication by the sample size can take conditions (Table 2).
place after the summation: The overall mean is 10.2.
Sums of squares
SSbc = n (mi m)2 . (22)
SSTotal = 47.2
The residual sum of squares (SSres ) can be found by SSS = 3.2
subtracting SSbc from SSws SSws = 44
SSbc = 5.2
SSres = SSws SSbc . (23) SSres = 44 5.2 = 38.8
6 Catalogue of Parametric Tests
Table 2 The scores of the five participants in four condi- null hypothesis. The first will ignore the presence of
tions of a within-subjects design with means for conditions the second independent variable and test the main
and participants effect of the first independent variable, such that if
Condition there were two conditions in the first IV, then the null
hypothesis would be 1 = 2 , where 1 and 2 are
Participant 1 2 3 4 Mean the means in the two populations for the first IV. The
1 11 9 12 10 10.5 second F -ratio would test the second null hypothesis,
2 10 10 9 9 9.5 which would refer to the main effect of the second IV
3 10 13 11 8 10.5 with the existence of the first being ignored. Thus, if
4 8 10 13 9 10 there were two conditions in the second IV, then the
5 13 9 9 11 10.5 second H0 would be a = b where a and b are
Mean 10.4 10.2 10.8 9.4 the means in the population for the second IV. The
third F -ratio would address the third H0 , which would
relate to the interaction between the two IVs. When
Degrees of freedom each IV has two levels, the null hypothesis would be
a1 a2 = b1 b2 , where the a1 denotes the
dfTotal = (5 4) 1 = 19
mean for the combination of the first condition of the
dfS = 51=4 first IV and the first condition of the second IV.
dfws = 19 4 = 15 Examples are only given here of ANOVAs with
dfbc = 41=3 two IVs. For more complex designs, there will be
dfres = 19 (3 + 4) = 12 higher-order interactions as well. When there are k
Mean squares IVs, there will be 2, 3, 4, . . . , k way interactions,
which can be tested. For details of such designs,
5.2 38.8 see [4].
MSbc = = 1.733, MSres = = 3.233
3 12
F -ratio Multiway Between-subjects ANOVA. This ver-
sion of ANOVA partitions the overall variance into
F(3,12) = 0.536 four parts: the main effect of the first IV, the main
effect of the second IV, the interaction between the
The critical value of F for = 0.05 with df of 3 and
two IVs, and the error term, which is used in all three
12 is 3.49.
F -ratios.
Decision: fail to reject H0 .
Sums of squares
The Total sum of squares (SSTotal ) is calculated from
Multiway ANOVA
When there is more than one independent variable SSTotal = (xijp m)2 , (31)
(IV), the way in which these variables work together
where xijp is the score provided by participant p
can be investigated to see whether some act as
in condition i of IV1 and condition j of IV2 . A
moderators for others. An example of a design with
simpler description is that it is the sum of the
two independent variables would be if researchers
squared deviations of each participants score from
wanted to test whether the effects of different types
the overall mean.
of music (jazz, classical, or pop) on blood pressure
The sum of squares for the first IV (SSA ) is calculated
might vary, depending on the age of the listeners.
The moderating effects of age on the effects of
from
SSA = [ni (mi m)2 ], (32)
music might be indicated by an interaction between
age and music type. In other words, the pattern where
of the link between blood pressure and music type
differed between the two age groups. Therefore, an ni is the sample size in condition i of the first IV
ANOVA with two independent variables will have mi is the mean in condition i of the first IV
three F -ratios, each of which is testing a different m is the overall mean.
Catalogue of Parametric Tests 7
The sum of squares for the second IV (SSB ) is The degrees of freedom for the residual (dfres ) are
calculated from calculated from
SSB = [nj (mj m)2 ], (33) dfres = dfTotal (dfA + dfB + dfAB ). (40)
(SSB and its error term SSBS are calculated in an Table 4 The scores and group means of partici-
analogous fashion.) pants in a 2-way, within-subjects design
The sum of squares for cells (SSb.cells ) is found IV1 (A) 1 2
from IV2 (B) 1 2 1 2
SSb.cells = n (mij m)2 , (49) 1 12 15 15 16
2 10 12 17 14
where 3 15 12 12 19
n is the sample size 4 12 14 14 19
5 12 11 13 14
mij is the mean for the combination of condition
j of IV1 and condition i of IV2 Mean 12.2 12.8 14.2 16.4
m is the overall mean.
The sum of squares for the interaction between IV1 F -ratios
and IV2 (SSAB ) is calculated from MSA MSB MSAB
FA = , FB = , FAB = .
MSAS MSBS MSABS
SSAB = SSb.cells (SSA + SSB ). (50)
(62)
The sum of squares for the error term for the
interaction IV1 by IV2 by subjects (SSABS ) is found Example
from Five participants provide scores for two different IVs
each of which has two conditions (Table 4).
SSABS = SSTotal (SSA + SSB + SSAB + SSAS
SSTotal = 115.8, SSS = 15.3, SSA = 39.2, SSAS =
+ SSBS + SSS ). (51) 5.3, SSB = 9.8, SSBS = 10.7, SSAB = 3.2,
SSABS = 32.3
Degrees of freedom
dfS = 5 1 = 4, dfA = 2 1 = 1, dfAS = 1
The total degrees of freedom is found from
4 = 4, dfB = 2 1 = 1, dfBS = 1 4 = 4,
dfTotal = n k1 k2 1, (52) dfAB = 1 1 = 1, dfABS = 1 4 = 4
39.2 5.3
where n is the sample size MSA = = 39.2, MSAS = = 1.325,
1 4
k1 and k2 are the number of conditions in IV1 and 9.8 10.7
IV2 respectively MSB = = 9.8, MSBS =
1 4
3.2 32.3
dfS = n 1. (53) = 2.675, MSAB = , MSABS =
1 4
dfA = k1 1. (54) = 8.075
39.2
dfB = k2 1. (55) FA(1,4) = = 29.58.
1.325
dfAB = dfA dfB . (56) The critical value of F at = 0.05 with df of 1 and
dfAS = dfA dfS . (57) 4 is 7.71.
Decision: Reject H0
dfBS = dfB dfS . (58)
9.8
dfABS = dfAB dfS . (59) FB(1,4) = = 3.66.
2.675
Mean squares Decision: Fail to reject H0
SSA SSB SSAB 3.2
MSA = , MSB = , MSAB = , FAB(1,4) = = 0.56.
dfA dfB dfAB 8.075
SSAS SSBS Decision: Fail to reject H0 .
MSAS = , MSBS = , (60)
dfAS dfBS
SSABS Multiway Mixed ANOVA. Mixed has a number
MSABS = . (61) of meanings within statistics. Here it is being used to
dfABS
10 Catalogue of Parametric Tests
mean designs that contain both within- and between- ni is the size of the sample in condition i of IV1
subjects independent variables. For the description mi is the mean for all the scores in condition i of
of the analysis and the example data, the first IV1
independent variable will be between-subjects and the m is the overall mean.
second within-subjects.
The overall variance can be partitioned into that The sum of squares for within groups (SSwg ) is given
which is between-subjects and that which is within- by:
subjects. The first part is further subdivided into the SSwg = SSS SSA . (67)
variance for IV1 and its error term (within groups).
The sum of squares for between cells (SSbc ) is given
The second partition is subdivided into the variance
by
for IV2 , for the interaction between IV1 and IV2 and
the error term for both (IV1 by IV2 by subjects). SSbc = ni (mij m)2 , (68)
Sums of squares
The total sum of squares is given by where
niis the sample size in condition i of IV1
SSTotal = (xjp m)2 , (63) mijis the mean of the combination of conditioniof
IV1 and condition j of IV2
where xjp is the score provided by participant p in m is the overall mean.
condition j of IV2 .
A simpler description is that it is the sum of the The sum of squares for the interaction between
squared deviations of each participants score in each IV1 and IV2 (SSAB ) is given by
condition of IV2 from the overall mean.
The between-subjects sum of squares (SSS ) is calcu- SSAB = SSbc (SSA + SSB ). (69)
lated from
The sum of squares for IV2 by subjects within groups
(SSB.s(gps) ) is given by
SSS = k2 (mp m)2 , (64)
SSB.s(gps) = SSws (SSB + SSAB ). (70)
where
k2 is the number of conditions in IV2 Degrees of freedom
mp is the mean for participant p across all the dfTotal = N k2 1, where N is the sample size,
conditions in IV2
m is the overall mean. dfA = k1 1, dfB = k2 1, dfAB = dfA dfB ,
The within-subjects sum of squares (SSws ) is calcu- dfwg = (ni 1), where ni is the sample in
lated from condition i of IV1 , dfB.s(gps) = dfB dfwg . (71)
SSws = (xjp mp )2 , (65) Mean squares
where SSA SSwg SSB
MSA = , MSwg = , MSB = ,
dfA dfwg dfB
xjp is the value provided by participant p in
condition j of IV2 SSAB SSbg
MSAB = , MSbg = , MSB.s(gps)
mp is the mean of participant p across all dfAB dfbg
the conditions.
SSB.s(gps)
= . (72)
The sum of squares for IV1 (SSA ) is given by dfB.s(gps)
SSA = k2 [ni (mi m)2 ], (66) F -ratios
MSA MSB MSAB
where FA = , FB = , FAB = .
MSwg MSB.s(gps) MSB.s(gps)
k2 is the number of conditions in IV2 (73)
Catalogue of Parametric Tests 11
Table 5 The scores of participants in a 2-way, mixed Extensions of ANOVA. The analysis of variance
design can be extended in a number of ways. Effects
Condition of IV2 (B) with several degrees of freedom can be decom-
Condition of posed into component effects using comparisons
Participant IV1 (A) 1 2 of treatment means, including trend tests. Analy-
ses of covariance can control statistically for the
1 1 11 9
effects of potentially confounding variables. Multi-
2 1 13 8
3 1 12 11 variate analyses of variance can simultaneously test
4 1 10 11 effects of the same factors on different dependent
5 1 10 10 variables.
6 2 10 12
7 2 11 12
8 2 13 10 Comparing Variances
9 2 9 13
10 2 10 11 F test for Difference Between Variances
Two Independent Variances. This test compares
Example two variances from different samples to see whether
Five participants provided data for condition 1 of IV1 they are significantly different. An example of its
and five provided data for condition 2 of IV1 . All use could be where we want to see whether a
participants provided data for both conditions of IV2 sample of peoples scores on one test were more
(Table 5). variable than a sample of peoples scores on another
test.
Sum of squares The equation for the F test is
SSS = 6.2, SSws = 31, SSA = 1.8, SSwg = 4.4, s12
SSb.cell = 9.2, SSB = 0.2, SSAB = 9.2 F = , (74)
s22
(1.8 + 0.2) = 7.2, SSB.s(gps) = 31 (0.2 +
where the variance in one sample is divided by the
7.2) = 23.6
variance in the other sample.
Degrees of freedom If the research hypothesis is that one particular group
will have the larger variance, then that should be
dfA = 2 1 = 1, dfwg = (5 1) + (5 1) = 8, treated as group 1 in this equation. As usual, an F -
dfB = 2 1 = 1, dfAB = 1 1 = 1, dfB.s(gps) ratio close to 1 would suggest no difference in the
=18=8 variances of the two groups. A large F -ratio would
Mean squares suggest that group 1 has a larger variance than group
2, but it is worth noting that a particularly small F -
MSA = 1.8, MSwg = 0.55, MSB = 0.2, MSAB = 7.2,
ratio, and therefore a probability close to 1, would
MSB.s(gps) = 2.95
suggest that group 2 has the larger variance.
F -ratios
1.8 Degrees of freedom
FA(1,8) = = 3.27 The degrees of freedom for each variance are 1 fewer
0.55
The critical value for F at = 0.05 with df of 1 and than the sample size in that group.
8 is 5.32.
Example
Decision: Fail to reject H0
0.2 Group 1
FB(1,8) = = 0.07. Variance: 16
2.95
Decision: Fail to reject H0 Sample size: 100
7.2 Group 2
FAB(1,8) = = 2.44.
2.95 Variance: 11
Decision: Fail to reject H0 . Sample size: 150
12 Catalogue of Parametric Tests
Degrees of freedom where log means take the logarithm to the base 10.
The test has df 1 and n 2.
Degrees of freedom
Example
Sample size: 50 df = k 1, where k is the number of groups.
Variance in variable 1: 50 (79)
Variance in variable 2: 35
Correlation between the two variables: 0.7 Kanji [5] cautions against using the chi-square dis-
tribution when the sample sizes are smaller than 6
Error df = 50 2 = 48 and provides a table of critical values for a statistic
(50 35)2 (50 2) derived from B when this is the case.
F(1,48) = = 3.025
4 50 35 (1 (.7)2 )
Example
The critical value for F with df = 1 and 48 for We wish to compare the variances of three groups:
= 0.05 is 4.043. 2.62, 3.66, and 2.49, with each group having the same
Decision: Fail to reject H0 . sample size: 10.
Table 6 The correlation matrix for three Table 8 The correlation matrix for four variables show-
variables showing the symbol for each ing the symbolfor each correlation
correlation
Variable
Variable
Variable 1 2 3 4
Variable 1 2 3
1 1
1 1 2 r21 1
2 r21 1 3 r31 r32 1
3 r31 r32 1 4 r41 r42 r43 1
1 2 3 Example
Twenty people each provide scores on four variables.
1 1 The correlations between the variables are shown in
2 0.113 1
3 0.385 0.139 1
Table 9:
Comparing r21 with r43
16 Catalogue of Parametric Tests
One assumption of this test is over the size The critical value of the chi-squared distribution with
of the expected frequencies. When the degrees of df = 2 and = 0.05 is 5.99.
freedom are 1, the assumption is that all the expected Decision: reject H0 .
frequencies will be at least 5. When the df is greater
than 1, the assumption is that at least 20% of the
Test of Distribution. This test is another use of
expected frequencies will be 5.
the previous test, but the example will show how
Yates [9] devised a correction for chi-square when
it is possible to test whether a set of data is
the degrees of freedom are 1 to allow for the fact
distributed according to a particular pattern. The
that the chi-squared distribution is continuous and
distribution could be uniform, as in the previous
yet when df = 1, there are so few categories that
example, or nonuniform.
the chi-squared values from such a test will be far
from continuous; hence, the Yates test is referred Example
to as a correction for continuity. However, it is One hundred scores have been obtained with a mean
considered that this variant on the chi-squared test is of 1.67 and a standard deviation of 0.51. In order
only appropriate when the marginal totals are fixed, to test whether the distribution of the scores deviates
that is, that they have been chosen in advance [6]. from being normally distributed, the scores can be
In most uses of chi-squared, this would not be true. converted into z-scores by subtracting the mean from
If we were looking at gender and smoking status, it each and dividing the result by the standard deviation.
would make little sense to set, in advance, how many The z-scores can be put into ranges. Given the sample
males and females you were going to sample, as well size and the need to maintain at least 80% of the
as how many smokers and nonsmokers. expected frequencies at 5 or more, the width of the
Chi-square corrected for continuity is found from ranges can be approximately 1/2 a standard deviation
(|fo fe | 0.5)2 except for the two outer ranges, where the expected
2
(1) = , (96) frequencies get smaller, the further they go from the
fe
mean. At the bottom of the range, as the lowest
Where |fo fe | means take the absolute value, possible score is 0, the equivalent z-score will be
in other words, if the result is negative, treat it 3.27. At the top end of the range, there is no limit
as positive. set on the scale.
By referring to standard tables of probabilities
One-group Chi-square/Goodness of Fit for a normal distribution, we can find out what the
expected frequency would be within a given range of
Equal Proportions. In this version of the test, the
z-scores. The following table (Table 10) shows the
observed frequencies that occur in each category of
expected and observed frequencies in each range.
a single variable are compared with the expected
frequencies, which are that the same proportion will
2 = 3.56
occur in each category.
df = 8 1 = 7
Degrees of freedom
The df are based on the number of categories (k); Table 10 The expected (under the assumption of
df = k 1. normal distribution) and observed frequencies of a
sample of 100 values
Example
A sample of 45 people are placed into three cate- From z to z fe fo
gories, with 25 in category A, 15 in B, and 5 in C. 3.275 1.500 6.620 9
The expected frequencies are calculated by dividing 1.499 1.000 9.172 7
the total sample by the number of categories. There- 0.999 0.500 14.964 15
fore, each category would be expected to have 15 0.499 0.000 19.111 22
people in it. 0.001 0.500 19.106 14
0.501 1.000 14.953 15
1.001 1.500 9.161 11
2 = 13.33 1.501 5.000 6.661 7
df = 2
Catalogue of Parametric Tests 19
The critical value for chi-squared distribution with Table 11 A contingency table showing the cell
df = 7 at = 0.05 is 14.07. and marginal frequencies of 40 participants
Decision: Fail to reject H0 . Variable A
1 2 Total
Chi-square Contingency Test
1 12 9 21
Variable B
This version of the chi-square test investigates the 2 15 4 19
way in which the frequencies in the levels of one Total 27 13 40
variable differ across the other variable. Once again it
is for categorical data. An example could be a sample
of blind people and a sample of sighted people; both
expected to be in level 1 of both variables would be
groups are aged over 80 years. Each person is asked
.675 .525 = 0.354375 and the expected frequency
whether they go out of their house in a normal day.
would be 0.354375 40 = 14.175.
Therefore, we have the variable visual condition with
the levels blind and sighted, and another variable 2 = 2.16
whether the person goes out, with the levels yes and df = (2 1) (2 1) = 1
no. The null hypothesis of this test would be that
the proportions of sighted and blind people going The critical value of chi-square with df = 1 and
out would be the same (which is the same as saying = 0.05 is 3.84.
that the proportions staying in would be the same Decision: Fail to reject H0 .
in each group). This can be rephrased to say that
the two variables are independent of each other: the
likelihood of a person going out is not linked to that z-test for Proportions
persons visual condition.
The expected frequencies are based on the margi- Comparison between a Sample and a Population
nal probabilities. In this example, that would be Proportion. This test compares the proportion in a
the number of blind people, the number of sighted sample with that in a population (or that which might
people, the number of the whole sample who go out, be assumed to exist in a population).
and the number of the whole sample who do not go (1 )
The standard error for this test is
out. Thus, if 25% of the entire sample went out, then n
where
the expected frequencies would be based on 25% of
each group going out and, therefore, 75% of each not is the proportion in the population
going out. n is the sample size
The degrees of freedom for this version of the test
are calculated from the number of rows and columns The equation for the z-test is
in the contingency table: df = (r 1) (c 1),
where r is the number of rows and c the number p
, (97)
of columns in the table. (1 )
n
Example
Two variables A and B each have two levels. Twenty- where p is the proportion in the sample.
seven people are in level 1 of variable A and 13 are in
level 2 of variable A. Twenty-one people are in level Example
1 of variable B and 19 are in level 2 (Table 11). In a sample of 25, the proportion to be tested is 0.7
If variables A and B are independent, then the Under the null hypothesis, the proportion in the
expected frequency for the number who are in the population is 0.5
first level of both variables will be based on the fact
that 27 out of 40 (or 0.675) were in level 1 of variable .7 .5
z= =2
A and 21 out of 40 (or 0.525) were in level 1 of .5 (1 .5)
variable B. Therefore, the proportion who would be 25
20 Catalogue of Parametric Tests
The critical value for z with = 0.05 for a two-tailed Table 12 The frequencies of 84 participants
test is 1.96. placed in two categories and noted at two dif-
Decision: Reject H0 . ferent times
After
Comparison of Two Independent Samples. Under A B Total
the null hypothesis that the proportions in the popu-
lations from which two samples have been taken are A 15 25 40
Before
the same, the standard error for this test is B 35 9 44
Total 50 34 84
1 (1 1 ) 2 (1 2 )
+ However, one version will be presented here from
n1 n2 Agresti [1]. This will be followed by a simplification,
where which is found in a commonly used test.
Given a table (Table 12).
1 and 2 are the proportions in each population This shows that originally 40 people were in
n1 and n2 are the sizes of the two samples. category A and 44 in category B, and on a second
occasion, this had changed to 50 being in category A
When the population proportion is not known, it is
and 34 in category B. This test is only interested in
estimated from the sample proportions.
the cells where change has occurred: the 25 who were
The equation for the z-test is
in A originally but changed to B and the 35 who were
p1 p2 in B originally but changed to A. By converting each
z= , (98)
p1 (1 p1 ) p2 (1 p2 ) of these to proportions of the entire sample, 25/84 =
+ 0.297619 and 35/84 = 0.416667, we have the two
n1 n2
proportions we wish to compare. The standard error
where p1 and p2 are the proportions in the two for the test is
samples.
(p1 + p2 ) (p1 p2 )2
Example n
Sample 1: n = 30, p = 0.7
where n is the total sample size
Sample 2: n = 25, p = 0.6
The equation for the z-test is
.7 .6 p1 p2
z= = 0.776 z= . (99)
.7 (1 .7) .6 (1 .6) (p1 + p2 ) (p1 p2 )2
+
30 25 n
Example
Critical value for z at = 0.05 with a two-tailed test Using the data in the table above,
is 1.96.
Decision: Fail to reject H0 . p1 = 0.297619
p2 = 0.416667
n = 84
Comparison of Two Correlated Samples. The .297619 .416667
main use for this test is to judge whether there has z=
(.297619 + .416667)
been change across two occasions when a measure
(.297619 .416667)2
was taken from a sample. For example, researchers
might be interested in whether peoples attitudes to 84
banning smoking in public places had changed after = 1.304
seeing a video on the dangers of passive smoking The critical z with = 0.05 for a two-tailed test is
compared with attitudes held before seeing the video. 1.96.
A complication with this version of the test is over Decision: Fail to reject H0 .
the estimate of the standard error. A number of ver- When the z from this version of the test is squared,
sions exist, which produce slightly different results. this produces the Wald test statistic.
Catalogue of Parametric Tests 21
A simplified version of the standard error allows [2] Bartlett, M.S. (1937). Some examples of statistical meth-
another test to be derived from the resulting z-test: ods of research in agriculture and applied biology, Sup-
McNemars test of change. plement to the Journal of the Royal Statistical Society 4,
137170.
In this version, the equation for the z-test is [3] Clark-Carter, D. (2004). Quantitative Psychological
p1 p2 Research: A Students Handbook, Psychology Press,
z= (100) Hove.
(p1 + p2 )
[4] Howell, D.C. (2002). Statistical Methods for Psychology,
n 5th Edition, Duxbury Press, Pacific Grove.
Example [5] Kanji, G.K. (1993). 100 statistical tests, Sage Publica-
Once again using the same data as that in the tions, London.
previous example, [6] Neave, H.R. & Worthington, P.L. (1988). Distribution-
Free Tests, Routledge, London.
z = 1.291. [7] Steiger, J.H. (1980). Tests for comparing elements of a
correlation matrix, Psychological Bulletin 87, 245251.
If this z-value is squared, then the statistic is McNe- [8] Winer, B.J., Brown, D.R. & Michels, K.M. (1991).
mars test of change, which is often presented in a Statistical Principles in Experimental Design, 3rd Edition,
further simplified version of the calculations, which McGraw-Hill, New York.
produces the same result [3]. [9] Yates, F. (1934). Contingency tables involving small
numbers and the 2 test, Supplement to the Journal of
the Royal Statistical Society 1, 217235.
References
DAVID CLARK-CARTER
[1] Agresti, A. (2002). Categorical Data Analysis, 2nd Edi-
tion, John Wiley & Sons, Hoboken.
Catalogue of Probability Density Functions
BRIAN S. EVERITT
Volume 1, pp. 228234
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
.5 .5
.4 .4
n = 10, p = .1 n = 10, p = .3
Probability
Probability
.3 .3
.2 .2
.1 .1
.0 .0
0 2 4 6 8 10 0 2 4 6 8 10
Number of successes Number of successes
.5 .5
n = 10, p = .5 n = 10, p = .8
.4 .4
Probability
Probability
.3 .3
.2 .2
.1 .1
.0 .0
0 2 4 6 8 10 0 2 4 6 8 10
Number of successes Number of successes
.30 .5
.25 .4
p = .3 p = .5
Probability
Probability
.20
.3
.15
.2
.10
.1
.05
.00 .00
0 2 4 6 8 10 0 2 4 6 8 10
Number of failures before first success Number of failures before first success
.6 .8
p = .7 p = .9
.6
Probability
Probability
.4
.4
.2
.2
.0 .0
0 2 4 6 8 10 0 2 4 6 8 10
Number of failures before first success Number of failures before first success
.10
.04 k = 5, p = .2 k = 5, p = .4
.08
.03
Probability
Probability
.06
.02
.04
.01 .02
.00 0.0
0 10 20 30 40 50 0 10 20 30 40 50
Number of failures before k successes Number of failures before k successes
k = 5, p = .6 .3 k = 5, p = .8
.15 Probability
.2
Probability
.10
.05 .1
0.0 .0
0 10 20 30 40 50 0 10 20 30 40 50
Number of failures before k successes Number of failures before k successes
binomial distribution are given in Figure 3. The den- exact test used in the analysis of sparse contin-
sity is important in discussions of overdispersion (see gency tables (see Exact Methods for Categorical
Generalized Linear Models (GLM)). Data).
Hypergeometric
Poisson
A PDF associated with sampling without replace-
ment from a population of finite size. If the pop- A PDF that arises naturally in many instances, par-
ulation consists of r elements of one kind and ticularly as a probability model for the occurrence of
Nr of another, then the hypergeometric is the rare events, for example, the emission of radioactive
PDF of the random variable X defined as the num- particles. In addition, the Poisson density function is
ber of elements of the first kind when a random the limiting distribution of the binomial when p is
sample of n is drawn. The density function is small and n is large. The Poisson density function
given by for a random variable X taking integer values from
zero to infinity is defined as
r!(N r)!
x!(r x)!(n x)!(N r n + x)! e x
P (X = x) = . P (X = x) = , x = 0, 1, 2 . . . . (6)
N! x!
n!(N n)!
(5) The single parameter of the Poisson density func-
tion, , is both the expected value and variance, that
The mean of X is nr/N and its variance is is, the mean and variance of a Poisson random vari-
(nr/N )(1 r/n)[(N n)/(N 1)]. The hypergeo- able are equal. Some examples of Poisson density
metric density function is the basis of Fishers functions are given in Figure 4.
4 Catalogue of Probability Density Functions
Probability
Probability
.2 .15
.10
.1
.05
.0 .00
0 2 4 6 8 10 0 2 4 6 8 10
Number of events Number of events
.20
Parameter = 3
.20 Parameter = 4
.15
.15
Probability
Probability
.10
.10
.05 .05
.00 .00
0 2 4 6 8 10 0 2 4 6 8 10
Number of events Number of events
1.0
defined as follows
1 1 (x )2 0.5
f (x) = exp ,
2 2 2
< x < , (7) 0.0
2 4 6 8 10
where and 2 are, respectively the mean and X
variance of X. When the mean is zero and the
variance one, the resulting density is labelled as the Figure 5 Normal density functions
standard normal. The normal density function is bell-
shaped as is seen in Figure 5, where a number of
normal densities are shown. in some interval, [A, B] say, is given by integrating
In the case of continuous random variables, the f (x) dx from A to B.
probability that the random variable takes a particular The normal density function is ubiquitous in
value is strictly zero; nonzero probabilities can only statistics. The vast majority of statistical methods are
be assigned to some range of values of the variable. based on the assumption of a normal density for the
So, for example, we say that f (x) dx gives the observed data or for the error terms in models for
probability of X falling in the very small interval, the data. In part this can be justified by an appeal to
dx, centered on x, and the probability that X falls the central limit theorem The density function first
Catalogue of Probability Density Functions 5
f (X )
Gaussian-Laplace. Parameter = 2.0
0.10
Uniform 0.05
0.2 0.2
0.15
Density
0.15
Density
0.1 0.1
0.05 0.05
0 01
10 0
8
8 10
6 10 6 8
8 4 6
Va 4 6 Va
4 riab 2 4
ria 2 le 2 1
ble 0
2 0 2
b le 1 2 0 ia ble
0
Varia Var
0.2 0.4
Density
0.15 0.3
Density
0.1 0.2
0.05 0.1
0 0
10 10
8 8
6 10 10
8 Va 6 8
Va 4 6 ria 4 6
riab 4 ble 4
le 2 2 2 1
2 0 2 le 1 0 2 ble
iab 0 ia
0
Var Var
Figure 7 Four bivariate normal density functions with means 5 and standard deviations 1 for variables 1 and 2 and with
correlations equal to 0.0, 0.3, 0.6, and 0.9
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Introduction
Catastrophe theory describes how small, continuous
changes in control parameters (i.e., independent
Sudden jump
variables that influence the state of a system) can x Minima
have sudden, discontinuous effects on dependent
variables. Such discontinuous, jumplike changes are Change in c
called phase-transitions or catastrophes. Examples
include the sudden collapse of a bridge under Figure 1 Smooth changes in a potential function may lead
slowly mounting pressure, and the freezing of water to a sudden jump. V (x; c) is the potential function, and c
denotes the set of control variables
when temperature is gradually decreased. Catastrophe
theory was developed and popularized in the early
1970s [27, 35]. After a period of criticism [34] according to their configuration of critical points, that
catastrophe theory is now well established and widely is, points at which the first or possibly second deriva-
applied, for instance, in the field of physics, (e.g., [1, tive equals zero. When the configuration of critical
26]), chemistry (e.g., [32]), biology (e.g., [28, 31]), points changes, so does the qualitative behavior of
and in the social sciences (e.g., [14]). the system. For instance, Figure 1 demonstrates how
In psychology, catastrophe theory has been applied the local minimum (i.e., a critical point) that con-
to multistable perception [24], transitions between tains the little ball suddenly disappears as a result of
Piagetian stages of child development [30], the per- a gradual change in the surface. As a result of this
ception of apparent motion [21], sudden transitions gradual change, the ball will suddenly move from its
in attitudes [29], and motor learning [19, 33], to old position to a new minimum. These ideas may be
name just a few. Before proceeding to describe the quantified by postulating that the state of the system,
statistical method required to fit the most popular x, will change over time t according to
catastrophe model the cusp model we first outline
the core principles of catastrophe theory (for details dx dV (x; c)
see [9, 22]). = , (1)
dt dx
where V (x; c) is the potential function that incorpo-
Catastrophe Theory rates the control variables c that affect the state of
the system. V (x; c) yields a scalar for each state x
A key idea in catastrophe theory is that the system and vector of control variables c. The concept of a
under study is driven toward an equilibrium state. potential function is very general, for instance, a
This is best illustrated by imagining the movement potential function that is quadratic in x will yield
of a ball on a curved one-dimensional surface, the ubiquitous normal distribution. A system whose
as in Figure 1. The ball represents the state of dynamics obey (1) is said to be a gradient dynamical
the system, whereas gravity represents the driving system. When the right-hand side of (1) equals zero,
force. the system is in equilibrium.
Figure 1, middle panel, displays three possible As the behavior of catastrophe models can become
equilibria. Two of these states are stable states (i.e., extremely complex when the number of behavioral
the valleys or minima) when perturbed, the behav- and control parameters is increased, we will focus
ior of the system will remain relatively unaffected. here on the simplest and most often applied catas-
One state is unstable (i.e., a hill or maximum) only trophe model that shows discontinuous behavior: the
a small perturbation is needed to drive the system cusp model. The cusp model consists of one behav-
toward a different state. ioral variable and only two control variables. This
Systems that are driven toward equilibrium values, may seem like a small number, especially since there
such as the little ball in Figure 1, may be classified are probably numerous variables that exert some kind
2 Catastrophe Theory
of influence on a real-life system, however, very variables that are important for attitudinal change
few of these are likely to qualitatively affect transi- are involvement and information. The most distin-
tional behavior. As will be apparent soon, two control guishing behavior of the cusp model takes places
variables already allow for the prediction of quite in the foreground of Figure 2, for the highest lev-
intricate transitional behavior. The potential func- els of involvement. Assume that the lower sheet of
tion that goes with the cusp model is V (x; c) = the cusp surface corresponds to equilibrium states of
(1/4)x 4 + (1/2)bx 2 + ax, where a and b are the being left-wing. As information (e.g., experience
control variables. Figure 2 summarizes the behavior or environmental effects) more and more favors a
of the cusp model by showing, for all values of the right-wing view, not much change will be apparent
control variables, those values of the behavioral vari- at first, but at some level of information, a sudden
able for which the system is at equilibrium (note that jump to the upper, right-wing sheet occurs. When
the Figure 2 variable names refer to the data example subsequent information becomes available that favors
that will be discussed later). That is, Figure 2 shows the left-wing view, the system eventually jumps back
the states for which the derivative of the potential from the upper sheet unto the lower sheet but note
function is zero (i.e., V (x; c) = x 3 + bx + a = 0). that this jump does not occur at the same position!
Note that one entire panel from Figure 1 is associ- The system needs additional impetus to change from
ated with only one (i.e., a minimum), or three (i.e., one state to the other, and this phenomenon is called
two minima and one maximum) points on the cusp hysteresis.
surface in Figure 2. Figure 2 also shows that a gradual change of
We now discuss some of the defining character- political attitude is possible, but only for low levels
istics of the cusp model in terms of a model for of involvement (i.e., in the background of the cusp
attitudinal change [7, 29, 35]. More specifically, we surface). Now assume ones political attitude starts
will measure attitude as regards political preference, out at the neutral point in the middle of the cusp
ranging from left-wing to right-wing. Two control surface, and involvement is increased. According to
the cusp model, an increase in involvement will
lead to polarization, as one has to move either
X = attitude
Equilibria of cusp to the upper sheet or to the lower sheet (i.e.,
Neutral point divergence), because for high levels of involvement,
(0,0,0)
a = information the intermediate position is inaccessible. Hysteresis,
divergence, and inaccessibility are three of eight
catastrophe flags [9], that is, qualitative properties
of catastrophe models. Consequently, one method of
Sudden jump
investigation is to look for the catastrophe flags (i.e.,
catastrophe detection).
b = involvement A major challenge in the search of an adequate
Inaccessible region
cusp model is the definition of the control variables.
In the cusp model, the variable that causes divergence
a
B
is called the splitting variable (i.e., involvement),
Bifurcation lines and the variable that causes hysteresis is called
the normal variable (i.e., information). When the
A
b normal and splitting variable are correctly identified,
and the underlying system dynamics are given by
Figure 2 The cusp catastrophe model for attitude change. catastrophe theory, this often provides surprisingly
Of the two control variables, information is the normal elegant insights that cannot be obtained from simple
variable, and involvement is the splitting variable. The linear models. In the following, we will ignore both
behavioral variable is attitude. The lower panel is a pro- the creative aspects of defining appropriate control
jection of the bifurcation area onto the control parameter
plane. The bifurcation set consists of those values for infor-
variables and the qualitative testing of the cusp model
mation and involvement combined that allow for more using catastrophe flags [30]. Instead, we will focus on
than one attitude. See text for details. Adapted from van the problem of statistically fitting a catastrophe model
der Maas et al. (29) to empirical data.
Catastrophe Theory 3
Fitting the Cusp Catastrophe Model Wiener process (i.e., idealized Brownian motion).
to Data The function D 2 (x) is the infinitesimal variance func-
tion and determines the relative influence of the noise
Several cusp fitting procedures have been proposed, process (for details on SDEs see [8, 15]).
but none is completely satisfactory (for an overview Under the assumption of additive noise (i.e., D(x)
see [12, 29]). The most important obstacle is that is a constant and does not depend on x), it can be
the cusp equilibrium surface is cubic in the depen- shown that the modes (i.e., local maxima) of the
dent variable. This means that for control variables empirical probability density function (pdf) corre-
located in the bifurcation area (cf. Figure 2, bot- spond to stable equilibria, whereas the antimodes
tom panel), two values of the dependent variable of the pdf (i.e., local minima) correspond to unsta-
are plausible (i.e., left-wing/lower sheet and right- ble equilibria (see e.g., [15], p. 273). More generally,
wing/upper sheet), whereas one value, corresponding there is a simple one-to-one correspondence between
to the unstable intermediate state, is definitely not an additive noise SDE and its stationary pdf. Hence,
plausible. Thus, it is important to distinguish between instead of fitting the drift function of the cusp model
minima of the potential function (i.e., stable states) directly, it can also be determined by fitting the pdf:
and maxima of the potential function (i.e., unsta-
1 4 1 2
ble states). p(y|, ) = N exp y + y + y , (3)
Two methods for fitting the cusp catastrophe 4 2
models, namely GEMCAT I and II [17, 20] and where N is a normalizing constant. In (3), the
Guastellos polynomial regression technique (see observed dependent variable x has been rescaled by
Polynomial Model) [10, 11] both suffer from the y = (x )/ , and and are linear functions of
fact that they consider as the starting point for the two control variables a and b as follows: =
statistical fitting only those values for the derivative k0 + k1 a + k2 b and = l0 + l1 a + l2 b. The parame-
of the potential function that equal zero. The ters , , k0 , k1 , k2 , l0 , l1 , and l2 can be estimated
equation dx/dt = dV (x; c)/dx = x 3 + bx + a = 0 using maximum likelihood procedures (see Maxi-
is, however, valid both for minima and maxima mum Likelihood Estimation) [5].
hence, neither GEMCAT nor the polynomial Although the maximum likelihood method of
regression technique are able to distinguish between Cobb is the most elegant and statistically satisfactory
stable equilibria (i.e., minima) and unstable equilibria method for fitting the cusp catastrophe model to
(i.e., maxima). Obviously, the distinction between date, it is not used often. One reason may be that
stable and unstable states is very important when Cobbs computer program for fitting the cusp model
fitting the cusp model, and neglecting this distinction can sometimes behave erratically. This problem was
renders the above methods suspect (for a more addressed in Hartelman [12, 13], who outlined a more
detailed critique on the GEMCAT and polynomial robust and more flexible version of Cobbs original
regression techniques see [2, 29]). program. The improved program, Cuspfit, uses a
The most principled method for fitting catastro- more reliable optimization routine, allows the user to
phe models, and the one under discussion here, is the constrain parameter values and to employ different
maximum likelihood method developed by Cobb and sets of starting values, and is able to fit competing
coworkers [3, 4, 6]. First, Cobb proposed to make models such as the logistic model. Cuspfit is available
catastrophe theory stochastic by adding a Gaussian at http://users.fmg.uva.nl/hvandermaas.
white noise driving term dW (t) with standard devia- We now illustrate Cobbs maximum likelihood
tion D(x) to the potential function, leading to procedure with a practical example on sudden tran-
dV (x; c) sitions in attitudes [29]. The data set used here is
dx = dt + D(x) dW (t). (2) taken from Stouffer et al. [25], and has been dis-
dx cussed in relation to the cusp model in Latane and
Equation (2) is a stochastic differential equation Nowak [18]. US soldiers were asked their opin-
(SDE), in which the deterministic term on the right- ion about three issues (i.e., postwar conscription,
hand side, dV (x; c)/dx, is called the (instanta- demobilization, and the Womens Army Corps). An
neous) drift function, while D 2 (x) is called the individual attitude score was obtained by combin-
(instantaneous) diffusion function, and W (t) is a ing responses on different questions relating to the
4 Catastrophe Theory
same issue, resulting in an attitude score that could true and the cusp model is not, given the data, by
vary between 0 (unfavorable) to 6 (favorable). In exp{(1/2)(BIC logistic )}/[exp{(1/2)(BIC logistic )} +
addition, respondents indicated the intensity of their exp{(1/2)(BIC cusp )}] (e.g., [23]). This approxima-
opinion on a six-point scale. Thus, this data set con- tion estimates P (logistic | data) to be about zero
sists of one behavioral variable (i.e., attitude) and consequently, the complement P (cusp | data) equals
only one control variable (i.e., the splitting variable about one.
intensity). One problem of the Cobb method remaining to
Figure 3 displays the histograms of attitude scores be solved is that the convenient relation between
for each level of intensity separately. The data show the pdf and the SDE (i.e., modes corresponding to
that as intensity increases, the attitudes become polar- stable states, antimodes corresponding to unstable
ized (i.e., divergence) resulting in a bimodal his- states) breaks down when the noise is multiplica-
togram for the highest intensities. The dotted line tive, that is, when D(x) in (2) depends on x. Mul-
shows the fit of the cusp model. The maximum like- tiplicative noise is often believed to be present in
lihood method as implemented in Cuspfit allows for economic and financial systems (e.g., time series of
easy model comparison. For instance, one popular short-term interest rates, [16]). In general, multiplica-
model selection method is the Bayesian informa- tive noise arises under nonlinear transformations of
tion criterion (BIC; e.g., [23]), defined as BIC = the dependent variable x. In contrast, determinis-
2 log L + k log n, where L is the maximum like- tic catastrophe theory is invariant under any smooth
lihood, k is the number of free parameters, and n and revertible transformation of the dependent vari-
is the number of observations. The BIC implements able. Thus, Cobbs stochastic catastrophe theory loses
Occams razor by quantifying the trade-off between some of the generality of its deterministic coun-
goodness-of-fit and parsimony, models with lower terpart (see [12], for an in-depth discussion of this
BIC values being preferable. point).
The cusp model, whose fit is shown in Figure 3,
has a BIC value of 1787. The Cuspfit program is also
able to fit competing models to the data. An exam- Summary and Recommendation
ple of these is the logistic model, which allows for
rapid changes in the dependent variable but cannot Catastrophe theory is a theory of great generality
handle divergence. The BIC for the logistic model that can provide useful insights as to how behavior
was 1970. To get a feeling for how big this dif- may radically change as a result of smoothly varying
ference really is, one may approximate P (logistic control variables. We discussed three statistical proce-
| data), the probability that the logistic model is dures for fitting one of the most popular catastrophe
models, i.e., the cusp model. Two of these proce-
dures, Guastellos polynomial regression technique
6 and GEMCAT, are suspect because these methods
are unable to distinguish between stable and unstable
5
equilibria. The maximum likelihood method devel-
4 oped by Cobb does not have this problem. The one
Attitude
[2] Alexander, R.A., Herbert, G.R., Deshon, R.P. & [20] Oliva, T., DeSarbo, W., Day, D. & Jedidi, K. (1987).
Hanges, P.J. (1992). An examination of least GEMCAT: a general multivariate methodology for esti-
squares regression modeling of catastrophe theory, mating catastrophe models, Behavioral Science 32,
Psychological Bulletin 111, 366374. 121137.
[3] Cobb, L. (1978). Stochastic catastrophe models and mul- [21] Ploeger, A., van der Maas, H.L.J. & Hartelman, P.A.I.
timodal distributions, Behavioral Science 23, 360374. (2002). Stochastic catastrophe analysis of switches in the
[4] Cobb, L. (1981). Parameter estimation for the cusp perception of apparent motion, Psychonomic Bulletin &
catastrophe model, Behavioral Science 26, 7578. Review 9, 2642.
[5] Cobb, L. & Watson, B. (1980). Statistical catastro- [22] Poston, T. & Stewart, I. (1978). Catastrophe Theory and
phe theory: an overview, Mathematical Modelling 1, its Applications, Dover, New York.
311317. [23] Raftery, A.E. (1995). Bayesian model selection in social
[6] Cobb, L. & Zacks, S. (1985). Applications of catastrophe research, Sociological Methodology 25, 111163.
theory for statistical modeling in the biosciences, Jour- [24] Stewart, I.N. & Peregoy, P.L. (1983). Catastrophe theory
nal of the American Statistical Association 80, 793802. modeling in psychology, Psychological Bulletin 94,
[7] Flay, B.R. (1978). Catastrophe theory in social psychol- 336362.
ogy: some applications to attitudes and social behavior, [25] Stouffer, S.A., Guttman, L., Suchman, E.A., Lazars-
Behavioral Science 23, 335350. feld, P.F., Star, S.A. & Clausen, J.A. (1950). Measure-
[8] Gardiner, C.W. (1983). Handbook of Stochastic Meth- ment and Prediction, Princeton University Press, Prince-
ods, Springer-Verlag, Berlin. ton.
[9] Gilmore, R. (1981). Catastrophe Theory for Scientists [26] Tamaki, T., Torii, T. & Meada, K. (2003). Stability
and Engineers, Dover, New York. analysis of black holes via a catastrophe theory and black
[10] Guastello, S.J. (1988). Catastrophe modeling of the acci- hole thermodynamics in generalized theories of gravity,
dent process: organizational subunit size, Psychological Physical Review D 68, 024028.
[27] Thom, R. (1975). Structural Stability and Morphogene-
Bulletin 103, 246255.
sis, Benjamin-Addison Wesley, New York.
[11] Guastello, S.J. (1992). Clash of the paradigms: a cri-
[28] Torres, J.-L. (2001). Biological power laws and Darwins
tique of an examination of the polynomial regression
principle, Journal of Theoretical Biology 209, 223232.
technique for evaluating catastrophe theory hypotheses,
[29] van der Maas, H.L.J., Kolstein, R. & van der Pligt, J.
Psychological Bulletin 111, 375379.
(2003). Sudden transitions in attitudes, Sociological
[12] Hartelman, P. (1997). Stochastic catastrophe theory,
Methods and Research 32, 125152.
Unpublished doctoral dissertation, University of Ams-
[30] van der Maas, H.L.J. & Molenaar, P.C.M. (1992).
terdam.
Stagewise cognitive development: an application of
[13] Hartelman, P.A.I., van der Maas, H.L.J. & Mole-
catastrophe theory, Psychological Review 99, 395417.
naar, P.C.M. (1998). Detecting and modeling devel-
[31] van Harten, D. (2000). Variable noding in Cyprideis
opmental transitions, British Journal of Developmental torosa (Ostracoda, Crustacea): an overview, experimen-
Psychology 16, 97122. tal results and a model from catastrophe theory, Hydro-
[14] Hoyst, J.A., Kacperski, K. & Schweitzer, F. (2000). biologica 419, 131139.
Phase transitions in social impact models of opinion [32] Wales, D.J. (2001). A microscopic basis for the
formation, Physica Series A 285, 199210. global appearance of energy landscapes, Science 293,
[15] Honerkamp, J. (1994). Stochastic Dynamical Systems, 20672070.
VCH Publishers, New York. [33] Wimmers, R.H., Savelsbergh, G.J.P., van der Kamp, J. &
[16] Jiang, G.J. & Knight, J.L. (1997). A nonparametric Hartelman, P. (1998). A developmental transition in pre-
approach to the estimation of diffusion processes, with hension modeled as a cusp catastrophe, Developmental
an application to a short-term interest rate model, Psychobiology 32, 2335.
Econometric Theory 13, 615645. [34] Zahler, R.S. & Sussmann, H.J. (1977). Claims and
[17] Lange, R., Oliva, T.A. & McDade, S.R. (2000). An accomplishments of applied catastrophe theory, Nature
algorithm for estimating multivariate catastrophe mod- 269, 759763.
els: GEMCAT II, Studies in Nonlinear Dynamics and [35] Zeeman, E.C. (1977). Catastrophe Theory: Selected
Econometrics 4, 137168. Papers (19721977), Addison-Wesley, New York.
[18] Latane, B., & Nowak, A. (1994). Attitudes as catas-
trophes: from dimensions to categories with increasing
involvement, in Dynamical Systems in Social Psychol- (See also Optimization Methods)
ogy, R.R. Vallacher & A. Nowak, eds, Academic Press,
San Diego, pp. 219249. E.-J. WAGENMAKERS, H.L.J. VAN DER MAAS
[19] Newell, K.M., Liu, Y.-T. & Mayer-Kress, G. (2001). AND P.C.M. MOLENAAR
Time scales in motor learning and development, Psy-
chological Review 108, 5782.
Categorizing Data
VALERIE DURKALSKI AND VANCE W. BERGER
Volume 1, pp. 239242
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
a description of the P value adjustment formulae, These more informative endpoints are then amenable
while Altman [1] discusses creating groups in a to a wider range of more powerful analyses that make
regression setting. use of all the categories, as opposed to just Fishers
Standard cutpoints, which are not influenced by exact test on two collapsed categories. More powerful
the present data, are useful particularly when com- analyses include the Wilcoxon-Mann Whitney test
paring results across studies because if the cutpoint and the Smirnov test (see KolmogorovSmirnov
changes over time, then it is difficult, if not impossi- Tests), as well as more recently developed tests such
ble, to use past information for comparing studies or as the convex hull test [4] and adaptive tests [3].
combining data. In fact, definitions of severity of can- So why would one consider categorizing a con-
cer, or any other disease, that vary over time can lead tinuous variable if it increases the chance of missing
to the illusion that progress has been made in con- real differences, increases the chance of misclassifica-
trolling this disease when in fact it may not have tion, and increases the sample size required to detect
this is known as stage migration [7]. differences? The best argument thus far seems to be
that categorization simplifies the analyses and offers
a better approach to understanding and interpreting
Implications of Chosen Cutpoints meaningful results (proportions versus mean-values).
Yet one could argue that when planning a research
It is illustrated in the literature that categorizing study, one should select the primary response vari-
continuous outcome variables can lead to a loss able that will give the best precision of an estimate
of information, increased chance of misclassification or the highest statistical power. In the same respect,
error, and a loss in statistical power [1, 8, 1217]. prognostic variables (i.e., age, tumor size) should be
Categorizing a continuous predictor variable can categorized only according to appropriate methodol-
lead to a reversal in the direction of its effect ogy to avoid misclassification. Because it is common
on an outcome variable [5]. Ragland [13] illustrates to group populations into risk groups for analysis
with a blood pressure example that the choice of and for the purposes of fulfilling eligibility criteria
cutpoint affects the estimated measures of association for stratified randomization schemes, categorization
(i.e., proportion differences, prevalence ratios, and methods need to be available.
odds ratios) and that as the difference between the The best approach may be to collect data on
two distribution means increases, the variations in a continuum and then categorize when necessary
association measurements and power increase. In (i.e., for descriptive purposes). This approach offers
addition, Connor [6] and Suissa [18] show that tests the flexibility to conduct the prespecified analyses
based on frequencies of a dichotomous endpoint are, whether they are based on categorical or continuous
in general, less efficient than tests using mean-value data, but it also allows for secondary exploratory and
statistics when the underlying distribution is normal sensitivity analyses to be conducted.
(efficiency is defined as the ratio of the expected
variances under each model; the model with the lower References
variance ratio is regarded as more efficient).
Even if the categorization is increased to three or [1] Altman, D.G. (1998). Categorizing continuous variables,
four groups, the relative efficiency (see Asymptotic in Encyclopedia of Biostatistics, 1st Edition, P. Armitage
and T. Colton, eds, Wiley, Chichester, pp. 563567.
Relative Efficiency) is still less than 90%. This
[2] Berger, V.W. (2002). Improving the information content
implies that if there really is a difference to detect, of categorical clinical trial endpoints, Controlled Clinical
then a study using categorized endpoints would Trials 23, 502514.
require a larger sample size to be able to detect it [3] Berger, V.W. & Ivanova, A. (2002). Adaptive tests for
than would a study based on a continuous endpoint. ordered categorical data, Journal of Modern Applied
Because of this, it has recently been suggested that Statistical Methods 1, 269280.
binary variables be reverse dichotomized and put [4] Berger, V.W., Permutt, T. & Ivanova, A. (1998). The
convex hull test for ordered categorical data, Biometrics
together to form the so-called information-preserving 54, 15411550.
composite endpoints, which are more informative [5] Brenner, H. (1998). A potential pitfall in control of
than any of the binary endpoints used in their creation covariates in epidemiologic studies, Epidemiology 9(1),
(see Analysis of Covariance: Nonparametric) [2]. 6871.
Categorizing Data 3
[6] Connor, R.J. (1972). Grouping for testing trends in [13] Ragland, D.R. (1992). Dichotomizing continuous out-
categorical data, Journal of the American Statistical come variables: dependence of the magnitude of associ-
Association 67, 601604. ation and statistical power on the cutpoint, Epidemiology
[7] Feinstein, A., Sosin, D. & Wells, C. (1985). The Will 3, 434440.
Rogers phenomenon: stage migration and new diagnostic [14] Rahlfs, V.W. & Zimmermann, H. (1993). Scores: ordinal
techniques as a source of misleading statistics for data with few categories how should they be analyzed?
survival in cancer, New England Journal of Medicine Drug Information Journal 27, 12271240.
312, 16041608. [15] Sankey, S.S. & Weissfeld, L.A. (1998). A study of
[8] MacCallum, R.C., Zhang, S., Preacher, K.J. & the effect of dichotomizing ordinal data upon mod-
Rucker, D.D. (2002). On the practice of dichotomization eling, Communications in Statistics Simulation 27(4),
of quantitative variables, Psychological Methods 7, 871887.
1940. [16] Senn, S. (2003). Disappointing dichotomies, Pharma-
[9] Maxwell, S.E. & Delaney, H.D. (1993). Bivariate ceutical Statistics 2, 239240.
median splits and spurious statistical significance, Psy- [17] Streiner, D.L. (2002). Breaking up is hard to do: the
chological Bulletin 113, 181190. heartbreak of dichotomizing continuous data, Canadian
[10] Mazumdar, M. & Glassman, J.R. (2000). Tutorial in bio- Journal of Psychiatry 47(3), 262266.
statistics: categorizing a prognostic variable: review of [18] Suissa, S. (1991). Binary methods for continuous out-
methods, code for easy implementation and applications comes: a parametric alternative, Journal of Clinical Epi-
to decision-making about cancer treatments, Statistics in demiology 44(3), 241248.
Medicine 19, 113132.
[11] Miller, R. & Siegmund, D. (1982). Maximally selected
chi-square statistics, Biometrics 38, 10111016. (See also Optimal Design for Categorical Vari-
[12] Moses, L.E., Emerson, J.D. & Hosseini, H. (1984). ables)
Analyzing data from ordered categories, New England
Journal of Medicine 311, 442448. VALERIE DURKALSKI AND VANCE W. BERGER
Cattell, Raymond Bernard
PAUL BARRETT
Volume 1, pp. 242243
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[2] Cattell, R.B., ed. (1966). Handbook of Multivariate Exper- [4] Cattell, R.B. (1978). The Scientific Use of Factor Analysis
imental Psychology, Rand McNally, Chicago. in Behavioral and Life Sciences, Plenum Press, New York.
[3] Cattell, R.B. (1971). Abilities: Their Structure, Growth,
and Action, Houghton-Mifflin, Boston. PAUL BARRETT
Censored Observations
VANCE W. BERGER
Volume 1, pp. 243244
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
carried out on a regular basis is the US census, which It is well known that interviewers often influence
was first held in 1790 and henceforth repeated every respondents answers to questions. To minimize this
ten years. The United Kingdom census also has a type of interviewer effects, it is important to pro-
long tradition, dating back to 1801 and being held vide these interviewers with the necessary training
every 10 years from then on (except in 1941). By with respect to good interview conduct. Furthermore,
1983, virtually every country in the world had taken interviewers also need to be supervised in order to
a census of its population [4]. avoid fraud.
Some countries, like the United States and the Another issue that needs to be considered is non-
United Kingdom, have a tradition of conducting response (see Missing Data). Although governments
a census every 5 or 10 years. A lot of countries, issue legislation that enforces participation in the cen-
however, only hold a census when it is necessary sus, there are always individuals who can or will not
and not at fixed dates. cooperate. Usually, these nonresponders do not rep-
resent a random sample from the population, but are
Methodological Issues systematically different from the responders. Hence,
there will be some bias in the statistics that are calcu-
The census data are usually gathered in one of two lated on the basis of the census. However, given the
ways: either by self-enumeration or by direct enu- fact that information about some background charac-
meration. In the former case, the questionnaire is teristics of these nonresponders are usually available
given to the individual from whom one wants to col- from other administrative sources, it can be assessed
lect information and he/she fills out the questionnaire to what extent they are different from the respon-
him/herself. In the case of direct enumeration, the ders and (to some extent) a correction of the statistics
questionnaire is filled out via a face-to-face interview is possible.
with the individual. Over the course of the years, censuses have
Given the fact that the census data are collected become more than population enumerations. The US
via questionnaires, all methodological issues that Census, for example, has come to collect much more
arise with respect to the construction and the use information, such as data on manufacturing, agricul-
of questionnaires also apply in the census context ture, housing, religious bodies, employment, internal
(see Survey Questionnaire Design). First of all, it is migration, and so on [7]. The contemporary South
important that the questions are worded in such a way Africa [2] census (see above) also illustrates the wide
that they are easily understandable for everyone. For
scope of modern censuses. Expanding the scope of
this reason, a pilot study is often administered to a
a census could not have happened without evolu-
limited sample of individuals to test the questionnaire
tions in other domains. Use of computers and optical
and detect and solve difficulties with the questions as
sensing devices for data input have greatly increased
well as with the questionnaire in general (e.g., too
the speed with which returned census forms can be
long, inadequate layout, etc.). Secondly, individuals
might not always provide correct and/or truthful processed and analyzed (see Computer-Adaptive
answers to the questions asked. In the context of a Testing). Also, censuses have come to use sampling
census, this problem is, however, somewhat different techniques (see Survey Sampling Procedures). First
than in the context of other questionnaire-based of all, certain questions are sometimes only adminis-
research that is not sponsored by the government. tered to a random sample of the population to avoid
On the one hand, the government usually issues high respondent burden due to long questionnaires.
legislations that enforce people to provide correct and This implies that not all questions are administered
truthful answers, which might lead some people to to all individuals. Questions could for instance be
do so more than in any other context. On the other asked on a cyclical basis: One set of questions for
hand, because the information is used for government individual 1, another set for individual 2, and so on,
purposes, some people might be hesitant to convey until individual 6, who again receives the set of ques-
certain information out of fear that it in some way tions from the first cycle [7]. As another example, the
will be used against them. 2000 US census used a short and a long form, the
Thirdly, in the case of direct enumeration, inter- latter being administered to a random sample of the
viewers are used to gather the census information. population.
Census 3
Secondly, some countries have recently abandoned [3] Taeuber, C. (1978). Census, in International Encyclopedia
the idea of gathering information from the entire pop- of Statistics, W.H. Kruskal & J.M. Tanur, eds, Free Press,
ulation, because a lot of administrative information New York, pp. 4146.
[4] Taeuber, C. (1996). Census of population, in The Social
about a countrys population has become available Science Encyclopedia, A. Kuper & J. Kuper, eds, Rout-
via other databases that have been set up over the ledge, London, pp. 7779.
years. Belgium, for example, from 2001 onward has [5] United Nations, Statistical office. (1958). Handbook of
replaced the former decennial population censuses Population Census Methods, Vol. 1: General Aspects of a
by a so-called General Socio-economic Survey that Population Census, United Nations, New York.
collects data on a large sample (Sample fraction: [6] United Nations. (1980). Principles and Recommendations
2025%) from the population. for Population and Housing Censuses, United Nations,
New York.
[7] US Bureau of the Census (South Dakota). Historical
References background of the United States census, Retrieved March
2, 2004 from http://www.census.gov/mso/www/
[1] Benjamin, B. (1989). Census, in Encyclopedia of Statis- centennial/bkgrnd.htm
tical Sciences, S. Kotz & N.L. Johnson, eds, John Wiley
& Sons, New York, pp. 397402. JERRY WELKENHUYSEN-GYBELS AND
[2] Kinsella, K. & Ferreira, M. (1997). Aging trends: South DIRK HEERWEGH
Africa, Retrieved March 2, 2004 from http://www.
census.gov/ipc/prod/ib-9702.pdf.
Centering in Linear Multilevel Models
JAN DE LEEUW
Volume 1, pp. 247249
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
is a useful type of invariance. But it is important to becomes more complicated. If the aj s are the group
observe that if we restrict our untranslated model, means of the predictors, this is within-group center-
for instance, by requiring one or more r0 to be ing. The relevant formulas are derived in [1], and
zero, then those same r0 will no longer be zero in we will not repeat them here. The conclusion is that
the corresponding translated model. We have invari- separate translations for each group cannot be com-
ance of the expected values under translation if the pensated for by adjusting the regression coefficients
regression coefficients of the group-level predictors and the variance components. In this case, there is no
are nonzero. invariance, and we are fitting a truly different model.
In the same way, we can see that In other words, choosing between a translated and a
nontranslated model becomes a matter of either the-
p
00 + (xij s + xkj s )0s oretical or statistical (goodness-of-fit) considerations.
s=1 From the theoretical point of view, consider the
difference in meaning of a grand mean centered and
p
p
a within-group mean centered version of a predictor
+ xij s xkj t st =
such as grade point average. If two students have
s=1 t=1
the same grade point average (GPA), they will also
p
have the same grand mean centered GPA. But GPA
00 + (xij s + xkj s ) 0s
in deviations from the school mean defines a different
s=1
variable, in which students with high GPAs in good
p
p
schools have the same corrected GPAs as students
+ xij s xkj t st (7) with low GPAs in bad schools. In the first case,
s=1 t=1 the variable measures GPA; in the second case, it
if measures how good the student is in comparison to all
students in his or her school. The two GPA variables
p
p
p
00 = 00 + 2 0s as + st as at , are certainly not monotonic with each other, and if the
s=1 s=1 t=1 within-school variation is small, they will be almost
uncorrelated.
p
0s = 0s + st at . (8)
t=1 References
Thus, we have invariance under translation of the
variance and covariance components as well, but, [1] Kreft, G.G., de Leeuw, J. & Aiken, L.S. (1995). The
effects of different forms of centering in hierarchical lin-
again, only if we do not require the 0s , that is, the ear models, Multivariate Behavioral Research 30, 121.
covariances of the slopes and the intercepts, to be [2] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
zero. If we center by using the grand mean of the Linear Models. Applications and Data Analysis Methods,
predictors, we still fit the same model, at least in the 2nd Edition, Sage Publications, Newbury Park.
case in which we do not restrict the r0 or the s0 to
be zero. JAN DE LEEUW
If we translate by xij s = xij s aj s and thus sub-
tract a different constant for each group, the situation
Central Limit Theory
JEREMY MILES
Volume 1, pp. 249255
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
0.2
0.16
Probability
0.12
0.08
0.04
0
1 2 3 4 5 6
Value
0.2
0.16
Porbability
0.12
0.08
0.04
0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Value
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
1 2 3 4 5 6
Figure 3 Distribution of mean score of 7 dice rolls (bars) and normal distribution with same mean and SD (line)
Imagine that we have a die, which has the original We might ask whether, in these circumstances, the
numbers removed, and the numbers 1, 1, 1, 2, 2, 3, central limit theorem still holds we can see if the
added to it. The distribution of this measure in the sampling distribution of the mean is normal. For one
sample is going to be markedly (positively) skewed, die, the distribution is markedly skewed, as we have
as is shown in Figure 4. seen. We can calculate the probability of different
Central Limit Theory 3
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3
0.35
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
1 1.5 2 2.5 3
Mean score
values occurring in larger samples. When N = 2, a that the sampling distribution of the mean is normal.
mean score of 1 can be achieved by rolling a 1 on The usually stated definition of sufficient is 30 (see,
both die. The probability of this event occurring = e.g., Howell [1]). However, this is dependent upon
0.5 0.5 = 0.25. We could continue to calculate the the shape of the distributions. Wilcox [4, 5] discusses
probability of each possible value occurring these a number of situations where, even with relatively
are shown in graphical form in Figure 5. Although large sample sizes, the central limit theorem fails to
we would not describe this distribution as normal, it apply. In particular, the theorem is prone to letting us
is closer to a normal distribution than that shown in down when the distributions have heavy tails. This is
Figure 4. likely to be the case when the data are derived from
Again, if we increase the sample size to 7 (still a a mixed-normal, or contaminated, distribution.
very small sample), the distribution becomes a much A mixed-normal, or contaminated, distribution
better approximation to a normal distribution. This is occurs when the population comprises of two or more
shown in Figure 6. groups, each of which has a different distribution
(see Finite Mixture Distributions). For example,
it may be the case that a measure has a different
When the Central Limit Theorem Goes variance for males and for females, or for alcoholics
Bad and nonalcoholics. Wilcox [4] gives an example of
a contaminated distribution with two groups. For
As long as the sample is sufficiently large, it seems both subgroups, the mean was equal to zero, for the
that we can rely on the central limit theorem to ensure larger group, which comprised 90% of the population,
4 Central Limit Theory
0.25
0.2
0.15
0.1
0.05
0
1 2 3
Figure 6 Sampling distribution of the mean, when N = 7, and dice are labelled 1, 1, 1, 2, 2, 3. Bars show sampling
distribution, line shows normal distribution with same mean and standard deviation
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
3
3
.5
.5
.5
.5
5
0.
1.
2.
3.
3
Figure 7 A normal and a mixed-normal (contaminated) distribution. The normal distribution (solid line) has mean = 0,
SD = 1, the mixed normal has mean = 0, SD = 1 for 90% of the population, and mean = 0, SD = 10, for 10% of the
population
the standard deviation was equal to 1, while for in the left-hand side of Figure 8. Examining that
the smaller subgroup, which comprised 10% of the graph by eye one would probably say that it fairly
population, the standard deviation was equal to 10. closely approximated a normal distribution. On the
The population distribution is shown in Figure 7, with right-hand side of Figure 8 the same distribution is
a normal distribution for comparison. The shape of redrawn, with a normal distribution curve, with the
the distributions is similar, and probably would not same mean and standard deviation. We can see that
give us cause for concern; however, we should note these distributions are different shapes but is this
that the tails of the mixed distribution are heavier. enough to cause us problems?
I generated 100 000 samples, of size 40, from this The standard deviation of the sampling distribu-
population. I calculated the mean for each sample, tion of these means is equal to 0.48. The mean
in order to estimate the sampling distribution of the standard error that was calculated in each sample was
mean. The distribution of the sample means is shown 0.44. These values seem close to each other.
Central Limit Theory 5
2.4 2 1.6 1.2 0.8 0.4 0 0.3 0.7 1 1.3 1.7 2 2.3
2.4 2 1.6 1.2 0.8 0.4 0 0.3 0.7 1 1.3 1.7 2 2.3
I also examined how often the 95% confidence computer platforms, and can be freely downloaded
limits in each sample excluded 0. According to from www.r-project.org).
the central limit theorem asymptotically, we would Note that any text following a # symbol is
expect that 95% of the samples would have confi- a comment and will be ignored by the program;
dence limits that included zero. In this analysis, the the < symbol is the assignment operator to
figure was 97% the overestimation of the standard make the variable x equal to 3, I would use
error has caused our type I error rate to drop to 3%. x < 3.
This drop in type I error rate may not seem such a The program will produce the graph shown
bad thing we would all like fewer type I errors. in Figure 8, as well as the proportion of the
However, it must be remembered that along with a 95% CIs which include the population value (of
drop in type I errors, we must have an increase, of zero).
unknown proportion, in type II errors, and hence a Four values are changeable the first two lines
decrease in power (see Power). (For further discus- give the SDs for two groups in the example
sion of these issues, and possible solutions, see entry that was used in the text, the values were 1
on robust statistics.) and 10.
The third line is used to give the proportion
of people in the population in the group with the
Appendix higher standard deviation in the example we used
0.110%.
The following is an R program for carrying out Finally, in lines 4 and 5 the sample size and the
a small Monte Carlo simulation to examine the number of samples to be drawn from the popula-
effect of contaminated distributions on the sampling tion are given. Again, these are the same as in the
distribution of the mean. (R is available for most example.
The figures shown in the small table earlier refer to the number of samples in which the 95% CIs included
the population value of zero. In this run, 96 860 samples (from 100 000, 96.9%) included zero, 3140 (from
100 000, 3.1%) did not. In a normal distribution, and extremely large sample, these values would be 95% and
5%. It may be a worthwhile exercise to set the SDs to be equal, and check to see if this is the case.
To run the program, paste the text into R. [2] Mood, A., Graybill, F.A. & Boes, D.C. (1974). Introduc-
As well as the graph, the output will show a tion to the Theory of Statistics, 3rd Edition, McGraw-Hill,
small table: New York.
[3] Roberts, M.J. & Russo, R. (1999). A Students Guide to
Analysis of Variance, Routledge, London.
FALSE TRUE [4] Wilcox, R.R. (1996). Statistics for the Social Sciences,
3140 96860 Academic Press, London.
[5] Wilcox, R.R. (1997). Introduction to Robust Estimation
and Hypothesis Testing, Academic Press, London.
References
JEREMY MILES
[1] Howell, D.C. (2002). Statistical Methods for Psychology,
5th Edition, Duxbury Press, Belmont.
Children of Twins Design
BRIAN M. DONOFRIO
Volume 1, pp. 256258
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the experience of having schizophrenic parent has a child generations. For example, a study of divorce
direct environmental impact on ones own risk for using the CoT Design reported results consistent with
schizophrenia. a direct environmental causal connection between
If the rates of the disorder in the offspring of the parental marital instability and young-adult behavior
affected and unaffected identical cotwins are equal to and substance abuse problems [4]. Similar conclu-
each other, the direct causal role of the parental psy- sions were found with CoT Design studies of the
chopathology would be undermined. However, such association between harsh parenting and child behav-
results do not elucidate whether shared genetic or ior problems [9] and between smoking during preg-
environmental processes are responsible for the inter- nancy and child birth weight [3, 11]. However, a
generational transmission. A comparison of the rates CoT analysis found that selection factors accounted
of psychopathology in the children of the unaffected for the lower age of menarche in girls growing up
identical and fraternal cotwins highlights the nature in households with a stepfather, results that suggest
of the selection factors. Children of the unaffected the statistical association is not a causal relation [12].
identical twins only vary with respect to the environ- These findings suggest that underlying processes in
mental risk associated with schizophrenia, whereas intergenerational associations are dependent on the
offspring of the unaffected fraternal twin differs with family risk factors and outcomes in the offspring.
respect to the environmental and genetic risk (lower). In summary, selection factors hinder all family
Therefore, higher rates of schizophrenia in the chil- studies that explore the association between risk
dren of the unaffected identical cotwins than in chil- factors and child outcomes. Without the ability to
dren of the unaffected fraternal cotwins suggest that experimentally assign children to different condi-
genetic factors account for some of the intergener- tions, researchers are unable to determine whether
ational covariation. If the rates are similar for the differences among groups (e.g., children from intact
children in unaffected identical and fraternal fami- versus divorced families) are due to the measured
lies, shared environmental factors would be of most risk factor or unmeasured differences between fami-
import because differences in the level of genetic risk lies. Because selection factors may be environmental
would not influence the rate of schizophrenia. or genetic in origin, researchers need to use quasi-
The most well-known application of the design experimental designs that pull apart the co-occurring
explored the intergenerational association of schizo- genetic and environmental risk processes [17]. The
phrenia using discordant twins [6]. Offspring of CoT Design is a behavior genetic approach that
schizophrenic identical cotwins had a morbid risk of can explore intergenerational associations with lim-
being diagnosed with schizophrenia of 16.8, whereas ited methodological assumptions compared to other
offspring of the unaffected identical cotwins had a designs [3]. However, caution must be used when
morbid risk of 17.4. Although the offspring in this interpreting the result of studies using the CoT
later group did not have a parent with schizophre- Design. Similar to all nonexperimental studies, the
nia, they had the same risk as offspring with a design cannot definitely prove causation. The results
schizophrenic parent. The results effectively discount can only be consistent with a causal hypothesis
the direct causal environmental theory of schizophre- because environmental processes that are correlated
nia transmission. The risk in the offspring of the with the risk factor and only influence one twin and
unaffected identical twins was 17.4, but the risk was their offspring may actually be responsible for the
much lower (2.1) in the offspring of the unaffected associations.
fraternal cotwins. This latter comparison suggests that The CoT Design can also be expanded in a num-
genetic factors account for the association between ber of ways. The design can include continuously
parental and offspring schizophrenia. Similar findings distributed risk factors [3, 7] and measures of fam-
were reported for the transmission of bipolar depres- ily level environments. Associations between parental
sion [1]. In contrast, the use of the CoT to explore characteristics and child outcomes may also be due
transmission of alcohol abuse and dependence from to reverse causation, but given certain assumptions,
parents to their offspring highlighted role of the fam- the CoT Design can delineate between parent-to-
ily environment [8]. child and child-to-parent processes [19]. Because the
One of the main strengths of the design is its abil- design is also a quasi-adoption study, the differences
ity to study different phenotypes in the parent and in genetic and environmental risk in the approach
Children of Twins Design 3
provides the opportunity to gene-environment inter- [9] Lynch, S.K., Turkheimer, E., Emery, R.E., DOnofrio,
action [8]. When the spouses of the adult twins are B.M., Mendle, J., Slutske, W. & Martin, N.G. (sub-
included in the design, the role of assortative mating mitted). A genetically informed study of the associa-
tion between harsh punishment and offspring behavioral
and the influence of both spouses can be considered, problems.
an important consideration for accurately describing [10] Maes, H.M., Neale, M.C. & Eaves, L.J. (1997). Genetic
the processes involved in the intergenerational associ- and environmental factors in relative body weight and
ations [7]. Finally, the CoT Design can be combined human adiposity, Behavior Genetics 27, 325351.
with other behavior genetic designs to test more com- [11] Magnus, P., Berg, K., Bjerkedal, T. & Nance, W.E.
plex models of parentchild relations [2, 10, 20]. (1985). The heritability of smoking behaviour in
Overall, the CoT Design is an important genetically pregnancy, and the birth weights of offspring of
smoking-discordant twins, Scandinavian Journal of
informed methodology that will continue to highlight
Social Medicine 13, 2934.
the mechanisms through which environmental and [12] Mendle, J., Turkheimer, E., DOnofrio, B.M., Lynch,
genetic factors act and interact. S.K. & Emery, R.E. (submitted). Stepfather presence and
age at menarche: a children of twins approach.
References [13] Nance, W.E. & Corey, L.A. (1976). Genetic models for
the analysis of data from the families of identical twins,
[1] Bertelsen, A. & Gottesman, I.I. (1986). Offspring of twin Genetics 83, 811826.
pairs discordant for psychiatric illness, Acta Geneticae [14] Plomin, R., DeFries, J.C. & Loehlin, J.C. (1977).
Medicae et Gemellologia 35, 110. Genotype-environment interaction and correlation in the
[2] DOnofrio, B.M., Eaves, L.J., Murrelle, L., Maes, H.H. analysis of human behavior, Psychological Bulletin 84,
& Spilka, B. (1999). Understanding biological and 309322.
social influences on religious affiliation, attitudes and [15] Rutter, M. (2000). Psychosocial influences: critiques,
behaviors: a behavior-genetic perspective, Journal of findings, and research needs, Development and Psy-
Personality 67, 953984. chopathology 12, 375405.
[3] DOnofrio, B.M., Turkheimer, E., Eaves, L.J., Corey, [16] Rutter, M., Dunn, J., Plomin, R., Simonoff, E., Pick-
L.A., Berg, K., Solaas, M.H. & Emery, R.E. (2003). les, A., Maughan, B., Ormel, J., Meyer, J. & Eaves, L.
The role of the children of twins design in elucidating (1997). Integrating nature and nurture: implications of
causal relations between parent characteristics and child person-environment correlations and interactions for
outcomes, Journal of Child Psychology and Psychiatry developmental psychopathology, Development and Psy-
44, 11301144. chopathology 9, 335364.
[4] DOnofrio, B.M., Turkheimer, E., Emery, R.E., Slutske, [17] Rutter, M., Pickles, A., Murray, R. & Eaves, L.J.
W., Heath, A., Madden, P. & Martin, N. (submitted). (2001). Testing hypotheses on specific environmental
A genetically informed study of marital instability and causal effects on behavior, Psychological Bulletin 127,
offspring psychopathology, Journal of Abnormal Psy- 291324.
chology. [18] Scarr, S. & McCartney, K. (1983). How people make
[5] Eaves, L.J., Last, L.A., Young, P.A. & Martin, N.B. their own environments: a theory of genotype-environ-
(1978). Model-fitting approaches to the analysis of ment effects, Child Development 54, 424435.
human behavior, Heredity 41, 249320. [19] Silberg, J.L. & Eaves, L.J. (2004). Analyzing the contri-
[6] Gottesman, I.I. & Bertelsen, A. (1989). Confirming bution of genes and parent-child interaction to childhood
unexpressed genotypes for schizophrenia, Archives of behavioural and emotional problems: a model for the
General Psychiatry 46, 867872. children of twins, Psychological Medicine 34, 347356.
[7] Heath, A.C., Kendler, K.S., Eaves, L.J. & Markell, D. [20] Truett, K.R., Eaves, L.J., Walters, E.E., Heath, A.C.,
(1985). The resolution of cultural and biological inheri- Hewitt, J.K., Meyer, J.M., Silberg, J., Neale, M.C.,
tance: informativeness of different relationships, Behav- Martin, N.G. & Kendler, K.S. (1994). A model system
ior Genetics 15, 439465. for analysis of family resemblance in extended kinships
[8] Jacob, T., Waterman, B., Heath, A., True, W., Buc- of twins, Behavior Genetics 24, 3549.
holz, K.K., Haber, R., Scherrer, J. & Quiang, F. (2003).
Genetic and environmental effects on offspring alco- BRIAN M. DONOFRIO
holism: new insights using an offspring-of-twins design,
Archives of General Psychiatry 60, 12651272.
Chi-Square Decomposition
DAVID RINDSKOPF
Volume 1, pp. 258262
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 3 Race by party identification (independent versus Table 5 Depression in adolescents, by age and sex (Seri-
major party) ously emotionally disturbed only)
Party identification Depression
Race Democrat + republican Independent Age Sex Low High P (High)
White 746 105 1214 Male 14 5 .26
Black 114 15 Female 5 8 .62
1516 Male 32 3 .09
Female 15 7 .32
1718 Male 36 5 .12
for models that fit well); with 1 df, it is obvious that Female 12 2 .14
there is no evidence for a relationship, so we can say
that the first hypothesis is confirmed.
We now use the data only on those who are
members of a major party to answer the second Example 2 Relationships Among Three Vari-
question. The frequencies are reproduced in Table 4. ables. Although partitioning is usually applied to
Here we find that G2 = 90.278, and X 2 = 78.908, two-way tables, it can also be applied to tables
each with 1 df. Independence is rejected, and we of higher dimension. To illustrate a more com-
conclude that among those registered to a major party, plex partitioning, I will use some of the data on
there is a large difference between Blacks and Whites depression in adolescents from Table 3 of the entry
in party registration. on contingency tables. In that example, there were
Why is this called a partition of chi-square? If two groups; I will use only the data on children
we add the likelihood ratio statistics for each of the classified as SED (seriously emotionally disturbed).
one df hypotheses, we get 0.053 + 90.278 = 90.331; The remaining variables (age, sex, and depression)
this is equal to the likelihood ratio statistic for the give a 3 2 2 table; my conceptualization will
full 3 2 table. (The same does not happen for the treat age and sex as predictors, and depression as
Pearson statistics; they are still valid tests of these an outcome variable. The data are reproduced in
hypotheses, but the results arent as pretty because Table 5.
they do not produce an exact partition.) We can If we consider the two predictors, there are six
split the overall fit for the model of independence Age Sex groups that form the rows of the table.
into parts, testing hypotheses within segments of the Therefore, we can think of this as a 6 2 table
overall table. instead of as a 3 2 2 table. If we test for
To get a partition, we must be careful of how we independence between the (six) rows and the (two)
select these hypotheses. Detailed explanations can columns, the likelihood ratio statistic is 18.272 with
be found in [3], but a simple general rule is that 5 df. So there is some evidence of a relationship
they correspond to selecting orthogonal contrasts in between group and depression. But what is the nature
an ANOVA. For example, the contrast coefficients of this relationship? We will start by breaking the
for testing the first hypothesis (independents ver- overall table into three parts, testing the following
sus major party) would be (1, 2, 1), and the three hypotheses:
coefficients for comparing Democrats to Republicans
(1) the depression rate for males is constant; that is,
would be (1, 0, 1). These two sets of coefficients
it does not change with age;
are orthogonal.
(2) the depression rate for females is constant;
(3) the rate of depression in males is the same as
Table 4 Race by party identification (Major party mem- that for females.
bers only)
To test hypothesis (1), we consider Table 6, which
Party identification contains only the males.
Race Democrat Republican For this table, a test of independence results in a
value G2 = 3.065, with 2 df. This is consistent with
White 341 405
Black 103 11
the hypothesis that the proportion depressed among
males does not change with age.
Chi-Square Decomposition 3
Table 6 Depression in adolescents (Males only) Table 8 Depression in adolescents (Collapsed over age)
Depression Depression
Age Low High P (High) Sex Low High P (High)
1214 14 5 .26 Male 82 13 .137
1516 32 3 .09 Female 32 17 .347
1718 36 5 .12
many substantive research hypotheses that cannot [2] Rindskopf, D. (1990). Nonstandard loglinear models,
be tested using traditional methods. For this reason, Psychological Bulletin 108, 150162.
it deserves to be in the arsenal of all researchers [3] Rindskopf, D. (1996). Partitioning chi-square, in Categor-
ical Variables in Developmental Research, C.C. Clogg &
analyzing categorical data. A. von Eye, eds, Academic Press, San Diego.
References
(See also Log-linear Models)
[1] Fisher, R.A. (1930). Statistical Methods for Research
Workers, 3rd Edition, Oliver and Boyd, Edinburgh.
DAVID RINDSKOPF
Cholesky Decomposition
STACEY S. CHERNY
Volume 1, pp. 262263
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
and environmental covariances. As shown, G can can similarly be done for the shared and nonshared
be decomposed into the genetic correlation matrix, environmental components.
RG , pre- and postmultiplied by the square roots
of the heritabilities in a diagonal matrix, h. This STACEY S. CHERNY
Classical Statistical Inference Extended: Split-Tailed
Tests
RICHARD J. HARRIS
Volume 1, pp. 263268
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Inference Extended:
and p< = Pr (rXY < rXY if H0 were true)
Split-Tailed Tests = 1 p< to < .
where rXY is the observed value of the correlation
As pointed out in the entry on classical statisti- coefficient calculated for your sample;
cal inference: practice versus presentation, classi- rXY is a random variable representing the values
cal statistical inference (CSI) as practiced by most of the sample correlation coefficient obtained from an
researchers is a very useful tool in determining infinitely large number of independent random sam-
whether sufficient evidence has been marshaled to ples from a population in which the true population
establish the direction of a population effect. How- correlation coefficient is precisely zero;
ever, classic two-tailed tests, which divide alpha p> is, as the formula indicates, the percentage
evenly between the rejection regions corresponding of the correlation coefficients computed from inde-
to positively and negatively signed values of the pendent random samples drawn from a population
population effect, do not take into account the pos- in which XY = 0 that are as large as or larger
sibility that logic, expert opinion, existing theories, than the one you got for your single sample from
and/or previous empirical studies of this same pop- that population;
ulation effect might strongly (or weakly, for that
p< is defined analogously as Pr(rXY < rXY if H0
matter) favor the hypothesis that (the population were true);
parameter under investigation) is greater than 0 (the > is the criterion you have set (before examining
highly salient dividing line often zero between your data) as to how low p> has to be to convince
positive and negative population effects) over the you that the population correlation is, indeed, positive
alternative hypothesis that < 0 , or vice versa. The (high values of Y tending to go along with high values
split-tailed test, introduced (but not so labeled) by of X and low, with low); and
Kaiser [4] and introduced formally by Braver [1], < is the criterion you have set as to how low
provides a way to incorporate prior evidence into p< has to be to convince you that the population
ones significance test so as to increase the likeli- correlation is, indeed, negative.
hood of finding statistical significance in the correct These two comparisons (p< to < and p> to
direction (provided that the researchers assessment > ) are then used to decide between H1 and H2 ,
of the evidence has indeed pointed her in the correct as follows:
direction). Decision rule: If p> < > , conclude that (or, at
The remainder of this article describes the decision least for the duration of this study, act as if)
rules (DRs) employed in conducting split-tailed sig- H1 is true (i.e., accept the hypothesis that
nificance tests and constructing corresponding confi- XY > 0).
dence intervals. It also points out that the classic one- If p< < < , conclude that (or act as if)
tailed test (100% of devoted to the predicted tail H2 is true (i.e., accept H2 that XY < 0).
of the sampling distribution) is simply an infinitely If neither of the above is true (i.e., if p> >
biased split-tailed test. and p< < , which is equivalent to the condition
that p> be greater than > but less than 1 < ),
conclude that we do not have enough evidence to
Decision Rule(s) for Split-tailed Tests decide between H1 and H2 (i.e., fail to reject H0 ).
We test H0 : XY = 0 against H1 : XY > 0 and As is true for classic (equal-tailed) significance tests,
H2 : XY < 0 the same basic logic holds for split-tailed tests of any
2 Classical Statistical Inference Extended: Split-Tailed Tests
population effect: the difference between two popula- Using Victor Bissonettes statistical applet for
tion means or between a single population mean and computation of the P values associated with sample
its hypothesized value, the difference between a pop- correlation coefficients (http://fsweb.berry.
ulation correlation coefficient and zero, or between edu/academic/education/vbissonnette/
two (independent or dependent) correlation coeffi- applets/sigr.html), we find that p< (the proba-
cients, and so on. bility that an rXY computed on a random sample of
size 43 and, thus, df = 43 2 = 41, drawn from a
Alternative, but Equivalent Decision Rules population in which XY = 0, would be less than or
equal to 0.31) equals 0.02153. Since this is smaller
The split-tailed versions of DRs 2 through 3 pre- than 0.04, we reject H0 in favor of H2 : XY < 0.
sented in classical statistical inference: practice ver- Had our sample yielded an rXY of +0.31, we
sus presentation should be clear. In particular, the would have found that p> = 0.022, which is greater
rejection-region-based DR 3 can be illustrated as than 0.01 (the value to which we set > ), so we
follows for the case of a large-sample t Test of the dif- would not reject H0 (though we would certainly
ference between two sample means (i.e., an attempt not accept it), and would conclude that we had
to establish the direction of the difference between insufficient evidence to determine whether XY is
the two corresponding population means), where the positive or negative. Had we simply conducted a
researcher requires considerably stronger evidence to symmetric (aka two-tailed) test of H0 that is, had
be convinced that 1 < 2 than to be convinced that we set > = < = 0.025, we would have found in
1 > 2 (Figure 1). this case that p> < > and we would, therefore, have
Let us modify the example used in the companion concluded that XY > 0. This is, of course, the price
entry [3] in which we tested whether interest in Web one pays in employing unequal values of > and < :
surfing (Y ) increases or decreases with age (X), that If our pretest assessment of the relative plausibility
is, whether the correlation between Web surfing and of H2 versus H1 is correct, we will have higher
age (rXY ) is positive (>0) or negative (<0). Before power to detect differences in the predicted direction
collecting any data, we decide that, on the basis of but lower power to detect true population differences
prior data and logic, it will be easier to convince opposite in sign to our expectation than had we set
us that the overall trend is negative (i.e., that the identical decision criteria for positive and negative
overall tendency is for Web-surfing interest to decline sample correlations.
with age) than the reverse, so we set < to 0.04 and
> to 0.01. We draw a random sample of 43 US
residents five years of age or older, determine each A Recommended Supplementary
sampled individuals age and interest in Web surfing Calculation: The Confidence Interval
(measured on a 10-point quasi-interval scale on which
high scores represent high interest in Web surfing), The confidence interval for the case where we have
and compute the sample correlation between our two set < to 0.04 and > to 0.01, but obtained an rXY
measures, which turns out to be 0.31. of +0.31, is instructive on two grounds: First, the
Cannot be sure
a< = 0.001 whether a> = 0.049
m 1 > m2
or vice versa m 1 > m2
m1 < m2
0
3.090 t-ratio +1.655
bias toward negative values of XY built into the test has much higher power than the one-tailed test
difference between < and > leads to a confidence when the researcher is mistaken about the sign of the
interval that is shifted toward the negative end of population effect.
the 1 to +1 continuum, namely, 0.047 < XY < To demonstrate the above points, consider the
0.535. Second, it also yields a confidence interval that case where IQ scores are, for the population under
is wider (range of 0.582) than the symmetric-alpha consideration, distributed normally with a population
case (range of 0.547). Indeed, a common justification mean of 105 and a population standard deviation of
for preferring > = < is that splitting total alpha 15. Assume further that we are interested primarily
equally leads to a narrower confidence interval than in establishing whether this populations mean IQ is
any other distribution of alpha between > and < . above or below 100, and that we propose to test
However, Harris and Vigil ([3], briefly described in this by drawing a random sample of size 36 from
Chapter 1 of [2]), have found that the PICI (Prior this population. If we conduct a two-tailed test of
Information Confidence Interval) around the mean this effect, our power (the probability of rejecting
yields asymmetric-case confidence intervals that are, H0 ) will be 0.516, and the width of the 95% CI
over a wide range of true values of , narrower around our sample mean IQ will be 9.8 IQ points.
than corresponding symmetric-case intervals. (The If, on the other hand, we conduct a one-tailed test
PICI is defined as the set of possible values of the of Hr that > 100, our power will be 0.639 a
population mean that would not be rejected by a substantial increase. If, however, our assessment of
split-tailed test, where the > to < ratio employed prior evidence, expert opinion, and so on, has led
decreases as an exponential function of the particular us astray and we, instead, hypothesize that < 100,
value of being tested, asymptotically approaching
our power will be a miniscule 0.00013 and every
unity and zero for infinitely small and infinitely large
bit of that power will come from cases where we
values of , and equals 0.5 for the value of that
have concluded incorrectly that < 100. Further,
represents the investigators a priori estimate of .)
since no sample value on the nonpredicted side of 100
Since the CI for a correlation is symmetric around
can be rejected by a one-tailed test, our confidence
the sample correlation when expressed in Fisher-z-
interval will be infinitely wide. Had we instead
transformed units, it seems likely that applying PICIs
conducted a split-tailed test with only a 49-to-1 bias
around correlation coefficients would also overcome
in favor of our Hr , our power would have been 0.635
this disadvantage of splitting total alpha unequally.
(0.004 lower than the one-tailed tests power) if our
prediction was correct and 0.1380 if our prediction
Power of Split-tailed versus One- and Two was incorrect (over a thousand times higher than
the one-tailed tests power in that situation); only
(Equal)-Tailed Tests
0.0001 of that power would have been attributable
A common justification for using one-tailed tests is to Type III error; and our confidence intervals would
that they have greater power (because they employ each have the decidedly noninfinite width of 11.9 IQ
lower critical values) than do two-tailed tests. How- points.
ever, that is true only if the researchers hypothesis In my opinion, the very small gain in power from
about the direction of the population effect is indeed conducting a one-tailed test of a correct Hr , rather
correct; if the population effect differs in sign from than a split-tailed test, is hardly worth the risk of
that hypothesized, the power of his one-tailed test near-zero power (all of it actually Type III error),
of his hypothesis cannot exceed /2 and all of the certainty of an infinitely wide confidence inter-
even that low power is actually Type III error, that val, and the knowledge that one has violated the
is, represents cases where the researcher comes to principle that scientific hypotheses must be discon-
the incorrect conclusion as to the direction of the firmable that come along with the use of a one-tailed
population effect. Further, the power advantage of test. Small surprise, then, that the one-tailed test was
a one-tailed test (i.e., a split-tailed test with infinite not included in the companion entrys description of
bias in favor of ones research hypothesis) over a Classical Statistical Inference in scientifically sound
split-tailed test with, say, a 50-to-1 bias is miniscule practice (see Classical Statistical Inference: Prac-
when the research hypothesis is correct, and the latter tice versus Presentation).
4 Classical Statistical Inference Extended: Split-Tailed Tests
Practical Computation of Split-tailed Significance test is that we do not accept the possibility that Y
Tests and Confidence Intervals could be <100.
Tables of critical values and some computer programs Choosing an > /< Ratio
are set up for one-tailed and two-tailed tests at
conventional (usually 0.05 and 0.01 and, sometimes, The choice of how strongly to bias your test of
0.001) alpha levels. This makes sense; it would, after the sign of the population effect being estimated in
all, be impossible to provide a column for every favor of the direction you believe to be implied by
possible numerical value that might appear in the logic, theoretical analysis, or evidence from previous
numerator or denominator of the > / < ratio in empirical studies involving this same parameter is
a split-tailed test. If the computer program you use ultimately a subjective one, as is the choice as to what
to compute your test statistic does not provide the P overall alpha (> + < ) to employ. One suggestion
value associated with that test statistic (which would, is to consider how strong the evidence for an effect
if provided, permit applying DR 1), or if you find opposite in direction to your prediction must be
yourself relying on a table or program with only before you would feel compelled to admit that, under
0.05 and 0.01 levels, you can use the 0.05 one-tailed the conditions of the study under consideration, the
critical value (aka the 0.10-level two-tailed critical population effect is indeed opposite to what you had
value) in the predicted direction and the 0.01-level expected it to be. Would a sample result that would
one-tailed test in the nonpredicted direction for a 5-to- have less than a one percent chance of occurring do
it? Would a p< of 0.001 or less be enough? Whatever
1 bias in favor of your research hypothesis, and a total
the breaking point for your prediction is (i.e., however
alpha of 0.06 well within the range of uncertainty
low the probability of a result as far from the value
about the true alpha of a nominal 0.05-level test,
specified in H0 in the direction opposite to prediction
given all the assumptions that are never perfectly
as your obtained results has to be to get you to
satisfied. Or, you could use the 0.05-level and 0.001-
conclude that you got it wrong), make that the portion
level critical values for a 50-to-1 bias and a total
of alpha you assign to the nonpredicted tail.
alpha of 0.051.
Alternatively and as suggested earlier, you could
Almost all statistical packages and computer sub-
choose a ratio that is easy to implement using stan-
routines report only symmetric confidence intervals. dard tables, such as 0.05/0.01 or 0.05/0.001, though
However, one can construct the CI corresponding that requires accepting a slightly higher overall alpha
to a split-tailed test by combining the lower bound (0.06 or 0.051, respectively, in these two cases) than
of a (12> )-level symmetric CI with the upper the traditional 0.05.
bound of a (12< )-level symmetric CI. This also Finally, as pointed out by section editor Ranald
serves as a rubric for hand computations of CIs to Macdonald (personal communication), you could
accompany split-tailed tests. For instance, for the leave to the reader the choice of the > /< ratio by
example used to demonstrate relative power (H0 : reporting, for example, that the effect being tested
Y = 100, Y distributed normally with a known pop- would be statistically significant at the 0.05 level
ulation standard deviation of 15 IQ points, > = for all > /< ratios greater than 4.3. Such a state-
0.049 and < = 0.001, sample size = 36 observa- ment can be made for some finite ratio if and only if
tions), if our sample mean equals 103, the corre- p> < more generally, if and only if the obtained
sponding CI will extend from 1031.655(2.5) = test statistic exceeds the one-tailed critical value for
98.8625 (the lower bound of a 0.902-level symmet- the obtained direction. For instance, if a t Test of
ric CI around 103) to 103 + 3.090(2.5) = 110.725 1 2 yielded a positive sample difference (mean
(the upper bound of the 0.998-level symmetric CI). for first group greater than mean for second group)
For comparison, the CI corresponding to a 0.05-level with an associated p> of 0.049, this would be con-
one-tailed test with Hr :Y > 100 (i.e., an infinitely sidered statistically significant evidence that 1 > 2
biased split-tailed test with > = 0.05 and < = by any split-tailed test with an overall alpha of 0.05
0) would extend from 103 1.645(2.5) = 98.89 to and an > /< ratio of 0.049/0.001 = 49 or greater.
103 + (2.5) = + or, perhaps, from 100 to If the obtained p> were 0.003, then the difference
+, since the primary presumption of the one-tailed would be declared statistically significant evidence
Classical Statistical Inference Extended: Split-Tailed Tests 5
that 1 > 2 by any split-tailed test with an overall studys results to consider statistically significant. On
alpha of 0.05 and an > /< ratio of 0.003/0.997 = the other hand, if you find yourself tempted to carry
0.00301 or more that is, even by readers whose bias out a one-tailed test because of strong prior evidence
in favor of the hypothesis that 1 < 2 led them to as to the sign of a population parameter, a split-
employ an < /> ratio of 0.997/0.003 = 332.3 or tailed test is, in my opinion, a far sounder approach
less. But if the obtained p> were 0.052, no split- to giving into that temptation than would be a one-
tailed test with an overall alpha of 0.05 would yield tailed test.
statistical significance for this effect, no matter how
high the preferred > /< ratio. The one-tailed crit-
References
ical value, thus, can play a role as the basis for a
preliminary test of whether the difference might be
[1] Braver, S.L. (1975). On splitting the tails unequally: a new
statistically significant by any split-tailed test pro- perspective on one- versus two-tailed tests, Educational
vided (in my opinion) that the researcher does not & Psychological Measurement 35, 283301.
buy into the associated logic of a one-tailed test [2] Harris, R.J. (2001). A Primer of Multivariate Statistics,
by employing an infinitely large ratio of predicted to 3rd Edition, Lawrence Erlbaum Associates, Mahwah.
nonpredicted rejection-region area. [3] Harris, R.J. & Vigil, B.V. (1998. October). Peekies: prior
information confidence intervals, Presented at meetings of
Wrapping Up Society for Multivariate Experimental Psychology, Wood-
side Lake.
Finally, this entry should not be taken as an endorse- [4] Kaiser, H.F. (1960). Directional statistical decisions, Psy-
ment of split-tailed tests in preference to two-tailed chological Review 67, 160167.
tests. Indeed, my personal preference and habit is
RICHARD J. HARRIS
to rely almost exclusively on two-tailed tests and,
thereby, let the data from the study in hand com-
pletely determine the decision as to which of that
Classical Statistical Inference: Practice versus
Presentation
RICHARD J. HARRIS
Volume 1, pp. 268278
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
aspects of the data collected in the present study (see equivalent to the condition that p> be greater than
Multiple Comparison Procedures.) /2 but less than 1 /2),
conclude that
we dont have enough evidence to decide between
Decision Rule(s), Compactly Presented H1 and H2.
The last few paragraphs can be summarized suc- (i.e., fail to reject H0 .)
cinctly in the following more formal presentation of
the case in which our interest is in whether the cor- Scientifically Sound Classical Statistical Infer-
relation between two variables, X and Y , is positive ence About Other Population Parameters. To test
or negative. whether any other population parameter (e.g., the
population mean, the difference between two popu-
lation means, the difference between two population
Scientifically Sound Classical Statistical Infer-
correlations) is greater than or less than some espe-
ence (Correlation Coefficient Example). We test
cially salient value 0 that represents the dividing line
H0 : XY = 0 against H1 : XY > 0 and H2 : XY < 0
between a positive and negative effect (e.g., 100 for
by comparing p> = P (rXY > rXY if H0 were true)
mean IQ, zero for the difference between two means
to /2 and
or the difference between two correlations), simply
p< = P (rXY < rXY if H0 were true) = 1 p> to
substitute (the parameter of interest) for XY ,
/2, where
(your sample estimate of the parameter of interest,
rXY is the observed value of the correlation
that is, the observed value in your sample of the statis-
coefficient calculated for your sample;
tic that corresponds to the population parameter) for
rXY is a random variable representing the values
rXY , and 0 for zero (0) in the above.
of the sample correlation coefficient obtained from an
The above description is silent on the issue of
infinitely large number of independent random sam-
how one goes about computing p> and/or p< .
ples from a population in which the true population
This can be as simple as conducting a single-
correlation coefficient is precisely zero;
sample z test or as complicated as the lengthy
p> is, as the formula indicates, the percentage
algebraic formulae for the test of the significance of
of the correlation coefficients computed from inde-
the difference between two correlation coefficients
pendent random samples drawn from a population
computed on the same sample [17]. See your local
in which XY = 0 that are as large as or larger
friendly statistics textbook or journal or the entries
than the one you got for your single sample from
on various significance tests in this encyclopedia
that population;
for details (see Catalogue of Parametric Tests).
p< is defined analogously as P (rXY < rXY if H0
were true); and
is the criterion you have set (before examining An Example: Establishing the Sign of a
your data) as to how low p> or p< has to be (Population) Correlation Coefficient
to convince you that the population correlation is,
For example, lets say that we wish to know whether
indeed, positive (high values of Y tending to go along
interest in Web surfing (Y ) increases or decreases
with high values of X and low, with low).
with age (X), that is, whether the correlation between
These two comparisons (p< to < and p> to
Web surfing and age (rXY ) is positive (>0) or
> ) are then used to decide between H1 and H2 ,
negative (<0). Before collecting any data we decide
as follows:
to set to 0.05. We draw a random sample of 43
Decision rule: If p> < /2, conclude that (or, at least US residents 5 years of age or older, determine each
for the duration of this study, act as if) sampled individuals age and interest in Web surfing
H1 is true. (I.e., accept the hypothesis that XY > 0.) (measured on a 10-point quasi-interval scale on which
If p< < /2 , conclude that (or act as if) high scores represent high interest in Web surfing),
H2 is true. (I.e., accept H2 that XY < 0.) and compute the sample correlation between our two
measures, which turns out to be 0.31. (Of course,
If neither of the above is true (i.e., if p> and p< this relationship is highly likely to be curvilinear,
are both /2 which is with few 5-year-olds expressing much interest in Web
Classical Statistical Inference: Practice versus Presentation 3
surfing; were testing only the overall linear trend confusing statistical significance with magnitude or
of the relationship between age and surfing interest. importance, in that it tells us that, while we can
Even with that qualification, before proceeding with be quite confident that the population correlation
a formal significance test we should examine the is positive, it could plausibly be as low as 0.011.
scatterplot of Y versus X for the presence of outliers (For example, our data are insufficient to reject the
that might be having a drastic impact on the slope of null hypothesis that XY = 0.02, that is that the two
the best-fitting straight line and for departures from variables share only 0.04% of their variance).
normality in the distributions of X and Y so extreme The case where we obtained an rXY of +0.28
that transformation and/or nonparametric alternatives yields a 95% CI of 0.022 < XY < 0.535. No value
should be considered.) contained within the 95% CI can be rejected as a
Using Victor Bissonettes statistical applet for plausible value by a 0.05-level significance test, so
computation of the P values associated with sample that it is indeed true that we cannot rule out zero
correlation coefficients (http://fsweb.berry. as a possible value of XY , which is consistent with
edu/academic/education/vbissonnette/ our significance test of rXY . On the other hand, we
applets/sigr.html), we find that p< (the prob- also cant rule out values of 0.020, +0.200, or
ability that an rXY computed on a random sample even +0.50 a population correlation accounting for
of size 43 and thus df = 43 2 = 41, drawn from 25% of the variation in Y on the basis of its linear
a population in which XY = 0, would be less than relationship to X. The CI thus makes it abundantly
or equal to 0.31) equals 0.02153. Since this is clear how foolish it would be to accept (rather than
smaller than 0.05/2 = 0.025, we reject H0 in favor fail to reject) the null hypothesis of a population
of H2 : XY < 0. correlation of precisely zero on the basis of statistical
Had our sample yielded an rXY of +0.28 we would nonsignificance.
have found that p< = 0.965 and p> = 0.035, so we
would not reject H0 (though we would certainly
not accept it) and would conclude that we had Alternative, but Equivalent Decision Rules
insufficient evidence to determine whether XY is
positive or negative. (It passeth all plausibility that DR (Decision Rule) 2
it could be precisely 0.000 . . . to even a few hundred
decimal places.) If either p> or p< is </2 and the sample estimate of
the population effect is positive ( > 0 , for example,
rXY positive or sample mean IQ > 100) reject H0 and
A Recommended Supplementary Calculation: The accept H1 that > 0 (e.g., that XY > 0).
Confidence Interval If either p> or p< is </2 and the sample estimate
It is almost always a good idea to supplement any of the population effect is negative ( < 0 , for
test of statistical significance with the confidence example, rXY < 0 or sample mean IQ < 100) reject
interval around the observed sample difference or H0 and accept H1 that < 0 (e.g., that the population
correlation (Harlow, Significance testing introduction correlation is negative or that the population mean IQ
and overview, in [9]). The details of how and why is below 100).
to do this are covered in the confidence interval
entry in this encyclopedia (see Confidence Inter- DR 3. First, select a test statistic T that, for
vals). To reinforce its importance, however, well fixed and sample size, is monotonically related
display the confidence intervals for the two subcases to the discrepancy between and 0 . (Common
mentioned above. examples would be the z- or t ratio for the difference
First, for the sample rXY of +0.31 with = between two independent or correlated means and
0.05, the 95% confidence interval (CI) around (well, the chi-square test for the difference between two
attempting to capture) XY is 0.011 < XY < 0.558. independent proportions.) Compute T (the observed
(It can be obtained via a plug-in program on Richard value of T for your sample). By looking it up in
Lowrys VassarStats website, http://faculty. a table or using a computer program (e.g., any of
vassar.edu/lowry/rho.html.) The lower bound the widely available online statistical applets, such
of this interval provides a useful caution against as those on Victor Bissonnettes site, cited earlier),
4 Classical Statistical Inference: Practice versus Presentation
determine either the two-tailed P value associated dont have enough evidence to decide whether
with T or Tcrit , the value of T that would yield a > 0 or <0 .
two-tailed P value of exactly . (The two-tailed P
value equals twice the smaller of p> or p< that As applied to testing a single, one-degree-of-
is, it is the probability of observing a value of T as freedom hypothesis the above six decision rules are
large as or larger than T* in absolute value in repeated logically and algebraically equivalent and therefore
random samplings from a population in which H0 is lead to identical decisions when applied to any given
true.) Then, if p < or if |T | (the absolute value sample of data. However, Decision Rule DR 4 (based
of T , that is, its numerical value, ignoring sign) is on examination of the confidence interval around
greater than Tcrit , conclude that the population effect the sample estimate) actually encompasses an infin-
has the same sign as the sample effect that is, accept ity of significance tests, since it neatly partitions
whichever of H1 or H2 is consistent with the data. the real line into values of that our data dis-
If instead |T | < Tcrit , conclude that we dont have confirm (to within the reasonable doubt quantified
enough evidence to decide whether > 0 or <0 by ) and those that are not inconsistent with our
(e.g., whether XY is positive or negative). data. This efficiency adds to the argument that con-
This decision rule is illustrated below in Figure 1 fidence intervals could readily replace the use of
for the case of a large-sample t Test of the difference significance tests as represented by DRs 13. How-
between two sample means (i.e., an attempt to ever, this efficiency comparison is reversed when
establish the direction of the difference between we consider multiple-degree-of-freedom (aka over-
the two corresponding population means) where the all) significance tests, since an appropriate overall
researcher has set to 0.05. test tells us whether any of the infinite number of
single-df contrasts is (or would be, if tested) statisti-
DR 4. Construct the (1 )-level confidence inter- cally significant.
val (CI) for corresponding to your choice of and
to , the value of you obtained for your sample
of observations. (See the entry in this encyclope-
dia on confidence intervals for details of how to do Criticisms of and Alternatives to CSI
this.) Then
The late 1990s saw a renewal of a recurring cycle
If the CI includes only values that are >0 , conclude of criticism of classical statistical inference, including
that > 0 . a call by Hunter [10] for a ban on the reporting of
If the CI includes only values that are <0 , conclude null-hypothesis significance tests (NHSTs) in APA
that < 0 . journals. The history of this particular cycle is
Otherwise (i.e., if the CI includes some values that are recounted by Fidler [6]; a compilation of arguments
>0 and some that are <0 ), conclude that we for and against NHST is provided by [10].
1.96 t 1.96
(tcrit) (tcrit)
Figure 1 An example: Establishing the direction of the difference between two population means
Classical Statistical Inference: Practice versus Presentation 5
Briefly, the principal arguments offered against more importantly, against the hypothesis that
NHST (aka CSI) are that the population effect is opposite in sign to
its sample estimate). In practice (as Estes [5]
(a) It wastes the researchers time testing an hypoth- explicitly states for himself but also opines is
esis (the null hypothesis) that is never true for true for most journal editors), a result with
real variables. an associated P value of 0.051 is unlikely
(b) It is misinterpreted and misused by many to lead to an automatic decision not to pub-
researchers, the most common abuse being treat- lish the report unless the editor feels strongly
ing a nonsignificant result (failure to reject H0 ) that the most appropriate alpha level for trans-
as equivalent to accepting the null hypothesis. lating the continuous confidence function into
(c) It is too rigid, treating a result whose P value is a discrete, take it seriously versus require
just barely below (e.g., p = .0498) as much more data and/or replication decision is 0.01
more important than one whose P value is just or 0.001. The adoption as a new standard
barely above (e.g., p = .0501). of an overall alpha (> + < ) greater than
(d) It provides the probability of the data, given that 0.05 (say, 0.055) is unlikely, however, for at
H0 is true, when what the researcher really wants least two reasons: First, such a move would
to know is the probability that H0 is true, given simply transfer the frustration of just miss-
the observed data. ing statistical significance and the number of
(e) There are alternatives to NHST that have complaints about the rigidity of NHST from
much more desirable properties, for example researchers whose P values have come out to
(i) Bayesian inference (see Bayesian Statis- 0.51 or 0.52 to those with P values of 0.56
tics), (ii) effect sizes (see Effect Size Mea- or 0.57. Second, as a number of authors (e.g.,
sures), (iii) confidence intervals, and (iv) meta- Wainer & Robinson [18]) have documented,
analysis. 0.05 is already a more liberal alpha level than
Almost equally briefly, the counterarguments to the founders of CSI had envisioned and than
the above arguments are that is necessary to give reasonable confidence in
the replicability of a finding (e.g., Greenwald,
(a) As pointed out earlier, we test H0 , not because et al. [8], who found that a P value of 0.005
anyone thinks it might really be true, but note the extra zero provides about an 80%
because, if we cant rule out 0 (the least chance that the finding will be statistically sig-
interesting possible value of our population nificant at the 0.05 level in a subsequent exact
parameter, ), we also cant rule out values of replication).
that are both greater than and smaller than (d) This one is just flat wrong. Researchers are not
0 that is, we wont be able to establish the interested in (or at least shouldnt be interested
sign (direction) of our effect in the population, in) the probability that H0 is true, since we
which is what significance testing (correctly know a priori that it is almost certainly not
interpreted) is all about. true. As suggested before, most of the ills
(b) That NHST is often misused is an argument attributed to CSI are due to the misconception
for better education, rather than for abandon- that we can ever collect enough evidence to
ing CSI. We certainly do need to do a better demonstrate that H0 is true to umpteen gazillion
job of presenting CSI to research neophytes decimal places.
in particular, we need to expunge from our text- (e)
books the traditional, antiscientific presentation (i) Bayesian statistical inference (BSI) is in
of NHST as involving a choice between two, many ways a better representation than CSI
rather than three, conclusions. of the way researchers integrate data from
(c) Not much of a counterargument to this one. It successive studies into the belief systems
seems compelling that confidence in the direc- they have built up from a combination
tion of an effect should be a relatively smooth, of logic, previous empirical findings, and
continuous function of the strength of the evi- perhaps unconscious personal biases. BSI
dence against the null hypothesis (and thus, requires that the researcher make her belief
6 Classical Statistical Inference: Practice versus Presentation
system explicit by spelling out the prior to the likelihood that an exact replication
probability she attaches to every possible would yield statistical significance in the
value of the population parameter being same direction [8]. Neither of these pieces
estimated and then, once the data have of information is easily gleaned from a
been collected, apply Bayes Theorem to CI except by rearranging the components
modify that distribution of prior probabili- from which the CI was constructed so as
ties in accordance with the data, weighted to reconstruct the significance test. Further,
by their strength relative to the prior the P value enables the reader to determine
beliefs. However, most researchers (though whether to consider an effect statistically
the size of this majority has probably significant, regardless of the alpha level he
decreased in recent years) feel uncomfort- or she prefers, while significance can be
able with the overt subjectivity involved determined from a confidence interval only
in the specification of prior probabilities. for the alpha level chosen by the author. In
CSI has the advantage of limiting subjec- short, we need both CIs and significance
tivity to the decision as to how to distribute tests, rather than either by itself.
total alpha between > + < . Indeed, split- (iii) In addition to the issue of the sign or
tailed tests (cf. Classical Statistical Infer- direction of a given population effect, we
ence Extended: Split-Tailed Tests) can will almost always be interested in how
be seen as a back door approach to large an effect is, and thus how important
Bayesian inference. it is on theoretical and/or practical/clinical
(ii) Confidence intervals are best seen as com- grounds. That is, we will want to report
plementing, rather than replacing signifi- a measure of effect size for our finding.
cance tests. Even though the conclusion This will often be provided implicitly as the
reached by a significance test of a par- midpoint of the range of values included in
ticular 0 is identical to that reached by the confidence interval which, for sym-
checking whether 0 is or is not included metrically distributed , will also be our
in the corresponding CI, there are nonethe- point estimate of that parameter. However,
less aspects of our evaluation of the data if the units in which the population param-
that are much more easily gleaned from one eter being tested and its sample estimate
or the other of these two procedures. For are expressed are arbitrary (e.g., number of
instance, the CI provides a quick, easily items endorsed on an attitude inventory) a
understood assessment of whether a non- standardized measure of effect size, such
significant result is a result of a population as Cohens d, may be more informative.
effect size that is very close to zero (e.g., (Cohens d is, for a single- or two-sample
a CI around mean IQ of your universitys t Test, the observed difference divided by
students that extends from 99.5 to 100.3) or the best available estimate of the standard
is instead due to a sloppy research design deviation of ) However, worrying about
(high variability and/or low sample size) size of effect (e.g., how much a treatment
that has not narrowed down the possible helps) is usually secondary to establishing
values of 0 very much (e.g., a CI that direction of effect (e.g., whether the treat-
states that population mean IQ is some- ment helps or harms).
where between 23.1 and 142.9). On the (iv) The wait for the Psych Bull article
other hand, the P value from a significance paradigm was old hat when I entered the
test provides a measure of the confidence field lo those many decades ago. This
you should feel that youve got the sign of paradigm acknowledges that any one study
the population effect right. Specifically, the will have many unique features that ren-
probability that you have declared 0 der generalization of its findings hazardous,
to have the wrong sign is at most half of which is the basis for the statement earlier
the P value for a two-tailed significance in this entry that a conclusion based on a
test. The P value is also directly related significance test is limited in scope to the
Classical Statistical Inference: Practice versus Presentation 7
present study. For the duration of the report being integrated, are not immune to sam-
of the particular study in which a given pling variability. One would certainly wish
significance test occurs we agree to treat to ask whether effect size, averaged across
statistically significant effects as if they studies, is statistically significantly higher
indeed matched the corresponding effect in or lower than zero. In other words, con-
sign or direction, even though we realize ducting a meta-analysis does not relieve
that we may have happened upon the one one of the responsibility of testing the
set of unique conditions (e.g., the partic- strength of the evidence that the population
ular on/off schedule that yielded ulcers in effect has a particular sign or direction.
the classic executive monkey study [2])
that yields this direction of effect. We
gain much more confidence in the robust-
CSI As Presented (Potentially
ness of a finding if it holds up under
replication in different laboratories, with Disastrously) in Most Textbooks
different sources of respondents, different
researcher biases, different operational def- Heres where I attempt to document the dark side
initions, and so on. Review articles (for of classical statistical inference, namely, the over-
many years but no longer a near-monopoly whelming tendency of textbooks to present its logic
of Psychological Bulletin) provide a sum- in a way that forces the researcher who takes it seri-
mary of how well a given finding or set ously to choose between vacuous versus scientifically
of findings holds up under such scrutiny. unsound conclusions. Almost all such presentations
In recent decades the traditional head consider two DRs: the two-tailed test (related to the
count (tabular review) of what propor- two-tailed P value defined earlier, as well as to DRs
13) and the one-tailed test, in which the sign of the
tion of studies of a given effect yielded
population effect is considered to be known a priori.
statistical significance under various con-
ditions has been greatly improved by the
tools of meta-analysis, which emphasizes
the recording and analysis of an effect- Two-tailed Significance Test (Traditional Textbook
size measure extracted from each reviewed Presentation)
study, rather than simply the dichotomous
measure of statistical significance or not. DR 2T. Compute (or, more likely, look up) the two-
Thus, for instance, one study employing tailed critical value of symmetrically distributed test
a sample of size 50 may find a statis- statistic T , Tcrit . This is the 100(1-/2)th percentile
tically nonsignificant correlation of 0.14, of the sampling distribution of T , that is the value of
while another study finds a statistically sig- T such that 100(/2)% of the samples drawn from a
nificant correlation of 0.12, based on a population in which H0 is true would yield a value
sample size of 500. Tabular review would of T that is as far from 0 as (or even farther from
0 than) that critical value, in either direction. Then
count these studies as evidence that about
half of attempts to test this relationship If |T | (the absolute value of the observed value of T
yield statistical significance, while meta- in your sample) is >cv/2 , accept H1 that = 0 .
analysis would treat them as evidence that If |T | is <Tcrit , that is, if Tcrit/ < T < Tcrit ), do
the effect size is in the vicinity of 0.10 to not reject H0 that = 0 .
0.15, with considerable consistency (low
variability in obtained effect size) from An even more egregious but unfortunately com-
study to study. Some authors [e.g., [16]) mon variation of DR 2T replaces the do not reject
have extolled meta-analysis as an alterna- H0 decision with the injunction to accept H0 .
tive to significance testing. However, meta- This decision rule can be illustrated for the same
analytic results, while based on larger sam- case (difference between two means) that was used
ple sizes than any of the single studies to illustrate DR 3 as follows (Figure 2):
8 Classical Statistical Inference: Practice versus Presentation
1 a = 0.95
a/2 = 0.025 Cannot be sure a/2 = 0.025
whether
m1 = m2
m1 not = m2 or not m1 not = m2
1.96 0 +1.96
t -ratio
1 a = 0.95
a = 0.05
0 +1.645
t -ratio
One-tailed Significance Test (Traditional Textbook direction, wasnt discrepant enough from 0
Presentation) to yield statistical significance), fail to reject
H0 that 0 (i.e., conclude that the data
DR 1T: Compute (or, more likely, look up) the one- provide insufficient evidence to prove that the
tailed critical value of symmetrically distributed test researchers hypothesis is correct).
statistic T , cv . This is either the negative of the
th percentile or the 100(1-)th percentile of the This decision rule can be illustrated for the same
sampling distribution of T , that is the value of T case (difference between two means) that was used
such that 100()% of the samples drawn from a to illustrate DR 3 as follows (Figure 3):
population in which H0 is true would yield a value
of T that is as far from 0 as (or even farther One-tailed or Unidirectional? The labeling of the
from 0 than) that critical value, in the hypothesized two kinds of tests described above as one-tailed
direction. Then, if the researcher has (before looking and two-tailed is a bit of a misnomer, in that the
at the data) hypothesized that > 0 (a mirror-image crucial logical characteristics of these tests are not a
decision rule applies if the a priori hypothesis is that function of which tail(s) of the sampling distribution
< 0 ), constitute the rejection region, but of the nature of
the alternative hypotheses that can be accepted as
If |T | is >cv and (the observed sample estimate a result of the tests. For instance, the difference
of ) >0 (i.e., the sample statistic came out on between two means can be tested (via two-tailed
the predicted side of 0 ), accept H1 that > 0 . logic) at the 0.05 level by determining whether the
If |T | is <cv or < 0 (i.e., either the sample square of t for the difference falls in the right-
estimate was on the nonpredicted side of 0 hand tail (i.e., beyond the 95th percentile) of the F
or the sample result, while in the hypothesized distribution with one df for numerator and the same
Classical Statistical Inference: Practice versus Presentation 9
df for denominator as the t Tests df. Squaring t competitors product against ours, and not a one
has essentially folded both tails of the t distribution performed statistically significantly better than our
into a single tail of the F distribution. Its what product because we did one-tailed tests, and we
one does after comparing your test statistic to its sure werent going to predict better performance for
critical value, not how many tails of the sampling the competitors product), but in my opinion its a
distribution of your chosen test statistic are involved terrible way to run a science.
in that determination, that determines whether you In short, and to reiterate the second sentence of this
are conducting a two-tailed significance test as entry, classical statistical inference as described in
presented in textbooks or a two-tailed significance almost all textbooks forces the researcher who takes
test as employed by sound researchers or a one- that description seriously to choose among affirming
tailed test. Perhaps it would be better to label the a truism, accepting a falsehood on scant evidence,
above two kinds of tests as bidirectional versus or violating one of the most fundamental tenets
unidirectional tests. of scientific method by declaring ones research
hypothesis impervious to disconfirmation.
The Unpalatable Choice Presented by Classical
Significance Tests as Classically Presented. Now, Fortunately, most researchers dont take the text-
look again at the conclusions that can be reached book description seriously. Rather, they conduct two-
under the above two decision rules. Nary a hint tailed tests, but with the three possible outcomes
of direction of effect appears in either of the two spelled out in DR 1 through DR 4 above. Or they
conclusions ( could = 0 or = 0 in the general pretend to conduct a one-tailed test but abandon that
case, 1 could = 2 or 1 = 2 in the example) that logic if the evidence is overwhelmingly in favor of an
could result from a two-tailed test. effect opposite in direction to their research hypothe-
Further, as hinted earlier, there are no true null sis, thus effectively conducting a split-tailed test such
hypotheses except by construction. No two popu- as those described in the entry, Classical Statistical
lations have precisely identical means on any real Inference Extended: Split-tailed Tests [9], but with a
variable; no treatment to which we can expose the somewhat unconventional alpha level. (E.g., if you
members of any real population leaves them utterly begin by planning a one-tailed test with = 0.05 but
unaffected; and so on. Thus, even a statistically sig- revert to a two-tailed test if t comes out large enough
nificant two-tailed test provides no new information. to be significant in the direction opposite to predic-
Yes, we can be 95% confident that the true value tion by a 0.05-level two-tailed test, you are effectively
of the population parameter doesnt precisely equal conducting a split-tailed test with an alpha of 0.05 in
0 to 45 or 50 decimal places but, then, we were the predicted direction and 0.025 in the nonpredicted
100% confident of that before we ever looked at a direction, for a total alpha of 0.075. See [1] for an
single datum! example of a research report in which exactly that
For the researcher who wishes to be able to procedure was followed.)
come to a conclusion (at least for purposes of the However, one does still find authors who explic-
discussion of this studys results) about the sign itly state that you must conduct a one-tailed test
(direction) of the difference between and 0 , if you have any hint about the direction of your
textbook-presented significance testing leaves only effect (e.g., [13], p. 136); or an academic depart-
the choice of conducting a one-tailed test. Doing so, ment that insists that intro sections of dissertations
however, requires not only that she make an a priori should state all hypotheses in null form, rather
prediction as to the direction of the effect being tested than indicating the direction in which you predict
(i.e., as to the sign of 0 ), but that she declare your treatment conditions will differ (see [15] and
that hypothesis to be impervious to any empirical http://www.blackwell-synergy.com/links/
evidence to the contrary. (If we conduct a one-tailed doi/10.1111/j.1365 2648.2004.03074.x/
test of the hypothesis that 1 > 2 , we can never abs/;jsessionid=l1twdxnTU-ze for examples
come to the conclusion that 1 < 2 no matter how of this practice and http://etleads.csuhay
much larger Y1 is than Y2 and no matter how close to ward.edu/6900.html and http://www.edb.
negative infinity our t ratio for the difference gets.) utexas.edu/coe/depts/sped/syllabi/Spr
In my opinion, this may be a satisfying way to ing%2003/Parker sed387 2nd.htm for exam-
run a market test of your product (We tested every ples of dissertation guides that enshrine it); or an
10 Classical Statistical Inference: Practice versus Presentation
article in which the statistical significance of an effect Lohnes and Cooley [13] follow their strong end-
is reported, with no mention of the direction of that orsement of one-tailed tests by an even stronger
effect (see http://www.gerardkeegan.co.uk/ endorsement of traditional, symmetric confidence
glossary/gloss repwrit.htm and http:// intervals: The great value of (a CI) is that it
web.hku.hk/rytyeung/nurs2509b.ppt for dramatizes that all values of within these limits
examples in which this practice is held up as a model are tenable in the light of the available evidence.
for students to follow); or a researcher who treats a However, those nonrejected values include every
huge difference opposite to prediction as a nonsignifi- value that would be rejected by a one-tailed test but
cant effect just as textbook-presented logic dictates. not by a two-tailed test. More generally, it is the set
(Lurking somewhere in, but as yet unrecovered from of values that would not be rejected by a two-tailed
my 30+ years of notes is a reference to a specific test that match up perfectly with the set of values
study that committed that last-mentioned sin.) that lie within the symmetric confidence interval.
There are even researchers who continue to cham- Lohnes and Cooley thus manage to strongly denigrate
pion one-tailed tests. As pointed out earlier, many of two-tailed tests and to strongly endorse the logically
these (fortunately) do not really follow the logic of equivalent symmetric confidence-interval procedure
one-tailed tests. For instance, after expressing con- within a three-page interval of their text.
cern about and disagreement with this entrys con- Few researchers would disagree that it is emi-
demnation of one-tailed tests, section editor Ranald nently reasonable to temper the conclusions one
Macdonald (email note to me) mentioned that, of reaches on the basis of a single study with the
course, should a study for which a one-tailed test evidence available from earlier studies and/or from
had been planned yield a large difference opposite logical analysis. However, as I explain in the compan-
to prediction he would consider the assumptions of ion entry, one can use split-tailed tests (Braver [3],
the test violated and acknowledge the reversal of the Kaiser [11]) to take prior evidence into account and
predicted effect that is, the decision rule he applies thereby increase the power of your significance tests
is equivalent to a split-tailed test with a somewhat without rending your directional hypothesis discon-
vague ratio of predicted to nonpredicted alpha. Oth- firmable and while preserving, as the one-tailed
ers, though (e.g., Cohen, in the entry on Directed test does not, some possibility of reaching the cor-
Alternatives in Testing and many of the references rect conclusion about the sign of the population
on ordinal alternatives cited in that entry) explicitly parameter when you have picked the wrong direc-
endorse the logic of one-tailed tests. tional hypothesis.
Two especially interesting examples are provided
by Burke [4] and Lohnes & Cooley [13]. Burke, in an References
early salvo of the 1950s debate with Jones on one-
tailed versus two-tailed tests (which Leventhal [12] [1] Biller, H.R. (1968). A multiaspect investigation of mas-
reports was begun because of Burkes concern that culine development in kindergarten age boys, Genetic
some researchers were coming to directional conclu- Psychology Monographs 78, 89138.
sions on the basis of two-tailed tests) concedes that [2] Brady, J.V., Porter, R.W., Conrad, D.G. & Mason, J.W.
a two-tailed rejection region could be used to sup- (1958). Avoidance behavior and the development of gas-
port the directional hypothesis that the experimental- troduodenal ulcers, Journal of the Experimental Analysis
of Behavior 1, 6973.
condition is greater than the control-condition
[3] Braver, S.L. (1975). On splitting the tails unequally: a
but then goes on to say that if a researcher did so new perspective on one- versus two-tailed tests, Educa-
his position would be unenviable. For following the tional & Psychological Measurement 35, 283301.
rules of (such a) test, he would have to reject the (null [4] Burke, C.J. (1954). A rejoinder on one-tailed tests,
hypothesis) in favor of the alternative (E > C ), Psychological Bulletin 51, 585586.
even though an observed difference (Y1 Y2 ) was [5] Estes, W.K. (1997). Significance testing in psychological
a substantial negative value. Such are the conse- research: some persisting issues, Psychological Science
8, 1820.
quences of the assumption that any significance test [6] Fidler, F. (2002). The fifth edition of the APA publi-
can have only two outcomes, rather than three: If your cation manual: why its statistics recommendations are
alternative hypothesis is that E > C , that leaves so controversial, Educational & Psychological Measure-
only H0 : E C as the other possible conclusion. ment 62, 749770.
Classical Statistical Inference: Practice versus Presentation 11
[7] Gigerenzer, G. & Murray, D.J. (1987). Cognition as Intu- [14] Macdonald, R.R. (1997). On statistical testing in psy-
itive Statistics, Lawrence Erlbaum Associates, Hillsdale. chology, British Journal of Psychology 88, 333347.
[8] Greenwald, A.G., Gonzalez, R. & Harris, R.J. (1996). [15] Ryan, M.G. (1994). The effects of computer-assisted
Effect sizes and p values: what should be reported instruction on at-risk technical college students, Doctoral
and what should be replicated? Psychophysiology 33, dissertation, University of South Carolina, ProQuest
175183. order #ABA95-17308.
[9] Harlow, L.L., Mulaik, S.A. & Steiger, J.H. (1997). What [16] Schmidt, F.L. (1996). Statistical significance testing
if there were no Significance Tests? Lawrence Erlbaum and cumulative knowledge in psychology: implications
Associates, Mahwah, pp. 446. for training of researchers, Psychological Methods 1,
[10] Hunter, J.E. (1997). Needed: a ban on the significance 115129.
test, Psychological Science 8, 37. [17] Steiger, J.H. (1980). Tests for comparing elements of a
[11] Kaiser, H.F. (1960). Directional statistical decisions, correlation matrix, Psychological Bulletin 87, 245251.
Psychological Review 67, 160167. [18] Wainer, H. & Robinson, D.H. (2003). Shaping up
[12] Leventhal, L. (1999). Updating the debate on one- the practice of null hypothesis significance testing,
versus two-tailed tests with the directional two-tailed Educational Researcher 32, 2230.
test, Psychological Reports 84, 707718.
[13] Lohnes, P.R. & Cooley, W.W. (1968). Introduction to RICHARD J. HARRIS
Statistical Procedures, Wiley, New York.
Classical Test Models
JOSE MUNIZ
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
random halves of the test corrected using the context of test validation, there have been highly
SpearmanBrown formula, and (c) the correlation effective and fruitful forms of collecting empirical
between two applications of the same test to a evidence for validating tests, and which, classically,
sample of persons. Each one of these procedures are referred to as: Content validity, predictive validity,
has its pros and cons, and suits some situations and construct validity. These are not three forms of
better than others. In all cases, the value obtained validity there is only one but rather three common
(reliability coefficient) is a numerical value between forms of obtaining data in the validation process.
0 and 1, indicating, as it approaches 1, that the Content validity refers to the need for the content of
test is measuring consistently. In the psychometric the test to adequately represent the construct assessed.
literature, there are numerous classic formulas for Predictive validity indicates the extent to which the
obtaining the empirical value of the reliability scores in the test predict a criterion of interest; it is
coefficient, some of the most important being those of operationalized by means of the correlation between
Rulon [28], Guttman [16], Flanagan [12], the KR20 the test and the criterion, which is called the validity
and KR21 [19], or the popular coefficient alpha [9], coefficient (xy ). Construct validity [11] refers to the
which express the reliability of the test according to need to ensure that the assessed construct has entity
its internal consistency. and consistency, and is not merely spurious. There
Regardless of the formula used for calculating the are diverse strategies for evaluating construct valid-
reliability coefficient, what is most important is that ity. Thus, for example, when we use the technique
all measurements have an associated degree of accu- of factor analysis (or, more generally, structural
racy that is empirically calculable. The commonest equation modeling), we refer to factorial validity.
sources of error in psychological measurement have If, on the other hand, we use the data of a multitrait-
been widely studied by specialists, who have made multimethod matrix (see MultitraitMultimethod
detailed classifications of them [32]. In general, it Analyses), we talk of convergent-discriminant valid-
can be said that the three most important sources ity. Currently, the concept of validity has become
of error in psychological measurement are: (a) the more comprehensive and unitary, with some authors
assessed persons themselves, who come to the test in even proposing that the consequences of test use be
a certain mood, with certain attitudes and fears, and included in the validation process [2, 22].
levels of anxiety in relation to the test, or affected
by any kind of event prior to the assessment, all of
which can influence the measurement errors, (b) the
measurement instrument used, and whose specific Extensions of Classical Test Theory
characteristics can differentially affect those assessed,
and (c) the application, scoring, and interpretation by As pointed out above, the classical linear model per-
the professionals involved [24]. mits estimation of the measurement errors, but not
their source; this is presumed unknown, and the errors
randomly distributed. Some models within the clas-
Validity sical framework have undertaken to break down the
From persons scores, a variety of inferences can be error, and, thus, offer not only the global reliability
drawn, and validating the test consists in checking but also its quantity as a function of the sources of
empirically that the inferences made based on the test error. The most well known model is that of gen-
are correct (see Validity Theory and Applications). eralizability theory, proposed by Cronbach and his
It could therefore be said that, strictly speaking, it is collaborators [10]. This model allows us to make esti-
not the test that is validated, but rather the inferences mations about the size of the different error sources.
made on the basis of the test. The procedure fol- The reliability coefficient obtained is referred to as the
lowed for validating these inferences is the one com- generalizability coefficient, and indicates the extent to
monly used by scientists, that is, defining working which a measurement is generalizable to the popula-
hypotheses and testing them empirically. Thus, from tion of measurements involved in the measurement
a methodological point of view, the validation process (see Generalizability Theory: Basics). A detailed
for a test does not differ in essence from customary explanation of generalizability theory can be found
scientific methodology. Nevertheless, in this specific in [5].
4 Classical Test Models
The tests mentioned up to now are those most workers using the classical approach have devel-
commonly used in the field of psychology for assess- oped diverse statistical strategies for the appropri-
ing constructs such as intelligence, extraversion or ate solutions of many of the problems that sur-
neuroticism. They are generally referred to as nor- face in practice, but the more elegant and tech-
mative tests, since the scores of the persons assessed nically satisfactory solutions are provided by item
are expressed according to the norms developed in response models.
a normative group. A persons score is expressed
according to the position he or she occupies in the References
group, for example, by means of centiles or stan-
dard scores. However, in educational and professional [1] Allen, M.J. & Yen, W.M. (1979). Introduction to Mea-
contexts, it is often of more interest to know the surement Theory, Brooks/Cole, Monterrey.
degree to which people have mastery in a particu- [2] American Educational Research Association, American
lar field than their relative position in a group of Psychological Association, National Council on Mea-
examinees. In this case, we talk about criterion- surement in Education (1999). Standards for Educa-
tional and Psychological Testing, American Educational
referenced tests [13, 17] for referring to tests whose Research Association, Washington.
central objective is to assess a persons ability in a [3] Anastasi, A. & Urbina, S. (1997). Psychological Testing,
field, domain, or criterion (see Criterion-Referenced 7th Edition, Prentice Hall, Upper Saddle River.
Assessment). In these circumstances, the score is [4] Berk, R.A., ed. (1984). A Guide to Criterion-Referenced
expressed not according to the group, but rather as Test Construction, The Johns Hopkins University Press,
an indicator of the extent of the persons ability in Baltimore.
[5] Brennan, R.L. (2001). Generalizability Theory, Springer-
the area of interest. However, the classical reliabil- Verlag, New York.
ity coefficients of normative tests are not particularly [6] Campbell, D.T. & Fiske, D.W. (1959). Convergent and
appropriate for this type of test, for which we need discriminant validation by the multitrait-multimethod
to estimate other indicators based on the reliability of matrix, Psychological Bulletin 56, 81105.
classifications [4]. Another specific technical problem [7] Cizek, G.J. ed. (2001). Setting Performance Standards.
with criterion-referenced tests is that of setting cut-off Concepts, Methods, and Perspectives, LEA, London.
[8] Crocker, L. & Algina, J. (1986). Introduction to Classi-
points for discriminating between those with mastery
cal and Modern Test Theory, Holt, Rinehart and Winston,
in the field and those without. A good description of New York.
the techniques available for setting cut-off points can [9] Cronbach, L.J. (1951). Coefficient alpha and the internal
be found in [7]. structure of tests, Psychometrika 16, 297334.
[10] Cronbach, L.J., Glesser, G.C., Nanda, H. & Rajarat-
nam, N. (1972). The Dependability of Behavioral Mea-
Limitations of the Classical Test Theory surement: Theory of Generalizability for Scores and Pro-
files, Wiley, New York.
Approach [11] Cronbach, L.J. & Meehl, P.E. (1955). Construct valid-
ity in psychological tests, Psychological Bulletin 52,
The classical approach is still today commonly 281302.
used in constructing and analyzing psychological [12] Flanagan, J.L. (1937). A note on calculating the standard
and educational tests [27]. The reasons for this error of measurement and reliability coefficients with the
widespread use are basically its relative simplicity, test score machine, Journal of Applied Psychology 23,
529.
which makes it easy to understand for the major-
[13] Glaser, R. (1963). Instructional technology and the
ity of users, and the fact that it works well and measurement of learning outcomes: some questions,
can be adapted to the majority of everyday situa- American Psychologist 18, 519521.
tions faced by professionals and researchers. These [14] Guilford, J.P. (1936, 1954). Psychometric Methods,
are precisely its strong points. Nevertheless, in cer- McGraw-Hill, New York.
tain assessment situations, the new psychometric [15] Gulliksen, H. (1950). Theory of Mental Tests, Wiley,
models derived from item response theory have New York.
[16] Guttman, L. (1945). A basis for analyzing test-retest
many advantages over the classical approach [18]. reliability, Psychometrika 10, 255282.
More about the limitations of CTT are found in [17] Hambleton, R.K. (1994). The rise and fall of criterion-
(see Item Response Theory (IRT) Models for referenced measurement? Educational Measurement:
Dichotomous Data). It is fair to point out that Issues and Practice 13(4), 2126.
Classical Test Models 5
[18] Hambleton, R.K. & Swaminathan, H. (1985). Item [29] Spearman, C. (1904). The proof and measurement of
Response Theory. Principles and Applications, Kluwer, association between two things, American Journal of
Boston. Psychology 15, 72101.
[19] Kuder, G.F. & Richardson, M.W. (1937). The theory of [30] Spearman, C. (1907). Demonstration of formulae for
estimation of test reliability, Psychometrika 2, 151160. true measurement of correlation, American Journal of
[20] Lord, F.M., & Novick, M.R. (1968). Statistical Theories Psychology 18, 161169.
of Mental Tests Scores, Addison-Wesley, Reading. [31] Spearman, C. (1913). Correlations of sums and differ-
[21] Magnuson, D. (1967). Test Theory, Addison-Wesley, ences, British Journal of Psychology 5, 417126.
Reading. [32] Stanley, J.C. (1971). Reliability, in Educational Mea-
[22] Messick, S. (1989). Validity, in Educational Measure- surement, R.L. Thorndike, ed., American Council on
ment, R. Linn, ed., American Council on Education, Education, Washington, pp. 356442.
Washington, pp. 13103. [33] Stevens, S.S. (1946). On the theory of scales of mea-
[23] Muniz, J. ed. (1996). Psicometra [Psychometrics], Uni- surement, Science 103, 677680.
versitas, Madrid. [34] Thorndike, E.L. (1904). An Introduction to the Theory
[24] Muniz, J. (1998). La medicion de lo psicologico [Psy- of Mental and Social Measurements, Science Press, New
chological measurement], Psicothema 10, 121. York.
[25] Muniz, J. (2003a). Teora Clasica De Los Tests [Classi- [35] Thorndike, R.L. (1982). Applied Psychometrics, Hough-
cal Test Theory], Piramide, Madrid. ton-Mifflin, Boston.
[26] Muniz, J. (2003b). Classical test theory, in Encyclopedia [36] Thurstone, L.L. (1931). The Reliability and Validity of
of Psychological Assessment, R. Fernandez-Ballesteros, Tests, Edward Brothers, Ann Arbor.
ed., Sage Publications, London, pp. 192198. [37] Torgerson, W.S. (1958). Theory and Methods of Scaling,
[27] Muniz, J., Bartram, D., Evers, A., Boben, D., Wiley, New York.
Matesic, K., Glabeke, K., Fernandez-Hermida, J.R. & [38] Van der Linden, W.J., & Hambleton, R.K., eds (1997).
Zaal, J. (2001). Testing practices in European countries, Handbook of Modern Item Response Theory, Springer-
European Journal of Psychological Assessment 17(3), Verlag, New York.
201211.
[28] Rulon, P.J. (1939). A simplified procedure for deter- JOSE MUNIZ
mining the reliability of a test by split-halves, Harvard
Educational Review 9, 99103.
Classical Test Score Equating
MICHAEL J. KOLEN AND YE TONG
Volume 1, pp. 282287
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Only one test form needs to be administered on each setting certain characteristics of score distributions
test date. For the random groups design, multiple equal for a specified population of examinees. In
forms must be administered on a test date. For the traditional equipercentile equating, a transformation
single group design, each examinee must take two is found such that, after equating, scores on alternate
forms, which typically cannot be done in a regular forms have the same distribution in a specified
test administration. population of examinees. Assume that scores on Form
The common-item nonequivalent design tends to X are to be equated to the raw score scale of Form
lead to greater test security than the other designs, Y. Define X as the random variable score on Form
because only one form needs to be administered at X, Y as the random variable score on Form Y,
a given test date. With the random groups and sin- F as the cumulative distribution function of X in
gle group designs, multiple forms are administered the population, and G as the cumulative distribution
at a particular test date to conduct equating. How- function of Y in the population. Let eY be a function
ever, security issues can be of concern with the that is used to transform scores on Form X to the
common-item nonequivalent groups design, because Form Y raw score scale, and let G be the cumulative
the common items must be repeatedly administered. distribution function of eY in the same population.
The common-item nonequivalent groups design The function eY is defined to be the equipercentile
requires the strongest statistical assumptions. The ran- equating function in the population if
dom groups design requires only weak assumptions,
mainly that the random assignment process was suc- G = G. (1)
cessful. The single group design requires stronger
Scores on Form X can be transformed to the Form
assumptions than the random groups design, in that
Y scale using equipercentile equating by taking,
it assumes no differential order effects.
The random groups design requires the largest eY (x) = G1 [F (x)], (2)
sample sizes of the three designs. Assuming no
differential order effects, the single group design has where x is a particular value of X, and G1 is the
the smallest sample size requirements of the three inverse of the cumulative distribution function G.
designs because, effectively, each examinee serves Finding equipercentile equivalents would be
as his or her own control. straightforward if the distributions of scores were
As is evident from the preceding discussion, continuous. However, test scores typically are
each of the designs has strengths and weaknesses. discrete (e.g., number of items correctly answered).
The choice of design depends on weighing the To conduct equipercentile equating with discrete
strengths and weaknesses with regard to the testing scores, the percentile rank of a score on Form
program under consideration. Each of these designs X is found for a population of examinees. The
has been used to conduct equating in a variety of equipercentile equivalent of this score is defined as
testing programs. the score on Form Y that has the same percentile
rank in the population. Owing to the discreteness of
scores, the resulting equated score distributions are
Statistical Methods only approximately equal.
Because many parameters need to be estimated
Equating requires that a relationship between alter- in equipercentile equating (percentile ranks at each
nate forms be estimated. Equating methods result in Form X and Form Y score), equipercentile equating
a transformation of scores on the alternate forms so is subject to much sampling error. For this reason,
that the scores possess specified properties. For tradi- smoothing methods are often used to reduce sam-
tional equating methods, transformations of scores are pling error. In presmoothing methods, the score dis-
found such that for the alternate forms, after equating, tributions are smoothed. In postsmoothing methods,
the distributions, or central moments of the distribu- the equipercentile function is smoothed. Kolen and
tions, are the same in a population of examinees for Brennan [5] discussed a variety of smoothing meth-
the forms to be equated. ods. von Davier, Holland, and Thayer [8] presented
Traditional observed score equating methods a comprehensive set of procedures, referred to as
define score correspondence on alternate forms by kernel smoothing, that incorporates procedures for
4 Classical Test Score Equating
presmoothing score distributions, handling the dis- in order to proceed with traditional equating with this
creteness of test score distributions, and estimating design.
standard errors of equating. Kolen and Brennan [5] described a few different
Other traditional methods are sometimes used that equating methods for the common-item nonequiva-
can be viewed as special cases of the equipercentile lent groups design. The methods differ in terms of
method. In linear equating, a transformation is found their statistical assumptions. Define V as score on
that results in scores on Form X having the same the common items. In the Tucker linear method, the
mean and standard deviation as scores on Form Y. linear regression of X on V is assumed to be the
Defining (X) as the mean score on Form X, (X) same for examinees taking Form X and the exami-
as the standard deviation of Form X scores, (Y ) as nees taking Form Y. A similar assumption is made
the mean score on Form Y, (Y ) as the standard about the linear regression of Y on V . In the Levine
deviation of Form Y scores, and lY as the linear linear observed score method, similar assumptions are
equating function, made about true scores, rather than observed scores.
No method exists to directly test all of the assump-
x (X)
lY (x) = (Y ) + (Y ). (3) tions that are made using data that are collected for
(X) equating. Methods also exist for equipercentile equat-
Unless the shapes of the score distributions for ing under this design that make somewhat different
Form X and Form Y are identical, linear and equiper- regression assumptions.
centile methods produce different results. However,
even when the shapes of the distributions differ, Equating Error
equipercentile and linear methods produce similar
results near the mean. When interest is in scores near Minimizing equating error is a major goal when
the mean, linear equating often is sufficient. However, developing tests that are to be equated, designing
when interest is in scores all along the score scale and equating studies, and conducting equating. Random
the sample size is large, then equipercentile equating equating error is present whenever samples from
is often preferable to linear equating. populations of examinees are used to estimate equat-
For the random groups and single group designs, ing relationships. Random error depends on the
the sample data typically are viewed as representative design used for data collection, the score point of
of the population of interest, and the estimation of the interest, the method used to estimate equivalents,
traditional equating functions proceeds without need- and the sample size. Standard errors of equating are
ing to make strong statistical assumptions. However, used to index random error. Standard error equa-
estimation in the common-item nonequivalent groups tions have been developed to estimate standard errors
design requires strong statistical assumptions. First, a for most designs and methods, and resampling meth-
population must be specified in order to define the ods like the bootstrap can also be used. In general,
equipercentile or linear equating relationship. Since standard errors diminish as sample size increases.
Form X is administered to examinees from a dif- Standard errors of equating can be used to estimate
ferent population than is Form Y, the population required sample sizes for equating, for comparing
used to define the equating relationship typically is the precision of various designs and methods, and
viewed as a combination of these two populations. for documenting the amount of random error in
The combined population is referred to as the syn- equating.
thetic population. Three common ways to define the Systematic equating error results from violations
synthetic population are to equally weight the pop- of assumptions of the particular equating method
ulation from which the examinees are sampled to used. For example, in the common-item nonequiv-
take Form X and Form Y, weight the two popula- alent groups design, systematic error will result if the
tions by their respective sample sizes, or define the Tucker method is applied and the regression-based
synthetic population as the population from which assumptions that are made are not satisfied. System-
examinees are sampled to take Form X. It turns out atic error typically cannot be quantified in operational
that the definition of the synthetic population typi- equating situations.
cally has little effect on the final equating results. Equating error of both types needs to be controlled
Still, it is necessary to define a synthetic population because it can propagate over equatings and result
Classical Test Score Equating 5
in scores on later test forms not being comparable such tests is that, frequently, very few essay ques-
to scores on earlier forms. Choosing a large enough tions can be administered in a reasonable time frame,
sample size given the design is the best way to control which can lead to concern about the comparability
random error. To control systematic error, the test of the content from one test form to another. It also
must be constructed and the equating implemented might be difficult, or impossible, when the common-
so as to minimize systematic error. For example, item nonequivalent groups design is used to construct
the assumptions for any of the methods for the common-item sections that represent the content of
common-item nonequivalent groups design tend to the complete tests.
hold better when the groups being administered
the old and the new form do not differ too much
from each other. The assumptions also tend to Concluding Comments
hold better when the forms to be equated are very
similar to one another, and when the content and Test form equating has as its goal to use scores
statistical characteristics of the common items closely from alternate test forms interchangeably. Test devel-
represent the content and statistical characteristics of opment procedures that have detailed content and
the total test forms. One other way to help control statistical specifications allow for the development of
error is to use what is often referred to as double- alternate test forms that are similar to one another.
linking. In double-linking, a new form is equated These test specifications are a necessary prerequisite
to two previously equated forms. The results for to the application of equating methods.
the two equatings often are averaged to produce a
more stable equating than if only one previously References
equated form had been used. Double-linking also
provides for a built-in check on the adequacy of the [1] American Educational Research Association, American
equating. Psychological Association, National Council on Measure-
ment in Education, & Joint Committee on Standards for
Educational and Psychological Testing (U.S.). (1999).
Selected Practical Issues Standards for Educational and Psychological Testing,
American Educational Research Association, Washington.
Owing to practical constraints, equating cannot be [2] Angoff, W.H. (1971). Scales, norms, and equivalent
scores, in Educational Measurement, 2nd Edition,
used in some situations where its use might be desir- R.L. Thorndike, ed., American Council on Education,
able. Use of any of the equating methods requires Washington, pp. 508600.
test security. In the single group and random groups [3] Flanagan, J.C. (1951). Units, scores, and norms, in
design, two or more test forms must be administered Educational Measurement, E.F. Lindquist, ed., American
in a single test administration. If these forms become Council on Education, Washington, pp. 695793.
known to future examinees, then the equating and the [4] Holland, P.W. & Rubin, D.B. (1982). Test Equating,
Academic Press, New York.
entire testing program could be jeopardized. With the
[5] Kolen, M.J. & Brennan, R.L. (1995). Test Equating:
common-item nonequivalent groups design, the com- Methods and Practices, Springer-Verlag, New York.
mon items are administered on multiple test dates. [6] Kolen, M.J. & Brennan, R.L. (2004). Test Equating,
If the common items become known to examinees, Scaling and Linking: Methods and Practices, 2nd Edition,
the equating also is jeopardized. In addition, equat- Springer-Verlag, New York.
ing requires that detailed content and statistical test [7] Petersen, N.S., Kolen, M.J. & Hoover, H.D. (1989). Scal-
specifications be used to develop the alternate forms. ing, norming, and equating, in Educational Measurement,
3rd Edition, R.L. Linn, ed., Macmillan Publishers, New
Such specifications are a prerequisite to conducting
York, pp. 221262.
adequate equating. [8] von Davier, A.A., Holland, P.W., & Thayer, D.T. (2004).
Although the focus of this entry has been on equat- The Kernel Method of Test Equating, Springer-Verlag,
ing multiple-choice tests that are scored number- New York.
correct, equating often can be used with tests that
are scored in other ways such as essay tests scored MICHAEL J. KOLEN AND YE TONG
by human raters. The major problem with equating
Classification and Regression Trees
BRIAN S. EVERITT
Volume 1, pp. 287290
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
variable close to one or zero in each node. Most com- where G is the collection of terminal nodes of G.
monly used is a log-likelihood function (see Maxi- Next we define the complexity of G as the number
mum Likelihood Estimation) defined for node g as of its terminal nodes, say N G , and finally, we can
K define the cost-complexity of G as
LL(g) = 2 yik log(pgk ) (3)
CC (G) = Cost(G) + N
G (6)
ig k=1
where K is the number of categories of the response where 0 is called the complexity parameter. The
variable, yik is an indicator variable taking the value aim is to minimize simultaneously both cost and
1 if individual i is in category k of the response and complexity; large trees will have small cost but high
zero otherwise, and pgk is the probability of being in complexity with the reverse being the case for small
the kth category of the response in node g, estimated trees. Solely minimizing cost will err on the side of
as ngk /Ng where ngk is the number of individuals overfitting; for example, with SS (g) we can achieve
in category k in node g. The corresponding split zero cost by splitting to a point where each terminal
function (s, g) is then simply node contains only a single observation. In practice
we use (6) by considering a range of values of and
(s, g) = LL(g) LL(gL ) LL(gk ) (4) for each find the subtree G() of our initial tree than
and again the chosen split is that maximizing (s, g). minimizes CC (G). If is small G() will be large,
(The split function (s, g) is often referred to simply and as increases, N G decreases. For a sufficiently
as deviance.) large , NG = 1.
Trees are grown by recursively splitting nodes to In this way, we are led to a sequence of possible
maximize , leading to smaller and smaller nodes trees and need to consider how to select the best.
of progressively increased homogeneity. A critical There are two possibilities:
question is when should tree construction end and
terminal nodes be declared? Two simple stopping If a separate validation sample is available, we
rules are as follows: can predict on that set of observations and cal-
culate the deviance versus for the pruned trees.
Node size-stop when this drops below a threshold This will often have a minimum, and so the small-
value, for example, when Ng < 10. est tree whose sum of squares is close to the
Node homogeneity stop when a node is homo- minimum can be chosen.
geneous enough, for example, when its deviance If no validation set is available, one can be con-
is less than 1% of the deviance of the root node. structed from the observations used in construct-
ing the tree, by splitting the observations into a
Neither of these is particularly attractive because number of (roughly) equally sized subsets. If n
they have to be judged relative to preset thresholds, subsets are formed this way, n 1 can be used to
misspecification of which can result in overfitting or grow the tree and it can be tested on the remain-
underfitting. An alternative more complex approach ing subset. This can be done n ways, and the
is to use what is known as a pruning algorithm. This results averaged.
involves growing a very large initial tree to capture
all potentially important splits, and then collapsing Full details are available in Breiman et al. [1]
this backup using what is know as cost complexity
pruning to create a nested sequence of trees.
Cost complexity pruning is a procedure which An Example of the Application of the
snips off the least important splits in the initial tree, CART Procedure
where importance is judged by a measure of within-
As an example of the application of tree-based mod-
node homogeneity or cost. For a continuous variable,
els in a particular area we shall use data on the
for example, cost would simply be the sum of squares
birthweight of babies given in Hosmer [2]. Birth-
term defined in (5.1). The cost of the entire tree, G,
weight of babies is often a useful indicator of how
is then defined as
they will thrive in the first few months of their life.
Cost(G) = SS (g) (5) Low birthweight, say below 2.5 kg, is often a cause
gG of concern for their welfare. The part of the data
Classification and Regression Trees 3
Race:bc
Smoke:b Smoke:b
Race:b Race:c
2826.85 3428.75
Figure 1 Regression tree to predict birthweight from race and smoking status
with which we will be concerned is that involving 4. black, nonsmokers: 2854, n = 16;
the actual birthweight and two explanatory variables, 5. white, smokers: 2827, n = 52;
race (white/black/other) and smoke, a binary variable 6. white, nonsmokers: 3429, n = 44.
indicating whether or not the mother was a smoker
during pregnancy. Here, there is evidence of a race smoke inter-
The regression tree for the data can be con- action, at least for black and other women. Among
structed using suitable software (for example, the smokers, black women produce babies with lower
tree function in S-PLUS (see Software for Statis- average birthweight than do other women. But for
tical Analyses) and the tree is displayed graphically nonsmokers the reverse is the case.
in Figure 1. Here, the first split is on race into white
and black/other. Each of the new nodes is then fur- References
ther split on the smoke variable into smokers and
nonsmokers, and then, in the left-hand side of the [1] Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J.
tree, further nodes are introduced by splitting race (1984). Classification and Regression Trees, Chapman and
into black and other. The six terminal nodes and their Hall/CRC, New York.
average birthweights are as follows: [2] Hosmer, D.W. & Lemeshow, S. (1989). Applied Logistic
Regression, Wiley, New York.
1. black, smokers: 2504, n = 10;
2. other, smokers: 2757, n = 12; BRIAN S. EVERITT
3. other, nonsmokers: 2816, n = 55;
Clinical Psychology
TERESA A. TREAT AND V. ROBIN WEERSING
Volume 1, pp. 290301
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
smaller subset of theoretically interpretable con- of data). Thus, when your theoretical expectations are
structs, which commonly are referred to as factors sufficiently strong to place a priori constraints on the
or latent variables. For example, Walden, Harris, analysis, it typically is preferable to use the confirma-
and Catron [53] used factor analysis when develop- tory approach to evaluate the fit of your theoretical
ing How I Feel, a measure on which children report model to the data. Walden et al. [53] followed up the
the frequency and intensity of five emotions (happy, exploratory factor analysis described above by using
sad, mad, excited, and scared), as well as how well confirmatory factor analysis to demonstrate the valid-
they can control these emotions. The authors gen- ity and temporal stability of the factor structure for
erated 30 relevant items (e.g., the extent to which How I Feel.
children were scared almost all the time during the Clinical researchers also use item response
past three months) and then asked a large number theory, often in conjunction with factor-analytic app-
of children to respond to them. Exploratory factor roaches, to assist in the definition and measurement
analyses of the data indicated that three underly- of constructs [17]. A detailed description of this
ing factors, or constructs, accounted for much of the approach is beyond the scope of this article, but it
variability in childrens responses: Positive Emotion, is helpful to note that this technique highlights the
Negative Emotion, and Control. For example, the importance of inspecting item-specific measurement
unobserved Negative Emotion factor accounted par- properties, such as their difficulty level and their
ticularly well for variability in childrens responses to differential functioning as indicators of the construct
the sample item above (i.e., this item showed a large of interest. For clinical examples of the application
factor loading on the Negative Emotion factor, and of this technique, see [27] and [30].
small factor loadings on the remaining two factors). Cluster analysis is an approach to construct defi-
One particularly useful upshot of conducting a fac- nition and measurement that is closely allied to factor
tor analysis is that it produces factor scores, which analysis but exhibits one key difference. Whereas
index a participants score on each of the underlying factor analysis uncovers unobserved factors on the
latent variables (e.g., a child who experiences chronic basis of the similarity of variables, cluster analy-
sadness over which she feels little control presum- sis uncovers unobserved typologies on the basis
ably would obtain a high score on the Negative of the similarity of people. Cluster analysis entails
Emotion factor and a lot score on the Control fac- (a) selecting a set of variables that are assumed to
tor). Quantifying factor scores remains a controversial be relevant for distinguishing members of the dif-
enterprise, however, and researchers who use this ferent typologies; (b) obtaining many participants
technique should understand the relevant issues [20]. responses to these variables; and (c) using cluster-
Both Reise, Waller, and Comrey [44] and Fabri- analytic techniques to reduce the complexity among
gar, Wegener, MacCallum, and Strahan [19] provide the numerous participants to a much smaller sub-
excellent overviews of the major decisions that clin- set of theoretically interpretable typologies, which
ical researchers must make when using exploratory commonly are referred to as clusters. Represen-
factor-analytic techniques. tative recent examples of the use of this technique
Increasingly, clinical researchers are making use can be found in [21] and [24]. Increasingly, clinical
of confirmatory factor-analytic techniques when researchers also are using latent class analysis and
taxometric approaches to define typologies of clini-
defining and measuring constructs. Confirmatory app-
cal interest, because these methods are less descrip-
roaches require researchers to specify both the num-
tive and more model-based than most cluster-analytic
ber of factors and which items load on which fac-
techniques. See [40] and [6], respectively, for appli-
tors prior to inspection and analysis of the data.
cation of these techniques to defining and measuring
Exploratory factor-analytic techniques, on the other
clinical typologies.
hand, allow researchers to base these decisions in
large part on what the data indicate are the best Evaluating Differences between Either
answers. Although it may seem preferable to let the Experimentally Created or Naturally Occurring
data speak for themselves, the exploratory approach
Groups
capitalizes on sampling variability in the data, and
the resulting factor structures may be less likely to After establishing a valid measurement model for the
cross-validate (i.e., to hold up well in new samples particular theoretical constructs of interest, clinical
Clinical Psychology 3
researchers frequently evaluate hypothesized group What sets apart this class of questions about the
differences in dependent variables (DVs) using one of influence of an IV or QIV on a DV is the discrete-
many analytical models. For this class of questions, ness of the predictor; the DVs can be practically
group serves as a discrete independent or quasi- any statistic, whether means, proportions, frequen-
independent variable (IV or QIV). In experimen- cies, slopes, correlations, time until a particular event
tal research, group status serves as an IV, because occurs, and so on. Thus, many statistical techniques
participants are assigned randomly to groups, as in aim to address the same meta-level research ques-
randomized controlled trials. In quasi-experimental tion about group differences but they make differ-
research, in contrast, group status serves as a QIV, ent assumptions about the nature of the DV. For
because group differences are naturally occurring, example, clinical researchers commonly use ANOVA
as in psychopathology research, which examines the techniques to examine group differences in means
effect of diagnostic membership on various measures. (perhaps to answer question 1 above); chi-square or
Thus, when conducting quasi-experimental research, log-linear approaches to evaluate group differences
it often is unclear whether the QIV (a) causes any in frequencies (question 2; see [52]); growth-curve
of the observed group differences; (b) results from or multilevel modeling (MLM) (see Hierarchical
the observed group differences; or (c) has an illu- Models) techniques to assess group differences in
sory relationship with the DV (e.g., a third variable the intercept, slope, or acceleration parameters of a
has produced the correlation between the QIV and regression line (question 3; see [48] for an example);
the DV). Campbell and Stanley [9] provide an excel- survival analysis to investigate group differences in
lent overview of the theoretical and methodological the time to event occurrence, or survival time (ques-
issues surrounding the distinction between quasi- tion 4; see [7] and [8]); and interrupted time-series
analysis to evaluate the effect of an intervention on
experimental and experimental research and describe
the level or slope of a single participants behav-
the limits of causality inferences imposed by the use
ior within a multiple-baseline design (question 5;
of quasi-experimental research designs.
see [42] for an excellent example of the application of
In contrast to the IV or QIV, the DVs can be
this approach). Thus, these five very different analyti-
continuous or discrete and are presumed to reflect
cal models all aim to evaluate very similar theoretical
the influence of the IV or QIV. Thus, we might
models about group differences. A common exten-
be interested in (a) evaluating differences in per-
sion of these analytical models provides simultaneous
fectionism (the DV) for patients who are diag- analysis of two or more DVs (e.g., Multivariate
nosed with anorexia versus bulimia (a QIV, because Analysis of Variance (MANOVA) evaluates mean
patients are not assigned randomly to disorder type); group differences in two or more DVs).
(b) examining whether the frequency of rehospital- Many analyses of group differences necessitate
ization (never, once, two or more times) over a inclusion of one or more covariates, or variables
two-year period (the DV) varies for patients whose other than the IV or QIV that also are assumed to
psychosis was or was not treated with effective influence the DV and may correlate with the predic-
antipsychotic medication during the initial hospital- tor. For example, a researcher might be interested
ization (an IV, if drug assignment is random); (c) in evaluating the influence of medication compli-
investigating whether the rate of reduction in hyper- ance (a QIV) on symptoms (the DV), apart from the
activity (the DV) over the course of psychophar- influence of social support (the covariate). In this cir-
macological treatment with stimulants is greater for cumstance, researchers commonly use Analysis of
children whose parents are assigned randomly to Covariance (ANCOVA) to control for the influ-
implement behavioral-modification programs in their ence of the covariate on the DV. If participants are
homes (an IV); (d) assessing whether the time to a assigned randomly to levels of the IV, then ANCOVA
second suicide attempt (the DV) is shorter for patients can be useful for increasing the power of the eval-
who exhibit marked, rather than minimal, impul- uation of the effect of the IV on the DV (i.e., a
sivity (a QIV); or (e) evaluating whether a 10-day true effect is more likely to be detected). If, how-
behavioral intervention versus no intervention (an IV) ever, participants are not assigned randomly to IV
reduces the overall level of a single childs disruptive levels and the groups differ on the covariate a com-
behavior (the DV). mon circumstance in clinical research and a likely
4 Clinical Psychology
characteristic of the example above then ANCOVA variables, as well as on each of the 10 variables in iso-
rarely is appropriate (i.e., this analytical model likely lation, the authors then used DFA to predict whether
provides an invalid assessment of the researchers each girl did or did not have ADHD. DFA estimated
theoretical model). This is an underappreciated mat- a score for each girl on the weighted linear com-
ter of serious concern in psychopathology research, bination (or discriminant function) of the predictor
and readers are urged to consult [39] for an excellent variables, and the girls predicted classification was
overview of the relevant substantive issues. based on whether her score cleared a particular cut-
off value that also was estimated in the analysis. The
resulting discriminant function, or prediction equa-
Predicting Group Membership tion, then could be used in other samples or studies to
predict the diagnosis of girls for whom ADHD status
Clinical researchers are interested not only in exam- was unknown. DFA produces a two-by-two classi-
ining the effect of group differences on variables of fication table, in which the two dimensions of the
interest (as detailed in the previous section) but also table are true and predicted states (e.g., the pres-
in predicting group differences. In this third class of
ence or absence of ADHD). Clinical researchers use
research questions, group differences become the DV,
the information in this table to summarize the predic-
rather than the IV or QIV. We might be interested
tive power of the collection of variables, commonly
in predicting membership in diagnostic categories
using a percent-correct index, a combination of sen-
(e.g., schizophrenic or not) or in predicting impor-
sitivity and specificity indices, or a combination of
tant discrete clinical outcomes (e.g., whether a person
positive and negative predictive power indices. The
commits suicide, drops out of treatment, exhibits
values of these indices frequently vary as a function
partner violence, reoffends sexually after mandated
of the relative frequency of the two states of inter-
treatment, or holds down a job while receiving inten-
est, as well as the cutoff value used for classification
sive case-management services). In both cases, the
purposes, however. Thus, researchers increasingly are
predictors might be continuous, discrete, or a mix
of both. Discriminant function analysis (DFA) and turning to alternative indices without these limita-
logistic regression techniques commonly are used to tions, such as those drawn from signal-detection
answer these kinds of questions. Note that researchers theory [37].
use these methods for a purpose different than that of Logistic regression also examines the prediction
researchers who use the typology-definition methods of group membership from a class of predictor vari-
discussed in the first section (e.g., cluster analysis, ables but relaxes a number of the restrictive assump-
latent class analysis); the focus in this section is on tions that are necessary for the valid use of DFA
the prediction of group membership (which already (e.g., multivariate normality, linearity of relation-
is known before the analysis), rather than the discov- ships between predictors and DV, and homogene-
ery of group membership (which is unknown at the ity of variances within each group). Whereas DFA
beginning of the analysis). estimates a score for each case on a weighted lin-
DFA uses one or more weighted linear combi- ear combination of the predictors, logistic regression
nations of the predictor variables to predict group estimates the probability of one of the outcomes
membership. For example, Hinshaw, Carte, Sami, for each case on the basis of a nonlinear (logis-
Treuting, and Zupan [22] used DFA to evaluate tic) transformation of a weighted linear combination
how well a class of 10 neuropsychiatric variables of the predictors. The predicted classification for a
could predict the presence or absence of attention- case is based on whether the estimated probability
deficit/hyperactivity disorder (ADHD) among ado- clears an estimated cutoff. Danielson, Youngstrom,
lescent girls. Prior to conducting the DFA, Hinshaw Findling, and Calabrese [16] used logistic regres-
and colleagues took the common first step of using sion in conjunction with signal-detection theory tech-
MANOVA to examine whether the groups differed niques to quantify how well a behavior inventory
on a linear combination of the class of 10 variables discriminated between various diagnostic groups. At
(i.e., they first asked the group-differences question this time, logistic regression techniques are pre-
that was addressed in the previous section). Having ferred over DFA methods, given their less-restrictive
determined that the groups differed on the class of assumptions.
Clinical Psychology 5
Evaluating Theoretical Models That Specify between three or more variables [3]. Mediation hypo-
a Network of Interrelated Constructs theses specify a mechanism (B) through which one
variable (A) influences another (C). Thus, the exam-
As researchers theoretical models for a particular
clinical phenomenon become increasingly sophis- ple in the previous paragraph proposes that severity
ticated and complex, the corresponding analytical of depression (B) mediates the relationship between
models also increase in complexity (e.g., evaluat- the frequency of negative life events (A) and physi-
ing a researchers theoretical models might require cal health (C); in other words, the magnitude of the
the simultaneous estimation of multiple equations association between negative life events and physi-
that specify the relationships between a network of cal health should be greatly reduced once depression
variables). At this point, researchers often turn to enters the mix. The strong version of the medi-
either multiple-regression models (MRM) (see Mul- ation model states that the A-B-C path is causal
tiple Linear Regression) or SEM to formalize their and complete in our example, that negative life
analytical models. In these models, constructs with events cause depression, which in turn causes a
a single measured indicator are referred to as mea- deterioration in physical health and that the rela-
sured (or manifest) variables; this representation of tionship between A and C is completely accounted
a construct makes the strong assumption that the for by the action of the mediator. Complete medi-
measured variable is a perfect, error-free indicator ation is rare in social science research, however.
of the underlying construct. In contrast, constructs Instead, the weaker version of the mediation model
with multiple measured indicators are referred to as is typically more plausible, in which the association
latent variables; the assumption in this case is that between A and C is reduced significantly (but not
each measured variable is an imperfect indicator of eliminated) once the mediator is introduced to the
the underlying construct and the inclusion of multiple model.
indicators helps to reduce error. In contrast, moderation hypotheses propose that
MRM is a special case of SEM in which the magnitude of the influence of one variable (A) on
all constructs are treated as measured variables another variable (C) depends on the value of a third
and includes single-equation multiple-regression variable (B) (i.e., moderation hypotheses specify an
approaches, path-analytic methods, and linear interaction between A and B on C). For example,
multilevel models techniques. Suppose, for example, we might investigate whether socioeconomic status
that you wanted to test the hypothesis that the (SES) (B) moderates the relationship between nega-
frequency of negative life events influences the tive life events (A) and physical health (C). Concep-
severity of depression, which in turn influences
tually, finding a significant moderating relationship
physical health status. MRM would be sufficient to
indicates that the AC relationship holds only for
evaluate this theoretical model if the measurement
certain subgroups in the population, at least when
model for each of these three constructs included
the moderator is discrete. Such subgroup findings are
only a single variable. SEM likely would become
useful in defining the boundaries of theoretical mod-
necessary if your measurement model for even one
of the three constructs included more than one els and guiding the search for alternative theoretical
measured variable (e.g., if you chose to measure models in different segments of the population.
physical health status with scores on self-report scale Although clinical researchers commonly specify
as well as by medical record review, because you mediation and moderation theoretical models, they
thought that neither measure in isolation reliably rarely design their studies in such a way as to be able
and validly captured the theoretical construct of to draw strong inferences about the hypothesized the-
interest). Estimating SEMs requires the use of oretical models (e.g., many purported mediation mod-
specialized software, such as LISREL, AMOS, M- els are evaluated for data collected in cross-sectional
PLUS, Mx, or EQS (see Structural Equation designs [54], which raises serious concerns from both
Modeling: Software). a logical and data-analytic perspective [14]). More-
Two types of multivariate models that are par- over, researchers rarely take all the steps necessary to
ticularly central to the evaluation and advancement evaluate the corresponding analytical models. Greater
of theory in clinical science are those that spec- attention to the relevant literature on appropriate
ify either mediation or moderation relationships statistical evaluation of mediation and moderation
6 Clinical Psychology
hypotheses should enhance the validity of our infer- statistical packages, see the recent text by Rauden-
ences about the corresponding theoretical models [3, bush and Byrk [43], and for recent applications of
23, 28, 29]. MLM techniques in the clinical literature, see [41]
In addition to specifying mediating or moderat- and [18].
ing relationships, clinical researchers are interested Researchers should be forewarned that numerous
in networks of variables that are organized in a theoretical, methodological, and statistical complex-
nested or hierarchical fashion. Two of the most com- ities arise when specifying, estimating, and evalu-
mon hierarchical, or multilevel, data structures are ating an analytical model to evaluate a hypothe-
(a) nesting of individuals within social groups or sized network of interrelated constructs, particularly
organizations (e.g., youths nested within classrooms) when using SEM methods. Space constraints pre-
or (b) nesting of observations within individuals (e.g., clude description of these topics, but researchers
multiple symptoms scores over time nested within who wish to test more complex theoretical models
patients). Prior to the 1990s, options for analyzing are urged to familiarize themselves with the follow-
these nested data structures were limited. Clinical ing particularly important issues: (a) Evaluation and
researchers frequently collapsed multilevel data into treatment of missing-data patterns; (b) assessment
a flat structure (e.g., by disaggregating classroom of power for both the overall model and for indi-
data to the level of the child or by using differ- vidual parameters of particular interest; (c) the role
ence scores to measure change within individuals). of capitalization on chance and the value of cross-
This strategy resulted in the loss of valuable informa- validation when respecifying poorly fitting models;
tion contained within the nested data structure and, (d) the importance of considering different models for
in some cases, violated assumptions of the analytic the network of variables that make predictions identi-
methods (e.g., if multiple youths are drawn from the cal to those of the proposed theoretical model; (e) the
same classroom, their scores will likely be corre- selection and interpretation of appropriate fit indices;
lated and violate independence assumptions). In the and (f) model-comparison and model-selection pro-
cedures (e.g., [2, 14, 25, 32, 33, 34, 51]). Finally,
1990s, however, advances in statistical theory and
researchers are urged to keep in mind the basic maxim
computer power led to the development of MLM
that the strength of causal inferences is affected
techniques. Conceptually, MLM can be thought of
strongly by research design, and the experimental
as hierarchical multiple regression, in which regres-
method applied well is our best strategy for drawing
sion equations are estimated for the smallest (or
such inferences. MRM and SEM analytical tech-
most nested) unit of analysis and then the parameters
niques often are referred to as causal models, but
of these regression equations are used in second-
we deliberately avoid that language here. These tech-
order analyses. For example, a researcher might be
niques may be used to analyze data from a variety of
interested in both individual-specific and peer-group experimental or quasi-experimental research designs,
influences on youth aggression. In an MLM anal- which may or may not allow you to draw strong
ysis, two levels of regression equations would be causal inferences.
specified: (a) a first-level equation would specify the
relationship of individual-level variables to youth
Synthesizing and Evaluating Findings Across
aggression (e.g, gender, attention problems, prior his-
Studies or Data Sets
tory of aggression in a different setting, etc.); and
(b) a second-level equation would predict variation in The final class of research questions that we con-
these individual regression parameters as a function sider is research synthesis or meta-analysis. In meta-
of peer-group variables (e.g., the effect of average analyses, researchers describe and analyze empiri-
peer socioeconomic status (SES) on the relationship cal findings across studies or datasets. As in any
between gender and aggression). In practice, these other research enterprise, conducting a meta-analysis
two levels are estimated simultaneously. However, (a) begins with a research question and statement of
given the complexity of the models that can be evalu- hypotheses; (b) proceeds to data collection, coding,
ated using MLM techniques, it is frequently useful to and transformation; and (c) concludes with analysis
map out each level of the MLM model separately. For and interpretation of findings. Meta-analytic investi-
a through overview of MLM techniques and available gations differ from other studies in that the unit of
Clinical Psychology 7
data collection is the study rather than the partici- behavioral therapies (e.g., [55]). The debate provoked
pant. Accordingly, data collection in meta-analysis by these meta-analytic findings continues, and the
is typically an exhaustive, well-documented literature results have spurred research on the moderators of
search, with predetermined criteria for study inclusion therapy effects and the dissemination of evidence-
and exclusion (e.g., requiring a minimum sample size based therapy protocols to community settings.
or the use of random assignment). Following initial As our example demonstrates, meta-analysis can
data collection, researchers develop a coding scheme be a powerful technique to describe and explain vari-
to capture the critical substantive and methodological ability in findings across an entire field of inquiry.
characteristics of each study, establish the reliability However, meta-analysis is subject to the same limi-
of the system, and code the findings from each inves- tations as other analytic techniques. For example, the
tigation. The empirical results of each investigation effects of a meta-analysis can be skewed by biased
are transformed into a common metric of effect sizes sampling (e.g., an inadequate literature review), use
(see [5] for issues about such transformations). Effect of a poor measurement model (e.g., an unreliable
sizes then form the unit of analysis for subsequent scheme for coding study characteristics), low power
statistical tests. These statistical analyses may range (e.g., an insufficiently large literature to support test-
from a simple estimate of a population effect size in a ing cross-study hypotheses), and data-quality prob-
set of homogenous studies to a complex multivariate lems (e.g., a substantial portion of the original stud-
model designed to explain variability in effect sizes ies omit data necessary to evaluate meta-analytic
across a large, diverse literature. hypotheses, such as a description of the ethnicity of
Meta-analytic inquiry has become a substantial the study sample). Furthermore, most published meta-
research enterprise within clinical psychology, and analyses do not explicitly model the nested nature
results of meta-analyses have fueled some of the most of their data (e.g., effect sizes on multiple symptom
active debates in the field. For example, in the 1980s measures are nested within treatment groups, which
and 1990s, Weisz and colleagues conducted several are nested within studies). Readers are referred to the
reviews of the youth therapy treatment literature, excellent handbook by Cooper and Hedges [15] for
estimated population effect sizes for the efficacy a discussion of these and other key issues involved
of treatment versus control conditions, and sought in conducting a meta-analysis and interpreting meta-
to explain variability in these effect sizes in this analytic data.
large and diverse treatment literature (e.g., [56]).
Studies included in the meta-analyses were coded for Overarching Principles That Underlie the
theoretically meaningful variables such as treatment Use of Statistics in Clinical Psychology
type, target problem, and youth characteristics. In
addition, studies were classified comprehensively Having provided an overview of the major research
in terms of their methodological characteristics questions and associated analytical techniques in clin-
from the level of the study (e.g., sample size, ical psychology, we turn to a brief explication of
type of control group) down to the level of each four principles and associated corollaries that char-
individual outcome measure, within each treatment acterize the responsible use of statistics in clinical
group, within each study (e.g., whether a measure psychology. The intellectual history of these princi-
was an unnecessarily reactive index of the target ples draws heavily from the work and insight of such
problem). This comprehensive coding system allowed luminaries as Jacob Cohen, Alan Kazdin, Robert
the investigators to test the effects of the theoretical McCallum, and Paul Meehl. Throughout this section,
variables of primary interest as well as to examine we refer readers to more lengthy articles and texts
the influence of methodological quality on their that expound on these principles.
findings. Results of these meta-analyses indicated that
Principle 1: The specification and evaluation of
(a) structured, behavioral treatments outperformed
theoretical models is critical to the rapid advance-
unstructured, nonbehavioral therapies across the child
ment of clinical research.
therapy literature; and (b) psychotherapy in everyday
community clinic settings was more likely to entail Corollary 1: Take specification of theoretical, mea-
use of nonbehavioral treatments and to have lower surement, and analytical models seriously. As the-
effect sizes than those seen in research studies of oretical models specify unobserved constructs and
8 Clinical Psychology
their interrelationships (see earlier section on defining evaluate a hypothesized model only by comparing
and measuring constructs), clinical researchers must it to models of little intrinsic interest, such as a
draw inferences about the validity of their theoret- null model that assumes that there is no relation-
ical models from the fit of their analytical models. ship between the variables or a saturated model
Thus, the strength of researchers theoretical infer- that accounts perfectly for the observed data. Seri-
ences depends critically on the consistency of the ous concerns still may arise in regard to a model
measurement and analytical models with the theo- that fits significantly better than the null model
retical models [38]. Tightening the fit between these and nonsignificantly worse than the saturated model,
three models may preclude the use of off-the-shelf however, (see [51] for an excellent overview of
measures or analyses, when existing methods do not the issues that this model-fitting strategy raises).
adequately capture the constructs or their hypothe- For example, a number of equivalent models may
sized interrelationships. For example, although more exist that make predictions identical to those of the
than 25 years of research document the outstand- model of interest [34]. Alternatively, nonequivalent
ing psychometric properties of the BDI, the BDI alternative models may account as well or better
emphasizes the cognitive and affective aspects of the for the observed data. Thus, methodologists now
construct of depression more than the vegetative and routinely recommend that researchers specify and
behavioral aspects. This measurement model may be contrast competing theoretical models (both equiv-
more than sufficient for many investigations, but it alent and nonequivalent) because this forces the
would not work well for others (e.g., a study targeting researcher to specify and evaluate a variety of theoret-
sleep disturbance). Neither measurement nor analyti- ically based explanations for the anticipated findings
cal models are assumption-free, so we must attend [34, 51].
to the psychometrics of measures (e.g., their relia-
bility and validity), as well as to the assumptions of
analytical models. Additionally, we must be careful to Corollary 2: Model modifications may increase the
maintain the distinctions among the three models. For validity of researchers theoretical inferences, but
example, clinical researchers tend to collapse the the- they also may capitalize on sampling variability.
oretical and measurement models as work progresses When the fit of a model is less than ideal, clini-
in a particular area (e.g., we reify the construct of cal researchers often make post hoc modifications
depression as the score on the BDI). McFall and to the model that improve its fit to the observed
Townsend [36] provide an eloquent statement of this data set. For example, clinical researchers who use
and related issues. SEM techniques often delete predictor variables,
modify the links between variables, or alter the
Corollary 2: Pursue theory-driven, deductive app-
pattern of relationships between error terms. Other
roaches to addressing research questions whenever
analytic techniques also frequently suffer from sim-
possible, and know the limitations of relying on more
ilar overfitting problems (e.g., stepwise regression
inductive strategies. Ad hoc storytelling about the
(see Regression Models), DFA). These data-driven
results of innumerable exploratory data analyses is
modifications improve the fit of the model signif-
a rampant research strategy in clinical psychology.
Exploratory research and data analysis often facilitate icantly and frequently can be cast as theoretically
the generation of novel theoretical perspectives, but motivated. However, these changes may do little
it is critical to replicate the findings and examine more than capitalize on systematic but idiosyncratic
the validity of a new theoretical model further before aspects of the sample data, in which case the new
taking it too seriously. model may not generalize well to the population as
a whole [33, 51]. Thus, it is critical to cross-validate
Principle 2: The heart of the clinical research respecified models by evaluating their adequacy with
enterprise lies in model (re-)specification, evalu- data from a new sample; alternatively, researchers
ation, and comparison. might develop a model on a randomly selected sub-
set of the sample and then cross-validate the resulting
Corollary 1: Identify the best model from a set of model on the remaining participants. Moreover, to be
plausible alternatives, rather than evaluating the ade- more certain that the theoretical assumptions about
quacy of a single model. Clinical researchers often the need for the modifications are on target, it is
Clinical Psychology 9
important to evaluate the novel theoretical impli- the internal and external validity of your conclu-
cations of the modified model with additional data sions [9, 26].
sets.
Principle 4: Know the limitations of Null-
Principle 3: Mastery of research design and the Hypothesis Statistical Testing (NHST).
mechanics of statistical techniques is critical to the
validity of researchers statistical inferences. Corollary 1: The alternative or research hypotheses
tested within the NHST framework are very imprecise
Corollary 1: Know your data. Screening data is a and almost always true at a population level. With
critical first step in the evaluation of any analyti- enough power, almost any two means will differ
cal model. Inspect and address patterns of missing significantly, and almost any two variables will
data (e.g., pair-wise deletion, list-wise deletion, esti- show a statistically significant correlation. This weak
mation of missing data). Evaluate the assumptions of approach to the specification and evaluation of
statistical techniques (e.g., normality of distributions theoretical models makes it very difficult to reject or
of errors, absence of outliers, linearity, homogeneity falsify a theoretical model, or to distinguish between
of variances) and resolve any problems (e.g., make two theoretical explanations for the same phenomena.
appropriate data transformations, select alternative Thus, clinical researchers should strive to develop
statistical approaches). Tabachnick and Fidell [50] and evaluate more precise and risky predictions about
provide an outstanding overview of the screening pro- clinical phenomena than those traditionally examined
cess in the fourth chapter of their multivariate text. with the NHST framework [11, 13, 31, 38]. When the
theoretical models in a particular research area are not
Corollary 2: Know the power of your tests. Jacob advanced enough to allow more precise predictions,
Cohen [10] demonstrated more than four decades researchers are encouraged to supplement NHST
ago that the power to detect hypothesized effects results by presenting confidence intervals around
was dangerously low in clinical research, and more sample statistics [31, 35].
recent evaluations have come to shockingly similar
conclusions [47, 49]. Every clinical researcher should Corollary 2: P values do not tell you the likelihood
understand how sample size, effect size, and affect that either the null or alternative hypothesis is true.
power; how low power increases the likelihood of P values specify the likelihood of observing your
erroneously rejecting our theoretical models; and findings if the null hypothesis is true not the
how exceedingly high power may lead us to retain likelihood that the null hypothesis is true, given your
uninteresting theoretical models. Cohens [12] power findings. Similarly, (1.0 p) is not equivalent to
primer is an excellent starting place for the faint of the likelihood that the alternative hypothesis is true,
heart. and larger values of (1.0 p) do not mean that the
alternative hypothesis is more likely to be true [11,
Corollary 3: Statistics can never take you beyond 13]. Thus, as Abelson [1] says, Statistical techniques
your methods. First, remember GIGO (garbage are aids to (hopefully wise) judgment, not two-valued
ingarbage out): Running statistical analyses on logical declarations of truth or falsity (p. 910).
garbage measures invariably produces garbage
results. Know and care deeply about the psychome- Corollary 3: Evaluate practical significance as well
tric properties of your measures (e.g., various forms as statistical significance. The number of tabular
of reliability, validity, and generalizability; see [26] asterisks in your output (i.e., the level of signifi-
for a comprehensive overview). Second, note that sta- cance of your findings) is influenced strongly by your
tistical techniques rarely can eliminate confounds in sample size and indicates more about reliability than
your research design (e.g., it is extremely difficult about the practical importance of your findings [11,
to draw compelling causal inferences from quasi- 13, 38]. Thus, clinical researchers should report infor-
experimental research designs). If your research ques- mation on the practical significance, or magnitude,
tions demand quasi-experimental methods, familiar- of their effects, typically by presenting effect-size
ize yourself with designs that minimize threats to indices and the confidence intervals around them [13,
10 Clinical Psychology
45, 46]. Researchers also should evaluate the ade- a taxometric analysis of social anhedonia, Journal of
quacy of an effects magnitude by considering the Abnormal Psychology 109, 8795.
domain of application (e.g., a small but reliable effect [7] Brent, D.A., Holder, D., Kolko, D., Birmaher, B.,
Baugher, M., Roth, C., Iyengar, S. & Johnson, B.A.
size on mortality indices is nothing to scoff at!).
(1997). A clinical psychotherapy trial for adolescent
depression comparing cognitive, family, and supportive
therapy, Archives of General Psychiatry 54, 877885.
Conclusions [8] Brown, G.K., Beck, A.T., Steer, R.A. & Grisham, J.R.
(2000). Risk factors for suicide in psychiatric outpa-
Rapid advancement in the understanding of complex tients: a 20-year prospective study, Journal of Consulting
clinical phenomena places heavy demands on clinical & Clinical Psychology 68, 371377.
researchers for thoughtful articulation of theoreti- [9] Campbell, D.T. & Stanley, J.C. (1966). Experimental
cal models, methodological expertise, and statistical and Quasi-experimental Designs for Research, Rand
McNally, Chicago.
rigor. Thus, the next generation of clinical psychol-
[10] Cohen, J. (1962). The statistical power of abnormal-
ogists likely will be recognizable in part by their social psychological research: a review, Journal of
quantitative sophistication. In this article, we have Abnormal and Social Psychology 65, 145153.
provided an overview of the use of statistics in clini- [11] Cohen, J. (1990). Things I have learned (so far),
cal psychology that we hope will be particularly help- American Psychologist 45, 13041312.
ful for students and early career researchers engaged [12] Cohen, J. (1992). A power primer, Psychological Bul-
in advanced statistical and methodological training. letin 112, 155159.
[13] Cohen, J. (1994). The earth is round, American Psychol-
To facilitate use for teaching and training purposes,
ogist 49, 9971003.
we organized the descriptive portion of the article [14] Cole, D.A. & Maxwell, S.E. (2003). Testing mediational
around core research questions addressed in clinical models with longitudinal data: questions and tips in
psychology, rather than adopting alternate organiza- the use of structural equation modeling, Journal of
tional schemes (e.g., grouping statistical techniques Abnormal Psychology 112, 558577.
on the basis of mathematical similarity). In the second [15] Cooper, H. & Hedges, L.V., eds (1994). The Handbook
portion of the article, we synthesized the collective of Research Synthesis, Sage, New York.
[16] Danielson, C.K., Youngstrom, E.A., Findling, R.L. &
wisdom of statisticians and methodologists who have
Calabrese, J.R. (2003). Discriminative validity of the
been critical in shaping our own use of statistics in general behavior inventory using youth report, Journal
clinical psychological research. Readers are urged to of Abnormal Child Psychology 31, 2939.
consult the source papers of this section for thought- [17] Embretson, S.E. & Reise, S.P. (2000). Item Response
ful commentary relevant to all of the issues raised in Theory for Psychologists, Lawrence Erlbaum Associates,
this article. Hillsdale.
[18] Espelage, D.L., Holt, M.K. & Henkel, R.R. (2003).
Examination of peer-group contextual effects on aggres-
References sion during early adolescence, Child Development 74,
205220.
[1] Abelson, R.P. (1995). Statistics as Principled Argument, [19] Fabrigar, L.R., Wegener, D.T., MacCallum, R.C. &
Lawrence Erlbaum Associates, Hillsdale. Strahan, E.J. (1999). Evaluating the use of exploratory
[2] Allison, P.D. (2003). Missing data techniques for struc- factor analysis in psychological research, Psychological
tural equation modeling, Journal of Abnormal Psychol- Methods 4, 272299.
ogy 112, 545557. [20] Grice, J.W. (2001). Computing and evaluating factor
[3] Baron, R.M. & Kenny, D.A. (1986). The moderator- scores, Psychological Methods 6, 430450.
mediator variable distinction in social psychological [21] Grilo, C.M., Masheb, R.M. & Wilson, G.T. (2001).
research: conceptual, strategic, and statistical considera- Subtyping binge eating disorder, Journal of Consulting
tions, Journal of Personality and Social Psychology 51, and Clinical Psychology 69, 10661072.
11731182. [22] Hinshaw, S.P., Carte, E.T., Sami, N., Treuting, J.J. &
[4] Beck, A.T., Steer, R.A. & Brown, G.K. (1996). Manual Zupan, B.A. (2002). Preadolescent girls with attention-
for the Beck Depression Inventory, 2nd Edition, The deficit/hyperactivity disorder: II. Neuropsychological
Psychological Corporation, San Antonio. performance in relation to subtypes and individual clas-
[5] Becker, B.J., ed. (2003). Special section: metric in meta- sification, Journal of Consulting and Clinical Psychology
analysis, Psychological Methods 8, 403467. 70, 10991111.
[6] Blanchard, J.J., Gangestad, S.W., Brown, S.A. & Horan, [23] Holmbeck, G.N. (1997). Toward terminological, concep-
W.P. (2000). Hedonic capacity and schizotypy revisited: tual, and statistical clarity in the study of mediators and
Clinical Psychology 11
moderators: examples from the child-clinical and pedi- [38] Meehl, P.E. (1978). Theoretical risks and tabular aster-
atric psychology literature, Journal of Consulting and isks: Sir Karl, Sir Ronald, and the slow progress of soft
Clinical Psychology 65, 599610. psychology, Journal of Consulting and Clinical Psychol-
[24] Holtzworth-Munroe, A., Meehan, J.C., Herron, K., Reh- ogy 46, 806834.
man, U. & Stuart, G.L. (2000). Testing the Holtzworth- [39] Miller, G.A. & Chapman, J.P. (2001). Misunderstanding
Munroe and Stuart (1994) Batterer Typology, Journal of analysis of covariance, Journal of Abnormal Psychology
Consulting and Clinical Psychology 68, 10001019. 110, 4048.
[25] Hu, L. & Bentler, P.M. (1998). Fit indices in covari- [40] Nelson, C.B., Heath, A.C. & Kessler, R.C. (1998).
ance structure modeling: sensitivity to underparameter- Temporal progression of alcohol dependence symptoms
ized model misspecification, Psychological Methods 3, in the U.S. household population: results from the
424452. national comorbidity survey, Journal of Consulting and
[26] Kazdin, A.E. (2003). Research Design in Clinical Psy- Clinical Psychology 66, 474483.
chology, Allyn and Bacon, Boston. [41] Peeters, F., Nicolson, N.A., Berkhof, J., Delespaul, P.
[27] Kim, Y., Pilkonis, P.A., Frank, E., Thase, M.E. & & deVries, M. (2003). Effects of daily events on mood
Reynolds, C.F. (2002). Differential functioning of the states in major depressive disorder, Journal of Abnormal
beck depression inventory in late-life patients: use of Psychology 112, 203211.
item response theory, Psychology & Aging 17, 379391. [42] Quesnel, C., Savard, J., Simard, S., Ivers, H. & Morin,
[28] Kraemer, H.C., Stice, E., Kazdin, A., Offord, D. & C.M. (2003). Efficacy of cognitive-behavioral therapy
Kupfer, D. (2001). How do risk factors work together? for insomnia in women treated for nonmetastatic breast
Mediators, moderators, and independent, overlapping, cancer, Journal of Consulting and Clinical Psychology
and proxy risk factors, American Journal of Psychiatry 71, 189200.
158, 848856. [43] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
[29] Kraemer, H.C., Wilson, T., Fairburn, C.G. & Agras, W.S. Linear Models: Applications and Data Analysis Methods,
(2002). Mediators and moderators of treatment effects in 2nd Edition, Sage, Thousand Oaks.
randomized clinical trials, Archives of General Psychia-
[44] Reise, S.P., Waller, N.G. & Comrey, A.L. (2000). Factor
try 59, 877883.
analysis and scale revision, Psychological Assessment
[30] Lambert, M.C., Schmitt, N., Samms-Vaughan, M.E.,
12, 287297.
An, J.S., Fairclough, M. & Nutter, C.A. (2003). Is it
[45] Rosenthal, R., Rosnow, R.L. & Rubin, D.B. (2000).
prudent to administer all items for each child behavior
Contrasts and Effect Sizes in Behavioral Research, Cam-
checklist cross-informant syndrome? Evaluating the psy-
bridge University Press, Cambridge.
chometric properties of the youth self-report dimensions
[46] Rosnow, R.L. & Rosenthal, R. (2003). Effect sizes
with confirmatory factor analysis and item response the-
for experimenting psychologists, Canadian Journal of
ory, Psychological Assessment 15, 550568.
Experimental Psychology 57, 221237.
[31] Loftus, G.R. (1996). Psychology will be a much better
science when we change the way we analyze data, [47] Rossi, J.S. (1990). Statistical power of psychological
Current Directions in Psychological Science 5, 161171. research: what have we gained in 20 years? Journal of
[32] MacCallum, R.C. & Austin, J.T. (2000). Applications of Consulting and Clinical Psychology 58, 646656.
structural equation modeling in psychological research, [48] Scott, K.L. & Wolfe, D.A. (2003). Readiness to change
Annual Review of Psychology 51, 201226. as a predictor of outcome in batterer treatment, Journal
[33] MacCallum, R.C., Roznowski, M. & Necowitz, L.B. of Consulting & Clinical Psychology 71, 879889.
(1992). Model modifications in covariance structure [49] Sedlmeier, P. & Gigerenzer, G. (1989). Do studies of
analysis: the problem of capitalization on chance, Psy- statistical power have an effect on the power of studies?
chological Bulletin 111, 490504. Psychological Bulletin 105, 309316.
[34] MacCallum, R.C., Wegener, D.T., Uchino, B.N. & Fab- [50] Tabachnick, B.G. & Fidell, L.S. (2001). Using Multi-
rigar, L.R. (1993). The problem of equivalent models in variate Statistics, Allyn and Bacon, Boston.
applications of covariance structure analysis, Psycholog- [51] Tomarken, A.J. & Waller, N.G. (2003). Potential prob-
ical Bulletin 114, 185199. lems with well fitting models, Journal of Abnormal
[35] Masson, M.E.J. & Loftus, G.R. (2003). Using confidence Psychology 112, 578598.
intervals for graphically based data interpretation, Cana- [52] von Eye, A. & Schuster, C. (2002). Log-linear models
dian Journal of Experimental Psychology 57, 203220. for change in manifest categorical variables, Applied
[36] McFall, R.M. & Townsend, J.T. (1998). Foundations Developmental Science 6, 1223.
of psychological assessment: Implications for cognitive [53] Walden, T.A., Harris, V.S. & Catron, T.F. (2003). How
assessment in clinical science, Psychological Assessment I feel: a self-report measure of emotional arousal and
10, 316330. regulation for children, Psychological Assessment 15,
[37] McFall, R.M. & Treat, T.A. (1999). Quantifying the 399412.
information value of clinical assessments with signal [54] Weersing, V. & Weisz, J.R. (2002). Mechanisms of
detection theory, Annual Review of Psychology 50, action in youth psychotherapy, Journal of Child Psy-
215241. chology & Psychiatry & Allied Disciplines 43, 329.
12 Clinical Psychology
[55] Weisz, J.R., Donenberg, G.R., Han, S.S. & Weiss, B. and adolescents revisited: a meta-analysis of treatment
(1995). Bridging the gap between laboratory and clinic outcome studies, Psychological Bulletin 117, 450468.
in child and adolescent psychotherapy, Journal of Con-
sulting and Clinical Psychology 63, 688701. TERESA A. TREAT AND V. ROBIN WEERSING
[56] Weisz, J.R., Weiss, B., Han, S.S., Granger, D.A. & Mor-
ton, T. (1995). Effects of psychotherapy with children
Clinical Trials and Intervention Studies
EMMANUEL LESAFFRE AND GEERT VERBEKE
Volume 1, pp. 301305
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
case (s)he will be in the control group. But other of Clinical Trials. Further, it is important that the
randomization schemes exist, like block- and strati- intervention study is able to detect the anticipated
fied randomization (see Block Random Assignment; effect of the intervention with a high probability. To
Stratification). It is important to realize that random- this end, the necessary sample size needs to be deter-
ization can only guarantee balance for large stud- mined such that the power is high enough (in clinical
ies and that random imbalance can often occur in trials, the minimal value nowadays equals 0.80).
small studies. Although not a statistical issue, it is clear that
For several types of intervention studies, balance any intervention study should be ethically sound. For
at baseline is a sufficient condition for an interpretable instance, an intervention study is being set up in
result at the end. However, in a clinical trial we need South Africa where on the one hand adolescents are
to be more careful. Indeed, while most interventions given guidelines of how to avoid HIV-transmission
aim to achieve a change in attitude (a psycholog- and on the other hand, for ethical reasons, adoles-
ical effect), medical treatments need to show their cents are given general guidelines to live a health-
effectiveness apart from their psychological impact, ier life (like no smoking, etc.). In clinical trials,
which is also called the placebo effect. The placebo ethical considerations are even more of an issue.
effect is the pure psychological effect that a medical Therefore, patients are supposed to sign an informed
treatment can have on a patient. This effect can be consent document.
measured by administering placebo (inactive medi-
cation with the same taste, texture, etc. as the active
medication) to patients who are blinded for the fact Typical Aspects of Clinical Trials
that they havent received active treatment. Placebo-
controlled trials, that is, trials with a placebo group The majority of clinical trials are drug trials. It is
as control, are quite common. When only the patient important to realize that it takes many years of clin-
is unaware of the administered treatment, the study ical research and often billions of dollars to develop
is called single-blinded. Sometimes, also the treating and register a new drug. In this context, clinical tri-
physician needs to be blinded, if possible, in order als are essential, partly because regulatory bodies
to avoid bias in scoring the effect and safety of the like the Food and Drug Administration (FDA) in the
medication. When patients as well as physician(s) are United States and the European Medicine Agency
blinded, we call it a double-blinded clinical trial. Such (EMEA) in Europe have imposed stringent criteria
a trial allows distinguishing the biological effect of a on the pharmaceutical industry before a new drug can
drug from its psychological effect. be registered. Further, the development of a new drug
The advantage of randomization (plus blinding in involves different steps such that drug trials are typ-
a clinical trial) is that the analysis of the results can ically subdivided into phases. Four phases are often
often be done with simple statistical techniques such distinguished. Phase I trials are small, often involve
as an unpaired t Test for continuous measurements volunteers, and are designed to learn about the drug,
or a chi-squared test for categorical variables. This is like establishing a safe dose of the drug, establishing
in contrast to the analysis of controlled observational the schedule of administration, and so on. Phase II
cohort studies where regression models are needed to trials build on the results of phase I trials and study
take care of the imbalance at baseline since subjects the characteristics of the medication with the pur-
are often self-selected in the two groups. pose to examine if the treatment should be used in
To evaluate the effect of the intervention, a spe- large-scale randomized studies. Phase II designs usu-
cific outcome needs to be chosen. In the context of ally involve patients, are sometimes double blind and
clinical trials, this outcome is called the endpoint. It randomized, but most often not placebo-controlled.
is advisable to choose one endpoint, the primary end- When a drug shows a reasonable effect, it is time to
point, to avoid multiple-testing issues. If this is not compare it to a placebo or standard treatment; this is
possible, then a correction for multiple testing such done in a phase III trial. This phase is the most rigor-
as a Bonferroni adjustment (see Multiple Compari- ous and extensive part of the investigation of the drug.
son Procedures) is needed. The choice of the primary Most often, phase III studies are double-blind, con-
endpoint has a large impact on the design of the study, trolled, randomized, and involve many centers (often
as will be exemplified in the section Typical Aspects hospitals); it is the typical controlled clinical trial as
Clinical Trials and Intervention Studies 3
introduced above. The size of a phase III trial will Currently, noninferiority trials are becoming quite
depend on the anticipated effect of the drug. Such frequent due to the difficulty to improve upon existing
studies are the basis for registration of the medica- therapies.
tion. After approval of the drug, large-scale studies The choice of the primary endpoint can have a
are needed to monitor for (rare) adverse effects; they large impact on the design of the study. For instance,
belong to the phase IV development stage. changing from a binary outcome evaluating short-
The typical clinical trial design varies with the term survival (say at 30 days) to survival time as
phase of the drug development. For instance, in phase endpoint not only changes the statistical test from
I studies, an Analysis of variance design compar- a chi-square test to, say, a logrank test but can
ing the different doses is often encountered. In phase also have a major practical impact on the trial.
II studies, crossover designs, whereby patients are For instance, with long-term survival as endpoint, a
randomly assigned to treatment sequences, are com- group-sequential design might become a necessity.
mon. In phase III studies, the most common design Despite the fact that most clinical trials are care-
is the simple parallel-group design where two groups fully planned, many problems can occur during the
of patients are studied over time after drug adminis- conduct of the study. Some examples are as fol-
tration. Occasionally, three or more groups are com- lows: (a) patients who do not satisfy the inclusion
pared; when two (or more) types of treatments are and/or exclusion criteria are included in the trial; (b) a
combined, a factorial design is popular allowing the patient is randomized to treatment A but has been
estimation of the effects of each type of treatment. treated with B; (c) some patients drop out from the
Many phase III trials need a lot of patients and study; (d) some patients are not compliant, that is,
take a long time to give a definite answer about do not take their medication as instructed, and so on.
the efficacy of the new drug. For economic as well Because of these problems, one might be tempted
as ethical reasons, one might be interested in hav- to restrict the comparison of the treatments to the
ing an idea of the effect of the new drug before ideal patients, that is, those who adhered perfectly
the planned number of patients is recruited and/or is to the clinical trial instructions as stipulated in the
studied over time. For this reason, one might want protocol. This population is classically called the per-
to have interim looks at the data, called interim protocol population and the analysis is called the per-
analyses. A clinical trial with planned interim anal- protocol analysis. A per-protocol analysis envisages
yses has a so-called group-sequential design indi- determining the biological effect of the new drug.
cating that specific statistical (correction for mul- However, by restricting the analysis to a selected
tiple testing) and practical (interim meetings and patient population, it does not show the practical
reports) actions are planned. Usually, this is taken value of the new drug. Therefore, regulatory bodies
care of by an independent committee, called the Data push the intention-to-treat (ITT) analysis forward. In
and Safety Monitoring Board (DSMB). The DSMB the ITT population, none of the patients is excluded
consists of clinicians and statisticians overlooking and patients are analyzed according to the random-
the efficacy but especially the safety of the new ization scheme. Although medical investigators have
drug. often difficulties in accepting the ITT analysis, it is
Most of the clinical trials are superiority trials the pivotal analysis for FDA and EMEA.
with the aim to show a better performance of the Although the statistical techniques employed in
new drug compared to the control drug. When the clinical trials are often quite simple, recent statisti-
control drug is not placebo but a standard active cal research tackled specific and difficult clinical trial
drug, and it is conceived to be difficult to improve issues, like dropouts, compliance, noninferiority stud-
upon the efficacy of that standard drug, one might ies, and so on. Probably the most important problem
consider showing that the new drug has comparable is the occurrence of dropout in a clinical trial. For
efficacy. When the new drug is believed to have instance, when patients drop out before a response
comparable efficacy and has other advantages, for can be obtained, they cannot be included in the anal-
example, a much cheaper cost, a noninferiority trial ysis, even not in an ITT analysis. When patients are
is an option. For a noninferiority trial, the aim is to examined on a regular basis, a series of measure-
show that the new medication is not (much) worse ments is obtained. In that case, the measurements
than the standard treatment (see Equivalence Trials). obtained before the patient dropped out can be used
4 Clinical Trials and Intervention Studies
to establish the unknown measurement at the end of example, [7] and Linear Multilevel Models; Gener-
the study. FDA has been recommending for a long alized Linear Models (GLM).
time the Last-Observation-Carried-Forward (LOCF)
method. Recent research shows that this method gives Further Reading
a biased estimate of the treatment effect and under-
estimates the variability of the estimated result [6]. An excellent source for clinical trial methodology can be found
More sophisticated methods are reviewed in [7] (see in [5]. Intervention studies operating on group level gained in
Missing Data). importance the last decade; for an overview of these designs,
we refer to [3].
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
There are many ways to define d(xj , xk ) (see Hamming distance is used:
Proximity Measures). New ones can be readily
constructed. The suitability of any given measure
n
d(xj , xk ) = (1 (xki , xji )),
is largely in the hands of the researcher, and is
i=1
determined by the data, objectives of the analysis,
Again, if a statistical package does not support This is a symmetric matrix with the diagonal
a desired distance measure, it will usually accept elements equal to 0; thus, it can be stored as an upper
an external procedure to compute D(c1 , c2 ) from triangle matrix. A step consists of the following:
D(c1 , c2 ) and D(c1 , c2 ).
1. Find the smallest element of this matrix; let it
The choice of distance measure has a signif-
be in row j and column j . As the matrix is
icant impact on the result of a clustering proce-
upper triangular, we have always j < j . The
dure. This choice is usually dictated by the sub-
clusters j and j are to be combined on this step.
ject domain, and all reasonable possibilities have
Combining may be considered as removal of the
to be carefully investigated. One important case,
cluster j and replacing cluster j by the union of
which leads to a unique (and, in some sense, ideal)
clusters j and j .
clusterization, is the ultrametricity of the distance
2. Remove row j and column j from the matrix
(see Ultrametric Inequality). The distance d(x, y)
and recalculate values in row j and column j
is called ultrametric, if it satisfies the require-
using (10).
ment d(x, z) max(d(x, y), d(y, z)). This require-
Here, one can see the importance of the property of
ment is stronger than the triangle inequality d(x, z)
distance between clusters given by (10): the distances
d(x, y) + d(y, z), and implies a number of good
in the matrix at the beginning of a step are sufficient
properties. Namely, the clusters constructed by the
to calculate new distances; one does not need to know
hierarchical clustering algorithm (described below)
distances between all members of the clusters.
have the properties: (a) the distance between two
members of two clusters does not depend on the The results of a hierarchical cluster analysis are
choice of these members, that is, if x and x are presented as a dendogram. One axis of a dendogram
vectors corresponding to members of a cluster c1 , is the intercluster distance; the identity of objects is
and y and y correspond to members of a cluster c2 , displayed on the other.
then d(x, y) = d(x , y ); moreover, all the distances Horizontal lines in Figure 1 connect a cluster with
between clusters defined above coincide, and are its parent cluster; the lengths of the lines indicate
equal to the distance between any pair of their mem- distances between the clusters. Where does one slice
bers, Ds (c1 , c2 ) = Dc (c1 , c2 ) = Da (c1 , c2 ) = d(x, y); the dendogram? It can be at a prespecified measure of
(b) the distance between any two members of one dissimilarity, or at a point that yields a certain number
cluster is smaller than the distance between any mem- of clusters. Wallace [24] suggests stopping at a point
ber of this cluster and any member of another cluster, on the dendogram where limbs are long and there
d(x, x ) d(x, y). are not many branches (see Number of Clusters).
4 Cluster Analysis: Overview
Case 6
Case 5
Case 4
Case 3
Case 2
Case 1
Figure 1 Dendogram
Depending on the choice of the distance measure minimization of (12) may be a very difficult problem.
between clusters, the resulting clusterization pos- Usually, the squared Euclidean distance is used; in
sesses different properties. The single-linkage cluster- this case, the center of a cluster is just its center of
ing tends to produce an elongated chain of clusters. gravity, which is easy to calculate.
Because of chaining, single-linkage clustering has Objects are allocated to clusters so that the sum
fallen out of favor [16], though it has attractive the- of distances from objects to the centers of clusters to
oretical properties. In one dimension, this distance which they belong, taken over all clusters, is minimal.
metric seems the obvious choice. It is also related to Mathematically, this means minimization of
the minimum spanning tree (MST) [7]. The MST
K
is the graph of minimum length connecting all data d(xj , vk ), (13)
points. Single-linkage clusters can be arrived at by k=1 xj ck
successively deleting links in the MST [10]. Single-
linkage is consistent in one dimension [10]. Complete which depends on continuous vector parameters vk
linkage and average linkage algorithms work best and discrete parameters representing membership in
when the data has a strong clustering tendency. clusters. This is a nonlinear constrained optimization
problem, and has no obvious analytical solution.
Therefore, heuristic methods are adopted; the one
k -means Clustering most widely used is described below.
First, the centers of clusters vk are randomly cho-
In hierarchical clustering, the number of clusters is sen (or, equivalently, objects are randomly assigned
not known a priori. In k-means clustering, suitable to clusters and the centers of clusters are calculated).
only for quantitative data, the number of clusters, k, Second, each object xj is assigned to a cluster ck
is assumed known (see k -means Analysis). whose center is the nearest to the object. Third, cen-
Every cluster ck is characterized by its center, ters of clusters are recalculated (based on the new
which is the point vk that minimizes the sum membership) and the second step repeated. The algo-
rithm terminates when the next iteration does not
d(xj , vk ). (12) change membership.
xj ck Unfortunately, the result to which the above algo-
rithm converges depends on the first random choice of
Again, it is possible to use different notions of clusters. To obtain a better result, it is recommended
distance; however, k-means clustering is significantly to perform several runs of the algorithm and then
more sensitive to the choice of distance, as the select the best result.
Cluster Analysis: Overview 5
Tree diagram for 50 cases Tree diagram for 50 cases Tree diagram for 50 cases
Unweighted pair-group average Single linkage Complete linkage
Percent disagreement Percent disagreement Percent disagreement
C_1 C_1 C_1
C_28 C_2 C_21
C_42 C_18 C_42
C_4 C_12 C_4
C_43 C_16 C_39
C_13 C_19 C_43
C_15 C_33 C_7
C_41 C_34 C_47
C_7 C_45 C_40
C_47 C_46 C_48
C_49 C_24 C_50
C_21 C_17 C_22
C_40 C_29 C_2
C_48 C_35 C_18
C_22 C_3 C_17
C_39 C_5 C_31
C_2 C_9 C_10
C_30 C_44 C_25
C_3 C_8 C_29
C_44 C_10 C_35
C_9 C_15 C_49
C_5 C_26 C_12
C_26 C_23 C_16
C_23 C_21 C_19
C_31 C_25 C_33
C_14 C_27 C_34
C_32 C_30 C_45
C_20 C_31 C_46
C_6 C_49 C_24
C_27 C_6 C_15
C_8 C_7 C_26
C_12 C_47 C_6
C_16 C_40 C_8
C_33 C_48 C_27
C_34 C_50 C_30
C_45 C_13 C_3
C_46 C_14 C_5
C_19 C_22 C_9
C_24 C_28 C_44
C_29 C_36 C_23
C_17 C_37 C_37
C_18 C_41 C_14
C_25 C_42 C_20
C_37 C_20 C_32
C_36 C_32 C_38
C_10 C_4 C_41
C_35 C_38 C_11
C_38 C_39 C_13
C_11 C_43 C_28
C_50 C_11 C_36
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0
Linkage distance Linkage distance Linkage distance
Figure 2 Fifty cases From NLTCS Data set clustered by three linkage algorithms
do a cluster analysis of variables. Since all ques- distance measure. To read a dendogram, select a
tions have categorical responses, we use the percent distance on the horizontal axis and draw a vertical
difference measure of dissimilarity as the distance line at that point. Every line in the dendogram that
function (6). is cut defines a cluster at that distance. The length
Figure 2 presents the results from the dendograms of the horizontal lines represents distances between
from the three hierarchical algorithms run on the clusters. While the three dendograms appear to have
first 50 NLTCS cases. Clusters for the single-linkage the same distance metric, distances are calculated
algorithm tend to form a chain-like structure. The differently.
two other algorithms give more compact results. Since the distance measures are defined differ-
The dendograms have the 50 cases identified on ently, we cannot cut each dendogram at a fixed dis-
the vertical axis. The horizontal axis scale is the tance, say 0.5, and expect to get comparable results.
Cluster Analysis: Overview 7
Instead, we start at the top of the tree where every in Figure 4. At a distance of 0.5, the variables fit
object is a member of a single cluster and cut imme- into three sets. Climbing a single flight of stairs
diate level two. In the UPGA dendogram, that cut (one-flight-stairs) comprises its own cluster; socks
will be at approximately 0.42; in the single-linkage (bending for) and holding a 10 pound package
dendogram, that cut will be at 0.45, and in the comprise another; and combing hair, washing hair,
complete-linkage dendogram, at about 0.8. These cuts grasping an object and reading a newspaper constitute
define 4, 7, and 7 clusters, respectively. The UPGA a third.
cluster sizes were {2, 32, 8, 8}; the single-linkage
cluster sizes were {1, 1, 1, 1, 2, 44}, and the complete-
linkage cluster sizes were {3, 1, 5, 6, 33, 6, 7}. In Software Metrics Example
practice, one would examine the cases that make up
each cluster. The second data set consists of 26 software met-
These algorithms can be used to cluster variables rics measured on the source code of 180 individ-
by reversing the roles of variables and cases in dis- ual software modules that were part of an interna-
tance calculations. Variables in the data set were tional customer service system for a large lodging
split into two sets according to outcome sets. The business. The modules were all designed and devel-
complete-linkage clustering was used. oped according to Michael Jackson methods [13, 14].
The results of the first 19 variables, which had The measurements included Halstead measures [8] of
binary response sets, are displayed in Figure 3. At vocabulary, volume, length, McCabe complexity [5],
a distance of 0.3, the variables divide into two as well as other measures such as comments, number
clusters. Moving about inside or outside, bathing, of processes. The original analysis determined if there
grocery shopping, traveling, and heavy work make were modules that should be rewritten. This second
up one cluster. The remaining 13 variables make set will be analyzed by k-means clustering. The orig-
up the second cluster. The second cluster could inal analysis was done using Principal Component
be seen as being two subclusters: (1) in/out-bed, Analysis [13, 14], and that analysis is presented here
dressing, toileting, light work, cooking, and laun- for comparison.
dry; and (2) activities that indicate a greater degree In the original Principal Component Analysis,
of disability (help eating, being bedfast, wheelchair- since the variables were on such different scales,
fast, etc.). and to keep one variable from dominating the anal-
The remaining eight variables that each have ysis, variables were scaled to have equal variances.
four responses were also analyzed. The results are The resulting covariance matrix (26 variables) was
Eating
Bedfast
No-inside-act
Wheelchairfast
Telephone
Manage-money
Taking-medicine
In/out-bed
Dressing
Toileting
Light-work
Cooking
Laundry
About-inside
About-outside
Bathing
Groc-shopping
Traveling
Heavy-work
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Linkage distance
Figure 3 Tree diagram for 19 disability measures complete-linkage dissimilarity measure = percent disagreement
8 Cluster Analysis: Overview
1-flight-stairs
Socks
10-lb-pkg
Reach-overhead
Comb-hair
Wash-hair
Grasp-object
Read-newspaper
Figure 4 Tree diagram for 8 disability measures dissimilarity measure = complete linkage percent disagreement
25
20
15
10
Second largest eigenvalue
10
15
20
35 30 25 20 15 10 5 0 5 10 15 Active
Largest eigenvalue
singular, so analyses used the generalized inverse. from transaction data; (b) estimating of readiness of
The two largest eigenvalues accounted for a little over accession countries to join the European Monetary
76% of the variance. The data was transformed by the Union [2]; (c) understanding benign prostate hyper-
eigenvectors associated with the two largest eigenval- trophy (BHP) [6]; (d) studying antisocial behavior
ues and plotted in Figure 5. The circles indicate the [15]: (e) to study southern senatorial voting [19];
five clusters identified by the eye. (f) studying multiple planetary flow regimes [20];
The analysis identified clusters, but does not tell us and (g) reconstructing of fossil organisms [23]. The
which programs are bad. Without other information, insights they provide, because of a lack of a for-
the analysis is silent. One could argue that we mal statistical theory of inference, require validating
know we have a good process and, therefore, any in other venues where formal hypothesis tests can
deviant results must be bad. In this case, three of be used.
the programs in the smaller clusters had repeatedly
missed deadlines and, therefore, could be considered References
bad. Other clusters were similar. Two clusters differed
from the main body of programs on one eigenvector [1] Andenberg, M.R. (1973). Cluster Analysis for Applica-
tions, Academic Press, New York.
and not on the other. The other two small clusters
[2] Boreiko, D. & Oesterreichische Nationalbank. (2002).
differed from the main body on both eigenvectors. EMU and Accession Countries: Fuzzy Cluster Analysis
We analyzed the same data by k-means clustering of Membership, Oesterreichische Nationalbank, Wien.
with k = 4. The four clusters contained 5, 9, 29, and [3] Eisen, M.B., Spellman, P.T., Brown, P.O. & Bot-
145 objects. Cluster 1 consisted of the five objects stein, D. (1999). Cluster analysis and display of genome-
in the three most remote clusters in Figure 5. Cluster wide expression patterns, Proceedings of the National
Academy of Sciences 95(25), 1486314868.
3 was made up of the programs in Figure 5, whose
[4] Everitt, B.S. & Hand, D.J. (1981). Finite Mixture Dis-
names are far enough away from the main body to tributions, Chapman & Hall.
have visible names. Clusters 2 and 4 were the largest [5] Gilb, T. (1977). Software Metrics, Wintrop Publishers,
and break up the main body of Figure 4. Cambridge.
[6] Girman, C.J. (1994). Cluster analysis and classification
tree methodology as an aid to improve understanding
of benign prostatic hyperplasia, Dissertation, Institute of
Conclusions Statistics, the University of North Carolina, Chapel Hill.
[7] Gower, J.C. & Ross, G.J.S. (1969). Minimum spanning
The concept of cluster analysis first appeared in trees and single-linkage cluster analysis Applied Statis-
the literature in the 1950s and was popularized tics 18(18), 5465.
by [22, 21, 1, 9], and others. It has recently [8] Halstead, M.H. (1977). Elements of Software Science,
enjoyed increased interest as a data-mining tool for Elsevier Science, New York.
understanding large volumes of data such as gene [9] Hartigan, J. (1975). Clustering Algorithms, Wiley, New
York.
expression experiments and transaction data such as [10] Hartigan, J.A. (1981). Consistency of single linkage for
online book sales or occupancy records in the lodg- high-density clusters, Journal of the American Statistical
ing industry. Association 76(374), 388394.
Cluster analysis methods are often simple to use [11] Heinen, T. (1996). Latent Class and Discrete Latent Trait
in practice. Procedures are available in commonly Models, SAGE, Thousand Oaks.
used statistical packages (e.g., SAS, STATA, STA- [12] Hoppner, F. (1999). Fuzzy Cluster Analysis: Methods for
Classification, Data Analysis, and Image Recognition,
TISTICA, and SPSS) as well as in programs devoted John Wiley, Chichester; New York.
exclusively to cluster analysis (e.g., CLUSTAN). [13] Jackson, M.A. (1975). Principles of Program Design,
Algorithms tend to be computationally intensive. The Academic Press, New York.
analyses presented here were done using STATIS- [14] Jackson, M.A. (1983). System Development, Addison-
TICA 6.1 software (see Software for Statistical Wesley, New York.
Analyses). [15] Jordan, B.K. (1986). A fuzzy cluster analysis of antiso-
cial behaviour: implications for deviance theory, Disser-
Cluster analyses provide insights into a wide tation, Duke University, Durham.
variety of applications without many statistical [16] Lance, G.N. & William, W.T. (1967). A general theory
assumptions. Further examples of cluster analysis of classificatory sorting strategies: 1. hierarchical sys-
applications are (a) identifying of market segments tems, Computer Journal 9, 373380.
10 Cluster Analysis: Overview
[17] Langeheine, R. & Rost, J. (1988). Latent Trait and Latent [26] Wong, M.A. & Lanr, T. (1983). A kth nearest neighbor
Class Models, Plenum Press, New York. clustering procedure, Journal of the Royal Statistical
[18] Lazarsfeld, P.F. & Henry, N.W. (1968). Latent Structure Society. Series B (Methodological) 45(3), 362368.
Analysis, Houghton Mifflin, Boston.
[19] Kammer, W.N. (1965). A Cluster-blod analysis of south-
ern senatorial voting behavior, 19471963, Dissertation, Further Reading
Duke University, Durham.
[20] Mo, K. & Ghil, M. (1987). Cluster analysis of mul- Avetisov, V.A., Bikulov, A.H., Kozyrev, S.V. & Osipov, V.A.
tiple planetary flow regimes, NASA contractor report, (2002). p-adic models of ultrametric diffusion constrained
National Aeronautics and Space Administration. by hierarchical energy landscapes, Journal of Physics A-
[21] Sneath, P.H.A. & Sokal, R.R. (1973). Numerical Taxon- Mathematical ad General 35, 177189.
omy, Freeman, San Francisco. Ling, R.F. (1973). A probability theory of cluster analysis,
[22] Sokal, R.R. & Sneath, P.H.A. (1963). Principles of Journal of the American Statistical Association 68(341),
Numerical Taxonomy, W.H. Freeman, San Francisco. 159164.
[23] Von Bitter, P.H., Merrill G.K. (1990). The Reconstruc- Manton, K., Woodbury, M. & Tolley, H. (1994). Statistical
tion of Fossil Organisms Using Cluster Analysis: A Case Applications Using Fuzzy Sets, Wiley Interscience, New
Study from Late Palaeozoic Conodonts, Toronto, Royal York.
Ontario Museum. Woodbury, M.A. & Clive, J. (1974). Clinical pure types as a
[24] Wallace, C. (1978). Notes on the distribution of sex fuzzy partition, Journal of Cybernetics 4, 111121.
and shell characters in some Australian populations of
Potamopyrgus (Gastropoda: Hydrobiidae), Journal of
the Mallacological Society of Australia 4, 7176. (See also Overlapping Clusters)
[25] Wong, A.M. (1982). A hybrid clustering method for
identifying high density clusters, Journal of the Amer- KENNETH G. MANTON, GENE LOWRIMORE,
ican Statistical Association 77(380), 841847. ANATOLI YASHIN AND MIKHAIL KOVTUN
Clustered Data
GARRETT M. FITZMAURICE
Volume 1, pp. 315315
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
References [4] Neter, J., Kutner, M.H., Nachtsheim, C.J. & Wasser-
man, W. (1996). Applied Linear Statistical Models, Irwin,
Chicago.
[1] Cochran, W.G. (1941). The distribution of the largest of
[5] Sachs, L. (2002). Angewandte Statistik [Applied Statis-
a set of estimated variances as a fraction of their total,
tics], 10th Edition, Springer-Verlag, Berlin.
Annals of Eugenics 11, 4752.
[2] Hartley, H.O. (1940). Testing the homogeneity of a set of
variances, Biometrika 31, 249255.
PATRICK MAIR AND ALEXANDER VON EYE
[3] Kirk, R.E. (1995). Experimental Design, 3rd Edition,
Brooks/Cole, Pacific Grove.
Coefficient of Variation
PAT LOVIE
Volume 1, pp. 317318
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Accelerated longitudinal design with four cohorts for the investigation of long-term causal effects that
and three annual assessments with an age range from 4 to occur without intermediate effects or sequences (e.g.,
9 years of age between child abuse and adult violence). In addition,
Period questions remain concerning the ability of the cohort-
1999 2000 2001 sequential approach to assess the impact of important
Cohort (Years) (Years) (Years) events and intervening variables on the course of
Cohort 1 4 5 6 development [19].
Cohort 2 5 6 7
Cohort 3 6 7 8
Cohort 4 7 8 9
Statistical Analysis
concerning convergence across separate groups and Several data-analytical strategies are developed to
the feasibility of specifying a common growth trajec- analyze data from a cohort-sequential design. The
tory over the 6 years represented by the latent variable most well known are matching cross-cohorts based on
cohort-sequential design (see below). statistical tests of significance, the use of structural
equation modeling and Linear multilevel models.
Bell [4] linked cohorts by matching characteristics
Advantages and Disadvantages of the subjects using a method he described as ad
hoc. Traditional analysis of variance and regres-
A noticeable advantage of the cohort-sequential over
sion methods (see Multiple Linear Regression) have
the single-cohort longitudinal design is the possi-
bility to study age effects independent of period been employed for cohort linkage and are criticized
and cohorts effects, but only if different cohorts by Nesselroade and Baltes [16]. More recently, two
are followed up between the same ages in different statistical approaches have been proposed to depict
periods. Another advantage is the shorter follow- change or growth adequately: the hierarchical lin-
up period. This reduces the problems of cumulative ear model [6, 18] and the latent curve analysis (see
testing effects and attrition, and produces quicker Structural Equation Modeling: Latent Growth
results. Finally, tracking several cohorts, rather than Curve Analysis) [11, 14, 15]. Both approaches have
one, allows the researcher to determine whether those in common that growth profiles are represented by
trends observed in the repeated observations are cor- the parameters of initial status and the rate of change
roborated within short time periods for each age (see Growth Curve Modeling). The hierarchical lin-
cohort. Two basic principles should be considered ear model is easier for model specification, is more
when designing cohort-sequential studies: efficiency efficient computationally in yielding results and pro-
of data collection and sufficiency of overlap. Accord- vide a flexible approach that allows for missing data,
ing to Anderson [1, p. 147], a proper balance between unequal spacing of time points, and the inclusion
these two principles can be achieved by ensuring that of time-varying and between-subject covariates mea-
(a) at least three data points overlap between adjacent sured either continuously or discretely. The latent
groups, and (b) the youngest age group is followed curve analysis has the advantage of providing model
until they reach the age of the oldest group at the first evaluation, that is, an overall test of goodness of fit,
measurement. and is more flexible in modeling and hypothesis test-
The main disadvantage of the cohort-sequential ing. The separate cohorts developmental paths are
design in comparison with the single-cohort longitu- said to converge to a single developmental path if a
dinal design is that within-individual developmental model that assumes unequal paths produces results
sequences are tracked over shorter periods. As a that are not statistically distinguishable from results
result, some researchers have questioned the effi- produced by a simpler model that specifies a single
cacy of the cohort-sequential approach in adequately path. Chou, Bentler, and Pentz [9] and Wendorf [23]
recovering information concerning the full longitu- compared both techniques and concluded that both
dinal curve from different cohort segments when approaches yielded very compatible results. In fact,
the criterion of convergence is not met. The cohort- both approaches might have more in common than
sequential design is not the most appropriate design once thought [2].
Cohort Sequential Design 3
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
u can be defined as time from selection into the whereby the history Z1 (t) alters disease risk has been
cohort, and T denotes vector transpose. Let Z1 (t) = conditioned upon. On the other hand, omission of
{z1 (u), u < t} denote the history of such character- such factors may leave a confounded association,
istics at times less than t. Note that the baseline since the relationship between Z2 and disease risk
exposure data Z1 (0) may include information that may not be wholly attributable to the effects of Z1
pertains to time periods prior to selection into the on Z2 .
cohort. Denote by {t; Z1 (t)} the population inci-
dence (hazard) rate at time t for a disease of interest,
as a function of an individuals preceding covariate Cohort Selection and Follow-up
history. A typical cohort study goal is the elucidation
of the relationship between aspects of Z1 (t) and the Upon identifying the study diseases of interest and
the covariate histories Z(t) = {Z1 (t), Z2 (t)} to be
corresponding disease rate {t; Z1 (t)}. As mentioned
ascertained and studied in relation to disease risk,
above, a single cohort study may be used to examine
one can turn to the estimation of {t; Z(t)} based
many such covariate-disease associations.
on a cohort of individuals selected from the study
The interpretation of the relationship between
population. The basic cohort selection and follow-
{t; Z1 (t)} and Z1 (t) may well depend on other fac-
up requirement for valid estimation of {t; Z(t)} is
tors. Let Z2 (t) denote the history up to time t of a
that at any {t, Z(t)} a sample that is representative of
set of additional characteristics. If the variates Z1 (t)
the population in terms of disease rate be available
and Z2 (t) are related among population members at
and under active follow-up for disease occurrence.
risk for disease at time t and if the disease rate
Hence, conceptually, cohort selection and censoring
{t; Z1 (t), Z2 (t)} depends on Z2 (t), then an observed
rates (e.g., loss to follow-up rates) could depend
relationship between {t; Z1 (t)} and Z1 (t) may be
arbitrarily on {t, Z(t)}, but selection and follow-up
attributable, in whole or in part, to Z2 (t). Hence,
procedures cannot be affected in any manner by
toward an interpretation of causality, one can focus
knowledge about, or perception of, disease risk at
instead on the relationship between Z1 (t) and the
specified {t, Z(t)}.
disease rate function {t; Z1 (t), Z2 (t)}, thereby con-
trolling for the confounding influences of Z2 . In
principle, a cohort study needs to control for all Covariate History Ascertainment
pertinent confounding factors in order to interpret a
relationship between Z1 and disease risk as causal. Valid estimation of {t; Z(t)} requires ascertainment
It follows that a good deal must be known about of the individual study subject histories, Z, during
the disease process and disease risk factors before cohort follow-up. Characteristics or exposures prior
an argument of causality can be made reliably. This to cohort study enrollment are often of considerable
feature places a special emphasis on the replica- interest, but typically need to be ascertained retro-
tion of results in various populations, with the idea spectively, perhaps using specialized questionnaires,
that unrecognized or unmeasured confounding fac- using analysis of biological specimens collected at
tors may differ among populations. As noted above, cohort study entry, or by extracting information from
the principle advantage of a randomized disease pre- existing records (e.g., employer records or occupa-
vention trial, as compared to a purely observational tional exposures). Postenrollment exposure data may
study, is that the randomization indicator variable also need to be collected periodically over the cohort
Z1 = Z1 (0), where here t = 0 denotes the time of study follow-up period to construct the histories of
randomization, is unrelated to the histories Z2 (0) of interest. In general, the utility of cohort study analy-
all confounding factors, whether or not such are rec- ses depends directly on the extent of variability in the
ognized or measured. covariate histories Z, and in the ability to document
The choice as to which factors to include in that such histories have been ascertained in a valid
Z2 (t), for values of t in the cohort follow-up period, and reliable fashion. It often happens that aspects
can be far from being straightforward. For example, of the covariate data of interest are ascertained with
factors on a causal pathway between Z1 (t) and some measurement error, in which case substudies
disease risk may give rise to over adjustment if that allow the measured quantities to be related to
included in Z2 (t), since one of the mechanisms the underlying variables of interest (e.g., validation
Cohort Studies 3
or reliability substudies) may constitute a key aspect coded by X2 , . . . , Xp . Virtually all statistical software
of cohort study conduct. packages include tests and confidence intervals on
under this model, as well as estimators of the
cumulative baseline hazard rate.
Disease Event Ascertainment The data-analytic methods for accommodating
measurement error in covariate histories are less stan-
A cohort study needs to include a regular updat- dardized. Various methods are available (e.g., [2])
ing of the occurrence times for the disease events under a classical measurement model (see Measure-
of interest. For example, this may involve asking ment: Overview), wherein a measured regression
study subjects to report a given set of diagnoses or variable is assumed to be the sum of the target vari-
health-related events (e.g., hospitalization) that initi- able plus measurement error that is independent, not
ate a process for collecting hospital and laboratory only of the targeted value but also of other disease
records to determine whether or not a disease event risk factors. With such difficult-to-measure exposures
has occurred. Diagnoses that require considerable as those related to the environment, occupation, phys-
judgment may be further adjudicated by a panel of ical activity, or diet, a major effort may need to be
diagnostic experts. While the completeness of out- expended to develop a suitable measurement model
come ascertainment is a key feature of cohort study and to estimate measurement model parameters in the
quality, the most critical outcome-related cohort study context of estimating the association between disease
feature concerns whether or not there is differential rate and the exposure history of interest.
outcome ascertainment, either in the recognition or
the timely ascertainment of disease events of interest,
across the exposures or characteristics under study. An Example
Differential ascertainment can often be avoided by
While there have been many important past and con-
arranging for outcome ascertainment procedures and
tinuing cohort studies over the past several decades,
personnel to be independent to exposure histories,
a particular cohort study in which the author is
through document masking and other means.
engaged is the Observational Study component of the
Womens Health Initiative (WHI) [10, 18]. This study
is conducted at 40 Clinical Centers in the United
Data Analysis States, and includes 93 676 postmenopausal women
Typically, a test of association between a certain in the age range 5079 at the time of enrollment
characteristic or exposure and disease risk can be during 19931998. Cohort enrollment took place in
formulated in the context of a descriptive statistical conjunction with a companion multifaceted Clinical
model. With occurrence-time data, the Cox regression Trial among 68 133 postmenopausal women in the
model [4], which specifies same age range, and for some purposes the combined
Clinical Trial and Observational Study can be viewed
{t; Z(t)} = 0 (t) exp{X(t)T }, (1) as a cohort study in 161 809 women. Most recruit-
ment took place using population-based lists of age-
is very flexible and useful for this purpose. In this and gender-eligible women living in proximity to a
model, 0 is a baseline disease rate model that participating Clinical Center.
need not be specified, X(t)T = {X1 (t), . . . , Xp (t)} is Postmenopausal hormone use and nutrition are
a modeled regression vector formed from Z(t), and major foci of the WHI, in relation to disease mor-
T = (1 , . . . , p ) is a corresponding hazard ratio bidity and mortality. Baseline collection included a
(relative risk) parameter to be estimated. Testing and personal hormone history interview, a food frequency
estimation on is readily carried out using a so- questionnaire, a blood specimen (for separation and
called partial likelihood function [5, 7]. For example, storage), and various risk factor and health-behavior
if X1 defines an exposure variable (or characteristic) questionnaires. Outcome ascertainment includes peri-
of interest, a test of 1 = 0 provides a test of the odic structured self-report of a broad range of health
hypothesis of no association between such exposure outcomes, document collection by a trained out-
and disease risk over the cohort follow-up period, come specialist, physician adjudication at each Clin-
which controls for the potential confounding factors ical Center, and subsequent centralized adjudication
4 Cohort Studies
for selected outcome categories. Exposure data are [5] Cox, D.R. (1975). Partial likelihood, Biometrika 62,
updated on a regular basis either through question- 269276.
naire or clinic visit. To date, the clinical trial has [6] Kahn, H.A. & Sempos, C.T. (1989). Statistical Methods
in Epidemiology, Oxford University Press, New York.
yielded influential, and some surprising, results on the
[7] Kalbfleisch, J.D. & Prentice, R.L. (2002). The Statistical
benefits and risks of postmenopausal hormone ther- Analysis of Failure Time Data, 2nd Edition, John Wiley
apy [17, 19]. The common context and data collection & Sons.
in the Observational Study and Clinical Trial provides [8] Kelsey, J.L., Thompson, W.D. & Evans, A.S. (1986).
a valuable opportunity to compare results on hormone Methods in Observational Epidemiology, Oxford Uni-
therapy between the two study designs. A major study versity Press, New York.
is currently being implemented using various objec- [9] Kleinbaum, D.G., Kupper, L.L. & Morganstern, H.
(1982). Epidemiologic Research: Principles and Quan-
tive markers of nutrient consumption toward building
titative Methods, Lifetime Learning Publications, Bel-
a suitable measurement model for calibrating the food mont.
frequency nutrient assessments and thereby providing [10] Langer, R.D., White, E., Lewis, C.E., Kotchen, J.M.,
reliable information on nutrient-disease associations. Hendrix, S.L. & Trevisan, M. (2003). The Womens
A subset of about 1000 Observational Study partic- Health Initiative Observational Study: baseline charac-
ipants provided replicate data on various exposures teristics of participants and reliability of baseline mea-
at baseline and at 3 years from enrollment toward sures, Annals of Epidemiology 13(95), 107121.
allowing for measurement error accommodation in [11] Miettinen, O.S. (1985). Theoretical Epidemiology: Prin-
ciples of Occurrence Research in Medicine, Wiley, New
exposure and confounding variables.
York.
[12] Morganstern, H. & Thomas, D. (1993). Principles of
study design in environmental epidemiology, Environ-
Concluding Comment mental Health Perspectives 101, 2338.
[13] Prentice, R.L. (1995). Design issues in cohort studies,
This entry builds substantially on prior cohort study Statistical Methods in Medical Research 4, 273292.
reviews by the author [13, 14], which provide more [14] Prentice, R.L. (1998). Cohort studies, in Encyclopedia
detail on study design and analysis choices. There of Biostatistics, Vol. 1, P. Armitage & T. Colton, eds,
are a number of books and review articles devoted to John Wiley & Sons, pp. 770784.
cohort study methods [1, 2, 3, 6, 8, 9, 11, 12, 15, 16]. [15] Rothman, K.J. (1986). Modern Epidemiology, Little,
Brown & Co., Boston.
[16] Willett, W.C. (1998). Nutritional Epidemiology, 2nd
Acknowledgment Edition, Oxford University Press.
[17] Womens Health Initiative Steering Committee. (2004).
This work was supported by grant CA-53996 from the US
Effects of conjugated equine estrogens in post-
National Cancer Institute.
menopausal women with hysterectomy: the Womens
Health Initiative randomized controlled trial, Journal of
References the American Medical Association 291, 17011712.
[18] Womens Health Initiative Study Group. (1998). Design
[1] Breslow, N.E. & Day, N.E. (1987). Statistical Methods of the Womens Health Initiative clinical trial and obser-
in Cancer Research, Vol. 2: The Design and Analysis vational study, Controlled Clinical Trials 19, 61109.
of Cohort Studies, IARC Scientific Publications No. 82, [19] Writing Group for the Womens Health Initiative Inves-
International Agency for Research on Cancer, Lyon. tigators. (2002). Risks and benefits of estrogen plus
[2] Carroll, R.J., Ruppert, D. & Stefanski, L.A. (1995). progestin in healthy postmenopausal women. Principal
Measurement Error in Nonlinear Models, Chapman & results from the Womens Health Initiative randomized
Hall, New York. controlled trial, Journal of the American Medical Asso-
[3] Checkoway, H., Pearce, N. & Crawford-Brown, D.J. ciation 288, 321333.
(1989). Research Methods in Occupational Epidemiol-
ogy, Oxford University Press, New York.
[4] Cox, D.R. (1972). Regression models and life tables (See also CaseCohort Studies)
(with discussion), Journal of the Royal Statistical Soci-
ety, Series B 34, 187220. ROSS L. PRENTICE
Coincidences
BRIAN S. EVERITT
Volume 1, pp. 326327
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
The second counting rule involves the computa- the first four patients out of six call in are each
tion of the number of ways that a set of responses considered distinct, while in the latter the order of
can be arranged in order. Suppose that there are six the first four is not maintained, that is, the four that
patients requesting an appointment to see their psy- call in are considered as the set of individuals that
chiatrist. What are the total number of ways in which get the appointment for the same day but are not
the receptionist may schedule them to see the psychi- necessarily scheduled in the order they called in.
atrist on the same day?
Example Suppose that a receptionist schedules a
total of k patients over n distinct days, What is
Counting Rule 2. The number of ways in which
the probability that t patients are scheduled on a
n responses can be arranged in order is given by
specific day?
n! = n(n 1)(n 2) . . . (3).2.1, where n! is called
First, an application of counting rule 4 to deter-
n factorial and 0! is defined to be 1.
mine the numberofways t patients can be selected
An application of Rule 2 shows that the reception-
ist has 6! = (6)(5)(4)(3)(2)(1) = 720 ways to sched- out of k gives kt total ways for choosing t =
ule them. 0, 1, 2 . . . k patients scheduled on a specific day.
But if the psychiatrist can only see 4 patients on Using counting rule 1, one can compute that the
that day, in how many ways can the receptionist remaining (k t) patients can be scheduled over the
schedule them in order? remaining (n 1) ways in a total of (n 1)kt pos-
sible ways.
Counting Rule 3. The number of arrangements for There are a total of nk possible ways of randomly
k responses selected from n responses in order is scheduling k patients over n days using counting
n!/(n k)!. This is called the rule of permutations. rule 1.
Using the rule of permutations, the receptionist has
The empirical probability that t patients
6! (6)(5)(4)(3)(2)(1)
= = 360 ways. (1) are scheduled on a specific day
(6 4)! (2)(1)
number of ways of scheduling t of
But what if the receptionist is not interested in the k people on a specific day
order but only in the number of ways that any 4 of =
number of ways of scheduling
the 6 patients can be scheduled?
k people on n days
Counting Rule 4. The number of ways in which k
t
(n 1)kt
k responses can be selected from n responses is = . (3)
nk
n!/k!(n k)!. This is called the rule of combinations
and the expression is commonly denoted by nk .
This last expression can be rewritten as kt (1/nt )
The combinations counting rule shows that the
receptionist has (1 (1/n))kt , which is the empirical form of the
binomial distribution.
6! 6! (6)(5)(4)(3)(2)(1)
= = = 15 ways.
4!(6 4)! 4!2! (4)(3)(2)(1)(2)(1)
(See also Catalogue of Probability Density Func-
(2)
tions)
The main difference between a permutation and a
ALKA INDURKHYA
combination is that in the former the order in which
Common Pathway Model
FRUHLING
RIJSDIJK
Volume 1, pp. 330331
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Ac Cc Ec Ec Cc Ac
1 1
L ph L ph
Figure 1 Common pathway model: Ac , Cc , and Ec are the common additive genetic, common shared, and common
nonshared environmental factors, respectively. Lph is the latent intermediate phenotypic variable, which influences all
observed variables. The factors at the bottom are estimating the variable specific A and E influences. For simplicity, the
specific C factors were omitted from the diagram
2 Common Pathway Model
variances and covariances by fitting structural equa- adolescent problem behaviors may share a common
tion models. The common pathway model is more underlying genetic risk [3].
parsimonious than the independent pathway model Another application of this model is to determine
because it estimates fewer parameters. the variation in a behavior that is agreed upon by
So what is the meaning and interpretation of this multiple informants. An example of such an applica-
factor model? The common pathway model is a tion is illustrated for antisocial behavior in 5-year-old
more stringent model than the independent path- twins as reported by mothers, teachers, examiners,
way model. It hypothesizes that covariation between and the children themselves [1]. Problem behavior
variables arises purely from their phenotypic rela- ascertained by consensus among raters in multiple
tion with the latent intermediate variable. This factor settings indexes cases of problem behavior that are
is identical to the factor derived from higher order pervasive. Heritability of this pervasive antisocial
phenotypic factor analyses, with the additional pos- behavior was higher than any of the informants indi-
sibility to estimate the relative importance of genetic vidually (which can be conceptualized as situational
and environmental effects of this factor. In contrast, antisocial behavior). In addition, significant informant
in the independent pathway model, where the com- specific unique environment (including) measurement
mon genetic and environmental factors influence the error was observed.
observed variables directly, different clusters of vari-
ables for the genetic and environmental factors are References
possible. This means that some variables could be
specified to covary mainly because of shared genetic
[1] Arseneault, L., Moffitt, T.E., Caspi, A., Taylor, A.,
effects, whereas others covary because of shared Rijsdijk, F.V., Jaffee, S., Ablow, J.C. & Measelle, J.R.
environmental effects. (2003). Strong genetic effects on antisocial behaviour
An obvious application of this model is to exam- among 5-year-old children according to mothers, teachers,
ine the etiology of comorbidity. In an adolescent examiner-observers, and twins self-reports, Journal of
twin sample recruited through the Colorado Twin Child Psychology and Psychiatry 44, 832848.
Registry and the Colorado Longitudinal Twin Study, [2] McArdle, J.J. & Goldsmith, H.H. (1990). Alternative
common-factor models for multivariate biometrical anal-
conduct disorder and attention deficit hyperactivity yses, Behavior Genetics 20, 569608.
disorder, along with a measure of substance experi- [3] Young, S.E., Stallings, M.C., Corley, R.P., Krauter, K.S.
mentation and novelty seeking, were used as indices & Hewitt, J.K. (2000). Genetic and environmental
of a latent behavioral disinhibition trait. A common influences on behavioral disinhibition, American Jour-
pathway model evaluating the genetic and environ- nal of Medical Genetics (Neuropsychiatric Genetics) 96,
mental architecture of this latent phenotype suggested 684695.
that behavioral disinhibition is highly heritable (0.84),
FRUHLING RIJSDIJK
and is not influenced significantly by shared environ-
mental factors. These results suggest that a variety of
Community Intervention Studies
DAVID M. MURRAY
Volume 1, pp. 331333
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
when the design is based on an insufficient number more power is available given more groups per
of groups randomized to each condition. condition with fewer members measured per group
There are several analytic approaches that can than given just a few groups per condition with
provide a valid analysis for GRTs [2, 5]. In most, many members measured per group, no matter the
the intervention effect is defined as a function size of the ICC. Third, the two factors that largely
of a condition-level statistic (e.g., difference in determine power in any GRT are the ICC and the
means, rates, or slopes) and assessed against the number of groups per condition. For these reasons,
variation in the corresponding group-level statistic. there is no substitute for a good estimate of the ICC
These approaches included mixed-model ANOVA/ for the primary endpoint, the target population, and
ANCOVA for designs having only one or two the primary analysis planned for the trial, and it is
time intervals (see Linear Multilevel Models), unusual for a GRT to have adequate power with
random coefficient models for designs having three fewer than 810 groups per condition. Finally, the
or more time intervals, and randomization tests as formula for the standard error for the intervention
an alternative to the model-based methods. Other effect depends on the primary analysis planned for the
approaches are generally regarded as invalid for trial, and investigators should take care to calculate
GRTs because they ignore or misrepresent a source that standard error, and power, based on that analysis.
of random variation. These include analyses that
assess condition variation against individual variation
and ignore the group, analyses that assess condition Acknowledgment
variation against individual variation and include the
The material presented here draws heavily on work pub-
group as a fixed effect, analyses that assess the
lished previously by David M. Murray [57]. Readers are
condition variation against subgroup variation, and referred to those sources for additional information.
analyses that assess condition variation against the
wrong type of group variation. Still other strategies
may have limited application for GRTs. For example, References
the application of generalized estimating equations
(GEE) and the sandwich method for standard errors [1] Cornfield, J. (1978). Randomization by group: a for-
requires a total of 40 or more groups in the study, or mal analysis, American Journal of Epidemiology 108(2),
a correction for the downward bias in the sandwich 100102.
estimator for standard errors when there are fewer [2] Donner, A. & Klar, N. (2000). Design and Analysis of
than 40 groups [7]. Cluster Randomization Trials in Health Research, Arnold,
London.
To avoid low power, investigators should plan a
[3] Forster, J.L., Murray, D.M., Wolfson, M., Blaine, T.M.,
large enough study to ensure sufficient replication, Wagenaar, A.C. & Hennrikus, D.J. (1998). The effects of
employ more and smaller groups instead of a few community policies to reduce youth access to tobacco,
large groups, employ strong interventions with a good American Journal of Public Health 88(8), 11931198.
reach, and maintain the reliability of intervention [4] Kish, L. (1965). Survey Sampling, John Wiley & Sons,
implementation. In the analysis, investigators should New York.
consider regression adjustment for covariates, model [5] Murray, D.M. (1998). Design and Analysis of Group-
randomized Trials, Oxford University Press, New York.
time if possible, and consider post hoc stratification.
[6] Murray, D.M. (2000). Efficacy and effectiveness trials in
Excellent treatments on power for GRTs exist, and health promotion and disease prevention: design and anal-
the interested reader is referred to those sources for ysis of group-randomized trials, in Integrating Behavioral
additional information. Chapter 9 in the Murray text and Social Sciences with Public Health, N. Schneider-
provides perhaps the most comprehensive treatment man, J.H. Gentry, J.M. Silva, M.A. Speers & H. Tomes,
of detectable difference, sample size, and power for eds, American Psychological Association, Washington,
GRTs [5]. Even so, a few points are repeated here. pp. 305320.
[7] Murray, D.M., Varnell, S.P. & Blitstein, J.L. (2004).
First, the increase in between-group variance due
Design and analysis of group-randomized trials: a review
to the ICC in the simplest analysis is calculated as of recent methodological developments, American Jour-
1 + (m 1)ICC, where m is the number of members nal of Public Health 94(3), 423432.
per group; as such, ignoring even a small ICC can
underestimate standard errors if m is large. Second, DAVID M. MURRAY
Comorbidity
S.H. RHEE, JOHN K. HEWITT, R.P. CORLEY AND M.C. STALLINGS
Volume 1, pp. 333337
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
researchers have made predictions regarding the ran- in data simulated for other comorbidity models (i.e.,
dom multiformity of A or random multiformity of B the particular analysis should discriminate a particular
models (i.e., an individual who has one disorder is comorbidity model from alternative hypotheses).
at an increased risk for having the second disorder,
although he or she may not have an elevated liability
for the second disorder).
Description of the Results
[6] Donaldson, S.K., Klein, D.N., Riso, L.P. & Schwartz, [13] Rhee, S.H., Hewitt, J.K., Corley, R.P., Willcutt, E.G., &
J.E. (1997). Comorbidity between dysthymic and major Pennington, B.F. (2004). Testing hypotheses regarding
depressive disorders: a family study analysis, Journal of the causes of comorbidity: examining the underlying
Affective Disorders 42, 103111. deficits of comorbid disorders, Manuscript submitted for
[7] Klein, D.N. & Riso, L.P. (1993). Psychiatric disorders: publication.
problems of boundaries and comorbidity, in Basic Issues [14] Rhee, S.H., Hewitt, J.K., Lessem, J.M., Stallings, M.C.,
in Psychopathology, Costello C.G., ed., The Guilford Corley, R.P. & Neale, M.C. (2004). The validity of the
Press, New York, pp. 1966. Neale and Kendler model fitting approach in examining
[8] Neale, M.C. & Kendler, K.S. (1995). Models of comor- the etiology of comorbidity, Behavior Genetics 34,
bidity for multifactorial disorders, American Journal of 251265.
Human Genetics 57, 935953. [15] Riso, L.P., Klein, D.N., Ferro, T., Kasch, K.L., Pepper,
[9] Newman, D.L., Moffitt, T.E., Caspi, A., Magdol, L., C.M., Schwartz, J.E. & Aronson, T.A. (1996). Under-
Silva, P.A. & Stanton, W.R. (1996). Psychiatric disorder standing the comorbidity between early-onset dysthymia
in a birth cohort of young adults: prevalence, comorbid- and cluster B personality disorders: a family study,
ity, clinical significance, and new case incidence from American Journal of Psychiatry 153, 900906.
ages 11 to 21, Journal of Consulting and Clinical Psy- [16] Rutter, M. (1997). Comorbidity: concepts, claims and
chology 64, 552562. choices, Criminal Behaviour and Mental Health 7,
[10] Pennington, B.F., Groisser, D. & Welsh, M.C. (1993). 265285.
Contrasting cognitive deficits in attention deficit hyper- [17] Simonoff, E. (2000). Extracting meaning from comor-
activity disorder versus reading disability, Developmen- bidity: genetic analyses that make sense, Journal of
tal Psychology 29, 511523. Child Psychology and Psychiatry 41, 667674.
[11] Rhee, S.H., Hewitt, J.K., Corley, R.P. & Stallings, M.C. [18] Wickramaratne, P.J. & Weissman, M.M. (1993). Using
(2003a). The validity of analyses testing the etiology family studies to understand comorbidity, European
of comorbidity between two disorders: comparisons of Archives of Psychiatry and Clinical Neuroscience 243,
disorder prevalences in families, Behavior Genetics 33, 150157.
257269.
[12] Rhee, S.H., Hewitt, J.K., Corley, R.P. & Stallings, M.C. S.H. RHEE, JOHN K. HEWITT, R.P. CORLEY
(2003b). The validity of analyses testing the etiology of AND M.C. STALLINGS
comorbidity between two disorders: a review of family
studies, Journal of Child Psychology and Psychiatry and
Allied Disciplines 44, 612636.
Compensatory Equalization
PATRICK ONGHENA
Volume 1, pp. 337338
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[2] Rosenbaum, P.R. (2002). Observational Studies, 2nd [4] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002).
Edition, Springer-Verlag, New York. Experimental and Quasi-experimental Designs for Gen-
[3] Schumacher, J.E., Milby, J.B., Raczynski, J.M., En- eralized Causal Inference, Houghton Mifflin, Boston.
gle, M., Caldwell, E.S. & Carr, J.A. (1994). Demoral-
ization and threats to validity in Birminghams home-
less project, in Critically Evaluating the Role of Exper- (See also Adaptive Random Assignment)
iments, K.J. Conrad, ed., Jossey Bass, San Francisco,
pp. 4144. PATRICK ONGHENA
Compensatory Rivalry
KAREN M. CONRAD AND KENDON J. CONRAD
Volume 1, pp. 338339
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in Critically Evaluating the Role of Experiments in Pro- [4] Saretsky, G. (1972). The OEO P.C. experiment and the
gram Evaluation, New Directions for Program Evaluation John Henry effect, Phi Delta Kappan 53, 579581.
Series, K.J. Conrad, ed., Jossey-Bass, San Francisco. [5] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002).
[2] Conrad, K.M., Reichelt, P.A., Meyer, F., Marks, B., Experimental and Quasi-experimental Designs for Gen-
Gacki-Smith, J.K., Robberson, J.J., Nicola, T., Ros- eralized Causal Inference, Houghton Mifflin Company,
tello, K. & Samo, D. (2004). (manuscript in preparation). New York.
Evaluating changes in firefighter physical fitness follow-
ing a program intervention. KAREN M. CONRAD AND KENDON J. CONRAD
[3] Cook, T.D. & Campbell, D.T. (1979). Quasi-experimenta-
tion: Design and Analysis Issues for Field Settings, Rand
McNally College Publishing Company, Chicago.
Completely Randomized Design
SCOTT E. MAXWELL
Volume 1, pp. 340341
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
An increasingly common computational model is [6] Hulin, C.L., Miner, A.G. & Seitz, S.T. (2002). Com-
a hybrid between statistical and substantive ques- putational modeling in organizational sciences: con-
tions. This type of model is frequently used when we tribution of a third research discipline, in Measur-
ing and Analyzing Behavior in Organizations: Advance-
want to understand the characteristics of a statistic to ments in Measurement and Data Analysis, F. Dras-
answer important applied questions. For example, we gow & N. Schmitt, eds, Jossey-Bass, San Francisco,
may want to know the consequences of error vari- pp. 498533.
ance heterogeneity on tests of differences in slopes [7] Ilgen, D.R. & Hulin, C.L. (2000). Computational Mod-
between demographic subgroups [1]. LeBreton, Ploy- eling of Behavior in Organizations: The Third Scientific
hart, and Ladd [8] examined which type of predic- Discipline, American Psychological Association, Wash-
ington.
tor relative importance estimate is most effective in
[8] LeBreton, J.M., Ployhart, R.E. & Ladd, R.T. (2004).
determining which predictors to keep in a regres- Use of dominance analysis to assess relative importance:
sion model. Sackett and Roth [11] demonstrated the a Monte Carlo comparison with alternative methods,
effects on adverse impact when combining predic- Organizational Research Methods 7, 258282.
tors with differing degrees of intercorrelations and [9] Murphy, K.R. (1986). When your top choice turns you
subgroup differences. Murphy [9] documented the down: effect of rejected offers on the utility of selection
negative effect on test utility when the top appli- tests, Psychological Bulletin 99, 133138.
[10] Ployhart, R.E. & Ehrhart, M.G. (2002). Modeling the
cant rejects a job offer. In each of these examples,
practical effects of applicant reactions: subgroup differ-
a merging of substantive and statistical questions ences in test-taking motivation, test performance, and
has led to a simulation methodology that informs selection rates, International Journal of Selection and
research and practice. The power of computational Assessment 10, 258270.
modeling in such circumstances helps test theories [11] Sackett, P.R. & Roth, L. (1996). Multi-stage selection
and develop applied solutions without the difficulty, strategies: a Monte Carlo investigation of effects on
expensive, and frequent impossibility of collecting performance and minority hiring, Personnel Psychology
49, 118.
real-world data. [12] Schmitt, N. & Ployhart, R.E. (1999). Estimates of
cross-validity for stepwise-regression and with pre-
References dictor selection, Journal of Applied Psychology 84,
5057.
[13] Whicker, M.L. & Sigelman, L. (1991). Computer Simu-
[1] Alexander, R.A. & DeShon, R.P. (1994). The effect of lation Applications: An Introduction, Sage Publications,
error variance heterogeneity on the power of tests for Newbury Park.
regression slope differences, Psychological Bulletin 115, [14] Zickar, M.J. & Robie, C. (1999). Modeling faking at the
308314. item-level, Journal of Applied Psychology 84, 95108.
[2] Anderson, J.R. (1993). Problem solving and learning, [15] Zickar, M.J. & Slaughter, J.E. (2002). Computational
American Psychologist 48, 3544. modeling, in Handbook of Research Methods in Indus-
[3] Anderson, J.R. (1996). ACT: a simple theory of complex trial and Organizational Psychology, S.G. Rogelberg,
cognition, American Psychologist 51, 355365. ed., Blackwell Publishers, Walden, pp. 184197.
[4] Hanisch, K.A., Hulin, C.L. & Seitz, S.T. (1996). Mathe-
matical/computational modeling of organizational with- ROBERT E. PLOYHART AND CRYSTAL
drawal processes: benefits, methods, and results, in
Research in Personnel and Human Resources Manage-
M. HAROLD
ment, Vol. 14, G. Ferris, ed., JAI Press, Greenwich,
pp. 91142.
[5] Hu, L.T. & Bentler, P.M. (1998). Fit indices in covari-
ance structure modeling: sensitivity to underparameter-
ized model misspecification, Psychological Methods 3,
424453.
Computer-Adaptive Testing
RICHARD M. LUECHT
Volume 1, pp. 343350
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Server
Data Data
^
ik max IUj qui1,,ui k1 : j Rk
j
CAT Item selection algorithm
Item database
Examinee
Item ststistics
Uj = (10100101101)
Scored response vector
q^MAP
ui1uik1 max g qui1, ,uik1 : q (,)
q
Proficiency score estimation algorithm
satisfied. Two standard stopping rules for adaptive then used by maximum information algorithm, ik
tests are: (a) a fixed test length has been met or (b) a maxj {IUj (ui1 ,...,uik1 ) : j Rk }, to select the next
minimum level of score precision has been satisfied1 . item. The test terminates when either a fixed number
This iterative process of selecting and administer- of items have been administered or when a particular
ing items, scoring, and then selecting more items is statistical criterion has been attained.
depicted in Figure 1. It is important to realize that each item incre-
In Figure 1, the initial items are usually trans- mentally improves our statistical confidence about
mitted through a computer network and rendered an examinees unknown proficiency, . For exam-
at the examinees workstation. The responses are ple, Figure 2 shows the degree of certainty about
captured and scored by the test delivery soft- an examinees score after 3 items and again after
ware. The scored response vector and the item 50 items. For the sake of this example, assume
parameters are then used to update the provi- that we know that this examinees true proficiency
sional estimate of . That provisional score is score to be = 1.75. After administering only 3
Computer-Adaptive Testing 3
1.5
0.5
Probability q =
Proficiencey
0.5
Examinee B
1.5
0 10 20 30 40 50
3
True
Proficiency ()
proficiency Figure 3 Proficiency scores and standard errors for a
50-item CAT for two hypothetical examinees
Figure 2 Certainty about proficiency after three and after
fifty items
items, our certainty is represented by dotted curve where Ii () is the item information function at some
is relatively flat indicating a lack of confidence proficiency score of interest, denoted as . The
about the exact location of the provisional esti- exact mathematical form of the information function
mate. However, after administering 50 items, we varies by IRT model. Lord [9] and Hambleton and
find that: (a) the provisional score estimate is quite Swaminathan [7] provide convenient computational
close to the true proficiency; and (b) our certainty formulas for the one-, two-, and three-parameter IRT
is very high, as indicated by the tall, narrow model information functions.
curve. Equation (3) suggests two important aspects about
measurement precision. First, each item contributes
some amount of measurement information to the
IRT Information and Efficiency in CAT reliability or score precision of the total test. That is,
the total test information function is sum of the item
To better understand how the adaptive algorithm information functions. Second, by increasing the test
actually works, we need to focus on the IRT item and information function, we correspondingly reduce the
test information functions. Birnbaum [1] introduced measurement error variance of the estimated score.
the concept of the test information function as a Simply put, when test information is maximized,
psychometric analysis mechanism for designing and measurement errors are minimized.
comparing the measurement precision of tests in the Figure 3 shows what happens to the provisional
context of item response theory (IRT). Under IRT, the proficiency scores and associated standard errors
conditional measurement error variance, var(E| ), (the square root of the error variance from (3))
is inversely proportional to the test information for two hypothetical examinees taking a 50-item
function, I (). That is, CAT. The proficiency scale is shown on the vertical
1 axis (1.5 to +1.5). The sequence of 50 adaptively
var(E| ) = [I ()]1 = (3) administered items is shown on the horizontal scale.
n
Ii () Although not shown in the picture, initially, both
i=1 examinees start with proficiency estimates near zero.
After the first item is given, the estimated proficiency
4 Computer-Adaptive Testing
scores immediately begin to separate ( for Exam- needed and less testing time needed (assuming, of
inee A and for Examinee B). Over the course of course, that a shorter test ought to take substantially
50 items, the individual proficiency scores for these less time than a longer test). Much of the early adap-
two examinees systematically diverge to their approx- tive testing research reported that typical fixed-length
imate true values of +1.0 for Examinee A and 1.0 academic achievement tests used could be reduced by
for Examinee B. The difficulties of the 50 items half by moving to a computerized adaptive test2 [25].
selected for each examinee CAT would track in a However, that early research ignored the perceptions
pattern similar to the symbols plotted for the provi- by some test users especially in high-stakes testing
sional proficiency scores. The plot also indicates the circles that short adaptive tests containing only 10
estimation errors present throughout the CAT. The or 20 items could not adequately cover enough con-
size of each error band about the proficiency score tent to make valid decisions or uses of scores. Today,
denotes the relative amount of error associated with CAT designs typically avoid such criticism by using
the scores. Larger bands indicate more error than nar- either fixed lengths or at least some minimum test
rower bands. Near to the left side of the plot the length to ensure basic content coverage.
error bands are quite large, indicating fairly impre- Nonetheless, CAT does offer improved testing effi-
cise scores. During the first half of the CAT, the error ciency, which means we can obtain more confident
bands rapidly shrink in size. After 20 items or so, estimates of examinees performance using fewer
the error bands tend to stabilize (i.e., still shrink, but items than are typically required on nonadaptive tests.
more slowly). This example demonstrates how the Figure 4 shows an example of the efficiency gains for
CAT quickly reduces error variance and improves the a hypothetical CAT, compared to a test for which the
efficiency of a test. items were randomly selected. The item character-
In practice, we can achieve maximum test infor- istics used to generate the test results for Figure 4
mation in two ways. We can choose highly discrimi- are rather typical of most professionally developed
nating items that provide maximum item information achievement tests. The plot shows the average stan-
within particular regions of the proficiency scale or at dard errors the square root of the error variance
specific proficiency scores that is, we sequentially from (3) over the sequence of 50 items (horizon-
select items to satisfy (2). Or, we can merely continue tal axis). The standard errors are averaged for a
adding items to increment the amount of information sizable sample of examinees having different profi-
until a desired level of precision is achieved. Maxi- ciency scores.
mizing the test information at each examinees score In Figure 4, we can more specifically see how the
is tantamount to choosing a customized, optimally errors decrease over the course of the two tests. It
reliable test for each examinee. is important to realize that the errors decrease for a
A CAT achieves either improvements in relative randomly selected set of items, too. However, CAT
efficiency or a reduction in test length. Relative effi- clearly does a better job of more rapidly reducing the
ciency refers to a proportional improvement in test errors. For example, at 20 items, the CAT achieves
information and can be computed as the ratio of test nearly the same efficiency as the 50-item random test;
information functions or reciprocal error variances at 50 items, the average standard error for the CAT
for two tests (see (3); also see [9]). This relative is approximately half as large as for the random test.
efficiency metric can be applied to improvements
in the accuracy of proficiency scores or to decision
accuracy in the context of mastery tests or certifica- Security Risks in CAT
tion/licensure tests. For example, if the average test
information function for a fixed-item test is 10.0 and The risks to the security of computer-based tests are
the average test information function for an adaptive somewhat analogous to the cheating threats faced
test is 15.0, the adaptive test is said to be 150% as effi- by gambling casinos or lotteries. Given any type
cient as the fixed-item test. Measurement efficiency of high stakes (e.g., entrance into graduate school,
is also associated with reductions in test length. For scholarships, a coveted course placement, a job,
example, if a 20-item adaptive test can provide the a license, a professional certificate), there will be
same precision as a 40-item nonadaptive test, there some group of cheaters intent on beating-the-odds
is an obvious reduction the amount of test materials (of random chance or luck) by employing well
Computer-Adaptive Testing 5
1.0
0.8
Average standard errors
CAT
Random test
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Item sequence
Figure 4 Average standard errors for a 50-item CAT versus 50 randomly selected items
thought out strategies, which provide them with any item selection algorithm. Traditional exposure con-
possible advantage, however slight that may be. One trol modifications cited in the psychometric literature
of the most common security risks in high-stakes include maximum information item selection with the
CAT involves groups of examinees collaborating to SympsonHetter unconditional item exposure con-
memorize and share items, especially when the same trol procedure (see references [8] and [20]), maxi-
item database is active over a long period of time, and mum information and Stocking and Lewis (condi-
testing is nearly continuous during that time period. tional) item exposure control procedure (see [17, 18]
Unfortunately, the CAT algorithm actually exac- and [19]), and maximum information and stochastic
erbates the security risks associated with cheating (conditional) exposure control procedure (see [14],
through systematic memorization of an item database. [15]). An extensive discussion of exposure controls
That is, because the CAT algorithm chooses the items is beyond the scope of this entry.
to be maximally informative for each examinee, the
most discriminating items are chosen far more often
than the less discriminating items. This means that CAT Variations
the effective item pool will typically be quite small
since only a subset of the entire item database is being In recent years, CAT research has moved beyond
used. Beyond the bad economic policy of underutiliz- the basic algorithm presented earlier in an attempt
ing an expensive commodity such as a large portion to generate better strategies for controlling test
of an item database, cheaters gain the advantage of form quality control and simultaneously reduc-
only needing to memorize and share the most highly ing exposure risks. Some testing programs are
exposed items. even moving away from the idea of an item as
Three of the methods for dealing with over- the optimal unit for CAT. Four promising CAT
exposure risks in high-stakes CAT are: (a) increasing variations are: (a) constrained CAT using shadow
the size of the active item database; (b) rotating tests (CCAT-UST); (b) a-Stratified Computerized
item databases over time (intact or partially); and Adaptive Testing (AS-CAT); (c) testlet-based CAT
(c) specifically controlling item exposures as part (T-CAT); and computer-adaptive multistage testing
of the computerized test assembly process. The lat- (CA-MST). These four approaches are summarized
ter approach involves a modification to the CAT briefly, below.
6 Computer-Adaptive Testing
Van der Linden and Reese [24] introduced the during the initial portion of an adaptive test, less dis-
concept of a shadow test as a method of achieving criminating items could be used since the proficiency
an optimal CAT in the face of numerous content estimates have not yet stabilized. This stratification
and other test assembly constraints (also see van strategy effectively ensures that most discriminating
der Linden [22]). Under CCAT-UST, a complete test items are saved until later in the test when they can
is reassembled following each item administration. be more accurately targeted to the provisional profi-
This test, called the shadow test, incorporates all of ciency scores. In short, the AS-CAT approach avoids
the required content constraints, item exposure rules, wasting the high demand items too early on in the
and other constraints (e.g., cognitive levels, total test and makes effective use of the low demand items
word counts, test timing requirements, clueing across that, ordinarily, are seldom if ever selected in CAT.
items), and uses maximization of test information Chang, Qian, and Ying [2] went a step further to also
at the examinees current proficiency estimate as block the items based on the IRT difficulty param-
its objective function. The shadow test model is an eters. This modification is intended to deal more
efficient means for balancing the goals of meeting effectively with exposure risks when the IRT discrim-
content constraints and maximizing test information. ination and difficulty parameters are correlated with
A shadow test actually is a special case of content- each other within a particular item pool.
constrained CAT that explicitly uses automated test One of the principal complaints from examinees
assembly (ATA) algorithms for each adaptive item about CAT is the inability for them to skip items, or
selection. In that regard, this model blends the review and change their answers to previously seen
efficiency of CAT with the sophistication of using items. That is, because the particular sequence of item
powerful linear programming techniques (or other selections in CAT is dependent on the provisional
scores, item review is usually prohibited. To address
ATA heuristics) to ensure a psychometrically optimal
this shortcoming in CAT, Wainer and Kiely [26]
test that simultaneously meets any number of test-
introduced the concept of a testlet to describe a
level specifications and item attribute constraints.
subset of items or a mini-test that could be used
Shadow testing can further incorporate exposure
in an adaptive testing environment. A testlet-based
control mechanisms as a security measure to combat
CAT (TB-CAT) involves the adaptive administration
some types of cheating [22].
of preassembled sets of items to an examinee, rather
a-Stratified computerized adaptive testing (AS-
than single items. Examples of testlets include sets
CAT; [4]) is an interesting modification on the adap- of items that are associated with a common reading
tive theme. AS-CAT adapts the test to the examinees passage or visual stimulus, or a carefully constructed
proficiency like a traditional CAT. However, the subset of items that mirrors the overall content
AS-CAT model eliminates the need for formal expo- specifications for a test. After completing the testlet,
sure controls and makes use of a greater proportion the computer scores the items within it and then
of the test bank than traditional CAT. As noted ear- chooses the next testlet to be administered. Thus, this
lier, the issue of test bank use is extremely important type of test is adaptive at the testlet level rather than
from an economic perspective (see the section Secu- at the item level. This approach allows examinees to
rity Risks in CAT). a-Stratified CAT partitions the test skip, review, and change answers within a block of
bank into ordered layers, based on statistical charac- test items. It also allows for content and measurement
teristics of the items (see [4], [3]). First, the items review of these sets of items prior to operational
are sorted according to their estimated IRT item dis- administration.
crimination parameters3 . Second, the sorted list is It should be clear that testlet-based CATs are
partitioned into layers (the strata) of a fixed size. only partially adaptive since items within a testlet
Third, one or more items are selected within each are administered in a linear fashion. However, TB-
strata by the usual CAT maximum information algo- CAT offers a compromise between the traditional,
rithm. AS-CAT then proceeds sequentially through nonadaptive format and the purely adaptive model.
the strata, from the least to the most discriminat- Advantages of TB-CAT include increased testing
ing strata. The item selections may or may not be efficiency relative to nonadaptive tests; the ability of
subject to also meeting applicable content specifica- content experts and sensitivity reviewers to review
tions or constraints. Chang and Ying reasoned that, individual, preconstructed testlets and subtests to
Computer-Adaptive Testing 7
evaluate content quality; and the ability of examinees 2. Although adaptation is clearly important as a psy-
to skip, review, and change answers to questions chometric criterion, it is easy sometimes to overstate
within a testlet. the real cost-reduction benefits that can be specifi-
cally attributed to gains in measurement efficiency.
Similar in concept to TB-CAT is computer-
For example, measurement efficiency gains from
adaptive multistage testing (CA-MST). Luecht and adaptive testing are often equated with reduced test-
Nungester [11] introduced CA-MST under the head- ing time. However, any potential savings in testing
ing of computer-adaptive sequential testing as a time may prove to be unimportant if a computer-
framework for managing real-life test construction based examination is administered at commercial
requirements for large-scale CBT applications (also CBT centers. That is, commercial CBT centers typ-
see [10]). Functionally, CA-MST is a preconstructed, ically charge fixed hourly rates per examinee and
require a guaranteed [minimum] testing time. There-
self-administering, multistage adaptive test model fore, if the CBT test center vendor negotiates with
that employs testlets as the unit of selection. The pri- the test developer for a four-hour test, the same fee
mary difference between TB-CAT and CA-MST is may be charged whether the examinee is at the center
that the latter prepackages the testlets, scoring tables, for two, three, or four hours.
and routing rules for the test delivery software. It is 3. See [7] or [9] for a more detailed description of IRT
even possible to use number-correct scoring during item parameters for multiple-choice questions and
the real-time administration, eliminating the need for related objective response items.
the test delivery software to have to compute IRT-
based scoring or select testlets based on a maximum References
information criterion.
Like TB-CAT, CA-MST uses preconstructed [1] Birnbaum, A. (1968). Estimation of an ability, in
testlets as the fundamental building blocks for test Statistical Theories of Mental Test Scores, F.M. Lord
construction and test delivery. Testlets may range in & M.R. Novick, eds, Addison-Wesley, Reading, pp.
size from several items to well over 100 items. The 423479.
testlets are usually targeted to have specific statistical [2] Chang, H.H., Qian, J. & Ying, Z. (2001). a-
properties (e.g., a particular average item difficulty or stratified multistage computerized adaptive testing item
b-blocking, Applied Psychological Measurement 25,
to match a prescribed IRT information function) and
333342.
all content balancing is built into the construction [3] Chang, H.H. & van der Linden, W.J. (2000). A zero-
of the testlets. As part of the ATA process, the one programming model for optimal stratification of item
preconstructed testlets will be further prepackaged in pools in a-stratified computerized adaptive testing, Paper
small collections called panels. Each panel contains Presented at The Annual Meeting of the National Council
four to seven (or more) testlets, depending on the on Measurement in Education, New Orleans.
panel design chosen an issue addressed below. Each [4] Chang, H.H. & Ying, Z. (1999). A-stratified multi-stage
computerized adaptive testing, Applied Psychological
testlet is explicitly assigned to a particular stage and Measurement 23, 211222.
to a specific route within the panel (easier, moderate, [5] College Board (1993). Accuplacer: Computerized Place-
or harder) based upon the average difficulty of the ment Tests: Technical Data Supplement, Author, New
testlet. Multiple panels can be prepared with item York.
overlap precisely controlled across different panels. [6] Eignor, D.R., Stocking, M.L., Way, W.D. & Steffen, M.
CA-MST is adaptive in nature and is therefore (1993). Case Studies in Computer Adaptive Test Design
more efficient than using fixed test forms. Yet, CA- Through Simulation (RR-93-56), Educational Testing
Service, Princeton.
MST provides explicit control over content validity, [7] Hambleton, R.K. & Swaminathan, H.R. (1985). Item
test form quality, and the exposure of test materials. Response Theory: Principles and Applications, Kluwer
Academic Publishers, Boston.
[8] Hetter, R.D. & Sympson, J.B. (1997). Item exposure
Notes control in CAT-ASVAB, in Computerized Adaptive Test-
ing: From Inquiry to Operation, W.A. Sands, B.K.
1. For pass/fail mastery tests that are typically used Waters & J.R. McBride, eds, American Psychological
in certification and licensure testing, a different Association, Washington, pp. 141144.
stopping rule can be implemented related to the [9] Lord, F.M. (1980). Applications of Item Response The-
desired statistical confidence in the accuracy of the ory to Practical Testing Problems, Lawrence Erlbaum,
classification decision(s). Hillsdale.
8 Computer-Adaptive Testing
[10] Luecht, R.M. (2000). Implementing the computer- [20] Sympson, J.B. & Hetter, R.D. (1985). Controlling item
adaptive sequential testing (CAST) framework to mass exposure rates in computerized adaptive tests, in Paper
produce high quality computer-adaptive and mastery presented at the Annual Conference of the Military
tests, in Paper Presented at The Meeting of The National Testing Association, Military Testing Association, San
Council on Measurement in Education, New Orleans. Diego.
[11] Luecht, R.M. & Nungester, R.J. (1998). Some practical [21] Thissen, D. & Orlando, M. (2002). Item response theory
examples of computer-adaptive sequential testing, Jour- for items scored in two categories, in Test Scoring, D.
nal of Educational Measurement 35, 229249. Thissen & H. Wainer, eds, Lawrence Erlbaum, Mahwah,
[12] Mislevy, R.J. (1986). Bayesian modal estimation in item pp. 73140.
response models, Psychometrika 86, 177195. [22] van der Linden, W.J. (2000). Constrained adaptive
[13] Parshall, C.G., Spray, J.A., Kalohn, J.C. & Davey, T. testing with shadow tests, in Computer-adaptive
(2002). Practical Considerations in Computer-based Testing: Theory and Practice,W.J. van der Linden
Testing, Springer, New York. & C.A.W. Glas, eds, Kluwer Academic Publishers,
[14] Revuela, J. & Ponsoda, V. (1998). A comparison of item Boston, pp. 2752.
exposure control methods in computerized adaptive test- [23] van der Linden, W.J. & Glas, C.A.W., eds (2000).
ing, Journal of Educational Measurement 35, 311327. Computer-adaptive Testing: Theory and Practice, Klu-
[15] Robin, F. (2001). Development and evaluation of test wer Academic Publishers, Boston.
assembly procedures for computerized adaptive testing, [24] van der Linden, W.J. & Reese, L.M. (1998). A model
Unpublished doctoral dissertation, University of Mas- for optimal constrained adaptive testing, Applied Psy-
sachusetts, Amherst. chological Measurement 22, 259270.
[16] Sands, W.A., Waters, B.K. & McBride, J.R., eds. (1997). [25] Wainer, H. (1993). Some practical considerations when
Computerized Adaptive Testing: From Inquiry to Opera- converting a linearly administered test to an adaptive
tion, American Psychological Association, Washington. format, Educational Measurement: Issues and Practice
[17] Stocking, M.L. & Lewis, C. (1995). A new method 12, 1520.
for controlling item exposure in computerized adaptive [26] Wainer, H. & Kiely, G.L. (1987). Item clusters and com-
testing, Research Report No. 95-25, Educational Testing puterized adaptive testing: A case for testlets, Journal of
Service, Princeton. Educational Measurement 24, 185201.
[18] Stocking, M.L. & Lewis, C. (1998). Controlling item [27] Zara, A.R. (1994). An overview of the NCLEX/CAT
exposure conditional on ability in computerized adaptive beta test, in Paper Presented at the Meeting of the Amer-
testing, Journal of Educational and Behavioral Statistics ican Educational Research Association, New Orleans.
23, 5775.
[19] Stocking, M.L. & Lewis, C. (2000). Methods of con-
trolling the exposure of items in CAT, in Computerized (See also Structural Equation Modeling: Mixture
Adaptive Testing: Theory and Practice, W.J. van der Lin- Models)
den & C.A.W. Glas, eds, Kluwer Academic Publishers,
Boston, pp. 163182. RICHARD M. LUECHT
Computer-based Test Designs
APRIL L. ZENISKY
Volume 1, pp. 350354
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
forms may be constructed well in advance of actual adaptive features of CAT with the opportunity to
test administration or assembled as the examinee is preassemble portions of tests prior to administration
taking the test. This latter circumstance, commonly as is done with linear testing [6]. MST designs are
referred to as linear-on-the-fly testing, or LOFT, generally defined by using multiple sets of items that
is a special case of CFT that uses item selection vary on the basis of difficulty and routing examinees
algorithms, which do not base item selection on though a sequence of such sets on the basis of
estimated examinee ability; rather, selection of items the performance on previous sets. With sets varying
proceeds relative to other predefined content and by difficulty, the particular sequence of item sets
other statistical targets [2]. Each examinee receives that any one examinee is presented with as the test
a unique test form under the LOFT design, but this is administered is chosen based on an examinees
provides benefits in terms of item security rather estimated ability, and so the test form is likely
than psychometric efficiency [4]. Making parallel to differ for examinees of different ability levels.
forms or introducing some randomization of items After an examinee finishes each item set, that ability
across forms are additional methods by which test estimate is updated to reflect the new measurement
developers address item exposure and test security information obtained about that examinees ability
concerns in CFT. through administration of the item set. In MST
Patelis [17] identified some other advantages asso- terminology, these sets of items have come to be
ciated with CFT including (a) the opportunity for described as modules [13] or testlets [21], and can
examinees to review, revise, and omit items, and be characterized as short versions of linear test
(b) the perception that such tests are easier to explain forms, where some specified number of individual
to examinees. At the same time, there are some dis- items are administered together to meet particular
advantages to linear test forms, and these are similar test specifications and provide a certain proportion
to those arising with paper-based tests. With static of the total test information. The individual items
forms, each form may be constructed to reflect a in a module may be all related to one or more
range of item difficulty in order to accurately assess common stems (such as passages or graphics) or
examinees of different abilities. Consequently, the be more generally discrete from one another, per
scores for some examinees (and especially those at the content specifications of the testing program for
the higher and lower ability levels) may not be as the test in question. These self-contained, carefully
precise as it would be in a targeted test. constructed, fixed sets of items are the same for every
The linear test designs possess many benefits for examinee to whom each set is administered, but any
measurement, and depending on the purpose of test- two examinees may or may not be presented with
ing and the degree of measurement precision needed the same sequence of modules. Most of the common
they may be wholly appropriate for many large-scale MST designs use two or three stages. However, the
testing organizations. However, other agencies may actual number of stages that could be implemented
be more interested in other test designs that afford could be set higher (or lower) given the needs of
them different advantages, such as the use of shorter different testing programs.
tests and the capacity to obtain more precise measure- As a test design, MST possesses a number of desir-
ment all along the ability distribution and particularly able characteristics. Examinees may change answers
near the cut-score where pass-fail decisions are made or skip test items and return to them, prior to actu-
in order to classify examinees as masters or nonmas- ally finishing a module and moving on to another.
ters. The remaining two families of test designs are Of course, after completing a stage in MST, how-
considered to be adaptive in nature, though they do ever, the items within that stage are usually scored
differ somewhat with respect to structure and format. using an appropriate IRT model and the next stage is
selected adaptively, so no return to previous stages
can be allowed (though, again, item review within a
Multistage Tests
module at each stage is permissible). Measurement
The second family of test designs, multistage testing precision may be gained over CFT or LOFT designs
(MST), is often viewed as an intermediary step without an increase in test length by adapting the
between a linear test and a computer-adaptive test exam administration to the performance levels of the
(CAT). As a middle ground, MST combines the examinees [11, 18]. If optimal precision of individual
Computer-based Test Designs 3
[12] Lord, F.M. & Novick, M.R. (1968). Statistical Theories [19] Stone, G.E. & Lunz, M.E. (1994). The effect of review
of Mental Test Scores, Addison-Wesley, Reading. on the psychometric characteristics of computerized
[13] Luecht, R.M. & Nungester, R.J. (1998). Some practical adaptive tests, Applied Measurement in Education 7,
examples of computer-adaptive sequential testing, Jour- 211222.
nal of Educational Measurement 35(3), 229249. [20] van der Linden, W.J. & Glas, C.A.W. eds. (2000).
[14] Mills, C.N., Potenza, M.T., Fremer, J.J. & Ward, W.C. Computerized Adaptive Testing: Theory and Practice,
eds. (2002). Computer-based Testing: Building the Foun- Kluwer, Boston.
dation for Future Assessments, Lawrence Erlbaum Asso- [21] Wainer, H. & Kiely, G.L. (1987). Item clusters and com-
ciates, Mahwah. puterized adaptive testing: A case for testlets, Journal of
[15] Mills, C.N. & Stocking, M.L. (1996). Practical issues Educational Measurement 24(3), 185201.
in large-scale computerized adaptive testing, Applied
Measurement in Education 9(4), 287304. Further Reading
[16] Parshall, C.G., Spray, J.A., Kalohn, J.C. & Davey., T.
(2002). Practical Considerations in Computer-based Wise, S.L. (1996, April). A critical analysis of the arguments
Testing, Springer, New York. for and against item review in computerized adaptive
[17] Patelis, T. (2000, April). An Overview of Computer- testing, in Paper Presented at the Meeting of the National
based Testing (Office of Research and Development Council on Measurement in Education, New York.
Research Notes, RN-09), College Board, New York.
[18] Patsula, L.N. & Hambleton, R.K. (1999, April). A APRIL L. ZENISKY
comparative study of ability estimates obtained from
computer-adaptive and multi-stage testing, in Paper
Presented at the Meeting of the National Council on
Measurement in Education, Montreal.
Computer-based Testing
TIM DAVEY
Volume 1, pp. 354359
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
goals, which usually conflict with one another [11]. In essence, the two strategies differ in the way and
The first is to maximize test efficiency by measuring the extent to which the testing process is permitted
examinees to appropriate levels of precision with as to adapt.
few items as possible. The competing adaptive testing The CAT selects items from the pool individually
strategies have evolved largely because different or in small sets. A wide range of item selection crite-
definitions can be attached to terms like efficiency, ria have been proposed. Some of these operate from
appropriate, and precision. In any case, achieving different definitions of precision; others try to recog-
this goal can allow an adaptive test to match or better nize the difficulty inherent in making optimal deci-
the precision of a conventional test that is several sions when information about examinee performance
times longer. is incomplete and possibly misleading. Still others
The second goal is that each examinees test be more explicitly subordinate measurement precision to
properly balanced in terms of item substance or con- the goals of balancing content and item exposure.
tent. This is important to ensure that tests are content Test scoring methods also vary widely, although
valid and meet both examinees and score users nearly all are based on item response theory (IRT)
subjective expectations of what a proper test should [8]. These procedures assume that all items in the
measure. The intent is to force adaptive tests to meet pool are properly characterized and lie along a single
proper test-construction standards despite their being IRT proficiency scale. A test score is then an estimate
assembled on-the-fly as the test proceeds [15]. of the examinees standing on this same scale. Max-
A third goal is to control or balance the rates at imum likelihood and Bayesian estimation methods
which various items in the pool are administered [16]. (see Bayesian Item Response Theory Estimation)
The concern is that without such control, a small are most commonly used. However, a number of
number of items might be administered very fre- more exotic estimation procedures have been pro-
quently while others rarely or never appear. posed, largely with the goal of increased statistical
The potential conflicts between these goals are robustness [17].
many. For example, imposing strict content standards Because of the flexibility inherent in the process,
is likely to lower test precision by forcing the the way in which a CAT unfolds is very difficult to
selection of items with less optimal measurement predict. If the item pool is reasonably large (and most
properties. Protecting the administration rates of researchers recommend the pool contain at least 810
items with exceptional measurement properties will times more items than the test length), the particular
have a similar effect on precision. Every adaptive combination of items administered to any examinee is
test must therefore strike a balance between these virtually unique [9]. This variation has at least three
goals. The three basic testing strategies that will be sources. First, different items are most appropriate
described do so in fundamentally different ways. for different examinees along the proficiency scale.
In general, easy items are most appropriate for low-
scoring examinees while harder items are reserved
CATs and MSTs for more proficient examinees. Second, each response
an examinee makes can cause the test to move in
The first two types of adaptive tests to be described a new direction. Correct answers generally lead to
share a common definition of test precision. Both harder questions being subsequently selected while
the computerized adaptive test (CAT) and the multi- wrong answers lead to easier questions in the future.
stage test (MST) attempt to accurately and efficiently Finally, most item selection procedures incorporate
estimate each examinees location on a continuous a random element of some sort. This means that
performance or score scale. This goal will be dis- even examinees of similar proficiency who respond
tinguished below from that of computerized classifi- in similar ways are likely to see very different tests.
cation tests, which attempt to accurately assign each Although the changing and unpredictable nature of
examinee to one of a small number of performance the CAT is the very essence of test adaptation, it can
strata. Where CATs and MSTs differ is in the way also be problematic. Some item selection procedures
this goal of maximum precision is balanced with the can paint themselves into a corner and have no
competing interests of controlling test content and choice but to administer a test that fails to conform
the rates at which various items are administered. to all test-construction rules. Measurement precision
Computer-based Testing 3
and overall test quality can also differ widely across test efficiency or precision. However, this decrease
examinees. Because tests are assembled in real time can be relatively minor in some cases.
and uniquely for each examinee, there is obviously
no opportunity for forms to be reviewed prior to
administration. All of these concerns contributed to Computerized Classification Tests
the development of multistage testing.
The MST is a very constrained version of CAT, Also called a computerized mastery test, the comput-
with these constraints being imposed to make the erized classification test (CCT) is based on a very
testing process more systematic and predictable [7]. different premise [6, 13]. Rather than trying to posi-
Development of an MST begins by assembling all tion each examinee accurately on a proficiency scale,
of the available pool items into a relative handful of the CCT instead tries to accurately sort examinees
short tests, often called testlets, some of which target into broad categories. The simplest example is a test
specific proficiency levels or ranges [20]. Content and that assigns each examinee to either of two classes.
item exposure rate considerations can be taken into These classes may be labeled master versus nonmas-
account when assembling each testlet. A common ter, pass versus fail or certified versus not certified.
practice is to assemble each testlet as a miniature Classification is based around one or more decision
version of an entire form. thresholds positioned along the proficiency scale.
Test administration usually begins by presenting The CCT is an attractive alternative for the many
each examinee with a testlet that measures across testing applications that require only a broad grouping
a wide proficiency range. The testlet is presented of examinees. Because it is far easier to determine
intact, with no further selection decision made until whether an examinee is above or below a threshold
the examinee has completed all of its items. Once than it is to position that examinee precisely along
the initial testlet is completed, performance is evalu- the continuous scale, a CCT can be even shorter
ated and a selection decision is made. Examinees who and more efficient than a CAT. CCTs also lend
performed well are assigned a second testlet that has themselves naturally to being of variable length
been assembled to best measure higher proficiency across examinees. Examinees whose proficiency lies
ranges. Examinees who struggled are assigned a test- well above or below a decision threshold can be
let largely comprising easier items. The logic inherent reliably classified with far fewer items than required
in these decisions is the same as that employed by by examinees who lie near that threshold.
the CAT, but selection decisions are made less fre- CCTs are best conducted using either of three item
quently and the range of options for each decision selection and examinee classification methods. The
is sharply reduced (since there are usually far fewer first makes use of latent class IRT models, which
testlets available than there are items in a CAT pool). assume a categorical rather than continuous under-
Scoring and routing decisions can be made based lying proficiency scale [4]. These models naturally
either on IRT methods similar to those used in CAT, score examinees through assignments to one of the
or on conventional number-right scores. The former latent classes or categories.
offers theoretical and psychometric advantages; the A second approach uses the sequential probability
latter is far simpler operationally. ratio test (SPRT), which conducts a series of likeli-
MSTs differ in the number of levels (or choice hood ratio tests that lead ultimately to a classification
points) that each examinee is routed through. The decision [13]. The SPRT is ideally suited to tests that
number of proficiency-specific testlets available for vary in length across examinees. Each time an item
selection at each level also differs. In simpler, more is administered and responded to, the procedure con-
restrictive cases those involving fewer levels and ducts a statistical test that can have either of two
fewer testlets per level it is quite possible to outcomes. The first is to conclude that an examinee
construct and review all of the test forms that could can be classified with a stated level of confidence
possibly be administered as combinations of the given the data collected so far. The second possible
available elements. In all cases, the particular form outcome is that classification cannot yet be confi-
administered via an MST is far more predictable than dently made and that testing will need to continue.
the outcome of a CAT. The price paid for increased The test ends either when a classification is made
predictability is a loss of flexibility and a decrease in with confidence or some maximum number of items
4 Computer-based Testing
have been administered. In the latter case, the deci- different purposes. With conventional administration,
sion made at the end of the test may not reach the these two examinees would likely need to be tested
desired level of confidence. Although IRT is not nec- at different times or in different places.
essary to administering a test under the SPRT, it can
greatly increase test precision or efficiency. Mass Customization
The third strategy for test administration and
scoring uses Bayesian decision theory to classify A CBT can flexibly customize itself uniquely to
examinees [6]. Like the SPRT, Bayes methods offer each examinee. This can go well beyond the sort
control over classification precision in a variable of adaptivity discussed above. For example, a CBT
length test. Testing can therefore continue until a can choose and administer each examinee only the
desired level of confidence is reached. Bayes methods appropriate components of a large battery of tests.
have an advantage over the SPRT in being more Another example would be a CBT that extended
easily generalized to classification into more than two the testing of failing examinees in order to provide
categories. They are also well supported by a rich detailed diagnostic feedback useful for improving
framework of statistical theory. subsequent performance. A CBT also could select
Item selection under classification testing can be or adjust items based on examinee characteristics.
very different from that under CAT. It is best to select For example, spelling and weight and measurement
items that measure best at the classification thresholds conventions can be easily matched to the location of
rather than target examinee proficiency. Naturally, it the examinee.
is much easier to target a stationary threshold than it is
to hit a constantly changing proficiency estimate. This Reach and Speed
is one factor in CCTs improved efficiency over CAT.
However, the primary factor is that the CCT does Although a CBT can be administered in a site dedi-
not distinguish between examinees who are assigned cated to test administration, it can also be delivered
the same classification. They are instead considered anywhere and anytime a computer is available. Exam-
as having performed equally. In contrast, the CAT inees can test individually at home over the Internet
is burdened with making such distinctions, however or in large groups at a centralized testing site.
small. The purpose of the test must, therefore, be It is also possible to develop and distribute a
considered when deciding whether the CAT or the CBT much faster than a paper test can be formatted,
CCT is the most appropriate strategy. printed, boxed and shipped. This can allow tests to
change rapidly in order to keep up with fast-changing
curricula or subject matter.
Operational Convenience
The third benefit of computerized testing is opera-
Flexible Scheduling
tional convenience for both examinees and test spon- Many CBT testing programs allow examinees to test
sors. These conveniences include: when they choose rather than requiring them to select
one of several periodic mass administrations. Allow-
Self-proctoring ing examinees to schedule their own test date can
be more than just a convenience or an invitation
Standardized paper and pencil tests often require a to procrastinate. It can also provide real benefits
human proctor to distribute test booklets and answer and efficiencies. For example, examinees in a train-
sheets, keep track of time limits, and collect materials ing program can move directly to certification and
after the test ends. Administering a CBT can be as employment without waiting for some distant test
simple as parking an examinee in front of a computer. administration date to arrive.
The computer can collect demographic data, orient
the examinee to the testing process, administer and Immediate Scoring
time the test, and produce a score report at the
conclusion. Different examinees can sit side by side Many CBTs are able to provide examinees with a
taking different tests with different time limits for score report immediately upon conclusion of the test.
Computer-based Testing 5
This is particularly important when coupled with [9] Mills, C. & Stocking, M. (1996). Practical issues in large
flexible scheduling. This can allow examinees to scale computerized adaptive testing, Applied Measure-
meet tight application deadlines, move directly to ment in Education 9(4), 287304.
[10] Parshall, C.G., Davey, T. & Pashley, P.J. (2000). Inno-
employment, or simply decide that their performance vative item types for computerized testing, in Comput-
was substandard and register to retest. erized Adaptive Testing: Theory and Practice, W.J. van
der Linden & C.A.W. Glas, eds, Kluwer Academic Pub-
lishers, Norwell, pp. 129148.
Summary [11] Parshall, C.G., Spray, J.A., Kalohn, J.C. & Davey, T.
(2002). Practical Considerations in Computer-based
Any description of computer-based testing is almost Testing, Springer, New York.
certain to be out-of-date before it appears in print. [12] Ramos, R.A., Heil, M.C. & Manning, C.A. (2001). Doc-
Things are changing quickly and will continue to do umentation of Validity for the AT-SAT Computerized Test
so. Over the last two decades, CBT has evolved from Battery (DOT/FAA/AM-01/5), US Department of Trans-
portation, Federal Aviation Administration, Washington.
a largely experimental procedure under investigation [13] Reckase, M.D. (1983). A procedure for decision making
to an operational procedure employed by hundreds using tailored testing, in New Horizons in Testing:
of testing programs serving millions of examinees Latent Trait Test Theory and Computerized Adaptive
each year [2]. Even faster growth can be expected in Testing, D.J. Weiss, ed., Academic Press, New York,
the future as technology becomes a more permanent pp. 237255.
fixture of everyones lives. Testing on computer may [14] Rosenfeld, M., Leung, S. & Oltman, P.K. (2001). The
reading, writing, speaking and listening tasks important
eventually become an even more natural behavior
for success at the undergraduate and graduate levels,
than testing on paper has ever been. TOEFL Monograph Series Report No. 21, Educational
Testing Service, Princeton.
References [15] Stocking, M.L. & Swanson, L. (1993). A method for
severely constrained item selection in adaptive testing,
[1] Bennett, R.E. (1998). Reinventing Assessment, Educa- Applied Psychological Measurement 17, 277292.
tional Testing Service, Princeton. [16] Sympson, J.B. & Hetter, R.D. (1985). Controlling
[2] Bennett, R.E. (2002). Inexorable and inevitable: the item exposure rates in computerized adaptive testing,
continuing story of technology and assessment, Journal Proceedings of the 27th annual meeting of the Mili-
of Technology, Learning, and Assessment 1(1). tary Testing Association, Naval Personnel Research and
[3] Clauser, B.E., Subhiyah, R.G., Nungester, R.J., Rip- Development Center, San Diego, pp. 973977.
key, D.R., Clyman, S.G. & McKinley, D. (1995). Scor- [17] Thissen, D. & Wainer, H., eds (2001). Test Scoring,
ing a performance-based assessment by modeling the Lawrence Erlbaum, Hillsdale.
judgments of experts, Journal of Education Measure- [18] Van der Linden, W.J. & Glas, C.A.W., eds (2000).
ment 32, 397415. Computerized Adaptive Testing: Theory and Practice,
[4] Dayton, C.M. (1999). Latent Class Scaling Analysis, Kluwer Academic Publishers, Boston.
Sage, Newbury Park. [19] Wainer, H., ed. (1990). Computerized Adaptive Testing:
[5] Gardner, H. (1991). Assessment in context: the alterna- A Primer, Lawrence Erlbaum, Hillsdale.
tive to standardized testing, in Cognitive Approaches to [20] Wainer, H., Bradlow, E.T. & Du, Z. (2000). Testlet
Assessment, B. Gifford & M.C. OConnor, eds, Kluwer response theory: an analog for the 3PL model useful in
Academic Publishers, Boston. testlet-based adaptive testing, in Computerized Adaptive
[6] Lewis, C. & Sheehan, K. (1990). Using Bayesian Testing: Theory and Practice, W.J. van der Linden &
decision theory to design a computerized mastery test, C.A.W. Glas, eds, Kluwer Academic Publishers, Boston,
Applied Psychological Measurement 14, 367386. pp. 245269.
[7] Lord, F.M. (1971). A theoretical study of two-stage
testing, Psychometrika 36, 227242. TIM DAVEY
[8] Lord, F.M. (1980). Applications of Item Response Theory
to Testing Problems, Lawrence Erlbaum, Hillsdale.
Concordance Rates
EDWIN J.C.G. VAN DEN OORD
Volume 1, pp. 359359
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
In Markov processes, a sequence of variables the interpretation of such graphs is that two sets of
indexed by discrete or continuous time t, Xt , has variables, A and B, are conditionally independent
the property that variables Xt1 and Xt3 are condi- given C, if and only if there is no connection between
tionally independent given an intermediate variable the vertices corresponding to A and B after all the
Xt2 , t1 < t2 < t3 . The concept of conditional inde- vertices that involve a vertex from C are erased. A
pendence is essential in the definitions of other time complex graph is much easier to study through sets
series models, such as autoregressive and moving- of such conditionally independent and conditioning
average and their combinations, because they involve variables. The edges may be associated with arrows
(residual) contributions independent of the past. that represent the direction of causality.
Search for conditional independence is an impor- A simple example is drawn in Figure 1. It repre-
tant preoccupation in many applications, because it sents the graph
enables a simpler description or explanation of and
better insight into the studied processes. In graph- ({A, B, C, D, E}, {AC, AD, BC, CD, CE}). (5)
ical models, [5], sets of random vectors are repre-
The circles AE represent random variables, and
sented by graphs in which vertices (variables) V are
lines are drawn between variables that are not condi-
connected (associated) by edges E. A graph is defined
tionally independent. By removing all the lines that
as G = (V , E). An important convention supporting
connect C with the other variables, B and E are not
connected with A or D; the random vectors and vari-
ables {A, D}, {B}, and {E} are mutually conditionally
A B
independent given C.
References
NICHOLAS T. LONGFORD
D E
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
when X 1 is the mean of X1 and X 2 the mean of X2 . random sample of items where the strata are typically
The mean of this quantity estimates the variance of based on content classifications. Let n1 , n2 , . . . be the
the half-test difference, which is an estimate for the number of items drawn from each stratum and Xp1 ,
total test error variance for the group. In the regres- Xp2 , . . . be the observed score of person p on strata
sion, Y is considered the dependent variable being 1, 2, . . . . In this case, the estimated total test error
predicted by X1 + X2 using polynomial regression. variance for person p is
(see Polynomial Model) One potential complication
with this method is to choose a certain degree of the
m
(nh Xph )Xph
polynomial. The lowest degree that fits the data is E.X
2
= . (6)
p
h=1
nh 1
recommended in practice [6].
Errors are assumed independent across strata and m
Lords Binomial Error Model is the total number of strata on the test.
When the number of items associated with any
Perhaps the best-known approach of calculating con- particular stratum is small, it can lead to instability in
ditional SEMs was proposed by Lord [13, 14] based estimating the conditional SEMs using this method.
on the binomial error model. Under the binomial error Furthermore, for examinees with the same observed
model, each test form is regarded as a random set of
scores, the estimates of their conditional SEMs may
n independent and dichotomously scored items. Each
be different, which might pose practical difficulties
examinee is assumed to have a true proportion score
in score reporting.
(p ); the error for an individual p can therefore be
defined as Xp n(p ). Under such a conceptualiza-
tion, the error variance conditional on a person over
Strong True Score Theory
the population of test forms is
2
E.X = n(p )(1 p ). (4) Strong true score models can also be considered
p
extensions of the classical test theory. In addition to
Using the estimate of p obtained through observed assuming a binomial-related model for error score,
scores, p = Xp /n, the estimate of the error variance typically a distributional form is assumed for true
for a person can be calculated: proportion-correct scores (). The most frequently
used distribution form for true proportion scores is
(n Xp )Xp the beta distribution (see Catalogue of Probability
E.X
2
= . (5)
p
n1 Density Functions) whose random variable ranges
The square root of this quantity yields the estimated from 0 to 1. The conditional error distribution Pr(X =
conditional SEM for the person p. i|) is typically assumed to be binomial or compound
By definition, persons with the same observed binomial. Under these assumptions, the observed
score on a given test have the same error variance score then will follow a beta-binomial or compound
under this model. One potential problem with this beta-binomial distribution ([9], [15], and [16]).
error model is that it fails to address the fact Using Lords two-term approximation of the com-
that test developers construct forms to be more pound binomial distribution, a general form of the
similar to one another than would be expected if error variance [15] can be estimated using
items were randomly sampled from a large pool of
xp (n xp )
items. Therefore, this estimator typically produces E|x
2
=
overestimates of the conditional SEMs. Keats [8]
p
n1
proposed a correction factor for this binomial error n(n 1)SX2
model that proved to be quite effective. 1 i
. (7)
X (n X ) SX2 p nSX i
score and universe deviation score. Using results the test score is reported. To convert raw scores
from Jarjoura [7], Brennan [3] showed that when the to scale scores, rounding and truncation are often
sample size for persons is large, an approximate involved.
estimator of the conditional relative SEM is On the raw score scale, the conditional SEM tends
to be large for middle scores and small for extreme
. 2 (i) 2cov(Xpi, XP i|p ) scores. If the raw-to-scale score transformation is
(p ) = 2 (p ) + ,
n n linear, then the scale score reliability will remain
(10) the same and the conditional SEM of the scale
scores is a multiple of the conditional SEM of raw
where cov(Xpi, XP i|p ) is the observed covariance scores and the relative magnitude stays the same.
over items between examinee ps item scores and However, a nonlinear transformation can change
the item mean scores. The observed covariance is the relative magnitude of conditional SEMs. For
not necessarily 0. example, a transformation that stretches the two ends
For multifacet designs, estimators exist for esti- and compresses the middle of the score distribution
mating the conditional relative SEMs, but they are can make the conditional SEM fairly consistent
rather complicated ([4], pp. 164165). For practical across the score scale. With an even more extreme
use, the following formula often provides an adequate
transformation, the conditional SEMs can be made
estimate:
relatively large at the two extremes and small in the
middle [10].
(p ) = 2 (p ) [ 2 () 2 ()]. (11)
Summary and Conclusion [6] Feldt, L.S. & Brennan, R.L. (1989). Reliability, in
Educational Measurement, R.L. Linn, ed., Macmillan,
Conditional SEMs provide important information on New York.
the amount of error in observed test scores. Many [7] Jarjoura, D. (1986). An estimator of examinee-level
measurement error variance that considers test form dif-
approaches have been proposed in the literature to ficulty adjustments, Applied Psychological Measurement
estimate conditional SEMs. These approaches are 10, 175186.
based on different test theory models and assump- [8] Keats, J.A. (1957). Estimation of error variances of test
tions. However, these methods typically produce scores, Psychometrika 22, 2941.
fairly consistent results when applied to standard- [9] Keats, J.A. & Lord, F.M. (1962). A theoretical dis-
ized achievement tests. There are no existing rules tribution for mental test scores, Psychometrika 27,
215231.
stating which method to use in the estimation of
[10] Kolen, M.J., Hanson, B.A. & Brennan, R.L. (1992).
conditional SEMs. Practitioners can choose a cer- Conditional standard errors of measurement for scale
tain method that aligns the best with the assumptions scores, Journal of Educational Measurement 29,
made and the characteristics of their tests. On the 285307.
raw score scale, conditional SEMs are relatively large [11] Kolen, M.J. & Hanson, B.A. (1989). Scaling the ACT
in magnitude at the two ends and small in the mid- assessment, in Methodology Used In Scaling The ACT
dle. However, conditional SEMs for scale scores can Assessment and P-ACT+, R.L. Brennan, ed., ACT, Iowa
City, pp. 3555.
take on a variety of forms depending on the func-
[12] Lee, W., Brennan, R.L. & Kolen, M.J. (2000). Estima-
tion of raw-to-scale score transformations that are tors of conditional scale-score standard errors of mea-
used. surement: a simulation study, Journal of Educational
Measurement 37, 120.
[13] Lord, F.M. (1955). Estimating test reliability, Educa-
References tional and Psychological Measurement 15, 325336.
[14] Lord, F.M. (1957). Do tests of the same length have the
[1] American Educational Research Association, American same standard error of measurement? Educational and
Psychological Association, & National Council on Mea- Psychological Measurement 17, 510521.
surement in Education. (1999). Standards for Educa- [15] Lord, F.M. (1965). A strong true score theory with
tional and Psychological Testing. American Educational applications, Psychometrika 30, 239270.
Research Association, Washington. [16] Lord, F.M. (1969). Estimating true-score distribution in
[2] Blixt, S.L. & Shama, D.B. (1986). An empirical investi- a psychological testing (An empirical Bayes estimation
gation of the standard error of measurement at different problem), Psychometrika 34, 259299.
ability levels, Educational and Psychological Measure- [17] Lord, F.M. (1984). Standard errors of measurement at
ment 46, 545550. different score levels, Journal of Educational Measure-
[3] Brennan, R.L. (1998). Raw-score conditional standard ment 21, 239243.
errors of measurement in generalizability theory, Applied [18] Mollenkopf, W.G. (1949). Variation of the standard error
Psychological Measurement 22, 307331. of measurement, Psychometrika 14, 189229.
[4] Brennan, R.L. (2001). Generalizability Theory, Springer- [19] Thorndike, R.L. (1950). Reliability, in Educational Mea-
Verlag, New York. surement, E.F. Lindquist ed., American Council on Edu-
[5] Feldt, L.S. (1984). Some relationships between the bino- cation, Washington, pp. 560620.
mial error model and classical test theory, Educational
and Psychological Measurement 44, 883891. YE TONG AND MICHAEL J. KOLEN
Confidence Intervals
CHRIS DRACUP
Volume 1, pp. 366375
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
and the confidence interval gets narrower, reflecting be rejected by a test carried out on the sample data.
the greater accuracy of estimation resulting from a The situation is similar with the upper limit of the
larger sample. calculated interval, 307.16. The term plausible has
been applied to values of a population parameter
that are included in a confidence interval [10, 29].
The Coverage Interpretation According to this usage, the 95% confidence interval
of Confidence Intervals calculated above showed any value between 259.02
and 307.16 to be a plausible value for the mean of
Ninety five percent of intervals calculated in the the population, as such values would not be rejected
above way will contain the appropriate population by a two-tailed test at the 0.05 level carried out on
mean, and 5% will not. This property is referred to as the obtained sample data.
coverage and it is a property of the procedure rather
than a particular interval. When it is claimed that an
interval contains the estimated population mean with One-sided Confidence Intervals
95% confidence, what should be understood is that
the procedure used to produce the interval will give In some circumstances, interest is focused on estimat-
95% coverage in the long run. However, for any par- ing only the highest (or lowest) plausible value for a
ticular interval that has been calculated, it will be population parameter. The distinction between two-
either one of the 95% that contain the population sided intervals and one-sided intervals mirrors that
mean or one of the 5% that do not. There is no way between two-tailed and one-tailed significance tests.
of knowing which of these two is true. According The formula for one of the one-sided 95% confidence
to the frequentist view of probability (see Prob- intervals for the population mean, derived from the
ability: Foundations of) from which the method one-tailed single sample t Test is:
derives [24], for any particular confidence interval
s
the population is either included or not included in X + t0.950,n1 (2)
it. The probability of inclusion is therefore either 1 n
or 0. The confidence interval approach does not give Applying (2) to the example data above yields:
the probability that the true population mean will be
in the particular interval constructed, and the term 302.97 (3)
confidence rather than probability is employed
to reinforce the distinction. Despite the claims of In the long run, intervals calculated in this way
the authors of many elementary textbooks [8], sig- will contain the true population value for 95% of
nificance tests do not result in statements about the appropriately collected samples, but they are not pop-
probability of the truth of a hypothesis, and confi- ular in the behavioral sciences for the same reasons
dence intervals, which derive from the same statistical that one-tailed tests are unpopular (see Classical Sta-
theory, cannot do so either. The relationship between tistical Inference: Practice versus Presentation).
confidence intervals and significance testing is exam-
ined in the next section.
Central and Noncentral Confidence
Intervals
The Significance Test Interpretation
of Confidence Intervals The two-sided and one-sided 95% confidence inter-
vals calculated above represent the extremes of a
The lower limit of the 95% confidence interval continuum. With the two-sided interval, the 5% rejec-
calculated above is 259.02. If the sample data were tion region of the significance test was split equally
used to conduct a two-tailed one sample t Test at between the two tails giving limits of 259.02 and
the 0.05 significance level to test the null hypothesis 307.16. In the long run, such intervals are exactly
that the true population mean was equal to 259.02, as likely to miss the true population mean by hav-
it would yield a t value that was just nonsignificant. ing a lower bound that is too high as by having
However, any null hypothesis that specified that the an upper bound that is too low. Intervals with this
population mean was a value less than 259.02 would property are known as central. However, with the
Confidence Intervals 3
one-sided interval, which ranged from to 302.97, the formula for the 95% confidence interval for
it is impossible for the interval to lie above the popu- the difference between the means of two normal
lation mean, but in the long run 5% of such intervals populations, which are assumed to have the same
will fall below the population mean. This is the most (unknown) variance. When applied to data from two
extreme case of a noncentral interval. appropriately collected samples, over the course of
It is possible to divide the 5% rejection region in repeated application, the upper and lower limits of the
an infinite number of ways between the two tails. interval will contain the true difference in population
All that is required is that the tails sum to 0.05. means for 95% of those applications. The formula is
If the lower rejection region was made to be 1% for a central confidence interval that is equally likely
and the upper region made to be 4% by employing to underestimate as overestimate the true difference
t0.010,n1 and t0.960,n1 as the critical values, then 95% in population means. Since the sampling distribution
of intervals calculated in this way would include involved is symmetrical, this is the shortest interval
the true population mean and 5% would not. Such providing 95% confidence.
intervals would be four times more likely to miss If (4) is applied to the sample data that were
the population mean by being too low than by being introduced previously (X 1 = 283.09, s1 = 51.42, and
too high. n1 = 20) and to a second sample (X 2 = 328.40, s2 =
Whilst arguments for employing noncentral con- 51.90, and n2 = 21), the 95% confidence interval for
fidence intervals have been put forward in other the difference in the two population means ranges
disciplines [15], central confidence intervals are usu- from 77.96 to 12.66 (which is symmetrical about
ally employed in the behavioral sciences. In a number the observed difference between the sample means
of important applications, such as the construction of 45.31). Since this interval does not contain zero,
of confidence intervals for population means (and the difference in population means specified by the nil
their differences), central confidence intervals have null hypothesis, it is readily apparent that the nil null
the advantage that they are shorter than noncentral hypothesis can be rejected at the 0.05 significance
intervals and therefore give the required level of con- level. Thus, the confidence interval provides all the
fidence for the shortest range of plausible values. information provided by a two-tailed test of the nil
null hypothesis at the corresponding conventional
significance level.
The Confidence Interval for the Difference In addition, the confidence interval implies that
Between Two Population Means a two-tailed independent samples t Test carried out
on the observed data would not lead to the rejection
If the intention is to compare means, then the appro- of any null hypothesis that specified the difference
priate approach is to construct a confidence inter- in population means was equal to a value in the
val for the difference in population means using a calculated interval, at the 0.05 significance level. This
rearrangement of the formula for the two indepen- information is not provided by the usual test of the
dent samples t Tests (see Catalogue of Parametric nil null hypothesis, but is a unique contribution of the
Tests): confidence interval approach. However, the test of the
nil null hypothesis of no population mean difference
(X 1 X 2 ) + t0.025,n1 +n2 2
yields an exact P value for the above samples of
0.008. So the nil null hypothesis can be rejected at
(n1 1)s12 + (n2 1)s22 1 1
+ a stricter significance level than the 0.05 implied by
n1 + n2 2 n1 n2
the 95% confidence interval. So whilst the confidence
1 2 (X 1 X 2 ) + t0.975,n1 +n2 2 interval approach provides a test of all possible null
hypotheses at a single conventional significance level,
(n1 1)s12 + (n2 1)s22 1 1 the significance testing approach provides a more
+ (4)
n1 + n2 2 n1 n2 detailed evaluation of the extent to which the data
cast doubt on just one of these null hypotheses.
where X 1 and X 2 are the two sample means, s1 If, as in this example, all the plausible values
and s2 are the unbiased sample standard deviations, for the difference in the means of two populations
and n1 and n2 are the two sample sizes. This is are of the same sign, then Tukey [32] suggests that
4 Confidence Intervals
it is appropriate to talk of a confident direction for confidence interval. Is this because there is no real
the effect. Here, the 95% confidence interval shows effect or is it because there is a real effect but the
that the confident direction of the effect is that the studys sample size was inadequate to detect it? If the
first population mean is less than the second. confidence interval is narrow, the studys sample size
has been shown to be sufficient to provide a relatively
accurate estimate of the true effect size, and, whilst
The Confidence Interval for the Difference it would not be sensible to claim that no real effect
Between Two Population Means and was present, it might confidently be claimed that no
Confidence Intervals for the Two large effect was present. If, however, the confidence
Individual Population Means interval is wide, then this means that the studys
sample size was not large enough to provide an
The 95% confidence interval for the mean of the accurate estimate of the true effect size. If, further,
population from which the first sample was drawn the interval ranges from a large negative limit to a
has been previously shown to range from 259.03 large positive limit, then it would be unwise to claim
to 307.16. The second sample yields a 95% confi- anything except that more information is needed.
dence interval ranging from 304.77 to 352.02. Values
between 304.77 and 307.16 are common to both inter-
vals and therefore plausible values for the means of Confidence Intervals Based on
both populations. The overlap of the two intervals Approximations to a Normal Distribution
would be visually apparent if the intervals were pre-
sented graphically, and the viewer might be tempted As sample sizes increase, the sampling distributions
to interpret the overlap as demonstrating the absence of many statistics tend toward a normal distribution
of a significant difference between the means of the with a standard error that is estimated with increas-
two samples. However, the confidence interval for ing precision. This is true, for example, of the mean
the difference between the population means did not of a random sample from some unknown nonnormal
contain zero, so the two sample means do differ population, where an interval extending from 1.96
significantly at the 0.05 level. This example serves estimated standard errors below the observed sample
to illustrate that caution is required in the compar- mean to the same distance above the sample mean
ison and interpretation of two or more confidence gives coverage closer and closer to 95% as sample
intervals. Confidence intervals and significance tests size increases. A rough and ready 95% confidence
derived from the data from single samples are often interval in such cases would range from two esti-
based on different assumptions from those derived mated standard errors below the sample statistic to
from the data from two (or more) samples. The for- two above. Intervals ranging from one estimated stan-
mer, then, cannot be used to determine the results of dard error below the sample statistic to one above
the latter directly. This problem is particularly acute would give approximately 68% coverage, and this
in repeated measures designs [11, 19] (see Repeated reasoning lies behind the promotion of mean and
Measures Analysis of Variance). error graphs by the editors of some journals [18].
Confidence intervals for population correlation coef-
ficients are often calculated in this way, as Fishers z
Confidence Intervals and Posterior Power transformation of the sample Pearson correlation (r)
Analysis tends toward a normal distribution of known variance
as sample size increases (see Catalogue of Paramet-
The entry on power in this volume demonstrates ric Tests). The procedure leads to intervals that are
convincingly that confidence intervals cannot by symmetrical about the transformed sample correlation
themselves provide a substitute for prospective power (zr ), but the symmetry is lost when the limits of the
analysis. However, they can provide useful insights interval are transformed back to r.
into why a result was not significant, traditionally These are large sample approximations, however,
the role of posterior power analysis. The fact that a and some caution is required in their interpretation.
test result is not significant at the 0.05 level means The use of the normal approximation to the bino-
that the nil null effect size lies within the 95% mial distribution in the construction of a confidence
Confidence Intervals 5
interval for a population proportion will be used to and to find the value of the population proportion L ,
illustrate some of the issues. which if treated as the null hypothesis, would just fail
to be rejected by a one-tailed test (in the upper tail)
at the 0.025 level on the observed data, and the value
Constructing Confidence Intervals U , which if treated as the null hypothesis, would just
for the Population Proportion fail to be rejected by a one-tailed test (in the lower
tail) at the 0.025 level on the observed data [3].
Under repeated random sampling from a population The approach will be illustrated with a small
where the population proportion is , the sample pro- sample example that leads to obvious inconsistencies
portion, p, follows a discrete, binomial distribution when the normal approximation is applied. A simple
(see Catalogue of Probability Density Functions), random sample of Chartered Psychologists is drawn
with mean and standard deviation ((1 )/n). from the British Psychological Societys Register. Of
As n increases, provided is not too close to zero the 20 psychologists in the sample, 18 are female.
or one, the distribution of p tends toward
a normal The task is to construct a 95% confidence interval
distribution. In these circumstances, (p(1 p)/n) for the proportion of Chartered Psychologists who
can provide a reasonably good estimate of ((1 are female.
)/n), and an approximate 95% confidence interval Using the normal approximation method, (5)
for is provided by would yield:
p(1 p) p(1 p) 0.9(1 0.9)
p 1.96 p + 1.96 0.9 1.96 0.9
n n 20
(5) 0.9(1 0.9) (6)
+1.96
20
In many circumstances, this may provide a ser- 0.77 1.03
viceable confidence interval. However, the approach
uses a continuous distribution to approximate a dis- The upper limit of this symmetrical interval is
crete one and makes no allowance for the fact that outside the range of the parameter. Even if it was
treated as 1, it is clear that if this were made the null
(p(1 p)/n) cannot be expected to be exactly
equal to ((1 )/n). Particular applications of hypothesis of a one-tailed test at the 0.025 level, then
this approximation can result in intervals with an it must be rejected on the basis of the observed data,
impossible limit (less than zero or greater than one) since if the population proportion of females really
or of zero width (when p is zero or one). was 1, the sample could not contain any males at all,
A more satisfactory approach to the question of but it did contain two.
constructing a confidence interval for a population The exact approach proceeds as follows. That
proportion, particularly when the sample size is small, value of the population proportion, L , is found
is to return to the relationship between the plausible such that under this null hypothesis the sum of
parameters included in a confidence interval and sta- the probabilities of 18, 19 and 20 females in the
tistical significance. According to this, the lower limit sample of 20 would equal .025. Reference to an
of the 95% confidence interval should be that value appropriate table [12] or statistical package (e.g.,
of the population proportion, L , which if treated as Minitab) yields a value of 0.683017. Then, that value
the null hypothesis, would just fail to be rejected in of the population proportion, U , is found such that
the upper tail of a two-tailed test at the 0.05 level under this null hypothesis the sum of the probabilities
on the observed data. Similarly, the upper limit of of 0, 1, 2, . . . , 18 females in a sample of 20 would
the 95% confidence interval should be that value of be .025. The desired value is 0.987651.
the population proportion, U , which if treated as the So the exact 95% confidence interval for the
null hypothesis, would just fail to be rejected in the population proportion of females is:
lower tail of a two-tailed test at the 0.05 level on the 0.683017 0.987651 (7)
observed data. However, there are particular prob-
lems in conducting two-tailed tests on a discrete vari- If a one-tailed test at the 0.025 level was conducted
able like the sample proportion [20]. As a result, the on any null hypothesis that specified that the popu-
approach taken is to work in terms of one-tailed tests lation proportion was less than 0.683017, it would
6 Confidence Intervals
be rejected on the basis of the observed data. So also studies employing different measurement scales. A
would any null hypothesis that specified that the pop- d value of 1.00 simply means that the population
ulation proportion was greater than 0.987651. means of the two conditions differ by one standard
In contrast to the limits indicated by the approxi- deviation. Cohen [5] identified three values of d (0.2,
mate method, those produced by the exact method 0.5, and 0.8) to represent small, medium, and large
will always lie within the range of the parameter effect sizes in his attempts to get psychologists to
and will not be symmetrical around the sample pro- take power considerations more seriously, though
portion (except when this is exactly 0.5). The exact the appropriateness of canned effect sizes has been
method yields intervals that are conservative in cov- questioned by some [14, 16].
erage [23, 33], and alternatives have been suggested. Steiger and Fouladi [31] have provided an intro-
One approach questions whether it is appropriate to duction to the construction of confidence intervals for
include the full probability of the observed outcome standardized effects size measures like d from their
in the tail when computing L and U . Berry and sample estimators. The approach, based on noncen-
Armitage [3] favor Mid-P confidence intervals where tral probability distributions, proceeds in much the
only half the probability of the observed outcome same way as that for the construction of an exact
is included. The Mid-P intervals will be rather nar- confidence interval for the population proportion dis-
rower than the exact intervals, and will result in a cussed above. For a 95% confidence interval, those
rather less conservative coverage. However, in some values of the noncentrality parameter are found (by
circumstances the Mid-P coverage can fall below the numerical means) for which the observed effect size
intended level of confidence, which cannot occur with is in the bottom 0.025 and top 0.025 of the sampling
the exact method [33]. distributions. A simple transformation converts the
obtained limiting values of the noncentrality parame-
ter into standardized effect sizes. The calculation and
Confidence Intervals for Standardized reporting of such confidence intervals may serve to
Effect Size Measures remind readers that observed standardized effect sizes
are random variables, and are subject to sampling
Effect size can be expressed in a number of ways. variability like any other sample statistic.
In a simple two condition experiment, the population
effect size can be expressed simply as the difference
between the two population means, 1 2 , and a Confidence Intervals and the Bootstrap
confidence interval can be constructed as illustrated
above. The importance of a particular difference Whilst the central limit theorem might provide sup-
between population means is difficult to judge except port for the use of the normal distribution in con-
by those who are fully conversant with the measure- structing approximate confidence intervals in some
ment scale that is being employed. Differences in situations, there are other situations where sample
means are also difficult to compare across studies that sizes are too small to justify the process, or sam-
do not employ the same measurement scale, and the pling distributions are suspected or known to depart
results from such studies prove difficult to combine in from normality. The Bootstrap [7, 10] is a numerical
meta-analyses These considerations led to the devel- method designed to derive some idea of the sampling
opment of standardized effect size measures that do distribution of a statistic without recourse to assump-
not depend on the particular units of a measurement tions of unknown or doubtful validity. The approach
scale. Probably the best known of these measures is is to treat the collected sample of data as if it were
Cohens d. This expresses the difference between the the population of interest. A large number of ran-
means of two populations in terms of the populations dom samples are drawn with replacement from this
(assumed shared) standard deviation: population, each sample being of the same size as
1 2 the original sample. The statistic of interest is calcu-
d= (8) lated for each of these resamples and an empirical
sampling distribution is produced. If a large enough
Standardized effect size measures are unit-less, number of resamplings is performed, then a 95% con-
and therefore, it is argued, are comparable across fidence interval can be produced directly from these
Confidence Intervals 7
by identifying those values of the statistic that cor- what value seemed most probable, whether the distri-
respond to the 2.5th and 97.5th percentiles. On other bution was symmetrical about this value, how spread
occasions, the standard deviation of the statistic over out the distribution was, perhaps trying to specify
the resamples is calculated and this is used as an the parameter values that contained the middle 50%
estimate of the true standard error of the statistic of the distribution, the middle 90%, and so on. This
in one of the standard formulae. Various ways of process would result in a subjective prior distribution
improving the process have been developed and sub- for the population mean. The Bayesian would then
jected to some empirical testing. With small samples collect sample data, in very much the same way as
or with samples that just by chance are not very repre- a Frequentist, and use Bayess Theorem to combine
sentative of their parent population, the method may the information from this sample with the information
provide rather unreliable information. Nonetheless, contained in the prior distribution to compute a pos-
Efron, and Tibshirani [10] provide evidence that it terior probability distribution. If the 2.5th and 97.5th
works well in many situations, and the methodology percentiles of this posterior probability distribution
is becoming increasingly popular. are located, they form the lower and upper bounds
on the Bayesians 95% certainty interval [26].
Certain parallels and distinctions can be drawn
Bayesian Certainty Intervals between the confidence interval of the Frequentist
and the certainty interval of the Bayesian. The cen-
Confidence intervals derive from the frequentist ter of the Frequentists confidence interval would be
approach to statistical inference that defines the sample mean, but the center of the Bayesians
probabilities as the limits in the long run of relative certainty interval would be somewhere between the
frequencies (see Probability: Foundations of). sample mean and the mean of the prior distribution.
According to this view, the confidence interval is the The smaller the variance of the prior distribution, the
random variable, not the population parameter [15, nearer the center of the certainty interval will be to
24]. In consequence, when a 95% confidence interval that of the prior distribution. The width of the Fre-
is constructed, the frequentist position does not quentists confidence interval would depend only on
allow statements of the kind The value of the the sample standard deviation and the sample size, but
population parameter lies in this interval with the width of the Bayesians certainty interval would
probability 0.95. In contrast, the Bayesian approach be narrower as the prior distribution contributes a
to statistical inference (see Bayesian Statistics) virtual sample size that can be combined with the
defines probability as a subjective or psychological actual sample size to yield a smaller posterior esti-
variable [25, 26, 27]. According to this view, it mate of the standard error as well as contribute to the
is not only acceptable to claim that particular degrees of freedom of any required critical values.
probabilities are associated with particular values So, to the extent that the Bayesian had any views
of a parameter; such claims are the very basis about the likely values of the population mean before
of the inference process. Bayesians use Bayess data were collected, these views will influence the
Theorem (see Bayesian Belief Networks) to update location and width of the 95% certainty interval.
their prior beliefs about the probability distribution Any interval so constructed is to be interpreted as
of a parameter on the basis of new information. a probability distribution for the population mean,
Sometimes, this updating process uses elements of permitting the Bayesian to make statements like, The
the same statistical theory that is employed by a probability that lies between L and U is X. Of
frequentist and the final result may appear to be course, none of this is legitimate to the Frequentist
identical to a confidence interval. However, the who distrusts the subjectivity of the Bayesians prior
interpretation placed on the resulting interval could distribution, its influence on the resulting interval,
hardly be more different. and the attribution of a probability distribution to a
Suppose that a Bayesian was interested in estimat- population parameter that can only have one value.
ing the mean score of some population on a partic- If the Bayesian admits ignorance of the parameter
ular measurement scale. The first part of the process being estimated prior to the collection of the sample,
would involve attempting to sketch a prior proba- then a uniform prior might be specified which says
bility distribution for the population mean, showing that any and all values of the population parameter are
8 Confidence Intervals
equally likely. In these circumstances, the certainty Interval estimates should be given for any effect sizes
interval of the Bayesian and the confidence interval involving principal outcomes. Provide intervals for
of the Frequentist will be identical in numerical correlations and other coefficients of association or
variation whenever possible. [p. 599]
terms. However, an unbridgeable gulf will still exist
between the interpretations that would be made of the
interval. References
[19] Loftus, G.R. & Masson, M.E.J. (1994). Using confidence no Significance Tests? L.L. Harlow, S.A. Mulaik &
intervals in within-subjects designs, Psychonomic Bul- J.H. Steiger, eds, Lawrence Erlbaum Associates, Mah-
letin & Review 1, 476480. wah.
[20] Macdonald, R.R. (1998). Conditional and unconditional [28] Schmidt, F.L. (1996). Statistical significance and
tests of association in 2 2 tables, British Journal of cumulative knowledge in psychology: implications for
Mathematical and Statistical Psychology 51, 191204. the training of researchers, Psychological Methods 1,
[21] Meehl, P.E. (1967). Theory testing in psychology and 115129.
in physics: a methodological paradox, Philosophy of [29] Smithson, M. (2000). Statistics with Confidence, Sage,
Science 34, 103115. London.
[22] Meehl, P.E. (1997). The problem is epistemology, not [30] Smithson, M. (2003). Confidence Intervals, Sage, Thou-
statistics: replace significance tests by confidence inter- sand Oaks.
vals and quantify accuracy of risky numerical pre- [31] Steiger, J.H. & Fouladi, R.T. (1997). Noncentrality
dictions, in What if there were no Significance Tests? interval estimation and the evaluation of statistical
L.L. Harlow, S.A. Mulaik & J.H. Steiger, eds, Lawrence models, in What if there were no Significance Tests?
Erlbaum Associates, Mahwah. L.L. Harlow, S.A. Mulaik & J.H. Steiger, eds, Lawrence
[23] Neyman, J. (1935). On the problem of confidence Erlbaum Associates, Mahwah.
intervals, Annals of Mathematical Statistics 6, 111116. [32] Tukey, J. (1991). The philosophy of multiple compar-
[24] Neyman, J. (1937). Outline of a theory of statistical isons, Statistical Science 6, 100116.
estimation based on the classical theory of probability, [33] Vollset, S.E. (1993). Confidence intervals for a binomial
Philosophical Transactions of the Royal Society, A 236, proportion, Statistics in Medicine 12, 809824.
333380. [34] Wilkinson, L. and The Task Force on Statistical Infer-
[25] Novick, M.R. & Jackson, P.H. (1974). Statistical ence. (1999). Statistical methods in psychology journals:
Methods for Educational and Psychological Research, guidelines and explanations, American Psychologist 54,
McGraw-Hill, New York. 594604.
[26] Phillips, L.D. (1993). Bayesian Statistics for Social
Scientists, Nelson, London. CHRIS DRACUP
[27] Pruzek, R.M. (1997). An introduction to Bayesian
inference and its applications, in What if there were
Confidence Intervals: Nonparametric
PAUL H. GARTHWAITE
Volume 1, pp. 375381
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Average LSAT and GPA The histogram shows that the distribution of boot-
for students entering 15 law schools strap correlations has marked skewness. The sample
Law Average Average size is small and a 95% confidence interval for
school LSAT GPA would be very wide, so suppose a central 80% con-
fidence interval is required. One thousand of the
1 576 3.39
10 thousand values of were less than 0.595 and
2 635 3.30
3 558 2.81 1000 were above 0.927, so (0.025) = 0.595 and
4 578 3.03
(0.975) = 0.927. The percentile method thus yields
5 666 3.44 (0.595, 0.927) as an 80% confidence interval for
6 580 3.07 , while the basic bootstrap method gives an inter-
7 555 3.00 val of (0.776 (0.927 0.776), 0.776 + (0.776
8 661 3.43 0.595)) = (0.625, 0.957). The difference between
9 651 3.36
the two confidence intervals is substantial because
10 605 3.13
11 653 3.12 of the skewness of the bootstrap distribution.
12 575 2.74 Whether the percentile method is to be pre-
13 545 2.76 ferred to the basic bootstrap method, or vice-versa,
14 572 2.88 depends upon characteristics of the actual distri-
15 594 2.96 bution of . If there is some transformation of
that gives a symmetric distribution, then the per-
centile method is optimal. Surprisingly, this is the
case even if we do not know what the transfor-
each yi has a probability of 1/15 of occurring mation is. As an example, for the sample correla-
whenever a law school is picked at random. tion coefficient there is a transformation that gives
2. As the original sample size was 15, a bootstrap an approximately symmetric distribution (Fishers
sample is chosen by selecting 15 observations tanh1 transformation). Hence, in the above exam-
from F . ple the percentile confidence interval is to be pre-
3. is the true population correlation between ferred to the basic bootstrap interval, even though
LSAT and GPA across all law schools. In the no transformations were used. Some situations where
original data, the sample correlation between the basic bootstrap method is optimal are described
LSAT and GPA is = 0.776;
is their cor- elsewhere in this encyclopedia (see Bootstrap Infer-
relation in a bootstrap sample. ence).
4. Steps 2 and 3 are repeated many times to obtain A variety of extensions have been proposed to
an estimate of the bootstrap distribution of
. improve the methods, notably the bias-corrected per-
centile method [4] and the bootstrap-t method [8],
For the law school data, we generated 10 000 which are modifications of the percentile method and
bootstrap samples of 15 observations and determined basic bootstrap method, respectively. It should be
the sample correlation for each. A histogram of the mentioned that the quantity that should be resam-
10 000 correlations that were obtained is given in pled for bootstrapping is not always obvious. For
Figure 1. instance, in regression problems the original data
700
600
Frequency
500
400
300
200
100
0
00
08
16
24
32
40
48
56
64
72
80
88
96
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
Correlation
points are sometimes resampled (analogous to the the probability distribution. In practice, the number of
above example), while sometimes a regression model permutations may be so large that the test statistic is
is fitted and the residuals from the model are only evaluated for a randomly chosen subset of them,
resampled. in which case the permutation test is often referred
to as a randomization test.
Suppose interest centers on a scalar parameter
Confidence Intervals from Permutation and that, for any value 0 we specify, a permutation or
Tests randomization test can be used to test the hypothesis
that takes the value 0 . Also, assume one-sided
Permutation tests (see Permutation Based Infer- tests can be performed and let 1 be the P value
ence) and randomization tests (see Randomization from the test of H0 : = L against H1 : > L and
Based Tests) are an appealing approach to hypothe- let 2 be the P value from the test of H0 : =
sis testing. They typically make fewer distributional U against H1 : < U . Then, from the relationship
assumptions than parametric tests and usually they between hypothesis tests and confidence interval, (L ,
are just slightly less powerful (see Power) than their U ) is a 100(1 1 2 )% confidence interval for
parametric alternatives. Some permutation tests are . Usually we want to specify 1 and 2 , typically
used to test hypotheses that do not involve parame- choosing them to each equal 0.025 so as to obtain a
ters, such as whether observations occur at random central 95% confidence interval. Then, the task is to
or whether two variables are independent. This may find values L and U that give these P values.
seem natural, since without parametric assumptions In practice, the values of L and U are often
there are few parameters to test. However, quantities found by using a simple trial-and-error search based
such as population means are well-defined without on common sense. An initial guess is made of
making distributional assumptions and permutation the value of L . Denoting this first guess by L1 ,
tests may be used to test hypotheses about their value a permutation test is conducted to test H0 : = L1
or the difference between two population means, for against H1 : > L1 . Usually between 1000 and 5000
example. If a permutation test or randomization test permutations would be used for the test.
can be used to test whether some specified value is Taking account of the P value from this first test, a
taken by a parameter, then the test also enables a con- second guess of the value of L is made and another
fidence interval for the parameter to be constructed. permutation test conducted. This sequence of trial-
In a permutation test, the first step is to choose and-error continues until the value of L is found to
a plausible test statistic for the hypothesis under an adequate level of accuracy, (when the P value
consideration and to determine the value it takes for of the test will be close to 1 ). A separate search is
the observed data. Then, we find permutations of the conducted for U .
data such that the probability of each permutation can As an example, we consider data from a study
be determined under the null hypothesis, H0 ; usually into the causes of schizophrenia [12]. Twenty-five
the permutations are chosen so that each of them hospitalized schizophrenic patients were treated with
is equally probable under H0 . The value of the test antipsychotic medication and, some time later, hos-
statistic is then calculated for each permutation and a pital staff judged ten of the patients to be psychotic
P value is evaluated by comparing the value of the and fifteen patients to be nonpsychotic. Samples of
test statistic for the actual data with the probability cerebrospinal fluid were taken from each patient
distribution of the statistic over all the permutations. and assayed for dopamine b-hydroxylase activity
H0 is rejected if the observed value is far into a tail of (nmol/(ml)(hr)/mg). Results are given in Table 2.
The sample totals are xi = 0.2426 and yi = A trial-and-error search for confidence limits
0.2464, giving sample means x = 0.02426 and y = requires a substantial number of permutations to be
0.1643. We assume the distributions for the two types performed and is clearly inefficient in terms of com-
of patients are identical in shape, differing only in puter time. (The above search for just the lower con-
their locations, and we suppose that a 95% confidence fidence limit required 25 000 permutations). Human
interval is required for = x y , the difference input to give the next estimate is also required sev-
between the population means. The point estimate of eral times during the search. A much more auto-
is
= 0.02426 0.01643 = 0.00783. mated and efficient search procedure is based on
Let L = 0.006 be the first guess at the lower con- the RobbinsMonro process. This search procedure
fidence limit, Then, a permutation test of H0 : = has broad application and is described below under
0.006 against H1 : > 0.006 must be conducted. To Distribution-free confidence intervals. Also for cer-
perform the test, 0.006 is subtracted from each mea- tain simple problems there are methods of determin-
surement for the patients judged psychotic. Denote ing permutation intervals that do not rely on search
the modified values by x1 , . . . , x10
and their mean by procedures. In particular, there are methods of con-
x . Under H0 , the distribution of X and Y are identi- structing permutation intervals for the mean of a
cal, so a natural test statistic is x y, which should symmetric distribution [9], the difference between the
have a value close to 0 if H0 is true. Without permu- locations of two populations that differ only in their
tation the value of x y equals 0.00783 0.006 = location [9] and the regression coefficient in a simple
0.00183. A permutation of the data simply involves regression model [5].
relabeling 10 of the values x1 , . . . , x10
, y1 , . . . , y15
as X-values and relabeling the remaining 15 val-
Distribution-free Confidence Intervals
ues as Y -values. The difference between the mean
of the relabeled X-values and the mean of the rela- The most common distribution-free tests are rank-
beled Y -values is the value of the test statistic for this based tests and methods of forming confidence
permutation. Denote this difference by d. intervals from some specific rank-based tests are
The number of possible permutations is the num- described elsewhere in this encyclopedia. Here we
ber of different ways of choosing 10 values from 25 simply give the general method and illustrate it
and exceeds 3 million. This number is too large for for the MannWhitney test, before considering a
it to be practical to evaluate the test statistics for distribution-free method that does not involve ranks.
all permutations, so instead a random selection of The MannWhitney test compares independent
4999 permutations was used, giving 5000 values of samples from two populations. It is assumed that the
the test statistic when the value for the unpermuted populations have distributions of similar shape but
data (0.00183) is included. Nine hundred and twelve whose means differ by . For definiteness, let x and
of these 5000 values were equal to or exceeded y denote the two means and let = x y . The
0.00183, so H0 is rejected at the 18.24% level of MannWhitney test examines the null hypothesis that
significance. The procedure was repeated using fur- equals a specified value, say 0 , testing it against
ther guesses/estimates of the lower confidence limit. a one- or two-sided alternative hypothesis. (If the
The values examined and the significance levels from null hypothesis is that the population distributions are
the hypothesis tests are shown in Table 3. identical, then 0 = 0.) The mechanics of the test are
The value of 0.0038 was taken as the lower limit to subtract 0 from each of the sample values from the
of the confidence interval and a corresponding search population with mean x . Let x1 , . . . , xm denote these
for the upper limit gave 0.0120 as its estimate. Hence revised values and let y1 , . . . , yn denote the values
the permutation method gave (0.0038, 0.0120) as the from the other sample. The combined set of values
95% confidence interval for the mean difference in x1 , . . . , xm , y1 , . . . , yn , are then ranked from smallest
dopamine level between the two types of patient. to largest and the sum of the ranks of the x -values is
determined. This sum (or some value derived from its the X -values were all smaller than all the Y -values,
value and the sample sizes) is used as the test statistic. the sum of their ranks would be 10(10 + 1)/2 = 55.
For small sample sizes the test statistic is compared As long as forty or fewer of the X Y differences
with tabulated critical values and for larger sample are positive, H0 is rejected, as 40 + 55 equals the
sizes an asymptotic approximation is used. critical value. At the borderline between rejecting
The correspondence between confidence intervals and not rejecting H0 , the 40th largest X Y dif-
and hypothesis tests can be used to form confidence ference is 0. From this it follows that the upper
intervals. For an equal-tailed 100(1 2)% confi- confidence limit is equal to the 40th largest value of
dence interval, values U and L are required such X Y . The ordered data values in Table 1 simplify
that H0 : = L is rejected in favour of H1 : > L at the task of finding the 40th largest difference. For
a P value of and H0 : = U is rejected in favour example, the X Y differences greater than 0.017 are
of H1 : < U at a P value, again, of . With rank- the following 12 combinations: X = 0.0320 in con-
based tests, there are only a finite number of values junction with Y = 0.0104, 0.0105, 0.0112, 0.0116,
that the test statistic can take (rather than a continuous 0.0130 or 0.0145; X = 0.0306 in conjunctions with
range of values) so it is only possible to find P values Y = 0.0104, 0.0105, 0.0112, 0.0116 or 0.0130; X =
that are close to the desired value . An advantage of 0.0275 in conjunction with Y = 0.0104. A similar
rank-based tests, though, is that often there are quick count shows that the 40th largest X Y difference
computational methods for finding values L and U is 0.0120, so this is the upper limit of the con-
that give the P values closest to . The broad strategy fidence interval. Equivalent reasoning shows that
is as follows. the lower limit is the 40th smallest X Y differ-
ence, which here takes the value 0.0035. Hence, the
1. For a one-sample test, order the data values from
smallest to largest and, for tests involving more MannWhitney test yields (0.0035, 0.0120) as the
than one sample, order each sample separately. central 95% confidence interval for .
2. Use statistical tables to find critical values of A versatile method of forming confidence intervals
the test statistic that correspond to the P values is based on the RobbinsMonro process [11]. The
closest to . method can be used to construct confidence intervals
3. The extreme data values are the ones that in one-parameter problems, provided the mechanism
increase the P value. For each confidence limit, that gave the real data could be simulated if the
separately determine the set of data values or parameters value were known. We first consider this
combinations of data values that together give a type of application before describing other situations
test statistic that equals the critical value or is where the method is useful.
just in the critical region. The combination of Let denote the unknown scalar parameter and
data values that were last to be included in this suppose a 100(1 2)% equal-tailed confidence
set determine the confidence interval. interval for is required. The method conducts a
separate sequential search for each endpoint of the
The description of Step 3 is necessarily vague as confidence interval. Suppose, first, that a search for
it varies from test to test. To illustrate the method, the upper limit, U , is being conducted, and let Ui be
we consider the data on schizophrenic patients given the current estimate of the limit after i steps of the
in Table 1, and use the MannWhitney test to derive search. The method sets equal to Ui and then gen-
a 95% confidence interval for , the difference in erates a set of data using a mechanism similar to that
dopamine b-hydroxylase activity between patients which gave the real data. From the simulated data an
judged psychotic and those judged nonpsychotic. The estimate of is determined, i say. Let denote the
data for each group of patients have already been estimate of given by the actual sample data. The
ordered according to size, so Step 1 is not needed. updated estimate of U , Ui+1 , is given by
The two sample sizes are ten and fifteen and as a
test statistic we will use the sum of the ranks of Ui c/i, if
i >
Ui+1 = (1)
the psychotic patients. From statistical tables (e.g., Ui + c(1 )/i, if i
Table A.7 in [1]), 95 is the critical value at sig-
nificance level = 0.025 for a one-tailed test of where c is a positive constant that is termed the step-
H0 : = U against H1 : < U . Let X = X U . If length constant.
6 Confidence Intervals: Nonparametric
The method may be thought of as stepping from one ith step in the search for an endpoint, a population
estimate of U to another and c governs the magnitude size is specified (Ui or Li ) and equated to . Then
of the steps. If Ui is equal to the 100(1 i )% point, it is straightforward to simulate six samples of the
the expected distance stepped is sizes given in Table 4 and to record the number of
recaptures. The estimate of based on this resample
(1 i )(c/i) + i c(1 )/i = c(i )/i, is
i and its value determines the next estimate of the
endpoint. Garthwaite and Buckland [7] applied the
which shows that each step reduces the expected method to these data and give (932, 1205) as a 95%
distance from U . A predetermined number of steps confidence interval for the population size.
are taken and the last Ui is adopted as the estimate The procedure developed by Garthwaite and
of U . An independent search is carried out for the Buckland can also be used to form confidence inter-
lower limit, L . If Li is the estimate after i steps of vals from randomization tests [6]. It is assumed, of
the search, then Li+1 is found as
course, that the randomization test examines whether
Li + c/i, if
i <
a scalar parameter takes a specified value 0 and
Li+1 = (2) that, for any 0 , the hypothesis H0 : = 0 could be
Li c(1 )/i, if i
tested against one-sided alternative hypotheses. In the
This method of forming confidence intervals was search for the upper limit, suppose Ui is the esti-
developed by Garthwaite and Buckland [7]. They mate of the limit after i steps of the RobbinsMonro
suggest methods of choosing starting values for a search. Then the mechanics for a randomization test
search and of choosing the value of the step-length of the hypothesis H0 : = Ui against H1 : < Ui are
constant. They typically used 5000 steps in the search followed, except that only a single permutation of the
for an endpoint. data is taken.
As a practical example where the above method is Let Ti denote the value of the test statistic from
useful, consider the data in Table 4. These are from a this permutation and let T1 denote its value for the
multiple-sample mark-recapture experiment to study actual data (before permutation). The next estimate
a population of frogs [10]. Over a one-month period, of the limit is given by
frogs were caught in six random samples and, after a
sample had been completed, the frogs that had been Ui c/i, if Ti > Ti
Ui+1 = (3)
caught were marked and released. Table 4 gives the Ui + c(1 )/i, if Ti Ti
numbers of frogs caught in each sample, the number
of these captures that were recaptures and the number where c is the step-length constant defined earlier
of frogs that has been marked before the sample. and a 100(1 2)% confidence interval is required.
One purpose of a mark-recapture experiment is to A predetermined number of steps are taken and
estimate the population size, say. A point estimate the last Ui is adopted as the estimate of the upper
of can be obtained by maximum likelihood but limit. An equivalent search is conducted for the
most methods of forming confidence intervals require lower limit. An important feature of the search
asymptotic approximations that may be inexact. The process is that only one permutation is taken at each
method based on the RobbinsMonro process can be hypothesized value, rather than the thousands that are
applied, however, and will give exact intervals. At the taken in unsophisticated search procedures. Typically,
Confidence Intervals: Nonparametric 7
5000 permutations are adequate for estimating each [5] Gabriel, K.R. & Hall, W.J. (1983). Rerandomoization
confidence limit. inference on regression and shift effects: computation-
Most hypothesis tests based on ranks may be ally feasible methods, Journal of the American Statistical
Association 78, 827836.
viewed as permutation tests in which the actual data [6] Garthwaite, P.H. (1996). Confidence intervals from
values are replaced by ranks. Hence, RobbinsMonro randomization tests, Biometrics 52, 13871393.
searches may be used to derive confidence intervals [7] Garthwaite, P.H. & Buckland, S.T. (1992). Generating
from such tests. This approach can be useful if the Monte Carlo confidence intervals by the Robbins-Monro
ranks contain a large number of ties rank tests typi- process, Applied Statistics 41, 159171.
cally assume that there are no ties in the ranks and the [8] Hall, P. (1988). Theoretical comparison of bootstrap
coverage of the confidence intervals they yield may confidence intervals, Annals of Statistics 16, 927963.
[9] Maritz, J.S. (1995). Distribution-Free Statistical Meth-
be uncertain when this assumption is badly violated. ods, 2nd Edition, Chapman & Hall, London.
[10] Pyburn, W.F. (1958). Size and movement of a local
References population of cricket frogs, Texas Journal of Science 10,
325342.
[1] Conover, W.J. (1980). Practical Nonparametric Statis- [11] Robbins, H. & Monro, S. (1951). A stochastic approx-
tics, John Wiley, New York. imation method, Annals of Mathematical Statistics 22,
[2] Davison, A.C. & Hinkley, D.V. (1997). Bootstrap Meth- 400407.
ods and their Application, Cambridge University Press, [12] Sternberg, D.E., Van Kammen, D.P. & Bunney, W.E.
Cambridge. (1982). Schizophrenia: dopamine b-hydroxylase activity
[3] Efron, B. & Gong, G. (1983). A leisurely look at the and treatment response, Science 216, 14231425.
bootstrap, the jackknife and cross-validation, American
Statistician 37, 3648. PAUL H. GARTHWAITE
[4] Efron, B. & Tibshirani, R.J. (1993). An Introduction to
the Bootstrap, Chapman and Hall, London.
Configural Frequency Analysis
ALEXANDER VON EYE
Volume 1, pp. 381388
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
models (see Log-linear Models), log E = X, where A third group of base models determines expected
E is the vector of model frequencies, X is the cell frequencies based on distributional assumptions.
indicator matrix that contains the effects that define For example, von Eye and Bogat [20] proposed
the base model, and is a parameter vector. estimating the expected cell probabilities based on the
To give an example of a base model, consider assumption that the categorized variables that span a
the data in Table 1. In tables of this sort, researchers cross-classification follow a multinormal distribution.
rarely need to ask the question whether an association CFA tests can then be used to identify those sectors
between the earlier and the later observations exists. that deviate from multinormality most strongly.
The X 2 -analysis confirms what everybody either The first group of base models is log-linear. The
knows or assumes: there is a strong association. It latter two are not log-linear, thus illustrating the
is, therefore, the goal of CFA to explore the cross- flexibility of CFA as a method of analysis of cross-
classification, and to identify those cells that deviate classifications.
from the assumption of independence. These cells
not only carry the association, they also define the
Sampling Schemes and Their Relation to CFA
developmental process that the researchers attempt to
Base Models. When selecting a base model for
capture in this study. Later in this entry, we present a
CFA, first, the variable relationships the researchers
complete CFA of this table. In brief the main effects
are (not) interested in must be considered. This
log-linear model of variable independence can be a
issue was discussed above. Second, the sampling
CFA base model. scheme must be taken into account. The sampling
Another example of a log-linear CFA base model scheme determines whether a base model is admis-
is that of Prediction CFA (P-CFA). This variant sible (see Sampling Issues in Categorical Data).
of CFA is used to identify patterns of predictor In the simplest case, sampling is multinomial (see
categories that go hand-in-hand with patterns of Catalogue of Probability Density Functions), that
criteria categories. For example, one can ask whether is, cases are randomly assigned to all cells of the
particular patterns of categories that describe how cross-classification. Multinomial sampling is typical
students do (or do not do) their homework allow in observational studies. There are no constraints
one to predict particular patterns of categories that concerning the univariate or multivariate marginal
describe the students success in academic subjects. frequencies.
In this situation, the researchers are not interested However, researchers determine marginal frequen-
in the associations among the predictors, and they cies often before data collection. For example, in
are not interested in the associations among the a comparison of smokers with nonsmokers, a study
criterion variables. Therefore, all these associations may sample 100 smokers and 100 nonsmokers. Thus,
are part of the base model. It is saturated in the there are constraints on the marginals. If 100 smokers
predictors, and it is saturated in the criteria. However, are in the sample, smokers are no longer recruited,
the researchers are interested in predictor-criterion and the sample of nonsmokers is completed. In this
relationships. Therefore, the base model proposes that design, cases are no longer randomly assigned to
the predictors are unrelated to the criteria. P-CFA the cells of the entire table. Rather, the smokers are
types and antitypes indicate where this assumption randomly assigned to the cells for smokers, and the
is violated. P-CFA types describe where criterion nonsmokers are assigned to the cells for nonsmokers.
patterns can be predicted from predictor patterns. This sampling scheme is called product-multinomial.
P-CFA antitypes describe which criterion patterns do In univariate, product-multinomial sampling, the con-
not follow particular predictor patterns. straints are on the marginals of one variable. In
A second group of base models uses prior infor- multivariate, product-multinomial sampling, the con-
mation to determine the expected cell frequencies. straints are on the marginals of multiple variables, and
Examples of prior information include population also on the cross-classifications of these variables. For
parameters, theoretical probabilities that are known, example, the researchers of the smoker study may
for instance, in coin toss or roulette experiments, or wish to include in their sample 50 female smokers
probabilities of transition patterns (see [16], Chap- from the age bracket between 20 and 30, 50 male
ter 8). smokers from the same age bracket, and so forth.
Configural Frequency Analysis 3
where pi is defined as for the binomial test, and there is capitalizing on chance. If the significance
threshold is , then researchers take the risk of
d
committing an -error at each occasion a test is per-
(Nj 1)
j =1 formed. When multiple tests are performed, the risk
i =
p , (4) of comes with each test. In addition, there is a risk
(N 1)d
that the -error is committed twice, three times, or,
is where d is the number of variables that span the in the extreme case, each time a test is performed.
cross-classification, and j indexes these variables. The second reason is that the tests in a CFA are
Using the exact variance, the Lehmacher cell-specific dependent. Von Weber, Lautsch, and von Eye [23]
test statistic can be defined as showed that the results of three of the four tests
mi ei that CFA can possibly perform in a 2 2 table are
ziL = . (5) completely dependent on the results of the first test. In
i
larger tables, the results of each cell-specific test are
Lehmachers z is, for large samples, standard also dependent on the results of the tests performed
normally distributed. A continuity correction has been before, but to a lesser extent. In either case, the
proposed that prevents Lehmachers test from being -level needs to be protected for the CFA tests to
nonconservative in small samples. The correction be valid.
requires subtracting 0.5 from the numerator if m > e, A number of methods for -protection has been
and adding 0.5 to the numerator if m < e. proposed. The most popular and simplest method,
These three test statistics have the following termed Bonferroni adjustment (see Multiple Com-
characteristics: parison Procedures), yields an adjusted significance
level, , that (a) is the same for each test, and
(1) The Pearson X 2 component test is an approxi- (b) takes the total number of tests into account.
mative test with average power. It is more sensi- The protected -level is = /c, where c is the
tive to types when samples are small, and more total number of tests preformed. For example, let
sensitive to antitypes when samples are large. It a cross-classification have 2 3 3 2 = 36 cells.
can be applied under any sampling scheme, and For this table, one obtains the Bonferroni-protected
is mostly used for inferential purposes. = 0.0014. Obviously, this new, protected thresh-
(2) The binomial test is exact, and can be used old is extreme, and it will be hard to find types
under any sampling scheme. and antitypes.
(3) Lehmachers test is approximative. It has the Therefore, less radical procedures for protection
most power of all tests that have been proposed have been devised. An example of these is Holms [9]
for use in CFA. It is more sensitive to types procedure. This approach takes into account (1) the
when samples are small, and more sensitive maximum number of tests to be performed, and
to antitypes when samples are large. The only (2) the number of tests already performed before
exceptions are 2 2 tables, where Lehmachers test i. In contrast to the Bonferroni procedure, the
test always identifies exactly the same number protected significance threshold i is not constant.
of types and antitypes. The test requires that the Specifically, the Holm procedure yields the protected
sampling scheme be product-multinomial, and significance level
can be used only when the CFA base model is
the log-linear main effect model. = , (6)
ci+1
Protection of the Significance Threshold where i indexes the ith test that is performed. Before
applying Holms procedure, the test statistics have to
Typically, CFA applications are exploratory and be rank-ordered such that the most extreme statistic
examine all cells of a cross-classification. This strat- is examined first. Consider the first test; for this test,
egy implies that the number of tests performed on the the Holm-protected is = /(c 1 + 1) = /c.
same data set can be large. If the number of tests is This threshold is identical to the one used for the first
large, the -level needs to be protected for two rea- test under the Bonferroni procedure. However, for the
sons (see Multiple Comparison Procedures). First, second test under Holm, we obtain = /(c 1), a
Configural Frequency Analysis 5
threshold that is less extreme and prohibitive than can be discriminated in the space of variables not
the one used under Bonferroni. For the last cell in used in the CFA that yielded them. For example,
the rank order, we obtain = /(c c + 1) = . Gortelmeyer [7] used CFA to identify six types of
Testing under the Holm procedure concludes after the sleep problems. To establish external validity, the
first null hypothesis prevails. More advanced methods author used ANOVA methods to test hypotheses
of -protection have been proposed, for instance, by about mean differences between these types in the
Keselman, Cribbie, and Holland [10]. space of personality variables.
Performing CFA tests usually leads to the identifica- This section presents two data examples.
tion of a number of configurations (cells) as consti-
tuting types or antitypes. The interpretation of these Data example 1: Physical pubertal development.
types and antitypes uses two types of information. The first example is a CFA of the data in Table 1.
First, types and antitypes are interpreted based on This table shows the cross-tabulation of two ratings
the meaning of the configuration itself. Consider, for of physical pubertal development in a sample of 64
example, the data in Table 1. Suppose Configuration adolescents. The second ratings were obtained two
2-3 constitutes a type (below, we will perform a years after the first. We analyze these data using
CFA on these data to determine whether this con- first order CFA. The base model of first order CFA
figuration indeed constitutes a type). This configura- states that the two sets of ratings are independent of
tion describes those adolescents who develop from each other. If types and antitypes emerge, we can
a Tanner stage 2 to a Tanner stage 3. That is, these interpret them as indicators of local associations or,
adolescents make progress in their physical pubertal from a substantive perspective, as transition patterns
development. They are neither prepubertal nor phys- that occur more often or less often than expected
ically mature, but they develop in the direction of based on chance.
becoming mature. Descriptions of this kind indicate To search for the types, we use Lehmachers test
the meaning of a type or antitype. with continuity correction and Holms procedure of
The second source of information is included in -protection. Table 2 presents the results of CFA.
the decisions as they were processed in the five The overall Pearson X 2 for this table is 24.10
steps of CFA. Most important is the definition of (df = 4; p < 0.01), indicating that there is an asso-
the base model. If the base model suggests variable ciation between the two consecutive observations of
independence, as it does in the log-linear main physical pubertal development. Using CFA, we now
effect base model, types and antitypes suggest local ask whether there are particular transition patterns
associations. If the base model is that of prediction that stand out. Table 2 shows that three types and
CFA, types and antitypes indicate pattern-specific two antitypes emerged. Reading from the top of the
relationships between predictors and criteria. If the table, the first type is constituted by Configuration
base model is that of two-group comparison, types 2-3. This configuration describes adolescents who
indicate the patterns in which the two groups differ develop from an early adolescent pubertal stage to
significantly. a late adolescent pubertal stage. Slightly over nine
The characteristics of the measure used for the cases had been expected to show this pattern, but 17
detection of types and antitypes must also be con- were found, a significant difference. The first antitype
sidered. Measures that are marginal-free can lead to is constituted by Configuration 24. Fifteen adoles-
different harvests of types and antitypes than mea- cents developed so rapidly that they leaped one stage
sures that are marginal-dependent. in the one-year interval between the observations, and
Finally, the external validity of types and anti- showed the physical development of a mature per-
types needs to be established. Researchers ask son. However, over 24 had been expected to show
whether the types and antitypes that stand out in this development. From this first type and this first
the space of the variables that span the cross- antitype, we conclude that the development by one
classification under study do also stand out, or stage is normative, and the development by more
6 Configural Frequency Analysis
Table 2 First order CFA of the pubertal physical development data in Table 1
Frequencies
than one stage can be observed, but less often than The base model for the following analyses (1) is
chance. saturated in the three strategy variables that are
If development by one stage is normative, the used to distinguish the two gender groups, and
second transition from a lower to a higher stage in (2) assumes that there is no relationship between
Table 2, that is, the transition described by Config- gender and strategies used. If discrimination types
uration 3-4, may also constitute a type. The table emerge, they indicate the strategy patterns in the
shows that this configuration also contains signifi- two gender groups differ significantly. For the anal-
cantly more cases than expected. yses, we use Pearsons X 2 -test, and the Bonferroni-
Table 2 contains one more antitype. It is consti- adjusted = .05/8 = 0.00625. Please note that the
tuted by Configuration 3-3. This antitype suggests numerator for the calculation of the adjusted is
that it is very unlikely that an adolescent who has 8 instead of 16, because, to compare the gender
reached the third stage of physical development will groups, one test is performed for each strategy
still be at this stage two years later. This lack of pattern (instead of two, as would be done in a
stability, however, applies only to the stages of one-sample CFA). Table 3 displays the results of 2-
development that adolescents go through before they group CFA.
reach the mature physical stage. Once they reach The results in Table 3 suggest strong gender dif-
this stage, development is completed, and stability is ferences. The first discrimination type is consti-
observed. Accordingly, Configuration 4-4 constitutes tuted by Configuration 111. Twenty-five females and
a type. five males used none of the three strategies. This
difference is significant. The second discrimination
Data example 2. The second data example presents type is constituted by Configuration 122. Thirteen
a reanalysis of a data set published by [4]. A total males and 63 females used the pattern comparison
of 181 high school students processed the 24 items and the viewpoint change strategies. Discrimination
of a cube comparison task. The items assess the type 221 suggests that the rotational and the pat-
students spatial abilities. After the cube task, the tern comparison strategies were used by females
students answered questions concerning the strategies in 590 instances, and by males in 872 instances.
they had used to solve the task. Three strategies The last discrimination type emerged for Config-
were used in particular: mental rotation (R), pattern uration 222. Females used all three strategies in
comparison (P), and change of viewpoint (V). Each 39 instances; males used all three strategies in 199
strategy was scored as either 1 = not used, or 2 = instances.
used. In the following sample analysis, we ask We conclude that female high school students
whether there are gender differences in strategy use. differ from male high school students in that they
Gender was scaled as 1 = females and 2 = males. either use no strategy at all, or use two. If they
The analyses are performed at the level of individual use two strategies, the pattern comparison strategy
responses. is one of them. In contrast, male students use all
Configural Frequency Analysis 7
Table 3 Two-group CFA of the cross-classification of rotation strategy (R), pattern comparison strategy (P), viewpoint
strategy (V), and gender (G)
Configuration m e statistic p Type?
1111 25 11.18
1112 5 18.82 27.466 .000000 Discrimination Type
1121 17 21.99
1122 42 37.01 1.834 .175690
1211 98 113.29
1212 206 190.71 3.599 .057806
1221 13 28.69
1222 64 48.31 13.989 .000184 Discrimination Type
2111 486 452.78
2112 729 762.22 5.927 .014911
2121 46 52.55
2122 95 88.45 1.354 .244633
2211 590 544.83
2212 872 917.17 10.198 .001406 Discrimination Type
2221 39 88.69
2222 199 149.31 47.594 .000000 Discrimination Type
three strategies significantly more often than female (1999). Derivation and prediction of temperamental
students. types among preschoolers, Developmental Psychology
35, 958971.
[2] Bergman, L.R. & Magnusson, D. (1997). A person-
oriented approach in research on developmental psy-
Summary and Conclusions
chopathology, Development and Psychopathology 9,
CFA has found applications in many areas of the 291319.
[3] Finkelstein, J.W., von Eye, A. & Preece, M.A. (1994).
social sciences, recently, for instance, in research The relationship between aggressive behavior and
on child development [1, 14], and educational psy- puberty in normal adolescents: a longitudinal study,
chology [15]. Earlier applications include analyses Journal of Adolescent Health 15, 319326.
of the control beliefs of alcoholics [11]. CFA is [4] Gluck, J. & von Eye, A. (2000). Including covariates in
the method of choice if researchers test hypothe- configural frequency analysis. Psychologische Beitrage
ses concerning local associations [8], that is, asso- 42, 405417.
ciations that hold in particular sectors of a cross- [5] Goodman, L.A. (1984). The Analysis of Cross-Classified
Data Having Ordered Categories, Harvard University
classification only. These hypotheses are hard or
Press, Cambridge.
impossible to test using other methods of categorical [6] Goodman, L.A. (1991). Measures, models, and graphical
data analysis. displays in the analysis of crossclassified data,
Software for CFA can be obtained free from von- Journal of the American Statistical Association 86,
eye@msu.edu. CFA is also a module of the soft- 10851111.
ware package SLEIPNER, which can be downloaded, [7] Gortelmeyer, R. (1988). Typologie Des Schlafverhaltens
also free, from http://www.psychology.su.se/ [A Typology of Sleeping Behavior], S. Roderer Verlag,
sleipner/. Regensburg.
[8] Havranek, T. & Lienert, G.A. (1984). Local and regional
versus global contingency testing. Biometrical Journal
References 26, 483494.
[9] Holm, S. (1979). A simple sequentially rejective
[1] Aksan, N., Goldsmith, H.H., Smider, N.A., Essex, M.J., multiple Bonferroni test procedure, Biometrics 43,
Clark, R., Hyde, J.S., Klein, M.H. & Vandell, D.L. 417423.
8 Configural Frequency Analysis
[10] Keselman, H.J., Cribbie, R. & Holland, B. (1999). [17] von Eye, A. (2002b). The odds favor antitypes a com-
The pairwise multiple comparison multiplicity problem: parison of tests for the identification of configural types
an alternative approach to familywise and comparison- and antitypes, Methods of Psychological Research
wise type I error control, Psychological Methods 4, online 7, 129.
5869. [18] von Eye, A. (2003). A comparison of tests used in 2 2
[11] Krampen, G., von Eye, A. & Brandtstadter, J. (1987). tables and in two-sample CFA. Psychology Science 45,
Konfigurationstypen generalisierter Kontrolluberzeugun- 369388.
gen. Zeitschrift fur Differentielle und Diagnostische Psy- [19] von Eye, A. & Bergman, L.R. (2003). Research strate-
chologie 8, 111119. gies in developmental psychopathology: dimensional
[12] Lehmacher, W. (1981). A more powerful simultaneous identity and the person-oriented approach, Development
test procedure in configural frequency analysis, Biomet- and Psychopathology 15, 553580.
rical Journal 23, 429436. [20] von Eye, A. & Bogat, G.A. (2004). Deviations from mul-
[13] Lienert, G.A. & Krauth, J. (1975). Configural fre- tivariate normality, Psychology Science 46, 243258.
quency analysis as a statistical tool for defining [21] von Eye, A. & Mun, E.-Y. (2003). Characteristics of
types, Educational and Psychological Measurement 35, measures for 2 2 tables, Understanding Statistics 2,
231238. 243266.
[14] Mahoney, J.L. (2000). School extracurricular activity [22] von Eye, A., Spiel, C. & Rovine, M.J. (1995). Concepts
participation as a moderator in the development of of nonindependence in configural frequency analysis,
antisocial patterns, Child Development 71, 502516. Journal of Mathematical Sociology 20, 4154.
[15] Spiel, C. & von Eye, A. (2000). Application of configu- [23] von Weber, S., von Eye, A., & Lautsch, E. (2004). The
ral frequency analysis in educational research. Psychol- type II error of measures for the analysis of 2 2 tables,
ogische Beitrage 42, 515525. Understanding Statistics 3(4), pp. 259232.
[16] von Eye, A. (2002a). Configural Frequency Analysis
Methods, Models, and Applications, Lawrence Erlbaum, ALEXANDER VON EYE
Mahwah.
Confounding in the Analysis of Variance
RICHARD S. BOGARTZ
Volume 1, pp. 389391
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
experimenter, and so on, which in turn may affect designs where subjects are confounded with treatment
performance. conditions. We have elected to ignore information
Another form of confounding comes not from a about individual subjects in favor of focusing on the
progressive buildup over trials but simply from the treatment effects.
carryover of the effect of one trial to the response In a design, there are groups of subjects, there
on the next. We can imagine an experiment on are treatments, and there are interactions. Any two
taste sensitivity in which the subject repeatedly tastes of these can be confounded with each other. So,
various substances. Obviously, it is important to for example, we have the split-plot design in which
remove traces of the previous taste experience before groups are confounded with treatments, the con-
presenting the next one. founded factorial designs in which groups are con-
founded with interactions, and the fractional repli-
cation designs in which treatments are confounded
Mixed Designs with interactions (see Balanced Incomplete Block
Mixed designs combine the features of between- Designs). In the case of the latter two designs, there
subject designs with those of repeated measures may be interactions that are unimportant or in which
designs (see Fixed and Random Effects). Different we have no interest. We can sometimes cut the cost
sets of subjects are given the repeated measures of running subjects in half by confounding such an
treatments with one and only one value of a between- interaction with groups or with treatments. Kirk [3]
subjects variable or one combination of between- provides an extensive treatment of designs with con-
subject variables. In this type of design, both types founding. The classic text by Cochran and Cox [2] is
of confounding described above can occur. The also still an excellent resource.
between-subject variables can be confounded with
groups of subjects and the within-subject variables
can be confounded with trial-to-trial effects. An Example of Intentional Confounding
An example of confounding groups of subjects with
Intentional Planned Confounding a treatment effect can be seen in the following
experimental design for studying an infant looking
or Aliasing in response to possible and impossible arithmetic
If when variables are confounded we cannot separate manipulations of objects [1]. Two groups of infants
their effects, it may be surprising that we would ever are given four trials in the following situation: An
intentionally confound two or more variables. And infant is seated in front of a stage on which there
yet, there are circumstances when such confounding are two lowered screens, one on the left, the other
can be advantageous. In general, we do so when on the right. When the screen is lowered, the infant
we judge the gains to exceed the losses. Research can see what is behind it, and when the screen is
has costs. The costs involve resources such as time, raised the infant cannot see. An object is placed
money, lab space, the other experiments that might behind the lowered left screen, the screen is raised,
have been done instead, and so on. The gains from an either one or two additional objects are placed behind
experiment are in terms of information about the way the raised left screen, an object is placed behind
the world is and consequent indications of fruitful the lowered right screen, the screen is raised, either
directions for more research. Every experiment is a one or two additional objects are placed behind the
trade-off of resources against information. raised right screen. Thus, each screen conceals either
Intentional confounding saves resources at the cost two or three objects. One of the two screens is then
of information. When the information about some lowered and either two or three objects are revealed.
variable or interaction of variables is judged to be not The duration of the infant looking at the revealed
worth the resources it takes to gain the information, objects is the basic measure. The trials are categorized
confounding becomes a reasonable choice. In some as possible or impossible, revealing two or three
cases, the loss of information is free of cost if we objects, and involving a primacy or recency effect.
have no interest in that information. We have already On possible trials, the same number hidden behind
seen an example of this in the case of between-subject the screen is revealed; on impossible, a different
Confounding in the Analysis of Variance 3
number is revealed. Primacy refers to revealing what hand, it seems unlikely that a possibility number
was behind the left screen since it was hidden first; revealed group interaction would exist, especially
recency refers to revealing the more recently hidden since infants are assigned to groups at random and
objects behind the right screen. Number revealed the order of running infants from the different groups
of course refers to revealing two or revealing three is randomized. In this context, a significant group
objects. Thus, with two objects hidden behind the possibility number interaction is reasonably inter-
left screen and three objects finally revealed behind preted as a recency effect.
the left screen, we have a combination of the three, The design described above contains a second con-
impossible, and primacy effects. A complete design found. Primacy-recency is confounded with spatial
would require eight trials for each infant: a 2 location of the screen (left or right). The first conceal-
2 2 factorial design of number possibility ment was always behind the left screen; the second,
recency. Suppose previous experience indicates that always behind the right. This means that preferring to
the full eight trials would probably result in declining look longer to the right than to the left would result
performance with 10-month-old infants. So the design in an apparent recency preference that was in fact
in Table 1 is used: a position preference. If such a position preference
Inspection of the table shows that the primacy were of concern, the design would have to be elabo-
effect is perfectly confounded with the possibility rated to include primacy on the left for some infants
number revealed group interaction. In the present and on the right for others.
context, a primacy effect seems reasonable in that
the infants might have difficulty remembering what
was placed behind the first screen after being exposed References
to the activity at the second screen. On the other
[1] Cannon, E.N. & Bogartz, R.S. (2003, April). Infant
Table 1 Confounding of the primacy effect with the number knowledge: a test of three theories, in Poster
possibility number revealed group interaction Session Presented at the Biennial Meeting of the Society
for Research in Child Development, Tampa.
Possible Impossible [2] Cochran, W.G. & Cox, G.M. (1957). Experimental
Designs, 2nd Edition, Wiley, New York.
2 revealed 3 revealed 2 revealed 3 revealed [3] Kirk, R.E. (1995). Experimental Design, 3rd Edition,
Group 1 Primacy Recency Recency Primacy Brooks/Cole, Pacific Grove.
Group 2 Recency Primacy Primacy Recency
RICHARD S. BOGARTZ
Confounding Variable
PATRICK ONGHENA AND WIM VAN DEN NOORTGATE
Volume 1, pp. 391392
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[6] Van Damme, J., De Fraine, B., Van Landeghem, G., (See also Confounding in the Analysis of Variance)
Opdenakker, M.-C. & Onghena, P. (2002). A new study
on educational effectiveness in secondary schools in PATRICK ONGHENA AND
Flanders: an introduction, School Effectiveness and School WIM VAN DEN NOORTGATE
Improvement 13, 383397.
Contingency Tables
BRIAN S. EVERITT
Volume 1, pp. 393397
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
(Eij ) using the familiar test statistic; many years, been interpreted as implying that all
the expected values in the table should be greater
r c
(Oij Eij )2
X2 = (4) than five for the chi-square test to be strictly valid.
i=1 j =1
Eij Since in Table 1 the four expected values are 5.7,
6.2, 4.3, and 4.7, this would appear to shed some
where r is the number of rows and c the number doubt on the validity of the chi-squared test for
of columns in the table. Under H0 , the distribution the data and on the conclusion from this test. But
of X 2 is, for large N , approximately a chi-squared as long ago as 1954, Cochran [3] pointed out that
distribution (see Catalogue of Probability Density such a rule is too stringent and suggested that
Functions) with (r 1)(c 1) degrees of freedom. if relatively few expected values are less than five
Hence, for a significance test of H0 with approximate (say one cell in five), a minimum value of one
significance level , we reject H0 if is allowable.
Nevertheless, for small, sparse data sets, the
X 2 2 ()(r1)(c1) (5)
asymptotic inference from the chi-squared test may
where 2 ()(r1)(c1) is the upper point of a not be valid, although it is usually difficult to iden-
chi-squared distribution with (r 1)(c 1) degrees tify a priori whether a given data set is likely to
of freedom. give misleading results. This has led to suggestions
In the case of a two-by-two contingency table with for alternative test statistics that attempt to make the
cell frequencies a, b, c, and d, (see Table 1), the X 2 asymptotic P value more accurate. The best known
statistic can be written more simply as of these is Yates correction. But, nowadays, such
procedures are largely redundant since exact P val-
N (ad bc)2 ues can be calculated to assess the hypothesis of
X2 = (6)
(a + b)(c + d)(a + c)(b + d) independence; for details, see the exact methods for
And in such tables, the independence hypothesis is categorical data entry.
equivalent to that of the equality of two probabilities, The availability of such exact methods also
for example, in Table 1, that the probability of makes the pooling of categories in contingency
crowd baiting in JuneSeptember is the same as the tables to increase the frequency in particular cells,
probability of crowd baiting in OctoberMay. unnecessary. The procedure has been used almost
Applying the chi-squared test of independence to routinely in the past, but can be criticized on a number
Table 2 results in a test statistic of 79.43 with two of grounds.
degrees of freedom. The associated P value is very
A considerable amount of information may be lost
small, and we can conclude with some confidence
that party identification and race are not independent. by the combination of categories, and this may
We shall return for a more detailed look at this detract greatly from the interest and usefulness of
result later. the study.
For the data in Table 1, the chi-square test gives The randomness of the sample may be affected;
X 2 = 4.07 with a single degree of freedom and a the whole rationale of the chi-squared test rests
P value of 0.04. This suggests some weak evidence on the randomness of the sample and in the
for a difference in the probability of baiting in the categories into which the observations may fall
different times of the year. But the frequencies in being chosen in advance.
Table 1 are small, and we need to consider how this Pooling categories after the data are seen may
might affect an asymptotic test statistic such as X 2 . affect the random nature of the sample with
unknown consequences.
The manner in which categories are pooled can
Small Expected Frequencies have an effect on the resulting inferences.
The derivation of the chi-square distribution as an As an example, consider the data in Table 4
approximation for the distribution of the X 2 statistic from [2]. When this table is tested for independence
is made under the rather vague assumption that the using the chi-squared test, the calculated significance
expected values are not too small. This has, for level is 0.086, which agrees with the exact probability
Contingency Tables 3
Table 4 Hypothetical data from Baglivo [2] than one. Consequently, the use of standardized
Column residuals in the detailed examination of a contin-
gency table may often give a conservative indi-
1 2 3 4 5 cation of a cells lack of fit to the independence
hypothesis.
Row 1 2 3 4 8 9
Row 2 0 0 11 10 11 An improvement over standardized residuals is
provided by the adjusted residuals (dij ) suggested
in [4], and defined as:
to two significant figures, although a standard statisti- rij
cal package issues a warning of the form some of the dij = (9)
[(1 ni. /N )(1 n.j /N )]
expected values are less than two, the test may not
be appropriate. If the first two columns of Table 4 When the variables forming the contingency table
are ignored, the P value becomes 0.48, and if the are independent, the adjusted residuals are approx-
first two columns are combined with the third, the P imately normally distributed with mean zero and
value becomes one. standard deviation one.
The practice of combining categories to increase Returning to the data in Table 2, we can now
cell size should be avoided and is nowadays calculate the expected values and then both the
unnecessary. standardized and adjusted residuals see Table 5.
Clearly, the lack of independence of race and party
identification arises from the excess of blacks who
Residuals identify with being Democrat and the excess of
whites who identify with being Republican.
In trying to identify which cells of a contingency
table are primarily responsible for a significant over-
all chi-squared value, it is often useful to look at the
Table 5 Expected values, standardized residuals, and
differences between the observed values and those adjusted residuals for data in Table 2
values expected under the hypothesis of indepen-
dence, or some function of these differences. In fact, (1) Expected values
looking at residuals defined as observed value
expected value would be very unsatisfactory since a Party identification
difference of fixed size is clearly more important for
Race Democrat Independent Republican
smaller samples. A more appropriate residual would
be rij , given by: White 385.56 104.20 361.24
Black 58.44 15.80 54.76
(nij Eij )
rij = (7)
Eij (2) Standardized residuals
In many cases, an informative way of inspecting two of the variables is identical in all levels of
residuals is to display them graphically using cor- the third.
respondence analysis (see Configural Frequency For some hypotheses, expected values can be
Analysis). obtained directly from simple calculations on particu-
lar marginal totals. But this is not always the case, and
for some hypotheses, the corresponding expected val-
ues have to be estimated using some form of iterative
Higher-Dimensional Contingency Tables procedure for details see [1].
Three- and higher-dimensional contingency tables
Three- and higher-dimensional contingency tables
are best analyzed using log-linear models.
arise when a sample of individuals is cross-classified
with respect to three (or more) qualitative variables.
A four-dimensional example appears in Table 3. References
The analysis of such tables presents problems not
encountered with two-dimensional tables, where a
[1] Agresti, A. (1996). An Introduction to Categorical Data
single question is of interest, namely, that of the inde-
Analysis, Wiley, New York.
pendence or otherwise of the two variables involved. [2] Baglivo, J., Oliver, D. & Pagano, M. (1988). Methods
In the case of higher-dimensional tables, the inves- for the analysis of contingency tables with large and
tigator may wish to test that some variables are small cell counts, Journal of the American Statistical
independent of some others, that a particular vari- Association 3, 10061013.
able is independent of the remainder or some more [3] Cochran, W.G. (1954). Some methods for strengthening
complex hypothesis. Again, however, the chi-squared the common chi-squared tests, Biometrics 10, 417477.
[4] Haberman, S.J. (1973). The analysis of residuals in cross-
statistic is used to compare observed frequencies with
classified tables, Biometrics 29, 205220.
estimates of those to be expected under a particu- [5] Maag, J.W. & Behrens, J.T. (1989). Epidemiological data
lar hypothesis. on seriously emotionally disturbed and learning disabled
The simplest question of interest in a three- adolescents: reporting extreme depressive symptomatol-
dimensional table, for example, is that of the mutual ogy, Behavioural Disorders 15, 2127.
independence of the three variables; this is directly [6] Mann, L. (1981). The baiting crowd in episodes of
threatened suicide, Journal of Personality and Social
analogous to the hypothesis of independence in a two-
Psychology, 41, 703709.
way table, and is tested in an essentially equivalent
fashion. Other hypotheses that might be of interest
are those of the partial independence of a pair of (See also Marginal Independence; Odds and Odds
variables, and the conditional independence of two Ratios)
variables for a given level of the third. A more
involved hypothesis is that the association between BRIAN S. EVERITT
Coombs, Clyde Hamilton
JOEL MICHELL
Volume 1, pp. 397398
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[3] Coombs, C.H. (1983). Psychology and Mathematics: an (See also Psychophysical Scaling)
Essay on Theory, The University of Michigan Press, Ann
Arbor. JOEL MICHELL
Correlation
DIANA KORNBROT
Volume 1, pp. 398400
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
40
10 10
30
5 5 20
10
0 0 0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
X X X
A B C
15
Y
200
100
150 10
50 100
5
50
0 0 0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 30
X X X
D E F
outlier at (29,29) is included. All other panels show but rs is widely used and easier to calculate.
some strong relationship. The point biserial coefficient is applicable if one
variable is dichotomous and the other metric or
Positive or negative? Positive correlations arise
ordinal.
when both variables increase together, like age and
Individual correlations are useful in their own
experience, as in Panels A and B. Negative corre-
right. In addition, correlation matrices, giving all the
lations occur when one variable goes down as the
pairwise correlations of N variables, are useful as
other goes up, like age and strength in adults, as in
input to other procedures such as Factor Analysis
Panel C.
and Multidimensional Scaling.
Measures of correlation range from +1, indicating Thanks to Rachel Msetfi and Elena Kulinskaya who made
perfect positive agreement to 1, indicating per- helpful comments on drafts.
fect negative agreement. Pearsons product moment
correlation r, can be used when both variables References
are metric (interval or ratio) (see Scales of Mea-
surement) and normally distributed. Rank-based [1] Howell, D.C. (2004). Fundamental Statistics for the
measures such as Spearmans rho rs (Pearsons Behavioral Sciences, 5th Edition, Duxbury Press, Pacific
correlation of ranks) or Kendalls tau (a mea- Grove.
sure of how many transpositions are needed to [2] Kendall, M.G. & Gibbons, J.D. (1990). Rank Correlation
Methods, 5th Edition, Edward Arnold, London & Oxford.
get both variables in the same order) are also [3] Sheskin, D.J. (2000). Handbook of Parametric and Non-
useful (sometimes known as nonparametric mea- parametric Statistical Procedures, 2nd Edition, Chapman
sures). They are applicable when either variable & Hall, London.
is ordinal, for example, ranking of candidates; or
when metric data has outliers or is not bivari- DIANA KORNBROT
ate normal. Experts [2] recommend over rs ,
Correlation and Covariance Matrices
RANALD R. MACDONALD
Volume 1, pp. 400402
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
2.00
1.00
0.00
y
1.00
2.00
2.00
1.00
0.00
y
1.00
2.00
2.00 1.00 0.00 1.00 2.00 2.00 1.00 0.00 1.00 2.00 2.00 1.00 0.00 1.00 2.00
X X X
Figure 1 Scatter plots with regression lines illustrating Pearson correlations of 1.0, 0.8, 0.5, 0.2, 0, and 0.5. X and Y
are approximately normally distributed variables with means of 0 and standard deviations of 1
variances because the covariance of a variable with Probability Density Functions), which can be fully
itself is its variance. When correlations are used they characterized by its means and covariance matrix, is
are presented in the form of a square correlation a suitable assumption for many multivariate models
matrix (see Multivariate Analysis: Overview). For exam-
1 12 . . . 1n ple, the commonly employed multivariate techniques
21 1 . . . 2n
= ... .. .. (5) principal component analysis and factor analysis
. . operate on the basis of a covariance or correla-
n1 n2 ... 1 tion matrix, and the decision whether to start with
the covariance or the correlation matrix is essen-
where ij is the correlation between Zi and Zj .
tially one of choosing appropriate scales of measure-
The diagonal elements of the matrix are unity since
ment.
by definition a variable can be fully predicted by
itself. Also note that in the special case where
variables are standardized to unit standard devi- References
ation the covariance matrix becomes the correla-
tion matrix.
[1] Howell, D.C. (2002). Statistical Methods in Psychology,
The variance-covariance matrix is important in
Duxbury Press, Belmont.
multivariate modeling because relationships between [2] Owens, T.J. & King, A.B. (2001). Measuring self
variables are often taken to be linear and the esteem: race, ethnicity & gender considered, in Extending
multivariate central limit theorem suggests that a Self-esteem Theory, T.J. Owens, S. Stryker & N. Good-
multivariate normal distribution (see Catalogue of man, eds, Cambridge University Press, Cambridge.
Correlation and Covariance Matrices 3
[3] Pearson, K. (1896). Mathematical contributions to the (See also Covariance/variance/correlation; Kend-
theory of evolution, III: regression, heredity and pan- alls Tau ; Partial Least Squares; Tetrachoric
mixia, Philosophical Transactions of the Society of Lon-
Correlation)
don, Series A 187, 253318.
[4] Stigler, S.M. (1986). The History of Statistics, Belknap
RANALD R. MACDONALD
Press, Cambridge.
Correlation Issues in Genetics Research
STACEY S. CHERNY
Volume 1, pp. 402403
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Three forms of the same data (a) responses to the questions, coded according to the response categories; (b) coding
of the same data as dummy variables, zero-one data in an indicator matrix Z with 24 + 5 = 29 columns; (c) 24 5
contingency table N cross-tabulating the two variables (N has a grand total equal to 33 590, the number of cases)
29 5
dummy variables categories
(a) Calculate the matrix S = [sij ], where 2. From the form (1) of the elements of S, the
total sum of squares of the matrix S is equal
ni+ n+j
to 2 /n, the Pearson chi-square statistic 2 for
1 nij n
sij = . (1) the contingency table divided by the sample size
n ni+ n+j n. This quantity, also known as Pearsons mean-
n square contingency coefficient and denoted by
(b) Calculate the SVD of S 2 , is called the (total)
inertia in CA. Notice
that Cramers V = ( 2 /d), where d is equal to
S = UVT , (2) the smaller of I 1 or J 1.
where the left and right singular vectors in the 3. If we continue our search for scale values, giving
columns of U and V respectively satisfy UT U = scores uncorrelated with the optimal ones found
VT V = I and is the diagonal matrix of positive sin- above but again with maximum correlation, the
gular values in descending order down the diagonal: solution is exactly as in (3) but for the second left
1 2 0. and right singular vectors of S, with maximum
(c) Calculate the two sets of optimal scale values correlation equal to 2 , and so on for successive
from the first singular vectors as follows: optimal solutions. There are exactly d solutions
(d = 4 in our example).
ai = n/ni+ ui1 , i = 1, . . . , I
(3)
bj = n/n+j vj 1 , j = 1, . . . , J.
Simple Correspondence Analysis:
The following results can easily be verified and Geometric Definition
are standard results in CA theory:
A geometric interpretation can be given to the above
1. The maximum correlation achieved by the solu- results: in fact, it is the visualization aspects of
tion is equal to 1 , the largest singular value of S. CA that have made it popular as a method of
Correspondence Analysis 3
Sometimes
2 wrong
1.5
Almost
always
wrong
0.5
Missing
A
CZ RUS
AUS
SLO D-W CDN
0 PL H USA
NL GB I
D-E N NZ
E BG IRL
S IL NIR
Never
wrong
RP
0.5
Always
wrong
1.5
1 0.5 0 0.5 1 1.5 2
Figure 1 Asymmetric CA map of countries by response categories, showing rows in principal coordinates and columns
in standard coordinates. Inertia on first axis: 0.1516 (71.7%), on second axis: 0.0428 (20.2%)
data analysis. Consider the same contingency table of five components sums to a constant 1 (d is
N in Table 1 and calculate, for example, the table the same quantity calculated in previous section).
of row proportions for each country, across the Now assign a weight to each of the row profiles
response categories (Table 2). Each of these five- equal to its relative frequency in the survey, called
component vectors, called profiles, defines a point the mass, given in the last column of Table 2;
in multidimensional space for the corresponding thus weighting the profile is proportional to its
country. In fact, the dimensionality d of these row sample size. Next, define distances between the
profiles is equal to 4 because each profiles set row profile vectors by the chi-square distance:
4 Correspondence Analysis
Table 2 Row (country) profiles based on the contingency table in Table 1(c), showing the row masses and the average
row profile used in the chi-square metric
Almost
Always always Sometimes Never
Country Abbr. wrong wrong wrong wrong Missing Mass
normalize each column of the profile matrix by But CA also displays the categories of responses
dividing it by the square root of the marginal (columns) in the map. The simplest way to incor-
profile element (e.g., divide the first column by porate the columns is to define fictitious unit row
the square root of 0.145 and so on), and then use profiles, called vertices, as the most extreme rows
Euclidean distances between the transformed row possible; for example, the vertex profile [1 0 0 0 0] is
profiles. Finally, look for a low-dimensional subspace totally concentrated into the response always wrong,
that approximates the row profiles optimally in a as if there were a country that was unanimously
weighted least-squares sense; for example, find the against premarital sex. This vertex is projected onto
two-dimensional plane in the four-dimensional space the optimal plane, as well as the vertex points rep-
of the row profiles, which is closest to the profile resenting the other response categories, including
points, where closeness is measured by (mass-) missing. These vertex points are used as reference
weighted sum-of-squared (chi-square) distances from points for the interpretation of the country profiles.
the points to the plane. The profile points are It can be shown that this geometric version of the
projected onto this best-fitting plane in order to problem has exactly the same mathematical solution
interpret the relative positions of the countries as the correlational one described previously. In fact,
(Figure 1). the following results are standard in the geometric
Up to now, the geometric description of CA is interpretation of simple CA:
practically identical to classical metric multidimen-
sional scaling applied to the chi-square distances 1. The column points, that is the projected vertices
between the countries, with the additional feature of in Figure 1, have coordinates equal to the scale
weighting each point proportional to its frequency. values obtained in previous section; that is,
Correspondence Analysis 5
we use the bj s calculated in (3) for the first dimension, or principal axis, usually expressed
(horizontal) dimension, and the similar quantities as percentages. The k s are the eigenvalues of
calculated from the second singular vector for the matrix ST S or SST . This is analogous to
the second (vertical) dimension. The bj s are the decomposition of total variance in principal
called the standard coordinates (of the columns component analysis.
in this case). 6. The map in Figure 1 is called the asymmetric
2. The positions of the row points in Figure 1 are map of the rows or the row-principal map.
obtained by multiplying the optimal scale values An alternative asymmetric map displays column
for the rows by the corresponding correlation; profiles in principal coordinates and row ver-
that is, we use the 1 ai s for the first dimension, tices in standard coordinates, called the column-
where the ai s are calculated as in (3) from principal map. However, it is common prac-
the first singular vector and value, and the tice in CA to plot the row and column points
corresponding values for the second dimension jointly as profiles, that is, using their princi-
calculated in the same way from the second pal coordinates in both cases, giving what is
singular value and vector. These coordinates of called a symmetric map (Figure 2). All points
the profiles are called principal coordinates (of in both asymmetric and symmetric maps have
the rows in this case).
the same relative positions along individual prin-
3. In the full space of the row profiles (a four-
cipal axes. The symmetric map has the advan-
dimensional space in our example) as well as in
tage of spreading out the two sets of profile
the (two-dimensional) reduced space of Figure 1,
points by the same amounts in the horizon-
the row profiles lie at weighted averages of
tal and vertical directions the principal iner-
the column points. For example, Sweden (S) is
tias, which are the weighted sum-of-squared dis-
on the left of the display because its profile
is equal to [0.039 0.015 0.046 0.854 0.046], tances along the principal axes, are identical for
highly concentrated into the fourth category the set of row profiles and the set of column
never wrong (compare this with the average profiles.
profile at the center, which is [0.145 0.068 0.139
Notice in Figures 1 and 2 the curve, or horseshoe,
0.571 0.078]). If the response points are assigned
traced out by the ordinal scale from never wrong on
weights equal to 0.039, 0.015, and so on, (with
the left, up to sometimes wrong and almost always
high weight of 0.854 on never wrong) the
wrong and then down to always wrong on the right
weighted average position is exactly where the
(see Horseshoe Pattern). This is a typical result of
point S is lying. This is called the barycentric
CA for ordinally scaled variables, showing the ordi-
principle in CA and is an alternative way of
thinking about the joint mapping of the points. nal scale on one dimension and the contrast between
4. The joint display can also be interpreted as a extreme points and intermediate points on the other.
biplot of the matrix of row profiles; that is, Thus, countries have positions according to two fea-
one can imagine oblique axes drawn through the tures: first, their overall strength of attitude on the
origin of Figure 1 through the category points, issue, with more liberal countries (with respect to the
and then project the countries onto these axes to issue of premarital sex) on the left, and more conser-
obtain approximate profile values. On this biplot vative countries on the right, and second, their polar-
axis, the origin will correspond exactly to the ization of attitude, with countries giving higher than
average profile element (given in the last row of average in-between responses higher up, and countries
Table 2), and it is possible to calibrate the axis giving higher than average extreme responses lower
in profile units, that is, in units of proportions or down. For example, whereas Spain (S) and Russia
percentages. (RUS) have the some average attitude on the issue,
5. The inertia 2 is a measure of the total vari- slightly to the conservative side of average (horizontal
ation of the profile points in multidimensional dimension), the Russian responses contain relatively
space, and the parts of inertia 1 = 1 2 , 2 = more intermediate responses compared to those of the
2 2 , . . . , called the principal inertias, are those Spanish responses that are more extreme in both direc-
parts of variation displayed with respect to each tions (see Table 2 to corroborate this finding).
6 Correspondence Analysis
J
0.75
0.5
Sometimes
wrong
Almost
0.25 A always
wrong
CZ RUS
AUS
Missing
SLO D-W H
0 CDN USA
GB PL
D-E N NL
Never I
wrong NZ
BG
E NIR IRL
S IL
0.25 Always
wrong
RP
0.5
0.75 0.5 0.25 0 0.25 0.5 0.75 1 1.25
Figure 2 Symmetric CA map of countries by response categories, showing rows and columns in principal coordinates.
Inertias and percentages as in Figure 1. Notice how the column points have been pulled in compared to Figure 1
Table 3 Three forms of the same multivariate data: (a) responses to the four questions, coded according to the response
categories; (b) coding of the same data as dummy variables, zero-one data in an indicator matrix Z with 4 5 = 20
columns; (c) 20 20 Burt matrix B of all 5 5 contingency tables Nqq cross-tabulating pairs of variables, including each
variable with itself on the block diagonal
20 20
dummy variables categories
for each case are in Z1 a1 , Z2 a2 , . . . , ZQ aQ , and the Za)]2 are called discrimination measures and are
averages of these scores in (1/Q)Za. It is equivalent analogous to squared (and thus unsigned) factor load-
to maximize the sum-of-squared correlations between ings (see Factor Analysis: Exploratory). Moreover,
the Q scores or to maximize the sum-of (or average) in homogeneity analysis the objective of the analysis
-squared correlations between the Q scores and the is defined in a different but equivalent form, namely,
average: as the minimization of the loss function:
Q
2
1 1 Q 2
maximize cor Zq aq , Za . (4) 1 1
a Q q=1 Q minimize Zq a q Za
, (5)
a nQ q=1 Q
As before, we need an identification condition
on a, conventionally the average score (1/Q)Za is
standardized to have mean 0 and variance 1. The where 2 denotes sum-of-squares of the ele-
solution is provided by the same CA algorithm ments of the vector argument. The solution is the
described in formulae (1)(3), applied either to Z same as (4) and the minima are 1 minus the eigen-
or to B. The standard coordinates of the columns values maximized previously. The eigenvalues are
of Z, corresponding to the maximum singular value interpreted individually and not as parts of variation
1 , provide the optimal solution, and 1 = 12 is the if the correlation amongst the variables is high, then
attained maximum of (4). Subsequent solutions are the loss is low and there is high homogeneity, or
provided by the following singular values and vectors internal consistency, amongst the variables; that is,
as before. If B is analyzed rather than Z, the standard the optimal scale successfully summarizes the asso-
coordinates of the columns of B (or rows, since B ciation amongst the variables.
is symmetric) are identical to those of the columns Using the optimal scaling option in SPSS module
of Z, but because B = ZT Z, the singular values and Categories, Table 4 gives the eigenvalues and dis-
eigenvalues of B are the squares of those of Z, that crimination measures for the four variables in the
is the eigenvalues of B are 14 , 24 , . . . and so on. first five dimensions of the solution (we shall inter-
In homogeneity analysis, an approach equivalent pret the solutions in the next section) (see Software
to MCA, the squared correlations [cor(Zq aq , (1/Q) for Statistical Analyses).
8 Correspondence Analysis
Table 4 Eigenvalues and discrimination measures (squared correlations) for first five dimensions in the homogeneity
analysis (MCA) of the four questions in Table 3
Discrimination measures
Dimension Eigenvalue Variable A Variable B Variable C Variable D
1 0.5177 0.530 0.564 0.463 0.514
2 0.4409 0.405 0.486 0.492 0.380
3 0.3535 0.307 0.412 0.351 0.344
4 0.2881 0.166 0.370 0.344 0.256
5 0.2608 0.392 0.201 0.171 0.280
Multiple Correspondence Analysis: distances. The loss, equal to 1 minus the eigenvalue,
Geometric Definition is equal to the minimum sum-of-squared distances
with respect to individual dimensions in either asym-
The geometric paradigm described in section Sim- metric map. The losses can thus be added for the first
ple Correspondence Analysis: Geometric Definition two dimensions, for example, to give the minimum
for simple CA, where rows and columns are pro- for the planar display.
jected from the full space to the reduced space, is It is clearly not useful to think of the joint display
now applied to the indicator matrix Z or the Burt of the cases and categories as a biplot, as we are
matrix B. Some problems occur when attempting to not trying to reconstruct the zeros and ones of the
justify the chi-square distances between profiles and indicator matrix. Nor is it appropriate to think of the
the notion of total and explained inertia. There are CA map of the Burt matrix as a biplot, as explained
two geometric interpretations that make more sense: in the following section.
one originates in the so-called Gifi system of homo-
geneity analysis, the other from a different gener-
alization of CA called joint correspondence analysis Geometry of all Bivariate Contingency Tables
(JCA).
(Joint CA)
Geometry of Joint Display of Cases In applying CA to the Burt matrix, the diagonal sub-
and Categories matrices on the diagonal of the block matrix B
As mentioned in property 3 of section Simple Cor- inflate both the chi-square distances between pro-
respondence Analysis: Geometric Definition, in the files and the total inertia by artificial amounts. In an
asymmetric CA map, each profile (in principal coor- attempt to generalize simple CA more naturally to
dinates) is at the weighted average, or centroid, of the more than two categorical variables, JCA accounts
set of vertices (in standard coordinates). In the MCA for the variation in the off-diagonal tables of B
of the indicator matrix with the categories (columns), only, ignoring the matrices on the block diagonal.
say, in standard coordinates, and the cases (rows) in Hence, in the two-variable case (Q = 2) when there
principal coordinates, each case lies at the ordinary is only one off-diagonal table, JCA is identical to
average of its corresponding category points, since simple CA. The weighted least-squares solution can
the profile of a case is simply the constant value 1/Q no longer be obtained by a single application of
for each category of response, and zero otherwise. the SVD and various algorithms have been pro-
From (5), the optimal map is the one that minimizes posed. Most of the properties of simple CA carry
the sum-of-squared distances between the cases and over to JCA, most importantly, the reconstruction
their response categories. It is equivalent to think of of profiles with respect to biplot axes, which is
the rows in standard coordinates and the columns in not possible in regular MCA of B. The percentages
principal coordinates, so that each response category of inertia are now correctly measured, quantifying
is at the average of the case points who have given the success of approximating the off-diagonal matri-
that response. Again the optimal display is the one ces relative to the total inertia of these matrices
that minimizes the sum-of-squared case-to-category only.
Correspondence Analysis 9
Adjustment of Eigenvalues in MCA dramatically improve the results of MCA and are rec-
ommended in all applications of MCA.
It is possible to remedy partially the percentage of In our example, the total inertia of B is equal
inertia problem in a regular MCA by a compro- to 1.1659 and the first five principal inertias are
mise between the MCA solution and the JCA objec- such that s 1/Q, that is, 2s 1/Q2 = 1/16. The
tive. The total inertia is measured (as in JCA) by different possibilities for inertias and percentages of
the average inertia of all off-diagonal subtables of inertia are given in Table 5. Thus, what appears
B, calculated either directly from the tables them- to be a percentage explained in two dimensions
selves or by reducing the total inertia of B as fol- of 29.9% (= 12.9 + 11.0) in the analysis of the
lows: indicator matrix Z, or 39.7% (= 23.0 + 16.7) in
average off-diagonal inertia the analysis of the Burt matrix B, is shown to be
at least 86.9% (= 57.6 + 29.3) when the solution
Q J Q is rescaled. The JCA solution gives only a slight
= inertia(B) . (6)
Q1 Q2 extra benefit in this case, with an optimal percentage
explained of 87.4%, so that the adjusted solution is
Parts of inertia are then calculated from the eigenval-
practically optimal.
ues 2s of B (or s of Z) as follows: for each
Figure 3 shows the adjusted MCA solution, that is,
s 1/Q calculate the adjusted inertias
in adjusted principal coordinates. The first dimension
2 lines up the four ordinal categories of each question
Q 1 2
s in their expected order (accounting for 57.6% of the
Q1 Q
inertia), whereas the second dimension opposes all
and express these as percentages of (6). Although the missing categories against the rest (29.3% of
these percentages underestimate those of JCA, they inertia). The positions of the missing values on the
Table 5 Eigenvalues (principal inertias) of the indicator matrix and the Burt matrix, their percentages of inertia, the
adjusted inertias, and their lower bound estimates of the percentages of inertia of the off-diagonal tables of the Burt matrix.
The average off-diagonal inertia on which these last percentages are based is equal to 4/3(1.1659 16/16) = 0.2212
Percentage Percentage Percentage
Dimension Eigenvalue of Z explained Eigenvalue of B explained Adjusted eigenvalue explained
1 0.5177 12.9 0.2680 23.0 0.1274 57.6
2 0.4409 11.0 0.1944 16.7 0.0648 29.3
3 0.3535 8.8 0.1249 10.7 0.0190 8.6
4 0.2881 7.2 0.0830 7.1 0.0026 1.2
5 0.2608 6.5 0.0681 5.8 0.0002 0.0
Table 6 Category scale values (standard coordinates) on first dimension (see Figure 3), and their linearly transformed
values to have not wrong at all categories equal to 0 and highest possible score (sum of underlined scale values) equal
to 100
Always Almost always Only sometimes Not wrong
wrong wrong wrong at all Missing
(a) Sex before marriage 2 .032 1.167 0.194 0.748 0.333
27.9 19.2 9.5 0.0 10.9
(b) Sex teens under 16 0 .976 0.813 1.463 1 .554 0.529
25.4 7.4 0.9 0.0 10.3
(c) Sex other than spouse 0 .747 1.060 1.414 1.591 0.895
23.5 5.3 1.8 0.0 7.0
(d) Sex two people of same sex 1 .001 0.649 0.993 1.310 0.471
23.2 6.6 3.2 0.0 8.4
10 Correspondence Analysis
first axis will thus provide estimated scale values for more interpretable values, for example, each category
missing data that can be used in establishing a general always wrong can be set to 0 and the upper limit
attitude scale of attitudes to sex. The scale values of the case score equalized to 100 (Table 6). The
on the first dimension can be transformed to have process of redefining the scale in this way is invariant
1 cM
bM aM
0.75
dM
0.5
0.25
0 c1 b1
a4 a1
c4 d1
b4 d4 c2 b2 a3 a2
c3 d2
b3 d3
0.25
0.5
1 0.75 0.5 0.25 0 0.25 0.5 0.75 1
Figure 3 MCA map of response categories (analysis of Burt matrix): first (horizontal) and second (vertical) dimensions,
using adjusted principal inertias. Inertia on first axis: 0.1274 (57.6%), on second axis: 0.0648 (29.3%). Labels refer to
variable A to D, with response categories 1 to 4 and missing (M)
D-Ef
0.2
D-Wf
SLOf
0.1 D-Wm BGm
SLOm RUSf BGf PLf
D-Em USAf
CDNf USAm
Sf
RUSm CDNm Em Nf Irf
Ef NZf Hf Irm
0 GBf
Af PLm NIRf
Am GBm Jm Jf
Sm NIRm
Nm NZm Ilf
CZm CZf AUSf
Hm If
0.1 NLf
NLm Im Ilm
AUSm
RPf
RPm
0.2
0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure 4 Positions of supplementary countrygender points in the map of Figure 3. Country abbreviations are followed
by m for male or f for female
Correspondence Analysis 11
with respect to the particular scaling used: standard, the framework of nonlinear multivariate analysis,
principal, or adjusted principal. a predominantly Dutch approach to data analysis.
Having identified the optimal scales, subgroups Greenacre [5] is a practical user-oriented introduc-
of points may be plotted in the two-dimensional tion. Nishisato [8] provides another view of the same
map; for example, a point for each country or a methodology from the viewpoint of quantification of
point for a subgroup within a country such as Italian categorical data. The edited volumes by Greenacre
females or Canadian males. This is achieved using and Blasius [6] and Blasius and Greenacre [2] are
the barycentric principle, namely, that the profile self-contained state-of-the-art books on the subject,
position is at the weighted average of the vertex the latter volume including related methods for visu-
points, using the profile elements as weights. This is alizing categorical data.
identical to declaring this subgroup a supplementary
point, that is, a profile that is projected onto the References
solution subspace. In Figure 4 the positions of males
and females in each country are shown. Apart from
[1] Benzecri, J.-P. (1973). Analyse des Donnees. Tome 2:
the general spread of the countries with conservative Analyse des Correspondances, Dunod, Paris.
countries (e.g., Philippines) more to the right and [2] Blasius, J. & Greenacre, M.J., (eds) (1998). Visualization
liberal countries more to the left (e.g., Germany), it of Categorical Data, Academic Press, San Diego.
can be seen that the female groups are consistently [3] Gifi, A. (1990). Nonlinear Multivariate Analysis, Wiley,
to the conservative side of their male counterparts Chichester.
and also almost always higher up on the map, [4] Greenacre, M.J. (1984). Theory and Applications of Cor-
respondence Analysis, Academic Press, London.
that is, there are also more nonresponses amongst [5] Greenacre, M.J. (1993). Correspondence Analysis in
the females. Practice, Academic Press, London.
[6] Greenacre, M.J. & Blasius, J., (eds) (1994). Correspon-
dence Analysis in the Social Sciences, Academic Press,
Further Reading London.
[7] Lebart, L., Morineau, A. & Warwick, K. (1984).
Benzecri [1] in French represents the origi- Multivariate Descriptive Statistical Analysis, Wiley,
nal material on the subject. Greenacre [4] and [7] Chichester.
both give introductory and advanced treatments of [8] Nishisato, S. (1994). Elements of Dual Scaling: An Intro-
the French approach, the former being a more duction to Practical Data Analysis, Lawrence Erlbaum,
encyclopedic treatment of CA and the latter includ- Hillsdale.
ing other methods of analyzing large sets of cat-
MICHAEL GREENACRE
egorical data. Gifi [3] includes CA and MCA in
Co-twin Control Methods
JACK GOLDBERG AND MARY FISCHER
Volume 1, pp. 415418
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
difference can be formally tested for heterogeneity genetic factors. Testing the difference between the
using a two-sample t Test. stratum-specific matched pair odds ratios involves
using the risk factoroutcome discordant pairs in a
Dichotomous Environmental Risk Factor Pearsons X 2 test with 1 degree of freedom (see Con-
and a Dichotomous Outcome Variable tingency Tables).
Table 1 Co-twin control analysis of Vietnam service discordant pairs and mean levels of PTSD symptoms
Mean level of PTSD
symptoms for Vietnam Matched pair
service t Test Heterogeneity
Number of Within-pair
Vietnam mean difference
discordant Vietnam No service in PTSD
Sample pairs service in Vietnam symptoms 95% CI t P value t P value
All pairs 1679 29.6 24.6 5.0 4.4,5.6 16.6 <0.001
MZ 846 29.7 24.5 5.2 4.4,6.0 13.0 <0.001
0.67 0.50
DZ 833 29.5 24.7 4.8 3.9,5.7 10.6 <0.001
Table 2 Co-twin control analysis of Vietnam service discordant pairs and the presence of high levels of PTSD symptoms
PTSD prevalencea Pair configurationb McNemars test Heterogeneity
Number of No
Vietnam Vietnam service in
discordant service Vietnam
Sample pairs (%) (%) n11 n10 n01 n00 OR 95% CI X2 P value X2 P value
All pairs 1679 34.8 15.5 148 437 113 981 3.9 3.1,4.8 191.0 <0.001
MZ 846 34.0 14.4 69 219 53 505 4.1 3.1,5.6 101.4 <0.001
0.37 0.54
DZ 833 35.7 16.7 79 218 60 476 3.6 2.7,4.8 89.9 <0.001
a
The prevalence of PTSD was defined as a PTSD score greater than 32, which represents the upper quartile of the PTSD scale
distribution.
b
n11 = number of Vietnam discordant pairs in which both twins have PTSD; n10 = number of Vietnam discordant pairs in which
the twin with service in Vietnam has PTSD and the twin without Vietnam service does not have PTSD; n01 = number of Vietnam
discordant pairs in which the twin with service in Vietnam does not have PTSD and the twin without Vietnam service has PTSD;
n00 = number of Vietnam discordant pairs in which neither twin has PTSD.
and DZ pairs. The comparison of heterogeneity of in MZ and DZ pairs was not statistically significant
the MZ and DZ within-pair mean differences was (X 2 = 0.37, p = .543).
not significant (p = .503), providing little evidence
of genetic confounding.
Table 2 presents the matched pairs analysis of More Advanced Methods
service in Vietnam and the dichotomous indicator
of PTSD symptoms. Overall, 35% of those who For co-twin control designs using the experimental
served in Vietnam were in the upper quartile of the or cohort approach with risk factor discordant twin
PTSD symptom scale, while only 16% of those who pairs, there is now a wide range of methods for clus-
did not serve in Vietnam had similarly high levels tered data that can be used to analyze twins [7, 11].
of PTSD symptoms. The matched pair odds ratio Duffy [3] has described how co-twin control studies
indicates that twins who served in Vietnam were can be analyzed using structural equation models.
nearly 4 times more likely to have high levels of Methods such as random effects regression (see Lin-
PTSD symptoms compared to their twin who did not ear Multilevel Models; Generalized Linear Mixed
serve. The within-pair difference in PTSD was highly Models) models and generalized estimating equa-
significant based on the McNemar test (X 2 = 191.00; tions are readily adapted to the analysis of twins [14].
p < .001). Stratification by zygosity demonstrated These methods are extremely flexible and allow addi-
that the effects were similar in both MZ and DZ tional covariates to be included in the regression
pairs; the difference in the matched pair odds ratios model as both main-effects and interaction terms.
4 Co-twin Control Methods
Options are available to examine outcome variables era twin (VET) registry: an approach using question-
that can take on virtually any structure, including con- naires, Clinical Genetics 35, 423432.
tinuous, dichotomous, ordinal, and censored. Further [6] Goldberg, J., Curran, B., Vitek, M.E., Henderson, W.G.
& Boyko, E.J. (2002). The Vietnam Era Twin (VET)
extensions allow the simultaneous analysis of mul- registry, Twin Research 5, 476481.
tiple outcomes as well as longitudinal analysis of [7] Goldstein, H. (2003). Multilevel Statistical Models, Hod-
repeated measures over time (see Longitudinal Data der Arnold, London.
Analysis; Repeated Measures Analysis of Vari- [8] Hu, F.B., Goldberg, J., Hedeker, D. & Henderson, W.G.
ance). These more complex applications can also (1999). Modeling ordinal responses from co-twin control
go beyond the discordant co-twin control method to studies, Statistics in Medicine 17, 957970.
[9] Kendler, K.S., Neale, M.C., MacLean, C.J., Heath, A.C.,
incorporate all types of exposure patterns within twin
Eaves, L.J. & Kessler, R.C. (1993). Smoking and
pairs [8]. However, application of these more sophis- major depression: a causal analysis, Archives of General
ticated analyses should be done with great care to Psychiatry 50, 3643.
make sure that the appropriate statistical model is [10] Lewis, D.H., Mayberg, H.S., Fischer, M.E., Goldberg, J.,
used [13]. Ashton, S., Graham, M.M. & Buchwald, D. (2001).
Monozygotic twins discordant for chronic fatigue syn-
drome: regional cerebral blood flow SPECT, Radiology
References 219, 766773.
[11] Liang, K.Y. & Zeger, S.L. (1993). Regression analysis
[1] Battie, M.C., Videman, T., Gibbons, L.E., Manninen, H., for correlated data, Annual Review of Public Health 14,
Gill, K., Pope, M. & Kaprio, J. (2002). Occupational 4368.
driving and lumbar disc degeneration: a case-control [12] MacGregor, A.J., Snieder, H., Schork, N.J. & Spec-
study, Lancet 360, 13691374. tor, T.D. (2000). Twins: novel uses to study com-
[2] Bouchard, C. & Tremblay, A. (1997). Genetic influ- plex traits and genetic diseases, Trends in Genetics 16,
ences on the response of body fat and fat distribution 131134.
to positive and negative energy balances in human iden- [13] Neuhaus, J.M. & Kalbfleisch, J.D. (1998). Between-
tical twins, The Journal of Nutrition 127,(Suppl. 5), and within-cluster covariate effects in the analysis of
943S947S. clustered data, Biometrics 54, 638645.
[3] Duffy, D.L. (1994). Biometrical genetic analysis of the [14] Quirk, J.T., Berg, S., Chinchilli, V.M., Johansson, B.,
cotwin control design, Behavior Genetics 24, 341344. McClearn, G.E. & Vogler, G.P. (2001). Modelling blood
[4] Duffy, D.L., Mitchell, C.A. & Martin, N.G. (1998). pressure as a continuous outcome variable in a co-twin
Genetic and environmental risk factors for asthma: a control study, Journal of Epidemiology and Community
cotwin-control study, American Journal of Respiratory Health 55, 746747.
and Critical Care Medicine 157, 840845.
[5] Eisen, S.A., Neuman, R., Goldberg, J., Rice, J. & JACK GOLDBERG AND MARY FISCHER
True, W.R. (1989). Determining zygosity in the Vietnam
Counterbalancing
VENITA DEPUY AND VANCE W. BERGER
Volume 1, pp. 418420
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
While every treatment appears in every time point fungal infections [5] and critiqued [1]. It is also
exactly once, it should be noted that treatment order possible to randomize treatments to measurement
does not vary in a simple Latin square. Every treat- times, for a single individual or for a group [4].
ment C is followed by treatment D, and so on. While
this method can account for both the order effect and References
the learning effect, it does not counteract the car-
ryover effect. A better method is a balanced Latin
[1] Berger, V.W. (2003). Preventing fungal infections in
square. In this type of Latin square, each treatment chronic granulomatous disease, New England Journal of
immediately follows and immediately precedes each Medicine 349(12), 1190.
other treatment exactly once, as shown here: [2] Bradley, J.V. (1958). Complete counterbalancing of
immediate sequential effects in a Latin square design
Subject 1: A B D C (Corr: V53 p1030-31), Journal of the American Statistical
Subject 2: B C A D Association 53, 525528.
Subject 3: C D B A [3] Cochran, W.G. & Cox, G.M. (1957). Experimental
Subject 4: D A C B Designs, 2nd Edition, Wiley, New York. (First corrected
printing, 1968).
It should be noted that balanced squares can be [4] Ferron, J. & Onghena, P. (1996). The power of ran-
domization tests for single-case phase designs, Journal
constructed only when the number of treatments is
of Experimental Education 64(3), 231239.
even. For odd numbers of treatments, a mirror image [5] Gallin, J.I., Alling, D.W., Malech, H.L., wesley, R.,
of the square must be constructed. Further details are Koziol, D., Marciano, B., Eisenstein, E.M., Turner, M.L.,
given in [3]. Latin squares are used when the number DeCarlo, E.S., Starling, J.M. & Holland, S.M. (2003).
of subjects, or blocks of subjects, equals the number Itraconazole to prevent fungal infections in chronic
of treatments to be administered to each. When the granulomatous disease, New England Journal of Medicine
number of sampling units exceeds the number of 348, 24162422.
[6] Maxwell, S.E. & Delaney, H.D. (1990). Designing Exper-
treatments, multiple squares or rectangular arrays iments and Analyzing Data: A Model Comparison Per-
should be considered, as discussed in [8]. In the event spective, Brooks/Cole, Pacific Grove.
that participants receive each treatment more than [7] Poulton, E.C. & Freeman, P.R. (1966). Unwanted asym-
once, reverse counterbalancing may be used. In this metrical transfer effects with balanced experimental
method, treatments are given in a certain order and designs, Psychological Bulletin 66, 18.
then in the reverse order. For example, a subject [8] Rosenthal, R. & Rosnow, R.L. (1991). Essentials of
Behavioral Research: Methods and Data Analysis, 2nd
would receive treatments ABCCBA or CBAABC.
Edition, McGraw-Hill, Boston.
A variation on this theme, in which subjects were
randomized to be alternated between treatments A VENITA DEPUY AND VANCE W. BERGER
and B in the order ABABAB. . . or BABABA. . . ,
was recently used to evaluate itraconazole to prevent
Counterfactual Reasoning
PAUL W. HOLLAND
Volume 1, pp. 420422
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
by Rubin [13, 14] (for a wider variety of causal [8] Neyman, J. (1923). Sur les applications de la theorie
studies). This formal model is described in [3] and des probabilites aux experiences agricoles: Essai des
compared to the more usual statistical models that principes, Roczniki Nauk Rolniczki 10, 151;in Polish:
English translation by Dabrowska, D. & Speed, T.
are appropriate for descriptive rather than causal (1991). Statistical Science 5, 463480.
inference. This approach to defining and estimating [9] Neyman, J. (1935). Statistical problems in agricultural
causal effects has been applied to a variety of research experimentation, Supplement of the Journal of the Royal
designs by many authors including [4, 5, 12] and the Statistical Society 2, 107180.
references therein. [10] Robins, J.M. (1985). A new theory of causality in
observational survival studies-application of the healthy
worker effect, Biometrics 41, 311.
References [11] Robins, J.M. (1986). A New approach to causal infer-
ence in mortality studies with a sustained exposure
[1] Cook, T.D. & Campbell, D.T. (1979). Quasi-experiment- period-application to control of the healthy worker sur-
ation: Design and Analysis Issues for Field Settings, vivor effect, Mathematical Modelling 7, 13931512.
Houghton Mifflin, Boston. [12] Robins, J.M. (1997). Causal inference from complex
[2] Dawid, A.P. (2000). Causal Inference without coun- longitudinal data, in Latent Variable Modeling with
terfactuals (with discussion), Journal of the American Applications to Causality, M. Berkane, ed., Springer-
Statistical Association 95, 407448. Verlag, New York, pp. 69117.
[3] Holland, P.W. (1986). Statistics and causal inference, [13] Rubin, D.B. (1974). Estimating causal effects of treat-
Journal of the American Statistical Association 81, ments in randomized and nonrandomized studies, Jour-
945970. nal of Educational Psychology 66, 688701.
[4] Holland, P.W. (1988). Causal inference, path analysis [14] Rubin, D.B. (1978). Bayesian inference for casual
and recursive structural equations models, in Sociologi- effects: the role of randomization, Annals of Statistics
cal Methodology, C. Clogg, ed., American Sociological 6, 3458.
Association, Washington, pp. 449484. [15] Shafer, G. (1996). The Art of Causal Conjecture, MIT
[5] Holland, P.W. & Rubin, D.B. (1988). Causal inference Press, Cambridge.
in retrospective studies, Evaluation Review 12, 203231.
[6] Lewis, D. (1973a). Causation, Journal of Philosophy. 70, PAUL W. HOLLAND
556567.
[7] Lewis, D. (1973b). Counterfactuals, Harvard University
Press, Cambridge.
Counternull Value of an Effect Size
ROBERT ROSENTHAL
Volume 1, pp. 422423
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
sometimes when sample sizes are very large, as in [2] Hedges, L.V. (1981). Distribution theory for Glasss
some clinical trials, highly significant results may estimator of effect size and related estimators, Journal
be associated with very small effect sizes. In such of Educational Statistics 6, 107128.
[3] Rosenthal, R. (1994). Parametric measures of effect
situations, when even the counternull value of the size, in Handbook of Research Synthesis, H. Cooper &
obtained effect size is seen to be of no practical L.V. Hedges eds, Russell Sage Foundation, New York,
import, clinicians may decide there is insufficient pp. 231244.
benefit to warrant introducing a new and possibly [4] Rosenthal, R., Rosnow, R.L. & Rubin, D.B. (2000).
very expensive intervention. Finally, it should be Contrasts and Effect Sizes in Behavioral Research: A
noted that the counternull values of effect sizes can Correlational Approach, Cambridge University Press,
New York.
be useful in multivariate cases, as well as in contrast
[5] Rosenthal, R. & Rubin, D.B. (1994). The counternull
analyses [4] and [5]. value of an effect size: a new statistic, Psychological
Science 5, 329334.
References
ROBERT ROSENTHAL
[1] Cohen, J. (1988). Statistical Power Analysis for the
Behavioral Sciences, 2nd Edition, Erlbaum, Hillsdale.
Covariance
DIANA KORNBROT
Volume 1, pp. 423424
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
4:26 4:26
4:22 4:22
4:18 4:18
1500 m, min:sec
1500 m, min:sec
4:14 4:14
4:10 4:10
4:06 4:06
4:02 4:02
3:58 3:58
1.6 1.7 1.8 1.9 2.0 22.0 23.0 24.0
(a) High jump, m (b) 200 m, sec
Figure 1 UK womens athletic performance from 1968 to 2003 to demonstrate covariance. (a) shows negative relation
between 1500 m and high-jump performance. (b) shows positive relation between 1500 m and 200 m performance
2 Covariance
Acknowledgment
Thanks to Martin Rix who maintains pages for United
Kingdom Track and Field at gbrathletics.com.
Covariance Matrices: Testing Equality of
WOJTEK J. KRZANOWSKI
Volume 1, pp. 424426
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of heterogeneity of dispersion matrices. The bottom [3] Morrison, D.F. (1990). Multivariate Statistical Methods,
line is that all significance tests should be interpreted 3rd Edition, McGraw-Hill, New York.
critically and with caution. [4] Wilks, S.S. (1932). Certain generalisations in the analysis
of variance, Biometrika 24, 471494.
References
(See also Multivariate Analysis: Overview)
[1] Box, G.E.P. (1949). A general distribution theory for a
class of likelihood criteria, Biometrika 36, 317346. WOJTEK J. KRZANOWSKI
[2] Krzanowski, W.J. & Marriott, F.H.C. (1994). Multivariate
Analysis Part 1: Distributions, Ordination and Inference,
Edward Arnold, London.
Covariance Structure Models
Y.M. THUM
Volume 1, pp. 426430
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
provided a maximum likelihood solution based on the four columns in design matrix
the log-likelihood
1 1 1 1
1 1 1 1
Np N
ln L = ln(2) ln || + N ln | 1 | 1 1 1 1
2 2
1 1 1 1
A= . (6)
N
tr( 1 1 S 1 ). 1 1 1 1
(5)
2 1 1 1 1
1 1 1 1
It is clear from our discussion above that a covari- 1 1 1 1
ance structure model belongs to the broader class of
Results from SAS PROC CALIS support [22]
structural equation models (SEM). Consequently,
conclusion that a model with correlated latent com-
widely available programs for SEM analyses, such
ponents estimated by
as AMOS, EQS, LISREL, MPLUS, and SAS PROC
CALIS (see Structural Equation Modeling: Soft- =
ware), will routinely provide the necessary estimates,
9.14 (1.92)
standard errors, and a host of fit statistics for fitting
0.73 (0.48) 0.68 (0.23)
covariance structure models.
0.63 (0.42) 0.06(0.15) 0.43 (0.19)
0.61(1.05) 0.51(0.37) 1.13 (0.35) 5.25 (1.14)
(7)
A Study of Teaching Practices and heterogeneous error variances estimated by
= diag[1.63 (0.90), 5.10 (1.40), 8.17 (1.90), 5.50
Wiley, Schmidt, and Bramble [22] examined data (1.56), 1.93 (0.91), 2.33 (0.84), 5.79 (1.44), 2.55
from [17], consisting of responses from 51 students to (0.93)] indicated an acceptable model fit (18 2
=
a test with items sharing combinations of three factors 25.24, p = 0.12; CFI = 0.98; RMSEA = 0.09) supe-
thought to influence classroom learning situations rior to several combinations of alternative forms for
and teaching practices (see Table 1). The study and .
sought to compare conditions in first and sixth grade In comparing the relative impact of the various
classrooms, teaching styles that were deemed teacher- factors in teachers judgments on conditions that
centered as opposed to pupil-centered, and teaching facilitated student learning, Wiley et al. [22, p. 322],
methods that were focused on drill as opposed to however, concluded erroneously that teaching style
promoting discovery. The eight subtests comprised a did not affect teacher evaluations due, most likely, to
23 factorial design. a clerical error in reporting 0.91 as the standard error
Following Wiley, Schmidt, and Bramble [22], we estimate for 32 = 0.43 instead of 0.19. The results
parameterize an overall latent component, a contrast therefore suggested that, on the contrary, both teach-
between grade levels, a contrast between teaching ing practices influenced the performance of subjects.
styles, and teaching methods represented seriatim by Of particular interest to research on teaching, the high
single observation vector. However, when replica- Measurement: Methodological Developments, Vol. 4,
tions within subjects are unbalanced, that is, for R.E. Traub, ed., Jossey-Bass, San Francisco, pp. 5373.
each subject i we observed a ni p matrix Yi = [12] Littell, R.C., Milliken, G.A., Stroup, W.W. & Wolfin-
ger, R.D. (1996). SAS System for Mixed Models, SAS
[yi1 , yi2 , . . . , yir , . . . , yini ,], the covariance structure Institute, Cary.
model (1) takes the extended form of a multivariate [13] Longford, N. & Muthen, B.O. (1992). Factor analysis
mixed-effects model for clustered observations, Psychometrika 57, 581597.
[14] Marcoulides, G.A. (1996). Estimating variance compo-
Y i = i A + Ei (10) nents in generalizability theory: the covariance structure
approach, Structural Equation Modeling 3(3), 290299.
treated, for example, in Thum [20, 21] in Littell, [15] McArdle, J.J. & Epstein, D.B. (1987). Latent growth
Milliken, Stroup, and Wolfinger [12], and in more curves within developmental structural equation models,
recent revisions of SEM and multilevel frameworks Child Development 58(1), 110133.
(see Generalized Linear Mixed Models). [16] Meredith, W. & Tisak, J. (1990). Latent curve analysis,
Psychometrika 55, 107122.
[17] Miller, D.M. & Lutz, M.V. (1966). Item design for an
References inventory of teaching practices and learning situations,
Journal of Educational Measurement 3, 5361.
[18] Muthen, B. & Satorra, A. (1989). Multilevel aspects of
[1] Bock, R.D. (1960). Components of variance analysis
varying parameters in structural models, in Multilevel
as a structural and discriminal analysis for psycholog-
Analysis of Educational Data, D.R. Bock, ed., Academic
ical tests, British Journal of Statistical Psychology 13,
Press, San Diego, pp. 8799.
151163.
[19] Shavelson, R.J. & Webb, N.M. (1991). Generalizability
[2] Bock, R.D. & Bargmann, R.E. (1966). Analysis of
Theory: A Primer, Sage Publications, Newbury Park.
covariance structures, Psychometrika 31, 507533.
[20] Thum, Y.M. (1994). Analysis of individual variation:
[3] Brennan, R.L. (2002). Generalizability Theory, Springer-
a multivariate hierarchical linear model for behavioral
Verlag, New York.
data, Doctoral dissertation, University of Chicago.
[4] Burt, C. (1947). Factor analysis and analysis of variance,
[21] Thum, Y.M. (1997). Hierarchical linear models for mul-
British Journal of Psychology Statistical Section 1,
tivariate outcomes, Journal of Educational and Behav-
326.
ioral Statistics 22, 77108.
[5] Campbell, D.T. & Fiske, D.W. (1959). Convergent and
[22] Wiley, D.E., Schmidt, W.H. & Bramble, W.J. (1973).
discriminant validation by the multitrait-multimethod
Studies of a class of covariance structure models, Jour-
matrix, Psychological Bulletin 56, 81105.
nal of American Statistical Association 68, 317323.
[6] Goldstein, H. & McDonald, R.P. (1988). A general
[23] Willett, J.B. & Sayer, A.G. (1994). Using covariance
model for the analysis of multilevel data, Psychometrika
structure analysis to detect correlates and predictors of
53, 455467.
change, Psychological Bulletin 116, 363381.
[7] Guilford, J.P. (1956). The structure of intellect, Psycho-
[24] Wothke, W. (1996). Models for multitrait-multimethod
logical Bulletin 53, 276293.
matrix analysis, in Advanced Structural Equation Mod-
[8] Guttman, L.A. (1954). A new approach to factor analy-
elling, G.A. Marcoulides & R.E. Schumacher, eds,
sis: the radex, in Mathematical Thinking in the Social
Lawrence Erlbaum, Mahwah.
Sciences, P.F. Lazarsfeld, ed., Columbia University
Press, New York, pp. 258348.
[9] Joreskog, K.G. (1979). Analyzing psychological data by (See also Linear Statistical Models for Causation:
structural analysis of covariance matrices, in Advances
in Factor Analysis and Structural Equation Models,
A Critical Review; Structural Equation Modeling:
J. Madgison, ed., University Press of America, Lanham. Nontraditional Alternatives)
[10] Kirk, R.E. (1995). Experimental Design, Brooks/Cole,
Pacific Grove. Y.M. THUM
[11] Linn, R.L. & Wert, C.E. (1979). Covariance structures
and their analysis, in New Directions for Testing and
Covariance/variance/correlation
JOSEPH LEE RODGERS
Volume 1, pp. 431432
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
often leads applied statisticians to use more robust these three statistics. Dividing this formula through
measures of variability, such as the median absolute by the product of the standard deviations shows how
deviation (MAD) statistic. Whereas the mean mini- the correlation can be obtained by standardizing the
mizes the sum of squared deviations compared to any covariance (i.e., by dividing the covariance by the
other constant, the median has the same optimality product of the standard deviations).
property in the context of absolute deviations. In structural equations (SEM) models (and fac-
The variance, like the correlation, is a special case tor analysis to a lesser extent), appreciating all three
of the covariance it is the covariance of a variable of these important statistical indices is prerequisite
with itself. The variance is zero if and only if all the to understanding the theory and to being able to
scores are the same; in this case, the variable can be apply the method. Most software packages that esti-
viewed as independent of itself in a sense. Although mate SEM models can fit a model to either covari-
covariances may be positive or negative depending ances or to correlations that is, these packages
on the relation between the two variables a variable define predicted covariance or correlation matrices
can only have a positive relationship with itself, between all pairs of variables, and compare them
implying that negative variances do not exist, at least to the observed values from real data (see Struc-
computationally. Occasionally, estimation routines tural Equation Modeling: Software). If the model
can estimate negative variances (in factor analysis, is a good one, one or more of the several popular
these have a special name Heywood Cases). These fit statistics will indicate a good match between the
can result from missing data patterns or from lack of observed and predicted values. With a structural equa-
fit of the model to the data. tion model, observed or latent variables can be linked
The formula given in Hays to compute the covari- to another observed or latent variable either through a
ance is cov(X, Y ) = E(XY ) E(X)E(Y ), which correlation or covariance, and can also be linked back
shows that the covariance measures the departure to itself through a variance. These variances may be
from independence of X and Y . A straightforward constrained to be equal to one implying standard-
computational formula for the covariance shows the ized variables or unconstrained implying unstan-
relationships between the three measures, the covari- dardized variables. The covariances/correlations can
ance, the correlation, and variance: be estimated, constrained, or fixed.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
from single administrations are also possible and will panel members, preparing clear descriptions of the
be discussed later in this entry. Validity is typically performance levels, developing clear and straightfor-
assessed by how well the test items measure the con- ward materials for panels to use in the process, choos-
tent domain to which the test scores are referenced. ing a standard-setting method that is appropriate for
Validity also depends on the performance standards the characteristics of the test itself and the panel itself
that are set for sorting candidates into performance (for example, some methods can only be used with
categories. If they are set improperly (perhaps set too multiple-choice test items, and other methods require
high or too low because of a political agenda of those item statistics), insuring effective training (normally,
panelists who set them), then examinees will be mis- this is best accomplished with field testing in advance
classified (relative to how they would be classified if of the actual standard-setting process), allowing suf-
true scores were available, and a valid set of perfor- ficient time for panels to complete their ratings and
mance standards were in place), and the validity of participate in discussions and revising their ratings
the resulting performance classifications is reduced. (this activity is not always part of a standard-setting
What is unique about CRTs is the central focus on process), compiling the panelists ratings and deriving
the content measured by the test, and subsequently, the performance standards, collecting validity data
on how the performance standards are set, and the from the panelists, analyzing the available data, and
levels of decision consistency and accuracy of the documenting the process itself.
resulting examinee classifications. These technical Counting variations, there are probably over 100
problems will be addressed next. methods for setting performance standards [1]. Most
of the methods involve panelists making judgments
about the items in the test. For example, with
Setting Performance Standards the Angoff method, panelists predict the expected
performance of borderline candidates at the Basic cut
Setting performance standards on CRTs has always score, the Proficient cut score, and at the Advanced
been problematic (see [1]) because substantial judg- cut score, on all of the items on the test. These
ment is involved in preparing a process for setting expected item scores at a cut score are summed to
them, and no agreement exists in the field about the arrive at a panelist cut score, and then averaged across
best choice of methods (see Setting Performance panelists to arrive at an initial cut score for the panel.
Standards: Issues, Methods). One instructor may be This process is repeated to arrive at each of the
acceptable to set performance standards on a class- cut scores. Normally, discussion follows, and then
room test (consequences are usually low for students, panelists have an opportunity to revise their ratings,
and the instructor is normally the most qualified per- and then the cut scores are recalculated. Sometime
son to set the performance standards), but when the during the process panelists may be given some item
stakes for the testing get higher (e.g., deciding who statistics, or consequences of particular cut scores that
will receive a high school diploma, or a certificate to they have set (e.g., with a particular cut score, 20%
practice in a profession), multiple judges or panelists of the candidates will fail). This is known as the
will be needed to defend the resulting performance Angoff method.
standards. Of course, with multiple panelists and each In another approach to setting performance stan-
with their own opinion, the challenge is to put them dards, persons who know the candidates (called
through a process that will converge on a defensible reviewers) and who know the purpose of the test
set of performance standards. In some cases, even two might be asked to sort candidates into four per-
or more randomly equivalent panels are set up so that formance categories: Failing, Basic, Proficient, and
the replicability of the performance standards can be Advanced. A cut score to distinguish Failing from
checked. Even multiple panels may not appease the Basic on the test is determined by looking at the
critics: The composition of the panel or panels, and actual test score distributions of candidates who were
the number of panel members can become a basis assigned to either the Failing or Basic categories by
for criticism. reviewers. A cut score is chosen to maximize the
Setting valid performance standards involves many consistency of the classifications between candidates
steps (see [4]): Choosing the composition of the panel based on the test and the reviewers. The process is
or panels and selecting a representative sample of then repeated for the other cut scores. This is known
Criterion-Referenced Assessment 3
as the contrasting groups method. Sometimes, other decision consistency (DC) given by Hambleton and
criteria for placing cut scores might be used, such Novick [6], decision accuracy (DA) is the extent to
as doubling the importance of minimizing one type which the actual classifications of the test takers agree
of classification error (e.g., false positive errors) over with those that would be made on the basis of their
another (e.g., false negative errors). true scores, if their true scores could somehow be
Many more methods exist in the measurement lit- known [12].
erature: Angoff, Ebel, Nedelsky, contrasting groups,
borderline group, book-mark, booklet classification,
and so on. See [1] and [5] for complete descriptions Methods of Estimating DC and DA
of many of the current methods. The introduction of the definition of DC by Ham-
bleton and Novick [6] pointed to a new direction for
evaluating the reliability of CRT scores. The focus
Assessing Decision Consistency was to be on the reliability of the classifications
and Accuracy or decisions rather than on the scores themselves.
Swaminathan, Hambleton, and Algina [19] extended
Reliability of test scores refers to the consistency of the HambletonNovick concept of decision consis-
test scores over time, over parallel forms, or over tency to the case where there were not just two
items within the test. It follows naturally from this performance categories:
definition that calculation of reliability indices would
require a single group of examinees taking two forms
k
of a test or even a single test a second time, but this p0 = pii (1)
is often not realistic in practice. Thus, it is routine i=1
to report single-administration reliability estimates
such as corrected split-half reliability estimates and/or where pii is the proportion of examinees consistently
coefficient alpha. Accuracy of test scores is another assigned to the i-th performance category across two
important concern that is often checked by comparing administrations, and k is the number of performance
test scores against a criterion score, and this consti- categories. In order to correct for chance agreement,
tutes a main aspect of validity [8]. based on the kappa coefficient (see Rater Agree-
With CRTs, examinee performance is typically ment Kappa) by Cohen [2], which is a generalized
reported in performance categories and so reliability proportion agreement index frequently used to esti-
and the validity of the examinee classifications are mate inter-judge agreement, Swaminathan, Hamble-
of greater importance than the reliability and validity ton, and Algina [20] put forward the kappa statistic
associated with test scores. That is, the consistency which is defined by:
and accuracy of the decisions based on the test scores p pc
outweighs the consistency and the accuracy of test = (2)
1 pc
scores with CRTs.
As noted by Hambleton and Slater [7], before where p is the proportion of examinees classified in
1973, it was common to report a KR-20 or a corrected the same categories across administrations, and pc is
split-half reliability estimate to support the use of a the agreement expected by chance factors alone.
credentialing examination. Since these two indices The concepts of decision consistency and kappa
only provide estimates of the internal consistency were quickly accepted by the measurement field for
of examination scores, Hambleton and Novick [6] use with CRTs, but the restriction of a double admin-
introduced the concept of the consistency of decisions istration was impractical. A number of researchers
based on test scores, and suggested that the reliability introduced single-administration estimates of deci-
of classification decisions should be defined in terms sion consistency and kappa, analogous to the cor-
of the consistency of examinee decisions resulting rected split-half reliability that was often the choice
from two administrations of the same test or parallel of researchers working with NRTs. Huynh [11] put
forms of the test, that is, an index of reliability forward his two-parameter bivariate beta-binomial
which reflects the consistency of classifications across model. His model relies on the assumption that a
repeated testing. As compared with the definition of group of examinees ability scores follow the beta
4 Criterion-Referenced Assessment
distribution with parameters and , and the fre- the group of examinees. Then the conditional dis-
quency of the observed test scores x follow the tribution of scores on an alternate form (given
beta-binomial (or negative hypergeometric) distribu- true score) is estimated using a binomial distribu-
tion with parameters and . The model is defined tion.
by the following: All of the previously described methods operate in
the framework of classical test theory (CTT). With
n! B( + x, + x) the popularization of item response theory (IRT),
f (x) = (3)
x!(n x)! B(, ) the evaluation of decision consistency and accuracy
under IRT has attracted the interest of researchers. For
where n is the total number of items in the test, and B example, Rudner ([16], [17]) introduced his method
is the beta function with parameters and , which for evaluating decision accuracy in the framework
can be estimated either with the moment method of IRT.
making use of the first two moments of the observed
Rudner [16] proposed a procedure for computing
test scores or with the maximum likelihood (ML)
expected classification accuracy for tests consisting
method described in his paper. The probability that
of dichotomous items and later extended the method
an examinee has been consistently classified into
to tests including polytomous items [17]. It should be
a particular category can then be calculated by
noted that Rudner referred to and as true score
using the beta-binomial density function. Hanson
and observed score respectively in his papers. He
and Brennan [10] extended Huynhs approach by
pointed out that because of the fact that for any given
using the four-parameter beta distribution for true
true score , the corresponding observed score is
scores.
expected to be normally distributed, with a mean
Subkoviaks method [18] is based on the assump-
and a standard deviation of se(), the probability of
tions that observed scores are independent and dis-
an examinee with a given true score of having an
tributed binomially, with two parametersthe number
observed score in the interval [a, b] on the theta scale
of items and the examinees proportion-correct true
is then given by
score. His procedure estimates the true score for each
individual examinee without making any distribu-
b a
tional assumptions for true scores. When combined p(a < < b|) = , (5)
with the binomial or compound binomial error model, se() se()
the estimated true score will provide a consistency
index for each examinee, and averaging this index where (Z) is the cumulative normal distribution
over all examinees gives the DC index. function. He noted further that multiplying (5) by the
Since the previous methods all deal with binary expected proportion of examinees whose true score is
data, Livingston and Lewis [12] came up with a yields the expected proportion of examinees whose
method that can be used with data including either true score is expected to be in the interval [a, b],
dichotomous, polytomous, or the combination of the and summing or integrating over all examinees in
two. It involves estimating the distribution of the interval [c, d] gives us the expected proportion of
proportional true scores Tp using strong true score all examinees that have a true score in [c, d] and an
theory [13]. This theory assumes that the proportional observed score in [a, b]. If we are willing to make
true score distribution has the form of a four- the assumption that the examinees true scores ()
parameter beta distribution with density are normally distributed, the expected proportions of
all examinees that have a true score in the interval
Tp 1 [c, d] and an observed score in the interval [a, b] are
g =
, , a, b Beta( + 1, + 1) given by
(Tp a) (b Tp )
(4)
d
d
(b a)++1 P (a < < b|)f () =
=c =c
where Beta is the beta function, and the four param-
eters of the function can be estimated by using b a
, (6)
the first four moments of the observed scores for se() se()
Criterion-Referenced Assessment 5
where se() is the reciprocal of the square root Also reported is the conditional error, which is
of the test information function at which is the the measurement error associated with test scores
sum of the item information functions in the test, at each of the performance standards. It is help-
and f () is the standard normal density function ful because it indicates the size of the measure-
(Z) [16]. The problem with this method, of course, ment error for examinees close to each performance
is that the normality assumption is usually problem- standard.
atic. The values of DA are usually reported in the
same way as in Table 1, only that the cross-tabulation
Reporting of DC and DA is between true score status and the test score
status. Of course, it is highly desirable that test
Table 1 represents a typical example of how DC of manuals also report other evidence to support the
performance classifications is being reported. Each score inferences from a CRT, for example, the
of the diagonal elements represents the proportion of evidence of content, criterion-related, and construct
examinees in the total sample who were consistently and consequential validity.
classified into a certain category on both administra-
tions (with the second one being hypothetical), and
summing up all the diagonal elements yields the total Appropriate Levels of DC and DA
DC index.
It is a common practice now to report kappa A complete set of approaches for estimating decision
in test manuals to provide information on the consistency and accuracy are contained in Table 2.
degree of agreement in performance classifications Note that the value of DA is higher that of DC
after correcting for the agreement due to chance. because the calculation of DA involves one set of
observed scores and one set of true scores which
Table 1 Grade 4 English language arts decision consis- are supposed to be without any measurement error
tency results due to improper sampling test questions, flawed test
Status on parallel form items, problems with the test administration and so
on, while the calculation of DC involves two sets of
Status on observed scores.
form taken Failing Basic Proficient Advanced Total The levels of DC and DA required in practice
Failing 0.083 0.030 0.000 0.000 0.113 will depend on the intended uses of the CRT and
the number of performance categories. There have
Basic 0.030 0.262 0.077 0.001 0.369
not been any established rules to help determine the
Proficient 0.000 0.077 0.339 0.042 0.458 levels of decision consistency and accuracy needed
Advanced 0.000 0.001 0.042 0.018 0.060 for different kinds of educational and psychological
1.00 assessments. In general, the more important the edu-
Total 1.00 0.369 0.458 0.060
cational decision to be made, the higher consistency
Note: From Massachusetts Department of Education [14]. and accuracy need to be.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Area a1 a2 a3
Hospital h1 h2 h3 h4
Hospital h1 h2 h3 h4
Area a1 a2 a3
y1 = 0 + u(2) (3)
1 + u 1 + e1 , (3)
The model is written as:
and for patient 5 would be
yi = 0 + u(2) (3)
elem(i) + uhigh(i) + ei
y5 = 0 + u(2)
3 + u(3)
2 + e5 . (4)
u(2)
elem(i) N (0, u(2) )
2
behavior is directed. For a family with two parents connections (or a different mapping) to the level 1
and two children, we have 12 directed scores (ds): units (directed scores). See [2] for a mathematical
definition of classifications as mappings between sets
c1 c2, c1 m, c1 f, c2 c1, c2 m, of units.
c2 f, m c1, m c2, m f, f c1, In this family network data the directed scores
are contained within a cross-classification of actors,
f c2, f m partners, and dyads and this crossed structure is
nested within families. The classification diagram for
where c1, c2, f, m denote child 1, child 2, father and this structure is shown in Figure 6.
mother. These directed scores can be classified by The model can be written as:
actor and by partner. They can also be classified into
six dyad groupings:
yi = (X)i + u(2) (3) (4)
actor(i) + upartner(i) + udyad(i)
(c1 c2, c2 c1), (c1 m, m c1),
+ u(5)
family(i) + ei
(c1 f, f c1), (c2 m, m c2),
(c2 f, f c2), (m f, f m) u(2) 2 (3)
actor(i) N (0, u(2) ), upartner(i) N (0, u(3) )
2
u(4)
dyad(i) N (0, u(4) )
2
Schematically, the structure is as shown in
Figure 5. u(5)
family(i) N (0, u(5) ), ei N (0, e ).
2 2
(6)
Note that the notion of a classification is different
from the set of units contained in a classification.
For example, the actor and partner classifications are These models have not been widely used but they
made up of the same set of units (family members). offer great potential for decomposing between and
What distinguishes the actors and partners as different within family dynamics. They can address questions
classifications is that they have a different set of such as:
Family f1
Dyad d1 d2 d3 d4 d5 d6
Actor c1 c2 m f
ds c1 c2 c1 m c1f c2c1 c2 m c2 f m c1 m c2 m f f c1 f c2 f m
Partner c1 c2 m f
Women w1 w2 w3.
Cycles c1 c2 c3 c4 c1 c2 c3 c4 c1 c2 c3 c4
Donations d1 d2 d1 d2 d3 d1 d2
Donors m1 m2 m3
Donation Woman
Multiple Membership Models
In the models we have fitted so far, we have assumed
that lower level units are members of a single unit
from each higher-level classification. For example,
students are members of a single high school and
Cycle a single elementary school. Where lower-level units
are influenced by more than one higher-level unit
Figure 8 Classification diagram for the artificial insemi- from the same classification, we have a multiple
nation example membership model. For example, if patients are
treated by several nurses, then patients are multiple
members of nurses. Each of the nurses treating a
We can write the model as patient contributes to the patients treatment outcome.
yi Binomial(1, i ) In this case the treatment outcome for patient i is
modeled by a fixed predictor, a weighted sum of
logit(i ) = (X)i + u(2) (3) (4)
woman(i) + udonation(i) + udonor(i) the random effects for the nurses that treat patient
i and a patient level residual. This model can be
u(2)
woman(i) N (0, u(2) )
2
written as
u(3)
donation(i) N (0, u(3) )
2
(2) (2)
yi = (X)i + wi,j u j + ei
u(4)
donor(i) N (0, u(4) ).
2
(7) j nurse(i)
Nurse n1 n2 n3
Patient
Figure 9 Unit diagram for patient nurse classifica- classification diagram for this structure is shown in
tion structure Figure 11.
Houses h1 h2 h1 h2
farm come from more than one parent flock. This Table 5 Results for Danish poultry data
means the multiple membership breeding hierarchy Parameter Description Estimate(se)
is cross-classified with the production hierarchy. A
unit diagram for the structure is shown in Figure 12 0 Intercept 1.86 (0.187)
and a classification diagram in Figure 13. 1 1996 1.04 (0.131)
2 1997 0.89 (0.151)
We can write the model as 3 Hatchery 2 1.47 (0.22)
4 Hatchery 3 0.17 (0.21)
yi Bin(i , 1) 5 Hatchery 4 0.92 (0.29)
2
u(2) House variance 0.19 (0.09)
logit(i ) = (XB)i + u(2) (3)
house(i) + ufarm(i) 2
u(3) Farm variance 0.59 (0.11)
(4) (4) 2
u(4) Parent flock variance 1.02 (0.22)
+ wi,j uj
j parent(i)
u(2)
house(i) N (0, u(2)
2
), u(3)
farm(i) N (0, u(3) )
2
u(4)
j N (0, u(4) ).
2
(10)
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
number of periods will equal the number of treat- random as opposed to fixed subject effects (see Fixed
ments. Examples of such designs that are in per- and Random Effects). In psychological research, it
fect balance are so-called balanced Latin square is not unusual to have crossover designs with very
designs in which each treatment occurs equally often many periods in which treatments are repeated. In
in each period and each sequence [2, Section 4.2]. the analysis of data from such trials, there are poten-
These are the most efficient designs possible, pro- tially many period parameters. For the purposes of
vided no adjustment need be made for other effects efficiency, it may be worth considering replacing the
such as carryover. Additional forms of balance can categorical period component by a smooth curve to
be imposed to maintain reasonable efficiency when represent changes over time; a polynomial or non-
adjusting for simple forms of carryover effect, the so- parametric smooth function might be used for this [2,
called Williams squares are the commonest example Section 5.8]. For analyzing nonnormal, particularly,
of these. categorical data, appropriate methods for nonnormal
The analysis of continuous data from crossover clustered or repeated measurements data (see Gener-
trials usually follows conventional factorial ANOVA alized Linear Mixed Models) can be adapted to the
type procedures (see Factorial Designs), incorporat- crossover setting [2, Chapter 6].
ing fixed subject effects (see Fixed and Random
Effects) [1; 2, Chapter 5]. This maintains simplic- References
ity but does confine the analysis to within-subject
information only. In efficient designs, most or all
[1] Cotton, J.W. (1998). Analyzing Within-subject Experi-
information on treatment effects is within-subject, so ments, Lawrence Erlbaum Associates, New Jersey.
it is rarely sensible to deviate from this approach. [2] Jones, B. & Kenward, M.G. (2003). The Design and
However, it is sometimes necessary to use designs Analysis of Cross-over Trials, 2nd Edition, Chapman &
that are not efficient, for example, when the num- Hall/CRC, London.
ber of treatments exceeds the number of periods that
can be used, and for these it is worth considering the MICHAEL G. KENWARD
recovery of between-subject or inter-block informa-
tion. This is accomplished by using an analysis with
Cross-sectional Design
PATRICIA L. BUSK
Volume 1, pp. 453454
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
would include seniors, individuals with 2-, 4-, and veys/hsb/ http://nces.ed.gov/surveys/
6-years graduation in the same geographic regions, hsb/
same gender composition, and same socioeconomic [2] Nesselroade, J.R. & Baltes, P.B., (eds) (1979). Longitudi-
nal Research in the Study of Behavior and Development,
status. Academic Press, New York.
Additional information on repeated cross-sectional [3] Purkey, W.W. (1970). Self-concept and School Achieve-
studies can be found under cohort sequential designs ment, Prentice-Hall, Engelwood Cliffs.
under accelerated longitudinal designs.
PATRICIA L. BUSK
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Y b0 + b1 X1 + + bm Xm , (1)
Example
we would base such statements on the estimated
regression coefficients b1 , . . . , bm . If the predictor We now describe a simple example which we will use
variables are correlated the usual situation for to illustrate risk estimation. There are m = 50 pre-
observational data such interpretations are not at dictor variables that are independent and uniformly
all innocuous; Chapters 1213 of Mosteller and distributed on the unit interval [0, 1]. The response Y
Tukey [2] contain an excellent discussion of the is a linear function of the predictors, plus Gaussian
problems and pitfalls. noise with mean 0 and variance 2 :
The second goal, and the one we will focus on,
Y = b1 X1 + + bm Xm + . (4)
is prediction: generate a prediction rule (x; T ) that
predicts the value of the response Y from the values Each of the true regression coefficients b1 , . . . , bm
x = (x1 , . . . , xm ) of the predictor variables. In the is zero with probability 0.8, and an observation of
case of a linear model, an obvious choice is a standard Gaussian with probability 0.2. Therefore,
only about 10 of the true regression coefficients will
(x; T ) = b0 + b1 x1 + + bm xm , (2)
be nonvanishing. The noise variance 2 is chosen to
be the same as the signal variance:
where the regression coefficients are estimated from
the training sample T. 2 = V(b1 X1 + + bm Xm ). (5)
In the social and behavorial sciences, the dom-
inant goal has traditionally been to understand how
the response depends on the predictors. Even if under- The Resubstitution Estimate of Risk
standing is the primary goal, it might still be prudent,
however, to evaluate the predictive performance of a The simplest and most obvious approach to risk
model. Low predictive power can indicate a lack of estimation is to see how well the prediction rule does
2 Cross-validation
for the observations in the training sample. This leads The Cross-validation Estimate of Risk
to the resubstitution estimate of risk
The basic idea of cross-validation, first suggested by
1
R resub () = (yi (xi ; T ))2 , (6) Stone [3] is to use each observation in both roles, as a
n training observation and as a test observation. Cross-
which is simply the average squared residual for the validation is best described in algorithmic form:
training observations. The problem with the resubsti-
tution estimate is that it tends to underestimate the Randomly divide the training sample T into k subsets
risk, often by a substantial margin. Intuitively, the T1 , . . . , Tk of roughly equal size (choice of k is
reason for this optimism is that, after all, the model discussed below). Let T i be the training set with
was chosen to fit the training data well. the i-th subset removed.
To illustrate this effect, we generated a training
For i = 1 . . . k {
sample T of size n = 100 from the model described in
the previous section, estimated regression coefficients
Generate a prediction rule (x, T i ) from the
by least squares, and constructed the prediction rule
training observations not in the i-th subset.
(2). The resubstitution estimate of risk is R resub () = Compute the total loss Li when using this rule
0.64. Because we know the population distribution on the i-th subset:
of (X, Y ), we can compute the true risk of the rule:
We generate a very large (N = 10 000) test set of new Li = (yj (xj , Ti ))2 (7)
observations from the model and evaluate the average j Ti
loss incurred when predicting those 10 000 responses }
from the corresponding predictor vectors. The true Compute the cross-validation estimate of risk
risk turns out to be R() = 3.22; the resubstitution
1 i
k
estimate underestimates the risk by a factor of 5!
Of course, this result might be a statistical fluke R cv () = L. (8)
n i=1
maybe we just got a bad training sample? To answer
this question, we randomly generated 50 training In our example, the cross-validation estimate of
samples of size n = 100, computed the true risk risk is R cv = 2.87, compared to the true risk R =
and the resubstitution estimate for each of them, 3.22, and the resubstitution estimate R resub = 0.64.
and averaged over training samples. The average So the cross-validation estimate is much closer to
resubstitution estimate was 0.84, while the average the true risk than the resubstitution estimate. It
true risk was 3.48; the result was not a fluke. still underestimates the risk, but this is a statistical
fluke. If we average over 50 training samples, the
The Test Set Estimate of Risk average cross-validation estimate is 3.98. The cross-
validation estimate of risk is not optimistic, because
If we had a large data set at our disposal a situation the observations that are predicted are not used
not uncommon in this age of automatic, computerized in generating the prediction rule. In fact, cross-
data collection we could choose not to use all the validation tends to be somewhat pessimistic, partly
data for making up our prediction rule. Instead, we because it estimates the performance of a prediction
could use half the data as the training set T and rule generated from a training sample of size roughly
compute the average loss when making predictions n(1 1/k).
for the test set. The average loss for the test set is an A question which we have not yet addressed is the
unbiased estimate for the risk; it is not systematically choice of k. In some situations, such as linear least
high or systematically low. squares, leave-one-out cross validation, correspond-
Often, however, we do not have an abundance of ing to k = n, has been popular, because it can be done
data, and using some of them just for estimating the in a computationally efficient way. In general, though,
risk of the prediction rule seems wasteful, given that the work load increases with k because we have to
we could have obtained a better rule by using all generate k prediction rules instead of one. Theoretical
the data for training. This is where cross-validation analysis of cross-validation has proven to be surpris-
comes in. ingly difficult, but a general consensus, based mostly
Cross-validation 3
3.0
Using Cross-validation for Model Selection
Risk
In a situation like the one in our example, where we 2.5
have many predictor variables and a small training
2.0
sample, we can often decrease the prediction error by
reducing model complexity. A well-known approach 1.5
to reducing model complexity in the context of linear
least squares (see Least Squares Estimation) is 1.0
stepwise regression: Find the predictor variable that,
by itself, best explains Y ; find the predictor variable 0 10 20 30 40 50
that best explains Y when used together with the Number of predictors
variable found in step 1; find the variable that best
explains Y when used together with the variables Figure 2 Average of true risk (solid), resubstitution esti-
mate (dotted) and cross-validation estimate (dashed) over
found in steps 1 and 2; and so on. The critical 50 training samples as a function of the number of predictor
question is when to stop adding more variables. variables
The dotted curve in Figure 1 shows the resubstitu-
tion estimate of risk the average squared residual
plotted against the number of predictor variables in also minimized for a model with six variables. This
the model. It is not helpful in choosing a good model is not surprising, given that only about 10 of the true
size. regression coefficients are nonvanishing, and some of
The dashed curve shows the cross-validation esti- the remaining ones are small. Adding more predictor
mate of risk. It is minimized for a rule using six variables basically just models noise in the training
predictor variables, suggesting that we should end the sample; complex models typically do not generalize
process of adding variables after six steps. well.
The solid curve shows the true risk. It exhibits the Figure 2 shows the corresponding plot, but the
same pattern as the cross-validation estimate and is curves are obtained by averaging over 50 training
samples. Note that the cross-validation estimate of
risk tracks the true risk well, especially for the lower
3.5
ranges of model complexity, which are the practically
important ones.
3.0
2.5
Alternatives to Cross-validation
Risk
2.0
Many alternatives to cross-validation have been sug-
gested. There is Akaikes Information Criterion
1.5
(AIC), the Bayesian Information Criterion (BIC),
Minimum Description Length (MDL), and so on;
1.0
see [1, Chapter 7] for a survey and references. These
criteria all consist of two components: a measure
of predictive performance for the training data, and
0 10 20 30 40 50
a penalty for model complexity. In principle, this
Number of predictors
makes sense more complex models are more prone
Figure 1 True risk (solid), resubstitution estimate (dotted) to modeling the noise in the training data, which
and cross-validation estimate (dashed) for a single training makes the resubstitution estimate of risk more opti-
sample as a function of the number of predictor variables mistic. However, it is often not obvious how to
4 Cross-validation
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
than those children who do not demonstrate such [3] Eaves, L.J. (1976a). A model for sibling effects in man,
talent. Active genotypeenvironment correlation [6] Heredity 36, 205214.
can also arise from oblique cultural transmission: [4] Eaves, L.J. (1976b). The effect of cultural transmission
on continuous variation, Heredity 37, 4157.
Individuals select certain cultural artifacts on the [5] Eaves, L.J., Eysenck, H.J. & Martin, N.G. (1989). Genes,
basis of genetically influenced proclivities. The Culture, and Personality, Academic Press, London.
child with unusual talent for the violin will [6] Hershberger, S.L. (2003). Latent variable models of
choose to master that musical instrument over the genotype-environment correlation, Structural Equation
piano. Modeling 10(3), 423434.
[7] Plomin, R., DeFries, J.C. & Loehlin, J.C. (1977).
Genotype-environment interaction and correlation in the
References analysis of human behavior, Psychological Bulletin 84,
309322.
[1] Boyd, R. & Richerson, P.J. (1985). Culture and the Evo- [8] Scarr, S. & McCartney, K. (1983). How people make their
lutionary Process, University of Chicago Press, Chicago. own environments: a theory of genotype environment
[2] Cavalli-Sforza, L.L. & Feldman, M.W. (1981). Cultural effects, Child Development 54, 424435.
Transmission and Evolution: A Quantitative Approach,
Princeton University Press, Princeton. SCOTT L. HERSHBERGER
Data Mining
DAVID J. HAND
Volume 1, pp. 461465
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
one evaluates the quality of a model by looking at genomics, espionage, and text processing. The lat-
the probability of obtaining the observed data (or ter is especially important for the web, where search
more extreme data) from the putative model. Unfor- engines are based on these ideas.
tunately, when very large data sets are involved, this In supervised pattern detection, we are told the
probability will almost always be vanishingly small: values of some variables, y, and the aim is to find
even a slight structural difference will translate into the values of other variables, x, which are likely
many data points, and hence be associated with a to describe data points with the specified y values.
very small probability and so high significance. For For example, we might want to find early childhood
this reason, measures of model quality other than the characteristics, x, which are more likely to have a
conventional statistical ones are widely used. Choice high value of a variable y, measuring predisposition
between models is then typically based on the relative to depression.
size of these measures. In pattern discovery, the aim is both to character-
Apart from analytic issues such as the above, there ize and to locate unusual features in the data. We have
are also more mundane difficulties associated with already mentioned outliers as an example of this: we
fitting models to large data sets. With a billion data need to define what we mean by an outlier, as well
points, even a simple scatterplot can easily reduce to as test each point to identify those, which are out-
a solid black rectangle: a contour plot is a more useful liers. More generally, we will want to identify those
summary. For such reasons, and also because many of regions of the data space, which are associated with
the data sets encountered in data mining involve large an anomalously high local density of data points. For
numbers of variables, dynamic interactive graphical example, in an EEG trace, we may notice repeated
tools are quite important in certain data mining episodes, separated by substantial time intervals, in
applications. which a brief interval of low voltage fast or desyn-
chronized activity [is] followed by a rhythmic (812
Hz) synchronized high amplitude discharge . . . The
frequency then begins to slow and spikes to clump
Pattern Detection in clusters, sometimes separated by slow waves . . .
Finally, the record is dominated by low amplitude
Whereas models are generally large-scale decomposi- delta activity. Any signal of a similar length can
tions of the data, splitting the data into parts, patterns be encoded so that it corresponds to some point in
are typically small-scale aspects: in pattern detection, the data space, and similar signals will correspond to
we are interested only in particular small localities of similar points. In particular, this means that whenever
the data, and the rest of the data are irrelevant. Just a signal similar to that above occurs, it will produce
as there are several aims in model building, so there a data point in a specific region of the data space: we
are several aims in pattern detection. will have an anomalously high local density of data
In pattern matching, we are told the structure points corresponding to such patterns in the trace.
of the pattern we are seeking, and the aim is to In fact, the above quotation is from Toone [8]; it
find occurrences of it in the data. For example, we describes the pattern of electrical activity recorded in
may look for occurrences of particular behavioral an EEG trace during a grand mal epileptic seizure.
sequences when studying group dynamics, or pur- Pattern discovery is generally a more demanding
chasing patterns when studying shopping activities. problem than pattern matching or supervised pattern
A classic example of the latter has been given the detection. All three of these have to contend with
name bread dominoes, though it describes a much a potentially massive search space, but, in pattern
more general phenomenon. The name derives from discovery, one may also be considering a very large
shoppers who normally purchase a particular kind of number of possible patterns. For this reason, most
bread. If their preferred kind is not present, they tend of the research in the area has focused on developing
to purchase the most similar kind, and, if that is not effective search algorithms. Typically, these use some
present, the next most similar kind, and so on, in a measure of interestingness of the potential patterns
sequence similar to a row of dominoes falling one and seek local structures, which maximize this. An
after the other. Pattern matching has been the focus important example arises in association analysis. This
of considerable research in several areas, especially describes the search, in a multivariate categorical data
Data Mining 3
space, for anomalously high cell frequencies. Often The technology of scan statistics has much to offer
the results are couched in the form of rules: If A in this area, although most of its applications to
and B occur, than C is likely to occur. A classic date have been to relatively simple (e.g. mainly one-
special case of this is market basket analysis, the dimensional) situations.
analysis of supermarket purchasing patterns. Thus, In general, because the search space and the space
one might discover that people who buy sun-dried of potential patterns are so vast, there will be a
tomatoes also have a higher than usual chance of tendency to throw up many possible patterns, most of
buying balsamic vinegar. Note that, latent behind this which will be already known or simply uninteresting.
procedure, is the idea that, when one discovers such One particular study [1] found 20 000 rules and
a pattern, one can use it to manipulate the future concluded the rules that came out at the top were
purchases of customers. This, of course, does not things that were obvious.
necessarily follow: merely because purchases of sun-
dried tomatoes and balsamic vinegar are correlated
does not mean that increasing the purchases of one Data Quality
will increase the purchases of the other. On the
The issue of data quality is important for all data
other hand, sometimes such patterns can be taken
analytic technologies, but perhaps it is especially so
advantage of. Hand and Blunt [3] describe how for data mining. One reason is that data mining is
the discovery of patterns revealing surprising local typically secondary data analysis. That is, the data
structures in petrol purchases led to the use of free will normally have been collected for some purpose
offers to induce people to spend more. other than data mining, and it may not be ideal for
Exhaustive search is completely infeasible, so var- mining. For example, details of credit card purchases
ious forms of constrained search have been devel- are collected so that people can be properly billed,
oped. A fundamental example is the a priori algo- and not so that, later, an analyst can pore through
rithm. This is based on the observation that if the these records seeking patterns (or, more immediately,
pattern AB occurs too few times to be of interest, so that an automatic system can examine them for
then there is no point in counting occurrences of pat- signs of fraud).
terns which include AB. They must necessarily occur All data are potentially subject to distortion, errors,
even less often. In fact, this conceals subtleties: it and missing values, and this is probably especially
may be that the frequency which is sufficient to be true for large data sets. Various kinds of errors can
of interest should vary according to the length of the occur, and a complete taxonomy is probably impossi-
proposed pattern. ble, though it has been attempted [7]. Important types
Search for data configurations is one aspect of include the following:
pattern discovery. The other is inference: is the
pattern real or could it have occurred by chance? Missing data. Entire records may be missing;
Since there will be many potential patterns thrown for example, if people have a different propen-
up, the analyst faces a particularly vicious version sity to enter into a study, so that the study
of the multiplicity problem: if each pattern is tested sample is not representative of the overall pop-
at the 5% level, then a great many false patterns ulation. Or individual fields may be missing,
(i.e. not reflecting real underlying structure in the so distorting models; for example, in studies
distribution) will be flagged; if one controls the of depression, people may be less likely to
overall level of flagging any false pattern as real attend interview sessions when they are having a
at the 5% level, then the test for each pattern severe episode (see Missing Data; Dropouts in
will be very weak. Various strategies have been Longitudinal Studies: Methods of Analysis).
suggested for tackling this, including the use of false Measurement error. This includes ceiling and
discovery rate now being promoted in the medical floor effects, where the ranges of possible scores
statistics literature, and the use of likelihood as a are artificially truncated.
measure of evidence favoring each pattern, rather Deliberate distortion. This, of course, can be
than formal testing. Empirical Bayesian ideas (see a particular problem in the behavioral sci-
Bayesian Statistics) are also used to borrow strength ences, perhaps especially when studying sensi-
from the large number of similar potential patterns. tive topics such as sexual practices or alcohol
4 Data Mining
consumption. Sometimes, in such situations, is increasing dramatically. It is very clear that there
sophisticated data capture methods, such as exists the potential for great discoveries in these data
randomized response [9], can be used to sets, but it is equally clear that making those discover-
tackle it. ies poses great technical problems. Data mining, as a
discipline, may suffer from a backlash, as it becomes
Clearly, distorted or corrupted data can lead to diffi-
apparent that the potential will not be achieved as
culties when fitting models. In such cases, the optimal
easily as the media hype accompanying its advent
approach is to include a model for the data distor-
may have led us to believe. However, there is no
tion process. For example, Heckman [6] models both
doubt that such a technology will be needed more
the sample selection process and the target regres-
and more in our increasingly data-dependent world.
sion relationship. This is difficult enough, but, for
Data mining will not go away.
pattern discovery, the situation is even more difficult.
General books on data mining, which describe the
The objective of pattern discovery is the detection of
tools in computational or mathematical detail, include
anomalous structures in the data, and data errors and
[2], [5], and [10].
distortion are likely to introduce anomalies. Indeed,
experience suggests that, in addition to patterns being
false, and, in addition to them being uninteresting, References
obvious, or well known, most of the remainder are
due to data distortion of some kind. Hand et al. [4]
give several examples. [1] Brin, S., Motwani, R., Ullma, J.D. & Tsur, S. (1997).
Dynamic itemset counting and implication rules for
Traditional approaches to handling data distortion
market basket data, Proceedings of the ACM SIGMOD
in large data sets derive from work in survey analysis, International Conference on Management of Data, ACM
and include tools such as automatic edit and impu- Press, Tucson, pp. 255264.
tation. These methods essentially try to eliminate the [2] Giudici, P. (2003). Applied Data Mining: Statistical
anomalies before fitting a model, and it is not obvious Methods for Business and Industry, Wiley, Chichester.
that they are appropriate or relevant when the aim is [3] Hand, D.J. & Blunt, G. (2001). Prospecting for gems
pattern detection in data mining. In such cases, they in credit card data, IMA Journal of Management Mathe-
matics 12, 173200.
are likely to smooth away the very features for which
[4] Hand, D.J., Blunt, G., Kelly, M.G. & Adams, N.M.
one is searching. Intensive human involvement seems (2000). Data mining for fun and profit, Statistical
inescapable. Science 15, 111131.
[5] Hand, D.J., Mannila, H. & Smyth, P. (2001). Principles
of Data Mining, MIT Press, Cambridge.
The Process of Data Mining [6] Heckman, J. (1976). The common structure of statistical
models of truncation, sample selection, and limited
Data mining, like statistical data analysis, is a cyclical dependent variables, and a simple estimator for such
process. For model fitting, one successively fits a models, Annals of Economic and Social Measurement
model and examines the quality of the fit using 5, 475492.
various diagnostics, then refines the model in the [7] Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K. & Lee, D.
(2003). A taxonomy of dirty data, Data Mining and
light of the fit, and so on. For pattern detection, one
Knowledge Discovery 7, 8199.
typically locates possible patterns, and then searches [8] Toone, B. (1984). The electroencephalogram, in The
for others in the light of those that have been found. Scientific Principles of Psychopathology, P. McGuf-
In both cases, it is not a question of mining the data fin, M.F. Shanks & R.J. Hodgson, eds, Grune and Strat-
and being finished. In a sense, one can never finish ton, London, pp. 3655.
mining a data set: there is no limit to the possible [9] Warner, S.L. (1965). Randomized response: a survey
questions that could be asked. technique for eliminating evasive answer bias, Journal
of the American Statistical Association 60, 6369.
Massive data sets are now collected almost rou-
[10] Witten, I.H. & Franke, E. (2000). Data Mining: Prac-
tinely. In part, this is a consequence of automatic tical Machine Learning Tools and Techniques with Java
electronic data capture technologies, and, in part, it Implementations, Morgan Kaufmann, San Francisco.
is a consequence of massive electronic storage facili-
ties. Moreover, the number of such massive data sets DAVID J. HAND
de Finetti, Bruno
SANDY LOVIE
Volume 1, pp. 465466
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Outcome
1.0 0.67 0.33
probabilities
the safer course in terms of future employment? in the early sixties. Herbert Simon [12] argued that
That is the typical risky decision dilemma. Deci- human information processing capacities limit the
sion making where outcome probabilities are known, rationality of decision making in important ways. He
such as in a game of roulette, are conventionally believed that utility maximization is beyond cognitive
described as decision under risk. However, decision limitations in all but the simplest environments. His
making where outcome probabilities are unknown, bounded rationality model assumed that people con-
and the stated probabilities represent subjective esti- struct simplified mental representations of decision
mates, as in our example, are known as decision problems, use simple decision strategies (or heuris-
under uncertainty [6, 20]. tics), and often have the goal of making a satisfactory,
The other types of decision making (sequential and rather than an optimal, decision. Since Simons sem-
dynamic) involve situations where peoples decisions inal work, much research has adopted his bounded
are dependent on previous decisions they have made. rationality perspective to develop process models
This entry will only consider strategies for multi- describing mental representations, decision strategies
attribute and risky decision making, for reasons of and goals. Payne, Bettman, and Johnson [10, p. 9]
space. For the same reason, the entry is specifically defined a decision strategy as a sequence of mental
concerned with evaluative choice rather than evalu- and effector (actions on the environment) operations
ative judgment. There is a subtle difference between used to transform an initial state of knowledge into a
the two. Evaluative judgments, as opposed to deci- final state of knowledge where the decision maker
sions, are made when a person is asked to evaluate views the particular decision problem as solved.
decision alternatives separately rather than choosing Process models can be categorized as either single
one of them. In consumer contexts, for example, peo- or multistage. Single-stage models describe decision
ple might be asked to say what they might be willing strategies in terms of single sequences of elemen-
to pay for an item. A substantial body of research has tary information processing operators (eips) that lead
demonstrated that people apply different strategies in to a decision, whereas multistage models describe
evaluative judgment as opposed to evaluative deci- sequences of eips that form interacting components
sion tasks [13, 14, 22]. For example, there is evidence of more complex strategies [2].
that an anchoring-and-adjustment heuristic is often
applied to judgments of the prices of consumer items,
but not to choices between them [3].
A Taxonomy of Decision Strategies to
Resolve Multiattribute Conflicts
Compensatory Strategies: Additive Utility
Two Theoretical Frameworks: Structural
Models and Process Descriptions Decision strategies can be classified as either com-
pensatory or noncompensatory. The former involve
Two major traditions that have shaped contemporary trade-offs between the advantages and disadvan-
psychological theories of decision making are utility tages of the various choice alternatives available,
theory, which aims to predict decision behavior, and whereas noncompensatory strategies do not. Utility
the information processing approach, which models models can be recast within an information process-
underlying cognitive processes. It is assumed in the ing framework as compensatory decision strategies
former that the goal of the rational decision maker is appropriate for certain contexts. Consider the stu-
to maximize utility or expected utility. Specific vari- dents dilemma over which course to choose, illus-
ants of utility theory have been proposed as structural trated in Table 1. The example assumes a short list of
models to describe and predict decision behavior (see three courses: Social Anthropology, Psychology, and
entry on utility theory). Such models are not strictly Business Studies. It assumes also that the three most
decision strategies, since they do not necessarily cor- important attributes to the student are the interest
respond to the mental processes underlying decisions. of the subject, the quality of teaching, and the rele-
Nevertheless, they can be interpreted as strategies, as vance of the course to her future career. She has been
discussed below. gathering information about these courses and her
Cognitive, or information processing approaches assessments are shown in the decision matrix on nine-
to the study of decision making began to emerge point scales, where 1 = very poor and 9 = excellent.
Decision Making Strategies 3
In addition, the decision will depend on the relative choose the first alternative, that is at least satisfac-
importance of these attributes to the student: Which tory on all important attributes. This saves time and
is more important, career-relevance or interest? Are effort because only part of the available information
they both equally important? The answer will be dif- is processed and all that is required is a simple com-
ferent for different people. parison of each aspect with an acceptability criterion.
Utility theorists have proposed a rational strategy In our example, (Table 1), one could work through
to resolve the students dilemma which takes attribute each alternative, row by row, perhaps starting at the
importance into account. Multiattribute utility theory top. Suppose a rating of 5 was considered satisfac-
(MAUT) assumes that each attribute has an impor- tory. Working from left to right across the matrix,
tance weight, represented in the bottom row of the the first alternative, Social Anthropology, could be
decision matrix in the table on a nine-point scale rejected on career-relevance and the other two aspects
(9 = most important). According to MAUT, the util- would not need to be examined. The second alterna-
ity of each alternative is calculated by adding the tive would be selected because it passes satisficing
utilities of each aspect multiplied by its importance tests on all three attributes and consequently, the
weight [15]. This weighting mechanism ensures that third alternative would not be considered at all. In
a less important attribute makes a smaller contribu- this example, satisficing leads to a decision after pro-
tion to the overall utility of the alternative than a more cessing less than half the available information and
important one. For example, utility (psychology) = the conflict across attributes is avoided rather than
(8 5) + (5 7) + (6 5) = 105 units. Here, the resolved. Although such a choice process may be
interest value has made a smaller contribution than good enough in some contexts, satisficing often fails
career relevance, even though the course was rated to produce the best decision: the gain in reduced effort
high in interest value. However, for this student, is at the expense of a loss of accuracy, as Payne
career relevance is more important, so it contributes et al. argue.
more to overall utility. This is a compensatory strat-
Direction of Processing: Attribute-based
egy, since positively evaluated aspects, such as the
career-relevance of the Business Studies course, com- or Alternative-based
pensate for its less attractive attributes. Amos Tversky [18] proposed two alternative deci-
sion strategies to explain how evaluative decisions
can violate one of the basic principles of rational-
Noncompensatory Strategies: Satisficing ity, transitivity of preference. Intransitive preference
occurs when someone prefers A to B, B to C, but C
Even with our simplified example in Table 1, nine to A. (A preference for a Psychology over a Busi-
items of information plus three attribute importance ness Studies course, Business Studies over Anthro-
weights must be considered. Most real-life deci- pology, but Anthropology over Psychology, would
sions involve many more attributes and alternatives clearly need to be sorted out). Tversky observed
and we have limited time and cognitive capac- similar intransitive cycles of preference with risky
ity to cope with all the information we receive. decisions, and explained them in terms of attribute-
A range of decision strategies we might use in based processing. This is in contrast to alternative-
different contexts has been identified. Beach and based processing strategies, such as additive utility
Mitchel [1] argued that the cognitive effort required and satisficing, in which each alternative is pro-
to execute a strategy is one of the main deter- cessed one at a time. In an attribute-based strategy,
minants of whether it is selected. Building on alternatives are initially compared on some or all
this, Payne, Bettman, and Johnson [10] developed of their important attributes. In Tverskys, additive
the more precise effort-accuracy framework, which difference strategy, alternatives are compared sys-
assumes that people select strategies adaptively, tematically, two at a time, on each attribute. The
weighing the accuracy of a strategy against the differences between them are added together in a
cognitive effort it would involve in a given deci- compensatory manner, to arrive at an evaluation as
sion context. to which is the best. For example, in Table 1, the
One of the earliest noncompensatory strategies differences between Business Studies and Psychol-
to be proposed was Simons satisficing principle: ogy on career relevance and interest would cancel
4 Decision Making Strategies
out, and the small difference in quality of teaching noncompensatory strategies, the process can fail
would tip the balance in favor of Psychology. Tver- to resolve a decision problem, either because all
sky showed that intransitive preference could occur alternatives are eliminated, or more than one remain.
if the evaluations of differences were nonlinear ([18],
see also [11]).
A noncompensatory attribute-based strategy could Two Stage and Multistage Strategies
also account for intransitive preferences. In a lex-
icographic strategy, the decision maker orders the It has been found that with complex choices, involv-
attributes by importance and chooses the best alter- ing perhaps dozens of alternatives, several decision
native on the most important attribute (for our hypo- strategies are often used in combination. For exam-
thetical student, Business Studies is chosen because ple, all alternatives might initially be screened using
career-relevance is the most important attribute). If the satisficing principle to produce a short-list. Addi-
tional screening could apply dominance testing to
there is a tie on the first attribute, the second attribute
remove all dominated alternatives, thereby short-
is processed in the same way. The process stops
ening the short-list still further. Following this, a
as soon as a clear favorite on an attribute is iden-
variety of strategies could be employed to evaluate
tified. This requires less cognitive effort because
the short-listed alternatives more thoroughly, such
usually information on several attributes is not pro-
as the additive equal weight heuristic. In this strat-
cessed (it is noncompensatory since trade-offs are
egy, the utilities of attribute values are combined
not involved). Tversky argued that if preferences on
additively in a similar manner to MAUT but with-
any attribute form a semi-order, involving intransi-
out importance weights, which are assumed to be
tive indifference, as opposed to a full rank-order,
equal. Job selection procedures based on equal oppor-
then intransitivity could result. This is because
tunity principles are often explicitly structured in
small differences on a more important attribute
this way, to make the selection process transpar-
may be ignored, with the choice being based on
ent. Applicants can be given a clear explanation as
less important attributes, whereas large differences
to why they were selected, short-listed or rejected,
would produce decisions based on more important
and selection panels can justify their decisions to
attributes. He termed this the lexicographic semi-
those to whom they are accountable. Beach and his
order strategy.
colleagues have developed image theory [1], which
Other important attribute-based strategies include
assumes a two-stage decision process similar to that
dominance testing [9] and elimination by as-
described above, that is, a screening stage involving
pects [19]. With respect to the former, one alternative
noncompensatory strategies, followed by a more thor-
is said to dominate another if it is at least as good
ough evaluation involving the selection of appropri-
as the other on all attributes and strictly better on at
ate compensatory strategies. Other multistage process
least one of them. In such a case there is no conflict to
models describe rather more complex combinations
resolve and the rational decision strategy is obvious:
of problem structuring operations and strategies [9,
choose the dominant alternative. From a cognitive
16, 17].
and a rational perspective it would seem sensible to
test for dominance initially, before engaging in deeper
processing involving substantially more cognitive Strategies for Decisions Involving Risk
effort. There is a lot of empirical evidence that and Uncertainty
people do this. Turning to the elimination-by-aspects
strategy, this is similar to satisficing, except that All the strategies discussed so far can be applied to
evaluation is attribute-based. Starting with the most decisions involving risk and uncertainty, if the out-
important, alternatives are evaluated on each attribute comes and probabilities of the decision tree illustrated
in turn. Initially, those not passing a satisficing test in Figure 1 are construed as attributes. However,
on the most important attribute are eliminated. This various specific strategies have been proposed that
process is repeated with the remaining alternatives recognize probability as being fundamentally differ-
on the next most important attribute, and so on. ent to other attributes. The most influential structural
Ideally, the process stops when only one alternative models to predict decisions under risk are variants
remains. Unfortunately, as is the case for most of the subjectively expected utility (SEU) model,
Decision Making Strategies 5
which assume that outcome probabilities are trans- decision strategies and validate them at the individ-
formed into weights of the subjective values of ual level.
outcomes. For example, sign and rank dependent
models such as prospect theory [4] and cumulative References
prospect theory [21]. In addition to compensatory
strategies derived from SEU, various noncompen- [1] Beach, L.R. & Mitchell, T.R. (1998). A contingency
satory strategies have been suggested for decision model for the selection of decision strategies, in
under risk. These include the minimax strategy: Image Theory: Theoretical and Empirical Foundations,
choose the alternative with the lowest maximum pos- L.R. Beach ed., LEAs Organization and Management
Series; Image Theory: Theoretical and Empirical Foun-
sible loss. In Figure 1 this would lead to the selection dations. Lawrence Erlbaum, Mahwah, pp. 145158.
of the alternative with the certain outcome, how- [2] Huber, O. (1989). Information-processing operators in
ever attractive the possible gain of the other option decision making, in Process and Structure in Human
might be. Decision Making, H. Montgomery & O. Svenson, eds,
Wiley, Chichester.
[3] Kahneman, D. & Tversky, A. (1974). Judgment under
Current Issues uncertainty: heuristics and biases, Science 185(4157),
pp. 11241131.
It is important to distinguish between strategies for [4] Kahneman, D. & Tversky, A. (1979). Prospect the-
preferential as opposed to judgmental decisions. In ory: analysis of decision under risk, Econometrica 47,
the latter, a choice has to be made concerning whether 263291.
[5] Klein, G. (1989). Recognition-primed decisions, Advan-
A or B is closer to some criterion state of the ces in Man-Machine Systems Research 5, 4792.
world, present or future (e.g., which has the greater [6] Knight, F.H. (1921). Risk, Uncertainty and Profit,
population, Paris or London?). Decision strategies Macmillan, London.
for judgmental choice are discussed in the entry [7] Lipshitz, R., Klein, G., Orasanu, J. & Salas, E. (2001).
on fast and frugal heuristics. Preferential choice, Taking stock of naturalistic decision making, Journal of
as defined earlier, is fundamentally different since Behavioral Decision Making 14, pp. 331352.
there is no real world best state of the world [8] Maule, A.J. & Edland, A.C. (1997). The effects of time
pressure on human judgement and decision making, in
against which the accuracy of a decision can be Decision Making: Cognitive Models and Explanations,
measured. Consequently, decision strategies for pref- R. Ranyard, W.R. Crozier & O. Svenson, eds, Routledge,
erential choice, though superficially similar, are often London.
profoundly different. In particular, preferential choice [9] Montgomery, H. (1983). Decision rules and the search
often involves some slow and costly, rather than for a dominance structure: Towards a process model
fast and frugal thinking. Certainly, research within of decision making, in Analyzing and Aiding Decision
Processes, P.D. Humphries, O. Svenson & A. Vari, eds,
the naturalistic decision making framework [7] has
North-Holland, Amsterdam.
identified several fast and frugal decision heuristic [10] Payne, J.W., Bettman, J.R. & Johnson, E.J. (1993). The
used to make preferential decisions (see Heuristics: Adaptive Decision Maker, Cambridge University Press,
Fast and Frugal). For example, the lexicographic New York.
strategy described earlier is essentially the same as [11] Ranyard, R. (1982). Binary choice patterns and reasons
the fast and frugal take the best heuristic. Simi- given for simple risky choice, Acta Psychologica 52,
larly, the recognition-primed decisions identified by 125135.
[12] Simon, H.A. (1957). Models of Man, John Wiley, New
Klein and his colleagues [5] have the same char-
York.
acteristics, and it has been found that people often [13] Slovic, P. & Lichtenstein, S. (1971). Reversals of pref-
switch to strategies that use less information under erence between bids and choices in gambling decisions,
time pressure [8]. However, evidence related to mul- Journal of Experimental Psychology 89(1), 4655.
tistage process theory points to a rather more com- [14] Slovic, P. & Lichtenstein, S. (1973). Response-induced
plex picture. Especially when important decisions reversals of preference in gambling: an extended replica-
are involved, people spend considerable effort seek- tion in Las Vegas, Journal of Experimental Psychology
101(1), 1620.
ing and evaluating as much information as possible [15] Srivastava, J., Connolly, T. & Beach, L. (1995). Do
in order to clearly differentiate the best alternative ranks suffice? a comparison of alternative weighting
from the others [16, 17]. One of the main chal- approaches in value elicitation, Organizational Behavior
lenges for future research is to model such multistage and Human Decision Processes 63, 112116.
6 Decision Making Strategies
[16] Svenson, O. (1992). Differentiation and consolidation [20] Tversky, A. & Fox, C.R. (1995). Weighing risk and
theory of human decision making: a frame of reference uncertainty, Psychological Review 102, 269283.
for the study of pre- and post-decision processes, Acta [21] Tversky, A. & Kahneman, D. (1992). Advances in
Psychologica 80, 143168. prospect theory: cumulative representations of uncer-
[17] Svenson, O. (1996). Decision making and the search for tainty, Journal of Risk and Uncertainty 5, 297323.
fundamental psychological realities: what can be learned [22] Tversky, A., Sattah, S. & Slovic, P. (1988). Contingent
from a process perspective?, Organizational Behavior weighting in judgment and choice, Psychological Review
and Human Decision Processes 65, 252267. 95(3), 371384.
[18] Tversky, A. (1969). Intransitivity of preferences, Psy-
chological Review 76, 3148. ROB RANYARD
[19] Tversky, A. (1972). Elimination by aspects: a theory of
choice, Psychological Review 79(4), 281299.
Deductive Reasoning and Statistical Inference
JAMES CUSSENS
Volume 1, pp. 472475
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Indifference permits us to infer that f (p) = 1, so Consider now the end result of Bayesian inference:
that the prior is uniform and P (a p b) = b a the posterior distribution. A pure Bayesian approach
for any a, b such that 0 a b 1. However, if views the posterior as the end result of Bayesian
we are indifferent about p, then surely we should inference:
be indifferent about p 2 also. Applying the Principle Finally, never forget that the goal of Bayesian
of Indifference to p 2 yields the prior f (p 2 ) = 1 so computation is not the posterior mode, not the
that P (a p 2 b) = b a. But it then follows
that posterior mean, but a representation of the entire
f (p) = 1/(2 p) and P (a p b) = b a. distribution, or summaries of that distribution such
We have logically inferred two inconsistent priors. as 95% intervals for estimands of interest [2, p. 301]
Where the prior is over a single variable, it is possible (italics in the original)
to avoid this sort of inconsistency by adopting not However, the posterior is often used in a semi-
a uniform distribution but Jeffreys noninformative Bayesian manner to compute a point estimate of p
distribution [2, p. 53]. However, when extended for the mean of the posterior is a favourite choice. But, as
multi-parameter distributions Jeffreys distributions with all point estimates, acting as if some estimated
are more problematic. value were the true value is highly nondeductive.
None of the vast literature attempting to rescue To sum up the principal objections to the Bayesian
the Principle of Indifference from problems such as approach: these are that it simply does not tell us
these is successful because the fundamental problem (a) what the right prior is and (b) what to do with
with the Principle is that it fallaciously claims to the posterior. But as Howson and Urbach point out,
generate knowledge (in the form of a prior probability much the same criticism can be made of deductive
distribution) from ignorance. As Keynes memorably inference:
noted:
Deductive logic is the theory (though it might be
No other formula in the alchemy of logic has exerted more accurate to say theories) of deductively valid
more astonishing powers. For it has established the inferences from premisses whose truth-values are
existence of God from total ignorance, and it has exogenously given. Inductive logic which is how
measured with numerical precision the probability we regard subjective Bayesian theory is the the-
that the sun will rise tomorrow. [6, p. 89], quoted in ory of inference from some exogenously given data
[4, p. 45] and prior distribution of belief to a posterior distri-
bution. . . . neither logic allows freedom to individual
Prior distributions can, in fact, only be deduced from discretion: both are quite impersonal and objective.
[4, p. 28990]
other distributions. If the coin being tossed were
selected at random from a set of coins all with known The Classical approach to statistical inference is
probabilities for heads, then a prior over possible val- closely tied to the doctrine of falsificationism, which
ues of p can easily be deduced. In such cases, a asserts that only statements that can be refuted by
Bayesian approach is entirely uncontroversial. Clas- data are scientific. Note that the refutation of a sci-
sical statisticians will happily use this prior and any entific hypothesis (all swans are white) by a single
observed coin tosses to compute a posterior, because counterexample (there is a black swan) is entirely
this prior is objective (although, in truth, it is only deductive. Statistical statements are not falsifiable.
as objective as the probabilities from which it was For example, a heavy preponderance of observed tails
deduced). In any case, such cases are rare and rest is logically consistent with the hypothesis of a fair
on the optimistic assumption that we somehow know coin (p = 0.5). However, such an event is at least
certain probabilities ahead of time. In most cases, improbable if p = 0.5. The basic form of Classical
the formulation of the prior is a mathematical incar- statistical inference is to infer that a statistical hypoth-
nation of exogenously given assumptions about the esis (the null hypothesis) is refuted if we observe
likely whereabouts of p. If we adopt it as the true data that is sufficiently improbable (5% is a favourite
prior, this amounts to positing these assumptions to level) on the assumption that the null hypothesis is
be true: a nondeductive step. If we put it forward as true. In this case, the null hypothesis is regarded as
merely an expression of our prior beliefs, this intro- practically refuted.
duces an element of subjectivity it is this to which It is important to realize that, in the Classical view,
the Classical view most strongly objects. improbable data does not make the null hypothesis
Deductive Reasoning and Statistical Inference 3
improbable. There is no probability attached to the e, but at the expense of a further move away
truth of the hypothesis, since this is only possible from falsificationism. In the standard falsificationist
in the Bayesian view. (We will assume throughout approach, a black swan refutes all swans are white
that we are not considering those rare cases where irrespective of any other competing hypotheses.
there exist the so-called objective priors.) What Having dealt with these somewhat technical mat-
then does it mean, on the Classical view, for a ters let us return to the deeper question of how to
null hypothesis to be rejected by improbable data? interpret the rejection of a hypothesis at significance
Popper justified such a rejection on the grounds that level, say, 5%. Clearly, this is not straight logical
it amounted to a methodological decision to regard refutation, nor (since it is non-Bayesian) does it even
highly improbable events as ruled out as prohibited say anything about the probability that the hypoth-
[7, p. 191]. Such a decision amounts to adopting esis is false. In the literature, the nearest we get to
the nondeductive inference rule P (e) < e, for an explanation is that one can act as if the hypoth-
some sufficiently small . In English, this inference esis were refuted. This is generally justified on the
rule says: From P (e) < infer that e is not the grounds that such a decision will only rarely be mis-
case. With this nondeductive rule it follows that if taken: if we repeated the experiment many times,
h0 is the null hypothesis and h0 P (e) < , then producing varying data due to sampling variation, and
the observation of e falsifies h0 in the normal way. applied the same significance test then the hypoth-
A problem with this approach (due to Fisher [1]) esis would not be erroneously rejected too often.
concerns the choice of e. For example, suppose the But, in fact, it is only possible to infer that proba-
following sequence of 10 coin tosses were observed bly erroneous rejection would not occur often: it is
e = H, H, T , H, H, H, H, H, H, T . If h0 states that possible (albeit unlikely) to have erroneous rejection
the coin is fair, then h0 P (e) = (1/2)10 < 0.001. every time. Also note that such a justification appeals
It would be absurd to reject h0 on this basis, since to what would (or more properly probably would)
any sequence of 10 coin tosses is equally improbable. happen if imaginary experiments were conducted.
This shows that care must be taken with the word This is in sharp contrast to standard falsification-
improbable: with a large enough space of possible ism, which, along with the Bayesian view, makes use
outcomes and a distribution that is not too skewed, only of the data we actually have; not any imaginary
then something improbable is bound to happen. If we data. Another, more practical, problem is that rejected
could sample a point from a continuous distribution hypotheses cannot be resurrected if strongly support-
(such as the Normal distribution), then an event of ive data is collected later on, or if other additional
probability zero would be guaranteed to occur! information is found. This problem is not present
Since e has been observed, we have also observed in the Bayesian case since new information can be
the events e = r = 8 and e = r 8. Events combined with an old posterior to produce a new
such as e are of the sort normally used in statistical posterior, using Bayes theorem as normal. Finally,
testing; they assert that a test statistic (r) has been notice that confidence is not invested in any particu-
found to lie in a critical region ( 8). h0 P (e ) = lar hypothesis rejection, but on the process that leads
5.47% (to 3 significant figures), so if we choose to to rejection. Separating out confidence in the process
test h0 with e as opposed to e, h0 is not rejected of inference from confidence in the results of infer-
at significance level 5%, although it would, had we ence marks out Classical statistical inference from
chosen 6%. e is a more sensible choice than e both Bayesian inference and deductive inference.
but there is no justified way of choosing the right To finish this section on Classical statistical infer-
combination of test statistic and critical region. ence note that the basic inferential features of hypoth-
In the modern (NeymanPearson) version of the esis testing also apply to Classical estimation of
Classical approach (see NeymanPearson Infer- parameters. The standard Classical approach is to
ence) a null hypothesis is compared to competing, use an estimator a function mapping the data to an
alternative hypotheses. For example, suppose there estimate of the unknown parameter. For example, to
were only one competing hypothesis h1 , then h0 estimate a probability p the proportion of successes
would be rejected if P (e|h0 )/P (e|h1 ) k, where r/n is used. If our data were e above, then the esti-
k is determined by the desired significance level. mate for p would be 8/10 = 0.8. Since we cannot ask
This turns out to solve the problem of choosing questions about the likely accuracy of any particular
4 Deductive Reasoning and Statistical Inference
estimate, the Classical focus is on the distribution of [2] Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B.
estimates produced by a fixed estimator determined (1995). Bayesian Data Analysis, Chapman & Hall, Lon-
by the distribution of the data. An analysis of this don.
[3] Hacking, I. (1965). Logic of Statistical Inference, Cam-
distribution leads to confidence intervals. For exam- bridge University Press, Cambridge.
ple, we might have a 95% confidence interval of the [4] Howson, C. & Urbach, P. (1989). Scientific Reasoning:
form (p , p + ), where p is the estimate. It is The Bayesian Approach, Open Court, La Salle, Illinois.
important to realize that such a confidence interval [5] Hume, D. (1777). An Enquiry Concerning Human Under-
does not mean that the true value p lies within the standing, Selby-Bigge, London.
interval (p , p + ) with probability 95%, since [6] Keynes, J.M. (1921). A Treatise on Probability, Macmil-
this would amount to treating p as a random vari- lan, London.
[7] Popper, K.R. (1959). The Logic of Scientific Discovery,
able. See the entry on confidence intervals for further Hutchinson, London. Translation of Logik der Forschung,
details. 1934.
[8] Popper, K.R. (1983). Realism and the Aim of Science,
References Hutchinson, London. Written in 1956.
[1] Fisher, R.A. (1958). Statistical Methods for Research JAMES CUSSENS
Workers, 13th Edition, Oliver and Boyd, Edinburgh. First
published in 1925.
DeFriesFulker Analysis
RICHARD RENDE AND CHERYL SLOMKOWSKI
Volume 1, pp. 475477
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
with variations in the computation of group-shared [2] DeFries, J.C. & Fulker, D.W. (1985). Multiple regres-
environment [4]. This method may be used with both sion analysis of twin data, Behavior Genetics 15,
selected and unselected samples and is assumed to be 467473.
[3] DeFries, J.C. & Fulker, D.W. (1988). Multiple regres-
robust to violations of normality in the distribution. sion analysis of twin data: etiology of deviant scores
After determining the effect sizes of heritability versus individual differences, Acta Geneticae Medicae
and shared environment from both the basic and et Gemellologiae (Roma) 37, 205216.
the augmented model, the individual difference and [4] Eley, T.C. (1997). Depressive symptoms in children
group parameters may be compared by contrasting and adolescents: etiological links between normality
confidence intervals for each. For example, it has and abnormality: a research note, Journal of Child
Psychology and Psychiatry 38, 861865.
been demonstrated that shared environmental influ-
[5] Plomin, R. & Rende, R. (1991). Human behavioral
ences are notable for elevated levels of depressive genetics, Annual Review of Psychology 42, 161190.
symptoms in adolescence but not for the full range [6] Purcell, S. & Sham, P.D. (2003). A model-fitting imple-
of individual differences (e.g., [4] and [8]). mentation of the DeFries-Fulker model for selected twin
A model-fitting implementation of the DF method data, Behavior Genetics 33, 271278.
has been introduced, which preserves the function [7] Rende, R. (1999). Adaptive and maladaptive pathways
of the regression approach but allows for particular in development: a quantitative genetic perspective, in On
the Way to Individuality: Current Methodological Issues
advantages [6]. Fundamental advances in this imple-
in Behavioral Genetics, LaBuda, M., Grigorenko, E.,
mentation include the analysis of pairs rather than Ravich-Serbo, I. & Scarr, S., eds, Nova Science Pub-
individuals (eliminating the need for double entry of lishers, New York.
twin pairs and requisite correction of standard error [8] Rende, R., Plomin, R., Reiss, D. & Hetherington, E.M.
terms) and the facility to include opposite-sex pairs in (1993). Genetic and environmental influences on depres-
a sex-limited analysis. As described in detail in Pur- sive symptoms in adolescence: etiology of individual
cell and Sham [6], the fundamental analytic strategy differences and extreme scores, Journal of Child Psy-
chology and Psychiatry 34, 13871398.
remains the same, as each observation (i.e., pair) con- [9] Rodgers, J.L. & McGue, M. (1994). A simple algebraic
tains a zygosity coefficient, continuous trait score for demonstration of the validity of DeFries-Fulker analysis
each member of the pair, and proband status for each in unselected samples with multiple kinship levels,
member of the pair (i.e., a dummy variable indicat- Behavior Genetics 24, 259262.
ing if they have exceeded a particular cutoff). Other [10] Slomkowski, C., Rende, R., Novak, S., Lloyd-
details of the analytic procedure can be found in Pur- Richardson, E. & Niaura, R. (in press). Sibling effects
of smoking in adolescence: evidence for social influence
cell and Sham [6], including a script for conducting
in a genetically-informative design, Addiction.
this type of analysis in the statistical program MX.
RICHARD RENDE AND CHERYL SLOMKOWSKI
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Robson [3] proposes to restrict the interaction [4] Rosenzweig, S. (1933). The experimental situation as
between the researcher and the respondent as much as a psychological problem, Psychological Review 40,
possible. In his view, this can be done by making use 337354.
[5] Rosnow, R.L. (2002). The nature and role of demand
of taped instruction or of the automated presentation characteristics in scientific inquiry, Prevention & Treat-
of material. ment 5, Retrieved July 16, 2003, from http://
www.Journals.apa.org/prevention/
volume5/pre0050037c.html
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
worked example from a school-based smoking pre- Table 1 Number and proportion of children who report
vention trial. As part of the Four Group Compari- using smokeless tobacco after 2 years of follow-up in each
son Study [10] children were randomly assigned by of 12 schools [3, 10]
school to either one of three smoking prevention pro- j yj mj p j
grams or to a control group. Randomization by school
1 2 152 0.0132
was adopted for this trial since allocation at the indi-
2 3 174 0.0172
vidual level would have been impractical and could 3 1 55 0.0182
also have increased the possibility that children in 4 3 74 0.0405
an intervention group could influence the behavior of 5 5 103 0.0485
control group children at the same school. 6 12 207 0.0580
Unfortunately, the selected programs in this trial 7 7 104 0.0673
proved ineffective in reducing tobacco use among 8 7 102 0.0686
9 6 83 0.0723
adolescents. However, suppose investigators decided
10 6 75 0.0800
to design a new trial focusing specifically on p, the 11 23 225 0.1022
proportion of children using smokeless tobacco. The k = 12 16 125 0.1280
corresponding design effect is given by Total Y = 91 M = 1479 p = Y /M = 0.0615
V arSR (p)
deff = , (2)
V arIR (p)
where p denotes the sample estimate of p, V arSR (p)
use in the j th school and among all 12 schools,
denotes the variance of p assuming random assign- respectively.
ment by school and V arIR (p) denotes the variance The estimated design effect is then seen to be
of p assuming individual random assignment. One given approximately by
can show [3] that in this case, deff 1 + (m 1),
where m denotes the average number of students per 1 + (123.25 1)0.013 = 2.6, (4)
school and is the intraclass correlation coefficient
measuring the similarity in response between any indicating that random assignment by school would
two students in the same school. With the additional require more than twice the number of students as
assumption that is nonnegative, this parameter compared to an individually randomized trial hav-
may also be interpreted as the proportion of over- ing the same power. Note that the design effect is
all variation in response that can be accounted for by a function of both the degree of intraclass correla-
between-school variation. tion and the average number of students per school,
Data are provided in Table 1 for the rates of so that even values of close to zero can dramat-
smokeless tobacco use among the students from the ically inflate the required sample size when m is
12 control group schools randomized in the Four large.
Group Comparison Study [3, 10], where the average Now suppose investigators believe that their new
number of students per school is given by m = intervention can lower the rate of smokeless tobacco
1479/12 = 123.25. Therefore, the sample estimate of use from six percent to three percent. Then using
may be calculated [3] as standard sample size formula for an individually ran-
domized trial [3], approximately 746 students would
k
j =1 mj p j (1 p j ) be required in each of the two groups at a 5%
= 1 two-sided significance level with 80% power. How-
1)p(1
k(m p)
ever, this result needs to be multiplied by deff = 2.6,
[152 0.0132(1 0.0132)] implying that at least 16 schools need to be ran-
+ + [125 0.1280(1 0.1280)] domized per intervention group assuming approxi-
=1
12(123.25 1)0.0615(1 0.0615) mately 123 students per school (746 2.6/123.25 =
= 0.013324, (3) 15.7). In practice, investigators should consider a
range of plausible values for the design effect as it
where p j = yj /mj and p = Y/M = 91/1479 = is frequently estimated with low precision in prac-
0.0615 denote the prevalence of smokeless tobacco tice.
Design Effects 3
[2] Connelly, L.B. (2003). Balancing the number and size (1992). Results from a statewide approach to adoles-
of sites: an economic approach to the optimal design of cent tobacco use prevention, Preventive Medicine 21,
cluster samples, Controlled Clinical Trials 24, 544559. 449472.
[3] Donner, A. & Klar, N. (2000). Design and Analysis [11] Neuhaus, J.M. & Segal, M.R. (1993). Design effects for
of Cluster Randomization Trials in Health Research, binary regression models fit to dependent data, Statistics
Arnold, London. in Medicine 12, 12591268.
[4] Gabler, S., Haeder, S. & Lahiri, P. (1999). A model [12] Rao, J.N.K. & Scott, A.J. (1992). A simple method
based justification of Kishs formula for design effects for the analysis of clustered binary data, Biometrics 48,
for weighting and clustering, Survey Methodology 25, 577585.
105106. [13] Rust, K. & Frankel, M. (2003). Issues in inference from
[5] Hansen, M.H. & Hurwitz, W.N. (1942). Relative survey data, in Leslie Kish, Selected Papers, S. Heeringa
efficiencies of various sampling units in population & G. Kalton, eds, John Wiley & Sons, New York,
inquiries, Journal of the American Statistical Association pp. 125129.
37, 8994. [14] Williams, D.A. (1982). Extra-binomial variation in logis-
[6] Kish, L. (1965). Survey Sampling, John Wiley & Sons, tic linear models, Applied Statistics 31, 144148.
New York. [15] Xie, T. & Waksman, J. (2003). Design and sample
[7] Kish, L. (1982). Design Effect, Encyclopedia of Statis- size estimation in clinical trials with clustered survival
tical Sciences, Vol. 2, John Wiley & Sons, New York, times as the primary endpoint, Statistics in Medicine 22,
pp. 347348. 28352846.
[8] LaVange, L.M., Stearns, S.C., Lafata, J.E., Koch, G.G. &
Shah, B.V. (1996). Innovative strategies using SUDAAN
for analysis of health surveys with complex samples, (See also Survey Sampling Procedures)
Statistical Methods in Medical Research 5, 311329.
[9] Lohr, S.L. (1999). Sampling: Design and Analysis, NEIL KLAR AND ALLAN DONNER
Duxbury Press, Pacific Grove.
[10] Murray, D.M., Perry, C.L., Griffin, G., Harty, K.C.,
Jacobs Jr, D.R., Schmid, L., Daly, K. & Pallonen, U.
Development of Statistical Theory in the 20th Century
PETER M. LEE
Volume 1, pp. 483485
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
testing (see NeymanPearson Inference). In the case where the prior probabilities P (Hi ) are all
simplest case, we are interested in knowing whether equal, this is essentially equivalent to the method of
an unknown parameter takes the value 0 or the inverse probability, which was popular at the start
value 1 . The first possibility is referred to as the of the twentieth century. Thus, if we assume that
null hypothesis H0 and the second as the alternative x1 , . . . , xn are independently normally distributed
hypothesisH1 . They argue that if we are to collect with unknown mean and known variance 2 , then
data whose distribution depends on the value of , assuming that all values of are equally likely a
then we should decide on a rejection region R, priori, it is easy to deduce that a posteriori the
which is such that we reject the null hypothesis if distribution of is normal with mean m (the sam-
and only if the observations fall in the rejection ple mean) and variance 2 . In the case where 2
region. Naturally, we want to minimize the probabil- is unknown, another conventional choice of prior
ity = P (R|0 ) of rejecting the null hypothesis when beliefs for 2 , this time uniform in its logarithm,
it is true (an error of the first kind), whereas we leads to the t Test, which Student and others had
want to maximize the probability 1 = P (R|1 ) of found by classical methods. Nevertheless, particularly
rejecting it when it is false (thus avoiding an error in the continuous case, there are considerable diffi-
of the second kind). Since increasing R decreases culties in taking the standard conventional choices of
but increases , a compromise is necessary. Ney- prior, and these difficulties are much worse in sev-
man and Pearson accordingly recommended restrict- eral dimensions. Controversy over Bayesian methods
ing attention to tests for which was less than some has centered mainly on the choice of prior probabil-
preassigned value called the size (for example 5%) ities, and while there have been attempts to find an
and then choosing among such regions one with a objective choice for prior probabilities (see, e.g., [5,
maximum value of the power 1 . Fisher, how- Section 3.10]), it is now common to accept that the
ever, firmly rebutted the view that the purpose of a prior probabilities are chosen subjectively. This has
test of significance was to decide between two or led some statisticians and scientists to reject Bayesian
more hypotheses. methods out of hand. However, with the growth of
Neyman and Pearson went on to develop a the- Markov Chain Monte Carlo Methods, which have
ory of confidence intervals. This can be exem- made Bayesian methods simple to use and made some
plified by the case of a sample of independently previously intractable problems amenable to solution,
normally distributed random variables (as above), they are now gaining in popularity.
when they argued that if the absolute value of For some purposes, formal statistical theory has
a t statistic on n 1 degrees of freedom is less declined in importance with the growth of Explora-
than tn1,0.95 with 95% probability, then the ran- tory Data Analysis as advocated by John W Tukey
dom interval (m stn1,0.95 , m + stn1,0.95 ) will con- (19152000) and of modern graphical methods as
tain the true, unknown value of the mean with developed by workers such as William S. Cleveland,
probability 0.95. Incautious users of the method (1943), but for many problems, the debate about
are inclined to speak as if the value were ran- the foundations of statistical inference remains lively
dom and lay within that interval with 95% prob- and relevant.
ability, but from the NeymanPearson standpoint,
this is unacceptable, although under certain circum-
stances, it may be acceptable to Bayesians as dis- References
cussed below.
Later in the twentieth century, the Bayesian [1] Fisher, R.A. (1925). Statistical Methods for Research
viewpoint gained adherents (see Bayesian Statis- Workers, Oliver & Boyd, Edinburgh.
tics). While there were precursors, the most influ- [2] Fisher, R.A. (1935). The Design of Experiments, Oliver
ential early proponents were Bruno de Finetti & Boyd, Edinburgh.
[3] Fisher, R.A. (1935). The fiducial argument in statistics,
(19061985) and Leonard Jimmie Savage (1917
Annals of Eugenics 6, 391398.
1971). Conceptually, Bayesian methodology is sim- [4] Fisher, R.A. (1956). Statistical Methods and Scientific
ple. It relies on Bayes theorem, that P (Hi |E) Inference, Oliver & Boyd, Edinburgh.
P (Hi )P (E|Hi ), where the Hi constitute a set of pos- [5] Jeffreys, H. (1939). Theory of Probability, Oxford Uni-
sible hypotheses and E a body of evidence. In the versity Press, Oxford.
Development of Statistical Theory 3
Further Reading Porter, T.M. (2004). Karl Pearson: The Scientific Life in a
Statistical Age, Princeton University Press, Princeton.
Berry, D.A. (1996). Statistics: A Bayesian Perspective, Reid, C. (1982). Neyman From Life, Springer-Verlag, New
Duxbury, Belmont. York.
Box, J.F. (1978). R.A. Fisher: The Life of a Scientist, Wiley, Salsburg, D. (2001). The Lady Tasting Tea: How Statistics
New York. Revolutionized Science in the Twentieth Century, W H
Cleveland, W.S. (1994). The Elements of Graphing Data, Freeman, New York.
Revised Ed., AT&T Bell Laboratories, Murray Hill. Savage, L.J. (1962). The Foundations of Statistical Inference;
Pearson, E.S. (1966). The Neyman-Pearson story: 192634. a Discussion Opened by Professor L J Savage, Methuen,
Historical sidelights on an episode in Anglo-Polish col- London.
laboration, in F.N. David ed., Festschrift for J Ney- Tukey, J.W. (1977). Exploratory Data Analysis, Addison-
man, Wiley, New York;Reprinted on pages 455477 of Wesley, Reading.
E.S. Pearson & M.G. Kendall eds, (1970). Studies in the
History of Statistics and Probability, Griffin Publishing, PETER M. LEE
London.
Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. (1996).
Markov Chain Monte Carlo in Practice, Chapman & Hall,
London.
Differential Item Functioning
H. JANE ROGERS
Volume 1, pp. 485490
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
have the same probability of the response, and hence separate model for each item in the test; by com-
that DIF is present. Thus, IRT methods for detecting parison, the D2 statistic requires only one calibration
DIF involve comparison of IRFs. Approaches to per group.
quantifying this comparison are based on comparison Limitations of IRT methods include the necessity
of item parameters, comparison of item characteristic for large sample sizes in order to obtain accurate
curves, or comparison of model fit. Millsap and parameter estimates, the requirement of model-data fit
Everson [18] provide a comprehensive review of prior to any investigation of DIF, and the additional
IRT DIF methods that encompasses the majority of requirement that the parameter estimates for the two
research done on these methods to date. groups be on a common scale before comparison.
Lord [15] proposed a statistic, often referred to Each of these issues has been subject to a great deal
as D2 , to test for equality of the vectors of item of research in its own right. IRT methods for detecting
parameters for the two groups. The test statistic DIF remain current and are considered the theoretical
requires the vector of item parameter estimates and ideal that other methods approximate; however, for
the variance-covariance matrices of the estimates the reasons given above, they are not as widely used
in each group and has an approximate chi-square as more easily computed methods based on OCI.
distribution with degrees of freedom equal to the A variety of OCI-based methods have been pro-
number of item parameters compared. Lord [15] posed. All use total score in place of the latent
notes that the chi-square distribution is asymptotic variable. Problems with the use of total score as
and strictly holds only when the true theta values a conditioning variable have been noted by many
are known. A practical problem with the D2 statistic authors [19]. A fundamental issue is that the total
is that the variance-covariance matrix of the item score may be contaminated by the presence of DIF
parameter estimates is often not well-estimated [28]. in some of the items. Purification procedures are
Differences between item characteristic functions routinely used to ameliorate this problem; these
have been quantified by computing the area between procedures involve an initial DIF analysis to iden-
the curves. Bounded and unbounded area statistics tify DIF items, removal of these items from the total
have been developed. Bounded statistics are neces- score, and recomputation of the DIF statistics using
sary when the c-parameters for the two groups differ, the purified score as the conditioning variable [9, 17].
as in this case, the area between item characteristic Logistic regression procedures [23, 27] most
curves is infinite. Kim and Cohen [12] developed a closely approximate IRT methods by using an
formula for bounded area, but no standard error or test observed score in place of the latent variable and in
statistic has been derived. Raju [21] provided a for- essence fitting a two-parameter model in each group.
mula for computing the area between IRFs when the Swaminathan and Rogers [27] reparameterized to
c-parameters are equal. Raju [22] further provided a produce an overall model, incorporating parameters
standard error formula and derived an approximately for score, group, and a score-by-group interaction.
normal test statistic. An overall test for the presence of DIF is obtained
Likelihood ratio statistics can be used to com- by simultaneously testing the hypotheses that the
pare the fit of a model based on equality con- group and interaction parameters differ from zero,
straints on parameters across all items with that of using a chi-square statistic with two degrees of
a model in which the item parameters for the studied freedom. Separate one degree of freedom tests
item are estimated separately within groups. Thissen (or z tests) for nonuniform and uniform DIF are
et al. [28] described a likelihood ratio test statistic possible by testing hypotheses about the interaction
proportional to the difference in the log likelihoods and group parameters, respectively. Zumbo [29]
under the constrained and unconstrained models. The suggested an R-square effect size measure for use
likelihood ratio and D2 statistics are asymptotically with the logistic regression procedure. The logistic
equivalent; Thissen et al. [28] argue that the likeli- regression procedure is a generalization of the
hood ratio test may be more useful in practice because procedure based on the loglinear model proposed
it does not require computation of the variance- by Mellenbergh [17]; the latter treats total score as a
covariance matrix of the parameter estimates. How- categorical variable. The advantages of the logistic
ever, the likelihood ratio procedure requires fitting a regression procedure over other OCI methods are
Differential Item Functioning 3
its generality and its flexibility in allowing multiple procedures if the logistic model is reexpressed in
conditioning variables. logit form, the total score treated as categorical, and
The standardization procedure of Dorans and the interaction term omitted. In this case, the group
Kulick [6] also approximates IRT procedures by parameter is equal to log alpha. The primary advan-
comparing empirical item characteristic curves, using tages of the MH procedure are its ease of calculation
total score on the test as the proxy for the and interpretable effect size measure.
latent trait. Unlike the logistic regression procedure, Also of current interest is the SIBTEST proce-
the standardization index treats observed score dure proposed by Shealy and Stout [25]. Shealy and
as a categorical variable; differences between the Stout [25] note that this procedure can be consid-
probability of a correct response (in the case of ered an extension of the standardization procedure
dichotomously scored items) are computed at each of Dorans and Kulick [6]. Its primary advantages
score level, weighted by the proportion of focal over the standardization index are that it does not
group members at that score level, and summed require large samples and that it provides a test of
to provide an index of uniform DIF known as the significance. It is conceptually model-based but non-
standardized P-DIF statistic. Dorans and Holland [5] parametric. The procedure begins by identifying a
provide standard errors for the standardization index valid subtest on which conditioning is based and
but no test of significance. Because it requires very a studied subtest, which may be a single item or
large sample sizes to obtain stable estimates of the group of items. The SIBTEST test statistic is based
item characteristic curves, the standardization index on the weighted sum of differences in the average
is not widely used; it is used largely by Educational scores of reference and focal group members with the
Testing Service (ETS) as a descriptive measure for same valid subtest true score. Shealy and Stout [25]
DIF in conjunction with the MantelHaenszel (MH)
show that if there are group differences in the dis-
procedure introduced by Holland and Thayer [9].
tribution of the trait, matching on observed scores
The MH procedure is the most widely known of
does not properly match on the trait, so they base
the OCI methods. The procedure for dichotomously
matching instead on the predicted valid subtest true
scored items is based on contingency tables of item
score, given observed valid subtest score. Shealy and
response (right/wrong) by group membership (refer-
Stout [25] derive a test statistic that is approximately
ence/focal) at each observed score level. The null
normally distributed. Like the standardization and
hypothesis tested under the MH procedure is that the
ratio of the odds for success in the reference ver- MH procedures, the SIBTEST procedure is designed
sus focal groups is equal to one at all score levels. to detect only uniform DIF. Li and Stout [14] modi-
The alternative hypothesis is that the odds ratio is a fied SIBTEST to produce a test sensitive to nonuni-
constant, denoted by alpha. This alternative hypoth- form DIF. Jiang and Stout [11] offered a modification
esis represents uniform DIF; the procedure is not of the SIBTEST procedure to improve Type I error
designed to detect nonuniform DIF. The test statistic control and reduce estimation bias. The advantages
has an approximate chi-square distribution with one of the SIBTEST procedure are its strong theoret-
degree of freedom. Holland and Thayer [9] note that ical basis, relative ease of calculation, and effect
this test is the uniformly most powerful (see Power) size measure.
unbiased test of the null hypothesis against the spec- Extensions of all of the DIF procedures described
ified alternative. The common odds ratio, alpha, pro- above have been developed for use with polytomous
vides a measure of effect size for DIF. Holland and models. Cohen, Kim, and Baker [4] developed an
Thayer [9] transformed this parameter to the ETS extension of the IRT area method and an accompa-
delta scale so that it can be interpreted as the constant nying test statistic and provided a generalization of
difference in difficulty of the item between refer- Lords D2 . IRT likelihood ratio tests are also read-
ence and focal groups across score levels. Holland ily extended to polytomous item response models [2,
and Thayer [9] note that the MH test statistic is very 13]. Rogers and Swaminathan [24] described logistic
similar to the statistic given by Mellenbergh [17] for regression models for unordered and ordered polyto-
testing the hypothesis of uniform DIF using the log- mous responses and provided chi-square test statistics
linear model. Swaminathan and Rogers [27] showed with degrees of freedom equal to the number of
equivalence between the MH and logistic regression item parameters being compared. Zwick, Donoghue,
4 Differential Item Functioning
and Grima [30] gave an extension of the standard- [7] Hambleton, R.K. & Swaminathan, H. (1985). Item
ization procedure for ordered polytomous responses Response Theory: Principles and Applications, Kluwer-
Nijhoff, Boston.
based on comparison of expected responses to the
[8] Hanson, B.A. (1998). Uniform DIF and DIF defined
item; Zwick and Thayer [31] derived the standard by differences in item response functions, Journal of
error for this statistic. Zwick et al. [30] also presented Educational and Behavioral Statistics 23, 244253.
generalized MH and Mantel statistics for unordered [9] Holland, P.W. & Thayer, D.T. (1988). Differential item
and ordered polytomous responses, respectively; the performance and the Mantel-Haenszel procedure, in
test statistic in the unordered case is distributed as Test Validity, H. Wainer & H.I. Braun, eds, Lawrence
a chi-square with degrees of freedom equal to one Erlbaum Associates, Hillsdale, pp. 129145.
[10] Holland, P.W. & Wainer, H., eds (1993). Differential
less than the number of response categories, while Item Functioning, Lawrence Erlbaum Associates, Hills-
the test statistic in the ordered case is a chi-square dale.
statistic with one degree of freedom. These authors [11] Jiang, H. & Stout, W. (1998). Improved Type I error
also provide an effect size measure for the ordered control and reduced estimation bias for DIF detection
case. Chang, Mazzeo, and Roussos [3] presented using SIBTEST, Journal of Educational Measurement
an extension of the SIBTEST procedure for poly- 23, 291322.
[12] Kim, S.H. & Cohen, A.S. (1991). A comparison of two
tomous responses based on conditional comparison
area measures for detecting differential item functioning,
of expected scores on the studied item or subtest. Applied Psychological Measurement 15, 269278.
Potenza and Dorans [20] provided a framework for [13] Kim, S.-H. & Cohen, A.S. (1998). Detection of dif-
classifying and evaluating DIF procedures for poly- ferential item functioning under the graded response
tomous responses. model with the likelihood ratio test, Applied Psycholog-
Investigations of DIF remain an important aspect ical Measurement 22, 345355.
of all measurement applications; however, current [14] Li, H. & Stout, W. (1996). A new procedure for
detection of crossing DIF, Psychometrika 61, 647677.
research efforts focus more on interpretations of DIF [15] Lord, F.M. (1980). Applications of Item Response Theory
than on development of new indices. to Practical Testing Problems, Erlbaum, Hillsdale.
[16] Mazor, K., Kanjee, A. & Clauser, B.E. (1995). Using
logistic regression and the Mantel-Haenszel procedure
References
with multiple ability estimates to detect differential item
functioning, Journal of Educational Measurement 32,
[1] Angoff, W.H. (1993). Perspectives on differential item 131144.
functioning methodology, in Differential Item Function- [17] Mellenbergh, G.J. (1982). Contingency table models for
ing, P.W. Holland & H. Wainer, eds, Lawrence Erlbaum assessing item bias, Journal of Educational Statistics 7,
Associates, Hillsdale, pp. 323. 105118.
[2] Bolt, D.M. (2002). A Monte Carlo comparison of [18] Millsap, R.E. & Everson, H.T. (1993). Methodology
parametric and nonparametric polytomous DIF detec- review: statistical approaches for assessing measurement
tion methods, Applied Measurement in Education 15, bias, Applied Psychological Measurement 17, 297334.
113141. [19] Millsap, R.E. & Meredith, W. (1992). Inferential con-
[3] Chang, H.H., Mazzeo, J. & Roussos, L. (1996). Detect- ditions in the statistical detection of measurement bias,
ing DIF for polytomously scored items: an adaptation of Applied Psychological Measurement 16, 389402.
the SIBTEST procedure, Journal of Educational Mea- [20] Potenza, M.T. & Dorans, N.J. (1995). DIF assessment
surement 33, 333353. for polytomously scored items: a framework for classi-
[4] Cohen, A.S., Kim, S.H. & Baker, F.B. (1993). Detec- fication and evaluation, Applied Psychological Measure-
tion of differential item functioning in the graded ment 19, 2337.
response model, Applied Psychological Measurement 17, [21] Raju, N.S. (1988). The area between two item charac-
335350. teristic curves, Psychometrika 53, 495502.
[5] Dorans, N.J. & Holland, P.W. (1993). DIF detec- [22] Raju, N.S. (1990). Determining the significance of
tion and description: Mantel-Haenszel and standardiza- estimated signed and unsigned areas between two item
tion, in Differential Item Functioning, P.W. Holland & response functions, Applied Psychological Measurement
H. Wainer, eds, Lawrence Erlbaum Associates, Hillsdale, 14, 197207.
pp. 3566. [23] Rogers, H.J. & Swaminathan, H. (1993). A comparison
[6] Dorans, N.J. & Kulick, E. (1986). Demonstrating the of the logistic regression and Mantel-Haenszel proce-
utility of the standardization approach to assessing unex- dures for detecting differential item functioning, Applied
pected differential item performance on the scholastic Psychological Measurement 17, 105116.
aptitude test, Journal of Educational Measurement 23, [24] Rogers, H.J. & Swaminathan, H. (1994). Logistic regres-
355368. sion procedures for detecting DIF in nondichotomous
Differential Item Functioning 5
item responses, in Paper Presented at the Annual [29] Zumbo, B.D. (1999). A Handbook on the Theory and
AERA/NCME Meeting, New Orleans. Methods of Differential Item Functioning (DIF): Logistic
[25] Shealy, R. & Stout, W. (1993). A model-based standard- Regression Modeling as a Unitary Framework for Binary
ization approach that separates true bias/DIF from group and Likert-type (Ordinal) Item Scores, Directorate of
ability differences and detects test bias/DTF as well as Human Resources Research and Evaluation, Department
item bias/DIF, Psychometrika 58, 159194. of National Defense, Ottawa.
[26] Shepard, L., Camilli, G. & Williams, D.M. (1984). [30] Zwick, R., Donoghue, J.R. & Grima, A. (1993).
Accounting for statistical artifacts in item bias research, Assessment of differential item functioning for perfor-
Journal of Educational Statistics 9, 93128. mance tasks, Journal of Educational Measurement 30,
[27] Swaminathan, H. & Rogers, H.J. (1990). Detecting 233251.
differential item functioning using logistic regression [31] Zwick, R. & Thayer, D.T. (1996). Evaluating the mag-
procedures, Journal of Educational Measurement 27, nitude of differential item functioning in polytomous
361370. items, Journal of Educational and Behavioral Statistics
[28] Thissen, D., Steinberg, L. & Wainer, H. (1992). Use of 21, 187201.
item response theory in the study of group differences in
trace lines, in Test Validity, H. Wainer & H.I. Braun, eds, H. JANE ROGERS
Lawrence Erlbaum Associates, Hillsdale, pp. 147170.
Direct and Indirect Effects
DOMINIQUE MULLER AND CHARLES M. JUDD
Volume 1, pp. 490492
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
It follows that:
Me
= . (7)
a b
From this last equation, it can be seen that the indi-
t
rect effect can be estimated either by or ,
X Y the change in effect of X on Y when controlling
and not controlling for Me. This is of importance
Figure 2 Direct and indirect effects because, if mediation is at stake, it should be the
2 Direct and Indirect Effects
case that | | > | |. Hence, the magnitude of the so-called stereotype threat literature (e.g., [7]), that
X Y relationship should decrease once the medi- these female participants performed less well at a
ator is controlled. The total effect must be of a math test in a condition making salient their gen-
higher magnitude than the direct effect. It could der (see [6] for details on the procedure) than in a
happen, however, that | | < | |. If it is the case, control condition. The test of this effect of condition
the third variable is not a mediator but a suppres- on math performance is, thus, the total effect. The
sor (e.g., [4]). In this case, this third variable does linear regression conducted on these data revealed
not reflect a possible mechanism for the relation- a standardized estimate of c = 0.42 (in this illus-
ship between X and Y, but, on the contrary, hides tration, we used the standardized estimates simply
a portion of this relationship. When the third vari- because these are what these authors reported in
able is a suppressor, the relationship between X their article. The same algebra would apply with
and Y is stronger once this third variable is con- the unstandardized estimates). Next, these authors
trolled. wanted to demonstrate that this difference in math
performance was due to a decrease in working mem-
ory in the condition where female gender was made
Extension to Models with More Than One salient. In other words, they wanted to show that
Mediator working memory mediated the condition effect. In
order to do so, they conducted two additional regres-
As suggested above, in some cases, multiple mediator sion analyses: The first one, regressing a work-
models could be tested. Then, there will be multiple ing memory measure on condition, and the sec-
possible indirect effects. For instance, in a model ond, regressing math performance on both condi-
like the one presented in Figure 3, there will be an tion and working memory (i.e., the mediator). The
indirect effect through Me 1 (i.e., 1 1 ) and another first one gives us the path from condition to work-
one through Me 2 (i.e., 2 2 ). In such a situation, it ing memory (a = 0.52); the second one gives us
is still possible to calculate an overall indirect effect the path from the mediator to the math perfor-
taking into account both the indirect effect through mance controlling for condition (b = 0.58) and the
Me 1 and through Me 2 . This one will be the sum of path from condition to math controlling for work-
1 1 and 2 2 (or a1 b1 and a2 b2 in terms of their ing memory (c = 0.12). So, in this illustration,
sample estimates). It also follows that: the indirect effect is given by a b = 0.30 and
= 1 1 + 2 2 . (8) the direct effect is c = 0.12. This example also
illustrates that, as stated before, c c = ab, so that
0.42 (0.12) = 0.52 0.58. The total effect,
Illustration 0.42 is, thus, due to two components, the direct
effect (0.12) and the indirect effect (0.30).
Two researchers conducted a study with only female
participants [6]. They first showed, in line with the References
[5] MacKinnon, D.P., Warsi, G. & Dwyer, J.H. (1995). A [7] Steele, C.M. & Aronson, J. (1995). Stereotype threat
simulation study of mediated effect measures, Multivari- and the intellectual test performance of African Ameri-
ate Behavioral Research 30, 4162. cans, Journal of Personality and Social Psychology 69,
[6] Schmader, T. & Johns, M. (2002). Converging evidence 797811.
that stereotype threat reduces working memory capac-
ity, Journal of Personality and Social Psychology 85, DOMINIQUE MULLER AND CHARLES M. JUDD
440452.
Direct Maximum Likelihood Estimation
CRAIG K. ENDERS
Volume 1, pp. 492494
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
mean vector and covariance matrix, and , respec- Table 2 Maximum likelihood parameter estimates
tively. ML estimation proceeds by trying out values Estimate
for and in an attempt to identify the estimates
that maximize the log likelihood. In the case of direct Variable Complete Direct ML Listwise
ML, each cases contribution to the log likelihood
Mean
is computed using the observed data for that case. Support 13.00 13.32 10.00
Assuming multivariate normality, case is contribu- Depression 15.00 15.12 17.33
tion to the sample log likelihood is given by Variance
log Li = Ki 12 log |i | 12 (yi i ) i1 (yi i ), Support 29.00 30.99 15.00
Depression 40.80 44.00 43.56
(1)
Covariance
where yi is the vector of raw data, i is the vector of Support/Depression 19.90 18.22 10.00
population means, i is the population covariance
matrix, and Ki is a constant that depends on the
number of observed values for case i. The sample [11 ]1 (yi 1 ) (4)
log likelihood is subsequently obtained by summing
over the N cases, as shown in (2). In a similar vein, the log likelihood for the case with
a missing social support score (Y1 ) is computed using
1 1
N N
y2 , 2 , and 22 .
log L = K log |i | (yi i )
2 i=1 2 i=1 To further illustrate, direct ML estimates of
and are given in Table 2. For comparative pur-
i1 (yi i ). (2) poses, and were also estimated following a
listwise deletion of cases (n = 6). Notice that the
The case subscript i implies that the size and content direct ML estimates are quite similar to those of
of the arrays differ according to the missing data the complete data, but the listwise deletion esti-
pattern for each case. To illustrate, let us return to the mates are fairly distorted. These results are con-
data in Table 1. Notice that there are three missing sistent with theoretical expectations, given that the
data patterns: six cases have complete data, a single data are MAR (for depression scores, missingness
case is missing the social support score (Y1 ), and is solely due to the level of social support). More-
three cases have missing values on the depression
over, there is a straightforward conceptual expla-
variable (Y2 ). The contribution to the log likelihood
nation for these results. Because the two variables
for each of the six complete cases is computed as
are negatively correlated (r = 0.58), the listwise
follows:
deletion of cases effectively truncates the marginal
log Li distributions of both variables (e.g., low depres-
sion scores are systematically removed, as are high
1 11 12 1 y1 1
= Ki log support scores). In contrast, direct ML utilizes all
2 21 22 2 y2 2 observed data during estimation. Although it may not
1 be immediately obvious from the previous equations,
12 y1 1
11 (3) cases with incomplete data are, in fact, contribut-
21 22 y2 2
ing to the estimation of all parameters. Although
For cases with missing data, the rows and columns depression scores are missing for three cases, the
of the arrays that correspond to the missing val- inclusion of their support scores in the log likeli-
ues are removed. For example, the arrays for the hood informs the choice of depression parameters
three cases with missing depression scores (Y2 ) via the linear relationship between social support
would include only y1 , 1 , and 12 . Substituting and depression.
these quantities into (1), the contribution to the log As mentioned previously, ML estimation requires
likelihood for each of these three cases is compu- the use of iterative computational algorithms. One
ted as such approach, the EM algorithm (see Maximum
1 1 Likelihood Estimation), was originally proposed as
log Li = Ki log |11 | (yi 1 ) a method for obtaining ML estimates of and
2 2
Direct Maximum Likelihood Estimation 3
with incomplete data [1], but has since been adapted note that a wide variety of linear models (e.g., regres-
to complete-data estimation problems as well (e.g., sion, structural equation models, multilevel models)
hierarchical linear models; [3]). EM is an itera- can be estimated using direct ML, and direct ML esti-
tive procedure that repeatedly cycles between two mation has recently been adapted to nonlinear models
steps. The process begins with initial estimates of as well (e.g., logistic regression implemented in the
and , which can be obtained via any number Mplus software package). Finally, when provided
of methods (e.g., listwise deletion). The purpose of with the option, it is important that direct ML stan-
the E, or expectation, step is to obtain the suffi- dard errors be estimated using the observed, rather
cient statistics required to compute and (i.e., the than expected, information matrix (a matrix that is
variable sums and sums of products) in the subse- a function of the second derivatives of the log like-
quent M step. The complete cases simply contribute lihood function), as the latter may produce biased
their observed data to these sufficient statistics, but standard errors [2].
missing Y s are replaced by predicted scores from
a regression equation (e.g., a missing Y1 value is
replaced with the predicted score from a regression References
of Y1 on Y2 , Y3 , and Y4 , and a residual variance
term is added to restore uncertainty to the imputed [1] Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977).
value). The M, or maximization, step uses stan- Maximum likelihood from incomplete data via the EM
dard formulae to compute the updated covariance algorithm, Journal of the Royal Statistical Society, Series
B 39, 138.
matrix and mean vector using the sufficient statistics
[2] Kenward, M.G. & Molenberghs, G. (1988). Likelihood
from the previous E step. This updated covariance based frequentist inference when data are missing at
matrix is passed to the next E step, and is used random, Statistical Science 13, 236247.
to generate new expectations for the missing values. [3] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
The two EM steps are repeated until the difference Linear Models, 2nd Edition, Sage, Thousand Oaks.
between covariance matrices in adjacent M steps [4] Rubin, D.B. (1976). Inference and missing data,
becomes trivially small, or falls below some conver- Biometrika 63, 581592.
gence criterion.
CRAIG K. ENDERS
The previous examples have illustrated the estima-
tion of and using direct ML. It is important to
Directed Alternatives in Testing
ARTHUR COHEN
Volume 1, pp. 495496
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
The pattern of cross-twin cross-trait correlations It is apparent that if variables A and B have identical
can, under certain conditions, falsify strong hypothe- modes of inheritance, then the cross-twin cross-trait
ses about the direction of causation, provided several correlations will be equivalent for MZ and DZ twin
1 1
E C C E E C C E
iB iB iA iA
E A D D A E E A D D A E
1MZ or DZ 1MZ or DZ
1MZ or DZ 1MZ or DZ
r MZ = c2 ASiB r MZ = (a2BS + d2BS) iA
(a) r DZ = c 2
ASiB (b) r DZ = (a2BS + d2BS) iA
Figure 1 Unidirectional causation hypotheses between two variables A and B measured on a pair of twins. (a) Trait A
causes Trait B and (b) Trait B causes Trait A. The figure also includes the expected cross-twin cross-trait correlations
for MZ and DZ twins under each unidirectional hypothesis. Example based on simplified model of causes of twin pair
resemblance in Neale and Cardon [5] and is also reproduced from Gillespie and colleagues [2]
2 Direction of Causation Models
pairs alike, regardless of the direction of causation, parental rearing behavior (depression parenting).
and the power to detect the direction of causation Yet, when a term for measurement error (omission
will vanish. of which is known to produce biased estimates of
Neale and colleagues [7] have modeled direction the causal parameters [5]) was included, the fit of
of causation on the basis of the cross-sectional data the parenting depression model improved, but
between symptoms of depression and parenting, as no longer explained the data as parsimoniously as a
measured by the dimensions of Care and Overpro- common additive genetic effects model (see Addi-
tection from the Parental Bonding Instrument [8]. tive Genetic Variance) alone (i.e., implying indi-
They found that models that specified parental rect causation).
rearing as the cause of depression (parenting Measurement error greatly reduces the statisti-
depression) fitted the data significantly better than cal power for resolving alternative causal hypothe-
did a model that specified depression as causing ses [3]. One remedy is to model DOC using multiple
Table 1 Results of fitting direction of causation models to the psychological distress and parenting variables. Reproduced
from Gillespie and colleagues [2]
Goodness of fit
2
Model df 2 df p AIC
Full bivariate 141.65 105 68.35
Reciprocal causation 142.12 106 0.47 1 0.49 69.88
Distressa Parentingb 152.28 107 10.63 2 ** 61.72
Parenting Distress 143.13 107 1.48 2 0.48 70.87
No correlation 350.60 108 208.95 3 *** 134.60
Results based on 944 female MZ twin pairs and 595 DZ twin pairs aged 18 to 45.
a
Distress as measured by three indicators: depression, anxiety, and somatic distress.
b
Parenting as measured by three indicators: coldness, overprotection, and autonomy.
*p < .05, **p < .01, ***p < 001.
C C
A E A E
Psychological .16
Parenting
distress
A C E A C E A C E A C E
A C E A C E
.06 .32 .12 .00 .41 .06 .05 .35 .13 .26
.00 .08
.11 .00 .20 .19 .13 .48
Figure 2 The best-fitting unidirection of causation model for the psychological distress and PBI parenting dimensions
with standardized variance components (double-headed arrows) and standardized path coefficients. Circles represent sources
of latent additive genetic (A), shared (C), and nonshared (E) environmental variance. Ellipses represent common pathways
psychological distress and parenting. Reproduced from Gillespie and colleagues [2]
Direction of Causation Models 3
indicators [35]. This method assumes that measure- judgment on the part of the user as to their
ment error occurs, not at the latent variable level but interpretation.
at the level of the indicator variables, and is uncor-
related across the indicator variables [5]. Gillespie References
and colleagues [2] have used this approach to model
the direction of causation between multiple indicators
[1] Duffy, D.L. & Martin, N.G. (1994). Inferring the direction
of parenting and psychological distress. Model-fitting of causation in cross-sectional twin data: theoretical
results are shown in Table 1. and empirical considerations [see comments], Genetic
The parenting distress model, as illustrated in Epidemiology 11, 483502.
Figure 2, provided the most parsimonious fit to the [2] Gillespie, N.G., Zhu, G., Neale, M.C., Heath, A.C.
data. Unfortunately, there was insufficient statistical & Martin, N.G. (2003). Direction of causation model-
power to reject a full bivariate model. Therefore, it is ing between measures of distress and parental bonding,
Behavior Genetics 33, 383396.
possible that the parenting and psychological distress [3] Heath, A.C., Kessler, R.C., Neale, M.C., Hewitt, J.K.,
measures were correlated because of shared genetic Eaves, L.J. & Kendler, K.S. (1993). Testing hypotheses
or environmental effects (bivariate model), or simply about direction of causation using cross-sectional family
arose via a reciprocal interaction between parental data, Behavior Genetics 23, 2950.
recollections and psychological distress. Despite this [4] McArdle, J.J. (1994). Appropriate questions about causal
limitation, the chief advantage of this model-fitting inference from Direction of Causation analyses [com-
ment], Genetic Epidemiology 11, 477482.
approach is that it provides a clear means of rejecting
[5] Neale, M.C. & Cardon, L.R. (1992). Methodology for
the distress parenting and no causation models, genetic studies of twins and families, NATO ASI Series,
because these models deteriorated significantly from Kluwer Academic Publishers, Dordrecht.
the full bivariate model. The correlations between [6] Neale, M.C., Duffy, D.L. & Martin, N.G. (1994a). Direc-
the parenting scores and distress measures could not tion of causation: reply to commentaries, Genetic Epi-
be explained by the hypothesis that memories of demiology 11, 463472.
parenting were altered by symptoms of psycholog- [7] Neale, M.C., Walters, E., Heath, A.C., Kessler, R.C.,
Perusse, D., Eaves, L.J. & Kendler, K.S. (1994b). Depres-
ical distress.
sion and parental bonding: cause, consequence, or genetic
Despite enthusiasm in the twin and behavior covariance? Genetic Epidemiology 11, 503522.
genetic communities, DOC modeling has received [8] Parker, G., Tupling, H. & Brown, L.B. (1979). A parental
little attention in the psychological literature, which bonding instrument, British Journal of Medical Psychol-
is a shame because it can prove exceedingly use- ogy 52, 110.
ful in illuminating the relationship between psycho-
logical constructs. However, as stressed by Duffy NATHAN A. GILLESPIE AND NICHOLAS
and Martin [1], these methods are not infallible G. MARTIN
or invariably informative, and generally require
Discriminant Analysis
CARL J. HUBERTY
Volume 1, pp. 499505
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
is: Different with respect to what? (Or, on what does proportion-of-variance approach. Each derived eigen-
the grouping variable have an effect?) Here is where value is a squared (canonical) correlation (between
DDA comes into play. the grouping variable and the linear combination of
DDA is used to determine what outcome vari- the outcome variables), and, thus, reflects a propor-
able constructs underlie the resultant group differ- tion of shared variance. There is a shared-variance
ences. The identification of the constructs is based on proportion associated with each LDF. For Data Set
what are called linear discriminant functions (LDFs). B, the proportions are: LDF1 , 0.849; LDF2 , 0.119;
The LDFs are linear combinations (or, composites) and LDF3 , 0.032. From this numerical information
of the p outcome variables. (A linear combina- it may be concluded, again, that at most two LDFs
tion/composite is a sum of products of variables and need be retained.
respective weights.). Derivation of the LDF weights Once the number of LDFs to be retained for
is based on a mathematical method called an eigen- interpretation purposes is determined, it may be
analysis (see [1, pp. 207208]). This analysis yields helpful to get a picture of the results. This may be
numbers called eigenvalues The number of eigen- accomplished by constructing an LDF plot. This plot
values is, in an LDF context, the minimum of p and is based on outcome variable means for each group
k 1, say, m. For Data Set B, m = min (15, 3) = 3. on each LDF. For Data Set B and for LDF1 , the group
With each eigenvalue is associated an eigenvector, 1 mean vector value is determined, from (1), to be
numerical elements of which are the LDF weights. .
So, for Data Set B, there are three sets of weights Z 1 = 0.54Y1 + 0.59Y2 + + 0.49Y15 = 0.94
(i.e., three LDFs) for the 15 outcome variables. The (3)
first LDF is defined by
The group 1 mean vector (i.e., centroid) used with
Z1 = 0.54Y1 + 0.59Y2 + + 0.49Y15 . (2) LDF2 yields an approximate value of 0.21. The
two centroids for group 2, for group 3, and for
The first set of LDF weights is mathematically group 4 are similarly calculated. The proximity of
derived so as to maximize, for the data on hand, the the group centroids is reflected in the LDF plot. (The
(canonical) correlation between Z1 and the group- typical plot used is that with the LDF axes at a right
ing variable (see Canonical Correlation Analysis). angle.) With the four groups in Data Set B, each of
Weights for the two succeeding LDFs are determined the four plotted points reflects the two LDF means.
so as to maximize, for the data on hand, the corre- The two-dimensional plot for Data Set B is given
lation between the linear combination/composite and in Figure 1. By projecting the centroid points, for
the grouping variable that is, successively, indepen- example, (0.94, 0.21) onto the respective axes,
dent of the preceding correlation. one gets a general idea of group separation that may
Even though there are, for Data Set B, m = 3 be attributed to each LDF. From Figure 1, it appears
LDFs, not all three need be retained for interpreting
the resultant group differences. Determination of the
1
LDF space dimension may be done in two ways G2:(.72, .55)
(see [1, pp. 211214]). One way is to conduct three
statistical tests: G1:(.94,.21) G4:(.81,.02)
0
H01 : no separation on any dimension,
LDF2
that LDF1 may account for separation of G4 on one the groups. This ordering may be determined by
hand versus G1 , G2 , and G3 on the other. Also, it may conducting all-but-one-variable MANOVAs. With the
be concluded that LDF2 may account for separation above example, this would result in 15 14-variable
of G2 versus G1 , G3 , and G4 . The latter separation analyses. What are examined, then, are the 15 14-
may not appear to be as clear-cut as that associated variable MANOVA F values, and comparing each
with LDF1 . with the F value yielded by the overall 15-variable
All this said and done, the researcher proceeds MANOVA. The variable not in the 14-variable sub-
to making a substantive interpretation of the two set that yields the largest F-value drop, relative to
retained LDFs. To accomplish this, one determines the 15-variable F value, is considered the most
the correlations between each of the two LDFs and important variable with respect to the contribu-
each of the 15 outcome variables. Thus, there are two tion to group differences. The remaining 14 vari-
sets of structure rs, 15 in each set (see Canonical ables would be similarly ordered using, of course,
Correlation Analysis). The high values are given some research judgment there usually would be
in Table 1. The structure rs for LDF1 indicate that some rank ties. It turns out that an equivalent way
the difference of university students versus teacher to accomplish an outcome variable ordering with
college, vocational school, and business or technical respect to group differences is to examine F -to-
school students may be attributed to the construct remove values via the use of the SPSS DISCRIMI-
of skill capability in mathematics, social science, NANT program (see Software for Statistical Anal-
and literature, along with maturity and interest in yses). For Data Set B, the F -to-remove values are
physical science. (It is left to the reader to arrive given in Table 2 these are the F -transformations
at a more succinct and concise name for the of Wilks lambda values obtained in the 15 14-
construct reflected by LDF1 .) The second construct variable analyses.
is a combination of mathematics reasoning and 3D A second way to order the 15 outcome vari-
visualization. (It should be recognized that researcher ables is to examine the absolute values of the two
judgment is needed to determine the number of LDFs sets of 15 structure rs. Such an ordering would
to retain, as well as to name them.) It is these reflect the relative contributions of the variables
constructs that describe/explain the group differences to the definitions of the respective resulting con-
found via the MANOVA and illustrated with the structs. From Table 1 trusting that all other struc-
LDF plot discussed above. The constructs may, ture rs are low the relative importance of the
alternatively, be viewed as on what the grouping five variables used to name the first construct is
variable has an effect.
It may be of further interest to the researcher to
determine an ordering of the outcome variables.
Table 2 F-to-remove values for Data
This interest may also be viewed as determining the
Set B
relative importance of the outcome variables. Now
there are two ways of viewing the variable order- Variable F-to-remove Rank
ing/relative importance issue. One way pertains to an TRINT 14.21 1.5
ordering with respect to differences between/among MINFO 13.04 1.5
BMINT 4.58 4
EPROF 4.50 4
Table 1 Selected structure rs LLINT 4.30 4
for Data Set B PSINT 3.28 7
LDF1 LDF2 MRSNG 3.05 7
SINFO 2.27 7
MINFO 0.68 CMINT 1.93 11
SINFO 0.50 MATRP 1.74 11
LINFO 0.46 VTDIM 1.30 11
MATRP 0.41 LINFO 1.24 11
PSINT 0.39 CPSPD 1.12 11
MRSNG 0.71 IMPLS 0.61 14.5
VTDIM 0.58 SOCBL 0.09 14.5
4 Discriminant Analysis
obvious; similarly for the second construct. One Group Membership Prediction
could sum the squares of the two structure rs for
each of the 15 outcome variables and order the 15 Suppose an educational researcher is interested in
variables with respect to the 15. Such an order- predicting student post-high-school experience using
ing would indicate the collective relative impor- (a hodgepodge of) nine predictor variables. Let there
tance to the definition of the pair of constructs. be four criterion groups determined four years after
This latter ordering is rather generic and is judged ninth grade enrollment: college, vocational school,
to be of less interpretive value than the two sepa- full-time job, and other. The nine predictor variable
rate orderings. scores would be obtained prior to, or during, the
As illustrated above with Data Set B, what was ninth grade: four specific academic achievements,
discussed were the testing, construct identification, three family characteristics, and two survey-based
and variable ordering for the omnibus effects. That attitudes. The analysis to be used with this k = 4,
is, what are the effects of the grouping variable on p = 9 design is PDA. The basic research question is:
the collection of 15 outcome variables? In some How well can the four post-high-school experiences
research, more specific questions are of interest. That be predicted using the nine predictors?
is, there may be interest in group contrast effects (see Another PDA example is that based on Data Set
Analysis of Variance). With Data Set B, for example, A in [1, p. 227]. The grouping variable is level
one might want to compare group 4 with any one of of college French course beginning (n1 = 35),
the other three groups or with the other three groups intermediate (n2 = 81), and advanced (n3 = 37). The
combined. With any contrast analysis accomplished N = 153 students were assessed on the following 13
very simply with any computer package MANOVA variables prior to college entry:
program there is only one LDF, which would be
Five high school cumulative grade-point
examined as above, assuming it is judged that the
averages:
tested contrast effect is real.
English (EGPA),
It is important to note that DDA methods are
Mathematics (MGPA),
also applicable in the context of a factorial design,
Social Science (SGPA),
say, A-by-B. One may examine the LDF(s) for the Natural Science (NGPA),
interaction effects, for simple A effects, for simple French (FGPA);
B effects, or (if the interaction effects are not real) The number of semesters of high school French
for main A effects and main B effects (see Analysis (SHSF);
of Variance). Here, too, there may be some interest Four measures of academic aptitude
in contrast effects. ACT English (ACTE),
Whether a one-factor or a multiple-factor design is ACT Mathematics (ACTM),
employed, the initial choice of the outcome variable ACT Social Studies (ACTS),
set for a DDA is very important. Unless the variable ACT Natural Sciences (ACTN);
set is, in some way, a substantive collection of Two scores on a French test:
analysis unit attributes, little, if any, meaningful ETS Aural Comprehension (ETSA),
interpretation can be made of the DDA results. If ETS Grammar (ETSG); and
one has just a hodgepodge of p outcome variables, The number of semesters since the last high school
then what is suggested is that p univariate analyses French course (SLHF).
be conducted see [4].
Finally, with respect to DDA, there may be It is assumed that French course enrollment was
research situations in which multiple MANOVAs, initially appropriate that is, the grouping variable
along with multiple DDAs, may be conducted. Such is well defined. The purpose of the study, then, is
a situation may exist when the collection of outcome to determine how well membership in k = 3 levels
variables constitutes multiple systems of variables. of college French can be predicted using scores
Of course, each system would be comprised of a on the p = 13 predictors. (Note that in PDA, the
respectable number of outcome variables so that at response variables are predictor variables, whereas in
least one construct might be meaningfully identified. DDA the response variables are outcome variables.
(Should this have been considered with Data Set B?) Also, in PDA, the grouping variable is an outcome
Discriminant Analysis 5
variable, whereas in DDA the grouping variable is a in the LCFs/QCFs and posterior probability calcula-
predictor variable.) tions. This is a prior (or, a priori ) probability (see
Assuming approximate multivariate normality (see Bayesian Statistics). The k priors reflect the relative
Multivariate Normality Tests) of predictor variable sizes of the k populations and must sum to 1.00. The
scores in the k populations (see [1, Chs. IV & X]), priors are included in both the composite scores and
there are two types of PDAs, linear and quadratic. A the posterior probabilities, and, thus, have an impact
linear prediction/classification rule is appropriately on group assignment. The values of the priors to be
used when the k group covariance matrices are used may be based on theory, on established knowl-
in the same ballpark. If so, a linear composite edge, or on expert judgment. The priors used for Data
of the p predictors is determined for each of the Set A are, respectively, 0.25, 0.50, and 0.25. (This
k groups. (These linear composites are not the implies that the proportion of students who enroll in,
same as the LDFs determined in a DDA different for example, the intermediate-level course is approx-
in number and different in derivation.) The linear imately 0.50.)
combination/composite for each group is of the same The calculation of LCF/QCF scores, for the data
general form as that in (1). For Data Set A, the first set on hand, are based on predictor weights that are
linear classification function (LCF) is determined from the very same data set (Similarly,
calculation of the posterior probability estimates is
99.33 + 0.61 X1 4.47 X2 + 12.73 X3 based on mathematical expressions derived from the
+ + 2.15X13 . (4) data set on hand.) In other words, the prediction
rule is derived from the very data on which the rule
The three sets of LCF weights, for this data set, are is applied. Therefore, these group-membership pre-
mathematically derived so as to maximize correct diction/classification results are (internally) biased
group classification for the data on hand. If it is to such a rule is considered an internal rule. Using an
be concluded that the three population covariance internal rule is NOT to be recommended in a PDA.
matrices are not equal (see [6]), then a quadratic Rather, an external rule should be employed. The
prediction/classification rule would be used. With this external rule that I suggest is the leave-one-out (L-
PDA, three quadratic classification functions (QCFs) O-O) rule. The method used with the L-O-O approach
involving the 13 predictors are derived (with the same involves N repetitions of the following two steps.
mathematical criterion as for the linear prediction
rule). The QCFs are rather complicated and lengthy, 1. Delete one unit and derive the rule of choice (lin-
including weights for Xj , for Xj2 and for Xj Xj . ear or quadratic) on the remaining N -1 units; and
Whether one uses a linear rule or a quadratic 2. Apply the rule of choice to the deleted unit.
rule in a group-membership prediction/classification
study, there are two bases for group assignment. One (Note: At the time of this writing, the quadratic
basis is the linear/ quadratic composite score for each external (i.e., L-O-O) results yielded by SPSS are
analysis unit for each group for each unit, there NOT correct. Both linear and quadratic external
are k composite scores. A unit, then, is assigned results are correctly yielded by the SAS package,
to that group with which the larger(est) composite while the SPSS package only yields correct linear
score is associated. The second basis is, for each external results.)
unit, an estimated probability of group membership, A basic summary of the prediction/classification
given the units vector of predictor scores; such results is in the form of a classification table. For
a probability is called a posterior probability (see Data Set A, the L-O-O linear classification results are
Bayesian Statistics). A unit, then, is assigned to that given in Table 3. For this data set, the separate-group
. .
group with which the larger (or largest) posterior hit rates are 29/35 = 0.83 for group 1, 68/81 = 0.84
.
probability is associated. (For each unit, the sum of for Group 2, and 30/37 = 0.81 for group 3. The
.
these probability estimates is 1.00.) The two bases total-group hit rate is (29 + 68 + 30)/153 = 0.83.
will yield identical classification results. (All four of these hit rates are inordinately high.)
Prior to discussing the summary of the group- It is advisable, in my opinion, to assess the hit rates
membership prediction/classification results, there is relative to chance. That is: Is an observed hit rate
another probability estimate that is very important better than a hit rate that can be obtained by chance?
6 Discriminant Analysis
Table 3 Classification table for Data Set A usual, judgment calls will have to be made about
Predicted group the retention of the final subset. For Data Set A,
the best subset of the 13 predictors is comprised of
1 2 3
six predictors (EGPA, MGPA, SGPA, NGPA, ETSA, and
Actual 1 29 6 0 35 ETSG) with a total-group L-O-O hit rate (using the
group 2 8 68 5 81 priors of 0.25, 0.50, and 0.25) of 0.88 (as compared
3 0 7 30 37 to the total-group hit rate of 0.83 based on all
37 81 35 153 13 predictors).
The second question pertains to predictor
ordering/relative importance. This may be simply
To address this question, one can use a better-than- addressed by conducting the p all-but-one-predictor
chance index: analyses. The predictor, when deleted, that leads to
Ho He the largest drop in the hit rates of interest, may be
I= , (5)
He considered the most important one, and so on. For
the p = 13 analyses with Data Set A, it is found that
where Ho is the observed hit rate, and He is the
when variable 12 is deleted, the total-group hit drops
hit rate expected by chance. For the total-group
. . the most, from 0.83 (with all 13 variables) to 0.73.
hit rate using Data Set A, Ho = 0.83 and He =
. Therefore, variable 12 is considered most important
(0.25 35 + 0.50 81 + .25 37)/153 = 0.38;
. (with respect to the total-group hit rate). There are
therefore, I = 0.72. Thus, by using a linear exter-
four variables, which when singly deleted, actually
nal rule, about 72% fewer classification errors across
increase the total-group hit rate.
all three groups would be made than if classifica-
. There are some other specific PDA-related aspects
tion was done by chance. For group 3 alone, I =
. in which a researcher might have some interest. Such
(0.81 0.25)/(10.25) = 0.75.
interest may arise when the developed prediction
A researcher may want to ask two more specific
rule (in the form of a set of k linear or quadratic
PDA questions:
composites) is to be used with another, comparable
1. May some predictor(s) be deleted? sample. Four such aspects are listed here but will
2. What are the more important predictors (with not be discussed: outliers, in-doubt units, nonnormal
respect to some specific group hit rate, or to the rules, and posterior probability threshold (see [1]
total-group hit rate)? for details).
PDA pertains to research context. DDA is applica- [4] Huberty, C.J. & Morris, J.D. (1989). Multivariate anal-
ble to theoretical research questions, while PDA is ysis versus multiple univariate analyses, Psychological
applicable to applied research questions. This rela- Bulletin 105, 302308.
[5] Huberty, C.J. & Olejnik, S.O. (in preparation). Applied
tionship reminds me of an analogy involving multiple MANOVA and Discriminant Analysis, 2nd Edition, Wiley,
correlation and multiple linear regression (see [2]). New York.
[6] Huberty, C.J. & Petoskey, M.D. (2000). Multivariate
analysis of variance and covariance, in Handbook of
References
Applied Multivariate Statistics and Mathematical Model-
ing, H.E.A. Tinsley & S.D. Brown, eds, Academic Press,
[1] Huberty, C.J. (1994). Applied Discriminant Analysis, New York, pp. 183208.
Wiley, New York.
[2] Huberty, C.J. (2003). Multiple correlation versus multiple
regression, Educational and Psychological Measurement (See also Hierarchical Clustering; k -means Anal-
63, 271278. ysis)
[3] Huberty, C.J. & Hussein, M.H. (2001). Some problems
in reporting use of discriminant analyses, Journal of CARL J. HUBERTY
Experimental Education 71, 177191.
Distribution-free Inference, an Overview
CLIFFORD E. LUNNEBORG
Volume 1, pp. 506513
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Potential car buyers report that they intend to Basic Distribution-free Techniques
purchase one of the following: a SUV, a minivan,
a sedan, or a sports car. The four possible Both [4] and [23] employ a factorial design in
responses are nonnumeric and have no natural presenting what I will call the basic nonparametric
order. They form a set of nominal categories. techniques, crossing levels of measurement with a
Students rate their instructors on a five-point common set of research designs. Notably, the fly
scale, ranging from poor through fair, good, and leaves of the two texts feature the resulting two-
very good to excellent. The five possibilities form way table. I have utilized this structure in Table 1
an ordinal scale of instructor performance. We to provide an overview of the basic distribution-free
could replace the verbal labels with numbers, but techniques and to show the extent of overlap between
how big is the difference between fair and good the 1956 and 1999 coverages. It should be noted that
compared with the difference between very good each of the two texts does mention techniques other
and excellent? than the ones listed here. Table 1 is limited to those
The difference in intelligence between two 12- statistics elevated to inclusion in the fly leaf table by
year old children with WISC IQ scores of 85 and one or other of the two authors.
95 may or may not correspond to the difference in In an important sense, the heart of classical
intelligence between another pair of twelve-year distribution-free statistics is contained in footnote c
2 Distribution-free Inference, an Overview
to Table 1. Measures that are numeric, apparently on The Fisher Exact Test (see Exact Methods for
an interval scale, can be treated as ordinal data. We Categorical Data) provides, as its name implies, an
simply replace the numeric scores with their ranks. I exact test of the independence of row and column
will describe, however, the whole of Table 1 column classifications in a 2 2 table of frequencies (see
by column. Two by Two Contingency Tables). Typically, the
In the first column of the table, data are categor- rows correspond to two treatments and the columns
ical and the Chi-squared Test, introduced in a first to two outcomes of treatment, so the test of inde-
statistics course, plays an important role (see Con- pendence is a test of the equality of the proportion
tingency Tables). In one-sample designs, it provides of successes in the two treatment populations. Under
for testing a set of observed frequencies against a set the null hypothesis, the cell frequencies are regulated
of theoretical expectations (e.g., the proportions of by a hypergeometric distribution (see Catalogue of
blood types A, B, AB, and O) and in two-sample Probability Density Functions) and that distribution
designs it provides for testing the equivalence of allows the computation of an exact P value for either
two distributions of a common set of categories. directional or nondirectional alternatives to indepen-
An accurate probability of incorrectly rejecting the dence. The chi-squared test is used frequently in this
null hypothesis is assured only asymptotically and is situation but yields only approximate probabilities
compromised where parameters are estimated in the and does not provide for directionality in the alterna-
first case. tive hypothesis.
Distribution-free Inference, an Overview 3
The Binomial Test invokes the family of binomial The second column of Table 1 lists techniques
distributions (see Catalogue of Probability Density requiring observations that can be ordered from
Functions) to test a hypothesis about the proportion smallest to largest, allowing for ties. The one sample
of successes in a distribution of successes and Runs Test in [23] evaluates the randomness of a
failures. One difference between [4] and [23] is that sequence of occurrences of two equally likely events
the former emphasizes hypothesis tests while the later (such as, heads and tails in the flips of a coin).
reflects the more recent interest in the estimation of The examples in [23] are based on numeric data,
confidence intervals (CIs) for parameters such as but as these are grouped into two events, the test
the proportion of successes or, in the two-sample could as well have appeared in the Nominal column.
case, the difference in the proportions of success (see Exact P values are provided by the appropriate
Confidence Intervals: Nonparametric). Binomial random variable with p = 0.5. Runs tests
The MaentelHaenszel Test extends Fishers are deprecated in [4] as having less power than
exact test to studies in which the two treatments alternative tests.
have been evaluated, independently, in two or more The KolmogorovSmirnov test compares two
populations. The null hypothesis is that the two cumulative distribution functions. The one-sample
treatments are equally successful in all populations. version compares an empirical distribution func-
The alternative may be directional (e.g., that treat- tion with a theoretical one. The two-sample version
ment A will be superior to treatment B in at compares two empirical distribution functions. The
least some populations and equivalent in the oth- Cramervon Mises Test is a variation on the two-
ers) or nondirectional. Cochrans Q is useful as sample KolmogorovSmirnov, again comparing two
an omnibus test of treatments in randomized com- empirical cumulative distribution functions. P values
plete block designs where the response to treatment
for both two-sample tests can be obtained by permut-
is either a success or failure. As Q is a trans-
ing observations between the two sources.
formation of the usual Pearson chi-squared statis-
The Quantile Test uses the properties of a Bino-
tic, the suggested method for finding a P value
mial distribution to test hypotheses about (or find CIs
is to refer the statistic to a chi-squared distribu-
for) quantiles, such as the median or the 75th per-
tion. As noted earlier, the result is only approx-
centile, of a distribution. The CoxStuart Test groups
imately correct. Subsequent pairwise comparisons
a sequence of scores into pairs and then applies the
can be made with McNemars Test for Signifi-
cance of Change (see Matching). The latter proce- Sign Test to the signs (positive or negative) of the
dure can be used as well in matched pair designs pairwise differences to detect a trend in the data. The
for two treatments with dichotomous outcomes. Sign Test itself is noted in [23] as the oldest of all
Although significance for McNemars test usually distribution-free tests. The null hypothesis is that the
is approximated by reference to a chi-squared dis- two signs, + and , have equal probability of occur-
tribution, Fishers exact test could be used to good ring and the binomial random variable with p = 0.5
effect. is used to test for significance. The Daniels Test for
The Phi Coefficient (see Effect Size Measures) Trend is an alternative to the CoxStuart test. It uses
expresses the association between two dichotomous Spearmans rho, computed between the ranks of
classifications of a set of cases on a scale not unlike a set of observations and the order in which those
the usual correlation coefficient. As it is based on observations were collected to assess trend. Below I
a 2 2 table of frequencies, significance can be mention how rho can be tested for significance.
assessed via Fishers exact test or approximated by The WilcoxonMannWhitney Test (WMW, it
reference to a chi-squared distribution. The several has two origins) has become the distribution-free
contingency coefficients that have been proposed rival to the t Test for comparing the magnitudes of
transform the chi-squared statistic for a two-way table scores in two-sampled distributions. The observations
to a scale ranging from 0 for independence to some in the two samples are pooled and then ranked from
positive constant, sometimes 1, for perfect depen- smallest to largest. The test statistic is the sum of
dence. The underlying chi-squared statistic provides ranks for one of the samples and significance is
a basis for approximating a test of significance of the evaluated by comparing this rank sum with those
null hypothesis of independence. computed from all possible permutations of the ranks
4 Distribution-free Inference, an Overview
between treatments. The rank sums can be used as with a common median. The test statistic is the Pear-
well to find a CI for the difference in medians [4]. son chi-squared statistic computed from the 2 k
The two-sample Median Test evaluates the same table of frequencies that results from counting the
hypothesis as the WMW by applying Fishers exact number of observations in each of the samples that
test to the fourfold table created by noting the number are either smaller than or greater than the median
of observations in each sample that are larger than of the combined samples. Approximate P values are
or are smaller than the median for the combined provided by referring the statistic to the appropriate
samples. The WaldWolfowitz Runs Test approaches chi-squared distribution.
the question of whether the two samples were drawn The KruskalWallis Test is an extension of the
from identical distributions by ordering the combined WMW test to k independent samples. All observa-
samples, smallest to largest, and counting the number tions are pooled and a rank assigned to each. These
of runs of the two sources in the resulting sequence. ranks are then summed separately for each sam-
A Binomial random variable provides the reference ple. The test statistic is the variance of these rank
distribution for testing significance. sums. Exact significance is assessed by comparing
The Moses Test of Extreme Reactions is tailored this variance against a reference distribution made
to a particular alternative hypothesis, that the active up of similar variances computed from all possible
treatment will produce extreme reactions, responses permutations of the ranks among the treatments.
that are either very negative (small) or very positive The KruskalWallis test is an omnibus test for
(large). The combined samples are ranked as for the equivalence of k treatments. By contrast, the Jonck-
runs test and the test statistic is the span in ranks heereTerpstra Test has as its alternative hypothesis
of the control sample. Exact significance can be an ordering of expected treatment effectiveness. The
assessed by referring this span to the distribution of test statistic can be evaluated for significance exactly
spans computed over all possible permutations of the by referring it to a distribution of similar values com-
ranks between the active and control treatments. puted over all possible permutations of the ranks of
Just as the WMW test is the principal distribution- the observations among treatments. In practice, the
free alternative to the t Test, Friedmans Test is test statistic is computed on the basis of the raw
the nonparametric choice as an omnibus treatment observations, but as the computation is sensitive only
test in the complete randomized block design (see to the ordering of these observations, ranks could be
Randomized Block Design: Nonparametric Anal- used to the same result.
yses). Responses to treatment are ranked within each Two distribution-free measures of association
block and these ranks summed over blocks, sepa- between a pair of measured attributes, Spearmans
rately for each treatment. The test statistic is the Rho and Kendalls Tau, are well known in the
variance of these sums of treatment ranks. Exact psychological literature. The first is simply the prod-
significance is evaluated by comparing this variance uctmoment correlation computed between the two
with the distribution of variances resulting from all sets of ranks. Tau, however, is based on an assessment
possible combinations of permutations of the ranks of the concordance or not of each of the [n (n
within blocks. 1)]/2 pairs of bivariate observations. Though com-
Friedmans test is an omnibus one. Where the puted from the raw observations, the same value of
alternative hypothesis specifies the order of effective- would result if ranks were used instead. Significance
ness of the treatments, the Page Test can be used. of either can be assessed by comparing the statistic
The test statistic is the Spearman rank correlation against those associated with all possible permuta-
between this order and the rank of treatment sums tions of the Y scores (or their ranks) paired with the
computed as for Friedmans test. Exact significance X scores (or their ranks).
is assessed by comparing this correlation with those To assess the degree of agreement among b raters,
resulting, again, from all possible combinations of when assessing (or, ordering) k stimulus objects,
permutations of the ranks within blocks (see Pages Kendalls Coefficient of Concordance has been
Ordered Alternatives Test). employed. Although W has a different computation,
The k-sample Median Test tests the null hypoth- it is a monotonic function of the statistic used in
esis that the k-samples are drawn from populations Friedmans test for balanced designs with b blocks
Distribution-free Inference, an Overview 5
and k treatments and can be similarly evaluated for or median difference and to find a CI for that
significance. parameter.
The final column of Table 1 lists techniques that, The Quade Test extends Friedmans test by dif-
arguably, require observations on an interval scale ferentially weighting the contribution of each of the
of measurement. There is some disagreement on the blocks. The weights are given by the ranks of the
correct placement among [4, 23], and myself. Ill deal ranges of the raw observations in the block. the use
first, and quite briefly, with six techniques that quite of the range places this test in the Interval, rather
clearly require interval measurement. than Ordinal column, of my Table 1. Quades test
The Lilliefors and ShapiroWilks procedures are statistic does not have a tractable exact distribution,
used primarily as tests of normality. Given a set so an approximation is used based on the parametric
of measurements on an interval scale, should we family of F random variables. It appears problem-
reject the hypothesis that it is a sample from a atic whether this test does improve on Friedmans
normal random variable? The Squared Ranks and approach, which can be used as well with ordi-
Klotz tests are distribution-free tests of the equiv- nal measures.
alence of variances in two-sampled distributions. Both [4] and [23] list the Permutation Test (see
Variance implies measurement on an interval scale. Permutation Based Inference) as a distribution-free
While the Slope Test and CI Estimate employ Spear- procedure for interval observations, though not under
mans Rho and Kendalls Tau, the existence of a that name. It is referred to as the Randomization Test
regression slope implies interval measurement. Simi- by [23] and as Fishers Method of Randomization
larly, Monotonic Regression uses ranks to estimate by [4]. I refer to it as the permutation test or, more
a regression curve, again defined only for inter- explicitly, as the Raw Score Permutation Test and
val measurement. The regression curve, whether reserve the name Randomization Test for a related,
linear or not, tracks the dependence of the mean but distinct, inferential technique. Incidentally, the
of Y on the value of X. Means require interval listing of the permutation test in the Interval column
measurement. of Table 1 for two independent samples has, perhaps,
Usually, Wilcoxons Signed Ranks Test is pre- more to do with the hypothesis most often associated
sented as an improvement on the Sign test, an Ordinal with the test than with the logic of the test. The
procedure. While the latter takes only the signs of a null hypothesis is that the two samples are drawn
set of differences into account, Wilcoxons procedure from identical populations. Testing hypotheses about
attaches those signs to the ranks of the absolute val- population means implies interval measurement.
ues of the differences. Under the null hypothesis, the Testing hypotheses about population medians, on the
difference in the sums of positive and negative ranks other hand, may require only ordinal measurement.
ought to be close to zero. An exact test is based on We have already encountered important distribu-
tabulating these sums for all of the 2n possible assign- tion-free procedures, including the WilcoxonMann
ments of signs to the ranks. Although only ranks Whitney and KruskalWallis tests, for which sig-
are used in the statistic, our ability to rank differ- nificance can be assessed exactly by systematically
ences, either differences between paired observations permuting the ranks of observations among treat-
or differences between observations and a hypoth- ments. These tests can be thought of as Rank Per-
esized median, depends upon an interval scale for mutation Tests. Raw score permutation test P values
the original observations. Wilcoxons procedure can are obtained via the same route; we refer a test
be used to estimate CIs for the median or median statistic computed from raw scores to a reference
difference. distribution made up of values of that statistic com-
The Walsh Test is similar in purpose to the puted for all possible permutations of the raw scores
Signed Rank test but uses signed differences, actu- among treatments.
ally pairwise averages of signed differences, rather I have identified a third class of permutation
than signed ranks of differences. It is a small sam- tests in the Interval column of Table 1 as Normal
ple procedure; [23] tables significant values only Scores tests (see Normal Scores and Expected
for sample sizes no larger than 15. The complete Order Statistics). Briefly, these are permutation tests
set of [n (n 1)]/2 pairwise averages, known as that are carried out after the raw scores have been
Walsh Averages can be used to estimate the median replaced, not by their ranks, but by scores that
6 Distribution-free Inference, an Overview
inherit their magnitudes from the standard Normal French than with English or US researchers, is
random variable while preserving the order of the Correspondence Analysis [12, 15]. In its simplest
observations. The gaps between these normal scores form, the correspondence referred to is that between
will vary, unlike the constant unit difference between row and column categories. In effect, correspondence
adjacent ranks. There are several ways of finding analysis decomposes the chi-squared lack of fit of a
such normal scores. The Normal Scores Permutation model to the observed frequencies into a number of,
Test is referred to as the van der Waerden Test often interpretable, components.
by [4]. This name derives from the use as normal Regression models for binary responses, known as
scores of quantiles of the Standard Normal random Logistic Regression, pioneered by [5] now see wide
variable (mean of zero, variance of one). In particular, usage [2]. As with linear regression, the regressors
the kth of n ranked scores is transformed to the may be a mix of measured and categorical variables.
q(k) = [k/(n + 1)] quantile of the Standard Normal, Unlike linear regression, the estimation of model
for example, for k = 3, n = 10, q(k) = 3/11. The parameters must be carried out iteratively, as also
corresponding van der Waerden score is that z score is true for the fitting of many log linear models.
below which fall 3/11 of the distribution of the Thus, the adoption of these techniques has required
standard normal, that is, z = 0.60.
additional computational support. Though originally
Normal scores tests have appealing power prop-
developed for two-level responses, logistic regression
erties [4] although this can be offset somewhat by
has been extended to cover multicategory responses,
a loss in accuracy if normal theory approximations,
either nominal or ordinal [1, 2].
rather than actual permutation reference distributions,
Researchers today have access to measures of
are employed for hypothesis testing.
Had [23] and, for that matter, [4] solely described association for cross-classified frequencies, such as
a set of distribution-free tests, the impact would have the GoodmanKruskal Gamma and Tau coeffi-
been minimal. What made the techniques valuable to cients [11], that are much more informative than
researchers was the provision of tables of significant contingency coefficients.
values for the tests. There are no fewer than 21 Earlier, I noted that both [4] and [23] include (raw
tables in [23] and 22 in [4]. These tables enable score) Permutation Tests among their distribution-
exact inference for smaller samples and facilitate free techniques. Though they predate most other
the use of normal theory approximations for larger distribution-free tests [22], their need for consider-
studies. able computational support retarded their wide accep-
In addition to [4], other very good recent guides tance. Now that sufficient computing power is avail-
to these techniques include [13, 14, 19] and [25]. able, there is an awakening of interest in permuta-
tion inference, [10, 18] and [24], and the range of
hypotheses that can be tested is expanding [20, 21].
Growth of Distribution-free Inference Important to the use of permutation tests has been
The growth of distribution-free inference beyond the the realization that it is not necessary to survey all of
techniques already surveyed has been considerable. the possible permutations of scores among treatments.
Most of this growth has been facilitated, if not stim- Even with modern computing power, it remains a
ulated, by the almost universal availability of inex- challenge to enumerate all the possible permutations
pensive, fast computing. These are some highlights. when, for example, there are 16 observations in each
The analysis of frequencies, tabulated by two or of two samples: [32!/(16! 16!)] = 601,080,390.
more sets of nominal categories, now extends far A Monte Carlo test (see Monte Carlo Goodness
beyond the chi-squared test of independence thanks of Fit Tests; Monte Carlo Simulation) based on
to the development [3] and subsequent populariza- a reference distribution made up of the observed
tion [2] of Log Linear Models The flavor of these test statistic plus those resulting from an additional
analyses is not unlike that of the analysis of factorial (R 1) randomly chosen permutations also provides
designs for measured data; what higher order inter- an exact significance test [8, 18]. The power of
actions are needed to account for the data? this test increases with R, but with modern desktop
A graphical descriptive technique for cross- computing power, an R of 10 000 or even larger is a
classified frequencies, as yet more popular with quite realistic choice.
Distribution-free Inference, an Overview 7
In his seminal series of papers, Pitman [22] standard errors and confidence intervals and for car-
noted that permutation tests were valid even where rying out hypothesis tests on the basis of samples
the samples exhausted the population sampled. drawn by resampling from an initial random sam-
Though the terminology may seem odd, the situation ple. The approach is computer intensive but has wide
described is both common and very critical. Consider applications [6, 9, 17, 18, 24].
the following. Fast computing has changed the statistical
A psychologist advertises, among undergraduates, landscape forever. Parametric methods thrived, in
for volunteers to participate in a study of visual per- large part, because their mathematics led to easy,
ception. The researcher determines that 48 of the albeit approximate and inaccurate, computations.
volunteers are qualified for the study and randomly That crutch is no longer needed.
divides those students into two treatment groups, a
Low Illumination Level group and a High Illumina- References
tion Level group. Notably, the 48 students are not a
random sample from any larger population; they are [1] Agresti, A. (1984). Analysis of Ordinal Categorical
a set of available cases. However, the two randomly Data, Wiley, New York.
[2] Agresti, A. (1996). An Introduction to Categorical Data
formed treatment groups are random samples from Analysis, Wiley, New York.
that set and, of course, together they exhaust that [3] Bishop, Y.V.V., Fienberg, S.E. & Holland, P.W. (1975).
set. This is the situation to which Pitman referred. Discrete Multivariate Analysis, MIT Press, Cambridge.
The set of available cases constitutes what I call a [4] Conover, W.J. (1999). Practical Nonparametric Statis-
local population. tics, 3rd Edition, Wiley, New York.
Parametric inference, for example, a t Test, [5] Cox, D.R. (1970). The Analysis of Binary Data, Chap-
man & Hall, London.
assumes the 48 students to be randomly chosen
[6] Davison, A.C. & Hinkley, D.V. (1997). Bootstrap Meth-
from an essentially infinitely large population and ods and their Applications, Cambridge University Press,
is clearly inappropriate in this setting [8, 16, 17]. Cambridge.
The permutation test mechanics, however, provide a [7] de Leuuw, J. (1994). Changes in JES , Journal of
valid test for the local population and Edgington [8] Educational and Behavioral Statistics 19, 169170.
advocates, as do I, the use of the term Randomization [8] Edgington, E.S. (1995). Randomization Tests, 3rd Edi-
tion, Marcel Dekker, New York.
Test when used in this situation (see Randomization
[9] Efron, B. & Tibshirani, R.J. (1993). An Introduction to
Based Tests). The distinctive term serves to the Bootstrap, Chapman & Hall, New York.
emphasize that the inference (a) is driven by the [10] Good, P. (2000). Permutation Tests, 2nd Edition,
randomization rather than by random sampling and Springer, New York.
(b) that inference is limited to the local population [11] Goodman, L.A. & Kruskal, W.H. (1979). Measures
rather than some infinitely large one. of Association for Cross Classifications, Springer, New
York.
Truly random samples remain a rarity in the
[12] Greenacre, M.J. (1984). Theory and Applications of
behavioral sciences. Randomization, however, is a Correspondence Analysis, Academic Press, London.
well-established experimental precaution and ran- [13] Hettmansperger, T.P. (1984). Statistical Inference Based
domization tests ought to be more widely used than on Ranks, Wiley, New York.
they have been [8, 16, 17]. In the preface to [4], the [14] Hollander, M. & Wolfe, C.A. (1999). Nonparametric
author notes that, in 1999, distribution-free methods Statistical Methods, 2nd Edition, Wiley, New York.
are essential tools for researchers doing statistical [15] Lebart, L., Morineau, A. & Warwick, K.M. (1984). Mul-
tivariate Descriptive Statistical Analysis, Wiley, New
analyses. The authors of [14] go even further, declar- York.
ing distribution-free tests to be the preferred method- [16] Ludbrook, J. & Dudley, H.A.F. (1998). Why permuta-
ology for data analysts. There is some evidence, tion tests are superior to t- and F -tests in biomedical
however, that psychologists may be reluctant to give research, The American Statistician 52, 127132.
up parametric techniques; in 1994, de Leuuw [7] [17] Lunneborg, C.E. (2000). Data Analysis by Resampling,
noted that there remained analysis of variance ori- Duxbury, Pacific Grove.
[18] Manly, B.F.J. (1997). Randomization, Bootstrap and
ented programs in psychology departments. Monte Carlo Methods in Biology, 2nd Edition, Chapman
Arguably, applications of the Bootstrap have had & Hall, London.
the greatest recent impact on distribution-free infer- [19] Maritz, J.S. (1984). Distribution-free Statistical Methods,
ence. The bootstrap provides a basis for estimating Chapman & Hall, London.
8 Distribution-free Inference, an Overview
[20] Mielke, P.W. & Berry, K.J. (2001). Permutation Meth- [24] Sprent, P. (1998). Data Driven Statistical Methods,
ods: A Distance Function Approach, Springer, New Chapman & Hall, London.
York. [25] Sprent, P. & Smeeton, N.C. (2001). Applied Nonpara-
[21] Pesarin, F. (2001). Multivariate Permutation Tests, metric Statistical Methods, 3rd Edition, Chapman &
Wiley, Chichester. Hall/CRC, London.
[22] Pitman, E.J.G. (1937). Significance tests which may be
applied to samples from any population, Journal of the CLIFFORD E. LUNNEBORG
Royal Statistical Society B 4, 119130.
[23] Siegel, S. (1956). Nonparametric Statistics for the
Behavioral Sciences, McGraw Hill, New York.
Dominance
DAVID M. EVANS
Volume 1, pp. 513514
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Genotype A2 A2 A1 A2 A1 A1
Genotypic a 0 d +a
value
Genotypic
q2 2pq p2
frequency
Figure 1 A biallelic autosomal locus in HardyWeinberg equilibrium. The genotypic values of the homozygotes A1 A1
and A2 A2 are +a and a respectively. The genotypic value of the heterozygote A1 A2 is d, which quantifies the degree
of dominance at the locus. The gene frequencies of alleles A1 and A2 are p and q respectively, and the frequencies of the
genotypes are as shown
2 Dominance
that variance component estimates will be biased [4] Falconer, D.S. & Mackay, T.F.C. (1996). Introduction to
when dominance genetic and shared environmental Quantitative Genetics, Longman, Burnt Mill.
components simultaneously contribute to trait varia- [5] Fisher, R.A. (1918). The correlation between relatives on
the supposition of mendelian inheritance, Transaction of
tion [68]. the Royal Society of Edinburgh 52, 399433.
[6] Grayson, D.A. (1989). Twins reared together: minimiz-
References ing shared environmental effects, Behavior Genetics 19,
593604.
[7] Hewitt, J.K. (1989). Of biases and more in the study
[1] Eaves, L.J. (1988). Dominance alone is not enough, of twins reared together: a reply to Grayson, Behavior
Behaviour Genetics 18, 2733. Genetics 19, 605608.
[2] Eaves, L.J., Last, K., Martin, N.G. & Jinks, J.L. (1977). [8] Martin, N.G., Eaves, L.J., Kearsey, M.J. & Davies, P.
A progressive approach to non-additivity and genotype- (1978). The power of the classical twin study, Heredity
environmental covariance in the analysis of human differ- 40, 97116.
ences, The British Journal of Mathematical and Statistical [9] Mather, K. & Jinks, J.L. (1982). Biometrical Genetics,
Psychology 30, 142. Chapman & Hall, New York.
[3] Evans, D.M., Gillespie, N.G. & Martin, N.G. (2002).
Biometrical genetics, Biological Psychology 61, 3351. DAVID M. EVANS
Dot Chart
BRIAN S. EVERITT
Volume 1, pp. 514515
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
nonresponse is beyond the scope of this entry; for an will send a small welcome in your new home-
introductory overview on prevention and treatment of gift (e.g., a flower token, a DIY-shop token, a
item nonresponse, see [2]. monetary incentive). It goes without saying, that the
Starting at the initial recruitment, the researcher change-of-address cards are preaddressed to the study
has to take steps to reduce future nonresponse. This administration and that no postage is needed.
needs careful planning and a total design approach. When the waves or follow-up times are close
As research participants will be contacted over together, there is opportunity to keep locating-
time, it is extremely important that the study has information up-to-date. If this is not the case, for
a well-defined image and is easily recognized and instance in an annual or biannual study, it pays
remembered at the next wave. A salient title, a to incorporate between-wave locating efforts. For
recognizable logo, and graphical design are strong instance, sending a Christmas card with a spare
tools to create a positive study identity, and should change-of-address card, birthday cards for panel-
be consistently used on all survey materials. For members, and sending a newsletter with a request
instance, the same logo and graphical style can be for address update. Additional strategies are to keep
used on questionnaires, interviewer identity cards, in touch and follow-up at known life events (e.g.,
information material, newsletters, and thank-you pregnancy, illness, completion of education). This is
cards. When incentives are used, one should try to not only motivating for respondents; it also limits
tie these in with the study. A good example comes loss of contact as change-of-address cards can be
from a large German study on exposure to printed attached. Any mailing that is returned as undeliver-
media. The logo and mascot of this study is a little able should be tracked immediately. Again, the better
duckling, Paula. In German, the word Ente or duck the contact ties in with the goal and topic of the
has the same meaning as the French word canard: study, the better it works. Examples are mothers day
a false (newspaper) report. Duckling Paula appears cards in a longitudinal study of infants, and individ-
on postcards for the panel members, as a soft toy for ual feedback and growth curves in health studies. A
the children, as an ornament for the Christmas tree, total design approach should be adopted with mate-
printed on aprons, t-shirts and so on, and has become rial identifiable by house style, mascot, and logo, so
a collectors item. that it is clear that the mail (e.g., childs birthday
Dropout in longitudinal studies originates from card) is coming from the study. Also ask regularly
three sources: failure to locate the research unit, for an update, or additional network addresses. This
failure to contact the potential respondent, and failure is extremely important for groups that are mobile,
to obtain cooperation from the response unit [3]. such as young adults.
Thus, the first task is limiting problems in locating If the data are collected by means of face-to-face
research participants. At the recruitment phase or or telephone interviews, the interviewers should be
during the base-line study, the sample is fresh and clearly instructed in procedures for locating respon-
address information is up-to-date. As time goes dents, both during training and in a special tracking
by, people move, and address, phone, and e-mail manual. Difficult cases may be allocated to special-
information may no longer be valid. It is of the utmost ized trackers. Maintaining interviewer and tracker
importance, that from the start at each consecutive morale, through training, feedback, and bonuses
time point, special locating information is collected. helps to attain a high response. If other data col-
Besides the full name, also the maiden name should lection procedures are used (e.g., mail or inter-
be recorded to facilitate follow-up after divorce. net survey, experimental, or clinical measurements),
It is advisable to collect full addresses and phone staff members should be trained in tracking pro-
numbers of at least three good friends or relatives as cedures. Trackers have to be trained in use of
network contacts. Depending on the study, names resources (e.g., phone books, telephone information
and addresses of parents, school-administration, or services), and in the approach of listed contacts.
employers may be asked too. One should always These contacts are often the only means to success-
provide change-of-address-cards and if the budget fully locate the research participant, and establishing
allows, print on this card a message conveying that rapport and maintaining the conversation with con-
if one sends in a change of address, the researchers tacts are essential.
Dropouts in Longitudinal Data 3
The second task is limiting the problems in con- subsequent waves [3]. A short and well-designed
tacting research participants. The first contact in a questionnaire helps to reduce response burden.
longitudinal study takes effort to achieve, just like Researchers should realize this and not try to get
establishing contact in a cross-sectional, one-time as much as possible out of the research participants
survey. Interviewers have to make numerous calls at the first waves. In general, make the experience
at different times, leave cards after a visit, leave as nice as possible and provide positive feedback at
messages on answering machines, or contact neigh- each contact.
bors to extract information on the best time to reach Many survey design features that limit locating
the intended household. However, after the initial problems, such as sending birthday and holiday cards
recruitment or base-line wave, contacting research and newsletters, also serve to nurture a good relation-
participants is far less of a problem. Information ship with respondents and keep them motivated. In
collected at the initial contact can be fed to inter- addition to these intrinsic incentives, explicit incen-
viewers and used to tailor later contact attempts, tives also work well in retaining cooperation, and
provided, of course, that good locating-information do not appear to have a negative effect on data
is also available. In health studies and experimental quality [1]. Again the better the incentives fit the
research, participants often have to travel to a special respondent and the survey, the better the motiva-
site, such as a hospital, a mobile van, or an office. tional power (e.g., free downloadable software in
Contacts to schedule appointments should preferably a student-internet panel, air miles in travel stud-
be made by phone, using trained staff. If contact is ies, cute t-shirt and toys in infant studies). When
being made through the mail, a phone number should research participants have to travel to a special site,
always be available to allow research participants a strong incentive is a special transportation ser-
to change an inconvenient appointment, and trained vice, such as a shuttle bus or car. Of course, all
staff members should immediately follow-up on no- real transportation costs of participants should be
shows. reimbursed. In general, everything that can be done
The third task is limiting dropout through lost to make participation in a study as easy and com-
willingness to cooperate. There is an extensive liter- fortable as possible should be done. For example,
ature on increasing the cooperation in cross-sectional provide for child-care during an on-site health study
surveys. Central in this is reducing the cost for of teenage mothers.
the respondent, while increasing the reward, moti- Finally, a failure to cooperate at a specific time
vating respondents and interviewers, and personaliz- point does not necessarily imply a complete dropout
ing and tailoring the approach to the respondent [1, from the study. A respondent may drop out temporar-
4, 5]. These principles can be applied both during ily because of time pressure or lifetime changes (e.g.,
recruitment and at subsequent time points. When change of job, birth of child, death of spouse). If a
interviewers are used, it is crucial that interview- special attempt is made, the respondent may not be
ers are kept motivated and feel valued and com- lost for the next waves.
mitted. This can be done through refresher train- In addition to the general measures described
ing, informal interviewer meetings, and interviewer above, each longitudinal study can and should use
incentives. Interviewers can and should be trained in data from earlier time points to design for nonre-
special techniques to persuade and motivate respon- sponse prevention. Analysis of nonrespondents (per-
dents, and learn to develop a good relationship [1]. sons unable to locate again and refusals) provides
It is not strictly necessary to have the same inter- profiles for groups at risk. Extra effort then may be
viewers revisit the same respondents at all time put into research participants with similar profiles
points, but it is necessary to feed interviewers infor- who are still in the study (e.g., offer an extra incen-
mation about previous contacts. Also, personaliz- tive, try to get additional network information). In
ing and adapting the wording of the questions by addition, these nonresponse analyses provide data for
incorporating answers from previous measurements better statistical adjustment.
(dependent interviewing) has a positive effect on With special techniques, it is possible to reduce
cooperation. dropout in longitudinal studies considerably, but it
In general, prior experiences and especially can never be prevented completely. Therefore, adjust-
respondent enjoyment is related to cooperation at ment procedures will be necessary during analysis.
4 Dropouts in Longitudinal Data
Knowing why dropout occurs makes it possible to [2] De Leeuw, E.D., Hox, J. & Huisman, M. (2003). Pre-
choose the correct statistical adjustment procedure. vention and treatment of item nonresponse, Journal of
Research participants may drop out of longitudinal Official Statistics 19(2), 153176.
[3] Lepkowski, J.M. & Couper, M.P. (2002). Nonresponse
studies for various reasons, but of one thing one in the second wave of longitudinal household surveys,
may be assured: they do not drop out completely at in Survey Nonresponse, R.M. Groves, D.A. Dillman,
random. If the reasons for dropout are not related J.L. Eltinge & R.J.A. Little, eds, Wiley, New York.
to the topic of the study, responses are missing at [4] Dillman, D.A. (2000). Mail and Internet Surveys, Wiley,
random and relatively simple weighting or imputa- New York, see also Dillman (1978) Mail and telephone
tion procedures can be adequately employed. But surveys.
[5] Groves, R.M. & Couper, M.P. (1998). Nonresponse in
if the reasons for dropout are related to the topic,
Household Surveys, Wiley, New York.
responses are not missing at random and a spe-
cial model for the dropout must be included in the
analysis to prevent bias. In longitudinal studies, usu- Further Reading
ally auxiliary data are available from earlier time
points, but one can only guess at the reasons why Kasprzyk, D., Duncan, G.J., Kalton, G. & Singh, M.P. (1989).
people drop out. It is advisable to ask for these rea- Panel Surveys, Wiley, New York.
sons directly in a special short exit-interview. The The website of the Journal of Official Statistics http://
www.jos.nu contains many interesting articles on
data from this exit interview, together with auxil- survey methodology, including longitudinal studies and
iary data collected at earlier time points, can then panel surveys.
be used to statistically model the dropout and avoid
biased results.
(See also Generalized Linear Mixed Models)
References EDITH D. DE LEEUW
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
relationships between imputed and observed variables for more than one parameter, with variances replaced
that are otherwise distorted. For repeated-measures by covariance matrices. For other forms of multiple-
data with dropouts, missing values can be filled in imputation inference, see [10, 18, 20]. Often, multiple
sequentially, with each missing value for each subject imputation is not much more difficult than doing
imputed by regression on the observed or previously single imputation; most of the work is in creating
imputed values for that subject. good predictive distributions for the missing values.
Imputation methods that impute draws include Software for multiple imputation is becoming more
stochastic regression imputation [10, Example 4.5], accessible, see PROC MI in [15], [19], [20] and [22].
where each missing value is replaced by its regression
prediction plus a random error with variance equal to
the estimated residual variance. A common approach Maximum Likelihood Methods
for longitudinal data imputes missing values for
a case with the last recorded observation for that Complete-case analysis and imputation achieve a
case. This method is common, but not recommended rectangular data set by deleting the incomplete cases
since it makes the very strong and often unjustified or filling in the gaps in the data set. There are
assumption that the missing values in a case are all other methods of analysis that do not require a
identical to the last observed value. Better methods rectangular data set, and, hence, can include all
for longitudinal imputation include imputation based the data without deletion or imputation. One such
on row and column fits [10, Example 4.11]. approach is to define a summary measure of the
The imputation methods discussed so far can yield treatment effect for each individual based on the
consistent estimates of the parameters under well- available data, such as change in an outcome between
specified imputation models, but the analysis of the baseline and last recorded measurement, and then
filled-in data set does not take into account the added carry out an analysis of the summary measure across
uncertainty from the imputations. Thus, statistical individuals (see Summary Measure Analysis of
inferences are distorted, in the sense that standard Longitudinal Data). For example, treatments might
errors of parameter estimates computed from the be compared in terms of differences in means of
filled-in data will typically be too small, confidence this summary measure. Since the precision of the
intervals will not have their nominal coverage, and estimated summary measure varies according to the
P values will be too small. An important refinement number of measurements, a proper statistical analysis
of imputation, multiple imputation, addresses this gives less weight to measures from subjects with
problem [18]. A predictive distribution of plausible shorter intervals of measurement. The appropriate
values is generated for each missing value using a choice of weight depends on the relative size of
statistical model or some other procedure. We then intraindividual and interindividual variation, leading
impute, not just one, but a set of M (say M = to complexities that negate the simplicity of the
10) draws from the predictive distribution of the approach [9].
missing values, yielding M data-sets with different Methods based on generalized estimating equa-
draws plugged in for each of the missing values. For tions [7, 12, 17] also do not require rectangular data.
example, the stochastic regression method described The most common form of estimating equation is
above could be repeated M times. We then apply the to generate a likelihood function for the observed
analysis to each of the M data-sets and combine the data based on a statistical model, and then esti-
results in a simple way. In particular, for a single mate the parameters to maximize this likelihood [10,
parameter, the multiple-imputation estimate is the chapter 6]. Maximum likelihood methods for mul-
average of the estimates from the M data-sets, and tilevel or linear multilevel models form the basis
the variance of the estimate is the average of the of a number of recent statistical software pack-
variances from the M data-sets plus 1 + 1/5 times ages for repeated-measures data with missing values,
the sample variance of the estimates over the M data- which provide very flexible tools for statistical mod-
sets (The factor 1 + 1/M is a small-M correction). eling of data with dropouts. Examples include SAS
The last quantity here estimates the contribution to PROC MIXED and PROC NLMIXED [19], meth-
the variance from imputation uncertainty, missed by ods for longitudinal data in S-PLUS functions lme
single imputation methods. Similar formulae apply and nlme [13], HLM [16], and the Stata programs
Dropouts in Longitudinal Studies: Methods of Analysis 3
gllamm in [14] (see Software for Statistical Anal- analysis model are the same, these methods have
yses). Many of these programs are based on linear similar large-sample properties. One useful feature
multilevel models for normal responses [6], but some of multiple imputation is that the imputation model
allow for binary and ordinal outcomes [5, 14, 19] (see can differ from the analysis model, as when variables
Generalized Linear Mixed Models). not included in the final analysis model are included
These maximum likelihood analyses are based in the imputation model [10, Section 10.2.4]. Soft-
on the ignorable likelihood, which does not include ware for both approaches is gradually improving in
a term for the missing data mechanism. The key terms of the range of models accommodated. Devi-
assumption is that the data are missing at ran- ations from the assumption of missing at random
dom, which means that dropout depends only on are best handled by a sensitivity analysis, where
the observed variables for that case, and not on the results are assessed under a variety of plausible
missing values or the unobserved random effects alternatives.
(see [10], chapter 6). In other words, missingness
is allowed to depend on values of covariates, Acknowledgment
or on values of repeated measures recorded prior
to drop out, but cannot depend on other quanti- This research was supported by National Science Founda-
ties. Bayesian methods (see Bayesian Statistics) [3] tion Grant DMS 9408837.
under noninformative priors are useful for small sam-
ple inferences. References
Some new methods allow us to deal with situ-
ations where the data are not missing at random, [1] Diggle, P. & Kenward, M.G. (1994). Informative drop-
by modeling the joint distribution of the data and out in longitudinal data analysis (with discussion),
the missing data mechanism, formulated by includ- Applied Statistics 43, 4994.
ing a variable that indicates the pattern of missing [2] Frangakis, C.E. & Rubin, D.B. (1999). Addressing
data [Chapter 15 in 10], [1, 4, 8, 14, 23, 24]. How- complications of intent-to-treat analysis in the combined
ever, these nonignorable models are very hard to presence of all-or-none treatment noncompliance and
subsequent missing outcomes, Biometrika 86, 365379.
specify and vulnerable to model misspecification.
[3] Gilks, W.R., Wang, C.C., Yvonnet, B. & Coursaget, P.
Rather than attempting simultaneously to estimate the (1993). Random-effects models for longitudinal data
parameters of the dropout mechanism and the param- using Gibbs sampling, Biometrics 49, 441453.
eters of the complete-data model, a more reliable [4] Hausman, J.A. & Wise, D.A. (1979). Attrition bias in
approach is to do a sensitivity analysis to see how experimental and panel data: the Gary income mainte-
much the answers change for various assumptions nance experiment, Econometrica 47, 455473.
about the dropout mechanism (see [Examples 15.10 [5] Hedeker, D. (1993). MIXOR: A Fortran Program for
Mixed-effects Ordinal Probit and Logistic Regression,
and 15.12 in 10], [21]). For example, in a smok- Prevention Research Center, University of Illinois at
ing cessation trial, a common practice is to treat Chicago, Chicago, 60637.
dropouts as treatment failures. An analysis based on [6] Laird, N.M. & Ware, J.H. (1982). Random-effects mod-
this assumption might be compared with an analy- els for longitudinal data, Biometrics 38, 963974.
sis that treats the dropouts as missing at random. If [7] Liang, K.-Y. & Zeger, S.L. (1986). Longitudinal data
substantive results are similar, the analysis provides analysis using generalized linear models, Biometrika 73,
1322.
some degree of confidence in the robustness of the
[8] Little, R.J.A. (1995). Modeling the drop-out mechanism
conclusions. in longitudinal studies, Journal of the American Statisti-
cal Association 90, 11121121.
[9] Little, R.J.A. & Raghunathan, T.E. (1999). On summary-
Conclusion measures analysis of the linear mixed-effects model for
repeated measures when data are not missing completely
Complete-case analysis is a limited approach, but it at random, Statistics in Medicine 18, 24652478.
[10] Little, R.J.A. & Rubin, D.B. (2002). Statistical Analysis
might suffice with small amounts of dropout. Other- with Missing Data, 2nd Edition, John Wiley, New
wise, two powerful general approaches to statistical York.
analysis are maximum likelihood estimation and mul- [11] Meinert, C.L. (1980). Terminology - a plea for standard-
tiple imputation. When the imputation model and the ization, Controlled Clinical Trials 2, 9799.
4 Dropouts in Longitudinal Studies: Methods of Analysis
[12] Park, T. (1993). A comparison of the generalized esti- [20] Schafer, J.L. (1997). Analysis of Incomplete Multivari-
mating equation approach with the maximum likeli- ate Data, CRC Press, New York, For associated mul-
hood approach for repeated measurements, Statistics in tiple imputation software, see http://www.stat.
Medicine 12, 17231732. psu.edu/jls/
[13] Pinheiro, J.C. & Bates, D.M. (2000). Mixed-effects Mod- [21] Scharfstein, D., Rotnitsky, A. & Robins, J. (1999).
els in S and S-PLUS, Springer-Verlag, New York. Adjusting for nonignorable dropout using semiparamet-
[14] Rabe-Hesketh, S., Pickles, A. & Skrondal, A. (2001). ric models, Journal of the American Statistical Associa-
GLLAMM Manual, Technical Report 2001/01, Depart- tion 94, 10961146 (with discussion).
ment of Biostatistics and Computing, Institute of Psychi- [22] Van Buuren, S., and Oudshoorn, C.G.M. (1999). Flex-
atry, Kings College, London, For associated software ible multivariate imputation by MICE. Leiden: TNO
see http://www.gllamm.org/ Preventie en Gezondheid, TNO/VGZ/PG 99.054. For
[15] Raghunathan, T., Lepkowski, J. VanHoewyk, J. & Solen- associated software, see http://www.multiple-
berger, P. (2001). A multivariate technique for multiply imputation.com.
imputing missing values using a sequence of regression [23] Wu, M.C. & Bailey, K.R. (1989). Estimation and com-
models, Survey Methodology 27(1), 8595. For asso- parison of changes in the presence of informative
ciated IVEWARE software see http://www.isr. right censoring: conditional linear model, Biometrics 45,
umich.edu/src/smp/ive/ 939955.
[16] Raudenbush, S.W., Bryk, A.S. & Congdon, R.T. (2003). [24] Wu, M.C. & Carroll, R.J. (1988). Estimation and com-
HLM 5, SSI Software, Lincolnwood. parison of changes in the presence of informative right
[17] Robins, J., Rotnitsky, A. & Zhao, L.P. (1995). Analysis censoring by modeling the censoring process, Biometrics
of semiparametric regression models for repeated out- 44, 175188.
comes in the presence of missing data, Journal of the
American Statistical Association 90, 106121.
[18] Rubin, D.B. (1987). Multiple Imputation in Sample (See also Dropouts in Longitudinal Data; Longitu-
Surveys and Censuses, John Wiley, New York. dinal Data Analysis)
[19] SAS. (2003). SAS/STAT Software, Version 9, SAS Insti-
tute, Inc., Cary. RODERICK J. LITTLE
Dummy Variables
JOSE CORTINA
Volume 1, pp. 522523
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
VOLUME 2
Kurtosis. 1028-1029
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
a significant trend across weeks? These two ques- Bonferroni methods appear particularly useful and
tions pertain to post hoc comparisons and trend result in fewer Type I errors than other tests (e.g.,
analysis (see Multiple Comparison Procedures), Tukey) when assumptions (such as sphericity) are
respectively. Statistically, both topics are addressed violated [16].
extensively in the literature [15, 21]. However, with
repeated measures designs, educational psychologists
must be very careful about the interpretations they Trend Analysis
make. The next two sections describe some method-
ological considerations. While post hoc comparison procedures are one way
to examine the means for repeated measures designs,
another approach to studying average performance
Post Hoc Comparisons over time is trend analysis. Tests of linear and non-
linear trends in studies of growth and development
Readers can browse any textbook containing chapters in educational psychology appear periodically [4].
on ANOVA, and find ample information about post Glass and Hopkins [7] indicate that as long as the
hoc (see A Priori v Post Hoc Testing) comparisons. repeated trials constitute an ordinal or interval scale
These mean comparisons are made after a so-called of measurement, such as the case of weeks, tests for
omnibus test is significant [13]. Thus, any time there significant trends are appropriate. However, if the
is a main effect for a factor with more than two levels, repeated factors are actually related measures, such
post hoc comparisons indicate where the significant as different subscale averages of a standardized test
difference or differences are. Post hoc comparisons battery, then trend analyses are not appropriate. Thus,
control familywise error rates (see Error Rates) (the educational psychologists should not use trend analy-
probability of a Type I error is for the set of sis to study within-student differences on the graduate
comparisons). For between-subjects designs, educa- record examination (GRE) quantitative, verbal, and
tional psychologists can choose from many post hoc analytic subscales (see Growth Curve Modeling).
comparisons (e.g., Duncan, Fishers least significant Figure 1 depicts a trend for hypothetical spelling
difference (LSD), StudentNewmanKuels, Tukeys data collected over a five-week period. Using stan-
honestly significant difference (HSD)). For repeated dard contrasts, software programs such as SPSS [22]
measures analyses, do well-known post hoc com- readily report whether the data fit linear, quadratic,
parison procedures work? Unfortunately, the answer cubic, or higher-order polynomial models. For the
is not a simple one, and statisticians vary in their means reported in Table 1, the linear trend is sig-
opinions about the best way to approach the ques- nificant, F (1, 9) = 21.16, p = 0.001, MSe = 1.00,
tion [11, 16]. Reasons for the complexity pertain partial 2 = 0.70. The effect size, 0.70, is large. The
to the methodological considerations of analyzing data support the idea that students spelling ability
repeated measures data. One concern about locating increases in a linear fashion.
mean differences relates to violations of the assump-
tion of sphericity. Quite often, the degrees of freedom
must be adjusted for the omnibus F test because the Changes in spelling scores
assumption is violated. While statistical adjustments 3.0
such as the GreenhouseGeisser correction assist in
Estimated marginal
Table 1 Means and standard deviations for hypothetical main effect for trials; and, (c) an interaction between
(SD) spelling data treatment and trials. In this analysis, the I (instruction
Phonics instruction Control Total mode) T (trials) interaction is significant, F
Trial M (SD) M (SD) M (SD) (4, 72) = 22.92, p < 0.0001, MSe = 0.63, partial
2 = 0.55 (a large effect) as is the main effect for
Week 1 1.60 (0.70) 1.20 (0.92) 1.40 (0.82)
Week 2 3.00 (1.15) 1.20 (0.79) 2.10 (1.33)
treatment, F (1, 18) = 24.27, p < 0.0001, MSe =
Week 3 4.90 (1.66) 2.00 (1.15) 3.45 (2.04) 7.62, partial 2 = 0.55, and trials, F (4, 72) =
Week 4 6.10 (2.08) 2.40 (1.17) 4.25 (2.51) 79.15, p < 0.0001, MSe = 0.63, partial 2 = 0.55.
Week 5 7.70 (2.26) 2.90 (1.45) 5.30 (3.08) Statisticians recommend that researchers describe
significant interactions before describing main effects
because main effects for the first factor do not
Table 2 Analysis of Variance for instruction level (I) and generalize over levels of the second factor (see
Trials (T) Interaction Effects). Thus, even though the F-ratios
Source of variation SS df MS F are large for both main effects tests (i.e., treatment
and trials), differences between treatment groups are
Between subjects 322.20 19 not consistently the same for every weekly trial.
Instruction (I) 184.96 1 184.96 24.27a Similarly, growth patterns across weekly trials are
Students (s : I) 137.24 18 7.62
Within subjects 302.8 80
not similar for both treatment conditions.
Trials (T) 199.50 4 49.87 79.15 Figure 2 illustrates the interaction visually. It
IT 57.74 4 14.44 22.92 displays two trend lines (Phonics and Control). Linear
Ts:I 45.56 72 0.63 trends for both groups are significant. However, the
Total 625 99 slope for the Phonics Instruction group is steeper
a
p < 0.0001. than that observed for the Control Group resulting
in a significant I T linear trend interaction, F
(1, 18) = 34.99, p < 0.0001, MSe = 1.64, partial
The Mixed-effects Model: Adding 2 = 0.56. Descriptively, the results support the idea
a Between-subjects Factor that increases in spelling ability over time are larger
for the Phonics Instruction group than they are
To this point, the features of a repeated measures for the Control group. In fact, while hypothetical
analysis of one factor (i.e., trials) have been pre- data are summarized here to show that Phonics
sented. However, educational psychologists rarely Instruction has an effect on spelling performance
test such simple models. Instead, they often study when compared with a control condition, the results
whether instructional interventions differ and whether reflect those reported by researchers in educational
the differences remain constant across time [5, 6]. psychology [24].
Suppose an experimental variable is added to the
repeated measures model for the hypothetical spelling
data. Assume students are now randomly assigned to Changes in spelling scores
10
Estimated marginal means
can be examined in waves. That is, construct rela- Table 3 Within-subjects methods in contempo-
tionships can be examined for stability over time. rary educational psychology research (20012003)
McDonald and Ho [20] provide a good resource Technique Frequency
of recommended practices in testing SEMs. Results
in testing SEMs are best when (a) researchers outline Repeated-measures ANOVA 11
Related-measures ANOVA 11
a very good theory for how constructs are related and
Reporting effect sizes 9
how they will change; (b) several exploratory studies SEM 8
guide the models structure; and, (c) sample sizes Regression 5
are sufficient to ensure that the estimates of model MANOVA for repeated measures 4
parameters are stable. Nonparametric repeated measures 4
HLM 4
Time series 3
A Summary of Statistical Techniques Used Testing of assumptions 2
by Educational Psychologists Note: Frequencies represent the number of articles that
used a specific technique. Some articles reported more
Unfortunately, educational psychologists have not than one technique.
relied extensively on HLM and SEM analyses with
repeated measures of manifest or latent variables to
test their hypotheses. A review of 116 studies in Table 3 lists the methods used in the 48 within-
Contemporary Educational Psychology and Journal subjects design studies. As can be seen, only two
of Educational Psychology between 2001 and 2003 studies addressed the assumptions of repeated mea-
showed that 48 articles included tests of models with sures designs (e.g., normality of distributions, inde-
within-subjects factors. Of these, only four studies pendence of observations, sphericity). Only 9 of the
tested HLM models with repeated measures with at 48 reported effect sizes. In a few instances, investiga-
least two time periods. Of these four studies, only tors used nonparametric procedures when their scales
one incorporated growth curve modeling to examine of measurement were nominal or ordinal, or when
differences in student performance over multiple time their data violated normality assumptions.
periods. As for SEM, eight studies tested multiple As long as researchers in educational psychology
waves of repeated and/or related measures. Perhaps are interested in change and development, repeated
the lack of HLM and SEM models may be due measures analysis will continue to be needed to
to sample size limitations. HLM and SEM model answer their empirical questions. Methodologists
parameters are estimated using maximum likelihood should ensure that important assumptions such as
procedures. Maximum likelihood procedures require sphericity and independence of observations are met.
large sample sizes for estimation [10]. Alternatively, Finally, there have been many recent developments
HLM and SEM research may not be prevalent, since in repeated measures techniques. HLM and SEM pro-
the investigations of educational psychologists are cedures can be used to study complex variables, their
often exploratory in nature. Thus, researchers may relationships, and how these relationships change in
not be ready to confirm relationships or effects using time. Additionally, time series analyses are recom-
HLM or SEM models [19, 20]. mended for examination of within-subject changes,
Certainly, the exploratory investigations could especially for variables that theoretically should not
help the researchers replicate studies that eventually remain stable over time (e.g., anxiety, situational
lead to test of HLM or SEM models where rela- interest, selective attention). As with all quantitative
tionships or effects can be confirmed. Additionally, research, sound theory, quality measurement, ade-
because educational psychologists are interested in quate sampling, and careful consideration of exper-
individual differences, more studies should examine imental design factors help investigators contribute
changes in latent and manifest variables at the student useful and lasting information to their field.
level for multiple time points [2]. Time series anal-
ysis is one statistical technique that educational psy- References
chologists can use to study developmental changes
within individuals. Only 3 of the 168 studies reported [1] Bryk, A. & Raudenbush, S.W. (1992). Hierarchical
use of time series analyses. Linear Models for Social and Behavioral Research:
6 Educational Psychology
Applications and Data Analysis Methods, Sage, Newbury [14] Keselman, H.J., Keselman, J.C. & Shaffer, J.P. (1991).
Park. Multiple pairwise comparisons of repeated measures
[2] Boorsboom, D., Mellenbergh, G.J. & van Heerden, J. means under violation of multisample sphericity, Psy-
(2003). The theoretical status of latent variables, Psy- chological Bulletin 110(1), 162170.
chological Bulletin 110(2), 203219. [15] Kirk, R.E. (1995). Experimental Design: Procedures
[3] Cohen, J. (1988). Statistical Power Analysis for the for the Behavioral Sciences, 3rd Edition, Brooks/Cole
Behavioral Sciences, 3rd Edition, Academic Press, New Publishing, Monterey.
York. [16] Kowalchuk, R.K. & Keselman, H.J. (2001). Mixed-
[4] Compton, D.L. (2003). Modeling the relationship model pairwise comparisons of repeated measures
between growth in rapid naming speed and growth means, Psychological Methods 6(3), 282296.
in decoding skill in first-grade children, Journal of [17] Lumley, M.A. & Provenzano, K.M. (2003). Stress man-
Educational Psychology 95(2), 225239. agement through written emotional disclosure improves
[5] Desoete, A., Roeyers, H. & DeClercq, A. (2003). academic performance among college students with
Can offline metacognition enhance mathematical prob- physical symptoms, Journal of Educational Psychology
lem solving? Journal of Educational Psychology 91(1), 95(3), 641649.
188200. [18] Maxwell, S.E. (1980). Pairwise multiple comparisons
[6] Gaskill, P.J. & Murphy, P.K. (2004). Effects of a mem- in repeated measures designs, Journal of Educational
ory strategy on second-graders performance and self- Statistics 5, 269287.
efficacy, Contemporary Educational Psychology 29(1), [19] McDonald, R.P. (1999). Test Theory: A Unified Treat-
2749. ment, Lawrence Erlbaum Associates, Mahwah.
[7] Glass, G.V. & Hopkins, K.D. (1996). Statistical Methods [20] McDonald, R.P. & Ho, M.-H.R. (2002). Principles
in Education and Psychology, 3rd Edition, Allyn & and practice in reporting structural equation analyses,
Bacon, Boston. Psychological Methods 7(1), 6482.
[8] Green, L., McCutchen, D., Schwiebert, C., Quinlan, T., [21] Neter, J., Kutner, M.H., Nachtsheim, C.J. & Wasser-
Eva-Wood, A. & Juelis, J. (2003). Morphological devel- man, W. (1996). Applied Linear Statistical Models, 4th
opment in childrens writing, Journal of Educational Edition, Irwin, Chicago.
Psychology 95(4), 752761. [22] SPSS Inc. (2001). SPSS for Windows 11.0.1, SPSS Inc.,
[9] Greenhouse, S.W. & Geisser, S. (1959). On methods in Chicago.
the analysis of profile data, Psychometrika 24, 95112. [23] Troia, G.A. & Whitney, S.D. (2003). A close look at
[10] Guay, F., Marsh, H.W. & Boivin, M. (2003). Academic the efficacy of Fast ForWord Language for children
self-concept and academic achievement: developmental with academic weaknesses, Contemporary Educational
perspectives on their causal ordering, Journal of Educa- Psychology 28, 465494.
tional Psychology 95(1), 124136. [24] Vandervelden, M. & Seigel, L. (1997). Teaching phono-
[11] Howell, D.C. (2002). Statistical Methods for Psychology, logical processing skills in early literacy: a developmen-
Duxbury Press, Belmont. tal approach, Learning Disability Quarterly 20, 6381.
[12] Huynh, H. & Feldt, L. (1970). Conditions under which
mean square ratios in repeated measurements designs
have exact F distributions, Journal of the American (See also Multilevel and SEM Approaches to
Statistical Association 65, 15821589. Growth Curve Modeling)
[13] Keselman, H.J., Algina, J. & Kowalchuk, R.K. (2002).
A comparison of data analysis strategies for test- JONNA M. KULIKOWICH
ing omnibus effects in higher-order repeated mea-
sures designs, Multivariate Behavioral Research 37(3),
331357.
Effect Size Measures
ROGER E. KIRK
Volume 2, pp. 532542
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
A P value only slightly larger than the level of signif- large enough to be useful; that is, is it practically
icance is treated the same as a much larger P value. significant? As noted earlier, null-hypothesis signifi-
The adoption of 0.05 as the dividing point between cance tests only address the first question. Descriptive
significance and nonsignificance is quite arbitrary. statistics, measures of effect magnitude, and Con-
The comment by Rosnow and Rosenthal [65] is per- fidence Intervals address the second question and
tinent, surely, God loves the 0.06 nearly as much as provide a basis for answering the third question.
the 0.05. Answering the third question, is the effect large
A fourth criticism of null-hypothesis significance enough to be useful or practically significant?, calls
testing is that it does not address the question of for a judgment. The judgment is influenced by a vari-
whether results are important, valuable, or useful, ety of considerations including the researchers value
that is, their practical significance. The fifth edi- system, societal concerns, assessment of costs and
tion of the Publication Manual of the American benefits, and so on. One point is evident, statistical
Psychological Association [1, pp. 2526] explicitly significance and practical significance address differ-
recognizes this limitation of null-hypothesis signifi- ent questions. Researchers should follow the advice
cance tests. of the Publication Manual of the American Psycho-
logical Association [1], provide the reader not only
Neither of the two types of probability value (signif-
with information about statistical significance but also
icance level and P value) directly reflects the mag-
nitude of an effect or the strength of a relationship. with enough information to assess the magnitude of
For the reader to fully understand the importance the observed effect or relationship (pp. 2526). In
of your findings, it is almost always necessary to the following sections, a variety of measures of effect
include some index of effect size or strength of rela- magnitude are described that can help a researcher
tionship in your Results section. assess the practical significance of research results.
Researchers want to answer three basic questions
from their research [48]: (a) Is an observed effect
Effect Size
real or should it be attributed to chance? (b) If the In 1969, Cohen introduced the first effect size mea-
effect is real, how large is it? and (c) Is the effect sure that was explicitly labeled as such. His measure
Effect Size Measures 3
Hedges [36] observed that all three estimators within-groups mean square may not reflect the vari-
of d, g , and gare biased. He recommended ability for the full range of the treatment because it
correcting g for bias as follows, is a pooled measure of the variation of boys alone
and the variation of girls alone. If there is a gen-
gc = J (N 2)g, (7) der effect, the within-groups mean square reflects the
variation for a partial range of the gender variable.
where J (N 2) is the bias correction factor de-
The variation for the full range of the gender vari-
scribed in Hedges and Olkin [37]. The correction
able is given by the total mean square and will be
factor is approximately
larger than the within-groups mean square. Effect
3 sizes should be comparable across different kinds of
J (N 2)
= 1 , (8) treatments and experimental designs. In the gender
4N 9
experiment, use of the square root of the total mean
where N = nE + nC . Hedges [36] showed that gc is square to estimate gives an effect size that is com-
the unique, uniformly minimum variance-unbiased parable to those for treatments that are manipulated.
estimator of . He also described an approximate The problem of estimating is exacerbated when
confidence interval for : the experiment has several treatments, repeated mea-
sures, and covariates. Gillett [27] and Olejnik and
gc z/2 (gc ) gc + z/2 (gc ), Algina [57] provide guidelines for computing effect
where z/2 denotes the two-tailed critical value that sizes for such designs.
cuts off the upper /2 region of the standard normal There are other problems with estimators of .
distribution and For example, the three estimators, d, g , and g,
assume normality and a common standard devia-
nE + nC gc2 tion. Unfortunately, the value of the estimators is
(gc ) = + . (9) greatly affected by heavy-tailed distributions and
nE nC 2(nE + nC )
heterogeneous standard deviations [82]. Considerable
Procedures for obtaining exact confidence inter- research has focused on ways to deal with these prob-
vals for using noncentral sampling distributions are lems [6, 44, 49, 50, 82, 83]. Some solutions attempt
described by Cumming and Finch [18]. to improve the estimation of , other solutions call
Cohens has been widely embraced by research- for radically different ways of conceptualizing effect
ers because (a) it is easy to understand and interpret magnitude. In the next section, measures that rep-
across different research studies, (b) the sampling resent the proportion of variance in the dependent
distributions of estimators of are well understood, variable that is explained by the variance in the inde-
and (c) estimators of can be easily computed from t pendent variable are described.
statistics and F statistics with one-degree-of-freedom
that are reported in published articles. The latter
feature is particularly attractive to researchers who Strength of Association
do meta-analyses.
The correct way to conceptualize and compute the Another way to supplement null-hypothesis signifi-
denominator of can be problematic when the treat- cance tests and help researchers assess the practical
ment is a classification or organismic variable [27, significance of research results is to provide a mea-
32, 57]. For experiments with a manipulated treat- sure of the strength of the association between the
ment and random assignment of the treatment levels independent and dependent variables. A variety of
to participants, the computation of an effect size such measures of strength of association are described
as gc is relatively straightforward. The denominator by Carroll and Nordholm [6] and Sarndal [70]. Two
of gc is the square root of the within-groups mean popular measures are omega squared, denoted by 2 ,
square. This mean square provides an estimate of for a fixed-effects (see Fixed and Random Effects)
that reflects the variability of observations for the treatment and the intraclass correlation denoted by
full range of the manipulated treatment. However, I , for a random-effects (see Fixed and Random
when the treatment is an organismic variable, such Effects) treatment. A fixed-effects treatment is one
as gender, boys and girls, the square root of the in which all treatment levels about which inferences
Effect Size Measures 5
are to be drawn are included in the experiment. A example, OGrady [56] observed that 2 and I
random-effects treatment is one in which the p treat- may underestimate the true proportion of explained
ment levels in the experiment are a random sample variance. If, as is generally the case, the dependent
from a much larger population of P levels. For a variable is not perfectly reliable, measurement error
completely randomized analysis of variance design, will reduce the proportion of variance that can be
omega squared and the intraclass correlation are explained. Years ago, Gulliksen [33] pointed out that
defined as the absolute value of the product-moment correlation
2
Treat coefficient, rXY , cannot exceed (rXX )1/2 (rY Y )1/2 ,
,
Treat + Error
2 2
where rXX and rY Y are the reliabilities of X and Y .
2 2
OGrady [56] also criticized omega squared and the
where Treat and Error denote respectively the treat- intraclass correlation on the grounds that their value
ment and error variance. Both omega squared and is affected by the choice and number of treatment
the intraclass correlation represent the proportion of levels. As the diversity and number of treatment
the population variance in the dependent variable levels increases, the value of measures of strength
that is accounted for by specifying the treatment- of association also tends to increase. Levin [52]
2 2
level classification. The parameters Treat and Error criticized 2 on the grounds that it is not very
for a completely randomized design are generally informative when an experiment contains more than
unknown, but they can be estimated from sample two treatment levels. A large value of 2 simply
data. Estimators of 2 and I are respectively indicates that the dependent variable for at least one
SS Treat (p 1)MS |Error treatment level is substantially different from the
2 = other levels. As is true for all omnibus measures,
SS Total + MS Error
2 and I do not pinpoint which treatment level(s) is
MS Treat MS Error responsible for a large value.
I = , (10)
MS Treat + (n 1)MS Error The last criticism can be addressed by comput-
ing omega squared and the intraclass correlation for
where SS denotes a sum of squares, MS denotes a two-mean contrasts as is typically done with Hedges
mean square, p denotes the number of levels of the gc . This solution is in keeping with the preference
treatment, and n denotes the number of observations of many researchers to ask focused one-degree-of-
in each treatment level. Omega squared and the freedom questions of their data [41, 66] and the
intraclass correlation are biased estimators because recommendation of the Publication Manual of the
they are computed as the ratio of unbiased estimators. American Psychological Association [1, p. 26], As a
The ratio of unbiased estimators is, in general, general rule, multiple degree-of-freedom effect indi-
not an unbiased estimator. Carroll and Nordholm cators tend to be less useful than effect indicators
(1975) showed that the degree of bias in 2 is slight. that decompose multiple degree-of-freedom tests into
The usefulness of Cohens was enhanced meaningful one-degree-of-freedom effects partic-
because he suggested guidelines for its interpretation. ularly when these are the results that inform the
On the basis of Cohens [12] classic work, the follow- discussion.
ing guidelines are suggested for interpreting omega The formulas for omega squared and the intraclass
squared: correlation can be modified to give the proportion of
variance in the dependent variable that is accounted
2 = 0.010 is a small association
for by the ith contrast, i . The formulas for a
2 = 0.059 is a medium association completely randomized design are
2 = 0.138 or larger is a large association. (11) SS i MS Error
Y2 |i =
SS Total + MS Error
According to Sedlmeier and Gigerenzer [72] and
Cooper and Findley [16], the typical strength of SS i MS Error
IY |i = , (12)
association in the journals that they examined was SS i + (n 1)MS Error
around 0.06a medium association. p
Omega squared and the intraclass correlation, like where SS i = i2 / j =1 cj2 /nj and the cj s are coef-
the measures of effect size, have their critics. For ficients that define the contrast [45]. These measures
6 Effect Size Measures
answer focused one-degree-of-freedom questions as correlation and its close relatives can be used with a
opposed to omnibus questions about ones data. variety of variables:
To determine the strength of association in exper-
iments with more than one treatment or experi- Product-moment correlation X and Y are continuous
ments with a blocking variable, partial omega squared and linearly related
can be computed. A comparison of omega squared phi correlation, X and Y are dichotomous
and partial omega squared for treatment A for Point-biserial correlation, rpb X is dichotomous, Y is
continuous
a two-treatment, completely randomized factorial
Spearman rank correlation, rs X and Y are in rank form.
design is
A2
Y2 |A = The point-biserial correlation coefficient is partic-
A2 + B2 + AB2
+ Error
2
ularly useful for answering focused questions. The
and independent variable is coded 0 and 1 to indicate the
treatment level to which each observation belongs.
A2 Two categories of measures of effect magnitude,
Y2 |AB,AB = , (13)
A2 + Error
2 measures of effect size and strength of association,
have been described. Researchers are divided in their
where partial omega squared ignores treatment B and preferences for the two kinds of measures. As Table 2
the A B interaction. If one or more of the variables shows, it is a simple matter to convert from one
in a multitreatment experiment is an organismic measure to another. Table 2 also gives formulas for
or blocking variable, Olejnik and Algina [58] show converting the t statistic found in research reports
that partial omega squared is not comparable across into each of the measures of effect magnitude.
different experimental designs. Furthermore, Cohens
guidelines for small, medium, and large effects are
not applicable. They propose a measure of strength Other Measures of Effect Magnitude
of association called generalized omega squared,
2
denoted by G , that is appropriate, and they provide Researchers continue to search for ways to supple-
extensive formulas for its computation. ment the null-hypothesis significance test and obtain
Meta-analysts often use the familiar product-mo- a better understanding of their data. Their primary
ment correlation coefficient, r, to assess strength of focus has been on measures of effect size and strength
association. The square of r called the coefficient of association. But, as Table 1 shows, there are many
of determination indicates the sample proportion of other ways to measure effect magnitude. Some of the
variance in the dependent variable that is accounted statistics in the Other measures column of Table 1
for by the independent variable. The product-moment are radically different from anything described thus
far. One such measure for the two-group case is the n11 43
probability of superiority, denoted by PS [31]. PS is = = = 6.1429.
n12 7
the probability that a randomly sampled member of (14)
a population given one treatment level will have a
score, Y1 , that is superior to the score, Y2 , of a ran- For participants in the control group, the odds of
domly sampled member of another population given success are
the other treatment level. The measure is easy to com- n21 /(n21 + n22 )
pute: PS = U/n1 n2 , where U is the MannWhitney Odds(Success|Control Grp.) =
n22 /(n21 + n22 )
statistic (see WilcoxonMannWhitney Test) and
n1 and n2 are the two sample sizes. The value of U n21 27
= = = 1.1739.
indicates the number of times that the n1 participants n22 23
who are given treatment level 1 have scores that (15)
outrank those of the n2 participants who are given
treatment level 2, assuming no ties or an equal allo-
The ratio of the two odds is the odds ratio, ,
cation of ties. An unbiased estimator of the population Odds(Success|Exp. Grp.)
P r(Y1 > Y2 ) is obtained by dividing U by n1 n2 , the =
Odds(Success|Control Grp.)
number of possible comparisons of the two treatment
levels. An advantage of PS according to Grissom [31] n11 /n12 n11 n22
= = = 5.233. (16)
is that it does not assume equal variances and is n21 /n22 n12 n21
robust to nonnormality.
The odds ratio is another example of a different In this example, the odds of success for partic-
way of assessing effect magnitude. It is applicable to ipants in the experiment group are approximately 5
two-group experiments when the dependent variable times greater than the odds of success for participants
has only two outcomes, say, success and failure. The in the control group. When there is no difference
term odds is frequently used by those who place bets between the groups in terms of odds of success,
on the outcomes of sporting events. The odds that the two rows (or two columns) are proportional to
each other and = 1. The more the groups differ,
an event will occur are given by the ratio of the
the more departs from 1. A value of less than
probability that the event will occur to the probability
1 indicates reduced odds of success for the experi-
that the event will not occur. If an event can occur
mental participants; a value greater than 1 indicates
with probability p, the odds in favor of the event
increased odds of success for the experimental partic-
are p/(1 p) to 1. For example, suppose an event
ipants. The lower bound for is 0 and occurs when
occurs with probability 3/4, the odds in favor of the
n11 = 0; the upper bound is arbitrarily large, in effect
event are (3/4)/(1 3/4) = (3/4)/(1/4) = 3 to 1.
infinite, and occurs when n21 = 0.
The computation of the odds ratio is illustrated
The probability distribution of the odds ratio
using the data in Table 3 where the performance of
is positively skewed. In contrast, the probability
participants in the experimental and control groups
distribution of the natural log of , ln , is more
is classified as either a success or a failure. For
symmetrical. Hence, when calculating a confidence
participants in the experimental group, the odds of
it is customary to use ln instead
interval for ,
success are
of . A 100(1 ) confidence interval for ln is
given by
n11 /(n11 + n12 )
Odds(Success|Exp. Grp.) =
n12 /(n11 + n12 ) ln z/2 ln < ln < ln + z/2 ln ,
where z/2 denotes the two-tailed critical value that [2] Bakan, D. (1966). The test of significance in psycholog-
cuts off the upper /2 region of the standard normal ical research, Psychological Bulletin 66, 42337.
[3] Berger, J.O. & Berry, D.A. (1988). Statistical analysis
distribution and ln denotes the standard error of
and the illusion of objectivity, American Scientist 76,
ln and is given by 15965.
[4] Berkson, J. (1938). Some difficulties of interpreta-
1 1 1 1
ln = + + + . (17) tion encountered in the application of the chi-square
n11 n12 n21 n22 test, Journal of the American Statistical Association 33,
526542.
Once the lower and upper bounds of the confi- [5] Bond, C.F., Wiitala, W.L. & Richard, F.D. (2003). Meta-
dence interval are found, the values are exponentiated analysis of raw mean scores, Psychological Methods 8,
to find the confidence interval for . The com- 406418.
putation will be illustrated for the data in Table 3 [6] Carroll, R.M. & Nordholm, L.A. (1975). Sampling
where = 5.233. For = 5.233, ln = 1.6550. A characteristics of Kelleys 2 and Hays 2 , Educational
100(1 0.05)% confidence interval for ln is and Psychological Measurement 35, 541554.
[7] Carver, R.P. (1978). The case against statistical signifi-
1.6550 1.96(0.4966) < ln cance testing, Harvard Educational Review 48, 378399.
[8] Carver, R.P. (1993). The case against statistical signifi-
< 1.6550 + 1.96(0.4966) cance testing, revisited, Journal of Experimental Educa-
tion 61, 287292.
0.6817 < ln < 2.6283. [9] Chambers, R.C. (1982). Correlation coefficients from
The confidence interval for is 2 2 tables and from biserial data, British Journal of
Mathematical and Statistical Psychology 35, 216227.
e0.6817 < < e2.6283 [10] Cliff, N. (1993). Dominance statistics: ordinal analyses
to answer ordinal questions, Psychological Bulletin 114,
2.0 < < 13.9 494509.
The researcher can be 95% confident that the [11] Cohen, J. (1969). Statistical Power Analysis for the
Behavioral Sciences, Academic Press, New York.
odds of success for participants in the experiment
[12] Cohen, J. (1988). Statistical Power Analysis for the
group are between 2.0 and 13.9 times greater than the Behavioral Sciences, 2nd Edition, Lawrence Erlbaum,
odds of success for participants in the control group. Hillsdale.
Notice that the interval does not include 1. The odds [13] Cohen, J. (1990). Things I have learned (so far),
ratio is widely used in the medical sciences, but less American Psychologist 45, 13041312.
often in the behavioral and social sciences. Table 1 [14] Cohen, J. (1992). A power primer, Psychological Bul-
provides references for a variety of other measures letin 112, 115159.
[15] Cohen, J. (1994). The earth is round (p < 0.05), Amer-
of effect magnitude. Space limitations preclude an ican Psychologist 49, 9971003.
examination of other potentially useful measures of [16] Cooper, H. & Findley, M. (1982). Expected effect
effect magnitude. sizes: estimates for statistical power analysis in social
From the foregoing, the reader may have gotten psychology, Personality and Social Psychology Bulletin
the impression that small effect magnitudes are never 8, 168173.
or rarely ever important or useful. This is not true. [17] Cramer, H. (1946). Mathematical Methods of Statistics,
Princeton University Press, Princeton.
Prentice and Miller [60] and Spencer [74] provide
[18] Cumming, G. & Finch, S. (2001). A primer on the under-
numerous examples in which small effect magnitudes standing, use, and calculation of confidence intervals that
are both theoretically and practically significant. The are based on central and noncentral distributions, Edu-
assessment of practical significance always involves cational and Psychological Measurement 61, 532574.
a judgment in which a researcher must calibrate the [19] Dawes, R.M., Mirels, H.L., Gold, E. & Donahue, E.
magnitude of an effect by the benefit possibly accrued (1993). Equating inverse probabilities in implicit per-
sonality judgments, Psychological Science 4, 396400.
from that effect [46].
[20] Doksum, K.A. (1977). Some graphical methods in
statistics. A review and some extensions, Statistica
References Neerlandica 31, 5368.
[21] Dunlap, W.P. (1994). Generalizing the common lan-
[1] American Psychological Association. (2001). Publica- guage effect size indicator to bivariate normal correla-
tion Manual of the American Psychological Associa- tions, Psychological Bulletin 116, 509511.
tion, 5th Edition, American Psychological Association, [22] Falk, R. (1998). ReplicationA step in the right direc-
Washington. tion, Theory and Psychology 8, 313321.
Effect Size Measures 9
[23] Falk, R. & Greenbaum, C.W. (1995). Significance tests of clinical significance, Journal of Consulting and Clin-
die hard: the amazing persistence of a probabilistic ical Psychology 67, 285299.
misconception, Theory and Psychology 5, 7598. [45] Kirk, R.E. (1995). Experimental Design: Procedures
[24] Fisher, R.A. (1921). On the probable error of a for the Behavioral Sciences, 3rd Edition, Brooks/Cole
coefficient of correlation deduced from a small sample, Publishing, Monterey.
Metron 1, 132. [46] Kirk, R.E. (1996). Practical significance: a concept
[25] Frick, R.W. (1996). The appropriate use of null hypoth- whose time has come, Educational and Psychological
esis testing, Psychological Methods 1, 379390. Measurement 56, 746759.
[26] Friedman, H. (1968). Magnitude of experimental effect [47] Kirk, R.E. (2001). Promoting good statistical practices:
and a table for its rapid estimation, Psychological some suggestions, Educational and Psychological Mea-
Bulletin 70, 245251. surement 61, 213218.
[27] Gillett, R. (2003). The comparability of meta-analytic [48] Kirk, R.E. (2003). The importance of effect magnitude,
effect-size estimators from factorial designs, Psycholog- in Handbook of Research Methods in Experimental
ical Methods 8, 419433. Psychology, S.F. Davis, ed., Blackwell Science, Oxford,
[28] Glass, G.V. (1976). Primary, secondary, and meta- pp. 83105.
analysis of research, Educational Researcher 5, 38. [49] Kraemer, H.C. (1983). Theory of estimation and testing
[29] Goodman, L.A. & Kruskal, W.H. (1954). Measures of effect sizes: use in meta-analysis, Journal of Educa-
of association for cross classification, Journal of the tional Statistics 8, 93101.
American Statistical Association 49, 732764. [50] Lax, D.A. (1985). Robust estimators of scale: finite
[30] Grant, D.A. (1962). Testing the null hypothesis and the sample performance in long-tailed symmetric distribu-
strategy and tactics of investigating theoretical models, tions, Journal of the American Statistical Association 80,
Psychological Review 69, 5461. 736741.
[31] Grissom, R.J. (1994). Probability of the superior out- [51] Lehmann, E.L. (1993). The Fisher, Neyman-Pearson the-
come of one treatment over another, Journal of Applied
ories of testing hypotheses: one theory or two, Journal
Psychology 79, 314316.
of the American Statistical Association 88, 12421248.
[32] Grissom, R.J. & Kim, J.J. (2001). Review of assump-
[52] Levin, J.R. (1967). Misinterpreting the significance
tions and problems in the appropriate conceptualization
of explained variation, American Psychologist 22,
of effect size, Psychological Methods 6, 135146.
675676.
[33] Gulliksen, H. (1950). Theory of Mental Tests, Wiley,
[53] Lord, F.M. (1950). Efficiency of Prediction When a
New York.
Regression Equation From One Sample is Used in a New
[34] Harris, R.J. (1994). ANOVA: An Analysis of Variance
Sample, Research Bulletin 50110, Educational Testing
Primer, F. E. Peacock Publishers, Itasca.
Service, Princeton.
[35] Hays, W.L. (1963). Statistics for Psychologists, Holt
[54] McGraw, K.O. & Wong, S.P. (1992). A common lan-
Rinehart Winston, New York.
[36] Hedges, L.V. (1981). Distributional theory for Glasss guage effect size statistic, Psychological Bulletin 111,
estimator of effect size and related estimators, Journal 361365.
of Educational Statistics 6, 107128. [55] Meehl, P.E. (1967). Theory testing in psychology and
[37] Hedges, L.V. & Olkin, I. (1985). Statistical Methods for physics: a methodological paradox, Philosophy of Sci-
Meta-Analysis, Academic Press, Orlando. ence 34, 103115.
[38] Herzberg, P.A. (1969). The parameters of cross- [56] OGrady, K.E. (1982). Measures of explained varia-
validation, Psychometrika Monograph Supplement 16, tion: cautions and limitations, Psychological Bulletin 92,
167. 766777.
[39] Hunter, J.E. (1997). Needed: a ban on the significance [57] Olejnik, S. & Algina, J. (2000). Measures of effect size
test, Psychological Science 8, 37. for comparative studies: applications, interpretations,
[40] Jones, L.V. & Tukey, J.W. (2000). A sensible formula- and limitations, Contemporary Educational Psychology
tion of the significance test, Psychological Methods 5, 25, 241286.
411414. [58] Olejnik, S. & Algina, J. (2003). Generalized eta and
[41] Judd, C.M., McClelland, G.H. & Culhane, S.E. (1995). omega squared statistics: measures of effect size for
Data analysis: continuing issues in the everyday analysis common research designs, Psychological Methods 8,
of psychological data, Annual Reviews of Psychology 46, 434447.
433465. [59] Preece, P.F.W. (1983). A measure of experimental
[42] Kelley, T.L. (1935). An unbiased correlation ratio mea- effect size based on success rates, Educational and
sure, Proceedings of the National Academy of Sciences Psychological Measurement 43, 763766.
21, 554559. [60] Prentice, D.A. & Miller, D.T. (1992). When small effects
[43] Kendall, M.G. (1963). Rank Correlation Methods, 3rd are impressive, Psychological Bulletin 112, 160164.
Edition, Griffin Publishing, London. [61] Rosenthal, R. & Rubin, D.B. (1982). A simple, general
[44] Kendall, P.C., Marss-Garcia, A., Nath, S.R. & Sheldrick, purpose display of magnitude of experimental effect,
R.C. (1999). Normative comparisons for the evaluation Journal of Educational Psychology 74, 166169.
10 Effect Size Measures
[62] Rosenthal, R. & Rubin, D.B. (1989). Effect size estima- [74] Spencer, B. (1995). Correlations, sample size, and practi-
tion for one-sample multiple-choice-type data: design, cal significance: a comparison of selected psychological
analysis, and meta-analysis, Psychological Bulletin 106, and medical investigations, Journal of Psychology 129,
332337. 469475.
[63] Rosenthal, R. & Rubin, D.B. (1994). The counternull [75] Tang, P.C. (1938). The power function of the analysis of
value of an effect size: a new statistic, Psychological variance tests with tables and illustrations of their use,
Science 5, 329334. Statistics Research Memorandum 2, 126149.
[64] Rosenthal, R. & Rubin, D.B. (2003). requivalent : a simple [76] Tatsuoka, M.M. (1973). An Examination of the Statistical
effect size indicator, Psychological Methods 8, 492496. Properties of a Multivariate Measure of Strength of
[65] Rosnow, R.L. & Rosenthal, R. (1989). Statistical proce- Association, Contract No. OEG-5-72-0027, U.S. Office
dures and the justification of knowledge in psychological of Education, Bureau of Research, Urbana-Champaign.
science, American Psychologist 44, 12761284. [77] Thompson, B. (1998). In praise of brilliance: where
[66] Rosnow, R.L., Rosenthal, R. & Rubin, D.B. (2000). that praise really belongs, American Psychologist 53,
Contrasts and correlations in effect-size estimation, Psy- 799800.
chological Science 11, 446453. [78] Thompson, B. (2002). Statistical, practical, and
[67] Rossi, J.S. (1997). A case study in the failure of psychol- clinical: how many kinds of significance do counselors
ogy as cumulative science: the spontaneous recovery of need to consider? Journal of Counseling and Develop-
verbal learning, in What if There Were no Significance ment 80, 6471.
Tests? L.L. Harlow, S.A. Mulaik & J.H. Steiger, eds, [79] Tukey, J.W. (1991). The philosophy of multiple com-
Lawrence Erlbaum, Hillsdale, pp. 175197. parisons, Statistical Science 6, 100116.
[68] Rozeboom, W.W. (1960). The fallacy of the null [80] Wherry, R.J. (1931). A new formula for predicting
hypothesis significance test, Psychological Bulletin 57, the shrinkage of the coefficient of multiple correlation,
416428. Annals of Mathematical Statistics 2, 440451.
[69] Sanchez-Meca, J., Marn-Martnez, F. & Chacon- [81] Wickens, C.D. (1998). Commonsense statistics, Ergo-
Moscoso, S. (2003). Effect-size indices for dichotomized nomics in Design 6(4), 1822.
outcomes in meta-analysis, Psychological Methods 8, [82] Wilcox, R.R. (1996). Statistics for the Social Sciences,
448467. Academic Press, San Diego.
[70] Sarndal, C.E. (1974). A comparative study of association [83] Wilcox, R.R. (1997). Introduction to Robust Estimation
measures, Psychometrika 39, 165187. and Hypothesis Testing, Academic Press, San Diego.
[71] Schmidt, F.L. (1996). Statistical significance testing and [84] Wilcox, R.R. & Muska, J. (1999). Measuring effect
cumulative knowledge in psychology: implications for size: a non-parametric analogue of 2 , British Journal
the training of researchers, Psychological Methods 1, of Mathematical and Statistical Psychology 52, 93110.
115129.
[72] Sedlmeier, P. & Gigerenzer, G. (1989). Do studies of ROGER E. KIRK
statistical power have an effect on the power of studies?
Psychological Bulletin 105, 309316.
[73] Shaver, J.P. (1993). What statistical significance testing
is, and what it is not, Journal of Experimental Education
61, 293316.
Eigenvalue/Eigenvector
IAN JOLLIFFE
Volume 2, pp. 542543
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
QuantileQuantile Plots 70
50
An empirical quantilequantile (EQQ) plot provides
a graphical comparison between measures of loca- 40
tion (means and medians, for instance) and spread 30
(standard deviations, variances, ranges, etc.) for two
ordered sets of observations, hence the name of the 20
plot, where quantiles are ordered values and empir- 20 30 40 50 60 70 80
ical refers to the source of the data. What one has Male fish times
with an EQQ plot, therefore, is the graphical equiva-
lent of significance tests of differences in location and Figure 1 EQQ plot of average nest-guarding times for
spread. The display itself is a form of added value male and female fish
scatterplot, in which the x and y axes represent the
ordered values of the two sets of data.
Interpreting the resulting graph is easiest if the datasets (the latter omitting one problematic pair of
axes are identical, since an essential part of the plot observations).
is a 45-degree comparison line running from the In Figure 1, the location measure (for example,
bottom left-hand corner (the origin) to the upper the average) for the male fish guarding time is higher
right-hand corner of the display. Decisions about the than the equivalent measure for the female fish,
data are made with respect to this comparison line; since all the data lie below the 45% comparison
for instance, are the data parallel to it, or coincident line. However, the spreads seem the same in each
with it, or is the bulk of the data above or below it? sample as the data are clearly parallel with the
Indeed, so much information can be extracted from an comparison line.
EQQ plot that it is helpful to provide a summary table In Figure 2, the data are almost exactly coincident
of data/comparison line outcomes and their statistical with the 45% comparison line, thus showing graph-
interpretation (see Table 1). ically that there are no differences in either location
Three example EQQ plots are shown below. The or spread of the estimates of the crowdedness of the
data are taken from Minitabs Fish and Crowd room by male and female students.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[2] Cordell, H.J. (2002). Epistasis: what it means, what it [10] Moore, J.H. (2003). The ubiquitous nature of epistasis in
doesnt mean, and statistical methods to detect it in determining susceptibility to common human diseases,
humans, Human Molecular Genetics 11, 24632468. Human Heredity 56, 7382.
[3] Crow, J.F. & Kimura, M. (1970). An Introduction to [11] Neale, M.C. (2002). QTL mapping in sib-pairs: the
Population Genetics Theory, Harper & Row, New York. flexibility of Mx, in Advances in Twin and Sib-Pair
[4] Eaves, L.J. (1988). Dominance alone is not enough, Analysis, T.D. Spector, H. Sneider & A.J. MacGregor,
Behavior Genetics 18, 2733. eds, Greenwich Medical Media, London, pp. 219244.
[5] Fisher, R.A. (1918). The correlation between relatives on [12] Purcell, S. & Sham, P.C. (2004). Epistasis in quantitative
the supposition of Mendelian inheritance, Transaction of trait locus linkage analysis: interaction or main effect?
the Royal Society. Edinburgh 52, 399433. Behavior Genetics 34, 143152.
[6] Kempthorne, O. (1957). An Introduction to Genetic [13] Shull, G.H. (1914). Duplicate genes for capsule form
Statistics, John Wiley & Sons, New York. in BURSA bursa Bastoris, Zeitschrift fur Induktive
[7] Lynch, M. & Walsh, B. (1998). Genetics and Analysis Abstammungs-und Vererbungslehre 12, 97149.
of Quantitative Traits, Sinauer Associates, Sunderland. [14] Williams, C.J. (1993). On the covariance between
[8] Martin, N.G., Eaves, L.J., Kearsey, M.J. & Davies, P. parameter estimates in models of twin data, Biometrics
(1978). The power of the classical twin study, Heredity 49, 557568.
40, 97116.
[9] Mather, K. & Jinks, J.L. (1982). Biometrical Genetics, DAVID M. EVANS
Chapman & Hall, New York.
Equivalence Trials
JOHN P. HATCH
Volume 2, pp. 546547
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
hypothesis, rather than conducting separate tests of the t Test invented by a Guinness brewery worker
the five outcome variables, EW = TW . roughly a century ago. Conceptually, ANOVA post
Years ago, researchers noticed that EW approx- hoc tests (e.g., Tukey, Scheff`e, Duncan) are t Tests
imately equals k(TW ) (e.g., 22.6% approximately with build in variations on the Bonferroni correction
equals 5 (0.05) = 25.0%). Thus was born the Bon- being invoked so as to keep EW from becoming
ferroni correction, which adjusts the original TW too inflated.
downward, so that given the new testwise alpha level
(TW ), the EW would be roughly equal to TW . With References
the present example, TW would be set equal to 0.01,
because TW = 0.05/5 is 0.01. However, one problem
[1] Hubbard, R. & Ryan, P.A. (2000). The historical growth
with using the Bonferroni correction in this manner is of statistical significance testing in psychologyand its
that although the procedure controls the experiment- future prospects, Educational and Psychological Mea-
wise Type I error rate, the probability of making Type surement 60, 661681.
II error gets correspondingly larger with this method. [2] Huberty, C.J (1999). On some history regarding statisti-
One common application of the Bonferroni cor- cal testing, in Advances in Social Science Methodology,
rection that is more appropriate involves post hoc Vol. 5, B. Thompson, ed., JAI Press, Stamford, pp. 123.
[3] Love, G. (November 1988). Understanding experiment-
tests in ANOVA. When we test whether the means wise error probability, in Paper Presented at the Annual
of more than two groups are equal, and determine Meeting of the Mid-South Educational Research Associa-
that some differences exist, the question arises as to tion, Louisville, (ERIC Document Reproduction Service
exactly which groups differ. We address this question No. ED 304 451).
by invoking one of the myriad post hoc test proce- [4] Mulaik, S.A., Raju, N.S. & Harshman, R.A. (1997). There
dures (e.g., Tukey, Scheff`e, Duncan). is a time and place for significance testing, in What if there
were no Significance Tests? L.L. Harlow, S.A. Mulaik &
Post hoc tests always compare one mean versus a
J.H. Steiger, eds, Erlbaum, Mahwah, pp. 65115.
second mean. Because the differences in two means
are being tested, post hoc tests invoke a variation on BRUCE THOMPSON
Estimation
SARA A. VAN DE GEER
Volume 2, pp. 549553
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
small when n is large. It can be shown that Sn2 is an Figure 2 Histogram with bandwidth h = 0.5 and true
unbiased estimator of the variance 2 = var(X). The density
estimator n2 is biased: it underestimates the variance.
In many models, unbiased estimators do not exist. Here, (x, x + h] is the interval with left endpoint x
Moreover, it often heavily depends on the model (not included) and right endpoint x + h (included).
under consideration, whether or not an estimator is Unfortunately, replacing P by Pn here does not work,
unbiased. A weaker concept is asymptotic unbiased- as for h small enough, Pn (x, x + h] will be equal
ness (see [1]). to zero. Therefore, instead of taking the limit as
The mean square error of Tn as estimator of is h 0, we fix h at a (small) positive value, called
the bandwidth. The estimator of f (x), thus, becomes
MSE (Tn ) = E(Tn )2 . (14)
Pn (x, x + h] number of Xi (x, x + h]
One may decompose the MSE as fn (x) = = .
h nh
MSE (Tn ) = bias2 (Tn ) + var(Tn ), (15) (17)
number of characteristics. The sample mean and sam- because, if one only considers the cell counts, one
ple variance are such summarizing statistics, but so throws away information on the distribution within
is, for example, the sample median, and so on. The a cell. Indeed, when one compares Figures 2 and 3
question arises, to what extent one can summarize (recall that in Figure 3 we shifted the sample one
data without throwing away information. For exam- unit to the left), one sees that, by using just 10 cells
ple, suppose you are given the empirical distribution instead of 20, the strong decrease in the second half
function Fn , and you are asked to reconstruct the orig- of the first cell is no longer visible.
inal data X1 , . . . , Xn . This is not possible since the Sufficiency depends very heavily on the model for
ordering of the data is lost. However, the index i of P . Clearly, when one decides to ignore information
Xi is just a label: it contains no information about the because of a sufficiency argument, one may be
distribution P of Xi (assuming that each observation ignoring evidence that the models assumptions may
Xi comes from the same distribution, and the obser- be wrong. Sufficiency arguments should be treated
vations are independent). We say that the empirical with caution.
distribution Fn is sufficient. More generally, a statis-
tic Tn = Tn (X1 , . . . , Xn ) is called sufficient for P if References
the distribution of the data given the value of Tn does
not depend on P . For example, it can be shown that [1] Bickel, P.J. & Doksum, K.A. (2001). Mathematical
when P is the exponential distribution with unknown Statistics, 2nd Edition, Holden-Day, San Francisco.
intensity, then the sample mean is sufficient. When [2]
Pareto, V. (1897). Course dEconomie Politique, Rouge,
P is the normal distribution with unknown mean and Lausanne et Paris.
variance, then the sample mean and sample variance
are sufficient. Cell counts are not sufficient when, SARA A. VAN DE GEER
for example, P is a continuous distribution. This is
Eta and Eta Squared
ANDY P. FIELD
Volume 2, pp. 553554
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
i=1
SSM = ni (xi xgrand )2 , (5)
i=1
This can also be expressed in terms of the variance
where k is the number of groups. We would get:
of all scores: SST = sgrand
2
(N 1).
Once a model has been fitted, this total variability SSM = 10(12.60 9.43)2 + 10(7.00 9.43)2
can be partitioned into the variability explained by
the model, and the error. The variability explained + 10(8.70 9.43)2
by the model (SSM ) is the sum of squared deviations = 164.87. (6)
of the values predicted by the model and the mean
of all observations: Eta squared is simply:
n
SSM 164.87
SSM = (xi xgrand )2 . (2) 2 = = = 0.27. (7)
SST 621.47
i=1
Finally, the residual variability (SSR ) can be obtained Table 1 Number of items to check generated under
through subtraction (SSR = SST SSM ), or for a different moods
more formal explanation, see [3].
Negative Positive None
In regression models, these values can be used
to calculate the proportion of variance that the 7 9 8
model explains (SSM /SST ), which is known as the 5 12 5
16 7 11
coefficient of determination (R2 ). Eta squared is the
13 3 9
same but calculated for models on the basis of group 13 10 11
means. The distinction is blurred because using group 24 4 10
means to predict observed values is a special case of 20 5 11
a regression model (see [1] and [3], and generalized 10 4 10
linear models (GLM)). 11 7 7
As an example, we consider data from Davey 7 9 5
et al. [2] who looked at the processes underlying X 12.60 7.00 8.70
s2 36.27 8.89 5.57
Obsessive Compulsive Disorder by inducing nega- Grand Mean = 9.43 Grand Variance = 21.43
tive, positive, or no mood in people and then asking
2 Eta and Eta Squared
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
identifies the relevant ethical issues that may be of basic nature of the research project and the quali-
concern and decides what is at stake for the partici- fications that are needed to participate. At this stage,
pant, the researcher, and the institution with which the ethical concerns include the use of inducements and
researcher is affiliated. If there are ethical concerns, coercion, consent and alternatives to consent, insti-
the IRB may suggest alternatives to the proposed pro- tutional approval of access to participants, and rules
cedures. Finally, the IRB will provide the researcher related to using student subject pools [1]. It is impor-
with a formal statement of what must be changed in tant that researchers avoid hyperclaiming, in which
order to receive IRB approval of the research project. the goals the research is likely to achieve are exagger-
The attempt by IRBs to ensure ethical practices ated. It is also important that researchers not exploit
has caused some dissatisfaction among scientists. potential participants, especially vulnerable partici-
Since IRBs are not federal agencies but are instead pants, by offering inducements that are difficult to
created by local institutions, they have come under refuse. At the same time, researchers must weigh the
criticism for (a) lack of standard procedures and costs to the participant and provide adequate compen-
requirements; (b) delays in completing the review sation for the time they spend in the research process.
process; (c) creating the fear that IRBs will impose Most psychological research is conducted with
institutional sanctions on individual researchers; and students recruited from university subject pools,
(d) applying rules originally designed for medical which raises an ethical concern since the students
studies to behavioral science research projects with- grades may be linked with participation. Ethical
out acknowledging the important differences between practice requires that students be given a reasonable
the two. To address these concerns, IRBs should alternative to participation in order to obtain the same
require both board members and principal investiga- credit as those who choose to participate in research.
tors to undergo training in research ethics, adopt more The alternatives offered must not be seen by students
consistent guidelines for evaluating research proto- as either punitive or more stringent than research
cols, place limits on the power given to the IRB, participation.
include an evaluation of the technical merit of a pro- In the recruitment process, researchers should
posal as a means of determining risk/benefit ratios, attempt to eliminate any potential participants who
develop a series of case studies to help sensitize mem- may be harmed by the research. Research protocols
bers of an IRB to ethical dilemmas within the social submitted to an IRB typically have a section in which
sciences and ways they may be resolved, encour- the researcher describes this screening process and
age the recruitment of women, minorities, and chil- the criteria that will be used to include or exclude
dren as research participants, adopt provisions that persons from the study. The screening process is of
ensure students be given alternatives to participation particular importance when using proxy decisions for
in research when the research is a class requirement, incompetent persons and when conducting clinical
and carefully review cases where a financial conflict research. On the other hand, it is important that the
of interest may occur [7]. sample be representative of the population to which
the research findings can be generalized.
and the researcher. An important distinction is made and giving incorrect information regarding stimuli.
between at risk and minimal risk. Minimal risk The acceptability of deception remains controver-
refers to a level of harm or discomfort no greater than sial although the practice is common. Both partic-
that which the participant might expect to experience ipants and researchers tend to conduct a kind of
in daily life. Research that poses minimal risk to the costbenefit analysis when assessing the ethics of
participant is allowed greater flexibility with regard deception. Researchers tend to be more concerned
to informed consent, the use of deception, and other about the dangers of deception than do research par-
ethically questionable procedures. Although, it should ticipants. Participants evaluation of studies that use
still meet methodological standards to ensure that the deception are related to the studies scientific merit,
participants time is not wasted. value, methodological alternatives, discomfort expe-
Informed consent presents difficulties when the rienced by the participants, and the efficacy of the
potential participants are children, the participants debriefing procedures.
speak a different language than the experimenter, or Several alternatives to using deception are avail-
the research is therapeutic but the participants are able. Role-playing and simulation can be used in lieu
unable to provide informed consent. Certain research of deception. In field research, many researchers have
methodologies make it difficult to obtain informed sought to develop reciprocal relationships with their
consent, as when the methodology includes disguised participants in order to promote acceptance of occa-
observation or other covert methods. The omission sional deception. Such reciprocal relationships can
of informed consent in covert studies can be appro- provide direct benefits to the participants as a result
priate when there is a need to protect participants of the research process. In cases where deception is
from nervousness, apprehension, and in some cases unavoidable, the method of assumed consent can be
criminal prosecution. Studies that blur the distinction used [3]. In this approach, a sample taken from the
between consent for treatment or therapy and consent same pool as the potential participants is given a
for research also pose ethical problems as can the use complete description of the proposed study, includ-
of a consent form that does not provide the participant ing all aspects of the deception, and asked whether
with a true understanding of the research. While most they would be willing to participate in the study. A
psychological research includes an informed consent benchmark of 95% agreement allows the researcher
process, it should be noted that federal guidelines per- to proceed with the deception manipulation.
mit informed consent to be waived if (a) the research
involves no more than minimal risk to the partic-
ipants; (b) the waiver will not adversely affect the
Avoiding Harm: Pain and Suffering
rights and welfare of the participants; and (c) the
research could not be feasibly conducted if informed Participants consent is typically somewhat unin-
consent were required [4]. formed in order to obtain valid information untainted
by knowledge of the researchers hypothesis and
The Use of Deception in Psychological expectations. Because of this lack of full disclosure,
Research it is important that the researcher ensures that no
harm will come to the participant in the research pro-
At one time, deception was routinely practiced in cess. Protection from harm is a foundational issue in
behavioral science research, and by the 1960 s re- research ethics. Types of harm that must be consid-
search participants, usually college students, expected ered by the researcher include physical harm, psy-
deception and as a result sometimes produced dif- chological stress, feelings of having ones dignity,
ferent results than those obtained from unsuspecting self-esteem, or self-efficacy compromised, or becom-
participants. In general, psychologists use deception ing the subject of legal action. Other types of potential
in order to prevent participants from learning the true harm include economic harm, including the imposi-
purpose of the study, which might in turn affect their tion of financial costs to the participants, and social
behavior. Many forms of deception exist, including harms that involve negative affects on a persons
the use of an experimental confederate posing as interactions or relationships with others. In addition
another participant, providing false feedback to par- to considering the potential harm that may accrue to
ticipants, presenting two related studies as unrelated, the research participant, the possibility of harm to the
4 Ethics in Research
Similarly, qualitative research poses special dif- consider, on behalf of the researcher, alternative pro-
ficulties for maintaining privacy and confidentiality. cedures to reduce risks to the participants. The careful
Techniques for maintaining confidentiality include deliberation of the cost/benefit ratio is of particular
the use of pseudonyms or fictitious biographies and importance in research with those unable to provide
the coding of tapes and other data recording meth- informed consent, such as the cognitively impaired;
ods in which participant identification cannot be dis- research where there is risk without direct benefit to
guised. Also, it is the researchers responsibility to the participant, research with such vulnerable pop-
take reasonable precautions to ensure that participants ulations as children and adolescents; and therapeutic
respect the privacy of other participants, particularly research in which the participant in need of treatment
in research settings where others are able to observe is likely to overestimate the benefit and underestimate
the behavior of the participant. the risk, even when the researcher has provided a full
and candid description of the likelihood of success
and possible deleterious effects.
Assessing Risks and Benefits
Ethical Issues in Conducting Research
One of the responsibilities of an IRB is to ask
the question: will the knowledge gained from this
with Vulnerable Populations
research be worth the inconvenience and potential An important ethical concern considered by IRBs is
cost to the participant? Both the magnitude of the the protection of those who are not able fully to
benefits to the participant and the potential sci- protect themselves. While determining vulnerability
entific and social value of the research must be can be difficult, several types of people can be con-
considered [5]. Some of the potential types of ben- sidered vulnerable for research purposes, including
efits of psychological research are (a) an increase people who (a) either lack autonomy and resources or
in basic knowledge of psychological processes; have an abundance of resources, (b) are stigmatized,
(b) improved methodological and assessment proce- (c) are institutionalized, (d) cannot speak for them-
dures; (c) practical outcomes and benefits to others; selves, (e) engage in illegal activities, and (f) may be
(d) benefits for the researchers, including the educa- damaged by the information revealed about them as
tional functions of research in preparing students to a result of the research. One of the principle groups
think critically and creatively about their field; and of research participants considered to be vulnera-
(e) direct, sometimes therapeutic, benefits to the par- ble is children and adolescents. In addition to legal
ticipants, for example, in clinical research. constraints on research with minors adopted by the
Some of the potential costs to the participant are United States Department of Health and Human Ser-
social and physical discomfort, boredom, anxiety, vices (DHHS), ethical practices must address issues
stress, loss of self-esteem, legal risks, economic risks, of risk and maturity, privacy and autonomy, parental
social risks, and other aversive consequences. In permission and the circumstances in which permis-
general, the risks associated with the research should sion can be waived, and the assent of the institution
be considered from the perspective of the participant, (school, treatment facility) where the research is to
the researcher, and society as a whole, and should be conducted.
include an awareness that the risks to the participant Other vulnerable groups addressed in the litera-
may come not only from the research process, but ture include minorities, prisoners, trauma victims, the
also from particular vulnerabilities of the participant homeless, Alzheimers patients, gays and lesbians,
or from the failure of the researcher to use appropriate individuals with AIDS and STDs, juvenile offenders,
strategies to reduce risk. and the elderly, particularly those confined to nurs-
The IRBs job of balancing these costs and ben- ing homes where participants are often submissive
efits is difficult since the types of costs and benefits to authority.
are so varied. The deliberations of the IRB in arriv- Research with psychiatric patients poses a chal-
ing at a favorable ratio should be formed with lenge to the researcher. A major ethical concern with
respect to the guidelines provided in the Belmont clinical research is how to form a control group
Report, which encourages ethical review committees without unethically denying treatment to some par-
to examine all aspects of the research carefully and to ticipants, for example, those assigned to a placebo
6 Ethics in Research
control group. One alternative to placebo-controlled unrelated to the experiment. Covert research that
trials is active-controlled trials. involves the observation of people in public places
A number of ethical issues arise when studying is not generally considered to constitute an invasion
families at risk and spousal abuse. It is the responsi- of privacy; however, it is sometimes difficult to
bility of the investigator to report abuse and neglect, determine when a reasonable expectation of privacy
and participants must understand that prior to giving exists, for example, behavior in a public toilet.
consent. Other ethical issues include conflict between Because it is not usually possible to assess whether
research ethics and the investigators personal ethics, participants have been harmed in covert studies,
identifying problems that cannot be solved, and bal- opinions regarding the ethicality and legality of such
ancing the demands made by family members and methods varies markedly. Four principles that must
the benefits available to them. be considered in deciding on the ethicality of covert
Alcohol and substance abusers and forensic pati- field research are (a) the availability of alternative
ents present particular problems for obtaining ade- means for studying the same question, (b) the merit
quate informed consent. The researcher must take of the research question, (c) the extent to which
into account the participants vulnerability to coercion confidentiality or anonymity can be maintained, and
and competence to give consent. The experience of (d) the level of risk to the uninformed participant.
the investigator in dealing with alcoholics and drug One specific type of field research warrants spe-
abusers can be an important element in maintaining cial ethical consideration: socially sensitive research,
ethical standards related to coercion and competence which is defined as research where the findings
to give consent. can have practical consequences for the partici-
One final vulnerable population addressed in pants. The research question, the research process,
the literature is the cognitively impaired. Research and the potential application of the research find-
with these individuals raises issues involving adult ings are particularly important in socially sensi-
guardianship laws and the rules governing proxy tive research. IRBs have been found to be very
decisions. The question is: who speaks for the wary of socially sensitive research, more often find-
participant? Research with vulnerable participants ing fault with the research and overestimating the
requires the researcher to take particular care to extent of risk involved as compared to their reviews
avoid several ethical dilemmas including coercive of less sensitive research. Despite these difficulties,
recruiting practices, the lack of confidentiality often socially sensitive research has considerable potential
experienced by vulnerable participants, and the for addressing many of societys social issues and
possibility of a conflict of interest between research should be encouraged.
ethics and personal ethics.
Ethical Issues in Conducting Archival Research
Ethical Considerations Related
Archival research can provide methodological advan-
to Research Methodology tages to the researcher in that unobtrusive measures
Ethical Issues in Conducting Field Research are less likely to affect how participants behave.
However, research involving archival data poses a
Research conducted in the field confronts an problem for obtaining informed consent, since the
additional ethical dilemma not usually encountered research question may be very different from the
in laboratory studies. Often the participants are one for which the data was originally collected. In
unaware that they are being studied, and therefore most cases, issues of privacy do not exist since an
no contractual understanding can exist. In many field archive can be altered to remove identifying informa-
studies, especially those that involve observational tion. A second ethical concern with archival research
techniques, informed consent may be impossible has to do with the possibility that those who create the
to obtain. This dilemma also exists when the archive may introduce systematic bias into the data
distinction between participant and observer is set. This is of particular concern when the archive is
blurred. Similarly, some laboratory experiments written primarily from an official point of view that
involving deception use procedures similar to field may not accurately represent the participants atti-
research in introducing the independent variable as tudes, beliefs, or behavior.
Ethics in Research 7
take to avoid this ethical breach, including (a) careful heuristics that can be employed in resolving ethi-
acknowledgement of all sources, including secondary cal conflicts include: (a) using the ethical standards
sources of information, (b) use of quotation marks to of the profession, (b) applying ethical and moral
set off direct quotes and taking care that paraphrasing principles, (c) understanding the legal responsibili-
another author is not simply a minor variation of the ties placed upon the researcher, and (d) consulting
authors own words, and (c) maintaining complete with professional colleagues [8]. In the final analy-
records of rough notes, drafts, and other materials sis, the researchers conscience determines whether
used in preparing a report. the research is conducted in an ethical manner.
Several notorious cases, including that of Cyril To enhance the researchers awareness of ethical
Burt, have clearly demonstrated the ethical ban on issues, education and training programs have become
the falsification and fabrication of data, as well as the increasingly available in university courses, work-
misuse of statistics to mislead the reader. In addition shops, and on governmental websites. The use of
to fabrication, it is unethical to publish, as original role-playing and context-based exercises, and the
data, material that has been published before. It is also supervision of student research have been shown to
the ethical responsibility of the investigator to share effectively increase ethical sensitivity.
research data for verification. While these are fairly
straightforward ethical considerations, it is important
to distinguish between honest errors and misconduct References
in statistical reporting. Currently, there are no federal
guidelines that inform our understanding of the [1] American Psychological Association. (1982). Ethical
differences between common practices and actual Principles in the Conduct of Research with Human Partic-
misuse. Therefore, it is important that individual ipants, American Psychological Association, Washington.
[2] American Psychological Association. (1992). Ethical
investigators consult with statisticians in order to principles of psychologists & code of conduct, American
apply the most appropriate tests to their data. Psychologist 42, 15971611.
Authorship credit at the time of publication should [3] Cozby, P.C. (1981). Methods in Behavioral Research,
only be taken for work actually performed and Mayfield, Palo Alto.
for substantial contributions to the published report. [4] Fischman, M.W. (2000). Informed consent, in Ethics
Simply holding an institutional position is not an in Research with Human Participants, B.D. Sales &
S. Folkman, eds, American Psychological Association,
ethical reason for being included as an author of
Washington, pp. 3548.
a report. Students should be listed as the principal [5] Fisher, C.B. & Fryberg, D. (1994). Participant partners:
author of any article that is primarily based on that college students weigh the costs and benefits of deceptive
students work, for example, a dissertation. research, American Psychologist 49, 417427.
[6] Office for Protection From Research Risks, Protection of
Human Subjects, National Commission for the Protec-
Summary and Conclusion tion of Human Subjects of Biomedical and Behavioral
Research. (1979). The Belmont Report: Ethical Principles
Ethical dilemmas often arise from a conflict of inter- and Guidelines in the Protection of Human Subjects. (GPO
est between the needs of the researcher and the needs 887-809) U.S. Government Printing Office, Washington.
[7] Rosnow, R., Rotheram-Borus, M.J., Ccci, S.J., Blanck,
of the participant and/or the public at large. A con- P.D. & Koocher, G.P. (1993). The institutional review
flict of interest can occur when the researcher occu- board as a mirror of scientific and ethical standards,
pies multiple roles, for example, clinician/researcher, American Psychologist 48, 821826.
or within a single role such as a program evalua- [8] Sales, B. & Lavin, M. (2000). Identifying conflicts
tion researcher who experiences sponsor pressures of interest and resolving ethical dilemmas, in Ethics
for results that may compromise scientific rigor. In in Research with Human Participants, B.D. Sales &
resolving ethical dilemmas, psychologists are guided S. Folkman, eds, American Psychological Association,
Washington, pp. 109128.
in their research practices by APA guidelines as well [9] Tarasoff V. Board of Regents of the University of
as Federal regulations that mandate that research be California (1976). 17 Cal. 3d 425, 551 P.2d 334.
approved by an Institutional Review Board or Insti-
tutional Animal Care and Use Committee. A set of RICHARD MILLER
Evaluation Research
MARK W. LIPSEY AND SIMON T. TIDD
Volume 2, pp. 563568
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the units in the research sample may be any of The best way to achieve equivalence between
these entities. It is not unusual for the program intervention and control groups is to randomly allo-
to deliver its services to one level with the intent cate members of a research sample to the groups
of producing effects on units nested within this (see [2] for a discussion of how to implement ran-
level. This situation occurs frequently in educational domization) (see Randomization). However, when
programs. A mathematics curriculum, for instance, intervention and control groups cannot be formed
may be implemented school wide and delivered through random assignment, evaluators may attempt
mainly at the classroom level. The desired outcome, to construct a matched control group by selecting
however, is improved math achievement for the either individuals or an aggregate group that is simi-
students in those classes. Students can be sampled lar on a designated set of variables to those receiving
only by virtue of being in a classroom that is, the intervention. In individual matching, a partner
or is not, using the curriculum of interest. Thus, is selected from a pool of individuals not exposed
the classroom is the primary sampling unit but to the program who matches each individual who
the students clustered within the classrooms are of does receive the program. For children in a school
focal interest for the evaluation and are the primary drug prevention program, for example, the evalua-
analysis unit. tor might deem the relevant matching variables to be
A common error is to analyze the outcome data age, sex, and family income. In this case, the evalua-
at the student level, ignoring the clustering of stu- tor might scrutinize the roster of unexposed children
dents within classrooms. This error exaggerates the at a nearby school for the closest equivalent child to
sample size used in the statistical analysis by count- pair with each child participating in the program.
ing the number of students rather than the number of With aggregate matching, individuals are not
classrooms that are the actual sampling unit. It also matched case by case; rather, the overall distributions
treats the student scores within each classroom as of values on the matching variables are made com-
if they were independent data points when, because parable for the intervention and control groups. For
of the students common classroom environment and instance, a control group might be selected that has
typically nonrandom assignment to classrooms, their the same proportion of children by sex and age as the
scores are likely to be more similar within classrooms intervention group, but this may involve a 12-year-
than they would be otherwise. This situation requires old girl and an 8-year-old boy in the control group to
the use of specialized multilevel statistical analysis balance a 9-year-old girl and an 11-year-old boy in
models (see Linear Multilevel Models) to properly the intervention group. For both matching methods,
estimate the standard errors and determine the statis- the overall goal is to equally distribute characteristics
tical significance of any effects (for further details, that may impact the outcome variable. As a further
see [13, 19]. safeguard, additional descriptive variables that have
not been used for matching may be measured prior
to intervention and incorporated in the analysis as
Selection Bias statistical controls (discussed below).
The most common impact evaluation design is
When an impact evaluation involves an intervention one in which the outcomes for an intervention group
and control group that show preintervention differ- are compared with those of a control group selected
ences on one or more variables related to an outcome on the basis of relevance and convenience. For a
of interest, the result is a postintervention differ- community-wide program for senior citizens, for
ence that mimics a true intervention effect. Initial instance, an evaluator might draw a control group
nonequivalence of this sort biases the estimate of from a similar community that does not have the pro-
the intervention effects and undermines the valid- gram and is convenient to access. Because any esti-
ity of the design for determining the actual pro- mate of program effects based on a simple compari-
gram effects. This serious and unfortunately com- son of outcomes for such groups must be presumed
mon problem is called selection bias because it to include selection bias, this is a nonequivalent com-
occurs in situations in which units have been dif- parison group design.
ferentially selected into the intervention and control Nonequivalent control (comparison) group
groups. designs are analyzed using statistical techniques that
Evaluation Research 3
attempt to control for the preexisting differences the magnitude of those effects. Small effects are
between groups. To apply statistical controls, the more difficult to detect than large ones and their
control variables must be measured on both the practical significance may also be more difficult
intervention and comparison groups before the to describe. Evaluators often use an effect size
intervention is administered. A significant limitation statistic to express the magnitude of a program effect
of both matched and nonequivalent comparison in a standardized form that makes it comparable
designs is that the evaluator generally does not know across measures that use different units or different
what differences there are between the groups nor scales. The most common effect size statistic is the
which of those are related to the outcomes of interest. standardized mean difference (sometimes symbolized
With relevant control variables in hand, the evalu- d), which represents a mean outcome difference
ator must conduct a statistical analysis that accounts between an intervention group and a control group
for their influence in a way that effectively and com- in standard deviation units. Describing the size of a
pletely removes selection bias from the estimates of program effect in this manner indicates how large
program effects. Typical approaches include analysis it is relative to the range of scores recorded in
of covariance and multiple linear regression anal- the study. If the mean reading readiness score for
ysis. If all the relevant control variables are included
participants in a preschool intervention program is
in these analyses, the result should be an unbiased
half a standard deviation larger than that of the
estimate of the intervention effect.
control group, the standardized mean difference effect
An alternate approach to dealing with nonequiva-
size is 0.50. The utility of this value is that it can
lence that is becoming more commonplace is selec-
be easily compared to, say, the standardized mean
tion modeling. Selection modeling is a two-stage
procedure in which the first step uses relevant control difference of 0.35 for a test of vocabulary. The
variables to construct a statistical model that predicts comparison indicates that the preschool program was
selection into the intervention or control group. This more effective in advancing reading readiness than in
is typically done with a specialized form of regression enhancing vocabulary.
analysis for binary dependent variables, for exam- Some outcomes are binary rather than a matter
ple, probit or logistic regression. The results of this of degree; that is, for each participant, the outcome
first stage are then used to combine all the control occurs or it does not. Examples of binary outcomes
variables into a single composite selection variable, include committing a delinquent act, becoming preg-
or propensity score (propensity to be selected into nant, or graduating from high school. For binary
one group or the other). The propensity score is opti- outcomes, an odds ratio effect size is often used to
mized to account for the initial differences between characterize the magnitude of the program effect. An
the intervention and control groups and can be used odds ratio indicates how much smaller or larger the
as a kind of super control variable in an analysis of odds of an outcome event are for the intervention
covariance or multiple regression analysis. Effective group compared to the control group. For exam-
selection modeling depends on the evaluators dili- ple, an odds ratio of 1.0 for high school graduation
gence in identifying and measuring variables related indicates even odds; that is, participants in the inter-
to the process by which individuals select them- vention group are no more and no less likely than
selves (e.g., by volunteering) or are selected (e.g., controls to graduate. Odds ratios greater than 1.0
administratively) into the intervention or compari- indicate that intervention group members are more
son group. Several variants of selection modeling and likely to experience the outcome event; for instance,
two-stage estimation of program effects are available. an odds ratio of 2.0 means that the odds of mem-
These include Heckmans econometric approach [6, bers of the intervention group graduating are twice
7], Rosenbaum and Rubins propensity scores [14, as great as for members of the control group. Odds
15], and instrumental variables [5]. ratios smaller than 1.0 mean that they are less likely
to graduate.
The Magnitude of Program Effects Effect size statistics are widely used in the meta-
analysis of evaluation studies. Additional information
The ability of an impact evaluation to detect and can be found in basic meta-analysis texts such as
describe program effects depends in large part on those found in [4, 10, 16].
4 Evaluation Research
The Practical Significance of Program use the Beck Depression Inventory as an outcome
Effects measure. On this instrument, scores in the 17 to 20
range indicate borderline clinical depression, so one
Effect size statistics are useful for summarizing and informative index of practical significance is the per-
comparing research findings but they are not nec- cent of patients with posttest scores less than 17. If
essarily good guides to the practical magnitude of 37% of the control group is below the clinical thresh-
those effects. A small statistical effect may represent old at the end of the treatment period compared to
a program effect of considerable practical signifi- 65% of the treatment group, the practical magnitude
cance; conversely, a large statistical effect may be of this treatment effect can be more easily appraised
of little practical significance. For example, a very than if the same difference is presented in arbitrary
small reduction in the rate at which people with a scale units.
particular illness are hospitalized may have impor- Another basis of comparison for interpreting the
tant cost implications for health insurers. Statistically practical significance of program effects is the distri-
larger improvements in the patients satisfaction with bution of effect sizes in evaluations of similar pro-
their care, on the other hand, may have negligible grams. For instance, a review of evaluation research
practical implications. on the effects of marriage counseling, or a meta-
To appraise the practical magnitude of program analysis of the effects of such programs, might show
effects, the statistical effect sizes must be translated that the mean effect size for marital satisfaction was
into terms relevant to the social conditions the around 0.46, with most of the effect sizes ranging
program aims to improve. For example, a common between 0.12 and 0.80. With this information, an
outcome measure for juvenile delinquency programs evaluator who finds an effect size of 0.34 for a par-
is the rate of rearrest within a given time period. If a ticular marriage-counseling program can recognize
program reduces rearrest rates by 24%, this amount it as rather middling performance for a program of
can readily be interpreted in terms of the number this type.
of juveniles affected and the number of delinquent
offenses prevented.
For other program effects, interpretation may not Statistical Power
be so simple. Suppose that a math curriculum for
low-performing sixth-grade students raised the mean Suppose that an evaluator has some idea of the
score from 42 to 45 on the mathematics subtest of the magnitude of the effect that a program must produce
Omnibus Test of Basic Skills, a statistical effect size to have a meaningful impact and can express it as
of 0.30 standard deviation units. How much improve- an effect size statistic. An impact evaluation of that
ment in math skills does this represent in practical program should be designed so it can detect that effect
terms? Interpretation of statistical effects on outcome size. The minimal standard for identifying an effect
measures with values that are not inherently meaning- in a quantitative analysis is that it attains statistical
ful requires comparison with some external referent significance. The probability that an estimate of
that puts the effect size in a practical context. With the program effect based on sample data will be
achievement tests, for instance, we might compare statistically significant when, in fact, it represents a
program effects against test norms. If the national real (population) effect of a given magnitude is called
norm on the math test is 50, the math curriculum statistical power. Statistical power is a function of the
reduced the gap between the students in the program effect size to be detected, the sample size, the type of
and the norm by about 38% (from 8 points to 5), but statistical significance test used, and the alpha level.
still left them short of the average skill level. Deciding the proper level of statistical power
Another referent for interpreting the practical mag- for an impact assessment is a substantive issue. If
nitude of a program effect is a success threshold on an evaluator expects that the programs statistical
the outcome measure. A comparison of the propor- effects will be small and that such small effects
tions of individuals in the intervention and control are worthwhile, then a design powerful enough to
groups who exceed the threshold reveals the practi- detect them is needed. For example, the effect of an
cal magnitude of the program effect. For example, a intervention that lowers automobile accident deaths
mental health program that treats depression might by as little as 1% might be judged worth detecting
Evaluation Research 5
needed for such analysis, and these analysis tech- [10] Lipsey, M.W. & Wilson, D.B. (2001). Practical Meta-
niques can be combined with those for analyzing Analysis, Sage Publications, Thousand Oaks.
experimental designs [12]. By providing the tools to [11] MacKinnon, D.P. & Dwyer, J.H. (1993). Estimat-
ing mediated effects in prevention studies, Evaluation
examine how, when, and where program effects are Review 17, 144158.
produced, evaluators avoid black box evaluations [12] Muthen, B.O. & Curran, P.J. (1997). General longi-
that determine only whether effects were produced. tudinal modeling of individual differences in experi-
mental designs: a latent variable framework for anal-
ysis and power estimation, Psychological Methods 2(4),
References 371402.
[13] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
[1] Baron, R.M. & Kenny, D.A. (1986). The moderator- Linear Models: Applications and Data Analysis Methods,
mediator distinction in social psychological research: 2nd Edition, Sage Publications, Newbury Park.
conceptual, strategic and statistical considerations, Jour- [14] Rosenbaum, P.R. & Rubin, D.B. (1983). The central role
nal of Personality and Social Psychology 51, 11731182. of the propensity score in observational studies for causal
[2] Boruch, R.F. (1997). Randomized Experiments for Plan- effects, Biometrika 70(1), 4155.
ning and Evaluation: A Practical Guide, Sage Publica- [15] Rosenbaum, P.R. & Rubin, D.B. (1983). Reducing bias
tions, Thousand Oaks. in observational studies using subclassification on the
[3] Cohen, J. (1988). Statistical Power Analysis for the propensity score, Journal of the American Statistical
Behavioral Sciences, 2nd Edition, Lawrence Erlbaum, Association 79, 516524.
Hillsdale. [16] Rosenthal, R. (1991). Meta-Analytic Procedures for
[4] Cooper, H. & Hedges, L.V. (1994). The Handbook of Social Research, (revised ed.), Sage Publications, Thou-
Research Synthesis, Russell Sage Foundation, New York. sand Oaks.
[5] Greene, W.H. (1993). Selection-incidental truncation, [17] Schumacker, R.E. & Lomax, R.G. (1996). A Beginners
in Econometric Analysis, W.H. Greene, ed., Macmillan Guide to Structural Equation Modeling, Lawrence Erl-
Publishing, New York, pp. 706715. baum, Mahwah.
[6] Heckman, J.J. & Holtz, V.J. (1989). Choosing among [18] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2001).
alternative nonexperimental methods for estimating the Experimental and Quasi-Experimental Designs for Gen-
impact of social programs: the case of manpower eralized Causal Inference, Houghton-Mifflin, New York.
training, Journal of the American Statistical Association [19] Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Anal-
84, 862880 (with discussion). ysis: An Introduction to Basic and Advanced Multilevel
[7] Heckman, J.J. & Robb, R. (1985). Alternative methods Modeling, Sage Publications, Newbury.
for evaluating the impact of interventions: an overview,
Journal of Econometrics 30, 239267. MARK W. LIPSEY AND SIMON T. TIDD
[8] Kline, R.B. (1998). Principles and Practice of Structural
Equation Modeling, Guilford Press, New York.
[9] Kraemer, H.C. & Thiemann, S. (1987). How Many
Subjects? Statistical Power Analysis in Research, Sage
Publications, Newbury Park.
Event History Analysis
JEROEN K. VERMUNT AND GUY MOORS
Volume 2, pp. 568575
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
This hazard model is not only log-linear but also pro- omitting some of the higher-order interaction terms.
portional. In proportional hazard models, the time For example,
dependence is multiplicative (additive after taking
logs) and independent of an individuals covariate log habt = u + uA
a + ub + ut
B T
(7)
values. The following section shows how to specify
yields a model that is similar to the proportional
nonproportional log-linear hazard models by includ-
log-linear hazard model described in (5). In addition,
ing time-covariate interactions.
different types of hazard models can be obtained
The various types of continuous-time log-linear
by the specification of the time dependence. Setting
hazard models are defined by the functional form
the uTt terms equal to zero yields an exponential
that is chosen for the time dependence, that is, for
model. Unrestricted uTt parameters yield a piecewise
the term log h(t). In Coxs semiparametric model [3],
exponential model. Other parametric models can be
the time dependence is left unspecified. Exponential
approximated by defining the uTt terms to be some
models assume the hazard rate to be constant over
function of T . And finally, if there are as many time
time, while piecewise exponential model assume the
intervals as observed survival times and if the time
hazard rate to be a step function of T , that is, constant
dependence of the hazard rate is not restricted, one
within time periods. Other examples of parametric
obtains a Cox regression model. Log-rate models can
log-linear hazard models are Weibull, Gompertz, and
be estimated using standard programs for log-linear
polynomial models.
analysis or Poisson regression using Eabt as a weight
As was demonstrated by several authors (for
or exposure vector (see [10] and generalized linear
example, see [6] or [10]), log-linear hazard models
models).
can also be defined as log-linear Poisson models,
which are also known as log-rate models. Assume
that we have besides the event history informa- Censoring
tion two categorical covariates denoted by A and
B. In addition, assume that the time axis is divided An issue that always receives a great amount of
into a limited number of time intervals in which the attention in discussions on event history analysis is
hazard rate is postulated to be constant. In the first- censoring. An observation is called censored if it
birth example, this could be one-year intervals. The is known that it did not experience the event of
discretized time variable is denoted by T . Let habt interest during some time, but it is not known when it
denote the constant hazard rate in the tth time interval experienced the event. In fact, censoring is a specific
for an individual with A = a and B = b. To see the type of missing data. In the first-birth example, a
similarity with standard log-linear models, it should censored case could be a woman who is 30 years
be noted that the hazard rate, sometimes referred to of age at the time of interview (and has no follow-
as occurrence-exposure rate, can also be defined as up interview) and does not have children. For such
habt = mabt /Eabt . Here, mabz denoted the expected a woman, it is known that she did not have a child
number of occurrences of the event of interest and until age 30, but it is not known whether or when she
Eabz the total exposure time in cell (a, b, t). will have her first child. This is, actually, an example
Using the notation of hierarchical log-linear mod- of what is called right censoring. Another type of
els the saturated model for the hazard rate habt can censoring that is more difficult to deal with is left
now be written as censoring. Left censoring means that we do not have
information on the duration of nonoccurrence of the
log habt = u + uA
a + ub + ut + uab
B T AB
event before the start of the observation period.
+ uAT
at + ubt + uabt ,
BT ABT
(6) As long as it can be assumed that the censor-
ing mechanism is not related to the process under
in which the u terms are log-linear parameters which study, dealing with right censored observations in
are constrained in the usual way, for instance, by maximum likelihood estimation of the parameters
means of analysis of variance-like restrictions. Note of hazard models is straightforward. Let i be a cen-
that this is a nonproportional model because of the soring indicator taking the value 0 if observation i
presence of time-covariate interactions. Restricted is censored and 1 if it is not censored. The contribu-
variants of model described in (6) can be obtained by tion of case i to the likelihood function that must be
4 Event History Analysis
maximized when there are censored observations is experience different types of events is the use of a
multiple-risk or competing-risk model. A multiple-
Li = h(ti |xi )i S(ti |xi ) risk variant of the hazard rate model described
ti
in (5) is
= h(ti |xi ) exp
i
h(u|xi ) du . (8)
0
log hd (t|xi ) = log hd (t) + j d xij . (9)
As can be seen, the likelihood contribution of a j
censored case equals its survival probability S(ti |xi ),
and of a noncensored case the density f (ti |xi ), which Here, the index d indicates the destination state or
equals h(ti |xi )i S(ti |xi ). the type of event. As can be seen, the only thing that
changes compared to the single type of event situation
is that we have a separate set of time and covariate
Time-varying Covariates effects for each type of event.
covariates in the hazard model, but also by explicitly observed, the assumption of statistical independence
modeling their mutual interdependence. of observation is violated. Hence, unobserved hetero-
Another application of multivariate hazard models geneity should be taken into account.
is the analysis of dependent or clustered observa-
tions. Observations are clustered, or dependent, when
there are observations from individuals belonging Unobserved Heterogeneity
to the same group or when there are several sim-
ilar observations per individual. Examples are the In the context of the analysis of survival and event
occupational careers of spouses, educational careers history data, the problem of unobserved hetero-
of brothers, child mortality of children in the same geneity, or the bias caused by not being able to
family, or in medical experiments, measures of the include particular important explanatory variables in
sense of sight of both eyes or measures of the pres- the regression model, has received a great deal of
ence of cancer cells in different parts of the body. attention. This is not surprising because this phe-
In fact, data on repeatable events can also be clas- nomenon, which is also referred to as selectivity or
sified under this type of multivariate event history frailty, may have a much larger impact in hazard
data, since in that case there is more than one obser- models than in other types of regression models:
vation of the same type for each observational unit We will illustrate the effects of unobserved het-
as well. erogeneity with a small example. Suppose that the
The hazard rate model can easily be generalized population under study consists of two subgroups
to situations in which there are several origin and formed by the two levels of an observed covariate
destination states and in which there may be more A, where for an average individual with A = 2 the
than one event per observational unit. The only thing hazard rate is twice as high as for someone with
that changes is that we need indices for the origin A = 1. In addition, assume that within each of the
state (o), the destination state (d), and the rank levels of A there is (unobserved) heterogeneity in the
number of the event (m). A log-linear hazard rate sense that there are two subgroups within levels of
model for such a situation is A denoted by W = 1 and W = 2, where W = 2 has
a 5 times higher hazard rate than W = 1. Table 1
od (t|xi ) = log hod (t) +
log hm m
jmod xij . (10)
shows the assumed hazard rates for each of the pos-
j
sible combinations of A and W at four time points.
The different types of multivariate event history As can be seen, the true hazard rates are constant
data have in common that there are dependencies over time within levels of A and W . The reported
among the observed survival times. These dependen- hazard rates in the columns labeled observed show
cies may take several forms: the occurrence of one what happens if we cannot observe W . First, it can be
event may influence the occurrence of another event; seen that despite that the true rates are time constant,
events may be dependent as a result of common both for A = 1 and A = 2 the observed hazard rates
antecedents; and survival times may be correlated decline over time. This is an illustration of the fact
because they are the result of the same causal process, that unobserved heterogeneity biases the estimated
with the same antecedents and the same parame- time dependence in a negative direction. Second,
ters determining the occurrence or nonoccurrence while the ratio between the hazard rates for A = 2
of an event. If these common risk factors are not and A = 1 equals the true value 2.00 at t = 0, the
observed ratio declines over time (see last column). Example: First Interfirm Job Change
Thus, when estimating a hazard model with these
To illustrate the use of hazard models, we use a data
observed hazard rates, we will find a smaller effect
set from the 1975 Social Stratification and Mobility
of A than the true value of (log) 2.00. Third, in order
Survey in Japan reported in Yamaguchis [12] text-
to fully describe the pattern of observed rates, we
book on event history analysis. The event of interest
need to include a time-covariate interaction in the
is the first interfirm job separation experienced by
hazard model: the covariate effect changes (declines)
the sample subjects. The time variable is measured in
over time or, equivalently, the (negative) time effect
years. In the analysis, the last one-year time intervals
is smaller for A = 1 than for A = 2.
are grouped together in the same way as Yamaguchi
Unobserved heterogeneity may have different did, which results in 19 time intervals. It should be
types of consequences in hazard modeling. The best- noted that contrary to Yamaguchi, we do not apply a
known phenomenon is the downwards bias of the special formula for the computation of the exposure
duration dependence. In addition, it may bias covari- times for the first time interval.
ate effects, time-covariate interactions, and effects of Besides the time variable denoted by T , there
time-varying covariates. Other possible consequences is information on the firm size (F ). The first five
are dependent censoring, dependent competing risks, categories range from small firm (1) to large firm (5).
and dependent observations. The common way to Level 6 indicates government employees. The most
deal with unobserved heterogeneity is included ran- general log-rate model that will be used is of the form
dom effects in the model of interest (for example,
log hf t = u + uFf + uTt . (12)
see [4] and [9]).
The random-effects approach is based on the
introduction of a time-constant latent covariate in the The log-likelihood values, the number of param-
hazard model. The latent variable is assumed to have eters, as well as the BIC1 values for the estimated
a multiplicative and proportional effect on the hazard models are reported in Table 2. Model 1 postulates
rate, that is, that the hazard rate does neither depend on time or
firm size and Model 2 is an exponential survival
model with firm size as a nominal predictor. The
log h(t|xi , i ) = log h(t) + j xij + log i (11) large difference in the log-likelihood values of these
j two models shows that the effect of firm size on the
rate of job change is significant. A Cox proportional
hazard model is obtained by adding an unrestricted
Here, i denotes the value of the latent variable for
time effect (Model 3). This model performs much
subject i. In the parametric random-effects approach,
better than Model 2, which indicates that there is a
the latent variable is postulated to have a particular
strong time dependence. Inspection of the estimated
distributional form. The amount of unobserved het-
time dependence of Model 3 shows that the hazard
erogeneity is determined by the size of the standard
rate rises in the first time periods and subsequently
deviation of this distribution: The larger the standard
starts decreasing slowly (see Figure 1). Models 4 and
deviation of , the more unobserved heterogeneity 5 were estimated to test whether it is possible to sim-
there is. plify the time dependence of the hazard rate on the
Heckman and Singer [4] showed that the results basis of this information. Model 4 contains only time
obtained from a random-effects continuous-time haz- parameters for the first and second time point, which
ard model can be sensitive to the choice of the
functional form of the mixture distribution. They, Table 2 Test results for the job change example
therefore, proposed using a nonparametric charac-
Model Log-likelihood # parameters BIC
terization of the mixing distribution by means of a
finite set of so-called mass points, or latent classes, 1. {} 3284 1 6576
whose number, locations, and weights are empirically 2. {F } 3205 6 6456
determined (also, see [10]). This approach is imple- 3. {Z, F } 3024 24 6249
4. {Z1 , Z2 , F } 3205 8 6471
mented in the Latent GOLD software [11] for latent 5. {Z1 , Z2 , Zlin , F } 3053 9 6174
class analysis.
Event History Analysis 7
1.5 first job change: The larger the firm the less likely an
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
employee is to leave the firm or, in other words, the
longer he will stay. Government employees (category
6) have a slightly higher (less low) hazard rate than
2
employees of large firm (category 5).
Notes
1. BIC is defined as minus twice the log-likelihood plus
2.5 ln(N ) times the number of parameters, where N is
the sample size (here 1782).
2. Very similar estimates are obtained with Model 3.
References
3
[1] Allison, P.D. (1984). Event History Analysis: Regression
for Longitudinal Event Data, Sage Publications, Beverly
Hills.
[2] Blossfeld, H.P. & Rohwer, G. (1995). Techniques of
3.5 Event History Modeling, Lawrence Erlbaum Associates,
Mahwah.
[3] Cox, D.R. (1972). Regression models and life tables,
Journal of the Royal Statistical Society, Series B 34,
187203.
4 [4] Heckman, J.J. & Singer, B. (1982). Population hetero-
Model 3 Model 5 geneity in demographic models, in Multidimensional
Mathematical Demography, K. Land & A. Rogers, eds,
Figure 1 Time dependence according to model 3 and Academic Press, New York.
model 5 [5] Kalbfleisch, J.D. & Prentice, R.L. (1980). The Statistical
Analysis of Failure Time Data, Wiley, New York.
means that the hazard rate is assumed to be constant [6] Laird, N. & Oliver, D. (1981). Covariance analysis of
censored survival data using log-linear analysis tech-
from time point 3 to 19. Model 5 is the same as Model niques, Journal of the American Statistical Association
4 except for that it contains a linear term to describe 76, 231240.
the negative time dependence after the second time [7] Lancaster, T. (1990). The Econometric Analysis of Tran-
point. The comparison between Models 4 and 5 sition Data, Cambridge University Press, Cambridge.
shows that this linear time dependence of the log haz- [8] Tuma, N.B. & Hannan, M.T. (1984). Social Dynamics:
ard rate is extremely important: The log-likelihood Models and Methods, Academic Press, New York.
[9] Vaupel, J.W., Manton, K.G. & Stallard, E. (1979). The
increases 97 points using only one additional param-
impact of heterogeneity in individual frailty on the
eter. Comparison of Model 5 with the less restricted dynamics of mortality, Demography 16, 439454.
Model 3 and the more restricted Model 2 shows that [10] Vermunt, J.K. (1997). Log-linear models for event
Model 5 captures the most important part of the time history histories, Advanced Quantitative Techniques in
dependence. Though according to the likelihood-ratio the Social Sciences Series, Vol. 8, Sage Publications,
statistic the difference between Models 3 and 5 is Thousand Oakes.
significant, Model 5 is the preferred model according [11] Vermunt, J.K. & Magidson, J. (2000). Latent GOLD 2.0
Users Guide, Statistical Innovations Inc., Belmont.
to the BIC criterion. Figure 1 shows how Model 5
[12] Yamaguchi, K. (1991). Event history analysis, Applied
smooths the time dependence compared to Model 3. Social Research Methods, Vol. 28, Sage Publications,
The log-linear hazard parameter estimates for firm Newbury Park.
size obtained with Model 5 are 0.51, 0.28, 0.03,
0.01, 0.48, and 0.34, respectively.2 These show JEROEN K. VERMUNT AND GUY MOORS
that there is a strong effect of firm size on the rate of a
Exact Methods for Categorical Data
SCOTT L. HERSHBERGER
Volume 2, pp. 575580
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
nine deviations can take on either one of two signs specified marginal frequencies. An empirical prob-
(+ or ), there are 29 = 512 possible allocations ability distribution is constructed that reflects the
of signs (two to each of nine differences). The probability of observing each of the contingency
idea behind the Pitman test is allocate signs to the tables. This test was first proposed in [8], [11],
deviations for all possible 512 allocations, and obtain and [20], and is also known as the FisherIrwin
the sum of the deviations for each allocation. The test and as the FisherYates test. It is discussed
sampling distribution of the sum of the deviations in many sources, including [2], [5], [6], [16],
under the null hypothesis is created from the resulting and [19].
512 sums. Using this distribution, we can determine Consider the following 2 2 contingency table.
what is the probability of obtaining a sum as great
as or greater than the sum observed (i.e., S = 95) B1 B2 | T otals
in the sample if the null hypothesis is correct. a b | a+b
A1
c d | c + d .
Our interest in the probability of a sum greater A2
than 95 follows from the teachers specification of |
a one-tailed H1 : The teacher expects the current a+c b+d | N
classs performance to be significantly better than The null hypothesis is that the categories of A and
in previous years; or to put it another way, the B are independent. Under this null hypothesis, the
new median of 80 should be significantly greater probability p of observing any particular table with
than 65. If this probability is low enough, we reject all marginal frequencies fixed follows the hyperge-
the null hypothesis that the current classs median ometric distribution (see Catalogue of Probability
is 68. Density Functions):
For example, one of the 512 allocations of signs
to the deviations gives each deviation a positive (a + c)!(b + d)!(a + b)!(c + d)!
sign; that is, 35, 5, 8, 10, 3, 23, 5, 6, and 11. p= . (2)
n!a!b!c!d!
The sum of these deviations is 101. Conversely,
another one of the 512 allocations gives each of This equation expresses the distribution of the four
the deviations a negative sign, resulting in a sum cell counts in terms of only one cell (it does not
of 101. All 512 sums contribute to the sam- matter which). Since the marginal totals (i.e., a + c,
pling distribution of S. On the basis of this sam- b + d, a + b, c + d) are fixed, once the number
pling distribution, the sum of 95 has a probability of observations in one cell has been specified, the
of .026 or less of occurring if the null hypothe- number of observations in the other three cells are
sis is correct. Given the low probability that H0 not free to vary: the count for one cell determines
is correct (p < .026, we reject H0 and decide that the other three cell counts.
the current class median of 80() is significantly In order to test the null hypothesis of indepen-
greater the past median of 65(0 ). Maritz [13] and dence, a one-tailed P value can be evaluated as the
Sprent [17] provide detailed discussions of the Pit- probability of obtaining a result (a 2 2 contingency
man test. table with a particular distribution of observations) as
extreme as the observed value in the one cell that is
free to vary in one direction. That is, the probabil-
Fishers Exact Test ities obtained from the hypergeometric distribution
in one tail are summed. Although in the follow-
Fishers exact test provides an exact method for ing example the alternative hypothesis is directional,
testing the null hypothesis of independence for cate- one can also perform a nondirectional Fisher exact
gorical data in a 2 2 contingency table with both test.
sets of marginal frequencies fixed in advance. The To illustrate Fishers exact test, we analyze the
exact probability is calculated for a sample show- results of an experiment examining the ability of
ing as much or more evidence for independence a subject to discriminate correctly between two
than that obtained. As with many exact tests, a objects. The subject is told in advance exactly how
number of samples are obtained from the data; in many times each object will be presented and is
this case, all possible contingency tables having the required to make that number of identifications. This
4 Exact Methods for Categorical Data
is to ensure that the marginal frequencies remain the subject is able to discriminate between the
fixed. two objects.
The results are given in the following contingency Prior to the widespread availability of comput-
table. ers, Fishers exact was rarely performed for sam-
B1 B2 | T otals ple sizes larger than the one in our example. The
1 6 | 7
A1 reason for this neglect is attributable to the intimi-
6 1 | 7 .
A2 dating number of contingency tables that could pos-
|
sibly be observed with the marginal totals fixed
7 7 | 14
to specific values and the necessity of comput-
Factor A represents the presentation of the two ing a hypergeometric probability for each. If we
objects and factor B is the subjects identification of arbitrarily choose cell a as the one cell of the
these. The null hypothesis is that the presentation of four free to vary, the number of possible tables
an object is independent of its identification; that is, is mLow a mHigh , where mLow = max(0, a + c +
the subject cannot correctly discriminate between the b + d N ) and mHigh = min(a + c + 1, b + d + 1).
two objects. Applied to our example, there mLow = max(0, 0) and
Using the equation for the probability from a mHigh = min(8, 8). Given a = 1, there is a range
hypergeometric distribution, we obtain the probability of 0 1 8 possible contingency tables, each with
of observing this table with its specific distribution of a different distribution of observations. When the
observations: most extreme distribution of observations results
7!7!7!7! from an experiment, only one hypergeometric prob-
p= = 0.014. (3) ability must be considered; as the results depart
14!1!6!1!6!
from extreme, additional hypergeometric probabil-
However, in order to evaluate the null hypothesis, ities must calculated. All eight tables are given in
in addition to the probability p = 0.014, we must also Tables 18:
compute the probabilities for any sets of observed
frequencies that are even more extreme than the
observed frequencies. The only result that is more Table 1
extreme is | T otals
B1 B2
0 7 | 7
B1 B2 | T otals
A1
7 0 | 7
0 7 | 7 A2
A1 |
7 0 | 7 . 7 | 14
A2 7
| p = 0.0003
7 7 | 14
The probability of observing this contingency
table is Table 2
7!7!7!7!
p= = 0.0003. (4) B1 B2 | T otals
14!0!7!0!7! 1 6 | 7
A1
When p = 0.014 and p = 0.0003 are added, the 6 1 | 7
A2 |
resulting probability of 0.0143 is the likelihood of 7 7 | 14
obtaining a set of observed frequencies that is equal p = 0.014
to or is more extreme than the set of observed
frequencies by chance alone. If we use a one-
tailed alpha of 0.05 as the criterion for reject- Table 3
ing the null hypothesis, the probability of 0.0143
suggests that the likelihood that the experimental B1 B2 | T otals
2 5 | 7
results would occur by chance alone is too small. A1
5 2 | 7
We therefore reject the null hypothesis: there is A2 |
a relation between the presentation of the objects 7 7 | 14
and the subjects correct identification of them p = 0.129
Exact Methods for Categorical Data 5
[17] Sprent, P. (1998). Data Driven Statistical Methods, [20] Yates, F. (1934). Contingency tables involving small
Chapman & Hall, London. numbers and the 2 test, Journal of the Royal Statistical
[18] Sprent, P. & Smeeton, N.C. (2001). Applied Nonpara- Society Supplement 1, 217245.
metric Statistical Methods, 3rd Edition, Chapman &
Hall, Boca Raton. SCOTT L. HERSHBERGER
[19] Wickens, T.D. (1989). Multiway Contingency Tables
Analysis for the Social Sciences, Lawrence Erlbaum,
Hillsdale.
Expectancy Effect by Experimenters
ROBERT ROSENTHAL
Volume 2, pp. 581582
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
fact that the only differences between the allegedly as observer error, interpreter error, intentional error,
bright and dull rats were in the mind of the exper- effects of biosocial and psychosocial attributes, and
imenter, those who believed their rats were brighter modeling effects; and such participant-based arti-
obtained brighter performance from their rats than facts as the perceived demand characteristics of the
did the experimenters who believed their rats were experimental situation, Hawthorne effects and vol-
duller. Essentially the same results were obtained in unteer bias.
a replication of this experiment employing Skinner The other domain into which the experimenter
boxes instead of mazes. Tables 1 and 2 give a brief expectancy effect simultaneously falls is the more
overview of some of the procedures that have been substantive domain of interpersonal expectation
proposed to help control the methodological problems effects. This domain includes the more general,
created by experimenter expectancy effects [4]. social psychology of the interpersonal self-fulfilling
The research literature of the experimenter expect- prophecy. Examples of major subliteratures of this
ancy effect falls at the intersection of two dis- domain include the work on managerial expectation
tinct domains of research. One of these domains effects in business and military contexts [2] and the
is the domain of artifacts in behavioral research [4, effects of teachers expectations on the intellectual
6, 8], including such experimenter-based artifacts performance of their students [1, 5].
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
individual expectations if the random variables [2] Casella, G. & Berger, R.L. (1990). Statistical Inference,
are independent. Duxbury, California.
[3] DeGroot, M.H. (1986). Probability and Statistics, 2nd
More information on the topic of expectation is given Edition, Addison-Wesley, Massachusetts.
[4] Mood, A.M., Graybill, F.A. & Boes, D.C. (1974). Intro-
in [2], [3] and [4].
duction to the Theory of Statistics, McGraw-Hill, Singa-
pore.
References
REBECCA WALWYN
[1] Bernoulli, D. (1738). Exposition of a new theory on the
measurement of risk, Econometrica 22, 2336.
Experimental Design
HAROLD D. DELANEY
Volume 2, pp. 584586
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
conditions. For example, each participant can be various imputation methods and hierarchical linear
used as his or her own control by contrasting that modeling procedures.
participants performance under one condition with
his or her performance under another condition. In References
many cases in psychology, the various conditions
experienced by a given participant will correspond
[1] Cochran, W.G. (1977). Sampling Techniques, 3rd Edition,
to observations at different points in time. For Wiley, New York.
example, a test of clinical treatments may assess [2] Edgington, E.S. (1995). Randomization Tests, 3rd Edition,
clients at each of several follow-up times. In this case, Dekker, New York.
the same participants are observed multiple times. [3] Fisher, R.A. (1971). Design of Experiments, Hafner, New
While there are obvious advantages to this procedure York. (Original work published 1935).
in terms of efficiency, conventional analysis-of- [4] Good, P. (2000). Permutation Tests: A Practical Guide
to Resampling Methods for Testing Hypotheses, Springer,
variance approaches to within-subjects designs New York.
(see Repeated Measures Analysis of Variance [5] Maxwell, S.E. & Delaney, H.D. (2004). Designing Exper-
and Educational Psychology: Measuring Change iments and Analyzing Data; A Model Comparison Per-
Over Time) require that additional assumptions spective, Erlbaum, Mahwah, NJ.
be made about the data. These assumptions are [6] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002).
often violated. Furthermore, within-subjects designs Experimental and Quasi-experimental Designs for Gen-
eralized Causal Inference, Houghton Mifflin, Boston.
require that participants have no missing data [5].
A variety of methods for dealing with these issues HAROLD D. DELANEY
have been developed in recent years, including
Exploratory Data Analysis
SANDY LOVIE
Volume 2, pp. 586588
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
ubiquitous box and whisker plot is based on both the Although EDA broke cover over twenty-five years
median and the midspread, while the latter measure ago with the simultaneous publication of the two
helps determine the whisker length, which can then be texts on EDA and regression [10, 9], it is still too
useful in identifying potential outliers. Similar novel- early to say what has been the real impact of EDA
ties such as the stem and leaf plot are also designed on data analysis. On the one hand, many of EDAs
for the median rather than the mean to be read off graphical novelties have been incorporated into most
easily, as can the upper and lower quartiles, and modern texts and computer programs in statistics, as
hence the midspread. Rather more exotic birds such have the raft of robust measures on offer. On the
as hanging or suspended rootograms and half nor- other, the risky approach to statistics adopted by EDA
mal plots (all used for checking the Gaussian nature has been less popular, with deduction-based inference
of the data) were either Tukeys own creation or that taken to be the only way to discover the world.
of workers influenced by him, for instance, Daniel However, while one could point to underlying and
and Wood [2], whose classic study of data modeling often contradictory cultural movements in scientific
owed much to EDA. Tukey did not neglect earlier and belief and practice to account for the less than
wholehearted embrace of Tukeys ideas, the future
simpler displays such as scatterplots, although, true
of statistics lies with EDA.
to his radical program, these became transformed into
windows onto data shape and symmetry, robustness, References
and even robust differences of location and spread. A
subset of these displays, residual and leverage plots, [1] Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J.,
Rogers, W.H. & Tukey, J.W. (1972). Robust Estimates
for instance, also played a key role in the revival of of Location: Survey and Advances, Princeton University
interest in regression, particularly in the area termed Press, Princeton.
regression diagnostics (see [6]). In addition, while [2] Daniel, C. & Wood, F.S. (1980). Fitting Equations to
EDA has concentrated on relatively simple data sets Data, 2nd Edition, Wiley, New York.
and data structures, there are less well known but still [3] Hoaglin, D.C., Mosteller, F. & Tukey, J.W., eds (1985).
Exploring Data Tables, Trends and Shapes, John Wiley,
provocative incursions into the partitioning of mul- New York.
tiway and multifactor tables [3], including Tukeys [4] Hoaglin, D.C., Mosteller, F. & Tukey, J.W. (1992).
rethinking of the analysis of variance (see [4]). Fundamentals of Exploratory Analysis of Variance, John
While EDA offered a flexibility and adaptability Wiley, New York.
to knowledge-building missing from the more formal [5] Lovie, P. (1986). Identifying outliers, in New Develop-
ments in Statistics for Psychology and the Social Sci-
processes of statistical inference, this was achieved
ences, A.D. Lovie, ed., BPS Books & Routledge, Lon-
with a loss of certainty about the knowledge gained don.
provided, of course, that one believes that more [6] Lovie, P. (1991). Regression diagnostics: a rough guide
formal methods do generate truth, or your money to safer regression, in New Developments in Statistics for
back. Tukey was less sure on this latter point in that Psychology and the Social Sciences, Vol. 2, P. Lovie &
A.D. Lovie, eds, BPS Books & Routledge, London.
while formal methods of inference are bolted onto
[7] Lovie, A.D. & Lovie, P. (1998). The social construction
EDA in the form of Confirmatory Data Analysis, of outliers, in The Politics of Constructionism, I. Velody
or CDA (which favored the Bayesian route see & R. Williams, eds, Sage, London.
Mosteller and Tukeys early account of EDA and [8] Mosteller, F. & Tukey, J.W. (1968). Data analysis
CDA in [8]), he never expended as much effort including statistics, in Handbook of Social Psychology,
on CDA as he did on EDA, although he often G. Lindzey & E. Aronson, eds, Addison-Wesley, Read-
ing.
argued that both should run in harness, with the [9] Mosteller, F. & Tukey, J.W. (1977). Data Analysis and
one complementing the other. Thus, for Tukey (and Regression: A Second Course in Statistics, Addison-
EDA), truth gained by such an epistemologically Wesley, Reading.
inductive method was always going to be partial, [10] Tukey, J.W. (1977). Exploratory Data Analysis,
local and relative, and liable to fundamental challenge Addison-Wesley, Reading.
[11] Velleman, P.F. & Hoaglin, D.C. (1981). Applications,
and change. On the other hand, without taking such
Basics and Computing of Exploratory Data Analysis,
risks, nothing new could emerge. What seemed to be Duxbury Press, Boston.
on offer, therefore, was an entrepreneurial view of
statistics, and science, as permanent revolution. SANDY LOVIE
External Validity
DAVID C. HOWELL
Volume 2, pp. 588591
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
attendance, but that does not necessarily mean Other writers have proposed additional threats to
that they improve the students behavior in external validity, and these are listed below. In gen-
school, whether the student pays attention in eral, any factors that can compromise our ability to
class, whether the student masters the mate- generalize from the results of an experimental manip-
rial, and so on. When we cannot generalize ulation are threats to external validity.
from one reasonable and desirable outcome
variable to another, we compromise the exter- Interaction of history and treatment
nal validity of the study. Occasionally, the events taking place out-
Interaction of treatment outcome with treat- side of the experiment influence the results.
ment variation For example, experiments that happened to
Often, the independent variable that we would be conducted on September 11, 2001 may
like to study in an experiment is not easy to very well have results that differ from the
clearly define, and we select what we hope is results expected on any other day. Sim-
an intelligent operationalization of that vari- ilarly, a study of the effects of airport
able. (For example, we might wish to show noise may be affected by whether that issue
that increasing the attention an adolescent has recently been widely discussed in the
receives from his or her peers will modify an daily press.
undesirable behavior. However, you just need Pretest-treatment interaction
to think of the numerous ways we can pay Many experiments are designed with a pre-
attention to someone to understand the prob- test, an intervention, and a posttest. In some
lem.) Similarly an experimental study might situations, it is reasonable to expect that
use a multifaceted treatment, but when others the pretest will sensitize participants to the
in the future attempt to apply that in their par- experimental treatment or cause them to
ticular setting, they may find that they only behave in a particular way (perhaps giv-
have the resources to implement some of the ing them practice on items to be included
facets of the treatment. In these cases, the in the posttest.). In this case, we would
external validity of the study involving its have difficulty generalizing to those who
ability to generalize to other settings, may be received the intervention but had not had
compromised. a pretest.
Context-dependent mediation Multiple-treatment interference
Many causal relationships between an inde- Some experiments are designed to have par-
pendent and a dependent variable are medi- ticipants experience more than one treatment
ated by the presence or absence of another (hopefully in random order). In this situ-
variable. For example, the degree to which ation, the response to one treatment may
your parents allowed you autonomy when depend on the other treatments the individ-
you were growing up might affect your level ual has experienced, thus limiting generaliza-
of self-confidence, and that self-confidence bility.
might in turn influence the way you bring Specificity of variables
up your own children. In this situation, self- Unless variables are clearly described and
confidence is a mediating variable. The dan- operationalized, it may be difficult to replicate
ger in generalizing from one experimen- the setting and procedures in a subsequent
tal context to another involves the possi- implementation of the intervention. This is
bility that the mediating variable does not one reason why good clinical psychologi-
have the same influence in all contexts. cal research often involves a very complete
For example, it is easy to believe that the manual on the implementation of the inter-
mediating role of self-confidence may be vention.
quite different in children brought up under Experimenter bias
conditions of severe economic deprivation This threat is a threat to both internal valid-
than in children brought up in a middle- ity and external validity. If even good exper-
class family. imenters have a tendency to see what they
External Validity 3
expect to see, the results that they find in one dealing with these threats, see Nonequivalent Group
setting, with one set of expectations, may not Design, Regression Discontinuity Design, and, par-
generalize to other settings. ticularly, Quasi-experimental Designs.
Reactivity effects
A classic study covered in almost any course References
on experimental design involves what is
known as the Hawthorne effect. This is often
[1] Campbell, D.T. & Stanley, J.C. (1966). Experimen-
taken to refer to the fact that even know- tal and Quasi-Experimental Designs for Research, Rand
ing that you are participating in an experi- McNally, Skokie.
ment may alter your performance. Reactivity [2] Cook, T. & Campbell, D. (1979). Quasi-Experimentation:
effects refer to the fact that participants often Design and Analysis Issues for Field Settings, Houghton
react to the very existence of an exper- Mifflin, Boston.
iment in ways that they would not other- [3] Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002).
Experimental and Quasi-Experimental Designs for Gen-
wise react. eralized Causal Inference, Houghton-Mifflin, Boston.
For other threats to invalidity, see the discussion
DAVID C. HOWELL
in the entry on internal validity. For approaches to
Face-to-Face Surveys
JAMES K. DOYLE
Volume 2, pp. 593595
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of reflection or a search for personal records are In summary, face-to-face surveys offer many
better handled by the self-paced format of a mail advantages over mail and telephone surveys in
survey. terms of the complexity and quality of the
Perhaps the largest cost associated with a face- data collected, but these advantages come with
to-face survey is the increased burden placed on significantly increased logistical costs as well as
the researcher to ensure that the interviewers who additional potential sources of response bias. The
are collecting the data do not introduce interviewer costs are in fact so prohibitive that face-to-face
bias, that is, do not, through their words or actions, surveys are typically employed only when telephone
unintentionally influence respondents to answer in surveys are impractical, for example, when the
a particular way. While interviewer bias is also a questionnaire is too long or complex to deliver over
concern in telephone surveys, it poses even more the phone or when a significant proportion of the
of a problem in face-to-face surveys for two rea- population of interest lacks telephone access.
sons. First, the interviewer is exposed to the poten-
tially biasing effect of the respondents appearance Further Reading
and environment in addition to their voice. Second,
the interviewer may inadvertently give respondents Czaja, R. & Blair, J. (1996). Designing Surveys: A Guide to
nonverbal as well as verbal cues about how they Decisions and Procedures, Pine Forge Press, Thousand
should respond. Interviewing skills do not come nat- Oaks.
urally to people because a standardized interview De Leeuw, E.D. & van der Zouwen, J. (1988). Data quality in
telephone and face to face surveys: a comparative meta-
violates some of the normative rules of efficient analysis, in Telephone Survey Methodology, R.M. Groves,
conversation. For instance, interviewers must read P.N. Biemer, L.E. Lyberg, J.T. Massey, W.L. Nichols II
all questions and response options exactly as writ- & J. Waksberg, eds, Wiley, New York, pp. 283299.
ten rather than paraphrasing them, since even small Dillman, D.A. (2000). Mail and Internet Surveys: The Tailored
changes in wording have the potential to influence Design Method, 2nd Edition, Wiley, New York.
survey outcomes. Interviewers also have to ask a Fowler, F.J. (1990). Standardized Survey Interviewing: Mini-
mizing Interviewer-Related Error, Sage, Newbury Park.
question even when the respondent has already vol-
Fowler Jr, F.J. (2002). Survey Research Methods, 3rd Edition,
unteered the answer. To reduce bias as well as to Sage, Thousand Oaks.
avoid interviewer effects, that is, the tendency for Groves, R.M. (1989). Survey Errors and Survey Costs, Wiley,
the data collected by different interviewers to dif- New York.
fer due to procedural inconsistency, large investments Groves, R.M. & Kahn, R.L. (1979). Surveys by Telephone: A
must typically be made in providing interviewers the National Comparison with Personal Interviews, Academic
necessary training and practice. Data analyses of face- Press, Orlando.
Hyman, H., Feldman, J. & Stember, C. (1954). Interviewing in
to-face surveys should also examine and report on
Social Research, University of Chicago Press, Chicago.
any significant interviewer effects identified in the
data. JAMES K. DOYLE
Facet Theory
INGWER BORG
Volume 2, pp. 595599
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of items and to systematically construct a sample that represents the items empirical similarities into
of items that belong to this universe. Mapping sen- simple regions. Figure 1 shows three prototypical
tences are, however, not a tool that automatically patterns that often result in this context. The space
yields meaningful results. One needs solid substantive here could be an multidimensional scaling (MDS)
knowledge and clear semantics to make them work. representation of the intercorrelations of a battery of
items. The plots are facet diagrams, where the points
Correspondence Hypotheses Relating are replaced by the element (struct) that the item
Design to Data represented by a particular point has on a particu-
A common range of items gives rise to cer- lar facet. Hence, if we look at the configuration in
tain monotonicity hypotheses. For intelligence items, terms of facet 1 (left panel), the points form a pattern
Guttman [6] predicts that they correlate nonnega- that allows us to cut the plane into parallel stripes. If
tively among each other, which is confirmed for the the facet were ordered so that a < b < c, this axial
case shown in Table 1. This first law of intelligence pattern leads to an embryonic dimensional interpreta-
is a well-established empirical regularity for the uni- tion of the X-axis. The other two panels of Figure 1
verse of intelligence items. A similar law holds for show patterns that are also often found in real applica-
attitude items. tions, that is, a modular and a polar regionalization,
A more common-place hypothesis in FT is to respectively. Combined, these prototypes give rise to
check whether the various facets of the studys design various partitionings such as the radex (a modular
show up, in one way or another, in the structure of the combined with a polar facet) or the duplex (two axial
data. More specifically, one often uses the discrimi- facets). Each such pattern is found by partitioning the
nation hypothesis that the distinctions a facet makes space facet by facet.
for the types of observations should be reflected in A third type of correspondence hypothesis used in
differences of the data for these types. Probably, FT exists for design (or data) structuples whose facets
the most successful variant of this hypothesis is that are all ordered in a common sense. Elizur [3], for
the Q-facets should allow one to partition a space example, asked persons to indicate whether they were
c c c c
c a
a a b a a
c
c
b b c c
b c a c a a
a
a a a a
b
b b c b
c b a b
a c b
a b c b
c
b c b b
concerned or not about losing certain features of their A soft variant of relating structuples to data is the
job after the introduction of computers to their work- hypothesis that underlies multidimensional scalogram
place. The design of this study was Person (p) is con- analysis or MSA [9]. It predicts that given design (or
cerned about losing A B C D, where A = data) structuples can be placed into in a multidimen-
interest (yes/no), B = experience (yes/no), C = sional space of given dimensionality so that this space
stability (yes/no), and D = employment (yes/no). can be partitioned, facet by facet, into simple regions
Each person generated an answer profile with four as shown, for example, in Figure 1. (Note though
elements such as 1101 or 0100, where 1 = yes and that MSA does not involve first computing any over-
0 = no. Table 2 lists 98% of the observed profiles. all proximities. It also places no requirements on the
We now ask whether these profiles form a Guttman scale level of the facets of the structuples.)
scale [4] or, if not, whether they form a partial order
with nontrivial properties. For example, the partial
Facet-theoretical Data Analysis
order may be flat in the sense that it can be repre-
sented in a plane such that its Hasse diagram has Data that are generated within a facet-theoretical
no paths that cross each other outside of common design are often analyzed by special data analytic
points (diamond ). If so, it can be explained by two methods or by traditional statistics used in particular
dimensions [10]. ways. An example for the latter is given in Figure 2.
3 = NI 3 = NI
2 = NA 2 = NA
I nfe
Ap
4 = GI 4 = GI
re n
p li c
1 = NA 1 = NA
ce
at
n
io
l
5 = GI ica al 5 = GI
er tric
m e
Nu om
e
G
8 = GA 6 = GA 6 = GA
8 = GA
7 = GA 7 = GA
Figure 2 MDS representation of correlations of Table 1 (left panel) and partitioning by facets language and operation
(right panel)
4 Facet Theory
Inference
Application
Geometrical Learning
Verbal
Verbal Numerical
Oral
Nu m e r
Inference
Manual
ic a l
ic a
Application
l
e tr
m
eo
Learning Paper-and-pencil
G
B are very high on at least one of the base dimen- [2] Borg, I. & Shye, S. (1995). Facet Theory: form and
sions. C, in contrast, is attenuating so that persons Content, Sage, Newbury Park.
high on C are relatively similar on their X and [3] Elizur, D. (1970). Adapting to Innovation: A Facet Anal-
ysis of the Case of the Computer, Jerusalem Academic
Y coordinates. Both secondary facets thus generate Press, Jerusalem.
additional cutting points and thereby more intervals [4] Guttman, L. (1944). A basis for scaling qualitative data,
on the base dimensions. American Sociological Review 9, 139150.
HUDAP also contains programs for doing MSA. [5] Guttman, L. (1965). A faceted definition of intelligence,
MSA is seldom used in practice, because its solu- Scripta Hierosolymitana 14, 166181.
tions are rather indeterminate [2]. They can be trans- [6] Guttman, L. (1991a). Two structural laws for intelligence
tests, Intelligence 15, 79103.
formed in many ways that can radically change
[7] Guttman, L. (1991b). Louis Guttman: In Memoriam-
their appearance, which makes it difficult to inter- chapters from an Unfinished Textbook on Facet Theory,
pret and to replicate them. One can, however, make Israel Academy of Sciences and Humanities, Jerusalem.
MSA solutions more robust by enforcing addi- [8] Levy, S. & Guttman, L. (1985). The partial-order of
tional constraints such as linearity onto the boundary severity of thyroid cancer with the prognosis of sur-
lines or planes [9]. Such constraints are not related vival, in Ins and Outs of Solving Problems, J.F. Mar-
to content and for that reason rejected by many chotorchino, J.-M. Proth & J. Jansen, eds, Elsevier,
Amsterdam, pp. 111119.
facet theorists who prefer data analysis that is as
[9] Lingoes, J.C. (1968). The multivariate analysis of quali-
intrinsic [7] as possible. On the other hand, if a tative data, Multivariate Behavioral Research 1, 6194.
good MSA solution is found under extrinsic side [10] Shye, S. (1985). Multiple Scaling, North-Holland, Ams-
constraints, it certainly also exists for the softer intrin- terdam.
sic model.
(See also Multidimensional Unfolding)
References
INGWER BORG
[1] Amar, R. & Toledano, S. (2001). HUDAP Manual,
Hebrew University of Jerusalem, Jerusalem.
Factor Analysis: Confirmatory
BARBARA M. BYRNE
Volume 2, pp. 599606
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
SDQ1 E1
1.0 SDQ8 E8
SDQ15 E15
Physical SC SDQ22 E22
(Appearance)
F1 SDQ38 E38
SDQ46 E46
SDQ54 E54
SDQ62 E62
SDQ3 E3
SDQ24 E24
Physical SC SDQ32 E32
(Ability)
F2 SDQ40 E40
SDQ48 E48
SDQ56 E56
SDQ64 E64
SDQ7 E7
SDQ28 E28
Social SC SDQ36 E36
(Peers)
F3 SDQ44 E44
SDQ52 E52
SDQ60 E60
SDQ69 E69
SDQ5 E5
SDQ19 E19
1.0
SDQ26 E26
Social SC SDQ34 E34
(Parents)
F4 SDQ42 E42
SDQ50 E50
SDQ58 E58
SDQ66 E66
latent factors, a square (or rectangle) representing Structural Equation Specification of the Model
observed variables, a single-headed arrow (>) rep-
resenting the impact of one variable on another, and a From a review of Figure 1, you will note that each
double-headed arrow (<>) representing covariance observed variable is linked to its related factor by a
between pairs of variables. In building a CFA model, single-headed arrow pointing from the factor to the
researchers use these symbols within the framework observed variable. These arrows represent regression
of three basic configurations, each of which repre- paths and, as such, imply the influence of each factor
sents an important component in the analytic process. in predicting its set of observed variables. Take, for
We turn now to the CFA model presented in Figure 1, example, the arrow pointing from Physical SC (Abil-
which represents the postulated four-factor struc- ity) to SDQ1. This symbol conveys the notion that
ture of nonacademic self-concept (SC) as tapped by responses to Item 1 of the SDQ-I assessment measure
items comprising the Self Description Questionnaire- are caused by the underlying construct of physi-
I (SDQ-I; [15]). As defined by the SDQ-I, nonaca- cal SC, as it reflects ones perception of his or her
demic SC embraces the constructs of physical and physical ability. In CFA, these symbolized regression
social SCs. paths represent factor loadings and, as with all factor
On the basis of the geometric configurations noted analyses, their strength is of primary interest. Thus,
above, decomposition of this CFA model conveys the specification of a hypothesized model focuses on the
following information: (a) there are four factors, as formulation of equations that represent these struc-
indicated by the four ellipses labeled Physical SC tural regression paths. Of secondary importance are
(Appearance; F1), Physical SC (Ability; F2), Social any covariances between the factors and/or between
SC (Peers; F3), and Social SC (Parents; F4); (b) the the measurement errors.
four factors are intercorrelated, as indicated by the The building of these equations, in SEM, embraces
six two-headed arrows; (c) there are 32 observed two important notions: (a) that any variable in the
variables, as indicated by the 32 rectangles (SDQ1- model having an arrow pointing at it represents a
SDQ66); each represents one item from the SDQ-I; dependent variable, and (b) dependent variables are
(d) the observed variables measure the factors in always explained (i.e., accounted for) by other vari-
the following pattern: Items 1, 8, 15, 22, 38, 46, ables in the model. One relatively simple approach to
54, and 62 measure Factor 1, Items 3, 10, 24, 32, formulating these structural equations, then, is first
40, 48, 56, and 64 measure Factor 2, Items 7, 14, to note each dependent variable in the model and
28, 36, 44, 52, 60, and 69 measure Factor 3, and then to summarize all influences on these variables.
Items 5, 19, 26, 34, 42, 50, 58, and 66 measure Turning again to Figure 1, we see that there are 32
Factor 4; (e) each observed variable measures one variables with arrows pointing toward them; all rep-
and only one factor; and (f) errors of measurement resent observed variables (SDQ1-SDQ66). Accord-
associated with each observed variable (E1-E66) ingly, these regression paths can be summarized in
are uncorrelated (i.e., there are no double-headed terms of 32 separate equations as follows:
arrows connecting any two error terms. Although the
error variables, technically speaking, are unobserved SDQ1 = F1 + E1
variables, and should have ellipses around them, SDQ8 = F1 + E8
common convention in such diagrams omits them in SDQ15.= F1 + E15
..
the interest of clarity.
In summary, a more formal description of the SDQ62 = F1 + E62
CFA model in Figure 1 argues that: (a) responses to SDQ3 = F2 + E3
the SDQ-I are explained by four factors; (b) each SDQ10.= F2 + E10
item has a nonzero loading on the nonacademic SC ..
factor it was designed to measure (termed target SDQ64 = F2 + E64
loadings), and zero loadings on all other factors SDQ7 = F3 + E7
(termed nontarget loadings); (c) the four factors SDQ14.= F3 + E14
are correlated; and (d) measurement error terms are ..
uncorrelated. SDQ69 = F3 + E69
4 Factor Analysis: Confirmatory
References [11] Hu, L.-T. & Bentler, P.M. (1999). Cutoff criteria for fit
indexes in covariance structure analysis: conventional
criteria versus new alternatives, Structural Equation
[1] Bentler, P.M. (1990). Comparative fit indexes in struc-
Modeling 6, 155.
tural models, Psychological Bulletin 107, 238246.
[12] Hu, L.-T., Bentler, P.M. & Kano, Y. (1992). Can test
[2] Bentler, P.M. (2004). EQS 6.1: Structural Equations
statistics in covariance structure analysis be trusted?
Program Manual, Multivariate Software Inc, Encino.
Psychological Bulletin 112, 351362.
[3] Bollen, K. (1989). Structural Equations with Latent
[13] Joreskog, K.G. & Sorbom, D. (1996). LISREL 8:
Variables, Wiley, New York.
Users Reference Guide, Scientific Software Interna-
[4] Browne, M.W. & Cudeck, R. (1993). Alternative ways
tional, Chicago.
of assessing model fit, in Testing Structural Equation
[14] Kline, R.B. (1998). Principles and Practice of Structural
Models, K.A. Bollen & J.S. Long eds, Sage, Newbury
Equation Modeling, Guildwood Press, New York.
Park, pp. 136162.
[15] Marsh, H.W. (1992). Self Description Questionnaire
[5] Bryant, F.B. & Yarnold, P.R. (1995). Principal-com-
(SDQ) I: A Theoretical and Empirical Basis for the
ponents analysis and exploratory and confirmatory factor
Measurement of Multiple Dimensions of Preadolescent
analysis, in Reading and Understanding Multivariate
Self-concept: A Test Manual and Research Monograph,
Statistics, L.G. Grimm & P.R. Yarnold eds, American
Faculty of Education, University of Western Sydney,
Psychological Association, Washington.
Macarthur, New South Wales.
[6] Byrne, B.M. (1994). Structural Equation Modeling with
[16] Maruyama, G.M. (1998). Basics of Structural Equation
EQS and EQS/Windows: Basic Concepts, Applications,
Modeling, Sage, Thousand Oaks.
and Programming, Sage, Thousand Oaks.
[17] Raykov, T. & Marcoulides, G.A. (2000). A First Course
[7] Byrne, B.M. (1998). Structural Equation Modeling with
Oin Structural Equation Modeling, Erlbaum, Mahwah.
LISREL, PRELIS, and SIMPLIS: Basic Concepts, Appli-
[18] Steiger, J.H. (1990). Structural model evaluation and
cations, and Programming, Erlbaum, Mahwah.
modification: an interval estimation approach, Multivari-
[8] Byrne, B.M. (2001a). Structural Equation Modeling with
ate Behavioral Research 25, 173180.
AMOS: Basic Concepts, Applications, and Program-
ming, Erlbaum, Mahwah.
[9] Byrne, B.M. (2001b). Structural equation modeling with (See also History of Path Analysis; Linear Sta-
AMOS, EQS, and LISREL: comparative approaches
to testing for the factorial validity of a measuring
tistical Models for Causation: A Critical Review;
instrument, International Journal of Testing 1, 5586. Residuals in Structural Equation, Factor Analysis,
[10] Curran, P.J., West, S.G. & Finch, J.F. (1996). The robust- and Path Analysis Models; Structural Equation
ness of test statistics to nonnormality and specifica- Modeling: Checking Substantive Plausibility)
tion error in confirmatory factor analysis, Psychological
Methods 1, 1629. BARBARA M. BYRNE
Factor Analysis: Exploratory
ROBERT PRUZEK
Volume 2, pp. 606617
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
ignored except for distinguishing the most distinc- approximate the corresponding entries in the corre-
tive School D (Venetian) from the rest using a binary lation matrix. For example, the inner product of the
variable. For more details, see the file painters in rows for Composition and Drawing is 0.76 0.50 +
the Modern Applied Statistics with S (MASS) library (0.09) (0.56) = 0.43, which is close to 0.42,
in R or Splus software (see Software for Statisti- the observed correlation; so the corresponding resid-
cal Analyses), and note that the original data and ual equals 0.01. Pairwise products for all rows
several further analyses can be found in the MASS reproduce the observed correlations in Table 1 quite
library [24]. well as only one residual fit exceeds 0.05 in magni-
Table 1 exhibits correlations among the painter tude, and the mean residual is 0.01.
variables, where upper triangle entries are ignored The final row of Table 2 contains the average
since the matrix is symmetric. Table 2 exhibits a sum of squares for the first two columns; the third
common factor coefficients matrix (of order 5 2) entry is the average of the communalities in the final
that corresponds to the initial correlations, where column, as well as the sum of the two average sums of
entries of highest magnitude are in bold print. The squares to its left: 0.31 + 0.28 0.60. These results
final column of Table 2 is labeled h2 , the stan- demonstrate an additive decomposition of common
dard notation for variable communalities. Because variance in the solution matrix where 60 percent
these factor coefficients correspond to an orthog- of the total variance is common among these five
onal factor solution, that is, uncorrelated common variables, and 40 percent is uniqueness variance.
factors, each communality can be reproduced as a Users of EFA have often confused communality
(row) sum of squares of the two factor coefficients with reliability, but these two concepts are quite dis-
to its left; for example (0.76)2 + (0.09)2 = 0.59. tinct. Classical common factor and psychometric test
The columns labeled 1 and 2 are factor loadings, theory entail the notion that the uniqueness is the sum
each of which is properly interpreted as a (product- of two (orthogonal) parts, specificity and error. Con-
moment) correlation between one of the original sequently, uniqueness variance is properly seen as an
manifest variables (rows) and a derived common upper bound for error variance; alternatively, commu-
factor (columns). Post-multiplying the factor coef- nality is in principle a lower bound for reliability. It
ficient matrix by its transpose yields numbers that might help to understand this by noting that each EFA
entails analysis of just a sample of observed variables
or measurements in some domain, and that the addi-
Table 2 Factor loadings for 2-factor EFA solution, painter tion of more variables within the general domain will
data
generally increase shared variance as well as indi-
Factor vidual communalities. As battery size is increased,
individual communalities increase toward upper lim-
Variable name 1 2 h2
its that are in principle close to variable reliabilities.
Composition 0.76 0.09 0.59 See [15] for a more elaborate discussion.
Drawing 0.50 0.56 0.56 To visualize results for my example, I plot the
Color 0.03 0.80 0.64 common factor coefficients in a plane, after making
Expression 0.81 0.26 0.72 some modifications in signs for selected rows and the
School D 0.30 0.62 0.47 second column. Specifically, I reverse the signs of the
Avg. Col. SS 0.31 0.28 0.60
3rd and 5th rows, as well as in the second column, so
that all values in the factor coefficients matrix become
Factor Analysis: Exploratory 3
positive. Changes of this sort are always permissible, identify latent variables that in some sense appear to
but we need to keep track of the changes, in this case underlie observed variables. In this case, my igno-
by renaming the third variable to Color[1] and the rance of the works of these classical painters, not to
final binary variable to School.D[1]. Plotting the mention of the thinking of de Piles as related to his
revised coefficients by rows yields the five labeled ratings, led to my literal, noninventive factor names.
points of Figure 1. Before going on, it should be made explicit that
In addition to plotting points, I have inserted vec- insertion of the factor-vectors into this plot, and the
tors to correspond to transformed factors; the arrows attempt to name factors, are best regarded as discre-
show an ExpressionComposition factor and a sec- tionary parts of the EFA enterprise. The key output
ond, correlated, DrawingColor[1] factor. That of such an analysis is the identification of the sub-
the School.D variable also loads highly on this second space defined by the common factors, within which
factor, and is also related to, that is, not orthogo- variables can be seen to have certain distinctive struc-
nal to, the point for Expression, shows that mean tural relationships with one another. In other words,
ratings, especially for the Drawing, Expression, and it is the configuration of points in the derived space
Color variates (the latter in an opposite direction), are that provides the key information for interpreting
notably different between Venetian School artists and factor results; a relatively low-dimensional subspace
painters from the collection of other schools. This can provides insights into structure, as well as quan-
be verified by examination of the correlations (some- tification of how much variance variables have in
times called point biserials) between the School.D common. Positioning or naming of factors is gener-
variable and all the ratings variables in Table 1; the ally optional, however common. When the common
skeptical reader can easily acquire these data and number of derived factors exceeds two or three, fac-
study details. In fact, one of the reasons for choosing tor transformation is an almost indispensable part of
this example was to show that EFA as an exploratory an EFA, regardless of whether attempts are made to
data analytic method can help in studies of relations name factors.
among quantitative and categorical variables. Some Communalities generally provide information as
connections of EFA with other methods will be dis- to how much variance variables have in common
cussed briefly in the final section. or share, and can sometimes be indicative of how
In modern applications of factor analysis, inves- highly predictable variables are from one another. In
tigators ordinarily try to name factors in terms fact, the squared multiple correlation of each variable
of dimensions of individual difference variation, to with all others in the battery is often recommended
as an initial estimate of communality for each vari-
able. Communalities can also signal (un)reliability,
1.0
Drawing
its counterparts. Its exploratory nature also means that
prior structural information is usually not part of an
0.4
Expression
EFA, although this idea will eventually be qualified
in the context of reviewing factor transformations.
0.2
well as common factors to account for all entries in an algebraic decomposition of an initial data matrix
the correlation table. into mutually orthogonal derived variables called
components. Alternatively, PCA can be viewed as
a linear transformation of the initial data vectors
Some Relationships Between EFA and into uncorrelated variates with certain optimality
PCA properties. Data are usually centered at the outset
by subtracting means for each variable and then
As noted earlier, EFA is often confused with PCA. scaled so that all variances are equal, after which
In fact, misunderstanding occurs so often in reports, the (rectangular) data matrix is resolved using a
published articles, and textbooks that it will be useful method called singular value decomposition (SVD).
to describe how these methods compare, at least Components from a SVD are usually ordered so that
in a general way. More detailed or more technical the first component accounts for the largest amount
discussions concerning such differences is available of variance, the second the next largest amount,
in [15]. subject to the constraint that it be uncorrelated with
As noted, the key aim of EFA is usually to derive a the first, and so forth. The first few components
relatively small number of common factors to explain will often summarize the majority of variation in
or account for (off-diagonal) covariances or correla- the data, as these are principal components. When
tions among a set of observed variables. However, used in this way, PCA is justifiably called a data
despite being an exploratory method, EFA entails reduction method and it has often been successful in
use of a falsifiable model at the level of manifest showing that a rather large number of variables can
observations or correlations (covariances). For such be summarized quite well using a relatively small
a model to make sense, relationships among mani- number of derived components.
fest variables should be approximately linear. When Conventional PCA can be completed by simply
approximate linearity does not characterize relation- computing a table of correlations of each of the
ships among variables, attempts can be made to trans- original variables with the chosen principal compo-
form (at least some of) the initial variables to remove nents; indeed doing so yields a PCA counterpart of
bends in their relationships with other variables, the EFA coefficients matrix in Table 2 if two com-
or perhaps to remove outliers. Use of square root, ponents are selected. Furthermore, sums of squares
logarithmic, reciprocal, and other nonlinear transfor- of correlations in this table, across variables, show
mations are often effective for such purposes. Some the total variance each component explains. These
investigators question such steps, but rather than component-level variances are the eigenvalues pro-
asking why nonlinear transformations should be con- duced when the correlation matrix associated with the
sidered, a better question usually is, Why should data matrix is resolved into eigenvalues and eigenvec-
the analyst believe the metric used at the outset tors. Alternatively, given the original (centered and
for particular variables should be expected to render scaled) data matrix, and the eigenvalues and vectors
relationships linear, without reexpressions or transfor- of the associated correlation matrix, it is straightfor-
mations? Given at least approximate linearity among ward to compute principal components. As in EFA,
all pairs of variables the inquiry about which is derived PCA coefficient matrices can be rotated or
greatly facilitated by examining pairwise scatterplots transformed, and for purposes of interpretation this
among all pairs of variables common factor anal- has become routine.
ysis can often facilitate explorations of relationships Given its algebraic nature, there is no particular
among variables. The prospects for effective or pro- reason for transforming variables at the outset so that
ductive applications of EFA are also dependent on their pairwise relationships are even approximately
thoughtful efforts at the stage of study design, a mat- linear. This can be done, of course, but absent a
ter to be briefly examined below. With reference to model, or any particular justification for concentrat-
our example, the pairwise relationships between the ing on pairwise linear relationships among variables,
various pairs of de Piles ratings of painters were principal components analysis of correlation matri-
found to be approximately linear. ces is somewhat arbitrary. Because PCA is just an
In contrast to EFA, principal components analysis algebraic decomposition of data, it can be used for
does not engage a model. PCA generally entails any kind of data; no constraints are made about the
Factor Analysis: Exploratory 5
dimensionality of the data matrix, no constraints on with which it is correlated. The largest weights for
data values, and no constraints on how many compo- each linear combination correspond to variables that
nents to use in analyses. These points imply that for most strongly define the corresponding linear combi-
PCA, assumptions are also optional regarding statis- nation, and so the corresponding correlations in the
tical distributions, either individually or collectively. Principal Component (PC) loading matrix tend to be
Accordingly, PCA is a highly general method, with highest, and indeed to have spuriously high mag-
potential for use for a wide range of data types or nitudes. In other words, each PC coefficient in the
forms. Given their basic form, PCA methods provide matrix that constitutes the focal point for interpre-
little guidance for answering model-based questions, tation of results, tends to have a magnitude that is
such as those central to EFA. For example, PCA gen- too large because the corresponding variable is cor-
erally offers little support for assessing how many related partly with itself, the more so for variables
components (factors) to generate, or try to interpret; that are largest parts of corresponding components.
nor is there assistance for choosing samples or extrap- Also, this effect tends to be exacerbated when princi-
olating beyond extant data for purposes of statistical pal components are rotated. Contrastingly, common
or psychometric generalization. The latter concerns factors are latent variables, outside of the space of
are generally better dealt with using models, and EFA the data vectors, and common factor loadings are
provides what in certain respects is one of the most not similarly spurious. For example, EFA loadings
general classes of models available. in Table 2, being correlations of observed variables
To make certain other central points about PCA with latent variables, do not reflect self-correlations,
more concrete, I return to the correlation matrix for as do their PCA counterparts.
the painter data. I also conducted a PCA with two
components (but to save space I do not present the
table of loadings). The Central EFA Questions: How Many
That is, I constructed the first two principal Factors? What Communalities?
component variables, and found their correlations
with the initial variables. A plot (not shown) of Each application of EFA requires a decision about
the principal component loadings analogous to that how many common factors to select. Since the com-
of Figure 1 shows the variables to be configured mon factor model is at best an approximation to the
similarly, but all points are further from the origin. real situation, questions such as how many factors,
The row sums of squares of the component loadings or what communalities, are inevitably answered with
matrix were 0.81, 0.64, 0.86, 0.83, and 0.63, values some degree of uncertainty. Furthermore, particular
that correspond to communality estimates in the third features of given data can make formal fitting of an
column of the common factor matrix in Table 2. EFA model tenuous. My purpose here is to present
Across all five variables, PCA row sums of squares EFA as a true exploratory method based on com-
(which should not be called communalities) range mon factor principles with the understanding that
from 14 to 37 percent larger than the h2 entries formal fitting of the EFA model is secondary to
in Table 2, an average of 27 percent; this means useful results in applications; moreover, I accept
that component loadings are substantially larger in that certain decisions made in contexts of real data
magnitude than their EFA counterparts, as will be analysis are inevitably somewhat arbitrary and that
true quite generally. For any data system, given the any given analysis will be incomplete. A wider per-
same number of components as common factors, spective on relevant literature will be provided in the
component solutions yield row sums of squares that final section.
tend to be at least somewhat, and often markedly, The history of EFA is replete with studies of
larger than corresponding communalities. how to select the number of factors; hundreds of
In fact, these differences between characteristics of both theoretical and empirical approaches have been
the PCA loadings and common factor loadings sig- suggested for the number of factors question, as this
nify a broad point worthy of discussion. Given that issue has been seen as basic for much of the past
principal components are themselves linear combina- century. I shall summarize some of what I regard as
tions of the original data vectors, each of the data the most enduring principles or methods, while trying
variables tends to be part of the linear combination to shed light on when particular methods are likely
6 Factor Analysis: Exploratory
to work effectively, and how the better methods can communalities is usually to make a rather strong
be attuned to reveal relevant features of extant data. assumption, one quite possibly not supported by data
Suppose scores have been obtained on some num- in hand.
ber of correlated variables, say p, for n entities, A better idea for SP entails computing the original
perhaps persons. To entertain a factor analysis (EFA) correlation matrix, R, as well as its inverse R1 . Then,
for these variables generally means to undertake denoting the diagonal of the inverse as D2 (entries
an exploratory structural analysis of linear relations of which exceed unity), rescale the initial correlation
among the p variables by analyzing a p p covari- matrix to DRD, and then compute eigenvalues of
ance or correlation matrix. Standard outputs of such this rescaled correlation matrix. Since the largest
an analysis are a factor loading matrix for orthogonal entries in D2 correspond to variables that are most
or correlated common factors as well as communal- predictable from all others, and vice versa, the
ity estimates, and perhaps factor score estimates. All effect is to weigh variables more if they are more
such results are conditioned on the number, m, of predictable, less if they are less predictable from
common factors selected for analysis. I shall assume other variables in the battery. (The complement of
that in deciding to use EFA, there is at least some the reciprocal of any D2 entry is in fact the squared
doubt, a priori, as to how many factors to retain, so multiple correlation (SMC) of that variable with all
extant data will be the key basis for deciding on the others in the set.) An SP based on eigenvalues of
number of factors. (I shall also presume that the data DRD allows for variability of communalities, and is
have been properly prepared for analysis, appropriate usually realistic in assuming that communalities are
nonlinear transformations made, and so on, with the at least roughly proportional to SMC values.
understanding that even outwardly small changes in Figure 2 provides illustrations of two scree plots
the data can affect criteria bearing on the number of based on DRD, as applied to two simulated random
factors, and more.) samples. Although real data were used as the starting
The reader who is even casually familiar with EFA point for each simulation, both samples are just
is likely to have learned that one way to select the simulation sets of (the same size as) the original
number of factors is to see how many eigenvalues (of data set, where four factors had consistently been
the correlation matrix; recall PCA) exceed a certain identified as the best number to interpret.
criterion. Indeed, the roots-greater-than-one rule has Each of these two samples yields a scree plot, and
become a default in many programs. Alas, rules of both are given in Figure 2 to provide some sense
this sort are generally too rigid to serve reliably of sampling variation inherent in such data; in this
for their intended purpose; they can lead either to case, each plot leads to breaks after four common
overestimates or underestimates of the number of factors where the break is found by reading the plot
common factors. Far better than using any fixed from right to left. But the slope between four and five
cutoff is to understand certain key principles and
then learn some elementary methods and strategies
for choosing m. In some cases, however, two or more
25
Eigenvalues for matrix DRD
same data.
A second thing even a nonspecialist may have
15
factors is somewhat greater for one sample than the possible that m is an underestimate solely because
other, so one sample identifies m as four with slightly a single correlation coefficient is poorly fit, and that
more clarity than the other. In fact, for some other adding a common factor merely reduces a single
samples examined in preparing these scree plots, large residual correlation. But especially if the use
breaks came after three or five factors, not just four. of m + 1 factors yields a factor loading matrix that
Note that for smaller samples greater variation can upon rotation (see below) improves interpretability
be expected in the eigenvalues, and hence the scree in general, there may be ex post facto evidence that
breaks will generally be less reliable indicators of the m was indeed an underestimate. Similar reasoning
number of common factors for smaller samples. may be applied when moving to m + 2 factors, etc.
So what is the principle behind the scree method? Note that sampling variation can also result in sample
The answer is that the variance of the p m small- reordering of so-called population eigenvectors too.
est eigenvalues is closely related to the variance An adjunct to an SP that is too rarely used
of residual correlations associated with fitting off- is simply to plot the distribution of the residual
diagonals of the observed correlation matrix in suc- correlations, either as a histogram, or in relation to the
cessive choices for m, the number of common factors. original correlations, for, say, m, m + 1, and m + 2
When a break occurs in the eigenvalue plot, it signi- factors in the vicinity of the scree break; outliers or
fies a notable drop in the sum of squares of residual other anomalies in such plots can provide evidence
correlations after fitting the common factor model that goes usefully beyond the SP when selecting m.
to the observed correlation matrix for a particular Factor transformation(s) (see below) may be essential
value of m. I have constructed a horizontal line in to ones final decision. Recall that it may be a folly
Figure 2 to correspond to the mean of the 20 smallest even to think there is a single correct value for m
eigenvalues (244) of DRD, to help see the variation for some data sets.
these so-called rejected eigenvalues have around Were one to use a different selection of variables
their mean. In general, it is the variation around such to compose the data matrix for analysis, or per-
a mean of rejected eigenvalues that one seeks to haps make changes in the sample (deleting or adding
reduce to a reasonable level when choosing m in the cases), or try various different factoring algorithms,
EFA solution, since a good EFA solution accounts further modifications may be expected about the num-
well for the off-diagonals of the correlation matrix. ber of common factors. Finally, there is always the
Methods such as bootstrapping wherein multiple possibility that there are simply too many distinctive
versions of DRD are generated over a series of boot- dimensions of individual difference variation, that is,
strap samples of the original data matrix can be common factors, for the EFA method to work effec-
used to get a clearer sense of sampling variation, tively in some situations. It is not unusual that more
and probably should become part of standard prac- variables, larger samples, or generally more investiga-
tice in EFA both at the level of selecting the number tive effort, are required to resolve some basic ques-
of factors, and assessing variation in various derived tions such as how many factors to use in analysis.
EFA results. Given some choice for m, the next decision is
When covariances or correlations are well fit by usually that of deciding what factoring method to
some relatively small number of common factors, use. The foregoing idea of computing DRD, finding
then scree plots often provide flexible, informative, its eigenvalues, and producing an SP based on those,
and quite possibly persuasive evidence about the can be linked directly to an EFA method called image
number of common factors. However, SPs alone can factor analysis (IFA) [13], which has probably been
be misleading, and further examination of data may underused, in that several studies have found it to
be helpful. The issue in selecting m vis-`a-vis the be a generally sound and effective method. IFA is
SP concerns the nature or reliability of the informa- a noniterative method that produces common factor
tion in eigenvectors associated with corresponding coefficients and communalities directly. IFA is based
eigenvalues. Suppose some number m is seen as a on the m largest eigenvalues, say, the diagonal entries
possible underestimate for m; then deciding to add of m , and corresponding eigenvectors, say Qm , of
one more factor to have m + 1 factors, is to decide the matrix denoted DRD, above. Given a particular
that the additional eigenvector adds useful or mean- factor method, communality estimates follow directly
ingful structural information to the EFA solution. It is from selection of the number of common factors.
8 Factor Analysis: Exploratory
The analysis usually commences from a correlation other methodological decisions, is often best made in
matrix, so communality estimates are simply row consultation with an expert.
sums of squares of the (orthogonal) factor coefficients
matrix that for m common factors is computed as
m = D1 Qm (m I)1/2 , where is the average Factor Transformations to Support EFA
of the p m smallest eigenvalues. IFA may be Interpretation
especially defensible for EFA when sample size is
limited; more details are provided in [17], including Given at least a tentative choice for m, EFA methods
a sensible way to modify the diagonal D2 when such as IFA or MLFA can be used straightforwardly
the number of variables is a substantial fraction of to produce matrices of factor coefficients to account
sample size. for structural relations among variables. However,
A more commonly used EFA method is called attempts to interpret factor coefficient matrices with-
maximum likelihood factor analysis (MLFA) for out further efforts to transform factors usually fall
which algorithms and software are readily available, short unless m = 1 or 2, as in our illustration. For
and generally well understood. The theory for this larger values of m, factor transformation can bring
method has been studied perhaps more than any order out of apparent chaos, with the understanding
other and it tends to work effectively when the EFA that order can take many forms. Factor transformation
problem has been well-defined and the data are well- algorithms normally take one of three forms: Pro-
behaved. Specialists regularly advocate use of the crustes fitting to a prespecified target (see Procrustes
MLFA method [1, 2, 16, 23], and it is often seen as Analysis), orthogonal simple structure, or oblique
the common factor method of choice when the sample simple structure. All modern methods entail use of
is relatively large. Still, MLFA is an iterative method specialized algorithms. I shall begin with Procrustean
that can lead to poor solutions, so one must be alert in methods and review each class of methods briefly.
case it fails in some way. Maximum likelihood EFA Procrustean methods owe their name to a figure
methods generally call for large ns, using an assump- of ancient Greek mythology, Procrustes, who made
tion that the sample has been drawn randomly from a practice of robbing highway travelers, tying them
a parent population for which multivariate normality up, and stretching them, or cutting off their feet
(see Catalogue of Probability Density Functions) to make them fit a rigid iron bed. In the context
holds, at least approximately; when this assumption is of EFA, Procrustes methods are more benign; they
violated seriously, or when sample size is not large, merely invite the investigator to prespecify his or her
MLFA may not serve its exploratory purpose well. beliefs about structural relations among variables in
Statistical tests may sometimes be helpful, but the the form of a target matrix, and then transform an
sample size issue is vital if EFA is used for test- initial factor coefficients matrix to put it in relatively
ing statistical hypotheses. There can be a mismatch close conformance with the target. Prespecification
between exploratory use of EFA and statistical test- of configurations of points in m-space, preferably
ing because small samples may not be sufficiently on the basis of hypothesized structural relations that
informative to reject any factor model, while large are meaningful to the investigator, is a wise step
samples may lead to rejection of every model in some for most EFAs even if Procrustes methods are not
domains of application. Generally scree methods for to be used explicitly for transformations. This is
choosing the number of factors are superior to statis- because explication of beliefs about structures can
tical testing procedures. afford (one or more) reference system(s) for inter-
Given a choice of factoring methods and of pretation of empirical data structures however they
course there are many algorithms in addition to were initially derived. It is a long-respected princi-
IFA and MLFA the generation of communality ple that prior information, specified independently of
estimates follows directly from the choice of m, the extant empirical data, generally helps to support sci-
number of common factors. However, some EFA entific interpretations of many kinds, and EFA should
methods or algorithms can yield numerically unstable be no exception. In recent times, however, meth-
results, particularly if m is a substantial fraction of p, ods such as confirmatory factor analysis (CFA), are
the number of variables, or when n is not large in usually seen as making Procrustean EFA methods
relation to p. Choice of factor methods, like many obsolete because CFA methods offer generally more
Factor Analysis: Exploratory 9
sophisticated numerical and statistical machinery to applications. Browne [2], in a recent overview of ana-
aid analyses. Still, as a matter of principle, it is use- lytic rotation methods for EFA, stated that Jennrich
ful to recognize that general methodology of EFA has and Sampson [12] solved the problems of oblique
for over sixty years permitted, and in some respects rotation; however, he went on to note that . . . we
encouraged, incorporation of sharp prior questions in are not at a point where we can rely on mechan-
structural analyses. ical exploratory rotation by a computer program if
Orthogonal rotation algorithms provide relatively the complexity of most variables is not close to one
simple ways for transforming factors and these [2, p. 145]. Methods such as Hyball [19] facilitate
have been available for nearly forty years. Most random starting positions in m-space of transforma-
commonly, an orthomax criterion is optimized, tion algorithms to produce multiple solutions that
using methods that have been dubbed quartimax, can then be compared for interpretability. The pro-
varimax, or equamax. Dispensing with quotations, max method is notable not only because it often
we merely note that in general, equamax solutions works well, but also because it combines elements
tend to produce simple structure solutions for which of Procrustean logic with analytical orthogonal trans-
different factors account for nearly equal amounts of formations. Yates geomin [25] is also a particularly
attractive method in that the author went back to
common variance; quartimax, contrastingly, typically
Thurstones basic ideas for achieving simple struc-
generates one broad or general factor followed by
ture and developed ways for them to be played out in
m 1 smaller ones; varimax produces results inter-
modern EFA applications. A special reason to favor
mediate between these extremes. The last, varimax,
simple structure transformations is provided in [10,
is the most used of the orthogonal simple structure
11] where the author noted that standard errors of fac-
rotations, but choice of a solution should not be based tor loadings will often be substantially smaller when
too strongly on generic popularity, as particular fea- population structures are simple than when they are
tures of a data set can make other methods more not; of course this calls attention to the design of the
effective. Orthogonal solutions offer the appealing battery of variables.
feature that squared common factor coefficients show
directly how much of each variables common vari-
ance is associated with each factor. This property is Estimation of Factor Scores
lost when factors are transformed obliquely. Also, the
factor coefficients matrix alone is sufficient to inter- It was noted earlier that latent variables, that is,
common factors, are basic to any EFA model. A
pret orthogonal factors; not so when derived factors
strong distinction is made between observable vari-
are mutually correlated. Still, forcing factors to be
ates and the underlying latent variables seen in EFA
uncorrelated can be a weakness when the constraint
as accounting for manifest correlations or covariances
of orthogonality limits factor coefficient configura-
between all pairs of manifest variables. The latent
tions unrealistically, and this is a common occurrence
variables are by definition never observed or observ-
when several factors are under study. able in a real data analysis, and this is not related to
Oblique transformation methods allow factors to the fact that we ordinarily see our data as a sample (of
be mutually correlated. For this reason, they are cases, or rows); latent variables are in principle not
more complex and have a more complicated his- observable, either for statistically defined samples, or
tory. A problem for many years was that by allowing for their population counterparts. Nevertheless, it is
factors to be correlated, oblique transformation meth- not difficult to estimate the postulated latent vari-
ods often allowed the m-factor space to collapse; ables, using linear combinations of the observed data.
successful methods avoided this unsatisfactory situ- Indeed, many different kinds of factor score estimates
ation while tending to work well for wide varieties have been devised over the years (see Factor Score
of data. While no methods are entirely acceptable Estimation).
by these standards, several, notably those of Jen- Most methods for estimating factor scores are not
nrich and Sampson (direct quartimin) [12], Harris worth mentioning because of one or another kind
and Kaiser (obliquimax), Rozeboom (Hyball) [18], of technical weakness. But there are two methods
Yates (geomin) [25], and Hendrikson and White (pro- that are worthy of consideration for practical appli-
max) [9] are especially worthy of consideration for cations in EFA where factor score estimates seem
10 Factor Analysis: Exploratory
needed. These are called regression estimates and selected, how variables are to be selected and trans-
Bartlett (also, maximum likelihood ) estimates of fac- formed to help ensure approximate linearity between
tor scores, and both are easily computed in the context variates; next, choices about factoring algorithms or
of IFA. Recalling that D2 was defined as the diagonal methods, the number(s) of common factors and fac-
of the inverse of the correlation matrix, now suppose tor transformation methods must be made. That there
the initial data matrix has been centered and scaled be no notably weak links in this chain is important if
as Z where ZZ = R; then, using the notation given an EFA project is to be most informative. Virtually
earlier in the discussion of IFA, Bartlett estimates of all questions are contextually bound, but the literature
factor scores can be computed as XmBartlett = Z D of EFA can provide guidance at every step.
Qm (m I)1/2 . The discerning reader may recog- Major references on EFA application, such as that
nize that these factor scores estimates can be further of Carroll [4], point up many of the possibilities and
simplified using the singular value decomposition of a perspective on related issues. Carroll suggests that
matrix Z D; indeed, these score estimates are just special value can come from side-by-side analyses of
rescaled versions of the first m principal components the same data using EFA methods and those based on
of Z D. Regression estimates, in turn, are further col- structural equation modeling (SEM). McDonald [15]
umn rescalings of the same m columns in XmBartlett . discusses EFA methods in relation to SEM. Several
MLFA factor score estimates are easily computed, authors have made connections between EFA and
but to discuss them goes beyond our scope; see [15]. other multivariate methods such as basic regression;
Rotated or transformed versions of factor score esti- see [14, 17] for examples.
mates are also not complicated; the reader can go to
factor score estimation (FSE) for details. an S function for Image Factor Analysis
ifa<-function(rr,mm) {
# routine is based on image factor
EFA in Practice: Some Guidelines and # analysis;
Resources # it generates an unrotated common
# factor coefficients matrix & a scree
Software packages such as CEFA [3], which imple- # plot; in R, follow w/ promax or
ments MLFA as well as geomin among other meth- # varimax; in Splus follow w/ rotate.
ods, and Hyball [18], can be downloaded from the # rr is taken to be symmetric matrix
web without cost, and they facilitate use of most of # of correlations or covariances;
the methods for factor extraction as well as factor # mm is no. of factors. For additional
transformation. These packages are based on mod- # functions or assistance, contact:
ern methods, they are comprehensive, and they tend # rpruzek@uamail.albany.edu
to offer advantages that most commercial software rinv <- solve(rr) #takes inverse
for EFA do not. What these methods lack, to some # of R; so R must be nonsingular
extent, is mechanisms to facilitate modern graphical sm2i <- diag(rinv)
displays. Splus and R software, the latter of which smrt <- sqrt(sm2i)
is also freely available from the web [r-project.org], dsmrt <- diag(smrt)
provide excellent modern graphical methods as well rsr <- dsmrt %*% rr %*% dsmrt
reig <- eigen(rsr, sym = T)
as a number of functions to implement many of the
vlamd <- reig$va
methods available in CEFA, and several in Hyball.
vlamdm <- vlamd[1:mm]
A small function for IFA is provided at the end of
qqm <- as.matrix(reig$ve[, 1:mm])
this article; it works in both R and Splus. In gen- theta <- mean(vlamd[(mm + 1)
eral, however, no one source provides all methods, :nrow(qqm)])
mechanisms, and management capabilities for a fully dg <- sqrt(vlamdm - theta)
operational EFA system nor should this be expected if(mm == 1)
since what one specialist means by fully operational fac <- dg[1] * diag(1/smrt)
necessarily differs from that of others. %*% qqm
Nearly all real-life applications of EFA require else fac <- diag(1/smrt) %*% qqm
decisions bearing on how and how many cases are %*% diag(dg)
Factor Analysis: Exploratory 11
plot(1:nrow(rr), vlamd, type [12] Jennrich, R.I. & Sampson, P.F. (1966). Rotation for
= "o") simple loadings, Psychometrika 31, 313323.
abline(h = theta, lty = 3) [13] Joreskog, K.G. (1969). Efficient estimation in image
title("Scree plot for IFA") factor analysis, Psychometrika 34, 5175.
print("Common factor coefficients [14] Lawley, D.N. & Maxwell, A.E. (1973). Regression and
factor analysis, Biometrika 60, 331338.
matrix is: fac")
[15] McDonald, R.P. (1984). Factor Analysis and Related
print(fac) Methods, Lawrence Erlbaum Associates, Hillsdale.
list(vlamd = vlamd, theta = theta, [16] Preacher, K.J. & MacCallum, R.C. (2003). Repair-
fac = fac) ing Tom Swifts electric factor analysis machine,
} Understanding Statistics 2, 1343. [http://www.
geocities.com/Athens/Acropolis/8950/
tomswift.pdf]
References
[17] Pruzek, R.M. & Lepak, G.M. (1992). Weighted struc-
tural regression: a broad class of adaptive methods
[1] Browne, M.W. (1968). A comparison of factor analytic for improving linear prediction, Multivariate Behavioral
techniques, Psychometrika 33, 267334. Research 27, 95130.
[2] Browne, M.W. (2001). An overview of analytic rotation [18] Rozeboom, W.W. (1991). HYBALL: a method for
in exploratory factor analysis, Multivariate Behavioral subspace-constrained oblique factor rotation, Multi-
Research 36, 111150. variate Behavioral Research 26, 163177. [http://
[3] Browne, M.W., Cudeck, R., Tateneni, K. & Mels, G. web.psych.ualberta.ca/rozeboom/]
(1998). CEFA: Comprehensive Exploratory Factor Anal- [19] Rozeboom, W.W. (1992). The glory of suboptimal fac-
ysis (computer software and manual). [http:// tor rotation: why local minima in analytic optimization
quantrm2.psy.ohio-state.edu/browne/] of simple structure are more blessing than curse, Multi-
[4] Carroll, J.B. (1993). Human Cognitive Abilities: A Sur- variate Behavioral Research 27, 585599.
vey of Factor Analytic Studies, Cambridge University [20] Rozeboom, W.W. (1997). Good science is abductive, not
Press, New York. hypothetico-deductive, in What if there were no signifi-
[5] Cattell, R.B. (1966). The scree test for the number of cance tests?, Chapter 13, L.L. Harlow, S.A. Mulaik &
factors, Multivariate Behavioral Research 1, 245276. J.H. Steiger, eds, Lawrence Erlbaum Associates, Hills-
[6] Darlington, R. (2000). Factor Analysis (Instructional dale, NJ.
Essay on Factor Analysis). [http://comp9.psych. [21] Spearman, C. (1904). General intelligence objectively
cornell.edu/Darlington/factor.htm] determined and measured, American Jour. of Psychology
[7] Davenport, M. & Studdert-Kennedy, G. (1972). The sta- 15, 201293.
tistical analysis of aesthetic judgement: an exploration, [22] Thurstone, L.L. (1947). Multiple-factor Analysis: A
Applied Statistics 21, 324333. Development and Expansion of the Vectors of Mind, Uni-
[8] Fabrigar, L.R., Wegener, D.T., MacCallum, R.C. & versity of Chicago Press, Chicago.
Strahan, E.J. (1999). Evaluating the use of exploratory [23] Tucker, L. & MacCallum, R.C. (1997). Exploratory
factor analysis in psychological research, Psychological factor analysis. [unpublished, but available: http://
Methods 3, 272299. www.unc.edu/rcm/book/factornew.htm]
[9] Hendrickson, A.E. & White, P.O. (1964). PROMAX: [24] Venables, W.N. & Ripley, B.D. (2002). Modern Applied
a quick method for transformation to simple structure, Statistics with S, Springer, New York.
Brit. Jour. of Statistical Psychology 17, 6570. [25] Yates, A. (1987). Multivariate Exploratory Data Analy-
[10] Jennrich, R.I. (1973). Standard errors for obliquely sis: A Perspective on Exploratory Factor Analysis, State
rotated factor loadings, Psychometrika 38, 593604. University of New York Press, Albany.
[11] Jennrich, R.I. (1974). On the stability of rotated factor
loadings: the Wexler phenomenon, Brit. J. Math. Stat. ROBERT PRUZEK
Psychology 26, 167176.
Factor Analysis: Multiple Groups
TODD D. LITTLE AND DAVID W. SLEGERS
Volume 2, pp. 617623
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
intercepts or means of the indicators, and (c) , loadings are equated across groups. The manifest
the residual variances of each indicator, which is means and residual variances are free to vary. This
the aggregate of the unique factor variance and the condition is also referred to as pattern invariance [15]
unreliable variance of an indicator. The other three or metric invariance [6]. Because the factor variances
types of parameter refer to the latent construct level: are free to vary across groups, the factor loadings are,
(d) , the mean of the latent constructs, (e) ii latent technically speaking, proportionally equivalent (i.e.,
variances, and (f ) ij latent covariances or correla- weighted by the differences in latent variances). If
tions [9, 12, 14]. weak factorial invariance is found to be untenable
(see testing below) then only configural invariance
holds across groups. Under this condition, one has
Taxonomy of Invariance
little basis to suppose that the constructs are the same
A key aspect of multiple-group MACS modeling is in each group and systematic comparisons of the
the ability to assess the degree of factorial invariance constructs would be difficult to justify. If invariance
of the constructs across groups. Factorial invariance of the loadings holds then one has a weak empirical
addresses whether the constructs measurement prop- basis to consider the constructs to be equivalent and
erties (i.e., the intercepts and loadings, which reflect would allow cross-group comparisons of the latent
the reliable components of the measurement space) variances and covariances, but not the latent means.
are the same in two or more populations. This ques-
tion is distinct from whether the latent aspects of
Strong Factorial Invariance. As Meredith [14]
the constructs are the same (e.g., the constructs
compellingly argued, any test of factorial invariance
mean levels or covariances). This latter question deals
should include the manifest means weak factorial
with particular substantive hypotheses about possible
invariance is not a complete test of invariance.
group differences on the constructs (i.e., the reliable
With strong factorial invariance, the loadings and the
and true properties of the constructs). The concept of
intercepts are equated (and like the variances of the
invariance is typically thought of and described as a
constructs, the latent means are allowed to vary in
hierarchical sequence of invariance starting with the
the second and all subsequent groups). This strong
weakest form and working up to the strictest form.
form of factorial invariance, also referred to as scalar
Although we will often discuss the modeling proce-
invariance [22], is required in order for individuals
dures in terms of two groups, the extension to three
with the same ability in separate groups to have the
or more groups is straightforward (see e.g., [9]).
same score on the instrument. With any less stringent
condition, two individuals with the same true level of
Configural Invariance. The most basic form of ability would not have the same expected value on
factorial invariance is ensuring that the groups have the measure. This circumstance would be problematic
the same basic factor structure. The groups should because, for example, when comparing groups based
have the same number of latent constructs, the same on gender on a measure of mathematical ability one
number of manifest indicators, and the same pattern would want to ensure that a male and a female with
of fixed and freed (i.e., estimated) parameters. If the same level of ability would receive the same
these conditions are met, the groups are said to score.
have configural invariance. As the weakest form of An important advantage of strong factorial invari-
invariance, configural invariance only requires the ance is that it establishes the measurement equiva-
same pattern of fixed and freed estimates among the lence (or construct comparability) of the measures. In
manifest and latent variables, but does not require the this case, constructs are defined in precisely the same
coefficients be equal across groups. operational manner in each group; as a result, they
can be compared meaningfully and with quantitative
Weak Factorial Invariance. Although termed weak precision. Measurement equivalence indicates that
factorial invariance, this level of invariance is more (a) the constructs are generalizable entities in each
restricted than configural invariance. Specifically, in subpopulation, (b) sources of bias and error (e.g., cul-
addition to the requirement of having the same pattern tural bias, translation errors, varying conditions of
of fixed and freed parameters across groups, the administration) are minimal, (c) subgroup differences
Factor Analysis: Multiple Groups 3
have not differentially affected the constructs under- the theoretically meaningful common-variance com-
lying measurement characteristics (i.e., constructs ponents as unbiasedly as possible.
are comparable because the indicators specific vari-
ances are independent of cultural influences after Partial Invariance. Widaman and Reise [23] and
conditioning on the construct-defining common vari- others have also introduced the concept of partial
ance; [14]), and (d) between-group differences in the invariance, which is the condition when a constraint
constructs mean, variance, and covariance relations of invariance is not warranted for one or a few of the
are quantitative in nature (i.e., the nature of group loading parameters. When invariance is untenable,
differences can be assessed as mean-level, variance, one may then attempt to determine which indicators
and covariance or correlational effects) at the con- contribute significantly to the misfit ([3] [5]). It is
likely that only a few of the indicators deviate sig-
struct level. In other words, with strong factorial
nificantly across groups, giving rise to the condition
invariance, the broadest spectrum of hypotheses about
known as partial invariance. When partial invariance
the primary construct moments (means, variances,
is discovered there are a variety of ways to pro-
covariances, correlations) can be tested while simul-
ceed. (a) One can leave the estimate in the model,
taneously establishing measurement equivalence (i.e.,
but not constrain it to be invariant across groups and
two constructs can demonstrate different latent rela-
argue that the invariant indicators are sufficient to
tions across subgroups, yet still be defined equiva-
establish comparability of the constructs [23]; (b) one
lently at the measurement level).
can argue that the differences between indicators are
small enough that they would not make a substantive
Strict Factorial Invariance. With strict factorial
difference and proceed with invariance constraints in
invariance, all conditions are the same as for strong
place [9]; (c) one could decide to reduce the number
invariance but, in addition, the residual variances are
of indicators by only using indicators that are invari-
equated across groups. This level of invariance is
ant across groups [16]; (d) one could conclude that
not required for making veridical cross-group com-
because invariance cannot be attained that the instru-
parisons because the residuals are where the aggre-
ment must be measuring different constructs across
gate of the true measurement error variance and the
the multiple groups and, therefore, not use the instru-
indicator-specific variance is represented. Here, the
ment at all [16]. Milsap and Kwok [16] also describe
factors that influence unreliability are not typically
a method to assess the severity of the violations of
expected to operate in an equivalent manner across
invariance by evaluating the sensitivity and speci-
the subgroups of interest. In addition, the residuals
ficity at various selection points.
reflect the unique factors of the measured indicators
(i.e., variance that is reliable but unique to the par- Selection Theorem Basis for Expecting Invariance.
ticular indicator). If the unique factors differ trivially The loadings and intercepts of a constructs indica-
with regard to subgroup influences, this violation of tors can be expected to be invariant across groups
selection theorem [14] can be effectively tolerated, if under a basic tenet of selection theorem namely,
sufficiently small, by allowing the residuals to vary conditional independence ([8, 14]; see also [18]). In
across the subgroups. In other words, strong factorial particular, if subpopulation influences (i.e., the basis
invariance is less biasing than strict factorial invari- for selecting the groups) and the specific components
ance because, even though the degree of random error (unique factors) of the constructs manifest indicators
may be quite similar across groups, if it is not exactly are independent when conditioned on the common
equal, the nonequal portions of the random error construct components, then an invariant measurement
are forced into other parameters of a given model, space can be specified even under extreme selection
thereby introducing potential sources of bias. More- conditions. When conditional independence between
over, in practical applications of cross-group research the indicators unique factors and the selection basis
such as cross-cultural studies, some systematic bias hold, the construct information (i.e., common vari-
(e.g., translation bias) may influence the reliable com- ance) contains, or carries, information about subpopu-
ponent of a given residual. Assuming these sources lation influences. This expectation is quite reasonable
of bias and error are negligible (see testing below), if one assumes that the subpopulations derive from a
they could be represented as unconstrained residual common population from which the subpopulations
variance terms across groups in order to examine can be described as selected on the basis of one or
4 Factor Analysis: Multiple Groups
more criteria (e.g., experimental treatment, economic value) for each construct. To set the scale, one ele-
affluence, degree of industrialization, degree of indi- ment of is fixed to 1.0 (or any other nonzero value)
vidualism etc.). This expectation is also reasonable if for each construct. This method of identification is
one assumes on the basis of a specific theoretical view less desirable than the 1st and 3rd methods because
that the constructs should exist in each assessed sub- the location and scale of the latent construct is deter-
population and that the constructs indicators reflect mined arbitrarily on the basis of which indicator is
generally equivalent domain representations. chosen. Reise, Widaman, and Pugh [19] recommend
Because manifest indicators reflect both common that if one chooses this approach the marker variables
and specific sources of variance, cross-group effects should be supported by previous research or selected
may influence not only the common construct-related on the basis of strong theory.
variance of a set of indicators but also the specific A third possible identification method is to con-
variance of one or more of them [17]. Measurement strain the sum of for each factor to zero [20]. For
equivalence will hold if these effects have influ- the scale identification, the s for a factor should sum
enced only the common-variance components of a set to p, the number of manifest variables. This method
of construct indicators and not their unique-specific forces the mean and variance of the latent construct to
components [8, 14, 18]. If cross-group influences be the weighted average of all of its indicators means
differentially and strongly affect the specific compo- and loadings. The method has the advantage of pro-
nents of indicators, nonequivalence would emerge. viding a nonarbitrary scale that can legitimately vary
Although measurement nonequivalence can be a across constructs and groups. It would be feasible, in
meaningful analytic outcome, it disallows, when suf- fact, to compare the differences in means of two dif-
ficiently strong, quantitative construct comparisons. ferent constructs if one was theoretically motivated
to do so (see [20], for more details of this method).
Identification Constraints
Testing for Measurement Invariance and Latent
There are three methods of placing constraints on Construct Differences
the model parameters in order to identify the con-
structs and model (see Identification). When a In conducting cross-group tests of equality, either a
mean structure is used, the location must be iden- statistical or a modeling rationale can be used for
tified in addition to the scale of the other esti- evaluating the tenability of the cross-group restric-
mated parameters. tions [9]. With a statistical rationale, an equivalence
The first method to identification and scale setting test is conducted as a nested-model comparison
is to fix a parameter in the latent model. For example, between a model in which specific parameters are
to set the scale for the location parameters, one can constrained to equality across groups and one in
fix the latent factor mean, , to zero (or a nonzero which these parameters (and all others) are freely
value). Similarly, to set the scale for the variance- estimated in all groups. The difference in 2 between
covariance and loading parameters one can fix the the two models is a test of the equality restrictions
variances, ii to 1.0 (or any other nonzero value). The (with degrees of freedom equal to the difference in
advantages of this approach are that the estimated their degrees of freedom). If the test is nonsignificant
latent means in each subsequent group are relative then the statistical evidence indicates no cross-group
mean differences from the first group. Because this differences between the equated parameters. If it
first group is fixed at zero, the significance of the is significant, then evidence of cross-group inequal-
latent mean estimates in the subsequent groups is the ity exists.
significance of the difference from the first group. The other rationale is termed a modeling ratio-
Fixing the latent variances to 1.0 has the advantage nale [9]. Here, model constraints are evaluated using
of providing estimates of the associations among the practical fit indices to determine the overall adequacy
latent constructs in correlational metric as opposed to of a fitted model. This rationale is used for large mod-
an arbitrary covariance metric. els with numerous constrained parameters because
The second common method is known as the the 2 statistic is an overly sensitive index of model
marker-variable method. To set the location param- fit, particularly for large numbers of constraints and
eters, one element of is set to zero (or a nonzero when estimated on large sample sizes (e.g., [10]).
Factor Analysis: Multiple Groups 5
From this viewpoint, if a model with numerous con- four sociocultural settings. His analyses demonstrated
straints evinces adequate levels of practical fit, then that the constructs were measurement equivalent (i.e.,
the set of constraints are reasonable approximations had strong factorial invariance) across all groups indi-
of the data. cating that the translation process did not unduly
Both rationales could be used in testing the influence the measurement properties of the instru-
measurement level and the latent level parameters. ment. However, the constructs themselves revealed
Because these two levels represent distinctly and a number of theoretically meaningful differences,
qualitatively different empirical and theoretical goals, including striking differences in the mean levels and
however, their corresponding rationale could also be the variances across the groups, but no differences in
different. Specifically, testing measurement equiva- the strength of association between the two primary
lence involves evaluating the general tenability of constructs examined.
an imposed indicator-to-construct structure via over-
all model fit indices. Here, various sources of model Extensions to Longitudinal MACS Modeling
misfit (random or systematic) may be deemed sub-
The issues related to cross-group comparisons with
stantively trivial if model fit is acceptable (i.e., if
MACS models are directly applicable to longitudinal
the model provides a reasonable approximation of
MACS modeling. That is, establishing the measure-
the data; [2, 9]). The conglomerate effects of these
ment equivalence (strong metric invariance) of a
sources of misfit, when sufficiently small, can be
constructs indicators over time is just as important as
depicted parsimoniously as residual variances and
establishing their equivalence across subgroups. One
general lack of fit, with little or no loss to theoretical
additional component of longitudinal MACS mod-
meaningfulness (i.e., the trade-off between empiri-
eling that needs to be addressed is the fact that the
cal accuracy and theoretical parsimony; [11]). When
specific variances of the indicators of a construct will
compared to a non-invariance model, an invariance
have some degree of association across time. Here,
model differs substantially in interpretability and par-
independence of the residuals is not assumed, but
simony (i.e., fewer parameter estimates than a non-
rather dependence of the unique factors is expected.
invariance model), and it provides the theoretical and
In this regard, the a priori factor model, when fit
mathematical basis for quantitative between-group
across time, would specify and estimate all possible
comparisons.
residual correlations of an indicator with itself across
In contrast to the measurement level, the latent
each measurement occasion.
level reflects interpretable, error-free effects among
constructs. Here, testing them for evidence of sys-
Summary
tematic differences (i.e., the hypothesis-testing phase
of an analysis) is probably best done using a statisti- MACS models are a powerful tool for cross-group
cal rationale because it is a precise criteria for testing and longitudinal comparisons. Because the means
the specific theoretically driven questions about the or intercepts of measured indicators are included
constructs and because such substantive tests are typ- explicitly in MACS models, they provide a very
ically narrower in scope (i.e., fewer parameters are strong test of the validity of construct compar-
involved). However, such tests should carefully con- isons (i.e., measurement equivalence). Moreover,
sider issues such as error rate and effect size. the form of the group- or time-related differences
Numerous examples of the application of MACS can be tested on many aspects of the constructs
modeling can be found in the literature, however, (i.e., means, variances, and covariances or corre-
Little [9] offers a detailed didactic discussion of the lations). As outlined here, the tenability of mea-
issues and steps involved when making cross-group surement equivalence (i.e., construct comparabil-
comparisons (including the LISREL source code used ity) can be tested using model fit indices (i.e.,
to estimate the models and a detailed Figural repre- the modeling rationale), whereas specific hypothe-
sentation). His data came from a cross-cultural study ses about the nature of possible group differences
of personal agency beliefs about school performance on the constructs can be tested using precise sta-
that included 2493 boys and girls from Los Angeles, tistical criteria. A measurement equivalent model is
Moscow, Berlin, and Prague. Little conducted an 8- advantageous for three reasons: (a) it is theoreti-
group MACS comparison of boys and girls across the cally very parsimonious and, thus, a reasonable a
6 Factor Analysis: Multiple Groups
priori hypothesis to entertain, (b) it is empirically profiles and oblique confactor problems, Multivariate
very parsimonious, requiring fewer estimates than a Behavioral Research 29, 63113.
non-invariance model, and (c) it provides the math- [13] Meredith, W. (1964). Rotation to achieve factorial
invariance, Psychometrika 29, 187206.
ematical and theoretical basis by which quantitative [14] Meredith, W. (1993). Measurement invariance, factor
cross-group or cross-time comparisons can be con- analysis and factorial invariance, Psychometrika 58,
ducted. In other words, strong factorial invariance 525543.
indicates that constructs are fundamentally similar [15] Millsap, R.E. (1997). Invariance in measurement and
in each group or across time (i.e., comparable) and prediction: their relationship in the single-factor case,
hypotheses about the nature of possible group- or Psychological Methods 2, 248260.
[16] Millsap, R.E. & Kwok, O. (2004). Evaluating the
time-related influences can be meaningfully tested
impact of partial factorial invariance on selection in two
on any of the constructs basic moments across time populations, Psychological Methods 9, 93115.
or across each group whether the groups are defined [17] Mulaik, S.A. (1972). The Foundations of Factor Analy-
on the basis of culture, gender, or any other group- sis, McGraw-Hill, New York.
ing criteria. [18] Muthen, B.O. (1989). Factor structure in groups selected
on observed scores, British Journal of Mathematical and
References Statistical Psychology 42, 8190.
[19] Reise, S.P., Widaman, K.F. & Pugh, R.H. (1995).
[1] Bollen, K.A. (1989). Structural Equations with Latent Confirmatory factor analysis and item response theory:
Variables, Wiley, New York. two approaches for exploring measurement invariance,
[2] Browne, M.W. & Cudeck, R. (1993). Alternative ways Psychological Bulletin 114, 552566.
of assessing model fit, in Testing Structural Equation [20] Slegers, D.W. & Little, T.D. (in press). Evaluating
Models, K.A. Bollen & J.S. Long, eds, Sage Publica- contextual influences using multiple-group, longitudi-
tions, Newbury Park, pp. 136162. nal mean and covariance structures (MACS) methods,
[3] Byrne, B.M., Shavelson, R.J. & Muthen, B. (1989). Test- in Modeling contextual influences in longitudinal data,
ing for the equivalence of factor covariance and mean T.D. Little, J.A. Bovaird & J. Marquis, eds, Lawrence
structures: the issue of partial measurement invariance, Erlbaum, Mahwah.
Psychological Bulletin 105, 456466. [21] Sorbom, D. (1982). Structural equation models with
[4] Cattell, R.B. (1944). Parallel proportional profiles and structured means, in Systems Under Direct Observation,
other principles for determining the choice of factors by K.G. Joreskog & H. Wold, eds, Praeger, New York,
rotation, Psychometrika 9, 267283. pp. 183195.
[5] Cheung, G.W. & Rensvold, R.B. (1999). Testing fac- [22] Steenkamp, J.B. & Baumgartner, H. (1998). Assess-
torial invariance across groups: a reconceptualization ing measurement invariance in cross-national consumer
and proposed new method, Journal of Management 25, research, Journal of Consumer Research 25, 7890.
127. [23] Widaman, K.F. & Reise, S.P. (1997). Exploring the
[6] Horn, J.L. & McArdle, J.J. (1992). A practical and measurement invariance of psychological instruments:
theoretical guide to measurement invariance in aging applications in the substance use domain, in The science
research, Experimental Aging Research 18, 117144. of prevention: Methodological advances from alcohol
[7] Horst, P. & Schaie, K.W. (1956). The multiple group and substance abuse research, K.J. Bryant & M. Windle
method of factor analysis and rotation to a simple et al., eds, American Psychological Association, Wash-
structure hypothesis, Journal of Experimental Education ington, pp. 281324.
24, 231237.
Further Reading
[8] Lawley, D.N. & Maxwell, A.E. (1971). Factor Anal-
ysis as a Statistical Method, 2nd Edition, Butterworth, MacCallum, R.C., Browne, M.W. & Sugawara, H.M. (1996).
London. Power analysis and determination of sample size for
[9] Little, T.D. (1997). Mean and covariance structures covariance structure modeling, Psychological Methods 1,
(MACS) analyses of cross-cultural data: practical and 130149.
theoretical issues, Multivariate Behavioral Research 32, McGaw, B. & Joreskog, K.G. (1971). Factorial invariance of
5376. ability measures in groups differing in intelligence and
[10] Marsh, H.W., Balla, J.R. & McDonald, R.P. (1988). socioeconomic status, British Journal of Mathematical
Goodness-of-fit indexes in confirmatory factor analysis: and Statistical Psychology 24, 154168.
the effect of sample size, Psychological Bulletin 103,
391410.
[11] McArdle, J.J. (1996). Current directions in structural (See also Structural Equation Modeling: Latent
factor analysis, Current Directions 5, 1118. Growth Curve Analysis)
[12] McArdle, J.J. & Cattell, R.B. (1994). Structural equation
models of factorial invariance in parallel proportional TODD D. LITTLE AND DAVID W. SLEGERS
Factor Analysis: MultitraitMultimethod
WERNER WOTHKE
Volume 2, pp. 623628
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Confirmatory Factor Analysis Trait-only All zero entries in and the diagonal entries
Model in are fixed (predetermined) parameters; the
p factor loading parameters i,j , t (t 1)/2 factor
The trait-only model allows one factor per trait. correlations, and p uniqueness coefficients in the
Trait factors are usually permitted to correlate. For diagonal of are estimated from the data. The
the nine-variable MTMM matrix shown in Table 1, model is identified when three or more methods
assuming the same variable order, the loading matrix are included in the measurement design. For the
has the following simple structure: special case that all intertrait correlations are nonzero,
model identification requires only two methods (two-
1,1 0 0
indicator rule [2]).
0 2,2 0
The worked example uses the MTMM matrix
0 0 3,3
of Table 2 on the basis of data by Flamer [8],
4,1 0 0
also published in [9] and [22]. The traits are Atti-
= 0 5,2 0 . (2)
tude toward Discipline in Children (ADC), Attitude
0 0 6,3
toward Mathematics (AM), and Attitude toward the
7,1 0 0
0 8,2 0
Law (AL). The methods are all paper-and-pencil,
differing by response format: dichotomous Likert
0 0 9,3
(L) scales, Thurstone (Th) scales, and the semantic
and the matrix of factor correlations is differential (SD) technique. Distinctly larger entries
in the validity diagonals (in bold face) and simi-
1 21 31
= 21 1 32 . (3) lar patterns of small off-diagonal correlations in the
31 32 1 monomethod triangles and heterotraitheteromethod
Trait factors
Uniqueness
Method Trait ADC AM AL estimates
ADC 0.85 0.0 0.0 0.28
Likert AM 0.0 0.77 0.0 0.41
AL 0.0 0.0 0.61 0.63
ADC 0.84 0.0 0.0 0.29
Thurstone AM 0.0 0.80 0.0 0.36
AL 0.0 0.0 0.62 0.62
ADC 0.50 0.0 0.0 0.75
Semantic diff AM 0.0 0.95 0.0 0.12
AL 0.0 0.0 0.71 0.50
Factor correlations
ADC AM AL
ADC 1.0
AM 0.07 1.0
AL 0.39 0.05 1.0
2 = 23.28 P = 0.503
df = 24 N = 105
blocks suggest some stability of the traits across the have therefore proposed the less restrictive trait-
three methods. method factor model, permitting systematic variation
The parameter estimates for the trait-only factor due to shared methods as well as shared traits.
model are shown in Table 3. The solution is admissi- The factor loading matrix of the expanded model
ble and its low maximum-likelihood 2 -value signals simply has several columns of method factor loadings
acceptable statistical model fit. No additional model appended to the right, one column for each method:
terms are called for. This factor model postulates con-
siderable generality of traits across methods, although 1,1 0 0 1,4 0 0
the large uniqueness estimates of some of the attitude 0 2,2 0 2,4 0 0
measures indicate low factorial validity, limiting their 0 0 3,3 3,4 0 0
practical use. 4,1 0 0 0 4,5 0
Performance of the trait-only factor model with = 0 5,2 0 0 5,5 0 .
other empirical MTMM data is mixed. In Wothkes
0 0 6,3 0 6,5 0
[21] reanalyses of 23 published MTMM matrices,
7,1 0 0 0 0 7,6
the model estimates were inadmissible or failed to 0 8,6
8,2 0 0 0
converge in 10 cases. Statistically acceptable model 0 0 9,3 0 0 9,6
fit was found with only 2 of the 23 data sets. (4)
In the structured correlation matrix (5), the submatrix continue because the matrix of second derivatives of
contains the correlations among traits and the sub- the fit function becomes rank deficient at that point.
matrix contains the correlations among methods. This is a serious practical problem because condi-
While the block-diagonal trait-method model tion (6) is so general that it slices the identified
appeared attractive when first proposed, there has solution space into many disjoint subregions so that
been growing evidence that its parameterization is the model estimates can become extremely sensitive
inherently flawed. Inadmissible or unidentified model to the choice of start values. Kenny and Kashy [14]
solutions are nearly universal with both simulated noted that . . . estimation problems increase as the
and empirical MTMM data [3, 15, 21]. In addition, factor loadings become increasingly similar.
identification problems of several aspects of the There are several alternative modeling approaches
trait-method factor model have been demonstrated that the interested reader may want to con-
formally [10, 14, 16, 20]. For instance, consider sult: (a) CFA with alternative factor correlation
factor loading structures whose nonzero entries are structures [19]; (b) CFA with correlated uniqueness
proportional by rows and columns: coefficients [4, 14, 15]; (c) covariance components
1 0 0 4 0 0
0 2 2 0 2 4 0 0
0 0 3 3 3 4 0 0
4 1 0 0 0 4 5 0
(p)
= 0 5 2 0 0 5 5 0 , (6)
0 0 6 3 0 6 5 0
7 1 0 0 0 0 7 6
0 8 2 8 6
0 0 0
0 0 9 3 0 0 9 6
where the i are mt 1 nonzero scale parameters for analysis [22]; and (d) the direct product model [5].
(p)
the rows of , with 1 = 1 fixed and all other i Practical implementation issues for several of these
estimated, and the k are a set of m + t nonzero scale models are reviewed in [14] and [22].
(p)
parameters for the columns of , with all k esti-
mated. Grayson and Marsh [10] proved algebraically Conclusion
that factor models with loading matrix (6) and fac-
tor correlation structure (5) are unidentified no matter About thirty years of experience with confirmatory
how many traits and methods are analyzed. Even if factor analysis of MTMM data have proven less
(p)
is further constrained by setting all (row) scale than satisfactory. Trait-only factor analysis suffers
parameters to unity (i = 1), the factor model will from poor fit to most MTMM data, while the block-
remain unidentified [20]. diagonal trait-method model is usually troubled by
Currently, identification conditions for the gen- identification, convergence, or admissibility prob-
eral form of the trait-method model are not com- lems, or by combinations thereof. In the presence
pletely known. Identification and admissibility prob- of method effects, there is no generally accepted
lems appear to be the rule with empirical MTMM multivariate model to yield summative measures of
data, although an identified, admissible, and fit- convergent and discriminant validity. In the absence
ting solution has been reported for one particular of such a model, (t)here remains the basic eyeball
dataset [2]. However, in order to be identified, the analysis as in the original article [6]. It is not always
estimated factor loadings must necessarily be differ- dependable; but it is cheap [7].
ent from the proportional structure in (6) a differ-
ence that would complicate the evaluation of trait References
validity. Estimation itself can also be difficult: The
usually iterative estimation process often approaches [1] Althauser, R.P. & Heberlein, T.A. (1970). Validity
an intermediate solution of the form (6) and cannot and the multitrait-multimethod matrix, in Sociological
Factor Analysis: MultitraitMultimethod 5
Methodology 1970, E.F. Borgatta, ed., Jossey-Bass, San [14] Kenny, D.A. & Kashy, D.A. (1992). Analysis
Francisco. of the multitrait-multimethod matrix by confirma-
[2] Bollen, K.A. (1989). Structural Equations with Latent tory factor analysis, Psychological Bulletin 112(1),
Variables, Wiley, New York. 165172.
[3] Brannick, M.T. & Spector, P.E. (1990). Estimation prob- [15] Marsh, H.W. & Bailey, M. (1991). Confirmatory factor
lems in the block-diagonal model of the multitrait- analysis of multitrait-multimethod data: A comparison of
multimethod matrix, Applied Psychological Measure- alternative models, Applied Psychological Measurement
ment 14(4), 325339. 15(1), 4770.
[4] Browne, M.W. (1980). Factor analysis for multi- [16] Millsap, R.E. (1992). Sufficient conditions for rota-
tional uniqueness in the additive MTMM model, British
ple batteries by maximum likelihood, British Jour-
Journal of Mathematical and Statistical Psychology 45,
nal of Mathematical and Statistical Psychology 33,
125138.
184199.
[17] Millsap, R.E. (1995). The statistical analysis of method
[5] Browne, M.W. (1984). The decomposition of multitrait- effects in multitrait-multimethod data: a review, in
multimethod matrices, British Journal of Mathematical Personality, Research Methods and Theory. A Festschrift
and Statistical Psychology 37, 121. Honoring Donald W. Fiske, P.E. Shrout & S.T. Fiske,
[6] Campbell, D.T. & Fiske, D.W. (1959). Convergent and eds, Lawrence Erlbaum Associates, Hillsdale.
discriminant validation by the multitrait-multimethod [18] Schmitt, N. & Stults, D.M. (1986). Methodology review:
matrix, Psychological Bulletin 56, 81105. analysis of multitrait-multimethod matrices, Applied
[7] Fiske, D.W. (1995). Reprise, new themes and steps Psychological Measurement 10, 122.
forward, in Personality, Research Methods and Theory. [19] Widaman, K.F. (1985). Hierarchically nested covari-
A Festschrift Honoring Donald W. Fiske, P.E. Shrout ance structure models for multitrait-multimethod data,
& S.T. Fiske, eds, Lawrence Erlbaum Associates, Applied Psychological Measurement 9, 126.
Hillsdale. [20] Wothke, W. (1984). The estimation of trait and
[8] Flamer, S. (1978). The effects of number of scale alterna- method components in multitrait-multimethod measure-
tives and number of items on the multitrait-multimethod ment, Unpublished doctoral dissertation, University of
matrix validity of Likert scales, Unpublished Disserta- Chicago.
tion, University of Minnesota. [21] Wothke, W. (1987). Multivariate linear models of
[9] Flamer, S. (1983). Assessment of the multitrait- the multitrait-multimethod matrix, in Paper Presented
multimethod matrix validity of Likert scales via at the Annual Meeting of the American Educational
Research Association, Washington, (paper available
confirmatory factor analysis, Multivariate Behavioral
through ERIC).
Research 18, 275308.
[22] Wothke, W. (1996). Models for multitrait-multimethod
[10] Grayson, D. & Marsh, H.W. (1994). Identification with
matrix analysis, in Advanced Structural Equation Mod-
deficient rank loading matrices in confirmatory factor
eling. Issues and Techniques, G.A. Marcoulides &
analysis multitrait-multimethod models, Psychometrika R.E. Schumacker, eds, Lawrence Erlbaum Associates,
59, 121134. Mahwah.
[11] Joreskog, K.G. (1966). Testing a simple structure
hypothesis in factor analysis, Psychometrika 31,
165178. (See also History of Path Analysis; Residuals in
[12] Joreskog, K.G. (1971). Statistical analysis of sets of Structural Equation, Factor Analysis, and Path
congeneric tests, Psychometrika 36(2), 109133. Analysis Models; Structural Equation Modeling:
[13] Joreskog, K.G. (1978). Structural analysis of covari- Overview)
ance and correlation matrices, Psychometrika 43(4),
443477. WERNER WOTHKE
Factor Analysis of Personality Measures
WILLIAM F. CHAPLIN
Volume 2, pp. 628636
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Illustration of the major structural models of personality based on factor analysis
Number of factors Representative labels and structure Associated theorists
2 Love-Hate; Dominance-Submission (interpersonal circle) Leary, Wiggins
2 Alpha (A, C, N) Beta (E, O) (Higher order factors of the Big Five) Digman
3 Extroversion, Neuroticism, Psychoticism Eysenck
5 E, A, C, N, O (Big Five; Five-Factor Model) Digman, Goldberg, Costa &
McCrae
7 E, A, C, N, O + Positive and Negative Evaluation Tellegen, Waller, Benet
16 16- PF; 16 Primary factors further grouped into five more global Cattell
factorsa
Note: E = Extroversion, A = Agreeableness, C = Conscientiousness, N = Neuroticism, O = Openness.
a
A complete list of the labels for the 16 PF can be found in [3].
included in the structure of personality. Adding such of personality items are responses (typically ratings
items to the domain to be factored resulted, not of descriptiveness of the item about ones self or
surprisingly, in a model called the Big Seven as possibly another person) from N subjects to k per-
shown in Table 1. sonality items; for example, talks to strangers, is
Cattells decision to exclude intelligence items punctual, or relaxed. These N k responses are
or Big Five theorists decisions to exclude purely then converted into a k k correlation (or less often
evaluative items represent different views of what is a covariance) matrix, and the k k matrix is then
meant by personality. It would be difficult to identify factor analyzed to yield a factor matrix showing
those views as correct or incorrect in any objective the loadings of the k variables on the m factors.
sense, but recognizing these different views can help Specifically, factor analysis operates on the com-
clarify the differences in Table 1 and in other fac- mon (shared) variance of the variables as measured
tor analyses of personality domains. The point is by their intercorrelations. The amount of variance
that understanding the results of a factor analysis a variable shares with the other variables is called
of personality measures must begin with a careful the variables communality. Factor analysis proceeds
evaluation of the measures that are included (and by extracting factors iteratively such that the first
excluded) and the rationale behind such inclusion or factor accounts for as much of the total common
exclusion. Probably the most prominent rationale for variance across the items (called the factors eigen-
selecting variables for a factor analysis in personal- value) as possible, the second factor accounts for
ity has been the lexical hypothesis. This hypothesis as much of the remaining common variance as pos-
roughly states that all of the most important ways sible and so on. Figure 1 shows a heuristic factor
that people differ from each other in personality matrix. The elements of the matrix are the esti-
will become encoded in the natural language as sin- mated correlations between each variable and each
gle word person descriptive terms such as friendly factor. These correlations are called loadings. To
or dependable. On the basis of this hypothesis, the right of the matrix is a column containing the
one selects words from a list of all possible terms final communality estimates (usually symbolized as
that describe people culled from a dictionary and h2 ). These are simply the sum of the squared load-
then uses those words as stimuli for which peo- ings for each variable across the m factors and thus
ple are asked to describe themselves or others on represent the total common variance in each vari-
those terms. Cattell used such a list that was com- able that is accounted for by the factors. At the
plied by Allport and Odbert [1] in his analyses, and bottom of the matrix are the eigenvalues of the fac-
more recently, the Big Five was based on a simi- tors. These are the sum of the squared loadings for
lar and more recent list compiled by Warren Nor- each factor across the k variables and thus represent
man [11]. the total amount of variance accounted for by each
factor.
How (and Why) Should Personality Factors be The point at which the correlation matrix is
Extracted? The basic data used in a factor analysis converted to a factor matrix represents the next
Factor Analysis of Personality Measures 3
Item 1 Item 1
Item 3 Item 3
(a) (b)
Error1 Item 1
Factor
Error2 Item 2
Item 3
Error3
(c)
results in factors that account for less variance than Within the domain of personality it is often
PC analysis. the case that similar factor structures emerge from
There is also a computational consequence of the same data regardless of whether PC or PF is
choosing PF over PC analysis. PF analysis is much employed, probably because the initial regression-
more difficult from a computational standpoint than based communality estimates for personality vari-
PC because one needs to estimate the error or unique- ables in PF tend to approach the 1.0 estimates used
ness of the items before the analysis can proceed. by PC analysis. Thus, the decision to use PC or PF
This is typically done by regressing each item on on personality data may be of little practical con-
all the others in the set to be factored, and using sequence. However, the implied view of factors as
the resulting R2 as the estimate of the items com- descriptive or causal by PC or PF respectively still
mon variance (communality) and 1 R2 as the items has important implications for the study of personal-
unique variance. Multiple linear regression requires ity. The causal view of factors must be a disciplined
inverting a correlation matrix, a time consuming, view to avoid circularity. For example, it is easy to
tedious, and error-prone task. If one were to fac- explain that a person has responded in an agreeable
tor, say, 100 items one would have to invert 100 manner because they are high on the agreeableness
matrices. This task would simply be beyond the factor. Without further specifying, and independently
skills and temperament of most investigators and as testing, the source of that factor (e.g., genetic, cog-
a consequence the vast majority of historical factor nitive, environmental), the causal assertion is circu-
analyses used the PC approach, which requires no lar (He is agreeable because he is agreeable) and
matrix inversion. Today we have computers, which, untestable. The PC view avoids this problem by sim-
among other things are designed for time consuming, ply using the factor descriptively without implying
tedious, and error-prone tasks, so the computational a cause.
advantage of PC is no longer of much relevance. However, the merely descriptive view of factors
However, the conservative nature of science, which is scientifically less powerful and two of the earliest
tends to foster continuity of methods and measures, and most influential factor analytic models of per-
has resulted in the vast majority of factor analy- sonality of Cattell [3] and Eysenck [5] both viewed
ses of personality items to continue to be based on factors as casual. Eysenck based his three factors on
PC, regardless of the (often unstated) view of inves- a strong biological theory that included the role of
tigator about the nature of factors or the goal of individual differences in brain structure and systems
the analysis. of biological activation and inhibition as the basis
Factor Analysis of Personality Measures 5
of personality, and Eysenck used factor analysis to that are specific to only one or two items and/or
evaluate his theory by seeing if factors consistent factors that are substantively difficult to interpret
with his theory could be derived from personality or name.
ratings. Cattell, on the other hand, did not base his One recent development that addresses the num-
16 factors on an explicit theory but instead viewed ber of factors problem is the use of factor analyses
factor analysis as a tool for empirically discovering based on maximum-likelihood criteria. In principle,
the important and replicable factors that caused per- this provides a statistical test of the significance
sonality. The widely accepted contemporary model of the amount of additional variance accounted for
of five factors also has both descriptive and causal by each additional factor. One then keeps extracting
interpretations. The term Five-Factor Model used by factors until the additional variance accounted for by
Costa and McCrae among others emphasizes a causal each factor does not significantly increase over the
interpretation, whereas the term Big Five used variance accounted for by the previous factor. How-
by Goldberg among others emphasizes the descrip- ever, it is still often the case that factors that account
tive view. for significantly more variance do not include large
numbers of items and/or are not particularly meaning-
How Many Factors are There? Probably the most ful. Thus, the tension between statistical significance
difficult issue in factor analysis is deciding on the and substantive significance remains, and ultimately
number of factors. Within the domain of personality, the number of factors reported reflects a subjective
we have seen that the number of factors extracted balance between these two criteria.
is influenced crucially by the decision of how many
and what type of items to factor. However, another How Should the Factors be Arranged (Rotated)?
reason that different investigators may report different Yet another source of subjectivity in factor analy-
numbers of factors is that there is no single criterion sis results because the initial extraction of factors
for deciding how many factors are needed or useful does not provide a statistically unique set of fac-
to account for the common variance among a set tors. Statistically, factors are extracted to account for
of items. The problem is that as one extracts more as much variance as possible. However, once a set
factors one necessarily accounts for more common of factors is extracted, it turns out that there are
variance. Indeed in PC analysis one can extract as many different combinations of factors and item load-
many factors are there are items in the data set and ings that will account for exactly the same amount
in doing so one can account for all the variance. of total variance of each item. From a statistical
Thus, the decision about the number of factors to standpoint, as long as a group of factors accounts
extract is ultimately based on the balance between for the same amount of total variance, there is no
the statistical goal of accounting for variance and basis for choosing one group over another. Thus,
the substantive goal of simplifying a set of data investigators are free to select whatever arrangements
into a smaller number of meaningful descriptive of factors and item loadings they wish. The term
components or underlying causal factors. The term that is used to describe the rearrangement of fac-
meaningful is the source of the inherent subjectivity tors among a set of personality items is rotation,
in this decision which comes from the geometric view of factors as
The most common objective criteria that has been vectors moving (rotating) through a space defined by
used to decide on the number of factors is Kaisers items.
eigenvalues greater than 1.0 rule. The logic of There is a generally accepted criterion, called
this rule is that, at a minimum, a factor should simple structure, that is used to decide how to
account for more common variance than any single rotate factors. An ideal simple structure is one
item. On the basis of this logic, it is clear that this where each item correlates 1.0 with one factor
rule only applies to PC analysis where the common and 0.00 with the other factors. In actual data
variance of an item is set at 1.0 and indeed Kaiser this ideal will not be realized, but the goal is
proposed this rule for PC analysis. Nonetheless, to come as close to this ideal as possible for as
one often sees this rule misapplied in PF analyses. many items as possible. The rationale for simple
Although there is a statistical objectivity about this structure is simplicity and this rationale holds for
rule, in practice its application often results in factors both PC and PF analyses. For PC analysis, simple
6 Factor Analysis of Personality Measures
structure results in a description of the relations to explore higher order factor modelsthat is fac-
among the variables that is easy to interpret because tors of factors. Two of the systems shown in Table 1,
there is little item overlap between factors. For PF Digmans Alpha and Beta factors and Cattells five
analysis the rationale is that scientific explanations Global Factors for the 16 PF represent such higher
should be as simple as possible. However, there order factor solutions.
are several different statistical strategies that can Simple structure is generally accepted as a goal
be used to approximate simple structure and the of factor rotation and is the basis for all the spe-
decision about which strategy to use is again a cific rotational strategies available in standard factor
subjective one. analytic software. However, within the field of per-
The major distinction between strategies to sonality there has been some theoretical recognition
achieve simple structure is oblique versus orthogonal that simple structure may not be the most appropriate
rotation of factors. As the labels imply, oblique way to conceptualize personality. The best historical
rotation allows the factors to be correlated with example of this view is the interpersonal circle of
each other whereas orthogonal rotation constrains the Leary, Wiggins, and others [14]. A circular arrange-
factors to be uncorrelated. Most factor analyses use an ment of items around two orthogonal axes means that
orthogonal rotation based on a specific strategy called some items must load equally highly on both factors,
Varimax. Although other orthogonal strategies exist which is not simple structure. In the interpersonal cir-
(e.g., Equimax, Quartimax) the differences among cle, for example, an item such as trusting has both
these in terms of rotational results are usually loving and submissive aspects, and so would load
slight and one seldom encounters these alternative complexly on both the Love-Hate and Dominance-
orthogonal approaches. Orthogonal approaches to the Submission factors. Likewise, cruel has both Dom-
rotation of personality factors probably dominate inant and hateful aspects. More recently, a complex
in the literature because of their computational version of the Big Five called the AB5C structure
simplicity relative to oblique rotations. However, the that explicitly recognizes that many personality items
issue of computational simplicity is no longer of are blends of more than one factor was introduced
much concern with the computer power available by [8]. In using factor analysis to identify or evalu-
today so the continued preference for orthogonal ate circumplex models or any personality models that
rotations may, as with the preference for PC over explicitly view personality items as blends of factors,
PF, be historically rather than scientifically based. simple structure will not be an appropriate criterion
Oblique rotations of personality factors have some for arranging the factors.
distinct advantages over orthogonal rotations. In gen-
eral these advantages result because oblique rotations
are less constrained than orthogonal ones. That is, What Should the Factors be Called? In previous
oblique rotations allow the factors to be correlated sections the importance of the meaningfulness and
with each other, whereas orthogonal rotations force interpretation of personality factors as a basis for
the factors to be uncorrelated. Thus, in the pursuit evaluating the acceptability of a factor solution has
of simple structure, oblique rotations will be more been emphasized. But, of course, the interpretation
successful than orthogonal ones because oblique rota- and naming of factors is another source of the
tions have more flexibility. Oblique rotations can, in inherent subjectivity in the process. This subjectivity
some sense, transfer the complexity of items that are is no different than the subjectivity of all of science
not simple (load on more than one factor) to the when it comes to interpreting the results but the
factors by making the relations among the factors fact that different, but reasonable, scientists will often
more complex. Perhaps the best way to appreciate disagree about the meaning or implications of the
the advantage of oblique rotations over orthogonal same data certainly applies to the results of a factor
ones is to note that if the simple structure factors are analysis.
orthogonal or nearly so, oblique rotations will leave The interpretation problem in factor analysis is
the factors essentially uncorrelated and oblique rota- perhaps particularly pronounced because factors, per-
tions will become identical (or nearly so) to orthog- sonality or otherwise, have no objective reality.
onal ones. A second advantage of oblique rotations Indeed, factors do not result from a factor analysis,
of personality factors is that it allows the investigator rather the result is a matrix of factor loadings such as
Factor Analysis of Personality Measures 7
the one shown in Figure 1. On the basis of the con- Some investigators, perhaps out of recognition of
tent of the items and their loadings in the matrix, the the difficulty and subjectivity of factor naming, have
nature of the factor is inferred. That is, we know eschewed applying labels at all and instead refer to
a factor through the variables with which it is corre- factors by number. Thus, in the literature on the Big
lated. It is because factors do not exist and are not, Five, one may see reference to Factors I, II, III, IV,
therefore, directly observed that we often call them V. Of course, those investigators know the names
latent factors. Latent factors have the same prop- that are typically applied to the numbered factors
erties as other latent variables such as depression, and these are shown in Table 2. Another approach
intelligence, or time. None of these variables is has been to name the factors with uncommon labels
observed or measured directly, but rather they are to try to separate the abstract scientific meaning of
measured via observations that are correlated with a factor from its everyday interpretation. In particu-
them such as loss of appetite, vocabulary knowl- lar, Cattell used this approach with the 16PF, where
edge, or the movement of the hand on a watch. he applied labels to his factors such as Parmia,
A second complication is that in the factor analy- Premsia, Autia, and so on. Of course, transla-
ses described here there are no statistical tests of tions of these labels into their everyday equivalents
whether a particular loading is significant; instead soon appeared (Parmia is Social Boldness, Prem-
different crude standards such as loadings over 0.50 sia is Sensitivity, and Autia is Abstractedness),
or over 0.30 have been used to decide if an item is but the point can be appreciated, even if not gener-
on a factor. Different investigators can, of course, ally followed.
decide on different standards, with the result that
factors are identified by different items, even in the A Note on Confirmatory Factor Analysis. This
same analysis. presentation of factor analysis of personality mea-
Thus, it should come as no surprise that different sures has focused almost exclusively on approaches
investigators will call the same factor by a different to factor analysis that are often referred to as
name. Within the domain of the interpersonal circle, exploratory (see Factor Analysis: Exploratory).
for example, the factors have been called Love and This label is somewhat misleading as it implies that
Hate, or Affiliation and Dominance. Within the Big investigators use factor analysis just to see what hap-
Five, various labels have been applied to each factor, pens. Most investigators are not quite so clueless
as shown in Table 2. Although there is a degree of and the factor analysis of personality items usually
similarity among the labels in each column, there are takes place under circumstances where the investiga-
clear interpretive differences as well. The implication tor has some specific ideas about what items should
of this point is that one should not simply look at be included in the set to be factored, and hypotheses
the name or interpretation an investigator applies to about how many factors there are, what items will be
a factor, but also at the factor-loading matrix so located on the same factor, and even what the factors
that the basis for the interpretation can be evaluated. will be called. In this sense, the analysis has some
It is not uncommon to see the same label applied confirmatory components.
to somewhat different patterns of loadings, or for In fact the term exploratory refers to the fact that
different labels to be applied to the same pattern in these analyses a correlation matrix is submitted
of loadings. for analyses and the analyses generates the optimal
Table 2 Some of the different names applied to the Big Five personality factors in different systems
Factor I Factor II Factor III Factor IV Factor V
Extroversion Agreeableness Conscientiousness Emotional Stability Openness to
Experience
Surgency Femininity High Ego Control Neuroticism (r) Intellect
Power Love Prudence Adjustment Imagination
Low Ego Control Likeability Work Orientation Anxiety (r) Rebelliousness
Sociability Impulsivity (r)
Note: r = label is reversed relative to the other labels.
8 Factor Analysis of Personality Measures
factors and loadings empirically for that sample of to structures that adequately summarize the complex
data and without regard to the investigators ideas and relations among those measures. This interpretation is
expectations. Thus the investigators beliefs do not undoubtedly correct. What is not correct is the further
guide the analyses and so they are not directly tested. conclusion that structures such as those represented
Indeed, there is no hypothesis testing framework by five or three or seven factors, or circumplexes,
within exploratory factor analysis and this is why or whatever are therefore useless or misleading
most decisions associated with this approach to factor characterizations of personality.
analysis are subjective. Factor analyses of personality measures are
The term confirmatory factor analysis (CFA) intended to simplify the complex observed relations
(see Factor Analysis: Confirmatory) is generally among personality measures. Thus, it is not surprising
reserved for a particular approach that is based that factor analytic solutions do not summarize
on structural equation modeling as represented in all the variation and covariation among personality
programs such as LISREL, EQS, or AMOS (see measures. The results of CFA are indicating that
Structural Equation Modeling: Software). CFA is factor analytic models of personality simply do not
explicitly guided by the investigators beliefs and capture all the complexity in human personality, but
hypotheses. Specifically, the investigator indicates the this is not their purpose. To adequately represent this
number of factors, designates the variables that load complexity items would need to load on a number
on each factor, and indicates if the factors are cor- of factors (no simple structure); factors would need
related (oblique) or uncorrelated (orthogonal). The to correlate with each other (oblique rotations), and
analyses then proceed to generate a hypothetical cor- many small factors representing only one or two
relation matrix based on the investigators specifica- items might be required. Moreover, such structures
tions and this matrix is compared to the empirical might well be specific to a given sample and would
correlation matrix based on the items. Chi-square not generalize. The cost of correctly modeling
goodness-of- fit tests and various modifications of personality would be the loss of the simplicity that
these as fit indices are available for evaluating how the factor analysis was initially designed to provide.
close the hypothesized matrix is to the observed Certainly the factor analysis of personality measures
matrix. In addition, the individual components of the is an undertaking where Whiteheads dictum, Seek
model such as the loadings of individual variables on simplicity but distrust it, applies.
specific factors and proposed correlations among the CFA can still be a powerful tool for evaluat-
factors can be statistically tested. Finally, the incre- ing the relations among personality measures. The
ment in the goodness-of-fit of more complex models point of this discussion is simply that CFA should
relative to simpler ones can be tested to see if the not be used to decide if a particular factor ana-
greater complexity is warranted. lytic model is correct; as the model almost cer-
Clearly, when investigators have some idea about tainly is not correct because it is too simple. Rather,
what type of factor structure should emerge from their CFA should be used to compare models of person-
analysis, and investigators nearly always have such an ality by asking if adding more factors or correla-
idea, CFA would seem to be the method of choice. tions among factors significantly improves the fit
However, the application of CFA to personality data of a model. That is, when the question is changed
has been slow to develop and is not widely used. The from, Is the model correct?, to Which model is
primary reason for this is that CFA does not often significantly better? CFA can be a most appropri-
ate tool. Finally, it is important to note that CFA
work well with personality data. Specifically, even
also does not address the decision in factor analysis
when item sets that seem to have a well-established
of personality measures that probably has the most
structure such as those contained in the Big Five
crucial impact on the results. This is the initial deci-
Inventory (BFI-44 [2]) or the Eysenck Personality
sion about what variables are to be included in the
Questionnaire (EPQ [5]) are subjected to CFA based
analysis.
on that structure, the fit of the established structure
to the observed correlations is generally below the
minimum standards of acceptable fit.
Summary
The obvious interpretation of this finding is that Factor analysis of personality measures has resulted
factor analyses of personality measures do not lead in a wide variety of possible structures of human
Factor Analysis of Personality Measures 9
personality. This variety results because personal- [6] Fabrigar, L.R., Wegener, D.T., MacCallum, R.C. &
ity psychologists have different theories about what Strahan, E.J. (1999). Evaluating the use of exploratory
constitutes the domain of personality and they have factor analysis in psychological research, Psychological
Methods 3, 272299.
different views about the goals of factor analysis. [7] Goldberg, L.R. & Velicer, W.F. (in press). Principles
In addition, different reasonable criteria exist for of exploratory factor analysis, in Differentiating Normal
determining the number of factors and for rotat- and Abnormal Personality, 2nd Edition, S. Strack, ed.,
ing and naming those factors. Thus, the evalua- Springer, New York.
tion of any factor analysis must include not sim- [8] Hofstee, W.K.B., de Raad, B. & Goldberg, L.R. (1992).
ply the end result, but all the decisions that were Integration of the Big Five and circumplex approaches
to trait structure, Journal of Personality and Social
made on the way to achieving that result. The exis-
Psychology 63, 146163.
tence of many reasonable factor models of human [9] John, O.P. (1990). The Big Five factor taxonomy:
personality suggests that people are diverse not dimensions of personality in the natural language and in
only in their personality, but in how they perceive questionnaires, in Handbook of Personality: Theory and
personality. Research, L.A. Pervin, ed., Guilford Press, New York,
pp. 66100.
[10] McCrae, R.R. & Costa Jr, P.T. (1996). Toward a new
References generation of personality theories: theoretical contexts
for the five-factor model, in J.S. Wiggins, ed., The Five-
[1] Allport, G.W. & Odbert, H.S. (1936). Trait names: Factor Model of Personality: Theoretical Perspectives,
a psycho-lexical study, Psychological Monographs 47, Guilford Press, New York, pp. 5187.
211. [11] Norman, W. (1963). Toward an adequate taxonomy of
[2] Benet-Martinez, V. & John, O.P. (1998). Los Cincos personality attributes, Journal of Abnormal and Social
Grandes across cultures and ethnic groups: multitrait Psychology 66(6), 574583.
multimethod analyses of the Big Five in Spanish and [12] Spearman, C. (1904). General intelligence objectively
English, Journal of Personality and Social Psychology determined and measured, American Journal of Psychol-
75, 729750. ogy 15, 201293.
[3] Cattell, R.B. (1995). Personality structure and the new [13] Thurstone, L.L. (1934). The vectors of mind, Psycho-
fifth edition of the 16PF, Educational & Psychological logical Review 41, 132.
Measurement 55(6), 926937. [14] Wiggins, J.S. (1982). Circumplex models of interper-
[4] Digman, J.M. (1997). Higher-order factors of the Big sonal behavior in clinical psychology, in Handbook of
Five, Journal of Personality and Social Psychology 73, Research Methods in Clinical Psychology, P.C. Kendall
12461256. & J.N. Butcher, eds, Wiley, New York, pp. 183221.
[5] Eysenck, H.J. & Eysenck, S.B.G. (1991). Manual of the
Eysenck Personality Scales (EPQ Adults), Hodder and WILLIAM F. CHAPLIN
Stoughton, London.
Factor Score Estimation
SCOTT L. HERSHBERGER
Volume 2, pp. 636644
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
and the eight observed scores for a person are rescaled transformations of scores obtained using the
unrotated matrix.
.10
.22
.19 Common Factor Scores
.25
z= . (5)
.09 Why are Common Factor Scores Indeterminate?
.23
.15 Scores from the common factor model are estimated
.19 because it is mathematically impossible to determine
a unique set of them an infinite number of such
The three eigenvalues are, respectively, 2.39, 1.64, sets exist. This results from the underidentification
and 1.53. The first, second, and third principle of the common factor model (see Factor Analy-
component scores are calculated as sis: Exploratory; Identification). An underidentified
model is a model for which not enough informa-
.71 .82
f1 = .04 = 10 + .22 tion in the data is present to estimate all of the
2.39 2.39 models unknown parameters. In the principal com-
.93 .10 ponents model, identification is achieved by imposing
+ .19 + .25 two restrictions: (a) the first component accounts for
2.39 2.39
the maximum amount of variance possible, the sec-
.22 .24 ond the next, and so on and so forth, and (b) the
+ .09 + .23
2.39 2.39 components are uncorrelated with each other. Impos-
ing these two restrictions, the unknown parameters
.28 .39
+ .15 + .19 in the principal components model the n m fac-
2.39 2.39
tor loadings can be uniquely estimated. Thus, the
f2 = .05 =
.11
.10 +
.15
.22 principal components model is identified: The n m
1.64 1.64 factor laodings to be estimated are in number to the
n(n + 1)/2 correlations available to estimate them.
.19 .77
+ .19 + .25 In contrast, even with the imposition of the two
1.64 1.64 restrictions, the common factor model remains under-
.88 .21 identified for the following reason. The model pos-
+ .09 + .23 tulates not only the existence of m common factors
1.64 1.64
underlying n variables, requiring the specification of
.23 .32 n m factor loadings (as in the principal components
+ .15 + .19
1.64 1.64 model), it also postulates the existence of n specific
factors, resulting in a model with (n m) + n param-
.16 .20
f3 = .01 = .10 + .22 eters to be estimated, greater in number than the
1.53 1.53
n(n + 1)/2 available to estimate them. As a result,
.24 .28 the n m factor loadings have an infinite number
+ .19 + .25
1.53 1.53 of possible values. Logically then, the factor scores
would be expected to have an infinite number of pos-
.32 .36
+ .09 + .23 sible values.
1.53 1.53
.71 .77
+ .15 + .19 . Methods for Estimating Common Factor
1.53 1.53
Scores
(6)
Estimation by Regression
Component scores can be computed using either
the unrotated pattern matrix or the rotated pattern Thomson [9] was the first to suggest that ordinary
matrix; both are of equivalent statistical validity. The least-squares regression methods (see Least Squares
scores obtained using the rotated matrix are simply Estimation) can be used to obtain estimates of factor
Factor Score Estimation 3
scores. The information required to find the regres- 1.00
sion weights for the factors on the observed vari- = . (10)
0.45 1.00
ables the correlations among the observed vari-
ables and the correlation between the factors and
On the basis of (9), the regression weights are
observed variables is available from the factor anal-
ysis. The least-squares criterion is to minimize the 1
sum of the squared differences between predicted 1.00
and true factor scores, which is analogous to the 0.31 1.00
generic least-squares criterion of minimizing the sum 0.48 0.54 1.00
B=
of the squared differences between predicted and 0.69 0.31 0.45 1.00
0.34 0.30 0.26 0.41 1.00
observed scores.
We express the linear regression of any factor f 0.37 0.41 0.57 0.39 0.38 1.00
on the observed variables z in matrix form for one 0.65 0.19
case as 0.59 0.08
F 1m = Z1n Bnm , (7) 0.59 0.17 1.00
0.12 0.72 0.45 1.00
where B is a matrix of weights for the regression of 0.14 0.70
the m factors on the n observed variables, and F is a 0.24 0.34
row vector of m estimated factor scores.
2.05
When the common factors are orthogonal, we use 0.01 0.48
the following matrix equation to obtain B:
0.41 0.64 1.99
=
Bnm = R1 1.18 0.01 0.20 2.11
nn Anm , (8) 0.09 0.20 0.16 0.35 1.32
where R is the matrix of correlations between the 0.02 0.14 0.70 0.12 0.33 1.64
n observed variables. If the common factors are 0.65 0.19
nonorthogonal, we also require , the correlation 0.59 0.08
matrix among the m factors: 0.59 0.17 1.00
0.12 0.72 0.45 1.00
Bnm = R1 0.14 0.70
nn Anm mm . (9)
0.24 0.34
To illustrate the regression method, we perform
a common factor analysis on a set of six observed 0.65 0.19
0.33 0.01
variables, retaining nonorthogonal two factors. We
0.33 0.10
use data from three cases and define = (11)
0.29 0.42
0.00 0.57 1.15 0.00 1.12 0.87 0.22 0.54
Z = 1.00 0.57 0.57 0.00 0.80 0.21 , 0.14 0.01
1.00 1.15 0.57 0.00 0.32 1.09
1.00 Then, based on (7), the two-factor scores for the
0.31 1.00 three cases are
0.48 0.54 1.00
R= ,
0.69 0.31 0.45 1.00 0.00 0.57 1.15 0.00 1.12 0.87
0.34 0.30 0.26 0.41 1.00 F = 1.00 0.57 0.57 0.00 0.80 0.21
0.37 0.41 0.57 0.39 0.38 1.00 1.00 1.15 0.57 0.00 0.32 1.09
0.65 0.19 0.65 0.19
0.59 0.08 0.33 0.01
0.59 0.17 0.33 0.10
A= ,
0.12 0.72 0.29 0.42
0.14 0.70 0.22 0.54
0.24 0.34 0.14 0.01
4 Factor Score Estimation
0.69 0.72 distributed (see Catalogue of Probability Density
= 0.44 0.68 (12) Functions).
1.14 0.04 Bartletts method specifies that for one case
0.64 0.00 0.00 0.00 0.00 0.00 1
0.00 0.78 0.00 0.00 0.00 0.00
0.00 0.57 1.15 0.00 1.12 0.87
= 0.00 0.00 0.66 0.00 0.00 0.00
V 1.00 0.57 0.57 0.00 0.80 0.21
0.00 0.00 0.00 0.62 0.00 0.00
1.00 1.15 0.57 0.00 0.32 1.09 0.00 0.00 0.00 0.00 0.87 0.00
0.00 0.00 0.00 0.00 0.00 0.73
0.65 0.19 0.64 0.00
0.00 0.00 0.00 0.00 1
0.59 0.08 0.00 0.78
0.00 0.00 0.00 0.00
0.85 0.49
0.59 0.17 0.00 0.00
0.66 0.00 0.00 0.00
0.69 0.45
0.12 0.72 0.00 0.00
0.00 0.62 0.00 0.00
1.54 0.04
0.14 0.70 0.00 0.00
0.00 0.00 0.87 0.00
0.24 0.34 0.00 0.00
0.00 0.00 0.00 0.73
1.56 0.00 0.00 0.00 0.00 0.00
0.00 1.28 0.00 0.00 0.00 0.00
0.00 0.57 1.15 0.00 1.12 0.87
0.00 0.00 1.51 0.00 0.00 0.00
= 1.00 0.57 0.57 0.00 0.80 0.21
0.00 0.00 0.00 1.61 0.00 0.00
0.32 1.09
0.00 0.00 0.00 0.00 1.50 0.00
1.00 1.15 0.57 0.00
0.00 0.00 0.00 0.00 0.00 1.37
0.65 0.19 1.56 0.00 0.00 0.00 0.00 0.00
0.59 0.08 0.00 1.28 0.00 0.00 0.00 0.00
0.85 0.49
0.59 0.17 0.00 0.00 1.51 0.00 0.00 0.00
0.69 0.45
0.12 0.72 0.00 0.00 0.00 1.61 0.00 0.00
1.54 0.04
0.14 0.70 0.00 0.00 0.00 0.00 1.50 0.00
0.24 0.34 0.00 0.00 0.00 0.00 0.00 1.37
0.00 0.73 1.74 0.00 1.29 1.19 0.41 0.42 0.39 0.28 0.40 0.27
= 1.56 0.73 0.86 0.00 0.92 0.29 0.23 0.28 0.22 0.15 0.19 0.01
1.56 1.47 0.86 0.00 0.36 1.49 0.64 0.71 0.60 0.13 0.21 0.28
0.41 0.31 1.39 0.28 0.89 0.92
= 1.33 0.44 1.09 0.15 0.73 0.27 (20)
0.92 0.76 0.28 0.13 0.16 1.21
Bartlett factor score estimates can always be of the factor score f and a residual :
distinguished from regression factor score estimates
by examining the variance of the factor scores. While f = f + , (23)
regression estimates have variances 1, Bartlett esti- where
mates have variances 1 [7]. This can be explained = f f. (24)
as follows: The regression estimation procedure
divides the factor score f into two uncorrelated parts, The result is that the variance of f is the sum of
the regression part f and the residual part f f. the unit variance of f and the variance of , the error
Thus, about the true value.
f = f + e, (21)
Uncorrelated Scores Minimizing Unique Factors
where
e = f f. (22) Anderson and Rubin [1] revised Bartletts method
so that factor score estimates are both uncorrelated
Since the e are assumed multivariate normally the m 1 with the other factors and are not cor-
distributed, the f can further be written as the sum related with each other. These two properties result
Factor Score Estimation 7
from the following matrix equation for the factor and therefore
score estimates:
1/2
F = Z1n U2 0.74 0.67 23.73 0.00
nn Anm G1/2 =
0.67 0.74 0.00 1.64
(Amn U2
nn nn Unn Anm )
1/2
, (25)
0.74 0.67
where is a matrix of factor correlations.
0.67 0.74
While resembling (14), (25) is substantially more
complex to solve: The term, Amn U2 0.74 0.67 4.87 0.00 0.74 0.67
nn nn Unn =
Anm , is raised to a power of 1/2. This power 0.67 0.74 0.00 1.28 0.67 0.74
indicates that the inversion of the symmetric square
3.26 1.79
root of the matrix product is required. The symmetric = . (33)
1.79 2.89
square root of a matrix can be found for any positive
definite symmetric matrix. To illustrate, we define G
as an n n positive semidefinite symmetric matrix. Then, from Equation
The symmetric square root of G, G1/2 , must meet the
following condition: 0.00 0.57 1.15 0.00 1.12 0.87
F = 1.00 0.57 0.57 0.00 0.80 0.21
1/2 1/2
Gnn = Gnn Gnn . (26) 1.00 1.15 0.57 0.00 0.32 1.09
Perhaps the most straightforward method of 0.41 0.00 0.00 0.00 0.00 0.00 2
obtaining G1/2 is to obtain the spectral decomposition 0.00 0.61 0.00 0.00 0.00 0.00
of G, such that G can be reproduced by a function 0.00 0.00 0.43 0.00 0.00 0.00
of its eigenvalues () and eigenvectors (x): 0.00 0.00 0.00 0.39 0.00 0.00
0.00 0.00 0.00 0.00 0.75 0.00
Gnn = Xnn Dnn Xnn , (27) 0.00 0.00 0.00 0.00 0.00 0.54
where X is an n n matrix of eigenvectors and D 0.65 0.19
is an n n diagonal matrix of eigenvalues. It follows 0.59 0.08
1
then that 0.59 0.17 3.26 1.79
Gnn = Xnn Dnn Xnn .
1/2 1/2
(28) 0.12 0.72 1.79 2.89
0.14 0.70
If we set G1/2 = Amn U2
nn nn Unn Anm , (25) 0.24 0.34
can now be rewritten as
0.00 0.57 1.15 0.00 1.12 0.87
F = Z1n U2 1/2 = 1.00 0.57 0.57 0.00 0.80 0.21
nn Anm G . (29)
1.00 1.15 0.57 0.00 0.32 1.09
To illustrate the Anderson and Rubin method,
2.44 0.00 0.00 0.00 0.00 0.00
we specify
0.00 1.64 0.00 0.00 0.00 0.00
0.74 0.67 0.00 0.00 2.38 0.00 0.00 0.00
X= (30)
0.67 0.74 0.00 0.00 0.00 2.63 0.00 0.00
0.00 0.00 0.00 0.00 1.33 0.00
and
0.00 0.00 0.00 0.00 0.00 1.85
23.73 0.00
= . (31) 0.65 0.19
0.00 1.64
0.59 0.08
Then, for Amn U2
nn nn Unn Anm , the spectral 0.59 0.17 0.46 0.29
decomposition is 0.12 0.72 0.29 0.52
0.14 0.70
0.74 0.67 23.73 0.00
G= 0.24 0.34
0.67 0.74 0.00 1.64
0.00 0.57 1.15 0.00 1.12 0.87
0.74 0.67 = 1.00 0.57 0.57 0.00 0.80 0.21
(32)
0.67 0.74 1.00 1.15 0.57 0.00 0.32 1.09
8 Factor Score Estimation
0.60 0.21 it would seem that the important issue is not which
0.41 0.21
of the three estimation methods should be used, but
0.67 0.32
0.53 0.19 whether any of them should be used at all due to
= 0.68 0.53 . (34)
0.40 0.90 factor score indeterminacy, implying that only prin-
0.18 0.43 1.35 0.20
cipal component scores should be obtained. Read-
0.03 0.20 ers seeking additional information on this area of
controversy specifically, and factor scores generally,
The unique factor scores are computed as in the
should consult, in addition to those references already
Bartlett method (15). Substituting the results of the
cited [36, 10].
Anderson and Rubin method into (15) yields
0.32 0.40 1.44 0.19 1.01 0.99 References
= 1.34 0.45 1.07 0.18 0.68 0.30
V
1.03 0.86 0.36 0.01 0.33 1.31 [1] Anderson, T.W. & Rubin, H. (1956). Statistical inference
(35) in factor analysis, Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability
5, 111150.
[2] Bartlett, M.S. (1937). The statistical conception of men-
Conclusion tal factors, British Journal of Psychology 28, 97104.
[3] Cattell, R.B. (1978). The Scientific Use of Factor Anal-
For convenience, we reproduce the factor scores ysis in the Behavioral and Life Sciences, Plenum Press,
estimated by the regression, Bartlett, and Anderson New York.
and Rubin methods: [4] Gorsuch, R.L. (1983). Factor Analysis, 2nd Edition,
Lawrence Erlbaum Associates, Hillsdale.
Regression Bartlett
[5] Harman, H.H. (1976). Modern Factor Analysis, 3rd Edi-
tion revised, The University of Chicago Press, Chicago.
0.69 0.72 0.85 0.49
[6] Lawley, D.N. & Maxwell, A.E. (1971). Factor Analysis
0.44 0.68 0.69 0.45 as a Statistical Method, American Elsevier Publishing,
1.14 0.04 1.54 0.04 New York.
Anderson-Rubin [7] McDonald, R.P. (1985). Factor Analysis and Related
Methods, Lawrence Erlbaum Associates, Hillsdale.
0.67 0.32 . [8] Mulaik, S.A. (1972). The Foundations of Factor Analy-
0.68 0.53 sis, McGraw-Hill, New York.
1.35 0.20 [9] Thomson, G.H. (1939). The Factorial Analysis of Human
Ability, Houghton Mifflin, Boston.
The similarity of the factor score estimates com- [10] Yates, A. (1987). Multivariate Exploratory Data Analy-
puted by the three methods is striking. sis: A Perspective on Exploratory Factor Analysis, State
This is in part surprising. Empirical studies University of New York Press, Albany.
have found that although the factor score estimates
obtained from different methods correlate substan- SCOTT L. HERSHBERGER
tially, they often have very different values [8]. So
Factorial Designs
PHILIP H. RAMSEY
Volume 2, pp. 644645
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
References [3] Maxwell, S.E. & Delaney, H.D. (1990). Designing Exper-
iments and Analyzing Data, Wadsworth, Belmont.
[4] Winer, B.J., Brown, D.R. & Michels, K.M. (1991).
[1] Boik, R.J. (1979). Interactions, partial interactions, and
Statistical Principles in Experimental Designs, McGraw-
interaction contrasts in the analysis of variance, Psycho-
Hill, New York.
logical Bulletin 86, 10841089.
[2] Kirk, R.E. (1995). Experimental Design: Procedures for
PHILIP H. RAMSEY
the Behavioral Sciences, Brooks/Cole, Pacific Grove.
Family History Versus Family Study Methods in
Genetics
AILBHE BURKE AND PETER MCGUFFIN
Volume 2, pp. 646647
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[3] Heun R., Hardt J., Burkart, M. & Maier, W. (1996). (See also Family Study and Relative Risk)
Validity of the family history method in relatives of geron-
topsychiatric patients, Psychiatry Research 62, 227238. AILBHE BURKE AND PETER MCGUFFIN
[4] Szatmari P. & Jones M.B. (1999). The effects of misclas-
sification on estimates of relative risk in family history
studies, Genetic Epidemiology 16, 368381.
Family Study and Relative Risk
PETER MCGUFFIN AND AILBHE BURKE
Volume 2, pp. 647648
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
of the disorder. This problem can be addressed using survival times (or times to becoming ill) is divided
an age correction, the most straightforward of which, into a number of intervals. For each of these, we can
originally proposed by Weinberg (the same Weinberg calculate the number and proportion of subjects who
after whom the Hardy Weinberg equilibrium in pop- entered the interval unaffected and the number and
ulation genetics is named) is to calculate a corrected proportion of cases that became affected during that
denominator or bezugsziffer (BZ). The lifetime risk interval, as well as the number of cases that were lost
or morbidity risk (MR) of the disorder can be esti- to follow-up (because they had died or had otherwise
mated as the number of affecteds (A) divided by the disappeared from view). On the basis of these num-
BZ, where the BZ is calculated as: bers and proportions, we can calculate the proportion
failing or becoming ill over a certain time interval
i wi + A that is usually taken as the entire period of risk. A
further alternative is to use a Kaplan Meier product
and where wi is the weight given to the ith unaf- limit estimator. This allows us to estimate the survival
fected individual on the basis of their current age. function (see Survival Analysis) directly from con-
The simplest system of assigning weights, the shorter tinuous survival or failure times instead of classifying
Weinberg method, is to give the weight of zero to observed survival times into a life table. Effectively,
those younger than the age of risk, a weight of a this means creating a life table in which each time
half to those within the age of risk, and weight of interval contains exactly one case. It therefore has an
one to those beyond the age of risk. A more accurate advantage over a life table method in that the results
modification devised by Erik Stromgren is to use an do not depend on grouping of the data.
empirical age of onset distribution from a large sep-
arate sample, for example, a national registry of psy-
chiatric disorders, to obtain the cumulative frequency References
of disorder over a range of age bands from which
weights can be derived [4]. Unfortunately, national [1] McGuffin, P. & Huckle, P. (1990). Simulation of
registry data are often unavailable and an alternative Mendelism revisited: the recessive gene for attending
method is to take the age of onset distribution in the medical school, American Journal of Human Genetics 46,
994999.
probands and transform it to a normal distribution, for [2] McGuffin, P., Katz, R., Aldrich, J. & Bebbington, P.
example, using a log transform [2]. The log age for (1988). The Camberwell Collaborative Depression Study.
each unaffected relative can be converted to a stan- II. Investigation of family members, British Journal of
dard score and a weight, the proportion of the period Psychiatry 152, 766774.
a risk that has been lived through, can be assigned [3] Risch, N. (1990). Linkage strategies for genetically com-
by reference to the standard normal integral. plex traits. III: the effect of marker polymorphism analysis
on affected relative pairs, American Journal of Human
Another approach is to carry out life table analy-
Genetics 46, 242253.
sis. The method most often used in family studies is [4] Slater, E. & Cowie, V. (1971). The Genetics of Mental
called the Weinberg morbidity table, but essentially Disorders, Oxford University Press, London.
the method is the same as in the life table analy-
sis performed in other spheres. The distribution of PETER MCGUFFIN AND AILBHE BURKE
Fechner, Gustav T
HELEN ROSS
Volume 2, pp. 649650
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
examined the effects of classroom-based perfor- able to disguise the study, especially if the researcher
mance-assessment-driven instruction using 16 teach- must provide instructions; other times the researcher
ers who were assigned randomly to performance- is able to act unobtrusively. For example, if there is a
assessment and nonperformance-assessment condi- one-way mirror at the day-care center, the researcher
tions. All of the teachers had volunteered to par- may be able to view the behavior of the children
ticipate. Neither the teachers nor their classes were without their knowing that they are being observed.
matched, but Fuchs et al. used inferential statistics to If teachers or supervisors will be serving as the
indicate that the teachers in the two conditions were researcher, they should be given explicit training, and
comparable on demographic variables of years of the researcher should use a double-blind procedure.
teaching, class size, ethnicity, and educational level. A double-blind procedure was used in the 1954 field
Teachers completed a demographic information form trial of the Salk poliomyelitis vaccine. Both the child
reporting on information about each student in the getting the treatment and the physician who gave
class. Results of statistical tests revealed that the the vaccine and evaluated the outcome were kept in
groups were comparable. Although the researchers ignorance of the treatment given [7].
were not able to control extraneous variables, they Several references that can be consulted for addi-
tested to assess whether extraneous variables could tional details regarding field experiments are [2], [3],
affect the outcome of the research. and [10]. Kerlinger [6], in his second edition, has
Field experiments can be conducted with the gen- a detailed discussion with examples of field exper-
eral public and with selected groups. Each of these iments and field studies (see Quasi-experimental
two types of studies with the general public has cer- Designs).
tain limitations. Field experiments that are conducted
in an unrestricted public area in order to generalize
to the typical citizen generally are studying social References
behaviors. Such investigations if conducted as labo-
ratory experiments would reduce the reliability and [1] Albas, D.C. & Albas, C.A. (1989). Meaning in context:
the impact of eye contact and perception of threat
validity of the results. The experiments with the gen-
on proximity, The Journal of Social Psychology 129,
eral public generally are carried out in one of two 525531.
ways: individuals can be targeted and their responses [2] Boruch, R.F. (1997). Randomized Experiments for Plan-
observed to a condition of an environmental inde- ning & Evaluation: A Practical Guide, Sage Publica-
pendent variable or the researcher or a confederate tions, Thousand Oaks.
creates a condition by approaching the public and [3] Boruch, R.F. & Wothke, W., eds (1985). Randomized
exhibiting a behavior to elicit a response. Some of Field Experimentation, Jossey-Bass, San Francisco.
[4] Cook, T.D., Habib, F.-N., Phillips, M., Settersten, R.A.,
the limitations of field studies with the public are
Shagle, S.C. & Degirmencioglu, S.M. (1999). Comers
that the situations are contrived, external validity is school development program in Prince Georges County,
limited to situations similar to those in the study, and Maryland: a theory-based evaluation, American Educa-
random selection is not possible in that the sample is tional Research Journal 36, 543597.
a convenient one depending on the individuals who [5] Fuchs, L.S., Fuchs, D., Karns, K., Hamlett, C.L. &
are present in the location at the time of the study. Katzaroff, M. (1999). Mathematics performance assess-
Albas and Albas [1] studied personal-space behavior ment in the classroom: effects on teacher planning
and student problem solving, American Educational
while conducting a fictitious poll by measuring how Research Journal 36, 609646.
far the participant would stop from the pollster. They [6] Kerlinger, F.N. (1973). Foundations of Behavioral
manipulated the factors of the meeting occurring in Research, 2nd Edition, Holt, Rinehart and Winston, New
the safety of a shopping mall versus a less safe city York.
park and of whether the pollster made eye contact or [7] Meier, P. (1972). The biggest public health experiment
did not because of wearing dark glasses. ever: the 1954 field trial of 6th Salk poliomyelitis
The other approach to field experiments involves vaccine, in Statistics: A Guide to the Unknown, J.M.
Tanur, F. Mosteller, W.H. Kruskal, R.F. Link, R.S.
studying a specific group of participants that exists Pieters & G.R. Rissing, eds, Holden-Day, San Francisco,
already. Some examples would be studying young pp. 213.
children at a day-care center or elderly individuals [8] Ritter, G.W. & Boruch, R.F. (1999). The political and
at a senior center. Sometimes the researcher is not institutional origins of a randomized controlled trial on
Field Experiment 3
elementary class size: Tennessees project STAR, Edu- [10] Shadish, W.R., Cook, T.D. & Campbell, D.T.
cational Evaluation and Policy Analysis 21, 111125. (2002). Experimental and Quasi-experimental Designs
[9] Rouse, C.E. (1998). Private school vouchers and student for Generalized Causal Inference, Houghton Mifflin Co.,
achievement: an evaluation of the Milwaukee parental Boston.
choice program, Quarterly Journal of Economics 113,
553602. PATRICIA L. BUSK
Finite Mixture Distributions
BRIAN S. EVERITT
Volume 2, pp. 652658
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the analytic difficulties, even for a mixture of two A sex difference in the age of onset of schizophrenia
components are so considerable that it may be was noted in [6]. Subsequently, it has been found to
2 Finite Mixture Distributions
80
Raw data
ML estimates
Pearson estimates
Single normal
60
Frequency
40
20
Figure 1 Frequency polygon of ratio of forehead to body length in 1000 crabs and fitted mixture and single normal
distributions
be one of the most consistent findings in the epi- will be a mixture, with the mixing proportion for
demiology of the disorder. Levine [7], for example, early onset schizophrenia being larger for men than
collated the results of 7 studies on the age of onset of for women. To investigate this model, finite mixture
the illness, and 13 studies on age at first admissions, distributions with normal components were fitted to
and showed that all these studies were consistent in age of onset (determined as age on first admission)
reporting an earlier onset of schizophrenia in men of 99 female and 152 male schizophrenics using
than in women. Levine suggested two competing maximum likelihood. The data are shown in Table 1
models to explain these data: and the results in Table 2. Confidence intervals
The timing model states that schizophrenia is were obtained by using the bootstrap (see [2] and
essentially the same disorder in the two sexes, Bootstrap Inference). The bootstrap distributions for
but has an early onset in men and a late onset each parameter for the data on women are shown in
in women . . . In contrast with the timing model, Figure 2.
the subtype model posits two types of schizophre- Histograms of the data showing both the fitted
nia. One is characterized by early onset, typical two-component mixture distribution and a single
symptoms, and poor premorbid competence, and the normal fit are shown in Figure 3.
other by late onset, atypical symptoms, and good For both sets of data, the likelihood ratio test
premorbid competence . . . the early onset typical for number of groups (see [8, 9], and Maximum
schizophrenia is largely a disorder of men, and late Likelihood Estimation) provides strong evidence
onset, atypical schizophrenia is largely a disorder that a two-component mixture provides a better fit
in women. than a single normal, although it is difficult to draw
The subtype model implies that the age of onset convincing conclusions about the proposed subtype
distribution for both male and female schizophrenics model of schizophrenia because of the very wide
Finite Mixture Distributions 3
20 30 21 23 30 25 13 19 16 25 20 25 27 43 6 21 15 26 23 21 23 23
34 14 17 18 21 16 35 32 48 53 51 48 29 25 44 23 36 58 28 51 40 43
21 48 17 23 28 44 28 21 31 22 56 60 15 21 30 26 28 23 21 20 43 39
40 26 50 17 17 23 44 30 35 20 41 18 39 27 28 30 34 33 30 29 46 36
58 28 30 28 37 31 29 32 48 49 30
(2) Men
21 18 23 21 27 24 20 12 15 19 21 22 19 24 9 19 18 17 23 17 23 19
37 26 22 24 19 22 19 16 16 18 16 33 22 23 10 14 15 20 11 25 9 22
25 20 19 22 23 24 29 24 22 26 20 25 17 25 28 22 22 23 35 16 29 33
15 29 20 29 24 39 10 20 23 15 18 20 21 30 21 18 19 15 19 18 25 17
15 42 27 18 43 20 17 21 5 27 25 18 24 33 32 29 34 20 21 31 22 15
27 26 23 47 17 21 16 21 19 31 34 23 23 20 21 18 26 30 17 21 19 22
52 19 24 19 19 33 32 29 58 39 42 32 32 46 38 44 35 45 41 31
Table 2 Age of onset of schizophrenia results of fitting Identifying Activated Brain Regions
finite mixture densities
(1) Women
In [3], an experiment is reported in which func-
Initial Final Bootstrap tional magnetic resonance imaging (fMRI) data were
Parameter value value 95% CI* collected from a healthy male volunteer during a
p 0.5 0.74 (0.19, 0.83) visual stimulation procedure. A measure of the exper-
1 25 24.80 (21.72, 27.51) imentally determined signal at each voxel in the
12 10 42.75 (27.92, 85.31) image was calculated, as described in [1]. Under
2 50 46.45 (34.70, 50.50) the null hypothesis of no experimentally determined
22 10 49.90 (18.45, 132.40) signal change (no activation), the derived statistic has
a chi-square distribution with two degrees of free-
(2) Men
dom (see Catalogue of Probability Density Func-
Initial Final Bootstrap tions). Under the presence of an experimental effect
Parameter value value 95% CIa (activation), however, the statistic has a noncentral
p 0.5 0.51 (0.24, 0.77)
chi-squared distribution (see [3]). Consequently, it
1 25 20.25 (19.05, 22.06) follows that the distribution of the statistic over all
12 10 9.42 (3.43, 36.70) voxels in an image, both activated and nonactivated,
2 50 27.76 (23.48, 34.67) can be modeled by a mixture of those two compo-
22 10 112.24 (46.00, 176.39) nent densities (for details, again see [3]). Once the
a
Number of bootstrap samples used was 250. parameters of the assumed mixture distribution have
been estimated, so can the probability of each voxel
in the image being activated or nonactivated. For
the visual simulation data, voxels were classified as
confidence intervals for the parameters. Far larger activated if their posterior probability of activation
sample sizes are required to get accurate estimates was greater than 0.5 and nonactivated otherwise.
than those used here. Figure 4 shows the mixture model activation map
4 Finite Mixture Distributions
8
0.3
6
Density
Density
0.2
4
0.1
2
0 0.0
0.2 0.4 0.6 0.8 20 22 24 26 28
(a) Value (b) Value
0.04 0.15
0.03
0.10
Density
Density
0.02
0.05
0.01
0.0 0.0
0 20 40 60 80 30 35 40 45 50 55
(c) Value
(d) Value
0.015
Density
0.005
0.0
0 50 100 150
(e) Value
Figure 2 Bootstrap distributions for five parameters of a two-component normal mixture fitted to the age of onset data
for women: (a) mixing proportion; (b) mean of first distribution; (c) standard deviation of first distribution; (d) mean of
second distribution; (e) standard deviation of second distribution
Finite Mixture Distributions 5
0.04
0.02
0.0
10 20 30 40 50
60
Age of oneset of schizophrenia in women
0.08
Fitted two component mixture density
0.06 Fitted single normal density
Density
0.04
0.02
0.0
10 20 30 40 50 60
Age of one set of schizophrenia in men
Figure 3 Histograms and fitted mixture distributions for age of onset data for women and men
of the visual simulation data for selected slices of the Geoff McLachlan and colleagues have developed
brain (activated voxels indicated). the EMMIX algorithm for the automatic fitting and
Finite mixture models are now widely used in testing of normal mixtures for multivariate data (see
many disciplines. In psychology, for example, latent http://www.maths.uq.edu.au/gjm/).
class analysis, essentially a finite mixture with mul- A further program is that developed by Jorgensen
tivariate Bernoulli components (see Catalogue of and Hunt [5] and the source code is available from
Probability Density Functions), is often used as a Murray Jorgensens website (http://www.stats.
categorical data analogue of factor analysis. And, waikato.ac.nz/Staff/maj.html).
in medicine, mixture models have been successful
in analyzing survival times (see [9] and Survival
Analysis). Many other applications of finite mix- References
ture distributions are described in the comprehensive
account of finite mixture distributions given in [9].
[1] Bullmore, E.T., Brammer, M.J., Rouleau, G., Everitt,
B.S., Simmons, A., Sharma, T., Frangou, S., Murray, R.
& Dunn, G. (1995). Computerized brain tissue classifi-
Software for Fitting Finite Mixture cation of magnetic resonance images: a new approach
Distributions to the problem of partial volume artifact, Neuroimage 2,
133147.
The following sites provide information on software [2] Efron, B. & Tibshirani, R.J. (1993). An Introduction to
for mixture modeling: the Bootstrap, Chapman & Hall/CRC, New York.
NORMIX was the first program for clustering [3] Everitt, B.S. & Bullmore, E.T. (1999). Mixture model
data that consists of mixtures of multivariate Normal mapping of brain activation in functional magnetic
resonance images, Human Brain Mapping 7, 114.
distributions. The program was originally written
[4] Everitt, B.S. & Hand, D.J. (1981). Finite Mixture Dis-
John H. Wolfe in the 1960s [13]. A version that runs tributions, Chapman & Hall/CRC, London.
under MSDOS-Windows is available as freeware at [5] Jorgensen, M. & Hunt, L.A. (1999). Mixture model
http://alumni.caltech.edu/wolfe/ clustering using the MULTIMIX program, Australian
normix.htm. and New Zealand Journal of Statistics 41, 153171.
6 Finite Mixture Distributions
Figure 4 Mixture model activation map of visual simulation data. Each slice of data (Z) is displayed in the standard
anatomical space described in [11]
Finite Mixture Distributions 7
[6] Kraeplin, E. (1919). Dementia Praecox and Paraphre- [11] Talairach, J. & Tournoux, P. (1988). A Coplanar
nia, Churchill Livingstone, Edinburgh. Stereotaxic Atlas of the Human Brain, Thieme-Verlag,
[7] Levine, R.R.J. (1981). Sex differences in schizophrenia: Stuttgart.
timing or subtypes? Psychological Bulletin 90, 432444. [12] Titterington, D.M., Smith, A.F.M. & Makov, U.E.
[8] McLachlan, G.J. (1987). On bootstrapping the likelihood (1985). Statistical Analysis of Finite Mixture Distribu-
ratio test statistic for the number of components in a tions, Wiley, New York.
normal mixture, Applied Statistics 36, 318324. [13] Wolf, J.H. (1970). Pattern clustering by multivariate
[9] McLachlan, G.J. & Peel, D. (2000). Finite Mixture mixture analysis, Multivariate Behavioral Research 5,
Distributions, Wiley, New York. 329350.
[10] Pearson, K. (1984). Contributions to the mathematical
theory of evolution, Philosophical Transactions A 185, BRIAN S. EVERITT
71110.
Fisher, Sir Ronald Aylmer
ANDY P. FIELD
Volume 2, pp. 658659
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
referring in his own journal to apparent errors made [2] Field, A.P. & Hole, G. (2003). How to Design and
by Fisher [8]. With Neyman, the rift developed Report Experiments, Sage Publications, London.
from a now infamous paper delivered by Neyman [3] Fisher, R.A. (1935). The Design of Experiments, Oliver
& Boyd, Edinburgh.
to the Royal Statistical Society that openly criticized [4] Fisher Box, J. (1978). R.A. Fisher: The Life of a Scientist,
Fishers work [7]. Such was their antagonism that Wiley, New York.
Neyman openly attacked Fishers factorial designs [5] Irwin, J.O. (1963). Sir Ronald Aylmer Fisher, 1890
and ideas on randomization in lectures while they 1962: Introduction, Journal of the Royal Statistical Soci-
both worked at University College, London. The two ety, Series A (General) 126, 159162.
feuding groups even took afternoon tea (a common [6] Olkin, I. (1992). A conversation with Churchill Eisen-
hart, Statistical Science 7, 512530.
practice in the British academic community of the
[7] Reid, C. (1982). Neyman-From Life, Springer-Verlag,
time) in the same room but at different times [6]. New York.
However, these accounts portray a one-sided view of [8] Salsburg, D. (2001). The Lady Tasting Tea: How statis-
all concerned, and it seems fitting to end with Irwins tics Revolutionized Science in the Twentieth Century,
(who worked with both K. Pearson and Fisher) obser- W.H. Freeman, New York.
vation that Fisher was always a wonderful conver- [9] Savage, L.J. (1976). On Re-reading R. A. Fisher, The
sationalist and a good companion. His manners were Annals of Statistics 4, 441500.
[10] Stigler, S.M. (1999). Statistics on the Table: A history
informal . . . and he was friendliness itself to his staff
of statistical Concepts and Methods, Harvard University
and any visitors ([5, p. 160]). Press, Cambridge.
[11] Yates, F. (1984). Review of Neyman-From Life by
References Constance Reid, Journal of the Royal Statistical Society,
Series A (General) 147, 116118.
[1] Barnard, G.A. (1963). Ronald Aylmer Fisher, 1890 ANDY P. FIELD
1962: Fishers contributions to mathematical statistics,
Journal of the Royal Statistical Society, Series A (Gen-
eral) 126, 162166.
Fisherian Tradition in Behavioral Genetics
DAVID DICKINS
Volume 2, pp. 660664
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
notion that once mutations in a certain direction due course fathering eight children on his 17-year-old
occurred, it was more likely that further ones in bride. There is inflammable material here for those
the same direction would follow, is shown to be whose opposition to sociobiology and evolutionary
redundant (and incorrect). psychology is colored by their political sensitivity to
Fisher was fond of comments such as Natural the dangerous beasts of Nazism and fascism forever
selection is a mechanism for generating an exceed- lurking in human society.
ingly high degree of improbability [11]. Now the In his population genetics, Fisher shifted the
idea of evolution being due to natural selection act- emphasis from the enhanced chances of survival of
ing on mutations occurring by chance was repel- favored individuals to the study, for each of the
lent to many critics from the publication of Dar- many alleles in a large population, of the compar-
wins Origin of Species [5]. Fisher explained, how- ative success of being duplicated by reproduction. In
ever, that this did not resemble an extraordinary this calculus, alleles conferring very slight benefits
run of good luck such as every client of a casino (to their possessor) in terms of Darwinian fitness7
wishes he might enjoy, but the inexorable work- would spread in the gene pool at the expense of alter-
ings of the laws of chance over a longer sample native alleles. This gave rise to important theoretical
upon which the profitability of such establishments advances that were particularly influential in the study
depends. Just as Mendel had been influenced by the of behavior. The adaptive significance of the minu-
physicist von Ettinghausen to apply combinatorial tiae of behavior could in principle be assessed in this
algebra to his breeding experiments, so a postdoc- way, and in some cases measured in the field with
toral year at Cambridge with James Jeans4 (after some confidence about its validity. When Fisher was
his Mathematics degree there) enthused Fisher with back in Cambridge as Professor of Genetics, one of
Maxwells statistical theory of gases, Boltzmanns his most enthusiastic students was William Hamilton,
statistical thermodynamics, and quantum mechanics. who from this notion of the gene as the unit of selec-
He looked for an integration in the new physics of tion, derived inclusive fitness theory [14], popularly
the novelty generation of biological evolution and expounded by Richard Dawkins in [9] and a series
the converse principle of entropy in nonliving sys- of similar books. Expressing Fishers [12] degrees of
tems [10]. genetic resemblance between closer and more distant
Alongside these adjustments to post-Newtonian relatives in terms of the probability, r, of a rare gene
science, transcending the Newtonian scientific world in one individual also occurring in a relative, Hamil-
view of Darwin, Fisher was in a sense the grandson ton propounded that for an apparently altruistic social
of Darwin (see [10]), for he was much influenced action, in which an actor appears to incur a cost C (in
by two of Charless sons, Horace, and particularly terms of Darwinian fitness) in the course of confer-
Leonard. This must have been very exciting to a ring a benefit B to a relative, the following equation
young middle-class man who might just as well applies:
have taken a scholarship in biology at Cambridge, rB C > 0
and took as the last of his many school prizes
the complete set of Darwins books which he read This has become known as Hamiltons rule, and
and reread throughout his life, and with Leonard it means that an allele predisposing an animal to
in particular he interchanged many ideas and was help a relative will tend to spread if the cost to
strongly encouraged by him. Now both these sons of the donor is less than the benefit to the recipient,
such a renowned father and from such a prominent downgraded by the degree of relatedness between
family were involved in the eugenics movement to the two.
which Fisher heartily subscribed5 . There is a lot of This is because r is an estimate of the chances that
this in the second half of his The Genetical Theory the allele will indirectly make copies of itself via the
of Natural Selection. It is not clear how the logical progeny of the beneficiary, another way of spreading
Fisher reconciled his attachment to Nietzsches ideas in the gene pool. This is also known as kin selection.
of superior people who should endeavor to generate Robert Trivers [20] soon followed this up with
lots of offspring and be placed beyond good or the notion of reciprocal altruism in which similar
evil, with his stalwart adherence to Anglicanism6 , altruistic acts could be performed by a donor for
but he certainly practiced what he preached by in an unrelated recipient, provided that the cost was
Fisherian Tradition in Behavioral Genetics 3
small in relation to a substantial benefit, if the in brain size in hominids, has speculated that a major
social conditions made it likely that roles would influence on human evolution has been because of the
at some future time be liable to be reversed, and greater attractiveness to females of males with larger
that the former recipient could now do a favor for brains enabling them to generate a more alluring
the former donor. This would entail discrimination diversity of courtship behaviors. Other explanations,
by the individuals concerned between others who either alternative or complementary, have also been
were and others who were not likely to reciprocate forwarded for the evolution of flamboyant attraction
in this way (probably on the basis of memory devices in male animals [22].
of past encounters). There are fascinating sequelae The evolutionary stabilization of the sex ratio
to these ideas, as when the idea of cheating is that except under special circumstances the propor-
considered, which Trivers does in his benchmark tion of males to females in a population will always
paper [20]. Such a cozy mutual set-up is always approximate to 1 : 1 is another fecund idea that has
open to exploitation, either by gross cheaters, who traditionally been attributed to Fisher. Actually the
are happy to act as beneficiaries but simply do not idea (like many another) goes back to Darwin, and
deliver when circumstances would require them to to The Descent of Man. Like many of us Fisher pos-
act as donors, or by more subtle strategists who sessed (and read and reread) the Second Edition of
give substandard altruism, spinning the balance of this book [8] in which Darwin backtracks in a quote
costs and benefits in their own favor. These provide I would have been critical of were it to occur in a
fertile ground for cheater-detection counter measures students essay today8 :
to evolve, an escalating story intuitively generative
I formerly thought that when a tendency to produce
of many of the social psychological features of our the two sexes in equal numbers was advantageous to
own species. the species, it would follow from natural selection,
The pages of high quality journals in animal but I now see that the whole problem is so intricate
behavior (such as Animal Behaviour) are today that it is safer to leave its solution for the future [8]
packed with meticulous studies conducted in the field (pp. 199200).
to test the ideas of Hamilton and Trivers, and a cor-
Fisher [13] also quotes this, but gives an incorrect
responding flow of theoretical papers fine-tuning the
citation (there are no references in his book) as
implications.
if it were from the first edition. In the first edi-
Another key idea attributed to Fisher is the run- tion [7], Darwin does indeed essay ways in which
away theory of sexual selection. This concerns male the sex ratio might come under the influence of
adornment, and Darwins complementary notion of natural selection. He does not rate these effects of
female choice, long treated with skepticism by many, selection as a major force compared with unknown
but now demonstrated across the animal kingdom. If forces:
it comes about that a particular feature of a male
bird, for example, such as a longer than usual tail, Nevertheless we may conclude that natural selection
attracts females more than tails of more standard will always tend, though sometimes inefficiently, to
equalise the relative numbers of the two sexes. [ibid.
length (which in some species can be demonstrated
Vol. I, p. 318]
by artificial manipulation [1]), and both the anatom-
ical feature and the female preference are under Then Darwin acknowledges Herbert Spencer, not
genetic control, then the offspring of resultant unions for the above, but for the idea of what we would
will produce sexy sons and size-sensitive females now call the balance between r and K selection.
who in turn will tend to corner the mating market. Darwin is unclear how fertility might be reduced
This is likely to lead, according to Fisher [13], to fur- by natural selection once it has been strongly
ther lengthening of the tail and strengthening of the selected, for direct selection, by chance, would
preference, at an exponential rate, since the greater always favor parents with more offspring in over-
the change, the greater the reproductive advantage, population situations, but the cost of producing more,
so long as this is not outweighed by other selective to the parents, and the likely lower quality of more
disadvantages, such as dangerous conspicuousness of numerous offspring, would be indirect selective
the males. Miller [17] has inverted this inference, and influences reducing fertility in severely competi-
from fossil data supporting such a geometric increase tive conditions.
4 Fisherian Tradition in Behavioral Genetics
There is an anticipation here too of Game Theory 8. The troublesome phrase here is advantageous to
which was developed by the late, great successor to the species. The point about the action of selection
Fisher as the (on one occasion self-styled!) Voice of here is that it is the advantage to the genes of the
individual that lead it to produce more male or more
Neodarwinism, John Maynard Smith [16]. The rele- female offspring.
vance of game theory is to any situation in which the 9. There can be interactions between species as well,
adaptive consequences (Darwinian fitness) of some as for example in arms races, for example, selection
(behavioral or other) characteristic of an individual for ever increasing speed both of predator and prey
depend, not only on the environment in general, but in say cheetahs hunting antelope.
upon what variants of this other members of the
same species possess. Put succinctly, your best strat- References
egy depends on others strategies9 . In the case of
the sex ratio, if other parents produce lots of daugh- [1] Andersson, M. (1982). Female choice selects for extreme
ters, it is to your advantage to produce sons. In the tail length in a widowbird, Nature 299, 818820.
case of sexual selection, if males with long tails [2] Axelrod, R. (1990). The Evolution of Co-operation,
are cornering the mating market, because females Penguin, London.
[3] Barrett, L., Lycett, J. & Dunbar, R.I.M. (2002). Human
with a preference for long tails are predominant, Evolutionary Psychology, Palgrave, Basingstoke & New
it is to your advantage to produce sons with even York.
longer tails. The combination of some of the theo- [4] Charlesworth, B. (2000). Fisher, Medawar, Hamilton
ries here mentioned, such as game theory and the and the evolution of aging, Genetics 156(3), 927931.
principle of reciprocal altruism [2] is an index of [5] Darwin, C. (1859). On the Origin of Species by Means
the potential of the original insights of Fisher. Reit- of Natural Selection, or the Preservation of Favoured
Races in the Struggle for Life, 1st Edition, John Murray,
erative computer programs have made such subtle
London.
interactions easier to predict, and fruitfully theorize [6] Darwin, C. (1868). The Variation of Plants and Animals
about than the unaided though brilliant mathematics Under Domestication, Vol. 12, John Murray, London.
of Fisher. [7] Darwin, C. (1871). The Descent of Man, and Selection
in Relation to Sex, John Murray, London.
[8] Darwin, C. (1888). The Descent of Man, and Selection
Notes in Relation to Sex, 2nd Edition, John Murray, London.
[9] Dawkins, R. (1989). The Selfish Gene, 2nd Edition,
1. Genetics, the study of the hereditary mechanism, Oxford University Press, Oxford.
and of the rules by which heritable qualities are [10] Depew, D.J. & Weber, B.H. (1996). Darwinism Evolv-
transmitted from one generation to the next. . .. ing: Systems Dynamics and the Genealogy of Natural
Fisher, R.A. (1953) Population genetics (The Croo- Selection, The MIT Press, Cambridge & London.
nian Lecture). Proceedings of the Royal Society, B, [11] Edwards, A.W.F. (2000). The genetical theory of natural
141, 510523. selection, Genetics 154(4), 14191426.
2. Work made possible by the delineation of the entire [12] Fisher, R.A. (1918). The correlation between relatives on
genome of mice and men and an increasing number the supposition of Mendelian inheritance, Transactions
of other species. of the Royal Society of Edinburgh 222, 309368.
3. For sociobiology and evolutionary psychology, some [13] Fisher, R.A. (1930). The Genetical Theory of Natural
degree of a hereditary basis for behavior is axiomatic. Selection, Oxford University Press, Oxford.
Behavior genetics seeks to demonstrate and analyze [14] Hamilton, W.D. (1964). The genetical evolution of social
specific examples of this, both in animal breeding behaviour I & II, Journal of Theoretical Biology 7, 152.
experiments and human familial studies, for practical [15] Huxley, J. (1942). Evolution, the Modern Synthesis,
as well as theoretical purposes. Allen & Unwin, London.
4. It was Jeans who was once asked whether it was [16] Maynard Smith, J. (1982). Evolution and the Theory of
true that he was one of the only three people to Games, Cambridge University Press, Cambridge.
understand relativity theory. Whos the third? he [17] Miller, G. (2000). The Mating Mind: How Sexual Choice
allegedly asked. Shaped the Evolution of Human Nature, Heinemann,
5. I cherish the conviction that Charles was entirely London.
egalitarian. [18] Plomin, R., DeFries, J.C., McGuffin, P. & McClearn,
6. While Nietzsche clearly recognized Christian values G.E. (2000). Behavioral Genetics, 4th Edition, Worth,
as in direct opposition to his own. New York.
7. Measured as the number of fertile offspring an indi- [19] Spencer, H.G., Clark, A.G. & Feldman, M.W. (1999).
vidual produces which survive to sexual maturity. Genetic conflicts and the evolutionary origin of genomic
Fisherian Tradition in Behavioral Genetics 5
imprinting, Trends in Ecology & Evolution 14(5), [22] Zahavi, A. & Zahavi, A. (1997). The Handicap Princi-
197201. ple: A Missing Piece of Darwins Puzzle, Oxford Uni-
[20] Trivers, R.L. (1971). The evolution of reciprocal altru- versity Press, New York.
ism, Quarterly Review of Biology 46, 3557.
[21] Wilson, E.O. (1975). Sociobiology: The New Synthesis, DAVID DICKINS
Harvard University Press, Cambridge.
Fixed and Random Effects
TOM A.B. SNIJDERS
Volume 2, pp. 664665
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the number of variables with random effects should of the other explanatory variables, then the analog is
not be so large that the model becomes unwieldy. to use an interaction variable obtained by multiplying
Modeling an effect as random usually although the explanatory variable in question by the dummy
not necessarily goes with the assumption of a nor- variables for the units. The consequence of this easy
mal distribution for the random effects. Sometimes, way out, however, is that the statistical generaliz-
this is not in accordance with reality, which then ability to the population of these units is lost (see
can lead to biased results. The alternative, entertain- Generalizability).
ing models with nonnormally distributed residuals,
can be complicated, but methods were developed, References
see [2]. In addition, the assumption is made that the
random effects are uncorrelated with the explana- [1] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
tory variables. If there are doubts about normality Linear Models. Applications and Data Analysis Methods,
or independence for a so-called nuisance effect, that 2nd Edition, Sage Publications, Newbury Park.
is, an effect the researcher is interested in not for [2] Seltzer, M. & Choi, K. (2002). Model checking and
sensitivity analysis for multilevel models, in Multilevel
its own sake but only because it must be statisti-
Modeling: Methodological Advances, Issues, and Appli-
cally controlled for, then there is an easy way out. If cations, N., Duan & S., Reise, eds, Lawrence Erlbaum,
the doubts concern the main effect of a categorical Hillsdale.
variable, which also would be a candidate for being [3] Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Anal-
modeled as a level as discussed above, then the easy ysis: An Introduction to Basic and Advanced Multilevel
solution (at least for linear models) is to model this Modeling, Sage Publishers, London.
categorical control variable by fixed effects, that is,
using dummy variables for the units in the sample. (See also Random Effects and Fixed Effects Fal-
If it is a random slope for which such a statistical lacy)
control is required without making the assumption of
residuals being normally distributed and independent TOM A.B. SNIJDERS
Fixed Effect Models
ROBERT J. VANDENBERG
Volume 2, pp. 665666
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
good facilitation, verbal and otherwise, are discussed up their answers the day of the interviews while the
in published research. discussion is fresh in their minds and before they take
The facilitator and discussants are only the most notes on another group.
obvious members of focus group research. Some- When all focus groups have been conducted, the
one, usually the primary researcher, must meet with primary investigator must organize the data so that it
clients to determine the focus of the topic and the can be easily understood, interpret the data in terms
sample. Then, someone must locate the sample, fre- of the clients goals, and make recommendations.
quently using a sampling screen, and contact them Focus groups are fairly expensive if done cor-
with a prepared script that will encourage their par- rectly. The facilitators fee and the discussants pay-
ticipation. Depending on the topic, the client, and the ments alone can run almost $1000 per focus group.
respondents involvement with the issue, the amount This amount does not include the cost of locating
that discussants are paid to participate varies. Par- and contacting the participants, the note takers pay-
ticipants are usually paid $25 to $50 dollars each. ment, the facilities rental, data analysis, and report
Professionals can usually only be persuaded to partic- preparation.
ipate by the offer of a noon meeting, a catered lunch,
and the promise that they will be out by 1 : 15. For
professionals, money is rarely an effective incentive Uses of Focus Groups
to participate. The script inviting them to participate
generally states that although the money offered will Focus group research has become a sophisticated
not completely reimburse them for their time, it will tool for researching a wide variety of topics. Focus
help pay for their gas. Some participants are will- groups are used for planning programs, uncovering
ing to attend at 5 : 30 so they can drop by on their background information prior to quantitative surveys,
way home from work. Older or retired participants testing new program ideas, discovering what cus-
prefer meetings in late morning, at noon, or at three tomers consider when making decisions, evaluating
in the afternoon. Other participants can only attend current programs, understanding an organizations
at around 7 : 30 or 9 : 00 in the evening. Groups are image, assessing a product, and providing feedback to
scheduled according to such demographic characteri- administrators. Frequently linked with other research
stics as age, location and vocation. techniques, it can precede the major data-gathering
The site for the meeting must be selected for technique and be used for general exploration and
ease of accessibility for the participants. Although questionnaire design. Researchers can learn what con-
our facilities have adequate parking and can be eas- tent is necessary to include in questionnaires that are
ily accessed from the interstate and from downtown, administered to representative samples of the target
focus groups for some of our less affluent partici- populations.
pants have been held in different parts of the city, The focus group technique is used for a variety of
sometimes in agency facilities with which they are purposes. One of the most widespread uses of focus
familiar and comfortable. Focus group facilities in groups involves general exploration into unfamiliar
most research centers have a conference room that is territory or into the area of new product development.
wired for audio and video and a one-way mirror for Also, habits and usage studies utilize the focus
unobtrusive viewing. Participants are informed if they group technique in obtaining basic information from
are being taped and if anyone is behind the mirror. participants about their use of different products or
They are told that their contributions are so important services and for identifying new opportunities to fill
that the researchers want to carefully record them. the shifting needs in the market. The dynamic of
In addition, our center also employs two note tak- the focus group allows information to flow easily
ers who are in the room, but not at the table, with and allows market researchers to find the deep
the discussants. They are introduced when the pro- motivations behind peoples actions. Focus groups
cedure and reasons for the research are explained, can lead to new ideas, products, services, themes,
but before the discussion begins. Note takers have explanations, thoughts, images, and metaphors.
the interview questions on a single sheet and record Focus groups are commonly used to provide infor-
responses on a notepad with answers following the mation about or predict the effectiveness of adver-
number of each question. They are instructed to type tising campaigns. For example, participants can be
Focus Group Techniques 3
shown promotional material, or even the sales pre- uncover non-homeowners opinions toward different
sentation in a series of focus groups. Focus groups loan programs that might put them in their own home,
also provide an excellent way for researchers to lis- to determine which student recruitment techniques
ten as people deliberate a purchase. The flexibility work best for a local community college, to dis-
of focus groups makes them an excellent technique cover how the birthing facilities at a local hospital
for developing the best positioning for products. could be made more user friendly, and to uncover
They also are used to determine consumer reac- attitudes toward different agencies supported by the
tions to new packaging, consumer attitudes towards United Way and ways to encourage better utilization
products, services, programs, and for public rela- of their services.
tions purposes. Focus groups are flexible, useful, and widely used.
The authors have used focus groups to learn how
to recruit Campfire Girls leaders, to undercover atti-
References
tudes toward a river-walk development, to determine
the strengths and needed changes for a citywide
[1] Client Guide to the Focus Group. (2000). Retrieved
library system, to discover how to involve univer- November 24, 2003, from http://www.mnav.
sity alumni in the alumni association, to determine com/cligd.htm.
which magazine supplement to include in the Sun- [2] Greenbaum, T.L. (1993). The Handbook for Focus Group
day newspaper; to learn ways to get young single Research, Lexington Books, New York.
nonsubscribers who read the newspaper to subscribe
to it; to determine what signs and slogans worked
(See also Qualitative Research; Survey Question-
best for political candidates; to determine which argu-
naire Design)
ments worked best in specific cases for trial lawyers;
to decide changes needed in a university continu- TILLMAN RODABOUGH AND LACEY WIGGINS
ing education program, to establish the packages and
pricing for a new telecommunications company, to
Free Response Data Scoring
BRIAN E. CLAUSER AND MELISSA J. MARGOLIS
Volume 2, pp. 669673
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the response(s) to a simple key. In some sense, this different design components into corresponding score
is true of all scoring procedures. However, as the categories.
complexity of the key increases, so does the com- With the computer-based case simulations used
plexity of the required computer algorithm. Essay in medical licensure assessment, examinees man-
scoring has been a kind of Holy Grail of comput- age patients in a simulated patient-care environment.
erized scoring. Researchers have devoted decades to The examinee uses free-text entry to order diagnos-
advancing the state of the art for essay scoring. Pages tic tests and treatments and results become available
efforts [14, 15] have been joined by those of numer- after the passage of simulated time. As the exami-
ous others [9]. All of these efforts share in common nee advances the case through simulated time, the
the basic approach that quantifiable aspects of the patients condition changes based both on the exam-
performance are used to predict expert ratings for a inees actions and the underlying problem. Boolean
sample of essays. Although the analytic procedures logic is applied to the actions ordered by the exam-
may vary, the regression procedures used by Page inee to produce scorable items. For example, an
provide a conceptual basis for understanding the gen- examinee may receive credit for an ordered treat-
eral approach used by these varying procedures. Early ment if (a) it occurs after the results of an appropriate
efforts in this arena used relatively simple variables. diagnostic test were seen, (b) if no other equivalent
When the score was to be interpreted as a general treatment had already been ordered, and (c) if the
measure of writing proficiency, this was a reasonable treatment were ordered within a specified time frame.
approach. More recently, the level of sophistication After the logical statements are applied to the per-
has increased as serious efforts were made to eval- formance record to convert behaviors into scorable
uate the content as well as the stylistic features of actions, regression-based weights are applied to the
the essay. One obvious approach to assessing con- items to calculate a score on the case. These weights
tent is to scan the essay for the presence of key are derived using expert ratings as the dependent
words; an essay about the battle of Gettysburg might measure in a regression equation.
reasonably be expected to make reference to Pick-
etts charge. This approach is less likely to be useful
when the same concept can be expressed in many Empirical Results
different ways. To respond to this problem, Landauer
and Dumais [11] developed a procedure in which the Most of the empirical research presented to support
relationship between words is represented mathemati- the usefulness of the various procedures focuses
cally. To establish these relationships, large quantities on the correspondence between scores produced by
of related text are analyzed. The inferred relationships these systems and those produced by experts. In
make it possible to define any essay in terms of a general, the relationship between automated scores
point in n-dimensional space. The similarity between and those of experts is at least as strong as that
a selected essay and other previously rated essays can between the same criterion and those produced by a
then be defined as a function of the distance between single expert rater [7, 11, 12, 15]. In the case of the
the two essays in n-dimensional space. hypothesis generation and mathematical expression
Essays are not the only context in which complex item types, the procedures have been assessed in
constructed responses have been successfully scored terms of the proportion of examinee responses that
by computer. Long-term projects by the National could be interpreted successfully by the computer.
Council of Architectural Registration Boards [17] Several authors have presented conceptual discus-
and the National Board of Medical Examiners sions of the validity issues that arise with the use of
(NBME) [12] have resulted in computerized scoring computerized scoring procedures [1, 6]. There has,
procedures for simulations used in certification of however, been relatively little in the way of sophisti-
architects and licensure of physicians, respectively. cated psychometric evaluation of the scores produced
In the case of the architectural simulations, the with these procedures. One exception is the eval-
examinee uses a computer interface to complete a uative work done on the NBMEs computer-based
design problem. When the design is completed, the case simulations. A series of papers summarized by
computer scores it by applying a branching rule- Margolis and Clauser [12] compare not only the cor-
based algorithm that maps the presence or absence of respondence between ratings and automated scores
Free Response Data Scoring 3
but (a) compare the generalizability of the resulting not actually represent the characteristics that experts
scores (they are similar), (b) examine the extent to consider when rating a performance; they may instead
which the results vary as a function of the group of act as proxies for those characteristics.
experts used as the basis for modeling the scores (this The use of proxies has both advantages and dis-
was at most a minor source of error), (c) examine the advantages. One advantage is that it may be difficult
extent to which the underlying proficiencies assessed to identify or quantify the actual characteristics of
by ratings and scores are identical (correlations were interest. Consider the problem of defining and quan-
essentially unity), and (d) compare the performance tifying the characteristics that make one essay better
of rule-based and regression-based procedures (the than another. However, the use of proxy variables
regression-based procedures were superior in this may have associated risks. If examinees know that
application). the essay is being judged on the basis of the number
of words, and so on, they may be able to manipulate
the system to increase their scores without improving
Conceptual Issues the quality of their essays. The use of implicit criteria
opens the possibility of using proxy measures as the
It may be too early in the evolution of automated basis for scoring, but it does not require the use of
scoring procedures to establish a useful taxonomy, such measures.
but some conceptual distinctions between procedures Another significant issue in the use of computer-
likely will prove helpful. One such distinction relates delivered assessments that require complex auto-
to whether the criterion on which the scores are based mated scoring procedures is the potential for the
is explicit or implicit. In some circumstances, the scores to be influenced by construct-irrelevant vari-
scoring rules can be made explicit. When experts ance. To the extent that computer delivery and/or
can define scorable levels of performance in terms scoring of assessments results in modifying the
of variables that can directly be quantified by the assessment task so that it fails to correspond to
computer, these rules can be programmed directly. the criterion real world behavior, the modifica-
Both the mathematical formulation items and the tions may result in construct-irrelevant variance.
architectural problems belong to this category. These Limitations of the scoring system may also induce
approaches have the advantage that the rules can construct-irrelevant variance. Consider the writing
be explicitly examined and openly critiqued. Such task described by Davey, Goodwin, and Mettel-
examination facilitates refinement of the rules; it holtz [8]. To the extent that competent writers may
also has the potential to strengthen the argument not be careful and critical readers, the potential exists
supporting the validity of the resulting score inter- for scores that are interpreted as representing writ-
pretations. ing skills to be influenced by an examinees edito-
By contrast, in some circumstances it is difficult rial skills. Similarly, in the mathematical expressions
to define performance levels in terms of quantifi- tasks, Bennett and colleagues [4] describe a prob-
able variables. As a result, many of the currently lem with scoring resulting from the fact that, if
used procedures rely on implicit, or inferred, criteria. examinees include labels in their expression (e.g.,
Examples of these include essentially all currently days), the scoring algorithm may not correctly inter-
available approaches for scoring essays. These pro- pret expressions that would be scored correctly by
cedures require expert review and rating of a sample expert review.
of examinee performances. Scores are then modeled A final important issue with computerized scoring
on the basis of the implicit relationship between the procedures is that the computer scores with mechani-
observed set of ratings and the quantifiable variables cal consistency. Even the most highly trained human
from the performances. The most common proce- raters will fall short of this standard. This level of
dure for deriving this implicit relationship is multiple consistency is certainly a strength for these proce-
linear regression; Pages early work on computer- dures. However, to the extent that the automated
ized scoring of essays and the scoring of computer- algorithm introduces error into the scores, this error
based case simulations both took this approach. One will also be propagated with mechanical consistency.
important characteristic of the implicit nature of this This has important implications because it has the
relationship is that the quantified characteristics may potential to replace random errors (which will tend
4 Free Response Data Scoring
to average out across tasks or raters) with system- [2] Bennett, R.E., Gong, B., Hershaw, R.C., Rock, D.A.,
atic errors. Soloway, E. & Macalalad, A. (1990). Assessment of an
expert systems ability to automatically grade and diag-
nose students constructed responses to computer sci-
The Future of Automated Scoring ence problems, in Artificial Intelligence and the Future
of Testing, R.O. Freedle, ed., Lawrence Erlbaum Asso-
It does not require uncommon prescience to predict ciates, Hillsdale, 293320.
[3] Bennett, R.E. & Sebrechts, M.M. (1997). A computer-
that the use of automated scoring procedures for com-
ized task for representing the representational component
plex computer-delivered assessments will increase of quantitative proficiency, Journal of Educational Mea-
both in use and complexity. The improvements in surement 34, 6477.
recognition of handwriting and vocal speech will [4] Bennett, R.E., Steffen, M., Singley, M.K., Morley, M.
broaden the range of the assessment format. The & Jacquemin, D. (1997). Evaluating an automatically
availability of low-cost computers and the construc- scorable, open-ended response type for measuring math-
tion of secure computerized test-delivery networks ematical reasoning in computer-adaptive testing, Journal
of Educational Measurement 34, 162176.
has opened the potential for large and small-scale
[5] Braun, H.I., Bennett, R.E., Frye, D. & Soloway, E.
computerized testing administrations in high and low- (1990). Scoring constructed responses using expert sys-
stakes contexts. tems, Journal of Educational Measurement 27, 93108.
The increasing use of computers in assessment and [6] Clauser, B.E., Kane, M.T. & Swanson, D.B. (2002).
the concomitant increasing use of automated scoring Validity issues for performance based tests scored with
procedures seems all but inevitable. This increase computer-automated scoring systems, Applied Measure-
will be facilitated to the extent that two branches of ment in Education 15, 413432.
research and development are successful. First, there [7] Clauser, B.E., Margolis, M.J., Clyman, S.G. & Ross, L.P.
(1997). Development of automated scoring algorithms
is the need to make routine what is now state of
for complex performance assessments: a comparison of
the art. The level of expertise required to develop two approaches, Journal of Educational Measurement
the more sophisticated of the procedures described 34, 141161.
in this article puts their use beyond the resources [8] Davey, T., Goodwin, J. & Mettelholtz, D. (1997). Devel-
available to most test developers. Secondly, new oping and scoring an innovative computerized writing
procedures are needed that will support development assessment, Journal of Educational Measurement 34,
of task-specific keys for automated scoring. Issues 2141.
[9] Deane, P. (in press). Strategies for evidence identifica-
of technical expertise aside, the human resources
tion through linguistic assessment of textual responses,
currently required to develop the scoring algorithms in Automated Scoring of Complex Tasks in Computer
for individual tasks is well in excess of that required Based Testing, D. Williamson, I. Bejar & R. Mislevy,
for testing on the basis of multiple-choice items. To eds, Lawrence Erlbaum Associates, Hillsdale.
the extent that computers can replace humans in this [10] Kaplan, R.M. & Bennett, R.E. (1994). Using a Free-
activity, the applications will become much more response Scoring Tool to Automatically Score the For-
widely applicable. mulating-hypotheses Item (RR 94-08), Educational Test-
Finally, at present, there are a limited number ing Service, Princeton.
[11] Landauer, T.K., Haham, D., Foltz, P.W. (2003). Auto-
of specific formats that are scorable by computer;
mated scoring and annotation of essays within the
this entry has referenced many of them. New and Intelligent Essay Assessor, in Automated essay scor-
increasingly innovative formats and scoring proce- ing: A cross Disciplinary Perspective, M.D. Shermis &
dures are sure to be developed within the coming J. Burstein, eds, Lawrence Erlbaum Associates, London,
years. Technologies such as artificial neural net- 87112.
works are promising [16]. Similarly, advances in [12] Margolis, M.J. & Clauser, B.E. (in press). A regression-
cognitive science may provide a framework for devel- based procedure for automated scoring of a complex
oping new approaches [13]. medical performance assessment, in Automated Scor-
ing of Complex Tasks in Computer Based Testing, D.
Williamson, I. Bejar & R. Mislevy, eds, Lawrence Erl-
References baum Associates, Hillsdale.
[13] Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond,
[1] Bejar, I.I. & Bennett, R.E. (1997). Validity and auto- R.G. & Johnson, L. (2002). Making sense of data
mated scoring: Its not only the scoring, Educational from complex assessments, Applied Measurement in
Measurement: Issues and Practice 17(4), 916. Education 15, 363390.
Free Response Data Scoring 5
[14] Page, E.B. (1966). Grading essays by computer: progress R. Mislevy, eds, Lawrence Erlbaum Associates, Hills-
report, Proceedings of the 1966 Invitational Confer- dale.
ence on Testing, Educational Testing Service, Princeton, [17] Williamson, D.M., Bejar, I.I. & Hone, A.S. (1999).
87100. Mental model comparison of automated and human
[15] Page, E.B. & Petersen, N.S. (1995). The computer scoring, Journal of Educational Measurement 36,
moves into essay grading, Phi Delta Kappan 76, 158184.
561565.
[16] Stevens, R.H. & Casillas, A. (in press). Artificial neu- BRIAN E. CLAUSER AND MELISSA
ral networks, in Automated Scoring of Complex Tasks J. MARGOLIS
in Computer Based Testing, D. Williamson, I. Bejar &
Friedmans Test
SHLOMO SAWILOWSKY AND GAIL FAHOOME
Volume 2, pp. 673674
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Procedure
Example
Rank the observations for each row from 1 to k.
For each of the k columns, the ranks are added Friedmans test is calculated with Samples 1 to 5
and averaged, and the mean is designated R j . The in the table below in Table 1, n1 = n2 = n3 = n4 =
mean of the ranks is R = 1/2(k + 1). The sum of n5 = 15.
the squares of the deviations of mean of the ranks of The rows are ranked, with average ranks assigned
the columns from the mean rank is computed. The to tied ranks as in Table 2.
test statistic is a multiple of this sum. The column sums are: R1 = 48.5, R2 = 47.0,
R3 = 33.0, R4 = 52.5, and R5 = 44.0. The sum of
the squared rank sums is 10 342.5. M = (12/15
Assumptions 5 6)(10 342.5) 3 15 6 = 0.02667(10 342.5)
It is assumed that the rows are independent and 270 = 5.8. The large sample approximation of the
there are no tied observations in a row. Because critical value is 9.488, chi-square with 5 1 = 4
comparisons are made within rows, tied values may degrees of freedom, and = 0.05. Because 5.8 <
not pose a serious threat. Typically, average ranks are
assigned to ties. Table 1 Sample data
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Test Statistic
20 11 9 34 10
The test statistic, M, is a multiple of S: 33 34 14 10 2
4 23 33 38 32
k 34 37 5 41 4
S= (R j R)
2 13 11 8 4 33
j =1 6 24 14 26 19
29 5 20 10 11
12n 17 9 18 21 21
M= S, (1)
k(k + 1) 39 11 8 13 9
26 33 22 15 31
where n is the number of rows, and k is the number 13 32 11 35 12
of columns. An alternate formula is: 9 18 33 43 20
33 27 20 13 33
12 k
16 21 7 20 15
M= R 2 3n(k + 1), (2) 36 8 7 13 15
nk(k + 1) j =1 j
2 Friedmans Test
9.488, the null hypothesis cannot be rejected on the [4] Friedman, M. (1937). The use of ranks to avoid the
basis of the evidence from these samples. assumption of normality implicit in the analysis of
variance, Journal of the American Statistical Association
32, 675701.
References [5] Hodges, J.L., Ramsey, P.H. & Shaffer, J.P. (1993).
Accurate probabilities for the sign test, Communications
[1] Agresti, A. & Pendergast, J. (1986). Comparing mean in Statistics, Theory and Methods 22, 12351255.
ranks for repeated measures data, Communications in [6] Neave, H.R. & Worthington, P.L. (1988). Distribution-
Statistics, Theory and Methods 15, 14171433. Free Tests, Unwin Hyman, London.
[2] Fahoome, G. (2002). Twenty nonparametric statistics and
their large-sample approximations, Journal of Modern
Applied Statistical Methods 1(2), 248268. (See also Distribution-free Inference, an Overview)
[3] Fahoome, G. & Sawilowsky, S. (2000). Twenty nonpara-
metric statistics, Annual Meeting of the American Edu- SHLOMO SAWILOWSKY AND GAIL FAHOOME
cational Research Association, SIG/Educational Statisti-
cians, New Orleans.
Functional Data Analysis
JAMES RAMSAY
Volume 2, pp. 675678
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
180
160
What is Functional Data Analysis?
Height (cm)
140
Functional data analysis, or FDA, is the modeling
of data using functional parameters. By a functional 120
parameter, we mean a function whose shape and 100
complexity is not known in advance of the analy-
sis, and therefore the modeling process must provide 80
as much flexibility in the estimated function as the 60
data require. By contrast, more classical paramet- 0 2 4 6 8 10 12 14 16 18
ric approaches to function estimation assume a fixed Age
form for the function defined by a small number of
parameters, and focus on estimating these parameters Figure 1 The heights of 10 girls. The height measure-
ments are indicated by the circles. The smooth lines
as the goal of the modeling process. As a conse-
are height functions estimated using monotone smoothing
quence, while FDA certainly estimates parameters, methods (Data taken from [5])
the attention is on the entire function rather than on
the values of these parameters.
Some of the oldest problems in psychology and 2
education are functional in nature. Psychophysics
aims to estimate a curve relating a physical mea- 1
surement to a subjective or perceived counterpart
Acceleration (cm/yr2)
0 0 0
2 0 2 2 0 2 2 0 2
Figure 3 Three item response functions showing the relation between the probability of getting an item correct on a
mathematics test as a function of ability (Reproduced from [4]. Copyright 2002 by the American Educational Research
Association and the American Statistical Association; reproduced with permission from the publisher)
Functions of arbitrary shape and complexity are FDA assumes that the function being estimated is
constructed using a set of K functional building smooth. In practice, this means that the function
blocks called basis functions. These basis functions has one or more derivatives that are themselves
are combined linearly by multiplying each basis smooth or at least continuous. Derivatives play many
function k (t), k = 1, . . . , K by a coefficient ck and roles in the technology of FDA. The growth curve
summing results. That is, analysis had the study of the second derivative as its
immediate goal.
K Derivatives are also used to quantify smooth-
x(t) = ck k (t). (1) ness. A frequently used method is to define the
k=1 total curvature of a function by the integral of the
square of its second or higher-order derivative. This
A familiar example is the polynomial, constructed measure is called a roughness penalty, and a func-
by taking linear combinations of powers of t. When tional parameter is estimated by explicitly controlling
x(t) is constructed with a Fourier series, the basis its roughness.
functions are one of a series of sine and cosine pairs, In many situations, we need to study rates of
each pair being a function of an integer multiple of change directly, that is, the dynamics of a process dis-
a base frequency. This is appropriate when the data tributed over time, space, or some other continuum.
are periodic. In these situations, it can be natural to develop dif-
Where an unconstrained function is to be esti- ferential equations, which are functional relationships
mated, the preferred basis functions tend to be the between a function and one or more of its derivatives.
splines, constructed by joining polynomial segments For example, sinusoidal oscillation in a function x(t)
together at a series of values of t called knots. can be expressed by the equation D 2 x(t) = 2 x(t),
Splines have pretty much replaced polynomials for where the notation D 2 x(t) refers to the second deriva-
functional work because of their much greater flex- tive of x, and 2/2 is the period. The use of FDA
ibility and computational convenience (see Scatter- techniques to estimate differential equations has many
plot Smoothers). applications in fields such as chemical engineering
No matter what the basis system, the flexibility and control theory, but should also prove impor-
of the resulting curve is determined by the number tant in the emerging study of the dynamic aspects
K of basis functions, and a typical analysis involves of human behavior.
determining how large K must be in order to capture We often need to fit functions to data that have spe-
the required features of the function being estimated. cial constraints. A familiar example is the probability
Functional Data Analysis 3
density function p(t) that we estimate to summa- What are Some Functional Data Analyses?
rize the distribution of a sample of N values ti
(see Catalogue of Probability Density Functions). Nearly all the analyses that are used in multivari-
A density function must be positive and must inte- ate statistics have their functional counterparts. For
grate to one. It is reasonable to assume that growth example, estimating functional descriptive statistics
curves such as those shown in Figure 1 are strictly such as mean curve, a standard deviation curve, and
increasing. Item response functions must take values a bivariate correlation function are usual first steps
within the unit interval [0, 1]. Constrained functions in an FDA, after, of course, registering the curves,
like these can often be elegantly expressed in terms if required.
of differential equations. For example, any strictly Then many investigators will turn to a functional
increasing curve x(t) can be expressed in terms of version of principal components analysis (PCA)
the equation D 2 x(t) = w(x)Dx(t), where the alter- to study the dominant modes of variation among
native functional parameter w(t) has no constraints a sample of curves. Here, the principal component
whatever. Estimating w(t) rather than x(t) is both vectors in multivariate PCA become principal func-
easier and assures monotonicity. tional components of variation. As in ordinary PCA, a
central issue is determining how many of these com-
ponents are required to adequately account for the
functional variation in the data, and rotating princi-
Phase Variation and Registration in FDA pal components can be helpful here, too. A functional
analogue of canonical correlation analysis may also
be useful.
A FDA can confront new problems not encountered Multiple regression analysis or the linear model
in multivariate and other older types of statistical has a wide range of functional counterparts. A
procedures. One of these is the presence of phase functional analysis of variance involves dependent
variation, illustrated in Figure 2. We see there that variables that are curves. We could, for example,
the pubertal growth spurt varies in both amplitude compute a functional version of the t Test to see if the
and timing from girl to girl. This is because each acceleration curves in Figure 2 differ between boys
child has a physiological age that does not evolve and girls. In such tests, it can be useful to identify
at the same rate as chronological age. We call regions on the t-axis where there are significant
this variation in timing of curve features phase differences, rather than being content just to show
variation. that differences exist. This is the functional analogue
The problem with phase variation is illustrated in of the multiple comparison problem (see Multiple
the heavy dashed mean curve in Figure 2. Because Comparison Procedures).
girls are at different stages of their maturation at What happens when an independent variable in a
any particular clock age, the cross-sectional mean regression analysis is itself a function? Such situa-
is a terrible estimate of the average childs growth tions often arise in medicine and engineering, where
pattern. The mean acceleration displays a pubertal a patient or some industrial process produces a mea-
growth spurt that lasts longer than that for any surable response over time to a time-varying input
single girl, and also has less amplitude variation of some sort, such as drug dose or raw material,
as well. respectively. In some situations, the effect of vary-
Before we can conduct even the simplest analyses, ing the input has an immediate effect on the output,
such as computing means and standard deviations, we and, in other situations, we need to compute causal
must remove phase variation. This is done by comput- effects over earlier times as well. Functional inde-
ing a nonlinear, but strictly increasing, transformation pendent variables introduce a fascinating number of
of clock time called a time warping function, such that new technical challenges, as it is absolutely essen-
when a childs curve values are plotted against trans- tial to impose smoothness on the estimated functional
formed time, features such as the pubertal growth regression coefficients.
spurt are aligned. This procedure is often called curve Differential equations are, in effect, functional
registration, and can be an essential first step in linear models where the output is a derivative,
an FDA. and among the inputs are the functions value and
4 Functional Data Analysis
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
X1 X2 X3 X4 X5 X6 X7 Weight
Y1 Y0 Y2
u
1 ]
u1 (w)
u2 (w)
0 ]
y1 y0 y2 w
u
1
u1 (w)
u2 (w)
0
y1 y0 y2 w
As in the case of crisp cluster analysis, the inputs uik d(xi , vk ), (3)
k=1 i=1
for the fuzzy cluster analysis are the results of m
measurements made on n objects, which can be with constraints
represented:
K
x1 x1 . . . x1 uik = 1 (4)
1 2 n k=1
x12 x22 ... xn2
. .. (1)
. .. .. for every i = 1, . . . n. The allowed values for uik are
. . . .
0 and 1; therefore, (4) means that for every i, only
x1m x2m . . . xnm
one value among ui1 , . . . , uiK is 1 and all others are
0. The distance d(x, y) may be chosen from a wide
The situation differs significantly, depending on range of formulas, but for computational efficiency
whether measurements are continuous or categori- it is necessary to have a simple way to compute
cal. First, we consider continuous measurements, for centers of clusters. The usual choice is the squared
which fuzzy K-means algorithm will be discussed Euclidean distance, d(x, y) = j (x j y j )2 , where
in detail and other methods will be briefly charac- the center of a cluster is its center of gravity. For the
terized. Then we will discuss methods for analyz- sake of simplicity, we restrict our consideration to the
ing categorical data which may be used for fuzzy squared Euclidean distance.
clustering. Equation (3) suggests that fuzzy clustering may
be obtained by relaxing the restriction uik is either
0 or 1; rather, uik is allowed to take any value
in the interval [0,1] and is treated as the degree of
Fuzzy Clustering of Continuous Data: membership of object i in cluster k. However, this is
Fuzzy K-means Algorithm not as simple as it appears. One can show that the
minimum of (3) with constraints (4) is still obtained
When the result of every measurement is a real when uik are 0s or 1s, despite the admissibility of
number, the columns of matrix (1) (which repre- intermediate values. In this problem, an additional
parameter f 1, called a fuzzifier, can be introduced
sent objects) may be considered as points in m-
in (3):
dimensional space. In crisp K-means clustering, the
goal is to split objects into K clusters c1 , . . . , cK with K n
(uik )f d(xi , vk ) (5)
centers v1 , . . . , vK such that
k=1 i=1
K The fuzzifier has no effect in crisp K-means cluster-
d(xi , vk ) (2) ing (as 0f = 0 and 1f = 1), but it produces nontrivial
k=1 xi ck minima of (5) with constraints (4).
Fuzzy Cluster Analysis 3
Now the fuzzy clustering problem is a problem of points x1 , . . . , xn . The right side of formula (6) is
finding the minimum of (5) under constraints (4). The undefined if d(xi , vk0 ) is 0 for some k0 ; in this case,
fuzzy K-means algorithm searches for this minimum one lets uik0 = 1 and uik = 0 for all other k. The
by alternating two steps: (a) optimizing membership algorithm stops when changes in uik and vk during
degrees uik while cluster centers vk are fixed; and the last step are below a predefined threshold.
(b) optimizing vk while uik are fixed. The minimum The fuzziness of the cluster depends on fuzzifier
of (5), with respect to uik , is f . If f is close to 1, the membership is close to
a crisp one; if f tends to infinity, the fuzzy K-
1
uik = , (6) means algorithm tends to give equal membership
K
1
d(xi , vk ) f 1 in all clusters to all objects. Figures 4, 5, and 6
d(xi , vk ) demonstrate membership functions for f = 2, 3, 5.
k =1
The most common choice for the fuzzifier is f = 2.
and the minimum of (5), with respect to vk , is This method gives nonzero membership in all
clusters for any object that does not coincide with the
n
center of one of the clusters. Some researchers, how-
(uik )f xi ever, prefer to have a crisp membership for objects
i=1
vk = . (7) close to cluster centers and fuzzy membership for
n
objects that are close to cluster boundaries. One pos-
f
(uik )
sibility was suggested by Klawonn and Hoeppner [4].
i=1
Their central idea is to consider subexpression uf
Equation (7) is a vector equation; it defines a center as a special case of function g(u). To be used in
of gravity of masses (u1k )f , . . . , (unk )f placed at place of the fuzzifier, such functions must (a) be a
1 u
u1 (w)
u2 (w)
0
0 y1 2 y2 4 w
u
1
u1 (w)
u2 (w)
0
0 y1 2 y2 4 w
1 u u1 (w)
u2 (w)
0
0 y1 2 y2 4 w
u
1
u1 (w)
u2 (w)
0
y1 y2 w
0
considered as clusters; consequently membership in However, in GoM, though the mixture sought is
a cluster is the probability of belonging to a com- allowed to be infinite, all mixed distributions must
ponent distribution law (conditional on observations) belong to a low-dimensional linear subspace Q of
(see Finite Mixture Distributions; Model Based a space of independent distributions. Under weak
Cluster Analysis). In the fish example, the observed conditions this linear subspace is identifiable, and
distribution law can be represented as a mixture of the algorithm for finding this subspace reduces the
two normal distributions, which leads to two clusters problem to an eigenvalue/eigenvector problem.
similar to those previously considered. The mixing distribution may be considered a
The applicability of this approach is restricted distribution of a random vector g taking values in
in that a representation as a finite mixture may or Q. Individual scores gi are expectations of random
may not exist, or may be not unique, even when vector g conditional on outcome of measurements
an obvious decomposition into clusters is present in xi1 , . . . , xim . Conditional expectations may be found
the data. On the other hand, although there is no as a solution of a linear system of equations. Let
obvious extension of the fuzzy K-means algorithm subspace Q be K-dimensional, 1 , . . . , K be its
to categorical data, the mixed distribution approach basis, and let gi1 , . . . , giK be coordinates of the vector
can be applied to categorical data. of individual scores gi in this basis. Often, for an
appropriate choice of basis, gik may be interpreted
as a partial membership of object i in cluster k.
Fuzzy Clustering of Categorical Data:
Alternatively, a crisp or fuzzy clustering algorithm
Latent Class Models may be applied to individual scores gi to obtain other
Latent structure analysis [3, 6, 7] deals with categor- classifications. The low computational complexity
ical measurements. The columns of (1) are vectors of the GoM algorithm makes it very attractive for
of measurements made on an object. These vectors analyzing data involving a large number (hundreds
may be considered as realizations of a random vector or thousands) of categorical variables.
x = (X 1 , . . . , X m ). We say that the distribution law
of random vector x is independent, if component ran-
dom variables X 1 , . . . , X m are mutually independent. Example: Analysis of Gene Expression
The observed distribution law is not required to Data
be independent; however, under some circumstances,
it may be represented as a mixture of independent We used as our example gene expression data the
distribution laws. This allows considering a popula- basis of Figure 2 in [1]. These authors performed a
tion as a disjointed union of classes (latent classes), hierarchical cluster analysis on 2427 genes in the
such that the distribution law of random vector x in
yeast S. cerevisia. Data were drawn at time points
every class is independent. Probabilities for objects
during several processes given in Table 1, taken
belonging to a class that is conditional on the out-
from footnotes in [1]: for example, cell division after
comes of measurements can be calculated and can be
synchronization by alpha factor arrest (ALPH; 18
considered as degrees of membership in correspond-
time points) after centrifugal elutriation (ELU; 14
ing classes (see Latent Class Analysis). Most widely
time points).
used algorithms for construction of latent class mod-
Gene expression (log ratios) was measured for
els are based on maximizing the likelihood function
each of these time points and subjected to hierarchical
and involve heavy computation.
cluster analysis. For details of their cluster analysis
method, see [1]. They took the results of their cluster
Fuzzy Clustering of Categorical Data: analysis and made plots of genes falling in various
Grade of Membership Analysis clusters. Their plots consisted of raw data values
(log ratios) to which they assigned a color varying
Grade of membership (GoM) analysis [5, 9, 11] from saturated green at the small end of the scale
works with the same data as latent structure analysis. to saturated red at the high end of the scale. The
It also searches for a representation of the observed resulting plots exhibited large areas similarly colored.
distribution as a mixture of independent distributions. These areas indicated genes that clustered together.
6 Fuzzy Cluster Analysis
Table 1 Cell division processes the whole ensemble of hj for one value of K, one
Process type Time points Process description can construct a description of each type that usually
makes sense. For most of the analyses done with
ALPH 18 Synchronization by alpha GoM, K works out to be between 5 and 7. For this
factor arrest analysis runs for K as high as 15 were done before
ELU 14 Centrifugal elutriation
CDC15 15 Temperature-sensitive
settling on K = 10 for this analysis. The program
mutant DSIGoM available from dsisoft.com was used for the
SPO 11 Sporulation computations. Further analysis was done by standard
HT 6 High-temperature shock statistical programs.
Dtt 4 Reducing agents The sizes of the types are 205.8, 364.1, 188.7,
DT 4 Low temperature 234.1, 386.1, 202.4, 211.2, 180.0, 187.9, and 207.2
DX 7 Diauxic shift
for types 1 through 10 respectively. Although the
GoM analysis makes a partial assignment of each
The primary purpose of this example is to describe observation or case to the 10 types, it is sometimes
the use of GoM for fuzzy cluster analysis. A sec- desirable to have crisp assignment. A forced crisp
ondary purpose was to identify some genes that clus- assignment can be made by assigning the case to the
ter together and constitute an interesting set. To use type k with the largest membership. For the ith case,
the Grade of Membership (GoM) model, we catego- define ki such that gik
> gih for all h = k .
rized the data by breaking each range of expression The GoM output consists of the profiles (or
into 5 parts roughly according to the empirical dis- variable clusters) characterized by the {hj l } and the
tribution. GoM constructs K groups (types) types grades of membership {gik } values along with several
with different characteristics to explain heterogene- goodness-of-fit measures. To construct clusters from
ity of the data. The product form of the multinomial the results of the GoM analysis, one can compare the
GoM likelihood for categorical variables xij is: Empirical Cumulative Distribution Function (CDF)
for each type with the CDF for the population
L= (pij )yij , (11) frequencies. This is done for each variable in the
L j analysis. Based on the CDFs for each variable, a
determination was made whether the type upregulated
where or downregulated gene expression more than the
K
population average. Results of this process are given
pij = gih hj (12)
in Table 2.
h=1
D means that the type downregulates gene
with constraints gih , hj 0 for all i, h, j , and expression for that variable more than the popula-
K Lj
h=1 gih = 1 for all i; =1 hj = 1 for all h, j .
tion. A value of U indicates the type upregulates
yij is the binary coding of xij . gene expression for the variable more than the pop-
In (11), pij is the probability that observation ulation average. A blank value indicates that gene
i gives rise to the response that for variable j ; regulation for the type did not differ from the popula-
hj is the probability that an observation belonging tion average. In Table 2, Type 4 downregulates ALPH
exclusively to type h gives rise to the response for stress test expression values for all experiment times
variable j ; gih is the degree to which observation i while Type 8 upregulates the same expression values
belongs to type h; Lj is the number of possible values for all experiment times. Types 4 and 8 represent dif-
for variable j ; and K is the number of types needed ferent dimensions of the sample space. Heterogeneity
to fully characterize the data and must be specified. of the data is assumed to be fully described by the
The parameters {gih, hj } are estimated from (11) by 10 types. Descriptions for all 10 types are given in
the principle of maximum likelihood. Table 3.
In practice, one starts with a low value of K = K0 , The gih represent the degree to which the ith case
usually 4 or 5. Then, analogous to the way one fits a belongs to the hth type or cluster. Each case was
polynomial, successively higher values of K are tried assigned the type with the largest gih value. The
until increasing K does not improve the fit. One of 79 data values for each case were assigned colors
the important features of GoM is that, by inspecting according to the scheme used by [1] and plotted. We
Fuzzy Cluster Analysis 7
K=4 K=8
50 50
[2] Everitt, B.S. & Hand, D.J. (1981). Finite Mixture Dis- [11] Woodbury, M.A., Clive J. (1974). Clinical pure types as
tributions, Chapman & Hall, New York. a fuzzy partition, Journal of Cybernet 4, 111121.
[3] Heinen, T. (1996). Latent Class and Discrete Latent Trait
Models, Sage Publications, Thousand Oaks.
[4] Klawonn, F. & Hoppner, F. (2003). What is fuzzy Further Reading
about fuzzy clustering? Understanding and improving
the concept of the fuzzifier, in Advances in Intelligent Boreiko, D. & Oesterreichische Nationalbank. (2002). EMU
Data Analysis, M.R. Berthold, H.-J. Lenz, E. Bradley, and Accession Countries: Fuzzy Cluster Analysis of Mem-
R. Kruse & C. Borgelt, eds, Verlag Springer, Berlin, bership, Oesterreichische Nationalbank, Wien.
pp. 254264. Halstead, M.H. (1977). Elements of Software Science, Elsevier
[5] Kovtun, M., Akushevich, I., Manton, K.G. & Tol- Science, New York.
ley, H.D. (2004). Grade of membership analysis: one Hartigan, J. (1975). Clustering Algorithms, Wiley, New York.
possible approach to foundations, Focus on Probabil- Hoppner, F. (1999). Fuzzy Cluster Analysis: Methods for
ity Theory, Nova Science Publishers, New York. (to be Classification, Data Analysis, and Image Recognition,
published in 2005.). John Wiley, Chichester; New York.
[6] Langeheine, R. & Rost, J. (1988). Latent Trait and Latent Jordan, B.K. (1986). A fuzzy cluster analysis of antisocial
Class Models, Plenum Press, New York. behavior: implications for deviance theory, Dissertation,
[7] Lazarsfeld, P.F. & Henry, N.W. (1968). Latent Structure Duke University, Durham North Carolina.
Analysis, Houghton-Mifflin, Boston. Lance, G.N. & William, W.T. (1967). A general theory of
[8] Lindsay, B.G. (1995). Mixture Models: Theory, Geome- classificatory sorting strategies: 1. Hierarchical systems,
try and Applications, Institute of Mathematical Statistics, Computer Journal 9, 373380.
Hayward. Ling, R.F. (1973). A Probability Theory of Cluster Analysis,
[9] Manton, K., Woodbury, M. & Tolley, H. (1994). Statis- Journal of the American Statistical Association 68(341),
tical Applications Using Fuzzy Sets, Wiley Interscience, 159164.
New York.
[10] Titterington, D.M., Smith, A.F. & Makov, U.E. (1985).
KENNETH G. MANTON, GENE LOWRIMORE,
Statistical Analysis of Finite Mixture Distribution, John ANATOLI YASHIN AND MIKHAIL KOVTUN
Wiley & Sons, New York.
Galton, Francis
ROGER THOMAS
Volume 2, pp. 687688
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
[2] Darwin, G.H. (1912). Galton, Sir Francis (18221911), [5] Galton, F. (1869). Hereditary Genius, Watt, London
The Dictionary of National Biography, Supplement Jan- [6] Gridgeman, N.T. (1972). Galton, Francis, Dictionary of
uary 1901-December 1911, Oxford University Press, Scientific Biography 5, 265267.
Oxford. [7] Smith, C.A.B. (1983). Galton, Francis, Encyclopedia of
[3] Fancher, R.E. (1996). Pioneers of Psychology, 3rd Edi- Statistical Sciences 3, 274276.
tion, Norton, New York.
[4] Forrest, D.W. (1974). Francis Galton: The Life and Work ROGER THOMAS
of a Victorian Genius, Taplinger Publishing, New York.
Game Theory
ANDREW M. COLMAN
Volume 2, pp. 688694
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
so on. This is an everyday phenomenon that occurs, police offer each prisoner the following deal. If
for example, whenever a public announcement is neither discloses incriminating evidence, then both
made, so that everyone present not only knows it, will go free; if both disclose incriminating evidence,
but knows that others know it, and so on [21]. then both will receive moderate sentences; and if
one discloses incriminating evidence and the other
conceals it, then the former will be set free with
Key Concepts a reward for helping the police, and the latter will
receive a heavy sentence. Each prisoner, therefore,
Other key concepts of game theory are most easily
faces a choice between cooperating with the coplayer
explained by reference to a specific example. Figure 1
depicts the best known of all strategic games, the (concealing the evidence) and defecting (disclosing
Prisoners Dilemma game. The figure shows its it). If both cooperate, then the payoffs are good for
payoff matrix, which specifies the game in normal both (3, 3); if both defect, then the payoffs are worse
form (or strategic form), the principal alternative for both (1, 1); and if one defects while the other
being extensive form, which will be illustrated in the cooperates, then the one who defects receives the best
section titled Subgame-perfect and Trembling-hand possible payoff and the cooperator the worst, with
Equilibria. Player I chooses between the rows labeled payoffs (5, 0) or (0, 5), depending on who defects.
C (cooperate) and D (defect), Player II chooses This interpretation rests on the assumption that the
between the columns labeled C and D, and the pair utility numbers shown in the payoff matrix do, in fact,
of numbers in each cell represent the payoffs to reflect the prisoners preferences. Considerations of
Player I and Player II, in that order by convention. In loyalty and a reluctance to betray a partner-in-crime
noncooperative game theory, which is being outlined might reduce the appeal of being the sole defector
here, it is assumed that the players choose their for some criminals, in which case that outcome might
strategies simultaneously, or at least independently, not yield the highest payoff. But the payoff numbers
without knowing what the coplayer has chosen. A represent von NeumannMorgenstern utilities, and
separate branch of game theory, called cooperative they are, therefore, assumed to reflect the players
game theory, deals with games in which players are preferences after taking into account such feelings
free to share the payoffs by negotiating coalitions and everything else affecting their preferences. Many
based on binding and enforceable agreements. The everyday interactive decisions involving cooperation
rank order of the payoffs, rather than their absolute and competition, trust and suspicion, altruism and
values, determines the strategic structure of a game. spite, threats, promises, and commitments turn out,
Replacing the payoffs 5, 3, 1, 0 in Figure 1 by 4, 3, on analysis, to have the strategic structure of the
2, 1, respectively, or by 10, 1, 2, 20, respectively, Prisoners Dilemma game [7]. An obvious example
changes some properties of the game but leaves its is price competition between two companies, each
strategic structure (Prisoners Dilemma) intact. seeking to increase its market share by undercutting
The Prisoners Dilemma game is named after an the other.
interpretation suggested in 1950 by Tucker [30] and How should a rational player act in a Prisoners
popularized by Luce and Raiffa [18, pp. 9497]. Dilemma game played once? The first point to notice
Two people, charged with joint involvement in a is that D is a best reply to both of the coplayers
crime, are held separately by the police, who have strategies. A best reply (or best response) to a
insufficient evidence for a conviction unless at least coplayers strategy is a strategy that yields the highest
one of them discloses incriminating evidence. The payoff against that particular strategy. It is clear that
D is a best reply to C because it yields a payoff
II of 5, whereas a C reply to C yields only 3; and D
C D is also a best reply to D because it yields 1 rather
than 0. In this game, D is a best reply to both of
C 3,3 0,5
I the coplayers strategies, which means that defection
D 5,0 1,1 is a best reply whatever the coplayer chooses. In
technical terminology, D is a dominant strategy for
Figure 1 Prisoners Dilemma game both players. A dominant strategy is one that is a best
Game Theory 3
reply to all the strategies available to the coplayer (or Nash equilibrium by definition. A uniquely rational
coplayers, if there are several). solution must, therefore, be a Nash equilibrium.
Strategic dominance is a decisive argument for The indirect argument also provides a proof that
defecting in the one-shot Prisoners Dilemma game a player cannot solve a game with the techniques of
it is in the rational self-interest of each player to standard (individual) decision theory (see strategies
defect, whatever the other player might do. In gen- of decision making) by assigning subjective proba-
eral, if a game has a dominant strategy, then a rational bilities to the coplayers strategies as if they were
player will certainly choose it. A dominated strat- states of nature and then simply maximizing SEU.
egy, such as C in the Prisoners Dilemma game, is The proof is by reductio ad absurdum. Suppose that
inadmissible, inasmuch as no rational player would a player were to assign subjective probabilities and
choose it. But the Prisoners Dilemma game embod- maximize SEU in the Prisoners Dilemma game. The
ies a genuine paradox, because if both players coop- specific probabilities are immaterial, so let us sup-
erate, then each receives a better payoff (each gets 3) pose that Player I, for whatever reason, believed that
than if both defect (each gets 1). Player II was equally likely to choose C or D. Then,
Player I could compute the SEU of choosing C as
1/2(3) + 1/2(0) = 1.5, and the SEU of choosing D
Nash Equilibrium as 1/2(5) + 1/2(1) = 3; therefore, to maximize SEU,
Player I would choose D. But if that were a rational
The most important solution concept of game the- conclusion, then by CKR1, Player II would anticipate
ory flows directly from best replies. A Nash equi- it, and by CKR2, would choose (with certainty) a best
librium (or equilibrium point or simply equilibrium) reply to D, namely D. This leads immediately to a
is an outcome in which the players strategies are contradiction, because it proves that Player II was not
best replies to each other. In the Prisoners Dilemma equally likely to choose C or D, as assumed from the
game, joint defection is a Nash equilibrium, because outset. The only belief about Player IIs choice that
D is a best reply to D for both players, and it escapes contradiction is that Player II will choose D
is a unique equilibrium, because no other outcome with certainty, because joint defection is the games
has this property. A Nash equilibrium has strate- unique Nash equilibrium.
gic stability, because neither player could obtain Nash proved in 1950 [22] that every game with a
a better payoff by choosing differently, given the finite number of pure strategies has at least one equi-
coplayers choice, and the players, therefore, have no librium point, provided that the rules of the game
reason to regret their own choices when the outcome allow mixed strategies to be used. The problem with
is revealed. Nash equilibrium as a solution concept is that many
The fundamental theoretical importance of Nash games have multiple equilibria that are nonequivalent
equilibrium rests on the fact that if a game has a and noninterchangeable, and this means that game
uniquely rational solution, then it must be a Nash theory is systematically indeterminate. This is illus-
equilibrium. Von Neumann and Morgenstern [31, trated in Figure 2, which shows the payoff matrix
pp. 146148] established this important result via of the Stag Hunt game, first outlined in 1755 by
a celebrated indirect argument, the most frequently Rousseau [26, Part II, paragraph 9], introduced into
cited version of which was presented later by Luce the literature of game theory by Lewis [16, p. 7],
and Raiffa [18, pp. 6365]. Informally, by CKR2, brought to prominence by Aumann [1], and discussed
the players are expected utility maximizers, and by
CKR1, any rational deduction about the game is
II
common knowledge. Taken together, these premises
imply that, in a two-person game, if it is uniquely C D
rational for the players to choose particular strategies, C 9,9 0, 8
then those strategies must be best replies to each
I
other. Each player can anticipate the coplayers
D 8,0 7,7
rationally chosen strategy (by CKR1) and necessarily
chooses a best reply to it (by CKR2); and because the
strategies are best replies to each other, they are in Figure 2 Stag Hunt game
4 Game Theory
in an influential book by Harsanyi and Selten [13, Furthermore, it is not intuitively obvious that play-
pp. 357359]. It is named after Rousseaus inter- ers should choose C, because, by so doing, they risk
pretation of it in terms of a hunt in which joint the worst possible payoff of zero. The D strategy
cooperation is required to catch a stag, but each is a far safer choice, risking a worst possible payoff
hunter is tempted to go after a hare, which can be of 7. This leads naturally to Harsanyi and Seltens
caught without the others help. If both players defect secondary criterion of selection among multiple equi-
in this way, then each is slightly less likely to succeed libria, called the risk-dominance principle, to be used
in catching a hare, because they may end up chasing only if payoff dominance fails to yield a determinate
the same one. solution. If e and f are any two equilibria in a game,
This game has no dominant strategies, and the (C, then e risk-dominates f if, and only if, the mini-
C) and (D, D) outcomes are both Nash equilibria mum possible payoff resulting from the choice of e
because, for both players, C is the best reply to C, and is strictly greater than the minimum possible pay-
D is the best reply to D. In fact, there is a third Nash off resulting from the choice of f , and players who
equilibrium virtually all games have odd numbers follow the risk-dominance principle choose its strate-
of equilibria in which both players use the mixed gies. In the Stag Hunt game, D risk-dominates C
strategy (7/8C, 1/8D), yielding expected payoffs for each player, but the payoff-dominance principle
of 63/8 to each. The existence of multiple Nash takes precedence, because, in this game, it yields a
equilibria means that formal game theory specifies determinate solution.
no rational way of playing this game, and other
psychological factors are, therefore, likely to affect
strategy choices. Subgame-perfect and Trembling-hand
Equilibria
Player II. This emerges most clearly from an exam- experimental attention focused largely on the Pris-
ination of the extensive form of the game, shown oners Dilemma and closely related games. The rise
in Figure 3(b). The extensive form is a game tree of behavioral economics in the 1980s led to experi-
depicting the players moves as if they occurred ments on a far broader range of games see [4, 5].
sequentially. This extensive form is read from Player The experimental data show that human decision
Is move on the left. If the game were played makers deviate widely from the rational prescriptions
sequentially, and if the second decision node were of orthodox game theory. This is partly because of
reached, then a utility-maximizing Player II would bounded rationality and severely limited capacity to
choose C at that point, to secure a payoff of 2 carry out indefinitely iterated recursive reasoning (I
rather than zero. At the first decision node, Player I think that you think that I think. . .) (see [8, 14,
would anticipate Player IIs reply, and would, there- 29]), and partly for a variety of unrelated reasons,
fore, choose C rather than D, to secure 2 rather including a strong propensity to cooperate, even when
than 1. This form of analysis, reasoning backward cooperation cannot be justified on purely rational
from the end, is called backward induction and is the grounds [7].
basic method of finding subgame-perfect equilibria.
In this game, it shows that the (D, D) equilibrium
could not be reached by rational choice in the exten-
sive form, and that means that it is imperfect in Evolutionary Game Theory
the normal form. By definition, a subgame-perfect
equilibrium is one that induces payoff-maximizing The basic concepts of game theory can be inter-
choices in every branch or subgame of its exten- preted as elements of the theory of natural selection
sive form. as follows. Players correspond to individual organ-
In a further refinement, Selten [28] introduced the isms, strategies to organisms genotypes, and payoffs
concept of the trembling-hand equilibrium to identify to the changes in their Darwinian fitness the num-
and eliminate imperfect equilibria. At every decision bers of offspring resembling themselves that they
node in the extensive form or game tree, there is transmit to future generations. In evolutionary game
assumed to be a small probability (epsilon) that theory, the players do not choose their strategies ratio-
the player acts irrationally and makes a mistake. The nally, but natural selection mimics rational choice.
introduction of these error probabilities, generated Maynard Smith and Price [20] introduced the con-
by a random process, produces a perturbed game in cept of the evolutionarily stable strategy (ESS) to
which every move that could be played has some pos- handle such games. It is a strategy with the prop-
itive probability of being played. Assuming that the erty that if most members of a population adopt it,
players trembling hands are common knowledge in then no mutant strategy can invade the population
a game, Selten proved that only the subgame-perfect by natural selection, and it is, therefore, the strategy
equilibria of the original game remain equilibria in that we should expect to see commonly in nature.
the perturbed game, and they continue to be equi-
An ESS is invariably a symmetric Nash equilib-
libria as the probability tends to zero. According
rium, but not every symmetric Nash equilibrium is
to this widely accepted refinement of the equilib-
an ESS.
rium concept, the standard game-theoretic rationality
The standard formalization of ESS is as follows
assumption (CKR2) is reinterpreted as a limiting case
[19]. Suppose most members of a population adopt
of incomplete rationality.
strategy I , but a small fraction of mutants or invaders
adopt strategy J . The expected payoff to an I
Experimental Games individual against a J individual is written E(I, J ),
and similarly for other combinations strategies. Then,
Experimental games have been performed since the I is an ESS if either of the conditions (1) or (2) below
1950s in an effort to understand the strategic interac- is satisfied:
tion of human decision makers with bounded ratio-
nality and a variety of nonrational influences on their
E(I, I ) > E(J, I ) (1)
behavior (for detailed reviews, see [6, 11, 15, Chap-
ters 14], [17, 25]). Up to the end of the 1970s, E(I, I ) = E(J, I ), and E(I, J ) > E(J, J ) (2)
6 Game Theory
Condition (1) or (2) ensures that J cannot spread [9] Dawkins, R. (1976). The Selfish Gene, Oxford University
through the population by natural selection. In addi- Press, Oxford.
tion, differential and difference equations called repli- [10] Dimand, M.A. & Dimand, R.W. (1996). A History of
Game Theory (Vol 1): From the Beginnings to 1945,
cator dynamics have been developed to model the
Routledge, London.
evolution of a population under competitive selec- [11] Foddy, M., Smithson, M., Schneider, S. & Hogg, M.,
tion pressures. If a population contains k genetically eds (1999). Resolving Social Dilemmas: Dynamic, Struc-
distinct types, each associated with a different pure tural, and Intergroup Aspects, Psychology Press, Hove.
strategy, and if their proportions at time t are x(t) = [12] Harsanyi, J.C. (19671968). Games with incomplete
(x1 (t), . . . , xk (t)), then the replicator dynamic equa- information played by Bayesian players, Parts I-III,
tion specifies the population change from x(t) to Management Science 14, 159182, 320334, 486502.
[13] Harsanyi, J.C. & Selten, R. (1988). A General Theory of
x(t + 1).
Equilibrium Selection in Games, MIT Press, Cambridge.
Evolutionary game theory turned out to solve [14] Hedden, T. & Zhang, J. (2002). What do you think I
several long-standing problems in biology, and it was think you think? Strategic reasoning in matrix games,
described by Dawkins as one of the most important Cognition 85, 136.
advances in evolutionary theory since Darwin [9, [15] Kagel, J.H. & Roth, A.E., eds (1995). Handbook of
p. 90]. In particular, it helped to explain the evolution Experimental Economics, Princeton University Press,
of cooperation and altruistic behavior conventional Princeton.
(ritualized) rather than escalated fighting in numerous [16] Lewis, D.K. (1969). Convention: A Philosophical Study,
Harvard University Press, Cambridge.
species, alarm calls by birds, distraction displays by
[17] Liebrand, W.B.G., Messick, D.M. & Wilke, H.A.M.,
ground-nesting birds, and so on. eds (1992). Social Dilemmas: Theoretical Issues and
Evolutionary game theory is also used to study Research Findings, Pergamon, Oxford.
adaptive learning in games repeated many times. [18] Luce, R.D. & Raiffa, H. (1957). Games and Decisions:
Evolutionary processes in games have been studied Introduction and Critical Survey, Wiley, New York.
analytically and computationally, sometimes by run- [19] Maynard Smith, J. (1982). Evolution and the Theory of
ning simulations in which strategies are pitted against Games, Cambridge University Press, Cambridge.
[20] Maynard Smith, J. & Price, G.R. (1973). The logic of
one another and transmit copies of themselves to
animal conflict, Nature 246, 1518.
future generations in proportion to their payoffs (see [21] Milgrom, P. (1981). An axiomatic characterization of
[2, 3, Chapters 1, 2; 23, 24]). common knowledge, Econometrica 49, 219222.
[22] Nash, J.F. (1950). Equilibrium points in n-person games,
References Proceedings of the National Academy of Sciences of the
United States of America 36, 4849.
[1] Aumann, R.J. (1976). Agreeing to disagree, Annals of [23] Nowak, M.A., May, R.M. & Sigmund, K. (1995). The
Statistics 4, 12361239. arithmetics of mutual help, Scientific American 272,(6),
[2] Axelrod, R. (1984). The Evolution of Cooperation, Basic 7681.
Books, New York. [24] Nowak, M.A. & Sigmund, K. (1993). A strategy of
[3] Axelrod, R. (1997). The Complexity of Cooperation: win-stay, lose-shift that outperforms tit-for-tat in the
Agent-based Models of Competition and Collaboration, Prisoners Dilemma game, Nature 364, 5658.
Princeton University Press, Princeton. [25] Pruitt, D.G. & Kimmel, M.J. (1977). Twenty years of
[4] Camerer, C.F. (2003). Behavioral Game Theory: Exper- experimental gaming: critique, synthesis, and sugges-
iments in Strategic Interaction, Princeton University tions for the future, Annual Review of Psychology 28,
Press, Princeton. 363392.
[5] Camerer, C.F., Loewenstein, G. & Rabon, M., eds [26] Rousseau, J.-J. (1755). Discours sur lorigine dinegalite
(2004). Advances in Behavioral Economics, Princeton parmi les hommes [discourse on the origin of inequal-
University Press, Princeton. ity among men], in Oeuvres Compl`etes, Vol. 3, J.-
[6] Colman, A.M. (1995). Game Theory and its Applications J. Rousseau, Edition Pleiade, Paris.
in the Social and Biological Sciences, 2nd Edition, [27] Selten, R. (1965). Spieltheoretische Behandlung eines
Routledge, London. Oligopolmodells mit Nachfragetragheit [Game-theoretic
[7] Colman, A.M. (2003a). Cooperation, psychological treatment of an oligopoly model with demand iner-
game theory, and limitations of rationality in social tia], Zeitschrift fur die Gesamte Staatswissenschaft 121,
interaction, The Behavioral and Brain Sciences 26, 301324, 667689.
139153. [28] Selten, R. (1975). Re-examination of the perfectness
[8] Colman, A.M. (2003b). Depth of strategic reasoning in concept for equilibrium points in extensive games,
games, Trends in Cognitive Sciences 7, 24. International Journal of Game Theory 4, 2555.
Game Theory 7
[29] Stahl, D.O. & Wilson, P.W. (1995). On players models [31] von Neumann, J. & Morgenstern, O. (1944). Theory of
of other players: theory and experimental evidence, Games and Economic Behavior, 2nd Edition, 1947; 3rd
Games and Economic Behavior 10, 218254. Edition, 1953, Princeton University Press, Princeton.
[30] Tucker, A. (1950/2001). A two-person dilemma, in
Readings in Games and Information, E. Rasmusen, ed., ANDREW M. COLMAN
Blackwell, Oxford, pp. 78 (Reprinted from a handout
distributed at Stanford University in May 1950).
Gauss, Johann Carl Friedrich
RANDALL D. WIGHT AND PHILIP A. GABLE
Volume 2, pp. 694696
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
took more time. By the time the use of least squares References
flowered in the social sciences, Galton, Pearson, and
Yule had uniquely transformed the procedure into [1] Dunnington, G.W. (1955). Carl Friedrich Gauss, Titan of
the techniques of regression (see Multiple Linear Science: A Study of his Life and Work, Exposition Press,
Regression) and analysis of variance [3, 4]. New York.
In addition to the GaussLaplace synthesis, [2] Gauss, C.F. (1809/1857/2004). Theoria Motus Corporum
Coelestium in Sectionibus Conicis Solum Ambientium,
Gausss more general contributions include the
[Theory of the motion of the heavenly bodies moving
fundamental theorems of arithmetic and algebra about the sun in conic sections] Translated by, C.H. Davis,
and development of the algebra of congruence. ed., Dover Publications, Mineola, (Original work pub-
He published important work on actuarial science, lished 1809; original translation published 1857).
celestial mechanics, differential geometry, geodesy, [3] Stigler, S.M. (1986). The Measurement of Uncertainty
magnetism, number theory, and optics. He invented a Before 1900, Harvard University Press, Cambridge.
heliotrope, magnetometer, photometer, and telegraph. [4] Stigler, S.M. (1999). Statistics on the Table: The History
of Statistical Concepts and Methods, Harvard University
Sub rosa, he was among the first to investigate Press, Cambridge.
non-Euclidean geometry and, in 1851, approved [5] Youden, W.J. (1994). Experimentation and Measurement,
Riemanns doctoral thesis. Indeed a titan of U.S. Department of Commerce, Washington.
science [1], Gauss was extraordinarily productive
throughout his life, although his personal life was RANDALL D. WIGHT AND PHILIP A. GABLE
not without turmoil. After developing heart disease,
Gauss died in his sleep in late February, 1855.
Gene-Environment Correlation
DANIELLE M. DICK
Volume 2, pp. 696698
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
individuals genotype influenced the group of indi- [2] Kendler, K.S. (1995). Adversity, stress and psycho-
viduals they selected as peers. Peer environments are pathology: a psychiatric genetic perspective, Interna-
known to then play a significant role in adolescent tional Journal of Methods in Psychiatric Research 5,
163170.
outcomes [1]. [3] Kendler, K.S., Neale, M., Kessler, R., Heath, A. &
A number of other findings exist supporting Eaves, L. (1993). A twin study of recent life events and
genetic influence on environmental measures. Sub- difficulties, Archives of General Psychiatry 50, 789796.
stantial genetic influence has been reported for ado- [4] Lyons, M.J., Goldberg, J., Eisen, S.A., True, W.,
lescents reports of family warmth [1012]. Genes Tsuang, M.T., Meyer, J.M., Hendersen, W.G. (1993).
have been found to influence the degree of anger and Do genes influence exposure to trauma? A twin study
of combat, American Journal of Medical Genetics 48,
hostility that children receive from their parents [11].
2227.
They influence the experience of life events [2, 3], [5] Manke, B., McGuire, S., Reiss, D., Hetherington, E.M.
and exposure to trauma [4]. In fact, genetic influ- & Plomin, R. (1995). Genetic contributions to adoles-
ence has been found for nearly all of the most widely cents extrafamilial social interactions: teachers, best
used measures of the environment [8]. Perhaps, even friends, and peers, Social Development 4, 238256.
more convincing is that these environmental mea- [6] McGue, M., Sharma, A. & Benson, P. (1996). The effect
sures include not only reports by children, parents, of common rearing on adolescent adjustment: evidence
from a U.S. adoption cohort, Developmental Psychology
and teachers, but also observations by independent
32(6), 604613.
observers [11]. [7] OConnor, T.G., Deater-Deckard, K., Fulker, D., Rut-
Thus, many sources of behavioral influence that ter, M. & Plomin, R. (1998). Genotype-environment
we might consider environmental are actually under correlations in late childhood and early adolescence:
a degree of genetic influence. An individuals family antisocial behavioral problems and coercive parenting,
environment is correlated with their genotype when Developmental Psychology 34, 970981.
they are reared among biological relatives. Further- [8] Plomin, R. & Bergeman, C.S. (1991). The nature of
nurture: genetic influence on environmental measures,
more, genes influence an individuals temperament Behavioral and Brain Sciences 14, 373427.
and personality, which impacts both the way that [9] Plomin, R., DeFries, J.C. & Loehlin, J.C. (1977).
other people react to the individual and the environ- Genotype-environment interaction and correlation in the
ments that person seeks out and experiences. Thus, analysis of human behavior, Psychological Bulletin 84,
environmental experiences are not always random, 309322.
but can be influenced by a persons genetic predispo- [10] Plomin, R., McClearn, G.E., Pedersen, N.L., Nessel-
roade, J.R. & Bergeman, C.S. (1989). Genetic influences
sitions. It is important to note that in standard twin
on adults ratings of their current family environment,
designs, the effects of gene-environment correlation Journal of Marriage and the Family 51, 791803.
are included in the genetic component. For example, [11] Reiss, D. (1995). Genetic influence on family systems:
if genetic influences enhance the likelihood that delin- implications for development, Journal of Marriage and
quent youth seek out other delinquents for their peers, the Family 57, 543560.
and socialization with these peers further contributes [12] Rowe, D. (1981). Environmental and genetic influences
to the development of externalizing behavior, that on dimensions of perceived parenting: a twin study,
Developmental Psychology 17, 209214.
effect could be subsumed in the genetic component
[13] Rutter, M.L. (1997). Nature-nurture integration: the
of the model, because genetic effects led to the risky example of antisocial behavior, American Psychologist
environment, which then influenced behavioral devel- 52, 390398.
opment. Thus, genetic estimates may represent upper [14] Scarr, S. & McCartney, K. (1983). How people make
bound estimates of direct genetic effects on the dis- their own environments: a theory of genotype-environ-
orders because they also include gene-environment ment effects, Child Development 54, 424435.
correlation effects.
DANIELLE M. DICK
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
antisocial behaviors
Average number of
Gene x environment interaction, or GxE, as it is 3
commonly called in behavioral genetic literature, is
the acknowledgement that genetic and environmen-
tal influences on behaviors and traits do not act 2
additively and independently, but may instead be
dependent upon one another. One of the working
hypotheses among researchers is the diathesis-stress
1
model [7], whereby genetic variants confer an under-
lying diathesis, or vulnerability to a certain behavior
or trait. These genetic vulnerabilities may only impact
upon development when accompanied by a specific Genetic Absent Absent Present Present
environmental stressor. Other hypotheses for GxE factor
interactions include the protective effect of genetic Environmental Absent Present Absent Present
factor
variants on environmental risk, and genetic sensitiv-
ity to environmental exposure. There are three main Figure 1 Least-squares means (SE) for simple genetic,
strategies for assessing GxE interactions in behavioral environmental, and interaction effects (Iowa 1980 data;
genetic research, primarily through the use of adop- N = 367). (Figure reproduced from Kluwer Academic Pub-
tion studies, twin studies, and studies of genotyped lishers Behavior Genetics, 13, 1983, p. 308, Evidence for
individuals. Each method has its relative strengths Gene-environment Interaction in the Development of Ado-
lescent Antisocial Behavior, R.J. Cadoret, C.A. Cain, and
and weaknesses. R.R. Crowe, Figure 1, copyright 1983, Plenum Publishing
Corporation, with kind permission of Springer Science and
Business Media)
The Adoption Study Method
Some of the earliest examples of gene x environ-
ment interaction appear in the adoption literature well as the interaction term between the two vari-
in the early 1980s. Theoretically, adoption studies ables. Figure 1 illustrates that neither the presence
are ideal methods for assessing GxE, as they allow of genetic risk nor the presence of environmental
for a clean separation of genetic influence (via the risk was sufficient, in and of itself, to cause an
biological parent characteristics) from salient envi- increase in the average number of antisocial behav-
ronmental characteristics (defined by adoptive par- iors in adolescent adoptees, compared with adoptees
ent characteristics). Figure 1 shows an example of with neither genetic nor environmental risk. In con-
gene x environment interaction using the adoption trast, the presence of both genetic and environmental
design for the development of adolescent antisocial risk factors was related to a higher mean num-
behavior [1]. In this study, genetic risk was defined ber of antisocial behaviors, compared to all other
as the presence of alcoholism or antisocial behav- groups.
ior in the biological parent, and environmental risk Adoption studies have the advantage over other
was defined as being raised in an adoptive family methods that use samples of related individuals of
with significant psychopathology in adoptive siblings being able to more cleanly separate genetic risk from
or parents, and/or the presence of adoptive parent environmental risk, as adoptees typically have limited
marital separation or divorce. Standard multivari- or no contact with their biological parents. Thus, in
ate regression analyses (see Multivariate Multiple theory, genetic risk in the biological parent is unlikely
Regression) were performed assessing the indepen- to be correlated with environmental risk in the adop-
dent effects of genetic and environmental risk, as tive home environment via passive gene-environment
2 Gene-Environment Interaction
correlation. However, as shown in more recent stud- Conversely, the absolute magnitude of genetic vari-
ies, results from adoption studies can still potentially ation was greater among adolescents from higher
be confounded by evocative gene-environment cor- socioeconomic status families. Both of these factors
relation. For example, Ge et al. [4] reported that hos- contributed to the finding that the heritability of cog-
tile and aggressive parenting from the adoptive parent nitive ability (which is defined as the proportion of
was, in fact, correlated with psychopathology in the phenotypic variance attributed to genetic factors) was
biological parent. This relationship was largely medi- higher among adolescents in more educated homes.
ated through the adoptees own hostile and aggressive Advantages to the twin method include the fact
behavior, demonstrating that gene-environment cor- that these studies call into question the assumption
relation can occur when adoptive parents respond to that heritability is a constant, and can identify salient
the behavior of their adopted child (which is, in turn, aspects of the environment that may either promote
partly influenced by genetic factors). Thus, genetic- or inhibit genetic effects. In addition, there are
influenced behaviors of the adoptee can evoke a many large twin studies in existence, which make
gene-environment correlation. Other limitations to replication of potential GxE interactions possible. The
the adoption study method have typically included: primary disadvantage to this method is that genetic
(1) the fact that adoptive parents are screened prior factors are defined as latent variables. Thus, these
to placement, indicating that the range of environ- studies cannot identify the specific genetic variants
mental factors within adoptive samples is likely to be that may confer greater or lesser risk at different
truncated, and that severe environmental deprivation levels of the environment.
therefore is unlikely; and (2) the limited generaliz-
ability of results from adoptive samples to the general
population (see Adoption Studies). Studies of Genotyped Individuals
Arguably, perhaps the gold standard for assessing
The Twin Study Method GxE interaction are studies that investigate whether
a specific genetic variant interacts with a measured
Twin studies typically estimate the proportion of vari- environmental characteristic. One of the first exam-
ation in a given behavior or trait that is due to ples of these studies is the finding that a polymor-
latent genetic and environmental factors (see ACE phism in the monoamine oxidase A (MAOA) gene
Model). In GxE studies using twin samples, the interacts with child maltreatment to influence mean
central question is generally whether genetic varia- levels of antisocial behavior [2]. Figure 2 shows the
tion on behavior or traits changes across some level relevant results from this study for four measures of
of a measured environmental variable. Methods to antisocial behavior. As can be seen in this figure,
assess GxE interaction in twin studies include exten- maltreated children with the genetic variant of the
sions of the DeFriesFulker regression model (see MAOA gene that confers high levels of MAOA activ-
DeFriesFulker Analysis), testing whether heritabil- ity showed mean levels of antisocial behavior that
ities (see Heritability) are the same or different were not significantly different from mean levels of
among individuals in two different groups (e.g., the antisocial behavior among non-maltreated children,
finding that the heritability of alcohol use is higher indicating that this genetic variant had a protective
among unmarried vs. married women [5]), or through influence against the effects of child maltreatment.
the inclusion of a measured environmental variable as Interestingly, although there was a main effect of
a continuous moderator of genetic and environmental child maltreatment in these analyses, there was no
influences in the standard ACE model. Examples of main effect of the MAOA genotype, indicating that
replicated GxE interactions using twin data include genetic variants that confer low levels of MAOA
the finding that the heritability of cognitive ability activity are not, in and of themselves, a risk factor
is greater among adolescents from more advantaged for antisocial behavior.
socioeconomic backgrounds [9, 11]. In both of these Advantages to this method are many. Analy-
studies, the absolute magnitude of shared environ- sis is relatively straightforward, requiring simply
mental influences on variation was stronger among the use of multivariate regression techniques. Stud-
adolescents from poorer and/or less educated homes. ies can be done using any sample of genotyped
Gene-Environment Interaction 3
100 60
80
40
60
30
40
20
20 10
0 0
n = 108 42 13 180 79 20 n = 108 42 13 180 79 20
(a) Low MAOA High MAOA (b) Low MAOA High MAOA
activity activity activity activity
0.9 0.9
Disposition toward
0.6 0.6
0.3 0.3
0 0
0.3 0.3
n = 108 42 13 180 79 20 n = 108 42 13 180 79 20
Low MAOA High MAOA Low MAOA High MAOA
activity activity activity activity
(c) No maltreatment Probable (d) Severe maltreatment
maltreatment
Figure 2 The association between childhood maltreatment and subsequent antisocial behavior as a function of MAOA
activity. (a) Percentage of males (and standard errors) meeting diagnostic criteria for Conduct Disorder between ages
10 and 18. In a hierarchical logistic regression model, the interaction between maltreatment and MAOA activity was in
the predicted direction, b = 0.63, SE = 0.33, z = 1.87, P = 0.06. Probing the interaction within each genotype group
showed that the effect of maltreatment was highly significant in the lowMAOA activity group (b = 0.96, SE = 0.27,
z = 3.55, P < 0.001), and marginally significant in the high-MAOA group (b = 0.34, SE = 0.20, z = 1.72, P = 0.09).
(b) Percentage of males convicted of a violent crime by age 26. The G E interaction was in the predicted direction,
b = 0.83, SE = 0.42, z = 1.95, P = 0.05. Probing the interaction, the effect of maltreatment was significant in the
lowMAOA activity group (b = 1.20, SE = 0.33, z = 3.65, P < 0.001), but was not significant in the high-MAOA group
(b = 0.37, SE = 0.27, z = 1.38, P = 0.17). (c) Mean z scores (M = 0, SD = 1) on the Disposition Toward Violence Scale at
age 26. In a hierarchical ordinary least squares (OLS) regression model, the G E interaction was in the predicted direction
(b = 0.24, SE = 0.15, t = 1.62, P = 0.10); the effect of maltreatment was significant in the lowMAOA activity group
(b = 0.35, SE = 0.11, t = 3.09, P = 0.002) but not in the high-MAOA group (b = 0.12, SE = 0.07, t = 1.34, P = 0.17).
(d) Mean z scores (M = 0, SD = 1) on the Antisocial Personality Disorder symptom scale at age 26. The G E interaction
was in the predicted direction (b = 0.31, SE = 0.15, t = 2.02, P = 0.04); the effect of maltreatment was significant in
the lowMAOA activity group (b = 0.45, SE = 0.12, t = 3.83, P 0.001) but not in the high-MAOA group (b = 0.14,
SE = 0.09, t = 1.57, P = 0.12). (Reprinted with permission from Caspi et al. (2002). Role of Genotype in the Cycle of
Violence in Maltreated Children. Science, 297, 851854. Copyright 2002 AAAS)
individuals there is no special adoptive or fam- underlying biological mechanism that confer risk
ily samples required. Because these studies rely or protective effects across different environments.
on measured genotypes, they can pinpoint more On the other hand, the effects of any one indi-
precisely the genetic variants and the potential vidual gene (both additively and/or interactively)
4 Gene-Environment Interaction
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
a representative sample, but if presented with a better term), which a quarter of the male smokers
sample, one could check the extent to which this over 50 happens to have? Presumably, there is no
sample represents the target population (assuming recognizable subset of this population, male smokers
that characteristics of this target population are also over 50, which would allow for greater separation (as
known). Having gone through the step of assessing in those who exercise a certain amount have 15% risk
the extent to which the sample is representative of while those who do not have 35% risk).
the target population, would one then care if this Suppose, further, that one study finds a 20% risk in
sample were obtained at random? Certainly, other males and a 35% risk in smokers, but that no study
sampling schemes, such as convenience sampling, had been done that cross-classified by both gender
may create a sample that appears to represent the and smoking status. In such a case, what would the
target population well, at least with respect to the risk be for a male smoker? The most relevant study
dimensions of the sample that can be checked. For for any given individual is a study performed in that
example, it may be feasible to examine the gender, individual, but the resources are not generally spent
age, and annual salary of the study subject for towards such studies of size one. Even if they were,
representativeness, but possibly not their political the sample size would still be far too small to study
belief. It is conceivable that unmeasured factors all variables that would be of interest, and so there
contribute to results of survey questions, and ignoring is a trade-off between the specificity of a study (for
them may lead to unexpected errors. From this a given individual or segment of the population) and
perspective, then, randomization does confer added the information content of a study.
benefits beyond those readily checked and classified In contrast to surveys, association studies require
under the general heading of representativeness. not only that the sample be representative of the
Of course, one issue that remains, even with study population but also that it be homogeneous.
a sample that has been obtained randomly, is a As mentioned previously, the use of run-ins, prior
variant of the Heisenberg uncertainty principle [1]. to randomization, to filter out poor responders to an
Specifically, being in the study may alter the subjects active treatment creates a distortion that may result
in ways that cannot be measured, and the sample in a spurious association [2]. That is, there may well
differs from the population at large with respect to be an association, among this highly select group
a variable that may assume some importance. That of randomized subjects who are already known to
is, if X is a variable that takes the value 1 for respond well to the active treatment, between treat-
subjects in the sample, and the value 0 for subjects ment received and outcome, but this association may
not in the sample, then the sample differs from the not reflect the reality of the situation in the population
target population maximally with respect to X. Of at large. But even if the sample is representative of
course, prior to taking the sample, each subject in the the target population, there is still a risk of spurious
target population had a value of X = 0, but for those association that arises from pooling heterogeneous
subjects in the sample, the value of X was changed segments of the population together. Suppose, for
to 1, in time for X to exert whatever influence it may example, that one group tends to be older and to
have on the primary outcomes of the study. This fact smoke more than another group, but that within either
has implications for anyone not included in a survey. group there is no association between smoking sta-
If, for example, a given population (say male tus and age. Ignoring the group, and studying only
smokers over the age of 50 years) is said to have age and smoking status, would lead to the mistaken
a certain level of risk regarding a given disease (say impression that these two variables are positively
lung cancer), then what does this mean to a male associated. This is the ecological fallacy [5].
smoker who was not included in the studies on which When trying to generalize associations in behav-
this finding was based? Hypothetically, suppose that ioral sciences, one needs to consider different char-
this risk is 25%. Does this then mean that each male acteristics of exposure, effect modifiers, confounders,
smoker over 50, whether in the sample or not, has and outcome. Duration, dose, route, and age at expo-
a one in four chance of contracting lung cancer? Or sure may all be important. In general, extrapolating
does it mean that there is some unmeasured variable, the results obtained from a certain range of expo-
possibly a genetic mutation, which we will call a sure to values outside that range may be very mis-
predisposition towards lung cancer (for lack of a leading. While short-term low-dose stress may be
Generalizability 3
stimulating, very high levels of stress may inhibit in one population, but only 10 mmHg more effective
productivity. Effect modifiers may vary among differ- on average in another population. Although treatment
ent populations. Single parenthood may be a stigma A is better than treatment B in both populations,
in some societies, and, therefore, may lead to behav- the magnitude of the blood pressure reduction is dif-
ioral abnormalities in the children. However, societies ferent. Are the results obtained from one population
that show high support for single parent families may generalizable to the other? There is no clear answer.
modify and lower such detrimental effects. Differ- Despite centuries of thinking and examination, the
ences in the distribution of confounding factors result process of synthesis of knowledge from individual
in failure of replication. observations is not well understood [3, 4, 6]. This
For example, higher education may be associated process is neither mechanical nor statistical; that is,
with higher income in some societies, but not in the process requires abstraction [7].
others. A clear definition of exposure and outcome is
necessary, and these definitions should be maintained References
when generalizing the results. Sufficient variability in
both exposure and outcome is also important. Family
[1] Berger, V.W. (2000). Pros and cons of permutation tests
income may not be a predictor of future educational in clinical trials, Statistics in Medicine 19, 13191328.
success when studied in a select group of families that [2] Berger, V.W., Rezvani, A. & Makarewicz, V.A. (2003).
all have an annual income between $80 000100 000, Direct effect on validity of response run-in selection in
but it may be a strong predictor in a wider range clinical trials, Controlled Clinical Trials 24, 156166.
of families. [3] Chalmers, A.F. (1999). What is this Thing Called Science?
Despite the fact that the term generalizability is 3rd Edition, Hackett publishing company, Indianapolis.
[4] Feyerabend, P. (1993). Against Method, 3rd Edition,
frequently used, and the rules mentioned above are Verso Books, New York.
commonly taught in methodology classes, the mean- [5] Piantidosi, S., Byar, D. & Green, S. (1988). The ecolog-
ing of generalizability is often not clear, and these ical fallacy, American Journal of Epidemiology 127(5),
rules give us only a vague idea about how and when 893904.
we are allowed to generalize information. One reason [6] Rothman, K.J. & Greenland, S. (1998). Modern Epidemi-
for such vagueness is that generalizability is a contin- ology, 2nd Edition, LippincottRaven, Philadelphia.
[7] Szklo, M. (1998). Population-based cohort studies, Epi-
uum rather than a dichotomous phenomenon, and the
demiologic Reviews 20, 8190.
degree of acceptable similarity is not well defined.
For example, suppose that in comparing treatments VANCE W. BERGER
A and B for controlling high blood pressure, treat-
ment A is more effective by 20 mmHg on average
Generalizability Theory: Basics
JOHN R. BOULET
Volume 2, pp. 704711
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
tasks (nt ). This is a two-facet design and is denoted the opportunity to further partition error variance.
p t r (person by task by rater). Where each level More important, since some error sources are only
of one facet (rater) is observed in combination with critical with respect to relative decisions (e.g., rank
each level of the other (task), the result is a crossed ordering people based on scores), and others can
design. If levels of one facet are observed in combi- influence absolute decisions (e.g., determining mas-
nation with specific level(s) of another, the design is tery based on defined standards or cutoffs), it is
said to be nested. For example, variance components essential that they can be identified and disentan-
can be estimated for people, tasks, raters, and the gled. Once this is accomplished, both error main
associated interactions. These components are sim- effects and error interaction effects can be stud-
ply estimates of differences in scores attributable to ied. For example, in figure-skating, multiple raters
a given facet or interaction of sources. are typically used to judge the performance of skaters
The purpose of a G study is to obtain estimates across multiple programs (short, long). Measurement
of variance components associated with the universe error can be introduced as a function of the choice of
of admissible observations. These estimates can be rater, the type of program (task), and, most important,
used in D (decision) studies to design efficient mea- the interaction between the two. For this situation, if
surement procedures. For D studies, the researcher we accept that any person in the population can par-
must specify a universe of generalization. This could ticipate in any program in the universe and can be
contain all facets in the universe of admissible obser- evaluated by any rater in the universe, the observable
vations (e.g., p T R; for D study designs, facets score for a single program evaluated by a single rater
are typically denoted by capital letters) or be oth- can be represented:
erwise restricted. For the scenario above, one may
want to generalize persons scores based on the spe- Xptr = + p + t + r + pt + pr + tr + ptr . (1)
cific tasks and raters used in the G study to persons
scores for a universe of generalization that involves For this design, is the grand mean in the
many other tasks and raters. The sample sizes in the D population and universe and specifies any one of the
study (nt , nr ) need not be the same as the sample sizes seven components. From this, the total observed score
in the G study (nt , nr ). Also, the focus of the D study variance can be decomposed into seven independent
is on mean scores for persons rather than single per- variance components:
son by task by rater observations. If a persons score
2 (Xptr ) = (p)
2
+ (t)
2
+ (r)
2
+ (pt)
2
is based on his or her mean score over nt nr obser-
vations, the researcher can explore, through various + (pr)
2
+ (tr)
2
+ (ptr)
2
(2)
D studies, the specific conditions that can make the
measurement process more efficient. The variance components depicted above are for
It is conceivable to obtain a persons mean score single person by programs (tasks) by rater combina-
for every instance of the measurement procedure tions. From a CTT perspective, one could collapse
(e.g., tasks, raters) in the universe of generalization. scores over the raters and estimate the consistency of
The expected value of these mean scores in the stated person scores between the long and short programs.
universe is the persons universe score. The variance Likewise, one could collapse scores over the two pro-
of universe scores over all persons in the population grams and estimate error attributable to the raters.
is called the universe score variance. More simply, it While these analyses could prove useful, only gener-
is the sum of all variance components that contribute alizability theory evaluates the interaction effects that
to differences in observed scores. Universe score introduce additional sources of measurement error.
variance is conceptually similar to true score variance In generalizability theory, reliability-like coeffi-
(T) in classical test theory. cients can be computed both for situations where
As mentioned previously, once the G study vari- scores are to be used for relative decisions and for
ance components are estimated, various D studies conditions where absolute decisions are warranted.
can be completed to determine the optimal condi- For both cases, relevant measurement error variances
tions for measurement. Unlike CTT, where observed (determined by the type of decision, relative or abso-
score variance can only be partitioned into two parts lute) are pooled. The systematic variance (universe
( 2 true and 2 observed ), generalizability theory affords score variance) is then divided by the sum of the
Generalizability Theory: Basics 3
systematic and the measurement error variance to care skills of physicians training in anesthesiol-
estimate reliability. When relative decisions are being ogy. The assessment utilized a sensorized, life-size
considered, only measurement error variances that electromechanical patient mannequin that featured,
could affect the rank orderings of the scores are amongst other things, breath sounds, heart sounds,
important. For this use of scores, the ratio of sys- and pulses. Computer-driven physiologic and phar-
tematic variance to the total variance, known as a macologic models determine cardiac and respiratory
generalizability coefficient (E 2 ), is the reliability responses, and are used to simulate acute medical
estimate. This is simply a quantification of how well conditions. The simulator offers simple as well as
persons observed scores correspond to the universe advanced programming actions to create and then
scores. When absolute decisions are considered (i.e., save a unique scenario for repeated evaluation of
where scores are interpreted in relation to a standard performances. A variety of additional features (e.g.,
or cutoff), all the measurement error variances can heart rate, lung compliance, vascular resistance) can
impact the reliability of the scores. Here, the relia- be manipulated independently to create a unique, but
bility coefficient, Phi (), is also calculated as the reproducible event that effectively tests the skill level
ratio of systematic variance to total variance. If all of the medical provider. Six acute care scenarios
the measurement error variances that are uniquely (cases) were developed. Each simulated scenario was
associated with absolute decisions are zero, then the constructed to model a medical care situation that
generalizability and Phi coefficients will be equal. required a rapid diagnosis and acute intervention in
The defining treatment of generalizability theory a brief period of time.
is provided by Cronbach et al. [10]. Brennan pro- Twenty-eight trainees were recruited and evalu-
vides a history of the theory [2]. For the interested ated. Each of the 28 participants was assessed in each
reader, there are numerous books and articles, both of the six simulated scenarios. Each trainees per-
technical and nontechnical, covering all aspects of the formance was videotaped and recorded. Three raters
theory [17]. independently observed and scored each of the per-
formances from the videotaped recordings. A global
score, based on the time to diagnosis and treatment
Purpose as well as potentially egregious or unnecessary diag-
nostic or therapeutic actions, was obtained. The raters
The purpose of this entry is to familiarize the reader were instructed to make a mark on a 10-cm horizontal
with the basic concepts of generalizability theory. For line based on their assessment of the trainees perfor-
the most part, the treatment is nontechnical and con- mance. The global rating system was anchored by the
centrates on the utility of the theory and associated lowest value 0 (unsatisfactory) and the highest value
methodology for handling an assortment of measure- 10 (outstanding).
ment problems. In addition, only univariate models
are considered. For more information on specific esti- Analysis. From a generalizability standpoint, the G
mation procedures, multivariate specifications, confi- study described above was fully crossed (p t r).
dence intervals for estimated variance components, All of the raters (nr = 3) provided a score for each of
and so on, the reader should consult Brennan [3]. For the six (nt = 6) scenarios (referred to as tasks) across
this entry, the basic concepts of generalizability the- all 28 trainees (objects of measurement). The person
ory are illustrated through the analysis of assessment by rater by task design can be used to investigate
data taken from an examination developed to evaluate the sources of measurement error in the simulation
the critical-care skills of physicians [1]. scores. Here, it was expected that the principle
source of variance in scores would be associated
with differences in individual residents abilities, not
Measurement Example choice of task or choice rater.
Table 1 Estimated variance components, standard errors of measurements, generalizability, and dependability coefficients
for simulation scores (G and D studies)
G Study D studies mean variance component
Component Estimatea Component t = 6, r = 3 t = 6, r = 2 t = 8, r = 2
Person ( p2 ) 1.28 Person ( p2 ) 1.28 1.28 1.28
Task ( t2 ) 0.51 Task ( T2 ) 0.09 0.09 0.06
Rater ( r2 ) 0.24 Rater ( R2 ) 0.08 0.12 0.12
pt2 2.09 pT
2
0.35 0.35 0.26
pr2 0.30 pR
2
0.10 0.15 0.15
tr2 0.11 TR
2
0.01 0.01 0.01
ptr
2
1.07 pTR
2
0.06 0.09 0.07
2 () 0.69 0.81 0.67
() 0.83 0.90 0.82
2 () 0.51 0.59 0.48
() 0.71 0.77 0.69
0.65 0.61 0.66
E 2 0.72 0.68 0.73
a
estimate for single person by task by rater combinations.
make this task much less cumbersome [4, 5, 9]. In in average stringency than simulation scenarios differ
addition, commonly used statistical programs typi- in average difficulty.
cally have routines for estimating variance compo- The largest interaction variance component was
nents for a multitude of G study designs. For this person by task ( 2 (pt)). The magnitude of this com-
example, the SAS PROC VARCOMP routine was ponent suggests that there are considerably different
used [16]. rank orderings of examinee mean scores for each of
The estimated variance components for the G the various simulation scenarios. The relatively small
study are presented in Table 1. The person (trainee) person by rater component suggests that the vari-
variance component ( 2 (p)) is an estimate of the ous raters rank order persons similarly. Likewise, the
variance across trainees of trainee-level mean scores. small rater by task component indicates that the raters
If one could obtain the persons expected score over rank order the difficulty of the simulation scenarios
all tasks and raters in the universe of admissible similarly. The final variance component is the resid-
observations, the variance of these scores would be ual variance that includes the triple-order interactions
2 (p). Ideally, most of the variance should be here, and all other unexplained sources of variation.
indicating that individual abilities account for differ-
ences in observed scores. The other main effect Decision (D) Studies. The G study noted above
variance components include task ( 2 (t)) and rater was used to derive estimates of variance components
( 2 (r)). The task component is the estimated variance associated with a universe of admissible observa-
of scenario mean scores. Since the estimate is greater tions. Decision (D) studies can use these estimates to
than zero, we know that the six tasks vary somewhat design efficient measurement procedures for future
in average difficulty. Not surprisingly, mean perfor- operations. To do this, one must specify universe
mance, by simulation scenario, ranged from a low of generalization. For the simulation assessment, we
of 5.7 to a high of 8.2. The rater component is the may want to generalize trainees scores based on the
variance of the rater mean scores. The nonzero value six tasks and three raters to trainees scores for a uni-
indicates that raters vary somewhat in terms of aver- verse of generalization that includes many other tasks
age stringency. The mean rater scores, on a scale and many other raters. In this instance, the universe of
from 0 to 10, were 7.4, 6.3, and 7.0, respectively. generalization is infinite in that we wish to general-
Interestingly, the task variance component is approx- ize to any set of raters and any set of tasks. Here, con-
imately twice as large as the rater component. We sistent with ANOVA terminology, both the rater and
can, therefore, conclude that raters differ much less task facets are said to be random as opposed to fixed.
Generalizability Theory: Basics 5
For the simulation assessment, we may decide that considered. The square root of the relative error vari-
we want each trainee to be assessed in each of the ance ( () = 0.71) is interpretable as an estimate of
six encounters (tasks; nt = 6) and each of the tasks be the relative SEM. As would be expected, and borne
rated by three independent raters (nr = 3). Although out by the data, absolute interpretations of a trainees
the sample sizes for the D study are the same as those score are more error-prone than relative ones.
for the G study, this need not be the case. Unlike In addition to calculating error variances, two
the G study, which focused on single trainee by task types of reliability-like coefficients are widely used
by rater observations, the D study focuses on mean in generalizability theory. The generalizability coef-
scores for persons. ficient (E 2 ), analogous to a reliability coefficient in
The D study variance components can be easily CTT, is the ratio of universe score variance to itself
estimated using the G study variance components in plus error variance:
Table 1 (see Table 1). The estimated random effects
variance components are for person mean scores over 2 (p)
E 2 = . (3)
nt = 6 tasks and nr = 3 raters. The calculation of 2 (p) + 2 ()
the variance components for this fully crossed design
is relatively straightforward. The estimated universe For nt = 6 and nr = 3, E 2 = 0.72. An index of
score variance ( p2 ) stays the same. To obtain means, dependability () can also be calculated:
variance components that contain t but not r are 2 (p)
divided by nt . Components that contain r but not t = . (4)
2 (p) + 2 ()
are divided by nr . And components that contain both
t and r are divided by nt nr . This is the ratio of universe score variance to
Since an infinite universe of generalization has itself plus absolute score variance. For nt = 6 and
been defined, all variance components other than nr = 3, = 0.65. The dependability coefficient is
2 (p) contribute to one or more types of error apropos when absolute decisions about scores are
variance. If the trainee scores are going to be used for being made. For example, if the simulation scores,
mastery decisions (e.g., pass/fail), then all sources of in conjunction with a defined standard, were going to
error are important. Here, both simulation scenario be used for licensure or certification decisions, then
difficulty and rater stringency are potential sources all potential error sources are important, including
of error in estimating true ability. Absolute error those associated with variability in task difficulty and
is simply the difference between a trainees observed rater stringency.
and universe score. Variance of the absolute errors The p T R design with two random facets
2 () is the sum of all variance components except (tasks, nt = 6; raters, nr = 3) was used for illustrative
2 (p) (see Table 1). The square root of this value purposes. However, based on the relative magni-
() is interpretable as the absolute standard error tudes of the G study variance components, it is clear
of measurement (SEM). On the basis of the D study that the reliability of the simulation scores is gen-
described above, () = 0.83. As a result, XpTR erally more dependent on the number of simulation
1.62 constitutes an approximate 95% confidence scenarios as opposed to the number of raters. One
interval for trainees universe scores. could easily model a different D study design and
If the purpose of the simulation assessment is calculate mean variance components for nt = 6 and
simply to rank order the trainees, then some compo- nr = 2 (see Table 1). By keeping the same number
nent variances will not contribute to error. For these of simulated encounters, and decreasing the number
measurement situations, relative error variance 2 () of raters per case (nr = 2), the overall generalizabil-
similar to error variance in CTT is central. For the ity and dependability coefficients are only slightly
p T R D study with nt = 6 and nr = 3 relative lower. Increasing the number of tasks (nt = 8) while
error variance is the sum of all variance components, decreasing the number of raters per task (nr = 2) has
excluding 2 (p), that contain p (i.e., pT
2
, pR
2
, pTR
2
). the effect of lowering both absolute and relative error
These are the only sources of variance that can variance. Ignoring the specific costs associated with
impact the relative ordering of trainee scores. The developing simulation exercises, testing trainees, and
calculated value ( 2 () = 0.51) is necessarily lower rating performances, increasing the number of tasks,
than 2 (), in that fewer variance components are as opposed to raters per given task, would appear to
6 Generalizability Theory: Basics
with a computer-automated scoring system, Journal of [13] Hurtz, N.R. & Hurtz, G.M. (1999). How many raters
Educational Measurement 37(3), 245262. should be used for establishing cutoff scores with the
[9] Crick, J.E. & Brennan, R.L. (1983). Manual for GEN- Angoff method: a generalizability theory study, Educa-
OVA: A Generalized Analysis of Variance System, Amer- tional and Psychological Measurement 59(6), 885897.
ican College Testing, Iowa City. [14] Kane, M. (2002). Inferences about variance components
[10] Cronbach, L.J., Gleser, G.C., Nanda, H. & Rajarat- and reliability-generalizability coefficients in the absence
nam, N. (1972). The Dependability of Behavioral Mea- of random sampling, Journal of Educational Measure-
surements: Theory of Generalizability for Scores and ment 39(2), 165181.
Profiles, John Wiley, New York. [15] Lord, F.M. & Novick, M.R. (1968). Statistical Theories
[11] Fitzpatrick, A.R. & Lee, G. (2003). The effects of of Mental Test Scores, Addison-Wesley, Reading.
a student sampling plan on estimates of the standard [16] SAS Institute Inc. (1989). SAS/STAT Users Guide,
errors for student passing rates, Journal of Educational Version 6, Vol. 2, 4th Edition, SAS Institute Inc., Cary.
Measurement 40(1), 1728. [17] Shavelson, R.J. & Webb, N.M. (1991). Generalizability
[12] Hartman, B.W., Fuqua, D.R. & Jenkins, S.J. (1988). Theory: a Primer, Sage, Newbury Park.
Multivariate generalizability analysis of three measures
of career indecision, Educational and Psychological JOHN R. BOULET
Measurement 48, 6168.
Generalizability Theory: Estimation
PIET F. SANDERS
Volume 2, pp. 711717
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
many raters should evaluate the responses. A G study of squares of persons, items, and interactions.
of a one-facet crossed random effects design is pre- The total sums of squares, p
2
i (Xpi X) , is
sented below. A D study of this design is discussed equal to the sum of the three other sums of
in the next section. squares: p
2
i (Xp X) + p
2
i (Xi X) +
p
2
i (Xpi Xp Xi + X) . The former is also
Generalizability Study One-facet Design written as SS tot = SS p + SS i + SS pi,e . The mean
squares can be computed from the sums of squares.
The Analysis of Variance table of a crossed one- Solving the equations of the expected mean squares
facet random effects design, a design where a random for the variance components and replacing the
sample of np persons from a population of persons observed mean squares by their expected values
responds to a random sample of ni items from a results in the following estimators for the vari-
universe of items, is presented in Table 1. ance components: p2 = (MS p MS pi,e )/ni , i2 =
From Table 1, we can see that we first have to (MS i MS pi,e )/np , and MS pi,e = pi,e
2
.
compute the sums of squares in order to estimate The artificial example in Table 2 was used to
the variance components. For that, we substitute the obtain the G study results presented in Table 3.
three parameters , p , and i in (1) with their Table 2 contains the scores (0 or 1) of four persons
observed counterparts, which results in the following on three items, the mean scores of the persons, the
decomposition: mean scores of the items, and the general mean. The
mean scores of the persons vary between a perfect
Xpi = X + (X p X)
+ (X i X)
mean score of 1 and a mean score of 0. The mean
+ (Xpi X p X i + X)
scores of the items range from an easy item of 0.75
to a difficult item of 0.25.
= Xpi X = (X p X)
+ (X i X)
The last column of Table 3 contains the estimated
+ (Xpi X p X i + X)
(2) variance components which are variance components
of scores of single persons on single items. Since
By squaring and summing the observed devi- the size of the components depends on the score
ation scores in (2), four sums of squares are scale of the items, the absolute size of the variance
obtained: the total sums of squares and the sums components does not yield very useful information.
Person 1 2 3 X p
1 1 1 1 1.000
2 1 1 0 0.667
3 1 0 0 0.333
4 0 0 0 0.000
X i 0.75 0.50 0.25 0.500 = X
Generalizability Theory: Estimation 3
It is therefore common practice to report the size of In (3), the symbol I is used to indicate the mean
the component as a percentage of the total variance. score on a number of items. In (3), the universe score
Since the items are scored on a 0 to 1 score scale, is defined as p I XpI , the expected value of XpI
the variance component cannot be larger than 0.25. over random parallel tests. By taking the expectation
The reason for the large universe score variance is over I in (3), the universe score variance p2 does
the large differences between the mean scores of the not change; the two other variance components do
four persons. The estimated variance component for change and are defined as I2 = i2 /ni and pI,e 2
=
the items is relatively small. This can be confirmed pi,e /ni . The total variance, X = (XpI ), is equal
2 2 2
by taking the square root of the variance components, to X2 = p2 + I2 + pI,e
2
.
resulting in a standard deviation of 0.17, which Table 4 contains the variance components from
is approximately one-sixth of the range for items the G study and the D study with three items.
scored on a dichotomous score scale. This value is The results in Table 4 show how the variance
what we might expect under a normal distribution of
component of the items and the variance compo-
the scores.
nent of the interaction or error component change
if we increase the number of items. To gauge
the effect of using three more items from the
Decision Study One-facet Design universe of admissible observations, we have to
divide the appropriate G-study variance components
The model in (1) and its associated variance com- by 6.
ponents relate to scores of single persons on single The purpose of many tests is to determine the
items from the universe of admissible observations. position of a person in relation to other persons.
However, the evaluation of a persons performance In generalizability theory, the relative position of
is never based on the score obtained on a sin- a person is called the relative universe score and
gle item, but on a test with a number of items. defined as p . The relative universe score is
What the effect is of increasing the number of items estimated by XpI XP I , the difference between the
on the variance components was investigated in a mean test score of a person and the mean test score
D study. of the sample of persons. The deviation between
The linear model for the decomposition of the XpI XP I and p is called relative error and
mean score of a person on a test with ni items,
is defined as pI = (XpI XP I ) (p ). The
denoted by XpI , is
estimated relative error variance is equal to 2 =
pI,e
2
. (Note that the prime is used to indicate the
XpI = + (p ) + (I )
sample sizes in a D study.) For the crossed one-
+ (XpI p I + ). (3) facet random effects design, the estimate of the
Table 4 Results of G study and D study for the example from Table 2
Effects Variance components G study Variance components D study
Persons (p) p2 = 0.139 p2 = 0.139
Items (i) i2 = 0.028 I2 = 0.028/3 = 0.009
Residual (pi , e) pi2 ,e = 0.139 pI2 ,e = 0.139/3 = 0.046
4 Generalizability Theory: Estimation
Table 5 The item scores of six persons on four items and two raters, per rater the mean score per item and per person,
the mean score per rater, the mean score per person, and the general mean
Rater 1 Rater 2
and items is much larger than the interaction compo- The seven variance components associated with
nent between persons and raters. Interaction between this model are p2 , I2 = i2 /ni , R2 = r2 /nr , pI 2
=
persons and items means that persons do not give 2
pi 2
/ni , pR = pr
2
/nr , I2R = ir2 /ni nr , pI
2
R,e =
consistent reactions to different questions with the
2
pir,e /ni nr .
result that depending on the question the relative posi-
The total variance is equal to X2 = p2 + I2 +
tion of the persons differs.
R + pI
2 2
+ pR
2
+ I2R + pI 2
R,e .
The estimate of the generalizability coefficient for
Decision Study Two-facet Design relative decisions for the crossed two-facet random
effects design is defined as
The linear model for the decomposition of the aver-
age score of a person on a test with ni items of p2
2 = .
which the answers were rated by nr raters, denoted pi
2
pr
2 pir,e
2
by XpI R , is p2 + + +
ni nr ni nr
XpI R = + (p ) + (I ) + (R )
The denominator of this coefficient has three vari-
+ (pI p I + ) ance components that relate to interactions with per-
sons. Interaction between persons and items means
+ (pR p R + )
that on certain items a person performs better
+ (I R I R + ) than other persons, while on certain other items
the performance is worse. This inconsistent per-
+ (XpI R pI pR I R
formance by persons on items contributes to error
+p + I + R ). variance. Interaction between persons and raters
6 Generalizability Theory: Estimation
means that a person is awarded different scores by the universe to which we want to generalize. We
different raters. This inconsistent rating by raters can, for example, change the universe by interpret-
contributes to error variance. The residual vari- ing a random facet as a fixed facet. If the items in
ance component is by definition error variance and the example with four items and two raters are to
the interaction component between persons, items, be interpreted as a fixed facet, only these four items
and raters. are admissible. If the facet items is interpreted as
For the example in Table 5, with four items and a fixed facet, generalization is no longer to the uni-
two raters, the generalizability coefficient is equal verse of random parallel tests with four items and
to 2.16/(2.16 + 0.99/4 + 0.18/2 + 1.96/8) = 0.79. two raters, but to the universe of random parallel
This generalizability coefficient can be improved by tests with two raters. Interpreting a random effect
increasing the number of observations, that is, the as a fixed facet means that fewer variance compo-
product of the number of items and the number of nents can be estimated. In a crossed two-facet mixed
raters. Having more items, however, will have a much effects design, the three variance components that
greater effect than more raters because the variance can be estimated, expressed in terms of the vari-
component of the interaction between persons and ance components of the crossed two-facet random
items is much larger than the variance component effects design, are p 2
= p2 + pi
2
/ni , r
2
= r2 +
of the interaction between persons and raters. This ir2 /ni , and pr,e
2
= pr
2
+ pir,e
2
/ni . The estimate
example shows that the SpearmanBrown formula of the generalizability coefficient for relative deci-
from classical theory does not apply to multifacet sions for the crossed two-facet mixed effects design,
designs from generalizability theory. Procedures for originally derived by Maxwell and Pilliner [4], is
selecting the optimal number of conditions in multi- defined as
facet designs have been presented by Sanders, The-
unissen, and Baas [5]. p
2
2 =
The estimate of the generalizability coefficient for p
2 +
pr,e
2 /nr
absolute decisions for the crossed two-facet random
effects design is defined as p2 + pi
2
/ni
=
2 /n + .
p2 + pi
2
/ni + pr r pir,e
2
/ni nr
p2
= .
2 2 pi
2
pr
2
2 pir,e
2 With the facet items fixed, the generalizability
p2 + i + r + + + ir + coefficient for our example is equal to 0.88, com-
ni nr ni nr ni nr ni nr
pared to a generalizability coefficient of 0.79 with
For making absolute decisions, it does mat- the facet items being random. This increase of the
ter whether we administer a test with difficult coefficient is expected since, by restricting the uni-
items or a test with easy items or have the verse, the relative decisions about persons will be
answers rated by lenient or strict raters. There- more accurate.
fore, the variance components of the items and the In G theory, nested designs can also be analyzed.
raters, and the variance component of the inter- Our example with two facets would be a nested
action between items and raters also contribute to design if the first two questions were evaluated by
the error variance. For the example in Table 5, the first rater and the other two questions by the sec-
the generalizability coefficient for absolute decisions ond rater. In a design where raters are nested within
is equal to 2.16/(2.16 + 1.26/4 + 0.0/2 + 0.99/4 + questions, the variance component of raters and the
0.18/2 + 1.55/8 + 1.96/8) = 0.66. variance component of the interaction between per-
sons and raters cannot be estimated. The estimate
of the generalizability coefficient for relative deci-
Other Designs sions for the nested two-facet random effects design
is defined as
In the previous sections, it was shown that modify- p2
ing the number of items and/or raters could affect the 2 = .
generalizability coefficient. However, the generaliz- pi
2
pr,pir,e
2
p2 + +
ability coefficient can also be affected by changing ni ni nr
Generalizability Theory: Estimation 7
The estimates of variance components of the [3] Cronbach, L.J., Gleser, G.C., Nanda, H. & Rajaratnam, N.
crossed two-facet random effects design can be used (1972). The Dependability of Behavioral Measurements:
to estimate the variance components of not only Theory of Generalizability for Scores and Profiles, Wiley,
New York.
a nested two-facet random effects design but also [4] Maxwell, A.E. & Pilliner, A.E.G. (1968). Deriving
those of a nested two-facet mixed effects design. coefficients of reliability and agreement, The British
Because of their versatility, crossed designs should Journal of Mathematical and Statistical Psychology 21,
be given preference. 105116.
G theory is not limited to the analysis of uni- [5] Sanders, P.F., Theunissen, T.J.J.M. & Baas, S.M. (1989).
variate models; multivariate models where persons Minimizing the numbers of observations: a generaliza-
tion of the Spearman-Brown formula, Psychometrika 54,
have more than one universe score can also be ana-
587598.
lyzed. In G theory, persons as well as facets can [6] Shavelson, R.J. & Webb, N.M. (1991). Generalizability
be selected as objects of measurement, making G Theory: A Primer, Sage Publications, Newbury Park.
theory a conceptual and statistical framework for [7] Thorndike, R.L. (1982). Applied Psychometrics, Hough-
a wide range of research problems from different ton Mifflin Company, Boston.
disciplines. [8] Smith, P.L. (1978). Sampling errors of variance compo-
nents in small sample multifacet generalizability studies,
Journal of Educational Statistics 3, 319346.
References
PIET F. SANDERS
[1] Brennan, R.L. (1992). Elements of Generalizability The-
ory, ACT, Iowa.
[2] Brennan, R.L. (2001). Generalizability Theory, Springer,
New York.
Generalizability Theory: Overview
NOREEN M. WEBB AND RICHARD J. SHAVELSON
Volume 2, pp. 717719
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
the generalizability of the average over the random occasions. Finally, the large pio,e
2
(36%) reflects the
facets. Alternatives include conducting a separate G varying relative standing of persons across occasions
study within each condition of the fixed facet, or a and items and/or other sources of error not systemat-
multivariate analysis with the levels of the fixed facet ically incorporated into the G study.
comprising a vector of dependent variables. Because more of the error variability in science
As an example, consider a G study in which per- achievement scores came from items than from
sons responded to 10 randomly selected science items occasions, changing the number of items will have
on each of 2 randomly sampled occasions. Table 1 a larger effect on the estimated variance components
gives the estimated variance components from the G and generalizability coefficients than will changing
study. The large p2 (1.376, 32% of the total varia- the number of occasions. For example, the estimated
tion) shows that, averaging over items and occasions, G and Phi coefficients for 4 items and 2 occasions
persons in the sample differed systematically in their are 0.72 and 0.69, respectively; the coefficients for 2
science achievement. The other estimated variance items and 4 occasions are 0.67 and 0.63, respectively.
components constitute error variation; they concern Choosing the number of conditions of each facet in
the item facet more than the occasion facet. The non- a D study, as well as the design (nested vs. crossed,
negligible i2 (5% of total variation) shows that items fixed vs. random facet), involves logistical and cost
varied somewhat in difficulty level. The large pi2 considerations as well as issues of dependability.
(20%) reflects different relative standings of persons
across items. The small o2 (1%) indicates that per-
formance was stable across occasions, averaging over
References
persons and items. The nonnegligible po 2
(6%) shows
that the relative standing of persons differed some- [1] Brennan, R.L. (2001). Generalizability Theory, Springer-
Verlag, New York.
what across occasions. The zero io2 indicates that the [2] Cronbach, L.J., Gleser, G.C., Nanda, H. & Rajaratnam, N.
rank ordering of item difficulty was similar across (1972). The Dependability of Behavioral Measurements,
Wiley, New York.
Table 1 Estimated variance components in a generaliz- [3] Shavelson, R.J. & Webb, N.M. (1981). Generalizability
ability study of science achievement (p i o design) theory: 19731980, British Journal of Mathematical and
Statistical Psychology 34, 133166.
Total [4] Shavelson, R.J. & Webb, N.M. (1991). Generalizability
Variance variability Theory: A Primer, Sage Publications, Newbury Park.
Source component Estimate (%)
Person (p) p2 1.376 32 (See also Generalizability Theory: Basics; Gener-
Item (i) i2 0.215 05 alizability Theory: Estimation)
Occasion (o) o2 0.043 01
pi pi2 0.860 20 NOREEN M. WEBB AND RICHARD J.
po po2
0.258 06 SHAVELSON
io io2 0.001 00
p i o,e 2
pio,e 1.548 36
Generalized Additive Model
BRIAN S. EVERITT
Volume 2, pp. 719721
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
150
Crime
100
50
160
Age
140
120
140
120
Unemployment 100
80
10
5
lo(Unemployment)
0
0
lo(Age)
5 10
15
10 20
Figure 2 Form of locally weighted regression fit for crime rate and age [lo(age) represents the lowess fit-see scatterplot
smoothers], and crime rate and unemployment [lo(unemployment) represents the lowess fit] of locally weighted regression
fit for crime rate and age and crime rate and unemployment
As an example of the application of GAMs, we The locally weighted regression fit for age suggests,
consider some data on crime rates in the United perhaps, that a linear fit for crime rate on age might be
States given in [2]. The question of interest is how appropriate, with crime declining with an increasingly
crime rate (number of offenses known to the police aged state population. But the relationship between
per one million population) in different states of crime rate and unemployment is clearly nonlinear.
the United States is related to the age of males Use of the GAM suggests, perhaps, that crime rate
in the age group 14 to 24 per 1000 of the total might be modeled by a multiple regression approach
state population and to unemployment in urban males with a linear term for age and a quadratic term for
per 1000 population in the age group 14 to 24. A unemployment.
scatterplot matrix of the data is shown in Figure 1
and suggests that the relationship between crime rate References
and each of the other two variables may depart
from linearity in some subtle fashion that is worth
[1] Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Addi-
investigating using a GAM. Using a locally weighted tive Models, CRC Press/Chapman & Hall, Boca Raton.
regression to model the relationship between crime [2] Vandele, W. (1978). Participation in illegitimate activ-
rate and each of the explanatory variables, the model ities: Erlich revisited, in Deterence and Incapacitation,
can be fitted simply using software available in, A. Blumstein, J. Cohen & D. Nagin, eds, National
for example, SAS or S-PLUS (see Software for Academy of Science, Washington.
Statistical Analyses). Rather than giving the results
BRIAN S. EVERITT
in detail, we simply show the locally weighted fits
of crime rate on age and unemployment in Figure 2.
Generalized Estimating Equations (GEE)
JAMES W. HARDIN
Volume 2, pp. 721729
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
equal to the observed Hessian matrix. This means model also serves as a reference model in the deriva-
that the model-based variance estimate (inverse of the tion of diagnostics for more sophisticated models for
expected Hessian) usually provided by the IRLS algo- clustered data (such as GEE models).
rithm for GLMs will be the same as the model-based Analysts can use the independence model to obtain
variance estimate (inverse of the observed Hessian) along with standard errors based on
point estimates
usually provided from a maximum likelihood algo- the modified sandwich variance estimator to ensure
rithm. One should note, however, that this property that inference is robust to any type of within-cluster
does not automatically mean that the canonical link correlation. While the inference regarding marginal
function is the best choice for a given dataset. effects is valid (assuming that the model for the
The large sample covariance matrix of the esti- mean is correctly specified), the estimator from the
mated regression coefficients may be estimated independence model is not efficient when the data
using the inverse of the expected information matrix are correlated.
(the expectation
of the matrix outer product of the
scores ni=1 i i T ), or the inverse of the observed
information matrix (matrix of derivatives of the score Modified Sandwich Variance Estimator
vector, /). These two variance estimators,
are the same if the canonical link
evaluated at , The validity of the (naive) model-based variance
is used. estimators, using the inverse of either the observed
or expected Hessian, depends on the correct spec-
ification of the variance; in turn this depends on
The Independence Model the correct specification of the working correlation
model. A formal justification for an alternative esti-
A basic individual-level model is written in terms mator known as the sandwich variance estimator is
of the n individual observations yi for i = 1, . . . , n. given in [9].
When observations may be clustered (see Clustered The sandwich variance estimator is presented in
Data), owing to repeated observations on the sam- the general form A1 BAT . Here A1 (the so-called
pling unit or because the observations are related to bread of the sandwich) is the standard model-based
some cluster identifier variable, the model may be (naive) variance estimator which can be based on
written in terms of the observations yit for the clus- the expected Hessian or the observed Hessian (see
ters i = 1, . . . , n and the within-cluster repeated, or Information Matrix). The B variance estimator is
related, observations t = 1, . . . , ni . The total number the sum of the cross-products of the scores.
of observations is then N = i ni . The clusters may The B variance estimator does not depend on the
also be referred to as panels, subjects, or groups. In correct specification
of the assumed model and is
this presentation, the clusters i are independent, but given by B = ni=1 nt=1 i
it it T . As the expected
the within-clusters observations it may be correlated. value of the estimating equation is zero, this formula
An independence model, however, assumes that the is similar to the usual variance estimator. A general-
within-cluster observations are not correlated. ization is obtained by squaring the sums of the terms
The independence model is a special case of more for each cluster (since we assume that the clusters
sophisticated correlated data approaches (such as are independent) instead of summing the squares of
GEE). This model assumes that there is no correlation the terms for each
observation.
ni This
ni summation
over
within clusters. Therefore, the model specification is clusters B = ni=1
t=1 it
t=1 it
T
is what
in terms of the individual observations yit . While adds the modified adjective to the modified sandwich
the independence model assumes that the repeated variance estimator.
measures are independent, the model still provides The beneficial properties of the sandwich variance
consistent estimators in the presence of correlated estimator, in the usual or the modified form, make it
data. Of course, this approach is paid for through a popular choice for many analysts. However, the
inefficiency, though the efficiency loss is not always acceptance of this estimator is not without some
large as investigated by Glonek et al. [5]. As such, controversy. A discussion of the decreased efficiency
this model remains an attractive alternative because and increased variability of the sandwich estimator
of its computational simplicity. The independence in common applications is presented in [11], and [3]
Generalized Estimating Equations (GEE) 3
argues against blind application of the sandwich fitted coefficients are not the same. The subject-
estimator by considering an independent samples test specific approach explicitly models the source of
of means. heterogeneity so that the fitted regression coefficients
It should be noted that assuming independence have an interpretation in terms of the individuals.
is not always conservative; the model-based (naive) The most commonly applied GEE is described
variance estimates based on the observed or expected in [12]. This is a population-averaged approach. It
Hessian matrix are not always smaller than those of is possible to derive subject-specific GEE models,
the modified sandwich variance estimator. Since the but such models are not currently part of software
sandwich variance estimator is sometimes called the packages and so do not appear nearly as often in
robust variance estimator, this result may seem coun- the literature.
terintuitive. However, it is easily seen by assuming
negative within-cluster correlation leading to clusters
with both positive and negative residuals. The cluster- (Population-averaged) Generalized
wise sums of those residuals will be small and the Estimating Equations
resulting modified sandwich variance estimator will
yield smaller standard errors than the model-based The genesis of population-averaged generalized esti-
Hessian variance estimators. mating equations is presented in [12]. The basic idea
behind this novel approach is illustrated as follows.
We consider the estimating equation for a model spec-
Subject-specific (SS) versus ifying the exponential family of distributions
Population-averaged (PA) Models n n
i
= i = Xi T
D [V(i )]1
There are two main approaches to dealing with corre- i=1 i=1
i
lation in repeated or longitudinal data. One approach
focuses on the marginal effects averaged across the yi i
= 0p1 , (4)
individuals (see Marginal Models for Clustered
Data) (population-averaged approach), and the sec-
where D(di ) denotes a diagonal matrix with diagonal
ond approach focuses on the effects for given values
elements given by the ni 1 vector di , Xi is the
of the random effects by fitting parameters of the
ni p matrix of covariates for cluster i, and yi =
assumed random-effects distribution (subject-specific
(yi1 , . . . , yini ) and i = (i1 , . . . , ini ) are ni 1
approach). Formally, we specify a generalized linear
vectors for cluster i. Assuming independence, V(i )
mixed model and include a source of the noninde-
is clearly an ni ni diagonal matrix which can be
pendence. We can then either explicitly model the
factored into
conditional expectation given the random effects i
using SSit = E(yit |xit , i ), or we can focus on the V(i ) = D(V (it ))1/2 I(ni ni ) D(V (it ))1/2 ni ni ,
marginal expectation (integrated over the distribution
(5)
of the random effects) as PA it = Ei [E(yit |xit , i )].
The responses in these approaches are character- where D(dit ) is a ni ni diagonal matrix with diag-
ized by onal elements dit for t = 1, . . . , ni . This presentation
makes it clear that the estimating equation treats
it ) = xit
g (SS + zit i
SS
each observation within a cluster as independent. A
V (yit |xit , i ) = V (SS
it ) (pooled) model associated with this estimating equa-
tion is called the independence model.
it ) = xit
g (PA PA
There are two other aspects of the estimating
V (yit |xit ) = V (PA
it ). (3) equation to note. The first aspect is that the estimating
equation is written in terms of while the scale
The population-averaged approach models the parameter is treated as ancillary. For discrete
average response for observations sharing the same families, this parameter is theoretically equal to one,
covariates (across all of the clusters or subjects). while for continuous families is a scalar multiplying
The superscripts are used to emphasize that the the assumed variance ( is estimated in this case).
4 Generalized Estimating Equations (GEE)
The second aspect of the estimating equation to note assures robustness in the case of misspecification of
is that it is written in terms of the clusters i instead the working correlation matrix, the advantage of more
of the observations it. efficient point estimates is still worth this effort.
The genesis of the original population-averaged There is no controversy as to the fact that the GEE
generalized estimating equations is to replace the estimates are consistent, but there is some contro-
identity matrix with a parameterized working corre- versy as to how efficient they are. This controversy
lation matrix R(). centers on how well the correlation parameters can
be estimated.
V(i ) = D(V (it ))1/2 R()(ni ni ) The full generalized estimating equation for
population-averaged GEEs is given in partitioned
D(V (it ))1/2 ni ni . (6)
form by = ( , ) = (0, 0), where the regression
To address correlated data, the working correlation and correlation components are given by
matrix is parameterized via in order to specify
n
structural constraints section Estimating the Working i yi i
= Xi T V1 (i ) =0
Correlation Matrix. In this way, the independence i=1
i
model is a special case of the GEE specifications n
where R() is an identity matrix. i
= H1
i (Wi i ) = 0, (7)
Formally, [12] introduces a second estimating
i=1
equation for the parameters of the working correla-
tion matrix. The authors then establish the properties where Wi = (ri1 ri2 , ri1 ri3 , . . . , rini 1 rini )T , Hi = D
of the estimators resulting from the solution of these (V (Wit )), and i = E(Wi ). From this specification
estimating equations. The GEE moniker was applied (using rit for the itth Pearson residual), it is clear
as the model is derived through a generalization of that the parameterization of the working correlation
the estimating equation rather than a derivation from matrix enters through the specification of . For
some assumed distribution. Example applications of example, the specification = (, , . . . , ) signals
these models in behavioral statistics studies can be a single unknown correlation; we assume that the
found in [4] and [1]. conditional correlations for all pairs of observations
GEE is a generalization of the quasilikelihood within a given cluster are the sample. For instance,
approach to GLMs which merely uses first and the correlations do not depend on a time lag.
second moments and does not require a likelihood. Typically a careful analyst chooses some small
There are several software packages that support number of candidate parameterizations. The quasi-
estimation of these models. These packages include likelihood information criterion (QIC) measures for
R, SAS, S-PLUS, Stata, and SUDAAN. R and S- choosing between candidate parameterizations is dis-
PLUS users can easily find user-written software cussed in [17]. This criterion measure is similar to the
tools for fitting GEE models, while such support is well known Akaike information criterion (AIC).
included in the other packages (see Software for The most common choices for parameterizing the
Statistical Analyses). working correlation R matrix are then given by
parameterizing the elements of the matrix as
Estimating the Working Correlation independent Ruv = 0
Matrix exchangeable Ruv =
autocorrelated
One should carefully consider the parameterization AR(1) |uv|
Ruv =
of the working correlation matrix since including the |uv| if |u v| k
correct parameterization leads to more efficient esti- stationary(k) Ruv =
0 otherwise
mates. We want to carefully consider this choice even uv if |u v| k
if we employ the modified sandwich variance esti- nonstationary(k) Ruv =
0 otherwise
mator in the calculation of standard errors and confi- unstructured Ruv = uv
dence intervals for the regression parameters. While (8)
the use of the modified sandwich variance estimator for u = v; Ruu = 1.
Generalized Estimating Equations (GEE) 5
The independence model admits no extra parame- models; see section Subject-specific (SS) versus
ters and the resulting model is equivalent to a gener- Population-averaged (PA) Models in this entry
alized linear model specification. The exchangeable and [25].
correlation parameterization admits one extra param- Several areas of research have led to extensions
eter and the unstructured
working correlation param- of the original GEE models. The initial extensions
eterization admits M2 M extra parameters where were to regression models not usually supported
M = max{ni }. The exchangeable correlation speci- in generalized linear models. In particular, general-
fication is also known as equal correlation, common ized logistic regression models for multinomial logit,
correlation, and compound symmetry (see Sphericity cumulative logistic regression models, and ordered
Test). outcome models (ordered logistic and ordered probit)
The elements of the working correlation matrix have all found support in various statistical soft-
are estimated using the Pearson residuals from ware packages.
the current fit. Estimation alternates between esti- An extension of the quasilikelihood such that both
mating the regression parameters for the cur- partial derivatives have score-like properties is given
rent estimates of , and then using those esti- in [15], and then [7], and later [6], derive an extended
mates to obtain residuals to update the estimate generalized estimating equation (EGEE) model from
of . this extended quasilikelihood. To give some context
In addition to estimating (, ), the continuous to this extension, the estimating equation for
families also require estimation of the scale parameter does not change, but the estimating equation for
; this is the same scale parameter as in generalized is then
linear models. Discrete families theoretically define n
V(i )1
this parameter to be 1, but one can optionally estimate = (yi i )T (yi i )
this parameter in the same manner as is required i=1
by continuous exponential family members. Software
V(i )1
documentation should specify the conditions under + tr V(i ) = 0. (9)
which the parameter is either assumed to be known
or is estimated. The EGEE model is similar to the population-
The usual approach in GLMs for N = i ni total averaged GEE model in that the two estimating
observations
n isni to 2estimate the scale parameter as equations are assumed to be orthogonal; it is assumed
1/N i=1 t=1 rit , though some software packages that Cov(, ) = 0 a property usually referred to in
will use (N p), where p is the dimension of , as the literature as GEE1.
the denominator. Software users should understand At the mention of GEE1, it should be obvious
that this seemingly innocuous difference will lead that there is another extension to the original GEE
to slightly different answers in various software model known as GEE2. A model derived from GEE2
packages. The scale parameter is the denominator in does not assume that and are uncorrelated. The
the estimation of the correlation parameters and a GEE2, which is not robust against misspecification of
change in the estimates of the correlation parameters the correlation, is a more general approach that has
will lead to slightly different regression coefficient less restrictions and which provides standard errors
estimates . for the correlation parameters . Standard errors
are not generally available in population-averaged
GEE models though one can calculate bootstrap
Extensions to the Population-averaged standard errors.
GEE Model One other extension of note is the introduction
of estimating methods that are resistant to outliers.
The GEE models described in [12] are so com- One such approach by Preisser and Qaqish [19]
monly used that analyses simply refer to their generalizes GEE model estimation following the
application as GEE. However, GEE derivations ideas in robust regression. This generalization down-
are not limited to population-averaged models. In weights outliers to remove exaggerated influence.
fact, generalized estimating equations methods can The estimating equation for the regression coeffi-
be applied in the construction of subject-specific cients becomes
6 Generalized Estimating Equations (GEE)
n
i Diagnostics
= D V(i )1
i=1
i
One of the most prevalent measures for model
yi i adequacy is the Akaike information criterion or
wi ci = 0p1 . (10)
AIC. An extension of this measure, given in [17],
is called the quasilikelihood information criterion
The usual GEE is a special case where, for all (QIC). This measure is useful for comparing models
i, the weights wi are given by an ni ni identity that differ only in the assumed correlation structure.
matrix and ci by a vector of zeros. Typical approaches For choosing covariates in the model, [18] introduces
use Mallows-type weights calculated from influence the QICu measure that plays a similar role for
diagnostics, though other approaches are possible. covariate selection in GEE models as the adjusted
R 2 plays in regression.
Since the MCAR is an important assumption in
Missing Data GEE models, [2] provides evidence of the utility
of the WaldWolfowitz nonparametric run test.
Population-averaged GEE models are derived for
This test provides a formal approach for assessing
complete data. If there are missing observations, the
compliance of a dataset to the MCAR assumption.
models are still applicable if the data are missing
While this test is useful, one should not forget
completely at random (MCAR).
the basics of exploratory data analysis. The first
Techniques for dealing with missing data are a
assessment of the data and the missingness of the data
source of active research in all areas of statistical
should be subjectively illustrated through standard
modeling, but methods for dealing with missing data
are difficult to implement as turnkey solutions. This graphical techniques.
means that software packages are not likely to support As in GLMs, the careful investigator looks at
specific solutions to every research problem. An influence measures of the data. Standard DFBETA
investigation into the missingness of data requires, as and DFFIT residuals introduced in the case of linear
a first step, the means for communicating the nature regression are generalized for clustered data analysis
of the missing data. by considering deletion diagnostics based on deleting
If data are not missing completely at random, then a cluster i at a time rather than an observation it at
an application of GEE analysis is performed under a a time. For goodness-of-fit, [26] provides discussion
violation of assumptions leading to suspect results of measures based on entropy (as a proportional
and interpretation. Analyses that specifically address reduction in variation), along with discussion in terms
data that do not satisfy the MCAR assumption of the concordance correlation.
are referred to as informatively missing methods; A 2 goodness-of-fit test for GEE binomial mod-
for further discussion see [22] for applications of els is presented in [8]. The basic idea of the test is
inverse probability weighting and [10] for additional to group results into deciles and investigate the fre-
relevant discussion. quencies as a 2 test of the expected and observed
A formal study for modeling missing data due counts. As with the original test, analysts should
to dropouts is presented in [13], while [22] and [21] use caution if there are many ties at the deciles
each discuss the application of sophisticated semi- since breaking the ties will be a function of the sort
parametric methods under non-ignorable missingness order of the data. In other words, the results will be
mechanisms which extend usual GEE models to pro- random.
vide consistent estimators. One of the assumptions Standard Wald-type hypothesis tests of regres-
of GEE is that if there is dropout, the dropout sion coefficients can be performed using the esti-
mechanism (see Dropouts in Longitudinal Studies: mated covariance matrix of the regression parameters.
Methods of Analysis) does not depend on the val- In addition, [23] provides alternative extensions of
ues of the outcomes (outcome-dependent dropout), Wald, Rao (score), and likelihood ratio tests (deviance
but as [13] points out, such missingness may depend difference based on the independence model). These
on the values of the fixed covariates (covariate- tests are available in the SAS commercial packages
dependent dropout). via specified contrasts.
Generalized Estimating Equations (GEE) 7
Table 2 Estimated incidence rate ratios and standard errors for various Poisson models
Model time trt age baseline
to modeling data of this type. The weakness of 1.00
the approach is that the estimators will not be as 0.25 1.00
Runst = . (11)
efficient as a model including the true underlying 0.42 0.68 1.00
within-cluster correlation structure. Another standard 0.22 0.28 0.58 1.00
approach to modeling this type of repeated measures
References
is to hypothesize that the correlations are due to
individual-specific random intercepts (see General- [1] Alexander, J.A., DAunno, T.A. & Succi, M.J. (1996).
ized Linear Mixed Models). These random effects Determinants of profound organizational change: choice
(one could also hypothesize fixed effects) will lead of conversion or closure among rural hospitals, Journal
of Health and Social Behavior 37, 238251.
to alternate models for the data.
[2] Chang, Y.-C. (2000). Residuals analysis of the gener-
Results from two different random-effects mod- alized linear models for longitudinal data, Statistics in
els are included in the table. The gamma-distributed Medicine 19, 12771293.
random-effects model is rather easy to program and fit [3] Drum, M. & McCullagh, P., Comment on Fitzmaurice,
to data as the log-likelihood of the model is in ana- G.M., Laird, N. & Rotnitzky, A. (1993). Regression
models for discrete longitudinal responses, Statistical
lytic form. The normally distributed random-effects
Science 8, 284309.
model on the other hand has a log-likelihood specifi- [4] Ennet, S.T., Flewelling, R.L., Lindrooth, R.C. & Nor-
cation that includes an integral. Sophisticated numeric ton, E.C. (1997). School and neighborhood characteris-
techniques are required for the calculation of this tics associated with school rates of alcohol, cigarette, and
model; see [20]. marijuana use, Journal of Health and Social Behavior
We could hypothesize that the correlation follows 38(1), 5571.
[5] Glonek, G.F.V. & McCullagh, R. (1995). Multivariate
an autoregressive process since the data are collected logistic models, Journal of the Royal Statistical Society
over time. However, this is not always the best choice Series B 57, 533546.
in an experiment since we must believe that the [6] Hall, D.B. (2001). On the application of extended
hypothesized correlation structure applies to both the quasilikelihood to the clustered data case, The Canadian
treated and untreated groups. Journal of Statistics 29(2), 122.
[7] Hall, D.B. & Severini, T.A. (1998). Extended general-
The QIC values for the independence, exchange- ized estimating equations for clustered data, Journal of
able, ar1, and unstructured correlation structures the American Statistical Association 93, 13651375.
are respectively given by 5826.23, 5826.25, [8] Horton, N.J., Bebchuk, J.D., Jones, C.L., Lipsitz, S.R.,
5832.20, and 5847.91. This criterion measure Catalano, P.J., Zahner, G.E.P. & Fitzmaurice, G.M.
indicates a preference for the unstructured model (1999). Goodness-of-fit for GEE: an example with
mental health service utilization, Statistics in Medicine
over the autoregressive model. The fitted corre-
18, 213222.
lation matrices for these models (printing only [9] Huber, P.J. (1967). The behavior of maximum likelihood
the bottom half of the symmetric matrices) are estimates under nonstandard conditions, in Proceedings
given by of the Fifth Berkeley Symposium on Mathematical Statis-
tics and Probability, Vol. 1, University of California
Press, Berkeley, 221233.
1.00 [10] Ibrahim, J.G. & Lipsitz, S.R. (1999). Missing covari-
0.51 1.00 ates in generalized linear models when the missing data
RAR(1) =
0.26 0.51 1.00 mechanism is non-ignorable, Journal of the Royal Sta-
0.13 0.26 0.51 1.00 tistical Society Series B 61(1), 173190.
Generalized Estimating Equations (GEE) 9
[11] Kauermann, G. & Carroll, R.J. (2001). The sandwich [21] Robins, J.M., Rotnitzky, A. & Zhao, L.P. (1994). Esti-
variance estimator: efficiency properties and coverage mation of regression coefficients when some regressors
probability of confidence intervals, Journal of the Amer- are not always observed, Journal of the American Sta-
ican Statistical Association 96, 13861397. tistical Association 89(427), 846866.
[12] Liang, K.-Y. & Zeger, S.L. (1986). Longitudinal data [22] Robins, J.M., Rotnitzky, A. & Zhao, L.P. (1995). Anal-
analysis using generalized linear models, Biometrika 73, ysis of semiparametric regression models for repeated
1322. outcomes in the presence of missing data, Journal of the
[13] Little, R.J.A. (1995). Modelling the drop-out mechanism American Statistical Association 90(429), 106121.
in repeated measures studies, Journal of the American [23] Rotnitzky, A. & Jewell, N.P. (1990). Hypothesis testing
Statistical Association 90, 11121121. of regression parameters in semiparametric generalized
[14] McCullagh, P. & Nelder, J.A. (1989). Generalized linear linear models for cluster correlated data, Biometrika
models, 2nd Edition, Chapman & Hall, London. 77(3), 485497.
[15] Nelder, J.A. & Pregibon, D. (1987). An extended quasi- [24] Thall, P.F. & Vail, S.C. (1990). Some covariance models
likelihood function, Biometrika 74, 221232. for longitudinal count data with overdispersion, Biomet-
[16] Nelder, J.A. & Wedderburn, R.W.M. (1972). Gener- rics 46, 657671.
alized linear models, Journal of the Royal Statistical [25] Zeger, S.L., Liang, K.-Y. & Albert, P.S. (1988). Models
Society Series A 135(3), 370384. for longitudinal data: a generalized estimating equation
[17] Pan, W. (2001a). Akaikes information criterion in gen- approach, Biometrics 44, 10491060.
eralized estimating equations, Biometrics 57, 120125. [26] Zheng, B. (2000). Summarizing the goodness of fit of
[18] Pan, W. (2001b). Model selection in estimating equa- generalized linear models for longitudinal data, Statistics
tions, Biometrics 57, 529534. in Medicine 19, 12651275.
[19] Preisser, J.S. & Qaqish, B.F. (1999). Robust regression
for clustered data with application to binary responses, JAMES W. HARDIN
Biometrics 55, 574579.
[20] Rabe-Hesketh, S., Skrondal, A. & Pickles, A. (2002).
Reliable estimation of generalized linear mixed models
using adaptive quadrature, Statistical Journal 2, 121.
Generalized Linear Mixed Models
DONALD HEDEKER
Volume 2, pp. 729738
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
result, GLMMs are often referred to as conditional where the inverse link function (ij ) is the logis-
models in contrast to the marginal generalized esti- tic cumulative distribution function (cdf), namely
mating equations (GEE) models (see Generalized (ij ) = [1 + exp(ij )]1 . A nicety of the logis-
Estimating Equations (GEE)) [29], which represent tic distribution, that simplifies parameter estimation,
an alternative generalization of GLMs for correlated is that the probability density function (pdf) is related
data (see Marginal Models for Clustered Data). to the cdf in a simple way, as (ij ) = (ij )[1
The model can be easily extended to include mul- (ij )].
tiple random effects. For example, in longitudinal The probit model, which is based on the standard
problems, it is common to have a random subject normal distribution, is often proposed as an alterna-
intercept and a random linear time-trend. For this, tive to the logistic model [13]. For the probit model,
denote zij as the r 1 vector of variables having ran- the normal cdf and pdf replace their logistic counter-
dom effects (a column of ones is usually included for parts. A useful feature of the probit model is that it
the random intercept). The vector of random effects can be used to yield tetrachoric correlations for the
vi is assumed to follow a multivariate normal distri- clustered binary responses, and polychoric correla-
bution with mean vector 0 and variancecovariance tions for ordinal outcomes (discussed below). For this
matrix v (see Catalogue of Probability Density reason, in some areas, for example familial studies,
Functions). The model is now written as the probit formulation is often preferred to its logistic
counterpart.
ij = xij + zij vi . (5)
Table 1 Smoking cessation study: smoking status (0 = smoking, 1 = not smoking) across time (N = 489), GLMM logistic
parameter estimates (Est.), standard errors (SE), and P values
Random intercept model Random int and trend model
test, the model with random intercept and linear time for this is to perform a sequential procedure for model
trend is preferred over the simpler random intercept selection. First, one includes all possible covariates
model (22 = 36.3). Thus, there is considerable evi- of interest into the model and selects between the
dence for subjects varying in both their intercepts and possible models of random effects using likelihood-
time trends. It should be noted that the test statistic ratio tests and model fit criteria. Then, once a
does not have a chi-square distribution when testing reasonable random effects structure is selected, one
variance parameters because the null hypothesis is on trims model covariates in the usual way.
the border of the parameter space, making the P value
conservative. Snijders and Bosker [46] elaborate on
IRT Models
this issue and point out that a simple remedy, that has
been shown to be reasonable in simulation studies, is Because the logistic model is based on the logis-
to divide the P value based on the likelihood-ratio tic response function, and the random effects are
chi-square test statistic by two. In the present case, assumed normally distributed, this model and models
it doesnt matter because the P value is <.001 for closely related to it are often referred to as logis-
22 = 36.3 even without dividing by two. tic/normal models, especially in the latent trait model
In terms of the fixed effects, both models indicate literature [4]. Similarly, the probit model is some-
a nonsignificant time effect for the control condition, times referred to as a normal/normal model. In many
and a highly significant condition effect at time 0 respects, latent trait or item response theory (IRT)
(e.g., z = 1.495/.415 = 3.6 in the second model). models, developed in the educational testing and psy-
This indicates a positive effect of the experimental chometric literatures, represent some of the earliest
conditions on smoking abstinence relative to control GLMMs. Here, item responses (j = 1, 2, . . . , n) are
at postintervention. There is also some evidence of nested within subjects (i = 1, 2, . . . , N ). The sim-
a negative condition by time interaction, suggesting plest IRT model is the Rasch model [40] which
that the beneficial condition effect diminishes across posits the probability of a correct response to the
time. Note that this interaction is not significant (P < dichotomous item j (Yij = 1) conditional on the ran-
.18) in the random intercept and trend model, but it is dom effect or ability of subject i (i ) in terms of
significant in the random intercept model (P < .02). the logistic cdf as
Since the former is preferred by the likelihood-ratio
test, we would conclude that the interaction is not P (Yij = 1|i ) = (i bj ), (8)
significant.
This example shows that the significance of model where bj is the threshold or difficulty parameter
terms can depend on the structure of the random for item j (i.e., item difficulty). Subjects ability
effects. Thus, one must decide upon a reasonable is commonly denoted as in the IRT literature
model for the random effects as well as for the (i.e., instead of ). Note that the Rasch model
fixed effects. A commonly recommended approach is simply a random-intercepts model that includes
4 Generalized Linear Mixed Models
item dummies as fixed regressors. Because there is with C 1 strictly increasing model thresholds
only one parameter per item, the Rasch model is c (i.e., 1 < 2 < C1 ). The thresholds allow
also called the one-parameter IRT model. A more the cumulative response probabilities to differ. For
general IRT model, the two-parameter model [5], also identification, either the first threshold 1 or the
includes a parameter for the discrimination of the model intercept 0 is typically set to zero. As the
item in terms of ability. regression coefficients do not carry the c sub-
Though IRT models were not originally cast as script, the effects of the regressors do not vary across
GLMMs, formulating them in this way easily allows categories. McCullagh [31] calls this assumption of
covariates to enter the model at either level (i.e., identical odds ratios across the C 1 cutoffs the pro-
items or subjects). This and other advantages of portional odds assumption.
casting IRT models as mixed models are described Because the ordinal model is defined in terms of
by Rijmen et al. [43], who provide a comprehensive the cumulative probabilities, the conditional proba-
overview and bridge between IRT models, mixed bility of a response in category c is obtained as the
models, and GLMMs. As they point out, the Rasch difference of two conditional cumulative probabili-
model, and variants of it, belong to the class of ties:
GLMMs. However, the more extended two-parameter
P (Yij = c|vi , xij , zij ) = (ij c ) (ij,c1 ). (11)
model is not within the class of GLMMs because the
predictor is no longer linear, but includes a product Here, 0 = and C = , and so (ij 0 ) = 0
of parameters. and (ij C ) = 1 (see Ordinal Regression Models).
Example
Ordinal Outcomes
Hedeker and Gibbons [25] described a random-
Extending the methods for dichotomous responses effects ordinal probit regression model, examining
to ordinal response data has also been actively pur- longitudinal data collected in the NIMH Schizophre-
sued; Agresti and Natarajan [2] review many of these nia Collaborative Study on treatment related changes
developments. Because the proportional odds model in overall severity. The dependent variable was
described by McCullagh [31], which is based on the item 79 of the Inpatient Multidimensional Psychi-
logistic regression formulation, is a common choice atric Scale (IMPS; [30]), scored as: (a) normal or
for analysis of ordinal data, many of the GLMMs borderline mentally ill, (b) mildly or moderately ill,
for ordinal data are generalizations of this model, (c) markedly ill, and (d) severely or among the most
though models relaxing this assumption have also extremely ill. In this study, patients were randomly
been described [27]. The proportional odds model assigned to receive one of four medications: placebo,
expresses the ordinal responses in C categories (c = chlorpromazine, fluphenazine, or thioridazine. Since
1, 2, . . . , C) in terms of C 1 cumulative category previous analyses revealed similar effects for the
comparisons, specifically, C 1 cumulative logits three antipsychotic drug groups, they were combined
(i.e., log odds). Here, denote the conditional cumula- in the analysis. The experimental design and corre-
tive probabilities for the C categories sponding sample sizes are listed in Table 2.
of the outcome As can be seen from Table 2, most of the mea-
Yij as Pij c = P (Yij c|vi , xij ) = Cc=1 pij c , where
pij c represents the conditional probability of response surement occurred at weeks 0, 1, 3, and 6, with some
in category c. The logistic GLMM for the conditional scattered measurements at the remaining timepoints.
cumulative probabilities ij c = Pij c is given in terms
of the cumulative logits as Table 2 Experimental design and weekly sample sizes
Sample size at week
ij c
log = ij c (c = 1, . . . , C 1), (9)
1 ij c Group 0 1 2 3 4 5 6
Placebo (n = 108) 107 105 5 87 2 2 70
where the linear predictor is now
Drug (n = 329) 327 321 9 287 9 7 265
ij c = c [xij + zij vi ], (10) Note: Drug = Chlorpromazine, Fluphenazine, or Thioridazine.
Generalized Linear Mixed Models 5
Here, a logistic GLMM with random intercept nonproportional odds for all model covariates (not
and trend was fit to these data using SAS PROC shown) supports the proportional odds assumption
NLMIXED with adaptive quadrature. Fixed effects (62 = 3.63). Thus, the three covariates (drug, time,
included a dummy-coded drug effect (placebo = 0 and drug by time) have similar effects on the three
and drug = 1), a time effect (square root of week; cumulative logits.
this was used to linearize the relationship between
the cumulative logits and week) and a drug by time
Survival Analysis Models
interaction. Results from this analyses are given in
Table 3. Connections between ordinal regression and survival
The results indicate that the treatment groups analysis models (see Survival Analysis) have led to
do not significantly differ at baseline (drug effect), developments of discrete and grouped-time survival
the placebo group does improve over time (signif- analysis GLMMs [49]. The basic notion is that the
icant negative time effect), and the drug group has time to the event can be considered as an ordinal
greater improvement over time relative to the placebo variable with C possible event times, albeit with
group (significant negative drug by time interac- right-censoring accommodated. Vermunt [50] also
tion). Thus, the analysis supports use of the drug, describes related log-linear mixed models for survival
relative to placebo, in the treatment of schizophre- analysis or event history analysis.
nia.
Comparing this model to a simpler random-
intercepts model (not shown) yields clear evidence of Nominal Outcomes
significant variation in both the individual intercept
and time-trends (likelihood-ratio 22 = 77.7). Also, a Nominal responses occur when the categories of the
moderate negative association between the intercept response variable are not ordered. General regression
and linear time terms is indicated, expressed as a cor- models for multilevel nominal data have been con-
relation it equals .40, suggesting that those patients sidered, and Hartzel et al. [22] synthesizes much of
with the highest initial severity show the greatest the work in this area, describing a general mixed-
improvement across time (e.g., largest negative time- effects model for both clustered ordinal and nomi-
trends). This latter finding could be a result of a nal responses.
floor effect, in that patients with low initial sever- In the nominal GLMM, the probability that Yij =
ity scores cannot exhibit large negative time-trends c (a response occurs in category c) for a given
due to the limited range in the ordinal outcome vari- individual i, conditional on the random effects v, is
able. Finally, comparing this model to one that allows given by:
6 Generalized Linear Mixed Models
and nonzero counts, have been developed [21]. A log-likelihood yields ML estimates (which are some-
somewhat related model is described by Olsen and times referred to as maximum marginal likelihood
Schafer [36] who propose a two-part model that estimates) of the regression coefficients and the
includes a logistic model for the probability of a variance-covariance matrix of the random effects vi .
nonzero response and a conditional linear model for
the mean response given that it is nonzero.
Integration over the random-effects distribution
to the location and dispersion of the distribution to random-intercepts models or two-level models, for
be integrated [39]. example, and several vary in terms of how the
More computer-intensive methods, involving iter- integration over the random effects is performed.
ative simulations, can also be used to approximate However, though the availability of these software
the integration over the random effects distribu- programs is relatively recent, they have definitely
tion. Such methods fall under the rubric of Markov facilitated application of GLMMs in psychology and
chain Monte Carlo (MCMC; [15]) algorithms. Use elsewhere. The continued development of these mod-
of MCMC for estimation of a wide variety of mod- els and their software implementations should only
els has exploded in the last 10 years or so; MCMC lead to greater use and understanding of GLMMs for
solutions for GLMMs are described in [9]. analysis of correlated nonnormal data.
[12] Fahrmeir, L. & Tutz, G.T. (2001). Multivariate Statisti- [29] Liang, K.-Y. & Zeger, S.L. (1986). Longitudinal data
cal Modelling Based on Generalized Linear Models, 2nd analysis using generalized linear models, Biometrika 73,
Edition, Springer-Verlag, New York. 1322.
[13] Gibbons, R.D. & Bock, R.D. (1987). Trend in correlated [30] Lorr, M. & Klett, C.J. (1966). Inpatient Multidimen-
proportions, Psychometrika 52, 113124. sional Psychiatric Scale: Manual, Consulting Psychol-
[14] Gibbons, R.D. & Hedeker, D. (1997). Random-effects ogists Press, Palo Alto.
probit and logistic regression models for three-level data, [31] McCullagh, P. (1980). Regression models for ordinal
Biometrics 53, 15271537. data (with discussion), Journal of the Royal Statistical
[15] Gilks, W., Richardson, S. & Spiegelhalter, D.J. (1997). Society, Series B 42, 109142.
Markov Chain Monte Carlo in Practice, Chapman & [32] McCullagh, P. & Nelder, J.A. (1989). Generalized
Hall, New York. Linear Models, 2nd Edition, Chapman & Hall, New
[16] Goldstein, H. (1995). Multilevel Statistical Models, 2nd York.
[33] McCulloch, C.E. & Searle, S.R. (2001). Generalized,
Edition, Halstead Press, New York.
Linear, and Mixed Models, Wiley, New York.
[17] Goldstein, H. & Rasbash, J. (1996). Improved approx-
[34] McKnight, B. & Van Den Eeden, S.K. (1993). A
imations for multilevel models with binary responses,
conditional analysis for two-treatment multiple period
Journal of the Royal Statistical Society, Series B 159,
crossover designs with binomial or Poisson outcomes
505513. and subjects who drop out, Statistics in Medicine 12,
[18] Goldstein, H. Rasbash, J. Plewis, I. Draper, D. 825834.
Browne, W. & Wang, M. (1998). A Users Guide to [35] Nelder, J.A. & Wedderburn, R.W.M. (1972). Gener-
MLwiN, University of London, Institute of Education, alized linear models, Journal of the Royal Statistical
London. Society, Series A 135, 370384.
[19] Greene, W.H. (1998). LIMDEP Version 7.0 Users Man- [36] Olsen, M.K. & Schafer, J.L. (2001). A two-part ran-
ual, (revised edition), Econometric Software, Plain- dom effects model for semicontinuous longitudinal data,
view. Journal of the American Statistical Association 96,
[20] Gruder, C.L., Mermelstein, R.J., Kirkendol, S., Hed- 730745.
eker, D., Wong, S.C., Schreckengost, J., Warnecke, [37] Pendergast, J.F., Gange, S.J., Newton, M.A., Lind-
R.B., Burzette, R. & Miller, T.Q. (1993). Effects strom, M.J., Palta, M. & Fisher, M.R. (1996). A survey
of social support and relapse prevention training as of methods for analyzing clustered binary response data,
adjuncts to a televised smoking cessation intervention, International Statistical Review 64, 89118.
Journal of Consulting and Clinical Psychology 61, [38] Rabe-Hesketh, S. Pickles, A. & Skrondal, A. (2001).
113120. GLLAMM Manual, Technical Report 2001/01, Institute
[21] Hall, D.B. (2000). Zero-inflated Poisson and binomial of Psychiatry, Kings College, University of London,
regression with random effects: a case study, Biometrics Department of Biostatistics and Computing.
56, 10301039. [39] Rabe-Hesketh, S., Skrondal, A. & Pickles, A. (2002).
[22] Hartzel, J., Agresti, A. & Caffo, B. (2001). Multinomial Reliable estimation of generalized linear mixed mod-
logit random effects models, Statistical Modelling 1, els using adaptive quadrature, The Stata Journal 2,
81102. 121.
[23] Hedeker, D. (1999). MIXNO: a computer program for [40] Rasch, G. (1960). Probabilistic Models for Some Intelli-
mixed-effects nominal logistic regression, Journal of gence and Attainment Tests, Danish Institute of Educa-
tional Research, Copenhagen.
Statistical Software 4(5), 192.
[41] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
[24] Hedeker, D. (2003). A mixed-effects multinomial
Linear Models in Social and Behavioral Research: Appli-
logistic regression model, Statistics in Medicine, 22
cations and Data-Analysis Methods, 2nd Edition, Sage
14331446.
Publications, Thousand Oaks.
[25] Hedeker, D. & Gibbons, R.D. (1994). A random-
[42] Raudenbush, S.W., Yang, M.-L. & Yosef, M. (2000).
effects ordinal regression model for multilevel analysis, Maximum likelihood for generalized linear models
Biometrics 50, 933944. with nested random effects via high-order, multivariate
[26] Hedeker, D. & Gibbons, R.D. (1996). MIXOR: a com- Laplace approximation, Journal of Computational and
puter program for mixed-effects ordinal probit and logis- Graphical Statistics 9, 141157.
tic regression analysis, Computer Methods and Programs [43] Rijmen, F., Tuerlinckx, F., De Boeck, P. & Kup-
in Biomedicine 49, 157176. pens, P. (2003). A nonlinear mixed model framework
[27] Hedeker, D. & Mermelstein, R.J. (1998). A multilevel for item response theory, Psychological Methods 8,
thresholds of change model for analysis of stages 185205.
of change data, Multivariate Behavioral Research 33, [44] Rodrguez, G. & Goldman, N. (1995). An assessment of
427455. estimation procedures for multilevel models with binary
[28] Laird, N.M. & Ware, J.H. (1982). Random-effects mod- responses, Journal of the Royal Statistical Society, Series
els for longitudinal data, Biometrics 38, 963974. A 158, 7389.
10 Generalized Linear Mixed Models
[45] Skrondal, A. & Rabe-Hesketh, S. (2003). Multilevel discrete failure times with ordinal responses, Biometrics
logistic regression for polytomous data and rankings, 52, 473491.
Psychometrika 68, 267287. [49] Ten Have, T.R. & Uttal, D.H. (1994). Subject-specific
[46] Snijders, T. & Bosker, R. (1999). Multilevel Analysis: An and population-averaged continuation ratio logit models
Introduction to Basic and Advanced Multilevel Modeling, for multiple discrete time survival profiles, Applied
Sage Publications, Thousand Oaks. Statistics 43, 371384.
[47] Stukel, T.A. (1993). Comparison of methods for the [50] Vermunt, J.K. (1997). Log-linear Models for Event
analysis of longitudinal interval count data, Statistics in Histories, Sage Publications, Thousand Oaks.
Medicine 12, 13391351.
[48] Ten Have, T.R. (1996). A mixed effects model for DONALD HEDEKER
multivariate ordinal response data including correlated
Generalized Linear Models (GLM)
BRIAN S. EVERITT
Volume 2, pp. 739743
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
y N (, 2 ),
Logistic Regression
where = 0 + 1 x1 + + q xq . This makes it
clear that this model is only suitable for continuous Logistic regression is a technique widely used to
response variables with, conditional on the values of study the relationship between a binary response and
the explanatory variables, a normal distribution with a set of explanatory variables. The expected value ()
constant variance. Analysis of variance is essentially of a binary response is simply the probability, , that
exactly the same model, with x1 , x2 , . . . , xq being the response variable takes the value one (usually
2 Generalized Linear Models (GLM)
used as the coding for the occurrence of the event A linear predictor, , formed from the explanatory
of interest, say improved). Modeling this expected variables
value directly as a linear function of explanatory
variables, as is done in multiple linear regression, = 0 + 1 x1 + 2 x2 + + q xq . (4)
is now clearly not sensible since it could result in
A transformation of the mean, , of the response
fitted values of the response variable outside the range
variable called the link function, g(). In a GLM,
(0, 1). And, in addition, the error distribution of the
it is g() which is modeled by the linear predictor
response, given the explanatory variables, will clearly
not be normal. Consequently, the multiple regression g() = . (5)
model is adapted by first introducing a transformation
of the expected value of the response, g(), and In multiple linear regression and analysis of vari-
then using a more suitable error distribution. The ance, the link function is the identity function. Other
transformation g is called a link function in GLM, link functions include the log, logit, probit, inverse,
and a suitable link function for a binary response is and power transformations, although the log and logit
the logistic or logit giving the model are those most commonly met in practice. The logit
link, for example, is the basis of logistic regression.
logit() = log = 0 + 1 x1 + + q xq .
1 The distribution of the response variable given its
mean is assumed to be a distribution from the
(2)
exponential family; this has the form
As varies from 0 to 1, the logit of can (y b())
vary from to , so overcoming the first f (y; , ) = exp . (6)
a() + c(y, )
problem noted above. Now, we need to consider the
appropriate error distribution. In linear regression, the For some specific functions, a, b, and c, and
observed value of the response variable is expressed parameters and .
as its expected value, given the explanatory variables For example, in linear regression, a normal dis-
plus an error term. With a binary response, we can tribution is assumed with mean and constant
express an observed value in the same way, that is: variance 2 . This can be expressed via the expo-
nential family as follows:
y = + , (3)
1 (y )2
f (y; , ) = exp
but here, can only assume one of two possible val- (2 2 ) 2 2
ues; if y = 1, then = 1 with probability , and
(y 2 /2)
if y = 0, then = with probability 1 . Con- = exp
sequently, has a distribution with mean zero and 2
variance equal to (1 ), that is, a binomial dis-
1 y2
tribution for a single trial (also known as a Bernoulli + log(2 2
) (7)
2 2
distribution see Catalogue of Probability Density
Functions). so that = , b() = 2 /2, = 2 and a() =
. Other distributions in the exponential family
include the binomial distribution, Poisson distri-
The Generalized Linear Model bution, gamma distribution, and exponential dis-
tribution (see Catalogue of Probability Density
Having seen the changes needed to the basic multiple Functions).
linear regression model needed to accommodate a Particular link function in GLMs are naturally
binary response variable, we can now see how the associated with particular error distributions, for
model is generalized in a GLM to accommodate example, the identity link with the Gaussian
a wide range of possible response variables with distribution, the logit with the binomial, and the
differing link functions and error distributions. The log with the Poisson. In these cases, the term
three essential components of a GLM are: canonical link is used.
Generalized Linear Models (GLM) 3
The choice of probability distribution determines Table 1 Distribution by months prior to interview of
the relationships between the variance of the response stressful events reported from subjects; 147 subjects report-
variable (conditional on the explanatory variables) ing exactly one stressful event in the period from 1 to
18 months prior to interview. (Taken with permission from
and its mean. This relationship is known as the Haberman, 1978)
variance function, denoted V (). We shall say more
about the variance function later. time y
Estimation of the parameters in a GLM is usually 1 15
carried out through maximum likelihood. Details are 2 11
given in [2, 6]. Having estimated the parameters, the 3 14
question of the fit of the model for the sample data 4 17
will need to be addressed. Clearly, a researcher needs 5 5
6 11
to be satisfied that the chosen model describes the
7 10
data adequately before drawing conclusions and mak- 8 4
ing interpretations about the parameters themselves. 9 8
In practice, most interest will lie in comparing the 10 10
fit of competing models, particularly in the context 11 7
of selecting subsets of explanatory variables so as 12 9
to achieve a more parsimonious model. In GLMs, a 13 11
14 3
measure of fit is provided by a quantity known as 15 6
the deviance. This is essentially a statistic that mea- 16 1
sures how closely the model-based fitted values of 17 1
the response approximate the observed values; the 18 4
deviance quoted in most examples of GLM fitting is
actually 2 times the maximized log-likelihood for a
model, so that differences in deviances of competing Explicitly, the model to be fitted to the mean number
models give a likelihood ratio test for comparing the of recalls, , is:
models. A more detailed account of the assessment
of fit for GLMs is given in [1]. log() = 0 + 1 time. (8)
Table 2 Results of a Poisson regression on the data in where is a constant and V () specifies how the
Table 1.8 variance depends on the mean . For the error
Estimated distributions considered previously, this general form
regression Standard Estimate/ becomes:
Covariates coefficient error SE
(1) Normal: V () = 1, = 2 ; here the variance
(Intercept) 2.803 0.148 18.920
does not depend on the mean and so can be
Time 0.084 0.017 4.987
freely estimated
(Dispersion Parameter for Poisson family taken to be 1). (2) Binomial: V () = (1 ), = 1
Null Deviance: 50.84 on 17 degrees of freedom. (3) Poisson: V () = ; = 1
Residual Deviance: 24.57 on 16 degrees of freedom.
In the case of a Poisson variable, we see that
the mean and variance are equal, and in the case
Number of events remembered
the method used with Gaussian models. Parame- regression models to medical data. Some familiarity
ter estimates remain the same but parameter stan- with the basis of such models might allow medical
dard errors are increased by multiplying them by researchers to consider more realistic models for their
the square root of the estimated dispersion param- data rather than to rely solely on linear and logis-
eter. This process can be carried out manually, or tic regression.
almost equivalently, the overdispersed model can be
formally fitted using a procedure known as quasi- References
likelihood ; this allows estimation of model parame-
ters without fully knowing the error distribution of the
[1] Cook, R.J. (1998). Generalized linear models, in Ency-
response variable see [6] for full technical details clopedia of Biostatistics, P. Armitage & T. Colton, eds,
of the approach. Wiley, Chichester.
When fitting generalized linear models with bino- [2] Dobson, A.J. (2001). An Introduction to Generalized
mial or Poisson error distributions, overdispersion can Linear Models, 2nd Edition, Chapman & Hall/CRC Press,
often be spotted by comparing the residual deviance Boca Raton.
with its degrees of freedom. For a well-fitting model, [3] Everitt, B.S. (2001). Statistics for Psychologists, LEA,
Mahwah.
the two quantities should be approximately equal. If [4] Greenwood, M. & Yule, G.U. (1920). An inquiry into the
the deviance is far greater than the degrees of free- nature of frequency-distributions of multiple happenings,
dom, overdispersion may be indicated. Journal of the Royal Statistical Society 83, 255.
An example of the occurrence of overdispersion [5] Haberman, S. (1978). Analysis of Qualitative Data, Vol. I,
when fitting a GLM with a log link and Poisson Academic Press, New York.
errors is reported in [8], for data consisting of the [6] McCullagh, P. & Nelder, J.A. (1989). Generalized Linear
Models, 2nd Edition, Chapman & Hall, London.
observation of number of days absent from school
[7] Nelder, J.A. & Wedderburn, R.W.M. (1972). Generalized
during the school-year amongst Australian Aboriginal linear models, Journal of the Royal Statistical Society,
and white children. The explanatory variables of Series A 135, 370384.
interest in this study were gender, age, type (average [8] RabeHesketh, S. & Everitt, B.S. (2003). A Handbook of
or slow learner), and ethnic group (Aboriginal or Statistical Analyses Using Stata, Chapman & Hall/CRC
White). Fitting the usual Poisson regression model Press, Boca Raton.
resulted in a deviance of 1768 with 141 degrees [9] Seeber, G.U.H. (1998). Poisson regression, in Encyclope-
dia of Biostatistics, P. Armitage & T. Colton, eds. Wiley,
of freedom, a clear indication of overdispersion. In
Chichester.
this model, both gender and type were indicated as
being highly significant predictors of number of days
absent. But when overdispersion was allowed for in (See also Generalized Additive Model; Generalized
the way described above, both these variables became Estimating Equations (GEE))
nonsignificant. A possible reason for overdispersion
in these data is the substantial variability in childrens BRIAN S. EVERITT
tendency to miss days of school that cannot be fully
explained by the variables that have been included in
the model.
Summary
Generalized linear models provide a very power-
ful and flexible framework for the application of
Genotype
ROBERT C. ELSTON
Volume 2, pp. 743744
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
variance/covariance matrix S has been drawn. Next, McDonald [8] recommended that instead of using
define 0 to be the maximum likelihood (ML) esti- this as a fundamental index of lack of fit, one should
mate of a model with free, fixed, and constrained normalize this estimate by dividing it by (N 1)
parameters fit to S. The error of fit given by the to control for the effect of sample size. Thus, the
discrepancy function value F (S; 0 ) contains both normalized noncentrality estimate is given by
sampling error and error of approximation of the
2 df
2
df df
model to the population covariance matrix. Analo- = = = FML . (6)
gously, let 0 be the estimate of the same model fit (N 1) (N 1) (N 1)
to the population variancecovariance matrix . The
0 ) and is known as
error of fit in this case is F (; The population raw noncentrality 2 and the
the error of approximation. It contains no sampling population normalized noncentrality are related as
error. It is a population parameter of lack of fit of the 2 = (N 1). As N increases without bound, and
model to the data. It is never measured directly and is > 0, the noncentrality parameter is undefined in the
0; 0 ) is known as limit.
only inferred from the data. F (
Browne and Cudeck [3] argue that is a less
the error of estimation, the discrepancy between sam-
biased estimator of the normalized population dis-
ple estimate and population estimate for the model. 0 ) than is the raw FML =
crepancy = F (;
The chi-squared statistic of an exact fit test is given 0 ) = df2 /(N 1), which has for its expecta-
as F (S;
tion
df
2
df = (N 1)F (S; 0 ) = (N 1) E(FML ) = F (; 0) + . (7)
(N 1)
0 | ln|S| + tr(
ln| 01 S) p . (3) In fact,
df
F (S; 0 ) is the minimum value of the discrepancy = E(FML )
E() = E(FML )
(N 1)
function when the maximum likelihood estimates
are optimized. df df
+ 0 ) = .
= F (; (8)
Assuming the variables have a joint multivari- (N 1)
ate normal distribution (see Catalogue of Probabil-
So, the estimated normalized noncentrality parameter
ity Density Functions), this statistic is used to test
is an unbiased estimate of the population normalized
the hypothesis that the constrained model covariance
discrepancy.
matrix is the population covariance matrix that gener-
Several indices of approximation are based on
ated the sample covariance matrix S. One rejects the
the noncentrality and normalized noncentrality para-
null hypothesis that the model under the constraints
2 meters.
generated S, when df > c, where c is some constant
Bentler [1] and McDonald and Marsh [9] simul-
such that P (df 2
> c|H0 is true) . Here is the taneously defined an index given the name FI by
probability one accepts of making a Type I error. Bentler:
In many cases, the model fails to fit the data. In
this case, chi squared does not have an approximate (null
k ) 2
[(null df null ) (k2 dfk )]
FI = =
chi-squared distribution, but a noncentral chi-squared null
2
(null dfnull )
distribution, whose expectation is
null k
2
) = df + 2 , = . (9)
E(df (4) null
where df are the degrees of freedom of the model 2
null is the chi squared of a null model in which one
and 2 , the noncentrality parameter for the model. hypothesizes that the population covariance matrix
The noncentrality parameter is a measure of the lack is a diagonal matrix with zero off-diagonal elements
of fit of the model for samples of size N . Thus, an and free diagonal elements. Each nonzero covariance
unbiased estimate of the noncentrality parameter is between any pair of variables in the data will produce
given by a lack of fit for the corresponding zero covariance
2 = df
2
df. (5) in this model. Hence, lack of fit null of this model
Goodness of Fit 3
can serve as an extreme norm against which to The GFI index computes error as the sum of
compare the lack of fit of model k, which is actually (weighted and possibly transformed) squared differ-
hypothesized. The difference in lack of fit between ences between the elements of the observed vari-
the null model and model k is compared to the ance/covariance matrix S and those of the estimated
lack of fit of the null model itself. The result on model variance/covariance matrix 0 and compares
the right is obtained by dividing the numerator and this sum to the total sum of squares of the elements
the denominator by (N 1). The index depends on in S. The matrix (S 0 ) is symmetric and produces
unbiased estimates and is itself relatively free of bias the element-by-element differences between S and
at different sample sizes. Bentler [1] further corrected 0 . W is a transformation matrix that weights and
the FI index to be 0 when it became occasionally combines the elements of these matrices, depending
negative and to be 1 when it occasionally exceeded on the method of estimation. Thus, we have
1. He called the resulting index the CFI (comparative
tr[W1/2 (S 0 )W1/2 ]2
fit index). A common rule of thumb is that models GFI = 1 ,
with CFI .95 are acceptable approximations. tr[W1/2 (S)W1/2 ][W1/2 (S)W1/2 ]
Another approximation index first recommended (12)
by Steiger and Lind [10] but popularized by Browne
and Cudeck [3] is the RMSEA (root mean squared where
error of approximation) index, given by 0 is the model variance/covariance matrix and
2 S is the unrestricted, sample variance/covariance
df k df k matrix and
RMSEA = Max. ,0
(N 1)df k
I Unweighted Least Squares
k W= S Weighted Least Squares . (13)
= Max. ,0 . (10) S0 Maximum Likelihood
df k
For maximum likelihood estimation the GFI sim-
This represents the square root of the estimated nor- plifies to
malized noncentrality of the model divided by the 01 I)(S 01 I)]
tr[(S
models degrees of freedom. In other words, it is GFI ML = 1 . (14)
the average normalized noncentrality per degree of tr(S 01 S
01 )
freedom. Although some have asserted that this rep-
A rule of thumb is again to consider a GFI >
resents the noncentrality adjusted for model parsi-
0.95 to be an acceptable approximation. Hu and
mony, this is not the case. A model may introduce Bentler [6] found that the GFI tended to underesti-
constraints and be more parsimonious with more mate its asymptotic value in small samples, especially
degrees of freedom, and the average discrepancy per when the latent variables were interdependent. Fur-
additional degree of freedom may not change. The thermore, the maximum likelihood (ML) and gen-
RMSEA index ranges between 0 (perfect fit) and eralized least squares (GLS) variants of the index
infinity. A value of RMSEA .05 is considered to be performed poorly in samples less than 250.
acceptable approximate fit. Browne and Cudeck [3] Steiger [11] has suggested a variant of the GFI
indicate that a confidence interval estimate for the such that under a general condition where the model
RMSEA is available to indicate the precision of the is invariant under a constant scaling function the GFI
RMSEA estimate. has a known population parameter
Another popular index first popularized by
Joreskog and Sorboms LISREL program [7] (see p
1 = (15)
Structural Equation Modeling: Software) is 0) + p
2F ML (;
inspired by Fishers intraclass correlation [5]:
to estimate. Note that as F (; 0 ) becomes close
Error Variance to zero, this index approaches unity, whereas when
R2 = 1 . (11) F (; 0 ) is greater than zero and increasing, this
Total Variance
4 Goodness of Fit
parameter declines toward zero, with its becoming [2] Browne, M.W. (1982). Covariance structures, in Topics
zero when F (; 0 ) is infinitely large. Steiger shows in Applied Multivariate Analysis, D.M. Hawkins, ed.,
University Press, Cambridge.
that
p [3] Browne, M.W. & Cudeck, R. (1993). Alternative ways
1 = (16)
2F ML (S; 0) + p of assessing model fit, in Testing Structural Equation
Models, K.A. Bollen & J.S. Long, eds, Sage Publica-
is equivalent to the GFI ML and an estimate of 1 . tions, Newbury Park, pp. 136162.
But it is a biased estimate, for the expectation of 1 [4] Cudeck, R. & Henley, S.J. (1991). Model selection
in covariance structures analysis and the problem of
is approximately sample size: a clarification, Psychological Bulletin 109,
p 512519.
E( 1 ) . (17) [5] Fisher, R.A. (1925). Statistical Methods for Research
0 ) + 2df /(N 1) + p
2F ML (; Workers, Oliver and Boyd, London.
[6] Hu, L.-T. & Bentler, P.M. (1995). Evaluating model
The bias leads 1 to underestimate 1 , but the bias fit, in Structural Equation Modeling: Concepts, Issues
diminishes as sample size N becomes large relative to and Applications, R.H. Hoyle, ed., Sage Publications,
the degrees of freedom of the model. Steiger [11] and Thousand Oaks.
Browne and Cudeck [3] report a confidence interval [7] Joreskog, K.G. & Sorbom, D. (1981). LISREL V: Anal-
estimate using 1 that may be used to test hypotheses ysis of Linear Structural Relationships by the Method of
Maximum Likelihood, National Educational Resources,
about 1 . Chicago.
There are numerous other indices of approximate [8] McDonald, R.P. (1989). An index of goodness of fit
fit, but those described here are the most popular. based on noncentrality, Journal of Classification 6,
Goodness of fit should not be the only criterion for 97103.
evaluating a model. Models with zero degrees of free- [9] McDonald, R.P. & Marsh, H.W. (1990). Choosing a
dom always fit perfectly as a mathematical necessity multivariate model: noncentrality and goodness-of-fit,
Psychological Bulletin 107, 247255.
and, thus, are useless for testing hypotheses. Besides
[10] Steiger, J.H. (1995). Structural Equation Modeling
having acceptable fit, the model should be parsimo- (SEPATH), in Statistica/W (Version 5), Statsoft, Inc.,
nious in having numerous degrees of freedom rela- Tulsa, OK, pp. 35393688.
tive to the number of nonredundant elements in the [11] Steiger, J.H. & Lind, J.C. (1980). Statistically based
variance-covariance matrix, and should be realistic in tests for the number of factors, in Paper Presented at
representing processes in the phenomenon modeled. the Annual Spring Meeting of the Psychometric Society,
Iowa City.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
superior to other values of . It leads to a statistic the number of cases by which the observed frequency
that keeps the level better, and has better small distribution and the probability model differ in Cell
sample power characteristics. i. The Pearson residual for Cell i is the ith summand
Comparisons of the two best known of these six of the overall X 2 given above.
statistics, the Pearson X 2 and the likelihood ratio For the standardized Pearson residual, one can
X 2 (see Contingency Tables), have shown that the find different definitions in the literature. According
Pearson statistic is often closer to the 2 distribution to Agresti [1], for Cell i which has an estimated
than G2 . However, G2 has better decomposition leverage of hi , the standardized Pearson residual is
characteristics than X 2 . Therefore, decompositions of
the effects in cross-classifications (see Chi-Square ni mi
ri = , (7)
Decomposition) and comparisons of hierarchically ni
mi 1 (1
hi )
related log-linear models are typically performed N
using the G2 statistic.
There exists a number of other goodness-of- where hi is defined as the diagonal element of the hat
fit tests. These include, for instance, the Kolo- matrix (for more detail on the hat matrix see [11]).
mogorovSmirnoff test, the Cramer-von Mises test, The absolute value of the standardized Pearson resid-
and runs tests. ual is slightly larger than the square root of the Pear-
son residual (which is often called the standardized
residual; see [8]), and it is approximately normally
Testing Cellwise Goodness-of-fit distributed if the model holds.
The adjusted residual [5] is a standardized resid-
Testing goodness-of-fit for individual cells has been ual that is divided by its standard deviation. Adjusted
proposed for at least three reasons. First, the dis- residuals aretypically closer to the normal distri-
tribution of residuals in cross-classifications that are bution than X 2 . Deviance residuals are the com-
evaluated using particular probability models can be ponents of the likelihood ratio statistic, G2 , given
very uneven such that the residuals are large for a above. Exact residual tests can be performed using,
small number of cells and rather small for the remain- for instance, the binomial test and the hypergeometric
ing cells. Attempts at improving model fit then focus test. The latter requires product-multinomial sampling
on reducing the discrepancies in the cells with the (see Sampling Issues in Categorical Data).
large residuals. Second, cells with large residuals can The characteristics and performance of these and
be singled out and declared structural frequencies, other residual tests have been examined in a number
that is, fixed, and not taken into account when the of studies (e.g., [6, 9, 17, 19, 20]). The following list
expected frequencies are estimated. Typically, model presents a selection of repeatedly reported results of
fit improves considerably when outlandish cells are comparison studies.
fixed. Third, cell-specific goodness-of-fit indicators
are examined with the goal of finding types and 1. The distribution of the adjusted residual is closer
antitypes in configural frequency analysis see von to the normal distribution than that of the stan-
Eye in this encyclopedia, [10, 16, 17]. Types and dardized residual.
antitypes are then interpreted with respect to the 2. Both the approximative and the exact tests
probability model that was used to explain a cross- tend to be conservative when the expected
classification. Different probability models can yield cell frequencies are estimated from the sam-
different patterns of types and antitypes. In either con- ple marginals.
text, decisions are made concerning single cells or 3. As long as cell frequencies are small (less
groups of cells. than about ni = 100), the distribution of resid-
The most popular measures of cellwise lack of uals tends to be asymmetric such that positive
fit include the raw residual, the Pearson residual, residuals are more likely than negative residu-
the standardized Pearson residual, and the adjusted als; for larger cell sizes, this ratio is inverted;
residual. The raw residual is defined as the differ- this applies to tables of all sizes greater than
ence between the observed and the expected cell 2 2 and under multinomial as well as product-
frequencies, that is, ni mi . This measure indicates multinomial sampling.
Goodness of Fit for Categorical Variables 3
4. The -curves of the residuals suggest conserva- Table 1 Prediction of anxiety from poverty, psychological
tive decisions as long as cell sizes are small. violence, and physical violence
5. The curves of the residuals also suggest that Variable patterns Frequencies
large sample sizes are needed to make sure the
-error is not severely inflated; this applies to A Po Ps Ph Observed Expected
tables of varying sizes, to both multinomial and 1 1 1 1 6 6.93
product-multinomial sampling, as well as to the 1 1 1 2 3 2.07
-levels of 0.05 and 0.01. 1 1 2 1 13 11.64
6. None of the tests presented here and the other 1 1 2 2 13 14.36
tests that were also included in some of the 1 2 1 1 2 1.11
comparison studies consistently outperformed all 1 2 1 2 0 0.89
1 2 2 1 1 2.32
others; that is, the tests are differentially sensitive
1 2 2 2 9 7.68
to characteristics of data and tables. 2 1 1 1 4 3.96
2 1 1 2 1 1.04
2 1 2 1 11 11.47
Data Example 2 1 2 2 13 12.53
2 2 1 1 0 0.00
The following data example presents a reanalysis of 2 2 1 2 0 0.00
data from a project on the mental health outcomes 2 2 2 1 5 4.57
of women experiencing domestic violence [2]. For 2 2 2 2 13 13.43
the example, we attempt to predict Anxiety (A) from
Poverty (Po), Psychological Abuse (Ps), and Physical
Abuse (Ph). Anxiety and Poverty were dichotomized Table 2 presents the raw, the adjusted, the devi-
at the median, and Psychological and Physical Abuse ance, the Pearson, and the FreemanTukey residuals
were dichotomized at the score of 0.7 (to separate for the above model.
the no abuse cases from the abuse cases). For each In the following paragraphs, we discuss four char-
variable, a 1 indicates a low score, and a 2 indicates acteristics of the residuals in Table 2. First, the var-
a high score. ious residual measures indicate that no cell quali-
To test the prediction hypotheses, we estimated fies as extreme. None of the residuals that are dis-
the hierarchical log-linear model [A, Po], [A, Ps], tributed either normally or as 2 has values that
[A, Ph], [Po, Ps, Ph]. This model is equivalent to the would indicate that a cell deviates from the model
logistic regression model that predicts A from Po, (to make this statement, we use the customary thresh-
Ps, and Ph. The cross-tabulation of the four variables, olds of 2 for normally distributed residuals and 4 for
including the observed and the estimated expected 2 -distributed residuals). Before retaining a model,
cell frequencies, appears in Table 1. researchers can do worse than inspecting residuals
Table 1 suggests that the observed cell frequen- for local model-data deviations.
cies are relatively close to the estimated expected cell Second, the Pearson X 2 is the only one that
frequencies. Indeed, the overall goodness-of-fit like- does not indicate the direction of the deviation.
lihood ratio X 2 = 3.50 (df = 4; p = 0.35) indicates From inspecting the Pearson residual alone, one
no significant model-data discrepancies. The Pear- cannot determine whether an observed frequency is
son X 2 = 4.40 (df = 4; p = 0.48) leads one to the larger or smaller than the corresponding estimated
same conclusion. The values of these two overall expected one.
goodness-of-fit measures are the same if a model Third, the correlations among the four arrays of
is the true one. In the present case, the values residuals vary within a narrow range, thus indicating
of the test statistics are not exactly the same, but that the measures are sensitive largely to the same
since they are both small and suggest the same characteristics of model-data discrepancies. Table 3
statistical decision, there is no reason to alter the displays the correlation matrix.
model. Substantively, we find that the two abuse The correlations in Table 3 are generally very
variables are significant predictors of Anxiety. In high. Only the correlations with Pearsons measure
contrast, Poverty fails to make a significant contri- are low. The reason for this is that the Pearson scores
bution. are positive by definition. Selecting only the positive
4 Goodness of Fit for Categorical Variables
Table 3 Intercorrelations of the residuals in Table 2 less than 1. The deviance residual has a standard devi-
Raw Adjusted Deviance Pearson ation greater than one. This is an unusual result and
may be specific to the data used for the present exam-
Adjusted 0.992 1.000 ple. In general, the deviance residual is less variable
Deviance 0.978 0.975 1.000 than N (0, 1), but it can be standardized.
Pearson 0.249 0.314 0.297 1.000
FreemanTukey 0.946 0.971 0.940 0.344
References
residuals, the correlations with the Pearson residuals [1] Agresti, A. (2002). Categorical Data Analysis, 2nd
would be high also. Edition, Wiley, New York.
Fourth, the standard deviations of the residual [2] Bogat, G.A., Levendosky, A.A., De Jonghe, E., David-
scores are different than 1. Table 4 displays descrip- son, W.S. & von Eye, A. (2004). Pathways of suffering:
The temporal effects of domestic violence. Maltratta-
tive measures for the variables in Table 2. mento e abuso allinfanzia 6(2), 97112.
It is a well-known result that the standard devia- [3] Cressie, N. & Read, T.R.C. (1984). Multinomial
tions of residuals can be less than 1 when a model fits. goodness-of-fit tests, Journal of the Royal Statistical
The FreemanTukey standard deviation is clearly Society, Series B 46, 440464.
[4] Freeman, M.F. & Tukey, J.W. (1950). Transformations [14] SYSTAT (2002). SYSTAT 10.2., SYSTAT software, Rich-
related to the angular and the square root, Annals of mond.
Mathematical Staitstics 77, 607611. [15] Upton, G.J.G. (1978). The Analysis of Cross-Tabulated
[5] Haberman, S.J. (1973). The analysis of residuals in Data, Wiley, Chichester.
cross-classifications, Biometrics 29, 205220. [16] von Eye, A. (2002a). Configural Frequency Analysis
[6] Koehler, K.J. & Larntz, K. (1980). An empirical Methods, Models, and Applications, Lawrence Erlbaum,
investigation of goodness-of-fit statistics for sparse Mahwah.
multinomials, Journal of the American Statistical [17] von Eye, A. (2002b). The odds favor antitypes A com-
Association 75, 336344. parison of tests for the identification of configural types
[7] Kullback, S. (1959). Information Theory and Statistics, and antitypes, Methods of Psychological Research
Wiley, New York. Online 7, 129.
[8] Lawal, B. (2003). Categorical data analysis with [18] von Weber, S., Lautsch, E. & von Eye, A. (2003). Table-
SAS and SPSS applications, NJ: Lawrence Erlbaum, specific continuity corrections for configural frequency
Mahwah. analysis, Psychology Science 45, 355368.
[9] Lawal, H.B. (1984). Comparisons of X2 , G2 , Y 2 , [19] von Weber, S., von Eye, A. & Lautsch, E. (2004). The
Freeman-Tukey, and Williams improved G2 test Type II error of measures for the analysis of 2 2 tables,
statistics in small samples of one-way multinomials, Understanding Statistics 3, 259282.
Biometrika 71, 415458. [20] West, E.N. & Kempthorne, O. (1972). A comparison of
[10] Lienert, G.A. & Krauth, J. (1975). Configural frequency the 2 and likelihood ratio tests for composite alterna-
analysis as a statistical tool for defining types, Educa- tives, Journal of Statistical Computation and Simulation
tional and Psychological Measurement 35, 231238. 1, 133.
[11] Neter, J., Kutner, M.H., Nachtsheim, C.J. & Wasser- [21] Wickens, T. (1989). Multiway Contingency Tables Anal-
man, W. (1996). Applied Linear Statistical Models, 4th ysis for the Social Sciences, Hillsdale.
Edition, Irwin, Chicago.
[12] Neyman, J. (1949). Contribution to the theory of the 2
test, in Proceedings of the First Berkeley Symposium on (See also Configural Frequency Analysis)
Mathematical Statistics and Probability, J. Neyman, ed.,
University of California Press, Berkeley, pp. 239273. ALEXANDER VON EYE, G. ANNE BOGAT AND
[13] Read, T.R.C. & Cressie, N. (1988). Goodness-of-Fit for STEFAN VON WEBER
Discrete Multivariate Data, Springer-Verlag, New York.
Gosset, William Sealy
ROGER THOMAS
Volume 2, pp. 753754
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
W. S. Gosset, R. A. Fisher and Karl Pearson with notes [5] Rucci, A.J. & Tweney, R.D. (1980). Analysis of variance
and comments, Biometrika 55, 445457. and the second discipline of scientific psychology: a
[4] Pearson, E.S. (1973). Some historical reflections on historical account, Psychological Bulletin 87, 166184.
the introduction of statistical methods in industry, The
Statistician 22, 165179. ROGER THOMAS
Graphical Chain Models
NANNY WERMUTH
Volume 2, pp. 755757
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
confounders, and that there is no interactive effect on [16] Lauritzen, S.L. & Sheehan, N.A. (2003). Graphical
the response by an unobserved variable and the inter- models for genetic analyses, Statistical Science 18,
vening variable. One consequence of these assump- 489514.
[17] Lindley, D.V. (2002). Seeing and doing: the concept of
tions is for linear models that the effect of the inter- causation, International Statistical Review 70, 191214.
vening variable on the response averaged over past [18] Marchetti, G.M. (2004). R functions for computing
variables coincides with its conditional effects given graphs induced from a DAG after marginalization and
past unobserved variables. Some authors have named conditioning, Proceedings of the American Statistical
this a causal effect. For a comparison of different Association, Alexandria, VA.
definitions of causality from a statistical viewpoint, [19] Marchetti, G.M. & Drton, M. (2003). GGM: An
R Package for Gaussian Graphical Models, URL:
including many references, and for the use of graph-
http://cran.r-project.org.
ical Markov models in this context, see [3]. [20] Markov, A.A. (1912). Wahrscheinlichkeitsrechnung,
(German translation of 2nd Russian edition: A.A.
References Markoff, ed., 1908), Teubner, Leipzig.
[21] Pearl, J. (1998). Graph, causality and structural equa-
tion models, Sociological Methods and Research 27,
[1] Badsberg, J.H. (2004). DynamicGraph: Interactive 226284.
Graphical Tool for Manipulating Graphs, URL: [22] Richardson, T.S. & Spirtes, P. (2002). Ancestral Markov
http://cran.r-project.org. graphical models, Annals of Statistics 30, 9621030.
[2] Cox, D.R. & Wermuth, N. (1996). Multivariate Depen- [23] Stanghellini, E. (1997). Identification of a single-factor
dencies: Models, Analysis, and Interpretation, Chapman model using graphical Gaussian rules, Biometrika 84,
& Hall, London. 241254.
[3] Cox, D.R. & Wermuth, N. (2004). Causality a statistical [24] Stanghellini, E. & Wermuth, N. (2004). On the identifi-
view, International Statistical Review 72, 285305. cation of path analysis models with one hidden variable,
[4] Cramer, H. (1946). Mathematical Methods of Statistics, Biometrika To appear.
Princeton University Press, Princeton. [25] Vicard, P. (2000). On the identification of a single-
[5] Dahlhaus, R. (2000). Graphical interaction models for factor model with correlated residuals, Biometrika 84,
multivariate time series, Metrika 51, 157172. 241254.
[6] Dobra, A. (2003). Markov bases for decomposable [26] Wermuth, N. (1998). Graphical Markov models, in
graphical models, Bernoulli 9, 10931108. Encyclopedia of Statistical Sciences, S. Kotz, C. Read &
[7] Edwards, D. (2000). Introduction to Graphical Mod- D. Banks, eds, Wiley, New York, pp. 284300, Second
elling, 2nd Edition, Springer, New York. Update Volume.
[8] Eichler, M., Dahlhaus, R. & Sandkuhler, J. (2003). Par- [27] Wermuth, N. & Cox, D.R. (2001). Graphical mod-
tial correlation analysis for the identification of synaptic els: overview, in International Encyclopedia of the
connections, Biological Cybernetics 89, 289302. Social and Behavioral Sciences, Vol. 9, P.B. Baltes &
[9] Fried, R. & Didelez, V. (2003). Decomposability and N.J. Smelser, eds, Elsevier, Amsterdam, pp. 63796386.
selection of graphical models for time series, Biometrika [28] Wermuth, N. & Cox, D.R. (2004). Joint response graphs
90, 251267. and separation induced by triangular systems, Journal of
[10] Gibbs, W. (1902). Elementary Principles of Statistical Royal Statistical Society, Series B 66, 687717.
Mechanics, Yale University Press, New Haven. [29] Whittaker, J. (1990). Graphical Models in Applied Mul-
[11] Green, P.J., Hjort, N.L. & Richardson, S. (2003). Highly tivariate Statistics, Wiley, Chichester.
Structured Stochastic Systems, University Press, Oxford. [30] Wright, S. (1921). Correlation and causation, Journal of
[12] Grzebyk, M., Wild, P. & Chouaniere, D. (2003). On Agricultural Research 20, 162177.
identification of multi-factor models with correlated
residuals, Biometrika 91, 141151.
[13] Koster, J.T.A. (1999). On the validity of the Markov (See also Markov Chain Monte Carlo and Baye-
interpretation of path diagrams of Gaussian structural sian Statistics)
equations systems with correlated errors, Scandinavian
Journal of Statistics 26, 413431. NANNY WERMUTH
[14] Koster, J.T.A. (2002). Marginalizing and conditioning in
graphical models, Bernoulli 8, 817840.
[15] Lauritzen, S.L. (1996). Graphical Models, Oxford Uni-
versity Press, Oxford.
Graphical Methods pre-20th Century
LAURENCE D. SMITH
Volume 2, pp. 758762
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
methods; a preference for the numerical precision of included line charts of birth rates, mortality curves,
tabled numbers; and a general Platonic distrust of and probability distributions fitted to histograms of
visual images as sources of reliable knowledge. The empirical data [32].
view that statistical graphics represent a vulgar substi- Graphical methods also entered science from
tute for rigorous numerical methods may have been a different direction with the application of self-
abetted by Playfair himself, who touted his graphs registering instruments to biological phenomena.
as a means of communicating data to politicians Notable was Carl Ludwigs 1847 invention of the
and businessmen. It seems likely that the disdain of kymograph, which quickly came into common use as
many scientists for graphical methods also stemmed a way to make visible a range of effects that were
from the dual roots of science in academic natu- invisible to the naked eye. Hermann von Helmholtz
ral philosophy and the less prestigious tradition of achieved acclaim in the 1850s by touring Europe with
mechanical arts, of which Playfair was a part [5, 19]. his Froschcurven, myographic records of the move-
Only when these two traditions successfully merged ments of frog muscles. These records included the
in the nineteenth century, combining Baconian hands- graphs by which he had measured the speed of the
on manipulation of data with academic mathematical neural impulse, one of the centurys most celebrated
theory, did graphical methods achieve widespread scientific achievements and one that, as Helmholtz
acceptance in science. In Playfairs time, the more recognized, depended crucially on graphical meth-
common response to graphs was that of the French ods [4, 15]. By midcentury, graphical methods had
statistician Jacques Peuchet, who dismissed graphs also gained the attention of philosophers and method-
as mere by-plays of the imagination that are for- ologists. In his influential Philosophy of the Induc-
eign to the aims of science [quoted in 10, p. 295]. tive Sciences (183760), William Whewell hailed
Such resistance to graphical methods which never the graphical method which he called the method
waned entirely, even during their rise to popularity of curves as a fundamental means of discover-
during the late nineteenth century is reflected in ing the laws of nature, taking its place alongside
the fact that the first published graph of a normal dis- the traditional inductive methods of Bacon. Based
tribution [6] appeared more than a century after De partly on his own investigations of the tides, Whewell
Moivre had determined the mathematical properties judged the method of curves superior to numerical
of the normal curve [10]. methods, for when curves are drawn, the eye often
The use of graphs spread slowly through the first spontaneously detects the law of recurrence in their
half of the nineteenth century, but not without signif- sinuosities [36, p. 405]. For such reasons, he even
icant developments. In 1811, the German polymath favored the graphical method over the newly devel-
Alexander von Humboldt, acknowledging Playfairs oped method of least squares, which was also treated
precedence, published a variety of graphs in his trea- in his text.
tise on the Americas. Displaying copious data on the
geography, geology, and climate of the New World, The Golden Age of Graphics
he used line graphs as well as divided bar graphs,
the latter his own invention. Humboldt echoed Play- The second half of the nineteenth century saw an
fairs judgment of the cognitive efficiency of graphs, unprecedented flourishing of graphical methods, lead-
praising their ability to speak to the senses without ing to its designation as the Golden Age of graph-
fatiguing the mind [quoted in 3, p. 223] and defend- ics. According to Funkhouser [10], this period was
ing them against the charge of being mere trifles marked by enthusiasm for graphs not only among
foreign to science [quoted in 10, p. 95]. In 1821, scientists and statisticians but also among engineers
the French mathematician J. B. J. Fourier, known (notably the French engineers Cheysson and Minard),
for his method of decomposing waveforms, used data government officials, and the public. The standardiza-
from the 1817 Paris census to produce the first pub- tion imposed by the government bureaucracies of the
lished cumulative frequency distribution, later given time produced torrents of data well suited to graph-
the name ogive by Francis Galton. The applica- ical treatment [26]. Under Quetelets leadership, a
tion of graphical analysis to human data was further series of International Statistical Congresses from
explored by Fouriers student Adolphe Quetelet in a 1853 from 1876 staged massive exhibitions of graph-
series of publications beginning in the 1820s. These ical displays (a partial listing of the charts at the 1876
Graphical Methods pre-20th Century 3
Congress cited 686 items), as well as attempts to stan- It was one of many late nineteenth-century works
dardize the nomenclature of graphs and the rules for that hailed graphs as the new langue universelle of
their construction. The Golden Age also saw the first science a visual language that, true to the posi-
published systems for classifying graphical forms, as tivist ideals of the era, would enhance communication
well as a proliferation of novel graphical formats. between scientists while neutralizing national origins,
In 1857, Florence Nightingale produced coxcomb ideological biases, and disciplinary boundaries. In
plots for displaying the mortality of British soldiers 1879, the young G. Stanley Hall, soon to emerge as
across the cycle of months, a format that survives a leading figure of American psychology, reported in
as todays rose plots. In 1878, the Italian statisti- The Nation that the graphic method a method said
cian Luigi Perozzo devised perspective plots called to be superior to all other modes of describing many
stereograms in which complex relations of vari- phenomena was fast becoming the international
ables (such as probability of marriage by age and language of science [13, p. 238]. Having recently
sex) were shown as three-dimensional surfaces. When toured Europes leading laboratories (including Lud-
the results of the ninth U.S. Census were published wigs in Leipzig), Hall also reported on the peda-
in 1874, they included such now-standard formats gogical applications of graphs he had witnessed at
as population pyramids and bilateral frequency poly- European universities. In an account foreshadowing
gons. The 1898 report of the eleventh U.S. Census, todays instructional uses of computerized graphics,
published toward the end of the Golden Age, con- he wrote that the graphical method had converted
tained over 400 graphs and statistical maps in a the lecture room into a sort of theatre, where graphic
wide variety of formats, many of them in color. The charts are the scenery, changed daily with the theme
widespread acceptance of graphs by the end of this [13, p. 238]. Hall himself would later make extensive
era was also signaled by the attention they drew use of graphs, including some sophisticated charts
from leading statisticians. During the 1890s, Karl with multiple ordinates, in his classic two-volume
Pearson, then a rising star in the world of statistics, work Adolescence (1904).
delivered a series of lectures on graphical methods
at Gresham College. In them, he treated dozens of
graphical formats, challenged the erroneous opinion Graphs in Behavioral Science
that graphs are but a means of popular presentation,
and described the graphical method as a fundamen- Hall was not alone among the early behavioral sci-
tal method of investigating and analyzing statistical entists in making effective use of graphic methods
material [23, p. 142, emphasis in original]. during the Golden Age. Hermann Ebbinghauss 1885
The spread of graphs among political and eco- classic Memory [7] contained charts showing the rep-
nomic statisticians during the Golden Age was paral- etitions required for memorizing syllable lists as a
leled by their growing currency in the natural sci- function of list length, as well as time series graphs
ences. Funkhouser reports that graphs became an that revealed unanticipated periodicities in memory
important adjunct of almost every kind of scien- performance, cycles that Ebbinghaus attributed to
tific gathering [10, p. 330]. Their use was endorsed oscillations in attention. In America, James McKeen
by leading scientists such as Ernst Mach and Emil Cattell applied graphical methods to one of the days
du Bois-Reymond on the Continent and Willard pressing issues the span of consciousness by
Gibbs in America. For his part, Gibbs saw the estimating the number of items held in awareness
use of graphs as central to the breakthroughs he from the asymptotes of speed-reading curves. Cat-
achieved in thermodynamics; in fact, his first paper tell also analyzed psychophysical data by fitting them
on the subject concerned the design of optimal graphs against theoretical curves of Webers law and his own
for displaying abstract physical quantities [14]. The square-root law, and later assessed the fit of edu-
bible of the burgeoning graphics movement was cational data to normal distributions in the course
Etienne-Jules Mareys 1878 masterwork, La methode of arguing for differential tuition fees favoring the
graphique [22]. This richly illustrated tome cov- academically proficient [24]. Cattells Columbia col-
ered both statistical graphs and instrument-generated league Edward L. Thorndike drew heavily on graph-
recordings, and included polemics on the cognitive ical methods in analyzing and presenting the results
and epistemological advantages of graphical methods. of the famous puzzle-box experiments that formed
4 Graphical Methods pre-20th Century
a cornerstone of later behaviorist research. His 1898 had planned to devote an entire book to the sub-
paper Animal Intelligence [33] contained more than ject. But his introduction of the chi-square test in
70 graphs showing various conditioning phenomena 1900 drew his interests back to numerical methods,
and demonstrating that trial-and-error learning occurs and this shift of interests would become emblematic
gradually, not suddenly as implied by the theory of of ensuing developments in the behavioral sciences.
animal reason. It was the inferential statistics of Pearson and his
Despite such achievements, however, the master successors (notably Gossett and Fisher) that pre-
of statistical graphics among early behavioral scien- occupied psychologists in the century to come [21,
tists was Francis Galton. Galton gained experience 28]. And while the use of hypothesis-testing statistics
with scientific visualization in the 1860s when he became nearly universal in the behavioral research
constructed statistical maps to chart weather patterns, of the twentieth century [16], the use of graphical
work which led directly to his discovery of anticy- methods lay fallow [9, 29]. Even with the advent of
clones. In the 1870s, he introduced the quincunx a exploratory data analysis (an approach more often
device that demonstrates normal distributions by fun- praised than practiced by researchers), graphical
neling lead shot across an array of pins for purposes methods would continue to endure waves of popu-
of illustrating his lectures on heredity and to facili- larity and of neglect, both among statisticians [8, 18,
tate his own reasoning about sources of variation and 35] and among behavioral scientists [30, 31].
their partitioning [32]. In the 1880s, he began to make
contour plots of bivariate distributions by connecting References
cells of equal frequencies in tabular displays. From
these plots, it was a small step to the scatter plots [1] Beniger, J.R. & Robyn, D.L. (1978). Quantitative graph-
that he used to demonstrate regression and, in 1888, ics in statistics: a brief history, American Statistician 32,
to determine the first numerical correlation coeffi- 111.
cient, an achievement attained using wholly graphical [2] Biderman, A.D. (1990). The Playfair enigma: the devel-
means [11]. Galtons graphical intuition, which often opment of the schematic representation of statistics,
compensated for the algebraic errors to which he was Information Design Journal 6, 325.
[3] Brain, R.M. (1996). The graphic method: inscription,
prone, was crucial to his role in the founding of
visualization, and measurement in nineteenth-century
modern statistics [25]. Indeed, the ability of graph- science and culture, Ph.D. dissertation, University of
ical methods to protect against numerical errors was California, Los Angeles.
recognized by Galton as one of its advantages. It is [4] Chadarevian, S. (1993). Graphical method and disci-
always well, he wrote, to retain a clear geometric pline: self-recording instruments in nineteenth-century
view of the facts when we are dealing with statisti- physiology, Studies in the History and Philosophy of Sci-
cal problems, which abound with dangerous pitfalls, ence 24, 267291.
[5] Costigan-Eaves, P. & Macdonald-Ross, M. (1990).
easily overlooked by the unwary, while they are can- William Playfair (17591823), Statistical Science 5,
tering gaily along upon their arithmetic [quoted in 318326.
32, p. 291]. [6] De Morgan, A. (1838). An Essay on Probabilities and on
Their Application to Life Contingencies and Insurance
Offices, Longman, Brown, Green & Longman, London.
[7] Ebbinghaus, H. (1885/1964). Memory: A Contribution
Conclusion to Experimental Psychology, Dover Publications, New
York.
By the end of the nineteenth century, statistical graph- [8] Fienberg, S.E. (1979). Graphical methods in statistics,
ics had come a long way. Nearly all of the graphical American Statistician 33, 165178.
formats in common use today had been established, [9] Friendly, M. & Denis, D. (2000). The roots and branches
the Golden Age of graphs had drawn attention to of modern statistical graphics, Journal de la Societe
their fertility, and prominent behavioral scientists had Francaise de Statistique 141, 5160.
used graphs in creative and sophisticated ways. Yet [10] Funkhouser, H.G. (1937). Historical development of
the graphical representation of statistical data, Osiris 3,
for all of these developments, the adoption of graphi- 269404.
cal methods in the behavioral sciences would proceed [11] Galton, F. (1888). Co-relations and their measurement,
slowly in the following century. At the time of his chiefly from anthropometric data, Proceedings of the
lectures on graphical techniques in the 1890s, Pearson Royal Society of London 45, 135145.
Graphical Methods pre-20th Century 5
[12] Gross, A.G., Harmon, J.E. & Reidy, M. (2002). Com- [27] Royston, E. (1956). Studies in the history of probability
municating Science: The Scientific Article from the 17th and statistics: III. A note on the history of the graphical
Century to the Present, Oxford University Press, Oxford. representation of data, Biometrika 43, 241247.
[13] Hall, G.S. (1879). The graphic method, The Nation 29, [28] Rucci, A.J. & Tweney, R.D. (1980). Analysis of variance
238239. and the Second Discipline of scientific psychology: an
[14] Hankins, T.L. (1999). Blood, dirt, and nomograms: a historical account, Psychological Bulletin 87, 166184.
particular history of graphs, Isis 90, 5080. [29] Smith, L.D., Best, L.A., Cylke, V.A. & Stubbs, D.A.
[15] Holmes, F.L. & Olesko, K.M. (1995). The images (2000). Psychology without p values: data analysis at
of precision: Helmholtz and the graphical method in the turn of the 19th century, American Psychologist 55,
physiology, in The Values of Precision, M.N. Wise, ed., 260263.
Princeton University Press, Princeton, pp. 198221. [30] Smith, L.D., Best, L.A., Stubbs, D.A., Archibald, A.B. &
[16] Hubbard, R. & Ryan, P.A. (2000). The historical growth Roberson-Nay, R. (2002). Constructing knowledge: the
of statistical significance testing in psychology and its role of graphs and tables in hard and soft psychology,
future prospects, Educational and Psychological Mea- American Psychologist 57, 749761.
surement 60, 661681. [31] Smith, L.D., Best, L.A., Stubbs, D.A., Johnston, J. &
[17] Krohn, R. (1991). Why are graphs so central in science? Archibald, A.M. (2000). Scientific graphs and the hier-
Biology and Philosophy 6, 181203. archy of the sciences: a Latourian survey of inscription
[18] Kruskal, W. (1978). Taking data seriously, in Toward a practices, Social Studies of Science 30, 7394.
Metric of Science, Y. Elkana, J. Lederberg, R. Merton, [32] Stigler, S.M. (1986). The History of Statistics: The Mea-
A. Thackray & H. Zuckerman, eds, Wiley, New York, surement of Uncertainty Before 1900, Harvard Univer-
pp. 139169. sity Press, Cambridge.
[19] Kuhn, T.S. (1977). Mathematical versus experimental [33] Thorndike, E.L. (1898). Animal intelligence: an exper-
traditions in the development of physical science, in The imental study of the associative processes in ani-
Essential Tension, T.S. Kuhn, ed., University of Chicago mals, Psychological Review Monograph Supplements 2,
Press, Chicago, pp. 3165. 1109.
[20] Latour, B. (1990). Drawing things together, in Repre- [34] Tilling, L. (1975). Early experimental graphs, British
sentation in Scientific Practice, M. Lynch & S. Woolgar, Journal for the History of Science 8, 193213.
eds, MIT Press, Cambridge, pp. 1968. [35] Wainer, H. & Thissen, D. (1981). Graphical data analy-
[21] Lovie, A.D. (1979). The analysis of variance in exper- sis, Annual Review of Psychology 32, 191241.
imental psychology: 19341945, British Journal of [36] Whewell, W. (1847/1967). Philosophy of the Inductive
Mathematical and Statistical Psychology 32, 151178. Sciences, 2nd Edition, Frank Cass, London.
[22] Marey, E.-J. (1878). La Methode Graphique Dans les [37] Ziman, J. (1978). Reliable Knowledge: An Exploration of
Sciences Experimentales, Masson, Paris. the Grounds for Belief in Science, Cambridge University
[23] Pearson, E.S. (1938). Karl Pearson: An Appreciation Press, Cambridge.
of Some Aspects of His Life and Work, Cambridge
University Press, Cambridge.
[24] Poffenberger, A.T., ed. (1947). James McKeen Cattell, (See also Exploratory Data Analysis; Graphical
Man of Science: Vol. 1 Psychological Research, Science Presentation of Longitudinal Data)
Press, Lancaster.
[25] Porter, T.M. (1986). The Rise of Statistical Thinking, LAURENCE D. SMITH
18201900, Princeton University Press, Princeton.
[26] Porter, T.M. (1995). Trust in Numbers: The Pursuit
of Objectivity in Science and Public Life, Princeton
University Press, Princeton.
Graphical Presentation of Longitudinal Data
HOWARD WAINER AND IAN SPENCE
Volume 2, pp. 762772
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
18
17
16
The Great
Number of annual christenings in London
15
Plague
14 of
End of the 1665
13 English civil war
(Charles I killed)
(in thousands)
12
8
Plague
outbreaks The Great
7 Fire
Beginning of of
6 the English 1666
5 civil war
4
1620 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720
Year
Figure 1 A plot of the annual christenings in London between 1630 and 1710 from the London Bills of Mortality. These
data were taken from a table published by John Arbuthnot in 1710
2 Graphical Presentation of Longitudinal Data
substantial jiggles. Yet, each jiggle, save one, can be great civil uprising, nothing that could explain this
explained. Some of these explanations are written on enormous drop.
the plot. The big dip that began in 1642 can only The plot not only reveals the anomaly, it also
partially be explained by the onset of the English presents a credible explanation. In Figure 2, we have
Civil War. Surely the chaos common to civil war duplicated the christening data and drawn a horizontal
can explain the initial drop, but the war ended in line across the plot through the 1704 data point.
1649 with the beheading of Charles I at Whitehall, In doing so, we immediately see that the line goes
whereas the christenings did not return to their through exactly one other point 1674. If we went
earlier levels until 1660 (1660 marks the end of the back to Arbuthnots table, we would see that in 1674
protectorate of Oliver and Richard Cromwell and the the number of christenings of boys and girls were
beginning of the restoration). Graunt offered a more 6113 and 5738, exactly the same number as he had
complex explanation that involved the distinction for 1704. Thus, the 1704 anomaly is likely to be
between births and christenings, and the likelihood a copying error! In fact, the correct figure for that
that Anglican ministers would not enter children year is 15 895 (8153 boys and 7742 girls), which lies
born to Catholics or Protestant dissenters into the comfortably between the christenings of 1703 and
register. 1705 as expected.
Many of the other irregularities observed are It seems reasonable to assume that if Arbuthnot
explained in Figure 1, but what about the mysterious had noticed such an unusual data point, he would
drop in 1704? That year has about 4000 fewer have investigated, and finding a clerical error, would
christenings than one might expect from observing have corrected it. Yet he did not. He did not, despite
the adjacent data points. What happened? There the fact that when graphed the error stands out,
was no sudden outbreak of a war or pestilence, no literally, like a sore thumb. Thus, we must conclude
18
Correct
17 value
16
Number of annual christenings in London
15
14
13
(in thousands)
12
5
4
1620 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720
Year
Figure 2 The solution to the mystery of 1704 is suggested by noting that only one other point (1674) had exactly the
same values as the 1704 outlier. This coincidence provided the hint that allowed Zabell [11] to trace down Arbuthnots
clerical error. (Data source: Arbuthnot 1710)
Graphical Presentation of Longitudinal Data 3
that he never graphed his data. Why not? The answer, graph, the bar chart, and the pie chart. The other
very simply, is that graphs were not yet part of important basic form the scatterplot did not
the statisticians toolbox. (There were a very small appear until atleast a half century later (some credit
number of graphical applications prior to 1710, but Herschel [4] with its first use, others believe that
they were not widely circulated and Arbuthnot, a Herschels plot was a time-series plot, no different
very clever and knowledgeable scientist, had likely than Playfairs). Playfair also invented other graph-
not been familiar with them.) ical elements, for example, the circle diagram and
statistical Venn diagram; but these innovations are
less widely used.
The Beginnings of Graphs
Graphs are the most important tool for examining Two Time-series Line Graphs
longitudinal data because they convey comparative
information in ways that no table or description ever In 1786, Playfair [5] published his Commercial and
could. Trends, differences, and associations are effort- Political Atlas, which contained 44 charts, but no
lessly seen in the blink of an eye. The eye perceives maps; all of the charts, save one, were variants of the
immediately what the brain would take much longer statistical time-series line graph. Playfair acknowl-
to deduce from a table of numbers. This is what edged the influence of the work of Joseph Priestley
makes graphs so appealing they give numbers a (17331804), who had also conceived of represent-
voice, allowing them to speak clearly. Graphs and ing time geometrically [6, 7]. The use of a grid
charts not only show what numbers tell, they also with time on the horizontal axis was a revolutionary
help scientists tease out the critical clues from their idea, and the representation of the lengths of reigns
data, much as a detective gathers clues at the scene of monarchs by bars of different lengths allowed
of a crime. Graphs are truly international a Ger- immediate visual comparisons that would otherwise
man can read the same graph that an Australian have required significant mental arithmetic. An inter-
draws. There is no other form of communication that esting sidelight to Priestleys plot is that he accom-
more appropriately deserves the description univer- panied the original (1765) version with extensive
sal language. explanations, which were entirely omitted in the 1769
Who invented this versatile device? Have graphs elaboration when he realized how naturally his audi-
been around for thousands of years, the work of ence could comprehend it (Figure 3).
inventors unknown? The truth is that statistical graphs At about the same time that Priestley was drafting
were not invented in the remote past; they were his time lines, the French physician Jacques Barbeu-
not at all obvious and their creator lived only two Dubourg (17091779) and the Scottish philosopher
centuries ago. He was a man of such unusual skills Adam Ferguson (17231816) produced plots that fol-
and experience that had he not devised and published lowed a similar principle. In 1753, Dubourg published
his charts during the Age of Enlightenment we might a scroll that was a complex timeline spanning the
have waited for another hundred years before the 6480 years from The Creation until Dubourgs time.
appearance of statistical graphs. This is demarked as a long thin line at the top of
The Scottish engineer and political economist, the scroll with the years marked off vertically in
William Playfair (17591823) is the principal inven- small, equal, one-year increments. Below the time-
tor of statistical graphs. Although one may point line, Dubourg laid out his record of world history.
to solitary instances of simple line graphs that pre- He includes the names of kings, queens, assassins,
cede Playfairs work (see Wainer & Velleman, [10]), sages, and many others, as well as short phrases sum-
such examples generally lack refinement and, with- marizing events of consequence. These are fixed in
out exception, failed to inspire others. In contrast, their proper place in time horizontally and grouped
Playfairs graphs were detailed and well drawn; vertically either by their country of origin or in
they appeared regularly over a period of more than Dubourgs catch-all category at the bottom of the
30 years; and they introduced a surprising variety chart evenements memorables. In 1780, Ferguson
of practices that are still in use today. He invented published a timeline of the birth and death of civi-
three of the four basic forms: the statistical line lizations that begins at the time of the Great Flood
4
Graphical Presentation of Longitudinal Data
Figure 3 Lifespans of 59 famous people in the six centuries before Christ (Priestley, [6]). Its principal innovation is the use of the horizontal axis to depict
time. It also uses dots to show the lack of precise information on the birth and/or death of the individual shown
Graphical Presentation of Longitudinal Data 5
(2344 BC indicating clearly, though, that this was (time) and grapheikos (writing). Dubourg intended to
1656 years after The Creation) and continued until provide the means for chronology to be a science that,
1780. And in 1782, the Scottish minister James Play- like geography, speaks to the eyes and the imagina-
fair (unrelated to William), published A System of tion, a picture moving and animated.
Chronology, in the style of Priestley. Joseph Priestley used his line chart to depict the
The motivation behind the drafting of graphical life spans of famous figures from antiquity; Pythago-
representations of longitudinal data remains the same ras, Socrates, Pericles, Livy, Ovid, and Augustus,
today as it was in eighteenth-century France. Dubourg all found their way onto Priestleys plot. Priestleys
declared that history has two ancillary fields: geog- use of this new tool was clearly in the classical
raphy and chronology. Of the two he believed that tradition.
geography was the more developed as a means for Twenty-one years later, William Playfair used a
studying history, calling it lively, convenient, attrac- variant on the same form (See Figure 4) to show
tive. By comparison, he characterizes chronology the extent of imports and exports of Scotland to 17
as dry, laborious, unprofitable, offering the spirit other places. Playfair, as has been amply documented
a welter of repulsive dates, a prodigious multitude (Spence & Wainer, [8]), was an iconoclast and
of numbers which burden the memory. He believed a versatile borrower of ideas who could readily
that by wedding the methods of geography to the adapt the chronological diagram to show economic
data of chronology he could make the latter as data; in doing so, he invented the bar chart. Such
accessible as the former. Dubourgs name for his unconventional usage did not occur to his more
invention chronographie tells a great deal about what conservative peers in Great Britain, or on the
he intended, derived as it is from the Greek chronos Continent. He had previously done something equally
Exports and imports of Scotland to and from different parts for one year from Christmas 1780 to Christmas 1781
10 20 30 40 50 60 70 80 90 100 110 130 150 170 200 220 240 260 280 L 300 000
Names of Places
0 Jersey &c.
0 Iceland
Poland
Isle of Man
0 Greenland
Prussia
Portugal
0 Holland
Sweden
Guernsey
Germany
Denmark and
Norway
Flanders
West Indies
America
Russia
Ireland
The Upright Divisions are ten thousand pounds each. The black lines are Exports, the ribbed lines, imports.
Published as the Act directs June 7th1786 by W m. Playfair Neele sculpt 352 strand London
Figure 4 Imports from and exports to Scotland for 17 different places (after Playfair, [5], plate 23)
6 Graphical Presentation of Longitudinal Data
novel when he adapted the line graph, which was during most of his time as a student. From Small,
becoming popular in the natural sciences, to display Jefferson received both friendship and an abiding love
economic time series. However, Playfair did not of science. Coincidentally, through his friendships
choose to adapt Priestleys chronological diagram with James Watt and John Playfair, Small was
because of any special affection for it, but rather responsible for introducing the 17-year-old William
of necessity, since he lacked the time-series data he Playfair to James Watt, with the former serving
needed to show what he wanted. He would have for three years as Watts assistant and draftsman
preferred a line chart similar to the others in his Atlas. in Watts steam engine business in Birmingham,
In his own words, England.
The limits of this work do not admit of representing
Although Jefferson was a philosopher whose
the trade of Scotland for a series of years, which, in vision of democracy helped shape the political
order to understand the affairs of that country, would structure of the emerging American nation, he was
be necessary to do. Yet, though they cannot be rep- also a farmer, a scientist, and a revolutionary whose
resented at full length, it would be highly blameable feet were firmly planted in the American ethos. So
entirely to omit the concerns of so considerable a it is not surprising that Jefferson would find uses
portion of this kingdom.
for graphical displays that were considerably more
Playfairs practical subject matter provides a sharp down to earth than the life spans of heroes from
contrast to the classical content chosen by Priestley classical antiquity. What is surprising is that he found
to illustrate his invention. time, while President of the United States, to keep
In 1787, shortly after publishing the Atlas, Playfair a keen eye on the availability of 37 varieties of
moved to Paris. Thomas Jefferson spent five years vegetables in the Washington market and compile a
as ambassador to France (from 1784 until 1789). chart of his findings (a detail of which is shown in
During that time, he was introduced to Playfair Figure 5).
personally Donnant [2], and he was certainly familiar When Playfair had longitudinal data, he made
with his graphical inventions. One of the most good use of them, producing some of the most
important influences on Jefferson at William and beautiful and informative graphs of such data ever
Mary College in Virginia was his tutor, Dr. William made. Figure 6 is one remarkable example of these.
Small, a Scots teacher of mathematics and natural Not only is it the first skyrocketing government debt
philosophy Small was Jeffersons only teacher chart but it also uses the innovation of an irregularly
Figure 5 An excerpt from a plot by Thomas Jefferson showing the availability of 16 vegetables in the Washington market
during 1802. This figure is reproduced, with permission, from Froncek ([3], p. 101)
Figure 6 This remarkable Chart of the National Debt of England appeared as plate 20, opposite page 83 in the third edition of Playfairs Commercial and
Political Atlas in 1801
Graphical Presentation of Longitudinal Data
7
8 Graphical Presentation of Longitudinal Data
spaced grid along the time axis to demark events of longitudinal data (see Longitudinal Data Analysis)
important economic consequence. from 100 HIV positive individuals over a period that
begins about two years before HIV was detectable
(seroconversion) and continues for four more years.
Modern Developments If the data were to be displayed as a scatterplot, the
Recent developments in displaying longitudinal data time trend would not be visible because we have no
show remarkably few modifications to what was idea of which points go with which. But (Figure 7(a))
developed more than 200 years ago, fundamentally if we connect all the dots together appropriately, the
because Playfair got it right. Modern high-speed com- graph is so busy that no pattern is discernable. Dig-
puting allows us to make more graphs faster, but they gle et al. [1] propose a compromise solution in which
are not, in any important way, different from those the data from a small, randomly chosen, subset of
Playfair produced. One particularly useful modern subjects are connected (Figure 7(b)). This provides a
example (Figure 7) is taken from Diggle, Heagerty, guide to the eye of the general shape of the longi-
Liang & Zeger ([1], p. 3738), which is a hybrid tudinal trends. Other similar schemes are obviously
plot combining a scatterplot with a line drawing. The possible: for example, fitting a function to the aggre-
data plotted are the number of CD4+ cells found gate data and connecting the points for some of the
in HIV positive individuals over time. (CD4+ cells residuals to look for idiosyncratic trends.
orchestrate the bodys immunoresponse to infectious A major challenge of data display is how to rep-
agents. HIV attacks this cell and so keeping track resent multidimensional data on a two-dimensional
of the number of CD4+ cells allows us to monitor surface (see Multidimensional Scaling; Principal
the progress of the disease.) Figure 7 contains the Component Analysis). When longitudinal data are
CD4+ cell number
2500
1500
500
0
2 0 2 4
(a) Years since seroconversion
1500
CD4+ residuals
500
0
500
2 0 2 4
(b)
Years since seroconversion
Figure 7 Figures 3.4 and 3.5 from Diggle et al., [1] reprinted with permission (p. 3738), showing CD4+ counts
against time since seroconversion, with sequences of data on each subject connected (a) or connecting only a randomly
selected subset of subjects (b)
Graphical Presentation of Longitudinal Data 9
Figure 8 An 1869 plot by Charles Joseph Minard, Tableaux Graphiques et Cartes Figuratives de M. Minard, 18451869
depicting the size of Hannibals Army as it crossed from Spain to Italy in his ill-fated campaign in the Second Punic War
(218202 BC). A portfolio of Minards work is held by the Biblioth`eque de lEcole Nationale des Ponts et Chaussees, Paris.
This figure was reproduced from Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, Connecticut
1983, 2001), p. 176. with permission
themselves multivariate (see Multivariate Analysis: in Spain with more than 97 000 men. His bold plan
Overview), this is a problem that has few com- was to traverse the Alps with elephants and sur-
pletely satisfying solutions. Interestingly, we must prise the Romans with an attack from the north,
look back more than a century for the best of these. but the rigors of the voyage reduced his army to
In 1846, the French civil engineer Charles Joseph only 6000 men. Minards beautiful depiction shows
Minard (17811870) developed a format to show the Carthaginian river that flowed across Gaul being
longitudinal data on a geographic background. He reduced to a trickle by the time they crossed the
used a metaphorical data river flowing across the Alps. This chart has been less often reproduced
landscape tied to a timescale. The rivers width than Napoleons march and so we prefer to include
was proportional to the amount of materials being it here.
depicted (e.g., freight, immigrants), flowing from one
geographic region to another. He used this almost Note
exclusively to portray the transport of goods by
water or land. This metaphor was employed to per- 1. This exposition is heavily indebted to the scholarly
fection in his 1869 graphic (Figure 8), in which, work of Sandy Zabell, to whose writings the inter-
ested reader is referred for a much fuller description
through the substitution of soldiers for merchandise, (Zabell, [11, 12]). It was Zabell who first uncovered
he was able to show the catastrophic loss of life Arbuthnots clerical error.
in Napoleons ill-fated Russian campaign. The rush-
ing river of 4 22 000 men that crossed into Russia References
when compared with the returning trickle of 10 000
seemed to defy the pen of the historian by its bru- [1] Diggle, P.J., Heagerty, P.J., Liang, K.Y. & Zeger, S.L.
tal eloquence. This now-famous display has been (2002). Analysis of Longitudinal Data, 2nd Edition,
called (Tufte, [9]) the best graph ever produced. Clarendon Press, Oxford, pp. 3738.
[2] Donnant, D.F. (1805). Statistical account of the United
Minard paired his Napoleon plot with a parallel one
States of America. Messrs Greenland and Norris, Lon-
depicting the loss of life in the Carthaginian general don.
Hannibals ill-fated crossing of the Alps in the Sec- [3] Froncek, T. (1985). An Illustrated History of the City of
ond Punic War. He began his campaign in 218 BC Washington, Knopf, New York.
10 Graphical Presentation of Longitudinal Data
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Program participants
100 100 100
Nonparticipants
80 80 80
60 60 60
1 1.5 2 1 1.5 2 1 1.5 2
Age Age Age
Figure 1 Developing a growth curve model using data on cognitive performance over time. The left-hand panel plots
the cognitive performance (COG) of one child in the control group versus his age (rescaled here in years). The middle
panel presents fitted OLS trajectories for a random subset of 28 children (coded using solid lines for program participants
and dashed lines for nonparticipants). The right-hand panel presents fitted change trajectories for program participants and
nonparticipants
for child i at time j , is a linear function of his age, 0i would represent child is true value of Y at
(or her) age on that occasion (AGEij ). The model age 0, which is meaningless and predates the onset
assumes that a straight line adequately represents of data collection. Centering time on the first wave
each persons true change trajectory and that any of data collection is a popular approach because it
deviations from linearity in sample data result from allows us to interpret 0i using simple nomenclature:
random error (ij ). Although everyone in this dataset it is child is true initial status, his or her true status
was assessed on the same three occasions (ages 1.0, at the beginning of the study.
1.5, and 2.0), this basic level-1 model can be used in The more important individual growth parameter
a wide variety of datasets, even those in which the is the slope, 1i , which represents the rate at which
timing and spacing of waves varies across people. individual i changes over time. By clocking age in
The brackets in (1) identify the models structural years (instead of the original metric of months), we
component, which represents our hypotheses about can adopt the simple interpretation that 1i represents
each persons true trajectory of change over time.
child is true annual rate of change. During the single
The model stipulates that this trajectory is linear in
year under study as child i goes from age 1 to 2
age and has individual growth parameters, 0i and
his trajectory rises by 1i . Because we hypothesize
1i , which characterize its shape for the ith child
that each individual in the population has his (or her)
in the population. If the model is appropriate, these
parameters represent fundamental features of each own rate of change, this growth parameter has the
childs true growth trajectory, and as such, become subscript i.
the objects of prediction when specifying the linked In specifying a level-1 model, we implicitly
level-2 model. assume that all the true individual change trajectories
An important feature of the level-1 specification is in the population have a common algebraic form. But
that the researcher controls the substantive meaning because each person has his or her own individual
of these parameters by choosing an appropriate metric growth parameters, we do not assume that everyone
for the temporal predictor. For example, in this level- follows the same trajectory. The level-1 model allows
1 model, the intercept, 0i , represents child is true us to distinguish the trajectories of different people
cognitive performance at age 1. This interpretation using just their individual growth parameters. This
accrues because we centered AGE in the level-1 model leap is the cornerstone of growth curve modeling
using the predictor (AGE 1). Had we not centered because it means that we can study interindividual
Growth Curve Modeling 3
differences in growth curves by studying interindi- Like all level-2 models, (2) has more than one
vidual variation in growth parameters. This allows component; taken together, they treat the intercept
us to recast vague questions about the relationship (0i ) and the slope (1i ) of an individuals growth
between change and predictors as specific questions trajectory as level-2 outcomes that may be associated
about the relationship between the individual growth with identified predictors (here, PROGRAM). As in
parameters and predictors. regular regression, we can modify the level-2 model
to include other predictors, adding, for example,
The Level-2 Model maternal education or family size. Each component
The level-2 model codifies the relationship between of the level-2 model also has its own residual here,
interindividual differences in the change trajecto- 0i and 1i that permits the level-1 parameters (the
ries (embodied by the individual growth parameters) s) to differ across individuals.
and time-invariant characteristics of individuals. To The structural parts of the level-2 model con-
develop an intuition for this model, examine the mid- tain four level-2 parameters 00 , 01 , 10 , and 11
dle panel of Figure 1, which presents fitted OLS known collectively as the fixed effects. The fixed
trajectories for a random subset of 28 children in effects capture systematic interindividual differences
the study (coded using solid lines for program par- in change trajectories according to values of the level-
ticipants and dashed lines for nonparticipants). As 2 predictor(s). In (2), 00 and 10 are known as level-2
noted for the one child in the left panel, cognitive intercepts; 01 and 11 are known as level-2 slopes.
performance (on this age-standardized scale) tends As in regular regression, the slopes are of greater
to decline over time. In addition, program partic- interest because they represent the effect of predic-
ipants have generally higher scores at age 1 and tors (here, the effect of PROGRAM) on the individual
decline less precipitously over time. This suggests growth parameters. You interpret the level-2 parame-
that their intercepts are higher but their slopes are ters much like regular regression coefficients, except
shallower. Also note the substantial interindividual that they describe variation in outcomes that are
heterogeneity within groups. Not all participants have level-1 individual growth parameters. For example,
higher intercepts than nonparticipants; not all non- 00 represents the average true initial status (cogni-
participants have steeper slopes. Our level-2 model tive score at age 1) for nonparticipants, while 01
must simultaneously account for both the general pat- represents the hypothesized difference in average true
terns (the between-group differences in intercepts and initial status between groups. Similarly, 10 represents
slopes) and interindividual heterogeneity in patterns the average true annual rate of change for nonpartic-
within groups. ipants, while 11 represents the hypothesized differ-
This suggests that an appropriate level-2 model ence in average true annual rate of change between
will have four specific features. First, the level- groups. The level-2 slopes, 01 and 11 , capture the
2 outcomes will be the level-1 individual growth effects of PROGRAM. If 01 and 11 are nonzero, the
parameters (here, 0i and 1i from (1)). Second, the average population trajectories in the two groups dif-
level-2 model must be written in separate parts, one fer; if they are both 0, they are the same. The two
distinct model for each level-1 growth parameter; level-2 slope parameters therefore address the ques-
(here, 0i and 1i ). Third, each part must specify a tion: What is the difference in the average trajectory
relationship between the individual growth parameter of true change associated with program participation?
and the predictor (here, PROGRAM, which takes on An important feature of both the level-1 and level-
only two values, 0 and 1). Fourth, each model 2 models is the presence of stochastic terms ij
must allow individuals who share common predictor at level-1, 0i and 1i at level-2 also known as
values to vary in their individual change trajectories. residuals. In the level-1 model, ij accounts for the
This means that each level-2 model must allow for difference between an individuals true and observed
stochastic variation (also known as random variation) trajectory. For these data, each level-1 residual rep-
in the individual growth parameters. resents that part of child is value of COG at time
These considerations lead us to postulate the j not predicted by his (or her) age. The level-2
following level-2 model: residuals, 0i and 1i , allow each persons individ-
0i = 00 + 01 PROGRAM + 0i ual growth parameters to be scattered around their
(2)
1i = 10 + 11 PROGRAM + 1i relevant population averages. They represent those
4 Growth Curve Modeling
portions of the outcomes the individual growth The complete set of residual variances and covari-
parameters that remain unexplained by the level-2 ances both the level-2 error variance-covariance
predictor(s). As is true of most residuals, we are usu- matrix and the level-1 residual variance, 2 is
ally less interested in their specific values than in their known as the models variance components.
variance. The level-1 residual variance, 2 , summa-
rizes the scatter of the level-1 residuals around each The Composite Growth Curve Model
persons true change trajectory. The level-2 residual
variances, 02 and 12 , summarize the variation in true The level-1/level-2 representation is not the only
individual intercept and slope around the average tra- specification of a growth curve model. A more par-
jectories left over after accounting for the effect(s) of simonious representation results if you collapse the
the models predictor(s). As a result, these level-2 level-1 and level-2 models together into a single com-
residual variances are conditional residual variances. posite model. The composite representation, while
Conditional on the models predictors, 02 represents identical to the level-1/level-2 specification mathe-
the population residual variance in true initial status matically, provides an alternative way of codifying
and 12 represents the population residual variance in hypotheses and is the specification required by many
true annual rate of change. The level-2 variance com- multilevel statistical software programs.
ponents allow us to address the question: How much To derive the composite specification also
heterogeneity in true change remains after accounting known as the reduced form growth curve model
for the effects of program participation? notice that any pair of linked level-1 and level-2
But there is another complication at level-2: models share some common terms. Specifically, the
might there be an association between individual ini- individual growth parameters of the level-1 model are
tial status and individual rates of change? Children the outcomes of the level-2 model. We can therefore
who begin at a higher level may have higher (or collapse the submodels together by substituting for
lower) rates of change. To account for this possi- 0i and 1i from the level-2 model in (2) into the
bility, we permit the level-2 residuals to be corre- level-1 model (in (1)). Substituting the more generic
lated. Since 0i and 1i represent the deviations of temporal predictor TIMEij for the specific predictor
the individual growth parameters from their pop- (AGEij -1), we write:
ulation averages, their population covariance, 01 ,
Yij = 0i + 1i TIME ij + ij
summarizes the association between true individ-
ual intercepts and slopes. Again because of their = (00 + 01 PROGRAM i + 0i ) (4)
conditional nature, the population covariance of the
level-2 residuals, 01 , summarizes the magnitude and + (10 + 11 PROGRAM i + 1i )TIMEij + ij
direction of the association between true initial sta- Multiplying out and rearranging terms yields the
tus and true annual rate of change, controlling for composite model :
program participation. This parameter allows us to
address the question: Controlling for program par- Yij = 00 + 10 TIME ij + 01 PROGRAM i
ticipation, are true initial status and true rate of
+ 11 (PROGRAM i TIME ij )
change related?
To fit the model to data, we must make some distri- + 0i + 1i TIME ij + ij (5)
butional assumptions about the residuals. At level-1,
the situation is relatively simple. In the absence of where we once again use brackets to distinguish the
evidence suggesting otherwise, we usually invoke models structural and stochastic components.
the classical normality assumption, ij N (0, 2 ). Even though the composite specification in (5)
At level-2, the presence of two (or sometimes more) appears more complex than the level-1/level-2 spec-
residuals necessitates that we describe their under- ification, the two forms are logically and mathemat-
lying behavior using a bivariate (or multivariate) ically equivalent. The level-1/level-2 specification is
distribution: often more substantively appealing; the composite
2 specification is algebraically more parsimonious. In
0i 0 0 01 addition, the s in the composite model describe
N , (3)
1i 0 10 12 patterns of change in a different way. Rather than
Growth Curve Modeling 5
postulating first how COG is related to TIME and observed and predicted value of Y for individual i
the individual growth parameters, and second how on occasion j .
the individual growth parameters are related to PRO- The mathematical form of the composite residual
GRAM, the composite specification postulates that COG reveals two important properties about the occasion-
depends simultaneously on: (a) the level-1 predictor, specific residuals not readily apparent in the level-
TIME; (b) the level-2 predictor, PROGRAM, and (c) the 1/level-2 specification: they can be both autocorre-
cross-level interaction, PROGRAM TIME. From this lated and heteroscedastic within person. These are
perspective, the composite models structural portion exactly the kinds of properties that you would expect
strongly resembles a regular regression model with among residuals for repeated measurements of a
predictors, TIME and PROGRAM, appearing as main- changing outcome.
effects (associated with 10 and 01 respectively) and When residuals are heteroscedastic, the unex-
in a cross-level interaction (associated with 11 ). plained portions of each persons outcome have
How did this cross-level interaction arise, when unequal variances across occasions of measurement.
the level-1/level-2 specification appears to have Although heteroscedasticity has many roots, one
no similar term? Its appearance arises from the major cause is the effects of omitted predictors
multiplying-out procedure used to generate the the consequences of failing to include variables that
composite model. When we substitute the level-2 are, in fact, related to the outcome. Because their
model for 1i into its appropriate position in the effects have nowhere else to go, they bundle together,
by default, into the residuals. If their impact dif-
level-1 model, the parameter 11 , previously associ-
fers across occasions, the residuals magnitude may
ated only with PROGRAM, gets multiplied by TIME. In
differ as well, creating heteroscedasticity. The com-
the composite model, then, this parameter becomes
posite model allows for heteroscedasticity via the
associated with the interaction term, PROGRAM
level-2 residual 1i . Because 1i is multiplied by
TIME. This association makes sense if you con-
TIME in the composite residual, its magnitude can
sider the following logic. When 11 is nonzero in
differ (linearly, at least, in a linear level-1 sub-
the level-1/level-2 specification, the slopes of the model) across occasions. If there are systematic dif-
change trajectories differ according to values of PRO- ferences in the magnitudes of the composite resid-
GRAM. Stated another way, the effect of TIME (whose
uals across occasions, there will be accompanying
effect is represented by the slopes of the change differences in residual variance, hence heteroscedas-
trajectories) differs by levels of PROGRAM. When ticity.
the effects of one predictor (here, TIME) differ by When residuals are autocorrelated, the unex-
the levels of another predictor (here, PROGRAM), we plained portions of each persons outcome are cor-
say that the two predictors interact. The cross-level related with each other across repeated occasions.
interaction in the composite specification codifies Once again, omitted predictors, whose effects are
this effect. bundled into the residuals, are a common cause.
Another distinctive feature of the composite model Because their effects may be present identically in
is its composite residual, the three terms in the each residual over time, an individuals residuals
second set of brackets on the right side of (5) that may become linked across occasions. The presence
combine together the one level-1 residual and the two of the time-invariant 0i s and 1i s in the composite
level-2 residuals: residual of (5) allows the residuals to be autocorre-
Composite residual: 0i + 1i TIME ij + ij lated. Because they have only an i subscript (and
Although the constituent residuals have the same no j ), they feature identically in each individuals
meaning under both representations, the composite composite residual on every occasion, allowing for
residual provides valuable insight into our assump- autocorrelation across time.
tions about the behavior of residuals over time.
Instead of being a simple sum, the second level- Fitting Growth Curve Models to Data
2 residual, 1i , is multiplied by the level-1 pre-
dictor, TIME. Despite its unusual construction, the Many different software programs can fit growth
interpretation of the composite residual is straight- curve models to data. Some are specialized pack-
forward: it describes the difference between the ages written expressly for this purpose (e.g., HLM,
6 Growth Curve Modeling
Table 1 Results of fitting a growth curve for change to data (n = 103). This model predicts cognitive functioning between
ages 1 and 2 years as a function of AGE-1 (at level-1) and PROGRAM (at level-2)
Parameter Estimate Age z
Fixed effects
Initial status, 0i Intercept 00 107.84*** 2.04 52.97
PROGRAM 01 6.85* 2.71 2.53
Rate of change, 1i Intercept 10 21.13*** 1.89 11.18
PROGRAM 11 5.27* 2.52 2.09
Variance components
Level-1: Within-person, ij 2 74.24***
Level-2: In initial status, 0i 02 124.64***
In rate of change, 1i 12 12.29
Covariance between 0i and 1i 01 36.41
*p<.05, **p<.01, ***p<.001
Note: Full ML, HLM
MlwiN, and MIXREG). Others are part of popu- parameters and program participation. We interpret
lar multipurpose software packages including SAS these estimates much as we do any regression coeffi-
(PROC MIXED and PROC NLMIXED), SPSS cient, with one key difference: the level-2 outcomes
(MIXED), STATA (xtreg and gllamm) and SPLUS that these fixed effects describe are level-1 individ-
(NLME) (see Software for Statistical Analyses). At ual growth parameters. In addition, you can conduct
their core, each program does the same job: it fits the a hypothesis test for each fixed effect using a sin-
growth model to data and provides parameter esti- gle parameter test (most commonly, examining the
mates, measures of precision, diagnostics, and so on. null hypothesis H0 : = 0). As shown in Table 1, we
There is also some evidence that all the different reject all four null hypotheses, suggesting that each
packages produce the same, or similar, answers to parameter plays a role in the story of the programs
a given problem [5]. So, in one sense, it does not effect on childrens cognitive development.
matter which program you choose. But the pack- Substituting the s in Table 1 into the level-2
ages do differ in many important ways including model in (2), we have:
the look and feel of their interfaces, their ways of
entering and preprocessing data, their model spec- 0i = 107.84 + 6.85 PROGRAM i
(6)
ification process (the level-1/level-2 specification or 1i = 21.13 + 5.27 PROGRAM i
the composite specification), their estimation methods
(e.g., full maximum likelihood vs restricted maxi- The first part of the fitted model describes the
mum likelihood see Direct Maximum Likelihood effects of PROGRAM on initial status; the second part
Estimation), their strategies for hypothesis testing, describes its effects on the annual rates of change.
and provision of diagnostics. It is beyond the scope Begin with the first part of the fitted model,
of this entry to discuss these details. Instead, we turn for initial status. In the population from which this
to the results of fitting the growth curve model to sample was drawn, we estimate the true initial status
data using one statistical program, HLM. Results are (COG at age 1) for the average nonparticipant to
presented in Table 1. be 107.84; for the average participant, we estimate
that it is 6.85 points higher (114.69). In rejecting
Interpreting the Results of Fitting the Growth (at the .001 level) the null hypotheses for the two
Curve Model to Data level-2 intercepts, we conclude that the average
nonparticipant had a nonzero cognitive score at age
The fixed effects parameters the s of (2) and 1 (hardly surprising!) but experienced a statistically
(5) quantify the effects of predictors on the individ- significant decline over time. Given that this was a
ual change trajectories. In our example, they quan- randomized trial, you may be surprised to find that
tify the relationship between the individual growth the initial status of program participants is 6.85 points
Growth Curve Modeling 7
higher than that of nonparticipants. Before concluding Estimated variance components assess the amount
that this differential in initial status casts doubt on of outcome variability left at either level-1 or
the randomization mechanism, remember that the level-2 after fitting the multilevel model. Because
intervention started before the first wave of data they are harder to interpret in absolute terms, many
collection, when the children were already 6 months researchers use null-hypothesis tests, for at least
old. This modest 7-point elevation in initial status they provide some benchmark for comparison. Some
may reflect early treatment gains attained between caution is necessary, however, because the null
ages 6 months and 1 year. hypothesis is on the border of the parameter space (by
Next examine the second part of the fitted model, definition, these components cannot be negative) and
for the annual rate of change. In the population from as a result, the asymptotic distributional properties
which this sample was drawn, we estimate the true that hold in simpler settings may not apply [9].
annual rate of change for the average nonparticipant The level-1 residual variance, 2 , summarizes the
to be 21.13; for the average participant, we estimate population variability in an average persons outcome
it to be 5.27 points higher (15.86). In rejecting (at values around his or her own true change trajectory.
the .05 level) the null hypotheses for the two level-2 Its estimate for these data is 74.24, a number that
slopes, we conclude that the differences between pro- is difficult to evaluate on its own. Rejection of the
gram participants and nonparticipants in their mean associated null-hypothesis test (at the .001 level)
annual rates of change is statistically significant. The suggests the existence of additional outcome variation
average nonparticipant dropped over 20 points dur- at level-1 (within-person) that may be predictable.
ing the second year of life; the average participant This suggests it might be profitable to add time-
dropped just over 15. The cognitive performance of varying predictors to the level-1 model (such as the
both groups of children declines over time, but pro- number of books in the home or the amount of parent-
gram participation slows the rate of decline. child interaction).
Another way of interpreting fixed effects is to plot The level-2 variance components summarize the
fitted trajectories for prototypical individuals. Even variability in change trajectories that remains after
in a simple analysis like this, which involves just controlling for predictors (here, PROGRAM). Associ-
one dichotomous predictor, we find it invaluable to ated tests for these variance components evaluate
inspect prototypical trajectories visually. For this par- whether there is any remaining residual outcome
ticular model, only two prototypes are possible: a variation that could potentially be explained by other
program participant (PROGRAM = 1) and a nonpartic- predictors. For these data, we reject only one of these
ipant (PROGRAM = 0). Substituting these values into null hypotheses (at the 0.001 level), for initial status,
equation (5) yields the predicted initial status and 02 . This again suggests the need for additional predic-
annual growth rates for each: tors, but because this is a level-2 variance component
(describing residual variation in true initial status), we
When PROGRAM = 0 : would consider both time-varying and time-invariant
predictors to the model. Failure to reject the null
0i = 107.84 + 6.85(0) = 107.84 hypothesis for 12 indicates that PROGRAM explains all
1i = 21.13 + 5.27(0) = 21.13 the potentially predictable variation between children
in their true annual rates of change.
When PROGRAM = 1 : Finally turn to the level-2 covariance component,
0i = 107.84 + 6.85(1) = 114.69 01 . Failure to reject this null hypothesis indicates
that the intercepts and slopes of the individual true
1i = 21.13 + 5.27(1) = 15.86 (7) change trajectories are uncorrelated that there is no
association between true initial status and true annual
We use these estimates to plot the fitted change tra- rates of change (once the effects of PROGRAM are
jectories in the right-hand panel of Figure 1. These removed). Were we to continue with model building,
plots reinforce the numeric conclusions just articu- this result might lead us to drop the second level-2
lated. In comparison to nonparticipants, the average residual, 1i , from our model, for neither its variance
participant has a higher score at age 1 and a slower nor covariance with 0i , is significantly different
annual rate of decline. from 0.
8 Growth Curve Modeling
Postscript [4] Fitzmaurice, G.M., Laird, N.M., & Ware, J.H. (2004).
Applied Longitudinal Analysis, Wiley, New York.
Growth curve modeling offers empirical researchers [5] Kreft, I.G.G. & de Leeuw, J. (1998). Introducing Multi-
a wealth of analytic opportunities. The method can level Modeling, Sage Publications, Thousand Oaks.
[6] Raudenbush, S.W. & Bryk, A.S. (2002). Hierarchical
accommodate any number of waves of data, the occa-
Linear Models: Applications and Data Analysis Methods,
sions of measurement need not be equally spaced, and 2nd edition, Sage Publications, Thousand Oaks.
different participants can have different data collec- [7] Rogosa, D.R., Brandt, D. & Zimowski, M. (1982). A
tion schedules. Individual change can be represented growth curve approach to the measurement of change,
by a variety of substantively interesting trajectories, Psychological Bulletin 90, 726748.
not only the linear functions presented here but also [8] Singer, J.D. & Willett, J.B. (2003). Applied Longitudinal
curvilinear and discontinuous functions. Not only can Data Analysis: Modeling Change and Event Occurrence,
Oxford University Press, New York.
multiple predictors of change be included in a sin-
[9] Snijders, T.A.B. & Bosker, R.J. (1999). Multilevel Anal-
gle analysis, simultaneous change across multiple ysis: An Introduction to Basic and Advanced Multilevel
domains (e.g., change in cognitive function and motor Modeling, Sage Publications, London.
function) can be investigated simultaneously. Readers [10] Verbeke, G. & Molenberghs, G. (2000). Linear Mixed
wishing to learn more about growth curve modeling Models for Longitudinal Data, Springer, New York.
should consult one of the recent books devoted to the [11] Willett, J.B. (1988). Questions and answers in the mea-
topic [3, 4, 6, 810]. surement of change, in Review of Research in Education
(19881989), E. Rothkopf, ed., American Education
Research Association, Washington, pp. 345422.
References
(See also Heteroscedasticity and Complex Varia-
[1] Burchinal, M.R., Campbell, F.A., Bryant, D.M.,
Wasik, B.H. & Ramey, C.T. (1997). Early intervention
tion; Multilevel and SEM Approaches to Growth
and mediating processes in cognitive performance of low Curve Modeling; Structural Equation Modeling:
income African American families, Child Development Latent Growth Curve Analysis)
68, 935954.
[2] Cronbach, L.J. & Furby, L. (1970). How should we mea- JUDITH D. SINGER AND JOHN B. WILLETT
sure change or should we? Psychological Bulletin
74, 6880.
[3] Diggle, P., Heagerty, P., Liang, K.-Y. & Zeger, S.
(2002). Analysis of Longitudinal Data, 2nd Edition,
Oxford University Press, New York.
Guttman, Louise (Eliyahu)
DAVID CANTER
Volume 2, pp. 780781
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
from the main group of assemblers and the use of hypothesis to working conditions as an explanation
small group contingencies may have stimulated peer for the changes in productivity observed. Designs
influence to favor rather than limit productivity. If, that rule out this rival hypothesis, such as the
as seems most likely, the learning interpretation is establishment of an adequate baseline or the use of
correct, a serious design flaw in the Hawthorne relay- a randomly equivalent control group, are therefore
assembler experiment was the failure to establish a often desirable in research.
performance baseline before varying rest periods and
other working conditions presumed to be related to References
fatigue. Once reward for performance and feedback
were both present, productivity began to increase
[1] Carey, A. (1967). The Hawthorne studies: a radical
and generally increased thereafter even during peri- criticism, American Sociological Review 32, 40416.
ods when other conditions were constant. For details, [2] Franke, R.H. & Kaul, J.D. (1978). The Hawthorne
see accounts by Gottfredson and Parsons [3, 7]. Fur- experiments: first statistical interpretation, American
thermore, observation logs imply that the workers set Sociological Review 43, 623643.
personal goals of improving their performance. For [3] Gottfredson, G.D. (1996). The Hawthorne misunder-
example, one assembler said, I made 421 yesterday, standing (and how to get the Hawthorne effect in action
research), Journal of Research in Crime and Delinquency
and Im going to make better today [8, p. 74].
33, 2848.
In contemporary perspective, the Hawthorne effect [4] Lawler, E.E. (1971). Pay and Organizational Effective-
is understandable in terms of goal-setting theory [5, ness, McGraw-Hill, New York.
6]. According to goal-setting theory, workers attend [5] Locke, E.A. & Latham, G.P. (1990). A Theory of Goal
to feedback on performance when they adopt per- Setting and Task Performance, Prentice Hall, Englewood
sonal performance goals. Contingent rewards, goals Cliffs.
set by workers, attention to information, and the [6] Locke, E.A. & Latham, G.P. (2002). Building a practi-
cally useful theory of goal setting and task motivation: a
removal of social obstacles to improved productivity
35 year Odyssey, American Psychologist 57, 705717.
led workers to learn to assemble relays faster and [7] Parsons, H.M. (1974). What happened at Hawthorne?
to display their learning by producing more relays per Science 183, 922932.
hour. Gottfredson [3] has provided additional exam- [8] Rothlisberger, F.J. & Dickson, W.J. (1930). Management
ples in which a similar process a Hawthorne effect and the Worker, Harvard University Press, Cambridge.
according to this understanding is produced. Under- [9] Snow, C.E. (1927). A discussion of the relation of
standing the remarkable improvement in worker per- illumination intensity to productive efficiency, Technical
Engineering News 256.
formance in the Hawthorne relay-assembly study
[10] Stagner, R. (1956). Psychology of Industrial Conflict,
in this way is important because it suggests how Wiley, New York.
one obtains the Hawthorne effect when improve- [11] Viteles, M.S. (1953). Motivation and Morale in Industry,
ments in worker performance are desired: remove Norton, New York.
obstacles to improvement, set goals, and provide
feedback. Then if learning is possible, performance GARY D. GOTTFREDSON
may improve.
What of the Hawthorne effect as research design
flaw? There was a flaw in the relay-assembly study
the failure of the design to rule out learning as a rival
Heritability
JOHN L. HOPPER
Volume 2, pp. 786787
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Heritability is a relative concept in another sense: the magnitude of heritability estimates found for the
it is derived from the comparative similarity of rel- two measures, ranging from 0.68 to 0.92 for total
atives who differ in their shared genes. The most ridge count, but much less, 0.40 to 0.54 for the mea-
common approach is to compare samples of the two sure of nonverbal intelligence. That finding is consis-
kinds of twins. Monozygotic (MZ) twins derive from tent with research on many species [5]: behavioral
a single zygote, and, barring rare events, they share traits exhibit moderate levels of heritability, much
all their genes identical-by-descent. Dizygotic (DZ) less than what is found for morphological and physio-
twins, like ordinary siblings, arise from two zygotes logical traits, but greater than is found for life-history
created by the same parents, and share, on average, characteristics. The heritability estimates illustrated
one-half of their segregating genes. If, in a large and from families of MZ twins date from a 1979 study.
representative sample of twins, behavioral similarity They were derived from coefficients of correlation
of DZ twins approaches that found for MZ twins, and regression among different groups of relatives in
genetic factors play little role in creating individual these families; interpretation of those estimates was
differences in that behavior; heritability is negligi- confounded by the imprecision of the coefficients on
ble, and the observed behavioral differences must be which they were based, and the fact that effects of
due to differences in environments shared by both common environment were ignored. Now, 25 years
kinds of cotwins in their homes, schools, and neigh- later, estimates of heritability typically include 95%
borhoods. Conversely, if the observed correlation of confidence intervals, and they are derived from robust
MZ cotwins doubles that found for DZ twins a dif- analytic models. The estimates are derived from mod-
ference in resemblance that parallels their differences els fit to data from sets of relatives, and heritability
in genetic similarity heritability must be nonzero is documented by showing that models that set it to
(see Twin Designs). We can extend the informa- zero result in a significantly poorer fit of the model to
tional yield found in contrasts of the two kinds of the observed data. Effects of common environments
twins by adding additional members of the families are routinely tested in an analogous manner. Ana-
of the twins. Consider, for example, children in fam- lytic techniques for estimating heritability are now
ilies of monozygotic twin parents. Children in each much more rigorous, and allow for tests of differ-
of the two nuclear families derive half their genes ential heritability in males and females. But they
from a twin parent, and those genes are identical remain estimates derived from the relative resem-
with the genes of the parents twin sister or brother, blance of relatives who differ in the proportion of
the childrens twin aunt or twin uncle. Because their shared genes.
the children and the twin aunt or uncle do not live Why do people differ? Why do brothers and sis-
in the same household, their resemblance cannot be ters, growing up together, sharing half their genes
due to household environment. And because the MZ and many of their formative experiences, turn out dif-
twin parents have identical sets of nuclear genes, ferently in their interests, aptitudes, lifestyles? The
their children are genetically related to one another as classic debate was framed as nature versus nurture, as
half-siblings; socially, they are reared as cousins in though genetic dispositions and experiential histories
separate homes. Thus, MZ twin families yield infor- were somehow oppositional, and as though a static
mative relationships ranging from those who share all decomposition of genetic and environmental factors
their genes (MZ parents) to those sharing one-half could adequately capture a childs developmental tra-
(siblings in each nuclear family; parents and their jectory. But, clearly, this is simplification. If all envi-
children; children and their twin aunt/uncle), one- ronmental differences were removed in a population,
quarter (the cousins who are half-siblings), or zero such that all environments offered the same oppor-
(children and their spousal aunt or uncle). tunities and incentives for acquisition of cognitive
We studied two measures in families of MZ skills (and if all tests were perfectly reliable), people
twin parents: one, a behavioral measure of nonver- with the same genes would obtain the same aptitude
bal intelligence, the other, the sum of fingerprint test scores. If, conversely, the environments people
ridge counts, a morphological measure known to be experienced were very different for reasons indepen-
highly heritable. For both measures, familial resem- dent of their genetic differences, heritability would
blance appeared to be a direct function of shared be negligible. High heritability estimates do not elu-
genes [8]. But, there was a substantial difference in cidate how genetic differences effect differences in
Heritability: Overview 3
behavioral outcomes, and it seems likely that many, equal likelihood of becoming a criminal. To sug-
perhaps most, gene effects on behavior are largely gest that, however, is neither to suggest that specific
indirect, influencing the trait-relevant environment to biological mechanisms for criminality are known,
which people are exposed. nor even that they exist. All behavior is biological
Much recent research in the fields of behavioral and genetic, but some behaviors are biological in a
and psychiatric genetics demonstrate substantial gene stronger sense than others, and some behaviors are
by environment interaction. Such research makes it genetic in a stronger sense than others [9]. Criminal-
increasingly apparent that the meaning of heritability ity is, in part, heritable, but unnecessary mischief is
depends on the circumstances in which it is assessed. caused by reference to genes for criminality or
Recent data suggest that it is nonsensical to conceptu- any similar behavioral outcome. We inherit disposi-
alize the heritability of a complex behavioral trait, tions, not destinies [6]. Heritability is an important
as if it were fixed and stable over time and envi- concept, but it is important to understand what it is.
ronments. Across different environments, the modu- And what it is not.
lation of genetic effects on adolescent substance use
ranges as much as five or six fold, even when those References
environments are crudely differentiated as rural ver-
sus urban residential communities [1, 7], or religious
[1] Dick, D.M., Rose, R.J., Viken, R.J., Kaprio, J. &
versus secular households [4]. Similarly, heritability Koskenvuo, M. (2001). Exploring gene-environment
estimates for tobacco consumption vary dramatically interactions: socioregional moderation of alcohol use,
for younger and older cohorts of twin sisters [3]. Such Journal of Abnormal Psychology 110, 625632.
demonstrations suggest that genetic factors play much [2] English, H.B. & English, A.C. (1958). A Comprehen-
more of a role in adolescent alcohol use in environ- sive Dictionary of Psychological and Psychoanalytical
Terms: a Guide to Usage, Longmans, Green & Co..
ments where alcohol is easily accessed and commu- [3] Kendler, K.S., Thornton, L.M. & Pedersen, N.L. (2000).
nity surveillance is reduced. And, similarly, as social Tobacco consumption in Swedish twins reared apart
restrictions on smoking have relaxed across gener- and reared together, Archives of General Psychiatry 57,
ations, heritable influences have increased. Equally 886892.
dramatic modulation of genetic effects by environ- [4] Koopmans, J.R., Slutske, W.S., van Baal, C.M. &
Boomsma, D.I. (1999). The influence of religion on
mental variation is evident in effects of differences in
alcohol use initiation: evidence for a Genotype X envi-
socioeconomic status on the heritability of childrens ronment interaction, Behavior Genetics 29, 445453.
IQ [10]. [5] Mosseau T.A. & Roff D.A. (1987). Natural selection
In recent years, twin studies have almost and the heritability of fitness components, Heredity 59,
monotonously demonstrated that estimates of her- 181197.
itable variance are nonzero across all domains of [6] Rose, R.J. (1995). Genes and human behavior, Annual
Reviews of Psychology 46, 625654.
individual behavioral variation that can be reliably
[7] Rose, R.J., Dick, D.M., Viken, R.J. & Kaprio, J. (2001).
assessed. These estimates are often modest in magni- Gene-environment interaction in patterns of adoles-
tude, and, perhaps more surprisingly, quite uniform cent drinking: regional residency moderates longitudinal
across different behavioral traits. But if heritability influences on alcohol use, Alcoholism: Clinical & Exper-
is so ubiquitous, what consequences does it have for imental Research 25, 637643.
scientific understanding of behavioral development? [8] Rose, R.J., Harris, E.L., Christian, J.C. & Nance, W.E.
(1979). Genetic variance in nonverbal intelligence: data
All human behaviors are, to some degree, heri- from the kinships of identical twins, Science 205,
table, but that cannot be taken as evidence that the 11531155.
complexity of human behavior can be reduced to rel- [9] Turkheimer, E. (1998). Heritability and biological expla-
atively simple genetic mechanisms [9]. Confounding nation, Psychological Review 105, 782791.
heritability with strong biological determinism is an [10] Turkheimer, E.N., Haley, A., Waldron, M., DOnofrio,
B.M. & Gottesman, I.I. (2003). Socioeconomic status
error. An example is criminality; like nearly all
modifies heritability of IQ in young children, Psycho-
behaviors, criminality is to some degree heritable. logical Science 14, 623628.
There is nothing surprising, and nothing morally
repugnant in the notion that not all children have RICHARD J. ROSE
Heteroscedasticity and Complex Variation
HARVEY GOLDSTEIN
Volume 2, pp. 790795
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 2 OLS estimates from separate gender means identical (see Maximum Likelihood Estimation)).
model (2) Note that the difference in the 2 log-likelihood
Coefficient Standard error values is 6.2, which judged against a chi squared
distribution on 1 degree of freedom (because we are
Fixed adding just 1 parameter to the model) is significant
Boy (1 ) 0.140 0.024
Girl (2 ) 0.093 0.032
at approximately the 1% level.
Now let us rewrite (3) in a form that will allow us
Random
Residual variance (e2 ) 0.99 0.023
to generalize to more complex variance functions.
2 log-likelihood 11455.7
yi = 0 + 1 x1i + ei
ei = e0i + e1i x1i
have a fixed underlying population value, and the
residual variance is under the heading random since var(ei ) = e0
2
+ 2e01 x1i + e1
2 2
x1i , 2
e1 0
it is associated with the random part of the model x1i = 1 if a boy, 0 if a girl (4)
(residual term).
Model (4) is equivalent to (3) with
Table 4 Estimates from fitting reading score as an Table 5 Estimates from fitting reading score and gender
explanatory variable with a quadratic variance function (girl = 1) as explanatory variables with linear variance
function
Coefficient Standard error
Coefficient Standard error
Fixed
Intercept (0 ) 0.002 Fixed
Reading (3 ) 0.596 0.013 Intercept (0 ) 0.103
Random Girl (2 ) 0.170 0.026
2
Intercept variance (e0 ) 0.638 0.017 Reading (3 ) 0.590 0.013
Covariance (e03 ) 0.002 0.007 Random
2 Intercept (0 ) 0.665 0.023
Reading variance (e3 ) 0.010 0.011
2 log-likelihood 9759.6 Girl (2 ) 0.038 0.030
Reading (3 ) 0.006 0.014
2 log-likelihood 9715.3
Table 6 Estimates from fitting reading score and gender The standard formulation for a repeated measures
(girl = 1) as explanatory variables with linear variance model is as a two-level structure where individual
function including interaction random effects are included to account for the covari-
Coefficient Standard error ance structure with correlated residuals. A simple
such model with a random intercept u0j and random
Fixed
slope u1j can be written as follows
Intercept (0 ) 0.103
Girl (2 ) 0.170 0.026
p
Reading (3 ) 0.590 0.013 yij = h tijh + u0j + u1j tij + eij
Random h=0
Intercept (0 ) 0.661 0.023
0.034 u0j
Girl (2 ) 0.030 cov(ej ) = e2 I, cov(uj ) = u , uj =
Reading (3 ) 0.040 0.022 u1j
Interaction (4 ) 0.072 0.028 (12)
2 log-likelihood 9709.1
This model incorporates the standard assumption
that the covariance matrix of the level 1 residuals
form is diagonal, but we can allow it to have a more
complex structure as in (11). In general, we can
log[var(ei )] = h xhi , x0i 1 (9)
fit complex variance and covariance structures to
h
the level 1 residual terms in any multilevel model.
We shall look at estimation algorithms suitable for Furthermore, we can fit such structures at any level of
either the linear or nonlinear formulations below. a data hierarchy. A general discussion can be found in
Goldstein [2, Chapter 3] and an application modeling
Covariance Modeling and Multilevel the level 2 variance in a multilevel generalized linear
model (see Generalized Linear Mixed Models) is
Structures
given by Goldstein and Noden [4]; in the case of
Consider the repeated measures model where the generalized linear models, the level 1 variance is
response is, for example, a growth measure at suc- heterogeneous by virtue of its dependence on the
cessive occasions on a sample of individuals as a linear part of the model through the (nonlinear)
polynomial function of time (t) link function.
p
Estimation
yij = h tijh + eij
h=0 For normally distributed variables, the likelihood
cov(ej ) = e ej = {eij } (10) equations can be solved, iteratively, in a variety of
ways. Goldstein et al. [3] describe an iterative gener-
where ej is the vector of residuals for the j th alized least squares procedure (see Least Squares
individual and i indexes the occasion. The residual Estimation) that will handle either linear models
covariance matrix between measurements at different such as (7) or nonlinear ones such as (9) for both
occasions (e ) is nondiagonal since the same indi- variances and covariances. Bayesian estimation can
viduals are measured at each occasion and typically be carried out readily using Monte Carlo Markov
there would be a relatively large between-individual Chain (MCMC) methods (see Markov Chain Monte
variation. The covariance between the residuals, how- Carlo and Bayesian Statistics), and a detailed com-
ever, might be expected to vary as a function of their parison of likelihood and Bayesian estimation for
distances apart so that a simple model might be as models with complex variance structures is given in
follows Browne et al. [1]. These authors also compare the fit-
ting of linear and loglinear models for the variance.
cov(etj , ets,j ) = e2 exp(s) (11)
which resolves to a first-order autoregressive struc-
Conclusions
ture (see Time Series Analysis) where the time This article has shown how to specify and fit a
intervals are equal. model that expresses the residual variance in a linear
Heteroscedasticity and Complex Variation 5
model as a function of explanatory variables. These [2] Goldstein, H. (2003). Multilevel Statistical Models, 3rd
variables may or may not also enter the fixed, Edition, Edward Arnold, London.
regression part of the model. It indicates how this [3] Goldstein, H., Healy, M.J.R. & Rasbash, J. (1994). Mul-
tilevel time series models with applications to repeated
can be extended to the case of multilevel models measures data, Statistics in Medicine 13, 16431655.
and to the general modeling of a covariance matrix. [4] Goldstein, H. & Noden, P. (2003). Modelling social
The example chosen shows how such models can segregation, Oxford Review of Education 29, 225237.
uncover differences between groups and according to [5] Goldstein, H., Rasbash, J., Yang, M., Woodhouse, G.,
the values of a continuous variable. The finding that Pan, H., Nuttall, D. & Thomas, S. (1993). A multilevel
an interaction exists in the model for the variance analysis of school examination results, Oxford Review of
Education 19, 425433.
underlines the need to apply considerations of model
[6] Rasbash, J., Browne, W., Goldstein, H., Yang, M.,
adequacy and fit for the variance modeling. The Plewis, I., Healy, M., Woodhouse, G., Draper, D., Langf-
relationships exposed by modeling the variance will prd, I. & Lewis, T. (2000). A Users Guide to MlwiN, 2nd
often be of interest in their own right, as well as better Edition, Institute of Education, London.
specifying the model under consideration.
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Antonio, or San Diego? If you have grown up in with respect to their number of inhabitants, it could
the United States, you probably have a considerable be shown that they chose the object predicted by the
amount of knowledge about both cities, and should recognition heuristic in more than 90% of the cases
do far better than chance when comparing the cities this was even true in a study in which participants
with respect to their populations. Indeed, about two- were taught knowledge contradicting recognition [7].
thirds of University of Chicago undergraduates got Moreover, the authors could empirically demonstrate
this question right [7]. In contrast, German citizens two less-is-more effects, one in which participants
knowledge of the two cities is negligible. So how performed better in a domain in which they rec-
much worse will they perform? The amazing answer ognized a lower percentage of objects, and another
is that within a German sample of participants, one in which performance decreased through suc-
100% answered the question correctly [7]. How could cessively working on the same questions (so that
this be? Most Germans might have heard of San recognition of objects increased during the course of
Diego, but do not have any specific knowledge the experiment).
about it. Even worse, most have never even heard
of San Antonio. However, this difference with
Take The Best. If both objects are recognized in a
respect to name recognition was sufficient to make
pair-comparison task (see BradleyTerry Model),
an inference, namely that San Diego has more
the recognition heuristic does not discriminate
inhabitants. Their lack of knowledge allowed them to
between them. A fast and frugal heuristic that could
use the recognition heuristic, which, in general, says:
be used in such a case is Take The Best. For
If one of two objects is recognized and the other not,
simplicity, it is assumed that all cues (i.e., predictors)
then infer that the recognized object has the higher
are binary (positive or negative), with positive cue
value with respect to the criterion [7, p. 76]. The
values indicating higher criterion values. Take The
Chicago undergraduates could not use this heuristic,
Best is a simple, lexicographic strategy that consists
because they have heard of both cities they knew
of the following building blocks:
too much.
The ecological rationality of the recognition (0) Recognition heuristic: see above.
heuristic lies in the positive correlation between cri- (1) Search rule: If both objects are recognized,
terion and recognition values of cities (if such a choose the cue with the highest validity (where
correlation were negative, the inference would have validity is defined as the percentage of correct
to go in the opposite direction). In the present case, inferences among those pairs of objects in which
the correlation is positive, because larger cities (as the cue discriminates) among those that have not
compared to smaller cities) are more likely to be men- yet been considered for this task. Look up the
tioned in mediators such as newspapers, which, in cue values of the two objects.
turn, increases the likelihood that their names are rec- (2) Stopping rule: If one object has a positive value
ognized by a particular person. It should thus be clear and the other does not (i.e., has either a negative
that the recognition heuristic only works if recogni- or unknown value), then stop search and go
tion is correlated with the criterion. Examples include on to Step 3. Otherwise go back to Step 1
size of cities, length of rivers, or productivity of and search for another cue. If no further cue
authors; in contrast, the heuristic will probably not is found, then guess.
work when, for instance, cities have to be compared (3) Decision rule: Infer that the object with the
with respect to their mayors age or their altitude positive cue value has the higher value on
above sea level. By means of mathematical analy- the criterion.
sis, it is possible to specify the conditions in which
a less-is-more effect can be obtained, that is, the Note that Take The Bests search rule ignores
maximum percentage of recognized objects in the cue dependencies and will therefore most likely
reference class that would increase the performance not establish the optimal ordering. Further note that
in a complete paired comparison task and the point the stopping rule does not attempt to compute an
from which recognizing more objects would lead to optimal stopping point at which the costs of further
a decrease in performance [7]. In a series of exper- search exceed its benefits. Rather, the motto of the
iments in which participants had to compare cities heuristic is Take The Best, ignore the rest. Finally
Heuristics: Fast and Frugal 3
note that Take The Best uses one-reason decision linear models were introduced as competitors: multi-
making because its decision rule does not weight ple linear regression and a simple unit-weight linear
and integrate information, but relies on one cue only. model [2]. To determine which of two objects has the
Another heuristic that also employs one-reason higher criterion value, multiple regression estimated
decision making is the Minimalist. It is even simpler the criterion of each object, and the unit-weight model
than Take The Best because it does not try to order simply added up the number of positive cue values.
cues by validity, but chooses them in random order. Table 1 shows the counterintuitive results obtained
Its stopping rule and its decision rule are the same as by averaging across frugality and percentages of
those of Take The Best. correct choices in each of the 20 different prediction
What price does one-reason decision making have problems. The two simple heuristics were most
to pay for being fast and frugal? How much more frugal: they looked up fewer than a third of the
accurate are strategies that use all cues and com- cues (on average, 2.2 and 2.4 as compared to 7.7).
bine them? To answer these questions, Czerlinski, What about the accuracy? Multiple regression was the
Gigerenzer, and Goldstein [1] evaluated the perfor- winner when the strategies were tested on the training
mance of various strategies in 20 data sets containing set, that is, on the set in which their parameters
were fitted. However, when it came to predictive
real-world structures rather than convenient multi-
accuracy, that is, to accuracy in the hold-out sample,
variate normal structures; they ranged from having 11
the picture changed. Here, Take The Best was not
to 395 objects, and from 3 to 19 cues. The predicted
only more frugal, but also more accurate than the two
criteria included demographic variables, such as mor-
linear strategies (and even Minimalist, which looked
tality rates in US cities and population sizes of Ger-
up the fewest cues, did not perform too far behind
man cities; sociological variables, such as drop-out the two linear strategies). This result may sound
rates in Chicago public high schools; health variables, paradoxical because multiple regression processed all
such as obesity at age 18; economic variables, such the information that Take The Best did and more.
as selling prices of houses and professors salaries; However, by being sensitive to many features of the
and environmental variables, such as the amount of data for instance, by taking correlations between
rainfall, ozone, and oxidants. In the tests, half of cues into account multiple regression suffered from
the objects from each environment were randomly overfitting, especially with small data sets (see Model
drawn. From all possible pairs within this training Evaluation). Take The Best, on the other hand,
set, the order of cues according to their validities was uses few cues. The first cues tend to be highly
determined (Minimalist used the training set only to valid and, in general, they will remain so across
determine whether a positive cue value indicates a different subsets of the same class of objects. The
higher or lower criterion). Thereafter, performance stability of highly valid cues is a main factor for the
was tested both on the training set (fitting) and on robustness of Take The Best, that is, its low danger
the other half of the objects (generalization). Two of overfitting in cross-validation as well as in other
Table 1 Performance of two fast and frugal heuristics (Take The Best and
Minimalist) and two linear models (multiple regression and a unit-weight
linear model) across 20 data sets. Frugality denotes the average number of
cue values looked up; Fitting and Generalization refer to the performance
in the training set and the test set, respectively (see text). Adapted from
Gigerenzer, G. & Todd, P.M., and the ABC Research Group. (1999). Simple
heuristics that make us smart, Oxford University Press, New York [6]
Accuracy (% correct)
forms of incremental learning. Thus, Take The Best [3] Gigerenzer, G. & Dieckmann, A. (in press). Empirical
can have an advantage against more savvy strategies evidence on fast and frugal heuristics.
that capture more aspects of the data, when the task [4] Gigerenzer, G. & Goldstein, D.G. (1996). Reasoning
the fast and frugal way: models of bounded rationality,
requires making out-of-sample predictions (for other Psychological Review 103, 650669.
aspects of Take The Bests ecological rationality, [5] Gigerenzer, G. & Selten, R. (2001). Bounded Rational-
see [9]). ity: The Adaptive Toolbox, The MIT Press, Cambridge.
There is meanwhile much empirical evidence that [6] Gigerenzer, G., Todd, P.M., and the ABC Research
people use fast and frugal heuristics, in particular, Group. (1999). Simple Heuristics that Make us Smart,
when under time pressure or when information is Oxford University Press, New York.
[7] Goldstein, D.G. & Gigerenzer, G. (2003). Models of
costly (for a review of empirical studies see [3]).
ecological rationality: the recognition heuristic, Psycho-
For other fast and frugal heuristics beyond the logical Review 109, 7590.
two introduced above, for instance QuickEst (for [8] Hertwig, R., Hoffrage, U. & Martignon, L. (1999). Quick
numerical estimation), Categorization-by-elimination estimation: letting the environment do the work, in
(for categorization), RAFT (Reconstruction After Simple Heuristics that Make us Smart, G. Gigerenzer,
F eedback with T ake The Best, for an application to a P.M. Todd, and the ABC Research Group, Oxford
memory phenomenon, namely the hindsight bias), the University Press, New York, pp. 209234.
[9] Martignon, L. & Hoffrage, U. (2002). Fast, frugal and
gaze heuristics (for catching balls on the playground),
fit: simple heuristics for paired comparison, Theory and
or various simple rules for terminating search through Decision 52, 2971.
sequentially presented options, see [3, 5, 6, 12, 13]; [10] Rieskamp, J. & Hoffrage, U. (1999). When do people use
for a discussion of this research program, see the simple heuristics, and how can we tell? in G. Gigerenzer,
commentaries and the authors reply following [12]. P.M. Todd, and the ABC Research Group, Simple
Heuristics that make us Smart, Oxford University Press,
New York, pp. 141167.
References [11] Simon, H. (1982). Models of Bounded Rationality, The
MIT Press, Cambridge.
[1] Czerlinski, J., Gigerenzer, G. & Goldstein, D.G. (1999). [12] Todd, P.M. & Gigerenzer, G. (2000). Precis of Simple
How good are simple heuristics? in Simple Heuristics heuristics that make us smart, Behavioral and Brain
that Make us Smart, G. Gigerenzer, P.M. Todd, and the Sciences 23, 727780.
ABC Research Group, Oxford University Press, New [13] Todd, P.M., Gigerenzer, G. & the ABC Research Group
York, pp. 97118. (in press). Ecological Rationality of Simple Heuristics.
[2] Dawes, R.M. (1979). The robust beauty of improper lin-
ear models in decision making, American Psychologist ULRICH HOFFRAGE
34, 571582.
Hierarchical Clustering
MORVEN LEESE
Volume 2, pp. 799805
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
For components of distance derived from binary A new matrix may now be constructed whose
or categorical data sij k takes the value of 1 for a entries are inter-individual and cluster-individual dis-
complete match and 0 otherwise. For components tances.
derived from continuous data the range-scaled city (1, 2) 3 4 5
block measure mentioned above is suggested. The (1, 2) 0.0
value of wij k can be set to 0 or 1 depending on the D2 = 3 5.0 0.0 (6)
whether the comparison is valid (for example, with a 4 9.0 4.0 0.0
binary variable it can be set to 0 to exclude negative 5 8.0 5.0 3.0 0.0
matches, as in the Jaccard coefficient); wij k can also
be used to exclude similarity components when one The smallest entry in D2 is that for individuals
or both values are missing for variable k. 4 and 5, so these now form a second two-member
cluster and a new set of distances found:
d(1,2)3 = 5.0 as before
Single Linkage Clustering
d(1,2)(4,5) = min[d14 , d15 , d24 , d25 ] = d25 = 8.0
Once a proximity matrix has been defined, the next d(4,5)3 = min[d34 , d35 ] = d34 = 4.0 (7)
step is to form the clusters in a hierarchical sequence.
There are many algorithms for doing this, depend- These may be arranged in a matrix D3 :
ing on the way in which clusters are merged or
(1, 2) 3 (4, 5)
divided. The algorithm usually entails defining prox-
(1, 2) 0.0
imity between clusters, as well as between individuals D3 = (8)
3 5.0 0.0
as outlined above. In one of the simplest hierar-
(4, 5) 8.0 4.0 0.0
chical clustering methods, single linkage [14], also
known as the nearest neighbor technique, the distance The smallest entry is now d(4,5)3 and individual 3
between clusters is defined as that between the clos- is added to the cluster containing individuals 4 and
est pair of individuals, where only pairs consisting 5. Finally, the groups containing individuals 1,2 and
of one individual from each group are considered. 3,4,5 are combined into a single cluster.
Single linkage can be applied as an agglomerative
method, or as a divisive method by initially split-
Dendrogram
ting the data into two clusters with maximum nearest
neighbor distance. The fusions made at each stage The process agglomerative process above is illus-
of agglomerative single linkage are now shown in a trated in a dendrogram in Figure 1. The nodes of
numerical example.
Consider the following distance matrix:
Distance Partition Members
1 2 3 4 5 5.0 P5 [1 2 3 4 5]
1 0.0
2 2.0 0.0 4.0 [1 2], [3 4 5]
D1 = (4) P4
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0 3.0 P3 [1 2], [3], [4 5]
5 9.0 8.0 5.0 3.0 0.0
2.0 P2 [1 2], [3], [4], [5]
The smallest entry in the matrix is that for
individuals 1 and 2, consequently these are joined to 1.0
form a two-member cluster. Distances between this
cluster and the other three individuals are obtained as 0.0 P1 [1], [2], [3], [4], [5]
1 2 3 4 5
d(1,2)3 = min[d13 , d23 ] = d23 = 5.0
Figure 1 A dendrogram produced by single linkage
d(1,2)4 = min[d14 , d24 ] = d24 = 9.0 hierarchical clustering. The process successively merges
five individuals based on their pairwise distances (see text)
d(1,2)5 = min[d15 , d25 ] = d25 = 8.0 (5) to form a sequence of five partitions P1P5
Hierarchical Clustering 3
the dendrogram represent clusters and the lengths Other Agglomerative Clustering Methods
of the stems (heights) represent the dissimilarities at
which clusters are joined. The same data and cluster- Single linkage operates directly on a proximity
ing procedure can give rise to 2n1 dendrograms with matrix. Another type of clustering, centroid cluster-
different appearances, and it is usual for the software ing [15], however, requires access to the original
to choose an order for displaying the nodes that is data. To illustrate this type of method, it will be
optimal (in some sense). Drawing a line across the applied to the set of bivariate data shown in Table 1.
dendrogram at a particular height defines a particular Suppose Euclidean distance is chosen as the
partition or cluster solution (such that clusters below inter-object distance measure, giving the following
that height are distant from each other by at least that distance matrix:
amount). The structure resembles an evolutionary tree 1 2 3 4 5
and it is in applications where hierarchies are implicit 1 0.0
in the subject matter, such as biology and anthropol- 2 1.0 0.0
D1 = (9)
ogy, where hierarchical clustering is perhaps most 3 5.39 5.10 0.0
relevant. In other areas it can still be used to pro- 4 7.07 7.0 2.24 0.0
vide a starting point for other methods, for example, 5 7.07 7.28 3.61 2.0 0.0
optimization methods such as k-means. Examination of D1 shows that d12 is the smallest
While the dendrogram illustrates the hierarchi- entry and objects 1 and 2 are fused to form a
cal process by which series of cluster solutions are group. The mean vector (centroid) of the group
produced, low dimensional plots of the data (e.g., is calculated (1,1.5) and a new Euclidean distance
principal component plots) are more useful for inter- matrix is calculated.
preting particular solutions. Such plots can show the (1, 2) 3 4 5
relationships among clusters, and among individual (1, 2) 0.0
cases within clusters, which may not be obvious from D2 = 3 5.22 0.0 (10)
a dendrogram. Comparisons between the mean lev- 4 7.02 2.24 0.0
els or frequency distributions of individual variables 5 7.16 3.61 2.0 0.0
within clusters, and the identification of representa-
The smallest entry in D2 is d45 and objects 4 and 5
tive members of the clusters (centrotypes or exem-
are therefore fused to form a second group, the mean
plars [19]) can also be useful. The latter are defined vector of which is (8.0,1.0). A further distance matrix
as the objects having the maximum within-cluster D3 is now calculated.
average similarity (or minimum dissimilarity), for
example, the medoid (the object with the minimum (1, 2) 3 (4, 5)
(1, 2) 0.0
absolute distance to the other members of the cluster). D3 = (11)
The dendrogram can be regarded as represent- 3 5.22 0.0
ing the original relationships amongst the objects, (4, 5) 7.02 2.83 0.0
as implied by their observed proximities. Its success In D3 the smallest entry is d(45)3 and so objects 3,4,
in doing this can be measured using the cophe- and 5 are merged into a three-member cluster. The
netic matrix, whose elements are the heights where final stage consists of the fusion of the two remaining
two objects become members of the same cluster groups into one.
in the dendrogram. The product-moment correlation
between the entries in the cophenetic matrix and the Table 1 Data on five objects used in
corresponding ones in the proximity matrix (exclud- example of centroid clustering
ing 1s on the diagonals) is known as the cophenetic
Object Variable 1 Variable 2
correlation. Comparisons using the cophenetic cor-
relation can also be made between different dendro- 1 1.0 1.0
grams representing different clusterings of the same 2 1.0 2.0
data set. Dendrograms can be compared using ran- 3 6.0 3.0
4 8.0 2.0
domization tests to assess the statistical significance 5 8.0 0.0
of the cophenetic correlation [11].
4 Hierarchical Clustering
Different definitions of intergroup proximity give relatively simple and computationally efficient mono-
rise to different agglomerative methods. Median link- thetic divisive methods are available. These methods
age [5] is similar to centroid linkage except that the divide clusters according the presence or absence of
centroids of the clusters to be merged are weighted each variable, so that at each stage clusters contain
equally to produce the new centroid of the merged members with certain attributes that are either all
cluster, thus avoiding the more numerous of the pair present or all absent.
of clusters dominating. The new centroid is inter- Instead of cluster homogeneity, the attribute used
mediate between the two constituent clusters. In the at each step in a divisive method can be cho-
centroid linkage shown above Euclidean distance sen according to its overall association with other
was used, as is usual. While other proximity mea- attributes: this is sometimes termed association anal-
sures are possible with centroid or median linkage, ysis [18]. The split at each stage is made according
they would lack interpretation in terms of the raw the presence or absence of the attribute whose asso-
data. Complete linkage (or furthest neighbor) [16], ciation with the others (i.e., the summed criterion
is opposite to single linkage, in the sense that dis- above) is a maximum. For example for one pair of
tance between groups is now defined as that of the variables Vi and Vj with values 0 and 1 the frequen-
most distant pair of individuals (the diameter of the cies observed might be as follows:
cluster). In (group) average linkage [15], the dis-
tance between two clusters is the average of the Vi 1 0
distance between all pairs of individuals that are Vj
made up of one individual from each group. Aver- 1 a b
age, centroid, and median linkage are also known as 0 c d
UPGMA, UPGMC, and WPGMC methods respec-
Examples of measures of association based on
tively (U: unweighted; W: weighted; PG: pair group;
these frequencies (summed over all pairs of vari-
A: average; C: centroid).
ables) are |ad bc| and (ad bc)2 n/[(a + b)(a + c)
Ward introduced a method based on minimising
(b + d)(c + d)]. Hubalek gives a review of 43 such
an objective function at each stage in the hierarchi- coefficients [7]. A general problem with this method
cal process, the most widely used version of which is that the possession of a particular attribute, which is
is known as Wards method [17]. The objective at either rare or rarely found in combination with others,
each stage is to minimize the increase in the total may take an individual down the wrong path.
within-cluster error sum of squares. This increase
is in fact a function of the weighted Euclidean
distance between the centroids of the merged clus- Choice of Number of Clusters
ters. Lance and Williams flexible method is defined
by values of the parameters of a general recur- It is often the case that an investigator is not
rence formula [10] and many of the standard meth- interested in the complete hierarchy but only in
ods mentioned above can be defined in terms of one or two partitions obtained from it (or cluster
the parameters of the Lance and Williams formula- solutions), and this involves deciding on the number
tion. of groups present. In standard agglomerative or
polythetic divisive clustering, partitions are achieved
by selecting one of the solutions in the nested
Divisive Clustering Methods sequence of clusterings that comprise the hierarchy.
This is equivalent to cutting a dendrogram at an
As mentioned earlier, divisive methods operate in the optimal height (this choice sometimes being termed
other direction from agglomerative methods, starting the best cut). The choice of height is generally based
with one large cluster and successively splitting clus- on large changes in fusion levels in the dendrogram,
ters. Polythetic divisive methods are relatively rarely and a scree-plot of height against number of clusters
used and are more akin to the agglomerative meth- can be used as an informal guide. A relatively
ods discussed above, since they use all variables widely used formal test procedure is based on the
simultaneously, and are computationally demanding. relative sizes of the different fusion levels [13], and
For data consisting of binary variables, however, a number of other formal approaches for determining
Hierarchical Clustering 5
the number of clusters have been reviewed [12] (see any given situation may also be difficult. Hierarchical
Number of Clusters). clustering algorithms are only stepwise optimal, in
the sense that at any stage the next step is chosen
Choice of Cluster Method to be optimal in some sense but that may not
guarantee the globally optimal partition, had all possi-
Apart from the problem of deciding on the number bilities had been examined. Empirical and theoretical
of clusters, the choice of the appropriate method in studies have rarely been conclusive. For example,
0 5 10 15 20 25
Factor
Figure 2 Dendrogram produced by cluster analysis of similarity judgments of pain descriptors obtained from healthy
volunteers, using average-linkage cluster analysis. The data are healthy peoples responses to the descriptors in the MAPS
(Multidimensional Affect and Pain Survey). The dendrogram has been cut at 30 clusters and also shows superclusters
joining at higher distances. A separate factor analysis obtained from patients responses of the 30-cluster concepts found
six factors, indicated along the left-hand side
6 Hierarchical Clustering
[18] Williams, W.T. & Lambert, J.M. (1959). Multivariate (See also Additive Tree; Cluster Analysis:
methods in plant ecology, 1 association analysis in plant Overview; Fuzzy Cluster Analysis; Overlapping
communities, Journal of Ecology 47, 83101.
Clusters; Two-mode Clustering)
[19] Wishart, D. (1999). Clustangraphics 3: interactive graph-
ics for cluster analysis, in Classification in the Informa-
MORVEN LEESE
tion Age, W. Gaul & H. Locarek-Junge, eds, Springer-
Verlag, Berlin, pp. 268275.
Hierarchical Item Response Theory Modeling
DANIEL BOLT AND JEE-SEON KIM
Volume 2, pp. 805810
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
traits. A hierarchical IRT model avoids problems of (0k ) and the within-school FLUNCH effects (1k ).
attenuation bias that are introduced when using a two- We also create a school variable, FLUNCH.SCHk ,
step estimation procedure, specifically, one that first the mean of FLUNCH across all students within
estimates the person traits using an IRT model, and school k, to represent the average socioeconomic
then in a separate analysis regresses the trait estimates status of students within the school. The variable
on the person variables. When these analyses are FLUNCH.SCHk is added as a predictor both of the
executed simultaneously, as in hierarchical IRT, the school intercepts and the within-school FLUNCH
regression coefficient estimates are based on the latent effect. This results in the following Level-3 (between-
traits and thus are not attenuated due to estimation school) model:
error.
A third advantage of hierarchical IRT is its capac- 0k = 00 + 01 FLUNCH.SCHk + u0k , (4)
ity to include additional levels above persons [9, 1k = 10 + 11 FLUNCH.SCHk + u1k , (5)
13]. To illustrate, we consider a three-level dataset
from the 1999 administration of the mathemat- 2k = 20 . (6)
ics section of the Texas Assessment of Academic In this representation, each of the parameters rep-
Skills (TAAS). The dataset contains correct/incorrect resents a fixed effect, while the u0k and u1k are
item responses to 52 items for a sample of 26,289 random effects associated with the school intercepts
fifth-graders from 363 schools. Student variables and school FLUNCH effects, respectively. Across
related to socioeconomic status (FLUNCH=1 implies schools, we assume (u0k , u1k ) to be bivariate nor-
free or reduced-price lunch, 0=regular-price lunch) mally distributed with mean (0,0) and covariance
and gender (GENDER=1 implies female, 0=male) matrix Tu , having diagonal elements u00 and u11 .
were also considered. Here we analyze just 20 of (Note that by omitting a similar residual for 2k , the
the 52 items. The three-level model involves item effects of GENDER are assumed to be the same,
responses nested with students, and students nested that is, fixed, across schools.) Variations on the above
within schools. model could be considered by modifying the nature
For the within-person model, we use a Rasch of effects (fixed versus random) and/or predictors of
model. In a Rasch model, the probability of correct the effects.
item response is modeled through an item difficulty To estimate the model above, we follow a proce-
parameter bj and a single person trait parameter ik , dure described by Kamata [9]. In Kamatas method,
the latter now double-indexed to identify student i a hierarchical IRT model is portrayed as a hier-
from school k: archical generalized linear model. Random per-
exp(ik bj ) son effects are introduced through a random inter-
P (Uik,j = 1|ik , bj ) = . (2) cept, while item effects are introduced through the
1 + exp(ik bj )
fixed (across persons) coefficients of item-identifier
At Level 2, between-student variability is mod- dummy variables at Level 1 of the model (see [9]
eled as: for details). In this way, the model can be estimated
using a quasi-likelihood algorithm implemented for
ik = 0k + 1k FLUNCHik + 2k GENDERik + rik , generalized linear models in the software program
(3) HLM [16].
A portion of the results is shown in Tables 1 and
where 0k , 1k , and 2k are the intercept and regres- 2. In Table 1, the fixed effect estimates are seen to be
sion coefficient parameters for school k; rik is a statistically significant for FLUNCH, FLUNCH.SCH,
normally distributed residual with mean zero and and GENDER, implying lower levels of math ability
variance r . on average for students receiving free or reduced-
Next, we add a third level associated with price lunch within a school (10 = .53), and also
school. At the school level, we can account for lower (on average) abilities for students coming from
the possibility that certain effects at the student schools that have a larger percentage of students
level (represented by the coefficients 0k , 1k , 2k ) that receive free or reduced-price lunch (01 = .39).
may vary across schools. In the current model, we The effect for gender is significant but weak (20 =
allow for between-school variability in the intercepts .04), with females having slightly higher ability. No
Hierarchical Item Response Theory Modeling 3
Table 1 Estimates of fixed and random effect parameters in multilevel Rasch model, Texas assessment of academic skills
data
Fixed effect estimates Coeff se t-stat Approx. df P value
00 , intercept 1.53 .04 34.24 361 .000
01 , FLUNCH.SCH .39 .06 6.15 361 .000
10 , FLUNCH .53 .04 12.04 361 .000
11 , FLUNCH FLUNCH.SCH .05 .08 .67 361 .503
20 , GENDER .04 .01 3.02 26824 .003
significant interaction was detected for FLUNCH the corresponding fixed effects (in this model, the
and FLUNCH.SCH. fixed intercept and FLUNCH.SCH effects). These
The variance estimates, also shown in Table 1, estimates illustrate another way in which hierar-
suggest significant between-school variability both chical IRT can be useful, namely, its capacity to
in the residual intercepts (u00 = .16) and in the provide group-level assessment. Recalling that both
residual FLUNCH effects (u11 = .05). This implies residuals have means of zero across schools, we
that even after accounting for the effects of observe that School 1 has a lower intercept (1.03),
FLUNCH.SCH, there remains significant between- implying lower ability levels for non-FLUNCH stu-
school variability in both the mean ability levels dents, and a more negative FLUNCH effect (0.26),
of non-FLUNCH students, and in the within-school than would be expected given the schools level on
effects of FLUNCH. Likewise, significant between- FLUNCH.SCH (.30). School 2, which has a much
student variability remains across students within higher proportion of FLUNCH students than School
school (r = 1.26) even after accounting for the 1 (.93), has a higher than expected intercept and a
effects of FLUNCH and GENDER. Recall that Rasch more negative than expected FLUNCH effect, while
item difficulties are also estimated as fixed effects in School 3, with FLUNCH.SCH=.17, has a higher than
the model. These estimates (not shown here) ranged expected intercept, but an FLUNCH effect that is
from 2.49 to 2.06. equivalent to what is expected.
Table 2 provides empirical Bayes estimates of the Despite the popularity of Kamatas method, it is
residuals for three schools (see Random Effects limited to use with the Rasch model. Other estima-
in Multivariate Linear Models: Prediction). Such tion methods have been proposed for more general
estimates allow a school-specific inspection of the models. For example, other models within the Rasch
two effects allowed to vary across schools (in family (e.g., Masters partial credit model), can be
this model, the intercept, u0k , and FLUNCH, u1k , estimated using a general EM algorithm in the soft-
effects). More specifically, they indicate how each ware CONQUEST [24]. Still more general models,
schools effect departs from what is expected given such as hierarchical two parameter IRT models, can
Table 2 Examples of empirical Bayes estimates for individual schools, Texas assessment of academic skills data
Empirical Bayes estimates Fixed effects
be estimated using fully Bayesian methods, such as prespecified item characteristics, such as the cognitive
Gibbs sampling [6, 7]. Such procedures are appeal- skill requirements of the item. Hierarchical IRT
ing in that they provide a full characterization of the models can extend models such as the LLTM by
joint posterior distribution of model parameters, as allowing item parameters to be random, thus allowing
opposed to point estimates. Because they are easy for less-than-perfect prediction of the item difficulty
to implement, they also permit greater flexibility in parameters [8].
manipulation of other features of the hierarchical IRT A hierarchical IRT model with random person and
model. For example, Maier [10] explores use of alter- item effects can be viewed as possessing two forms of
native distributions for the residuals (e.g., inverse nesting, as item responses are nested within both per-
chi-square, uniform) in a hierarchical Rasch model. sons and items. Van den Noortgate, De Boeck, and
Several advantages of hierarchical IRT model- Meulders [21] show how random item effect mod-
ing can be attributed to its use of a latent trait. els can be portrayed as cross-classified hierarchical
First, the use of a latent variable as the outcome models [20] in that each item response is associated
in the between-person model allows for more real- with both a person and item. With cross-classified
istic treatment of measurement error. When mod- models, it is possible to consider not only main
eling test scores as outcomes, for example, vari- effects associated with item and person variables, but
ability in the standard error of measurement across also item-by-person variables [19, 21]. This further
persons is not easily accounted for, as a common extends the range of applications that can be por-
residual variance applies for all persons. Second, trayed within the hierarchical IRT framework. For
the invariance properties of IRT allow it to accom- example, hierarchical IRT models such as the ran-
modate a broader array of data designs, such as
dom weights LLTM [18], where the LLTM weights
matrix sampled educational assessments (as in the
vary randomly across persons, and dynamic Rasch
National Assessment of Educational Progress), or
models [22], where persons learn over the course of
others that may involve missing data. Finally, the
a test [22], can both be portrayed in terms of item-by-
interval level properties of the IRT metric can be
person covariates [19]. Similarly, IRT applications
beneficial. For example, Raudenbush, Samson, and
such as differential item functioning can be portrayed
Johnson [16] note the value of a Rasch trait metric
in a hierarchical IRT model where the product of per-
when modeling self-reported criminal behavior across
son group by item is a covariate [19].
neighborhoods, where simple counts of crime tend to
produce observed scores that are highly skewed and Additional levels of nesting can also be defined for
lack interval scale properties. the items. For example, in the hierarchical IRT model
of Janssen, Tuerlinckx, Meulders, and De Boeck [8],
items are nested within target content categories. An
Hierarchical IRT Models with Both advantage of this model is that a prototypical item for
each category can be defined, thus allowing criterion-
Random Person and Item Effects
related classification decisions based on each persons
Less common, but equally useful, are hierarchical estimated trait level.
IRT models that model random item effects. Such Different estimation strategies have been consid-
models typically introduce a between-item model ered for random person and item effect models. Using
in which the item parameters of the IRT model the cross-classified hierarchical model representation,
become outcomes. Modeling the predictive effects Van den Noortgate et al. [21] propose use of quasi-
of item features on item parameters can be very likelihood procedures implemented in the SAS-macro
useful. Advantages include improved estimation of GLIMMIX [23]. Patz and Junker [14] presented a
the item parameters (i.e., collateral information), very general Markov chain Monte Carlo strategy for
as well as information about item features that hierarchical IRT models that can incorporate both
can be useful for item construction and item-level item and person covariates. A related application is
validity checks [4]. A common IRT model used given in Patz, Junker, Johnson, and Moriano [15],
for this purpose is Fischers [5] linear logistic test where dependence due to rater effects is addressed.
model [LLTM]. In the LLTM, Rasch item difficulty General formulations such as this offer the clearest
is expressed as a weighted linear combination of indication of the future potential for hierarchical IRT
Hierarchical Item Response Theory Modeling 5
models, which should continue to offer the method- data, and rated responses, Journal of Educational and
ologist exciting new ways of investigating sources of Behavioral Statistics 24, 342366.
hierarchical structure in item response data. [15] Patz, R.J., Junker, B.W., Johnson, M.S. & Mariano, L.T.
(2002). The hierarchical rater model for rated test items
and its application to large-scale educational assessment
References data, Journal of Educational and Behavioral Statistics
27, 341384.
[16] Raudenbush, S.W., Bryk, A.S., Cheong, Y.F. & Cong-
[1] Adams, R.J., Wilson, M. & Wang, W. (1997). The
don, R. (2001). HLM 5: Hierarchical Linear and Non-
multidimensional random coefficients multinomial logit
linear Modeling, 2nd Edition, Scientific Software Inter-
model, Applied Psychological Measurement 21, 123.
national, Chicago.
[2] Adams, R.J., Wilson, M. & Wu, M. (1997). Multilevel
[17] Raudenbush, S.W., Johnson, C. & Sampson, R.J. (2003).
item response models: an approach to errors in vari-
A multivariate multilevel Rasch model with application
ables regression, Journal of Educational and Behavioral
to self-reported criminal behavior, Sociological Method-
Statistics 22, 4776.
ology 33, 169211.
[3] Bock, R.D. & Zimowski, M.F. (1997). Multi-group IRT,
[18] Rimjen, F. & De Boeck, P. (2002). The random weights
in Handbook of Modern Item Response Theory, W.J. van
linear logistic test model, Applied Psychological Mea-
der Linden & R.K. Hambleton, eds, Springer-Verlag,
surement 26, 269283.
New York, pp. 433448.
[19] Rimjen, F., Tuerlinckx, F., De Boeck, P. & Kuppens, P.
[4] Embretson, S.E. & Reise, S.P. (2000). Item Response
(2003). A nonlinear mixed model framework for item
Theory for Psychologists, Lawrence Erlbaum, Mahwah.
response theory, Psychological Methods 8, 185205.
[5] Fisher, G.H. (1973). Linear logistic test model as an
[20] Snijders, T. & Bosker, R. (1999). Multilevel Analysis,
instrument in educational research, Acta Psychologica
Sage Publications, London.
37, 359374.
[21] Van den Noortgate, W., De Boeck, P. & Meulders, M.
[6] Fox, J.P. (in press) Multilevel IRT Using Dichotomous
(2003). Cross-classification multilevel logistic models in
and Polytomous Response Items. British Journal of
psychometrics, Journal of Educational and Behavioral
Mathematical and Statistical Psychology.
Statistics 28, 369386.
[7] Fox, J.-P. & Glas, C. (2001). Bayesian estimation of a
[22] Verhelst, N.D. & Glas, C.A.W. (1993). A dynamic
multilevel IRT model using Gibbs sampling, Psychome-
generalization of the Rasch model, Psychometrika 58,
trika 66, 271288.
395415.
[8] Janssen, R., Tuerlinckx, F., Meulders, M. & De
[23] Wolfinger, R. & OConnell, M. (1993). Generalized
Boeck, P. (2000). A hierarchical IRT model for criterion
linear mixed models: a pseudo-likelihood approach,
referenced measurement, Journal of Educational and
Journal of Statistical Computation and Simulation 48,
Behavioral Statistics 25, 285306.
233243.
[9] Kamata, A. (2001). Item analysis by the hierarchical
[24] Wu, M.L., Adams, R.J. & Wilson, M.R. (1997). Con-
generalized linear model, Journal of Educational Mea-
quest: Generalized Item Response Modeling Software
surement 38, 7993.
[Software manual], Australian Council for Educational
[10] Maier, K. (2001). A Rasch hierarchical measurement
Research, Melbourne.
model, Journal of Educational and Behavioral Statistics
[25] Zwinderman, A.H. (1991). A generalized Rasch model
26, 307330.
for manifest predictors, Psychometrika 56, 589600.
[11] Mislevy, R.J. (1987). Exploiting auxiliary information
about examinees in the estimation of item parameters,
Applied Psychological Measurement 11, 8191. Further Reading
[12] Mislevy, R.J., & Sheehan K.M. (1989). The role of
collateral information about examinees in item parameter
estimation, Psychometrika 54, 661679. De Boeck, P. & Witson, M. (2004). Explanatory Item Response
[13] Pastor, D. (2003). The use of multilevel item response Models, Springes-Verlag, New York.
theory modeling in applied research: an illustration,
Applied Measurement in Education 16, 223243. DANIEL BOLT AND JEE-SEON KIM
[14] Patz, R.J. & Junker, B.W. (1999). Application and exten-
sions of MCMC in IRT: multiple item types, missing
Hierarchical Models
ROBERT E. PLOYHART
Volume 2, pp. 810816
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Yi = B0 + B1(Xi) + ei (1)
satisfyi = B0 + B1 (autonomyi) + ei
(1) (2)
Dependent
variable
Low High
Level 1 predictor
Figure 2 Illustration of what random effects for intercepts and slopes really mean
(e.g., autocratic, democratic) also influences satis- shown for 10 subjects. The solid lines represent five
faction? Note employees are nested within super- subjects from one unit, the dashed lines represent five
visors a given supervisor may be in charge of subjects from a different unit. Notice that across both
numerous employees, and, therefore, employees who units there is considerable variability in intercepts and
have a particular supervisor may share some similar- slopes, with members of the second unit tending to
ities not shared by employees of another supervisor. show negative slops (denoted by dashed lines). The
This makes it necessary to use HLM. Supervisors solid bold line represents the regression line obtained
may have independent and direct effects on satisfac- from a traditional regression clearly an inappropri-
tion (arrow 2), or moderate the relationship between ate representation of this data.
autonomy and satisfaction (arrow 1). The real benefit of HLM comes in two forms.
Equation (1) is the classic regression model, which First, because it explicitly models the nonindepen-
assumes errors are independent and normally dis- dence/heterogeneity in the data, it provides accurate
tributed with a mean of zero and constant vari- standard errors and, hence, statistical significance
ance (see Multiple Linear Regression). The model tests. Second, it allows one to explain between-
assumes that the regression weights are constant unit differences in the regression weights. This can
across different supervisors; hence, there are no sub- be seen in (3) and (4). Equations (3) and (4) states
scripts for B0 and B1 . This exemplifies a fixed effects between-unit differences in the intercept (slope) are
model because the weights do not vary across units explained by supervisory style. Note that in this
(see Fixed and Random Effects). In contrast, the model, the supervisory effect is a fixed effect.
HLM exemplifies a random effects model because Level 2 predictors may either be categorical (e.g.,
the regression weights B0 and B1 vary across super- experimental condition) or continuous (e.g., indi-
visors (levels of j ; see (2)) who are assumed to be vidual differences). It is important to center the
randomly selected from a population of supervisors continuous data to facilitate interpretation of the
(see Fixed and Random Effects). Figure 2 illustrates lower level effects (see Centering in Linear Mul-
this visually, where hypothetical regression lines are tilevel Models). Often, the most useful centering
Hierarchical Models 3
method is to center the Level 2 predictors across the Bayesian Information Criterion (BIC). Smaller
all units (grand mean centering), and then cen- values for AIC and BIC are better. There are no
ter the Level 1 predictors within each unit (unit statistical significance tests associated with these
mean centering). indices, so one must conduct a model comparison
The basic HLM assumptions are (a) errors at both approach in which simple models are compared to
levels have a mean of zero, (b) Level 1 and Level 2 more complicated models. The difference between
errors are uncorrelated with each other, (c) Level 1 the simple and complex models is evaluated via the
errors (eij ) have a constant variance (sigma-squared), change in 2 Residual Log Likelihood (distributed
and (d) Level 2 errors take a known form, but this as a chi-square), and/or examining which has the
form allows heterogeneity (nonconstant variance), smaller AIC and BIC values. Table 1 shows a generic
and covariances among the error terms. Note also that sequence of model comparisons for HLM models.
HLM models frequently use restricted maximum like- The model comparison approach permits only the
lihood (REML) estimation that assumes multivariate minimum amount of model complexity to explain the
normality (see Maximum Likelihood Estimation; data.
Catalogue of Probability Density Functions). HLM also provides statistical significance tests
of fixed and random effects. Statistical significance
Comparing and Interpreting HLM Models tests for the fixed effects are interpreted just like
in the usual ordinary least squares (OLS) regres-
The HLM provides several estimates of model fit. sion model. However, the statistical significance tests
These frequently include the 2 Residual Log Like- for the random effects should be avoided because
lihood, Akaikes Information Criterion (AIC), and they are often erroneous (see [13]). It is better
to test the random effects using the model com- regression model. Table 2 shows the command syntax
parison approach described above and in Table 1 and results for this model. The regression weights
(see [3, 11]). for the intercept (6.28) and autonomy (0.41) are
statistically significant.
Step 3 determines whether there are between-unit
Example differences in the intercept, which would represent
a random effect. We start with the intercept and
Let us now illustrate how to model hierarchical data
compare the fit indices for this model to those
in SAS. Suppose we have 1567 employees nested
from the fixed-effects model. To conserve space,
within 168 supervisors. We hypothesize a simple two-
I note only that these comparisons supported the
level model identical to that shown in the lower part
inclusion of the random effect for the intercept. The
of Figure 1.
fourth step is to examine the regression weight for
Following the model testing sequence in Table 1,
autonomy (also a random effect), and compare the
we start with determining how much variance in
fit of this model to the previous random intercept
satisfaction is explainable by differences between
model. When we compare the two models, we find
supervisors. This is known as an intraclass correla-
no improvement in model fit by allowing the slope
tion coefficient (ICC) and is calculated by taking the
parameter to be a random effect. This suggests the
variance in the intercept and dividing it by the sum of
relationship between autonomy and satisfaction does
the intercept variance plus residual variance. Generic
not differ across supervisors. However, the intercept
SAS notation for running all models is shown in
parameter does show significant variability (0.73)
Table 1. The COVTEST option requests significance
across units.
tests of the random effects (although we know not to
The last step is to determine whether supervisory
put too much faith in these tests), and the UPDATE
style differences explain the variability in job satis-
option asks the program to keep us informed of
faction. To answer this question, we could include
the REML iteration progress. The CLASS statement
a measure of supervisory style as a Level 2 fixed
identifies the Level 2 variable within which the Level
effects predictor. Table 2 shows supervisory style has
1 variables are nested (here referred to as unit).
a significant effect (0.39).
The statement dv = /SOLUTION CL DDFM = BW
Thus, one concludes (a) autonomy is positively
specifies the nature of the fixed effects (satisfy is
related to satisfaction, and (b) this relationship is
the dependent variable; SOLUTION asks for signif-
not moderated by supervisory style, but (c) there
icance tests for the fixed effects, CL requests confi-
are between-supervisor differences in job satisfac-
dence intervals, and DDFM = BW requests denom-
tion, and (d) supervisory style helps explain these
inator degrees of freedom be calculated using the
differences.
between-within method, see [9, 13]). The RANDOM
statement is where we specify the random effects.
Because the INTERCEPT is specified, we allow ran- Conclusion
dom variation among the intercept term. TYPE = UN
specifies the structure of the random effects, where Many substantive questions in the behavioral sci-
UN means unstructured (other options include vari- ences must deal with nested and hierarchical data.
ance components, etc.). SUB = unit again identifies Hierarchical models were developed to address these
the nesting variable. Because no predictor variable problems. This entry provided a brief introduction to
is specified in this model, it is equivalent to a one- such models and illustrated their application using
way random effects Analysis of Variance (ANOVA) SAS. But there are many extensions to this basic
with unit as the grouping factor. When we run model. HLM can also be used to model longitu-
this analysis, we find a variance component of 0.71 dinal data and growth curves. In such models, the
and residual variance of 2.78; therefore the ICC = Level 1 model represents intraindividual change and
0.71/(0.71 + 2.78) = 0.20. This means 20% of the the Level 2 model represents individual differences
variance in satisfaction can be explained by higher- in intraindividual change (for introductions, see [3,
level effects. 12]). HLM has many research and real-world applica-
Step 2 includes autonomy and the intercept as tions and provides researchers with a powerful theory
fixed effects. This model is identical to the usual testing and building methodology.
Table 2 Sample table of HLM results
Fixed Fixed Random
Model df parameter 95% C.I. parameter AIC SBC -2LLR SAS code
Level 1 Model 6309.7 6315.1 6307.7 PROC MIXED COVTEST UPDATE;
Intercept (fixed) 1565 6.28 (6.19; 6.37) CLASS unit;
Autonomy 1565 0.41 (0.31; 50) MODEL satisfy = autonomy/SOLUTION CL
Residual 3.27 DDFM = BW;
RUN;
Level 1 and 2 Model 6161.8 6168.1 6157.8 PROC MIXED COVTEST UPDATE;
Intercept (random) 164 6.28 (6.12; 6.44) 0.73 CLASS unit;
Autonomy 1401 0.41 (0.32; 0.49) MODEL satisfy = autonomy/SOLUTION CL
Residual 2.62 DDFM = BW;
RANDOM INTERCEPT /TYPE = UN SUB = unit;
RUN;
Level 1 and 2 Model PROC MIXED COVTEST UPDATE;
Intercept 163 6.30 6.16; 6.44) 0.52 6134.1 6140.3 6130.1 CLASS unit;
Autonomy 1401 0.41 (0.32; 0.49) MODEL satisfy = autonomy suprstyl/SOLUTION
Supervisory Style 163 0.39 (0.26; 0.52) CL DDFM = BW;
Residual 2.63 RANDOM INTERCEPT /TYPE = UN SUB = unit;
RUN;
Hierarchical Models
5
6 Hierarchical Models
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
given knot sequence. Best fits for such models are A different strategy is to use the fact that any
easily computed these days by using alternating least multivariate function can be approximated by a multi-
squares algorithms that iteratively alternate fitting variate step function. This fits into the product model,
for fixed and fitting for fixed [1]. Although if we realize that multivariate functions constant on
generalized additive models add a great deal of rectangles are products of univariate functions con-
flexibility to the regression situation, they do not stant on intervals. In general, we fit
directly deal with the instability and multicollinearity
that comes from the very large number of predictors.
q
They do not address the data reduction problem, they E(y i | xi ) = t I (xi Rt ). (9)
t=1
just add more parameters to obtain a better fit.
A next step is to combine the ideas of PCR and Here, the Rt define a partitioning of the p-
GAM into projection pursuit regression or PPR [4]. dimensional space of predictors, and the I () are
The model now is indicator functions of the q regions. In each of the
p
q
regions the regression function is a constant. The
E(y i | xi ) = t ts xis . (6) problem, of course, is how to define the regions.
t=1 s=1 The most popular solution is to use a recursive
This is very much like GAM, but the transformations partitioning algorithm such as Classification and
are applied to a presumably small number of linear Regression Trees, or by the algorithm CART [2],
combinations of the original variables. PPR regres- which defines the regions as rectangles in variable
sion models are closely related to neural networks, in space. Partitionings are refined by splitting along a
which the linear combinations are the single hidden variable, and by finding the variable and the split
layer and the nonlinear transformations are sigmoids which minimize the residual sum of squares. If the
or other squashers (see Neural Networks). PPR mod- variable is categorical, we split into two arbitrary
els can be fit by general neural network algorithms. subsets of categories. If the variable is quantitative,
PPR regression is generalized in Lis slicing we split an interval into two pieces. This recursive
inverse regression or SIR [7, 8], in which the model partitioning builds up a binary tree, in which leaves
is are refined in each stage by splitting the rectangles
p into two parts.
p
It is difficult, at the moment, to suggest a best
E(y i | xi ) = F 1s xis , . . . , qs xis . (7)
technique for high-dimensional regression. Formal
s=1 s=1
statistical sensitivity analysis, in the form of standard
For details on the SIR and PHD algorithms, we refer errors and confidence intervals, is largely missing.
to (see Slicing Inverse Regression). Decision procedures, in the form of tests, are also in
Another common, and very general approach, is to their infancy. The emphasis is on exploration and on
use a finite basis of functions hst , with t = 1, . . . , qs , computation. Since the data sets are often enormous,
for each of the predictors xs . The basis functions can we do not really have to worry too much about
be polynomials, piecewise polynomials, or splines, significance, we just have to worry about predictive
or radical basis functions. We then approximate the performance and about finding (mining) interesting
multivariate function F by a sum of products of these aspects of the data.
basis functions. Thus we obtain the model
q1
qp References
E(y i | xi ) = t1 tp
t1 =1 tp =1 [1] Breiman, L. & Friedman, J.H. (1985). Estimating opti-
mal transformations for multiple regression and correla-
h1t1 (xi1 ) hptp (xip ) (8) tion, Journal of the American Statistical Association 80,
580619.
This approach is used in multivariate adaptive regres- [2] Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984).
sion splines, or MARS, by [3]. The basis functions Classification and Regression Trees, Wadsworth.
are splines, and they adapt to the data by locating the [3] Friedman, J. (1991). Multivariate adaptive regression
knots of the splines. splines (with discussion), Annals of Statistics 19, 1141.
High-dimensional Regression 3
[4] Friedman, J. & Stuetzle, W. (1981). Projection pursuit [8] Li, K.C. (1992). On principal Hessian directions for data
regression, Journal of the American Statistical Association visualization and dimension reduction: another applica-
76, 817823. tion of Steins Lemma, Journal of the American Statistical
[5] Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Association 87, 10251039.
Elements of Statistical Learning, Springer.
[6] Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Addi- JAN DE LEEUW
tive Models, Chapman and Hall, London.
[7] Li, K.C. (1991). Sliced inverse regression for dimension
reduction (with discussion), Journal of the American
Statistical Association 86, 316342.
Hills Criteria of Causation
KAREN J. GOODMAN AND CARL V. PHILLIPS
Volume 2, pp. 818820
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
intended for all time and all occasions, but as a formal little to answer this question. Recent developments in
description of how we drew our. . . conclusions. . . [7, methods for uncertainty quantification [15], however,
p. 527]. Since the 1970s [20], Susser has advocated are creating tools for assessing the probability
the use of causal criteria for discriminating between a that an observed association is due to alternative
true causal factor and an imposter [21, pp. 6378], explanations, which include random error or study
proposing a refined list of criteria in 1991, includ- bias rather than a causal relationship. Equally
ing strength, specificity, consistency (both replicabil- important, Hill, though described as the greatest
ity and survivability on diverse tests of the causal medical statistician of the twentieth century [4], had
hypothesis), predictive performance, and coherence his formal academic training in economics rather
(including theoretical, factual, biological, and statis- than medicine or statistics, and anticipated modern
tical) [21]. Sussers historical analysis argues against expected-net-benefit-based decision analysis [16]. He
ossified causal criteria (epidemiologists have modi- stated, finally, in passing from association to
fied their causal concepts as the nature of their tasks causation. . . we shall have to consider what flows
has changed. . .. Indeed, the current set of criteria may from that decision [8, p. 300], and suggested that
well be displaced as the tasks of the discipline change, the degree of evidence required, in so far as
which they are bound to do. [21, pp. 6467]). alternate explanations appear unlikely, depends on
the potential costs and benefits of taking action.
Recognizing the inevitable scientific uncertainty in
Limitations of Criteria for Inferring establishing cause and effect, for Hill, the bottom line
Causation for causal inference overlooked in most discussions
of causation or statistical inference was deciding
With the exception of temporality, no item on any whether the evidence was convincing enough to
proposed list is necessary for causation, and none is warrant a particular policy action when considering
sufficient. More importantly, it is not clear how to expected costs and benefits.
quantify the degree to which each criterion is met, let
alone how to aggregate such results into a judgment
about causation. In their advanced epidemiology References
textbook, Rothman and Greenland question the utility
of each item on Hills list except temporality [17]. [1] Armitage, P. (2003). Fisher, Bradford Hill, and ran-
Studies of how epidemiologists apply causal criteria domization, International Journal of Epidemiology 32,
925928.
reveal wide variations in how the criteria are selected,
[2] Chalmers, I. (2003). Fisher and Bradford Hill: theory
defined, and judged [24]. Furthermore, there appear and pragmatism? International Journal of Epidemiology
to be no empirical assessments to date of the validity 32, 922924.
or usefulness of causal criteria (e.g., retrospective [3] Doll, R. (1992). Sir Austin Bradford Hill and the
studies of whether appealing to criteria improves the progress of medical science, British Medical Journal
conclusions of an analysis). In short, the value of 305, 15211526.
checklists of criteria for causal inference is severely [4] Doll, R. (1993). Obituary: Sir Austin Bradford Hill,
pp. 795797, in Sir Austin Bradford Hill, 18991991,
limited and has not been tested. Statistics in Medicine 12, 795808.
[5] Doll, R. (2003). Fisher and Bradford Hill: their per-
sonal impact, International Journal of Epidemiology 32,
Beyond Causal Criteria 929931.
[6] Evans, A.S. (1978). Causation and disease: a chrono-
Although modern thinking reveals limitations of logical journey, American Journal of Epidemiology 108,
causal criteria, Hills landmark paper contains 249258.
crucial insights. Hill anticipated modern statistical [7] Hamill, P.V.V. (1997). Re: Invited commentary:
approaches to critically analyzing associations, asking Response to Science article, Epidemiology faces its
limits, American Journal of Epidemiology 146, 527.
Is there any way of explaining the set of facts
[8] Hill, A.B. (1965). The environment and disease: associ-
before us, is there any other answer equally, or more, ation or causation?, Proceedings of the Royal Society of
likely than cause and effect? [8, p. 299]. Ironically, Medicine 58, 295300.
although Hill correctly identified this as the [9] Hill, A.B. (1971). Principles of Medical Statistics, 9th
fundamental question, consulting a set of criteria does Edition, Oxford University Press, New York.
Hills Criteria of Causation 3
[10] Hill, A.B. (1977). Short Textbook of Medical Statistics, [19] Silverman, W.A. & Chalmers, I. (1992). Sir Austin
Oxford University Press, New York. Bradford Hill: an appreciation, Controlled Clinical Trials
[11] Hill, A.B. (1984). Short Textbook of Medical Statistics, 13, 100105.
Oxford University Press, New York. [20] Susser, M. (1973). Causal Thinking in the Health Sci-
[12] Hill, A.B. (1991). Bradford Hills Principles of Medical ences. Concepts and Strategies of Epidemiology, Oxford
Statistics, Oxford University Press, New York. University Press, New York.
[13] Hill, I.D. (1982). Austin Bradford Hill ancestry and [21] Susser, M. (1991). What is a cause and how do we know
early life, Statistics in Medicine 1, 297300. one? A grammar for pragmatic epidemiology, American
[14] Hume, D. (1978). A Treatise of Human Nature, (Origi- Journal of Epidemiology 133, 635648.
nally published in 1739), 2nd Edition, Oxford University [22] Statistics in Medicine, Special Issue to Mark the 85th
Press, 1888, Oxford. Birthday of Sir Austin Bradford Hill 1(4), 297375
[15] Phillips, C.V. (2003). Quantifying and reporting 1982.
uncertainty from systematic errors, Epidemiology 14(4), [23] Weed, D.L. (1995). Causal and preventive inference,
459466. in Cancer Prevention and Control, P. Greenwald, B.S.
[16] Phillips, C.V. & Goodman K.J. (2004). The missed Kramer & D. Weed, eds, Marcel Dekker, pp. 285302.
lessons of Sir Austin Bradford Hill, Epidemiologic [24] Weed, D.L. & Gorelic, L.S. (1996). The practice of
Perspectives and Innovations, 1, 3. causal inference in cancer epidemiology, Cancer Epi-
[17] Rothman, K.J. & Greenland, S. (1998). Modern Epi- demiology, Biomarkers & Prevention 5, 303311.
demiology, Chap. 2, 2nd Edition, Lippencott-Raven Pub-
lishers, Philadelphia, pp. 728.
[18] Sartwell, P.E. (1960). On the methodology of investi- (See also INUS Conditions)
gations of etiologic factors in chronic disease further
comments, Journal of Chronic Diseases 11, 6163. KAREN J. GOODMAN AND CARL V. PHILLIPS
Histogram
BRIAN S. EVERITT
Volume 2, pp. 820821
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Frequency
A histogram is perhaps the graphical display that is 8
used most often in the initial exploration of a set of
6
measurements. Essentially, it is a simple graphical
representation of a frequency distribution in which 4
each class interval (category) is represented by a 2
vertical bar whose base is the class interval and whose
0
height is proportional to the number of observations
in the class interval. When the class intervals are 0 5 10 15 20 25
unequally spaced, the histogram is drawn in such a Murder rate
way that the area of each bar is proportional to the
Figure 1 Murder rates for 30 cities in the United States
frequency for that class interval. Scott [1] considers
how to choose the optimal number of classes in a
histogram. Figure 1 shows a histogram of the murder Reference
rates (per 100 000) for 30 cities in southern USA
in 1970. [1] Scott, D.W. (1979). On optimal and data based his-
The histogram is generally used for two purposes, tograms, Biometrika 66, 605610.
counting and displaying the distribution of a variable,
although it is relatively ineffective for both; stem and BRIAN S. EVERITT
leaf plots are preferred for counting and box plots
are preferred for assessing distributional properties.
A histogram is the continuous data counterpart of the
bar chart.
Historical Controls
VANCE W. BERGER AND RANDI SHAFRAN
Volume 2, pp. 821823
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
extent that it is comprised of roughly 20% potential This is especially important for patients with life-
responders (those having the 50% response rate) threatening diseases. Besides potential ethical advan-
and 80% nonresponders, following the population tages, studies with historical controls may require a
proportions. But if these proportions are distorted smaller number of participants and may require less
in the cohort, then the response rate in the cohort time than comparable randomized trials [4, 6].
will not reflect the response rate in the population. In If feasible, then randomized control trials should
the extreme case, if the cohort is comprised entirely be used. However, this is not always the case, and
of potential responders, then the response rate will historical control trials may be used as an alterna-
be 50%. tive. The limitations of historical controls must be
The key, though, is that if this distortion from taken into account in order to prevent false con-
the population is not recognized, then one might be clusions regarding the evaluation of new treatments.
inclined to attribute the increased response rate, 50% It is probably not prudent to use formal inferential
versus 10%, not to the composition of the sample analyses with any nonrandomized studies, including
but rather to how they were treated. One would historically controlled studies, because without a sam-
go on to associate the 50% response rate with the ple space of other potential outcomes and known
new treatment, and conclude that it is superior to probabilities for each, the only outcome that can be
the standard one. We see that the selection bias considered to have been possible (with a known prob-
discussed above can render historically controlled ability, for inclusion in a sample space) is the one that
data misleading, so that observed response rates was observed. This means that the only valid P value
are not equal to the true response rates. However, is the uninteresting value of 1.00 [1].
historically controlled data can be used to help
evaluate new treatments if selection bias can be References
minimized. The conditions under which selection bias
can be demonstrated to be minimal are generally not [1] Berger, V.W. & Bears, J. (2003). When can a clinical trial
very plausible, and require a uniform prognosis of be called randomized? Vaccine 21, 468472.
untreated patients [3, 7]. An extreme example can [2] Berger, V.W. & Christophi, C.A. (2003). Randomization
illustrate this point. If a vaccine were able to confer technique, allocation concealment, masking, and suscepti-
immortality, or even the ability to survive an event bility of trials to selection bias, Journal of Modern Applied
that currently is uniformly fatal, then it would be Statistical Methods 2, 8086.
[3] Byar, D.P., Schoenfeld, D.A., Green, S.B., Amato, D.A.,
clear, even without randomization, that this vaccine is Davis, R, De Gruttola, V, Finkelstein, D.M., Gatsonis, C,
effective. But this is not likely, and so it is probably Gelber, R.D., Lagakos, S, Lefkopoulou, M, Tsiatis, A.A,
safe to say that no historically controlled trial can be Zelen, M, Peto, J, Freedman, L.S., Gail, M, Simon, R,
known to be free of biases. Ellenberg, S.S., Anderson, J.R., Collins, R, Peto, R,
Of course, if the evidentiary standard required for Peto, T, (1990). Design considerations for AIDS trials,
progress in science were an ironclad guarantee of New England Journal of Medicine 323, 13431348.
[4] Gehan, E.A. (1984). The evaluation of therapies: histori-
no biases, then science would probably not make
cal control studies, Statistics in Medicine 3, 315324.
very much progress, and so it may be unfair to sin- [5] Green, S.B. & Byar, D.B. (1984). sing observational
gle out historically controlled studies as unacceptable data from registries to compare treatments: the fallacy
based on the biases they may introduce. If historically of omnimetrics, Statistics in Medicine 3, 361370.
controlled trials tend to be more biased than concur- [6] Hoehler, F.K. (1999). Sample size calculations when out-
rent, or especially randomized trials, then this has comes will be compared with historical control, Comput-
to be a disadvantage that counts against historically ers in Biology and Medicine 29, 101110.
[7] Lewis, J.A. & Facey, K.M. (1998). Statistical shortcom-
controlled trials. However, it is not the only consid- ings in licensing applications, Statistics in Medicine 17,
eration. Despite the limitations of historical control 16631673.
data, there are, under certain conditions, advantages [8] Thall, P. & Simon, R. (1990). Incorporating historical
to employing this technique. For example, if the control data in planning phase II clinical trials, Statistics
new treatment turns out to be truly superior to the in Medicine 9, 215228.
control treatment, then finding this out with a histori-
cally controlled trial would not involve exposing any VANCE W. BERGER AND RANDI SHAFRAN
patients to the less effective control treatment [4].
History of Analysis of Variance
RYAN D. TWENEY
Volume 2, pp. 823826
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
need for a method of analyzing complex experimen- of Fisherian statistical testing, and measurement-
tal designs. Lovie [7] noted that such designs were oriented psychologists who had had their psychome-
used long prior to the appearance of ANOVA tech- tric and statistical skills sharpened by war research,
niques and that even factorial and nested designs were thus able to join hands in recommending that
were in occasional use in the 1920s and 1930s. He statistical training in both domains, ANOVA and
suggested that the late appearance of ANOVA was correlational, be a requirement for doctoral-level
instead due in part to the cognitive complexity of education in psychology. As psychology expanded,
its use and the relatively limited mathematical back- experimental psychologists in the United States were
grounds of the experimental psychologists who were therefore entrusted with ensuring the scientific cre-
its most likely clients. Further, Lovie noted the deeply dentials of the training of graduate students (most
theoretical nature of the concept of interaction. Until of whom had little background in mathematics or
the simplistic rule of one variable that dominated the physical sciences), even in clinical domains. The
experimental methodological thinking could be tran- adoption of ANOVA training permitted a new gen-
scended, there was no proper understanding of the eration of psychologists access to a set of tools of
contribution that ANOVA could make. perceived scientific status and value, without demand-
In the United States, the demands of war research ing additional training in mathematics or physical
during World War II exposed many psychologists to science. As a result, by the 1970s, ANOVA was fre-
new problems, new techniques, and a need to face quent in all experimental research, and the use of
the limitations of accepted psychological methods. significance testing had penetrated all areas of the
In contrast to the first war, there was much less of behavioral sciences, including those that relied upon
the measure everyone emphasis that characterized correlational and factor-analytic techniques. In spite
the Yerkes-led mental testing project of World War I. of frequent reminders that there were two psycholo-
Instead, a variety of projects used the research abili- gies [2], one correlational and one ANOVA-based,
ties of psychologists, often in collaboration with other the trend was toward the statistical merging of the
disciplinary scientists. War research also affected the two via the common embrace of significance testing.
nature of statistical analysis itself, and, in fact, also In the last decades of the twentieth century,
provided an opportunity for statisticians to establish ANOVA techniques displayed a greater sophisti-
their autonomy as a distinct profession. Many of cation, including repeated measures designs (see
the common uses of statistical inference were being Repeated Measures Analysis of Variance), mixed
extended by statisticians and mathematicians during designs, multivariate analysis of variance and other
the war, for example studies of bombing and fire procedures. Many of these were available long before
control (Neyman), sequential analysis (Wald and oth- their incorporation in psychology and other behav-
ers), and quality control statistics (Deming). More ioral sciences. In addition, recent decades have seen
to the point, significance testing began to find its a greater awareness of the formal identity between
way into the specific applications that psychologists ANOVA and multiple linear regression techniques,
were working upon. Rucci & Tweney [11] found both of which are, in effect, applications of a gener-
only 17 articles in psychology journals that used alized linear model [9].
ANOVA between 1934 and 1939, and of these most In spite of this growing sophistication, the use
of the applications were, as Lovie [7] noted, rather of ANOVA techniques has not always been seen
unimpressive. Yet the wartime experiences of psy- as a good thing. In particular, the ease with which
chologists drove home the utility of these procedures, complex statistical procedures can be carried out on
led to many more psychologists learning the new modern desktop computers has led to what many
procedures, and provided paradigmatic exemplars of see as the misuse and overuse of otherwise pow-
their use. erful programs. For example, one prominent recent
After 1945, there was a rapid expansion of gradu- critic, Geoffrey Loftus [6], has urged the replacement
ate training programs in psychology, driven in large of null hypothesis testing by the pictorial display
part by a perceived societal need for more clinical and of experimental effects, together with relevant confi-
counseling services, and also by the needs of Cold dence intervals even for very complex designs.
War military, corporate, and governmental bureau- Many of the criticisms of ANOVA use in the
cracies. Experimental psychologists, newly apprised behavioral sciences are based upon a claim that
History of Analysis of Variance 3
inferential canons are being violated. Some have Issues, Keren, G. & Lewis, C., eds, Lawrence Erlbaum
criticized the mechanized inference practiced by Associates, Hillsdale, pp. 311339.
[6] Loftus, G.R. (1993). A picture is worth a thousand
many in the behavioral sciences, for whom a sig-
p values: on the irrelevance of hypothesis testing in
nificant effect is a true finding and a nonsignificant the microcomputer age, Behavior Research Methods,
effect is a finding of no difference [1]. Gigeren- Instruments, & Computers 25, 250256.
zer [5] argued that psychologists were using an [7] Lovie, A.D. (1979). The analysis of variance in exper-
incoherent hybrid model of inference, one that imental psychology: 19341945, British Journal of
inappropriately blended aspects of Neyman/Pearson Mathematical and Statistical Psychology 32, 151178.
approaches with those of Fisher. In effect, the charge [8] Myers, C.S. (1909). A Text-book of Experimental Psy-
chology, Edward Arnold, London.
is that a kind of misappropriated Bayesianism (see [9] Neter, J., Wasserman, W., Kutner, M.H., Nacht-
Bayesian Statistics) has been at work, one in which steim, C.J. & Kutner, M. (1996). Applied Linear Sta-
the P value of a significance test, p(D|Ho ), was tistical Models, 4th Edition, McGraw-Hill, New York.
confused with the posterior probability, p(Ho |D), [10] Reitz, W. (1934). Statistical techniques for the study
and, even more horribly, that p(H1 |D) was equated of institutional differences, Journal of Experimental
with 1 p (D|Ho ). Empirical evidence that such Education 3, 1124.
[11] Rucci, A.J. & Tweney, R.D. (1980). Analysis of variance
confusions were rampant even among professional
and the Second Discipline of scientific psychology: an
behavioral scientists was given by Tversky & Kah- historical account, Psychological Bulletin 87, 166184.
neman [12]. [12] Tversky, A. & Kahneman, D. (1971). Belief in the law
By the beginning of the twenty-first century, the of small numbers, Psychological Bulletin 76, 105110.
ease and availability of sophisticated ANOVA tech-
niques continued to grow, along with increasingly Further Reading
powerful graphical routines. These hold out the hope
that better uses of ANOVA will appear among behav- Capshew, J.H. (1999). Psychologists on the March: Sci-
ioral sciences. The concerns over mechanized infer- ence, Practice, and Professional Identity in America,
19291969, Cambridge University Press, Cambridge.
ence and inappropriate inferential beliefs will not,
Cowles, M. (2001). Statistics in Psychology: An Historical
however, be resolved by any amount of computer Perspective, 2nd Edition, Lawrence Erlbaum Associates,
software. Instead, these will require better method- Mahwah.
ological training and more careful evaluation by jour- Fienberg, S.E. (1985). Statistical developments in World War
nal editors of submitted articles. II: an international perspective, in A Celebration Of
Statistics: The ISI Centenary Volume, Atkinson, A.C.
& Fienberg, S.E., eds, Springer, New York, pp. 2530.
References Herman, E. (1995). The Romance of American Psychology:
Political Culture in the Age of Experts, University of
California Press, Berkeley.
[1] Bakan, D. (1966). The test of significance in psycholog- Lovie, A.D. (1981). On the early history of ANOVA in the anal-
ical research, Psychological Bulletin 66, 423437. ysis of repeated measure designs in psychology, British
[2] Cronbach, L. (1957). The two disciplines of scientific Journal of Mathematical and Statistical Psychology 34,
psychology, American Psychologist 12, 671684. 115.
[3] Danziger, K. (1990). Constructing the Subject: Histori- Tweney, R.D. (2003). Whatever happened to the brass and
cal Origins of Psychological Research, Cambridge Uni- glass? The rise of statistical instruments in psychol-
versity Press, Cambridge. ogy, 19001950, in Thick Description and Fine Texture:
[4] Fisher, R.A. (1935). The Design of Experiments, Oliver Studies in the History of Psychology, Baker, D., ed., Uni-
& Boyd, Edinburgh. versity of Akron Press, Akron, pp. 123142 & 200205,
[5] Gigerenzer, G. (1993). The superego, the ego, and the notes.
id in statistical reasoning, in A Handbook for Data
Analysis in the Behavioral Sciences: Methodological RYAN D. TWENEY
History of Behavioral Statistics
SANDY LOVIE
Volume 2, pp. 826829
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
world; second, that height is a useful property of a quantitative analyses that went hand in hand with
house that can be expressed as a measurable variable; how the psychologist and the educational researcher
third, that the measurements are not biased in any way viewed the complexity of their respective worlds.
during their collection, hence any variation represents The first decade or so of the twentieth century
pure error around the true height and not a variation also saw the start of the serious commitment of
in the height of the building itself; and finally, that psychology to statistical analysis in so far as this
the estimator of height has certain useful properties, marks the publication of the first textbooks in the
which commends it to the analyst. Thus, one has a subject by popular and well-respected authors like
mixture of consensually achieved background ideas Thorndike. There was also a lively debate pursued
about the context, how the particular aspect of interest by workers for over 30 years as to the best test
can be experimentally investigated, and agreement as for the difference between a pair of means. This
to what summary numbers and other representations had been kick started by Yerkes and Thorndike at
of the sample yield the most useful information. the turn of the century and had involved various
Statistics is about all of these, since all of them affect measures of variation to scale the difference. This was
how any set of numbers is generated and how they gradually subsumed within Students t Test, but the
are interpreted. existence of a range of solutions to the problem meant
When we move on quickly to the middle and latter that, unlike analysis of variance (or ANOVA), the
parts of the nineteenth century we find that Quetelets acceptance of the standard analysis was somewhat
precepts have been accepted and well learnt (although delayed, but notice that the essentially uncertain
not perhaps his radical social physics, or his empha- nature of the world and any data gathered from it
sis on lhomme moyen, or the average man), and that had been explicitly recognized by psychologists in
people are now more than happy to draw conclu- this analysis (see [7] for more details). Unfortunately,
sions from data, since they are convinced that their the major twentieth-century debates in statistics about
approach warrants them to do so. Thus, Galton is inference seemed to pass psychology by. It was only
happy to use the emerging methods of regression in the 1950s, for example, that the NeymanPearson
analysis to argue a hereditarian case for all man- work on Type I and II errors (see NeymanPearson
ner of human properties, not just his famous one of Inference) from the 1930s had begun to seep into
heights. Karl Pearsons monumental extension and the textbooks (see [7], for a commentary as to why
elaboration of both Galtons relatively simple ideas this might have been). An exception could possibly
about relationships and his rather plain vanilla ver- be made for Bayesian inference and its vigorous take
sion of Darwinian evolution marks the triumph of a up in the 1960s by Ward Edwards [3] and others (see
more mathematical look to statistics, which is now his 1963 paper, for instance), but even this seemed to
increasingly seen as the province of the data modeler peter out as psychology came to reluctantly embrace
rather than the mere gatherer-in of numbers and their power, sample size, and effects with its obvious line
tabulation. This is also coincident, for example, with to a NeymanPearson analysis of inference. Again,
a switch from individual teaching and examining to such an analysis brings a strong structuring principle
mass education and testing in the schools, thus has- and worldview to the task of choosing between
tening the use of statistics in psychology and areas uncertain outcomes.
related to it like education (see [2], for more infor- The other early significant work on psychological
mation). One cannot, in addition, ignore the large statistics, which simultaneously both acknowledged
amount of data generated by psychophysicists like the uncertain nature of the world and sought structur-
Fechner (who was a considerable statistician in his ing principles to reduce the effects of this uncertainty
own right), or his model building in the name of was Charles Spearmans foray into factor analy-
panpsychism, that is, his philosophy that the whole sis from 1904 onwards. Using an essentially latent
universe was linked together by a form of psychic variable approach, Spearman looked for support for
energy. The 1880s and 1890s also saw the increasing the strongly held nineteenth-century view that human
use of systematic, multifactor experimental designs in nature could be accounted for by a single general fac-
the psychological and pedagogical literature, includ- tor (sometimes referred to as mental energy) plus an
ing several on the readability of print. In other words, array of specific factors whose operation would be
there was an unrequited thirst in psychology for more determined by the individual demands of the situation
History of Behavioral Statistics 3
or task. This meant that a hierarchical structure could variations, while the earliest example that I could
be imposed, a priori, on the intercorrelations between find of its application to an area close to psychology
the various scholastic test results that Spearman had was by the statistician Reitz who, in 1934, used
obtained during 1903. Factor analysis as a deductive the technique to compare student performance across
movement lasted until the 1930s when the scepti- schools. Clearly, psychology had long taken peoples
cism of Godfrey Thomson and the risky, inductive actions, beliefs, and thought to be determined by
philosophy of Louis Thurstone combined to turn fac- many factors. Here at last was a method that allowed
tor analysis into the exploratory method that it has them to quantitatively represent and explore such a
become today (see [7], and [1], for more detail). But structuring worldview.
notice that the extraction of an uncertainty-taming
structure is still the aim of the enterprise, what- References
ever flavor of factor analysis we are looking at.
And this is also the case for all the multivariable
[1] Cowles, M. (2001). Statistics in Psychology: an His-
methods, which were originated by Karl Pearson, torical Perspective, 2nd Edition, Lawrence Erlbaum,
from principal component analysis to multiple lin- Mahwah.
ear regression. [2] Danziger, K. (1987). Statistical method and the historical
My final section is devoted to a brief outline of the development of research practice in American psychol-
rapid acceptance by psychology of the most widely ogy, in The Probabilistic Revolution: Ideas in Modern
employed method in psychological statistics, that is, Science, Vol. II, G. Gigerenzer, L. Kruger & M. Morgan,
eds, MIT Press, Cambridge.
ANOVA (see [1, 5, 6, 10] for much more detail). This [3] Edwards, W., Lindman, H. & Savage, L.J. (1963).
was a technique for testing the differences between Bayesian statistical inference for psychological research,
more than two samples, which was developed in 1923 Psychological Review 70(3), 193242.
by the leading British statistician of this time, R A [4] Hacking, I. (1976). The Emergence of Probability, Cam-
Fisher, as part of his work in agricultural research. It bridge University Press, Cambridge.
was also a method which crucially depended for its [5] Lovie, A.D. (1979). The analysis of variance in exper-
imental psychology: 19341945, British Journal of
validity on the source of the data, specifically from
Mathematical and Statistical Psychology 32, 151178.
experiments that randomly allocated the experimental [6] Lovie, A.D. (1981). On the early history of ANOVA in
material, for instance, varieties of wheat, to the the analysis of repeated measure designs in psychology,
experimental plots. Differences over the need for British Journal of Mathematical and Statistical Psychol-
random allocation was the cause of much bitterness ogy 34, 115.
between Fisher and Student (W S Gosset), but [7] Lovie, A.D. (1991). A short history of statistics in
it is really an extension of Quetelets rule that Twentieth Century psychology, in New Developments in
Statistics for Psychology and the Social Sciences, Vol. 2,
any variation in the measurement of a homogenous
P. Lovie & A.D. Lovie, eds, BPS Books & Routledge,
quality such as a single wheat variety should reflect London.
error and nothing else, a property that only random [8] Mackenzie, D. (1981). Statistics in Britain, 18651930:
allocation could guarantee. In psychology, ANOVA The Social Construction of Scientific Knowledge, Edin-
and its extension to more than one factor were quickly burgh University Press, Edinburgh.
taken into the discipline after the appearance of [9] Porter, T.M. (1986). The Rise in Statistical Thinking:
Fishers first book introducing the method (in 1935), 18201900, Princeton University Press, Princeton.
[10] Rucci, A.J. & Tweney, R.D. (1979). Analysis of variance
and was rapidly applied to the complex, multifactor and the Second Discipline of scientific psychology,
experiments, which psychology had been running Psychological Bulletin 87, 166184.
for decades. Indeed, so fast was this process that
Garrett and Zubin, in 1943, were able to point to SANDY LOVIE
over 30 papers and books that used ANOVA and
History of the Control Group
TRUDY DEHUE
Volume 2, pp. 829836
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
to check the claims of homeopathy [24]. And sev- already discussed the issue of experimentation as
eral examples of group comparison in the treatment a suitable method for investigating human life [7].
of illnesses (although without randomization) are David Humes Treatise of Human Nature, first pub-
also presented in the electronic James Lind Library lished in 1739, is subtitled: Being an Attempt to
(www.jameslindlibrary.org). Introduce the Experimental Method of Reasoning into
Entertaining, however, as such examples of com- Moral Subjects. Hume and his Enlightment con-
parison may be, they are hardly surprising, since temporaries, however, borrowed the terminology of
checking the effects of ones actions by sometimes experimentation from natural science as a metaphor
withholding them is a matter of everyday logic. for major events happening without the intervention
Moreover, it would be quite artificial to depict these of researchers. Observing disturbances of regular life,
examples as early, if still incomplete, steps toward they argued, is the human science substitute of natural
the present-day methodological rule of employing science experimentation.
control groups. Historians of science use derogatory Nineteenth-century views on social experimen-
labels such as presentist history, finalist history, tation were largely, but not entirely, the same as
justificationary history, and feel good history, for those of the eighteenth century. Distinguished schol-
histories applying present-day criteria in selecting ars such as Adolphe Quetelet (17961874) in Bel-
predecessors who took early steps toward our gium, Auguste Comte (17981857) in France, and
own viewpoints, whilst also excusing these pioneers George Cornwall Lewis (18061863) as well as
for the understandable shortcomings still present in John Stuart Mill (18061873) in Britain used the
their ideas. Arranging the examples in chronologi- terminology of experimentation for incidents such
cal order, as such histories do, suggests a progressive as natural disasters, famines, economic crises, and
trajectory from the past to the present, whereas they also government interventions. Different, however,
actually drew their own line from the present back from eighteenth-century scholars and in accordance
into the past. Historian and philosopher of science with later twentieth-century views, they preserved the
Thomas Kuhn discussed the genre under the name of epithet of scientific experimentation for experiments
preface history, referring to the typical historical with active manipulation by researchers. As scien-
introduction in textbooks. Apart from worshipping tific experimentation entails intentional manipulation
the present, Kuhn argued, preface histories convey by researchers, they maintained, research with human
a deeply misleading view of scientific development beings cannot be scientific.
as a matter of slow, but accumulative, discovery by a Roughly speaking, there were two reasons why
range of mutually unrelated great men [25, pp. 110; they excluded deliberate manipulation from the use-
136144]. able methods of research with human beings. One
Rather than lining up unconnected look-alikes reason was of a moral nature. When George Cornwall
through the ages, the present account asks when, why, Lewis in 1852 published his two-volume Treatise
and how employing control groups became a method- on the Methods of Observation and Reasoning in
ological condition. The many reputed nineteenth- Politics, he deliberately omitted the method of exper-
century scholars who explicitly rejected experimental imentation from the title. Experimentation, Lewis
comparison are neither scorned nor excused for their maintained, is inapplicable to man as a sentient, and
deficiency. Rather, their views are analyzed as con- also as an intellectual and moral being. This is not
tributions to debates in their own time. Likewise, because man lies beyond the reach of our powers,
the ideas of early twentieth-century scholars who but because experiments could not be applied to him
advanced group comparison are discussed as part of without destroying his life, or wounding his sensi-
debates with their own contemporaries. bility, or at least subjecting him to annoyance and
restraint [26, pp. 160161].
Nineteenth-century Qualms The second reason was of an epistemological
nature. In 1843, the prominent British philosopher,
If control groups were not recommended before the economist, and methodologist John Stuart Mill pub-
early twentieth century, the expression of social lished his System of Logic that was to become very
experimentation did appear in much earlier method- influential in the social sciences. This work exten-
ological texts. Eighteenth-century scholars had sively discussed Mills method of difference, which
History of the Control Group 3
entailed comparing cases in which an effect does The answer is that their qualms were inspired by
and does not occur. According to Mill, this most the general holism and determinism of their time.
perfect of the methods of experimental inquiry was Nineteenth-century scholars regarded communities as
not suitable for research with people. He illustrated well as individuals as organic systems in which
this view with the frequent topic of debate in the every element is closely related to all others, and in
present century, that is, whether or not government which every characteristic is part of an entire pat-
intervention into free enterprise impedes national tern of interwoven strands rather than caused by one
wealth. The method of difference is unhelpful in a or more meticulously isolated factors. In addition,
case like this, he explained, because comparability they ascribed the facts of life to established laws
is not achievable: [I]f the two nations differ in this of God or Nature rather than to human purposes
portion of their institutions, it is from some differ- and plans. According to nineteenth-century determin-
ences in their position, and thence in their apparent ism, the possibilities of engineering human life were
interests, or in some portion or the other of their very limited. Rather than initiating permanent social
opinions, habits and tendencies; which opens a view change, the role of responsible authorities was to
of further differences without any assignable limit, preserve public stability. Even Mill, for whom the
capable of operating on their industrial prosperity, as disadvantages of a laissez-faire economy posed a sig-
well as on every other feature of their condition, in nificant problem, nevertheless, held that government
more ways than can be enumerated or imagined [31, interference should be limited to a small range of
pp. 881882]. issues and should largely aim at the preservation of
Mill raised the objection of incomparability not regular social order.
only in complex issues such as national economic
In this context, the common expression of social
policies but in relation to all research with peo-
experimentation could not be more than a metaphor
ple. Even a comparatively simple question such as,
to express the view that careful observation of severe
whether or not mercury cures a particular disease,
disturbances offers an understanding of the right
was quite chimerical as it was impossible in med-
and balanced state of affairs. The same holistic and
ical research to isolate a single factor from all other
determinist philosophy expressed itself in nineteenth-
factors that might constitute an effect. Although the
century statistics, where indeterminism or chance
efficacy of quinine, colchicum, lime juice, and cod
had the negative connotation of lack of knowledge
liver oil was shown in so many cases that their
tendency to restore health. . . may be regarded as an and whimsicality rather than the present-day associ-
experimental truth, real experimentation was out of ation of something to take and as an instrument to
the question, and [S]till less is this method applica- make good use of [33, 22]. Nineteenth-century sur-
ble to a class of phenomena more complicated than vey researchers, for instance, did not draw representa-
those of physiology, the phenomena of politics and tive population samples. This was not because of the
history [31, pp. 451452]. inherent complexity of the idea, nor because of slug-
gishness on the researchers part, but because they
investigated groups of people as organic entities and
Organicism and Determinism prototypical communities [17]. To nineteenth-century
researchers, the idea of using chance for deriving pop-
How to explain the difference between these nine- ulation values, or, for that matter, allocating people
teenth-century objections and the commonness of to groups, was literally unimaginable.
experimentation with experimental and control Even the occasional proponent of active experi-
groups in our own time? How could Lewis be com- mentation in clinical research rejected chance as an
punctious about individual integrity even to the level instrument of scientific research. In 1865, the illus-
of not annoying people, whereas, in our time, large trious French physiologist Claude Bernard (1813
group experiments hardly raised an eyebrow? And 1878) published a book with the deliberately provoca-
why did a distinguished methodologist like Mill not tive title of Introduction a` Letude de la Medecine
promote the solution, so self-evident to present-day Experimentale [1] translated into English as An
researchers, of simply creating comparable groups if Introduction to the Study of Experimental Medicine.
natural ones did not exist? Staunchly, Bernard stated that philosophic obstacles
4 History of the Control Group
to experimental medicine arise from vicious meth- primary purpose, so to speak, of all living existence,
ods, bad mental habits, and certain false ideas [2, whereas [E]volution is an unresting progression,
p. 196]. For the sake of valid knowledge, he main- Galton added, the nature of the average individual is
tained, comparative experiments have to be made essentially unprogressive [20, p. 406].
at the same time and on as comparable patients as Galton was interested in finding more ways of
possible [2, p. 194]. employing science for the sake of human progress. In
Yet, one searches Bernards Introduction in vain an 1872 article Statistical Inquiries into the Efficacy
for comparison of experimental to control groups. As of Prayer, he questioned the common belief that sick
ardently as he defended experimentation, he rejected persons who pray, or are prayed for, recover on the
statistical averages. He sneered about the startling average more rapidly than others. This article opened
instance of a physiologist who collected urine from with the statement that there were two methods of
a railroad station urinal where people of all nations studying an issue like the profits of piety. The first
passed as if it were possible to analyze the aver- one was to deal with isolated instances. Anyone,
age European urine! (italics and exclamation mark however, using that method should suspect his own
in original). And he scorned surgeons who published judgments or otherwise would certainly run the
the success rates of their operations because average risk of being suspected by others in choosing one-
success does not give any certainty on the next oper- sided examples. Galton vigorously broke a lance for
ation to come. Bernards expression of comparative substituting the study of representative types with
experimentation did refer to manipulating animals statistical comparison. The most reliable method was
and humans for the sake of research. Instead of com- to examine large classes of cases, and to be guided
paring group averages, however, he recommended by broad averages [19, p. 126].
that one should present our most perfect experiment Galton elaborately explained how the latter
as a type [2, pp. 134135]. To Bernard, the rise of method could be applied in finding out the revenues
probabilistic statistics meant literally nothing scien- of praying: We must gather cases for statistical com-
tifically [2, p. 137]. parison, in which the same object is keenly pursued
by two classes similar in their physical but opposite
in their spiritual state; the one class being spiritual,
Impending Changes the other materialistic. Prudent pious people must be
compared with prudent materialistic people and not
The British statistician, biometrician, and eugenicist with the imprudent nor the vicious. We simply look
Sir Francis Galton (18221911) was a crucial figure for the final result - whether those who pray attain
in the gradual establishment of probabilism as an their objects more frequently than those who do not
instrument of social and scientific progress. Galton pray, but who live in all other respects under similar
was inspired by Adolphe Quetelets notion of the conditions [19, p. 126].
statistical mean and the normal curve as a substi- As it seems, Galton was the first to advocate
tute for the ideal of absolute laws. In Quetelets comparison of group averages. Yet, his was not an
own writings, however, this novelty was not at example of treating one group while withholding the
odds with determinism. His well-known Lhomme treatment from a comparison group. The emergence
moyen (average man) represented normalcy and dis- of the control group in the present-day sense occurred
persion from the mean signified abnormality. It was when his fears of being suspected by others in
Galton who gave Quetelets mean a progressive choosing one-sided examples began to outgrow
twist. anxieties on doing injustice to organic wholes. This
Combining the evolution theory of his cousin transition took place with the general changeover
Charles Darwin with eugenic ideals of human from determinism to progressivism in a philosophical
improvement, Galton held that an average man is as well as social sense.
morally and intellectually a very uninteresting being.
The class to which he belongs is bulky, and no doubt Progressivism and Distrust
serves to keep the course of social life in action. . .
But the average man is of no direct help towards By the end of the nineteenth century, extreme des-
evolution, which appears to our dim vision to be the titution among the working classes led to social
History of the Control Group 5
movements for mitigation of laissez faire capital- The new social scientists measured peoples abil-
ism. Enlightened members of the upper middle class ities, motives, and attitudes, as well as social phe-
pleaded for some State protection of laborers via min- nomena such as crime, alcoholism, and illiteracy.
imum wage bills, child labor bills, and unemployment Soon, they arrived at the idea that these instruments
insurances. Their appeals for the extension of govern- could also be used for establishing the results of
ment responsibility met with strong fears that help ameliorative interventions. In 1917, the well-reputed
would deprive people of their own responsibility and sociologist F. Stuart Chapin lengthily discussed the
that administrations would squander public funds. It issue. Simple, before and after measurement of one
was progressivism combined with distrust that con- group, he stated, would not suffice for excluding per-
stituted a new definition of social experimentation sonal judgement. Yet, Chapin rejected comparison of
as statistical comparison of experimental and con- treated and untreated groups. Like Mill before him,
trol groups. Three interrelated maxims of twentieth- he maintained that fundamental differences between
century economic liberalism were crucial to the grad- groups would always invalidate the conclusions of
ual emergence of the present-day ideal experiment. social experiments. Adding a twentieth-century ver-
The first maxim was that of individual responsibil- sion to Lewis moral objections, he argued that it
ity. Social success and failure remained an individual would be immoral to withhold help from needy peo-
affair. This implied that ameliorative attempts were ple just for the sake of research [9, 10]. It was psy-
to be directed first and foremost at problematic indi- chologists who introduced the key idea to create equal
viduals rather than on further structural social change. groups rather than search for them in natural life, and
Helping people implied trying to turn them into inde- they did so in a context with few ethical barriers.
pendent citizens by educating, training, punishing,
and rewarding them. The second maxim was that of
efficiency. Ameliorative actions financed with public Creating Groups
money had to produce instant results with simple eco-
Psychologists had a tradition of psychophysiological
nomical means. The fear that public funds would be
experimentation with small groups of volunteers in
squandered created a strong urge to attribute misery
laboratory settings for studying the law-like relation-
and backwardness to well-delineated causes rather
ships between physical stimuli and mental sensations.
than complex patterns of individual and social rela-
During the administrative turn of both government
tions. And the third maxim was that of impersonal and human science, many of them adapted their
procedures. Fears of abuse of social services evoked psychophysiological methods to the new demands
distrust of peoples own claims of needs, and the con- of measuring progress rather than just discovering
sequent search for impersonal techniques to establish laws [14, 15]. One of these psychologists was John
the truth behind their stories [38]. In addition, not Edgar Coover, who studied at Stanford University
only was the self-assessment of the interested recip- in Palo Alto (California) with the psychophysical
ients of help to be distrusted but also that of the experimenter Frank Angell. As a former school prin-
politicians and administrators providing help. Mea- cipal, Coover gave Angells academic interests an
surement also had to control administrators claims instrumental twist. He engaged in a debate among
of efficiency [34]. school administrators on the utility of teaching sub-
Academic experts on psychological, sociologi- jects such as Latin and formal mathematics. Oppo-
cal, political, and economical matters adapted their nents wanted to abolish such redundant subjects from
questions and approaches to the new demands. the school curriculum, but proponents argued that
They began to produce technically useful data formal discipline strengthens general mental capac-
collected according to standardized methodological ities. Coover took part in this debate with laboratory
rules. Moreover, they established a partnership with experiments testing whether or not the training of
statisticians who now began to focus on population one skill improves performance in another ability. In
varieties rather than communalities. In this context, a 1907 article, published together with Angell, he
the interpretation of chance as something one must explained that in the context of this kind of research
make good use of replaced the traditional one of a one-group design does not do. Instead, he com-
chance as something to defeat [17, 22, 33]. pared the achievements of experimental reagents
6 History of the Control Group
who received training with those of control reagents device which will make the selection truly random is
who did not [13]. Coover and Angells article seems satisfactory [30, pp.4142].
to be the first report of an experiment in which one
group of people is treated and tested, while another
one is only tested. Fishers Support
From the 1910s, a vigorous movement started
in American schools for efficiency and scientific In the meantime, educational psychologists were test-
(social) engineering [6]. In the school setting, it was ing various factors simultaneously, which made the
morally warrantable and practically doable to com- resulting data hard to handle. The methodological
pare groups. Like the earlier volunteers in laborato- handbook The Design of Experiments published in
ries, school children and teachers were comparatively 1935 by the British biometrician and agricultural
easy to handle. Whereas historian Edwin Boring statistician Ronald A. Fisher provided the solu-
found no control groups in the 1916 volume of tion of analysis of variance (ANOVA). As Fisher
the American Journal of Psychology [3, page 587], repeatedly stressed, random allocation to groups was
historian Kurt Danziger found 14 to 18% in the a central condition to the validity of this tech-
19141916 volumes of the Journal of Educational nique. When working as a visiting professor at
Psychology [14, pp. 113115]. the agricultural station of Iowa State College, he
Psychological researchers experimented in real met the American statistician George W. Snedecor.
classrooms where they tested the effects of classroom Snedecor published a book based on Fishers sta-
circumstances such as fresh versus ventilated air, tistical methods [37] that was easier to comprehend
the sex of the teacher, memorizing methods, and than Fishers own, rather intricate, writings and
educational measures such as punishing and praising. that was widely received by methodologists in biol-
They also sought ways of excluding the possibility ogy as well as psychology [28, 35]. Subsequently,
that their effects are due to some other difference Snedecors Iowa colleague, the educational psycholo-
between the groups than the variable that is tested. gist Everett Lindquist, followed with the book Statis-
During the 1920s, it became customary to handle the tical Analysis in Educational Research which became
problem by matching. Matching, however, violated a much-cited source in the international educational
the guiding maxims of efficiency and impersonality. community [27].
It was quite time- and money-consuming to test Fishers help was welcomed with open arms by
each child on every factor suspected of creating methodologists, not only because it provided a means
bias. And, even worse, determining these factors to handle multi factor research but also because
depended on the imaginative power and reliability it regulated experimentation from the stage of the
of the researchers involved. Matching only covered experimental design. As Snedecor expressed it in
possibly contaminating factors that the designers of 1936, the designs researchers employed often baf-
an experiment were aware of, did not wish to neglect, fled the statisticians. No more than a decade past,
and were able to pretest the participants on. the statistician was distinctly on the defence, he
In 1923, William A. McCall at Columbia Uni- revealed, but [U]nder the leadership of R. A. Fisher,
versity in New York, published the methodolog- the statistician has become the aggressor. He has
ical manual How to Experiment in Education in found that the key to the problem is the intimate
which he emphasized the need of comparing simi- relation between the statistical method and the exper-
lar groups [30]. In the introduction to this volume, imental plan [36, p. 690]. This quote confirms the
McCall predicted that enhancing the efficiency of thesis of historians that the first and foremost motive
education could save billions of dollars. Further on, to prescribe randomization was not the logic of prob-
he proposed to equate the groups on the basis of abilistic statistics, but the wish to regulate the conduct
chance as an economical substitute for matching. of practicing researchers [8, 16, 29, 34]. Cancel-
McCall did not take randomization lightly. For exam- ing out personal judgment, together with economi-
ple, he rejected the method of writing numbers on cal reasons, was the predominant drive to substitute
pieces of paper because papers with larger numbers matching by randomization. Like Galton in 1872,
contain more ink and are therefore likely to sink fur- who warned against eliciting accusations of having
ther to the bottom of a container. But, he stated, any chosen one-sided examples, early twentieth-century
History of the Control Group 7
statisticians and methodologists cautioned against the children and university students, also soldiers, slum
danger of selection bias caused by high hopes on par- dwellers, spouse beaters, drug abusers, disabled food-
ticular outcomes. stamp recipients, bad parents, and wild teenagers
have all participated in experiments testing the effects
of special training, social housing programs, mar-
Epilogue riage courses, safe-sex campaigns, health programs,
income maintenance, employment programs, and the
It took a while before randomization became more
like, in an impersonal, efficient, and standardized
than a methodological ideal. Practicing physicians
way [4, 5, 32].
argued that the hopes of a particular outcome are
often a substantial part of the treatment itself. They
also maintained that it is immoral to let chance References
determine which patients gets the treatment his doctor
[1] Bernard, C. (1865). Introduction a` Letude de la Mede-
believes in and which patient does not, as well as
cine Experimentale, Balli`ere, Paris.
keeping it a secret as to which group a patient [2] Bernard, C. (1957). An Introduction to the Study of
has been assigned. Moreover, they put forward the Experimental Medicine, Dover Publications, New York.
argument that subjecting patients to standardized tests [3] Boring, E.G. (1954). The nature and history of exper-
rather than examining them in a truly individual way imental control, American Journal of Psychology 67,
would harm, rather than enhance, the effectiveness of 573589.
diagnoses and treatments. [4] Boruch, R. (1997). Randomised Experiments for Plan-
ning and Evaluation, Sage Publications, London.
In social research, there were protests too. After
[5] Bulmer, M. (1986). Evaluation research and social
he learned about the solution of random alloca- experimentation, in Social Science and Social Policy,
tion, sociologist F. Stuart Chapin unambiguously M. Bulmer, K.G. Banting, M. Carley & C.H. Weiss,
rejected it. Allocating people randomly to interven- eds, Allen and Unwin, London, pp. 155179.
tions, he maintained, clashes with the humanitarian [6] Callahan, R.E. (1962). Education and the Cult of Effi-
mores of reform [11, 12]. And the Russian-American ciency, The University of Chicago Press, Chicago.
anthropologist Alexander Goldenweiser objected that [7] Carrithers, D. (1995). The enlightment science of soci-
ety, in Inventing Human Science. Eighteenth-Century
human reality resents highhanded manipulation for Domains, C. Fox, R. Porter & R. Wokler, eds, University
which reason it demands true dictators to reduce of California Press, Berkeley, pp. 232270.
variety by fostering uniformity [21, p. 631]. An [8] Chalmers, I. (2001). Comparing like with like. Some his-
extensive search for the actual use of random alloca- torical milestones in the evolution of methods to create
tion in social experiments led to the earliest instance unbiased comparison groups in therapeutic experiments,
in a 1932 article on educational counseling of uni- International Journal of Epidemiology 30, 11561164.
[9] Chapin, F.S. (1917a). The experimental method and
versity students, whereas the next seven appeared in
sociology. I. The theory and practice of the experimental
research reports dating from the 1940s (all but one in method, Scientific Monthly 4, 133144.
the field of educational psychology) [18]. [10] Chapin, F.S. (1917b). The experimental method and
Nevertheless, the more twentieth-century welfare sociology. II. Social legislation is social experimenta-
capitalism replaced nineteenth-century laissez-faire tion, Scientific Monthly 4, 238247.
capitalism, the more administrators and researchers [11] Chapin, F.S. (1938). Design for social experiments,
felt that it is both necessary and morally acceptable American Sociological Review 3, 786800.
[12] Chapin, F.S. (1947). Experimental Designs in Social
to experiment with randomized groups of children as
Research, Harper & Row, New York.
well as adults. From about the 1960s onward, there- [13] Coover, J.E. & Angell, F. (1907). General practice effect
fore, protesting doctors could easily be accused of of special exercise, American Journal of Psychology 18,
an unwillingness to give up an outdated elitist posi- 328340.
tion for the truly scientific attitude. Particularly in the [14] Danziger, K. (1990). Constructing the Subject, Cam-
United States, the majority of behavioral and social bridge University Press, Cambridge.
researchers too began to regard experiments with [15] Dehue, T. (2000). From deception trials to control
reagents. The introduction of the control group about
randomly composed groups as the ideal experiment. a century ago, American Psychologist 55, 264269.
Since President Johnsons War on Poverty, many [16] Dehue, T. (2004). Historiography taking issue. Analyz-
such social experiments have been conducted, some- ing an experiment with heroin maintenance, Journal of
times with thousands of people. Apart from school the History of the Behavioral Sciences 40(3), 247265.
8 History of the Control Group
[17] Desrosi`eres, A. (1998). The Politics of Large Numbers. [29] Marks, H.M. (1997). The Progress of Experiment. Sci-
A History of Statistical Reasoning, Harvard University ence and Therapeutic Reform in the United States,
Press, Cambridge. 19001990, Cambridge University Press, New York.
[18] Forsetlund, L., Bjrndal, A. & Chalmers, I. (2004, [30] McCall, W.A. (1923). How to Experiment in Education,
submitted for publication). Random allocation to assess McMillan McCall, New York.
the effects of social interventions does not appear to have [31] Mill, J.S. (1843, reprinted 1973). A System of Logic,
been used until the 1930s. Ratiocinative and Inductive: Being a Connected View of
[19] Galton, F. (1872). Statistical inquiries into the efficacy the Principles of Evidence and the Methods of Scientific
of prayer, Fortnightly Review XII, 124135. Investigation, University of Toronto, Toronto.
[20] Galton, F. (1889). Human variety, Journal of the Anthro- [32] Orr, L. (1999). Social Experiments. Evaluating Public
pological Institute 18, 401419. Programs with Experimental Methods, Sage Publica-
[21] Goldenweiser, A. (1938). The concept of causality tions, London.
in physical and social science, American Sociological [33] Porter, T.M. (1986). The Rise of Statistical Thinking,
Review 3(5), 624636. 18201900, Princeton University Press, Princeton.
[22] Hacking, I. (1990). The Taming of Chance, Cambridge [34] Porter, T.M. (1995). Trust in Numbers: The Pursuit
University Press, New York. of Objectivity in Science and Public Life, Princeton
[23] Hempel, C.G. (1966). Philosophy of Natural Science, University Press, Princeton.
Prentice-Hall, Englewood Cliffs. [35] Rucci, A.J. and Ryan D.T. (1980). Analysis of variance
[24] Kaptchuck, T.J. (1998). Intentional ignorance: a history and the second discipline of scientific psychology: a
of blind assessment and placebo controls in medicine, historical account, Psychological Bulletin 87, 166184.
Bulletin of the History of Medicine 72, 389433. [36] Snedecor, G.W. (1936). The improvement of statistical
[25] Kuhn, T.S. (1962, reprinted 1970). The Structure of Sci- techniques in biology, Journal of the American Statisti-
entific Revolutions, Chicago University Press, Chicago. cal Association 31, 690701.
[26] Lewis, C.G. (1852, reprinted 1974). A Treatise on the [37] Snedecor, G.W. (1937). Statistical Methods, Collegiate
Methods of Observation and Reasoning in Politics, Vol Press, Ames, Iowa.
1, Arno Press, New York. [38] Stone, D.A. (1993). Clinical authority in the construc-
[27] Lindquist, E.F. (1940). Statistical Analysis in Educa- tion of citizenship, in Public Policy for Democracy,
tional Research, Houghton-Mifflin, Boston. H. Ingram & S. Rathgeb Smith, eds, Brookings Insti-
[28] Lovie, A.D. (1979). The analysis of variance in exper- tution, Washington, pp. 4568.
imental psychology: 19341945, British Journal of
Mathematical and Statistical Psychology 32, 151178. TRUDY DEHUE
History of Correlational Measurement
MICHAEL COWLES
Volume 2, pp. 836840
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
is a correlation coefficient that takes into account rab = rag .rbg , and, when a is set equal to b, the
the correlation, the overlap between the indepen- variance in a would be accounted for by g and
dent variables. this leads us to what are called the communalities
This procedure is just one of the techniques used in a correlation matrix. This approach leads us to
in univariate statistical analysis. Spearmans tetrad differences and the beginnings of
It must be appreciated that the model is appli- what he thought was a mathematical approach to his
cable to cases where the fundamental question is, two-factor theory of intelligence and the development
what goes with what? the correlational study, and of the methods of factor analysis.
How is the dependent variable changed by the inde- The general aim of using correlation to identify
pendent variables that have been chosen or manipu- specific and general intelligences began to occupy a
lated? the formal experiment. Some workers have number of researchers, notably Thurstone and Kelley
been known to reject the correlational study largely in the United States, and Thomson and Burt in
because the differences in the dependent variable are, Britain. The ongoing problem for many researchers
in general, individual differences or errors. The true was the subjective element in the methods. The fact
experiment attempts to reduce error so that the effect that they did not produce determinate results reflected
of the independent variables is brought out. More- an argument that has not yet totally expired, what
over, the independent variables in the correlational goes into the analysis say the critics, reflects what
study are most often, but not always, continuous vari- comes out. But increasing attention was given to
ables, whereas these variables in, for example, the developing methods that avoid subjective decisions.
analysis of variance are more likely to be categori- Apart from its beginnings in the study of intelligence
cal. The unnecessary disputes arise from the historical and ability, factor analysis is used by a number
investigation of variate and categorical data and do of workers in the field of personality research in
not reflect the mathematical bases of the applications. attempts to produce nonsubjective assessments of the
Among the earliest of studies that made use of existence of personality traits.
the idea of correlational measurement in the fields The growing use of factor analytic techniques pro-
of the biological and social sciences was the one duced a burgeoning of interest in the assessment of
carried out in 1877 by an American researcher, Henry the reliability of tests. They were becoming increas-
Bowditch, who drew up correlation charts based on ingly sophisticated as researchers worried not only
data from a large sample of Massachusetts school about their validity and validity had been largely
children. Although he did not have a method to a matter of subjective face validity but also of the
compute measures of correlation, there is no doubt respectability of their reliability. A leading scholar in
that he thought that one was necessary, as was this field was Cronbach who listed those aspects of
an assessment of partial correlation. It was at this test reliability that are of concern. They are test-retest
time that Sir Francis Galton, a founding father of reliability is a test consistent over repeated admin-
statistical techniques, was in correspondence with istrations?; internal consistency do the test items
Bowditch and who was himself working on the relate to the whole set of items?; alternate or paral-
measurement of what he termed correlation and on lel forms reliability do equivalent forms of the test
the beginnings of regression analysis. show high correlations? Cronbach himself offered a
The partial correlation is the correlation of two useful test, Cronbachs . A popular early test of
variables when a third is held constant. For example, reliability, the KuderRichardson estimate of relia-
there is a correlation between height and weight in bility was developed to offset the difficulties of split-
children, but the relationship is affected by the fact half methods, and the SpearmanBrown formula that
that age will influence the variables. compares the reliabilities of tests with their length.
rnn = nr11 /1 + (n 1)r11 , where n is test length and
r12 r13 r23
r12.3 = (2) r11 is the reliability of the test of unit length.
(1 r13
2
)(1 r23
2
) Galtons view that ability, talent, and intellectual
power are characteristics that are primarily innately
and, for four variables a, b, c, and d, we find that determined the nature side of the nature-nurture
rac
rad
= rrbdbc and therefore rac .rbd rad .rbc = 0. If there issue sparked a series of investigations that exam-
are two variables a and b and g is a constant, then ined the weights and the sizes of sweet pea seeds
History of Correlational Measurement 3
over two generations. Later, he looked at human a set of independent variables. When we have just
characteristics in a similar context, these latter data one dependent and one independent variable, then
being collected by offering prizes for the submis- the slope of the regression line is equivalent to r. For
sion of family records of physical endowments and the values of b, we have constants that represent the
from visitors to an anthropometric laboratory at the weights given to the independent variables, and these
International Health Exhibition, held in 1884. He are calculated on the basis of the partial regression
pondered on the data and noted (the occasion was coefficients.
when he was waiting for a train, which shows that The first use of the word correlation in a statistical
his work was never far from his thoughts) that the context is by Galton in his 1888 paper, Correla-
frequency of adult childrens measurements of height tions and their measurement, chiefly from anthropo-
charted against those of the parents (he had devised a metric data. Pearson maintains that Galton had first
measure that incorporated the heights of both parents) approached the idea of correlation via the use of
produced a set of ellipses centred on the mean of all ranked data before he turned to the measurement of
the measurements. This discovery provided Galton variates (see Spearmans Rho). The use of ranks in
with a method of describing the relationship between these kinds of data is usually attributed to Charles
parents and offspring using the regression slope. Spearman, who became the first Professor of Psy-
An event of greatest importance led to the investi- chology at University College, London. He was, then,
gations that were to lie at the heart of the new science for a time a colleague of Pearsons in the same
of biometrics. In his memoirs, [2] Galton noted that, institution, but the two men disliked each other and
were critical of each others work so that a col-
As these lines are being written, the circumstances laboration, that may have been valuable, was never
under which I first clearly grasped the important entertained. A primary reason was that Pearson did
generalisation that the laws of Heredity were solely
not relish his approach to correlation that was central
concerned with deviations expressed in statistical
units, are vividly recalled in my memory. It was in to his espousal of eugenics being sullied by methods
the grounds of Naworth Castle, where an invitation that did not openly acknowledge the use of vari-
had been given to ramble freely. A temporary shower ates, essential to the law of ancestral heredity. It can,
drove me to seek refuge in a reddish recess in the in fact, be rather easily shown that the modern for-
rock by the side of the pathway. There the idea mula for correlation using ranked data may be derived
flashed across me, and I forgot everything for a directly from the product-moment formula.
moment in my great delight. (p. 300).
Spearman first offered the formula for the rank
An insight of the utmost utility shows us that if the differences. R = 1 3Sd/n2 1. Here, he uses S
characteristics of interest are measured on a scale for the sum, rather than the modern version and d
that is based on its variability, then the regression is the difference
in ranks. Later, the formula becomes
coefficient could be applied to these data. rs = 1 6 d 2 /n(n2 1). An alternative measure
The formula is, of course, the mean of the products of correlation using ranks was suggested by Kendall.
This is his tau statistic (see Kendalls Tau ).
of what
we now call z scores the standard scores If two people are asked to rank the quality
r = zx zy /n.
of service in four restaurants, the data may be
It may be shown that the best estimate of the slope
presented thus:
of the regression line is b = rXY sY /sX , where s is
the sample standard deviation of Y and X, the two
variables of interest. Restaurant a b c d
The multiple linear regression model is given by Judge 1 3 4 2 1
Y = bY X.Z X + bY Z.X Z + a, where Y is termed the Judge 2 3 1 4 2
dependent or criterion variable and X and Z are the
independent or predictor variables and a is a constant. Reordered
The bs are the constants that represent the weight
given to the independent (predictor) variables in the Restaurant d c a b
estimation (prediction) of Y , the dependent variable.
In other words, the regression model may be used Judge 1 1 2 3 4
to predict the value of a dependent variable from Judge 2 2 4 3 1
4 History of Correlational Measurement
What is the degree of correspondence between the about an underlying continuity in the data and is
judges? We examine the data from Judge 2. Consid- most suitable for nominal variables. The technique
ering the rank of 2 and comparing it with the other is usually associated with Yule, and this, together
ranks, 2 precedes 4, 2 precedes 3, but 2 does not pre- with Pearsons insistence that the variables should be
cede 1. These outcomes produce the scores +1, +1, continuous and normally distributed, almost certainly
and 1. When they are summed, we obtain +1. We contributed toward the PearsonYule disputes.
proceed to examine each of the possible pairs of ranks The correlation technique of Spearman [4] is well
and their totals. The maximum possible total obtained known, but his legacy must be his early work on what
if there was perfect agreement between Judge 1 and is now called factor analysis. Factor analyses applied
Judge 2 would be 6. = (actual total)/(maximum to matrices of intercorrelations among observed score
possible total) = 2/6 = 0.33. This is a measure variables are techniques that psychology can call its
of agreement. This index is not identical with that own for they were developed in that discipline, par-
of Spearman, but they both reflect association in ticularly in the context of the measurement of ability.
the population. All the developments discussed here have led us
George Udny Yule initially trained as an engineer, to modern approaches of increasing sophistication.
but turned to statistics when Pearson offered him a But these approaches have not supplanted the early
post at University College, London. Although, at first, methods, and correlational techniques produced in
the two maintained a friendly relationship, this soured the nineteenth century and the later approaches to
when Yules own work did not meet with Pearsons regression analysis will be popular in the behavioral
favor. In particular, Yules development of a coeffi- sciences for a good while yet.
cient of association in 2 2 contingency tables cre-
ated disagreement and bittercontroversy between the References
two men. In general, X2 = (f o f e)2 /f e, where
fo is the observed, and fe the expected, frequency of
[1] Bravais, A. (1846). Sur les probabilites des erreurs de
the observations. In a 2 2 table, this becomes situation dun point [on the probability of errors in the
1 position of a point], Memoirs de lAcademie Royale des
X2 = (fo fe )2 . (3) Sciences de lInstitut de France 9, 255332.
fe [2] Galton, F. (1908). Memories of My Life, Methuen, Lon-
don.
When two variables, X and Y , have been reduced [3] Pearson K. (1896). Mathematical contributions to the
to two categories, it is possible to compute the theory of evolution. III. regression, heredity and panmixia
tetrachoric correlation coefficient. This measure in 1896, Philosophical Transactions of the Royal Society,
demands normality of distribution of the continuous Series A 187, 253318.
variables and a linear relationship. The basic calcula- [4] Spearman, C. (1927). The Abilities of Man, their Nature
tion is difficult and approximations to the formula are and Measurement, Macmillan Publishers, New York.
[5] Walker, H.M. (1975). Studies in the History of Statistical
available. The procedure was just one of a number of
Method, Arno Press, New York, (facsimile edition from
methods provided by Pearson, but it lacks reliability the edition of 1929).
and is rarely used nowadays.
The contingency coefficient is also an association MICHAEL COWLES
method for two sets of attributes (see Measures
of Association). However, it makes no assumptions
History of Discrimination and Clustering
SCOTT L. HERSHBERGER
Volume 2, pp. 840842
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
factor analysis according to Tryon) as an alternative agglomerative clustering, but it is also referred to
to using factor analysis for classifying people into pseudonymously as nearest neighbor method, the
types. Most of the methods develop by Tryon were minimum method, the space contracting method,
in fact variants of multiple factor analysis [13]. Cat- hierarchical analysis, elementary linkage analysis,
tell [1], who also emphasized the use of cluster anal- and the connectedness method.
ysis for classifying types of persons, discussed four
clustering methods: (a) ramifying linkage, which References
was a variation on what is now termed single linkage,
(b) a matrix diagonal method which was a graphi- [1] Cattell, R.B. (1944). A note on correlation clusters and
cal procedure, (c) Tryons method which is related cluster search methods, Psychometrika 9, 169184.
to what currently would be described as average [2] Cattell, R.B., Coulter, M.A. & Tsuijoka, B. (1966).
linkage (see Hierarchical Clustering), and (d) the The taxonomic recognition of types and functional
approximate delimitation method which was Cat- emergents, in Handbook of Multivariate Experimental
Psychology, R.B. Cattell, ed., Rand-McNally, Chicago,
tells extension of the ramifying linkage method. pp. 288329.
Cattell et al. [2] presented an iterative extension of [3] Das Gupta, S. (1974). Theories and methods of discrim-
the ramifying linkage method in order to identify two inant analysis: a review, in Discriminant Analysis and
general classes of types: homostats and segregates. Applications, T. Cacoullos, ed., Academic Press, New
A homostat is a group in which every member has a York, pp. 77138.
high degree of resemblance with every other member [4] Fisher, R.A. (1936). The use of multiple measurements
in taxonomic problems, Annals of Eugenics 7, 179188.
in the group. On the other hand, a segregate is a group
[5] Huberty, C.J. (1994). Applied Discriminant Analysis,
in which each member resembles more members of John Wiley & Sons, New York.
that group than other groups. [6] Mahalanobis, P.C. (1930). On tests and measurements
Since the 1960s, interest in cluster analysis has of group divergence, The Journal and Proceedings of
increased considerably, and a large number of dif- the Asiatic Society of Bengal 26, 541588.
ferent methods for clustering have been proposed. [7] Pearson, K. (1926). On the coefficient of racial likeness,
The new interest in cluster analysis was primarily Biometrika 13, 247251.
[8] Rao, C.R. (1947). A statistical criterion to determine
due to two sources: (a) the availability of high-speed the group to which an individual belongs, Nature 160,
computers, and (b) the advocacy of cluster analy- 835836.
sis as a method of numerical taxonomy [10]. The [9] Raudys, S. & Young, D.M. (2004). Results in statistical
introduction of high-speed computers permitted the discriminant analysis: a review of the former Soviet
development of sophisticated cluster analysis meth- Union literature, Journal of Multivariate Analysis 89,
ods, methods nearly impossible to carry out by hand. 135.
[10] Sokal, R.R. & Sneath, P. (1963). Principles of Numerical
Most of the methods available at the time when high-
Taxonomy, Freeman, San Francisco.
speed computers first became available required the [11] Stephenson, W. (1936). Introduction of inverted factor
computation and analysis of an N N similarity analysis with some applications to studies in orexia,
matrix, where N refers to the number of observations Journal of Educational Psychology 5, 553567.
to be clustered. For example, if a sample consisted [12] Tryon, R. (1939). Cluster Analysis, McGraw-Hill, New
of 100 observations, this would require the analysis York.
of a 100 100 matrix, which would contain 4950 [13] Tryon, R. & Bailey, D.E. (1970). Cluster Analysis,
McGraw-Hill, New York.
unique values, hardly an analysis to be untaken with-
[14] von Mises, R. (1944). On the classification of observa-
out mechanical assistance. tion data into distinct groups, Annals of Mathematical
Cluster analysis appears now to be in a stage of Statistics 16, 6873.
consolidation, in which synthesizing and populariz- [15] Wald, A. (1944). On a statistical problem arising in the
ing currently available methods, rather than intro- classification of an individual into one of two groups,
ducing new ones, are emphasized. Consolidation is Annals of Mathematical Statistics 15, 145162.
[16] Welch, B.L. (1939). Note on discriminant functions,
important, if for no other reason than to remove
Biometrika 31, 218220.
existing discrepancies and ambiguities. For example, [17] Zubin, J.A. (1938). A technique for measuring likemind-
the same methods of cluster analysis are often con- edness, Journal of Abnormal Psychology 33, 508516.
fusingly called by different names. Single linkage
is the standard name for a method of hierarchical SCOTT L. HERSHBERGER
History of Factor Analysis: A Psychological Perspective
ROBERT M. THORNDIKE
Volume 2, pp. 842851
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
factors of intellect (his preferred term; see [29] for a Vernon [39] and Humphreys [11] into general theo-
description of the debate). In the face of this criticism, ries about the organization of human abilities.
Spearman was forced to develop an analytic method Enter L. L. Thurstone, the most important single
to support his claim that a single factor was sufficient contributor to the development of factor analysis after
to account for the correlations among a set of tests. Spearman himself. In 1931, Thurstone [30] published
He was able to show [8] that a sufficient condition for an important insight. He recognized that satisfying
the existence of a single factor was that an equation the tetrad criterion for any set of four variables
of the form was equivalent to saying that the rank of the 4 4
rab rcd rac rbd = 0 (5) correlation matrix was 1. (We can roughly define
the rank of a matrix as the number of independent
dimensions it represents. More formal definitions
be satisfied for all possible sets of four tests. This
require a knowledge of matrix algebra.) In this
criterion, known as the tetrad difference equation,
important paper, Thurstone argued that the rank of
would not be exactly satisfied for all possible sets of
a matrix is the equivalent of the number of factors
four tests with real data, but it might be approximated.
required to account for the correlations. Unless the
Debate over the nature of intelligence continued
rank of a matrix was 1, it would require more
as one side produced a set of data satisfying the
than one factor to reproduce the correlations (see
tetrad criterion and the other side countered with one
that did not. Then, in 1917, Cyril Burt [2] offered a below). He also showed how the centroid method
method for extracting a factor from a matrix of cor- could be used to extract successive factors much
relations that approximated Pearsons principal com- more simply and satisfactorily than Kelleys partial
ponent, but at great computational savings. Because correlation procedure.
his method placed the first factor through the average Through the remainder of the 1930s, Thurstone
or geometric center of the set of variables, it became continued to expand his conception of common factor
known as the centroid method for extracting factors analysis. He undertook a massive study of men-
(determining the initial location of a factor is called tal abilities, known as the Primary Mental Abilities
factor extraction). The centroid method was compu- study, in which 240 college-student volunteers took
tationally straightforward and yielded useful factors. a 15-hour battery of 56 tests [3134]. From analysis
In the hands of L. L. Thurstone, it would become the of this battery, he identified as many as 12 factors,
standard method of factor extraction until computers seven of which were sufficiently well defined to be
became widely available in the late 1950s. named as scientific constructs of ability. In addition,
Although Spearman continued to offer his tetrad he developed the geometric interpretation of factors
criterion as providing evidence of a single general as the axes of a multidimensional space defined by the
factor of intelligence [25], the two-factor theory was variables. This insight allowed him to recognize that
dealt a serious blow in 1928 by Truman Kelley [17]. the location of any factor is arbitrary. Once the mul-
Using the method of partial correlation to remove g tidimensional space (whose dimensionality is defined
from the matrix of correlations among a set of ability by the rank of the correlation matrix) is defined by the
variables, Kelley showed that additional meaningful variables, centroid factors or principal components
factors could be found in the matrix of residual cor- are used to define the nonzero axes of the space by
relations. He argued that the distribution of residual satisfying certain conditions (see below), but these
correlations after extracting g could be used to test initial factors seldom seemed meaningful. Thurstone
(and reject) the hypothesis of a single general factor argued that one could (and should) move the axes
and that an important goal for psychological mea- to new positions that had the greatest psychological
surement should be to construct tests that were pure meaning. This process was called factor rotation (see
measures of the multiple factors that he had found. Factor Analysis: Exploratory).
Somewhat earlier, Thompson [26] had proposed a In his original work, Thurstone rotated the factors
sampling approach to the conceptualization of fac- rigidly, maintaining their orthogonal or uncorrelated
tors that resulted logically in a hierarchy of factors character. By 1938, he was advocating allowing the
depending on the breadth of the sampling. The con- factors to become correlated or oblique. Geomet-
cept of a hierarchy was later explicitly developed by rically, this means allowing the factors to assume
History of Factor Analysis: A Psychological Perspective 3
positions at other than 90 degrees to each other. Oth- attention has been directed to finding a rotation that
ers, such as Vernon [39] and Humphreys [11] would produces a small number of nonzero loadings for
later apply factor analysis to the matrices of corre- any variable.
lations among the first-order factors to obtain their There are two primary arguments in favor of factor
hierarchical models. patterns that satisfy simple structure. First, they are
Thurstones insights created three significant prob- likely to be the most interpretable and meaningful. A
lems. First, the actual rank of any proper correlation strong argument can be made that meaningfulness is
matrix would always be equal to the number of really the most important property for the results of
variables because the diagonal entries in the matrix a factor analysis to have. Second, Thurstone argued
included not only common variance (the g-related that a real simple structure would be robust across
variance of Spearmans two-factor model) but also samples and with respect to the exact selection of
the specific variance. Thurstone suggested that the variables. He argued convincingly that one could
correlation matrix to be explained by the factors hardly claim to have discovered a useful scientific
should not be the original matrix but one in which an construct unless it would reliably appear in data sets
estimate of the common variance of each variable had designed to reveal it.
been placed in the appropriate location in the diago- Thurstone always did his rotations graphically by
nal. This left investigators with the problem of how inspection of a plot of the variables and a pair of
to estimate the common variance (or communality, factors. However, this approach was criticized as
as it came to be known). lacking objectivity. With the advent of computers in
The problem of estimating the communality was the 1950s, several researchers offered objective rota-
intimately related to the problem of how many factors tion programs that optimized a numerical function of
were needed to account for the correlations. More
the factor loadings [e.g., 3, 19]. The most success-
factors would always result in higher communalities.
ful of these in terms of widespread usage has been
In an era of hand computation, one did not want to
the varimax criterion for rotation to an orthogonal
extract factors more than once, so good communality
simple structure proposed by Kaiser [14], although
estimates and a correct decision on the number of
the direct oblimin procedure of Jennrich and Samp-
factors were crucial. Thurstone himself tended to
son [12] is also very popular as a way to obtain an
favor using the largest correlation that a variable
oblique simple structure.
had with any other variable in the matrix as an
estimate of the communality. Roff [22] argued that In addition to making analytic rotation possi-
the squared multiple correlation of each variable with ble, the rise of computers also sounded the death
the other variables in the matrix provided the best knell for centroid extraction. By the late 1960s the
estimate of the communality, and this is a starting PearsonHotelling method of principal axis factor
point commonly used today. Others suggested an extraction had replaced all others. Several alterna-
estimate of the reliability of each variable provided tives had been offered for how to estimate the
the best communality estimate. communalities, including maximum likelihood [13,
The third problem resulted from the practice of 18], alpha [16] and minimum residuals [7], but all
rotation. The criteria for factor extraction provided employed the same basic extraction strategy that is
a defined solution for the factors, but once rotation described below.
was introduced, there were an infinite number of There was also progress on the number-of-factors
equally acceptable answers. Thurstone attempted to question that can be traced to the availability of com-
solve this problem with the introduction of the puters. Although Hotelling [10] and Bartlett [1] had
concept of simple structure. In its most rudimentary provided tests of the statistical significance of prin-
form, the principle of simple structure says that cipal components (Bartletts sphericity test is still an
each observed variable should be composed of the option in SPSS), neither was used until computers
smallest possible number of factors, ideally one. In were available because they did not apply to cen-
his most comprehensive statement on factor analysis, troid factors. Rippe [21] offered a general test for
Thurstone [35, p. 335] offered five criteria that a the number of factors in large samples, and Law-
pattern of factor loadings should meet to qualify as ley [18] had provided the foundation of a significance
satisfying the simple structure principle, but most test for use with maximum likelihood factors. Others,
4 History of Factor Analysis: A Psychological Perspective
notably Kaiser [15] and Cattell [4] offered nonsta- each of these briefly. For a further description, see
tistical rules of thumb for the number of principal the entry on common factor analysis. A thorough
components to retain for rotation. Kaisers criterion description of both approaches can also be found in
held that only factors that have eigenvalues (see Harman [6] or Gorsuch [5].
below) greater than 1.0 should be considered, and
Cattell suggested that investigators examine the plot Algebraic Approach
of the eigenvalues to determine where a scree (ran-
dom noise factors) began. Kaisers criterion became Spearman [24] and Thurstone [35] both considered
so popular that it is the default in SPSS and some factors to represent real latent causal variables that
other computer programs, and many programs will were responsible for individual differences in test
output the plot of the eigenvalues as an option. scores. Individuals are viewed as having levels of
Statistical criteria for the number of factors were ability or personality on whatever traits the factors
criticized as being highly sensitive to sample size. represent. The task for factor analysis is to determine
On the other hand, one persons scree is another from the correlations among the variables how much
persons substantive factor, and Kaisers criterion, each factor contributes to scores on each variable. We
although objective, could result in keeping a factor can therefore think of a series of regression equations
with an eigenvalue of 1.0001 and dropping one at with the factors as predictors and the observed
0.999. To solve these problems, Horn [9] proposed variables as the criteria. If there are K factors and
that in a study with m variables, m m matrices p observed variables, we will have p regression
of correlations from random data be analyzed and equations, each with the same K predictors, but
only factors from the real data with eigenvalues larger the predictors will have different weights in each
than the paired eigenvalue from random data be kept. equation reflecting their individual contributions to
This approach has worked well in simulation studies, that variable.
but has not seen widespread application. A method Suppose we have a set of six variables, three
with similar logic by Velicer [36] based on average measures of verbal ability and three measures of
squared partial correlations has also shown promise quantitative ability. We might expect there to be two
but seen little application. factors in such a set of data. Using a generalization of
By the early 1970s, the development of common Spearmans two-factor equation, we could then think
factor analysis was all but complete. That this is so of the score of a person (call him i for Ishmael) on
can be inferred from the fact that there has not been the first test (X1i ) as being composed of some part
a major book devoted to the subject since 1983 [5], of Ishmaels score on factor 1 plus some part of his
while before that date several important treatments score on factor 2, plus a portion specific to this test.
appeared every decade. This does not mean that the For convenience, we will put everything in standard
method has been abandoned. Far from it; unrestricted score form.
(exploratory) factor analysis remains one of the most
popular data analytic methods. Rather, work has ZX1 i = X1 F1 ZF1 i + X1 F2 ZF2 i + UX1 i (6)
focused on technical issues such as rules for the Ishmaels score on variable X1 (ZXi i ) is composed of
number of factors to extract, how large samples need the contribution factor 1 makes to variable X1 (X1 F1 )
to be, and how many variables need to be included multiplied by Ishmaels score on factor 1 (ZF1 i )
to represent each factor. Although many investigators plus the contribution factor 2 makes to X1 (X1 F2 )
have contributed to developments on these topics, times Ishmaels score on factor 2 (ZF2 i ) plus the
Wayne Velicer and his associates have been among residual or unique part of the score, UX1 i . U is
the most frequent and influential contributors [37]. whatever is not contributed by the common factors
and is called uniqueness. We shall see shortly that
uniqueness is composed of two additional parts.
Overview of Factor Analysis Likewise, Ishmaels scores on each of the other
variables are composed of a contribution from each
There are two basic ways to conceptualize factor of the factors plus a unique part. For example,
analysis, an algebraic approach and a graphic or
geometric approach. In this section, we will review ZX2 i = F1 X2 ZF1i + F2 X2 ZF2i + UX2i (7)
History of Factor Analysis: A Psychological Perspective 5
If we have scores on variable 1 for a set of people, Table 2 Initial factor matrix for six StanfordBinet Tests
we can use (6) to see that factor analysis decomposes Factor
the variance of these scores into contributions by each
of the factors. That is, for each variable Xj we can 1 2 h2
develop an expression of the following form
Vocabulary 0.80 0.20 0.68
Comprehension 0.79 0.26 0.69
X2 j = Xj F1 F21 + Xj F2 F22 + U2j (8)
Absurdities 0.67 0.26 0.52
Equation building 0.71 0.43 0.69
The variance of each observed variable is a weighted
Number series 0.78 0.18 0.64
combination of the factor variances plus a unique Quantitative 0.76 0.14 0.60
contribution due to that variable.
Factor variances 3.40 0.41 3.81
The betas are known as factor pattern coeffi-
cients or factor loadings. As is the case in multiple
correlations generally, if the predictors (factor) are factor). The large first factor has often been equated
uncorrelated, the regression weights are equal to the with Spearmans g factor.
predictor-criterion correlations. That is, for orthog- One interpretation of a correlation is that its square
onal factors, the pattern coefficients are also the corresponds to the proportion of variance in one
correlations between the factors and the observed variable that is accounted for by the other variable.
variables. In factor analysis, the correlations between For orthogonal factors, this means that a squared
the variables and the factors are called factor struc- factor loading is the proportion of the variables
ture coefficients. One of the major arguments that has variance contributed by that factor. Summed across
been made in favor of orthogonal rotations of the fac- all common factors, the result is the proportion
tors is that as long as the factors are orthogonal the of the variables variance that is accounted for by
equivalence between the pattern and structure coef- the set of common factors (note that uniqueness is
ficients is maintained, so interpretation of the results not included). This quantity is called the common
is simplified. variance of the variable or its communality. For the
Let us consider an example like the one above. case of K factors,
Table 1 contains hypothetical correlations among
three verbal and three quantitative tests from the
K
Communality of X1 = X2 1 Fj = h2X1 (9)
StanfordBinet Fourth Edition [28]. The matrix was
j =1
constructed to be similar to the results obtained with
the actual instrument. The communality of each variable is given in the last
Applying principal axis factor analysis (see the column of Table 1. The symbol h2 is often used for
entry on common factor analysis) to this matrix communality and represents a variance term.
yields the factor matrix in Table 2. This matrix is The remainder of each variables variance, (1
fairly typical of results from factoring sets of ability h2 ), is the variance unique to that variable, its
variables. There is a large first factor with all positive uniqueness (symbolized u2 , also a variance term).
loadings and a smaller second factor with about half The unique variance is composed of two parts,
positive and half negative loadings (called a bipolar variance that is due to reliable individual differences
that are not accounted for by the common factors
Table 1 Hypothetical correlations between six subtests of and random errors of measurement. The first is called
the StanfordBinet, Fourth Edition specificity (symbolized s 2 ) and the second is simply
Variable 1 2 3 4 5 6 error (e2 ). Thus, Spearmans two 1904 papers lead
to the following way to view a persons score on a
Vocabulary 1.000 variable
Comprehension 0.710 1.000
Absurdities 0.586 0.586 1.000 ZXj i = F1 Xj ZF1i + F2 Xj ZF2i + sXj i + eXj i (10)
Equation 0.504 0.460 0.330 1.000
building and link common factor theory with measurement
Number series 0.562 0.563 0.522 0.630 1.000
Quantitative 0.570 0.567 0.491 0.594 0.634 1.000
theory. If we once again think of the scores for N
people on the set of variables, the variance of each
6 History of Factor Analysis: A Psychological Perspective
variable (we are still considering standard scores, so that the factors still account for the same amount
each variables total variance is 1.0) can be viewed of each variables variance, but that variance has
in three interlocking ways (each letter corresponds to been redistributed between the factors. That is, the
a kind of variance derived from (10)). communalities are unchanged by rotation, but the
factor variances are now more nearly equal.
Factor theory 1.0 = h2 + u2 u2 = s 2 + e 2
There are two things about the varimax factor
1.0 = h2 + s 2 + e2 r 2 = h2 + s 2
matrix that might cause us concern. First, the small
Measurement 1.0 = r 2 + e2
loadings are not that small. The structure is not
theory
that simple. Second, there is no particular reason
The symbol r 2 is used to indicate the reliable why we would or should expect the factors to be
variance in test scores. orthogonal in nature. We will allow the data to
We can also consider the factor loading as reveal- speak to us more clearly if we permit the factors to
ing the amount of a factors variance that is con- become correlated. If they remain orthogonal with the
tributed by each variable. Again taking the squared restriction of orthogonality relaxed, so be it, but we
factor loadings, but this time summing down the col- might not want to force this property on them. Table 4
umn of each factor, we get the values at the bottom contains the factor pattern matrix after rotation by
of Table 2. These values are often referred to, some- direct oblimin.
what inappropriately, as eigenvalues. This term is There are two things to notice about these pattern
really only appropriate in the case of principal com- coefficients. First, the large or primary coefficients
ponents. In this example, we used squared multiple display the same basic pattern and size as the coef-
correlations as initial communality estimates, so fac- ficients in Table 3. Second, the secondary loadings
tor variance is the correct term to use. Note that the are quite a lot smaller. This is the usual result of an
first factor accounts for over half of the variance of oblique rotation. The other important statistic to note
the set of six variables and the two factors combined is that the factors in this solution are correlated 0.70.
account for about 2/3 of the variance. That is, according to these data, verbal ability and
Now, let us see what happens if we apply varimax quantitative ability are quite highly correlated. This
rotation to these factors. What we would expect for makes sense when we observe that the smallest cor-
a simple structure is for some of the loadings on relation between a verbal test and a quantitative test
the first factor to become small, while some of the in Table 1 is 0.33 and most are above 0.50. It is also
loadings on the second factor become larger. The what Spearmans theory would have predicted.
results are shown in Table 3. The first factor now We can note one final feature of this analysis,
has large loadings for the three verbal tests and which addresses the question of whether there is
modest loadings for the three quantitative tests and a difference between principal components analysis
the reverse pattern is shown on the second factor. and common factor analysis. The factors provide a
We would be inclined to call factor 1 a verbal ability model for the original data and we can ask how
factor and factor 2 a quantitative ability factor. Note well the model fits the data. We can reproduce the
Table 3 Varimax-rotated factor matrix for six Stan- Table 4 Pattern matrix from a direct oblimin rotation for
fordBinet Tests six StanfordBinet Tests
Factor Factora
1 2 h2 1 2
Vocabulary 0.73 0.38 0.68 Vocabulary 0.76 0.10
Comprehension 0.76 0.34 0.69 Comprehension 0.82 0.02
Absurdities 0.68 0.26 0.52 Absurdities 0.75 0.04
Equation building 0.24 0.79 0.69 Equation building 0.09 0.89
Number series 0.46 0.66 0.64 Number series 0.27 0.59
Quantitative 0.48 0.61 0.60 Quantitative 0.30 0.53
Factor variances 2.07 1.76 3.81 a
Factors correlate +0.70.
History of Factor Analysis: A Psychological Perspective 7
2 contrasts verbal and quantitative tests. If we rotate in part of word problems that might involve a verbal
the factors clockwise, factor 1 will come to represent component. The reason for the nonzero coefficient for
the verbal tests more clearly and factor 2 will align Number Series is not clear from the test content.
with the quantitative ones. It looks like a rotation of The algebraic and graphical representations of the
about 45 degrees will do the trick. factors complement each other for factor interpreta-
When we apply the varimax rotation criterion, a tion because they provide two different ways to view
rotation of 42 degrees produces an optimum solution. exactly the same outcome. Either one allows us to
The plot of the rotated solution is shown in Figure 2. formulate hypotheses about causal constructs that
Notice that the variables stay in the same place and underlie and explain a set of observed variables. As
the factors rotate to new locations. Now, all of the Thurstone [35] pointed out many years ago, however,
variables project toward the positive ends of both this is only a starting point. The scientific value of
factors, and this fact in reflected by the uniformly the constructs so discovered must be tested in addi-
positive loadings in Table 3. tional studies to demonstrate both their stability with
Figure 3 is a plot of the direct oblimin rotation respect to the specific selection of variables and their
from Table 4. Here we can see that the two factors generality across subject populations. Often they may
have been placed near the centers of the two clusters be included in studies involving experimental manip-
of variables. The verbal cluster is a relatively pure ulations to test whether they behave as predicted
representation of the verbal factor (1). None of the by theory.
variables are far from the factor and all of their
pattern coefficients on factor 2 are essentially zero.
Equation building is a relatively pure measure of References
the quantitative factor, but two of the quantitative
variables seem to also involve some elements of [1] Bartlett, M.S. (1950). Tests of significance in factor anal-
verbal behavior. We can account for this fact in the ysis, British Journal of Psychology, Statistical Section 3,
case of the Quantitative test because it is composed 7785.
[2] Burt, C. (1917). The Distributions and Relations of
Educational Abilities, London County Council, London.
[3] Carroll, J.B. (1953). Approximating simple structure in
factor analysis, Psychometrika 18, 2338.
e
[4] Cattell, R.B. (1966). The scree test for the number of
tat
Ro
[15] Kaiser, H.F. (1960). The application of electronic com- [29] Thorndike, R.M. (1990). A Century of Ability Testing,
puters to factor analysis, Educational and Psychological Riverside, Itasca.
Measurement 20, 141151. [30] Thurstone, L.L. (1931). Multiple factor analysis, Psy-
[16] Kaiser, H.F. & Caffrey, J. (1965). Alpha factor analysis, chological Review 38, 406427.
Psychometrika 30, 114. [31] Thurstone, L.L. (1935). The Vectors of Mind, University
[17] Kelley, T.L. (1928). Crossroads in the Mind of Man, of Chicago Press, Chicago.
Stanford University Press, Stanford. [32] Thurstone, L.L. (1936a). A new conception of intelli-
[18] Lawley, D.N. (1940). The estimation of factor loadings gence and a new method of measuring primary abilities,
by the method of maximum likelihood, Proceedings of Educational Record 17(Suppl. 10), 124138.
the Royal Society of Edinburgh 60, 6482. [33] Thurstone, L.L. (1936b). A new conception of intelli-
[19] Neuhaus, J.O. & Wrigley, C. (1954). The quartimax gence, Educational Record 17, 441450.
method: an analytic approach to orthogonal simple [34] Thurstone, L.L. (1938). Primary Mental Abilities, Psy-
structure, British Journal of Statistical Psychology 7, chometric Monographs No. 1.
8191. [35] Thurstone, L.L. (1947). Multiple Factor Analysis, Uni-
[20] Pearson, K. (1901). On lines and planes of closest fit versity of Chicago Press, Chicago.
to systems of points in space, Philosophical Magazine, [36] Velicer, W.F. (1976). Determining the number of compo-
Series B 2, 559572. nents from the matrix of partial correlations, Psychome-
[21] Rippe, D.D. (1953). Application of a large sampling trika 41, 321327.
criterion to some sampling problems in factor analysis, [37] Velicer, W.F. & Fava, J.L. (1998). The effects of
Psychometrika 18, 191205. variable and subject sampling on factor pattern recovery,
[22] Roff, M. (1936). Some properties of the communality in Psychological Methods 3, 231251.
multiple factor theory, Psychometrika 1, 16. [38] Velicer, W.F. & Jackson, D.N. (1990). Component anal-
[23] Spearman, C. (1904a). The proof and measurement of ysis versus common factor analysis: some issues in
the association between two things, American Journal selecting an appropriate procedure, Multivariate Behav-
of Psychology 15, 72101. ioral Research 25, 128.
[24] Spearman, C. (1904b). General intelligence, objec- [39] Vernon, P.E. (1950). The Structure of Human Abilities,
tively determined and measured, American Journal of Wiley, New York.
Psychology 15, 201293.
[25] Spearman, C. (1927). The Abilities of Man, Macmillan
Publishing, New York. (See also Factor Analysis: Confirmatory; Factor
[26] Thomson, G.H. (1920). General versus group factors in Analysis: MultitraitMultimethod; Factor Analy-
mental activities, Psychological Review 27, 173190. sis of Personality Measures)
[27] Thorndike, E.L., Lay, W. & Dean, P.R. (1909). The
relation of accuracy in sensory discrimination to gen- ROBERT M. THORNDIKE
eral intelligence, American Journal of Psychology 20,
364369.
[28] Thorndike, R.L., Hagen, E.P. & Sattler, J.M. (1986). The
Stanford-Binet intelligence scale, 4th Edition, Technical
Manual, Riverside, Itasca.
History of Factor Analysis: A Statistical Perspective
DAVID J. BARTHOLOMEW
Volume 2, pp. 851858
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Analysis: A Statistical The key idea was that it might be possible to explain
the correlations in sets of observable variables by
Perspective the hypothesis that they all had some dependence
on a common factor (or, later, factors). The fact
that, in practice, the correlations were not wholly
accounted for in this way was explained by the influ-
Origins ence of other variables specific to each observable
variable. If this hypothesis were correct, then con-
Factor analysis is usually dated from Charles ditioning on the common variables (factors) should
Spearmans paper General Intelligence Objectively render the variables independent. In that sense, their
Determined and Measured published in the American correlations were explained by the common factors.
Journal of Psychology in 1904 [18]. However, like It was then but a short step to show that the vari-
most innovations, traces of the idea can be found ances of each variable could be partitioned into two
in earlier work by Karl Pearson [17] and others. parts, one arising from the common factor(s) and the
All the same, it was a remarkable idea. Spearman, other from the rest. The importance of each variable
of course, did not invent factor analysis in the full (its saturation with the common factor) could be
glory of its later development. He actually proposed measured by its correlation with that factor and
what would now be called a one-factor model though this could be estimated from the observed correla-
then it was, perversely, called a two-factor model. tions. In essence, this was achieved by Spearman
It arose in the context of the theory of correlation in 1904.
and partial correlation, which was one of the few In 1904, there was little statistical theory avail-
topics in multivariate statistics that was reasonably able to help Spearman but what there was proved
well developed at that time. Technically speaking, to be enough. Correlation had been a major field of
it was not such a great step forward but it proved study. The invention of the product-moment corre-
enough to unlock the door to a huge field of applica- lation (see Pearson Product Moment Correlation)
tions. had been followed by expressions for partial cor-
Spearman and most of his immediate followers relations. A first-order partial correlation gives the
were interested in measuring human abilities and, in correlation between a pair of variables when a third
particular, general intelligence. There was no inter- is fixed. Second-order coefficients deal with the case
est in developing the general method of multivariate when two other variables are fixed, and so on. The
analysis which factor analysis later became. Fac- expressions for the partial correlations presupposed
tor analysis is unusual among multivariate statistical that the relationships between the variables were
techniques in that it was developed almost entirely linear. That was because product-moment correla-
within the discipline of psychology. Its line of devel- tion is a measure of linear correlation. Inspection
opment was therefore subservient to the needs of of early editions of Yules Introduction to the The-
psychological measurement of abilities in particu- ory of Statistics (starting with [21]) will show how
lar. This has had advantages and disadvantages. On prominent a place partial correlation occupied in
the positive side, it has earthed or grounded the the early days. Later, the emphasis shifted to mul-
subject, ensuring that it did not wander off into tiple regression (see Multiple Linear Regression),
theoretical irrelevancies. Negatively, it had a dis- which offered an alternative way of investigating the
torting effect that emphasized some aspects and same phenomenon.
ignored others. The result of Spearmans idea is that if the
Returning to Spearman and the origins of factor correlation between two variables is due to their
analysis; the theory quickly grew. Sir Cyril Burt, common dependence on a third variable, then one
see for example [5], was one of the first on the scene can deduce that the form of the correlations has a
and, with his access to large amounts of data from particularly simple form. It is not entirely clear from
the London County Council, was able to press ahead the 1904 paper how Spearman went about this or
with practical applications. what form of the relationship among the correlations
2 History of Factor Analysis: A Statistical Perspective
he actually used, but a simple way of arriving at his and, through his influence, Whittle [20] made a brief
result is as follows. excursion into the field.
Suppose we have a set of variables correlated There the matter appears to have rested until the
among themselves. We suspect that these correlations immediate postwar period. By then, statistics, in a
are induced by their common dependence on a factor modern guise, was making great progress. M. G.
called G (Spearman used G in this context because Kendall, who was a great systematizer, turned his
he was using it to denote general intelligence). If attention to factor analysis in [9] and also included
that is the case, then conditioning on G should it in taught courses at about the time and in one
remove the correlation. Consider two variables i and of his early monographs on multivariate analysis.
j with correlation rij . If our hypothesis is correct, This period also marks D. N. Lawleys contribution
that correlation should vanish if we condition on G. concerned especially with fitting the factor model,
That is, the partial correlation rij.G should be zero see, for example, [10]. His one-time colleague, A.
(the dot is used to denote given). Now, E. Maxwell, who collaborated in the writing of
the book Factor Analysis as a Statistical Method
rij riG rj G
rij.G = , (1) [11], did practical factor analysis in connection with
1 riG
2
1 rj2G his work at the London Institute of Psychiatry. His
expository paper [14], first read at a conference of the
and so the necessary and sufficient condition for r to Royal Statistical Society in Durham and subsequently
vanish is that published in the Journal Series A, is an admirable
summary of the state of play around 1961. In
rij = riG rj G (i, j = 1, 2, . . . , p) (i = j ). (2) particular, it highlights the problems of implementing
the methods of fitting the model that had already been
If we can find values riG (i = 1, 2, . . . , p) to satisfy developed uncertain convergence being prominent
these relations (approximately), then we shall have among them.
established that the mutual correlation among the However, factor analysis did not catch on in a big
variables can, indeed, be explained by their common way within the statistical community and there were
dependence on the common factor G. This derivation a number of critical voices. These tended to focus on
shows that what came to be called factor loadings are, the alleged arbitrariness of the method that so often
in fact, correlations of the manifest variables with seemed to lead to an unduly subjective treatment.
the factor. As we shall see, this idea can easily be The range of rotations available, oblique as well as
extended to cover additional factors but that was not orthogonal, left the user with a bewildering array of
part of Spearmans original discovery. solutions one of which, surely, must show what the
analyst desired. Much of this unfriendly fire was
occasioned by the fact that, in practice, factor ana-
The Statistical Strand lysts showed little interest in sampling error. It was
easily possible to demonstrate the pitfalls by simula-
The first passing contact of statistics with the devel- tion studies on the basis of small sample sizes, where
oping factor analysis was the publication of Harold sampling error was often mistaken for arbitrariness.
Hotellings seminal paper [6] on principal com- To many statisticians, the solidity of principal com-
ponent analysis. PCA is quite distinct from factor ponents analysis provided a surer foundation even if
analysis but the distinction was, perhaps, less clear it was, basically, only a descriptive technique. How-
in the 1930s. Hotelling himself was critical of factor ever, to psychologists, meaningfulness was as least
analysis, especially because of its lack of the statisti- as important a criterion in judging solutions as sta-
cal paraphernalia of inferential statistics. tistical significance.
Hotelling was followed, quite independently it The immediate postwar period, 19501960 say,
seems, by Bartlett, [24], whose name is particu- marks an important watershed in the history of factor
larly remembered in this field for what are know as analysis, and of statistics in general. We shall come
Bartlett scores. These are factor scores and we to this shortly, but it owed its origin to two important
shall return to them below (see Factor Score Esti- happenings of this period. One was the introduction
mation). He also wrote more widely on the subject of the electronic computer, which was, ultimately,
History of Factor Analysis: A Statistical Perspective 3
to revolutionize multivariate statistical analysis. The they can be regarded when viewed geometrically.
other was the central place given to probability Once that fact was recognized, the question naturally
models in the specification and analysis of statistical arose as to whether some rotations were better or
problems. In a real sense, statistics became a branch more meaningful than others. Strong claims may
of applied probability in a way that it had not be advanced for those having what Thurstone called
been earlier. simple structure. In such a rotation, each factor
Prior to this watershed, the theory of factor depends only (or largely) on a subset of the observ-
analysis was largely about the numerical analysis able variables. Such variables are sometimes called
of correlation (and related) matrices. In a sense, group variables, for obvious reasons.
this might be called a deterministic or mathematical
theory. This became such a deeply held orthodoxy
that it still has a firm grip in some quarters. The Two Factors
so-called problem of factor scores, for example, is
sometimes still spoken of as a problem even though The question of whether the correlation matrix can
its problematic character evaporates once the problem be explained by a single underlying factor therefore
is formulated in modern terms. resolved itself into the question of whether it has the
structure (2). If one factor failed to suffice, one could
go on to ask whether two factors or more would do
the job better. The essentials can be made clear if we
Next Steps first limit ourselves to the case of two factors.
The first main extension was to introduce more than Suppose, then, we introduce two factors G1 and
one common factor. It soon became apparent in G2 . We then require rij.G1 G2 to be zero for all i = j .
applied work that the original one-factor hypothesis If G1 and G2 are uncorrelated, it turns out that
did not fit much of the data available. It was straight- rij = riG1 rj G1 + riG2 rj G2 (i = j ) (3)
forward, in principle, to extend the theory, and Burt
was among the pioneers, though it is doubtful whether = i1 j 1 + i2 j 2 , say. (4)
his claim to have invented multifactor analysis can be
substantiated (see [13]). Pursuing this line of argument to incorporate further
At about the same time, the methods were taken factors, we find, in the q-factor case, that
up across the Atlantic, most conspicuously by L. L.
q
Thurstone [19]. He, too, claimed to have invented rij = ik j k (i = j ). (5)
multifactor analysis and, for a time at least, his k=1
approach was seen as a rival to Spearmans. Spear-
In matrix notation,
mans work had led him to see a single underlying
factor (G) as being common to, and the major deter- R = + , (6)
minant of, measures of human ability. Eventually,
Spearman realized that this dominant factor could not where = {ik } and is a diagonal matrix whose
wholly explain the correlations and that other group elements are chosen to ensure that the diagonal
factors had to be admitted. Nevertheless, he contin- elements of the matrix on the right add up to 1 and so
ued to believe that the one-factor model captured the match those of R. The complements of the elements
essence of the situation. of are known as the communalities because they
Thurstone, on the other hand, emphasized that the provide a measure of the variance attributable to the
evidence could be best explained by supposing that common factor.
there were several (7 or 9) primary abilities and, The foregoing, of course, is not a complete
moreover, that these were correlated among them- account of the basis of factor analysis, even in its
selves. To demonstrate the latter fact, it was necessary original form but it shows why the structure of the
to recognize that once one passed beyond one fac- correlation matrix was the focal point. No ques-
tor the solution was not unique. One could move tion of a probability model arose and there was
from one solution to another by simple transfor- no discussion, for example, of standard errors of
mations, known as rotations, because that is how estimates. Essentially and originally, factor analysis
4 History of Factor Analysis: A Statistical Perspective
was the numerical analysis of a correlation matrix. which is of exactly the same form as (6) and so
This approach dominated the development of factor justifies us in regarding it as a stochastic version of
analysis before the Second World War and is still the old (Spearman) model. The difference is that
sometimes found today. For this reason, (6) was (and is the covariance matrix rather than the correlation
sometimes still is) spoken of as the factor analy- matrix. This is often glossed over by supposing that
sis model. the xi s have unit variance. This, of course, imposes
a further constraint on and by requiring that
or suspected, that only the members of a given of continuous variables for which correlations are the
subset are indicators of a particular factor. This appropriate measure of association. It was possible, as
amounts to believing that certain factor loadings are we have seen, because the theory of partial correlation
zero. In cases like these, there is a prior hypoth- already existed. At the time, there was no such theory
esis about the factor structure and we may then for categorical variables, whether ordered or not.
wish to test whether this is confirmed by a new This lopsided development reflected much that was
data set. going on elsewhere in statistics. Yet, in practice,
This is called confirmatory factor analysis (see categorical variables are very common, especially
Factor Analysis: Confirmatory). Confirmatory fac- in the behavioral sciences, and are often mixed up
tor analysis is a rather special case of a more general with continuous variables. There is no good reason
extension known as linear structural relations mod- why this separation should persist. The logic of the
eling or structural equation modeling. This orig- problem does not depend, essentially, on the type
inated with [7] and has developed enormously in of variable.
the last 30 years. In general, it supposes that there
are linear relationships among the latent variables.
The object is then to not only determine how many
Extension to Variables of Other Types
factors are needed but to estimate the relationships Attempts to cope with this problem have been made
between them. This is done, as in factor analysis, by in a piecemeal fashion, centering, to a large extent, on
comparing the observed covariance matrix with that the work of Lazarsfeld, much of it conveniently set
predicted by the model and choosing the parameters out in [12]. He introduced latent structure analysis
of the latter to minimize the distance between them. to do for categorical and especially binary vari-
For obvious reasons, this is often called covariance ables what factor analysis had done for continuous
structure analysis. variables. Although he noted some similarities, he
Another long-standing part of factor analysis can seemed more interested in the differences that con-
also be cast into the mold of linear structural relations cerned the computational rather than the conceptual
modeling. This is what is known as hierarchical aspects. What was needed was a broader framework
factor analysis, and it has been mainly used in within which a generalized form of factor analysis
intelligence testing. When factor analysis is carried could be carried out regardless of the type of variable.
out on several sets of test scores in intelligence Lazarsfelds work also pointed to a second general-
testing, it is common to find that several factors ization that was needed. This concerns the factors,
are needed to account for the covariances perhaps or latent variables. In traditional factor analysis, the
as many as 8 or 9. Often, the most meaningful factors have been treated as continuous variables
solution will be obtained using an oblique rotation usually normally distributed. There may be circum-
in which the resulting factors will themselves be stances in which it would be more appropriate to treat
correlated. It is then natural to enquire whether their the factors as categorical variables. This was done by
covariances might be explained by factors at a deeper Lazarsfeld with his latent class and latent profile mod-
level, to which they are related. A second stage els. It may have been partly because the formulae for
analysis would then be carried out to reveal this models involving categorical variables look so dif-
deeper factor structure. It might even be possible ferent from those for continuous variables, that their
to carry the analysis further to successively deeper essential unity was overlooked.
levels. In the past, hierarchical analysis has been The key to providing a generalized factor analy-
carried out in an ad hoc way much as we have just sis was found in the recognition that the exponential
described it. A more elegant way is to write the family of distributions provided a sufficient variety of
dependence between the first-level factors and the forms to accommodate most kinds of observed vari-
second level as linear relations to be determined. In able. It includes the normal distribution, of course,
this way, the whole factor structure can be estimated but also the Bernoulli and multinomial distributions,
simultaneously. to cover categorical data and many other forms as
The second kind of recent development has been well that have not been much considered in latent
to extend the range of observed variables that can be variables analysis. A full development on these lines
considered. Factor analysis was born in the context will be found in [1].
6 History of Factor Analysis: A Statistical Perspective
In this more general approach, the normal lin- matrix S and its theoretical equivalent given by
ear factor model is replaced by one in which the
canonical parameter (rather than the mean) of the = + . (12)
distribution is expressed as a linear function of the
factors. Many features of the standard linear model These methods include least squares, weighted (or
carry over to this more general framework. Thus, one generalized) least squares, and maximum likelihood.
can fit the model by maximum likelihood, rotate fac- The latter has been generally favored because it
tors, and so on. However, in one important respect allows the calculation of standard errors and measures
it differs. It moves the focus away from correlations of goodness-of-fit. It is not immediately obvious
as the basic data about dependencies and toward the that this involves a minimization of distance but
more fundamental conditional dependencies that the this becomes apparent when we note that the log
model is designed to express. It also resolves the dis- (likelihood) turns out to be
putes that have raged for many years about factor n
scores. A factor score is an estimate or predic- log(likelihood) = constant + ln det [ 1 S]
2
tion of the value of the factor corresponding to a n
set of values of the observed variables (see Fac- trace[ 1 S], (13)
2
tor Score Estimation). Such a value is not uniquely where is the covariance matrix according to the
determined but, within the general framework, is a model and S is the sample covariance matrix. We
random variable. The factor score may then be taken note that 1 S = I if = S and, otherwise, is
as the expected value of the factor, given the data. It positive. This means that, even if the distributional
is curious that this has been the undisputed practice assumptions required by the model are not met,
in latent class analysis from the beginning where the maximum likelihood method will still be a
allocation to classes has been based on posterior reasonable fitting method. There are two principal
probabilities of class membership. Only recently is approaches to minimizing (13). One, adopted by
it becoming accepted that this is the obvious way to Joreskog and Sorbom [8] uses the FletcherPowell
proceed in all cases. (see Optimization Methods) algorithm. The second
Posterior probability analysis also shows that, in is based on the E-M algorithm. The latter has the
a broad class of cases, all the information about conceptual advantage that it can be developed for the
the latent variables is contained in a single statistic, much wider class of models described in the section
which, in the usual statistical sense, is sufficient titled Extension to Variables of Other types.
for the factor. It is now possible to have a single The major software packages that are now avail-
program for fitting virtually any model in this wider able, also allow for various kinds of rotation, the
class when the variables are of mixed type. One such calculation of factor scores, and many other details
is GENLAT due to Moustaki [15]. A general account of the analysis.
of such models is given in [16]. In spite of the fact that the main computational
problems of fitting have been solved, there are still
complications inherent in the model itself. Most
Computation noteworthy are what are known as Heywood cases.
These arise from the fact that the elements of
Factor analysis is a computer-intensive technique. the diagonal matrix are variances and must,
This fact made it difficult to implement before the therefore, be nonnegative. Viewed geometrically, we
coming of electronic computers. Various methods are looking for a point in the parameter space (of
were devised for estimating the factor loadings and and ) that maximizes the likelihood. It may
communalities for use with the limited facilities then then happen that the maximum is a boundary point
available. The commonest of these, known as the at which one or more elements of is zero. The
centroid method, was based on geometrical ideas and problem arises because such a boundary solution can,
it survived long enough to be noted in the first edition and often does, arise when the true values of all the
of Lawley and Maxwell [11]. Since then, almost elements of are strictly positive. There is nothing
all methods have involved minimizing the distance inherently impossible about a zero value of a residual
between the observed covariance (or correlation) variance but they do seem practically implausible.
History of Factor Analysis: A Statistical Perspective 7
Heywood cases are an inconvenience but their distributions. The focus then shifts to the essential
occurrence emphasizes the inherent uncertainty in question that has underlain factor analysis from the
the estimation of the parameters. They are much beginning. That is, is the interdependence among the
more common with small sample sizes and the only manifest variables indicative of their dependence on
ultimate cure is to use very large samples. a (small) number of factors (latent variables)? It is
then seen as one tool among many for studying the
dependence structure of a set of random variables.
What Then is Factor Analysis? From that perspective, it is seen to have a much wider
relevance than Spearman could ever have conceived.
Factor analysis has appeared under so many guises
in its 100-year history that one may legitimately
References
ask whether it has retained that unitary character
that would justify describing it as a single entity.
[1] Bartholomew, D.J. & Knott, M. (1999). Latent Variable
Retrospectively, we can discern three, overlapping, Models and Factor Analysis, 2nd Edition, Arnold, Lon-
phases that have coexisted. The prominence we give don.
to each may depend, to some extent, on what vantage [2] Bartlett, M.S. (1937). The statistical concept of mental
point we adopt that of psychologist, statistician, or factors, British Journal of Psychology 28, 97104.
general social scientist. [3] Bartlett, M.S. (1938). Methods of estimating mental
At the beginning, and certainly in Spearmans factors, Nature 141, 609610.
[4] Bartlett, M.S. (1948). Internal and external factor anal-
view, it was concerned with explaining the pattern
ysis, British Journal of Psychology (Statistical Section)
in a correlation matrix. Why, in short, are the vari- 1, 7381.
ables correlated in the way they are it thus became [5] Burt, C. (1941). The Factors of the Mind: An Introduc-
a technique for explaining the pattern of correlation tion to Factor Analysis in Psychology, Macmillan, New
coefficients in terms of their dependence on under- York.
lying variables. It is true that the interpretation of [6] Hotelling, H. (1933). Analysis of a complex of statistical
those correlations depended on the linearity of rela- variables into principal components, Journal of Educa-
tional Psychology 24, 417441, 498520.
tions between the variables but, in essence, it was [7] Joreskog, K.G. (1970). A general method for analysis of
the correlation coefficients that contained the relevant covariance structures, Biometrika 57, 239251.
information. Obviously, the technique could only be [8] Joreskog, K.G. & Sorbom, D. (1977). Statistical models
used on variables for which correlation coefficients and methods for analysis of longitudinal data, in Latent
could be calculated or estimated. Variables in Socioeconomic Models, D.J. Aigner &
The second approach is to write down a proba- A.S. Goldberger, eds, North-Holland, Amsterdam.
[9] Kendall, M.G. (1950). Factor analysis as a statistical
bility model for the observed (manifest) variables.
technique, Journal of the Royal Statistical Society. Series
Traditionally, these variables have been treated as B 12, 6063.
continuous and it is then natural to express them as [10] Lawley, D.N. (1943). The application of the maximum
linear in the latent variables, or factors. In the stan- likelihood method to factor analysis, British Journal of
dard normal linear factor model, the joint distribution Psychology 33, 172175.
of the manifest variables is multivariate normal and [11] Lawley, D.N. & Maxwell, A.E. (1963). Factor Analysis
thus depends, essentially, on the covariance matrix as a Statistical Method, Butterworths, London.
[12] Lazarsfeld, P.E. & Henry, N.W. (1968). Latent Structure
of the data. We are thus led to the covariance rather
Analysis, Houghton Mifflin, New York.
than the correlation matrix as the basis for fitting. [13] Lovie, A.D. & Lovie, P. (1993). Charles Spearman, Cyril
Formally, we have reached almost the same point as Burt and the origins of factor analysis, Journal of the
in the first approach though this is only because of History of the Behavioral Sciences 29, 308321.
the particular assumptions we have made. However, [14] Maxwell, A.E. (1961). Recent trends in factor analysis,
we can now go much further because of the distribu- Journal of the Royal Statistical Society. Series A 124,
tional assumptions we have made. In particular, we 4959.
[15] Moustaki, I. (2003). A general class of latent variable
can derive standard errors for the parameter estimates, models for ordinal manifest variables with coveriate
devise goodness-of-fit tests, and so forth. effects on the manifest and latent variables, British
The third and final approach is to drop the specific Jounal of Mathematical and Statistical Psychology 56,
assumptions about the kinds of variable and their 337357.
8 History of Factor Analysis: A Statistical Perspective
[16] Moustaki, I. & Knott, M. (2000). Generalized latent trait [21] Yule, G.U. (1911). Introduction to the Theory of Statis-
models, Psychometrika 65, 391411. tics, Griffin, London.
[17] Pearson, K. (1901). On lines and planes of closest fit
to a system of points in space, Philosophical Magazine
(6th Series), 2, 557572. (See also Structural Equation Modeling: Cate-
[18] Spearman, L. (1904). General intelligence objectively gorical Variables; Structural Equation Modeling:
determined and measured, American Journal of Psychol- Latent Growth Curve Analysis; Structural Equa-
ogy 15, 201293. tion Modeling: Multilevel)
[19] Thurstone, L.L. (1947). Multiple Factor Analysis, Uni-
versity of Chicago Press, Chicago. DAVID J. BARTHOLOMEW
[20] Whittle, P. (1953). On principal components and least
square methods of factor analysis, Skandinavisk Aktuar-
ietidskrift 55, 223239.
History of Intelligence Measurement
NADINE WEIDMAN
Volume 2, pp. 858861
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
could be removed and given special education, allow- of the standardized, multiple-choice tests routinely
ing the other children to progress normally. Binet taken by American students from the elementary
had previously been interested in the experimental grades through the college and postgraduate years
study of the highest and most complex mental pro- [3]. But, psychologists moved the intelligence test
cesses, and of individuals of high ability; with his beyond its application to performance in school.
colleague Theodore Simon (18731961), Binet deter- With the entrance of the United States into World
mined what tasks a normal child, of a given age, War I in 1917, the comparative psychologist Robert
could be expected to do, and then based his test M. Yerkes, supported by the National Research
on a series of 30 tasks of graded difficulty. Binet Council, proposed to the US Army a system of
and Simon published their results in Lannee Psy- mass testing of recruits, which would determine
chologique in 1905, revising their test in 1908 and whether they were fit for army service and, if so,
again in 1911. Though some identified the general what tasks best suited them [2]. Mass testing of
capacity that such a test seemed to assess with Spear- thousands of soldiers differed greatly from Binets
mans general factor of intelligence, Binet and Simon individualist emphasis, but it raised psychologys
referred to the ability being tested as judgment. public profile considerably: after the war, psychol-
They were, unlike Spearman, more interested in the ogists could justifiably call themselves experts in
description of individuals than in developing a the- human management [4, 10]. Again, the results of
ory of general intelligence, and their work did not the army testing were interpreted in hereditarian
have the strong hereditarian overtones that Spear- ways: psychologists argued that they showed that
mans did [6]. blacks and immigrants, especially from southern and
Binet and Simons test for mental ability was eastern Europe, were less intelligent than native-
refined and put to use by many other psychologists. born whites. Such arguments lent support to the call
In Germany, the psychologist William Stern argued for immigration restriction, which passed into law
in 1912 that the mental age of the child, as deter- in 1924.
mined by the test, should be divided by the childs Even as the IQ testers achieved these successes,
chronological age, and gave the number that resulted they began to receive harsh criticism. Otto Klineberg
the name IQ for intelligence quotient [9]. But it (18991992), a psychologist trained under the
was in the United States that the BinetSimon IQ anthropologist Franz Boas, made the best known and
test found its most receptive audience, and where it most influential attack. Klineberg argued that the
was put to the hereditarian ends that both Binet and supposedly neutral intelligence tests were actually
Stern had renounced. The psychologist Henry Herbert compromised by cultural factors and that the level
Goddard (18661957), for example, used the test to of education, experience, and upbringing so affected
classify patients at the New Jersey Training School a childs score that it could not be interpreted as
for Feebleminded Boys and Girls, a medical institu- a marker of innate intelligence. Klinebergs work
tion housing both children and adults diagnosed with drew on that of lesser-known black psychologists,
mental, behavioral, and physical problems. Goddard most notably Horace Mann Bond (19041972), an
subsequently developed, in part on the basis of his educator, sociologist, and university administrator.
experience at the Training School, a theory that intel- Bond showed that the scores of blacks from the
ligence was unitary and was determined by a single northern states of New York, Ohio, and Pennsylvania
genetic factor. He also used IQ tests on immigrants were higher than those of southern whites, and
who came to America through the Ellis Island immi- explained the difference in terms of better access to
gration port [13]. education on the part of northern blacks. Such an
At Stanford University, the educational psycholo- argument flew in the face of innatist explanations.
gist Lewis Terman (18771956) and his colleagues Nonetheless, despite his criticisms of hereditarian
used IQ tests to determine the mental level of interpretations of the tests, Bond never condemned
normal children, rather than to identify abnormal the tests outright and in fact used them in his work
ones, an application that represented a significant as a college administrator. Intelligence tests could,
departure from Binets original intention. Terman he argued, be used to remedy the subjectivity of
called his elaboration of Binets test the Stanford- individual teachers judgments. If used properly
Binet, and it became the predecessor and prototype that is, for the diagnosis of learning problems
History of Intelligence Measurement 3
and if interpreted in an environmentalist way, Bond no surprise. Just as the post-World War I enthu-
believed that the tests could actually subvert bias. By siasm for IQ testing must be understood in the
the mid-1930s, Bonds evidence and arguments had context of immigration restriction, Jensens and Her-
severely damaged the hereditarian interpretation of rnsteins interest in intelligence and heredity arose
IQ test results [11]. against a background of debates over civil rights,
By 1930, too, several prominent psychologists had affirmative action, and multiculturalism. From Gal-
made public critiques or undergone well-publicized tons day to the present, IQ testers and their crit-
reversals on testing. E. G. Boring (18861968) ics have been key players in the ongoing conver-
expressed skepticism that intelligence tests actu- sation about the current state and future direction
ally measured intelligence. And Carl Brigham of society.
(18901943), who had in 1923 published a racist
text on intelligence, recanted his views by the end of References
that decade [6].
The trend toward environmentalist and cultural
[1] Burt, C. (1940). The Factors of the Mind, University of
critiques of intelligence testing met a strong opponent London Press, London.
in Arthur Jensen, a psychologist at the University of [2] Carson, J. (1993). Army alpha, army brass, and the
California, Berkeley. In 1969, his controversial arti- search for army intelligence, Isis 84, 278309.
cle How Much Can We Boost I.Q. and Scholastic [3] Chapman, P.D. (1988). Schools as Sorters: Lewis M.
Achievement claimed that compensatory education Terman, Applied Psychology, and the Intelligence Testing
has been tried and it apparently has failed [8]. Jensen Movement, 18901930, New York University Press,
New York.
argued that it was in fact low IQ, not discrimina- [4] Fancher, R. (1985). The Intelligence Men: Makers of the
tion, cultural or social disadvantages, or racism that I.Q. Controversy, Norton, New York.
accounted for minority students poor performance in [5] Galton, F. (1869). Hereditary Genius: An Inquiry into Its
intelligence tests and in school. His claim relied to an Laws and Consequences, Horizon Press, New York.
extent on Cyril Burts twin studies, which purported [6] Gould, S.J. (1981). The Mismeasure of Man, Norton,
to show that identical twins separated at birth and New York.
[7] Herrnstein, R. & Murray, C. (1994). The Bell Curve:
raised in different environments were highly similar
Intelligence and Class Structure in American Life, Free
in mental traits and that such similarity meant that Press, New York.
intelligence was largely genetically determined. (In [8] Jensen, A. (1969). How much can we boost I.Q. and
1974, Leon J. Kamin investigated Burts twin stud- scholastic achievement? Harvard Educational Review
ies and concluded that Burt had fabricated his data.) 39, 1123.
Jensens argument was in turn echoed by the Har- [9] Smith, R. (1997). The Norton History of the Human
vard psychologist Richard Herrnstein (19301994), Sciences, Norton, New York.
[10] Sokal, M.M., ed. (1987). Psychological Testing and
who argued that because IQ was so highly heritable,
American Society, Rutgers University Press, New
one should expect a growing stratification of society Brunswick.
based on intelligence and that this was in fact happen- [11] Urban, W.J. (1989). The black scholar and intelligence
ing in late twentieth-century America. Expanded and testing: the case of Horace Mann bond, Journal of the
developed, this same argument appeared in The Bell History of the Behavioral Sciences 25, 323334.
Curve: Intelligence and Class Structure in American [12] Wooldridge, A. (1994). Measuring the Mind: Education
Life, which Herrnstein published with the political and Psychology in England, c. 1860-c. 1990, Cambridge
University Press, Cambridge.
scientist Charles Murray in 1994 [7]. Both in 1970 [13] Zenderland, L. (1998). Measuring Minds: Henry Her-
and 1994, Herrnsteins argument met a firestorm bert Goddard and the Origins of American Intelligence
of criticism. Testing, Cambridge University Press, Cambridge.
Attempts to define and measure intelligence are
always tied to social and political issues, so the con- NADINE WEIDMAN
troversy that attends such attempts should come as
History of Mathematical Learning Theory
SANDY LOVIE
Volume 2, pp. 861864
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
system of linear operators (see their [5, Chapter 2]; underdetermined by the data, to quote the standard
also their initial comments that a stimulus model is postpositivist mantra. (Consult [13, especially Chap-
not necessary to the operator approach, [5, p. 46]). ter 12], for a carefully drawn instance of how to
What they were attempting was the development handle some problems of identifiability in reaction
of a flexible mathematical system which could be time studies originally raised by Townsend in, for
tweaked to model many theoretical approaches in example, [16]; see also [12] for a case study in the
psychology by varying the range (and meaning) of history of factor analysis).
allowable parameter values (but not model type) Meanwhile, and seemingly almost oblivious to this
according to both the theory and the experimental debate, Estes single-mindedly pursued his vision of
domain. So ambitious a project was, however, almost MLT as Stimulus Sampling Theory (SST), which
impossible to carry out in practice, particularly as claimed to be closer than most versions of MLT
it also assumed a narrowly defined class of models, to psychological theorizing. Increasingly, however,
and was eminently mis-understandable by learning SST was viewed as a kind of meta-theory in that
theorist and experimentalist alike. And so it proved. its major claim to fame was as a creative resource
What now happened to MLT from the late 1950s rather than its instantiation in a series of detailed and
was an increasing concentration on technical details specific models. Thus, according to Atkinson et al. [1,
and the fragmentation of the field as a result of p. 372], Much as with any general heuristic device,
strong creative disagreements, with infighting over stimulus sampling theory should not be thought of as
model fit replacing Bush and Mostellers 1955 plea provable or disprovable, right or wrong. Instead, we
for a cumulative process of model development; judge the theory by how useful it is in suggesting
tendencies, which, paradoxically, they did little to specific models that may explain and bring some
discourage. Indeed, their eight model comparison degree of orderliness into the data. Consequently,
in [6] using the 1953 Solomon and Wynne shock from the 1966 edition of their authoritative survey of
avoidance data could be said to have kick-started the learning theory onwards, Hilgard and Bower treated
competitive phase of MLT, a direction hastened by MLT as if it was SST, pointing to the approachs
the work of Bush, Galanter, and Luce in the same ability to generate testable models in just about every
1959 volume, which pitted Luces beta model for field of learning, from all varieties of conditioning
individual choice against the linear operator one, in to concept identification and two person interactive
part using the same Solomon and Wynne summary games, via signal detection and recognition, and
numbers [4]. spontaneous recovery and forgetting. In fact, Bower,
Of course, comparing one model with another on pages 376 to 377 of his 1966 survey of MLT [9],
is a legitimate way of developing a field, but the lists over 25 distinctive areas of learning and related
real lack at the time of any deep or well-worked fields into which MLT, in the guise of SST, had
out theories of learning meant that success or fail- infiltrated (see also [1], in which Atkinson et al. take
ure in model-fitting was never unambiguous, with the same line over SSTs status and success). By
the contenders usually having to fall back on infor- the time of the 1981 survey [3], Bower was happy
mal claims of how much closer their (unprioritized to explicitly equate MLT with SST, and to impute
and unprioritizable) collective predictions were to genius to Estes himself (p. 252).
the data than those of their opponents. This also Finally, all these developments, together with the
made the issue of formal tests of goodness of fit, increasing power of the computer metaphor for the
such as chi-square, problematic for many workers human cognitive system, also speeded up the recast-
(see Bowers careful but ultimately unsatisfactory ing and repositioning of the use of mathematics
trip around this issue in [9, pp. 375376]). Further- in psychology. For instance, Atkinson moved away
more, the epistemological deficit meant that MLT from a completely analytical and formal approach
would sooner or later have to face up to the prob- by mixing semiformal devices such as flow charts
lem of identifiability, that is, how well do models and box models (used to sketch in the large scale
need to be substantively and formally specified in anatomy of such systems as human memory) with
order to uniquely and unambiguously represent a mathematical models of the process side, for exam-
particular data set. Not to do so opened up the pos- ple, the operation of the rehearsal buffer linking the
sibility of finding that MLTs theories are typically short and long term memory stores (see [2]). Such
History of Mathematical Learning Theory 3
hybrid models or approaches allowed MLT to remain [6] Bush, R.R. & Mosteller, F. (1959). A comparison of
within, and contribute to, the mainstream of experi- eight models, in Studies in Mathematical Learning The-
ory, R.R. Bush & W.K. Estes, eds, Stanford University
mental psychology for which a more thoroughgoing
Press, Stanford.
mathematical modeling was a minority taste. Inter- [7] Estes, W.T. (1950). Towards a statistical theory of
estingly, a related point was also advanced by Bower learning, Psychological Review 57, 94107.
in his survey of MLT [9], where a distinction was [8] Feller, W. (1957). An Introduction to Probability Theory
made between rigorous mathematical systems with and its Applications, Vol. 1, 2nd Edition, Wiley, New
only minimal contact with psychology (like the class York.
of linear models proposed by Bush and Mosteller) [9] Hilgard, E.R. & Bower, G.H. (1966). Theories of Learn-
ing, 3rd Edition, Appleton-Century-Crofts, New York.
and overall ones like SST, which claimed to represent
[10] Hull, C.L. (1943). Principles of Behavior, Appleton-
well-understood psychological processes and results, Century-Crofts, New York.
but which made few, if any, specific predictions. [11] Hull, C.L., Hovland, C.J., Ross, R.T., Hall, M.,
Thus, on page 338 of [9], Bower separates specific- Perkins, D.T. & Fitch, F.G. (1940). Mathematico-
quantitative from quasi-quantitative approaches to deductive Theory of Rote Learning, Yale University
MLT, but then opts for SST on the pragmatic grounds Press, New York.
[12] Lovie, P. & Lovie, A.D. (1995). The cold equa-
that it serves as a persuasive example of both.
tions: Spearman and Wilson on factor indeterminacy,
British Journal of Mathematical and Statistical Psychol-
ogy 48(2), 237253.
References
[13] Luce, R.D. (1986). Response Times: Their Role in Infer-
ring Elementary Mental Organisation, Oxford Univer-
[1] Atkinson, R.C., Bower, G.H. & Crothers, E.J. (1965). An sity Press, New York.
Introduction to Mathematical Learning Theory, Wiley, [14] Luce, R.D., Bush, R.R. & Galanter, E., eds (1963).
New York. Handbook of Mathematical Psychology, Vols. 1 & 2,
[2] Atkinson, R.C. & Shiffrin, R.M. (1968). Human mem- Wiley, New York.
ory: a proposed system and its control processes, in [15] Luce, R.D., Bush, R.R. & Galanter, E., eds (1965).
The Psychology of Learning and Motivation, Vol. 2, Handbook of Mathematical Psychology, Vol. 3, Wiley,
K.W. Spence & J.T. Spence, eds, Academic Press, New New York.
York. [16] Townsend, J.T. (1974). Issues and models concerning
[3] Bower, G.H. & Hilgard, E.R. (1981). Theories of Learn- the processing of a finite number of inputs, in Human
ing, 5th Edition, Prentice Hall, Englewood Cliffs. Information Processing: Tutorials in Performance and
[4] Bush, R.R., Galanter, E. & Luce, R.D. (1959). Tests Cognition, Kantowitz, B.H., ed., Erlbaum, Hillsdale.
of the beta model, in Studies in Mathematical Learning
Theory, R.R. Bush & W.K. Estes, eds, Stanford Univer- SANDY LOVIE
sity Press, Stanford.
[5] Bush, R.R. & Mosteller, F. (1955). Stochastic Models
for Learning, Wiley, New York.
History of Multivariate Analysis of Variance
SCOTT L. HERSHBERGER
Volume 2, pp. 864869
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
where S is the pooled within-groups covariance accompanying increase in the probability of reject-
matrix. For L1 , we need the likelihood of k sepa- ing H0 .
rate multivariate normal distributions; in other words, Lambda is a family of three-parameter curves,
each of the k multivariate normal distribution is with parameters based on the number of groups, the
described by a different mean and covariance struc- number of subjects, and the number of dependent
ture. The likelihood for the nonnull hypothesis, of variables, and is thus complex. Although has been
group inequality, is tabled for specific values of its parameters [7, 8, 10,
20], the utility of depends on its transformation
1
e 2 N
k
1 to either an exact or approximate 2 or F statistic.
L1 = 1 1
, (5) Bartlett [2] proposed an approximation to in 1939
(2) 2 Np t=1 |St | 2 nt based on the chi-square distribution:
where nt is the sample size of an individual group.
Thus to test the hypothesis that the k samples are 2 = [(N 1) 0.5(p + k)] ln , (8)
drawn from the same population as against the
alternative that they come from different populations, which is evaluated at p(k 1) degrees of freedom.
we test the ratio L0 /L1 : Closer asymptotic approximations have been given
by Box [4] and Anderson [1]. Transformations to
1
exact chi-squared distributions have been given by
e 2 N 1
1
1
Schatzoff [21], Lee [11], and Pillai and Gupta [14].
L0 (2) 2 Np |S| 2 N Rao [15] derived an F statistic in 1952 which
LR = = 1 provides better approximations to cumulative
L1
e 2 N
k
1 probability densities compared to approximate chi-
1 1
(2 ) 2 Np t=1 |St | 2 nt square statistics, especially when sample size is rela-
tively small:
1
nt
|St | 2
= 1 1/s ms i(j 1)/2 + 1
1 F = , (9)
|S| 2 N 1/s i(k 1)
k
1
|St | 2 nt where m = N 1 (p + k)/2, s = [(p 2 (k 1)2
= . (6)
t=1
|S| 4/p 2 + (k 1)2 5)], and with p(k 1), ms
p(k 1)/2 + 1 degrees of freedom.
Further simplification of the LR is possible when In general, the LR principle provides several
we recognize that the numerator represents the optimal properties for reasonably sized samples, and
between-groups variance and the denominator rep- is convenient for hypotheses formulated in terms
resents the total variance. Therefore, we have of multivariate normal parameters. In particular, the
attractiveness of the LR presented by Wilks is that
k
1 |W| |W| it yields test statistics that reduce to the familiar
LR = {|St |/|S|} 2 nt = = , (7)
|B + W| |T| univariate F and t statistics when p = 1. If only one
t=1
dependent variable is considered, |W| = SS within and
where |W| is the determinant of the within-groups |B + W| = SS between + SS within . Hence, the value of
sum of squares (SS within ) and cross-products Wilkss is
(CP within ) matrix, |B + W| is the determinant of
the sum of the between-groups sum of squares SS within
= . (10)
(SS between ) and cross-products (CP between ) matrix and SS between + SS within
the within-groups SS within , CP within matrix, and |T|
is the determinant of the total sample sum of squares Because the F ratio in a traditionally formu-
(SS total ) and cross-products (CP total ) matrix. The ratio lated as
|W|/|T| is Wilkss . Note that as |T| increases SS between
relative to |W| the ratio decreases in size with an F = , (11)
SS between + SS within
History of Multivariate Analysis of Variance 3
Wilkss can also be written as (TS , 1000 lb per square inch) and hardness (H , Rock-
wells E). The data may be summarized as shown in
1 Table 1.
= , (12)
(k 1) We wish to test the multivariate null hypothesis
1+ F
(N k) of sample equality with the 2 approximation to
Wilkss . Recall that = |W|/|B + W|, so W and
where N = the total sample size. This indicates that B are needed. First we calculate W. Recognizing that
the relationship between and F is somewhat each sample provides an estimate of W, we use a
inverse. The larger the F ratio is, the smaller the pooled estimate of the within-sample variability for
Wilkss . the two variables:
Most computational algorithms for Wilkss take
advantage of the fact that can be expressed as a W = W1 + W2 + W3 + W4 + W5
function of the eigenvalues of a matrix. Consider
78.95 214.18 223.70 657.62
Wilkss rewritten as = +
214.18 1247.18 657.62 2519.31
|W| 1
= = . (13) +
57.45 190.63
+
187.62 375.91
|B + W| |BW1 + I| 190.63 1241.78 375.91 1473.44
Also consider for any matrix X there are i 88.46 259.18
+
eigenvalues, and for a matrix (X + I) there are 259.18 1171.73
(i + 1) eigenvalues. In addition, the product of
636.17 1697.52
the eigenvalues of a matrix is always equal to the = (15)
1697.52 7653.44
determinant of the matrix (i.e., i = |X|). Hence,
(i + 1) = |X + 1|. Based on this information, the
value of Wilkss can be written as the product of The diagonal elements of B are defined as follows:
the eigenvalues of the matrix BW1 :
k
bii = nj (y ij y i )2 , (16)
1
= . (14) j =1
(i + 1)
where nj is the number of specimens in group j, y ij
is the mean for variable i in group j , and y i is the
MANOVA Example grand mean for variable i.
The off diagonal elements of B are defined as
We illustrate MANOVA using as an example one of
follows:
its earliest applications [13]. In this study, there were
five samples, each with 12 members, of aluminum
k
diecastings (k = 5, nk = 12, N = 60). On each spec- bmi = bim = nj (y ij y i )(y mj y m ). (17)
imen p = 2 measurements are taken: tensile strength j =1
Table 2 Results from MANOVA examining the effect of fertilizer treatment on straw and grain yield
Source df SS x1 CP x1 x2 SS x2
Blocks 7 86 045.8 56 073.6 75 841.5
Treatments 7 12 496.8 6786.6 32 985.0
Residual 49 136 972.6 58 549.0 71 496.1
Total 63 235 515.2 107 836.0 180 322.6
History of Multivariate Analysis of Variance 5
Table 3 Results from MANCOVA examining the effect of fertilizer treatment on straw and grain yield
Source df SS x1 CP x1 x2 SS x2
Total 58 149 469.4 51 762.4 104 481.1
removed from the total variability, resulting in a new [9] Hotelling, H. (1951). A generalized T test and measure
total (Table 3). of multivariate dispersion, Proceedings of the Second
The multivariate null hypothesis of equality Berkeley Symposium on Mathematical Statistics, Univer-
sity of California Press, Berkeley, pp. 2341.
among the eight fertilizers was tested using [10] Hsu, P.L. (1939). On the distribution of roots of certain
Wilkss : determinantal equations, Annals of Eugenics 9, 250258.
136972.6 58549.0 [11] Lee, Y.S. (1972). Some results on the distribution
|W| 58549.0 71496.1 of Wilkss likelihood-ratio criterion, Biometrika 59,
= = 649664.
|T| 149469.4 51762.4
[12] Neyman, J. & Pearson, E. (1928). On the use and
51762.4 104481.1 interpretation of certain test criteria for purposes of
statistical inference, Biometrika 20, 175240, 263294.
7113660653 [13] Pearson, E.S. & Wilks, S.S. (1933). Methods of statisti-
= = 0.49. (24)
7649097476 cal analysis appropriate for K samples of two variables,
Biometrika 25, 353378.
Next the chi-square approximation to was [14] Pillai, K.C.S. & Gupta, A.K. (1968). On the non-
computed: central distribution of the second elementary symmetric
function of the roots of a matrix, Annals of Mathematical
2 = [(56 1) 0.5(2 + 8)] ln(0.49) Statistics 39, 833839.
[15] Rao, C.R. (1952). Advanced Statistical Methods in Bio-
= 50.0(0.31) metric Research, Wiley, New York.
= 15.5, with 2(8 1) = 14 df, p = 0.34. (25) [16] Roy, S.N. (1939). P -statistics or some generalizations
in analysis of variance appropriate to multivariate prob-
The conclusion was that the eight fertilizer treat- lems, Sankhya 4, 381396.
ments yielded equal amounts of straw and grain. [17] Roy, S.N. (1942a). Analysis of variance for multivariate
normal populations: the sampling distribution of the
requisite p-statistics on the null and non-null hypothesis,
References Sankhya 6, 3550.
[18] Roy, S.N. (1942b). The sampling distribution of
[1] Anderson, T.W. (1984). Introduction to Multivariate p-statistics and certain allied statistics on the non-null
Statistical Analysis, 2nd Edition, Wiley, New York. hypothesis, Sankhya 6, 1534.
[2] Bartlett, M.S. (1939). A note on tests of significance [19] Roy, S.N. (1946). Multivariate analysis of variance: the
in multivariate analysis, Proceedings of the Cambridge sampling distribution of the numerically largest of the p-
Philosophical Society 35, 180185. statistics on the non-null hypotheses, Sankhya 8, 1552.
[3] Bartlett, M.S. (1947). Multivariate analysis, Journal of [20] Roy, S.N. (1957). Some Aspects of Multivariate Analysis,
the Royal Statistical Society B 9, 176197. Wiley, New York.
[4] Box, G.E.P. (1949). A general distribution theory for a [21] Schatzoff, M. (1966). Exact distributions of Wilks
class of likelihood criteria, Biometrika 36, 317346. likelihood ratio criterion, Biometrika, 53, 34358.
[5] Fisher, R.A. (1912). On the absolute criterion for fit- [22] Wilks, S.S. (1932). Certain generalizations in the analy-
ting frequency curves, Messenger of Mathematics 41, sis of variance, Biometrika 24, 471494.
155160.
[6] Fisher, R.A. (1922). On the mathematical founda-
tions of theoretical statistics, Philosophical Transac- Further Reading
tions of the Royal Society of London Series A 222,
309368. Wilks, S.S. (1962). Mathematical Statistics, Wiley, New York.
[7] Fisher, R.A. (1939). The sampling distribution of some
statistics obtained from nonlinear equations, Annals of SCOTT L. HERSHBERGER
Eugenics 9, 238249.
[8] Girshick, M.A. (1939). On the sampling theory of roots
of determinantal equations, Annals of Mathematical
Statistics 10, 203224.
History of Path Analysis
STANLEY A. MULAIK
Volume 2, pp. 869875
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
B N
pM.B pM.N
M X eX
p X.M
pM.A pX.M
A M X
Further, suppose X, Y, M and N have unit vari-
ances and that we observe the correlations rXY , rXM ,
rXN , rY M , rY N , and rMN . From these equations and
assuming eX , eY are mutually uncorrelated and fur- The path diagram shows that B and N are
ther each are uncorrelated with M, and N , then we uncorrelated. Furthermore, A is uncorrelated with B
can derive the following: and M is uncorrelated with N . We will again assume
that the variances of all variables are unity and have
rXY = pXM pY N rMN
zero means. This gives rise to the equations for the
rXM = pXM effect variables:
History of Path Analysis 3
among the variables. So, path analysis was a model- Blalock [3], a sociologist who had been originally
testing method. trained in mathematics and physics, authored a highly
Another reason that likely retarded the uptake of influential book in sociology that drew upon the
path analysis into the behavioral and social sciences method of path analysis of Wright [35]. Blalock [4]
at the outset was that this was a method used in also edited a collection of key articles in the study of
genetic research, published in genetics and biolog- causation in the social sciences, which was highly
ical journals, so the technique was little known to influential in the treatment of causality, its detec-
researchers in the behavioral and social sciences for tion, and in providing research examples. A second
many years. edition [5] also provided newer material. This was
By a different route, the econometricians began also accompanied by a second volume [6] devoted to
implementing regression models and then extended issues of detecting causation with experimental and
these to a method mathematically equivalent to path panel designs. Duncan [8] wrote an influential intro-
analysis known as structural equation modelling. This ductory textbook on structural equation models for
was initially stimulated by such mathematical mod- sociologists. Heise [10] also authored an important
els of the economy as formulated by John Maynard text on how to study causes with flowgraph analysis,
Keynes [17], which used sets of simultaneous linear a variant of path analysis.
equations to specify relations between variables in the A highly important development began in the
economy. The econometricians distinguished exoge- latter half of the 1960s in psychology. Bock and
nous variables (inputs into a system of variables) Bargmann [2] described a new way of testing
from endogenous variables (variables that are depen- hypotheses about linear functional relations known
dent on other variables in the system). Econometri- as analysis of covariance structures. This was
followed up in the work of Karl Joreskog, a
cians also used matrix algebra to express their model
Swedish mathematical statistician, who came to
equations. They sought further to solve several prob-
Educational Testing Service to work on problems of
lems, such as determining the conditions under which
factor analysis. After solving the problem of finding
the free parameters of their models would be identi-
a full information maximum likelihood estimation
fied, that is, determinable uniquely from the observed
method for exploratory common factor analysis [12]
data [18]. They showed how the endogenous vari-
(see Factor Analysis: Exploratory), Joreskog
ables could ultimately be made to be just effects
turned his attention to solving a similar problem
of the exogenous variables, given in the reduced for confirmatory factor analysis [13] (see Factor
equations. They developed several new methods Analysis: Confirmatory), which prior to that time
of parameter estimation such as two-stage [32] and had received little attention among factor analysts.
three-stage least squares [36]. They developed both This was followed by an even more general model
Full Information Maximum Likelihood (FIML) and that he called analysis of covariance structures [14].
Limited Information Maximum likelihood (LIML) Collaboration with Arthur S. Goldberger led Karl
estimates of unspecified parameters (see Maximum Joreskog to produce an algorithm for estimating
Likelihood Estimation). However, generally their parameters and testing the fit of a structural equation
models involved only measured observed variables. model with latent variables [15], which combined
Although in the 1950s logical empiricism reigned concepts from factor analysis with those of structural
still as the dominant philosophy of science and equations modeling. He was also able to provide
continued to issue skeptical critiques of the idea of for a distinction between free, fixed, and constrained
causation as an out-dated remnant of determinism, parameters in his models. But of greatest importance
or to be replaced by a form of logical implication, for the diffusion of his methods was his making
several philosophers sought to restore causality as a available computer programs for implementing the
central idea of science. Bunge [7] issued a significant algorithms described in his papers. By showing that
book on causality. Simon [26] argued that causality confirmatory factor analysis, analysis of covariance
is to be understood as a functional relation between structures, and structural equation modeling could all
variables, not a relation between individual events, be accomplished with a single computer program,
like logical implication. This laid the groundwork for this provided researchers with a general, highly
what followed in sociology and, later, psychology. flexible method for studying a great variety of
History of Path Analysis 5
linear causal structures. He called this program Four philosophers of science [9] put forth a
LISREL for linear structural relations. It has gone description of a method for discovering causal struc-
through numerous revisions. But his program was ture in correlations based on an artificial intelli-
shortly followed by others, which sought to simplify gence algorithm that implemented heuristic searches
the representation of structural equation models, for certain zero partial correlations between vari-
such as COSAN, [19], EQS [1] and several others ables and/or zero tetrad differences among correla-
(see Structural Equation Modeling: Software). tions [29] that implied certain causal path structures.
Numerous texts, too many to mention here, followed, Their approach combined graph theory with arti-
based on Joreskogs breakthrough. A new journal, ficial intelligence search algorithms and statistical
Structural Equation Modeling appeared in 1994. tests of vanishing partial correlations and vanishing
The availability of easy-to-use computer programs tetrad differences. They also produced a computer
for doing structural equation modeling in the 1980s program for accomplishing these searches known
and 1990s produced almost a paradigm shift in cor- as Tetrad. In a brief history of heuristic search in
relational psychological research from descriptive applied statistics, they argued that researchers had
studies to testing causal models and renewed investi- abandoned an optimal approach to testing causal
gations of the concept of causality and the conditions theories and discovering causal structure first sug-
under which it may be inferred. James, Mulaik, and gested by Spearmans (1904) use of tetrad differ-
Brett [11] sought to remind psychological researchers ence tests, by turning to a less optimal approach
that structural equation modeling is not exploratory in factor analysis. The key idea was that instead of
research, and that, in designing their studies, they estimating parameters and then checking the fit of
needed to focus on establishing certain conditions the reproduced covariance matrix to the observed
that facilitated inferences of causation as opposed covariance matrix, and then, if the fit was poor,
to spurious causes. Among these was the need to taking another factor with associated loadings to esti-
make a formal statement of the substantive theory mate, as in factor analysis, Spearman had identified
underlying a model, to provide a theoretical rationale constraints implied by a causal model on the ele-
for causal hypotheses, to specify a causal order of ments of the covariance matrix, and sought to test
variables, to establish self-contained systems of struc- these constraints directly. Generalizing from this,
tural equations representing all relevant causes in the Glymour et al. [9] showed how one could search for
phenomenon, to specify boundaries such as the popu- those causal structures having the greatest number
lations and environments to which the model applies, of constraints implying vanishing partial correlations
to establish that the phenomenon had reached an equi- and vanishing tetrad differences on the population
librium condition when measurements were taken, covariance matrix for the variables that would be
to properly operationalize the variables in terms of most consistent with the sample covariance matrix.
conditions of measurement, to confirm empirically The aim was to find a causal structure that would
support for the functional equations in the model, apply regardless of the values of the model parame-
and to confirm the model empirically in terms of its ters.
overall fit to data. Spirtes, Glymour, and Scheines [31] followed the
Mulaik [21] provided an amplified account first previous work with a book that went into consid-
suggested by Simon in 1953 [28] of how one might erable detail to show how probability could be con-
generalize the concept of causation as a functional nected with causal graphs. To do this, they considered
relation between variables to the probabilistic case. that three conditions were needed for this: the Causal
Simon had written . . .we can replace the causal Markov Condition, the Causal Minimality Condition,
ordering of the variables in the deterministic model and the Faithfulness Condition. Kinship metaphors
by the assumption that the realized values of certain were used to identify certain sets of variables. For
variables at one point or period in time determines example, the parents of a variable V would be all
the probability distribution of certain variables at later those variables that are immediate causes of the vari-
points in time [27, 1977, p. 54]. This allows one to able V represented by directed edges of a graph
join linear structural equation modeling with other, leading from these variables to the variable in ques-
nonlinear forms of probabilistic causation, such as tion. The descendents would be all those variables
item-response theory. that are in directed paths from V . A directed acyclic
6 History of Path Analysis
graph for a set of variables V and a probability distri- Publications and the American Psychological Associa-
bution would be said to satisfy the Markov Condition tion Division 14, Beverly Hills.
if and only if for every variable W in V, W is inde- [12] Joreskog, K.G. (1967). Some contributions to maximum
likelihood factor analysis, Psychometrika 32, 443482.
pendent of all variables in V that are neither parents
[13] Joreskog, K.G. (1969). A general approach to confirma-
nor descendents of W conditional on the parents of tory maximum likelihood factor analysis, Psychometrika
W . Satisfying the Markov Condition allows one to 34, 183202.
specify conditional independence to occur between [14] Joreskog, K.G. (1970). A general method for the analysis
certain sets of variables that could be represented of covariance structures, Biometrika 57, 239251.
by vanishing partial correlations between the vari- [15] Joreskog, K.G. (1973). A general method for estimat-
ables in question, conditional on their parents. This ing a linear structural equation system, in Structural
gives one way to perform tests on the causal struc- Equation Models in the Social Sciences, A.S. Gold-
berger & O.D. Duncan, Eds, Seminar Press, New York,
ture without estimating model parameters. Spirtes, pp. 85112.
Glymour, and Scheines [31] showed how from these [16] Kempthorne, O. (1969). An Introduction to Genetic
assumptions one could develop discovery algorithms Statistics, Iowa State University Press, Ames.
for causally sufficient structures. Their book was full [17] Keynes, J.M. (1936). The General Theory of Employ-
of research examples and advice on how to design ment, Interest and Money, Macmillan, London.
empirical studies. A somewhat similar book by [24], [18] Koopmans, T.C. (1953). Identification problems in
because Spirtes, Glymour, and Scheines [31] drew economic model construction. in Studies in Econo-
metric Method, Cowles Commission Monograph 14,
upon many of Pearls earlier works, attempted to
W.C. Hood & T.C. Koopmans, eds, Wiley, New York.
restore the study of causation to a prominent place [19] McDonald, R.P. (1978). A simple comprehensive model
in scientific thought by laying out the conditions by for the analysis of covariance structures, British Journal
which causal relations could be and not be estab- of Mathematical and Statistical Psychology 31, 5972.
lished between variables. Both of these works dif- [20] Mulaik, S.A. (1985). Exploratory statistics and empiri-
fer in emphasizing tests of conditional independence cism, Philosophy of Science 52, 410430.
implied by a causal structure rather than tests of [21] Mulaik, S.A. (1986). Toward a synthesis of deterministic
fit of an estimated model to the data in evaluating and probabilistic formulations of causal relations by the
functional relation concept, Philosophy of Science 53,
the model.
313332.
[22] Mulaik, S.A. (1987). A brief history of the philosophical
References
foundations of exploratory factor analysis, Multivariate
Behavioral Research 22, 267305.
[1] Bentler, P.M. & Weeks, D.G. (1980). Linear struc-
tural equations with latent variables, Psychometrika 45, [23] Niles, H.E. (1922). Correlation, causation and Wrights
289308. theory of path coefficients, Genetics 7, 258273.
[2] Bock, R.D. & Bargmann, R.E. (1966). Analysis of [24] Pearl, J. (2000). Causality: Models, reasoning and infer-
covariance structures, Psychometrika 31, 507534. ence, Cambridge University Press, Cambridge.
[3] Blalock Jr, H.M. (1961). Causal Inferences in Nonexper- [25] Pearson, K. (1892). The Grammar of Science, Adam &
imental Research, University of North Carolina Press, Charles Black, London.
Chapel Hill. [26] Simon, H.A. (1952). On the definition of the causal
[4] Blalock Jr, H.M. (1971). Causal Models in the Social relation, Journal of Philosophy 49, 517528.
Sciences, Aldine-Atherton, Chicago. [27] Simon, H.A. (1953). Causal ordering and identifiabil-
[5] Blalock Jr, H.M. (1985a). Causal Models in the Social ity, in Studies in Econometric Method, W.C. Hood &
Sciences, 2nd Edition, Aldine-Atherton, Chicago. T.C. Koopmans, eds, Wiley, New York, pp. 4974.
[6] Blalock, H.M. (ed.) (1985b). Causal Models in panel and [28] Simon, H.A. (1977). Models of Discovery, R. Reidel,
experimental designs, Aldine, New York. Dordrecht, Holland.
[7] Bunge, M. (1959). Causality, Harvard University Press, [29] Spearman, C. (1904). General intelligence objectively
Cambridge. determined and measured, British Journal of Psychology
[8] Duncan, O.D. (1975). Introduction to Structural Equa- 15, 201293.
tion Models, Academic Press, New York. [30] Spearman, C. (1927). The Abilities of Man, MacMillan,
[9] Glymour, C., Scheines, R., Spirtes, P. & Kelly, K. New York.
(1987). Discovering Causal Structure, Academic Press, [31] Spirtes, P., Glymour, C. & Scheines, R. (1993). Causa-
Orlando. tion, Prediction and Search, Springer-Verlag, New York.
[10] Heise, D.R. (1975). Causal Analysis, Wiley, New York. [32] Theil, H. (1953). Estimation and Simultaneous Correla-
[11] James, L.R., Mulaik, S.A. & Brett, J.M. (1982). tion in Complete Equation Systems, Central Planbureau
Causal Analysis: Assumptions, Models, and Data, Sage (mimeographed), The Hague.
History of Path Analysis 7
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
to calibrate the probabilistic rather than the deter- attempt to characterize intellectual processes; instead
ministic relationship between two variables. He used he assumed that performance on a uniform set of
scatterplots and noticed how scores on one vari- tasks would constitute a basis for a meaningful
able were useful for predicting the scores on another ranking of school childrens ability. Binet thought it
and developed a measure of the correlation of two necessary to sample complex mental functions, since
sets of scores. His colleague, the biometric statis- these most resembled the tasks faced at school and
tician Karl Pearson formalized and extended this provided for a maximum spread of scores [15].
work. Using the terms normal curve and stan- Binet did not interpret his scale as a measure of
dard deviation from the mean, Pearson developed innate intelligence; he insisted it was only a screen-
what would become the statistical building blocks ing device for children with special needs. However,
for modern psychometrics (e.g., the product-moment Goddard and many other American psychologists
correlation (see Pearson Product Moment Corre- thought Binets test reflected a general factor in intel-
lation), multiple correlation (see Multiple Linear lectual functioning and also assumed this was largely
Regression), biserial correlation (see Point Biserial hereditary. Terman revised the Binet test just prior
Correlation) [8, 13]. to World War II, paying attention to relevant cultural
By the turn of the twentieth century, James Cat- content and documenting the score profiles of vari-
tell and a number of American psychologists had ous American age groups of children. But Termans
developed a more elaborate set of anthropometric revision (called the StanfordBinet) remained an age-
measures, including tests of reaction time and sen- referenced scale, with sets of problems or items
sory acuity. Cattell was reluctant to measure higher grouped according to age appropriate difficulty, yield-
mental processes, arguing these were a result of more ing an intelligence quotient mental age/chronological
basic faculties that could be measured more precisely. age (IQ) score.
However, Cattells tests did not show consistent rela- Widespread use of Binet-style tests in the US army
tionships with outcomes they were expected to, like during World War I helped streamline the testing
school grades and later professional achievements. process and standardize its procedures. It was the
Pearsons colleague and rival Charles Spearman first large-scale deployment of group testing and
argued this may have been due to the inherent unrelia- multiple-choice response formats with standardized
bility of the various measures Cattell and others used. tests [6, 16].
Spearman reasoned that any test would inevitably
contain measurement error, and any correlation with Branching Out
other equally error-prone tests would underestimate
the true correlation. According to Spearman, one way In the 1920s, criticism of interpretation of the Army
of estimating the measurement error of a particular test data that the average mental age of soldiers,
test was to correlate the results of successive admin- a large sample of the US population, was below
istrations. Spearman provided a calculation that cor- average drew attention to the problem of appro-
rected for this attenuation due to accidental error, priate normative samples that gave meaning to test
as did William Brown independently, and both gave scores. The innovations of the subsequent Wechsler
proofs they attributed to Yule. Calibrating measure- intelligence scales with test results compared to
ment error in this way proved foundational. Spear- a representative sample of adult scores could be
mans expression of the correlation of two composite seen as a response to the limitations of younger age-
measures in terms of their variance and covariance referenced Binet tests. The interwar period also saw
later became known as the index of reliability [9]. the gradual emergence of the concept of validity,
that is, whether the test measured what it was sup-
posed to. Proponents of Binet-style tests wriggled out
Practical Measures of the validity question with a tautology: intelligence
was what intelligence tests measured. However, this
The first mental testers lacked effective means for stance was developed more formally as operationism,
assessing the qualities they were interested in. In a stopgap or creative solution (depending on your
France in 1905, Binet introduced a scale that provided point of view) to the problem of quantitative ontol-
a different kind of measurement. Binet did not ogy. In the mid-1930s, S. S. Stevens argued that
History of Psychometrics 3
the theoretical meaning of a psychological concept latent concepts (see Latent Variable) from more
could be defined by the operations used to measure it, directly measured variables [1, 10]. They also played
which usually involved the systematic assignment of a role in guaranteeing both the validity and relia-
number to quality. For many psychologists, the opera- bility of tests, especially in the construction phase.
tions necessary to transform a concept into something Items could be selected that apparently measured
measurable were taken as producing the concept the same underlying variable. Several key person-
itself [11, 14, 18]. ality and attitude scales, such as the R.B. Cattells
The practical success of intelligence scales 16 PF and Eysencks personality questionnaires, were
allowed psychologists to extend operationism to developed primarily using factor analysis. Thurstone
various interest, attitude, and personality measures. used factor analysis to question the unitary concept of
While pencil-and-paper questionnaires dated back intelligence. New forms of item analyses and scaling
to at least Galtons time, the new branch (e.g., indices of item difficulty, discrimination, and
of testing appearing after World War I took consistency) also served to guide the construction of
up the standardization and group comparison reliable and valid tests.
techniques of intelligence scales. Psychologists In the mid-1950s, the American Psychological
took to measuring what were assumed to be Association stepped in to upgrade all aspects of
dispositional properties that differed from individual testing, spelling out the empirical requirements of
to individual not so much in quality but in amount. a good test, as well as extending publishing and
New tests of personal characteristics contained distribution regulations. They also introduced the
short question items sampling seemingly relevant concept of construct validity, the tests conceptual
content. Questions usually had fixed response integrity borne out by its theoretically expected
formats, with response scores combined to form relationships with other measures. Stung by damaging
additive, linear scales. Scale totals were interpreted social critiques of cultural or social bias in the 1960s,
as a quantitative index of the concept being testers further revived the importance of theory to
measured, calibrated through comparisons with the a historically pragmatic field. Representative content
distribution of scores of normative groups. Unlike coverage, relevant, and appropriate predictive criteria,
intelligence scales, responses to interest, attitude, or all became keystones for fair and valid tests [5, 14].
personality inventory items were not thought of as The implications of Spearmans foundational work
unambiguously right or wrong although different were finally formalized by Gulliksen in 1950, who
response options usually reflected an underlying spelt out the assumptions the classical true score
psychosocial ordering. Ambiguous item content and model required. The true score model was given
poor relationships with other measures saw the first a probabilistic interpretation by Lord and Novick
generation of personality and interest tests replaced in 1968 [17]. More recently, psychometricians have
by instruments where definitions of what was to be extended item level analyses to formulate generalized
measured were largely determined by reference to response models. Proponents of item response theory
external criteria. For example, items on the Minnesota claim it enables the estimation of latent aptitudes
Multiphasic Personality Inventory were selected by or attributes free from the constraints imposed by
contrasting the responses of normal and psychiatric particular populations and item sets [2, 7].
subject groups [3, 4].
References
Grafting on Theoretical Respectability
[1] Bartholomew, D.J. (1995). Spearman and the origin
In the post World War II era, psychologists sub- and development of factor analysis, British Journal of
tly modified their operationist approach to mea- Mathematical and Statistical Psychology 48, 211220.
[2] Bock, R.D. (1997). A brief history of item response
surement. Existing approaches were extended and
theory, Educational Measurement: Issues and Practice
given theoretical rationalizations. The factor ana- 16, 2133.
lytic techniques (see Factor Analysis: Exploratory) [3] Buchanan, R.D. (1994). The development of the MMPI,
that Spearman, Thurstone, and others had developed Journal of the History of the Behavioral Sciences 30,
and refined became a mathematical means to derive 148161.
4 History of Psychometrics
[4] Buchanan, R.D. (1997). Ink blots or profile plots: the [13] Porter, T.M. (2004). The Scientific Life in a Statistical
Rorschach versus the MMPI as the right tool for Age, Princeton University Press, Princeton.
a science-based profession, Science, Technology and [14] Rogers, T.B. (1995). The Psychological Testing Enter-
Human Values 21, 168206. prise: An Introduction, Brooks/Cole, Pacific Grove.
[5] Buchanan, R.D. (2002). On not Giving Psychology [15] Rose, N. (1979). The psychological complex: mental
Away: the MMPI and public controversy over testing measurement and social administration, Ideology and
in the 1960s, History of Psychology 5, 284309. Consciousness 5, 568.
[6] Danziger, K. (1990). Constructing the Subject: Histori- [16] Sokal, M.M. (1987). Psychological Testing and Ameri-
cal Origins of Psychological Research, Cambridge Uni- can Society, 18901930, Rutgers University Press, New
versity Press, Cambridge. Brunswick.
[7] Embretson, S.E. (1996). The new rules of measurement, [17] Traub, R.E. (1997). Classical test theory in historical per-
Psychological Assessment 8, 341349. spective, Educational Measurement: Issues and Practice
[8] Gillham, N.W. (2001). A Life of Sir Francis Galton: 16, 814.
From African Exploration to the Birth of Eugenics, [18] Wright, B.D. (1997). A history of social science mea-
Oxford University Press, New York. surement, Educational Measurement: Issues and Prac-
[9] Levy, P. (1995). Charles Spearmans contributions to tice 16, 3352.
test theory, British Journal of Mathematical and Statis-
tical Psychology 48, 221235.
[10] Lovie, A.D. & Lovie, P. (1993). Charles Spearman, Cyril (See also Measurement: Overview)
Burt, and the origins of factor analysis, Journal of the
History of the Behavioral Sciences 29, 308321. RODERICK D. BUCHANAN AND SUSAN
[11] Michell, J. (1999). Measurement in Psychology: A Crit- J. FINCH
ical History of a Methodological Concept, Cambridge
University Press, Cambridge.
[12] Porter, T.M. (1995). Trust in Numbers: The Pursuit
of Objectivity in Science and Public Life, Princeton
University Press, Princeton.
History of Surveys of Sexual Behavior
BRIAN S. EVERITT
Volume 2, pp. 878887
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
techniques disguise the true response yet allow the Clelia Moshers study, whilst not satisfactory from
researcher sufficient data for analysis. The most com- a sampling point-of-view because the results can in
mon of these techniques is the randomized response no way be generalized (the 45 women interviewed
approach but there is little evidence of its use in were, after all, mature, married, experienced, largely
the vast majority of investigations into human sex- college-educated American women) remains a pri-
ual behavior. mary historical document of premodern sex and mar-
riage in America. The reasons are clearly identified
in [9];
Surveys of Sexual Behavior
. . . it contains statements of great rarity directly from
The possibility that women might enjoy sex was not Victorian women, whose lips previously had been
considered by the majority of our Victorian ancestors. sealed on the intimate questions of their private lives
The general Victorian view was that women should and cravings. Although one day it may come to
show no interest in sex and preferably be ignorant of light, we know of no other sex survey of Victorian
women, in fact no earlier American sex survey of
its existence unless married; then they must submit to any kind, and certainly no earlier survey conducted
their husbands without giving any sign of pleasure. by a woman sex researcher.
A lady was not even supposed to be interested in
sex, much less have a sexual response. (A Victorian Two of the most dramatic findings of the Mosher
physician, Dr. Acton, even went as far as to claim survey are
It is a vile aspersion to say that women were ever
capable of sexual feelings.) Women were urged to The Victorian women interviewed by Mosher
be shy, blushing, and genteel. As Mary Shelley wrote appeared to relish sex, and claimed higher rates
in the early 1800s, Coarseness is completely out of of orgasm than those reported in far more
fashion. (Such attitudes might, partially at least, help recent surveys.
explain both the increased interest in pornography They practised effective birth-control techniques
amongst Victorian men and the parallel growth in beyond merely abstinence or withdrawal.
the scale of prostitution.)
But in a remarkable document written in the For these experienced, college-educated women at
1890s by Clelia Mosher, such generalizations about least, the material collected by Mosher produced little
the attitudes of Victorian women to matters sexual evidence of Victorian prudery.
are thrown into some doubt, at least for a minority Nearly 40 years on from Moshers survey,
of women. The document, Study of the Physiology Katharine Davis studied the sex lives of 2200 upper-
and Hygiene of Marriage, opens with the following middle class married and single women. The results
introduction; of Daviss survey are described in her book, Factors
in The Sex Life of Twenty Two Hundred Women,
In 1892, while a student in biology at the University published in 1929 [2]. Her stated aim was to gather
of Wisconsin, I was asked to discuss the marital data as to normal experiences of sex on which to
relation in a Mothers Club composed largely of base educational programs. Davis considered such
college women. The discussion was based on replies
given by members to a questionnaire. normal sexual experiences to be, to a great extent,
scientifically unexplored country. Unfortunately, the
Mosher probed the sex lives of 45 Victorian women manner in which the eponymous women were
by asking them whether they liked intercourse, how selected for her study probably meant that these
often they had intercourse, and how often they wanted experiences were to remain so for some time to come.
to have intercourse. She compiled approximately 650 Initially a letter asking for cooperation was sent
pages of spidery handwritten questionnaires but did to 10000 married women in all parts of the United
not have the courage to publish, instead depositing the States. Half of the addresses were furnished by
material in Stanford University Archives. Publication a large national organization (not identified by
had to await the heroic efforts of James MaHood Davis). Recipients were asked to submit names
and his colleagues who collated and edited the of normal married women that is, women of
questionnaires, leading in 1980 to their book, The good standing in the community, with no known
Mosher Survey [9]. physical, mental, or moral handicap, of sufficient
History of Surveys of Sexual Behavior 3
intelligence and education to understand and answer Table 1 Sources of information about contraceptive mea-
in writing a rather exhaustive set of questions as sures (from [2])
to sex experience. (The questionnaire was eight Physicians 370
pages long.) Married women friends 174
Another 5000 names were selected from published Husband 139
membership lists of clubs belonging to the General Mother 42
Federation of Womens Clubs, or from the alumnae Friend of husband 39
Books 33
registers of womens colleges and coeducational Birth-control circulars 31
universities. Common knowledge 27
In each letter was a return card and envelope. The Nurse 15
women were asked to indicate on the card whether Medical studies 9
they would cooperate by filling out the question- Various 8
naire, which was sent only to women requesting it. Drug-store man 6
The Bible 2
This led to returned questionnaires from 1000 mar-
A servant 1
ried women. A psychoanalyst 1
The unmarried women in the study were those five
years out of a college education; again 10000 such
women were sent a letter asking whether or not they Table 2 Frequency of intercourse of married women
would be willing to fill out, in their case, a 12-page (from [2])
questionnaire. This resulted in the remaining 1200 Answer Number Percent
women in the study.
Every aspect of the selection of the 2200 women More than once a day 19 2.0
in Dr Daviss study is open to statistical criticism. Once a day 71 7.6
Over twice, less than seven times a 305 31.3
The respondents were an unrepresentative sample, week
of volunteers who were educationally far above Once or twice a week 391 40.0
average and only about 10% of those contacted ever One to three times a month 125 12.8
returned a questionnaire. The results are certainly not Often or frequently 22 2.4
generalizable to any recognisable population of more Seldom or infrequently 38 3.9
universal interest. But despite its flaws a number of Total answers to frequency questions 971 100
None in early years 8
the charts and tables in the report retain a degree of
Unanswered (No answer) 21
fascination. Part of the questionnaire, for example, Total group 1000
dealt with the use of methods of contraception. At
the time, contraceptive information was categorized
as obscene literature under federal law. Despite this, From a methodological point-of-view, one of the
730 of the 1000 married women who filled out most interesting aspects of the Davis report is her
questionnaires had used some form of contraceptive attempt to compare the answers of women who
measure. Where did they receive their advice about responded by both interview and questionnaire. Only
these measures? Daviss report gives the sources a relatively small number of women (50) participated
shown in Table 1. in this comparison but in general there was a con-
Davis along with most organizers of sex surveys siderably higher incidence of sex practices reported
also compiled figures on frequency of sex; these are on the questionnaire. Davis makes the following argu-
shown in Table 2. ment as to why she considers the questionnaire results
Daviss rationale for compiling the figures in to be more likely to be closer to the truth;
Table 2 was to investigate the frequency of inter-
course as a possible factor in sterility and for this In the evolutionary process civilization, for its own
purpose she breaks down the results in a number of protection, has had to build up certain restraints on
sexual instincts which, for the most part, have been
ways. She found no evidence to suggest a relation- in sense of shame, especially for sex outside of the
ship between marked frequency of intercourse and legal sanction of marriage. Since sex practices prior
sterility indeed she suggests that her results indicate to marriage have not the general approval of society,
the reverse. and since the desire for social approval is one of the
4 History of Surveys of Sexual Behavior
fundamental motives in human behavior, admitting kind to label as merely monotonous. The report cer-
such a practice constitutes a detrimental confession tainly does not make for lively reading. Nevertheless,
on the part of the individual and is more likely six months after its publication it still held second
to be true than a denial of it. In other words, the place on the list of nonfiction bestsellers in the USA.
group admitting the larger number of sex practices
is assumed to contain the greater number of honest
The first report proved of interest not only to the gen-
replies [2]. eral public, but to psychiatrists, clergymen, lawyers,
anthropologists, and even home economists. Reaction
The argument is not wholly convincing, and would to it ranged all the way from extremely favourable
certainly not be one that could be made about to extremely unfavourable here are some examples
of both:
the respondents in contemporary surveys of sex-
ual behavior. The Kinsey Report has done for sex what Colum-
Perhaps the most famous sex survey ever con- bus did for geography,
ducted was the one by Kinsey and his colleagues in . . . a revolutionary scientific classic, ranking
the 1940s. Alfred Charles Kinsey was undoubtedly with such pioneer books as Darwins Origin
of the Species, Freuds and Copernicus origi-
the most famous American student of human sexual nal works,
behavior in the first half of the twentieth century. He . . . it is an assault on the family as the basic
was born in 1894 and had a strict Methodist upbring- unit of society, a negation of moral law, and a
ing. Originally a biologist who studied Cynipidae celebration of licentiousness,
(gall wasps), Kinsey was a professor of zoology, who there should be a law against doing research
dealing exclusively with sex.
never thought to study human sexuality until 1938,
when he was asked to teach the sexuality section What made the first Kinsey report the talk of
of a course on marriage. In preparing his lectures, every town in the USA lies largely in the following
he discovered that there was almost no informa- summary of its main findings:
tion on the subject. Initially, and without assistance, Of American males,
he gathered sex histories on weekend field trips to
nearby cities. Gradually this work involved a number 86% have premarital intercourse by the age of 30,
of research assistants and was supported by grants 37%, at some time in their lives, engaged in
from Indiana University and the Rockefeller Founda- homosexual activity climaxed by orgasm,
tion. 70% have, at some time, intercourse with prosti-
Until Kinseys work (and despite the earlier inves- tutes,
tigations of people like Mosher and Davis) most of 97% engage in forms of sexual activity, at some
what was known about human sexual behavior was time in their lives, that are punishable as crimes
based on what biologists knew about animal sex, under the law,
what anthropologists knew about sex among natives of American married males, 40% have been
in Non-Western, nonindustrialized societies, or what involved in extramarital relations,
Freud and others learnt about sexuality from emo- of American farm boys, 16% have sexual contacts
tionally disturbed patients. Kinsey and his colleagues with animals.
were the first psychological researchers to interview These figures shocked because they suggested that
volunteers in depth about their sexual behavior. The there was much more sex, and much more variety
research was often hampered by political investiga- of sexual behavior amongst American men than
tions and threats of legal action. But in spite of such was suspected.
harassment, the first Kinsey report, Sexual Behavior But we need to take only a brief look at some of
in the Human Male, appeared in 1948 [7], and the the details of Kinseys study to see that the figures
second, Sexual Behavior in the Human Female, in above and the many others given in the report hardly
1953 [8]. It is no exaggeration to say that both caused stand up to statistical scrutiny.
a sensation and had massive impact. Sexual Behav- Although well aware of the scientific principles
ior in the Human Male, quickly became a bestseller, of sampling, Kinsey based all his tables, charts, and
despite its apparent drawbacks of stacks of tables, so on, on a total of 5300 interviews with volun-
graphs, bibliography, and a scholarly text that it is teers. He knew that the ideal situation would have
History of Surveys of Sexual Behavior 5
been to select people at random, but he did not individuals might be confounded with differences in
think it possible to coax a randomly selected group question wording and order.
of American males to answer truthfully when asked The interview data in the Kinsey survey were
deeply personal questions about their sex lives. Kin- recorded in the respondents presence by a system of
sey sought volunteers from a diversity of sources coding that was consigned to memory by all six inter-
so that all types would be sampled. The work was, viewers during the year-long training that proceeded
for example, carried on in every state of the Union, data collection. Coding in the field has several advan-
and individuals from various educational groups were tages such as speed and the possibility of clarifying
interviewed. But the diversification was rather hap- ambiguous answers; memory was used in preference
hazard and the proportion of respondents in each to a written version of the code to preserve the con-
cell did not reflect the United States population data. fidence of the interviewee. But the usual code ranged
So the study begins with the disadvantage of vol- from six to twenty categories for each of the max-
unteers and without a representative sample in any imum of 521 items that could be covered in the
sense. The potential for introducing bias seems to interview, so prodigious feats of memory were called
loom large since, for example, those who volunteer for. One can only marvel at the feat. Unfortunately,
to take part in a sex survey might very well have although field coding was continually checked, no
different behavior, different experiences, and differ- specific data on the reliability of coding are presented
ent attitudes towards sex than the general population. and there has to be some suspicion that occasionally,
In fact, recent studies show that people who volun- at least, the interviewer made coding mistakes.
teer to take part in surveys about sexual behavior Memory certainly also played a role in the accu-
are likely to be more sexually experienced and also racy of respondents answers to questions about
more interested in sexual variety than those who do events which might have happened long ago. Its
not volunteer. difficult to believe, for example, that many people can
A number of procedures were used by Kinsey to remember details of frequency of orgasm per week,
obtain interviews and to reduce refusals. Contacts per five-year period, but this is how these frequencies
were made through organizations and institutions that are presented. Many of the interviews in the first Kin-
in turn persuaded their members to volunteer. In sey report were obtained through the cooperation of
addition, public appeals were made and often one key individuals in a community who recommended
respondent would recommend another. Occasionally, friends and acquaintances, and through the process
payments were given as incentives. The investigators of developing a real friendship with the prospective
attempted to get an unbiased selection by seeking all respondent before starting the interview as the fol-
kinds of histories and by long attempts to persuade lowing quotation from the report indicates:
those who were initially hostile to come into the sam-
ple. In a two-hour interview, Kinseys investigators We go with them to dinner, to concerts, to nightclubs,
covered from 300 to 500 items about the respon- to the theatre, we become acquainted with them at
community dances and in poolrooms and taverns,
dents sexual history, but no sample questionnaire
and in other places which they frequent. They in
is provided in the published report. The definition turn invite us to meet friends in their homes, at teas,
of each item in the survey was standard, but the at dinners, at other social events [7, p. 40].
wording of the questions and the order in which
they were given were varied for each respondent. In This all sounds very pleasant both for the respondents
many instances leading questions were asked such as, and the interviewers but is it good survey research
When did you last. . . . or When was the first time practice? Probably not, since experience suggests
you. . . ., thereby placing the onus of denial on the that the sociological stranger gets the more accu-
respondent. The use of leading questions is generally rate information in a sensitive survey, because the
thought to lead to the overreporting of an activity. respondent is wary about revealing his most private
Kinseys aim was to provide the ideal setting for each behavior to a friend or acquaintance. And assuming
individual interview whilst retaining an equivalence that all the interviewers were white males the ques-
in the interviews administered to all respondents. So tion arises as to how this affected interviews with
the objective conditions of the interview were not say, African-American respondents (and in the sec-
uniform and variation in sexual behavior between ond report, with women)?
6 History of Surveys of Sexual Behavior
Finally there are some more direct statistical crit- The Kinsey report did have the very positive
icisms that can be levelled at the first Kinsey report. effect of encouraging others to take up the chal-
There is, for example, often a peculiar variation in lenge of investigating human sexual behavior in a
the number of cases in a given cell, from table to scientific and objective manner. In the United King-
table. A particular group will be reported on one type dom, for example, an organization known as Mass-
of sexual behavior, and this same group may be of Observation carried out a sex survey in 1949 that
slightly different size in another table. The most likely was directly inspired by Kinseys first study. In fact
explanation is that the differences are due to loss of it became generally known as Little Kinsey [3].
information through Dont know responses or omis- Composed of three related surveys, Little Kinsey
sions of various items, but the discrepancies are left was actually very different methodologically from its
unexplained in the report. And Kinsey seems shaky American predecessor. The three components of the
on the definition of terms such as median although study were as follows:
this statistic is often used to summarize findings.
Likewise he uses the sample range as a measure of 1. A street sample survey of over 2000 people
selected by random sampling methods carried
how much particular measurements varied amongst
out in a wide cross section of cities, towns and
his respondents rather than the preferable standard
villages in Britain.
deviation statistic.
2. A postal survey of about 1000 each of three
Kinsey addressed the possibility of bias in his
groups of opinion leaders: clergymen, teachers,
study of male sexual behavior and somewhat surpris-
and doctors.
ingly suggested that any lack of validity in the reports
3. A set of interrelated questions sent to members
he obtained would be in the direction of concealment
of Mass-Observations National Panel, which
or understatement. Kinsey gives little credence to the
produced responses from around 450 members.
possibility of overstatement:
Cover-up is more easily accomplished than exag- The reports author, Tom Harrison, was eager to
geration in giving a history [7, p. 54]. get to the human content lying behind the line-up of
Kinsey thought that the interview approach pro- percentages and numbers central to the Kinsey report
vided considerable protection against exaggeration proper, and he suggested that the Mass-Observation
but not so much against understatement. But given study was both something less and something more
all the points made earlier this claim is not con- than Kinsey. It tapped into more of the actuality, the
vincing, and it is not borne out by later, better- real life, the personal stuff of the problem. He tried
designed studies, which generally report lower lev- to achieve these aims by including in each chapter
els of sexual activity than Kinsey. For example, some very basic tables of responses, along with large
the Sex in America survey [10] was based on a numbers of comments from respondents to particular
representative sample of Americans and it showed questions. Unfortunately this idiosyncratic approach
that individuals were more monogamous and more meant that the study largely failed to have any
sexually conservative than had been reported previ- lasting impact, although later authors, for example,
ously. Liz Stanley in Sex Surveyed 1949-1994 [11], claim
Kinsey concludes his first report with the fol- it was of pioneering importance and was remarkable
lowing. for pinpointing areas of behavioral and attitudinal
We have performed our function when we have change. It does appear to be one of the earliest
published the record of what we have found the surveys of sex that used random sampling. Here are
human male doing sexually, as far as we have been some of the figures and comments from Chapter 7 of
able to ascertain the facts. the report, Sex Outside Marriage.
Unfortunately, the facts arrived at by Kinsey The percentages who disapproved of extramarital
and his colleagues may have been distorted in a relations were
variety of ways because of the many flaws in the
24% on the National Panel,
study. But despite the many methodological errors, 63% of the street sample,
Kinseys studies remain gallant attempts to survey 65% of doctors,
the approximate range and norms of sexual behavior. 75% of teachers,
History of Surveys of Sexual Behavior 7
3. Manual clitoral stimulation by partner, control its spread. The emergence in the 1980s of
4. Oral stimulation by a partner, a lethal epidemic of sexually transmitted infection
5. Intercourse plus manual clitoral stimulation, focused attention on the profound ignorance that still
6. Never have orgasms.
remained about many aspects of sexual behavior,
Also indicate above how many orgasms you despite Kinsey and others. The collaboration of
usually have during each activity, and how long epidemiologists, statisticians, and survey researchers
you usually take. produced a plan and a survey about sex in which
Please give a graphic description of how your all the many problems with such surveys identified
body could best be stimulated to orgasm. earlier were largely overcome.
A feasibility study assessed the acceptability of the
Hites questionnaire began with items about
survey, the extent to which it would produce valid and
orgasm and much of her book dwells on her inter-
reliable results, and the sample size needed to pro-
pretation of the results from these items. She con-
duce statistically acceptable accuracy in estimates of
cludes that women can reach orgasm easily through
masturbation but far less easily, if at all, through minority behavior. The results of the feasibility study
intercourse with their male partners. Indeed, one of guided the design of the final questionnaire that was
her main messages is that intercourse is less satis- used in obtaining results from a carefully selected
fying to women than masturbation. She goes on to random sample of individuals representative of the
blame what she sees as the sorry state of female general population. Of the 20 000 planned interviews
sexual pleasure in patriarchal societies, such as the 18876 were completed. Nonresponse rates were gen-
United States, that glorify intercourse. Critics pointed erally low. The results provided by the survey give
out that there may be something in all of this, but a convincing account of sexual lifestyle in Britain at
that Hite was being less than honest to suppose that the end of the twentieth century. For interest one of
her views were an inescapable conclusion from the the tables from the survey is reproduced in Table 3.
results of her survey. As the historian Linda Gor- The impact of AIDS has also been responsible for
don pointed out [5], the Hite report was orientated an increasing number of surveys about sexual behav-
towards young, attractive, autonomous career women, ior in the developing world, particularly in parts of
who were focused on pleasure and unencumbered by Africa. A comprehensive account of such surveys is
children. These women could purchase vibrators, read given in [1].
the text, and undergo the self-improvement necessary
for one-person sexual bliss.
The Hite report has severe methodological flaws Summary
and these are compounded by the suspicion that
its writer is hardly objective about the issues under Since 1892 when a biology student, Clelia Mosher,
investigation. The numbers are neither likely to have questioned 45 upper middle-class married Victorian
accurately reflected the facts, nor to have been value- women about their sex lives, survey researchers have
free. asked thousands of people about their sexual behav-
(It is not, of course, feminist theory that is at ior. According to Julia Ericksen [4] in Kiss and Tell,
fault in the Hite report, as the comprehensive study Sexual behavior is a volatile and sensitive topic, and
of sex survey research given in [4], demonstrates; surveys designed to reveal it have great power and
these two authors combine feminist theory with a great limits. Their power has been to help change,
critical analysis of survey research to produce a well- radically change in particular aspects, attitudes about
balanced and informative account.) sex compared to 50 years ago. Their limits have often
If the Hite Report was largely a flash in the media been their methodological flaws. And, of course,
pan (Sheer Hype perhaps?), the survey on sexual when it comes to finding out about their sexual behav-
attitudes and lifestyles undertaken in the UK in the ior, people may not want to tell, and even if they
late 1980s and early 1990s by Kaye Wellings and agree to be interviewed they may not be entirely
her coworkers [12] acts as a model of excellence truthful. But despite these caveats the information
for survey research in such a sensitive area. The from many of the surveys of human sexual behavior
impetus for the survey was the emergence of the has probably helped remove the conspiracy of silence
HIV pandemic, and the attendant effort to assess and about sex that existed in society, which condemned
History of Surveys of Sexual Behavior 9
Table 3 Number and percent of respondents taking part in different sexual practices in the last year and ever, by social
class (from [12])
Vaginal intercourse Oral sex
many men and women to a miserable and unful- [4] Ericksen, J.A. & Steffen, S.A. (1999). Kiss and Tell: Sur-
filling sex life. The results have challenged views veying Sex in the Twentieth Century, Harvard University
of the past 100 years that sex was not central to a Press, Cambridge.
[5] Gordon, L. (2002). The Moral Property of Women: A
happy marriage and that sex, as a pleasure for its History of Birth Control Politics in America, 3rd Edition,
own sake, debased the marital relationship. Sex as University of Illinois Press, Champaign.
pleasure is no longer regarded by most people as [6] Hite, S. (1976). The Hite Report: A Nationwide Study on
a danger likely to overwhelm the supposedly more Female Sexuality, Macmillan, New York.
spiritual bond between a man and a woman thought [7] Kinsey, A.C., Wardell, B.P. & Martin, C.E. (1948).
by some to be achieved when sex occurs solely for Sexual Behavior in the Human Male, Saunders, Philadel-
the purposes of reproduction. Overall the informa- phia.
[8] Kinsey, A.C., Wardell, B.P., Martin, C.E. & Geb-
tion about human sexual behavior gathered from sex hard, P.H. (1953). Sexual Behavior in the Human
surveys has helped to promote, all be it in a mod- Female, Saunders, Philadelphia.
est way, a healthier attitude toward sexual matters [9] MaHood, J. & Wenburg, K. (1980). in The Mosher
and perhaps a more enjoyable sex life for many Survey, C.D. Mosher, ed., Arno Press, New York.
people. [10] Michael, R.T., Gagnon, J., Lauman, E.O. & Kolata, G.
(1994). Sex in America: A definitive Survey, Little,
Brown, Boston.
References [11] Stanley, L. (1995). Sex Surveyed 1949-1994, Taylor &
Francis, London.
[1] Cleland, J. & Ferry, B. (1995). Sexual Behavior in the [12] Wellings, K., Field, J., Johnson, A. & Wadsworth, J.
Developing World, Taylor & Francis, Bristol. (1994). Sexual Behavior in Britain, Penguin Books,
[2] Davis, K. (1929). Factors in the Sex Life of Twenty-Two London.
Hundred Women, Harper and Brothers, New York.
[3] England, L.R. (1949). Little Kinsey: an outline of sex BRIAN S. EVERITT
attitudes in Britain, Public Opinion 13, 587600.
HodgesLehman Estimator
CLIFFORD E. LUNNEBORG
Volume 2, pp. 887888
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Dimension 2
42 11
proximity data (see Proximity Measures). A classic 41 12
40 13
example is the recreation of a map from a matrix of 39 14
say intercity road distances in a country. Often, such a 38 15
37 16
map can be successfully recreated if only the ranking 36
35 17
of the distances is given (see [2]). With such data, the 34 18
33 19
underlying structure is essentially two-dimensional, 32 20
31 21
and so can be represented with little distortion in 30
29 28 23
22
27 26 25 24
a two-dimensional scaling solution. But when the
observed data have a one-dimensional structure, for
example, in a chronological study, representing the
observed proximities in a two-dimensional scaling Dimension 1
solution often gives rise to what is commonly referred
to as the horseshoe effect. This effect appears to have Figure 1 An example of the horseshoe effect
been first identified in [2] and can be illustrated by
the following example: which is a consequence of the blurring of the large
Consider 51 objects, O1 , O2 , . . . , O51 assumed to distances and is characteristic of such situations.
be arranged along a straight line with the j th object Further discussion of the horseshoe effect is given
being located at the point with coordinate j . Define in [3] and some examples of its appearance in prac-
the similarity, sij , between object i and object j , as, tice are described in [1].
follows:
9 if i = j, References
8 if 1 |i j | 3,
7 if 4 |i j | 6,
sij = .. . (1) [1] Everitt, B.S. & Rabe-Hesketh, S. (1997). The Analysis of
. Proximity Data, Arnold, London.
1 if 22 |i j | 24
[2] Kendall, D.G. (1971). A mathematical approach to seri-
0 if |i j | 25 ation, Philosophical Transactions of the Royal Society of
London, A269, 125135.
Next, convert these similarities into dissimilarities, [3] Podani, J. & Miklos, I. (2002). Resemblance coefficients
ij , using ij = (sii + sjj 2sij )1/2 and then apply and the horseshoe effect in principal coordinates analysis,
classical multidimensional scaling (see Multidimen- Ecology 83, 33313343.
sional Scaling) to the resulting dissimilarity matrix.
BRIAN S. EVERITT
The two-dimensional solution is shown in Figure 1.
The original order has been reconstructed very well,
but the plot shows the characteristic horseshoe shape,
Hotelling, Howard
SCOTT L. HERSHBERGER
Volume 2, pp. 889891
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
The structure of the models was also essentially deter- respectability through the application of mathematics
ministic in that the only random variable was error, to data.
that is, behavioral oscillation, which was added to
the fixed effects of reinforcement, stimulus general- References
ization, and so on, in order to account for the results.
The last years of Hulls life were dogged by
[1] Hull, C.L. (1943). Principles of Behavior, Appleton-
increasing ill health, and he was only able to publish Century, New York.
the updated and expanded version of the Princi- [2] Hull, C.L. (1962). A Behavior System, Yale University
ples, A Behavior System [2], in 1962, the year of his Press, New Haven.
death. Hull fought against considerable personal and [3] Miller, G.A. (1964). Mathematics and Psychology, Wiley,
bureaucratic odds to demonstrate that learning, the New York.
most self-consciously scientific area of all experimen- SANDY LOVIE
tal psychology, could be moved that much closer to
Identification
DAVID KAPLAN
Volume 2, pp. 893896
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
y = x + . (2)
Definition of Identification
When the vector of endogenous variables contains
We begin with a definition of identification from only one column (i.e., only one explanatory variable
the perspective of covariance structure modeling. is considered), then we have the case of simple
The advantage of this perspective is that covari- linear regression.
ance structure modeling includes, as a special case, To begin, we note that there exists an initial set
the simple linear regression model, and, therefore, of restrictions that must be imposed even for simple
we can understand the role of identification even in regression models. The first restriction, referred to
this simple case. First, arrange the unknown param- as normalization, requires that we set the diagonal
eters of the model in the vector . Consider next a elements of B to zero, such that an endogenous
population covariance matrix whose elements are variable cannot have a direct effect on itself.
population variances and covariances (see Correla- The second requirement concerns the vector of
tion and Covariance Matrices). It is assumed that disturbance terms . Note that the disturbances for
a substantive model can be specified to explain the each equation are unobserved, and, hence, have no
variances and covariances contained in . Such a inherent metric. The most common way to set the
substantive model can be as simple as a two-variable metric of , and the one used in simple regression
linear regression model or as complicated as a simul- modeling, is to fix the coefficient relating the endoge-
taneous equation model. We know that the variances nous variables to the disturbance terms to 1.0. An
and covariances contained in can be estimated by inspection of (2) reveals that is actually being mul-
their sample counterparts in the sample covariance tiplied by the scaling factor 1.0. Thus, the disturbance
matrix S using straightforward formulae for the cal- terms are in the same scale as their relevant endoge-
culation of sample variances and covariances. Thus, nous variables.
the parameters in are identified. With the normalization rule in place and the
Having established that the elements in are metric of fixed, we can now discuss some common
identified from their sample counterparts, what we rules for the identification of simultaneous equation
need to establish in order to permit estimation of the model parameters. Recall again that we wish to
model parameters is whether the model parameters know whether the variances and covariances of the
2 Identification
exogenous variables (contained in ), the variances counting rule to show that regression models are also
and covariances of the disturbance terms (contained just identified.
in ), and the regression coefficients (contained in B Note that recursive models place restrictions on
and ) can be solved in terms of the variances and the form of B and and that the identification
covariances contained in . conditions stated above are directly related to these
Two classical approaches to identification can be types of restrictions. Nonrecursive models, however,
distinguished in terms of whether identification is do not restrict B and in the same way. Thus, we
evaluated on the model as a whole, or whether need to consider identification rules that are relevant
identification is evaluated on each equation com- to nonrecursive models.
prising the system of simultaneous equations. The As noted above, the approach to identification
former approach is generally associated with social arising out of econometrics (see [2]), considers one
science applications of simultaneous equation mod- equation at a time. The concern is whether a true
eling, while the latter approach appears to be favored simultaneous equation can be distinguished from a
in the econometrics field applied mainly to simulta- false one formed by a linear combination of the other
neous (i.e., nonrecursive) models. Nevertheless, they equations in the model (see [3]). In complex systems
both provide a consistent picture of identification in of equations, trying to determine linear combinations
that if any equation is not identified, the model as a of equations is a tedious process. One approach would
whole is not identified. be to evaluate the rank of a given matrix, because
The first, and perhaps simplest, method for ascer- if a given matrix is not of full rank, then it means
taining the identification of the model parameters is that there exist columns (or rows) that are linear
combinations of each other. This leads to developing
referred to as the counting rule. Let s = p + q be the
a rank condition for identification.
total number of endogenous and exogenous variables,
To motivate the rank and order conditions, con-
respectively. Then the number of nonredundant ele-
sider the simultaneous equation model represented
ments in is equal to 1/2 s(s + 1). Let t be the total
in path analytic form shown in Figure 1. Let p be
number of parameters in the model that are to be esti-
the number of endogenous variables and let q be the
mated (i.e., the free parameters). Then, the counting
number of exogenous variables. We can write this
rule states that a necessary condition for identifica-
model as
tion is that t 1/2 s(s + 1). If the equality holds,
then we say that the model may be just identified. If y1 0 12 y1
=
t is strictly less than 1/2 s(s + 1), then we say that y2 21 0 y2
the model may be overidentified. If t is greater than x1
1/2 s(s + 1), then the model may be not identified. +
11 12 0
x2 + 1 . (3)
Clearly, the advantage to the counting rule is its 0 0 23 2
x3
simplicity. However, the counting rule is a necessary
but not sufficient rule. We can, however, provide In this example, p = 2 and q = 3. As a useful device
rules for identification that are sufficient, but that for assessing the rank and order condition, we can
pertain only to recursive models, or special cases of
recursive models. Specifically, a sufficient condition
for identification is that B is triangular and that is a x1
diagonal matrix. However, this is the same as saying
that recursive models are identified. Indeed, this is the 1 e1
case, and [1] refers to this rule as the recursive rule of y1
x2
identification. In combination with the counting rule
above, recursive models can be either just identified
or overidentified.
A special case of the recursive rule concerns x3 y2
1
the situation where B = 0 and again a diagonal e2
matrix. Under this condition, the model in (1) reduces
to the model in (2). Here too, we can utilize the Figure 1 Prototype nonrecursive path model
Identification 3
arrange the structural coefficients in a partitioned A corollary of the rank condition is referred to as
matrix A of dimension p s as the order condition. The order condition states that
the number of variables (exogenous and endogenous)
A = [(I B)| ], excluded (restricted) from any of the equations in
the model must be at least p 1 [2]. Despite the
1 12 11 12 0
= , (4) simplicity of the order condition, it is only a necessary
21 1 0 0 23
condition for the identification of an equation of the
where s = p + q. Note that the zeros placed in (4) model. Thus, the order condition guarantees that there
represent paths that have been excluded (restricted) is a solution to the equation, but it does not guarantee
from the model based on a priori model specification. that the solution is unique. A unique solution is
We can represent the restrictions in the first equation guaranteed by the rank condition.
of A, say A1 , as A1 1 = 0, where 1 is a column As an example of the order condition, we observe
vector whose hth element (h = 1, . . . , s) is unity and that the first equation has one restriction and the
the remaining elements are zero. Thus, 1 selects the second equation has two restrictions as required
particular element of A1 for restriction. A similar by the condition that the number of restrictions
equality can be formed for A2 , the second equation in must be as least p 1 (here, equal to one). It
the system. The rank condition states that a necessary may be of interest to modify the model slightly
and sufficient condition for the identifiability of the to demonstrate how the first equation of the model
first equation is that the rank of A1 must be at least would not be identified according to the order con-
equal to p 1. Similarly for the second equation. dition. Referring to Figure 1, imagine a path from
x3 to y1 . Then the zero in the first row of A would
The proof of the rank condition is given in [2]. If
be replaced by 13 . Using the simple approach
the rank is less than p 1, then the parameters of
for determining the order condition, we find that
the equation are not identified. If the rank is exactly
there are no restrictions in the first equation and,
equal to p 1, then the parameters of the equation
therefore, the first equation is not identified. Simi-
in question are just identified. If the rank is greater
larly, the first equation fails the rank condition of
than p 1, then the parameters of the equation are
identification.
overidentified.
This chapter considered identification for recursive
The rank condition can be easily implemented
and nonrecursive simultaneous equation models. A
as follows. Delete the columns containing nonzero
much more detailed exposition of identification can
elements in the row corresponding to the equation
be found in [2]. It should be pointed out that the
of interest. Next, check the rank of the resulting
above discussion of identification is model-specific
submatrix. If the rank is p 1, then the equa-
and the data play no role. Problems of identification
tion is identified. To take the above example, con-
can arise from specific aspects of the data. This
sider the identification status of the first equation.
is referred to as empirical identification and the
Recall that for this example, p 1 = 1. According
problem is most closely associated with issues of
to the procedure just described, the resulting subma-
colinearity.
trix is Briefly, consider a simple linear regression model
0
.
23 y = 1 x1 + 2 x2 + . (5)
With the first row zero, the rank of this matrix is one, If x1 and x2 were perfectly collinear, then x1 = x2 =
and, hence, the first equation is identified. Consider- x, and equation (5) can be rewritten as
ing the second equation, the resulting submatrix is
y = 1 x + 2 x + ,
11 12 = (1 + 2 )x + . (6)
.
0 0
It can be seen from application of the counting rule
Again, because of the zeros in the second row, the that (5) is identified, whereas (6) is not. Therefore,
rank of this submatrix is 1 and we conclude that the the problem of colinearity can induce empirical
second equation is identified. nonidentification.
4 Identification
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
to the Elevated Plus Maze). Omega-squared is an inbred strains were able to be successfully replicated
estimate of the dependent variance accounted for by across the labs in the study, though strain differences
the independent variable in the population for a fixed of moderate effects size were less likely to be
effects model, and so is a measure of the importance resolved.
of that treatment effect in relation to all effects in the
model. References
(SSeffect (DFeffect )(MSerror ))
2 = (1)
MSerror + SStotal [1] Festing, F.W. (1979). Inbred Strains in Biomedical
Research, Oxford University Press, New York.
where SS is the Sum of Squares, DF is the degrees [2] Wade, C.M., Kulbokas, E.J., Kirby, A.W., Zody, M.C.,
of freedom and MS is the measured Mean Square Mullikin, J.C., Lander, E.S., Lindblad-Toh, K., Daly, M.J.
(SS/DF). (2002). The mosaic structure of variation in the laboratory
mouse genome, Nature 420(6915), 5748.
The multiple R-squared (R2 ) (see Multiple Linear
[3] Wahlsten, D., Metten, P., Phillips, T.J., Boehm, S.L.
Regression) describes the proportion of all variance II, Burkhart-Kasch, S., Dorow, J., Doerksen, S., Down-
accounted for by the corrected model. It is calculated ing, C., Fogarty, J., Rodd-Henricks, K., Hen, R., McK-
as the sum of squares for the fitted model divided by innon, C.S., Merrill, C.M., Nolte, C., Schalomon, M.,
the total sum of squares. Schlumbohm, J.P., Sibert, J.R., Wenger, C.D., Dudek,
The researchers concluded that, while there B.C. & Crabbe, J.C. (2003). Different data from different
were significant interactions between laboratories labs: lessons from studies of gene-environment interac-
tion, Journal of Neurobiology 54(1), 283311.
and genotypes for the observed effects, the
magnitude of the interactions depended upon the
measurements in question. The results suggested that (See also Selection Study (Mouse Genetics))
test standardization alone is unlikely to completely
overcome the influences of different laboratory CLAIRE WADE
environments. Most of the larger differences between
Incidence
HANS GRIETENS
Volume 2, pp. 898899
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
One strategy that is often used inadvisably is to [5] Koehler, K. (1986). Goodness-of-fit tests for log-linear
add a small constant, such as 1/2, to cells counts models in sparse contingency tables, Journal of the
previously empty cells are no longer empty. The American Statistical Association 81, 483493.
[6] Koehler, K. & Larntz, K. (1980). An empirical investi-
problem with this strategy is that it tends to increase gation of goodness-of-fit statistics for sparse multinomi-
the apparent equality of the cells frequencies, result- nals, Journal of the American Statistical Association 75,
ing in a loss of power for finding significant effects. 336344.
If the strategy of adding a constant is adopted, [7] Mielke, P.W. & Berry, K.J. (1988). Cumulant meth-
an extremely small constant should be used, much ods for analyzing independence of r-way contingency
smaller than 1/2. Agresti [1] recommends a constant tables and goodness-of-fit frequency data, Biometrika 75,
790793.
on the order of 108 . He also recommends conduct-
[8] Mielke, P.W., Berry, K.J. & Johnston, J.E. (2004).
ing a sensitivity analysis in which the analysis is Asymptotic log-linear analysis: some cautions concern-
repeated using different constants, in order to evalu- ing sparse frequency tables, Psychological Reports 94,
ate the relative effects of the constants on parameter 1932.
estimation and model testing. [9] Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-
fit Statistics for Discrete Multivariate Data, Springer-
Verlag, New York.
References [10] von Davier, M. (1997). Bootstrapping goodness-of-fit
statistics for sparse categorical data: results of a Monte
[1] Agresti, A. (2002). Categorical Data Analysis, 2 nd Carlo study, Methods of Psychological Research 2,
Edition, John Wiley & Sons, New York. 2948.
[2] Berry, K.J. & Mielke, P.W. (1986). R by C chi-square [11] Whittaker, J. (1990). Graphical Methods in Applied
analyses with small expected cell frequencies, Educa- Mathematical Multivariate Statistics, John Wiley &
tional & Psychological Measurement 46, 169173. Sons, New York.
[3] Berry, K.J. & Mielke, P.W. (1988). Monte Carlo [12] Wickens, T.D. (1989). Multiway Contingency Tables
comparisons of the asymptotic chi-square and like- Analysis for the Social Sciences, Lawrence Erlbaum
lihood ratio tests with the nonasymptotic chi-square Associates, Hillsdale.
tests for sparse r c tables, Psychological Bulletin 103,
256264. SCOTT L. HERSHBERGER
[4] Greenwood, P.E. & Nikulin, M.S. (1996). A Guide to
Chi-squared Testing, John Wiley & Sons, New York.
Incompleteness of Probability Models
RANALD R. MACDONALD
Volume 2, pp. 900902
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
she said that events have to be cut and dried in the probability models should be modified to take
order to be assigned probabilities. Because of this, a account of possibilities their inventors had not fore-
probability may always be improved by making more seen. Unforeseen occurrences can also pose problems
distinctions between events regardless of however for theories that suppose that probabilities are the
well an associated probability model may appear to limits of relative frequencies because the relative fre-
be consistent with the available evidence. quency of events that have never occurred is zero, but
Fisher [4] considered what an ideal probability clearly novel events occur [6]. Probabilities and their
model would be like starting with a cut and dried interpretation should be seen as matters for debate
outcome. Such a model would classify events into rather than as the necessary consequences of applying
sets containing no identifiable subsets that is to say, the correct analysis to a particular problem [1].
sets that could not be further broken down into sub-
sets where the outcome in question was associated References
with different probabilities. This involves postulat-
ing sets of equivalent or exchangeable events, which
[1] Abelson, R.P. (1995). Statistics as Principled Argument,
is how de Finetti conceived of ideal probability Lawrence Erlbaum Associates, Hillsdale.
models [3]. Probabilities assigned to events in such [2] Anscombe, G.E.M. (1979). Under a description, Nous
sets could not be improved on by taking additional 13, 219233.
variables into account. Such probabilities are impor- [3] de Finetti, B. (1972). Probability, Induction & Statistics:
tant as the laws of large numbers ensure that, in this The Art of Guessing, Wiley, London.
case, each probability has a correct value the limit [4] Fisher, R.A. (1957/1973). Statistical Methods and Scien-
tific Inference, 3rd Edition, Hafner, New York.
of the relative frequency of equivalent events as n [5] Fisher, R.A. (1957). The underworld of probability,
increases. Betting that one of the equivalent events Sankhya 18, 201210.
will occur taking the odds to be p:(1 p), where p [6] Franklin, J. (2001). The Science of Conjecture: Evidence
is this correct probability, will do better than any and Probability before Pascal, John Hopkins University
other value over the long term (Dutch book theo- Press, Baltimore.
rem: [7]). Unfortunately, one can never know that [7] Howson, C. & Urbach, P. (1993). Scientific Reasoning:
the Bayesian Approach, 2nd Edition, Open Court, Peru,
any two events are equivalent let alone that any set
Illinois.
of events has no identifiable subsets. [8] Luchins, A.S. & Luchins, E.H. (1965). Logical Foun-
If all the uncertainty concerning happenings in dations of Mathematics for Behavioral Scientists, Holt
the world could be captured by a probability, then it Rinehart & Winston, New York.
would have a correct value, and optimal solutions to [9] Macdonald, R.R. (2000). The limits of probability mod-
probability problems regarding real-life events would elling: A serendipitous tale of goldfish, transfinite num-
exist. As things stand, however, probabilities regard- bers and pieces of string, Mind and Society 1(part 2),
1738.
ing real-life events are based on analogies between
[10] Mandelbrot, B. (1982). The Fractal Geometry of Nature,
the uncertainties in the world and models origi- Freeman, New York.
nating in peoples heads. Furthermore, probability [11] Reichenbach, H. (1970). The Theory of Probability, 2nd
models do not explain how these analogies come to Edition, University of California Press, Berkelery.
be formed [13]. As Aristotle foresaw around 2000 [12] Ross, W.D. (1966). The Works of Aristotle Translated
years before probability was invented, what probabil- into English, Clarendon, Oxford.
ity should be assigned to an event depends on how [13] Searle, J.R. (2001). Rationality in Action, MIT Press,
Massachusetts.
it is characterized, and that is a matter for reasoned [14] Thompson, J.J. (1971). The time of a killing, Journal of
argument. Inferences from probabilities, be they rel- Philosophy 68, 115132.
ative frequencies, estimates of peoples beliefs, the
P values in statistical tests, or posterior probabili- RANALD R. MACDONALD
ties in Bayesian statistics are necessarily subject to
revision when it is argued that the events being mod-
eled should be characterized in more detail or that
Independence: Chi-square and Likelihood Ratio Tests
BRUCE L. BROWN AND KENT A. HENDRIX
Volume 2, pp. 902907
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
divided by the expected value (E) for each cell, and The null hypothesis in this case is that the three
then to sum all of these: column categories have equal frequencies in the
population, so the expected (E) matrix consists of
(O E)2
2 = the total 1292 divided by 3 (which gives 430.67),
E with this entry in each of the three positions. The
89.232 (56.94)2 (32.29)2 O E matrix is therefore
= + +
140.77 302.94 41.29 O E = [ 375 807 110 ] [ 430.67 430.67 430.67 ]
(60.11)2 68.882 (8.77)2
+ + + = [ 55.67 376.33 320.67 ] (6)
190.11 409.12 55.77
(29.12)2 (11.94)2 41.062 The obtained chi-square statistic is
+ + +
44.12 94.94 12.94 (O E)2
= 275.48 (4)
2
columns = = 574.81 (7)
E
The degrees of freedom value for this chi-square This chi-square test has 2 degrees of freedom,
test is the product (R 1) (C 1), where R is the three columns minus one (C 1), for which the
number of rows in the contingency table and C is the critical ratio at the 0.001 level is 13.816. The null
number of columns. For this 3 3 contingency table hypothesis of equal preferences for the three groups
(R 1) (C 1) = (3 1) (3 1) = 4 df. In a can therefore be rejected at the 0.001 level.
table of critical values for the chi-square distribution, So far, two chi-square statistics have been cal-
the value needed to reject the null hypothesis at the culated on this set of data, a test of independence
0.001 level for 4 degrees of freedom is found to be and a test of equality of proportions across columns
18.467. The obtained value of 275.48 exceeds this, so (musical preference). The third chi-square test to be
the null hypothesis of independence can be rejected computed is the test for row effects (occupation). The
at the 0.001 level. That is, these data give substantial observed (O) matrix is the row marginal totals:
evidence that occupation and musical preference are
485
systematically related, and therefore not independent.
O = 655 (8)
Although this example is for a 3 3 contingency
152
table, the chi-square test of independence may be
used for two-way tables with any number of rows The null hypothesis in this case is that the three
and any number of columns. row categories (occupations) have equal frequencies
within the population. Therefore, the expected (E)
matrix consists of the total 1292 divided by 3 (which
Expanding Chi-square to Other Kinds gives 430.67), with this same entry in each of the
of Hypotheses three positions. The obtained chi-square statistic is
(O E)2
Other chi-square tests are possible besides the test 2
rows = = 304.01 (9)
of independence. Suppose in the example just given E
that the 1292 respondents were a random sample of With 2 degrees of freedom, the critical ratio for
workers in a particular Alaskan city, and that you significance at the 0.001 level is 13.816, so the null
wish to test workers relative preferences for these hypothesis of equal occupational distribution can be
three musical groups. In other words, you wish to rejected at the 0.001 level.
test the null hypothesis that workers in the population The fourth and final chi-square statistic to be
from which this random sample is obtained are computed is that for the total matrix. The observed
equally distributed in their preferences for the three (O) matrix is again the 3 3 matrix of observed
musical groups. The observed (O) matrix is the frequencies, just as it was for the R C test of
column marginal totals: independence given above. However, even though
the observed matrix is the same, the null hypothesis
O = [ 375 807 110 ] (5) (and therefore the expected matrix) is very different.
Independence: Chi-square and Likelihood Ratio Tests 3
This test is an omnibus test of whether there is any Multiplicative Models, Additive Models,
significance in the matrix considered as a whole and the Rationale for Log-linear
(row, column, or R C interaction), and the null
hypothesis is therefore the hypothesis that all of the The procedure given above for obtaining the matrix
cells are equal. All nine entries in the expected (E) of expected frequencies is a direct application of the
matrix are 143.56, which is one-ninth of 1292. The multiplication rule of probabilities for independent
O E matrix is therefore joint events:
230 246 9 P (A and B) = P (A) P (B) (13)
O E = 130 478 47
15 83 54 For example, for the Alaskan sample described
above, the probability of a respondent being a lum-
143.56 143.56 143.56
berjack is
143.56 143.56 143.56
143.56 143.56 143.56 frequency of lumberjacks 485
P (A) = = = 0.375
86.44 102.44 134.56 total frequency 1292
= 13.56 334.44 96.56 (10) (14)
128.56 60.56 89.56
Similarly, the probability of a respondent prefer-
The obtained chi-square statistic is ring Jethro Tull music is
frequency of Jethro Tull preference
(O E)2 P (B) =
2
total = = 1293.20 (11) total frequency
E
110
This chi-square has 8 degrees of freedom, the = = 0.085 (15)
1292
total number of cells minus one (RC 1), with a
If these two characteristics were independent of
critical ratio, at the 0.001 level, of 26.125. The null
one another, the joint probability of a respondent
hypothesis of no differences of any kind within the
being a lumberjack and also preferring Jethro Tull
entire data matrix can therefore be rejected at the
music would (by the multiplication rule for joint
0.001 level.
events) be
These four kinds of information can be obtained
from a contingency table using chi-square. However, P (A and B) = P (A) P (B)
there is a problem in using chi-square in this way.
The total analysis should have a value that is equal = (0.375) (0.085) = 0.032 (16)
to the sum of the other three analyses that make it
Multiplying this probability by 1292, the number
up. The values should be additive, but this is not the
in the sample, gives 41.3, which is (to one decimal
case as shown with the example data:
place) the value of the expected frequency of the
lumberjack/Jethro Tull cell.
2
total = 1293.20 = 1154.30
Interaction terms in analysis of variance
= 304.01 + 574.81 + 275.48 (ANOVA) use a similar kind of observed minus
expected logic and an analogous method for
= rows
2
+ columns
2
+ RC
2
(12) obtaining expected values. The simple definitional
formula for the sum of squares of a two-way
As will be shown in the next section, this addi- interaction is
tivity property does hold for the likelihood ratio G2 2
statistic of log-linear analysis. Log-linear analysis is SS(AB) = n (X ab X a Xb + X )
supported by a more coherent mathematical theory 2
than chi-square that enables this additivity property =n (X ab E ab ) (17)
to hold, and also enables one to use the full power of
linear models applied to categorical data (see Log- This sum can be decomposed as the multiplicative
linear Models). factor n (cell frequency) times the sum of squared
4 Independence: Chi-square and Likelihood Ratio Tests
deviations from additivity, where deviations from 230
230 loge
additivity refers to the deviations of the observed
140.77
means (O) from the expected means (E), those that 246
+246 loge
would be expected if an additive model were true. 302.94
This is because the additive model for creating means 9
+9 loge
is given by
41.29
E ab = Xa + X b X (18) 130
+130 loge
190.11
The two processes are analogous. To obtain the
478
expected cell means for ANOVA, one sums the =2
+478 loge
marginal row mean and the marginal column mean 409.12
47
and subtracts the grand mean. To obtain expected +47 loge
cell frequencies in a contingency table, one multi-
55.77
plies marginal row frequencies by marginal column 15
+15 loge
frequencies and divides by the total frequency. By
44.12
taking logarithms of the frequency values, one con- 83
+83 loge
verts the multiplicative computations of contingency 94.94
table expected values into the additive computations 54
of ANOVA, thus making frequency data amenable +54 loge
12.94
to linear models analysis. This log-linear approach
comes with a number of advantages: it enables one = 229.45 (20)
to test three-way and higher contingency tables, to
This likelihood ratio has the same degrees of
additively decompose test statistics, and in general
freedom as the corresponding chi-square, (R 1)
to apply powerful general linear models analysis to
(C 1) = 4, and it is also looked up in an ordinary
categorical data. The log-linear model will be briefly
demonstrated for two-way tables. chi-square table. The critical ratio for significance at
the 0.001 level with 4 degrees of freedom is 18.467.
The null hypothesis of independence can therefore be
Log-linear Models rejected at the 0.001 level.
The likelihood ratio statistic for rows is
The observed and expected matrices for each of the
four tests demonstrated above are the same for log- O
Grows = 2
2
O loge
linear analysis as for the chi-square analysis. All that E
differs is the formula for calculating the likelihood
485
ratio statistic G2 . It is given as two times the sum over = 2 485 loge
430.67
all the cells of the following quantity: the observed
value times the natural logarithm of the ratio of the 655
+ 655 loge
observed value to the expected value. 430.67
O 152
G2total = 2 O loge (19) +152 loge
E 430.67
Four likelihood ratio statistics will be demon- = 347.93 (21)
strated, corresponding to the four chi-square statistics
just calculated. The first is the test of the row by col- This test has a 0.001 critical ratio of 13.816 with
umn interaction. This is called the likelihood ratio 2 (R 1) degrees of freedom, so the null hypothesis
test for independence. The likelihood ratio statistic can be rejected at the 0.001 level.
for this test (using O and E values obtained above) The third likelihood ratio statistic to be calculated
is calculated as is that for columns:
O O
GRC = 2
2
O loge Gcolumns = 2
2
O loge = 609.50 (22)
E E
Independence: Chi-square and Likelihood Ratio Tests 5
which is also significant at the 0.001 level. using computer statistical packages such as SPSS and
The likelihood ratio for the total matrix is SAS. Chi-square analysis can be accomplished using
the FREQ procedure of SAS, and log-linear anal-
O
G2total = 2 O loge = 1186.88 (23) ysis can be accomplished in SAS using CATMOD
E (see Software for Statistical Analyses). Landau and
With 8 degrees of freedom and a critical ratio of Everitt [5] demonstrates (in Chapter 3) how to use
26.125, this test is also significant at the 0.001 level. SPSS to do chi-square analysis and also how to do
The additivity property holds with these four cross-tabulation of categorical and continuous data.
likelihood ratio statistics. That is, the sum of the
obtained G2 values for rows, columns, and R C References
interaction is equal to the obtained G2 value for the
total matrix: [1] Agresti, A. (2002). Categorical Data Analysis, 2nd Edi-
tion, John Wiley & Sons, Hoboken.
G2total = 1186.88 = 347.93 + 609.50 + 229.45 [2] Brown, B.L., Hendrix, S.B. & Hendrix, K.A. (in prepa-
= G2rows + G2columns + G2RC (24) ration). Multivariate for the Masses, Prentice-Hall, Upper
Saddle River.
The history of log-linear models for categori- [3] Imrey, P.B., Koch, G.G. & Stokes, M.E. (1981). Categori-
cal data analysis: some reflections on the log-linear model
cal data is given by Imrey, Koch, and Stokes [3],
and logistic regression. Part I: historical and method-
and detailed accounts of the mathematical develop- ological overview, International Statistics Review 49,
ment are given by Agresti [1], and by Imrey, Koch, 265283.
and Stokes [4]. Marascuilo and Levin [6] give a [4] Imrey, P.B., Koch, G.G. & Stokes, M.E. (1982). Categori-
particularly readable account of how the logarithmic cal data analysis: some reflections on the log-linear model
transformation enables one to analyze categorical and logistic regression. Part II: data analysis, International
data with the general linear model, and Brown, Hen- Statistics Review 50, 3563.
[5] Landau, S. & Everitt, B.S. (2004). A Handbook of
drix, and Hendrix [2] demonstrate the convergence
Statistical Analyses Using SPSS, Chapman & Hall, Boca
of chi-square and ANOVA through log-linear models Raton.
with simplest case data. [6] Marascuilo, L.A. & Levin, J.R. (1983). Multivariate
The calculations involved in both the chi-square Statistics in the Social Sciences: A Researchers Guide,
analysis and also the log-linear analysis are simple Brooks-Cole, Monterey.
enough to be easily accomplished using a spread-
sheet, such as Microsoft Excel, Quattro Pro, Clar- BRUCE L. BROWN AND KENT A. HENDRIX
isWorks, and so on. They can also be accomplished
Independent Component Analysis
JAMES V. STONE
Volume 2, pp. 907912
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
4 2 4
Signal amplitude
2 1 2
0 0 0
2 1 2
4 2 4
(a) Time (b) Time (c) Time
4000
3000
1000
3000
Count
2000
2000
500
1000
1000
0 0 0
5 0 5 2 0 2 5 0 5
Signal amplitude Signal amplitude Signal amplitude
(d) (e) (f)
Figure 2 Signal mixtures have Gaussian or normal histograms. Signals (top row) and corresponding histograms of signal
values (bottom row), where each histogram approximates the probability density function (pdf ) of one signal. The top
panels display only a small segment of the signals used to construct displayed histograms. A speech source signal (a), and
a histogram of amplitude values in that signal (d). A sawtooth source signal (b), and its histogram (e). A signal mixture
(c), which is the sum of the source signals on the left and middle, and its bell-shaped histogram (f )
Independent Component Analysis 3
a joint pdf ps , which particular mixing matrix A (and, function L(W) of W, and its logarithm defines the log
therefore, which unmixing matrix W = A1 ) is most likelihood function ln L(W). If the M source signals
likely to have generated the observed signal mix- are mutually independent, so that the joint pdf ps is
tures x? In other words, if the likelihood of obtain- the product of its M marginal pdfs, then (7) can be
ing the observed mixtures (from some unknown written
source signals with joint pdf ps ) were to vary with
A, then which particular A would maximize this
M
N
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
1 (MZ) or .5 (DZ)
Ac Cc Ec Ec Cc Ac
Var1
Var1 Var2
Var2 Var3
Var3 Var3
Var3 Var2
Var2 Var1
Var1
Figure 1 Independent Pathway Model for a Twin Pair: Ac, Cc, and Ec are the common additive, genetic, common shared,
and common nonshared environmental factors, respectively. The factors at the bottom are estimating the variable specific
A and E influences. For simplicity the specific C factors were omitted from the diagram
2 Independent Pathway Model
(which will be a function of both their h2 (see Her- schizotypal personality data indicated that at least
itability) and genetic correlation) and by common two latent factor (see Latent Variable) structures
shared and unique environmental effects (which will are required to account for the genetic covariation
be a function of their c2 and e2 and the C and E cor- between the various components of schizotypy. The
relation [see ACE Model]). For more information on positive components (reality distortion, such as mag-
genetic and environmental correlations between vari- ical ideas, unusual perceptions, and referential ideas)
ables, see the general section on multivariate genetic and negative components (anhedonia, social isola-
analysis. Parameter estimates are estimated from the tion, and restricted affect) are relatively genetically
observed variances and covariances by structural independent, although each in turn may be related to
equation modeling. cognitive disorganization [2].
So what is the meaning and interpretation of this
factor model? An obvious one is to examine the eti- References
ology of comorbidity. One of the first applications in
this sense was by Kendler et al. [1], who illustrated [1] Kendler, K.S., Heath, A.C., Martin, N.G. & Eaves, L.J.
that the tendency of self-report symptoms of anxiety (1987). Symptoms of anxiety and symptoms of depres-
and depression to form separate symptom clusters sion: same genes, different environment? Archives of
was mainly due to shared genetic influences. This General Psychiatry 44, 451457.
means that genes act largely in a nonspecific way [2] Linney, Y.M., Murray, R.M., Peters, E.R., Macdon-
ald, A.M., Rijsdijk, F.V. & Sham, P.C. (2003). A quan-
to influence the overall level of psychiatric symp-
titative genetic analysis of Schizotypal personality traits,
toms. The effects of the environment were mainly Psychological Medicine 33, 803816.
specific. The conclusion was that separable anxiety [3] McArdle, J.J. & Goldsmith, H.H. (1990). Alternative
and depression symptom clusters in the general pop- common-factor models for multivariate biometrical anal-
ulation are largely the result of environmental factors. yses, Behavior Genetics 20, 569608.
Independent pathway models with more than one
FRUHLING RIJSDIJK
set of common genetic and environmental factors
are also possible. Multivariate genetic analyses of
Index Plots
PAT LOVIE
Volume 2, pp. 914915
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Table 1 Percentage of studies in Journal of Applied Psychology that used 10 statistical techniques 19952003
Type of analysis 1995 1996 1997 1998 1999 2000 2001 2002 2003 Ave SD
ANOVA 53 58 41 50 41 35 59 61 47 49.44 9.14
CFA 14 14 16 15 11 24 29 16 27 18.44 6.46
CHISQ 4 11 11 5 8 6 8 10 4 7.44 2.83
CORR 76 72 66 78 72 78 86 84 78 76.67 6.16
EFA 8 13 8 5 15 9 30 12 23 13.67 8.06
HLM 0 1 0 0 1 2 4 5 8 2.33 2.78
LOGR 5 3 3 4 1 5 3 4 0 3.11 1.69
MA 7 10 8 11 9 5 7 11 1 7.67 3.20
MR 46 28 32 45 40 38 57 48 43 41.89 8.68
SEM 7 11 11 9 17 8 17 14 13 11.89 3.66
Notes: Values rounded to the nearest whole number. Studies associated with nontraditional IO research topics such as eyewitness
testimony, jury selection, and suspect lineup studies were not included. ANOVA = t Tests, analysis of variance, analysis of
covariance, multivariate analysis of variance, and multivariate analysis of covariance; CFA = confirmatory factor analysis;
CHISQ = chi-square; CORR = bivariate correlations; EFA = exploratory factor analysis; HLM = hierarchical linear modeling;
LOGR = logistic regression; MA = meta-analysis; MR = multiple regression analysis, and SEM = structural equation modeling.
Table 2 Percentage of studies in Personnel Psychology that used 10 statistical techniques 19952003
Type of analysis 1995 1996 1997 1998 1999 2000 2001 2002 2003 Ave SD
ANOVA 56 36 59 54 36 34 46 62 40 47.00 10.95
CFA 0 7 19 14 32 7 15 8 27 14.33 10.30
CHISQ 9 11 7 8 0 5 13 8 7 7.56 3.68
CORR 66 71 72 86 82 66 73 77 80 74.78 6.98
EFA 25 29 25 29 18 24 15 15 33 23.67 6.42
HLM 0 4 0 0 0 0 4 0 0 0.89 1.76
LOGR 4 0 7 4 9 5 4 4 0 4.11 2.89
MA 13 7 9 0 14 7 4 0 27 9.00 8.37
MR 41 39 41 57 54 52 46 38 67 48.33 9.85
SEM 9 7 6 11 7 3 15 4 7 7.67 3.64
Notes: Values rounded to the nearest whole number. Studies associated with nontraditional IO research topics such as eyewitness
testimony, jury selection, and suspect lineup studies were not included. ANOVA = t Tests, analysis of variance, analysis of
covariance, multivariate analysis of variance, and multivariate analysis of covariance; CFA = confirmatory factor analysis;
CHISQ = chi-square; CORR = bivariate correlations; EFA = exploratory factor analysis; HLM = hierarchical linear modeling;
LOGR = logistic regression; MA = meta-analysis; MR = multiple regression analysis, and SEM = structural equation modeling.
pronounced and that no other techniques show sim- the nine years. In a second tier, ANOVA-based
ilar differences, we can conclude that technique use statistics and multiple regression appeared in 49%
does not vary as a function of publication outlet in and 43% of studies, respectively. Confirmatory
the two leading journals in the IO field. factor analysis (see Factor Analysis: Confirmatory)
Owing to the consistency of the observed results, (18%), exploratory factor analysis (17%), and
data were collapsed across year and journal to pro- structural equation modeling (11%) comprised a
duce an overall ranking of statistics use (see Table 3). third tier. Meta-analysis (9%), chi-square analysis
Rankings are based on the percentage of all studies (see Contingency Tables) (8%), logistic regression
that report each statistical technique. (3%), and hierarchical linear modeling (2%) appeared
Table 3 makes clear the use of statistics in in the fewest coded studies.
IO psychology. Correlations are the statistics most These techniques do not exhaust the statistical
frequently used, appearing in approximately 78% toolbox used by IO psychologists. For example,
of empirical research in JAP and PP during statistical techniques such as canonical correlation
Industrial/Organizational Psychology 3
Table 3 Top ten statistical techniques in IO psychology of random assignment, measuring change due to a
Rank Type of analysis Overall percent specific intervention).
Organizational psychologists generally study a
1 Correlation 78 broader range of topics including job attitudes, work-
2 Analysis of Variance 49 er well-being, motivation, careers, and leadership.
3 Multiple Regression 43
Addressing such issues presents a number of spe-
4 Confirmatory Factor Analysis 18
5 Exploratory Factor Analysis 17 cial statistical considerations. Measurement issues
6 Structural Equation Modeling 11 and practical considerations of data collection simi-
7 Meta-Analysis 9 larly confront organizational researchers. In addition,
8 Chi-square 8 the hierarchical nature of organizations (i.e., individ-
9 Logistic Regression 3 uals nested within groups nested within companies)
10 Hierarchical Linear Modeling 2 presents unique methodological challenges for orga-
nizational psychologists. To address such factors,
researchers often employ a variety of methodologi-
and discriminant function analysis also appear in cal and statistical techniques in order to draw strong
IO-related articles. Table 3 simply provides a Top conclusions.
10 of the most widely used techniques.
Information not included in the tabled data, but Bivariate Correlation
especially striking, is the percentage of articles in
which the primary focus was some aspect of statis- As Table 3 illustrates, simple correlational analyses
tics or research methodology. In JAP, 4.8% of the appear with the most frequency in IO research. For
articles fell into this category, whereas 15.6% of the present purposes, correlational analyses include those
articles in PP addressed a primarily psychometric, that involve Pearson correlations, phi coefficients,
methodological, or statistical issue. Although the biserial correlations, point-biserial correlations, and
nature of these articles varied widely, many reported terachoric correlations methods that assess relation-
results of statistical simulations, especially simula- ships between two variables. Simple correlations are
tions of the consequences of certain statistical con- reported in studies related to nearly every subarea
siderations (e.g., violations of assumptions). Others (e.g., selection, training, job performance, work atti-
detailed the development or refinement of a given tudes, organizational climate, and motivation).
technique or analytic procedure. One factor responsible for the ubiquity of correla-
tions in IO research is the fields interest in reliability
information. Whether discussing predictor tests, cri-
Types of Analyses used in IO Research terion ratings, attitude scales, or various self-report
measures, IO psychologists are typically concerned
Although the distinction between industrial and orga- about the consistency of the observed data. Types of
nizational psychology may be perceived as somewhat reliability reported include test-retest, alternate forms,
nebulous, treating the two areas as unique is a useful internal consistency (most frequently operationalized
heuristic. Industrial psychologists traditionally study through coefficient alpha), and interrater reliability.
issues related to worker performance, productivity,
motivation, and efficiency. To understand and pre- Analysis of Variance
dict these criterion variables, researchers explore how
concepts such as individual differences (e.g., cogni- Statistics such as t Tests, analysis of variance,
tive ability), workplace interventions (e.g., training analysis of covariance (ANCOVA), multivariate
programs), and methods of employee selection (e.g., analysis of variance (MANOVA), and multivariate
job interviews) impact job-related outcomes. The sta- analysis of covariance (MANCOVA) appear in
tistical orientation of industrial psychology is largely studies that involve comparisons of known or
a function of both the need to quantify abstract manipulated groups. In addition, researchers often
psychological constructs (e.g., cognitive ability, per- utilize t Tests and ANOVA prior to conducting
sonality traits, job performance) and the practical other analyses to ensure that different groups do
difficulties faced in organizational settings (e.g., lack not differ significantly from one another on the
4 Industrial/Organizational Psychology
primary variables of interest. ANCOVA allows one Confirmatory & Exploratory Factor Analysis
to statistically control for confounding variables,
or covariates, that potentially distort results and Exploratory factor analysis is used by IO psycholo-
conclusions. Because organizational reality often gists to provide construct validity evidence in many
precludes true experimentation, ANCOVA is quite substantive interest areas. In particular, exploratory
popular among IO psychologists, especially those in factor analysis is used in situations that involve newly
organizations. MANOVA and MANCOVA are useful created or revised measures. Often, but not always,
when dealing with either multiple criteria and/or those using exploratory factor analysis for this pur-
repeated measurements (see Repeated Measures pose hope to find that all of the items load on a single
Analysis of Variance) For example, evaluating factor, suggesting that the measure is unidimensional.
separate training outcomes or evaluating one or more Confirmatory factor analysis has become increas-
outcomes repeatedly would warrant one of these ingly popular in recent years, largely due to the
techniques. increasing availability of computer packages such
as LISREL and EQS (see Structural Equation
Modeling: Software). Unlike exploratory techniques,
Multiple Regression confirmatory approaches allow one to specify an a
priori factor structure, indicating which items are
Multiple regression (MR) analysis is used in expected to load on which factors. Confirmatory
three situations: (1) identifying the combination of factor analysis is also useful for investigating the
predictors that can most accurately forecast a criterion presence of method variance, often through multi-
variable, (2) testing for mediated relationships (see traitmultimethod data, as well as ensuring that
Mediation), and (3) testing for the presence of factor structures are similar, or invariant, across dif-
statistical interactions (see Interaction Effects). ferent subgroups.
The first situation occurs, for example, when IO
psychologists attempt to identify the optimal set of Structural Equation Modeling
predictor variables that an organization should utilize
in selecting employees. Using MR in this manner Because path analysis using ordinary least squares
yields potential practical and financial benefits to regression does not allow for the inclusion of mea-
organizations by enabling them to eliminate useless or surement error, structural equation modeling is used
redundant selection tests while maintaining optimal to test hypothesized measurement and structural rela-
prediction. tionships between variables. Although the use of
Utilizing MR for mediated relationships is increas- structural equation modeling in IO remains relatively
ingly common in IO and has led to both theoretical infrequent (see Tables 1 and 2), this approach holds
and practical advances. For example, researchers uti- great promise, especially given the increasing sophis-
lize MR when attempting to identify the intervening tication of IO theories and models. As IO psycholo-
variables and processes that explain bivariate rela- gists become more familiar with structural equation
tionships between predictor variables (e.g., cognitive modeling and the associated software, its frequency
ability, personality traits) and relevant criteria (e.g., should increase.
job performance). Moderated MR (see Moderation)
is also encountered in IO, both to uncover complex Meta-analysis
relationships that main effects fail to capture and to
identify important boundary conditions that limit the The use of meta-analytic techniques has led to sev-
generalizability of conclusions. In addition, organiza- eral truths in IO psychology. From the selection
tions use moderated MR to ensure that the empirical literature, meta-analytic results reveal that the best
relationship between a given selection measure and predictor of job performance across all jobs is cogni-
the criterion is constant across subgroups and pro- tive ability and the best personality-related predictor
tected classes. Any evidence to the contrary, revealed is conscientiousness. More generally, meta-analysis
by a significant interaction between predictor and led to the insight that disparities in results between
group, necessitates that the organization abandon the studies are due largely to artifacts inherent in the mea-
procedure. surement process. This conclusion has the potential
Industrial/Organizational Psychology 5
to change the ways that IO psychologists undertake generalizes across organizations, thereby suggesting
applied and academic research questions. In addition that local validation studies are not always essential.
to reducing the necessity of conducting local vali- Specifically, early meta-analytic work revealed that
dation studies in organizations, academic researchers validity estimates from different organizations and
may choose meta-analytic methodologies rather than situations often differed from each other primarily
individual studies. as a function of statistical artifacts inherent in the
measurement process (e.g., sampling error, low relia-
bility, range restriction) and not as a result of specific
Logistic Regression contextual factors. These insights led to several lines
Logistic regression is useful for predicting dichoto- of statistical research on how best to conceptualize
mous outcomes, especially when fundamental assum- and correct for these artifacts, especially when the
ptions underlying linear regression are violated. This primary studies do not contain the necessary infor-
technique is especially common among IO psycholo- mation (e.g., reliability estimates).
gists examining issues related to employee turnover, Beginning in the early 1980s, IO psychologists
workplace health and safety, and critical performance also made strides in examining and developing vari-
behaviors because the relevant criteria are dichoto- ous aspects of structural equation modeling. Some of
mous. For example, a researcher might use logis- these advances were related to the operationalization
tic regression to investigate the variables that are of continuous moderators, procedures for evaluat-
predictive of whether one is involved in a driving ing the influence of method variance, techniques for
accident or whether a work team performs a critical assessing model invariance across groups, the use of
behavior. formative versus reflective manifest variables, and the
impact of item parceling on model fit. Notably, some
developments engendered novel research questions
Contributions of IO Psychologists that, prior to these advances, IO psychologists may
to the Statistics Literature not have considered. For example, recent develop-
ments in latent growth modeling allowed IO psy-
In addition to extensively utilizing existing analytical chologists to study how individuals changes over
techniques, IO psychologists also conduct research time on a given construct impact or are impacted by
in which they examine, refine, and create statistical their standing on another variable. Thus, to study how
tools. Largely driving these endeavors is the com- changes in workers job satisfaction influence their
plex nature of organizational phenomena that IO subsequent job performance, the researcher can now
psychologists address. Often, however, these statis- measure how intra individual changes in satisfaction
tical advances not only enable researchers and prac- affect performance, instead of relying on a design in
titioners to answer their questions but also propagate which Time 1 satisfaction simply is correlated with
new insights and questions. In addition, other areas Time 2 performance.
both within and outside of the organizational realm Yet another statistical area that IO psychologists
often benefit by applying IO psychologists statistical contributed to is difference scores. Organizational
advances. In the following paragraphs, we list several researchers traditionally utilized difference scores to
statistical topics to which IO researchers made espe- examine issues such as the degree of fit between
cially novel and significant contributions. This listing a person and a job or a person and an organization.
is not exhaustive with respect to topics or results but Throughout the 1990s, however, a series of articles
is presented for illustrative purposes. highlighted several problematic aspects of difference
IO researchers have made contributions in quan- scores and advanced the use of an alternative tech-
titative meta-analysis, especially in terms of validity nique, polynomial regression, to study questions of
generalization. Until the late 1970s, IO psychologists fit and congruence.
believed that to identify those variables that best pre- The preceding discussion covers only a few of
dicted job performance, practitioners must conduct a IO psychologys contributions to statistical methods.
validation study for each job within each organiza- Given the continuing advances in computer technol-
tion. This notion, however, was radically altered by ogy as well as the ever-increasing complexity of man-
demonstrating that predictor-criterion validity often agement and organizational theory, IO psychologists
6 Industrial/Organizational Psychology
probably will continue their research work on statis- in terms of the availability of new or refined tech-
tical issues in the future. niques. Thus, IO psychologists continue embracing
statistics both as an instrumental tool to address the-
oretical research questions and as an area of study and
The Role of Statistics in Theory
application worthy of addressing in its own right.
Development
This entry illustrates the interest that IO psychologists References
have in quantitative methods. Although IO psycholo-
gists pride themselves on the relative methodological [1] Austin, J.T., Scherbaum, C.A. & Mahlman, R.A. (2002).
rigor and statistical sophistication of the field, our History of research methods in industrial and organiza-
focus on these issues is not without criticism. For tional psychology: measurement, design, and analysis, in
example, IO psychologists may be viewed as overly Handbook of Research Methods in Industrial and Orga-
nizational Psychology, S. Rogelberg, ed., Blackwell Pub-
concerned with psychometric and statistical issues
lishers, Malden.
at the expense of underlying constructs and theory. [2] Landy, F. (1997). Early influences on the development
To be sure, this criticism may once have possessed of industrial and organizational psychology, Journal of
some merit. Recently, however, IO psychologists Applied Psychology 82, 467477.
have made significant theoretical advancements as [3] Sackett, P.R. & Larson Jr, J.R. (1990). Research strategies
evidenced by recent efforts to understand, instead of and tactics in industrial and organizational psychology, in
simply predict, job performance and other important Handbook of Industrial and Organizational Psychology,
M.D. Dunnette & L.M. Hough, eds, Vol. 1, 2nd Edition,
criteria. Without our embrace of sophisticated statis-
Consulting Psychologists Press, Palo Alto.
tical analyses, this theoretical focus might not have [4] Stone-Romero, E.F., Weaver, A.E. & Glenar, J.L. (1995).
emerged. Procedures such as structural equation mod- Trends in research design and data analytic strategies
eling, hierarchical linear modeling, and meta-analysis in organizational research, Journal of Management 21,
have enabled researchers to assess complex theoreti- 141157.
cal formulations, and have allowed practitioners to
better serve organizations and workers. Moreover, RONALD S. LANDIS AND SETH A. KAPLAN
many other areas of psychology often benefit from
the statistical skills of IO psychologists, especially
Influential Observations
LAWRENCE PETTIT
Volume 2, pp. 920923
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
5.0 B
Introduction
It gives a measure of the influence on T of adding Note that unlike most of the frequentist diagnostics it
an observation at x as n . is not zero if ti is. This is because I (i) measures the
Several finite sample versions of the influence effect on the whole posterior. Deleting an observation
curve have been suggested. The empirical influence with ti = 0 would not affect but may affect its
curve (EIC) is obtained by substituting the sample cdf variance if vi is large.
F for F in the influence curve. For linear models, The idea of using case deletion or a sample influ-
ence curve to measure influence has been extended
EI C(x , y) = n(X T X )1 x (y x T ) to many situations, for example,
Jolliffe and Lukudu [12] find similar behavior when [8] Groeneveld, R.A. (1991). An influence function appro-
looking at the effect of a contaminant on the T ach to describing the skewness of a distribution, Ameri-
statistic. can Statistician 45, 97102.
[9] Hampel, F. (1968). Contributions to the theory of robust
The question remains as to what an analyst should estimation, Unpublished PhD dissertation, University of
do when they find an observation is influential. California, Berkeley.
It should certainly be reported. Sometimes, as in [10] Johnson, W. & Geisser, S. (1983). A predictive view of
the Pack et al. [14] example, it is a sign of a the detection and characterization of influential obser-
recording error that can be corrected. If a designed vations in regression analysis, Journal of the American
experiment results in an influential observation, it Statistical Association 78, 137144.
[11] Jolliffe, I.T., Jones, B. & Morgan, B.J.T. (1995). Iden-
suggests that taking some more observations in
tifying influential observations in hierarchical cluster
that part of the design space would be a good analysis, Journal of Applied Statistics 22, 6180.
idea. Another possibility is to use a more robust [12] Jolliffe, I.T. & Lukudu, S.G. (1993). The influence of
procedure that automatically down weights such a single observation on some standard statistical tests.
observations. It may also suggest that a hypothesized Journal of Applied Statistics, 20, 143151.
model does not hold for the whole of the space of [13] Pack, P. & Jolliffe, I.T. (1992). Influence in correspon-
regressors. dence analysis, Applied Statistics 41, 365380.
[14] Pack, P., Jolliffe, I.T. & Morgan, B.J.T. (1988). Influen-
tial observations in principal component analysis: a case
References study, Journal of Applied Statistics 15, 3952.
[15] Pettit, L.I. & Smith, A.F.M. (1985). Outliers and influ-
ential observations in linear models, in Bayesian Statis-
[1] Andrews, D.F. & Pregibon, D. (1978). Finding the
tics 2, J.M. Bernardo, M.H. DeGroot, D.V. Lindley &
outliers that matter, Journal of the Royal Statistical
A.F.M. Smith eds, Elsevier Science Publishers B.V.,
Society. Series B 40, 8593.
North Holland, pp. 473494.
[2] Atkinson, A.C. (1981). Robustness, transformations and
[16] Pettit, L.I. & Young, K.D.S. (1990). Measuring the
two graphical displays for outlying and influential obser-
effect of observations on Bayes factors, Biometrika 77,
vations in regression, Biometrika 68, 1320.
455466.
[3] Bruce, A.G. & Martin, R.D. (1989). Leave-k-out diag-
[17] Welsch, R.E. & Kuh, E. (1977). Linear regression
nostics for time series (with discussion), Journal of the
diagnostics, Working paper No. 173, National Bureau
Royal Statistical Society. Series B 57, 363424.
of Economic Research, Cambridge.
[4] Cook, R.D. (1977). Detection of influential observations
[18] Young, K.D.S. (1992). Influence of groups of observa-
in linear regression, Technometrics 19, 1518.
tions on Bayes factors, Communications in Statistics
[5] Cook, R.D. & Weisberg, S. (1980). Characterizations of
Theory and Methods 21, 14051426.
an empirical influence function for detecting influential
cases in regression, Technometrics 22, 495508.
[6] Cook, R.D. & Weisberg, S. (1982). Residuals and
LAWRENCE PETTIT
Influence in Regression, Chapman & Hall, New York.
[7] Critchley, F. (1985). Influence in principal components
analysis, Biometrika 72, 627636.
Information Matrix
JAY I. MYUNG AND DANIEL J. NAVARRO
Volume 2, pp. 923924
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
is used to define a noninformative prior that gen- [2] Bernardo, J.M. (1979). Reference posterior distributions
eralizes the notion of uniform. This for Bayesian inference (with discussion), Journal of the
is called Jef- Royal Statistical Society, Series B 41, 113147.
freys prior [3] defined as J () |I1 ()| where
[3] Jeffreys, H. (1961). Theory of Probability, 3rd Edition,
|I1 ()| is the determinant of the information matrix. Oxford University Press, London.
This prior can be useful for three reasons. First, it [4] Lehman, E.L. & Casella, G. (1998). Theory of Point
is reparametrization-invariant so the same prior is Estimation, 2nd Edition, Springer, New York.
obtained under all reparameterizations. Second, Jef- [5] Rao, C.R. (1945). Information and accuracy attainable
freys prior is a uniform density on the space of prob- in the estimation of statistical parameters, Bulletin of the
ability distributions in the sense that it assigns equal Calcutta Mathematical Society, 37, 8191 (Republished
in S. Kotz & N. Johnson, eds, Breakthroughs in Statistics:
mass to each different distribution [1]. In compar-
18891990, vol. 1).
ison, the uniform prior defined as U () = c for [6] Schervish, M.J. (1995). Theory of Statistics, Springer,
some constant c assigns equal mass to each different New York.
value of the parameter and is not reparametrization- [7] Spanos, A. (1999). Probability Theory and Statisti-
invariant. Third, Jeffreys prior is the one that max- cal Inference: Econometric Modeling with Observational
imizes the amount of information about , in the Data, Cambridge University Press, Cambridge.
KullbackLeibler sense, that the data are expected
JAY I. MYUNG AND DANIEL J. NAVARRO
to provide [2].
References
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
Message
source Encoder Channel Decoder Destination
Noise
another particular word will follow. For example, the H = pi log pi , (3)
i=1
probability that a noun follows after the word the
is higher than the probability that an adverb follows. where p1 , . . . , pn are the probabilities of particular
messages from a set of N independent messages.
The opposite of information is redundancy.
Some Background Redundant messages add nothing or only little
information to a message. This concept is important
The mathematical background of the study of because it helps track down and minimize noise, for
sequences of words or other messages is given example, in the form of repeating a message, in a
by Stochastic Processes and by Markov Chains communicating system.
If a speaker builds a message word by word, the The aim of the original theory of information [3]
probability of the next word is considered given was to find out how many calls can be transmitted
only by the immediately preceding word, but not in one phone transmission. This number is called the
by the words used before. This is the concept of channel capacity. To determine channel capacity, it
a first-order Markov Chain. The mathematical and is essential to take the length of signs into account.
statistical terminology used in this context is that of In general, the channel capacity C is
an Ergodic Markov Chain, the most important case log N (T )
of Markov chains. In more technical terms, let Pi be C = lim , (4)
T T
the probability of state i, and pi (j ) the probability
of arriving at state j coming from state i. This where N (T ) is the number of permitted signals of
probability is also called the transition probability. length T . Based on these considerations, the follow-
For a stationary process, the following constraint ing fundamental theorem of Shannon and Weaver can
holds: be derived: Using an appropriate coding scheme, a
Pj = Pi pi (j ). (2) source is able to transmit messages via the transmis-
i
sion channel at an average transmission rate of almost
C/H , where C is the channel capacity in bits per sec-
In the ergodic case, it can be shown that the proba- ond and H is the entropy, measured in bits per sign.
bility Pj (N ), that is, the probability of reaching state The exact value of C/H is never reached, regardless
j after N signs converges to the equilibrium values of which coding scheme is used.
Information Theory 3
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
either since it is vulnerable to bias. A large part of the [2] Buyse, M.E., Staquet, M.J. & Sylvester, R.J. (1984).
recent incomplete data and so-called noncompliance Cancer Clinical Trials: Methods and Practice, Oxford
University Press, Oxford.
literature is devoted to ways of dealing with this
[3] Friedman, L.M., Furberg, C.D. & DeMets, D.L. (1998).
question [12]. This issue is nontrivial since the only Fundamentals of Clinical Trials, Springer-Verlag, New
definitive way to settle it would be to dispose of the York.
unavailable data, which, by definition, is impossible. [4] Green, S., Benedetti, J. & Crowley, J. (1997). Clinical
Whatever assumptions made to progress with the Trials in Oncology, Chapman & Hall, London.
analysis, they will always be unverifiable, at least in [5] Hill, A.B. (1961). Principles of Medical Statistics, 7th
part, which typically results in sensitivity to model Edition, The Lancet, London.
[6] Mazumdar, S., Liu, K.S., Houck, P.R. & Reynolds, C.F.
assumptions.
III. (1999). Intent-to-treat analysis for longitudinal clin-
The translation of the ITT principle to longitudinal ical trials: coping with the challenge of missing values,
clinical studies (see Longitudinal Data Analysis), Journal of Psychiatric Research 33, 8795.
that is, studies where patient data are collected [7] McMahon, A.D. (2002). Study control, violators, inclu-
at multiple measurement occasions throughout the sion criteria and defining explanatory and pragmatic
study period, is a controversy in its own right. trials, Statistics in Medicine 21, 13651376.
[8] Molenberghs, G., Thijs, H., Jansen, I., Beunckens, C.,
For a long time, the view has prevailed that only
Kenward, M.G., Mallinckrodt, C. & Carroll, R.J. (2004).
carrying the last measurement (also termed last Analyzing incomplete longitudinal clinical trial data,
value) actually obtained on a given patient forward Biostatistics 5, 445464.
throughout the remainder of the follow-up period is [9] Piantadosi, S. (1997). Clinical Trials: A Methodologic
a sensible approach in this respect. (This is known Perspective, John Wiley, New York.
as last observation carried forward (LOCF).) With [10] Pocock, S.J. (1983). Clinical Trials: A Practical
the advent of modern likelihood-based longitudinal Approach, John Wiley, Chichester.
[11] Schwartz, D. & Lellouch, J. (1967). Explanatory and
data analysis tools, flexible modeling approaches that pragmatic attitudes in therapeutic trials, Journal of
avoid the need for both imputation and deletion of Chronic Diseases 20, 637648.
data have come within reach. Part of this discussion [12] Sheiner, L.B. & Rubin, D.B. (1995). Intention-to-treat
can be found in [6, 8]. analysis and the goals of clinical trials, Clinical Phar-
macology and Therapeutics 57, 615.
References
(See also Dropouts in Longitudinal Data; Dropouts
in Longitudinal Studies: Methods of Analysis)
[1] Armitage, P. (1998). Attitudes in clinical trial, Statistics
in Medicine 17, 26752683. GEERT MOLENBERGHS
Interaction Effects
LEONA S. AIKEN AND STEPHEN G. WEST
Volume 2, pp. 929933
in
ISBN-13: 978-0-470-86080-9
ISBN-10: 0-470-86080-4
Editors
80
Table 1 Cell means, marginal means (row and column means), and grand mean for performance on a statistics examination
as a function of study guide and review session
Table 1(a): No interaction Table 1(b): Interaction
Study guide Study guide
Yes No Row mean Yes No Row mean
Review Session
Yes 90 70 80 80 70 75
No 60 40 50 90 40 65
Column Mean 75 55 65 85 55 70
review session collapsed over study guide. The grand session, the cell mean is 40; introducing the study
mean (65 in Table 1(a)) is average performance over guide yields a 30-point gain to a cell mean of 70; then,
all four cells. introducing the review session yields another 20-
point gain to a cell mean of 90. Table 1(b), associated
with Figure 1(b) (interaction) contains nonadditive
Interactions as Conditional Effects effects. The row mean shows a 10 point average gain
Conditional effects are the effects of one variable at a from the study guide, from 65 to 75. The column
particular level of another variable. The effect of the mean shows as a 30-point gain from the review
review session when a study guide is also given is one session, from 55 to 85. However, the cell means
conditional effect; effect of review session without a do not follow the pattern of the marginal means.
study guide is a second conditional effect. When there With neither study guide nor review session, the cell
is no interaction, as in Table 1(a), the conditional mean is 40. Introducing the study guide yields a 50-
effects of a variable are constant over all levels of point gain to 90, and not the 30-point gain expected
from the marginal mean; then, introducing the review
the other variable (here the constant 30-point gain
session on top of the study guide yields a loss of
from the review session). If there is an interaction,
10 points, rather than the gain of 10 points expected
the conditional effects of one variable differ across
from the marginal means. The unique combinations
values of the other variable. As we have already seen
of effects represented by the cells do not follow the
in Table 1(b), the effect of the review session changes
marginal means.
dramatically, depending on whether a study guide has
been given: a 30-point gain when there is no study
guide versus a 10-point loss in the presence of a study Interactions as Cell Residuals
guide. One variable is said to be a moderator of the The characterization of interactions as cell residuals
effect of the other variable or to moderate the effect [8] follows from the analysis of variance frame-
of the other variable; here we say that study guide work [5, 6]. By cell residual is meant the discrepancy
moderates the effect of review session. between the cell mean and the grand mean that would
not be expected from the additive effects of each vari-
Interactions as Nonadditive Effects able. When there is an interaction between variables,
the cell residuals are nonzero and are pure measures
Nonadditive effects signify that the combination of of the amount of interaction. When there is no interac-
two or more variables does not produce an outcome tion between variables, the cell residuals are all zero.
that is the sum of their individual effects. First,
consider Table 1(a), associated with Figure 1(a) (no
Types of Interactions by Variable
interaction); here, the cell means are additive effects
of the two variables. In Table 1(a), the row means (Categorical and Continuous)
tell us that the average effect of study guide is a 30- Categorical by Categorical Interactions
point gain, from 50 to 80; the column means tell us
that the average effect of review session is 20 points, Thus far, our discussion of interactions is in terms
from 55 to 75. With neither study guide nor review of variables that take on discrete values, categorical
Interaction Effects 3
by categorical interactions, as in the factors in the performance on the statistics examination. However,
analysis of variance (ANOVA) framework. In the the review session has a compensatory effect for
ANOVA framework, the conditional effects of one weaker students. When students receive a review ses-
variable at a value of another variable, for example, sion, there is a much-reduced relationship between
the effect of review session when there is no study mathematics ability and performance; the weaker
guide, is referred to as a simple main effect. See [5] students catch up with the stronger students. Put
and [6] for complete treatments of interactions in the another way, the effect of mathematics ability on
ANOVA framework. performance is conditional on whether or not the
instructor provides a review session. An introduc-
tion to the categorical by continuous variable inter-
Categorical by Continuous Variable Interactions action is given in [1] with an extensive treatment
in [10].
We can also characterize interactions between cat-
egorical and continuous variables. To continue our
example, suppose we measure the mathematics abil- Continuous by Continuous Variable Interactions
ity of each student on a continuous scale. We can
Finally, two or more continuous variables may inter-
examine whether mathematics ability interacts with
act. Suppose we have a continuous measure of
having a review session in producing performance motivation to succeed. Motivation may interact with
on a statistics examination. Figure 2(a) illustrates an mathematics ability, as shown in Figure 2(b). The
interaction between these variables. For students who relationship of ability to performance is illustrated for
do not receive the review session, there is a strong three values of motivation along a motivation contin-
positive relationship between mathematics ability and uum. The effect of ability becomes increasingly more
positive as motivation increases with low motiva-
tion, ability does not matter. The effect of ability is
Categorical by continuous variable interaction
conditional on the strength of motivation; put another
Statistics exam scores
Crossover versus Noncrossover Interactions performance; however, the review session weakens
the positive effect of ability.
Crossover interactions (or disordinal interactions) are
ones in which the direction of effect of one variable
reverses as a function of the variable with which it Interactions Beyond the ANOVA and
interacts. Figure 1(b) illustrates a crossover interac- Multip