You are on page 1of 12

The Robust Beauty of Improper Linear Models

in Decision Making

ROBYN M. DAWES University of Oregon

ABSTRACT: Proper linear models are those in which A proper linear model is one in which the
predictor variables are given weights in such a way weights given to the predictor variables are chosen
that the resulting linear composite optimally predicts in such a way as to optimize the relationship be-
some criterion of interest; examples of proper linear tween the prediction and the criterion. Simple
models are standard regression analysis, discriminant
regression analysis is the most common example
function analysis, and ridge regression analysis. Re-
of a proper linear model; the predictor variables
search summarized in Paul Meehl's book on clinical
versus statistical prediction—and a plethora of re-
are weighted in such a way as to maximize the
search stimulated in part by that book—all indicates correlation between the subsequent weighted com-
that when a numerical criterion variable (e.g., graduate posite and the actual criterion. Discriminant
grade point average) is to be predicted from numerical function analysis is another example of a proper
predictor variables, proper linear models outperform linear model; weights are given to the predictor
clinical intuition. Improper linear models are those in variables in such a way that the resulting linear
which the weights of the predictor variables are ob- composites maximize the discrepancy between two
tained by some nonoptimal method; for example, they or more groups. Ridge regression analysis, an-
may be obtained on the basis of intuition, derived other example (Darlington, 1978; Marquardt &
from simulating a clinical judge's predictions, or set to Snee, 1975), attempts to assign weights in such
be equal. This article presents evidence that even
a way that the linear composites correlate maxi-
such improper linear models are superior to clinical in-
mally with the criterion of interest in a new set
tuition when predicting a numerical criterion from
numerical predictors. In fact, unit (i.e., equal) weight-
of data.
ing is quite robust for making such predictions. The Thus, there are many types of proper linear
article discusses, in some detail, the application of unit models and they have been used in a variety of
weights to decide what bullet the Denver Police De- contexts. One example (Dawes, 1971) was pre-
partment should use. Finally, the article considers sented in this Journal; it involved the prediction
commonly raised technical, psychological, and ethical
of faculty ratings of graduate students. All gradu-
resistances to using linear models to make important
social decisions and presents arguments that could
weaken these resistances.
Work on this article was started at the University of
Oregon and Decision Research, Inc., Eugene, Oregon; it
Paul MeehPs (1954) book Clinical Versus Statis- was completed while I was a James McKeen Cattell Sab-
tical Prediction: A Theoretical Analysis and a batical Fellow at the Psychology Department at the Uni-
Review of the Evidence appeared 25 years ago. versity of Michigan and at the Research Center for
It reviewed studies indicating that the prediction Group Dynamics at the Institute for Social Research there,
I thank all these institutions for their assistance, and I
of numerical criterion variables of psychological especially thank my friends at them who helped.
interest (e.g., faculty ratings of graduate students This article is based in part on invited talks given at
who had just obtained a PhD) from numerical the American Psychological Association (August 1977),
the University of Washington (February 1978), the
predictor variables (e.g., scores on the Graduate Aachen Technological Institute (June 1978), the Univer-
Record Examination, grade point averages, ratings sity of Groeningen (June 1978), the University of Am-
of letters of recommendation) is better done by a sterdam (June 1978), the Institute for Social Research at
the University of Michigan (September 1978), Miami Uni-
proper linear model than by the clinical intuition versity, Oxford, Ohio (November 1978), and the Uni-
of people presumably skilled in such prediction. versity of Chicago School of Business (January 1979).
The point of this article is to review evidence that I received valuable feedback from most of the audiences.
Requests for reprints should be sent to Robyn M.
even improper linear models may be superior to Dawes, Department of Psychology, University of Oregon,
clinical predictions. Eugene, Oregon 97403.

Vol. 34, No. 7,571-582 AMERICAN PSYCHOLOGIST • JULY 1979 • 571


Copyright 1979 by the American Psychological Association, Inc.
0003-066X/79/3407-0571$00.75
ate students at the University of Oregon's Psy- random. Nevertheless, improper models may have
chology Department who had been admitted be- great utility. When, for example, the standard-
tween the fall of 1964 and the fall of 1967—and ized GREs, GPAs, and selectivity indices in the
who had not dropped out of the program for non- previous example were weighted equally, the re-
academic reasons (e.g., psychosis or marriage)— sulting linear composite correlated .48 with later
were rated by the faculty in the spring of 1969; faculty rating. Not only is the correlation of this
faculty members rated only students whom they linear composite higher than that with the clinical
felt comfortable rating. The following rating judgment of the admissions committee (.19), it is
scale was used: S, outstanding; 4, above average; also higher than that obtained upon cross-validat-
3, average; 2, below average; 1, dropped out of ing the weights obtained from half the sample.
the program in academic difficulty. Such overall An example of an improper model that might be
ratings constitute a psychologically interesting cri- of somewhat more interest—at least to the general
terion because the subjective impressions of fac- public—was motivated by a physician who was on
ulty members are the main determinants of the a panel with me concerning predictive systems.
job (if any) a student obtains after leaving gradu- Afterward, at the bar with his' wife and me, he
ate school. A total of 111 students were in the said that my paper might be of some interest to
sample; the number of faculty members rating my colleagues, but success in graduate school in
each of these students ranged from 1 to 20, with psychology was not of much general interest:
the mean number being 5.67 and the median be- "Could you, for example, use one of your improper
ing 5. The ratings were reliable. (To determine linear models to predict how well my wife and I
the reliability, the ratings were subjected to a one- get along together?" he asked. I realized that I
way analysis of variance in which each student be- could—or might. At that time, the Psychology
ing rated was regarded as a treatment. The Department at the University of Oregon was en-
resulting between-treatments variance ratio (»j2) gaged in sex research, most of which was be-
was .67, and it was significant beyond the .001 havioristically oriented. So the subjects of this
level.) These faculty ratings were predicted from research monitored when they made love, when
a proper linear model based on the student's Grad- they had fights, when they had social engagements
uate Record Examination (GRE) score, the stu- (e.g., with in-laws), and so on. These subjects
dent's undergraduate grade point average (GPA), also made subjective ratings about how happy they
and a measure of the selectivity of the student's were in their marital or coupled situation. I
undergraduate institution.1 The cross-validated immediately thought of an improper linear model
multiple correlation between the faculty ratings to predict self-ratings of marital happiness: rate
and predictor variables was .38. Congruent with of lovemaking minus rate of fighting. My col-
Meehl's results, the correlation of these latter fac- league John Howard had collected just such data
ulty ratings with the average rating of the people on couples when he was an undergraduate at the
on the admissions committee who selected the University of Missouri—Kansas City, where he
students was .19; 2 that is, it accounted for one worked with Alexander (1971). After establish-
fourth as much variance. This example is typical ing the intercouple reliability of judgments of
of those found in psychological research in this
lovemaking and fighting, Alexander had one part-
area in that (a) the correlation with the model's
ner from each of 42 couples monitor these events.
predictions is higher than the correlation with clin-
ical prediction, but (b) both correlations are low. She allowed us to analyze her data, with the fol-
These characteristics often lead psychologists to lowing results: "In the thirty happily married
interpret the findings as meaning that while the
low correlation of the model indicates that linear
modeling is deficient as a method, the even lower ^This index was based on Cass and Birnbaum's (1968)
rating of selectivity given at the end of their book Com-
correlation of the judges indicates only that the parative Guide to American Colleges. The verbal cate-
wrong judges were used. gories of selectivity were given numerical values according
An improper linear model is one in which the to the following rale: most selective, 6; highly selective,
5; very selective (+), 4; very selective, 3; selective, 2;
weights are chosen by some nonoptimal method. not mentioned, 1.
2
They may be chosen to be equal, they may be Unfortunately, only 23 of the 111 students could be
used in this comparison because the rating scale the
chosen on the basis of the intuition of the person admissions committee used changed slightly from year
making the prediction, or they may be chosen at to year.

572 • JULY 1979 • AMERICAN PSYCHOLOGIST


couples (as reported by the monitoring partner) few would agree that it doesn't entail some ability
only two argued more often than they had inter- to predict.
course. All twelve of the unhappily married Why? Because people—especially the experts
couples argued more often" (Howard & Dawes, in a field—are much better at selecting and coding
1976, p. 478). We then replicated this finding at information than they are at integrating it.
the University of Oregon, where 27 monitors rated But people are important. The statistical model
happiness on a 7-point scale, from "very un- may integrate the information in an optimal man-
happy" to "very happy," with a neutral midpoint. ner, but it is always the individual (judge, clini-
The correlation of rate of lovemaking minus rate cian, subjects) who chooses variables. Moreover,
of arguments with these ratings of marital happi- it is the human judge who knows the directional
ness was .40 ( / > < . 0 5 ) ; neither variable alone relationship between the predictor variables and
was significant. The findings were replicated in the criterion of interest, or who can code the
Missouri by D. D. Edwards and Edwards (1977) variables in such a way that they have clear direc-
and in Texas by Thornton (1977), who found a tional relationships. And it is in precisely the
correlation of .81 (p < .01) between the sex-argu- situation where the predictor variables are good
ment difference and self-rating of marital happi- and where they have a conditionally monotone
ness among 28 new couples. (The reason for this relationship with the criterion that proper linear
much higher correlation might be that Thornton models work well.8
obtained the ratings of marital happiness after, The linear model cannot replace the expert in
rather than before, the subjects monitored their deciding such things as "what to look for," but it
lovemaking and fighting; in fact, one subject de- is precisely this knowledge of what to look for in
cided to get a divorce after realizing that she was reaching the decision that is the special expertise
fighting more than loving; Thornton, Note 1.) people have. Even in as complicated a judgment
The conclusion is that if we love more than we as making a chess move, it is the ability to code
hate, we are happy; if we hate more than we love, the board in an appropriate way to "see" the
we are miserable. This conclusion is not very proper moves that distinguishes the grand master
profound, psychologically or statistically. The from the expert from the novice (deGroot, 1965;
point is that this very crude improper linear model Simon & Chase, 1973). It is not in the ability
predicts a very important variable: judgments to integrate information that people excel (Slovic,
about marital happiness. Note 2). Again, the chess grand master considers
The bulk (in fact, all) of the literature since the no more moves than does the expert; he just
publication of Meehl's (1954) book supports his knows which ones to look at. The distinction be-
generalization about proper models versus intui- tween knowing what to look for and the ability
tive clinical judgment. Sawyer (1966) reviewed a to integrate information is perhaps best illustrated
plethora of these studies, and some of these studies in a study by Einhorn (1972). Expert doctors
were quite extensive (cf. Goldberg, 196S). Some coded biopsies of patients with Hodgkin's disease
10 years after his book was published, Meehl and then made an overall rating of the severity of
(1965) was able to conclude, however, that there the process. The overall rating did not predict
was only a single example showing clinical judg- survival time of the 193 patients, all of whom
ment to be superior, and this conclusion was im-
mediately disputed by Goldberg (1968) on the 8
Relationships are conditionally monotone when vari-
grounds that even the one example did not show ables can be scaled in such a way that higher values on
such superiority. Holt (1970) criticized details each predict higher values on the criterion. This condi-
tion is the combination of two more fundamental mea-
of several studies, and he even suggested that surement conditions: (a) independence (the relationship
prediction as opposed to understanding may not be between each variable and the criterion is independent
a very important part of clinical judgment. But of the values on the remaining variables) and (b) mono-
tonicity (the ordinal relationship is one that is monotone).
a search of the literature fails to reveal any studies (See Krantz, 1972; Krantz, Luce, Suppes, & Tversky,
in which clinical judgment has been shown to be 1971). The true relationships need not be linear for
linear models to work; they must merely be approximated
superior to statistical prediction when both are by linear models. It is not true that "in order to com-
based on the same codable input variables. And pute a correlation coefficient between two variables the
relationship between them must be linear" (advice found
though most nonpositivists would agree that un- in one introductory statistics text). In the first place, it
derstanding is. not synonymous with prediction, is always possible to compute something.

AMERICAN PSYCHOLOGIST • JULY 1979 • 573


died. (The correlations of rating with survival Another situation in which proper linear models
time were all virtually 0, some in the wrong direc- cannot be used is that in which there are no
tion.) The variables that the doctors coded did, measurable criterion variables. We might, never-
however, predict survival time when they were theless, have some idea about what the important
used in a multiple regression model. predictor variables would be and the direction
In summary, proper linear models work for a they would bear to the criterion if we were able to
very simple reason. People are good at picking measure the criterion. For example, when decid-
out the right predictor variables and at coding ing which students to admit to graduate school,
them in such a way that they have a conditionally we would like to predict some future long-term
monotone relationship with the criterion. People variable that might be termed "professional self-
are bad at integrating information from diverse actualization." We have some idea what we mean
and incomparable sources. Proper linear models by this concept, but no good, precise definition as
are good at such integration when the predictions yet. (Even if we had one, it would be impossible
have a conditionally monotone relationship to the to conduct the study using records from current
criterion. students, because that variable could not be as-
Consider, for example, the problem of compar- sessed until at least 20 years after the students
ing one graduate applicant with GRE scores of had completed their doctoral work.) We do, how-
750 and an undergraduate GPA of 3.3 with an- ever, know that in all probability this criterion is
other with GRE scores of 680 and an undergradu- positively related to intelligence, to past accom-
ate GPA of 3.7. Most judges would agree that plishments, and to ability to snow one's colleagues.
these indicators of aptitude and previous accom- In our applicant's files, GRE scores assess the
plishment should be combined in-some compensa- first variable; undergraduate GPA, the second;
tory fashion, but the question is how to compen- and letters of recommendation, the third. Might
sate. Many judges attempting this feat have little we not, then, wish to form some sort of linear
knowledge of the distributional characteristics of combination of these variables in order to assess
GREs and GPAs, and most have no knowledge of our applicants' potentials? Given that we cannot
studies indicating their validity as predictors of perform a standard regression analysis, is there
graduate success. Moreover, these numbers are nothing to do other than fall back on unaided in-
inherently incomparable without such knowledge, tuitive integration of these variables when we
GREs running from SOO to 800 for viable appli- assess our applicants?
cants, and GPAs from 3.0 to 4.0. Is it any One possible way of building an improper linear
wonder that a statistical weighting scheme does model is through the use of bootstrapping (Dawes
better than a human judge in these circumstances? & Corrigan, 1974; Goldberg, 1970). The process
Suppose now that it is not possible to construct is to build a proper linear model of an expert's
a proper linear model in some situation. One judgments about an outcome criterion and then to
reason we may not be able to do so is that our use that linear model in place of the judge. That
sample size is inadequate. In multiple regression, such linear models can be accurate in predicting
for example, b weights are notoriously unstable; experts' judgments has been pointed out in the
the ratio of observations to predictors should be as psychological literature by Hammond (1955) and
high as IS or 20 to 1 before b weights, which are Hoffman (1960). (This work was anticipated by
the optimal weights, do better on cross-validation 32 years by the late Henry Wallace, Vice-Presi-
than do simple unit weights. Schmidt (1971), dent under Roosevelt, in a 1923 agricultural ar-
Goldberg (1972), and Claudy (1972) have demon- ticle suggesting the use of linear models to analyze
strated this need empirically through computer "what is on the corn judge's mind.") In his in-
simulation, and Einhorn and Hogarth (1975) and fluential article, Hoffman termed the use of linear
Srinivisan (Note 3) have attacked the problem models a paramorphic representation of judges, by
analytically. The general solution depends on a which he meant that the judges' psychological pro-
number of parameters such as the multiple corre- cesses did not involve computing an implicit or
lation in the population and the covariance pattern explicit weighted average of input variables, but
between predictor variables. But the applied im- that it could be simulated by such a weighting.
plication is clear. Standard regression analysis Paramorphic representations have been extremely
cannot be used in situations where there is not successful (for reviews see Dawes & Corrigan,
a "decent" ratio of observations to predictors. 1974; Slovic & Lichtenstein, 1971) in contexts in

574 • JULY 1979 • AMERICAN PSYCHOLOGIST


which predictor variables have conditionally mono- judges were highly skilled experts attempting to
tone relationships to criterion variables. predict a highly reliable criterion. Goldberg
The bootstrapping models make use of the (1976), however, noted that many of the ratios
weights derived from the judges; because these had highly skewed distributions, and he reana-
weights are not derived from the relationship be- lyzed Libby's data, normalizing the ratios before
tween the predictor and criterion variables them- building models of the loan officers. Libby found
selves, the resulting linear models are improper. 77% of his officers to be superior to their para-
Yet these paramorphic representations consistently morphic representations, but Goldberg, using his
do better than the judges from which they were rescaled predictor variables, found the opposite;
derived (at least when the evaluation of goodness 72% of the models were superior to the judges
4
is in terms of the correlation between predicted from whom they were derived.
and actual values). Why does bootstrapping work? Bowman
Bootstrapping has turned out to be pervasive. (1963), Goldberg (1970), and Dawes (1971) all
For example, in a study conducted by Wiggins and maintained that its success arises from the fact
Kohen (1971), psychology graduate students at that a linear model distills underlying policy (in
the University of Illinois were presented with 10 the implicit weights) from otherwise variable be-
background, aptitude, and personality measures havior (e.g., judgments affected by context effects
describing other (real) Illinois graduate students or extraneous variables).
in psychology and were asked to predict these stu- Belief in the efficacy of bootstrapping was based
dents' first-year graduate GPAs. Linear models of on the comparison of the validity of the linear
every one of the University of Illinois judges did model of the judge with the validity of his or her
a better job than did the judges themselves in judgments themselves. This is only one of two
predicting actual grade point averages. This re- logically possible comparisons. The other is the
sult was replicated in a study conducted in con- validity of the linear model of the judge versus
junction with Wiggins, Gregory, and Diller (cited the validity of linear models in general; that is, to
in Dawes & Corrigan, 1974). Goldberg (1970) demonstrate that bootstrapping works because the
demonstrated it for 26 of 29 clinical psychology linear model catches the essence of the judge's
judges predicting psychiatric diagnosis of neu- valid expertise while eliminating unreliability, it is
rosis or psychosis from Minnesota Multiphasic necessary to demonstrate that the weights ob-
Personality Inventory (MMPI) profiles, and tained from an analysis of the judge's behavior
Dawes (1971) found it in the evaluation of are superior to those that might be obtained in
graduate applicants at the University of Oregon. other ways, for example, randomly. Because both
The one published exception to the success of the model of the judge and the model obtained
bootstrapping of which I am aware was a study randomly are perfectly reliable, a comparison of
conducted by Libby (1976). He asked 16 loan the random model with the judge's model permits
officers from relatively small banks (located in an evaluation of the judge's underlying linear
Champaign-Urbana, Illinois, with assets between representation, or policy. If the random model
$3 million and $56 million) and 27 loan officers does equally well, the judge would not be "follow-
from large banks (located in Philadelphia, with ing valid principles but following them poorly"
assets between $.6 billion and $4.4 billion) to (Dawes, 1971, p. 182), at least not principles any
judge which 30 of 60 firms would go bankrupt more valid than any others that weight variables
within three years after their financial statements. in the appropriate direction:
The loan officers requested five financial ratios on Table 1 presents five studies summarized by
which to base their judgments (e.g., the ratio of Dawes and Corrigan (1974) in which validities
present assets to total assets). On the average,
the loan officers correctly categorized 44.4 busi-
nesses (74%) as either solvent or future bank- *It should be pointed out that a proper linear model
ruptcies, but on the average, the paramorphic rep- does better than either loan officers or their paramorphic
representations. Using the same task, Beaver (1966) and
resentations of the loan officers could correctly Deacon (1972) found that linear models predicted with
classify only 43.3 (72%). This difference turned about 78% accuracy on cross-validation. But I can't
out to be statistically significant, and Libby con- resist pointing out that the simplest possible improper
model of them all does best. The ratio of assets to lia-
cluded that he had an example of a situation where bilities (!) correctly categorizes 48 (80%) of the cases
bootstrapping did not work—perhaps because his studied by Libby.

AMERICAN PSYCHOLOGIST • JULY 1979 • 575


TABLE 1

Correlations Between Predictions and Criterion Values


Average Average Validity Cross- Validity
Average validity validity of equal validity of of optimal
validity of judge of random weighting regression linear
Example of judge model model model analysis model

Prediction of neurosis vs. psychosis .28 .31 .30 .34 .46 .46
Illinois students' predictions of GPA .33 .50 .51 .60 .57 .69
Oregon students' predictions of GPA .37 .43 .51 .60 .57 .69
Prediction of later faculty ratings at Oregon .19 .25 .39 .48 .38 .54
Yntema & Torgerson's (1961) experiment .84 .89 .84 .97 .97
Note. GPA = grade point average.

(i.e., correlations) obtained by various methods in this experiment were asked to estimate the
were compared. In the first study, a pool of 861 value of each ellipse and were presented with out-
psychiatric patients took the MMPI in various come feedback at the end of each trial. The prob-
hospitals; they were later categorized as neurotic lem was to predict the true (i.e., experimenter-
or psychotic on the basis of more extensive in- assigned) value of each ellipse on the basis of its
formation. The MMPI profiles consist of 11 size, eccentricity, and grayness. .
scores, each of which represents the degree to The first column of Table 1 presents the average
which the respondent answers questions in a man- validity of the judges in these studies, and the
ner similar to patients suffering from a well-de- second presents the average validity of the para-
fined form of psychopathology. A set of 11 scores morphic model of these judges. In all cases, boot-
is thus associated with each patient, and the prob- strapping worked. But then what Corrigan and I
lem is to predict whether a later diagnosis will be constructed were random linear models, that is,
psychosis (coded 1) or neurosis (coded 0). models in which weights were randomly chosen
Twenty-nine clinical psychologists "of varying ex- except for sign and were then applied to standard-
perience and training" (Goldberg, 1970, p. 425) ized variables."
were asked to make this prediction on an 11-step
forced-normal distribution. The second two stud- The sign of each variable was determined on an a priori
basis so 'that it would have a positive relationship to the
ies concerned 90 first-year graduate students in the criterion. Then a normal deviate was selected at random
Psychology Department of the University of from a normal distribution with unit variance, and the
Illinois who were evaluated on 10 variables that absolute value of this deviate was used as a weight for
the variable. Ten thousand such models were constructed
are predictive of academic success. These vari- for each example. (Dawes & Corrigan, 1974, p. 102)
ables included aptitude test scores, college GPA,
various peer ratings (e.g., extraversion), and vari- On the average, these random linear models per-
ous self-ratings (e.g., conscientiousness). A first- form about as well as the paramorphic models of
year GPA was computed for all these students. the judges; these averages are presented in the
The problem was to predict the GPA from the 10 third column of the table. Equal-weighting mod-
variables. In the second study this prediction was els, presented in the fourth column, do even better.
made by 80 (other) graduate students at the Uni- (There is a mathematical reason why equal-weight-
versity of Illinois (Wiggins & Kohen, 1971), and ing models must outperform the average random
in the third study this prediction was made by 41 model.6) Finally, the last two columns present
graduate students at the University of Oregon.
The details of the fourth study have already been
covered; it is the one concerned with the predic- 5
Unfortunately, Dawes and Corrigan did not spell out
tion of later faculty ratings at Oregon. The final in detail that these variables must first be standardized
study (Yntema & Torgerson, 1961) was one in and that the result is a standardized dependent variable.
Equal or random weighting of incomparable variables—
which experimenters assigned values to ellipses for example, GRE score and GPA—without prior stan-
presented to the subjects, on the basis of figures' dardization would be nonsensical.
6
size, eccentricity, and grayness. The formula used Consider a set of standardized variables Si, X*, .Xm,
each of which is positively correlated with a standardized
was ij + kj + ik, where i, j, and k refer to values variable Y. The correlation of the average of the Xs
on the three dimensions just mentioned. Subjects with the Y is equal to the correlation of the sum of the

576 • JULY 1979 • AMERICAN PSYCHOLOGIST


the cross-validated validity of the standard re- Wainer (1976), Wainer and Thissen (1976), W.
gression model and the validity of the optimal M. Edwards (1978), and Gardiner and Edwards
linear model. (197S).
Essentially the same results were obtained when Dawes and Corrigan (1974, p. 105) concluded
the weights were -selected from a .rectangular dis- that "the whole trick is to know what variables to
tribution. Why? Because linear models are ro- look at and then know how to add." That prin-
bust over deviations from optimal weighting. In ciple is well illustrated in the following study,
other words, the bootstrapping finding, at least conducted since the Dawes and Corrigan article
in these studies, has simply been a reaffirmation of was published. In it, Hammond and Adelman
the earlier finding that proper linear models are (1976) both investigated and influenced the de-
superior to human judgments—the weights derived cision about what type of bullet should be used by
from the judges' behavior being sufficiently close the Denver City Police, a decision having much
to the optimal weights that the outputs of the more obvious social impact than most of those
models are highly similar. The solution to the discussed above. To quote Hammond and Adel-
problem of obtaining optimal weights is one that man (1976):
—in terms of von Winterfeldt and Edwards (Note
4)—has a "flat maximum." Weights that are In 1974, the Denver Police Department (DPD), as well
as other police departments throughout the country, de-
near to optimal level produce almost the same cided to change its handgun ammunition. The principle
output as do optimal beta weights. Because the reason offered by the police was that the conventional
expert judge knows at least something about the round-nosed bullet provided insufficient "stopping effec-
tiveness" (that is, the ability to incapacitate and thus to
direction of the variables, his or her judgments prevent the person shot from firing back at a police
yield weights that are nearly optimal (but note officer or others). The DPD chief recommended (as did
that in all cases equal weighting is superior to other police chiefs) the conventional bullet be replaced
by a hollow-point bullet. Such bullets, it was contended,
models based on judges' behavior). flattened on impact, thus decreasing penetration, increasing
The fact that different linear composites corre- stopping effectiveness, and decreasing ricochet potential.
late highly with each other was first pointed out The suggested change was challenged by the American
Civil Liberties Union, minority groups, and others. Op-
40 years ago by Wilks (1938). He considered ponents of the change claimed that the new bullets were
only situations in which there was positive corre- nothing more than outlawed "dum-dum" bullets, that they
lation between predictors. This result seems to created far more injury than the round-nosed bullet, and
should, therefore, be barred from use. As is customary,
hold generally as long as these intercorrelations judgments on this matter were formed privately and
are not negative; for example, the correlation be- then defended publicly with enthusiasm and tenacity, and
tween X + 27 and 2X + Y is .80 when X and Y the usual public hearings were held. Both sides turned
to ballistics experts for scientific information and support,
are uncorrelated. The ways in which outputs are (p.392)
relatively insensitive to changes in coefficients
(provided changes in sign are not involved) have The disputants focused on evaluating the merits
been investigated most recently by Green (1977), of specific bullets—confounding the physical effect
of the bullets with the implications for social
policy; that is, rather than separating questions of
Xs with Y. The covariance of this sum with 7 is equal what it is the bullet should accomplish (the social
to policy question) from questions concerning ballis-
tic characteristics of specific bullets, advocates
(-n)1 Li yt(xn + *,-»... + *,-m merely argued for one bullet or another. Thus,
as Hammond and Adelman pointed out, social
= ( ~ ) £ ViXn + ( - ) policymakers inadvertently adopted the role of
\n) \n/
(poor) ballistics experts, and vice versa. What
= fi + »•«... + Cm (the sum of the correlations). Hammond and Adelman did was to discover the
The variance of y is 1, and the variance of the sum of important policy dimensions from the policy-
the Xs is M + M(M — l)r, where f is the average inter- makers, and then they had the ballistics experts
correlation between the Xs. Hence, the correlation of the
average of the ATs with Y is (2><)/(M + if (M - l)f)»; rate the bullets with respect to these dimensions.
this is greater than (Sn)/(Af + M1 - M)* = average n. These dimensions turned out to be stopping ef-
Because each of the random models is positively corre- fectiveness (the probability that someone hit in
lated with the criterion, the correlation of their average,
which is the unit-weighted model, is higher than the the torso could not return fire), probability of
average of the correlations. serious injury, and probability of harm to by-

AMERICAN PSYCHOLOGIST • JULY 1979 • 577


standers. When the ballistics experts rated the departmental merit, and need; I was told "you
bullets with respect to these dimensions, it turned can't systemize human judgment." It was only
out that the last two were almost perfectly con- six months later, after our committee realized the
founded, but they were not perfectly confounded political and ethical impossibility of cutting back
with the first. Bullets do not vary along a single fellowships on the basis of intuitive judgment, that
dimension that confounds effectiveness with lethal- such a system was adopted. And so on.
ness. The probability of serious injury or harm In the past three years, I have written and
to bystanders is highly related to the penetration talked about the utility (and in my view, ethical
of the bullet, whereas the probability of the bul- superiority) of using linear models in socially
let's effectively stopping someone from returning important decisions. Many of the same objec-
fire is highly related to the width of the entry tions have been raised repeatedly by different
wound. Since policymakers could not agree about readers and audiences. I would like to conclude
the weights given to the three dimensions, Ham- this article by cataloging these objections and an-
mond and Adelman suggested that they be swering them.
weighted equally. Combining the equal weights
with the (independent) judgments of the ballistics Objections to Using Linear Models
experts, Hammond and Adelman discovered a
bullet that "has greater stopping effectiveness and These objections may be placed in three broad
is less apt to cause injury (and is less apt to categories: technical, psychological, and ethical.
threaten bystanders) than the standard bullet then Each category is discussed in turn.
in use by the DPD" (Hammond & Adelman, 1976,
p. 395). The bullet was also less apt to cause in- TECHNICAL
jury than was the bullet previously recommended
The most common technical objection is to the use
by the DPD. That bullet was "accepted by the
City Council and all other parties concerned, and of the correlation coefficient; for example, Remus
is now being used by the DPD" (Hammond & and Jenicke (1978) wrote:
Adelman, 1976, p. 395).7 Once again, "the whole
It is clear that Dawes and Corrigan's choice of the corre-
trick is to decide what variables to look at and lation coefficient to establish the utility of random and
then know how to add" (Dawes & Corrigan, 1974, unit rules is inappropriate [sic, inappropriate for what?].
p. 105). A criterion function is also needed in the experiments
cited by Dawes and Corrigan. Surely there is a cost
So why don't people do it more often? I know function for misclassifying neurotics and psychotics or
of four universities (University of Illinois; New refusing qualified students admissions to graduate school
York University; University of Oregon; Univer- while admitting marginal students, (p. 221)
sity of California, Santa Barbara—there may be
Consider the graduate admission problem first.
more) that use a linear model for applicant selec-
Most schools have k slots and N applicants. The
tion, but even these use it as an initial screening
problem is to get the best k (who are in turn
device and substitute clinical judgment for the
willing to accept the school) out of N. What
final selection of those above a cut score. Gold-
better way is there than to have an appropriate
berg's (1965) actuarial formula for diagnosing
neurosis or psychosis from MM PI profiles has rank? None. Remus and Jenicke write as if the
proven superior to clinical judges attempting the problem were not one of comparative choice but
of absolute choice. Most social choices, however,
same task (no one to my or Goldberg's knowledge
involve selecting the better or best from a set of
has ever produced a judge who does better),
alternatives: the students that will be better, the
yet my one experience with its use (at the Ann
bullet that will be best, a possible airport site
Arbor Veterans Administration Hospital) was that
that will be superior, and so on. The correlation
it was discontinued on the grounds that it made
obvious errors (an interesting reason, discussed at
length below). In 1970, I suggested that our fel- 7
It should be pointed out that there were only eight
lowship committee at the University of Oregon bullets on the Pareto frontier; that is, there were only
apportion cutbacks of National Science Founda- eight that were not inferior to some particular other
tion and National Defense Education Act fellow- bullet in both stopping effectiveness and probability of
harm (or inferior on one of the variables and equal on
ships to departments on the basis of a quasi-linear the other). Consequently, any weighting rule whatsoever
point system based on explicitly defined indices, would have chosen one of these eight.

578 • JULY 1979 • AMERICAN PSYCHOLOGIST


coefficient, because it reflects ranks so well, is important criterion were to be predicted. The
clearly appropriate for evaluating such choices. answer is that of course the findings could be dif-
The neurosis-psychosis problem is more subtle ferent, but we have no reason to suppose that they
and even less supportive of their argument. would be different. First, the distant future is in
"Surely," they state, "there is a cost function," general less predictable than the immediate future,
but they don't specify any candidates. The im- for the simple reason that more unforeseen, ex-
plication is clear: If they could find it, clinical traneous, or self-augmenting factors influence in-
judgment would be found to be superior to linear dividual outcomes. (Note that we are not dis-
models. Why? In the absence of such a discov- cussing aggregate outcomes, such as an unusually
ery on their part, the argument amounts to nothing cold winter in the Midwest in general spread out
at all. But this argument from a vacuum can be over three months.) Since, then, clinical predic-
very compelling to people (for example, to losing tion is poorer than linear to begin with, the hy-
generals and losing football coaches, who know pothesis would hold only if linear prediction got
that "surely" their plans would work "if"—when much worse over time than did clinical prediction.
the plans are in fact doomed to failure no matter There is no a priori reason to believe that this
what). differential deterioration in prediction would occur,
A second related technical objection is to the and none has ever been suggested to me. There is
comparison of average correlation coefficients of certainly no evidence. Once again, the objection
judges with those of linear models. Perhaps by . consists of an argument from a vacuum.
averaging, the performance of some really out- Particularly compelling is the fact that people
standing judges is obscured. The data indicate who argue that different criteria or judges or vari-
otherwise. In the Goldberg (1970) study, for ables or time frames would produce different re-
example, only 5 of 29 trained clinicians were better sults have had 25 years in which to produce ex-
than the unit-weighted model, and none did better amples, and they have failed to do so.
than the proper one. In the Wiggins and Kohen
(1971) study, no judges were better than the unit- PSYCHOLOGICAL
weighted model, and we replicated that effect at
Oregon. In the Libby (1976) study, only 9 of 43 One psychological resistance to using linear models
judges did better than the ratio of assets to lia- lies in our selective memory about clinical predic-
bilities at predicting bankruptcies (3 did equally tion. Our belief in such prediction is reinforced
well). While it is then conceded that clinicians by the availability (Tversky & Kahneman, 1974)
should be able to predict diagnosis of neurosis or of instances of successful clinical prediction—espe-
psychosis, that graduate students should be able to cially those that are exceptions to some formula:
predict graduate success, and that bank loan offi- "I knew someone once with . , . who . . . ." (e.g.,
cers should be able to predict bankruptcies, the "I knew of someone with a tested IQ of only 130
possibility is raised that perhaps the experts used who got an advanced degree in psychology.") As
in the studies weren't the right ones. This again Nisbett, Borgida, Crandall, and Reed (1976)
is arguing from a vacuum: If other experts were showed, such single instances often have greater
used, then the results would be different. And impact on judgment than do much more valid sta-
once again no such experts are produced, and once tistical compilations based on many instances. (A
again the appropriate response is to ask for a good prophylactic for clinical psychologists basing
reason why these hypothetical other people should resistance to actuarial prediction on such instances
be any different. (As one university vice-president would be to keep careful records of their own
told me, "Your research only proves that you used predictions about their own patients—prospective
poor judges; we could surely do better by getting records not subject to hindsight. Such records
better judges"—apparently not from the psychol- could make all instances of successful and unsuc-
ogy department.) cessful prediction equally available for impact; in
A final technical objection concerns the nature addition, they could serve for another clinical ver-
of the criterion variables. They are admittedly sus statistical study using the best possible judge
short-term and unprofound (e.g., GPAs, diag- —the clinician himself or herself.)
noses) ; otherwise, most studies would be infea- Moreover, an illusion of good judgment may be
sible. The question is then raised of whether the reinforced due to selection (Einhorn & Hogarth,
findings would be different if a truly long-range 1978) in those situations in which the prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • S79


of a positive or negative outcome has a self-ful- Finally, there are all sorts of nonintellectual
filling effect. For example, admissions officers factors in professional success that could not pos-
who judge that a candidate is particularly quali- sibly be evaluated before admission to graduate
fied for a graduate program may feel that their school, for example, success at forming a satisfy-
judgment is exonerated when that candidate does ing or inspiring libidinal relationship, not yet evi-
well, even though the candidate's success is in dent genetic tendencies to drug or alcohol addic-
large part due to the positive effects of the pro- tion, the misfortune to join a research group that
gram. (In contrast, a linear model of selection "blows up," and so on, and so forth.
is evaluated by seeing how well it predicts per- Intellectually, I find it somewhat remarkable
formance within the set of applicants selected.) that we are able to predict even 16% of the
Or a waiter who believes that particular people variance. But I believe that my own emotional
at the table are poor tippers may be less attentive response is indicative of those of my colleagues
than usual and receive a smaller tip, thereby hav- who simply assume that the future is more pre-
ing his clinical judgment exonerated.8 dictable. / want it to be predictable, especially
A second psychological resistance to the use of when the aspect of it that I want to predict is
linear models stems from their "proven" low valid- important to me. This desire, I suggest, trans-
ity. Here, there is an implicit (as opposed to lates itself into an implicit assumption that the
explicit) argument from a vacuum because neither future is in fact highly predictable, and it would
changes in evaluation procedures, nor in judges, nor then logically follow that if something is not a
in criteria, are proposed. Rather, the unstated very good predictor, something else might do bet-
assumption is that these criteria of psychological ter (although it is never correct to argue that it
interest are in fact highly predictable, so it fol- necessarily will).
lows that if one method of prediction (a linear Statistical prediction, because it includes the
model) doesn't work too well, another might do specification (usually a low correlation coefficient)
better (reasonable), which is then translated into of exactly how poorly we can predict, bluntly
the belief that another will do better (which is strikes us with the fact that life is not all that
not a reasonable inference)—once it is found. predictable. Unsystematic clinical prediction (or
This resistance is best expressed by a dean con- "postdiction"), in contrast, allows us the com-
sidering the graduate admissions who wrote, "The forting illusion that life is in fact predictable and
correlation of the linear composite with future that we can predict it.
faculty ratings is only .4, whereas that of the
admissions committee's judgment correlates .2.
ETHICAL
Twice nothing is nothing." In 1976, I answered
as follows (Dawes, 1976, pp. 6-7): When I was at the Los Angeles Renaissance Fair
last summer, I overhead a young woman complain
In response, I can only point out that 16% of the variance that it was "horribly unfair" that she had been
is better than 4% of the variance. To me, however, the
fascinating part of this argument is the implicit assump- rejected by the Psychology Department at the
tion that that other 84% of the variance is predictable University of California, Santa Barbara, on the
and that we can somehow predict it. basis of mere numbers, without even an interview.
Now what are we dealing with? We are dealing with
personality and intellectual characteristics of [uniformly "How can they possibly tell what I'm like?"
bright] people who are about 20 years old. . . . Why are The answer is that they can't. Nor could they
we so convinced that this prediction can be made at all? with an interview (Kelly, 1954). Nevertheless,
Surely, it is not necessary to read Ecclesiastes every night
to understand the role of chance. . . . Moreover, there many people maintain that making a crucial social
are clearly positive feedback effects in professional de- choice without an interview is dehumanizing. I
velopment that exaggerate threshold phenomena. For think that the question of whether people are.
example, once people are considered sufficiently "out-
standing" that they are invited to outstanding institutions, treated in a fair manner has more to do with the
they have outstanding colleagues with Whom to interact question of whether or not they have been de-
—and excellence is exacerbated. This same problem humanized than does the question of whether
occurs for those who do not quite reach such a threshold
level. Not only do all these factors mitigate against
the treatment is face to face. (Some of the worst
successful long-range prediction, but studies of the success doctors spend a great deal of time conversing with
of such prediction are necessarily limited to those accepted,
with the incumbent problems of restriction of range and
a negative covariance structure between predictors (Dawes, 8
1975). This example was provided by Einhorn (Note 5 ) .

580 • JUL^T 1979 • AMERICAN PSYCHOLOGIST


their patients, read no medical journals, order few REFERENCES
or no tests, and grieve at the funerals.) A GPA Alexander, S. A. H. Sex, arguments, and social engage-
represents 3£ years of behavior on the part of the ments in marital and premarital relations. Unpublished
applicant. (Surely, not all the professors are master's thesis, University of Missouri—Kansas City,
1971.
biased against his or her particular form of cre- Beaver, W. H. Financial ratios as predictors of failure.
ativity.) The GRE is a more carefully devised In Empirical research in accounting: Selected studies.
test. Do we really believe that we can, do a Chicago: University of Chicago, Graduate School of
better or a fairer job by a 10-minute folder evalua- Business, Institute of Professional Accounting, 1966.
Bowman, E. H. Consistency and optimality in manager-
tion or a half-hour interview than is done by ial decision making. Management Science, 1963, 9, 310-
these two mere numbers? Such cognitive conceit 321.
(Dawes, 1976, p. 7) is unethical, especially given Cass, J., & Birnbaum, M. Comparative guide to American
colkges. New York: Harper & Row, 1968.
the fact of no evidence whatsoever indicating that Claudy, J. G. A comparison of five variable weighting
we do a better job than does the linear equation. procedures. Educational and Psychological Measure-
(And even making exceptions must be done with ment, 1972,52,311-322.
extreme care if it is to be ethical, for if we admit Darlington, R. B. Reduced-variance regression. Psycho-
logical Bulletin, 1978, 8S, 1238-12SS.
someone with a low linear score on the basis that Dawes, R. M. A case study of graduate admissions:
he or she has some special talent, we are automati- Application of three principles of human decision mak-
ing. American Psychologist, 1971^ 26, 180-188.
cally rejecting someone with a higher score, who Dawes, R. M. Graduate admissions criteria and future
might well have had an equally impressive talent success. Science, 1975,187, 721-723,
had we taken the trouble to evaluate it.) Dawes, R. M. Shallow psychology. In J. Carroll & J.
Payne (Eds.), Cognition and social behavior. Hills-
No matter how much we would like to see this dale, N.J.: Erlbaum, 1976.
or that aspect of one or another of the studies Dawes, R. M., & Corrigan, B. Linear models in decision
making. Psychological Bulletin, 1974, SI, 95-106.
reviewed in this article changed, no matter how Deacon, E. B. A discriminant analysis of predictors of
psychologically uncompelling or distasteful we may business failure. Journal of Accounting Research, 1972,
find their results to be, no matter how ethically un- 10, 167-179.
deGroot, A. D. Het denken van den schaker [Thought
comfortable we may feel at "reducing people to and choice in chess]. The Hague, The Netherlands:
mere numbers," the fact remains that our clients Mouton, 1965.
Edwards, D. D., & Edwards, J. S. Marriage: Direct and
are people who deserve to be treated in the best continuous measurement. Bulletin of the Psychonomic
manner possible. If that means—as it appears Society, 1977, 10, 187-188.
at present—that selection, diagnosis, and progno- Edwards, W. M. Technology for director dubious: Evalu-
ation and decision in public contexts. In K. R. Ham-
sis should be based on nothing .more than the mond (Ed,), Judgement and decision in public policy
addition of a few numbers representing values on formation. Boulder, Colo.: Westview Press, 1978.
Einhorn, H. J. Expert measurement and mechanical
important attributes, so be it. To do otherwise is combination. Organizational Behavior and Human Per-
cheating the people we serve. , formance, 1972, 7, 86-106.
Einhorn, H. J., & Hogarth, R. M. Unit weighting schemas
for decision making. Organizational Behavior and Hu-
REFERENCE NOTES man Performance, 1975, 13, 171-192.
Einhorn, H. J., & Hogarth, R. M. Confidence in judg-
1. Thornton, B. Personal communication, 1977. ment: Persistence of the illusion of validity. Psycho-
2. Slovic, P. Limitations of the mind of man: Implica- logical Review, 1978, 85, 395-416.
tions for decision making in the nuclear age. In H. J. Gardiner, P. C., & Edwards, W. Public values: Multi-
Otway (Ed.), Risk vs. benefit: Solution or dream? attribute-utility measurement for social decision mak-
ing. In M. F. Kaplan & S. Schwartz (Eds.), Human
(Report LA 4860-MS), Los Alamos, N.M.: Los Ala-
judgment and decision processes. New York: Academic
mos Scientific Laboratory, 1972. [Also available as Press, 1975.
Oregon Research Institute Bulletin, 1971, 77(17).] Goldberg, L. R. Diagnosticians vs. diagnostic signs: The
3. Srinivisan, V. A theoretical comparison of the predic- diagnosis of psychosis vs. neurosis from the MMPI.
tive power of the multiple regression and equal weight- Psychological Monographs, 1965, 79(9, Whole No. 602).
ing procedures (Research Paper No. 347). Stanford, Goldberg, L. R. Seer over sign: The first "good" ex-
Calif.: Stanford University, Graduate School of Busi- ample? Journal of Experimental Research in Person-
ness, February 1977. ality, 1968,3, 168-171.
Goldberg, L. R. Man versus model of man: A rationale,
4. von Winterfeldt, D., & Edwards, W. Costs and payoffs
plus some evidence for a method of improving on
in perceptual research. Unpublished manuscript, Uni- clinical inferences. Psychological Bulletin, 1970, 73,
versity of Michigan, Engineering Psychology Labora- 422-432;
tory, 1973. Goldberg, L. R. Parameters of personality inventory con-
5. Einhorn, H. J. Personal communication, January 1979. struction and utilization: A comparison of prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • 581


strategies and tactics. Multivariate Behavioral Research mative. In J. Carrol & J. Payne (Eds.), Cognition
Monographs, 1972, No. 72-2. and social behavior. Hillsdale, N.J.: Erlbaum, 1976.
Goldberg, L. R. Man versus model of man: Just how Remus, W. E., & Jenicke, L. O. Unit and random linear
conflicting is that evidence? Organizational Behavior models in decision making. Multivariate Behavioral
and Human Performance, 1976, 16, 13-22. Research, 1978, 13, 215-221.
Green, B. F., Jr. Parameter sensitivity in multivariate Sawyer, J. Measurement and prediction, clinical and sta-
methods. Multivariate Behavioral Research, 1977, 3, tistical. Psychological Bulletin, 1966, 66, 178-200.
263. Schmidt, F. L. The relative efficiency of regression and
Hammond, K. R. Probabilistic functioning and the simple unit predictor weights in applied differential
clinical method. Psychological Review, 1955, 62, 255- psychology. Educational and Psychological Measure-
262. ment, 1971, 31, 699-714.
Hammond, K. R., & Adelman, L. Science, values, and Simon, H. A., & Chase, W. G. Skill in chess. American
human judgment. Science, 1976, 194, 389-396. Scientist, 1973, 61, 394-403.
Hoffman, P. J. The paramorphic representation of clini- Slovic, P., & Lichtenstein, S. Comparison of Bayesian
cal judgment. Psychological Bulletin, 1960, 57, 116-131. and regression approaches to the study of information
Holt, R. R. Yet another look at clinical and statistical processing in judgment. Organizational Behavior and
prediction. American Psychologist, 1970, 25, 337-339. Human Performance, 1971, 6, 649-744.
Howard, J. W., & Dawes, R. M. Linear prediction of Thornton, B. Linear prediction of marital happiness: A
marital happiness. Personality and Social Psychology replication. Personality and Social Psychology Bulletin,
Bulletin, 1976, 2, 478-480. 1977, 3, 674-676.
Kelly, L. Evaluation of the interview as a selection Tversky, A., & Kahneman, D. Judgment under uncer-
technique. In Proceedings of the 1953 Invitational Con- tainty: Heuristics and biases. Science, 1974, 184, 1124-
ference on Testing Problems. Princeton, N.J.: Educa- 1131.
tional Testing Service, 1954. Wainer, H. Estimating coefficients in linear models: It
Krantz, D. H. Measurement structures and psychologi- don't make no nevermind. Psychological Bulletin, 1976,
cal laws. Science, 1972, 175, 1427-1435. 83, 312-317.
Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. Wainer, H., & Thissen, D. Three steps toward robust
Foundations of measurement (Vol. 1). New York: Aca- regression. Psychometrika, 1976, 41, 9-34.
demic Press, 1971. Wallace, H. A. What is in the corn judge's mind?
Libby, R. Man versus model of man: Some conflicting Journal of the American Society of Agronomy, 1923,
evidence. Organizational Behavior and Human Per- 15, 300-304.
formance, 1976,16, 1-12. Wiggins, N., & Kohen, E. S. Man vs. model of man
Marquardt, D. W., & Snee, R. D. Ridge regression in revisited: The forecasting of graduate school success.
practice. American Statistician, 1975, 29, 3-19. Journal of Personality and Social Psychology, 1971, 19,
Meehl, P. E. Clinical versus statistical prediction: A 100-106.
theoretical analysis and a review of the evidence. Min- Wilks, S. S. Weighting systems for linear functions of
neapolis: University of Minnesota Press, 1954. correlated variables when there is no dependent vari-
Meehl, P. E. Seer over sign: The first good example. able. Psychometrika, 1938, 8, 23-40.
Journal of Experimental Research in Personality, 1965, Yntema, D. B., & Torgerson, W. S. Man-computer co-
1, 27-32. operation in decisions requiring common sense. IRE
Nisbett, R. E., Borgida, E., Crandall, R., & Reed, H. Transactions of the Professional Group on Human Fac-
Popular induction: Information is not necessarily nor- tors in Electronics, 1961, 2(1), 20-26.

582 • JULY 1979 • AMERICAN PSYCHOLOGIST

You might also like