You are on page 1of 33

Statistical Science

2001, Vol. 16, No. 3, 199231

Statistical Modeling: The Two Cultures


Leo Breiman

Abstract. There are two cultures in the use of statistical modeling to


reach conclusions from data. One assumes that the data are generated
by a given stochastic data model. The other uses algorithmic models and
treats the data mechanism as unknown. The statistical community has
been committed to the almost exclusive use of data models. This commit-
ment has led to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting current prob-
lems. Algorithmic modeling, both in theory and practice, has developed
rapidly in elds outside statistics. It can be used both on large complex
data sets and as a more accurate and informative alternative to data
modeling on smaller data sets. If our goal as a eld is to use data to
solve problems, then we need to move away from exclusive dependence
on data models and adopt a more diverse set of tools.

1. INTRODUCTION The values of the parameters are estimated from


Statistics starts with data. Think of the data as the data and the model then used for information
being generated by a black box in which a vector of and/or prediction. Thus the black box is lled in like
input variables x (independent variables) go in one this:
side, and on the other side the response variables y linear regression
y x
come out. Inside the black box, nature functions to logistic regression
associate the predictor variables with the response Cox model
variables, so the picture is like this:
Model validation. Yesno using goodness-of-t
tests and residual examination.
y nature x
Estimated culture population. 98% of all statisti-
cians.
There are two goals in analyzing the data:
The Algorithmic Modeling Culture
Prediction. To be able to predict what the responses
are going to be to future input variables; The analysis in this culture considers the inside of
Information. To extract some information about the box complex and unknown. Their approach is to
how nature is associating the response variables nd a function fxan algorithm that operates on
to the input variables. x to predict the responses y. Their black box looks
like this:
There are two different approaches toward these
goals: y unknown x
The Data Modeling Culture
The analysis in this culture starts with assuming decision trees
a stochastic data model for the inside of the black neural nets
box. For example, a common data model is that data
are generated by independent draws from Model validation. Measured by predictive accuracy.
Estimated culture population. 2% of statisticians,
response variables = f(predictor variables, many in other elds.
random noise, parameters)
In this paper I will argue that the focus in the
statistical community on data models has:
Leo Breiman is Professor, Department of Statistics,
University of California, Berkeley, California 94720- Led to irrelevant theory and questionable sci-
4735 (e-mail: leo@stat.berkeley.edu). entic conclusions;

199
200 L. BREIMAN

Kept statisticians from using more suitable between inputs and outputs than data models. This
algorithmic models; is illustrated using two medical data sets and a
Prevented statisticians from working on excit- genetic data set. A glossary at the end of the paper
ing new problems; explains terms that not all statisticians may be
familiar with.
I will also review some of the interesting new
developments in algorithmic modeling in machine
learning and look at applications to three data sets. 3. PROJECTS IN CONSULTING
As a consultant I designed and helped supervise
2. ROAD MAP surveys for the Environmental Protection Agency
It may be revealing to understand how I became a (EPA) and the state and federal court systems. Con-
member of the small second culture. After a seven- trolled experiments were designed for the EPA, and
year stint as an academic probabilist, I resigned and I analyzed trafc data for the U.S. Department of
went into full-time free-lance consulting. After thir- Transportation and the California Transportation
teen years of consulting I joined the Berkeley Statis- Department. Most of all, I worked on a diverse set
tics Department in 1980 and have been there since. of prediction projects. Here are some examples:
My experiences as a consultant formed my views Predicting next-day ozone levels.
about algorithmic modeling. Section 3 describes two Using mass spectra to identify halogen-containing
of the projects I worked on. These are given to show compounds.
how my views grew from such problems. Predicting the class of a ship from high altitude
When I returned to the university and began radar returns.
reading statistical journals, the research was dis- Using sonar returns to predict the class of a sub-
tant from what I had done as a consultant. All marine.
articles begin and end with data models. My obser- Identity of hand-sent Morse Code.
vations about published theoretical research in Toxicity of chemicals.
statistics are in Section 4. On-line prediction of the cause of a freeway trafc
Data modeling has given the statistics eld many breakdown.
successes in analyzing data and getting informa- Speech recognition
tion about the mechanisms producing the data. But The sources of delay in criminal trials in state court
there is also misuse leading to questionable con- systems.
clusions about the underlying mechanism. This is
reviewed in Section 5. Following that is a discussion To understand the nature of these problems and
(Section 6) of how the commitment to data modeling the approaches taken to solve them, I give a fuller
has prevented statisticians from entering new sci- description of the rst two on the list.
entic and commercial elds where the data being
3.1 The Ozone Project
gathered is not suitable for analysis by data models.
In the past fteen years, the growth in algorith- In the mid- to late 1960s ozone levels became a
mic modeling applications and methodology has serious health problem in the Los Angeles Basin.
been rapid. It has occurred largely outside statis- Three different alert levels were established. At the
tics in a new communityoften called machine highest, all government workers were directed not
learningthat is mostly young computer scientists to drive to work, children were kept off playgrounds
(Section 7). The advances, particularly over the last and outdoor exercise was discouraged.
ve years, have been startling. Three of the most The major source of ozone at that time was auto-
important changes in perception to be learned from mobile tailpipe emissions. These rose into the low
these advances are described in Sections 8, 9, and atmosphere and were trapped there by an inversion
10, and are associated with the following names: layer. A complex chemical reaction, aided by sun-
light, cooked away and produced ozone two to three
Rashomon: the multiplicity of good models;
hours after the morning commute hours. The alert
Occam: the conict between simplicity and
warnings were issued in the morning, but would be
accuracy;
more effective if they could be issued 12 hours in
Bellman: dimensionalitycurse or blessing?
advance. In the mid-1970s, the EPA funded a large
Section 11 is titled Information from a Black effort to see if ozone levels could be accurately pre-
Box and is important in showing that an algo- dicted 12 hours in advance.
rithmic model can produce more and more reliable Commuting patterns in the Los Angeles Basin
information about the structure of the relationship are regular, with the total variation in any given
STATISTICAL MODELING: THE TWO CULTURES 201

daylight hour varying only a few percent from eld. The molecules of the compound split and the
one weekday to another. With the total amount of lighter fragments are bent more by the magnetic
emissions about constant, the resulting ozone lev- eld than the heavier. Then the fragments hit an
els depend on the meteorology of the preceding absorbing strip, with the position of the fragment on
days. A large data base was assembled consist- the strip determined by the molecular weight of the
ing of lower and upper air measurements at U.S. fragment. The intensity of the exposure at that posi-
weather stations as far away as Oregon and Ari- tion measures the frequency of the fragment. The
zona, together with hourly readings of surface resultant mass spectra has numbers reecting fre-
temperature, humidity, and wind speed at the quencies of fragments from molecular weight 1 up to
dozens of air pollution stations in the Basin and the molecular weight of the original compound. The
nearby areas. peaks correspond to frequent fragments and there
Altogether, there were daily and hourly readings are many zeroes. The available data base consisted
of over 450 meteorological variables for a period of of the known chemical structure and mass spectra
seven years, with corresponding hourly values of of 30,000 compounds.
ozone and other pollutants in the Basin. Let x be The mass spectrum predictor vector x is of vari-
the predictor vector of meteorological variables on able dimensionality. Molecular weight in the data
the nth day. There are more than 450 variables in base varied from 30 to over 10,000. The variable to
x since information several days back is included. be predicted is
Let y be the ozone level on the n + 1st day. Then
the problem was to construct a function fx such y = 1: contains chlorine,
that for any future day and future predictor vari- y = 2: does not contain chlorine.
ables x for that day, fx is an accurate predictor of
the next days ozone level y. The problem is to construct a function fx that
To estimate predictive accuracy, the rst ve is an accurate predictor of y where x is the mass
years of data were used as the training set. The spectrum of the compound.
last two years were set aside as a test set. The To measure predictive accuracy the data set was
algorithmic modeling methods available in the pre- randomly divided into a 25,000 member training
1980s decades seem primitive now. In this project set and a 5,000 member test set. Linear discrim-
large linear regressions were run, followed by vari- inant analysis was tried, then quadratic discrimi-
able selection. Quadratic terms in, and interactions nant analysis. These were difcult to adapt to the
among, the retained variables were added and vari- variable dimensionality. By this time I was thinking
able selection used again to prune the equations. In about decision trees. The hallmarks of chlorine in
the end, the project was a failurethe false alarm mass spectra were researched. This domain knowl-
rate of the nal predictor was too high. I have edge was incorporated into the decision tree algo-
regrets that this project cant be revisited with the rithm by the design of the set of 1,500 yesno ques-
tools available today. tions that could be applied to a mass spectra of any
dimensionality. The result was a decision tree that
3.2 The Chlorine Project
gave 95% accuracy on both chlorines and nonchlo-
The EPA samples thousands of compounds a year rines (see Breiman, Friedman, Olshen and Stone,
and tries to determine their potential toxicity. In 1984).
the mid-1970s, the standard procedure was to mea-
sure the mass spectra of the compound and to try 3.3 Perceptions on Statistical Analysis
to determine its chemical structure from its mass As I left consulting to go back to the university,
spectra. these were the perceptions I had about working with
Measuring the mass spectra is fast and cheap. But data to nd answers to problems:
the determination of chemical structure from the
mass spectra requires a painstaking examination (a) Focus on nding a good solutionthats what
by a trained chemist. The cost and availability of consultants get paid for.
enough chemists to analyze all of the mass spectra (b) Live with the data before you plunge into
produced daunted the EPA. Many toxic compounds modeling.
contain halogens. So the EPA funded a project to (c) Search for a model that gives a good solution,
determine if the presence of chlorine in a compound either algorithmic or data.
could be reliably predicted from its mass spectra. (d) Predictive accuracy on test sets is the crite-
Mass spectra are produced by bombarding the rion for how good the model is.
compound with ions in the presence of a magnetic (e) Computers are an indispensable partner.
202 L. BREIMAN

4. RETURN TO THE UNIVERSITY These truisms have often been ignored in the enthu-
siasm for tting data models. A few decades ago,
I had one tip about what research in the uni-
the commitment to data models was such that even
versity was like. A friend of mine, a prominent
simple precautions such as residual analysis or
statistician from the Berkeley Statistics Depart-
goodness-of-t tests were not used. The belief in the
ment, visited me in Los Angeles in the late 1970s.
infallibility of data models was almost religious. It
After I described the decision tree method to him,
is a strange phenomenononce a model is made,
his rst question was, Whats the model for the
then it becomes truth and the conclusions from it
data?
are infallible.
4.1 Statistical Research 5.1 An Example
Upon my return, I started reading the Annals of I illustrate with a famous (also infamous) exam-
Statistics, the agship journal of theoretical statis- ple: assume the data is generated by independent
tics, and was bemused. Every article started with draws from the model
Assume that the data are generated by the follow- M

ing model:    R y = b0 + bm xm +

1
followed by mathematics exploring inference, hypo- where the coefcients bm  are to be estimated,
thesis testing and asymptotics. There is a wide is N0
2  and 2 is to be estimated. Given that
spectrum of opinion regarding the usefulness of the the data is generated this way, elegant tests of
theory published in the Annals of Statistics to the hypotheses, condence intervals, distributions of
eld of statistics as a science that deals with data. I the residual sum-of-squares and asymptotics can be
am at the very low end of the spectrum. Still, there derived. This made the model attractive in terms
have been some gems that have combined nice of the mathematics involved. This theory was used
theory and signicant applications. An example is both by academic statisticians and others to derive
wavelet theory. Even in applications, data models signicance levels for coefcients on the basis of
are universal. For instance, in the Journal of the model (R), with little consideration as to whether
American Statistical Association JASA, virtually the data on hand could have been generated by a
every article contains a statement of the form: linear model. Hundreds, perhaps thousands of arti-
Assume that the data are generated by the follow- cles were published claiming proof of something or
ing model:    other because the coefcient was signicant at the
5% level.
I am deeply troubled by the current and past use Goodness-of-t was demonstrated mostly by giv-
of data models in applications, where quantitative ing the value of the multiple correlation coefcient
conclusions are drawn and perhaps policy decisions R2 which was often closer to zero than one and
made. which could be over inated by the use of too many
parameters. Besides computing R2 , nothing else
5. THE USE OF DATA MODELS was done to see if the observational data could have
Statisticians in applied research consider data been generated by model (R). For instance, a study
modeling as the template for statistical analysis: was done several decades ago by a well-known
Faced with an applied problem, think of a data member of a university statistics department to
model. This enterprise has at its heart the belief assess whether there was gender discrimination in
that a statistician, by imagination and by looking the salaries of the faculty. All personnel les were
at the data, can invent a reasonably good para- examined and a data base set up which consisted of
metric class of models for a complex mechanism salary as the response variable and 25 other vari-
devised by nature. Then parameters are estimated ables which characterized academic performance;
and conclusions are drawn. But when a model is t that is, papers published, quality of journals pub-
to data to draw quantitative conclusions: lished in, teaching record, evaluations, etc. Gender
appears as a binary predictor variable.
The conclusions are about the models mecha- A linear regression was carried out on the data
nism, and not about natures mechanism. and the gender coefcient was signicant at the
5% level. That this was strong evidence of sex dis-
It follows that:
crimination was accepted as gospel. The design
If the model is a poor emulation of nature, the of the study raises issues that enter before the
conclusions may be wrong. consideration of a modelCan the data gathered
STATISTICAL MODELING: THE TWO CULTURES 203

answer the question posed? Is inference justied a variety of models. A residual plot is a goodness-of-
when your sample is the entire population? Should t test, and lacks power in more than a few dimen-
a data model be used? The deciencies in analysis sions. An acceptable residual plot does not imply
occurred because the focus was on the model and that the model is a good t to the data.
not on the problem. There are a variety of ways of analyzing residuals.
The linear regression model led to many erro- For instance, Landwher, Preibon and Shoemaker
neous conclusions that appeared in journal articles (1984, with discussion) gives a detailed analysis of
waving the 5% signicance level without knowing tting a logistic model to a three-variable data set
whether the model t the data. Nowadays, I think using various residual plots. But each of the four
most statisticians will agree that this is a suspect discussants present other methods for the analysis.
way to arrive at conclusions. At the time, there were One is left with an unsettled sense about the arbi-
few objections from the statistical profession about trariness of residual analysis.
the fairy-tale aspect of the procedure, But, hidden in Misleading conclusions may follow from data
an elementary textbook, Mosteller and Tukey (1977) models that pass goodness-of-t tests and residual
discuss many of the fallacies possible in regression checks. But published applications to data often
and write The whole area of guided regression is show little care in checking model t using these
fraught with intellectual, statistical, computational, methods or any other. For instance, many of the
and subject matter difculties. current application articles in JASA that t data
Even currently, there are only rare published cri- models have very little discussion of how well their
tiques of the uncritical use of data models. One of model ts the data. The question of how well the
the few is David Freedman, who examines the use model ts the data is of secondary importance com-
of regression models (1994); the use of path models pared to the construction of an ingenious stochastic
(1987) and data modeling (1991, 1995). The analysis model.
in these papers is incisive. 5.3 The Multiplicity of Data Models
5.2 Problems in Current Data Modeling One goal of statistics is to extract information
from the data about the underlying mechanism pro-
Current applied practice is to check the data ducing the data. The greatest plus of data modeling
model t using goodness-of-t tests and residual is that it produces a simple and understandable pic-
analysis. At one point, some years ago, I set up a ture of the relationship between the input variables
simulated regression problem in seven dimensions and responses. For instance, logistic regression in
with a controlled amount of nonlinearity. Standard classication is frequently used because it produces
tests of goodness-of-t did not reject linearity until a linear combination of the variables with weights
the nonlinearity was extreme. Recent theory sup- that give an indication of the variable importance.
ports this conclusion. Work by Bickel, Ritov and The end result is a simple picture of how the pre-
Stoker (2001) shows that goodness-of-t tests have diction variables affect the response variable plus
very little power unless the direction of the alter- condence intervals for the weights. Suppose two
native is precisely specied. The implication is that statisticians, each one with a different approach
omnibus goodness-of-t tests, which test in many to data modeling, t a model to the same data
directions simultaneously, have little power, and set. Assume also that each one applies standard
will not reject until the lack of t is extreme. goodness-of-t tests, looks at residuals, etc., and
Furthermore, if the model is tinkered with on the is convinced that their model ts the data. Yet
basis of the data, that is, if variables are deleted the two models give different pictures of natures
or nonlinear combinations of the variables added, mechanism and lead to different conclusions.
then goodness-of-t tests are not applicable. Resid- McCullah and Nelder (1989) write Data will
ual analysis is similarly unreliable. In a discussion often point with almost equal emphasis on sev-
after a presentation of residual analysis in a sem- eral possible models, and it is important that the
inar at Berkeley in 1993, William Cleveland, one statistician recognize and accept this. Well said,
of the fathers of residual analysis, admitted that it but different models, all of them equally good, may
could not uncover lack of t in more than four to ve give different pictures of the relation between the
dimensions. The papers I have read on using resid- predictor and response variables. The question of
ual analysis to check lack of t are conned to data which one most accurately reects the data is dif-
sets with two or three variables. cult to resolve. One reason for this multiplicity
With higher dimensions, the interactions between is that goodness-of-t tests and other methods for
the variables can produce passable residual plots for checking t give a yesno answer. With the lack of
204 L. BREIMAN

power of these tests with data having more than a Mosteller and Tukey (1977) were early advocates
small number of dimensions, there will be a large of cross-validation. They write, Cross-validation is
number of models whose t is acceptable. There is a natural route to the indication of the quality of any
no way, among the yesno methods for gauging t, data-derived quantity   . We plan to cross-validate
of determining which is the better model. A few carefully wherever we can.
statisticians know this. Mountain and Hsiao (1989) Judging by the infrequency of estimates of pre-
write, It is difcult to formulate a comprehensive dictive accuracy in JASA, this measure of model
model capable of encompassing all rival models. t that seems natural to me (and to Mosteller and
Furthermore, with the use of nite samples, there Tukey) is not natural to others. More publication of
are dubious implications with regard to the validity predictive accuracy estimates would establish stan-
and power of various encompassing tests that rely dards for comparison of models, a practice that is
on asymptotic theory. common in machine learning.
Data models in current use may have more dam-
aging results than the publications in the social sci- 6. THE LIMITATIONS OF DATA MODELS
ences based on a linear regression analysis. Just as
the 5% level of signicance became a de facto stan- With the insistence on data models, multivariate
dard for publication, the Cox model for the analysis analysis tools in statistics are frozen at discriminant
of survival times and logistic regression for survive analysis and logistic regression in classication and
nonsurvive data have become the de facto standard multiple linear regression in regression. Nobody
for publication in medical journals. That different really believes that multivariate data is multivari-
survival models, equally well tting, could give dif- ate normal, but that data model occupies a large
ferent conclusions is not an issue. number of pages in every graduate textbook on
multivariate statistical analysis.
5.4 Predictive Accuracy With data gathered from uncontrolled observa-
The most obvious way to see how well the model tions on complex systems involving unknown physi-
box emulates natures box is this: put a case x down cal, chemical, or biological mechanisms, the a priori
natures box getting an output y. Similarly, put the assumption that nature would generate the data
same case x down the model box getting an out- through a parametric model selected by the statis-
put y . The closeness of y and y is a measure of tician can result in questionable conclusions that
how good the emulation is. For a data model, this cannot be substantiated by appeal to goodness-of-t
translates as: t the parameters in your model by tests and residual analysis. Usually, simple para-
using the data, then, using the model, predict the metric models imposed on data generated by com-
data and see how good the prediction is. plex systems, for example, medical data, nancial
Prediction is rarely perfect. There are usu- data, result in a loss of accuracy and information as
ally many unmeasured variables whose effect is compared to algorithmic models (see Section 11).
referred to as noise. But the extent to which the There is an old saying If all a man has is a
model box emulates natures box is a measure of hammer, then every problem looks like a nail. The
how well our model can reproduce the natural trouble for statisticians is that recently some of the
phenomenon producing the data. problems have stopped looking like nails. I conjec-
McCullagh and Nelder (1989) in their book on ture that the result of hitting this wall is that more
generalized linear models also think the answer is complicated data models are appearing in current
obvious. They write, At rst sight it might seem published applications. Bayesian methods combined
as though a good model is one that ts the data with Markov Chain Monte Carlo are cropping up all
very well; that is, one that makes (the model pre- over. This may signify that as data becomes more
dicted value) very close to y (the response value). complex, the data models become more cumbersome
Then they go on to note that the extent of the agree- and are losing the advantage of presenting a simple
ment is biased by the number of parameters used and clear picture of natures mechanism.
in the model and so is not a satisfactory measure. Approaching problems by looking for a data model
They are, of course, right. If the model has too many imposes an a priori straight jacket that restricts the
parameters, then it may overt the data and give a ability of statisticians to deal with a wide range of
biased estimate of accuracy. But there are ways to statistical problems. The best available solution to
remove the bias. To get a more unbiased estimate a data problem might be a data model; then again
of predictive accuracy, cross-validation can be used, it might be an algorithmic model. The data and the
as advocated in an important early work by Stone problem guide the solution. To solve a wider range
(1974). If the data set is larger, put aside a test set. of data problems, a larger set of tools is needed.
STATISTICAL MODELING: THE TWO CULTURES 205

Perhaps the damaging consequence of the insis- 7.2 Theory in Algorithmic Modeling
tence on data models is that statisticians have ruled
Data models are rarely used in this community.
themselves out of some of the most interesting and
The approach is that nature produces data in a
challenging statistical problems that have arisen
black box whose insides are complex, mysterious,
out of the rapidly increasing ability of computers
and, at least, partly unknowable. What is observed
to store and manipulate data. These problems are
is a set of xs that go in and a subsequent set of ys
increasingly present in many elds, both scientic
that come out. The problem is to nd an algorithm
and commercial, and solutions are being found by
fx such that for future x in a test set, fx will
nonstatisticians.
be a good predictor of y.
The theory in this eld shifts focus from data mod-
7. ALGORITHMIC MODELING els to the properties of algorithms. It characterizes
Under other names, algorithmic modeling has their strength as predictors, convergence if they
been used by industrial statisticians for decades. are iterative, and what gives them good predictive
See, for instance, the delightful book Fitting Equa- accuracy. The one assumption made in the theory
tions to Data (Daniel and Wood, 1971). It has been is that the data is drawn i.i.d. from an unknown
used by psychometricians and social scientists. multivariate distribution.
Reading a preprint of Gis book (1990) many years There is isolated work in statistics where the
ago uncovered a kindred spirit. It has made small focus is on the theory of the algorithms. Grace
inroads into the analysis of medical data starting Wahbas research on smoothing spline algo-
with Richard Olshens work in the early 1980s. For rithms and their applications to data (using cross-
further work, see Zhang and Singer (1999). Jerome validation) is built on theory involving reproducing
Friedman and Grace Wahba have done pioneering kernels in Hilbert Space (1990). The nal chapter
work on the development of algorithmic methods. of the CART book (Breiman et al., 1984) contains
But the list of statisticians in the algorithmic mod- a proof of the asymptotic convergence of the CART
eling business is short, and applications to data are algorithm to the Bayes risk by letting the trees grow
seldom seen in the journals. The development of as the sample size increases. There are others, but
algorithmic methods was taken up by a community the relative frequency is small.
outside statistics. Theory resulted in a major advance in machine
learning. Vladimir Vapnik constructed informative
7.1 A New Research Community bounds on the generalization error (innite test set
error) of classication algorithms which depend on
In the mid-1980s two powerful new algorithms
the capacity of the algorithm. These theoretical
for tting data became available: neural nets and
bounds led to support vector machines (see Vapnik,
decision trees. A new research community using
1995, 1998) which have proved to be more accu-
these tools sprang up. Their goal was predictive
rate predictors in classication and regression then
accuracy. The community consisted of young com-
neural nets, and are the subject of heated current
puter scientists, physicists and engineers plus a few
research (see Section 10).
aging statisticians. They began using the new tools
My last paper Some innity theory for tree
in working on complex prediction problems where it
ensembles (Breiman, 2000) uses a function space
was obvious that data models were not applicable:
analysis to try and understand the workings of tree
speech recognition, image recognition, nonlinear
ensemble methods. One section has the heading,
time series prediction, handwriting recognition,
My kingdom for some good theory. There is an
prediction in nancial markets.
effective method for forming ensembles known as
Their interests range over many elds that were
boosting, but there isnt any nite sample size
once considered happy hunting grounds for statisti-
theory that tells us why it works so well.
cians and have turned out thousands of interesting
research papers related to applications and method-
7.3 Recent Lessons
ology. A large majority of the papers analyze real
data. The criterion for any model is what is the pre- The advances in methodology and increases in
dictive accuracy. An idea of the range of research predictive accuracy since the mid-1980s that have
of this group can be got by looking at the Proceed- occurred in the research of machine learning has
ings of the Neural Information Processing Systems been phenomenal. There have been particularly
Conference (their main yearly meeting) or at the exciting developments in the last ve years. What
Machine Learning Journal. has been learned? The three lessons that seem most
206 L. BREIMAN

important to one: neural net 100 times on simple three-dimensional


data reselecting the initial weights to be small and
Rashomon: the multiplicity of good models;
random on each run. I found 32 distinct minima,
Occam: the conict between simplicity and accu-
each of which gave a different picture, and having
racy;
about equal test set error.
Bellman: dimensionalitycurse or blessing.
This effect is closely connected to what I call
instability (Breiman, 1996a) that occurs when there
8. RASHOMON AND THE MULTIPLICITY
are many different models crowded together that
OF GOOD MODELS
have about the same training or test set error. Then
Rashomon is a wonderful Japanese movie in a slight perturbation of the data or in the model
which four people, from different vantage points, construction will cause a skip from one model to
witness an incident in which one person dies and another. The two models are close to each other in
another is supposedly raped. When they come to terms of error, but can be distant in terms of the
testify in court, they all report the same facts, but form of the model.
their stories of what happened are very different. If, in logistic regression or the Cox model, the
What I call the Rashomon Effect is that there common practice of deleting the less important
is often a multitude of different descriptions [equa- covariates is carried out, then the model becomes
tions fx] in a class of functions giving about the unstablethere are too many competing models.
same minimum error rate. The most easily under- Say you are deleting from 15 variables to 4 vari-
stood example is subset selection in linear regres- ables. Perturb the data slightly and you will very
sion. Suppose there are 30 variables and we want to possibly get a different four-variable model and
nd the best ve variable linear regressions. There a different conclusion about which variables are
are about 140,000 ve-variable subsets in competi- important. To improve accuracy by weeding out less
tion. Usually we pick the one with the lowest resid- important covariates you run into the multiplicity
ual sum-of-squares (RSS), or, if there is a test set, problem. The picture of which covariates are impor-
the lowest test error. But there may be (and gen- tant can vary signicantly between two models
erally are) many ve-variable equations that have having about the same deviance.
RSS within 1.0% of the lowest RSS (see Breiman, Aggregating over a large set of competing mod-
1996a). The same is true if test set error is being els can reduce the nonuniqueness while improving
measured. accuracy. Arena et al. (2000) bagged (see Glossary)
So here are three possible pictures with RSS or logistic regression models on a data base of toxic and
test set error within 1.0% of each other: nontoxic chemicals where the number of covariates
Picture 1 in each model was reduced from 15 to 4 by stan-
dard best subset selection. On a test set, the bagged
y = 21 + 38x3 06x8 + 832x12 model was signicantly more accurate than the sin-
21x17 + 32x27
gle model with four covariates. It is also more stable.
This is one possible x. The multiplicity problem
Picture 2 and its effect on conclusions drawn from models
y = 89 + 46x5 + 001x6 + 120x15 needs serious attention.

+ 175x21 + 02x22
9. OCCAM AND SIMPLICITY VS. ACCURACY
Picture 3 Occams Razor, long admired, is usually inter-
y = 767 + 93x2 + 220x7 132x8 preted to mean that simpler is better. Unfortunately,
in prediction, accuracy and simplicity (interpretabil-
+ 34x11 + 72x28 
ity) are in conict. For instance, linear regression
Which one is better? The problem is that each one gives a fairly interpretable picture of the y
x rela-
tells a different story about which variables are tion. But its accuracy is usually less than that
important. of the less interpretable neural nets. An example
The Rashomon Effect also occurs with decision closer to my work involves trees.
trees and neural nets. In my experiments with trees, On interpretability, trees rate an A+. A project
if the training set is perturbed only slightly, say by I worked on in the late 1970s was the analysis of
removing a random 23% of the data, I can get delay in criminal cases in state court systems. The
a tree quite different from the original but with Constitution gives the accused the right to a speedy
almost the same test set error. I once ran a small trial. The Center for the State Courts was concerned
STATISTICAL MODELING: THE TWO CULTURES 207

Table 1
Data set descriptions

Training Test
Data set Sample size Sample size Variables Classes

Cancer 699 9 2
Ionosphere 351 34 2
Diabetes 768 8 2
Glass 214 9 6
Soybean 683 35 19
Letters 15,000 5000 16 26
Satellite 4,435 2000 36 6
Shuttle 43,500 14,500 9 7
DNA 2,000 1,186 60 3
Digit 7,291 2,007 256 10

that in many states, the trials were anything but variables. At each node choose several of the 20 at
speedy. It funded a study of the causes of the delay. random to use to split the node. Or use a random
I visited many states and decided to do the anal- combination of a random selection of a few vari-
ysis in Colorado, which had an excellent computer- ables. This idea appears in Ho (1998), in Amit and
ized court data system. A wealth of information was Geman (1997) and is developed in Breiman (1999).
extracted and processed.
The dependent variable for each criminal case 9.2 Forests Compared to Trees
was the time from arraignment to the time of sen- We compare the performance of single trees
tencing. All of the other information in the trial his- (CART) to random forests on a number of small
tory were the predictor variables. A large decision and large data sets, mostly from the UCI repository
tree was grown, and I showed it on an overhead and (ftp.ics.uci.edu/pub/MachineLearningDatabases). A
explained it to the assembled Colorado judges. One summary of the data sets is given in Table 1.
of the splits was on District N which had a larger Table 2 compares the test set error of a single tree
delay time than the other districts. I refrained from to that of the forest. For the ve smaller data sets
commenting on this. But as I walked out I heard one above the line, the test set error was estimated by
judge say to another, I knew those guys in District leaving out a random 10% of the data, then run-
N were dragging their feet. ning CART and the forest on the other 90%. The
While trees rate an A+ on interpretability, they left-out 10% was run down the tree and the forest
are good, but not great, predictors. Give them, say, and the error on this 10% computed for both. This
a B on prediction.
was repeated 100 times and the errors averaged.
9.1 Growing Forests for Prediction The larger data sets below the line came with a
separate test set. People who have been in the clas-
Instead of a single tree predictor, grow a forest of sication eld for a while nd these increases in
trees on the same datasay 50 or 100. If we are accuracy startling. Some errors are halved. Others
classifying, put the new x down each tree in the for- are reduced by one-third. In regression, where the
est and get a vote for the predicted class. Let the for-
est prediction be the class that gets the most votes.
There has been a lot of work in the last ve years on Table 2
Test set misclassication error (%)
ways to grow the forest. All of the well-known meth-
ods grow the forest by perturbing the training set, Data set Forest Single tree
growing a tree on the perturbed training set, per-
Breast cancer 2.9 5.9
turbing the training set again, growing another tree,
Ionosphere 5.5 11.2
etc. Some familiar methods are bagging (Breiman, Diabetes 24.2 25.3
1996b), boosting (Freund and Schapire, 1996), arc- Glass 22.0 30.4
ing (Breiman, 1998), and additive logistic regression Soybean 5.7 8.6
(Friedman, Hastie and Tibshirani, 1998). Letters 3.4 12.4
My preferred method to date is random forests. In Satellite 8.6 14.8
this approach successive decision trees are grown by Shuttle 103 7.0 62.0
DNA 3.9 6.2
introducing a random element into their construc- Digit 6.2 17.1
tion. For example, suppose there are 20 predictor
208 L. BREIMAN

forest prediction is the average over the individual the dimensionality. The published advice was that
tree predictions, the decreases in mean-squared test high dimensionality is dangerous. For instance, a
set error are similar. well-regarded book on pattern recognition (Meisel,
1972) states the features   must be relatively
9.3 Random Forests are A + Predictors few in number. But recent work has shown that
The Statlog Project (Mitchie, Spiegelhalter and dimensionality can be a blessing.
Taylor, 1994) compared 18 different classiers. 10.1 Digging It Out in Small Pieces
Included were neural nets, CART, linear and
quadratic discriminant analysis, nearest neighbor, Reducing dimensionality reduces the amount of
etc. The rst four data sets below the line in Table 1 information available for prediction. The more pre-
were the only ones used in the Statlog Project that dictor variables, the more information. There is also
came with separate test sets. In terms of rank of information in various combinations of the predictor
accuracy on these four data sets, the forest comes variables. Lets try going in the opposite direction:
in 1, 1, 1, 1 for an average rank of 1.0. The next Instead of reducing dimensionality, increase it
best classier had an average rank of 7.3. by adding many functions of the predictor variables.
The fth data set below the line consists of 1616 There may now be thousands of features. Each
pixel gray scale depictions of handwritten ZIP Code potentially contains a small amount of information.
numerals. It has been extensively used by AT&T The problem is how to extract and put together
Bell Labs to test a variety of prediction methods. these little pieces of information. There are two
A neural net handcrafted to the data got a test set outstanding examples of work in this direction, The
error of 5.1% vs. 6.2% for a standard run of random Shape Recognition Forest (Y. Amit and D. Geman,
forest. 1997) and Support Vector Machines (V. Vapnik,
9.4 The Occam Dilemma 1995, 1998).
10.2 The Shape Recognition Forest
So forests are A+ predictors. But their mechanism
for producing a prediction is difcult to understand. In 1992, the National Institute of Standards and
Trying to delve into the tangled web that generated Technology (NIST) set up a competition for machine
a plurality vote from 100 trees is a Herculean task. algorithms to read handwritten numerals. They put
So on interpretability, they rate an F. Which brings together a large set of pixel pictures of handwritten
us to the Occam dilemma: numbers (223,000) written by over 2,000 individ-
uals. The competition attracted wide interest, and
Accuracy generally requires more complex pre-
diverse approaches were tried.
diction methods. Simple and interpretable functions The AmitGeman approach dened many thou-
do not make the most accurate predictors. sands of small geometric features in a hierarchi-
Using complex predictors may be unpleasant, but cal assembly. Shallow trees are grown, such that at
the soundest path is to go for predictive accuracy each node, 100 features are chosen at random from
rst, then try to understand why. In fact, Section the appropriate level of the hierarchy; and the opti-
10 points out that from a goal-oriented statistical mal split of the node based on the selected features
viewpoint, there is no Occams dilemma. (For more is found.
on Occams Razor see Domingos, 1998, 1999.) When a pixel picture of a number is dropped down
a single tree, the terminal node it lands in gives
10. BELLMAN AND THE CURSE OF probability estimates p0
  
p9 that it represents
DIMENSIONALITY numbers 0
1
  
9. Over 1,000 trees are grown, the
probabilities averaged over this forest, and the pre-
The title of this section refers to Richard Bell- dicted number is assigned to the largest averaged
mans famous phrase, the curse of dimensionality. probability.
For decades, the rst step in prediction methodol- Using a 100,000 example training set and a
ogy was to avoid the curse. If there were too many 50,000 test set, the AmitGeman method gives a
prediction variables, the recipe was to nd a few test set error of 0.7%close to the limits of human
features (functions of the predictor variables) that error.
contain most of the information and then use
10.3 Support Vector Machines
these features to replace the original variables. In
procedures common in statistics such as regres- Suppose there is two-class data having prediction
sion, logistic regression and survival models the vectors in M-dimensional Euclidean space. The pre-
advised practice is to use variable deletion to reduce diction vectors for class #1 are x1 and those for
STATISTICAL MODELING: THE TWO CULTURES 209

class #2 are x2. If these two sets of vectors can becomes too large, the separating hyperplane will
be separated by a hyperplane then there is an opti- not give low generalization error. If separation can-
mal separating hyperplane. Optimal is dened as not be realized with a relatively small number of
meaning that the distance of the hyperplane to any support vectors, there is another version of support
prediction vector is maximal (see below). vector machines that denes optimality by adding
The set of vectors in x1 and in x2 that a penalty term for the vectors on the wrong side of
achieve the minimum distance to the optimal the hyperplane.
separating hyperplane are called the support vec- Some ingenious algorithms make nding the opti-
tors. Their coordinates determine the equation of mal separating hyperplane computationally feasi-
the hyperplane. Vapnik (1995) showed that if a ble. These devices reduce the search to a solution
separating hyperplane exists, then the optimal sep- of a quadratic programming problem with linear
arating hyperplane has low generalization error inequality constraints that are of the order of the
(see Glossary). number N of cases, independent of the dimension
of the feature space. Methods tailored to this partic-
optimal hyperplane ular problem produce speed-ups of an order of mag-
nitude over standard methods for solving quadratic
support vector
programming problems.
Support vector machines can also be used to
provide accurate predictions in other areas (e.g.,
regression). It is an exciting idea that gives excel-
lent performance and is beginning to supplant the
In two-class data, separability by a hyperplane use of neural nets. A readable introduction is in
does not often occur. However, let us increase the Cristianini and Shawe-Taylor (2000).
dimensionality by adding as additional predictor
variables all quadratic monomials in the original 11. INFORMATION FROM A BLACK BOX
predictor variables; that is, all terms of the form
xm1 xm2 . A hyperplane in the original variables plus The dilemma posed in the last section is that
quadratic monomials in the original variables is a the models that best emulate nature in terms of
more complex creature. The possibility of separa- predictive accuracy are also the most complex and
tion is greater. If no separation occurs, add cubic inscrutable. But this dilemma can be resolved by
monomials as input features. If there are originally realizing the wrong question is being asked. Nature
30 predictor variables, then there are about 40,000 forms the outputs y from the inputs x by means of
features if monomials up to the fourth degree are a black box with complex and unknown interior.
added.
The higher the dimensionality of the set of fea- y nature x
tures, the more likely it is that separation occurs. In
the ZIP Code data set, separation occurs with fourth Current accurate prediction methods are also
degree monomials added. The test set error is 4.1%. complex black boxes.
Using a large subset of the NIST data base as a
training set, separation also occurred after adding neural nets
y forests x
up to fourth degree monomials and gave a test set support vectors
error rate of 1.1%.
Separation can always be had by raising the So we are facing two black boxes, where ours
dimensionality high enough. But if the separating seems only slightly less inscrutable than natures.
hyperplane becomes too complex, the generalization In data generated by medical experiments, ensem-
error becomes large. An elegant theorem (Vapnik, bles of predictors can give cross-validated error
1995) gives this bound for the expected generaliza- rates signicantly lower than logistic regression.
tion error: My biostatistician friends tell me, Doctors can
interpret logistic regression. There is no way they
ExGE Exnumber of support vectors/N 1

can interpret a black box containing fty trees


where N is the sample size and the expectation is hooked together. In a choice between accuracy and
over all training sets of size N drawn from the same interpretability, theyll go for interpretability.
underlying distribution as the original training set. Framing the question as the choice between accu-
The number of support vectors increases with the racy and interpretability is an incorrect interpre-
dimensionality of the feature space. If this number tation of what the goal of a statistical analysis is.
210 L. BREIMAN

The point of a model is to get useful information (unspecied) statistical procedure which I assume
about the relation between the response and pre- was logistic regression.
dictor variables. Interpretability is a way of getting Efron and Diaconis drew 500 bootstrap samples
information. But a model does not have to be simple from the original data set and used a similar pro-
to provide reliable information about the relation cedure to isolate the important variables in each
between predictor and response variables; neither bootstrapped data set. The authors comment, Of
does it have to be a data model. the four variables originally selected not one was
The goal is not interpretability, but accurate selected in more than 60 percent of the samples.
Hence the variables identied in the original analy-
information.
sis cannot be taken too seriously. We will come back
The following three examples illustrate this point. to this conclusion later.
The rst shows that random forests applied to a
medical data set can give more reliable informa- Logistic Regression
tion about covariate strengths than logistic regres- The predictive error rate for logistic regression on
sion. The second shows that it can give interesting the hepatitis data set is 17.4%. This was evaluated
information that could not be revealed by a logistic by doing 100 runs, each time leaving out a randomly
regression. The third is an application to a microar- selected 10% of the data as a test set, and then
ray data where it is difcult to conceive of a data averaging over the test set errors.
model that would uncover similar information. Usually, the initial evaluation of which variables
are important is based on examining the absolute
11.1 Example I: Variable Importance in a
values of the coefcients of the variables in the logis-
Survival Data Set
tic regression divided by their standard deviations.
The data set contains survival or nonsurvival Figure 1 is a plot of these values.
of 155 hepatitis patients with 19 covariates. It is The conclusion from looking at the standard-
available at ftp.ics.uci.edu/pub/MachineLearning- ized coefcients is that variables 7 and 11 are the
Databases and was contributed by Gail Gong. The most important covariates. When logistic regres-
description is in a le called hepatitis.names. The sion is run using only these two variables, the
data set has been previously analyzed by Diaconis cross-validated error rate rises to 22.9%. Another
and Efron (1983), and Cestnik, Konenenko and way to nd important variables is to run a best
Bratko (1987). The lowest reported error rate to subsets search which, for any value k, nds the
date, 17%, is in the latter paper. subset of k variables having lowest deviance.
Diaconis and Efron refer to work by Peter Gre- This procedure raises the problems of instability
gory of the Stanford Medical School who analyzed and multiplicity of models (see Section 7.1). There
this data and concluded that the important vari- are about 4,000 subsets containing four variables.
ables were numbers 6, 12, 14, 19 and reports an esti- Of these, there are almost certainly a substantial
mated 20% predictive accuracy. The variables were number that have deviance close to the minimum
reduced in two stagesthe rst was by informal and give different pictures of what the underlying
data analysis. The second refers to a more formal mechanism is.

3.5
standardized coefficients

2.5

1.5

.5

.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables

Fig. 1. Standardized coefcients logistic regression.


STATISTICAL MODELING: THE TWO CULTURES 211

50

40

percent increse in error


30

20

10

10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
variables

Fig. 2. Variable importance-random forest.

Random Forests measure of variable importance that is shown in


Figure 1.
The random forests predictive error rate, evalu-
Random forests singles out two variables, the
ated by averaging errors over 100 runs, each time
12th and the 17th, as being important. As a veri-
leaving out 10% of the data as a test set, is 12.3%
almost a 30% reduction from the logistic regression cation both variables were run in random forests,
error. individually and together. The test set error rates
Random forests consists of a large number of over 100 replications were 14.3% each. Running
randomly constructed trees, each voting for a class. both together did no better. We conclude that virtu-
Similar to bagging (Breiman, 1996), a bootstrap ally all of the predictive capability is provided by a
sample of the training set is used to construct each single variable, either 12 or 17.
tree. A random selection of the input variables is To explore the interaction between 12 and 17 a bit
searched to nd the best split for each node. further, at the end of a random forest run using all
To measure the importance of the mth variable, variables, the output includes the estimated value
the values of the mth variable are randomly per- of the probability of each class vs. the case number.
muted in all of the cases left out in the current This information is used to get plots of the vari-
bootstrap sample. Then these cases are run down able values (normalized to mean zero and standard
the current tree and their classication noted. At deviation one) vs. the probability of death. The vari-
the end of a run consisting of growing many trees, able values are smoothed using a weighted linear
the percent increase in misclassication rate due to regression smoother. The results are in Figure 3 for
noising up each variable is computed. This is the variables 12 and 17.

VARIABLE 12 vs PROBABILITY #1 VARIABLE 17 vs PROBABILITY #1


1 1

0
0
variable 12

variable 17

1
1
2

2
3

4 3
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
class 1 probability class 1 probability

Fig. 3. Variable 17 vs. probability #1.


212 L. BREIMAN

Fig. 4. Variable importanceBupa data.

The graphs of the variable values vs. class death is automatically less than 50%. The paper lists the
probability are almost linear and similar. The two variables selected in ten of the samples. Either 12
variables turn out to be highly correlated. Thinking or 17 appear in seven of the ten.
that this might have affected the logistic regression
results, it was run again with one or the other of 11.2 Example II Clustering in Medical Data
these two variables deleted. There was little change. The Bupa liver data set is a two-class biomedical
Out of curiosity, I evaluated variable impor- data set also available at ftp.ics.uci.edu/pub/Mac-
tance in logistic regression in the same way that I hineLearningDatabases. The covariates are:
did in random forests, by permuting variable val-
ues in the 10% test set and computing how much 1. mcv mean corpuscular volume
that increased the test set error. Not much help 2. alkphos alkaline phosphotase
variables 12 and 17 were not among the 3 variables 3. sgpt alamine aminotransferase
ranked as most important. In partial verication 4. sgot aspartate aminotransferase
of the importance of 12 and 17, I tried them sep- 5. gammagt gamma-glutamyl transpeptidase
arately as single variables in logistic regression. 6. drinks half-pint equivalents of alcoholic
Variable 12 gave a 15.7% error rate, variable 17 beverage drunk per day
came in at 19.3%.
To go back to the original DiaconisEfron analy- The rst ve attributes are the results of blood
sis, the problem is clear. Variables 12 and 17 are sur- tests thought to be related to liver functioning. The
rogates for each other. If one of them appears impor- 345 patients are classied into two classes by the
tant in a model built on a bootstrap sample, the severity of their liver malfunctioning. Class two is
other does not. So each ones frequency of occurrence severe malfunctioning. In a random forests run,

Fig. 5. Cluster averagesBupa data.


STATISTICAL MODELING: THE TWO CULTURES 213

the misclassication error rate is 28%. The variable (1999) derives similar variable information from a
importance given by random forests is in Figure 4. different way of constructing a forest. The similar-
Blood tests 3 and 5 are the most important, fol- ity is that they are both built as ways to give low
lowed by test 4. Random forests also outputs an predictive error.
intrinsic similarity measure which can be used to There are 32 deaths and 123 survivors in the hep-
cluster. When this was applied, two clusters were atitis data set. Calling everyone a survivor gives a
discovered in class two. The average of each variable
baseline error rate of 20.6%. Logistic regression low-
is computed and plotted in each of these clusters in
ers this to 17.4%. It is not extracting much useful
Figure 5.
An interesting facet emerges. The class two sub- information from the data, which may explain its
jects consist of two distinct groups: those that have inability to nd the important variables. Its weak-
high scores on blood tests 3, 4, and 5 and those that ness might have been unknown and the variable
have low scores on those tests. importances accepted at face value if its predictive
accuracy was not evaluated.
11.3 Example III: Microarray Data Random forests is also capable of discovering
Random forests was run on a microarray lym- important aspects of the data that standard data
phoma data set with three classes, sample size of models cannot uncover. The potentially interesting
81 and 4,682 variables (genes) without any variable clustering of class two patients in Example II is an
selection [for more information about this data set, illustration. The standard procedure when tting
see Dudoit, Fridlyand and Speed, (2000)]. The error data models such as logistic regression is to delete
rate was low. What was also interesting from a sci- variables; to quote from Diaconis and Efron (1983)
entic viewpoint was an estimate of the importance
again,   statistical experience suggests that it is
of each of the 4,682 gene expressions.
unwise to t a model that depends on 19 variables
The graph in Figure 6 was produced by a run
of random forests. This result is consistent with with only 155 data points available. Newer meth-
assessments of variable importance made using ods in machine learning thrive on variablesthe
other algorithmic methods, but appears to have more the better. For instance, random forests does
sharper detail. not overt. It gives excellent accuracy on the lym-
phoma data set of Example III which has over 4,600
11.4 Remarks about the Examples variables, with no variable deletion and is capable
The examples show that much information is of extracting variable importance information from
available from an algorithmic model. Friedman the data.

Fig. 6. Microarray variable importance.


214 L. BREIMAN

These examples illustrate the following points: combination. But the trick to being a scientist is to
Higher predictive accuracy is associated with be open to using a wide variety of tools.
The roots of statistics, as in science, lie in work-
more reliable information about the underlying data
ing with data and checking theory against data. I
mechanism. Weak predictive accuracy can lead to
hope in this century our eld will return to its roots.
questionable conclusions.
There are signs that this hope is not illusory. Over
Algorithmic models can give better predictive
the last ten years, there has been a noticeable move
accuracy than data models, and provide better infor-
toward statistical work on real world problems and
mation about the underlying mechanism.
reaching out by statisticians toward collaborative
work with other disciplines. I believe this trend will
12. FINAL REMARKS
continue and, in fact, has to continue if we are to
The goals in statistics are to use data to predict survive as an energetic and creative eld.
and to get information about the underlying data
mechanism. Nowhere is it written on a stone tablet GLOSSARY
what kind of model should be used to solve problems Since some of the terms used in this paper may
involving data. To make my position clear, I am not not be familiar to all statisticians, I append some
against data models per se. In some situations they denitions.
are the most appropriate way to solve the problem. Innite test set error. Assume a loss function
But the emphasis needs to be on the problem and Ly
y that is a measure of the error when y is
on the data. the true response and y the predicted response.
Unfortunately, our eld has a vested interest in In classication, the usual loss is 1 if y = y and
data models, come hell or high water. For instance, zero if y = y. In regression, the usual loss is
see Dempsters (1998) paper on modeling. His posi- y y2 . Given a set of data (training set) consist-
tion on the 1990 Census adjustment controversy is ing of yn
xn n = 1
2
  
N, use it to construct
particularly interesting. He admits that he doesnt a predictor function x of y. Assume that the
know much about the data or the details, but argues training set is i.i.d drawn from the distribution of
that the problem can be solved by a strong dose the random vector Y
X. The innite test set error
of modeling. That more modeling can make error- is ELY
X. This is called the generalization
ridden data accurate seems highly unlikely to me. error in machine learning.
Terrabytes of data are pouring into computers The generalization error is estimated either by
from many sources, both scientic, and commer- setting aside a part of the data as a test set or by
cial, and there is a need to analyze and understand cross-validation.
the data. For instance, data is being generated Predictive accuracy. This refers to the size of
at an awesome rate by telescopes and radio tele- the estimated generalization error. Good predictive
scopes scanning the skies. Images containing mil- accuracy means low estimated error.
lions of stellar objects are stored on tape or disk. Trees and nodes. This terminology refers to deci-
Astronomers need automated ways to scan their sion trees as described in the Breiman et al book
data to nd certain types of stellar objects or novel (1984).
objects. This is a fascinating enterprise, and I doubt Dropping an x down a tree. When a vector of pre-
if data models are applicable. Yet I would enter this dictor variables is dropped down a tree, at each
in my ledger as a statistical problem. intermediate node it has instructions whether to go
The analysis of genetic data is one of the most left or right depending on the coordinates of x. It
challenging and interesting statistical problems stops at a terminal node and is assigned the predic-
around. Microarray data, like that analyzed in tion given by that node.
Section 11.3 can lead to signicant advances in Bagging. An acronym for bootstrap aggregat-
understanding genetic effects. But the analysis ing. Start with an algorithm such that given any
of variable importance in Section 11.3 would be training set, the algorithm produces a prediction
difcult to do accurately using a stochastic data function x. The algorithm can be a decision tree
model. construction, logistic regression with variable dele-
Problems such as stellar recognition or analysis tion, etc. Take a bootstrap sample from the training
of gene expression data could be high adventure for set and use this bootstrap training set to construct
statisticians. But it requires that they focus on solv- the predictor 1 x. Take another bootstrap sam-
ing the problem instead of asking what data model ple and using this second training set construct the
they can create. The best solution could be an algo- predictor 2 x. Continue this way for K steps. In
rithmic model, or maybe a data model, or maybe a regression, average all of the k x to get the
STATISTICAL MODELING: THE TWO CULTURES 215

bagged predictor at x. In classication, that class Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction
which has the plurality vote of the k x is the to Support Vector Machines. Cambridge Univ. Press.
bagged predictor. Bagging has been shown effective Daniel, C. and Wood, F. (1971). Fitting equations to data. Wiley,
New York.
in variance reduction (Breiman, 1996b). Dempster, A. (1998). Logicist statistic 1. Models and Modeling.
Boosting. This is a more complex way of forming Statist. Sci. 13 3 248276.
an ensemble of predictors in classication than bag- Diaconis, P. and Efron, B. (1983). Computer intensive methods
ging (Freund and Schapire, 1996). It uses no ran- in statistics. Scientic American 248 116131.
domization but proceeds by altering the weights on Domingos, P. (1998). Occams two razors: the sharp and the
blunt. In Proceedings of the Fourth International Conference
the training set. Its performance in terms of low pre- on Knowledge Discovery and Data Mining (R. Agrawal and
diction error is excellent (for details see Breiman, P. Stolorz, eds.) 3743. AAAI Press, Menlo Park, CA.
1998). Domingos, P. (1999). The role of Occams razor in knowledge dis-
covery. Data Mining and Knowledge Discovery 3 409425.
ACKNOWLEDGMENTS Dudoit, S., Fridlyand, J. and Speed, T. (2000). Comparison
of discrimination methods for the classication of tumors.
Many of my ideas about data modeling were (Available at www.stat.berkeley.edu/technical reports).
formed in three decades of conversations with my Freedman, D. (1987). As others see us: a case study in path
old friend and collaborator, Jerome Friedman. Con- analysis (with discussion). J. Ed. Statist. 12 101223.
versations with Richard Olshen about the Cox Freedman, D. (1991). Statistical models and shoe leather. Soci-
model and its use in biostatistics helped me to ological Methodology 1991 (with discussion) 291358.
Freedman, D. (1991). Some issues in the foundations of statis-
understand the background. I am also indebted to
tics. Foundations of Science 1 1983.
William Meisel, who headed some of the predic- Freedman, D. (1994). From association to causation via regres-
tion projects I consulted on and helped me make sion. Adv. in Appl. Math. 18 59110.
the transition from probability theory to algorithms, Freund, Y. and Schapire, R. (1996). Experiments with a new
and to Charles Stone for illuminating conversations boosting algorithm. In Machine Learning: Proceedings of the
Thirteenth International Conference 148156. Morgan Kauf-
about the nature of statistics and science. Im grate-
mann, San Francisco.
ful also for the comments of the editor, Leon Gleser, Friedman, J. (1999). Greedy predictive approximation: a gra-
which prompted a major rewrite of the rst draft dient boosting machine. Technical report, Dept. Statistics
of this manuscript and resulted in a different and Stanford Univ.
better paper. Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive
logistic regression: a statistical view of boosting. Ann. Statist.
28 337407.
REFERENCES Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley, New
Amit, Y. and Geman, D. (1997). Shape quantization and recog- York.
nition with randomized trees. Neural Computation 9 1545 Ho, T. K. (1998). The random subspace method for constructing
1588. decision forests. IEEE Trans. Pattern Analysis and Machine
Arena, C., Sussman, N., Chiang, K., Mazumdar, S., Macina, Intelligence 20 832844.
O. and Li, W. (2000). Bagging Structure-Activity Rela- Landswher, J., Preibon, D. and Shoemaker, A. (1984). Graph-
tionships: A simulation study for assessing misclassica- ical methods for assessing logistic regression models (with
tion rates. Presented at the Second Indo-U.S. Workshop on discussion). J. Amer. Statist. Assoc. 79 6183.
Mathematical Chemistry, Duluth, MI. (Available at NSuss- McCullagh, P. and Nelder, J. (1989). Generalized Linear Mod-
man@server.ceoh.pitt.edu). els. Chapman and Hall, London.
Bickel, P., Ritov, Y. and Stoker, T. (2001). Tailor-made tests Meisel, W. (1972). Computer-Oriented Approaches to Pattern
for goodness of t for semiparametric hypotheses. Unpub- Recognition. Academic Press, New York.
lished manuscript. Michie, D., Spiegelhalter, D. and Taylor, C. (1994). Machine
Breiman, L. (1996a). The heuristics of instability in model selec- Learning, Neural and Statistical Classication. Ellis Hor-
tion. Ann. Statist. 24 23502381. wood, New York.
Breiman, L. (1996b). Bagging predictors. Machine Learning J. Mosteller, F. and Tukey, J. (1977). Data Analysis and Regres-
26 123140. sion. Addison-Wesley, Redding, MA.
Breiman, L. (1998). Arcing classiers. Discussion paper, Ann. Mountain, D. and Hsiao, C. (1989). A combined structural and
Statist. 26 801824. exible functional approach for modelenery substitution.
Breiman. L. (2000). Some innity theory for tree ensembles. J. Amer. Statist. Assoc. 84 7687.
(Available at www.stat.berkeley.edu/technical reports). Stone, M. (1974). Cross-validatory choice and assessment of sta-
Breiman, L. (2001). Random forests. Machine Learning J. 45 5 tistical predictions. J. Roy. Statist. Soc. B 36 111147.
32. Vapnik, V. (1995). The Nature of Statistical Learning Theory.
Breiman, L. and Friedman, J. (1985). Estimating optimal trans- Springer, New York.
formations in multiple regression and correlation. J. Amer. Vapnik, V (1998). Statistical Learning Theory. Wiley, New York.
Statist. Assoc. 80 580619. Wahba, G. (1990). Spline Models for Observational Data. SIAM,
Breiman, L., Friedman, J., Olshen, R. and Stone, C. Philadelphia.
(1984). Classication and Regression Trees. Wadsworth, Zhang, H. and Singer, B. (1999). Recursive Partitioning in the
Belmont, CA. Health Sciences. Springer, New York.
216 L. BREIMAN

Comment
D. R. Cox
Professor Breimans interesting paper gives both key points concern the precise meaning of the data,
a clear statement of the broad approach underly- the possible biases arising from the method of ascer-
ing some of his inuential and widely admired con- tainment, the possible presence of major distorting
tributions and outlines some striking applications measurement errors and the nature of processes
and developments. He has combined this with a cri- underlying missing and incomplete data and data
tique of what, for want of a better term, I will call that evolve in time in a way involving complex inter-
mainstream statistical thinking, based in part on dependencies. For some of these, at least, it is hard
a caricature. Like all good caricatures, it contains to see how to proceed without some notion of prob-
enough truth and exposes enough weaknesses to be abilistic modeling.
thought-provoking. Next Professor Breiman emphasizes prediction
There is not enough space to comment on all the as the objective, success at prediction being the
many points explicitly or implicitly raised in the criterion of success, as contrasted with issues
paper. There follow some remarks about a few main of interpretation or understanding. Prediction is
issues. indeed important from several perspectives. The
One of the attractions of our subject is the aston- success of a theory is best judged from its ability to
ishingly wide range of applications as judged not predict in new contexts, although one cannot dis-
only in terms of substantive eld but also in terms miss as totally useless theories such as the rational
of objectives, quality and quantity of data and action theory (RAT), in political science, which, as
so on. Thus any unqualied statement that in I understand it, gives excellent explanations of the
past but which has failed to predict the real politi-
applications   has to be treated sceptically. One
cal world. In a clinical trial context it can be argued
of our failings has, I believe, been, in a wish to
that an objective is to predict the consequences of
stress generality, not to set out more clearly the
treatment allocation to future patients, and so on.
distinctions between different kinds of application
If the prediction is localized to situations directly
and the consequences for the strategy of statistical
similar to those applying to the data there is then
analysis. Of course we have distinctions between
an interesting and challenging dilemma. Is it prefer-
decision-making and inference, between tests and
able to proceed with a directly empirical black-box
estimation, and between estimation and predic- approach, as favored by Professor Breiman, or is
tion and these are useful but, I think, are, except it better to try to take account of some underly-
perhaps the rst, too phrased in terms of the tech- ing explanatory process? The answer must depend
nology rather than the spirit of statistical analysis. on the context but I certainly accept, although it
I entirely agree with Professor Breiman that it goes somewhat against the grain to do so, that
would be an impoverished and extremely unhis- there are situations where a directly empirical
torical view of the subject to exclude the kind of approach is better. Short term economic forecasting
work he describes simply because it has no explicit and real-time ood forecasting are probably further
probabilistic base. exemplars. Key issues are then the stability of the
Professor Breiman takes data as his starting predictor as practical prediction proceeds, the need
point. I would prefer to start with an issue, a ques- from time to time for recalibration and so on.
tion or a scientic hypothesis, although I would However, much prediction is not like this. Often
be surprised if this were a real source of disagree- the prediction is under quite different conditions
ment. These issues may evolve, or even change from the data; what is the likely progress of the
radically, as analysis proceeds. Data looking for incidence of the epidemic of v-CJD in the United
a question are not unknown and raise puzzles Kingdom, what would be the effect on annual inci-
but are, I believe, atypical in most contexts. Next, dence of cancer in the United States of reducing by
even if we ignore design aspects and start with data, 10% the medical use of X-rays, etc.? That is, it may
be desired to predict the consequences of something
only indirectly addressed by the data available for
D. R. Cox is an Honorary Fellow, Nufeld College, analysis. As we move toward such more ambitious
Oxford OX1 1NF, United Kingdom, and associate tasks, prediction, always hazardous, without some
member, Department of Statistics, University of understanding of underlying process and linking
Oxford (e-mail: david.cox@nufeld.oxford.ac.uk). with other sources of information, becomes more
STATISTICAL MODELING: THE TWO CULTURES 217

and more tentative. Formulation of the goals of process involving somehow or other choosing a
analysis solely in terms of direct prediction over the model, often a default model of standard form,
data set seems then increasingly unhelpful. and applying standard methods of analysis and
This is quite apart from matters where the direct goodness-of-t procedures. Thus for survival data
objective is understanding of and tests of subject- choose a priori the proportional hazards model.
matter hypotheses about underlying process, the (Note, incidentally, that in the paper, often quoted
nature of pathways of dependence and so on. but probably rarely read, that introduced this
What is the central strategy of mainstream sta- approach there was a comparison of several of the
tistical analysis? This can most certainly not be dis- many different models that might be suitable for
cerned from the pages of Bernoulli, The Annals of this kind of data.) It is true that many of the anal-
Statistics or the Scandanavian Journal of Statistics yses done by nonstatisticians or by statisticians
nor from Biometrika and the Journal of Royal Sta- under severe time constraints are more or less like
tistical Society, Series B or even from the application those Professor Breiman describes. The issue then
pages of Journal of the American Statistical Associa- is not whether they could ideally be improved, but
tion or Applied Statistics, estimable though all these whether they capture enough of the essence of the
journals are. Of course as we move along the list, information in the data, together with some rea-
there is an increase from zero to 100% in the papers sonable indication of precision as a guard against
containing analyses of real data. But the papers under or overinterpretation. Would more rened
do so nearly always to illustrate technique rather analysis, possibly with better predictive power and
than to explain the process of analysis and inter- better t, produce subject-matter gains? There can
pretation as such. This is entirely legitimate, but be no general answer to this, but one suspects that
is completely different from live analysis of current quite often the limitations of conclusions lie more
data to obtain subject-matter conclusions or to help in weakness of data quality and study design than
solve specic practical issues. Put differently, if an in ineffective analysis.
important conclusion is reached involving statisti- There are two broad lines of development active
cal analysis it will be reported in a subject-matter at the moment arising out of mainstream statistical
journal or in a written or verbal report to colleagues, ideas. The rst is the invention of models strongly
government or business. When that happens, statis- tied to subject-matter considerations, represent-
tical details are typically and correctly not stressed. ing underlying dependencies, and their analysis,
Thus the real procedures of statistical analysis can perhaps by Markov chain Monte Carlo methods.
be judged only by looking in detail at specic cases, In elds where subject-matter considerations are
and access to these is not always easy. Failure to largely qualitative, we see a development based on
discuss enough the principles involved is a major Markov graphs and their generalizations. These
criticism of the current state of theory. methods in effect assume, subject in principle to
I think tentatively that the following quite com- empirical test, more and more about the phenom-
monly applies. Formal models are useful and often ena under study. By contrast, there is an emphasis
almost, if not quite, essential for incisive thinking. on assuming less and less via, for example, kernel
Descriptively appealing and transparent methods estimates of regression functions, generalized addi-
with a rm model base are the ideal. Notions of tive models and so on. There is a need to be clearer
signicance tests, condence intervals, posterior about the circumstances favoring these two broad
intervals and all the formal apparatus of inference approaches, synthesizing them where possible.
are valuable tools to be used as guides, but not in a My own interest tends to be in the former style
mechanical way; they indicate the uncertainty that of work. From this perspective Cox and Wermuth
would apply under somewhat idealized, may be (1996, page 15) listed a number of requirements of a
very idealized, conditions and as such are often statistical model. These are to establish a link with
lower bounds to real uncertainty. Analyses and background knowledge and to set up a connection
model development are at least partly exploratory. with previous work, to give some pointer toward
Automatic methods of model selection (and of vari- a generating process, to have primary parameters
able selection in regression-like problems) are to be with individual clear subject-matter interpretations,
shunned or, if use is absolutely unavoidable, are to to specify haphazard aspects well enough to lead
be examined carefully for their effect on the nal to meaningful assessment of precision and, nally,
conclusions. Unfocused tests of model adequacy are that the t should be adequate. From this perspec-
rarely helpful. tive, t, which is broadly related to predictive suc-
By contrast, Professor Breiman equates main- cess, is not the primary basis for model choice and
stream applied statistics to a relatively mechanical formal methods of model choice that take no account
218 L. BREIMAN

of the broader objectives are suspect in the present the analysis, an important and indeed fascinating
context. In a sense these are efforts to establish data question, but a secondary step. Better a rough
descriptions that are potentially causal, recognizing answer to the right question than an exact answer
that causality, in the sense that a natural scientist to the wrong question, an aphorism, due perhaps to
would use the term, can rarely be established from Lord Kelvin, that I heard as an undergraduate in
one type of study and is at best somewhat tentative. applied mathematics.
Professor Breiman takes a rather defeatist atti- I have stayed away from the detail of the paper
tude toward attempts to formulate underlying but will comment on just one point, the interesting
processes; is this not to reject the base of much sci- theorem of Vapnik about complete separation. This
entic progress? The interesting illustrations given conrms folklore experience with empirical logistic
by Beveridge (1952), where hypothesized processes regression that, with a largish number of explana-
in various biological contexts led to important
tory variables, complete separation is quite likely to
progress, even though the hypotheses turned out in
occur. It is interesting that in mainstream thinking
the end to be quite false, illustrate the subtlety of
this is, I think, regarded as insecure in that com-
the matter. Especially in the social sciences, repre-
sentations of underlying process have to be viewed plete separation is thought to be a priori unlikely
with particular caution, but this does not make and the estimated separating plane unstable. Pre-
them fruitless. sumably bootstrap and cross-validation ideas may
The absolutely crucial issue in serious main- give here a quite misleading illusion of stability.
stream statistics is the choice of a model that Of course if the complete separator is subtle and
will translate key subject-matter questions into a stable Professor Breimans methods will emerge tri-
form for analysis and interpretation. If a simple umphant and ultimately it is an empirical question
standard model is adequate to answer the subject- in each application as to what happens.
matter question, this is ne: there are severe It will be clear that while I disagree with the main
hidden penalties for overelaboration. The statisti- thrust of Professor Breimans paper I found it stim-
cal literature, however, concentrates on how to do ulating and interesting.

Comment
Brad Efron
At rst glance Leo Breimans stimulating paper dominant interpretational methodology in dozens of
looks like an argument against parsimony and sci- elds, but, as we say in California these days, it
entic insight, and in favor of black boxes with lots is power purchased at a price: the theory requires a
of knobs to twiddle. At second glance it still looks modestly high ratio of signal to noise, sample size to
that way, but the paper is stimulating, and Leo has number of unknown parameters, to have much hope
some important points to hammer home. At the risk of success. Good experimental design amounts to
of distortion I will try to restate one of those points, enforcing favorable conditions for unbiased estima-
the most interesting one in my opinion, using less tion and testing, so that the statistician wont nd
confrontational and more historical language. himself or herself facing 100 data points and 50
From the point of view of statistical development parameters.
the twentieth century might be labeled 100 years Now it is the twenty-rst century when, as the
of unbiasedness. Following Fishers lead, most of paper reminds us, we are being asked to face prob-
our current statistical theory and practice revolves lems that never heard of good experimental design.
around unbiased or nearly unbiased estimates (par- Sample sizes have swollen alarmingly while goals
ticularly MLEs), and tests based on such estimates. grow less distinct (nd interesting data structure).
The power of this theory has made statistics the
New algorithms have arisen to deal with new prob-
lems, a healthy sign it seems to me even if the inno-
Brad Efron is Professor, Department of Statis- vators arent all professional statisticians. There are
tics, Sequoia Hall, 390 Serra Mall, Stanford Uni- enough physicists to handle the physics case load,
versity, Stanford, California 943054065 (e-mail: but there are fewer statisticians and more statistics
brad@stat.stanford.edu). problems, and we need all the help we can get. An
STATISTICAL MODELING: THE TWO CULTURES 219

attractive feature of Leos paper is his openness to The prediction culture, at least around Stan-
new ideas whatever their source. ford, is a lot bigger than 2%, though its constituency
The new algorithms often appear in the form of changes and most of us wouldnt welcome being
black boxes with enormous numbers of adjustable typecast.
parameters (knobs to twiddle), sometimes more Estimation and testing are a form of prediction:
knobs than data points. These algorithms can be In our sample of 20 patients drug A outperformed
quite successful as Leo points outs, sometimes drug B; would this still be true if we went on to test
more so than their classical counterparts. However, all possible patients?
unless the bias-variance trade-off has been sus- Prediction by itself is only occasionally suf-
pended to encourage new statistical industries, their cient. The post ofce is happy with any method
success must hinge on some form of biased estima- that predicts correct addresses from hand-written
tion. The bias may be introduced directly as with the scrawls. Peter Gregory undertook his study for pre-
regularization of overparameterized linear mod- diction purposes, but also to better understand the
els, more subtly as in the pruning of overgrown medical basis of hepatitis. Most statistical surveys
regression trees, or surreptitiously as with support have the identication of causal factors as their ulti-
mate goal.
vector machines, but it has to be lurking somewhere
inside the theory. The hepatitis data was rst analyzed by Gail
Of course the trouble with biased estimation is Gong in her 1982 Ph.D. thesis, which concerned pre-
that we have so little theory to fall back upon. diction problems and bootstrap methods for improv-
Fishers information bound, which tells us how well ing on cross-validation. (Cross-validation itself is an
a (nearly) unbiased estimator can possibly perform, uncertain methodology that deserves further crit-
is of no help at all in dealing with heavily biased ical scrutiny; see, for example, Efron and Tibshi-
methodology. Numerical experimentation by itself, rani, 1996). The Scientic American discussion is
unguided by theory, is prone to faddish wandering: quite brief, a more thorough description appearing
Rule 1. New methods always look better than old in Efron and Gong (1983). Variables 12 or 17 (13
ones. Neural nets are better than logistic regres- or 18 in Efron and Gongs numbering) appeared
sion, support vector machines are better than neu- as important in 60% of the bootstrap simulations,
ral nets, etc. In fact it is very difcult to run an which might be compared with the 59% for variable
honest simulation comparison, and easy to inadver- 19, the most for any single explanator.
tently cheat by choosing favorable examples, or by In what sense are variable 12 or 17 or 19
not putting as much effort into optimizing the dull important or not important? This is the kind of
old standard as the exciting new challenger. interesting inferential question raised by prediction
methodology. Tibshirani and I made a stab at an
Rule 2. Complicated methods are harder to crit-
answer in our 1998 annals paper. I believe that the
icize than simple ones. By now it is easy to check
current interest in statistical prediction will eventu-
the efciency of a logistic regression, but it is no
ally invigorate traditional inference, not eliminate
small matter to analyze the limitations of a sup-
it.
port vector machine. One of the best things statis-
A third front seems to have been opened in
ticians do, and something that doesnt happen out- the long-running frequentist-Bayesian wars by the
side our profession, is clarify the inferential basis advocates of algorithmic prediction, who dont really
of a proposed new methodology, a nice recent exam- believe in any inferential school. Leos paper is at its
ple being Friedman, Hastie, and Tibshiranis anal- best when presenting the successes of algorithmic
ysis of boosting, (2000). The past half-century has modeling, which comes across as a positive devel-
seen the clarication process successfully at work on opment for both statistical practice and theoretical
nonparametrics, robustness and survival analysis. innovation. This isnt an argument against tradi-
There has even been some success with biased esti- tional data modeling any more than splines are an
mation in the form of Stein shrinkage and empirical argument against polynomials. The whole point of
Bayes, but I believe the hardest part of this work science is to open up black boxes, understand their
remains to be done. Papers like Leos are a call for insides, and build better boxes for the purposes of
more analysis and theory, not less. mankind. Leo himself is a notably successful sci-
Prediction is certainly an interesting subject but entist, so we can hope that the present paper was
Leos paper overstates both its role and our profes- written more as an advocacy device than as the con-
sions lack of interest in it. fessions of a born-again black boxist.
220 L. BREIMAN

Comment
Bruce Hoadley

INTRODUCTION next 6 months. The goal is to estimate the function,


fx = logPry = 1x/ Pry = 0x. Professor
Professor Breimans paper is an important one
Breiman argues that some kind of simple logistic
for statisticians to read. He and Statistical Science
regression from the data modeling culture is not
should be applauded for making this kind of mate-
the way to solve this problem. I agree. Lets take
rial available to a large audience. His conclusions
are consistent with how statistics is often practiced a look at how the engineers at Fair, Isaac solved
in business. This discussion will consist of an anec- this problemway back in the 1960s and 1970s.
dotal recital of my encounters with the algorithmic The general form used for fx was called a
modeling culture. Along the way, areas of mild dis- segmented scorecard. The process for developing
agreement with Professor Breiman are discussed. I a segmented scorecard was clearly an algorithmic
also include a few proposals for research topics in modeling process.
algorithmic modeling. The rst step was to transform x into many inter-
pretable variables called prediction characteristics.
CASE STUDY OF AN ALGORITHMIC This was done in stages. The rst stage was to
MODELING CULTURE compute several time series derived from the orig-
inal two. An example is the time series of months
Although I spent most of my career in manage- delinquenta nonlinear function. The second stage
ment at Bell Labs and Bellcore, the last seven years was to dene characteristics as operators on the
have been with the research group at Fair, Isaac. time series. For example, the number of times in
This company provides all kinds of decision sup- the last six months that the customer was more
port solutions to several industries, and is very than two months delinquent. This process can lead
well known for credit scoring. Credit scoring is a to thousands of characteristics. A subset of these
great example of the problem discussed by Professor characteristics passes a screen for further analysis.
Breiman. The input variables, x, might come from The next step was to segment the population
company databases or credit bureaus. The output based on the screened characteristics. The segmen-
variable, y, is some indicator of credit worthiness. tation was done somewhat informally. But when
Credit scoring has been a protable business for I looked at the process carefully, the segments
Fair, Isaac since the 1960s, so it is instructive to turned out to be the leaves of a shallow-to-medium
look at the Fair, Isaac analytic approach to see how tree. And the tree was built sequentially using
it ts into the two cultures described by Professor mostly binary splits based on the best splitting
Breiman. The Fair, Isaac approach was developed by characteristicsdened in a reasonable way. The
engineers and operations research people and was
algorithm was manual, but similar in concept to the
driven by the needs of the clients and the quality
CART algorithm, with a different purity index.
of the data. The inuences of the statistical com-
Next, a separate function, fx, was developed for
munity were mostly from the nonparametric side
each segment. The function used was called a score-
things like jackknife and bootstrap.
card. Each characteristic was chopped up into dis-
Consider an example of behavior scoring, which
crete intervals or sets called attributes. A score-
is used in credit card account management. For
card was a linear function of the attribute indicator
pedagogical reasons, I consider a simplied version
(dummy) variables derived from the characteristics.
(in the real world, things get more complicated) of
The coefcients of the dummy variables were called
monthly behavior scoring. The input variables, x,
score weights.
in this simplied version, are the monthly bills and
payments over the last 12 months. So the dimen- This construction amounted to an explosion of
sion of x is 24. The output variable is binary and dimensionality. They started with 24 predictors.
is the indicator of no severe delinquency over the These were transformed into hundreds of charac-
teristics and pared down to about 100 characteris-
tics. Each characteristic was discretized into about
Dr. Bruce Hoadley is with Fair, Isaac and Co., 10 attributes, and there were about 10 segments.
Inc., 120 N. Redwood Drive, San Rafael, California This makes 100 10 10 = 10
000 features. Yes
94903-1996 (e-mail: BruceHoadley@ FairIsaac.com). indeed, dimensionality is a blessing.
STATISTICAL MODELING: THE TWO CULTURES 221

What Fair, Isaac calls a scorecard is now else- meeting, I heard talks on treed regression, which
where called a generalized additive model (GAM) looked like segmented scorecards to me.
with bin smoothing. However, a simple GAM would After a few years with Fair, Isaac, I developed a
not do. Client demand, legal considerations and talk entitled, Credit ScoringA Parallel Universe
robustness over time led to the concept of score engi- of Prediction and Classication. The theme was
neering. For example, the score had to be monotoni- that Fair, Isaac developed in parallel many of the
cally decreasing in certain delinquency characteris- concepts used in modern algorithmic modeling.
tics. Prior judgment also played a role in the design Certain aspects of the data modeling culture crept
of scorecards. For some characteristics, the score into the Fair, Isaac approach. The use of divergence
weights were shrunk toward zero in order to moder- was justied by assuming that the score distribu-
ate the inuence of these characteristics. For other tions were approximately normal. So rather than
characteristics, the score weights were expanded in making assumptions about the distribution of the
order to increase the inuence of these character- inputs, they made assumptions about the distribu-
istics. These adjustments were not done willy-nilly. tion of the output. This assumption of normality was
They were done to overcome known weaknesses in supported by a central limit theorem, which said
the data. that sums of many random variables are approxi-
mately normaleven when the component random
So how did these Fair, Isaac pioneers t these
variables are dependent and multiples of dummy
complicated GAM models back in the 1960s and
random variables.
1970s? Logistic regression was not generally avail-
Modern algorithmic classication theory has
able. And besides, even todays commercial GAM
shown that excellent classiers have one thing in
software will not handle complex constraints. What
common, they all have large margin. Margin, M, is
they did was to maximize (subject to constraints)
a random variable that measures the comfort level
a measure called divergence, which measures how with which classications are made. When the cor-
well the score, S, separates the two populations with rect classication is made, the margin is positive;
different values of y. The formal denition of diver- it is negative otherwise. Since margin is a random
gence is 2ESy = 1 ESy = 02 /VSy = variable, the precise denition of large margin is
1 + VSy = 0. This constrained tting was done tricky. It does not mean that EM is large. When
with a heuristic nonlinear programming algorithm. I put my data modeling hat on, I surmised that
A linear transformation was used to convert to a log large margin means that EM/ VM is large.
odds scale. Lo and behold, with this denition, large margin
Characteristic selection was done by analyzing means large divergence.
the change in divergence after adding (removing) Since the good old days at Fair, Isaac, there have
each candidate characteristic to (from) the current been many improvements in the algorithmic mod-
best model. The analysis was done informally to eling approaches. We now use genetic algorithms
achieve good performance on the test sample. There to screen very large structured sets of prediction
were no formal tests of t and no tests of score characteristics. Our segmentation algorithms have
weight statistical signicance. What counted was been automated to yield even more predictive sys-
performance on the test sample, which was a surro- tems. Our palatable GAM modeling tool now han-
gate for the future real world. dles smooth splines, as well as splines mixed with
These early Fair, Isaac engineers were ahead of step functions, with all kinds of constraint capabil-
their time and charter members of the algorithmic ity. Maximizing divergence is still a favorite, but
modeling culture. The score formula was linear in we also maximize constrained GLM likelihood func-
an exploded dimension. A complex algorithm was tions. We also are experimenting with computa-
used to t the model. There was no claim that the tionally intensive algorithms that will optimize any
nal score formula was correct, only that it worked objective function that makes sense in the busi-
well on the test sample. This approach grew natu- ness environment. All of these improvements are
rally out of the demands of the business and the squarely in the culture of algorithmic modeling.
quality of the data. The overarching goal was to
develop tools that would help clients make bet- OVERFITTING THE TEST SAMPLE
ter decisions through data. What emerged was a Professor Breiman emphasizes the importance of
very accurate and palatable algorithmic modeling performance on the test sample. However, this can
solution, which belies Breimans statement: The be overdone. The test sample is supposed to repre-
algorithmic modeling methods available in the pre- sent the population to be encountered in the future.
1980s decades seem primitive now. At a recent ASA But in reality, it is usually a random sample of the
222 L. BREIMAN

current population. High performance on the test So far, all I had was the scorecard GAM. So clearly
sample does not guarantee high performance on I was missing all of those interactions that just had
future samples, things do change. There are prac- to be in the model. To model the interactions, I tried
tices that can be followed to protect against change. developing small adjustments on various overlap-
One can monitor the performance of the mod- ping segments. No matter how hard I tried, noth-
els over time and develop new models when there ing improved the test sample performance over the
has been sufcient degradation of performance. For global scorecard. I started calling it the Fat Score-
some of Fair, Isaacs core products, the redevelop- card.
ment cycle is about 1824 months. Fair, Isaac also Earlier, on this same data set, another Fair, Isaac
does score engineering in an attempt to make researcher had developed a neural network with
the models more robust over time. This includes 2,000 connection weights. The Fat Scorecard slightly
damping the inuence of individual characteristics, outperformed the neural network on the test sam-
using monotone constraints and minimizing the size ple. I cannot claim that this would work for every
of the models subject to performance constraints data set. But for this data set, I had developed an
on the current test sample. This score engineer- excellent algorithmic model with a simple data mod-
ing amounts to moving from very nonparametric (no eling tool.
score engineering) to more semiparametric (lots of Why did the simple additive model work so well?
score engineering). One idea is that some of the characteristics in the
model are acting as surrogates for certain inter-
SPIN-OFFS FROM THE DATA action terms that are not explicitly in the model.
MODELING CULTURE Another reason is that the scorecard is really a
In Section 6 of Professor Breimans paper, he says sophisticated neural net. The inputs are the original
that multivariate analysis tools in statistics are inputs. Associated with each characteristic is a hid-
frozen at discriminant analysis and logistic regres- den node. The summation functions coming into the
sion in classication     This is not necessarily all hidden nodes are the transformations dening the
that bad. These tools can carry you very far as long characteristics. The transfer functions of the hid-
as you ignore all of the textbook advice on how to den nodes are the step functions (compiled from the
use them. To illustrate, I use the saga of the Fat score weights)all derived from the data. The nal
Scorecard. output is a linear function of the outputs of the hid-
Early in my research days at Fair, Isaac, I den nodes. The result is highly nonlinear and inter-
was searching for an improvement over segmented active, when looked at as a function of the original
scorecards. The idea was to develop rst a very inputs.
good global scorecard and then to develop small The Fat Scorecard study had an ingredient that
adjustments for a number of overlapping segments. is rare. We not only had the traditional test sample,
To develop the global scorecard, I decided to use but had three other test samples, taken one, two,
logistic regression applied to the attribute dummy and three years later. In this case, the Fat Scorecard
variables. There were 36 characteristics available outperformed the more traditional thinner score-
for tting. A typical scorecard has about 15 char- card for all four test samples. So the feared over-
acteristics. My variable selection was structured tting to the traditional test sample never mate-
so that an entire characteristic was either in or rialized. To get a better handle on this you need
out of the model. What I discovered surprised an understanding of how the relationships between
me. All models t with anywhere from 27 to 36 variables evolve over time.
characteristics had the same performance on the I recently encountered another connection
test sample. This is what Professor Breiman calls between algorithmic modeling and data modeling.
Rashomon and the multiplicity of good models. To In classical multivariate discriminant analysis, one
keep the model as small as possible, I chose the one assumes that the prediction variables have a mul-
with 27 characteristics. This model had 162 score tivariate normal distribution. But for a scorecard,
weights (logistic regression coefcients), whose P- the prediction variables are hundreds of attribute
values ranged from 0.0001 to 0.984, with only one dummy variables, which are very nonnormal. How-
less than 0.05; i.e., statistically signicant. The con- ever, if you apply the discriminant analysis algo-
dence intervals for the 162 score weights were use- rithm to the attribute dummy variables, you can
less. To get this great scorecard, I had to ignore get a great algorithmic model, even though the
the conventional wisdom on how to use logistic assumptions of discriminant analysis are severely
regression. violated.
STATISTICAL MODELING: THE TWO CULTURES 223

A SOLUTION TO THE OCCAM DILEMMA used as a surrogate for the decision process, and
misclassication error is used as a surrogate for
I think that there is a solution to the Occam
dilemma without resorting to goal-oriented argu- prot. However, I see a mismatch between the algo-
ments. Clients really do insist on interpretable func- rithms used to develop the models and the business
tions, fx. Segmented palatable scorecards are measurement of the models value. For example, at
very interpretable by the customer and are very Fair, Isaac, we frequently maximize divergence. But
accurate. Professor Breiman himself gave single when we argue the models value to the clients, we
trees an A+ on interpretability. The shallow-to- dont necessarily brag about the great divergence.
medium tree in a segmented scorecard rates an A++. We try to use measures that the client can relate to.
The palatable scorecards in the leaves of the trees The ROC curve is one favorite, but it may not tell
are built from interpretable (possibly complex) char- the whole story. Sometimes, we develop simulations
acteristics. Sometimes we cant implement them of the clients business operation to show how the
until the lawyers and regulators approve. And that model will improve their situation. For example, in
requires super interpretability. Our more sophisti-
a transaction fraud control process, some measures
cated products have 10 to 20 segments and up to
of interest are false positive rate, speed of detec-
100 characteristics (not all in every segment). These
models are very accurate and very interpretable. tion and dollars saved when 0.5% of the transactions
I coined a phrase called the Ping-Pong theorem. are agged as possible frauds. The 0.5% reects the
This theorem says that if we revealed to Profes- number of transactions that can be processed by
sor Breiman the performance of our best model and the current fraud management staff. Perhaps what
gave him our data, then he could develop an algo- the client really wants is a score that will maxi-
rithmic model using random forests, which would mize the dollars saved in their fraud control sys-
outperform our model. But if he revealed to us the tem. The score that maximizes test set divergence
performance of his model, then we could develop or minimizes test set misclassications does not do
a segmented scorecard, which would outperform it. The challenge for algorithmic modeling is to nd
his model. We might need more characteristics, an algorithm that maximizes the generalization dol-
attributes and segments, but our experience in this lars saved, not generalization error.
kind of contest is on our side. We have made some progress in this area using
However, all the competing models in this game of
ideas from support vector machines and boosting.
Ping-Pong would surely be algorithmic models. But
By manipulating the observation weights used in
some of them could be interpretable.
standard algorithms, we can improve the test set
THE ALGORITHM TUNING DILEMMA performance on any objective of interest. But the
price we pay is computational intensity.
As far as I can tell, all approaches to algorithmic
model building contain tuning parameters, either
explicit or implicit. For example, we use penalized
objective functions for tting and marginal contri- MEASURING IMPORTANCEIS IT
bution thresholds for characteristic selection. With REALLY POSSIBLE?
experience, analysts learn how to set these tuning I like Professor Breimans idea for measuring the
parameters in order to get excellent test sample
importance of variables in black box models. A Fair,
or cross-validation results. However, in industry
Isaac spin on this idea would be to build accurate
and academia, there is sometimes a little tinker-
ing, which involves peeking at the test sample. The models for which no variable is much more impor-
result is some bias in the test sample or cross- tant than other variables. There is always a chance
validation results. This is the same kind of tinkering that a variable and its relationships will change in
that upsets test of t pureness. This is a challenge the future. After that, you still want the model to
for the algorithmic modeling approach. How do you work. So dont make any variable dominant.
optimize your results and get an unbiased estimate I think that there is still an issue with measuring
of the generalization error? importance. Consider a set of inputs and an algo-
rithm that yields a black box, for which x1 is impor-
GENERALIZING THE GENERALIZATION ERROR tant. From the Ping Pong theorem there exists a
In most commercial applications of algorithmic set of input variables, excluding x1 and an algorithm
modeling, the function, fx, is used to make deci- that will yield an equally accurate black box. For
sions. In some academic research, classication is this black box, x1 is unimportant.
224 L. BREIMAN

IN SUMMARY GAMs in the leaves. They are very accurate and


Algorithmic modeling is a very important area of interpretable. And you can do it with data modeling
statistics. It has evolved naturally in environments tools as long as you (i) ignore most textbook advice,
with lots of data and lots of decisions. But you (ii) embrace the blessing of dimensionality, (iii) use
can do it without suffering the Occam dilemma; constraints in the tting optimizations (iv) use reg-
for example, use medium trees with interpretable ularization, and (v) validate the results.

Comment
Emanuel Parzen
1. BREIMAN DESERVES OUR Breiman presents the potential benets of algo-
APPRECIATION rithmic models (better predictive accuracy thandata
models, and consequently better information about
I strongly support the view that statisticians must
the underlying mechanism and avoiding question-
face the crisis of the difculties in their practice of
able conclusions which results from weak predictive
regression. Breiman alerts us to systematic blun- accuracy) and support vector machines (which pro-
ders (leading to wrong conclusions) that have been vide almost perfect separation and discrimination
committed applying current statistical practice of between two classes by increasing the dimension of
data modeling. In the spirit of statistician, avoid the feature set). He convinces me that the methods
doing harm I propose that the rst goal of statisti- of algorithmic modeling are important contributions
cal ethics should be to guarantee to our clients that to the tool kit of statisticians.
any mistakes in our analysis are unlike any mis- If the profession of statistics is to remain healthy,
takes that statisticians have made before. and not limit its research opportunities, statis-
The two goals in analyzing data which Leo calls ticians must learn about the cultures in which
prediction and information I prefer to describe as Breiman works, but also about many other cultures
management and science. Management seeks of statistics.
prot, practical answers (predictions) useful for
decision making in the short run. Science seeks 2. HYPOTHESES TO TEST TO AVOID
truth, fundamental knowledge about nature which BLUNDERS OF STATISTICAL MODELING
provides understanding and control in the long run.
Breiman deserves our appreciation for pointing
As a historical note, Students t-test has many sci-
out generic deviations from standard assumptions
entic applications but was invented by Student as (which I call bivariate dependence and two-sample
a management tool to make Guinness beer better conditional clustering) for which we should rou-
(bitter?). tinely check. Test null hypothesis can be a use-
Breiman does an excellent job of presenting the ful algorithmic concept if we use tests that diagnose
case that the practice of statistical science, using in a model-free way the directions of deviation from
only the conventional data modeling culture, needs the null hypothesis model.
reform. He deserves much thanks for alerting us to Bivariate dependence (correlation) may exist
the algorithmic modeling culture. Breiman warns us between features [independent (input) variables] in
that if the model is a poor emulation of nature, the a regression causing them to be proxies for each
conclusions may be wrong. This situation, which other and our models to be unstable with differ-
I call the right answer to the wrong question, is ent forms of regression models being equally well
called by statisticians the error of the third kind. tting. We need tools to routinely test the hypoth-
Engineers at M.I.T. dene suboptimization as esis of statistical independence of the distributions
elegantly solving the wrong problem. of independent (input) variables.
Two sample conditional clustering arises in the
distributions of independent (input) variables to
Emanuel Parzen is Distinguished Professor, Depart- discriminate between two classes, which we call the
ment of Statistics, Texas A&M University, 415 C conditional distribution of input variables X given
Block Building, College Station, Texas 77843 (e-mail: each class. Class I may have only one mode (clus-
eparzen@stat.tamu.edu). ter) at low values of X while class II has two modes
STATISTICAL MODELING: THE TWO CULTURES 225

(clusters) at low and high values of X. We would like approach to estimation of conditional quantile func-
to conclude that high values of X are observed only tions which I only recently fully implemented. I
for members of class II but low values of X occur for would like to extend the concept of algorithmic sta-
members of both classes. The hypothesis we propose tistical models in two ways: (1) to mean data tting
testing is equality of the pooled distribution of both by representations which use approximation the-
samples and the conditional distribution of sample ory and numerical analysis; (2) to use the notation
I, which is equivalent to PclassIX = PclassI. of probability to describe empirical distributions of
For successful discrimination one seeks to increase samples (data sets) which are not assumed to be
the number (dimension) of inputs (features) X to generated by a random mechanism.
make PclassIX close to 1 or 0. My quantile culture has not yet become widely
applied because you cannot give away a good idea,
3. STATISTICAL MODELING, MANY you have to sell it (by integrating it in computer
CULTURES, STATISTICAL METHODS MINING programs usable by applied statisticians and thus
Breiman speaks of two cultures of statistics; I promote statistical methods mining).
believe statistics has many cultures. At specialized A quantile function Qu, 0 u 1, is the
workshops (on maximum entropy methods or robust inverse F1 u of a distribution function Fx,
methods or Bayesian methods or   ) a main topic < x < . Its rigorous denition is Qu =
of conversation is Why dont all statisticians think inf x Fx u. When F is continuous with den-
like us? sity f, FQu = u, qu = Q u = 1/fQu.
I have my own eclectic philosophy of statis- We use the notation Q for a true unknown quantile
tical modeling to which I would like to attract function, Q for a raw estimator from a sample, and
serious attention. I call it statistical methods Q for a smooth estimator of the true Q.
mining which seeks to provide a framework to Concepts dened for Qu can be dened also for
synthesize and apply the past half-century of other versions of quantile functions. Quantile func-
methodological progress in computationally inten- tions can compress data by a ve-number sum-
sive methods for statistical modeling, including mary, values of Qu at u = 05
025
075
01
09
EDA (exploratory data analysis), FDA (functional (or 005
095). Measures of location and scale are
data analysis), density estimation, Model DA (model QM = 05Q025 + Q075
QD = 2Q075
selection criteria data analysis), Bayesian priors Q025. To use quantile functions to identify dis-
on function space, continuous parameter regres- tributions tting data we propose the quantile
sion analysis and reproducing kernels, fast algo- quartile function Q/Qu = Qu QM/QD
rithms, Kalman ltering, complexity, information, Five-number summary of distribution becomes
quantile data analysis, nonparametric regression, QM
QD
Q/Q05 skewness, Q/Q01 left-tail,
conditional quantiles. Q/Q09 right-tail. Elegance of Q/Qu is its uni-
I believe data mining is a special case of data versal values at u = 025
075 Values Q/Qu > 1
modeling. We should teach in our introductory are outliers as dened by Tukey EDA.
courses that one meaning of statistics is statisti- For the fundamental problem of comparison of
cal data modeling done in a systematic way by two distributions F and G we dene the compar-
an iterated series of stages which can be abbrevi- ison distribution Du F
G and comparison den-
ated SIEVE (specify problem and general form of sity du F
G = D u F
G For F
G continu-
models, identify tentatively numbers of parameters ous, dene Du F
G = GF1 u
du F
G =
and specialized models, estimate parameters, val- gF1 u/fF1 u assuming fx = 0 implies
idate goodness-of-t of estimated models, estimate gx = 0, written G  F. For F
G discrete with
nal model nonparametrically or algorithmically). probability mass functions pF and pG dene (assum-
MacKay and Oldford (2000) brilliantly present the ing G  F) du F
G = pG F1 u/pF F1 u.
statistical method as a series of stages PPDAC Our applications of comparison distributions
(problem, plan, data, analysis, conclusions). often assume F to be an unconditional distribution
and G a conditional distribution. To analyze bivari-
4. QUANTILE CULTURE, ate data X
Y a fundamental tool is dependence
ALGORITHMIC MODELS density dt
u = du FY
FYX=QX t  When X
Y
is jointly continuous,
A culture of statistical data modeling based on
quantile functions, initiated in Parzen (1979), has dt
u
been my main research interest since 1976. In
my discussion to Stone (1977) I outlined a novel = fX
Y QX t
QY u/fX QX tfY QY u
226 L. BREIMAN

The statistical independence hypothesis FX


Y = QYj kj /10 Instead of deciles k/10 we could use
FX FY is equivalent to dt
u = 1, all t
u. A funda- k/M for another base M.
mental formula for estimation of conditional quan- To test the hypothesis that Y1
  
Ym are statis-
tile functions is tically independent we form for all kj = 1
  
10,
QYX=x u = QY D1 u FY
FYX=x  dk1
  
km  = PBin k1
  
km /
= QY s
u = Ds FY
FYX=x  PBin k1
  
km independence
To compare the distributions of two univariate To test equality of distribution of a sample from pop-
samples, let Y denote the continuous response vari- ulation I and the pooled sample we form
able and X be binary 0, 1 denoting the population
from which Y is observed. The comparison density d1 k1
  
km 
is dened (note FY is the pooled distribution func- = PBin k1
  
km  population I/
tion)
PBin k1
  
km  pooled sample
d1 u = du FY
FYX=1 
for all k1
  
km  such that the denominator is
= PX = 1Y = QY u/PX = 1 positive and otherwise dened arbitrarily. One can
show (letting X denote the population observed)
5. QUANTILE IDEAS FOR HIGH
DIMENSIONAL DATA ANALYSIS d1 k1
  
km  = PX = I observation from
By high dimensional data we mean multivariate Bin k1
  
km /PX = I
data Y1
  
Ym . We form approximate high
To test the null hypotheses in ways that detect
dimensional comparison densities du1
  
um  to
directions of deviations from the null hypothesis our
test statistical independence of the variables and,
recommended rst step is quantile data analysis of
when we have two samples, d1 u1
  
um  to test
the values dk1
  
km  and d1 k1
  
km .
equality of sample I with pooled sample. All our
I appreciate this opportunity to bring to the atten-
distributions are empirical distributions but we use
tion of researchers on high dimensional data anal-
notation for true distributions in our formulas. Note
ysis the potential of quantile methods. My con-
that
1 1 clusion is that statistical science has many cul-
du1    dum du1
  
um d1 u1
  
um  = 1 tures and statisticians will be more successful when
0 0 they emulate Leo Breiman and apply as many cul-
A decile quantile bin Bk1
  
km  is dened tures as possible (which I call statistical methods
to be the set of observations Y1
  
Ym  satisfy- mining). Additional references are on my web site
ing, for j = 1
  
m
QYj kj 1/10 < Yj at stat.tamu.edu.

Rejoinder
Leo Breiman
I thank the discussants. Im fortunate to have com- then sharply diverge. To begin, I quote: Professor
ments from a group of experienced and creative Breiman takes data as his starting point. I would
statisticianseven more so in that their comments prefer to start with an issue, a question, or a sci-
are diverse. Manny Parzen and Bruce Hoadley are entic hypothesis,    I agree, but would expand
more or less in agreement, Brad Efron has seri- the starting list to include the prediction of future
ous reservations and D. R. Cox is in downright events. I have never worked on a project that has
disagreement. started with Here is a lot of data; lets look at it
I address Professor Coxs comments rst, since and see if we can get some ideas about how to use
our disagreement is crucial. it. The data has been put together and analyzed
starting with an objective.
D. R. COX C1 Data Models Can Be Useful
Professor Cox is a worthy and thoughtful adver- Professor Cox is committed to the use of data mod-
sary. We walk down part of the trail together and els. I readily acknowledge that there are situations
STATISTICAL MODELING: THE TWO CULTURES 227

where a simple data model may be useful and appro- He advocates construction of stochastic data mod-
priate; for instance, if the science of the mechanism els that summarize the understanding of the phe-
producing the data is well enough known to deter- nomena under study. The methodology in the Cox
mine the model apart from estimating parameters. and Wermuth book (1996) attempts to push under-
There are also situations of great complexity posing standing further by nding casual orderings in the
important issues and questions in which there is not covariate effects. The sixth chapter of this book con-
enough data to resolve the questions to the accu- tains illustrations of this approach on four data sets.
racy desired. Simple models can then be useful in The rst is a small data set of 68 patients with
giving qualitative understanding, suggesting future seven covariates from a pilot study at the University
research areas and the kind of additional data that of Mainz to identify pyschological and socioeconomic
needs to be gathered. factors possibly important for glucose control in dia-
At times, there is not enough data on which to betes patients. This is a regression-type problem
base predictions; but policy decisions need to be with the response variable measured by GHb (gly-
made. In this case, constructing a model using what- cosylated haemoglobin). The model tting is done
ever data exists, combined with scientic common by a number of linear regressions and validated
sense and subject-matter knowledge, is a reason- by the checking of various residual plots. The only
able path. Professor Cox points to examples when other reference to model validation is the statement,
he writes: R2 = 034, reasonably large by the standards usual
for this eld of study. Predictive accuracy is not
Often the prediction is under quite dif- computed, either for this example or for the three
ferent conditions from the data; what other examples.
is the likely progress of the incidence My comments on the questionable use of data
of the epidemic of v-CJD in the United models apply to this analysis. Incidentally, I tried to
Kingdom, what would be the effect on get one of the data sets used in the chapter to con-
annual incidence of cancer in the United duct an alternative analysis, but it was not possible
States reducing by 10% the medical use to get it before my rejoinder was due. It would have
of X-rays, etc.? That is, it may be desired been interesting to contrast our two approaches.
to predict the consequences of some-
thing only indirectly addressed by the C3 Approach to Statistical Problems
data available for analysis   prediction, Basing my critique on a small illustration in a
always hazardous, without some under- book is not fair to Professor Cox. To be fairer, I quote
standing of the underlying process and his words about the nature of a statistical analysis:
linking with other sources of information,
becomes more and more tentative. Formal models are useful and often
almost, if not quite, essential for inci-
I agree. sive thinking. Descriptively appealing
and transparent methods with a rm
C2 Data Models Only
model base are the ideal. Notions of sig-
From here on we part company. Professor Coxs nicance tests, condence intervals, pos-
discussion consists of a justication of the use of terior intervals, and all the formal appa-
data models to the exclusion of other approaches. ratus of inference are valuable tools to be
For instance, although he admits,    I certainly used as guides, but not in a mechanical
accept, although it goes somewhat against the grain way; they indicate the uncertainty that
to do so, that there are situations where a directly would apply under somewhat idealized,
empirical approach is better   the two examples maybe very idealized, conditions and as
he gives of such situations are short-term economic such are often lower bounds to real
forecasts and real-time ood forecastsamong the uncertainty. Analyses and model devel-
less interesting of all of the many current suc- opment are at least partly exploratory.
cessful algorithmic applications. In his view, the Automatic methods of model selection
only use for algorithmic models is short-term fore- (and of variable selection in regression-
casting; there are no comments on the rich infor- like problems) are to be shunned or, if
mation about the data and covariates available use is absolutely unavoidable, are to be
from random forests or in the many elds, such as examined carefully for their effect on
pattern recognition, where algorithmic modeling is the nal conclusions. Unfocused tests of
fundamental. model adequacy are rarely helpful.
228 L. BREIMAN

Given the right kind of data: relatively small sam- gathered in each run consists of a sequence of
ple size and a handful of covariates, I have no doubt 150,000 pixel images. Gigabytes of satellite infor-
that his experience and ingenuity in the craft of mation are being used in projects to predict and
model construction would result in an illuminating understand short- and long-term environmental
model. But data characteristics are rapidly chang- and weather changes.
ing. In many of the most interesting current prob- Underlying this rapid change is the rapid evo-
lems, the idea of starting with a formal model is not lution of the computer, a device for gathering, stor-
tenable. ing and manipulation of incredible amounts of data,
together with technological advances incorporating
C4 Changes in Problems
computing, such as satellites and MRI machines.
My impression from Professor Coxs comments is The problems are exhilarating. The methods used
that he believes every statistical problem can be in statistics for small sample sizes and a small num-
best solved by constructing a data model. I believe ber of variables are not applicable. John Rice, in his
that statisticians need to be more pragmatic. Given summary talk at the astronomy and statistics work-
a statistical problem, nd a good solution, whether shop said, Statisticians have to become opportunis-
it is a data model, an algorithmic model or (although tic. That is, faced with a problem, they must nd
it is somewhat against my grain), a Bayesian data a reasonable solution by whatever method works.
model or a completely different approach. One surprising aspect of both workshops was how
My work on the 1990 Census Adjustment opportunistic statisticians faced with genetic and
(Breiman, 1994) involved a painstaking analysis of astronomical data had become. Algorithmic meth-
the sources of error in the data. This was done by ods abounded.
a long study of thousands of pages of evaluation
documents. This seemed the most appropriate way C5 Mainstream Procedures and Tools
of answering the question of the accuracy of the Professor Cox views my critique of the use of data
adjustment estimates. models as based in part on a caricature. Regard-
The conclusion that the adjustment estimates ing my references to articles in journals such as
were largely the result of bad data has never been JASA, he states that they are not typical of main-
effectively contested and is supported by the results stream statistical analysis, but are used to illustrate
of the Year 2000 Census Adjustment effort. The technique rather than explain the process of anal-
accuracy of the adjustment estimates was, arguably, ysis. His concept of mainstream statistical analysis
the most important statistical issue of the last is summarized in the quote given in my Section C3.
decade, and could not be resolved by any amount It is the kind of thoughtful and careful analysis that
of statistical modeling. he prefers and is capable of.
A primary reason why we cannot rely on data Following this summary is the statement:
models alone is the rapid change in the nature of
statistical problems. The realm of applications of By contrast, Professor Breiman equates
statistics has expanded more in the last twenty-ve mainstream applied statistics to a rel-
years than in any comparable period in the history atively mechanical process involving
of statistics. somehow or other choosing a model, often
In an astronomy and statistics workshop this a default model of standard form, and
year, a speaker remarked that in twenty-ve years applying standard methods of analysis
we have gone from being a small sample-size science and goodness-of-t procedures.
to a very large sample-size science. Astronomical
data bases now contain data on two billion objects The disagreement is denitionalwhat is main-
comprising over 100 terabytes and the rate of new stream? In terms of numbers my denition of main-
information is accelerating. stream prevails, I guess, at a ratio of at least 100
A recent biostatistics workshop emphasized the to 1. Simply count the number of people doing their
analysis of genetic data. An exciting breakthrough statistical analysis using canned packages, or count
is the use of microarrays to locate regions of gene the number of SAS licenses.
activity. Here the sample size is small, but the num- In the academic world, we often overlook the fact
ber of variables ranges in the thousands. The ques- that we are a small slice of all statisticians and
tions are which specic genes contribute to the an even smaller slice of all those doing analyses of
occurrence of various types of diseases. data. There are many statisticians and nonstatis-
Questions about the areas of thinking in the brain ticians in diverse elds using data to reach con-
are being studied using functional MRI. The data clusions and depending on tools supplied to them
STATISTICAL MODELING: THE TWO CULTURES 229

by SAS, SPSS, etc. Their conclusions are important or nuclear reactions in terms of hard ball models for
and are sometimes published in medical or other atoms. The scientic approach is to use these com-
subject-matter journals. They do not have the sta- plex models as the best possible descriptions of the
tistical expertise, computer skills, or time needed to physical world and try to get usable information out
construct more appropriate tools. I was faced with of them.
this problem as a consultant when conned to using There are many engineering and scientic appli-
the BMDP linear regression, stepwise linear regres- cations where simpler models, such as Newtons
sion, and discriminant analysis programs. My con- laws, are certainly sufcientsay, in structural
cept of decision trees arose when I was faced with design. Even here, for larger structures, the model
nonstandard data that could not be treated by these is complex and the analysis difcult. In scientic
standard methods. elds outside statistics, answering questions is done
When I rejoined the university after my consult- by extracting information from increasingly com-
ing years, one of my hopes was to provide better plex and accurate models.
general purpose tools for the analysis of data. The The approach I suggest is similar. In genetics,
rst step in this direction was the publication of the
astronomy and many other current areas statistics
CART book (Breiman et al., 1984). CART and other
is needed to answer questions, construct the most
similar decision tree methods are used in thou-
accurate possible model, however complex, and then
sands of applications yearly in many elds. It has
extract usable information from it.
proved robust and reliable. There are others that
Random forests is in use at some major drug
are more recent; random forests is the latest. A pre-
companies whose statisticians were impressed by
liminary version of random forests is free source
with f77 code, S+ and R interfaces available at its ability to determine gene expression (variable
www.stat.berkeley.edu/users/breiman. importance) in microarray data. They were not
A nearly completed second version will also be concerned about its complexity or black-box appear-
put on the web site and translated into Java by the ance.
Weka group. My collaborator, Adele Cutler, and I
will continue to upgrade, add new features, graph- E2 Prediction
ics, and a good interface. Leos paper overstates both its [predic-
My philosophy about the eld of academic statis- tions] role, and our professions lack of
tics is that we have a responsibility to provide interest in it   Most statistical surveys
the many people working in applications outside of have the identication of causal factors
academia with useful, reliable, and accurate analy- as their ultimate role.
sis tools. Two excellent examples are wavelets and
decision trees. More are needed.
My point was that it is difcult to tell, using
BRAD EFRON goodness-of-t tests and residual analysis, how well
a model ts the data. An estimate of its test set accu-
Brad seems to be a bit puzzled about how to react racy is a preferable assessment. If, for instance, a
to my article. Ill start with what appears to be his model gives predictive accuracy only slightly better
biggest reservation. than the all survived or other baseline estimates,
E1 From Simple to Complex Models we cant put much faith in its reliability in the iden-
tication of causal factors.
Brad is concerned about the use of complex I agree that often    statistical surveys have
models without simple interpretability in their the identication of casual factors as their ultimate
structure, even though these models may be the role. I would add that the more predictively accu-
most accurate predictors possible. But the evolution rate the model is, the more faith can be put into the
of science is from simple to complex. variables that it ngers as important.
The equations of general relativity are consider-
ably more complex and difcult to understand than E3 Variable Importance
Newtons equations. The quantum mechanical equa-
tions for a system of molecules are extraordinarily A signicant and often overlooked point raised
difcult to interpret. Physicists accept these com- by Brad is what meaning can one give to state-
plex models as the facts of life, and do their best to ments that variable X is important or not impor-
extract usable information from them. tant. This has puzzled me on and off for quite a
There is no consideration given to trying to under- while. In fact, variable importance has always been
stand cosmology on the basis of Newtons equations dened operationally. In regression the important
230 L. BREIMAN

variables are dened by doing best subsets or vari- In 1992 I went to my rst NIPS conference. At
able deletion. that time, the exciting algorithmic methodology was
Another approach used in linear methods such as neural nets. My attitude was grim skepticism. Neu-
logistic regression and survival models is to com- ral nets had been given too much hype, just as AI
pare the size of the slope estimate for a variable to had been given and failed expectations. I came away
its estimated standard error. The larger the ratio, a believer. Neural nets delivered on the bottom line!
the more important the variable. Both of these def- In talk after talk, in problem after problem, neu-
initions can lead to erroneous conclusions. ral nets were being used to solve difcult prediction
My denition of variable importance is based problems with test set accuracies better than any-
on prediction. A variable might be considered thing I had seen up to that time.
important if deleting it seriously affects prediction My attitude toward new and/or complicated meth-
accuracy. This brings up the problem that if two ods is pragmatic. Prove that youve got a better
variables are highly correlated, deleting one or the mousetrap and Ill buy it. But the proof had bet-
other of them will not affect prediction accuracy. ter be concrete and convincing.
Deleting both of them may degrade accuracy consid- Brad questions where the bias and variance have
erably. The denition used in random forests spots gone. It is surprising when, trained in classical bias-
both variables. variance terms and convinced of the curse of dimen-
Importance does not yet have a satisfactory the- sionality, one encounters methods that can handle
oretical denition (I havent been able to locate the thousands of variables with little loss of accuracy.
article Brad references but Ill keep looking). It It is not voodoo statistics; there is some simple the-
depends on the dependencies between the output ory that illuminates the behavior of random forests
variable and the input variables, and on the depen- (Breiman, 1999). I agree that more theoretical work
dencies between the input variables. The problem is needed to increase our understanding.
Brad is an innovative and exible thinker who
begs for research.
has contributed much to our eld. He is opportunis-
E4 Other Reservations tic in problem solving and may, perhaps not overtly,
already have algorithmic modeling in his bag of
Sample sizes have swollen alarmingly tools.
while goals grow less distinct (nd inter-
esting data structure). BRUCE HOADLEY

I have not noticed any increasing fuzziness in I thank Bruce Hoadley for his description of
goals, only that they have gotten more diverse. In the algorithmic procedures developed at Fair, Isaac
since the 1960s. They sound like people I would
the last two workshops I attended (genetics and
enjoy working with. Bruce makes two points of mild
astronomy) the goals in using the data were clearly
contention. One is the following:
laid out. Searching for structure is rarely seen
even though data may be in the terabyte range. High performance (predictive accuracy)
on the test sample does not guaran-
The new algorithms often appear in the tee high performance on future samples;
form of black boxes with enormous num- things do change.
bers of adjustable parameters (knobs to
twiddle). I agreealgorithmic models accurate in one con-
text must be modied to stay accurate in others.
This is a perplexing statement and perhaps I dont This does not necessarily imply that the way the
understand what Brad means. Random forests has model is constructed needs to be altered, but that
only one adjustable parameter that needs to be set data gathered in the new context should be used in
for a run, is insensitive to the value of this param- the construction.
eter over a wide range, and has a quick and simple His other point of contention is that the Fair, Isaac
way for determining a good value. Support vector algorithm retains interpretability, so that it is pos-
machines depend on the settings of 12 parameters. sible to have both accuracy and interpretability. For
Other algorithmic models are similarly sparse in the clients who like to know whats going on, thats a
number of knobs that have to be twiddled. sellable item. But developments in algorithmic mod-
eling indicate that the Fair, Isaac algorithm is an
New methods always look better than old exception.
ones.    Complicated models are harder A computer scientist working in the machine
to criticize than simple ones. learning area joined a large money management
STATISTICAL MODELING: THE TWO CULTURES 231

company some years ago and set up a group to do viability of statistics as a eld. Oddly, we are in a
portfolio management using stock predictions given period where there has never been such a wealth of
by large neural nets. When we visited, I asked how new statistical problems and sources of data. The
he explained the neural nets to clients. Simple, he danger is that if we dene the boundaries of our
said; We t binary trees to the inputs and outputs eld in terms of familar tools and familar problems,
of the neural nets and show the trees to the clients. we will fail to grasp the new opportunities.
Keeps them happy! In both stock prediction and
credit rating, the priority is accuracy. Interpretabil-
ity is a secondary goal that can be nessed. ADDITIONAL REFERENCES
Beverdige, W. V. I (1952) The Art of Scientic Investigation.
MANNY PARZEN Heinemann, London.
Breiman, L. (1994) The 1990 Census adjustment: undercount or
Manny Parzen opines that there are not two but bad data (with discussion)? Statist. Sci. 9 458475.
many modeling cultures. This is not an issue I want Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies.
to ercely contest. I like my division because it is Chapman and Hall, London.
pretty clear cutare you modeling the inside of the Efron, B. and Gong, G. (1983) A leisurely look at the bootstrap,
the jackknife, and cross-validation. Amer. Statist. 37 3648.
box or not? For instance, I would include Bayesians Efron, B. and Tibshirani, R. (1996) Improvements on cross-
in the data modeling culture. I will keep my eye on validation: the 632+ rule. J. Amer. Statist. Assoc. 91 548560.
the quantile culture to see what develops. Efron, B. and Tibshirani, R. (1998) The problem of regions.
Most of all, I appreciate Mannys openness to the Ann. Statist. 26 12871318.
issues raised in my paper. With the rapid changes Gong, G. (1982) Cross-validation, the jackknife, and the boot-
strap: excess error estimation in forward logistic regression.
in the scope of statistical problems, more open and
Ph.D. dissertation, Stanford Univ.
concrete discussion of what works and what doesnt MacKay, R. J. and Oldford, R. W. (2000) Scientic method,
should be welcomed. statistical method, and the speed of light. Statist. Sci. 15
224253.
WHERE ARE WE HEADING? Parzen, E. (1979) Nonparametric statistical data modeling (with
discussion). J. Amer. Statist. Assoc. 74 105131.
Many of the best statisticians I have talked to Stone, C. (1977) Consistent nonparametric regression. Ann.
over the past years have serious concerns about the Statist. 5 595645.

You might also like