You are on page 1of 23

Statistical Science

2007, Vol. 22, No. 4, 598–620


DOI: 10.1214/07-STS249
© Institute of Mathematical Statistics, 2007

The Epic Story of Maximum Likelihood


Stephen M. Stigler

Abstract. At a superficial level, the idea of maximum likelihood must be


prehistoric: early hunters and gatherers may not have used the words “method
of maximum likelihood” to describe their choice of where and how to hunt
and gather, but it is hard to believe they would have been surprised if their
method had been described in those terms. It seems a simple, even unassail-
able idea: Who would rise to argue in favor of a method of minimum likeli-
hood, or even mediocre likelihood? And yet the mathematical history of the
topic shows this “simple idea” is really anything but simple. Joseph Louis
Lagrange, Daniel Bernoulli, Leonard Euler, Pierre Simon Laplace and Carl
Friedrich Gauss are only some of those who explored the topic, not always in
ways we would sanction today. In this article, that history is reviewed from
back well before Fisher to the time of Lucien Le Cam’s dissertation. In the
process Fisher’s unpublished 1930 characterization of conditions for the con-
sistency and efficiency of maximum likelihood estimates is presented, and
the mathematical basis of his three proofs discussed. In particular, Fisher’s
derivation of the information inequality is seen to be derived from his work
on the analysis of variance, and his later approach via estimating functions
was derived from Euler’s Relation for homogeneous functions. The reaction
to Fisher’s work is reviewed, and some lessons drawn.
Key words and phrases: R. A. Fisher, Karl Pearson, Jerzy Neyman, Harold
Hotelling, Abraham Wald, maximum likelihood, sufficiency, efficiency,
superefficiency, history of statistics.

1. INTRODUCTION Galton reported it, during a pause in the conversation


Herbert Spencer said, “You would little think it, but
In the 1860s a small group of young English in-
I once wrote a tragedy.” Huxley answered promptly,
tellectuals formed what they called the X Club. The
“I know the catastrophe.” Spencer declared it was im-
name was taken as the mathematical symbol for the
possible, for he had never spoken about it before then.
unknown, and the plan was to meet for dinner once
a month and let the conversation take them where Huxley insisted. Spencer asked what it was. Huxley
chance would have it. The group included the Dar- replied, “A beautiful theory, killed by a nasty, ugly lit-
winian biologist Thomas Henry Huxley and the social tle fact” (Galton, 1908, page 258).
philosopher-scientist Herbert Spencer. One evening Huxley’s description of a scientific tragedy is singu-
about 1870 they met for dinner at the Athenaeum Club larly appropriate for one telling of the history of Maxi-
in London, and that evening included one exchange mum Likelihood. The theory of maximum likelihood is
that so struck those present that it was repeated on sev- very beautiful indeed: a conceptually simple approach
eral occasions. Francis Galton was not present at the to an amazingly broad collection of problems. This the-
dinner, but he heard separate accounts from three men ory provides a simple recipe that purports to lead to the
who were, and he recorded it in his own memoirs. As optimum solution for all parametric problems and be-
yond, and not only promises an optimum estimate, but
Stephen M. Stigler is the Ernest DeWitt Burton also a simple all-purpose assessment of its accuracy.
Distinguished Service Professor, Department of Statistics, And all this comes with no need for the specification
University of Chicago, Chicago, Illinois 60637, USA of a priori probabilities, and no complicated deriva-
(e-mail: stigler@galton.uchicago.edu). tion of distributions. Furthermore, it is capable of being

598
THE EPIC STORY OF MAXIMUM LIKELIHOOD 599

Joe Hodges’s Nasty, Ugly Little Fact (1951) a better assumption, were supposed equally able to be
1 positive and negative, and large errors were expected
Tn = X̄n if |X̄n | ≥ 1/4 to be less frequently encountered than small. Indeed,
n
1 it was generally accepted that their frequency distribu-
= α X̄n if |X̄n | < 1/4 .
√ n tion followed a smooth symmetric curve. Even the goal
Then n(Tn − θ ) is asymptotically N(0, 1) if θ = 0, of the observer was agreed upon: while the words em-
and asymptotically N(0, α 2 ) if θ = 0. ployed varied, the observer sought the most probable
Tn is then “super-efficient” for θ = 0 if α 2 < 1. position for the object of observation, be it a star dec-
lination or a geodetic location. But in the few serious
F IG . 1. The example of a superefficient estimate due to Joseph L.
attempts to treat this problem, the details varied in im-
Hodges, Jr. The example was presented in lectures in 1951, but
was first published in Le Cam (1953). Here X̄n is the sample mean portant ways. It was to prove quite difficult to arrive at
of a random sample of size n from a N (θ, 1) population, with a precise formulation that incorporated these elements,
n Var(X̄n ) = 1 all n, all θ (Bahadur, 1983; van der Vaart, 1997). covered useful applications, and also permitted analy-
sis.
automated in modern computers and extended to any There were early intelligent comments related to this
number of dimensions. But as in Huxley’s quip about problem already in the 1750s by Thomases Simpson
Spencer’s unpublished tragedy, some would have it that and Bayes and by Johann Heinrich Lambert in 1760,
this theory has been “killed by a nasty, ugly little fact,” but the first serious assault related to our topic was by
most famously by Joseph Hodges’s elegant simple ex- Joseph Louis Lagrange in 1769 (Stigler, 1986, Chap-
ample in 1951, pointing to the existence of “superef- ter 2; 1999, Chapter 16; Sheynin, 1971; Hald, 1998,
ficient” estimates (estimates with smaller asymptotic 2007). Lagrange postulated that observations varied
variances than the maximum likelihood estimate). See about the desired mean according to a multinomial dis-
Figure 1. And then, just as with fatally wounded slaves tribution, and in an analytical tour de force he showed
in the Roman Colosseum, or fatally wounded bulls in a that the probability of a set of observations was largest
Spanish bullring, the theory was killed yet again, sev- if the relative frequencies of the different possible val-
eral times over by others, by ingenious examples of in- ues were used as the values of the probabilities. In
consistent maximum likelihood estimates. modern terminology, he found that the maximum like-
The full story of maximum likelihood is more com- lihood estimates of the multinomial probabilities are
plicated and less tragic than this simple account would the sample relative frequencies. He concluded that the
have it. The history of maximum likelihood is more in most probable value for the desired mean was then the
the spirit of a Homeric epic, with long periods of peace mean value found from these probabilities, which is the
punctuated by some small attacks building to major arithmetic mean of the observations. It was only then,
battles; a mixture of triumph and tragedy, all of this and contrary to modern practice, that Lagrange intro-
dominated by a few characters of heroic stature if not duced the hypothesis that the multinomial probabilities
heroic temperament. For all its turbulent past, maxi- followed a symmetric curve, and so he was left with
mum likelihood has survived numerous assaults and only the problem of finding the probability distribu-
remains a beautiful, if increasingly complicated the- tion of the arithmetic mean when the error probabili-
ory. I propose to review that history, with a sketch of ties follow a curve. This he solved for several exam-
the conceptual problems of the early years and then a ples by introducing and using “Laplace Transforms.”
closer look at the bold claims of the 1920s and 1930s, By introducing restrictions in the form of the curve
only after deriving the estimates of probabilities, La-
and at the early arguments, some unpublished, that
grange’s analysis had the curious consequence of al-
were devised to support them.
ways arriving at method of moment estimates, even
though starting with maximum likelihood! (Lagrange,
2. THE EARLY HISTORY OF
1776; Stigler, 1999, Chapter 14; Hald, 1998, page 48.)
MAXIMUM LIKELIHOOD
At about the same time, Daniel Bernoulli considered
By the mid-1700s it seems to have become a com- the problem in two successively very different ways.
monplace among natural philosophers that problems First, in 1769 he tried using the hypothesized curve as
of observational error were susceptible to mathemat- a weight function, in order to weight, then iteratively
ical description. There was essential agreement upon reweight and average the observations. This was very
some elements of that description: errors, for want of much like some modern robust M-estimates. Second,
600 S. M. STIGLER

in 1778 (possibly after he had seen a 1774 memoir place in history, more for what in the end it seemed
of Laplace’s with a Bayesian analytical formulation), to suggest, rather than for what it accomplished. The
Bernoulli changed his view dramatically and used the two authors considered a very general setting for the
same curve as a density for single observations. He estimation problem—a set of multivariate observations
multiplied these densities together, and he sought as with a distribution depending upon a potentially large
the true value for the observed quantity, that value that array of constants to be determined. They did not re-
made the product a maximum (Bernoulli, 1769, 1778; fer to the constants as parameters, but it would be hard
Stigler, 1999, Chapter 14; Laplace, 1774). for a modern reader to view them in any other light,
These and the other attempts of that time were even though a close reading of the memoir shows that
primarily theoretical explorations, and did not attract it lacked the parametric view Fisher was to introduce
many practical applications or further development. more than 20 years later (Stigler, 2007).
And while they all used phrases that could easily be The main result of Pearson and Filon (expressed in
translated into modern English as “Maximum Likeli- modern terminology) came from taking a likelihood ra-
hood,” and in some cases even be defended as maxi- tio (a ratio of the frequency distribution of the observed
mum likelihood, in no case was there a reasoned de-
data and the frequency distribution evaluated for the
fense for them or their performance. The most that was
same data, but with the constants slightly perturbed),
to be found was the superficial invocation that the value
expanding its logarithm in a multivariate Taylor’s ex-
derived was “most probable” because it made the only
pansion, then approximating the coefficients by their
probability in sight (the probability of the observed
data) as large as possible. expected values and claiming that the resulting expres-
The philosophically most cogent of these early treat- sion gave the frequency distribution of the errors made
ments was that of Gauss, in his first publication on in estimating the constants. They erred in taking the
least squares in 1809 (Gauss, 1809). Gauss, like Daniel limit of the coefficients, in effect using a procedure that
Bernoulli in 1778, adopted Laplace’s analytical formu- did not at all depend upon the method of estimation
lation, but unlike Bernoulli, Gauss explicitly invoked used and would at most be valid for maximum likeli-
Laplace’s Bayesian perspective using a uniform prior hood estimates, a fact they failed to recognize. Their
distribution for the unknowns. Where Laplace had then last step employed an implicit Bayesian step in the
sought (and found) the posterior median (which mini- manner of Gauss. When cubic and higher order terms
mized the posterior expected error), Gauss chose the were neglected, their formula would give a multivari-
posterior mode. In accord with modern maximum like- ate normal posterior distribution (extending results of
lihood with normally distributed errors, this led Gauss Laplace a century earlier), although Pearson and Filon
to the method of least squares. The simplicity and cautioned against doing this with skewed frequency
tractability of the analysis made this approach very distributions. A modern reader would recognize their
popular over the nineteenth century. By the end of that resulting distribution as the normal distribution some-
century this was sometimes known as the Gaussian times used to approximate the distribution of maximum
method, and the approach became the staple of many likelihood estimates, but Pearson and Filon made no
textbooks, often without the explicit invocation of a such restriction in the choice of estimate and applied
uniform prior that Gauss had seen as needed to justify it heedlessly to all manner of estimates, particularly to
the procedure. method of moments estimates.
The result may in hindsight be seen to be a mess,
3. KARL PEARSON AND L. N. G. FILON not even applying to the examples presented, and the
Over the 19th century, the theory of estimation gen- approach was soon to be abandoned by Pearson him-
erally remained around the level Laplace and Gauss left self. But it led to some correct results for the bivari-
it, albeit with frequent retreats to lower levels. With re- ate normal correlation coefficient, and it was bold and
gard to maximum likelihood, the most important event surely highly suggestive to a reader like Ronald Fisher,
after Gauss’s publication of 1809 occurred only on the to whom I now turn. I have recently published a de-
eve of a new century, with a long memoir by Karl Pear- tailed study (Stigler, 2005) of how Fisher was led to
son and Louis Napoleon George Filon, published in write his 1922 watershed work on “The Mathematical
the Transactions of the Royal Society of London in Foundations of Theoretical Statistics,” so I will only
1898 (Pearson and Filon, 1898). The memoir has a briefly review the main points leading to that memoir.
THE EPIC STORY OF MAXIMUM LIKELIHOOD 601

4. R. A. FISHER the following. Suppose you have two candidates as es-


timates for a parameter θ , denoted by S and T . Suppose
At Cambridge Fisher had studied the theory of er-
that T is a sufficient statistic for θ . Since generally both
rors and even published in 1912 a short piece com-
S and T are approximately normal with large samples,
mending the virtues of the Gaussian approach to esti- let us (anticipating a species of argument Wald was to
mation, particularly of the standard deviation of a nor- develop rigorously in 1943) follow Fisher in consid-
mally distributed sample. He had been so taken by the ering that S and T actually have a bivariate normal
invariance of the estimates so derived, how (for exam- distribution, both with expectation = θ , and with stan-
ple) the estimate of the square of a frequency constant dard deviations σS and σT and correlation ρ. Then the
was the square of the estimate of the constant, that he standard facts of the bivariate normal distribution tell
termed the criterion “absolute” (Fisher, 1912). But his us that E(S|T = t) = θ + ρ(σS /σT )(t − θ ). Since T
approach at that time was superficial in most respects, is sufficient, this cannot depend upon θ , which is only
tacitly endorsing the naïve Bayesian approach Gauss possible if ρ(σS /σT ) = 1, or if σT = ρσS ≤ σS . Thus T
had used, without noticing the lurking inconsistency in cannot have a larger mean squared error than any other
even the example he considered, in that the estimate of such estimate S, and so must be optimum according to
the squared standard deviation based upon the distrib- a clear metric criterion, expected squared error! In one
1
ution of the data, namely n (xi − x̄)2 , did not agree stroke Fisher had (if one accepts the substitution of ex-
with that found

applying the same principle to distrib- act for approximate normality) the simple and powerful
ution of n1 (xi − x̄)2 alone. result:
Four years later, Fisher sent to Pearson for possi-
ble publication a short, equally superficial critique of Sufficiency implies optimality, at least when
a Biometrika article by Kirstine Smith advocating the combined with consistency and asymptotic
minimum chi-square approach to estimation (Smith, normality.
1916). Pearson’s thoughtful rejection letter to Fisher The question was, how general is this result? Neither
focused on the lack of a clear and convincing ratio- Fisher nor much of posterity thought of consistency
nale for the method of choosing constants to maxi- and asymptotic normality as major restrictions. After
mize the frequency function, and Pearson even stated all, who would use an inconsistent estimate, and while
that he now thought the Pearson–Filon paper was re- there are noted exceptions, is not asymptotic normality
miss on the same count. He called particular attention the general rule? Indeed, Fisher clearly knew the result
to a perceptive footnote in Smith’s paper that argued was stronger than this, that a sufficient estimate cap-
the case against the Gaussian method: the probabil- tured all the information in the data in even stronger
ity being maximized was not a probability but rather senses; the argument was only to present the claim in
a probability density, an infinitesimal probability, and terms of a specific criterion, minimum standard error.
of what force was such meager evidence in defense of a But what about sufficiency?
choice? At least the minimum chi-square method opti- At this point Fisher appears to have made an in-
mized with respect to an actual metric. Two more years teresting and highly productive mistake. He quickly
passed, and in 1918 Fisher discovered sufficiency in explored a number of other parametric examples and
the context of estimating the normal standard deviation came to the conclusion that maximizing the likelihood
(Fisher, 1920); he recalled Pearson’s challenge to pro- always led to an estimate that was a function of a suf-
duce a rationale for the method, and he was off to the ficient statistic! When he read the paper to the Royal
races, quickly setting to work on the monumental pa- Society in November 1921, his abstract, as printed in
per on the theory of statistics that he read to the Royal Nature (November 24, 1921) emphatically stated, “Sta-
Society in November 1921 and published in 1922. tistics obtained by the method of maximum likelihood
are always sufficient statistics.” And from this it would
5. FISHER’S FIRST PROOF follow, with the minor quibble that perhaps consistency
and asymptotic normality may be needed, that maxi-
By my reconstruction, Fisher’s discovery of suffi- mum likelihood estimates are always optimum. A truly
ciency was quickly followed by the development of a beautiful theory was born, after over a century and a
short argument that he gave in that great 1922 paper; half in gestation.
indeed it was the first mathematical argument in the pa- Even as the paper was being readied for press, doubts
per. The essence of the argument in modern notation is occurred to the one person best equipped to understand
602 S. M. STIGLER

the theory, Fisher himself. The bold claim of the ab- 1922 paper Fisher also pointedly included a section il-
stract does not appear in the published version; neither lustrating the use of maximum likelihood for Pearson’s
does its denial. He expressed himself in this way: Type-III distributions (gamma distributions), contrast-
ing his results with the erroneous ones Pearson and
“For the solution of problems of estimation
Filon had given in 1898 for the same family.
we require a method which for each partic-
ular problem will lead us automatically to 6. THREE YEARS LATER
the statistic by which the criterion of suffi-
ciency is satisfied. Such a method is, I be- By 1925 Fisher’s earlier optimism had faded some-
lieve, provided by the Method of Maximum what, and he prepared a revised version of his theory
Likelihood, although I am not satisfied as to for presentation to the Cambridge Philosophical Soci-
the mathematical rigour of any proof which ety. At some point in the interim he had recognized
I can put forward to that effect. Readers that sufficient statistics of the same dimension as the
of the ensuing pages are invited to form parameter did not always exist. What led to this re-
their own opinion as to the possibility of the alization? Fisher did not say, although in a 1935 dis-
method of maximum likelihood leading in cussion he wrote, “I ought to mention that the theo-
any case to an insufficient statistic. For my rem that if a sufficient statistic exists, then it is given
own part I should gladly have withheld pub- by the method of maximum likelihood was proved in
lication until a rigourously complete proof my paper of [1922]. . . . It was this that led me to at-
could be formulated; but the number and tach especial importance to this method. I did not at
variety of new results which the method that time, however, appreciate the cases in which there
discloses press for publication, and at the is no sufficient statistic, or realize that other proper-
same time I am not insensible of the advan- ties of the likelihood function, in addition to the posi-
tage which accrues to Applied Mathematics tion of its maximum, could supply what was lacking”
from the co-operation of the Pure Mathe- (Fisher, 1935, page 82). I speculate that he learned this
matician, and this co-operation is not infre- in considering a problem where no sufficient statistic
quently called forth by the very imperfec- exists, namely the problem that figured prominently in
tions of writers on Applied Mathematics” the 1925 paper, the estimation of a location parameter
(Fisher, 1922, page 323). for a Cauchy distribution. In any event, in that 1925
paper Fisher did not dwell on this discovery of insuf-
The 1922 paper did present several related argu- ficiency; quite the contrary. The possibility that suffi-
ments in addition to the Waldian one I reported above. cient statistics need not exist was only casually noted
It stated less boldly a converse of the statement in the as a fact 14 pages into the paper, and a reader of both
1921 abstract that, “it appears that any statistic which the 1922 and 1925 papers might not even notice the
fulfils the condition of sufficiency must be a solution subtle shift in emphasis that had taken place.
obtained by the method of the optimum [e.g. maxi- Where in 1922 Fisher started with consistency and
mum likelihood]” (page 331). But Fisher did not now sufficiency, in 1925 he began with efficiency. Writing
claim that a sufficient statistic need always exist. In- of consistent and asymptotically normal estimates, he
stead Fisher gave an improved non-Bayesian version of stated, “The criterion of efficiency requires that the
the Pearson–Filon argument for asymptotic normality, fixed value to which the variance of a statistic (of the
expanding the likelihood function about the true value class of which we are speaking) multiplied by n, tends,
and pointing out how and why the argument requires shall be as small as possible. An efficient statistic is one
maximum likelihood estimates (and that it would not for which this criterion is satisfied” (page 703). With
apply to moment estimates), and how it could be used this in mind, his main claim now was (page 707), “We
to assess the accuracy of maximum likelihood esti- shall see that the method of maximum likelihood will
mates (pages 328–329). And there, in a long footnote, always provide a statistic which, if normally distributed
he called Karl Pearson to task for not earlier calling at- in large samples with variance falling off inversely to
tention himself to the error in the 1898 paper. Fisher the sample number, will be an efficient statistic.”
noted that in 1903 Pearson had published correct stan- Thus in 1925 the theory said that if there is an effi-
dard errors for moment estimates, even while citing cient statistic, then the maximum likelihood estimate is
the 1898 paper without noting that the standard errors efficient. When a sufficient and consistent estimate ex-
given in 1898 for several examples were wrong. In the ists, it will also be maximum likelihood, but that is not
THE EPIC STORY OF MAXIMUM LIKELIHOOD 603

necessary for efficiency. He granted that more than one Fisher did not discuss conditions under which the lin-
efficient estimate could exist, but he repeated a proof ear approximation would prove adequate; he was con-
he had already given in 1924 (Fisher, 1924a) that any tent to exploit it as a simple route to the asymptotic dis-
two efficient estimates are correlated with correlation tribution of the maximum likelihood estimate, namely
that approaches 1.0 as n increases. N(θ, 1/I (θ )). Thus far he had not gone beyond the
1922 argument.
7. THE 1925 “ANOVA” PROOF The part of the argument that was novel in 1925, the
What did Fisher offer by way of proof of this new “ANOVA proof,” then went as follows: Let T be any
efficiency-based formulation? His 1922 treatment had estimate of θ , assumed to be consistent and asymptot-
leaned crucially on sufficiency, but that was no longer ically normal N(θ, V ). In the proof Fisher used this
generally available. In its place he depended upon a as the exact distribution of T , and further treated V
new and limited but mathematically rather clever proof as not depending upon θ , as would approximately be
that I will call the “analysis of variance proof.” The the case for “reasonable” estimates T in what we now
proof was clearly based upon a probabilistic version of call “regular” parametric problems. Fisher considered
the analysis of variance breakdown of a sum of squares the score function X as a function of the sample and
that Fisher was developing separately at about the same looked at its variation over different samples in two
time for agricultural field trials. Fisher’s own 1925 pre- ways. The first was to consider the total variation of X
sentation of the argument is fairly opaque and does not over all samples, namely its variance Var(X) = I (θ ).
explain clearly its underlying logic; in 1935 he gave an And for the second, he evaluated Var(X|T ), the con-
improved presentation that helps some (Fisher, 1935, ditional variation in X given the value of T for the
pages 42–44). The mathematical details of the proof sample (i.e., the variance of X among all samples
have been clearly re-presented by Hinkley (1980) at that give the same value for T ). From this he com-
some length. I will be content to offer only a sketch puted E[Var(X|T )], which he found equal to Var(X) −
emphasizing the essence of the argument, what I be- 1/V . Since Var(X) = E[Var(X|T )] + Var[E(X|T )]
lieve to be the logical development Fisher had in mind.
(this is the ANOVA-like breakdown I refer to), this
It will help the historical discussion to divide his 1925
would give Var[E(X|T )] = 1/V . But Var(X|T ) ≥ 0
argument into two parts, just as Fisher did in the 1935
always, which implies that necessarily E[Var(X|T )] ≥
version.
0, and so Var(X) − 1/V ≥ 0. This gave V1 ≤ I (θ ), or
Let f (x; θ ) be the density of a single observation,
and let φ be the likelihood function for a sample of n V ≥ I (θ)
1
for any such T , with equality for efficient
independent observations, so that log φ =  log f . Fol- estimates—what we now refer to as the information
lowing Fisher, let X = φ1 ∂φ inequality. Thus if the maximum likelihood estimate
∂θ = ∂θ log φ—what we now

sometimes refer to as the score function. Fisher was indeed has asymptotic variance 1/I (θ ), he had estab-
only concerned here with situations where the maxi- lished efficiency.
mum likelihood estimate could be found from solving The logic of the proof—and the likely route that led
the equation X = 0 for θ . The first part of the argument Fisher to it—seems clear. If there were a sufficient sta-
was really more of a restatement of what he had shown tistic S, then the factorization theorem (which Fisher
in 1922: from expanding the score function in a Tay- had recognized in 1922, at least in part) would give
lor series, he had that the score function was approx- φ = C · h(S; θ ), where the proportionality factor C
imately a linear function of the maximum likelihood may depend upon the sample but not on θ . By suf-
estimate; as he put it, X = −nA(θ − θ̂ ) “if θ − θ̂ is a ficiency, X would then depend upon the sample only
small quantity of order n−1/2 ,” where his −nA denoted through S, and so Var(X|S) = 0 for all values of S,
what we now call the Fisher Information in a sample, and consequently E[Var(X|S)] = 0 also. Also, if S
I (θ ). Since under fairly general regularity conditions is sufficient, the maximum likelihood estimate (found
  ∂φ ∂ 
E(X) = φ1 ∂φ ∂θ φ = ∂θ = ∂θ φ = ∂θ 1 = 0, we also
∂ through solving X = 0 for θ ) is a function of S. The
have Var(X) = I (θ ). As Fisher noted, I (θ ) may be failure of T to capture all of the information in the
found from any of the alternative expressions sample is then reflected through the variation in the val-
 2   2 ues of X given T , namely through Var(X|T ) and thus
∂ log φ ∂ log φ
I (θ ) = −E =E E[Var(X|T )]. This latter quantity plays the role of a
∂θ 2 ∂θ residual sum of squares and measures the loss of effi-
 2   2
∂ log f ∂ log f ciency of T over S (or at least over what would have
= −nE = nE . been achievable had there been a sufficient statistic).
∂θ 2 ∂θ
604 S. M. STIGLER

What is more, this interpretation gave Fisher a target and scientifically (he was working on crop estimating
to pursue in trying to measure the amount of lost infor- at that time) that he was able to engage in just such a
mation, or even to determine how one might recover dialogue. I refer to Harold Hotelling.
it, just as in an analysis of variance one can advance Hotelling received his Ph.D. from Princeton Uni-
the analysis by introducing factors that lead to a de- versity in 1924, for a dissertation in point set topol-
crease in the residual sum of squares. In the remainder ogy. In that same year he joined the Food Research
of the 1925 paper Fisher pursued just such courses. He Institute at Stanford University, where he worked on
introduced both the term and the concept of an ancil- agricultural problems. Soon after, he discovered Fisher
lary statistic, in effect as a covariate designed to reduce through Fisher’s 1925 book, Statistical Methods for
the residual sum of squares toward its theoretical min- Research Workers. Hotelling reviewed that book for
imum achievable value. He gave particular attention to JASA; in fact he reviewed each of the first seven edi-
multinomial problems and focused on a study of the tions and the first three of these were volunteered re-
information loss when no sufficient estimate existed, views, not requested by the Editor (Hotelling, 1951).
and the loss in information in using an estimate that He started up a correspondence with Fisher, and tried
was efficient but not maximum likelihood (e.g., a min- unsuccessfully to get Fisher to visit Stanford in 1928
imum chi-square estimate). He found the latter differ- and 1929 (Stigler, 1999a). After several friendly ex-
ence tended to a finite limit, a measure of what C. R. changes of letters, on October 15, 1928, Fisher (who
Rao (1961, 1962) was later to term “second-order effi- had had several requests from others for detailed math-
ciency.” ematical proofs) wrote, asking Hotelling, “Now I want
By 1935 Fisher evidently had come to see the first your considered opinion as to the utility of collect-
part of the argument—the part establishing that the ing such scraps of theory as are needed to prove just
maximum likelihood estimate actually achieved the what is wanted for my practical methods.” Hotelling
lower bound 1/I (θ )—as unsatisfactory, and he offered replied December 8, strongly encouraging such a work
as valuable for mathematics generally, and stated that
in its place a different argument to show the bound was
“a knowledge of the grounds for belief in a theory helps
achieved. That argument (Fisher, 1935, pages 45–46)
to dispel the absurd notions which tend to cluster even
was derived from what I will call his third proof; I shall
about sound doctrines.” Fisher’s Christmas Eve 1928
comment on it later in that connection.
reply proposed that they collaborate:
Fisher’s 1925 work was conceptually deep and has
been the subject of much fruitful modern discussion, 24 Dec ’28
particularly by Efron (1975, 1978, 1982, 1998), Efron Dear Prof. Hotelling
and Hinkley (1978) and Hinkley (1980). Your letter has arrived on Christmas Eve,
and has given me plenty to think about for
8. AFTER 1925: CORRESPONDENCE the holidays. You will not expect too much
WITH HOTELLING of my answer, as you see that I am writing
first and thinking afterwards; but I can see
Fisher’s beautiful theory had become more compli- already that I have a great deal to thank you
cated but was still quite attractive. The proofs Fisher for.
offered in 1925 were not such as would satisfy the After a few hours consideration I believe
Pure Mathematician he had referred to in 1922, nor my right course is to send you a draft con-
would they withstand the challenges that would come tents, to be pulled to pieces or recast as
a quarter century later. Were they all that he could of- much as you like, and to say I will do my
fer? To answer this, it would help us to listen in on a best to fill the bill if you will be joint author
dialogue between Fisher and a nonhostile, highly in- and be responsible for the pure mathemat-
telligent party. Many in the audience in England who ics. If you consent to this and to taking the
were interested in this question had axes to wield, and first decision, like an editor, as to inclusion
Fisher’s transparent digs at Karl Pearson, even though or exclusion, on the clear understanding that
they came in the form of legitimately pointing out ma- either of us may throw it up as soon as we
jor errors in Pearson’s previous work, just set those think it is not worth while, I will start send-
axes a-grinding. But there was one reader who ap- ing stuff in. It will be mostly new as many
proached Fisher’s level as a mathematician and was of the proofs can be done much better than
so distant both geographically (he was in California) in my old publications.
THE EPIC STORY OF MAXIMUM LIKELIHOOD 605

Have you all my old stuff? I believe you whereas you call them all inconsistent. Thus
have, but if not I will try to find anything I should not call the mean of a sample from
still lacking. 1 dx
It seems a monstrous lot of work, but I
π 1 + (x − m)2
will not grumble if I need not think too
much about arrangement. an inconsistent statistic, though you would.
Yours sincerely Congratulations on a very fine paper.”
R. A. Fisher Hotelling’s paper is little referred to today, which
[Hotelling Papers Box 3] seems a shame. It is beautifully written, as was most
Fisher’s draft table of contents is given as Appen- of Hotelling’s work, and among other things he ex-
dix 1 below. That work was never to be completed. plained Fisher’s own work on this topic more clearly
There was no apparent split between the two, but as the than Fisher ever did. He reviewed Fisher’s proof of as-
project went on, Fisher’s increasing focus on genetics ymptotic normality (the one based upon the Pearson–
as his 1930 book The Genetical Theory of Natural Se- Filon approach), and he gently noted that “it is not clear
lection went through the press, and Hotelling’s move what conditions, particularly of continuity, are neces-
in 1931 to the Department of Economics at Columbia sary in order that the proofs which have been given
University, were likely causes for the drop in inter- shall be valid.” To repair this omission Hotelling of-
fered two explicit proofs for the case of one contin-
est. By February 1930 Fisher was writing, “It is a
uous variable, stating overconfidently that “the exten-
grind getting anything serious done in the way of a
sions to any number of variables are perfectly obvious;
text book; I hope you will stick to yours, though; as
and the corresponding theorems for discrete variables
well as developing the purely mathematical develop-
follow immediately. . . .” The problem is, as Hotelling’s
ments.” Nonetheless, Hotelling spent nearly six months
clear exposition makes apparent to a modern reader,
at Rothamsted over the last half of 1929 and saw quite
the proof does not work. He simplified the problem by
a bit of Fisher over that time. Hotelling returned to
transforming the parameter space to a finite interval (if
the United States in late December in time to submit
necessary) by an arc tangent transformation, and dis-
a paper to the American Mathematical Society (AMS)
cretized the observed variable by grouping in a finite
and present it at their meeting in Des Moines, Decem- number of small intervals, and did not realize that the
ber 31. That paper was entitled, “The consistency and two combined do not ensure the uniformity he would
ultimate distribution of optimum statistics”; that is, on need to achieve the desired result for other than discrete
the consistency and asymptotic normality of maximum distributions with bounded parameter sets. The error
likelihood estimates. It was published in the October evidently came to Hotelling’s attention by 5 Decem-
1930 issue of the Transactions of the AMS. ber 1931, when he circulated a list of 37 “Outstanding
It is a reasonable guess that the approach taken in the Problems in the Theory of Statistics.” Problem #16 on
paper reflected Fisher’s views to some degree, com- the list was, “Prove the validity of the double limiting
ing directly after the long visit with Fisher, although process used in the proof of (Hotelling, 1930), for as
Fisher apparently played no direct role in the writing. general a situation as possible.”
At any rate, when Fisher wrote to Hotelling on the 7th
of January in 1930 to thank him for a copy, Fisher’s 9. THE GEOMETRIC SHADOW OF A NASTY
only complaint was that the definition of “consistency” LITTLE FACT
Hotelling gave was slightly different from Fisher’s.
Fisher wrote, To this point there had been not even a hint of the fu-
ture appearance of any nasty, ugly little fact that might
“It is worth noting to avoid future confusion sully the beautiful theory. But then, on November 15,
that you are using consistency in a some- 1930, Hotelling wrote to Fisher with some pointed
what different sense from mine. To me a sta- questions. The letter reflected a geometric view of the
tistic is inconsistent if it tends to the wrong inference problem that Hotelling seems to have found
limit as the sample is increased indefinitely. in Fisher’s work by 1926 and developed further after
I do not think I have ever attempted to ap- their conversations at Rothamsted. Hotelling gave one
ply the distinction of consistency or incon- statement of the view in his 1930 paper (which must
sistency to statistics which tend to no limit, have been drafted at Rothamsted), and he restated it
606 S. M. STIGLER

F IG . 2. A reconstruction of Hotelling’s geometric view of the multinomial estimation problem, circa Fall 1929. Here x represents a multino-
mial observed relative frequency vector in the simplex, and the curve f (p) the potential values of the multinomial probability vector; the true
value of the parameter p (p0 ) is shown, as is the MLE and a contour of the likelihood surface.

in his November letter in different but equivalent nota- exact meaning of the theorem. One of sev-
tion. The essence is captured by Figure 2, drawn to dis- eral questions is whether the variance of a
play what Hotelling conveyed in words and symbols. statistic or its mean square deviation from
Hotelling considered a parameterized multinomial the true value should be used as a measure
problem with m cells, where the observations are a vec- of accuracy.
tor of relative frequencies of counts x = (x1 , . . . , xm ) Denoting by p̂ the optimum estimate of a
taking values in the m-dimensional simplex parameter p, whose true value is p0 , can it
m be said that the variance of p̂, assuming p̂
t=1 xt = 1, xt ≥ 0 all t. Let the probabilities of the normally distributed, is less than that of any
cells f (p) = (f1 (p), . . . , fm (p)) depend upon a pa-
other function of the same observations?
rameter p; this describes a curve in the simplex as p Obviously not without further qualification,
varies. Let p = p0 denote the true value of the para- since a function of the observations can be
meter, let p̂ be the maximum likelihood estimate of p, defined having an arbitrarily small variance.
and let f (p0 ) and f (p̂) be the points on the curve We must therefore restrict the comparison
corresponding to these two values. In his 1930 paper, to a special class of functions suitable for
Hotelling stated further, “The likelihood L is constant estimating p, but the definition of this class
over a system of approximately spherical hypersur- must not involve p0 . How should the class
faces about [x]. The point [f (p̂)] is the point of the be defined? As the class of consistent statis-
curve which lies on the smallest of the approximate tics? If so, the following difficulty must be
spheres meeting the curve, and is therefore approxi- faced.
mately the nearest point on the curve to [x]” (Hotelling, Consider a distribution of frequency
1930). among a finite number m of classes, involv-
Here then is how Hotelling raised his question in cor- ing a parameter p. In a sample of n, let xt
respondence, in the context of what must have been a be the number [Hotelling evidently means
relative frequency] falling in the tth class.
shared frame of discourse they had adopted at Rotham-
Let ft (p) be the probability of an individual
sted.
falling into this class. If we take x1 , . . . , xm
Dear Dr. Fisher: as coordinates in m-space, the equations
Thank you very much for your recent let- xt = ft (p) (t = 1, . . . , m)
ter, with graph and data. represent a curve with p as parameter. The
I have been examining various problems points corresponding to samples will form
in Maximum Likelihood of late; I wonder if a “globular cluster” (as you so well put it
you can enlighten me as to the conditions in 1915)1 about that point on the curve for
under which your proof holds good regard-
ing the minimum variance of statistics ob- 1 Here Hotelling evidently refers to Fisher’s use in Fisher (1924,
tained by this method, or rather, as to the at page 101) of the evocative astronomical term “globular cluster”
THE EPIC STORY OF MAXIMUM LIKELIHOOD 607

which p = p0 . The method of maximum I have two students working on the opti-
likelihood corresponds approximately, for mum estimates of m for the above curve and
large samples, to taking for p̂ the para- for the Type III case you treated. Failing to
meter of the point of the curve nearest to get anything of consequence for small sam-
that representing the sample; i.e., to project- ples by purely mathematical methods, they
ing orthogonally. Now consider some other will probably soon resort to experiment.2
method of projecting sample points upon Cordially yours,
the curve; for example an orthogonal pro- Harold Hotelling
jection followed by an alternate stretching
Hotelling’s letters posed a challenging question in a
and contracting along the curve. Then if p0
direct but nonconfrontational way. Clearly, Hotelling
happens to give a point in one of the re-
said, some more constraints on the class of estimates
gions of condensation [i.e. high density],
would be needed; the geometric view they had ev-
this method of estimation will, for suffi-
idently shared at Rothamsted suggested that consis-
ciently large samples, yield a statistic with
tency alone was not enough. There is no obvious
smaller variance than that by the method of
guarantee that the curve f (p) and the contours of the
maximum likelihood. To be sure, its vari-
likelihood are such that improvement over maximum
ance will be larger if the true value p0 lies
likelihood is not possible. What would be needed to
in a region of rarefaction [i.e. low density],
prevent this, or at least to convince a reader such as
and averaging for different possible values
Hotelling that the worry was groundless? Hotelling’s
of p0 might indicate a greater average vari-
hypothetical improvements were certainly vague. A
ance than that of the optimum statistic. But
modern reader might be tempted to see them as fore-
such an averaging would seem to be of a
shadowing Hodges’s estimate or even shrinkage via
piece with “Bayes’ Theorem,” in supposing
Stein estimation, but even though they fall short of that,
equal a priori probabilities.
they presented a clear challenge to Fisher.
Hotelling went on to state that even in the particu-
lar case of symmetric beta densities, maximum likeli- 10. FISHER’S REPLY: A THIRD PROOF OF THE
hood failed to be optimum, but his derivation there was EFFICIENCY OF MAXIMUM LIKELIHOOD
marred by a simple error in differentiation. Before he By 1930 Fisher was no stranger to challenges by
received Fisher’s reply of the 28th of November, 1930, skeptical readers. His general reaction to one from a
Hotelling wrote again, on December 12, correcting his friendly source was to state clearly what he was pre-
own error with regard to the beta estimation problem pared to say, while avoiding speaking directly to the
and enlarging on his other comment, to the point of point raised. Without addressing the criticism, much
rather clearly speculating on the possibility of superef- less admitting its validity, he would move directly to
ficient estimates. a new and improved position, often not giving any in-
The general question of the exact circum- dication that it represented the strongest statement that
stances in which optimum statistics have could be made and perhaps even hinting otherwise or
minimum variance. . . is extremely interest- at least allowing the reader to speculate so. Such was
ing. That the property is not perfectly gen- the case here.
eral seems clear from a consideration of Fisher’s reply to the first of Hotelling’s letters was
some of the distributions having discontinu- brief, but it included one enclosure (A) that outlined a
ities; and also from the fact that, if the true new proof that illuminated Fisher’s views, as well as
value were known, a system of estimation a second short note (B) correcting Hotelling’s error in
could be devised which would give it with differentiating the beta density.
arbitrarily small variance; and such a system 28 November 1930
of estimation might happen to be adopted Dear Hotelling,
even if the true value were unknown. I enclose two notes A and B on the points
you raise. The first brings in the general
to describe a point cloud. Fisher (1924) used the term in summariz-
ing the multiple dimensional space approach he had taken in Fisher
(1915). Fisher did not use the term in Fisher (1915), although it 2 From comments elsewhere in the correspondence it is clear that
would have been appropriate there also. by “experiment” Hotelling means simulation with dice or cards.
608 S. M. STIGLER

variances and covariances for the multino- where φx denotes the partial derivative of φ with re-
mial, and is done more prettily by replacing spect to x (see, e.g., Courant, 1936, Vol. 2, pages 108–
the multinomial by a multiple Poisson,3 but 109). In Fisher’s case, the homogeneous functions φ
the argument is probably clearer as it stands. of degree zero would be functions of the sample rela-
This is a very short note; the meat of my tive frequencies only, and would not otherwise depend
letter is in the enclosures. . . . . upon the sample size n. This might be considered a
Yours sincerely, strong restriction upon the class of estimates (Fisher
R. A. Fisher did not comment upon this), but with the assumed dif-
[Hotelling Papers Box 45] ferentiability and Euler’s Relation with h = 0, and the
exactly known covariances for the multinomial, Fisher
Fisher’s Enclosure A presented a sketch of a new, had an easy expression for the asymptotic variance for
third proof of the efficiency of maximum likelihood, all estimates T in this class. He did not require recourse
one that took a different point of attack. The argu- to the regularity assumptions implicit in the substitut-
ment, given in toto as Appendix 2 below, was ele- ing normal distributions for approximately normal dis-
gant, geometric, and I believe also correct, or at least tributions, or in assuming the variance of T was ap-
completable, under the tacit regularity conditions im- proximately constant, as he had in his previous proof.
plied by the analysis. The geometric stance he took It was then an easy step to use standard Lagrangian
was that which he and Hotelling would have discussed methods to minimize this asymptotic expression for
at Rothamsted, restricted to the case of a parametric the variance for consistent estimates within this class
multinomial family of distributions. Without any dis- and show the resulting equations were those that also
cussion, but in evident reply to Hotelling’s request for determined the maximum likelihood estimate.
further restrictions on the class of estimates allowed in Fisher published this third proof later only in a dis-
order to rule out nasty little facts of the sort Hotelling guised form, namely where he assumed that the esti-
had hinted at, Fisher did introduce a new restriction mate T was to be found from an estimating equation
on the class of estimates T for which the result was restricted to be a linear function of the relative frequen-
claimed. As Fisher put it, the estimates under consider- cies; that is, without stating where he had begun, he
ation were now assumed to be homogeneous functions jumped directly to Euler’s Relation. In that guise, and
T = φ(x1 , . . . , xs ) of degree zero, where (x1 , . . . , xs ) without the geometric setting and intuition, it appeared
is the vector of counts. This, and the tacitly assumed in his 1935 paper (Fisher, 1935, pages 45–46) where
smooth differentiability, gave him access to a number it served to provide an improved version of the demon-
of simple relationships leading to a conclusion he sum- stration that the maximum likelihood estimate achieves
marized as follows: the information lower bound. It also appeared in Fisher
(1938, pages 30–32), a 45-page tract he put together
“The criterion of consistency thus fixes the for a visit to India at Mahalonobis’s invitation in Janu-
value of T at all points on the expectation ary 1938. That tract was mostly cobbled together from
line, while the criterion of efficiency in con- Fisher’s papers, and it summarized his view to that
junction with it fixes the direction in which time. And in his 1956 book, he gave (apparently only
the equistatistical surface cuts that line. All as an illustration) another, simplified version, restricted
statistics which are both consistent and effi- to estimates T that were themselves linear in the rela-
cient thus have surfaces which touch on that tive frequencies (Fisher, 1956, pages 145–148).
line. The surface for Maximum Likelihood Hotelling’s role in this was that of an important cat-
has the plane surface of this type.” alyst. He helped lead Fisher to reconsider the problem
and provided a remarkably acute audience, but he him-
A homogeneous function φ of degree h is one self did not contribute further to the theory of maxi-
where φ(cx, cy, . . .) = ch φ(x, y, . . .), and in the ap- mum likelihood. Hotelling did write one other related
plied mathematics of Fisher’s day their principal ad- paper during his nearly six months at Rothamsted in
vantage was that if differentiable, they satisfied Euler’s 1929. It was an investigation of the differential geome-
Relation xφx + yφy + zφz + · · · = hφ(x, y, z, . . .), try of parameter spaces, with what is sometimes called
the Jeffreys information metric, after Jeffreys (1946)
3 An analytical trick he had introduced in Fisher (1922a), see also (Kass, 1989; Kass and Vos, 1997). The paper, entitled
the revised footnote in the reprinting of this paper in Fisher (1974). “Spaces of statistical parameters,” included Type-III or
THE EPIC STORY OF MAXIMUM LIKELIHOOD 609

gamma densities as one example, must also have been States, Daniel Dugué (1937) in France, and then Harald
written by Hotelling at Rothamsted. On 27 December Cramér (1946, 1946a) writing during the war in isola-
1929, a summary of the paper was read by Oystein Ore tion in Sweden (having read Fisher, Doob, Dugué, but
in Hotelling’s absence to the Annual Meeting of the apparently not Wald). Both Doob and Wald had strong
American Mathematical Society in Bethlehem, Penn- connections with Hotelling; both pursued their stud-
sylvania. Only an abstract was ever published, but the ies of this topic on Carnegie Fellowships working with
summary Ore read survives (part of a thick folder of Hotelling at Columbia, Doob in 1934–1935, and Wald
other, later notes by Hotelling), and it is printed here, in 1938–1939. Doob left for the University of Illinois
together with the abstract (Hotelling, 1930a), as Ap- in 1935, but Wald stayed on at Columbia, replacing
pendix 3. Hotelling in 1939–1940 while Hotelling was on leave,
and again permanently when Hotelling moved to North
11. THE SITUATION TO 1950 Carolina in 1946.
Of these writers, Doob and Dugué fell into new diffi-
In all, Fisher gave three proofs of the optimality of culties (as Hotelling had in his 1930 paper); Doob was
maximum likelihood. The first, in 1922, was based gently corrected by Wald, and Dugué’s slip was appar-
upon the erroneous belief that maximum likelihood ently first noticed a decade later, in the mid-1940s by
estimates were always sufficient statistics, and it de- Edith Mourier, who brought it to Darmois’s attention.
pended upon treating approximately normally distrib- The Wald and Cramér treatments were the most satis-
uted random variables as if they were in fact normally factory; both raised the level of rigor to new heights,
distributed. The second proof, in 1925, was what I although both suffered from the complexity of the con-
called the ANOVA proof. It too required the same im- ditions assumed and the limitations imposed. Wald was
plicit appeal to regularity by using normality in place already publishing on the theory of estimation by 1939,
of approximate normality, as well as assuming that and his 1943 proof of the asymptotic sufficiency of
the asymptotic variances of the estimates were approx- the maximum likelihood estimates can be seen as a
imately constant, and that the likelihood was suffi- form of the completion of Fisher’s 1922 proof. Cramér
ciently regular to permit the evaluation and manipu- also was firmly based on Fisher; indeed his develop-
lation of various integrals. The third, in 1930 in cor- ment followed the structure of Fisher’s work closely,
respondence (and later in print in 1935 in a version but with rigorous demonstrations and explicit state-
restated in terms of estimating functions that lost the ments of conditions. Much of what Cramér presented
geometric origin), placed more severe restrictions upon might be seen as a realization of the book Fisher and
the distributions (assumed multinomial) and estimates Hotelling might have written, albeit without the geom-
(smoothly differentiable functions of the relative fre- etry.
quencies only, not varying with sample size), but it While this reaction to Fisher’s theory (namely that
yielded a more satisfactory proof. Even if not all de- it was not true, or at least not proven as stated) pro-
tails were filled in, that task was fairly easy for the lim- gressed in some quarters, another appeared, namely
ited setting considered. Indeed, the third proof was im- claims that the theory was not new. In this respect
mune to the ugly little facts that Hotelling hinted at in the reactions were like those in the seventeenth cen-
1930 and Hodges produced explicitly in 1951, but at tury to William Harvey’s 1628 demonstration of the
a cost in generality. Still, multinomial distributions are circulation of blood, where denials of the truth of the
as general as one could hope for in the discrete case, claimed phenomenon coexisted with priority claims
and the intuition developed from the geometric setting on behalf of Hippocrates, circa 400 BC (Stigler, 1999,
of that third proof provided at least superficial promise pages 207ff). Karl Pearson to his death and some others
that the result held much more generally, for continu- in his camp considered Fisher’s maximum likelihood
ous parametric families. simply the Gaussian method, warmed over and served
Over the next few years, several mathematicians again without overt reference to any Bayesian under-
recognized the unsatisfactory extent of rigorous sup- pinnings. That can be attributed to a lack of under-
port given for such a broad theory and tried their standing of what Fisher was accomplishing, a phenom-
hands at filling in the gaps Fisher had knowingly leapt enon that afflicted even such first-class older statisti-
over as well as some he had not even recognized. cians as G. Udny Yule. Yule’s otherwise excellent 1911
The major early efforts were by Joseph Doob (1934, textbook was frequently revised but never made more
1936) and Abraham Wald (1943, 1949) in the United than the most superficial reference to Fisher (other than
610 S. M. STIGLER

to his correction to Pearson on degrees of freedom), is not enough original material to justify publication
even to the 10th edition of 1936 (Yule, 1936). Another, as a book, and too much that is really trivial” (Fisher,
more recent claimant’s name was added to Gauss’s in 1938–1939). In June 1951, Neyman also wrote to the
1935 when Arthur Bowley, in moving a vote of thanks editor of the Journal of the American Statistical As-
for Fisher (1935), called attention to work by Edge- sociation, W. Allen Wallis, unsuccessfully requesting
worth in 1908–1909 that bore at least superficial simi- that the Journal reprint Edgeworth’s 1908–1909 papers
larity to some of Fisher’s work, namely the information (letter in Neyman papers, Bancroft Library).
inequality of the second of the three proofs. But what of the basic question, did Edgeworth pre-
Bowley clearly had only a dim understanding of this cede Fisher and did he in any way influence him if he
work of Fisher’s, and his remarks were mild compared did? My own view, which is in general accord with
to those 15 years later by Jerzy Neyman. Neyman’s dif- the conclusions Jimmie Savage (1976, pages 447–448)
ficulties with Fisher began in 1934 and involved both and particularly John Pratt (1976) came to from a de-
scientific and personal issues, in what would become a tailed study of both Edgeworth and Fisher, is that while
long-running feud. Generally the dispute simmered at there was indeed merit to Edgeworth’s work on this,
a low level: Fisher would, after the initial split, mostly there was no merit to the 1951 accusation of “an un-
ignore Neyman except for occasional barbs (usually justified claim of priority.” In the course of a long, ob-
veiled, without mentioning Neyman by name), and scure and rambling series of papers emphasizing the
Neyman would generally downplay the importance and use of inverse probability in estimation, Edgeworth
originality of Fisher’s work, rising on occasion for a did include a treatment of what he called “the direct
more detailed published blast (Zabell, 1992; Kruskal, method free from the speculative character which at-
1980). taches to inverse probability.” He made what can in
In 1937 Neyman had been content to attribute the retrospect be best interpreted as a statement that max-
simple idea of maximum likelihood to Karl Pearson, imizing the likelihood within a very restricted class
citing Pearson’s derivation of the product moment es- of estimates (basically M-estimates for location para-
timate of the normal correlation coefficient as the meters) gives the estimate with smallest standard de-
“most probable” value, using the Gaussian method viation. The proof he offered (suggested by Professor
Pearson later abandoned (Neyman, 1937, page 345; A. E. H. Love, an expert on the calculus of variations)
1938, pages 132, 136; Pearson, 1896, pages 262–265). was based explicitly upon Schwarz’s inequality, and
But in 1951 Neyman’s focus on Fisher reached a peak, bore no resemblance to any Fisher gave.
and he latched on to the claim of priority for Edgeworth There is no indication that this work of Edge-
and deployed it as a rhetorical weapon in the feud. In a worth’s ever had any influence upon Fisher or any
review of the collection of papers (Fisher, 1950), Ney- other worker on this topic. And the obscurity of the
man resurrected Bowley’s discovery, accusing Fisher prose—uncommonly dense, even by Edgeworthian
of “an unjustified claim of priority” with respect to “the standards—is such that it is hard to believe the re-
so-called property of efficiency of the maximum likeli- sult would have been recognized there by any con-
hood estimates” (Neyman, 1951). What is more, Ney- temporary reader other than Edgeworth himself. Even
man wrote, “Actually, the proofs of the efficiency of at a later time, its recognition required a reader with
maximum likelihood estimates offered both by Edge- Fisher’s work in hand and either extensive experi-
worth and by Fisher are inaccurate, and the assertion, ence with Edgeworth or a strong historical or per-
taken at its full generality, is false.” This comes close sonal motive. Bowley had studied Edgeworth’s work
to being an accusation of a false claim of priority for and mode of expression thoroughly in preparing an
a false discovery of an untrue fact, which would be a extended commemorative summary in 1928. Neyman
rare triple-negative in the history of intellectual prop- had both historical and personal motives, as well as
erty disputes. Savage (1976) wrote of Fisher with this Bowley’s 1935 prompt. Even today anyone who tries
review in mind, “nor did he always emerge as the to learn what Edgeworth accomplished from Bowley’s
undisputed champion in bad manners.” On the other 1928 summary (Bowley, 1928, pages 26–28) would
side, in 1938 Fisher had reviewed Neyman’s influential emerge completely at sea, no matter how long the text
Lectures and Conferences on Mathematical Statistics is puzzled over. This is not to deny that when one
(1938), a book which had only a few grudging refer- has dug through the thicket of the 1908–1909 origi-
ences to Fisher. Fisher’s review consisted of only two nal, there is a limited result and a hint of understanding
sentences, the first innocuous and the second, “There that went beyond the limited result. Edgeworth was
THE EPIC STORY OF MAXIMUM LIKELIHOOD 611

a statistical scientist with an uncommonly subtle and these variances? If so, how can such an estimate be ob-
deep mind (Stigler, 1986, Chapter 9; 1999, Chapter 5), tained in general? But Edgeworth would have been far
and his work here is further evidence of that. But, ahead of his time had he asked them.” Fisher would
for all that, the work stands as an independent partial grant Edgeworth “an inkling,” but no more. Some
anticipation—a hint, not an instance, of what was to might see more in Edgeworth than Fisher did, but they
come. do so from a different historical perspective. I believe
Edgeworth died in 1926 without ever commenting Fisher owed no intellectual debt to Edgeworth on this
on Fisher, and Fisher, as was his wont, dug in his heels issue, and it was his own loss. Had he taken the time
and refused to seriously engage the issue in print. His and trouble to learn from Edgeworth’s insight, he might
frankest private statements were in two letters. The have gone even further. Savage (1976) proffered as ex-
first was a 12 February 1940 letter to Maurice Fréchet, planations for this neglect, that Fisher initially thought
where he described Edgeworth’s statement as confus- Edgeworth’s premises ridiculous, and later “because it
ingly linked to inverse probability, even though the is hard to seek diligently after the unwelcome.”
mathematics could be dissociated from that approach. Neyman’s was not the only review to raise priority
In that letter Fisher summed up his view in these words: issues about Fisher’s work. In a tendentious 1930 re-
“The confusion of associating this method with Bayes’ view of the 3rd edition of Fisher’s Statistical Methods
theorem seems to have been originally due to Gauss, for Research Workers, Charles Grove seemed to claim
who certainly recognized its merits as a method of es- that all in Fisher was to be found earlier in Scandina-
timation, though I do not know whether he proved any- vian work by Thiele, Gram or Charlier. Grove (1930)
thing definite about it. I do not know of any explicit did not focus on maximum likelihood, which he evi-
statement of the properties, consistency, efficiency and dently thought was unsupported, but put forth instead
sufficiency, which may characterize estimates prior to the claim that Thiele had in 1889 anticipated Fisher
my 1922 paper” (Bennett, 1990, page 125). The second on small sample inference and particularly on estimat-
letter, dated 2 July 1951, was to a Californian, Horace ing cumulants with k-statistics, and Gram had done so
Gray, who had spent time with Fisher in London 1935– on the use of orthogonal polynomials in regression.
1936, and he had written to Fisher to call attention to Fisher replied in the same publication, and more col-
Neyman’s review. Fisher replied, orfully in a private letter to Grove’s colleague Arne
“Neyman is, judging from my own experi- Fisher (a Dane who seems to have been the instigator of
ence, a malicious mischief-maker. . . . Edge- Grove’s review). Fisher stated that Thiele “had no more
worth’s paper of 1908 has, of course, been glimmer than [Karl] Pearson of some of the ideas we
long familiar to me, and to other English now use” (Grove, 1930; Fisher, 1931; Bennett, 1990,
statisticians. No one could now read it with- page 313). A scrupulous recent translation of Thiele
out realizing that the author was profoundly from the Danish (Lauritzen, 2002) with accompanying
confused. I should say, for my own part, that commentary allows a better assessment of his excellent
he certainly had an inkling of what I later work, which, however, did not include contributions to
demonstrated. The view that, in any proper maximum likelihood estimation.
sense, he anticipated me is made difficult
by a number of verifiable facts” (Bennett, 12. DOUBTS ABOUT MAXIMUM LIKELIHOOD
1990, pages 138–139). The possibility that maximum likelihood estimates
The facts Fisher listed were that (i) Edgeworth based could actually perform badly, or that they might be dra-
his investigation on inverse probability, (ii) he limited matically improved upon by another method, seems to
attention to location parameters, and (iii) the formula have not been raised prior to Hotelling’s probing let-
they shared in common, for the variance of efficient es- ters to Fisher of November 15 and December 12, 1930.
timates, had been drawn from Pearson and Filon with Kirstine Smith and Karl Pearson had questioned the
no notice given to the major errors in that work. Fisher relative merits of the “Gaussian method” versus min-
noted that since by 1903 Sheppard’s works had shown imum chi-square in 1916, but any difference there was
that moment estimates had variances different from minor; both were later seen to be asymptotically effi-
those given by Pearson and Filon, this to Fisher raised cient estimates. For the most part, the early reservations
the questions: “Had Pearson and Filon’s variances any about Fisher’s maximum likelihood centered on ques-
validity at all? Does any class of estimate actually have tions of priority (was he preceded? was anything really
612 S. M. STIGLER

new in the method?) and issues of practical usefulness j = 1, . . . , n, in which case the maximum likelihood
(were the calculations too hard relative to the method estimate of σ 2 consistently estimates half the correct
of moments?). As maximum likelihood became more value (Neyman and Scott, 1948).
widely adopted in the 1930s, the increased attention In June of 1951, just as Jerzy Neyman’s review
to proofs of its effectiveness (could a rigorous general of Fisher’s Collected Papers appeared, the Berkeley
demonstration be devised?) led inevitably to questions Statistical Laboratory convened for the summer un-
of when it might break down. The earliest explicit ex- der Neyman’s general direction. One of three research
ample is perhaps due to Abraham Wald, in correspon- groups took as its charge “a complex of questions aris-
dence with Jerzy Neyman in 1938. ing from considerations of superefficiency and identi-
Wald immigrated to the United States from Vi- fiability.” The group concentrating on this topic was
enna in Spring 1938, when shortly after Hitler’s an- comprised of Joseph L. Hodges, Jr., Lucien Le Cam
nexation of Austria he accepted an offer to join the and Agnes Berger. It was presumably shortly before
Cowles Commission for Research in Economics, then that time that Hodges, then an Assistant Professor at
located in Colorado Springs. He remained with Cowles Berkeley, constructed his example; in any event the
through the summer before joining Harold Hotelling study was soon sufficiently advanced that a session
at Columbia University in Fall 1938. On Septem- on the topic “Efficiency and superefficiency of esti-
ber 20, 1938, a week before he left for Columbia, Wald mates” was arranged by Neyman to be held on Sat-
wrote to Neyman sending a promised manuscript on urday, December 29, 1951, at the Boston meeting
the Markov inequality, but also describing a different of the Institute of Mathematical Statistics. Four talks
problem he had encountered. The problem described were presented in that session, by Jerzy Neyman (“On
was a slight generalization of one he would treat in the problem of asymptotic efficiency of estimates”),
Wald (1940), namely estimating a straight line when Joe Hodges (“Local superefficiency”), Lucien Le Cam
n points on the line are observed but both coordinates (“On sets of parameter points where it is possible
are subject to independent errors. However, Wald’s let- to achieve superefficiency of estimates”) and Joseph
ter to Neyman contained a statement that he omitted Berkson (“Relative precision of least squares and max-
from the 1940 article: “I have shown that the method imum likelihood estimates of regression coefficients”)
of maximum likelihood leads to false estimations of (Biometrics, 1951; Littauer and Mode, 1952). Neither
the parameters. . . . (i.e., leads to statistics of which the
Neyman’s nor Hodges’s talks were ever published; Le
stochastic limits are unequal to the values of the re-
Cam’s was developed into his Ph.D. dissertation and
spective parameters to be estimated). Hence the max-
published in 1953. That publication (Le Cam, 1953) in-
imum likelihood method cannot be applied” (Neyman
cluded Hodges’s example (credited to Hodges), and Le
Papers, Box 14, Folder 28). Wald stated that he had
Cam proved among other things that while supereffi-
solved this general estimation problem for the case of
ciency was clearly possible, the set of parameter points
independent normally distributed errors with possibly
where it could be achieved had Lebesgue measure zero.
unequal variances.4
In the decade that followed, a number of other ex-
Neyman replied on September 23 that he was quite
interested in the new problem, “the more so as it is amples were discovered or devised. Of these, the least
rather close to what I am trying to do myself.” Ten contrived was the problem of estimation for the five-
years later Neyman and Elizabeth Scott published, with parameter mixture of two normal distributions, where
a general citation to Wald (1940), a simplified version the likelihood function explodes to infinity when either
of Wald’s example as one of several with increasing mean parameter equals any observation. This and sev-
numbers of parameters where maximum likelihood es- eral other examples, including an important one by Ba-
timates are inconsistent. That version, in which the hadur, are reviewed in Le Cam (1990) and Cox (2006,
straight line is y = x and the two coordinates’ er- Chapter 7). Le Cam speculates that the normal mixture
ror variances are equal, has come to be known as the example (known in the folklore of the 1950s but appar-
Neyman–Scott example. It is usually expressed as fol- ently not published then) was due to Jack Kiefer and
lows: Xij are independent N(μj , σ 2 ), for i = 1, 2, and Jacob Wolfowitz; Cox (2006, pages 134–135) consid-
ers it to some extent pathological.
4 If the observed points’ means are modeled as a random sam- These early examples created a flurry of excitement
ple, the parameters do not grow in number with the sample size but are for the most part not seen today as debilitating
and their maximum likelihood estimates are consistent under mild to the theory. Hodges’s example made a substantial im-
conditions; see Kiefer and Wolfowitz (1956). pact when it first became known, but it has, ever since
THE EPIC STORY OF MAXIMUM LIKELIHOOD 613

Le Cam’s dissertation, come to be seen as an ingenious method of moment estimates that would be thought
but minor technical achievement. Hodges (see Figure 1 woefully inefficient by the Fisher generation. Perhaps
above) showed you could improve locally on maxi- Gauss’s use of a uniform prior, which rendered his so-
mum likelihood, basically by shrinking the estimate to- lution susceptible to change by nonlinear transforma-
ward zero, and as such it might also be viewed as an tions of the parameters, would be considered an error.
early hint of the 1955 shrinkage estimates of Charles Certainly Pearson and Filon erred in their promiscuous
Stein that in multiparameter problems can improve uni- use of a naïve passage to a limit in ways where it gave
formly on maximum likelihood. But Hodges’s example wrong answers (Stigler, 2007). And certainly Fisher’s
itself was for finite samples inferior to maximum likeli- 1921 assumption that sufficient statistics always exist
hood for parameter values not near zero, and it was not was an error, and Hotelling’s 1930 proof of the consis-
long seen as a serious threat. The Wald–Neyman–Scott tency and asymptotic normality of the maximum like-
example was of more practical import, and still serves lihood estimate cannot be counted correct for the gen-
as a warning of what might occur in modern highly erality claimed.
parameterized problems, where the information in the There are other errors I have not discussed. When
data may be spread too thinly to achieve asymptotic Lambert (1760) in a sketchy presentation gave only one
consistency. The normal mixture example remains cer- example, he got what was arguably the wrong answer
tainly of at least computational importance, as showing there. Lambert’s only specific result was for n = 2,
how in complex settings it may be necessary to seek lo- claiming the sample mean in that case always gave the
cal maxima or to constrain the parameter space. Fisher most probable result, a claim that would fail for the
never commented on any of these examples. Cauchy density. See Stigler (1999, Chapter 16). And
There continued over this period to be a number of there were later smaller and subtler lapses in rigor in
attempts to complete the theory, to give a rigorous de- attempts by Doob and by Dugué to themselves correct
scription of conditions that approached necessary and some of Fisher’s oversights. But I do not mean at all to
sufficient, conditions describing situations in which suggest these pioneers had feet of clay. To the contrary.
maximum likelihood would not mislead. As work on Without Lagrange’s error he might not have found the
the topic became more refined and more correct, the Laplace transform at that early date. Without Pearson
intrinsic difficulties of the topic also became more ap- and Filon, Fisher might not have started down the road
parent. The lists of conditions needed to prove opti- he did. Without Fisher’s 1921 mistaken jump to a con-
mality by Wald and Cramér were already unwieldy and clusion, he might not have rushed to complete his the-
the basic logic of the solutions retreated from sight; in- ory, which even flawed and incomplete, was instrumen-
deed one problem was that achieving rigor sometimes tal in launching twentieth century theoretical statistics.
led to the exclusion of basic examples, such as the es- Great explorations in uncharted territory seem to re-
timation of the normal standard deviation, as in Wald quire great boldness, and even mischance can lead to
(1943). The consequences can still be seen today, in the major advance.
best textbook treatments, such as those by Bickel and
Doksum (2001) and by van der Vaart (1998), where 14. CONCLUSION
the elegance of the exposition comes from strategically
Despite all these difficulties, maximum likelihood
restricting the range of the coverage. Bahadur (1964)
remains one of the most used and useful techniques
gave a succinct and elegant theorem that builds upon
of modern statistics. How can that be, in the face of
work of Le Cam, but only treated a one-dimensional
the nasty little facts uncovered by the 1950s? For one
parameter and was restricted to estimates that are as-
thing, there is solid mathematical support in a wide
ymptotically normal with variances that are continuous
class of problems. Fisher’s proofs can all be defended
in the parameter.
as correct, at least if one accepts as given the regularity
conditions and assumptions that were clearly implicit,
13. OF ERRORS IN THEORY
including the limitation in 1922 to sufficient estimates,
At many junctures in this story we have encountered and in 1925 to score functions linearly approximable
what might be judged theoretical errors committed by by maximum likelihood estimates. Of course that de-
the workers involved. Perhaps Lagrange, by ignoring fense flirts with tautology: any statement is true if all
the curve his probabilities followed until the final stage, the conditions required for its truth are assumed; even
could be judged in error; it certainly left him with the Pearson–Filon derivation of the “probable errors of
614 S. M. STIGLER

frequency constants” might be so defended. But there theory might arise. Hostility bred uncivil discourse; it
is a big difference between the two cases. Fisher’s also led to principled focus.
implicit assumptions are in part fairly clear (smooth Yet despite these problems, time and again maxi-
differentiability, consistent estimates, e.g.) and were mum likelihood has proved useful even in situations
clearly evident to Fisher himself; if he made a false where no general theorem could be found to defend its
application of the theory, it is not known to me. On use. Perhaps as Fisher’s powerful geometric intuition
the other hand, with Pearson–Filon the case was dif- may have foreseen, the scope of useful application of
ferent, as the inappropriate applications in the same maximum likelihood exceeds that of any reasonably
paper make clear. Nonetheless, the intense mathemat- achievable proof, even though this comes at the po-
ical investigations after 1938 and particularly in the tential cost of inadvertently blundering into a region
1950s revealed potential problems Fisher had not con- of inapplicability. We now understand the limitations
sidered, with increasing numbers of parameters, un- of maximum likelihood better than Fisher did, but far
bounded likelihood functions and the possibility of lo- from well enough to guarantee safety in its application
cal improvement over maximum likelihood. Fisher was in complex situations where it is most needed. Maxi-
surely aware of some of these problems, at least when mum likelihood remains a truly beautiful theory, even
they were published, if not before. The first of them though tragedy may lurk around a corner.
he might have countered by noting that in such situa-
tions the amount of information in the data (the mea-
APPENDIX 1: FISHER’S DECEMBER 1928 DRAFT
surement of which was one of his pioneering advances)
TABLE OF CONTENTS [HOTELLING PAPERS,
was being spread over a space of dimension increasing
BOX 3]
in proportion with the sample size, and so of course
problems with consistency could be expected. But he I. Distributions
did not. The third possibility, local improvement, had Varieties and variables
been brought early to his attention by Hotelling, but Types of distribution
here too Fisher remained silent, as he did in the face of
other examples as well. An explanation for this silence (a) Discontinuous, step like integrals
might, ironically, have been given by Fisher himself in (b) Continuous, differentiable integrals
a 14 January 1933 letter to Egon Pearson, commiserat- (c) General type, integral not differentiable but
ing with him on the difficulties he faced with his father, frequency not confined to zero measure
Karl: “Many original men are for that reason unrecep- Specification by moments
tive, and this is a fault which age does nothing to cure”  itx
Characteristic
 itx
function, e f (x) dx or
(Fisher papers). e dF (x)
Personalities played a role in this development. Its logarithm, cumulative property
Fisher’s hostilities with Neyman surely increased his
Cumulative moment functions or seminvariants
stubborn resistance to public discussion of areas where
[sic]
questions remained, and they surely contributed to the
Illustrative cases, uniqueness of normal distribu-
zeal with which Neyman pursued the discovery and
tion, multinomial and multiple Poisson
public discussion of such problems. The latter might
II. Distributions derived from normal
be viewed as a benefit of the feud: when peace reigned
χ 2 distribution is that of S1n (xp2 ) when xp is dis-
in the early 1930s and the only attention to the prob-
tributed with unit variance about zero
lem was by Fisher, Hotelling, and those Hotelling in- n
spired to work on this (Doob, Wald) or a noncombatant Transformation of ξq = p=1 cpq xp ,
n n
p=1 cpq = 1, p=1 cpq cpq = 0
2 
(Dugué), the problems in the proofs and the limitations
of the theory were not on public view. Indeed, there Application of χ 2 to frequencies
has been no published criticism that clearly identified Distribution of t = χnx̄2 [sic]; application to regres-
the source of errors in the proofs of Hotelling, Doob or n χ2
Dugué even to the present day; either the early works sion coefficients; of z = 12 log n2 χ12 .
1 2
were ignored, were merely cited, or referred to with III. Distribution of correlation coefficient, partial cor-
a polite allusion such as to the proofs being “not rig- relation, multiple correlation. Hyperspace treat-
orous” (e.g., Doob, 1934; Le Cam, 1953). The reader ment
got no sense of where and how real problems with the IV. Moment estimates of seminvariants
THE EPIC STORY OF MAXIMUM LIKELIHOOD 615

Simple and multiple distribution of such estimates Any values ∂φ


∂x are admissible subject to this condi-
{paper in Lond. Math. Soc. They will not publish tion for consistency, we may therefore minimize the
for a year if then} expression for the variance subject to this condition and
Combinatorial method obtain equations of the form
V. Theory of estimation ∂θ f1  ∂φ ∂f1
(Much as already done but more about Sufficient f1 (θ ) − f =λ
∂x1 n ∂x ∂θ
Statistics)
Method of maximum likelihood Now if φ is homogeneous in x of zero degree,
 ∂φ
Bayes’ theorem. Inverse probability and likeli- f ∂x = 0, hence, for all classes
hood. Illustrate by inefficiency of moments with ∂φ λ ∂f
Pearsonian curves. = or
∂x f ∂θ
VI. Experimental design (not so agricultural as in Sta-  2
tistical Methods for Research Workers), more use ∂φ 1 ∂f  1 ∂f
=
of amount of information ∂x f ∂θ f ∂θ
VII. Statistical mechanics; argument put in a clear The criterion of consistency thus fixes the value of T at
light without taking x! as a continuous function all points on the expectation line, while the criterion of
when x is small! NOT YET DONE! efficiency in conjunction with it fixes the direction in
Analogous biological problems. Rather worth do- which the equistatistical surface cuts that line. All sta-
ing though. tistics which are both consistent and efficient thus have
Fowler has now done a great deal, but still the surfaces which touch on that line. The surface for Max-
method of steepest descent seems very indirect, imum Likelihood has the plane surface of this type.
and obviously this limits statistical argument.
APPENDIX 3: HOTELLING ON
APPENDIX 2: FISHER’S ENCLOSURE A FROM PARAMETER SPACES
THE NOV. 28, 1930 LETTER TO HOTELLING. THE
Hotelling briefly attended the American Mathemat-
LETTERS WERE TYPED BUT THE FORMULAS ical Society’s Annual Meeting in Bethlehem, Penn-
WERE WRITTEN IN BY HAND, AND THE sylvania, December 26–29, 1929, but he left before
APPARENT TYPOGRAPHICAL ERROR IN THE this paper was scheduled to be read on December 27.
∂θ
FOURTH FORMULA FROM THE BOTTOM ( ∂X The paper was read in his absence by Professor Oys-
1
∂φ
FOR ∂X ) IS AS WRITTEN IN THE ORIGINAL tein Ore of Yale University, and Ore subsequently re-
1
[HOTELLING PAPERS, BOX 45] turned the manuscript to Hotelling; only the abstract
was ever published (Hotelling, 1930a). Meanwhile,
Expectation line x = f (θ ) Hotelling traveled on to the AMS Regular Meeting De-
Equistatistical surface (or region) T = φ(x1 , . . . , xs ) cember 30–31 in Des Moines, Iowa, where on Decem-
 ∂φ ber 31 he read his paper “The consistency and ultimate
x =0 if φ is homogeneous of zero degree. distribution of optimum statistics” (Hotelling, 1930).
∂x
The summary that follows is the entire manuscript as
For consistency θ = φ(f1 , . . . , fs ) read by Ore, from the Hotelling Papers at Columbia
For large samples, provided there is no bias of order University (Box 44).
as high as n−1/2 ,
 2 SPACES OF STATISTICAL PARAMETERS
 ∂φ
V (T ) = Mean δx By Harold Hotelling, Stanford University.
∂x
  b

∂φ
2   ff  ∂φ ∂φ [Abstract]
= f 1− − For a space of n dimensions representing the para-
x ∂x n ∂x ∂x 
meters p1 , . . . , pn of a frequency distribution, a statisti-
for multinomial, where differentials refer to the expec- cally significant metric is defined by means of the vari-
tation point. ances and co-variances of efficient estimates of these
Differentiating the condition of consistency, dθ = parameters. Such a space, for the ordinary types of dis-
 ∂f  ∂φ ∂f
( ∂φ ∂x ∂θ ) dθ , or ∂x ∂θ = 1 tributions, is always curved. For the two parameters of
616 S. M. STIGLER

the normal law the manifold may be represented in part statistical conclusions. It should be said at once that
as a surface of revolution of negative curvature, with a these spaces are not flat, but are curved in a manner
sharp circular edge. On this surface variation of the dis- depending on the initial population distributions.
persion is represented by moving along a generator. For Problems of “random migration” by short leaps in
a Pearson Type III curve [i.e. gamma distributions] of the k-space occur in various biological problems, when
any given shape the same surface occurs. For the unre- evolution is supposed to take place by small mutations.
stricted Type III curve there are three parameters; their Such problems occur also in experimental work, as in
space is investigated. Certain metrical properties which the dilution method of counting soil bacteria developed
hold in general for spaces of statistical parameters are by Cutter at Rothamsted. These problems, for short
given. steps, are equivalent to problems regarding heat con-
duction and geodesics in the curved space.
SUMMARY OF “SPACES OF If we are considering an initial distribution curve of
STATISTICAL PARAMETERS” any fixed shape, we have two parameters to estimate,
A “population” is specified by a function giving the location and scale of the curve, for exam-
ple the mean and standard deviation of a normal er-
f (x, p1 , . . . , pk ) ror curve. Our k-space is in such cases a surface of
such that f dx is the probability of an observation constant negative curvature. Representing the normal
falling in the range dx. In statistical theory we have curve by means of a pseudosphere, variation of the
given observations x1 , . . . , xN and wish to estimate the standard deviation is represented by motion along a
values of the parameters p1 , . . . , pk . There is an infin- generator, variations of the mean by rotation about the
ity of possible methods of making these estimates; but axis. A greater variance means closer propinquity to
one possessing certain peculiarly valuable properties is the axis.
that of maximum likelihood. The likelihood is defined Since a geodesic on a pseudosphere between two
as points on the same meridian comes closer to the axis

N than the meridian, we have an interesting biological


f (xi , p1 , . . . , pk ). conclusion. If we have two related species having about
i=1 the same variance but a difference in means, the most
Denote its logarithm by L. Let p̂1 , . . . , p̂k be the values likely common ancestors had a greater variance than
maximizing L. They have been called optimum statis- either existing species.
tics, or optimum estimates of the parameters, by R. A. For a Pearson Type III curve the measures of posi-
Fisher. The errors of estimate p̂α − pα derived from tion and scale vary, not along geodesic but along loxo-
samples of N have a distribution which for large val- dromes.
ues of N approaches the normal form Spaces of statistical parameters lend themselves to
1 the treatment of a wide range of problems in which dis-
Ke− 2 T d p̂1 , . . . , d p̂k , crepancies between hypothesis and observation which
where involve two or more observations are to be tested. Thus
 if the hypothesis to be tested is that a species, in which
T= gαβ (p̂α − pα )(p̂β − pβ ). the frequency distribution of some dimension has the
Here gαβ is the mathematical expectation of normal form, has arisen by a succession of small mu-
tations from another, and if we consider the difference
∂ 2L of the variances along with that of the means, we are
,
∂pα ∂pβ led to apply the distribution of χ 2 for n = 2, just as in
and is a covariant tensor of second order under transfor- judging marksmanship we may combine vertical with
mations pα = φα (p1 , . . . , pk )—though of course the horizontal deviations of a shot from the center of the
second derivative is not itself a tensor. target. But the fact that the surface, on which the mean
This tensor property suggests that and variance are coordinates, is a pseudosphere instead
of a plane, shows that a correction must be applied
gαβ dpα dpβ to the probability of a greater deviation as calculated
be taken as distance element in a space of coordinates from χ 2 . Indeed, the area or circumference of a geo-
p1 , . . . , pk . Indeed a considerable amount of differen- desic circle is greater than for one of the same radius
tial geometry carries over immediately to give novel on a plane. The excess of area measures the correction
THE EPIC STORY OF MAXIMUM LIKELIHOOD 617

which must be applied to obtain the true probability of and for an elegant and insightful modern develop-
a greater discrepancy. ment of Fisher’s geometric approach with historical
If about a point on the pseudosphere representing references, see Kass and Vos (1997). On Hotelling’s
any population we describe a geodesic circle, the points life and work see Arrow and Lehmann (2005), Dar-
on the circumference represent statistics, such as mean nell (1988), Hotelling (1990) and Smith (1978); on
and variance, which might with equal likelihood have Fisher’s life see Box (1978); on Pearson’s life see
been obtained in a sample from this population. And Porter (2004). The Hotelling–Fisher correspondence is
inversely, if corresponding to a given sample, we fix housed in the Hotelling Collection at Columbia Uni-
upon a point as center of a geodesic circle, the points versity (Rare Book and Manuscript Library) and the
on the circumference represent populations which, on Fisher Papers at the University of Adelaide; each has a
the evidence of this sample, are all equally likely. more complete set of the received letters than the sent
letters. Neyman’s papers are at the Bancroft Library of
ACKNOWLEDGMENTS the University of California Berkeley.

I am grateful to Henry Bennett for permission to con-


sult and quote from the Fisher Papers in Adelaide, to A LDRICH , J. (1997). R. A. Fisher and the making of maximum
likelihood 1912–1922. Statist. Sci. 12 162–176. MR1617519
Michael Ryan for permission to quote from the Harold A RROW, K. J. and L EHMANN , E. L. (2005). Harold Hotelling
Hotelling Papers, Rare Book and Manuscript Library, 1895–1973. Biographical Memoirs of the National Academy of
Columbia University, and to Susan Snyder for per- Sciences 87 3–15.
mission to quote from the Jerzy Neyman Papers (Call BAHADUR , R. R. (1964). On Fisher’s bound for asymptotic vari-
Number BANC MSS 84/30C, Box 14, Folder 28), ances. Ann. Math. Statist. 35 1545–1552. MR0166867
BAHADUR , R. R. (1983). Hodges superefficiency. In Encyclopedia
Bancroft Library, University of California Berkeley. of Statistical Sciences (S. Kotz and N. L. Johnson, eds.) 3 645–
For comments during the course of this investigation, 646.
I thank Peter Bickel, Larry Brown, Bernard Bru, David B ENNETT, J. H., ED . (1990). Statistical Inference and Analysis:
Cox, Persi Diaconis, Anthony Edwards, Brad Efron, Selected Correspondence of R. A. Fisher. Clarendon Press, Ox-
Tim Gregoire, Marc Hallin, Lucien Le Cam, Erich ford. MR1076366
B ERNOULLI , D. (1769). Dijudicatio maxime probabilis plurium
Lehmann, Peter McCullagh, Edith Mourier, Ingram observationum discrepantium atque verisimillima inductio inde
Olkin. The paper is based upon material presented as formanda. Manuscript; Bernoulli MSS f.299–305, University of
the Lucien Le Cam Memorial Lecture at the IMS An- Basel. English translation in Stigler (1997).
nual Meeting in Rio de Janeiro, August 2, 2006. B ERNOULLI , D. (1778). Dijudicatio maxime probabilis plurium
observationum discrepantium atque verisimillima inductio inde
formanda. Acta Academiae Scientiarum Imperialis Petropoli-
REFERENCES tanae for 1777, pars prior 3–23. Reprinted in Bernoulli (1982).
English translation in Kendall (1961) 3–13, reprinted 1970 in
Norden (1972–1973) surveys the literature to 1972,
Pearson, Egon S. and Kendall, M. G. (eds.), Studies in the His-
with an extensive bibliography. Hald (1998) gives a de- tory of Statistics and Probability, pp. 157–167. Charles Griffin,
tailed report in modern notation of Fisher’s published London.
work in statistical inference, as well as work related B ERNOULLI , D. (1982). Die Werke von Daniel Bernoulli.
to maximum likelihood by Pearson and Filon and by Band 2. Analysis. Wahrscheinlichkeitsrechnung. Birkhäuser,
Basel. MR0685593
Edgeworth. Aldrich (1997) and Edwards (1997a) dis- B ICKEL , P. J. and D OKSUM , K. (2001). Mathematical Statistics.
cuss Fisher’s earliest work on maximum likelihood; Basic Ideas and Selected Topics, 2nd ed. 1. Prentice Hall, Upper
I also describe this work from a different perspec- Saddle River, NJ. MR0443141
tive in Stigler (2005), and other aspects of Fisher’s B IOMETRICS (1951). News and Notes. Biometrics 7 449–450.
work in Stigler (1973, 2001, 2007). Edwards (1974), B OWLEY, A. L. (1928). F. Y. Edgeworth’s Contributions to Mathe-
matical Statistics. Royal Statistical Society, London. (Reprinted
Kendall (1961) and Hald (1998, 2007) are among those 1972 by Augustus M. Kelley, Clifton, NJ.)
who describe the predecessors of maximum likeli- B OX , J. F. (1978). R. A. Fisher. The Life of a Scientist. Wiley, New
hood; see Stigler (1999, Chapter 16), for more detail York. MR0500579
on this and further references. Savage (1976), Pratt C OURANT, R. (1936). Differential and Integral Calculus. Norde-
(1976) and Hald (1998, 2007) give accounts of the man, New York.
C OX , D. R. (2006). Principles of Statistical Inference. Cambridge
relationship of Fisher’s work to Edgeworth’s. For the Univ. Press. MR2278763
best appreciation of the role of geometry in Fisher’s C RAMÉR , H. (1946). Mathematical Methods of Statistics. Prince-
work on estimation see Efron’s Wald Lecture (1982), ton Univ. Press. MR0016588
618 S. M. STIGLER

C RAMÉR , H. (1946a). A contribution to the theory of statisti- F ISHER , R. A. (1925). Theory of statistical estimation. Proc. Cam-
cal estimation. Skand. Aktuarietidskr. 29 85–94. Reprinted in bridge Philos. Soc. 22 700–725; reprinted as Paper 42 in Fisher
H. Cramér, Collected Works 2 948–957. Springer, Berlin (1994). (1974).
MR0017505 F ISHER , R. A. (1931). Letter to the Editor. Amer. Math. Monthly
DARNELL , A. C. (1988). Harold Hotelling 1895–1973. Statist. Sci. 38 335–338. MR1522291
3 57–62. MR0959716 F ISHER , R. A. (1935). The logic of inductive inference. J. Roy.
D OOB , J. L. (1934). Probability and statistics. Trans. Amer. Math. Statist. Soc. 98 39–54; reprinted as Paper 124 in Fisher (1974).
Soc. 36 759–775. MR1501765 F ISHER , R. A. (1938). Statistical Theory of Estimation. Univ. Cal-
D OOB , J. L. (1936). Statistical estimation. Trans. Amer. Math. Soc. cutta.
39 410–421. MR1501855 F ISHER , R. A. (1938–1939). Review of “Lectures and Conferences
D UGUÉ , D. (1937). Application des propriétés de la limite au on Mathematical Statistics” by J. Neyman. Science Progress 33
sens du calcul des probabilités a l’étude de diverse questions 577.
d’estimation. J. l’École Polytechnique 3e série (n. 4) 305–373. F ISHER , R. A. (1950). Contributions to Mathematical Statistics.
Wiley, New York.
E DWARDS , A. W. F. (1974). The history of likelihood. Internat.
F ISHER , R. A. (1956). Statistical Methods and Scientific Inference.
Statist. Rev. 42 9–15. MR0353514
Oliver and Boyd, Edinburgh.
E DWARDS , A. W. F. (1997). Three early papers on efficient para-
F ISHER , R. A. (1974). The Collected Papers of R. A. Fisher U. of
metric estimation. Statist. Sci. 12 35–47. MR1466429
Adelaide Press. MR0505093
E DWARDS , A. W. F. (1997a). What did Fisher mean by “in- G ALTON , F. (1908). Memories of my Life. Methuen, London.
verse probability” in 1912–1922? Statist. Sci. 12 177–184. G AUSS , C. F. (1809). Theoria Motus Corporum Coelestium.
MR1617520 Perthes et Besser, Hamburg. Translated, 1857, as Theory of Mo-
E FRON , B. (1975). Defining the curvature of a statistical problem tion of the Heavenly Bodies Moving about the Sun in Conic
(with applications to second order efficiency). Ann. Statist. 3 Sections, trans. C. H. Davis. Little, Brown; Boston. Reprinted,
1189–1242. MR0428531 1963, Dover, New York.
E FRON , B. (1978). The geometry of exponential families. Ann. Sta- G ROVE , C. C. (1930). Review of “Statistical Methods for Research
tist. 6 362–376. MR0471152 Workers.” Amer. Math. Monthly 37 547–550. MR1522136
E FRON , B. (1982). Maximum likelihood and decision theory (The H ALD , A. (1998). A History of Mathematical Statistics from 1750
1981 Wald Memorial Lectures). Ann. Statist. 10 340–356. to 1930. Wiley, New York. MR1619032
MR0653516 H ALD , A. (2007). A History of Parametric Statistical Inference
E FRON , B. (1998). R. A. Fisher in the 21st century (with discus- from Bernoulli to Fisher, 1713 to 1935. Springer, New York.
sion). Statist. Sci. 13 95–122. MR1647499 MR2284212
E FRON , B. and H INKLEY, D. V. (1978). Assessing the accuracy of H INKLEY, D. V. (1980). Theory of statistical estimation: The 1925
the maximum likelihood estimator: Observed versus expected paper. Pp. 85–94 in Fienberg and Hinkley (1980).
Fisher information. Biometrika 65 457–482. MR0521817 H OTELLING , H. (1930). The consistency and ultimate distribu-
F IENBERG , S. E. and H INKLEY, D. V. EDS . (1980). R. A. Fisher: tion of optimum statistics. Trans. Amer. Math. Soc. 32 847–859.
An Appreciation. Springer, New York. MR0578886 MR1501565
F ISHER , R. A. (1912). On an absolute criterion for fitting H OTELLING , H. (1930a). Spaces of statistical parameters (Ab-
frequency curves. Messenger of Mathematics 41 155–160; stract). Bull. Amer. Math. Soc. 36 191.
reprinted as Paper 1 in Fisher (1974); reprinted in Edwards H OTELLING , H. (1951). The impact of R. A. Fisher on statistics.
(1997). J. Amer. Statist. Assoc. 46 35–46.
F ISHER , R. A. (1915). Frequency distribution of the values of the H OTELLING , H. (1990). The Collected Economic Articles of
Harold Hotelling. Springer, New York. MR1030045
correlation coefficient in samples from an indefinitely large pop-
J EFFREYS , H. (1946). An invariant form for the prior probability
ulation. Biometrika 10 507–521; reprinted as Paper 4 in Fisher
in estimation problems. Proc. Roy. Soc. London Ser. A 186 453–
(1974).
461. MR0017504
F ISHER , R. A. (1920). A mathematical examination of the methods
K ASS , R. E. (1989). The geometry of asymptotic inference. Statist.
of determining the accuracy of an observation by the mean error,
Sci. 4 188–219. MR1015274
and by the mean square error. Mon. Notices Roy. Astron. Soc. 80 K ASS , R. E. and VOS , P. W. (1997). Geometrical Foundations of
758–770; reprinted as Paper 12 in Fisher (1974). Asymptotic Inference. Wiley, New York. MR1461540
F ISHER , R. A. (1922). On the mathematical foundations of the- K ENDALL , M. G. (1961). Daniel Bernoulli on maximum likeli-
oretical statistics. Philos. Trans. Roy. Soc. London Ser. A 222 hood. Biometrika 48 1–18. Reprinted in 1970 in Pearson, Egon
309–368; reprinted as Paper 18 in Fisher (1974). S. and Kendall, M. G. (eds.), Studies in the History of Statistics
F ISHER , R. A. (1922a). On the interpretation of χ 2 from contin- and Probability. Charles Griffin, London, pages 155–172.
gency tables, and the calculation of P. J. Roy. Statist. Soc. 85 K IEFER , J. and W OLFOWITZ , J. (1956). Consistency of the max-
87–94; reprinted as Paper 19 in Fisher (1974). imum likelihood estimator in the presence of infinitely many
F ISHER , R. A. (1924). The Influence of Rainfall on the Yield of parameters. Ann. Math. Statist. 27 887–906. MR0086464
Wheat at Rothamsted. Philos. Trans. Roy. Soc. London Ser. B K RUSKAL , W. H. (1980). The significance of Fisher: A review
213 89–142; reprinted as Paper 37 in Fisher (1974). of “R. A. Fisher: The Life of a Scientist” by Joan Fisher Box.
F ISHER , R. A. (1924a). Conditions under which χ 2 measures the J. Amer. Statist. Assoc. 75 1019–1030.
discrepancy between observation and hypothesis. J. Roy. Statist. L AGRANGE , J.-L. (1776). Mémoire sur l’utilité de la méthode de
Soc. 87 442–450; reprinted as Paper 34 in Fisher (1974). prendre le milieu entre les résultats de plusieurs observations;
THE EPIC STORY OF MAXIMUM LIKELIHOOD 619

dans lequel on examine les avantages de cette méthode par le R AO , C. R. (1961). Asymptotic efficiency and limiting informa-
calcul d es probabilités, & ou l’on resoud differens problèmes tion. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1 531–
relatifs à cette matière. Miscellanea Taurinensia 5 167–232. 546. Univ. California Press, Berkeley. MR0133192
Reprinted in Lagrange (1868) 2 173–236. R AO , C. R. (1962). Efficient estimates and optimum inference pro-
L AGRANGE , J.-L. (1868). Oeuvres de Lagrange, 2. Gauthier- cedures in large samples, with discussion. J. Roy. Statist. Soc.
Villars, Paris. Ser. B 24 46–72. MR0293766
L AMBERT, J. H. (1760). Photometria, sive de Mensura et S AVAGE , L. J. (1976). On rereading R. A. Fisher. Ann. Statist. 4
Gradibus Luminis, Colorum et Umbrae. Detleffsen, Augsburg. 441–500. MR0403889
(French translation 1997, L’Harmattan, Paris; English transla- S HEYNIN , O. B. (1971). J. H. Lambert’s work on probability.
tion 2001, by David L. DiLaura, for The Illuminating Engineer- Archive for History of Exact Sciences 7 244–256. MR1554145
ing Society of North America). S MITH , K. (1916). On the ‘best’ values of the constants in fre-
L APLACE , P. S. (1774). Mémoire sur la probabilité des causes quency distributions. Biometrika 11 262–276.
par les évènemens. Mémoires de mathématique et de physique, S MITH , W. L. (1978). Harold Hotelling 1985–1973. Ann. Statist. 6
presentés à l’Académie Royale des Sciences, par divers sa- 1173–1183. MR0523758
vans, & lû dans ses assemblées 6 621–656. Translated in Stigler S TIGLER , S. M. (1973). Laplace, Fisher, and the discovery of the
(1986a). concept of sufficiency. Biometrika 60 439–445. Reprinted in
L AURITZEN , S. L. (2002). Thiele: Pioneer in Statistics. Oxford 1977 in Kendall, Maurice G. and Robin L. Plackett, eds., Stud-
Univ. Press. MR2055773 ies in the History of Statistics and Probability, Vol. 2. Griffin,
L E C AM , L. (1953). On some asymptotic properties of maximum London, pp. 271–277. MR0326872
likelihood estimates and relates Bayes estimates. University of S TIGLER , S. M. (1986). The History of Statistics: The Measure-
California Publications in Statistics 1 277–330. MR0054913 ment of Uncertainty Before 1900. Harvard Univ. Press, Cam-
L E C AM , L. (1990). Maximum likelihood: An introduction. Inter- bridge, MA. MR0852410
nat. Statist. Rev. 58 153–171 [Previously issued in 1979 by the S TIGLER , S. M. (1986a). Laplace’s 1774 memoir on inverse prob-
Statistics Branch of the Department of Mathematics, University ability. Statist. Sci. 1 359–378. MR0858515
of Maryland, as Lecture Notes No. 18]. S TIGLER , S. M. (1997). Daniel Bernoulli, Leonhard Euler, and
Maximum Likelihood. In Festschrift for Lucien LeCam (D. Pol-
L ITTAUER , S. B. and M ODE , E. B. (1952). Report of the Boston
lard, E. Torgersen and G. Yang, eds.) 345–367. Springer, New
Meeting of the Institute. Ann. Math. Statist. 23 155–159.
York. Extensively revised and reprinted as Chapter 16 of Stigler
N EYMAN , J. (1937). Outline of a theory of statistical estimation
(1999). MR1462957
based upon the classical theory of probability. Phil. Trans. Royal
S TIGLER , S. M. (1999). Statistics on the Table. Harvard Univ.
Soc. London Ser. A 236 333–380.
Press, Cambridge, MA. MR1712969
N EYMAN , J. (1938). Lectures and Conferences on Mathematical
S TIGLER , S. M. (1999a). The Foundations of Statistics at Stanford.
Statistics (edited by W. Edwards Deming). The Graduate School
Amer. Statist. 53 263–266. MR1711551
of the USDA, Washington DC.
S TIGLER , S. M. (2001). Ancillary history. In State of the Art in
N EYMAN , J. and S COTT, E. L. (1948). Consistent estimates based
Probability and Statistics (C. M. de Gunst, C. A. J. Klaassen
on partially consistent observations. Econometrica 16 1–32. and A. W. van der Vaart, eds.). IMS Lecture Notes Monogr. Ser.
MR0025113 36 555–567. IMS, Beachwood, OH. MR1836581
N EYMAN , J. (1951). Review of R. A. Fisher “Contributions to S TIGLER , S. M. (2005). Fisher in 1921. Statist. Sci. 20 32–49.
Mathematical Statistics.” The Scientific Monthly 72 406–408. MR2182986
N ORDEN , R. H. (1972–1973). A survey of maximum likelihood S TIGLER , S. M. (2007). Karl Pearson’s theoretical errors and the
estimation. Internat. Statist. Rev. 40 329–354, 41 39–58. advances they inspired. To appear.
P EARSON , K. (1896). Mathematical contributions to the theory of VAN DER VAART, A. W. (1997). Superefficiency. In Festschrift for
evolution, III: regression, heredity and panmixia. Philos. Trans. Lucien Le Cam (D. Pollard, E. Torgersen and G. L. Yang, eds.)
Roy. Soc. London Ser. A 187 253–318. Reprinted in Karl Pear- 397–410. Springer, New York. MR1462961
son’s Early Statistical Papers, Cambridge: Cambridge Univer- VAN DER VAART, A. W. (1998). Asymptotic Statistics. Cambridge
sity Press, 1956, pp. 113–178. Univ. Press. MR1652247
P EARSON , K. and F ILON , L. N. G. (1898). Mathematical contri- WALD , A. (1940). The fitting of straight lines if both variables are
butions to the theory of evolution IV. On the probable errors of subject to error. Ann. Math. Statist. 11 284–300. [A summary
frequency constants and on the influence of random selection of the main results of this article, as presented in a talk July 6,
on variation and correlation. Philos. Trans. Roy. Soc. London 1939, was published pp. 25–28 in Report of the Fifth Annual
Ser. A 191 229–311. Reprinted in Karl Pearson’s Early Statisti- Research Conference on Economics and Statistics Held at Col-
cal Papers, Cambridge: Cambridge University Press, 1956, pp. orado Springs July 3 to 28, 1939, Cowles Commission, Univer-
179–261. sity of Chicago, 1939.] MR0002739
P ORTER , T. M. (2004). Karl Pearson: The Scientific Life in a Sta- WALD , A. (1943). Tests of statistical hypotheses concerning sev-
tistical Age. Princeton Univ. Press. MR2054951 eral parameters when the number of observations is large. Trans.
P RATT, J. W. (1976). F. Y. Edgeworth and R. A. Fisher on the effi- Amer. Math. Soc. 54 426–482. MR0012401
ciency of maximum likelihood estimation. Ann. Statist. 4 501– WALD , A. (1949). Note on the consistency of the maximum likeli-
514. MR0415867 hood estimate. Ann. Math. Statist. 20 595–601. MR0032169
620 S. M. STIGLER

Y ULE , G. U. (1936). An Introduction to the Theory of Statistics, Z ABELL , S. L. (1992). R. A. Fisher and the fiducial argument. Sta-
10th ed. Charles Griffin, London. [This was the last edition re- tist. Sci. 7 369–387. Reprinted in 2005 in S. L. Zabell, Symmetry
vised by Yule himself; subsequent revisions from 1937 by M. G. and its Discontents: Essays on the History of Inductive Philos-
Kendall were not greatly changed in emphasis.] ophy. Cambridge Univ. Press. MR1181418

You might also like