Professional Documents
Culture Documents
Author(s): R. A. Fisher
Source: Philosophical Transactions of the Royal Society of London. Series A, Containing Papers
of a Mathematical or Physical Character, Vol. 222 (1922), pp. 309-368
Published by: The Royal Society
Stable URL: http://www.jstor.org/stable/91208
Accessed: 28/06/2009 02:41
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=rsl.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org.
The Royal Society is collaborating with JSTOR to digitize, preserve and extend access to Philosophical
Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical
Character.
http://www.jstor.org
[ 309 ]
RUSSELL,
F.R.S.
CONTENTS.
Section
. . . . . . . . . . . .
.. .
.
1. The Neglect of Theoretical Statistics
. . .......
2. The Purpose of Statistical Methods .......
.
3. The Problems of Statistics .....................
..
. . ................
.
4. Criteria of Estimation
...
........
.
5. Examples of the Use of Criterion of Consistency
. .........
.
6. Formal Solution of Problems of Estimation ..
.
.
of
.
the
....
Criterion
7. Satisfaction of
Sufficiency ........
8. The Efficiency of the Method of Moments in Fitting Curves of the Pearsonian Type I . .
....
9. Location and Scaling of Frequency Curves in general ..........
10. The Efficiency of the Method of Moments in Fitting Pearsonian Curves . . . . . . . .
11. The Reason for the Efficiency of the Method of Mom.ents in a Simall Region surrounding the
. ..................
Normal Curve ..
.
.
. . . . . . . . . . . . . . . . . . .
12. Discontinuous Distributions
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
Series
.
..
.
The
Poisson
(1)
.
.
(2) Grouped Normal Data . . ............
.
(3) Distribution of Observations in a Dilution Series ..
13. Summary ..................
Page
310
311
313
316
317
323
330
332
338
342
355
356
359
359
363
366
DEFINITIONS.
Centreof Location.--That abscissa of a frequency curve for which the sampling errors
of optimum location are uncorrelatedwith those of optimum scaling. (9.)
Consistency.-A
which its intrinsic accuracy bears to that of the most efficient statistic possible. It
VOL. CCXXII.--A
602.
2 X
310
expresses the proportion of the total available relevant information of which that
statistic makes use. (4 and 10.)
Efficiency(Criterion).-The criterion of efficiencyis satisfied by those statistics which,
when derived from large samples, tend to a normal distribution with the least possible
standard deviation. (4.)
Estimation.-Problems of estimation are those in which it is requiredto estimate the
value of one or more of the population parameters from a random sample of the
population. (3.)
Intrinsic Accuracy.-The intrinsic accuracy of an error curve is the weight in large
samples, divided by the number in the sample, of that statistic of location which satisfies
SEVERALreasons have contributed to the prolonged neglect into which the study of
statistics, in its theoretical aspects, has fallen. In spite of the immense amount of
fruitful labour which has been expended in its practical applications, the basic principles
of this organ of science are still in a state of obscurity, and it cannot be denied that,
during the recent rapid development of practical methods, fundamental problems have
been ignored and fundamental paradoxes left unresolved. This anomalous state of
statistical science is strikingly exemplified by a recent paper (1) entitled " The Funda-
311
mental Problem of Practical Statistics," in which one of the most eminent of modern
statisticians presents what purports to be a general proof of BAYES' postulate, a proof
which, in the opinion of a second statistician of equal eminence, " seems to rest upon a
very peculiar-not to say hardly supposable-relation." (2.)
Leaving aside the specific question here cited, to which we shall recur, the obscurity
which envelops the theoretical bases of statistical methods may perhaps be ascribed
to two considerations. In the first place, it appears to be widely thought, or rather
felt, that in a subject in which all results are liable to greater or smaller errors,precise
definition of ideas or concepts is, if not impossible, at least not a practical necessity.
In the second place, it has happened that in statistics a purely verbal confusion has
hindered the distinct formulation of statistical problems; for it is customary to apply
the same name, mean, standarddeviation, correlationcoefficient,etc., both to the true
value which we should like to know,.but can only estimate, and to the particular value
at which we happen to arrive by our methods of estimation; so also in applying the
term probableerror,writers sometimes would appearto suggest that the formerquantity,
and not merely the latter, is subject to error.
It is this last confusion, in the writer's opinion, more than any other, which has led
to the survival to the present day of the fundamental paradox of inverse probability,
which like an impenetrable jungle arrests progress towards precision of statistical
concepts.
banishing the rnethod,at least from the elementary text-books of Algebra ; but though
we may agree wholly with CHRYSTAL
that inverse probability is a mistake (perhapsthe
only mistake to which the mathematical world has so deeply committed itself), there
yet remains the feeling that such a mistake would not have captivated the minds of
LAPLACE
and POISSONif there had been nothing in it but error.
2. THE PURPOSE OF STATISTICALMETHODS.
312
MRR. .
the data is usually far greater than the nunmberof facts sought, much of the information
supplied by any actual sample is irrelevant. It is the object of the statistical processes
employed in the reduction of data to exclude this irrelevant information, and to isolate
the whole of the relevant informationcontained in the data.
When we speak of the probabilityof a certain object fulfilling a certain condition, we
imagine all such objects to be divided into two classes, according as they do or do not
fulfil the condition. This is the only characteristicin them of which we take cognisance.
For this reason probability is the most elementary of statistical concepts. It is a parameter which specifies a simple dichotomy in an infinite hypothetical population, and it
represents neither more nor less than the frequency ratio which we imagine such a
population to exhibit. For example, when we say that the probability of throwing a
five with a die is one-sixth, we must not be taken to mean that of any six throws with
that die one and one only will necessarily be a five; or that of any six million
throws, exactly one million will be fives; but that of a hypothetical population of an
infinite number of throws, with the die in its original condition, exactly one-sixth will
be fives. Our statement will not then contain any false assumption about the actual
die, as that it will not wear out with continued use, or any notion of approximation, as
in estimating the probability from a finite sample, although this notion may be logically
developed once the meaning of probability is apprehended.
The concept of a discontinuousfrequencydistributionis merely an extension of that of
a simple dichotomy, for though the number of classes into which the population is
divided may be infinite, yet the frequency in each class bears a finite ratio to that of the
whole population. In frequencycurves, however, a second infinity is introduced. No
finite sample has a frequency curve : a finite sample may be representedby a histogram,
or by a frequency polygon, which to the eye more and more resembles a curve, as the
size of the sample is increased. To reach a true curve, not only would an infinite number
of individuals have to be placed in each class, but the number of classes (arrays) into
which the population is divided must be made infinite. Consequently, it should be
clear that the concept of a frequency curve includes that of a hypothetical infinite
population, distributed according to a mathematical law, represented by the curve.
This law is specified by assigning to each element of the abscissa the corresponding
element of probability. Thus, in the case of the normal distribution, the probability
of an observation falling in the range dx, is
1
7V
__
27r
(r-mw)'
22
d.x,
in which expressionx is the value of the variate, while m, the mean, and o, the standard
deviation, are the two parameters by which the hypothetical population is specified.
If a sample of n be taken from such a population, the data comprisen independent facts.
The statistical process of the reduction of these data is designed to extract from them
all relevant information respecting the values of m and a, and to reject all other
information as irrelevant.
FOUNDATIONS OF TtEOTRETICALSTATISTICS.
313
The problems which arise in reduction of data may be conveniently divided into three
types :
(1) Problems of Specification. These arise in the choice of the mathematical form of
the population.
(2) Problems of Estimation. These involve the choice of methods of calculating from
a sample statistical derivates, or as we shall call them statistics, whicl are designed
to estimate the values of the parameters of the hypothetical population.
(3) Problems of Distribution. These include discussions of the distribution of
statistics derived from samples, or in general any functions of quantities whose
distribution is knowln.
It will be clear that when we know (1) what parameters are required to specify the
314
population from which the sample is drawn, (2) how best to calculate from the sample
estimates of these parameters, and (3) the exact form of the distribution, in different
samples, of our derived statistics, then the theoretical aspect of the treatment of any
particular body of data has been completely elucidated.
As regards problems of specification, these are entirely a matter for the practical
statistician, for those cases where the qualitative nature of the hypothetical population
is known do not involve any problems of this type. In other cases we may know by
experience what forms are likely to be suitable, and the adequacy of our choice may
be tested a posteriori. We must confine ourselves to those forms which we know how
to handle, or for which any tables which may be necessary have been constructed.
More or less elaborate forms will be suitable according to the volume of the data.
Evidently these are considerationsthe nature of which may change greatly during the
work of a single generation. We may instance the development by PEARSON of a very
extensive system of skew curves, the elaboration of a method of calculating their parameters, and the preparationof the necessarytables, a body of work which has enormously
extended the power of modern statistical practice, and which has been, by pertinacity
and inspiration alike, practically the work of a single man. Nor is the introduction of
the Pearsonian system of frequency curves the only contribution which their author has
made to the solution of problems of specification: of even greater importance is the
introduction of an objective criterion of goodness of fit. For empiricalas the specification of the hypothetical population may be, this empiricismis cleared of its dangers if
we can apply a rigorous and objective test of the adequacy with which the proposed
population represents the whole of the available facts. Once a statistic, suitable for
applying such a test, has been chosen, the exact form of its distribution in random
samples must be investigated, in order that we may evaluate the probability that a
worse fit should be obtained from a random sample of a population of the type considered. The possibility of developing complete and self-contained tests of goodness of
fit deserves very careful consideration, since therein lies our justification for the free
use which is made of empirical frequency formulae. Problems of distribution of great
mathematical difficulty have to be faced in this direction.
Although problems of estimation and of distribution may be studied separately, they
are intimately related in the development of statistical methods. Logically problems of
distribution should have prior consideration,for the study of the random distribution of
different suggested statistics, derived from samples of a given size, must guide us in the
choice of which statistic it is most profitable to calculate. The fact is, however, that
very little progress has been made in the study of the distribution of statistics derived
from.samples. In 1900 PEARSON(15) gave the exact form of the distribution of x2, the
Pearsonian test of goodness of fit, and in 1915 the same author published (18) a similar
result of more general scope, valid when the observations are regarded as subject to
linear constraints. By an easy adaptation (17) the tables of probability derived from
this formula may be made available for the more numerous cases in which linear con-
31 5
31.6
4.
CRITERIA
OF ESTIMATION.
The common-sensecriterionemployedin problemsof estimation nlay be stated thus :That when applied to the whole population the derived statistic should be equal to the
parameter. This may be called the Criterionof Consistency. It is often the only test
applied: thus, in estimating the standard deviation of a normally distributed population,
from an ungroupedsample, either of the two statistics=
nA
,2=
S/
S (Ix-
|)
(Mean error)
and
n
(x-P)2
will lead to the correct value, r, when calculated from the whole population. They both
thus satisfy the criterion of consistency, and this has led many computers to use the
first formula, although the result of the second has 14 per cent. greater weight (7), and
the labour of increasing the number of observations by 14 per cent. can seldom be less
than that of applying the more accurate formula.
Considerationof the above example will suggest a second criterion,namely :-That in
large samples, when the distributions of the statistics tend to normality, that statistic
is to be chosen which has the least probable error.
This may be called the Criterionof Efficiency. It is evident that if for large samples
one statistic has a probable error double that of a second, while both are proportional
to n-:, then the first method applied to a sample of 4n values will be no more accurate
than the second applied to a sample of any n values. If the second method makes use
of the whole of the information available, the first makes use of only one-quarterof it,
and its efficiencymay therefore be said to be 25 per cent. To calculate the efficiency of
any given method, we must therefore know the probable errorof the statistic calculated
by that method, and that of the most efficient statistic which could be used. The
square of the ratio of these two quantities then measuresthe efficiency.
The criterion of efficiency is still to some extent incomplete, for different
methods of calculation may tend to agreement for large samples, and yet differ for
all finite samples. The complete criterion suggested by our work on the mean
square error(7) is:That the statistic chosen should summarise the whole of the relevant information
supplied by the sample.
This may be called the Criterionof Sufficiency.
In mathematical language we may interpret this statement by saying that if 0 be
the parameterto be estimated, 0, a statistic which contains the whole of the information
as to the value of 0, which the sample supplies, and 02 any other statistic, then the
317
surface of distribution of pairs of values of 01and 0,, for a given value of 0, is such that
for a given value of 0,, the distribution of 02 does not involve 0. In other words, when
0, is known, knowledge of the value of 0, throws no further light upon the value of 0.
It may be shown that a statistic which fulfils the criterion of sufficiency will also
fulfil the criterion of efficiency, when the latter is applicable. For, if this be so, the
distribution of the statistics will in large samples be normal, the standard deviations
being proportional to n-. Let this distribution be
d
df =
-rrz
1
I
0{02
->-e
T-7
2'01-0--0
0
+
-2
2-2
d
do0
s,
df = --/
82o-
d
21 del,
21-2
/27ri
Cr2.2//
,_ sr,0,-oi
0__ V2
{
}d2;
de2;
- e21"^r
d--o'
^ vT-
O;
showing that a- is necessarily less than o2, and that the efficiency of 02 is measured by
r2, when r is its correlation in large samples with 01.
Besides this case we shall see that the criterionof sufficiencyis also applicableto finite
samples, and to those cases when the weight of a statistic is not proportional to the
number of the sample from which it is calculated.
5. EXAMPLES OF THE USE OF THE CRITERION OF CONSISTENCY.
In certain cases the criterion of consistency is sufficient for the solution of problems
of estimation. An example of this occurs when a fourfold table is interpreted as representing the double dichotomy of a normal surface. In this case the dichotomic ratios
of the two variates, together with the correlation, completely specify the four fractions
into which the population is divided. If these are equated to the four fractions into
which the sample is divided, the correlationis determineduniquely.
In other cases where a small correctionhas to be made, the amount of the correction
is not of sufficient importance to justify any great refinement in estimation, and it is
sufficient to calculate the discrepancy which appears when the uncorrected method is
applied to the whole population. Of this nature is SHEPPARD'Scorrectionfor grouping,
VOL. CCXXII.-A.
2 Y
318
and it will illustrate this use of the criterion of consistency if we derive formulaefor
this correction without approximation.
Let ~ be the value of the variate at the mid point of any group, a the interval of
grouping,and x the true value of the variate at any point, then the kth moment of an
infinite grouped sample is
in which of f(x) dx is the frequency, in any element dx, of the ungrouped population, and
:p
J d
Ao= 2
1
-la
r2i00
f (x)
c+
sinsOs0 dO
AS
7r
p=_~x
P=00
01
Bs=k
u-Ja-
cos sgd0J
dx
la
(x) dx,
(x) dx.
But
0
a C-27rp,
therefore
do =
2w
sin sO = sin -7 s,
a
2C
cos o = cos - se,
hence
AO =
deF
ie
f(X)dx
= f-
X(), dx
kC.
FOUNDATIONSOF THEORETICALSTATISTICS.
319
Inserting the values 1, 2, 3 and 4 for k, we obtain for the aperiodicterms of the four
moments of the grouped population
r0
IAo= I
xf(x)dx,
2Ao=
(+
A f (x+
4Ao=
) f (x) dx,
+ 80)
4+
f (X)dx.
If we ignore the periodic terms, these equations lead to the ordinary SHEPPARD
correctionsfor the secondand fourth moment. The nature of the approximationinvolved
is brought out by the periodic terms. In the absence of high contact at the ends of the
curve, the contribution of these will, of course, include the terms given in a recent paper
by PEARSON(8); but even with high contact it is of interest to see for what degree of
coarseness of groupingthe periodic terms become sensible.
Now
As=-
=<
f"
sinsOdO
2r Csin 27
-a -oo
(x)dcx
kf
r-J
p =o0
de - :kf(x)
dx,
1e-a
f (x) dx
Ca J -
sin
2r
Ca
x-2a
d.
But
2
+2
Jx-2a
2 -s
gosin a d$
fo--
7rS
27sx
cos
-rs,
a
therefore
As = (-)'sa
cos2
(x) dx;
2e2
bs = 0,
in which
a = 27re.
-2
9&02
4o-2
2 Y 2
320
MR R.A.FISHER
A.
ON THE MATHEMATICAL
2r
2,12
sin 0
7F
10/n
_ 27r2
whence
n = - e42 = 13,790 billion.
100
For the second moment
B2
(s 2/
s2V2
and, if we put
v/22- = 4- 2e _*<M2
,
o10/n
there results
n =
-o-e42r
175 billion.
The error, while still very minute, is thus more important for the second than for
the first moment.
For the third moment
4
3~48 F4 e4+ 2
s(Sd
A
A, = (--)s 6,_ 'eI + S6s
292\
putting
2/10v/n
=
e='
= 14 7 billion.
B= (-)
6^
-(7r2_3)
(7r2-6)
so that, if we put,
v/96 4.244
2-
e4 = 1-34 billion.
n = -32007r4
s30'2
22
321
In a similar manner the exact form of SHEPPARD'Scorrection may be found for other
curves ; for the normal curve we may say that the periodicterms are exceedingly minute
so long as a is less than -, though they increase very rapidly if a is increased beyond
this point. They are of increasing importance as higher moments are used, not only
absolutely, but relatively to the increasing probable errors of the higher moments.
The principle upon which the correction is based is merely to find the error when the
momentsare calculated from an infinite groupedsample; the correctedmoment therefore
fulfils the criterion of consistency, and so long as the correction is small no greater
refinementis required.
Perhaps the most extended use of the criterion of consistency has been developed by
PEARSONin the " Method of Moments." In this method, which is without question of
great practical utility, different forms bf frequency curves are fitted by calculating as
many moments of the sample as there are parametersto be evaluated. The parameters
chosen are those of an infinite population of the specifiedtype having the same moments
as those calculated from the sample.
The system of curves developed by PEARSON has four variable parameters, and may
be fitted by means of the first four moments. For this purpose it is necessary to confine
attention to curves of which the first four moments are finite; further, if the accuracy
of the fourth moment should increase with the size of the sample, that is, if its probable
error should not be infinitely great, the first eight moments must be finite. This
restriction requires that the class of distribution in which this condition is not fulfilled
should be set aside as " leterotypic," and that the fourth moment should become
practically valueless as this class is approached. It should be made clear, however,
that there is nothing anomalous about these so-called " heterotypic " distributions
except the fact that the method of moments cannot be applied to them. Moreover, for that class of distribution to which the method can be applied, it has not
been shown, except in the case of the normal curve, that the best values will be
obtained by the method of moments. The method will, in these cases, certainly be
serviceable in yielding an approximation, but to discover whether this approximation
is a good or a bad one, and to improve it, if necessary, a more adequate criterion is
required.
A single example will be sufficient to illustrate the practical difficulty alluded to
above. If a point P lie at known (unit) distance from a straight line AB, and lines be
drawn at random through P, then the distribution of the points of intersection with
AB will be distributed so that the frequency in any range dx is
1
dx
df= = 7' 1 +
(X-m)2
in which x is the distance of the infinitesimal range dx from a fixed point 0 on the line,
and m is the distance, from this point, of the foot of the perpendicularPM. The distri-
322
v (fig. 1). It is
bution will be a svyumetricalone (Type VII.) having its centre at x =-m
therefore a perfectly definite problemto estimate the value of m (to find the best value of
m) from a random sample of values of x. We have stated the problem in its simplest
possible form: only one parameter is required, the middle point of the distribution.
-6
-5
-4
-3
-2
-10
. .f . . . . .
rrl+x2.
. . . . . . .
e 4
By the method of moments, this should be given by the first moment, that is by the
mean of the observations: such would seem to be at least a good estimate. It is,
however, entirely valueless. The distribution of the mean of such samples is in fact the
same, identically, as that of a single observation. In taking the mean of 100 values of
x, we are no nearer obtaining the value of m than if we had chosen any value of x out
of the 100. The problem, however, is not in the least an impracticable one: clearly
from a large sample we ought to be able to estimate the centre of the distribution with
some precision; the mean, however, is an entirely useless statistic for the purpose.
By taking the median of a large sample, a fair approximationis obtained, for the standard
error of the median of a large sample of n is 2-
by adopting adequate statistical methods it must be possible to estimate the value for
m, with increasing accuracy, as the size of the sample is increased.
This example serves also to illustrate the practical difficulty which observers often
find, that a few extreme observations appear to dominate the value of the mean. In
these cases the rejection of extreme values is often advocated, and it may often happen
that gross errors are thus rejected. As a statistical measure, however, the rejection of
observations is too crude to be defended: and unless there are other reasons for rejection than mere divergence from the majority, it would be more philosophicalto accept
these extreme values, not as gross errors, but as indications that the distribution of
errorsis not normal. As we shall show, the only Pearsonian curve for which the mean
FOUNDATIONSOF THEORETICALSTATISTICS.
323
is the best statistic for locating the curve, is the normal or gaussian curve of errors. If
the curve is not of this form the mean is not necessarily, as we have seen, of any value
whatever. The determinationof the true curves of variation for differenttypes of work
is therefore of great practical importance, and this can only be done by differentworkers
recording their data in full without rejections, however they may please to treat the
data so recorded. Assuredly an observer need be exposed to no criticism, if after
recordingdata which are not probably normal in distribution, he prefers to adopt some
value other than the arithmetic mean.
6. FORMAL SOLUTION OF PROBLEMS OF ESTIMATION.
The form in which the criterion of sufficiency has been presented is not of direct
assistance in the solution of problems of estimation. For it is necessary first to know
the statistic concerned and its surface of distribution, with an infinite number of other
statistics, before its sufficiency can be tested. For the solution of problems of
estimation we require a method which for each particular problem will lead us
automatically to the statistic by which the criterion of sufficiencyis satisfied. Such a
method is, I believe, provided by the Method of Maximum Likelihood, although I am
not satisfied as to the mathematical rigour of any proof which I can put forward to
that effect. Readers of the ensuing pages are invited to form their own opinion as
to the possibility of the method of the maximum likelihood leading in any case to an
insufficient statistic. For my own part I should gladly have withheld publication until
a rigorously complete proof could have been formulated; but the number and variety
of the new results which the method discloses press for publication, and at the same
time I am not insensible of the advantage which accrues to Applied Mathematicsfrom
the co-operation of the Pure Mathematician,and this co-operation is not infrequently
called forth by the very imperfectionsof writers on Applied Mathematics.
If in any distributioninvolving unknown parameters0,, 02, 0,...,
the chance of
an observationfalling in the range dx be representedby
f(x, 0,
02, ...
) dx,
then the chance that in a sample of n, n, fall in the range dx,, n2 in the range dx2,and
so on, vwillbe
(
(!)
(
f v,
The method of maximum likelihood consists simply in choosing that set of values
for the parameters which makes this quantity a maximum, and since in this expression
the parametersare only involved in the functionf, we have to make
S (logf)
324
a maximum for variations of 01, 02, 0, &c. In this form the method is applicable to
the fitting of populations involving any number of variates, and equally to discontinuous
as to continuous distributions.
In order to make clear the (listinction between this Imethod and that of BAYES, we
will apply it to the same type of probleimas that wvhichBAYESdiscussed, in the hope
! P'(1-P)Y
Applying the method of maximum likelihood, we have
S (logf) = x log p+y log (I -)
whence, differentiating with respect to p, in order to make this quantity a maximum,
--=
1p
or
A
-
x
-.
The question then arises as to the accuracy of this determination. This question was
first discussed by BAYES (10), in a form which we may state thus. After observing
this sample, when we know p, what is the probabilitythat p lies in any range dp ? In
other words, what is the frequency distribution of the values of p in populations which
are selected by the restriction that a sample of n taken from each of them yields x
successes. Without further data, as BAYES perceived, this problem is insoluble. To
render it capable of mathematical treatment, BAYES introduced the datum, that among
the populations upon which the experiment was tried, those in which p lay in the range
dp were equally frequent for all equal ranges dp. The probability that the value of p
lay in any range dp was therefore assumed to be simply dp, before the sample was
taken. After the selection effected by observing the sample, the probability is clearly
proportionalto
px ( l-p)Ydp.
After giving this solution, based upon the particular datum stated, BAYES adds a
scholiumthe purport of which would seem to be that in the absence of all knowledge
save that supplied by the sample, it is reasonable to assume this particular a priori
distribution of p. The result,the datum,and the postulateimplied by the scholium,have
all been somewhat loosely spoken of as BAYES' Theorem.
325
the quantity, 0, measures the degree of probability, just as well as p, and is even, for
some purposes, the more suitable variable. The chance of obtaining a sample of x
successes and y failures is now
2 x !y! (1 +sin 0) ( 1-sin 0)Y;
y cos 0
ycos0
1-sinO'
sin. 0
whence
- y
2n
an exactly equivalent solution to that obtained using the variable p. But what a priori
assumption are we to make as to the distribution of 0 ? Are we to assume that 0 is
equally likely to lie in all equal ranges do ? In this case the a priori probability will
be d0/7r,and that after making the observationswill be proportionalto
(1 +sin 0)x ( -sin o0)d0.
But if we interpret this in terms of p, we obtain
px (1-P)y
vp
dp
(1-p)"
p-
(1 -p)Y dp,
a result inconsistent with that obtained previously. In fact, the distribution previously
assumed for p was equivalent to assuming the special distribution for 0,
df=
cos 0
de,
the arbitrarinessof which is fully apparent when we use any variable other than p.
In a less obtrusiveform the same species of arbitraryassumptionunderlies the method
VOL. CCXXII.-A,
24
326
known as that of inverse probability. Thus, if the same observed result A might be
the consequence of one or other of two hypothetical conditions X and Y, it is assumed
that the probabilities of X and Y are in the same ratio as the probabilitiesof A occurring
on the two assumptions, X is true, Y is true. This amounts to assuming that before
A was observed, it was known that our universe had been selected at random for an
infinite population in which X was true in one half, and Y true in the other half.
Clearlysuch an assumption is entirely arbitrary,nor has any method been put forward
by which such assumptions can be made even with consistent uniqueness. There
is nothing to prevent an irrelevant distinction being drawn among the hypothetical
conditions represented by X, so that we have to consider two hypothetical possibilities
X, and X2, on both of which A will occur with equal frequency. Such a distinction
should make no differencewhatever to our conclusions; but on the principle of inverse
probability it does so, for if previously the relative probabilities were reckoned to be
in the ratio x to y, they must now be reckoned 2x to y. Nor has any criterion been
suggested by which it is possible to separate such irrelevant distinctions from those
which are relevant.
There would be no need to emphasise the baseless character of the assumptions made
under the titles of inverse probability and BAYES' Theorem in view of the decisive
criticism to which they have been exposed at the hands of BOOLE,VENN, and CHRYSTAL,
were it not for the fact that the older writers, such as LAPLACEand POISSON,who accepted
these assumptions, also laid the foundations of the moderntheory of statistics, and have
introduced into their discussions of this subject ideas of a similar character. I must
indeed plead guilty in my original statement of the Method of the Maximum Likeli]lood (9) to having based my argument upon the principle of inverse probability; in the
same paper, it is true, I emphasisedthe fact that such inverse probabilitieswere relative
only. That is to say, that while we might speak of one value of p as having an inverse
probability three times that of another value of p, we might on no account introduce
the differential element dp, so as to be able to say that it was three times as probable
that p should lie in one rather than the other of two equal elements. Upon consideration, therefore,I perceive that the wordprobability is wronglyused in such a connection:
probability is a ratio of frequencies, and about the frequencies of such values we can
know nothing whatever. We must return to the actual fact that one value of p, of
the frequency of which we know nothing, would yield the observed result three times
as frequently as would another value of p. If we need a word to characterise this
relative property of differentvalues of p, I suggest that we may speak without confusion
of the likelihoodof one value of p being thrice the likelihood of another, bearing always
in mind that likelihood is not here used loosely as a synonym of probability, but simply
to express the relative frequencies with which such values of the hypothetical quantity
p would in fact yield the observed sample.
The solution of the problems of calculating from a sample the parameters of the
hypothetical population, which we have put forward in the method of maximum likeli-
FOUNDATIONSOF THEORETICALSTATISTICS.
327
hood, consists, then, simply of choosing such values of these parameters as have the
maximum likelihood. Formally, therefore, it resembles the calculation of the mode of
an inverse frequency distribution. This resemblance is quite superficial: if the scale
of measurement of the hypothetical quantity be altered, the mode must change its
position, and can be brought to have any value, by an appropriatechange of scale; but
the optimum, as the position of maximumlikelihood may be called, is entirely unchanged
by any such transformation. Likelihood also differsfrom probability* in that it is not
a differentialelement, and is incapable of being integrated : it is assigned to a particular
point of the range of variation, not to a particular element of it. There is therefore an
---
_,-7
which is the best value obtainable from the sample; we shall term this the optimum
value of p. Other values of p for which the likelihood is not much less cannot, however,
be deemed unlikely values for the true value of p. We do not, and cannot, know, from
the information supplied by a sample, anything about the probability that p should lie
between any named values.
It should be remarked that likelihood, as above defined, is not only fundamentally distinct from
mathematical probability,but also from the logical " probability" by which Mr.KEYNES(21) has recently
attempted to develop a method of treatment of uncertain inference,applicable to those cases where we
lack the statistical information necessary for the application of mathematical probability. Although, in
an important class of cases, the likelihood may be held to measure the degree of our rational belief in a
conclusion,in the same sense as Mr. KEYNES'" probability," yet since the latter quantity is constrained,
somewhat arbitrarily, to obey the addition theorem of mathematical probability, the likelihood is a
quantity which falls definitely outside its scope.
2 z 2
MR .
328
may write down the chance for a given value of the parameter 0, that 0, should lie in
the range do, in the form
$ ==-=:e
d6,.
The mean value of 0, will be the true value 0, and the standard deviation is a-, the
sample being assumed sufficiently large for us to disregardthe dependence of aCupon O.
The likelihood of any value, 0, is proportionalto
( (-0)
01
(e,-ey2
log
0;
2"
1
:
2=
a02log
Now 4) stands for the total frequency of all samples for which the chosen statistic
has the value 01, consequently =- S' (0), the summation being taken over all such
examples, where /s stands for the probability of occurrenceof a certain specified sample.
For which we know that
iog p = CJ+S(logf),
the summation being taken over the individual members of the sample.
If now we expand logf in the form
--2
log f (0)
+
log (0) f(o,)
...,)log f
or
log f = log i + 0-01+ 2 0-0
+...,
we have
log = c+0-0os(a)+0-01e,
(b)+...;
0,
C + nT_0
329
hence
^oce~ ?----.
Now this factor is constant for all samples which have the same value of 0,, hence
the variation of 4) with respect to 0 is represented by the same factor, and consequently
log q = C'+nn b 0-01;
whence
a2
1
I = n6,
= ~-flog
2--;
where
b -=
p32
logf(01),
= ax
log
supplies the most direct way known to me of finding the probable errors of statistics.
It may be seen that the above proof applies only to statistics obtained by the method
of maximum likelihood.*
For example, to find the standard deviation of
A
n
*
A similar method of obtaining the standard deviations and correlations of statistics derived from
large samples was developed by PEARSONand FILONin 1898 (16). It is unfortunate that in this memoir
no sufficient distinction is drawn between the population and the sample, in consequence of which the
formulaeobtained indicate that the likelihood is always a maximum (for continuous distributions) when
the mean of each variate in the sample is equated to the correspondingmean in the population(16, p. 232,
A, = 0 "). If this were so the mean would always be a sufficientstatistic for location; but as we have
alreadyseen, and will see later in more detail, this is far from being the case. The same argument,indeed,
is applied to all statistics, as to which nothing but their consistencycan be truly affirmed.
The probable errorsobtained in this way are those appropriateto the method of maximum likelihood,
but not in other cases to statistics obtained by the method of moments, by which method the examples
'
given were fitted. In the ' Tables for Statisticians and Biometricians (1914), the probable errorsof the
constants of the Pearsoniancurves are those properto the method of moments; no mention is there made
of this change of practice, nor is the publication of 1898 referredto.
It would appear that shortly before 1898 the process which leads to the correct value, of the probable
errorsof optimnumstatistics, was hit upon and found to agree with the probable errors of statistics found
by the method of moments for normal curves and surfaces; without further enquiry it would appear to
have been assumed that this process was valid in all cases, its directness and simplicity being peculiarly
attractive. The mistake was at that time, perhaps,a natural one; but that it should have been discovered
and correctedwithout revealingthe inefficiencyof the method of momentsis a very remarkablecircumstance.
In 1903 the correct formulhefor the probable errors of statistics found by the method of moments are
given in ' Biometrika' (19); referencesare there given to SHEPPARD(20), whose method is employed, as
well as to PEARSON and FILON (16), although both the method and the results differfrom those of the latter.
330
.'
-log f = _x---
p lI-p
ap2
Now the mean value of x in pn, and of y is (l--p) n, hence the mean value of
-logfi
is-
therefore
2_p(L-p)
-n
2'~
331
The region dOdol evidently lies wholly in the isostatistical region d4.
equation
Hence the
a log/f(, 4, 1)= o
is satisfied, irrespectiveof 01, by the value 0 =
for then
(, 0);
=a log ,
alog
0,
aL
80
-= O,
2_L
802
this latter formula being applicable only where 6 is normally distributed, as is often
the case with considerable accuracy in large samples. The value 0- so found is in
these cases the least possible value for the standard deviation of a statistic designed to
332
estimate the same parameter; it may therefore be applied to calcullate the efficiency of
any other such statistic.
When several parameters are determninedsimultaneously, we nmustequate the second
differentials of L, with respect to the parameters, to the coefficients of the quadratic
terms in the index of the normal expression which represents the distribution of the
correspondingstatistics. Thus with two parameters,
1a1
a2L
-2
I1
1
a
2.
2.-r
11
a2L
2J__
a01ae3
1O
2.
4'
-r2*
_ ra2
or, in effect, a(r is found by dividing the Hessian determinant of L, with respect to the
Curvesof PEARSON's Type III. offera good example for the calculation of the efficiency
of the Method of Moments. The chance of an observation falling in the range dx is
drV= -.
-6
a dx.*
By the method of moments the curve is located by means of the statistic u, its dimensions are ascertained from the second moment ,2, and the remaining parameter p is
determined from ,3. Consideringfirst the problem of location, if a and p were known
and we had only to determine m, we should take, accordingto the method of moments,
-A= m,+ (p+ 1),
where m~ represents the estimate of the parameter m, obtained by using the method of
moments. The variance of m, is, therefore,
2
, = 2a
+1)
If, on the other hand, we aim at greater accuracy, and make the likelihood of the
sample a maximum for variations of In, we have
L =-n
log a-n
) xm
-S(),
log
O THE'REE()'TICAr- SrTATrISTICS,.
FOUNDATIONS
833
1S=\x-m/
L., -p-PS
+- .a
Cm
*
(1)
':\'-p/
p+ 1
Efficiencies of over 80 per cent. for location are therefore obtained if p exceeds 9; for
p = 1 the efficiency of location vanishes, as in other cases where the curve makes an
angle with the axis at the end of its range.
Turning now to the problem of scaling, we have, by the method of moments,
=,
.2
(2
+ l ),
we must have
2
p321
3
4+
,
8n
4n
c-r
22
___
2 (p+t)
|L^
Ia_,
a
am a-
and
a2L
;'
VOL(. CoXXInl. -- A.
(p.)
s(-.)
a5
Ca
-\
(2)
334
1.)
a2
dividing
cr
c (p-1)
(){p
.)
a__n_
9t,a
(p +
(2
which reduces to
2
whence
a .....
2/,
3
p+ 4
p+ 4
Efficiency of over 80 per cent. for scaling are, therefore, obtained when p exceeds 11.
The efficiency of scaling does not, however, vanish for any possible value of p, though
p+
Now
-
/2
+ 36 +
90/3-12'33
51),
of r'ypel 11II,
/,2
hence
('434-242
+ V
- (s + 3) = -a/(3A+.:S
/ 23,
2
= ...t.(5Afi
63
1',
+ 4) (/, + 4),
+ 6),
335
6(+.
2 =
* = np+
( + 1)(p+)2)
..
"
(p +6).
which equation solved for m, a and p as a simultaneous equation with (1) and (2), w-il
yield the set of values for the parameters which has the maximum likelihood. To find
the variance of the value of p, so obtained, observe that
c?m \xI
p/I
BaLp
- _
aaap
and
a2L =
,~~~~~~a
-2'
3
(I
,log
!).
P- ]e adt
by the determinant
(t4
.1
P
p-1
d2
--
-(1p
12
ct p - tp-(pdp^
p+1
1
.
n2
~~(p}v~!)l
(?')~~~~~
log
hence
2
I2
2.
p,
3 2+1},
When p is large,
d'
r2d
in
's2
2+(+~=
2'
p
)p
I"j
3 A 2
'
.5
'"'
7p7
3 %f6
so that, approximately,
2=
6(p3+'p);
for large values of p, the efficiency of the method of moments is, therefore, approximately
p++
2p +6
Efficiencies of over 80 per cent. occur whein p exceeds 381 (,fi -- 0 102); evidently
the method of moments is effective for determining the form of the curve only when it
is relatively close to the normal form. For small values of p, the above approximation
for the efficiency is not adequate. The true values can easily be obtained from the
The following values are
recently published tables of the Trigarima* function (ll).
obtained for the integral values of p from 0 to 5.
p
Efficiency
An interesting
0
.
0-0274
0-0871
0-1532
0-2159
0-2727
is to find
the variance of m, when a and p are not known, derived from the above set of simultaneous equations; that is to say, to calculate the accuracy with which the limiting
point of the curve is determined; such determinations are often stated as the result
of fitting curves of limited range, but their probable errors are seldonl, if ever, evaluated.
To obtain the greatest possible accuracy with which such a point can be determined
we must divide the minor of
,, namely,
{p+1-
by
n3 - _L_
*
(t4 p-g
__
\2
log (p !) /
(dp2 lVP'I
whence
2 (tp
log(p!)d2
p2p"
The position of the limiting point will, when p is at all large, evidently be determined
with mtuch less accuracy than is the position, as a whole, of a curve of known form and
size.
Let n' be a multiplier such that the position of the extremity of a curve calculated
d2
* It is sometimes convenient to write
fo- log (!)
(x) for
(x).
vlo
OF THrEOlRETICATL
STAT.ISTICS.
FiOUNI )AIONS
337
from nn' observations will be determined with the same accuracy as the position, as a
whole, of a curve of known form and size, can be determined from a sample of n observations when n is large. Then
P 4
..- ?P'
p?1
.P+
d2
log
log
(P~~ --t+.
;p (p!)
.....-.l-()
'\3p
2...
3p/
and
d2
22
log (p
2+
i)
= 3
);
therefore
el/
p3(_2
2
p "
+
3Jp-p+2
For large values of p the probable error of the determiination of the end-point may be
found approximately by multiplying the probable error of location by
(p-?)
-3/-.
As p grows smaller, n' diminishes until it reaches unity, when p-- 1. For values of
p less than 1 it would appear that the end-point had a smaller probable error than tle
probable error of location, but, as a matter of fact, for these values location is determined
by the end-point, and as we see from the vanishing of o^, whether or not p and a
are known, when p - 1, the weight of the determination from this point onwards increases
more rapidly than n, as the sample increases. (See Section 10.)
The above method illustrates how it is possible to calculate the variance of any
function of the population. pa rameters as estimated from large samples by comparirng
this variance with th
that of the sae function estimated by the methiod of mloments, we
may find the efficiency of that method for any proposed function. Thle above examnination, in which the determinations
of the locus, the scale, and the forim of the curve are
treated separately, will serve as a general criterion of the application of the method of
moments to curves of Type III. Special combinations of the parameters will, however,
be of interest in special cases. It may be noted here that by virtue of equation (2) the
function of rm-1--a (p - I) is the same, whether determined by moments or by the method
of the optimum :
, + Ca (p-L+ 1) = n +? ( + 1).
The efficiency of the method of moments in deter
'rhat this function is the abscissa of the mean does not imply 100 per cent.
per cent.
of
location, for the centre of location of these curves is not the mean (see p. 340).
efficiency
338
The general problem of the location and scaling of curves may now be treated more
generally. This is the problem which presents itself with respect to error curves of
assumed form, when to find the best value of the quantity mleasuredwe must locatethe
curve as accurately as possible, and to find the probableerrorof the result of this process
we must, as accurately as possible, estimate its scale.
The form of the curve may be specified by a function (,, s-uchthat
dj' cc~ ) de, when
3;,--'D
-. 1
In this expression < specifiesthe form of the curve, which is unaltered by variations
of a an(I m.
When a sample of n observations has been taken, the likelihood of any combination
of values of a and m is
L
aC+S (I),
lfog
-( Cwhence
aL =
am
(dl dAm
since
a
a8
also
?L
1:Q/-
,,
8a
(at
since
at _ t.
a
aa
Differentiatinga second time,
a2L
--
-- -b
,
_(p
,)
therefore
(rhi
't,a -e
This expression enables us to compare the accuracy of error curves of different form,
when the location is performed in each case by the method which yields the minimum
error.
Example -The curve
d(/'-
7r 1 +- ?
referredto in Section 5 has an infinite standard deviation, but it is not on that accolunt
an errorcurveof zeroaccuracy,for
,-log
( +),"
+2 I~~/
=_
-
(1? ')2 )2
ii$
FOUNI)ATIONS OFt'THEO)RETICAI,STATISTlICS
339
Now
hence
ct
and
(f" = -}
--i.
c-~ =?
<
CT;
The quantity,
X
a2
81
2a2t
,2
X2
have the samleintrinsic accuracy as errors distributed according to the normal curve
--
df
(Ix
2-e2x
provided
) 2
L J=C-
we have
c L
a
C(:1
s(+
log +S (),
0,
and
a:L =
(fF)
ct
8 (2 C
'+r ?"
'
) (+"- =1
s
a
S
).
The latter expression will. directly give the accuracy witi which a is determined only if
-c-0
mc da
Qm
and we can always arrange thiat this shall be so by subtracting from E the quantity
")!
Thus in a Type III. curve where, referredto the end of the range,
b,v =--1,
9-1
1, -1
=
(/)"
340
instead of
,p = p log c-$
we must write
- +~
-
= j log t+P-
then
1;
'
_,+ p-/l
?f+p_
hence
_2L = (120_1
''
'
+p-1_
$+P-1
(-
-p-:
--)
hence
22
2n
2n
For one particular point of origin, therefore, the variations of the abscissa are
uincorrelatedwith those of a; this point may be termed the centreof location.
Example :--To determine the centre of location of the curve of Type IV.,
r+
?2
e-v
df
(L+ :-
tall-
Here
=-v
r+ 2
tan-l
log T
g,
-2
+2) 1 + 2;
2 +2
(v--r
=-
-2
r++ 1i'+4
,r+2
+?r2
r+
qr+4+ir
r}+ 4
+IJ
so that
e4
fro+
The centre of location, therefore, at the distance
rvt)
"~+4'
the mode,
FOUNDATIONSOF THEORETICALSTATISTICS.
341
:Exmnple:-Determine the intrinsic acculracyof an error curve of Type IV. and tle
efficiency of the method of moments in location and scaling.Since
777_ r 1 r+2r +4
r-+4
+v2
2~2
v2
r+4 +v
n
++-l r 2 r-- 4
8rh ,-+lr+2r+4
----_-
r+4
+v2
but
-a(2
.a2+ v2
~
n --r2
+2)
24-^
^r27^
V(3)
,+2r+4(r24-2)
r-+l
When v=
(r+
a(
,1
+ 1 r +2
The efficiency of location of these curves vanishes at r = 1, at which value the standard
deviation becomes infinite. Although values down to --1 give admissible frequency
curves, the conventional limit at which curves are reckoned as heterotypic is at r = 7.
For this value the efficiency is
49 121+ v
132
49+v2 '
which varies from 91*'67 per cent. for the symmetrical Type VII. curve, to 37*12 per
cent. when v ->- o and the curve to Type V.
~f~"- 1 =
whence
(-=
r+4
anld
ct
VOL. CCXXII.-A.
a :r+4
a2
-- 2
n 2r+l
3 B
'
342
the intrinsic accuracy of scaling is therefore independent of i,. Now for thlesecurves
2 =
3r--1
'2
3
r-2r-3\
+6-
8'?
2
4'1
r+v/2
-
so that
/32-1
_r3
+ 2 (r2+ 10r-12)
r-2
r-2
r-3
(r2+2)
and
a2
2+lv(r2+10r-12)
r3 -2
3 (r2
2 r-
n'-
a,
2)
r-3 r+4 (+
v2(.2+l
{ri3r--+2
v2)
(4
Or-12)}'
The efficiency of the miethod of mnomlents iii scaling these curves vanishes at r - 3,
where ,2 becomes infinite; for r - 7, the efficiency of scaling is
55
49+v
2 ' 1715+107v:'
varying in value from 78-57 per cent. for the symmetrical Type VII. curve, to 25*70
per cent. when v -> oo and the curve to Type V.
10. THE EFFICIENCY OF THE METHOD OF MOMENTS IN FITTING THE PEARSONIAN
CURVES.
The Pearsonian group of skew curves are obtained as solutions of the equation
1 dy _
-(x-r)
a + bx+cx2 '
y dx
li-)
(i+2
(
dx
and
(
df
2:
X)
-"
a/
2e
e -vtanlr a dx,
according as the roots of the quadratic expression in (5) are real or imaginary.
343
?i(i
1-2
a2
r+2
a dx,
(2) When the curve extends to infinity, the ordinate, when x is large, must diminish
more rapidly than -;
x
A.
_..
.
. ..-
Y= O0
. .... .HeHerotypicLimitr 7
.-.--.
LimitoF( diagram r3
Fig. 2.
344
the lines AC and AC' represent the limits along which the area between the curve and
a vertical ordinate tends to infinity, and on whichil,,
line CC'represents bhe limit at which unbounded curves enclose an i:nfinitearea with
the horizontal axis; at this limit r - -1.
The symmetrical curves of Type II.
\ r4~-2
2'
d(~t
I-) z
extend from the point N, representingthe normal curve, at which r is infinite, through
the point P at which r --4, and the curve is a parabola,to the point B (r - -2),
where the curve takes the form of a rectangle; from this point the curves are U-shaped,
and at A, when the arms of U are hyperbolic, we have the limiting curve of this type,
which is the discontinuous distribution of equal or unequaldichotomy('r - 0).
Tr-le unsymmetrical curves of Type I. are divided by PEARSON into three classes
according as the terminal ordinate is infinite at neither end, at one end (J curves)j or
at both ends (U curves); the dividing lines are C'BD and CBD', along which one of
the terminal ordinates are finite (mn,,or
extend to F (p
-1), at which point the integral ceases to converge. In curves of
Type III., r is infinite ; v is also infinite, but one of the quantities mi, and mn2is finite,
or zero (= p); as p tends to infinity we approach the normal curve
df c e-"2dx.
Type VI., like Type III., consists of curves bounded only at one end; here r is
positive, and both mn and in, are finite or zero. For the J curves of Type VI. both
in, and mn,are negative, but for the remainderof these curves they are of opposite sign,
the negative index being the greater by at least unity in order that the representative
dYf oc x2
d x.
As r tends to infinity the curve tends to the normal form; the integral does not
become divergent until ~rk
1, or r
-1.
345.
In Type IV.
-7 f c1
eI+
cr-l1^2
vxtan-1T
a^Y1'?"
a;
we have written v, not as previously for the difference between ml and m, for these
quantities are now complex, and their differenceis a pure imaginary, but for the difference divided by /--1; , is then real and finite throughout Type IV., and it vanishes
along the line NS, representingthe symmetrical curves of Type VII.
29\+2
2)
dfo1+(
92^
v2
_x(
+-2
2).
(Xz2_y)
It might have been thought that use could have been made of the criterion,
Ai (02+3)2
(
4 (482-3/31)(2/2-3A,-6)
4?
I--
by which PEARSON
distinguishes these curves; but this criterion is only valid in the
region treated by PEARSON.For when r = 0, K2 = 1, and we should have to place
a variety of curves of Types VII., TV.,V., and VI., all in Type V. in order to adhere to
the criterion.
This diagram gives, I believe, the simplest possible conspectus of the whole of the
Pearsoniansystem of curves; the inclusion of the curves beyond r = 3 becomes neces* The true limit is the line 2 = PI + 1, along which the curves degenerate into simipledichotomies.
346
sary as soon as we take a view unrestrictedby the method of moments ; of the so-called
heterotypic curves between r = 3 and r = 7 it should be noticed that they not
only fall into the ordinary Pearsonian types, but have finite values for the nmoment
coefficients /3, and 32; they differ from those in which r exceeds 7, merely in the fact
that the value of f32, calculatedfrom thefotrth momentof a sample,has an infinite probable
error. It is therefore evident that this is not the right method to treat the sample, but
this does not constitute, as it has been called, " the failure of Type IV.," b-ut merely
the failure of the method of moments to make a valid estimate of the form of these
curves. As we shall see in more detail, the method of moments, when its efficiency is
tested, fails equally in other parts of the (diaoram.
In expression (3) we have found that the efficiency of the method of nmomentsfor
location of a curve of Type IV. is
_--2
E=
r-"--L(r?+4
_1 r2t2r+4
(r_
v2)
2)
whence if we substitute for r and , in terms of the co-ordinatesof our diagram,we obtain
a general formula for the efficiency of t:he method of moments in locating Pearsonian
curves, which is applicable within the boundary of the zero contour (fig. 3). lThismay
Fig. 3.
Region of validity of the first moment (the mean) applied in the location of
be called the region of validity of the first moiment; it is bounded at the base by the
line r - 1, so that the first monient is valid far beyond the heterotvpic limit ; its other
boundary, however, represents those curves which make a finite angle with the axis at
the end of their range (m1, or n, -- 1);
n2,
This boundaryhas a double point at P, which thus forms the apex of the region of validity.
347
I:n fig. 3 are shllownthe contours along which thle efficiency is 20, 40, 60, and 80 per cent.
..
Fig. 4.
'
A_
-- o_
J curves, though the maximum efficiency amnongthe J curves is about 30 per cent.
As before, the contours are centred about the normal curve (N) and for high efficiencies
tend to the system of concentric circles,
12x+ 12y2 = I-E,
showing that the region of high efficiency is somewhat more restricted for the second
moment, as compared to the first.
The lower boundaryto the efficienciesof these statistics is due merely to their probable
errorsbecoming infinite, a weakness of the method of moments wlich has been partially
recognisedby the exclusion of the so-calledheterotypic curves (r < 7). The stringency
of the upper boundaryis much moreunexpected; the probableerrors of the moments do
not here become infinite; only the ratio of the probable errors of the moments to the
probable error of the correspondingoptimum statistics is great and tends to infinity as
the size of the sample is increased.
That this failure as regards location occurs when the curve makes a finite angle with
the axis may be seen by considering the occurrence of observations near the terminus
of the curve.
Let
idf = k.x dx
348
in the neighbourhloodof the terminus, then the chance of ani observation falling within
a distance x of the terminus is
./
t_]ic--
rLd
a+ 1
k'xa+l
when oa < 1, this quantity diminishes more rapidly tLhan,-- and coiisequently for large
samples it is much mroreaccurate to locate the curve by the extremle observation than
by the mean.
Since it might be doubted whlethersuch a sim.plemethod could really be more accurate
than the process of finding the actual mean, we will take as example the location of
the curve (B) in the form of a rectangle,
0a
a
m - - < x < m- '+*
alf7=-,dx
a
and
clf= 0,
C,
V 12
6n
dx.
,
-e a dx.
7r
The difference of the extreme observation fro;mtthe end of the range is distributed
y6 -E
349
if - is the difference at one end of the range and 7 the differenceat the other end, the
joint distribution (since, when n is considerable,these two quantities may be regarded
as independent) is
2
2 ena
de d,
Now if we take the mean of the extreme observations of the sample, our error is
for
which
we
+
write
aso
x;
yforwe have the oint distribution of x ad y
writing
-2 e n,
addx ^ y.
For a given value of x the values of y range from 2 x to oo, whence, integrating with
respect to y, we find the distribution of x to be
df =-
e a
dx,
-25
Fig.A.
-20
-15
10
-5
-10
15
20
25
The two error curves are thus of a radically different form, and strictly no value for
the efficiency can be calculated; if, however, we consider the ratio of the two standard
deviations, then
2
. _
T
2"
.a2
'
,12n
when ntis large, a quantity which diminishes indefinitely as the sample is increased.
VOL. CcXX::.---A.
350
.For example, we h.ave taken from VEGA(14) sets of digits from the table of Natural
Logarithmsto 48 places of decimals. The last block of four digits was taken from the
logarithms of 100 consecutive numbers from 101 to 200, giving a sample of 100 numbers
distributed evenly over a limited range. It is sufficient to take the three first digits
to the nearest integer; then each number has an equal chance of all values between 0
and 1000. The true mean of the population is 500, and the standard deviation 289.
The standard error of the mean of a sample of 100 is therefore 28*9.
Twenty-five such samples were taken, using the last five blocks of digits, for the
merinedely from the highest
logarithms of numbersfrom 101 to 600, and the mean determined
and lowest number occurring, the following values were obtained:1st hundred.
Digits.
2nd hundred.
3rd hundred.
4th hundred.
5th hundred.
.i
1 999
45-48
24
978 + 1-0
39 980 + 9.5
41-44
3 960 -18 5
6 997 +1-5
16 983 - 05
18 994 -I-6-0
1 978 -10-5
4 979 -.-8-5
4 978-
90
37-40
988 -1-5
11 999 + 5-0
31 984 +7-5
33-36
995 ,- 1-0
13 997 + 5 0
4 998 +10-
0 994 - 30
29-32
988 -
3 988 - 4-5
4 992 --2 0
1 996 -
55
1'5
2 986 -6.0
3 981 --8-0
21 977 --10
It will be seen that these errors rarely exceed one-half of the standard error of the
mean of the sample. The actual mean square error of these 25 values is 6 86, while the
calculated value, v/50, is 7. 07. It will therefore be seen that, with samples of only 100,
there is no exaggeration in placing the efficiency of the method of moments as low as
6 per cent. in comparison with the more accurate method, which in this case happens
to be far less laborious.
Such a value for the efficiency of the mean in this case is, however, purely conventional, since the curve of distribution is outside the region of its valid application, and
the two curves of sampling do not tend to assume the same form. It is, however,
convenient to have an estimate of the effectiveness of statistics for small samples, and
in such cases we should prefer to treat the curve of distribttion of the statistic as an
error curve, and to judge the effectiveness of the statistic by the intrinsic accuracyof
the curve as defined in Section 9. Thus the intrinsic accuracy of the curve of distribution of the mean of all the observations is
1 2n?
a2
FOUNDATIONSOF THEORETICALSTATISTICS.
351
2 ?
r -1lr+2r+4
a2a2
(+2+v2)
(rS44Jy
r+1
r+2v
2
2
-a2(r4
+4v2)
a2(r+4
22
(r2
r+l v
a(rt2
+Y2)
2)
r+1 vI
a2
v2)
32
a)
av r
a
a (2
2)
r+2+v2
v2)
log
r+2+v2
-' afr2+2
r+l1
7 11
a(+2
2)
r+l v
r+2
a (r22
(2r+4+v2)
- 1-_ 2
r+lr+2
-
a2(r+42+v2)
(r-2
2~+.
I
C r+l
r+lr+2v
log F
a (r+ 22 +v2)
a2
ar log
a32
log F
=e-2
Io
eVOsinr
0 d0.
The ratios of the minors of this determinant to the value of the determinant give
the standard deviations and correlations of the optimum values of the four parameters
obtained from a number of large samples.
In discussing the efficiency of the method of moments in respect of the form of the
curve, it is doubtful if it be possible to isolate in a unique and natural manner, as we
have done in respect of locationand scaling, a seriesof parameterswhich shall successively
represent different aspects of the process of curve fitting. Thus we might find the
efficiencieswith which r and v are determined by the method of moments, or those of
the parametric functions corresponding to 31 and 32, or we might use- m and ma as
MR .
352
ellipses are coaxial, the deviations of r and v being uncorrelated; in the case of Type VII.
we put v 0, in the determinant given above, which then becomes
r +r- r+ 22
r+4
1
r--t 2
1
2r+
?4r
r+lr+2
rf (t \
\2/J
-22
2r+4
2/
so that
2r 23+
2
2-
r?+2 F (-
T+
2r1-2
r+4
and
4 r + l+22
( 2
-+.
-2I-4
-7
% - 8' --rI
-3r9--5
and
2 r
3
y^
12 - 3 (r- r+18)
q-5
-7
F (2 -2
r+ Jr+4
-3 (L-
---2
r+l 1+4
and
r+1
r+2
6--.2;
5 =-l-
T+ F
::-22
POUNDATIONS OF THI-EORETICAL
STATISTICS.
353
++
(r+2+,('r
r-5
10) r2 r-22
(r12++
-I
(r+22
+1...) rTi2
r-
- lor-3
(r2_r+18)rr
The following table gives the values of the transcendental quantities required, and
the efficiencyof the method of moments in estimating the value of v and r from samples
drawn from Type VII. distribution.
--..32
r.
v
Efficiency
1
x
2-21
.of
-2r+1
5
6
7
8
9
10
11
12
13
14
15
16
17
18
5 31271
5-31736
5-32060
5-32296
5-32472
5-32607
5-32713
5-32797
5-32866
5-32919
0
0-2572
0-4338
0-5569
0-6449
0-7097
0-7586
0*7963
0-8259
0-8497
--2
r +2
Efficiency
of r,.
r+ 4.
5-9473
5-9574
5-9649
5-9706
5-9750
5-9787
5*9810
5-9839
5-9853
5-9870
5-9883
5-9895
0
0-1687
0-3130
0-4403
0-5207
0-5935
0-6519
0-6990
0-7376
0-7694
0-7959
0-8182
It will be seen that we do not attain to 80 per cent. efficiency in estimating the form
of the curve until r is about 17-2, which corresponds to /32- 342. Even for symmetrical curves higher values of /2 imply that the method of moments makes use of
less than four-fifths of the information supplied by the sample.
354
On the other side of the normalpoint, among the Type II. curves, very similarformulae
apply. The fundamental Hessian is
r-1
-1 r--2
r 2-2
ro,4
Ir-2(I--22Ir
r'-
?0--2
2 2--23
r-2
) -2
r---1
and
-
?2
4 --i-2
___i_a
--
-2
r-'2
r--4
( 2 )}-2--4
Now since
r-2
-4
( 2
2 )
it follows that
r-23
-2rIr-4
F (r-4)-2r
r3,
as
r+4
ri+2 F ()-2r+
is of r.
In a similar manner
-2
22{
-F (f 2-2
ii
----1-4
--
=,-- - ~l__
r-h-4 ... ?Q
T m-.
which is +hesame function of r-3
1-2
r+l
is of r.
1-2
r+2
as
r-
CI(
1 -F
/
\J2
-2r+lr+4
2 r2r+l
232}---2r-1
._
FOUNDATIONSOF THEORETICALSTATISTICS.
355
In all these functions and those of the following table, r must be substituted as a
positive quantity, although it must not be forgotten that r changes sign as we pass from
Type VII. to Type II., and we have hitherto adhered to the convention that r is to
be taken positive for Type VII. and negative for Type II.
2.E.e
.3,.........
1--
--,
)
1kr 4.(
2-2r r 4.
r.'-~ 2i
2
3
4
5
6
7
8
9
10
11
12
13
14
15
4
4'93480
5'15947
5523966
5.27578
5.29472
5.30576
5.31271
5.31736
5.32060
5.32296
5.32472
5.32607
Efficiency
of v, .
..---2
/2
Efficiency
-
2/
of r,.
- 2r - r - 4.
0
0.
0576
0*2056
0-3590
0-4865
0.5857
0.6615
0.7198
0.7650
0.8005
0 8287
0-8516
0.8702
4
5'1595
5.5648
5.7410
5.8305
5.8813
5.9126
5.9331
5.9473
5.9574
5.9649
5.9706
5.9750
5.9787
0
0.0431
0.1445
0*2613
0.3708
0.4653
0.5441
0-6090
0.6624
0.7063
0.7427
0.7731
07986
0.8202
In both cases the region of validity is bounded by the rectangle, at the point B
(fig. 2, p. 343). Efficiency of 80 per cent. is reached when r is about 14-1 (/32 2 65).
Thus for symmetrical curves of the Pearsonian type we may say that the method of
moments has an efficiency of 80 per cent. or more, when /, lies between 2 65 and 3 42.
The limits within which the values of the parameters obtained by moments cannot be
greatly improved are thus much narrowerthan has been imagined.
11. THE REASONFORTHE EFFICIENCYOF THE METHODOF MOMENTSIN A SMALL
REGION SURROUNDINGTHE NORMAL CURVE.
We have seen that the method of moments applied in fitting Pearsonian curves has
an efficiencyexceeding 80 per cent. only in the restricted region for which 32 lies between
the limits 2 65 and 3-42, and as we have seen in Section 8, for which /3, does not exceed
0.1. The contours of equal efficiency are nearly circular or elliptical within these
limits, if the curves are representedas in fig. 2, p. 343, and are ultimately centred round
the normalpoint, at which point the efficienciesof all parameters tend to 100 per cent.
It was, of course, to be expected that the first two moments would have 100 per cent.
efficiencies at this point, for they happen to be the optimum statistics for fitting
the normal curve. That the moment coefficients /3 and /2 also tend to 100 per cent.
efficiency in this region suggests that in the immediate neighbourhood of the normal
356
curve the departures from normality specified by the Pearsonian formula agree with
those of that system of curves for which the method of moments gives the solution of
maximum likelihood.
The system of curves for which the method of moments is the best method of fitting
may easily be deduced, for if the frequency in the range dx be
y (x, 01, 02, 03, 04)
then
dx,
logy
must involve x only as polynomials up to the fourth degree; consequently
y
-- a (x4+pI-z3+PaX2+p3X + P)
ae-I
2a2+1
I3
+0i4
represent a curve of the quartic exponent, sufficiently near to the normal curve for the
squares of k, and k, to be neglected, then
d
log y -
X/
x27X
3k/
--4k,)
7 *^\
x
- (I
+-
'4ta2
neglecting powers of kc and kJ. Since the only terms in the denominatorconstitute a
quadratic in x, the curve satisfies the fundamental equation of the Pearsonian type of
curves. In the neighbourhoodof the normal point, therefore, the Pearsonian curves
are equivalent to curves of the quartic exponent; it is to this that the efficiency of ^
and 4,,in the neighbourhoodof the normal curve, is to be ascribed.
12. DISCONTINUOUS
DISTRIBUTIONS.
The applications hitherto made of the optimum statistics have been problems in
which the data are ungrouped, or at least in which the grouping intervals are so small
as not to distulrbthe values of the derived statistics. By grouping,these continutous
't`l77
S (n, log ,,
where L differs by a constant only from the logarithm of the likelihood, with sign
reversed, and therefore the method of the optimum will consist in finding the minimum
value of L. The equations so found are of the form
aL
snI
a_L _s ( =
/.
). .
(6)
and therefore
-1,2
Q
S
=+x
2\ ,
so that on differentiatingby do, the condition that x2should be a minimum for variations
of 0 is
:0 . . . . . . . . . .
$2a
Equation (7) has actually been used (12) to "improve" the values obtained by the
method of moments, even in cases of normal distribution, and the POISSON
series, where
the method of moments gives a strictly sufficient solution. The discrepancy between
these two methods arises from the fact that x2 is itself an approximation, applicable
only when., and ns are large, and the difference between them. of a lower order of
magnitude. In such cases
L
S(nslog)
==S(
x+log
=
+
8m
m,
m 12m
and since
S (x)
(VOL. CCXXTI-X---A.
3 D
0,
-X-..2
f J
.
6M2
358
I
MR.
R. A. FISHER
ON THE MTHEMIATICAL
L = iS (
df c x2
X)
or
n'- -3
dfocL
e-LdL,
where n' is one more than the number of degrees of freedom in which the sample may
differ from expectation (17).
In other cases we are at present faced with the difficulty that the distribution L
requires a special investigation. This distribution will in general be discontinuous (as
is that of x2), but it is not impossiblethat mathematicalresearchwill reveal the existence
of effective graduations for the most important groups of cases to which x2 cannot
be applied.
359
Art/I
, 2
_______
x'
log ml
(-im+x
= 0,
whence
or
A
The most likely value of m is therefore found by taking the first moment of the series.
Differentiating a second time,
-1
Ca2
m'
m2
so that
0F=
--
as is well known.
2. GroupedNormal Data.
In the case of the normal curve of distribution it is evident that the second moment
is a sufficientstatistic for estimating the standard deviation ; in investigating a sufficient
solution for grouped normal data, we are therefore in reality finding the optimum
correction for grouping; the SHEPPARD correction having been proved only to satisfy
the criterion of consistency.
For grouped normal data we have
_2
Ps.=---212 o
dx,
and the optimum values of m and. a are obtained from the equations,
aL= f-s, 8
m
\-s am/
CL n= n,
3 D 2
360
IMR. R. A. FISHER
ON THE MATHEMATICAIL
or, if we write,
.
1_
C >2w
1-
and
S
x,
=0.
As a simple examnplewe shall take the case chosen by K. SMITHin her investigation of
114
84
14
24
53
7-8
8-9
264214,
2 355860.
If the latter value were accepted we should have to conclude that SHEPPARD'S correction, even when it is small, and applied to normal data, might be altogether of the
wrong magnitude, and even in the wrong direction. In order to obtain the optimum
in the region under consideration; this miay
value of r, we tabulate the values ofbe done without great labour if values of or be chosen suitable for the direct application
of th.e table of the probability
values:-
integral
4()3
IA2L
0'200Ia"
( 16
04 4
-0 0 261!
I-
0 '260
6(
361
By interpolation,
-
0'441624
= 2'26437.
.....
.
Correctionfor maximum likelihood . . ..
.
. ..
" Correction" for minimum x2 .
2 28254
- 0 01833
-001817
+007332
', - a
f(e)
e,
; (f) +
?=f(eZ)24f,'(f)+
+ " 0f/"
Ji9-(e) 1 920
322,560
+ (V/(()
7(4
=f(
(-)
1'
-6-)
24
1)+
3,6--
) + 322,5
322,560 ($6-l5e4+45"-215)+...
1-920 (34--6
whence
f+
log v = log+6g
jlgV24
4" - 2) -+
1) '- C
-28W
i+4 0 .
181-,440.
..
6:1)
and
2
,2~
a.
log ' = - <
a2 --
e2____
((2
70+((3+2)
-- I.
_ a2
30,240
-1.
a
2 + 144
4320
(5+12+)
- **
362
2(1
12
2880)
t - )
a2)
2 _ ^ t(2 5-i
, =.
12/
neglecting the periodic terms; the loss of efficiencyby using V,therefore is only
2880
ylog ':-
a,
3,
"12
--
(10o-3)
(94+21-5)
360
55
(5
i;i,8+31516+351-i
04+36-7)-
9)
...
83
......2
40
270
129,600
'
a21
0~2
a4
as
_
360
10,800
2
-- -2 (r2
2r 4 /
the
+
peric
CS.
FOUNDATIONS OF THEORETICAL. STArTIST.I
3-63
We may conclude, therefore, that the high agreement between the optimumtn
value of
and that obtained by SHEPPARD'S correction in the above example is characteristic
of grouped normal data. The method of moments with SHEPPARD'S correction is highly
efficient in treating such material, the gain in efficiency obtainable by increasing the
likelihood to its maximum value is trifling, and far less than can usually be gained by
using finer groups. The loss of efficiency involved in grouping may be kept below
1 per cent. by making the group interval less than one-quarterof the standard deviation.
Although for the normal curve the loss of efficiencydue to moderate grouping is very
small, such is not the case with curves making a finite angle with the axis, or having at
an extreme a finite or infinitely great ordinate. In such cases even moderate grouping
may result in throwing away the greater part of the information which the sample
provides.
3. Distributionof Observationsin a Dilution Series.
An important type of discontinuous distribution occurs in the application of the
dilution. method to the estimation of the number of micro-organismsin a sample of
water or of soil. The method here presented was originally developed in connection
with Mr. CUTLER'S extensive counts of soil protozoa carried out in the protozoological
laboratory at Rothamsted, and altllough the method is of very wide application, this
particular investigation affords an admirable example of the statistical principles
involved,
In principle the method consists in making a series of dilutions of the soil sample,
and determiningthe presence or absence of each type of protozoa in a cubic centimetre
of the dilution, after incubation in a nutrient medium.
The series in use proceeds by powers of 2, so that the frequencyof protozoa in each
dilution is one-half that in the last.
The frequency at any stage of the process may then.be representedby
n
2X,
"
2, 3
364
' 5-\
1
log p -
82
ap ==
= - Saq
aq_
a log n ;n- a log n-=p log p,
so that the optimum value of n is obtained froni the equation,
alog
/n
S=sig(
) -s
log
))
=0.)
a2L =
a(lg'
( 2i
(lolog
pi?
p logP
q }
p++ log+
now the mean number of sterile plates is ps, and of fertile plates qs, so that the in ean
32L
(log n)
value of -ig
- <---
= S {p log p-
loogni
p)}j1
-8S {'(log
p)2 ,
p)..
365
We give below a table of the values of p, and of w, for the dilution series log p - 2-x
from x =- -4 to x = 11.
x.
--4
-3
-2
--1
0
1
2
3
4
5
6
7
8
9
10
11
'
Remainder
Total.
p.
w.
0.00000014167
0-0003354626
0-01831564
0-1353353
0-3678794
0-6065306
0-7788009
0-8824969
0-9394110
0-9692333
0-9844965
0-9922179
0-9961014
0-9980488
0-9990239
09995119
0-000036
0-021477
0-298518
0-626071
0-581977
0-385355
0-220051
0-117350
0-060565
0-030764
0-015503
0-007782
0-003899
0-001951
0-000976
0-000488
0-002
0-906
13-485
39-865
64-388
80-625
89-897
94-842
97-394
98-690
99-343
99-671
99-836
99-918
99-959
99-979
........
.
.....
0-000488
..
2-373251
For the same dilution constant the total S (w) is nearly independent of the particular
2
series chosen. Its average value being
, or in this case 2 373138. The fourth
6 log a'
column shows the total weight attained at any stage, expressed as a percentage of that
obtained from an infinite series of dilutions. It will be seen that a set of eight dilutions
comprise all but about 2 per cent. of the weight. With a loss of efficiency of only 2 to
2, per cent., therefore, the number of dilutions which give informationas to a particular
species may be confined to eight. To this number must be added a number depending
on the range which it is desired to explore. Thus to explore a range from 100 to 100,000
per gramme (about 10 octaves) we should require 10 more dilutions, making 18 in all,
while to explore a range of a millionfold, or about 20 octaves, 28 dilutions would be
needed.
In practice it would be exceedingly laborious to calculate the optimum value of n for
each series observed (of which 38 are made daily). On the advice of the statistical
department, therefore, Mr. CUTLER
adopted the plan of counting the total number oJ
sterile plates, and taking the value of n which on the average would give that number,
When a sufficient number of dilutions are made, log n is diminished by - log a for each
additional sterile plate, and even near the ends of the series the appropriatevalues ol
n may easily be tabulated. Since this method of estimation is of wide application
and appearsat first sight to be a very rough one, it is important to calculate its efficiency
VOL. CCXXII.-A.
3 E
366
hence
2
S (pq)=
being very nearly constant and within a small fraction of unity; whence the efficiency
of the method of counting the sterile plates is
6
=
cent.,
7r logl22 87'71 per
a remarkably high efficiency, considering the simplicity of the method, the efficiency
being independent of the dilution ratio.
13. SUMMARY.
During the rapid development of practical statistics in the past few decades, the
theoretical foundations of the subject have been involved in great obscurity. Adequate
distinction has seldom been drawn between the sample recorded and the hypothetical
populationfrom which it is regardedas drawn. This obscurity is centredin the so-called
"6inverse " methods.
On the bases that the purpose of the statistical reduction of data is to obtain statistics
which shall contain as much as possible, ideally the whole, of the relevant information
contained in the sample, and that the function of Theoretical Statistics is to show how
such adequate statistics may be calculated, and how much and of what kind is the
information contained in them, an attempt is made to formulate distinctly the types
of problems which arise in statistical practice.
Of these, problems of Specificationare found to be dominated by considerationswhich
may change rapidly during the progress of Statistical Science. In problems of Distribution relatively little progress has hitherto been made, these problems still affording
a field for valuable enquiry by highly trained mathematicians. The principal purpose
of this paper is to put forward a general solution of problems of Estimation.
367
Of the criteria used in problems of Estimation only the criterion of Consistency has
hitherto been widely applied; in Section 5 are given examples of the adequate and
inadequate application of this criterion. The criterion of Efficiency is shown to be a
special but important case of the criterion of Sufficiency,which latter requires that the
whole of the relevant informationsuppliedby a sample shall be contained in the statistics
calculated.
In order to make clear the nature of the general method of satisfying the criterion
of Sufficiency, which is here put forward, it has been thought necessary to reconsider
BAYES'problem in the light of the more recent criticisms to which the idea of " inverse
probability" has been exposed. The conclusion is drawn that two radically distinct
concepts, both of importance in influencing our judgment, have been confused under
the single name of probability. It is proposed to use the term likelihoodto designate
the state of our informationwith respect to the parametersof hypothetical populations,
and it is shown that the quantitative measure of likelihood does not obey the mathematical laws of probability.
A proof is given in Section 7 that the criterion of Sufficiencyis satisfied by that set
of values for the parameters of which the likelihood is a maximum, and that the same
function may be used to calculate the efficiency of any other statistics, or, in other
words, the percentage of the total available information which is made use of by such
statistics.
This quantitative treatment of the information supplied by a sample is illustrated by
an investigation of the efficiency of the method of moments in fitting the Pearsonian
curves of Type III.
Section 9 treats of the location and scaling of Error Curves in general, and contains
definitions and illustrations of the intrinsic accuracy,and of the centreof location of such
curves.
In Section 10 the efficiencyof the method of moments in fitting the general Pearsonian
curves is tested and discussed. High efficiency is only found in the neighbourhoodof
the normal point. The two causes of failure of the method of moments in locating these
curves are discussed and illustrated. The special cause is discovered for the high
efficiency of the third and fourth moments in the neighbourhoodof the normal point.
It is to be understoodthat the low efficiencyof the moments of a sample in estimating
the form of these curves does not at all diminish the value of the notation of moments as
a means of the comparativespecificationof the form of such curves as have finite moment
coefficients.
Section 12 illustrates the application of the method of maximum likelihood to discontinuous distributions. The POIssoN series is shown to be sufficiently fitted by the
mean. In the case of grouped normal data, the SHEPPARDcorrection of the crude
moments is shown to have a very high efficiency, as compared to recent attempts to
improve such fits by making x2 a minimum; the reason being that x2 is an expression
only approximate to a true value derivable from likelihood. As a final illustration of
368
the scope of the new process, the theory of the estimation of micro-organismsby the
dilution method is investigated.
Finally it is a pleasure to thank Miss W. A.
in the preparation of the diagrams.
MACKENZIE,
REFEREN-CES.
(1) K. PEARSON(1920). " The Fundamental Problemi of Practical Statistics," ' Biom.,' xiii., pp. 1-16.
(1921). "Molecular Statistics," 'J.R.S.S.," lxxxiv., p. 83.
(2) F. Y. EDGEWORTH
G.
U.
YULE
(3)
(1912). " On the Methods of Measuring Association between two Attributes."
' J.R.S.S.,' lxxv., p. 587.
The Probable Error of a Mean,"' Biotm.," vi., p. 1.
(4) STUDENT(1908).
(5) R. A. FISHER (1915).
Frequency Distribution of the Values of the Correlation Coefficient in
'
from
an
Indefinitely Large Population," Biom.," x., 507.
Samples
(6) R. A. FISHER (1921). "On the 'Probable Error' of a Coefficient of Correlation deduced from a
Small Sample," ' Metron.," i., pt. iv., p. 82.
(7) R. A. FISHER(1920). " A MathematicalExamination of the Methodsof Determiningthe Accuracy
of an Observationby the MeanErrorand by the Mean Square Error," ' Monthly Notices of R.A.S.,
lxxx., 758.
(8) E. PAIRMAN and K. PEARSON(1919). "On Corrections for the Moment Coefficients of Limited
Range Frequency Distributions when there are finite or infinite Ordinates and any Slopes at the
xi., p. 262.
(1914). "Tables for Statisticians and Biometricians,"Camb. Univ. Press.
(13) K. PEARSON
(14) G. VEGA(1764). "Thesaurus LogarithmorumCoinpletus,"p. 643.
On the Criterionthat a given System of Deviations from the Probable in
(1900).
(15) K. PEARSON
the case of a Correlated System of Variables is such that it can be reasonably supposed to have
the Probable Errors of Frequency Constants, and on the influence of Random Selection on
"On the Probable Errors of Frequency Constants," ' Biom.,' ii., p. 273,
(20) W. F. SHEPPARD
(1898). "On the Applicationof the Theory of Error to Casesof NormalDistribution and Correlations,"' Phil. Trans.,' A., cxcii., p. 101.
(21) J. M. KEYNES (1921). "A Treatise on Probability," Macmillan& Co., London.