You are on page 1of 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/271856948

A short history of probability theory and its applications

Article  in  International Journal of Mathematical Education · January 2015


DOI: 10.1080/0020739X.2014.936975

CITATIONS READS
4 4,783

2 authors, including:

Kanadpriya Basu
Occidental College
8 PUBLICATIONS   32 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Kanadpriya Basu on 07 April 2015.

The user has requested enhancement of the downloaded file.


A Short History of Probability Theory and Its
Applications
Lokenath Debnath
Department of Mathematics,
University of Texas-Pan American, Edinburg, TX
78539-2999 USA
e-mail: debnathl@utpa.edu
and
Kanadpriya Basu
Department of Mathematical Sciences
University of Texas at El Paso,
El Paso, TX 79968
e-mail: kbasu@utep.edu

“It is remarkable that a science (Probability) which began with consideration of games of chance,
should have become the most important object of human knowledge.”
“Probability has reference partly to our ignorance, partly to our knowledge... . The Theory of
chances consists in reducing all events of the same kind to a certain number of cases equally
possible, that is, such that we are equally undecided as to their existence; and determining the
number of these cases which are favourable to the event sought. The ratio of that number to the
number of all the possible cases is the measure of the probability ... .”

P.S. Laplace

“The true logic of this world is to be found in theory of probability.”

James Clark Maxwell

Abstract
This paper deals with a brief history of probability theory and its applications to Jacob Bernoulli’s
famous law of large numbers, theory of errors in observations, physics and the distribution of prime
numbers. Included are major contributions of Jacob Bernoulli and Laplace. It is written to pay the
tricentennial tribute to Jacob Bernoulli, since this year 2013 marks the tricentennial anniversary of the
Bernoulli’s law of large numbers since its publication in 1713. Special attention is given to the Bayes
celebrated theorem and the famous controversy between the Bayesian and frequentism approaches to
probability and statistics. This article is also written to pay a special tribute to Thomas Bayes since this
year 2013 marks the 250th anniversary of the Bayes celebrated work in probability and statistics, since
its posthumous publication in 1763. This is followed by the modern axiomatic theory of probability first
discovered by A.N. Kolmogorov in 1933. The last section is devoted to applications of probability theory
to the distribution of prime numbers.

Keywords and phrases: probability, Bayes theorem, least squares, binomial, normal and Poisson
distributions
2010 AMS Mathematics Subject Classification Codes: 01A, 60-02

1
1 Ancient Origins of the Concept of Probability
In ancient times, Plato (428-348 BC) and his famous student, Aristotle (384-322 BC) used to discuss the
word chance philosophically. In 324 BC, a Greek person, Antimenes (530-510 BC) first developed the system
of insurance which guaranteed a sum of money against wins or losses of certain events. In view of many
uncertainties of everyday life such as health, weather, birth, death and game that led to the concept of
chance or random variables as output of an experiment (for example, the length of an object, the height of
people, the temperature in a city in a given day). Almost all measurements in mathematics or science have
the fundamental property that the results vary in different trials. In other words, results are, in general,
random in nature. Thus, the quantity we want to measure is called a random variable.
Historically, the word probability was associated with the Latin word ‘probo’ and the English words,
probe and probable. In other languages, this word used in a mathematical sense had a meaning more or
less like plausibility. In ancient times, the concept of probability arose in problems of gambling dealing
with winning or losing of a game. It began with famous physician, mathematician and gambler, Gerolamo
Cardano (1501-1576) who became Professor and Chair of mathematics at the University of Bologna in Italy.
During the fifteenth century, the pragmatic approach to problems of games of chance with dice began in Italy.
During that time, references to games of chance were more numerous, but no suggestions were made how to
calculate probabilities of events. Cardano wrote a short manual, Liber de Ludo Aleae (Games of Chance)
which contained the first mathematical treatment of probability dealing with problems of mathematical
expectation (or mean), addition of probability, frequency tables for throwing of a dice, n successes in n
independent trials, and the law of large numbers. However, his work attracted a little attention and did
not provide any real development of probability theory. Cardano’s manual was published in 1633 about a
century later. In this published manual, Cardano introduced the idea of probability p between 0 and 1 to
an event whose outcome is random, and then applied this idea to games of chance. He also developed the
law of large numbers which states that when the probability of an event is p, then after a large number of
trials n, the number of times it will occur is close to np.

2 Development of Classical Probability During the Sixteenth and


Seventeenth Centuries
During the sixteenth and seventeenth centuries, a great deal of attention was given to games of chance,
such as tossing coins, throwing dice or playing cards, in particular, and to problems of gambling, in general.
An Italian nobleman suggested a problem of throwing dice to Galileo Galilei (1564-1642), a great Italian
astronomer and physicist, a solution of which was the first recorded result in the history of mathematical
probability theory. A decade after Galileo’s death, a French nobleman and gambling expert, Chevalier de
Mére (1610-1685) proposed some mathematical questions on games of chance to Baise Pascal (1623-1662),
French mathematician, who communicated these to another French mathematician, Pierre de Fermat (1601-
1665). From 1654, both Pascal and Fermat began a lively correspondence about questions and problems
dealing with games of chance, arrangements of objects, and chance of winning a fair game. Their famous
correspondence introduced the concept of probability, mean (or expected) value and conditional probability
and hence, can be regarded as the mark of the birth of classical probability theory. According to Pascal,
when a coin is tossed twice, there are four possible outcomes: HH, HT , T H and T T where H stands for
heads and T for tails, a solution of which was closely related to the binomial coefficients of
2
(H + T ) = H 2 + HT + T H + T 2 = H 2 + 2HT + T 2 . (2.1)

Similarly, when an ideal coin is tossed three times, there are eight possible outcomes: HHH, HHT , HT H,
T HH, HT T , T HT , T T H, T T T . A solution of which was closely linked with the binomial coefficients
(collecting all terms of H and T together) of
3
(H + T ) = H 3 + 3H 2 T + 3HT 2 + T 3 . (2.2)

The binomial coefficients in (2.2) led to the famous Pascal triangle as shown in Figure 1. In connection with
the Pascal triangle as shown in Figure 1, Pascal observed the famous formula for the binomial coefficients

2
Figure 1: Pascal triangle: coefficients of bionomial formulas for n = 0, 1, 2, 3, 4, 5, · · · .

 
n n n
Cr = involved in the coefficient of ar bn−r in the binomial expansion of (a + b) for any integer
r
n ≥ 0 and 0 ≤ r ≤ n. He also linked binomial coefficients with the combinatorial coefficients that arose in
probability, and used these coefficients to solve the problem of points in the case of one player requires r
points and the other n points to win a game. At the same time, Pascal also wrote his ‘Traite’ du triangle
arithmétique (Treatise on the Arithmetical Triangle) in which he found a triangular arrangement of the
binomial coefficients and proved many of their new properties. In this work, he gave what seemed to be the
first satisfactory statement of the principle and proof by mathematical induction (see Debnath[1]). However,
the Pascal triangles for n = 1, 2, · · · , 8 was found earlier in the work of Chinese mathematician, Chu
Shih-Chieh in 1303.
With the ancient rudiments provided by Cardano a century before, the Pascal and Fermat’s work laid the
classical mathematical foundation of the probability theory in the middle of the seventeenth century. Stim-
ulated by the significant advances of Pascal and Fermat, Christian Huygens (1629-1695), an extraordinary
mathematical physicist and astronomer of Holland, became actively involved in the study of probability, and
published the first treatise on probability theory “De Rationiis in Ludo Aleae (On Reasoning in Games of
Dice)” in 1657. This major treatise contained the logical treatment of probability theory, and many impor-
tant results including problems of throwing of dice, chance of winning games, problems of games involving
three or four players in addition to some works of Pascal and Fermat. In his treatise, Huygens introduced
the definition of probability of an event as a quotient of favorable cases by all possible cases. He also rein-
troduced the concept of mathematical expectation (or mean) and elaborated Cardano’s idea of expected
value (or mean) of random variables. In those days and even in current days, there was a disagreement
among different disciplines on the meaning and significance of the terms chance, probability, likelihood and
uncertainty. However, Huygen’s treatise provided the greatest impetus to the subsequent development of
the probability theory, and remained the best book on probability theory until the publication of Jacob
Bernoulli’s (1654-1705) first significant work “Ars Conjectandi (Arts of Prediction)” on probability in 1713
eight years after his death. So, Huygen may perhaps be regarded as the father of the theory of probability.
In addition, to many major ideas and results included in his monumental treatise, Bernoulli formulated the
fundamental principle of probability theory, known as Bernoulli’s theorem or the law of large numbers, a
name given by a celebrated French mathematician, Siméon Poisson (1781-1840).
The Bernoulli’s law of large numbers is so fundamental that the subsequent development of probability
theory is based on the law of large numbers. In his letter to Gottfried Wilhelm Leibniz (1646-1716) on
October 3, 1703, Jacob Bernoulli described the major contents of his Ars Conjectandi and stated that the
most of it has been completed except applications to civil, moral, social and economic problems, and to
uncertainties of life beyond the analysis of games of chance. From the time of Bernoulli and even today, the
famous phrase such as proof beyond reasonable doubt in law involve probability. However, he discussed the
applications of probability theory in great details. Also included in his book were some problems as well as
solutions of Huygens with his own ideas and solutions, problems of throwing dice and games of chance. In
particular, he added the problem of repeated experiments in which a particular outcome either a success or

3
 
n
a failure. Using the general properties of binomial coefficients dealing with the number of ways in
r
which r elements can be chosen from a population of n (≥ r) elements, Bernoulli formulated the probability
B (r; n, p) of r successes (and hence, n − r failures) in n independent trials in the form (see Lèvy and Roth
[2])
 
n
B (r; n, p) = pr q n−r , (2.3)
r

where p is the probability of a success and q = 1 − p is the probability of a failure of an individual event.
n
Clearly, (2.3) is the rth term of the binomial expansion of (p + q) and hence, B (r; n, p) is called Bernoulli’s
binomial distribution of the random variable representing the number of r successes in n trials.
Writing r = np + x and n − r = nq − x in (2.3) so that n is large, r is also large provided x is small
compared with λ = np. Using the celebrated James Stirling’s (1693-1770) asymptotic approximation of n!
for large n discovered by Stirling in 1718 in the form

 
n+ 12 −n 1
n! ∼ 2π n e 1+ (2.4)
12n

and then expanding in decending powers of n in (2.3) leads to

log B = log n! + (np + x) log p + (nq − x) log q − log (np + x)! − log (nq − x)!
1 x2 + x (1 − 2p)
 
1
= − log (2πnq) − . (2.5)
2 2n pq

Or, equivalently,

1 x2 + x (1 − 2p)
  
1
B ≈ √ exp − . (2.6)
2πnpq 2n pq

If |x|  |1 − 2p|, the term (1 − 2p) x can be neglected compared to x2 in the exponential term so that (2.6)
is approximately equal to

x2 x2
   
1 1
B (x) = √ exp − = √ exp − 2 , (2.7)
2πnpq 2npq σ 2π 2σ

where σ = npq is the standard deviation. This is universally known as the normal (or Gaussian) probability
density function B (x) ≡ f (x) in standard notation.
This normal distribution function (often called the Gaussian distribution function) was first derived
by French mathematician, Abraham de Moivre (1667-1754) who left France due to the revocation of the
edict of Nantes in 1685 and took shelter in London. Subsequently, he was elected a Fellow of the Royal
Society of London in 1697 for his many outstanding contributions to probability and statistics. Even before
the publication of Bernoulli’s Ars Conjectandi in 1713, De Moivre published his ground breaking book in
1711 on probabilty entitled The Doctrine of Chances in which he established the importance of the law of
large numbers and its computation. This book went through multiple editions which contained numerical
n
analysis with approximation of sums of terms of the binomial expansion (a + b) for large n. The Stirling
approximation for n! allowed him to estimate the middle term of the binomial expansion for power n = 2r
so that
   
n 2r 1
= ≈ √ . (2.8)
r r r π

Although De Moivre realized that the histogram of the frequency distribution is very close to the bell-
shaped normal curve, but he could not write down its equation. Subsequently, it has been recognized that
when independent measurements are averaged, they tend to look like the belt-shaped normal curve, known
as the central limit theorem. This is just the generalization of Bernoulli’s original idea due to De Moivre
who formulated in terms of the central limit theorem which became the core of probability theory. More

4
generally, if µn denotes the probability distribution of √1n X, then the central limit theorem states that
the n-fold convolutions of µn converge to a normal distribution on the real line if X has zero mean and
finite second moment. Francis Galton (1822-1911), a famous British mathematician, described the normal
distribution as “the supreme law of unreason”.
The variance of the probability density is calculated from its definition as follows:
Z ∞ Z ∞
x2
 
1
var (X = x) = x2 f (x) dx = √ x2 exp − 2 dx
−∞ σ 2π −∞ 2σ
2 Z ∞ 
σ 1 2
  x 
= √ t t e− 2 t dt, t= ,
2π −∞ σ

which is, integrating by parts,



σ2 σ2 √
h 
1 2 ∞
i Z
1 2
= √ t e− 2 t + e− 2 t dt = √ · 2π = σ 2 . (2.9)
2π −∞ −∞ 2π
x
Replacing x by σ in (2.7) leads to the standard normal probability density function f (x) given by
 2
1 x
f (x) = √ exp − . (2.10)
2π 2

The corresponding equivalent cummative distribution function is a never-decreasing function of x which


tends to zero as x → −∞, and to one as x → +∞
Z x
1 1 2
F (x) = √ e− 2 t dt, (2.11)
2π −∞

where F (x) represents the area under the curve f (x) to the left of x.
The more general normal density function f (x) with the mean µ and the standard deviation σ is given
by
" #
2
1 (x − µ)
f (x) = √ exp − . (2.12)
σ 2π 2σ 2

The curve of f (x) is symmetric about x = µ. Putting µ = 0, the symmetric curve of f (x) are drawn in
Figure 2 for different values of σ = σ1 = 0.25, σ = σ2 = 0.5 and σ = σ3 = 1 as follows:

Figure 2: The curves of f (x) with µ = 0 and for different σ.

These symmetric curves for f (x) are called the bell-shaped curves which assume the maximum value
√1 . These curves have a peak at x = 0 (or at x = µ when translated). For small values of variance σ 2 , the
σ 2π
curve has a high peak and steep slopes, whereas with increasing σ 2 , curves get flatter and flatter, and the
density function f (x) is spread out more and more which means that the variance or the standard deviation
measures the spread.

5
Thus, for large values of n (n → ∞), the binomial probability distribution tends asymptotically to the
normal density function (2.12) with the mean µ and the standard deviation σ. This is one of the most
famous limit theorem of De Moivre and Laplace in probability theory.
The probability distribution function corresponding to the probability density function (2.12) (density
function a modern term which is synonym for classical and older term frequency function) is defined by
integrating f (x) from −∞ to x, that is,
" #
Z x 2
1 (x − µ)
F (X = x) = √ exp − dx. (2.13)
σ 2π −∞ 2σ 2
The probability that a normal random variable X assumes any value in a < x ≤ b is given by
" #
Z b 2
1 (x − µ)
P (a < X = x ≤ b) = F (b) − F (a) = √ exp − dx. (2.14)
σ 2π a 2σ 2

The integral in (2.13) cannot be integrated exactly by methods in Calculus, but it is extremely useful and
frequently used, and hence, it has been tabulated. It is sufficient and easy to
 tabulate the standard normal
distribution function (2.13) of the standard normal variable X ∗ = X−µ σ with the mean zero and the
standard deviation σ = 1 so that we use the standard probability distribution function in the form
Z x Z √x
1 t2 1 2 2
F (X = x) = √ e− 2 dt = √ e−ξ dξ
2π −∞ π −∞
  "Z 0 Z √x #   
1 2 2 2 1 x
= √ + e−ξ dξ = 1 + erf √ , (2.15)
2 π −∞ 0 2 2
where the integral
Z x
2 2
√ e−ξ dξ = erf (x) . (2.16)
π 0

is called the probability integral or error function, erf (x), and the complimentary error function, denoted
by erf c (x), is defined by
Z ∞
2 2
erf c (x) = √ e−ξ dξ = 1 − erf (x) . (2.17)
π x
Evidently, erf (0) = 0, erf (∞) = 1, erf c (0) = 1 and erf c (∞) = 0.
Consequently, the distribution function F (x) can be expressed in terms of the complementary error
function as
    
1 x 1 x
F (x) = 1 + erf √ = 1 − erf c √ . (2.18)
2 2 2 2
Similarly, the general distribution function F (x) given by (2.13) can be expressed in terms of the error
function in the form
" #
Z x 2
1 (ξ − µ)
F (x) = √ exp − dξ
σ 2π −∞ 2σ 2
    
1 x−µ 1 x−µ
= 1 + erf √ = 1 − erf c √ . (2.19)
2 σ 2 2 σ 2
In the real world there are numerous applications of the theory associated with n independent Bernoulli’s
trials. Very often n is very large and p is small, but the mean of the binomial distribution np = λ is finite.
Under these conditions, Poisson derived an asymptotic approximation of B (r; n, p) as follows:
n (n − 1) (n − 2) · · · (n − r + 1) r n−r
B (r; n, p) = p (1 − p) , (2.20)
r!
r     
(np) 1 2 r−1 n−r
= 1− 1− ··· 1 − (1 − p) . (2.21)
r! n n n

6
For any fixed r > 0, as n → ∞ and p → 0,
    
1 2 r−1
1− 1− ··· 1 − → 1, and n = n − 1 = n − 2 = · · · = n − r + 1,
n n n

so that
2
n n (n − 1) 2 (np)
(1 − p) = 1 − np + p − · · · = 1 − np + − · · · = e−np ,
2! 2!
−r
(1 − p) → 1 as p → 0.

Consequently, (2.21) reduces to the famous form


r
(np) −np λr −λ
B (r; n, p) = e = e . (2.22)
r! r!
This is well known as the Poisson probability distribution, or more precisely, the Poisson approximation to
the binomial distribution as
∞ ∞ r
X X (np)
B (r; n, p) = e−np = e−np enp = 1. (2.23)
r=0 r=0
r!

Based on the Bernoulli’s law of large numbers, Poisson derived the Poisson probability distribution (2.22)
from the Bernoulli’s binomial distribution for large number of trials (n → ∞). It is evident from the above
discussion that Jacob Bernoulli’s great contributions had a significant influence on De Moivre’s work on
Bernoulli’s law of large numbers and the central limit theorems that have been extended over the years by
many great mathematical scientists to its more precise present form. It is also evident that the Bernoulli’s
Law of Large Numbers(see Denker[3]) had tremendous significance and lasting impact during the development
of probability and statistics for the last 300 years. This year 2013 marks the tricentennial anniversary of
the Bernoulli’s Law of Large Numbers since its publication in 1713. There has been an enormous progress
on probability and statistics during the twentieth century that can be traced back to Jacob Bernoulli’s
celebrated work in many ways. He will be remembered forever for his many fundamental contributions,
especially for his first discovery of the classical law of large numbers.
The mean of the Bernoulli distribution (2.3) is np and the standard deviation is
√ p √
σ = npq = np (1 − p) ≈ np as p → 0. (2.24)

It may be appropriate to mention that Karl Pearson (1857-1936), a celebrated British mathematical scientist
and one of the founders of modern statistics, first introduced the standard deviation to denote the natural
unit of measure of probability.
It seems clear that there is no precise definition or interpretation of the concept of probability, and hence,
many questions have been raised about this concept and its applications to scientific investigations. In his
famous work on Recherches sur la probabilité des jugemens (Investigations into Plausibility of Inference),
Poisson first wrote: “The probability of an event is the reason we have to believe that it has taken place or
that it will take place.” This is simply a subjective definition of probability. He then immediately proceeded
to introduce an objective definition of probability in his following statement: “The measure of the probability
of an event is the ratio of the number of cases favorable to that event to the total number of cases favorable or
contrary.” In fact, he introduced the definition of probability in both (subjective and objective) ways which
led to some controversy among mathematicians and mathematical philosophers. On the other hand, in 1770,
Joseph Louis Lagrange (1736-1813) published his celebrated memoir on the method of determining the best
value from among a set of experimental observations. He also showed how to determine the probability that
the error of the mean should have assigned value or lies within given limits. In general, all experimental
results are liable to some inevitable observational errors regardless of whether measurements made by the
different observers, or even by the same observer at different times or different places. There is a common
phrase used in all branches of experimental science that the same experiment always produces the same
results when carried out under the same essential conditions. However, no two experiments can be the same

7
because they always differ in place or time and in equipment and observer. So, it is almost impossible to
avoid experimental errors in physical experiments. What, then, is probability, when applied to physical
experiment? It may be a simple matter of frequency of observations, and so objective. Or, it may be
that observer assigns probability to events based on subjective judgement. For example, when two coins
are tossed simultaneously, there are four distinguishable outcomes: HH, HT , T H, T T . Are these four
events equally likely? If we do not distinguish the order of the outcomes, then there are only three possible
outcomes: HH, HT = T H, T T . When these ideas are applied to quantum statistics, both possibilities exist.
In Bose-Einstein statistics, order of HT and T H is not distinguished. Consequently, particles obeying Bose-
Einstein statistics are called Bosons which are described by symmetric wave functions. On the other hand,
the order HT and T H is very important in the Fermi-Dirac statistics and particles obeying the Fermi-Dirac
statistics are known as Fermions which are governed by antisymmetric wave functions. These statistics can
be described in terms of probability models as follows: In statistical physics, the phase space is usually
subdivided into a large number of smaller cells so that each cell contains a particle. In this way the state
of the entire system is described in terms of random distribution of r particles in n cells so that all possible
nr arrangements (or distributions are assumed to have equal probabilities. In this model, cell numbers
1, 2, 3, · · · , n contain r1 , r2 , · · · , rn particles respectively, where r1 + r2 + · · · + rn = r, and each of the nr
distributions has equal probability n−r . This model is universally known as the Maxwell-Boltzmann statistics.
James Clerk Maxwell (1831-1897), a celebrated British mathematical scientist and Ludwig Boltzmann (1844-
1906), a famous German mathematical scientist founded statistical physics in the nineteenth century. They
applied methods of probability theory to describe physical systems with large number of particles (like gases).
Maxwell made fundamental contributions to the theory of electrodynamics and kinetic theory of gases. In
1864, he formulated the fundamental equations for all electromagnetic phenomena, universally known as the
Maxwell equations. Boltzmann was the first scientist who recognized that both entropy and the second law
of thermodynamics have a statistical nature. Numerous attempts have been made to prove whether physical
particles obey this Maxwell-Boltzmann statistics. Modern theories have shown beyond any doubt that this
statistics cannot be applied to any known particles, because all nr distributions are not equally probable.
In the early twentieth century, two different modern universally accepted models have been discovered to
describe the physical behavior of known particles. One of these models is universally known as Bose-Einstein
statistics which was discovered by S.N.Bose (1894-1974), a famous Indian mathematical scientist and Albert
Einstein (1879-1955), a celebrated German mathematical scientist. This model deals with indistinguishable
identical r particles in n cells so that only distinguishable distributions (or arrangements) are considered
with each is assigned to a probability Ar,n given by
   
n+r−1 n+r−1
Ar,n = = . (2.25)
r n−1

It is also universally well known that photons, π-meson and atoms containing an even number of elemen-
tary particles obey the Bose-Einstein statistics.
The other universally accepted probability model is known as the Fermi-Dirac statistics which is based
on two basic assumptions:
(i) no two or more particles occupy the same cell so that r ≤ n and
(ii) all distinguishable distributions(or arrangements) satisfies (i) must have equal probabilities. This model
can then be completely described by stating which of the n cells contain a particle.Since there are r particles,
the corresponding cells can be chosen in nr ways and hence, there are in all nr possible arrangements(or
−1
distributions) with each has probability nr in the Fermi-Dirac statistics. All elementary particles including
electrons, protons, and nutrons obey the Fermi-Dirac statistics. Enrico Fermi (1901-1954), a famous Italian
mathematical scientist and Paul Dirac (1902-1985), a British mathematical scientist discovered this statistics
independently. So far two famous probability models have been discovered, and it is highly possible that
some day a third model will be discovered for certain kinds of elementary particles.

3 Laplace’s Notable Contributions to the Theory of Probability


During the sixteenth and seventeenth centuries, a great deal of attention had been given to the theory of
probability by many emminent mathematicians including Jacob Bernoulli, Abraham de Moivre, Pierre Simon

8
Laplace (1749-1827), Thomas Bayes (1702-1761), Joseph Luis Lagrange (1736-1813), Leonhard Euler (1707-
1783), A.M. Legendre (1752-1831), Karl Friedrich Gauss (1777-1858), John Venn (1834-1923), Richard von
Mises (1883-1953), Augustin De Morgan (1866-1871), Emile Borel (1871-1956) and P.L. Chebyshev (1821-
1894). Among these great mathematicians and others, Laplace made most notable and major contributions
to the theory of probability following the significant work of Jacob Bernoulli. Laplace’s masterpiece treatise
entitled Théorie Analytique des Probabilités (Analytical Theory of Probability) was published in 1812. It
contained all of his own discoveries and all other results that were then known in probability and its applica-
tions to mathematical and social sciences including life insurance, mortality, and life tables of mortality which
provided the basis of insurance statistics. He is the first one who made major attempts to develop the theory
of probability as a new subject beyond the theory of games of chance. He also first recognized the real signif-
icance of analysis of results of observations or measurements involved in mathematical physics, astronomy,
and social sciences. Although his major research work was on celestial mechanics, Laplace made the modern
rigorous treatment of probability and statistical inference. His great synthesis led to bringing together many
basic mathematics and philosophical ideas and results which had emerged from classical problems of games
of chance, on the one hand, and from statistical analysis of experimental errors of measurements, on the other
hand. It was inevitable at that time that developments in any fields of philosophical, logical, mathematical
experimental, financial, industrial, actuarial and statistical nature were bound to affect one another, and to
grow from the same broad general principles.
In 1810, Laplace investigated the problem of finding the mean value from a large number of observations
which he interpreted as the problem of probability that the mean value lies between certain limits. Assuming
that the positive and negative errors in a large number of observations are equally likely, he proved a law of
large numbers, and then showed that their mean value converges to a precise limit. Based on some general
assumptions, he discovered the Method of Least Squares of Errors. He introduced the most appropriate mean
(or average) value for a series of astronomical observations or measurements, and determined how the limits
of the errors have been associated with the number of observations of different kinds. At the same time, a new
major subject, the so called the theory of errors, arose associated with a set of observations or measurements
of a trials of experiments in physical science, and astronomy. He also proved that distribution of the average
random observational errors which are uniformly distributed in an interval symmetric about the origin tends
to the normal distribution as the number of observations increases to infinity. This is a celebrated result
which became known as the central limit theorem. Laplace’s work in applied mathematics and probability
led to many non-trivial mathematical methods including a law large numbers, generating functions, analysis
of errors, and the Laplace transform which was found to be a simple and elegant method of solving finite
difference, differential and integral equations. Some of his original ideas and results of probability theory
were derived from celestial mechanics and theoretical astronomy. Conversely, the theory of probability led
Laplace to recognize the existence of constant causes of many results in physical astronomy obtained by
observations. He also made an attempt to verify the existence of these constant causes by mathematical
investigations.
Most of the original calculations are based on the analysis of Laplace’s old but standard treatise on
Analytical theory of Probability. Laplace asked whether the observed sequence of events was to be described
to the operation of an unknown law of nature or to chance. The more useful applications of the theory of
probability to natural phenomena have been in connection with the law relating to errors of observations
and with the kinetic theory of gases due to Maxwell (1831-1879) consisting of a large number of molecules
in rapid random motions. By 1860 Maxwell applied the Laplace-Gauss law of probability to describe an
elaborate statistical theory of gases based on classical thermodynamics. He proved that the velocities of gas
molecules satisfy the normal probability distribution as a consequence of the central limit theorem. Whe
he became the Head of Cavendish Laboratory at Cambridge in 1871, he stated in his inaugural lecture
that “the statistical method involves an abandonment of strict dynamical principles and an adoption of the
mathematical methods belonging to the theory of probability ... .”
In view of importance and usefulness of both erf (x) and erf c (x) in applied mathematics, and probability
theory, Laplace introduced these functions by (2.16) and (2.17) respectively and studied their ordinary
properties and asymptotic representations for small and large values of x. He also introduced a double-
exponential density function, fX (x) = 12 λ exp (−λ |x|), λ > 0. This is known as the Laplace density function
to model navigation and pilot errors, speech and image processing.
Based on a set of very general assumptions, Laplace established the Method of Least Squares. In fact, if

9
the mean value of a set of observations is the most probable value, and positive errors are as negative ones,
the error function for the observations is of the Gaussian form
h 2 2
√ e−h x , (3.1)
π

where h is known as the measure of precision or called the precision constant. The value of h to be chosen
is that which makes the probability maximum. If x1 , x2 , · · · , xn are the measured observations of a number
x, then the deviations of the observations from x are respectively x − x1 , x − x2 , · · · , x − xn . Assuming that
the probability p (x − xk ), k = 1, 2, 3, · · · , n, is Gaussian, that is,

h h i
p (x − xk ) = √ exp −h2 (x − xk )2 . (3.2)
π

Thus, the probability of all deviations is the product


n
" n
#
Y hn 2
X 2
p (x − xk ) = exp −h (x − xk ) . (3.3)
k=1
π n/2 k=1

The problem is to find x so that probability is maximum which is equivalent to determining x so that
Xn
2
(x − xk ) is a minimum. This gives the mean value x of x1 , x2 , · · · , xn . This method of finding the best
k=1
value of an observation by assuming that the sum of the squares of the deviations from it will be a minimum
Xn
is called the Method of Least Squares. Putting the mean value x = n1 xn in the right hand side of (3.3),
1
and making the above probability maximum, it turns out that h is determined by the equation
" ( n
)#
d n 2
X 2
h exp −h (x − xk ) = 0, (3.4)
dh
k=1

n
X
n 2
so that h = 2h (x − x) . Therefore,
k=1

n 1
h2 = = . (3.5)
2σ 0 2
Pn 2
2 k=1 (x − xk )
0 0 Pn 2
where σ is the standard deviation for the given set of observations defined by nσ 2 = k=1 (x − x) . It
follows that the choice of h is such as to make the standard deviation for the observed set x1 , x2 , · · · , xn
coincide with that of the given population.
Thus, the works of Fermat, Pascal, Huygens, Laplace and Bernoulli marked the beginning of the theory
of probability in the seventeenth and eighteenth centuries.
However, the Laplace method of least squares was also used earlier by Euler, Legendre and Gauss. Gauss
was the first who justified it by an appeal to the theory of probability. Laplace’s fundamental contributions
to the theory of probability became much more notable than to his astronomial works including the analysis
of planetary motions. About eleven years after the discovery of the Bayes theorem in 1763, Laplace also
made a deep investigation of the conditional (or inverse) probability and the analysis of errors in scientific
observations. Obviously, he was in favor of the Bayesian approach based on a priori probability, and the
principle of insufficient reason which was used to determine a priori probability. It is a delight to quote
Laplace’s[4] remarks at the close of his Essai philosophique sur les Probabilités in 1812: “the theory of
probabilities is nothing more than common-sense reduced to calculation. It determines with exactness what
a well-balanced mind perceives by a kind of instinct, without being aware of the process... It is, therefore, a
most valuable supplement to the ignorance and frailty of mind.” Laplace followed the basic ideas of Bayes
to enunciate the principle for calculating the probabilities of the causes by which an observed event may
have been produced. Thus, by any standards, Laplace was a great French mathematician and theoretical

10
astronomer in the golden age of France who became so famous in his time that he was known as the
Newton (1642-1727) of France. The year 2012 marked the two hundred anniversary of the Laplace treatise
on the Analytical Theory of Probability published in 1812. He will be remembered forever for his notable
contributions to probability theory, physical astronomy and applied mathematics, and for the major impact
of his marvelous treatise on the subsequent development of the theory of probability and its wide spread
applications.
Stimulated by the celebrated work of De Moivre, Thomas Simpson (1710-1761), an eminent British
mathematician, made some contributions to the subject of chance. In his 1757 Miscellaneous Tracts, he also
made a critical investigation of the mean as the best possible value of a series of observations or measurements
in Physical Astronomy. This work marked the starting point of what is then known as the theory of errors,
or errors of observations as a branch of the theory of probability. Simpson’s work entitled The Nature and
Laws of Chance was very much similar and less original than the famous work of De Moivre.
In 1770, J.L. Lagrange (1736-1813) and then in 1805, A.M. Legendre (1753-1833) spent considerable
amount of time and effort to develop methods for determining the best value from a set of observations.
Legendre also formulated a principle least squares which states that the most probable value for the observed
quantities is that for which the sum of the squares of the individual errors is a minimum. On the other hand,
Lagrange proved how to determine the probability that error of the mean should have assigned value or
lie within given limits. He also introduced the error (or probability) curves which are represented by the
equation
2
x2
y = f (x) = ke−h , (3.6)

where k and h are constants.


It is almost certain that all errors are included between limits ±∞ and since certainty is represented by
unity so that (3.6) becomes
Z ∞
2 2 k√
1 = k e−h x dx = π (3.7)
−∞ h

and hence, k = (h/ π) and (3.6) reduces to (3.1). Thus, result (3.1) is well known as the probability of
errors of observations or measurements.
In order to introduce the probability integral, we consider a set of errors x0 , x1 , x2 , · · · , x in ascending
order of magnitude from x0 to x with the equal differences between the successive value of x. If x denotes
the error, then  P of an error between x0 and x is the sum of the separate probabilities
 the probability
k exp −h2 x20 , k exp −h2 x21 · · · , that is,
x
X
exp −h2 x2 .

P = k (3.8)
x0

Replacing the summation by the integral with dx representing the successive intervals between any two limits
x0 and x so that (3.8) becomes
Z x
k
exp −h2 x2 dx.

P = (3.9)
dx x0

Since all errors are included between −∞ to +∞, so that the total probability is one, it turns out that
Z ∞ √
k 2 2 k π
1 = e−h x dx = · (3.10)
dx −∞ dx h

so that k = √h dx, and then (3.8) becomes


π

h
√ exp −h2 x2 dx.

(3.11)
π

11
This represents the probability of errors of observations between x and x + dx and is known as Gauss’ law of
errors. Consequently, the probability integral that an error lies between any two limits x0 and x is given by
Z x
h 2 2
P (x) = √ e−h x dx. (3.12)
π x0

Thus, the minimum value of any function, y = k exp −h2 x2 is attained when x2 = 2h1 2 so that (3.8) gives


the minimum value of the sum of the squares of the individual errors, x0 , x1 , · · · , xn , that is,
X 1
x20 + x21 + · · · + x2n = = a minimum. (3.13)
2h2
This is known as the Legendre principle of least squares.
On the other hand, if x1 , x2 , · · · , xn represents a set of n observations and x is its best possible value,
then according to the Method of Least Squares, the sum of the squares of the deviations (or errors)
n
X 2
f (x) = (x − xk ) (3.14)
k=1
0 00
is a minimum. It follows from calculus that f (x) is minimum when f (x) = 0 and f (x) > 0 so that
(x − x1 ) + (x − x2 ) + · · · (x − xn ) = 0 (3.15)
so that the required value of x is the arithematic mean x given by
n
1 1 X
x = (x1 + x2 + · · · + xn ) = xk (3.16)
n n
k=1
00
and it makes (3.14) minimum because f (x) = 2n > 0. This shows that the best value of a given set of
measurements of equal weight is their arithmetic mean (3.16).
Similarly, the best value of a set of observations of different weights is obtained by multiplying each
observation by its weight and dividing the sum of these products by the sum of their different weights. With
the same notations as before, if x1 , x2 , · · · , xn are n observations of weights w1 , w2 , · · · , wn respectively,
then using the Method of Least Squares, the sum of the squares of the deviations (or errors)
n
X 2
f (x) = wk (x − xk ) (3.17)
k=1
0 00
is a minimum so that f (x) = 0 and f (x) > 0. Consequently, the most probable value of the set observations
of different weights is then given by
Pn
(w1 x1 + w2 x2 + · · · + wn xn ) k=1 wk xk
x = = P n = x. (3.18)
(w1 + w2 + · · · + wn ) k=1 wk

Thus, the value x in (3.18) is called the arithmetic mean of weighted observations or, often called the weighted
(general ) arithmetic mean. When w1 = w2 = · · · = wn = 1, result (3.18) reduces to (3.16).
Alternatively, using the normal distribution of the probability of errors, Laplace determined the most
probable value of x from a set of large number of observations of n different kinds. The probability of the
errors of the result obtained from the rth observation lying between x − xr and x − xr + dx is
r
wr h
2
i
exp −wr (x − xr ) , (3.19)
π
where wr is constant and r = 1, 2, 3, · · · , n. Thus, the probability of errors that x is the true value of the
unknown quantity is obtained by multiplication the results for the first, second and the nth observation, and
is proportional to
" n #
X 2
exp − wr (x − xr ) = exp [−f (x)] . (3.20)
r=1

12
We determine x from (3.20) so that the probability of errors has a maximum value which is equivalent to
0 00
the minimum value of f (x) for some x. This implies that f (x) = 0 and f (x) > 0 so that x is given by
(3.18).
On the other hand, Laplace raised the basic question in his treatise, Theorié Analytique des Probabilities,
whether the observed set of data be reasonably attributed to the operation of an unknown law of Nature
or to chance. So, equation (3.1) or (3.6) may not represent the precise representation of the law of errors.
However, John Venn a renowned British mathematician, state in his 1867 book on the Logic of Chance that
the exponential law of errors (3.1) is a law because it expresses a physical fact relating to the frequency
with which errors are found to present themselves in practice, while the method of least squares is a rule
which shows how the best representative value can be extracted from a set of experimental measurements. In
the preface of his Thermodynamique published in 1892, Henri Poincaré (1854-1912) made a precise remark:
“everybody firmly believe in it (the law of errors), because mathematicians imagine that it is a fact of
observation, and observers that it is a theorem of mathematics.” At any rate, several theoretical attempts
have been made notably by Laplace, Gauss and others to derive the law of errors without any solid foundation.
The probability curve represented by the formula (3.6) can be considered as the graphical representation
of the law relating the probability of the occurrence of an error with its magnitude. According to Venn:
“although probability may require considerable mathematical knowledge, the discussion of the fundamental
principles on which rules are based does not necessarily require any such quantification. His view was that
the application of probability in such matters as insurance was simply one more aspect of induction, the
extrapolation of past experience into the future.” John Venn revised his 1867 book twice during two decades,
and made many important contributions to theory of probability with applications in jurisprudence, logic
and set theory. He also created the so called the Venn diagrams to describe relations geometrically between
subsets which were originally known to Leonhard Euler.
Another famous British mathematician and logician, Augustus De Morgan (1806-1871) made major
contributions to symbolic algebra and symbolic logic, and has widely been recognized for his laws which are
now known as De Morgan’s laws in symbolic logic. In his celebrated treatise on Formal Logic, he included
three chapters on probability and induction to provide the foundation of probability theory which was based
on prior information and beliefs. So, his views on probability was totally against of those of John Venn who
also expressed his poor opinion about contributions and writings of De Morgan. Like Laplace, De Morgan
provided the usual rules of probability of disjoint and independent random events and even stated Bayes’
law of conditional probability.
Laplace provided a remarkable view about the theory of probability through his famous quotation that
“a science which began with the consideration of play has risen to the most important object of human
knowledge.” But it is even more remarkable that, in spite of so much attention has been given to the
subject of probability due to its numerous applications and enormous impact, mathematical scientists and
philosophers made enormous heated discussions on probability for more than a century, and even after that
they were unable to agree on the correct definition and/or interpretation on probability. In fact, probability
is not a univocal term, and so, it has different meanings in different contexts. However, we can give at least
three main interpretations of probability. The first classical subjective view due to Laplace and De Morgan
refer to a state of mind leading to the degree of belief in a proposition as none of our knowledge is certain.
In An Essay on Probability published in 1838, De Morgan interpreted probability “as meaning the state of
mind with respect to an assertion, a coming event, or any other matter on which absolute knowledge does
not exist.” The second view defines probability as an intuitably understandable, logical relation between
propositions. A principal exponent of this view, John Maynard Keynes (1883-1946), a British mathematician
and gambler, wrote: “We must have a logical intuition of the probable relations between propositions. Once
the existence of this relation between evidence and conclusion, the latter becomes the subject of the degree
of belief.” The third and perhaps popular objective view of probability is essentially based on the statistical
concept of relative frequencies of events from observations and experiments. Based on games of chance with
finite number of equally likely outcomes, the objective probability is measured by the ratio of the number of
favorable outcomes to the total number of outcomes. This popular view was developed by several eminent
mathematical scientists and philosophers including Bernhard Bolzano (1781-1841), John Venn, von Mises
and Sir R. A. Fisher (1890-1962). Thus, the frequency interpretation is developed by frequentists in the
twentieth century and seems to be the most satisfactory approach for analyzing the meaning of the term
probability in the contexts of every-day discourse, statistics and measurement, and within many fields of

13
mathematical and statistical sciences.
A renowed Russian mathematician, P. L. Chebyshev (1821-1894) made some major contributions to
probability theory including the concept of mathematical expectation, variance and arithmetic mean of
random variables. He is the best known for his 1846 simple proof of the weak law of large numbers for
repeated independent trials based on what is now known as Chebyshev’s inequality: The probability that a
random variable will assume a value more that k standard deviations from its mean is at most k −2 . The
more explicit form of the Chebyshev inequality is given by
1
P (|x − x| > kσ) ≤ , (3.21)
k2
where σ is the standard deviation of the random variable X. He also worked extensively on the central limit
theorems, and in 1887, he gave an explicit statement of the central limit theorem for independent random
variables. He is considered as the intellectual father of a long list of renowned Russian scientists including
A. A. Markov (1856-1922), S. N. Bernstein (1880-1968), A. N. Kolmogorov (1903-1987), A. Y. Khinchin
(1894-1959) and others. Chebyshev’s student, Markov made the remarkable extension of the law of large
numbers to dependent trials. The idea of dependent trials — known as Markov chains is one of the major
topics of active current research in statistical sciences. In its simplest form it applies to a system in one
of a number of states {S1 , S2 , · · · , Sn } which at given times can change from one state to another. The
probability of a transition from Si to Sj is n × n matrix (pij ) which is called the Markov transition matrix.
Using Henri Lebesgue’s (1875-1941) new and general theory of integration and measure theory, Emile
Borel (1871-1956), a celebrated French mathematician, first formulated the modern approach to probability
theory and random variables. In his short book on Probabilité et Certitude (Probability and Certainty),
Borel pointed out that probability theory is essential for the study of many diverse problems in natural and
social sciences, engineering, business, economics, agricultural, medical and statistical sciences.

4 Bayes’ Celebrated Theorem in Probability


Thomas Bayes(see Bellhouse [5]), a British Clergyman, had a remarkable interest in mathematical probability
and made some fundamental contributions to probability theory. He provided mathematical analysis of
finding probability p based on k sucesses and (n − k) failures in observations of n trials. This problem was
exactly opposite to that of Jacob Bernoulli who assigned probabilities to the event of getting k sucesses
in n independent trials under the assumption of the probability of sucess in each trial was p. Bayes made
an attempt to determine the parameter in a distribution from observed data. He asserted that p would lie
n−k
between a and b with a probability proportional to the area under the curve y = xk (1 − x) between a
and b. If the probability of event B is p, given that event A has occurred, after a number of trials without
reference to whether A has occurred or not, it can be shown that event B has occurred m times and has
not occurred n times. What is the probability of event A? In order to answer this question, Bayes also
introduced a new fundamental idea of conditional probability, P (A | B) of an event A, given that B has
occurred, defined by
P (A ∩ B)
P (A | B) = , P (B) 6= 0. (4.1)
P (B)
Or, equivalently,
P (A ∩ B) = P (A | B) P (B) . (4.2)
This result is known as the multiplication law for determining the probability when each of the two events,
A and B, occurs. However, the use of this multiplication law is not only way to determine the probability
that each of two events occurs. Since P (A ∩ B) = P (B ∩ A), it follows that
P (B ∩ A) = P (B | A) P (A) = P (A | B) P (B) . (4.3)
Dividing by P (B) leads to the celebrated Bayes’ theorem:
P (B | A) P (A)
P (A | B) = . (4.4)
P (B)

14
This shows that the conditional (or inverse) probability of A given B in term of the conditional probability
of B given A. In fact, it also provides a direct relation between a posteriori probability, P (A | B) and a
priori probability, P (A) through the likelihood function P (B | A). In other words, a priori probability
P (A) is combined with likelihood function to determine a posteriori probability P (A | B). According to
Bayes, it is possible to assign a probability to an event before a similar event has occurred. This is exactly
opposite view of frequentism which asserts that probability of a future event depends on its occurence in the
past. But it not always clear how to determine a priori probability P (A). Classical probabilities dealing
with throwing of a coin or dice, or drawing from a deck of cards or similar problems are in effect Bayesians
where symmetry or subjective intuitive considerations helped assign probabilities to hypothetical events.
However, it is an interesting and subtle problem to develop methods for determining a priori probability. One
classical method first introduced by Jacob Bernoulli and then advocated by Laplace is called the Principle of
Insufficient Reason (PIR) which is a rule for assigning probability of m events in n (≥ m) mutually exclusive
and indistinguishable events so that the probability is m n . For example, in case of tossing an ideal coin, the
PIR can be used to assign a priori probability of two possible outcomes, head or tail, is 21 . Both Bernoulli
and Laplace strongly believed that when we are ignorant of the ways an event can occur, then the event will
occur equally likely in any way. So, it seems clear that, in many cases, a priori probability can be determined
by PIR. It is also often possible to find a priori probability using some external evidence or some kind of
good guess or observations.
We next give a simple example of Bayes’ theorem. When two coins are thrown simultaneously, it produces
four possible outcomes: HH, HT , T H, T T , where an event A = {HH}. Consisting of heads in both coins
and an event B = {HT, HT } showing head in the first coin? Thus, P (A) = 41 , P (B) = 24 = 12 and
P (A ∩ B) = P (HH) = P (A) = 41 . According to the Bayes theorem, the conditional probability is given by

P (A ∩ B) 1
P (A | B) = = .
P (B) 2
The natural and simple generalization of the Bayes theorem deals with the probabilities of mutually exclusive
and exhaustic events A1 , A2 , · · · , An , when the even B has occured and has the form
P (Ai ∩ B) P (B | Ai ) P (Ai ) P (B | Ai ) P (Ai )
P (Ai | B) = = = Pn . (4.5)
P (B) P (B) j=1 P (B | Aj ) P (Aj )

Thus, the generalized version of the Bayes theorem provides a direct relation between a priori probability,
P (Ai ) and a posteriori probability, P (Ai | B) given that B has already occurred. If the events Ai are called
causes, then (4.5) represents the generalized Bayes rule of the probability of causes. It has a wide variety
of applications to problems in many areas including statistics, communication theory and medical imaging
process.
We next formulate Bayes’ theorem for the conditional probability density functions. The conditional
probability distribution function FX (x | A) given that an event A has occurred is defined by
P (X ≤ x | A)
FX (x | A) = P (X ≤ x | A) = , P (A) 6= 0. (4.6)
P (A)
If the event X ≤ x is independent of A, then P (X ≤ x | A) = P (X ≤ x). The corresponding conditional
probability density function fX (x | A) is defined by
d P (x < X ≤ x + ∆x | A)
fX (x | A) = FX (x | A) = lim . (4.7)
dx ∆x→0 ∆x
If P (A) 6= 0, the conditional probability is given by
fX (x | B) P (B)
P (B | X = x) = lim P (B | x < X ≤ x + ∆x) = . (4.8)
∆x→0 fX (x)
Multiplying both sides of (4.8) by fX (x) and integrating the result from −∞ to ∞ gives
Z ∞ Z ∞
fX (x) P (B | X = x) dx = P (B) fX (x | B) dx = P (B) . (4.9)
−∞ −∞

15
This represents the total probability formula for density functions. Consequently, the Bayes theorem for the
conditional density function fX (x | B) follows from (4.8) as
P (B | X = x) fX (x)
fX (x | B) = . (4.10)
P (B)
Substituting the value of P (B) from (4.9) to (4.10) gives
P (B | X = x) fX (x)
fX (x | B) = R∞ . (4.11)
−∞
P (B | X = x) fX (x) dx

If B represents the event B (Y = y), then (4.11) reduces to the form, using some algebra,
fY |X (y | x) fX (x)
fX|Y (x | y) = R∞ . (4.12)
f
−∞ Y |X
(y | x) fX (x) dx

This is the Bayes theorem for probability density functions which connects the a posteriori probability density
fX|Y (x | y) given that the event Y = y has already occurred with the a priori probability density fX (x).
On the other hand, the Bayes theorem can be extended to probability functions of a continuous variable.
The probability of a sample, U (t) of a continuous variable t drawn from a population V (t) can be calculated
by a certain process p (x) in the range a < x < b in the form
Z b
U (t) = V (t + x) p (x) dx. (4.13)
a

This is the Bayes theorem for the probability of the sample U (t) from the original population V (t) with a
given process p (x). Conversely, the solution of V (t) from a given U (t) given by (4.13) may not necessarily
possible. If possible, the solution of the integral equation (4.13) leads to the solution of V (t) which is not
necessarily unique because of the law of selection of p (x) is involved. The general problem for finding V (t) is
beyond the scope of this article, and hence, the reader is referred to Levy and Roth [4] for more information
about it.
Bayes’ work on An Essay towards solving a Problem in the Doctrine of Chance including his celebrated
theorem (4.13) was published in 1763 after his death in the Transactions of the Royal Society, London by his
close friend, Richard Price (1723-1791) who also did some major editing and updating before publication. If
an event has p successes and q faitures, Bayes proved that the probability P of the chance of success lying
between a and b is given by
Rb p q Z b
x (1 − x) dx 1 q
P = R1 a
q
= xp (1 − x) dx. (4.14)
p
x (1 − x) dx B (p, q) a
0

He then evaluated these definite integrals by a method of approximation. However, the integral in the
denominator represents the beta function, B (p, q) which was introduced by both Euler and Legendre around
1770. Thus, the study of the beta function and its properties by Euler and Legendre were suggested by the
works of Bayes.
In addition, Bayes’ published work also contained the principle of mathematical induction which is a
method of proof of a general results from a particular ones. This principle is now regarded as a fundamental
axiom of modern mathematics. However, Bayes work gained some limited exposure until it was independently
reconsidered by Laplace in the early 1770s and then further developed by him in his first 1812 modern treatise.
In spite of some weakness of Bayesian approach, Gauss used the idea of the inverse probability to derive and
justify the Method of Least Squares. Moreover, Bayes’ theorem of conditional (or inverse) probability also
appeared in a 1837 paper of De Morgan in connection with Laplace’s work on probability. It is important to
note that Bayes’ theorem can also be derived from the basic axioms of probability and has many applications
in a wide range of calculations involving probabilities, not just in the Bayesian inference in statistics. Sir
Harold Jeffreys (1891-1989), a celebrated British applied mathematician, became actively involved in research
on Bayes’ theorem, and then modified it which is now known as the Objective Bayes-Jeffreys principle. It
is worth adding Jeffreys’ famous quotation about the Bayes theorem: “Bayes’ theorem is to the theory of
probability, what is Pythagoras’ theorem to geometry”.

16
5 The Famous Controversy Between Bayesian and Frequentism Ap-
proaches to Probability and Statistics
Since the publication of Bayes’ work [6] in 1763, the Bayesian approach had been very influential in probability
and statistical over the next 157 years until the development of another new and fundamentally opposite
principle, known as the frequentism in the early twentieth century. This principle was developed by three
modern founders of statistical science, including Sir Ronald A. Fisher, a celebrated British statistician, Jerzy
Neyman (1894-1981), a Polish born American mathematical statistician, and Karl Pearson (1857-1936), a
famous British mathematician, philosopher and lawyer. The frequentism states that the probability of any
future event depends on its occurrence in the past. The famous question was: What is the probability that
the Sun will rise tomorrow? The short answer is that there had been no exceptions during the last thousands
of years, and so, the Sun will rise tomorrow. As William Feller (1906-1970) pointed out in his classic book
on probability that our confidence that the Sun rose daily in the remote past for 5000 years based on the
same considerations that lead us to believe that the Sun will rise tomorrow.
It was Fisher who boldly rejected the Bayesian formulation and declared that it cannot be accepted as
the fundamental approach to probability and statistics for the following reasons. In the absense of genuine
prior information, Bayesian approach is inherently subjective. When the Bayesian method is applied to solve
problems of statistics, it often produces different answers. According to him, frequentism begins with the
choice of a specific method for high demand of scientific objectivity.
Sir Fisher made many fundamental contributions to modern statistical science including the maximum
likelihood theory of optimal estimation, fiducial inference, and the deviations of various sampling distri-
butions. In his first famous paper of 1915, he showed that the maximum likelihood is the best possible
frequentist estimation method in the sense that it minimizes the expected squared difference between an
unknown parameter x and its maximum likelihood estimate x b, no matter whatever x may be. Neyman
also made many outstanding contributions to many major problems in probability and statistics including
hypothesis testing, confidence intervals, and sampling of survey. He also successfully completed Fisher’s the-
ory of maximum likelihood by developing optimal frequentist methods for testing and confidence intervals.
His research revolutionized both theory and practice of statistical science. On the other hand, Pearson was
the third founder of modern statistical science and became very famous for his many major contributions
to mathematical statistics, applied statistics, the Chi-square test of significance, correlation and regression
analysis and mathematical investigation of Charles Darwin’s (1809-1882) theory of evolution. Pearson pro-
claimed himself as a frequentist. Due to strong influence of three modern founders of statistical science
together with John Venn and George Boole (1815-1869), the twentieth century has become predominantly
frequentism in both theory and statistical practice. The combined work of Fisher and Neyman firmly estab-
lished strong influence and domination of frequentism and its statistical practice in the twentieth century.
In the early 1920s, relation between these three famous men were very uncordial and mutually unfriendly,
and subsequently, they have become bitter rivals during the 1930s.
Finally, we summarize various implications and major differences between the Bayesian and frequentism
approaches. Contrary to the view of frequentism is the Bayesian approach which states that a probability can
be assigned to an event before a similar event has occurred. In order words, it is essentially based on priori
beliefs or information, that is, it is inherently subjective which was totally unacceptable by frequentists.
On the other hand, the frequentism has been associated with the frequency interpretation of probability
which is defined as the objective event. Particularly, any specific experiment can be regarded as one of an
infinite set of possible repetitions of the same experiment, each capable of producing independent results.
In this approach, unknown quantities are often treated as having fixed unknown values that are not random
variables, and hence, no probability can be assigned to them. On the contrary, the Bayesian approach to
inference allows probability of an unknown quantity which can sometimes have frequency interpretation. So
the major difference between these two basic principles is that the Bayesian is essentially based on priori
beliefs or information, while the frequentism focuses on objective behavior of particular estimates and tests
under repeated experiments. In general, the latter is formulated under the basic assumption that the design
of any further experiments depends on the outcomes of the same experiment performed in the past.
In spite of the serious criticism against the Bayesian principle, it is found to be extremely useful in
many scientific investigations in recent years, expecially in modern machine learning and talent analytics. In
addition, Sir Harold Jeffrey’s book [7] on the theory of probability published in 1939 as well as his work on

17
the improved formulation of the Bayesian principle, now known as the Bayes-Jeffreys principle has played
a major role on the revival of the Bayesian approach in the twentieth century. Although the Bayesian-
frequentism controversy has not yet fully been resolved, there is an increasing sign of revival of Bayesian
approach and its application and its popularity in the twenty first century. It is interesting to note that
this year 2013 marks the 250th anniversary of the Bayes celebrated work(see Efron [8]) in probability and
statistics since its posthumous publication in 1763. Thomas Bayes had a vision and 250 years later we
realize that his vision has become a reality in both theory and practice. Indeed, the Bayesian approach
had a tremendous impact on the twentieth century which has seen the most modern advances in probability
and statistics. Thomas Bayes will be remembered forever for his remarkable discovery of one of the two
fundamental approaches to probability and statistics.

6 Modern Axiomatic Theory of Probability - The Legacy of A.N.


Kolmogorov
The most influential figure in probability theory during the early twentieth century was A.N. Kolmogorov, a
Russian mathematical scientist. He made many fundamental contributions to many areas of mathematical
and physical sciences including Fourier and trigonometric series, measure theory, set theory, the theory of in-
tegration, mathematical logic, approximation theory, probability theory, topology, the theory of algorithms,
mathematical linguistics, the theory of random processes, mathematical statistics, information theory, dy-
namical systems, theory of turbulence, celestial mechanics, automata theory, differential equations, ballistics,
Hilbert’s 13th problem, mathematics education, and applications of mathematics to problems biology, geol-
ogy and the crystallization of metals.
In his famous English edition (1956) on Foundations of Probability Theory first published in 1933 in
German, Kolmogorov [9] first created the modern axiomatic foundations of probability theory. His modern
general approach to probability theory is based on each possible outcome of a random experiment has a well
defined probability of occurrence independent of any measurements. The probability measure is defined on
the field F of events (subsets) of the sample space S (F ⊂ S), and it is a mapping of the sample space to
the real line between 0 and 1 such that it satisfies the following Kolmogorov axioms for all events belonging
to the field F:
(i) For any event E (subset) ∈ F, the probability P (E) is a real number which satisfies

0 ≤ P (E) ≤ 1. (6.1)

(ii)

P (S) = 1 and P (φ) = 0, (6.2)

where φ denotes the empty set which contains no element.


(iii) For any two mutually exclusive (or disjoint) events A and B in F that is, A ∪ B = φ, then

P (A ∪ B) = P (A) + P (B) . (6.3)

(iv) If E1 , E2 , · · · , En , · · · is a sequence of events with Ei ∩ Ej = φ for all i 6= j, then


∞ ∞
!
[ X
P En = P (En ) . (6.4)
n=1 n=1

The following results follow from the Kolmogorov axioms (i)–(iv):


We have P (φ) = 0, but if P (A) = 0, it does not imply that A = φ. All sets with zero probability are called
null sets. An empty set is a null set, but null sets need not be empty.
Since A ∪ A = S and A ∩ A = φ, it follows that

P A = 1 − P (A) . (6.5)

18
For any two events (subsets) A and B of S, not mutually exclusive,

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) . (6.6)

We can write
 
P (A ∪ B) = P A ∪ A ∩ B = P (A) + P A ∪ B .
  
P (B) = P (A ∩ B) ∪ A ∩ B = P (A ∩ B) + P A ∩ B .

Eliminating P A ∩ B from these two results gives (6.6).
If P (A ∩ B) = 0 = P (φ), then A ∩ B = φ, that is, A and B are mutually exclusive events, then (6.6)
reduces to Kolmogorov’s third axiom (6.5). On the other hand, if P (A) = P (B), that is, two sets have
equal probability, this does not mean that A would be equal to B.
It should be pointed out that the Kolmogorov axiomatic approach is compatible with famous philoso-
pher and natural scientist, Immanual Kant (1724-1804) who introduced the basic idea of a priori which is
intrinsic to human nature, and a posterori concept which humans acquire through experience. According to
Kolmogorov, probability exists a priori and relative frequencies are products of measurements and thus, a
posterori. Kolmogorov will be remembered forever for his many remarkable contributions to mathematical
and physical sciences, especially for his first 1933 discovery of the modern axiomatic treatment of probability.

7 Applications of Probability Theory to the Distribution of Prime


Numbers
From the ancient times, the natural numbers (n > 1) are divided into two classes, prime numbers such as
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, · · · , which is divisible only by 1 and itself, and composite numbers, such
as 4, 6, 8, 9, 10, 12, 14, 16, 18, 20, which admit factors other than 1 and itself. The two most fundamental
problems in early mathematics were whether a given natural number n > 1 is prime, and for every positive
real number x, how to compute the number π (x) of primes p less or equal to x. In general, there were no
easy algorithms or methods of solution of these problems. However, for small values x, it is always possible
to compute π (x) exactly. For example, π (10) = 4 and π (20) = 8. But for large x, it is not easy to compute
π (x). This is one of the major problems in all of mathematics.
Another basic problem in number theory was how to factor a natural number into primes which led to
the Fundamental Theorem of Arithmetic which states that every number n can be uniquely factored into
primes. In other words, every natural number can be expressed as a product of primes in exactly one way.
However, there is no corresponding theorem for the additive decomposition of natural numbers into primes.
From the table of prime numbers, ancient Greek mathematicians knew that there were infinitely many
primes. About twenty-three centuries ago, Euclid (365-300 BC) recognized that there is no largest prime
number, and he gave a beautiful proof by contradiction that verified that there are infinite number of primes
so that π (x) → ∞ as x → ∞. After the discovery of the Euclid celebrated theorem, a Greek mathematician,
Eratosthenes (266-194 BC) developed a method called the Sieve of Eratosthenes to determine when a give
natural number is a prime number, and how to estimate the number of primes up to a given number x.
It follows from the table of prime numbers that they seem to occur quite frequently and irregularly at
first, but less frequently and irregularly later on. In fact, about 25% of the numbers up to 100 are primes,
and the corresponding figure for numbers up to 106 is just about 8%. So, it raises fundamental questions
of how are prime numbers distributed, and whether there is any mathematical formula that describes all of
them. A table of prime numbers also shows a chaotic nature whose apparant disorder somewhat resembles
classical random behavior like the Brownian motion of an ideal gas in physical sciences. Thus, one of the
remarkable features of the distribution of primes is their tendency to exhibit local irregularity and global
regularity. In other words, prime numbers behave as randomly as possible and at a given point, statistical
fluctuations occur as in a game of chance. Evidently, probability theory can help describe the distribution of
primes and probabilistic method and/or computational technique become very useful for testing primality
and determining whether a given number is composite. In spite of powerful modern technology available in
recent years, factoring large composite numbers is still a difficult problem.

19
In 1748, Euler used precise analytic treatment to prove his celebrated identity, known as the Euler zeta
function ζ (x) for real x > 1 in the form (see Debnath [10])
∞ Y −1
X 1 1
ζ (x) = = 1 − , (7.1)
n=1
nx p
px

where the product is taken over all primes p. The Euler zeta function (7.1) is closely associated with the
distribution of prime numbers p. When x = 1, Euler also proved that the number of primes is infinite by
showing that the sum of the reciprocals of p diverges, that is,
X 1
= ∞. (7.2)
n
n=prime

This leads to the equivalent statement of the Euclid theorem as

lim π (x) = ∞. (7.3)


x→∞

The function π (x) has been the subject of intensive research for the last several hundred years. The precise
nature of π (x) as x → ∞ has become known as the celebrated Prime Number Theorem:

π (x)
lim = 1. (7.4)
x→∞ (x/ log x)

Or, equivalently, x → ∞
x x
π (x) ∼ or π (x) ∼ = L (x) , (7.5ab)
log x log x − c
where c is a numerical constant.
In 1798, Legendre proved the following limiting result

π (x)
lim = 0. (7.6)
x→∞ x
This implies that there are considerably fewer prime numbers than natural numbers.
It is quite remarkable that, at the age of sixteen in 1791, Gauss made a detailed study of table of prime
numbers up to three million, and then made a famous conjecture that the density of primes at x is about
−1
(log x) and the number π (x) of primes p between 1 and x for a very large x is given by
∞ Z ∞
X 1 dt
π (x) = ≈ dt = Li (x) . (7.7)
x=2
log x 2 log t

This became well known as the celebrated Prime Number Theorem.


Among all asymptotic approximations of π (x) for very large x obtained by many great mathematicians,
Li (x) provides a much closer asymptotic approximation to π (x) than does (x/ log x) or L (x). Although
Euler, Legendre and Gauss suggested different forms of Prime Number Theorem, the question about the
distribution of primes remained inconclusive. It follows from the available data that over a long intervals
the density of primes tends to decrease as one approaches larger and larger integers.
It also follows from the Euler formula (7.1) that its equivalent form is

1 Y 1

= 1− x . (7.8)
ζ (x) p
p

Evidently, Euler’s formula (7.1) or (7.8) represents a remarkable connection between the zeta function ζ (x)
and the infinite product over prime numbers only.

20
In 1859, Bernhard Riemann (1826-1866) made a remarkable generalization of ζ (x) to ζ (z) for a complex
z = x + iy in the form
∞  
1 Y 1
= 1− z . (7.9)
ζ (z) p=2
p

It also follows from the Riemann zeta function that it is closely associated with the distribution of prime
numbers. Indeed, the asymptotic distribution of prime numbers is closely related to the singularities of
ζ (z). Then, Riemann made a remarkable conjecture that all non-trivial zeros of ζ (z) lie on the critical
line Re (z) = 21 . This is universally known as the Riemann Hypothesis which has not yet been proved or
disproved as of today. However, it has been verified computationally for the ten billion zeros which all lie on
the critical line. This is one of the most celebrated unsolved problems in all of mathematics. If the Riemann
Hypothesis is true, the known estimate of π (x) is
√ 
π (x) = Li (x) + O x log x . (7.10)

This clearly shows that the Riemann Hypothesis is closely related to the asymptotic distribution of prime
numbers.
It is relevant to mention that Gauss’s celebrated conjecture (7.7) for π (x) was independently proved by
Jacques Hadamard (1865-1963) and Charles-Jean de La Vallée Poussin (1866-1962) in 1896. About fifty
years after Hadamard and Vallée Poussin’s proof, Paul Erdös (1913-1996) and Atle Selberg (1917-2007) gave
independently an elementary proof of the Prime Number Theorem ([11]) in 1949 without any knowledge of
the theory of complex functions. But their proofs are still very long and complicated and they are elementary
in the sense that no knowledge of the theory of complex functions is required. It was Norbert Wiener (1894-
1964) who gave a proof of the prime number theorem from the Wiener-Ikehara Tauberian theorem. This is
perhaps a more transparent proof from the analytical point of view.
In 1930s, Harald Cramér (1893-1985), a great Swedish mathematician, first gave a probabilistic interpre-
tation of Gauss’s conjecture which is then known as the Gauss-Cramér model. With the help of the central
limit theorem in probability and statistics, Gauss-Cramér model with representation of primes as a sequence
of 0 and 1, provides the same result (7.10) (see Ingham [12]). Although the Gauss-Cramér model does not
provide any precise proof, but it does give a remarkable way to describe the question of the distribution of
primes. In 2008, Green and Tao [13] made a truely remarkable discovery by proving a celebrated theorem
which states that, for every k, there are infinitely many k-term arithmetic progressions of primes, that is,
for a pair of integers a and d such that a, a + d, a + 2d, · · · , a + (k − 1) d are all primes.

References
[1] Debnath, L., The Legacy of the Leonhard Euler — A Tricentennial Tribute, Internal. J. Math. Educ.
Sci. Technol., 40 (2009) 353-388.
[2] Levy, H. and Roth, L., Elements of Probability, Oxford University Press, Oxford (1936).
[3] Denker, M., Tercentennial Anniversary of Bernoulli’s Law of Large Numbers, Bull. Amer. Math. Soc.,
50 (2013) 373-390.

[4] Laplace, P.S., A Philosophical Essay on Probability, Constable and Co. Ltd. (1951) (Translation of 1819
French Edition).
[5] Bellhouse, D.R., The Reverend Thomas Bayes, FRS: A biography to celebrate the tercentenary of his
birth, Statist. Sci. 19 (2004) 3-43.
[6] Bayes, T., Essay towards solving a problem in the doctrine of chances, Biometrika, 45 (1958) 293-315.
(Reproduction of 1963 paper).
[7] Jeffreys, H., Theory of Probability (Third Edition), Oxford University Press, Oxford (1961).

21
[8] Efron, B., A 250-Year Argument, Belief, Behaivor, and the Bootstrap, Bull. Amer. Math. Soc., 50
(2013) 129-146.
[9] Kolmogorov, A.N., Foundations of the Theory of Probability, Chelsea Publishing Company, New York
(1956) (Translation of original 1933 German Edition).

[10] Debnath, L., The Legacy of Leonhard Euler - A Tricentennial Tribute, Imperial College Press, London
(2010).
[11] Erdos, P., and Selberg, A., On a New method in elementary number theory which leads to an elementary
proof of the prime number theorem, Proc. Nat. Acad. Sci., 35 (1949). 374-384.
[12] Ingham, A.E., The Distribution of Prime Numbers, Cambridge University Press, Cambridge, (1932).

[13] Green, B. and T. Tao, The primes contain arbitrarily long arithmetic progressions, Ann. Math., 167
(2008) 481-547.

22

View publication stats

You might also like