Professional Documents
Culture Documents
A Sample Problem
In this footnote I will venture into a philosophical speculation. I think the reason for
this difficulty is that our brains are not very well attuned to arguing about chances. Kahneman and Tversky have shown that when making decisions under uncertainty, humans
often misjudge the risks and make silly mistakes. We are just not good at it. Evolution
has prepared us to judge the paths of solid objects in the space well. Hence our mechanics
is good and our cars, planes, and bridges are quite reliable. But we are not good at judging
risks and hence our stock market is a mess.
(1)
Probability Space
X
Pr(A1 A2 . . .) =
Pr(Ai ) .
i=1
Pr(A Bn ) .
n=1
Rain in London and the state of the stock exchange are more or less
independent3 so we can write Pr(R S) = Pr(R) Pr(S).
Conditional probability Pr(A | B) (reads the probability of A given B)
is the ratio
Pr(A B)
Pr(A | B) =
.
Pr B
The intuitive meaning of conditional probability may be described as follows.
Suppose we know that B has occurred. Our probability space effectively
contracts to B: anything outside B is no longer possible. As B is the new ,
its probability is now 1. What is the new probability of A? The conditional
probabilities are just these new probabilities given B.
If A and B are independent, then Pr(A B) = Pr(A) Pr(B) and
Pr(A | B) =
Pr(A B)
Pr(A) Pr(B)
=
= Pr(A) .
Pr B
Pr B
Thus stating that Pr(A | B) = Pr(A) is another way of saying that A and
B are independent. The equation Pr(A | B) = Pr(A) implies Pr(A B) =
Pr(A) Pr(B) and Pr(B | A) = Pr(B). These all are equivalent formulas.
The law of total probability can be now written as
Pr(A) =
Pr(A Bn ) =
n=1
Pr(A | Bn ) Pr(Bn ) .
n=1
Pr(A Bn | C) =
n=1
Pr(A | Bn , C) Pr(Bn | C) .
n=1
Indeed, imagine the world where C has occurred and has become the new
. In the formula above Pr(A | Bn , C) is the probability of A given Bn and
C (provided Bn and C have occurred). It is the same as Pr(A | Bn C).
3
A special message for finance students. No, they are in fact not. The performance of
many companies is affected by the weather. In the extreme case a torrent in London will
send the economy and, consequently, the stock market into a complete collapse. These
probabilities are tiny but should not be ignored. Sometimes they may play their part.
Bayes Rule
A common error is to confuse Pr(A | B) and Pr(B | A). These two conditional probabilities may be very different! Let us find a formula linking
them together.
There are two ways to write the joint probability Pr(A B):
Pr(A | B) Pr(B) = Pr(A B) = Pr(B | A) Pr(A) .
Therefore
Pr(A | B) =
Pr(B | A) Pr(A)
.
Pr(B)
This is the celebrated Bayes theorem or Bayes rule. (Thomas Bayes lived
in the 18th century. He studied in the University of Edinburgh and was
a priest in Tunbridge Wells. He formulated a special case of the formula
so some say that the complete formula is due to the 19th century French
mathematician Laplace. But so many other things are named after Laplace
that Bayes gets his credit for the formula.)
If we have disjoint events A1 , AS
2 , . . . covering the whole sample space
(i.e., An Am = for
n
=
6
m
and
n=1 An = ) then by the law of total
P
probability Pr(B) = n=1 Pr(B | An ) Pr(An ). The Bayes formula can be
written as
Pr(B | Ai ) Pr(Ai )
Pr(Ai | B) = P
.
(3)
n=1 Pr(B | An ) Pr(An )
The immense importance of this formula for data analysis is based on the
following interpretation.
Let An be disjoint hypotheses. The number Pr(B | Ai ) is the probability
of some B put forward by the hypothesis Ai . The value of Pr(B | Ai ) usually
follows from the nature of Ai . Different hypotheses give us different values.
The values Pr(Ai ) are the probabilities we assign to the hypotheses.
These probabilities are called prior probabilities. They represent our trust in
the hypothesis before we have done any experiments or observations. Now
supposed that B has occurred. We need to adjust our beliefs. Using (3)
we can calculate posterior probabilities Pr(Ai | B), i.e., our trust in the
hypotheses after B has occurred.
Sometimes we only need to compare Pr(Ai | B) between each other. We
may then write the Bayes formula as
Pr(Ai | B) Pr(B | Ai ) Pr(Ai )
(the sign reads proportional to). We have removed the denominator
because it is the same anyway and would not matter for the comparison.
6
Random Variables
5, and 6, each with probability 1/6. This can be written as Pr(Y = i) = 1/6,
where i = 1, 2, 3, 4, 5, 6. One can say that Y has a uniform distribution over
six values i = 1, 2, 3, 4, 5, 6.
Some random variables take infinitely many outcomes. Let X be the
height in centimetres of a person in the street. It has the finite range (certainly greater than 0 but less than 250) but within that range it can take
any value. The distribution can no longer be described by enumerating all
possible values with their probabilities.
A continuous random variable has a density function (sometimes called
probability density function and abbreviated pdf), which can be used to
describe the distribution. Let Y be a random variable taking real values
and pY be its density. By definition pY (x) 0 for all x R and the
Rb
probability Pr(a Y b) is the integral a pY (x)dx. The integral can be
intuitively interpreted as the area bounded by the curve y = pY (x) from
above, the coordinate axis y = 0 from below, and the the lines x = a and
x = b on the left and right.
As an example consider the variable X with the uniform distribution on
the segment [0, 10]. It has the density
if x < 0 ;
0,
1
pX (x) = 10 , if 0 x 10 ;
0,
if x > 10 .
If an interval [a, b] is inside [0, 10], i.e., 0 a < b 10, then Pr(a X
b) = (b a)/10.
It is customary to describe distributions of random variables by means of
distribution functions. A distribution function (sometimes called cumulative
distribution function and abbreviated cdf) of a random variable Y is the
function
FY (x) = Pr(Y x) .
The distribution function can be used to calculate Pr(a < X b) = FY (b)
FY (a).
Exercise 5. Plot the graphs of the distribution functions of Y such that
Pr(Y = i) = 1/6, where i = 1, 2, 3, 4, 5, 6 and of X uniformly distributed
on the interval [0, 10]. Hint: The former graph consists of steps. The later
graph is flat to the left of 0 and to the right of 1.
A distribution function is always non-decreasing (if a < b then Pr(X
a) Pr(X b) and hence F (a) F (b)). Its values are always between 0
and 1 (probabilities are always between 0 and 1). As x goes to , F (x)
8
Gaussian Distribution
Gaussian (or normal) distribution is very important. The Gaussian distribution with the mean and standard deviation (or variance 2 , which
means the same thing) has the density function
1
2
p, (x) = e(x) 2 2 ,
2
where 3.14159 and e 2.71828 is Eulers constant. The function has
a characteristic bell shape with the peak at . The statement that X has
a Gaussian distribution with the mean and variance 2 can be written as
X N (, 2 ).
The distribution function of a Gaussian distribution cannot be written
explicitly in terms of standard functions of analysis.
The standard normal distribution has = 0 and = 1. Its density is
often denoted by (a Greek letter called phi):
1
2
(x) = ex /2
2
and the distribution function by (capital phi):
Z x
1
2
et /2 dt .
(x) =
2
Technically speaking, there are subsets A R such that Pr(X A) cannot be measured. You will most certainly never meet such sets in practice.
The following parameters are very important for describing and analysing
distributions.
The expectation (or average or mean value) of a random variable is defined as follows. If Y is discrete and takes values x1 , x2 , . . . , xn with probabilities p1 , p2 , . . . , pn , respectively, then its expectation is
EY =
n
X
pi xi .
i=1
X = varX .
For the Gaussian random variable X N (, 2 ) with the mean of and
standard deviation we have
EX =
X =
as the names of the parameters suggests.
Here are important properties of variances:
1. The variance of a constant is zero. If a random variable X is constant
and takes only one value c, then varX = 0.
11
12