You are on page 1of 74

# Probability

Alan Beggs
Michaelmas Term 2012
Minor Corrections Trinity Term 2013
1
This part of the course assumes familiarity with the Probability material delivered in
Michaelmas and in the QE lectures in Trinity of the second year. The material will be
revised but rapidly.
The material can be found in a large number of books, although many books on
Probability are aimed at mathematicians and are rather long. Reference will be made to
Ross, S. A First Course in Probability
Other good books include
Grinstead, C. and J.Snell Introduction to Probability (Downloadable from
http://www.dartmouth.edu/
~
chance/teaching_aids/books_articles/probability_book/book.
html)
Hoel, P., Port, S. and C. Stone Introduction to Probability
Stirzaker, G. Elementary Probability (not in SSL but widely available in College
Libraries and RSL)
Most mathematical statistics books have an introduction to Probability and may
be more concise and accessible sources. A good example is:
Hoel, P. Introduction to Mathematical Statistics
Others good books include
Freund, G. Mathematical Statistics (more recent editions edited by Miller and Miller
and under the title John Freunds Mathematical Statistics
Hogg, R. and A. Craig Introduction to Mathematical Statistics (more recent editions
with McKean)
Mood, A., Graybill, F. and D. Boes Introduction to the Theory of Statistics
Casella, G. and R. Berger Statistical Inference (advanced)
2
1 Classical Probability
Ross 1
1.1 Sample Space and Probability
Suppose we have an experiment with a set of possible outcomes. is called the sample
space and an element a sample outcome.
For example in tossing a coin = {H, T}. In tossing two coins = {(H, H), (H, T), (T, H), (T, T)}.
An event is a subset of . The event A occurs if A.
In classical probability all possible outcomes of the experiment are considered equally
likely.
So
P(A) =
|A|
||
where |A| denotes the number of elements in A.
Hence
Pr({(H, T)}) =
1
4
3
1.2 Combinatorial Analysis
Fundamental principle of counting:
Suppose two experiments are to be conducted. If experiment 1 can result in any of
n possible outcomes and if, for each outcome of experiment 1, for are are m possible
outcomes of experiment 2, then there are mn possible outcomes of the two experiments.
So are 10 mothers, each of whom has 3 children, are 10 3 = 30 combinations of
mother and child can choose. Hence if pick at random probability
1
30
pick a particular
pair.
Obvious generalisation to n experiments.
1.3 Permutations
6 = 3 2 1 ways of arranging letters a,b,c (3 choices for rst letter, 2 for second, 1 for
third).
In general there are
n! = n(n 1) . . . 1
ways of arranging n distinguishable objects.
If there are n objects of r kinds, with n
r
of each kind (n
1
+. . . +n
r
= n), then there
are
n!
n
1
! . . . n
r
!
ways of arranging them.
For example with the letters a,a,b there are
3!
2!1!
= 3 ways of arranging them (aab,
aba, baa).
4
1.4 Combinations
The number of ways of choosing r objects from n is
_
n
r
_

n!
r!(n r)!
The notation
n
C
r
is also used.
To choose 3 objects from 5: are 5 ways choosing rst, 4 second, 3 third but any order
yields same combination, so only
5.4.3
3.2.1
=
5!
3!2!
distinct combinations.
Example: 307 Conservative MPs, 57 Liberal Democrats. Cabinet with 19 Conservatives,
5 Liberal Democrats.
Number of dierent ways of picking cabinet:
57
C
5

307
C
19
.
[Note Stirlings Formula: n! (2n)
1/2
n
n
e
n
( means the ratio tends to 1 as n
tends to innity.]
Recall the Binomial theorem:
(x +y)
n
=
n

r=0
_
n
r
_
x
r
y
nr
5
2 Axiomatic Probability
Experiment 1. We know intuitively that for a fair die:
P({2, 4, 6}) = P(N is even) =
1
2
and P({1, 2}) = P(N < 3) =
1
3
How can this be justied? For a fair die, all the outcomes in are equally likely. So, for
example, if you throw a die many times, then you expect a 1 to occur
1
6
of the time, on
average. We can think of probability as frequency of occurrence.
Formally, however, we will dene the probability function by specifying a set of axioms
that it must satisfy. Then we can interpret it as a frequency where appropriate, but
alternatively, for example, we could interpret it a subjective belief in the likelihood of an
event.
6
2.1 Formal Denition of Probability Ross 2
If is a sample space:
an event is a subset of
an event space (or -eld) F is a (non-empty) set of subsets of satisfying:
(i) F.
(ii) If A F then A
c
F
(iii) If A
1
, A
2
, F then

i=1
A
i
F
We need the denition of an event space because it is the domain of the probability
function.
Note that not all subsets of need be events. One interpretation is that only certain
subsets are observable and so probability is only assigned to them. There are also tech-
nical reasons why this restriction is sometimes necessary (discussed in more advanced
treatments of probability).
A probability measure is dened as a mapping P : F [0, 1] satisfying:
The Probability Axioms:
1. P() = 1
2. P(A) 0 for all A F
3. If A
1
, A
2
, . . . are disjoint events in F, P(

i=1
A
i
) =

i=1
P(A
i
)
7
From these axioms we can derive some results for the probability function. If A and
B are events:
4. P(A
c
) = 1 P(A)
5. P() = 0
6. P(A B) = P(A) + P(B) P(A B)
7. If A B, P(A) P(B)
You can check these for experiment 1 (fair die). For example:
Result 4 = P(N 3) = 1 P(N < 3)
2
3
= 1
1
3
Result 6 = P(N < 3 or N even) = P(N < 3) + P(N even) P(N < 3 and N even)
P({1, 2, 4, 6}) = P({1, 2}) + P({2, 4, 6}) P({2})
2
3
=
1
3
+
1
2

1
6
Result 7 = P(N < 3) < P(N < 5)
1
3
<
2
3
More technical consequences of the axioms, included for reference, are
8. If {A
i
, i 1} is an increasing sequence of events, that is A
1
A
2
. . . A
i

A
i+1
. . . and A =

i=0
A
i
then P(A) = lim
i
P(A
i
).
9. If {A
i
, i 1} is an decreasing sequence of events, that is A
1
A
2
. . . A
i

A
i+1
. . . and A =

i=0
A
i
then P(A) = lim
i
P(A
i
).
8
3 Conditioning and Independence
Ross 3
3.1 Conditional Probability
If we know that one event has happened then this infomation will in general cause us to
change our view of the probability of other events.
Definition: For two events A and B, P(A|B) =
P(A B)
P(B)
if P(B) > 0.
P(A|B) is the (conditional) probability of A given B.
E1. Fair die: P(N < 3|N even) =
P(N < 3 and N even)
P(N even)
Note that if B is held xed and A varies, then P(A|B) denes a probability distribution
on (exercise to check the axioms).
9
3.2 The Law of Total Probability
Let A
1
, A
2
, . . . be a partition of : (i)

i
A
i
= (the A
i
are exhaustive), (ii) A
i
A
j
= ,
j = i (the A
i
are mutually exclusive). If B is any event:
P(B) =

i
P(B A
i
) =

i
P(B|A
i
)P(A
i
)
Conditioning on appropriate events can often simplify calculations.
Example (Gamblers Ruin): A gambler repeatedly plays a game in which he wins 1 if a
(fair) coin comes up heads, loses 1 if it comes up tails. He quits if his capital reaches m
or if he goes bankrupt (his capital reaches 0). Find the probability, p
x
, be the probability
he eventually goes bankrupt if he starts with capital x.
Let A be event he eventually goes bankrupt. Condition on the outcome of the rst
toss:
P(A) = P(A|H)P(H) + P(A|T)P(T)
But P(A|H) = p
x+1
and P(A|T) = p
x1
. Hence
p
x
=
1
2
p
x+1
+
1
2
p
x1
This is a linear dierence equation with initial conditions p
0
= 1 and p
m
= 0. It is easy
to check the solution is
p
x
= 1
x
m
10
3.3 Bayes Theorem
Note that from the denition of conditional probability:
P(A
i
|B) =
P(A
i
B)
P(B)
=
P(B|A
i
)P(A
i
)
P(B)
If A
1
, A
2
, . . . is a partition of , we can write then using the Law of Total Probability
P(B) =

i=1
P(B A
i
) =

i=1
P(B|A
i
)P(A
i
)
Hence we obtain
Bayes Rule: P(A
j
|B) =
P(B|A
j
)P(A
j
)

i=1
P(B|A
i
)P(A
i
)
P(A
j
) is sometimes referred as the prior belief, P(A
j
|B) as the posterior
Suppose 15% of olympic athletes take performance-enhancing drugs. If
P(fails drug test | takes drugs) = 0.90 and P(fails test | doesnt take drugs) = 0.12,
what is the probability that an athlete who fails the test has taken drugs?
P(drugs|fails) =
P(fails|drugs)P(drugs)
P(fails|drugs)P(drugs) + P(fails|no drugs)P(no drugs)
=
0.9 0.15
0.9 0.15 + 0.12 0.85
57%
The probability that a second-hand car is a lemon is p
g
if you buy from a good
dealer, and p
b
> p
g
from a bad dealer. A proportion of dealers are good. Having
selected a dealer at random, you ask a previous customer whether his car was a
lemon. If he says no, what is the probability that the dealer is good?
11
3.4 Independence
In some cases knowing that one event has happened does not change our estimate of the
probability that another has happened. That is
P(A|B) = P(A)
Since P(A|B) = P(A B)/P(B) this leads to the denition: Two events A and B are
independent if
P(A B) = P(A)P(B)
that is, if the probability that both occur is the product of the individual probabilities of
occurrence.
Intuitively, two events are independent if the probability of one of them occurring is
not aected by whether or not the other occurs. So, for example, if you toss a fair coin
twice, the chance of getting a head on the second throw is not aected by getting a head
on the rst throw.
Examples
Suppose you toss a fair coin twice. The sample space is {(H, H), (H, T), (T, H), (T, T)}
and each of these outcomes is equally likely has probability
1
4
. Consider the events
A=head on rst throw, and B = head on second throw:
A = {(H, H), (H, T)}; P(A) =
1
2
B = {(H, H), (T, H)}; P(B) =
1
2
A B = {(H, H)}; P(A B) =
1
4
= P(A B) = P(A)P(B) so the events are independent
Suppose you throw a fair die. Consider the events C= the number is odd, and
D=the number is less than 4:
C = {1, 3, 5}; P(C) =
1
2
D = {1, 2, 3}; P(D) =
1
2
C D = {1, 3}; P(A B) =
1
3
= P(C D) = P(C)P(D) so the events are not independent
12
A family A = {A
i
: i I} is independent if for any nite subset J I, P
_

jJ
A
j
_
=

jJ
P(A
j
).
Note that it is not enough to check pairwise independence.
Example Suppose that 2 dice are thrown. Let A
1
be the event that the rst die shows an
odd number, A
2
that the second does and A
3
the event that the sum of the two numbers
is odd. Then the events are pairwise independent but P(A
1
A
2
A
3
) = 0.
13
4 Random Variables and Probability Distributions
Ross 4
A Random Variable is a real-valued variable whose value is determined by the outcome
of an experiment. It is a mapping from the sample space to the real line: X : R
For the toss of a fair coin, = {heads, tails}. A random variable can be dened:
Y (tails) = 0; Y (heads) = 1
Experiment 2 (Lifetime of lightbulb). = [0, ). Two random variables are:
X
1
() = and X
2
() =
_
0 if < 100
1 if 100
In these examples, X
1
is a continuous random variable, taking values on the real line;
Y and X
2
are discrete random variables, taking values in a countable set.
14
4.1 The Cumulative Distribution Function
The probability distribution of a random variable X is a description of the probabilities
corresponding to dierent values of X. One way to describe these probabilities is the
cumulative distribution function.
If x R, { : X() x} is an event, so has a probability. As a shorthand we write:
P(X x) for P({ : X() x})
Then we dene the Cumulative Distribution Function for X:
F
X
(x) P(X x)
If a function F(x) is a cumulative distribution function (cdf), it must satisfy:
1. lim
x
F(x) = 0; lim
x+
F(x) = 1
2. F(x) is non-decreasing
3. F(x) is right-continuous
and from the denition of the cdf we can see that: P(x
1
< X x
2
) = F(x
2
) F(x
1
)
15
For a discrete random variable (with a countable set of values) the cdf is a step
function.
If N is the number on a fair die:
F
N
(x) = 0 for x < 1
F
N
(x) =
1
6
for 1 x < 2
F
N
(x) =
1
3
for 2 x < 3
.
.
.
F
N
(x) = 1 for x 6
and P(1 < N 5) =
5
6

1
6
=
2
3
N = number on fair die
F
N
(n)
n
0
1

## If X is a continuous random variable (there is a continuum of possible values), then

the cdf is continuous.
Suppose X is the height of a randomly-
chosen person measured in metres.
This is a continuous random variable.
The diagram shows the typical shape of
the cdf for such a random variable.
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
X = height of random person
F
X
(x)
x
0
1
16
5 Discrete Distributions and the Probability Mass
Function
Ross 4
For a discrete random variable, X, which can take values {x
i
}, we can dene the proba-
bility mass function (pmf):
f(x) = P(X = x)
Then:
- The sum of the probabilities over all the possible values is one:

i
f(x
i
) = 1
- The relationship between the pmf and the cdf is given by: F(x) =

x
i
x
f(x
i
)
- and also: f(x
i
) = F(x
i
) F(x
i
)
i.e. the probability that x = x
i
is equal to the size of the step at x
i
in the cdf.
Some examples of discrete distributions and their probability mass functions are given
in the following pages.
17
5.1 Discrete Uniform Distribution
If N is the number on a fair die, f
N
(1) =
1
6
, f
N
(2) =
1
6
, . . . , f
N
(6) =
1
6
.
This is an example of the discrete uniform distribution. More generally, the discrete
uniform distribution is the distribution of a random variable X that has n equally prob-
able values. The probability mass function is given by:
f(x; n) =
_
1
n
if x = 1, 2, . . . , n
0 otherwise
5.2 Bernoulli Distribution
A random variable X has a Bernouilli distribution if
- it has just 2 possible outcomes, 0 and 1 (corresponding to failure and success),
- and the probability of success is given by the parameter p,
- so that: f(0) = 1 p and f(1) = p.
The pmf can be written:
f(x; p) =
_
p
x
(1 p)
1x
if x = 0, 1
0 otherwise
For example, a coin toss has a Bernoulli distribution (with p =
1
2
if the coin is fair).
18
5.3 Binomial Distribution
Suppose the random variable N is the number of heads in 3 tosses of a fair coin.
By listing the outcomes ((H,H,H), (H,H,T), etc), we nd:
f
N
(0) =
1
8
, f
N
(1) =
3
8
, f
N
(2) =
3
8
, f
N
(3) =
1
8
.
This is an example of the Binomial Distribution with parameters 3 and
1
2
.
More generally, suppose X is the number of successes in n independent trials (draws)
from a Bernoulli distribution with parameter p. Then we say that X has a Binomial
distribution, with parameters n and p, and write this as X B(n, p).
The formula for the pmf of the Binomial Distribution B(n, p) is:
f(x; n, p) =
_
n!
x! (nx)!
p
x
(1 p)
nx
if x = 0, 1, 2, . . . , n
0 otherwise
Examples:
If N= number of heads in three tosses of a fair coin, N B(3, 0.5) and:
P(N = 2) = f(2; 3, 0.5) =
3!
2! 1!
0.5
2
0.5
1
= 0.375
If the probability that a new light-bulb is defective is 0.05, let X be the number
of defective bulbs in a sample of 10. Then X B(10, 0.05) and, for example, the
probability of getting more than one defective bulb in a sample of 10 is:
P(X > 1) = f(2) + f(3) + + f(10)
The easiest way to calculate this is:
P(X > 1) = 1 P(X 1) = 1 f(0) f(1) = 1 (0.95)
10
10 (0.95)
9
0.5
19
5.4 The Geometric Distribution
This distribution describes the probability that it takes x trials to achieve a success in
a series of independent trials when the probability of success is p at each trial (with
q = 1 p).
f(x) = pq
x1
x = 1, 2, . . .
5.5 The Negative Binomial Distribution
The time to achieve r successes in the preceding setup:
f(x) =
_
x 1
r 1
_
p
r
q
xr
20
5.6 Relationship between Poisson and Binomial
The Poisson distribution is a limiting form of the Binomial as n becomes large, holding
the mean constant. Start with the Binomial pmf, put np = , and let n :
f(x; n, p) =
n!
x!(n x)!
p
x
(1 p)
nx
=
n!
x!(n x)!

x
n
x
_
1

n
_
nx
=

x
x!
_
1

n
_
n
n(n 1) . . . (n x + 1)
(n )
x

e

x
x!
Poisson pmf: f(x; ) =
_
e

x
x!
if x = 0, 1, 2, 3, . . . (countably innite)
0 otherwise
Binomial Distribution
B(6, 0.3)
f(x)
x
Poisson Distribution
= 2
f(x)
x
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 8
0 0
0.3 0.3
The Poisson distribution could be used to model:
The number of deaths from horse-kicks in a Prussian army corps in a year
The number of job oers received by an unemployed person in a week
The number of times an individual patient visits a doctor in one year
In general to systems with large number of potential events, each one of which is
rare.
Check that

i
f(x
i
) = 1:

x=0
e

x
x!
= e

x=0

x
x!
= e

= 1
21
6 Expectations for Discrete Random Variables
Ross 4
6.1 The mean of a discrete random variable
If X is a discrete random variable with probability mass function f(x), the expected value,
or mean, of X is:
E(X) =

i
x
i
f(x
i
)
[provided that

i
|x
i
|f(x
i
) < ].
The mean is a measure of the average value of the random variable.
If, in a lottery you win 100 with probability 0.2, 50 with probability 0.3, and
otherwise nothing, the expected prize is:
E(X) = 0.2 100 + 0.3 50 + 0.5 0 = 35
Bernoulli Distribution
E(X) = p 1 + (1 p) 0 = p
Binomial Distribution
E(X) = np
_
n

r=0
r
n!
r!(n r)!
p
r
(1 p)
nr
= np
n

r=1
(n 1)!
(r 1)!(n r)!
p
r1
(1 p)
nr
= np
n1

s=0
_
n 1
s
_
p
s
(1 p)
n1s
= np
_
Poisson Distribution
E(X) =
_

x=0
x

x
x!
e

x=1

x1
(x 1)!
e

s=0

s
s!
e

=
_
22
6.2 Functions of Random Variables
If X is a random variable and g a real-valued function then Y = g(X) is also a random
variable. It has probability mass function
f
Y
(y) =

x:g(x)=y
f(x)
To calculate expectations one does not need to nd f
Y
:
Eg(X) =

i
g(x
i
)f(x
i
)
[provided

i
|g(x
i
)|f(x
i
) < .]
Proof: If g
j
are the possible values of g then
Eg(X) =

g
j
g
j
f
Y
(g
j
) =

g
j
g
j

x
i
:g(x
i
)=g
j
f(x
i
) =

i
g(x
i
)f(x
i
)
Example For a Bernouilli random variable
E(X
2
) = 1
2
p + 0
2
(1 p) = p
23
6.3 Variance and Higher Moments
The variance of random variable X is
Var(X) = E (X E(X))
2
The variance is a measure of dispersion.
A more convenient expression is:
Var(X) = E(X
2
) (EX)
2
Proof: Denote EX by
E(X EX)
2
=

i
f(x
i
)(x
i
)
2
=

i
f(x
i
)(x
2
i
2x
i
+
2
)
=

i
f(x
i
)x
2
i
2

i
x
i
f(x
i
) +
2
=

i
x
2
i
f(x
i
) 2
2
+
2
= E(X
2
)
2
Example For a Bernouilli random variable
Var(X) = E(X
2
) (EX)
2
= p p
2
= p(1 p)
E(X
k
) is called the kth moment of X. E(XEX)
k
is called the kth central moment.
24
6.4 Conditioning and Expectation
Suppose that B is some event. Recall that the conditional probability of events given B
denes a distribution on .
For a random variable X its mass function conditional on X is
f(x
i
|B) = P(X = x
i
|B)
Its expectation conditional on B is
E(X|B) =

x
i
x
i
f(x
i
|B)
The Law of Total Expectation
Let A
1
, A
2
, . . . be a partition of : (i)

i
A
i
= (the A
i
are exhaustive), (ii) A
i
A
j
=
, j = i (the A
i
are mutually exclusive). If X is a random variable:
E(X) =

i
E(X|A
i
)P(A
i
)
Example (Gamblers Ruin Again): A gambler repeatedly plays a game in which he wins
1 if a (fair) coin comes up heads, loses 1 if it comes up tails. He quits if his capital
reaches m or if he goes bankrupt (his capital reaches 0). Find the expected duration of
the game if his initial capital is k.
Let X
k
denoted the number of games to be played when his capital is k and let
n
k
= E(X
k
). Clearly n
0
= n
m
= 0. Condition on the outcome of the rst toss:
n
k
= E(X
k
|H)P(H) + E(X
k
|T)P(T)
But E(X
k
|H) = 1 +n
k+1
and E(X
k
|T) = 1 +n
k1
. Hence
n
k
= 1 +
1
2
n
k+1
+
1
2
n
k1
This is a linear dierence equation with initial conditions n
0
= 0 and n
m
= 0. One can
check the solution is
n
k
= k(mk)
25
7 Multivariate Distributions
Ross 6
and 7
7.1 Joint Distributions
Let X and Y be two discrete random variables taking values x
i
, i = 1, . . . and y
j
, j = 1, . . ..
Their joint probability mass function is dened to be
f(x, y) = P(X = x, Y = y)
where x and y range over all possible values of X and Y .
Example Suppose a fair coin is tossed twice. Let X be the number of heads and Y the
number of tails. (X, Y ) takes values in the set {0, 1, 2} {0, 1, 2}. Moreover
f(0, 2) =
1
4
f(1, 1) =
1
2
f(2, 0) =
1
4
and f is zero at all other points.
The marginal distributions of X and Y are
f
X
(x) =

y
f(x, y) f
Y
(y) =

x
f(x, y)
So in the example the marginal distribution of X is
f
X
(0) =
1
4
f
X
(1) =
1
2
f
X
(2) =
1
4
26
7.2 Independence
Two random variables X and Y are independent if for all x and y
f(x, y) = f
X
(x)f
Y
(y)
This is equivalent to P(A B) = P(A)P(B) where A = { : X() = x} and
B = { : Y () = y}, which is the denition of independence of the events A and B.
Example The random variables X and Y in the previous subsection are not independent.
Note that
1. If X and Y are independent random variables, then for any countable sets A and
B
P(X A, Y B) = P(X A)P(Y B)
2. For any real functions g and h, g(X) and h(Y ) are independent.
27
7.3 Expectation and Moments
7.3.1 Expectation
Let Z = g(X, Y ). The following is a straightforward generalisation of the result in Section
6.2.
E(Z) =

x,y
g(x, y)f(x, y)
Let X and Y be random variables and a real number. The result above implies the
following key property of expectations (all expectations assumed to exist):
1. E(X + Y ) = E(X) + E(Y )
2. E(X) = E(X)
Note that X and Y are not required to be independent for this result to hold.
Example If X B(n, p) then X is the sum of n independent Bernouilli random variables
X
i
with probability of success p on each trial. So
E(X) = E(

i
X
1
+ . . . + X
n
) =

i
E(X
i
) = np
28
7.4 Variance and Covariance
If X and Y are two random variables their covariance is
Cov(X, Y ) = E(X EX)(Y EY )
The following is a key property of variances (all assumed to exist). X and Y are
random variables and a scalar:
1. Var(X) =
2
Var(X)
2. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
3. If X and Y are independent
Var(X, Y ) = Var(X) + Var(Y )
Proof The rst result is obvious. Let
X
= EX and
Y
= EY then E(X +Y ) =
X
+
Y
.
E(X +Y
X

Y
)
2
= E(X
X
)
2
+E(Y
Y
)
2
+ 2E(X
X
)(Y
Y
)
The third equality follows from the important result
If X and Y are independent then
E(XY ) = E(X)E(Y )
Hence if X and Y are independent
Cov(X, Y ) = 0
Note that Cov(X, Y ) = 0 does not imply that X and Y are independent.
Proof :
E(XY ) =

x,y
xyf(x, y) =

x,y
xyf
X
(x)f
Y
(y) =
_

x
xf
X
(x)
__

y
yf
Y
(y)
_
= E(X)E(Y )
29
Example If X B(n, p) then X is the sum of n independent Bernouilli random variables
X
i
with success p. In 6.3 we showed that
Var(X
i
) = p(1 p)
Hence
Var(X) = Var(X
1
+. . . +X
n
) =

i
Var(X
i
) = np(1 p)
The Cauchy-Schwarz Inequality
(EXY )
2
E(X
2
)E(Y
2
)
whenever these expectations exist with equality only if Y = X with probability 1 for
some .
Proof: E(X Y )
2
0 for all . Expanding
E(X
2
) 2E(XY ) +
2
E(Y
2
) 0
For this to be the case the expression considered as a quadratic in must have at either one double or
no real roots, that is
(EXY )
2
EX
2
EY
2
If equality holds then at the double root E(X Y )
2
= 0 so X = Y with probability 1.
It follows that if we dene the correlation coecient between X and Y to be
=
Cov(X, Y )

VarX

VarY
then
1. 1 1.
2. = 1 if and only if Y = aX +b with probability 1 for some a,b with a > 0.
3. = 1 if and only if Y = aX + b with probability 1 for some a, b with a < 0.
Proof: Apply the Cauchy-Schwarz inequality to the random variables X E(X) and Y E(Y ).
30
7.5 Conditional Expectation
If X and Y have the joint probability mass function f(x, y) then given Y = y, the random
variable X has a conditional probability mass function given by
f
X|Y
(x|y) =
f(x, y)
f
Y
(y)
for all y such that f
Y
(y) > 0.
The conditional expectation of X, given that Y = y where f
Y
(y) > 0 is
E(X|Y = y) =

x
xf(x, y)/f
Y
(y)
As y varies E(X|Y = y) denes a random variable. This is denoted E(X|Y ).
Provided both sides are dened we have:
E(E(X|Y )) = E(X)
This is just another version of the law of total expectation (with partition the events
Y = y
j
).
Example A fair coin is tossed a random number N times, where N has mean 10. Find
the expected number of heads.
E(H|N) =
N
2
E(H) = E(E(H|N)) = E(N/2) = 5
31
Conditional Expectation has the properties of ordinary expectations, for example
E(aX +bY |Z) = aE(X|Z) + bE(Y |Z)
Note also the following useful properties (proofs as exercises):
1. E(g(Y )X|Y ) = g(Y )E(X|Y ).
2. E(E(X|Y ; Z)|Y ) = E(X|Y ) (the tower property)
The conditional expectation is in a sense the best predictor of X given Y . Suppose
we want to choose a function Y to minimise
E(X g(Y ))
2
(assuming all needed all second moments exist). Then g(Y ) = E(X|Y ) solves this prob-
lem.
Proof:
E(X g(Y ))
2
= E(X E(X|Y ) + E(X|Y ) g(Y ))
2
= E(X E(X|Y ))
2
+ (E(X|Y ) g(Y ))
2
where the cross product term has vanished because
E((X E(X|Y ))(g(Y ) E(X|Y )) = E(E(X E(X|Y ))(g(Y ) E(X|Y ))|Y )
= E((g(Y ) E(X|Y ))E(X E(X|Y )|Y ))
= 0
using the properties of conditional expectation given above.
The optimal choice of g(Y ) is therefore E(X|Y ).
In more advanced treatments this property is sometimes used to dene conditional
expectation.
32
Another useful formula:
Var(X) = EVar(X|Y ) + VarE(X|Y )
Proof:
Var(X) = EX
2
(EX)
2
= EX
2
E(EX|Y )
2
+ E(EX|Y )
2
(EX)
2
= E(EX
2
|Y ) E(EX|Y )
2
+ E(EX|Y )
2
(EE(X|Y ))
2
= EVar(EX|Y ) + Var(EX|Y )
7.6 Random Walks
Gamblers ruin is an example of a random walk. In a random walk the state of the system
at time n, S
n+1
is
S
n
+ X
n
where X
n
are independent over time and have the same distibution for all n.
In the simple random walk
X
n
=
_
1 with probability p
1 with probability 1 p
We will simply use the simple random walk as example of the use of dierence equa-
tions but it can be studied in much greater depth.
In the Hall model of consumption, the random walk model is referred as the property that C
t+1
=
C
t
+
t
, where
t
has mean zero conditional on current information. This is somewhat of a misnomer as

t
need not be independent or identically distribted over time. Probabilists would instead refer to C
t
as
a martingale.
33
8 Generating Functions
Ross 7
8.1 Basic Properties
Suppose that the random variable X takes on non-negative integer values. Its probability
generating function G
X
(s) is
G
X
(s) = E(s
X
) =

k
P(X = k)s
k
Ross discusses the moment-generating function M(t) = E(e
tX
), which we will come across for
continuous random variables, rather than the pgf. Note that G(s) = M(ln s), so one can move between
the two.
It can be shown that if two integer-valued random variables have the same pgf they
have the same distribution, so if we recognise the pgf we have identied the distribution.
Examples
Bernouilli Let P(X = 1) = p and P(X = 0) = 1 p = q.
G
X
(s) = q +ps
Binomial If X is B(n, p)
G
X
(s) =
n

r=0
s
r
_
n
r
_
p
r
(1 p)
nr
= (ps +q)
n
Poisson If X is Poisson with parameter
G
X
(s) =

r=0
s
r
e

r
r!
= e

r=0
(s)
r
r!
= e
s
34
Note that
G

(s) =

k
kP(X = 0)s
k1
So
1. G

X
(1) = E(X)
2. G

X
(1) = E(X(X 1))
Examples
Bernouilli G
X
(s) = q +ps, G

X
(1) = p = EX.
Binomial G
X
(s) = (q +ps)
n
, G

X
(1) = np(q +p.1)
n1
= np = EX.
Poisson G
X
(s) = e
(s1)
, G

X
(1) = = EX
8.2 Sums of Random Variables
Often we are interested in the distribution of the sum of two independent randon variables.
The direct approach is to note that if Z = X +Y and X and Y are independent
P(X + Y = k) =

r
P(X = r)P(Y = k r)
Generating functions can sometimes save work.
If X and Y are independent integer-valued random variables
G
X+Y
(s) = E(s
X+Y
) = E(s
X
s
Y
) = E(s
X
)E(s
Y
) = G
X
(s)G
Y
(s)
Example If X
1
, . . . , X
n
are independent Bernouilli distributions then
G
X
1
+...+X
n
(s) =
n
i=1
G
X
i
(s) = (q +ps)
n
Hence (by the uniqueness of pgfs) X
1
+ . . . + X
n
has a B(n, p) distribution.
35
pgfs are also useful when the number of summands is random.
Suppose N and X
1
, . . . , are independent non-negative integer random variables. If
the X
1
, . . . are identically distributed, each having pgf G(s), and N has pgf G
N
(s)
then
G
X
1
+...+X
N
= E(s
X
1
+...+X
N
) = E(E(s
X
1
+...+X
N
)|N) = E(G(s)
N
) = G
N
(G(s))
Example Suppose a fair coin is tossed N times, where N has a Poisson distribution. The
number of heads has pgf
e

(
1
2
s+
1
2
1
)
since each toss follows is a Bernouilli random variable. Hence one can calculate means
and variances etc, using the derivatives of the pgf.
36
9 Continuous Distributions and the Probability Den-
sity Function
Ross 5
For a continuous random variable, P(X = x), the probability that it is equal to a partic-
ular value x, is zero because the distribution function is continuous:
P(x X x + h) = F(x + h) F(x) 0 as h 0
But we can look at the density of probability:
P(x X x +h)
h
=
F(x +h) F(x)
h
f(x) as h 0
where f is the derivative of the distribution function (if dierentiable).
f(x) is the probability density function (pdf)
f(x)dx is the probability that x is in a very small (innitesimal) interval:
f(x)dx = F(x + dx) F(x)
Formal denition of a continuous random variable: X is a continuous random variable
if the distribution function F(x) is continuous and there is a non-negative function f(x)
such that:
F(x) =
_
x

f(t)dt
Then:
1. If F is dierentiable f(x) = F

(x)
2. P(a X b) = F(b) F(a) =
_
b
a
f(x)dx
3.
_

f(x)dx = 1
Definition: The support of a distribution is closure of the set of points for which the
density function (or pmf) is positive: {x R : f(x) > 0}
1
.
1
That is, the smallest closed set containing it.
37
9.1 Uniform Distribution U(a, b)
A Uniform random variable is one that can take any value in an interval [a, b] of the real
line, and all values between a and b are equally likely. So between a and b the probability
density function is a constant, and elsewhere it is zero:
f(x) =
_
1
ba
if x [a, b]
0 otherwise
and F(x) =
_
_
_
0 x < a
xa
ba
a x b
1 x > b
If X is uniformly distributed on [a, b], we write: X U(a, b)
Its support is [a, b].
It is quite hard to think of real-world random variables that are uniformly dis-
tributed, but the uniform distribution is commonly used in economic models, because
it is so simple (e.g the Hotelling model of product dierentiation, where consumers are
uniformly distributed along a line).
38
9.2 Normal Distribution N(,
2
)
Many real-world random variables are Normally distributed, or approximately so
for example, measurements in a population, such as heights or IQ; in economics the
logarithm of wages or incomes or prices may be approximately Normal. The Normal
density function is:
f(x; ,
2
) =
1

2
2
exp
_

1
2
2
(x )
2
_
The Normal distribution has two parameters, and . (We will interpret them later.) If
X is a normally distributed random variable we write: X N(,
2
).
Think about the shape of the graph of the density function. Note that its derivative
is: f

(x) =
x

2
f(x). Then we can see that:
1. It has a single maximum point at (it is single-peaked or equivalently uni-modal )
2. It is symmetric about : f(x ) = f((x ))
3. It is positive for all values on the real line (that is, it has unbounded support)
4. f 0 as x
5. It is more spread out if high.
It has the well-known bell-shape.
Problem: there is no analytical form for the cumulative distribution function:
F(x; ,
2
) =
1

2
2
_
x

1
2
2
(t)
2
dt
so it has to be evaluated numerically. It can be shown analytically that
_

f(x)dx = 1.
This follows from the fact that
_

e
x
2
/2
dx =

2
39
9.2.1 The Standard Normal Distribution
Suppose X N(,
2
) and let: Z =
X

Z is also a random variable, and we can work out its distribution. Let (z) be the
distribution function of Z. Then:
(z) = P
_
X

z
_
= P(X +z) =
1

2
2
_
+z

1
2
2
(t)
2
dt
=
1

2
_
z

1
2
u
2
du (using a subsitution u = (t )/)
Hence we can see that:
the density of Z is (z) =
1

2
e

1
2
z
2
this is the normal density with = 0 and = 1; so Z N(0, 1)
We say that Z has a Standard Normal distribution. The values of the cdf for the standard
normal, (z), are tabulated below.
From the table for the Standard Normal distribution:
(0) = 0.5
(1.645) = 0.95
(1.96) = 0.975
9.2.2 Finding Normal Probabilities
We can obtain probabilities for any Normal distribution using tables for N(0, 1). If
X N(,
2
), its distribution function is:
F
X
(x) =
_
x

_
If the height in cm of adult males has a normal distribution with = 175 and
= 10, what is the probability that a randomly-chosen individual is taller than
190cm?
P(X 190) =
_
190 175
10
_
= (1.5) = 0.9332 = P(X > 190) = 0.0668
40
9.3 Exponential Distribution E()
This is another continuous distribution. It has a single parameter . If X is an exponentially-
distributed random variable, X E(), its density is:
f(x; ) = e
x
for x 0
Note that the support of X is R
+
it can only take positive values.
9.4 The Cauchy Distribution
f(x) =
1
(1 + x
2
)
< x <
This is a density because
_

1
1+x
2
= [tan
1
x]

= .
41
9.5 The Gamma Distribution (, )
The Gamma distribution with parameters and has the density function
f(x) =
_
1
()

x
1
e
x
x > 0
0 x 0
where
() =
_

0
u
1
e
u
du
Note that when n is a positive integer
(n) = (n 1)!
9.6 The Chi-squared Distribution
2
(n)
The
2
distribution with n degrees of freedom is the (
n
2
,
1
2
) distribution:
f(x) =
_
1
2(
n
2
)
_
1
2
x
_n
2
1
e

1
2
x
x > 0
0 x 0
42
9.7 Monotonic Transformations
Suppose X is continuous, with distribution and density functions F
X
(x) and f
X
(x), and
g(.) is a dierentiable and monotonic function. Then g has an inverse g
1
, and there is
a general formula for the distribution and density of Y .
If g(.) is an increasing transformation:
P(Y y) = P(g(X) y) = P
_
X g
1
(y)
_
F
Y
(y) = F
X
_
g
1
(y)
_
Dierentiating: f
Y
(y) =
dF
Y
dy
= f
X
_
g
1
(y)
_
d
dy
g
1
(y)
Example Let X be uniformly distributed on (0, 1). What is the probability density of
Y = ln X? X might represent income and Y its utility to a consumer.
F
Y
(y) = P(Y y) = P(ln X y) = P(X e
y
) = e
y
y < 0
So Y has density
f
Y
(y) = F

Y
(y) = e
y
y < 0 (0 otherwise)
Or using formula, for y < 0, with g(x) = ln x and g
1
(y) = e
y
,
f
Y
(y) = f
X
_
g
1
(y)
_
d
dy
g
1
(y) = 1e
y
= e
y
y < 0
If g(.) is a decreasing transformation:
P(Y y) = P(g(X) y) = P
_
X g
1
(y)
_
F
Y
(y) = 1 F
X
_
g
1
(y)
_
Dierentiating: f
Y
(y) =
dF
Y
dy
= f
X
_
g
1
(y)
_
d
dy
g
1
(y)
So if g(.) is any monotonic transformation: f
Y
(y) = f
X
_
g
1
(y)
_

d
dy
g
1
(y)

## For any random variable X let Y = X. Then:

F
Y
(y) = 1 F
X
(y) and f
Y
(y) = f
X
(y)
43
Important Example: Lognormal Distribution
Suppose X N(,
2
) and Y = e
X
. Inverting: X = ln Y .
F
Y
(y) = P(Y y) = P(e
X
y) = P(X ln y) =
_
ln y

_
(y > 0)
and:
f
Y
(y) =
1
y

_
ln y

_
=
1
y

2
2
e

1
2
(
ln y

)
2
(Have not dened expectations yet but...)
Note: Y = e
X
, and E(X) = , but E(Y ) = e

:
E(Y ) = E(e
X
) =
1

2
2
_

e
x
e

1
2
(
x

)
2
dx = e
+
1
2

2
Var(Y ) = e
2+
2
_
e

2
1
_
Hence Coecient of Variation=

2
1, which depends on only.
We can use the lognormal distribution to model wages (positive, skewed, . . . ).
If real wages in year 1 have a lognormal distribution: W
1
LN(,
2
)
and increase by a factor k due to productivity growth: W
2
= kW
1
then W
2
LN( + ln k,
2
).
Can also use to model stock prices
S
t+1
= A
t
S
t
where ln A
t
has a Normal distribution.
9.8 A non-monotonic example: the Chi-Squared distribution
It is possible to nd the distribution of g(X) for some non-monotonic functions.
Suppose Z N(0, 1) and let Y = Z
2
. Then for y > 0:
F
Y
(y) = P(Y < y) = P(Z
2
< y) = P(

y Z

y)
= 2(

y) 1
and hence
f
Y
(y) =
d
dy
F
Y
(y) = 2(

y)
1
2
y

1
2
=
1

2y
e

1
2
y
.
This is the density for the chi-squared distribution with 1 degree of freedom: Y
2
(1).
One can show that (
1
2
) =

, so this is consistent with denition above. (
1
2
) =
_

0
x

1
2
e
x
dx.
Use the substitution x = y
2
to reduce to 2
_

0
e
y
2
dy =

## (cf. integral for Normal density).

44
10 Expectation for Continuous Random Variables
Ross 5
10.1 The mean of a continuous random variable
If X is a continuous random variable with probability density function f(x), the expected
value, or mean, of X is:
E(X) =
_

xf(x)dx
(Note that this is an improper integral, so it may not exist.We will discuss this later.)
Uniform X U(a, b)
E(X) =
_
b
a
x
b a
dx =
1
b a
_
x
2
2
_
b
a
=
a +b
2
Standard Normal X N(0, 1)
E(X) =
1

2
_

xe

1
2
x
2
dx = 0 by symmetry
(This doesnt really prove that the mean is zero although it is. We ought to check
formally that the integral exists to be done later.)
Note also that we can integrate here:
_
xe

1
2
x
2
dx = e

1
2
x
2
Normal X N(,
2
)
It can be shown that the mean of the Normal distribution is .
Exponential distribution X E(): f(x; ) = e
x
for x 0
E(X) =
_

0
xe
x
dx =
_
xe
x

0
+
_

0
e
x
dx =
_

e
x
_

0
=
1

45
10.2 The expectation of a function of a random variable
If X is a random variable with density fuction f, and g is a function: g : R R, then
g(X) is a random variable too, and we can nd its expected value:
E(g(X)) =
_

g(x)f(x)dx
Proof similar to that in discrete case.
10.3 Variance and Dispersion
If X is a continuous random variable with mean E(X) = and density function f, its
variance is dened by:
Var(X) = E
_
(X )
2
_
=
_

(x )
2
f(x)dx
As with discrete random variables (same proof):
Var(X) = E(X
2
) (E(X))
2
Examples
Uniform, X U(a, b)
E(X
2
) =
_
b
a
x
2
b a
dx =
1
b a
_
x
3
3
_
b
a
=
b
3
a
3
3(b a)
=
b
2
+ ab + a
2
3
= Var(X) =
b
2
+ ab + a
2
3

_
a +b
2
_
2
=
(b a)
2
12
Standard Normal, X N(0, 1)
Var(X) =
1

2
_

x
2
e

1
2
x
2
dx = 1 (Use integration by parts)
So the variance is 1, and so is the standard deviation.
Normal, X N(,
2
)
It can be shown that Var(X) =
2
, and the standard deviation is .
46
10.4 Moments
The mean = E(X), and variance Var(X) = E(X)
2
, of a random variable X are the
rst two (central) moments of its distribution. We can dene higher-order moments:
The n
th
moment is:

n
= E(X
n
)
The n
th
central moment is:
n
= E((X )
n
)
So

1
= and
2
= variance. If we write for the standard deviation (so =

2
):
The standardised 3
rd
central moment
3
=

3

3
is the skewness
and the standardised fourth central moment
4
=

4

4
is the kurtosis
N(,
2
): skewness = 0; kurtosis = 3.
47
10.5 Existence of Expectations
If E(|g(X)|) is innite, then E(g(X)) does not exist.
So, for example, the correct denition of the mean of a continuous random variable is:
E(X) =
_

## xf(x)dx provided that E(|X|) =

_

|x|f(x)dx <
For any density function f(x) the integral
_

## f(x)dx must exist. But the mean

_

xf(x)dx
may not exist, or if the mean exists
_

x
2
f(x)dx may not (etc.).
Standard Normal Distribution Z N(0, 1)
E(Z) =
1

2
_
+

ze

1
2
z
2
dz =
1

2
_
0

ze

1
2
z
2
dz +
1

2
_
+
0
ze

1
2
z
2
dz
=
1

2
_
+
0
ze

1
2
z
2
dz +
1

2
_
+
0
ze

1
2
z
2
dz
=
1

2
_
e

1
2
z
2
_

2
_
e

1
2
z
2
_

0
= 0
Positive and negative parts both nite.
Cauchy distribution f(x) =
1

1
1 + x
2
E(X) =
_
0

x
(1 + x
2
)
dx +
_

0
x
(1 + x
2
)
dx
=
_
1
2
ln(1 + x
2
)
_
0

+
_
1
2
ln(1 + x
2
)
_

0
Positive and negative parts are innite, so E(X) not dened.
48
10.6 Conditional Distributions
As in the discrete case, one can consider the distibution of a continuous of random variable
X, with density f(x), conditional on some event A, usually of the form A = {a < X b}
that is X is known to lie in some subset of its range.
The distribution function of X given A is
F
X|A
(x) = P(X x|X A) =
P(X x&X A)
P(A)
=
_
A(,x]
f(y) dy
_
A
f(y) dy
_
C
denotes the integral over the set C.
If X is U(a, b) and A is the event X c, where a < c b then the conditional
distribution function of X given A is
F
X|A
(x) =
_

_
0 x a

x
b
1
ba
dy

c
a
1
ba
dy
=
xa
ca
a < x c
1 x c
The conditional density is
f
X|A
(x) =
_

_
0 x a
1
ca
a < x c
0 x c
More generally if A is an event of the form a < X b and X has density f then
P(X x|A) =
F(x) F(a)
F(b) F(a)
for a < x b
So its conditional density is
f
X|A
(x) =
_
f(x)
F(b)F(a)
if a < x b
0 otherwise
Thus one can dene expectations conditional on A etc.
49
Example Consider the exponential distribution, X E().
P(X > t +s|X > s) =
P(X > t +s &X > s)
P(X > s)
=
P(X > t +s)
P(X > s)
=
e
(t+s)
e
s
= e
t
since 1 the cdf of X is 1 F(x) = e
x
.
This is referred to as the lack of memory property of the exponential distribution.
Consider a worker looking for a job. Let X be a positive continuous random variable,
representing the time it takes him to nd a job. The hazard rate is dened to be
(t) =
f(t)
1 F(t)
This can be thought of as the probability that he nds a job in the interval (t, t + dt)
given that he has not found one by time t.
P(X (t, t +dt)|X > t)
f(t)dt
1 F(t)
using the above denition of the conditional density.
For the exponential
(t) =
e
t
e
t
=
That is a constant hazard rate: probability of nding a job in the next instant, indepen-
dent of how long have been searching for (cf. lack of memory property).
Specifying (t) determines F (exercsise). Useful in analysis of duration models, eg
job search, survival from smoking etc.
50
11 Bivariate Distributions for Continuous Random
Variables
Ross 6
11.1 Introduction
X and Y are jointly continuous random variables if the joint distribution function:
F
X,Y
(x, y) = P(X x, Y y)
is continuous and there is a non-negative joint density function f
X,Y
(x, y) satisfying
F
X,Y
(x, y) =
_
x

_
y

f
X,Y
(s, t)ds dt (
_

f
X,Y
(s, t)ds dt = 1)
Then
f
X,Y
(x, y) =

2
x y
F
X,Y
(x, y)
and f
X,Y
(x, y)dxdy represents the probability P(x X x +dx, y Y y +dy).
F
X,Y
(x, y) =
1
x +y 1

1
x

1
y
+ 1 for x, y 1 is one of the Pareto family of
distributions:
F
X,Y
(1, 1) = 0, lim
x,y
F
X,Y
(x, y) = 1, and f
X,Y
(x, y) =
2
(x +y 1)
3
51
11.2 Marginal Distributions
If we know the joint distribution of X and Y we can work out the distributions of each
of the variables individually.
F
X
(x) = P(X x, Y ) = lim
y
F(x, y)
To nd the density, start from the cdf:
F
X
(x) =
_
x

f
X,Y
(s, t)ds dt =
_
x

__

f
X,Y
(s, y)dy
_
ds
=
_
x

f
X
(s)ds where f
X
(x) =
_

f
X,Y
(x, y)dy
For the Pareto distribution above, F
X,Y
(x, y) =
1
x +y 1

1
x

1
y
+ 1
F
X
(x) = lim
y
F
X,Y
(x, y) = 1
1
x
and so f
X
(x) =
1
x
2
Alternatively, starting from the joint density f
X,Y
(x, y) =
2
(x +y 1)
3
:
f
X
(x) =
_

1
2
(x + y 1)
3
dy =
1
x
2
52
11.3 Independence
X and Y are independent if and only if: f
X,Y
(x, y) = f
X
(x)f
Y
(y) for all x, y.
Equivalently:
P(X x, Y y) = P(X x)P(Y y)
or F
X,Y
(x, y) = F
X
(x)F
Y
(y)
(Compare with independent events.)
For the Pareto distribution above, X and Y are not independent:
f
X,Y
(x, y) =
2
(x +y 1)
2
, but f
X
(x) =
1
x
2
and f
Y
(y) =
1
y
2
But the joint density g
X,Y
(x, y) =
1
x
2
y
2
, for x, y 1, has the same marginal
distributions so here the random variables are independent.
(Note that we can work out the marginal distributions from the joint distribution,
but not vice-versa unless we know that the variables are independent.)
53
11.4 Conditional Distribution
Recall that for events A and B: P(A|B) =
P(A B)
P(B)
Similarly we dene the conditional pdf of X given Y as
f
X|Y
(x|y) =
f
X,Y
(x, y)
f
Y
(y)
F
X|Y
(x|y) =
_
x

f
X|Y
(u|y) du
Note that if X and Y are independent: f
X|Y
(x|y) =
f
X
(x)f
Y
(y)
f
Y
(y)
= f
X
(x)
Suppose X and Y are jointly distributed: f
X,Y
(x, y) = xe
x(y+1)
, 0 x, y.
Find the marginal density of Y , and the conditional density of Y given X.
The marginal density of Y is: f
Y
(y) =
_

0
xe
x(y+1)
dx =
1
(y + 1)
2
To nd the conditional density we also need: f
X
(x) =
_

0
xe
x(y+1)
dy = e
x
Then the conditional density of Y |X is: f
Y |X
(y|x) =
xe
x(y+1)
e
x
= xe
xy
where y, x 0. That is, Y |
X=x
E(x).
54
11.5 Expectation and Moments in Bivariate Distributions
If g(X, Y ) is a function of jointly distributed random variables X and Y , the expectation
of g(X, Y ) can be shown to be
E(g(X, Y )) =
_
x
_
y
g(x, y)f
X,Y
(x, y)dx dy
In particular:
1. E(X) =
_
x
_
y
xf
X,Y
(x, y)dx dy =
_
x
x
__
y
f
X,Y
(x, y)dy
_
dx =
_
x
xf
X
(x)dx
(and similarly for the variance).
2. E(aX +bY ) = aE(X) + b E(Y )
3. If X and Y are independent:
(a) E(XY ) = E(X)E(Y )
(b) Var(aX +bY ) = a
2
Var(X) + b
2
Var(Y )
If X U(0, 1) and Y U(0, 1) are independent uniform random variables, nd
the mean and variance of X +Y and X Y .
E(X +Y ) = E(X) + E(Y ) =
1
2
+
1
2
= 1 E(X Y ) = E(X) E(Y ) = 0
Var(X +Y ) = Var(X) + Var(Y ) =
1
12
+
1
12
=
1
6
Var(X Y ) = Var(X) + Var(Y ) =
1
6
55
11.6 Covariance and Correlation
The covariance of X and Y is dened:
Cov(X, Y ) = E
_
(X E(X))(Y E(Y )
_
Equivalently: Cov(X, Y ) = E(XY ) E(X)E(Y )
so if X and Y are independent, their covariance is zero. (Reverse NOT generally true.)
Recall that In general:
Var(aX +bY ) = a
2
Var(X) + b
2
Var(Y ) + 2ab Cov(X, Y )
The correlation coecient, often denoted (X, Y ), is dened:
Corr(X, Y ) = (X, Y ) =
Cov(X, Y )
_
Var(X)Var(Y )
As in discrete case
1. 1 1
2. = 1 if and only if Y = aX +b with a > 0 with probability 1.
3. = 1 if and only if Y = aX +b with a < 0 with probability 1.
on account of the
Cauchy-Schwarz Inequality: (E(XY ))
2
E(X
2
)E(Y
2
) with equality if and only if X =
Y with probability 1.
56
11.7 Conditional Expectation
If the conditional distribution of Y , given that X = x, has density f
Y |X
(y|x) then the
conditional expectation is:
E(Y |X = x) =
_
yf
Y |X
(y|x)dy
(If X and Y are independent E(Y |X = x) = E(Y ).)
The Law of Total Expectations or Law of Iterated Expectations:
E(E(Y |X)) =
_

E(Y |X = x)f
X
(x)dx =
_

__

y
f
X,Y
(x, y)
f
X
(x)
dy
_
f
X
(x)dx = E(Y )
Suppose the conditional distribution of consumption at time t given income is:
C
t
|y
t
N(ky
t
,
2
)
Then:
E(C
t
|y
t
) = ky
t
and hence E(C
t
) = E(E(C
t
|y
t
)) = kE(y
t
)
As in discrete case, conditional expectation has the properties of ordinary expectation
such as
E(aX + bY |Z) = aE(X|Z) + bE(Y |Z)
and also:
1. E(g(Y )X|Y ) = g(Y )E(X|Y ).
2. E(E(X|Y ; Z)|Y ) = E(X|Y ) (tower property).
3. E(X|Y ) solves the problem min
g
E(X g(Y ))
2
.
Also note
Var(Y ) = EVar(Y |X) + VarE(Y |X))
57
11.8 Using Matrix Notation
X = (X
1
, X
2
)

## is a random vector. Suppose E(X

i
) =
i
. We can write the rst and
second moments using matrix notation. For the mean:
E(X) = E
_
X
1
X
2
_
=
_
E(X
1
)
E(X
2
)
_
=
_

1

2
_
=
For the variance:
Var(X) = E
_
(X )(X )

_
= E
__
X
1

1
X
2

2
_
(X
1

1
X
2

2
)
_
=
_
Var(X
1
) Cov(X
1
, X
2
)
Cov(X
1
, X
2
) Var(X
2
)
_
58
11.9 The Bivariate Normal Distribution
The density function for the bivariate Normal distribution is:
f
X
1
,X
2
(x
1
, x
2
) = f
X
(x) =
1
2
_
||
e

1
2
(x)

1
(x)
.
where:
=
_

2
1

1

2

2
2
_
|| is the determinant:
|| =
2
1

2
2
(1
2
)
and
(x)

1
(x) =
1
1
2
_
_
x
1

1
_
2
2
_
x
1

1
__
x
2

2
_
+
_
x
2

2
_
2
_
.
59
As in the univariate case there is no analytical form for the distribution function
F
X
1
,X
2
(x
1
, x
2
).
1. The marginal distributions are X
1
N (
1
,
2
1
), X
2
N (
2
,
2
2
)
To prove this we have to integrate:
f
X
1
(x
1
) =
_

f
X
1
,X
2
(x
1
, x
2
) dx
2
Step (i): Substitute z
2
=
x
2

2
, and also write z
1
=
x
1

1
:
f
X
1
(x
1
) =

2
2
_
||
_

exp
_

1
2(1
2
)
_
z
2
1
2z
1
z
2
+z
2
2
_
_
dz
2
Step (ii): Write z
2
1
2z
1
z
2
+z
2
2
= z
2
1
(1
2
) + (z
2
z
1
)
2
f
X
1
(x
1
) =
1

2
1
exp(
1
2
z
2
1
)
_

1
_
2(1
2
)
exp
_

1
2(1
2
)
(z
2
z
1
)
2
_
dz
2
=
1

2
1
exp
_

1
2
2
1
(x
1

1
)
2
_
(These steps are often useful in Normal integrals.)
2. Hence E(X
1
) =
1
, Var(X
1
) =
2
1
, E(X
2
) =
2
, Var(X
2
) =
2
2
.
3. is the correlation coecient: = Corr(X
1
, X
2
).
To prove this, show that:
Cov(X
1
, X
2
) =
_

(x
1

1
)(x
2

2
)f
X
1
,X
2
(x
1
, x
2
) dx
1
dx
2
=
1

2
using the steps above.
4. Hence =
_

2
1

1

2

2
2
_
is the variance-covariance matrix.
In matrix notation: X N(, ).
Note that for the Normal distribution, = 0 X
1
and X
2
are independent.
60
Conditional Distribution
If X
1
and X
2
are bivariate Normal random variables, the conditional distributions are
Normal too. Substituting the Normal densities into:
f
X
1
|X
2
(x
1
|x
2
) =
f
X
1
,X
2
(x
1
, x
2
)
f
X
2
(x
2
)
and rearranging, we nd that this is the density of a Normally-distributed variable:
X
1
|X
2
N
_

1
+

2
(x
2

2
),
2
1
(1
2
)
_
.
61
12 Multivariate Distributions
12.1 Introduction
We can extend from bivariate to multivariate distributions in the obvious way. If X
1
,
X
2
, . . . , X
n
are n jointly continuous random variables, we can write them as a random
vector, X, with distribution and density functions:
F
X
(x
1
, x
2
, . . . , x
n
) and f
X
(x
1
, x
2
, . . . , x
n
)
We can nd marginal densities such as f
X
1
(x
1
) and f
X
3
,X
5
(x
3
, x
5
), and conditional
densities such as f
X
(x
n
|x
1
, x
2
, . . . , x
n1
) and f
X
(x
3
, x
4
, . . . , x
n
|x
1
, x
2
).
Writing
i
= E(X
i
) and
ij
= Cov(X
i
, X
j
) the rst and second moments are:
= E(X) =
_
_
_
_
_
_
E(X
1
)
E(X
2
)
.
.
E(X
n
)
_
_
_
_
_
_
=
_
_
_
_
_
_

2
.
.

n
_
_
_
_
_
_
=
= Var(X) = E
_
(X )(X )

_
= E
_
_
_
_
_
_
_
_
_
_
_
_
X
1

1
X
2

2
.
.
X
n

n
_
_
_
_
_
_
(X
1

1
X
2

2
. . X
n

n
)
_
_
_
_
_
_
=
_
_
_
_
Var(X
1
) Cov(X
1
, X
2
) . . Cov(X
1
, X
n
)
Cov(X
1
, X
2
) Var(X
2
) . . Cov(X
2
, X
n
)
. .
Cov(X
1
, X
n
) . Var(X
n
)
_
_
_
_
= (
ij
)
62
12.1.1 IID Random Variables
If the random variables X
1
, X
2
, . . . , X
n
are independent:
f
X
(x
1
, x
2
, . . . , x
n
) = f
X
1
(x
1
)f
X
2
(x
2
) . . . f
X
n
(x
n
) =

i
f
X
i
(x
i
)
If, in addition, they all have the same marginal distribution:
f
X
(x
1
, x
2
, . . . , x
n
) =

i
f
X
(x
i
)
and we say that they are independent identically-distributed (IID) random variables.
12.1.2 Linear Functions
Remember: E(X) = and Var(X) =
2
E(aX+b) = a+b and Var(aX+b) = a
2

2
.
Generalising to the multivariate case, if E(X) = and Var(X) = :
1. E(a

X) = a

and Var(a

X) = a

a
2. E(AX) = A and Var(AX) = AA

63
12.2 Multivariate Normal Distribution
If X is a Normal random vector with values in R
n
, mean , and variance-covariance
matrix = (
ij
), the joint density function is:
f
X
(x) =
1
_
(2)
n
||
e

1
2
(x)

1
(x)
.
The marginal distributions are Normal: X
i
N(
i
,
ii
). If is a diagonal matrix, the
X
i
are independent.
If X has a multivariate Normal distribution then a

## X has a univariate Normal dis-

tribution for any n 1 vector a.
12.2.1 Conditional Distribution
If
_
X
1
X
2
_
N
__

2
_
,
_

11

12

21

22
__
Then
X
1
|X
2
N
_

1
+
12

1
22
(X
2

2
) ,
11

12

1
22

21
_
64
12.3 The Distribution of Functions of Several Random Vari-
ables
Suppose we know the joint distribution of X
1
, X
2
, . . . , X
n
.
Then if g : R
n
R, Y = g(X
1
, X
2
, . . . , X
n
) is a random variable, and we can nd its
distribution as in the 1-variable case from:
P(Y y) = P(g(X) y)
12.3.1 Sums of Random Variables: Y = X
1
+X
2
F
Y
(y) = P(X
1
+X
2
y) =
_ _
x
1
+x
2
< y
f(x
1
, x
2
)dx
1
dx
2
=
_

__
yx
2

f(x
1
, x
2
)dx
1
_
dx
2
=
_

__
y

f(u x
2
, x
2
)du
_
dx
2
=
_
y

__

f(u x
2
, x
2
)dx
2
_
du
and hence: f
Y
(y) =
_

f(y x
2
, x
2
)dx
2
and also f
Y
(y) =
_

f(x
1
, y x
1
)dx
1
Suppose X
1
and X
2
are independent standard Normal random variables.
f
Y
(y) =
_

(y x)(x)dx =
1
2
_
exp
_

1
2
(y x)
2
_
exp
_

1
2
x
2
_
dx
=
1
2
_
exp
_
(x
1
2
y)
2
_
exp
_

1
4
y
2
_
dx
=
1

2
exp
_

1
4
y
2
_
1

2
_
exp
_

1
2
z
2
_
1

2
dz =
1

4
exp
_

1
4
y
2
_
So Y N(0, 2)
65
12.3.2 Linear Functions of Normal Random Variables
If X N(, ) : (i) a

X N(a

, a

## a) and (ii) AX N(A, AA

)
In particular, if the variables are independent:
a

X N
_
n

i=1
a
i

i
,
n

i=1
a
2
i

2
i
_
(ii) can be expressed in words as saying that linear functions of jointly Normal random
variables are themselves jointly Normal.
12.3.3 The Maximum of a Set of Independent Random Variables: Y =
max(X
1
, X
2
, . . . , X
n
)
F
Y
(y) = P(max(X
1
, X
2
, . . . , X
n
) y) = P(X
1
y, X
2
y, . . . , X
n
y)
= P(X
1
y)P(X
2
y) . . . P(X
n
y)
= F
X
1
(y)F
X
2
(y) . . . F
X
n
(y)
If they are IID: F
Y
(y) = F
X
(y)
n
Suppose that in an auction with two bidders, the bids are independently uniformly
distributed on [0, 1]:
b
1
U(0, 1), b
2
U(0, 1)
What is the expected winning bid?
y = max(b
1
, b
2
) F(y) = y
2
and f(y) = 2y (0 y 1)
E(y) =
_
1
0
2y
2
dy =
2
3
One can also generalise the results of 9.7 to consider joint distributions of transformations of random
variables but we will not consider this here (see Ross 6.7).
66
12.4 Distributions Related to the Normal
12.4.1 The chi-square distribution
Suppose Z
1
, Z
2
, . . . , Z
n
are IID standard Normal: Z
i
IN(0, 1). Then
Y =
n

i=1
Z
2
i
has a chi-square distribution with n degrees of freedom: Y
2
(n).
The pdf for
2
(n) is f(y; n) =
1
2
n/2
(n/2)
y
n
2
1
e

y
2
(y 0). [See earlier]
12.4.2 The t-distribution
If X N(,
2
) and Y
2
(k) are independent: t =
(X )/
_
Y/k
t(k)
(Looks like attened Normal distribution; very close to standard Normal when k is large.)
The t-distribution with k degrees of freedom has pdf
f(t; k) =

_
k+1
2
_

_
k
2
_
1
(k)
1/2
1
(1 + t
2
/k)
(k+1)/2
< t <
When k = 1 Cauchy so no mean. In general if k degrees of freedom, only k 1 moments exist.
12.4.3 The F-distribution
If Y
1

2
(n
1
) and Y
2

2
(n
2
) are independent: W =
Y
1
/n
1
Y
2
/n
2
F(n
1
, n
2
) (W 0)
The F-distribution with n
1
and n
1
degrees of freedom has pdf
f
F
(x) =

_
n
1
+n
2
2
_

_
n
1
2
_

_
n
2
2
_
_
n
1
n
2
_
n
1
/2
x
(n
1
/2)1
[1 + (n
1
/n
2
)x]
(n
1
+n
2
)/2
0 < x <
Note that t
2
F(1, k).
The t and F probability densities can be derived by using the results of Ross 6.7 on joint distributions
of transformations of random variables. but their explicit form is not usually useful.
67
13 Moment Generating Functions
Ross 7
13.1 Introduction
For a random variable X its moment-generating function is dened to be
M
X
(t) = E(e
tX
)
This clearly exists for t = 0 but may not exist if t = 0. One can show that if M
X
exists in a
neighbourhood of zero, it corresponds to a unique distribution.
Examples
1. Note that the pgf G
X
(s) of a discrete variable is M
X
(ln s) or equivalently M
X
(t) =
G
X
(e
t
), so for example for the Bernouilli:
M
X
(t) = q +pe
t
2. Exponential, X E()
M
X
(t) =
_

0
e
tx
e
x
dx =
_

0
e
(t)x
=

t
t <
3. Standard Normal, X N(0, 1),
M
X
(t) =
1

2
_

exp(tx) exp(
x
2
2
) dx
=
1

2
_

exp
_

(x t)
2
2
_
exp
_
t
2
2
_
dx
(completing the square)
= exp
_
t
2
2
__

exp
_

(x t)
2
2
_
dx
= exp
_
t
2
2
_
68
Note that
M
aX+b
(t) = E
_
e
t(aX+b)
_
= e
bt
E
_
e
taX
_
= e
bt
M
X
(at)
Example Let X N(,
2
) then X = + Z where Z N(0, 1). Hence
M
X
(t) = e
t
M
Z
(t) = e
t
e

2
t
2
/2
= exp
_
t +

2
t
2
2
_
The reason for the moment-generating functions name:
M(t) = E(e
tX
) = E
_
1 + tX +
t
2
2!
X
2
+. . . +
t
n
n!
X
n
+. . .
_
= 1 +tE(X) +
t
2
2!
E(X
2
) + . . . +
t
n
n!
E(X
n
) + . . .
Can be used to calculate moments:
d
n
dt
n
M(t) = E
_
X
n
e
tX
_
Hence
E(X) = M
(1)
(0)
E(X
2
) = M
(2)
(0)
E(X
3
) = M
(3)
0)
E(X
4
) = M
(4)
(0)
etc
Example: Moments of Standard Normal (X N(0, 1)). M(t) = exp
_
t
2
2
_
. Hence
M
(1)
(t) = t exp
_
t
2
2
_
M
(2)
(t) = (1 +t
2
) exp
_
t
2
2
_
M
(3)
(t) = (3t +t
3
) exp
_
t
2
2
_
M
(4)
(t) = (3 + 6t
2
+t
4
) exp
_
t
2
2
_
E(X) = 0 E(X
2
) = 1
E(X
3
) = 0 E(X
4
) = 3
69
13.2 Sums of Random Variables
As with pgfs, mgfs can be used to nd the distribution of sums of random variables.
If X and Y are independent random variables
M
X+Y
(t) = E(e
t(X+Y )
) = E(e
tX
e
tY
) = E(e
tX
)E(e
tY
) = M
X
(t)M
Y
(t)
Example If X
1
, . . . , X
n
are independent N(0, 1) random variables then
M
X
1
+...+X
n
(t) =
_
exp
_
t
2
2
__
n
= exp
_
n
t
2
2
_
Hence X
1
+. . . +X
n
has a N(0, n) distribution.
mgfs are also useful when the number of summands is random.
Suppose N and X
1
, . . . , are independent random variables, with N taking non-
negative integer values. If the X
1
, . . . are identically distributed, each having mgf
M(s), and N has pgf G
N
(s) then
M
X
1
+...+X
N
(t) = E(e
t(X
1
+...+X
N
)
) = E(E(e
t(X
1
+...+X
N
)
)|N) = E(M(s)
N
) = G
N
(M(s))
The fact that M
X
(t) does not always exist for t = 0 is inconvenient. More advanced treatments use
instead the characteristic function E(exp(itX)), which always exists. We shall, however, not do so.
70
14 Inequalities
Ross 7
14.1 Jensens Inequality
If g(.) is a concave function: E(g(X)) g(E(X))
Proof: If g is a concave function then for any point a there exists such that
g(x) g(a) +(x a)
Picture:
If g is dierentiable at a, equals g

(a).
Take a = = E(X), then
g(x) g() +(x )
Taking expectations
Eg(X) g(E(X))
If X is a random variable with mean , and Y = ln(X), then E(Y ) ln()
Consider a lottery in which you win a monetary prize y
i
with probability p
i
(i = 1, 2, . . . , N). This is a nite discrete probability distribution with pmf:
f(y
i
) = p
i
, i = 1, 2, . . . , N
_

i
p
i
= 1
_
The expected value of the lottery is:
y =

i
y
i
p
i
If your utility function is u(y), the expected utility of the lottery is:
Eu =

i
u(y
i
)p
i
By Jensens inequality, if the utility function is concave: Eu u(y)
71
14.2 Chebyshevs Inequality
.
If g(.) is a non-negative function, and r > 0:
P(g(X) r)
E(g(X))
r
Proof: Consider the function I with
I(x) =
_
1 if g(x) r
0 otherwise
Since g is non-negative, g(x) rI(x) for all x, hence
E(g(X)) rE(I(X)) = rP(g(X) r)
Chebyshevs Inequality If X has mean and nite variance then for k > 0
P(|X | k)
Var(X)
k
2
Apply the result above to g(x) = (x )
2
and r = k
2
.
Chebyshevs Inequality is general, but not always very informative for individual
distributions:
If Z N(0, 1), (1.96) = 0.975 = P(|Z| 1.96) = 0.05.
Chebyshevs Inequality only tells us: P(|Z| 1.96)
Var(Z)
1.96
2
0.26
Markovs Inequality If X is a random variable which takes only non-negative values,
then for r > 0
P(X r)
E(X)
r
Just rst result with g the identity.
72
15 The Law of Large Numbers and the Central Limit
Theorem
Ross 7
15.1 The Law of Large Numbers
The sequence of random variables {X
n
} is said to converge in probability to X if for all
> 0
P(|X
n
X| > ) 0
This is sometimes written as plimX
n
= X or X
n
p
X.
The Weak Law of Large Numbers
Let X
1
, X
2
, . . . , be independent and identically dis-
tributed with mean then
X
n
=
X
1
+. . . +X
n
n
converges to in probability
In the case where the X
n
have a second moment, this can be proved using Chebyshevs
Inequality.
E(X) =
and
Var(X
n
) = Var
_
X
1
+. . . +X
n
n
_
=
1
n
2
(Var(X
1
) + . . . + Var(X
n
)) =

2
n
So by Chebyshev
P
_

X
n

>
_

VarX
n

2
=

2
n
2
0
The Strong Law of Large Numbers claims that with probability 1 X converges to . This is a stronger
assertion than the Weak Law. The Weak Law is usually sucient for statistical applications.
73
15.2 The Central Limit Theorem
The Central Limit Theorem
Let X
1
, X
2
, . . . , be independent and identically dis-
tributed with mean and variance
2
then for any x
P
_
X
/

n
x
_

_
x

1
2
exp
_

x
2
2
_
dx
Idea of Proof. Replace X
i

i
by Y
i
. E(Y
i
) = 0 and E(Y
2
i
) =
2
. It is enough to show that
Y /(/

## n) converges to a standard Normal.

M
(Y
1
+...+Y
n
)/

n
(t) = M
Y
1
+...+Y
n
_
t

n
_
=
_
M
Y
1
_
t

n
__
n
(Independence)
=
_
1 + E(Y
2
)
t
2
2
n
+ terms of smaller order than
1
n
_
n
exp
_
t
2
2
_
So the moment-generating function of Y /(/

## n) converges to that of a standard Normal.

The proof in the general case is along these lines but uses characteristic functions.
The Law of Large Numbers and Central Limit Theorem can be generalised to relax somewhat the
assumptions of independence and identical distributions.
74