You are on page 1of 123

EC509 Probability and Statistics

Slides 3

Bilkent

This Version: 4 Nov 2013

(Institute) EC509 This Version: 4 Nov 2013 1 / 123


Preliminaries

In this part of the lectures, we will discuss multiple random variables.


This mostly involves extending the concepts we have learned so far to the case of
more than one variable. This might sound straightforward but, as always, we have to
be careful.
This part is heavily based on Casella & Berger, Chapter 4.

(Institute) EC509 This Version: 4 Nov 2013 2 / 123


Joint and Marginal Distribution

So far, our interest has been on events involving a single random variable only. In
other words, we have only considered “univariate models.”
Multivariate models, on the other hand, involve more than one variable.
Consider an experiment about health characteristics of the population. Would we be
interested in one characteristic only, say weight? Not really. There are many
important characteristics.
De…nition (4.1.1): An n-dimensional random vector is a function from a sample
space Ω into Rn , n-dimensional Euclidean space.
Suppose, for example, that with each point in a sample space we associate an
ordered pair of numbers, that is, a point (x , y ) 2 R2 , where R2 denotes the plane.
Then, we have de…ned a two-dimensional (or bivariate) random vector (X , Y ) .

(Institute) EC509 This Version: 4 Nov 2013 3 / 123


Joint and Marginal Distribution

Example (4.1.2): Consider the experiment of tossing two fair dice. The sample
space has 36 equally likely points. For example:

(3, 3) : both dices show a 3,


(4, 1) : …rst die shows a 4 and the second die a 1,
etc... .

Now, let

X = sum of the two dice & Y = jdi¤erence of the two dicej .

Then,

(3, 3) : X =6 and Y = 0,
(4, 1) : X =5 and Y = 3,

and so we can de…ne the bivariate random vector (X , Y ) thus.

(Institute) EC509 This Version: 4 Nov 2013 4 / 123


Joint and Marginal Distribution

What is, say, P (X = 5 and Y = 3)? One can verify that the two relevant sample
points in Ω are (4, 1) and (1, 4) . Therefore, the event fX = 5 and Y = 3g will only
occur if the event f(4, 1), (1, 4)g occurs. Since each of these sample points in Ω are
equally likely,
2 1
P (f(4, 1), (1, 4)g) = = .
36 18
Thus,
1
P (X = 5 and Y = 3) = .
18
For example, can you see why
1
P (X = 7, Y 4) = ?
9
This is because the only sample points that yield this event are (4, 3), (3, 4), (5, 2)
and (2, 5).
Note that from now on we will use P (event a, event b) rather than P (event a and
event b).

(Institute) EC509 This Version: 4 Nov 2013 5 / 123


Joint and Marginal Distribution

De…nition (4.1.3): Let (X , Y ) be a discrete bivariate random vector. Then the


function f (x , y ) from R2 into R, de…ned by f (x , y ) = P (X = x , Y = y ) is called
the joint probability mass function or joint pmf of (X , Y ). If it is necessary to stress
the fact that f is the joint pmf of the vector (X , Y ) rather than some vector, the
notation fX ,Y (x , y ) will be used.

(Institute) EC509 This Version: 4 Nov 2013 6 / 123


Joint and Marginal Distribution

The joint pmf of (X , Y ) completely de…nes the probability distribution of the


random vector (X , Y ), just as in the discrete case.
The value of f (x , y ) for each of 21 possible values is given in the following Table
x
2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1 1
0 36 36 36 36 36 36
1 1 1 1 1
1 18 18 18 18 18
1 1 1 1
y 2 18 18 18 18
1 1 1
3 18 18 18
1 1
4 18 18
1
5 18

Table: Probability table for Example (4.1.2)

The joint pmf is de…ned for all (x , y ) 2 R2 , not just the 21 pairs in the above Table.
For any other (x , y ) , f (x , y ) = P (X = x , Y = y ) = 0.

(Institute) EC509 This Version: 4 Nov 2013 7 / 123


Joint and Marginal Distribution

As before, we can use the joint pmf to calculate the probability of any event de…ned
in terms of (X , Y ) . For A R2 ,

P ((X , Y ) 2 A ) = ∑ f (x , y ).
fx ,y g2A

We could, for example, have A = f(x , y ) : x = 7 and y 4g. Then,


1 1 1
P ((X , Y ) 2 A ) = P (X = 7, Y 4) = f (7, 1) + f (7, 3) = + = .
18 18 9
Expectations are also dealt with in the same way as before. Let g (x , y ) be a
real-valued function de…ned for all possible values (x , y ) of the discrete random
vector (X , Y ) . Then, g (X , Y ) is itself a random variable and its expected value is

E [g (X , Y )] = ∑ g (x , y )f (x , y ).
(x ,y )2R2

(Institute) EC509 This Version: 4 Nov 2013 8 / 123


Joint and Marginal Distribution

Example (4.1.4): For the (X , Y ) whose joint pmf is given in the above Table, what
is the expected value of XY ? Letting g (x , y ) = xy , we have
1 1 1 1 11
E [XY ] = 2 0 +4 0 + ... + 8 4 +7 5 = 13 .
36 36 18 18 18
As before,

E [ag1 (X , Y ) + bg2 (X , Y ) + c ] = aE [g1 (X , Y )] + bE [g2 (X , Y )] + c.

One very useful result is that any nonnegative function from R2 into R that is
nonzero for at most a countable number of (x , y ) pairs and sums to 1 is the joint
pmf for some bivariate discrete random vector (X , Y ).

(Institute) EC509 This Version: 4 Nov 2013 9 / 123


Joint and Marginal Distribution

Example (4.1.5): De…ne f (x , y ) by

f (0, 0) = f (0, 1) = 1/6,


f (1, 0) = f (1, 1) = 1/3,
f (x , y ) = 0 for any other (x , y ) .

Then, f (x , y ) is nonnegative and sums up to one, so f (x , y ) is the joint pmf for


some bivariate random vector (X , Y ) .

(Institute) EC509 This Version: 4 Nov 2013 10 / 123


Joint and Marginal Distribution

Suppose we have a multivariate random variable (X , Y ) but are concerned with, say,
P (X = 2) only.
We know the joint pmf fX ,Y (x , y ) but we need fX (x ) in this case.
Theorem (4.1.6): Let (X , Y ) be a discrete bivariate random vector with joint pmf
fX ,Y (x , y ). Then the marginal pmfs of X and Y , fX (x ) = P (X = x ) and
fY (y ) = P (Y = y ), are given by

fX (x ) = ∑ fX ,Y (x , y ) and fY (y ) = ∑ fX ,Y (x , y ).
y 2R x 2R

Proof: For any x 2 R, let Ax = f(x , y ) : ∞ < y < ∞g. That is, Ax is the line in
the plane with …rst coordinate equal to x . Then, for any x 2 R,

fX (x ) = P (X = x ) = P (X = x , ∞ < Y < ∞)
= P ((X , Y ) 2 Ax ) = ∑ fX ,Y (x , y )
(x ,y )2A x

= ∑ fX ,Y (x , y ).
y 2R

(Institute) EC509 This Version: 4 Nov 2013 11 / 123


Joint and Marginal Distribution

Example (4.1.7): Now we can compute the marginal distribution for X and Y from
the joint distribution given in the above Table. Then,

fY (0 ) = fX ,Y (2, 0) + fX ,Y (4, 0) + fX ,Y (6, 0)


+fX ,Y (8, 0) + fX ,Y (10, 0) + fX ,Y (12, 0)
= 1/6.

As an exercise, you can check that,

fY (1) = 5/18, fY (2) = 2/9, fY (3) = 1/6, fY (4) = 1/9, fY (5) = 1/18.

Notice that ∑5y =0 fY (y ) = 1, as expected, since these are the only six possible values
of Y .

(Institute) EC509 This Version: 4 Nov 2013 12 / 123


Joint and Marginal Distribution

Now, it is crucial to understand that the marginal distribution of X and Y , described


by the marginal pmfs fX (x ) and fY (y ), do not completely describe the joint
distribution of X and Y .
There are, in fact, many di¤erent joint distributions that have the same marginal
distributions.
The knowledge of the marginal distributions only does not allow us to determine the
joint distribution (except under certain assumptions).

(Institute) EC509 This Version: 4 Nov 2013 13 / 123


Joint and Marginal Distribution
Example (4.1.9): De…ne a joint pmf by

f (0, 0) = 1/12, f (1, 0) = 5/12, f (0, 1) = f (1, 1) = 3/12,


f (x , y ) = 0 for all other values.

Then,

fY (0 ) = f (0, 0) + f (1, 0) = 1/2,


fY (1 ) = f (0, 1) + f (1, 1) = 1/2,
fX (0 ) = f (0, 0) + f (0, 1) = 1/3,
and fX (1 ) = f (1, 0) + f (1, 1) = 2/3.

Now consider the marginal pmfs for the distribution considered in Example (4.1.5).

fY (0 ) = f (0, 0) + f (1, 0) = 1/6 + 1/3 = 1/2,


fY (1 ) = f (0, 1) + f (1, 1) = 1/6 + 1/3 = 1/2,
fX (0 ) = f (0, 0) + f (0, 1) = 1/6 + 1/6 = 1/3,
and fX (1 ) = f (1, 0) + f (1, 1) = 1/3 + 1/3 = 2/3.

We have the same marginal pmfs but the joint distributions are di¤erent!

(Institute) EC509 This Version: 4 Nov 2013 14 / 123


Joint and Marginal Distribution

Consider now the corresponding de…nition for continuous random variables.


De…nition (4.1.10): A function f (x , y ) from R2 to R is called a joint probability
density function or joint pdf of the continuous bivariate random vector (X , Y ) if, for
every A R2 , Z Z
P ((X , Y ) 2 A ) = f (x , y )dxdy .
A
RR
The notation A means that the integral is evaluated over all (x , y ) 2 A.
Naturally, for real valued functions g (x , y ),
Z ∞ Z ∞
E [g (X , Y )] = g (x , y )f (x , y )dxdy .
∞ ∞

It is important to realise that the joint pdf is de…ned for all (x , y ) 2 R2 . The pdf
may equal 0 on a large set A if P ((X , Y ) 2 A ) = 0 but the pdf is still de…ned for
the points in A.

(Institute) EC509 This Version: 4 Nov 2013 15 / 123


Joint and Marginal Distribution

Again, naturally,
Z ∞
fX (x ) = f (x , y )dy , ∞ < x < ∞,

Z ∞
fY (y ) = f (x , y )dx , ∞ < y < ∞.

As before, a useful result is that any function f (x , y ) satisfying f (x , y ) 0 for all


(x , y ) 2 R2 and Z ∞ Z ∞
1= f (x , y )dxdy ,
∞ ∞
is the joint pdf of some continuous bivariate random vector (X , Y ) .

(Institute) EC509 This Version: 4 Nov 2013 16 / 123


Joint and Marginal Distribution

Example (4.1.11): Let


(
6xy 2 0<x <1 and 0<y <1
f (x , y ) = .
0 otherwise

Is f (x , y ) a joint pdf? Clearly, f (x , y ) 0 for all (x , y ) in the de…ned range.


Moreover,
Z ∞ Z ∞ Z 1Z 1 Z 1
6xy 2 dxdy = 6xy 2 dxdy = 3x 2 y 2 j1x =0 dy
∞ ∞ 0 0 0
Z 1
= 3y 2 dy = y 3 j1y =0 = 1.
0

(Institute) EC509 This Version: 4 Nov 2013 17 / 123


Joint and Marginal Distribution

Now, consider calculating P (X + Y 1). Let A = f(x , y ) : x + y 1g in which


case we are interested in calculating P ((X , Y ) 2 A ). Now, f (x , y ) = 0 everywhere
except where 0 < x < 1 and 0 < y < 1. Then, the region we have in mind is
bounded by the lines

x = 1, y =1 and x + y = 1.

Therefore,

A = f(x , y ) : x + y 1, 0 < x < 1, 0 < y < 1g


= f(x , y ) : x 1 y, 0 < x < 1, 0 < y < 1g
= f(x , y ) : 1 y x < 1, 0 < y < 1 g.

(Institute) EC509 This Version: 4 Nov 2013 18 / 123


Joint and Marginal Distribution
Hence,
Z Z Z 1Z 1
P (X + Y 1) = f (x , y )dxdy = 6xy 2 dxdy
A 0 1 y
Z 1Z 1 Z 1h i
= 3x 2 y 2 j1x =1 y dy = 3y 2 3 (1 y )2 y 2 dy
0 1 y 0
Z 1 1
6 4 3 5
= 6y 3 3y 4 dy = y y
0 4 5
y =0
3 3 9
= = .
2 5 10
Moreover,
Z ∞ Z 1
f (x ) = f (x , y )dy = 6xy 2 dy = 2xy 3 j1y =0 = 2x .
∞ 0

Then, for example,


Z 3/4
1 3 5
P <X < = 2xdx = .
2 4 1/2 16

(Institute) EC509 This Version: 4 Nov 2013 19 / 123


Joint and Marginal Distribution

The joint probability distribution of (X , Y ) can be completely described using the


joinf cdf (cumulative distribution function) rather than with the joint pmf or joint
pdf.
The joint cdf is the function F (x , y ) de…ned by

F (x , y ) = P (X x, Y y) for all (x , y ) 2 R2 .

Although for discrete random vectors it might not be convenient to use the joint cdf,
for continuous random variables, the following relationship makes the joint cdf very
useful: Z x Z y
F (x , y ) = f (s, t )dsdt.
∞ ∞
From the bivariate Fundamental Theorem of Calculus,
∂2 F (x , y )
= f (x , y )
∂x ∂y
at continuity points of f (x , y ). This relationship is very important.

(Institute) EC509 This Version: 4 Nov 2013 20 / 123


Conditional Distributions and Independence

We have talked a little bit about conditional probabilities before. Now we will
consider conditional distributions.
The idea is the same. If we have some extra information about the sample, we can
use that information to make better inference.
Suppose we are sampling from a population where X is the height (in kgs) and Y is
the weight (in cms). What is P (X > 95)? Would we have a better/more relevant
answer if we knew that the person in question has Y = 202 cms ? Usually,
P (X > 95jY = 202) is supposed to be much larger than P (X > 95jY = 165).
Once we have the joint distribution for (X , Y ) , we can calculate the conditional
distributions, as well.
Notice that now we have three distribution concepts: marginal distribution,
conditional distribution and joint distribution.

(Institute) EC509 This Version: 4 Nov 2013 21 / 123


Conditional Distributions and Independence

De…nition (4.2.1): Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x , y ) and marginal pmfs fX (x ) and fY (y ). For any x such that
P (X = x ) = fX (x ) > 0, the conditional pmf of Y given that X = x is the function
of y denoted by f (y jx ) and de…ned by

f (x , y )
f (y jx ) = P (Y = y jX = x ) = .
fX (x )

For any y such that P (Y = y ) = fY (y ) > 0, the conditional pmf of X given that
Y = y is the function of x denoted by f (x jy ) and de…ned by

f (x , y )
f (x jy ) = P (X = x jY = y ) = .
fY (y )

Can we verify that, say, f (y jx ) is a pmf? First, since f (x , y ) 0 and fX (x ) > 0,


f (y jx ) 0 for every y . Then,

∑y f (x , y ) f (x )
∑ f (y jx ) = fX (x )
= X
fX (x )
= 1.
y

(Institute) EC509 This Version: 4 Nov 2013 22 / 123


Conditional Distributions and Independence
Example (4.2.2): De…ne the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) = ,
18 18
4 4
f (1, 20) = and f (2, 30) = ,
18 18
while f (x , y ) = 0 for all other combinations of (x , y ).
Then,
4
fX (0) = f (0, 10) + f (0, 20) = ,
18
10
fX (1) = f (1, 10) + f (1, 20) + f (1, 30) = ,
18
4
fX (2) = f (2, 30) = .
18
Moreover,
f (0, 10) 2/18 1
f (10j0) = = = ,
fX (0 ) 4/18 2
f (0, 20) 1
f (20j0) = = .
fX (0 ) 2
Therefore, given the knowledge that X = 0, Y is equal to either 10 or 20, with equal
probability.
(Institute) EC509 This Version: 4 Nov 2013 23 / 123
Conditional Distributions and Independence

In addition,
3/18 3
f (10j1) = f (30j1) = = ,
10/18 10
4/18 4
f (20j1) = = ,
10/18 10
4/18
f (30j2) = = 1.
4/18
Interestingly, when X = 2, we know for sure that Y will be equal to 30.
Finally,

P (Y > 10jX = 1) = f (20j1) + f (30j1) = 7/10,


P (Y > 10jX = 0) = f (20j0) = 1/2,
etc...

(Institute) EC509 This Version: 4 Nov 2013 24 / 123


Conditional Distributions and Independence

The analogous de…nition for continuous random variables is given next.


De…nition (4.2.3): Let (X , Y ) be a continuous bivariate random vector with joint
pdf f (x , y ) and marginal pdfs fX (x ) and fY (y ). For any x such that fX (x ) > 0, the
conditional pdf of Y given that X = x is the function of y denoted by f (y jx ) and
de…ned by
f (x , y )
f (y jx ) = .
fX (x )
For any y such that fY (y ) > 0, the conditional pdf of X given that Y = y is the
function of x denoted by f (x jy ) and de…ned by

f (x , y )
f (x jy ) = .
fY (y )

Note that for discrete random variables, P (X = x ) = fX (x ) and P (X = x , Y = y )


= f (x , y ). Then De…nition (4.2.1) is actually parallel to the de…nition of
P (Y = y jX = x ) in De…nition (1.3.2). The same interpretation is not valid for
continuous random variables since P (X = x ) = 0 for every x . However, replacing
pmfs with pdfs leads to De…nition (4.2.3).

(Institute) EC509 This Version: 4 Nov 2013 25 / 123


Conditional Distributions and Independence

The conditional expected value of g (Y ) given X = x is given by


Z ∞
E [g (Y )jx ] = ∑ g (y )f (y jx ) and E [g (Y )jx ] =

g (y )f (y jx )dx ,
y

in the discrete and continuous cases, respectively.


The conditional expected value has all of the propertes of the usual expected value
listed in Theorem 2.2.5.

(Institute) EC509 This Version: 4 Nov 2013 26 / 123


Conditional Distributions and Independence
Example (2.2.6): Suppose we measure the distance between a random variable X
and a constant b by (X b )2 . What is

β = arg min E [(X b )2 ],


b

i.e. the best predictor of X in the mean-squared sense?


Now,

E [(X b )2 ] = E fX E [X ] + E [X ] b g2
h i
= E (fX E [X ]g + fE [X ] b g)2

= E fX E [X ]g2 + fE [X ] b g2
+2E (fX E [X ]gfE [X ] b g) ,

where
E (fX E [X ]gfE [X ] b g) = fE [X ] b gE fX E [X ]g = 0.
Then,
E [(X b )2 ] = E fX E [X ]g2 + fE [X ] b g2 .

(Institute) EC509 This Version: 4 Nov 2013 27 / 123


Conditional Distributions and Independence

We have no control over the …rst term, but the second term is always positive, and
so, is minimised by setting b = E [X ]. Therefore, b = E [X ] is the value that
minimises the prediction error and is the best predictor of X .
In other words, the expectation of a random variable is its best predictor.
Can you show this for the conditional expectation, as well? See Exercise 4.13 which
you will be asked to solve as homework.

(Institute) EC509 This Version: 4 Nov 2013 28 / 123


Conditional Distributions and Independence
Example (4.2.4): Let (X , Y ) have joint pdf f (x , y ) = e y , 0 < x < y < ∞.
Suppose we wish to compute the conditional pdf of Y given X = x .
Now, if x 0, f (x , y ) = 0 8y , so fX (x ) = 0.
If x > 0, f (x , y ) > 0 only if y > x . Thus
Z ∞ Z ∞
y x
fX (x ) = f (x , y )dy = e dy = e .
∞ x
Then, the marginal distribution of X is the exponential distribution.
Now, likewise, for any x > 0, f (y jx ) can be computed as follows:
f (x , y ) e y
(y x )
f (y jx ) = = x
=e , if y > x ,
fX (x ) e
and
f (x , y ) 0
f (y jx ) = = = 0, if y x .
fX (x ) e x
Thus, given X = x , Y has an exponential distribution, where x is the location
parameter in the distribution of Y and β = 1 is the scale parameter. Notice that the
conditional distribution of Y is di¤erent for every value of x . Hence,
Z ∞
E [Y jX = x ] = ye (y x)
dy = 1 + x .
x

(Institute) EC509 This Version: 4 Nov 2013 29 / 123


Conditional Distributions and Independence
This is obtained by integration by parts.
Z ∞ ∞ Z ∞
ye (y x)
dy = ye (y x)
+ e (y x)
dy
x x
y =x

h i h i
(x x ) (y x )
= 0 xe + e
y =x
h i
= x + 0 + e (x x)
= x + 1.

What about the conditional variance? Using the standard de…nition,

Var (Y jx ) = E [Y 2 jx ] fE [Y jx ]g2 .
Hence,
Z ∞ Z ∞ 2
Var (Y jx ) = y 2 e (y x)
dy ye (y x)
dy = 1.
x x
Again, you can obtain the …rst part of this result by integration by parts.
What is the implication of this? One can show that the marginal distribution of Y is
gamma (2, 1) and so Var (Y ) = 2. Hence, the knowledge that X = x has reduced the
variability of Y by 50%!!!
(Institute) EC509 This Version: 4 Nov 2013 30 / 123
Conditional Distributions and Independence

The conditional distribution of Y given X = x is possibly a di¤erent probability


distribution for each value of x .
Thus, we really have a family of probability distributions for Y , one for each x .
When we wish to describe this entire family, we will use the phrase “the distribution
of Y jX .”
For example, if X 0 is an integer valued random variable and the conditional
distribution of Y given X = x is binomial (x , p ), then we might say the distribution
of Y jX is binomial (X , p ) or write Y jX binomial (X , p ).
In addition, E [g (Y )jx ] is a function of x , as well. Hence, the conditional
expectation is a random variable whose value depends on the value of X . When we
think of Example 4.2.4, we can then say that

E [Y jX ] = 1 + X .

(Institute) EC509 This Version: 4 Nov 2013 31 / 123


Conditional Distributions and Independence

Yet, in some other cases, the conditional distribution might not depend on the
conditioning variable.
Say, the conditional distribution of Y given X = x is not di¤erent for di¤erent
values of x .
In other words, knowledge of the value of X does not provide any more information.
This situation is de…ned as independence.
De…nition (4.2.5): Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x , y ) and marginal pdfs or pmfs fX (x ) and fY (y ). Then X and Y are called
independent random variables if, for every x 2 R and y 2 R,

f (x , y ) = fX (x )fY (y ). (1)

(Institute) EC509 This Version: 4 Nov 2013 32 / 123


Conditional Distributions and Independence

Now, in the case of independence, clearly,


f (x , y ) f (x )fY (y )
f (y jx ) = = X = fY (y ).
fX (x ) fX (x )
We can either start with the joint distribution and check independence for each
possible value of x and y , or start with the assumption that X and Y are
independent and model the joint distribution accordingly. In this latter direction, our
economic intuition might have to play an important role.
“Would information on the value of X really increase our information about the
likely value of Y ?”

(Institute) EC509 This Version: 4 Nov 2013 33 / 123


Conditional Distributions and Independence

Example (4.2.6): Consider the discrete bivariate random vector (X , Y ) , with joint
pmf given by

f (10, 1) = f (20, 1) = f (20, 2) = 1/10,


f (10, 2) = f (10, 3) = 1/5 and f (20, 3) = 3/10.

The marginal pmfs are then given by

fX (10) = fX (20) = .5 and fY (1) = .2, fY (2) = .3 and fY (3) = .5.

Now, for example,


1 11
f (10, 3) = 6= = fX (10)fY (3),
5 22
1 11
although f (10, 1) = = = fX (10)fY (1).
10 25
Do we always have to check all possible pairs, one by one???

(Institute) EC509 This Version: 4 Nov 2013 34 / 123


Conditional Distributions and Independence
Lemma (4.2.7): Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x , y ). Then, X and Y are independent random variables if and only if there exist
functions g (x ) and h (y ) such that, for every x 2 R and y 2 R,
f (x , y ) = g (x )h (y ).
Proof: First, we deal with the “only if” part. De…ne
g (x ) = fX (x ) and h (y ) = fY (y ).
Then, by (1),
f (x , y ) = fX (x )fY (y ) = g (x )h (y ).
For the “if” part, suppose f (x , y ) = g (x )h (y ). De…ne
Z ∞ Z ∞
g (x )dx = c and h (y )dy = d ,
∞ ∞

where
Z ∞ Z ∞
cd = g (x )dx h (y )dy
∞ ∞
Z ∞ Z ∞ Z ∞ Z ∞
= g (x )h (y )dydx = f (x , y )dydx = 1.
∞ ∞ ∞ ∞

(Institute) EC509 This Version: 4 Nov 2013 35 / 123


Conditional Distributions and Independence

Moreover,
Z ∞ Z ∞
fX (x ) = g (x )h (y )dy = g (x )d and fY (y ) = g (x )h (y )dx = h (y )c.
∞ ∞

Then,
f (x , y ) = g (x )h (y ) = g (x )h (y )|{z}
cd = fX (x )fY (y ),
1
proving that X and Y are independent.
To prove the Lemma for discrete random vectors, replace integrals with summations.

(Institute) EC509 This Version: 4 Nov 2013 36 / 123


Conditional Distributions and Independence

It might not be clear enough at …rst sight but this is a powerful result. It implies
that we do not have to calculate the marginal distributions …rst and then check
whether their product gives the joint distribution.
Instead, it is enough to check whether the joint distribution is equal to the product
of some function of x and some function of y , for all values of (x , y ) 2 R2 .
Example (4.2.8): Consider the joint pdf
1 2 4 y (x /2 )
f (x , y ) = x y e , x >0 and y > 0.
384
If we de…ne
y 4e y
g (x ) = x 2 e x /2
and h (y ) =
,
384
for x > 0 and y > 0 and g (x ) = h (y ) = 0 otherwise, then, clearly,

f (x , y ) = g (x )h (y )

for all (x , y ) 2 R2 . By Lemma (4.2.7), X and Y are independently distributed!

(Institute) EC509 This Version: 4 Nov 2013 37 / 123


Conditional Distributions and Independence
Example (4.2.9): Let X be the number of living parents of a student randomly
selected from an elementary school in Kansas city and Y be the number of living
parents of a retiree randomly selected from Sun City. Suppose, furthermore, that we
have
fX (0 ) = .01, fX (1) = .09, fX (2) = .90,
fY (0 ) = .70, fY (1) = .25, fY (2) = .05.
It seems reasonable that X and Y will be independent: knowledge of the number of
parents of the student does not give us any information on the number of parents of
the retiree and vice versa. Therefore, we should have
fX ,Y (x , y ) = fX (x )fY (y ).
Then, for example
fX ,Y (0, 0) = .0070, fX ,Y (0, 1) = .0025,
etc.
We can thus calculate quantities such as,
P (X = Y ) = f (0, 0) + f (1, 1) + f (2, 2)
= .01 .70 + .09 .25 + .90 .05 = .0745.

(Institute) EC509 This Version: 4 Nov 2013 38 / 123


Conditional Distributions and Independence

Theorem (4.2.10): Let X and Y be independent random variables.


1 For any A R and B R, P (X 2 A, Y 2 B ) = P (X 2 A )P (Y 2 B ); that is, the
events fX 2 A g and fY 2 B g are independent events.
2 Let g (x ) be a function only of x and h (y ) be a function only of y . Then
E [g (X )h (Y )] = E [g (X )]E [h (Y )].

Proof: Start with (2) and consider the continuous case. Now,
Z ∞ Z ∞
E [g (X )h (Y )] = g (x )h (y )f (x , y )dxdy
∞ ∞
Z ∞ Z ∞
= g (x )h (y )fX (x )fY (y )dxdy
∞ ∞
Z ∞ Z ∞
= g (x )fX (x )dx h (y )fY (y )dy
∞ ∞
= E [g (X )]E [h (Y )].

As before, the discrete case is proved by replacing integrals with sums.

(Institute) EC509 This Version: 4 Nov 2013 39 / 123


Conditional Distributions and Independence
Now, consider (1). Remember that for some set A R, the indicator function IA (x )
is given by (
1 if x 2 A
IA (x ) = .
0 otherwise

Let some g (x ) and h (y ) be the indicator functions of the sets A and B , respectively.
Now, g (x )h (y ) is the indicator function of the set C R2 where
C = f(x , y ) : x 2 A, y 2 B g.
Note that for an indicator function such as g (x ), E [g (X )] = P (X 2 A ), since
Z Z
P (X 2 A ) = f (x )dx = IA (x )f (x )dx = E [IA (x )].
x 2A

Then, using the result we proved on the previous slide,

P (X 2 A, Y 2 B ) = P ((X , Y ) 2 C ) = E [g (X )h (Y )]
= E [g (X )]E [h (Y )]
= P (X 2 A )P (Y 2 B ).

These results make life a lot easier when calculating expectations of certain random
variables.

(Institute) EC509 This Version: 4 Nov 2013 40 / 123


Conditional Distributions and Independence
Example (4.2.11): Let X and Y be independent exponential (1) random variables.
Then, from the previous theorem we have that
4 3
P (X 4, Y < 3) = P (X 4 )P (Y < 3 ) = e (1 e ).
Letting g (x ) = x2 and h (y ) = y , we see that

E [X 2 Y ] = E [X 2 ]E [Y ] = Var (X ) + fE [X ]g2 E [Y ] = 1 + 12 1 = 2.

The moment generating function is back!


Theorem (4.2.12): Let X and Y be independent random variables with moment
generating functions MX (t ) and MY (t ). Then the moment generating function of
the random variable Z = X + Y is given by
M Z (t ) = M X (t )M Y (t ).
Proof: Now, we know from before that MZ (t ) = E [e tZ ]. Then,

E [e tZ ] = E [e t (X +Y ) ] = E [e tX e tY ] = E [e tX ]E [e tY ] = MX (t )MY (t ),
where we have used the result that for two independent random variables X and Y ,
E [XY ] = E [X ]E [Y ].
Of course, if independence does not hold, life gets pretty tough! But we will not
deal with that here.
(Institute) EC509 This Version: 4 Nov 2013 41 / 123
Conditional Distributions and Independence

We will now introduce a special case of Theorem (4.2.12).


Theorem (4.2.14): Let X N (µX , σ2X ) and Y N (µY , σ2Y ) be independent
normal random variables. Then the random variable Z = X + Y has a
N (µX + µY , σ2X + σ2Y ) distribution.
Proof: The mgfs of X and Y are
! !
σ2 t 2 σ2 t 2
MX (t ) = exp µX t + X and MY (t ) = exp µY t + Y .
2 2

Then, from Theorem (4.2.12)


" #
(σ2X + σ2Y )t 2
MZ (t ) = MX (t )MY (t ) = exp (µX + µY )t + ,
2

which is the mgf of a normal random variable with mean µX + µY and variance
σ2X + σ2Y .

(Institute) EC509 This Version: 4 Nov 2013 42 / 123


Bivariate Transformations

We now consider transformations involving two random variables rather than only
one.
Let (X , Y ) be a random vector and consider (U , V ) where

U = g1 (X , Y ) and V = g2 (X , Y ),

for some pre-speci…ed functions g1 ( ) and g2 ( ).


For any B R2 , notice that

(U , V ) 2 B if and only if (X , Y ) 2 A, A = f(x , y ) : g1 (x , y ), g2 (x , y ) 2 B g.

Hence,
P ((U , V ) 2 B ) = P ((X , Y ) 2 A ) .
This implies that the probability distribution of (U , V ) is completely determined by
the probability distribution of (X , Y ) .

(Institute) EC509 This Version: 4 Nov 2013 43 / 123


Bivariate Transformations

Consider the discrete case …rst.


If (X , Y ) is a discrete random vector, then there is only a countable set of values for
which the joint pmf of (X , Y ) is positive. De…ne this set as A.
Moreover, let

B = f(u, v ) : u = g1 (x , y ) and v = g2 (x , y ) for some (x , y ) 2 Ag.

Therefore, B gives the countable set of all possible values for (U , V ) .


Finally, de…ne

Auv = f(x , y ) 2 A : g1 (x , y ) = u and g2 (x , y ) = v g, for any (u, v ) 2 B .

Then, the joint pmf of (U , V ) is given by

fU ,V (u, v ) = P (U = u, V = v ) = P ((X , Y ) 2 Auv ) = ∑ fX ,Y (x , y ).


(x ,y )2Auv

(Institute) EC509 This Version: 4 Nov 2013 44 / 123


Bivariate Transformations

Example (4.3.1): Let X Poisson (θ ) and Y Poisson (λ). Due to independence,


the joint pmf is given by

θx e θ λy e λ
fX ,Y (x , y ) = , x = 0, 1, 2, ... and y = 0, 1, 2, ... .
x! y!
Obviously
A = f (x , y ) : x = 0, 1, 2, ... and y = 0, 1, 2, ...g.
De…ne
U = X +Y and V = Y,
implying that
g1 (x , y ) = x + y and g2 (x , y ) = y .

(Institute) EC509 This Version: 4 Nov 2013 45 / 123


Bivariate Transformations

What is B ? Since y = v , for any given v ,

u = x + y = x + v.

Hence, u = v , v + 1, v + 2, v + 3, ... . Therefore,

B = f (u, v ) : v = 0, 1, 2, ... and u = v , v + 1, v + 2, v + 3, ...g.

Moreover, for any (u, v ) the only (x , y ) satisfying u = x + y and v = y are given by
x = u v and y = v . Therefore, we always have

Auv = (u v , v ).

As such,

fU ,V (u, v ) = ∑ fX ,Y (x , y ) = fX ,Y (u v, v)
(x ,y )2Auv

θu v
e θ λv e λ
= , v = 0, 1, 2, ... and u = v , v + 1, v + 2, ... .
(u v )! v !

(Institute) EC509 This Version: 4 Nov 2013 46 / 123


Bivariate Transformations

What is the marginal pmf of U ?


For any …xed non-negative integer u, fU ,V (u, v ) > 0 only for v = 0, 1, ..., u. Then,
u
θu v
e θ λv e λ u
θ u v λv
fU (u ) = ∑ =e (θ +λ)

v =0 (u v )! v ! v =0 (u v )!v !
e (θ +λ) u u!
=
u! ∑ (u v )!v ! θ u v
λv
v =0
e (θ +λ) u u e (θ +λ)
=
u! ∑ v θu v
λv =
u!
( θ + λ )u , u = 0, 1, 2, ... ,
v =0

which follows from the binomial formula given by (a + b )n = ∑ni=0 (xn )ax b n x .
This is the pmf of a Poisson random variable with parameter θ + λ. A theorem
follows.
Theorem (4.3.2): If X Poisson (θ ), Y Poisson (λ) and X and Y are
independent, then X + Y Poisson (θ + λ).

(Institute) EC509 This Version: 4 Nov 2013 47 / 123


Bivariate Transformations
Consider the continuous case now.
Let (X , Y ) fX ,Y (x , y ) be a continuous random vector and

A = f(x , y ) : fX ,Y (x , y ) > 0g,


B = f(u, v ) : u = g1 (x , y ) and v = g2 (x , y ) for some (x , y ) 2 Ag.
The joint pdf fU ,V (u, v ) will be positive on the set B .
Assume now that the transformation u = g1 (x , y ) and v = g2 (x , y ) is a one-to-one
transformation from A onto B .
Why onto? Because we have de…ned B in such a way that any (u, v ) 2 B
corresponds to some (x , y ) 2 A.
We assume that the one-to-one assumption is maintained. That is, we are assuming
that for each (u, v ) 2 B there is only one (x , y ) 2 A such that
(u, v ) = (g1 (x , y ), g2 (x , y )) .
When the transformation is one-to-one and onto, we can solve the equations

u = g1 (x , y ) and v = g2 (x , y ),

and obtain
x = h1 (u, v ) and y = h2 (u, v ).

(Institute) EC509 This Version: 4 Nov 2013 48 / 123


Bivariate Transformations

The last remaining ingredient is the Jacobian of the transformation. This is the
determinant of a matrix of partial derivatives.
∂x ∂x ∂x ∂y ∂x ∂y
J= ∂u
∂y
∂v
∂y = .
∂u ∂v
∂u ∂v ∂v ∂u

These partial derivatives are given by


∂x ∂h (u, v ) ∂x ∂h (u, v ) ∂y ∂h (u, v ) ∂y ∂h (u, v )
= 1 , = 1 , = 2 , = 2 .
∂u ∂u ∂v ∂v ∂u ∂u ∂v ∂v
Then, the transformation formula is given by

fU ,V (u, v ) = fX ,Y [h1 (u, v ) , h2 (u, v )] jJ j ,

while fU ,V (u, v ) = 0 for (u, v ) 2


/ B.
Clearly, we need J 6= 0 on B .

(Institute) EC509 This Version: 4 Nov 2013 49 / 123


Bivariate Transformations

The next example is based on the beta distribution, which is related to the gamma
distribution.
The beta (α, β) pdf is given by

Γ (α + β) α 1 1
fX (x jα, β) = x (1 x )β ,
Γ (α) Γ ( β)
where 0 < x < 1, α > 0 and β > 0.

(Institute) EC509 This Version: 4 Nov 2013 50 / 123


Bivariate Transformations

Example (4.3.3): Let X beta (α, β), Y beta (α + β, γ) and X ?? Y .


The symbol “??” is used to denote independence.
Due to independence, the joint pdf is given by
Γ (α + β) α 1 1 Γ (α + β + γ) α+ β 1 1
fX ,Y (x , y ) = x (1 x )β y (1 y )γ ,
Γ (α) Γ ( β) Γ (α + β) Γ (γ)
where 0 < x < 1 and 0 < y < 1.
We consider
U = XY and V = X.
Let’s de…ne A and B …rst.
As usual,
A = f(x , y ) : fX ,Y (x , y ) > 0g.

(Institute) EC509 This Version: 4 Nov 2013 51 / 123


Bivariate Transformations

Now we know that V = X and the set of possible values for X is 0 < x < 1. Hence,
the set of possible values for V is given by 0 < v < 1.
Since U = XY = VY , for any given value of V = v , U will vary between 0 and v , as
the set of possible values for Y is 0 < y < 1. Hence

0 < u < v.

Therefore, the given transformation maps A onto

B = f(u, v ) : 0 < u < v < 1g.

For any (u, v ), we can uniquely solve for (x , y ) . That is,


u
x = h1 (u, v ) = v and y = h2 (u, v ) = .
v
An aside: Of course, we assume that this transformation is one-to-one on A only.
This is not the case on R2 because for any value of y , (0, y ) is mapped into the
point (0, 0) .

(Institute) EC509 This Version: 4 Nov 2013 52 / 123


Bivariate Transformations

For
u
x = h1 (u, v ) = v and y = h2 (u, v ) = ,
v
we have
∂x ∂x 1
0 1
J= ∂u
∂y
∂v
∂y = 1 u = .
∂u ∂v v v2 v
Then, the transformation formula gives,
Γ (α + β + γ) α 1 1 u α+ β 1 u γ 1 1
fU ,V (u, v ) = v (1 v )β 1 ,
Γ (α) Γ( β)Γ (γ) v v v
where 0 < u < v < 1.
Obviously, since V = X , V beta (α, β). What about U ?

(Institute) EC509 This Version: 4 Nov 2013 53 / 123


Bivariate Transformations

Γ(α+ β+γ)
De…ne K = Γ(α)Γ( β)Γ(γ) . Then,
Z 1
fU (u ) = fU ,V (u, v )dv
u
Z 1 α+ β 1 γ 1
1 1 u u 1
= K vα (1 v )β 1 dv
u v v v
Z 1
1 1 uv α 1 1 α+ β 1 u γ 1 1
= K uα (1 v )β (u ) β v 1 dv
u vu v v v
| {z }
v β

Z 1 β γ 1
1 1 v 1 u 1u
= Ku α (1 v )β (u ) β 1 dv
u u v v v v
|{z}
| {z }
(u/v ) β 1 u/v 2
Z 1 β 1 γ 1
1 u u u
= Ku α u 1 dv .
u v v v2

(Institute) EC509 This Version: 4 Nov 2013 54 / 123


Bivariate Transformations

Let
u 1
y = u ,
v 1 u
which implies that
u 1
dy = dv ,
v2 1 u
or, equivalently,
v2
dv = dy (1 u) .
u
Now, observe that
β 1 β 1
1 u 1
yβ = u ,
v 1 u
γ 1 γ 1
1 1 u u/v + u 1 u/v
(1 y )γ = =
1 u 1 u
γ 1 γ 1
u 1
= 1 .
v 1 u

(Institute) EC509 This Version: 4 Nov 2013 55 / 123


Bivariate Transformations
Moreover, for v = u, y = 1 and for v = 1, y = 0. Therefore,
Z 1 β 1 γ 1
1 u u u
fU (u ) = Ku α u 1 2
dv
|{z}
u | v {z } | v
{z } v
v2
1 1 1 dy u (1 u )
yβ 1 (1 u )β (1 y ) γ (1 u ) γ
Z 0
= Ku α 1
(1 u ) β+γ 1
yβ 1
(1 y )γ 1
dy
1
Z 1
= Ku α 1
(1 u ) β+γ 1
yβ 1
(1 y )γ 1
dy .
0

Now,
1 1
yβ (1 y )γ
is the kernel for the pdf of Y beta ( β, γ)
Γ ( β + γ) β 1 1
fY (y j β, γ) = y (1 y )γ , 0 < y < 1, α > 0, β > 0,
Γ ( β) Γ (γ)
and it is straightforward to show that
Z 1
1 1 Γ ( β) Γ (γ)
yβ (1 y )γ = .
0 Γ ( β + γ)

(Institute) EC509 This Version: 4 Nov 2013 56 / 123


Bivariate Transformations

Hence,
Z 1
fU (u ) = Ku α 1
(1 u ) β+γ 1
yβ 1
(1 y )γ 1
dy
0
Γ (α + β + γ) Γ ( β) Γ (γ) α 1
= u (1 u ) β + γ 1
Γ (α) Γ( β)Γ (γ) Γ ( β + γ)
Γ (α + β + γ) α 1
= u (1 u ) β + γ 1 ,
Γ (α) Γ ( β + γ)
where 0 < u < 1.
This shows that the marginal distribution of U is beta (α, β + γ) .

(Institute) EC509 This Version: 4 Nov 2013 57 / 123


Bivariate Transformations

Example (4.3.4): Let X N (0, 1), Y N (0, 1) and X ?? Y . The transformation


is given by

U = g1 (X , Y ) = X + Y and V = g2 (X , Y ) = X Y.

Now, the pdf for (X , Y ) is given by

1 1 2 1 1 2
fX ,Y (x , y ) = p exp x p exp y
2π 2 2π 2
1 1
= exp x2 + y2 ,
2π 2
where ∞ < x < ∞ and ∞ < y < ∞.
Therefore,
A = f (x , y ) : fX ,Y (x , y ) > 0g = R2 .

(Institute) EC509 This Version: 4 Nov 2013 58 / 123


Bivariate Transformations

We have
u = x +y and v =x y,
and to determine B we need to …nd out all the values taken on by (u, v ) as we
choose di¤erent (x , y ) 2 A.
Thankfully, when these equations are solved for (x , y ) they yield unique solutions:
u+v u v
x = h1 (u, v ) = and y = h2 (u, v ) = .
2 2
Now, the reasoning is as follows: A = R2 . Moreover, for every (u, v ) 2 B , there is a
unique (x , y ) . Therefore, B = R2 , as well! The mapping is one-to-one and, by
de…nition, onto.

(Institute) EC509 This Version: 4 Nov 2013 59 / 123


Bivariate Transformations

The Jacobian is given by


∂x ∂x 1 1 1 1 1
J= ∂u ∂v = 2 2 = = .
∂y ∂y 1 1
∂u ∂v 2 2 4 4 2

Therefore,

fU ,V (u, v ) = fX ,Y [h1 (u, v ) , h2 (u, v )] jJ j


" # " #
1 u+v 2 1 u v 2
1 1
= exp exp ,
2π 2 2 2 2 2

where ∞ < u < ∞ and ∞ < v < ∞.


Rearranging gives,

1 1 u2 1 1 v2
fU ,V (u, v ) = p p exp p p exp
2π 2 4 2π 2 4
Since the joint density factors into a function of u and a function of v , by Lemma
(4.2.7), U and V are independent.

(Institute) EC509 This Version: 4 Nov 2013 60 / 123


Bivariate Transformations
Remember Theorem (4.2.14).
Theorem (4.2.14): Let X N (µX , σ2X ) and Y N (µY , σ2Y ) be independent
normal random variables. Then the random variable Z = X + Y has a
N (µX + µY , σ2X + σ2Y ) distribution.
Then,
U = X +Y N (0, 2).
What about V ? De…ne Z = Y and notice that

Z = Y N (0, 1).

Then, by Theorem (4.2.14)

V =X Y = X +Z N (0, 2),

as well.
That the sums and di¤erences of independent normal random variables are
independent normal random variables is true independent of the means of X and Y ,
as long as Var (X ) = Var (Y ). See Exercise 4.27.
Note that the more di¢ cult bit here is to prove that U and V are indeed
independent.

(Institute) EC509 This Version: 4 Nov 2013 61 / 123


Bivariate Transformations
Theorem (4.3.5): Let X ?? Y be two random variables. De…ne U = g (X ) and
V = h (Y ), where g (x ) is a function only of x and h (y ) is a function only of y .
Then U ?? V .
Proof: Consider the proof for continuous random variables U and V . De…ne for any
u 2 R and v 2 R,
A u = fx : g (x ) ug and B v = fy : h (y ) v g.
Then,
FU ,V (u, v ) = P (U u, V v)
= P (X 2 A u , Y 2 B v )
= P (X 2 A u ) P (Y 2 B v ) ,
where the last result follows by Theorem (4.2.10). Then,
∂2
fU ,V (u, v ) = F (u, v )
∂u∂v U ,V
d d
= P (X 2 A u ) P (Y 2 B v ) ,
du dv
and clearly, the joint pdf factorises into a function only of u and a function only of
v . Therefore, by Lemma (4.2.7), U and V are independent.
(Institute) EC509 This Version: 4 Nov 2013 62 / 123
Bivariate Transformations

What if we start with a bivariate random variable (X , Y ) but are only interested in
U = g (X , Y ) and not (U , V )?
We can then choose a convenient V = h (X , Y ) , such that the resulting
transformation from (X , Y ) to (U , V ) is one-to-one on A.
Then, the joint pdf of (U , V ) can be calculated as usual and we can obtain the
marginal pdf of U from the joint pdf of (U , V ) .
Which V is “convenient” would generally depend on what U is.

(Institute) EC509 This Version: 4 Nov 2013 63 / 123


Bivariate Transformations
What if the transformation of interest is not one-to-one?
In a similar fashion to Theorem 2.1.8, we can de…ne
A = f(x , y ) : fX ,Y (x , y ) > 0g,
and then divide this into a partition where the transformation is one-to-one on each
block.
Speci…cally, let A0 , A1 , ..., Ak be a partition of A which satis…es the following
properties.
1 P ((X , Y ) 2 A 0 ) = 0;
2 The transformation U = g 1 (X , Y ) and V = g 2 (X , Y ) is a one-to-one transformation
from A i onto B for each i = 1, ..., k .
Then we can calculate the inverse functions from B to Ai for each i . Let
x = h1i (u, v ) and y = h2i (u, v ) for the i th partition.
For (u, v ) 2 B , this inverse gives the unique (x , y ) 2 Ai such that
(u, v ) = (g1 (x , y ) , g2 (x , y )) .
Let Ji be the Jacobian for the i th inverse. Then,
k
fU ,V (u, v ) = ∑ fX ,Y [h1i (u, v ), h2i (u, v )] jJi j
i =1

(Institute) EC509 This Version: 4 Nov 2013 64 / 123


Hierarchical Models and Mixture Distributions

As an introduction to the idea of hierarchical models, consider the following example.


Example (4.4.1): As insect lays a large number of eggs, each surviving with
probability p. On the average, how many eggs will survive?
Often, the “large number of eggs” is taken to be Poisson (λ) . Now, assume that
survival of each egg is an independent event. Therefore, survival of eggs constitutes
Bernoulli trials.
Let’s put this into a statistical framework.

X jY binomial (Y , p ) ,
Y Poisson (λ) .

Therefore, survival is a random variable, which in turn is a function of another


random variable.
This framework is convenient in the sense that complex models can be analysed as a
collection of simpler models.

(Institute) EC509 This Version: 4 Nov 2013 65 / 123


Hierarchical Models and Mixture Distributions

Let’s get back to the original question.


Example (4.4.2): Now,

P (X = x ) = ∑ P (X = x, Y = y)
y =0

= ∑ P (X = x jY = y ) P (Y = y )
y =0
!
∞ λ λy
y x e
= ∑ x
p (1 p) y x
y!
,
y =x

where the summation in the third line starts from y = x rather than y = 0 since if
y < x , the conditional probability should be equal to zero: clearly, the number of
surviving eggs cannot be larger than the number of those laid.

(Institute) EC509 This Version: 4 Nov 2013 66 / 123


Hierarchical Models and Mixture Distributions

Let t = y x . This gives,


!
∞ λ λx λy x
y! e
P (X = x ) = ∑ (y x )!x ! p x (1 p )y x
y!
y =x
∞ y x λ ∞
x
(λp ) e λ [(1 p )λ] (λp )x e [(1 p ) λ ]t
=
x! ∑ (y x ) !
=
x! ∑ t!
.
y =x t =0

But the summation in the …nal term is the kernel for Poisson ((1 p )λ) . Remember
that,

[(1 p )λ]t
∑ t!
= e (1 p ) λ .
t =0
Therefore,
(λp )x e λ (1 p )λ (λp )x λp
P (X = x ) = e = e ,
x! x!
which implies that X Poisson (λp ) .
The answer to the original question then is E [X ] = λp : on average, λp eggs survive.

(Institute) EC509 This Version: 4 Nov 2013 67 / 123


Hierarchical Models and Mixture Distributions

Now comes a very useful theorem which you will, most likely, use frequently in the
future.
Remember that E [X jy ] is a function of y and E [X jY ] is a random variable whose
value depends on the value of Y .
Theorem (4.4.3): If X and Y are two random variables, then
n o
E X [ X ] = E Y E X jY [ X jY ] ,

provided that the expectations exist.


It is important to notice that the two expectations are with respect to two di¤erent
probability densities, fX . ( ) and fX jY ( jY = y ).
This result is widely known as the Law of Iterated Expectations.

(Institute) EC509 This Version: 4 Nov 2013 68 / 123


Hierarchical Models and Mixture Distributions

Proof: Let fX ,Y (x , y ) denote the joint pdf of (X , Y ) . Moreover, let fY (y ) and


fX jY (x jy ) denote the marginal distribution of Y and the conditional distribution of
X (conditional on Y = y ), respectively. By de…nition,
Z ∞ Z ∞ Z ∞ Z ∞
E X [X ] = xf (x , y )dxdy = xfX jY (x jy )dx fY (y )dy
∞ ∞ ∞ ∞
Z ∞
= EX jY [X jY = y ]fY (y )dy

n o
= E Y E X jY [X jY ] .

The corresponding proof for the discrete case can be obtained by replacing integrals
with sums.
How does this help us? Consider calculating the expected number of survivors again.
n o
E X [X ] = E Y E X jY [ X jY ]
= EY [pY ] = pEY [Y ] = pλ.

(Institute) EC509 This Version: 4 Nov 2013 69 / 123


Hierarchical Models and Mixture Distributions

The textbook gives the following de…nition for mixture distributions.


De…nition (4.4.4): A random variable X is said to have a mixture distribution if the
distribution of X depends on a quantity that also has a distribution.
Therefore, the mixture distribution is a distribution that is generated through a
hierarchical mechanism.
Hence, as far as Example (4.4.1) is concerned, the Poisson (λp ) distribution, which
is the result of combining a binomial (Y , p ) and a Poisson (λ) distribution, is a
mixture distribution.

(Institute) EC509 This Version: 4 Nov 2013 70 / 123


Hierarchical Models and Mixture Distributions

Example (4.4.5): Now, consider the following hierarchical model:

X jY binomial (Y , p ),
Y jΛ Poisson (Λ),
Λ exponential ( β).

Then,
n o
E X [X ] = EY EX jY [X jY ] = EY [pY ]
n o n o
= EΛ EY jΛ [pY jΛ] = pEΛ EY jΛ [Y jΛ]
= pEΛ [Λ] = pβ,

which is obtained by successive application of the Law of Iterated Expectations.


Note that in this example we considered both discrete and continuous random
variables. This is …ne.

(Institute) EC509 This Version: 4 Nov 2013 71 / 123


Hierarchical Models and Mixture Distributions
The three-stage hierarchy of the previous model can also be considered as a
two-stage hierarchy by combining the last two stages.
Now,
Z ∞
P (Y = y ) = P (Y = y , 0 < Λ < ∞ ) = f (y , λ )d λ
Z ∞ Z ∞
"0 #
e λ λy 1 λ/β
= f (y j λ )f ( λ )d λ = e dλ
0 0 y! β
Z ∞
!y +1
1 y λ (1 + β 1
) 1 1
= λ e dλ = Γ (y + 1 ) 1
βy ! 0 βy ! 1+β
!y
1 1
= 1
.
1+β 1+β

In order to see that


Z ∞
!y +1
y λ (1 + β 1
) 1
λ e d λ = Γ (y + 1 ) 1
,
0 1+β

just use the pdf for λ Gamma (y + 1, (1 + 1/β) 1 ).

(Institute) EC509 This Version: 4 Nov 2013 72 / 123


Hierarchical Models and Mixture Distributions

The …nal expression is that of the negative binomial pmf. Therefore, the two-stage
hierarchy is given by

X jY binomial (Y , p ),
1
Y negative binomial r = 1, p = .
1+β

(Institute) EC509 This Version: 4 Nov 2013 73 / 123


Hierarchical Models and Mixture Distributions
Another useful result is given next.
Theorem (4.4.7): For any two random variables X and Y ,

VarX (X ) = EY [VarX jY (X jY )] + VarY fEX jY [X jY ]g

Proof: Remember that

VarX (X ) = EX [X 2 ] fEX [X ]g2


n o 2
= E Y E X jY [X 2 jY ] EY fEX jY [X jY ]g
h n o i
= E Y E X jY [X 2 jY ] EY fEX jY [X jY ]g2
2
+ EY fEX jY [X jY ]g2 EY fEX jY [X jY ]g
hn o i
= EY E X jY [X 2 jY ] fEX jY [X jY ]g2
2
+ EY fEX jY [X jY ]g2 EY fEX jY [X jY ]g
h i
= EY VarX jY (X jY ) + VarY fEX jY [X jY ]g ,

which yields the desired result.


(Institute) EC509 This Version: 4 Nov 2013 74 / 123
Hierarchical Models and Mixture Distributions
Example (4.4.6): Consider the following generalisation of the binomial distribution,
where the probability of success varies according to a distribution.
Speci…cally,
X jP binomial (n, P ) ,
P beta (α, β) .
Then n o α
E X [X ] = E P EX jP [X jP ] = EP [nP ] = n ,
α+β
where the last result follows from the fact that for P beta (α, β) ,
E [P ] = α/ (α + β) .
Example (4.4.8): Now, let’s calculate the variance of X . By Theorem (4.4.7),
VarX (X ) = VarP fEX jP [X jP ]g + EP [VarX jP (X jP )].

Now, EX jP [X jP ] = nP and since P beta (α, β) ,

αβ
VarP (EX jP [X jP ]) = VarP (nP ) = n 2 2
.
(α + β) (α + β + 1)
Moreover, VarX jP (X jP ) = nP (1 P ), due to X jP being a binomial random
variable.
(Institute) EC509 This Version: 4 Nov 2013 75 / 123
Hierarchical Models and Mixture Distributions

What about EP [VarX jP (X jP )]? Remember that the beta (α, β) pdf is given by

Γ (α + β) α 1 1
fX (x jα, β) = x (1 x )β ,
Γ (α) Γ ( β)
where 0 < x < 1, α > 0 and β > 0.
Then,

EP [VarX jP (X jP )] = EP [nP (1 P )]
Z 1
Γ (α + β) 1 1
= n p (1 p )p α (1 p )β dp.
Γ(α)Γ( β) 0

The integrand is the kernel of another beta pdf with parameters α + 1 and β + 1
since the pdf for P beta (α + 1, β + 1) is given by

Γ (α + β + 2)
p α (1 p )β .
Γ (α + 1) Γ ( β + 1)

(Institute) EC509 This Version: 4 Nov 2013 76 / 123


Hierarchical Models and Mixture Distributions

Therefore,
Γ (α + β) Γ(α + 1)Γ( β + 1)
EP [VarX jP (X jP )] = n
Γ(α)Γ( β) Γ (α + β + 2)
Γ (α + β) αΓ(α) βΓ( β)
= n
Γ(α)Γ( β) (α + β + 1) (α + β) Γ (α + β)
αβ
= n .
(α + β + 1) (α + β)
Thus,
αβ αβ
VarX (X ) = n + n2 2
(α + β + 1) (α + β) (α + β) (α + β + 1)
αβ(α + β + n )
= n .
( α + β )2 ( α + β + 1 )

(Institute) EC509 This Version: 4 Nov 2013 77 / 123


Covariance and Correlation
Let X and Y be two random variables. To keep notation concise, we will use the
following notation.

E [X ] = µX , E [Y ] = µY , Var (X ) = σ2X and Var (Y ) = σ2Y .

De…nition (4.5.1): The covariance of X and Y is the number de…ned by

Cov (X , Y ) = E [(X µX ) (Y µY )].

De…nition (4.5.2): The correlation of X and Y is the number de…ned by

Cov (X , Y )
ρXY = ,
σX σY
which is also called the correlation coe¢ cient.
If large (small) values of X tend to be observed with large (small) values of Y , then
Cov (X , Y ) will be positive.
Why so? Within the above setting, when X > µX then Y > µY is likely to be true
whereas when X < µX then Y < µY is likely to be true. Hence
E [(X µX ) (Y µY )] > 0.
Similarly, if large (small) values of X tend to be observed with small (large) values of
Y , then Cov (X , Y ) will be negative.
(Institute) EC509 This Version: 4 Nov 2013 78 / 123
Covariance and Correlation

Correlation normalises covariance by the standard deviations and is, therefore, a


more informative measure.
If Cov (X , Y ) = 300 while Cov (W , Z ) = 10, this does not necessarily mean that
there is a much stronger relationship between X and Y . For example, if
Var (X ) = Var (Y ) = 100 while Var (W ) = Var (Z ) = 1, then

ρXY = 3 and ρWZ = 10.

Theorem (4.5.3): For any random variables X and Y ,

Cov (X , Y ) = E [XY ] µX µY .

Proof:

Cov (X , Y ) = E [(X µX ) (Y µY )]
= E [XY µX Y µY X + µX µY ]
= E [XY ] µX E [Y ] µY E [X ] + µX µY
= E [XY ] µX µY .

(Institute) EC509 This Version: 4 Nov 2013 79 / 123


Covariance and Correlation

Example (4.5.4): Let (X , Y ) be a bivariate random variable and

fX ,Y (x , y ) = 1, 0 < x < 1, x < y < x + 1.

Now, Z x +1 Z x +1
fX (x ) = fX ,Y (x , y )dy = 1dy = y jxy + 1
=x = 1,
x x
implying that X Uniform (0, 1). Therefore, E [X ] = 1/2 and Var (x ) = 1/12.
Now, fY (y ) is a bit more complicated to calculate. Considering the region where
fX ,Y (x , y ) > 0, we observe that 0 < x < y when 0 < y < 1 and y 1 < x < 1
when 1 y < 2. Therefore,
Ry Ry
fY (y ) = 0 fX ,Y (x , y )dx = 0 1dx = y , when 0 < y < 1
R1 R1 .
fY (y ) = y 1 fX ,Y (x , y )dx = y 1 1dx = 2 y , when 1 y < 2

(Institute) EC509 This Version: 4 Nov 2013 80 / 123


Covariance and Correlation

As an exercise, you can show that

µY = 1 and σ2Y = 1/6.

Moreover,
Z 1 Z x +1 Z 1 x +1
xy 2
EX ,Y [XY ] = xydydx = dx
0 x 0 2
y =x
Z 1
1 7
= x2 + x dx = .
0 2 12
Therefore,
1 1
Cov (X , Y ) = and ρXY = p .
12 2

(Institute) EC509 This Version: 4 Nov 2013 81 / 123


Covariance and Correlation

Theorem (4.5.5): If X ?? Y , then Cov (X , Y ) = ρXY = 0.


Proof: Since X ?? Y , by Theorem (4.2.10),

E [XY ] = E [X ]E [Y ].

Then
Cov (X , Y ) = E [XY ] µX µY = µX µY µX µY = 0,
and consequently,
Cov (X , Y )
ρXY = = 0.
σX σY
It is crucial to note that although X ??Y implies that Cov (X , Y ) = ρXY = 0, the
relationship does not necessarily hold in the reverse direction.

(Institute) EC509 This Version: 4 Nov 2013 82 / 123


Covariance and Correlation
Theorem (4.5.6): If X and Y are any two random variables and a and b are any
two constants, then

Var (aX + bY ) = a2 Var (X ) + b 2 Var (Y ) + 2abCov (X , Y ).

If X and Y are independent random variables, then

Var (aX + bY ) = a2 Var (X ) + b 2 Var (Y ).

You will be asked to prove this as a homework assignment.


Note that if two random variables, X and Y , are positively correlated, then

Var (X + Y ) > Var (X ) + Var (Y ),

whereas if X and Y are negatively correlated, then

Var (X + Y ) < Var (X ) + Var (Y ).

For positively correlated random variables, large values in one tend to be


accompanied by large values in the other. Therefore, the total variance is magni…ed.
Similarly, for negatively correlated random variables, large values in one tend to be
accompanied by small values in the other. Hence, the variance of the sum is
dampened.
(Institute) EC509 This Version: 4 Nov 2013 83 / 123
Covariance and Correlation

Example (4.5.8): Let X Uniform (0, 1), Z Uniform (0, 1/10) and Z ?? X . Let,
moreover,
Y = X + Z,
and consider (X , Y ) .
Consider the following intuitive derivation of the distribution of (X , Y ) . We are
given X = x and Y = x + Z for a particular value of X . Now, the conditional
distribution of Z given X is,

Z jX Uniform (0, 1/10),

since Z and X are independent! Therefore, one can interpret x as a location


parameter in the conditional distribution of Y given X = x .
This conditional distribution is Uniform (x , x + 1/10) : we simply shift the location
of the distribution of Z by x units.
Now,
1 1 1 1
fX ,Y (x , y ) = fY jX (y jx )fX (x ) = = = 10,
x + 1/10 x1 0 1/10 1
where 0 < x < 1 and x < y < x + 1/10.

(Institute) EC509 This Version: 4 Nov 2013 84 / 123


Covariance and Correlation
Based on these …ndings, it can be shown that
1 1 1 11
E [X ] = and E [X + Z ] = + = .
2 2 20 20
Then,

Cov (X , Y ) = E [XY ] E [X ]E [Y ]
= E [X (X + Z )] E [X ]E [X + Z ]
2
= E [X ] + E [XZ ] fE [X ]g2 E [X ]E [Z ]
= E [X ] + E [X ]E [Z ] fE [X ]g2 E [X ]E [Z ]
2

1
= E [X 2 ] fE [X ]g2 = Var (X ) = .
12
By Theorem (4.5.6),
1 1
Var (Y ) = Var (X ) + Var (Z ) = + .
12 1200
Then, r
1/12 100
ρXY = p p = .
1/12 1/12 + 1/1200 101

(Institute) EC509 This Version: 4 Nov 2013 85 / 123


Covariance and Correlation

A quick comparison of Examples (4.5.4) and (4.5.8) reveals an interesting result.


In the …rst example, one can show that Y jX Uniform (x , x + 1) while in the latter
example we have Y jX Uniform (x , x + 1/10).
Therefore, in the latter case, knowledge about the value of X gives us a more precise
idea about the possible values that Y can take on. Seen from a di¤erent
perspective, in the second example Y is conditionally more tightly distributed around
the particular value of X .
p
In relation to this observation, ρXY for the …rst example is equal to 1/2 which is
p
much smaller compared to ρXY = 100/101 from the second example.
Another important result, provided without proof at this point, is that

1 ρXY 1;

see part (1) of Theorem (4.5.7) in Casella & Berger (p. 172). We will provide an
alternative proof of this result when we deal with inequalities.

(Institute) EC509 This Version: 4 Nov 2013 86 / 123


Covariance and Correlation
Bivariate Normal Distribution

We now introduce the bivarite normal distribution.


De…nition (4.5.10): Let ∞ < µX < ∞, ∞ < µY < ∞, σX > 0, σY > 0 and
1 < ρ < 1. The bivariate normal pdf with means µX and µY , variances σ2X and
2
σY , and correlation ρ is the bivariate pdf given by
1
fX ,Y (x , y ) = p
2πσX σY 1 ρ2
1 h i
exp u2 2ρuv + v 2 ,
2 (1 ρ2 )
y µY x µX
where u = σY and v = σX , while ∞ < x < ∞ and ∞ < y < ∞.

(Institute) EC509 This Version: 4 Nov 2013 87 / 123


Covariance and Correlation
Bivariate Normal Distribution

More concisely, this would be written as

X µX σ2X ρσX σY
N , .
Y µY ρσX σY σ2Y

In addition, starting from the bivariate distribution, one can show that
x µX
Y jX = x N µY + ρσY , σ2Y (1 ρ2 ) ,
σX
and, likewise,
y µY
X jY = y N µX + ρσX , σ2X (1 ρ2 ) .
σY
Finally, again, starting from the bivariate distribution, it can be shown that

X N (µX , σ2X ) and Y N (µY , σ2Y ).

Therefore, joint normality implies conditional and marginal normality. However, this
does not go in the opposite direction; marginal normality does not imply joint
normality.

(Institute) EC509 This Version: 4 Nov 2013 88 / 123


Covariance and Correlation
Bivariate Normal Distribution
The following is based on Section 3.6 of Gallant (1997).
Start with
1 1 h i
fX ,Y (x , y ) = p exp 2
u2 2ρuv + v 2 .
2πσX σY 1 ρ 2 2 (1 ρ )
Now,
u2 2ρuv + v 2 = u2 2ρuv + v 2 + ρ2 v 2 ρ2 v 2
= (u ρv )2 + 1 ρ2 v 2 .
Then,
1 1 h i
fX ,Y (x , y ) = p exp (u ρv )2 + 1 ρ2 v 2
2πσX σY 1 ρ2 2 (1 ρ2 )
1 1
= p exp (u ρv )2
σY 2π (1 ρ2 ) 2 (1 ρ2 )
1 1
p exp 1 ρ2 v 2
σX 2π 2 (1 ρ2 )
1 1 1 v2
= p exp (u ρv )2 p exp .
σY 2π (1 ρ2 ) 2 (1 ρ2 ) σX 2π 2

(Institute) EC509 This Version: 4 Nov 2013 89 / 123


Covariance and Correlation
Bivariate Normal Distribution

Then, the bivariate pdf factors into


1 1
p exp (u ρv )2
σY 2π (1 ρ2 ) 2 (1 ρ2 )
( )
2
1 1 σ2Y y µY x µX
= p exp ρ
σY 2π (1 ρ2 ) 2 (1 ρ ) σ2Y
2 σY σX
( )
2
1 1 ρσY
= p exp y µY (x µX )
σY 2π (1 ρ2 ) 2 (1 ρ2 )σ2Y σX

and " #
2
1 v2 1 1 x µX
p exp = p exp .
σX 2π 2 σX 2π 2 σX
The …rst of these can be considered as the conditional density fY jX (y jx ) which is
normal with mean and variance
ρσ
µY jX = µY + Y (x µX ) and σ2Y jX = (1 ρ2 )σ2Y .
σX
The second, on the other hand, is the unconditional density fX (x ) which is normal
with mean µX and variance σ2X .
(Institute) EC509 This Version: 4 Nov 2013 90 / 123
Covariance and Correlation
Bivariate Normal Distribution

Now, the covariance is given by


Z ∞ Z ∞
σYX = (y µY )(x µX )fX ,Y (x , y )dxdy
∞ ∞
Z ∞ Z ∞
= (y µY )(x µX )fY jX (y jx )fX (x )dxdy
∞ ∞
Z ∞ Z ∞
= (y µY )fY jX (y jx )dy fX (x )(x µX )dx
∞ ∞
| {z }
E [y µY jX ]
Z ∞
ρσY
= µY + (x µX ) µY fX (x )(x µX )dx
∞ σX
Z
ρσY ∞
= (x µX )2 fX (x )dx
σX ∞
ρσY ρσY 2
= Var (X ) = σ = ρσY σX .
σX σX X
This shows that the correlation is given by ρ since
σXY ρσ σ
ρXY = = X Y = ρ.
σX σY σX σY

(Institute) EC509 This Version: 4 Nov 2013 91 / 123


Multivariate Distributions

So far, we have discussed multivariate random variables which consist of two random
variables only. Now, we extend these ideas to general multivariate random variables.
For example, we might have the random vector (X1 , X2 , X3 , X4 ) where X1 is
temperature, X2 is weight, X3 is height and X4 is blood pressure.
The ideas and concepts we have discussed so far extend to such random vectors
directly.

(Institute) EC509 This Version: 4 Nov 2013 92 / 123


Multivariate Distributions

Let X = (X1 , ..., Xn ). Then the sample space for X is a subset of Rn , the
n-dimensional Euclidian space.
If this sample space is countable, then X is a discrete random vector and its joint
pmf is given by

f (x) = f (x1 , ..., xn ) = P (X1 = x1 , X2 = x2 , ..., Xn = xn ) for each (x1 , ..., xn ) 2 Rn .

For any A Rn ,
P (X 2 A ) = ∑ f (x).
x2A
Similarly, for the continuous random vector, we have the joint pdf given by
f (x) = f (x1 , ..., xn ) which satis…es
Z Z Z Z
P (X 2 A ) = f (x)d x = f (x1 , ..., xn )dx1 ...dxn .
A A
R R
Note that A is an n-fold integration, where the limits of integration are such
that the integral is calculated over all points x 2 A.

(Institute) EC509 This Version: 4 Nov 2013 93 / 123


Multivariate Distributions

Let g (x) = g (x1 , ..., xn ) be a real-valued function de…ned on the sample space of X.
Then, for the random variable g (X),

(discrete ) : E [g (X)] = ∑ g (x)f (x),


x 2 Rn
Z ∞ Z ∞
(continuous ) : E [g (X)] = g (x)f (x)d x.
∞ ∞

The marginal pdf or pmf of (X1 , ..., Xk ) , the …rst k coordinates of (X1 , ..., Xn ), is
given by

(discrete ) : f (x1 , ..., xk ) = ∑ f (x1 , ..., xn ) ,


(xk +1 ,...,xn )2Rn k
Z ∞ Z ∞
(continuous ) : f (x1 , ..., xk ) = f (x1 , ..., xn )dxk +1 ...dxn ,
∞ ∞

for every (x1 , ..., xk ) 2 Rk .

(Institute) EC509 This Version: 4 Nov 2013 94 / 123


Multivariate Distributions

As you would expect, if f (x1 , ..., xk ) > 0 then,

f (x1 , ..., xn )
f (xk +1 , ..., xn jx1 , ..., xk ) = ,
f (x1 , ..., xk )

gives the conditional pmf or pdf of (Xk +1 , ..., Xn ) given


(X1 = x1 , X2 = x2 , ..., Xk = xk ) , which is a function of (x1 , ..., xk ).
Note that you can pick your k coordinates as you like. For example, if you have
(X1 , X2 , X3 , X4 , X5 ) and want to condition on X2 and X5 , then simply consider the
reordered random vector (X2 , X5 , X1 , X3 , X4 ) where one would condition on the …rst
two coordinates.
Let’s illustrate these concepts using the following example.

(Institute) EC509 This Version: 4 Nov 2013 95 / 123


Multivariate Distributions
Example (4.6.1): Let n = 4 and
(
3/4 x12 + x22 + x32 + x42 0 < xi < 1, i = 1, 2, 3, 4
f (x1 , x2 , x3 , x4 ) = .
0 otherwise

Clearly this is a nonnegative function and it can be shown that


Z ∞ Z ∞ Z ∞ Z ∞
f (x1 , x2 , x3 , x4 )dx1 dx2 dx3 dx4 = 1.
∞ ∞ ∞ ∞

For example, you can check that P X1 < 12 , X2 < 43 , X4 > 1


2 is given by
Z 1 Z 1 Z 3/4 Z 1/2
3 151
x12 + x22 + x32 + x42 dx1 dx2 dx3 dx4 = .
1/2 0 0 0 4 1024
Now, consider
Z ∞ Z ∞
f (x1 , x2 ) = f (x1 , x2 , x3 , x4 )dx3 dx4
∞ ∞
Z 1Z 1
3 3 1
= x12 + x22 + x32 + x42 dx3 dx4 = x 2 + x22 + ,
0 0 4 4 1 2
where 0 < x1 < 1 and 0 < x2 < 1.
(Institute) EC509 This Version: 4 Nov 2013 96 / 123
Multivariate Distributions

This marginal pdf can be used to compute any probably or expectation involving X1
and X2 . For instance,
Z ∞ Z ∞
E [X 1 X 2 ] = x1 x2 f (x1 , x2 )dx1 dx2
∞ ∞
Z ∞ Z ∞
3 1 5
= x1 x2 x 2 + x22 + dx1 dx2 = .
∞ ∞ 4 1 2 16
Now let’s …nd the conditional distribution of X3 and X4 given X1 = x1 and X2 = x2 .
For any (x1 , x2 ) where 0 < x1 < 1 and 0 < x2 < 1,

f (x1 , x2 , x3 , x4 ) 3/4 x12 + x22 + x32 + x42


f (x3 , x4 jx1 , x2 ) = =
f (x1 , x2 ) 3/4 x12 + x22 + 1/2
x12 + x22 + x32 + x42
= .
x12 + x22 + 2/3

(Institute) EC509 This Version: 4 Nov 2013 97 / 123


Multivariate Distributions

Then, for example,


!
3 1 1 2
P X3 > , X4 < X1 = , X2 =
4 2 3 3
Z 1/2 Z 1
" #
(1/3)2 + (2/3)2 + x32 + x42
= dx3 dx4
0 3/4 (1/3)2 + (2/3)2 + 2/3
Z 1/2 Z 1
5 9 9
= + x32 + x42 dx3 dx4
0 3/4 11 11 11
Z 1/2
5 111 9
= + + x42 dx4
0 44 704 44
5 111 3 203
= + + = .
88 1408 352 1408

(Institute) EC509 This Version: 4 Nov 2013 98 / 123


Multivariate Distributions
Let’s introduce some other useful extensions to the concepts we covered for the
univariate and bivariate random variables.
De…nition (4.6.5): Let X1 , ..., Xn be random vectors with joint pdf or pmf
f (x1 , ..., xn ). Let fXi (xi ) denote the marginal pdf or pmf of Xi . Then, X1 , ..., Xn are
called mutually independent random vectors if, for every (x1 , ..., xn ),
n
f (x1 , ..., xn ) = fX1 (x1 ) ... fXn (xn ) = ∏ fX (xi ).
i
i =1
If the Xi s are all one-dimensional, then X1 , ..., Xn are called mutually independent
random variables.
Clearly, if X1 , ..., Xn are mutually independent, then knowledge about the values of
some coordinates gives us no information about the values of the other coordinates.
Remember that mutual independence implies pairwise independence and beyond.
For example, it is possible to specify a probability distribution for (X1 , ..., Xn ) with
the property that each pair Xi , Xj is pairwise independent but X1 , ..., Xn are not
mutually independent.
Theorem (4.6.6) (Generalisation of Theorem 4.2.10): Let X1 , ..., Xn be mutually
independent random variables. Let g1 , ..., gn be real-valued functions such that
gi (xi ) is a function only of xi , i = 1, ..., n. Then
E [g1 (X1 ) ... gn (Xn )] = E [g1 (X1 )] E [g2 (X2 )] ... E [gn (Xn )].

(Institute) EC509 This Version: 4 Nov 2013 99 / 123


Multivariate Distributions

Theorem (4.6.7) (Generalisation of Theorem 4.2.12): Let X1 , ..., Xn be mutually


independent random variables with mgfs MX 1 (t ), ..., MX n (t ). Let

Z = X1 + ... + Xn .

Then, the mgf of Z is


MZ (t ) = MX 1 (t ) ... MX n (t ).
In particular, if X1 , ..., Xn all have the same distribution with mgf MX (t ), then

MZ (t ) = [MX (t )]n .

Example (4.6.8): Suppose X1 , ..., Xn are mutually independent random variables


and the distribution of Xi is gamma (αi , β) . The mgf of a gamma (α, β) distribution
is M (t ) = (1 βt ) α . Thus, if Z = X1 + ... + Xn , the mgf of Z is

MZ (t ) = MX 1 (t ) ... MX n (t ) = (1 βt ) α1
... (1 βt ) αn
= (1 βt ) (α1 +...+αn ) .

This is the mgf of a gamma (α1 + ... + αn , β) distribution. Thus, the sum of
independent gamma random variables that have a common scale parameter β also
has a gamma distribution.

(Institute) EC509 This Version: 4 Nov 2013 100 / 123


Multivariate Distributions
Corollary (4.6.9): Let X1 , ..., Xn be mutually independent random variables with
mgfs MX 1 (t ), ..., MX n (t ). Let a1 , ..., an and b1 , ..., bn be …xed constants. Let

Z = (a1 X1 + b1 ) + ... + (an Xn + bn ) .

Then, the mgf of Z is


!
n
MZ (t ) = exp t ∑ bi MX 1 (a1 t ) ... MX n (an t ).
i =1

Proof: From the de…nition, the mgf of Z is


" !#
n
M Z (t ) = E [exp (tZ )] = E exp t ∑ (ai X i + bi )
i =1
!
n
= exp t ∑ bi E [exp (ta1 X1 ) ... exp (tan Xn )]
i =1
!
n
= exp t ∑ bi MX 1 (a1 t ) ... MX n (an t ),
i =1

which yields the desired result.


(Institute) EC509 This Version: 4 Nov 2013 101 / 123
Multivariate Distributions

Corollary (4.6.10): Let X1 , ..., Xn be mutually independent random variables with


Xi N (µi , σ2i ). Let a1 , ..., an and b1 , ..., bn be …xed constants. Then,
" #
n n n
Z = ∑ (ai X i + bi ) N ∑ (ai µi + bi ) , ∑ ai2 σ2i .
i =1 i =1 i =1

Proof: Recall that the mgf of a N µ, σ2 random variable is


M (t ) = exp µt + σ2 t 2 /2 . Substituting into the expression in Corollary (4.6.9)
yields
!
n σ2 a 2 t 2 σ2 a 2 t 2
MZ (t ) = exp t ∑ bi exp µ1 a1 t + 1 1 ... exp µn an t + n n
i =1
2 2
" #
n
n
∑i =1 ai2 σ2i t 2
= exp ∑ (ai µi + bi ) t + ,
i =1
2

which is mgf of the normal distribution given in Corollary (4.6.10).


This is an important result. A linear combination of independent normal random
variables is normally distributed.

(Institute) EC509 This Version: 4 Nov 2013 102 / 123


Multivariate Distributions

Theorem (4.6.11) (Generalisation of Lemma 4.2.7): Let X1 , ..., Xn be random


vectors. Then X1 , ..., Xn are mutually independent random vectors if and only if
there exist functions gi (xi ), i = 1, ..., n, such that the joint pdf or pmf of
(X1 , ..., Xn ) can be written as

f (x1 , ..., xn ) = g1 (x1 ) ... gn (xn ).

Theorem (4.6.12) (Generalisation of Theorem 4.3.5): Let X1 , ..., Xn be


independent random vectors. Let gi (xi ) be a function only of xi , i = 1, ..., n. Then,
the random variables Ui = gi (Xi ), i = 1, ..., n, are mutually independent.

(Institute) EC509 This Version: 4 Nov 2013 103 / 123


Multivariate Distributions
We can also calculate the distribution of a transformation of a random vector, in the
same fashion as before.
Let (X1 , ..., Xn ) be a random vector with pdf fX (x1 , ..., xn ). Let
A = fx : fX (x) > 0g. Consider a new random vector (U1 , ..., Un ) , de…ned by
U1 = g1 (X1 , ..., Xn ), U2 = g2 (X1 , ..., Xn ), ..., Un = gn (X1 , ..., Xn ).
Suppose that A0 , A1 , ..., Ak from a partition of A with the following properties.
The set A0 , which may be empty, satis…es
P ((X1 , ..., Xn ) 2 A0 ) .
The transformation
(U1 , ..., Un ) = (g1 (X), g2 (X), ..., gn (X))
is a one-to-one transformation from Ai onto B for each i = 1, 2, ..., k. Then for each
i , the inverse functions from B to Ai can be found. Denote the i th inverse by
x1 = h1i (u1 , ..., un ), x2 = h2i (u1 , ..., un ), ... xn = hni (u1 , ..., un ).
This i th inverse gives, for (u1 , ..., un ) 2 B , the unique (x1 , ..., xn ) 2 Ai such that
(u1 , ..., un ) = (g1 (x1 , ..., xn ), g2 (x1 , ..., xn ), ..., gn (x1 , ..., xn )) .

(Institute) EC509 This Version: 4 Nov 2013 104 / 123


Multivariate Distributions

Let Ji denote the Jacobian computed from the i th inverse,


∂h 1i (u) ∂h 1i (u) ∂h 1i (u)
∂u 1 ∂u 2 ∂u n
∂h 2i (u) ∂h 2i (u) ∂h 2i (u)
∂u 1 ∂u 2 ∂u n
Ji = .. .. .. .. ,
. . . .
∂h ni (u) ∂h ni (u) ∂h ni (u)
∂u 1 ∂u 2 ∂u n

which is the determinant of an n n matrix.


Then, assuming that these Jacobians are non-zero on B ,
k
fU (u1 , ..., un ) = ∑ fX (h1i (u1 , ..., un ), ..., hni (u1 , ..., un )) jJi j .
i =1

(Institute) EC509 This Version: 4 Nov 2013 105 / 123


Inequalities

We will now cover some basic inequalities used in statistics and econometrics.
Most of the time, more complicated expressions can be written in terms of simpler
expressions. Inequalities on these simpler expressions can then be used to obtain an
inequality, or often a bound, on the original complicated term.
This part is based on Sections 3.6 and 4.7 in Casella & Berger.

(Institute) EC509 This Version: 4 Nov 2013 106 / 123


Inequalities
We start with one of the most famous probability inequalities.
Theorem (3.6.1) (Chebychev’s Inequality): Let X be a random variable and let
g (x ) be a nonnegative function. Then, for any r > 0,
E [g (X )]
P (g (X ) r) .
r
Proof: Using the de…nition of the expected value,
Z ∞
E [g (X )] = g (x )fX (x )dx

Z
g (x )fX (x )dx
fx :g (x ) r g
Z
r fX (x )dx
fx :g (x ) r g
= rP (g (X ) r ).
Hence,
E [g (X )] rP (g (X ) r ),
implying
E [g (X )]
P (g (X ) r) .
r

(Institute) EC509 This Version: 4 Nov 2013 107 / 123


Inequalities

The interesting feature of the inequality is that it manages to convert a probability


statement into a statement about the expectation and vice-versa.
Example (3.6.2): Let g (x ) = (x µ)2 /σ2 , where µ = E [X ] and σ2 = Var (X ).
Let, for convenience, r = t 2 . Then,
" # " #
(X µ )2 2 1 (X µ )2 1
P t E = 2.
σ2 t2 σ2 t

This implies that


1
P [jX µj tσ ] ,
t2
and, consequently,
1
P [jX µj < tσ ] 1 .
t2
Therefore, for instance for t = 2, we get

P [jX µj 2σ ] 0.25 or P [jX µj < 2σ ] 0.75.

This says that, there is at least a 75% chance that a random variable will be within
2σ of its mean (independent of the distribution of X ).

(Institute) EC509 This Version: 4 Nov 2013 108 / 123


Inequalities
This information is useful. However, many times, it might be possible to obtain an
even tighter bound in the following sense.
Example (3.6.3): Let Z N (0, 1). Then,
Z ∞
1 2
P (Z t) = p e x /2 dx
2π t
Z ∞
1 x x 2 /2
p e dx
2π t t
2
1 e t /2
= p .
2π t
The second inequality follows from the fact that for x > t, x /t > 1. Now, since Z
has a symmetric distribution, P (jZ j t ) = 2P (Z t ) . Hence,
r 2
2e t /2
P (jZ j t ) .
π t
Set t = 2 and observe that
r
2e 2
P (jZ j 2) = .054.
π 2
This is a much tighter bound compared to the one given by Chebychev’s Inequality.
Generally Chebychev’s Inequality provides a more conservative bound.
(Institute) EC509 This Version: 4 Nov 2013 109 / 123
Inequalities

A related inequality is Markov’s Inequality.


Lemma (3.8.3): If P (Y 0) = 1 and P (Y = 0) < 1, then, for any r > 0,

E [Y ]
P (Y r) ,
r
and the relationship holds with equality if and only if
P (Y = r ) = p = 1 P (Y = 0) , where 0 < p 1.
The more general form of Chebychev’s Inequality, provided in White (2001), is as
follows.
Proposition 2.41 (White 2001): Let X be a random variable such that
E [jX jr ] < ∞, r > 0. Then, for every t > 0,

E [jX jr ]
P (jX j t) .
tr
Setting r = 2, and some re-arranging, gives the usual Chebychev’s Inequality. If we
let r = 1, then we obtain Markov’s Inequality. See White (2001, pp.29-30).

(Institute) EC509 This Version: 4 Nov 2013 110 / 123


Inequalities

We continue with inequalities involving expectations. However, …rst we present a


useful Lemma.
Lemma (4.7.1): Let a and b be any positive numbers, and let p and q be any
positive numbers (necessarily greater than 1) satisfying
1 1
+ = 1.
p q
Then,
1 p 1 q
a + b ab, (2)
p q
with equality if and only if ap = b q .

(Institute) EC509 This Version: 4 Nov 2013 111 / 123


Inequalities

Proof: Fix b, and consider the function


1 p 1 q
g (a ) = a + b ab.
p q
To minimise g (a ), di¤erentiate and set equal to 0:
d
g (a ) = 0 =) ap 1
b = 0 =) b = ap 1
.
da
A check of the second derivative will establish that this is indeed a minimum. The
value of the function at the minimum is
1 p 1 p 1 q 1 p 1 p
a + (a ) aap 1
= a + a ap = 0.
p q p q

Note that, (p 1)q = p since p 1


+ q 1 = 1.
Hence, the unique minimum is zero and (2) is established. Since the minimum is
unique, equality holds only if ap 1 = b, which is equivalent to ap = b q , again from
p 1 + q 1 = 1.

(Institute) EC509 This Version: 4 Nov 2013 112 / 123


Inequalities
Theorem (4.7.2) (Hölder’s Inequality): Let X and Y be any two random
variables, and let p and q be such that
1 1
+ = 1.
p q
Then,
1/p 1/q
jE [XY ]j E [jXY j] fE [jX jp ]g fE [jY jq ]g .
Proof: The …rst inequality follows from jXY j XY jXY j and Theorem (2.2.5).
To prove the second inequality, de…ne
jX j jY j
a= 1/p
and b= 1/q
.
p
fE [jX j ]g fE [jY jq ]g
Applying Lemma (4.7.1), we get
1 jX jp 1 jY jq jXY j
p + .
p E [jX j ] q E [jY jq ] fE [jX j ]g p 1/p
fE [jY jq ]g
1/q

Then,
1 jX jp 1 jY jq 1 1
E p + = + = 1,
p E [jX j ] q E [jY jq ] p q
and so
1/p 1/q
E [jXY j] fE [jX jp ]g fE [jY jq ]g .

(Institute) EC509 This Version: 4 Nov 2013 113 / 123


Inequalities
Now consider a common special case of Hölder’s Inequality.
Theorem (4.7.3) (Cauchy-Schwarz Inequality): For any two random variables X
and Y ,
n o1/2 n o1/2
jE [XY ]j E [jXY j] E [jX j2 ] E [jY j2 ] .
Proof: Set p = q = 2.
Example (4.7.4): If X and Y have means µX and µY and variances σ2X and σ2Y ,
respectively, we can apply the Cauchy-Schwarz Inequality to get
n h io1/2 n h io1/2
E [j(X µX ) (Y µY )j] E (X µX )2 E (Y µY )2 .

Squaring both sides yields,

[Cov (X , Y )]2 σ2X σ2Y .

A particular implication of the previous result is that

1 ρXY 1.

One can show this without using the Cauchy-Schwarz Inequality, as well. However,
the corresponding calculations would be a lot more tedious. See Proof of Theorem
4.5.7 in Casella & Berger (2001, pp.172-173).
(Institute) EC509 This Version: 4 Nov 2013 114 / 123
Inequalities

Theorem (4.7.5) (Minkowski’s Inequality): Let X and Y be any two random


variables. Then, for 1 p < ∞,
1/p 1/p
fE [jX + Y jp ]g fE [jX jp ]g + fE [jY jp ]g1/p .

Proof: We will use the triangle inequality which says that

jX + Y j jX j + jY j .

Now,
h i
E [jX + Y jp ] = E jX + Y j jX + Y jp 1
h i h i
E jX j jX + Y jp 1 + E jY j jX + Y jp 1
, (3)

where we have used the triangle inequality.

(Institute) EC509 This Version: 4 Nov 2013 115 / 123


Inequalities
Next, we apply Hölder’s Inequality to each expectation on the right-hand side of (3)
and obtain,
n h io1/q
1/p
E [jX + Y jp ] fE [jX jp ]g E j X + Y j q (p 1 )
n h io1/q
+ fE [jY jp ]g1/p E jX + Y jq (p 1 ) ,

where q is such that p 1 +q 1 = 1. Notice that q (p 1) = p and 1 q 1 =p 1.

Then,
E [jX + Y jp ] E [jX + Y jp ]
n h io1/q = 1/q
E j X + Y j q (p 1 ) fE [jX + Y jp ]g
1
1 q
= fE [jX + Y jp ]g
1
p
= fE [jX + Y jp ]g ,

and
1/p 1/p
fE [jX + Y jp ]g fE [jX jp ]g + fE [jY jp ]g1/p .

(Institute) EC509 This Version: 4 Nov 2013 116 / 123


Inequalities

The previous results can be used for the case of numerical sums, as well.
For example, let a1 , ..., an and b1 , ..., bn be positive nonrandom values. Let X be a
random variable with range a1 , ..., an and P (X = ai ) = 1/n, i = 1, ..., n. Similarly,
let Y be a random variable taking on values b1 , ..., bn with probability
P (Y = bi ) = 1/n. Moreover, let p and q be such that p 1 + q 1 = 1.
Then,
1 n
n i∑
E [jXY j] = jai bi j ,
=1
" #1/p
1 n
n i∑
1/p
fE [jX jp ]g = jai j p
=1
" #1/q
1 n
n i∑
1/q
and fE [jY jq ]g = jbi j q
.
=1

(Institute) EC509 This Version: 4 Nov 2013 117 / 123


Inequalities
Hence, by Hölder’s Inequality,
" #1/p " #1/q
1 n 1 n 1 n
n i∑ n i∑ n i∑
p q
jai bi j jai j jbi j
=1 =1 =1
1 1
" #1/p " #1/q
1 (p +q ) n n
=
n ∑ jai j p
∑ jbi j q
,
i =1 i =1

and so,
" #1/p " #1/q
n n n
∑ jai bi j ∑ jai j p
∑ jbi j q
.
i =1 i =1 i =1
A special case of this result is obtained by letting bi = 1 for all i and setting
p = q = 2. Then,
" #1/2 " #1/2 " #1/2
n n n n
∑ jai j ∑ jai j2 ∑1 = ∑ jai j2 n 1/2 ,
i =1 i =1 i =1 i =1

and so, !2
n n
1
n ∑ jai j ∑ ai2 .
i =1 i =1

(Institute) EC509 This Version: 4 Nov 2013 118 / 123


Inequalities

Next, we turn to functional inequalities. These are inequalities that rely on


properties of real-valued functions rather than on any statistical properties.
Let’s start by de…ning a convex function.
De…nition (4.7.6): A function g (x ) is convex if

g (λx + (1 λ) y ) λg (x ) + (1 λ ) g (y ), for all x and y , and 0 < λ < 1.

The function g (x ) is concave if g (x ) is convex.


Now we can introduce Jensen’s Inequality.
Theorem (4.7.7): For any random variable X , if g (x ) is a convex function, then

E [g (X )] g fE [X ]g.

(Institute) EC509 This Version: 4 Nov 2013 119 / 123


Inequalities
Proof: Let `(x ) be a tangent line to g (x ) at the point g (E [X ]). Write
`(x ) = a + bx for some a and b.
By the convexity of g , we have g (x ) a + bx . Since expectations preserve
inequalities,

E [g (X )] E [a + bX ] = a + bE [X ]
= `(E [X ]) = g (E [X ]).

Note, for example, that g (x ) = x 2 is a convex function. Then,

E [X 2 ] fE [X ]g2 .

Moreover, if x is positive, then 1/x is convex and


1 1
E .
X E [X ]
Note that, for concave functions, we have, accordingly,

E [g (X )] g (E [X ]).

(Institute) EC509 This Version: 4 Nov 2013 120 / 123


Inequalities

We …nish by providing the following theorem.


Theorem (4.7.9): Let X be any random variable and g (x ) and h (x ) any functions
such that E [g (X )], E [h (X )] and E [g (X )h (X )] exist.
1 If g (x ) is a nondecreasing function and h (x ) is a nonincreasing function, then
E [g (X )h (X )] E [g (X )]E [h (X )].
2 If g (x ) and h (x ) are either both nondecreasing or both nonincreasing, then
E [g (X )h (X )] E [g (X )]E [h (X )].

Unlike, say, Hölder’s Inequality or Cauchy-Schwarz Inequality, this inequality allows


us to bound an expectation without using higher-order moments.

(Institute) EC509 This Version: 4 Nov 2013 121 / 123


Inequalities

Let us prove the …rst part of this Theorem for the easier case where
E [g (X )] = E [h (X )] = 0.
Now,
Z ∞
E [g (X )h (X )] = g (x )h (x )fX (x )dx

Z Z
= g (x )h (x )fX (x )dx + g (x )h (x )fX (x )dx .
fx :h (x ) 0 g fx :h (x ) 0 g

Let x0 be such that h (x0 ) = 0. Then, since h (x ) is nonincreasing,


fx : h (x ) 0g = fx : x0 x < ∞g and fx : h (x ) 0g = fx : ∞ < x x0 g.
Moreover, since g (x ) is nondecreasing,

g (x0 ) = min g (x ) and g (x0 ) = max g (x ).


fx :h (x ) 0 g fx :h (x ) 0 g

(Institute) EC509 This Version: 4 Nov 2013 122 / 123


Inequalities

Hence,
Z Z
g (x )h (x )fX (x )dx g (x0 ) h (x )fX (x )dx ,
fx :h (x ) 0 g fx :h (x ) 0 g
Z Z
and g (x )h (x )fX (x )dx g (x0 ) h (x )fX (x )dx .
fx :h (x ) 0 g fx :h (x ) 0 g

Thus,
Z Z
E [g (X )h (X )] = g (x )h (x )fX (x )dx + g (x )h (x )fX (x )dx
fx :h (x ) 0 g fx :h (x ) 0 g
Z Z
g (x0 ) h (x )fX (x )dx + g (x0 ) h (x )fX (x )dx
fx :h (x ) 0 g fx :h (x ) 0 g
Z ∞
= g (x0 ) h (x )fX (x )dx = g (x0 )E [h (X )] = 0.

You can try to do proof for the second part, following the same method.

(Institute) EC509 This Version: 4 Nov 2013 123 / 123

You might also like