You are on page 1of 81

Probability IIb1

Petri Koistinen, 2013

Translated from the Finnish original by Andrei Ovod, 2019

Translation revised by Tuomas Hytönen, 2020

1
Chapter 11 currently missing
Contents

6 Inequalities 3
6.1 Markov and Chebyshev inequalities . . . . . . . . . . . . . . . . . 3
6.2 Convex functions and Jensen’s inequality . . . . . . . . . . . . . 5
6.3 Hölder’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.4 Cauchy–Schwarz inequality and correlation . . . . . . . . . . . . 9
6.5 Inequalities for generating functions . . . . . . . . . . . . . . . . 9

7 Bivariate distribution 11
7.1 Continuous bivariate distribution . . . . . . . . . . . . . . . . . . 11
7.2 Uniform distribution in a planar region . . . . . . . . . . . . . . . 15
7.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.4 Expectation of a transformed random vector . . . . . . . . . . . . 19
7.5 Covariance and other joint moments . . . . . . . . . . . . . . . . 21
7.6 Best linear approximation . . . . . . . . . . . . . . . . . . . . . . 22
7.7 Expectation vector and covariance matrix . . . . . . . . . . . . . 24
7.8 Distribution of a transformed random vector . . . . . . . . . . . 26
7.9 Properties of the t-distribution . . . . . . . . . . . . . . . . . . . 31

8 Conditional distribution 34
8.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 The chain rule (multiplication rule) . . . . . . . . . . . . . . . . . 36
8.3 The joint distribution of discrete and continuous random variables 38
8.4 The conditional expectation . . . . . . . . . . . . . . . . . . . . . 39
8.5 Hierarchical definition of a joint distribution . . . . . . . . . . . . 43

9 Multivariate distribution 47
9.1 Random vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 The expectation vector and the covariance matrix . . . . . . . . . 49
9.3 Conditional distributions, the multiplication rule and the condi-
tional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4 Conditional independence . . . . . . . . . . . . . . . . . . . . . . 55
9.5 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 The density function transformation formula . . . . . . . . . . . 59
9.7 The moment generating function of a random vector . . . . . . . 61

1
10 Multivariate normal distribution 65
10.1 The standard normal distribution Nn (0, I) . . . . . . . . . . . . . 65
10.2 General multinormal distribution . . . . . . . . . . . . . . . . . . 66
10.3 The distribution of an affine transformation . . . . . . . . . . . . 68
10.4 The density function . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.5 The contours of the density function . . . . . . . . . . . . . . . . 69
10.6 Non-correlation and independence . . . . . . . . . . . . . . . . . 71
10.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 74
10.8 The two dimensional normal distribution . . . . . . . . . . . . . . 75
10.9 The joint distribution of the sample mean and variance of a nor-
mal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2
Chapter 6

Inequalities

Inequalities are among the most powerful tools available for a mathematician.
An equation a = b implies that two supposedly different things are in fact the
same while an inequality a ≤ b might provide us with an opportunity to make
a complex situation easier to understand.
The probability inequalities are typically used to assist the understanding of
theoretical concepts e.g. deriving limit values and many other applications. In-
equalities can be also used to analyze the complexity of randomized algorithms.
When working with inequalities, one must pay close attention to the form of
both sides of the inequality as well as the conditions under which the inequality is
valid. Furthermore, it is essential to always keep in mind the correct magnitude
order between the two sides of the inequality. Mathematical argumentation
often uses long inequality sequences

a ≤ a1 ≤ · · · ≤ ak ≤ b

in order to derive a result a ≤ b. If even one ≤–sign in this chain is accidentally


written pointing to a wrong direction, the whole expression becomes invalid.

6.1 Markov and Chebyshev inequalities


Theorem 6.1 (Markov’s inequality). . Let X ≥ 0 be a random variable with a
finite expectation. In this case
EX
P (X ≥ a) ≤ , ∀a > 0.
a
Proof. Let the random variable Y be defined by

0, when 0 ≤ X < a,
Y = a1[a,∞) (X) =
a, when X ≥ a.

3
As Y ≤ X,

EX ≥ EY = 0 · P (Y = 0) + a · P (Y = a) = aP (X ≥ a).

Theorem 6.2 (Chebyshev’s inequality). Let X be a random variable with ex-


pectation µ and variance σ 2 . Then,

σ2
P [|X − µ| ≥ t] ≤ , ∀t > 0.
t2
In particular, if σ 2 > 0, then
1
P [|X − µ| ≥ k σ] ≤ , ∀k > 0.
k2
Proof. We apply Markov’s inequality to the random variable Y = (X − µ)2 . If
t > 0, then
σ2
P [|X − µ| ≥ t] = P [(X − µ)2 ≥ t2 ] ≤ 2 .
t
The second inequality is formulated by choosing t = kσ.

Chebyshev’s inequality can be used to prove the weak law of large numbers
(WLLN). This particular form of convergence of the random variables is referred
to as convergence in probability.
Theorem 6.3 (The weak law of large numbers). Let X1 , X2 , . . . be a sequence of
independent random variables with the same expectation µ = EXi and variance
σ 2 = var Xi < ∞. Then, the sequence formed from the averages (X n )
n
1X
Xn = Xi
n i=1

converges in probability towards µ, i.e.,

P (|X n − µ| ≥ ) −−−−−→ 0, for all  > 0.


(n→∞)

Proof. Now,
1 σ2
EX n = µ, var X n = 2
nσ 2 = .
n n
By Chebyshev’s inequality,

σ2
P (|X n − µ| ≥ ) ≤ .
n2

4
6.2 Convex functions and Jensen’s inequality
A function is convex if its graph remains below a chord drawn through any two
points of the graph of the function. A function is concave if its graph remains
above a chord drawn through any two points of the graph of the function. The
idea is made precise below. Most functions are neither convex nor concave.
By interval we understand any closed, open or half-closed, finite or infinite
interval: (a, b), [a, b), (a, b] or [a, b], where a ≤ b and a = −∞, b = ∞ are
allowed. If I is an interval and x, y ∈ I, then the points λx + (1 − λ)y, where
0 ≤ λ ≤ 1, situated on a chord connecting x and y are located inside the interval
I. A convex set is characterised by this particular feature and convex subsets
of the real line are intervals.
Definition 6.1 (Convex function, concave function). Let I ⊂ R be an interval.
A function g : I → R is called convex, if

g(λx + (1 − λ)y) ≤ λg(x) + (1 − λ)g(y), for all x, y ∈ I and 0 ≤ λ ≤ 1

The function g : I → R is called concave, if −g is convex.


The definition of convexity could very well be restricted to the values 0 <
λ < 1, since the cases λ = 0 and λ = 1 are trivially satisfied. The functions
x2 , ex ja |x| are convex over the whole real line; 1/x is convex when x > 0 and
log x is concave when x > 0. Affine functions a+bx are both convex and concave
over the whole real line.
The following theorem includes difference quotients which should be inter-
preted geometrically as slopes of the function g. Draw a picture yourself. Prov-
ing the theorem is left as an exercise.

Theorem 6.4. Let I be an open interval. A function g : I → R is convex, if


and only if the following inequality holds for all a < b < c with a, b, c, ∈ I:

g(b) − g(a) g(c) − g(b)


≤ . (6.1)
b−a c−b
Theorem 6.5 (Conditions for convexity). Let I ⊂ R be an open interval and
g : I → R be a function.
(a) If g is continuously differentiable, g 0 is increasing and both properties are
valid across the whole interval I, then g is convex.

(b) If g has a second derivative across I and g 00 (x) ≥ 0 for all x ∈ I, then g is
convex.
Proof. (a): Let a, b, c ∈ I and a < b < c. By the mean value theorem for the
derivative, we have

g(b) − g(a) g(c) − g(b)


= g 0 (x), = g 0 (y)
b−a c−b

5
for some x ∈ (a, b) and y ∈ (b, c). Since g 0 is increasing, and x < b < y, we
have g 0 (x) ≤ g 0 (y). This proves (6.1), which by Theorem 6.4 is equivalent to
the convexity of g.
(b): Since g 00 ≥ 0, the function g 0 is increasing across I.
If g : I → R is convex, then the definition of a convex function can be used
to derive (utilising induction) the inequality
n
X  n
X
g pi xi ≤ pi g(xi ), (6.2)
i=1 i=1

which is valid when P xi ∈ I and pi ≥ 0 and i pi = 1. This kind of linear


P
combination x = i pi xi , where the coefficients pi are non-negative and sum
up to 1, is referred to as a convex combination with points x1 , . . . , xn . Naturally,
the convex combination satisfies x ∈ I.
Jensen’s inequality is a generalisation of the previous inequality, where the
discrete distribution with the probability mass function f (xi ) = pi is replaced by
an arbitrary distribution. Proving Jensen’s inequality is easy after the following
lemma, which states that the graph of a convex function is always above its
tangent.
Theorem 6.6. If I ⊂ R be an open interval, g : I → R is a convex function,
and m ∈ I, then we can find a constant k such that

g(x) ≥ g(m) + k(x − m), ∀x ∈ I.

Proof. The claimed inequality is trivial for x = m and any k, so it is enough


to consider x ∈ I with x 6= m. Observing that division by a negative number
reverses an inequality, the claim is equivalent to the existence of k ∈ R such
that
g(y) − g(m) g(z) − g(m)
≤k≤ . (∗)
y−m z−m
for all y < m < z such that y, m, z ∈ I.
By Theorem 6.4, we already know that

g(m) − g(y) g(z) − g(m)


≤ (∗∗)
m−y z−m
for all y < m < z such that y, m, z ∈ I, so all we need is to “squeeze” a number
k in between. It is maybe intuitively clear that this can be done; a rigorous
argument runs as follows makes use of the following notions and facts:

• u ∈ R is called an upper bound of a set A ⊂ R, if a ≤ u for every a ∈ A.


• s ∈ R is called a least upper bound (supremum) of A ⊂ R, if s is an upper
bound of A, and every other upper bound u of A satisfies s ≤ u, i.e., we
have a ≤ s ≤ u for every a ∈ A and every upper bound u of A.

6
• If a set A ⊂ R has an upper bound, it has a least upper bound. (This is
a deep completeness property of the real line R that distinguishes it for
instance from the set of rational numbers Q.)
• The least upper bound of A is unique (this is easy) and denoted by sup A.
Denoting
n g(m) − g(y) o n g(z) − g(m) o
A= : x ∈ I, y < m , B= : x ∈ I, z > m ,
m−y z−m
the inequality (∗∗) shows that every b ∈ B is an upper bound of the set A. Thus
there is a least upper bound k = sup A, and this satisfies (∗).
The proof above shows that if g is differentiable at m, then k = g 0 (m) is
the unique choice in Theorem 6.6. Indeed, both the left and right sides of (∗)
converge to g 0 (m) when y, z → m.
Theorem 6.7 (Jensen’s inequality). Let I ⊂ R be an open interval and g :
I → R a convex function. If X is a random variable assuming values within I
(with probability one) and EX and Eg(X) exist, then
g(EX) ≤ Eg(X). (6.3)
Proof. As µ = EX ∈ I, there is a constant k such that,
g(x) ≥ g(µ) + k(x − µ) , ∀x ∈ I.
With probability one
g(X) ≥ g(µ) + k(X − µ),
and Jensen’s inequality follows by taking the expectation of both sides.
Example 6.1. As the function x2 is convex, according to the Jensen’s inequal-
ity,
(EX)2 ≤ E(X 2 ),
if these moments exist. This implies the inequality
var X = E(X 2 ) − (EX)2 ≥ 0,
which we already knew.

6.3 Hölder’s inequality


Theorem 6.8. Let p, q > 1 be numbers such that
1 1
+ = 1.
p q
Then 1 1
E|XY | ≤ [E|X|p ] p [E|Y |q ] q .

7
Proof. The claim is trivially true if E|X|p = ∞ or E|Y |q = ∞ or if E|X|p = 0 or
E|Y |q = 0. Therefore, we assume that both expectations are finite and greater
than 0.
First we will prove an inequality regarding real positive numbers. Let arbi-
trary a, b > 0 and let s ja t be such that,
1 1
a = exp( s), b = exp( t).
p q

As exp(x) = ex is convex, then

1 1 1 1 ap bq
ab = exp( s + t) ≤ es + et = + .
p q p q p q
The same inequality is also valid when a = 0 or b = 0, if we agree that zero to
the power of a positive number is zero.

If we let 1 1
u = [E|X|p ] p , v = [E|Y |q ] q ,
and take the expectation of both sides of the inequality
XY 1 X 1 Y
| | ≤ | |p + | |q
uv p u q v
we get
XY 1 E|X|p 1 E|Y |q
E| |≤ + = 1.
uv p E|X|p q E|Y |q
As an easy application of the Hölder’s inequality we can prove that
1 1
[E|X|r ] r ≤ [E|X|s ] s , for all 0 < r ≤ s. (6.4)

This result is called Lyapunov’s inequality.

Minkowski’s inequality states that


1 1 1
[E|X + Y |p ] p ≤ [E|X|p ] p + [E|Y |p ] p , for all p ≥ 1. (6.5)

(If one of the expectations does not exist as a real number, the term is under-
stood as ∞). Minkowski’s inequality can be derived from the triagle inequality
and Hölder’s inequality.

8
6.4 Cauchy–Schwarz inequality and correlation
The Cauchyn-Schwarz inequality is a special case of the Hölder’s inequality,
where p = q = 2, which results in
√ √
E|XY| ≤ EX 2 EY2 .

In some cases this inequality can be valid in the form E|XY| ≤ ∞, but this
does not signify anything interesting. If EX 2 < ∞ and EY 2 < ∞, then the
upper limit is finite and we have
√ √
|E(XY )| ≤ E|XY| ≤ EX 2 EY 2 . (6.6)

Here, the first inequality follows from Theorem 4.4.


Let us assume that the random variables X and Y have finite variances,
2
σX = var X and σY2 = var Y . Their covariance is

cov(X, Y ) = E[(X − EX)(Y − EY )].

This expectation is finite as seen by applying the Cauchy–Schwarz inequality to


the random variables X − EX and Y − EY , which produces the inequality

|cov(X, Y )| ≤ σX σY . (6.7)

Here the upper limit is established by the product of the standard deviations of
the variables.
If both standard deviations σX and σY are strictly positive, then the corre-
lation coefficient of X and Y (often denoted by corr(X, Y ) or ρXY ) is defined
by the formula
cov(X, Y )
corr(X, Y ) = ρXY = ρ = . (6.8)
σX σY
By the Cauchy–Schwarz inequality,

|corr(X, Y )| ≤ 1. (6.9)

If corr(X, Y) > 0, then also cov(X, Y ) > 0 and X and Y are said to be positively
correlated. If corr(X, Y) < 0, then also cov(X, Y) < 0, and X and Y are said
to be negatively correlated. If corr(X, Y) = 0, then also cov(X, Y) = 0, and X
and Y are said to have no (linear) correlation or to be (lineary) uncorrelated.

6.5 Inequalities for generating functions


Let us recall some definitions. The moment generating function and the cumu-
lant generating function of a random variable X are

MX (t) = EetX , KX (t) = ln MX (t)

if the expectation is finite.

9
Theorem 6.9 (Properties of the generating functions). a) The tail properties
satisfy the following upper bounds for all a ∈ R:

P (X ≥ a) ≤ inf e−ta MX (t) and P (X ≤ a) ≤ inf e−ta MX (t) .


t>0 t<0

b) The cumulant generating function and the moment generating function are
both convex functions.
Proof. Part (a): when t > 0, the function x 7→ etx is strictly increasing and
therefore, according to Markov’s inequality

EetX
P (X ≥ a) = P (etX ≥ eta ) ≤ .
eta
As this approximation is valid for all t > 0, it also remains valid when the upper
limit is replaced by its infimum over t > 0. The other inequality is proved by
applying the first random variable Y = −X.
The convexicity of a cumulant generating function is a result of Hölder’s
inequality. The convexicity of a moment-generating function is a result of ex
being an increasing and convex function.

10
Chapter 7

Bivariate distribution

7.1 Continuous bivariate distribution


Definition 7.1 (Continuous (joint) distribution, joint density function). . We
say that a random vector (X, Y ) has a continuous distribution, or that the
random variables X and Y have a continuous joint distribution, if there exists
a function f ≥ 0 defined on R2 such that
ZZ
P ((X, Y ) ∈ B) = f (x, y)dxdy, for all B ⊂ R2 . (7.1)
B

In this situation, f = fX,Y is called the density function of the random vector
(X, Y ) or the joint density function of the random variables X and Y .
Formula (7.1) is sometimes heuristically expressed as

P (X ∈ dx, Y ∈ dy) = f (x, y)dxdy.

Here the notation P (X ∈ dx, Y ∈ dy) is understood as P (x ≤ X ≤ x + dx, y ≤


Y ≤ y + dy) , where dx ja dy are very small positive numbers. As in the
one-dimensional case, the value of the density function at point (x, y) has a
frequency interpretation.
A bivariate continuous distribution can be visualised in different ways. One
possibility is to draw a perspective picture of the function z = f (x, y). Another
option is to plot the function’s level curves (also known as contour lines) in the
xy-plane. The function’s contour lines at level c consist of the points (x, y),
where f (x, y) = c. Different values of the constant c produce different level
curves. A third option is to simulate points (Xi , Yi ), i = 1, . . . , N from the
distribution and draw the points in the plane. Figure 7.1. illustrates these
alternatives.
When dealing with a bivariate continuous distribution we will have to count
level integrals over a given integration domain. The integral of a function g over

11
15
10
5
z

0
x
0 2 4 6 8

Figure 7.1: Different ways of visualising a continuous bivariate distribution: (a)


a perspective drawing, (b) contour lines of the density function together with a
sample of point simulated from the distribution.

a set B means an integral of the function 1B g over the whole domain R2 . This
type of an integral is written as
Z Z Z Z Z Z
g= g(x, y)d(x, y) = g(x, y)dxdy = 1B (x, y)g(x, y)dxdy.
B B B R2

Such area integrals are usually calculated as iterated integrals. Under these cir-
cumstances we can apply the Fubini’s theorem (which will not be proved during
this course). The following theorem is formulated for the Lebesgue integral.
Theorem 7.1 (Fubini). In the following two cases the area integral can be
calculated as an iterated integral, meaning that the identities below are valid:
Z Z ∞ Z ∞
g(x, y)d(x, y) = ( 1B (x, y)g(x, y)dx)dy
B −∞ −∞
Z ∞ Z ∞ (7.2)
= ( 1B (x, y)g(x, y)dy)dx
−∞ −∞

(a) If g ≥ 0, in which case the possibility that the common value of all these
integrals is ∞ is allowed.
(b) If B |g| < ∞, in which case the common value of all the integrals above is
R

finite. (Notice that B |g| can always be calculated as an iterated integral by


R

case (a)).
For example if B can be presented as
B = {(x, y) : a < x < b, c(x) < y < d(x)},
and |g| < ∞ or g ≥ 0, then
R
B
Z ZZ Z b Z d(x)
g= g(x, y)dxdy = ( g(x, y)dy)dx.
B B a c(x)

12
We will get the same result even if, within the definition of B, one or more of the
inequalities “< ” is replaced by “≤”. This iterated integral can also be written
in any of the following way:
Z Z b Z d(x) Z b Z d(x)
g= g(x, y)dxdy = dx g(x, y)dy
B x=a y=c(x) a c(x)

If B can be expressed as
B = {(x, y) : c < y < d, a(y) < x < b(y)},
and B |g| < ∞ or g ≥ 0, then the regional integral can be calculated as an
R

iterated integral:
ZZ Z d Z b(y)
g(x, y)dxdy = ( g(x, y)dx)dy
B c a(y)
Z d Z b(y) Z d Z b(y)
= g(x, y)dxdy = dy g(x, y)dx
y=c x=a(y) c a(y)

Most integration domains encountered in applications can be divided into parts


that can be expressed in one of the ways above. The area integral can be
calculated over the whole original set by summing up the integrals calculated
over the different parts.
Remark. A notation lke
Z bZ d
g(x, y)dxdy.
a c
is rather common. It is unclear which variable is related to which integration set.
In some sources, this notation is used to denote the integral over (x, y) ∈ (a, b)×
(c, d); in some other sources the same notation is used to denote integration over
(x, y) ∈ (c, d) × (a, b).
The density function of a bivariate distribution is not completely unique
but is it almost unique. If f is a density function, and g is a density function
of the same distribution, then f = g almost everywhere, which means that
f (x, y) = g(x, y) everywhere except possibly in a set of measure zero. Observe
that in the two-dimensional situation, all one dimensional graphs have measure
zero.
Theorem 7.2. Let the random vector (X, Y ) be continuously distributed. In this
case the marginal distributions are also continuous and their density functions
can be obtained by integrating the other variable out of the joint density function:
Z ∞ Z ∞
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
−∞ −∞

Proof. If B ⊂ R, then
Z Z ∞
P (X ∈ B) = P ((X, Y ) ∈ B × R) = ( fX,Y (x, y)dy)dx.
B −∞

The formula for the density function of Y is derived in a similar manner.

13
Example 7.1. In the converse direction, the continuity of the marginal distri-
butions does not automatically imply the continuity of the joint distribution.
For example, let X be a random variable with a density function fX and let a
random variable Y be defined by Y = X, or
Y (ω) = X(ω), for all ω ∈ Ω.
Now the distribution of Y is equal to the distribution of X and therefore both
distributions are continuous. The distribution of a random vector (X, Y ) is
concentrated on the diagonal
B = {(x, y) : x = y}.
However, for any function f : R2 → R we have B f = 0, since the diagonal
R

B is has measure zero as a one-dimensional set. For this reason, the random
vector (X, Y ) cannot be continuous and therefore, the joint distribution is not
continuous either.
In the case of a continuous joint distribution, the joint cumulative distribu-
tion has the formula
Z x Z y
FX,Y (x, y) = P ((X, Y ) ∈ (−∞, x] × (−∞, y]) = ds fX,Y (s, t)dt.
−∞ −∞

Let us differentiate this formula first with respect to x,


Z y

FX,Y (x, y) = fX,Y (x, t)dt,
∂x −∞

and then with respect to y,


∂2
FX,Y (x, y) = fX,Y (x, y) .
∂x∂y
This result is valid at least in those points (x, y) where fX,Y is continuous.
To prove the next theorem would unfortunately require tools we do not have
available in this course.
Theorem 7.3. If a random vector (X, Y ) has a continuous distribution, then
∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
can be chosen as a density function for the distribution. This second-order mixed
partial derivative exists for almost all (x, y). In those points where the mixed
partial derivative is not defined, an arbitrary definition for fX,Y can be used.
Note. It can be difficult to determine whether an arbitrary cumulative distri-
bution function of a bivariate distribution represents a continuous distribution
or not. In principle we should investigate whether the second-order mixed par-
tial derivative of the function can be a density function of the distribution.
Theorem 7.4. If f : R2 → R is non-negative and its integral over the whole
plane is one, then it is a density function of a random vector.

14
7.2 Uniform distribution in a planar region
Definition 7.2 (Uniform distribution in region A). Let A ⊂ R2 be a set with
an area ZZ
m(A) = 1A (x, y)dxdy
R2

that satisfies t0 < m(A) < ∞. A random vector (X, Y ) is said to have uniform
distribution in A, if it has a joint density function

1A (x, y)
f (x, y) = .
m(A)

The density function of a uniform distribution vanishes outside the region A,


and in the points (x, y) ∈ A, it attains a constant non-zero value. Suppose that
the random vector (X, Y ) has a uniform distribution in a region A. If B ⊂ R2 ,
then
m(A ∩ B)
ZZ
P ((X, Y ) ∈ B) = f (x, y)dxdy = .
B m(A)
In particualr, F ((X, Y ) ∈ R2 ) = 1, as it should be.
Notice that, for example, iniform distributions on a closed square [0, 1]×[0, 1]
and an open square (0, 1) × (0, 1) are equal, since the boundary of the square
has measure zero.
In the next example we will investigate a uniform distribution over a region
between the graph of a non-negative and integrable function h and the x-axis.
Notice that in this case marginal distribution of the x-coordinate has a density
function equivalent to h.

Example 7.2. Let h(x) = exp(− 21 x2 ) and suppose that the random vector
(X, Y ) has a uniform distribution under h, or over the set

A = {(x, y) : 0 < y < h(x)}.

The area of this set is


ZZ Z ∞ Z h(x) Z ∞ √
m(A) = 1A (x, y)dxdy = dx 1dy = h(x)dx = 2π,
−∞ 0 −∞

and therefore, the joint density function is


1
fX,Y (x, y) = √ 1{0 < y < h(x)},

where the following notations are used:
(
1, if 0 < y < h(x)
1{0 < y < h(x)} = 1{(u,v):0<v<h(u)} (x, y) =
0, otherwise.

15
The marginal distribution of X is
Z ∞ Z h(x)
1 1 1
fX (x) = fX,Y (x, y)dy = √ dy = √ exp(− x2 )
−∞ 0 2π 2π 2

and therefore, the marginal distribution of X is N (0, 1) .

What is the marginal distribution of Y ? As 0 < h(x) ≤ 1 for all x, then


the ineuality
0 < y < h(x)
can only be valid, when 0 < y < 1. If 0 < y < 1, then the inequality
1
0 < y < exp(− x2 )
2
can be easily solved in respect to x and the result is
p p
− −2 ln y < x < −2 ln y.

(Notice, that –ln y > 0 when 0 < y < 1). Therefore, the joint density function
can also be expressed as
1 p p
fX,Y (x, y) = √ 1{0 < y < 1, − −2 ln y < x < −2 ln y},

from which the marginal distribution of Y is easy to calculate
2 p
fY (y) = √ − ln y, 0 < y < 1.
π

7.3 Independence
Let us recall a definition. Two eandom variables X and Y are called independent
(denoted by X Y ), if
|=

P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B), for all (Borel sets) A, B ⊂ R.

According to Section 3.4.

X ⇔ FX,Y (x, y) = FX (x)FY (y) for all (x, y).


|=

Next we will show that the joint mass functions (jmf) and the joint density
functions (jdf) of independent random variables can be expressed as the prod-
uct of the marginal distributions when the joint distribution is either discrete
or continuous. Notice that the joint density function of independent random
variables is not just a product of functions FX ja FY

u 7→ FX (u)FY (u) ,

16
but a so called tensor product

(u, v) 7→ FX (u)FY (v) .

Similarly, the jmfs and the jdfs of independent random variables can be ex-
pressed as the product of marginal density/mass functions as long as we under-
stand that we are dealing with the tensor product.
Theorem 7.5. Suppose that the random vector (X, Y ) has a discrete or con-
tinuous distribution with joint mass function or joint density function fX,Y ,
respectively.
(a) If X Y , then fX,Y (x, y) = fX (x)fY (y) .
|=

(b) If fX,Y can be expressed as a product

fX,Y (x, y) = g(x)h(y) , ∀x, y,

where g ≥ 0 and h ≥ 0 are functions of one variable, then X Y.

|=
Proof. (a): The Discrete case was proved in Section 3.4. and now we shall
now prove the continuous case. Let us assume that the joint distribution is
continuous and that X Y . We will check that the value of the joint cumula-
|=

tive distribution function (jcf) at an arbitrary point (x, y) can be obtained by


integrating the jdf. This statement stems from the fact that the jcf determines
the distribution. Based on independence, all x, y ∈ R have
Z x Z y
FX,Y (x, y) = FX (x)FY (y) = fX (s)ds fY (t)dt
−∞ −∞
Z x Z y
= fX (s)( fY (t)dt)ds
−∞ −∞
Z x Z y
= ds fX (s)fY (t)dt.
−∞ −∞

Therefore fX (x)fY (y) represents the jdf of (X, Y ).


(b): The proof will be presented in the continuous case; the idea of the proof
remains identical also in the discrete case. Let us assume that the jdf can be
written as a product fX,Y (x, y) = g(x)h(y) . Now let
Z ∞ Z ∞
c= g(x)dx, d = h(y) dy.
−∞ −∞

Then ZZ Z ∞ Z ∞
1= fX,Y (x, y)dxdy = g(x)dx h(y)dy = cd.
R2 −∞ −∞

The marginal distribution of X is


Z ∞
fX (x) = fX,Y (x, y)dy = g(x)d = g(x)/c,
−∞

17
and the marginal distribution of Y is
Z ∞
fY (y) = fX,Y (x, y)dx = ch(y) = h(y)/d.
−∞

In other words, the marginal distributions of fX and fY are the factors g and
h normalised to density functions. For all x, y, the jcf is the product of the
marginal distributions because
Z x Z y Z x Z y
g(s) h(t)
FX,Y (x, y) = ds fX,Y (s, t)dt = ds dt
−∞ −∞ −∞ −∞ c d
Z x Z y
= fX (s)ds fY (t)dt = FX (x)FY (y) ,
−∞ −∞

and consequently X Y.
|=

Sometimes we wish to prove that two random variables X and Y are not
independent. If X and Y are discrete, their dependence can be proved by finding
one point (x, y), where the multiplicative formula fails:
fX,y (x, y) 6= fX (x)fY (y) :
this shows that P (X ∈ A, Y ∈ B) is not equal to P (X ∈ A)P (Y ∈ B) when we
choose A = {x} ja B = {y}.
In case of a continuous joint distribution, the proof is more complicated, since
many different functions can be density functions of the same distribution.
Example 7.3. Suppose that the random vector (X, Y ) has a uniform distibu-
tion in the triangle
C = {(x, y) : x ≥ 0, y ≥ 0, x + y ≤ 1}.
Are X and Y independent? As the area of this square is 21 , the jdf is
fX,Y (x, y) = 2, when (x, y) ∈ C.
It might misleadingly seem that the jdf takes the product form of g(x)h(y). We
can notice that this is false when we write down the condition (x, y) ∈ C using
indicator functions:
fX,Y (x, y) = 2 · 1{x≥0} 1{y≥0} 1{x+y≤1}
Here, the last term is definitely not of the product form.
If A = B = ( 21 , 1) , then (A × B) ∩ C = ∅ and
P ((X, Y ) ∈ A × B) = 0,
but as P (X ∈ A) > 0 ja P (Y ∈ B) > 0, then
0 = P (X ∈ A, Y ∈ B) 6= P (X ∈ A)P (Y ∈ B) > 0.
Directly utilizing the definition of independence we have shown that X ja Y are
not independent.

18
7.4 Expectation of a transformed random vector
Theorem 7.6. Suppose that the random vector (X, Y ) has a discrete distribu-
tion and let Z = g(X, Y ) be its real-valued transformation. Then
X
EZ = g(x, y)fX,Y (x, y),
x,y

if the sum converges absolutely.


Proof. We will apply Lemma 4.1. in order to divide the probability space. Let
A = {(xi , yi ) : i ≥ 1} be the (at most countable) range of the random vector
(X, Y ). Then the sets

{(X, Y ) 6∈ A}, ({X = xi , Y = yi })i≥1

partition the probability space. Now

g(X, Y )(ω) = g(X(ω), Y (ω)) = g(xi , yi ), when ω ∈ {X = xi , Y = yi },

and therefore,
X X
EZ = g(xi , yi )P (X = xi , Y = yi ) = g(xi , yi )fX,Y (xi , yi ).
i≥1 i≥1

In case of a joint continuous distribution, the expectation Eg(X, Y ) can be


obtained by calculating an integral.
Theorem 7.7. Let (X, Y ) be a random vector with a continuous distribution,
and let Z = g(X, Y ) be its real-valued transformation. Then
Z
EZ = g(x, y)fX,Y (x, y)dxdy
R2

if the integral converges absolutely.


Proof. Let us consider a special case where the function g only takes finitely
many different values ai ; thus
X
g= ai 1Ai
i≥1

for some disjoint Ai ⊂ R2 . Then


X X X
Eg(X, Y ) = E ai 1Ai (X, Y ) = ai E(1Ai (X, Y )) = ai P ((X, Y ) ∈ Ai ).
i≥1 i≥1 i≥1

But
Z Z
P ((X, Y ) ∈ Ai ) = fX,Y (x, y)dxdy = 1Ai (x, y)fX,Y (x, y)dxdy,
Ai R2

19
and hence
X Z
Eg(X, Y ) = ai 1Ai (x, y)fX,Y (x, y)dxdy
i≥1 R2
Z X Z
= ai 1Ai (x, y)fX,Y (x, y)dxdy = g(x, y)fX,Y (x, y)dxdy.
R2 i≥1 R2

Let us define
n
4X −1
sn (t) := k2−n · 1[k2−n ,(k+1)2−n ) (t) + 2n · 1[2n ,∞) (t).
k=0

Then 0 ≤ sn (t) ≤ 2n and sn (t) ≤ sn+1 (t) → t as n → ∞ for t ≥ 0. Moreover,


for any function g, the function sn ◦ g takes only finitely many values. By the
previous part, we have
Z
Esn (g(X, Y )) = sn (g(x, y))fX,Y (x, y)dxdy.
R2

If g ≥ 0, then 0 ≤ sn (g(x, y)) ≤ sn+1 (g(x, y)) → g(x, y). In this situation,
the monotone convergence theorem (proved in the course Measure and Integral,
taken for granted here) allows the exchange the limit with either the integral or
the expectation, to the result that
(MON)
Eg(X, Y ) = E lim sn (g(X, Y )) = lim Esn (g(X, Y ))
n→∞ n→∞
Z
= lim sn (g(x, y))fX,Y (x, y)dxdy
n→∞ R2
Z
(MON)
= lim sn (g(x, y))fX,Y (x, y)dxdy
R2 n→∞
Z
= g(x, y)fX,Y (x, y)dxdy.
R2

This proves the theorem for g ≥ 0.


Finally, if g is arbitrary, we write it as g + − g − , where g ± = max(±g, 0) ≥ 0,
and we apply the previous case to both g ± and sum up.
As a summary of the two previous theorems we get a bivariate version of
the “law of the unconscious statistician” (sometimes abbreviated as “LOTUS”),
namely
(P
g(x, y)fX,Y (x, y) , if (X, Y ) is discrete,
Eg(X, Y ) = RR x,y
R2
g(x, y)fX,Y (x, y)dxdy, if (X, Y ) is continuous.

This formula is valid under the assumption that the expectation exists and is
finite meaning that this sum or integral converges inherently.

20
Why this funny name? In order to compute the expectation Eg(X, Y ), the
statistician can be entirely unconscious about the distribution of the transformed
random variable g(X, Y ), as long as they know the distribution of the original
random vector (X, Y ).
This bivariate version is compatible with the previous univariate version.
For example, the bivariate version and the law of the unconscious statistician
can be used to compute an Eg(X) of a discrete joint distribution and a function
with one variable g:
XX X X X
Eg(X) = g(x)fX,Y (x, y) = g(x) fX,Y (x, y) = g(x)fX (x),
x y x y x

where the last formula comes from the univariate version of the law of the
unconscious statistician, which was discussed in section 4.4.
Example 7.4. If integrable random variables X and Y have a continuous joint
distribution, then for all constants a, b ∈ R
ZZ
E(aX + bY ) = (ax + by)fX,Y (x, y)dxdy
Z Z Z Z
= a x( fX,Y (x, y)dy)dx + b y( fX,Y (x, y)dx)dy
Z Z
= a xfX (x)dx + b yfY (y)dy = aEX + bEY.

Here the integral refers to an integral calculated over the whole real axis. We
have thus derived in this special a fact that we already knew: the expectation
is a linear operator.
Note. The identity Eg(X, Y ) = g(EX, EY ) holds for the function g(x, y) =
ax + by, but it is usually false for other functions.
Example 7.5. If X Y , then g(X) h(Y ) for all functions g, h : R → R and
|=

|=

therefore,
E(g(X)h(Y )) = E(g(X))E(h(Y ))
if these expectations exist. If a random vector (X, Y ) is continuously distributed,
then fX,Y (x, y) = fX (x)fY (y), and the previous identity can also be derived
from the computation
ZZ
E(g(X)h(Y )) = g(x)h(y)fX (x)fY (y)dxdy
Z Z
= ( g(x)fX (x)dx)( h(y)fY (y)dy) = E(g(X))E(h(Y )).

7.5 Covariance and other joint moments


Definition 7.3. Let m, n ≥ 0 be integers. If the expectation
E[X m Y n ]

21
exists, it is referred to as the joint moment (or product moment) of the order
(mn). If the expectation

E[(X − EX)m (Y − EY )n ]

exists, it is referred to as the central moment of the order (m, n).

The most important of these moments are definitely the moments EX m and
EY n of the marginal distributions (namely the joint moments of orders (m, 0)
ja (0, n)) and the central moment of order (1, 1), which is better known as the
covariance of X and Y :

cov(X, Y ) = E[(X − EX)(Y − EY )].

7.6 Best linear approximation


In some cases, the value of a random variable Y cannot be directly determined
but somehow the value of a random variable X (which is related to the random
variable Y ) might be established. In this case we might want to predict the value
of Y by utilizing the value of X. The prediction can for example undertake a
linear form of a + bX or equivalently α + β(X − EX). Now we still have to
determine the correct coefficients in order for the prediction to be as good as
possible. If the quality of an approximation is measured by how small the
standard deviation is

E[(Y − (α + β(X − EX)))2 ] = min!,

then it turns out that the best linear approximation can be found from a simple
calculation which we will work with next. Let us assume that
EX 2 < ∞, EY2 < ∞, var X > 0, varY > 0.
Now let us denote µX = EX and µY = EY . If α ja β are constant, then the
prediction α + β(X − µX ) deviates from the actual Y by

Z = Y − α − β(X − µX ) .

The expectation and variance of the prediction error Z are

EZ = µY − α − β(µX − µX ) = µY − α

var Z = var Y + β 2 var X − 2β cov (X, Y )


hence the mean square error (MSE) is
EZ 2 = (EZ)2 + var Z = (µY − α)2 + var y + β 2 var X − 2β cov (X, Y ) .

22
Next we will complete the square in the terms that include the β-coefficient,
 cov(X, Y ) (cov(X, Y ))2 
EZ 2 = (µY − α)2 + var Y + var X β 2 − 2β +
var X (var X)2
(cov(X, Y ))2

var X
 cov(X, Y ) 2 (cov(X, Y ))2
= var Y + (µY − α)2 + var X β − −
var X var X
In the last form only the second and third terms depend on the variables α or
β and both of these terms are non-negative. No matter how we choose α and
β, the MSE will always be greater than or equal to

(cov(X, Y ))2
var Y − = (1 − ρ2 ) var Y, (7.3)
var X
and the minimum MSE can be achieved by choosing

cov(X, Y )
α = µY , β= .
var X
In other words, the best linear approximation in the sense of MSE for Y is

cov(X, Y )
EY + (X − EX) (7.4)
var(X)

Above, ρ denotes the correlation coefficient

cov(X, Y )
ρ = corr(X, Y ) = √ √ ,
varX varY
which takes values (according to the Cauchy–Schwarz inequality) between −1 ≤
ρ ≤ 1. As the expectation of the approximation is EY , the MSE is equal to the
variance of the prediction error:

cov(X, Y )
Y − EY − (X − EX).
var(X)

We will later solve a more ambitious example by determining the best MSE-
based approximation of Y of the form m(X), where the function m can be
chosen freely.
Formula (7.3) of the best linear MSE approximation shows how the correla-
tion coefficient measures the intensity of the linear dependence between variables.
If the correlation coefficient reaches either extreme value of ρ = ±1, then the
MSE is zero and the prediction error is zero almost surely (see Theorem 4.3.),
i.e.,
cov(X, Y )
Y = EY + (X − EX) (a.s.), (7.5)
var X

23
The abbreviation a.s. comes from words almost surely, which in turn means
with probability one. In other words, if |ρ| = 1 it follows that,
cov(X, Y )
Y (ω) = EY + (X(ω) − EX)
var X
for all elementary events ω, except those that belong to an exceptional set of
probability of zero.
Notice also that the sign of the correlation coefficient is the same as the sign
of the covariance cov(X, Y ) as well as the sign of the slope of the best linear
estimate. If ρ > 0, meaning cov(X, Y ) > 0, then Y has a tendency to grow as
X increases, since the slope of best linear prediction is in this case positive. If,
on the other hand, ρ < 0, meaning cov(X, Y ) < 0, then Y has a tendency to
decrease as X increases.

7.7 Expectation vector and covariance matrix


When a random vector V = (X, Y ) appears in a formula involving vectors and
matrices, it is understood as a column vector. In this course material, vectors
and matrices are written with bold face letters, while scalars are written in
italic. In more advanced materials, this distinction is usually not indicated by
the notation, and the readers needs to understand the nature of the different
symbols from the context.
Definition 7.4 (Expectation of a random vector). The expectation (vector) of
a random vector (X, Y ) is produced by calculating the expectation of the each
component
   
X EX
EV = E(X, Y ) = E = (EX, EY ) =
Y EY
if the expectation of each component exist.
The two-component vector (X, Y ) can be viewed as a 2×1-matrix. Therefore,
the definition of the expectation of a random vector is a special case of the
definition of the expectation of a random matrix.
Definition 7.5 (Random matrix and its expectation). A random matrix is
a matrix whose elements are random variables. Its expectation is a constant
matrix whose element (i, j) is the expectation of the original element (i, j) (if
the expectations of all the elements exist).
It is immediate from the definition that

E(ZT ) = (EZ)T , (7.6)

if Z is a random matrix.
Based on the linearity of the expectation we can easily see that,

E(AZB + C) = A(EZ)B + C, (7.7)

24
when Z is a random matrix and A, B and C are constant matrices, with such
dimensions that the expression is defined. Constant matrices can be pulled
outside the expectation if they are located in the extreme right or the extreme
left of the matrix product expression. On the contrary, matrices located in
the middle of the expression can notbe pulled out using this formula. Formula
(7.7) is also applicable to a random vector Z, since a column vector with d -
components can be interpreted as a d ×1-matrix.
Definition 7.6 (Covariance matrix). The covariance matrix of a random vector
V = (X, Y ) is defined as

Cov(V) = E[(V − EV)(V − EV)T ]. (7.8)

Remarks: In these notes, the capital symbol Cov(V) is used to denote


the covariance matrix of a random vector, while the small case symbol cov
(X, Y ) denotes the covariance of the random variables X and Y . We will soon
observe that these concepts are closely related to each other. In other sources,
the covariance matrix of a random vector V is also denoted by several other
symbols such as, for instance, VarV. The covariance matrix is also known as
the variance-covariance matrix, variance matrix, and the dispersion matrix.
By definition, we have
n X − EX  o
Cov(V) = E [X − EXY − −EY ]
Y − EY
 
(X − EX)(E − EX) (X − EX)(Y − EY )
=E
(Y − EY )(X − EX) (Y − EY )(Y − EY )
   
cov(X, X) cov(X, Y ) var(X) cov(X, Y )
= =
cov(Y, X) cov(Y, Y ) cov(X, Y ) var(Y )

The main diagonal of the covariance matrix has the variances of the random
variables and elsewhere the covariances between the random variables. For this
reason the covariance matrix is sometimes called the variance-covariance matrix.
If we denote σX2
= var(X) ja σY2 = var Y ja σX , σY > 0, then the correlation
coefficient ρ = corr(X, Y) of the random variables X and Y is defined by the
expression
cov(X, Y ) cov(X, Y )
ρ = corr(X, Y ) = √ √ =
varX varY σX σY
Due to this, the covariance matrix of the random vector V = (X, Y ) can be also
written as
 2 
σX ρσX σY
Cov(V) =
ρσX σY σY2
If a random vector W can be obtained from the random vector V by an
affine transformation
W = AV + b,

25
where A is a constant matrix and b is a canstant vector, then

EW = A(EV) + b ⇒ (W − EW)(W − EW)T = A(V − EV)(V − EV)T AT .

Therefore, by formula (7.7),

Cov(AV + b) = ACov(V)AT (7.9)

Let us apply formula (7.9) to V = (X, Y ) and A = uT , where u = (a, b).


This leads to a familiar expression:

var(aX + bY ) = var(uT V) = uT Cov(V)u


  
var X cov(X, Y ) a
= [a b]
cov(X, Y ) var Y b
= a2 var(X) + 2ab cov(X, Y ) + b2 var(Y ).

7.8 Distribution of a transformed random vector


If a random vector (X, Y ) has a discrete distribution and a random variable or
a random vector Z is defined as its transformation, Z = g(X, Y ), then Z has a
discrete distribution as well. The probability mass function of this distribution
can be directly calculated from the definition
X
fZ (z) = P (Z = z) = fX,Y (x, y).
(x,y):g(x,y)=z

A continuous distribution requires more elaborate calculations.


Let us investigate a continuously distributed random vector (X, Y ) that has
a density function fX,Y , and a function g : A → B where A, B ⊂ R2 , and A is
such an open set that
P ((X, Y ) ∈ A) = 1. (7.10)
We will define a two-component random vector (U, V ) by
   
U g1 (X, Y )
= g(X, Y ) =
V g2 (X, Y )

We will assume that

g : A → B is diffeomorphism, (7.11)

which means that

• the sets A and B are open,


• the function g : A → B is a continuously differentiable bijection,
• its inverse h = g−1 is a continuously differentiable bijection B → A.

26
In practice, checking that a function is a diffeomorphism can be done as
follows. First, we find an expression for the invesrse function h = (h1 , h2 ) by
solving from the equations
g1 (x, y) = u, g2 (x, y) = v, (u, v) ∈ B
the variables x and y as functions of u and v. If the solution
(x, y) = (h1 (u, v), h2 (u, v))
is unique for all (u, v) ∈ B and satisfies (x, y) ∈ A, then we have checked that
we are dealing with a bijection. After this, it is usually quite straightforward to
check whether the expressions g1 (x, y) ja g2 (x, y) are continuously differentiable
on A and whether expressions h1 (u, v) ja h2 (u, v) are continuously differentiable
on B. For example, with g1 (x, y) we must check, whether both partial deriva-
tives of the first order ∂g1 (x, y)/∂x and ∂g1 (x, y)/∂y exist and whether they are
continuous functions on the set A.
Now let us consider a bijective correspondence
(u, v) = (g1 (x, y), g2 (x, y)) ⇔ (x, y) = (h1 (u, v), h2 (u, v)), (7.12)
which is convenient to write as
(u, v) = (u(x, y), v(x, y)) ⇔ (x, y) = (x(u, v), y(u, v)). (7.13)
The derivative (matrix) of the mapping h or the so-called Jacobian matrix at
the point (u, v) is
∂h1 (u,v) ∂h1 (u,v) ∂x ∂x
   
 ∂u ∂v   ∂u ∂v  (7.14)
∂h2 (u,v) ∂h2 (u,v) = ∂y ∂y
∂u ∂v ∂u ∂v
The determinant of the derivative matrix (the Jacobian matrix) is called the
functional determinant or the Jacobian determinant or the Jacbian. The con-
cept of a Jacobian determinant was invented in the 19th century by a German
mathematician Carl Gustav Jacobi. (Warning: some textbooks use the term
Jacobian to refer to the differentiation matrix and not its determinant.)
The Jacobian or the determinant of the Jacobian matrix is denoted by var-
ious symbols. We will be using both the modern notation
Jh (u, v),
and the following classical notation which has been used since the mid 19th
century:
∂(x, y)
.
∂(u, v)
The classical notation is more useful when the functions are defined by concrete
expressions. Here
∂x ∂x
 
∂(x, y) ∂v  = ∂x ∂y − ∂x ∂y .
Jh (u, v) = = det  ∂u
∂y ∂y
∂(u, v) ∂u ∂v ∂v ∂u
∂u ∂v

27
The derivative matrices of a function g and its inverse function h are inverse
matrices of each other which implies the following identity for their Jacobians:
∂(x, y) ∂(x, y)
1 = Jh (u, v)Jg (x, y) = . (7.15)
∂(u, v) ∂(u, v)
In this expression (x, y) and (u, v) are related to each other bijectively according
to formula (7.13). Formula (7.15) follows from the facts that the determinant
of an identity matrix is equal to one, and that det(A B) =det(A)det(B), which
is valid when A and B are square matrices of equal size.
Theorem 7.8. Under assumptions (7.10) and (7.11), and using the notation
above in this section, the random vector (U, V ) has a countinuous distribution
with a density function
(
fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|, when (u, v) ∈ B
fU,V (u, v) = (7.16)
0, otherwise.

Proof. We have (U, V ) ∈ B with a probability, since

F ((U, V ) ∈ B) = P (g(X, Y ) ∈ B) = P ((X, Y ) ∈ g−1 (B)) = P ((X, Y ) ∈ A) = 1.

Let C ⊂ B be an arbitrary set. Now


P ((U, V ) ∈ C) = P (g(X, Y ) ∈ C) = P ((X, Y ) ∈ h(C))
ZZ
= fX,Y (x, y)dxdy.
h(C)

By changing variables (u, v) = g(x, y) , or (x, y) = h(u, v), the area integral will
obtain the following form (an appropriate change-of-variables formula for the
Lebesgue integral can be found, for example, in Billingley’s book [1], Theorem
17.2.)
ZZ ZZ
fX,Y (x, y)dxdy = fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|dudv
h(C) C

These observations prove the claim.


The change-of-variable formula (7.16) is perhaps easiest to remember in the
form
fX,Y (x, y)|∂(x, y)| = fU,V (u, v)|∂(u, v)|, (7.17)
when

(u, v) = g(x, y) ⇔ (x, y) = h(u, v) , (x, y) ∈ A, (u, v) ∈ B. (7.18)

From here the unknown values are solved by using the known values. If fX,Y is
known, then
∂(x, y)
fU,V (u, v) = fX,Y (x, y)| | = fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|,
∂(u, v)

28
when (u, v) ∈ B. The same joint density function can also be expressed as

fX,Y (x, y) fX,Y (h1 (u, v), h2 (u, v))


fU,V (u, v) = = ,
| ∂(u,v)
∂(x,y) |
|Jg (h1 (u, v), h2 (u, v))|

when (u, v) ∈ B. The validity of this expression is based on formula (7.15).


When applying the transformation formula (7.16) of the density function,
it is important to make a careful bookkeeping of the sets in which the formula
is valid As we will see in the next example, indicators of sets can be helpful
in this bookkeeping. Unfortunately, for further considerations (for example for
computing the marginal distributions), it is often necessary to express the set
B in one of the forms

B = {(u, v) : a < u < b, c(u) < v < d(u)}

or
B = {(u, v) : c < v < d, a(v) < u < b(v)}.
This can result in some extra work.
Example 7.6. Suppose that a random vector (X, Y ) has a continuous distri-
bution with a joint density function given by
(
k(x, y) when x > 0 and y > 0
fX,Y (x, y) = (7.19)
0 otherwise.

Let us present this jdf as a product of the indicator 1{x > 0, y > 0} of the
positive quadrant, and of the function k(x, y)

fX,Y (x, y) = 1{x > 0, y > 0}k(x, y) .

In this situation, the expression k(x, y) is not necessarily well-defined if x ≤ 0 or


y ≤ 0. The previous expression should be viewed as an abbreviation for formula
(7.19). We will use this convention, where the indicator of a set will annihilate
a possibly undefined expression, if needed.
Let us consider a random vector (U, V ) where U = X + Y and V = X − Y .
The diffeomorphism corresponding to this transformation is
( (
u = x + y, x = 12 (u + v),
v = x − y, y = 12 (u − v).

This is a diffeomorphism between the whole plane and itself. Thus, the joint
density function of random variables U and V is

∂(x, y)
fU,V (u, v) = fX,Y (x, y)| |
∂(u, v)
1
= 1{u + v > 0, u − v > 0}k( 21 (u + v), 12 (u − v)) .
2

29
Next, let us calculate the marginal density of U . In order to correctly deter-
mine integration limits, the previously considered pair of equations u + v > 0
and u − v > 0 should be solved in the following form

a < u < b, c(u) < v < d(u) .

After some examination we will notice (draw a picture) that the solution is

u > 0, −u < v < u.

Hence, the marginal density function of U is


Z u
1 1 1
fU (u) = k( (u + v), (u − v))dv, u > 0.
−u 2 2 2

Note that a diffeomorphism is always a mapping between two euclidian


spaces of the same dimension. Sometimes the goal is to derive a density function
of a scalar transformation U = g1 (X, Y ) of a continuously distributed random
vector (X, Y ). This can be done with two different techniques.
1. We first calculate the density function of U
Z
FU (u) = P (g1 (X, Y ) ≤ u) = fX,Y (x, y)dxdy
{(x,y):g1 (x,y)≤u}

by integrating over this set. The density function U can now be calculated
by differentiation, if the distribution of U turns out to be continuous.
2. We can try to (artificially) complement the transformation to make it a
bijection by choosing a suitable V = g2 (X, Y ) and then deriving the density
function of the random vector (U, V ). Finally the density function of the
variable U is calculated by integration.

Out of these two approaches the latter one is often more straightforward.
Example 7.7 (The density function of a sum). Let a random vector (X, Y )
have a density function fX,Y . Let

U = X + Y.

Derive the density function of the random variable U . What kind of an expres-
sion will you get if X Y ?
|=

Solution by complementing into a bijection: In this assignment we


could complete the projection into a bijection in various ways, but now we will
choose V = X. Then
( (
u = x + y, x = v,

v = x, y = u − v,

30
and
∂(x, y)
fU,V (u, v) = fX,Y (x, y)| | = fX,Y (v, u − v).
∂(u, v)
The marginal density function of random variable U can be calculated from this
expression by integreting out v:
Z ∞
fU (u) = fX,Y (v, u − v)dv
−∞

Solution via the cumulative distribution function: When we calculate


the cumulative function in point u, then the intergation set is

{(x, y) : x + y ≤ u} = {(x, y) : x ∈ R, y ≤ u − x}.

Thus Z ∞ Z u−x
FU (u) = dx fX,Y (x, y)dy.
−∞ −∞

If moving the derivative under the integral can be justified, then

fU (u) = FU0 (u)


Z ∞ Z u−x

= dx fX,Y (x, y)dy
−∞ ∂u −∞
Z ∞
= fX,Y (x, u − x)dx.
−∞

In the latter calculation, it is not necessarily clear what conditions should be


placed on the joint density function so that the calculation could be justified
precisely. The first example showed that no conditions are actually not needed.
If X Y , then fX,Y (x, y) = fX (x)fY (y) identically and
|=

Z ∞
fU (u) = fX (v) fY (u − v)dv.
−∞

Thus fU is the convolution of density functions fX and fY ,

fU = fX ∗ fY .

Literature
[1] P. Billingsley. Probability and Measure. John Wiley & Sons, 2nd ed., 1986.

7.9 Properties of the t-distribution


Student t-distribution with ν > 0 degrees of freedom, or simply the tv -distribution,
is defined as a distribution of a random variable
Z
Y =p (7.20)
X/ν

31
where X ∼ χ2v = Gam( v2 , 21 ), Z ∼ N (0, 1) , and X Z. To demonstrate the

|=
technique of interchanging variables we will calculate the density function of the
t-distribution. Furthermore, we will derive the expectation and the variance of
this distribution by using the stochastic representation of this distribution.
By independence

( 21 )ν/2 ν −1 − 1 x 1 − 1 z2
fX,Z (x, z) = fX (x)fZ (z) = x2 e 2 √ e 2 , x > 0.
Γ( ν2 ) 2π

We complement the mapping to a bijection by choosing U = X. Then we are


led to consider a diffeomorphism
( (
u = x, x = u,

y = √z ,
p
x/ν
z = y u/ν,

where x > 0 and u > 0. Thus


r r
∂(x, z)  u u
fU,Y (u, y) = fX,Z (x, z)| | = fX,Z u, y , u > 0.
∂(u, y) ν ν

From this we get the marginal density by integrating out u:


Z ∞
Γ( ν+1
r
u 2 ) 1
p
fY (y) = fX,Z (u, y u/ν) du = ν √ .
0 ν Γ( 2 ) νπ (1 + y /ν)(ν+1)/2
2

The integral above can be calculated by “integrating like a statistician”: as we


know the formula of a gamma density distribution, we can directly write the
value of this kind of an integral.
With the value ν = 1, the density function assumes the following form
1 1
fY (y) = ,
π 1 + y2
and therefore, t1 is the Cauchy distribution.
The student t-distribution does not have all absolute moments. If a > 0,
then
E|Y |a < ∞ ⇒ a < ν.
This fact can be seen for example from the stochastic representation (7.20) of
the distribution, which gives

E|Y |a = ν a/2 E|Z|a EX −a/2 ,

which is finite if both expectations on the righ hand side of the equation are
finite. The absolute moments E|Z|a of the normal distribution are finite, but
by considering the latter expectation we easily end up with a condition a <
ν. For example, the Cauchy distribution does not have an expectation; and
t-distribution has a variance only when ν > 2. As all the moments of the

32
distribution are not finite, the moment generating function of the t-distribution
does not exist in any neighbourhood of the origin and therefore, the moment
generating function is a useless tool for this distribution.
The expectation and the variance of the Student t-distribution are easily
derived from the stochastic representation (7.20) of the distribution. If ν > 1,
then the tv -distribution has expectation zero. This is the case since, when
Z X, then
|=

r r
ν v
EY = EZ E =0·E = 0.
X X
If ν > 2, then the variance of the tν -distribution is
ν ν
varY = EY 2 = EZ 2 E =1·E .
X X
After several steps (integrate like a statitician and use the functional equation
of the gamma-function) it can be seen that
ν
var Y = , when ν > 2.
ν−2

33
Chapter 8

Conditional distribution

8.1 Conditional distributions


If a random vector (X, Y ) has a discrete distribution, then the conditional prob-
ability mass function (pmf) of the random variable X, under condition (Y = y),
is defined by the formula of conditional probability

P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = P (X = x| Y = y) = = , (8.1)
P (Y = y) fY (y)

when y is a point, where fY (y) > 0. The function fX|Y (·|y) is the pmf of the
random variable X, when we know that Y = y. The conditional pmf fY |X (y|x)
is defined analogously.
Formula (8.1) can be used to define a conditional density function in the
case of a continuous joint distribution.
Definition 8.1. Let (X, Y ) have a continuous distribution with density function
fX,Y . The conditional density function, given (Y = y), of the random variable
X is defined as
fX,Y (x, y)
fX|Y (x|y) = , x ∈ R,
fY (y)
when y is such that fY (y) > 0. Similarly we define

fX,Y (x, y)
fY |X (y|x) = , y ∈ R,
fX (x)

when fX (x) > 0.


Note that x 7→ fX|Y (x|y) is a density function since it is non-negative and
Z ∞ Z ∞
1
fX|Y (x|y)dx = fX,Y (x, y)dx = 1.
−∞ fY (y) −∞

34
y= 6

0.6
0.4
0.2
15

0.0
0 2 4 6 8
10

y= 3

0.6
0.4
5
y

0.2
0.0
0

0 2 4 6 8

y= 0

0.6
−5

0.4
0.2
0 2 4 6 8

0.0
x
0 2 4 6 8

Figure 8.1: A joint density function and conditional densities x 7→ fX|Y (x|y)
with some choices of y.

The function x 7→ fX|Y (x|y) is derived from the joint density function fX,Y
by normalising its horizontal “section” x → fX,Y (x, y) into a density function.
Similarly the function y → fY |X (y|x) is derived by normalising the vertical
section y 7→ fX,Y (x, y). Figure 8.1 shows how a conditional density function
can be derived from the joint density function.
A continuous joint distribution can be represented by different density func-
tions as long as they agree almost everywhere. For the purpose of defining a
conditional density function, we can use any choice of the joint density func-
tion. Therefore, the conditional densities are not unique either. Nevertheless,
it turns out that the ambiguity does not cause serious problems. Therefore,
the non-uniqueness will not be emphasised, and we will simple talk about “the”
conditional density function (rather than more accurately about a version of
the conditional density function).
In case of a continuous joint distribution, the probability of a condition
Y = y is always zero. Therefore, the conditional density function cannot be
directly interpreted in terms of a conditional probability. It is a definition that
we will explore further now.
We will make it plausible, that if a jdf fX,Y is smooth enough (precise
conditions for this will not be specified) and fY (y) > 0, then
Z
P (X ∈ A| y ≤ Y ≤ y + h) −−−−−→ fX|Y (x|y)dx, for all A ⊂ R. (8.2)
(n→0+) A

In other words, when h > 0 is small, the conditional probability P (X ∈ A| y ≤


Y ≤ y + h) can be roughly obtained by integrating the conditional density
function fX|Y (x|y) with respect to the variable x over the set A for arbitrary
A ⊂ R. The value of a continuous random variable can be detected only with
a certain accuracy. Therefore, this type of interpretation is used when we are

35
dealing with a conditional density function, conditioned Y = y, and y is a
detected value.
The clain (8.2) will be explained starting from
1
h P (X ∈ A, y ≤ Y ≤ y + h)
P (X ∈ A|y ≤ Y ≤ y + h) = 1
h P (y ≤ Y ≤ y + h)
R 
1 y+h
R
h y A
f X,Y (x, u)dx du
= R y+h
1
h y
fY (u)du

Both the numerator and the denominator contain integral averages of the fol-
lowing form
1 y+h
Z
g(u)du,
h y
which approach the value g(y) with quite general assumptions, when h → 0+.
For example, if function g is continuous in point y and integrable somewhere in
it’s proximity, then for any  > 0 there is a δ > 0 such that |g(u) − g(y)| < 
whenever |u − y| < δ. If h is any number that satisfies 0 < h < δ, then
Z y+h Z y+h
1 1
g(y) −  = (g(y) − )du ≤ g(u)du
h y h y
Z y+h
1
≤ (g(y) + )du = g(y) + .
h y

This proves that the integrable average approaches the limit g(y),
R when h →
0+. Using this with g(u) = fY (u) in the denominator, and g(u) = A f (x, u)dx
in the numerator, we obtain
R
fX,Y (x, y)dx
Z
fX,Y (x, y)
P (X ∈ A|y ≤ Y ≤ y + h) −−−−−→ A = dx.
(h→0+) fY (y) A fY (y)

Even without continuity, we have this limit at almost every y ∈ R (i.e., with
possible exception only in a set of measure zero) by Lebesgue’s differentiation
theorem from Real Analysis.

8.2 The chain rule (multiplication rule)


Based on the definitions of the conditional probability mass function (pmf) and
density function (df) we can derive the multiplication rule or the chain rule

fX,Y (x, y) = fX (x)fY |X (y|x) = fY (y)fX|Y (x|y), for all x, y. (8.3)

In case of a continuous distribution, this equation needs some interpretation as


the jdf and the conditional df are not unique. The equation can be interpreted to
mean that if the marginal density function fX (x) and conditional df fY |X (y|x)

36
are formulated from some versions (possibly different from one another) of the
jdf, then their product is a valid jdf.
Another problem with the formula is that fY |X (y|x) is not necessarily defined
for all x. We shall agree that in case of the multiplication rule, the definitions
of the conditional probabilities of the df and cdf will be extended (if needed) to
include those points, where the density of the conditioned variable is zero. For
example, we can agree that
(
fX,Y (x,y)
fY |X (y|x) = fX (x) , when fX (x) > 0, (8.4)
0, otherwise.

(We could also use some other consistent agreement). Similar extension will
be naturally used for the other condition df fX|Y (x|y) as well. After this, the
multiplication rule is valid for all x, y ∈ R. These complications do not cause
any real problems with actual computations.
From the multiplication rule, we can solve the other conditional density
function, if the marginal densities and the their conditional df are known. For
example, when fX (x) > 0 (and other values of x are irrelevant), then

fX,Y (x, y) fY (y)fX|Y (x|y)


fY |X (x) = = . (8.5)
fX (x) fX (x)
This is the Bayes formula for the density function.
Notice that, when x is fixed, according to the Bayes formula,
1
fY |X (y|x) = fY (y)fX|Y (x|y) ∝ fY (y)fX|Y (x|y) = fX,Y (x, y).
fX (x)
Proportionality g(y) ∝ h(y) means that

g(y) = ch(y)

with some proportionality constant c (which may dependend on the parameters


and other variables except the variable y). This observation sometimes gives a
shortcut to determining conditional probability densities, as illustrated in the
following:
Example 8.1. In example 7.7., the jdf has the expression
1 1
fX) Y (x, y) = √ 1{0 < y < exp(− x2 )}.
2π 2
The density functions can either be calculated by dividing this jdf by the ob-
tained marginal density functions or by applying the following reasoning.
a) fY |X : With a fixed x, the jdf fX,Y (x, y) maintains a constant positive
value when 0 < y < exp(− 21 x2 ) . For this reason the conditional distribution is
a uniform distribution,
1 2
Y |(X = x) ∼ U(0, exp(− x )) .
2

37
Recalling that the marginal distribution of X is X ∼ N (0, 1), the joint distri-
bution can be rewritten as
1
Y |X ∼ U (0, exp (− X 2 )) ,
2
X ∼ N (0, 1) .
b) fX|Y : For fixed 0 < y < 1, the jdf has a constant positive value over a certain
interval on the x-axis, which can be found by solving the equation
1
0 < y < exp(− x2 )
2
for the variable x. This equation was solved in Example 7.2. The result shows
that the conditional distribution is a normal distribution.
p p
X|(Y = y) ∼ U(− −2 ln y, −2 ln y) , 0 < y < 1.

8.3 The joint distribution of discrete and contin-


uous random variables
Let us consider the joint distribution of two random variables X and Y that are
defined on the same probability space, where X is discrete and Y is continuous.
Let X have the possible values of x1 , x2 . . . and let fX be its (marginal) prob-
ability mass function. It can be shown that, under these conditions, there exit
density functions y 7→ fY |X (y|xi ), i ≥ 1 so that
Z
P (X = xi , Y ∈ B) = fX (xi ) fY |X (y|xi )dy for all i and all B ⊂ R,
B

but the proof would require the so-called Radon–Nikodym theorem from measure
theory.
Because of this, the joint distribution is given by

fX,Y (x, y) = fX (x)fY |X (y|x)

The following probabilities can be calculated by adding the discrete variable


and integrating with respect to the continuous variable,
XZ
P (X ∈ A, Y ∈ B) = fX,Y (x, y)dy, A, B ⊂ R.
x∈A B

Even in this case we might refer to the joint distribution fX,Y as a density or a
density function, but it is important to keep in mind that we add with respect
to one variable and integrate with respect to the other variable.
The probability mass function (pmf), given Y = y, is defined by the Bayes
formula, namely
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)

38
when fY (y) > 0. This formula could also be motivated via limit considerations
as we did with the joint continuous distribution.
The multiplication rule and the Bayes formula are valid in this case as well. If
needed, the definitions of the conditional density fY |X (y|x) and the probability
mass function fX|Y (x|y) can be extended also to such values of the arguments
that were previously not covered by the definition.
When one of the random variables is discrete and another one is continuous,
expectations are computed by summing in the discrete variable and integrating
in the continuous variable. For instance, the law of the unconscious statistician
now takes the form:
Theorem 8.1. Let (X, Y ) be a random vector such that X has a discrete dis-
tribution and Y a continuous distribution. Let Z = g(X, Y ) be its real-valued
transformation. Then
XZ Z X
EZ = g(x, y)fX,Y (x, y)dy = g(x, y)fX,Y (x, y)dy
x x

provided that
XZ
|g(x, y)|fX,Y (x, y)dy < ∞.
x

8.4 The conditional expectation


Definition 8.2 (Conditional expectation and conditional variance, given the
value of a random variable). Let (X, Y ) be a random vector, let g(X, Y ) be its
transformation, and let us assume that we can define the conditional distribution
of the random variable Y , given X = x. The conditional expectation of the
random variable g(X, Y ), given X = x, denoted by

E(g(X, Y )|X = x),

is the expectation of the random variable g(x, Y ), evaluated according to the


conditional distribution of Y , given X = x. The conditional variance, given
X = x, of the random variable g(X, Y ), denoted by

var(g(X, Y )|X = x),

is the variance of random variable g(x, Y ), evaluated according to the conditional


distribution of Y , given X = x.
If Y has a continuous distribution, then
Z
E(g(X, Y )|X = x) = g(x, y)fY |X (y|x)dy,

and
Z
var(g(X, Y |X = x) = [g(x, y) − E(g(X, Y )|X = x)]2 fY |X (y|x)dy.

39
If the square of the binomial of the latter equation is expanded and the coeffi-
cients are organised, we see that
var(g(X, Y )|X = x) = E[(g(X, Y ))2 |X = x] − {E[g(X, Y )|X = x]}2 (8.6)
This is a substitute for the conditional variance of a known formula Z = EZ 2 −
(EZ)2 . If X is a discrete variable, addition should be applied in the previous
formulas and identity (8.6.) remains valid.
If a function g takes the form
g(x, y) = g1 (x)g2 (x, y),
then,
E(g1 (X)g2 (X, Y )|X = x) = g1 (x)E(g2 (X, Y )|X = x). (8.7)
This can be expressed by saying that the known factors can be pulled out of the
conditional expectation.
The conditional expectation corresponding to g(x, y) = y is called the the
conditional expectation of Y , given X = x.
Definition 8.3 (Conditional expectation of Y , given X = x; regression func-
tion). The conditional expectation of a random variable Y , given X = x, is the
expectation of its conditional distribution.
E(Y |X = x).
The function
x 7→ E(Y |X = x)
is called the regression function of Y on X.
If Y has a continuous distribution, then
Z
E(Y |X = x) = yfY |X (y|x)dy,

and if Y has a discrete distribution, then


X
E(Y |X = x) = yfY |X (y|x).
y

Figure 8.4 illustrates the conditional expectation E(Y |X = x), or the regression
function, and the conditional variance var(Y |X = x) for the joint distribution
of Figure 8.1.
Definition 8.4 (Conditional expectation given a random variable). Let us tem-
porarily denote
m(x) = E(g(X, Y )|X = x).
Let us agree that m(x) = 0 for those x for which m(x) cannot be defined,
i.e., the conditional distribution Y |(X = x) is not defined. After this, we can
talk about the transformed random variable m(X). It is called the conditional
expectation of the random variable g(X, Y ) given the random variable X, and
denoted by
E(g(X, Y )|X) = m(X).

40
15

2.5
10

ehdollinen varianssi

2.0
1.5
5
y

1.0
0

0.5
0.0
−5

0 2 4 6 8 0 2 4 6 8

x x

Figure 8.2: (a) A joint distribution, some conditional densities y 7→ fY |X (y|x)


and the conditional expectation E(Y |X = x). (b) The conditional variance
var(Y |X = x).

Note. It might occur to one to use the notation E(g(X, Y )|X = X), but it
does not make sense. The conditional expectation E(g(X, Y )|X) is a random
variable defined in such a way that it takes the value E(g(X, Y )|X = x) with a
probability of one, if X takes the value x.
Theorem 8.2. If E[g(X, Y )2 | < ∞, then E(g(X, Y )|X) is the best approxima-
tion, in terms of the mean square error, of the random variable g(X, Y ), among
all transformations of the random variable X only. Namely,
2 2
E[ g(X, Y ) − E(g(X, Y )|X) ] ≤ E[ g(X, Y ) − h(X) ]

for all functions h : R → R.


Proof. Exercise.
Especially if we need to predict the value of a random variable Y through a
function of a random variable X, then in terms of the mean square error, the
best possible prediction is m(X) , where m is the regression function defined by
m(x) = E[Y |X = x].
Theorem 8.3. The expectation can be computed as an iterated expectation,
namely
Eg(X, Y ) = EE(g(X, Y )|X),
if the expectation Eg(X, Y ) exists as an extended real number.

41
Proof. We present the argument for the discrete case:
X X
Eg(X, Y ) = g(x, y)fX,Y (x, y) = g(x, y)fX (x)fY |X (y|x)
x,y x,y
X X
= fX (x) g(x, y)fY |X (y|x)
x y
X
= fX (x)E(g(X, Y )|X = x) = EE(g(X, Y )|X).
x

Example 8.2. Let us consider the joint distribution

X|Y ∼ Bin(Y, θ),

Y ∼ Poi(λ),
where λ > 0 and 0 < θ < 1. Now

E(X|Y ) = Y θ,

therefore,
EX = EE(X|Y ) = EY θ) = θEY = θλ.
As is the case with the conditional expectation, also the conditional variance
can be calculated given a random variable X (rather than its value X = x).
First, let us define an function
v(x) = var(g(X, Y )|X = x),
The definition will be extended over the whole real axis by agreeing that v(x) = 0
under such arguments, for which the conditional distribution Y |(X = x) is not
naturally defined. After this, we define

var(g(X, Y )|X) = v(X).

Theorem 8.4. The variance of a random variable equals to the sum of the
expectation of the conditional variance and the variance of the conditional ex-
pectation, namely

var(g(X, Y )) = Evar(g(X, Y )|X) + varE(g(X, Y )|X). (8.8)

Proof. Exercise.
Note. In more advanced probability theory (based on measure theory),
the conditional expectation E(g(X, Y )|X) is defined with a completely different
technique than we are using now. Furthermore, it is usually defined in text-
books only when g(X, Y ) is integrable, i.e., when E|g(X, Y )| < ∞. However,
integrability is an unnecessary restriction here; it is sufficient to assume quasi-
integrability, i.e., the existence of Eg(X, Y ) as an expanded real number. This

42
expanded theory is quite difficult to come across in textbooks, but it is present
in the works of Ashin [1], or Chown and Teicherin [2] or Jacodin and Protterin
[3]. According to this extended theory, the integrability of the random variable
g(X, Y ) can be checked by calculating the expectation of E|g(X, Y )| by using
the formula EE(|g(X, Y )||X). If the result is finite, then g(X, Y ) is integrable
meaning that Eg(X, Y ) is a real number, which can be calculated with the
formula EE(g(X, Y )|X). When checking whether the integral is finite, we can
apply the same technique that was used with Fubini’s theorem.

Literature
1. Robert B. Ash. Probability and Measure Theory. Academic Press, 2nd edi-
tion, 2000.
2. Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchange-
ability, Martinga-les. Springer-Verlag, 2nd edition, 1988.

3. Jean Jacod and Philip Protter. Probability Essentials. Springer, 2nd edition,
2002.

8.5 Hierarchical definition of a joint distribution


Statistical models are often specified by revealing the marginal distribution of
one random variable and the conditional distribution of another random vari-
able. In this case we talk about a hierarchical model. Let us consider some
examples.
Example 8.3 (Discrete and discrete). The number of eggs laid by an insect
are distributed according to Y ∼ Poi(λ), where λ > 0 is the expectation of the
distribution. Each egg will develop into a larva with probability of 0 < θ < 1
and each such event is independent. Let X be the number of eggs that develop
into a larva. Now,
X|(Y = y) ∼ Bin(y, θ).
The joint distribution is discrete and its joint probability mass function is
y
 
−λ λ y x
fX,Y (x, y) = fY (y)fX|Y (x|y) = e θ (1 − θ)y−x , 0 ≤ x ≤ y,
y! x

where x and y are integers 0, 1, 2, . . .


This model can be defined by stating that
X|Y ∼ Bin(Y, θ) ,
Y ∼ Poi (λ)
where λ > 0 and 0 < θ < 1 are constants.

43
We can now determine the marginal distribution of X. Its probability mass
function can be expressed as a sum
X
fX (x) = fX,Y (x, y).
y

After some calculations we find that X ∼ Poi(λθ).


Example 8.4 (Continuous and discrete). Let the random variable Θ have a
beta distribution Be (α, β), where α, β > 0 are constants. A biased coin is
tossed n times and the number of tails are counted, when the probability of
tails is Θ. Given Θ = θ, the distribution of the number of tails X is Bin (n, θ).
Now, the joint distribution can be expressed with a function
 
1 α−1 β−1 n
fΘ,X (θ, x) = θ (1 − θ) θx (1 − θ)n−x ,
B(α, β) x
where 0 < θ < 1 and x = 0, 1, . . . , n.
This model could be specified by stating that
X|Θ ∼ Bin(n, Θ) ,
Θ ∼ Be(α, β)
where α, β > 0 and n ≥ 0 are constants.
What is the distribution of the random variable Θ conditioned on the obser-
vation X = x? In the 18th century Thomas Bayes provided a solution to this
problem. We will solve this problem by examining the expression of the joint
distribution as a function of a vaiable Θ, which leads us to the realisation that

Θ|(X = x) ∼ Be(α + x, β + n − x).

This calculation is an example of Bayesian inference. Here, the parameter of


the binomial distribution was treated as a random variable with a certain prior
distribution. After the observation, the parameter has a conditional distribu-
tion conditioned on the observation, and this distribution is referred to as the
posterior distribution of the parameter. In this example, we were lucky as both
the prior and the posterior distributions belong to the same family of distribu-
tions: they are both beta distributions. In this situation, we say that the prior
and posterior distributions form a conjugate family with respect to the likeli-
hood function under consideration. It can also be said that beta distribution is
conjugate prior of the binomial likelihood.
Example 8.5 (Continuous and continuous). Many computer programs have
an inbuilt random generator for the normal distribution. Let us consider the
following algorithm, where σX > 0, µX and σZ > 0 are given numbers and m is
a real-valued function of a real variable.
1 Simulate X ∼ N (µX , σX
2
).
2 Simulate Z ∼ N (0, σZ
2
).

44
3 Set Y = m(X) + Z.
Describe the joint distribution of the random variables X and Y .
Note. When a random number generator is called upon multiple times (as
in previous steps 1 and 2), the returned numbers can be regarded as values of
independent random variables, since actual random number generators function
in this way. In textbooks, this property is usually taken as self-evident and is
therefore rarely explained.
Solution: The joint distribution can be expressed with the formulas
2
Y |X ∼ N (m(X), σZ )
X ∼ N (µX , σX ) .
2

The conditional distribution [Y |X = x] can be deduced from the fact that X Z

|=
and hence conditioning on X = x will not affect the distribution of Z.
As the disribution of X is continuous and the conditional distribution of Y ,
given X = x, is continuous for all x, then the joint distribution is continuous as
well. Its joint density function is
fX,Y (x, y) = fX (x)fY |X (y|x)
1 1 (x − µX )2 1 1 (y − m(x))2
= √ exp(− 2 ) √ exp(− 2 )
σX 2π 2 σX σZ 2π 2 σZ
Note: The marginal distribution of X is N (µX , σX
2
). The regression function
and the conditional variance of Y are
2
E[Y |X = x| = m(x) , var(Y |X = x) = σZ ,
but the marginal distribution of Y is not a familiar distribution unless a more
specific form of the regression function m is imposed.
Example 8.6 (The previous example continued). Let us choose a linear form
for the regression function m. Now, in terms of the mean square error, m(X)
gives the best estimate of the value of a random variable Y . As m(X) is linear,
it also has to be the best (in terms of the mean square error) linear estimate.
Now, it must be true that, in accordance with the formulas in section 7.6.,
σY
m(x) = µY + ρ (x − µX ),
σX
where µY and σY > 0 (assumption) are the expectation and the standard devi-
ation of the random variable Y , and −1 < ρ < 1 (assumption) is the correlation
coefficient of X and Y . Furthermore,
2
σZ = σY2 (1 − ρ2 ).
After laborious but straightforward calculations, the joint density function can
be written as
 
1 1 1 T −1 x
fX,Y (x, y) = √ exp(− (z − µ) C (z − µ)) , z =
2π det C 2 y

45
where    2 
µX σX ρσX σY
µ= , C=
µY ρσX σY σY2
Here, µ is the expectation vector of the random vector (X, Y ) and C is its
covariance matrix.
The derived distribution is a bivariate normal distribution with parameters
µ and C, which is written as

(X, Y ) ∼ N (µ, C).

Later we will further investigate the multivariate normal distribution, also known
as the multinormal distribution, which is a generalisation of the normal distri-
bution to higher dimensions.

46
Chapter 9

Multivariate distribution

A multivariate vector may have more than two components. Its distribution is
described in a similar way as the distribution of a bivariate random vector. For
this reason, most of the ideas presented in this chapter are already familiar and
therefore, some of the conclusions will not be explained again. A statistician
needs multivariate distributions in order to define and analyse statistical models.
The new concepts in this chapter are the definition of a marginal distribution
for any subset of components, and calculating conditional distributions by con-
ditioning on any subset of components. Conditional independence is a concept
that becomes interesting only in three or higher dimensions.
Visualising multivariate distributions is difficult, as pictures will not work in
higher dimensions. Instead, we have to trust manipulating the formulas.

9.1 Random vector


If X1 , . . . , Xn are random variables defined on the same probability space Ω,
then X = (X1 , . . . , Xn ) is an n-dimensional random vector (abbreviated rv).
It is a mapping Ω → Rn such that
 
X1 (ω)
X(ω) = (X1 (ω), . . . , Xn (ω)) =  ... 
 

Xn (ω).

If all the components Xi of the random vector X are discrete random vari-
ables, then its pmf (or the joint probability mass function of random variables
Xi ) is
fX (x) = P (X = x) = P (X1 = x1 , . . . , Xn = xn ). (9.1)
If g(X) is a real value transformation of X, then its expectation can be computed
by X
Eg(X) = g(x)fX (x), (9.2)
x

47
if the sum converges absolutely.
A random vector X has a continuous distribution, if it has a density function
fX , meaning that
Z Z
P (X ∈ B) = fX = fX (x)dx, for all B ⊂ Rn (9.3)
B B

Now the components X1 , . . . , Xn have a continuous joint distribution and the


density function

fX1 ,...,Xn (x1 , . . . , xn ) = fX (x), x = (x1 . , xn )

is called the joint density function of the random variables X1 , . . . , Xn . This


is an n-fold integral,
Z Z Z
fX (x)dx = · · · fX (x)dx1 · · · dxn ,
B B

which can be calculated as an iterated integral by using any integration order.


If g(X) is a real value transformation of the rv X, then its expectation can
be calculated using the formula
Z
Eg(X) = g(x)fX (x)dx (9.4)

if the integral converges absolutely. In this case, the integral can be calculated
as an iterated integral in any integration order. When the integration domain is
not indicated, we understand that we are computing the integral over the whole
space (here Rn ). (If the dimension is n ≥ 2, this notation should not cause any
confusion with the indefinite integral, as the concept of an indefinite integral is
only used in dimension one).
In case of a continuous (joint) distribution, the jdf is
Z x1 Z xn
FX (x) = ... fX (s1 , . . . , sn )ds1 · · · dsn .
s1 =−∞ sn =−∞

When this function is partially differentiated in each variable, we see that


∂ n FX (x1 · · · , xn )
= fX (x1 , . . . , xn ),
∂x1 · ∂xn
at least in those points where (x1 , . . . , xn ) ,where fX is continuous. We can
show that this function is a valid density function of a continuously distributed
rv (when it is arbitrarily defined in those points where the derivative does not
exist).
If X is a discrete rv and Y is continuously distributed random vector, and
they are both defined on the same probability space Ω, then their joint distri-
bution can be expressed with a function

fX,Y (x, y) = fX (x)fY|X (y|x) ,

48
where fX is the joint pmf of the marginal distribution of the random vector
X, and fY|X (y|x) is the conditional joint density function, given X=x, of the
rv Y. If g(X, Y) is a real valued transformation, then its expectation can be
calculated with
XZ
Eg(X, Y) = g(x, y)fX,Y (x, y)dy, (9.5)
x

if the expectation exists.

9.2 The expectation vector and the covariance


matrix
Definition 9.1. We will set the following definitions if the expectations exist.
• The expectation (vector) of a random vector X= (X1 , . . . , Xn ) is a con-
stant vector with n-components
 
EXl
EX = E(X) = (EX1 , . . . , EXn ) =  ... 
 

EXn

• The expectation (matrix) of the m × n matrix Z = [Zij ] is the m × n


constnat matrix whose element (i, j) is the expectation of Zij
   
Z11 Z12 · · · Z1n EZ11 EZ12 · · · EZ1n
 Z21 Z22 · · · Z2n   EZ21 EZ22 · · · EZ2n 
EZ = E  .. .. =  .. .. .. 
   
. . . . . 

  
Zml Zm2 · · · Zmn EZml EZm2 · · · EZmn

• The n × k covariance matrix of random vectors X = (X1 , . . . , Xn ) and


Y = (Y1 , . . . , Yk ) is
cov (X, Y) = E[(X − EX)(Y − EY)T ]
 
cov(X1 , Y1 ) cov(X1 , Y2 ) ··· cov(X1 , Yk )
 cov(X2 , Y1 ) cov(X2 , Y2 ) ··· cov(X2 , Yk )
=  .. .. ..
 
. . .

 
cov(Xn , Y1 ) cov(Xn , Y2 ) · · · cov(Xn , Yk )

If cov(X, Y) = 0, then we say that the random vectors X and Y do not


correlate or are uncorrelated.

49
• The covariance matrix of a random vector X = (X1 , . . . , Xn ) is the
n × n-matrix

Cov(X) = cov(X, X) = E[(X − EX)(X − EX)T ]


 
cov(X1 , X1 ) cov(X1 , X2 ) · · · cov(X1 , Xn )
 cov(X2 , X1 ) cov(X2 , X2 ) · · · cov(X2 , Xn ) 
= .. .. ..
 
. . .

 
cov(Xn , X1 ) cov(Xn , X2 ) · · · cov(Xn , Xn )

Notice that the covariance matrix of a random vector is symmetric and its
main diagonal has the variances of the component random variables.
The covariance Cov os a random verctor is a one-argument function, while
the covariance cov is a two-arguement function. Notice that the covariance
cov(X, Y) = 0) of two random vectors is a matrix.
It follows from the definition that

E(ZT ) = (EZ)T . (9.6)

The calculation rule

E[AZB + C] = A(EZ)B + C (9.7)

is valid when Z is a random matrix and A, B and C are constant matrices with
such dimensions that the expression is defined. In other words, the constant
matrices can be pulled out of the expectation if they are located to the extreme
right or left of the matrix product.
If X consists of subvectors X=(Y, Z) and G(Y) and H(Z) are matrix ex-
pressions, then
Y Z ⇒ G(Y) (9.8)
|=

|=

H(Z).
(Since this equation is valid for the matrix valued functions, it is naturally
valid for the vector and scalar valued functions). In other words, functions of
independent random vectors are also independent.
Here, the independece of matrix expressions means that all components of
G(Y) are independent of the components of H(Z). Based on independence and
the definition of the matrix product, it follows that

Y ⇒ (9.9)
|=

Z E[G(Y)H(Z)] = E[G(Y)]E[H(Z)]

if the dimensions of the matrix product G(Y)H(Z) are compatible and if the
expectations exist.
Theorem 9.1 (Properties of the covariance). (a) If X Y and both X and
|=

Y have m components, then

cov(X, Y) = 0m×n .

50
(b) The covariances of X and Y can be calculated with the formula

cov(X, Y) = E[XYT ] − (EX)(EY)T .

(c) If v ∈ Rm is a constant vector and X is a vector with n components, then

cov(v, X) = 0m×n , cov(X, v) = 0n×m .

(d) cov(X, Y) = cov(Y, XT ).


(e) If X1 , X2 and Y are random vectors, then

cov(X1 + X2 , Y) = cov(X1 , Y) + cov(X2 , Y) ,

cov(Y, X1 + X2 ) = cov(Y, X1 ) + cov(Y, X2 ) .

(f ) If A and B are constnt matrices and v and w are constant vectors, then

cov(AX + v, BY + w) = A cov(X, Y)BT . (9.10)

In particular,
Cov(AX + v) = A Cov(X)AT (9.11)

Proof. All parts follow from the definition of the covariance and the multiplica-
tion rules (9.7) and (9.9).
Example 9.1 (The covariance matrix of a sum). If the random vectors X and
Y have the same length, then

Cov(X + Y) = cov(X + Y, X + Y) = cov(X, X + Y) + cov(Y, X + Y)


= cov(X, X) + cov(X, Y) + cov(Y, X) + cov(Y, Y)
= Cov(X) + Cov(Y) + cov(X, Y) + cov(Y, X).

If Y and Y do not correlate, then cov(X, Y) = 0 and cov(Y, X) = 0. Therefore,


in case of uncorrelated random vectors, the covariance matrix of the sum equals
to the sum of the covariance matrices of the random vectors. In particular, this
holds for independent random vectors.

Let us recall that the square matrix B is called


• regular or invertible or non-singular, if it has an inverse matrix B−1 ;
• symmetric if BT = B;

• positive semi-definite if vT Bv ≥ 0 for all v;


• positive definite if vT Bv > 0 for all v 6= 0.

51
Note: A matrix need not be positive semi-definite even if it is symmetric
and all the elements of the matrix are positive. For example matrix
 
1 2
B=
2 1

is symmetric, but B is not positive semi-definite: if we for choose, for example,


v = (−1, 1) we will get vT Bv = −2. Instead, definiteness can be understood
through the signs (positive / negative) of the eigenvalues of a matrix. Recall
that λ ∈ R is an eigenvalue of a matrix B if there is a nonzero vector v such
that Bv = λv.
• A symmetric matrix B is positive semi-definite if and only if all of its
eigenvalues are greater than or equal to zero.
• A symmetric matrix B is positive definite if and only if all of its eigenval-
ues are strictly greater than zero.

Theorem 9.2. The covariance matrix of the random vector X is a symmetric


and positive semi-definite matrix. If it fails to be positive definite, then there
are bonding conditions between the components of the random vector X. This
means that X can obtain values only from a certain fixed hyperplane.
Proof. Symmetry is a consequence of the symmetry of the covariance operator.
Positive semi-defininiteness follows from

vT Cov(X)v = vT E[(X − EX)(X − EX)T ]v


= E[vT (X − EX)(X − EX)T v] = E[(vT (X − EX))2 ] ≥ 0,

since the expectation of a non-negative random variable is non-negative. If the


covariance matrix is not semi-definite, then there exists v 6= 0 such that
0 = vT Cov(X) v = E[(vT (X − EX))2 ],
and therefore,
vT (X − EX) = 0.
with probability one. Geometrically, this means that X − EX lies in the hyper-
plane perpendicular to the vector v.
Note. If Cov X is not positive definite, then X cannot have a continuous
distribution, since the hyperplane, as a set of lower dimensions, is a null set.

52
9.3 Conditional distributions, the multiplication
rule and the conditional expectation
Let us consider a random vector Z. Its first r coordinates constitute a random
vector X and the remaining s coordinates constitute another random vector Y,

Z = (X, Y) = (X1 , . . . , Xr , Y1 , . . . , Ys ).

Let us assume that the density of the joint distribution of random vector Z is
given by
fZ (z) = fX,Y (x, y), z = (x, y).
We will now allow the situation that some of the coordinates of Z are discrete
random variables and the rest have a continuous joint distribution. Now, the
density of the marginal distribution of x is given by fX . This density can be
obtained by adding out the discrete component of Y from the density fX,Y and
integrating out its continuous components. By abusing notations, this idea can
be expressed by Z
fX (x) = fX,Y (x, y)dy, (9.12)

where the integration symbol represents addition in the case of the discrete
components. In this case, we can say that the distribution of X is marginalised
from the joint distribution. Similarly, fY can be obtained using
Z
fY (y) = fX,Y (x, y)dx.

A situation where a marginal distribution has to be presented for an arbitrary


subset of the components can always be reduced to the previous situation by
permuting the coordinates.
Example 9.2. Let U and V be discrete random variables and (X, Y ) a contin-
uously distributed random vector. In this case, the joint distribution of random
vectors V and Y can be expressed by the density
XZ
fV,Y (v, y) = fU,V,X,Y (u, v, x, y)dx,
u

since (first permute V and Y )

fV,Y,U,X (v, y, u, x) = fU,V,X,Y (u, v, x, y),

and (add and integrate out everything else)


XZ
fV,Y (v, y) = fV,Y,U,X (v, y, u, x)dx.
u

53
The density of the conditional distribution of Y, given X = x, is obtained
by
fX,Y (x, y)
fY|X (y|x) = ,
fX (x)
when fX (x) > 0. If necessary, this expression can be defined as, for example,
zero for those x, with fX (x) = 0. Similarly, we define
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
The multiplication formula (chain rule) is valid, and

fX,Y (x, y) = fX (x)fY|X (y|x) = fY (y)fX|Y (x|y).

The multiplication formula can be iterated. Let us assume that a random vector
is X = (U, V) . Now,

fu,v (u, v) = fU (u)fV|U (v|u),

and therefore

fU,V,Y (u, v, y) = fU (u)fV|U (v|u)fY|U,V (y|u, v).

This prosess can be continued until we reach scalar components. For example,
the density of four random variables U , V , X , Y can be expressed as

fU,V,X,Y (u, v, x, y) = fU (u)fV |U (v|u)fX|U,V (x|u, v)fY |X,U,Y (y|x, u, v).

The multiplication rule can also be utilised by using any other permutation
of the random variables.The multiplication rule is also valid for conditional
distributions. For example
fU,V,Y (u, v, y) fU,Y (u, y) fU,V,Y (u, v, y)
fU,V|Y (u, v|y) = =
fY (y) fY (y) fU,Y (u, y)
= fU|Y (u|y)fV|U,Y (v|u, y).

The marginal distribution of a conditional distribution can be calculated by


marginalising the conditional distribution. For example,
Z
fU,Y (u, y) 1
fU|Y (u|y) = = fU,V,Y (u, v, y)dv
f (y) fY (y)
Z Y
= fU,V|Y (u, v|y)dv.

Note. When connections between distributions are derived by using the mul-
tiplication formula and marginalisation, the subscripts are often left unwritten
because they are evident from the arguments of the density function. This
abuse of the notation will not result in confusion as long as we remember that,
for example, f (x), f (y), f (x|y) ja f (y|x) are typically all different functions.

54
The conditional expectation (vector), given X = x, of a random vector
g(X, Y) is defined as the expectation vector of the random vector g(x, Y) when
the distribution of Y is the conditional distribution fY|X (·|x). By abusing
notation, Z
E(g(X, Y)|X = x) = g(x, y)fY|X (y|x)dy.

The integral here may mean addition with respect to some of the components
of the vector y. If we condition on a random vector X, then the random vector
E(g(X, Y)|X) is defined as m(X), where

m(x) = E(g(X, Y)|X = x) ,

and the definition of the function m is continued to a zero vector for those
arguments for which it is originally undefined.
The conditional covariance matrix, given X = x, of a random vector g(X, Y)
is similarly defined as the covariance matrix of a random vector g(x, Y) when
the conditional distribution fY|X (·|x) is used as the distribution for Y. This is
denoted by
Cov (g(X, Y)|X = x) .
If we condition on random vector X, then the conditional covariance matrix is
a random matrix and is denoted by
Cov (g(X, Y)|X) .

9.4 Conditional independence


Two random vectors X and Y are independent if

fX,Y (x, y) = fX (x)fY (y)

for all x, y. Similarly, random vectors X1 , . . . , Xn are independent if

fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn )

for all arguments.


Definition 9.2. Two random vectors X and Y are conditionally independent,
given Z, if they are independent in their joint distribution, given Z = z for any
z. This means that

fX,Y|Z (x, y|z) = fX|Z (x|z)fY|Z (y|z), for all x, y and z. (9.13)

Similarly, random vectors X1 , . . . , Xn are conditionally independent, given


Z, if
n
Y
fX1 ,...,Xn |Z (x1 , . . . , xn |z) = fXi |Z (xi |z)
i=1
for all x1 , . . . , xn and all z.

55
Conditionally independent random vectors need not be marginally indepen-
dent, meaning independent in their joint marginal distribution. For example,
even if (X Y)| Z, then

|=
Z Z
fX,Y (x, y) = fX,Y|Z (x, y|z)fZ (z)dz = fX|Z (x|z)fY|Z (y|z)fZ (z)dz,

which typically cannot factored further.

9.5 Statistical models


Statistical inference has two main approaches: the so called frequentist approach
and the Bayesian approach. Both are founded on the use of a likelihood function.
Broadly speaking, the layout is the following.
We have numerical data in the form of a vector y = (y1 , . . . , yn ). Before
making observations, the value of the data is uncertain (due to measurement
mistakes, the natural variation of the population, etc.). Therefore, we will model
this situation so that y is the observed value of a random vector Y, which is
defined by some probability space. In other words,

y = Y(ω act ) ,

where ω act is an actualised elementary event in the probability model.


Typically, the distribution of vector Y is modelled with a parametric model,
with one or more parameters θ1 , . . . , θp that are combined into a parameter
vector θ. When the value of the parameter is fixed, then the distribution of the
random vector Y is expressed by the density

y 7→ f (y|θ).

We are interested in the distribution of a random vector Y. It can be approxi-


mated, if we first estimate the unknown parameter vector θ.
When a data y is observed, and this observed value is used as the first
argument of the function f (y|θ), then the function

θ 7→ f (y|θ)

is called the likelihood function. In this case, coefficients that are independent
of θ may be omitted from the density function and this function will still be
called the likelihood function.
In the so called classical or frequantist statistics, the parameter vector θ is
regarded as an unknown coefficient; we only know in which set (or parameter
space) its values could be. In this case, the notation f (y|θ) is purely formal, and
not interpreted as a conditional distribution, since a parameter vector does not
have any probability distribution. A more common notation in this case would
be f (y; θ). The most known estimation principle is the so called maximum
likelihood, ML. According to ML, the best estimate of a parameter vector within

56
the parameter space is the value θ̂ ML that maximises the likelihood function. It
is called the maximum likelihood estimate, MLE.
According to Bayesian inference, the parameter vector is interpreted as the
value of the random vector Θ. The function f (y|θ) is identified with the con-
ditional distribution fY|Θ (y|θ). According to Bayesian inference, apart from
the likelihood function, the statistical model requires a marginal distribution
for the vector Θ. This is called the prior distribution. The joint distribution of
the parameter vector and the observational vector is the density

(y, θ) 7→ fΘ,Y (θ, y) = fΘ (θ)fY|Θ (y|θ).

After observing Y = y, the statistical conclusions are made by characterising


the posterior distribution of the parameter vector, which is

fΘ,Y (θ, y) fΘ (θ)fY|Θ (y|θ)


fΘ|Y (θ|y) = =R
fY (y) fΘ (t)fY|Θ (y|t)dt

Both approaches of the statistical inference require a concrete formula for the
likelihood function. For this reason, a statistician has to be able to derive density
functions for multivariate random vectors.

Example 9.3. A factory produces components that are either functional or


flawed. Let us define

1, if the ith component is flawed,
Yi =
0, if the ith component is functional.

In this situation, at its simplest, the model can be constructed as follows. When
the value of the parameter is 0 ≤ θ ≤ 1, then the random variables Yi are inde-
pendent and have the same distribution, the Bernoulli distribution Bernoulli(θ).
(Outcome one now means that the component is flawed). Now the likelihood
function is
n
Y
f (y1 , . . . , yn |θ) = θyi (1 − θ)1−yi , y = (y1 , . . . , yn ) ∈ {0, 1}n
i=1
= θk (1 − θ)n−k , if y has exactly k components equal to 1.

In this situation the ML-estimate is easy to find as the unique zero P of the
derivative of this function with respect to θ: it is k/n, where k = yi is the
number of flawed components.
When applying the Bayesian inference, a statistician would give the param-
eter Θ a prior distribution, e.g. uniform distribution Be(1, 1) over an interval
(0,1), and then derive a posterior distribution using the same likelihood func-
tion. Now the random variables Yi are conditionally independent given Θ and
they have a Bernoulli(Θ) distribution. The joint distribution is

f ,Y (θ, y) = f (y1 , . . . , yn |θ), 0 ≤ θ ≤ 1, y ∈ {0, 1}n

57
and the posterior distribution is Be(1+k, 1+n−k). This calculation has already
been solved in the example 8.4.
Let us assume that during the production process we want to measure some
information xi , which is somehow related to the functionality of the components.
In this case, xi is called the explanatory variable or the covariate. One way to
account for covariates in a model is the following.
Let us choose parameters α and β and try to explain the success (flawed
component) probability in the ith Bernoulli trial using the linear expression
α + βxi . This expression might attain arbitrary values and therefore is not valid
for a parameter of the Bernoulli distribution. Instead, by using the expression

exp(α + βx)
p(α, β, x) = ,
1 + exp(α + βx)

we obtain a number between one and zero no matter what the values of α, β
and x are. When the parameters are α, β, then according to the model

yi ∼ Bernoulli(p(α, β, xi )) , i = 1, . . . , n

independently. Now the likelihood function is


n
Y exp(α + βxi ) yi 1
f (y1 , . . . , yn |α, β) = ( ) ( )1−yi
i=1
1 + exp(α + βxi ) 1 + exp(α + βxi )

This is an example of a logistic regression model, which is in turn a special case


of the generalized linear model, GLM . Many statistics softwares can iteratively
solve the ML estimate in logistical regression.
In case of Bayesian inference, the parameters need to be assigned a prior dis-
tribution fA,B (α, β), which could for example be a bivariate normal distribution.
According to the model, the random variables Yi are conditionally independent
given (A, B) = (α, β), and the conditional distribution of the random variable
Yi is the Bernoulli distribution Bernoulli(p(α, β, xi )) . The joint distribution of
the parameters and the data is

fA,B (α, β)f (y1 , . . . , yn |α, β) ,

where the properties of the posterior distribution will have to be solved numer-
ically in practice.
Example 9.4. In some situations, an observed time series y1 , . . . , yn can
be modelled by random variables Y0 , Y1 , . . . , Yn , where Y0 = y0 is some known
constant. For other values, we use the autoregression model

Yt = h(Yt−1 , β) + t , t ≥ 1,

where h(y, β) is some known function and the random variables obey the dis-
tribution t ∼ N (0, σ 2 ) independently. Here, β and σ are parameters of the
model.

58
This model satisfies the Markov-property

f (yt |y1 , . . . , yt−1 ) = f (yt |yt−1 ), ∀t ≥ 1.

The multiplication rule gives the joint distribution as

f (y1 , . . . , yn ) = f (y1 )f (y2 |y1 )f (y3 |y1 , y2 ) · · · f (yrι |y1 , . . . , yn−1 )


= f (y1 )f (y2 |y1 )f (y3 |y2 ) · · · f (yn |yn−1 ).

Now the likelihood function takes the form


n
Y 1 1 (yt − h(yt−1 , β))2
f (y1 , . . . , yn |β, σ) = √ exp(− ).
t=1
σ 2π 2 σ2

In this situation we can use either frequentist or Bayesian inference. Especially


if h(y, β) = βy, then there are many readily awailable programmes and theory
for the situation. However, in a general situation the software and models have
to be developed in order to estimate the model.
Example 9.5 (Missing data). A statistician often needs to deal with data,
where some observations that it was originally supposed to include are missing.
On the other hand, it is sometimes handy to construct a probability model that
includes random variables, whose values cannot be directly observed. In both
cases, the unobserved random variables can be referred to as missing data or
latent variables or we can talk about auxiliary variables. In this type of a model,
the likelihood function of the observed data can be obtained by marginalising
the observed data Y and the joint density of the missing data U.
Z
fY|Θ (y|θ) = fU,Y|Θ (u, y|θ) du.

Often this integral cannot be dealt with analytically and for this reason special
techniques (like so-called EM-algorithm) have been developed to process missing
data.

9.6 The density function transformation formula


Let us consider a diffeomorphism g : A → B, where A, B ⊂ Rn are open sets.
This means that g has an inverse function h : B → A, and both g and h are
continuously differentiable. The component functions of these projections are

g = (g1 , . . . , gn ) , h = (h1 , . . . , hn ).

Let us use the notation Di hj (y) to denote the partial derivative of the real
valued function hj with respect to the ith variable at point y:


Di hj (y) = hj (y) .
∂yi

59
Let us consider the bijective correspondence

y = g(x) ⇔ x = h(y) ,

also written as
y = y(x) ⇔ x = x(y).
The derivative matrix or the Jacobian matrix of the mapping h = (h1 , . . . , hn )
at point y ∈ B is
 
Dl h1 (y) D2 h1 (y) · · · Dn h1 (y)
 D1 h2 (y) D2 h2 (y) · · · Dn h2 (y) 
..
 
.
 
 
D1 hn (y) D2 hn (y) · · · Dn hn (y)

The Jacobian, or Jacobian determinant, or the functional determinant, of the


mapping h is
∂x ∂(x1 , . . . , xn )
Jh (y) = =
∂y ∂(y1 , . . . , yn )
is the determinant of the Jacobian matrix.
The transformation formula of the density function in the n-dimensional
situation is proved in the same way as in the two-dimensional case (see Theorem
7.8.)

Theorem 9.3. If g : A → B is a diffeomorphism, h = g−1 : B → A denotes its


inverse, X is a random vector with continuous distribution, and P (X ∈ A) = 1,
then the random vector Y = g(X) has a continuous distribution with density
function
fY (y) = fX (h(y))|Jh (y)|, when y ∈ B,
and zero elsewhere.
This result is best memorised in the form

fX (x)|∂x| = fY (y)|∂y|, when (9.14)

y = g(x) ⇐⇒ x = h(y) , x ∈ A, y ∈ B (9.15)


Example 9.6 (Affine transformation). If A is a constant matrix and b is a
constant vector that makes the dimensions of the following equation compatible,
then the function
g(x) = Ax + b, x ∈ Rn
is called an affine transformation. If A is a regular matrix (invertible matrix),
hence in particular a square matrix, then the inverse of this mapping is another
affine transformation, since

60
The Jacobian of the mapping h can be easily calculated to be
∂x 1
Jh (y) = = det(A−1 ) = .
∂y det(A)

If X is an n-dimensional random vector with a continuous distribution, and the


random vector Yis defined by Y = g(X), then from

fX (x)|∂x| = fY (y)|∂y|

we get the result


∂x 1
fY (y) = fX (x)| | = fX (A−1 (y − b)) .
∂y | det(A)|

Recall that a diffemorphism is always a mapping between spaces of the same


dimension. If our goal is to derive the density function of a lower dimensional
random vector, then the mapping is first completed into a bijection (if this is
possible), the density function of the transformation is derived, and finally the
density function of the sub-vector is derived through integration.
Sometimes the transformation formula of the density function is needed in a
situation where g : A → B is not a diffeomorphism as such, but A can be split
into parts A0 ja Ai , i ≥ 1 in such a manner that the following conditions apply

• P (X ∈ A0 ) = 0.
• When i ≥ 1, then the mapping g, restricted to the set Ai (denoted by
g|Ai ), is a diffeomorphism Ai → Bi , where Bi is the image of the function.
Let hi : Bi → Ai be the inverse of this function.
Now the random vector Y = g(X) has a continuous distribution with the density
function X
fY (y) = 1Bi (y)fX (hi (y))|Jhλ (y)|. (9.16)
i≥1

This result is the multivariate generalisation of the one dimensional result (The-
orem 2.13).

9.7 The moment generating function of a random


vector
Definition 9.3. The moment generating function of a random vector X =
(X1 , . . . , Xn ), or the joint moment generating function of the random variables
X1 , . . . , Xn , is defined as
Xn
M (t) = MX (t) = E exp(tT X) = E exp( tj Xj ),
j=1

61
for those t = (t1 , . . . , tn ) ∈ Rn , for which the expectation is defined. The
cumulant generating function of a random vector X (or the joint cumulant
generating function of the random variables X1 , . . . , Xn ) is

K(t) = KX (t) = ln MX (t),

for those t ∈ Rn , for which M (t) is defined.


Recall that the notation Di g designates the partial derivative of the function
g : Rn → R with respect to its ith variable,

Di g(x) = g(x1 , . . . , xn ).
∂xi

The notation Dik means the kth partial derivative with respect to the ith vari-
able, and if k1 , . . . , kn are non-negative integers, then

∂ k1 +k2 +.··.·+kn
D1k1 D2k2 . . . Dnk n g(x) = g(x) .
∂xk11 ∂xk22 · ∂xknn

If g is continuously differentiable infinitely many times, then the order of com-


puting the previous partial derivatives does not matter.
The moment generating function and the cumulant generating function have
similar properties as in the case of a scalar.
Theorem 9.4. If MX (t) exist in a non empty neighbourhood of the origin,
then MX is infinitely continuously differentiable, and MX can be expressed as a
converging power series, in the same neighbourhood of the origin. Moreover,
(a) The moment of order (k1 , . . . , kn ) is obtained as a derivative

E(X1k1 X2k2 · · · Xnkn ) = D1k1 D2k2 · · · Dnkn MX (0),

where (k1 , . . . , kn ) is a vector whose components are non-negative integers.


(b) MX determines the distribution of a random vector X.

Definition 9.4 (Gradient vector and Hessian matrix). The (vertical) vector

∇g(x) = (D1 g(x), D2 g(x), . . . , Dn g(x))

is called the gradient of a funktion g : Rn → R calculated ar point x. The


Hessian matrix, Hg (x), of a function g, calculated at point x, is an n × n matrix
constructed from the second order partial derivatives as follows:
 
D1 D1 g(x) D1 D2 g(x) · · · D1 Dn g(x)
 D2 D1 g(x) D2 D2 g(x) · · · D2 Dn g(x) 
Hg (x) =  .. .. ..
 
. . .

 
Dn D1 g(x) Dn D2 g(x) · · · Dn Dn g(x)

62
Note. The concept of a Hessian matrix was introduced in by the Prussian
mathematician Ludwig Otto Hesse in the 19th century. Many sources treat the
gradient as a horizontal vector but we will understand it as a vertical vector.
The Hessian matrix is sometimes denoted by ∇2 g(x) or g 00 (x) .
Theorem 9.5. Suppose that the moment generating function M and the cu-
mulant generating function K of a random vector X are be defined in a neigh-
bourhood of the origin. Then

EX = ∇MX (0) = ∇KX (0), Cov(X) = HKX (0).

Proof. The results are obtained by the differentiation formula for the composi-
tion of functions as in the one-dimensional case.
Example 9.7. The moment generating function for the multinomial distribu-
tion X ∼ Mult(k, p) can be obtained by the multinomial formula,
X k

MX (t) = E exp(tT X) = (p1 et1 )x1 · · · (pn etn )xn
x1 , . . . , x n

= (p1 et1 + · · · + pn etn )k


By differentiating this expression or its logarithm, we find that

EX = kp, Cov(X) = k(diag(p) − ppT ),

where diag(p) is the diagonal matrix with the numbers (p1 , . . . , pn ) on the
diagonal.
Theorem 9.6. Let X = (Y, Z) be a random vector whose moment generating
function MX (t) exists in a neighbourhood of the origin. Let t = (u, v) be divided
into parts with the same dimensions as Y and Z. Then
(a) the moment generating functions of the random vectors Y and Z are

MY (u) = MX (u, 0), MZ (v) = MX (0, v);

(b) Y Z if and only if


|=

MX (u, v) = MY (u)MZ (v) .

for all small enough u and v.


Proof. (a) :

MY (u) = E exp(uT Y) = E exp(uT Y + 0T Z) = MX (u, 0) ,

and the expression for MZ (v) is proved in the same way.


(b) First let us assume that Y Z. Then
|=

MX (u, v) = E exp(uT Y + vT Z) = E(exp(uT Y) exp(vT Z))


= MY (u)MZ (v),

63
where we used the information exp(uT Y) exp(vT Z).

|=
For the converse implication, we assume that MX (u, v) = MY (u)MZ (v) for
all small enough arguments. Let Y0 and Z0 be random vectors such that
d d
Y0 = Y, Z0 = Z, Y0 Z0 .

|=
Now

MY0 ,Z0 (u, v) = MY0 (u)MZ0 (v) = MY (u)MZ (v) = MY,Z (u, v),

and hence the moment generating functions of random vectors (Y0 , Z0 ) and
(Y, Z) agree in a neighbourhood of the origin. Since the moment generating
d
function determines the distribution, we have (Y0 , Z0 ) = (Y, Z), and hence
Y Z.
|=

64
Chapter 10

Multivariate normal
distribution

In this chapter we will consider the multidimensional generalisation of the nor-


mal distribution, which is referred to as multivariate normal distribution or
multinormal distribution, and is used in many applications. We will also demon-
strate how the multinormal distributions are treated using vector and matrix
notations.

10.1 The standard normal distribution Nn (0, I)


Definition 10.1. A random vector U = (U1 , . . . , Un ) is said to have the n-
dimensional standard normal distribution or the normal distribution Nn (0, I) if
components are independent random variables distributed according to N (0, 1).
The density function of a random vector U ∼ Nn (0, I) is thus given by
n n
Y Y 1 1 2
fU (u) = fUi (ui ) = √ e− 2 ui
i=1 i=1
2π (10.1)
−n/2 1 1
= (2π) exp(− (u21 + · · · + u2n )) = (2π)−n/2 exp(− uT u).
2 2
The expectation vector of U is an n-dimensional zero vector and its covariance
matrix is an identity matrix of dimensions n × n:

EU = 0n , Cov U = In . (10.2)

The moment generating function of the random vector U is


n n
Y Y 1 2 1
MU (t) = E exp(tT U) = E exp(ti Ui ) = e 2 ti = exp( tT t). (10.3)
i=1 i=1
2

65
If a random vector Z has a k-dimensional standard normal distribution Z ∼
Nk (0, I) , then the square of its lenght, kZk2 = ZT Z has a chi-square distribution
with k degrees of freedom, since
k
X
kZk2 = ZT Z = Zi2 ,
i=1

where Zi ∼ N (0, 1) independently. In this case EkZk2 = k.

10.2 General multinormal distribution


The goal of this section is to define the multinormal distribution Nm (µ, Σ),
where µ is the expectation of the distribution, Σ is the covariance matrix and
m is the dimension. We can leave out the subscript m if the dimension is either
clear from the context or unimportant.
We will first make an observation. If a random vector X is defined with the
formula
X = AU + µ, (10.4)
where A is an m × n constant matrix, µ ∈ Rm is a constant vector, and U ∼
Nn (0, In ) , then
EX = µ, CovX = AIn AT = AAT . (10.5)
Note that the matrix AAT is automatically symmetric, since
(AAT )T = AAT
and furthermore, it is positive semi-definite, since for arbitrary v ∈ Rn , we have
vT AAT v = (AT v)T (AT v),
which is non-negative, as it the squared length of vector AT v. A question arises:
can the covariance matrix Σ always be represented as a product AAT ? If it is
possible, we can use the formula (10.4) to define the multinormal distribution.
One (but not only) possibility to find Σ = AAT is to use the Cholesky de-
composition. According to the Cholesky decomposition, symmetric and positive
semi-definite matrix Σ can be represented as
Σ = LLT , (10.6)
where L is a lower triangular matrix. (A lower triangular matrix is a square
matrix, whose upper triangle elements (i, j), where j > i, are all zero). The
Cholesky decomposition is usually available from the programme libraries of
matrix software (but typically programs return the second factor in the product,
i.e., the upper triangular matrix LT ).
Note. If Σ is a positive definite matrix, then it is invertible. If it is repre-
sented in a form
Σ = AAT ,
where A is a square matrix, then the matrix A has to be invertible as well.

66
Definition 10.2. A random vector X has a multinormal distribution, if it has
the same distribution as
AU + µ (10.7)
where A is an m × n constant matrix, µ ∈ Rm is a constant vector, and
U ∼ Nn (0, In ) for some n.
The one dimensional normal distribution N (µ, σ 2 ) is a special case of the
multinormal distribution, since
X ∼ N (µ, σ 2 ) ⇒ (X = µ + σU, U ∼ N (0, 1)) .
According to the definition and formula (10.5),
EX = µ, Σ = Cov X = AAT ;
therefore formula (10.7) defines the normal distribution Nm (µ, AAT ) . In the
next theorem we will check that the Definition 10.2. is well posed.
Theorem 10.1. Definition 10.2 implies that EX = µ and Cov X = AAT .
The distribution of the random vector X only depends on its expectation vector
and covariance matrix, and not on the representation dimension n or other
properties of representation matrix A.
Conversely, if µ ∈ Rm is a constant vector and Σ ∈ Rm×m is a positive
semi-definite constant matrix, then there exists a random vector X that has a
multinormal distribution with expectation vector µ and covariance matrix Σ.
Proof. The formulas for the expectation vector and covariance matrix have been
verified already.
The moment generating function of the random vector X can be calculated
in terms of the moment function of a standard normal distribution (10.3):
MX (t) = E exp(tT X) = E exp(tT (AU + µ)) = exp(tT µ)MU (AT t)
1 1 (10.8)
= exp(tT µ + tT AAT t) = exp(tT µ + tT Σt)
2 2
As the moment generating function of X only depends on its expectation vector
and covariance, the distribution of X only depends on these parameters as well.
In order to prove the converse result, given a positive semi-definite matrix
Σ, we represent it as a product Σ = AAT , and then a random vector with the
correct distribution is constructed according to formula (10.7), as we already
checked.
The definition can be viewed as a simulation recipe. If we wish to simulate
a distribution Nm (µ, Σ), where Σ is a given symmetric positive semi-definite
matrix, then we first look for a decomposition of the covariance matrix Σ
Σ = AAT
Here we can use e.g. the Cholesky decomposition. Next we will (independently)
simulate m values (u1 , . . . , um ) from the standard normal distributionN (0, 1),
and finally we will calculate
x = Au + µ, where u = (u1 , . . . , un ) .

67
10.3 The distribution of an affine transformation
Theorem 10.2 (Multinormality is preserved in affine transformations). Let
X ∼ Nm (µ, Σ) , let B ∈ Rp×m be a constant matrix, and let b ∈ Rp be a
constnat vector. Then
BX + b ∼ Np (Bµ + b, BΣBT ).
Proof. Let us use the representation (10.7) for the distribution of X.
d
BX + b = B(AU + µ) + b = (BA)U + (Bµ + b) .
According to Definition 10.2, the random vector BX + b has a multinormal
distribution with
E(BX + b) = BE(X) + b = Bµ + b,
Cov(BX + b) = BΣBT .

This theorem also provides all marginal distributions. If we define the ith
unit vector ei to be such that its ith coordinate is one and other coordinates
are zero, then
Xi = eTi X.
If a random vector Y consists of the components (Xi1 , . . . , Xik ) of the random
vector X, then Y is obtained by multiplying the random vector X from the left
hand side by a constant matrix, whose jth row is eij ,
   T
Xi1 ei1
 ..   .. 
Y =  .  =  . X
Xik eTik
Now it follows that Y has a multinormal distribution, and therefore all the
marginal distributions of the multinormal distribution are multinormal as well.
The parameters for the random vector Y can be obtained through Theorem
10.2, or simply by picking the relevant parameters from the distribution of X,
since
   
EXil µil
 ..   .. 
EY =  .  =  .  , (CovY)(p, q) = cov(Xip , Xiq ) = Σ(ip , iq ).
EXik µik

10.4 The density function


Theorem 10.3 (The density function of a multinormal distribution). If Σ is a
positive definite matrix, then the density function of the distribution Nm (µ, Σ)
is
 1 
f (x) = (2π)−m/2 (det(Σ))−1/2 exp − (x − µ)T Σ−1 (x − µ) . (10.9)
2

68
Proof. Let A be a regular m × m matrix such that Σ = AAT . It is possible to
find such an A, since Σ is positive definite. After this, X ∼ Nm (µ, Σ), when

X = AU + µ, where U ∼ N (0, Im ).

Next we will change variables

x = Au + µ ⇐⇒ u = A−1 (x − µ).

(In order to apply the transformation formula of the density function, we need
a regular coefficient matrix A, and vectors X and U need to have the same
dimension in the representation of the multinormal distribution). Applying the
transformation formula of the density function, we get
∂u
fX (x) = fU (u)| | = fU (A−1 (x − µ))| det(A−1 )|
∂x
1
= (2π)−m/2 | det(A−1 )| exp(− (x − µ)T (A−1 )T A−1 (x − µ))
2
The claim is a consequence of this formula and the following calculations:

Σ−1 = (AAT )−1 = (AT )−1 A−1 = (A−1 )T A−1 ,

det(Σ) = det(AAT ) = det(A) det(AT ) = (det(A))2 > 0,


1 1
| det(A−1 )| = | |= p .
det(A) det(Σ)

Note. The multinormal distribution Nm (µ, Σ) has a density function pre-


cisely when Σ is positive definite. If Σ is only positive semi-definite but not
positive definite, then according to Theorem 9.2, X will take values on some
hyperplane with probability of one, and therefore X cannot have a density
function. Even this type of singular multinormal distributions are important in
applications; for example in linear models, the fit vector and the residual vector
both have singular multinormal distributions.

10.5 The contours of the density function


The contours of the multidimensional normally distributed density function are
m-dimensional ellipsoids. This can be seen from the eigenvalue decomposition
of the covariance matrix.
Let Σ ∈ Rn×n be symmetric and positive semi-definite matrix. As a sym-
metric matrix, it has real eigenvalues λi and eigen vectors vi . The eigenvectors
can be chosen to be orthonormal, and hence

Σvi = λi vi , i = 1, ..., n, (10.10)

69
(
1, if i = j,
viT vj = (10.11)
0, if i 6= j.
The search for the eigenvalues and eigenvectors is included in all matrix com-
putation software.
As Σ is positive semi-definite, we have

0 ≤ viT Σvi = viT (λi vi ) = λi .

Thus all the eigenvalues are non-negative. If Σ is positive definite, then all the
eigenvalues are strictly positive.
Let us construct a matrix V from the eigenvectors by placing them as the
columns of the matrix, and let us form a diagonal matrix Λ from the eigenvalues,

V = [v1 , . . . , vn ], Λ = diag(λ1 , . . . , λn ).

Now from the definitions of the eigenvalues and eigenvectors it follows that

ΣV = [Σv1 , . . . , Σvn ] = [λ1 v1 , . . . , λn vn ] = VΛ.

As a consequence of the orthonormality of the eigenvectors, VT V = In . Since


V is a square matrix, this shows that V−1 = VT . These types of matrices,
whose inverse coincides with the transpose, are called orthogonal matrices. Now
we have derived eigenvalue decomposition of the matrix Σ,

Σ = VΛVT , where V−1 = V. (10.12)

Let Σ be a covariance matrix of a multinormal distribution, and let us assume


that it is positive definite. In this case we can produce the decomposition for
its inverse matrix from the eigenvalue decomposition

Σ = VΛVT ⇒ Σ−1 = VΛ−1 VT

The density function f (x) (10.9) of the multinormal distribution Nm (µ, Σ)


obtains a constant value for all arguments x, for which the expression inside the
exponential function obtains a constant value. This expression can be written
as
y12 y2
(x − µ)T Σ−1 (x − µ) = (x − µ)T VΛ−1 VT (x − µ) = + ··· + m ,
λi λm
where the vector y is defined by

y = VT (x − µ) ⇐⇒ x = µ + Vy.

The vector y is thus obtained from the vector x by a coordinate transformation,


where µ is the new origin and the vectors Vei = vi are the new coordinate axes
(where ei is the ith standard unit vector of Rn ). These vectors are ortonormal
because
(Vei )T (Vej ) = eTi VT Vej = eTi ej .

70
2
0
−2
−4
−4 −2 0 2 4

Figure 10.1: A two-dimensional normal distribution with two contour lines and
the directions of the eigenvectors.

Since all λi > 0, the contour lines satisfy

y2 y2
√ 1 2 + · · · + √ m 2 = c, c > 0.
( λi ) ( λm )

This is the formula of an m-dimensional√ellipsoid in the


√ new coordinates y. The
lengths of half axes of the ellipsoid are cλ1 , . . . , cλn .
Example 10.1 (Visualisation of the two dimensional normal distribution). Let
     
0 4 0 cos θ − sin θ
µ= , Λ= , V=
−1 0 1 sin θ cos θ

Let us consider the distribution N (µ, Σ) with Σ = VΛVT . Figure 10.1 shows
two contour lines of this distribution with θ = 30◦ ,as well as the vectors Ve1
and Ve2 at the point µ. The dots represent a sample of vectors ui from the
two-dimensional standard normal distributionN (0, I) transformed according to
the formula
xi = Aui + µ, A = VΛ1/2
so as to give a sample of the two-dimensional normal distribution, N (µ, Σ).

10.6 Non-correlation and independence


Independent random vectors X Y do not correlate, since
|=

cov(X, Y) = E[(X − EX)(Y − EY)T ]


= E[X − EX]E[(Y − EY)T ] = 0.

71
In general, non-correlation does not guarantee independence. However, in the
special case that a random vector (X, Y) has a multinormal distribution, the
non-correlation and independence of its partial vectors X and Y turns out to
be equivalent, as we are about to prove..
Let us observe, for later use, the parameters of the multinormal distributions
of partial vectors X = (X1 , . . . , Xk ) and Y = (Y1 , . . . , Ym ), if the joint vector
(X, Y) has a multinormal distribution N (µ, Σ) . The expectation vector µ
consists of parts    
µX EX
µ = EZ = = (10.13)
µY EY
and the covariance matrix Σ= CovZ can also be partitioned as
 
cov(X1 , X1 ) · · · cov(X1 , Xk ) cov(X1 , Y1 ) · · · cov(X1 , Ym )
.. .. .. .. .. ..
. . . . . .
 
 
 
cov(Xk , X1 ) · · · cov(Xk , Xk ) cov(Xk , Y1 ) · · · cov(Xk , Ym ) 
Σ=   
 cov(Y1 , X1 ) · · · cov(Y1 , Xk ) cov(Y1 , Y1 ) · · · cov(Y1 , Ym ) 

.. .. .. .. .. ..
. . . . . .
 
 
cov(Ym , X1 ) · · · cov(Ym , Xk ) cov(Ym , Y1 ) · · · cov(Ym , Ym )
   
ΣXX ΣXY cov(X, X) cov(X, Y)
= = (10.14)
ΣY X ΣY Y cov(Y, X) cov(Y, Y)
According to Section 10.3, the marginal distributions are given by

X ∼ N (µX , ΣXX ) , Y ∼ N (µY , ΣY Y ). (10.15)

Theorem 10.4. If a random vector (X, Y) has a multinormal distribution,


then
X Y ⇐⇒ cov(X, Y) = 0.
|=

Proof. As the moment generating function of a multinormal joint distribution is


defined for all arguments, according to Theorem 9.6, the random vectors X and
Y are independent if and only if when their joint moment generating function
factorises into a product of the moment generating functions of X and Y,

MX,Y (u, v) = MX (u)MY (v) for all u, v.

Using the partitions (10.13) and (10.14), we find that

MX,Y (u, v)
1
= exp(uT µX + vT µY + (uT ΣXX u + uT ΣXY v + vT ΣY X u + vT ΣY Y v))
2
On the other hand, the product of the moment functions of the marginal dis-
tributions is
1 1
MX (u)MY (v) = exp(uT µX + uT ΣXX u + vT µY + vT ΣY Y v)
2 2

72
In the earlier expression, we can write

vT ΣY X u = (vT ΣY X u)T = uT ΣXY v,

and therefore the functions MX,Y and MX MY are equal if and only if

uT ΣXY v = 0 for all u, v,

which is the same as


ΣXY = cov(X, Y) = 0.
If we only assume multinormal marginal distributions

X ∼ N (µX , ΣXX ), Y ∼ N (µY , Σy Y ),

then the distribution of the joint vector (X, Y) need not be multinormal in
general, not even under an extra assumption that cov(X, Y) = 0. However,
observing the formulas of the previous theorem, it is easy to prove the following
result:
Theorem 10.5. If

X ∼ N (µX , ΣXX ) , Y ∼ N (µY , ΣY Y ) ,

and X Y, then the vector (X, Y) has a multinormally distribution N (µ, Σ)


|=

with parameters    
µX ΣXX 0
µ= , Σ= .
µY 0 ΣY Y
Proof. By independence, the moment generating function of the joint distribu-
tion is the product of the moment generating functions of the marginal distri-
butions. Therefore, for all u, v,

MX,Y (u, v) = MX (u)MY (v)


1 1
= exp(uT µX + uT ΣXX u + vT µY + vT ΣY Y v)
2
   2  
µX 1 ΣXX 0 u
= exp([uT vT ] + [uT vT ]
µY 2 0 ΣY Y v

and this is the moment generating function of the multinormal distributionn


N (µ, Σ)
Example 10.2. If X Y and
|=

X ∼ N (µX , ΣXX ), Y ∼ N (µY , ΣYY ),

and A and B are constant matrices such that the vectors AX and BY are
well-defined and of equal length, then the random vector

Z = AX + BY

73
has a multinormal distribution.
This is the case, since the combined vector (X, Y) has a multinormal dis-
tribution, and  
X
AX + BY = [A B]
Y
The parameters of the distribution can be obtained by calculating the expecta-
tion and the covariance matrix
EZ = E(AX + BY) = AµX + BµY ,
Cov Z = Cov(AX) + Cov(BY) = AΣXX AT + BΣY Y BT .

10.7 Conditional distributions


Theorem 10.6. Let (X, Y) ∼ N (µ, Σ) , and let us use the partitions (10.13)
and (10.14) for the expectation vector µ and the covariance matrix Σ. If the
partial matrix ΣXX is regular, then the conditional distribution of the random
vector Y given X = x is multinormal, and the parameters of the conditional
distribution are
 
Y|(X = x) ∼ N µY + ΣY X Σ−1 XX (x − µ X ), ΣY Y − Σ Y X Σ −1
XX Σ XY .

Proof. Even though the formulas are complicated, the idea of this proof is sim-
ple. We first form an auxiliary random vector V by using the formula

V = Y − BX, (10.16)

where B is a constant matrix to be chosen shortly. The combined vector (V, X)


can be obtained from the multinormal vector (X, Y) by a linear transformation
and therefore, (V, X) also has a multinormal distribution. We want to choose
B in such a way that V and X) do not correlate, which means that

0 = cov(V, X) = ΣY X − BΣXX ,

and can be solved as


B = ΣY X Σ−1
XX .

As the vector (V, X) is multinormally distibuted, non-correlation implies inde-


pendence, and hence V X. According to the representation (10.16),
|=

Y = V + BX,

where V and X are independent. Now the conditional distribution of Y given


X = x is the same as the distribution of the random vector

V + Bx,

since the condition X = x does not affect the distribution of random vector
V with V X. Therefore, the conditional distribution is multinormal. Now
|=

74
the parameters of the conditional distribution can be obtained by calculating
the parameters of the distribution of random vector V + Bx. The expectation
vector of the conditional distribution is
E(V + Bx) = µY + B(x − µX ) = µY + ΣY X Σ−1
XX (x − µX ),

and the covariance matrix is


Cov(V + BX) = Cov(V) = cov(Y − BX, Y − BX)
= ΣY Y − ΣY X BT − BΣXY + BΣXX BT
= ΣY Y − ΣY X Σ−1
XX ΣXY . 

10.8 The two dimensional normal distribution


Let us consider some properties of the two-dimension normal distribution (X, Y ) ∼
N (µ, Σ). The derived formulas are lengthy and not worth memorising. When
in need of them, you can derive them for yourself or check from the literature.
Let −1 < ρ = corr(X, Y ) < 1, and let us write the parameters of the
distribution into the following form:
   2 
µ σX ρσX σY
µ= X , Σ=
µY ρσX σY σY2
We assume that σX > 0 and σY > 0.
The inverse matrix of the covariance matrix is
σY2
 
1 −ρσX σY
Σ−1 = 2 ,
det(Σ) −pσX σY σX
where
det(Σ) = (1 − ρ2 )σX
2 2
σY .
The joint density function of the distribution is
1
fX,Y (x, y) = p ×
2πσX σY 1 − ρ2
 1 h x−µ x − µX y − µY y − µY 2 i
X 2
× exp − ( ) − 2ρ + ( )
2(1 − ρ2 ) σX σX σY σY
The formula of the argument of the exponential function can be obtained by
multiplying out the matrix product
 
1 (x − µX )
− [(x − µX ) (y − µY )]Σ−1 .
2 (y − µY )
The marginal distributions are
2
X ∼ N (µX , σX ), Y ∼ N (µY , σY2 ) .

75
The random variables X and Y are independent if and only if ρ = 0. Conditional
distributions are obtained based on the Theorem 10.6:
σY
Y |(X = x) ∼ N (µY + ρ (x − µX ), (1 − ρ2 )σY2 ) ,
σX
σX
X|(y = y) ∼ N (µX + ρ (y − µY ), (1 − ρ2 )σX
2
).
σY
Here we can for example obtain the parameters of the distribution Y |(X = x)
by calculating
−1 ρσX σY
µY + ΣY X ΣXX (x − µX ) = µY + 2 (x − µX )
σX

(ρσX σY )2
ΣY Y − ΣY X Σ−1 2
XX ΣXY = σY − 2 = (1 − ρ2 )σY2 .
σX
It is straightforward but laborious to check if these conditional distributions can
also be derived using the formulas
fX,Y (x, y) fX,Y (x, y)
fY |X (y|x) = , fX|Y (x|y) = .
fX (x) fY (y)

10.9 The joint distribution of the sample mean


and variance of a normal distribution
Let us consider random variables X1 , . . . , Xn . Their sample mean X and
sample variance S 2 are defined by formulas
n n
1X 1 X
X= Xi , S2 = (Xi − X)2 (10.17)
n i=1 n − 1 i=1

If the random variables Xi are independent and have the same expectation µ
and the same variance σ 2 , then we can easily show that

EX = µ, ES 2 = σ 2 .

In other words, the sample mean and the sample variance are unbiased estima-
tors of the population parameters µ ja σ 2 .
Let us assume from now on that random variables X1 , . . . , Xn are inde-
pendent with normal distribution N (µ, σ 2 ) . The vector (X1 , . . . , Xn ) has a
multinormal distribution Nn (m, Σ), where
 
µ
 .. 
m =  .  , Σ = σ 2 In .
µ

Next, we will derive the joint distribution of the vector (X, S 2 ).

76
It is easy to derive the marginal distribution of a sample mean, namely
1
X ∼ N (µ, σ 2 ), as it a linear transformation of a multinormally distributed
n
random vector, and we can calculate the mean and variance of the random
variable X. The next theorem shows that, after suitable scaling, the sample
variance S 2 has a chi squared distribution with n − 1 degrees of freedom.
Theorem 10.7. Let X1 , . . . , Xn ∼ N (µ, σ 2 ) , and let us define X and S 2 by

|=
equation (10.17). Then
(a) X ∼ N (µ, σ 2 /n) ,
(b) (n − 1)S 2 /σ 2 ∼ χ2n−1 , so ES 2 = σ 2 ,

(c) X S2
|=

Proof. Part (a) was proved already, so we will focus on (b) and (c).
The assumption about the random vector X = (X1 , . . . , Xn ) can be ex-
pressed as
X ∼ Nn (µ1, σ 2 In ) ,
where 1 = (1, . . . , 1) ∈ Rn is a constant n-component vector with all its compo-
nents equal to one.
Let us define another n-component vector u by
1
u = √ 1.
n
Now u is a unit vector parallel to vector 1, in other words uT u = 1. Notice that
the sample mean can be represented as
1 T 1
X= 1 X = √ uT X,
n n
and then
1 √
X1 = ( √ uT X) nu = uuT X.
n
Let us define the residual vector R by
 
X1 − X
.. T T
R= .  = X − X1 = X − uu X = (In − uu )X
 

Xn − X

(The vector X1 is called the fit vector.) The sample variance S 2 can be written
as
(n − 1)S 2 = RT R = kRk2 (10.18)
The combined vector (X, R) can be obtained by a linear transformation from
vector X, since    T √ 
X u 1/ n
= X,
R I − uuT

77
and for this reason it has a multinormal joint distribution. Moreover, the com-
ponents X and R are uncorrelated, since
1 1
cov(X, R) = √ cov(uT X, (I − uuT )X) = √ uT σ 2 I(I − uuT ) = 0.
n n

By Theorem 10.4, X and R are independent. As S 2 can be calculated (using


formula (10.18)) as a function of the residual vector R, we find that X and S 2
are also independent. This proves part (c).
In order to establish the distribution of the sample mean, let us define an
n × n square matrix Q so that it has u as its first column. The rest of the
columns v1 , . . . , vn−1 of the matrix Q are chosen so that, together with u,
they form an orthonormal basis of Rn . Then QT Q = In and, as Q is a square
matrix, it is an orthogonal matrix with Q−1 = QT . From this is follows that
QQT = In as well. When we use the partition of Q as Q = [u, V], where the
collumns of the n × (n − 1) matrix V are v1 , . . . , vn−1 , then we get the
formula
In = QQT = uuT + VVT .
From this we see that

R = (In − uuT )X = VVT X.

As the columns of the matrix V are orthonormal, we have VT V = In−1 , and

RT R = (VVT X)T (VVT X) = XT VVT X = kVT Xk2 (10.19)

The vector VT X has an (n − 1)-dimensional normal distribution with expecta-


tion
E(VT X) = VT (µ1) = 0,

since u is perpendicular to every column of the matrix V, and 1 = nu. The
covariance matrix is
Cov (VT X) = VT (σ 2 In )V = σ 2 In−1 .
Hence
1 T
V X ∼ Nn−1 (0, In−1 ) ,
σ
and so the square of the length of this random vector has a chi squared distri-
bution with n − 1 degrees of freedom. By formula (10.18) and (10.19), we see
that
n−1 2 1 1 1 1
S = 2 RT R = ( VT X)T ( VT X) = k VT Xk2 ∼ χ2n−1 .
σ2 σ σ σ σ
THus E[(n − 1)S 2 /σ 2 ] = n − 1, and hence ES 2 = σ 2 .

78
Let us recall that the t-distribution with ν degrees of freedom is defined as
the distribution of a random variable
Z
p ,
Y /ν

where Z ∼ N (0, 1), Y ∼ χ2v and Z Y.

|=
The random variable X − µ has a normal distribution with expectation zero
and variance σ 2 /n, so the random variable

X −µ

σ/ n
has a standard normal distribution. If we replace the unknown standard de-
viation parameter σ with the sample estimate, we obtain the so called t-test
value
X −µ
T = √ .
S/ n
This t-test value can also be expressed as

(X − µ)/(σ/ n)
T = p .
S 2 /σ 2
Based on the previous theorem, the numerator and the denominator are in-
dependent, the numerator’s distribution is N (0, 1) and the denominator is the
square root of a random variable, where a χ2 distributed random variable is
divided by the number of its degree of freedom n − 1. As a consequence, T has
a t-distribution with n − 1 degrees of freedom.
The previous considerations can easily be expanded to a situation of a linear
model
X ∼ Nn (m, σ 2 I),
where the expectation vector m is known to belong to a subspace L of Rn of
dimension p ≤ n. Let us choose ortonormal basis vectors u1 , . . . , up for this
subspace, and let us use these vectors as columns of a matrix U, so thatU =
[u1 , . . . , up ]. Now the matrix
H = UUT
is a projection matrix onto the subspace L, and so H is a symmetric and idem-
potent matrix (the latter means that HH = H) and Hv = v for all v ∈ L. By
adapting the proof of the previous theorem, one can check that
HX (I − H)X,
|=

and that
1
kX − HXk2 ∼ χ2n−p .
σ2
Hence the random variable
1
kX − HXk2
n−p

79
is an unbiased estimator of variance σ 2 , and its distribution is a scaled chi
squared. The fit vector HX and the residual vector X − HX = (I − H)X are
independent random vectors.

80

You might also like