Professional Documents
Culture Documents
1
Chapter 11 currently missing
Contents
6 Inequalities 3
6.1 Markov and Chebyshev inequalities . . . . . . . . . . . . . . . . . 3
6.2 Convex functions and Jensen’s inequality . . . . . . . . . . . . . 5
6.3 Hölder’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.4 Cauchy–Schwarz inequality and correlation . . . . . . . . . . . . 9
6.5 Inequalities for generating functions . . . . . . . . . . . . . . . . 9
7 Bivariate distribution 11
7.1 Continuous bivariate distribution . . . . . . . . . . . . . . . . . . 11
7.2 Uniform distribution in a planar region . . . . . . . . . . . . . . . 15
7.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.4 Expectation of a transformed random vector . . . . . . . . . . . . 19
7.5 Covariance and other joint moments . . . . . . . . . . . . . . . . 21
7.6 Best linear approximation . . . . . . . . . . . . . . . . . . . . . . 22
7.7 Expectation vector and covariance matrix . . . . . . . . . . . . . 24
7.8 Distribution of a transformed random vector . . . . . . . . . . . 26
7.9 Properties of the t-distribution . . . . . . . . . . . . . . . . . . . 31
8 Conditional distribution 34
8.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 The chain rule (multiplication rule) . . . . . . . . . . . . . . . . . 36
8.3 The joint distribution of discrete and continuous random variables 38
8.4 The conditional expectation . . . . . . . . . . . . . . . . . . . . . 39
8.5 Hierarchical definition of a joint distribution . . . . . . . . . . . . 43
9 Multivariate distribution 47
9.1 Random vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 The expectation vector and the covariance matrix . . . . . . . . . 49
9.3 Conditional distributions, the multiplication rule and the condi-
tional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.4 Conditional independence . . . . . . . . . . . . . . . . . . . . . . 55
9.5 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 The density function transformation formula . . . . . . . . . . . 59
9.7 The moment generating function of a random vector . . . . . . . 61
1
10 Multivariate normal distribution 65
10.1 The standard normal distribution Nn (0, I) . . . . . . . . . . . . . 65
10.2 General multinormal distribution . . . . . . . . . . . . . . . . . . 66
10.3 The distribution of an affine transformation . . . . . . . . . . . . 68
10.4 The density function . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.5 The contours of the density function . . . . . . . . . . . . . . . . 69
10.6 Non-correlation and independence . . . . . . . . . . . . . . . . . 71
10.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 74
10.8 The two dimensional normal distribution . . . . . . . . . . . . . . 75
10.9 The joint distribution of the sample mean and variance of a nor-
mal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2
Chapter 6
Inequalities
Inequalities are among the most powerful tools available for a mathematician.
An equation a = b implies that two supposedly different things are in fact the
same while an inequality a ≤ b might provide us with an opportunity to make
a complex situation easier to understand.
The probability inequalities are typically used to assist the understanding of
theoretical concepts e.g. deriving limit values and many other applications. In-
equalities can be also used to analyze the complexity of randomized algorithms.
When working with inequalities, one must pay close attention to the form of
both sides of the inequality as well as the conditions under which the inequality is
valid. Furthermore, it is essential to always keep in mind the correct magnitude
order between the two sides of the inequality. Mathematical argumentation
often uses long inequality sequences
a ≤ a1 ≤ · · · ≤ ak ≤ b
3
As Y ≤ X,
EX ≥ EY = 0 · P (Y = 0) + a · P (Y = a) = aP (X ≥ a).
σ2
P [|X − µ| ≥ t] ≤ , ∀t > 0.
t2
In particular, if σ 2 > 0, then
1
P [|X − µ| ≥ k σ] ≤ , ∀k > 0.
k2
Proof. We apply Markov’s inequality to the random variable Y = (X − µ)2 . If
t > 0, then
σ2
P [|X − µ| ≥ t] = P [(X − µ)2 ≥ t2 ] ≤ 2 .
t
The second inequality is formulated by choosing t = kσ.
Chebyshev’s inequality can be used to prove the weak law of large numbers
(WLLN). This particular form of convergence of the random variables is referred
to as convergence in probability.
Theorem 6.3 (The weak law of large numbers). Let X1 , X2 , . . . be a sequence of
independent random variables with the same expectation µ = EXi and variance
σ 2 = var Xi < ∞. Then, the sequence formed from the averages (X n )
n
1X
Xn = Xi
n i=1
Proof. Now,
1 σ2
EX n = µ, var X n = 2
nσ 2 = .
n n
By Chebyshev’s inequality,
σ2
P (|X n − µ| ≥ ) ≤ .
n2
4
6.2 Convex functions and Jensen’s inequality
A function is convex if its graph remains below a chord drawn through any two
points of the graph of the function. A function is concave if its graph remains
above a chord drawn through any two points of the graph of the function. The
idea is made precise below. Most functions are neither convex nor concave.
By interval we understand any closed, open or half-closed, finite or infinite
interval: (a, b), [a, b), (a, b] or [a, b], where a ≤ b and a = −∞, b = ∞ are
allowed. If I is an interval and x, y ∈ I, then the points λx + (1 − λ)y, where
0 ≤ λ ≤ 1, situated on a chord connecting x and y are located inside the interval
I. A convex set is characterised by this particular feature and convex subsets
of the real line are intervals.
Definition 6.1 (Convex function, concave function). Let I ⊂ R be an interval.
A function g : I → R is called convex, if
(b) If g has a second derivative across I and g 00 (x) ≥ 0 for all x ∈ I, then g is
convex.
Proof. (a): Let a, b, c ∈ I and a < b < c. By the mean value theorem for the
derivative, we have
5
for some x ∈ (a, b) and y ∈ (b, c). Since g 0 is increasing, and x < b < y, we
have g 0 (x) ≤ g 0 (y). This proves (6.1), which by Theorem 6.4 is equivalent to
the convexity of g.
(b): Since g 00 ≥ 0, the function g 0 is increasing across I.
If g : I → R is convex, then the definition of a convex function can be used
to derive (utilising induction) the inequality
n
X n
X
g pi xi ≤ pi g(xi ), (6.2)
i=1 i=1
6
• If a set A ⊂ R has an upper bound, it has a least upper bound. (This is
a deep completeness property of the real line R that distinguishes it for
instance from the set of rational numbers Q.)
• The least upper bound of A is unique (this is easy) and denoted by sup A.
Denoting
n g(m) − g(y) o n g(z) − g(m) o
A= : x ∈ I, y < m , B= : x ∈ I, z > m ,
m−y z−m
the inequality (∗∗) shows that every b ∈ B is an upper bound of the set A. Thus
there is a least upper bound k = sup A, and this satisfies (∗).
The proof above shows that if g is differentiable at m, then k = g 0 (m) is
the unique choice in Theorem 6.6. Indeed, both the left and right sides of (∗)
converge to g 0 (m) when y, z → m.
Theorem 6.7 (Jensen’s inequality). Let I ⊂ R be an open interval and g :
I → R a convex function. If X is a random variable assuming values within I
(with probability one) and EX and Eg(X) exist, then
g(EX) ≤ Eg(X). (6.3)
Proof. As µ = EX ∈ I, there is a constant k such that,
g(x) ≥ g(µ) + k(x − µ) , ∀x ∈ I.
With probability one
g(X) ≥ g(µ) + k(X − µ),
and Jensen’s inequality follows by taking the expectation of both sides.
Example 6.1. As the function x2 is convex, according to the Jensen’s inequal-
ity,
(EX)2 ≤ E(X 2 ),
if these moments exist. This implies the inequality
var X = E(X 2 ) − (EX)2 ≥ 0,
which we already knew.
7
Proof. The claim is trivially true if E|X|p = ∞ or E|Y |q = ∞ or if E|X|p = 0 or
E|Y |q = 0. Therefore, we assume that both expectations are finite and greater
than 0.
First we will prove an inequality regarding real positive numbers. Let arbi-
trary a, b > 0 and let s ja t be such that,
1 1
a = exp( s), b = exp( t).
p q
1 1 1 1 ap bq
ab = exp( s + t) ≤ es + et = + .
p q p q p q
The same inequality is also valid when a = 0 or b = 0, if we agree that zero to
the power of a positive number is zero.
If we let 1 1
u = [E|X|p ] p , v = [E|Y |q ] q ,
and take the expectation of both sides of the inequality
XY 1 X 1 Y
| | ≤ | |p + | |q
uv p u q v
we get
XY 1 E|X|p 1 E|Y |q
E| |≤ + = 1.
uv p E|X|p q E|Y |q
As an easy application of the Hölder’s inequality we can prove that
1 1
[E|X|r ] r ≤ [E|X|s ] s , for all 0 < r ≤ s. (6.4)
(If one of the expectations does not exist as a real number, the term is under-
stood as ∞). Minkowski’s inequality can be derived from the triagle inequality
and Hölder’s inequality.
8
6.4 Cauchy–Schwarz inequality and correlation
The Cauchyn-Schwarz inequality is a special case of the Hölder’s inequality,
where p = q = 2, which results in
√ √
E|XY| ≤ EX 2 EY2 .
In some cases this inequality can be valid in the form E|XY| ≤ ∞, but this
does not signify anything interesting. If EX 2 < ∞ and EY 2 < ∞, then the
upper limit is finite and we have
√ √
|E(XY )| ≤ E|XY| ≤ EX 2 EY 2 . (6.6)
|cov(X, Y )| ≤ σX σY . (6.7)
Here the upper limit is established by the product of the standard deviations of
the variables.
If both standard deviations σX and σY are strictly positive, then the corre-
lation coefficient of X and Y (often denoted by corr(X, Y ) or ρXY ) is defined
by the formula
cov(X, Y )
corr(X, Y ) = ρXY = ρ = . (6.8)
σX σY
By the Cauchy–Schwarz inequality,
|corr(X, Y )| ≤ 1. (6.9)
If corr(X, Y) > 0, then also cov(X, Y ) > 0 and X and Y are said to be positively
correlated. If corr(X, Y) < 0, then also cov(X, Y) < 0, and X and Y are said
to be negatively correlated. If corr(X, Y) = 0, then also cov(X, Y) = 0, and X
and Y are said to have no (linear) correlation or to be (lineary) uncorrelated.
9
Theorem 6.9 (Properties of the generating functions). a) The tail properties
satisfy the following upper bounds for all a ∈ R:
b) The cumulant generating function and the moment generating function are
both convex functions.
Proof. Part (a): when t > 0, the function x 7→ etx is strictly increasing and
therefore, according to Markov’s inequality
EetX
P (X ≥ a) = P (etX ≥ eta ) ≤ .
eta
As this approximation is valid for all t > 0, it also remains valid when the upper
limit is replaced by its infimum over t > 0. The other inequality is proved by
applying the first random variable Y = −X.
The convexicity of a cumulant generating function is a result of Hölder’s
inequality. The convexicity of a moment-generating function is a result of ex
being an increasing and convex function.
10
Chapter 7
Bivariate distribution
In this situation, f = fX,Y is called the density function of the random vector
(X, Y ) or the joint density function of the random variables X and Y .
Formula (7.1) is sometimes heuristically expressed as
11
15
10
5
z
0
x
0 2 4 6 8
a set B means an integral of the function 1B g over the whole domain R2 . This
type of an integral is written as
Z Z Z Z Z Z
g= g(x, y)d(x, y) = g(x, y)dxdy = 1B (x, y)g(x, y)dxdy.
B B B R2
Such area integrals are usually calculated as iterated integrals. Under these cir-
cumstances we can apply the Fubini’s theorem (which will not be proved during
this course). The following theorem is formulated for the Lebesgue integral.
Theorem 7.1 (Fubini). In the following two cases the area integral can be
calculated as an iterated integral, meaning that the identities below are valid:
Z Z ∞ Z ∞
g(x, y)d(x, y) = ( 1B (x, y)g(x, y)dx)dy
B −∞ −∞
Z ∞ Z ∞ (7.2)
= ( 1B (x, y)g(x, y)dy)dx
−∞ −∞
(a) If g ≥ 0, in which case the possibility that the common value of all these
integrals is ∞ is allowed.
(b) If B |g| < ∞, in which case the common value of all the integrals above is
R
case (a)).
For example if B can be presented as
B = {(x, y) : a < x < b, c(x) < y < d(x)},
and |g| < ∞ or g ≥ 0, then
R
B
Z ZZ Z b Z d(x)
g= g(x, y)dxdy = ( g(x, y)dy)dx.
B B a c(x)
12
We will get the same result even if, within the definition of B, one or more of the
inequalities “< ” is replaced by “≤”. This iterated integral can also be written
in any of the following way:
Z Z b Z d(x) Z b Z d(x)
g= g(x, y)dxdy = dx g(x, y)dy
B x=a y=c(x) a c(x)
If B can be expressed as
B = {(x, y) : c < y < d, a(y) < x < b(y)},
and B |g| < ∞ or g ≥ 0, then the regional integral can be calculated as an
R
iterated integral:
ZZ Z d Z b(y)
g(x, y)dxdy = ( g(x, y)dx)dy
B c a(y)
Z d Z b(y) Z d Z b(y)
= g(x, y)dxdy = dy g(x, y)dx
y=c x=a(y) c a(y)
Proof. If B ⊂ R, then
Z Z ∞
P (X ∈ B) = P ((X, Y ) ∈ B × R) = ( fX,Y (x, y)dy)dx.
B −∞
13
Example 7.1. In the converse direction, the continuity of the marginal distri-
butions does not automatically imply the continuity of the joint distribution.
For example, let X be a random variable with a density function fX and let a
random variable Y be defined by Y = X, or
Y (ω) = X(ω), for all ω ∈ Ω.
Now the distribution of Y is equal to the distribution of X and therefore both
distributions are continuous. The distribution of a random vector (X, Y ) is
concentrated on the diagonal
B = {(x, y) : x = y}.
However, for any function f : R2 → R we have B f = 0, since the diagonal
R
B is has measure zero as a one-dimensional set. For this reason, the random
vector (X, Y ) cannot be continuous and therefore, the joint distribution is not
continuous either.
In the case of a continuous joint distribution, the joint cumulative distribu-
tion has the formula
Z x Z y
FX,Y (x, y) = P ((X, Y ) ∈ (−∞, x] × (−∞, y]) = ds fX,Y (s, t)dt.
−∞ −∞
14
7.2 Uniform distribution in a planar region
Definition 7.2 (Uniform distribution in region A). Let A ⊂ R2 be a set with
an area ZZ
m(A) = 1A (x, y)dxdy
R2
that satisfies t0 < m(A) < ∞. A random vector (X, Y ) is said to have uniform
distribution in A, if it has a joint density function
1A (x, y)
f (x, y) = .
m(A)
Example 7.2. Let h(x) = exp(− 21 x2 ) and suppose that the random vector
(X, Y ) has a uniform distribution under h, or over the set
15
The marginal distribution of X is
Z ∞ Z h(x)
1 1 1
fX (x) = fX,Y (x, y)dy = √ dy = √ exp(− x2 )
−∞ 0 2π 2π 2
(Notice, that –ln y > 0 when 0 < y < 1). Therefore, the joint density function
can also be expressed as
1 p p
fX,Y (x, y) = √ 1{0 < y < 1, − −2 ln y < x < −2 ln y},
2π
from which the marginal distribution of Y is easy to calculate
2 p
fY (y) = √ − ln y, 0 < y < 1.
π
7.3 Independence
Let us recall a definition. Two eandom variables X and Y are called independent
(denoted by X Y ), if
|=
Next we will show that the joint mass functions (jmf) and the joint density
functions (jdf) of independent random variables can be expressed as the prod-
uct of the marginal distributions when the joint distribution is either discrete
or continuous. Notice that the joint density function of independent random
variables is not just a product of functions FX ja FY
u 7→ FX (u)FY (u) ,
16
but a so called tensor product
Similarly, the jmfs and the jdfs of independent random variables can be ex-
pressed as the product of marginal density/mass functions as long as we under-
stand that we are dealing with the tensor product.
Theorem 7.5. Suppose that the random vector (X, Y ) has a discrete or con-
tinuous distribution with joint mass function or joint density function fX,Y ,
respectively.
(a) If X Y , then fX,Y (x, y) = fX (x)fY (y) .
|=
|=
Proof. (a): The Discrete case was proved in Section 3.4. and now we shall
now prove the continuous case. Let us assume that the joint distribution is
continuous and that X Y . We will check that the value of the joint cumula-
|=
Then ZZ Z ∞ Z ∞
1= fX,Y (x, y)dxdy = g(x)dx h(y)dy = cd.
R2 −∞ −∞
17
and the marginal distribution of Y is
Z ∞
fY (y) = fX,Y (x, y)dx = ch(y) = h(y)/d.
−∞
In other words, the marginal distributions of fX and fY are the factors g and
h normalised to density functions. For all x, y, the jcf is the product of the
marginal distributions because
Z x Z y Z x Z y
g(s) h(t)
FX,Y (x, y) = ds fX,Y (s, t)dt = ds dt
−∞ −∞ −∞ −∞ c d
Z x Z y
= fX (s)ds fY (t)dt = FX (x)FY (y) ,
−∞ −∞
and consequently X Y.
|=
Sometimes we wish to prove that two random variables X and Y are not
independent. If X and Y are discrete, their dependence can be proved by finding
one point (x, y), where the multiplicative formula fails:
fX,y (x, y) 6= fX (x)fY (y) :
this shows that P (X ∈ A, Y ∈ B) is not equal to P (X ∈ A)P (Y ∈ B) when we
choose A = {x} ja B = {y}.
In case of a continuous joint distribution, the proof is more complicated, since
many different functions can be density functions of the same distribution.
Example 7.3. Suppose that the random vector (X, Y ) has a uniform distibu-
tion in the triangle
C = {(x, y) : x ≥ 0, y ≥ 0, x + y ≤ 1}.
Are X and Y independent? As the area of this square is 21 , the jdf is
fX,Y (x, y) = 2, when (x, y) ∈ C.
It might misleadingly seem that the jdf takes the product form of g(x)h(y). We
can notice that this is false when we write down the condition (x, y) ∈ C using
indicator functions:
fX,Y (x, y) = 2 · 1{x≥0} 1{y≥0} 1{x+y≤1}
Here, the last term is definitely not of the product form.
If A = B = ( 21 , 1) , then (A × B) ∩ C = ∅ and
P ((X, Y ) ∈ A × B) = 0,
but as P (X ∈ A) > 0 ja P (Y ∈ B) > 0, then
0 = P (X ∈ A, Y ∈ B) 6= P (X ∈ A)P (Y ∈ B) > 0.
Directly utilizing the definition of independence we have shown that X ja Y are
not independent.
18
7.4 Expectation of a transformed random vector
Theorem 7.6. Suppose that the random vector (X, Y ) has a discrete distribu-
tion and let Z = g(X, Y ) be its real-valued transformation. Then
X
EZ = g(x, y)fX,Y (x, y),
x,y
and therefore,
X X
EZ = g(xi , yi )P (X = xi , Y = yi ) = g(xi , yi )fX,Y (xi , yi ).
i≥1 i≥1
But
Z Z
P ((X, Y ) ∈ Ai ) = fX,Y (x, y)dxdy = 1Ai (x, y)fX,Y (x, y)dxdy,
Ai R2
19
and hence
X Z
Eg(X, Y ) = ai 1Ai (x, y)fX,Y (x, y)dxdy
i≥1 R2
Z X Z
= ai 1Ai (x, y)fX,Y (x, y)dxdy = g(x, y)fX,Y (x, y)dxdy.
R2 i≥1 R2
Let us define
n
4X −1
sn (t) := k2−n · 1[k2−n ,(k+1)2−n ) (t) + 2n · 1[2n ,∞) (t).
k=0
If g ≥ 0, then 0 ≤ sn (g(x, y)) ≤ sn+1 (g(x, y)) → g(x, y). In this situation,
the monotone convergence theorem (proved in the course Measure and Integral,
taken for granted here) allows the exchange the limit with either the integral or
the expectation, to the result that
(MON)
Eg(X, Y ) = E lim sn (g(X, Y )) = lim Esn (g(X, Y ))
n→∞ n→∞
Z
= lim sn (g(x, y))fX,Y (x, y)dxdy
n→∞ R2
Z
(MON)
= lim sn (g(x, y))fX,Y (x, y)dxdy
R2 n→∞
Z
= g(x, y)fX,Y (x, y)dxdy.
R2
This formula is valid under the assumption that the expectation exists and is
finite meaning that this sum or integral converges inherently.
20
Why this funny name? In order to compute the expectation Eg(X, Y ), the
statistician can be entirely unconscious about the distribution of the transformed
random variable g(X, Y ), as long as they know the distribution of the original
random vector (X, Y ).
This bivariate version is compatible with the previous univariate version.
For example, the bivariate version and the law of the unconscious statistician
can be used to compute an Eg(X) of a discrete joint distribution and a function
with one variable g:
XX X X X
Eg(X) = g(x)fX,Y (x, y) = g(x) fX,Y (x, y) = g(x)fX (x),
x y x y x
where the last formula comes from the univariate version of the law of the
unconscious statistician, which was discussed in section 4.4.
Example 7.4. If integrable random variables X and Y have a continuous joint
distribution, then for all constants a, b ∈ R
ZZ
E(aX + bY ) = (ax + by)fX,Y (x, y)dxdy
Z Z Z Z
= a x( fX,Y (x, y)dy)dx + b y( fX,Y (x, y)dx)dy
Z Z
= a xfX (x)dx + b yfY (y)dy = aEX + bEY.
Here the integral refers to an integral calculated over the whole real axis. We
have thus derived in this special a fact that we already knew: the expectation
is a linear operator.
Note. The identity Eg(X, Y ) = g(EX, EY ) holds for the function g(x, y) =
ax + by, but it is usually false for other functions.
Example 7.5. If X Y , then g(X) h(Y ) for all functions g, h : R → R and
|=
|=
therefore,
E(g(X)h(Y )) = E(g(X))E(h(Y ))
if these expectations exist. If a random vector (X, Y ) is continuously distributed,
then fX,Y (x, y) = fX (x)fY (y), and the previous identity can also be derived
from the computation
ZZ
E(g(X)h(Y )) = g(x)h(y)fX (x)fY (y)dxdy
Z Z
= ( g(x)fX (x)dx)( h(y)fY (y)dy) = E(g(X))E(h(Y )).
21
exists, it is referred to as the joint moment (or product moment) of the order
(mn). If the expectation
E[(X − EX)m (Y − EY )n ]
The most important of these moments are definitely the moments EX m and
EY n of the marginal distributions (namely the joint moments of orders (m, 0)
ja (0, n)) and the central moment of order (1, 1), which is better known as the
covariance of X and Y :
then it turns out that the best linear approximation can be found from a simple
calculation which we will work with next. Let us assume that
EX 2 < ∞, EY2 < ∞, var X > 0, varY > 0.
Now let us denote µX = EX and µY = EY . If α ja β are constant, then the
prediction α + β(X − µX ) deviates from the actual Y by
Z = Y − α − β(X − µX ) .
EZ = µY − α − β(µX − µX ) = µY − α
22
Next we will complete the square in the terms that include the β-coefficient,
cov(X, Y ) (cov(X, Y ))2
EZ 2 = (µY − α)2 + var Y + var X β 2 − 2β +
var X (var X)2
(cov(X, Y ))2
−
var X
cov(X, Y ) 2 (cov(X, Y ))2
= var Y + (µY − α)2 + var X β − −
var X var X
In the last form only the second and third terms depend on the variables α or
β and both of these terms are non-negative. No matter how we choose α and
β, the MSE will always be greater than or equal to
(cov(X, Y ))2
var Y − = (1 − ρ2 ) var Y, (7.3)
var X
and the minimum MSE can be achieved by choosing
cov(X, Y )
α = µY , β= .
var X
In other words, the best linear approximation in the sense of MSE for Y is
cov(X, Y )
EY + (X − EX) (7.4)
var(X)
cov(X, Y )
ρ = corr(X, Y ) = √ √ ,
varX varY
which takes values (according to the Cauchy–Schwarz inequality) between −1 ≤
ρ ≤ 1. As the expectation of the approximation is EY , the MSE is equal to the
variance of the prediction error:
cov(X, Y )
Y − EY − (X − EX).
var(X)
We will later solve a more ambitious example by determining the best MSE-
based approximation of Y of the form m(X), where the function m can be
chosen freely.
Formula (7.3) of the best linear MSE approximation shows how the correla-
tion coefficient measures the intensity of the linear dependence between variables.
If the correlation coefficient reaches either extreme value of ρ = ±1, then the
MSE is zero and the prediction error is zero almost surely (see Theorem 4.3.),
i.e.,
cov(X, Y )
Y = EY + (X − EX) (a.s.), (7.5)
var X
23
The abbreviation a.s. comes from words almost surely, which in turn means
with probability one. In other words, if |ρ| = 1 it follows that,
cov(X, Y )
Y (ω) = EY + (X(ω) − EX)
var X
for all elementary events ω, except those that belong to an exceptional set of
probability of zero.
Notice also that the sign of the correlation coefficient is the same as the sign
of the covariance cov(X, Y ) as well as the sign of the slope of the best linear
estimate. If ρ > 0, meaning cov(X, Y ) > 0, then Y has a tendency to grow as
X increases, since the slope of best linear prediction is in this case positive. If,
on the other hand, ρ < 0, meaning cov(X, Y ) < 0, then Y has a tendency to
decrease as X increases.
if Z is a random matrix.
Based on the linearity of the expectation we can easily see that,
24
when Z is a random matrix and A, B and C are constant matrices, with such
dimensions that the expression is defined. Constant matrices can be pulled
outside the expectation if they are located in the extreme right or the extreme
left of the matrix product expression. On the contrary, matrices located in
the middle of the expression can notbe pulled out using this formula. Formula
(7.7) is also applicable to a random vector Z, since a column vector with d -
components can be interpreted as a d ×1-matrix.
Definition 7.6 (Covariance matrix). The covariance matrix of a random vector
V = (X, Y ) is defined as
The main diagonal of the covariance matrix has the variances of the random
variables and elsewhere the covariances between the random variables. For this
reason the covariance matrix is sometimes called the variance-covariance matrix.
If we denote σX2
= var(X) ja σY2 = var Y ja σX , σY > 0, then the correlation
coefficient ρ = corr(X, Y) of the random variables X and Y is defined by the
expression
cov(X, Y ) cov(X, Y )
ρ = corr(X, Y ) = √ √ =
varX varY σX σY
Due to this, the covariance matrix of the random vector V = (X, Y ) can be also
written as
2
σX ρσX σY
Cov(V) =
ρσX σY σY2
If a random vector W can be obtained from the random vector V by an
affine transformation
W = AV + b,
25
where A is a constant matrix and b is a canstant vector, then
g : A → B is diffeomorphism, (7.11)
26
In practice, checking that a function is a diffeomorphism can be done as
follows. First, we find an expression for the invesrse function h = (h1 , h2 ) by
solving from the equations
g1 (x, y) = u, g2 (x, y) = v, (u, v) ∈ B
the variables x and y as functions of u and v. If the solution
(x, y) = (h1 (u, v), h2 (u, v))
is unique for all (u, v) ∈ B and satisfies (x, y) ∈ A, then we have checked that
we are dealing with a bijection. After this, it is usually quite straightforward to
check whether the expressions g1 (x, y) ja g2 (x, y) are continuously differentiable
on A and whether expressions h1 (u, v) ja h2 (u, v) are continuously differentiable
on B. For example, with g1 (x, y) we must check, whether both partial deriva-
tives of the first order ∂g1 (x, y)/∂x and ∂g1 (x, y)/∂y exist and whether they are
continuous functions on the set A.
Now let us consider a bijective correspondence
(u, v) = (g1 (x, y), g2 (x, y)) ⇔ (x, y) = (h1 (u, v), h2 (u, v)), (7.12)
which is convenient to write as
(u, v) = (u(x, y), v(x, y)) ⇔ (x, y) = (x(u, v), y(u, v)). (7.13)
The derivative (matrix) of the mapping h or the so-called Jacobian matrix at
the point (u, v) is
∂h1 (u,v) ∂h1 (u,v) ∂x ∂x
∂u ∂v ∂u ∂v (7.14)
∂h2 (u,v) ∂h2 (u,v) = ∂y ∂y
∂u ∂v ∂u ∂v
The determinant of the derivative matrix (the Jacobian matrix) is called the
functional determinant or the Jacobian determinant or the Jacbian. The con-
cept of a Jacobian determinant was invented in the 19th century by a German
mathematician Carl Gustav Jacobi. (Warning: some textbooks use the term
Jacobian to refer to the differentiation matrix and not its determinant.)
The Jacobian or the determinant of the Jacobian matrix is denoted by var-
ious symbols. We will be using both the modern notation
Jh (u, v),
and the following classical notation which has been used since the mid 19th
century:
∂(x, y)
.
∂(u, v)
The classical notation is more useful when the functions are defined by concrete
expressions. Here
∂x ∂x
∂(x, y) ∂v = ∂x ∂y − ∂x ∂y .
Jh (u, v) = = det ∂u
∂y ∂y
∂(u, v) ∂u ∂v ∂v ∂u
∂u ∂v
27
The derivative matrices of a function g and its inverse function h are inverse
matrices of each other which implies the following identity for their Jacobians:
∂(x, y) ∂(x, y)
1 = Jh (u, v)Jg (x, y) = . (7.15)
∂(u, v) ∂(u, v)
In this expression (x, y) and (u, v) are related to each other bijectively according
to formula (7.13). Formula (7.15) follows from the facts that the determinant
of an identity matrix is equal to one, and that det(A B) =det(A)det(B), which
is valid when A and B are square matrices of equal size.
Theorem 7.8. Under assumptions (7.10) and (7.11), and using the notation
above in this section, the random vector (U, V ) has a countinuous distribution
with a density function
(
fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|, when (u, v) ∈ B
fU,V (u, v) = (7.16)
0, otherwise.
By changing variables (u, v) = g(x, y) , or (x, y) = h(u, v), the area integral will
obtain the following form (an appropriate change-of-variables formula for the
Lebesgue integral can be found, for example, in Billingley’s book [1], Theorem
17.2.)
ZZ ZZ
fX,Y (x, y)dxdy = fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|dudv
h(C) C
From here the unknown values are solved by using the known values. If fX,Y is
known, then
∂(x, y)
fU,V (u, v) = fX,Y (x, y)| | = fX,Y (h1 (u, v), h2 (u, v))|Jh (u, v)|,
∂(u, v)
28
when (u, v) ∈ B. The same joint density function can also be expressed as
or
B = {(u, v) : c < v < d, a(v) < u < b(v)}.
This can result in some extra work.
Example 7.6. Suppose that a random vector (X, Y ) has a continuous distri-
bution with a joint density function given by
(
k(x, y) when x > 0 and y > 0
fX,Y (x, y) = (7.19)
0 otherwise.
Let us present this jdf as a product of the indicator 1{x > 0, y > 0} of the
positive quadrant, and of the function k(x, y)
This is a diffeomorphism between the whole plane and itself. Thus, the joint
density function of random variables U and V is
∂(x, y)
fU,V (u, v) = fX,Y (x, y)| |
∂(u, v)
1
= 1{u + v > 0, u − v > 0}k( 21 (u + v), 12 (u − v)) .
2
29
Next, let us calculate the marginal density of U . In order to correctly deter-
mine integration limits, the previously considered pair of equations u + v > 0
and u − v > 0 should be solved in the following form
After some examination we will notice (draw a picture) that the solution is
by integrating over this set. The density function U can now be calculated
by differentiation, if the distribution of U turns out to be continuous.
2. We can try to (artificially) complement the transformation to make it a
bijection by choosing a suitable V = g2 (X, Y ) and then deriving the density
function of the random vector (U, V ). Finally the density function of the
variable U is calculated by integration.
Out of these two approaches the latter one is often more straightforward.
Example 7.7 (The density function of a sum). Let a random vector (X, Y )
have a density function fX,Y . Let
U = X + Y.
Derive the density function of the random variable U . What kind of an expres-
sion will you get if X Y ?
|=
30
and
∂(x, y)
fU,V (u, v) = fX,Y (x, y)| | = fX,Y (v, u − v).
∂(u, v)
The marginal density function of random variable U can be calculated from this
expression by integreting out v:
Z ∞
fU (u) = fX,Y (v, u − v)dv
−∞
Thus Z ∞ Z u−x
FU (u) = dx fX,Y (x, y)dy.
−∞ −∞
Z ∞
fU (u) = fX (v) fY (u − v)dv.
−∞
fU = fX ∗ fY .
Literature
[1] P. Billingsley. Probability and Measure. John Wiley & Sons, 2nd ed., 1986.
31
where X ∼ χ2v = Gam( v2 , 21 ), Z ∼ N (0, 1) , and X Z. To demonstrate the
|=
technique of interchanging variables we will calculate the density function of the
t-distribution. Furthermore, we will derive the expectation and the variance of
this distribution by using the stochastic representation of this distribution.
By independence
( 21 )ν/2 ν −1 − 1 x 1 − 1 z2
fX,Z (x, z) = fX (x)fZ (z) = x2 e 2 √ e 2 , x > 0.
Γ( ν2 ) 2π
which is finite if both expectations on the righ hand side of the equation are
finite. The absolute moments E|Z|a of the normal distribution are finite, but
by considering the latter expectation we easily end up with a condition a <
ν. For example, the Cauchy distribution does not have an expectation; and
t-distribution has a variance only when ν > 2. As all the moments of the
32
distribution are not finite, the moment generating function of the t-distribution
does not exist in any neighbourhood of the origin and therefore, the moment
generating function is a useless tool for this distribution.
The expectation and the variance of the Student t-distribution are easily
derived from the stochastic representation (7.20) of the distribution. If ν > 1,
then the tv -distribution has expectation zero. This is the case since, when
Z X, then
|=
r r
ν v
EY = EZ E =0·E = 0.
X X
If ν > 2, then the variance of the tν -distribution is
ν ν
varY = EY 2 = EZ 2 E =1·E .
X X
After several steps (integrate like a statitician and use the functional equation
of the gamma-function) it can be seen that
ν
var Y = , when ν > 2.
ν−2
33
Chapter 8
Conditional distribution
P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = P (X = x| Y = y) = = , (8.1)
P (Y = y) fY (y)
when y is a point, where fY (y) > 0. The function fX|Y (·|y) is the pmf of the
random variable X, when we know that Y = y. The conditional pmf fY |X (y|x)
is defined analogously.
Formula (8.1) can be used to define a conditional density function in the
case of a continuous joint distribution.
Definition 8.1. Let (X, Y ) have a continuous distribution with density function
fX,Y . The conditional density function, given (Y = y), of the random variable
X is defined as
fX,Y (x, y)
fX|Y (x|y) = , x ∈ R,
fY (y)
when y is such that fY (y) > 0. Similarly we define
fX,Y (x, y)
fY |X (y|x) = , y ∈ R,
fX (x)
34
y= 6
0.6
0.4
0.2
15
0.0
0 2 4 6 8
10
y= 3
0.6
0.4
5
y
0.2
0.0
0
0 2 4 6 8
y= 0
0.6
−5
0.4
0.2
0 2 4 6 8
0.0
x
0 2 4 6 8
Figure 8.1: A joint density function and conditional densities x 7→ fX|Y (x|y)
with some choices of y.
The function x 7→ fX|Y (x|y) is derived from the joint density function fX,Y
by normalising its horizontal “section” x → fX,Y (x, y) into a density function.
Similarly the function y → fY |X (y|x) is derived by normalising the vertical
section y 7→ fX,Y (x, y). Figure 8.1 shows how a conditional density function
can be derived from the joint density function.
A continuous joint distribution can be represented by different density func-
tions as long as they agree almost everywhere. For the purpose of defining a
conditional density function, we can use any choice of the joint density func-
tion. Therefore, the conditional densities are not unique either. Nevertheless,
it turns out that the ambiguity does not cause serious problems. Therefore,
the non-uniqueness will not be emphasised, and we will simple talk about “the”
conditional density function (rather than more accurately about a version of
the conditional density function).
In case of a continuous joint distribution, the probability of a condition
Y = y is always zero. Therefore, the conditional density function cannot be
directly interpreted in terms of a conditional probability. It is a definition that
we will explore further now.
We will make it plausible, that if a jdf fX,Y is smooth enough (precise
conditions for this will not be specified) and fY (y) > 0, then
Z
P (X ∈ A| y ≤ Y ≤ y + h) −−−−−→ fX|Y (x|y)dx, for all A ⊂ R. (8.2)
(n→0+) A
35
dealing with a conditional density function, conditioned Y = y, and y is a
detected value.
The clain (8.2) will be explained starting from
1
h P (X ∈ A, y ≤ Y ≤ y + h)
P (X ∈ A|y ≤ Y ≤ y + h) = 1
h P (y ≤ Y ≤ y + h)
R
1 y+h
R
h y A
f X,Y (x, u)dx du
= R y+h
1
h y
fY (u)du
Both the numerator and the denominator contain integral averages of the fol-
lowing form
1 y+h
Z
g(u)du,
h y
which approach the value g(y) with quite general assumptions, when h → 0+.
For example, if function g is continuous in point y and integrable somewhere in
it’s proximity, then for any > 0 there is a δ > 0 such that |g(u) − g(y)| <
whenever |u − y| < δ. If h is any number that satisfies 0 < h < δ, then
Z y+h Z y+h
1 1
g(y) − = (g(y) − )du ≤ g(u)du
h y h y
Z y+h
1
≤ (g(y) + )du = g(y) + .
h y
This proves that the integrable average approaches the limit g(y),
R when h →
0+. Using this with g(u) = fY (u) in the denominator, and g(u) = A f (x, u)dx
in the numerator, we obtain
R
fX,Y (x, y)dx
Z
fX,Y (x, y)
P (X ∈ A|y ≤ Y ≤ y + h) −−−−−→ A = dx.
(h→0+) fY (y) A fY (y)
Even without continuity, we have this limit at almost every y ∈ R (i.e., with
possible exception only in a set of measure zero) by Lebesgue’s differentiation
theorem from Real Analysis.
36
are formulated from some versions (possibly different from one another) of the
jdf, then their product is a valid jdf.
Another problem with the formula is that fY |X (y|x) is not necessarily defined
for all x. We shall agree that in case of the multiplication rule, the definitions
of the conditional probabilities of the df and cdf will be extended (if needed) to
include those points, where the density of the conditioned variable is zero. For
example, we can agree that
(
fX,Y (x,y)
fY |X (y|x) = fX (x) , when fX (x) > 0, (8.4)
0, otherwise.
(We could also use some other consistent agreement). Similar extension will
be naturally used for the other condition df fX|Y (x|y) as well. After this, the
multiplication rule is valid for all x, y ∈ R. These complications do not cause
any real problems with actual computations.
From the multiplication rule, we can solve the other conditional density
function, if the marginal densities and the their conditional df are known. For
example, when fX (x) > 0 (and other values of x are irrelevant), then
g(y) = ch(y)
37
Recalling that the marginal distribution of X is X ∼ N (0, 1), the joint distri-
bution can be rewritten as
1
Y |X ∼ U (0, exp (− X 2 )) ,
2
X ∼ N (0, 1) .
b) fX|Y : For fixed 0 < y < 1, the jdf has a constant positive value over a certain
interval on the x-axis, which can be found by solving the equation
1
0 < y < exp(− x2 )
2
for the variable x. This equation was solved in Example 7.2. The result shows
that the conditional distribution is a normal distribution.
p p
X|(Y = y) ∼ U(− −2 ln y, −2 ln y) , 0 < y < 1.
but the proof would require the so-called Radon–Nikodym theorem from measure
theory.
Because of this, the joint distribution is given by
Even in this case we might refer to the joint distribution fX,Y as a density or a
density function, but it is important to keep in mind that we add with respect
to one variable and integrate with respect to the other variable.
The probability mass function (pmf), given Y = y, is defined by the Bayes
formula, namely
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)
38
when fY (y) > 0. This formula could also be motivated via limit considerations
as we did with the joint continuous distribution.
The multiplication rule and the Bayes formula are valid in this case as well. If
needed, the definitions of the conditional density fY |X (y|x) and the probability
mass function fX|Y (x|y) can be extended also to such values of the arguments
that were previously not covered by the definition.
When one of the random variables is discrete and another one is continuous,
expectations are computed by summing in the discrete variable and integrating
in the continuous variable. For instance, the law of the unconscious statistician
now takes the form:
Theorem 8.1. Let (X, Y ) be a random vector such that X has a discrete dis-
tribution and Y a continuous distribution. Let Z = g(X, Y ) be its real-valued
transformation. Then
XZ Z X
EZ = g(x, y)fX,Y (x, y)dy = g(x, y)fX,Y (x, y)dy
x x
provided that
XZ
|g(x, y)|fX,Y (x, y)dy < ∞.
x
and
Z
var(g(X, Y |X = x) = [g(x, y) − E(g(X, Y )|X = x)]2 fY |X (y|x)dy.
39
If the square of the binomial of the latter equation is expanded and the coeffi-
cients are organised, we see that
var(g(X, Y )|X = x) = E[(g(X, Y ))2 |X = x] − {E[g(X, Y )|X = x]}2 (8.6)
This is a substitute for the conditional variance of a known formula Z = EZ 2 −
(EZ)2 . If X is a discrete variable, addition should be applied in the previous
formulas and identity (8.6.) remains valid.
If a function g takes the form
g(x, y) = g1 (x)g2 (x, y),
then,
E(g1 (X)g2 (X, Y )|X = x) = g1 (x)E(g2 (X, Y )|X = x). (8.7)
This can be expressed by saying that the known factors can be pulled out of the
conditional expectation.
The conditional expectation corresponding to g(x, y) = y is called the the
conditional expectation of Y , given X = x.
Definition 8.3 (Conditional expectation of Y , given X = x; regression func-
tion). The conditional expectation of a random variable Y , given X = x, is the
expectation of its conditional distribution.
E(Y |X = x).
The function
x 7→ E(Y |X = x)
is called the regression function of Y on X.
If Y has a continuous distribution, then
Z
E(Y |X = x) = yfY |X (y|x)dy,
Figure 8.4 illustrates the conditional expectation E(Y |X = x), or the regression
function, and the conditional variance var(Y |X = x) for the joint distribution
of Figure 8.1.
Definition 8.4 (Conditional expectation given a random variable). Let us tem-
porarily denote
m(x) = E(g(X, Y )|X = x).
Let us agree that m(x) = 0 for those x for which m(x) cannot be defined,
i.e., the conditional distribution Y |(X = x) is not defined. After this, we can
talk about the transformed random variable m(X). It is called the conditional
expectation of the random variable g(X, Y ) given the random variable X, and
denoted by
E(g(X, Y )|X) = m(X).
40
15
2.5
10
ehdollinen varianssi
2.0
1.5
5
y
1.0
0
0.5
0.0
−5
0 2 4 6 8 0 2 4 6 8
x x
Note. It might occur to one to use the notation E(g(X, Y )|X = X), but it
does not make sense. The conditional expectation E(g(X, Y )|X) is a random
variable defined in such a way that it takes the value E(g(X, Y )|X = x) with a
probability of one, if X takes the value x.
Theorem 8.2. If E[g(X, Y )2 | < ∞, then E(g(X, Y )|X) is the best approxima-
tion, in terms of the mean square error, of the random variable g(X, Y ), among
all transformations of the random variable X only. Namely,
2 2
E[ g(X, Y ) − E(g(X, Y )|X) ] ≤ E[ g(X, Y ) − h(X) ]
41
Proof. We present the argument for the discrete case:
X X
Eg(X, Y ) = g(x, y)fX,Y (x, y) = g(x, y)fX (x)fY |X (y|x)
x,y x,y
X X
= fX (x) g(x, y)fY |X (y|x)
x y
X
= fX (x)E(g(X, Y )|X = x) = EE(g(X, Y )|X).
x
Y ∼ Poi(λ),
where λ > 0 and 0 < θ < 1. Now
E(X|Y ) = Y θ,
therefore,
EX = EE(X|Y ) = EY θ) = θEY = θλ.
As is the case with the conditional expectation, also the conditional variance
can be calculated given a random variable X (rather than its value X = x).
First, let us define an function
v(x) = var(g(X, Y )|X = x),
The definition will be extended over the whole real axis by agreeing that v(x) = 0
under such arguments, for which the conditional distribution Y |(X = x) is not
naturally defined. After this, we define
Theorem 8.4. The variance of a random variable equals to the sum of the
expectation of the conditional variance and the variance of the conditional ex-
pectation, namely
Proof. Exercise.
Note. In more advanced probability theory (based on measure theory),
the conditional expectation E(g(X, Y )|X) is defined with a completely different
technique than we are using now. Furthermore, it is usually defined in text-
books only when g(X, Y ) is integrable, i.e., when E|g(X, Y )| < ∞. However,
integrability is an unnecessary restriction here; it is sufficient to assume quasi-
integrability, i.e., the existence of Eg(X, Y ) as an expanded real number. This
42
expanded theory is quite difficult to come across in textbooks, but it is present
in the works of Ashin [1], or Chown and Teicherin [2] or Jacodin and Protterin
[3]. According to this extended theory, the integrability of the random variable
g(X, Y ) can be checked by calculating the expectation of E|g(X, Y )| by using
the formula EE(|g(X, Y )||X). If the result is finite, then g(X, Y ) is integrable
meaning that Eg(X, Y ) is a real number, which can be calculated with the
formula EE(g(X, Y )|X). When checking whether the integral is finite, we can
apply the same technique that was used with Fubini’s theorem.
Literature
1. Robert B. Ash. Probability and Measure Theory. Academic Press, 2nd edi-
tion, 2000.
2. Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchange-
ability, Martinga-les. Springer-Verlag, 2nd edition, 1988.
3. Jean Jacod and Philip Protter. Probability Essentials. Springer, 2nd edition,
2002.
43
We can now determine the marginal distribution of X. Its probability mass
function can be expressed as a sum
X
fX (x) = fX,Y (x, y).
y
44
3 Set Y = m(X) + Z.
Describe the joint distribution of the random variables X and Y .
Note. When a random number generator is called upon multiple times (as
in previous steps 1 and 2), the returned numbers can be regarded as values of
independent random variables, since actual random number generators function
in this way. In textbooks, this property is usually taken as self-evident and is
therefore rarely explained.
Solution: The joint distribution can be expressed with the formulas
2
Y |X ∼ N (m(X), σZ )
X ∼ N (µX , σX ) .
2
|=
and hence conditioning on X = x will not affect the distribution of Z.
As the disribution of X is continuous and the conditional distribution of Y ,
given X = x, is continuous for all x, then the joint distribution is continuous as
well. Its joint density function is
fX,Y (x, y) = fX (x)fY |X (y|x)
1 1 (x − µX )2 1 1 (y − m(x))2
= √ exp(− 2 ) √ exp(− 2 )
σX 2π 2 σX σZ 2π 2 σZ
Note: The marginal distribution of X is N (µX , σX
2
). The regression function
and the conditional variance of Y are
2
E[Y |X = x| = m(x) , var(Y |X = x) = σZ ,
but the marginal distribution of Y is not a familiar distribution unless a more
specific form of the regression function m is imposed.
Example 8.6 (The previous example continued). Let us choose a linear form
for the regression function m. Now, in terms of the mean square error, m(X)
gives the best estimate of the value of a random variable Y . As m(X) is linear,
it also has to be the best (in terms of the mean square error) linear estimate.
Now, it must be true that, in accordance with the formulas in section 7.6.,
σY
m(x) = µY + ρ (x − µX ),
σX
where µY and σY > 0 (assumption) are the expectation and the standard devi-
ation of the random variable Y , and −1 < ρ < 1 (assumption) is the correlation
coefficient of X and Y . Furthermore,
2
σZ = σY2 (1 − ρ2 ).
After laborious but straightforward calculations, the joint density function can
be written as
1 1 1 T −1 x
fX,Y (x, y) = √ exp(− (z − µ) C (z − µ)) , z =
2π det C 2 y
45
where 2
µX σX ρσX σY
µ= , C=
µY ρσX σY σY2
Here, µ is the expectation vector of the random vector (X, Y ) and C is its
covariance matrix.
The derived distribution is a bivariate normal distribution with parameters
µ and C, which is written as
Later we will further investigate the multivariate normal distribution, also known
as the multinormal distribution, which is a generalisation of the normal distri-
bution to higher dimensions.
46
Chapter 9
Multivariate distribution
A multivariate vector may have more than two components. Its distribution is
described in a similar way as the distribution of a bivariate random vector. For
this reason, most of the ideas presented in this chapter are already familiar and
therefore, some of the conclusions will not be explained again. A statistician
needs multivariate distributions in order to define and analyse statistical models.
The new concepts in this chapter are the definition of a marginal distribution
for any subset of components, and calculating conditional distributions by con-
ditioning on any subset of components. Conditional independence is a concept
that becomes interesting only in three or higher dimensions.
Visualising multivariate distributions is difficult, as pictures will not work in
higher dimensions. Instead, we have to trust manipulating the formulas.
Xn (ω).
If all the components Xi of the random vector X are discrete random vari-
ables, then its pmf (or the joint probability mass function of random variables
Xi ) is
fX (x) = P (X = x) = P (X1 = x1 , . . . , Xn = xn ). (9.1)
If g(X) is a real value transformation of X, then its expectation can be computed
by X
Eg(X) = g(x)fX (x), (9.2)
x
47
if the sum converges absolutely.
A random vector X has a continuous distribution, if it has a density function
fX , meaning that
Z Z
P (X ∈ B) = fX = fX (x)dx, for all B ⊂ Rn (9.3)
B B
if the integral converges absolutely. In this case, the integral can be calculated
as an iterated integral in any integration order. When the integration domain is
not indicated, we understand that we are computing the integral over the whole
space (here Rn ). (If the dimension is n ≥ 2, this notation should not cause any
confusion with the indefinite integral, as the concept of an indefinite integral is
only used in dimension one).
In case of a continuous (joint) distribution, the jdf is
Z x1 Z xn
FX (x) = ... fX (s1 , . . . , sn )ds1 · · · dsn .
s1 =−∞ sn =−∞
48
where fX is the joint pmf of the marginal distribution of the random vector
X, and fY|X (y|x) is the conditional joint density function, given X=x, of the
rv Y. If g(X, Y) is a real valued transformation, then its expectation can be
calculated with
XZ
Eg(X, Y) = g(x, y)fX,Y (x, y)dy, (9.5)
x
EXn
49
• The covariance matrix of a random vector X = (X1 , . . . , Xn ) is the
n × n-matrix
Notice that the covariance matrix of a random vector is symmetric and its
main diagonal has the variances of the component random variables.
The covariance Cov os a random verctor is a one-argument function, while
the covariance cov is a two-arguement function. Notice that the covariance
cov(X, Y) = 0) of two random vectors is a matrix.
It follows from the definition that
is valid when Z is a random matrix and A, B and C are constant matrices with
such dimensions that the expression is defined. In other words, the constant
matrices can be pulled out of the expectation if they are located to the extreme
right or left of the matrix product.
If X consists of subvectors X=(Y, Z) and G(Y) and H(Z) are matrix ex-
pressions, then
Y Z ⇒ G(Y) (9.8)
|=
|=
H(Z).
(Since this equation is valid for the matrix valued functions, it is naturally
valid for the vector and scalar valued functions). In other words, functions of
independent random vectors are also independent.
Here, the independece of matrix expressions means that all components of
G(Y) are independent of the components of H(Z). Based on independence and
the definition of the matrix product, it follows that
Y ⇒ (9.9)
|=
Z E[G(Y)H(Z)] = E[G(Y)]E[H(Z)]
if the dimensions of the matrix product G(Y)H(Z) are compatible and if the
expectations exist.
Theorem 9.1 (Properties of the covariance). (a) If X Y and both X and
|=
cov(X, Y) = 0m×n .
50
(b) The covariances of X and Y can be calculated with the formula
(f ) If A and B are constnt matrices and v and w are constant vectors, then
In particular,
Cov(AX + v) = A Cov(X)AT (9.11)
Proof. All parts follow from the definition of the covariance and the multiplica-
tion rules (9.7) and (9.9).
Example 9.1 (The covariance matrix of a sum). If the random vectors X and
Y have the same length, then
51
Note: A matrix need not be positive semi-definite even if it is symmetric
and all the elements of the matrix are positive. For example matrix
1 2
B=
2 1
52
9.3 Conditional distributions, the multiplication
rule and the conditional expectation
Let us consider a random vector Z. Its first r coordinates constitute a random
vector X and the remaining s coordinates constitute another random vector Y,
Z = (X, Y) = (X1 , . . . , Xr , Y1 , . . . , Ys ).
Let us assume that the density of the joint distribution of random vector Z is
given by
fZ (z) = fX,Y (x, y), z = (x, y).
We will now allow the situation that some of the coordinates of Z are discrete
random variables and the rest have a continuous joint distribution. Now, the
density of the marginal distribution of x is given by fX . This density can be
obtained by adding out the discrete component of Y from the density fX,Y and
integrating out its continuous components. By abusing notations, this idea can
be expressed by Z
fX (x) = fX,Y (x, y)dy, (9.12)
where the integration symbol represents addition in the case of the discrete
components. In this case, we can say that the distribution of X is marginalised
from the joint distribution. Similarly, fY can be obtained using
Z
fY (y) = fX,Y (x, y)dx.
53
The density of the conditional distribution of Y, given X = x, is obtained
by
fX,Y (x, y)
fY|X (y|x) = ,
fX (x)
when fX (x) > 0. If necessary, this expression can be defined as, for example,
zero for those x, with fX (x) = 0. Similarly, we define
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
The multiplication formula (chain rule) is valid, and
The multiplication formula can be iterated. Let us assume that a random vector
is X = (U, V) . Now,
and therefore
This prosess can be continued until we reach scalar components. For example,
the density of four random variables U , V , X , Y can be expressed as
The multiplication rule can also be utilised by using any other permutation
of the random variables.The multiplication rule is also valid for conditional
distributions. For example
fU,V,Y (u, v, y) fU,Y (u, y) fU,V,Y (u, v, y)
fU,V|Y (u, v|y) = =
fY (y) fY (y) fU,Y (u, y)
= fU|Y (u|y)fV|U,Y (v|u, y).
Note. When connections between distributions are derived by using the mul-
tiplication formula and marginalisation, the subscripts are often left unwritten
because they are evident from the arguments of the density function. This
abuse of the notation will not result in confusion as long as we remember that,
for example, f (x), f (y), f (x|y) ja f (y|x) are typically all different functions.
54
The conditional expectation (vector), given X = x, of a random vector
g(X, Y) is defined as the expectation vector of the random vector g(x, Y) when
the distribution of Y is the conditional distribution fY|X (·|x). By abusing
notation, Z
E(g(X, Y)|X = x) = g(x, y)fY|X (y|x)dy.
The integral here may mean addition with respect to some of the components
of the vector y. If we condition on a random vector X, then the random vector
E(g(X, Y)|X) is defined as m(X), where
and the definition of the function m is continued to a zero vector for those
arguments for which it is originally undefined.
The conditional covariance matrix, given X = x, of a random vector g(X, Y)
is similarly defined as the covariance matrix of a random vector g(x, Y) when
the conditional distribution fY|X (·|x) is used as the distribution for Y. This is
denoted by
Cov (g(X, Y)|X = x) .
If we condition on random vector X, then the conditional covariance matrix is
a random matrix and is denoted by
Cov (g(X, Y)|X) .
fX,Y|Z (x, y|z) = fX|Z (x|z)fY|Z (y|z), for all x, y and z. (9.13)
55
Conditionally independent random vectors need not be marginally indepen-
dent, meaning independent in their joint marginal distribution. For example,
even if (X Y)| Z, then
|=
Z Z
fX,Y (x, y) = fX,Y|Z (x, y|z)fZ (z)dz = fX|Z (x|z)fY|Z (y|z)fZ (z)dz,
y = Y(ω act ) ,
y 7→ f (y|θ).
θ 7→ f (y|θ)
is called the likelihood function. In this case, coefficients that are independent
of θ may be omitted from the density function and this function will still be
called the likelihood function.
In the so called classical or frequantist statistics, the parameter vector θ is
regarded as an unknown coefficient; we only know in which set (or parameter
space) its values could be. In this case, the notation f (y|θ) is purely formal, and
not interpreted as a conditional distribution, since a parameter vector does not
have any probability distribution. A more common notation in this case would
be f (y; θ). The most known estimation principle is the so called maximum
likelihood, ML. According to ML, the best estimate of a parameter vector within
56
the parameter space is the value θ̂ ML that maximises the likelihood function. It
is called the maximum likelihood estimate, MLE.
According to Bayesian inference, the parameter vector is interpreted as the
value of the random vector Θ. The function f (y|θ) is identified with the con-
ditional distribution fY|Θ (y|θ). According to Bayesian inference, apart from
the likelihood function, the statistical model requires a marginal distribution
for the vector Θ. This is called the prior distribution. The joint distribution of
the parameter vector and the observational vector is the density
Both approaches of the statistical inference require a concrete formula for the
likelihood function. For this reason, a statistician has to be able to derive density
functions for multivariate random vectors.
In this situation, at its simplest, the model can be constructed as follows. When
the value of the parameter is 0 ≤ θ ≤ 1, then the random variables Yi are inde-
pendent and have the same distribution, the Bernoulli distribution Bernoulli(θ).
(Outcome one now means that the component is flawed). Now the likelihood
function is
n
Y
f (y1 , . . . , yn |θ) = θyi (1 − θ)1−yi , y = (y1 , . . . , yn ) ∈ {0, 1}n
i=1
= θk (1 − θ)n−k , if y has exactly k components equal to 1.
In this situation the ML-estimate is easy to find as the unique zero P of the
derivative of this function with respect to θ: it is k/n, where k = yi is the
number of flawed components.
When applying the Bayesian inference, a statistician would give the param-
eter Θ a prior distribution, e.g. uniform distribution Be(1, 1) over an interval
(0,1), and then derive a posterior distribution using the same likelihood func-
tion. Now the random variables Yi are conditionally independent given Θ and
they have a Bernoulli(Θ) distribution. The joint distribution is
57
and the posterior distribution is Be(1+k, 1+n−k). This calculation has already
been solved in the example 8.4.
Let us assume that during the production process we want to measure some
information xi , which is somehow related to the functionality of the components.
In this case, xi is called the explanatory variable or the covariate. One way to
account for covariates in a model is the following.
Let us choose parameters α and β and try to explain the success (flawed
component) probability in the ith Bernoulli trial using the linear expression
α + βxi . This expression might attain arbitrary values and therefore is not valid
for a parameter of the Bernoulli distribution. Instead, by using the expression
exp(α + βx)
p(α, β, x) = ,
1 + exp(α + βx)
we obtain a number between one and zero no matter what the values of α, β
and x are. When the parameters are α, β, then according to the model
yi ∼ Bernoulli(p(α, β, xi )) , i = 1, . . . , n
where the properties of the posterior distribution will have to be solved numer-
ically in practice.
Example 9.4. In some situations, an observed time series y1 , . . . , yn can
be modelled by random variables Y0 , Y1 , . . . , Yn , where Y0 = y0 is some known
constant. For other values, we use the autoregression model
Yt = h(Yt−1 , β) + t , t ≥ 1,
where h(y, β) is some known function and the random variables obey the dis-
tribution t ∼ N (0, σ 2 ) independently. Here, β and σ are parameters of the
model.
58
This model satisfies the Markov-property
Often this integral cannot be dealt with analytically and for this reason special
techniques (like so-called EM-algorithm) have been developed to process missing
data.
g = (g1 , . . . , gn ) , h = (h1 , . . . , hn ).
Let us use the notation Di hj (y) to denote the partial derivative of the real
valued function hj with respect to the ith variable at point y:
∂
Di hj (y) = hj (y) .
∂yi
59
Let us consider the bijective correspondence
y = g(x) ⇔ x = h(y) ,
also written as
y = y(x) ⇔ x = x(y).
The derivative matrix or the Jacobian matrix of the mapping h = (h1 , . . . , hn )
at point y ∈ B is
Dl h1 (y) D2 h1 (y) · · · Dn h1 (y)
D1 h2 (y) D2 h2 (y) · · · Dn h2 (y)
..
.
D1 hn (y) D2 hn (y) · · · Dn hn (y)
60
The Jacobian of the mapping h can be easily calculated to be
∂x 1
Jh (y) = = det(A−1 ) = .
∂y det(A)
fX (x)|∂x| = fY (y)|∂y|
• P (X ∈ A0 ) = 0.
• When i ≥ 1, then the mapping g, restricted to the set Ai (denoted by
g|Ai ), is a diffeomorphism Ai → Bi , where Bi is the image of the function.
Let hi : Bi → Ai be the inverse of this function.
Now the random vector Y = g(X) has a continuous distribution with the density
function X
fY (y) = 1Bi (y)fX (hi (y))|Jhλ (y)|. (9.16)
i≥1
This result is the multivariate generalisation of the one dimensional result (The-
orem 2.13).
61
for those t = (t1 , . . . , tn ) ∈ Rn , for which the expectation is defined. The
cumulant generating function of a random vector X (or the joint cumulant
generating function of the random variables X1 , . . . , Xn ) is
The notation Dik means the kth partial derivative with respect to the ith vari-
able, and if k1 , . . . , kn are non-negative integers, then
∂ k1 +k2 +.··.·+kn
D1k1 D2k2 . . . Dnk n g(x) = g(x) .
∂xk11 ∂xk22 · ∂xknn
Definition 9.4 (Gradient vector and Hessian matrix). The (vertical) vector
62
Note. The concept of a Hessian matrix was introduced in by the Prussian
mathematician Ludwig Otto Hesse in the 19th century. Many sources treat the
gradient as a horizontal vector but we will understand it as a vertical vector.
The Hessian matrix is sometimes denoted by ∇2 g(x) or g 00 (x) .
Theorem 9.5. Suppose that the moment generating function M and the cu-
mulant generating function K of a random vector X are be defined in a neigh-
bourhood of the origin. Then
Proof. The results are obtained by the differentiation formula for the composi-
tion of functions as in the one-dimensional case.
Example 9.7. The moment generating function for the multinomial distribu-
tion X ∼ Mult(k, p) can be obtained by the multinomial formula,
X k
MX (t) = E exp(tT X) = (p1 et1 )x1 · · · (pn etn )xn
x1 , . . . , x n
where diag(p) is the diagonal matrix with the numbers (p1 , . . . , pn ) on the
diagonal.
Theorem 9.6. Let X = (Y, Z) be a random vector whose moment generating
function MX (t) exists in a neighbourhood of the origin. Let t = (u, v) be divided
into parts with the same dimensions as Y and Z. Then
(a) the moment generating functions of the random vectors Y and Z are
63
where we used the information exp(uT Y) exp(vT Z).
|=
For the converse implication, we assume that MX (u, v) = MY (u)MZ (v) for
all small enough arguments. Let Y0 and Z0 be random vectors such that
d d
Y0 = Y, Z0 = Z, Y0 Z0 .
|=
Now
MY0 ,Z0 (u, v) = MY0 (u)MZ0 (v) = MY (u)MZ (v) = MY,Z (u, v),
and hence the moment generating functions of random vectors (Y0 , Z0 ) and
(Y, Z) agree in a neighbourhood of the origin. Since the moment generating
d
function determines the distribution, we have (Y0 , Z0 ) = (Y, Z), and hence
Y Z.
|=
64
Chapter 10
Multivariate normal
distribution
EU = 0n , Cov U = In . (10.2)
65
If a random vector Z has a k-dimensional standard normal distribution Z ∼
Nk (0, I) , then the square of its lenght, kZk2 = ZT Z has a chi-square distribution
with k degrees of freedom, since
k
X
kZk2 = ZT Z = Zi2 ,
i=1
66
Definition 10.2. A random vector X has a multinormal distribution, if it has
the same distribution as
AU + µ (10.7)
where A is an m × n constant matrix, µ ∈ Rm is a constant vector, and
U ∼ Nn (0, In ) for some n.
The one dimensional normal distribution N (µ, σ 2 ) is a special case of the
multinormal distribution, since
X ∼ N (µ, σ 2 ) ⇒ (X = µ + σU, U ∼ N (0, 1)) .
According to the definition and formula (10.5),
EX = µ, Σ = Cov X = AAT ;
therefore formula (10.7) defines the normal distribution Nm (µ, AAT ) . In the
next theorem we will check that the Definition 10.2. is well posed.
Theorem 10.1. Definition 10.2 implies that EX = µ and Cov X = AAT .
The distribution of the random vector X only depends on its expectation vector
and covariance matrix, and not on the representation dimension n or other
properties of representation matrix A.
Conversely, if µ ∈ Rm is a constant vector and Σ ∈ Rm×m is a positive
semi-definite constant matrix, then there exists a random vector X that has a
multinormal distribution with expectation vector µ and covariance matrix Σ.
Proof. The formulas for the expectation vector and covariance matrix have been
verified already.
The moment generating function of the random vector X can be calculated
in terms of the moment function of a standard normal distribution (10.3):
MX (t) = E exp(tT X) = E exp(tT (AU + µ)) = exp(tT µ)MU (AT t)
1 1 (10.8)
= exp(tT µ + tT AAT t) = exp(tT µ + tT Σt)
2 2
As the moment generating function of X only depends on its expectation vector
and covariance, the distribution of X only depends on these parameters as well.
In order to prove the converse result, given a positive semi-definite matrix
Σ, we represent it as a product Σ = AAT , and then a random vector with the
correct distribution is constructed according to formula (10.7), as we already
checked.
The definition can be viewed as a simulation recipe. If we wish to simulate
a distribution Nm (µ, Σ), where Σ is a given symmetric positive semi-definite
matrix, then we first look for a decomposition of the covariance matrix Σ
Σ = AAT
Here we can use e.g. the Cholesky decomposition. Next we will (independently)
simulate m values (u1 , . . . , um ) from the standard normal distributionN (0, 1),
and finally we will calculate
x = Au + µ, where u = (u1 , . . . , un ) .
67
10.3 The distribution of an affine transformation
Theorem 10.2 (Multinormality is preserved in affine transformations). Let
X ∼ Nm (µ, Σ) , let B ∈ Rp×m be a constant matrix, and let b ∈ Rp be a
constnat vector. Then
BX + b ∼ Np (Bµ + b, BΣBT ).
Proof. Let us use the representation (10.7) for the distribution of X.
d
BX + b = B(AU + µ) + b = (BA)U + (Bµ + b) .
According to Definition 10.2, the random vector BX + b has a multinormal
distribution with
E(BX + b) = BE(X) + b = Bµ + b,
Cov(BX + b) = BΣBT .
This theorem also provides all marginal distributions. If we define the ith
unit vector ei to be such that its ith coordinate is one and other coordinates
are zero, then
Xi = eTi X.
If a random vector Y consists of the components (Xi1 , . . . , Xik ) of the random
vector X, then Y is obtained by multiplying the random vector X from the left
hand side by a constant matrix, whose jth row is eij ,
T
Xi1 ei1
.. ..
Y = . = . X
Xik eTik
Now it follows that Y has a multinormal distribution, and therefore all the
marginal distributions of the multinormal distribution are multinormal as well.
The parameters for the random vector Y can be obtained through Theorem
10.2, or simply by picking the relevant parameters from the distribution of X,
since
EXil µil
.. ..
EY = . = . , (CovY)(p, q) = cov(Xip , Xiq ) = Σ(ip , iq ).
EXik µik
68
Proof. Let A be a regular m × m matrix such that Σ = AAT . It is possible to
find such an A, since Σ is positive definite. After this, X ∼ Nm (µ, Σ), when
X = AU + µ, where U ∼ N (0, Im ).
x = Au + µ ⇐⇒ u = A−1 (x − µ).
(In order to apply the transformation formula of the density function, we need
a regular coefficient matrix A, and vectors X and U need to have the same
dimension in the representation of the multinormal distribution). Applying the
transformation formula of the density function, we get
∂u
fX (x) = fU (u)| | = fU (A−1 (x − µ))| det(A−1 )|
∂x
1
= (2π)−m/2 | det(A−1 )| exp(− (x − µ)T (A−1 )T A−1 (x − µ))
2
The claim is a consequence of this formula and the following calculations:
69
(
1, if i = j,
viT vj = (10.11)
0, if i 6= j.
The search for the eigenvalues and eigenvectors is included in all matrix com-
putation software.
As Σ is positive semi-definite, we have
Thus all the eigenvalues are non-negative. If Σ is positive definite, then all the
eigenvalues are strictly positive.
Let us construct a matrix V from the eigenvectors by placing them as the
columns of the matrix, and let us form a diagonal matrix Λ from the eigenvalues,
V = [v1 , . . . , vn ], Λ = diag(λ1 , . . . , λn ).
Now from the definitions of the eigenvalues and eigenvectors it follows that
y = VT (x − µ) ⇐⇒ x = µ + Vy.
70
2
0
−2
−4
−4 −2 0 2 4
Figure 10.1: A two-dimensional normal distribution with two contour lines and
the directions of the eigenvectors.
y2 y2
√ 1 2 + · · · + √ m 2 = c, c > 0.
( λi ) ( λm )
Let us consider the distribution N (µ, Σ) with Σ = VΛVT . Figure 10.1 shows
two contour lines of this distribution with θ = 30◦ ,as well as the vectors Ve1
and Ve2 at the point µ. The dots represent a sample of vectors ui from the
two-dimensional standard normal distributionN (0, I) transformed according to
the formula
xi = Aui + µ, A = VΛ1/2
so as to give a sample of the two-dimensional normal distribution, N (µ, Σ).
71
In general, non-correlation does not guarantee independence. However, in the
special case that a random vector (X, Y) has a multinormal distribution, the
non-correlation and independence of its partial vectors X and Y turns out to
be equivalent, as we are about to prove..
Let us observe, for later use, the parameters of the multinormal distributions
of partial vectors X = (X1 , . . . , Xk ) and Y = (Y1 , . . . , Ym ), if the joint vector
(X, Y) has a multinormal distribution N (µ, Σ) . The expectation vector µ
consists of parts
µX EX
µ = EZ = = (10.13)
µY EY
and the covariance matrix Σ= CovZ can also be partitioned as
cov(X1 , X1 ) · · · cov(X1 , Xk ) cov(X1 , Y1 ) · · · cov(X1 , Ym )
.. .. .. .. .. ..
. . . . . .
cov(Xk , X1 ) · · · cov(Xk , Xk ) cov(Xk , Y1 ) · · · cov(Xk , Ym )
Σ=
cov(Y1 , X1 ) · · · cov(Y1 , Xk ) cov(Y1 , Y1 ) · · · cov(Y1 , Ym )
.. .. .. .. .. ..
. . . . . .
cov(Ym , X1 ) · · · cov(Ym , Xk ) cov(Ym , Y1 ) · · · cov(Ym , Ym )
ΣXX ΣXY cov(X, X) cov(X, Y)
= = (10.14)
ΣY X ΣY Y cov(Y, X) cov(Y, Y)
According to Section 10.3, the marginal distributions are given by
MX,Y (u, v)
1
= exp(uT µX + vT µY + (uT ΣXX u + uT ΣXY v + vT ΣY X u + vT ΣY Y v))
2
On the other hand, the product of the moment functions of the marginal dis-
tributions is
1 1
MX (u)MY (v) = exp(uT µX + uT ΣXX u + vT µY + vT ΣY Y v)
2 2
72
In the earlier expression, we can write
and therefore the functions MX,Y and MX MY are equal if and only if
then the distribution of the joint vector (X, Y) need not be multinormal in
general, not even under an extra assumption that cov(X, Y) = 0. However,
observing the formulas of the previous theorem, it is easy to prove the following
result:
Theorem 10.5. If
with parameters
µX ΣXX 0
µ= , Σ= .
µY 0 ΣY Y
Proof. By independence, the moment generating function of the joint distribu-
tion is the product of the moment generating functions of the marginal distri-
butions. Therefore, for all u, v,
and A and B are constant matrices such that the vectors AX and BY are
well-defined and of equal length, then the random vector
Z = AX + BY
73
has a multinormal distribution.
This is the case, since the combined vector (X, Y) has a multinormal dis-
tribution, and
X
AX + BY = [A B]
Y
The parameters of the distribution can be obtained by calculating the expecta-
tion and the covariance matrix
EZ = E(AX + BY) = AµX + BµY ,
Cov Z = Cov(AX) + Cov(BY) = AΣXX AT + BΣY Y BT .
Proof. Even though the formulas are complicated, the idea of this proof is sim-
ple. We first form an auxiliary random vector V by using the formula
V = Y − BX, (10.16)
0 = cov(V, X) = ΣY X − BΣXX ,
Y = V + BX,
V + Bx,
since the condition X = x does not affect the distribution of random vector
V with V X. Therefore, the conditional distribution is multinormal. Now
|=
74
the parameters of the conditional distribution can be obtained by calculating
the parameters of the distribution of random vector V + Bx. The expectation
vector of the conditional distribution is
E(V + Bx) = µY + B(x − µX ) = µY + ΣY X Σ−1
XX (x − µX ),
75
The random variables X and Y are independent if and only if ρ = 0. Conditional
distributions are obtained based on the Theorem 10.6:
σY
Y |(X = x) ∼ N (µY + ρ (x − µX ), (1 − ρ2 )σY2 ) ,
σX
σX
X|(y = y) ∼ N (µX + ρ (y − µY ), (1 − ρ2 )σX
2
).
σY
Here we can for example obtain the parameters of the distribution Y |(X = x)
by calculating
−1 ρσX σY
µY + ΣY X ΣXX (x − µX ) = µY + 2 (x − µX )
σX
(ρσX σY )2
ΣY Y − ΣY X Σ−1 2
XX ΣXY = σY − 2 = (1 − ρ2 )σY2 .
σX
It is straightforward but laborious to check if these conditional distributions can
also be derived using the formulas
fX,Y (x, y) fX,Y (x, y)
fY |X (y|x) = , fX|Y (x|y) = .
fX (x) fY (y)
If the random variables Xi are independent and have the same expectation µ
and the same variance σ 2 , then we can easily show that
EX = µ, ES 2 = σ 2 .
In other words, the sample mean and the sample variance are unbiased estima-
tors of the population parameters µ ja σ 2 .
Let us assume from now on that random variables X1 , . . . , Xn are inde-
pendent with normal distribution N (µ, σ 2 ) . The vector (X1 , . . . , Xn ) has a
multinormal distribution Nn (m, Σ), where
µ
..
m = . , Σ = σ 2 In .
µ
76
It is easy to derive the marginal distribution of a sample mean, namely
1
X ∼ N (µ, σ 2 ), as it a linear transformation of a multinormally distributed
n
random vector, and we can calculate the mean and variance of the random
variable X. The next theorem shows that, after suitable scaling, the sample
variance S 2 has a chi squared distribution with n − 1 degrees of freedom.
Theorem 10.7. Let X1 , . . . , Xn ∼ N (µ, σ 2 ) , and let us define X and S 2 by
|=
equation (10.17). Then
(a) X ∼ N (µ, σ 2 /n) ,
(b) (n − 1)S 2 /σ 2 ∼ χ2n−1 , so ES 2 = σ 2 ,
(c) X S2
|=
Proof. Part (a) was proved already, so we will focus on (b) and (c).
The assumption about the random vector X = (X1 , . . . , Xn ) can be ex-
pressed as
X ∼ Nn (µ1, σ 2 In ) ,
where 1 = (1, . . . , 1) ∈ Rn is a constant n-component vector with all its compo-
nents equal to one.
Let us define another n-component vector u by
1
u = √ 1.
n
Now u is a unit vector parallel to vector 1, in other words uT u = 1. Notice that
the sample mean can be represented as
1 T 1
X= 1 X = √ uT X,
n n
and then
1 √
X1 = ( √ uT X) nu = uuT X.
n
Let us define the residual vector R by
X1 − X
.. T T
R= . = X − X1 = X − uu X = (In − uu )X
Xn − X
(The vector X1 is called the fit vector.) The sample variance S 2 can be written
as
(n − 1)S 2 = RT R = kRk2 (10.18)
The combined vector (X, R) can be obtained by a linear transformation from
vector X, since T √
X u 1/ n
= X,
R I − uuT
77
and for this reason it has a multinormal joint distribution. Moreover, the com-
ponents X and R are uncorrelated, since
1 1
cov(X, R) = √ cov(uT X, (I − uuT )X) = √ uT σ 2 I(I − uuT ) = 0.
n n
78
Let us recall that the t-distribution with ν degrees of freedom is defined as
the distribution of a random variable
Z
p ,
Y /ν
|=
The random variable X − µ has a normal distribution with expectation zero
and variance σ 2 /n, so the random variable
X −µ
√
σ/ n
has a standard normal distribution. If we replace the unknown standard de-
viation parameter σ with the sample estimate, we obtain the so called t-test
value
X −µ
T = √ .
S/ n
This t-test value can also be expressed as
√
(X − µ)/(σ/ n)
T = p .
S 2 /σ 2
Based on the previous theorem, the numerator and the denominator are in-
dependent, the numerator’s distribution is N (0, 1) and the denominator is the
square root of a random variable, where a χ2 distributed random variable is
divided by the number of its degree of freedom n − 1. As a consequence, T has
a t-distribution with n − 1 degrees of freedom.
The previous considerations can easily be expanded to a situation of a linear
model
X ∼ Nn (m, σ 2 I),
where the expectation vector m is known to belong to a subspace L of Rn of
dimension p ≤ n. Let us choose ortonormal basis vectors u1 , . . . , up for this
subspace, and let us use these vectors as columns of a matrix U, so thatU =
[u1 , . . . , up ]. Now the matrix
H = UUT
is a projection matrix onto the subspace L, and so H is a symmetric and idem-
potent matrix (the latter means that HH = H) and Hv = v for all v ∈ L. By
adapting the proof of the previous theorem, one can check that
HX (I − H)X,
|=
and that
1
kX − HXk2 ∼ χ2n−p .
σ2
Hence the random variable
1
kX − HXk2
n−p
79
is an unbiased estimator of variance σ 2 , and its distribution is a scaled chi
squared. The fit vector HX and the residual vector X − HX = (I − H)X are
independent random vectors.
80