All Econometrics - Slides PDF

Advanced Econometrics I
Ingo Steinke, Anne Leucht, Enno Mammen
University of Mannheim
Fall 2014
Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 1

Organisation
Important dates
Start: 2013-10-07
End: 2013-12-05
Lectures:
Tuesday 10:15 - 11:45 in L 7, 3-5 - 001
Thursday 10:15 - 11:45 in L 7, 3-5 - 001
Exercises:
Thursday 13:45 - 15:15 in L 9, 1-2 003
Thursday 15:30 - 17:00 in L 9, 1-2 003
teaching assistants: Maria Marchenko
Slides will be provided via Ilias, usually on Friday for the next week.

Exercise sheets
will be provided via Ilias and usually published on Tuesday (or
Wednesday).
hand in written solutions in lecture on Tuesday (you may work in
pairs).
discussion of the solutions on Thursday.
There is 1 point per exercise (0.25,0.5,0.75,1).
You need 75% of the points of the Exercise sheet to get (at most)
20% of the exam points.
There will be stared exercises which can be used to make up for
missing points of one exercise sheet.

Exam
written exam, 180 min
Date: 2014-12-17
Contact
Office: L7, 3 - 5, room 142
Phone: 1940
E-Mail: isteinke@rumms.uni-mannheim.de
Office hour: on appointment

Contents
Overview:
1 Probability theory
2 Asymptotic theory
3 Conditional expectations
4 Linear regression

Literature
Ash, R. B. and Doléans-Dade, C. (1999). Probability & Measure
Theory. Academic Press.
Billingsley, P. (1994). Probability and Measure. Wiley.
Hayashi, F. (2009). Econometrics. Princeton University Press.
Jacod, J. and Protter, P. (2000). Probability Essentials. Springer.
Van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence
and Empirical Processes. With Applications to Statistics. New York:
Springer.
Wooldridge, J. M. (2004). Introductory Econometrics: A Modern
Approach. Thomson/Southwestern.

Introduction
Motivation
Application in Statistics ...

Study relationship between variables, e.g.
consumption and income
−→ How does raising income effect consumption behaviour?
evaluation of effectiveness of job market training (treatment effects)
...
Econometrics (Wooldridge (2004)):
development of statistical methods for estimating economic
relationships
testing economic theories
evaluation of government and business policies

Classical model in econometrics: linear regression
Figure: http://en.wikipedia.org/wiki/File:Linear regression.svg
Y = β0 + β1 X + u,
e.g. Y consumption, X wage, u error term typically data not generated by
experiments, error term “collects all other effects on consumption besides
wage”
→ variables somehow “random”
→ How do we formalize randomness?
Aims of this course:
(1) Provide basic probabilistic framework and statistical tools for
econometric theory.
(2) Application of these tools to the classical multiple linear regression
model.
→ Application of these results to economic problems in Advanced
Econometrics II/III and follow-up elective courses.

Chapter 1: Probability theory
Chapter 1: Elementary probability theory

Overview
1 Probability measures
2 Probability measures on R
3 Random variables
4 Expectation

Chapter 1: Probability theory 1.1 Probability measures
1.1 Probability measures
Aim: Formal description of “probability measures”

Setup:
The set Ω 6= ∅ of the possible outcomes of an random experiment is
called sample space, e.g. Ω = N = {1, 2, · · · }.
A ⊆ Ω event, e.g. A = {2, 4, 6, 8, · · · }
outcome: ω ∈ Ω
−→ Want to assign a “probability ” P(A) to event A
Consider first the case that Ω is a countable set, i.e.
Ω = {ω1 , ω2 , ω3 , · · · }
(e.g. Ω = N, Ω = Z).

∅ = { } denotes the empty set,

P(Ω) = {A : A ⊆ Ω} the power set.
Defintion 1.1
A probability measure P on a countable set Ω is a set function that
maps subsets of Ω to [0, 1], i.e. P : P(Ω) → [0, 1], and has the following
properties:
(i) P(Ω) = 1.
(ii) It holds
∞
[ ∞
X
P( Ai ) = P(Ai )
i=1 i=1
for any Ai ⊆ Ω, i ∈ N, that are pairwise disjoint, i.e Ai ∩ Aj = ∅ for

i 6= j.

Recap: Index notation

Let I 6= ∅ some set and Ai ⊂ Ω for all i ∈ I . Then
[
x∈ Ai ⇐⇒ ∃j ∈ I : x ∈ Aj .
i∈I
If I = {1, . . . , n} and J = N, then

[ n
[
Ai = A1 ∪ · · · ∪ An = Ai ,
i∈I i=1
[ ∞
[
Aj = A1 ∪ · · · ∪ An ∪ · · · = Aj .
j∈J j=1
Especially, for I = A and Ax = {x},

[
A= {x}.
x∈A

Recap: Series
A ⊆ Ω set is countable iff there is a set N ⊆ N and a bijection

(one-to-one-map) m : N → A. Then A can be written A = {a1 , a2 , . . . , an }
or A = {a1 , a2 , . . . , an , . . . }.
A series is a infinite sum and defined by
∞
X n
X
s= ak = lim ai
n→∞
k=1 k=1
if the limit exists. The series s is absolutely convergent if

∞
X
|ak | < ∞.
k=1

The series s is unconditionally well-defined

P∞ if for any
{k1 , k2 , k3 , . . . } = N we have s = i=1 aki , i.e. a rearrangement of its
members does not change the (infinite) sum which might be ∞ oder −∞.
Note:
If a series is absolutely convergent, then it is unconditionally
convergent.
A series is unconditionally well-defined iff the (infinite) sums of all its
positive members or the sum of all negative members is finite.
Let I be countable and ai ∈ R for any i ∈ I . If I = {i1 , i2 , . . . }, then
X ∞
X
ai := aij ,
i∈I j=1
if the right-hand series is unconditionally convergent

Lemma 1.2
For a countable sample space Ω = {ωi }i∈I (with countable I ) and a
probability measure P on Ω. Then for every A ⊆ Ω, it holds
X
P(A) = P({ω}).
ω∈A
Proof: Exercise.
Let ω ∈ Ω. An event {ω} that only contains one element is also called an
elementary event.

Arbitrary sample spaces
It is often impossible to define P “appropriately” for all subsets A ⊆ Ω

such that Definition 1.1 holds true; see e.g. Billingsley (1994).
Definition 1.3 A family A of subsets of Ω with
(i) ∅ ∈ A,
(ii) if A ∈ A, then AC = Ω\A ∈ A,
(iii) if A1 , A2 , · · · ∈ A, then ∞
S
i=1 Ai ∈ A,
is called a σ-field or σ-algebra.
For a σ-field A of Ω holds: A ⊆ P(Ω).

A σ-field A is called smallest-σ-field, containing B ⊆ P(Ω), if for any

σ-field C of Ω holds: If B ⊆ C, then A ⊆ C.
Notation: A = σ(B).
Example 1.4 Let Ω 6= ∅ be a set.

{Ω, ∅} is the smallest σ-field on Ω and is called trivial σ-field.
The power set P(Ω) is the largest σ-field on Ω.
6 B ⊂ Ω, the family {Ω, ∅, B, B C } is the smallest σ-field on Ω
If ∅ =
that contains B.
Suppose that A is a σ-field on a set Ω. Then the tuple (Ω, A) is called a
measurable space.

Definition 1.5
A set function P : A → [0, ∞) is a measure on (Ω, A) if for A1 , A2 , . . .
pairwise disjoint ∈ A it holds that
∞ ∞
!
[ X
P Ai = P(Ai ) (σ − additivity)
i=1 i=1
If, in addition, P(Ω) = 1, then it is called probability measure.

The triple (Ω, A, P) is then called a probability space.
Example 1.6 Let (Ω, A) be a measurable space and ω0 ∈ Ω.
Then ν(A) = |A|, A ∈ A, is the so-called counting measure.
The Dirac measure δω0 is then defined by
δω0 (A) := 1A (ω0 ) A ∈ A.

Theorem 1.7 (Properties of probability measures)

Suppose that (Ω, A, P) is probability space. Let A, B, A1 , A2 , · · · ∈ A.
Then holds:
(i) P(∅) = 0
(ii) Finite additivity: A1 , . . . , An pairwise disjoint imply
n
[ n
X
P( Ai ) = P(Ai ).
i=1 i=1
(iii) P(AC ) = 1 − P(A).

(iv) P(A) ≤ 1 for all A ∈ A.
(v) Subtractivity: A ⊆ B implies P(B\A) = P(B) − P(A).

(vi) Monotonicity: A ⊆ B implies P(A) ≤ P(B).

(vii) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
(viii) Continuity from below: An ⊆ An+1 for all n ∈ N implies
∞
S
P(An ) −→ P( Ak ).
n→∞ k=1
(ix) Continuity from above: An+1 ⊆ An for all n ∈ N implies
∞
T
P(An ) −→ P( Ak ).
n→∞ k=1
∞
S ∞
P
(x) Sub-σ-additivity: P( An ) ≤ P(An ).
n=1 n=1
Proof: Exercise.

Chapter 1: Probability theory 1.2 Probability measures on R
1.2 Probability measures on R

Definition 1.8 The smallest σ-field B, that contains all open intervals
(a, b) (−∞ ≤ a ≤ b ≤ ∞), is called the Borel σ-field.
A set A ∈ B is called a Borel set.
Theorem 1.9 Put
A1 = {(a, b] : −∞ ≤ a < b < +∞},

A2 = {[a, b) : −∞ < a < b ≤ +∞},
A3 = {[a, b] : −∞ < a ≤ b < +∞},
A4 = {(−∞, b] : −∞ < b < +∞}.
Then it follows for j = 1, . . . , 4: B = σ(Aj ).
Proof: Exercise.
Definition 1.10
A class A∗ of subsets of Ω is a field if
(i) ∅ ∈ A∗ ,
(ii) A ∈ A∗ , then AC ∈ A∗ ,
(iii) A1 , A2 ∈ A∗ , then A1 ∪ A2 ∈ A∗ .
Suppose that A∗ is a field and define A as the smallest σ-field with

A∗ ⊆ A (notion: A = σ(A∗ )). Then a set S function P ∗ : A∗ → [0, ∞) s.t.
for A1 , A2 , . . . pairwise disjoint ∈ A∗ with ∞ ∗
i=1 Ai ∈ A it holds that
∞ ∞
!
[ X
P∗ Ai = P ∗ (Ai )
i=1 i=1
is called pre-measure.
If, in addition, P ∗ (Ω) = 1 it is called probability pre-measure.

Theorem 1.11 (Carathéodory) Let A∗ be a field, A = σ(A∗ ) and P ∗

a probability pre-measure on A∗ . Then there exists a unique
probability measure P on A with
P(A) = P ∗ (A) for A ∈ A∗ .
For a proof see Ash and Doléans-Dade (1999), Theorem 1.3.10.

Definition 1.12 For a probability measure P on (R, B) the function
F : R → [0, 1] given by
F (b) = P((−∞, b]) ∀b ∈ R
is called a (cumulative) distribution function (CDF).

Proposition 1.13 (Properties of the CDF) Suppose that F is the

distribution function of a probability measure P on (R, B). Then
(i) P((a, b]) = F (b) − F (a) for a < b,
(ii) F is non-decreasing (i.e. F (a) ≤ F (b) for a ≤ b),
(iii) F is continuous from the right (i.e. F (bn ) → F (b) for bn → b,
bn ≥ b (or for bn ↓ b)),
(iv) limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1.
(v) F (b−) := limn→∞ F (bn ) = P((−∞, b)) for any bn ↑ b.
(vi) P({b}) = F (b) − F (b−) for all b ∈ R.
Define P by F on A1 : P((a, b]) = F (b) − F (a), a < b.

Can this function be uniquely extended to a set function on B?

Theorem 1.14 Consider a function F : R → R satisfying (ii) to (iv) of

Proposition 1.13. Then F is a distribution function (i.e. then there
exists a unique probability measure P on (R, B) with
F (b) = P((−∞, b]) for all b ∈ R).
Some ideas of the proof: First, define a set function P ∗ : A1 → [0, 1] as
P ∗ ((a, b]) = F (b) − F (a).
Extend this function as follows: P ∗ : A∗ → [0, 1], where A∗ consists of the

empty set and all finite unions of sets of A1 and their complements, and
for disjoint intervals
n n
!
[ X
∗
P (ai , bi ] = P ∗ ((ai , bi ]) with notation (c, ∞] = (c, ∞).
i=1 i=1

Discrete probability measures
The function f : R → R is called probability mass function (pmf) if

1 f (x) ≥ 0 for all x ∈ R and
P
2 it holds x∈Sf f (x) = 1 with Sf = {x ∈ R : f (x) > 0}.
Note that Sf must be countable if 2. holds. Sf is called support of f .
Define
X
P(A) = f (x). (1)
x∈Sf ∩A
Lemma 1.15 P, defined by (1), is a probability measure.
P
Then, the cdf is defined by F (x) = a∈Sf ∩(−∞,x] f (a).

Definition 1.16 A probability measure P on the measurable space (R, B)

is discrete if there is an at most countable set A ⊂ R such that P(A) = 1.
By
SP = {a ∈ R : P({a}) > 0}
we denote the support of a discrete probability measure R.
Lemma 1.17 P is discrete iff f : R → R, f (x) = P({x}), is a pmf.
Remark 1.18
1 SP ⊆ A is countable and P(SP ) = 1.
2 If P is a discrete probability measure with support SP then F has
jumps at a ∈ SP with jump heights P({a}).

Example 1.19
1 Binomial distribution

n i
P({i}) = π (1 − π)(n−i) for i = 0, 1, . . . , n,
i
P({i}) = 0 elsewhere. Parameter: 0 ≤ π ≤ 1, n ≥ 1.

2 Geometric distribution
P({i}) = (1 − π)i−1 π for i = 1, 2, 3 . . .
P({i}) = 0 elsewhere. Parameter: 0 ≤ π ≤ 1.

3 Poisson distribution
λi −λ
P({i}) = e , i = 0, 1, 2, . . .
i!
P({i}) = 0 elsewhere. Parameter: λ > 0.

Absolultely continuous probability measures

A (Riemann) integrable function f : R → R is called probability density
function (pdf) if
1 f (x) ≥ 0 for all x and
R∞
−∞ f (x)dx = 1.
2
In the following we assume that f is piecewise continuous, i.e. the is an

S countable index set I and pairwise disjoint open intervals Ai ⊆ R
at most
with i∈I Ai = R such that
f (x) is continuous on Ai for all i ∈ I .
Lemma 1.20 Let f : R → R be a piecewise continuous pdf. Then

there exists a unique probability measure on (R, B) such that
Z b
P((a, b]) = f (x)dx, for all a < b.
a

The corresponding distribution P isR called absolutely continuous.

x
Then the CDF is given by F (x) = −∞ f (t)dt.
Note that F is continuous and
F 0 (x) = f (x),
if f is continuous at x.
A density is not unique but almost unique.
Lemma 1.21 Let f , g be piecewise continuous pdf’s such that

Z b Z b
f (x)dx = g (x)dx for all a < b.
a a
Then {x : f (x) 6= g (x)} is countable.

Example 1.22
1 Normal distribution
1 (x − µ)2

1
f (x) = √ exp − .
2πσ 2 σ2
Parameter: µ ∈ R, σ > 0
2 Uniform distribution
1
f (x) = 1 (x)
b − a [a,b]
Parameters: −∞ < a < b < ∞
3 Exponential distribution
f (x) = λ e −λx 1[0,∞) (x)
Parameter: λ > 0

Extension to Rk
The Borel σ-field B k is the σ-field generated by the open intervals

(a1 , b1 ) × · · · × (ak , bk ). As in the real-valued case, probability measures on
(Rk , B k ) are uniquely defined via the multivariate distribution function:
F (b1 , . . . , bk ) = P({(x1 , . . . , xk ) : x1 ≤ b1 , . . . , xk ≤ bk }).
F is called absolutely continuous if:

Z b1 Z bk
F (b1 , . . . , bk ) = ··· f (x1 , . . . , xk )dxk · · · dx1
−∞ −∞
for all b1 , ..., bk ∈ R. Here, if f continuous at (x1 , . . . , xk ), then
∂k F
(x1 , . . . , xk ) = f (x1 , . . . , xk ).
∂x1 · · · ∂xk

The Borel σ-field B k is the σ-field generated by the open intervals

(a1 , b1 ) × · · · × (ak , bk ).
The function f : Rk → R is called (multivariate) probability mass
function if
1 f (x) ≥ 0 for all x ∈ Rk and
it holds x∈Sf f (x) = 1 with Sf = {x ∈ Rk : f (x) > 0}.
P
2
Define the discrete probability measure on (Rk , B k ) by

X
P(A) = f (x),
x∈Sf ∩A
cf.(1), p.27.

A (Riemann) integrable function f : Rk → R is called (multivariate)

probability density function if
1 f (x) ≥ 0 for all x and
2 it holds
Z Z ∞ Z ∞
f (x)dx := ··· f (x1 , . . . , xk )dxk · · · dx1 = 1.
Rk −∞ −∞
Then by
Z b1 Z bk
P((a1 , b1 ] × · · · × (ak , bk ]) = ··· f (x1 , . . . , xk )dxk · · · dx1
a1 ak
a probability measure can be introduced on (Rk , B k ) which is called

absolutely continuous.

Notation: For x = (x1 , . . . , xk )0 , y = (y1 , . . . , yk )0 ∈ Rk we write

x = y iff xi = yi for all i = 1, . . . , k,
x ≤ y iff xi ≤ yi for all i = 1, . . . , k,
x < y iff xi < yi for all i = 1, . . . , k,
(x, y ] = {z ∈ Rk : x < z and z ≤ y } ⊂ Rk ,
(x, y ) = {z ∈ Rk : x < z and z < y } ⊂ Rk etc.
Let
(k)
A0 = {(a, b) : a, b ∈ Rk and a < b}.
Then
(k)
B k := σ(A0 )
is the Borel-σ-field on Rk and its members are called Borel sets.

Chapter 1: Probability theory 1.3 Random variables
1.3 Random variables
Definition 1.23 Let (Ω1 , A1 ) and (Ω2 , A2 ) measurable spaces.

A function g : Ω1 → Ω2 is called measurable (or A1 − A2 -measurable) if
g −1 (B) = {ω ∈ Ω1 : g (ω) ∈ B} ∈ A1 ∀ B ∈ A2 . (2)
Notation: g : (Ω1 , A1 ) → (Ω2 , A2 ).

Remark 1.24
1 IA : Ω → R is A − B-measurable, if A ∈ A.
2 If g : Rm → Rk is continuous, then g it is B m − B k -measurable.
3 If f : (Ω1 , A1 ) → (Ω2 , A2 ) and g : (Ω2 , A2 ) → (Ω3 , A3 ), then
h : Ω1 → Ω3 , defined by h(ω1 ) = g (f (ω1 )), is A1 − A3 -measurable.

Definition 1.25 An Rk -valued random variable (r.v.) is a function

X : Ω → Rk , where (Ω, A) is a measurable space and X fulfills:
X −1 (B) = {ω ∈ Ω : X (ω) ∈ B} ∈ A ∀ B ∈ Bk , (3)
i.e. X is A − B k measurable.
Notation: B ∈ B k ,
P(X ∈ B) := P(X −1 (B)) = P({ω ∈ Ω : X (ω) ∈ B}),

P(X = x) = P(X −1 ({x})) = P({ω ∈ Ω : X (ω) = x}).
(3) guarantees that X −1 (B) ∈ A, i.e. P(X ∈ B) is well-defined.

Definition 1.26 Suppose that X is an Rk -valued random variable on a

probability space (Ω, A, P). Then
P X (B) := P(X ∈ B) := P(X −1 (B)), B ∈ Bk , (4)
is called the distribution of X .
Lemma 1.27 P X is a probability measure on (Rk , B k ).
Notation: Let X1 , . . . , Xl be r.v. on a probability space (Ω, A, P).
P(X1 ∈ A1 , . . . , Xl ∈ Al ) := P(X1−1 (A1 ) ∩ · · · ∩ Xl−1 (Al )).

Definition 1.28
(i) Random variables X1 , . . . , Xl on a probability space (Ω, A, P) are
independent if
P(X1 ∈ A1 , . . . , Xl ∈ Al ) := P(X1 ∈ A1 ) · · · · · P(Xl ∈ Al )
for all Borel sets A1 , . . . , Al .

(ii) Suppose that (Xt )t∈T with some nonempty index set T is a family of
Rk -valued random variables on (Ω, A, P). These random variables are
independent if for any finite, nonempty I0 ⊆ T and any
At ∈ B k , t ∈ I0 ,
 
\ Y
P Xt−1 (At ) = P(Xt−1 (At )).
t∈I0 t∈I0

Let X : Ω → Rk and Y : Ω → Rm be r.v.
Lemma 1.29 If X , Y are independent and g : Rk → Rl and

h : Rm → Rn B k − B l and B m − B n -measurable, then g (X ) and h(Y )
are independent.
The cumulative distribution function (CDF) of X , FX : Rk → [0, 1], is

defined by
FX (x) = P(X ≤ x) for all x ∈ Rk .
Note that for k = 2, a = (a1 , a2 )0 , b = (b1 , b2 )0 holds
P(X ∈ (a, b]) = FX (b1 , b2 ) − FX (a1 , b2 ) − FX (b1 , a2 ) + FX (a1 , a2 ). (5)

Discrete random vectors

Let Z be a r.v. with values in Rk .
Z is discrete, if P Z is discrete, i.e. P(Z ∈ SZ ) = 1, where
SZ := {z ∈ Rk : P(Z = z) > 0}
is countable. SZ is called the support of Z . If Z = (X , Y )0 , then

X
P(X = x) = P(X = x, Y = y ),
y ∈SY
X
P(Y = y ) = P(X = x, Y = y ).
x∈SX
X , Y are independent iff
P(X = x, Y = y ) = P(X = x)P(Y = y ) ∀x ∈ SX , y ∈ SY .

Let X be real-valued. See Example 1.19, see p.29.

X is called binomial distributed with parameter π ∈ [0, 1] and
n ∈ N, in signs X ∼ B(n, π), if P X is a Binomial distribution, i.e.

n x
P(X = x) = π (1 − π)(n−x) for x = 0, 1, . . . , n,
x
X is called geometric distributed with parameter π ∈ [0, 1], i.s.

X ∼ Geo(π), if P X is a Geometric distribution, i.e.
P(X = x) = (1 − π)x−1 π for x = 1, 2, 3 . . .
X is called Poisson distributed with parameter λ > 0, i.s.

X ∼ Po(λ), if P X is a Poisson distribution, i.e.
λx −λ
P(X = x) = e , x = 0, 1, 2, . . .
x!

Continuous random variables

Let X be a r.v. with values in R.
X is continuous, if P X is absolutely continuous, i.e. it exists a pdf fX s.t.
Z b
P(a ≤ X ≤ b) = P X ([a, b]) = fX (x)dx
a
for all a < b. fX is then called probability density function (pdf) of X .

The CDF of X is given by
Z x
FX (x) = P(X ≤ x) = fX (t)dt
−∞
Let SX> = {x ∈ R : f (x) > 0}. Then SX = SX> is called the support of X.
It holds
P(X ∈ SX ) = 1.

Let X be continuous. See Example 1.22, p.32.

X is normally distributed with parameters µ and σ 2 > 0, i.s.
X ∼ N(µ, σ 2 ) if its density can be written as
1 (x − µ)2

1
fX (x) = √ exp − .
2πσ 2 2 σ2
X is uniformly distributed with parameters a and b, a < b, i.s.

X ∼ U(a, b) if its density can be written as
1
fX (x) = 1 (x).
b − a [a,b]
X is exponentially distributed with parameter λ > 0, i.s.
X ∼ Exp(λ) if its density can be written as
fX (x) = λ e −λx 1[0,∞) (x).

Continuous random vectors
Let X = (X1 , . . . , Xk )0 be a r.v. with values in Rk .

X is continuous, if P X is absolutely continuous, i.e. the is a multivariate
pdf fX s.th.
Z b1 Z bk
F (b1 , . . . , bk ) = ··· fX (x1 , . . . , xk )dxk · · · dx1
−∞ −∞
for all b1 , ..., bk ∈ R. fX is then called probability density function (pdf)

of X . SX> = {x ∈ Rk : f (x) > 0}. Then SX = SX> is called the support of
X. Especially, for k = 2, holds, for a1 < b1 , a2 < b2 ,
Z b1 Z b2
P(a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 ) = fX1 ,X2 (x1 , x2 )dx2 dx1 .
a1 a2

Lemma 1.30 Let X = (X1 , X2 )0 be continuous with density fX1 ,X2 .

Then X1 is continuous with density
Z ∞
fX1 (x1 ) = fX1 ,X2 (x1 , x2 )dx2
−∞
and X2 is continuous with density

Z ∞
fX2 (x2 ) = fX1 ,X2 (x1 , x2 )dx1 .
−∞
fX1 and fX2 are called marginal densities and the distribution of X1 and
X2 , resp., are called marginal distributions.

Chapter 1: Probability theory 1.4 Expectation
1.4 Expectation
In the following we consider real-valued r.v., i.e. with values in R.

Definition 1.31 Suppose that X is a discrete random variable with
support SX ⊆ R on a probability space (Ω, A, P).
X
E ∗ X = E ∗ [X ] = x · P(X = x) (6)
x∈SX
is well-defined if the sum (6) is unconditionally well-defined. Then E ∗ [X ]

is called expectation (or mean) of (the discrete r.v.) X.
E ∗ [X ] is finite iff
X
|x| · P(X = x) < ∞. (7)
x∈SX

Especially,
PN if SX = {a1 , . . . , aN }, pi = P(X = ai ) for i = 1, 2, . . . and
i=1 pi = 1, then
N
X N
X N
X
∗
E X = ai pi = ai P(X = ai ) = ai P X ({ai }).
i=1 i=1 i=1
Example 1.32 If P(X = a) = 1, then SX = {a} and
E ∗ X = a · P(X = a) = a.
Example 1.33 Let (Ω, A, P) be a probability space and A ∈ A. Then 1A

is a r.v. with support in {0, 1}.
E ∗ 1A = 0 · P(1A = 0) + 1 · P(1A = 1) = P(A).

Proposition 1.34 Let Z be discrete r.v. with values in Rk and

support SZ and g : Rk → R.
(a) The support of Z∗ = g (Z ) is SZ∗ = g (SZ ) = {g (z) : z ∈ SZ }.
(b) If E ∗ [g (Z )] is well-defined, then
X
E ∗ [g (Z )] = g (z)P(Z = z)
z∈SZ
Especially, if X , Y are random vectors and Z = (X , Y )0 , then

X X
E ∗ [g (Z )] = g (x, y )P(X = x, Y = y ).
x∈SX y ∈SY

Special case:
X
E ∗ |X | = |x|P(X = x),
x∈SX
Remark 1.35
(a) E ∗ |X | is always well-defined.
(b) E ∗ X is finite iff E ∗ |X | < ∞. See (7).
Recap: Some laws for real numbers. Let an , bn , cn be real numbers.

P P
Triangle inequality: | i∈I ai | ≤ i∈I |ai |.
If an ≤ bn , limn→∞ an = a, and limn→∞ bn = b, then a ≤ b.
If an ≤ cn ≤ bn , limn→∞ an = a, and limn→∞ bn = a, then
limn→∞ cn = a.

Laws for the expectation of discrete r.v.
Lemma 1.36 Let X , Y be discrete r.v. and E ∗ [X ], E ∗ [Y ] well-defined.

Then holds: if X ≤ Y , then E ∗ [X ] ≤ E ∗ [Y ].
Lemma 1.37
Let X , Y be discrete r.v., E ∗ [X ], E ∗ [Y ] finite, and a, b, c ∈ R.
(a) |E ∗ [X ]| ≤ E ∗ [|X |].
(b) E ∗ [a + bX + cY ] is finite and
E ∗ [a + bX + cY ] = a + bE ∗ [X ] + cE ∗ [Y ].
(c) If X , Y are independent, then E ∗ [X · Y ] is finite and
E ∗ [X · Y ] = E ∗ [X ] · E ∗ [Y ].

General definition of expectation

Definition 1.38 For a real-valued random variable X on (Ω, A) define
k k k +1
Xn∗ (ω) = if ≤ X (ω) < for k ∈ Z.
n n n
If
(i) E ∗ [Xn∗ ] is well-defined for every n ∈ N and
(ii) lim E ∗ [Xn∗ ] is well-defined,
n→∞
then
E [X ] := EX := lim E ∗ [Xn∗ ]. (8)
n→∞
is called the expectation (or mean) of X .

Note that by definition for all n ∈ N holds
1 1
Xn∗ ≤ X ≤ Xn∗ + and |X − Xn∗ | ≤ . (9)
n n
Denote
X + = max(0, X ), X − = max(0, −X ).
Then X + ≥ 0, X − ≥ 0,
X = X + − X −, and |X | = X + + X − .
Then for a discrete r.v. X holds, by definition of E ∗ [X ], that
E ∗ [X ] is well-defined iff E ∗ [X + ] < ∞ or E ∗ [X − ] < ∞.

E ∗ [X ] is finite iff E ∗ [X + ] < ∞ and E ∗ [X − ] < ∞.
Lemma 1.39 Let X , Y be discrete r.v., E ∗ Y well-defined, and

|X − Y | ≤ c for some constant c. Then E ∗ X is well-defined.

For a discrete r.v. X we have two definitions of an expectation, E ∗ [X ] and

E [X ], but they coincide.
Proposition 1.40 Let X be a discrete r.v.

(i) E ∗ [X ] is well-defined iff E [X ] is well-defined.
(ii) If E ∗ [X ] is well-defined, then E ∗ [X ] = E [X ].
Some technical Lemma:
Lemma 1.41 Let X be any r.v. and Yn , Zn discrete r.v. Assume that
|X − Yn | ≤ Zn , lim E [Zn ] → 0, and lim E [Yn ] = a ∈ R.

n→∞ n→∞
Then E [X ] is finite and E [X ] = a.

Laws for the expectation

We are ready to generalize Lemma 1.36 and Lemma 1.37.
Proposition 1.42 (Monotonicity) Let X , Y be r.v. and E [X ], E [Y ]

well-defined. Then holds: if X ≤ Y , then E [X ] ≤ E [Y ].
Theorem 1.43 Let X , Y be r.v. and E [X ], E [Y ] finite. Then:

(a) |EX | ≤ E |X |.
(b) (Linearity) E [a + bX + cY ] is finite and
E [a + bX + cY ] = a + b · E [X ] + c · E [Y ].
(c) (Product law) If X , Y are independent, then E [X · Y ] is finite
and it holds E [X · Y ] = E [X ] · E [Y ].

Continuous random variables

A continuous version of Proposition 1.34.
Theorem 1.44 Let X be real-valued and continuous with density fX .

(a) If E [X ] is finite, then
Z ∞
E [X ] = x · fX (x)dx. (10)
−∞
(b) (Expectation rule) Let g : R → R (measurable). If E [g (X )] is

finite, then Z ∞
E [g (X )] = g (x) · fX (x)dx. (11)
−∞
Cf. Jacod and Protter (2000), Corollary 9.1.

Chapter 1: Probability theory 1.5 Variance and covariance
1.5 Variance and covariance
Suppose that X , Y are a real-valued random variables and denote

µX = E [X ], µY = E [Y ] if the expectations are defined.
Definition 1.45
(i) The s-th moment of X is defined as E [X s ] (if well-defined),
(ii) the s-th absolute moment as E [|X |s ],
(iii) and the s-th central moment as E [(X − µX )s ] with µ = E [X ]
(if well-defined).
(iv) The 2nd central
moment is also called variance:
Var [X ] = E (X − µX )2 .
For X , Y define by
Cov [X , Y ] = E [(X − µX )(Y − µY )]
the covariance between X and Y .

Existence of higher moments
Lemma 1.46 Let X be any real-valued r.v.

(a) If E [|X |s ] < ∞ and 0 < p < s, then E [|X |p ] < ∞.
(b) Var [X ] < ∞ iff E [X 2 ] < ∞.
(c) If E [X 2 ] < ∞ and E [Y 2 ] < ∞, then Cov [X , Y ] is finite.
Higher order moments guarantee the existence of lower order moments.
Lemma 1.47 If X , Y are independent and µX , µY are finite, then

Cov [X , Y ] is finite and Cov [X , Y ] = 0.
If Cov [X , Y ] = 0, then X and Y are called uncorrelated.

Laws for variances and covariances
A selection of laws for variances and covariances some of them could be

easily generalized.
Proposition 1.48 Let X be any real-valued r.v. with finite variances

and µX = E [X ].
(a) Var [X ] = E [X 2 ] − µ2X .
(b) Var [a + bX ] = b 2 Var [X ].
(c) Var [X + Y ] = Var [X ] + Var [Y ] + 2Cov [X , Y ].
(d) Cov [X , Y ] = Cov [Y , X ].
(e) Cov [X , X ] = Var [X ].
(f) Cov [a + bX , c + dY ] = b · d · Cov [X , Y ].

Examples for expectations and variances
Expectations and variances can be computed for specific discrete and

absolutely continuous distribution using Proposition 1.34 and Theorem
1.44.
Examples 1.49
1 If X ∼ B(n, π), then E [X ] = nπ and Var [X ] = nπ(1 − π).
2 If X ∼ Po(λ), then E [X ] = λ and Var [X ] = λ.
3 If X ∼ U(a, b), then E [X ] = (a + b)/2, Var [X ] = (b − a)2 /12.
4 If X ∼ Exp(λ), then E [X ] = 1/λ, Var [X ] = 1/λ2 .
5 If X ∼ N(µ, σ 2 ), then E [X ] = µ, Var [X ] = σ 2 .

Property (c) of Proposition 1.48 can be generalized.
Proposition 1.50 If X1 , X2 , . . . , Xn are independent with finite

variances, then
Var [X1 + · · · + Xn ] = Var [X1 ] + · · · + Var [Xn ].
More generally,
Xn n
X n−1 X
X n
2
Var [ ci Xi ] = ci Var [Xi ] + 2 ci cj Cov [Xi , Xj ] (12)
i=1 i=1 i=1 j=i+1
for any r.v. X1 , . . . , Xn , i.e. the Xi0 s need not be independent.

Expectations and covariances of random vectors
For an Rk -valued random variable X = (X1 , . . . , Xk )0 such that EXj exists

for all j, we define the expectation (vector) as
 
E [X1 ]
µX = E [X ] =  ...  .
 
E [Xk ]
Lemma 1.51 Let X be r.v. with values in Rk and E [X ] finite. Let

a ∈ Rm and B ∈ Rm×k a matrix. Then holds
E [a + BX ] = a + B · E [X ].

If E |Xj |2 < ∞ for all j, the covariance matrix of X is defined by
Var [X ] := Cov [X , X ] := ΣX = E [(X − µX )(X − µX )0 ]

(X1 − µ1 )2
 
. . . (X1 − µ1 )(Xk − µk )
=E 
 .. .. .. 
. . . 
(Xk − µk )(X1 − µ1 ) . . . (Xk − µk )2
E [(X1 − µ1 )2 ]
 
. . . E [(X1 − µ1 )(Xk − µk )]
= .. .. ..
.
 
. . .
E [(Xk − µk )(X1 − µ1 )] . . . E [(Xk − µk )2 ]
More generally, for two random vectors X , Y , we define by
Cov [X , Y ] := E [(X − µX )(Y − µY )0 ].
covariance matrix of X and Y .

Chapter 2: Asymptotic theory
Chapter 2: Asymptotic theory
1 Convergence of expectations
2 Modes of convergence
3 Convergence in distribution
4 Limit Theorems

Chapter 2: Asymptotic theory 2.1 Convergence of expectations
2.1 Convergence of expectations

Let X , X1 , . . . , Xn , . . . be real-valued r.v. on R
(Xn )n converges pointwise to X , i.s. limn→∞ Xn = X or Xn → X , iff
lim Xn (ω) = X (ω) for all ω ∈ Ω. (13)

n→∞
In general, (13) is not sufficient for limn→∞ E [Xn ] = E [X ].
Theorem 2.1 (Monotone convergence theorem)

Assume (13) and
0 ≤ Xn ≤ Xn+1 for all n ≥ 1. (14)
Then
E [Xn ] −→ E [X ].
n→∞

For a proof see Jacod and Protter (2000).

Recall that for a sequence of real numbers (an )n

lim inf an = lim inf ak .
n→∞ n→∞ k:k≥n
Lemma 2.2 (Fatou’s Lemma) Assume that Xn ≥ 0 and define
X (ω) = lim inf Xn (ω).

n→∞
Then
E [X ] ≤ lim inf E [Xn ].
n→∞
Idea of the proof: Put Yn = inf k≥n Xk and apply the Monotone
convergence theorem.

Theorem 2.3 (Dominated Convergence Theorem) Assume (13) and
|Xn | ≤ Y for n ≥ 1 (15)
for a random variable Y with E [Y ] < ∞. Then
E [Xn ] −→ E [X ]. (16)
n→∞
For a proof see Jacod and Protter (2000).
Especially, (16) holds if Xn → X and the Xn are uniformly bounded, i.e.

|Xn | ≤ C for all n and for some C ∈ R.

Theorem 2.1, Lemma 2.2, and Theorem 2.3 are still valid if (13), (14),
and (15) are replaced by a.s (almost surely) statements. I.e. put
A = {ω : Xn (ω) −−−→ X (ω)}

n→∞
Bn = {ω : Xn (ω) ≤ Xn+1 (ω)}
Cn = {ω : |Xn (ω)| ≤ |Y (ω)|}.
If
P(A) = P(Bn ) = P(Cn ) = 1 for all n ∈ N,
then (13), (14), and (15) are said to hold almost surely, Theorem 2.1,
Lemma 2.2, and Theorem 2.3 stay valid, and (16) holds true.

Chapter 2: Asymptotic theory 2.2 Modes of convergence
2.2 Modes of convergence

Pk
Let k · k denote a norm on Rk , e.g. kxk = kxk1 = j=1 |xj | or
qP
k 2
kxk = kxk2 = j=1 |xj | .
Definition 2.4 Suppose that (Xn )n and X are random variables on a

probability space (Ω, A, P) and with values in (Rk , B k ).
(i) (Convergence in probability)
The sequence (Xn )n converges in probability to X if
P({ω : kXn (ω) − X (ω)k > }) = P(kXn − X k > ) −→ 0 ∀ > 0.

n→∞
Notation:
P
Xn −→ X , p-limn→∞ Xn = X .

(ii) (Almost sure convergence)

The sequence (Xn )n converges almost surely to X if
P({ω : lim Xn (ω) = X (ω)}) = P(Xn −→ X ) = 1.

n→∞ n→∞
Notation:
a.s.
Xn −→ X P − a.s., Xn −→ X a.s., Xn −→ X .
n→∞ n→∞
(iii) (Convergence in the p-th mean (Lp -convergence))

Let p ≥ 1. The sequence (Xn )n converges in p-th mean to X if
E kXn − X kp −→ 0.
Notation:
Lp
Xn −→ X , Lp − lim Xn = X .
n→∞

Definition 2.5 Suppose that (Xn )n and X are random variables with
values in (Rk , B k ). (Convergence in distribution)
The sequence (Xn )n converges to X in distribution if
E [f (Xn )] −→ E [f (X )] for all f ∈ Cb (Rk ),

n→∞
i.e. for all functions f : Rk → R that are continuous and bounded.

Notation:
L D d
Xn −→ X , Xn −→ X , Xn −→ X ,
(d, D, and L for in distribution, in law).

Relation between different modes of convergence
In the following suppose that (Xn )n and X are random variables on a

probability space (Ω, A, P).
Then the following scheme holds true:
Lp
Xn −→ X &
P d
Xn −→ X → Xn −→ X
a.s.
Xn −→ X %
Lp -convergence (p ≥ r ≥ 1)
Lp r L 1 L
Xn −→ X → Xn −→ X → Xn −→ X

Lp -convergence and convergence in probability
Theorem 2.6 (Markov’s inequality) For a random variable X and an

monotone increasing function g : [0, ∞) → [0, ∞) with g (x) > 0 for
all x > 0 it holds for every ε > 0 that
E [g (kX k)]
P(kX k ≥ ε) ≤ .
g (ε)
For a real-valued random variable X with finite second moment and ε > 0
it holds that
Var [X ]
P(|X − EX | ≥ ε) ≤ (Chebychev’s inequality).
ε2

Suppose that (Xn )n and X are random variables on some probability space
(Ω, A, P).
Corollary 2.7 (a) Let p ≥ 1.

Lp P
Xn −→ X =⇒ Xn −→ X .
(b) For p ≥ r ≥ 1 holds

Lp r L
Xn −→ X =⇒ Xn −→ X.
Lemma 2.8 Let Xn be real-valued and a ∈ R.

L2
If E [Xn ] → a and Var [Xn ] → 0 as n → ∞, then Xn −→ a.

Weak laws of large numbers (WLLN)
Lemma 2.9 (Weak law of large numbers 1)

Suppose that X1 , X2 , . . . are real-valued and uncorrelated r.v.
(i.e. cov [Xi , Xj ] = 0, i 6= j) with E [X1 ] = E [X2 ] = · · · = µ ∈ R and
Var [Xi ] ≤ c for all i and some c ∈ R. Then
n
1X P
X̄n = Xi −→ µ.
n
i=1
Theorem 2.10 (Weak law of large numbers 2)

For a sequence of independently and identically distributed (i.i.d.)
P
r.v. X1 , X2 , ... with finite mean µ = E [Xi ] it holds that X̄n −→ µ.

Convergence in probability and almost surely
Theorem 2.11 It holds

a.s. P
Xn −→ X =⇒ Xn −→ X .
In general, convergence in probability does not imply a.s. convergence.

Example 2.12 Suppose that (Ω, A, P) = ([0, 1], B, U(0, 1)) and define
X2k +j (ω) = 1[j2−k ,(j+1)2−k ] (ω) , k ∈ N0 , j = 0, . . . , 2k − 1
P a.s.
Put X = 0. Then Xn −→ X , but not Xn −→ X .

Chapter 2: Asymptotic theory 2.3 Convergence in distribution
2.3 Convergence in distribution
Theorem 2.13 Suppose that (Xn )n , X are Rk -valued random

variables and FXn , FX their CDFs. Choose m ∈ N.
The following statements are equivalent
d
(i) Xn −→ X .
(ii) FXn (x) −→ FX (x) at all continuity points of FX .
n→∞
(iii) E [f (Xn )] −→ E [f (X )] for all bounded, Lipschitz functions
n→∞
f : Rk → R.
Cf. Van der Vaart (1998): Asymptotic Statistics. Cambridge University

Press, Lemma 2.2.

Let f : Rk → R be a function.
C (f ) = {x ∈ Rk : f is continuous at x} is called the set of
continuity points of f .
f is bounded if if there is a c ∈ R s.th. |f (x)| ≤ c for x ∈ Rk .
f is a Lipschitz-function if there is a L ∈ R such that
for all x, y : |f (x) − f (y )| ≤ Lkx − y k.
Remark: A continuous function f : Rk → R is uniformly continuous on a

compact subset C ⊂ Rk , i.e. for every ε > 0 there exists a δ = δ(ε) > 0
s.th.
for all x, y ∈ C : kx − y k ≤ δ =⇒ |f (x) − f (y )| ≤ ε.

Convergence in probability and convergence in distribution
Theorem 2.14 For Rk -valued r.v. (Xn )n and X it holds:

P d
Xn −→ X =⇒ Xn −→ X .
In general, convergence in probability implies convergence in distribution

but not vice versa.
Theorem 2.15 For Rk -valued r.v. (Xn )n on a probability space

(Ω, A, P) and deterministic a ∈ Rk it holds:
P d
Xn −→ a ⇐⇒ Xn −→ a.

Characteristic functions
Let X be a random vector in Rk .
The function, ϕX : Rk → C, defined by
0
ϕX (t) = E [e it X ] = E [cos(t 0 X )] + iE [sin(t 0 X )],
is called characteristic function of X.
i is here the imaginary unit, i.e. i 2 = −1.
Proposition 2.16
Let Y be a random vector in Rk , a ∈ Rm and B ∈ Rm×k
1 ϕX is uniformly continuous.
0
2 ϕa+BX (t) = e ia t · ϕX (B 0 t) for t ∈ Rm .
3 If X und Y are independent, then ϕX +Y (t) = ϕX (t)ϕY (t).
2 /2
4 If Z ∼ N(0, 1), then ϕZ (t) = e −t .

Let X , Y , Xn be random vectors in Rk .
d
Lemma 2.17 X = Y iff ϕX = ϕY .
Cf. v.d.Vaart (1998), Lemma 2.15.
Theorem 2.18 (Levy’s continuity theorem)

d
Xn → X iff for all t ∈ Rk : ϕXn (t) → ϕX (t).
Cf. v.d.Vaart (1998), Lemma 2.13.
d d
Theorem 2.19 (Cramèr-Wold) Xn → X iff ∀t ∈ Rk : t 0 Xn → t 0 X .

Theorem 2.20 (Continuous mapping theorem) Let g : Rk → Rm be

continuous on C with P (X ∈ C ) = 1. Then
P P
(i) Xn → X ⇒ g (Xn ) → g (X ) ,
d d
(ii) Xn → X ⇒ g (Xn ) → g (X ) .
Cf. v.d.Vaart (1998), Theorem 2.3.
P P
→ X iff Xn,j −
Lemma 2.21 Xn − → Xj for j = 1, . . . , k.
The vector (Xn ) converges in probability if and only if all components

converge in probability.

P
Note that for P(An = an ) = 1 holds: an → a iff An −
→ a.
Application: Let Xn , Yn be r.v. and an , bn , cn be real numbers.
P P
Let Xn → X , Yn → Y , an → a, bn → b, cn → c. Then
P
an + bn Xn + cn Yn → a + bX + cY ,
P
Xn Yn → XY etc.
Lemma 2.22 (Slutsky’s Lemma) Let Xn , Zn , Z r.v. with values in Rk

P d
and let Xn −
→ c ∈ R and Zn −
→ Z . Then
d
Xn + Zn −→ c + Z .

Theorem 2.23 (Slutsky’s Lemma)

P
(i) Let Xn , Zn be r.v. with values in Rk and Rm , resp., with Xn → c
d
and Zn → Z , where c is a constant. Then
d
(Xn , Zn ) → (c, Z ).
P
(ii) Let c ∈ Rm and B ∈ Rm×k . Let Xn → c with values in Rm ,
P d
Bn → B m × k-matrices, and Zn → Z with values in Rk . Then
d
Xn + Bn Zn → c + BZ .
Cf. v.d.Vaart Theorem 2.7 and Lemma 2.8.

P P
Bn → B means Bn,i,j → Bi,j for all 1 ≤ i ≤ m,1 ≤ j ≤ k.

Chapter 2: Asymptotic theory 2.4 Limit Theorems
2.4 Limit Theorems
Theorem 2.24 (Strong law of large numbers (SLLN)) For a sequence

of i.i.d. random variables X1 , X2 , ... on some probability space
(Ω, A, P) with finite mean µ = E [Xj ] it holds that X̄n −→ µ almost
surely.
Theorem 2.25 (Central limit theorem for i.i.d. sequences) Let

X1 , X2 , X3 , . . . be i.i.d. real-valued random variables with EXi = µ,
Var (Xi ) = σ 2 ∈ (0, ∞). Then
X1 + · · · + Xn − nµ d
√ −→ Z ∼ N(0, 1).
nσ

Taylor formula: Let f : R → R be (m+1)-times differentiable in (a,b)

and x, x + h ∈ (a, b). Then
m
X f (i) (x) f (m+1) (ξ) m+1
f (x + h) = hi + h ,
i! (m + 1)!
i=0
where ξ lies between x and x + h.

Special case, m = 1:
1
f (x + h) = f (x) + f 0 (x)h + f 00 (ξ)h2
2
1 1
= f (x) + f (x)h + f 00 (x)h2 + [f 00 (ξ) − f 00 (x)]h2 .
0
2 2

Lemma 2.26 Let p=1. If E [|X |m ] < ∞, then ϕX is m-times

differentiable and 1 ≤ r ≤ m,
(r ) (r )
ϕX (t) = E [(iX )r e itX ], E (X r ) = ϕX (0)/i r , and
m
X (it)r (it)m
ϕX (t) = E [X r ] + Rm (t),
r! m!
r =0
where |Rm (t)| ≤ 3E [|X |m ] and Rm (t) tends to 0 as t → 0.
Especially, if E (X 2 ) < ∞, then
t2
ϕX (t) = 1 + itE [X ] − E [X 2 ] + t 2 R ∗ (t)
2
with R ∗ (t) → 0 if t → 0.

Theorem 2.27 (Lyapounov CLT) Let X1 , X2 , . . . be independent

real-valued random variables with µt = EXt , σt2 = Var (Xt ) and
m3,t = E |Xt − µt |3 < ∞. Assume
[ nt=1 m3,t ]1/3

P
−→ 0.
2 1/2 n→∞
Pn
t=1 σ t
Then
X1 + · · · + Xn − µ1 − · · · − µn d
q −→ Z ∼ N(0, 1).
σ12 + · · · + σn2

Excursus to multivariate normal distribution: A vector X with density

1
exp −0.5(x − µ)0 Σ−1 (x − µ) , x ∈ Rk ,

φ(x) = p
(2π)k detΣ
has a multivariate normal distribution with mean µ ∈ Rk and covariance

matrix Σ ∈ Rk × Rk , which it assumed to be positive definite. One can
show that a0 X ∼ N(a0 µ, a0 Σa) for any a ∈ Rk \{0k }
Theorem 2.28 (Multivariate CLT) Suppose that X1 , X2 , . . . are i.i.d.

Rk -valued random variables with mean vector µ and finite, positive
definite covariance matrix Σ. Then
1 d
√ (X1 + · · · + Xn − nµ) −→ Ze ∼ N(0k , Σ).
n

Chapter 2: Asymptotic theory 2.5 Application in Statistics
2.5 Application in Statistics
Example: 2.29 We assume that
X1 , . . . , Xn ∼ N(µ, σ 2 ) i.i.d.
are defined on the same sample space. µ and σ 2 are unknown and could
be ”determined” by observations. Θ = R × (0, ∞).
Note that the distribution P Zn of Zn = (X1 , . . . , Xn )0 changes with the
choice of the parameter θ = (µ, σ 2 ).
Let Zn be r.v. with values of Rmn defined on some sample space. Let
Θ 6= ∅ a set and PθZn for any θ ∈ Θ probability measures on Rmn . Then
(Rmn , B mn , {PθZn : θ ∈ Θ}) is called statistical experiment, Θ its
parameter space. θ ∈ Θ is called parameter.
n is usually some sample size.

Bias, Consistency
Let there be given a statistical experiment with parameter space Θ. A r.v.
g (Zn ), g : Rmn → S, Θ ⊂ S, can be called estimator of θ.
Let T̂ , T̂n be estimators with values in Rk .
T̂ is unbiased for τ = τ (θ) ∈ Θ if
Eθ [T̂ ] = τ ∀θ ∈ Θ.
The sequence of estimators (T̂n )n is asymptotically unbiased if
lim Eθ [T̂n ] = τ ∀θ ∈ Θ.
n→∞
The sequence of estimators (T̂n )n is (weakly) consistent if

P
T̂n −→ τ ∀θ ∈ Θ.

Example 2.30 We assume that X1 , . . . , Xn are real-valued and i.i.d. with

E [Xi2 ] < ∞. Then
n
1 X
M̂n = X̄n and Sn2 = (Xi − X̄n )2
n−1
i=1
are unbiased and consistent estimators for µ = E [Xi ] and σ 2 = Var [Xi ],
respectively.
R
Notation: If F is the CDF of Y , we write E [g (Y )] = g (y )F (dy ). Let
DF (R) denote the set of all CDF on R.
Note that the parameter space of the Example 2.30 can be written as
Z
Θ = {F ∈ DF (R) : x 2 F (dx) < ∞};
xF (dx) and σ 2 = σ 2 (θ) = (x − µ)2 F (dx).

R R
θ = F ∈ Θ, µ = µ(θ) =

Example 2.31 (Plug-in principle) Let X1 , . . . , Xn be Rk -valued r.v. and

θbn = T (X1 , . . . , Xn ) an estimator of some parameter θ, T : Rkn → Θ.
Let g be continuous. Then
P P
θbn −→ θ ∀θ ∈ Θ =⇒ g (θbn ) −→ g (θ) ∀θ ∈ Θ.
Example 2.32 Let X1 , . . . , Xn ∼ Exp(λ) i.i.d.

By the WLLN, Theorem 2.10,
P 1
X̄n −→ E [Xi ] = µ = forall λ > 0.
λ
Consequently, g (x) = 1/x, by Theorem 2.20,
1 P 1
Λ̂n = = g (X̄n ) −→ g (µ) = = λ forall λ > 0,
X̄n µ
i.e. Λ̂n = 1/X̄n is a consistent estimator for λ.

Some Applications
Let X , X1 , . . . , Xn i.i.d. with values in Rk with E [X ] = µ, Var [X ] = Σ.
Then
n
1X P
h i
(i) Xi XiT → E XX T . (LLN)
n
i=1
n n
1X T 1X
Xi XiT − X̄n (X̄n )T

(ii) Σ̂n = Xi − X̄n Xi − X̄n =
n n
i=1 i=1
h i
P T T
→ E XX − E [X ] E [X ]
h i
= E (X − E [X ]) (X − E (X ))T = Σ.
√ − 21 d
(iii) n Σ̂n X̄n − µ −→ N(0, Ip ).

Chapter 2: Asymptotic theory 2.6 Stochastic Boundedness
2.6 Stochastic Boundedness
Defintion 2.33 (Stochastic boundedness, tightness)

The sequence (Xn )n is stochastically bounded if for every ε > 0 there
exists a C = C (ε) > 0 and n0 = n0 (ε) ∈ N, s.th.
P(kXn k ≤ C ) ≥ 1 − ε for all n ≥ n0 .
Notation: Xn = OP (1).
Note that for a r.v. X , in general, there is no C ∈ R s.th.
P(kX k ≤ C ) = 1.
Consider e.g. X ∼ N(0, 1):
P(|X | ≤ C ) = Φ(C ) − Φ(−C ) < 1 for all C > 0.

P
Notation: Zn = oP (1) iff Zn −
→ 0.
Theorem 2.34
d
(i) Xn −→ X =⇒ Xn = Op (1).
(ii) Xn = X + op (1) =⇒ Xn = Op (1)
(Xn , X scalar or vector or matrix).
(iii) For Xn = op (1), Yn = op (1), Un = Op (1), Wn = Op (1) it holds
(a) Xn + Yn = op (1),
(b) Un + Wn = Op (1),
(c) Un · Wn = Op (1),
(d) Xn · Un = op (1).
(iv) g : Rk → Rl continuous at x0 . Then
Xn = x0 + op (1) =⇒ g (Xn ) = g (x0 ) + op (1).

Lemma 2.35 Let cn Zn = OP (1) for cn → ∞, then Zn = oP (1).
Theorem 2.36 (Delta method)

Let U ⊂ Rk be a neighborhood of c ∈ Rk , φ : U → Rm differentiable
at c and Xn is a Rk -valued random variable with
√ d
n (Xn − c) −→ N (0, Σ) .
Then:
√ d
N 0, φ0 (c) Σ φ0 (c)0 .

n (φ (Xn ) − φ (c)) −→
Cf. v.d.Vaart Theorem 3.1.

Summary
Important concepts and statements:

Modes of convergence and their relationship:
P L a.s. d
Xn → X , Xn →k X , Xn → X , Xn → X
continuous mapping theorem, dominated convergence theorem
Slutsky’s lemma, Delta method,
characteristic functions: Definition, identification of distributions,
expansion, Levy’s continuity theorem
Cramèr-Wold
LLN, CLT, Lyapunov CLT
Landau symbols: oP (1), OP (1)
algebra with Landau symbols

Chapter 3: Conditional expectations, probabilities and variances
Chapter 3: Conditional expectations, probabilities and

variances
1 Conditional expectation and conditional probabilities:

Definition and special cases
2 Important properties of conditional expectations
3 Conditional variances

Chapter 3: Conditional expectations, probabilities and variances 3.1 Conditional expectations and conditional probabilities
3.1 Conditional expectations and conditional probabilities
Regression problem: How much of the random fluctuations of Y can be

explained by X ?
Find a (measurable) function g : Rk −→ R that minimizes
E [{Y − g (X )}2 ]. (∗)
Definition 3.1
Each (measurable) function g that minimizes (∗) is called conditional
expectation of Y given X .
Notation:
E [Y |X ] = g (X ), E [Y |X = x] = g (x).
Remark: For c ∈ R we have E [c|X ] = c.

Theorem 3.2 For an Rk -valued random variable X and a real-valued

random variable Y assume that EY 2 < ∞. Then the following are
equivalent (TFAE):
(i) g (X ) = E [Y |X ] a.s.
(ii) E [{Y − g (X )}h(X )] = 0
for all measurable functions h with E [h2 (X )] < ∞.
(iii) E [Y · h(X )] = E [g (X ) · h(X )]
for all measurable, bounded functions h.
(iv) E [{Y − g (X )}h(X )] = 0
for all measurable functions h : Rk → {0, 1}.
These characterizations can be used to prove properties of conditional

expectations or to compute specific ones.

(iv) can be rewritten as
E [Y 1(X ∈ B)] = E [g (X )1(X ∈ B)] (**)
for all (Borel-) sets B ⊂ Rk . (**) is often used as definition of a

conditional expectation; it does not require E [Y 2 ] < ∞ but only
E |Y | < ∞ or Y ≥ 0.
Theorem 3.3 (Uniqueness of conditional expectation)

For two minimizers g1 , g2 of (∗) it holds
g1 (X ) = g2 (X ) a.s.
Consequently, E [Y |X ] is almost sure unique.

Recall the relation between expectation and probability E 1A = P(A).

Now, we define conditional distributions via conditional expectations.
Suppose that X and Y are random variables with values in (Rk , B k ) and
(Rl , B l ), respectively, on a probability space (Ω, A, P).
Definition 3.4 For any A ∈ A,
P(A | X ) = E (1A | X )
is called conditional probability of A given X . Since 1A is bounded,

P(A | X ) always exists. All conditional probabilities
P Y |X (B) := P(Y ∈ B | X ) = E (1B (Y ) | X ), ∀B ∈ B l ,
together are called the conditional distribution of Y given X .

Moreover, P Y |X =x (B) = E (1B (Y ) | X = x) is called conditional
distribution of Y given X = x.

Special case: X , Y discrete

Let X , Y be discrete with support SX , SY and pmf
fX ,Y (x, y ) = P(X = x, Y = y ). Define
fX ,Y (x, y ) P(X = x, Y = y )
fY |X (y |x) := = = P(Y = y |X = x)
fX (x) P(X = x)
for x ∈ SX . Then
X
E (Y | X = x) = g (x) = y · fY |X (y |x)
y ∈SY
and
X
P Y |X =x (B) = g (x) = fY |X (y |x).
y ∈SY ∩B
fY |X is called conditional probability mass function of Y given X .

Example 3.5 Suppose that X and Z are the thrown numbers of two
independent dice throws and define Y = X + Z . Then, for x = 1, . . . , 6
E [Y | X = x] = x + 3, 5
and hence E [Y | X ] = X + 3.5.

Example 3.6
Suppose that Y1 , . . . , Yn are i.i.d. random variables with Yi ∼ B(1, θ),
where θ ∈ (0, 1) is an unknown parameter.Pn Let Y = (Y1 , . . . , Yn )0 . We
consider the statistic X = T (Y ) = i=1 Yi , i.e. T ∼ B(n, θ). Then
(
1/ kn ,

if x = k,
P(Y = y | X = k) =
0, else,
y = (y1 , . . . , yn )0 , k = 0, . . . , n, is independent of θ.

Special case: Continuous distributions
Let (X , Y )0 be a continuous random vector with joint pdf fX ,Y . Define

(f
X ,Y (x,y )
fX (x) , if fX (x) > 0,
fY |X (y |x) =
any density, elsewhere.
Then
Z ∞
E [Y |X = x] = y · fY |X (y |x)dy
−∞
and
Z b
Y |X =x
P ([a, b]) = fY |X (y |x)dy .
a
fY |X is called conditional probability density function of Y given X .

Theorem 3.7 Let Y be a square-integrable real-valued r.v. and X is

an Rk -valued random variable on a probability space (Ω, A, P). Then
(i) X , Y are independent iff for all B ∈ B: P Y |X (B) = P(B) a.s.
(ii) If X , Y are independent then E [Y |X ] = E [Y ] a.s.
Note: If (X , Y )0 is continuous with pdf fX ,Y , then independence can be

expressed by
fX ,Y (x, y ) = fX (x)fY (y ) a.e.
Then holds
fX ,Y (x, y ) fX (x)fY (y )
fY |X (y |x) = = = fY (y ) a.e.
fX (x) fX (x)

Chapter 3: Conditional expectations, probabilities and variances 3.2 Important properties of conditional expectations
3.2 Important properties of conditional expectations
Theorem 3.8 (Iterated expectations)

Let Y be a real-valued r.v., X a Rk -valued r.v., and Z a Rm -valued
r.v. on a probability space (Ω, A, P). Then
(i) E [E [Y |X ]] = E [Y ],
(ii) E [E [Y |X , Z ]|Z ] = E [Y |Z ] a.s.,
(iii) E [E [Y |X ]|X , Z ] = E [Y |X ] a.s.,
(iv) E [E [Y |X ]|f (X )] = E [Y |f (X )] a.s.,
(v) E [Yf (X )|X ] = f (X )E [Y |X ] a.s., where f is an R-valued function
such that Ef 2 (X ) + E [Yf (X )]2 < ∞,
(vi) E [Y |X , f (X )] = E [Y |X ] a.s.

Remarks:
By (i)-(iv) the most restricted conditions prevail.
(v): If conditioning on X, f(X) can be handled like a constant and
pulled out of the conditional expectation.
Redundant information can be dropped (vi).
Example 3.9 (Application of (vi)). Some model equation for the wage in
dependence of the education and experience.
E [wage | educ, exper , educ 2 , educ · exper ]

= β0 + β1 educ + β2 exper + β3 educ · exper + β4 educ 2
= E [wage | educ, exper ] a.s.
Thus, it is redundant to also condition on educ 2 and educ · exper .

Theorem 3.10 (Properties of conditional expectation) Suppose that

Y1 , Y2 are square-integrable real-valued random variables, X is an
Rk -valued random variable on a probability space (Ω, A, P) and a1 , a2
are scalars. Then
(i) E [a1 Y1 + a2 Y2 |X ] = a1 E [Y1 |X ] + a2 E [Y2 |X ] a.s.
(ii) If Y1 ≤ Y2 , then E [Y1 |X ] ≤ E [Y2 |X ] a.s.
(iii) (E [Y1 Y2 |X ])2 ≤ E [Y12 |X ]E [Y22 |X ] a.s.
(Cauchy-Schwarz inequality), E [Yi4 ] < ∞.
(iv) For any ε > 0 and E [Y 4 ] < ∞ holds
E [Y 2 |X ]
P(|Y | ≥ ε|X ) ≤ a.s.
ε2
Note that the moment conditions for Y and Yi could be relaxed.

A function % : (a, b) → R is convex iff for all x1 , x2 ∈ (a, b) and all

α ∈ (0, 1):
%(αx1 + (1 − α)x2 ) ≤ α%(x1 ) + (1 − α)%(x2 ).
If it is two times differentiable, then %00 (x) ≥ 0.
Theorem 3.11 (Properties of conditional expectation)

Let E [Y 2 ] < ∞ and X be Rk -valued. Then
(i) If % : R −→ R convex and E [%(Y )]2 < ∞, then
%(E [Y |X ]) ≤ E [%(Y )|X ] a.s.
(Jensen inequality).
(ii) 0 ≤ Yn ↑ Y =⇒ E [Yn |X ] ↑ E [Y |X ] a.s.
(monotone convergence).

Chapter 3: Conditional expectations, probabilities and variances 3.3 Conditional Variances
3.3 Conditional Variances
Definition 3.12 For a real-valued random variable Y (with EY 4 < ∞)

and an Rk -valued random variable X on a probability space (Ω, A, P) a
conditional variance of Y given X is defined as
Var [Y |X ] = E [(Y − E [Y |X ])2 |X ].
Lemma 3.13 Under the conditions of the Definition

(i) Var [a(X )Y + b(X )|X ] = a2 (X )Var [Y |X ] a.s., where a and b
are measurable functions such that a(X )Y + b(X ) satisfies the
assumptions of the Definition,
(ii) Var [c|X ] = 0 and Var [a(X )|X ] = 0.
(iii) If X,Y are independent, then Var [Y |X ] = Var [Y ] a.s.

Chapter 3: Conditional expectations, probabilities and variances 3.3 Conditional Variances
Theorem 3.14 Under the conditions of the Definition

(i) Var (Y ) = E [Var (Y |X )] + Var (E [Y |X ]).
(ii) E [Var (Y |X )] ≥ E [Var (Y |X , Z )] with an additional Rl -valued
random variable Z on the same space.
If Y is a vector with values in Rk , then
Var [Y |X ] = E [(Y − E [Y |X ])(Y − E [Y |X ])0 |X ]
is the conditional covariance matrix of Y given X.

It holds for A ∈ Rm×k
Var [AY |X ] = AVar [Y |X ]A0 . (17)

Chapter 3: Conditional expectations, probabilities and variances Summary
Summary

definition and equivalent characterization of a conditional expectation
of Y given X, E [Y |X ],
conditional distribution of Y given X,
formulas for the computation of E [Y |X = x] for jointly discreet and
jointly continuous X and Y,
laws for E [Y |X ],
definition and laws for Var [Y |X ], the conditional variance of Y given
X.

Chapter 4: Linear regression Overview
Chapter 4: Linear regression
1 The classic model

2 Parameter estimation: finite sample properties
3 Parameter estimation: asymptotic properties
4 Hypothesis tests in the classical linear regression model

Chapter 4: Linear regression 4.1 The classic model
4.1 The classic model

Definition 4.1 A (multiple) linear regression model based on n
observations (Yi , Xi0 ), Xi0 = (Xi,1 , . . . , Xi,K ), i = 1, . . . , n, with (unknown)
regression coefficients β1 , . . . , βK is given by
Yi = β1 Xi,1 + · · · + βK Xi,K + εi , i = 1, . . . , n.
Matrix notation:
Y = X β + ε; (*)
X is called design matrix. εi are unobserved. Here

       
Y1 X1,1 . . . X1,K β1 ε1
 ..   .. .
.. .
..  , β =  ...   .. 
Y =  . , X =  . , ε =  . .
 
Yn Xn,1 . . . Xn,K βK εn

Classical linear regression model: Model assumptions I
Let Y and X satisfy the model (*) for some β ∈ RK .

Model assumptions I:
1 n > K.
2 P(rank(X ) = K ) = 1 (no multicollinearity).
3 E [ε|X ] = 0 (strict exogeneity).
4 Var [ε|X ] = σ 2 In (homoscedasticity).
Consequently,
E [Y |X ] = X β, Var [Y |X ] = σ 2 In .
Remember: Conditional expectations are only a.s. unique.

In most cases an intercept, i.e. a constant, is included in the model, i.e.

with Xi,1 = 1 for i = 1, . . . , n, and we get
Yi = β1 + β2 Xi,2 · · · + βK Xi,K + εi , i = 1, . . . , n.
Let A be a n × K -matrix with n > K and z ∈ RK . Then
rank(A) = K ⇐⇒ Az = 0 =⇒ z = 0.
⇐⇒ det(A0 A) 6= 0.
rank(A) < K ⇐⇒ ∃z 6= 0 : Az = 0.
If rank(A) < K , then one column of A can be written as linear

combination of the other columns.
If rank(X ) < K in the linear model, then the parameter β is not uniquely
specified by the model equation (*).

Chapter 4: Linear regression 4.2 Parameter estimation: finite sample properties
4.2 Parameter estimation: finite sample properties
Note that
E [Yi |X ] = Xi0 β a.s.
Hence for a choice of β the prediction for Yi given Xi would be X 0 β and
ei = Yi − β 0 Xi
the prediction error or residual.

Therefore,
n
X n
X
Q(β) = ei2 = (Yi − β 0 Xi )2 = (Y − X β)0 (Y − X β)
i=1 i=1
becomes small if the prediction errors are small.

Notation: Let f : Rd → Rm , i.e. f (x) = (f1 (x), . . . , fm (x))0 and

x = (x1 , . . . , xd )0 . Then
 
∂
f1 (x)x=x . . . ∂x∂d f1 (x)x=x

∂ ∂x 1 0 0
f (x0 ) =  ... ... ...
 
∂x

∂ ∂
∂x1 mf (x)
x=x
. . . f
∂xd m (x)
x=x
0 0
is the derivative of f w.r.t. x.

If f has a local minimum or maximum at x0 and is differentiable at x0 , then
Let A ∈ Rm×d and B ∈ Rd×d matrices. It can be shown:
∂ ∂ 0
(Ax) = A, (x Bx) = x 0 (B + B 0 ).
∂x ∂x

OLS estimator
Definition 4.2 In the classical linear regression model the OLS estimator
(ordinary least square) is defined by
βbOLS = arg min (Y − X β)0 (Y − X β).

β∈RK
Then the OLS-fitted y -values are
Ŷ = X β̂OLS , Ŷi = Xi0 β̂OLS
and the OLS-residuals are
ê = Y − Ŷ , êi = Yi − Ŷi .
Notation: 1n = (1, . . . , 1)0 ∈ Rn .

Theorem 4.3 If det(X 0 X ) 6= 0, then
βbOLS = (X 0 X )−1 X 0 Y .
To emphasize the dependency of βbOLS from the sample size n we might

write βbn,OLS .
Lemma 4.4 Under the conditions of Theorem 4.3 holds
X 0 ê = 0 (normal equations).
If the model contains a constant, e.g. Xi,1 = 1 for i = 1, . . . , n, then
10n ê = 0.

A symmetric, quadratic m × m-matrix A is called positive semi-definite,

in signs A ≥ 0, if x 0 Ax ≥ 0 for all x ∈ Rm .
Let A, B two symmetric, quadratic matrices. Then
A≥B :⇔ A − B ≥ 0.
An estimator β̂ is called
linear iff β̂ = AY for some K × n-Matrix A = A(X ).
(A may depend on X, but not Y.)
conditionally unbiased iff E [β̂|X ] = β.
BLUE (best linear unbiased) iff β̂ is linear and biased and
Var [βe | X ] ≥ Var [βb | X ]
for any other linear and unbiased estimator β.

e

Theorem 4.5 (Gauss-Markov-Theorem) In the classical linear

regression model the OLS estimator is BLUE if Var [β̂OLS |X ] is finite.
If the linear model contains a constant, then

n
X n
X n
X
2
(Yi − Ȳn ) = (Ybi − Ȳn )2 + êi2 .
i=1 i=1 i=1
| {z } | {z } | {z }
total variability of Y variability of regression variability of residuals
The coefficient of determination is then defined by

Pn 2
Pn 2
2 i=1 (Ŷi − Ȳn ) i=1 êi
R = Pn 2
= 1 − Pn 2
∈ [0, 1].
i=1 (Yi − Ȳn ) i=1 (Yi − Ȳn )
R 2 = 1 iff êi = 0 for all i.

Estimation of σ 2
Definition 4.6 If n > K , the OLS estimate of the variance σ 2 > 0 is
given by
2 2 ê 0 ê
σ
bOLS =σ bn,OLS =
n−K
q
and σ 2
bOLS is called standard error of regression (SER).
Note: For a quadratic matrix C = (ci,j )1≤i,j≤m is tr (C ) = ni=1 ci,i the
P
trace of C. It holds for a matrix Z = (Zi,j )1≤i,j≤m of r.v. and matrices
A ∈ Rm×k and B ∈ Rk×m :
E [tr (Z )] = tr (E [Z ]), tr (AB) = tr (BA).
Theorem 4.7 In the classical linear regression model σ 2

bOLS is a
2
(conditionally) unbiased estimator for σ .

Chapter 4: Linear regression 4.3 Parameter estimation: asymptotic properties
4.3 Parameter estimation: asymptotic properties

Classical linear model: Model assumptions II
For the model
Yi = Xi0 β + εi , i = 1, . . . , n.
Model assumptions II:

1 n > K.
2 P(rank(X ) = K ) = 1 (no multicollinearity).
3 E [εi |Xi ] = 0 a.s. (strict exogeneity)
4 E [ε2i |Xi ] = σ 2 a.s. (homoscedasticity).
5 (X10 , ε1 ), . . . , (Xn0 , εn ) are i.i.d.

Theorem 4.8 Let E [Y 2 ] < ∞ and X = (X1 , . . . , Xk )0 . Then

g (X ) = E [Y |X ] iff
E [Y 1(X1 ∈ B1 ) · · · 1(Xk ∈ Bk )] = E [g (X )1(X1 ∈ B1 ) · · · 1(Xk ∈ Bk )]
for all Borel sets B1 , . . . , Bk .
That is a weaker version of Theorem 3.2(iv).
Proposition 4.9 The model assumptions II imply
E [ε|X ] = 0 and Var [ε|X ] = σ 2 In .

Theorem 4.10 (Consistency and asymptotic normality of the OLS

estimator) In the classical linear regression model with model
assumptions II we assume that E [X1 X10 ] is finite and invertible. Then
P
βbn,OLS −→ β.
and √ d
n (βbn,OLS − β) −→ Z ∼ N(0K , Σ).
with Σ = σ 2 (E [X1 X10 ])−1 .
Consequently,
√ d
n (βbn,OLS,k − βk ) −→ Zk ∼ N(0, Σk,k ).

Confidence intervals
Let there be given an statistical experiment with parameter space Θ and
τ = τ (θ) a parameter of interest.
Definition 4.11 Let Ln , Un be r.v. (not depending of θ) and α ∈ (0, 1).
[Ln , Un ] is called asymptotic (1 − α)-confidence interval for τ iff
lim inf Pθ (Ln ≤ τ ≤ Un ) ≥ 1 − α for all θ ∈ Θ.

n→∞
√ d P
Lemma 4.12 Let → Z ∼ N(0, σ 2 ) and Ŝn −
n(T̂n − τ ) − → σ, then
h Ŝn Ŝn i
T̂n − z1−α/2 √ , T̂n + z1−α/2 √
n n
is an asymptotic (1 − α)-confidence interval for τ .

Here zβ denote the quantiles of the standard normal distribution, i.e.

Φ(zβ ) = β for β ∈ (0, 1).
Theorem 4.13 Under the conditions of Theorem 4.10 holds

2 P
σ
bn,OLS −→ σ 2 as n → ∞.
A consequence of the continuous mapping theorem, Theorem 2.20:
Lemma 4.14 Let Zn = (Zn,i,j )1≤i,j≤m be matrices with random

P
entries and Zn −
→ A for a matrix A with det(A) 6= 0. Let
(
− Zn−1 , if det(Zn ) 6= 0, P
Zn = Then Zn− − → A−1 .
Im , else.

By the law of large numbers

n
1 0 1X P
XX = Xi Xi0 −
→ E [X1 X10 ].
n n
i=1
Consequently,
1 0 − P 2
2
Σ̂n = σ̂n,OLS · XX → σ · (E [X1 X10 ])−1 = Σ.
−
n
By Lemma 4.12
q q
h Σ̂n,k,k Σ̂n,k,k i
βbn,OLS,k − z1−α/2 √ , βbn,OLS,k + z1−α/2 √
n n
is an asymptotic (1-α)-confidence interval for βk .

Chapter 4: Linear regression 4.4 Hypothesis tests in the classical linear regression model
4.4 Hypothesis tests in the classical linear regression model
Example 4.15 A company delivers packages of pasta to the canteen of

the University of Mannheim and claims that the weight of a randomly
chosen package is N(5, 0.5)-distributed. Based on a sample of size n we
intend to decide whether
(a) the expected weight is at least 5.
(b) the assumption of normality is justified.
Question: How can we decide these problems properly ?
Remark:
Tests to decide (a) are called parameter tests.
Tests to decide (b) are called goodness-of-fit tests.

Statistical tests
Let En = (Rmn , B mn , {PθZn : θ ∈ Θ}) be a statistical experiment for some

r.v. Zn with some parameter space Θ, cf.p.91.
Based on our data Zn we aim to decide a testing problem of the

following form
H0 : θ ∈ Θ0 ⊆ Θ vs. H1 : θ ∈ Θ1 = Θ\Θ0 . (*)
Here, H0 is called null hypothesis and H1 is referred to as alternative or

alternative hypothesis.
E.g. if Θ0 = {θ0 }, (*) can be rewritten as
H0 : θ = θ0 vs. H1 : θ 6= θ0 .

Definition 4.16 Let there be given a statistical experiment En and a

testing problem (*). A (measurable) function ϕ : Rmn → {0, 1} is called
(non-randomized) (statistical) test if
(
0 if Zn = z implies acceptance H0 ,
ϕ(z) =
1 if Zn = z implies rejection of H0 .
Often tests are given in the form

(
0 if Tn ≤ c
ϕ(Zn ) =
1 if Tn > c.
Then Tn = g (Zn ) is called test statistic and c critical value for the test
ϕ.

Definition 4.17 Let ϕ be a test to decide the problem (*).

(i) A type I error (error of first kind) occurs when H0 is true but
rejected.
(ii) A type II error (error of second kind) occurs when H1 is true but
rejected.
Decision scheme:
Decision for
H0 H1
H0 is true correct type I error
H1 is true type II error correct

Definition 4.18 In the set-up of Definition 4.17

(i) a test ϕ is called an α-test if
Eθ [ϕ(Zn )] = Pθ (ϕ(Zn ) = 1) ≤ α for all θ ∈ Θ0 .
(ii) a sequence of tests (ϕn )n is called consistent if
Pθ (ϕn (Zn ) = 1) −→ 1 for all θ ∈ Θ1 .

n→∞
(iii) a sequence of tests (ϕn )n is called asymptotic α-test if
lim sup Pθ (ϕn (Zn ) = 1) ≤ α for all θ ∈ Θ0 .

n→∞

Test in the classic linear regression model
In the linear regression model, cf.p.117,
Yi = β1 Xi,1 + · · · + βK Xi,K + εi , i = 1, . . . , n,
we might want to test:
H0 : βj = 0 or H0 : β2 = · · · = βK = 0.
More generally, we want to test the following null hypothesis
H0 : Rβ = θ0 vs. H1 : Rβ 6= θ0
for some prescribed (r × K )-matrix R of rank r and a prescribed

r -dimensional vector θ0 .

Example 4.19
This general hypothesis covers several interesting special cases.
1 H0 : β k = 0
with R = (0, . . . , 0, |{z}
1 , 0, . . . , 0) and θ0 = 0.
k−th
2 H0 : β 1 = β 2
with R = (1, −1, 0, . . . , 0) and θ0 = 0.
3 H0 : β 1 + β 2 + β 3 = 1
with R = (1, 1, 1, 0, . . . , 0) and θ0 = 1.
4 H0 : β 1 = β 2 = β 3 = 0
with    
1 0 0 0 ... 0 0
R = 0 1 0 0 . . . 0 and θ0 = 0 .
0 0 1 0 ... 0 0

Ideas to proceed:
Put τ = Rβ − θ0 , i.e. H0 is true iff τ = 0r and kτ k = 0, resp.
Estimate τ by τ̂n = R β̂n,OLS − θ0 .
Find some distance function d : Rr → [0, ∞) s.th. d(0r ) = 0 and
d(x) is ”large” if kxk is ”large”.
Decision rule: Reject H0 if Tn = d(τ̂n ) > c (is ”large”).
Determine c s.th. the decision rule becomes an α-test.
For the last step we need to specify the distribution of d(τ̂n ), at least
approximately.

χ2 -distribution
Definition 4.20 X∗ is χ2 -distributed with k degrees of freedom,
k ∈ N, k ≥ 1, if X∗ is continuous with density
1
fχ2 (x) = x k/2−1 exp(−x/2)1[0,∞) (x),
k 2k/2 Γ(k/2)
where Γ denotes the so-called Gamma function, defined by
Z ∞
Γ(a) = x a−1 e −x dx, a > 0.
0
Let Fχ2 denote the CDF of X∗ and χ2k,α its α-quantile, α ∈ (0, 1), i.e.
k
Fχ2 (χ2k,α ) = α.
k
Note (see next page): If X1 , . . . , Xk ∼ N(0, 1) are i.i.d., then

k
X
Xi2 ∼ χ2 (k).
i=1

Multivariate normal distribution
Let Z ∼ N(µ, Σ) a multivariate normally distributed k-dimensional random

vector, cf.p.90, det(Σ) 6= 0, i.e. Z is continuous with density
1
exp −0.5(z − µ)0 Σ−1 (z − µ) , z ∈ Rk .

φ(z) = p
(2π)k detΣ
Proposition 4.21
Let Z ∼ N(µ, Σ), a ∈ Rm and B ∈ Rm×k . Then holds:
1 a + BZ ∼ N(a + Bµ, BΣB 0 ).
2 Z = (Z1 , . . . , Zk )0 ∼ N(0k , Ik ) iff Z1 , . . . , Zk ∼ N(0, 1) are i.i.d.
3 If µ = 0, then Z 0 Σ−1 Z ∼ χ2 (k).
See Jacod, Protter (2000), chapter 16.

Convergence to infinity
Let (Zn )n be r.v. with values in Rk . (Zn ) converges in probability to ∞,
if for all C > 0
lim inf P(kZn k > C ) = 1.

n→∞
Note: If (zn ) is a sequence of non-random vectors with kzn k → ∞, then

P
→ ∞ holds as well. For the notation A− see 131.
zn −
Lemma 4.22 Let Xn , Zn r.v. with values in Rk and Σ̂n , Σ k × k

matrices, Σ̂n with random entries.
P P
(a) If Zn −
→ ∞ and Xn = OP (1), then Xn + Zn −
→ ∞.
P P P
(b) If Zn − → Σ > 0, then Zn0 Σ̂−
→ ∞ and Σ̂n − n Zn −
→ ∞.

Wald-test for H0 : Rβ = θ0
Put
Tn = (R βbn,OLS − θ0 )0 (R(X 0 X )− R 0 )−1 (R βbn,OLS − θ0 )/b 2

σn,OLS .
Definition 4.23 A Wald test of level α ∈ (0, 1) for H0 : Rβ = θ based on

n observations Zi = (Yi , Xi,1 , . . . , Xi,K )0 , i = 1, 2, . . . , n, is given by
(
1, if Tn > χ2r ,1−α ,
ϕn ((Z1 , . . . , Zn )) = (**)
0, else,
where χ2r ,1−α denotes the (1 − α)-quantile of a χ2r distribution.
Theorem 4.24 Suppose the conditions of Theorem 4.10 hold. Then

(ϕn )n from (**) is an asymptotic α-test and consistent.

Summary

In the classical linear model we assume a linear relationship between
Yi and Xi and a strict exogeneity and homoscedasticity of the error
terms, εi .
β is estimated by the method of ordinary least squares. β̂OLS is linear,
conditionally unbiased and optimal in the sense of the
Gauß-Markov-Theorem.
Under the model assumptions II β̂OLS is a consistent and asymptotic
normally distributed.
Important concepts: test, test statistic, α-test, type I and II error,
consistency of tests.
The Wald-test for H0 : Rβ = θ0 is an asymptotical α-test and
consistent.

All Econometrics - Slides PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Econometrics - Slides PDF

Uploaded by

Copyright:

Available Formats

Advanced Econometrics I

Ingo Steinke, Anne Leucht, Enno Mammen

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 1

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 2

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 3

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 4

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 5

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 6

Application in Statistics ...

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 7

Figure: http://en.wikipedia.org/wiki/File:Linear regression.svg

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 9

Chapter 1: Elementary probability theory

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 10

1.1 Probability measures

Aim: Formal description of “probability measures”

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 11

∅ = { } denotes the empty set,

for any Ai ⊆ Ω, i ∈ N, that are pairwise disjoint, i.e Ai ∩ Aj = ∅ for

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 12

Recap: Index notation

If I = {1, . . . , n} and J = N, then

Especially, for I = A and Ax = {x},

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 13

A ⊆ Ω set is countable iff there is a set N ⊆ N and a bijection

if the limit exists. The series s is absolutely convergent if

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 14

The series s is unconditionally well-defined

if the right-hand series is unconditionally convergent

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 15

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 16

Arbitrary sample spaces

It is often impossible to define P “appropriately” for all subsets A ⊆ Ω

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 17

A σ-field A is called smallest-σ-field, containing B ⊆ P(Ω), if for any

Example 1.4 Let Ω 6= ∅ be a set.

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 18

If, in addition, P(Ω) = 1, then it is called probability measure.

δω0 (A) := 1A (ω0 ) A ∈ A.

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 19

Theorem 1.7 (Properties of probability measures)

(iii) P(AC ) = 1 − P(A).

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 20

(vi) Monotonicity: A ⊆ B implies P(A) ≤ P(B).

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 21

1.2 Probability measures on R

Theorem 1.9 Put

A1 = {(a, b] : −∞ ≤ a < b < +∞},

Then it follows for j = 1, . . . , 4: B = σ(Aj ).

Suppose that A∗ is a field and define A as the smallest σ-field with

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 23

Theorem 1.11 (Carathéodory) Let A∗ be a field, A = σ(A∗ ) and P ∗

P(A) = P ∗ (A) for A ∈ A∗ .

For a proof see Ash and Doléans-Dade (1999), Theorem 1.3.10.

F (b) = P((−∞, b]) ∀b ∈ R

is called a (cumulative) distribution function (CDF).

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 24

Proposition 1.13 (Properties of the CDF) Suppose that F is the

Define P by F on A1 : P((a, b]) = F (b) − F (a), a < b.

Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 25

Theorem 1.14 Consider a function F : R → R satisfying (ii) to (iv) of

Some ideas of the proof: First, define a set function P ∗ : A1 → [0, 1] as

P ∗ ((a, b]) = F (b) − F (a).