Professional Documents
Culture Documents
1/54
Goals
In the kiss example, the estimator was intuitively the right thing
to do: p̂ = X̄n .
In view of LLN, since p = IE[X], we have X̄n
so p̂ ⇡ p for n large enough.
If the parameter is ✓ 6= IE[X]? How do we perform?
1. Maximum likelihood estimation: a generic approach with very
good properties
2. Method of moments: a (fairly) generic and easy approach
3. M-estimators: a flexible approach, close to machine learning
2/54
Total variation distance
3/54
Total variation distance between discrete measures
4/54
Total variation distance between continuous measures
5/54
Properties of Total variation
6/54
Exercises
Compute:
a)TV(Ber(0.5), Ber(0.1)) =
b) TV(Ber(0.5), Ber(0.9)) =
d)TV(X, X + a) =
for any a 2 (0, 1), where X ⇠ Ber(0.5)
p
e)TV(2 n(X̄n 1/2), Z) =
i.i.d
where Xi ⇠ Ber(0.5) and Z ⇠ N (0, 1)
7/54
An estimation strategy
c ✓ , IP✓⇤ ) for all ✓ 2 ⇥. Then find ✓ˆ that
Build an estimator TV(IP
c ✓ , IP✓⇤ ).
minimizes the function ✓ 7! TV(IP
c ✓ , IP✓⇤ )!
problem: Unclear how to build TV(IP
8/54
Kullback-Leibler (KL) divergence
Definition
The Kullback-Leibler1
(KL) divergence between two probability
measures IP✓ and IP✓0 is defined by
8
>
>
>
> X ⇣ p (x) ⌘
>
> ✓
>
< p ✓ (x) log if E is discrete
p✓0 (x)
KL(IP✓ , IP✓0 ) = x2E
>
>
>
>
>
>
>
: if E is continuous
1
KL-divergence is also know as “relative entropy” 9/54
Properties of KL-divergence
Not a distance.
This is is called a
10/54
Maximum likelihood
estimation
11/54
Estimating the KL
h ⇣ p ⇤ (X) ⌘i
✓
KL(IP✓⇤ , IP✓ ) = IE✓⇤ log
p✓ (X)
⇥ ⇤
= IE✓⇤ log p✓⇤ (X)
n
X
1
Can be estimated: IE✓⇤ [h(X)] h(Xi ) (by LLN)
n
i=1
n
X
c ✓⇤ , IP✓ ) = “constant” 1
KL(IP log p✓ (Xi )
n
i=1
12/54
Maximum likelihood
n
X
c ✓⇤ , IP✓ ) = “constant” 1
KL(IP log p✓ (Xi )
n
i=1
n
X
c ✓⇤ , IP✓ ) 1
min KL(IP , min log p✓ (Xi )
✓2⇥ ✓2⇥ n
i=1
n
Y
, max p✓ (Xi )
✓2⇥
i=1
Definition
The likelihood of the model is the map Ln (or just L) defined as:
Ln : E n ⇥⇥ ! IR
(x1 , . . . , xn , ✓) 7! IP✓ [X1 = x1 , . . . , Xn = xn ].
14/54
Likelihood for the Bernoulli model
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ⇠ Ber(p) for some
p 2 (0, 1):
I E = {0, 1};
I ⇥ = (0, 1);
15/54
Likelihood for the Poisson model
I E = IN;
I ⇥ = (0, 1);
16/54
Likelihood, Continuous case
Definition
L : E n ⇥⇥ ! IR
Qn
(x1 , . . . , xn , ✓) 7! i=1 f✓ (xi ).
17/54
Likelihood for the Gaussian model
iid 2 ),
Example 1 (Gaussian model): If X1 , . . . , Xn ⇠ N (µ, for
2
some µ 2 IR, > 0:
I E = IR;
I ⇥ = IR ⇥ (0, 1)
18/54
Exercises
a) What is E?
b) What is ⇥?
19/54
Exercise
b) What is ⇥?
20/54
Maximum likelihood estimator
Definition
The maximum likelihood estimator of ✓ is defined as:
ˆM LE
✓n = argmax L(X1 , . . . , Xn , ✓),
✓2⇥
provided it exists.
21/54
Interlude: maximizing/minimizing functions
Note that
min h(✓) , max h(✓)
✓2⇥ ✓2⇥
Qn
Example: ✓ 7! i=1 (✓ Xi )
22/54
Concave and convex functions
Definition
A function twice di↵erentiable function h : ⇥ ⇢ IR ! IR is said to
be concave if its second derivative satisfies
00
h (✓) 0 , 8✓2⇥
I Hessian matrix:
0 @2h 1
@2h
@✓1 @✓1 (✓) ··· @✓1 @✓d (✓)
B C
B C
Hh(✓) = B
B
C 2 IRd⇥d
C
@ A
@2h @2h
@✓d @✓d (✓) ··· @✓d @✓d (✓)
25/54
Exercises
26/54
Examples of maximum likelihood estimators
I Poisson model: ˆ M
n
LE
= X̄n .
⇣ ⌘
I Gaussian model: µ̂n , ˆn2 = X̄n , Ŝn .
27/54
Consistency of maximum likelihood estimator
ˆM LE IP ⇤
✓n !✓
n!1
28/54
Covariance
Cov(X, Y ) =
⇥ ⇤
= IE X · Y
⇥ ⇤
= IE X ·
29/54
Properties
I Cov(X, Y ) =
I Cov(X, Y ) = Cov(Y, X)
I If X and Y are independent, then Cov(X, Y ) =
Cov(AX + B) =
31/54
The multivariate Gaussian distribution
If (X, Y )> is a Gaussian vector then its pdf depends on 5
parameters:
and Cov(X, Y )
3
As before, this means that ↵> X is Gaussian for any ↵ 2 IRd , ↵ 6= 0. 32/54
The multivariate CLT
Equivalently
(d)
! Nd (0, Id )
n!1
33/54
Multivariate Delta method
Let (Tn )n sequence of random vectors in IR d such that
1
p (d)
n(Tn ✓) ! Nd (0, ⌃),
n!1
✓ ◆
@g @gj d⇥k
where rg(✓) = (✓) = 2 IR .
@✓ @✓i 1i
1j
34/54
Fisher Information
If ⇥ ⇢ IR, we get:
⇥ 0
⇤ ⇥ 00
⇤
I(✓) = var ` (✓) = IE ` (✓)
35/54
Fisher information of the Bernoulli experiment
Let X ⇠ Ber(p).
`(p) =
0 0
` (p) = var[` (p)] =
00 00
` (p) = IE[` (p)] =
36/54
Asymptotic normality of the MLE
Theorem
ˆM
Then, ✓n LE satisfies:
I ✓n
ˆM LE IP
! w.r.t. IP✓⇤ ;
n!1
p ⇣ ⌘ (d)
I n ✓n ˆM LE
✓ ⇤
! N 0, w.r.t. IP✓⇤ .
n!1
37/54
The method of moments
38/54
Moments
Let X1 , . . . , Xn be an i.i.d. sample associated with a statistical
model E, (IP✓ )✓2⇥ .
I From LLN,
IP/a.s
m̂k !
n!1
Definition
Moments estimator of ✓:
ˆMM
✓n =M 1
( ),
provided it exists.
40/54
Statistical analysis
41/54
Method of moments (5)
Remark: The method of moments can be extended to more
general moments, even when E 6⇢ IR.
Theorem
p ⇣ MM ⌘ (d)
n ✓ˆn ✓ ! N (0, (✓)) (w.r.t. IP✓ ),
n!1
1 > 1
@M @M
where (✓) = M (✓) ⌃(✓) M (✓) .
@✓ @✓
43/54
MLE vs. Moment estimator
44/54
M-estimation
45/54
M-estimators
Idea:
I Let X1 , . . . , Xn be i.i.d with some unknown distribution IP in
d
some sample space E (E ✓ IR for some d 1).
I No statistical model needs to be assumed (similar to ML).
46/54
Examples (1)
47/54
Examples (2)
If E = M = IR, ↵ 2 (0, 1) is fixed and ⇢(x, µ) = C↵ (x µ), for
all x 2 IR, µ 2 IR : µ⇤ is a ↵-quantile of IP.
Check function
(
(1 ↵)x if x < 0
C↵ (x) =
↵x if x 0.
48/54
MLE is an M-estimator
Theorem
49/54
Definition
50/54
Statistical analysis
2
2
@ Q @ ⇢
I Let J(µ) = (µ) (= IE (X 1 , µ) under
@µ@µ > @µ@µ >
@⇢
I Let K(µ) = Cov (X1 , µ) .
@µ
J(✓) = K(✓) =
51/54
Asymptotic normality
52/54
M-estimators in robust statistics
How to estimate m ?
53/54
Recap
54/54