Professional Documents
Culture Documents
650
1/23
Total variation distance (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such
that X1 ∼ IPθ∗ : θ ∗ is the true parameter.
Definition
The total variation distance between two probability measures IPθ
and IPθ′ is defined by
2/23
Total variation distance (2)
3/23
Total variation distance (3)
4/23
Total variation distance (4)
◮ TV(IPθ , IPθ′ ) ≥ 0
inequality)
5/23
Total variation distance (5)
6/23
Total variation distance (5)
T θ , IPθ∗ )!
problem: Unclear how to build TV(IP
6/23
Kullback-Leibler (KL) divergence (1)
Definition
The Kullback-Leibler (KL) divergence between two probability
measures IPθ and IPθ′ is defined by
L ( p (x) )
θ
pθ (x) log if E is discrete
pθ′ (x)
KL(IPθ , IPθ ) =
′ x∈E
l
( f (x) )
θ
f θ (x) log dx if E is continuous
E θf ′ (x)
7/23
Kullback-Leibler (KL) divergence (2)
Properties of KL-divergence:
Not a distance.
8/23
Kullback-Leibler (KL) divergence (3)
[ ( p ∗ (X) )]
θ
KL(IPθ∗ , IPθ ) = IEθ∗ log
pθ (X)
[ ] [ ]
= IEθ∗ log pθ∗ (X) − IEθ∗ log pθ (X)
L n
T θ∗ , IPθ ) = “constant” − 1
KL(IP log pθ (Xi )
n
i=1
9/23
Kullback-Leibler (KL) divergence (4)
L n
T θ∗ , IPθ ) =
“constant” − 1
KL(IP log pθ (Xi )
n
i=1
n
T θ∗ , IPθ ) 1
L
min
KL(IP ⇔ min −
log pθ (Xi )
θ∈Θ θ∈Θ n
i=1
n
L
1
⇔ max log pθ (Xi )
θ∈Θ n
i=1
n
L
⇔ max log pθ (Xi )
θ∈Θ
i=1
n
n
⇔ max pθ (Xi )
θ∈Θ
i=1
Note that
min −h(θ) ⇔ max h(θ)
θ∈Θ θ∈Θ
�n
Example: θ → i=1 (θ − Xi )
11/23
Interlude: maximizing/minimizing functions (2)
Definition
A function twice differentiable function h : Θ ⊂ IR → IR is said to
be concave if its second derivative satisfies
◮ Θ = IR, h(θ) = 2θ − 3
12/23
Interlude: maximizing/minimizing functions (3)
More generally for a multivariate function: h : Θ ⊂ IRd → IR,
d ≥ 2, define the
∂h
∂θ1 (θ)
.
.
◮ gradient vector: ∇h(θ) = . ∈ IRd
∂h
∂θd (θ)
◮ Hessian matrix:
∂2h ∂2h
···
∂θ1 ∂θ1 (θ) ∂θ1 ∂θd (θ)
.
.
2
∇ h(θ) =
. ∈ IRd×d
∂2h ∂2h
∂θd ∂θd (θ) · · · ∂θd ∂θd (θ)
13/23
Interlude: maximizing/minimizing functions (4)
h′ (θ) = 0 ,
∇h(θ) = 0 ∈ IRd .
14/23
Likelihood, Discrete case (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or
countable).
Definition
The likelihood of the model is the map Ln (or just L) defined as:
Ln : En × Θ → IR
(x1 , . . . , xn , θ) → IPθ [X1 = x1 , . . . , Xn = xn ].
15/23
Likelihood, Discrete case (2)
iid
Example 1 (Bernoulli trials): If X1 , . . . , Xn ∼ Ber(p) for some
p ∈ (0, 1):
◮ E = {0, 1};
◮ Θ = (0, 1);
n
n
L(x1 , . . . , xn , p) = IPp [Xi = xi ]
i=1
nn
= pxi (1 − p)1−xi
i=1
�n �n
xi
=p i=1 (1 − p)n− i=1 xi
.
16/23
Likelihood, Discrete case (3)
◮ E = IN;
◮ Θ = (0, ∞);
n
n
L(x1 , . . . , xn , p) = IPλ [Xi = xi ]
i=1
nn
λxi
= e−λ
xi !
i=1
�n
λ i=1 xi
= e−nλ .
x1 ! . . . xn !
17/23
Likelihood, Continuous case (1)
( )
Let E, (IPθ )θ∈Θ be a statistical model associated with a sample
of i.i.d. r.v. X1 , . . . , Xn . Assume that all the IPθ have density fθ .
Definition
L : En × Θ → �
IR
n
(x1 , . . . , xn , θ) → i=1 fθ (xi ).
18/23
Likelihood, Continuous case (2)
iid
Example 1 (Gaussian model): If X1 , . . . , Xn ∼ N (µ, σ 2 ), for
some µ ∈ IR, σ 2 > 0:
◮ E = IR;
◮ Θ = IR × (0, ∞)
n
1 1 L
L(x1 , . . . , xn , µ, σ 2 ) = √ exp − 2 (xi − µ)2 .
(σ 2π)n 2σ
i=1
19/23
Maximum likelihood estimator (1)
Definition
The likelihood estimator of θ is defined as:
provided it exists.
20/23
Maximum likelihood estimator (2)
Examples
◮ ˆ M LE = X
Poisson model: λ ¯n.
n
( ) ( )
◮ Gaussian model: µ̂n , σ̂n2 = X̄n , Ŝn .
21/23
Maximum likelihood estimator (3)
If Θ ⊂ IR, we get:
[ ] [ ]
I(θ) = var ℓ′ (θ) = −IE ℓ′′ (θ)
22/23
Maximum likelihood estimator (4)
Theorem
Then, θˆn
M LE satisfies:
IP
23/23
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.