Professional Documents
Culture Documents
Second lecture
1 Previously...
Z1 , . . . , Zn are iid, their common law is P (unknown).
There are 2 types of learning :
• Supervised : Zi = (Xi , Yi )
• Unsupervised : Zi = Xi
η ∗ (x) = E [Y |X = x]
1
g ∗ (x) = 1 η ∗ (x) > with
2
g ∗ (x) = η ∗ (x) = E [Y | = x]
Risk
R [g] = EP [` (Y, g(X))]
Bayes Predictor
g ∗ ∈ arg min
g:X →Y
R [g]
Excess risk
R [ĝ] − R [g∗ ] > 0
1
Question: How good the plug-in rule ĝn is ?
Proposition 1 Let η̂ be an estimator of the regression function η ∗ , and let ĝ(x) = 1 η̂ (x) > 1
2 .
Then, we have : q
Rclass [ĝ] − Rclass [g ∗ ] 6 2 Rreg [η̂] − Rreg [η ∗ ]
= 2E 1
1 ∗ ∗ 1
∈ [η (X), η(X)] η (X) −
2 2
If η(X) 6 1
2 and η ∗ (X) > 12 , then η ∗ (X) − 21 6 |η ∗ (X) − η(X)|, and thus :
Since this inequality is true for every deterministic η, we get the desired property.
2
2.2 Link between the classification and density estimation
iid
Framework: (X0 , Y0 ) , . . . , (Xn , Yn ) ∼ P.
Assume that, conditionally to Y , P admits a density on X . Let f1 (x) be the density of X
conditionally to Y = 1, and f0 likewise to Y = 0. Let π1 = P (Y = 1) and π0 = 1 − π1 . The
well-known Bayes formula implies that:
η ∗ (x) = E [Y |X = x] = P (Y = 1|X = x) =
f1 (x)π1
f1 (x)π1 + f0 (x)π0
1, if η ∗ (x) > 12
1, if f1 (x)π1 > f0 π0
g ∗ (x) = =
0, otherwise 0, otherwise
General approach
• Estimate π0 , π1 by :
n
1X
π̂1 = Yi , π̂0 = 1 − π̂1
n i=1
3 Density model
Let now X1 , . . . , Xn be iid with density f . To start with, we assume that Xi ∈ [0, 1]; the relaxations
of this condition will be discussed later.
Let’s assume
We are going to estimate f by a histogram. The set Lip (L) can be approximated by a K-
dimensional space ΘK : ( )
K
ΘK = θ ∈ R , θk ≥ 0,
X
K
θk = 1
k=1
θk 1
θ ∈ ΘK 7→ fθ (x) = , ∀x ∈ [(k − 1) h, kh[ , where h= .
h K
Under the assumption that f = fθ with θ ∈ ΘK , the estimation of f is equivalent to estimating
θ, which we can be done by the method of maximum likelihood:
n
Y n
X
θ̂n = arg max fθ (Xi ) = arg max ln fθ (Xi )
θ∈ΘK θ∈ΘK
i=1 i=1
3
It is easy to check (good exercise for the students) that the maximum is attained at:
n
1X
θ̂n = θ̂n(1) , . . . , θ̂n(K) st θ̂n(k) = 1 (Xi ∈ [(k − 1) h, kh[)
n i=1
1
The following estimator is then called histogram with bandwidth h = K :
K
" n
#
1 X X
fˆn,h (x) = 1Ck,h (x) 1Ck,h (Xi ) with Ck,h = [(k − 1) h, kh[ .
nh i=1
k=1
We have 2 cases :
where B(n, p) stands for the binomial distribution with parameters (n, p). Note that the integral
over Ck in the above formula comes from the fact that:
N.
R
Proposition 2 Let X1 , . . . , Xn be iid with density f (x) and K ∈ Set wherepj = Cj f for
every j = 1, . . . , K. The Mean Integrated Squared Error of the histogram fˆn,h is given by :
Z 1 2 Z 1 K
MISEf (h) , Ef ˆ 1 n+1X 2
fn,h (x) − f (x) dx = f 2 (x)dx + − p .
0 0 nh nh i=1 j
It is clear that:
E
h i np h i np (1 − p )
k k k
nhfˆn,h (x) ∼ B (n, pk ) ⇒ fˆn,h (x) = and Varf fˆn,h (x) = 2 .
nh (nh)
Therefore,
K Z
X p
k
2 pk (1 − pk )
MISEf (h) = − f (x) dx + (1)
Ck h nh
k=1
K
p2 pk − p2k
X Z
− k + f 2 dx +
R
= (we used that Ck f = pk )
h Ck nh
k=1
Z 1 PK K
2 i=1 pk n+1 X 2
= f dx + − pk
0 nh nh
k=1
4
P
The desired result follows from the obvious relation k pk = 1.
The result of the previous proposition is very useful, since it provides the explicit form of the risk.
However, it is not very easy to see how this expression depends on the bandwidth h, which is the
parameter responsible for overfitting or underfitting. Under the assumption that f ∈ Lip (L), we
can get a simpler bound on the risk. Indeed, for every x ∈ Ck , we have :
Z
pk 1
f (x) − = f (x) − f (y)d(y)
h h Ck
Z
1
= (f (x) − f (y)) dy
h Ck
Z
1
6 |f (x) − f (y)| dy
h Ck
Z
L
6 |x − y| dy
h Ck
6 Lh (2)
Remark 2
L2
• L2 h2 = K2 is an upper bound on the square of the bias. It describes the error of approxi-
mation.
1 K
• nh = n is an upper bound on the variance of the estimator; it describes the estimation
error.
1
• If we minimize L2 h2 + nh w.r.t. h > 0, we get
−1/3
hopt = 2nL2 , and MISE (hopt ) 6 3(L/2)2/3 n−2/3 .