You are on page 1of 5

Statistique en grande dimension

Lecturer : Dalalyan A., Scribe : Thomas F.-X.

Second lecture

1 Previously...
Z1 , . . . , Zn are iid, their common law is P (unknown).
There are 2 types of learning :

• Supervised : Zi = (Xi , Yi )
• Unsupervised : Zi = Xi

Predictions (X, Y ) ∈ X × Y : Given an example X, we’d like to predict the value of Y .

• Binary classification : Y = {0, 1} and ` (y, y 0 ) = 1 (y =


6 y0 )

η ∗ (x) = E [Y |X = x]
 
1
g ∗ (x) = 1 η ∗ (x) > with
2

• Least-squares regression : Y ⊂ R, and ` (y, y 0 ) = (y − y 0 )


2

g ∗ (x) = η ∗ (x) = E [Y | = x]

Risk
R [g] = EP [` (Y, g(X))]
Bayes Predictor
g ∗ ∈ arg min
g:X →Y
R [g]

Excess risk
R [ĝ] − R [g∗ ] > 0

2 Link between Binary Classification & Regression

2.1 Plug-in rule


• We start by estimating η ∗ (x) by η̂n (x),
• We define ĝn (x) = 1 η̂n > 12 .


1
Question: How good the plug-in rule ĝn is ?

Proposition 1 Let η̂ be an estimator of the regression function η ∗ , and let ĝ(x) = 1 η̂ (x) > 1

2 .
Then, we have : q
Rclass [ĝ] − Rclass [g ∗ ] 6 2 Rreg [η̂] − Rreg [η ∗ ]

Proof Let η : X → Y ⊂ R, and g(x) = 1 η(x) > 1



2 , and let’s compute the excess risk of g. By
definition:

Rclass [g] − Rclass [g ∗ ] = E [1 (Y 6= g(X)) − 1 (Y 6= g ∗ (X))]

Simple algebra yields:

E [1 (Y 6= g(X))] = E [1 (Y 6= g (X)) 1 (Y = 1)] + E [1 (Y 6= g(X)) 1 (Y = 0)]


= E [E (1 (1 6= g(x)) 1 (Y = 1) |X)] + E [E (1 (0 6= g(X)) 1 (Y = 0) |X)] .

Let us compute now the conditional expectations appearing above:

E 1(1 = 1 − g(X) P (Y = 1|X) = 1 − g(X) η ∗ (X)


h i  
6= g(X)) ·1(Y = 1) | X
| {z }
=1−g(X)

E = g(X)P(Y = 0|X) = g(X) (1 − η ∗ (X))


h i
1(0 6= g(X)) ·1(Y = 0) | X
| {z }
=g(X)

Combining these relations, we get

E [1 (Y 6= g(X))] = E[η∗ (X)] + E [g(X) (1 − 2η∗ (X))] .


Therefore,

Rclass [g] − Rclass [g ∗ ] = E [(g(X) − g∗ (X)) (1 − 2η∗ (X))] .


Since g and g ∗ are both indicator functions and, therefore, take only the values 0 and 1, their
difference will be nonzero if and only if one of them is equal to 1 and the other one is equal to 0.
This leads to

Rclass [g] − Rclass ≤ E 1 η(X) 6 < η ∗ (X) |2η ∗ (X) − 1|


   
1
2

+E 1 η ∗ (X) 6 < η(X) |2η ∗ (X) − 1|


   
1
2

= 2E 1
   
1 ∗ ∗ 1
∈ [η (X), η(X)] η (X) −
2 2

If η(X) 6 1
2 and η ∗ (X) > 12 , then η ∗ (X) − 21 6 |η ∗ (X) − η(X)|, and thus :

Rclass [g] − Rclass [g ∗ ] 6 2E 1


   
1
∈ [η(X), η ∗ (X)] |η (X) − η ∗ (X)|
2
6 2E [|η(X) − η ∗ (X)|]
r h
6 2 E η(X) − η ∗ (X)
2 i q
= 2 Rreg (η) − Rreg (η ∗ ).

Since this inequality is true for every deterministic η, we get the desired property. 

2
2.2 Link between the classification and density estimation
iid
Framework: (X0 , Y0 ) , . . . , (Xn , Yn ) ∼ P.
Assume that, conditionally to Y , P admits a density on X . Let f1 (x) be the density of X
conditionally to Y = 1, and f0 likewise to Y = 0. Let π1 = P (Y = 1) and π0 = 1 − π1 . The
well-known Bayes formula implies that:

η ∗ (x) = E [Y |X = x] = P (Y = 1|X = x) =
f1 (x)π1
f1 (x)π1 + f0 (x)π0

Henceforth, the Bayes rule can be written as:

1, if η ∗ (x) > 12
 
1, if f1 (x)π1 > f0 π0
g ∗ (x) = =
0, otherwise 0, otherwise

General approach

• Estimate π0 , π1 by :
n
1X
π̂1 = Yi , π̂0 = 1 − π̂1
n i=1

• Use a non-parametric method for estimating f1 and f0 .


 
• Set ĝ(x) = 1 fˆ1 (x)π̂1 > fˆ0 (x)π̂0

Multi-scale classification In the case where Y = {a1 , . . . , aM }, similar arguments yield

g ∗ (x) = arg max P (Y = a|X = x) .


a∈Y

3 Density model
Let now X1 , . . . , Xn be iid with density f . To start with, we assume that Xi ∈ [0, 1]; the relaxations
of this condition will be discussed later.
Let’s assume

f ∈ Lip (L) = {f : [0, 1] → R : |f (x) − f (y)| 6 L |x − y| , ∀x, y ∈ [0, 1]} .

We are going to estimate f by a histogram. The set Lip (L) can be approximated by a K-
dimensional space ΘK : ( )
K
ΘK = θ ∈ R , θk ≥ 0,
X
K
θk = 1
k=1

In fact, the following application is an injection from ΘK to Lip(L):

θk 1
θ ∈ ΘK 7→ fθ (x) = , ∀x ∈ [(k − 1) h, kh[ , where h= .
h K
Under the assumption that f = fθ with θ ∈ ΘK , the estimation of f is equivalent to estimating
θ, which we can be done by the method of maximum likelihood:
n
Y n
X
θ̂n = arg max fθ (Xi ) = arg max ln fθ (Xi )
θ∈ΘK θ∈ΘK
i=1 i=1

3
It is easy to check (good exercise for the students) that the maximum is attained at:
n
  1X
θ̂n = θ̂n(1) , . . . , θ̂n(K) st θ̂n(k) = 1 (Xi ∈ [(k − 1) h, kh[)
n i=1
1
The following estimator is then called histogram with bandwidth h = K :
K
" n
#
1 X X
fˆn,h (x) = 1Ck,h (x) 1Ck,h (Xi ) with Ck,h = [(k − 1) h, kh[ .
nh i=1
k=1

We have 2 cases :

• If K is too large : overfitting


• If K is too small : underfitting

Remark 1 If x ∈ Ck,h , then :


n
X  Z 
nhfˆn,h (x) = 1Ck,h (Xi ) ∼ B n, f (x) dx
i=1 Ck

where B(n, p) stands for the binomial distribution with parameters (n, p). Note that the integral
over Ck in the above formula comes from the fact that:

P 1Ck,h (Xi ) = 1 = P (Xi ∈ Ck,h ) = f (x) dx.


Z

Ck

N.
R
Proposition 2 Let X1 , . . . , Xn be iid with density f (x) and K ∈ Set wherepj = Cj f for
every j = 1, . . . , K. The Mean Integrated Squared Error of the histogram fˆn,h is given by :
Z 1  2  Z 1 K
MISEf (h) , Ef ˆ 1 n+1X 2
fn,h (x) − f (x) dx = f 2 (x)dx + − p .
0 0 nh nh i=1 j

Proof Combining Fubini with bias-variance decomposition, we get:


Z 1  Z 1
Ef Ef fn,h (x) − f (x) dx
2   2 
ˆ
fn,h (x) − f (x) dx = ˆ
0 0
K Z
Ef
X  2 
= fˆn,h (x) − f (x) dx
k=1 Ck
K Z
Ef
X  h i 2 h i
= fˆn,h (x) − f (x) + Varf fˆn,h (x) dx.
k=1 Ck

It is clear that:

E
h i np h i np (1 − p )
k k k
nhfˆn,h (x) ∼ B (n, pk ) ⇒ fˆn,h (x) = and Varf fˆn,h (x) = 2 .
nh (nh)
Therefore,
K Z 
X p
k
2 pk (1 − pk )
MISEf (h) = − f (x) dx + (1)
Ck h nh
k=1
K
p2 pk − p2k
X  Z 
− k + f 2 dx +
R
= (we used that Ck f = pk )
h Ck nh
k=1
Z 1 PK K
2 i=1 pk n+1 X 2
= f dx + − pk
0 nh nh
k=1

4
P
The desired result follows from the obvious relation k pk = 1. 

The result of the previous proposition is very useful, since it provides the explicit form of the risk.
However, it is not very easy to see how this expression depends on the bandwidth h, which is the
parameter responsible for overfitting or underfitting. Under the assumption that f ∈ Lip (L), we
can get a simpler bound on the risk. Indeed, for every x ∈ Ck , we have :
Z
pk 1
f (x) − = f (x) − f (y)d(y)
h h Ck
Z
1
= (f (x) − f (y)) dy
h Ck
Z
1
6 |f (x) − f (y)| dy
h Ck
Z
L
6 |x − y| dy
h Ck
6 Lh (2)

This leads to the following result.

Proposition 3 If f ∈ Lip (L), then :


1
MISEf (h) 6 L2 h2 +
nh
PK PK
Proof Using (1), (2) and k=1 pk (1 − pk ) 6 k=1 pk = 1, we get the result. 

Remark 2

L2
• L2 h2 = K2 is an upper bound on the square of the bias. It describes the error of approxi-
mation.
1 K
• nh = n is an upper bound on the variance of the estimator; it describes the estimation
error.
1
• If we minimize L2 h2 + nh w.r.t. h > 0, we get
−1/3
hopt = 2nL2 , and MISE (hopt ) 6 3(L/2)2/3 n−2/3 .

(strictly worse than the parametric rate n−1 )

• hopt depends on L, which is unknown ! We can not use it in practice...

You might also like