Cours 2 MVA

Statistique en grande dimension
Lecturer : Dalalyan A., Scribe : Thomas F.-X.
Second lecture
1 Previously...
Z1 , . . . , Zn are iid, their common law is P (unknown).
There are 2 types of learning :
• Supervised : Zi = (Xi , Yi )
• Unsupervised : Zi = Xi
Predictions (X, Y ) ∈ X × Y : Given an example X, we’d like to predict the value of Y .
• Binary classification : Y = {0, 1} and ` (y, y 0 ) = 1 (y =

6 y0 )
η ∗ (x) = E [Y |X = x]

1
g ∗ (x) = 1 η ∗ (x) > with
2
• Least-squares regression : Y ⊂ R, and ` (y, y 0 ) = (y − y 0 )

2
g ∗ (x) = η ∗ (x) = E [Y | = x]
Risk
R [g] = EP [` (Y, g(X))]
Bayes Predictor
g ∗ ∈ arg min
g:X →Y
R [g]
Excess risk
R [ĝ] − R [g∗ ] > 0
2 Link between Binary Classification & Regression
2.1 Plug-in rule

• We start by estimating η ∗ (x) by η̂n (x),
• We define ĝn (x) = 1 η̂n > 12 .

1
Question: How good the plug-in rule ĝn is ?
Proposition 1 Let η̂ be an estimator of the regression function η ∗ , and let ĝ(x) = 1 η̂ (x) > 1

2 .
Then, we have : q
Rclass [ĝ] − Rclass [g ∗ ] 6 2 Rreg [η̂] − Rreg [η ∗ ]
Proof Let η : X → Y ⊂ R, and g(x) = 1 η(x) > 1

2 , and let’s compute the excess risk of g. By
definition:
Rclass [g] − Rclass [g ∗ ] = E [1 (Y 6= g(X)) − 1 (Y 6= g ∗ (X))]
Simple algebra yields:
E [1 (Y 6= g(X))] = E [1 (Y 6= g (X)) 1 (Y = 1)] + E [1 (Y 6= g(X)) 1 (Y = 0)]

= E [E (1 (1 6= g(x)) 1 (Y = 1) |X)] + E [E (1 (0 6= g(X)) 1 (Y = 0) |X)] .
Let us compute now the conditional expectations appearing above:
E 1(1 = 1 − g(X) P (Y = 1|X) = 1 − g(X) η ∗ (X)

h i
6= g(X)) ·1(Y = 1) | X
| {z }
=1−g(X)
E = g(X)P(Y = 0|X) = g(X) (1 − η ∗ (X))

h i
1(0 6= g(X)) ·1(Y = 0) | X
| {z }
=g(X)
Combining these relations, we get
E [1 (Y 6= g(X))] = E[η∗ (X)] + E [g(X) (1 − 2η∗ (X))] .

Therefore,
Rclass [g] − Rclass [g ∗ ] = E [(g(X) − g∗ (X)) (1 − 2η∗ (X))] .

Since g and g ∗ are both indicator functions and, therefore, take only the values 0 and 1, their
difference will be nonzero if and only if one of them is equal to 1 and the other one is equal to 0.
This leads to
Rclass [g] − Rclass ≤ E 1 η(X) 6 < η ∗ (X) |2η ∗ (X) − 1|

1
2
+E 1 η ∗ (X) 6 < η(X) |2η ∗ (X) − 1|

1
2
= 2E 1

1 ∗ ∗ 1
∈ [η (X), η(X)] η (X) −
2 2
If η(X) 6 1
2 and η ∗ (X) > 12 , then η ∗ (X) − 21 6 |η ∗ (X) − η(X)|, and thus :
Rclass [g] − Rclass [g ∗ ] 6 2E 1

1
∈ [η(X), η ∗ (X)] |η (X) − η ∗ (X)|
2
6 2E [|η(X) − η ∗ (X)|]
r h
6 2 E η(X) − η ∗ (X)
2 i q
= 2 Rreg (η) − Rreg (η ∗ ).
Since this inequality is true for every deterministic η, we get the desired property.
2
2.2 Link between the classification and density estimation
iid
Framework: (X0 , Y0 ) , . . . , (Xn , Yn ) ∼ P.
Assume that, conditionally to Y , P admits a density on X . Let f1 (x) be the density of X
conditionally to Y = 1, and f0 likewise to Y = 0. Let π1 = P (Y = 1) and π0 = 1 − π1 . The
well-known Bayes formula implies that:
η ∗ (x) = E [Y |X = x] = P (Y = 1|X = x) =
f1 (x)π1
f1 (x)π1 + f0 (x)π0
Henceforth, the Bayes rule can be written as:
1, if η ∗ (x) > 12

1, if f1 (x)π1 > f0 π0
g ∗ (x) = =
0, otherwise 0, otherwise
General approach
• Estimate π0 , π1 by :
n
1X
π̂1 = Yi , π̂0 = 1 − π̂1
n i=1
• Use a non-parametric method for estimating f1 and f0 .

• Set ĝ(x) = 1 fˆ1 (x)π̂1 > fˆ0 (x)π̂0
Multi-scale classification In the case where Y = {a1 , . . . , aM }, similar arguments yield
g ∗ (x) = arg max P (Y = a|X = x) .

a∈Y
3 Density model
Let now X1 , . . . , Xn be iid with density f . To start with, we assume that Xi ∈ [0, 1]; the relaxations
of this condition will be discussed later.
Let’s assume
f ∈ Lip (L) = {f : [0, 1] → R : |f (x) − f (y)| 6 L |x − y| , ∀x, y ∈ [0, 1]} .
We are going to estimate f by a histogram. The set Lip (L) can be approximated by a K-
dimensional space ΘK : ( )
K
ΘK = θ ∈ R , θk ≥ 0,
X
K
θk = 1
k=1
In fact, the following application is an injection from ΘK to Lip(L):
θk 1
θ ∈ ΘK 7→ fθ (x) = , ∀x ∈ [(k − 1) h, kh[ , where h= .
h K
Under the assumption that f = fθ with θ ∈ ΘK , the estimation of f is equivalent to estimating
θ, which we can be done by the method of maximum likelihood:
n
Y n
X
θ̂n = arg max fθ (Xi ) = arg max ln fθ (Xi )
θ∈ΘK θ∈ΘK
i=1 i=1
3
It is easy to check (good exercise for the students) that the maximum is attained at:
n
1X
θ̂n = θ̂n(1) , . . . , θ̂n(K) st θ̂n(k) = 1 (Xi ∈ [(k − 1) h, kh[)
n i=1
1
The following estimator is then called histogram with bandwidth h = K :
K
" n
#
1 X X
fˆn,h (x) = 1Ck,h (x) 1Ck,h (Xi ) with Ck,h = [(k − 1) h, kh[ .
nh i=1
k=1
We have 2 cases :
• If K is too large : overfitting

• If K is too small : underfitting
Remark 1 If x ∈ Ck,h , then :

n
X Z
nhfˆn,h (x) = 1Ck,h (Xi ) ∼ B n, f (x) dx
i=1 Ck
where B(n, p) stands for the binomial distribution with parameters (n, p). Note that the integral
over Ck in the above formula comes from the fact that:
P 1Ck,h (Xi ) = 1 = P (Xi ∈ Ck,h ) = f (x) dx.

Z

Ck
N.
R
Proposition 2 Let X1 , . . . , Xn be iid with density f (x) and K ∈ Set wherepj = Cj f for
every j = 1, . . . , K. The Mean Integrated Squared Error of the histogram fˆn,h is given by :
Z 1 2 Z 1 K
MISEf (h) , Ef ˆ 1 n+1X 2
fn,h (x) − f (x) dx = f 2 (x)dx + − p .
0 0 nh nh i=1 j
Proof Combining Fubini with bias-variance decomposition, we get:

Z 1 Z 1
Ef Ef fn,h (x) − f (x) dx
2 2
ˆ
fn,h (x) − f (x) dx = ˆ
0 0
K Z
Ef
X 2
= fˆn,h (x) − f (x) dx
k=1 Ck
K Z
Ef
X h i 2 h i
= fˆn,h (x) − f (x) + Varf fˆn,h (x) dx.
k=1 Ck
It is clear that:
E
h i np h i np (1 − p )
k k k
nhfˆn,h (x) ∼ B (n, pk ) ⇒ fˆn,h (x) = and Varf fˆn,h (x) = 2 .
nh (nh)
Therefore,
K Z
X p
k
2 pk (1 − pk )
MISEf (h) = − f (x) dx + (1)
Ck h nh
k=1
K
p2 pk − p2k
X Z
− k + f 2 dx +
R
= (we used that Ck f = pk )
h Ck nh
k=1
Z 1 PK K
2 i=1 pk n+1 X 2
= f dx + − pk
0 nh nh
k=1
4
P
The desired result follows from the obvious relation k pk = 1.
The result of the previous proposition is very useful, since it provides the explicit form of the risk.
However, it is not very easy to see how this expression depends on the bandwidth h, which is the
parameter responsible for overfitting or underfitting. Under the assumption that f ∈ Lip (L), we
can get a simpler bound on the risk. Indeed, for every x ∈ Ck , we have :
Z
pk 1
f (x) − = f (x) − f (y)d(y)
h h Ck
Z
1
= (f (x) − f (y)) dy
h Ck
Z
1
6 |f (x) − f (y)| dy
h Ck
Z
L
6 |x − y| dy
h Ck
6 Lh (2)
This leads to the following result.
Proposition 3 If f ∈ Lip (L), then :

1
MISEf (h) 6 L2 h2 +
nh
PK PK
Proof Using (1), (2) and k=1 pk (1 − pk ) 6 k=1 pk = 1, we get the result.
Remark 2
L2
• L2 h2 = K2 is an upper bound on the square of the bias. It describes the error of approxi-
mation.
1 K
• nh = n is an upper bound on the variance of the estimator; it describes the estimation
error.
1
• If we minimize L2 h2 + nh w.r.t. h > 0, we get
−1/3
hopt = 2nL2 , and MISE (hopt ) 6 3(L/2)2/3 n−2/3 .
(strictly worse than the parametric rate n−1 )
• hopt depends on L, which is unknown ! We can not use it in practice...

Cours 2 MVA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cours 2 MVA

Uploaded by

Copyright:

Available Formats

Statistique en grande dimension

Lecturer : Dalalyan A., Scribe : Thomas F.-X.

Predictions (X, Y ) ∈ X × Y : Given an example X, we’d like to predict the value of Y .

• Binary classification : Y = {0, 1} and ` (y, y 0 ) = 1 (y =

• Least-squares regression : Y ⊂ R, and ` (y, y 0 ) = (y − y 0 )

2 Link between Binary Classification & Regression

2.1 Plug-in rule

Proof Let η : X → Y ⊂ R, and g(x) = 1 η(x) > 1

Rclass [g] − Rclass [g ∗ ] = E [1 (Y 6= g(X)) − 1 (Y 6= g ∗ (X))]

Simple algebra yields:

E [1 (Y 6= g(X))] = E [1 (Y 6= g (X)) 1 (Y = 1)] + E [1 (Y 6= g(X)) 1 (Y = 0)]

Let us compute now the conditional expectations appearing above:

E 1(1 = 1 − g(X) P (Y = 1|X) = 1 − g(X) η ∗ (X)

E = g(X)P(Y = 0|X) = g(X) (1 − η ∗ (X))

Combining these relations, we get

E [1 (Y 6= g(X))] = E[η∗ (X)] + E [g(X) (1 − 2η∗ (X))] .

Rclass [g] − Rclass [g ∗ ] = E [(g(X) − g∗ (X)) (1 − 2η∗ (X))] .

Rclass [g] − Rclass ≤ E 1 η(X) 6 < η ∗ (X) |2η ∗ (X) − 1|

+E 1 η ∗ (X) 6 < η(X) |2η ∗ (X) − 1|

Rclass [g] − Rclass [g ∗ ] 6 2E 1

Henceforth, the Bayes rule can be written as:

• Use a non-parametric method for estimating f1 and f0 .

Multi-scale classification In the case where Y = {a1 , . . . , aM }, similar arguments yield

g ∗ (x) = arg max P (Y = a|X = x) .

f ∈ Lip (L) = {f : [0, 1] → R : |f (x) − f (y)| 6 L |x − y| , ∀x, y ∈ [0, 1]} .

In fact, the following application is an injection from ΘK to Lip(L):

• If K is too large : overfitting

Remark 1 If x ∈ Ck,h , then :

P 1Ck,h (Xi ) = 1 = P (Xi ∈ Ck,h ) = f (x) dx.

Proof Combining Fubini with bias-variance decomposition, we get:

This leads to the following result.

Proposition 3 If f ∈ Lip (L), then :

(strictly worse than the parametric rate n−1 )

• hopt depends on L, which is unknown ! We can not use it in practice...

You might also like