Professional Documents
Culture Documents
by François Laviolette
Learning sample
S = { (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ) } ∼ Dm
Var X
From Chebyshev’s inequality ( Pr (X − E X ≥ a) ≤ a2 +Var X ), we obtain:
The C-bound (Lacasse et al., 2006) 0.5
0.96
0.4
0.80
2 0.64
1 − 2 · RD (GQ )
0.3
′
dQD
D def
0.48
RD (BQ ) ≤ CQ = 1− 0.2
0.32
1 − 2 · dQ
D 0.1
0.16
0.00
0
0 0.1 0.2 0.3 0.4 0.5
RD (GQ )
′
McAllester Bound
For any D, any H, any P of support H, any δ ∈ (0, 1], we have
√
1 2 m
Pr m ∀ Q on H : 2(RS (GQ )−R(GQ ))2 ≤ KL(QkP ) + ln ≥ 1−δ ,
S∼D m δ
or vh
u √ i
u KL(QkP ) + ln 2 m
t δ
Pr m ∀ Q on H : R(GQ ) ≤ RS (GQ ) + ≥ 1−δ ,
S∼D 2m
def
where kl(qkp) = q ln pq + (1 − q) ln 1−p
1−q
.
0. 4
0. 3
0. 2
0. 1
0. 1 0. 2 0. 3 0. 4 0. 5 R(Q)
Borne Inf Borne Sup
where
m
" #
X
m k
m∆( m , r)
rk (1−r)m−k e
I∆ (m) = sup .
r∈[0,1] k=0
|k {z }
Bin k;m,r
Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have
E φ(h) ≤ KL(QkP ) + ln E eφ(h) .
h∼Q h∼P
Markov’s inequality
EX EX
Pr (X ≥ a) ≤ a
⇐⇒ Pr X ≤ δ
≥ 1−δ .
m · ∆ E LbS` (h), E LD `
(h)
h∼Q h∼Q
` `
Jensen’s Inequality ≤ E m · ∆ LbS (h), LD (h)
h∼Q
b ` (h),L ` (h)
L
Change of measure ≤ KL(QkP ) + ln E em∆ S D
h∼P
1 b` `
Markov’s Inequality ≤ 1−δ KL(QkP ) + ln E E em·∆(LS0 (h),LD (h))
δ S 0∼D m h∼P
1 b` `
Expectation swap = KL(QkP ) + ln E E em·∆(LS0 (h),LD (h))
δ h∼P S 0∼D m
m
1 X ` k `
(h) em·∆( m ,LD (h))
Binomial law = KL(QkP ) + ln E Bin k; m, LD
δ h∼P
k=0
" m #
1 X m∆( k , r)
Supremum over risk ≤ KL(QkP ) + ln sup Bin k; m, r e m
δ r∈[0,1]
k=0
1
= KL(QkP ) + ln I∆ (m) .
δ
Corollary
[...] with probability at least 1−δ over the choice of S ∼ Dm , for all Q on H :
h √ i
(a) kl R bS (GQ ), RD (GQ ) ≤ 1 KL(QkP ) + ln 2 m , (Langford and Seeger
m δ
(2001))
r h √ i
1 2 m
(b) RD (GQ ) ≤ R
bS (GQ ) +
2m
KL(QkP ) + ln δ
, McAllester (1999b,
(
2003b))
bS (GQ ) + 1 KL(QkP ) + ln 1 ,
1
(c) RD (GQ ) ≤ c·R (Catoni (2007b))
1−e−c m δ
def
kl(q, p) = q ln q
p
+ (1 − q) ln 1−q
1−p
≥ 2(q − p)2 ,
def
∆c (q, p) = − ln[1 − (1 − e−c ) · p] − c · q ,
def λ
∆λ (q, p) = m
(p − q) .
(Laboratoire du GRAAL, Université Laval) 16 / 41
Proof of the Langford/Seeger bound
m
= Pm k
k=0 ( k )(k/m) (1−k/m)
m−k
, (1)
√
≤ 2 m.
• Note that, in Line (1) of the proof, Prm RS (h) = k
m
is replaced by the
S∼D
probability mass function of the binomial.
• This is only true if the examples of S are drawn iid. (i.e., S ∼ Dm )
• So this result is no longuer valid in the non iid case, even if General Theorem is.
When given a PAC-Bayes bound, one can easily derive a learning algorithm
that will simply consist of finding the posterior Q that minimizes the
bound.
Catoni’s bound
∀ Q on H :
h
R(GQ ) ≤ 1−C 1−exp − C ·RS (GQ )
Pr 1−e ≥ 1 − δ.
S∼Dm i
1
KL(QkP ) + ln 1δ
+m
Interestingly, minimizing the Catoni’s bound (when prior and posterior are
restricted to Gaussian) give rise to the SVM !
In fact to an SVM where the Hinge loss is replaced by the sigmoid loss.
• However, recall that the prior is not allowed to depend in any way on
the training set.
• one may leave a part of the training set in order to learn the prior, and
only use the remaining part of it to calculate the PAC-Bayesian bound.
• Even if the prior can not be data dependent, it can depend on the
distribution D that generates the data.
• How can this be possible ? D is supposed to be unknown !
• Thus, P will have to remain unknown !
• But may be we can manage to nevertheless estimate KL(QkP ).
This is all we need here.
This has been proposed in
• A. Ambroladze, E. Parrado-Hernández, and J. Shawe-Taylor. Tighter
PAC-Bayes bounds. In Advances in Neural Information Processing Systems
18, (2006) Pages 9-16.
The chosen prior was: wp = E(x,y)∼D (y φ (x)) .
• O. Catoni. A PAC-Bayesian approach to adaptive classification. Preprint
n.840, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6
and Paris 7, 2003.
• G. Lever, F. Laviolette, J. Shawe-Taylor. Distribution-Dependent PAC-Bayes
Priors. Proceedings of the 21st International Conference on Algorithmic
Learning Theory (ALT 2010), 119-133.
(Laboratoire du GRAAL, Université Laval) 31 / 41
Localized PAC-Bayesian bounds :
(2) Distribution-Dependent PAC-Bayes Priors (cont)
MAGIC !!!
Proof.
b ` (h),
m·∆ E L
`
E LD (h)
S h∼Q h∼Q
Jensen’s inequality ≤ E m·∆ Lb ` (h), L ` (h)
h∼Q S D
b ` (h),L ` (h) b ` (h),L ` (h)
m∆ L m·∆ L
Jensen’s inequality = E ln e S D ≤ ln E e S D
h∼Q h∼Q
b ` (h),L ` (h)
m·∆ L
Change of measure ≤ ln E e S D
h∼P
1 b ` (h),L ` (h))
m·∆(L
Markov’s Inequality ≤ 1−δ ln E E e S0 D
δ S 0∼Dm h∼P
1 b ` (h),L ` (h))
m·∆(L
Expectation swap = ln E E e S0 D
δ h∼P S 0∼Dm
m
1 k `
Bin k; m, LD` (h) em·∆( m ,LD (h))
X
Binomial law = ln E
δ h∼P k=0
m
1 k 1
Bin k; m, r em∆( m , r) = ln I∆ (m)
X
Supremum over risk ≤ ln sup
δ r∈[0,1] k=0 δ
Proof.
m·∆ E LbS` (h), E LbZ` (h)
h∼Q h∼Q
Jensen’s inequality ≤ E m · ∆ LS (h), LbZ` (h)
b `
h∼Q
b ` (h),L
L b ` (h)
Change of measure ≤ KL(QkP ) + ln E em∆ S Z
h∼P
1 b` b`
Markov’s inequality ≤1−δ KL(QkP ) + ln E E em·∆(LS0 (h),LZ (h))
δ S 0∼[Z]m h∼P
1 b` b`
Expectations swap = KL(QkP ) + ln E 0 E m em·∆(LS0 (h),LZ (h))
δ h∼P S ∼[Z]
b` b`
1 X (N ·LZ (h))(N −N ·LZ (h)) m·∆( k ,Lb ` (h))
k n−k
Hypergeometric law = KL(QkP ) + ln E N e m Z
δ h∼P (m )
k " #
1 X (K )(N −K ) m∆( k , K )
k n−k
Supremum over risk ≤ KL(QkP ) + ln max N e m N
δ K=0...N (m )
k
1
= KL(QkP ) + ln T∆ (m, N ) .
δ
(Laboratoire du GRAAL, Université Laval) 40 / 41
Aknowledgements