You are on page 1of 33

Foundations of Machine Learning (FOML)

Lecture 1

Kristiaan Pelckmans

September 8, 2015
Overview

Today:
I Overview of the course.
I Support Vector Machines (SVMs) - the
separable case.
I Convex Optimization.
I Analysis.
I Kernels.
I SVMs - the inseparable case.
Overview (Ctd)

Organization:
I 10 Lectures.
I 1 computer lab (mid october).
(content)
I Miniprojects (due end october).
(content)
I Participants giving lectures using
material.
Overview (Ctd)
Course:
1. Introduction.
2. Support Vector Machines (SVMs).
3. Probably Approximatively Correct (PAC) analysis.
4. Boosting.
5. Online Learning.
6. Multi-class classification (*).
7. Ranking (*).
8. Regression (*).
9. Stability-based analysis (*).
10. Dimensionality reduction (*).
11. Reinforcement learning (*).
12. Presentations of the results of the mini-projects.
Introduction

Applications and problems:

I Classification.
I Regression.
I Ranking.
I Clustering.
I Dimensionality Reduction
or manifold learning.
Introduction (Ctd)
Definitions and Terminology:
I m Examples.
I Features xi X .
I Labels yi Y.
I Fixed, unknown distribution underlying samples D.
I Training sample Sm X Y.
I Validation sample S 0 .
I Test sample S 00 .
I Loss function L.
I Hypothesis set H = {h : X Y}.
I Learning algorithm A() : {S} H : hS
I where are all the free tuning parameters
I (true and average loss) Risk R and Rm .
Introduction (Ctd)

n-fold Cross-validation
I Let Sm = {(xi , yi )}m
i=1 be the original training set.
I Divide set Sm into n disjunct folds so that every
point included once.
I Make n sets with n 1 folds, denote them as Si .
I Let Si = {(xij , yij )}m i
j=1 be the training set of the
i-th iteration.
I Hence hSi the outcome of A() applied to the i-th
training set.

n
1X 1 X  
RCV () = L hSi (xij ), yij
n mi
i=1 j=1
Introduction (Ctd)

Learning Scenarios
I Supervised learning.
I Unsupervised learning.
I Semi-supervised learning.
I Transductive inference.
I Online learning.
I Reinforcement learning.
I Active Learning.
SVM - separable case
Support Vector Machine (SVM)
I Assume that there is a f s.t. y = f (x).
I Find h with minimal

I Hypothesis set H of all linear separators h

characterised by (w, b):
(
w x + b > 0 h(x) = +1
w x + b < 0 h(x) = 1
or
n o
H = x sign(w x + b) : w RN , b R

I Separable case: h H s.t.

f (x)h(x) = yh(x) > 0 x D.
SVM - separable case (Ctd)

Maximal Margin
I Hyperplane {x : w x + b = 0}.
I Normalise such that mini |w xi + b| = 1
(w.l.o.g.).
I Distance point x0 - margin:

|w x0 + b|
kwk
I Thus margin is given as

mini |w xi + b| 1
= = .
kwk kwk
SVM - separable case (Ctd)
Maximal Margin
I Maximal Hyperplane:

yi (w xi + b) 0 i

|wxi +b|
max s.t. = mini kwk
,w,b
mini |w xi + b| = 1

I Maximal hyperplane:
1
max s.t. yi (w xi + b) 1 i, =
,w,b kwk
I Or
1
min kwk2 s.t. yi (w xi + b) 1 i.
w,b 2

I Why? Find the safest solution.

SVM - separable case (Ctd)

Maximal Margin
I Convex objective.
I Affine inequality constraints.
I Dual problem: proberties!
Convex Optimization

Convex
I A set X is convex iff for any two points
x, x0 X , the segment
{x + (1 )x0 : 0 1} X .
I A function f : X R is convex iff for all
x, x0 X and all 0 1 one has that



I Let f be a differentiable function, then f

is convex if and only if X is convex and

x, x0 X : f (x0 ) f (x) f (x)(x0 x)

Convex Optimization (Ctd)
Convex Programming
I Constrained optimisation problem.

p = min f (x) s.t. gi (x) 0, i.

xX

I Lagrangian:
X
x X , 0 : L(x, ) = f (x)+ i gi (x).
i

I Dual function (concave)

0 : F () = inf L(x, ).
xX

so that F () p .
I Dual problem:

d = max F ()
0
Convex Optimization (Ctd)

Convex Programming
I Weak duality: p d .
I Strong duality: p = d .
I Duality gap: p d .
I Strong duality holds when Constraint
qualifications hold.
I Strong constraint qualification (Slater):
x int(C) : gi (x) < 0 i
I Weak constraint qualification (weak
Slater): x int(C) : gi (x) < 0
or gi is affine, gi (x) = 0 i
Convex Optimization (Ctd)

Karush-Kuhn-Tucker (KKT) conditions:

Assume that f , gi : X R for all i are convex
and differentiable, and that the constraints are
qualified, then x is a solution of the
constrained program if and only there exists an
such that

x L(x, ) = 0

L(x, ) 0

i gi (x) = 0 i

Analysis of SVMs

I Lagrangian:
m
1 X
L(w, b, ) = kwk2 i (yi (w xi + b) 1)
2
i=1

I KKT conditions

w= m
P
w L = 0 P
i=1 i yi xi
m
b L = 0 i=1 i yi = 0

i : i (yi (w xi + b) 1) = 0

I Support vectors: NSV (S)

Analysis of SVMs (Ctd)
I Dual problem max0 inf w,b L(w, b, )
I Eliminate w and b using KKT conditions:
I Dual problem
m
X m X
X m
max i i j yi yj (xi xj )
0
i=1 i=1 j=1
m
X
s.t. i yi = 0 (1)
i=1

And at the optimum w = m

P
i=1 i yi xi .
I
Pm
I and b = yi i=1 j yj xj xi
I Hence we can predict

h(x) = sign(w x + b)
Analysis of SVMs (Ctd)

Generalization error

R(hS ) = PrxD [hS (x) 6= f (x)]

I Leave-one-out analysis.
I In terms of NSV .
I Margin-based analysis.
Analysis of SVMs (Ctd)

Leave-one-out analysis
m
1 X
RLOO (A(), S) = 1(hS/(xi ,yi ) (xi ) = yi )
m
i=1

I A()(S) = hS .
I 1(z) = 1 iff z is true, 1(z) = 0.
I In terms of NSV .
I Then

ESD m [RLOO (A(), S)] = ES 0 D m1 [R(hS 0 )].

Analysis of SVMs (Ctd)

Proof:
m
1 X  
E SD m [RLOO (A(), S)] = ESD m 1(hS/(xi ,yi ) (xi ) = yi )
m
i=1
 
= ESD m 1(hS/(x1 ,y1 ) (x1 ) = y1 )
 
= ESD m 1(hS/(x1 ,y1 ) (x1 ) = y1 )
= ES 0 D m1 [Ex1 D [1(hS 0 (x1 ) = y1 )]]
= ES 0 D m1 [R(hS 0 )].
Analysis of SVMs (Ctd)

Support Vector Analysis: Let A()(S) = hS be the hypothesis

returned by SVMs for a sample S, and let NSV (S) be the number
of Support Vectors that define hS . Then

NSV (S 0 )
 
ESD m [R(hS )] ES 0 D m+1
m+1

Argument: if xi is not a SV, then hS/(xi ,yi ) = hS and

hS/(xi ,yi ) (xi ) = f (xi )

NSV (S)
RLOO (A(), S)
m+1
SVM - Margin analysis (Ctd)
Vapnik-Chervonenkis (VC) dimension:
I Distance point x0 with label y0 to a
hyperplane {x : w x + b = 0} is

y0 (w x0 + b)
(x) =
kwk
I Margin is given as

yi (w xi + b)
= min
i kwk
I capacity of H (Structural Risk
Minimisation: see next lecture)
I VC dimension (try) of hyperplane is N + 1
...
I But high-dimensions?
SVM - Margin analysis (Ctd)

Refined analysis of VC dimension

1
I Margin = kwk .
I H = {h(x) = sign(w x + b), kw k }.
I How many points can be shattered?

d : {xi }di=1 , 1m : h H : h(x1 ) = 1 , . . . , h(xd ) = d ,

I Measures capacity of H.
m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi ) .
m hH i=1
SVM - non-separable case
Maximal Soft Margin:
I Non-separable case: (w, b)

i : yi (w xi + b) 6 1

I Idea: find best (w, b) with minimal slack

yi (w xi + b) 1 i

I Max Soft Margin:

m
1 X
min kwk2 + C i
w,b, 2
i=1
(
yi (w xi + b) 1 i i
s.t. (2)
i 0 i.
SVM - non-separable case (Ctd)

Dual problem:
I Dual problem
m
X m X
X m
max i i j yi yj (xi xj )
0C
i=1 i=1 j=1
m
X
s.t. i yi = 0 (3)
i=1

And at the optimum w = m

P
i=1 i yi xi .
I
Pm
I and byi = 1 yi i=1 j yj xj xi when
i = 0.
SVM - Analysis.
m
1 X
R (h) = (h(xi ) yi )
m
i=1

I
m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi )
m hHi=1
I H = {h(x) = sign(w x + b), kw k , b R}.
I Theorem: Let H be a set of real-valued functions, fix
> 0. For any > 0, with probability exceeding 1
one has that
s
2 log 2
h H : R(h) R (h) + RS (H) + 3 .
2m
SVM - Analysis (Ctd).

m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi )
m hH i=1

I H = {h(x) = sign(w x + b), kw k , b R}.

I Theorem: Let S be a sample of size m with kxi k R,
then r
R 2 2
RS (H) .
m
SVM - Analysis (Ctd)
Proof:
m
" #
1 X
RS (H) , E sup i (w xi )
m kwk i=1
m
" #
1 X
= E sup w i xi
m kwk i=1
m
" #
X
E k i xi k
m
i=1
" m #1/2
X
E i kxi k2
m
i=1
r
mR 2 2 R 2
= . (4)
m m
SVM - Analysis (Ctd)

RS ( H) LRS (H)

where : R R is L Lipschitz smooth, and H any

hypothesis set.

m
" #
1 X
, E1 ,...,m sup i ( h)(xi )
m hH i=1
 
1
= E1 ,...,m1 Em sup m ( h)(xm ) + um1 (5)
m hH
Pm1
with um1 (h) = i=1 i ( h)(xi ).
SVM - Analysis (Ctd).

 
1
, E1 ,...,m1 Em sup um1 (h) + m ( h)(xm )
m hH
1 1
[um1 (h1 ) + (h1 (xm ))] + [um1 (h2 ) (h2 (xm ))]
2 2
1 1
[um1 (h1 ) + um1 (h2 )] + sL [(h1 (xm )) (h2 (xm ))]
2  2 
E sup um1 (h) + m Lh(xm ) , (6)
hH

Kernels.

I Note that dual problem and predictor expressed

in (xi xj ).
I Lets generalise it to ((xi ) (xj )) with
: RN R .
I No explicit mapping, just inner product needed!
I ((xi ) (xj )) = K (xi , xj ).
I iff K PSD!
kxi xj k2
 
I Typical choice K (xi , xj ) = exp 2

Xm
H={ i yi K (xi , ), kk }.
i=1
Conclusions

Take home messages:

I SVMs: optimisation, analysis.
I Separable case, non-separable case.
I Linear + kernels.
I Analysis.
I Margin and high-dimensional.