Professional Documents
Culture Documents
an Introduction
Ron Meir
Sources of Information
Web http://www.kernel-machines.org/
Books
Application Domains
Supervised Learning
Unsupervised Learning
Classification I
3 1 3 ?
The problem:
Classification II
The ‘Model’
+1 w> x + b > 0
Y = sgn[w> x + b > 0] =
−1 w> x + b ≤ 0
VC Dimension
VCdim=3
6 dichotomies
For h ∈ H
L(h) Probability of miss-classification
L̂n (h) Empirical fraction of miss-classifications
Vapnik and Chervonenkis 1971: For any
distribution with prob. 1 − δ, ∀h ∈ H,
s
VCdim(H) log n + log 1δ
L(h) < L̂n (h) + c
| {z } n
emp. error | {z }
complexity penalty
^
L ( h ) L( h )
n
An Improved VC Bound I
Hyper-plane: H(w, b) = {x : w> x + b = 0}
Distance of a point from a hyper-plane:
w> x + b
d(x, H(w, b)) =
kwk
d(x,H(w,b))
An Improved VC Bound II
Canonical hyper-plane:
min |w> xi + b| = 1
1≤i≤n
kwk ≤ A
xi ∈ Ball of radius L
is
VCdim ≤ min(A2 L2 , d) + 1
min |w> xi + b| ≥ 1
1≤i≤n
Support vectors:
{xi : |w> xi + b| = 1}
Margin
w > xi + b ±1
=
kwk kwk
¯ ¯
¯ 1 −1 ¯ 2
Margin = ¯
¯ − ¯ =
kwk kwk ¯ kwk
1 >
minimize w w
2
subject to yi (w> xi + b) ≥ 1 i = 1, 2, . . . , n
Convex Optimization
Problem:
minimize f (x)
subject to hi (x) = 0, i = 1, . . . , m,
gj (x) ≤ 0, j = 1, . . . , r
∇x L(x∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 j = 1, 2, . . . , r
µ∗j = 0 / A(x∗ )
∀j ∈
minimize f (x)
subject to e>
i x = di , i = 1, . . . , m,
a>
j x ≤ bj , j = 1, . . . , r
Lagrangian
m
X r
X
LP (x, λ, µ) = f (x)+ λi (e>
i x−di )+ µj (a>
j x−bj )
i=1 j=1
Dual Problem
maximize LD (λ, µ)
λ,µ
subject to µ ≥ 0
Observation:
? LP (x, λ, µ) quadratic ⇒ LD (λ, µ) quadratic
? Constraints in Dual greatly simplified
? m + r variables, r constraints
Duality Theorem:
Optimal solutions of P and D coincide
1
minimize kwk2
w,b 2
¡ > ¢
subject to yi w xi + b ≥ 1, i = 1, . . . , n.
n
1 X
LP (w, b, α) = kwk2 − αi [yi (w> xi + b) − 1],
2 i=1
Solution:
n
X
w= α i yi x i
i=1
n
X
0= α i yi (αi ≥ 0)
i=1
KKT condition:
αi = 0 unless yi (w> xi + b) = 1
Margin
Occurs if constraint is
obeyed with equality
n
1X
X
max. LD (α) = αi − α i α j yi yj x >
i xj
i=1
2 i,j
n
X
s.t. α i yi = 0 ; αi ≥ 0
i=1
yi (w> xi + b) = 1
Thus
µ ¶
1
b∗ = − min {w∗T xi } + max {w∗T xi }
2 yi =+1 yi =−1
P
Classifier: (Recall w = i α i yi x i )
à n
!
X
f (x) = sgn αi∗ yi x>
i x+b
∗
i=1
Non-Separable Case I
Non-Separable Case II
Proposed solution: Minimize
n
1 2
X
kwk + C I(ξi > 1) (non − convex!)
2 i=1
n
1 2
X
minimize LP (w, ξ) = kwk + C ξi
w,b,ξ 2 i=1
subject to yi (w> xi + b) ≥ 1 − ξi
ξi ≥ 0
KKT conditions:
n
X
0= α i yi
i=1
¡ >
¢
0 = αi (yi w xi + b) − 1 + ξi
0 = (C − αi )ξi
Non-Separable Case IV
Two types of support vectors:
Recall
¡ >
¢
αi (yi w xi + b) − 1 + ξi = 0
(C − αi )ξi = 0
Margin vectors:
1
0 < αi < C⇒ξi = 0 ⇒ d(xi , H(w, b)) =
kwk
Non-margin vectors: αi = C
? Errors: ξi > 1 Misclassified
? Non-errors: 0 ≤ ξi ≤ 1 Correcty classified
Within margin
Non-Separable Case V
1
3
1
3
2 3
Support Vectors:
Non-linear SVM I
Φ : Rd 7→ RD (D À d)
x 7→ Φ(x)
Non-Linear SVM II
R2 x1 x2 input space
✕
Φ: R2 R3
✕ ❍
❍ x12 x22 2 x1x2 feature space
✕ w2 w
✕ w1 3
f(x)
2 2
f (x)=sgn (w1x1+w2x2+w3 2 x1x2+b)
Φ
R3 R2
✕
❍
❍ ✕ ❍
❍
✕
✕
✕ ✕ ✕
✕
Obtained
n
X
f (x) = α i yi x >
i x+b
i=1
In feature space
n
X
f (x) = αi yi Φ(xi )> Φ(x) + b
i=1
Examples:
Mercer Kernels I
Assumptions:
Mercer’s Theorem:
∞
X
K(x, z) = λj ψj (x)ψj (z)
j=1
Z
K(x, z)ψj (z)dz = λj ψj (x)
p
Conclusion: Let φj (x) = λj ψj (x), then
Mercer Kernels II
Classifier:
à n
!
X
f (x) = sgn αi yi Φ(xi )> Φ(x) + b
i=1
à n
!
X
= sgn αi yi K(xi , x) + b
i=1
Kernel Selection I
Simpler Mercer conditions: for any finite set
of points Kij = K(xi , xj ) is positive-definite
v> Kv > 0
Classifier:
à n
!
X
f (x) = sgn αi yi K(xi , x) + b
i=1
1. K1 (x, z) + K2 (x, z)
3. f (x)f (z)
4. K3 (Φ(x), Φ(z))
Kernel Selection II
Explicit construction:
2. exp[K(x, z)]
Hand-writing recognition
Text classification
Bioinformatics
Many more
n
1 2
X
minimize LP (w, ξ) = kwk + C ξi
w,b,ξ 2 i=1
subject to yi (w> xi + b) ≥ 1 − ξi
ξi ≥ 0
( n
)
X
minimize [1 − yi f (x)]+ + λkwk2
w,b
i=1
[1−yf(x)]
+
I[yf(x) <0]
1 yf(x)
SVM Regression I
loss
−ε +ε
y−(wx−b)
y ξ ε y ξ
ε
ξ ξ
x x
SVM Regression II
n
X
f (x) = (αi∗ − αi )K(xi , x) + b
i=1
Effect of ²:
1 1
0 0
0 1 0 1
? Data-dependent complexities
40
Test error
35 Span prediction
30
25
Error
20
15
10
0
−6 −4 −2 0 2 4 6
Log sigma
"!
# $% &
36
Test error
34 Span prediction
32
30
Error
28
26
24
22
20
−2 0 2 4 6 8 10 12
Log C
'$(
)* ,+-
.$./0 &120
3/4!
$% &
Summary I
Advantages
Summary II
Drawbacks
Summary III
Extensions
? Online algorithms
? Applications to
– Clustering
– Non-linear principal component analysis
– Independent component analysis