Thorsten Joachims
Cornell University
Computer Science Department
tj@cs.cornell.edu
http://www.joachims.org
0
Overview
What is an SVM?
• optimal hyperplane and softmargin for inseparable data
• handling nonlinear rules and nonstandard data using kernels
1
What I will not (really) talk about...
• SVMs in the transductive setting
[Vapnik, 1998][Joachims, 1999c][Bennet & Demiriz, 1999]
• Kernel Principal Component Analysis
[Schoelkopf et al., 1998]
• connection to related methods (i.e. Gaussian Process Classifiers,
Ridge Regression, Logistic Regression, Boosting)
[Cristianini & ShaweTylor, 2000][MacKay, 1997][Schoelkopf &
Smola, 2002]
Warning: At some points throughout this tutorial, precision is
sacrificed for better intuition (e.g. uniform convergence bounds for
SVMs).
2
Text Classification
3
Learning Text Classifiers
RealWorld
Process
Training Set New Documents
Classifier
Learner
Goal:
• Learner uses training set to find classifier with low prediction error.
4
Representing Text as Attribute Vectors
Attributes: Words
0 baseball
3 specs
(WordStems)
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics 0 graphics
Subject: Need specs on Apple QT
1 references
I need to get the specs, or at least a
0 hockey Values: Occurrence
very verbose interpretation of the specs,
for QuickTime. Technical articles from 0 car Frequencies
magazines and references to books would
be nice, too.
0 clinton
.
I also need the specs in a fromat usable .
on a Unix or MSDos system. I can’t .
do much with the QuickTime stuff they
have on ...
1 unix
0 space
2 quicktime
0 computer
5
Paradoxon of Text Classification
30,000 10,000
Attributes Training Examples
6
Experimental Results
Reuters Newswire WebKB Collection Ohsumed MeSH
• 90 categories • 4 categories • 20 categories
• 9603 training doc. • 4183 training doc. • 10000 training doc.
• 3299 test doc. • 226 test doc. • 10000 test doc.
• ~27000 features • ~38000 features • ~38000 features
microaveraged precision/recall
Reuters WebKB Ohsumed
breakevenpoint [0..100]
7
Part 1 (a): What is an SVM? (linear)
•prediction error vs. training error
•learning by empirical risk minimization
•VCDimension and learnability
•linear classification rules
•optimal hyperplane
•softmargin separation
8
Generative vs. Discriminative Training
Process:
• Generator: Generates descriptions x according to distribution P ( x ) .
• Teacher: Assigns a value y to each description x based on P ( y x ) .
N
=> Training examples ( x 1, y1 ), …, ( x n, yn ) ∼ P ( x, y ) xi ∈ ℜ y i ∈ {1, – 1}
9
True (Prediction) Error
What is a “good” classification rule h ?
P(h( x) ≠ y) = ∫ ∆ ( h ( x ) ≠ y ) dP ( x, y ) = Err P ( h )
Loss function ∆:
• 1 if not equal
• 0 if equal
Problem:
P ( x, y ) unknown. Known are training examples ( x 1, y 1 ), …, ( x n, y n ) .
10
Principle: Empirical Risk Minimization (ERM)
Learning Principle:
Find the decision rule h° ∈ H for which the training error is minimal:
h° = arg min h ∈ H { Err S ( h ) }
Training Error:
n
1
Err S ( h ) = 
n ∑ ∆( yi ≠ h ( xi ) )
i=1
11
When is it Possible to Learn?
Definition [Consistency]: ERM is consistent for
• a hypothesis space H and
• independent of the distribution P ( x, y )
converges in probability.
12
Vapnik/Chervonenkis Dimension
Definition: The VCdimension of H is equal to the maximal number d
of examples that can be split into two sets in all 2d ways using
functions from H (shattering).
x1 x2 x3 ... xd
h1 + + + ... +
h2  + + ... +
h3 +  + ... +
h4   + ... +
... ... ... ... ... ...
hN    ... 
13
Linear Classifiers
Rules of the Form: weight vector w , threshold b
N
N
h ( x ) = sign ∑
w i x i + b = 1 if ∑ wi xi + b > 0
i=1
i=1
–1 else
14
Linear Classifiers (Example)
Text Classification: Physics (+1) versus Receipes (1)
D1 1 2 0 0 2 0 2 +1
D2 0 0 0 3 0 1 1 1
D3 0 2 1 0 0 0 3 +1
D4 0 0 1 1 1 1 1 1
w,b 2 3 1 3 1 1 0 b=1
7
D1: ∑ wi xi + b = [ 2 ⋅ 1 + 3 ⋅ 2 + ( –1 ) ⋅ 0 + ( –3 ) ⋅ 0 + ( –1 ) ⋅ 2 + ( –1 ) ⋅ 0 + 0 ⋅ 2]+1
i=1
7
D2: ∑ wi xi + b = [ 2 ⋅ 0 + 3 ⋅ 0 + ( –1 ) ⋅ 0 + ( –3 ) ⋅ 3 + ( –1 ) ⋅ 0 + ( –1 ) ⋅ 1 + 0 ⋅ 1]+1
i=1
15
VCDimension of Hyperplanes in ℜ2
• Three points in ℜ 2 can be shattered with hyperplanes.
16
Rate of Convergence
Question: After n training examples, how close is the training error to
the true error?
η
d ln 2n  + 1 – ln 
1 d 4
Φ ( d, n ) = 
 4 
2 n
17
SVM Motivation: Structural Risk Minimization
Err P ( h i ) ≤ Err S ( h i ) + Φ ( VCdim ( H ) ,n, η )
Idea: Structure on
hypothesis space.
h*
Err ( h )
P i
Φ ( VCdim ( H ) ,n, η )
Err ( h )
S i
opt VCdim(H)
18
Optimal Hyperplane (SVM Type 1)
Assumption: The training examples are linearly separable.
19
VCDimension of “thick” Hyperplanes
Lemma: The VCdim of hyperplanes 〈 w, b〉 with margin δ and
description vectors xi ≤ R is bounded by
2
VCdim ≤ R
 + 1
2
δ
δ δ
21
Computing the Optimal Hyperplane
N
Training Examples: ( x 1, y 1 ), …, ( x n, y n ) xi ∈ ℜ y i ∈ {1, – 1}
Requirement 1: Zero training error!
( y = –1 ) ⇒ [ w ⋅ xi + b ] < 0
⇔ yi [ w ⋅ xi + b ] > 0
( y = 1 ) ⇒ [ w ⋅ xi + b ] > 0
Distance δ of point x
Requirement 2: Maximum margin! from hyperplane
1 <w,b>:
maximize δ, with δ = min i 
 [ w ⋅ xi + b ]
w⋅w 1
δ =  [ w ⋅ x + b ]
w⋅w
=> Requirement 1 & Requirement 2:
1 
maximize δ, with ∀i ∈ [ 1…n ] yi  [ w ⋅ x i + b ] ≥ δ
w⋅w
22
Primal Optimization Problem
1
maximize δ, with ∀i ∈ [ 1…n ] yi 
 [ w ⋅ x i + b ] ≥ δ
w⋅w
1
Set 
 = δ:
w⋅w 1  1  1
=> maximize  , with ∀i ∈ [ 1…n ] y i  [ w ⋅ x i + b ] ≥ 
w⋅w w⋅w w⋅w
Cancel:
1
=> maximize 
 , with ∀i ∈ [ 1…n ] [ y i [ w ⋅ x i + b ] ≥ 1 ]
w⋅w
23
Example: Optimal Hyperplane vs. Perceptron
Perceptron with eta=0.1
30
"perceptron_iter_trainerror.dat"
"perceptron_iter_testerror.dat"
25 hard_margin_svm_testerror.dat
Percent Training/Testing Errors
20
15
10
0
1 2 3 4 5 6 7 8 9 10
Iterations
24
NonSeparable Training Samples
• Forsome training samples there is no separating hyperplane!
• Complete separation is suboptimal for many training samples!
25
SoftMargin Separation
Idea: Maximize margin and minimize training error simultanously.
Hard Margin: Soft Margin: n
Hard Margin
(separable)
ξi
δ Soft Margin
(training error) ξj
26
Controlling SoftMargin Separation
n
•
∑ ξi is an upper bound on the number of training errors.
•C is a parameter that controls tradeoff between margin and error.
Large C
ξi
δ
ξj
Small C
27
Example Reuters “acq”: Varying C
4
"svm_trainerror.dat"
3.5 "svm_testerror.dat"
Percent Training/Testing Errors
2.5
1.5
0.5
hardmargin SVM
0
0.1 1 10
C
28
Part 1 (b): What is an SVM? (nonlinear)
•quadraticprograms and duality
•properties of the dual
•nonlinear classification rules
•kernels and their properties
•kernels for vectorial data
•kernels for nonvectorial data
29
Quadratic Program
n n n
1
minimize P(w) = – k i w i + 
∑ ∑∑ w i wj H ij
2
n i = 1 i = 1j = 1 n
(1) (k)
s.t. ∑ wi gi ≤0 … ∑ wi gi ≤0
ni = 1 i n= 1
(1) (m)
∑ wi hi = 0 … ∑ wi hi = 0
i=1 i=1
30
Fermat Theorem
Given an unconstrained optimization problem
minimize P ( w )
δP ( w° ) = 0

δw
31
Lagrange Function
Given an optimization problem
minimize P ( w )
s.t. g 1 ( w ) ≤ 0 … gk ( w ) ≤ 0
h1 ( w ) = 0 … hm ( w ) = 0
L ( w, α, β ) = P ( w ) + ∑ αi gi ( w ) + ∑ βi hi ( w )
i=1 i=1
32
Lagrange Theorem
Given an optimization problem
minimize P ( w )
s.t. h 1 ( w ) = 0 … hm ( w ) = 0
33
KarushKuhnTucker Theorem
Given an optimization problem
minimize P ( w )
s.t. g 1 ( w ) ≤ 0 … gk ( w ) ≤ 0
h1 ( w ) = 0 … hm ( w ) = 0
g i ( w° ) ≤ 0, i = 1, …, k
α i ° ≥ 0, i = 1, …, k
34
Dual Optimization Problem
n
n n n
1
Dual OP: maximize D(α) = α i – 
∑ ∑∑ α i αj y i y j ( x i ⋅ x j )
2
n i = 1 i = 1j = 1
s.t. ∑ αi yi = 0 and 0 ≤ αi ≤ C
i=1
==> positive semidefinite quadratic program
35
Primal <=> Dual
Theorem: The primal OP and the dual OP have the same solution.
Given the solution α i ° of the dual OP,
n
1 pos neg
w° = ∑ α i°y i x i b° =  ( w 0 ⋅ x + w 0 ⋅ x )
2
i=1
36
Properties of the SoftMargin Dual OP
n n n
s. t. ∑ αi yi = 0 und 0 ≤ αi ≤ C
i=1
• based
exclusively on inner product
between training examples
37
NonLinear Problems
==>
Problem:
• some tasks have nonlinear structure
• no hyperplane is sufficiently accurate
38
Extending the Hypothesis Space
Idea: Input Space
Φ
Feature Space
==> Find hyperplane in feature space!
Example: a b c
Φ
a b c aa ab ac bb bc cc
39
Example
Input Space: x = ( x1, x 2 ) (2 Attributes)
Feature Space: Φ ( x ) = ( x 21, x 22, 2x1, 2x 2, 2x1 x2, 1 ) (6 Attributes)
40
Kernels
Problem: Very many Parameters! Polynomials of degree p over N
attributes in input space lead to O ( Np ) attributes in feature space!
Solution: [Boser et al., 1992] The dual OP need only inner products =>
Kernel Functions
K ( x i ,x j ) = Φ ( x i ) ⋅ Φ ( x j )
2 2
Example: For Φ ( x ) = ( x 1, x 2, 2x 1, 2x 2, 2x 1 x 2, 1 ) calculating
2
K ( x i ,x j ) = [ x i ⋅ x j + 1 ] = Φ ( x i ) ⋅ Φ ( x j )
41
SVM with Kernels
n n n
1
Training: maximize D(α) = α i – 
∑ ∑∑ α i αj y i y j K ( x i, x j )
2
i = 1 i = 1j = 1
n
s. t. ∑ αi yi = 0 und 0 ≤ αi ≤ C
i=1
Classification: For new example x h ( x ) = sign
∑ α i y i K ( x i ,x ) + b
x i ∈ SV
42
Example: SVM with Polynomial of Degree 2
2
Kernel: K ( x i ,x j ) = [ xi ⋅ xj + 1 ]
plot by Bell SVM applet
43
Example: SVM with RBFKernel
2
Kernel: K ( xi ,xj ) = exp ( – xi – xj ⁄ σ 2 ) plot by Bell SVM applet
44
What is a Valid Kernel?
Mercer’s Theorem (see [Cristianini & ShaweTaylor, 2000])
Theorem [Saitoh]: Let X be a finite input space of n points ( x 1 ,…, xn ) .
A function K ( xi ,x j ) is a valid kernel in X iff it produces a Gram matrix
G ij = K ( x i ,x j )
that is symmetric
T
G = G
45
How to Construct Valid Kernels?
Theorem: Let K1 and K2 be valid Kernels over X × X , X ⊆ ℜ N , a ≥ 0 ,
m
0 ≤ λ ≤ 1 , f a realvalued function on X , φ ;X → ℜ with K 3 a kernel over
m m
ℜ × ℜ , and K a summetric positive semidefinite matrix. Then the
following functions are valid Kernels
K ( x, z ) = λK 1 ( x, z ) + ( 1 – λ )K 2 ( x, z )
K ( x, z ) = aK 1 ( x, z )
K ( x, z ) = K 1 ( x, z )K 2 ( x, z )
K ( x, z ) = f ( x )f ( z )
K ( x, z ) = K 3 ( φ ( x ), φ ( z ) )
T
K ( x, z ) = x Kz
46
Kernels for NonVectorial Data
Kernels for Sequences: Two sequences are similar, if the have many
common and consecutive subsequences.
Example [Lodhi et al., 2002]: For 0 ≤ λ ≤ 1 consider the following
features space
47
Computing String Kernel (I)
Definitions:
n
• Σ : sequences of length n over alphabet Σ
• i = ( i 1, …, i n ) : index sequence (sorted)
• s ( i ) : substring operator
• r ( i ) = i n – i 1 + 1 : range of index sequence
48
Computing String Kernel (II)
Kernel:
K n ( s, t ) = 0 if ( min ( s, t ) < n )
2
K n ( sx, t ) = K n ( s, t ) + ∑ K′ n – 1 ( s, t [ 1…j – 1 ] )λ
j ;t j = x
Auxiliary:
K′ 0 ( s, t ) = 1
K′ d ( s, t ) = 0 if ( min ( s, t ) < d )
t –j+2
K′ d ( sx, t ) = λK′ d ( s, t ) + ∑ K′ d – 1 ( s, t [ 1…j – 1 ] )λ
j ;t j = x
49
Other Kernels for Complex Data
General information on Kernels:
• Introduction to Kernels [Cristianini & ShaweTaylor, 2000]
• All the details on Kernels + Background [Schoelkopf & Smola, 2002]
50
Two Reasons for Using a Kernel
51
Summary
What is an SVM?
Given:
N
• Training examples ( x 1, y 1 ), …, ( x n, y n ) x i ∈ ℜ y i ∈ {1, – 1}
• Hypothesis space according to kernel K ( x i ,x j )
• Parameter C for tradingoff training error and margin size
Training:
• Finds hyperplane in feature space generated by kernel.
• The hyperplane has maximum margin in feature space with minimal
training error (upper bound ∑ ξ i ) given C.
• The result of training are α 1, …, α n . They determine 〈 w, b〉 .
Classification: For new example h ( x ) = sign
∑ α i y i K ( x i ,x ) + b
x i ∈ SV
52
Part 2: How to use an SVM effectively and
efficiently?
53
Design Decisions in Working with SVMs
Setting up the learning task
•multiclass problems
•multilabel problems
54
Handling MultiClass / MultiLabel Problems
Standard classification SVM addresses binary problems y ∈ { 1, – 1 }
Multiclass classification: y ∈ { 1, …, k }
• oneagainstrest decomposition into k binary problems
= 1 if ( y = i )
(i) (i)
• learn one binary SVM h per class with y
(i) – 1 else
• assign new example to y = arg max [ h ( x ) ]
• pairwise decomposition into k ( k – 1 ) binary problems
= 1 if ( y = i )
(i) ( i, j )
• learn one binary SVM h per class pair y
– 1 if ( y = j )
• assign new example by majority vote
• reducing number of classifications [Platt et al., 2000]
• multiclass SVM [Weston & Watkins, 1998]
• multiclass SVM via ranking [Crammer & Singer, 2001]
Multilabel classification: y ⊆ { 1, …, k }
= 1 if ( i ∈ y )
(i) (i)
• learn one binary SVM h per class with y
–1 else
55
Which Features to Choose?
Things to take into consideration:
• if features sparse, then dimensionality of space no efficiency problem
• computations based on inner product between vectors
• consider frequency distribution of features (e.g. many rare features)
• Zipf distribution of words
• see TCatmodel
• SVMs can handle redundancy in features
• bagofwords representation redundant for topic classification
• see TCatmodel
• as few irrelevant features as possible
• stopword removal often helps in text classification
• see TCatmodel
56
How to Assign Feature Values?
Things to take into consideration:
• importance of feature is monotonic in its absolute value
• the larger the absolute value, the more influence the feature gets
• typical problem: number of doors [05], price [0100000]
• want relevant features large / irrelevant features low (e.g. IDF)
• normalization to make features equally important
x – mean ( X )
• by mean and variance: x norm = 
var ( X )
• by other distribution
• normalization to bring feature vectors onto the same scale
• directional data: text classification
x
• by normalizing the length of the vector x norm =  according to
some norm x
57
Selecting a Kernel
Things to take into consideration:
• kernel can be thought of as a similarity measure
• examples in the same class should have high kernel value
• examples in different classes should have low kernel value
• ideal kernel: equivalence relation K ( x i ,x j ) = sign ( y i y j )
• normalization also applies to kernel
• relative weight for implicit features
• normalize per example for directional data
K ( x i ,x j )
K ( x i ,x j ) = 

K ( x i ,x i ) K ( x j ,x j )
• potential problems with large numbers, for example polynomial
d
kernel K ( xi ,x j ) = [ x i ⋅ x j + 1 ] for large d
58
Selecting Regularization Parameter C
Common Method
1
• a reasonable starting point and/or default value is C def = 
• search for C on a logscale, for example ∑ K ( xi ,xi )
–4 4
C ∈ [ 10 C def, …, 10 C def ]
59
Selecting Kernel Parameters
Problem
• results often very sensitive to kernel parameters (e.g. variance γ in
RBF kernel)
• need to simultaneously optimize C, since optimal C typically depends
on kernel parameters
Common Method
• search for combination of parameters via exhaustive search
• selection of kernel parameters typically via crossvalidation
Advanced Approach
• avoiding exhaustive search for improved search efficiency [Chapelle
et al, 2002]
60
Handling Unbalanced Datasets
Problem
• often the number of positive examples is much lower than the number
of negative examples
• SVM minimizes error rate
=> always say “no” gives great error rate, but poor recall
Common Methods
• cost model that makes errors on positive examples more expensive
61
Selecting an SVM Training Algorithm
SVMlight (also SVMtorch, mySVM, BSVM, etc.) [Joachims, 1999b]
• solve dual QP to obtain hyperplane from α coefficients
• iteratively decompose large QP into a sequence of small QPs
• handles kernels and treats linear SVM as special case
62
Part 3: How to Train SVMs?
•efficiency
of primal vs. dual
•decomposition algorithm
•working set selection
•optimality criteria
•caching
•shrinking
63
How can One Train SVMs Efficiently?
Solve one of the following quadratic optimization problems:
• n + N + 1 variables
n
• n linear inequality
1
min P ( w, b, ξ ) = w ⋅ w + C ∑ ξ i constraints
2
i=1 • no direct use of kernels
s. t. yi [ w ⋅ x i + b ] ≥ 1 – ξ i and ξ i ≥ 0
• size scales O ( nN )
65
Decomposition
Idea: Solve small subproblems until convergence (Osuna, et al.)!
T T
1 α1 α1 k 11 k 12 k 13 k 14 k 15 k 16 k 17 α 1
1 α2 α2 k 21 k 22 k 23 k 24 k 25 k 26 k 27 α 2
1 α3 α3 k 31 k 32 k 33 k 34 k 35 k 36 k 37 α 3
1
max 1 α 4 –  α 4 k 41 k 42 k 43 k 44 k 45 k 46 k 47 α 4
2
1 α5 α5 k 51 k 52 k 53 k 54 k 55 k 56 k 57 α 5
1 α6 α6 k 61 k 62 k 63 k 64 k 65 k 66 k 67 α 6
1 α7 α7 k 71 k 73 k 73 k 74 k 75 k 76 k 77 α 7
66
What Working Set to Select Next?
Solution: Select subproblem with q variables that minimizes
T
V(d) = g(α) d
yT d = 0
d i ≥ 0, if ( α i = 0 )
subject to d i ≤ 0, if ( α i = C )
–1 ≤ d ≤ 1
{ di ≠ 0 } = q
67
How to Tell that we Found the Optimal Solution?
KarushKuhnTucker conditions lead to the following criterion:
n n n
1
maximize D(α) = α i – 
∑ ∑∑ α i αj y i y j ( x i ⋅ x j )
2
n i = 1 i = 1j = 1 is optimal
s. t. ∑ αi yi = 0 and 0 ≤ αi ≤ C
i=1
<=>
n
(α = 0) ⇒ y
i i ∑
αj y j K ( x i, x j ) + b ≥ 1
j=1
n
∀i ( 0 < α i < C ) ⇒ y i
∑
αj y j K ( x i, x j ) + b = 1
n
j=1
(α = C) ⇒ y
i i ∑
αj y j K ( x i, x j ) + b ≤ 1
j=1
68
Demo
69
Caching
Observation: Most CPUtime is spent on computing the Hessian!
Idea: Cache kernel evaluations.
300
250
200
frequency
150
100
50
0
1 10 100 1000 10000
rank by frequency (logscale)
70
Shrinking
Idea: If we knew the set of SVs, we could solve a smaller problem!
(complexity per iteration from O ( nqf ) to O ( sqf ) )
Algorithm: • monitor the KKTconditions in each iteration
• if a variable is “stuck at bound”, remove it
• do final optimality check
71
Summary
How can One Train SVMs Efficiently?
SVMlight (also SVMtorch, mySVM, BSVM, etc.)
• solve dual QP to obtain hyperplane from α coefficients
• iteratively decompose large QP into a sequence of small QPs
• select working set according to steepest feasible descent criterion
• check optimality using KarushKuhnTucker conditions
73
...classifies as well as possible!?
What is a “good” classification rule h ?
P(h(x) ≠ y) = ∫ ∆ ( h ( x ) ≠ y ) dP ( x, y ) = Err P ( h )
“AverageCase” Learner:
E ( Err P ( h L ) ) = ∫ ErrP ( hL ) dP ( x1, y1 )…P ( xn, yn )
74
SVMs as WorstCase Learner
Goal: Guarantee of the form
P ( Err P ( h L ) > ε ) < η
R2
Theorem: P ErrP ( h ) –ErrS ( h ) ≥ Φ 2, n, η < η [ShaweTaylor et al,1996]
δ
So, if
• the training error Err S ( h ) on sample S is low and
• the margin δ is large,
then with probablility η the SVM will output a classification rule with
true error R2
Err P ( h i ) ≤ Err S ( h i ) + Φ 2 ,n, η .


δ
Problem: For most practical problems this bound is vacuous,
i.e. ErrP ( h i ) ≤ 1 .
75
SVMs as AverageCase Learner
Theorem: The expected error of an SVM is bounded by
n
R 2
2
2E 2 + 2CR E ∑ ξ i
δ
n = 1 1
E ( Err P ( h SVM ) ) ≤  C ≥ 2
n 2R
n
R 2
2 ξ i
2E 2 + 2 ( CR + 1 )E
δ ∑
n = 1 1
E ( Err P ( h SVM ) ) ≤  C < 2
n 2R
n
R 2
with E 2 the expected soft margin and E ∑ ξ i the expected training
δ
n = 1
error bound [Joachims, 2001] [Vapnik, 1998].
76
LeaveOneOut
Training set: ( x1, y 1 ), ( x2, y 2 ), ( x 3, y 3 ), …, ( xn, y n )
train on test on
( x 2, y 2 ), ( x 3, y 3 ), ( x 4, y 4 ), …, ( x n, y n ) ( x 1, y 1 )
77
Necessary Cond. for LeaveOneOut Error of SVM
Lemma: SVM h i ( xi ) ≠ y i ⇒ 2αi R2 + ξ i ≥ 1 [Joachims, 2000] [Jaakkola
& Haussler, 1999] [Vapnik
& Chapelle, 2000]
Input: Example:
• α i dual variable of example i
• ξ i slack variable of example i ραi R 2+ ξ i leaveoneout error
3.5 ERROR
Available after training SVM 0.1 OK
on the full training data 1.3 OK
0.0 OK
0.0 OK
... ...
78
Case 1: Example is no SV
2
( α i = 0 ) ⇒ ( ξ i = 0 ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
79
Case 2: Example is SV with Low Influence
αj αi
αk
0.5 2
αi < 
 < C ⇒ ( ξ = 0 ) ⇒ ( 2αi R + ξ < 1 ) ⇒ no leave – one – out error
i i
2
R
80
Case 2: Example is SV with Low Influence
0.5 2
αi < 
 < C ⇒ ( ξ = 0 ) ⇒ ( 2αi R + ξ < 1 ) ⇒ no leave – one – out error
i i
2
R
81
Case 3: Example has Small Training Error
2 2
( α i = C ) ∧ ( ξ i < 1 – 2CR ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
α i =C
ξi
82
Case 3: Example has Small Training Error
2 2
( α i = C ) ∧ ( ξ i < 1 – 2CR ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
83
Experiment: Reuters21578
• 6451 training examples
• 6451 test examples for holdout testing
• ~27,000 features
4.5 AvgEstimatedError
AvgHoldoutError
4 Default
3.5
3
Error
2.5
1.5
0.5
0
earn acq money−fx grain crude trade interest ship wheat corn
84
Fast LeaveOneOut Estimation for SVMs
Lemma: Training errors are always leaveoutout errors.
Algorithm: • ( R, α, ξ )= train_SVM(X,0,0);
• for all training examples, do
• if ξ i > 1 then loo++;
2
• else if ( ρα i R + ξ i < 1 ) then loo=loo;
• else train_SVM( X i, α, ξ );
Experiment:
Training Retraining Steps (%) CPUTime (sec)
Examples ρ = 1 ρ = 2 ρ = 1 ρ = 2
Reuters 6451 0.20% 0.58% 11.1 32.3
WebKB 2092 6.78% 20.42% 78.5 235.4
Ohsumed 10000 1.07% 2.56% 433.0 1132.3
85
Estimated Error of SVM
n
n
=> Errloo( h ) ≤ 1 i 2α R2 + ξ ≥ 1 ≤ 1 ∑ 2αi R2 + ξ i
n i i
n
i=1
86
Summary
Why do SVMs Work?
If
• the training error Err S ( h ) (on the sample S / on average) is low and
• the margin δ/R (on the sample S / on average) is large
then
• the SVM has learned a classification rule with low error rate with
high probablility (worstcase).
• the SVM learns classification rules that have low error rate on
average.
• the SVM has learned a classification rule for which the (leaveone
out) estimated error rate is low.
87
Part 4: When do SVMs Work Well?
Successful Use:
•Optical Character Recognition (OCR) [Vapnik, 1998]
•Face Recognition, etc. [Osuna et al., 1997]
•Text Classification [Joachims, 1997] [Dumais et al., 1998]
•...
Open Questions:
What characterizes these problems?
How can the good performance be explained?
What are “sufficient conditions” for using (linear) SVMs successfully?
88
Learning Text Classifiers
RealWorld
Process
Training Set New Documents
Classifier
Learner
Goal:
• Learner uses training set to find classifier with low prediction error.
89
Learning Text Classifiers
RealWorld
Process
Training Set New Documents
Classifier
Learner
Goal:
• Learner uses training set to find classifier with low prediction error.
Obstacle:
• No Free Lunch: There is no learner that does well on every task.
90
Learning Text Classifiers Successfully
RealWorld
Process
Training Set New Documents
Classifier
Learner
91
Learning SVM Text Classifiers Successfully
RealWorld
Process
Training Set New Documents
Classifier
Learner
SVM
The learner produces a classifier with low error rate
<=>
SVM
The properties of the learner fit the properties of the process.
92
Representing Text As Feature Vectors
Features: words
0 baseball
3 specs
(wordstems)
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics 0 graphics
Subject: Need specs on Apple QT
1 references
I need to get the specs, or at least a
0 hockey Values: occurrence
very verbose interpretation of the specs,
for QuickTime. Technical articles from 0 car frequency
magazines and references to books would
be nice, too.
0 clinton
.
I also need the specs in a fromat usable .
on a Unix or MSDos system. I can’t .
do much with the QuickTime stuff they
have on ...
1 unix
0 space
2 quicktime
0 computer
93
Paradox of Text Classification
30,000 10,000
attributes training examples
94
Experimental Results
Reuters Newswire WebKB Collection Ohsumed MeSH
• 90 categories • 4 categories • 20 categories
• 9603 training doc. • 4183 training doc. • 10000 training doc.
• 3299 test doc. • 226 test doc. • 10000 test doc.
• ~27000 features • ~38000 features • ~38000 features
microaveraged precision/recall
Reuters WebKB Ohsumed
breakevenpoint [0..100]
95
Why Do SVMs Work Well for Text Classification?
96
Margin/Loss Based Bound on the Expected Error
Theorem: The expected error of a soft margin SVM is bounded by
n+1
R 2
2
ρE 2 + ρCR E ∑ ξ i
δ
n = 1 1
E ( Err n ( h SVM ) ) ≤  C ≥ 2
n+1 ρR
n+1
R 2
ρE 2 + ρ ( CR + 1 )E ξ i
2
δ ∑
n = 1 1
E ( Err n ( h SVM ) ) ≤  C < 2
n+1 ρR
n+1
R 2
Where E 2 is the expected soft margin and E ξ i
∑ is the
δ n = 1
expected training loss on training sets of size n + 1 .
97
First Step Completed
text
classification
? 5
properties
=>
low
2
R ⁄δ
and
2
low
E(Errn( h SVM ))
of
task
model ∑ ξi SVM
98
Properties 1+2: Sparse Examples in High Dimension
• High dimensional feature vectors (30,000 features)
• Sparse document vectors: only a few words of the whole language
occur in each document
Number
Training Distinct Words
of
Examples (Sparsity)
Features
Reuters 9,603 27,658 74
Newswire Articles (0.27%)
Ohsumed 10,000 38,679 100
MeSH Abstracts (0.26%)
WebKB 3,957 38,359 130
WWWPages (0.34%)
99
Property 3: Heterogeneous Use Of Words
MODULAIRE BUYS BOISE HOMES PROPERTY USX, CONS. NATURAL END TALKS
Modulaire Industries said it acquired the design USX Corp’s Texas Oil and Gas Corp subsidiary and
library and manufacturing rights of privatelyowned Consolidated Natural Gas Co have mutually agreed
Boise Homes for an undisclosed amount of cash. not to pursue further their talks on Consolidated’s
Boise Homes sold commercial and residential
possible purchase of Apollo Gas Co from Texas
prefabricated structures, Modulaire said.
Oil. No details were given.
JUSTICE ASKS U.S. DISMISSAL OF TWA
FILING E.D. And F. MAN TO BUY INTO HONG KONG
The Justice Department told the Transportation FIRM
Department it supported a request by USAir Group The U.K. Based commodity house E.D. And F.
that the DOT dismiss an application by Trans World Man Ltd and Singapore’s Yeo Hiap Seng Ltd jointly
Airlines Inc for approval to take control of USAir. announced that Man will buy a substantial stake in
``Our rationale is that we reviewed the application Yeo’s 71.1 pct held unit, Yeo Hiap Seng Enterprises
for control filed by TWA with the DOT and Ltd. Man will develop the locally listed soft drinks
ascertained that it did not contain sufficient manufacturer into a securities and commodities
information upon which to base a competitive brokerage arm and will rename the firm Man
review,’’ James Weiss, an official in Justice’s Pacific (Holdings) Ltd.
Antitrust Division, told Reuters.
No pair of documents shares any words, but “it”, “the”, “and”, “of”,
“for”, “an”, “a”, “not”, “that”, “in”.
100
Property 4: High Level Of Redundancy
101
Property 5: “Zipf’s Law”
k
Zipf’s Law: In text, the ith frequent word occurs f i = 
Θ
 times.
(c + i)
=> Most words occur very infrequently!
102
Text Classification Model
Definition: For the TCatconcept there are s disjoint sets of features.
TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] )
[20 20 100]
[ 4 1 200 ]
[ 1 4 200 ]
Example: TCat [ 5 5 600 ]
[ 9 1 3000 ]
[ 1 9 3000 ]
[10 10 4000]
103
TCatConcept for WebKB “Course”
[ 77 29 98 ], [ 4 21 52 ] high frequency
[ 16 2 431 ], [ 1 12 341 ]
medium frequency
TCat
[ 9 1 5045 ], [ 1 21 24276 ] low frequency
[ 169 191 8116 ] rest
104
Real Text Classification Tasks as TCatConcepts
[ 33 2 65 ], [ 32 65 152 ] high frequency
[ 2 1 171 ], [ 3 21 974 ]
medium frequency
Reuters “Earn”: TCat
[ 3 1 3455 ], [ 1 10 17020 ] low frequency
[ 78 52 5821 ] rest
[ 77 29 98 ], [ 4 21 52 ] high frequency
[ 16 2 431 ], [ 1 12 341 ]
medium frequency
Webkb “Course”: TCat
[ 9 1 5045 ], [ 1 21 24276 ] low frequency
[ 169 191 8116 ] rest
[ 2 1 10 ], [ 1 4 22 ] high frequency
[ 2 1 92 ], [ 1 2 94 ] medium frequency
Ohsumed “Pathology”: TCat
[5 1 4080 ], [ 1 10 20922 ] low frequency
[ 197 190 13459 ] rest
105
Second Step Completed
text
classification
5
properties
=>
? low
2
R ⁄δ
and
2
low
E(Errn( h SVM ))
of
task
model ∑ ξi SVM
106
2
The Margin δ of TCatConcepts
Lemma 1: For TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] ) concepts there is
always a hyperplane passing through the origin with margin δ 2 at least
s
p 2i
a = ∑ 
fi
i=1
s
2 n 2i
2 ad – b
δ ≥ 
a + 2b + d
with d = ∑ 
fi
i=1
s
ni pi
b = ∑ 
fi
i=1
107
2
The Length R of Document Vectors
Lemma 2: If the ranked term frequencies f i in a document with l
words have the form of the generalized Zipf’s Law
k
f i = 
Θ
(c + i)
108
2 2
R , δ , and ∑ ξi for Text Classification
Reuters Newswire Stories
n+1
• 10 most frequent categories R 2
2
• 9603 training examples E  +CR E ∑ ξ i
δ2
• 27658 attributes i = 1
E ( Err P ( h SVM ) ) ≤ 
n+1
∑ ∑ ξi
2 2 2 2
R ⁄δ ξi R ⁄δ
earn 1143 0 trade 869 9
109
Learnability of TCatConcepts
Theorem: For TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] ) concepts and
documents with l words that follow the generalized Zipf’s Law
Θ
fi = k ⁄ ( c + i ) the expected generalization error of an unbiased
SVM after training on n examples is bounded by
s
p 2i
a = ∑ 
fi
i=1
s
n 2i
2 2
d = ∑ 
fi
R ad – b i=1
E ( Err n ( h SVM ) ) ≤   with s
n + 1 a + 2b + d ni pi
b = ∑ 
fi
i=1
d
2 k  2

R ≤ ∑
(c + i)
Θ
i=1
110
Comparison Theory vs. Experiments
Predicted
Error Rate in
Learning Curve Bound Bound on
Experiment
Error Rate
Reuters 138
“earn” E ( Err n ( h SVM ) ) ≤  1.5% 1.3%
n+1
WebKB 443
“course” E ( Err n ( h SVM ) ) ≤  11.2% 4.4%
n+1
Ohsumed 9457
“pathology” E ( Err n ( h SVM ) ) ≤  94.6% 23.1%
n+1
111
Sensitivity Analysis
What makes a text classification problem suitable for a linear SVM?
High Redundancy: [ 40 40 50 ] high frequency
TCat [ 25 5 1000 ], [ 5 25 1000 ] medium frequency
[ 30 30 30000 ] low frequency
High Frequency:
[ 16 4 10 ], [ 4 16 10 ], [ 20 20 30 ] high frequency
TCat [ 30 30 2000 ] medium frequency
[ 30 30 30000 ] low frequency
112
What does this Model Provide?
Connects the statistical properties of text classification tasks with
generalization error of SVM!
=>
• Explains the behavior of (linear) SVMs on text classification tasks
113
Summary
When do (Linear) SVMs Work Well?
114
Part 3: SVMX?
•common elements of SVMs for other problem
•learning ranking functions from preferences
•novelty and outlier detection
•regression
115
The Receipe for Cooking an SVMs
Ingredients:
• linear prediction rules h ( x ) = w ⋅ x + b
• training problem with objective a la min w ⋅ w + C
∑ ξi and with linear
constraints (=> quadratic program)
Stirr and add flavor:
• Classification
• Ranking [Herbrich et al., 2000][Joachims, 2002c]
• Novelty Detection [Schoelkopf et al., 2000]
• Regression [Vapnik, 1998][Smola & Schoelkopf, 1998]
That makes:
• nice SVM with global optimal solution and duality
• often sparse solution (#SVs < n)
• Hint: garnish the dual with kernel to get nonlinear prediction rules
116
SVM Ranking
Query:
• “Support Vector Machine”
Goal:
• “rank the document I want
high in the list”
282,000 hits
117
Training Examples from Clickthrough
Assumption: If a user skips a link a and clicks on a link b ranked
lower, then the user preference reflects rank(b) < rank(a).
Example: (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
Ranking Presented to User:
1. Kernel Machines
http://svm.first.gmd.de/
2. Support Vector Machine
http://jbolivar.freeservers.com/
3. SVMLight Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines
http://www.supportvector.net/
5. Support Vector Machine and Kernel ... References
http://svm.research.belllabs.com/SVMrefs.html
6. Archives of SUPPORTVECTORMACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
7. Lucent Technologies: SVM demo applet
http://svm.research.belllabs.com/SVT/SVMsvt.html
8. Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk/
118
Training Examples from Clickthrough
Assumption: If a user skips a link a and clicks on a link b ranked
lower, then the user preference reflects rank(b) < rank(a).
Example: (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
Ranking Presented to User:
1. Kernel Machines
http://svm.first.gmd.de/
2. Support Vector Machine
http://jbolivar.freeservers.com/
3. SVMLight Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines
http://www.supportvector.net/
5. Support Vector Machine and Kernel ... References
http://svm.research.belllabs.com/SVMrefs.html
6. Archives of SUPPORTVECTORMACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
7. Lucent Technologies: SVM demo applet
http://svm.research.belllabs.com/SVT/SVMsvt.html
8. Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk/
119
Learning to Rank
Assume:
• distribution of queries P(Q)
• distribution of target rankings for query P(R  Q)
Given:
• collection D of m documents
• i.i.d. training sample ( q 1, r 1 ), …, ( q n, r n )
Design:
D×D
• set of ranking functions F, with elements f: Q → P (weak ordering)
• loss function l ( r a, r b )
• learning algorithm
Goal:
∫
• find f° ∈ F with minimal R P ( f ) = l ( f ( q ), r ) dP ( q, r )
120
A Loss Function for Rankings
For two orderings r a and r b , a pair d i ≠ d j is
• concordant, if r a and r b agree in their ordering
P = number of concordant pairs
• discordant, if r a and r b disagree in their ordering
Q = number of discordant pairs
Loss function: [Wong et al., 88], [Cohen et al., 1999], [Crammer &
Singer, 01], [Herbrich et al., 98] ...
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
121
A Loss Function for Rankings
For two orderings r a and r b , a pair d i ≠ d j is
• concordant, if r a and r b agree in their ordering
P = number of concordant pairs
• discordant, if r a and r b disagree in their ordering
Q = number of discordant pairs
Loss function: [Wong et al., 88], [Cohen et al., 1999], [Crammer &
Singer, 01], [Herbrich et al., 98] ...
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
122
A Loss Function for Rankings
For two orderings r a and r b , a pair d i ≠ d j is
• concordant, if r a and r b agree in their ordering
P = number of concordant pairs
• discordant, if r a and r b disagree in their ordering
Q = number of discordant pairs
Loss function: [Wong et al., 88], [Cohen et al., 1999], [Crammer &
Singer, 01], [Herbrich et al., 98] ...
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
123
Interpretation of Loss Function
Notation:
• P concordant pairs
• Q discordant pairs
124
What does the Ranking Function Look Like?
Sort documents d i by their “retrieval status value” rsv( q , d i ) with query
q [Fuhr, 89]:
rsv( q , d i ) = w1 * #(of query words in title of d i )
+ w2 * #(of query words in H1 headlines of d i )
...
+ w N * PageRank( d i )
= w Φ( q , d i ).
125
Minimizing Training Loss For Linear Ranking
Functions
Given:
• training sample S = ( q 1, r 1 ), …, ( q n, r n )
126
Ranking Support Vector Machine
Optimization Problem (primal):
1
min  w ⋅ w + C
2 ∑ ξl, i, j
∀( d i, d j ) ∈ r 1 ;( wΦ ( q 1, d i ) ≥ wΦ ( q 1, d j ) + 1 – ξ 1, i, j )
…
∀( d i, d j ) ∈ r n ;( wΦ ( q n, d i ) ≥ wΦ ( q n, d j ) + 1 – ξ n, i, j )
Properties:
• minimize tradeoff between training 1
loss and margin size δ = 1 / w 2
• quadratic program, similar to
classification SVM (=> SVMlight)
• convex => unique global optimum R 3
• radius of ball containing the training
δ
points R 4 f(q)
127
How is this different from ...
... classification?
f1(q):   +                                    
f2(q):                                       +
=> both have same error rate (always classify as nonrelevant)
=> very different rank loss
... ordinal regression?
Training set S = ( x 1, y 1 ), …, ( x n, yn ) , with Y ordinal (and finite)
=> ranks need to be comparable between examples
1 2 3
ordinal regression
1 2 3
1 2 3
ranking
1 2 3
128
Experiment Setup
Collected training examples with partial feedback about ordering from
• user skipping links Ranking Presented to User:
1. Kernel Machines
http://svm.first.gmd.de/
2. Support Vector Machine
http://jbolivar.freeservers.com/
3. SVMLight Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines
http://www.supportvector.net/
5. Support Vector Machine and Kernel ... References
http://svm.research.belllabs.com/SVMrefs.html
6. Archives of SUPPORTVECTORMACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
7. Lucent Technologies: SVM demo applet
http://svm.research.belllabs.com/SVT/SVMsvt.html
8. Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk/
=> (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
129
Query/Document Match Features Φ(q,d)
Rank in other search engine:
• Google, MSNSearch, Altavista, Hotbot, Excite
Query/Content Match:
• cosine between URLwords and query
• cosine between titlewords and query
• query contains domainname
Popularity Attributes:
• length of URL in characters
• country code of URL
• domain of URL
• word “home” appears in title
• URL contains “tilde”
• URL as an atom
130
Experiment I: Learning Curve
Training examples: preferences from 112 queries
25 MSNSearch
Google
Learning
20
Prediction Error (%)
15
10
0
0 10 20 30 40 50 60 70 80
Number of Training Examples
131
Experiment II
Experiment Setup:
• metasearch engine (Google, MSNSearch, Altavista, Hotbot, Excite)
• approx. 20 users
• machine learning students and researchers from University of
Dortmund AI Unit (Prof. Morik)
• asked to use system as any other search engine
• display title and URL of document
132
Experiment: Learning vs. Google/MSNSearch
133
Learned Weights
weight feature
0.24 document was ranked at rank 1 by exactly one of the 5 search engines
...
...
134
Summary: SVM Ranking
• An SVM method for learning ranking functions
• Training examples are rankings
=> pairwise preferences like “A should be ranked higher than B”
• Turn training examples into linear inequality constraints
• Results in quadratic program similar to classification
• Rank new examples by sorting according to distance from hyperplane
Applications:
• personalizing search engines
• tuning retrieval functions in XML intranets
• recommender systems
• betting on horses
135
SVM Novelty/Outlier Detection
Assume:
• distribution of feature vectors P(X)
P(x ∈ R) ≥ 1 – ε
136
Example: Small and Large Volume Regions
Assume that we know the distribution P(X).
All following are regions with P ( x ∈ R ) ≥ 1 – ε :
R
P(X)
137
Find Region using Examples
Problem:
• P(X) cannot be observed directly.
Given:
• training oberservations x 1, …, x n drawn according to P(X).
138
Properties of the Primal/Dual
Primal: Dual:
n n n n
Properties:
• convex => global optimum
ξi
• ξ i measures distance from ball
• α i = 0 : example lies inside the ball r
c
• 0 < α i < C : example on hull of ball
• α i = C : example is training error
139
OneClass SVM: Separating from the Origin
Observation: [Schölkopf et al., 2000][Schölkopf et al., 2001]
• For kernels depending on the distance between points, the dual is the
same as for classification SVM with
• all training observations in the positive class (with slack)
• one virtual negative example with α = – 1 and K ( x i ,x j ) = 0 .
Dual:
n n n
s. t. ∑ αi = 1 and 0 ≤ αi ≤ C
i=1 Constant for kernels depending on
distance between points (e.g. RBF)
2
=> Equivalent for RBF kernel K ( x i ,x j ) = exp ( – x i – x j ⁄ σ 2 ) !
140
Influence of C and RBFWidth σ2
small C
large width σ2
no outliers
small C
large width σ2
some ouliers
large C
large width σ2
small C
small width σ2
(plots courtesy of B. Schoelkopf)
141
Summary: SVM Novelty Detection
• Find small region where most observations fall
• OneClass SVM: separate observations from origin
• Outliers (or new observations after shift in distribution) lie outside of
region
• Training problem similar to classification SVM
Further work:
• Extension to ν SVMs and error bounds [Schölkopf et al., 2001]
[Schölkopf et al., 2001]
• SVM clustering [BenHur et al., 2001]
Applications:
• Text classification [Manevitz & Yousef, 2001]
• Topic detection
142
SVM Regression
Loss function:
• ε insensitive region with zero loss
• linear loss beyond the “tube”
143
Primal SVM Optimization Problems
Classification:
n
minimize J ( w, b, ξ ) = 1 w ⋅ w + C ∑ ξ i
2
i=1
s. t. yi [ w ⋅ x i + b ] ≥ 1 – ξ i and ξ i ≥ 0
δ
Regression:
n
144
Dual SVM Optimization Problems
m m m
1
maximize L(α) = p i α i – 
∑ ∑∑ α i αj u i u j K ( v i, v j )
2
m i = 1 i = 1j = 1
s.t. ∑ αi ui = 0 and 0 ≤ αi ≤ C
i=1
N
Classification: ( x 1, y 1 ), …, ( x n, y n ) ∼ P ( x, y ) xi ∈ ℜ y i ∈ {1, – 1} m = n
• p i = 1 for 1 ≤ i ≤ n
• u i = y i for 1 ≤ i ≤ n
• v i = x i for 1 ≤ i ≤ n
Regression: ( x1, y1 ), …, ( x n, y n ) ∼ P ( x, y ) xi ∈ ℜN y i ∈ ℜ m = 2n
• p i = ε + y i for 1 ≤ i ≤ n and p i = ε – y i for n + 1 ≤ i ≤ 2n
• u i = 1 for 1 ≤ i ≤ n and u i = – 1 for n + 1 ≤ i ≤ 2n
• v i = x i for 1 ≤ i ≤ n and v i = x i for n + 1 ≤ i ≤ 2n
145
Conclusions
• What! How! Why! When! ...and that SVMs solve any other problem!
Info
• Chris Burges’ tutorial (Classification)
http://www.kernelmachines.org/papers/Burges98.ps.gz
• Smola & Schölkopf’s tutorial (Regression)
http://www.kernelmachines.org/papers/tr301998.ps.gz
• Cristianini& ShaweTaylor book: Introduction to SVMs, Cambridge
University Press, 2000.
• Schölkopf’ & Smola book: Learning with Kernels, MIT Press, 2002.
• My dissertation: Learning to Classify Text Using Support Vector
Machines, Kluwer.
• Software: SVM
light for Classification, Regression, and Ranking
http://svmlight.joachims.org/
• General: http://www.kernelmachines.org
146
Bibliography
• [BenHur et al., 2001] BenHur, A., Horn, D., Siegelmann, H.,
Vapnik, V. (2001). Support Vector Clustering. Journal of Machine
Learning Research, 2:125137.
• [Bennet & Demiriz, 1999] K. Bennet and A. Demiriz (1999). Semi
supervised support vector machines, Advances in Neural Information
Processing Systems, 12, pages 368374, MIT Press.
• [Boser et al., 1992] Boser, B. E., Guyon, I. M., and Vapnik, V. N.
(1992). A traininig algorithm for optimal margin classifiers. In
Haussler, D., editor, Proceedings of the 5th Annual ACM Workshop
on Computational Learning Theory, pages 144152.
• [Burges, 1998] C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
147
• [Chapelle et al., 2002] Chapelle, O., V. Vapnik, O. Bousquet and S.
Mukherjee (2002). Choosing Multiple Parameters for Support Vector
Machines. Machine Learning 46(1), pages 131159.
• [Cohen et al., 1999] W. Cohen, R. Shapire, and Y. Singer (1999).
Learning to order things. Journal of Artificial Intelligence Research,
10:243270.
• [Collins & Duffy, 2001] Collins, M., and Duffy, N. (2001).
Convolution kernels for natural language. Proceedings of Advances
in Neural Information Processing Systems (NIPS).
• [Cortes & Vapnik, 1995] Cortes, C. and Vapnik, V. N. (1995).
Supportvector networks. Machine Learning Journal, 20:273297.
• [Crammer & Singer, 2001] K. Crammer and Y. Singer (2001). On the
algorithmic implementation of multiclass kernelbased vector
machines, Journal of Machine Learning Research, 2:265292.
• [Cristianini & ShaweTaylor, 2000] Cristianini, N. and ShaweTaylor,
J. (2000). An Introduction to Support Vector Machines and Other
KernelBased Learning Methods. Cambridge University Press.
148
• [Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D., and
Sahami, M. (1998). Inductive learning algorithms and representations
for text categorization. In Proceedings of ACMCIKM98.
• [Gaertner et al., 2002] Gaertner, T., Lloyd, J., Flach, P. (2002).
Kernels for structured data. Twelfth International Conference on
Inductive Logic Programming (ILP).
• [Herbrich et al., 2000] R. Herbrich, T. Graepel, K. Obermayer (2000).
Large Margin Rank Boundaries for Ordinal Regression, In Advances
in Large Margin Classifiers, pages 115132, MIT Press.
• [Jaakkola, and Haussler, 1998] Jaakkola, T. and Haussler, D. (1998).
Exploiting generative models in discriminative classifiers. In
Advances in Neural Information Processing Systems 11, MIT Press.
• [Jaakkola and Haussler, 1999] Jaakkola, T. and Haussler, D. (1999).
Probabilistic kernel regression models. In Conference on AI and
Statistics.
• [Joachims, 1998] Joachims, T. (1998). Text categorization with
support vector machines: Learning with many relevant features. In
149
Proceedings of the European Conference on Machine Learning, pages
137  142, Berlin. Springer.
• [Joachims, 1999a] Joachims, T. (1999a). Aktuelles Schlagwort:
Support Vector Machines. Künstliche Intelligenz, 4.
• [Joachims, 1999b] Joachims, T. (1999b). Making LargeScale SVM
Learning Practical. In Schölkopf, B., Burges, C., and Smola, A.,
editors, Advances in Kernel Methods  Support Vector Learning,
pages 169  184. MIT Press, Cambridge.
• [Joachims, 1999c] Joachims, T. (1999c) Transductive Inference for
Text Classification using Support Vector Machines. International
Conference on Machine Learning (ICML).
• [Joachims, 2002] Joachims, T. (2002). Learning to Classify Text
Using Support Vector Machines: Methods, Theory, and Algorithms,
Kluwer.
• [Joachims, 2002c] Joachims, T. (2002). Optimizing Search Engines
using Clickthrough Data, Conference on Knowledge Discovery and
Data Mining (KDD), ACM.
150
• [Keerthi et al., 1999] Keerthi, S., Shevade, S., Bhattacharyya, C., and
Murthy, K. (1999). A fast iterative nearest point algorithm for support
vector machine classifier design. Technical Report TRISL9903,
Indian Institute of Science, Bangalore, India.
• [Kindermann & Paass, 2000] Kindermann, Jörg and Paass, Gerhard:
Multiclass Classification with Error Correcting Codes, FGML, 2000.
• [Kondor & Lafferty, 2002] Kondor, I., and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete input spaces,
International Conference on Machine Learning (ICML).
• [Lodhi et al., 2002] Lodhi, H., Saunders, C., ShaweTaylor, J.,
Cristianini, N., & Watkins, C. (2002). Text Classification using String
Kernels. Journal of Machine Learning Research, 2:419444.
• [MacKay, 1997] D.J.C. MacKay. Introduction to gaussian processes,
1997. Tutorial at ICANN'97. ftp://wol.ra.phy.cam.ac.uk/pub/mackay/
gpB.ps.gz
151
• [Manevitz & Yousef, 2001] L. Manevitz and M. Yousef (2001). One
Class SVMs for Document Classification, Journal of Machine
Learning Research, 2:139154.
• [Mangasarian and Musicant, 2000] Mangasarian, O. L. and Musicant,
D. R. (2000). Active support vector machine classification. Technical
Report 0004, Data Mining Institute, Computer Sciences Department,
University of Wisconsin, Madison, Wisconsin. ftp://ftp.cs.wisc.edu/
pub/dmi/techreports/0004.ps.
• [Osuna et al., 1997] Osuna, E., Freund, R., and Girosi, F. (1997b).
Training support vector machines: An application to face detection.
In Proceedings CVPR'97.
• [Platt, 1999] Platt, J. (1999). Fast training of support vector machines
using sequential minimal optimization. In Schölkopf, B., Burges, C.,
and Smola, A., editors, Advances in Kernel Methods  Support Vector
Learning, chapter 12. MITPress.
152
• [Plattet al., 2000] Platt, J., Cristianini, N., and ShaweTaylor, J.
(2000). Large margin dags for multiclass classification. In Advances
in Neural Information Processing Systems 12.
• [Schölkopf et al., 1995] Schölkopf, B., Burges, C., Vapnik, V. (1995).
Extracting support data for a given task. Proceedings of the
International Conference on Knowledge Discovery and Data Mining
(KDD).
• [Schölkopf et al., 1998] Schölkopf, B., Smola, A., and Mueller, K.R.
(1998). Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10:12991319.
• [Schölkopf et al., 2000] Schölkopf, B., R. C. Williamson, A. J.
Smola, J. ShaweTaylor, and J. C. Platt (2000). Support vector
method for novelty detection. In S. Solla, T. Leen, and K.R. Müller
(Eds.), Advances in Neural Information Processing Systems 12, pp.
582588. MIT Press.
153
• [Schölkopf et al., 2001] Schölkopf, B., J. Platt, J. ShaweTaylor, A.J.
Smola and R.C. Williamson (2001). Estimating the support of a high
dimensional distribution. Neural Computation 13(7);14431471.
• [Schölkopf & Smola, 2002] Schölkopf, B., Smola, A. (2002).
Learning with Kernels, MIT Press.
• [ShaweTaylor et al., 1996] ShaweTaylor, J., Bartlett, P., Williamson,
R., and Anthony, M. (1996). Structural risk minimization over data
dependent hierarchies. Technical Report NCTR96053,
NeuroCOLT.
• [Smola & Schölkopf, 1998] A. J. Smola and B. Schölkopf. A Tutorial
on Support Vector Regression. NeuroCOLT Technical Report NC
TR98030, Royal Holloway College of London, UK, 1998.
• [Tax & Duin, 1999] D. Tax and R. Duin (1999). Support vector
domain description. Patter Recognition Letters, 20:19911999.
• [Tax & Duin, 2001] D. Tax and R. Duin (2001). Uniform Object
Generation for Optimizing OneClass Classifiers, Journal of Machine
Learning Research, 2:155173.
154
• [Vapnik, 1998] Vapnik, V. (1998). Statistical Learning Theory. Wiley,
Chichester, GB.
• [Vapnik & Chapelle, 2000] V. Vapnik and O. Chapelle. Bounds on
Error Expectation for Support Vector Machines. In Neural
Computation, 2000, vol 12, 9.
• [Wapnik & Tscherwonenkis, 1979] Wapnik, W. and Tscherwonenkis,
A. (1979). Theorie der Zeichenerkennung. Akademie Verlag, Berlin.
• [Weston & Watkins, 1998] Weston, J. and Watkins, C. (1998). Multi
class support vector machines. Technical Report CSDTR9804,
Royal Holloway University of London.
155