You are on page 1of 65

Kernels in Support Vector Machines

Based on lectures of Martin Law, University of Michigan


Non Linear separable problems

AND OR NOT(1)
The XOR
problem cannot
be solved with
a perceptron.

XOR
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
With NN:Multi-layer feed-forward neural networks

Neurons are organized into hierarchical layers

Each layer receive their inputs from the previous one and
transmits the output to the next one

w1ij
 1
w2ij z  g   wij  xi   j 
1
j
1

 i 
 2
z  g   wij  zi   j 
2
j
2 1

 i 

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
w111 1 XOR
1 w211
w112 (11) w111 = 0.7 w121 = 0.7 11 = 0. 5
1 w112 = 0.3 w122 = 0.3 12 = 0. 5
(21) w211 = 0.7 w221 = -0.7 12 = 0. 5
w121 2
2 w221
w122 (12)

x1 = 0 x2 = 0

a11 = -0.5 z11 = 0


a12 = -0.5 z12 = 0

a21 = -0.5 z12 = 0

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
w111 1 XOR
1 w211
w112 (11) w111 = 0.7 w121 = 0.7 11 = 0. 5
1 w112 = 0.3 w122 = 0.3 12 = 0. 5
(21) w211 = 0.7 w221 = -0.7 12 = 0. 5
w121 2
2 w221
w122 (12)

x1 = 1 x2 = 0

a11 = 0.2 z11 = 1


a12 = -0.2 z12 = 0

a21 = 0.2 z12 = 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
w111 1 XOR
1 w211
w112 (11) w111 = 0.7 w121 = 0.7 11 = 0. 5
1 w112 = 0.3 w122 = 0.3 12 = 0. 5
(21) w211 = 0.7 w221 = -0.7 12 = 0. 5
w121 2
2 w221
w122 (12)

x1 = 0 x2 = 1

a11 = 0.2 z11 = 1


a12 = -0.2 z12 = 0

a21 = 0.2 z12 = 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
w111 1 XOR
1 w211
w112 (11) w111 = 0.7 w121 = 0.7 11 = 0. 5
1 w112 = 0.3 w122 = 0.3 12 = 0. 5
(21) w211 = 0.7 w221 = -0.7 12 = 0. 5
w121 2
2 w221
w122 (12)

x1 = 1 x2 = 1

a11 = 0.9 z11 = 1


a12 = 0.1 z12 = 1

a21 = -0.5 z12 = 0

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
The hidden layer REMAPS the input in a new
representation that is linearly separable

Input Desired Activation of


output hidden neurons
0 0 0 0 0
1 0 1 0 1
0 1 1 0 1
1 1 0 1 1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Extension to Non-linear Decision Boundary
 So far, we have only considered large-margin classifier
with a linear decision boundary
 How to generalize it to become nonlinear?

 Key idea: transform xi to a higher dimensional space to

“make life easier”


 Input space: the space the point xi are located
 Feature space: the space of f(xi) after transformation

 Why to transform?
 Linear operation in the feature space is equivalent to non-
linear operation in input space
 Classification can become easier with a proper

transformation. In the XOR problem, for example, adding a


new feature of x1x2 make the problem linearly separable

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR
Is not linearly separable
X Y Y
0 0 0
0 1 1
1 0 1
1 1 0
X

Is linearly separable
Y
X Y XY
0 0 0 0
0 1 0 1
1 0 0 1
X
1 1 1 0
XY
Find a feature space

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )

Input space Feature space


Note: feature space is of higher dimension
than the input space in practice

 Computation in the feature space can be costly because it is


high dimensional
 The feature space can be infinite-dimensional!
 The kernel trick comes to rescue

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
The Kernel Trick
 Recall the SVM optimization problem

 The data points only appear as scalar product


 As long as we can calculate the inner product in the

feature space, we do not need the mapping explicitly


 Many common geometric operations (angles, distances)

can be expressed by inner products


 Define the kernel function K by

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
An Example for f(.) and K(.,.)
 Suppose f(.) is given as follows

 An inner product in the feature space is

 So, if we define the kernel function as follows, there is


no need to carry out f(.) explicitly

 This use of kernel function to avoid carrying out f(.)


explicitly is known as the kernel trick

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Kernels
 Given a mapping:

x  φ(x)

a kernel is represented as the inner product

K (x, y)   φ (x)φ (y)


i
i i

A kernel must satisfy the Mercer’s condition:

g ( x ) such that  g 2 ( x )dx     K ( x, y ) g ( x ) g ( y )dxdy  0

Analogous to positive-semidefinite matrices M for which

z  0  z  M  z  0
T

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Modification Due to Kernel Function
 Change all inner products to kernel functions
 For training,

Original

With kernel
function

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Modification Due to Kernel Function
 For testing, the new data z is classified as class 1 if f >0,
and as class 2 if f <0

Original

With kernel
function

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
More on Kernel Functions
 Since the training of SVM only requires the value of K(xi,
xj), there is no restriction of the form of xi and xj
 xi can be a sequence or a tree, instead of a feature vector

 K(xi, xj) is just a similarity measure comparing xi and xj

 For a test object z, the discriminat function essentially is


a weighted sum of the similarity between z and a pre-
selected set of objects (the support vectors)

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
 Suppose we have 5 1D data points
 x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
 Suppose we have 5 1D data points
 x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1

class 1 class 2 class 1

1 2 4 5 6

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
 We use the polynomial kernel of degree 2
 K(x,y) = (xy+1)2
 C is set to 100

 We first find ai (i=1, …, 5) by

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example
 By using a QP solver, we get
 a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
 Note that the constraints are indeed satisfied

 The support vectors are {x2=2, x4=5, x5=6}

 The discriminant function is

 b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

 All three give b=9

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Example

Value of discriminant function

f(z) f(z)>0

class 1 class 2 class 1

1 2 4 5 6
f(z)<0

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Kernel Functions
 In practical use of SVM, the user specifies the kernel
function; the transformation f(.) is not explicitly stated
 Given a kernel function K(xi, xj), the transformation f(.)

is given by its eigenfunctions (a concept in functional


analysis)
 Eigenfunctions can be difficult to construct explicitly
 This is why people only specify the kernel function without

worrying about the exact transformation


 Another view: kernel function, being an scalar product,
is really a similarity measure between the objects

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
A kernel is associated to a transformation
 Given a kernel, in principle it should be recovered the
transformation in the feature space that originates it.

 K(x,y) = (xy+1)2= x2y2+2xy+1


If x and y are numbers it corresponds the transformation
 x2 
 
x   2x 
 1 
 
What if x and y are 2-dimensional vectors?

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
A kernel is associated to a transformation


K xi , x j )= 1+ xi , x j )  1  x x  x x ) 
2 1 1
i j
2 2 2
i j

 ) x ) +2x )x )x )x )+x ) x ) +2x )x )+2x )x )


 1+ x 1 2
i
1 2
j
1
i
1
j
2
i
2
j
2 2
i
2 2
j
1
i
1
j
2
i
2
j


f(xi ) = 1, x ) , 2 x )x ), x ) , 2 x ), 2 x
1 2
i
1
i
2
i
2 2
i
1
i
2
i )
T

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR 1

0 1

 Simple example (XOR problem)


Input vector Y
N N N
L α ) =  αi   αi α j yi y j K ( xi , x j )
1 [-1,-1] -1
2 i=1 j=1 [-1,+1] +1
i=1
[+1,-1] +1

 )
[+1,+1] -1
2
K(xi , x j ) = 1+ xi x j

(-1,-1) (-1,+1) (+1,-1) (+1,+1)


(-1,-1) 9 1 1 1
(-1,+1) 1 9 1 1
(+1,-1) 1 1 9 1
(+1,+1) 1 1 1 9
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR

1
L (α)  α1+α2+α3+α4  ( 9α12  2α1α2  2α1α3  2α1α4+
2
 9α22+2α2α3  2α2α4+9α32  2α3α4+9α42 )

1
L α1 1  9α1  α2  α3  α4 = 0 α1 = α2 = α3 = α4 =
8
L α 2 1  α1  9α 2  α3  α4 = 0 1
L=
L α 3 1  α1  α2  9α3  α4 = 0 4
L α 4 1  α1  α2  α3  9α 4 = 0 The four Input vectors are
All support vectors
N
w =  αi yi(xi ) W = [0, 0, –1/sqrt(2), 0, 0, 0] T

i=1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
XOR
Input vector Y

1 [-1,-1] -1
[-1,+1] +1
[+1,-1] +1
0 1
[+1,+1] -1


f(xi ) = 1, x ) , 2 x )x ), x ) , 2 x ), 2 x
1 2
i
1
i
2
i
2 2
i
1
i
2
i )
T

N
w =  αi yi(xi ) W = [0, 0, –1/sqrt(2), 0, 0, 0]’
i=1
Input vector Y x1x2
[-1,-1] -1 +1
[-1,+1] +1 -1
[+1,-1] +1 -1
[+1,+1] -1 +1
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Examples of Kernel Functions

 Polynomial kernel up to degree d

 Polynomial kernel up to degree d

 Radial basis function kernel with width s

 Sigmoid with parameter k and 

 It does not satisfy the Mercer condition on all k and 

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Ploynomial kernel

Bishop C, Pattern recognition and Machine Learning, Springer

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Fonte:
http://www.ivanciuc.org/
Examples of Kernel Functions

 Radial basis function (or gaussian) kernel with width s

  xx    xy    yy 
K ( x, y )  exp   exp 
  s2
 exp 
  2s 2


 2s
2
    

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Examples of Kernel Functions
 With 1-dim vectors:

  x2    xy    y 2

K ( x, y)  exp  2  exp  2  exp  2 
 2s  s   2s 

It corresponds to the scalar product in the infinite


dimensional feature space:
 x 2

f ( x)  exp  2 1,
T
 x x 2
, 2,
x 3
,....,
x )n

.....
 2s   s 2s 3!s n!s
3 n

For vector in m-dim the feature space is more complicated
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Without slack variables

Bishop C, Pattern recognition and Machine Learning, Springer

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
With slack variables

Bishop C, Pattern recognition and Machine Learning, Springer

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Gaussian RBF kernel

Bishop C, Pattern recognition and Machine Learning, Springer

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Building new kernels

 If k1(x,y) and k2(x,y) are two valid kernels then the


following kernels are valid
 Linear Combination

k ( x, y)  c1k1 ( x, y)  c2 k2 ( x, y)
Exponential
k ( x, y)  exp k1 ( x, y)

 Product
k ( x, y)  k1 ( x, y)  k2 ( x, y)
 Polymomial transformation (Q: polymonial with non negative
coefficients)
k ( x, y)  Qk1 ( x, y)
 Function product (f: any function)
k ( x, y)  f ( x)k1 ( x, y) f ( y)
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Choosing the Kernel Function
 Probably the most tricky part of using SVM.
 The kernel function is important because it creates the
kernel matrix, which summarizes all the data
 Many principles have been proposed (diffusion kernel,

Fisher kernel, string kernel, …)


 There is even research to estimate the kernel matrix

from available information

 In practice, a low degree polynomial kernel or RBF


kernel with a reasonable width is a good initial try
 Note that SVM with RBF kernel is closely related to RBF
neural networks, with the centers of the radial basis
functions automatically chosen for SVM
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Kernels can be defined also for structures
other than vectors
Computational biology often deals with structures different
from vectors:
 Sequences (DNA, RNA, proteins)
 Trees (Phylogenetic relationships)

 Graphs (Interaction networks)

 3-D structures (proteins)

 Is it possible to build kernels for that structures?


 Transform data onto a feature space made of n-dimensional
real vectors and then compute the scalar product.

 Write a kernel without writing explicitly the feature space


(but.. What is a kernel?)
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Defining kernels without defining feature transformation
What a kernel represent?

Distance in feature space


Defining kernels without defining feature transformation
What a kernel represent?

Distance in feature space


Kernel is a SIMILARITY measure

Moreover it has to fullfill a «positivity» condition


Spectral kernel for sequences
 Given a DNA sequence x we can count the number of
bases (4-D feature space)

f1 ( x)  (nA , nC , nG , nT )
 Or the number of dimers (16-D space)
f2 ( x)  (nAA , nAC , nAG , nAT , nCA , nCC , nCG , nCT ,..)

 Or l-mers (4l –D space)

 The spectral kernel is

kl ( x, y)  fl x ) fl  y )
Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
l is usually lower than 1
Kernel out of generative models
 Given a generative model associating a probability
p(x|θ) to a given input, we define :

K ( x, y)  p( x) p( y)
 Fisher Kernel

g ( , x)   ln p( x |  )

 
N
1
F  E g ( , x) g ( , x) 
T

N
 g ( , x ) g ( , x )
i 1
i i

K ( x, y )  g ( , x) F g ( , y )
T 1

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Aspects of SVM
 How to use SVM for multi-class classification?
 One can change the QP formulation to become multi-class
 More often, multiple binary classifiers are combined

 One can train multiple one-versus-all classifiers, or combine

multiple pairwise classifiers “intelligently”

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Aspects of SVM
 How to interpret the SVM discriminant function value as
probability?
 By performing logistic regression on the SVM output of a
set of data (validation set) that is not used for training
 Some SVM software (like libsvm) have these features
built-in

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Software
 A list of SVM implementation can be found at
http://www.kernel-machines.org/software.html
 Some implementation (such as LIBSVM) can handle

multi-class classification
 SVMLight is among one of the earliest implementation of
SVM
 Several Matlab toolboxes for SVM are also available

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Summary: Steps for Classification
 Prepare the pattern matrix
 Select the kernel function to use

 Select the parameter of the kernel function and the

value of C
 You can use the values suggested by the SVM software, or
you can set apart a validation set to determine the values
of the parameter
 Execute the training algorithm and obtain the ai
 Unseen data can be classified using the ai and the

support vectors

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Strengths and Weaknesses of SVM
 Strengths
 Training is relatively easy
 No local optimal, unlike in neural networks
 It scales relatively well to high dimensional data
 Tradeoff between classifier complexity and error can be

controlled explicitly
 Non-traditional data like strings and trees can be used as

input to SVM, instead of feature vectors


 Weaknesses
 Need to choose a “good” kernel function.

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna
Other Types of Kernel Methods
 A lesson learnt in SVM: a linear algorithm in the feature
space is equivalent to a non-linear algorithm in the input
space
 Standard linear algorithms can be generalized to its non-

linear version by going to the feature space


 Kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis,
kernel k-means, 1-class SVM are some examples

Pier Luigi Martelli - Systems and In Silico Biology 2015-2016- University of Bologna

You might also like