You are on page 1of 65

A Primer on Kernel Methods

„ Paper of Jean Philippe Vert, Koji Tsuda, Bernhard Scholkopf,


in Kernel Methods in Computational Biology, MIT 2004.
„ Learning with Kernels, Bernhard Scholkopf and
Alexander Smola, 2002, MIT.
„ Kernel Methods for Pattern Recognition, John Shawe-
Taylor, Nello Cristianini, Cambridge 2004.
„ http://www.jaist.ac.jp/~bao/kernels/
Content

1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. More Kernels
6. Designing Kernels

2
ICML 2006 (720 abstracts)
kernel methods
shown to be an
Kernel methods & SVM 166 emerging trend

Probabilistic, graphical 146


models
Unsupervised learning, 128
clustering
Statistical models 121

Language, Text & web 68

Learning in bioinformatics 45

ANN 29

ILP 9

CRF 13
3
10 challenging problems in data mining
(IEEE ICDM’05)

1. Developing a unifying theory of data mining


2. Scaling up for high dimensional data/high speed streams
3. Mining sequence data and time series data
4. Mining complex knowledge from complex data
5. Data mining in a network setting
6. Distributed data mining and mining multi-agent data
7. Data mining for biological and environmental problems
8. Data-mining-process related problems
9. Security, privacy and data integrity
10. Dealing with non-static, unbalanced and cost-sensitive data
4
Kernel methods: a bit of history
ƒ Aronszajn (1950) and Parzen (1962) were some of the first to
employ positive definite kernels in statistics.
ƒ Aizerman et al. (1964) used positive definite kernels in a way
closer to the kernel trick.
ƒ Boser et al. (1992) constructed the SVMs, a generalization of
optimal hyperplane algorithm worked for vectorial data.
ƒ Scholkopf (1997): kernels can work with nonvectorial data by
providing a vectorial representation of the data in the feature space.
ƒ Schoolkopf et al. (1998): kernels can be used to build generalizations
of any algorithm that can be carried out in terms of dot products.
ƒ A large number of “kernelizations” of various algorithms since last
5 years.
5
Content

1. Introduction
2. Kernels (a general framework to represent data and must satisfy
some math conditions that give them properties useful to
understand kernel methods and kernel design)
3. Some Kernel Methods
4. Support Vector Machines
5. More Kernels
6. Designing Kernels

6
The issue of data representation

ƒ Let S = (x1, …, xn) a set of n objects to be analyzed.


ƒ Suppose that each object xi is an element of a set X, which may
be images, molecules, texts, etc.
ƒ Majority of data analysis methods represent S by:
‰ defining a representation φ(x) ∈ F for each object x ∈ X, where φ(x)
can be a real-valued vector (F = ℜp) or a finite-length string, or more
complex representation.

‰ representing S by a set of representations of the objects

φ(S) = (φ(x1), …, φ(xn))

7
Kernel representation: idea

ƒ Data are not represented individually anymore, but only through


a set of pairwise comparisons.

ƒ Instead of using a mapping φ: X → F to represent each object x ∈


X by φ(x) ∈ F, a real-valued “comparison function” (called kernel)
k: X x X → ℜ is used, and the data set S is represented by the nxn
matrix (Gram matrix) of pairwise comparisons ki,j = k(xi, xj).
ƒ A main question is how to find a
kernel k such that, in the new space,
problem solving is easier (e.g. linear).
ƒ All kernel methods have two parts:
(1) find such a kernel k, and
(2) process such Gram matrices.
8
Kernel representation: idea

ƒ X is the set of all oligonucleotides, S consists of three oligonucleoides.


ƒ Traditionally, each oligonucleotide is represented by a sequence
of letters.
ƒ In kernel methods, S is represented as a matrix of pairwise similarity
between its elements.
9
Examples of kernels

by polynomial
kernel (n=2)
φ : (x1 , x2 ) a
(x12 , 2x1 x2 , x22 )

by RBF kernel
(n=2)
φ : (x1 , x2 ) a
d (x1 , x2 ) 2
exp − 2

(Jean-Michel Renders, ICFT’04) 10


Prerequisite 1. Dot product and Hilbert space
ƒ A vector space (linear space) H of “vectors” over the reals ℜ if two operations,
‰ vector addition: H × H → H denoted v + w, where v, w ∈ H, and
‰ scalar multiplication: ℜ × H → H denoted λv, where λ ∈ ℜ and v ∈H
such that some axioms are satisfied (http://www.answers.com/topic/vector-space).
ƒ Dot product (inner, scalar product) on H is a symmetric bilinear form
〈.,.〉: H x H → ℜ, (x, x’) |→ 〈x, x’〉 that is strictly positive definite, i.e.,
∀x ∈ H, 〈x, x〉 ≥ 0 with equality only for x = 0. 1
n pp
ƒ Norm x := 〈x, x 〉1/2 ( . : H Æ ℜ+), p-norm: x p := ∑ xi
i=1

ƒ An dot product space (inner product space) H is a vector space endowed


with a dot product. The dot product space allows us to introduce
geometrical notions such as angles and lengths of vectors:

cosθ = x, x'
x x' 11
Prerequisite 1. Dot product and Hilbert space
ƒ A Hilbert space H is a dot product space with the additional properties
that it is separable and complete.
Completeness means every Cauchy sequence {hn}n≥1 of elements of H
converges to an element h ∈ H, where a Cauchy sequence is one satisfying
sup hn − hm → 0, as n → ∞
m>n

H is separable if for any ε > 0 there is a finite set of elements h1, …, hN of H


such that for all h ∈ H,
min hi − h < e
i
ƒ Importance: As all elements of a Hilbert space H are linear functions in H
via the dot product: for a point z the corresponding function is fz(x) = 〈z,
x〉. Our target is to learn linear functions represented by a weight vector in
the feature space. Finding the weight vector is therefore equivalent to
identifying an appropriate point of the feature space.
12
Kernel methods: the scheme
Input space X Feature space F
x1 x2
inverse map φ-1
φ(xn)
φ(x )
φ(x) φ(x1) ... n-1

… φ(x2)
x x
n-1 n

k(xi,xj) = φ(xi).φ(xj)

kernel function k: XxX Æ R Gram matrix Knxn kernel-based algorithm on K

ƒ Map the data from X into a (high-dimensional) vector space, the feature space F, by
applying the feature map φ on the data points x.
ƒ Find a linear (or other easy) pattern in F using a well-known algorithm (that works on
the Gram matrix).
ƒ By applying the inverse map, the linear pattern in F can be found to correspond to a
complex pattern in X.

ƒ This implicitly by only making use of inner products in F (kernel trick) 13


Kernel representation: comments
ƒ The representation as a square matrix does not depend on the
nature of the objects to be analyzed.
‰ an algorithm developed to process such a matrix can analyze images
as well as molecules, if valid functions k can be defined.

‰ a complete modularity exists between the design of a function k to


represent data and the design of an algorithm to process the data
representations → utmost importance in field where data of different
nature need to be integrated and analyzed in a unified framework.

ƒ The size of the matrix used to represent a dataset of n objects is


always nxn, whatever the nature or the complexity of the objects.

ƒ Comparing objects is an easier task in many cases than finding an


explicit representation for objects that a given algorithm can process.
14
Valid kernels

ƒ Most kernel methods can only process symmetric positive definite


T p
matrix: ki,j = kj,i for all i, j and c kc ≥ 0, for any c ∈ ℜ .
ƒ Definition 1: A function k: X x X → ℜ is called a positive definite
kernel if it is symmetric (that is, k(x, x’) = k(x’, x) ∀ x, x’∈ X),
and positive definite, that is for any n > 0, any choice of n objects
x1, …, xn ∈ X, and any choice of real number c1, …, cn ∈ ℜ
n n

∑∑cic j k(xi , x j ) ≥ 0 (1)


i=1 j =1

ƒ A positive definite kernel k is called a valid kernel (or simply kernel).


ƒ Mercer’s theorem: Any positive definite function can be written as
an inner product in some feature space.

15
Prerequisite 2: Regularization (1/6)

ƒ Classification is one inverse problem


(induction): Data → Model parameters model

ƒ Inverse problems are typically ill posed, as


opposed to the well-posed problems typically
when modeling physical situations where the
model parameters or material properties are deduction
known (a unique solution exists that depends
continuously on the data).
induction
ƒ To solve these problems numerically one
must introduce some additional information

Intensity (arb. unit)


about the solution, such as an assumption on
the smoothness or a bound on the norm.
data
4 8 12 16 20

2θ 16
Prerequisite 2: Regularization (2/6)
ƒ Input of the classification problem: m pairs of training data (xi, yi)
generated from some distribution P(x,y), xi ∈ X, yi ∈ C = {C1, C2, …,
Ck} (training data).
ƒ Task: Predict y given x at a new location, i.e., to find a function f
(model) to do the task, f: X Æ C.
ƒ Training error (empirical risk): Average of a loss function on
the training data, for example
1 m 1, yi = f(xi )
R [f]= c(x , y , f (x )) for example, c(x ,y ,f(x )) =

emp m i=1 ii i i ii 0, yi ≠ f(xi )
ƒ Target: (risk minimization) to find a function f that minimizes
the test error (expected risk)
R[ f ] := E[Rtest [ f ]] = E[c(x, y, f (x))] = ∫c( x, y, f (x))dP(x, y)
XxY 17
Prerequisite 2: Regularization (3/6)
ƒ Problem: Small Remp[f] does not always ensure small R[f] (overfitting),
i.e., we may get small
Prob{sup f ∈F Remp [ f ] − R[ f ] < ε}
ƒ Fact 1: Statistical learning theory says that the difference is small if F is small.
ƒ Fact 2: Practical work says the difference is small if f is smooth.

Remp[f2] = 5/40 Remp[f2] = 3/40 Remp[f1] = 0

18
Prerequisite 2: Regularization (4/6)
ƒ Regularization is the restriction of a class F of possible minimizers (with f∈F)
of the empirical risk functional Remp[f] such that F becomes a compact set.
ƒ Key idea: Add a regularization (stabilization) term Ω[f] such that small
Ω[f] corresponds to smooth f (or otherwise simple f) and minimize
Rreg [ f ] := Remp [ f ] + λΩ[ f ]

‰ Rreg[f]: regularized risk functionals;


‰ Remp[f]: empirical risk;
‰ Ω[f]: regularization term; and
‰ λ: regularization parameter that specifies the trade-off between
minimization of Remp[f] and the smoothness or simplicity enforced by
small Ω[f] (i.e., complexity penalty).
ƒ We need someway to measure if the set FC = {f | Ω[f] < C} is a “small” class
of functions.
19
Prerequisite 2: Regularization (5/6)
ƒ Representer theorem: In many cases, the expansion of f in terms
of k(xi, x), where the xi are training data, contains the minimizer of
Remp[f] + λΩ[f].

ƒ Maximization of the margin of classification in feature space by


2
using the regularizing term ½ w , and thus minimizing
Ω[ f ] := 1 w 2 , and therefore R [ f ] + λ w 2
2 2
reg

ƒ Rewriting the above risk functional in terms of the reproducing kernel


Hilbert space (RKHS) H associated with k, and we equivalently
minimize
R [ f ] := R [ f ] + λ f 2
H
reg emp
2
20
Prerequisite 2: Regularization (6/6)

ƒ Reproducibility is one of the main principles of the


scientific method.

ƒ An experimental description (thought experiment) produced by a


particular researcher or group of researchers is generally
evaluated by other independent researchers by attempting to
reproduce it; they repeat the same experiment themselves and
see if it gives equal results as reported by the original group.

21
Fundamental properties of kernels:
Kernels as inner product (1/3)

p T
ƒ Suppose X = ℜ and x = (x1, …, xp) . Comparing such vectors by
p
inner product, for any x, x’∈ ℜ ,
p

kL (x, x') := xT x' = ∑ xi xi' = x, x'


i=1

ƒ This function is a kernel as it is symmetric and positive


definite, usually called linear kernel.

ƒ Limitation: can only defined when data represented as vectors.


ƒ One systematic way to define kernels for general objects:
1. representing each object x ∈ X as a vector φ(x) ∈ ℜp
2. defining a kernel for any x, x’∈ X as inner product k(x,x’) = φ(x)Tφ(x’) (2)

Can define kernel without requiring X to be a vector space. 22


Fundamental properties of kernels:
Kernels as inner product (2/3)

p
ƒ Any φ: X → ℜ for some p ≥ 0 results in a valid kernel through (2).
ƒ Whether there exist more general kernels than these? At least, it
p
is possible if replacing ℜ by an infinite-dimensional Hilbert space.

ƒ Theorem 2: For any kernel k on a space X, there exists a Hilbert


space F and a mapping φ: X → F such that

k(x, x’) = 〈φ(x), φ(x’)〉, for any x, x’∈ X,

where 〈u, v〉 represents the dot product in the Hilbert space


between any two points u, v ∈ F.

23
Fundamental properties of kernels:
Kernels as inner product (3/3)

Any kernel on a space X can be


represented as an inner product after
the space X is mapped to a Hilbert
space F, called feature space.

ƒ Kernels can all be thought of as dot products in some space F, called


the feature space . Using a kernel boils down to representing each
object x ∈ X as a vector φ(x) ∈ F, and computing dot products.
ƒ We don’t need to compute explicitly φ(x) for each point in S, but
only the pairwise dot products. F usually is infinite-dimensional, and
φ(x) is tricky to represent though the kernel is simple to compute.
ƒ Most kernel methods possess such an interpretation when the
points x ∈ X are viewed as points φ(x) in the feature space.
24
Fundamental properties of kernels:
Kernels as measures of similarity (1/2)

ƒ Kernels are often presented as measures of similarity, in the


sense that k(x, x’) is “large” when x and x’ are “similar”.
ƒ This motivates the design of kernels for particular types of data
or applications, because particular prior knowledge might suggest
a relevant measure of similarity in a given context.
ƒ The dot product does not always fit one's intuition of similarity. There
are cases where these notions coincide, e.g., Gaussian RBK kernel.
2
ƒ For a general kernel k, where Hilbert distance d(u,v) = 〈(u-v),
2
(u,v)〉, . is Hilbert norm (i.e., u = 〈u, u〉), the following holds for
any x, x’∈ F
k(x, x’) measures the similarity
between x and x’ as the opposite
2 2
k(x, x') = φ(x) + φ(x' ) −d of the square distance
2
d(φ(x),φ(x’)) between their
(φ(x),φ(x'))2 2 images, up to square of their
norms. 25
Fundamental properties of kernels:
Kernels as measures of similarity (2/2)

ƒ For more general kernels, one should keep in mind the slight
gap between the notion of dot product and similarity.

ƒ It is generally relevant to think of a kernel as a measure of


similarity, in particular when it is constant on the diagonal.

ƒ This intuition is useful in designing kernels and in


understanding kernel methods.

26
Fundamental properties of kernels:
Kernels as measures of function regularity (1/4)

ƒ Let k be a kernel on a space X, we show that k is associated with a set


Hk = {f: X → ℜ} of real-valued functions on X, and Hk is endowed with
a structure of Hilbert space (a dot product and a norm).

ƒ Understand the functional space Hk and the norm associated with a kernel
often helps in understanding kernel methods and in designing new kernels.
p T ’
ƒ Example: linear kernel on X = ℜ , i.e., k(x, x’) = x x’ = ∑ xixi . Hk is the
space of linear functions f: ℜp → ℜ and associated norm
T p T
Hk = {f(x) = w x: w∈ℜ } and f Hk = w for f(x) = w x.

The norm f Hk decreases if the “smoothness” of f increases, where the


definition of smoothness depends on the kernel (for linear kernel it relates
to function slope: smooth ≡ flat). The notion of smoothness is dual to that
of similarity: a function is “smooth” when it varies slowly between “similar”
points.

27
Fundamental properties of kernels:
Kernels as measures of function regularity (2/4)

ƒ The question is how to systematic construct Hk from k? We define Hk as the set


of functions f: X → ℜ in Hk for n > 0, a finite number of points x1, …, xn∈ X
α1, …, αn∈ ℜ asandweights
n n n

f (x) = ∑αi k(xi , x) and the norm f 2


H k := ∑∑αiα j k(xi , x j ) (*)
i=1 i=1 j=1

ƒ One can prove that Hk is a Hilbert space, and the dot product for two functions
n m

f (x) = ∑ α i k(xi , x ), and g(x) = ∑β j k(x j , x)


i=1 j =1
n m
= ∑∑nm αi β j k(xi , x j )
f, = ∑αi g(xi ) = ∑β j f (x j )
g
i=1 j=1 i=1 j =1

ƒ Interesting property: the value f(x) of f ∈ Hk at a point x ∈ X can be


expressed as a dot product in Hk, i.e., if we take g = k(x, .)
n

f , k(x,.) = ∑αi k(xi , x) = f (x)


i=1 28
Fundamental properties of kernels:
Kernels as measures of function regularity (3/4)

ƒ Taking f(.) = k(x’,.), we derive the reproducing property valid


for any x, x’∈ X:

k(x, x’) = 〈k(x,.), k(x’,.)〉 (3)

ƒ Hk is usually called the reproducing kernel Hilbert space (RKHS)


associated with k. Hk is one possible feature space associated with
k when considering φ: X → Hk defined by φ(x) := k(x, .)
ƒ A general Hilbert space usually contains many non-smooth
functions. RKHS contains smooth functions, and is “smaller” than a
general Hilbert space.

29
Fundamental properties of kernels:
Kernels as measures of function regularity (4/4)

ƒ We should keep in mind the connection between kernels and norms


on functional spaces. Most kernel methods have an interpretation
in terms of functional analysis.
ƒ Many kernel methods, including SVMs, can be defined as
algorithm that, given a set S of objects, return a function that
solves

min R( f , S) +
c f (4)
f ∈Hk Hk

ƒ Where R(f,S) is small when f “fits” the data well, and the term
f Hk ensures the solution of the above equation is “smooth”.
In fact, besides fitting the data well and being smooth, the
solution of the above equation turns out to have special properties
that are useful for computational reasons (representer theorem).
30
Content

1. Introduction
2. Kernels
3. Some Kernel Methods (consider a class of algorithms that take
as input the similarity matrix defined by a kernel)
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels

31
The kernel trick (1/5)
ƒ Two concepts that underlie most kernel methods: the kernel
trick and the representer theorem.
ƒ The kernel trick is a simple and general principle based on
the property that kernels can be thought of as inner product:
Any positive definite kernel k(x, y) can be expressed as a
dot product in a high-dimensional space.
ƒ The kernel trick is obvious but has huge practical consequences
that were only recently exploited.

ƒ Proposition 3. Any algorithm for vectorial data that can be


expressed only in terms of dot products between vectors can
be performed implicitly in the feature space associated with
any kernel, by replacing each dot product by a kernel
evaluation.
32
The kernel trick (2/5)
ƒ Kernelization: Transforming linear methods Æ nonlinear methods
‰ Kernel trick can be first used to transform linear methods (e.g.,
linear discriminant analysis or PCA) into non-linear methods by
simply replacing the classic dot product by a more general
kernel, such as the Gaussian RBF kernel.
‰ Nonlinearity is obtained at no computational cost, as the
algorithm remains exactly the same.
ƒ Non-vectorial data:
‰ The combination of the kernel trick with kernels defined on non-
vectorial data permits the application of many classic algorithms on
vectors to virtually any type of data as long as kernel can be defined.
‰ Examples: to perform PCA on a set of sequences or graphs thanks to
the availability of kernels for sequences and for graphs.
33
The kernel trick (3/5)
Example: Computing distances between objects

ƒ Recall that the kernel can be expressed as k(x, x’) = 〈φ(x),


φ(x’)〉 in a dot product space F for some mapping φ: X → F.

ƒ Given x1, x2 ∈ X, mapped to φ(x1), φ(x2) ∈ F. We define


distance d(x1, x2) as Hilbert distance between their images

d (x1, x2 ) := φ(x1 ) −φ(x2 )

Given a space X endowed with a kernel, a


distance can be defined between points
of X mapped to the feature space F
associated with the kernel. This distance
can be computed without knowing the
mapping φ thank to the kernel trick.

34
The kernel trick (4/5)
Example: Computing distances between objects

ƒ The following equality shows the distance can be expressed in


terms of dot products in F:
φ (x1 ) −φ(x2 ) 2 = φ(x1 ),φ(x1 ) + φ(x2 ),φ(x2 ) − 2 φ(x1 ),φ(x2 )
From here the distance can be computed in terms of the kernel
d(x1,x2) = k(x1 , x1 ) + k(x2 , x2 ) − 2k(x1 , x2 )

ƒ The example shows the effect of the kernel trick: it is possible


to perform operations implicitly in the feature space.
ƒ A more general question: Let S = (x1, …, xn) be a fixed finite set
of objects, and x ∈ X a generic object. Is it possible to assess
how “close” the object x is to the set of objects S?

35
The kernel trick (5/5)
Example: Computing distances between objects

The distance between the white and the set of


three black in X endowed with a kernel is defined
as the distance between image of the white and
the centroid m of the images of the black. The
centroid m might have no preimage in X. This
distance can nevertheless be computed
implicitly with the kernel trick.

n
1
dist(x, S) = φ(x) − m = φ(x) − ∑φ(xi )
n i=1
n n n
2 1
Using kernel trick we obtain dist(x, S) = k(x, x) − ∑ k (x, xi ) + 2 ∑∑k(x, xi )
n i=1 n i=1 j =1

1
dist(S1, S2 ) = 12 ∑ k(x, x') + 2 ∑ k(x, x') - 1 ∑k(x, x')
S x , x'∈S1 S x , x'∈S2
S S
1 2 x∈S1 , x'∈S2
1 2 36
The representer theorem
ƒ Theorem (Kimeldoft, Wahba, 1971): If the problem
min{C( f ,{xi , yi }) + Ω( f ) (5)
f ∈H H

satisfies (1) C is pointwise; i.e., C(f,{xi, yi}) = C({xi, yi, f(xi)}) which
only depends on {f(xi)}) and (2) Ω(.) is monotonically increasing, then
every minimizer of the problem admits a representation of the form

f (.) = ∑im=1αi k(xi ,.) (6)

ƒ The quantity . Hk included in the function to optimize is a penalization


that forces the solution to be smooth and usually a powerful protection
against over-fitting of the data.
ƒ Meaning: Instead of finding optimal solution in infinite-dimensional space,
(5) can be reformulated as a n-dimensional optimization problem,
p.
by plugging (6) into (5) and optimizing over (α1, …, αn) ∈ ℜ
37
Kernel principal component analysis (kernel PCA)

p d
ƒ PCA: Given x1, …, xm ∈ R . Find: x’1, …, x’m ∈ R , d <<
p so that x’1, …, x’m preserve most information.
ƒ PCA finds the principal axes by diagonalizing the
covariance matrix C with singular value decomposition
1 m
λυ =Cυ C= ∑ x j xT 1.Find
m j=1
j
eigenvectors,
1
λυ =Cυ=
m
∑x j x T
j υ order them in
decreasing.
1 1
∑ ( x j .υ ) x 2.Projects

υ=

∑ x j x Tj υ =
mλ j
points onto
eigenvectors.
3.Use new
ƒ All solutions v with λ≠0 lie in the span of x1,..,xm,,i.e. coordinates
to do things
υ = ∑αi xi
i
38
Kernel principal component analysis (kernel PCA)

ƒ Do PCA in feature space? The covariance matrix


1 m
C= φ(x )φ(x )T

j j

ƒ mj=1
can be diagonalized with nonnegative eigenvalues satisfying

λV =CV

ƒ As V lie in the span of φ(xi), so we have


1 1
λ∑αiφ(xi ) = m ∑φ(xj )φ(xj ) .∑αiφ(xi ) = m ∑∑αiφ(xj )φ(xj )T φ(xi )
T

= ∑ ∑ α iφ ( xi ) φ ( xi ), φ ( x j )
j i
39
Kernel principal component analysis (kernel PCA)

ƒ Apply kernel trick, we have


K(xi, xj) = 〈φ(xi), φ(xj)〉

mλ∑αiφ(xi ) = ∑∑αiφ(xi )K(xi , xj )


i j

ƒ And we can finally write the


expression as the
eigenvalue problem

Kα=λα

ƒ For a test pattern x, we extract


a nonlinear component via
m

Vk , x = ∑αik K (xi , x)
i=1 40
Content

1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels

41
Linear classifiers

f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1 0
=
b
+
wx
How would you
classify this data?

w x + b<0

42
The illustrated slides (p.42-48) are taken from Prof. Andrew Moore’s SVM tutorial
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

43
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

44
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

45
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class 46
Classifier margin

f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
47
1. Maximizing the margin is good according to
intuition and PAC theory

Maximum margin 2. Implies that only support vectors are important;


other training examples are ignorable.
3. Empirically it works very very well.
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support vectors
with the
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (called an
LSVM)
48
Summary

Soft margin problem Equivalent to dual problem


n n n n
1 2
+
1
W(α) = − ∑∑ yi y jαiα j xiT x j + ∑αi
min w C∑ξi 2 i=1 j=1 i=1
w,b,ξ1 ,...,ξn
2 i=1 n
yα = 0
∑i=1 i i
ξ ≥0
i
∀i,
T
0 ≤ αi ≤ C for i = 1,..., n
ξi −1+ yi (w xi + b) ≥ 0

Each point correctly classified with large confidence (yf(x) > 1) has a
null multiplier. Other points are called support vector. They can be
on the boundary., in which case the multiplier satisfies 0 ≤ α ≤ C, or
on the wrong side of this boundary, in which case a = C. 49
Content

1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels

50
SVMs in practice

ƒ Many free and commercial software: SVMlight, LIBSVM, mySVM,


ƒ etc. Multiclass problems
‰ SVM implemented for multiclass problem (just for few classes)
‰ Combination of binary SVMs by one-against-all scheme
ƒ Kernel normalization
‰
Vectors: linearly scale each attribute to zero mean and
unit variance and use Gaussian RBF is usually effective.
‰ 1/2
Strings and graphs: k’ij = kij/(kiikjj)
ƒ Parameter setting
‰ The regularization parameter C of the SVM Use cross-validation
‰ The kernel (γ) and its parameters to choose C and γ
51
Content

1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels

52
Kernels for vectors
T
ƒ The linear kernels kL = (x, x’) = x x’

ƒ kL is a particular case of polynomial kernels defined for d ≥ 0


T d T d
kPoly1 = (x, x’) = (x x’) or kPoly1 = (x, x’) = (x x’)

2 2 2 2
they yields with d = 2 {x1 , x1 x2, x2 }, and {1, x1, x2, x1 , x1 x2, x2 }

ƒ The Gaussian RBF kernel 2


x − x'
kG (x, x') = exp −
2σ 2

T
ƒ The sigmoid kernel k(x, x’) = tanh(kx x’ + θ)
53
Kernels for strings
ƒ
Lodhi et al. (2002) in the NLP context. Basic idea:
‰ Count the number of subsequences up to length n in a sequence.
‰ Compose a high-dimensional feature vector by these counts.
‰ Kernel = dot product between such feature vectors.
ƒ |s| i
Σ = set of symbols; a string s = s1, …, s|s| ∈ Σ . X = ∪i=0,…,∞ Σ
ƒ
An index set i of length l is an l-tuple of positions in s
i = (i1, …, il), 1 ≤ i1 < … < il < |s|
ƒ Denoted by s[i] = si1, …, sil the subsequence of s corresponding to i.
l(i)
ƒ Define weight of i by λ , where l(i) = il − i1 + 1, λ < 1.

54
Kernels for strings
k
ƒ For u ∈ Σ , k is fixed, we define Φu: X → ℜ.

∀s ∈X, Φu (s) = ∑λl (i)


i:s[i]=u

ƒ Considering all sequences u of length n, we map each s ∈ X to


n n
a |Σ| −dimensional feature space by s → (Φu(s))u∈Σ .
ƒ We can then define a kernel for strings as the dot product
between these representations
∀s, t ∈ X , kn (s, t) = ∑ ∑λl(i )
∑λl( j )
n
u∈Σ i:u=s[i] j:u=t[]

= ∑ ∑ ∑λ l(i )+l ( j)
n
u∈Σ i:u=s[i] j:u=t[ j]
55
The Fisher kernel

ƒ Probabilistic models are convenient to represent families of


complex objects that arise in computational biology.

ƒ Hidden Markov models (HMMs) are a central tool for modeling


protein families or finding genes from DNA sequences (Durbin et al.,
1998). More complicated models called stochastic context-free
grammars (SCFGs) are useful for modeling RNA sequences
(Sakakibara et al., 1994).

ƒ The Fisher kernel (Jaakkola and Haussler, 1999) provides a general


principle to design a kernel for objects well modeled by a
probabilistic distribution, or more precisely a parametric
statistical model.

56
Content

1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels

57
How to choose kernels?

ƒ There is no absolute rules for choosing the right kernel, adapted to


a particular problem
ƒ Kernel design can start from the desired feature space,
from combination or from data
ƒ Some considerations are important:
‰ Use kernel to introduce a priori (domain) knowledge
‰ Be sure to keep some information structure in the feature space
‰ Experimentally, there is some “robustness” in the choice, if the
chosen kernels provide an acceptable trade-off between
– simpler and more efficient structure (e.g. linear
separability), which requires some “explosion”
– Information structure preserving, which requires that
the “explosion” is not too strong.
58
Kernels built from data

ƒ In general, this mode of kernel design can use both labeled and
unlabeled data of the training set! Very useful for semi-
supervised learning

ƒ Basic ideas:
‰ Intuitively, kernels define clusters in the feature space, and we want
to find interesting clusters, i.e. cluster components that can be
associated with labels.
‰ Convex linear combination of kernels in a given family: find the
best coefficient of eigen-components of the (complete) kernel
matrix by maximizing the alignment on the labeled training data.
‰ Build a generative model of the data, then use the Fischer Kernel.

59
Designing kernels:
Operations on kernels

ƒ The class of kernel functions on X has several useful closure


properties. It is a convex cone, which means that if k1 and k2 are
two kernels, then any linear combination, λ1 k1+ λ2k2, with λ1, λ2 ≥ 0
is a kernel.

ƒ If k1 and k2 are two kernels, then k(x, x’):= k1(x, x’) k2(x, x’) is also
a kernel. Several other operations are possible.
ƒ Some other operations are forbidden: if k is a kernel, then log(k) is
β
not positive definite in general, and neither is k for 0 < β < 1.
However, these two operations can be linked.

60
Designing kernels:
Translation-invariant kernels and kernels on semi-groups

p
ƒ When X = ℜ , the class of translation-invariant kernels is defined
p →
as the class of kernels of the form, for some function ψ : ℜ ℜ
∀x, x’ ∈ X, k(x, x’) = ψ(x − x’)

ƒ More general, if (X, .) is a group, then the group kernels are



defined, with ψ : X ℜ, as functions of the form
−1
k(x, x’) = ψ(x x’)

61
Designing kernels:
Combining kernels

ƒ Multiple kernel matrices k1, k2, …, kc for the same set of objects
are available.
ƒ We may design a single kernel k from several basic kernels k1, k2, …,
kc. A simple way to achieve this is to take the sum of the kernels:
c

k = ∑ki
i=1

ƒ Multiple kernel matrices k1, k2, …, kc for the same set of objects
are available.
c

k = ∑μi
ki i=1
62
Designing kernels:
From similarity scores to kernels

ƒ X be a set and s: X x X → ℜ a measure of similarity.

ƒ Empirical kernel map (Tsuda, 1999): (1) Choosing a finite set of


objects t1, …, tr ∈ X called templates. (2) x ∈ X is represented
by a vector of similarity

x ∈ X → φ(x) = (s(x, t1), …, s(x, tr))T ∈ ℜp


ƒ The kernel is defined as the dot product between two
similarity vectors
r

∀x, x'∈X, k(x, x') = φ(x)T φ(x) = ∑s(x, ti )s(x', ti )


i=1

63
Summary
Input space X Feature space F
x1 x2
inverse map φ-1
φ(xn)
φ(x )
φ(x) φ(x1) ... n-1

… φ(x2)
x x
n-1 n

k(xi,xj) = φ(xi).φ(xj)

kernel function k: XxX Æ R Gram matrix Knxn kernel-based algorithm on K

ƒ Map the data from X into a (high-dimensional) vector space, the feature space F, by
applying the feature map φ on the data points x.
ƒ Find a linear (or other easy) pattern in F using a well-known algorithm (that works
on the Gram matrix).
ƒ By applying the inverse map, the linear pattern in F can be found to correspond to a
complex pattern in X.

ƒ This implicitly by only making use of inner products in F (kernel trick) 64


Summary
Input space X Feature space F
x1 x2
inverse map φ-1
φ(xn)
φ(x )
φ(x) φ(x1) ... n-1

… φ(x2)
x
n-1 xn
k(xi,xj) = φ(xi).φ(xj)

kernel function k: XxX Æ R Gram matrix Knxn


kernel-based algorithm on K

ƒ Linear algebra, probability/statistics, functional analysis, optimization


ƒ Mercer theorem: Any positive definite function can be written as an inner
product in some feature space.
ƒ Kernel trick: Using kernel matrix instead of inner product in the feature space.
ƒ Representer theorem: Every minimizer of min{C( f ,{x , y })+ Ω( f ) admits
f ∈H ii H
m

a representation of the form f (.) = ∑αi K (., xi ) 65


i=1

You might also like