Professional Documents
Culture Documents
2021-UNAS-REFER-Rafi Yon Saputra-173112706420242
2021-UNAS-REFER-Rafi Yon Saputra-173112706420242
1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. More Kernels
6. Designing Kernels
2
ICML 2006 (720 abstracts)
kernel methods
shown to be an
Kernel methods & SVM 166 emerging trend
Learning in bioinformatics 45
ANN 29
ILP 9
CRF 13
3
10 challenging problems in data mining
(IEEE ICDM’05)
1. Introduction
2. Kernels (a general framework to represent data and must satisfy
some math conditions that give them properties useful to
understand kernel methods and kernel design)
3. Some Kernel Methods
4. Support Vector Machines
5. More Kernels
6. Designing Kernels
6
The issue of data representation
7
Kernel representation: idea
by polynomial
kernel (n=2)
φ : (x1 , x2 ) a
(x12 , 2x1 x2 , x22 )
by RBF kernel
(n=2)
φ : (x1 , x2 ) a
d (x1 , x2 ) 2
exp − 2
2σ
cosθ = x, x'
x x' 11
Prerequisite 1. Dot product and Hilbert space
A Hilbert space H is a dot product space with the additional properties
that it is separable and complete.
Completeness means every Cauchy sequence {hn}n≥1 of elements of H
converges to an element h ∈ H, where a Cauchy sequence is one satisfying
sup hn − hm → 0, as n → ∞
m>n
… φ(x2)
x x
n-1 n
k(xi,xj) = φ(xi).φ(xj)
Map the data from X into a (high-dimensional) vector space, the feature space F, by
applying the feature map φ on the data points x.
Find a linear (or other easy) pattern in F using a well-known algorithm (that works on
the Gram matrix).
By applying the inverse map, the linear pattern in F can be found to correspond to a
complex pattern in X.
15
Prerequisite 2: Regularization (1/6)
2θ 16
Prerequisite 2: Regularization (2/6)
Input of the classification problem: m pairs of training data (xi, yi)
generated from some distribution P(x,y), xi ∈ X, yi ∈ C = {C1, C2, …,
Ck} (training data).
Task: Predict y given x at a new location, i.e., to find a function f
(model) to do the task, f: X Æ C.
Training error (empirical risk): Average of a loss function on
the training data, for example
1 m 1, yi = f(xi )
R [f]= c(x , y , f (x )) for example, c(x ,y ,f(x )) =
∑
emp m i=1 ii i i ii 0, yi ≠ f(xi )
Target: (risk minimization) to find a function f that minimizes
the test error (expected risk)
R[ f ] := E[Rtest [ f ]] = E[c(x, y, f (x))] = ∫c( x, y, f (x))dP(x, y)
XxY 17
Prerequisite 2: Regularization (3/6)
Problem: Small Remp[f] does not always ensure small R[f] (overfitting),
i.e., we may get small
Prob{sup f ∈F Remp [ f ] − R[ f ] < ε}
Fact 1: Statistical learning theory says that the difference is small if F is small.
Fact 2: Practical work says the difference is small if f is smooth.
18
Prerequisite 2: Regularization (4/6)
Regularization is the restriction of a class F of possible minimizers (with f∈F)
of the empirical risk functional Remp[f] such that F becomes a compact set.
Key idea: Add a regularization (stabilization) term Ω[f] such that small
Ω[f] corresponds to smooth f (or otherwise simple f) and minimize
Rreg [ f ] := Remp [ f ] + λΩ[ f ]
21
Fundamental properties of kernels:
Kernels as inner product (1/3)
p T
Suppose X = ℜ and x = (x1, …, xp) . Comparing such vectors by
p
inner product, for any x, x’∈ ℜ ,
p
p
Any φ: X → ℜ for some p ≥ 0 results in a valid kernel through (2).
Whether there exist more general kernels than these? At least, it
p
is possible if replacing ℜ by an infinite-dimensional Hilbert space.
23
Fundamental properties of kernels:
Kernels as inner product (3/3)
For more general kernels, one should keep in mind the slight
gap between the notion of dot product and similarity.
26
Fundamental properties of kernels:
Kernels as measures of function regularity (1/4)
Understand the functional space Hk and the norm associated with a kernel
often helps in understanding kernel methods and in designing new kernels.
p T ’
Example: linear kernel on X = ℜ , i.e., k(x, x’) = x x’ = ∑ xixi . Hk is the
space of linear functions f: ℜp → ℜ and associated norm
T p T
Hk = {f(x) = w x: w∈ℜ } and f Hk = w for f(x) = w x.
27
Fundamental properties of kernels:
Kernels as measures of function regularity (2/4)
One can prove that Hk is a Hilbert space, and the dot product for two functions
n m
29
Fundamental properties of kernels:
Kernels as measures of function regularity (4/4)
min R( f , S) +
c f (4)
f ∈Hk Hk
Where R(f,S) is small when f “fits” the data well, and the term
f Hk ensures the solution of the above equation is “smooth”.
In fact, besides fitting the data well and being smooth, the
solution of the above equation turns out to have special properties
that are useful for computational reasons (representer theorem).
30
Content
1. Introduction
2. Kernels
3. Some Kernel Methods (consider a class of algorithms that take
as input the similarity matrix defined by a kernel)
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels
31
The kernel trick (1/5)
Two concepts that underlie most kernel methods: the kernel
trick and the representer theorem.
The kernel trick is a simple and general principle based on
the property that kernels can be thought of as inner product:
Any positive definite kernel k(x, y) can be expressed as a
dot product in a high-dimensional space.
The kernel trick is obvious but has huge practical consequences
that were only recently exploited.
34
The kernel trick (4/5)
Example: Computing distances between objects
35
The kernel trick (5/5)
Example: Computing distances between objects
n
1
dist(x, S) = φ(x) − m = φ(x) − ∑φ(xi )
n i=1
n n n
2 1
Using kernel trick we obtain dist(x, S) = k(x, x) − ∑ k (x, xi ) + 2 ∑∑k(x, xi )
n i=1 n i=1 j =1
1
dist(S1, S2 ) = 12 ∑ k(x, x') + 2 ∑ k(x, x') - 1 ∑k(x, x')
S x , x'∈S1 S x , x'∈S2
S S
1 2 x∈S1 , x'∈S2
1 2 36
The representer theorem
Theorem (Kimeldoft, Wahba, 1971): If the problem
min{C( f ,{xi , yi }) + Ω( f ) (5)
f ∈H H
satisfies (1) C is pointwise; i.e., C(f,{xi, yi}) = C({xi, yi, f(xi)}) which
only depends on {f(xi)}) and (2) Ω(.) is monotonically increasing, then
every minimizer of the problem admits a representation of the form
p d
PCA: Given x1, …, xm ∈ R . Find: x’1, …, x’m ∈ R , d <<
p so that x’1, …, x’m preserve most information.
PCA finds the principal axes by diagonalizing the
covariance matrix C with singular value decomposition
1 m
λυ =Cυ C= ∑ x j xT 1.Find
m j=1
j
eigenvectors,
1
λυ =Cυ=
m
∑x j x T
j υ order them in
decreasing.
1 1
∑ ( x j .υ ) x 2.Projects
υ=
mλ
∑ x j x Tj υ =
mλ j
points onto
eigenvectors.
3.Use new
All solutions v with λ≠0 lie in the span of x1,..,xm,,i.e. coordinates
to do things
υ = ∑αi xi
i
38
Kernel principal component analysis (kernel PCA)
mj=1
can be diagonalized with nonnegative eigenvalues satisfying
λV =CV
= ∑ ∑ α iφ ( xi ) φ ( xi ), φ ( x j )
j i
39
Kernel principal component analysis (kernel PCA)
Kα=λα
Vk , x = ∑αik K (xi , x)
i=1 40
Content
1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels
41
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1 0
=
b
+
wx
How would you
classify this data?
w x + b<0
42
The illustrated slides (p.42-48) are taken from Prof. Andrew Moore’s SVM tutorial
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
43
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
44
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
45
Linear classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Misclassified
to +1 class 46
Classifier margin
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
47
1. Maximizing the margin is good according to
intuition and PAC theory
Each point correctly classified with large confidence (yf(x) > 1) has a
null multiplier. Other points are called support vector. They can be
on the boundary., in which case the multiplier satisfies 0 ≤ α ≤ C, or
on the wrong side of this boundary, in which case a = C. 49
Content
1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels
50
SVMs in practice
1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels
52
Kernels for vectors
T
The linear kernels kL = (x, x’) = x x’
2 2 2 2
they yields with d = 2 {x1 , x1 x2, x2 }, and {1, x1, x2, x1 , x1 x2, x2 }
T
The sigmoid kernel k(x, x’) = tanh(kx x’ + θ)
53
Kernels for strings
Lodhi et al. (2002) in the NLP context. Basic idea:
Count the number of subsequences up to length n in a sequence.
Compose a high-dimensional feature vector by these counts.
Kernel = dot product between such feature vectors.
|s| i
Σ = set of symbols; a string s = s1, …, s|s| ∈ Σ . X = ∪i=0,…,∞ Σ
An index set i of length l is an l-tuple of positions in s
i = (i1, …, il), 1 ≤ i1 < … < il < |s|
Denoted by s[i] = si1, …, sil the subsequence of s corresponding to i.
l(i)
Define weight of i by λ , where l(i) = il − i1 + 1, λ < 1.
54
Kernels for strings
k
For u ∈ Σ , k is fixed, we define Φu: X → ℜ.
= ∑ ∑ ∑λ l(i )+l ( j)
n
u∈Σ i:u=s[i] j:u=t[ j]
55
The Fisher kernel
56
Content
1. Introduction
2. Kernels
3. Some Kernel Methods
4. Support Vector Machines
5. SVMs in Practice
6. More Kernels
7. Designing Kernels
57
How to choose kernels?
In general, this mode of kernel design can use both labeled and
unlabeled data of the training set! Very useful for semi-
supervised learning
Basic ideas:
Intuitively, kernels define clusters in the feature space, and we want
to find interesting clusters, i.e. cluster components that can be
associated with labels.
Convex linear combination of kernels in a given family: find the
best coefficient of eigen-components of the (complete) kernel
matrix by maximizing the alignment on the labeled training data.
Build a generative model of the data, then use the Fischer Kernel.
59
Designing kernels:
Operations on kernels
If k1 and k2 are two kernels, then k(x, x’):= k1(x, x’) k2(x, x’) is also
a kernel. Several other operations are possible.
Some other operations are forbidden: if k is a kernel, then log(k) is
β
not positive definite in general, and neither is k for 0 < β < 1.
However, these two operations can be linked.
60
Designing kernels:
Translation-invariant kernels and kernels on semi-groups
p
When X = ℜ , the class of translation-invariant kernels is defined
p →
as the class of kernels of the form, for some function ψ : ℜ ℜ
∀x, x’ ∈ X, k(x, x’) = ψ(x − x’)
61
Designing kernels:
Combining kernels
Multiple kernel matrices k1, k2, …, kc for the same set of objects
are available.
We may design a single kernel k from several basic kernels k1, k2, …,
kc. A simple way to achieve this is to take the sum of the kernels:
c
k = ∑ki
i=1
Multiple kernel matrices k1, k2, …, kc for the same set of objects
are available.
c
k = ∑μi
ki i=1
62
Designing kernels:
From similarity scores to kernels
63
Summary
Input space X Feature space F
x1 x2
inverse map φ-1
φ(xn)
φ(x )
φ(x) φ(x1) ... n-1
… φ(x2)
x x
n-1 n
k(xi,xj) = φ(xi).φ(xj)
Map the data from X into a (high-dimensional) vector space, the feature space F, by
applying the feature map φ on the data points x.
Find a linear (or other easy) pattern in F using a well-known algorithm (that works
on the Gram matrix).
By applying the inverse map, the linear pattern in F can be found to correspond to a
complex pattern in X.
… φ(x2)
x
n-1 xn
k(xi,xj) = φ(xi).φ(xj)