Kernel Methods: Prof. Navneet Goyal, CS Dept. BITS-Pilani, Pilani Campus India

KERNEL METHODS
Prof. Navneet Goyal,

CS Dept.
BITS-Pilani, Pilani Campus
INDIA
Figure source:
http://wwwold.ini.ruhr-uni-
bochum.de/thbio/group/neuralnet
/index_p.html
Kernel Methods
§ In computer science, kernel methods (KMs)

are a class of algorithms for pattern analysis,
whose best known element is the support
vector machine (SVM) (Wikipedia)
§ Transformations
§ Feature Spaces
§ Kernel Functions
§ Kernel Tricks
§ Inner Products
Kernel Methods
Algorithms capable of operating with kernels include:

§ Support vector machine (SVM)
§ Gaussian processes
§ Fisher's linear discriminant analysis (LDA)
§ Principal components analysis (PCA) (Kernel PCA)
§ Canonical correlation analysis
§ Ridge regression
§ Spectral clustering
§ Linear adaptive filters
§ …
Kernel Methods
§ Kernels are non-linear generalizations of

inner products
Kernel Methods
§ Any kernel-based method comprises of two

modules:
ú Mapping into embedding or feature space
ú Learning algorithm designed to discover linear
patterns in that space
Kernel Methods
§ Why this approach works?
ú Detecting linear patterns has been the focus of
much research in statistics and machine learning
ú Resulting algorithms are well understood and
efficient
ú Computational shortcut: makes it possible to
represent linear patterns efficiently in high
dimensional space to ensure adequate
representational power
ú The shortcut is nothing but the KERNEL FUNCTION
Kernel Methods
§ Kernel methods allow us to extend algorithms such as
SVMs to define non-linear decision boundaries
§ Other algorithms that only depend on inner products
between data points can be extended similarly
§ Kernel functions which are symmetric and positive
definite allows us to implicitly define inner products in
high dimensional
§ Replacing inner products in input space with positive
definite kernels immediately extends algorithms like
SVM to
ú Linear separation in high dimensional space
ú Or equivalently to a non-linear separation in input space
Types of Kernels
§ Positive definite symmetric kernels (PDS)
§ Negative definite symmetric kernels (NDS)
§ Role of NDS in construction on PDS!
Kernel Methods
§ Input space, χ
§ High dimensional space, ℍ
§ ℍ can be really large!!
§ Document classification
ú Trigrams
ú Vocabulary of 100000 words
ú Dimension of feature space reaches 1015
ú Generalization ability of large-margin classifiers like
SVM does not depend on dimensions of the feature
space, but on the margin and no. of training examples
Kernel Functions
§ A function K: 𝟀 x 𝟀 → ℝ is called a kernel over 𝟀
§ For any two points x, x’ ∈ 𝟀,
K(x,x’) = 〈 ϕ (x), ϕ (x’)〉
For some mapping ϕ : 𝟀 → ℍ to a Hilbert space ℍ
called a feature space
§ K is efficient!
ú K(x,x’) is O(N)
ú 〈 ϕ (x), ϕ (x’)〉 is O(dim ℍ) with dim ℍ ≫ N
§ K is flexible!
ú No need to explicitly define or compute ϕ
ú Kernel K can be arbitrarily chosen so long as the existensce
of ϕ is guaranteed, i.e. K satisfies Mercer’s condition
Kernel Functions
§ Mercer’s Condition
§ A kernel function K can be expressed as
K(x,x’) = 〈 ϕ (x), ϕ (x’)〉
iff, for any function g(x) such that ∫g(x)2dx is
finite, then
∫K(x,x’)g(x)g(x’) dxdx’ ≥ 0.
Kernels satisfying Mercer’s condition are called

Positive Definite Kernel Functions!
Transformed space of SVM kernels is called a
Reproducing Kernel Hilbert Space (RKHS)
Kernel Functions
§ Examples
( -|| x - y||2 / c )
k ( x, y ) = e
k ( x, y ) = (< x, y > +q ) d
k ( x, y ) = tanh(a < x, y > +q )
1
k ( x, y ) =
|| x - y ||2 +c 2
§ Show that the polynomial Kernel fn. satisfies the Mercer’s
condition
Feature Spaces
FF: :xx®
®FF( x( ),
x), RR ®
dd
®FF
F
L2
example: ( x, y ) ® ( x , y , 2 xy )
2 2
13
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)

2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.

some kernels: some kernel algorithms:
( -|| x - y||2 / c ) - support vector machine
k ( x, y ) = e
k ( x, y ) = (< x, y > +q ) d - Fisher discriminant analysis
- kernel regression
k ( x, y ) = tanh(a < x, y > +q )
- kernel PCA
1
k ( x, y ) =
|| x - y ||2 +c 2 14
Kernel Trick
Kernel Trick: Examples
Consider the mapping:
The points in feature space corresponding to x1 and x2:
Find:
Kernel Trick: Examples
Kernel Functions: Example
§ Given x = (x1, x2, x3); y = (y1, y2, y3). Find the
kernel for the function:
f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
Also find f(x).f(y) without finding f(x) & f(y).
Verify your result using x = (1, 2, 3) & y = (4, 5, 6).
§ Consider a two-dimensional input space 𝑋 ⊆ ℝE
together with the feature map:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 2 𝑥F 𝑥E ∈ 𝐹
= ℝG
§ Find the Kernel corresponding to ∅ and 𝐹
§ Also find the Kernel function corresponding to:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 𝑥F 𝑥E , 𝑥E 𝑥F ∈ 𝐹
= ℝH
§ What conclusions you can draw?
§ Find a mapping f from ℝ𝟐 → ℝ𝟔 corresponding to
the kernel 𝐾 𝑿, 𝒀 = [2 𝑿. 𝒀 + 3]E
§ For a given kernel, is the mapping unique?
Kernel Functions
§ You must be wondering about:
ú How to choose the mapping ϕ ?
ú How to avoid CoD?
ú Which Kernel to choose?
Let us try to find answers in context of SVM!!

Kernel Functions
§ How can we extend the linear SVM approach to
obtain a non-linear SVM approach?
§ 2 steps:
ú Transform the original input data to a higher
dimensional space using a non-linear mapping ϕ
ú Search for a linear separating hyperplane in the
new space
Kernel Functions
§ How to choose the mapping ϕ ?
ú Several common non-linear mappings can be used
ú In fact we do not even need to know what the mapping
is J
§ How to avoid CoD?
ú Dot product in feature space can be represented as a
kernel function of input points:
K(x,x’) = ϕ (x).ϕ (x’)
ú SVMs do not suffer from CoD
Complexity of SVMs depend on the number of Support
Vectors (SVs)
Kernel Functions
§ Which Kernel to choose?
ú Properties of the kinds of kernels that can be used
to replace the dot product have been studiedJ
ú Admissible Kernels functions are:
Polynomial Kernel
Gaussian Radial Basis Function Kernel
Sigmoid Kernel
ú Each of these Kernels result in a different nonlinear
classifier in the input space
ú Connection with ANN?
Later – after we have discussed ANNs
Kernel Functions
§ Which Kernel to choose?
ú No golden rules for determining which admissible
kernel will result in the most accurate SVM L
ú Kernel chosen does not make a large difference in
resulting accuracy
ú SVM training always finds a global solution
§ Challenges in SVM
ú Scalability
Need to improve the speed of training/testing so that SVM
scales to very large data sets (millions of SVs)
ú Determining best Kernel for a given data set
ú More efficient implementations of SVM for multiclass
problems
Kernel Functions
Non-linear SVM Input Space Feature Space

(Low dimensional) (High dimensional
Optimization Convex Convex

Problem Optimization Optimization
(concept of Duals (concept of Duals
and Lagrange and Lagrange
Multipliers) Multipliers)
Main operation in Xi. Xj Φ(Xi ). (Xj)
Training (= K(Xi. Xj))
Classifier Non-linear Maximal margin

separating hyperplane
hypersurface
Kernel Functions: Examples
§ Decision boundary a circle cantered at
(0.5,0.5) with r=0.2
Goodies and Baddies
Goodies:
•Kernel algorithms are typically constrained convex optimization
problems à solved with either spectral methods or convex
optimization tools.
• Efficient algorithms do exist in most cases.
• The similarity to linear methods facilitates analysis. There are strong
generalization bounds on test error.
Baddies:
•You need to choose the appropriate kernel
• Kernel learning is prone to over-fitting
• All information must go through the kernel-bottleneck.
28

Kernel Methods: Prof. Navneet Goyal, CS Dept. BITS-Pilani, Pilani Campus India

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kernel Methods: Prof. Navneet Goyal, CS Dept. BITS-Pilani, Pilani Campus India

Uploaded by

Copyright:

Available Formats

KERNEL METHODS

Prof. Navneet Goyal,

§ In computer science, kernel methods (KMs)

Algorithms capable of operating with kernels include:

§ Kernels are non-linear generalizations of

§ Any kernel-based method comprises of two

Kernels satisfying Mercer’s condition are called

Kernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)

Modularity: Any kernel can be used with any kernel-algorithm.

Consider the mapping:

The points in feature space corresponding to x1 and x2:

Let us try to find answers in context of SVM!!

Non-linear SVM Input Space Feature Space

Optimization Convex Convex

Classifier Non-linear Maximal margin

You might also like