KD Iii 4 Kernel SVM 1415

Knowledge Discovery WS 14/15
Kernels / SVM 6
Prof. Dr. Rudi Studer, Dr. Achim Rettinger*, Dipl.-Inform. Lei Zhang
{rudi.studer, achim.rettinger, l.zhang}@kit.edu
INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)
KIT – University of the State of Baden-Württemberg and

National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
2 Institut AIFB
Chapter 3.3
Kernel Methods &

Support Vector Machines
Recommended Reading:
Cristanini & Shaw-Taylor, Support Vector and Kernel Methods, Ch. 5 in Berthold & Hand (eds) Intelligent
Data Analysis 2003
3 Institut AIFB
Recap: Linear Classification (Perceptron)
  Artificial Neural Networks (ANNs) are learning systems

inspired by the structure of neural information processing
  Single layer Perceptron as the simplest network topology
  Model class is a linear discriminant function
  Decision boundary is a n-1 dimensional hyperplane (e.g. a
straight line in 2D)
  Training = search for "good" parameters w and b
4 Institut AIFB
Recap: Linear Classification (Perceptron)
5 Institut AIFB
Recap: Basis expansion
M
X1
  Linear model: f (xi , w) = w0 + wj xi,j
j=1
Idea: Augment inputs with additional variables: (x)
  Basis expanded linear model:

M
X1
f (xi , w) = w0 + wj j (xi,j )
j=1
6 Institut AIFB
Recap: Popular Basis Expansions
  j (x) = x1 , x2 , ..., xM 1
j (x) = 2
  x1 , x1 x2
p
  j (x) = log(x1 ), x1
  Picewise polynomials, splines, wavelet bases
  j (x) = exp( vkx xm )

k2
M
X1 sig(arg) =
  (x) = sig( vj xj ) 1/(1 + exp( arg))
7
j=0 Institut AIFB
Three Evolutions in Machine Learning
  Until 1960: first simple statistical techniques

  1960s: efficient algorithms for detecting linear patterns, e.g.
Perceptron training [Rosenblatt, 1957],
  efficient training algorithms (+)
  good generalization performance (+)
  insufficient for nonlinear patterns (-)
  1980s: "nonlinear revolution": backpropagation multilayer
neural networks (gradient descent) and decision trees,
  can deal with nonlinear patterns (+)
  local minima, overfitting (-)
  modeling effort, long training times (-)
  1990s: SVMs and other Kernel Methods,
  considerable success in doing all at once
  principled statistical theory behind the practical techniques
8 Institut AIFB
Chapter 3.3.a
Kernel Functions
9 Institut AIFB
Nonlinearity through Feature Maps
  Often, we can make a nonlinear dataset linearly separable after

some suitable preprocessing, e.g. feature creation
(cf. basis expansions in ANN lecture)
  Conceptually, use of feature mapping function
  Example: data inside and outside a 2D sphere (circle)
  NB: Basis functions / hidden layers in ANNs have same effect
10 Institut AIFB
Reminder: “Perceptron Training” Algorithm
11 23.11.2011 Knowledge Discovery WS 11/12 Institut AIFB

Dual representation of parameters
  Observation: weight vector w can always be rewritten as a

linear combination of training data points: sum if misclassified
  coefficient αi is indicating if x_i has been classified correctly

  In fact, optimal weight vector w is a linear combination of training
instances (“Representer Theorem” by Wahba (1971), simplified)
  Result: Indirect evaluation of inner product with weight

vector is possible without explicitly representing it
12 Institut AIFB
Dual version of “Perceptron Training” algorithm
the only reference to φ(x) and

only for the known data items.
rewritten
rewritten
Observations:
§  information on pairwise inner products is
sufficient for training and classification
§  feature mappings enter computation only
through inner products;
13 Institut AIFB
Kernel Functions
  Kernel function is thus an efficient shortcut which computes dot

products of mapped data without any explicit mapping into a new
feature space.
14 Institut AIFB
Example: Polynomial Kernel of Degree 2
15 Institut AIFB
Example: Polynomial Kernel of Degree 2
"Kernel Trick"
Feature Mapping
16 Institut AIFB
Mathematical Properties of Kernels
  There is a (slightly more advanced) mathematical theory behind

kernels ("Reproducing Kernel Hilbert Spaces" / Mercer's Theorem)
which we won't cover here
  Core insights:
17 Institut AIFB
Mathematical Properties of Kernels (cont)
  Kernels are closed under certain operations, so we can combine

known kernel functions and yield valid kernels again
  Most important closure properties:
18 Institut AIFB
Kernel Function Design
  How to come up with an appropriate kernel function?
  Case I: Derive it directly from explicit feature mappings
  Case II: Design a similarity function directly on the input data and
check whether it conforms to a valid kernel function
19 Institut AIFB
Example: Gaussian Kernel
  Gaussian Kernel (also known as Radial Basis Function

RBF kernels (RBF)) is one of the "classical" kernels
  Interpretation of Gaussian Kernel:

  Result of kernel evaluation depend on distance of examples
  Large distances lead to more orthogonality
  Bandwidth parameter sigma controls rate of decay
20 Institut AIFB
Example: Gaussian Kernel
  Features space of Gaussian Kernel:

  Bell-shaped Gaussians centered at data points
  Theoretically, infinite number of dimensions
Decision Boundaries for 2D Learning Problem

with Gaussian Kernel
  NB: ||x-z||κ could itself be computed in a kernel-induced

feature space, Gaussian Kernel becomes a kernel modifier
21 Institut AIFB
Kernels on Non-Vectorial Data
  Kernels need not be restricted to operate on vectors but

can be defined on other structures as well.
  We can define similarity functions on such structures and
make sure that they are positive semi-definite
  Examples:
  String Kernels
[e.g. Lodhi et al., "Text Classification Using String Kernels", 2002]
  Tree Kernels
[e.g. Collins & Duffy, "Convolution Kernels for Natural Language", 2001]
  Graph Kernels
[e.g. Gärtner et al., "On Graph Kernels: Hardness Results and Efficient Alternatives", 2003]
  Many more…
[Gärtner, "A Survey of Kernels for Structured Data", 2003]
22 Institut AIFB
Summary: Kernel Methods (remember this !)
Linear Model: Dual Representation:
Classification = side of hyperplane During learning and classification,
Training = estimation of "good" it is sufficient to access only dot
parameters w and b products of data points
Representer Theorem: Kernel Function:

Optimal weight vector w is a
linear combination of training Similarity function with an inter-
instances pretation as a dot product (after
mapping to vector representation)
24 Institut AIFB
Summary: Kernel Methods (remember this !)
  "Kernel Trick":
  Rewrite learning algorithm such that any reference to the input data
happens from within inner products
  Replace any such inner product by the kernel function of your
choice.
  Work with the (linear) algorithm as usual
  Non-linearity enters learning through kernels while training
algorithm may remain linear.
  NB: surprisingly many well-known algorithms can be

rewritten using the kernel approach.
25 Institut AIFB
Chapter 3.3.b
Support Vector Machines (SVMs)
26 Institut AIFB
  Let's imagine this: if we use a fancy kernel and the

perceptron training algorithm, we might always be able to
separate our data perfectly.
However, we are in danger of overfitting!
  Support Vector Machines use a simple and elegant way to

prevent overfitting: margin maximization.
  very simple intuition
  elaborated statistical theory
  connections and analogies to various other learning methods
27 Institut AIFB
Margin Maximization: Intuition
28 Institut AIFB
Margin Maximization
  The (functional) margin of an example with respect to a hyperplane is

the quantity:
  Analogous: Geometric margin – Euclidean distance of example (i.e. w,

b scaled by 1/||w||) from hyperplane.
  The margin of a hyperplane with respect to a training set is the

minimum geometric margin of all items in the set.
  The margin of a training set is the maximum geometric margin over all
hyperplanes
29 Institut AIFB
The SVM principle: of all the hyperplanes that separate the

data perfectly, pick the one which maximizes the margin.
30 Institut AIFB
SVM Optimization Problem
  Idea: Fix the norm of the weight vector and maximize the functional
margin.
  Equivalently, if we fix the functional margin to 1, the geometric margin
equals 1/||w||
  Thus: maximizing (geometric) margin = minimizing the norm of the
weight vector while keeping functional margin above a fixed value
  Results in SVM optimization problem:
31 Institut AIFB
SVM Optimization Problem (dual)
  Reformulation (Lagrangian dual):
  Observation: dual formulation allows us to use kernel functions

  Convex quadratic programming problem (global optimal solution,
solvable in polynomial time)
  NB: parameter b doesn't show up here, but can be determined

from optimal α.
32 Institut AIFB
Why "Support Vectors" ?
αI positive exactly for xi that

lie on the margin = "support
vectors"
Typically, the solution is very

sparse, i.e. only few support
vectors!
33 Institut AIFB
Dealing with Noise: Soft Margins
  Still, noise may make data not linearly separable

  SVM approach: relax margin criterion and let some patterns violate the
margin constraint
  Soft margin version of SVM optimization problem:
  Trade-off parameter C controls influence of errors

  NB: for linearly separable case, C may be infinitely high
34 Institut AIFB
Dealing with Noise: Soft Margins
Slack variables ξi are

proportional to distance of
misclassified example xi from
its corresponding margin
35 Institut AIFB
Dealing with Noise: Soft Margins (dual)
  Reformulation to dual optimization problem:
  Solution is surprisingly simple:

  again, it is possible to use kernels
  difference to hard-margin version:
"box constraint" for αi.
36 Institut AIFB
Summary: SVMs (remember this !)
  Generalization through margin maximization

  intuitive motivation
  unique optimal solution
  Maximal margin = minimal norm
  Dual formulation
  makes it possible to use kernel functions
  makes problem become a quadratic programming problem for
which fast algorithms exist
  Hard margin vs. Soft margin
  NB: Many well-known implementations available, e.g.

SVMlight <http://svmlight.joachims.org/>
libSVM <http://www.csie.ntu.edu.tw/~cjlin/libsvm/>
37 Institut AIFB
Chapter 3.3 – Summary & Example
Graph Analytics with

Kernels, SVMs & friends
38 Institut AIFB
Big picture: Kernels and SVMs
1.  Think whether you could work with linear techniques if you map
your data into a richer (higher dimensional) space
Kernel Functions
2.  Implement learning algorithm in such a way that reference to the
SVMs
mapped data is made only within pairwise dot products.
3.  Pairwise dot products can be efficiently computed by means of a
corresponding kernel function on the original data items.
4.  Make the SVM training maximize the margin of the training data in
SVMs
the implicit feature space to maximize generalization performance.
5.  Do tuning and evaluation.
39 Institut AIFB
There is more behind Kernels and SVMs
  In principle, we have a framework for:

  detecting nonlinear patterns in data,
  efficiently (little computational cost),
  with statistical stability (little overfitting)
  The overall family of algorithms is modular & flexible:
Learning
Kernel Algorithm
Function
Data Classification: SVM
(a)…
(Ranking)
(b)…
(Regression)
(Clustering)
40 Institut AIFB
Example: Graph kernel
Given any data in graph format (RDF)…
skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest foaf:gender
„Jane Doe“ foaf:name person200 „female“
…solve any standard

statistical relational learning task,
like…
41 Institut AIFB
The Learning Tasks (I)
skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
?
„Jane Doe“ foaf:name person200 foaf:gender „female“
… property value prediction, …
42 Institut AIFB
The Learning Tasks (II)
„Machine skos:prefLabel
? foaf:topic_interest
topic110 person100
Learning“
foaf:knows
… link prediction, …
43 Institut AIFB
The Learning Tasks (III)
skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest ? foaf:gender
… clustering,…
… or class-membership prediction, entity resolution, …
44 Institut AIFB
The Kernel Trick
Any RDF graph
Solve any Task Define Kernel

(Classify /
Predict / Cluster)
(x, y) =< (x), (y) >
Any Kernel
Machine
(SVM / SVR /
Kernel k-means)
45 Institut AIFB
Intersection Graph
Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2
Inter-
section
Output Intersection graph
Kernel value Feature count G(e1)∩G(e2)

k(e1,e2)
46 Institut AIFB
Instance graph
skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
  Instance graph: k-hop-neighbourhood of entity e

Explore graph starting from entity e up to a depth k
47 Institut AIFB
Instance graph - Example
Instance graph of depth 2 for

person200
„Machine
„Machine skos:prefLabel
topic11 person1
topic110
Learning“
Learning“ 0 00
foaf:knows
person
„Jane Doe“
200
Instance graph of depth 2 for

„Machine skos:prefLabel topic11
person200 person1
topic110 person100
Learning“ 0 00
foaf:knows
person
„Jane Doe“
„Jane Doe“ person200 „female“
„female“
foaf:name
200
48 Institut AIFB
Intersection Graph
Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2
Inter-
section

k(e1,e2)
49 Institut AIFB
Intersection graph
Intersection graph of graphs G(e1) and G(e2):

V (G1 \ G2 ) = V1 \ V2
E(G1 \ G2 ) = {(v1 , p, v2 )|(v1 , p, v2 ) 2 E1 ^ (v1 , p, v2 ) 2 E2 }
Intersection of depth 2 for person100 and (2)

person200
„Machine skos:prefLabel topic11 person1
topic110
Learning“ 0 00
foaf:knows
person
„Jane Doe“
„Jane Doe“ person200 „female“
foaf:name
200
50 Institut AIFB
Intersection Graph
Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2
Inter-
section

k(e1,e2)
51 Institut AIFB
Feature count
  Kernel function: Count specific substructures of the

intersection graph.
  Any set of edge-induced subgraphs…
E0 ✓ E
V0 = {v | 9u, p : (u, p, v) 2 E 0 _ (v, p, u) 2 E 0 }
  …qualifies as a candidate feature set
Edges
Walks/Paths up to a length of an arbitrary l (2)
Connected edge-induced subgraphs
52 Institut AIFB
Excursus: String Kernels
6
(a) Extracted Entities (b) Entity Similarity (c) Entity Clu
Observation Structural Similarity:
DE
type Intersection Graphs
Entity e1
geo
tec0001 unit MIO_EUR Ge1 ∩ Ge2
indicator time value Observation MIO_EUR tec0001
“2496200.0“
B11 “2010-01-01“
“2010-01-01“ DE
Src.
Src. 11
Observation
type DE Ge1 ∩ Ge3
Entity e2
geo Observation
gov_q_ggdebt unit MIO_EUR
DE
indicator value
“1786882.0“ “2010-01-01“ gov_q_ggdebt
time
F2 “2010-01-01“ Src.
Src. 22
Literal Similarity
Observation e1 vs. e2
type
Entity e3
classification “2496200.0“ “1786882.0“

NY.GDP.MKTP.CN area /country/DE
indicator refPeriod value e1 vs. e3
“2500090.5“
NY.GDP.MKTP.CN
NY.GDP.
“2010-01-01“ “2496200.0“ “2500090.5“
MKTP.CN Src.
Src. 33
54 Fig. Knowledge
2: (a)Discovery
Extracted entities e1 , e2 , and e3 from Fig. 1. (b) Structural
Institut AIFB sim
as intersection graphs Ge1 \ Ge2 , and Ge1 \ Ge3 . Literal similarities f
e vs. e , and e vs. e . Note, exact matching literals are omitted, as
Note, [16] introduced further kernels, however, we found path kernels to be
simple, and perform well in our experiments, cf. Sect. 6.
Example. Extracted entities from sources in Fig. 1 are given in Fig. 2-a. In
Excursus: Stringthe
Fig. 2-b, we compare Kernels
structure of tec0001 (short: e1 ) and gov q ggdebt
(short: e2 ). For this, we compute an intersection: Ge1 \ Ge2 . This yields a set of
4 paths, each with length 0. The unnormalized kernel value is 0 · 4.
  Strings
For literalare also a one
similarities, relational representation:
can use di↵erent a linear
kernels on, e.g., stringsgraph
or num-
(sequence
bers of reasons,
[18]. For space nodes)we restrict presentation to the string subsequence
kernel [15]. However, a numerical kernel is outlined in our extended report [19].
Definition 7 (String Subsequence Kernel l ). Let ⌃ denote a vocabulary
for strings, with each string s as finite sequence of characters in ⌃. Let s[i :
j] denote a substring si , . . . , sj of s. Further, let u be a subsequence of s, if
indices i = (i1 , . . . , i|u| ) exist with 1  i1  i|u|  |s| such that u = s[i].
The length l(i) of subsequence u is i|u| i1 + 1. Then, a kernel function l is
l
defined
P P as sum Pover all common, weighted subsequences for strings s, t:  (s, t) =
l
u i:u=s[i] j:u=t[j] (i) l (j), with as decay factor.
7
Example. For instance, strings “MI” and “MIO EUR” share a common sub-
sequence “MI” with i = (1, 2). Thus, the unnormalized kernel is 2 + 2 .
As l is only defined for two strings, we sample over every possible string
pair for two entities, and aggregate the needed kernel for each pair. Finally, we
aggregate kernels s and l , resulting in a single kernel [18]:
M
0 00 s
55 Knowledge Discovery
(e , e ) =  (Ge0 , Ge00 ) l (s, t) Institut AIFB
s,t 2 sample(Ge0 ,Ge00 )

Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
56 Institut AIFB

KD Iii 4 Kernel SVM 1415

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KD Iii 4 Kernel SVM 1415

Uploaded by

Copyright:

Available Formats

Knowledge Discovery WS 14/15

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and

Kernel Methods &

Artificial Neural Networks (ANNs) are learning systems

Idea: Augment inputs with additional variables: (x)

Basis expanded linear model:

Picewise polynomials, splines, wavelet bases

j (x) = exp( vkx xm )

Until 1960: first simple statistical techniques

Often, we can make a nonlinear dataset linearly separable after

NB: Basis functions / hidden layers in ANNs have same effect

11 23.11.2011 Knowledge Discovery WS 11/12 Institut AIFB

Observation: weight vector w can always be rewritten as a

coefficient αi is indicating if x_i has been classified correctly

Result: Indirect evaluation of inner product with weight

the only reference to φ(x) and

Kernel function is thus an efficient shortcut which computes dot

There is a (slightly more advanced) mathematical theory behind

Kernels are closed under certain operations, so we can combine

Most important closure properties:

How to come up with an appropriate kernel function?

Case I: Derive it directly from explicit feature mappings

Gaussian Kernel (also known as Radial Basis Function

Interpretation of Gaussian Kernel:

Features space of Gaussian Kernel:

Decision Boundaries for 2D Learning Problem

NB: ||x-z||κ could itself be computed in a kernel-induced

Kernels need not be restricted to operate on vectors but

Representer Theorem: Kernel Function:

NB: surprisingly many well-known algorithms can be

Support Vector Machines (SVMs)

Let's imagine this: if we use a fancy kernel and the

However, we are in danger of overfitting!

Support Vector Machines use a simple and elegant way to

The (functional) margin of an example with respect to a hyperplane is

Analogous: Geometric margin – Euclidean distance of example (i.e. w,

The margin of a hyperplane with respect to a training set is the

The SVM principle: of all the hyperplanes that separate the

Results in SVM optimization problem:

Reformulation (Lagrangian dual):

Observation: dual formulation allows us to use kernel functions

NB: parameter b doesn't show up here, but can be determined

αI positive exactly for xi that

Typically, the solution is very

Still, noise may make data not linearly separable

Soft margin version of SVM optimization problem:

Trade-off parameter C controls influence of errors

Slack variables ξi are

Reformulation to dual optimization problem:

Solution is surprisingly simple:

Generalization through margin maximization

NB: Many well-known implementations available, e.g.

Graph Analytics with

5. Do tuning and evaluation.

In principle, we have a framework for:

The overall family of algorithms is modular & flexible:

Given any data in graph format (RDF)…

„Jane Doe“ foaf:name person200 „female“

…solve any standard

… property value prediction, …

„Jane Doe“ foaf:name person200 „female“

„Jane Doe“ foaf:name person200 „female“

  Artificial Neural Networks (ANNs) are learning systems

  Basis expanded linear model:

  Picewise polynomials, splines, wavelet bases

  j (x) = exp( vkx xm )

  Until 1960: first simple statistical techniques

  Often, we can make a nonlinear dataset linearly separable after

  NB: Basis functions / hidden layers in ANNs have same effect

  Observation: weight vector w can always be rewritten as a

  coefficient αi is indicating if x_i has been classified correctly

  Result: Indirect evaluation of inner product with weight

  Kernel function is thus an efficient shortcut which computes dot

  There is a (slightly more advanced) mathematical theory behind

  Kernels are closed under certain operations, so we can combine

  Most important closure properties:

  How to come up with an appropriate kernel function?

  Case I: Derive it directly from explicit feature mappings

  Gaussian Kernel (also known as Radial Basis Function

  Interpretation of Gaussian Kernel:

  Features space of Gaussian Kernel:

  NB: ||x-z||κ could itself be computed in a kernel-induced

  Kernels need not be restricted to operate on vectors but

  NB: surprisingly many well-known algorithms can be

  Let's imagine this: if we use a fancy kernel and the

  Support Vector Machines use a simple and elegant way to

  The (functional) margin of an example with respect to a hyperplane is

  Analogous: Geometric margin – Euclidean distance of example (i.e. w,

  The margin of a hyperplane with respect to a training set is the

  Results in SVM optimization problem:

  Reformulation (Lagrangian dual):

  Observation: dual formulation allows us to use kernel functions

  NB: parameter b doesn't show up here, but can be determined

  Still, noise may make data not linearly separable

  Soft margin version of SVM optimization problem:

  Trade-off parameter C controls influence of errors

  Reformulation to dual optimization problem:

  Solution is surprisingly simple:

  Generalization through margin maximization

  NB: Many well-known implementations available, e.g.

5.  Do tuning and evaluation.

  In principle, we have a framework for:

  The overall family of algorithms is modular & flexible:

  Instance graph: k-hop-neighbourhood of entity e

  Kernel function: Count specific substructures of the