You are on page 1of 54

Knowledge Discovery WS 14/15

Kernels / SVM 6  
Prof. Dr. Rudi Studer, Dr. Achim Rettinger*, Dipl.-Inform. Lei Zhang
{rudi.studer, achim.rettinger, l.zhang}@kit.edu

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and


National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

2 Institut AIFB
Chapter 3.3

Kernel Methods &


Support Vector Machines
Recommended Reading:
Cristanini & Shaw-Taylor, Support Vector and Kernel Methods, Ch. 5 in Berthold & Hand (eds) Intelligent
Data Analysis 2003
3 Institut AIFB
Recap: Linear Classification (Perceptron)

  Artificial Neural Networks (ANNs) are learning systems


inspired by the structure of neural information processing
  Single layer Perceptron as the simplest network topology
  Model class is a linear discriminant function
  Decision boundary is a n-1 dimensional hyperplane (e.g. a
straight line in 2D)
  Training = search for "good" parameters w and b

4 Institut AIFB
Recap: Linear Classification (Perceptron)

5 Institut AIFB
Recap: Basis expansion
M
X1
  Linear model: f (xi , w) = w0 + wj xi,j
j=1

Idea: Augment inputs with additional variables: (x)

  Basis expanded linear model:


M
X1
f (xi , w) = w0 + wj j (xi,j )
j=1
6 Institut AIFB
Recap: Popular Basis Expansions

  j (x) = x1 , x2 , ..., xM 1

j (x) = 2
  x1 , x1 x2
p
  j (x) = log(x1 ), x1

  Picewise polynomials, splines, wavelet bases

  j (x) = exp( vkx xm )


k2
M
X1 sig(arg) =
  (x) = sig( vj xj ) 1/(1 + exp( arg))
7
j=0 Institut AIFB
Three Evolutions in Machine Learning

  Until 1960: first simple statistical techniques


  1960s: efficient algorithms for detecting linear patterns, e.g.
Perceptron training [Rosenblatt, 1957],
  efficient training algorithms (+)
  good generalization performance (+)
  insufficient for nonlinear patterns (-)
  1980s: "nonlinear revolution": backpropagation multilayer
neural networks (gradient descent) and decision trees,
  can deal with nonlinear patterns (+)
  local minima, overfitting (-)
  modeling effort, long training times (-)
  1990s: SVMs and other Kernel Methods,
  considerable success in doing all at once
  principled statistical theory behind the practical techniques

8 Institut AIFB
Chapter 3.3.a

Kernel Functions

9 Institut AIFB
Nonlinearity through Feature Maps

  Often, we can make a nonlinear dataset linearly separable after


some suitable preprocessing, e.g. feature creation
(cf. basis expansions in ANN lecture)
  Conceptually, use of feature mapping function
  Example: data inside and outside a 2D sphere (circle)

  NB: Basis functions / hidden layers in ANNs have same effect

10 Institut AIFB
Reminder: “Perceptron Training” Algorithm

11 23.11.2011 Knowledge Discovery WS 11/12 Institut AIFB


Dual representation of parameters

  Observation: weight vector w can always be rewritten as a


linear combination of training data points: sum if misclassified

  coefficient αi is indicating if x_i has been classified correctly


  In fact, optimal weight vector w is a linear combination of training
instances (“Representer Theorem” by Wahba (1971), simplified)

  Result: Indirect evaluation of inner product with weight


vector is possible without explicitly representing it

12 Institut AIFB
Dual version of “Perceptron Training” algorithm

the only reference to φ(x) and


only for the known data items.

rewritten

rewritten

Observations:
§  information on pairwise inner products is
sufficient for training and classification
§  feature mappings enter computation only
through inner products;

13 Institut AIFB
Kernel Functions

  Kernel function is thus an efficient shortcut which computes dot


products of mapped data without any explicit mapping into a new
feature space.

14 Institut AIFB
Example: Polynomial Kernel of Degree 2

15 Institut AIFB
Example: Polynomial Kernel of Degree 2

"Kernel Trick"

Feature Mapping

16 Institut AIFB
Mathematical Properties of Kernels

  There is a (slightly more advanced) mathematical theory behind


kernels ("Reproducing Kernel Hilbert Spaces" / Mercer's Theorem)
which we won't cover here

  Core insights:

17 Institut AIFB
Mathematical Properties of Kernels (cont)

  Kernels are closed under certain operations, so we can combine


known kernel functions and yield valid kernels again

  Most important closure properties:

18 Institut AIFB
Kernel Function Design

  How to come up with an appropriate kernel function?

  Case I: Derive it directly from explicit feature mappings

  Case II: Design a similarity function directly on the input data and
check whether it conforms to a valid kernel function

19 Institut AIFB
Example: Gaussian Kernel

  Gaussian Kernel (also known as Radial Basis Function


RBF kernels (RBF)) is one of the "classical" kernels

  Interpretation of Gaussian Kernel:


  Result of kernel evaluation depend on distance of examples
  Large distances lead to more orthogonality
  Bandwidth parameter sigma controls rate of decay

20 Institut AIFB
Example: Gaussian Kernel

  Features space of Gaussian Kernel:


  Bell-shaped Gaussians centered at data points
  Theoretically, infinite number of dimensions

Decision Boundaries for 2D Learning Problem


with Gaussian Kernel

  NB: ||x-z||κ could itself be computed in a kernel-induced


feature space, Gaussian Kernel becomes a kernel modifier
21 Institut AIFB
Kernels on Non-Vectorial Data

  Kernels need not be restricted to operate on vectors but


can be defined on other structures as well.
  We can define similarity functions on such structures and
make sure that they are positive semi-definite

  Examples:
  String Kernels
[e.g. Lodhi et al., "Text Classification Using String Kernels", 2002]

  Tree Kernels
[e.g. Collins & Duffy, "Convolution Kernels for Natural Language", 2001]

  Graph Kernels
[e.g. Gärtner et al., "On Graph Kernels: Hardness Results and Efficient Alternatives", 2003]

  Many more…
[Gärtner, "A Survey of Kernels for Structured Data", 2003]

22 Institut AIFB
Summary: Kernel Methods (remember this !)
Linear Model: Dual Representation:
Classification = side of hyperplane During learning and classification,
Training = estimation of "good" it is sufficient to access only dot
parameters w and b products of data points

Representer Theorem: Kernel Function:


Optimal weight vector w is a
linear combination of training Similarity function with an inter-
instances pretation as a dot product (after
mapping to vector representation)

24 Institut AIFB
Summary: Kernel Methods (remember this !)

  "Kernel Trick":
  Rewrite learning algorithm such that any reference to the input data
happens from within inner products
  Replace any such inner product by the kernel function of your
choice.
  Work with the (linear) algorithm as usual
  Non-linearity enters learning through kernels while training
algorithm may remain linear.

  NB: surprisingly many well-known algorithms can be


rewritten using the kernel approach.

25 Institut AIFB
Chapter 3.3.b

Support Vector Machines (SVMs)

26 Institut AIFB
Support Vector Machines

  Let's imagine this: if we use a fancy kernel and the


perceptron training algorithm, we might always be able to
separate our data perfectly.

However, we are in danger of overfitting!

  Support Vector Machines use a simple and elegant way to


prevent overfitting: margin maximization.
  very simple intuition
  elaborated statistical theory
  connections and analogies to various other learning methods

27 Institut AIFB
Margin Maximization: Intuition

28 Institut AIFB
Margin Maximization

  The (functional) margin of an example with respect to a hyperplane is


the quantity:

  Analogous: Geometric margin – Euclidean distance of example (i.e. w,


b scaled by 1/||w||) from hyperplane.

  The margin of a hyperplane with respect to a training set is the


minimum geometric margin of all items in the set.
  The margin of a training set is the maximum geometric margin over all
hyperplanes

29 Institut AIFB
Support Vector Machines

The SVM principle: of all the hyperplanes that separate the


data perfectly, pick the one which maximizes the margin.

30 Institut AIFB
SVM Optimization Problem

  Idea: Fix the norm of the weight vector and maximize the functional
margin.
  Equivalently, if we fix the functional margin to 1, the geometric margin
equals 1/||w||
  Thus: maximizing (geometric) margin = minimizing the norm of the
weight vector while keeping functional margin above a fixed value

  Results in SVM optimization problem:

31 Institut AIFB
SVM Optimization Problem (dual)

  Reformulation (Lagrangian dual):

  Observation: dual formulation allows us to use kernel functions


  Convex quadratic programming problem (global optimal solution,
solvable in polynomial time)

  NB: parameter b doesn't show up here, but can be determined


from optimal α.
32 Institut AIFB
Why "Support Vectors" ?

αI positive exactly for xi that


lie on the margin = "support
vectors"

Typically, the solution is very


sparse, i.e. only few support
vectors!

33 Institut AIFB
Dealing with Noise: Soft Margins

  Still, noise may make data not linearly separable


  SVM approach: relax margin criterion and let some patterns violate the
margin constraint

  Soft margin version of SVM optimization problem:

  Trade-off parameter C controls influence of errors


  NB: for linearly separable case, C may be infinitely high

34 Institut AIFB
Dealing with Noise: Soft Margins

Slack variables ξi are


proportional to distance of
misclassified example xi from
its corresponding margin

35 Institut AIFB
Dealing with Noise: Soft Margins (dual)

  Reformulation to dual optimization problem:

  Solution is surprisingly simple:


  again, it is possible to use kernels
  difference to hard-margin version:
"box constraint" for αi.

36 Institut AIFB
Summary: SVMs (remember this !)

  Generalization through margin maximization


  intuitive motivation
  unique optimal solution
  Maximal margin = minimal norm
  Dual formulation
  makes it possible to use kernel functions
  makes problem become a quadratic programming problem for
which fast algorithms exist
  Hard margin vs. Soft margin

  NB: Many well-known implementations available, e.g.


SVMlight <http://svmlight.joachims.org/>
libSVM <http://www.csie.ntu.edu.tw/~cjlin/libsvm/>
37 Institut AIFB
Chapter 3.3 – Summary & Example

Graph Analytics with


Kernels, SVMs & friends

38 Institut AIFB
Big picture: Kernels and SVMs

1.  Think whether you could work with linear techniques if you map
your data into a richer (higher dimensional) space

Kernel Functions
2.  Implement learning algorithm in such a way that reference to the

SVMs
mapped data is made only within pairwise dot products.
3.  Pairwise dot products can be efficiently computed by means of a
corresponding kernel function on the original data items.

4.  Make the SVM training maximize the margin of the training data in

SVMs
the implicit feature space to maximize generalization performance.

5.  Do tuning and evaluation.

39 Institut AIFB
There is more behind Kernels and SVMs

  In principle, we have a framework for:


  detecting nonlinear patterns in data,
  efficiently (little computational cost),
  with statistical stability (little overfitting)

  The overall family of algorithms is modular & flexible:

Learning
Kernel Algorithm
Function
Data Classification: SVM
(a)…
(Ranking)
(b)…
(Regression)
(Clustering)

40 Institut AIFB
Example: Graph kernel

Given any data in graph format (RDF)…

skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest foaf:gender

„Jane Doe“ foaf:name person200 „female“

…solve any standard


statistical relational learning task,
like…

41 Institut AIFB
The Learning Tasks (I)

skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest foaf:gender

?
„Jane Doe“ foaf:name person200 foaf:gender „female“

… property value prediction, …

42 Institut AIFB
The Learning Tasks (II)

„Machine skos:prefLabel
? foaf:topic_interest
topic110 person100
Learning“
foaf:knows
foaf:topic_interest foaf:gender

„Jane Doe“ foaf:name person200 „female“

… link prediction, …

43 Institut AIFB
The Learning Tasks (III)

skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest ? foaf:gender

„Jane Doe“ foaf:name person200 „female“

… clustering,…

… or class-membership prediction, entity resolution, …

44 Institut AIFB
The Kernel Trick
Any RDF graph

Solve any Task Define Kernel


(Classify /
Predict / Cluster)
(x, y) =< (x), (y) >

Any Kernel
Machine
(SVM / SVR /
Kernel k-means)

45 Institut AIFB
Intersection Graph

Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2

Inter-
section

Output Intersection graph

Kernel value Feature count G(e1)∩G(e2)


k(e1,e2)

46 Institut AIFB
Instance graph

skos:prefLabel
„Machine
topic110 person100
Learning“
foaf:knows
foaf:topic_interest foaf:gender

„Jane Doe“ foaf:name person200 „female“

  Instance graph: k-hop-neighbourhood of entity e


Explore graph starting from entity e up to a depth k

47 Institut AIFB
Instance graph - Example

Instance graph of depth 2 for


person200
„Machine
„Machine skos:prefLabel
topic11 person1
topic110
Learning“
Learning“ 0 00
foaf:knows
foaf:topic_interest foaf:gender

person
„Jane Doe“
„Jane Doe“ foaf:name person200 „female“
200

Instance graph of depth 2 for


„Machine skos:prefLabel topic11
person200 person1
topic110 person100
Learning“ 0 00
foaf:knows
foaf:topic_interest foaf:gender

person
„Jane Doe“
„Jane Doe“ person200 „female“
„female“
foaf:name
200

48 Institut AIFB
Intersection Graph

Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2

Inter-
section

Output Intersection graph

Kernel value Feature count G(e1)∩G(e2)


k(e1,e2)

49 Institut AIFB
Intersection graph

Intersection graph of graphs G(e1) and G(e2):


V (G1 \ G2 ) = V1 \ V2
E(G1 \ G2 ) = {(v1 , p, v2 )|(v1 , p, v2 ) 2 E1 ^ (v1 , p, v2 ) 2 E2 }

Intersection of depth 2 for person100 and (2)


person200
„Machine skos:prefLabel topic11 person1
topic110
Learning“ 0 00
foaf:knows
foaf:topic_interest foaf:gender

person
„Jane Doe“
„Jane Doe“ person200 „female“
foaf:name
200

50 Institut AIFB
Intersection Graph

Input
Instance graph
Entity e1 Instance
RDF
extraction
data G(e1) G(e2)
graph Entity e2

Inter-
section

Output Intersection graph

Kernel value Feature count G(e1)∩G(e2)


k(e1,e2)

51 Institut AIFB
Feature count

  Kernel function: Count specific substructures of the


intersection graph.
  Any set of edge-induced subgraphs…

E0 ✓ E
V0 = {v | 9u, p : (u, p, v) 2 E 0 _ (v, p, u) 2 E 0 }
  …qualifies as a candidate feature set
Edges
Walks/Paths up to a length of an arbitrary l (2)
Connected edge-induced subgraphs

52 Institut AIFB
Excursus: String Kernels
6
(a) Extracted Entities (b) Entity Similarity (c) Entity Clu
Observation Structural Similarity:
DE
type Intersection Graphs
Entity e1
geo
tec0001 unit MIO_EUR Ge1 ∩  Ge2
indicator time value Observation MIO_EUR tec0001
“2496200.0“
B11 “2010-01-01“
“2010-01-01“ DE
Src.
Src. 11
Observation
type DE Ge1 ∩  Ge3
Entity e2

geo Observation
gov_q_ggdebt unit MIO_EUR
DE
indicator value
“1786882.0“ “2010-01-01“ gov_q_ggdebt
time
F2 “2010-01-01“ Src.
Src. 22
Literal Similarity
Observation e1 vs. e2
type
Entity e3

classification “2496200.0“ “1786882.0“


NY.GDP.MKTP.CN area /country/DE
indicator refPeriod value e1 vs. e3
“2500090.5“
NY.GDP.MKTP.CN
NY.GDP.
“2010-01-01“ “2496200.0“ “2500090.5“
MKTP.CN Src.
Src. 33

54 Fig. Knowledge
2: (a)Discovery
Extracted entities e1 , e2 , and e3 from Fig. 1. (b) Structural
Institut AIFB sim
as intersection graphs Ge1 \ Ge2 , and Ge1 \ Ge3 . Literal similarities f
e vs. e , and e vs. e . Note, exact matching literals are omitted, as
Note, [16] introduced further kernels, however, we found path kernels to be
simple, and perform well in our experiments, cf. Sect. 6.
Example. Extracted entities from sources in Fig. 1 are given in Fig. 2-a. In
Excursus: Stringthe
Fig. 2-b, we compare Kernels
structure of tec0001 (short: e1 ) and gov q ggdebt
(short: e2 ). For this, we compute an intersection: Ge1 \ Ge2 . This yields a set of
4 paths, each with length 0. The unnormalized kernel value is 0 · 4.
  Strings
For literalare also a one
similarities, relational representation:
can use di↵erent a linear
kernels on, e.g., stringsgraph
or num-
(sequence
bers of reasons,
[18]. For space nodes)we restrict presentation to the string subsequence
kernel [15]. However, a numerical kernel is outlined in our extended report [19].
Definition 7 (String Subsequence Kernel l ). Let ⌃ denote a vocabulary
for strings, with each string s as finite sequence of characters in ⌃. Let s[i :
j] denote a substring si , . . . , sj of s. Further, let u be a subsequence of s, if
indices i = (i1 , . . . , i|u| ) exist with 1  i1  i|u|  |s| such that u = s[i].
The length l(i) of subsequence u is i|u| i1 + 1. Then, a kernel function l is
l
defined
P P as sum Pover all common, weighted subsequences for strings s, t:  (s, t) =
l
u i:u=s[i] j:u=t[j] (i) l (j), with as decay factor.
7

Example. For instance, strings “MI” and “MIO EUR” share a common sub-
sequence “MI” with i = (1, 2). Thus, the unnormalized kernel is 2 + 2 .
As l is only defined for two strings, we sample over every possible string
pair for two entities, and aggregate the needed kernel for each pair. Finally, we
aggregate kernels s and l , resulting in a single kernel [18]:
M
0 00 s
55 Knowledge Discovery
(e , e ) =  (Ge0 , Ge00 ) l (s, t) Institut AIFB

s,t 2 sample(Ge0 ,Ge00 )


Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung

56 Institut AIFB

You might also like