Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
Max Planck Institute for Biological Cybernetics
Technical Report No. 156
A Review of Kernel Methods in
Machine Learning
Thomas Hofmann
1
, Bernhard Sch¨ olkopf
2
,
Alexander J. Smola
3
December 2006
1
Darmstadt University of Technology, Department of Computer Science and Fraunhofer In
stitute for Integrated Publication and Information Systems, Darmstadt, Germany, email:
thomas.hofmann@ipsi.fraunhofer.de
2
Max Planck Institute f ¨ ur Biological Cybernetics, 72076 T¨ ubingen, Germany, email:
bs@tuebingen.mpg.de
3
Statistical Machine Learning Program, National ICT Australia, and Computer Science Laboratory,
Australian National University, Canberra, Australia, email: Alex.Smola@nicta.com.au
This report is available in PDF–format via anonymous ftp at ftp://ftp.kyb.tuebingen.mpg.de/pub/mpimemos/pdf/TR156.pdf. The
complete series of Technical Reports is documented at: http://www.kyb.tuebingen.mpg.de/techreports.html
A Review of Kernel Methods in Machine
Learning
Thomas Hofmann, Bernhard Sch¨ olkopf, Alexander J. Smola
Abstract. We review recent methods for learning with positive deﬁnite kernels. All these methods
formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS)
associated with a kernel. We cover a wide range of methods, ranging from simple classiﬁers to sophisti
cated methods for estimation with structured data.
(AMS 2000 subject classiﬁcations: primary  30C40 Kernel functions and applications; secondary 
68T05 Learning and adaptive systems. — Key words: machine learning, reproducing kernels, support
vector machines, graphical models)
1 Introduction
Over the last ten years, estimation and learning methods utilizing positive deﬁnite kernels have become
rather popular, particularly in machine learning. Since these methods have a stronger mathematical slant
than earlier machine learning methods (e.g., neural networks), there is also signiﬁcant interest in the
statistics and mathematics community for these methods. The present review aims to summarize the
state of the art on a conceptual level. In doing so, we build on various sources (including [Vapnik,
1998, Burges, 1998, Cristianini and ShaweTaylor, 2000, Herbrich, 2002] and in particular [Sch¨ olkopf
and Smola, 2002]), but we also add a fair amount of more recent material which helps unifying the
exposition. We have not had space to include proofs; they can be found either in the long version of the
present paper (see Hofmann et al. [2006]), in the references given, or in the above books.
The main idea of all the described methods can be summarized in one paragraph. Traditionally, theory
and algorithms of machine learning and statistics has been very well developed for the linear case. Real
world data analysis problems, on the other hand, often requires nonlinear methods to detect the kind
of dependencies that allow successful prediction of properties of interest. By using a positive deﬁnite
kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a
(usually high dimensional) feature space. In this space, our estimation methods are linear, but as long as
we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the
high dimensional feature space.
The paper has three main sections: Section 2 deals with fundamental properties of kernels, with spe
cial emphasis on (conditionally) positive deﬁnite kernels and their characterization. We give concrete
examples for such kernels and discuss kernels and reproducing kernel Hilbert spaces in the context of
regularization. Section 3 presents various approaches for estimating dependencies and analyzing data
that make use of kernels. We provide an overview of the problem formulations as well as their solution
using convex programming techniques. Finally, Section 4 examines the use of reproducing kernel Hilbert
spaces as a means to deﬁne statistical models, the focus being on structured, multidimensional responses.
We also show how such techniques can be combined with Markov networks as a suitable framework to
model dependencies between response variables.
1
Note: instead of this TR, please cite the updated journal version:
T. Hofmann, B. Schölkopf, A. Smola: Kernel methods in machine learning.
Annals of Statistics (in press).
Figure 1: A simple geometric classiﬁcation algorithm: given two classes of points (depicted by ‘o’ and ‘+’), compute
their means c
+
, c
−
and assign a test input x to the one whose mean is closer. This can be done by looking at the dot
product between x − c (where c = (c
+
+ c
−
)/2) and w := c
+
− c
−
, which changes sign as the enclosed angle
passes through π/2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w
(from [Sch¨ olkopf and Smola, 2002]).
2 Kernels
2.1 An Introductory Example
Suppose we are given empirical data
(x
1
, y
1
), . . . , (x
n
, y
n
) ∈ X Y. (1)
Here, the domain X is some nonempty set that the inputs (the predictor variables) x
i
are taken from; the
y
i
∈ Y are called targets (the response variable). Here and below, i, j ∈ [n], where we use the notation
[n] :=¦1, . . . , n¦.
Note that we have not made any assumptions on the domain X other than it being a set. In order to
study the problem of learning, we need additional structure. In learning, we want to be able to generalize
to unseen data points. In the case of binary pattern recognition, given some new input x ∈ X, we want
to predict the corresponding y ∈ ¦±1¦ (more complex output domains Y will be treated below). Loosely
speaking, we want to choose y such that (x, y) is in some sense similar to the training examples. To this
end, we need similarity measures in X and in ¦±1¦. The latter is easier, as two target values can only be
identical or different. For the former, we require a function
k : X X →R, (x, x
t
) →k(x, x
t
) (2)
satisfying, for all x, x
t
∈ X,
k(x, x
t
) = ¸Φ(x), Φ(x
t
)) , (3)
where Φmaps into some dot product space H, sometimes called the feature space. The similarity measure
k is usually called a kernel, and Φ is called its feature map.
The advantage of using such a kernel as a similarity measure is that it allows us to construct al
gorithms in dot product spaces. For instance, consider the following simple classiﬁcation algorithm,
where Y = ¦±1¦. The idea is to compute the means of the two classes in the feature space,
c
+
=
1
n
+
¦i:y
i
=+1]
Φ(x
i
), and c
−
=
1
n
−
¦i:y
i
=−1]
Φ(x
i
), where n
+
and n
−
are the number of
examples with positive and negative target values, respectively. We then assign a new point Φ(x) to the
class whose mean is closer to it. This leads to the prediction rule
y = sgn(¸Φ(x), c
+
) −¸Φ(x), c
−
) +b) (4)
2
with b =
1
2
_
c
−

2
−c
+

2
_
. Substituting the expressions for c
±
yields
y = sgn
_
1
n
+
¦i:y
i
=+1]
¸Φ(x), Φ(x
i
))
. ¸¸ .
k(x,x
i
)
−
1
n
−
¦i:y
i
=−1]
¸Φ(x), Φ(x
i
))
. ¸¸ .
k(x,x
i
)
+b
_
, (5)
where b =
1
2
_
1
n
2
−
¦(i,j):y
i
=y
j
=−1]
k(x
i
, x
j
) −
1
n
2
+
¦(i,j):y
i
=y
j
=+1]
k(x
i
, x
j
)
_
.
Let us consider one wellknown special case of this type of classiﬁer. Assume that the class means
have the same distance to the origin (hence b = 0), and that k(, x) is a density for all x ∈ X. If the two
classes are equally likely and were generated from two probability distributions that are estimated
p
+
(x) :=
1
n
+
¦i:y
i
=+1]
k(x, x
i
), p
−
(x) :=
1
n
−
¦i:y
i
=−1]
k(x, x
i
), (6)
then (5) is the estimated Bayes decision rule, plugging in the estimates p
+
and p
−
for the true densities.
The classiﬁer (5) is closely related to the Support Vector Machine (SVM) that we will discuss below. It
is linear in the feature space (4), while in the input domain, it is represented by a kernel expansion (5). In
both cases, the decision boundary is a hyperplane in the feature space; however, the normal vectors (for
(4), w = c
+
− c
−
) are usually rather different.
1
The normal vector not only characterizes the alignment
of the hyperplane, its length can also be used to construct tests for the equality of the two classgenerating
distributions [Borgwardt et al., 2006].
2.2 Positive Deﬁnite Kernels
We have required that a kernel satisfy (3), i.e., correspond to a dot product in some dot product space. In
the present section, we show that the class of kernels that can be written in the form (3) coincides with
the class of positive deﬁnite kernels. This has farreaching consequences. There are examples of positive
deﬁnite kernels which can be evaluated efﬁciently even though they correspond to dot products in inﬁnite
dimensional dot product spaces. In such cases, substituting k(x, x
t
) for ¸Φ(x), Φ(x
t
)), as we have done
in (5), is crucial. In the machine learning community, this substitution is called the kernel trick.
Deﬁnition 1 (Gram matrix) Given a kernel k and inputs x
1
, . . . , x
n
∈ X, the n n matrix
K := (k(x
i
, x
j
))
ij
(7)
is called the Gram matrix (or kernel matrix) of k with respect to x
1
, . . . , x
n
.
Deﬁnition 2 (Positive deﬁnite matrix) A real n n symmetric matrix K
ij
satisfying
i,j
c
i
c
j
K
ij
≥ 0 (8)
for all c
i
∈ R is called positive deﬁnite. If equality in (8) only occurs for c
1
= = c
n
= 0 then we
shall call the matrix strictly positive deﬁnite.
Deﬁnition 3 (Positive deﬁnite kernel) Let X be a nonempty set. A function k : X X → R which for
all n ∈ N, x
i
∈ X, i ∈ [n] gives rise to a positive deﬁnite Gram matrix is called a positive deﬁnite kernel.
A function k : XX →R which for all n ∈ N and distinct x
i
∈ X gives rise to a strictly positive deﬁnite
Gram matrix is called a strictly positive deﬁnite kernel.
1
As an aside, note that if we normalize the targets such that ˆ y
i
= y
i
/{j : y
j
= y
i
}, in which case the ˆ y
i
sum to
zero, then w
2
=
K, ˆ yˆ y
F
, where ., .
F
is the Frobenius dot product. If the two classes have equal size, then up
to a scaling factor involving K
2
and n, this equals the kerneltarget alignment deﬁned in [Cristianini et al., 2002].
3
Occasionally, we shall refer to positive deﬁnite kernels simply as a kernels. Note that for simplicity we
have restricted ourselves to the case of real valued kernels. However, with small changes, the below will
also hold for the complex valued case.
Since
i,j
c
i
c
j
¸Φ(x
i
), Φ(x
j
)) =
_
i
c
i
Φ(x
i
),
j
c
j
Φ(x
j
)
_
≥ 0, kernels of the form (3) are posi
tive deﬁnite for any choice of Φ. In particular, if X is already a dot product space, we may choose Φ to
be the identity. Kernels can thus be regarded as generalized dot products. Whilst they are not generally
bilinear, they share important properties with dot products, such as the CauchySchwartz inequality:
Proposition 4 If k is a positive deﬁnite kernel, and x
1
, x
2
∈ X, then
k(x
1
, x
2
)
2
≤ k(x
1
, x
1
) k(x
2
, x
2
). (9)
Proof The 22 Gram matrix with entries K
ij
= k(x
i
, x
j
) is positive deﬁnite. Hence both its eigenvalues
are nonnegative, and so is their product, K’s determinant, i.e.,
0 ≤ K
11
K
22
−K
12
K
21
= K
11
K
22
−K
2
12
. (10)
Substituting k(x
i
, x
j
) for K
ij
, we get the desired inequality.
2.2.1 Construction of the Reproducing Kernel Hilbert Space
We now deﬁne a map from X into the space of functions mapping X into R, denoted as R
X
, via
Φ : X →R
X
where x →k(, x). (11)
Here, Φ(x) = k(, x) denotes the function that assigns the value k(x
t
, x) to x
t
∈ X.
We next construct a dot product space containing the images of the inputs under Φ. To this end, we
ﬁrst turn it into a vector space by forming linear combinations
f() =
n
i=1
α
i
k(, x
i
). (12)
Here, n ∈ N, α
i
∈ R and x
i
∈ X are arbitrary.
Next, we deﬁne a dot product between f and another function g() =
n
j=1
β
j
k(, x
t
j
) (with n
t
∈ N,
β
j
∈ R and x
t
j
∈ X) as
¸f, g) :=
n
i=1
n
j=1
α
i
β
j
k(x
i
, x
t
j
). (13)
To see that this is welldeﬁned although it contains the expansion coefﬁcients and points, note that
¸f, g) =
n
j=1
β
j
f(x
t
j
). The latter, however, does not depend on the particular expansion of f. Simi
larly, for g, note that ¸f, g) =
n
i=1
α
i
g(x
i
). This also shows that ¸, ) is bilinear. It is symmetric, as
¸f, g) = ¸g, f). Moreover, it is positive deﬁnite, since positive deﬁniteness of k implies that for any
function f, written as (12), we have
¸f, f) =
n
i,j=1
α
i
α
j
k(x
i
, x
j
) ≥ 0. (14)
Next, note that given functions f
1
, . . . , f
p
, and coefﬁcients γ
1
, . . . , γ
p
∈ R, we have
p
i,j=1
γ
i
γ
j
¸f
i
, f
j
) =
_
p
i=1
γ
i
f
i
,
p
j=1
γ
j
f
j
_
≥ 0. (15)
4
Here, the equality follows from the bilinearity of ¸, ), and the right hand inequality from (14).
By (15), ¸, ) is a positive deﬁnite kernel, deﬁned on our vector space of functions. For the last step in
proving that it even is a dot product, we note that by (13), for all functions (12),
¸k(, x), f) = f(x), and in particular ¸k(, x), k(, x
t
)) = k(x, x
t
). (16)
By virtue of these properties, k is called a reproducing kernel [Aronszajn, 1950].
Due to (16) and (9), we have
[f(x)[
2
= [¸k(, x), f)[
2
≤ k(x, x) ¸f, f). (17)
By this inequality, ¸f, f) = 0 implies f = 0, which is the last property that was left to prove in order to
establish that ¸., .) is a dot product.
Skipping some details, we add that one can complete the space of functions (12) in the norm corre
sponding to the dot product, and thus gets a Hilbert space H, called a reproducing kernel Hilbert space
(RKHS).
One can deﬁne an RKHS as a Hilbert space H of functions on a set X with the property that for all
x ∈ X and f ∈ H, the point evaluations f → f(x) are continuous linear functionals (in particular, all
point values f(x) are well deﬁned, which already distinguishes RKHS’s from many L
2
Hilbert spaces).
From the point evaluation functional, one can then construct the reproducing kernel using the Riesz
representation theorem. The MooreAronszajnTheorem [Aronszajn, 1950] states that for every positive
deﬁnite kernel on X X, there exists a unique RKHS and vice versa.
There is an analogue of the kernel trick for distances rather than dot products, i.e., dissimilarities rather
than similarities. This leads to the larger class of conditionally positive deﬁnite kernels. Those kernels
are deﬁned just like positive deﬁnite ones, with the one difference being that their Gram matrices need to
satisfy (8) only subject to
n
i=1
c
i
= 0. (18)
Interestingly, it turns out that many kernel algorithms, including SVMs and kernel PCA (see Section 3),
can be applied also with this larger class of kernels, due to their being translation invariant in feature
space [Sch¨ olkopf and Smola, 2002, Hein et al., 2005].
We conclude this section with a note on terminology. In the early years of kernel machine learning
research, it was not the notion of positive deﬁnite kernels that was being used. Instead, researchers
considered kernels satisfying the conditions of Mercer’s theorem [Mercer, 1909], see e.g. Vapnik [1998]
and Cristianini and ShaweTaylor [2000]. However, whilst all such kernels do satisfy (3), the converse is
not true. Since (3) is what we are interested in, positive deﬁnite kernels are thus the right class of kernels
to consider.
2.2.2 Properties of Positive Deﬁnite Kernels
We begin with some closure properties of the set of positive deﬁnite kernels.
Proposition 5 Below, k
1
, k
2
, . . . are arbitrary positive deﬁnite kernels on XX, where X is a nonempty
set.
(i) The set of positive deﬁnite kernels is a closed convex cone, i.e., (a) if α
1
, α
2
≥ 0, then α
1
k
1
+α
2
k
2
is
positive deﬁnite; and (b) If k(x, x
t
) := lim
n→∞
k
n
(x, x
t
) exists for all x, x
t
, then k is positive deﬁnite.
(ii) The pointwise product k
1
k
2
is positive deﬁnite.
(iii) Assume that for i = 1, 2, k
i
is a positive deﬁnite kernel on X
i
X
i
, where X
i
is a nonempty set. Then
the tensor product k
1
⊗k
2
and the direct sumk
1
⊕k
2
are positive deﬁnite kernels on (X
1
X
2
)(X
1
X
2
).
5
The proofs can be found in Berg et al. [1984]. We only give a short proof of (ii), a result due to Schur
(see e.g., Berg and Forst [1975]). Denote by K, L the positive deﬁnite Gram matrices of k
1
, k
2
with
respect to some data set x
1
, . . . , x
n
. Being positive deﬁnite, we can express L as SS
¯
in terms of an
n n matrix S (e.g., the positive deﬁnite square root of L). We thus have
L
ij
=
n
m=1
S
im
S
jm
, (19)
and therefore, for any c
1
, . . . , c
n
∈ R,
n
i,j=1
c
i
c
j
K
ij
L
ij
=
n
i,j=1
c
i
c
j
K
ij
n
m=1
S
im
S
jm
=
n
m=1
n
i,j=1
(c
i
S
im
)(c
j
S
jm
)K
ij
≥ 0, (20)
where the inequality follows from (8).
It is reassuring that sums and products of positive deﬁnite kernels are positive deﬁnite. We will now
explain that, loosely speaking, there are no other operations that preserve positive deﬁniteness. To this
end, let C denote the set of all functions ψ: R → R that map positive deﬁnite kernels to (conditionally)
positive deﬁnite kernels (readers who are not interested in the case of conditionally positive deﬁnite
kernels may ignore the term in parentheses). We deﬁne
C := ¦ψ [ k is a positive deﬁnite kernel ⇒ψ(k) is a (conditionally) positive deﬁnite kernel¦.
Moreover, deﬁne
C
t
= ¦ψ [ for any Hilbert space F, ψ(¸x, x
t
)
F
) is (conditionally) positive deﬁnite ¦
and
C
tt
= ¦ψ [ for all n ∈ N: K is a positive deﬁnite n n matrix ⇒ψ(K) is (conditionally) positive deﬁnite ¦,
where ψ(K) is the n n matrix with elements ψ(K
ij
).
Proposition 6 C = C’ = C”
Proof Let ψ ∈ C. For any Hilbert space F, ¸x, x
t
)
F
is positive deﬁnite, thus ψ(¸x, x
t
)
F
) is (condition
ally) positive deﬁnite. Hence ψ ∈ C
t
.
Next, assume ψ ∈ C
t
. Consider, for n ∈ N, an arbitrary positive deﬁnite n nmatrix K. We can
express K as the Gram matrix of some x
1
, . . . , x
n
∈ R
n
, i.e., K
ij
= ¸x
i
, x
j
)
R
n
. Since ψ ∈ C
t
, we
know that ψ(¸x, x
t
)
R
n
) is a (conditionally) positive deﬁnite kernel, hence in particular the matrix ψ(K)
is (conditionally) positive deﬁnite, thus ψ ∈ C
tt
.
Finally, assume ψ ∈ C
tt
. Let k be a positive deﬁnite kernel, n ∈ N, c
1
, . . . , c
n
∈ R, and x
1
, . . . , x
n
∈
X. Then
ij
c
i
c
j
ψ(k(x
i
, x
j
)) =
ij
c
i
c
j
(ψ(K))
ij
≥ 0 (21)
(for
i
c
i
= 0), where the inequality follows from ψ ∈ C
tt
. Therefore, ψ(k) is (conditionally) positive
deﬁnite, and hence ψ ∈ C.
The following proposition follows from a result of FitzGerald et al. [1995] for (conditionally) positive
deﬁnite matrices; by Proposition 6, it also applies for (conditionally) positive deﬁnite kernels, and for
functions of dot products. We state the latter case.
6
Proposition 7 Let ψ : R → R. Then ψ(¸x, x
t
)
F
) is positive deﬁnite for any Hilbert space F if and only
if ψ is real entire of the form
ψ(t) =
∞
n=0
a
n
t
n
(22)
with a
n
≥ 0 for n ≥ 0.
Moreover, ψ(¸x, x
t
)
F
) is conditionally positive deﬁnite for any Hilbert space F if and only if ψ is real
entire of the form (22) with a
n
≥ 0 for n ≥ 1.
There are further properties of k that can be read off the coefﬁcients a
n
.
• Steinwart [2002a] showed that if all a
n
are strictly positive, then the kernel of Proposition 7 is
universal on every compact subset S of R
d
in the sense that its RKHS is dense in the space of
continuous functions on S in the .
∞
norm. For support vector machines using universal kernels,
he then shows (universal) consistency [Steinwart, 2002b]. Examples of universal kernels are (23)
and (24) below.
• In Lemma 13, we will show that the a
0
term does not affect an SVM. Hence we infer that it is
actually sufﬁcient for consistency to have a
n
> 0 for n ≥ 1.
• Let I
even
and I
odd
denote the sets of even and odd incides i with the property that a
i
> 0, respectively.
Pinkus [2004] showed that for any F, the kernel of Proposition 7 is strictly positive deﬁnite if and
only if a
0
> 0 and neither I
even
nor I
odd
is ﬁnite.
2
Moreover, he states that the necessary and sufﬁcient
conditions for universality of the kernel are that in addition both
i∈I
even
1
i
and
i∈I
odd
1
i
diverge.
While the proofs of the above statements are fairly involved, one can make certain aspects plau
sible using simple arguments. For instance [Pinkus, 2004], suppose k is not strictly positive def
inite. Then we know that for some c ,= 0 we have
ij
c
i
c
j
k(x
i
, x
j
) = 0. This means that
_
i
c
i
k(x
i
, ),
j
c
j
k(x
j
, )
_
= 0, implying that
i
c
i
k(x
i
, ) = 0. For any f =
j
α
j
k(x
j
, ),
this implies that
i
c
i
f(x
i
) =
_
i
c
i
k(x
i
, ),
j
α
j
k(x
j
, )
_
= 0. This equality thus holds for any
function in the RKHS. Therefore, the RKHS cannot lie dense in the space of continuous functions on X,
and k is thus not universal.
We thus know that if k is universal, it is necessarily strictly positive deﬁnite. The latter, in turn, implies
that its RKHS H is inﬁnite dimensional, since otherwise the maximal rank of the Gram matrix of k with
respect to a data set in H would equal its dimensionality. If we assume that F is ﬁnite dimensional, then
inﬁnite dimensionality of Himplies that inﬁnitely many of the a
i
in (22) must be strictly positive. If only
ﬁnitely many of the a
i
in (22) were positive, we could construct the feature space H of ψ(¸x, x
t
)) by
taking ﬁnitely many tensor products and (scaled) direct sums of copies of F, and Hwould be ﬁnite.
3
We
conclude the section with an example of a kernel which is positive deﬁnite by Proposition 7. To this end,
let X be a dot product space. The power series expansion of ψ(x) = e
x
then tells us that
k(x, x
t
) = e
¸x,x
)
σ
2
(23)
2
One might be tempted to think that given some strictly positive deﬁnite kernel k with feature map Φ and feature
space H, we could set F to equal H and consider ψ(Φ(x), Φ(x
)
H
). In this case, it would seem that choosing
a
1
= 1 and a
n
= 0 for n = 1 should give us a strictly positive deﬁnite kernel which violates Pinkus’ conditions.
However, the latter kernel is only strictly positive deﬁnite on Φ(X), which is a weaker statement than it being strictly
positive deﬁnite on all of H.
3
Recall that the dot product of F ⊗ F is the square of the original dot product, and the one of F ⊕ F
, where F
is another dot product space, is the sum of the original dot products.
7
is positive deﬁnite [Haussler, 1999]. If we further multiply k with the positive deﬁnite kernel f(x)f(x
t
),
where f(x) = e
−
x
2
2 σ
2
and σ > 0, this leads to the positive deﬁniteness of the Gaussian kernel
k
t
(x, x
t
) = k(x, x
t
)f(x)f(x
t
) = e
−
x−x
2
2 σ
2
. (24)
2.2.3 Properties of Positive Deﬁnite Functions
We now let X = R
d
and consider positive deﬁnite kernels of the form
k(x, x
t
) = h(x −x
t
), (25)
in which case h is called a positive deﬁnite function. The following characterization is due to Bochner
[1933], see also Rudin [1962]. We state it in the form given by Wendland [2005].
Theorem 8 A continuous function h on R
d
is positive deﬁnite if and only if there exists a ﬁnite nonnega
tive Borel measure µ on R
d
such that
h(x) =
_
R
d
e
−i¸x,ω)
dµ(ω). (26)
Whilst normally formulated for complex valued functions, the theorem also holds true for real functions.
Note, however, that if we start with an arbitrary nonnegative Borel measure, its Fourier transform may
not be real. Real valued positive deﬁnite functions are distinguished by the fact that the corresponding
measures µ are symmetric.
We may normalize h such that h(0) = 1 (hence by (9) [h(x)[ ≤ 1), in which case µ is a probabil
ity measure and h is its characteristic function. For instance, if µ is a normal distribution of the form
(2π/σ
2
)
−d/2
e
−
σ
2
ω
2
2
dω, then the corresponding positive deﬁnite function is the Gaussian e
−
x
2
2σ
2
, cf.
(24).
Bochner’s theorem allows us to interpret the similarity measure k(x, x
t
) = h(x −x
t
) in the frequency
domain. The choice of the measure µ determines which frequency components occur in the kernel. Since
the solutions of kernel algorithms will turn out to be ﬁnite kernel expansions, the measure µ will thus
determine which frequencies occur in the estimates, i.e., it will determine their regularization properties
— more on that in Section 2.3.3 below.
A proof of the theorem can be found for instance in Rudin [1962]. One part of the theorem is easy to
prove: if h takes the form (26), then
ij
a
i
a
j
h(x
i
−x
j
) =
ij
a
i
a
j
_
R
d
e
−i¸x
i
−x
j
,ω)
dµ(ω) =
_
R
d
¸
¸
¸
¸
¸
i
a
i
e
−i¸x
i
,ω)
¸
¸
¸
¸
¸
2
dµ(ω) ≥ 0. (27)
Bochner’s theorem generalizes earlier work of Mathias, and has itself been generalized in various ways,
e.g. by Schoenberg [1938]. An important generalization considers Abelian semigroups (e.g., Berg et al.
[1984]). In that case, the theorem provides an integral representation of positive deﬁnite functions in
terms of the semigroup’s semicharacters. Further generalizations were given by Krein, for the cases of
positive deﬁnite kernels and functions with a limited number of negative squares (see Stewart [1976] for
further details and references).
As above, there are conditions that ensure that the positive deﬁniteness becomes strict.
Proposition 9 [Wendland, 2005] A positive deﬁnite function is strictly positive deﬁnite if the carrier of
the measure in its representation (26) contains an open subset.
8
This implies that the Gaussian kernel is strictly positive deﬁnite.
An important special case of positive deﬁnite functions, which includes the Gaussian, are radial basis
functions. These are functions that can be written as h(x) = g(x
2
) for some function g : [0, ∞[→ R.
They have the property of being invariant under the Euclidean group. If we would like to have the
additional property of compact support, which is computationally attractive in a number of large scale
applications, the following result becomes relevant.
Proposition 10 [Wendland, 2005] Assume that g : [0, ∞[→ R is continuous and compactly supported.
Then h(x) = g(x
2
) cannot be positive deﬁnite on every R
d
.
2.2.4 Examples of Kernels
We have already seen several instances of positive deﬁnite kernels, and now intend to complete our
selection with a few more examples. In particular, we discuss polynomial kernels, convolution kernels,
ANOVA expansions, and kernels on documents.
Polynomial Kernels From Proposition 5 it is clear that homogeneous polynomial kernels k(x, x
t
) =
¸x, x
t
)
p
are positive deﬁnite for p ∈ N and x, x
t
∈ R
d
. By direct calculation we can derive the corre
sponding feature map [Poggio, 1975]
¸x, x
t
)
p
=
_
_
d
j=1
[x]
j
[x
t
]
j
_
_
p
=
j∈[d]
p
[x]
j
1
[x]
j
p
[x
t
]
j
1
[x
t
]
j
p
= ¸C
p
(x), C
p
(x
t
)) , (28)
where C
p
maps x ∈ R
d
to the vector C
p
(x) whose entries are all possible pth degree ordered products
of the entries of x (note that [d] is used as a shorthand for ¦1, . . . , d¦). The polynomial kernel of degree p
thus computes a dot product in the space spanned by all monomials of degree p in the input coordinates.
Other useful kernels include the inhomogeneous polynomial,
k(x, x
t
) = (¸x, x
t
) +c)
p
where p ∈ N and c ≥ 0, (29)
which computes all monomials up to degree p.
Spline Kernels It is possible to obtain spline functions as a result of kernel expansions [Smola, 1996,
Vapnik et al., 1997] simply by noting that convolution of an even number of indicator functions yields a
positive kernel function. Denote by I
X
the indicator (or characteristic) function on the set X, and denote
by ⊗ the convolution operation, (f ⊗ g)(x) :=
_
R
d
f(x
t
)g(x
t
− x)dx
t
. Then the Bspline kernels are
given by
k(x, x
t
) = B
2p+1
(x −x
t
) where p ∈ N with B
i+1
:= B
i
⊗B
0
. (30)
Here B
0
is the characteristic function on the unit ball
4
in R
d
. From the deﬁnition of (30) it is obvious that
for odd m we may write B
m
as inner product between functions B
m/2
. Moreover, note that for even m,
B
m
is not a kernel.
Convolutions and Structures Let us now move to kernels deﬁned on structured objects [Haussler,
1999, Watkins, 2000]. Suppose the object x ∈ X is composed of x
p
∈ X
p
, where p ∈ [P] (note that the
sets X
p
need not be equal). For instance, consider the string x = ATG, and P = 2. It is composed of the
parts x
1
= AT and x
2
= G, or alternatively, of x
1
= A and x
2
= TG. Mathematically speaking, the set
of “allowed” decompositions can be thought of as a relation R(x
1
, . . . , x
P
, x), to be read as “x
1
, . . . , x
P
constitute the composite object x.”
4
Note that in R one typically uses I
[−
1
2
,
1
2
]
.
9
Haussler [1999] investigated how to deﬁne a kernel between composite objects by building on similar
ity measures that assess their respective parts; in other words, kernels k
p
deﬁned on X
p
X
p
. Deﬁne the
Rconvolution of k
1
, . . . , k
P
as
[k
1
k
P
] (x, x
t
) :=
¯ x∈R(x),¯ x
∈R(x
)
P
p=1
k
p
(¯ x
p
, ¯ x
t
p
), (31)
where the sum runs over all possible ways R(x) and R(x
t
) in which we can decompose x into ¯ x
1
, . . . , ¯ x
P
and x
t
analogously.
5
If there is only a ﬁnite number of ways, the relation R is called ﬁnite. In this case,
it can be shown that the Rconvolution is a valid kernel [Haussler, 1999].
ANOVA Kernels Speciﬁc examples of convolution kernels are Gaussians and ANOVA kernels [Wahba,
1990, Vapnik, 1998]. To construct an ANOVA kernel, we consider X = S
N
for some set S, and kernels
k
(i)
on S S, where i = 1, . . . , N. For P = 1, . . . , N, the ANOVA kernel of order P is deﬁned as
k
P
(x, x
t
) :=
1≤i
1
<<i
P
≤N
P
p=1
k
(i
p
)
(x
i
p
, x
t
i
p
). (32)
Note that if P = N, the sum consists only of the term for which (i
1
, . . . , i
P
) = (1, . . . , N), and k equals
the tensor product k
(1)
⊗ ⊗ k
(N)
. At the other extreme, if P = 1, then the products collapse to one
factor each, and k equals the direct sum k
(1)
⊕ ⊕k
(N)
. For intermediate values of P, we get kernels
that lie in between tensor products and direct sums.
ANOVA kernels typically use some moderate value of P, which speciﬁes the order of the interactions
between attributes x
i
p
that we are interested in. The sum then runs over the numerous terms that take
into account interactions of order P; fortunately, the computational cost can be reduced to O(Pd) cost by
utilizing recurrent procedures for the kernel evaluation. ANOVA kernels have been shown to work rather
well in multidimensional SV regression problems [Stitson et al., 1999].
Brownian Bridge and Related Kernels The Brownian bridge kernel min(x, x
t
) deﬁned on R likewise
is a positive kernel. Note that onedimensional linear spline kernels with an inﬁnite number of nodes
[Vapnik et al., 1997] for x, x
t
∈ R
+
0
are given by
k(x, x
t
) =
min(x, x
t
)
3
3
+
min(x, x
t
)
2
[x −x
t
[
2
+ 1 +xx
t
. (33)
These kernels can be used as basis functions k
(n)
in ANOVA expansions. Note that it is advisable to use
a k
(n)
which never or rarely takes the value zero, since a single zero term would eliminate the product in
(32).
Bag of Words One way in which SVMs have been used for text categorization [Joachims, 2002] is
the bagofwords representation. This maps a given text to a sparse vector, where each component cor
responds to a word, and a component is set to one (or some other number) whenever the related word
occurs in the text. Using an efﬁcient sparse representation, the dot product between two such vectors
can be computed quickly. Furthermore, this dot product is by construction a valid kernel, referred to as
a sparse vector kernel. One of its shortcomings, however, is that it does not take into account the word
ordering of a document. Other sparse vector kernels are also conceivable, such as one that maps a text to
the set of pairs of words that are in the same sentence [Joachims, 1998, Watkins, 2000], or those which
look only at pairs of words within a certain vicinity with respect to each other [Sim, 2001].
5
We use the convention that an empty sum equals zero, hence if either x or x
cannot be decomposed, then
(k
1
· · · k
P
)(x, x
) = 0.
10
ngrams and Sufﬁx Trees A more sophisticated way of dealing with string data was proposed
[Watkins, 2000, Haussler, 1999]. The basic idea is as described above for general structured objects
(31): Compare the strings by means of the substrings they contain. The more substrings two strings have
in common, the more similar they are. The substrings need not always be contiguous; that said, the fur
ther apart the ﬁrst and last element of a substring are, the less weight should be given to the similarity.
Depending on the speciﬁc choice of a similarity measure it is possible to deﬁne more or less efﬁcient
kernels which compute the dot product in the feature space spanned by all substrings of documents.
Consider a ﬁnite alphabet Σ, the set of all strings of length n, Σ
n
, and the set of all ﬁnite strings,
Σ
∗
:= ∪
∞
n=0
Σ
n
. The length of a string s ∈ Σ
∗
is denoted by [s[, and its elements by s(1) . . . s([s[); the
concatenation of s and t ∈ Σ
∗
is written st. Denote by
k(x, x
t
) =
s
#(x, s)#(x
t
, s)c
s
a string kernel computed from exact matches. Here #(x, s) is the number of occurrences of s in x and
c
s
≥ 0.
Vishwanathan and Smola [2004] provide an algorithm using sufﬁx trees, which allows one to compute
for arbitrary c
s
the value of the kernel k(x, x
t
) in O([x[ +[x
t
[) time and memory. Moreover, also f(x) =
¸w, Φ(x)) can be computed in O([x[) time if preprocessing linear in the size of the support vectors is
carried out. These kernels are then applied to function prediction (according to the gene ontology) of
proteins using only their sequence information. Another prominent application of string kernels is in the
ﬁeld of splice form prediction and gene ﬁnding R¨ atsch et al. [2007].
For inexact matches of a limited degree, typically up to = 3, and strings of bounded length a similar
data structure can be built by explicitly generating a dictionary of strings and their neighborhood in
terms of a Hamming distance [Leslie et al., 2002b,a]. These kernels are deﬁned by replacing #(x, s)
by a mismatch function #(x, s, ) which reports the number of approximate occurrences of s in x. By
trading off computational complexity with storage (hence the restriction to small numbers of mismatches)
essentially lineartime algorithms can be designed. Whether a general purpose algorithm exists which
allows for efﬁcient comparisons of strings with mismatches in linear time is still an open question. Tools
from approximate string matching [Navarro and Rafﬁnot, 1999, Cole and Hariharan, 2000, Gusﬁeld,
1997] promise to be of help in this context.
Mismatch Kernels In the general case it is only possible to ﬁnd algorithms whose complexity is linear
in the lengths of the documents being compared, and the length of the substrings, i.e. O([x[ [x
t
[) or
worse. We now describe such a kernel with a speciﬁc choice of weights [Watkins, 2000, Cristianini and
ShaweTaylor, 2000].
Let us now form subsequences u of strings. Given an index sequence i := (i
1
, . . . , i
]u]
) with 1 ≤ i
1
<
< i
]u]
≤ [s[, we deﬁne u := s(i) := s(i
1
) . . . s(i
]u]
). We call l(i) := i
]u]
− i
1
+ 1 the length of the
subsequence in s. Note that if i is not contiguous, then l(i) > [u[.
The feature space built from strings of length n is deﬁned to be H
n
:= R
(Σ
n
)
. This notation means
that the space has one dimension (or coordinate) for each element of Σ
n
, labelled by that element (equiv
alently, we can think of it as the space of all realvalued functions on Σ
n
). We can thus describe the
feature map coordinatewise for each u ∈ Σ
n
via
[Φ
n
(s)]
u
:=
i:s(i)=u
λ
l(i)
. (34)
Here, 0 < λ ≤ 1 is a decay parameter: The larger the length of the subsequence in s, the smaller the
respective contribution to [Φ
n
(s)]
u
. The sum runs over all subsequences of s which equal u.
11
For instance, consider a dimension of H
3
spanned (that is, labelled) by the string asd. In this case, we
have [Φ
3
(Nasdaq)]
asd
= λ
3
, while [Φ
3
(lass das)]
asd
= 2λ
5
.
6
The kernel induced by the map Φ
n
takes
the form
k
n
(s, t) =
u∈Σ
n
[Φ
n
(s)]
u
[Φ
n
(t)]
u
=
u∈Σ
n
(i,j):s(i)=t(j)=u
λ
l(i)
λ
l(j)
. (35)
To describe the actual computation of k
n
, deﬁne
k
t
i
(s, t) :=
u∈Σ
i
(i,j):s(i)=t(j)=u
λ
]s]+]t]−i
1
−j
1
+2
, for i = 1, . . . , n −1. (36)
Using x ∈ Σ
1
, we then have the following recursions, which allow the computation of k
n
(s, t) for all
n = 1, 2, . . . (note that the kernels are symmetric):
k
t
0
(s, t) = 1 for all s, t
k
t
i
(s, t) = 0 if min([s[, [t[) < i
k
i
(s, t) = 0 if min([s[, [t[) < i
k
t
i
(sx, t) = λk
t
i
(s, t) +
j:t
j
=x
k
t
i−1
(s, t[1, . . . , j −1])λ
]t]−j+2
, i = 1, . . . , n −1
k
n
(sx, t) = k
n
(s, t) +
j:t
j
=x
k
t
n−1
(s, t[1, . . . , j −1])λ
2
(37)
The string kernel k
n
can be computed using dynamic programming, see Watkins [2000], Lodhi et al.
[2000], and Durbin et al. [1998].
The above kernels on string, sufﬁxtree, mismatch and tree kernels have been used in sequence analysis.
This includes applications in document analysis and categorization, spam ﬁltering, function prediction in
proteins, annotations of dna sequences for the detection of introns and exons, named entity tagging of
documents, and the construction of parse trees.
Locality Improved Kernels It is possible to adjust kernels to the structure of spatial data. Recall the
Gaussian RBF and polynomial kernels. When applied to an image, it makes no difference whether one
uses as x the image or a version of x where all locations of the pixels have been permuted. This indicates
that function space on X induced by k does not take advantage of the locality properties of the data.
By taking advantage of the local structure, estimates can be improved. On biological sequences [Zien
et al., 2000] one may assign more weight to the entries of the sequence close to the location where
estimates should occur.
For images, local interactions between image patches need to be considered. One way is to use the
pyramidal kernel [Sch¨ olkopf, 1997, DeCoste and Sch¨ olkopf, 2002]. It takes inner products between
corresponding image patches, then raises the latter to some power p
1
, and ﬁnally raises their sum to
another power p
2
. While the overall degree of this kernel is p
1
p
2
, the ﬁrst factor p
1
only captures short
range interactions.
Tree Kernels We now discuss similarity measures on more structured objects. For trees Collins and
Duffy [2001] propose a decomposition method which maps a tree x into its set of subtrees. The kernel
between two trees x, x
t
is then computed by taking a weighted sum of all terms between both trees. In
particular, Collins and Duffy [2001] show a quadratic time algorithm, i.e. O([x[ [x
t
[) to compute this
expression, where [x[ is the number of nodes of the tree. When restricting the sum to all proper rooted
subtrees it is possible to reduce the computational cost to O([x[ + [x
t
[) time by means of a tree to string
conversion [Vishwanathan and Smola, 2004].
6
In the ﬁrst string, asd is a contiguous substring. In the second string, it appears twice as a noncontiguous
substring of length 5 in lass das, the two occurrences are lass das and lass das.
12
Graph Kernels Graphs pose a twofold challenge: one may both design a kernel on vertices of them and
also a kernel between them. In the former case, the graph itself becomes the object deﬁning the metric
between the vertices. See G¨ artner [2003], Kashima et al. [2004], and Kashima et al. [2003] for details
on the latter. In the following we discuss kernels on graphs.
Denote by W ∈ R
n·n
the adjacency matrix of a graph with W
ij
> 0 if an edge between i, j exists.
Moreover, assume for simplicity that the graph is undirected, that is W
¯
= W (see Zhou et al. [2005] for
extensions to directed graphs). Denote by L = D−W the graph Laplacian and by
˜
L = 1−D
−
1
2
WD
−
1
2
the normalized graph Laplacian. Here D is a diagonal matrix with D
ii
=
j
W
ij
denoting the degree of
vertex i.
Fiedler [1973] showed that the second largest eigenvector of L approximately decomposes the graph
into two parts according to their sign. The other large eigenvectors partition the graph into correspond
ingly smaller portions. L arises from the fact that for a function f deﬁned on the vertices of the graph
i,j
(f(i) −f(j))
2
= 2f
¯
Lf.
Finally, Smola and Kondor [2003] show that under mild conditions and up to rescaling, L is the only
quadratic permutation invariant form which can be obtained as a linear function of W.
Hence it is reasonable to consider kernel matrices K obtained from L (and
˜
L). Smola and Kondor
[2003] suggest kernels K = r(L) or K = r(
˜
L), which have desirable smoothness properties. Here
r : [0, ∞) →[0, ∞) is a monotonically decreasing function. Popular choices include
r(ξ) = exp(−λξ) diffusion kernel (38)
r(ξ) = (ξ +λ)
−1
regularized graph Laplacian (39)
r(ξ) = (λ −ξ)
p
pstep random walk (40)
where λ > 0 is chosen such as to reﬂect the amount of diffusion in (38), the degree of regularization in
(39), or the weighting of steps within a random walk (40) respectively. Eq. (38) was proposed by Kondor
and Lafferty [2002]. In Section 2.3.3 we will discuss the connection between regularization operators and
kernels in R
n
. Without going into details, the function r(ξ) describes the smoothness properties on the
graph and L plays the role of the Laplace operator.
Kernels on Sets and Subspaces Whenever each observation x
i
consists of a set of instances we may
use a range of methods to capture the speciﬁc properties of these sets (for an overview, see Vishwanathan
et al. [2006]):
• take the average of the elements of the set in feature space, that is, φ(x
i
) =
1
n
j
φ(x
ij
) [G¨ artner
et al., 2002]. This yields good performance in the area of multiinstance learning.
• Jebara and Kondor [2003] extend the idea by dealing with distributions p
i
(x) such that φ(x
i
) =
E[φ(x)] where x ∼ p
i
(x). They apply it to image classiﬁcation with missing pixels.
• Alternatively, one can study angles enclosed by subspaces spanned by the observations [Wolf and
Shashua, 2003, Martin, 2000, Cock and Moor, 2002]. In a nutshell, if U, U
t
denote the orthogonal
matrices spanning the subspaces of x and x
t
respectively, then k(x, x
t
) = det U
¯
U
t
.
• Vishwanathan et al. [2006] extend this to arbitrary compound matrices (i.e. matrices composed of
subdeterminants of matrices) and dynamical systems. Their result exploits the Binet Cauchy theo
rem which states that compound matrices are a representation of the matrix group, i.e. C
q
(AB) =
C
q
(A)C
q
(B). This leads to kernels of the form k(x, x
t
) = tr C
q
(AB). Note that for q = dimA we
recover the determinant kernel of Wolf and Shashua [2003].
Fisher Kernels Jaakkola and Haussler [1999] have designed kernels building on probability density
models p(x[θ). Denote by
U
θ
(x) := −∂
θ
log p(x[θ) (41)
I := E
x
_
U
θ
(x)U
¯
θ
(x)
¸
(42)
13
the Fisher scores and the Fisher information matrix respectively. Note that for maximum likelihood
estimators E
x
[U
θ
(x)] = 0 and therefore I is the covariance of U
θ
(x). The Fisher kernel is deﬁned as
k(x, x
t
) := U
¯
θ
(x)I
−1
U
θ
(x
t
) or k(x, x
t
) := U
¯
θ
(x)U
θ
(x
t
) (43)
depending on whether whether we study the normalized or the unnormalized kernel respectively.
In addition to that, it has several attractive theoretical properties: Oliver et al. [2000] show that esti
mation using the normalized Fisher kernel corresponds to estimation subject to a regularization on the
L
2
(p([θ)) norm.
Moreover, in the context of exponential families (see Section 5.1 for a more detailed discussion) where
p(x[θ) = exp(¸φ(x), θ) −g(θ)) we have
k(x, x
t
) = [φ(x) −∂
θ
g(θ)] [φ(x
t
) −∂
θ
g(θ)] . (44)
for the unnormalized Fisher kernel. This means that up to centering by ∂
θ
g(θ) the Fisher kernel is identi
cal to the kernel arising from the inner product of the sufﬁcient statistics φ(x). This is not a coincidence.
In fact, in our analysis of nonparametric exponential families we will encounter this fact several times
(cf. Section 5 for further details). Moreover, note that the centering is immaterial, as can be seen in
Lemma 13.
The above overview of kernel design is by no means complete. The reader is referred to books of
Cristianini and ShaweTaylor [2000], Sch¨ olkopf and Smola [2002], Joachims [2002], Herbrich [2002],
Sch¨ olkopf et al. [2004], ShaweTaylor and Cristianini [2004], Bakir et al. [2007] for further examples and
details.
2.3 Kernel Function Classes
2.3.1 The Representer Theorem
From kernels, we now move to functions that can be expressed in terms of kernel expansions. The
representer theorem [Kimeldorf and Wahba, 1971, Cox and O’Sullivan, 1990] shows that solutions of a
large class of optimization problems can be expressed as kernel expansions over the sample points. We
present a slightly more general version of the theorem with a simple proof [Sch¨ olkopf et al., 2001]. As
above, His the RKHS associated to the kernel k.
Theorem 11 (Representer Theorem) Denote by Ω : [0, ∞) → R a strictly monotonic increasing func
tion, by X a set, and by c : (X R
2
)
n
→ R ∪ ¦∞¦ an arbitrary loss function. Then each minimizer
f ∈ Hof the regularized risk functional
c ((x
1
, y
1
, f(x
1
)) , . . . , (x
n
, y
n
, f(x
n
))) + Ω
_
f
2
H
_
(45)
admits a representation of the form
f(x) =
n
i=1
α
i
k(x
i
, x). (46)
Proof We decompose any f ∈ H into a part contained in the span of the kernel functions
k(x
1
, ), , k(x
n
, ), and one in the orthogonal complement;
f(x) = f

(x) +f
⊥
(x) =
n
i=1
α
i
k(x
i
, x) +f
⊥
(x). (47)
Here α
i
∈ R and f
⊥
∈ H with ¸f
⊥
, k(x
i
, ))
H
= 0 for all i ∈ [n] := ¦1, . . . , n¦. By (16) we may write
f(x
j
) (for all j ∈ [n]) as
f(x
j
) = ¸f(), k(x
j
, )) =
n
i=1
α
i
k(x
i
, x
j
) +¸f
⊥
(), k(x
j
, ))
H
=
n
i=1
α
i
k(x
i
, x
j
). (48)
14
Second, for all f
⊥
,
Ω(f
2
H
) = Ω
_
_
_
_
_
_
_
n
i
α
i
k(x
i
, )
_
_
_
_
_
2
H
+f
⊥

2
H
_
_
≥ Ω
_
_
_
_
_
_
_
n
i
α
i
k(x
i
, )
_
_
_
_
_
2
H
_
_
. (49)
Thus for any ﬁxed α
i
∈ R the risk functional (45) is minimized for f
⊥
= 0. Since this also has to hold
for the solution, the theorem holds.
Monotonicity of Ω does not prevent the regularized risk functional (45) from having multiple local
minima. To ensure a global minimum, we would need to require convexity. If we discard the strictness of
the monotonicity, then it no longer follows that each minimizer of the regularized risk admits an expansion
(46); it still follows, however, that there is always another solution that is as good, and that does admit the
expansion.
The signiﬁcance of the Representer Theorem is that although we might be trying to solve an optimiza
tion problem in an inﬁnitedimensional space H, containing linear combinations of kernels centered on
arbitrary points of X, it states that the solution lies in the span of n particular kernels — those centered
on the training points. We will encounter (46) again further below, where it is called the Support Vector
expansion. For suitable choices of loss functions, many of the α
i
often equal 0.
2.3.2 Reduced Set Methods
Despite the ﬁniteness of the representation in (46) it can often be the case that the number of terms in
the expansion is too large in practice. This can be problematic in practical applications, since the time
required to evaluate (46) is proportional to the number of terms. To deal with this issue, we thus need
to express or at least approximate (46) using a smaller number of terms. Exact expressions in a reduced
number of basis points are often not possible; e.g., if the kernel is strictly positive deﬁnite and the data
points x
1
, . . . , x
n
are distinct, then no exact reduction is possible. For the purpose of approximation, we
need to decide on a discrepancy measure that we want to minimize. The most natural such measure is
the RKHS norm. From a practical point of view, this norm has the advantage that it can be computed
using evaluations of the kernel function which opens the possibility for relatively efﬁcient approximation
algorithms. Clearly, if we were given a set of points z
1
, . . . , z
m
such that the function (cf. (46))
f =
n
i=1
α
i
k(x
i
, ). (50)
can be expressed with small RKHSerror in the subspace spanned by k(z
1
, ), . . . , k(z
m
, ), then we can
compute an mterm approximation of f by projection onto that subspace.
The problem thus reduces to the choice of the expansion points z
1
, . . . , z
m
. In the literature, two
classes of methods have been pursued. In the ﬁrst, it is attempted to choose the points as a subset of some
larger set x
1
, . . . , x
n
, which is usually the training set [Sch¨ olkopf, 1997, Frieß and Harrison, 1998]. In
the second, we compute the difference in the RKHS between (50) and the reduced set expansion [Burges,
1996]
g =
m
p=1
β
p
k(z
p
, ), (51)
leading to
f −g
2
=
n
i,j=1
α
i
α
j
k(x
i
, x
j
) +
m
p,q=1
β
p
β
q
k(z
p
, z
q
) −2
n
i=1
m
p=1
α
i
β
p
k(x
i
, z
p
). (52)
15
To obtain the z
1
, . . . , z
m
and β
1
, . . . , β
m
, we can minimize this objective function, which for most kernels
will be a nonconvex optimization problem. Specialized methods for speciﬁc kernels exist, as well as
iterative methods taking into account the fact that given the z
1
, . . . , z
m
, the coefﬁcients β
1
, . . . , β
m
can
be computed in closed form. See Sch¨ olkopf and Smola [2002] for further details.
Once we have constructed or selected a set of expansion points, we are working in a ﬁnite dimensional
subspace of Hspanned by these points.
It is known that each closed subspace of an RKHS is itself an RKHS. Denote its kernel by l, then the
projection P onto the subspace is given by (e.g., Meschkowski [1962])
(Pf)(x
t
) = ¸f, l(, x
t
)) . (53)
If the subspace is spanned by a ﬁnite set of linearly independent kernels k(, z
1
), . . . , k(, z
m
), the
kernel of the subspace takes the form
k
m
(x, x
t
) = (k(x, z
1
), . . . , k(x, z
m
)) K
−1
(k(x
t
, z
1
), . . . , k(x
t
, z
m
))
¯
, (54)
where K is the Gram matrix of k with respect to z
1
, . . . , z
m
. This can be seen by noting that (i) the
k
m
(, x) span an mdimensional subspace, and (ii) we have k
m
(z
i
, z
j
) = k(z
i
, z
j
) for i, j = 1, . . . , m.
The kernel of the subspace’s orthogonal complement, on the other hand, is k −k
m
.
The projection onto the mdimensional subspace can be written
P =
m
i,j=1
_
K
−1
_
ij
k(, z
i
)k(, z
j
)
¯
, (55)
where k(, z
j
)
¯
denotes the linear form mapping k(, x) to k(z
j
, x) for x ∈ X.
If we prefer to work with coordinates and the Euclidean dot product, we can use the kernel PCA map
[Sch¨ olkopf and Smola, 2002]
Φ
m
: X →R
m
, x →K
−1/2
(k(x, z
1
), . . . , k(x, z
m
))
¯
, (56)
which directly projects into an orthonormal basis of the subspace.
7
A number of articles exist discussing how to best choose the expansion points z
i
for the purpose of
processing a given dataset, e.g., Smola and Sch¨ olkopf [2000], Williams and Seeger [2000].
2.3.3 Regularization Properties
The regularizer f
2
H
used in Theorem 11, which is what distinguishes SVMs from many other regu
larized function estimators (e.g., based on coefﬁcient based L
1
regularizers, such as the Lasso [Tibshirani,
1996] or linear programming machines [Sch¨ olkopf and Smola, 2002]), stems from the dot product ¸f, f)
k
in the RKHS Hassociated with a positive deﬁnite kernel. The nature and implications of this regularizer,
however, are not obvious and we shall now provide an analysis in the Fourier domain. It turns out that
if the kernel is translation invariant, then its Fourier transform allows us to characterize how the different
frequency components of f contribute to the value of f
2
H
. Our exposition will be informal (see also
Poggio and Girosi [1990], Girosi et al. [1993], Smola et al. [1998]), and we will implicitly assume that
all integrals are over R
d
and exist, and that the operators are well deﬁned.
We will rewrite the RKHS dot product as
¸f, g)
k
= ¸Υf, Υg) =
¸
Υ
2
f, g
_
, (57)
7
In case K does not have full rank, projections onto the span {k(·, z
1
), . . . , k(·, z
m
)} are achieved by using the
pseudoinverse of K instead.
16
where Υ is a positive (and thus symmetric) operator mapping H into a function space endowed with the
usual dot product
¸f, g) =
_
f(x)g(x) dx. (58)
Rather than (57), we consider the equivalent condition (cf. Section 2.2.1)
¸k(x, ), k(x
t
, ))
k
= ¸Υk(x, ), Υk(x
t
, )) =
¸
Υ
2
k(x, ), k(x
t
, )
_
. (59)
If k(x, ) is a Green function of Υ
2
, we have
¸
Υ
2
k(x, ), k(x
t
, )
_
= ¸δ
x
, k(x
t
, )) = k(x, x
t
), (60)
which by the reproducing property (16) amounts to the desired equality (59).
8
We now consider the particular case where the kernel can be written k(x, x
t
) = h(x − x
t
) with a
continuous strictly positive deﬁnite function h ∈ L
1
(R
d
) (cf. Section 2.2.3). A variation of Bochner’s
theorem, stated by Wendland [2005], then tells us that the measure corresponding to h has a nonvanishing
density υ with respect to the Lebesgue measure, i.e., that k can be written as
k(x, x
t
) =
_
e
−i¸x−x
,ω)
υ(ω)dω =
_
e
−i¸x,ω)
e
−i¸x
,ω)
υ(ω)dω. (61)
We would like to rewrite this as ¸Υk(x, ), Υk(x
t
, )) for some linear operator Υ. It turns out that a
multiplication operator in the Fourier domain will do the job. To this end, recall the ddimensional
Fourier transform, given by
F[f](ω) := (2π)
−
d
2
_
f(x)e
−i¸x,ω)
dx, (62)
with the inverse F
−1
[f](x) = (2π)
−
d
2
_
f(ω)e
i¸x,ω)
dω. (63)
Next, compute the Fourier transform of k as
F[k(x, )](ω) = (2π)
−
d
2
_ _
_
υ(ω
t
)e
−i¸x,ω
)
_
e
i¸x
,ω
)
dω
t
e
−i¸x
,ω)
dx
t
(64)
= (2π)
d
2
υ(ω)e
−i¸x,ω)
. (65)
Hence we can rewrite (61) as
k(x, x
t
) = (2π)
−d
_
F[k(x, )](ω)F[k(x
t
, )](ω)
υ(ω)
dω. (66)
If our regularization operator maps
Υ: f →(2π)
−
d
2
υ
−
1
2
F[f], (67)
we thus have
k(x, x
t
) =
_
(Υk(x, ))(ω)(Υk(x
t
, ))(ω)dω, (68)
8
For conditionally positive deﬁnite kernels, a similar correspondence can be established, with a regularization
operator whose null space is spanned by a set of functions which are not regularized (in the case (18), which is
sometimes called conditionally positive deﬁnite of order 1, these are the constants).
17
i.e., our desired identity (59) holds true.
As required in (57), we can thus interpret the dot product ¸f, g)
k
in the RKHS as a dot product
_
(Υf)(ω)(Υg)(ω)dω. This allows us to understand regularization properties of k in terms of its (scaled)
Fourier transform υ(ω). Small values of υ(ω) amplify the corresponding frequencies in (67). Penalizing
¸f, f)
k
thus amounts to a strong attenuation of the corresponding frequencies. Hence small values of
υ(ω) for large ω are desirable, since high frequency components of F[f] correspond to rapid changes
in f. It follows that υ(ω) describes the ﬁlter properties of the corresponding regularization operator Υ. In
view of our comments following Theorem 8, we can translate this insight into probabilistic terms: if the
probability measure
υ(ω)dω
υ(ω)dω
describes the desired ﬁlter properties, then the natural translation invariant
kernel to use is the characteristic function of the measure.
2.3.4 Remarks and Notes
The notion of kernels as dot products in Hilbert spaces was brought to the ﬁeld of machine learning by
Aizerman et al. [1964], Boser et al. [1992], Vapnik [1998], Sch¨ olkopf et al. [1998]. Aizerman et al. [1964]
used kernels as a tool in a convergence proof, allowing them to apply the Perceptron convergence theorem
to their class of potential function algorithms. To the best of our knowledge, Boser et al. [1992] were the
ﬁrst to use kernels to construct a nonlinear estimation algorithm, the hard margin predecessor of the
Support Vector Machine, from its linear counterpart, the generalized portrait [Vapnik and Lerner, 1963,
Vapnik, 1982]. Whilst all these uses were limited to kernels deﬁned on vectorial data, Sch¨ olkopf [1997]
observed that this restriction is unnecessary, and nontrivial kernels on other data types were proposed by
Haussler [1999], Watkins [2000]. Sch¨ olkopf et al. [1998] applied the kernel trick to generalize principal
component analysis and pointed out the (in retrospect obvious) fact that any algorithm which only uses
the data via dot products can be generalized using kernels.
In addition to the above uses of positive deﬁnite kernels in machine learning, there has been a parallel,
and partly earlier development in the ﬁeld of statistics, where such kernels have been used for instance for
time series analysis [Parzen, 1970] as well as regression estimation and the solution of inverse problems
[Wahba, 1990].
In probability theory, positive deﬁnite kernels have also been studied in depth since they arise as co
variance kernels of stochastic processes, see e.g. [Lo` eve, 1978]. This connection is heavily being used in
a subset of the machine learning community interested in prediction with Gaussian Processes [Rasmussen
and Williams, 2006].
In functional analysis, the problem of Hilbert space representations of kernels has been studied in great
detail; a good reference is Berg et al. [1984]; indeed, a large part of the material in the present section
is based on that work. Interestingly, it seems that for a fairly long time, there have been two separate
strands of development [Stewart, 1976]. One of them was the study of positive deﬁnite functions, which
started later but seems to have been unaware of the fact that it considered a special case of positive
deﬁnite kernels. The latter was initiated by Hilbert [1904], Mercer [1909], and pursued for instance by
Schoenberg [1938]. Hilbert calls a kernel k deﬁnit if
_
b
a
_
b
a
k(x, x
t
)f(x)f(x
t
)dxdx
t
> 0 (69)
for all nonzero continuous functions f, and shows that all eigenvalues of the corresponding integral
operator f →
_
b
a
k(x, )f(x)dx are then positive. If k satisﬁes the condition (69) subject to the constraint
that
_
b
a
f(x)g(x)dx = 0, for some ﬁxed function g, Hilbert calls it relativ deﬁnit. For that case, he shows
that k has at most one negative eigenvalue. Note that if f is chosen to be constant, then this notion is
closely related to the one of conditionally positive deﬁnite kernels, cf. (18). For further historical details
see the review of Stewart [1976] or Berg et al. [1984].
18
3 Convex Programming Methods for Estimation
As we saw, kernels can be used both for the purpose of describing nonlinear functions subject to smooth
ness constraints and for the purpose of computing inner products in some feature space efﬁciently. In this
section we focus on the latter and how it allows us to design methods of estimation based on the geometry
of the problems at hand.
Unless stated otherwise E[] denotes the expectation with respect to all random variables of the ar
gument. Subscripts, such as E
X
[], indicate that the expectation is taken over X. We will omit them
wherever obvious. Finally we will refer to E
emp
[] as the empirical average with respect to an nsample.
Given a sample S := ¦(x
1
, y
1
), . . . , (x
n
, y
n
)¦ ⊆ X Y we now aim at ﬁnding an afﬁne function
f(x) = ¸w, φ(x)) + b or in some cases a function f(x, y) = ¸φ(x, y), w) such that the empirical risk
on S is minimized. In the binary classiﬁcation case this means that we want to maximize the agreement
between sgn f(x) and y.
• Minimization of the empirical risk with respect to (w, b) is NPhard [Minsky and Papert, 1969]. In
fact, BenDavid et al. [2003] show that even approximately minimizing the empirical risk is NP
hard, not only for linear function classes but also for spheres and other simple geometrical objects.
This means that even if the statistical challenges could be solved, we still would be confronted with
a formidable algorithmic problem.
• The indicator function ¦yf(x) < 0¦ is discontinuous and even small changes in f may lead to large
changes in both empirical and expected risk. Properties of such functions can be captured by the VC
dimension [Vapnik and Chervonenkis, 1971], that is, the maximum number of observations which
can be labeled in an arbitrary fashion by functions of the class. Necessary and sufﬁcient conditions
for estimation can be stated in these terms [Vapnik and Chervonenkis, 1991]. However, much tighter
bounds can be obtained by also using the scale of the class [Alon et al., 1993, Bartlett et al., 1996,
Williamson et al., 2001]. In fact, there exist function classes parameterized by a single scalar which
have inﬁnite VCdimension [Vapnik, 1995].
Given the difﬁculty arising from minimizing the empirical risk we now discuss algorithms which mini
mize an upper bound on the empirical risk, while providing good computational properties and consis
tency of the estimators. The statistical analysis is relegated to Section 4.
3.1 Support Vector Classiﬁcation
Assume that S is linearly separable, i.e. there exists a linear function f(x) such that sgn yf(x) = 1 on
S. In this case, the task of ﬁnding a large margin separating hyperplane can be viewed as one of solving
[Vapnik and Lerner, 1963]
minimize
w,b
1
2
w
2
s.t. y
i
(¸w, x) +b) ≥ 1. (70)
Note that w
−1
f(x
i
) is the distance of the point x
i
to the hyperplane H(w, b) := ¦x[ ¸w, x) +b = 0¦.
The condition y
i
f(x
i
) ≥ 1 implies that the margin of separation is at least 2 w
−1
. The bound becomes
exact if equality is attained for some y
i
= 1 and y
j
= −1. Consequently minimizing w subject to
the constraints maximizes the margin of separation. Eq. (70) is a quadratic program which can be solved
efﬁciently [Luenberger, 1984, Fletcher, 1989, Boyd and Vandenberghe, 2004].
Mangasarian [1965] devised a similar optimization scheme using w
1
instead of w
2
in the objective
function of (70). The result is a linear program. In general, one can show [Smola et al., 2000] that
minimizing the
p
norm of w leads to the maximizing of the margin of separation in the
q
norm where
1
p
+
1
q
= 1. The
1
norm leads to sparse approximation schemes (see also Chen et al. [1999]), whereas
the
2
norm can be extended to Hilbert spaces and kernels.
19
To deal with nonseparable problems, i.e. cases when (70) is infeasible, we need to relax the constraints
of the optimization problem. Bennett and Mangasarian [1992] and Cortes and Vapnik [1995] impose a
linear penalty on the violation of the largemargin constraints to obtain:
minimize
w,b,ξ
1
2
w
2
+C
n
i=1
ξ
i
s.t. y
i
(¸w, x
i
) +b) ≥ 1 −ξ
i
and ξ
i
≥ 0, ∀i ∈ [n] . (71)
Eq. (71) is a quadratic program which is always feasible (e.g. w, b = 0 and ξ
i
= 1 satisfy the constraints).
C > 0 is a regularization constant trading off the violation of the constraints vs. maximizing the overall
margin.
Whenever the dimensionality of Xexceeds n, direct optimization of (71) is computationally inefﬁcient.
This is particularly true if we map from X into an RKHS. To address these problems one may solve the
problem in dual space as follows. The Lagrange function of (71) is given by
L(w, b, ξ, α, η) =
1
2
w
2
+C
n
i=1
ξ
i
+
n
i=1
α
i
(1 −ξ
i
−y
i
(¸w, x
i
) +b)) −
n
i=1
η
i
ξ
i
(72)
where α
i
, η
i
≥ 0 for all i ∈ [n]. To compute the dual of L we need to identify the ﬁrst order conditions
in w, b. They are given by
∂
w
L = w −
n
i=1
α
i
y
i
x
i
= 0 and ∂
b
L = −
n
i=1
α
i
y
i
= 0 and ∂
ξ
i
L = C −α
i
+η
i
= 0 . (73)
This translates into w =
n
i=1
α
i
y
i
x
i
, the linear constraint
n
i=1
α
i
y
i
= 0, and the boxconstraint
α
i
∈ [0, C] arising from η
i
≥ 0. Substituting (73) into L yields the Wolfe [1961] dual
minimize
α
1
2
α
¯
Qα −α
¯
1 s.t. α
¯
y = 0 and α
i
∈ [0, C]. ∀i ∈ [n] . (74)
Q ∈ R
n·n
is the matrix of inner products Q
ij
:= y
i
y
j
¸x
i
, x
j
). Clearly this can be extended to feature
maps and kernels easily via K
ij
:= y
i
y
j
¸Φ(x
i
), Φ(x
j
)) = y
i
y
j
k(x
i
, x
j
). Note that w lies in the span
of the x
i
. This is an instance of the Representer Theorem (Theorem 11). The KKT conditions [Karush,
1939, Kuhn and Tucker, 1951, Boser et al., 1992, Cortes and Vapnik, 1995] require that at optimality
α
i
(y
i
f(x
i
)−1) = 0. This means that only those x
i
may appear in the expansion (73) for which y
i
f(x
i
) ≤
1, as otherwise α
i
= 0. The x
i
with α
i
> 0 are commonly referred to as support vectors.
Note that
n
i=1
ξ
i
is an upper bound on the empirical risk, as y
i
f(x
i
) ≤ 0 implies ξ
i
≥ 1 (see also
Lemma 12). The number of misclassiﬁed points x
i
itself depends on the conﬁguration of the data and the
value of C. The result of BenDavid et al. [2003] suggests that ﬁnding even an approximate minimum
classiﬁcation error solution is difﬁcult. That said, it is possible to modify (71) such that a desired target
number of observations violates y
i
f(x
i
) ≥ ρ for some ρ ∈ R by making the threshold itself a variable
of the optimization problem [Sch¨ olkopf et al., 2000]. This leads to the following optimization problem
(νSV classiﬁcation):
minimize
w,b,ξ
1
2
w
2
+
n
i=1
ξ
i
−nνρ subject to y
i
(¸w, x
i
) +b) ≥ ρ −ξ
i
and ξ
i
≥ 0. (75)
The dual of (75) is essentially identical to (74) with the exception of an additional constraint:
minimize
α
1
2
α
¯
Qα subject to α
¯
y = 0 and α
¯
1 = nν and α
i
∈ [0, 1]. (76)
One can show that for every C there exists a ν such that the solution of (76) is a multiple of the solution
of (74). Sch¨ olkopf et al. [2000] prove that solving (76) for which ρ > 0 satisﬁes:
20
1. ν is an upper bound on the fraction of margin errors.
2. ν is a lower bound on the fraction of SVs.
Moreover under mild conditions, with probability 1, asymptotically, ν equals both the fraction of SVs
and the fraction of errors.
This statement implies that whenever the data are sufﬁciently well separable (that is ρ > 0), νSV
Classiﬁcation ﬁnds a solution with a fraction of at most ν margin errors. Also note that for ν = 1, all
α
i
= 1, that is, f becomes an afﬁne copy of the Parzen windows classiﬁer (5).
3.2 Estimating the Support of a Density
We now extend the notion of linear separation to that of estimating the support of a density [Sch¨ olkopf
et al., 2001, Tax and Duin, 1999]. Denote by X = ¦x
1
, . . . , x
n
¦ ⊆ X the sample drawn from P(x). Let
C be a class of measurable subsets of X and let λ be a realvalued function deﬁned on C. The quantile
function [Einmal and Mason, 1992] with respect to (P, λ, C) is deﬁned as
U(µ) = inf ¦λ(C)[P(C) ≥ µ, C ∈ C¦ where µ ∈ (0, 1]. (77)
We denote by C
λ
(µ) and C
m
λ
(µ) the (not necessarily unique) C ∈ C that attain the inﬁmum (when it
is achievable) on P(x) and on the empirical measure given by X respectively. A common choice of λ
is Lebesgue measure, in which case C
λ
(µ) is the minimum volume set C ∈ C that contains at least a
fraction µ of the probability mass.
Support estimation requires us to ﬁnd some C
m
λ
(µ) such that [P(C
m
λ
(µ)) −µ[ is small. This is where
the complexity tradeoff enters: On the one hand, we want to use a rich class C to capture all possible dis
tributions, on the other hand large classes lead to large deviations between µ and P(C
m
λ
(µ)). Therefore,
we have to consider classes of sets which are suitably restricted. This can be achieved using an SVM
regularizer.
In the case where µ < 1, it seems the ﬁrst work was reported in Sager [1979] and Hartigan [1987],
in which X = R
2
, with C being the class of closed convex sets in X. Nolan [1991] considered higher
dimensions, with C being the class of ellipsoids. Tsybakov [1997] studied an estimator based on piece
wise polynomial approximation of C
λ
(µ) and showed it attains the asymptotically minimax rate for
certain classes of densities. Polonik [1997] studied the estimation of C
λ
(µ) by C
m
λ
(µ). He derived
asymptotic rates of convergence in terms of various measures of richness of C. More information on
minimum volume estimators can be found in that work, and in Sch¨ olkopf et al. [2001].
SV support estimation works by using SV support estimation relates to previous work as follows: set
λ(C
w
) = w
2
, where C
w
= ¦x[f
w
(x) ≥ ρ¦, f
w
(x) = ¸w, x), and (w, ρ) are respectively a weight
vector and an offset. Stated as a convex optimization problem we want to separate the data from the origin
with maximum margin via:
minimize
w,ξ,ρ
1
2
w
2
+
n
i=1
ξ
i
−nνρ subject to ¸w, x
i
) ≥ ρ −ξ
i
and ξ
i
≥ 0. (78)
Here, ν ∈ (0, 1] plays the same role as in (75), controlling the number of observations x
i
for which
f(x
i
) ≤ ρ. Since nonzero slack variables ξ
i
are penalized in the objective function, if w and ρ solve this
problem, then the decision function f(x) will attain or exceed ρ for at least a fraction 1 − ν of the x
i
contained in X while the regularization term w will still be small. The dual of (78) yields:
minimize
α
1
2
α
¯
Kα subject to α
¯
1 = νn and α
i
∈ [0, 1] (79)
To compare (79) to a Parzen windows estimator assume that k is such that it can be normalized as a
density in input space, such as a Gaussian. Using ν = 1 in (79) the constraints automatically imply
21
α
i
= 1. Thus f reduces to a Parzen windows estimate of the underlying density. For ν < 1, the equality
constraint (79) still ensures that f is a thresholded density, now depending only on a subset of X — those
which are important for deciding whether f(x) ≤ ρ.
3.3 Regression Estimation
SV regression was ﬁrst proposed in Vapnik [1995], Vapnik et al. [1997], and Drucker et al. [1997] using
the socalled insensitive loss function. It is a direct extension of the softmargin idea to regression:
instead of requiring that yf(x) exceeds some margin value, we now require that the values y −f(x) are
bounded by a margin on both sides. That is, we impose the soft constraints
y
i
−f(x
i
) ≤
i
−ξ
i
and f(x
i
) −y
i
≤
i
−ξ
∗
i
. (80)
where ξ
i
, ξ
∗
i
≥ 0. If [y
i
−f(x
i
)[ ≤ no penalty occurs. The objective function is given by the sum
of the slack variables ξ
i
, ξ
∗
i
penalized by some C > 0 and a measure for the slope of the function
f(x) = ¸w, x) +b, that is
1
2
w
2
.
Before computing the dual of this problem let us consider a somewhat more general situation where
we use a range of different convex penalties for the deviation between y
i
and f(x
i
). One may check that
minimizing
1
2
w
2
+C
m
i=1
ξ
i
+ξ
∗
i
subject to (80) is equivalent to solving
minimize
w,b,ξ
1
2
w
2
+
n
i=1
ψ(y
i
−f(x
i
)) where ψ(ξ) = max (0, [ξ[ −) . (81)
Choosing different loss functions ψ leads to a rather rich class of estimators:
• ψ(ξ) =
1
2
ξ
2
yields penalized least squares (LS) regression [Hoerl and Kennard, 1970, Tikhonov,
1963, Morozov, 1984, Wahba, 1990]. The corresponding optimization problem can be minimized
by solving a linear system.
• For ψ(ξ) = [ξ[ we obtain the penalized least absolute deviations (LAD) estimator [Bloomﬁeld and
Steiger, 1983]. That is, we obtain a quadratic program to estimate the conditional median.
• A combination of LS and LAD loss yields a penalized version of Huber’s robust regression [Huber,
1981, Smola and Sch¨ olkopf, 1998]. In this case we have ψ(ξ) =
1
2σ
ξ
2
for [ξ[ ≤ σ and ψ(ξ) = [ξ[−
σ
2
for [ξ[ ≥ σ.
• Note that also quantile regression [Koenker, 2005] can be modiﬁed to work with kernels [Sch¨ olkopf
et al., 2000, Takeuchi et al., 2006] by using as loss function the “pinball” loss, that is ψ(ξ) = (1−τ)ψ
if ψ < 0 and ψ(ξ) = τψ if ψ > 0.
All the optimization problems arising from the above ﬁve cases are convex quadratic programs. Their
dual resembles that of (80), namely
minimize
α,α
∗
1
2
(α −α
∗
)
¯
K(α −α
∗
) +
¯
(α +α
∗
) −y
¯
(α −α
∗
) (82a)
subject to (α −α
∗
)
¯
1 = 0 and α
i
, α
∗
i
∈ [0, C]. (82b)
Here K
ij
= ¸x
i
, x
j
) for linear models and K
ij
= k(x
i
, x
j
) if we map x → Φ(x). The νtrick, as
described in (75) [Sch¨ olkopf et al., 2000], can be extended to regression, allowing one to choose the
margin of approximation automatically. In this case (82a) drops the terms in . In its place, we add a
linear constraint (α − α
∗
)
¯
1 = νn. Likewise, LAD is obtained from (82) by dropping the terms in
without additional constraints. Robust regression leaves (82) unchanged, however, in the deﬁnition of K
we have an additional term of σ
−1
on the main diagonal. Further details can be found in Sch¨ olkopf and
Smola [2002]. For Quantile Regression we drop and we obtain different constants C(1 − τ) and Cτ
for the constraints on α
∗
and α. We will discuss uniform convergence properties of the empirical risk
estimates with respect to various ψ(ξ) in Section 4.
22
3.4 Multicategory Classiﬁcation, Ranking and Ordinal Regression
Many estimation problems cannot be described by assuming that Y = ¦±1¦. In this case it is advan
tageous to go beyond simple functions f(x) depending on x only. Instead, we can encode a larger
degree of information by estimating a function f(x, y) and subsequently obtaining a prediction via
ˆ y(x) := argmax
y∈Y
f(x, y). In other words, we study problems where y is obtained as the solution
of an optimization problem over f(x, y) and we wish to ﬁnd f such that y matches y
i
as well as possible
for relevant inputs x.
Note that the loss may be more than just a simple 0 − 1 loss. In the following we denote by ∆(y, y
t
)
the loss incurred by estimating y
t
instead of y. Without loss of generality we require that ∆(y, y) = 0
and that ∆(y, y
t
) ≥ 0 for all y, y
t
∈ Y. Key in our reasoning is the following:
Lemma 12 Let f : X Y → R and assume that ∆(y, y
t
) ≥ 0 with ∆(y, y) = 0. Moreover let ξ ≥ 0
such that f(x, y) −f(x, y
t
) ≥ ∆(y, y
t
) −ξ for all y
t
∈ Y. In this case ξ ≥ ∆(y, argmax
y
∈Y
f(x, y
t
)).
Proof Denote by y
∗
:= argmax
y∈Y
f(x, y). By assumption we have ξ ≥ ∆(y, y
∗
) +f(x, y
∗
) −f(x, y).
Since f(x, y
∗
) ≥ f(x, y
t
) for all y
t
∈ Y the inequality holds.
The construction of the estimator was suggested in Taskar et al. [2003] and Tsochantaridis et al. [2005],
and a special instance of the above lemma is given by Joachims [2005]. While the bound appears quite
innocuous, it allows us to describe a much richer class of estimation problems as a convex program.
To deal with the added complexity we assume that f is given by f(x, y) = ¸Φ(x, y), w). Given the
possibly nontrivial connection between x and y the use of Φ(x, y) cannot be avoided. Corresponding
kernel functions are given by k(x, y, x
t
, y
t
) = ¸Φ(x, y), Φ(x
t
, y
t
)). We have the following optimization
problem [Tsochantaridis et al., 2005]:
minimize
w,ξ
1
2
w
2
+C
n
i=1
ξ
i
s.t. ¸w, Φ(x
i
, y
i
) −Φ(x
i
, y)) ≥ ∆(y
i
, y) −ξ
i
, ∀i ∈ [n], y ∈ Y. (83)
This is a convex optimization problem which can be solved efﬁciently if the constraints can be evalu
ated without high computational cost. One typically employs columngeneration methods [Hettich and
Kortanek, 1993, R¨ atsch, 2001, Bennett et al., 2000, Tsochantaridis et al., 2005, Fletcher, 1989] which
identify one violated constraint at a time to ﬁnd an approximate minimum of the optimization problem.
To describe the ﬂexibility of the framework set out by (83) we give several examples below:
• Binary classiﬁcation can be recovered by setting Φ(x, y) = yΦ(x), in which case the constraint of
(83) reduces to 2y
i
¸Φ(x
i
), w) ≥ 1 − ξ
i
. Ignoring constant offsets and a scaling factor of 2, this is
exactly the standard SVM optimization problem.
• Multicategory classiﬁcation problems [Crammer and Singer, 2001, Collins, 2000, Allwein et al.,
2000, R¨ atsch et al., 2003] can be encoded via Y = [N], where N is the number of classes and
∆(y, y
t
) = 1 − δ
y,y
. In other words, the loss is 1 whenever we predict the wrong class and 0 for
correct classiﬁcation. Corresponding kernels are typically chosen to be δ
y,y
k(x, x
t
).
• We can deal with joint labeling problems by setting Y = ¦±1¦
n
. In other words, the error measure
does not depend on a single observation but on an entire set of labels. Joachims [2005] shows that
the socalled F
1
score [van Rijsbergen, 1979] used in document retrieval and the area under the ROC
curve [Bamber, 1975, Gribskov and Robinson, 1996] fall into this category of problems. Moreover,
Joachims [2005] derives an O(n
2
) method for evaluating the inequality constraint over Y.
• Multilabel estimation problems deal with the situation where we want to ﬁnd the best subset of
labels Y ⊆ 2
[N]
which correspond to some observation x. The problem is described in Elisseeff and
Weston [2001], where the authors devise a ranking scheme such that f(x, i) > f(x, j) if label i ∈ y
and j ,∈ y. It is a special case of a general ranking approach described next.
23
Note that (83) is invariant under translations Φ(x, y) ← Φ(x, y) + Φ
0
where Φ
0
is constant, as
Φ(x
i
, y
i
) − Φ(x
i
, y) remains unchanged. In practice this means that transformations k(x, y, x
t
, y
t
) ←
k(x, y, x
t
, y
t
) + ¸Φ
0
, Φ(x, y)) + ¸Φ
0
, Φ(x
t
, y
t
)) + Φ
0

2
do not affect the outcome of the estimation
process. Since Φ
0
was arbitrary, we have the following lemma:
Lemma 13 Let H be an RKHS on X Y with kernel k. Moreover let g ∈ H. Then the function
k(x, y, x
t
, y
t
) +f(x, y) +f(x
t
, y
t
) +g
2
H
is a kernel and it yields the same estimates as k.
We need a slight extension to deal with general ranking problems. Denote by Y = Graph[N] the set
of all directed graphs on N vertices which do not contain loops of less than three nodes. Here an edge
(i, j) ∈ y indicates that i is preferred to j with respect to the observation x. It is the goal to ﬁnd some
function f : X [N] → R which imposes a total order on [N] (for a given x) by virtue of the function
values f(x, i) such that the total order and y are in good agreement.
More speciﬁcally, Dekel et al. [2003] and Crammer and Singer [2005] propose a decomposition algo
rithm A for the graphs y such that the estimation error is given by the number of subgraphs of y which
are in disagreement with the total order imposed by f. As an example, multiclass classiﬁcation can be
viewed as a graph y where the correct label i is at the root of a directed graph and all incorrect labels
are its children. Multilabel classiﬁcation is then a bipartite graph where the correct labels only contain
outgoing arcs and the incorrect labels only incoming ones.
This setting leads to a form similar to (83) except for the fact that we now have constraints over each
subgraph G ∈ A(y). We solve
minimize
w,ξ
1
2
w
2
+C
n
i=1
[A(y
i
)[
−1
G∈A(y
i
)
ξ
iG
(84)
subject to ¸w, Φ(x
i
, u) −Φ(x
i
, v)) ≥ 1 −ξ
iG
and ξ
iG
≥ 0 for all (u, v) ∈ G ∈ A(y
i
).
That is, we test for all (u, v) ∈ G whether the ranking imposed by G ∈ y
i
is satisﬁed.
Finally, ordinal regression problems which perform ranking not over labels y but rather over observa
tions x were studied by Herbrich et al. [2000] and Chapelle and Harchaoui [2005] in the context of ordinal
regression and conjoint analysis respectively. In ordinal regression x is preferred to x
t
if f(x) > f(x
t
) and
hence one minimizes an optimization problem akin to (83), with constraint ¸w, Φ(x
i
) −Φ(x
j
)) ≥ 1−ξ
ij
.
In conjoint analysis the same operation is carried out for Φ(x, u), where u is the user under consideration.
Similar models were also studied by Basilico and Hofmann [2004]. Further models will be discussed in
Section 5, in particular situations where Y is of exponential size. These models allow one to deal with
sequences and more sophisticated structures.
3.5 Applications of SVM Algorithms
When SVMs were ﬁrst presented, they initially met with scepticism in the statistical community. Part of
the reason was that as described, SVMs construct their decision rules in potentially very highdimensional
feature spaces associated with kernels. Although there was a fair amount of theoretical work addressing
this issue (see Section 4 below), it was probably to a larger extent the empirical success of SVMs that
paved its way to become a standard method of the statistical toolbox. The ﬁrst successes of SVMs on prac
tical problem were in handwritten digit recognition, which was the main benchmark task considered in
the Adaptive Systems Department at AT&T Bell Labs where SVMs were developed (see e.g. LeCun et al.
[1998]). Using methods to incorporate transformation invariances, SVMs were shown to beat the world
record on the MNIST benchmark set, at the time the gold standard in the ﬁeld [DeCoste and Sch¨ olkopf,
2002]. There has been a signiﬁcant number of further computer vision applications of SVMs since then,
including tasks such as object recognition [Blanz et al., 1996, Chapelle et al., 1999, Holub et al., 2005,
Everingham et al., 2005] and detection [Romdhani et al., 2004]. Nevertheless, it is probably fair to say
24
that two other ﬁelds have been more inﬂuential in spreading the use of SVMs: bioinformatics and natural
language processing. Both of them have generated a spectrum of challenging highdimensional problems
on which SVMs excel, such as microarray processing tasks [Brown et al., 2000] and text categorization
[Dumais, 1998]. For further references, see Sch¨ olkopf et al. [2004] and Joachims [2002].
Many successful applications have been implemented using SV classiﬁers; however, also the other
variants of SVMs have led to very good results, including SV regression [M¨ uller et al., 1997], SV novelty
detection [Hayton et al., 2001], SVMs for ranking [Herbrich et al., 2000] and more recently, problems
with interdependent labels [Tsochantaridis et al., 2005, McCallum et al., 2005].
At present there exists a large number of readily available software packages for SVM optimization.
For instance, SVMStruct, based on [Tsochantaridis et al., 2005] solves structured estimation problems.
LibSVM is an open source solver which excels on binary problems. The Torch package contains a
number of estimation methods, including SVM solvers. Several SVM implementations are also available
via statistical packages, such as R.
4 Margins and Uniform Convergence Bounds
So far we motivated the algorithms by means of their practicality and the fact that 0 − 1 loss functions
yield hardtocontrol estimators. We now follow up on the analysis by providing uniform convergence
bounds for large margin classiﬁers. We focus on the case of scalar valued functions applied to classiﬁ
cation for two reasons: The derivation is well established and it can be presented in a concise fashion.
Secondly, the derivation of corresponding bounds for the vectorial case is by and large still an open prob
lem. Preliminary results exist, such as the bounds by Collins [2000] for the case of perceptrons, Taskar
et al. [2003] who derive capacity bounds in terms of covering numbers by an explicit covering construc
tion, and Bartlett and Mendelson [2002], who give Gaussian average bounds for vectorial functions. We
believe that the scaling behavior of these bounds in the number of classes [Y[ is currently not optimal,
when applied to the problems of type (83).
Our analysis is based on the following ideas: ﬁrstly the 0 −1 loss is upper bounded by some function
ψ(yf(x)) which can be efﬁciently minimized, such as the soft margin function max(0, 1 −yf(x)) of the
previous section. Secondly we prove that the empirical average of the ψloss is concentrated close to its
expectation. This will be achieved by means of Rademacher averages. Thirdly we show that under rather
general conditions the minimization of the ψloss is consistent with the minimization of the expected risk.
Finally, we combine these bounds to obtain rates of convergence which only depend on the Rademacher
average and the approximation properties of the function class under consideration.
4.1 Margins and Empirical Risk
While the sign of yf(x) can be used to assess the accuracy of a binary classiﬁer we saw that for al
gorithmic reasons one rather optimizes (a smooth function of) yf(x) directly. In the following we
assume that the binary loss χ(ξ) =
1
2
(1 − sgn ξ) is majorized by some function ψ(ξ) ≥ χ(ξ),
e.g. via the construction of Lemma 12. Consequently E[χ(yf(x))] ≤ E[ψ(yf(x))] and likewise
E
emp
[χ(yf(x))] ≤ E
emp
[ψ(yf(x))]. The hope is (as will be shown in Section 4.3) that minimizing
the upper bound leads to consistent estimators.
There is a longstanding tradition of minimizing yf(x) rather than the number of misclassiﬁcations.
yf(x) is known as “margin” (based on the geometrical reasoning) in the context of SVMs [Vapnik and
Lerner, 1963, Mangasarian, 1965], as “stability” in the context of Neural Networks [Krauth and M´ ezard,
1987, Ruj´ an, 1993], and as the “edge” in the context of arcing [Breiman, 1999]. One may show[Makovoz,
1996, Barron, 1993, Herbrich and Williamson, 2002] that functions f in an RKHS achieving a large
margin can be approximated by another function f
t
achieving almost the same empirical error using a
much smaller number of kernel functions.
Note that by default, uniform convergence bounds are expressed in terms of minimization of the em
pirical risk average with respect to a ﬁxed function class F, e.g. Vapnik and Chervonenkis [1971]. This is
25
very much unlike what is done in practice: in SVM (83) the sum of empirical risk and a regularizer is min
imized. However, one may check that minimizing E
emp
[ψ(yf(x))] subject to w
2
≤ W is equivalent
to minimizing E
emp
[ψ(yf(x))] +λw
2
for suitably chosen values of λ. The equivalence is immediate
by using Lagrange multipliers. For numerical reasons, however, the second formulation is much more
convenient [Tikhonov, 1963, Morozov, 1984], as it acts as a regularizer. Finally, for the design of adap
tive estimators, socalled luckiness results exist, which provide risk bounds in a datadependent fashion
[ShaweTaylor et al., 1998, Herbrich and Williamson, 2002].
4.2 Uniform Convergence and Rademacher Averages
The next step is to bound the deviation E
emp
[ψ(yf(x))] − E[ψ(yf(x))] by means of Rademacher
averages. For details see Bousquet et al. [2005], Mendelson [2003], Bartlett et al. [2002], and
Koltchinskii [2001]. Denote by g : X
n
→ R a function of n variables and let c > 0 such that
[g(x
1
, . . . , x
n
) − g(x
1
, . . . , x
i−1
, x
t
i
, x
i+1
, . . . , x
n
)[ ≤ c for all x
1
, . . . , x
n
, x
t
i
∈ X and for all i ∈ [n],
then [McDiarmid, 1989]
P¦E[g(x
1
, . . . , x
n
)] −g(x
1
, . . . , x
n
) > ¦ ≤ exp
_
−2
2
/nc
2
_
. (85)
Assume that f(x) ∈ [0, B] for all f ∈ F and let g(x
1
, . . . , x
n
) := sup
f∈F
[E
emp
[f(x)] − E[f(x)] [.
Then it follows that c ≤
B
n
. Solving (85) for g we obtain that with probability at least 1 −δ
sup
f∈F
E[f(x)] −E
emp
[f(x)] ≤ E
_
sup
f∈F
E[f(x)] −E
emp
[f(x)]
_
+B
_
−
log δ
2n
. (86)
This means that with high probability the largest deviation between the sample average and its expectation
is concentrated around its mean and within an O(n
−
1
2
) term. The expectation can be bounded by a
classical symmetrization argument [Vapnik and Chervonenkis, 1971] as follows:
E
X
_
sup
f∈F
E[f(x
t
)] −E
emp
[f(x)]
_
≤ E
X,X
_
sup
f∈F
E
emp
[f(x
t
)] −E
emp
[f(x)]
_
=E
X,X
,σ
_
sup
f∈F
E
emp
[σf(x
t
)] −E
emp
[σf(x)]
_
≤ 2E
X,σ
_
sup
f∈F
E
emp
[σf(x)]
_
=: 2R
n
[F] .
The ﬁrst inequality follows from the convexity of the argument of the expectation, the second equality
follows from the fact that x
i
and x
t
i
are drawn i.i.d. from the same distribution, hence we may swap
terms. Here σ
i
are independent ±1valued zeromean Rademacher random variables. R
n
[F] is referred
as the Rademacher average [Mendelson, 2001, Bartlett and Mendelson, 2002, Koltchinskii, 2001] of F
wrt. sample size n.
For linear function classes R
n
[F] takes on a particularly nice form. We begin with F :=
¦f[f(x) = ¸x, w) and w ≤ 1¦. It follows that sup
w≤1
n
i=1
σ
i
¸w, x
i
) = 
n
i=1
σ
i
x
i
. Hence
nR
n
[F] = E
X,σ
_
_
n
i=1
σ
i
x
i
_
_
≤ E
X
_
_
E
σ
_
_
_
_
_
_
_
n
i=1
σ
i
x
i
_
_
_
_
_
2
_
_
_
_
1
2
= E
X
_
n
i=1
x
i

2
_1
2
≤
_
nE
_
x
2
_
.
(87)
Here the ﬁrst inequality is a consequence of Jensen’s inequality, the second equality follows from the fact
that σ
i
are i.i.d. zeromean random variables, and the last step again is a result of Jensen’s inequality.
Corresponding tight lower bounds by a factor of 1/
√
2 exist and they are a result of the Khintchine
Kahane inequality [Kahane, 1968].
26
Note that (87) allows us to bound R
n
[F] ≤ n
−
1
2
r where r is the average norm of vectors in the
sample. An extension to kernel functions is straightforward: by design of the inner product we have
r =
_
E
x
[k(x, x)]. Note that this bound is independent of the dimensionality of the data but rather only
depends on the expected norm of the input vectors. Moreover r can also be interpreted as the trace of the
integral operator with kernel k(x, x
t
) and probability measure on X.
Since we are computing E
emp
[ψ(yf(x))] we are interested in the Rademacher complexity of ψ ◦ F.
Bartlett and Mendelson [2002] show that R
n
[ψ ◦ F] ≤ LR
n
[F] for any Lipschitz continuous function ψ
with Lipschitz constant L and with ψ(0) = 0. Secondly, for ¦yb where [b[ ≤ B¦ the Rademacher average
can be bounded by B
_
2 log 2/n, as follows from [Bousquet et al., 2005, eq. (4)]. This takes care of the
offset b. For sums of function classes F and G we have R
n
[F +G] ≤ R
n
[F] + R
n
[G]. This means
that for linear functions with w ≤ W, [b[ ≤ B, and ψ Lipschitz continuous with constant L we have
R
n
≤
L
√
n
(Wr +B
√
2 log 2).
4.3 Upper Bounds and Convex Functions
We brieﬂy discuss consistency of minimization of the surrogate loss function ψ : R →[0, ∞), which we
assume to be convex and χmajorizing, i.e. ψ ≥ χ [Jordan et al., 2003, Zhang, 2004]. Examples of such
functions are the softmargin loss max(0, 1 − γξ), which we discussed in Section 3, the Boosting Loss
e
−ξ
, which is commonly used in AdaBoost [Schapire et al., 1998, R¨ atsch et al., 2001], and the logistic
loss ln
_
1 +e
−ξ
_
(see Section 5).
Denote by f
∗
χ
the minimizer of the expected risk and let f
∗
ψ
be the minimizer of E[ψ(yf(x)] with
respect to f. Then under rather general conditions on ψ [Zhang, 2004] for all f the following inequality
holds:
E[χ(yf(x))] −E
_
χ(yf
∗
χ
(x))
¸
≤ c
_
E[ψ(yf(x))] −E
_
ψ(yf
∗
ψ
(x))
¸_
s
. (88)
In particular we have c = 4 and s = 1 for soft margin loss, whereas for Boosting and logistic regres
sion c =
√
8 and s =
1
2
. Note that (88) implies that the minimizer of the ψ loss is consistent, i.e.
E[χ(yf
ψ
(x))] = E[χ(yf
χ
(x))].
4.4 Rates of Convergence
We now have all tools at our disposition to obtain rates of convergence to the minimizer of the expected
risk which depend only on the complexity of the function class and its approximation properties in terms
of the ψloss. Denote by f
∗
ψ,F
the minimizer of E[ψ(yf(x))] restricted to F, let f
n
ψ,F
be the minimizer
of the empirical ψrisk, and let δ(F, ψ) := E
_
yf
∗
ψ,F
(x)
_
− E
_
yf
∗
ψ
(x)
_
be the approximation error due
to the restriction of f to F. Then a simple telescope sum yields
E
_
χ(yf
n
ψ,F
)
¸
≤E
_
χ(yf
∗
χ
)
¸
+ 4
_
E
_
ψ(yf
n
ψ,F
)
¸
−E
emp
_
ψ(yf
n
ψ,F
)
¸¸
+ 4
_
E
emp
_
ψ(yf
∗
ψ,F
)
¸
−E
_
ψ(yf
∗
ψ,F
)
¸¸
+δ(F, ψ)
≤E
_
χ(yf
∗
χ
)
¸
+δ(F, ψ) + 4
RWγ
√
n
_
_
−2 log δ +r/R +
_
8 log 2
_
. (89)
Here γ is the effective margin of the softmargin loss max(0, 1 −γyf(x)), W is an upper bound on w,
R ≥ x, r is the average radius, as deﬁned in the previous section, and we assumed that b is bounded
by the largest value of ¸w, x). A similar reasoning for logistic and exponential loss is given in Bousquet
et al. [2005].
Note that we get an O(1/
√
n) rate of convergence regardless of the dimensionality of x. Moreover
note that the rate is dominated by RWγ, that is, the classical radiusmargin bound [Vapnik, 1995]. Here
R is the radius of an enclosing sphere for the data and 1/(Wγ) is an upper bound on the radius of the
data — the softmargin loss becomes active only for yf(x) ≤ γ.
27
4.5 Localization and Noise Conditions
In many cases it is possible to obtain better rates of convergence than O(1/
√
n) by exploiting information
about the magnitude of the error of misclassiﬁcation and about the variance of f on X. Such bounds
use Bernsteintype inequalities and they lead to localized Rademacher averages [Bartlett et al., 2002,
Mendelson, 2003, Bousquet et al., 2005].
Basically the slowO(1/
√
n) rates arise whenever the region around the Bayes optimal decision bound
ary is large. In this case, determining this region produces the slow rate, whereas the welldetermined
region could be estimated at an O(1/n) rate.
Tsybakov’s noise condition [Tsybakov, 2003] requires that there exist β, γ ≥ 0 such that
P
_¸
¸
¸
¸
P¦y = 1[x¦ −
1
2
¸
¸
¸
¸
≤ t
_
≤ βt
γ
for all t ≥ 0. (90)
Note that for γ = ∞the condition implies that there exists some s such that
¸
¸
P¦y = 1[x¦ −
1
2
¸
¸
≥ s > 0
almost surely. This is also known as Massart’s noise condition.
The key beneﬁt of (90) is that it implies a relationship between variance and expected value of classi
ﬁcation loss. More speciﬁcally for α =
γ
1+γ
and g : X →Y we have
E
_
[¦g(x) ,= y¦ −¦g
∗
(x) ,= y¦]
2
_
≤ c [E[¦g(x) ,= y¦ −¦g
∗
(x) ,= y¦]]
α
. (91)
Here g
∗
(x) := argmax
y
P(y[x) denotes the Bayes optimal classiﬁer. This is sufﬁcient to obtain faster
rates for ﬁnite sets of classiﬁers. For more complex function classes localization is used. See, e.g.
Bousquet et al. [2005] and Bartlett et al. [2002] for more details.
5 Statistical Models and RKHS
As we have argued so far, the Reproducing Kernel Hilbert Space approach offers many advantages in
machine learning: (i) powerful and ﬂexible models can be deﬁned, (ii) many results and algorithms for
linear models in Euclidean spaces can be generalized to RKHS, (iii) learning theory assures that effective
learning in RKHS is possible, for instance, by means of regularization.
In this chapter, we will show how kernel methods can be utilized in the context of statistical models.
There are several reasons to pursue such an avenue. First of all, in conditional modeling, it is often
insufﬁcent to compute a prediction without assessing conﬁdence and reliability. Second, when dealing
with multiple or structured responses, it is important to model dependencies between responses in addition
to the dependence on a set of covariates. Third, incomplete data, be it due to missing variables, incomplete
training targets, or a model structure involving latent variables, needs to be dealt with in a principled
manner. All of these issues can be addressed by using the RKHS approach to deﬁne statistical models and
by combining kernels with statistical approaches such as exponential models, generalized linear models,
and Markov networks.
5.1 Exponential RKHS Models
5.1.1 Exponential Models
Exponential models or exponential families are among the most important class of parametric models
studied in statistics. Given a canonical vector of statistics Φ and a σﬁnite measure ν over the sample
space X, an exponential model can be deﬁned via its probability density with respect to ν (cf. Barndorff
Nielsen [1978]),
p(x; θ) = exp [¸θ, Φ(x)) −g(θ)] , where g(θ) :=ln
_
X
e
¸θ,Φ(x))
dν(x) . (92)
28
The mdimensional vector θ ∈ Θ with Θ:=¦θ ∈ R
m
: g(θ) < ∞¦ is also called the canonical parameter
vector. In general, there are multiple exponential representations of the same model via canonical param
eters that are afﬁnely related to one another (Murray and Rice [1993]). A representation with minimal m
is called a minimal representation, in which case mis the order of the exponential model. One of the most
important properties of exponential families is that they have sufﬁcient statistics of ﬁxed dimensionality,
i.e. the joint density for i.i.d. random variables X
1
, X
2
, . . . , X
n
is also exponential, the corresponding
canonical statistics simply being
n
i=1
Φ(X
i
). It is wellknown that much of the structure of exponential
models can be derived from the log partition function g(θ), in particular (cf. Lauritzen [1996])
∇
θ
g(θ) = µ(θ) :=E
θ
[Φ(X)] , ∂
2
θ
g(θ) = V
θ
[Φ(X)] , (93)
where µ is known as the meanvalue map. Being a covariance matrix, the Hessian of g is positive semi
deﬁnite and consequently g is convex.
Maximum likelihood estimation (MLE) in exponential families leads to a particularly elegant form for
the MLE equations: the expected and the observed canonical statistics agree at the MLE
ˆ
θ. This means,
given an i.i.d. sample S = (x
i
)
i∈[n]
,
E
ˆ
θ
[Φ(X)] = µ(
ˆ
θ) =
1
n
n
i=1
Φ(x
i
)=: E
S
[Φ(X)] . (94)
5.1.2 Exponential RKHS Models
One can extend the parameteric exponential model in (92) by deﬁning a statistical model via an RKHS
H with generating kernel k. Linear function ¸θ, Φ()) over X are replaced with functions f ∈ H, which
yields an exponential RKHS model
p(x; f) = exp[f(x) −g(f)], f ∈ H:=
_
f : f() =
x∈S
α
x
k(, x), S ⊆ X, [S[ < ∞
_
. (95)
A justiﬁcation for using exponential RKHS families with rich canonical statistics as a generic way to
deﬁne nonparametric models stems from the fact that if the chosen kernel k is powerful enough, the
associated exponential families become universal density estimators. This can be made precise using the
concept of universal kernels (Steinwart [2002a], cf. Section 2).
Proposition 14 (Dense Densities) Let X be a measurable set with a ﬁxed σﬁnite measure ν and denote
by P a family of densities on X with respect to ν such that p ∈ P is uniformly bounded from above and
continuous. Let k : X X → R be a universal kernel for H. Then the exponential RKHS family of
densities generated by k according to Eq. (95) are dense in P in the L
∞
sense.
Proof Let D:=sup
p
p
∞
and chose η = η(, D) > 0 appropriately (see below). By the universality
assumption for every p ∈ P there exists f ∈ H such that f − ln p
∞
< η. Exploiting the strict
convexity of the expfunction one gets that e
−η
p < e
f
< e
η
p, which directly implies e
−η
< e
g(f)
=
_
X
e
f(x)
dν(x) < e
η
. Moreover e
f
−p
∞
< D(e
η
−1), and hence
p(f) −p
∞
≤ e
f
−p
∞
+e
f−g(f)
−e
f

∞
< D(e
η
−1)(1 +e
η
) < 2De
η
(e
η
−1) ,
where we utilized the upper bound
e
f−g(f)
−e
f

∞
≤ [e
−g(f)
−1[ e
f

∞
< (e
η
−1)e
f

∞
< De
η
(e
η
−1).
Equaling with and solving for η results in η(, D) = ln
_
1
2
+
1
2
_
1 +
2
D
_
> 0.
29
5.1.3 Conditional Exponential Models
For the rest of the paper, we will focus on the case of predictive or conditional modeling with a –
potentially compound or structured – response variable Y and predictor variables X. Taking up the
concept of joint kernels introduced in the previous section, we will investigate conditional models that
are deﬁned by functions f : X Y →R from some RKHS Hover X Y with kernel k as follows
p(y[x; f) = exp [f(x, y) −g(x, f)] , where g(x, f) :=ln
_
Y
e
f(x,y)
dν(y) . (96)
Notice that in the ﬁnitedimensional case we have a feature map Φ : XY →R
m
from which parametric
models are obtained via H:=¦f : ∃w, f(x, y) = f(x, y; w) :=¸w, Φ(x, y))¦ and each f can be identi
ﬁed with its parameter w. Let us discuss some concrete examples to illustrate the rather general model
equation (96).
• Let Y be univariate and deﬁne Φ(x, y) = yΦ(x). Then simply f(x, y; w) = ¸w, Φ(x, y)) =
y
˜
f(x; w), with
˜
f(x; w) :=¸w, Φ(x)) and the model equation in (96) reduces to
p(y[x; w) = exp [y ¸w, Φ(x)) −g(x, w)] . (97)
This is a generalized linear model (GLM) (Nelder and Wedderburn [1972], McCullagh and Nelder
[1983]) with a canonical link, i.e. the canonical parameters depend linearly on the covariates Φ(x).
For different response scales we get several wellknown models such as, for instance, logistic re
gression where y ∈ ¦−1, 1¦.
• In the nonparameteric extension of generalized linear models following [Green and Yandell, 1985,
O’Sullivan et al., 1986] the parametric assumption on the linear predictor
˜
f(x; w) = ¸w, Φ(x)) in
GLMs is relaxed by requiring that
˜
f comes from some sufﬁciently smooth class of functions, namely
a RKHS deﬁned over X. In combination with a parametric part, this can also be used to deﬁne semi
parametric models. Popular choices of kernels include the ANOVA kernel investigated in Wahba
et al. [1995]. This is a special case of deﬁning joint kernels from an existing kernel k over inputs via
k((x, y), (x
t
, y
t
)) := yy
t
k(x, x
t
).
• Joint kernels provide a powerful framework for prediction problems with structured outputs. An il
luminating example is statistical natural language parsing with lexicalized probabilistic context free
grammars (cf. Magerman [1996]). Here x will be an English sentence and y a parse tree for x, i.e. a
highly structured and complex output. The productions of the grammar are known, but the condi
tional probability p(y[x) needs to be estimated based on training data of parsed/annotated sentences.
In the simplest case, the extracted statistics Φ may encode the frequencies of the use of different pro
ductions in a sentence with known parse tree. More sophisticated feature encodings are discussed in
Taskar et al. [2004], Zettlemoyer and Collins [2005]. The conditional modeling approach provide al
ternatives to stateofthe art approaches that estimate joint models p(x, y) with maximum likelihood
or maximum entropy (Charniak [2000]) and obtain predictive models by conditioning on x.
5.1.4 Risk Functions for Model Fitting
There are different inference principles to determine the optimal function f ∈ H for the conditional
exponential model in (96). One standard approach to parametric model ﬁtting is to maximize the condi
tional loglikelihood – or equivalently – minimize a logarithmic loss, a strategy pursued in the Conditional
Random Field (CRF) approach of Lafferty et al. [2001]. Here we consider the more general case of min
imizing a functional that includes a monotone function of the Hilbert space norm f
H
as a stabilizer
(cf. Wahba [1990]). This reduces to penalized loglikelihood estimation in the ﬁnite dimensional case,
C
ll
(f; S) :=−
1
n
n
i=1
ln p(y
i
[x
i
; f),
ˆ
f
ll
(S) :=argmin
f∈H
λ
2
f
2
H
+C
ll
(f; S) . (98)
30
• For the parametric case, Lafferty et al. [2001] have employed variants of improved iterative scal
ing (Della Pietra et al. [1997], Darroch and Ratcliff [1972]) to optimize Eq.(98) whereas Sha and
Pereira [2003] have investigated preconditioned conjugate gradient descent and limited memory
quasiNewton methods.
• In order to optimize Eq.(98) one usually needs to compute expectations of the canoncial statistics
E
f
[Φ(Y, x)] at sample points x = x
i
, which requires the availability of efﬁcient inference algo
rithms.
As we have seen in the case of classiﬁcation and regression, likelihoodbased criteria are by no means
the only justiﬁable choice and large margin methods offer an interesting alternative. To that extend,
we will present a general formulation of large margin methods for response variables over ﬁnite sample
spaces that is based on the approach suggested by Altun et al. [2003] and Taskar et al. [2003]. Deﬁne
r(x, y; f) :=f(x, y) −max
y
,=y
f(x, y
t
) = min
y
,=y
log
p(y[x; f)
p(y
t
[x; f)
and r(S; f) :=
n
min
i=1
r(x
i
, y
i
; f) . (99)
Here r(S; f) generalizes the notion of separation margin used in SVMs. Since the logodds ratio is
sensitive to rescaling of f, i.e. r(x, y; βf) = βr(x, y; f), we need to constrain f
H
to make the problem
welldeﬁned. We thus replace f by φ
−1
f for some ﬁxed dispersion parameter φ > 0 and deﬁne the
maximum margin problem
ˆ
f
mm
(S) :=φ
−1
argmax
f
H
=1
r(S; f/φ). For the sake of the presentation, we
will drop φ in the following.
9
Using the same line of arguments as was used in Section 3, the maximum
margin problem can be reformulated as a constrained optimization problem
ˆ
f
mm
(S) :=argmin
f∈H
1
2
f
2
H
, s.t. r(x
i
, y
i
; f) ≥ 1, ∀i ∈ [n] , (100)
provided the latter is feasible, i.e. if there exists f ∈ H such that r(S; f) > 0. To make the connection to
SVMs consider the case of binary classiﬁcation with Φ(x, y) = yΦ(x), f(x, y; w) = ¸w, yΦ(x)), where
r(x, y; f) = ¸w, yΦ(x)) −¸w, −yΦ(x)) = 2y ¸w, Φ(x)) = 2ρ(x, y; w). The latter is twice the standard
margin for binary classiﬁcation in SVMs.
A soft margin version can be deﬁned based on the Hinge loss as follows
C
hl
(f; S) :=
1
n
n
i=1
min¦1 −r(x
i
, y
i
; f), 0¦ ,
ˆ
f
sm
(S) :=argmin
f∈H
λ
2
f
2
H
+C
hl
(f, S) . (101)
• An equivalent formulation using slack variables ξ
i
as discussed in Section 3 can be obtained by
introducing softmargin constraints r(x
i
, y
i
; f) ≥ 1 − ξ
i
, ξ
i
≥ 0 and by deﬁning C
hl
=
1
n
ξ
i
. Each
nonlinear constraint can be further expanded into [Y[ linear constraints f(x
i
, y
i
)−f(x
i
, y) ≥ 1−ξ
i
for all y ,= y
i
.
• Prediction problems with structured outputs often involve taskspeciﬁc loss function ´ : YY →R
discussed in Section 3.4. As suggested in Taskar et al. [2003] cost sensitive large margin methods
can be obtained by deﬁning rescaled margin constraints f(x
i
, y
i
) −f(x
i
, y) ≥ ´(y
i
, y) −ξ
i
.
• Another sensible option in the parametric case is to minimize an exponential risk function of the
following type
ˆ
f
exp
(S) :=argmin
w
1
n
n
i=1
y,=y
i
exp [f(x
i
, y
i
; w) −f(x
i
, y; w)] . (102)
This is related to the exponential loss used in the AdaBoost method of Freund and Schapire [1996].
Since we are mainly interested in kernelbased methods here, we refrain from further elaborating on
this connection.
9
We will not deal with the problem of how to estimate φ here; note, however, that one does need to know φ in
order to make an optimal determinisitic prediction.
31
5.1.5 Generalized Representer Theorem and Dual SoftMargin Formulation
It is crucial to understand how the representer theorem applies in the setting of arbitrary discrete output
spaces, since a ﬁnite representation for the optimal
ˆ
f ∈ ¦
ˆ
f
ll
,
ˆ
f
sm
¦ is the basis for constructive model
ﬁtting. Notice that the regularized logloss as well as the soft margin functional introduced above depend
not only on the values of f on the sample S, but rather on the evaluation of f on the augmented sample
˜
S :=¦(x
i
, y) : i ∈ [n], y ∈ Y¦. This is the case, because for each x
i
, output values y ,= y
i
not observed
with x
i
show up in the logpartition function g(x
i
, f) in (96) as well as in the logodds ratios in (99). This
adds an additional complication compared to binary classiﬁcation.
Corollary 15 Denote by H an RKHS on X Y with kernel k and let S = ((x
i
, y
i
))
i∈[n]
. Further
more let C(f; S) be a functional depending on f only via its values on the augmented sample
˜
S.
Let Ω be a strictly monotonically increasing function. Then the solution of the optimization problem
ˆ
f(S) :=argmin
f∈H
C(f;
˜
S) + Ω(f
H
) can be written as:
ˆ
f() =
n
i=1
y∈Y
β
iy
k(, (x
i
, y)) (103)
This follows directly from Theorem 11.
Let us focus on the soft margin maximizer
ˆ
f
sm
. Instead of solving (101) directly, we ﬁrst derive the
dual program, following essentially the derivation in Section 3.
Proposition 16 (Tsochantaridis et al. [2005]) The minimizer
ˆ
f
sm
(S) can be written as in Corollary 15
where the expansion coefﬁcients can be computed from the solution of the following convex quadratic
program
α
∗
=argmin
α
_
_
_
1
2
n
i,j=1
y,=y
i
y
,=y
j
α
iy
α
jy
K
iy,jy
−
n
i=1
y,=y
i
α
iy
_
_
_
(104a)
s.t. λn
y,=y
i
α
iy
≤ 1, ∀i ∈ [n]; α
iy
≥ 0, ∀i ∈ [n], y ∈ Y, (104b)
where K
iy,jy
:=k((x
i
, y
i
), (x
j
, y
j
)) +k((x
i
, y), (x
j
, y
t
)) −k((x
i
, y
i
), (x
j
, y
t
)) −k((x
i
, y), (x
j
, y
j
)).
Proof The primal program for softmargin maximization can be expressed as
min
f∈H,ξ
R(f, ξ, S) :=
λ
2
f
2
H
+
1
n
n
i=1
ξ
i
, s.t. f(x
i
, y
i
) −f(x
i
, y) ≥ 1 −ξ
i
, ∀y ,= y
i
; ξ
i
≥ 0, ∀i ∈ [n] .
Introducing dual variables α
iy
for the margin constraints and η
i
for the nonnegativity constraints on ξ
i
yields the Lagrange function (the objective has been divided by λ)
L(f, α) :=
1
2
¸f, f)
H
+
1
λn
n
i=1
ξ
i
−
n
i=1
y,=y
i
α
iy
[f(x
i
, y
i
) −f(x
i
, y) −1 +ξ
i
] −
n
i=1
ξ
i
η
i
.
Solving for f results in
∇
f
L(f, α) = 0 ⇐⇒ f() =
n
i=1
y,=y
i
α
iy
[k(, (x
i
, y
i
)) −k(, (x
i
, y))]
32
since for f ∈ H, ∇
f
f(x, y) = k(, (x, y)). Solving for ξ
i
implies λn
y,=y
i
α
iy
≤ 1 and plugging the
solution back into L yields the (negative) objective function in the claim. Finally note that the represen
tation in (103) can be obtained by identifying
β
iy
:=
_
−α
iy
if y ,= y
i
y,=y
i
α
iy
if y = y
i
• The multiclass SVM formulation of Crammer and Singer [2001] can be recovered as a special case
for kernels that are diagonal with respect to the outputs, i.e. k((x, y), (x
t
, y
t
)) = δ
y,y
k(x, x
t
).
Notice that in this case the quadratic part in Eq. (104a) simpliﬁes to
i,j
k(x
i
, x
j
)
y
α
iy
α
jy
_
1 +δ
y
i
,y
δ
y
j
,y
−δ
y
i
,y
−δ
y
j
,y
¸
.
• The pairs (x
i
, y) for which α
iy
> 0 are the support pairs, generalizing the notion of support vectors.
As in binary SVMs their number can be much smaller than the total number of constraints. Notice
also that in the ﬁnal expansion contributions k(, (x
i
, y
i
)) will get nonnegative weights whereas
k(, (x
i
, y)) for y ,= y
i
will get nonpositive weights. Overall one gets a balance equation β
iy
i
−
y,=y
i
β
iy
= 0 for every data point.
5.1.6 Sparse Approximation
Proposition 16 shows that sparseness in the representation of
ˆ
f
sm
is linked to the fact that only few
α
iy
in the solution to the dual problem in Eq. (104) are nonzero. Note that each of these Lagrange
multipliers is linked to the corresponding soft margin constraint f(x
i
, y
i
) − f(x
i
, y) ≥ 1 − ξ
i
. Hence,
sparseness is achieved, if only few constraints are active at the optimal solution. While this may or may
not be the case for a given sample, one can still exploit this observation to deﬁne a nested sequence of
relaxations, where margin constraint are incrementally added. This corresponds to a constraint selection
algorithm (cf. Bertsimas and Tsitsiklis [1997]) for the primal or – equivalently – a variable selection or
column generation method for the dual program and has been investigated in Tsochantaridis et al. [2005].
Solving a sequence of increasingly tighter relaxations to a mathematical problem is also known as an
outer approximation. In particular, one may iterate through the training examples according to some
(fair) visitation schedule and greedily select constraints that are most violated at the current solution f,
i.e. for the ith instance one computes
ˆ y
i
= argmax
y,=y
i
f(x
i
, y) = argmax
y,=y
i
p(y[x
i
; f) , (105)
and then strengthens the current relaxation by including α
iˆ y
i
in the optimization of the dual if f(x
i
, y
i
) −
f(x
i
, ˆ y
i
) < 1 − ξ
i
− . Here > 0 is a predeﬁned tolerance parameter. It is important to understand
how many strengthening steps are necessary to achieve a reasonable close approximation to the original
problem. The following theorem provides an answer:
Theorem 17 (Tsochantaridis et al. [2005]) Let
¯
R = max
i,y
K
iy,iy
and chose > 0. A sequential
strengthening procedure, which optimizes Eq. (101) by greedily selecting violated constraints, will ﬁnd
an approximate solution where all constraints are fulﬁlled within a precision of , i.e. r(x
i
, y
i
; f) ≥
1 −ξ
i
− after at most
2n
max
_
1,
4
¯
R
2
λn
2
_
steps.
33
Corollary 18 Denote by (
ˆ
f,
ˆ
ξ) the optimal solution of a relaxation of the problem in Proposition 16,
minimizing R(f, ξ, S) while violating no constraint by more than (cf. Theorem 17). Then
R(
ˆ
f,
ˆ
ξ, S) ≤ R(
ˆ
f
sm
, ξ
∗
, S) ≤ R(
ˆ
f,
ˆ
ξ, S) +
where (
ˆ
f
sm
, ξ
∗
) is the optimal solution of the original problem.
Proof The ﬁrst inequality follows from the fact that (
ˆ
f,
ˆ
ξ) is an optimal solution to a relaxation, the
second from the observation that by setting
˜
ξ =
ˆ
ξ + one gets an admissible solution (
ˆ
f,
˜
ξ) such that
R(
ˆ
f,
˜
ξ, S) =
λ
2

ˆ
f
2
H
+
1
n
n
i=1
(
ˆ
ξ
i
+) = R(
ˆ
f,
ˆ
ξ, S) +.
• Combined with an efﬁcient QP solver, the above theorem guarantees a runtime polynomial in n,
−1
,
¯
R, and λ
−1
. This holds irrespective of special properties of the data set utilized, the only exception
being the dependeny on the sample points x
i
is through the radius
¯
R.
• The remaining key problem is how to compute Eq. (105) efﬁciently. The answer depends on the
speciﬁc form of the joint kernel k and/or the feature map Φ. In many cases, efﬁcient dynamic
programming techniques exists, whereas in other cases one has to resort to approximations or use
other methods to identify a set of candidate distractors Y
i
⊂ Y for a training pair (x
i
, y
i
) (cf. Collins
[2000]). Sometimes one may also have search heuristics available that may not ﬁnd the solution to
Eq. (105), but that ﬁnd (other) violating constraints with a reasonable computational effort.
5.1.7 Generalized Gaussian Processes Classiﬁcation
The model Eq. (96) and the minimization of the regularized logloss can be interpreted as a gener
alization of Gaussian process classiﬁcation (Rasmussen and Williams [2006], Altun et al. [2004b]) by
assuming that (f(x, ))
x∈X
is a vectorvalued zero mean Gaussian process; note that the covariance func
tion C is deﬁned over pairs XY. For a given sample S, deﬁne a multiindex vector F(S) :=(f(x
i
, y))
i,y
as the restriction of the stochastic process f to the augmented sample
˜
S. Denote the kernel matrix by
K = (K
iy,jy
), where K
iy,jy
:=C((x
i
, y), (x
j
, y
t
)) with indices i, j ∈ [n] and y, y
t
∈ Y, so that in
summary: F(S) ∼ N(0, K). This induces a predictive model via Bayesian model integration according
to
p(y[x; S) =
_
p(y[F(x, ))p(F[S)dF , (106)
where x is a test point that has been included in the sample (transductive setting). For an i.i.d. sample the
logposterior for F can be written as
ln p(F[S) = −
1
2
F
T
K
−1
F +
n
i=1
[f(x
i
, y
i
) −g(x
i
, F)] +const . (107)
Invoking the representer theorem for
ˆ
F(S) :=argmax
F
ln p(F[S), we know that
ˆ
F(S)
iy
=
n
j=1
y
∈Y
α
iy
K
iy,jy
, (108)
which we plug into Eq. (107) to arrive at
min
α
α
T
Kα −
n
i=1
_
_
α
T
Ke
iy
i
+ log
y∈Y
exp
_
α
T
Ke
iy
¸
_
_
, (109)
34
were e
iy
denotes the respective unit vector. Notice that for f() =
i,y
α
iy
k(, (x
i
, y)) the ﬁrst
term is equivalent to the squared RKHS norm of f ∈ H since ¸f, f)
H
=
i,j
y,y
α
iy
α
jy
¸k(, (x
i
, y)), k(, (x
j
, y
t
))). The latter inner product reduces to k((x
i
, y), (x
j
, y
t
)) due to the reproduc
ing property. Again, the key issue in solving Eq. (109) is how to achieve spareness in the expansion for
ˆ
F.
5.2 Markov Networks and Kernels
In Section 5.1 no assumptions about the speciﬁc structure of the joint kernel deﬁning the model in Eq. (96)
has been made. In the following, we will focus on a more speciﬁc setting with multiple outputs, where
dependencies are modeled by a conditional independence graph. This approach is motivated by the
fact that independently predicting individual responses based on marginal response models will often be
suboptimal and explicitly modelling these interactions can be of crucial importance.
5.2.1 Markov Networks and Factorization Theorem
Denote predictor variables by X, response variables by Y and deﬁne Z :=(X, Y ) with associated
sample space Z. We use Markov networks as the modeling formalism for representing dependencies
between covariates and response variables as well as interdependencies among response variables.
Deﬁnition 19 A conditional independence graph (or Markov network) is an undirected graph G =
(Z, E) such that for any pair of variables (Z
i
, Z
j
) ,∈ E if and only if Z
i
⊥⊥Z
j
[Z −¦Z
i
, Z
j
¦.
The above deﬁnition is based on the pairwise Markov property, but by virtue of the separation theorem
(cf. e.g. Whittaker [1990]) this implies the global Markov property for distributions with full support.
The global Markov property says that for disjoint subsets U, V, W ⊆ Z where W separates U from V in
G one has that U⊥⊥V [W. Even more important in the context of this paper is the factorization result due
to Hammersley and Clifford [1971] and Besag [1974].
Theorem 20 Given a random vector Z with conditional independence graph G. Any density function for
Z with full support factorizes over C(G), the set of maximal cliques of G as follows:
p(z) = exp
_
_
c∈C(G)
f
c
(z
c
)
_
_
(110)
where f
c
are clique compatibility functions dependent on z only via the restriction on clique conﬁgura
tions z
c
.
The signiﬁcance of this result is that in order to specify a distribution for Z, one only needs to specify or
estimate the simpler functions f
c
.
5.2.2 Kernel Decomposition over Markov Networks
It is of interest to analyse the structure of kernels k that generate Hilbert spaces Hof functions that are
consistent with a graph.
Deﬁnition 21 A function f : Z → R is compatible with a conditional independence graph G, if f
decomposes additively as f(z) =
c∈C(G)
f
c
(z
c
) with suitably chosen functions f
c
. A Hilbert space H
over Z is compatible with G, if every function f ∈ His compatible with G. Such f and Hare also called
Gcompatible.
35
Proposition 22 Let Hwith kernel k be a Gcompatible RKHS. Then there are functions k
cd
: Z
c
Z
d
→
R such that the kernel decomposes as
k(u, z) =
c,d∈C
k
cd
(u
c
, z
d
) .
Proof Let z ∈ Z, then k(, z) = k(z, ) ∈ H and by assumption for all u, z ∈ Z there exist functions
k
d
(; z) such that
k(u, z) =
d
k
d
(u
d
; z) =
c
k
c
(z
c
; u) .
The left hand sum involves functions restricted to cliques of the ﬁrst argument, whereas the right hand
sum involves functions restricted to cliques of the second argument. Then the following simple lemma
shows that there have to be function k
cd
(u
c
, z
d
) as claimed.
Lemma 23 Let X be a set of ntupels and f
i
, g
i
: X X →R for i ∈ [n] functions such that f
i
(x, y) =
f
i
(x
i
, y) and g
i
(x, y) = g
i
(x, y
i
). If
i
f
i
(x
i
, y) =
j
g
j
(x, y
j
) for all x, y, then there exist functions
h
ij
such that
i
f
i
(x
i
, y) =
i,j
h
ij
(x
i
, y
j
).
• Proposition 22 is useful for the design of kernels, since it states that only kernels allowing an additive
decomposition into local functions k
cd
are compatible with a given Markov network G. Lafferty et al.
[2004] have pursued a similar approach by considering kernels for RKHS with functions deﬁned
over Z
C
:= ¦(c, z
c
) : c ∈ c, z
c
∈ Z
c
¦. In the latter case one can even deal with cases where the
conditional dependency graph is (potentially) different for every instance.
• An illuminating example of how to design kernels via the decomposition in Proposition 22 is the
case of conditional Markov chains, for which models based on joint kernels have been proposed in
Lafferty et al. [2001], Collins [2000], Altun et al. [2003], Taskar et al. [2003]. Given an input se
quences X = (X
t
)
t∈[T]
, the goal is to predict a sequence of labels or class variables Y = (Y
t
)
t∈[T]
,
Y
t
∈ Σ. Dependencies between class variables are modeled in terms of a Markov chain, whereas
outputs Y
t
are assumed to depend (directly) on an observation window (X
t−r
, . . . , X
t
, . . . , X
t+r
).
Notice that this goes beyond the standard Hidden Markov Model structure (cf. Rabiner [1989]) by al
lowing for overlapping features (r ≥ 1). For simplicty we focus on a window size of r = 1, in which
case the clique set is given by C:=¦c
t
:=(x
t
, y
t
, y
t+1
), c
t
t
:=(x
t+1
, y
t
, y
t+1
) : t ∈ [T −1]¦. We as
sume an input kernel k is given and introduce indicator vectors (or dummy variates) I(Y
¦t,t+1]
) :=
(I
ω,ω
(Y
¦t,t+1]
))
ω,ω
∈Σ
. Now we can deﬁne the local kernel functions as
k
cd
(z
c
, z
t
d
) :=
_
I(y
¦s,s+1]
), I(y
t
¦t,t+1]
)
_
_
k(x
s
, x
t
) if c = c
s
and d = c
t
k(x
s+1
, x
t+1
) if c = c
t
s
and d = c
t
t
(111)
Notice that the inner product between indicator vectors is zero, unless the variable pairs are in the
same conﬁguration.
Conditional Markov chain models have found widespread applications, in natural language process
ing (e.g. for part of speech tagging and shallow parsing, cf. Sha and Pereira [2003]), in information
retrieval (e.g. for information extraction, cf. McCallum et al. [2005]), or in computational biology
(e.g. for gene prediction, cf. Culotta et al. [2005]).
5.2.3 Cliquebased Sparse Approximation
Proposition 22 immediately leads to an alternative version of the representer theorem as observed by
Lafferty et al. [2004] and Altun et al. [2004b].
36
Corollary 24 If H is Gcompatible then in the same setting as in Corollary 15, the optimizer
ˆ
f can be
written as
ˆ
f(u) =
n
i=1
c∈C
y
c
∈Y
c
β
i
c,y
c
d∈C
k
cd
((x
ic
, y
c
), u
d
) (112)
here x
ic
are the variables of x
i
belonging to clique c and Y
c
is the subspace of Z
c
that contains response
variables.
Proof According to Proposition 23 the kernel k of H can be written as k(z, u) =
c,d
k
cd
(z
c
, u
d
)
plugging this into the expansion of Corollary 15 yields
ˆ
f(u) =
n
i=1
y∈Y
β
iy
c,d∈C
k
cd
((x
ic
, y
c
), u
d
) =
n
i=1
c,d∈C
y
c
∈Y
c
k
cd
((x
ic
, y
c
), u
d
)
y
δ
y
c
,y
c
β
iy
.
Setting β
i
c,y
c
:=
y
δ
y
c
,y
c
β
iy
completes the proof.
• Notice that the number of parameters in the representation Eq. (112) scales with n
c∈C
[Y
c
[
as opposed to n [Y[ in Eq. (103). For cliques with reasonably small state spaces this will be a
signiﬁcantly more compact representation. Notice also that the evaluation of functions k
cd
will
typically be more efﬁcient than evaluating k.
• In spite of this improvement, the number of terms in the expansion in Eq. (112) may in practice still
be too large. In this case, one can pursue a reduced set approach, which selects a subset of variables
to be included in a sparsiﬁed expansion. This has been proposed in Taskar et al. [2003] for the soft
margin maximization problem as well as in Lafferty et al. [2004], Altun et al. [2004a] for conditional
random ﬁelds and Gaussian processes. For instance, in Lafferty et al. [2004] parameters β
i
cy
c
that
maximize the functional gradient of the regularized logloss are greedily included in the reduced
set. In Taskar et al. [2003] a similar selection criterion is utilized with respect to margin violations,
leading to an SMOlike optimization algorithm (cf. Platt [1999]).
5.2.4 Probabilistic Inference
In dealing with structured or interdependent response variables, computing marginal probabilities of
interest or computing the most probable response (cf. Eq. (105)) may be nontrivial. However, for depen
dency graphs with small tree width efﬁcient inference algorithms exist, such as the junction tree algorithm
(cf. Jensen et al. [1990], Dawid [1992]) and variants thereof. Notice that in the case of the conditional
or hidden Markov chain, the junction tree algorithm is equivalent to the wellknown forwardbackward
algorithm Baum [1972]. Recently, a number of approximate inference algorithms has been developed to
deal with dependency graphs for which exact inference is not tractable (cf. e.g. Wainwright and Jordan
[2003]).
6 Kernel Methods for Unsupervised Learning
This section discusses various methods of data analysis by modeling the distribution of data in feature
space. To that extent, we study the behavior of Φ(x) by means of rather simple linear methods, which
has implications for nonlinear methods on the original data space X. In particular, we will discuss the
extension of PCA to Hilbert spaces, which allows for image denoising, clustering, and nonlinear dimen
sionality reduction, the study of covariance operators for the measure of independence, the study of mean
operators for the design of twosample tests, and the modeling of complex dependencies between sets of
random variables via kernel dependency estimation and canonical correlation analysis.
37
6.1 Kernel Principal Component Analysis
Principal Component Analysis (PCA) is a powerful technique for extracting structure from possibly high
dimensional data sets. It is readily performed by solving an eigenvalue problem, or by using iterative
algorithms which estimate principal components. For reviews of the existing literature, see Jolliffe [1986]
and Diamantaras and Kung [1996]; some of the classical papers are Pearson [1901], Hotelling [1933],
and Karhunen [1946].
PCA is an orthogonal transformation of the coordinate system in which we describe our data. The new
coordinate system is obtained by projection onto the socalled principal axes of the data. A small number
of principal components is often sufﬁcient to account for most of the structure in the data, e.g. for the
purpose of principal component regression [Draper and Smith, 1981].
The basic idea is strikingly simple: denote by X = ¦x
1
, . . . , x
n
¦ an nsample drawn from P(x). Then
the covariance operator C is given by C = E
_
(x −E[x])(x −E[x])
¯
¸
. PCA aims at estimating leading
eigenvectors of C via the empirical estimate C
emp
= E
emp
_
(x −E
emp
[x])(x −E
emp
[x])
¯
¸
. If X is
ddimensional, then the eigenvectors can be computed in O(d
3
) time [Press et al., 1994].
The problem can also be posed in feature space [Sch¨ olkopf et al., 1998] by replacing x with Φ(x).
In this case, however, it is impossible to compute the eigenvectors directly. Yet, note that the image of
C
emp
lies in the span of ¦Φ(x
1
), . . . , Φ(x
n
)¦. Hence it is sufﬁcient to diagonalize C
emp
in that subspace.
In other words, we replace the outer product C
emp
by an inner product matrix, leaving the eigenvalues
unchanged, which can be computed efﬁciently. Using w =
n
i=1
α
i
Φ(x
i
) it follows that α needs to
satisfy PKPα = λα, where P is the projection operator with P
ij
= δ
ij
− n
−2
and K is the kernel
matrix on X.
Note that the problem can also be recovered as one of maximizing some Contrast[f, X] subject to
f ∈ F. This means that the projections onto the leading eigenvectors correspond to the most reliable
features. This optimization problem also allows us to unify various feature extraction methods as follows:
• For Contrast[f, X] = Var
emp
[f, X] and F = ¦¸w, x) subject to w ≤ 1¦ we recover PCA.
• Changing F to F = ¦¸w, Φ(x)) subject to w ≤ 1¦ we recover Kernel PCA.
• For Contrast[f, X] = Curtosis[f, X] and F = ¦¸w, x) subject to w ≤ 1¦ we have Projection
Pursuit [Huber, 1985, Friedman and Tukey, 1974]. Other contrasts lead to further variants, e.g. the
Epanechikov kernel, entropic contrasts, etc. [Cook et al., 1993, Friedman, 1987, Jones and Sibson,
1987].
• If F is a convex combination of basis functions and the contrast function is convex in w one obtains
computationally efﬁcient algorithms, as the solution of the optimization problem can be found at
one of the vertices [Rockafellar, 1970, Sch¨ olkopf and Smola, 2002].
Subsequent projections are obtained, e.g. by seeking directions orthogonal to f or other computationally
attractive variants thereof.
Kernel PCA has been applied to numerous problems, from preprocessing and invariant feature extrac
tion [Mika et al., 2003] by simultaneous diagonalization of the data and a noise covariance matrix, to
image denoising [Mika et al., 1999] and superresolution [Kim et al., 2005]. The basic idea in the latter
case is to obtain a set of principal directions in feature space w
1
, . . . , w
l
, obtained from noisefree data,
and to project the image Φ(x) of a noisy observation x onto the space spanned by w
1
, . . . , w
l
. This yields
a “denoised” solution
˜
Φ(x) in feature space. Finally, to obtain the preimage of this denoised solution
one minimizes
_
_
_Φ(x
t
) −
˜
Φ(x)
_
_
_. The fact that projections onto the leading principal components turn out
to be good starting points for preimage iterations is further exploited in kernel dependency estimation
(Section 6.4). Kernel PCA can be shown to contain several popular dimensionality reduction algorithms
as special cases, including LLE, Laplacian Eigenmaps, and (approximately) Isomap [Ham et al., 2004].
38
6.2 Canonical Correlation and Measures of Independence
Given two samples X, Y canonical correlation analysis [Hotelling, 1936] aims at ﬁnding directions of
projection u, v such that the correlation coefﬁcient between X and Y is maximized. That is, (u, v) are
given by
argmax
u,v
Var
emp
[¸u, x)]
−1
Var
emp
[¸v, y)]
−1
E
emp
[¸u, x −E
emp
[x]) ¸v, y −E
emp
[y])] . (113)
This problemcan be solved by ﬁnding the eigensystemof C
−
1
2
x
C
xy
C
−
1
2
y
, where C
x
, C
y
are the covariance
matrices of X and Y and C
xy
is the covariance matrix between X and Y , respectively. Multivariate
extensions are discussed in [Kettenring, 1971].
CCA can be extended to kernels by means of replacing linear projections ¸u, x) by projections in
feature space ¸u, Φ(x)). More speciﬁcally, Bach and Jordan [2002] used the soderived contrast to ob
tain a measure of independence and applied it to Independent Component Analysis with great success.
However, the formulation requires an additional regularization term to prevent the resulting optimization
problem from becoming distribution independent. The problem is that bounding the variance of each
projections is not a good way to control the capacity of the function class. Instead, one may modify the
correlation coefﬁcient by normalizing by the norms of u and v. Hence we maximize
u
−1
v
−1
E
emp
[¸u, x −E
emp
[x]) ¸v, y −E
emp
[y])] , (114)
which can be solved by diagonalization of the covariance matrix C
xy
. One may check that in feature
space, this amounts to ﬁnding the eigenvectors of (PK
x
P)
1
2
(PK
y
P)
1
2
[Gretton et al., 2005b].
6.3 Measures of Independence
R´ enyi [1959] showed that independence between random variables is equivalent to the condition of van
ishing covariance Cov [f(x), g(y)] = 0 for all C
1
functions f, g bounded by L
∞
norm 1 on X and Y.
In Das and Sen [1994], Dauxois and Nkiet [1998], Bach and Jordan [2002], Gretton et al. [2005c], and
Gretton et al. [2005a] a constrained empirical estimate of the above criterion is used. That is, one studies
Λ(X, Y, F, G) := sup
f,g
Cov
emp
[f(x), g(y)] subject to f ∈ F and g ∈ G. (115)
This statistic is often extended to use the entire series Λ
1
, . . . , Λ
d
of maximal correlations where each
of the function pairs (f
i
, g
i
) are orthogonal to the previous set of terms. More speciﬁcally Dauxois and
Nkiet [1998] restrict F, G to ﬁnitedimensional linear function classes subject to their L
2
norm bounded
by 1, Bach and Jordan [2002] use functions in the RKHS for which some sum of the
n
2
and the RKHS
norm on the sample is bounded. Martin [2000], Cock and Moor [2002], and Vishwanathan et al. [2006]
show that degrees of maximum correlation can also be viewed as an inner product between subspaces
spanned by the observations.
Gretton et al. [2005a] use functions with bounded RKHS norm only, which provides necessary and
sufﬁcient criteria if kernels are universal. That is, Λ(X, Y, F, G) = 0 if and only if x and y are indepen
dent. Moreover tr PK
x
PK
y
P has the same theoretical properties and it can be computed much more
easily in linear time, as it allows for incomplete Cholesky factorizations. Here K
x
and K
y
are the kernel
matrices on X and Y respectively.
The above criteria can be used to derive algorithms for Independent Component Analysis [Bach and
Jordan, 2002, Gretton et al., 2005a]. In this case one seeks rotations of the data S = V X where V ∈
SO(d) such that S is coordinatewise independent, i.e. P(S) =
d
i=1
P(s
i
). This approach provides
the currently best performing independence criterion, although it comes at a considerable computational
cost. For fast algorithms, one should consider the work of Cardoso [1998], Hyv¨ arinen et al. [2001], Lee
39
et al. [2000]. Also the work of Chen and Bickel [2005] and Yang and Amari [1997] is of interest in this
context.
Note that a similar approach can be used to develop twosample tests based on kernel methods. The
basic idea is that for universal kernels the map between distributions and points on the marginal polytope
µ : p → E
x∼p
[φ(x)] is bijective and consequently it imposes a norm on distributions. This builds on
the ideas of Fortet and Mourier [1953]. The corresponding distance d(p, q) := µ[p] −µ[q] leads to a
Ustatistic which allows one to compute empirical estimates of distances between distributions efﬁciently
Borgwardt et al. [2006].
6.4 Kernel Dependency Estimation
A large part of the previous discussion revolved around estimating dependencies between sample X and
Y for rather structured spaces Y, in particular (83). In general, however, such dependencies can be hard
to compute. Weston et al. [2003] proposed an algorithm which allows one to extend standard regularized
LS regression models, as described in Section 3.3, to cases where Y has complex structure.
It works by recasting the estimation problem as a linear estimation problem for the map f : Φ(x) →
Φ(y) and then as a nonlinear preimage estimation problem for ﬁnding ˆ y := argmin
y
f(x) −Φ(y) as
the point in Y closest to f(x).
Since Φ(y) is a possibly inﬁnite dimensional space, which may lead to problems of capacity of the es
timator, Weston et al. [2003] restrict f to the leading principal components of Φ(y) and subsequently they
perform regularized LS regression onto each of the projections ¸v
j
, Φ(y)) separately, yielding functions
f
j
. This yields an estimate of f via f(x) =
j
v
j
f
j
(x). This works well on strings and other structured
data. This problem can be solved directly [Cortes et al., 2005] without the need for subspace projections.
The authors apply it to the analysis of sequence data.
7 Conclusion
We have summarized some of the advances in the ﬁeld of machine learning with positive deﬁnite ker
nels. Due to lack of space this article is by no means comprehensive, in particular, we were not able to
cover statistical learning theory, which is often cited as providing theoretical support for kernel methods.
However, we nevertheless hope that the main ideas that make kernel methods attractive became clear.
In particular, these include the fact that kernels address the following three major issues of learning and
inference:
• they formalize the notion of similarity of data
• they provide a representation of the data in an associated reproducing kernel Hilbert space
• they characterize the function class used for estimation via the representer theorem (see Eqs. (46)
and (112))
We have explained a number of approaches where kernels are useful. Many of them involve the substi
tution of kernels for dot products, thus turning a linear geometric algorithm into a nonlinear one. This
way, obtains gets SVMs from hyperplane classiﬁers, and kernel PCA from linear PCA. There is, how
ever, a more recent method of constructing kernel algorithsm, where the starting point is not a linear
algorithm, but a linear criterion (e.g., that two random variables have zero covariance, or that the means
of two samples are identical), which can be turned into a condition involving an optimization over a large
function class using kernels, thus yielding tests for independence of random variables, or tests for solving
the twosample problem. We believe that these works, as well as the increasing amount of work on the
use of kernel methods for structured data, illustrates that we can expect signiﬁcant further progress in the
ﬁeld in the years to come.
40
Acknowledgments
The main body of this paper was written at the Mathematisches Forschungsinstitut Oberwolfach. The
authors would like to express thanks for the support during that time. National ICT Australia is funded
through the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian
Research Council. This work was supported by grants of the ARC and by the Pascal Network of Excel
lence. We thank Florian Steinke, Matthias Hein, Jakob Macke, Conrad Sanderson and Holger Wendland
for comments and suggestions.
References
M. A. Aizerman,
´
E. M. Braverman, and L. I. Rozono´ er. Theoretical foundations of the potential function
method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.
E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for
margin classiﬁers. In P. Langley, editor, Proc. Intl. Conf. Machine Learning, pages 9–16, San Francisco,
California, 2000. Morgan Kaufmann Publishers.
N. Alon, S. BenDavid, N. CesaBianchi, and D. Haussler. Scalesensitive dimensions, uniform conver
gence, and learnability. In Proc. of the 34rd Annual Symposium on Foundations of Computer Science,
pages 292–301. IEEE Computer Society Press, Los Alamitos, CA, 1993.
Y. Altun, A. J. Smola, and T. Hofmann. Exponential families for conditional randomﬁelds. In Uncertainty
in Artiﬁcial Intelligence (UAI), pages 2–9, 2004a.
Y. Altun, A. J. Smola, and T. Hofmann. Gaussian process classiﬁcation for segmenting and annotating
sequences. In Proc. Intl. Conf. Machine Learning, 2004b.
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proc. Intl.
Conf. Machine Learning, 2003.
N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:
337–404, 1950.
F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning
Research, 3:1–48, 2002.
G. Bakir, T. Hofmann, B. Sch¨ olkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan, editors. Predicting
Structured Data. MIT Press, Cambridge, Massachusetts, 2007.
D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating
characteristic graph. Journal of Mathematical Psychology, 12:387–415, 1975.
O. E. BarndorffNielsen. Information and Exponential Families in Statistical Theory. Wiley, Chichester,
1978.
A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans
actions on Information Theory, 39(3):930–945, May 1993.
P. Bartlett, P. M. Long, and R. C. Williamson. Fatshattering and the learnability of realvalued functions.
Journal of Computer and System Sciences, 52(3):434–452, 1996.
P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher averages. In Proc. Annual Conf.
Computational Learning Theory, pages 44–58, 2002.
P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural
results. Journal of Machine Learning Research, 3:463–482, 2002.
J. Basilico and T. Hofmann. Unifying collaborative and contentbased ﬁltering. In Proc. Intl. Conf.
Machine Learning, 2004.
41
L. E. Baum. An inequality and associated maximization technique in statistical estimation of probabilistic
functions of a Markov process. Inequalities, 3, 1972.
S. BenDavid, N. Eiron, and P.M. Long. On the difﬁculty of approximately maximizing agreements. J.
Comput. Syst. Sci., 66(3):496–514, 2003.
K. P. Bennett, A. Demiriz, and J. ShaweTaylor. A column generation algorithm for boosting. In P. Lan
gley, editor, Proc. Intl. Conf. Machine Learning, San Francisco, 2000. Morgan Kaufmann Publishers.
K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly insepa
rable sets. Optimization Methods and Software, 1:23–34, 1992.
C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, New York,
1984.
C. Berg and G. Forst. Potential Theory on Locally Compact Abelian Groups. SpringerVerlag, Berlin,
1975.
D. Bertsimas and J. Tsitsiklis. Introduction to linear programming. Athena Scientiﬁc, 1997.
J. Besag. Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of
the Royal Statistical Society, 36(B):192–326, 1974.
V. Blanz, B. Sch¨ olkopf, H. B¨ ulthoff, C. Burges, V. Vapnik, and T. Vetter. Comparison of viewbased
object recognition algorithms using realistic 3D models. In C. von der Malsburg, W. von Seelen, J. C.
Vorbr¨ uggen, and B. Sendhoff, editors, Artiﬁcial Neural Networks ICANN’96, volume 1112 of Lecture
Notes in Computer Science, pages 251–256, Berlin, 1996. SpringerVerlag.
P. Bloomﬁeld and W.L. Steiger. Least absolute deviations: theory, applications and algorithms.
Birkh¨ auser, Boston, 1983.
S. Bochner. Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse. Mathematische
Annalen, 108:378–410, 1933.
K. M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Sch¨ olkopf, and A.J. Smola. Integrating
structured biological data by kernel maximum mean discrepancy. In Proc. of Intelligent Systems in
Molecular Biology (ISMB), Fortaleza, Brazil, 2006.
B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classiﬁers. In D. Haussler,
editor, Proc. Annual Conf. Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992.
ACM Press.
O. Bousquet, S. Boucheron, and G. Lugosi. Theory of classiﬁcation: a survey of recent advances. ESAIM:
Probability and Statistics, 9:323 – 375, 2005.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England,
2004.
L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1518, 1999.
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, J. r. Ares M, and D. Haus
sler. Knowledgebased analysis of microarray gene expression data by using support vector machines.
Proc Natl Acad Sci U S A, 97(1):262–267, Jan 2000.
C. J. C. Burges. Simpliﬁed support vector decision rules. In L. Saitta, editor, Proc. Intl. Conf. Machine
Learning, pages 71–77, San Mateo, CA, 1996. Morgan Kaufmann Publishers.
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowl
edge Discovery, 2(2):121–167, 1998.
J.F. Cardoso. Blind signal separation: statistical principles. Proceedings of the IEEE, 90(8):2009–2026,
1998.
42
O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogrambased image classiﬁcation. IEEE Transac
tions on Neural Networks, 10(5), 1999.
O. Chapelle and Z. Harchaoui. A machine learning approach to conjoint analysis. In Lawrence K.
Saul, Yair Weiss, and L´ eon Bottou, editors, Advances in Neural Information Processing Systems 17,
Cambridge, MA, 2005. MIT Press.
E. Charniak. A maximumentropyinspired parser. In North American Association of Computational
Linguistics (NAACL 1), pages 132–139, 2000.
A. Chen and P. Bickel. Consistent independent component analysis and prewhitening. IEEE Transactions
on Signal Processing, 53(10):3625–3632, 2005.
S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Siam Journal of Scientiﬁc
Computing, 20(1):33–61, 1999.
K. De Cock and B. De Moor. Subspace angles between ARMA models. Systems and Control Letter, 46:
265–270, 2002.
R. Cole and R. Hariharan. Faster sufﬁx tree construction with missing sufﬁx links. In Proceedings of the
Thirty Second Annual Symposium on the Theory of Computing, Portland, OR, USA, May 2000. ACM.
M. Collins. Discriminative reranking for natural language parsing. In Proc. Intl. Conf. Machine Learning,
pages 175–182. Morgan Kaufmann, San Francisco, CA, 2000.
M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Dietterich, S. Becker, and
Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA,
2001. MIT Press.
D. Cook, A. Buja, and J. Cabrera. Projection pursuit indices based on orthonormal function expansions.
Journal of Computational and Graphical Statistics, 2:225–250, 1993.
C. Cortes, M. Mohri, and J. Weston. A general regression technique for learning transductions. In ICML
’05: Proceedings of the 22nd international conference on Machine learning, pages 153–160, New
York, NY, USA, 2005. ACM Press.
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(3):273–297, 1995.
D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of
Statistics, 18:1676–1695, 1990.
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernelbased vector ma
chines. Journal of Machine Learning Research, 2001.
K. Crammer and Y. Singer. Loss bounds for online category ranking. In Proc. Annual Conf. Computa
tional Learning Theory, 2005.
N. Cristianini and J. ShaweTaylor. An Introduction to Support Vector Machines. Cambridge University
Press, Cambridge, UK, 2000.
N. Cristianini, J. ShaweTaylor, A. Elisseeff, and J. Kandola. On kerneltarget alignment. In T. G.
Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems
14, pages 367–373, Cambridge, MA, 2002. MIT Press.
A. Culotta, D. Kulp, and A. McCallum. Gene prediction with conditional random ﬁelds. Technical Report
UMCS2005028, University of Massachusetts, Amherst, 2005.
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for loglinear models. The Annals of Mathe
matical Statistics, 43, 1972.
D. Das and P. Sen. Restricted canonical correlations. Linear Algebra and its Applications, 210:29–47,
1994.
43
J. Dauxois and G. M. Nkiet. Nonlinear canonical analysis and independence tests. Annals of Statistics,
26(4):1254–1278, 1998.
A. P. Dawid. Applications of a general propagation algorithm for probabilistic expert systems. Statistics
and Computing, 2:25–36, 1992.
D. DeCoste and B. Sch¨ olkopf. Training invariant support vector machines. Machine Learning, 46:161–
190, 2002.
O. Dekel, C.D. Manning, and Y. Singer. Loglinear models for label ranking. In S. Becker, Sebastian
Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, 2003.
S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random ﬁelds. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19:380–393, 1997.
K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. John Wiley and Sons, New
York, 1996.
N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons, New York, NY, 1981.
H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. Vapnik. Support vector regression machines.
In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing
Systems 9, pages 155–161, Cambridge, MA, 1997. MIT Press.
S. Dumais. Using SVMs for text categorization. IEEE Intelligent Systems, 13(4), 1998. In: M. A. Hearst,
B. Sch¨ olkopf, S. Dumais, E. Osuna, and J. Platt: Trends and Controversies  Support Vector Machines.
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of
proteins and nucleic acids. Cambridge University Press, 1998.
J. H. J. Einmal and D. M. Mason. Generalized quantile processes. Annals of Statistics, 20(2):1062–1078,
1992.
A. Elisseeff and J. Weston. A kernel method for multilabeled classiﬁcation. In Advances in Neural
Information Processing Systems 14. MIT press, 2001.
M. Everingham, L. Van Gool, C. Williams, and A. Zisserman. Pascal visual object classes challenge
results. Available from www.pascalnetwork.org, 2005.
M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(98):298–305,
1973.
C. H. FitzGerald, C. A. Micchelli, and A. Pinkus. Functions that preserve families of positive semideﬁnite
matrices. Linear Algebra and its Applications, 221:83–102, 1995.
R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, New York, 1989.
R. Fortet and E. Mourier. Convergence de la r´ eparation empirique vers la r´ eparation th´ eorique. Ann.
Scient.
´
Ecole Norm. Sup., 70:266–285, 1953.
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the
International Conference on Machine Learing, pages 148–146. Morgan Kaufmann Publishers, 1996.
J. H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82:
249–266, 1987.
J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE
Transactions on Computers, C23(9):881–890, 1974.
T. T. Frieß and R. F. Harrison. Linear programming support vector machiens for pattern classiﬁcation and
regression estimation and the set reduction algorithm. TR RR706, University of Shefﬁeld, Shefﬁeld,
UK, 1998.
44
T. G¨ artner. A survey of kernels for structured data. SIGKDD Explorations, 5(1):49–58, July 2003.
T. G¨ artner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multiinstance kernels. In Proc. Intl. Conf.
Machine Learning, 2002.
F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From regularization to radial,
tensor and additive splines. A.I. Memo 1430, Artiﬁcial Intelligence Laboratory, Massachusetts Institute
of Technology, 1993.
P. Green and B. Yandell. Semiparametric generalized linear models. In Proceedings 2nd International
GLIM Conference, number 32 in Lecture Notes in Statistics, pages 44–55. Springer Verlag, New York,
1985.
A. Gretton, O. Bousquet, A.J. Smola, and B. Sch¨ olkopf. Measuring statistical dependence with Hilbert
Schmidt norms. In S. Jain and W.S. Lee, editors, Proceedings Algorithmic Learning Theory, 2005a.
A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨ olkopf. Kernel methods for measuring
independence. Journal of Machine Learning Research, 6:2075–2129, 2005b.
A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls,
B. Sch¨ olkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. In
Proceedings of International Workshop on Artiﬁcial Intelligence and Statistics, volume 10, 2005c.
M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to evaluate
sequence matching. Computers and Chemistry, 20(1):25–33, 1996.
D. Gusﬁeld. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.
Cambridge University Press, June 1997. ISBN 0–521–58519–8.
J. Ham, D. Lee, S. Mika, and B. Sch¨ olkopf. A kernel view of the dimensionality reduction of manifolds.
In Proceedings of the TwentyFirst International Conference on Machine Learning, pages 369–376.
2004.
J. M. Hammersley and P. E. Clifford. Markov ﬁelds on ﬁnite graphs and lattices. unpublished manuscript,
1971.
J. A. Hartigan. Estimation of a convex density contour in two dimensions. Journal of the American
Statistical Association, 82:267–270, 1987.
D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL9910, Com
puter Science Department, UC Santa Cruz, 1999.
P. Hayton, B. Sch¨ olkopf, L. Tarassenko, and P. Anuzis. Support vector novelty detection applied to jet
engine vibration spectra. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural
Information Processing Systems 13, pages 946–952, Cambridge, MA, 2001. MIT Press.
M. Hein, O. Bousquet, and B. Sch¨ olkopf. Maximal margin classiﬁcation for metric spaces. Journal of
Computer and System Sciences, 71:333 – 359, 2005.
R. Herbrich. Learning Kernel Classiﬁers: Theory and Algorithms. MIT Press, 2002.
R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. J.
Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classiﬁers,
pages 115–132, Cambridge, MA, 2000. MIT Press.
R. Herbrich and R. C. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3:
175–212, 2002.
R. Hettich and K. O. Kortanek. Semiinﬁnite programming: theory, methods, and applications. SIAM
Rev., 35(3):380–429, 1993.
45
D. Hilbert. Grundz¨ uge einer allgemeinen Theorie der linearen Integralgleichungen. Nachrichten der
Akademie der Wissenschaften in G¨ ottingen, MathematischPhysikalische Klasse, pages 49–91, 1904.
A. E. Hoerl and R. W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Tech
nometrics, 12:55–67, 1970.
T. Hofmann, B. Sch¨ olkopf, and A. J. Smola. A review of kernel methods in machine learning. Technical
Report 156, MaxPlanckInstitut f¨ ur biologische Kybernetik, 2006.
A. D. Holub, M. Welling, and P. Perona. Combining generative models and ﬁsher kernels for object
recognition. In Proceedings of the International Conference on Computer Vision, 2005.
H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educa
tional Psychology, 24:417–441 and 498 – 520, 1933.
H. Hotelling. Relations between two seets of variates. Biometrika, 28:321–377, 1936.
P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.
P. J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985.
A. Hyv¨ arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, 2001.
T. S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proceedings of the 1999
Conference on AI and Statistics, 1999.
T. Jebara and I. Kondor. Bhattacharyya and expected likelihood kernels. In B. Sch¨ olkopf and M. War
muth, editors, Proceedings of the Sixteenth Annual Conference on Computational Learning Theory,
number 2777 in LNAI, pages 57–71, 2003.
F. V. Jensen, S. L. Lauritzen, and K. G. Olesen. Bayesian updates in causal probabilistic networks by
local computation. Computational Statistics Quaterly, 4:269–282, 1990.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In
Proceedings of the European Conference on Machine Learning, pages 137–142, Berlin, 1998. Springer.
T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algo
rithms. Kluwer Academic Publishers, Boston, 2002.
T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl. Conf.
Machine Learning, 2005.
I. T. Jolliffe. Principal Component Analysis. Springer, New York, 1986.
M. C. Jones and R. Sibson. What is projection pursuit? Journal of the Royal Statistical Society A, 150
(1):1–36, 1987.
M. I. Jordan, P. L. Bartlett, and J. D. McAuliffe. Convexity, classiﬁcation, and risk bounds. Technical
Report 638, UC Berkeley, 2003.
J. P. Kahane. Some Random Series of Functions. Cambridge University Press, 1968.
K. Karhunen. Zur Spektraltheorie stochastischer Prozesse. Annales Academiae Scientiarum Fennicae,
37, 1946.
W. Karush. Minima of functions of several variables with inequalities as side constraints. Master’s thesis,
Dept. of Mathematics, Univ. of Chicago, 1939.
H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proc. Intl.
Conf. Machine Learning, Washington, DC, United States, 2003.
H. Kashima, K. Tsuda, and A. Inokuchi. Kernels on graphs. In K. Tsuda, B. Sch¨ olkopf, and J.P. Vert,
editors, Kernels and Bioinformatics, pages 155–170, Cambridge, MA, 2004. MIT Press.
J. R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58:433–451, 1971.
46
K. Kim, M. O. Franz, and B. Sch¨ olkopf. Iterative kernel principal component analysis for image model
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(9):1351–1366, 2005.
G. S. Kimeldorf and G. Wahba. Some results on Tchebychefﬁan spline functions. J. Math. Anal. Applic.,
33:82–95, 1971.
R. Koenker. Quantile Regression. Cambridge University Press, 2005.
V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Informa
tion Theory, 47:1902–1914, 2001.
I. R. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete structures. In Proc. Intl.
Conf. Machine Learning, pages 315–322, 2002.
W. Krauth and M. M´ ezard. Learning algorithms with optimal stability in neural networks. Journal of
Physics A, 20:745–752, 1987.
H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proc. 2
nd
Berkeley Symposium on Mathe
matical Statistics and Probabilistics, pages 481–492, Berkeley, 1951. University of California Press.
J. Lafferty, X. Zhu, and Y. Liu. Kernel conditional random ﬁelds: Representation and clique selection. In
Proc. Intl. Conf. Machine Learning, volume 21, 2004.
J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic modeling for
segmenting and labeling sequence data. In Proc. Intl. Conf. Machine Learning, volume 18, 2001.
S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, November 1998.
T.W. Lee, M. Girolami, A. Bell, and T. Sejnowski. A unifying framework for independent component
analysis. Computers and Mathematics with Applications, 39:1–21, 2000.
C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classiﬁcation.
In Proceedings of the Paciﬁc Symposium on Biocomputing, pages 564–575, 2002a.
C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein classiﬁca
tion. In S. Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information
Processing Systems 15, volume 15, Cambridge, MA, 2002b. MIT Press.
H. Lodhi, J. ShaweTaylor, N. Cristianini, and C. Watkins. Text classiﬁcation using string kernels. Tech
nical Report 2000  79, NeuroCOLT, 2000. Published in: T. K. Leen, T. G. Dietterich and V. Tresp
(eds.), Advances in Neural Information Processing Systems 13, MIT Press, 2001.
M. Lo` eve. Probability Theory II. SpringerVerlag, New York, 4 edition, 1978.
D. G. Luenberger. Linear and Nonlinear Programming. AddisonWesley, Reading, second edition, May
1984. ISBN 0  201  15794  2.
D. M. Magerman. Learning grammatical stucture using statistical decisiontrees. In Proceedings ICGI,
volume 1147 of Lecture Notes in Artiﬁcial Intelligence, pages 1–21. Springer, Berlin, 1996.
Y. Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85:98–109,
1996.
O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations
Research, 13:444–452, 1965.
R.J. Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing, 48(4):1164–1170,
2000.
A. McCallum, K. Bellare, and F. Pereira. A conditional random ﬁeld for discriminativelytrained ﬁnite
state string edit distance. In Conference on Uncertainty in AI (UAI), 2005.
47
P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, London, 1983.
C. McDiarmid. On the method of bounded differences. In Survey in Combinatorics, pages 148–188.
Cambridge University Press, 1989.
S. Mendelson. Rademacher averages and phase transitions in glivenkocantelli classes. IEEE Transac
tions on Information Theory, 2001. submitted.
S. Mendelson. A few notes on statistical learning theory. In S. Mendelson and A. J. Smola, editors,
Advanced Lectures on Machine Learning, number 2600 in LNAI, pages 1–40. Springer, 2003.
J. Mercer. Functions of positive and negative type and their connection with the theory of integral equa
tions. Philosophical Transactions of the Royal Society, London, A 209:415–446, 1909.
H. Meschkowski. Hilbertsche R¨ aume mit Kernfunktion. SpringerVerlag, Berlin, 1962.
S. Mika, G. R¨ atsch, J. Weston, B. Sch¨ olkopf, A. J. Smola, and K.R. M¨ uller. Learning discriminative and
invariant nonlinear features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):
623–628, 2003.
S. Mika, B. Sch¨ olkopf, A. J. Smola, K.R. M¨ uller, M. Scholz, and G. R¨ atsch. Kernel PCA and de
noising in feature spaces. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural
Information Processing Systems 11, pages 536–542. MIT Press, 1999.
M. Minsky and S. Papert. Perceptrons: An Introduction To Computational Geometry. MIT Press, Cam
bridge, MA, 1969.
V. A. Morozov. Methods for Solving Incorrectly Posed Problems. Springer, 1984.
K.R. M¨ uller, A. J. Smola, G. R¨ atsch, B. Sch¨ olkopf, J. Kohlmorgen, and V. Vapnik. Predicting time
series with support vector machines. In W. Gerstner, A. Germond, M. Hasler, and J.D. Nicoud,
editors, Artiﬁcial Neural Networks ICANN’97, volume 1327 of Lecture Notes in Computer Science,
pages 999–1004, Berlin, 1997. SpringerVerlag.
M. K. Murray and J.W. Rice. Differential geometry and statistics, volume 48 of Monographs on statistics
and applied probability. Chapman and Hall, 1993.
G. Navarro and M. Rafﬁnot. Fast regular expression search. In Proceedings of WAE, number 1668 in
LNCS, pages 198–212. Springer, 1999.
J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the Royal Statistical
Society, A, 135:370–384, 1972.
D. Nolan. The excess mass ellipsoid. Journal of Multivariate Analysis, 39:348–371, 1991.
N. Oliver, B. Sch¨ olkopf, and A. J. Smola. Natural regularization in SVMs. In A. J. Smola, P. L. Bartlett,
B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classiﬁers, pages 51–60, Cam
bridge, MA, 2000. MIT Press.
F. O’Sullivan, B. Yandell, and W. Raynor. Automatic smoothing of regression functions in generalized
linear models. J. Amer. Statist. Assoc., 81:96–103, 1986.
E. Parzen. Statistical inference on time series by RKHS methods. In R. Pyke, editor, Proceedings 12th
Biennial Seminar, Montreal, 1970. Canadian Mathematical Congress. 1  37.
K. Pearson. On lines and planes of closest ﬁt to points in space. Philosophical Magazine, 2 (sixth series):
559–572, 1901.
A. Pinkus. Strictly positive deﬁnite functions on a real inner product space. Advances in Computational
Mathematics, 20:263–271, 2004.
48
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ olkopf,
C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods  Support Vector Learning,
pages 185–208, Cambridge, MA, 1999. MIT Press.
T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201–209, 1975.
T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9),
September 1990.
W. Polonik. Minimum volume sets and generalized quantile processes. Stochastic Processes and their
Applications, 69:1–24, 1997.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. The Art of
Scientiﬁc Computation. Cambridge University Press, 1994.
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–285, February 1989.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cam
bridge, MA, 2006.
G. R¨ atsch. Robust Boosting via Convex Optimization: Theory and Applications. PhD thesis, University
of Potsdam, 2001.
G. R¨ atsch, S. Mika, and A. J. Smola. Adapting codes and embeddings for polychotomies. In S. Thrun
S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages
513–520. MIT Press, Cambridge, MA, 2003.
G. R¨ atsch, T. Onoda, and K. R. M¨ uller. Soft margins for adaboost. Machine Learning, 42(3):287–320,
2001.
G. R¨ atsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.R. M¨ uller, R. J. Sommer, and B. Sch¨ olkopf. Im
proving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational
Biology, 2007. In press.
A. R´ enyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441–451, 1959.
R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University
Press, 1970.
S. Romdhani, P. H. S. Torr, B. Sch¨ olkopf, and A. Blake. Efﬁcient face detection by a cascaded support
vector machine expansion. Proceedings of the Royal Society A, 460:3283–3297, 2004.
W. Rudin. Fourier analysis on groups. Interscience Publishers, New York, 1962.
P. Ruj´ an. A fast method for calculating the perceptron with maximal stability. Journal de Physique I
France, 3:277–290, 1993.
T. W. Sager. An iterative method for estimating a multivariate mode and isopleth. Journal of the American
Statistical Association, 74(366):329–339, 1979.
R. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the
effectiveness of voting methods. Annals of Statistics, 26:1651–1686, 1998.
I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, 39:811–841,
1938.
B. Sch¨ olkopf. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997. Download:
http://www.kernelmachines.org.
B. Sch¨ olkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In D. P. Helmbold and
B. Williamson, editors, Proc. Annual Conf. Computational Learning Theory, number 2111 in Lecture
Notes in Computer Science, pages 416 – 426, London, UK, 2001. SpringerVerlag.
49
B. Sch¨ olkopf, J. Platt, J. ShaweTaylor, A. J. Smola, and R. C. Williamson. Estimating the support of a
highdimensional distribution. Neural Computation, 13(7):1443–1471, 2001.
B. Sch¨ olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
B. Sch¨ olkopf, A. J. Smola, and K.R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10:1299–1319, 1998.
B. Sch¨ olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural
Computation, 12:1207–1245, 2000.
B. Sch¨ olkopf, K. Tsuda, and J.P. Vert. Kernel Methods in Computational Biology. MIT Press, Cam
bridge, MA, 2004.
F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. In Proceedings of HLTNAACL,
pages 213–220. Association for Computational Linguistics, 2003.
J. ShaweTaylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over
datadependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998.
J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,
Cambridge, UK, 2004.
K. Sim. Context kernels for text categorization. Master’s thesis, The Australian National University,
Canberra, Australia, ACT 0200, June 2001.
A. J. Smola. Regression estimation with support vector learning machines. Diplomarbeit, Technische
Universit¨ at M¨ unchen, 1996.
A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors. Advances in Large Margin Classi
ﬁers. MIT Press, Cambridge, MA, 2000.
A. J. Smola and I. R. Kondor. Kernels and regularization on graphs. In B. Sch¨ olkopf and M. K. Warmuth,
editors, Proc. Annual Conf. Computational Learning Theory, Lecture Notes in Computer Science,
pages 144–158. Springer, 2003.
A. J. Smola and B. Sch¨ olkopf. On a kernelbased method for pattern recognition, regression, approxima
tion and operator inversion. Algorithmica, 22:211–231, 1998.
A. J. Smola and B. Sch¨ olkopf. Sparse greedy matrix approximation for machine learning. In P. Langley,
editor, Proc. Intl. Conf. Machine Learning, pages 911–918, San Francisco, 2000. Morgan Kaufmann
Publishers.
A. J. Smola, B. Sch¨ olkopf, and K.R. M¨ uller. The connection between regularization operators and
support vector kernels. Neural Networks, 11(5):637–649, 1998.
I. Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of
Machine Learning Research, 2:67–93, 2002a.
I. Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002b.
J. Stewart. Positive deﬁnite functions and generalizations, an historical survey. Rocky Mountain J. Math.,
6:409–434, 1976.
M. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston. Support vector regression
with ANOVA decomposition kernels. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors,
Advances in Kernel Methods  Support Vector Learning, pages 285–292, Cambridge, MA, 1999. MIT
Press.
I. Takeuchi, Q.V. Le, T. Sears, and A.J. Smola. Nonparametric quantile estimation. Journal of Machine
Learning Research, 2006. forthcoming.
50
B. Taskar, C. Guestrin, and D. Koller. Maxmargin Markov networks. In Sebastian Thrun, Lawrence
Saul, and Bernhard Sch¨ olkopf, editors, Advances in Neural Information Processing Systems 16, 2003.
B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Maxmargin parsing. In Empirical Methods
in Natural Language Processing, 2004.
D. M. J. Tax and R. P. W. Duin. Data domain description by support vectors. In M. Verleysen, editor,
Proceedings ESANN, pages 251–256, Brussels, 1999. D Facto.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society 
Series B, 58:267–288, 1996.
A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math.
Dokl., 4:1035–1038, 1963.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and
interdependent output variables. Journal of Machine Learning Research, 2005.
A. B. Tsybakov. On nonparametric estimation of density level sets. Annals of Statistics, 25(3):948–969,
1997.
A.B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. Annals of Statistics, 32(1):
135–166, 2003.
C.J. van Rijsbergen. Information Retrieval. Butterworths, London, 2 edition, 1979.
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Berlin, 1982.
V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.
V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their
probabilities. Theory of Probability and its Applications, 16(2):264–281, 1971.
V. Vapnik and A. Chervonenkis. The necessary and sufﬁcient conditions for consistency in the empirical
risk minimization method. Pattern Recognition and Image Analysis, 1(3):283–305, 1991.
V. Vapnik, S. Golowich, and A. J. Smola. Support vector method for function approximation, regression
estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.
V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote
Control, 24:774–780, 1963.
S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching. In K. Tsuda,
B. Sch¨ olkopf, and J.P. Vert, editors, Kernels and Bioinformatics, Cambridge, MA, 2004. MIT Press.
S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binetcauchy kernels on dynamical systems and its
application to the analysis of dynamic scenes. International Journal of Computer Vision, 2006. to
appear.
G. Wahba. Spline Models for Observational Data, volume 59 of CBMSNSF Regional Conference Series
in Applied Mathematics. SIAM, Philadelphia, 1990.
G. Wahba, Y. Wang, C. Gu, R. Klein, and B. Klein. Smoothing spline ANOVA for exponential families,
with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Annals of Statistics,
23:1865–1895, 1995.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.
Technical Report 649, UC Berkeley, Department of Statistics, September 2003.
C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans,
editors, Advances in Large Margin Classiﬁers, pages 39–50, Cambridge, MA, 2000. MIT Press.
51
H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005.
J. Weston, O. Chapelle, A. Elisseeff, B. Sch¨ olkopf, and V. Vapnik. Kernel dependency estimation. In
S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems
15, pages 873–880. MIT Press, Cambridge, MA, 2003.
J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, 1990.
C. K. I. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In T. G.
Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems
14, 2000.
R. C. Williamson, A. J. Smola, and B. Sch¨ olkopf. Generalization bounds for regularization networks and
support vector machines via entropy numbers of compact operators. IEEE Transaction on Information
Theory, 47(6):2516–2532, 2001.
L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Jounal of Machine Learning
Research, 4:913–931, 2003.
P. Wolfe. A duality theorem for nonlinear programming. Quarterly of Applied Mathematics, 19:239–244,
1961.
H. H. Yang and S.I. Amari. Adaptive online learning algorithms for blind separation – maximum
entropy and minimum mutual information. Neural Computation, 9(7):1457–1482, 1997.
L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classiﬁcation
with probabilistic categorial grammars. In Uncertainty in Artiﬁcial Intelligence UAI, 2005.
T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimiza
tion. Annals of Statistics, 32(1):56–85, 2004.
D. Zhou, J. Huang, and B. Sch¨ olkopf. Learning from labeled and unlabeled data on a directed graph. In
Proc. Intl. Conf. Machine Learning, 2005.
A. Zien, G. R¨ atsch, S. Mika, B. Sch¨ olkopf, T. Lengauer, and K.R. M¨ uller. Engineering Support Vector
Machine Kernels That Recognize Translation Initiation Sites. Bioinformatics, 16(9):799–807, 2000.
52
Note: instead of this TR, please cite the updated journal version: T. Hofmann, B. Schölkopf, A. Smola: Kernel methods in machine learning. Annals of Statistics (in press).
A Review of Kernel Methods in Machine Learning
Thomas Hofmann, Bernhard Sch¨ lkopf, Alexander J. Smola o
Abstract. We review recent methods for learning with positive deﬁnite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classiﬁers to sophisticated methods for estimation with structured data. (AMS 2000 subject classiﬁcations: primary  30C40 Kernel functions and applications; secondary 68T05 Learning and adaptive systems. — Key words: machine learning, reproducing kernels, support vector machines, graphical models)
1 Introduction
Over the last ten years, estimation and learning methods utilizing positive deﬁnite kernels have become rather popular, particularly in machine learning. Since these methods have a stronger mathematical slant than earlier machine learning methods (e.g., neural networks), there is also signiﬁcant interest in the statistics and mathematics community for these methods. The present review aims to summarize the state of the art on a conceptual level. In doing so, we build on various sources (including [Vapnik, 1998, Burges, 1998, Cristianini and ShaweTaylor, 2000, Herbrich, 2002] and in particular [Sch¨ lkopf o and Smola, 2002]), but we also add a fair amount of more recent material which helps unifying the exposition. We have not had space to include proofs; they can be found either in the long version of the present paper (see Hofmann et al. [2006]), in the references given, or in the above books. The main idea of all the described methods can be summarized in one paragraph. Traditionally, theory and algorithms of machine learning and statistics has been very well developed for the linear case. Real world data analysis problems, on the other hand, often requires nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive deﬁnite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a (usually high dimensional) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space. The paper has three main sections: Section 2 deals with fundamental properties of kernels, with special emphasis on (conditionally) positive deﬁnite kernels and their characterization. We give concrete examples for such kernels and discuss kernels and reproducing kernel Hilbert spaces in the context of regularization. Section 3 presents various approaches for estimating dependencies and analyzing data that make use of kernels. We provide an overview of the problem formulations as well as their solution using convex programming techniques. Finally, Section 4 examines the use of reproducing kernel Hilbert spaces as a means to deﬁne statistical models, the focus being on structured, multidimensional responses. We also show how such techniques can be combined with Markov networks as a suitable framework to model dependencies between response variables. 1
Figure 1: A simple geometric classiﬁcation algorithm: given two classes of points (depicted by ‘o’ and ‘+’), compute their means c+ , c− and assign a test input x to the one whose mean is closer. This can be done by looking at the dot product between x − c (where c = (c+ + c− )/2) and w := c+ − c− , which changes sign as the enclosed angle passes through π/2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w (from [Sch¨ lkopf and Smola, 2002]). o
2 Kernels
2.1 An Introductory Example Suppose we are given empirical data (x1 , y1 ), . . . , (xn , yn ) ∈ X × Y. (1)
Here, the domain X is some nonempty set that the inputs (the predictor variables) xi are taken from; the yi ∈ Y are called targets (the response variable). Here and below, i, j ∈ [n], where we use the notation [n] :={1, . . . , n}. Note that we have not made any assumptions on the domain X other than it being a set. In order to study the problem of learning, we need additional structure. In learning, we want to be able to generalize to unseen data points. In the case of binary pattern recognition, given some new input x ∈ X, we want to predict the corresponding y ∈ {±1} (more complex output domains Y will be treated below). Loosely speaking, we want to choose y such that (x, y) is in some sense similar to the training examples. To this end, we need similarity measures in X and in {±1}. The latter is easier, as two target values can only be identical or different. For the former, we require a function k : X × X → R, satisfying, for all x, x ∈ X, k(x, x ) = Φ(x), Φ(x ) , (3) where Φ maps into some dot product space H, sometimes called the feature space. The similarity measure k is usually called a kernel, and Φ is called its feature map. The advantage of using such a kernel as a similarity measure is that it allows us to construct algorithms in dot product spaces. For instance, consider the following simple classiﬁcation algorithm, where Y = {±1}. The idea is to compute the means of the two classes in the feature space, 1 c+ = n1 {i:yi =+1} Φ(xi ), and c− = n− {i:yi =−1} Φ(xi ), where n+ and n− are the number of + examples with positive and negative target values, respectively. We then assign a new point Φ(x) to the class whose mean is closer to it. This leads to the prediction rule y = sgn( Φ(x), c+ − Φ(x), c− + b) 2 (4) (x, x ) → k(x, x ) (2)
with b =
1 2
c−
2
− c+
2
. Substituting the expressions for c± yields Φ(x), Φ(xi ) −
{i:yi =+1} k(x,xi )
y = sgn
1 n+
1 n−
Φ(x), Φ(xi ) +b ,
{i:yi =−1} k(x,xi )
(5)
1 1 where b = 2 n1 2 {(i,j):yi =yj =−1} k(xi , xj ) − n2 {(i,j):yi =yj =+1} k(xi , xj ) . − + Let us consider one wellknown special case of this type of classiﬁer. Assume that the class means have the same distance to the origin (hence b = 0), and that k(·, x) is a density for all x ∈ X. If the two classes are equally likely and were generated from two probability distributions that are estimated
p+ (x) :=
1 n+
k(x, xi ),
{i:yi =+1}
p− (x) :=
1 n−
k(x, xi ),
{i:yi =−1}
(6)
then (5) is the estimated Bayes decision rule, plugging in the estimates p+ and p− for the true densities. The classiﬁer (5) is closely related to the Support Vector Machine (SVM) that we will discuss below. It is linear in the feature space (4), while in the input domain, it is represented by a kernel expansion (5). In both cases, the decision boundary is a hyperplane in the feature space; however, the normal vectors (for (4), w = c+ − c− ) are usually rather different.1 The normal vector not only characterizes the alignment of the hyperplane, its length can also be used to construct tests for the equality of the two classgenerating distributions [Borgwardt et al., 2006]. 2.2 Positive Deﬁnite Kernels
We have required that a kernel satisfy (3), i.e., correspond to a dot product in some dot product space. In the present section, we show that the class of kernels that can be written in the form (3) coincides with the class of positive deﬁnite kernels. This has farreaching consequences. There are examples of positive deﬁnite kernels which can be evaluated efﬁciently even though they correspond to dot products in inﬁnite dimensional dot product spaces. In such cases, substituting k(x, x ) for Φ(x), Φ(x ) , as we have done in (5), is crucial. In the machine learning community, this substitution is called the kernel trick. Deﬁnition 1 (Gram matrix) Given a kernel k and inputs x1 , . . . , xn ∈ X, the n × n matrix K := (k(xi , xj ))ij is called the Gram matrix (or kernel matrix) of k with respect to x1 , . . . , xn . Deﬁnition 2 (Positive deﬁnite matrix) A real n × n symmetric matrix Kij satisfying ci cj Kij ≥ 0
i,j
(7)
(8)
for all ci ∈ R is called positive deﬁnite. If equality in (8) only occurs for c1 = · · · = cn = 0 then we shall call the matrix strictly positive deﬁnite. Deﬁnition 3 (Positive deﬁnite kernel) Let X be a nonempty set. A function k : X × X → R which for all n ∈ N, xi ∈ X, i ∈ [n] gives rise to a positive deﬁnite Gram matrix is called a positive deﬁnite kernel. A function k : X × X → R which for all n ∈ N and distinct xi ∈ X gives rise to a strictly positive deﬁnite Gram matrix is called a strictly positive deﬁnite kernel.
As an aside, note that if we normalize the targets such that yi = yi /{j : yj = yi }, in which case the yi sum to ˆ ˆ zero, then w 2 = K, y y F , where ., . F is the Frobenius dot product. If the two classes have equal size, then up ˆˆ to a scaling factor involving K 2 and n, this equals the kerneltarget alignment deﬁned in [Cristianini et al., 2002].
1
3
Occasionally, we shall refer to positive deﬁnite kernels simply as a kernels. Note that for simplicity we have restricted ourselves to the case of real valued kernels. However, with small changes, the below will also hold for the complex valued case. Since i,j ci cj Φ(xi ), Φ(xj ) = i ci Φ(xi ), j cj Φ(xj ) ≥ 0, kernels of the form (3) are positive deﬁnite for any choice of Φ. In particular, if X is already a dot product space, we may choose Φ to be the identity. Kernels can thus be regarded as generalized dot products. Whilst they are not generally bilinear, they share important properties with dot products, such as the CauchySchwartz inequality: Proposition 4 If k is a positive deﬁnite kernel, and x1 , x2 ∈ X, then k(x1 , x2 )2 ≤ k(x1 , x1 ) · k(x2 , x2 ). (9)
Proof The 2×2 Gram matrix with entries Kij = k(xi , xj ) is positive deﬁnite. Hence both its eigenvalues are nonnegative, and so is their product, K’s determinant, i.e.,
2 0 ≤ K11 K22 − K12 K21 = K11 K22 − K12 .
(10)
Substituting k(xi , xj ) for Kij , we get the desired inequality.
2.2.1 Construction of the Reproducing Kernel Hilbert Space We now deﬁne a map from X into the space of functions mapping X into R, denoted as RX , via Φ : X → RX where x → k(·, x). (11)
Here, Φ(x) = k(·, x) denotes the function that assigns the value k(x , x) to x ∈ X. We next construct a dot product space containing the images of the inputs under Φ. To this end, we ﬁrst turn it into a vector space by forming linear combinations
n
f (·) =
i=1
αi k(·, xi ).
(12)
Here, n ∈ N, αi ∈ R and xi ∈ X are arbitrary. Next, we deﬁne a dot product between f and another function g(·) = βj ∈ R and xj ∈ X) as
n n
n j=1
βj k(·, xj ) (with n ∈ N,
f, g :=
i=1 j=1
αi βj k(xi , xj ).
(13)
To see that this is welldeﬁned although it contains the expansion coefﬁcients and points, note that n f, g = j=1 βj f (xj ). The latter, however, does not depend on the particular expansion of f . Simin larly, for g, note that f, g = i=1 αi g(xi ). This also shows that ·, · is bilinear. It is symmetric, as f, g = g, f . Moreover, it is positive deﬁnite, since positive deﬁniteness of k implies that for any function f , written as (12), we have
n
f, f =
i,j=1
αi αj k(xi , xj ) ≥ 0.
(14)
Next, note that given functions f1 , . . . , fp , and coefﬁcients γ1 , . . . , γp ∈ R, we have
p p p
γi γj fi , fj =
i,j=1 i=1
γi fi ,
j=1
γj fj
≥ 0.
(15)
4
f = f (x). .. x ) exists for all x. deﬁned on our vector space of functions. k(·. In the early years of kernel machine learning research. By virtue of these properties. There is an analogue of the kernel trick for distances rather than dot products. x ) = k(x. x ) := limn→∞ kn (x. the equality follows from the bilinearity of ·. f 2 ≤ k(x. f = 0 implies f = 0. i=1 (18) Interestingly. and (b) If k(x. x). From the point evaluation functional. x) · f. Vapnik [1998] and Cristianini and ShaweTaylor [2000]. due to their being translation invariant in feature space [Sch¨ lkopf and Smola. k1 . By (15). One can deﬁne an RKHS as a Hilbert space H of functions on a set X with the property that for all x ∈ X and f ∈ H. ·. which is the last property that was left to prove in order to establish that . i. and the right hand inequality from (14). Proposition 5 Below. Then the tensor product k1 ⊗k2 and the direct sum k1 ⊕k2 are positive deﬁnite kernels on (X1 ×X2 )×(X1 ×X2 ).. Skipping some details. 1950] states that for every positive deﬁnite kernel on X × X. f. dissimilarities rather than similarities.. (17) (16) By this inequality. there exists a unique RKHS and vice versa. Those kernels are deﬁned just like positive deﬁnite ones. x). However. 2002. where X is a nonempty set. . α2 ≥ 0.Here. one can then construct the reproducing kernel using the Riesz representation theorem. k(·. researchers considered kernels satisfying the conditions of Mercer’s theorem [Mercer. 2005]. Since (3) is what we are interested in. · . (ii) The pointwise product k1 k2 is positive deﬁnite. x).2. positive deﬁnite kernels are thus the right class of kernels to consider. the point evaluations f → f (x) are continuous linear functionals (in particular. 1950]. see e. ki is a positive deﬁnite kernel on Xi × Xi . The MooreAronszajnTheorem [Aronszajn. are arbitrary positive deﬁnite kernels on X × X. f . then k is positive deﬁnite. 5 . it was not the notion of positive deﬁnite kernels that was being used. Due to (16) and (9). 2. and thus gets a Hilbert space H.e. . for all functions (12).g. o We conclude this section with a note on terminology. we note that by (13).e. the converse is not true. (iii) Assume that for i = 1. 2. and in particular k(·. (a) if α1 . x . Instead. which already distinguishes RKHS’s from many L2 Hilbert spaces). all point values f (x) are well deﬁned. i.2 Properties of Positive Deﬁnite Kernels We begin with some closure properties of the set of positive deﬁnite kernels. . can be applied also with this larger class of kernels. it turns out that many kernel algorithms. we have f (x)2 =  k(·. including SVMs and kernel PCA (see Section 3). For the last step in proving that it even is a dot product. then α1 k1 + α2 k2 is positive deﬁnite. This leads to the larger class of conditionally positive deﬁnite kernels. k2 . Hein et al. with the one difference being that their Gram matrices need to satisfy (8) only subject to n ci = 0. · is a positive deﬁnite kernel. 1909]. whilst all such kernels do satisfy (3).. called a reproducing kernel Hilbert space (RKHS). where Xi is a nonempty set. we add that one can complete the space of functions (12) in the norm corresponding to the dot product. (i) The set of positive deﬁnite kernels is a closed convex cone. x ). is a dot product. k is called a reproducing kernel [Aronszajn.
the positive deﬁnite square root of L). . . . by Proposition 6. . The following proposition follows from a result of FitzGerald et al. we can express L as SS in terms of an n × n matrix S (e. We state the latter case. Then ci cj ψ(k(xi . a result due to Schur (see e. . k2 with respect to some data set x1 ..e. Hence ψ ∈ C . For any Hilbert space F. x F is positive deﬁnite. Let k be a positive deﬁnite kernel. loosely speaking. Kij = xi . assume ψ ∈ C . an arbitrary positive deﬁnite n × nmatrix K. We only give a short proof of (ii). Next.The proofs can be found in Berg et al. assume ψ ∈ C . xj )) = ci cj (ψ(K))ij ≥ 0 (21) ij ij F) is (conditionally) positive deﬁnite } (for i ci = 0). . Being positive deﬁnite. thus ψ( x. i. ψ(k) is (conditionally) positive deﬁnite. [1995] for (conditionally) positive deﬁnite matrices. Finally. It is reassuring that sums and products of positive deﬁnite kernels are positive deﬁnite.j=1 ci cj Kij m=1 Sim Sjm = m=1 i. Berg and Forst [1975]). . ψ( x. We can express K as the Gram matrix of some x1 . .. . We thus have n Lij = m=1 Sim Sjm .j=1 i. x Rn ) is a (conditionally) positive deﬁnite kernel. (19) and therefore. . . Denote by K. for n ∈ N. xj Rn . thus ψ ∈ C . . Moreover. for any c1 . We will now explain that. c1 . xn ∈ Rn . 6 .g. (20) where the inequality follows from (8). x and C = {ψ  for all n ∈ N : K is a positive deﬁnite n × n matrix ⇒ ψ(K) is (conditionally) positive deﬁnite }. and for functions of dot products. . We deﬁne C := {ψ  k is a positive deﬁnite kernel ⇒ ψ(k) is a (conditionally) positive deﬁnite kernel}. To this end. [1984]. and hence ψ ∈ C. . n n n n n ci cj Kij Lij = i. where the inequality follows from ψ ∈ C . n ∈ N. Since ψ ∈ C . it also applies for (conditionally) positive deﬁnite kernels. hence in particular the matrix ψ(K) is (conditionally) positive deﬁnite. xn ..g. . Therefore. we know that ψ( x. deﬁne C = {ψ  for any Hilbert space F. x F ) is (conditionally) positive deﬁnite. cn ∈ R. there are no other operations that preserve positive deﬁniteness. Proposition 6 C = C’ = C” Proof Let ψ ∈ C.j=1 (ci Sim )(cj Sjm )Kij ≥ 0. L the positive deﬁnite Gram matrices of k1 . . . cn ∈ R. . Consider. let C denote the set of all functions ψ : R → R that map positive deﬁnite kernels to (conditionally) positive deﬁnite kernels (readers who are not interested in the case of conditionally positive deﬁnite kernels may ignore the term in parentheses). . . where ψ(K) is the n × n matrix with elements ψ(Kij ). xn ∈ X. and x1 . x.
let X be a dot product space. Hence we infer that it is actually sufﬁcient for consistency to have an > 0 for n ≥ 1. • Let Ieven and Iodd denote the sets of even and odd incides i with the property that ai > 0. Then we know that for some c = 0 we have ij ci cj k(xi . For any f = j αj k(xj . and the one of F ⊕ F . j cj k(xj . For support vector machines using universal kernels. we could set F to equal H and consider ψ( Φ(x). and k is thus not universal. There are further properties of k that can be read off the coefﬁcients an . • Steinwart [2002a] showed that if all an are strictly positive. x ) by taking ﬁnitely many tensor products and (scaled) direct sums of copies of F.3 We conclude the section with an example of a kernel which is positive deﬁnite by Proposition 7. implies that its RKHS H is inﬁnite dimensional. where F is another dot product space. 2004]. he then shows (universal) consistency [Steinwart. this implies that i ci f (xi ) = i ci k(xi .2 Moreover.Proposition 7 Let ψ : R → R. we could construct the feature space H of ψ( x. one can make certain aspects plausible using simple arguments. in turn. 2002b]. ∞ norm. 7 . then the kernel of Proposition 7 is universal on every compact subset S of Rd in the sense that its RKHS is dense in the space of continuous functions on S in the . since otherwise the maximal rank of the Gram matrix of k with respect to a data set in H would equal its dimensionality. However. The power series expansion of ψ(x) = ex then tells us that k(x. We thus know that if k is universal. the RKHS cannot lie dense in the space of continuous functions on X. he states that the necessary and sufﬁcient conditions for universality of the kernel are that in addition both i∈Ieven 1 and i∈Iodd 1 diverge. Examples of universal kernels are (23) and (24) below. ·). In this case. Then ψ( x. respectively. This means that i ci k(xi . For instance [Pinkus. The latter. it is necessarily strictly positive deﬁnite. j αj k(xj . This equality thus holds for any function in the RKHS. To this end. ψ( x. x F ) is conditionally positive deﬁnite for any Hilbert space F if and only if ψ is real entire of the form (22) with an ≥ 0 for n ≥ 1. x ) = e x. and H would be ﬁnite. ·). Φ(x ) H ). i i While the proofs of the above statements are fairly involved. ·) = 0. xj ) = 0.x σ2 (23) 2 One might be tempted to think that given some strictly positive deﬁnite kernel k with feature map Φ and feature space H. the latter kernel is only strictly positive deﬁnite on Φ(X). implying that i ci k(xi . 3 Recall that the dot product of F ⊗ F is the square of the original dot product. Pinkus [2004] showed that for any F. it would seem that choosing a1 = 1 and an = 0 for n = 1 should give us a strictly positive deﬁnite kernel which violates Pinkus’ conditions. If we assume that F is ﬁnite dimensional. is the sum of the original dot products. x if ψ is real entire of the form F) is positive deﬁnite for any Hilbert space F if and only ∞ ψ(t) = n=0 an tn (22) with an ≥ 0 for n ≥ 0. Therefore. then inﬁnite dimensionality of H implies that inﬁnitely many of the ai in (22) must be strictly positive. which is a weaker statement than it being strictly positive deﬁnite on all of H. the kernel of Proposition 7 is strictly positive deﬁnite if and only if a0 > 0 and neither Ieven nor Iodd is ﬁnite. suppose k is not strictly positive definite. ·). ·) = 0. we will show that the a0 term does not affect an SVM. If only ﬁnitely many of the ai in (22) were positive. ·) = 0. • In Lemma 13. Moreover.
. 2005] A positive deﬁnite function is strictly positive deﬁnite if the carrier of the measure in its representation (26) contains an open subset. Note. In that case. 8 . x ) = k(x. x )f (x)f (x ) = e− x−x 2 2 σ2 x 2 . One part of the theorem is easy to prove: if h takes the form (26).2. (26) Whilst normally formulated for complex valued functions. however. its Fourier transform may not be real.ω dµ(ω) ≥ 0. then the corresponding positive deﬁnite function is the Gaussian e− 2σ2 . there are conditions that ensure that the positive deﬁniteness becomes strict. e. Berg et al. in which case µ is a probability measure and h is its characteristic function. that if we start with an arbitrary nonnegative Borel measure. this leads to the positive deﬁniteness of the Gaussian kernel k (x.ω dµ(ω) = Rd i ai e −i xi .3 below. Bochner’s theorem allows us to interpret the similarity measure k(x.ω dµ(ω). then 2 σ2 ω 2 x 2 ai aj h(xi − xj ) = ij ij ai aj Rd e −i xi −xj . (24). i. the theorem also holds true for real functions. Further generalizations were given by Krein. (27) Bochner’s theorem generalizes earlier work of Mathias. the measure µ will thus determine which frequencies occur in the estimates.. We may normalize h such that h(0) = 1 (hence by (9) h(x) ≤ 1). An important generalization considers Abelian semigroups (e. (24) 2. The following characterization is due to Bochner [1933]. Theorem 8 A continuous function h on Rd is positive deﬁnite if and only if there exists a ﬁnite nonnegative Borel measure µ on Rd such that h(x) = Rd e−i x. A proof of the theorem can be found for instance in Rudin [1962]. For instance. If we further multiply k with the positive deﬁnite kernel f (x)f (x ). if µ is a normal distribution of the form (2π/σ 2 )−d/2 e− 2 dω.3 Properties of Positive Deﬁnite Functions We now let X = Rd and consider positive deﬁnite kernels of the form k(x. and has itself been generalized in various ways. Since the solutions of kernel algorithms will turn out to be ﬁnite kernel expansions. cf. for the cases of positive deﬁnite kernels and functions with a limited number of negative squares (see Stewart [1976] for further details and references). 1999].3. x ) = h(x − x ) in the frequency domain. by Schoenberg [1938]. it will determine their regularization properties — more on that in Section 2. [1984]). x ) = h(x − x ).e.g. where f (x) = e− 2 σ2 and σ > 0.g. As above. We state it in the form given by Wendland [2005]. Proposition 9 [Wendland. The choice of the measure µ determines which frequency components occur in the kernel. (25) in which case h is called a positive deﬁnite function. Real valued positive deﬁnite functions are distinguished by the fact that the corresponding measures µ are symmetric. the theorem provides an integral representation of positive deﬁnite functions in terms of the semigroup’s semicharacters.is positive deﬁnite [Haussler. see also Rudin [1962].
and now intend to complete our selection with a few more examples. which includes the Gaussian. Mathematically speaking. 2 2 9 . xP . where p ∈ [P ] (note that the sets Xp need not be equal). Moreover. From the deﬁnition of (30) it is obvious that for odd m we may write Bm as inner product between functions Bm/2 . (30) p (29) Here B0 is the characteristic function on the unit ball4 in Rd . x ) = ( x. Convolutions and Structures Let us now move to kernels deﬁned on structured objects [Haussler. ∞[→ R is continuous and compactly supported. x ) = p x. . Watkins. note that for even m. . . . An important special case of positive deﬁnite functions. x ) = B2p+1 (x − x ) where p ∈ N with Bi+1 := Bi ⊗ B0 . k(x. . x). ∞[→ R. 1999. convolution kernels. . the following result becomes relevant. xP constitute the composite object x. x + c) where p ∈ N and c ≥ 0. Suppose the object x ∈ X is composed of xp ∈ Xp . . For instance. 1 ] . . 2005] Assume that g : [0. . are radial basis functions. ANOVA expansions..” 4 Note that in R one typically uses I[− 1 . 1975] p d x. They have the property of being invariant under the Euclidean group. 1997] simply by noting that convolution of an even number of indicator functions yields a positive kernel function. Denote by IX the indicator (or characteristic) function on the set X. and denote by ⊗ the convolution operation. Other useful kernels include the inhomogeneous polynomial. x p = j=1 [x]j [x ]j = j∈[d]p [x]j1 · · · · · [x]jp · [x ]j1 · · · · · [x ]jp = Cp (x). Vapnik et al. of x1 = A and x2 = T G. 1996. d}). . 2000]. By direct calculation we can derive the corresponding feature map [Poggio. we discuss polynomial kernels. These are functions that can be written as h(x) = g( x 2 ) for some function g : [0. . the set of “allowed” decompositions can be thought of as a relation R(x1 . (28) where Cp maps x ∈ Rd to the vector Cp (x) whose entries are all possible pth degree ordered products of the entries of x (note that [d] is used as a shorthand for {1. to be read as “x1 . which is computationally attractive in a number of large scale applications. Cp (x ) . and P = 2. consider the string x = AT G. Then h(x) = g( x 2 ) cannot be positive deﬁnite on every Rd . .2. Proposition 10 [Wendland. Then the Bspline kernels are given by k(x. If we would like to have the additional property of compact support. Spline Kernels It is possible to obtain spline functions as a result of kernel expansions [Smola. Polynomial Kernels From Proposition 5 it is clear that homogeneous polynomial kernels k(x. or alternatively. and kernels on documents. x ∈ Rd . The polynomial kernel of degree p thus computes a dot product in the space spanned by all monomials of degree p in the input coordinates. which computes all monomials up to degree p.4 Examples of Kernels We have already seen several instances of positive deﬁnite kernels. In particular. Bm is not a kernel.This implies that the Gaussian kernel is strictly positive deﬁnite. It is composed of the parts x1 = AT and x2 = G. (f ⊗ g)(x) := Rd f (x )g(x − x)dx . 2. x are positive deﬁnite for p ∈ N and x.
then the products collapse to one factor each. 3 2 (33) These kernels can be used as basis functions k (n) in ANOVA expansions. Other sparse vector kernels are also conceivable. . N . fortunately. since a single zero term would eliminate the product in (32). 10 . ANOVA kernels have been shown to work rather well in multidimensional SV regression problems [Stitson et al. iP ) = (1.¯ ∈R(x ) p=1 ¯ x kp (¯p . which speciﬁes the order of the interactions between attributes xip that we are interested in. . . x ) := 1≤i1 <···<iP ≤N p=1 k (ip ) (xip . and k equals the direct sum k (1) ⊕ · · · ⊕ k (N ) . the dot product between two such vectors can be computed quickly. 1999]. (32) Note that if P = N . x ) deﬁned on R likewise is a positive kernel.Haussler [1999] investigated how to deﬁne a kernel between composite objects by building on similarity measures that assess their respective parts. . x ) := x∈R(x). it can be shown that the Rconvolution is a valid kernel [Haussler. N . 1999]. . xip ). in other words. Vapnik. the ANOVA kernel of order P is deﬁned as P kP (x. Brownian Bridge and Related Kernels The Brownian bridge kernel min(x. xp ).5 If there is only a ﬁnite number of ways. . Deﬁne the Rconvolution of k1 . Using an efﬁcient sparse representation. x )3 + + 1 + xx . and kernels k (i) on S × S. we get kernels that lie in between tensor products and direct sums. kP as P [k1 · · · kP ] (x. For P = 1. The sum then runs over the numerous terms that take into account interactions of order P . where i = 1. and k equals the tensor product k (1) ⊗ · · · ⊗ k (N ) . xP ¯ ¯ and x analogously. x ¯ (31) where the sum runs over all possible ways R(x) and R(x ) in which we can decompose x into x1 . .. This maps a given text to a sparse vector. we consider X = S N for some set S. In this case. . . At the other extreme. N ). . 2001]. such as one that maps a text to the set of pairs of words that are in the same sentence [Joachims. For intermediate values of P . 2000]. . . Furthermore. . 1990. One of its shortcomings. or those which look only at pairs of words within a certain vicinity with respect to each other [Sim. the relation R is called ﬁnite. . . the sum consists only of the term for which (i1 . however. kernels kp deﬁned on Xp × Xp . x ∈ R+ are given by 0 k(x. Bag of Words One way in which SVMs have been used for text categorization [Joachims. . x )2 x − x  min(x. is that it does not take into account the word ordering of a document. . Watkins. x ) = min(x. hence if either x or x cannot be decomposed. . this dot product is by construction a valid kernel. . 5 (k1 We use the convention that an empty sum equals zero. 1998]. ANOVA kernels typically use some moderate value of P .. 2002] is the bagofwords representation. if P = 1. the computational cost can be reduced to O(P d) cost by utilizing recurrent procedures for the kernel evaluation. referred to as a sparse vector kernel. . and a component is set to one (or some other number) whenever the related word occurs in the text. . To construct an ANOVA kernel. Note that it is advisable to use a k (n) which never or rarely takes the value zero. x ) = 0. then · · · kP )(x. Note that onedimensional linear spline kernels with an inﬁnite number of nodes [Vapnik et al. . 1997] for x. 1998. where each component corresponds to a word. . . ANOVA Kernels Speciﬁc examples of convolution kernels are Gaussians and ANOVA kernels [Wahba.
Depending on the speciﬁc choice of a similarity measure it is possible to deﬁne more or less efﬁcient kernels which compute the dot product in the feature space spanned by all substrings of documents. . then l(i) > u. 0 < λ ≤ 1 is a decay parameter: The larger the length of the subsequence in s. Here #(x. s. The sum runs over all subsequences of s which equal u. The substrings need not always be contiguous. Mismatch Kernels In the general case it is only possible to ﬁnd algorithms whose complexity is linear in the lengths of the documents being compared. also f (x) = w. 1997] promise to be of help in this context. . and its elements by s(1) . We can thus describe the feature map coordinatewise for each u ∈ Σn via [Φn (s)]u := i:s(i)=u λl(i) . iu ) with 1 ≤ i1 < · · · < iu ≤ s. the n=0 concatenation of s and t ∈ Σ∗ is written st. a For inexact matches of a limited degree. the set of all strings of length n. Another prominent application of string kernels is in the ﬁeld of splice form prediction and gene ﬁnding R¨ tsch et al. Moreover.ngrams and Sufﬁx Trees A more sophisticated way of dealing with string data was proposed [Watkins. We call l(i) := iu − i1 + 1 the length of the subsequence in s. x ) = s #(x. Cole and Hariharan.. the further apart the ﬁrst and last element of a substring are. Given an index sequence i := (i1 . s(s). n The feature space built from strings of length n is deﬁned to be Hn := R(Σ ) . Denote by k(x. typically up to = 3. 1999]. (34) Here. . the more similar they are. i. The basic idea is as described above for general structured objects (31): Compare the strings by means of the substrings they contain. 2002b. Tools from approximate string matching [Navarro and Rafﬁnot.a]. 2000. the less weight should be given to the similarity. ) which reports the number of approximate occurrences of s in x. Φ(x) can be computed in O(x) time if preprocessing linear in the size of the support vectors is carried out. These kernels are then applied to function prediction (according to the gene ontology) of proteins using only their sequence information. [2007]. The more substrings two strings have in common. we deﬁne u := s(i) := s(i1 ) . that said. 11 . x ) in O(x + x ) time and memory. Σ∗ := ∪∞ Σn . s(iu ). and strings of bounded length a similar data structure can be built by explicitly generating a dictionary of strings and their neighborhood in terms of a Hamming distance [Leslie et al. we can think of it as the space of all realvalued functions on Σn ). By trading off computational complexity with storage (hence the restriction to small numbers of mismatches) essentially lineartime algorithms can be designed. 2000]. O(x · x ) or worse.e. 2000. and the set of all ﬁnite strings. We now describe such a kernel with a speciﬁc choice of weights [Watkins. Note that if i is not contiguous. 1999. Consider a ﬁnite alphabet Σ. . and the length of the substrings. which allows one to compute for arbitrary cs the value of the kernel k(x. Whether a general purpose algorithm exists which allows for efﬁcient comparisons of strings with mismatches in linear time is still an open question. Vishwanathan and Smola [2004] provide an algorithm using sufﬁx trees. s) by a mismatch function #(x. The length of a string s ∈ Σ∗ is denoted by s. Cristianini and ShaweTaylor. Gusﬁeld. labelled by that element (equivalently. . s)cs a string kernel computed from exact matches. 2000. s) is the number of occurrences of s in x and cs ≥ 0. Let us now form subsequences u of strings. the smaller the respective contribution to [Φn (s)]u . These kernels are deﬁned by replacing #(x. . . s)#(x . Σn . Haussler. This notation means that the space has one dimension (or coordinate) for each element of Σn . .
and the construction of parse trees. i. t) = λki (s. sufﬁxtree. it makes no difference whether one uses as x the image or a version of x where all locations of the pixels have been permuted. [1998]. estimates can be improved. t) for all n = 1. consider a dimension of H3 spanned (that is. For images. . labelled) by the string asd. . . The above kernels on string.j):s(i)=t(j)=u To describe the actual computation of kn . 1997. While the overall degree of this kernel is p1 p2 . . . spam ﬁltering. 2004]. .e. t) < i ki (s. j − 1])λ2 (37) kn (sx. . Lodhi et al. asd is a contiguous substring. t) = kn (s. 2. (36) Using x ∈ Σ1 . t ki (s. . . x is then computed by taking a weighted sum of all terms between both trees. t) = 1 for all s. t) + j:tj =x The string kernel kn can be computed using dynamic programming. annotations of dna sequences for the detection of introns and exons. function prediction in proteins. local interactions between image patches need to be considered. O(x · x ) to compute this expression. When applied to an image. . t) = 0 if min(s. we have [Φ3 (Nasdaq)]asd = λ3 . It takes inner products between o o corresponding image patches. . which allow the computation of kn (s. n − 1 kn−1 (s. In this case. Collins and Duffy [2001] show a quadratic time algorithm. [2000]. . This indicates that function space on X induced by k does not take advantage of the locality properties of the data. and ﬁnally raises their sum to another power p2 . 2002]. When restricting the sum to all proper rooted subtrees it is possible to reduce the computational cost to O(x + x ) time by means of a tree to string conversion [Vishwanathan and Smola. t) = 0 if min(s. In the second string. the ﬁrst factor p1 only captures short range interactions. . while [Φ3 (lass das)]asd = 2λ5 . it appears twice as a noncontiguous substring of length 5 in lass das. 6 In the ﬁrst string.j):s(i)=t(j)=u λs+t−i1 −j1 +2 . t[1. 12 . i = 1. . By taking advantage of the local structure. . j − 1])λt−j+2 . t) := u∈Σi (i. The kernel between two trees x. deﬁne ki (s. . t) = (35) u∈Σn u∈Σn (i. where x is the number of nodes of the tree. t) < i ki (sx. Locality Improved Kernels It is possible to adjust kernels to the structure of spatial data. 2000] one may assign more weight to the entries of the sequence close to the location where estimates should occur. t[1. t) + j:tj =x ki−1 (s. named entity tagging of documents.6 The kernel induced by the map Φn takes the form [Φn (s)]u [Φn (t)]u = λl(i) λl(j) . (note that the kernels are symmetric): k0 (s. DeCoste and Sch¨ lkopf. for i = 1. On biological sequences [Zien et al. and Durbin et al.. the two occurrences are lass das and lass das. In particular. Recall the Gaussian RBF and polynomial kernels. For trees Collins and Duffy [2001] propose a decomposition method which maps a tree x into its set of subtrees. kn (s. This includes applications in document analysis and categorization. . n − 1. mismatch and tree kernels have been used in sequence analysis.For instance. we then have the following recursions. . . see Watkins [2000]. then raises the latter to some power p1 . Tree Kernels We now discuss similarity measures on more structured objects. One way is to use the pyramidal kernel [Sch¨ lkopf.
Without going into details. x ) = tr Cq (AB). • Vishwanathan et al. matrices composed of subdeterminants of matrices) and dynamical systems. Denote by Uθ (x) := −∂θ log p(xθ) I := Ex Uθ (x)Uθ (x) 13 (41) (42) .j (f (i) − f (j)) = 2f Lf . In a nutshell. that is W = W (see Zhou et al. ∞) → [0. Finally. Eq.e. assume for simplicity that the graph is undirected. Kashima et al.3 we will discuss the connection between regularization operators and kernels in Rn . L arises from the fact that for a function f deﬁned on the vertices of the graph 2 i. Here [2003] suggest kernels K = r(L) or K = r(L). [2006] extend this to arbitrary compound matrices (i. Martin. In the former case. This yields good performance in the area of multiinstance learning. and Kashima et al. Fiedler [1973] showed that the second largest eigenvector of L approximately decomposes the graph into two parts according to their sign. [2004]. L is the only quadratic permutation invariant form which can be obtained as a linear function of W . Smola and Kondor [2003] show that under mild conditions and up to rescaling. then k(x. • Alternatively.e. The other large eigenvectors partition the graph into correspondingly smaller portions. see Vishwanathan et al. or the weighting of steps within a random walk (40) respectively. r : [0. This leads to kernels of the form k(x. j exists. [2006]): 1 a • take the average of the elements of the set in feature space. if U. one can study angles enclosed by subspaces spanned by the observations [Wolf and Shashua. φ(xi ) = n j φ(xij ) [G¨ rtner et al. 2000. Note that for q = dimA we recover the determinant kernel of Wolf and Shashua [2003]. In the following we discuss kernels on graphs. 2002]. [2003] for details a on the latter. Popular choices include r(ξ) = exp(−λξ) r(ξ) = (ξ + λ) r(ξ) = (λ − ξ)p −1 diffusion kernel regularized graph Laplacian pstep random walk (38) (39) (40) where λ > 0 is chosen such as to reﬂect the amount of diffusion in (38). the graph itself becomes the object deﬁning the metric between the vertices.Graph Kernels Graphs pose a twofold challenge: one may both design a kernel on vertices of them and also a kernel between them. Here D is a diagonal matrix with Dii = j Wij denoting the degree of vertex i. that is. Fisher Kernels Jaakkola and Haussler [1999] have designed kernels building on probability density models p(xθ). ∞) is a monotonically decreasing function. • Jebara and Kondor [2003] extend the idea by dealing with distributions pi (x) such that φ(xi ) = E [φ(x)] where x ∼ pi (x). See G¨ rtner [2003].. Their result exploits the Binet Cauchy theorem which states that compound matrices are a representation of the matrix group. the degree of regularization in (39). They apply it to image classiﬁcation with missing pixels. ˜ Hence it is reasonable to consider kernel matrices K obtained from L (and L). U denote the orthogonal matrices spanning the subspaces of x and x respectively. x ) = det U U . [2005] for 1 1 ˜ extensions to directed graphs). 2002]. Cock and Moor. Smola and Kondor ˜ which have desirable smoothness properties.3. Denote by W ∈ Rn×n the adjacency matrix of a graph with Wij > 0 if an edge between i. (38) was proposed by Kondor and Lafferty [2002]. 2003. i. Kernels on Sets and Subspaces Whenever each observation xi consists of a set of instances we may use a range of methods to capture the speciﬁc properties of these sets (for an overview. Moreover. Denote by L = D − W the graph Laplacian and by L = 1 − D− 2 W D− 2 the normalized graph Laplacian. Cq (AB) = Cq (A)Cq (B). In Section 2. the function r(ξ) describes the smoothness properties on the graph and L plays the role of the Laplace operator.
yn . Section 5 for further details). n}. ·).the Fisher scores and the Fisher information matrix respectively. xj ). x ) = [φ(x) − ∂θ g(θ)] [φ(x ) − ∂θ g(θ)] . Theorem 11 (Representer Theorem) Denote by Ω : [0. . [2000] show that estimation using the normalized Fisher kernel corresponds to estimation subject to a regularization on the L2 (p(·θ)) norm. Note that for maximum likelihood estimators Ex [Uθ (x)] = 0 and therefore I is the covariance of Uθ (x). The representer theorem [Kimeldorf and Wahba. ∞) → R a strictly monotonic increasing function. Moreover. x). Then each minimizer f ∈ H of the regularized risk functional c ((x1 .. f (x1 )) . Herbrich [2002]. in our analysis of nonparametric exponential families we will encounter this fact several times (cf. f (xn ))) + Ω admits a representation of the form n f 2 H (45) f (x) = i=1 αi k(xi . 1971.1 for a more detailed discussion) where p(xθ) = exp( φ(x). o Sch¨ lkopf et al. x ) := Uθ (x)I −1 Uθ (x ) or k(x. xj ) + f⊥ (·). . x) + f⊥ (x). by X a set. note that the centering is immaterial. k(xj . The above overview of kernel design is by no means complete. Cox and O’Sullivan. · · · . in the context of exponential families (see Section 5. . Moreover. ·) f (xj ) (for all j ∈ [n]) as n H = 0 for all i ∈ [n] := {1. k(xn . (48) 14 . [2004]. y1 . 2001]. ·) = i=1 αi k(xi . By (16) we may write n f (xj ) = f (·). [2007] for further examples and o details. 1990] shows that solutions of a large class of optimization problems can be expressed as kernel expansions over the sample points. Joachims [2002]. . (47) Here αi ∈ R and f⊥ ∈ H with f⊥ . x ) := Uθ (x)Uθ (x ) (43) depending on whether whether we study the normalized or the unnormalized kernel respectively. Sch¨ lkopf and Smola [2002]. This is not a coincidence. In fact. As o above. k(xi . The Fisher kernel is deﬁned as k(x. The reader is referred to books of Cristianini and ShaweTaylor [2000]. This means that up to centering by ∂θ g(θ) the Fisher kernel is identical to the kernel arising from the inner product of the sufﬁcient statistics φ(x). (46) Proof We decompose any f ∈ H into a part contained in the span of the kernel functions k(x1 .1 The Representer Theorem From kernels. ShaweTaylor and Cristianini [2004]. H is the RKHS associated to the kernel k. . We present a slightly more general version of the theorem with a simple proof [Sch¨ lkopf et al. we now move to functions that can be expressed in terms of kernel expansions. and by c : (X × R2 )n → R ∪ {∞} an arbitrary loss function. and one in the orthogonal complement. ·). ·) H = i=1 αi k(xi . . . k(xj .3. . it has several attractive theoretical properties: Oliver et al. n f (x) = f (x) + f⊥ (x) = i=1 αi k(xi . as can be seen in Lemma 13. θ − g(θ)) we have k(x. 2. In addition to that. (xn . Bakir et al. (44) for the unnormalized Fisher kernel.3 Kernel Function Classes 2.
.. . The signiﬁcance of the Representer Theorem is that although we might be trying to solve an optimization problem in an inﬁnitedimensional space H. . Monotonicity of Ω does not prevent the regularized risk functional (45) from having multiple local minima. (50) can be expressed with small RKHSerror in the subspace spanned by k(z1 . xn are distinct. . . zm . 2. and that does admit the expansion. . zp ). Since this also has to hold for the solution. ·). If we discard the strictness of the monotonicity. zm such that the function (cf. (49) = Ω i αi k(xi .Second. k(zm . . containing linear combinations of kernels centered on arbitrary points of X. two classes of methods have been pursued. The most natural such measure is the RKHS norm. it states that the solution lies in the span of n particular kernels — those centered on the training points. we thus need to express or at least approximate (46) using a smaller number of terms. . then it no longer follows that each minimizer of the regularized risk admits an expansion (46).2 Reduced Set Methods Despite the ﬁniteness of the representation in (46) it can often be the case that the number of terms in the expansion is too large in practice. . Clearly. 1997. 1998]. . ·) i H Thus for any ﬁxed αi ∈ R the risk functional (45) is minimized for f⊥ = 0. Exact expressions in a reduced number of basis points are often not possible. 1996] m g= p=1 βp k(zp . it still follows. To ensure a global minimum. however. We will encounter (46) again further below. Ω( f 2 H) n 2 + f⊥ 2 H ≥ Ω n 2 . (46)) n f= i=1 αi k(xi . This can be problematic in practical applications. . . ·). xn . . then we can compute an mterm approximation of f by projection onto that subspace. many of the αi often equal 0. In the literature. . From a practical point of view. For the purpose of approximation. that there is always another solution that is as good. .q=1 βp βq k(zp . For suitable choices of loss functions. if we were given a set of points z1 . the theorem holds. we would need to require convexity. xj ) + p. zq ) − 2 i=1 p=1 αi βp k(xi . ·). (52) 15 . (51) leading to n m n m f −g 2 = i. for all f⊥ . it is attempted to choose the points as a subset of some larger set x1 . if the kernel is strictly positive deﬁnite and the data points x1 .j=1 αi αj k(xi . . which is usually the training set [Sch¨ lkopf.3. this norm has the advantage that it can be computed using evaluations of the kernel function which opens the possibility for relatively efﬁcient approximation algorithms. since the time required to evaluate (46) is proportional to the number of terms. In o the second. . In the ﬁrst. ·) H αi k(xi . ·). where it is called the Support Vector expansion.g. . . To deal with this issue. Frieß and Harrison. e. . we compute the difference in the RKHS between (50) and the reduced set expansion [Burges. then no exact reduction is possible. we need to decide on a discrepancy measure that we want to minimize. The problem thus reduces to the choice of the expansion points z1 .
z1 ). Smola and Sch¨ lkopf [2000]. zm . . . k(x. such as the Lasso [Tibshirani. βm . projections onto the span {k(·.To obtain the z1 . the coefﬁcients β1 . zm )} are achieved by using the pseudoinverse of K instead. .j=1 K −1 ij k(·. g . . . as well as iterative methods taking into account the fact that given the z1 . x → K −1/2 (k(x. .g. . . 2002]).7 A number of articles exist discussing how to best choose the expansion points zi for the purpose of processing a given dataset. are not obvious and we shall now provide an analysis in the Fourier domain. (54) where K is the Gram matrix of k with respect to z1 . and that the operators are well deﬁned. The projection onto the mdimensional subspace can be written m P = i. The kernel of the subspace’s orthogonal complement. and we will implicitly assume that all integrals are over Rd and exist. (53) If the subspace is spanned by a ﬁnite set of linearly independent kernels k(·. . z1 ). We will rewrite the RKHS dot product as f. the kernel of the subspace takes the form km (x. . . x ) = (k(x. k(·. [1993]. we can use the kernel PCA map [Sch¨ lkopf and Smola. f k o in the RKHS H associated with a positive deﬁnite kernel. . . . See Sch¨ lkopf and Smola [2002] for further details. Specialized methods for speciﬁc kernels exist. . If we prefer to work with coordinates and the Euclidean dot product. It turns out that if the kernel is translation invariant. we can minimize this objective function.g. j = 1. which for most kernels will be a nonconvex optimization problem. . [1998]). zj ) . zi )k(·. k(x.. o Once we have constructed or selected a set of expansion points. which is what distinguishes SVMs from many other reguH larized function estimators (e. . . zm )) . . . . zm . . z1 ). . Υg = Υ2 f. It is known that each closed subspace of an RKHS is itself an RKHS. on the other hand. . (55) where k(·. The nature and implications of this regularizer. we are working in a ﬁnite dimensional subspace of H spanned by these points. . . . then the projection P onto the subspace is given by (e. o 2..g. Williams and Seeger [2000].3 Regularization Properties The regularizer f 2 used in Theorem 11. based on coefﬁcient based L1 regularizers. zm and β1 . . . Smola et al. k(·. . . . . . Denote its kernel by l. (57) 7 In case K does not have full rank. Our exposition will be informal (see also H Poggio and Girosi [1990]. m. . l(·. . . g k = Υf. .3. zj ) denotes the linear form mapping k(·. Girosi et al. x) span an mdimensional subspace. 2002] o Φm : X → Rm . zm ). 1996] or linear programming machines [Sch¨ lkopf and Smola. z1 ). . then its Fourier transform allows us to characterize how the different frequency components of f contribute to the value of f 2 . Meschkowski [1962]) (P f )(x ) = f. z1 ). . This can be seen by noting that (i) the km (·. . x) to k(zj . βm can be computed in closed form. zj ) = k(zi .. . . zm )) K −1 (k(x . x ) . stems from the dot product f. 16 . (56) which directly projects into an orthonormal basis of the subspace. and (ii) we have km (zi . zj ) for i. e. zm )) . is k − km . however. k(x . x) for x ∈ X. .
(58) Rather than (57). compute the Fourier transform of k as F [k(x.ω υ(ω)dω. (59) If k(x.3). x ) = h(x − x ) with a continuous strictly positive deﬁnite function h ∈ L1 (Rd ) (cf. To this end. It turns out that a multiplication operator in the Fourier domain will do the job. stated by Wendland [2005]. ·). ·) = δx . ·) is a Green function of Υ2 . ·)](ω)F [k(x .2. g = f (x)g(x) dx. recall the ddimensional Fourier transform. Section 2. these are the constants). k(x . k(x .2. ·)](ω) dω. x ) = 8 d 1 (67) (Υk(x. with a regularization operator whose null space is spanned by a set of functions which are not regularized (in the case (18). (61) We would like to rewrite this as Υk(x.e.ω ei x . (68) For conditionally positive deﬁnite kernels. ·) for some linear operator Υ. which is sometimes called conditionally positive deﬁnite of order 1.ω dω e−i x . ·)](ω) = (2π)− 2 d d d d f (x)e−i f (ω)ei x. ·) . ·). we thus have k(x. ·). (62) (63) x. υ(ω) (66) Υ : f → (2π)− 2 υ − 2 F [f ]. i. ·) = Υ2 k(x.1) k(x. A variation of Bochner’s theorem. we consider the equivalent condition (cf.ω e−i x . given by F [f ](ω) := (2π)− 2 with the inverse F −1 [f ](x) = (2π)− 2 Next. x ) = (2π)−d If our regularization operator maps . Section 2. ·))(ω)(Υk(x . ·). υ(ω )e−i x. then tells us that the measure corresponding to h has a nonvanishing density υ with respect to the Lebesgue measure. x ). k(x .8 We now consider the particular case where the kernel can be written k(x.ω υ(ω)dω = e−i x. that k can be written as k(x. ·). ·))(ω)dω..ω x. we have Υ2 k(x.ω dx. a similar correspondence can be established.ω dω.ω dx (64) (65) = (2π) 2 υ(ω)e−i Hence we can rewrite (61) as k(x. 17 . Υk(x . (60) which by the reproducing property (16) amounts to the desired equality (59). Υk(x . F [k(x. x ) = e−i x−x . ·) k = Υk(x. ·) = k(x.where Υ is a positive (and thus symmetric) operator mapping H into a function space endowed with the usual dot product f. k(x .
In view of our comments following Theorem 8. 1976]. This allows us to understand regularization properties of k in terms of its (scaled) Fourier transform υ(ω). 1978]. which started later but seems to have been unaware of the fact that it considered a special case of positive deﬁnite kernels. 1982]. [1998]. This connection is heavily being used in e a subset of the machine learning community interested in prediction with Gaussian Processes [Rasmussen and Williams. from its linear counterpart. x )f (x)f (x )dxdx > 0 (69) for all nonzero continuous functions f . To the best of our knowledge. Note that if f is chosen to be constant. the problem of Hilbert space representations of kernels has been studied in great detail. a large part of the material in the present section is based on that work. since high frequency components of F [f ] correspond to rapid changes in f .i. where such kernels have been used for instance for time series analysis [Parzen. The latter was initiated by Hilbert [1904]. Interestingly. (18).e.3. g k in the RKHS as a dot product (Υf )(ω)(Υg)(ω)dω. 1990]. 18 . 1970] as well as regression estimation and the solution of inverse problems [Wahba. and shows that all eigenvalues of the corresponding integral b operator f → a k(x. there have been two separate strands of development [Stewart. we can thus interpret the dot product f. allowing them to apply the Perceptron convergence theorem to their class of potential function algorithms. there has been a parallel. For further historical details see the review of Stewart [1976] or Berg et al. Sch¨ lkopf [1997] o observed that this restriction is unnecessary. It follows that υ(ω) describes the ﬁlter properties of the corresponding regularization operator Υ. [1992] were the ﬁrst to use kernels to construct a nonlinear estimation algorithm. 2. and nontrivial kernels on other data types were proposed by Haussler [1999]. indeed. Hilbert calls a kernel k deﬁnit if b a a b k(x. [Lo` ve. Boser et al. for some ﬁxed function g. [1992]. Hence small values of υ(ω) for large ω are desirable. [1984]. we can translate this insight into probabilistic terms: if the probability measure υ(ω)dω describes the desired ﬁlter properties. If k satisﬁes the condition (69) subject to the constraint b that a f (x)g(x)dx = 0. Sch¨ lkopf et al. In probability theory.g. our desired identity (59) holds true. One of them was the study of positive deﬁnite functions. Boser et al. [1964] o used kernels as a tool in a convergence proof. Whilst all these uses were limited to kernels deﬁned on vectorial data. ·)f (x)dx are then positive. the generalized portrait [Vapnik and Lerner.4 Remarks and Notes The notion of kernels as dot products in Hilbert spaces was brought to the ﬁeld of machine learning by Aizerman et al. In functional analysis.. he shows that k has at most one negative eigenvalue. see e. positive deﬁnite kernels have also been studied in depth since they arise as covariance kernels of stochastic processes. 2006]. then the natural translation invariant υ(ω)dω kernel to use is the characteristic function of the measure. For that case. a good reference is Berg et al. In addition to the above uses of positive deﬁnite kernels in machine learning. Vapnik. f k thus amounts to a strong attenuation of the corresponding frequencies. cf. Penalizing f. the hard margin predecessor of the Support Vector Machine. and pursued for instance by Schoenberg [1938]. Small values of υ(ω) amplify the corresponding frequencies in (67). Vapnik [1998]. then this notion is closely related to the one of conditionally positive deﬁnite kernels. [1964]. Watkins [2000]. Mercer [1909]. Sch¨ lkopf et al. [1998] applied the kernel trick to generalize principal o component analysis and pointed out the (in retrospect obvious) fact that any algorithm which only uses the data via dot products can be generalized using kernels. [1984]. Aizerman et al. As required in (57). and partly earlier development in the ﬁeld of statistics. 1963. it seems that for a fairly long time. Hilbert calls it relativ deﬁnit.
• Minimization of the empirical risk with respect to (w. . kernels can be used both for the purpose of describing nonlinear functions subject to smoothness constraints and for the purpose of computing inner products in some feature space efﬁciently. Consequently minimizing w subject to the constraints maximizes the margin of separation. Unless stated otherwise E[·] denotes the expectation with respect to all random variables of the argument. Eq. The 1 norm leads to sparse approximation schemes (see also Chen et al. . 1995]. not only for linear function classes but also for spheres and other simple geometrical objects. 1996. Boyd and Vandenberghe. 1991].. The statistical analysis is relegated to Section 4. • The indicator function {yf (x) < 0} is discontinuous and even small changes in f may lead to large changes in both empirical and expected risk. This means that even if the statistical challenges could be solved. one can show [Smola et al. b) is NPhard [Minsky and Papert. 1993. 3.e. Bartlett et al. [1999]). much tighter bounds can be obtained by also using the scale of the class [Alon et al. yi ( w.. the maximum number of observations which can be labeled in an arbitrary fashion by functions of the class. yn )} ⊆ X × Y we now aim at ﬁnding an afﬁne function f (x) = w. −1 The condition yi f (xi ) ≥ 1 implies that the margin of separation is at least 2 w . Given the difﬁculty arising from minimizing the empirical risk we now discuss algorithms which minimize an upper bound on the empirical risk. y). 2004]. w such that the empirical risk on S is minimized. In general. 2000] that minimizing the p norm of w leads to the maximizing of the margin of separation in the q norm where 1 1 p + q = 1. Subscripts. Mangasarian [1965] devised a similar optimization scheme using w 1 instead of w 2 in the objective function of (70). Fletcher. In this case. x + b = 0}. . Williamson et al. Properties of such functions can be captured by the VCdimension [Vapnik and Chervonenkis.b −1 1 w 2 2 s. while providing good computational properties and consistency of the estimators. [2003] show that even approximately minimizing the empirical risk is NPhard. 2001]. there exist function classes parameterized by a single scalar which have inﬁnite VCdimension [Vapnik. . φ(x) + b or in some cases a function f (x. BenDavid et al. In the binary classiﬁcation case this means that we want to maximize the agreement between sgn f (x) and y. the task of ﬁnding a large margin separating hyperplane can be viewed as one of solving [Vapnik and Lerner. However. that is. In fact. y) = φ(x.t. b) := {x w. 1971]. In this section we focus on the latter and how it allows us to design methods of estimation based on the geometry of the problems at hand. y1 ). In fact. we still would be confronted with a formidable algorithmic problem. Finally we will refer to Eemp [·] as the empirical average with respect to an nsample. (xn .. i. 1969].3 Convex Programming Methods for Estimation As we saw. Given a sample S := {(x1 . Necessary and sufﬁcient conditions for estimation can be stated in these terms [Vapnik and Chervonenkis. The result is a linear program. 1989. 1984.1 Support Vector Classiﬁcation Assume that S is linearly separable. We will omit them wherever obvious. indicate that the expectation is taken over X. 1963] minimize w. there exists a linear function f (x) such that sgn yf (x) = 1 on S. x + b) ≥ 1. The bound becomes exact if equality is attained for some yi = 1 and yj = −1. whereas the 2 norm can be extended to Hilbert spaces and kernels.. such as EX [·]. (70) Note that w f (xi ) is the distance of the point xi to the hyperplane H(w. 19 . (70) is a quadratic program which can be solved efﬁciently [Luenberger.
The result of BenDavid et al. The Lagrange function of (71) is given by L(w. To address these problems one may solve the problem in dual space as follows. (71) is a quadratic program which is always feasible (e.t. This is particularly true if we map from X into an RKHS. The number of misclassiﬁed points xi itself depends on the conﬁguration of the data and the value of C. That said. it is possible to modify (71) such that a desired target number of observations violates yi f (xi ) ≥ ρ for some ρ ∈ R by making the threshold itself a variable of the optimization problem [Sch¨ lkopf et al. as otherwise αi = 0. xi + b)) − i=1 ηi ξi (72) where αi . 1995] require that at optimality αi (yi f (xi )−1) = 0. 1992. C] arising from ηi ≥ 0. b.b.. b. Sch¨ lkopf et al. C > 0 is a regularization constant trading off the violation of the constraints vs.. Cortes and Vapnik. Bennett and Mangasarian [1992] and Cortes and Vapnik [1995] impose a linear penalty on the violation of the largemargin constraints to obtain: minimize w. xi + b) ≥ ρ − ξi and ξi ≥ 0. This means that only those xi may appear in the expansion (73) for which yi f (xi ) ≤ 1. 2 (74) Q ∈ Rn×n is the matrix of inner products Qij := yi yj xi . Kuhn and Tucker. The KKT conditions [Karush.b. Note that w lies in the span of the xi . The xi with αi > 0 are commonly referred to as support vectors. 2000]. 1939.t. maximizing the overall margin. Φ(xj ) = yi yj k(xi . This leads to the following optimization problem o (νSV classiﬁcation): 1 minimize w w. [2000] prove that solving (76) for which ρ > 0 satisﬁes: o 20 . (71) Eq. xj ). α.To deal with nonseparable problems. 1]. yi ( w. ηi ≥ 0 for all i ∈ [n]. 2 (76) One can show that for every C there exists a ν such that the solution of (76) is a multiple of the solution of (74). we need to relax the constraints of the optimization problem. ∀i ∈ [n] . xj . i. cases when (70) is infeasible. xi + b) ≥ 1 − ξi and ξi ≥ 0. w. n (73) This translates into w = i=1 αi yi xi . and the boxconstraint αi ∈ [0. as yi f (xi ) ≤ 0 implies ξi ≥ 1 (see also Lemma 12). ξ. the linear constraint i=1 αi yi = 0. (75) The dual of (75) is essentially identical to (74) with the exception of an additional constraint: minimize α 1 α Qα subject to α y = 0 and α 1 = nν and αi ∈ [0. C]. η) = 1 w 2 n 2 n n +C i=1 ξi + i=1 αi (1 − ξi − yi ( w. To compute the dual of L we need to identify the ﬁrst order conditions in w. b = 0 and ξi = 1 satisfy the constraints). [2003] suggests that ﬁnding even an approximate minimum classiﬁcation error solution is difﬁcult.ξ 2 n 2 + i=1 ξi − nνρ subject to yi ( w. Substituting (73) into L yields the Wolfe [1961] dual minimize α 1 α Qα − α 1 s. n Note that i=1 ξi is an upper bound on the empirical risk.ξ 1 w 2 n 2 +C i=1 ξi s.e. They are given by n n ∂w L = w − i=1 αi yi xi = 0 and ∂b L = − i=1 n αi yi = 0 and ∂ξi L = C − αi + ηi = 0 . Boser et al. ∀i ∈ [n] . α y = 0 and αi ∈ [0. direct optimization of (71) is computationally inefﬁcient. 1951. Clearly this can be extended to feature maps and kernels easily via Kij := yi yj Φ(xi ). Whenever the dimensionality of X exceeds n.g. This is an instance of the Representer Theorem (Theorem 11).
ν ∈ (0.1. in which X = R2 . 1] plays the same role as in (75). with probability 1. Stated as a convex optimization problem we want to separate the data from the origin with maximum margin via: minimize w. 2. with C being the class of ellipsoids. controlling the number of observations xi for which f (xi ) ≤ ρ. if w and ρ solve this problem. The quantile function [Einmal and Mason. The dual of (78) yields: minimize α 1 α Kα subject to α 1 = νn and αi ∈ [0.ξ. 1]. ν is an upper bound on the fraction of margin errors. ν is a lower bound on the fraction of SVs. This can be achieved using an SVM regularizer. Denote by X = {x1 . asymptotically. Nolan [1991] considered higher dimensions. A common choice of λ is Lebesgue measure. we have to consider classes of sets which are suitably restricted. xi ≥ ρ − ξi and ξi ≥ 0. 1992] with respect to (P. Tsybakov [1997] studied an estimator based on piecewise polynomial approximation of Cλ (µ) and showed it attains the asymptotically minimax rate for m certain classes of densities. . More information on minimum volume estimators can be found in that work. C) is deﬁned as U (µ) = inf {λ(C)P(C) ≥ µ. ρ) are respectively a weight vector and an offset. In the case where µ < 1. 1999]. ν equals both the fraction of SVs and the fraction of errors. 1] 2 (79) To compare (79) to a Parzen windows estimator assume that k is such that it can be normalized as a density in input space. This is where the complexity tradeoff enters: On the one hand. Therefore. (78) Here. νSV Classiﬁcation ﬁnds a solution with a fraction of at most ν margin errors.ρ 1 w 2 n 2 + i=1 ξi − nνρ subject to w. that is. 3. Using ν = 1 in (79) the constraints automatically imply 21 . m m Support estimation requires us to ﬁnd some Cλ (µ) such that P (Cλ (µ)) − µ is small. Polonik [1997] studied the estimation of Cλ (µ) by Cλ (µ). λ. . Also note that for ν = 1. He derived asymptotic rates of convergence in terms of various measures of richness of C. we want to use a rich class C to capture all possible dism tributions. where Cw = {xfw (x) ≥ ρ}. x . with C being the class of closed convex sets in X.. C ∈ C} where µ ∈ (0. . it seems the ﬁrst work was reported in Sager [1979] and Hartigan [1987]. and in Sch¨ lkopf et al. then the decision function f (x) will attain or exceed ρ for at least a fraction 1 − ν of the xi contained in X while the regularization term w will still be small. This statement implies that whenever the data are sufﬁciently well separable (that is ρ > 0). Since nonzero slack variables ξi are penalized in the objective function. .2 Estimating the Support of a Density We now extend the notion of linear separation to that of estimating the support of a density [Sch¨ lkopf o et al. o SV support estimation works by using SV support estimation relates to previous work as follows: set λ(Cw ) = w 2 . f becomes an afﬁne copy of the Parzen windows classiﬁer (5). xn } ⊆ X the sample drawn from P(x). 2001. Tax and Duin. on the other hand large classes lead to large deviations between µ and P (Cλ (µ)). (77) m We denote by Cλ (µ) and Cλ (µ) the (not necessarily unique) C ∈ C that attain the inﬁmum (when it is achievable) on P(x) and on the empirical measure given by X respectively. such as a Gaussian. Moreover under mild conditions. in which case Cλ (µ) is the minimum volume set C ∈ C that contains at least a fraction µ of the probability mass. fw (x) = w. all αi = 1. [2001]. Let C be a class of measurable subsets of X and let λ be a realvalued function deﬁned on C. and (w.
• Note that also quantile regression [Koenker. can be extended to regression. namely 1 minimize (α − α∗ ) K(α − α∗ ) + (α + α∗ ) − y (α − α∗ ) ∗ α. we now require that the values y − f (x) are bounded by a margin on both sides.α 2 ∗ subject to (α − α∗ ) 1 = 0 and αi . 1983]. Before computing the dual of this problem let us consider a somewhat more general situation where we use a range of different convex penalties for the deviation between yi and f (xi ). Robust regression leaves (82) unchanged. x + b. LAD is obtained from (82) by dropping the terms in without additional constraints. we obtain a quadratic program to estimate the conditional median. as described in (75) [Sch¨ lkopf et al. and Drucker et al. xj for linear models and Kij = k(xi . Further details can be found in Sch¨ lkopf and Smola [2002]. (82a) (82b) Here Kij = xi . the equality constraint (79) still ensures that f is a thresholded density. 2006] by using as loss function the “pinball” loss. For Quantile Regression we drop and we obtain different constants C(1 − τ ) and Cτ for the constraints on α∗ and α. Wahba.b. that is ψ(ξ) = (1−τ )ψ if ψ < 0 and ψ(ξ) = τ ψ if ψ > 0. 2 1963. 1984. allowing one to choose the o margin of approximation automatically. ξi ≥ 0. [1997]. Takeuchi et al. 1998]. One may check that 2 m 1 ∗ minimizing 2 w + C i=1 ξi + ξi subject to (80) is equivalent to solving minimize w. however. we impose the soft constraints yi − f (xi ) ≤ i − ξi and f (xi ) − yi ≤ i ∗ − ξi . That is. If yi − f (xi ) ≤ no penalty occurs. 22 . Smola and Sch¨ lkopf. Their dual resembles that of (80). 2000. αi ∈ [0.ξ 1 w 2 n 2 + i=1 ψ(yi − f (xi )) where ψ(ξ) = max (0. (80) ∗ where ξi .3 Regression Estimation SV regression was ﬁrst proposed in Vapnik [1995]. The νtrick. we add a linear constraint (α − α∗ ) 1 = νn. • A combination of LS and LAD loss yields a penalized version of Huber’s robust regression [Huber.αi = 1. It is a direct extension of the softmargin idea to regression: instead of requiring that yf (x) exceeds some margin value. Thus f reduces to a Parzen windows estimate of the underlying density. Likewise. C].. In its place. The corresponding optimization problem can be minimized by solving a linear system. We will discuss uniform convergence properties of the empirical risk estimates with respect to various ψ(ξ) in Section 4. ξ − ) . 3. that is 2 w . ξi penalized by some C > 0 and a measure for the slope of the function 2 1 f (x) = w. 2005] can be modiﬁed to work with kernels [Sch¨ lkopf o et al. now depending only on a subset of X — those which are important for deciding whether f (x) ≤ ρ. All the optimization problems arising from the above ﬁve cases are convex quadratic programs.. 1970. That is. (81) Choosing different loss functions ψ leads to a rather rich class of estimators: • ψ(ξ) = 1 ξ 2 yields penalized least squares (LS) regression [Hoerl and Kennard. Morozov. Vapnik et al. 1990]. The objective function is given by the sum ∗ of the slack variables ξi . In this case (82a) drops the terms in . Tikhonov. For ν < 1. xj ) if we map x → Φ(x). 1 1981. 2000]. in the deﬁnition of K o we have an additional term of σ −1 on the main diagonal.. [1997] using the socalled insensitive loss function. In this case we have ψ(ξ) = 2σ ξ 2 for ξ ≤ σ and ψ(ξ) = ξ− σ o 2 for ξ ≥ σ. • For ψ(ξ) = ξ we obtain the penalized least absolute deviations (LAD) estimator [Bloomﬁeld and Steiger.
y k(x.3. w . x ). y). y) cannot be avoided. yi ) − Φ(xi . Given the possibly nontrivial connection between x and y the use of Φ(x. Gribskov and Robinson. y). y) and subsequently obtaining a prediction via y (x) := argmaxy∈Y f (x. we study problems where y is obtained as the solution ˆ of an optimization problem over f (x. y ) the loss incurred by estimating y instead of y. Collins. y ) ≥ 0 for all y. y) − f (x. y ∈ Y. We have the following optimization problem [Tsochantaridis et al. 2005]: minimize w. [2003] and Tsochantaridis et al.. y). In other words. 2005. the loss is 1 whenever we predict the wrong class and 0 for correct classiﬁcation. the error measure does not depend on a single observation but on an entire set of labels. where N is the number of classes and a ∆(y. To describe the ﬂexibility of the framework set out by (83) we give several examples below: • Binary classiﬁcation can be recovered by setting Φ(x. 2001. It is a special case of a general ranking approach described next. 2001. argmaxy ∈Y f (x. it allows us to describe a much richer class of estimation problems as a convex program. Joachims [2005] derives an O(n2 ) method for evaluating the inequality constraint over Y. Bennett et al. y) ≥ ∆(yi . 2000. y ) for all y ∈ Y the inequality holds. The problem is described in Elisseeff and Weston [2001]. y) = yΦ(x). In the following we denote by ∆(y. Fletcher. where the authors devise a ranking scheme such that f (x. R¨ tsch et al. i) > f (x. In other words. y) − ξi . Without loss of generality we require that ∆(y. Corresponding kernels are typically chosen to be δy. To deal with the added complexity we assume that f is given by f (x.. Corresponding kernel functions are given by k(x. In this case it is advantageous to go beyond simple functions f (x) depending on x only. While the bound appears quite innocuous. y ) ≥ 0 with ∆(y. Proof Denote by y ∗ := argmaxy∈Y f (x. y) and we wish to ﬁnd f such that y matches yi as well as possible for relevant inputs x.4 Multicategory Classiﬁcation. Joachims [2005] shows that the socalled F1 score [van Rijsbergen. y). y ) = Φ(x. y. and a special instance of the above lemma is given by Joachims [2005]. this is exactly the standard SVM optimization problem... y ∗ ) + f (x. y). ∀i ∈ [n]. [2005]. • Multilabel estimation problems deal with the situation where we want to ﬁnd the best subset of labels Y ⊆ 2[N ] which correspond to some observation x. Moreover. j) if label i ∈ y and j ∈ y. y ) − ξ for all y ∈ Y. y ) . By assumption we have ξ ≥ ∆(y. n • We can deal with joint labeling problems by setting Y = {±1} .ξ 1 w 2 n 2 +C i=1 ξi s. w. y ∈ Y . we can encode a larger degree of information by estimating a function f (x. y) = 0 and that ∆(y. 1996] fall into this category of problems.y . Φ(x . 2003] can be encoded via Y = [N ]. 2000. Tsochantaridis et al. • Multicategory classiﬁcation problems [Crammer and Singer.t. (83) This is a convex optimization problem which can be solved efﬁciently if the constraints can be evaluated without high computational cost. 1993. Moreover let ξ ≥ 0 such that f (x. The construction of the estimator was suggested in Taskar et al. w ≥ 1 − ξi . Since f (x. In this case ξ ≥ ∆(y. y ∗ ) − f (x. Ignoring constant offsets and a scaling factor of 2. Ranking and Ordinal Regression Many estimation problems cannot be described by assuming that Y = {±1}. 1979] used in document retrieval and the area under the ROC curve [Bamber. 1975. y) = 0. 1989] which a identify one violated constraint at a time to ﬁnd an approximate minimum of the optimization problem. y )). In other words. Instead. Allwein et al. 23 . y ) = 1 − δy. y ∗ ) ≥ f (x. 2000. y ) ≥ ∆(y. One typically employs columngeneration methods [Hettich and Kortanek. Note that the loss may be more than just a simple 0 − 1 loss. y) = Φ(x. Key in our reasoning is the following: Lemma 12 Let f : X × Y → R and assume that ∆(y.. R¨ tsch. x . Φ(xi . in which case the constraint of (83) reduces to 2yi Φ(xi ).
as Φ(xi . Dekel et al. Multilabel classiﬁcation is then a bipartite graph where the correct labels only contain outgoing arcs and the incorrect labels only incoming ones. In ordinal regression x is preferred to x if f (x) > f (x ) and hence one minimizes an optimization problem akin to (83). Moreover let g ∈ H. ordinal regression problems which perform ranking not over labels y but rather over observations x were studied by Herbrich et al. Although there was a fair amount of theoretical work addressing this issue (see Section 4 below).. Denote by Y = Graph[N ] the set of all directed graphs on N vertices which do not contain loops of less than three nodes. Further models will be discussed in Section 5. Since Φ0 was arbitrary. it is probably fair to say 24 .Note that (83) is invariant under translations Φ(x. Nevertheless. including tasks such as object recognition [Blanz et al.. 1996.. we test for all (u. The ﬁrst successes of SVMs on practical problem were in handwritten digit recognition.5 Applications of SVM Algorithms When SVMs were ﬁrst presented. It is the goal to ﬁnd some function f : X × [N ] → R which imposes a total order on [N ] (for a given x) by virtue of the function values f (x. Finally.g. y. 1999. As an example. y ) + f (x. SVMs construct their decision rules in potentially very highdimensional feature spaces associated with kernels. These models allow one to deal with sequences and more sophisticated structures. x . Φ(xi . That is. v) ≥ 1 − ξiG and ξiG ≥ 0 for all (u. Holub et al. Using methods to incorporate transformation invariances. with constraint w. y) + f (x . x . multiclass classiﬁcation can be viewed as a graph y where the correct label i is at the root of a directed graph and all incorrect labels are its children. Similar models were also studied by Basilico and Hofmann [2004]. i) such that the total order and y are in good agreement.ξ 1 w 2 n 2 +C i=1 A(yi )−1 G∈A(yi ) ξiG (84) subject to w. Φ(x . o 2002]. y ) + Φ0 . at the time the gold standard in the ﬁeld [DeCoste and Sch¨ lkopf. in particular situations where Y is of exponential size. Then the function k(x. We solve minimize w. Φ(xi ) − Φ(xj ) ≥ 1−ξij . which was the main benchmark task considered in the Adaptive Systems Department at AT&T Bell Labs where SVMs were developed (see e. Chapelle et al. 2004]. y ) + Φ0 2 do not affect the outcome of the estimation process. Here an edge (i. v) ∈ G ∈ A(yi ).. There has been a signiﬁcant number of further computer vision applications of SVMs since then. In conjoint analysis the same operation is carried out for Φ(x. In practice this means that transformations k(x. where u is the user under consideration. u). 2005] and detection [Romdhani et al. Part of the reason was that as described. x . y ) ← k(x. j) ∈ y indicates that i is preferred to j with respect to the observation x. y) remains unchanged. it was probably to a larger extent the empirical success of SVMs that paved its way to become a standard method of the statistical toolbox. 3. SVMs were shown to beat the world record on the MNIST benchmark set. we have the following lemma: Lemma 13 Let H be an RKHS on X × Y with kernel k. 2005. H We need a slight extension to deal with general ranking problems. Everingham et al. yi ) − Φ(xi . [1998]).. y. y) ← Φ(x. y) + Φ0 . y) + Φ0 where Φ0 is constant. [2000] and Chapelle and Harchaoui [2005] in the context of ordinal regression and conjoint analysis respectively. u) − Φ(xi . This setting leads to a form similar to (83) except for the fact that we now have constraints over each subgraph G ∈ A(y). LeCun et al. More speciﬁcally. they initially met with scepticism in the statistical community. [2003] and Crammer and Singer [2005] propose a decomposition algorithm A for the graphs y such that the estimation error is given by the number of subgraphs of y which are in disagreement with the total order imposed by f . v) ∈ G whether the ranking imposed by G ∈ yi is satisﬁed. y. y ) + g 2 is a kernel and it yields the same estimates as k. Φ(x.
The Torch package contains a number of estimation methods. Ruj´ n. 2001]. 1965]. yf (x) is known as “margin” (based on the geometrical reasoning) in the context of SVMs [Vapnik and Lerner. There is a longstanding tradition of minimizing yf (x) rather than the number of misclassiﬁcations. McCallum et al.. Preliminary results exist. At present there exists a large number of readily available software packages for SVM optimization. For further references... o Many successful applications have been implemented using SV classiﬁers. e. The hope is (as will be shown in Section 4. such as the soft margin function max(0. 1999].. and Bartlett and Mendelson [2002]. For instance. including SVM solvers. such as R. 1993. This is 25 .g. SV novelty u detection [Hayton et al. Mangasarian. e 1987. such as microarray processing tasks [Brown et al. and as the “edge” in the context of arcing [Breiman. Thirdly we show that under rather general conditions the minimization of the ψloss is consistent with the minimization of the expected risk. as “stability” in the context of Neural Networks [Krauth and M´ zard. however. 2000] and text categorization [Dumais.g. This will be achieved by means of Rademacher averages. 2005]. 2000] and more recently. 4 Margins and Uniform Convergence Bounds So far we motivated the algorithms by means of their practicality and the fact that 0 − 1 loss functions yield hardtocontrol estimators. Note that by default. when applied to the problems of type (83). 1993]. Both of them have generated a spectrum of challenging highdimensional problems on which SVMs excel. 2005] solves structured estimation problems. a 1996. the derivation of corresponding bounds for the vectorial case is by and large still an open problem. such as the bounds by Collins [2000] for the case of perceptrons. problems with interdependent labels [Tsochantaridis et al. [2003] who derive capacity bounds in terms of covering numbers by an explicit covering construction. including SV regression [M¨ ller et al. via the construction of Lemma 12. We now follow up on the analysis by providing uniform convergence bounds for large margin classiﬁers.. based on [Tsochantaridis et al. SVMs for ranking [Herbrich et al. Consequently E [χ(yf (x))] ≤ E [ψ(yf (x))] and likewise Eemp [χ(yf (x))] ≤ Eemp [ψ(yf (x))]. 2002] that functions f in an RKHS achieving a large margin can be approximated by another function f achieving almost the same empirical error using a much smaller number of kernel functions. One may show [Makovoz. Our analysis is based on the following ideas: ﬁrstly the 0 − 1 loss is upper bounded by some function ψ(yf (x)) which can be efﬁciently minimized. 4. [2004] and Joachims [2002]. also the other variants of SVMs have led to very good results. We focus on the case of scalar valued functions applied to classiﬁcation for two reasons: The derivation is well established and it can be presented in a concise fashion. Taskar et al.. Secondly. Finally. LibSVM is an open source solver which excels on binary problems. Secondly we prove that the empirical average of the ψloss is concentrated close to its expectation. 1963. uniform convergence bounds are expressed in terms of minimization of the empirical risk average with respect to a ﬁxed function class F. 1 − yf (x)) of the previous section. who give Gaussian average bounds for vectorial functions. In the following we 1 assume that the binary loss χ(ξ) = 2 (1 − sgn ξ) is majorized by some function ψ(ξ) ≥ χ(ξ). SVMStruct. we combine these bounds to obtain rates of convergence which only depend on the Rademacher average and the approximation properties of the function class under consideration. e.. Barron. see Sch¨ lkopf et al.1 Margins and Empirical Risk While the sign of yf (x) can be used to assess the accuracy of a binary classiﬁer we saw that for algorithmic reasons one rather optimizes (a smooth function of) yf (x) directly. We believe that the scaling behavior of these bounds in the number of classes Y is currently not optimal. Vapnik and Chervonenkis [1971]. 1997].that two other ﬁelds have been more inﬂuential in spreading the use of SVMs: bioinformatics and natural language processing. 2005.3) that minimizing the upper bound leads to consistent estimators. Several SVM implementations are also available via statistical packages. Herbrich and Williamson. 1998].
as it acts as a regularizer. Denote by g : Xn → R a function of n variables and let c > 0 such that g(x1 . . However.X sup Eemp [f (x )] − Eemp [f (x)] f ∈F =EX.2 Uniform Convergence and Rademacher Averages The next step is to bound the deviation Eemp [ψ(yf (x))] − E[ψ(yf (x))] by means of Rademacher averages. . 4. 2002. Bartlett et al. and the last step again is a result of Jensen’s inequality. .i. the second equality follows from the fact that σi are i. . 1984]. socalled luckiness results exist. . xn ) > } ≤ exp −2 2 /nc2 . . the second equality follows from the fact that xi and xi are drawn i. . Bartlett and Mendelson.d. . Hence n n 2 1 2 n 1 2 nRn [F] = EX. . [2005]. xi = i=1 σi xi . . For details see Bousquet et al. 1989] P {E [g(x1 . . and Koltchinskii [2001]. f ∈F The ﬁrst inequality follows from the convexity of the argument of the expectation. Solving (85) for g we obtain that with probability at least 1 − δ n sup E [f (x)] − Eemp [f (x)] ≤ E sup E [f (x)] − Eemp [f (x)] + B f ∈F f ∈F − log δ . zeromean random variables. xn . . The equivalence is immediate by using Lagrange multipliers. hence we may swap terms.very much unlike what is done in practice: in SVM (83) the sum of empirical risk and a regularizer is min2 imized. 1963.σ i=1 σi xi ≤ EX Eσ i=1 σi xi = EX i=1 xi 2 ≤ nE x 2 . 2001] of F wrt. 1968]. . sample size n. . xi+1 . (87) Here the ﬁrst inequality is a consequence of Jensen’s inequality. Mendelson [2003]. 26 . one may check that minimizing Eemp [ψ(yf (x))] subject to w ≤ W is equivalent 2 to minimizing Eemp [ψ(yf (x))] + λ w for suitably chosen values of λ. . the second formulation is much more convenient [Tikhonov. .i.d. Koltchinskii. however. . √ Corresponding tight lower bounds by a factor of 1/ 2 exist and they are a result of the KhintchineKahane inequality [Kahane. Here σi are independent ±1valued zeromean Rademacher random variables. Herbrich and Williamson. . Then it follows that c ≤ B . then [McDiarmid. xi ∈ X and for all i ∈ [n]. We begin with F := n n {f f (x) = x. It follows that sup w ≤1 i=1 σi w. For linear function classes Rn [F] takes on a particularly nice form. 2001. For numerical reasons. . . . xn ) := supf ∈F Eemp [f (x)] − E [f (x)] . Rn [F] is referred as the Rademacher average [Mendelson. . 2002].X . . xn ) − g(x1 . from the same distribution. Finally. xn )] − g(x1 .σ sup Eemp [σf (x)] =: 2Rn [F] . xn ) ≤ c for all x1 . 2n (86) This means that with high probability the largest deviation between the sample average and its expectation 1 is concentrated around its mean and within an O(n− 2 ) term. which provide risk bounds in a datadependent fashion [ShaweTaylor et al. xi−1 . Morozov. . (85) Assume that f (x) ∈ [0. 1998..σ sup Eemp [σf (x )] − Eemp [σf (x)] f ∈F ≤ 2EX. B] for all f ∈ F and let g(x1 . [2002]. . . 1971] as follows: EX sup E[f (x )] − Eemp [f (x)] f ∈F ≤ EX. . xi . The expectation can be bounded by a classical symmetrization argument [Vapnik and Chervonenkis. for the design of adaptive estimators. . w and w ≤ 1}.
which we discussed in Section 3. r is the average radius. 1995]. x . (89) Here γ is the effective margin of the softmargin loss max(0.4 Rates of Convergence We now have all tools at our disposition to obtain rates of convergence to the minimizer of the expected risk which depend only on the complexity of the function class and its approximation properties in terms ∗ n of the ψloss. √ Note that we get an O(1/ n) rate of convergence regardless of the dimensionality of x. and ψ Lipschitz continuous with constant L we have √ L Rn ≤ √n (W r + B 2 log 2). let fψ. i. For sums of function classes F and G we have Rn [F + G] ≤ Rn [F] + Rn [G]. Moreover r can also be interpreted as the trace of the integral operator with kernel k(x. as deﬁned in the previous section. 2004] for all f the following inequality holds: ∗ ∗ E [χ(yf (x))] − E χ(yfχ (x)) ≤ c E [ψ(yf (x))] − E ψ(yfψ (x)) s 1 . (4)]. ψ) + 4 √ −2 log δ + r/R + n 8 log 2 . R ≥ x . 27 . Zhang. 2004].F ) ∗ ∗ + 4 Eemp ψ(yfψ. and let δ(F. (88) In particular we have c = 4 and s = 1 for soft margin loss. as follows from [Bousquet et al. eq. which we assume to be convex and χmajorizing. and the logistic a loss ln 1 + e−ξ (see Section 5). Secondly.F (x) − E yfψ (x) be the approximation error due to the restriction of f to F.F the minimizer of E [ψ(yf (x))] restricted to F. This takes care of the offset b. [2005].. Here R is the radius of an enclosing sphere for the data and 1/(W γ) is an upper bound on the radius of the data — the softmargin loss becomes active only for yf (x) ≤ γ.F ) + δ(F. Then under rather general conditions on ψ [Zhang. ψ) RW γ ∗ ≤E χ(yfχ ) + δ(F. which is commonly used in AdaBoost [Schapire et al.F ) ≤E χ(yfχ ) + 4 E ψ(yfψ. 1 − γξ). R¨ tsch et al.3 Upper Bounds and Convex Functions We brieﬂy discuss consistency of minimization of the surrogate loss function ψ : R → [0. Note that (88) implies that the minimizer of the ψ loss is consistent. b ≤ B. for {yb where b ≤ B} the Rademacher average can be bounded by B 2 log 2/n. x)].. x ) and probability measure on X. Denote by fψ. ψ) := E yfψ. 2 E [χ(yfψ (x))] = E [χ(yfχ (x))]. 4. 1998. whereas for Boosting and logistic regres√ sion c = 8 and s = 1 . Then a simple telescope sum yields n ∗ n n E χ(yfψ. that is. 4.e. ψ ≥ χ [Jordan et al. 1 − γyf (x)). and we assumed that b is bounded by the largest value of w. Examples of such functions are the softmargin loss max(0.F ) − E ψ(yfψ.Note that (87) allows us to bound Rn [F] ≤ n− 2 r where r is the average norm of vectors in the sample. An extension to kernel functions is straightforward: by design of the inner product we have r = Ex [k(x. the Boosting Loss e−ξ . i. ∗ ∗ Denote by fχ the minimizer of the expected risk and let fψ be the minimizer of E [ψ(yf (x)] with respect to f . ∞). 2003.. 2005. the classical radiusmargin bound [Vapnik. Note that this bound is independent of the dimensionality of the data but rather only depends on the expected norm of the input vectors.. A similar reasoning for logistic and exponential loss is given in Bousquet et al. This means that for linear functions with w ≤ W . Bartlett and Mendelson [2002] show that Rn [ψ ◦ F] ≤ LRn [F] for any Lipschitz continuous function ψ with Lipschitz constant L and with ψ(0) = 0. 2001]. Since we are computing Eemp [ψ(yf (x))] we are interested in the Rademacher complexity of ψ ◦ F.F be the minimizer ∗ ∗ of the empirical ψrisk. Moreover note that the rate is dominated by RW γ.e.F ) − Eemp ψ(yfψ. W is an upper bound on w .
. (90) Localization and Noise Conditions 1 Note that for γ = ∞ the condition implies that there exists some s such that P {y = 1x} − 2 ≥ s > 0 almost surely. incomplete data. [2005] and Bartlett et al. where g(θ) := ln X e θ.Φ(x) dν(x) . it is often insufﬁcent to compute a prediction without assessing conﬁdence and reliability. The key beneﬁt of (90) is that it implies a relationship between variance and expected value of classiγ ﬁcation loss. √ Basically the slow O(1/ n) rates arise whenever the region around the Bayes optimal decision boundary is large. and Markov networks. All of these issues can be addressed by using the RKHS approach to deﬁne statistical models and by combining kernels with statistical approaches such as exponential models. Mendelson. θ) = exp [ θ. when dealing with multiple or structured responses. 5.4. 2005]. Such bounds use Bernsteintype inequalities and they lead to localized Rademacher averages [Bartlett et al. or a model structure involving latent variables.g. First of all. Third.1. by means of regularization. e. Tsybakov’s noise condition [Tsybakov. the Reproducing Kernel Hilbert Space approach offers many advantages in machine learning: (i) powerful and ﬂexible models can be deﬁned. In this chapter. More speciﬁcally for α = 1+γ and g : X → Y we have E [{g(x) = y} − {g ∗ (x) = y}] 2 ≤ c [E [{g(x) = y} − {g ∗ (x) = y}]] . determining this region produces the slow rate. 5 Statistical Models and RKHS As we have argued so far. Φ(x) − g(θ)] . (92) 28 .. Bousquet et al. in conditional modeling. This is also known as Massart’s noise condition. Bousquet et al. Given a canonical vector of statistics Φ and a σﬁnite measure ν over the sample space X. In this case. See. 2003. for instance. γ ≥ 0 such that P P {y = 1x} − 1 ≤t 2 ≤ βtγ for all t ≥ 0. 2002. α (91) Here g ∗ (x) := argmaxy P(yx) denotes the Bayes optimal classiﬁer. (ii) many results and algorithms for linear models in Euclidean spaces can be generalized to RKHS.1 Exponential RKHS Models 5. For more complex function classes localization is used. we will show how kernel methods can be utilized in the context of statistical models. BarndorffNielsen [1978]). incomplete training targets.1 Exponential Models Exponential models or exponential families are among the most important class of parametric models studied in statistics. Second. There are several reasons to pursue such an avenue. generalized linear models. an exponential model can be deﬁned via its probability density with respect to ν (cf. [2002] for more details. This is sufﬁcient to obtain faster rates for ﬁnite sets of classiﬁers. be it due to missing variables. (iii) learning theory assures that effective learning in RKHS is possible. p(x. it is important to model dependencies between responses in addition to the dependence on a set of covariates. whereas the welldetermined region could be estimated at an O(1/n) rate. 2003] requires that there exist β. needs to be dealt with in a principled manner.5 √ In many cases it is possible to obtain better rates of convergence than O(1/ n) by exploiting information about the magnitude of the error of misclassiﬁcation and about the variance of f on X.
cf. In general. f ∈ H := f : f (·) = x∈S αx k(·. x). A representation with minimal m is called a minimal representation. Lauritzen [1996]) θ g(θ) = µ(θ) :=Eθ [Φ(X)] . Φ(·) over X are replaced with functions f ∈ H. random variables X1 . i. Proposition 14 (Dense Densities) Let X be a measurable set with a ﬁxed σﬁnite measure ν and denote by P a family of densities on X with respect to ν such that p ∈ P is uniformly bounded from above and continuous. which directly implies e−η < eg(f ) = ef (x) dν(x) < eη . . . S < ∞ . > 0. sample S = (xi )i∈[n] . S ⊆ X.e. (95) A justiﬁcation for using exponential RKHS families with rich canonical statistics as a generic way to deﬁne nonparametric models stems from the fact that if the chosen kernel k is powerful enough.1. the joint density for i. . (95) are dense in P in the L∞ sense. there are multiple exponential representations of the same model via canonical parameters that are afﬁnely related to one another (Murray and Rice [1993]). ˆ Eθ [Φ(X)] = µ(θ) = ˆ 1 n n Φ(xi )=: ES [Φ(X)] . i=1 (94) 5. which yields an exponential RKHS model p(x.i. . where we utilized the upper bound ef −g(f ) − ef ∞ ≤ e−g(f ) − 1 · ef ∞ < (eη − 1) ef 1 2 ∞ 2 D < Deη (eη − 1). D) = ln + 1 2 1+ 29 .i. Maximum likelihood estimation (MLE) in exponential families leads to a particularly elegant form for ˆ the MLE equations: the expected and the observed canonical statistics agree at the MLE θ. X2 . 2 ∂θ g(θ) = Vθ [Φ(X)] . in which case m is the order of the exponential model. f ) = exp[f (x) − g(f )].d. Exploiting the strict convexity of the expfunction one gets that e−η p < ef < eη p. Moreover ef − p ∞ < D(eη − 1). given an i. the corresponding n canonical statistics simply being i=1 Φ(Xi ).The mdimensional vector θ ∈ Θ with Θ :={θ ∈ Rm : g(θ) < ∞} is also called the canonical parameter vector. the Hessian of g is positive semideﬁnite and consequently g is convex. By the universality assumption for every p ∈ P there exists f ∈ H such that f − ln p ∞ < η. (93) where µ is known as the meanvalue map. Let k : X × X → R be a universal kernel for H. and hence X p(f ) − p ∞ ≤ ef − p ∞ + ef −g(f ) − ef ∞ < D(eη − 1)(1 + eη ) < 2Deη (eη − 1) . Xn is also exponential. Section 2). the associated exponential families become universal density estimators. Equaling with and solving for η results in η( .d. Linear function θ. D) > 0 appropriately (see below). This means.2 Exponential RKHS Models One can extend the parameteric exponential model in (92) by deﬁning a statistical model via an RKHS H with generating kernel k. in particular (cf. Proof Let D := supp p ∞ and chose η = η( . Being a covariance matrix. Then the exponential RKHS family of densities generated by k according to Eq. One of the most important properties of exponential families is that they have sufﬁcient statistics of ﬁxed dimensionality. This can be made precise using the concept of universal kernels (Steinwart [2002a]. It is wellknown that much of the structure of exponential models can be derived from the log partition function g(θ).
5. y). logistic regression where y ∈ {−1. An illuminating example is statistical natural language parsing with lexicalized probabilistic context free grammars (cf.e. Here x will be an English sentence and y a parse tree for x. McCullagh and Nelder [1983]) with a canonical link. ˜ O’Sullivan et al. C ll (f .. [1995]. f ) = exp [f (x. (97) This is a generalized linear model (GLM) (Nelder and Wedderburn [1972]. Φ(x. w) := w. • Let Y be univariate and deﬁne Φ(x. More sophisticated feature encodings are discussed in Taskar et al. but the conditional probability p(yx) needs to be estimated based on training data of parsed/annotated sentences. we will investigate conditional models that are deﬁned by functions f : X × Y → R from some RKHS H over X × Y with kernel k as follows p(yx. In the simplest case. (x . In combination with a parametric part. One standard approach to parametric model ﬁtting is to maximize the conditional loglikelihood – or equivalently – minimize a logarithmic loss. Let us discuss some concrete examples to illustrate the rather general model equation (96). Φ(x) in ˜ comes from some sufﬁciently smooth class of functions. The conditional modeling approach provide alternatives to stateofthe art approaches that estimate joint models p(x. Popular choices of kernels include the ANOVA kernel investigated in Wahba et al. w) = w.e. y) − g(x. For different response scales we get several wellknown models such as. S) . y. y) with maximum likelihood or maximum entropy (Charniak [2000]) and obtain predictive models by conditioning on x. Here we consider the more general case of minimizing a functional that includes a monotone function of the Hilbert space norm f H as a stabilizer (cf. f (x. (96) Notice that in the ﬁnitedimensional case we have a feature map Φ : X × Y → Rm from which parametric models are obtained via H :={f : ∃w.1. This reduces to penalized loglikelihood estimation in the ﬁnite dimensional case. Φ(x) − g(x. 5. i. a strategy pursued in the Conditional Random Field (CRF) approach of Lafferty et al. This is a special case of deﬁning joint kernels from an existing kernel k over inputs via k((x. a highly structured and complex output. y )) := yy k(x. this can also be used to deﬁne semiparametric models. y. w) := w. • Joint kernels provide a powerful framework for prediction problems with structured outputs. w) = exp [y w. y) = ˜ ˜ y f (x. Φ(x. Magerman [1996]).1. w). namely GLMs is relaxed by requiring that f a RKHS deﬁned over X. • In the nonparameteric extension of generalized linear models following [Green and Yandell. we will focus on the case of predictive or conditional modeling with a – potentially compound or structured – response variable Y and predictor variables X. Wahba [1990]). S) := − 1 n n ln p(yi xi . for instance. Zettlemoyer and Collins [2005].4 Risk Functions for Model Fitting There are different inference principles to determine the optimal function f ∈ H for the conditional exponential model in (96). 1986] the parametric assumption on the linear predictor f (x. Taking up the concept of joint kernels introduced in the previous section. Then simply f (x. y) } and each f can be identiﬁed with its parameter w. f ). y) = yΦ(x). 1985. the canonical parameters depend linearly on the covariates Φ(x). w)] . y) = f (x. (98) 30 . 1}. the extracted statistics Φ may encode the frequencies of the use of different productions in a sentence with known parse tree. x ). with f (x.y) dν(y) . [2001]. [2004]. Φ(x) and the model equation in (96) reduces to p(yx. i. where g(x. The productions of the grammar are known.3 Conditional Exponential Models For the rest of the paper. w) = w. f ) := ln Y ef (x. f )] . i=1 λ ˆ f ll (S) := argmin f f ∈H 2 2 H + C ll (f .
[2001] have employed variants of improved iterative scaling (Della Pietra et al. Since the logodds ratio is sensitive to rescaling of f . yi .9 Using the same line of arguments as was used in Section 3. −yΦ(x) = 2y w. i=1 y=yi (102) This is related to the exponential loss used in the AdaBoost method of Freund and Schapire [1996]. y. yi .t. w). Deﬁne r(x. To make the connection to SVMs consider the case of binary classiﬁcation with Φ(x. yi . yΦ(x) − w. A soft margin version can be deﬁned based on the Hinge loss as follows C hl (f . f ) ≥ 1 − ξi . [2003] and Taskar et al. Darroch and Ratcliff [1972]) to optimize Eq. f (x.4. y) − max f (x. Since we are mainly interested in kernelbased methods here. f ) ≥ 1. that one does need to know φ in order to make an optimal determinisitic prediction. 0} . i=1 λ ˆ f sm (S) := argmin f f ∈H 2 2 H + C hl (f. Lafferty et al. x)] at sample points x = xi . f ) := min r(xi . r(x. Φ(x) = 2ρ(x. f ). we need to constrain f H to make the problem welldeﬁned.e. i=1 (99) Here r(S. • Prediction problems with structured outputs often involve taskspeciﬁc loss function : Y × Y → R discussed in Section 3. y. we refrain from further elaborating on this connection. yi . f ) > 0. likelihoodbased criteria are by no means the only justiﬁable choice and large margin methods offer an interesting alternative. βf ) = βr(x. y. S) := 1 n n min{1 − r(xi . The latter is twice the standard margin for binary classiﬁcation in SVMs. s. i. f ) = w. y. w) − f (xi . we will present a general formulation of large margin methods for response variables over ﬁnite sample spaces that is based on the approach suggested by Altun et al. where r(x. (101) • An equivalent formulation using slack variables ξi as discussed in Section 3 can be obtained by 1 introducing softmargin constraints r(xi . Each nonlinear constraint can be further expanded into Y linear constraints f (xi . f ). yi ) − f (xi . f /φ). f ) n and r(S. i. As suggested in Taskar et al. which requires the availability of efﬁcient inference algorithms. S) . To that extend. [2003] cost sensitive large margin methods can be obtained by deﬁning rescaled margin constraints f (xi . however. (100) provided the latter is feasible. f ) :=f (x. [1997]. f ) generalizes the notion of separation margin used in SVMs. yΦ(x) . f ) p(y x. w) = w.(98) whereas Sha and Pereira [2003] have investigated preconditioned conjugate gradient descent and limited memory quasiNewton methods. y. we will drop φ in the following. y. ∀i ∈ [n] .e. r(xi . if there exists f ∈ H such that r(S.• For the parametric case. y. 31 . • In order to optimize Eq. 9 We will not deal with the problem of how to estimate φ here. As we have seen in the case of classiﬁcation and regression. f ) . w)] . [2003]. ξi ≥ 0 and by deﬁning C hl = n ξi . y) = yΦ(x). y) ≥ (yi .(98) one usually needs to compute expectations of the canoncial statistics Ef [Φ(Y. y) ≥ 1 − ξi for all y = yi . We thus replace f by φ−1 f for some ﬁxed dispersion parameter φ > 0 and deﬁne the ˆ maximum margin problem f mm (S) :=φ−1 argmax f H =1 r(S. For the sake of the presentation. the maximum margin problem can be reformulated as a constrained optimization problem 1 ˆ f mm (S) := argmin f f ∈H 2 2 H. y ) = min log y =y y =y p(yx. note. y) − ξi . yi . • Another sensible option in the parametric case is to minimize an exponential risk function of the following type 1 ˆ f exp (S) := argmin n w n exp [f (xi . yi ) − f (xi .
λn y=yi αiy ≤ 1. This adds an additional complication compared to binary classiﬁcation. because for each xi . S) + Ω( f H ) can be written as: n ˆ f (·) = i=1 y∈Y βiy k(·.t. yi )) − k(·. Let Ω be a strictly monotonically increasing function. y). i=1 Introducing dual variables αiy for the margin constraints and ηi for the nonnegativity constraints on ξi yields the Lagrange function (the objective has been divided by λ) L(f. but rather on the evaluation of f on the augmented sample ˜ S :={(xi .5 Generalized Representer Theorem and Dual SoftMargin Formulation It is crucial to understand how the representer theorem applies in the setting of arbitrary discrete output ˆ ˆ ˆ spaces.1.j=1 y=yi y =yj y=yi s. S) be a functional depending on f only via its values on the augmented sample S. yi ) − f (xi . (xi . (xi . y))] 32 . (xj . f sm } is the basis for constructive model ﬁtting. ˆ Proposition 16 (Tsochantaridis et al. yi ). (104b) where Kiy. α) = 0 ⇐⇒ f (·) = i=1 y=yi αiy [k(·. Proof The primal program for softmargin maximization can be expressed as min R(f. y ∈ Y . α) := 1 f. s. y )) − k((xi . yi ). since a ﬁnite representation for the optimal f ∈ {f ll . y) : i ∈ [n]. f (xi . yj )) + k((xi . ˆ Let us focus on the soft margin maximizer f sm . yj )). y) ≥ 1 − ξi . y). Instead of solving (101) directly. ∀i ∈ [n]. Notice that the regularized logloss as well as the soft margin functional introduced above depend not only on the values of f on the sample S. (xj . This is the case. (xj . Then the solution of the optimization problem ˜ ˆ f (S) := argminf ∈H C(f . Further˜ more let C(f . f ) in (96) as well as in the logodds ratios in (99).jy − αiy (104a) α∗ = argmin 2 α i=1 i. [2005]) The minimizer f sm (S) can be written as in Corollary 15 where the expansion coefﬁcients can be computed from the solution of the following convex quadratic program n 1 n αiy αjy Kiy. y ∈ Y}.ξ + 1 n n ξi . (xj . ∀i ∈ [n]. we ﬁrst derive the dual program.t. ∀y = yi . y) − 1 + ξi ] − i=1 ξi ηi . S) := λ f 2 2 H f ∈H. Corollary 15 Denote by H an RKHS on X × Y with kernel k and let S = ((xi .jy :=k((xi . yi ))i∈[n] . following essentially the derivation in Section 3. y)) (103) This follows directly from Theorem 11. ξ. f 2 H + 1 λn n n n ξi − i=1 i=1 y=yi αiy [f (xi . yi ) − f (xi . ξi ≥ 0. y )) − k((xi . output values y = yi not observed with xi show up in the logpartition function g(xi . Solving for f results in n f L(f. αiy ≥ 0. (xi .5. ∀i ∈ [n] .
one may iterate through the training examples according to some (fair) visitation schedule and greedily select constraints that are most violated at the current solution f . Overall one gets a balance equation βiyi − y=yi βiy = 0 for every data point.y k(x. i. for the ith instance one computes yi = argmax f (xi . Solving for ξi implies λn y=yi αiy ≤ 1 and plugging the solution back into L yields the (negative) objective function in the claim. In particular. y) for which αiy > 0 are the support pairs. (xi . i.6 Sparse Approximation ˆ Proposition 16 shows that sparseness in the representation of f sm is linked to the fact that only few αiy in the solution to the dual problem in Eq.since for f ∈ H. Solving a sequence of increasingly tighter relaxations to a mathematical problem is also known as an outer approximation. y) = k(·. (101) by greedily selecting violated constraints. λn2 steps. ˆ y=yi y=yi (105) and then strengthens the current relaxation by including αiˆi in the optimization of the dual if f (xi . where margin constraint are incrementally added. Hence. 5.j y αiy αjy 1 + δyi . y) ≥ 1 − ξi . generalizing the notion of support vectors. (104a) simpliﬁes to k(xi . Finally note that the representation in (103) can be obtained by identifying βiy := −αiy y=yi αiy if y = yi if y = yi • The multiclass SVM formulation of Crammer and Singer [2001] can be recovered as a special case for kernels that are diagonal with respect to the outputs. y)). if only few constraints are active at the optimal solution. Note that each of these Lagrange multipliers is linked to the corresponding soft margin constraint f (xi .e. • The pairs (xi . i. xj ) i.y − δyj .e. The following theorem provides an answer: ¯ Theorem 17 (Tsochantaridis et al. yi . While this may or may not be the case for a given sample. f ) .1. r(xi . It is important to understand ˆ how many strengthening steps are necessary to achieve a reasonable close approximation to the original problem. y)) for y = yi will get nonpositive weights. will ﬁnd an approximate solution where all constraints are fulﬁlled within a precision of . f f (x. Here > 0 is a predeﬁned tolerance parameter. 33 . one can still exploit this observation to deﬁne a nested sequence of relaxations. [2005]. (x.iy and chose > 0. (xi . which optimizes Eq.y − δyi . yi ) < 1 − ξi − . y). As in binary SVMs their number can be much smaller than the total number of constraints.e.y . f ) ≥ ¯ 4R 2 1 − ξi − after at most 2n · max 1.y Kiy. This corresponds to a constraint selection algorithm (cf. x ). k((x. Bertsimas and Tsitsiklis [1997]) for the primal or – equivalently – a variable selection or column generation method for the dual program and has been investigated in Tsochantaridis et al. [2005]) Let R = maxi. A sequential strengthening procedure. yi ) − y f (xi . Notice also that in the ﬁnal expansion contributions k(·.y δyj . yi )) will get nonnegative weights whereas k(·. yi ) − f (xi . Notice that in this case the quadratic part in Eq. sparseness is achieved. y) = argmax p(yxi . (x . y )) = δy. (104) are nonzero.
Collins [2000]). K).i. yi ) − g(xi . Sometimes one may also have search heuristics available that may not ﬁnd the solution to Eq. ·))x∈X is a vectorvalued zero mean Gaussian process. S) + .y ˜ as the restriction of the stochastic process f to the augmented sample S. ξ ∗ . This holds irrespective of special properties of the data set utilized. Then ˆ ˆ ˆ ˆ ˆ R(f . efﬁcient dynamic programming techniques exists. F )] + const . y))i. ˜ ˆ ˆ ˆ ˆ R(f H i=1 2 n • Combined with an efﬁcient QP solver. S) ≤ R(f sm .1. ξ. Denote the kernel matrix by K = (Kiy. so that in summary: F (S) ∼ N(0. [2004b]) by assuming that (f (x. ˆ ˆ Proof The ﬁrst inequality follows from the fact that (f . S) while violating no constraint by more than (cf. the ˜ = ξ + one gets an admissible solution (f . In many cases. (106) where x is a test point that has been included in the sample (transductive setting). S) = p(yF (x. The answer depends on the speciﬁc form of the joint kernel k and/or the feature map Φ. (108) which we plug into Eq. whereas in other cases one has to resort to approximations or use other methods to identify a set of candidate distractors Yi ⊂ Y for a training pair (xi . 5. ξ) such that ˆ ˆ ˜ second from the observation that by setting ξ ˆ.jy . ξ. (96) and the minimization of the regularized logloss can be interpreted as a generalization of Gaussian process classiﬁcation (Rasmussen and Williams [2006]. • The remaining key problem is how to compute Eq. y). ¯ R. (109) min αT Kα − α i=1 34 . For a given sample S. S) ≤ R(f . minimizing R(f. we know that n n (107) ˆ F (S)iy = j=1 y ∈Y αiy Kiy.ˆ ˆ Corollary 18 Denote by (f . ξ. Altun et al. (xj . (105). ξ) the optimal solution of a relaxation of the problem in Proposition 16.d. y ∈ Y. note that the covariance function C is deﬁned over pairs X × Y. y )) with indices i. and λ−1 . S) = λ f 2 + 1 n (ξi + ) = R(f .jy ). (107) to arrive at n αT Keiyi + log y∈Y exp αT Keiy . S) + ˆ where (f sm . ln p(F S) = − F T K−1 F + 2 i=1 ˆ Invoking the representer theorem for F (S) := argmaxF ln p(F S). yi ) (cf. j ∈ [n] and y. Theorem 17). This induces a predictive model via Bayesian model integration according to p(yx. sample the logposterior for F can be written as 1 [f (xi . but that ﬁnd (other) violating constraints with a reasonable computational effort. the only exception ¯ being the dependeny on the sample points xi is through the radius R. ξ ∗ ) is the optimal solution of the original problem. deﬁne a multiindex vector F (S) :=(f (xi . ξ. (105) efﬁciently. the above theorem guarantees a runtime polynomial in n. ·))p(F S)dF .7 Generalized Gaussian Processes Classiﬁcation The model Eq. where Kiy. For an i. ξ.jy :=C((xi . −1 . ξ) is an optimal solution to a relaxation.
The global Markov property says that for disjoint subsets U. ⊥Z The above deﬁnition is based on the pairwise Markov property. one only needs to specify or estimate the simpler functions fc . (109) is how to achieve spareness in the expansion for F .1 Markov Networks and Factorization Theorem Denote predictor variables by X. y)). Again.g. 5. This approach is motivated by the fact that independently predicting individual responses based on marginal response models will often be suboptimal and explicitly modelling these interactions can be of crucial importance. (xi . 5. W ⊆ Z where W separates U from V in G one has that U ⊥ W .j y. Even more important in the context of this paper is the factorization result due ⊥V to Hammersley and Clifford [1971] and Besag [1974]. (96) has been made. Any density function for Z with full support factorizes over C(G). where dependencies are modeled by a conditional independence graph. we will focus on a more speciﬁc setting with multiple outputs. Whittaker [1990]) this implies the global Markov property for distributions with full support. In the following.2 Markov Networks and Kernels In Section 5. if f decomposes additively as f (z) = c∈C(G) fc (zc ) with suitably chosen functions fc .were eiy denotes the respective unit vector. Deﬁnition 21 A function f : Z → R is compatible with a conditional independence graph G. E) such that for any pair of variables (Zi . Notice that for f (·) = i.2 Kernel Decomposition over Markov Networks It is of interest to analyse the structure of kernels k that generate Hilbert spaces H of functions that are consistent with a graph.y αiy αjy k(·. Theorem 20 Given a random vector Z with conditional independence graph G. Deﬁnition 19 A conditional independence graph (or Markov network) is an undirected graph G = (Z. response variables by Y and deﬁne Z :=(X. if every function f ∈ H is compatible with G. 35 .2.1 no assumptions about the speciﬁc structure of the joint kernel deﬁning the model in Eq. The latter inner product reduces to k((xi . (xj . Y ) with associated sample space Z. y)) the ﬁrst term is equivalent to the squared RKHS norm of f ∈ H since f. (xi .y αiy k(·. (xj . e. V. Zj ) ∈ E if and only if Zi ⊥ j Z − {Zi . Such f and H are also called Gcompatible. k(·. We use Markov networks as the modeling formalism for representing dependencies between covariates and response variables as well as interdependencies among response variables. Zj }. the set of maximal cliques of G as follows: p(z) = exp c∈C(G) fc (zc ) (110) where fc are clique compatibility functions dependent on z only via the restriction on clique conﬁgurations zc . y). but by virtue of the separation theorem (cf.2. y )) . A Hilbert space H over Z is compatible with G. 5. the key issue in solving Eq. y )) due to the reproducˆ ing property. f H = i. The signiﬁcance of this result is that in order to specify a distribution for Z.
gi : X × X → R for i ∈ [n] functions such that fi (x.3 Cliquebased Sparse Approximation Proposition 22 immediately leads to an alternative version of the representer theorem as observed by Lafferty et al. xt ) if c = cs and d = ct k(xs+1 . Conditional Markov chain models have found widespread applications. y) and gi (x. Altun et al. If i fi (xi .d∈C kcd (uc . z) = d kd (ud .t+1} ) := (Iω. xt+1 ) if c = cs and d = ct (111) Notice that the inner product between indicator vectors is zero. Then the following simple lemma shows that there have to be function kcd (uc . Given an input sequences X = (Xt )t∈[T ] .Proposition 22 Let H with kernel k be a Gcompatible RKHS. z) = k(z. in which case the clique set is given by C :={ct :=(xt . In the latter case one can even deal with cases where the conditional dependency graph is (potentially) different for every instance. zc ∈ Zc }. zc ) : c ∈ c.2. cf. for part of speech tagging and shallow parsing. y) = fi (xi . . for which models based on joint kernels have been proposed in Lafferty et al. . Collins [2000]. [2003]. whereas the right hand sum involves functions restricted to cliques of the second argument. u) . [2005]). yt . Dependencies between class variables are modeled in terms of a Markov chain. zd ) as claimed. . in information retrieval (e. y) = i. yi ). Lafferty et al. zd ) := I(y{s. For simplicty we focus on a window size of r = 1. Culotta et al. . whereas outputs Yt are assumed to depend (directly) on an observation window (Xt−r . . Notice that this goes beyond the standard Hidden Markov Model structure (cf. Rabiner [1989]) by allowing for overlapping features (r ≥ 1). then there exist functions hij such that i fi (xi .ω ∈Σ . since it states that only kernels allowing an additive decomposition into local functions kcd are compatible with a given Markov network G. Then there are functions kcd : Zc × Zd → R such that the kernel decomposes as k(u. the goal is to predict a sequence of labels or class variables Y = (Yt )t∈[T ] . 36 . yj ) for all x. z ∈ Z there exist functions kd (·.ω (Y{t. • An illuminating example of how to design kernels via the decomposition in Proposition 22 is the case of conditional Markov chains. for gene prediction. McCallum et al. cf. ·) ∈ H and by assumption for all u. z) = c. yt+1 ). [2001]. Now we can deﬁne the local kernel functions as kcd (zc . y) = gi (x. [2004b]. or in computational biology (e. Proof Let z ∈ Z. . • Proposition 22 is useful for the design of kernels. Sha and Pereira [2003]). for information extraction. yt . Lemma 23 Let X be a set of ntupels and fi . We assume an input kernel k is given and introduce indicator vectors (or dummy variates) I(Y{t. Yt ∈ Σ. cf. yj ). [2004] and Altun et al.g. z) such that k(u. The left hand sum involves functions restricted to cliques of the ﬁrst argument. Xt+r ).t+1} ))ω. y) = j gj (x. [2004] have pursued a similar approach by considering kernels for RKHS with functions deﬁned over ZC := {(c. yt+1 ) : t ∈ [T − 1]}. I(y{t.t+1} ) × k(xs . z) = c kc (zc .j hij (xi .s+1} ). unless the variable pairs are in the same conﬁguration.g. y. 5. . ct :=(xt+1 . in natural language processing (e.g. then k(·. Xt . zd ) . [2005]). . Taskar et al. [2003].
Eq. which allows for image denoising. (112) scales with n · c∈C Yc  as opposed to n · Y in Eq. Notice that in the case of the conditional or hidden Markov chain. e. the optimizer f can be written as n ˆ f (u) = i=1 c∈C yc ∈Yc i βc.yc βiy .yc := βiy c. we will discuss the extension of PCA to Hilbert spaces. which selects a subset of variables to be included in a sparsiﬁed expansion.d∈C kcd ((xic . Proof According to Proposition 23 the kernel k of H can be written as k(z. [2003] a similar selection criterion is utilized with respect to margin violations. • In spite of this improvement. For instance. we study the behavior of Φ(x) by means of rather simple linear methods. 37 . For cliques with reasonably small state spaces this will be a signiﬁcantly more compact representation. and the modeling of complex dependencies between sets of random variables via kernel dependency estimation and canonical correlation analysis. (103). yc ). To that extent. 5. and nonlinear dimensionality reduction. Altun et al.4 Probabilistic Inference In dealing with structured or interdependent response variables. ud ) ˆ f (u) = i=1 y∈Y i Setting βc. Notice also that the evaluation of functions kcd will typically be more efﬁcient than evaluating k. (112) may in practice still be too large. y δyc . ud ) (112) here xic are the variables of xi belonging to clique c and Yc is the subspace of Zc that contains response variables. the junction tree algorithm is equivalent to the wellknown forwardbackward algorithm Baum [1972].d kcd (zc . [2004] parameters βcyc that maximize the functional gradient of the regularized logloss are greedily included in the reduced set. in Lafferty et al. leading to an SMOlike optimization algorithm (cf. In Taskar et al. [2003] for the soft margin maximization problem as well as in Lafferty et al. • Notice that the number of parameters in the representation Eq. However. Jensen et al. yc ). Platt [1999]). ud ) = i=1 c. In particular. clustering. ud ) y δyc . which has implications for nonlinear methods on the original data space X. one can pursue a reduced set approach. Wainwright and Jordan [2003]). [1990].2. the study of mean operators for the design of twosample tests.yc βiy completes the proof. for dependency graphs with small tree width efﬁcient inference algorithms exist.yc d∈C kcd ((xic . the study of covariance operators for the measure of independence. 6 Kernel Methods for Unsupervised Learning This section discusses various methods of data analysis by modeling the distribution of data in feature space. the number of terms in the expansion in Eq.g. computing marginal probabilities of interest or computing the most probable response (cf. yc ). [2004]. [2004a] for conditional i random ﬁelds and Gaussian processes. Dawid [1992]) and variants thereof. u) = plugging this into the expansion of Corollary 15 yields n n c.ˆ Corollary 24 If H is Gcompatible then in the same setting as in Corollary 15. In this case. a number of approximate inference algorithms has been developed to deal with dependency graphs for which exact inference is not tractable (cf. Recently. (105)) may be nontrivial. This has been proposed in Taskar et al.d∈C yc ∈Yc kcd ((xic . such as the junction tree algorithm (cf.
X] = Varemp [f. as the solution of the optimization problem can be found at one of the vertices [Rockafellar. xn } an nsample drawn from P(x). . 2003] by simultaneous diagonalization of the data and a noise covariance matrix. 1993.4). e. leaving the eigenvalues n unchanged. o In this case. 1970. For reviews of the existing literature. . or by using iterative algorithms which estimate principal components. In other words. .1 Kernel Principal Component Analysis Principal Component Analysis (PCA) is a powerful technique for extracting structure from possibly highdimensional data sets. Jones and Sibson. . . 2002]. wl . • For Contrast[f. Finally. . x subject to w ≤ 1} we have Projection Pursuit [Huber. This optimization problem also allows us to unify various feature extraction methods as follows: • For Contrast[f. X] subject to f ∈ F. 1985. 1981]. some of the classical papers are Pearson [1901]. . e. 2004]. Sch¨ lkopf and Smola. then the eigenvectors can be computed in O(d3 ) time [Press et al. The problem can also be posed in feature space [Sch¨ lkopf et al. . from preprocessing and invariant feature extraction [Mika et al. . A small number of principal components is often sufﬁcient to account for most of the structure in the data.6. 1998] by replacing x with Φ(x). PCA aims at estimating leading eigenvectors of C via the empirical estimate Cemp = Eemp (x − Eemp [x])(x − Eemp [x]) . x subject to w ≤ 1} we recover PCA. Hence it is sufﬁcient to diagonalize Cemp in that subspace. Friedman and Tukey.. Yet. note that the image of Cemp lies in the span of {Φ(x1 ).. 1994]. X] and F = { w. ..g. • Changing F to F = { w. Hotelling [1933]. and (approximately) Isomap [Ham et al. X] and F = { w. The basic idea is strikingly simple: denote by X = {x1 . and Karhunen [1946]. the Epanechikov kernel. [Cook et al. This yields ˜ a “denoised” solution Φ(x) in feature space. 38 . and to project the image Φ(x) of a noisy observation x onto the space spanned by w1 . to obtain the preimage of this denoised solution ˜ one minimizes Φ(x ) − Φ(x) . 1987]. The new coordinate system is obtained by projection onto the socalled principal axes of the data.g. Then the covariance operator C is given by C = E (x − E [x])(x − E [x]) . 1974]. wl . including LLE. o Subsequent projections are obtained. Other contrasts lead to further variants. Friedman. Kernel PCA has been applied to numerous problems. we replace the outer product Cemp by an inner product matrix. for the purpose of principal component regression [Draper and Smith.. • If F is a convex combination of basis functions and the contrast function is convex in w one obtains computationally efﬁcient algorithms. Φ(xn )}. where P is the projection operator with Pij = δij − n−2 and K is the kernel matrix on X. The fact that projections onto the leading principal components turn out to be good starting points for preimage iterations is further exploited in kernel dependency estimation (Section 6.. 2005]. 1987. Using w = i=1 αi Φ(xi ) it follows that α needs to satisfy P KP α = λα. X] = Curtosis[f.. see Jolliffe [1986] and Diamantaras and Kung [1996]. This means that the projections onto the leading eigenvectors correspond to the most reliable features. entropic contrasts. . to image denoising [Mika et al. . . Laplacian Eigenmaps.. Kernel PCA can be shown to contain several popular dimensionality reduction algorithms as special cases. PCA is an orthogonal transformation of the coordinate system in which we describe our data. by seeking directions orthogonal to f or other computationally attractive variants thereof. which can be computed efﬁciently. . It is readily performed by solving an eigenvalue problem. etc. 1999] and superresolution [Kim et al. Φ(x) subject to w ≤ 1} we recover Kernel PCA. Note that the problem can also be recovered as one of maximizing some Contrast[f. however. If X is ddimensional. The basic idea in the latter case is to obtain a set of principal directions in feature space w1 . e. .g. . it is impossible to compute the eigenvectors directly. obtained from noisefree data.
although it comes at a considerable computational cost. g(y)] subject to f ∈ F and g ∈ G. This approach provides the currently best performing independence criterion. In Das and Sen [1994]. 1936] aims at ﬁnding directions of projection u. Moreover tr P Kx P Ky P has the same theoretical properties and it can be computed much more easily in linear time. which provides necessary and sufﬁcient criteria if kernels are universal. Bach and Jordan [2002] used the soderived contrast to obtain a measure of independence and applied it to Independent Component Analysis with great success. and Gretton et al. P(S) = i=1 P(si ). Gretton et al. The problem is that bounding the variance of each projections is not a good way to control the capacity of the function class. . That is. Bach and Jordan [2002] use functions in the RKHS for which some sum of the n and the RKHS 2 norm on the sample is bounded. y − Eemp [y] ] . where Cx . 2005b]. That is. x ] u. v) are given by argmax Varemp [ u. g bounded by L∞ norm 1 on X and Y. [2005a] use functions with bounded RKHS norm only. Λ(X. 1971]. Hyv¨ rinen et al. For fast algorithms. More speciﬁcally Dauxois and Nkiet [1998] restrict F. [2001]. f. [2005a] a constrained empirical estimate of the above criterion is used. one may modify the correlation coefﬁcient by normalizing by the norms of u and v. More speciﬁcally. Y canonical correlation analysis [Hotelling. F.6. Φ(x) . Y. the formulation requires an additional regularization term to prevent the resulting optimization problem from becoming distribution independent.. one should consider the work of Cardoso [1998]. y − Eemp [y] ] . respectively. and Vishwanathan et al. Dauxois and Nkiet [1998]. 2002. However. gi ) are orthogonal to the previous set of terms. . i. Bach and Jordan [2002]. . y ] −1 Eemp [ u. Hence we maximize u −1 v −1 Eemp [ u. Gretton et al. this amounts to ﬁnding the eigenvectors of (P Kx P ) 2 (P Ky P ) 2 [Gretton et al. Cock and Moor [2002]. −1 −1 (113) This problem can be solved by ﬁnding the eigensystem of Cx 2 Cxy Cy 2 . Cy are the covariance matrices of X and Y and Cxy is the covariance matrix between X and Y . CCA can be extended to kernels by means of replacing linear projections u. In this case one seeks rotations of the data S = V X where V ∈ d SO(d) such that S is coordinatewise independent.g (115) This statistic is often extended to use the entire series Λ1 . x by projections in feature space u. Multivariate extensions are discussed in [Kettenring. That is. (114) which can be solved by diagonalization of the covariance matrix Cxy . (u.. [2006] show that degrees of maximum correlation can also be viewed as an inner product between subspaces spanned by the observations.3 Measures of Independence R´ nyi [1959] showed that independence between random variables is equivalent to the condition of vane ishing covariance Cov [f (x). Y. G) = 0 if and only if x and y are independent. one studies Λ(X. Instead. as it allows for incomplete Cholesky factorizations. G to ﬁnitedimensional linear function classes subject to their L2 norm bounded by 1. F. 2005a]. G) := sup Covemp [f (x). . g(y)] = 0 for all C 1 functions f. 6. v such that the correlation coefﬁcient between X and Y is maximized. The above criteria can be used to derive algorithms for Independent Component Analysis [Bach and Jordan. x − Eemp [x] v. Gretton et al.v −1 Varemp [ v.2 Canonical Correlation and Measures of Independence Given two samples X.e. Here Kx and Ky are the kernel matrices on X and Y respectively. One may check that in feature 1 1 space. Martin [2000]. Λd of maximal correlations where each of the function pairs (fi . [2005c]. Lee a 39 . x − Eemp [x] v.
which can be turned into a condition involving an optimization over a large function class using kernels. yielding functions fj . It works by recasting the estimation problem as a linear estimation problem for the map f : Φ(x) → Φ(y) and then as a nonlinear preimage estimation problem for ﬁnding y := argminy f (x) − Φ(y) as ˆ the point in Y closest to f (x). thus yielding tests for independence of random variables. In general.. which is often cited as providing theoretical support for kernel methods. thus turning a linear geometric algorithm into a nonlinear one.. as well as the increasing amount of work on the use of kernel methods for structured data. however. and kernel PCA from linear PCA. Φ(y) separately. that two random variables have zero covariance. Due to lack of space this article is by no means comprehensive.et al. This builds on the ideas of Fortet and Mourier [1953]. We believe that these works.3. Weston et al. The corresponding distance d(p.4 Kernel Dependency Estimation A large part of the previous discussion revolved around estimating dependencies between sample X and Y for rather structured spaces Y. but a linear criterion (e. 40 . to cases where Y has complex structure. There is. The authors apply it to the analysis of sequence data. Since Φ(y) is a possibly inﬁnite dimensional space. a more recent method of constructing kernel algorithsm. q) := µ[p] − µ[q] leads to a Ustatistic which allows one to compute empirical estimates of distances between distributions efﬁciently Borgwardt et al. obtains gets SVMs from hyperplane classiﬁers. in particular (83). Weston et al. as described in Section 3. 7 Conclusion We have summarized some of the advances in the ﬁeld of machine learning with positive deﬁnite kernels. (46) and (112)) We have explained a number of approaches where kernels are useful. This problem can be solved directly [Cortes et al. or that the means of two samples are identical). This way. 2005] without the need for subspace projections. [2003] restrict f to the leading principal components of Φ(y) and subsequently they perform regularized LS regression onto each of the projections vj . This yields an estimate of f via f (x) = j vj fj (x).g. The basic idea is that for universal kernels the map between distributions and points on the marginal polytope µ : p → Ex∼p [φ(x)] is bijective and consequently it imposes a norm on distributions. Also the work of Chen and Bickel [2005] and Yang and Amari [1997] is of interest in this context. illustrates that we can expect signiﬁcant further progress in the ﬁeld in the years to come. In particular. However. such dependencies can be hard to compute. 6. these include the fact that kernels address the following three major issues of learning and inference: • they formalize the notion of similarity of data • they provide a representation of the data in an associated reproducing kernel Hilbert space • they characterize the function class used for estimation via the representer theorem (see Eqs. [2000]. we were not able to cover statistical learning theory. in particular. we nevertheless hope that the main ideas that make kernel methods attractive became clear. [2006]. Note that a similar approach can be used to develop twosample tests based on kernel methods. [2003] proposed an algorithm which allows one to extend standard regularized LS regression models. which may lead to problems of capacity of the estimator. where the starting point is not a linear algorithm. Many of them involve the substitution of kernels for dot products. however. or tests for solving the twosample problem. This works well on strings and other structured data.
F. Machine Learning. Machine Learning. editor. R. Wiley. We thank Florian Steinke. 1978. Journal of Machine Learning Research. Singer. and S. Cambridge. and S. Scalesensitive dimensions. National ICT Australia is funded through the Australian Government’s Backing Australia’s Ability initiative. Hidden Markov support vector machines. Gaussian process classiﬁcation for segmenting and annotating sequences. M. O. and learnability. Machine Learning. 2004b. Alon. Computational Learning Theory. Schapire. Long. Jordan. 2003. Theoretical foundations of the potential function e method in pattern recognition learning. G. BenDavid. CesaBianchi. Annual Conf. Smola. In Proc. T. J. B. Tsochantaridis. Matthias Hein. Sch¨ lkopf. Y. O. P. and D. Y. 2004a. L. in part through the Australian Research Council. A. Conf. P. California. Conf. The authors would like to express thanks for the support during that time. Journal of Computer and System Sciences. Barron. 2002. 2002.Acknowledgments The main body of this paper was written at the Mathematisches Forschungsinstitut Oberwolfach. Intl. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. 25:821–837. Rademacher and gaussian complexities: Risk bounds and structural results. Hofmann. R. 1964. Vishwanathan. Smola. 1975. Smola. In P. A. In Proc. Transactions of the American Mathematical Society. BarndorffNielsen. Conf. Journal of Machine Learning Research. Fatshattering and the learnability of realvalued functions. I. 41 . and L. J. In Uncertainty in Artiﬁcial Intelligence (UAI). Journal of Mathematical Psychology. A. E. L. IEEE Transactions on Information Theory. P. Bartlett. pages 44–58. N. A. Kernel independent component analysis. C. I. Conf. Aronszajn. 2004. Theory of reproducing kernels. 2002. L. and T. Unifying collaborative and contentbased ﬁltering. May 1993. Hofmann. Morgan Kaufmann Publishers. and T. and T. Los Alamitos. Exponential families for conditional random ﬁelds. D. Intl. Bartlett. Basilico and T. 3:463–482. Allwein. Hofmann. Reducing multiclass to binary: a unifying approach for margin classiﬁers. IEEE Computer Society Press. 2000. Taskar. Machine Learning. 39(3):930–945. E. Predicting o Structured Data. R. Intl. N. P. Braverman. Universal approximation bounds for superpositions of a sigmoidal function. Bousquet. Bakir. pages 2–9. Haussler. 3:1–48. Bach and M. pages 292–301. In Proc. Altun. Automation and Remote Control. Altun. N. N. Localized rademacher averages. Mendelson. San Francisco. Langley. Chichester. E. Y. Information and Exponential Families in Statistical Theory. Altun. This work was supported by grants of the ARC and by the Pascal Network of Excellence. 1996. Intl. Rozono´ r. A. Mendelson. In Proc. 52(3):434–452. 68: 337–404. pages 9–16. Hofmann. S. 2007. Hofmann. M. V. Massachusetts. I. B. Jakob Macke. 12:387–415. 1950. E. CA. J. Bartlett and S. In Proc. uniform convergence. Williamson. and R. Proc. editors. Conrad Sanderson and Holger Wendland for comments and suggestions. Bamber. of the 34rd Annual Symposium on Foundations of Computer Science. Aizerman. References ´ M. MIT Press. 1993. and Y.
2(2):121–167. ACM Press. Eiron. N. C.M. A column generation algorithm for boosting. Gretton. San Francisco. 1984. Lugosi. In P. K.P. Proc. Bloomﬁeld and W. editor. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Knowledgebased analysis of microarray gene expression data by using support vector machines. B. H. Bennett. 1974. Cristianini. Proc. Guyon. Saitta.J. Boyd and L. D. In L. Machine Learning. Morgan Kaufmann Publishers. A. Tsitsiklis. Jan 2000. Baum. of Intelligent Systems in Molecular Biology (ISMB). 1996. S. Machine Learning. N. 42 . In C. Proc. Forst. J. Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Proceedings of the IEEE. Bertsimas and J. Proc Natl Acad Sci U S A. Boston. C. Burges. Lin. 11(7):1493–1518. Blind signal separation: statistical principles. Langley. Haussler. In D. Long. Conf. Birkh¨ user. Boucheron. CA. Vandenberghe. P. K. Vapnik. O. Christensen. and V. S. and T. Grundy. Harmonic Analysis on Semigroups. PA. Berlin.L. A training algorithm for optimal margin classiﬁers.F. J. Sch¨ lkopf. E. Mangasarian. Athena Scientiﬁc. Berlin. 66(3):496–514. 1972. 90(8):2009–2026. A. Data Mining and Knowledge Discovery. Kriegel. J. J. 9:323 – 375. 2000. Boser. 2004. and P. and J. Burges. B¨ lthoff. W. editor. Optimization Methods and Software. Berg and G. 1998. editor. Integrating o structured biological data by kernel maximum mean discrepancy. and B. L. ESAIM: Probability and Statistics. Vapnik. and G. C. M. J. Prediction games and arcing algorithms. Ressel. Burges. Intl. Comput. BenDavid. 1998. Cambridge University Press. 1997. pages 71–77.L. 1983. Computational Learning Theory. Brown. 1933.J. 1975. C. D. Sendhoff. 1996. and D. Smola. England. Potential Theory on Locally Compact Abelian Groups. Robust linear programming discrimination of two linearly inseparable sets. J. Convex Optimization. Breiman. C. Annual Conf. San Mateo. Mathematische Annalen. Sugnet. B. V. 2006. M. Neural Computation. L. Sci. P. SpringerVerlag. C. Vetter. Sch¨ lkopf. In Proc. von Seelen. V. 1999. Springer. Inequalities. New York. r. Cardoso. W. C. Blanz. von der Malsburg.. 2005. C. volume 1112 of Lecture u Notes in Computer Science. Fortaleza. Artiﬁcial Neural Networks ICANN’96. K. Vorbr¨ ggen. S. Simpliﬁed support vector decision rules. and P. B. Demiriz. 1992. A tutorial on support vector machines for pattern recognition. and A. P. Brazil. P. Furey. ShaweTaylor. 97(1):262–267. Syst. Borgwardt. Berg. pages 144–152. 1:23–34. Monotone Funktionen. Conf. Haussler. I. 2003. Cambridge. pages 251–256. T. 3. Introduction to linear programming. Bochner. Rasch. 108:378–410. R. a Least absolute deviations: theory. July 1992. Comparison of viewbased o u object recognition algorithms using realistic 3D models. C. Ares M. SpringerVerlag. Bousquet. H. S. applications and algorithms. Intl. 36(B):192–326. Steiger. N. Morgan Kaufmann Publishers. M. Stieltjessche Integrale und harmonische Analyse. Bennett and O. Journal of the Royal Statistical Society. editors. Theory of classiﬁcation: a survey of recent advances. W. P. J. On the difﬁculty of approximately maximizing agreements. S. Pittsburgh. Besag.
Linear Algebra and its Applications. In Proc. Advances in Neural Information Processing Systems 17. 210:29–47. E. Loss bounds for online category ranking. McCallum. Discriminative reranking for natural language parsing. Hariharan. CA. and J. Cortes and V. On kerneltarget alignment. Collins. Kandola. 43 . Haffner. J. Yair Weiss. NY. A. Kulp. A general regression technique for learning transductions. 10(5). Cook. Ghahramani. MIT Press. De Cock and B. M. Sen. G. M. Portland. and J. S. and Z. Saunders. ShaweTaylor. Cambridge. Weston. 1995. Cole and R. In T. 2005. 1990. S. editors. Vapnik. San Francisco. Journal of Computational and Graphical Statistics. Ghahramani. pages 153–160. Bickel. 20(1):33–61. and M. Buja. A. Elisseeff. MA. SVMs for histogrambased image classiﬁcation. P. 1994. Crammer and Y. 18:1676–1695. Morgan Kaufmann. MIT Press. editors. University of Massachusetts. and Z. Machine Learning. Saul. pages 367–373. A machine learning approach to conjoint analysis. In Lawrence K. Cambridge. IEEE Transactions on Signal Processing. 2005. Cristianini and J. In Proceedings of the Thirty Second Annual Symposium on the Theory of Computing. 2001. G. K. Singer. Technical Report UMCS2005028. Atomic decomposition by basis pursuit. 2001. USA. Intl. 1999. Machine Learning. K. Cristianini. Convolution kernels for natural language. Becker. In T. K. D. On the algorithmic implementation of multiclass kernelbased vector machines. S. IEEE Transactions on Neural Networks. 20(3):273–297. Dietterich. A maximumentropyinspired parser. R. ACM. e Cambridge. N. 2002. and L´ on Bottou. Support vector networks. 2005. D. ShaweTaylor. New York. pages 132–139. 2000. Restricted canonical correlations. 1972. 1993. 2005. Darroch and D. Singer. Faster sufﬁx tree construction with missing sufﬁx links. MA. 2000. Chapelle and Z. M. Conf. and J. An Introduction to Support Vector Machines. Advances in Neural Information Processing Systems 14. Mohri. Charniak. Gene prediction with conditional random ﬁelds. Siam Journal of Scientiﬁc Computing. Culotta. 2005. Donoho. Consistent independent component analysis and prewhitening. Duffy. Journal of Machine Learning Research. Das and P. Cortes. editors. 2000. UK. A. C. Advances in Neural Information Processing Systems 14. 53(10):3625–3632. C. The Annals of Mathematical Statistics. Subspace angles between ARMA models. O’Sullivan. N. Projection pursuit indices based on orthonormal function expansions. USA. Computational Learning Theory. Chen and P. 1999. N. D. 2:225–250.O. pages 175–182. 2002. D. O. Systems and Control Letter. In ICML ’05: Proceedings of the 22nd international conference on Machine learning. Cox and F. Chen. Dietterich. and V. In Proc. Crammer and Y. Cambridge. J. Annual Conf. and A. De Moor. Amherst. MIT Press. Vapnik. Cambridge University Press. 46: 265–270. Collins and N. Ratcliff. 43. In North American Association of Computational Linguistics (NAACL 1). Asymptotic analysis of penalized likelihood and related estimators. Chapelle. ACM Press. Generalized iterative scaling for loglinear models. Harchaoui. MA. May 2000. D. Annals of Statistics. Becker. OR. Cabrera. A.
Advances in Neural Information Processing Systems 9. M. C. 1995. and A. Dumais. Convergence de la r´ paration empirique vers la r´ paration th´ orique. K. Linear Algebra and its Applications. John Wiley and Sons. In S. 70:266–285. Journal of the American Statistical Association. Algebraic connectivity of graphs. S. L. E. C. Experiments with a new boosting algorithm. Principal Component Neural Networks. Inducing features of random ﬁelds. Fiedler. Krogh. J. H. Harrison. 2:25–36. Vapnik. 221:83–102. Annals of Statistics. Einmal and D. New York. J. Annals of Statistics. Ecole Norm. N. M. Y. In: M. S. Weston. S. Sch¨ lkopf. Osuna. 2003. and V. 13(4). editors. NY. Fletcher. A. John Wiley and Sons. Using SVMs for text categorization. 1998. J. Kung. UK. Elisseeff and J. Mitchison. I. and G. 2001. Schapire. T. Morgan Kaufmann Publishers. Mourier. pages 148–146. Dumais. H. In M. 1998. o R.D. H. 20(2):1062–1078. John Wiley and Sons. J. B. and Y. O. Statistics and Computing. Burges. Support vector regression machines. H. 1996.J. and A. pages 155–161. H. C. V. Friedman. C. Della Pietra. and J. Platt: Trends and Controversies . Zisserman. 1973. Van Gool. Sebastian Thrun. Czechoslovak Mathematical Journal. New York. Frieß and R. 2002. E. A. MIT press. Linear programming support vector machiens for pattern classiﬁcation and regression estimation and the set reduction algorithm. Kaufman. New York. C. Becker. Smola. Shefﬁeld. Applications of a general propagation algorithm for probabilistic expert systems. 2005. Functions that preserve families of positive semideﬁnite matrices. Available from www. 19:380–393. C. Dekel. 46:161– o 190. Eddy. Training invariant support vector machines. IEEE Intelligent Systems. editors. Exploratory projection pursuit. Applied Regression Analysis. P. M. 1989. M. R. 44 . F. Lafferty. Pascal visual object classes challenge results.pascalnetwork. Fortet and E. 1992. 1997.Support Vector Machines. Nkiet. Loglinear models for label ranking. Sup. A. 82: 249–266. Petsche. Dawid. A. R.. 1998. and Klaus Obermayer. Tukey. L. 1987. Micchelli. FitzGerald. Diamantaras and S. Durbin. MA. Friedman and J. Mozer. Mason. 1997. Smith. C. M. Cambridge. Hearst. DeCoste and B. Generalized quantile processes. Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. J. Draper and H.org. Jordan. TR RR706. 1981. W. and J. 1998. Y. J. Nonlinear canonical analysis and independence tests. R. 1992. 1953. Dauxois and G. I. 1974. T. In Proceedings of the International Conference on Machine Learing. e e e ´ Scient. 23(98):298–305. Ann. Williams. Sch¨ lkopf. A kernel method for multilabeled classiﬁcation. Della Pietra. University of Shefﬁeld. Manning. MIT Press. Drucker. Cambridge University Press. IEEE Transactions on Pattern Analysis and Machine Intelligence. A. IEEE Transactions on Computers. Machine Learning. Freund and R. Singer. 26(4):1254–1278. In Advances in Neural Information Processing Systems 14. Advances in Neural Information Processing Systems 15. C23(9):881–890. Everingham. S. Pinkus. A projection pursuit algorithm for exploratory data analysis. D. and T. Practical Methods of Optimization. 1996. A.
A. In T. number 32 in Lecture Notes in Statistics. A. Jain and W. O. SIAM Rev. Yandell. Semiparametric generalized linear models. T. Bousquet. Haussler. Computer Science Department. tensor and additive splines. J. o pages 115–132. Hayton. 2002. M. and K. MA. Herbrich. July 2003. M. Bousquet. ISBN 0–521–58519–8. volume 10. o In Proceedings of the TwentyFirst International Conference on Machine Learning. P. Massachusetts Institute of Technology. 2002. A. E. B. and N. M. G¨ rtner. Measuring statistical dependence with Hilberto Schmidt norms. Herbrich. S. Advances in Large Margin Classiﬁers. Pauls. Trees and Sequences: Computer Science and Computational Biology. L. Anuzis.J. Gribskov and N. a T. editors. R. 5(1):49–58. and B. Proceedings Algorithmic Learning Theory. Smola. Gusﬁeld. B.T. and B. Intl. Sch¨ lkopf. Sch¨ lkopf. Cambridge University Press. R. O. Graepel. 2002. R. A. P. Hettich and K. Estimation of a convex density contour in two dimensions. Technical Report UCSCCRL9910. and P. Gretton. Bousquet. Maximal margin classiﬁcation for metric spaces. Kernel constrained covariance for dependence measurement. Hein. Logothetis. Herbrich. Conf. 3: 175–212. 45 . L. Journal of Machine Learning Research. Smola. Smola. MIT Press. Multiinstance kernels. 6:2075–2129. A survey of kernels for structured data. pages 44–55. and D. D. In A. Semiinﬁnite programming: theory. G¨ rtner. Markov ﬁelds on ﬁnite graphs and lattices. L. J. Sch¨ lkopf. 1987. and applications. O. 2004. and V.I. Herbrich. Sch¨ lkopf. J. C. and A. A. Support vector novelty detection applied to jet o engine vibration spectra. A. R. Ham. Algorithmic luckiness. 1971. Convolutional kernels on discrete structures. Williamson. P. F. 1993. June 1997. Clifford. Schuurmans. 2005a. UC Santa Cruz. Leen. Jones. D. Poggio. Gretton. Memo 1430. In Proceedings 2nd International GLIM Conference. SIGKDD Explorations. J. Hartigan. A kernel view of the dimensionality reduction of manifolds. J. 35(3):380–429. Hammersley and P. MIT Press. A. P. and B. B. Flach. 20(1):25–33. Robinson. editors. a Machine Learning. Dietterich. Algorithms on Strings. MA. Mika. pages 946–952. 2000. Kowalczyk. 1999. 82:267–270. Obermayer. Bousquet. Lee. Tarassenko. T. Tresp. Sch¨ lkopf. Belitski. Herbrich and R. J. MIT Press. stabilizers and basis functions: From regularization to radial. R. In o Proceedings of International Workshop on Artiﬁcial Intelligence and Statistics.. Lee. 2005b. 1996. Augath. and T. Advances in Neural Information Processing Systems 13. Journal of the American Statistical Association. A. A. Journal of Machine Learning Research. Kortanek. In S. Priors. New York. Sch¨ lkopf. Smola. Sch¨ lkopf. 2005. and B. A. In Proc. Y. Smola. 1993. Journal of o Computer and System Sciences. pages 369–376. O. Learning Kernel Classiﬁers: Theory and Algorithms. R.S. unpublished manuscript. 1985. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Girosi. Gretton. Computers and Chemistry. Kernel methods for measuring o independence. editors. 2005c. K. Springer Verlag. Murayama. 2001. O. Cambridge. G. methods. Bartlett. A. Artiﬁcial Intelligence Laboratory. Green and B. M. D. Cambridge. Large margin rank boundaries for ordinal regression. 71:333 – 359. M.
Technical o Report 156. In Proceedings of the European Conference on Machine Learning. New York. 13(2):435–475. Robust Statistics. Jaakkola and D. Waro muth. and K. 1998. MathematischPhysikalische Klasse. and A. editors. 1946. J. Bayesian updates in causal probabilistic networks by local computation.D. McAuliffe. Lauritzen. and J. Kashima. 1936. Canonical analysis of several sets of variables. W. Jebara and I. Technometrics. Smola. Olesen. of Mathematics. T. pages 137–142. Technical Report 638. 4:269–282. Proceedings of the Sixteenth Annual Conference on Computational Learning Theory. 1933. Huber. United States. Kondor. Text categorization with support vector machines: Learning with many relevant features. UC Berkeley. Oja. T. John Wiley. Kennard. Hilbert. Kernels on graphs. K. 46 . T. Jensen. MA. 1990. K. Bartlett. Cambridge University Press. Inokuchi. A. A review of kernel methods in machine learning. Sch¨ lkopf and M. J. 2005. Berlin. Joachims. Springer. 1986. I. K. Annales Academiae Scientiarum Fennicae. New York. 1981. 2006. Kettenring. T. 150 (1):1–36. G. 37. 1970. and Algorithms. Kluwer Academic Publishers. C. Convexity. of Chicago. H. 58:433–451. classiﬁcation. J. 1939. V. Journal of Educational Psychology. Vert. Conf. J. I. Conf. pages 155–170. 1968. R. Hoerl and R. D. and A. J. 1999. number 2777 in LNAI. Springer. Minima of functions of several variables with inequalities as side constraints. o A. In B. L. MaxPlanckInstitut f¨ r biologische Kybernetik. 2004. Marginalized kernels between labeled graphs. a T. Hotelling. Relations between two seets of variates. and risk bounds. B. Some Random Series of Functions. Washington. Theory. L. 1971. Cambridge. H. In Proceedings of the 1999 Conference on AI and Statistics. Bhattacharyya and expected likelihood kernels. What is projection pursuit? Journal of the Royal Statistical Society A. Sch¨ lkopf. 2003. Joachims. Biometrika. Probabilistic kernel regression models. Learning to Classify Text Using Support Vector Machines: Methods. In Proc. In Proceedings of the International Conference on Computer Vision. Tsuda. Univ. Principal Component Analysis. Sch¨ lkopf. T. 12:55–67. 1985. Karhunen. H. Hotelling. Biometrika. 2003. H. Jones and R. Inokuchi. In Proc. Welling. Tsuda. Hofmann. Intl. F. Combining generative models and ﬁsher kernels for object recognition. P. Joachims. Master’s thesis. S. pages 57–71. P. Nachrichten der u Akademie der Wissenschaften in G¨ ttingen. DC. Dept. Machine Learning. T. 28:321–377. Analysis of a complex of statistical variables into principal components. u A. and A. 1904. John Wiley and Sons. Annals of Statistics. M. Machine Learning. Boston. Projection pursuit. 2005. and P. Kahane. S. 24:417–441 and 498 – 520. P. Karush. Jolliffe. o editors. Haussler. In K. P. Grundz¨ ge einer allgemeinen Theorie der linearen Integralgleichungen. A support vector method for multivariate performance measures. MIT Press. D. B.P. 2001. Jordan. 2002. Intl. Sibson. Computational Statistics Quaterly. M. E. W. 1987. Hyv¨ rinen. Independent Component Analysis. Tsuda. 2003. Kashima. Perona. M. J. Huber. and J. and E. pages 49–91. Zur Spektraltheorie stochastischer Prozesse. Karhunen. Ridge regression: biased estimation for nonorthogonal problems. Holub. Kernels and Bioinformatics.
79. 13:444–452. Liu. X. Applic. Lodhi. pages 315–322. In Proc. R. Journal of e Physics A. Kernel conditional random ﬁelds: Representation and clique selection.W. L. Leen. Zhu. Eskin. G. The spectrum kernel: A string kernel for SVM protein classiﬁcation. Intl. J. G. Berkeley. Mismatch string kernels for SVM protein classiﬁcation. Cambridge University Press. M. NeuroCOLT. Oxford University Press. L. Math. Y. In Proc. Reading. McCallum. Advances in Neural Information Processing Systems 15. 2002a. S. N. 2002. 2002b. Krauth and M. Published in: T. 2001. Lauritzen. 2000. Y. 2000. volume 21. Pereira.2. MA. Kim. In Proc. D. In Conference on Uncertainty in AI (UAI). pages 564–575. Lafferty. MIT Press. and C. MIT Press. editors. In Proceedings of the Paciﬁc Symposium on Biocomputing. S. J. K. IEEE Transactions on Signal Processing. and B. M´ zard. Dietterich and V. Mangasarian. 86(11):2278–2324. Lo` ve. T. Text classiﬁcation using string kernels. 2005. Springer. A metric for ARMA processes. A. H. Eskin. E. Intl. Noble. Conf. A conditional random ﬁeld for discriminativelytrained ﬁnitestate string edit distance. Machine Learning. Kimeldorf and G. 48(4):1164–1170.). 1951. Koenker. Leslie. W. W. Lee. Bell. and F. J. pages 481–492. Gradientbased learning applied to document recognition. and Klaus Obermayer. Luenberger. Weston. Leslie. Conf. Random approximants and neural networks. M. 1971. J.K. L. Pereira. Cristianini. University of California Press. and Y. Tucker. Machine Learning. Wahba. SpringerVerlag. second edition. Martin. ShaweTaylor. Lafferty. Conf. 33:82–95. Kuhn and A. Y. O. Koltchinskii. C. 47 . Watkins. May 1984. In Proc. M. J. Linear and Nonlinear Programming. G. 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics. and P. AddisonWesley. McCallum. Linear and nonlinear separation of patterns by linear programming. 1996. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 1–21. 1987. A. 1996. Haffner. ISBN 0 . A unifying framework for independent component analysis. Cambridge.15794 . S. K. Kondor and J. T. Technical Report 2000 .. Bellare. Becker. LeCun. and W. volume 18. S. 2004. Nonlinear programming. 20:745–752. Franz. 1978. 27(9):1351–1366. Learning grammatical stucture using statistical decisiontrees. C. e D.201 . Lafferty. R. 85:98–109. 39:1–21. and W. Noble. Magerman. 47:1902–1914. Sejnowski. Makovoz. New York. Intl. Sebastian Thrun. D. Operations Research. Rademacher penalties and structural risk minimization. In S. 2001. 2005. Iterative kernel principal component analysis for image modelo ing. Machine Learning. O. M. Tresp (eds. Computers and Mathematics with Applications. and T. Sch¨ lkopf. A. Conditional random ﬁelds: Probabilistic modeling for segmenting and labeling sequence data. and F. volume 1147 of Lecture Notes in Artiﬁcial Intelligence. Proceedings of the IEEE.J. Anal. 1965. Girolami. E. Some results on Tchebychefﬁan spline functions. IEEE Transactions on Information Theory. 1996. Advances in Neural Information Processing Systems 13. 2000. Graphical Models. R. Berlin. Probability Theory II. D. Journal of Approximation Theory. November 1998. volume 15. Bengio. In Proceedings ICGI. Quantile Regression. I. 4 edition. Diffusion kernels on graphs and other discrete structures. V. 2001. H. W. 2005. Bottou. Learning algorithms with optimal stability in neural networks.
P. K. 2000. S.W. Schuurmans. 1999. Sch¨ lkopf. B. MA. Smola. pages 1–40. Minsky and S. 1970. M¨ ller. Amer. Murray and J. Berlin. R¨ tsch. Sch¨ lkopf. 1983. Advances in Neural Information Processing Systems 11. W. 25(5): 623–628. F. 1972. Pinkus. a S. J. A. J. 48 . A. A. G. 1993. Scholz. Raynor. 1984. Springer. Nelder. In R. Papert. M. Berlin. SpringerVerlag. submitted. Mendelson. R¨ tsch. Chapman and Hall. Sch¨ lkopf. J. Smola. A. London. 1909. B. Smola. A. Camo bridge. and A. Hasler. Functions of positive and negative type and their connection with the theory of integral equations. Automatic smoothing of regression functions in generalized linear models. O’Sullivan.. editors. Smola. 39:348–371. Strictly positive deﬁnite functions on a real inner product space. Kernel PCA and deo u a noising in feature spaces. G. Advances in Large Margin Classiﬁers. Oliver. S. Journal of the Royal Statistical Society. Morozov. Rice. Gerstner. Wedderburn. Yandell. editors. and J. Weston. IEEE Transactions on Pattern Analysis and Machine Intelligence. D. Learning discriminative and a o u invariant nonlinear features. and K. Fast regular expression search. and G. 2 (sixth series): 559–572. 2004. K. J. In M. B. Canadian Mathematical Congress. Advances in Computational Mathematics. 2003. Cohn. editors. E. Generalized linear models. Mika. Natural regularization in SVMs. A. V. MIT Press. B. o B. A. MIT Press. Springer. pages 999–1004.R. S. SpringerVerlag. Navarro and M. M. J. Pearson. McCullagh and J. editor. and D. B. Assoc.P. A. K. pages 198–212. In W. M. L. In A. 135:370–384. Rademacher averages and phase transitions in glivenkocantelli classes. 1999. 1989. number 1668 in LNCS. Predicting time u a o series with support vector machines. pages 536–542. G. Montreal. 2001.D. Pyke. IEEE Transactions on Information Theory. Hilbertsche R¨ ume mit Kernfunktion. J. A 209:415–446. MIT Press. and V. In S. On lines and planes of closest ﬁt to points in space. volume 1327 of Lecture Notes in Computer Science. Smola. Journal of Multivariate Analysis. MA. Springer. McDiarmid. 1962. Mendelson. Parzen. Chapman and Hall. M¨ ller. Perceptrons: An Introduction To Computational Geometry. K. Philosophical Transactions of the Royal Society. Nelder and R. Philosophical Magazine. Nicoud. Kearns.R. H. Mercer. 2003. A. A few notes on statistical learning theory. Proceedings 12th Biennial Seminar. Kohlmorgen. editors. M. C. M. Sch¨ lkopf. On the method of bounded differences. N. London. Rafﬁnot.R. Differential geometry and statistics. Generalized Linear Models. A. Vapnik. Statist. In Proceedings of WAE. volume 48 of Monographs on statistics and applied probability. Advanced Lectures on Machine Learning. Nolan. Mendelson and A. and D. Smola. 1991. and W. J. 1997. J. J. Cambridge. Meschkowski. Mika. 1 . S. Artiﬁcial Neural Networks ICANN’97. The excess mass ellipsoid. pages 51–60.37. Cambridge University Press. J. M¨ ller. 1969. 1986. Germond. In Survey in Combinatorics. 20:263–271. Bartlett. pages 148–188. A. R¨ tsch. Methods for Solving Incorrectly Posed Problems. J. Solla. 1901. number 2600 in LNAI. Statistical inference on time series by RKHS methods. Sch¨ lkopf. S. 81:96–103.
Fast training of support vector machines using sequential minimal optimization. Oldenbourg Verlag. MA. Cambridge. A. J. W. T. T. Metric spaces and completely monotone functions. Mika. Helmbold and o B. 2001. Proceedings of the Royal Society A. Acad. Williams.R. J. Gaussian Processes for Machine Learning. W. 1962. Numerical Recipes in C. A. I.Support Vector Learning. A tutorial on hidden Markov models and selected applications in speech recognition. E. L. S. Advances in Kernel Methods . R. H. J. Proceedings of the IEEE. and W. P. H. 1970. Ima u o proving the Caenorhabditis elegans genome annotation using machine learning. Becker and K. 78(9). Lee.J. Sonnenburg. number 2111 in Lecture Notes in Computer Science. 1994. 3:277–290. Ruj´ n. 1975. R´ nyi. In press. G. Boosting the margin: A new explanation for the effectiveness of voting methods. Annual Conf. R¨ tsch. J. Freund. Onoda. Sch¨ lkopf. Support Vector Learning. and A. 77(2):257–285. H. Cambridge. 10:441–451. Annals of Mathematics. Proc. 2003. Princeton University Press. Minimum volume sets and generalized quantile processes. Sch¨ lkopf. Y. MIT Press. Efﬁcient face detection by a cascaded support o vector machine expansion. G. 1999. 1959. and B. PLoS Computational Biology. Smola. Press. 42(3):287–320. Cambridge. Witte. R. o C. T. Girosi. PhD thesis. Robust Boosting via Convex Optimization: Theory and Applications. C. M¨ ller. 74(366):329–339. o http://www. Annals of Statistics. 2004. UK. Advances in Neural Information Processing Systems 15. Journal of the American Statistical Association. S. Journal de Physique I a France. Srinivasan. Interscience Publishers. In D. 39:811–841. S. Obermayer. editors. G. 2006. R. Bartlett. pages 513–520. Platt. 19:201–209. A generalized representer theorem. 2007. 1997. S. 1979. and A. pages 185–208. Vetterling. Polonik. R. R. In B. Convex Analysis. J. 1998. S. R¨ tsch. R. Stochastic Processes and their Applications. New York. Machine Learning. K. Williamson. MIT Press. J. February 1989. 460:3283–3297. A fast method for calculating the perceptron with maximal stability. W. editors. The Art of Scientiﬁc Computation. Rudin. Rabiner. Rockafellar.kernelmachines. Flannery. Herbrich. e R. editors. Schapire. Blake.. Proceedings of the IEEE. Adapting codes and embeddings for polychotomies. Sch¨ lkopf. R¨ tsch. Biological Cybernetics. 1997. Smola. Networks for approximation and learning. B. volume 28 of Princeton Mathematics Series. Romdhani. MA. Poggio and F. Sch¨ lkopf. T. Computational Learning Theory. Poggio. Sci. An iterative method for estimating a multivariate mode and isopleth. 1938. University a of Potsdam. SpringerVerlag. Cambridge University Press. and A. 69:1–24. and A. G. Soft margins for adaboost. C. Teukolsky.org. I. Download: B. L. P. a u 2001. and K. P. B. Munich. P. MIT Press. W. On measures of dependence. pages 416 – 426. P. Schoenberg. 26:1651–1686. and B. In S. 1993. M¨ ller. Fourier analysis on groups. K. J. September 1990. MA. 2001. Rasmussen and C. Hungar. Torr. Acta Math. Thrun a S. On optimal nonlinear associative recall. W. R¨ tsch. Sager. T. S. London. Smola. 49 . Burges. Sch¨ lkopf. Sommer. T.
Positive deﬁnite functions and generalizations. Canberra. Morgan Kaufmann Publishers. Cambridge. and J. J. June 2001. Regression estimation with support vector learning machines. UK. Sim. Smola. Kernels and regularization on graphs. ShaweTaylor and N. Nonparametric quantile estimation. L. o editors. an historical survey. approximao tion and operator inversion. 18:768–791. Sch¨ lkopf.P. Diplomarbeit. 2002b. J. A. A. Williamson. V. Cristianini. Le. A. On a kernelbased method for pattern recognition. In P. Smola and B. Advances in Large Margin Classio ﬁers. and K. Cambridge.. Journal of Machine Learning Research. 2004. pages 213–220. Anthony. J. B. Gammerman. 1976. B. V. I. and J. Cambridge University Press. C. 1996. J. L. R. Context kernels for text categorization. M. Steinwart. Williamson. Proc. MIT Press. Neural o Computation. B. 13(7):1443–1471. J. Watkins. Camo bridge. Support vector machines are universally consistent. Neural Networks. C. 2001.Support Vector Learning. MA. Smola. MIT Press. K. Bartlett. 2006. Kernel Methods for Pattern Analysis. Smola and I. Sch¨ lkopf and A. Langley. Structural risk minimization over datadependent hierarchies. 1998. pages 144–158. 2004. A. forthcoming. In B. 6:409–434. Association for Computational Linguistics. The connection between regularization operators and o u support vector kernels. 11(5):637–649.R. Machine Learning. Complexity. C. Sears. 10:1299–1319. 2:67–93. Kondor. Algorithmica. o Advances in Kernel Methods . Springer. Annual Conf. Stitson. J.J. Smola. Takeuchi. 1998. Proc. o B. and R. Rocky Mountain J. T. 44(5):1926–1940. Journal of Machine Learning Research. MIT Press. 2000. J. The Australian National University. Vapnik. 2002a. Pereira. MA. Sch¨ lkopf. C. MIT Press. J. and A. Learning with Kernels. Smola. editors. Sch¨ lkopf. Technische Universit¨ t M¨ nchen. 1999.V. K.B. Weston. 1998. In Proceedings of HLTNAACL. P. A. Neural Computation. A. Smola. Sha and F. Sch¨ lkopf. Stewart. A. Nonlinear component analysis as a kernel eigenvalue o u problem. and M. and K. ShaweTaylor. R. a u A. Australia. Estimating the support of a o highdimensional distribution. Cambridge. I. J. Master’s thesis. 2002. Smola. ShaweTaylor. o editor. C. Neural Computation. J. 2003. P. Bartlett. J. and P. Intl. Smola. Computational Learning Theory. 2003. 12:1207–1245. A. Tsuda. Sch¨ lkopf. 1998. 2000. Vert. J. ACT 0200. B. pages 285–292. Math. F. editors. Williamson. Kernel Methods in Computational Biology. Sch¨ lkopf. 50 . Smola and B. On the inﬂuence of the kernel on the consistency of support vector machines. Cambridge. San Francisco. J. J. regression. Schuurmans. Shallow parsing with conditional random ﬁelds. pages 911–918. Sch¨ lkopf. MA. Lecture Notes in Computer Science. Sparse greedy matrix approximation for machine learning. Platt. In B. I. Sch¨ lkopf. Bartlett. Support vector regression with ANOVA decomposition kernels. L. Q. K. Warmuth. IEEE Transactions on Information Theory. Smola. C. B. M¨ ller. Burges. and D. M¨ ller. J. New support vector algorithms. A.R. Sch¨ lkopf and M. Vovk. Sch¨ lkopf. Smola. J. J. Conf. 22:211–231. MA. R. 2000. Steinwart. and A.
and A. Watkins. Statistical Learning Theory. Journal of the Royal Statistical Society Series B. Annals of Statistics. Taskar. Springer. Smoothing spline ANOVA for exponential families. Smola. and R. Jordan. Vidal. In M. B. Smola. 32(1): 135–166.P. C. Fast kernels for string and tree matching. 58:267–288. and C. Annals of Statistics. John Wiley and Sons. 1991.. Bartlett. and J. V. Springer. V. Technical Report 649. Dokl. Golowich.B. New York. G. New York. P. Mozer. J. On the uniform convergence of relative frequencies of events to their probabilities. volume 59 of CBMSNSF Regional Conference Series in Applied Mathematics. Journal of Machine Learning Research. W. and B. Chervonenkis. Wang. van Rijsbergen. editors. 1996. C. Soviet Math. Department of Statistics. Dynamic alignment kernels. Chervonenkis. V. Tax and R. C. 1998. T. Vapnik and A. In Empirical Methods in Natural Language Processing. exponential families. 1995. Advances in Neural Information Processing Systems 9. 24:774–780. C. M. Joachims. Tibshirani. Guestrin. and D. The Nature of Statistical Learning Theory. Advances in Large Margin Classiﬁers. editor. Wahba. A. pages 281–287. N. pages 39–50. Philadelphia. Koller. with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Collins. Verleysen. Taskar. Proceedings ESANN. MA. regression estimation. 4:1035–1038. 23:1865–1895. Annals of Statistics. Pattern Recognition and Image Analysis. Kernels and Bioinformatics. MA. UC Berkeley. Tsochantaridis. Klein. D. 1982. D. Petsche. SIAM. J. 1995. Tsybakov. Vert. R. J. Vapnik and A. Large margin methods for structured and interdependent output variables. 1979. T. 2003. L. I. Maxmargin Markov networks. and T. Vishwanathan and A. and Y. 2003. V. V. MIT Press. A. Automation and Remote Control. MA. Data domain description by support vectors. V. Klein. Altun. Tikhonov. Spline Models for Observational Data. Cambridge. Vapnik. and signal processing. 2006. M. 16(2):264–281. J. MIT Press. Graphical models. V. Y. Koller. M. G. Duin. Pattern recognition using generalized portrait method. S. R. V. N. Sch¨ lkopf. Manning. A. Sch¨ lkopf. Smola. N. On nonparametric estimation of density level sets. J. 1997. V. 51 . Support vector method for function approximation. 1(3):283–305. Tsuda. MIT Press. 2000. o S. In K. editors. Wainwright and M. Butterworths. September 2003. editors.J. Vapnik. to appear. Lawrence Saul. Solution of incorrectly formulated problems and the regularization method. Binetcauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. In Sebastian Thrun. 1997. Estimation of Dependences Based on Empirical Data. Vishwanathan.B. In M. In A. Theory of Probability and its Applications. Advances in Neural Information Processing Systems 16. and D. B. 1971. M. 1963. International Journal of Computer Vision. The necessary and sufﬁcient conditions for consistency in the empirical risk minimization method. Gu. and Bernhard Sch¨ lkopf. 1999. Klein. Vapnik. Tsybakov. 2004. B. Hofmann. London. Lerner. 1963. D. A. D Facto. 2004. S. Regression shrinkage and selection via the lasso. Optimal aggregation of classiﬁers in statistical learning. Vapnik and A. and variational inference. Cambridge. 25(3):948–969. Smola. Berlin. o editors. I. C. pages 251–256. Brussels. J. I. Cambridge. Vapnik. Schuurmans. Wahba. 1990. 2 edition. 2005. Maxmargin parsing. Jordan. Information Retrieval. o B. P.
B. editors. In o Proc. Yang and S. Amari. O. Bioinformatics. Sch¨ lkopf. Sch¨ lkopf. B. Zien. C. and K. Learning to map sentences to logical form: Structured classiﬁcation with probabilistic categorial grammars. and B. 2003. Becker and K. Intl. MA. Lengauer. Zhang. 19:239–244. Zhou. 47(6):2516–2532. Sch¨ lkopf. 2003. Jounal of Machine Learning Research. 1961.I. 4:913–931. Vapnik. and Z. 2000. P. Wolfe. Advances in Neural Information Processing Systems 14. I. D. A. Williams and M. 1997. Cambridge. J. J. Weston. 52 . MIT Press. and B. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization.R. 1990. Adaptive online learning algorithms for blind separation – maximum entropy and minimum mutual information. G. Conf. Graphical Models in Applied Multivariate Statistics. Chapelle. A. Williamson. T. John Wiley & Sons. 16(9):799–807. In T. J. Whittaker. 2001. Obermayer. R. R¨ tsch. 9(7):1457–1482. M¨ ller. Wolf and A. G. A duality theorem for nonlinear programming. and V. L. S. 2005. Generalization bounds for regularization networks and o support vector machines via entropy numbers of compact operators. In Uncertainty in Artiﬁcial Intelligence UAI. Cambridge University Press. UK. Machine Learning. Sch¨ lkopf. pages 873–880. S. In o S. S. 32(1):56–85. 2005. C. L. Kernel dependency estimation. Becker. Thrun S. A. Huang. Cambridge. 2000. Scattered Data Approximation. Quarterly of Applied Mathematics. Advances in Neural Information Processing Systems 15. Elisseeff. Wendland. IEEE Transaction on Information Theory. T. K. Shashua. Zettlemoyer and M. H. Collins. Dietterich. Learning from labeled and unlabeled data on a directed graph. 2005. Learning over sets using kernel principal angles. H. 2004. editors. Seeger. Annals of Statistics. Using the Nystrom method to speed up kernel machines. Mika. Neural Computation. Engineering Support Vector a o u Machine Kernels That Recognize Translation Initiation Sites. Smola.H. Ghahramani. J.
This action might not be possible to undo. Are you sure you want to continue?
Use one of your book credits to continue reading from where you left off, or restart the preview.