Professional Documents
Culture Documents
Class03 PDF
Class03 PDF
x
suppose this is the “true” solution...
f(x)
x
... but suppose ERM gives this solution!
f(x)
x
Regularization
kf k2
H ≤ A,
with k · k, the norm in the normed function space H
Function space
1. kf k ≥ 0 and kf k = 0 iff f = 0;
2. kf + gk ≤ kf k + kgk;
3. kαf k = |α| kf k.
kf k = max |f (t)|.
a≤t≤b
Minimize I(x)
subject to Φ(x) ≤ A (for some A)
i=1
The vectors
Ft[f ] = f (t)
|Ft[f ]| = |f (t)| ≤ M kf kH ∀f ∈ H
where k · kH is the norm in the Hilbert space of functions.
4
f(x)
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x
RKHS
K(t, x) := Kt(x)
.
Positive definite kernels
Definition
Theorem
• K is pd because
n n
cj Ktj ||2
X X X
cicj K(ti, tj ) = cicj hKti , Ktj iK = || K ≥ 0.
i,j=1 i,j=1
Sketch of proof (cont.)
• Linear kernel
K(x, x0) = x · x0
• Gaussian kernel
kx−x0 k2
K(x, x0) = e σ2 , σ>0
• Polynomial kernel
We want to separate two classes using lines and see how the magnitude
of the slope corresponds to a measure of complexity.
We will look at three examples and see that each example requires
more complicated functions, functions with greater slopes, to separate
the positive examples from negative examples.
A linear example (cont.)
here are three datasets: a linear function should be used to
separate the classes. Notice that as the class distinction
becomes finer, a larger slope is required to separate the
classes.
2 2
1.5 1.5
1 1
0.5 0.5
f(X)
f(x)
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x x
2
1.5
0.5
f(x)
−0.5
−1
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
Again Tikhonov Regularization
V = (y − f (x))2
• Support vector machines for classification (SVMC)
V = |1 − yf (x)|+
where
(k)+ ≡ max(k, 0).
The Square Loss
5
L2 loss
0
−3 −2 −1 0 1 2 3
y−f(x)
The Hinge Loss
3.5
2.5
Hinge Loss
1.5
0.5
−3 −2 −1 0 1 2 3
y * f(x)
Existence and uniqueness of minimum
Both the squared loss and the hinge loss are convex.
V = Θ(−f (x)y),
where Θ(·) is the Heaviside step function, is not convex.
The Representer Theorem
IS [f ] + λkf k2
K,
can be represented by the expression
n
X
fS (x) = ciK(xi, x),
i=1
for some n-tuple (c1, . . . , cn) ∈ IRn.
H0 = span({Kxi }i=1,...,n)
⊥
From the reproducing property of H, ∀f ∈ H0
X X X
hf, ciKxi iK = cihf, Kxi iK = cif (xi) = 0.
i i i
Since by orthogonality
IS [f0] + λkf0k2
K ≤ IS [f 0 + f ⊥ ] + λkf + f ⊥ k2 .
0 0 0 K
min g(c)
c∈IRn
for a suitable function g(·).