Professional Documents
Culture Documents
e-Chapter 8
Pierre Paquay
Problem 8.1
wT (x+ − x− ) ≥ 2.
we choose k to be
2
k= .
||x+ − x− ||2
Now, we may write that
2(x+ − x− )
w∗ = .
||x+ − x− ||2
It remains to determine the value of b∗ . To do that we fix the following equality
T
(x+ − x− )
2 x+ + b∗ = 1;
||x+ − x− ||2
1
xT+ x+ − xT− x+
b∗ = 1−2
||x+ − x− ||2
xT− x− − xT+ x+
=
||x+ − x− ||2
||x− ||2 − ||x+ ||2
= .
||x+ − x− ||2
It is now easy to verify that (w∗ , b∗ ) satisfies both constraints and minimizes ||w||, and therefore gives us the
optimal hyperplane.
Problem 8.2
−b ≥ 1, −(−w2 + b) ≥ 1, (−2w1 + b) ≥ 1.
If we combine the first and the third ones, we get w1 ≤ −1. The quantity we seek to minimize is
1 T 1 1 1
w w = (w12 + w22 ) ≥ (1 + 0) ≥ ,
2 2 2 2
where we have equality when w1 = −1 and w2 = 0; consequently, we choose w∗ = (−1, 0). With this in mind,
the third constraint becomes
1 ≤ −2w1∗ + b = 2 + b ⇔ b ≥ −1;
so we choose b∗ = −1. It is now easy to verify that (w∗ , b∗ ) satisfies both constraints and minimizes ||w||,
and therefore gives us the optimal hyperplane. The margin in this case is given by 1/||w∗ || = 1.
Problem 8.3
1 XX X
L(α) = yn ym αn αm xTn xm − αn
2 n m n
1
= (8α22 − 4α2 α3 − 6α2 α4 − 4α2 α3 + 4α32 + 6α3 α4 − 6α4 α2 + 6α3 α4 + 9α42 ) − α1 − α2 − α3 − α4
2
9
= 4α22 + 2α32 + α42 − 4α2 α3 − 6α2 α4 + 6α3 α4 − α1 − α2 − α3 − α4 .
2
or equivalently
α1 + α2 = α3 + α4
with α1 , α2 , α3 , α4 ≥ 0.
(b) If we replace α1 with α3 + α4 − α2 , we obtain
9
L(α) = 4α22 + 2α32 + α42 − 4α2 α3 − 6α2 α4 + 6α3 α4 − 2α3 − 2α4 .
2
2
(c) Now, we fix α3 and α4 and we take the derivative of L(α) with respect to α2 , this gives us that
∂L
= 8α2 − 4α3 − 6α4 .
∂α2
By setting the previous expression to 0, we get that
α3 3α4
α2 = + ,
2 4
and also that
α3 α4
α1 = −α2 + α3 + α4 = + .
2 4
These expressions are valid since they are both greater or equal to 0, and obviously
α1 + α2 = α3 + α4 .
(d) It remains to replace α2 by its new expression (in (c)), we get that
2
α3 3α4 9 α3 3α4 α3 3α4
L(α) = 4 + + 2α32 + α42 − 4 + α3 − 6 + α4 + 6α3 α4 − 2α3 − 2α4
2 4 2 2 4 2 4
9
= α32 + (3α4 − 2)α3 + α42 − 2α4
4
2
(3α4 − 2)2
3α4 − 2 9
= α3 + + α42 − 2α4 −
2 4 4
2
3α4 − 2
= α3 + + α4 − 1 ≥ −1.
2
The minimum of the Lagrangian is attained when α3 = 1 and α4 = 0, in this case we also have
α3 α4 1
α1 = + =
2 4 2
and
α3 3α4 1
α2 = + = .
2 4 2
Problem 8.4
We have
0 0 −1
X = 2 2 and y = −1 .
2 0 1
The Lagrangian is equal to
1 XX X
L(α) = yn ym αn αm xTn xm − αn
2 n m n
= 4α22 − 4α2 α3 + 2α32 − α1 − α2 − α3
1 1
= 2(α12 − α1 ) + 2(α22 − α2 ) ≥ − − ≥ −1;
2 2
and the constraints are α3 = α1 + α2 with α1 , α2 , α3 ≥ 0. The minimum of the Lagrangian is attained when
α1 = α2 = 1/2 which gives us α3 = α1 + α2 = 1. Then, the optimal Lagrange multipliers are
1 1
α1∗ = , α2∗ = , and α3∗ = 1.
2 2
3
Problem 8.5
(a) Below, we generate three data points uniformly in the upper half of the input space and three data points
in the lower half. We also obtain grandom and gSV M .
set.seed(101)
f <- function(D) {
return(sign(D$x2))
}
g <- function(D, a) {
return(sign(D$x2 - a))
}
return(D)
}
return(a)
}
return(a)
}
D <- dataset_gen()
(b) Here, we create a plot of our data and of our two hypotheses.
ggplot(D, aes(x1, x2, col = y)) + geom_point() +
guides(colour = guide_legend(title = "Class")) +
geom_hline(aes(yintercept = a_r, linetype = "Random"), colour = "green") +
geom_hline(aes(yintercept = a_s, linetype = "SVM"), colour = "black") +
scale_linetype_manual(name = "Type", values = c(2, 2),
guide = guide_legend(override.aes = list(color = c("green", "black"))))
4
0.50
Class
0.25 −1
1
x2
Type
0.00
Random
SVM
−0.25
−0.50
−0.5 0.0
x1
(c) Now, we repeat (a) for a million data sets to obtain one million random and SVM hypotheses.
N <- 1000000
vec_a_r <- rep(NA, N)
vec_a_s <- rep(NA, N)
for (i in 1:N) {
D <- dataset_gen()
a_r <- a_random()
a_s <- a_SVM(D)
vec_a_r[i] <- a_r
vec_a_s[i] <- a_s
}
(d) Below, we give a histogram of the values of arandom and another histogram of the values of aSV M .
ggplot(a_repeat, aes(x = vec_a_r)) + geom_histogram(colour = "blue") +
geom_histogram(aes(x = vec_a_s), colour = "red") +
xlab("a value") + ylab("Count")
5
200000
150000
Count
100000
50000
6
bias_s.x <- c(bias_s.x, (g_bar_s - f(x.new))^2)
}
bias_r <- mean(bias_r.x)
bias_s <- mean(bias_s.x)
We get a bias of 0.2855502 for our random algorithm and a bias of 0.1571874 for SVM; the bias for SVM is
clearly better than the bias for our random algorithm which is not surprising. Now, we take a look at the
variances.
Ni <- 100
Nj <- 100
We get a variance of 0.1884158 for our random algorithm and a variance of 0.0929888 for SVM; here the
variance for SVM is clearly better as well. So, it seems that SVM is much better in our present case.
Problem 8.6
PN
To find the stationary point of n=1 ||xn − µ||2 , we compute its gradient and we equal it to 0, we get
7
N
X N
X
∇µ ||xn − µ||2 = ∇µ (xn − µ)T (xn − µ)
n=1 n=1
N
!
X
= ∇µ (xTn xn − 2xTn µ T
+ µ µ)
n=1
N
X
= 2N µ − 2 xn = 0.
n=1
This gives us
N
1 X
µ= xn .
N n=1
To check that it is actually a minimum, we need to take a look at the Hessian matrix which is
Hess(µ) = 2N IN ;
this is a positive definite matrix which means that our stationary point is in fact a minimum.
Problem 8.7
(b) Since yn ∈ {−1, 1}, when n = m, we obviously have yn ym = +1 which gives us immediately E[yn ym ] = 1
when n = m. Now, if n 6= m, we may write that
Consequently, we get (
1 if n = m
E[yn ym ] = −1 .
N −1 if n 6= m
8
2 " #
XN
X
yn xn
= yn ym xTn xm
E
E
m,n
n=1
X
= E[yn ym ]xTn xm
m,n
X X
= E[yn ym ] xT xm + E[yn ym ] xTn xm
n
| {z } n | {z }
m6=n
=1 =−1/(N −1)
N X T 1 X T 1 X
= xn xm − ( xn xn + xTn xm )
N −1 n N −1 n N −1
m6=n
| {z
P }
=1/(N −1) xT
n xm
m,n
N X 1 X
= xTn xm − xT xm .
N −1 n
N − 1 m,n n
N
N X N X T
||xn − x||2 = (x xn − 2xTn x + xT x)
N − 1 n=1 N −1 n n
N X
xT xn − xT x}
= N
N −1 n n |P{z
=(N/N ) 2 xT
n xm
m,n
N X T 1 X T
= xn xn − x xm .
N −1 n N − 1 m,n n
N
2 N
X
N X
yn xn
= ||xn − x||2
E
n=1
N − 1 n=1
N N 2 R2
≤ N R2 = .
N −1 N −1
9
Now, let us assume that "
#
X
NR
yn xn
≤ √ = 0,
P
n
N −1
or equivalently "
#
X
NR
y x
> √ = 1.
P
n n n
N −1
In this case, we may write that
N
2
X
N 2 R2
yn xn
> ,
E
n=1
N −1
| {z }
>N 2 R2 /(N −1)
Problem 8.8
(a) We begin by assuming that the yn are numbered in such a way that y1 , · · · , yk = +1 and yk+1 , · · · , yN = −1.
We know that
ρ||w|| ≤ yn (wT xn + b)
for n = 1, · · · , N ; if we sum this expression for n = 1, · · · , k and for n = k + 1, · · · , N , we get that
k
X k
X k
X
ρ||w|| ≤ yn (wT xn + b) = yn wT xn + kb
n=1 n=1 n=1
| {z }
=kρ||w||
and
N
X N
X N
X
ρ||w|| ≤ yn (wT xn + b) = yn wT xn − (k + 1)b.
n=k+1 n=k+1 n=k+1
| {z }
=(k+1)ρ||w||
By multiplying the first expression by k + 1 and the second one by k, we may write that
k
! N
!
X X
(k + 1)kρ||w|| + k(k + 1)ρ||w|| ≤ (k + 1) yn wT xn + kb +k yn wT xn − (k + 1)b
| {z } n=1 n=k+1
=2k(k+1)ρ||w||
k
X N
X
= (k + 1) yn wT xn + k yn w T xn .
n=1 n=k+1
10
k N
1 X 1 X
2ρ||w|| ≤ yn wT xn + yn w T xn
k
|{z} n=1 k + 1
| {z } n=k+1
=ln =ln
N
X
= wT ln yn xn
n=1
N
X
≤ ||w|| ·
ln yn xn
,
n=1
the last inequality is due to the Cauchy-Schwarz inequality. Consequently, we now have that
N
X
2ρ ≤
ln yn xn
n=1
since w 6= 0.
(b) (i) We have immediately that
N
2 N
N X
X
X
ln yn xn
= ln lm yn ym xTn xm .
n=1 n=1 m=1
1 1
E[ln lm yn ym ] = P[yn = +1] + P[yn = −1]
| {z } k2 | {z } (k + 1)2 | {z }
=1 =k/N =(k+1)/N
1 1 1
= + = .
kN (k + 1)N k(k + 1)
1 1 2
E[ln lm yn ym ] = 2
P[yn = +1, ym = +1] + 2
P[yn = −1, ym = −1] − P[yn = +1, ym = −1]
k | {z } (k + 1) | {z } k(k + 1) | {z }
=k(k−1)/N (N −1) =k(k+1)/N (N −1) =k(k+1)/N (N −1)
2
(k − 1)(k + 1) + k − 2k(k + 1) −1
= = .
k(k + 1)N (N − 1) k(k + 1)(N − 1)
N
2
X
X
ln yn xn
= E[ln lm yn ym ]xTn xm
E
m,n
n=1
1 X 1 X
= xTn xn − xT xm
k(k + 1) n k(k + 1)(N − 1) m,n n
1 X X
= [(N − 1) xTn xn − xTn xm ]
k(k + 1)(N − 1) n m6=n
1 X X
= [N xTn xn − xTn xm ];
k(k + 1)(N − 1) n m,n
11
and also (see Problem 8.7) that
N X N X 1 X T
||xn − x||2 = [ xTn xn − x xm ]
(N − 1)k(k + 1) n (N − 1)k(k + 1) n N m,n n
1 X X
= [N xTn xn − xTn xm ].
k(k + 1)(N − 1) n m,n
which is impossible. So we are now able to conclude that there exists a labeling with k labels being +1 for
which
N
X
NR
ln yn xn
≤
p
n=1
k(k + 1)(N − 1)
2N R
= √ .
(N − 1) N + 1
(c) From points (a) and (b) and for a specific labeling, we have that
N
X
2N R
2ρ ≤
ln yn xn
≤ √ ,
(N − 1) N + 1
n=1
12
Problem 8.9
Let g be the maximum margin classifier for the data set D, this means that g may be constructed from
α∗ = (α1∗ , · · · , αN
∗
) solution to the following problem
N X
X N N
X
min ym yn αm αn xTm xn − αn
α
m=1 n=1 n=1
| {z }
=f (α1 ,··· ,αN )
subject to
N
X
yn αn = 0 and αn ≥ 0
n=1
∗
for n = 1, · · · , N . Now, let us assume that we remove the point (xN , yN ) from D such as αN = 0; in this
−
case, we obtain a new problem on our altered data set D which is
N
X −1 N
X −1 N
X −1
min ym yn αm αn xTm xn − αn
α
m=1 n=1 n=1
| {z }
=f (α1 ,··· ,αN −1 ,0)
subject to
N
X −1
yn αn = 0 and αn ≥ 0
n=1
for n = 1, · · · , N − 1. It is easy to see that (α1∗ , · · · , αN
∗
−1 ) is still feasible for our new problem, we have
N
X −1 N
X
yn αn∗ = yn αn∗ = 0
n=1 n=1
and αn∗
≥ 0 for n = 1, · · · , N − 1. Now, we consider α = (α10 , · · · , αN 0 0
−1 ) another solution to our new problem
such that
f (α10 , · · · , αN
0 ∗ ∗
−1 , 0) < f (α1 , · · · , αN −1 , 0).
First, we notice that
f (α1∗ , · · · , αN
∗
−1 , 0) ≥ min f (α1 , · · · , αN −1 , 0);
α
then, we have that
f (α1∗ , · · · , αN
∗ ∗ ∗
−1 , 0) = f (α1 , · · · , αN ) = min f (α1 , · · · , αN ) ≤ min f (α1 , · · · , αN −1 , 0).
α α
and
b∗ = ys − w∗T xs
where αs 6= 0 which is exactly the optimal hyperplane for our former problem. This means that g is also the
maximum margin separator for D− .
13
Problem 8.10
We consider all the points (xn , yn ) (n = 1 · · · , S) which are on the boundary of the optimal fat-hyperplane
defined by (b∗ , w∗ ); these points are characterized by the condition
yn (w∗T xn + b∗ ) = 1.
Certain points among these are what we call essential support vectors (those with αn > 0); if we put these
equalities in matrix form, we get
AS u∗ = c
where
y1 xT1
y1 ∗ 1
..
AS = . .. , u∗ = b ..
and c = . .
. w∗
T
yS yS xS 1
∗
Thus, the vector u is determined by the matrix AS ∈ RS×(d+1) only, which means that only points on the
boundary are actually needed to compute u∗ . Moreover, we know that only the rows of matrix AS which are
linearly independant are needed to compute u∗ , these are exactly our essential support vectors and there are
rank(AS ) = r ≤ d + 1 of them. We also know that
N
1 X
Ecv = en
N n=1
where en is the leave-one-out error for (xn , yn ). Since removing any data point that is not an essential support
vector results in the same separator, and since (xn , yn ) was classified correctly before removal, it will remain
correct after removal; we get that en = 0 for these non essential support vectors, and en ≤ 1 for the support
vectors. Consequently, we may write that
N
1 X r d+1
Ecv = en ≤ ≤ .
N n=1 N N
Problem 8.11
R2 ||w∗ ||2
T ≤
ρ02
where w∗ is an optimal set of weights obtained in this case by the SVM algorithm, R = maxn ||xn ||, and
ρ0 = minn yn (w∗T xn ). Since, by construction of w∗ , we have that ||w∗ || = 1/ρ and ρ0 = minn yn (w∗T xn ) = 1.
Consequently, we get that
R2 ||w∗ ||2 R2
T ≤ = .
ρ02 ρ2
(b) Since R2 /ρ2 is the maximum number of updates for PLA, it is obvious that at most R2 /ρ2 different points
are visited during the course of this algorithm.
(c) The PLA algorithm generates its final solution by taking a linear combination of the visited points, so
deleting the points that have not been visited does not affect this final solution.
(d) Since deleting the points that have not been visited does not affect the final solution, the corresponding
en will be equal to 0, thus
N
1 X #visited points R2
Ecv (P LA) = en ≤ ≤ .
N n=1 N N ρ2
14
Problem 8.12
subject to
yn (wT xn + b) ≥ 1 − ξn and ξn ≥ 0
for i = 1, · · · , N ; and the optimization problem of Problem 3.6 (c) is
N
X
min ξn
w,ξ
n=1
subject to
yn (wT xn ) ≥ 1 − ξn and ξn ≥ 0
for i = 1, · · · , N . First, we may note that the constraints are identical for these two problems (by taking
into account the difference in notation). Then obviously if C →P∞, the corresponding term will render the
term in wT w obsolete, so in this case we only need to minimize n ξn which is equivalent to the soft-margin
optimization problem.
Problem 8.13
(a) We use our data set with the 2nd and 3rd order polynomial transforms Φ2 and Φ3 and the pseudo-inverse
algorithm for linear regression to get weights w̃ for our final hypothesis in Z-space
15
1.0
0.5
Class
x2
0.0 −1
1
−0.5
−1.0
−0.5 0.0 0.5 1.0
x1
Below, we plot the classification regions for our final hypothesis in X -space for Φ2 and Φ3 respectively.
D_Phi2 <- data.frame(x1 = D$x1, x2 = D$x2, x1_sq = D$x1^2, x1x2 = D$x1 * D$x2,
x2_sq = D$x2^2, class = D$class)
D_Phi3 <- data.frame(x1 = D$x1, x2 = D$x2, x1_sq = D$x1^2, x1x2 = D$x1 * D$x2,
x2_sq = D$x2^2, x1_cub = D$x1^3, x1_sqx2 = D$x1^2 * D$ x2,
x1x2_sq = D$x1 * D$x2^2, x2_cub = D$x2^3, class = D$class)
16
1.0
0.5
Class
x2
0.0 −1
1
−0.5
−1.0
−0.5 0.0 0.5 1.0
x1
cc3 <- emdbook::curve3d(1 * w_lin_Phi3[1] + x * w_lin_Phi3[2] + y * w_lin_Phi3[3] +
x^2 * w_lin_Phi3[4] + x * y * w_lin_Phi3[5] +
y^2 * w_lin_Phi3[6] + x^3 * w_lin_Phi3[7] +
x^2 * y * w_lin_Phi3[8] + x * y^2 * w_lin_Phi3[9] +
y^3 * w_lin_Phi3[10], xlim = c(-1, 1), ylim = c(-1, 1),
sys3d = "none")
dimnames(cc3$z) <- list(cc3$x, cc3$y)
mm3 <- reshape2::melt(cc3$z)
p + geom_contour(data = mm3, aes(x = Var1, y = Var2, z = value), breaks = 0,
colour = "black")
17
1.0
0.5
Class
x2
−1
0.0
1
−0.5
−1.0
−0.5 0.0 0.5 1.0
x1
(b) It seems that the fit obtained with Φ3 has overfitted the data since it cannot capture the overall (circular)
data structure very well.
(c) Usually regularization helps to avoid overfitting, so we now use the pseudo-inverse algorithm
wreg = (X T X + λI)−1 X T y
18
1.0
0.5
Class
x2
0.0 −1
1
−0.5
−1.0
−0.5 0.0 0.5 1.0
x1
We may clearly see that the 3rd order polynomial transform Φ3 regularized with λ = 1 results in a classifier
almost identical to the 2nd order polynomial transform Φ2 .
Problem 8.14
(a) We begin by taking a look at the Kernel-Gram matrix K defined by Kij = Φ(xi )T Φ(xj ), we have that
− Φ(x1 )T − |
|
.. T
K= Φ(x1 ) · · · Φ(xN ) = Z Z.
.
− Φ(xN )T − | |
Since w̃∗ minimizes Eaug (w̃), we obviously have that β ∗ minimizes E(β).
(b) The matrix K is clearly symmetric since
K T = (Z T Z)T = Z T Z = K.
(c) To solve the minimization problem in part (a), we have to compute the gradient of E(β); we get that
19
∇β E(β) = ∇β (β T K T Kβ − 2β T K T y + y T y + λβ T Kβ)
= (K T K + K T K)β − 2K T y + λ(K + K T )β
= 2K[(K + λI)β − y].
(K + λI)β = y.
xT Kx = xT Z T Zx = ||Zx||2 ≥ 0;
and since λ > 0, we can conclude that K + λI is positive definite, and consequently invertible. Thus the
solution β ∗ to the minimization problem is
β ∗ = (K + λI)−1 y.
The β ∗ above can be computed without visiting the Z-space because in the case of specific kernels (like
polynomial kernels or Gauss-RBF kernels), we can use the kernel trick.
(d) The final hypothesis is given by
Problem 8.15
Ein (gi ) = min Ein (h) ≥ min Ein (h) = Ein (gi+1 )
h∈Hi h∈Hi+1
for any i = 1, 2, · · ·.
(b) Let pi = P[g ∗ ∈ Hi ] = P[g ∗ = gi ], so if pi is small then Ω(Hi ) is large, which implies that the model is
complex. It is obvious that
g ∗ ∈ Hi ⇒ g ∗ ∈ Hi+1 ,
thus we get that
pi = P[g ∗ ∈ Hi ] ≤ P[g ∗ ∈ Hi+1 ] = pi+1
for any i = 1, 2, · · ·.
(c) We know from the generalization bound that
20
P[|Ein (gi ) − Eout (gi )| > ∩ g ∗ = gi ]
P[|Ein (gi ) − Eout (gi )| > |g ∗ = gi ] ≤
P[g ∗ = gi ]
P[|Ein (gi ) − Eout (gi )| > ]
≤
P[g ∗ = gi ]
2
4mHi (2N )e− N/8
≤ .
pi
Problem 8.16
This means that HC1 ⊂ HC2 which can be posed within the SRM framework. However, the {Hλ }λ>0 with
regular items does not contain such a relationship, so the SRM framework cannot be used in this case.
Problem 8.17
Problem 8.18
(a) Yes, in my opinion since we only have m hypotheses in our hypothesis subset Hm , we know, with
probability at least 1 − δ, that
r r
ln(2m/δ) ln(2m/δ)
Eout (hm ) ≤ Ein (hm ) + ≤ν+ .
2N 2N
(b) We can formulate this process within the SRM structure since, obviously, the Hm ’s form a structure
(Hm ⊂ Hm+1 ). It is easy to see that if we output hm , this means that
21
Moreover, we also have that
h∗ = argminhm Ein (hm ) + Ω(Hm )
where Ω(Hm ) = m this will give priority to the hm corresponding to the smaller subscript m, which
characterizes exactly the SRM framework.
(c) Since we are within the SRM framework, we may write that
2
4mHm (2N )e− N/8
P[|Ein (hm ) − Eout (hm )| > |h∗ = hi ] ≤ .
pm
22