You are on page 1of 4

Introduction to Machine Learning / Labwork 6

Kernel SVMs and model selection

Maxime Ossonce maxime.ossonce@esme.fr

The purpose of this labwork is to study the support vector machine (SVM) algorithm with the radial basis
function (RBF) kernel.

Remark: for better readability of the code, imports are added when needed, even though this is not considered
a good practice (imports are generally done at the beginning of the file).

The kernel trick

The SVM algorithm test function is of the the form:

f (x) = w > x + b.

The hyperplane w is written as a linear combination of the so called support vectors which are a subset of
training dataset D :
n
αi y i x ij
X
w=
i =1

The support vectors being the set x i : αi > 0 . Hence, the scoring function f (x) writes:
© ª

n
αi y i x i > x + b.
X
f (x) =
i =1

This solution is a linear separation of the sample space X . One can imagine a feature sapce H, endowed with
a scalar product 〈·, ·〉H and a function φ : X 7→ H computing a representation of samples x of X on which the
classification is performed.

The advantages of applying the SVM on φ(x) ∈ H rather than in x ∈ X are that:

• classification can be performed on objects that are not in a vector space (on which the scalar product
x i > x would not be defined);
• the linear separation performed by the SVM on H could induce more complex separation in X .

One important feature of SVM is that the (dual) optimization as well as the classification of a new sample x
only involves
­ scalar products. Hence a SVM performed­in the feature ® space φ(x) would only involve the scalar
products φ(x i ), φ(x) H . That means that if k (x, z) := φ(x), φ(z) H is calculable for any x, z ∈ X without the
®

explicit values of φ(x) and φ(z), the feature space H as well as the feature map φ(·) do not need to be known to
perform a SVM in H. This powerful property is called the kernel trick (the function k (·, ·) is called the kernel).

Furthermore, under some hypothesis on k (·, ·) (symmetric and positive-definite), it is shown that the Hilbert
space H exists. The most used kernel is the so called RBF, or gaussian kernel:
2
¡ ¢
k (x, z) = exp −γkx − zk .

1
Sometimes the precision γ is replaced by 2σ1 2 . The gaussian kernel is positive-definite: for every dataset
X = x i : x ∈ {1, . . . , n} and every (α1 , . . . , αn ) ∈ R∗n we have:
© ª

n
αi α j k (x i , x j ) > 0.
X
i =1

Hence, there exist one Hilbert space H, the reproducing kernel Hilbert space (RKHS), and one feature map
φ : X 7→ H s.t.:
k (x, z) = φ(x), φ(z) H ∀x, z ∈ X .
­ ®

1 Kernel SVM on synthetic data

The dataset used here is a 2D (p = 2) dataset with two classes.

kernel_svm.py
from sklearn.datasets import make_moons
n = 900
X, y = make_moons(n_samples=n, noise=0.25, random_state=42)

Q 1-1. After loading the dataset (see above), split the data to create Xa, ya, Xt, yt, Xv, yv the train, test
and validation sets (of equal sizes).
Q 1-2. Visualize the training set to check that you need a non linear classifier.

kernel_svm.py
from matplotlib import pyplot as plt

plt.figure(figsize=(12, 8))
plt.scatter(Xa[:, 0], Xa[:, 1], c=ya)

The chosen kernel is the RBF (which is also the default kernel in scikit-learn library):
2
¡ ¢
k (x, z) = exp −γkx − zk .

Q 1-3. What is the equivalent kernel when γ → 0?

The precision γ is a hyperparameter that has to be chosen by the user (or selected, see section 2). Another
chosen hyperparameter is C (a regularization term controlling slack variables, see lesson 6). First, we set
arbitrary values for C and γ: C = 10, γ = 0.1.

Q 1-4. With function plot_decision_bounday (from svm_utility) plot the decision boundary of a SVM fit
on the training samples. Comment on the obtained decison regions.
Q 1-5. Vary γ (e.g. γ ∈ {1 × 10−2 , 1 × 10−1 , 1, 10, 100}) to see its influence on the decision regions. What can be
said for small values of γ? Large values?

kernel_svm.py
from sklearn.svm import SVC

clf = SVC(kernel=’rbf’) # SVM with gaussian (rbf) kernel


clf.gamma = 0.1
clf.C = 10

clf.fit(Xa, ya)

from svm_utility import plot_decison_boundary

plot_decison_boundary(Xa, ya, clf)

2
A validation procedure has to be applied to chose the proper values of C and γ: the performances of the model
(fit on the training set) are evaluated on the validation set. The chosen (C , γ) is the one yielding the best accuracy
on the validation set.

Q 1-6. Select C in the logarithmic range {1 × 10−3 , . . . , 1 × 102 } and γ in {1 × 10−2 , . . . , 1 × 102 }. What is the
optimal pair (C ∗ , γ∗ ) to select?

kernel_svm.py
from sklearn.metrics import accuracy_score

C_ = np.logspace(-3, 2, 6)
gamma_ = np.logspace(-2, 2, 5)

val_err_ = np.zeros((len(C_), len(gamma_)))

for i_C, C in enumerate(C_):


for i_g, gamma in enumerate(gamma_):
clf.C = C
clf.gamma = gamma
print(’Fitting with C= {}, gamma={}’.format(C, gamma))
clf.fit(Xa, ya)

err = 1 - accuracy_score(yv, clf.predict(Xv))


print(’Validation error; {:.1%}’.format(err))
val_err_[i_C, i_g] = err

plt.figure(figsize=(12, 8))

extentC = [min(np.log10(C_)) - 0.5, max(np.log10(C_)) + 0.5]


extentG = [min(np.log10(gamma_)) - 0.5, max(np.log10(gamma_)) + 0.5]
plt.imshow(val_err_, extent=[*extentG, *extentC])
plt.colorbar()
plt.xlabel("log(C)")
plt.ylabel("log(gamma)")
plt.title("Validation error rate")

ind_C, ind_gamma = np.unravel_index(np.argmin(val_err_), val_err_.shape)


C_star = C_[ind_C]
gamma_star = gamma_[ind_gamma]
print(’C*={}, gamma*={}’.format(C_star, gamma_star))

Q 1-7. Now you can merge training set and validation set to train the optimal SVM and evaluate its perfor-
mances on the test set. Comment on the decison region.

kernel_svm.py
Xa = np.concatenate([Xa, Xv])
ya = np.concatenate([ya, yv])
clf.C = C_star
clf.gamma = gamma_star
# ... to complete

2 Model selection

The validation procedure described question 1-6 can be done with a grid search using GridSearchCV from
scikit-learn library.

The used dataset is the UCI breast cancer Wisconsin.

3
model_selection.py
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

print(’Target:’, *data.target_names)
print(’Features:’, ’, ’.join(data.feature_names))
X, y = data.data, data.target

The selection procedure is a K -fold (the training set is split into K subsets, each consecutively used for validation,
the K − 1 remaining used for model fitting).

Q 2-1. Perform the grid search using a K -fold validation (K = 3).

model_selection.py
C_ = np.logspace(-0.5, 2, 25)
gamma_ = np.logspace(-3, 1, 25)

from sklearn.model_selection import GridSearchCV


from sklearn.svm import SVC
# the grid
parameters = [{"gamma": gamma_, "C": C_}]

# Define the classifier


clf = SVC(kernel=’rbf’)
# Perform a K-fold validation using the accuracy as the performance measure
K = 3
clf = GridSearchCV(clf, param_grid=parameters, cv=K, scoring=’accuracy’, verbose=2, n_jobs=2)
clf.fit(Xa, ya) # of course you have to do first a train / test split!
print(’Best parameters:’, clf.best_params_)
print(’Best score: {:.1%}’.format(clf.best_score_))

Q 2-2. What is the accuracy attained on the train set? on the test set?

You might also like