Professional Documents
Culture Documents
TODEREAN
HOW TO FINE-TUNE
SUPPORT VECTOR MACHINES
FOR CLASSIFICATION
Reviewers
Prof. Dr. Eng. Sergiu Nedevschi
Prof. Dr. Eng. Gabriel Oltean
ISBN 978-973-720-806-4
TABLE OF CONTENTS
INTRODUCTION ...........................................................................................1
2. Classification.............................................................................................5
4. Performance Evaluation............................................................................7
5. Scalability ...............................................................................................14
6. Dimensionality ........................................................................................15
4. Kernels ....................................................................................................35
4.1. Linear Kernel ....................................................................................35
CONCLUSIONS ............................................................................................63
BIBLIOGRAPHY ..........................................................................................66
NOTATIONS
b Bias
C Cost parameter
Q Kernel matrix
iv
U Set of unbounded support vectors
w m-dimensional vector
yi Dependent variable i
z Feature space
v
INTRODUCTION
This book covers in the first part the theoretical aspects of support vector
machines and their functionality, and then based on the discussed concepts it
explains how to find-tune a support vector machine to yield highly accurate
prediction results which are adaptable to any classification tasks. The
introductory part is extremely beneficial to someone new to learning support
vector machines, while the more advanced notions are useful for everyone who
wants to understand the mathematics behind support vector machines and how
to find-tune them in order to generate the best predictive performance of a
certain classification model.
1
followed by theoretical foundations pertaining to their attributes in supervised
learning. It discusses the technique of mapping independent variables into a
high-dimensional space and several noteworthy research papers existent in the
literature regarding various training techniques.
2
CHAPTER 1
SUPERVISED LEARNING
3
based on the input values.
1. Training Dataset
Typically, the variables, also named fields or attributes, are of two types:
nominal (values belong to an unordered set) and continuous (values are real
numbers). If the variable ai is of nominal type, its definition domain is denoted
4
by dom(ai ) = {vi,1, vi,2, ..., vi, dom(ai ) }, where dom(ai ) refers to its finite
cardinality. Similarly, dom(y) = {c1, c2, ..., c dom(y) } represents the definition
domain of the dependent variable. Variables of continuous type have an infinite
cardinality.
It is generally assumed that the tuples from the training dataset are
generated in a random and independent order in accordance with an unknown
fixed joint probability distribution, D. When a tuple is classified using the
y = f (x) function, is considered a generalization of the deterministic case.
2. Classification
5
{x ∈ X: c(x) = 1}. A set of concepts is known as a concept class C.
∑
ε(I(S ), D) = D(x, y)L(y, I(S )(x))
(1.1)
⟨x,y⟩∈U
{1 if y ≠ I(S )(x)
0 if y = I(S )(x)
L(y, I(S )(x)) = (1.2)
3. Induction Algorithms
6
by the application of this algorithm I on the training dataset S, is used I(S ). The
dependent variable of the tuple xq can then be predicted, and this prediction
denoted by I(S )(xq).
4. Performance Evaluation
7
accuracy.
^
∑
ε(I(S ), S ) = L(y, I(S )(x))
(1.3)
⟨x,y⟩∈S
where L(y, I(S )(x)) is the cost function defined in equation (1.2).
The theoretical and empirical methods are two different methods present
in the literature to estimate the generalization error.
8
4.1.1. Theoretical Estimation of Generalization Error
9
frameworks, compares them, and highlights their strengths and weaknesses.
These frameworks are useful to estimate the generalization error. Among these,
the VC and the PAC (Probably Approximately Correct) frameworks are
mentioned, which add a penalty function to the training error to indicate the
capacity of an induction algorithm.
A. VC Dimension
The Vapnik-Chervonenkis [46] theory is the most complete theoretical
learning framework and relevant to classification models. The VC theory offers
all the conditions needed for the consistency of the induction procedure. The
concept of consistency comes from statistics and states that both, the training
and the generalization errors of the classification model must converge to a
minimal error as the training dataset tends to infinity. The VC theory defines the
VC dimension as a capacity measure of an induction algorithm.
The VC theory highlights the extreme case when the training error and the
generalization error are estimated and these estimation values are bounds viable
for any induction algorithm and probability distribution in the input space.
These bounds are functions of the training dataset size and the VC dimension of
the induction algorithm.
d(ln(2n /d ) + 1) − ln(δ/4)
^ S) ≤
ε(h, D) − ε(h, , ∀h ∈ H, ∀δ > 0 (1.4)
n
10
with the probability 1 − δ , where ε(h, D) is the generalization error of the
^ S ) is the training error
classification model h over the distribution D, and ε(h,
of the same classification model h measured over the training dataset S of
cardinality n.
11
B. PAC Dimension
The Probably Approximately Correct (PAC) learning model was
introduced by Valiant in 1984 [44]. This framework is useful to characterize the
concept class that can be reliably learned from a reasonable number of
randomly drawn training instances and a reasonable amount of computation
[34]. The following definition of the PAC learning model is adapted from [34]
and [35]:
Definition 2. Let C be a class concept defined over the input space X with m
variables. Let I be an induction algorithm that considers the hypothesis space
H. C is PAC learnable by I using H if for ∀c ∈ C , ∀D defined over X,
∀ε ∈ (0, 1/2) and ∀δ ∈ (0, 1/2) the induction algorithm I with a probability
greater than or equal to 1 − δ will find the hypothesis h ∈ H such that,
ε(h, D) ≤ ε and learnable in polynomial time if the induction algorithm is of
polynomial time complexity in 1/ε, 1/δ, m, and size(c).
1 H
n≥ ln (1.5)
ε δ
12
The generalization error can be estimated by dividing the available dataset
into a training and a test datasets. The training dataset is used by the induction
algorithm to build the classification model, and then the misclassification rate of
this model is calculated on the test dataset. The error obtained on the test dataset
yields a better estimate of the generalization error, because the training error
tends to overfit the data and thus underestimates the generalization error.
13
technique may increase the risk of Type I (false positives) error, that is,
identifying a significant difference when there is none. On the other hand, using
a t-test on the generalization error obtained on each fold decreases the risk of a
Type I error, but instead may not provide an adequate estimate of the
generalization error. To obtain a more reliable estimate, the k-folds cross-
validation technique is usually repeated k times. However, the test dataset is not
independent and there is a risk of Type I error. Unfortunately, currently no
satisfactory solution has been found to this problem. Dietterich proposed in [14]
alternative tests that have a low chance of a Type I error, but have a high risk of
a Type II (false negative) error, i.e. not identifying a significant difference when
one exists.
5. Scalability
14
Induction algorithms have been successfully implemented in multiple
situations to solve fairly basic problems, but with the increasing desire to
discover knowledge in large datasets, several difficulties and constraints related
to time and memory appear.
6. Dimensionality
High dimensional input data, i.e. datasets with large number of variables,
involve an exponential increase of the size of the search space, and
consequently increases the chance that an induction algorithm will build
classification models that are not valid in general. In [25], the authors explain
that in the case of a supervised classification model, the required number of
instances increases with the dimensionality of that dataset. Furthermore, the
15
author shows in [20] that in the case of a linear classification model, the
required number of instances is linear with respect to the dimensionality and to
the square of the dimensionality in the case of a quadratic classification model.
Regarding the nonparametric classification models, such as decision trees, the
situation is more serious. In order to obtain an efficient estimation of the
multivariate densities, in [24] it was estimated that as the number of dimensions
increases the number of instances must increase exponentially.
This situation is called the curse of dimensionality, term which was first
used by Bellman [2]. Algorithms, such as decision trees, which are effective in
situations of low dimensionality, do not yield significant results when the
dimensionality increases beyond a certain level. Moreover, the classification
models that are built on datasets with a small number of variables are easier to
interpret and more suitable for visualization by using different specific data
mining methods.
16
CHAPTER 2
SUPPORT VECTOR MACHINES
17
complexity of the hyperplane can be bounded by another measure, called
margin. The margin is the minimum distance between a training instance and
the decision area. Thus, if the margin of a function is bounded, one can control
its complexity. Learning with support vectors implies that the risk is minimized
when the margin is maximized. A support vector machine selects the hyperplane
with the maximum margin in the transformed input space and separates the
classes of the training instances while maximizing the distance to the nearest
separating instance. The parameters of the resulting hyperplane are obtained by
solving a quadratic programing optimization problem.
18
the dependent variable yi ∈ {− 1, 1} . If these training instances are linearly
separable, the decision function can be determined:
yi(w⊤xi + b) ≥ 1 (2.2)
The hyperplane
Figure 2.1 illustrates two decision functions that satisfy equation (2.2). It
can be noted that there are an infinite number of decision functions that satisfy
19
maximă
x2
Optimal Hyperplane
Maximum
Margin
0 x1
D(x) / w (2.4)
x2 20
Hiperplanul optim
yk D(xk )
≥δ (2.5)
w
δ w =1 (2.6)
1 2
Q(w, b) = w (2.7)
2
subject to
21
constraints are not unique, but the value of this function is. Thus, the fact that
solutions are not unique does not represent a problem for the SVM algorithm,
on the contrary, it is an advantage over the neural networks, which generate
many local minima.
If the points that satisfy the strict inequalities are removed from equation
(2.8), the same optimal separating hyperplane is still obtained. Thus, the points
that satisfy the equality are called support vectors (this definition is imprecise at
the moment because the support vectors are defined using the solution of the
dual problem, as discussed further). In Figure 2.1, the support vectors are
represented by the full square and circles.
First, the problem subject to the constraints given by equations (2.7) and
(2.8) is transformed into a problem without constraints, namely:
n
1 ⊤
Q(w, b, α) = w w − αi{yi(w⊤xi + b) − 1}
2 ∑ (2.9)
i=1
where α = (α1, α2, …, αn)⊤ and αi are the nonnegative Lagrange multipliers [47].
The optimal solution of equation (2.9) is given by the saddle point. In this point,
22
equation (2.9) is minimized with respect to w, maximized with respect to αi ≥ 0,
and minimized or maximized with respect to b based on the sign of the sum
n
∑i=q
αi yi, and the solution satisfies the Karush-Kuhn-Tucker (KKT) conditions
[31]:
∂Q(w, b, α)
=0 (2.10)
∂w
∂Q(w, b, α)
=0 (2.11)
∂b
αi ≥ 0 for i = 1, …, n (2.13)
Using equation (2.9), equations (2.10) and (2.11) are reduced to:
n
w= αi yi xi
∑ (2.14)
i=1
∑
αi yi = 0 (2.15)
i=1
23
Then, replacing equations (2.14) and (2.15) in equation (2.9), it results the
dual problem that needs to be maximized [47]:
n
1 n n
α α y y x⊤ x
∑ i 2 ∑∑ i j i j i j
Q(α) = α− (2.16)
i=1 i=1 j=1
subject to
∑
αi yi = 0, αi ≥ 0 for i = 1, …, n (2.17)
i=1
⊤
1 n n 1 n n
∑ i j i j i j 2 ( ∑ i i i) ( ∑ i i i)
ααyyx x = αyx αyx ≥ 0
2∑
⊤
(2.18)
i=1 j=1 i=1 i=1
The training instances associated with the positive values of αi are support
vectors for yi = 1 or yi = − 1. Thus, from equation (2.14) we obtain the decision
function:
24
αi yi x⊤i x + b
∑
D(x) =
(2.19)
i∈S
where S is the set of support vectors. From the KKT conditions given by
equation (2.12), b is given by equation (2.20):
1
(yi − w⊤xi)
S ∑
b= (2.21)
i∈S
In the case of the hard margin support vector machines, the training
25
dataset is linearly separable. However, if the training dataset is not linearly
separable, nox2 feasible solution exists, and the hard margin support vectors
Hiperplanul
problem cannot be solved. Thus, the soft margin support vectoroptim
machines for
linearly inseparable training data is presented further.
Marginea
ξi
In order to allow for inseparability, maximă
the nonnegative slack variables ξi ≥ 0
are introduced [12] in equation (2.2):
ξj
The introduction of the slack variables guarantees the existence of a
0 [12]. For the training instances xi , if s (ξi in Figurex 2.2), the
feasible solution
1
x2
Optimal Hyperplane
Maximum
ξi Margin
ξj
0 x1
26
If ξi ≥ 1 (ξi in Figure 2.2), the training instances are misclassified by the
optimal separating hyperplane. In order to obtain the optimal separating
hyperplane with a minimum number of training instances without a maximum
margin, equation (2.24) must be minimized:
∑
Q(w) = θ(ξi ) (2.24)
i=1
where
{0 if ξi = 0
1 if ξi > 0
θ(ξi ) = (2.25)
1 ⊤ C n p
Q(w, b, ξ) = w w +
p∑
ξi (2.26)
2 i=1
subject to
where ξ = (ξ1, ξ2, …, ξn)⊤ and C represents the cost parameter and controls the
balance between the margin maximization and the classification error
minimization. The parameter p has a value of 1 or 2. The resulted hyperplane is
called soft margin hyperplane. When p = 1 the SVM is called L1 soft margin
27
SVM, or L1 SVM, and when p = 2 it is called L2 soft margin SVM, or L2 SVM.
This section details further the L1 soft margin SVM because it is the algorithm
to be implemented next.
n n
1 ⊤
Q(w, b, ξ, α, β) = w w + C α (y (w⊤xi + b) − 1 + ξi )−
∑ i ∑ i i
ξ−
2 i=1 i=1
n
(2.28)
∑
− βi ξi
i=1
For the optimal solution, the following KKT conditions are satisfied [12]:
∂Q(w, b, ξ, α, β)
=0 (2.29)
∂w
∂Q(w, b, ξ, α, β)
=0 (2.30)
∂b
∂Q(w, b, ξ, α, β)
=0 (2.31)
∂ξ
βi ξi = 0 for i = 1, …, n (2.33)
28
αi ≥ 0, βi ≥ 0, ξi ≥ 0, for i = 1, …, n (2.34)
Using equation (2.28), equations (2.29), (2.30), and (2.31) are reduced to
[32]:
n
w= αi yi xi
∑ (2.35)
i=1
∑
αi yi = 0 (2.36)
i=1
αi + βi = C for i = 1, …, n (2.37)
n
1 n n
α α y y x⊤ x
∑ i 2 ∑∑ i j i j i j
Q(α) = α− (2.38)
i=1 i=1 j=1
subject to
∑
αi yi = 0, 0 ≤ αi ≤ C for i = 1, …, n (2.39)
i=1
In comparison to the hard margin SVM, for the L1 soft margin SVM, αi
can not be greater than the parameter C. The inequality constraints in equation
29
(2.39) are called box constraints.
The decision function is identical to one of the hard margin SVM and is
given by equation (2.40):
αi yi x⊤i x + b
∑
D(x) =
(2.40)
i∈S
where S is the set of support vectors. The summation in equation (2.39) applies
only to the support vectors, since αi are not equal to zero in this case. In the case
of the unbounded support vectors, b is given by equation (2.20):
30
where U represents the set of unbounded support vectors.
1
(yi − w⊤xi )
∑
b= (2.42)
U i∈U
3. Kernel Trick
31
We use ϕ(x) = (ϕ1(x), ϕ2(x), …, ϕn(x))⊤ , a nonlinear vector function, to
map the m-dimensional input vector x into an n-dimensional feature space. The
linear decision function is shown in equation (2.44):
n n
hi hj K(xi, xj ) ≥ 0
∑∑ (2.45)
i=1 j=1
for all n ∈ ℕ, xi, and hi ∈ ℝ, based on the Hilbert-Schmidt theory [1], a function
ϕ(x) exists that maps the input vector x into a feature space and satisfies
equation:
n n n n
hi hj K(xi, xj ) = ( hi ϕ (xi ))(
⊤
h ϕ(xi )) ≥ 0
∑∑ ∑ ∑ i (2.47)
i=1 j=1 i=1 i=1
Equations (2.45) or (2.47) are called Mercer's conditions [47], and the
functions that satisfy these conditions are called the Mercer kernels, used further
32
only as kernels.
n
1 n n
α α y y K(xi, xj )
∑ i 2 ∑∑ i j i j
Q(α) = α− (2.48)
i=1 i=1 j=1
subject to
∑
yi αi = 0, 0 ≤ αi ≤ C for i = 1, …, n (2.49)
i=1
n
αi(yi( yj αj K(xi, xj ) + b) − 1 + ξi) = 0 for i = 1, …, n
∑ (2.50)
j=1
33
(C − αi )ξi = 0 for i = 1, …, n (2.51)
αi ≥ 0, ξi ≥ 0 for i = 1, …, n (2.52)
αi yi K(xi, x) + b
∑
D(x) =
(2.53)
i∈S
αi yi K(xi, xj )
∑
b = yi −
(2.54)
i∈S
1
b=
U ∑ (yj − ∑ αi yi K(xi, xj )) (2.55)
j∈U i∈S
34
If D(x) = 0, the instance x cannot be classified.
4. Kernels
The SVM algorithm presents a great advantage over the other machine
learning algorithms, such that by selecting the appropriate kernel for a particular
application, the generalization performance can be significantly improved.
Therefore, selecting the appropriate kernel represents a critical step, and the
ongoing research focuses on developing new kernels [41], [42]. Further are
presented the four main kernel functions used in the SVM algorithm.
35
where γ > 0 and r ≥ 0 are parameters, and r is used to make a compromise
between the influence of higher-order terms against lower-order terms. When
r = 0, the kernel is called homogeneous. Generally, the following values γ = 1,
r = 1, and d = 2 or d = 3 are used.
The radial basis function kernel, also called the Gaussian kernel, is given
by equation:
In this case, the centers of the radial basis functions are actually the
support vectors. It is important to note that the RBF (Radial Basis Functions)
kernels map the input space in an infinitely-dimensional space and since the
Euclidean distance is used, they are not robust to extreme values.
36
5. Training Techniques
37
caching method. One of the inefficiencies of this algorithm is given by the
caching method, that is, it caches few rows of the kernel Hessian matrix, which
consumes memory and becomes impractical for large datasets. In addition, a
failure during the shrinking strategy will result in reoptimizing the training
problem of the SVM.
Platt has proposed the SMO method [37], which represents an important
step in training a SVM. In this algorithm, the size of the working set is equal to
two, and this optimization problem with two variables is solved without
requiring any additional optimization package. In [37], some heuristics for
selecting the working set have been suggested. Keerthi et al. [29] have identified
an inefficiency in Platt’s algorithm while updating the parameters with one
threshold and have proposed replacing them with parameters with two
thresholds, change that has led to an improved performance. Hsu and Lin have
presented a working set selection method through which the convergence is
attained more quickly in difficult cases [23].
38
proposed by Keerthi et al. [15]. Their algorithm has achieved a five up to ten
times better performance than the previous algorithm, in which caching the
kernel matrix represents the critical part. In this case, the size of the kernel
matrix is equal to the working set, which is at least the size of the number of
support vectors. The digest method restricts the increase in the number of
candidate support vectors, so that the number of candidate support vectors sets
during the training process will never exceed the size of the working set.
Therefore, it is possible to maximize the reuse of the kernel matrix caching. In
[28], Keerthi and Gilbert have demonstrated the convergence of the algorithm
proposed by Keerthi et al.
39
Dong et al. have proposed an effective solution to train the SVMs and
consists of two steps: a parallel optimization and a sequential working set
optimization [16]. During the first step, the kernel matrix is approximated by
block diagonal matrices, and as such, the optimization problem is divided into
multiple subproblems which can be solved more efficiently. The majority of
nonsupport vectors are removed and the training sets are collected for the
second step.
40
CHAPTER 3
FINE-TUNING SUPPORT VECTOR MACHINES
n
1 ⊤
Q(w, b, ξ) = w w + C
∑ i
ξ (3.1)
2 i=1
subject to
41
where ϕ(xi ) maps xi in a high-dimensional space and C > 0 is a cost or
regularization parameter.
1 ⊤
Q(α) = α Qα − e⊤α (3.3)
2
subject to
After solving the dual problem, the optimal vector w satisfies equation
(3.5):
n
w=
∑
yi αi ϕ(xi ) (3.5)
i=1
n
+ b) = sgn( yi αi K(xi, x) + b)
∑
sgn(w⊤ϕ(x) (3.6)
i=1
42
where b is a constant.
43
∇f (α) + by = λ − ξ,
(3.7)
λi αi = 0, ξi(C − αi ) = 0, λi ≥ 0, ξi ≥ 0, i = 1, ..., n
▪ where
44
▪ Thus, a feasible solution α is stationary if and only if:
{ τ otherwise
ats ats > 0
ats = (3.17)
▪ Select:
bit2
j ∈ argmin{ − t ∈ Ilow(αk ), − yt ∇t f (αk ) < − yi ∇i f (αk )} (3.19)
t ait
45
3. To accelerate the convergence of the algorithm near the end of the
iterative process, the decomposition method identifies a possible set A
containing all the free support vectors. An optimal solution α of the
dual problem may contain bounded support vectors. To accelerate the
training process, this shrinking technique identifies and removes some
bounded support vectors. Thus, instead of solving the whole problem,
the decomposition method is applied to a smaller problem [26],
namely:
1
min( α⊤A QA AαA − (pA − QAN αkN ) αA)
⊤
(3.20)
αA 2
▪ subject to
46
▪ Until the condition m(αk ) − M(αk ) ≤ e , where e is the tolerance, is
satisfied, the variables in the following set can be shrunk:
∑
Gi = C Qij, i = 1, …, n
(3.26)
αj =C
n
∇i f (α) =
∑ ∑
Qij αj + pi = Gi + pi + Qij αj (3.27)
j=1 0<αj <C
47
∇i f (αk+1) = ∇i f (αk ) + Qit Δαt + QisΔαs (3.28)
[Qji Qjj][αj]
BN N ) [α ]
1 Qii Qij αi α
min( [αi αj] + ( B
−p + Q α k ⊤ i + const
) (3.30)
αi ,αj 2 j
▪ subject to
▪ and let
48
yi bij
αinew = αik + (3.33)
aij
yj bij
αjnew = αjk − (3.34)
aij
▪ where N = {1, ..., n}\B , and αBk and αNk are the subvectors of α k
corresponding to the set B and the set N, respectively.
[Qji Qjj][αj]
BN N) [α ]
Q Q αi α
αj] ii ij
2[ ( B
1 α
i + −p + Q α k ⊤ i +
j
min (3.35)
αi ,αj
+ (τ − aij) /4((αi − αik) + (αj − αjk) )
2 2
yi bij
αinew = αik + (3.36)
τ
yj bij
αjnew = αjk − (3.37)
τ
5. Finally, set αBk+1 as the optimal point of the subproblem, and αNk+1 = αNk .
Set k = k + 1 and go to step 2.
49
decision function is computed. If there exists 0 < αi < C , from the KKT
condition (3.14), we have:
b = − yi ∇i f (α) (3.38)
yi ∇i f (α)
∑i:0<αi<C
b=− (3.39)
{i 0 < αi < C}
50
conquer approach [5], [7], [4] is used to split the original problem into a set of
subproblems which are solved using the SMO algorithm adapted by Chang and
Lin [8].
For each subproblem, the kernel matrix can be stored in a kernel cache
defined as part of the adjacent memory. The size of the kernel matrix will be
large enough to contain all the support vectors in the training dataset and small
enough to meet the memory constraint. Since the kernel matrix for the
subproblem is entirely cached, each element of the kernel matrix has to be
evaluated only once and calculated using a fast method, namely the fast SVM
algorithm proposed by Dong, Suen, and Krzyzak [16]. This algorithm consists
of two stages: a parallel optimization and a fast sequential optimization. Each of
these two steps is described below.
Q11
Q22
Qdiag = (3.41)
⋱
Qkk
∑
where the matrices Qii of size ni × ni, i = 1, …, k, ni = n are block diagonal.
i=1
Consequently, k optimization subproblems are obtained as described in
the working set selection algorithm [19]. All the subproblems are optimized in
51
parallel using the SMO decomposition algorithm [8] as detailed previously.
Upon completion of the parallel optimization, most nonsupport vectors are
removed from the training dataset on the assumption that the set of nonsupport
Calcul orizontal
vectors in the optimization subproblems are a subset of the vectors from the
initial optimization problem. The Calculare
Calculare Q11
computational
Q22
diagram is illustrated in Figure
Calculare Qkk
3.1 [16]. Optimizare Optimizare Optimizare
Calcul vertical
Horizontal Calculation
52
new training dataset is much smaller compared to the size of the original
dataset, the available memory may not be sufficient to store the kernel matrix,
especially for large datasets. Thus, the fast sequential optimization technique
[16] is used to training the SVM.
In this optimization process, the first step applies the SMO algorithm [8]
to optimize a subproblem in the subset E with kernel caching, and updates αi
and the kernel matrix. The subset size is chosen to be large enough to contain all
the support vectors in the training dataset but at the same time sufficiently small
to meet the memory constraint.
The subset is selected using the queue method for subset selection [16].
This queue method selects subsets of the training dataset that can be trained
using the fast sequential optimization algorithm. The method is initialized by
setting the subset to contain the first d instances in the training dataset and the
queue QS to contain the rest of the instances, and computing the kernel matrix
for that specific subset. Once initialized, the subset selection process works as
follows: each nonsupport vector in the subset is added at the end of the queue
and is replaced in the subset with the instance at the front of the queue, which is
consequently removed from the queue. After all the nonsupport vectors are
53
replaced, the subset is used in the optimization process. In the next iteration, the
same process is applied, starting with the subset and the queue that are in the
same state they were at the end of the last iteration.
Since the KKT conditions are evaluated at each training instance, this step
has a high computational cost, therefore some heuristic stopping conditions
have been proposed in [16]:
▪ sv ≥ d − 1
where Δsv represents the change in the number of support vectors between
two successive subsets, n the size of the training dataset, and T > 1 represents a
predefined maximum number of loops allowed through the dataset.
54
CHAPTER 4
ESTIMATING SUPORT VECTOR MACHINES
The proposed support vector machine architecture [5], [7], [4] is scalable to
large datasets and can be applied to any dataset from any field of activity, as
long as the problem to be solved is a classification problem.
55
of the original dataset [43]. For optimal training, a machine learning algorithm
such as support vector machines requires a training dataset that has a dependent
variable with an approximately balanced distribution of its classes, in this case
the variable Churn. The training dataset has initially 388 (14%) instances
56
belonging to Yes class and 2,279 (86%) instances belonging to No class.
Predicted
Dataset Observed
No Yes % correct
No TN FP TNR
Training/
Yes FN TP TPR
Test
Total % NPV PPV ACC
The confusion matrix cells are called: true negatives (TN), false positives
57
(FP), false negatives (FN), and true positives (TP) (Table 4.2). The rest of the
measures from Table 4.2, such as TNR (true negatives rate or specificity), TPR
(true positives rate or sensitivity), NPV (negative predictive value), PPV
(positive predictive value or precision), and ACC (accuracy) are calculated
using the following equations:
TP
TPR = (4.1)
TP + FN
TN
TNR = (4.2)
TN + FP
TP
PPV = (4.3)
TP + FP
TN
NPV = (4.4)
TN + FN
TP + TN
ACC = (4.5)
P+N
where P represents the total number of positives and N is the total number of
negatives.
Based on these defined measures, Table 4.3 shows the confusion matrix
for the machine learning algorithm used, namely the support vector machine, on
both, the training and the test datasets.
In the training phase, the prediction model has correctly classified 2,281
58
cases of the 2,294 cases belonging to Yes class, with a true positives rate of
99.43%. Of the 2,279 cases belonging to No class, all cases are classified
correctly, providing a specificity of 100%. In other words, approximately 99%
of cases in the training dataset are classified correctly.
Within the test dataset, out of the 95 cases belonging to Yes class, all the
cases are classified correctly (a true positives rate of 100%); and of the 571
cases belonging to No class, 569 cases are classified correctly (a specificity of
99.65%). Overall, approximately 99.70% of the cases in the test dataset are
classified correctly and approximately 0.30% are misclassified.
Predicted
Dataset Observed
No Da % correct
No 2279 0 100.00%
Training Yes 13 2281 99.43%
Total % 50.19% 49.81% 99.72%
No 569 2 99.65%
Test Yes 0 95 100.00%
Total % 85.44% 14.56% 99.70%
Another way of interpreting these results is the lift chart. This type of
graph sorts the predicted pseudo-probabilities [33] in descending order and
59
displays the corresponding curve. There are two types of lift charts: incremental
and cumulative. The incremental lift chart represented in Figure 4.1 shows the
lift factor in each percentile [18] without any accumulation for Yes class of the
dependent variable Churn. The curve corresponding to this predictive model
falls below the gray line, which corresponds to the random expectation (RND
E), around the 16th percentile. This means that compared to the random
expectation, the model achieves its maximum performance in the first 16% of
the instances.
8
SVM
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
The cumulative lift graph indicates the prediction rate of the predictive
model compared to the random expectation. Figure 4.2 illustrates the curve of
the cumulative lift chart for Yes class of the dependent variable Churn. By
reading the chart on the horizontal axis, it can be seen that for the 16th
percentile, the model has a lift index of approximately 7 on the vertical axis,
meaning that unlike a random model, this model has a predictive performance
60
of approximately 7 times better.
8
SVM
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
The performance of this predictive model can also be evaluated using the
gain measure. The gain chart shows the percentage of positive responses on the
vertical axis, and the percentage of cases predicted on the horizontal axis. The
gain measure is defined as the proportion of cases present in each percentile
relative to the total number of cases. The cumulative gain chart shows the
prediction rate of the model compared to the random expectation.
Figure 4.4 shows the ROC (receiver operator characteristic) curve of the
predictive model. The ROC curve is derived from the confusion matrix and uses
only the TPR and the FPR (false positives rate) measures, the latter being
61
100
80
SVM
60 RND E
% Gain
40
20
0
0 20 40 60 80 100
Percentile
Churn = Yes
obtained by subtracting the specificity from the unit. Following the chart in
Figure 4.4, one can observe that it approaches the coordinate point (0, 1) in the
upper left corner, which implies a perfect prediction. Our predictive model
based on support vector machines obtains a sensitivity of 100% and a specificity
of 100%.
0.8
SVM
TPR (Sensitivity)
0.6 RND E
0.4
0.2
0
0.0 0.2 0.4 0.6 0.8 1.0
FPR (1-Specificity)
Churn = Yes
62
CHAPTER 5
CONCLUSIONS
To improve the linear separability of the soft margin SVM algorithm the
kernel trick technique is applied using the polynomial kernel with 4 degrees that
satisfies Mercer's condition [47]. Using this technique, the initial input space is
mapped into a high-dimensional features space that is not treated explicitly. To
train the soft margin SVM, a quadratic problem with a number of variables
equal to the number of training instances must be solved. When the number of
training instances is high, the kernel matrix is dense and cannot be stored in
memory. To the detriment of the standard decomposition algorithm, which
63
depends on a cache storage strategy to calculate the kernel matrix, in this book,
a divide-and-conquer approach is used to split the original problem into a set of
subproblems which is solved using the SMO algorithm adapted by Chang and
Lin [8]. For each subproblem, the kernel matrix can be stored in a cache defined
as part of the adjacent memory. The size of the kernel matrix is large enough to
contain all the support vectors in the training dataset and small enough to meet
the memory constraint. Since the kernel matrix for the subproblem is entirely
cached, each element of the kernel matrix has to be evaluated only once and
calculated using a fast method, namely the fast SVM algorithm proposed by
Dong, Suen, and Krzyzak [16]. This algorithm consists of a parallel
optimization followed by a fast sequential optimization.
The generalization error of this support vector machine model [5], [7], [4]
was then estimated empirically. This predictive model was also evaluated using
four other methods, namely: the incremental and the cumulative lift charts, the
gain chart, and the ROC curve.
Using the incremental lift chart method, it can be observed that the
predictive model achieves its maximum performance in the first 16% of the
instances because the corresponding curve of the predictive model falls below
the gray line corresponding to the random expectation around the 16th
percentile. The cumulative lift chart indicates that for the 16th percentile, the
64
model has a lift index of approximately 7 on the vertical axis, meaning that
unlike a random model, this model has a predictive performance of about 7
times better.
Based on the gain chart, in the 16th percentile, the predictive model
implemented using the support vector machine provides 100% of the
respondents present in the 16th percentile in relation to the total number of
respondents. By interpreting the chart of the ROC curve, it can be seen that the
model is very close to the perfect prediction with a sensitivity of 100% and a
specificity of 100%.
It is extremely important to reiterate the fact that the resulted soft margin
support vector machine architecture is scalable to larger datasets and can be
applied to any dataset from any field of activity, as long as the problem to be
solved is a classification problem.
65
BIBLIOGRAPHY
66
8. Chang, C.C. and C.J. Lin, LIBSVM: A Library for Support Vector Machines.
2001.
9. Cherkassky, V. and F. Mulier, Learning from Data: Concepts, Theory, and
Methods. 1998: John Wiley & Sons.
10. Collobert, R. and S. Bengio, SVMTorch: Support Vector Machines for Large-
Scale Regression Problems. Machine Learning Research, 2001. 1.
11. Collobert, R., S. Bengio, and Y. Bengio, A Parallel Mixture of SVMs for Very
Large Scale Problems. Neural Computation, 2002. 14.
12. Cortes, C. and V. Vapnik, Support Vector Networks. Machine Learning, 1995.
20.
13. DeCoste, D. and B. Scholkopf, Training Invariant Support Vector Machines.
Machine Learning, 2002. 46.
14. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation, 1998. 10(7).
15. Dong, J.X., A. Krzyzak, and C.Y. Suen, A Fast SVM Training Algorithm.
Pattern Recognition with Support Vector Machines: Lecture Notes in
Computer Science, 2002. 2388.
16. Dong, J.X., A. Krzyzak, and C.Y. Suen, Fast SVM Training Algorithm with
Decomposition on Very Large Data Sets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2005. 27(4).
17. Dunteman, G.H., Principal Components Analysis. 1989: Sage Publications.
18. Edwards, D.I., Introduction to Graphical Modeling 2nd Edition. 2000: Springer.
19. Fan, R.E., P.H. Chen, and C.J. Lin, Working Set Selection Using Second Order
Information for Training SVM. Journal of Machine Learning Research, 2005.
6.
20. Fukunaga, K., Introduction to Statistical Pattern Recognition. 1990: Academic
Press.
21. Gorunescu, F., Data Mining Concepts, Models and Techniques. 2011: Springer
Verlag.
22. He, H. and Y. Ma, Imbalanced Learning Foundations, Algorithms, and
Applications. 2013: John Wiley & Sons.
23. Hsu, C.W. and C.J. Lin, A Comparison of Methods for Multi-Class Support
Vector Machines. IEEE Transactions on Neural Networks, 2002.
24. Hwang, J., S. Lay, and A. Lippman, Nonparametric Multivariate Density
Estimation: A Comparative Study. IEEE Transaction on Signal Processing,
1994. 42(10).
25. Jimenez, L.O. and L.D. A., Supervised Classification in High-Dimensional
Space: Geometrical, Statistical, and Asymptotical Properties of Multivariate
Data. IEEE Transaction on Systems, Man, and Cybernetics, 1998. 28.
26. Joachims, T., Making Large-Scale Support Vector Machine Learning Practical.
Advances in Kernel Methods: Support Vector Machines, 1998.
27. Jordan, M.I. and R. Thibaux, The Kernel Trick. UC Berkley Lecture Notes,
2004.
28. Keerthi, S.S. and E.G. Gilbert, Convergence of a Generalized SMO Algorithm
for SVM Classifier Design. Machine Learning, 2002. 46.
29. Keerthi, S.S., S.K. Shevade, C. Bhattachayya, and K.R.K. Murth,
Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural
Computation, 2001. 13.
30. Kim, J.O. and C.W. Mueller, Factor Analysis: Statistical Methods and Practical
Issues. 1978: Sage Publications.
31. Kuhn, H. and A. Tucker. Nonlinear Programming. Proceedings of the Berkeley
Symposium Mathematics, Statistics and Probabilistics. 1951.
32. Leon, F., Inteligență Artificială: Mașini Cu Vector Suport. 2014: Tehnopress.
33. Minsky, M.L. and S. Papert, Perceptrons. 1969: MIT Press.
34. Mitchell, T., Machine Learning. 1997: McGraw-Hill.
35. Nilsson, N.J., Introduction to Machine Learning. 1998: Stanford University.
36. Osuna, E., R. Freund, and F. Girosi. Training Support Vector Machines: An
Application to Face Detection. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 1997.
37. Platt, J.C., Fast Training of Support Vector Machines Using Sequential
Minimal Optimization. Advances in Kernel Methods: Support Vector
Machines, 1998.
38. Schmitt, M., On the Complexity of Computing and Learning with
Multiplicative Neural Networks. Neural Computation, 2002. 14(2).
39. Scholkopf, B., Support Vector Learning. 1997: Oldenbourg-Verlag.
40. Scholkopf, B., C.J.C. Burges, and V.N. Vapnik. Extracting Support Data for a
Given Task. Proceedings of the International Conference on Knowledge
Discovery and Data Mining AAAI. 1995.
41. Scholkopf, B. and A.J. Smola, Learning with Kernels. 2002: MIT Press.
42. Steinwart, I., On the Optimal Parameter Choice for Nu-Support Vector
Machines. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2003. 25.
43. Sumathi, S. and S.N. Sivanandam, Introduction to Data Mining and its
Applications. Studies in Computational Intelligence, 2006. 29.
44. Valiant, L.G., A Theory of the Learnable. 1984: Communications of the ACM.
45. Vapnik, V. and O. Chapelle, Bounds on Error Expectation for Support Vector
Machines. Neural Computation, 2000. 12.
46. Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York:
Springer. xv, 188 p.
47. Vapnik, V.N., Statistical Learning Theory. Adaptive and Learning Systems for
Signal Processing, Communications, and Control. 1998, New York: Wiley.
xxiv, 736 p.
48. Wolpert, D.H., The Relationship between PAC, the Statistical Physics
Framework, the Bayesian Framework, and the VC Framework. The
Mathematics of Generalization the SFI Studies in the Sciences of Complexity,
1995.
49. Zoutendijk, G., Methods of Feasible Directions. 1960: Elsevier.