Sanet - ST How To Fine-Tune Support Vector Machines For Classification

Ionut B. BRANDUSOIU Gavril I.
TODEREAN
HOW TO FINE-TUNE
SUPPORT VECTOR MACHINES
FOR CLASSIFICATION
GAER Publishing House

Bucharest, 2020
THE GENERAL ASSOCIATION OF ENGINEERS IN ROMANIA
Copyright © Authors, 2020
All rights on this edition are reserved to the Authors.
GAER Publishing House

118 Calea Victoriei
010093, Sector 1, Bucharest, Romania
Phone: 4021-316 89 92, 4021-316 89 93,
4021- 319 49 45 (bookshop); Fax: 4021-312 55 31
E-mail: editura@agir.ro; www.agir.ro
Reviewers
Prof. Dr. Eng. Sergiu Nedevschi
Prof. Dr. Eng. Gabriel Oltean
Description of CIP of the National Library of Romania

BRANDUSOIU, IONUT B.
How to fine-tune support vector machines for classification / Ionut B.
Brandusoiu, Gavril I. Toderean - Bucharest : G.A.E.R. Publishing House,
2020
Contains bibliography
ISBN 978-973-720-806-4
621.39
Editor: Eng. Dan Bogdan

Cover: Dr. Eng. Ionut B. Brandusoiu
ISBN 978-973-720-806-4
TABLE OF CONTENTS
INTRODUCTION ...........................................................................................1
SUPERVISED LEARNING ............................................................................3
1. Training Dataset ........................................................................................4
2. Classification.............................................................................................5
3. Induction Algorithms ................................................................................6
4. Performance Evaluation............................................................................7
4.1. Generalization Error ...........................................................................8
5. Scalability ...............................................................................................14
6. Dimensionality ........................................................................................15
SUPPORT VECTOR MACHINES ................................................................17
1. Hard Margin SVMs ................................................................................18
2. Soft Margin SVMs ..................................................................................25
3. Kernel Trick ............................................................................................31
4. Kernels ....................................................................................................35
4.1. Linear Kernel ....................................................................................35
4.2. Polynomial Kernel ............................................................................35
4.3. RBF Kernel.......................................................................................36
4.4. Sigmoid Kernel.................................................................................36
5. Training Techniques ................................................................................37
FINE-TUNING SUPPORT VECTOR MACHINES .....................................41
1. Fast SVM Algorithm...............................................................................50
1.1. Parallel Optimization ........................................................................51
1.2. Fast Sequential Optimization ...........................................................53
ESTIMATING SUPORT VECTOR MACHINES .........................................55
CONCLUSIONS ............................................................................................63
BIBLIOGRAPHY ..........................................................................................66
NOTATIONS
αi Nonnegative Lagrange multiplier i
α Vector of nonnegative Lagrange multipliers
b Bias
βi Nonnegative Lagrange multiplier i
β Vector of nonnegative Lagrange multipliers
C Cost parameter
ξi Nonnegative slack variable i
D(x) Decision function
δ Min. distance from a training instance to the decision surface
K(x, x′) Kernel function
m Number of independent variables
n Number of training instances
Q Kernel matrix
S Set of support vectors
iv
U Set of unbounded support vectors
w m-dimensional vector
x Independent variables space
xi Vector of training instance i
yi Dependent variable i
z Feature space
v
INTRODUCTION
This book covers in the first part the theoretical aspects of support vector
machines and their functionality, and then based on the discussed concepts it
explains how to find-tune a support vector machine to yield highly accurate
prediction results which are adaptable to any classification tasks. The
introductory part is extremely beneficial to someone new to learning support
vector machines, while the more advanced notions are useful for everyone who
wants to understand the mathematics behind support vector machines and how
to find-tune them in order to generate the best predictive performance of a
certain classification model.
For a better understanding of the underlying theory of support vector

machines for classification, Chapter 1 describes supervised learning in the
context of machine learning and all its related concepts. In terms of the
generalization error as the metric to evaluate the performance of a classification
model, the second half of this chapter contrasts between theoretical and
empirical estimations.
Chapter 2 starts with the introduction to support vector machines,
1
followed by theoretical foundations pertaining to their attributes in supervised
learning. It discusses the technique of mapping independent variables into a
high-dimensional space and several noteworthy research papers existent in the
literature regarding various training techniques.
Chapter 3 presents a support vector machine architecture for

classification [5], [7], [4]. This architecture is unique and has been implemented
by the author of this book based on various techniques found in remarkable
manuscripts present in the literature which are discussed beforehand and each
decision included motivated. This support vector machine architecture has been
published by IEEE in a peer-reviewed manuscript authored by the writer of this
book.
This proposed support vector machine is validated in Chapter 4 by

applying it on a synthetic dataset frequently used as benchmark for testing
machine learning algorithms for classification. The results obtained in this book
are also included in the peer-reviewed research paper published by IEEE. The
resulted support vector machine architecture is scalable to larger datasets and
can be applied to any dataset from any field of activity, as long as the problem to
be solved consists of a classification task.
Chapter 5 concludes this book.
2
CHAPTER 1
SUPERVISED LEARNING
In the context of machine learning, the modeling techniques can be

categorized into supervised and unsupervised learning techniques. Supervised
learning aims to predict an event through a classification model or estimate the
value of a continuous variable through a regression model. Within these models,
there are several independent variables and a dependent variable. Classification
models translate the input space into predefined classes, while regression
models translate the input space into a real value range. There are many
alternatives for classification models, for example, decision trees, neural
networks, support vector machines, statistical methods, or algebraic functions.
Within each of these classification models the input data is analyzed with
respect to the dependent variable. This way, it can be said that the pattern
recognition is supervised by the dependent variable. This type of predictive
models establishes relationships between the independent variables and the
dependent variable, and generates a function which associates the independent
variables with the dependent variable and allows the prediction of output values
3
based on the input values.
In the case of unsupervised learning, the data do not contain a dependent

variable, only independent variables. The pattern recognition is undirected, in
other words it is not guided by a specific variable, and the purpose of the data
mining mathematical algorithms is to discover patterns in the input data.
Segmentation and association are two of the most well-known unsupervised
learning methods [21].
Further are presented the theoretical notions related to supervised

learning, the classification model, and how to evaluate such a model.
1. Training Dataset
In the case of a supervised learning model, the training dataset is known

and the goal is to implement a system that can be used to predict previously
unseen instances.
The training dataset can be characterized in a multitude of ways. Most

often it is thought as being a set of instances belonging to a particular schema.
These instances refer to a collection of tuples (instances, rows, or records) that
may contain duplicates. Each tuple is described by a vector which contains the
values of each variable. The schema describes the variables and their definition
domains, denoted by B(A ∪ y) , where A represents the set of m independent
variables A = {a1, a2, ..., am} and y represents the dependent variable.
Typically, the variables, also named fields or attributes, are of two types:
nominal (values belong to an unordered set) and continuous (values are real
numbers). If the variable ai is of nominal type, its definition domain is denoted
4
by dom(ai ) = {vi,1, vi,2, ..., vi, dom(ai ) }, where dom(ai ) refers to its finite
cardinality. Similarly, dom(y) = {c1, c2, ..., c dom(y) } represents the definition
domain of the dependent variable. Variables of continuous type have an infinite
cardinality.
The input space is defined as the Cartesian product of the definition

domains of all the independent variables, X = dom(a1) × dom(a2) × ...
... × dom(am). The universal input space U is defined as the Cartesian product of
the definition domains of all the independent and dependent variables,
U = X × dom(y). The training dataset is a set of instances which consists of a
set of n tuples, and is denoted by S(B) = (⟨a1, y1⟩, ⟨a2, y2⟩, ..., ⟨an, yn⟩) where
xq ∈ X and yq ∈ dom(y).
It is generally assumed that the tuples from the training dataset are
generated in a random and independent order in accordance with an unknown
fixed joint probability distribution, D. When a tuple is classified using the
y = f (x) function, is considered a generalization of the deterministic case.
2. Classification
The machine learning community was the first to introduce concept

learning. These concepts are categories built by the human brain for objects,
events, or ideas that have a common set of characteristics. Learning a concept
implies to deduct its definition from a set of instances, which can be formulated
explicitly or implicitly, but in any situation it either assigns each instance to the
concept or not. In conclusion, a concept can be viewed as a function defined on
the input space with values in a Boolean set, namely, c: X → {− 1, 1} .
Alternatively, a concept c can be defined as a subset of X, namely,
5
{x ∈ X: c(x) = 1}. A set of concepts is known as a concept class C.
Definition 1. Given a training dataset S with the set of independent variables

A = {a1, a2, ..., am} and a nominal dependent variable y that follows an
unknown fixed distribution D over the input space, the objective is to build a
classification model that presents a minimal generalization error.
The generalization error of a classification model is defined as the

misclassification rate over the distribution D. For nominal variables, the
generalization error can be defined as:
∑
ε(I(S ), D) = D(x, y)L(y, I(S )(x))
(1.1)
⟨x,y⟩∈U
where L(y, I(S )(x)) is the cost function, defined as:
{1 if y ≠ I(S )(x)
0 if y = I(S )(x)
L(y, I(S )(x)) = (1.2)
3. Induction Algorithms
An induction algorithm, also known as inducer or learner, builds a model

based on a training dataset and is able to generalize the relationships between
the independent variables and the dependent variable. For instance, an induction
algorithm constructs a classification model with tuples and their class labels as
input training data.
To denote an induction algorithm, is used I and to denote a model induced
6
by the application of this algorithm I on the training dataset S, is used I(S ). The
dependent variable of the tuple xq can then be predicted, and this prediction
denoted by I(S )(xq).
Depending on the induction algorithm, the classification models can be

categorized in several ways, for instance, some models are expressed as
decision trees, while others as probabilistic classification models. Additionally,
the classification models can be categorized as being deterministic, in the case
of decision trees, or stochastic, in the case of neural networks with
backpropagation of the error.
A classification model obtained from an induction algorithm can classify a

new unseen tuple by either assigning it to a particular class or by providing a
conditional probability vector for some instances to pertain to a class
(probabilistic model).
4. Performance Evaluation
It is fundamentally important to evaluate the performance of the induction

algorithm. As previously mentioned, an induction algorithm builds a
classification model based on a training dataset, which is then able to label new
unseen instances. By evaluating this classification model one can understand
more about its quality, refine its parameters during the iterative data mining
process, and select the best performing model from a set of models.
When evaluating a classification model, several aspects should be

considered, such as accuracy, comprehensibility, and computational complexity.
In this book, preference is given to classification models that yield high
7
accuracy.
4.1. Generalization Error
As previously mentioned, I(S ) denotes a classification model induced by

the induction algorithm I on the training dataset S. The generalization error of
this classification model I(S ) is given by the probability of misclassifying a
selected instance according to the distribution D of the labeled input space. The
accuracy of this classification model is obtained by subtracting the
generalization error from the unit. The training error is defined as the percentage
of correctly classified instances from the training dataset:
^
∑
ε(I(S ), S ) = L(y, I(S )(x))
(1.3)
⟨x,y⟩∈S
where L(y, I(S )(x)) is the cost function defined in equation (1.2).
Although this type of error seems like a natural criterion, it is difficult to

compute the actual value of the generalization error because the distribution D
of the labeled input space is known only in rare situations, such as in synthetic
cases. One way to calculate the generalization error is to use the training error as
an estimate. The only downside to this is that this error represents an optimistic
biased estimation, especially if the induction algorithm overfits the training data.
The theoretical and empirical methods are two different methods present
in the literature to estimate the generalization error.
8
4.1.1. Theoretical Estimation of Generalization Error
However, if one decides to estimate the generalization error using the

training error, it is important to note that a low training error does not
necessarily imply a low generalization error. A compromise arises frequently
between the error obtained during training and the confidence level that is
attributed to this error when estimating the generalization error, and it is
calculated by subtracting the training error from the generalization error. The
capacity of an induction algorithm is determinative for the level of confidence in
the training error and indicative relative to the classification models that this
algorithm can induce. The capacity of an induction algorithm can be calculated
using the Vapnik-Chervonenkis (VC) dimension which is discussed further.
Induction algorithms that present a large capacity, in other words that

have many free parameters in comparison to the training dataset size, are
susceptible to generate a training error that is low and lead to overfitting the
relationships present in the dataset and yield a poor generalization error. In such
a case, the training error is highly unlikely to be a good estimate for the
generalization error. Contrary, induction algorithms that present a small capacity
in comparison to the training dataset size, tend to generate a high training error
and lead to underfitting the relationships present in the dataset and yield a poor
generalization error. Induction algorithms that do not have enough free
parameters, may generate a low training error, but on the other side can yield a
good generalization error. However, taking into account the characteristics and
volume of the training data available, the optimal capacity can be achieved, and
thus obtain the best generalization error.
In [48], the author discusses the relationships between four theoretical
9
frameworks, compares them, and highlights their strengths and weaknesses.
These frameworks are useful to estimate the generalization error. Among these,
the VC and the PAC (Probably Approximately Correct) frameworks are
mentioned, which add a penalty function to the training error to indicate the
capacity of an induction algorithm.
A. VC Dimension
The Vapnik-Chervonenkis [46] theory is the most complete theoretical
learning framework and relevant to classification models. The VC theory offers
all the conditions needed for the consistency of the induction procedure. The
concept of consistency comes from statistics and states that both, the training
and the generalization errors of the classification model must converge to a
minimal error as the training dataset tends to infinity. The VC theory defines the
VC dimension as a capacity measure of an induction algorithm.
The VC theory highlights the extreme case when the training error and the
generalization error are estimated and these estimation values are bounds viable
for any induction algorithm and probability distribution in the input space.
These bounds are functions of the training dataset size and the VC dimension of
the induction algorithm.
Theorem 1. Given a hypothesis space H with a finite VC dimension d, the upper

bound on its generalization error is defined by:
d(ln(2n /d ) + 1) − ln(δ/4)
^ S) ≤
ε(h, D) − ε(h, , ∀h ∈ H, ∀δ > 0 (1.4)
n
10
with the probability 1 − δ , where ε(h, D) is the generalization error of the
^ S ) is the training error
classification model h over the distribution D, and ε(h,
of the same classification model h measured over the training dataset S of
cardinality n.
The VC dimension represents the property of a set H composed of all the

classification models examined by the induction algorithm. In the simplest case
of a two-class model, the VC dimension is defined as being the maximum
number of instances that can be shattered by the set H composed of all relevant
classification models. By definition, a dataset S with n instances is shattered by a
set H, if and only if this set H contains a classification model consistent with any
dichotomy of S. To express it differently, a dataset S is shattered by H if the
instances in S can be separated in two classes in 2n different ways by some
classification models contained in H. One should take into account that if the
VC dimension of H is denoted by d, then exists at least one set of d instances
that can be shattered by H. Generally speaking, it will not be true that every set
of d instances can be shattered by H.
As a condition for the consistency of the induction procedure, the VC

dimension of an induction algorithm must be finite. In the case of a linear
classification model, the VC dimension is equal to the size of the input space or
to the number of free parameters of this classification model. The VC dimension
of a general classification model may be different from the number of free
parameters, and in many cases, it might be extremely difficult to calculate it
precisely. In this case, it is advisable to calculate a lower and upper bound of the
VC dimension. The two VC dimension bounds for neural networks are
presented in [38].
11
B. PAC Dimension
The Probably Approximately Correct (PAC) learning model was
introduced by Valiant in 1984 [44]. This framework is useful to characterize the
concept class that can be reliably learned from a reasonable number of
randomly drawn training instances and a reasonable amount of computation
[34]. The following definition of the PAC learning model is adapted from [34]
and [35]:
Definition 2. Let C be a class concept defined over the input space X with m
variables. Let I be an induction algorithm that considers the hypothesis space
H. C is PAC learnable by I using H if for ∀c ∈ C , ∀D defined over X,
∀ε ∈ (0, 1/2) and ∀δ ∈ (0, 1/2) the induction algorithm I with a probability
greater than or equal to 1 − δ will find the hypothesis h ∈ H such that,
ε(h, D) ≤ ε and learnable in polynomial time if the induction algorithm is of
polynomial time complexity in 1/ε, 1/δ, m, and size(c).
By examining an hypothesis space H with a probability greater than or

equal to 1 − δ to find a hypothesis h ∈ H with an error less than or equal to ε of
the target concept c ∈ C ⊆ H, the PAC learning model offers a general bound on
the number of training instances, which is sufficient for any consistent induction
algorithm I. In particular, the size of the training dataset must be equal to:
1 H
n≥ ln (1.5)
ε δ
4.1.2. Empirical Estimation of Generalization Error
12
The generalization error can be estimated by dividing the available dataset
into a training and a test datasets. The training dataset is used by the induction
algorithm to build the classification model, and then the misclassification rate of
this model is calculated on the test dataset. The error obtained on the test dataset
yields a better estimate of the generalization error, because the training error
tends to overfit the data and thus underestimates the generalization error.
When the available data is limited, it is a well-known practice to resample

the data, meaning to partition the dataset into a training and a test dataset in
several ways. An induction algorithm is trained and tested on each partition, and
then the arithmetic mean of all the misclassification rates is computed. This way,
a more reliable estimate of the generalization error is obtained.
Random sub-sampling and k-fold cross-validation are two well-known

resampling techniques. The first technique randomly partitions the dataset
multiple times into a disjoint training and test datasets, and computes the
arithmetic mean of the errors obtained from each partition. The k-fold cross-
validation technique randomly partitions the dataset into k mutually exclusive
datasets, on which the induction algorithm is trained and tested multiple times.
Each time the algorithm is tested on one of the unseen folds and trained using
the remaining k − 1 folds.
The estimation of the generalization error obtained through cross-

validation is equal to the ratio of the number of misclassifications over the total
number of instances in the dataset. The random sub-sampling technique has the
upside that it can be repeated unlimitedly, and the downside that the test datasets
are not independently selected in relation to the distribution of the instances.
Thus, using a t-test for paired differences using the random sub-sampling
13
technique may increase the risk of Type I (false positives) error, that is,
identifying a significant difference when there is none. On the other hand, using
a t-test on the generalization error obtained on each fold decreases the risk of a
Type I error, but instead may not provide an adequate estimate of the
generalization error. To obtain a more reliable estimate, the k-folds cross-
validation technique is usually repeated k times. However, the test dataset is not
independent and there is a risk of Type I error. Unfortunately, currently no
satisfactory solution has been found to this problem. Dietterich proposed in [14]
alternative tests that have a low chance of a Type I error, but have a high risk of
a Type II (false negative) error, i.e. not identifying a significant difference when
one exists.
While applying the random sub-sampling and k-fold cross-validation

techniques, a method called stratification is frequently used to ensure that the
distribution of the dependent variable from the initial dataset is kept within each
partition, i.e. in the training and the test datasets. This method reduces the
variance of the estimated error in particular for multi-class datasets.
5. Scalability
The induction process represents the main concern throughout many

disciplines, such as pattern recognition, machine learning, and statistics. The
data mining process is different from these traditional methods because of its
capability to scale to large datasets with various input data types. The concept of
scalability implies that the datasets have either a large number of instances, a
large dimensionality, or both.
14
Induction algorithms have been successfully implemented in multiple
situations to solve fairly basic problems, but with the increasing desire to
discover knowledge in large datasets, several difficulties and constraints related
to time and memory appear.
Since databases have become a standard within many domains, such as

telecommunications, finance, astronomy, biotech, marketing, healthcare, and
many others, the data mining process designated to discover knowledge within
these domains has become a very productive discipline. Organizations that
produce large amounts of data, such as those active in the telecommunications,
financial, and online industries accumulate few petabytes of data every year.
Thus, difficulties arise in implementing classification algorithms for large

datasets due to their high dimensionality, i.e. large number of instances and
variables. Different sampling methods can be used to select only a part of the
instances, reduce the number of instances by grouping them or eliminating
subsets of unimportant instances, or parallel processing to simultaneously solve
different aspects of this problem.
6. Dimensionality
High dimensional input data, i.e. datasets with large number of variables,
involve an exponential increase of the size of the search space, and
consequently increases the chance that an induction algorithm will build
classification models that are not valid in general. In [25], the authors explain
that in the case of a supervised classification model, the required number of
instances increases with the dimensionality of that dataset. Furthermore, the
15
author shows in [20] that in the case of a linear classification model, the
required number of instances is linear with respect to the dimensionality and to
the square of the dimensionality in the case of a quadratic classification model.
Regarding the nonparametric classification models, such as decision trees, the
situation is more serious. In order to obtain an efficient estimation of the
multivariate densities, in [24] it was estimated that as the number of dimensions
increases the number of instances must increase exponentially.
This situation is called the curse of dimensionality, term which was first
used by Bellman [2]. Algorithms, such as decision trees, which are effective in
situations of low dimensionality, do not yield significant results when the
dimensionality increases beyond a certain level. Moreover, the classification
models that are built on datasets with a small number of variables are easier to
interpret and more suitable for visualization by using different specific data
mining methods.
In recent years, multiple linear dimensionality reduction algorithms have

been developed, of which factor analysis [30] and principal component analysis
(PCA) [17] are mentioned. The main objective of these algorithms is to
transform the input variables into a dataset of smaller dimension. These
algorithms require the input variables to be of continuous type and the
dimensions to be representable as linear combinations of these input variables.
Each newly formed dimension is supposed to represent an unobserved factor.
16
CHAPTER 2
SUPPORT VECTOR MACHINES
The support vector machines (SVM) algorithm is a set of supervised

learning methods applicable to classification problems. Since its introduction,
this algorithm has gained impressive popularity due to its solid theoretical
foundation. This algorithm has been developed by Vapnik et al. [40], [39] to
implement principles from statistical learning theory [47]. In the framework of
statistical learning, the learning refers to the estimation of a function based on a
training dataset. To accomplish this, a SVM choses from a set of given
functions, a function that minimizes the empirical risk such that the estimated
function is different from the current function, which is unknown yet. The
empirical risk depends on the complexity of the chosen set of functions and the
training dataset. Thus, a learning SVM must find the best function in the set of
functions determined by its complexity. In practice, the risk bounds are not easy
to calculate nor useful to analyze the quality of the solution [45].
If it is assumed that the training dataset is separable by a hyperplane,

Vapnik has demonstrated in [47] that for the class of hyperplanes, the
17
complexity of the hyperplane can be bounded by another measure, called
margin. The margin is the minimum distance between a training instance and
the decision area. Thus, if the margin of a function is bounded, one can control
its complexity. Learning with support vectors implies that the risk is minimized
when the margin is maximized. A support vector machine selects the hyperplane
with the maximum margin in the transformed input space and separates the
classes of the training instances while maximizing the distance to the nearest
separating instance. The parameters of the resulting hyperplane are obtained by
solving a quadratic programing optimization problem.
For a two-class classification problem, the SVM algorithm is trained so

that the decision function maximizes the generalization task [47], [9], i.e. the
initial m-dimensional space of the problem x, called the input space, is mapped
into the n-dimensional (n ≥ m) feature space z, in which the classes become
linearly separable. In the newly obtained feature space z, in order to determine
the optimal separating hyperplane to separate the two classes, it is required to
solve a quadratic programming problem.
We present further the hard margin support vector machines algorithm,

useful when the training dataset is linearly separable. This concept is extended
further to the case in which the training dataset is not linearly separable and the
input space is mapped into a high-dimensional feature space to improve the
linear separability.
1. Hard Margin SVMs
Let n m-dimensional training instances xi , i = 1, …, n that correspond to
18
the dependent variable yi ∈ {− 1, 1} . If these training instances are linearly
separable, the decision function can be determined:
D(x) = w⊤x + b (2.1)
where w ∈ ℝn is an m-dimensional vector, b ∈ ℝ is a scalar and represents the

bias and for i = 1, …, n, we have:
yi(w⊤xi + b) ≥ 1 (2.2)
The hyperplane
D(x) = w⊤x + b for −1 < D(x) < 1 (2.3)
forms a separating hyperplane that separates xi, i = 1, …, n [47]. If D(x) = 0, the

separating hyperplane is at equal distance from the hyperplanes D(x) = 1 and
D(x) = − 1. The distance between the training instances nearest to the
hyperplanes D(x) = 1 and D(x) = − 1 , and the separating hyperplane is called
margin. Assuming that the hyperplanes D(x) = 1 and D(x) = − 1 have at least
one training instance, the maximum margin for hyperplane D(x) = 0 is achieved
for −1 < D(x) < 1 . Thus, the generalization region for the decision function is
{x − 1 ≤ D(x) ≤ 1}.
Figure 2.1 illustrates two decision functions that satisfy equation (2.2). It
can be noted that there are an infinite number of decision functions that satisfy
19
maximă
equation (2.2), which are separating hyperplanes.
The location of the separating hyperplane affects the generalization task.

Thus, the hyperplane with the maximum margin is called the optimal separating
hyperplane 0(Figure 2.1). If the training dataset does not contain xany
1 extreme
value and the test dataset follows the same distributions, the generalization task
is maximized if the separating hyperplane is the optimal separating hyperplane.
x2
Optimal Hyperplane
Maximum
Margin
0 x1
Figure 2.1. Optimal Separating Hyperplane in a Two-dimensional Space.
The Euclidean distance from a training instance x to the separating

hyperplane is given by:
D(x) / w (2.4)
The training dataset should satisfy for k = 1, …, n equation (2.5):
x2 20
Hiperplanul optim
yk D(xk )
≥δ (2.5)
w
where δ is the margin.
If (w, b) is a solution, then (aw, ab) is also a solution, where a is a scalar.

Thus, the following constraint is required:
δ w =1 (2.6)
To determine the optimal separating hyperplane, from equations (2.5) and

(2.6) must be determined w with the minimum Euclidian norm that satisfies
equation (2.2). Consequently, the optimal separating hyperplane is obtained by
solving the optimization problem for w and b [47]:
1 2
Q(w, b) = w (2.7)
2
subject to
yi(w⊤xi + b) ≥ 1 for i = 1, …, n (2.8)
The square of the Euclidean norm w in equation (2.7) is used to

transform the optimization problem into a quadratic programming optimization
problem. The linear separating hypothesis signifies that there is a w and b that
satisfy equation (2.8). The solutions that satisfy this equation are called feasible
solutions. The solutions of the quadratic objective function with inequality
21
constraints are not unique, but the value of this function is. Thus, the fact that
solutions are not unique does not represent a problem for the SVM algorithm,
on the contrary, it is an advantage over the neural networks, which generate
many local minima.
If the points that satisfy the strict inequalities are removed from equation
(2.8), the same optimal separating hyperplane is still obtained. Thus, the points
that satisfy the equality are called support vectors (this definition is imprecise at
the moment because the support vectors are defined using the solution of the
dual problem, as discussed further). In Figure 2.1, the support vectors are
represented by the full square and circles.
The variables w and b of the convex optimization problem are given by

equations (2.7) and (2.8). Therefore, the total number of variables is equal to the
number of independent variables plus 1, i.e. m + 1 . If m is small, the quadratic
programming optimization technique can be used to solve equations (2.7) and
(2.8). Once the input space is mapped into a high-dimensional feature space,
equations (2.7) and (2.8) are transformed in a dual problem with the number of
variables equal to the number of the training instances.
First, the problem subject to the constraints given by equations (2.7) and
(2.8) is transformed into a problem without constraints, namely:
n
1 ⊤
Q(w, b, α) = w w − αi{yi(w⊤xi + b) − 1}
2 ∑ (2.9)
i=1
where α = (α1, α2, …, αn)⊤ and αi are the nonnegative Lagrange multipliers [47].
The optimal solution of equation (2.9) is given by the saddle point. In this point,
22
equation (2.9) is minimized with respect to w, maximized with respect to αi ≥ 0,
and minimized or maximized with respect to b based on the sign of the sum
n
∑i=q
αi yi, and the solution satisfies the Karush-Kuhn-Tucker (KKT) conditions
[31]:
∂Q(w, b, α)
=0 (2.10)
∂w
∂Q(w, b, α)
=0 (2.11)
∂b
αi{yi(w⊤xi + b) − 1} = 0 for i = 1, …, n (2.12)
αi ≥ 0 for i = 1, …, n (2.13)
These inequality constraints and the associated Lagrange multipliers are

called the KKT complementary conditions [31]. In equation (2.12), the
conditions αi = 0 or αi ≠ 0 and yi(w⊤xi + b) = 1 must be satisfied. The training
instances xi that satisfy αi ≠ 0 are called support vectors.
Using equation (2.9), equations (2.10) and (2.11) are reduced to:
n
w= αi yi xi
∑ (2.14)
i=1
∑
αi yi = 0 (2.15)
i=1
23
Then, replacing equations (2.14) and (2.15) in equation (2.9), it results the
dual problem that needs to be maximized [47]:
n
1 n n
α α y y x⊤ x
∑ i 2 ∑∑ i j i j i j
Q(α) = α− (2.16)
i=1 i=1 j=1
subject to
∑
αi yi = 0, αi ≥ 0 for i = 1, …, n (2.17)
i=1
This support vector machine is called a hard margin support vector

machine [32]. Considering the following:
⊤
1 n n 1 n n
∑ i j i j i j 2 ( ∑ i i i) ( ∑ i i i)
ααyyx x = αyx αyx ≥ 0
2∑
⊤
(2.18)
i=1 j=1 i=1 i=1
maximizing equation (2.16) subject to the constraints in equation (2.17) is a

concave quadratic programming problem. If there is a solution, in other words if
the classification problem is linearly separable, the global optimal solution is αi,
i = 1, …, n. Regarding the quadratic programming, if the optimal solutions
exist, the values of the primal and dual objective functions are the identical.
The training instances associated with the positive values of αi are support
vectors for yi = 1 or yi = − 1. Thus, from equation (2.14) we obtain the decision
function:
24
αi yi x⊤i x + b
∑
D(x) =
(2.19)
i∈S
where S is the set of support vectors. From the KKT conditions given by
equation (2.12), b is given by equation (2.20):
b = yi − w⊤xi for i ∈ S (2.20)
To ensure the stability of the calculations, the arithmetic mean of the

support vectors is used [47], such as:
1
(yi − w⊤xi)
S ∑
b= (2.21)
i∈S
Consequently, unknown instances x can be classified using equation

(2.22):
{yi = − 1 if D(x) > 0

yi = 1 if D(x) < 0
(2.22)
If D(x) = 0 , the instance x is on the boundary and cannot be classified. When

the training dataset is linearly separable, as mentioned previously, the
generalization region for the decision function is {x 1 > D(x) > − 1}.
2. Soft Margin SVMs
In the case of the hard margin support vector machines, the training
25
dataset is linearly separable. However, if the training dataset is not linearly
separable, nox2 feasible solution exists, and the hard margin support vectors
Hiperplanul
problem cannot be solved. Thus, the soft margin support vectoroptim
machines for
linearly inseparable training data is presented further.
Marginea
ξi
In order to allow for inseparability, maximă
the nonnegative slack variables ξi ≥ 0
are introduced [12] in equation (2.2):
yi(w⊤xi + b) ≥ 1 − ξi for i = 1, …, n (2.23)
ξj
The introduction of the slack variables guarantees the existence of a
0 [12]. For the training instances xi , if s (ξi in Figurex 2.2), the
feasible solution
1
instances are correctly classified and do not have a maximum margin.
x2
Optimal Hyperplane
Maximum
ξi Margin
ξj
0 x1
Figure 2.2. Inseparable Data in a Two-dimensional Space.
26
If ξi ≥ 1 (ξi in Figure 2.2), the training instances are misclassified by the
optimal separating hyperplane. In order to obtain the optimal separating
hyperplane with a minimum number of training instances without a maximum
margin, equation (2.24) must be minimized:
∑
Q(w) = θ(ξi ) (2.24)
i=1
where
{0 if ξi = 0
1 if ξi > 0
θ(ξi ) = (2.25)
This represents a difficult combinatorial optimization problem. Thus, the

following minimization problem [12] is considered:
1 ⊤ C n p
Q(w, b, ξ) = w w +
p∑
ξi (2.26)
2 i=1
subject to
yi(w⊤xi + b) ≥ 1 − ξi, ξi ≥ 0 for i = 1, …, n (2.27)
where ξ = (ξ1, ξ2, …, ξn)⊤ and C represents the cost parameter and controls the
balance between the margin maximization and the classification error
minimization. The parameter p has a value of 1 or 2. The resulted hyperplane is
called soft margin hyperplane. When p = 1 the SVM is called L1 soft margin
27
SVM, or L1 SVM, and when p = 2 it is called L2 soft margin SVM, or L2 SVM.
This section details further the L1 soft margin SVM because it is the algorithm
to be implemented next.
Similarly to the linearly separable instances, if the nonnegative Lagrange

multipliers, αi and βi, are introduced, we get:
n n
1 ⊤
Q(w, b, ξ, α, β) = w w + C α (y (w⊤xi + b) − 1 + ξi )−
∑ i ∑ i i
ξ−
2 i=1 i=1
n
(2.28)
∑
− βi ξi
i=1
where α = (α1, α2, …, αn)⊤ and β = (β1, β2, …, βn)⊤.
For the optimal solution, the following KKT conditions are satisfied [12]:
∂Q(w, b, ξ, α, β)
=0 (2.29)
∂w
∂Q(w, b, ξ, α, β)
=0 (2.30)
∂b
∂Q(w, b, ξ, α, β)
=0 (2.31)
∂ξ
αi(yi(w⊤xi + b) − 1 + ξi ) = 0 for i = 1, …, n (2.32)
βi ξi = 0 for i = 1, …, n (2.33)
28
αi ≥ 0, βi ≥ 0, ξi ≥ 0, for i = 1, …, n (2.34)
Using equation (2.28), equations (2.29), (2.30), and (2.31) are reduced to
[32]:
n
w= αi yi xi
∑ (2.35)
i=1
∑
αi yi = 0 (2.36)
i=1
αi + βi = C for i = 1, …, n (2.37)
Then, replacing equations (2.35), (2.36), and (2.37) in equation (2.28), it

results the dual problem that needs to be maximized:
n
1 n n
α α y y x⊤ x
∑ i 2 ∑∑ i j i j i j
Q(α) = α− (2.38)
i=1 i=1 j=1
subject to
∑
αi yi = 0, 0 ≤ αi ≤ C for i = 1, …, n (2.39)
i=1
In comparison to the hard margin SVM, for the L1 soft margin SVM, αi
can not be greater than the parameter C. The inequality constraints in equation
29
(2.39) are called box constraints.
Equations (2.32) and (2.33) are called KKT complementarity conditions

[31]. From these two equations and equation (2.37), we have three situations for
αi, namely:
1. If αi = 0 , then ξi = 0 and the instance xi is correctly classified and

called nonsupport vector.
2. If 0 < αi < C , then yi(w⊤xi + b) − 1 + ξi = 0 and ξi = 0 . Thus,

yi(w⊤xi + b) = 1 and the instance xi is a support vector. If 0 < αi < C,
the instance xi is called unbounded support vector.
3. If αi = C , then yi(w⊤xi + b) − 1 + ξi = 0 and ξi ≥ 0 . Thus, the instance

xi is called a bounded support vector. If 0 ≤ ξi < 1 , the instance xi is
correctly classified, and if ξi ≥ 1 , then the instance xi is incorrectly
classified.
The decision function is identical to one of the hard margin SVM and is
given by equation (2.40):
αi yi x⊤i x + b
∑
D(x) =
(2.40)
i∈S
where S is the set of support vectors. The summation in equation (2.39) applies
only to the support vectors, since αi are not equal to zero in this case. In the case
of the unbounded support vectors, b is given by equation (2.20):
b = yi − w⊤xi for i ∈ U (2.41)
30
where U represents the set of unbounded support vectors.
To ensure the stability of the calculations, the arithmetic mean of the

support vectors is used [47], such as:
1
(yi − w⊤xi )
∑
b= (2.42)
U i∈U

(2.43):
{yi = − 1 if D(x) < 0

yi = 1 if D(x) > 0
(2.43)
If D(x) = 0 , the instance x is on the boundary and cannot be classified. If there

are no bounded support vectors, the generalization region for the decision
function is {x 1 > D(x) > − 1} , and is identical to the one of the hard margin
SVM.
3. Kernel Trick
In the SVM algorithm, the optimal separating hyperplane is selected such

that the generalization ability is maximized. If the training instances are not
linearly separable, even though the separating hyperplane is optimal, the
classification model may not present a high generalization performance.
Consequently, to improve the linear separability, the input space is mapped into
a high-dimensional space, called the feature space [12].
31
We use ϕ(x) = (ϕ1(x), ϕ2(x), …, ϕn(x))⊤ , a nonlinear vector function, to
map the m-dimensional input vector x into an n-dimensional feature space. The
linear decision function is shown in equation (2.44):
D(x) = w⊤ϕ(x) + b (2.44)
where w is an n-dimensional vector and b represents the bias.
For a symmetric function K(x, x′) that satisfies equation:
n n
hi hj K(xi, xj ) ≥ 0
∑∑ (2.45)
i=1 j=1
for all n ∈ ℕ, xi, and hi ∈ ℝ, based on the Hilbert-Schmidt theory [1], a function
ϕ(x) exists that maps the input vector x into a feature space and satisfies
equation:
K(x, x′) = ϕ ⊤(x)ϕ(x′) (2.46)
If equation (2.46) is satisfied, we have equation:
n n n n
hi hj K(xi, xj ) = ( hi ϕ (xi ))(
⊤
h ϕ(xi )) ≥ 0
∑∑ ∑ ∑ i (2.47)
i=1 j=1 i=1 i=1
Equations (2.45) or (2.47) are called Mercer's conditions [47], and the
functions that satisfy these conditions are called the Mercer kernels, used further
32
only as kernels.
By employing kernels during the training and the classification processes,

that is, to use K(x, x′) instead of ϕ(x) , it is not required to treat the high-
dimensional feature space explicitly anymore [27]. This technique is called the
kernel trick, and the methods of mapping the input space into the feature space
are called kernel methods.
As such, in the feature space, the dual problem to be maximized is given

by equation (2.48):
n
1 n n
α α y y K(xi, xj )
∑ i 2 ∑∑ i j i j
Q(α) = α− (2.48)
i=1 i=1 j=1
subject to
∑
yi αi = 0, 0 ≤ αi ≤ C for i = 1, …, n (2.49)
i=1
Considering that K(x, x′) is a Mercer kernel, which is a positive

semidefinite kernel, and that we have the feasible solution α = 0 , our concave
quadratic programming optimization problem has an optimal global solution.
The KKT complementarity conditions are given by the following

equations:
n
αi(yi( yj αj K(xi, xj ) + b) − 1 + ξi) = 0 for i = 1, …, n
∑ (2.50)
j=1
33
(C − αi )ξi = 0 for i = 1, …, n (2.51)
αi ≥ 0, ξi ≥ 0 for i = 1, …, n (2.52)
The decision function is given by equation (2.53):
αi yi K(xi, x) + b
∑
D(x) =
(2.53)
i∈S
where the term b is given by:
αi yi K(xi, xj )
∑
b = yi −
(2.54)
i∈S
where the instance xj is an unbounded support vector.
To ensure the stability of the calculations, the arithmetic mean is used,

such as:
1
b=
U ∑ (yj − ∑ αi yi K(xi, xj )) (2.55)
j∈U i∈S
where U is the set of unbounded support vectors.

(2.56):
{yi = − 1 if D(x) < 0

yi = 1 if D(x) > 0
(2.56)
34
If D(x) = 0, the instance x cannot be classified.
4. Kernels
The SVM algorithm presents a great advantage over the other machine
learning algorithms, such that by selecting the appropriate kernel for a particular
application, the generalization performance can be significantly improved.
Therefore, selecting the appropriate kernel represents a critical step, and the
ongoing research focuses on developing new kernels [41], [42]. Further are
presented the four main kernel functions used in the SVM algorithm.
4.1. Linear Kernel
If the classification problem and the training instances are linearly

separable, the input space is not mapped into a high-dimensional feature space
and instead the linear kernel below is used:
K(x, x′) = x⊤x′ (2.57)
4.2. Polynomial Kernel
The polynomial kernel with d ∈ ℕ degrees, is given by equation (2.58):
K(x, x′) = (γ x⊤x′ + r)d (2.58)
35
where γ > 0 and r ≥ 0 are parameters, and r is used to make a compromise
between the influence of higher-order terms against lower-order terms. When
r = 0, the kernel is called homogeneous. Generally, the following values γ = 1,
r = 1, and d = 2 or d = 3 are used.
4.3. RBF Kernel
The radial basis function kernel, also called the Gaussian kernel, is given
by equation:
K(x, x′) = exp( − γ x − x′ 2) (2.59)
where γ > 0 is a parameter that controls the radius of the function.
In this case, the centers of the radial basis functions are actually the
support vectors. It is important to note that the RBF (Radial Basis Functions)
kernels map the input space in an infinitely-dimensional space and since the
Euclidean distance is used, they are not robust to extreme values.
4.4. Sigmoid Kernel
The sigmoid kernel, also known as the hyperbolic tangent kernel or

multilayer perceptron kernel, is similar to the activation function of a perceptron
neuron and is given by equation:
K(x, x′) = tanh(γ x⊤x′ + r) (2.60)
36
5. Training Techniques
As presented in the previous paragraph, to train a L1 SVM it is required

to solve a quadratic programming problem with the number of variables equal to
the number of training instances. Many experimental results indicate that the
complexity of the Sequential Minimal Optimization (SMO) algorithm can be
approximated by n 3, where n is the size of the training dataset. Thus, in the case
of large datasets, when n is high, the SVM training process is inefficient due to
the long training time and the high memory consumption [13]. In order to avoid
this problem, various decomposition methods have been proposed, which
change only one subset of α per iteration, thus being required only few columns
of the matrix Q. This subset of variables, referred to as the working set B, leads
to an optimization subproblem.
Currently, there are three important algorithms for training a SVM,

namely: chunking [36], SMO [37], [29], and SVMlight [26]. The chunking
algorithm starts with an arbitrary subset of the training dataset, and after using a
general optimizer to train the SVM on this subset, the support vectors are kept
in the working set and the rest are replaced by training instances that violate the
KKT conditions [31]. The notion of working set, i.e. the idea of decomposition,
used in training a SVM is introduced by this algorithm, thus making learning
feasible when having large datasets [36]. However, this algorithm is not fast
enough due to an ineffective selection of the working set. Joachims has
developed a decomposition algorithm for the SVM in SVMlight [26]. To select
the working set, he has used the nonzero elements of the steepest descent based
on a strategy found in Zoutendijk's work [49]. To accelerate the training process,
Joachims has also used a shrinking strategy and the Least Recently Used (LRU)
37
caching method. One of the inefficiencies of this algorithm is given by the
caching method, that is, it caches few rows of the kernel Hessian matrix, which
consumes memory and becomes impractical for large datasets. In addition, a
failure during the shrinking strategy will result in reoptimizing the training
problem of the SVM.
Platt has proposed the SMO method [37], which represents an important
step in training a SVM. In this algorithm, the size of the working set is equal to
two, and this optimization problem with two variables is solved without
requiring any additional optimization package. In [37], some heuristics for
selecting the working set have been suggested. Keerthi et al. [29] have identified
an inefficiency in Platt’s algorithm while updating the parameters with one
threshold and have proposed replacing them with parameters with two
thresholds, change that has led to an improved performance. Hsu and Lin have
presented a working set selection method through which the convergence is
attained more quickly in difficult cases [23].
In [13], DeCoste and Scholkopf have discovered that at the beginning of

the training process, the number of candidate support vectors is greater than the
final number of support vectors. Thus, reducing the number of candidate support
vectors during iterations will affect the performance of the SMO algorithm. To
avoid this inefficiency, the authors have introduced the idea of digest and have
proposed heuristics to reduce the number of kernel reevaluations [13]. However,
these heuristics contain many ad-hoc parameters and caching the entire kernel
matrix is still not efficient.
Dong et al. have effectively integrated the idea of decomposition, caching,

digest, and shrinking, and have applied it to the improved SMO algorithm
38
proposed by Keerthi et al. [15]. Their algorithm has achieved a five up to ten
times better performance than the previous algorithm, in which caching the
kernel matrix represents the critical part. In this case, the size of the kernel
matrix is equal to the working set, which is at least the size of the number of
support vectors. The digest method restricts the increase in the number of
candidate support vectors, so that the number of candidate support vectors sets
during the training process will never exceed the size of the working set.
Therefore, it is possible to maximize the reuse of the kernel matrix caching. In
[28], Keerthi and Gilbert have demonstrated the convergence of the algorithm
proposed by Keerthi et al.
Fan et al. have proposed a decomposition method using second order

information to achieve a fast convergence of the algorithm [19]. The authors
have demonstrated that this method trains a SVM much faster than the first
order selection methods.
In [11], the authors have proposed a parallel mixture of SVMs. Initially,

the model trains multiple SVMs using subsets of the training dataset and
combines their results using a linear hyperplane or a multilayer perceptron.
Although the idea of the mixture of local experts is not new, this model differs
from other mixture models by automatically assigning training instances to
different SVMs according to the prediction made by each local SVM. The
authors have showed that their model is faster and more accurate than a model
using a single SVM, such as SVMTorch [10]. However, in [16], two problems
related to this model have been indicated, namely the way to determine the
optimal number of local SVMs, which is a rather difficult problem, and the fact
that this model is not using a regularization term to control its complexity.
39
Dong et al. have proposed an effective solution to train the SVMs and
consists of two steps: a parallel optimization and a sequential working set
optimization [16]. During the first step, the kernel matrix is approximated by
block diagonal matrices, and as such, the optimization problem is divided into
multiple subproblems which can be solved more efficiently. The majority of
nonsupport vectors are removed and the training sets are collected for the
second step.
40
CHAPTER 3
FINE-TUNING SUPPORT VECTOR MACHINES
In this chapter, the support vector machine is modeled for classification,

namely for predicting the state of a two-class dependent variable based on the
independent variables. The architecture presented further, has been proposed
also in [5], [7], and [4].
Thus, having the training vector xi ∈ ℝn , i = 1, …, n , in two classes, and

the vector y ∈ ℝn, so that y ∈ {− 1, 1}, the SVM algorithm solves the following
primal optimization problem that needs to be minimized:
n
1 ⊤
Q(w, b, ξ) = w w + C
∑ i
ξ (3.1)
2 i=1
subject to
yi(w⊤ϕ(xi ) + b) ≥ 1 − ξi, ξi ≥ 0 for i = 1, …, n (3.2)
41
where ϕ(xi ) maps xi in a high-dimensional space and C > 0 is a cost or
regularization parameter.
Due to the high dimensionality of the vector w, the following dual

problem needs to be minimized:
1 ⊤
Q(α) = α Qα − e⊤α (3.3)
2
subject to
y⊤α = 0, 0 ≤ αi ≤ C for i = 1, …, n (3.4)
where α = (α1, α2, …, αn)⊤, e = (1, 1, …, 1)⊤, Q is a positive semidefinite matrix

of size n × n , Qij = yi yj K(xi, xj ) , and K(xi, xj ) = ϕ ⊤(xi )ϕ(xj ) is the kernel
function.
After solving the dual problem, the optimal vector w satisfies equation
(3.5):
n
w=
∑
yi αi ϕ(xi ) (3.5)
i=1
The decision function is given by equation (3.6):
n
+ b) = sgn( yi αi K(xi, x) + b)
∑
sgn(w⊤ϕ(x) (3.6)
i=1
42
where b is a constant.
Due to the density of the kernel matrix Q, the traditional optimization

methods cannot be applied directly to solve for vector α . Unlike most of the
optimization methods which update the entire vector α in each step of an
iterative process, as presented in the previous subchapter, the decomposition
method modifies only a subset of α per iteration, thus are needed only few
columns of the kernel matrix Q. This subset, referred to as the working set B,
involves minimizing a smaller subproblem in each iteration. The SMO
algorithm is an extreme example of this approach because it restricts the
working set B to have only two elements. Consequently, an optimization
algorithm is not required to solve a simple problem of two variables during each
iteration.
In this book, to solve the dual quadratic programming problem the

decomposition method using the second order information [19] is used, adapted
by Chang and Lin [8], and proposed in [5], [7], and [4]. The key step of the
chosen SMO algorithm consists of the working set selection method and the
shrinking algorithm, which determine the convergence speed of the algorithm.
The SMO decomposition algorithm proposed by Chang and Ling [8], is

presented below:
1. Search α 1 as the initial feasible solution and set k = 1.
2. If α k is a stationary point of the dual problem, it stops.
▪ The KKT optimality conditions of the problem (3.3) imply that a

feasible solution α is a stationary point of the problem (3.3) if and only
if there is a number b and two nonnegative vectors λ and ξ, so that:
43
∇f (α) + by = λ − ξ,
(3.7)
λi αi = 0, ξi(C − αi ) = 0, λi ≥ 0, ξi ≥ 0, i = 1, ..., n
▪ where ∇f (α) = Qα + p is the gradient of f (α).
▪ Condition (3.7) can be rewritten as:
∇i f (α) + byi ≥ 0 if αi < C

(3.8)
∇i f (α) + byi ≤ 0 if αi > 0
▪ Since yi ∈ {− 1, 1}, the condition (3.8) is equivalent to there exists a b

such that:
m(α) ≤ b ≤ M(α) (3.9)
▪ where
m(α) = max (−yi ∇i f (α)) (3.10)

i∈Iup(α)
M(α) = min (−yi ∇i f (α)) (3.11)

i∈Ilow (α)
Iup(α) = {t αt < C, yt = 1 or αt > 0, yt = − 1} (3.12)
Ilow(α) = {t αt < C, yt = − 1 or αt > 0, yt = 1} (3.13)
44
▪ Thus, a feasible solution α is stationary if and only if:
m(α) ≤ M(α) (3.14)
▪ Otherwise, a two-element working set B = {i, j} ⊂ {1, ..., n} is

searched using the working set selection algorithm [19]. This
algorithm derives the selection set B = {i, j} based on a small positive
number τ, C, the output vector y, and the selected kernel function
K(xi, xj ), that is, for each working set index t and s, it is defined:
αts = K(xt, xt ) + K(xs, xs) − 2K(xt, xs) (3.15)
bts = − yt ∇t f (αk ) + ys ∇s f (αk ) (3.16)
{ τ otherwise
ats ats > 0
ats = (3.17)
▪ Select:
i ∈ argmax{ − yt ∇t f (αk ) t ∈ Iup(αk )} (3.18)

t
bit2
j ∈ argmin{ − t ∈ Ilow(αk ), − yt ∇t f (αk ) < − yi ∇i f (αk )} (3.19)
t ait
▪ Return B = {i, j}.
45
3. To accelerate the convergence of the algorithm near the end of the
iterative process, the decomposition method identifies a possible set A
containing all the free support vectors. An optimal solution α of the
dual problem may contain bounded support vectors. To accelerate the
training process, this shrinking technique identifies and removes some
bounded support vectors. Thus, instead of solving the whole problem,
the decomposition method is applied to a smaller problem [26],
namely:
1
min( α⊤A QA AαA − (pA − QAN αkN ) αA)
⊤
(3.20)
αA 2
▪ subject to
0 ≤ (αA)t ≤ C for t = 1, …, q (3.21)
y⊤A αA = Δ − y⊤N αkN (3.22)
▪ where N = {1, 2, ..., n}\ A represents the set of shrunk variables.
▪ After every min(n, 1000) iterations, some variables will be shrunk.

During the iterative process we have:
m(αk ) > M(αk ) (3.23)
46
▪ Until the condition m(αk ) − M(αk ) ≤ e , where e is the tolerance, is
satisfied, the variables in the following set can be shrunk:
{t − yt ∇t f (αk ) > m(αk ), αt = C, yt = 1 or αt = 0, yt = − 1} ∪

k k
(3.24)
{t − yt ∇t f (αk ) < M(αk ), αt = 0, yt = 1 or αt = C, yt = − 1}
k k
▪ Thus, the set A of activated variables is dynamically reduced in every

min(n, 1000) iterations.
▪ To account for the tendency of the shrinking method to be too

aggressive, the gradient is reconstructed when the tolerance reaches:
m(αk ) ≤ M(αk ) + 10e (3.25)
▪ To reduce the reconstruction cost of the gradient ∇f (α) during the

iterations, the vector G ∈ ℝn is kept:
∑
Gi = C Qij, i = 1, …, n
(3.26)
αj =C
▪ Thus, for the gradient ∇i f (α), i ∉ A we have:
n
∇i f (α) =
∑ ∑
Qij αj + pi = Gi + pi + Qij αj (3.27)
j=1 0<αj <C
▪ And for the gradient ∇i f (α), i ∈ A we have:
47
∇i f (αk+1) = ∇i f (αk ) + Qit Δαt + QisΔαs (3.28)
▪ where t and s are the working set indices.
▪ After the reconstruction of the gradient, some previously shrunk

variables are restored based on the relationship:
{t − yt ∇t f (αk ) > m(αk ), αt = C, yt = 1 or αt = 0, yt = − 1} ∪

k k
(3.29)
{t − yt ∇t f (αk ) < M(αk ), αt = 0, yt = 1 or αt = C, yt = − 1}
k k
4. Derive α k+1 as follows:
▪ If αij = K(xi, xi ) + K(xj, xj ) − 2K(xi, xj ) > 0 , the following subproblem

is solved with αB = [αi αj ]⊤:
[Qji Qjj][αj]
BN N ) [α ]
1 Qii Qij αi α
min( [αi αj] + ( B
−p + Q α k ⊤ i + const
) (3.30)
αi ,αj 2 j
▪ subject to
0 ≤ αi, αj ≤ C (3.31)
yi αi + yj αj = − y⊤N αkN (3.32)
▪ and let
48
yi bij
αinew = αik + (3.33)
aij
yj bij
αjnew = αjk − (3.34)
aij
▪ where N = {1, ..., n}\B , and αBk and αNk are the subvectors of α k
corresponding to the set B and the set N, respectively.
▪ Otherwise, the following subproblem is solved:
[Qji Qjj][αj]
BN N) [α ]
Q Q αi α
αj] ii ij
2[ ( B
1 α
i + −p + Q α k ⊤ i +
j
min (3.35)
αi ,αj
+ (τ − aij) /4((αi − αik) + (αj − αjk) )
2 2
▪ subject to the same constraints as described in equations (3.31) and

(3.32), where τ is a small positive number and let
yi bij
αinew = αik + (3.36)
τ
yj bij
αjnew = αjk − (3.37)
τ
5. Finally, set αBk+1 as the optimal point of the subproblem, and αNk+1 = αNk .
Set k = k + 1 and go to step 2.
After the quadratic programming problem is solved, the support vectors

coefficients in the decision function are obtained. Next, the constant in the
49
decision function is computed. If there exists 0 < αi < C , from the KKT
condition (3.14), we have:
b = − yi ∇i f (α) (3.38)
To ensure the stability of the calculations, the arithmetic mean is used,

such as:
yi ∇i f (α)
∑i:0<αi<C
b=− (3.39)
{i 0 < αi < C}
If no αi satisfies the condition 0 < αi < C , the KKT condition (3.14)

becomes:
m(α) = max{ − yi ∇i f (α) αi = 0, yi = 1 or αi = C, yi = − 1}

≤b≤ (3.40)
M(α) = min{ − yi ∇i f (α) αi = 0, yi = − 1 or αi = C, yi = 1}
Finally, the constant b is the middle point of this interval.
1. Fast SVM Algorithm
For a two-class SVM classification model, if the number of the training

instances n is large, the kernel matrix is dense and cannot be stored in memory.
To the detriment of the standard decomposition algorithm, which depends on a
cache storage strategy to calculate the kernel matrix, in this book, a divide-and-
50
conquer approach [5], [7], [4] is used to split the original problem into a set of
subproblems which are solved using the SMO algorithm adapted by Chang and
Lin [8].
For each subproblem, the kernel matrix can be stored in a kernel cache
defined as part of the adjacent memory. The size of the kernel matrix will be
large enough to contain all the support vectors in the training dataset and small
enough to meet the memory constraint. Since the kernel matrix for the
subproblem is entirely cached, each element of the kernel matrix has to be
evaluated only once and calculated using a fast method, namely the fast SVM
algorithm proposed by Dong, Suen, and Krzyzak [16]. This algorithm consists
of two stages: a parallel optimization and a fast sequential optimization. Each of
these two steps is described below.
1.1. Parallel Optimization
Since the kernel matrix Q is symmetric and positive semidefinite, its

block diagonal matrices are positive semidefinite and can be written as [16]:
Q11
Q22
Qdiag = (3.41)
⋱
Qkk
∑
where the matrices Qii of size ni × ni, i = 1, …, k, ni = n are block diagonal.
i=1
Consequently, k optimization subproblems are obtained as described in
the working set selection algorithm [19]. All the subproblems are optimized in
51
parallel using the SMO decomposition algorithm [8] as detailed previously.
Upon completion of the parallel optimization, most nonsupport vectors are
removed from the training dataset on the assumption that the set of nonsupport
Calcul orizontal
vectors in the optimization subproblems are a subset of the vectors from the
initial optimization problem. The Calculare
Calculare Q11
computational
Q22
diagram is illustrated in Figure
Calculare Qkk
3.1 [16]. Optimizare Optimizare Optimizare
Calcul vertical
Clasa 1 Clasa 1 Clasa 1

The calculations illustrated in Figure 3.1 are effective from three points of
view. First, the kernel matrix can be divided into block diagonal matrices so that
Optimizare Optimizare Optimizare
each of theseClasa
can2be stored in memory.
Clasa 2
Secondly, in the caseClasa
of the
2
vertical
calculation, both classes have the same block kernel matrix, which needs to be
calculated only once. Finally, after calculating the block matrices in the first
row, the optimizations in the second class are independent, as well as the
calculations in other columns [16].
Horizontal Calculation
Calculate Q11 Calculate Q22 Calculate Qkk

Vertical Calculation
Optimization Optimization Optimization

Class 1 Class 1 Class 1
Optimization Optimization Optimization

Class 2 Class 2 Class 2
Figure 3.1. Parallel Optimization Diagram [16].
As a result of these calculations, a new training dataset is formed from the

support vectors collected from the subproblems. Even though the size of the
52
new training dataset is much smaller compared to the size of the original
dataset, the available memory may not be sufficient to store the kernel matrix,
especially for large datasets. Thus, the fast sequential optimization technique
[16] is used to training the SVM.
1.2. Fast Sequential Optimization
The fast sequential optimization technique works by iteratively

optimizing the subsets of the problem. Initially, the training dataset is shuffled,
the vectors αi , i = 1, …, n are set equal to zero and a subset E ⊆ S is selected
from the training dataset S, having the size d set (d ≤ n).
In this optimization process, the first step applies the SMO algorithm [8]
to optimize a subproblem in the subset E with kernel caching, and updates αi
and the kernel matrix. The subset size is chosen to be large enough to contain all
the support vectors in the training dataset but at the same time sufficiently small
to meet the memory constraint.
The subset is selected using the queue method for subset selection [16].
This queue method selects subsets of the training dataset that can be trained
using the fast sequential optimization algorithm. The method is initialized by
setting the subset to contain the first d instances in the training dataset and the
queue QS to contain the rest of the instances, and computing the kernel matrix
for that specific subset. Once initialized, the subset selection process works as
follows: each nonsupport vector in the subset is added at the end of the queue
and is replaced in the subset with the instance at the front of the queue, which is
consequently removed from the queue. After all the nonsupport vectors are
53
replaced, the subset is used in the optimization process. In the next iteration, the
same process is applied, starting with the subset and the queue that are in the
same state they were at the end of the last iteration.
Since the KKT conditions are evaluated at each training instance, this step
has a high computational cost, therefore some heuristic stopping conditions
have been proposed in [16]:
▪ Δsv < 20 and the number of trained instances to be greater than n
▪ sv ≥ d − 1
▪ the number of trained instances to be greater than T ⋅ n
where Δsv represents the change in the number of support vectors between
two successive subsets, n the size of the training dataset, and T > 1 represents a
predefined maximum number of loops allowed through the dataset.
If one of these stopping conditions is satisfied, it returns to step 1, namely

to applying the SMO algorithm [8].
54
CHAPTER 4
ESTIMATING SUPORT VECTOR MACHINES
The proposed support vector machine architecture [5], [7], [4] is scalable to
large datasets and can be applied to any dataset from any field of activity, as
long as the problem to be solved is a classification problem.
However, in order to empirically estimate the generalization error of this

proposed support vector machine, a relatively small synthetic dataset is used
which has 3,333 records and 21 variables that comes from the University of
California, Irvine, the Department of Information and Computer Science [3].
Table 4.1 indicates the 21 variables, their type, and the range of their values.
On this dataset, the PCA algorithm is applied as described by the authors

in [6] to eliminate any possible collinearity. Therefore, when implementing the
classification model, the variable Churn is the dependent variable, and the 11
principal components and 2 nominal variables are the independent variables.
The dataset is randomly partitioned into a training dataset, which

represents 80% of the original dataset, and a test dataset, which represents 20%
55
of the original dataset [43]. For optimal training, a machine learning algorithm
such as support vector machines requires a training dataset that has a dependent
variable with an approximately balanced distribution of its classes, in this case
the variable Churn. The training dataset has initially 388 (14%) instances
Variable Name Type Values Missing

State Nominal AK, AL, … Values
0
Area Code Nominal 408, 415, 510 0
Phone Nominal N/A 0
International Plan Nominal Yes/No 0
Voice Mail Plan Nominal Yes/No 0
Account Length Continuous 1 – 243 0
Voice Mail Continuous 0 – 51 0
Day Minutes Continuous 0.00 – 350.80 0
Day Calls Continuous 0 – 165 0
Day Charge Continuous 0.00 – 59.64 0
Evening Minutes Continuous 0.00 – 363.70 0
Evening Calls Continuous 0 – 170 0
Evening Charge Continuous 0.00 – 30.91 0
Night Minutes Continuous 23.20 – 395.00 0
Night Calls Continuous 33 – 175 0
Night Charge Continuous 1.04 – 17.77 0
Intl. Minutes Continuous 0 – 20 0
Intl. Calls Continuous 0 – 20 0
Intl. Charge Continuous 0.00 – 5.40 0
Customer Service Calls Continuous 0–9 0
Churn Nominal Yes/No 0
Table 4.1. Dataset Variables [3].
56
belonging to Yes class and 2,279 (86%) instances belonging to No class.
Such an imbalanced distribution is common in many industries. However,

in order to balance the distribution of the variable Churn, the oversampling
technique is used [22]. This technique randomly clones the instances
corresponding to Yes class of the variable Churn until are approximately equal
in number to the instances corresponding to No class in the training dataset. By
adding new instances, the distribution of the variable Churn in the training
dataset will be as follows: 2,294 (50%) instances pertaining to Yes class and
2,279 (50%) instances pertaining to No class, with a total of 4,573 instances. In
the test dataset, the variable Churn has 95 (15%) instances belonging to Yes
class and 571 (85%) instances belonging to No class.
The predictive performance of the model can be assessed using the

confusion matrix. The confusion matrix is a table with two rows and two
columns, where the cells on the diagonal of the classified cases represent correct
predictions, and those on the opposite diagonal represent incorrect predictions
from the training and the test datasets respectively, as illustrated in Table 4.2.
Predicted
Dataset Observed
No Yes % correct
No TN FP TNR
Training/
Yes FN TP TPR
Test
Total % NPV PPV ACC
Table 4.2. Confusion Matrix.
The confusion matrix cells are called: true negatives (TN), false positives
57
(FP), false negatives (FN), and true positives (TP) (Table 4.2). The rest of the
measures from Table 4.2, such as TNR (true negatives rate or specificity), TPR
(true positives rate or sensitivity), NPV (negative predictive value), PPV
(positive predictive value or precision), and ACC (accuracy) are calculated
using the following equations:
TP
TPR = (4.1)
TP + FN
TN
TNR = (4.2)
TN + FP
TP
PPV = (4.3)
TP + FP
TN
NPV = (4.4)
TN + FN
TP + TN
ACC = (4.5)
P+N
where P represents the total number of positives and N is the total number of
negatives.
Based on these defined measures, Table 4.3 shows the confusion matrix
for the machine learning algorithm used, namely the support vector machine, on
both, the training and the test datasets.
In the training phase, the prediction model has correctly classified 2,281
58
cases of the 2,294 cases belonging to Yes class, with a true positives rate of
99.43%. Of the 2,279 cases belonging to No class, all cases are classified
correctly, providing a specificity of 100%. In other words, approximately 99%
of cases in the training dataset are classified correctly.
Within the test dataset, out of the 95 cases belonging to Yes class, all the
cases are classified correctly (a true positives rate of 100%); and of the 571
cases belonging to No class, 569 cases are classified correctly (a specificity of
99.65%). Overall, approximately 99.70% of the cases in the test dataset are
classified correctly and approximately 0.30% are misclassified.
In other words, the information presented in Table 4.3 suggests that,

overall, the predictive model will classify correctly approximately 997 out of
1,000 cases.
Predicted
Dataset Observed
No Da % correct
No 2279 0 100.00%
Training Yes 13 2281 99.43%
Total % 50.19% 49.81% 99.72%
No 569 2 99.65%
Test Yes 0 95 100.00%
Total % 85.44% 14.56% 99.70%
Table 4.3. Confusion Matrix – SVM.
Another way of interpreting these results is the lift chart. This type of
graph sorts the predicted pseudo-probabilities [33] in descending order and
59
displays the corresponding curve. There are two types of lift charts: incremental
and cumulative. The incremental lift chart represented in Figure 4.1 shows the
lift factor in each percentile [18] without any accumulation for Yes class of the
dependent variable Churn. The curve corresponding to this predictive model
falls below the gray line, which corresponds to the random expectation (RND
E), around the 16th percentile. This means that compared to the random
expectation, the model achieves its maximum performance in the first 16% of
the instances.
8
SVM
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
Figure 4.1. Incremental Lift Chart.
The cumulative lift graph indicates the prediction rate of the predictive
model compared to the random expectation. Figure 4.2 illustrates the curve of
the cumulative lift chart for Yes class of the dependent variable Churn. By
reading the chart on the horizontal axis, it can be seen that for the 16th
percentile, the model has a lift index of approximately 7 on the vertical axis,
meaning that unlike a random model, this model has a predictive performance
60
of approximately 7 times better.
8
SVM
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
Figure 4.2. Cumulative Lift Chart.
The performance of this predictive model can also be evaluated using the
gain measure. The gain chart shows the percentage of positive responses on the
vertical axis, and the percentage of cases predicted on the horizontal axis. The
gain measure is defined as the proportion of cases present in each percentile
relative to the total number of cases. The cumulative gain chart shows the
prediction rate of the model compared to the random expectation.
Figure 4.3 illustrates the curve corresponding to the support vector

machine predictive model for Yes class of the dependent variable Churn. It can
be seen that in the 16th percentile, the predictive model has a performance of
approximately 100%.
Figure 4.4 shows the ROC (receiver operator characteristic) curve of the
predictive model. The ROC curve is derived from the confusion matrix and uses
only the TPR and the FPR (false positives rate) measures, the latter being
61
100
80
SVM
60 RND E
% Gain
40
20
0
0 20 40 60 80 100
Percentile
Churn = Yes
Figure 4.3. Gain Chart.
obtained by subtracting the specificity from the unit. Following the chart in
Figure 4.4, one can observe that it approaches the coordinate point (0, 1) in the
upper left corner, which implies a perfect prediction. Our predictive model
based on support vector machines obtains a sensitivity of 100% and a specificity
of 100%.
0.8
SVM
TPR (Sensitivity)
0.6 RND E
0.4
0.2
0
0.0 0.2 0.4 0.6 0.8 1.0
FPR (1-Specificity)
Churn = Yes
Figure 4.4. ROC Curve.
62
CHAPTER 5
CONCLUSIONS
As previously shown, a support vector machine can be fine-tuned to yield

highly accurate prediction results for a specific classification task. The support
vector machine architecture described in this book is original and based on
several techniques existent in the literature and which has not been applied in
other research papers to solve the same problem, nor any other classification
problem. This architecture has been previously published by the author in IEEE
peer-reviewed manuscripts [5], [7], and [4].
To improve the linear separability of the soft margin SVM algorithm the
kernel trick technique is applied using the polynomial kernel with 4 degrees that
satisfies Mercer's condition [47]. Using this technique, the initial input space is
mapped into a high-dimensional features space that is not treated explicitly. To
train the soft margin SVM, a quadratic problem with a number of variables
equal to the number of training instances must be solved. When the number of
training instances is high, the kernel matrix is dense and cannot be stored in
memory. To the detriment of the standard decomposition algorithm, which
63
depends on a cache storage strategy to calculate the kernel matrix, in this book,
a divide-and-conquer approach is used to split the original problem into a set of
subproblems which is solved using the SMO algorithm adapted by Chang and
Lin [8]. For each subproblem, the kernel matrix can be stored in a cache defined
as part of the adjacent memory. The size of the kernel matrix is large enough to
contain all the support vectors in the training dataset and small enough to meet
the memory constraint. Since the kernel matrix for the subproblem is entirely
cached, each element of the kernel matrix has to be evaluated only once and
calculated using a fast method, namely the fast SVM algorithm proposed by
Dong, Suen, and Krzyzak [16]. This algorithm consists of a parallel
optimization followed by a fast sequential optimization.
The generalization error of this support vector machine model [5], [7], [4]
was then estimated empirically. This predictive model was also evaluated using
four other methods, namely: the incremental and the cumulative lift charts, the
gain chart, and the ROC curve.
By evaluating the results obtained using the confusion matrix, it can be

observed that for Yes class of the dependent variable Churn, the support vector
machine model has a predictive performance of 100.00%. When considering the
results generated for each class of the dependent variable Churn, the support
vector machine achieves a predictive performance of 99.70%.
Using the incremental lift chart method, it can be observed that the
predictive model achieves its maximum performance in the first 16% of the
instances because the corresponding curve of the predictive model falls below
the gray line corresponding to the random expectation around the 16th
percentile. The cumulative lift chart indicates that for the 16th percentile, the
64
model has a lift index of approximately 7 on the vertical axis, meaning that
unlike a random model, this model has a predictive performance of about 7
times better.
Based on the gain chart, in the 16th percentile, the predictive model
implemented using the support vector machine provides 100% of the
respondents present in the 16th percentile in relation to the total number of
respondents. By interpreting the chart of the ROC curve, it can be seen that the
model is very close to the perfect prediction with a sensitivity of 100% and a
specificity of 100%.
It is extremely important to reiterate the fact that the resulted soft margin
support vector machine architecture is scalable to larger datasets and can be
applied to any dataset from any field of activity, as long as the problem to be
solved is a classification problem.
In conclusion, it is evident that if one has a clear understanding of the

theoretical notions underlying the support vector machines algorithm, a
prediction model can be architected to generate extremely accurate results.
65
BIBLIOGRAPHY
1. Allen, C. and A. Pipkin, Course on Integral Equations. Texts in Applied

Mathematics, 1991. 9.
2. Bellman, R.E., Dynamic Programming. 1957, Princeton,: Princeton University
Press. xxv, 342 p.
3. Blake, C.L. and C.J. Merz, Churn Data Set. 1998: California, USA.
4. Brandusoiu, I.B., Methods for Predicting the Evolution of the Number of
Subscribers in the Mobile Telecommunications Industry, PhD Thesis. 2016,
Technical University of Cluj-Napoca.
5. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Support Vector Machines. Annals of the Oradea
University Fascicle of Management and Technological Engineering, 2013.
22(1).
6. Brandusoiu, I.B. and G. Toderean, Applying Principal Component Analysis on
Call Detail Records. ACTA Technica Napocensis Electronics and Telecom-
munications, 2014. 55(4).
7. Brandusoiu, I.B. and G. Toderean, Methods for Churn Prediction in the Pre-
Paid Mobile Telecommunications Industry, 11th International Conference on
Communications (COMM). 2016.
66
8. Chang, C.C. and C.J. Lin, LIBSVM: A Library for Support Vector Machines.
2001.
9. Cherkassky, V. and F. Mulier, Learning from Data: Concepts, Theory, and
Methods. 1998: John Wiley & Sons.
10. Collobert, R. and S. Bengio, SVMTorch: Support Vector Machines for Large-
Scale Regression Problems. Machine Learning Research, 2001. 1.
11. Collobert, R., S. Bengio, and Y. Bengio, A Parallel Mixture of SVMs for Very
Large Scale Problems. Neural Computation, 2002. 14.
12. Cortes, C. and V. Vapnik, Support Vector Networks. Machine Learning, 1995.
20.
13. DeCoste, D. and B. Scholkopf, Training Invariant Support Vector Machines.
Machine Learning, 2002. 46.
14. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation, 1998. 10(7).
15. Dong, J.X., A. Krzyzak, and C.Y. Suen, A Fast SVM Training Algorithm.
Pattern Recognition with Support Vector Machines: Lecture Notes in
Computer Science, 2002. 2388.
16. Dong, J.X., A. Krzyzak, and C.Y. Suen, Fast SVM Training Algorithm with
Decomposition on Very Large Data Sets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2005. 27(4).
17. Dunteman, G.H., Principal Components Analysis. 1989: Sage Publications.
18. Edwards, D.I., Introduction to Graphical Modeling 2nd Edition. 2000: Springer.
19. Fan, R.E., P.H. Chen, and C.J. Lin, Working Set Selection Using Second Order
Information for Training SVM. Journal of Machine Learning Research, 2005.
6.
20. Fukunaga, K., Introduction to Statistical Pattern Recognition. 1990: Academic
Press.
21. Gorunescu, F., Data Mining Concepts, Models and Techniques. 2011: Springer
Verlag.
22. He, H. and Y. Ma, Imbalanced Learning Foundations, Algorithms, and
Applications. 2013: John Wiley & Sons.
23. Hsu, C.W. and C.J. Lin, A Comparison of Methods for Multi-Class Support
Vector Machines. IEEE Transactions on Neural Networks, 2002.
24. Hwang, J., S. Lay, and A. Lippman, Nonparametric Multivariate Density
Estimation: A Comparative Study. IEEE Transaction on Signal Processing,
1994. 42(10).
25. Jimenez, L.O. and L.D. A., Supervised Classification in High-Dimensional
Space: Geometrical, Statistical, and Asymptotical Properties of Multivariate
Data. IEEE Transaction on Systems, Man, and Cybernetics, 1998. 28.
26. Joachims, T., Making Large-Scale Support Vector Machine Learning Practical.
Advances in Kernel Methods: Support Vector Machines, 1998.
27. Jordan, M.I. and R. Thibaux, The Kernel Trick. UC Berkley Lecture Notes,
2004.
28. Keerthi, S.S. and E.G. Gilbert, Convergence of a Generalized SMO Algorithm
for SVM Classifier Design. Machine Learning, 2002. 46.
29. Keerthi, S.S., S.K. Shevade, C. Bhattachayya, and K.R.K. Murth,
Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural
Computation, 2001. 13.
30. Kim, J.O. and C.W. Mueller, Factor Analysis: Statistical Methods and Practical
Issues. 1978: Sage Publications.
31. Kuhn, H. and A. Tucker. Nonlinear Programming. Proceedings of the Berkeley
Symposium Mathematics, Statistics and Probabilistics. 1951.
32. Leon, F., Inteligență Artificială: Mașini Cu Vector Suport. 2014: Tehnopress.
33. Minsky, M.L. and S. Papert, Perceptrons. 1969: MIT Press.
34. Mitchell, T., Machine Learning. 1997: McGraw-Hill.
35. Nilsson, N.J., Introduction to Machine Learning. 1998: Stanford University.
36. Osuna, E., R. Freund, and F. Girosi. Training Support Vector Machines: An
Application to Face Detection. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 1997.
37. Platt, J.C., Fast Training of Support Vector Machines Using Sequential
Minimal Optimization. Advances in Kernel Methods: Support Vector
Machines, 1998.
38. Schmitt, M., On the Complexity of Computing and Learning with
Multiplicative Neural Networks. Neural Computation, 2002. 14(2).
39. Scholkopf, B., Support Vector Learning. 1997: Oldenbourg-Verlag.
40. Scholkopf, B., C.J.C. Burges, and V.N. Vapnik. Extracting Support Data for a
Given Task. Proceedings of the International Conference on Knowledge
Discovery and Data Mining AAAI. 1995.
41. Scholkopf, B. and A.J. Smola, Learning with Kernels. 2002: MIT Press.
42. Steinwart, I., On the Optimal Parameter Choice for Nu-Support Vector
Machines. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2003. 25.
43. Sumathi, S. and S.N. Sivanandam, Introduction to Data Mining and its
Applications. Studies in Computational Intelligence, 2006. 29.
44. Valiant, L.G., A Theory of the Learnable. 1984: Communications of the ACM.
45. Vapnik, V. and O. Chapelle, Bounds on Error Expectation for Support Vector
Machines. Neural Computation, 2000. 12.
46. Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York:
Springer. xv, 188 p.
47. Vapnik, V.N., Statistical Learning Theory. Adaptive and Learning Systems for
Signal Processing, Communications, and Control. 1998, New York: Wiley.
xxiv, 736 p.
48. Wolpert, D.H., The Relationship between PAC, the Statistical Physics
Framework, the Bayesian Framework, and the VC Framework. The
Mathematics of Generalization the SFI Studies in the Sciences of Complexity,
1995.
49. Zoutendijk, G., Methods of Feasible Directions. 1960: Elsevier.

Sanet - ST How To Fine-Tune Support Vector Machines For Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sanet - ST How To Fine-Tune Support Vector Machines For Classification

Uploaded by

Copyright:

Available Formats

Ionut B. BRANDUSOIU Gavril I.

GAER Publishing House

All rights on this edition are reserved to the Authors.

GAER Publishing House

Description of CIP of the National Library of Romania

Editor: Eng. Dan Bogdan

SUPERVISED LEARNING ............................................................................3

1. Training Dataset ........................................................................................4

3. Induction Algorithms ................................................................................6

4.1. Generalization Error ...........................................................................8

SUPPORT VECTOR MACHINES ................................................................17

1. Hard Margin SVMs ................................................................................18

2. Soft Margin SVMs ..................................................................................25

3. Kernel Trick ............................................................................................31

4.2. Polynomial Kernel ............................................................................35

4.3. RBF Kernel.......................................................................................36

4.4. Sigmoid Kernel.................................................................................36

5. Training Techniques ................................................................................37

FINE-TUNING SUPPORT VECTOR MACHINES .....................................41

1. Fast SVM Algorithm...............................................................................50

1.1. Parallel Optimization ........................................................................51

1.2. Fast Sequential Optimization ...........................................................53

ESTIMATING SUPORT VECTOR MACHINES .........................................55

αi Nonnegative Lagrange multiplier i

α Vector of nonnegative Lagrange multipliers

βi Nonnegative Lagrange multiplier i

β Vector of nonnegative Lagrange multipliers

ξi Nonnegative slack variable i

D(x) Decision function

δ Min. distance from a training instance to the decision surface

K(x, x′) Kernel function

m Number of independent variables

n Number of training instances

S Set of support vectors

x Independent variables space

xi Vector of training instance i

For a better understanding of the underlying theory of support vector

Chapter 2 starts with the introduction to support vector machines,

Chapter 3 presents a support vector machine architecture for

This proposed support vector machine is validated in Chapter 4 by

Chapter 5 concludes this book.

In the context of machine learning, the modeling techniques can be

In the case of unsupervised learning, the data do not contain a dependent

Further are presented the theoretical notions related to supervised

In the case of a supervised learning model, the training dataset is known

The training dataset can be characterized in a multitude of ways. Most

The input space is defined as the Cartesian product of the definition

The machine learning community was the first to introduce concept

Definition 1. Given a training dataset S with the set of independent variables

The generalization error of a classification model is defined as the

where L(y, I(S )(x)) is the cost function, defined as:

An induction algorithm, also known as inducer or learner, builds a model

To denote an induction algorithm, is used I and to denote a model induced

Depending on the induction algorithm, the classification models can be

A classification model obtained from an induction algorithm can classify a

It is fundamentally important to evaluate the performance of the induction

When evaluating a classification model, several aspects should be

4.1. Generalization Error

As previously mentioned, I(S ) denotes a classification model induced by

Although this type of error seems like a natural criterion, it is difficult to

However, if one decides to estimate the generalization error using the

Induction algorithms that present a large capacity, in other words that

In [48], the author discusses the relationships between four theoretical