You are on page 1of 8

Model Selection for Support Vector Machines: Advantages and

Disadvantages of the Machine Learning Theory

Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE, Noemi Greco, Luca Oneto,
and Sandro Ridella, Member, IEEE

Ahstract-A common belief is that Machine Learning Theory (LOO), and the Bootstrap (BTS). The KCV technique [9]
(MLT) is not very useful, in pratice, for performing effective consists in splitting a dataset in k independent subsets; in
SVM model selection. This fact is supported by experience,
tum, all but one are used to train a classifier, while the
because well-known hold-out methods like cross-validation,
leave-one-out, and the bootstrap usually achieve better results
remaining one is used as a hold-out set to evaluate an average
than the ones derived from MLT. We show in this paper that, in generalization error. The LOO technique [10] is analogous
a small sample setting, i.e. when the dimensionality of the data to a KCV where the number of folds equals the number
is larger than the number of samples, a careful application of of available patterns: one sample is used as hold-out, while
the MLT can outperform other methods in selecting the optimal
the remaining ones are used for training a model. The BTS
hyperparameters of a SVM.
method [11], instead, is a pure resampling technique: a
I. INTRODUCTION training set, with the same cardinality of the original one, is
built by extracting the samples with replacement, while the
The Support Vector Machine (SVM) [1] is one of the
unextracted patterns (approximately 36.8% of the dataset, on
state-of-the-art techniques for classification tasks. Its learning
average) is used as hold-out set.
phase consists in finding a set of parameters by solving a
Convex Constrained Quadratic Programming (CCQP) prob­ Theoretical methods, instead, provide deep insights on
lem, for which many effective techniques have been proposed the classification algorithms and are based on rigorous ap­
[2]. However, the search for the optimal parameters does not proaches to give a prediction, in probability, of the gener­
complete the learning phase of the SVM: a set of additional alization ability of a classifier. The main advantage, respect
variables (hyperparameters) must be tuned in order to find to hold-out methods, is the use of the whole set of avail­
the SVM characterized by optimal performance in classifying able data for both training the model and estimating the
a particular set of data. This phase is usually called model generalization error (from which derives the name of in­
selection and is strictly linked with the estimation of the gen­ sample methods), and makes these approaches very appealing
eralization ability of a classifier (i.e., the error rate attainable when only few data are available. However, the underlying
on new and previously unobserved data), as the chosen model hypotheses, which must be fulfilled for the consistency of
is characterized by the smallest estimated generalization the estimation, are seldomly satisfied in practice and the
error. Unfortunately, the tuning of the hyperparameters is not generalization estimation can be very pessimistic [12], [13],
a trivial task and represents an open research problem [3], [14].
[4], [5], [6].
As the true probability distribution originating the data When targeting classification problems, where a large
is unknown, the generalization error of a classifier cannot amount of data are available, practical techniques outperform
be computed, but several techniques have been proposed theoretical ones [3], [4]. On the other hand, there are cases
for obtaining a probabilistic estimate, which can be divided where the number of patterns is small compared to the
in two main categories [4], [7]: practical and theoretical dimensionality of the problem, like, for example, in the case
methods. of microarray data, where often less than a hundred samples,
Practical methods typically rely on well-known and re­ composed by thousands of genes, are available. In this case,
liable statistical procedures, whose underlying hypotheses, the classification task belongs to the small sample setting
however, cannot be always satisfied or are only asymptoti­ and the practical approaches have several drawbacks, since
cally valid [8]. Practical methods usually split the available reducing the size of the training set usually decreases, by
set of data in two independent subsets: one is used for a large amount, the reliability of the classifier [15], [16].
creating a model (training set), while the other is used for For the small sample regime, in-sample methods should be
computing the generalization error estimation (hold-out set). preferred but their application is usually unfeasible.
The most used practical techniques for model selection are:
We present in this paper a procedure for practically apply­
the k-Fold Cross Validation (KCV ), the Leave One Out
ing an in-sample approach, based on the Maximal Discrep­
Davide Anguita, Alessandro Ghio, Noemi Greco, Luca Oneto and ancy (MD) theory, to the SVM model selection. In addition,
Sandro Ridella are with the Department of Biophysical and Elec­ we show that, using this approach, the hyperparameter space
tronic Engineering, University of Genova, Via Opera Pia l1A, J-
16145 Genova, Italy (emails:{Davide.Anguita. Alessandro.Ghio. San­ of the SVM can be searched more effectively and better
dro.Ridella }@unige.it, {greco.noemi, phoenix.luca }@gmail.com). results, respect to hold-out methods are obtained.

978-1-4244-8126-2/10/$26.00 ©2010 IEEE


II. TH E SUP PORT V ECTOR MACHIN E This formulation is equivalent to (2), for some value of the
As we are working in the small sample regime, we focus hyperparameter C, but, during the model selection phase, the
here on the linear SVM. The extension to the non-linear case, hyperparameter WMAX is tuned instead of C.
through the use of kernels, will be detailed in the Appendix. The hyperparameter wMAX allows to control the size of
Let us consider a dataset Dl, composed by l i.i.d. patterns the set:F and to perform Structural Risk Minimization [1]: in
Dl = { (x ! , Yl ), ...., (Xl,Y i n , where Xi E lRn, Yi ±1, =
fact, increasing the value of WMAX corresponds to increase
and let :F be a class of possible functions. The SVM is the the size of the set of functions.
function fE:F III. TH E MAXIM A L DI SCR EPANCY OF A C L A S SIFI ER
f(x) wT X +b,
= (1)
Let us consider the dataset Dl and a general prediction
where the weights w and the biasb are found by solving the rule fE:F. We can define the empirical error rate! Ll (f) =

following primal CC QP problem: t E�=l £(f(x), y) associated with f, where £ is a suitable


1 loss function.
min
W,b,e
"2llwl12 + CeTe (2) Let us split Dl m two halves and compute the two
empirical errors:
Yi(wTx +b)?:I-�i ViE [I, ... ,l] l

�i ?:0 ViE [1,... ,l] 2 "2


l
L £(f(Xi),Yi) (10)
where ei 1 Vi, which is equivalent to maximizing the
= i=l
margin and penalizing the errors by the hinge loss function I
2
£t;.
l
L £ (f(Xi),Yi) , (11)
(3) i=!+l
where [.]+ max(O,·) [1]. then the Maximal Discrepancy (MD) is defined as
(ALl(1)/2(f) - LA l/2(f) ) .
=

By introducing l Lagrange multipliers (al , . . . ,al ) , it is (2)


MD (12)
TlJ:
=

possible to write the above problem in its dual form, for


which efficient solvers have been developed throughout the
In practical cases, we want to avoid that a possible "unlucky"
years [18]:
shuffling results in an unreliable MD value, therefore, we
I I I replicate m times the splitting procedure: at each iteration,
1
min ""a·a .y.y·xTX· - �
-2 �� "a· (4) the dataset Dl is randomly shuffled and, then, m MD values
0:
i=l j=l • J •
i=l .
J • J
are averaged.
o :::; ai :::; C ViE [1,... ,l] If the loss function£(·,·) is bounded (e.g., £(.,.) E [0,1]),
YT 0: = o. an upper bound of the generalization error L (f) in terms of
MD can be found using the following theorem [14], [17]:
After solving the above problem, the Lagrange multipliers Theorem 1: Given a dataset Dl, consisting in l patterns
can be used to define the SVM classifier in its dual form: XiE lRn, given a class of functions :F and a loss function
I
L (·,· ) E [0,1], the following procedure can be replicated m
times: (a) randomly shuffle the samples in Dl to obtain D}j ;
f(x) = LYiaix; X+b. (5) )
i=l (b) compute MD (j) for each replicate. Then, with probability
The patterns characterized by yd (Xi) :::; 1 are called Support 1 - 15,
Vectors (SVs), because they are the only ones for which ai >
(�)
o.
The hyperparameter C in problems (2) and (4) is tuned
L (f) :::; Ll(f) + �
m
f=l MD j) + 3 ( -log

2l
(13)

during the model selection phase, in order to balance the


j
size of the margin with the amount of misclassification, and Furthermore, if the loss function £ E [0,1] is such
indirectly defines the size of the set of functions :F. In order that £(f(Xi) , Yi) 1 - £(f(Xi) , -Yi) , the MD values
=

to better control this effect, and to apply the MD theory, we can be computed by a conventional empirical minimization
propose to use an alternative SVM formulation, based on the procedure.
concepts of Ivanov regularization [12], where the set :F is Let us define a new data set, Df {(xi , yi), ....,(xf,yf n ,
=

defined as the set of functions with IIwl12 :::; w it-AX and such that (x� , y�) (Xi, -Yi) if i :::; � and (x� , y�)
= =

bE(-00,+00): (Xi,Yi) if i > �, then it is easy to show that


min (6)
W,b,e MD = 1- 2 (minLf(f) ) , (14)
JEF
IIwI12 :::; w it-AX (7)
where Lf(f) is the empirical error obtained on Df.
Yi( wT Xi+b)?:1 -�i ViE [1,... ,l] (8)
�i ?:0 ViE [1,... ,l] (9) 1 In this paper, we use the same notation of [14] and [19].
IV. TH E ApPLICATION OF MD TO THE SVM given a class of functions F:
When targeting classification tasks, we are interested in a
min � '"
'T/i
< IScl + min � '"
�k
(18)
hard loss function, which counts the number of misclassifi­ JEF l L...
s .. 2 - l JEF l L..... 2'
cations: tE kESN

if yi! (Xi) > 0


(15)
where IScI is the cardinality of the set Sc.
if yi! (Xi) :-:::: o.
In order to obtain the tightest bound, we should choose
Unfortunately, the use of a hard loss function makes the the set Sc with minimum cardinality, but this approach is
problem of finding the optimal f computationally hard. For obviously infeasible as it would require to examine all the
this reason, the conventional SVM algorithm makes use of possible combinations of samples. A possible solution is to
the hinge loss of Eq. (3), which is convex and Lipschitz consider one sample at the time: at first, the SVM learning
continuous, so that the search for the optimal prediction problem (6) is solved to identify the CSVs, then the CSV
rule is greatly simplified. This simplification, however, has with the largest error or, in other words, the sample for which
a severe drawback, because the unboundness of the hinge yi!(Xi) is minimum, is deleted from the training set and the
loss complicates the problem of predicting the generalization learning is repeated with the remaining samples. At the final
ability of f [14], [17], and the conditions of Theorem 1 do step, the classifier will be trained on the set consisting of the
not hold anymore. remaining ISNI patterns. The peeling procedure is obviously
We define here an alternative soft loss function sub-optimal and could remove, at least in theory, a large

. .) - { C€
Cs (f(Xt) ,Yt -
CH (f(Xi),Yi)
(f(Xi),Yi) /2
if
if
yi !(xi):-:::: -l
yi!(Xi) 2": -1
number of CSVs, so making the bound on generalization
error very loose. In practice, however, the number of CSV s is
usually a tiny fraction of the training set (see also section V-A
(16)
for the analysis on a real-world dataset) and several replicates
and its associated slack variable 'T/i 2Cs (f(Xi),Yi)
are used in Eq. (13 ), in order to mitigate the effect of CSVs.
= =

min(2 ,�i) ' which can be used in the SVM formulation in


Moreover, it is important to remark that our approach is
order to compute the bound of Eq. (13 ).
consistent in computing MD:
As in the case of the hard loss function, the resulting op­
Theorem 4: Let Dl be a dataset of l patterns. Let us
timization problem is not convex, which makes it intractable
suppose to know the soft loss values 'T/i for each pattern in
and solvable only in an approximate way [20], even for
Dl. Then, given a class of functions F,
moderate l. However, we propose a practical method, which
makes use of a peeling technique, and allows to find an upper �i
�l min '" < �. (19)
bound of minJEFLl (f) and a lower bound of minJEFL;(f) JEF iESN 2 - 2
L... ..
so that the bound of Eq. (l3) still holds.
Therefore, MD 2": 0 as expected.
A. The Peeling Technique
B. Solving the SVM Problem (6)
It is easy to note that the values of 'T/i and �i coincide
We are interested in finding the value of minJEF Li �i'
for all the patterns Xi for which yi!(Xi) 2": -1. In general,
therefore, after the peeling procedure ends, we use the
however, some patterns will be characterized by yi!(Xi) <
obtained minimum in order to estimate the generalization
-1: they are critical for computing the error, since 'T/i and
error using the bound of Eq. (13 ). Then, when applying the
�i do not coincide, therefore we call them Critical Support
MD-based bound, we make use of the SVM formulation (6)
Vectors (CSVs).
for two main reasons: (i) the minimization procedure gives
Let S = {I, . . . , l} be the set of indexes of the l patterns
us exactly the error estimation we are looking for, and (ii)
of the dataset, Sc the set of indexes of the CSVs and SN =

the class of functions F can be defined more easily than in


S \ Sc the set of indexes of the remaining patterns. Then, a the conventional primal or dual formulations for SVM.
lower bound of minJEFL;(f) or, in other words, an upper
However, to the best of our knowledge, no ad-hoc pro­
bound of MD, can be found using the following theorem
cedure has been described for solving the SVM problem
(proofs are omitted here due to space constraints):
(6). Our proposal, based on the ideas of [21], makes use of
Theorem 2: Let Dl be a dataset of l patterns and let us
conventional Linear (LP) and Quadratic Programming (QP)
suppose to know the values 'T/i for each pattern in Dl. Then,
optimization algorithms and is presented in Algorithm 1. The
given a class of functions F:
first step consists in solving the problem (6), which becomes
. 1 'T/i 1 �k
.
. a LP problem when discarding the quadratic constraint (7).
mm-
JEF l L2 - > mm-
- JEF l
L -

2
. (17)
After the optimization procedure ends, the value of IIwl12
tES kESN
is computed and two alternatives arise: if the constraint is
Similarly, we can upper bound the error on the training satisfied, we already have the optimal solution and the routine
set minJEFLI(f): ends; else, the optimal solution corresponds to Ilwll =

Theorem 3: Let Dl be a dataset of l patterns and let us WMAX . In order to find the solution, we have to switch
suppose to know the values 'T/i for each pattern in Dl. Then, to the dual of the problem (6), which can be obtained by
defining l Lagrange multipliers (3 for the constraint (8) and We iteratively proceed in solving the dual of Eq. (24) and
one additional Lagrange multiplier 'Y for the constraint (7): updating the value of 'Yo until the termination condition is
I I I met:
1
min
{3;y 2'Y
L Lf3d3jYiYjX;Xj - Lf3i + (20) (28)
i=l j=l i=l where T is a user-defined tolerance.
2
+ 'YwMAX
2
Algorithm 1: The algorithm for solving the SVM prob­
O:::;f3i:::;l ViE[l, ... , l] (21)
lem (6).
'Y � 0 (22)
Input: WLAX' a tolerance
A dataset D1, T
yT(3=O . (23) Output:w, b, e
where (3 are such that w =* L�=l f3iYiXi. Please note that, {w,b,e} = solve LP problem (6) removing the
if the quadratic constraint (7) were satisfied, 'Y would equal 0 constraint (7);
and the dual would not be solvable due to numerical issues: if IIwl12 > WLAX then
'Yo 1;
this is why, as a first step, we make use of the LP routines for =

while !'Y o -'Y� I >


ld
T do
solving the problem (6) and we exploit the dual formulation ld
'Y� = 'Yo;
only if the constraint is not satisfied.
Our target is to solve the problem (20) using conventional {w,b,e} =solve QP problem (24);
QP optimization routines for SVMs (e.g. SMa [2]), therefore 'Yo -
_ V'£ \=1'£ ;=1!3i!3jYiYjXTXj .
'
WMAX
we use an iterative optimization technique. The first step end
consists in fixing the value of'Y to a value'Yo > 0 and, then, end
optimizing the cost function with reference to the other dual
2
variables (3. It is easy to see that the term 'YW � AX is now
constant and can be removed from the expression. The dual
becomes:
C. Searching for the Optimal Value of WMAX
I I I The model selection using the conventional primal or
1
min
(3
2" L Lf3if3jYiYjX;Xj -'Yo Lf3i (24) dual formulation of SVM consists in finding the optimal
i=l j=l i=l value for the hyperparameter C. Even though some practical
o:::;f3i:::;1 ViE[1, ... , l] methods have been suggested for deriving them in a very
simple and efficient way [3], the most effective procedure
yT(3=0,
is to solve the related CCQP problem several times [22],
which is equivalent to the conventional SVM dual problem with different C values, and estimate the generalization
(4) and can be solved with well-known QP solvers [18]. error at each step. Finally, the optimal hyperparameters are
The next step consists in updating the value of 'Yo. We chosen in correspondence to the minimum of the estimated
have to compute the Lagrangian of problem (20): generalization error. Some proposals exist for choosing the
admissible search space for C [23], but this choice is far
= 21 I I I
A
'Y
L L f3if3 j YiYjX;Xj - L f3i from obvious.
i=l j=l i=l When performing the model selection, using the SVM
I I
2 formulation based on Ivanov regularization (6), we have
+ 'YW�AX - Lf..tif3i - LWi(l- f3i) to find the search space for the hyperparameter WMAX.
i=l i=l Differently from the previous case, it is possible to find
a simple relation between the value of WMAX and the
(25) dimension of the margin: then, finding an upper and a lower
i=l bound for the margin implies defining the search space for
where IL, W, b and p are the Lagrange multipliers of the this hyperparameter.
constraints (21), (22) and (23). The following derivative of Let us consider the SVM separating hyperplane W·x+b =
A is the only one of interest for our purposes: o for a set of data D1, defined as in section II. Let us consider
a pattern Xk: the distance between the pattern Xk and the
=0=-212 �� WMAX -P (26) separating hyperplane d can be computed as
8A TX + -
8'Y ��
'Y i=l j=l f3if3jY iY jX i j 2-
Since, from the slackness conditions, we have that P'Y =0
d=Iw·Xk +bl . (29)
Ilwll
and since, in the cases of interest, 'Y > 0, it must be p =0
If Xk is such that w . Xk+b =+1, i.e. it lies on the margin
boundary, d=( 1Iwll )-l and, then, the margin
and we find the following updating rule for 'Yo:
M equals
'Yo=
JL�=l L�=l f3if3jYiYjX;Xj , (27)
2
M=llwll· (30)
WMAX
Let S+ and S_ be the set of indexes of the patterns of l 20 to l 500, while the remaining 13074 - l images are
= =

Dl which refer to the class + 1 and - 1, respectively. We can used as a test set. In order to build statistically relevant results
define and, at the same time, to show that the MD-based approach
is almost insensitive with respect to the selection of samples
min 8(xi,xj) (31)
for the training and the model selection phases, we build
iES+, jES_
max 8(xi' Xj ) , (32) a set of 30 replicates using a random sampling technique
iES+, jES_ and a set of 30 replicates using the approach of [26], which
where 8(·,,) represents the distance between two patterns. guarantees that almost-homogeneous subsets of the dataset
Then, the margin can assume values only in the range: are built. Note that the dimensionality of the dataset is 784,
which is much higher than the number of samples in each
(33) of the training sets and, therefore, defines a typical small
or, in other words, the search space for the hyperparameter sample setting.
WMAX is: In Table I, we show the results obtained on the MNIST
2 2 replicates, created using a random sampling technique: the
-- :::; WMAX :::; --.
dM1N
(34)
dMAX first column represents the number of patterns used in the
experiments, while the remaining columns present the error
V. EX PERIM ENTA L RESU LT S
rates obtained on the test set for the BTS, KCV, LOO, and
In the following experiments, three practical approaches MD approaches, respectively. When only a restrained number
(KCV, LOO and BTS) are compared with the results obtained of patterns is used, the best overall performance corresponds
using the MD-based technique. The experimental setup is the to the model selected with the MD-based approach: when l
following: is small (in this case, l :::; 200), the underlying hypotheses
• the data are normalized in the range [0, 1] ; of the practical methods are not valid, then the generaliza­
• the model selection is performed, using the three prac­ tion error estimation and, consequently, the model selection
tical methods, by searching for the optimal value of C become unreliable. On the contrary, when l is large (e.g.,
in the interval [10-5 , 103] , which includes the cases of l 2:: 300 in the experiments), the practical methods tend
interest, among 30 values, equally spaced in a logarith­ to outperform MD. This is mainly due to the fact that the
mic scale [22]. For the KCV technique, k = 10 is used, MD-based technique privileges "underfitting" models (i.e.,
while the bootstrap procedure is iterated 1000 times; SVMs characterized by large margin values) instead of the
• the MD-based model selection is performed by search­ "overfitting" classifiers, chosen by the practical approaches:
ing for the optimal value of WMAX, as described in this behaviour allows to improve the performance of the
section IV-C. In order to avoid unreliable M D values, classifier when only few training patterns are available, but
we set m = 100 in the bound of Eq. (13); results to be a conservative approach as l increases. This
• the error rates of the optimal models chosen by the KCV, is confirmed also by the results obtained on the replicates
LOO, BTS, and MD approaches are then computed on created using the approach of [26] and shown in Tab. II:
a separate test set, where available, using the hard loss in these cases, the underlying hypotheses for the practical
function of Eq. (15); approaches hold (e.g. the training set is a "good sample" of
• when a separate test set is not available, the approach the entire population) and the BTS method outperforms MD,
of [24] is used by generating different training/test pairs even for very low values of l.
for the comparison.
TABLE I
A. The MNIST Dataset ERROR RATES ON THE TEST SET OF THE MNI ST DATASET, SAMPLED
The MD-based method is obviously targeted toward small WITH A RA NDOM TECHNIQUE, WITH 95% CONFIDENCE INTERVAL. ALL
sample problems, where the use of a hold-out set for es­ VALUES A RE IN PERCENTAGE.
timating the generalization ability of a classifier is usually I BT S Key LOO MD
less effective [15], [16]. In order to fairly compare the 20 1.9 ±0.4 1.9 ±0.3 2.4 ±0.6 1.8±0.4
performance of the MD-based technique versus the hold­ 50 0.9 ±0.2 1.3 ±0.3 1.4 ±0.2 0.8±0.1
out ones, we select a real-world application, the MNIST 100 0.6±0.2 0.8±0.2 0.8±0.1 0.5 ±0.1
dataset [25], consisting of a large number of samples, and 200 0.4 ±0.1 0.4 ±0.1 0.5 ±0.1 0.4 ±0.1
use only a small amount of the available data as training 300 0.3 ±0.1 0.3 ±0.1 0.4 ±0.1 0.4 ±0.1
set. The remaining samples can be used as a test set for 500 0.2±0.1 0.2±0.1 0.2±0.1 0.3 ±0.1
the comparison, since they represent a reasonably good
estimation of the generalization error L(I).
The MNIST dataset consists of 62000 images, representing We can also verify experimentally that the probability
the numbers from 0 to 9: in particular, we consider the 13074 of finding a large number of CSVs is low. Fig. 1 shows
patterns containing O's and 1's, that allow us to deal with a the experimental probability of finding at least s + 1 CSVs
binary classification problem. We build the training set by (in percentage, respect to the number of samples) as a
randomly sampling a small number of patterns, varying from function of s: the probability of finding at least one CSV
TABLE II
In this section, we use two biclass problems, taken from
ERROR RATES ON THE TEST SET OF THE MNI ST DATASET, SA MPLED
the well-known GEMS datasets [27]: Prostate Tumor and
USING THE TECHNIQUE PROPOSED IN [26], WITH 95% CONFIDENCE
DLBCL. In addition to these two sets, we also make use
INTERVAL. A LL VALUES A RE IN PERCENTAGE.
of a gene expression dataset for myeloma diagnosis taken
I BT S KCY LOO MD from [28], and a DNA microarray dataset collected in Casa
20 1.2±0.3 1.5 ±0.3 2.1±0.5 1.7±0.4 Sollievo della SofJerenza Hospital, Foggia - Italy, relative to
50 0.6±0.1 1.0±0.2 1.1±0.3 0.8±0.1 patients affected by colon cancer [29]. Table III presents the
100 0.4 ±0.1 0.7±0.2 0.7±0.2 0.5 ±0.1 main characteristics of the datasets.
200 0.3 ±0.1 0.5 ±0.1 0.5 ±0.1 0.4 ±0.1
TABLE III
300 0.3 ±0.1 0.3 ±0.1 0.4 ±0.1 0.3 ±0.1
CHARACTERISTICS OF THE HUMAN GENE EXPRESSION DATASETS USED
500 0.2±0.1 0.2±0.1 0.2±0.1 0.3 ±0.1
IN OUR EXPERIMENTS.

Dataset Reference # of patterns # of features


1= 200
Prostate Tumor [27] 102 10509
70 DLBCL [27] 77 5469
Myeloma [28] 105 28032
Colon cancer [29] 47 22283

50

'"
A Tables IV and V show the total number of misclassifi­
� 40
en cations, obtained using both the random and the stratified
U
0"- data sampling. When a good sample is available (Table V),
":::"30
a.
the practical techniques tend to outperform the MD-based
20
technique, even in the case of small datasets, since the under­
lying hypotheses are satisfied. When the data is not carefully
10 selected (Table IV), MD still tends to choose underfitting
models, differently from the practical approaches: the models
0 selected by MD allow to obtain the best performance (on
0 2 4 6 8 10 12
S [%of CSVsJ average) on the test sets.

Fig. I. The experimental probability of finding at least s + 1 CSYs as TABL E IY


a function of s. The figure refers to a case of the MNIST dataset with NUMBER OF MISCLASSIFICATIONS ON THE TEST SET OF THE HUM A N
1=200.
GENE EXPRESSION DATASETS, CREATED USING A RA NDOM SAMPLING
TECHNIQUE.

is approximately 70%, but this value decreases to less than Dataset BTS KCY LOO MD
0.2% for a number of CSVs greater than 10% of the training Prostate 16 16 18 22
patterns. DLBCL 3 3 4 2
Myeloma 8 10 8 0
B. Human Gene Expression Datasets Colon cancer 9 8 8 6
In the experiments described in the previous section, we Total 36 37 38 30
extracted small sample sets in order to fairly compare the
theoretical and practical model selection techniques on a
large cardinality test set. In a real-world small sample setting, TABLE Y
a test set of such size is not available: then, we reproduce NUMBER OF MISCLA SSIFICATIONS ON THE TEST SET OF THE HUMA N
the methodology used by [24], which consists in generating p GENE EXPRESSION DATASETS, CREATED USING THE APPROACH O F [26].
different training/test pairs using a cross validation approach.
Dataset BT S KCY LOO MD
In particular, we set p 5 for our experiments. If the number
=

Prostate 9 10 10 10
of patterns of a dataset is not exactly a multiple of p, some
DLBCL 0 2 2 3
patterns are left out of the training set: however, they are
Myeloma 0 0 0 0
not neglected (as in many other applications) and they are
Colon cancer 4 4 3 5
simply added to every test set. Analogously to the analysis
Total 13 16 16 18
of the MNIST dataset, we create the training/test splitting
using two different techniques: a random sampling approach
and the stratified almost-homogeneous sampling method of
[26], in order to verify the generalization ability of the model VI. CONCLU SION S

selection techniques, when both "bad" and "good" samples We have detailed a method to apply a well-known ap­
are available. proach of the MLT, based on the Maximal Discrepancy con-
cept, to the problem of SVM model selection. In particular, from which the following Karush-Kuhn-Tucker (KKT) con­

88A�i
we have focused on the small sample regime, where the ditions are obtained:
number of available samples is very low, if compared to
their dimensionality, which is the typical setting of several
bioinformatic classification problems. The disadvantage of
= 0 -+ fJi::; 1 (41)

the MLT based approach lies in the pessimistic behavior of (42)


the Maximal Discrepancy method and on the computational i=1

88A'I/Ji
complexity, which is not lower than the methods based on re­
sampling techniques. However, the MD method outperforms = 0
several resampling algorithms, which are widely used by
practicioners, and appears less sensitive to the availability By substituting the previous conditions in the Lagrangian A,
of a 'good' training set for the problem under investigation. it is easy to see that the same formulation of problem (20)
ApPENDIX
is obtained.

In this appendix, we propose the non-linear kernel ex­ AC KNOW L EDGM ENT S

tension for the SVM problem (6), which allows to use We thank Casa Sollievo della Sofferenza Hospital, Foggia
the procedure presented in Algorithm 1. For our purposes, - Italy, for providing the colon cancer dataset.
we use the same assumptions of [30] for the non-linear
REF ER ENCES

Let ¢(Xi)
reformulation.
be a non-linear function which maps a pattern
from the input to the feature space. Let us define the weights
Xi [ I]V. Vapnik, "An overview of statistical learning theory", IEEE Transac­
vol. 10,pp. 988-999, 1999.
tions on Neural Networks,
[2] C.1. Lin, "Asymptotic convergence of an SMO algorithm without any

the input patterns, mapped through : ¢(.)


w of the primal formulation (6) as a linear combination of

[3]
assumptions", IEEE Transactions on Neural Networks, vol. 13,pp. 248-
250,2002.
B. L. Milenova, 1. S. Yarmus, M. M. Campos, " SVM in Oracle database

W =
LYi'I/Ji¢(Xi).
I

i=1
(35)
109: Removing the barriers to widespread adoption of Support Vector
Machines", Proc. of the 31st Int. Con! on Very Large Data Bases, pp.
1152-1163, 2005.

where 'l/Ji
E �. Then, we can write the following primal
formulation:
[4] D. Anguita, A. Boni, S. Ridella, F. Rivieccio, D. Sterpi, "Theoretical
and practical model selection methods for Support Vector classifiers",
in "Support Vector Machines: Theory and Applications", edited by
L. Wang, Springer, 2005.
[5] B. Schoelkopf, A. Smola, Learning with Kernels, The MIT Press, 2002.
min eTe (36) [6] J. Shawe-Taylor, N. Cristianini, "Margin distribution and soft margin",
'IjJ,b,e in "Advances in Large Margin Classifiers", edited by A. Smola,

LLYiYj'I/Ji'I/JjKij::;
I

i=1 j=1
I

wi.tAX (37)
P. Bartlett, B. Schoelkopf, D. Schuurmans, The MIT Press, 2000.
[7] K. Duan, S. S. Keerthy, A. Poo, " Evaluation of simple performance
measures for tuning SVM parameters", Neurocomputing, vol. 51,pp.

Y; (:t Yj>J>jK;j ) �i
41-59,2003.
[8] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy
Estimation and Model Selection", Proc. of the Int. Joint Con! on
+b ?: 1 - Vi (38) Artificial Intelligence, 1995.
J=1 D. Anguita, S. Ridella, S. Rivieccio, " K-fold generalization capability

�i ?: 0 Vi (39)
[9]
assessment for support vector classifiers", Proc. of the Int. Joint Con!
on Neural Networks, pp. 855-858,Montreal, Canada, 2005.

where Kij K(Xi' Xj) ¢(Xi) . ¢(Xj).


= = As discussed in
section IV-B, we can remove the quadratic constraint (37),
[10] O. Bousquet, A. Elisseeff, " Stability and generalization", Journal of
Machine Learning Research, vol. 2,pp. 499-526,2002.
[ I I] B. Efron, R. Tibshirani, An introduction to the Bootstrap, Chapmann
and Hall, 1993.
solve the problem (36) and verify if the solution satisfy the [12] V. Vapnik, The Nature of Statistical Learning Theory, Springer, 2000.
constraint (37): if the latter is not satisfied, we have to switch [13] T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi, "General conditions for
to the dual formulation. predictivity in learning theory", Nature, vol. 428,pp. 419-422,2004.
[14] P. L. Bartlett, S. Boucheron, G. Lugosi, "Model selection and error
Then, we derive the dual formulation and verify that it estimation", Machine Learning, vol. 48,pp. 85-113,2002.
can be efficiently solved with conventional QP solvers, by [15] U.M. Braga-Neto, E.R. Dougherty, "Is cross-validation valid for small­
computing the Lagrangian A: sample microarray classification?", Bioinformatics, vol. 20, pp. 374-
380,2004.

A t,€; i [W)w t,t,Y;Yj>J>;>J>jK;j]


� �
[16] A. Isaksson, M. Wallman, H. Goeransson, M.G. Gustafsson, "Cross­
validation and bootstrapping are unreliable in small sample classifica­
tion", Pattern Recognition Letters, vol. 29,pp. 1960-1965, 2008.
[17] D. Anguita, A. Ghio, S. Ridella, "Maximal Discrepancy for Support

-t fJi [Yi (t Yj'I/JjKij ) �il


Vector Machines", Proc. of European Symposium on ArtifiCial Neural
Networks, Bruges, Belgium, 2010.
+b -1+ [18] L. Bottou, C.1. Lin, " Support Vector Machine Solvers", in "Large Scale
�=1 J=1 Learning Machines", edited by L. Bottou, O. Chapelle, D. DeCoste, J.
Weston, The MIT Press, pp. 1-28,2007.

-L(i�i I

i=1
(40)
[19] S. Boucheron, O. Bousquet, G. Lugosi, "Theory of Classification: a
Survey of Recent Advances", ESAIM' Probability and StatistiCS, vol.
9,pp. 323-375, 2005.
[20] L. Wang,H. Jia,J. Li,"Training robust support vector machines with [26] M. Aupetit,"Nearly homogeneous multi-partitioning with a determin­
smooth ramp loss in the primal space", Neurocomputing, vol. 71, pp. istic generator",Neurocomputing, vol. 72,pp. 1379-1389,2009.
3020-3025,2008. [27] A. Statnikov,I. Tsamardinos, Y. Dosbayev, C.F. Aliferis, "GEMS: A
[21] L. Martein, S. Schaible, "On solving a linear program with one System for Automated Cancer Diagnosis and Biomarker Discovery from
quadratic constraint", Decisions in Economics and Finance, vol. 10, Microarray Gene Expression Data", International Journal of Medical
1987. Informatics, vol. 74,pp. 491-503,2005.
[22] C.-w. Hsu, C.-C. Chang, C.-J. Lin, "A practical guide to support [28] D. Page, F. Zhan, J. Cussens, W. Waddell, J. Hardin, B. Barlogie,
vector classification", Technical report, Dept. of Computer Science, J. Shaughnessy, "Comparative Data Mining for Microarrays: A Case
National Taiwan University,2003. Study Based on Multiple Myeloma",Proc. of International Coriference
[23] T. Hastie,S. Rosset,R. Tibshirani,J. Zhu,"The Entire Regularization on Intelligent Systems for Molecular Biology, 2002.
Path for the Support Vector Machine", Journal of Machine Learning [29] N. Ancona,R. Maglietta, A. Piepoli, A. D'Addabbo,R. Cotugno,M.
Research, Vol. 5,pp. 1391-1415,2004. Savino, S. Liuni, M. Carella, G. Pesole, F. Perri, "On the statistical
[24] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, S. Levy. "A assessment of classifiers using DNA microarray data",Bioinformatics,
Comprehensive Evaluation of Multicategory Classification Methods for vol. 7,2006.
Microarray Gene Expression Cancer Diagnosis", Bioiriformatics, vol. [30] O. Chapelle, "Training a Support Vector Machine in the Primal",
21,pp. 631-643,2005. Neural Computation, vol. 19,pp. 1155-1178,2007.
[25] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, "An
empirical evaluation of deep architectures on problems with many
factors of variation",Proc. of the International Coriference on Machine
Learning, pp. 473-480, 2007.

You might also like