You are on page 1of 9

ARTICLE IN PRESS

Neurocomputing 71 (2008) 611–619


www.elsevier.com/locate/neucom

Support vector machine classification for large data sets via minimum
enclosing ball clustering
Jair Cervantesa, Xiaoou Lia, Wen Yub,, Kang Lic
a
Departamento de Computación, CINVESTAV-IPN, A.P. 14-740, Av.IPN 2508, México, D.F. 07360, México
b
Departamento de Control Automático, CINVESTAV-IPN, A.P. 14-740, Av.IPN 2508, México, D.F. 07360, México
c
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Ashby Building, Stranmillis Road, Belfast BT9 5AH, UK
Available online 9 October 2007

Abstract

Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high
classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly
dependent on the size of data set. This paper presents a novel SVM classification approach for large data sets by using minimum
enclosing ball clustering. After the training data are partitioned by the proposed clustering method, the centers of the clusters are used for
the first time SVM classification. Then we use the clusters whose centers are support vectors or those clusters which have different classes
to perform the second time SVM classification. In this stage most data are removed. Several experimental results show that the approach
proposed in this paper has good classification accuracy compared with classic SVM while the training is significantly faster than several
other SVM classifiers.
r 2007 Elsevier B.V. All rights reserved.

Keywords: Support vector machine; Classification; Large data sets

1. Introduction these methods can be divided into two types: (1) modify
SVM algorithm so that it could be applied to large data
There are a number of standard classification techniques sets, and (2) select representative training data from a large
in literature, such as simple rule based and nearest data set so that a normal SVM could handle.
neighbor classifiers, Bayesian classifiers, artificial neural For the first type, a standard projected conjugate
networks, decision tree, support vector machine (SVM), gradient (PCG) chunking algorithm can scale somewhere
ensemble methods, etc. Among these techniques, SVM is between linear and cubic in the training set size [9,16].
one of the best-known techniques for its optimization Sequential minimal optimization (SMO) is a fast method to
solution [10,20,29]. Recently, many new SVM classifiers train SVM [24,8]. Training SVM requires the solution of
have been reported. A geometric approach to SVM QP optimization problem. SMO breaks this large QP
classification was given by [21]. Fuzzy neural network problem into a series of smallest possible QP problems, and
SVM classifier was studied by [19]. Despite of its good it is faster than PCG chunking. Dang et al. [11] introduced
theoretic foundations and generalization performance, a parallel optimization step where block diagonal matrices
SVM is not suitable for classification of large data sets are used to approximate the original kernel matrix so that
since SVM needs to solve the quadratic programming SVM classification can be split into hundreds of subpro-
problem (QP) in order to find a separation hyperplane, blems. A recursive and computational superior mechanism
which causes an intensive computational complexity. referred as adaptive recursive partitioning was proposed in
Many researchers have tried to find possible methods to [17], where the data are recursively subdivided into smaller
apply SVM classification for large data sets. Generally, subsets. Genetic programming is able to deal with large
data sets that do not fit in main memory [12]. Neural
Corresponding author. Tel.: +52 55 5061 3734; fax: +52 55 5747 7089. networks technique can also be applied for SVM to
E-mail address: yuw@ctrl.cinvestav.mx (W. Yu). simplify the training process [15].

0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2007.07.028
ARTICLE IN PRESS
612 J. Cervantes et al. / Neurocomputing 71 (2008) 611–619

For the second type, clustering has been proved to be an Definition 5. In clustering, ð1 þ Þapproximation of
effective method to collaborate with SVM on classifying MEB(S) is denoted as a set of k balls Bi ði ¼ 1    kÞ
large data sets. For examples, hierarchical clustering [31,1], containing S, i.e., S  B1 [ B2 [    [ Bk .
k-means cluster [5] and parallel clustering [8]. Clustering-
based methods can reduce the computations burden of In other words, given 40; a subset Q; is said to be a
SVM, however, the clustering algorithms themselves are ð1 þ Þapproximation of S for clustering if MEBðQÞ ¼
still complicated for large data set. Rocchio bundling is a [ki¼1 Bi ðci ; ð1 þ Þri Þ and S  [ki¼1 Bi ðci ; ð1 þ Þri Þ; i.e., Q is a
statistics-based data reduction method [26]. The Bayesian ð1 þ Þapproximation with an expansion factor ð1 þ Þ.
committee machine is also reported to be used to train Now we consider a finite set of elements X ¼ fx1 ; x 2 ;
SVM on large data sets, where the large data set is divided . . . ; xn g with p dimensional Euclidian space xi ¼ ðxi1;
into m subsets of the same size, and m models are derived . . . ; xip ÞT 2 Rp . First we randomly select the ball centers in
from the individual sets [27]. But, it has higher error rate the data set such that they can cover all range of the data.
than normal SVM and the sparse property does not hold. The radius of the ball r is the most important parameter in
In this paper, a new approach for reducing the training MEB clustering. How to choose the user-defined parameter
data set is proposed by using minimum enclosing ball is a trade-off. If r is too small, there will be many groups at
(MEB) clustering. MEB computes the smallest ball which the end, their centers will be applied for the first stage
contains all the points in a given set. It uses the core-sets SVM. The data reduction is not good. Conversely, if r is
idea [18,3] to partition input data set into several balls, too large, many objects that are not very similar may end
named k-balls clustering. For normal clustering, the up in the same cluster, some information will be lost. In
number of clusters may be predefined, since determining this paper, we use the following equation:
the optimal number of clusters may involve more xmax;j  xmin;j
computational cost than clustering itself. The method of rk;j ¼ ðk  1 þ randÞ ,
l
this paper does not need the optimal number of clusters, we
k ¼ 1    l; j ¼ 1    p, ð1Þ
only need to partition the training data set and to extract
support vectors with SMO. Then we remove the balls where l is the number of the balls, rand is a random number
which are not support vectors. For the remaining balls, we in ð0; 1Þ, xmax;j ¼ maxi ðxij Þ, i ¼ 1    n; n is the number of
apply de-clustering technique, and classify it with SMO the data, xmin;j ¼ mini ðxij Þ. In order to simplify the
again, then we obtain the final support vectors. The algorithm, we use the same r for all balls
experimental results show that the accuracy obtained by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pp 2
our approach is very close to the classic SVM methods, j¼1 ðxj;max  xj;min Þ
while the training time is significantly shorter. The r¼ . (2)
2l
proposed approach can therefore classify huge data sets
with high accuracy. If one data is included in more than one ball, it can be in
any ball. This does not affect the algorithm, because we
2. MEB clustering algorithm only care about the center of the balls.
In most cases, there is no obvious way to select the
MEB clustering proposed in this paper uses the concept optimal number of clusters, l: An estimate of l can be
of core-sets. It is defined as follows. obtained from the data using the method of cross-
validation, for example, v-fold cross-validation algorithm
Definition 1. The ball with center c and radius r is denoted [4] can automatically determine the number of clusters in
as Bðc; rÞ. the data. For our algorithm, we can first guess the support
Definition 2. Given a set of points S ¼ fx1 ; . . . ; xm g with vector number as sv; then l  23 sv.
xi 2 Rd , the MEB of S is the smallest ball that contains all We use Fig. 1 to explain the MEB clustering, l ¼ 3:
balls and also all points in S; it is denoted as MEBðSÞ. We check if the three balls have already included all
data, and if not, we enlarge the radius into ð1 þ eÞr: From
Because it is very difficult to find the optimal ball Fig. 1 we see that A1, B1 and C1 are included in the new
MEBðSÞ, we use an approximation method which is balls (dash lines). But A2, B2 and C2 are still outside of the
defined as follows. ball, now we enlarge e until every data in X is inside the
Definition 3. ð1 þ Þ-approximation of MEBðSÞ is denoted balls, i.e.
as a ball Bðc; ð1 þ ÞrÞ, 40 with rXrMEBðSÞ and l
S  Bðc; ð1 þ ÞrÞ. X  [ Bðci ; ð1 þ Þri Þ.
i¼1
Definition 4. A set of points Q is a core-set of S if
The MEB clustering algorithm of sectioning l balls is as
MEBðQÞ ¼ Bðc; rÞ and S  Bðc; ð1 þ ÞrÞ.
follows:
For clustering problem, there should be many balls in Step 1: Use the random sampling method (1) to generate l
the data set S: So the definition of ð1 þ Þapproximation ball centers C ¼ fc1 ; . . . ; cl g and select the ball radius r
of MEBðSÞ is modified as: as in (2).
ARTICLE IN PRESS
J. Cervantes et al. / Neurocomputing 71 (2008) 611–619 613

B1 X Y
Training data set
Bal l_B
o
o MEB clustering Selected
o
o data set

A1 o SVM classification Support


C1 vectors
B2o
Stage 1
r *
+
* * +
(1 + ε ) r A2* C2
+
B a l l_ C
Data near the optimal
* De-clustering
* hyperplane
Bal l_A
+

SVM classification The optimal


Fig. 1. MED clustering. Stage 2 hyperplane

Step 2: For each point xi ; we calculate the distance Fig. 2. SVM classification via MEB clustering.
2
jk ðxi Þ ¼ kxi  ck k
3.1. Data selection via MEB clustering
calculate the maximum distance jðxi Þ ¼ maxk ½jk ðxi Þ,
k ¼ 1 . . . l, i ¼ 1 . . . n. such that jðqÞ be inside of radius
In order to use SVM, we need to select data from a large
Bðck ; ð1 þ eÞrk Þ of each center, where
data set as the input of SVM firstly. In our approach, we
Step 3: Complete the clustering, the clusters are
use MEB clustering as data selection method. After MEB
Bk ðck ; ð1 þ ÞrÞ
clustering, there are l balls with initial radium r and
Step 4: If there exists jðxi Þ4ð1 þ eÞr, i ¼ 1 . . . n then
ð1 þ Þapproximation of MEBðSÞ.
increasing e as
The process of MEB clustering is to find l partitions (or
¼þ
r
, clusters) Oi from X, i ¼ 1; . . . ; l; lon; Oi a+, [li¼1 Oi ¼ X .
D The obtained clusters can be classified into three types:
where D is increasing step, we can use D ¼ 10; and goto step (1) clusters with only positive labeled data, denoted by
2. Otherwise all data points are included in the balls, goto Oþ , i.e., Oþ ¼ f[Oi j y ¼ þ1g;
step 3. (2) clusters with only negative labeled data, denoted by
O , i.e., O ¼ f[Oi j y ¼ 1g;
(3) clusters with both positive and negative labeled data
3. SVM classification via MEB clustering (or mix-labeled), denoted by Om , i.e., Om ¼ f[Oi j y ¼ 1g:
Fig. 3(a) illustrates the clusters after MEB, where
Let ðX ; Y Þ be the training patterns set, the clusters with only red points are positive labeled
(Oþ ), the clusters with green points are negative
X ¼ fx1 ; . . . ; xn g; Y ¼ fy1 ; . . . ; yn g; yi ¼ 1, labeled (O ) , and clusters A and B are mix-labeled (Om ).
xi ¼ ðxi1; . . . ; xip Þ 2 R .T p
ð3Þ We select not only the centers of the clusters but also
all the data of mix-labeled clusters as training data in
The training task of SVM classification is to find the the first SVM classification stage. If we denote the set
optimal hyperplane from the input X and the output Y, of the centers of the clusters in Oþ and O by C þ and C  ,
which maximizes the margin between the classes. By the respectively, i.e.,
sparse property of SVM, the data which are not support
vectors will not contribute to the optimal hyperplane. The C þ ¼ f[C i j y ¼ þ1g positive labeled centers,
input data sets which are far away from the decision
hyperplane should be eliminated, meanwhile the data sets C  ¼ f[C i j y ¼ 1g negative labeled centers.
which are possibly support vectors should be used.
Our SVM classification can be summarized in four steps Then the selected data which will be used in the first
which is shown in Fig. 2: (1) data selection via MEB stage SVM classification is the union of C þ , C  and Om ,
clustering, (2) the first stage SVM classification, (3) de- i.e., C þ [ C  [ Om . In Fig. 3(b), the red points belongs to
clustering, and (4) the second stage SVM classification. The C þ , and the green points belong to C  . It is clear that the
following subsections will give a detailed explanation on data in Fig. 3(b) are all cluster centers except the data in
each step. mix-labeled clusters A and B.
ARTICLE IN PRESS
614 J. Cervantes et al. / Neurocomputing 71 (2008) 611–619

Fig. 3. An example. (a) Data selection; (b) 1st stage SVM; (c) de-clustering; (d) 2nd stage SVM.

3.2. The first stage SVM classification Many solutions of (6) are zero, i.e., ak ¼ 0, so the solution
vector is sparse, the sum is taken only over the non-zero ak :
We consider binary classification. Let ðX ; Y Þ be the The xi which corresponds to non-zero ai is called a support
training patterns set, vector (SV). Let V be the index set of SV, then the optimal
hyperplane is
X ¼ fx1 ; . . . ; xn g; Y ¼ fy1 ; . . . ; yn g,
X
ak yk Kðxk ; xj Þ þ b ¼ 0. (7)
yi ¼ 1; xi ¼ ðxi1; . . . ; xip ÞT 2 Rp . (4) k2V

The training task of SVM classification is to find the The resulting classifier is
optimal hyperplane from the input X and the output Y, " #
which maximizes the margin between the classes, i.e., X
training SVM yields to find an optimal hyperplane or to yðxÞ ¼ sign ak yk Kðxk ; xÞ þ b ,
k2V
solve the following QP (primal problem),
1 Pn where b is determined by Kuhn–Tucker conditions.
minw;b JðwÞ ¼ wT w þ u xk SMO breaks the large QP problem into a series of
2 k¼1 (5) smallest possible QP problems [24]. These small QP
subject : yk ½wT jðxk Þ þ bX1  xk ; problems can be solved analytically, which avoids using a
time-consuming numerical QP optimization as an inner
where xk is slack variables to tolerate mis-classifications
loop. The memory required by SMO is linear in the
xk 40; k ¼ 1 . . . n; c40, wk is the distance from xk to the
training set size, which allows SMO to P handle very large
hyperplane ½wT jðxk Þ þ b ¼ 0, jðxk Þ is a nonlinear func-
training sets [16]. A requirement in (6) is li¼1 ai yi ¼ 0; it is
tion. The kernel which satisfies the Mercer condition [10] is
enforced throughout the iterations and implies that the
Kðxk ; xi Þ ¼ jðxk ÞT jðxi Þ. Eq. (5) is equivalent to the
smallest number of multipliers can be optimized at each
following QP which is a dual problem with the Lagrangian
step is 2. At each step SMO chooses two elements ai and aj
multipliers ak X0;
to jointly optimize, it finds the optimal values for these
P
n P
n two parameters while all others are fixed. The choice
maxa JðaÞ ¼  12 yk yj Kðxk ; xj Þak aj þ ak of the two points is determined by a heuristic algorithm,
k;j¼1 k¼1
(6) the optimization of the two multipliers is performed
P
n
analytically. Experimentally the performance of SMO is
subject : ak yk ¼ 0; 0pak pc:
k¼1 very good, despite needing more iterations to converge.
ARTICLE IN PRESS
J. Cervantes et al. / Neurocomputing 71 (2008) 611–619 615

Each iteration uses few operations such that the algorithm is loaded into the memory. The data type is float, so the
exhibits an overall speedup. Besides convergence time, data size is 4 bytes. If we use normal SVM classification,
SMO has other important features, such as, it does not the memory size for the input data should be 4ðn pÞ2
need to store the kernel matrix in memory, and it is fairly because of the kernel matrix while the size for the clustering
easy to implement [24]. data is 4ðn pÞ. In the first stage SVM classification, the
In the first stage SVM classification we use SVM training data size is 4ðl þ mÞ2 p2 ; where l is the number
classification with SMO algorithm to get the decision of the clusters, m is the number of the elements in the
hyperplane. Here, the training data set is C þ [ C  [ Om mixed clusters: In the second P stage SVM classification,
which has been obtained in the last subsection. Fig. 3(b) the training data size is 4ð li¼1 ni þ mÞ2 p2 ; where ni
shows the results of the first stage SVM classification. is the number of the elements in the clusters whose
centers are support vectors. The total storage space of
3.3. De-clustering MEB clustering is
2 !2 3
We propose to recover the data into the training data set X
l

by including the data in the clusters whose centers are 4ðn pÞ þ 4p2 4 ni þ m þ ðl þ mÞ2 5. (9)
i¼1
support vectors of the first stage SVM, we call this process
de-clustering. Then, more original data near the hyperplane
can be found through the de-clustering. The de-clustering When n is large (large data sets), ni ; m and l
n; the
results of the support vectors are shown in Fig. 3 (c). The memory space by (9) of our approach is much smaller than
de-clustering process not only overcomes the drawback 4p2 n2 which is needed by a normal SVM classification.
that only small part of the original data near the support
vectors are trained, but also enlarges the training data set 4.2. Algorithm complexity
size of the second stage SVM which is good for improving
the accuracy. It is clear that without a decomposition algorithm, it is
almost impossible for normal SVM to obtain the optimal
3.4. Classification of the reduced data: the second stage hyperplane when the training data size n is huge. It is
classification difficult to analyze the complexity of SVM algorithm
precisely. This operation involves multiplication of ma-
Taking the recovered data as new training data set, we trices of size n, which has complexity Oðn3 Þ.
use again SVM classification with SMO algorithm to get The complexity of our algorithm can be calculated
the final decision hyperplane as follows. The complexity of the MEB is OðnÞ. The
X approximate complexity
P of the two SVM training is
yk a2;k Kðxk ; xÞ þ b2 ¼ 0, (8) O½ðl þ mÞ3  þ O½ð li¼1 ni þ mÞ3 . The total complexity of
k2V 2 MEB is
where V 2 is the index set of the support vectors in the 2 !3 3
second stage. Generally, the hyperplane (7) is close to the X
l

hyperplane (8). OðnÞ þ O½ðl þ mÞ3  þ O4 ni þ m 5, (10)


i¼1
In the second stage SVM, we use the following two types
of data as training data:
where l is the total number of cluster, ni is the number of
(1) The data of the clusters whose centers are support
the elements in the ith clusters whose centers are support
vectors, i.e., [C i 2V fOi g, where V is a support vectors set of
vectors, m is the number of the elements in the mixed
the first stage SVM;
labeled clusters. Obviously (10) is much smaller than the
(2) The data of mix-labeled clusters, i.e, Om .
complexity of a normal SVM Oðn3 Þ.
Therefore, the training data set is [C i 2V fOi g [ Om .
Above complexity grows linearly with respect to the
Fig. 3(d) illustrates the second stage SVM classification
training data size n. The choice of l is very important in
results. One can observe that the two hyperplanes in Fig. 3
order to obtain fast convergence. When n is large, the cost
(b) and (c) are different but similar.
for each iteration will be high, and a smaller l needs more
iterations, hence, and will converge more slowly.
4. Performance analysis
4.3. Training time
4.1. Memory space
The training time of the approach proposed in this paper
In the first step clustering the total input data set
includes two parts: clustering algorithm and two SVMs.
X ¼ fx1 ; . . . ; xn g; Y ¼ fy1 ; . . . ; yn g, The training time of MEB is
yi ¼ 1; xi ¼ ðxi1 ; . . . ; xip ÞT 2 Rp T f ¼ p l n cf ,
ARTICLE IN PRESS
616 J. Cervantes et al. / Neurocomputing 71 (2008) 611–619

where p is the times of ð1 þ Þ-approximation, l is number 4ðl þ mÞc1 : The cost of the optimization at the first stage is
of clusters, ce is the cost of evaluating the Euclidian  
Xl
1 lþm
distance. ð1Þ
T op ¼ 4ðl þ mÞc1 ln
The training time of SVM can be calculated simply as i¼1
lðl þ mÞ l þ m  hi
follows. We assume that the major computational cost Xl  
4 hi
comes from multiplication operators (SMO) without ¼ c1 ln 1 þ
considering the cost of the other operators such as memory i¼1
l l þ m  hi
access. The growing rate of the probability of support X
l
4c1 hi
vectors is assumed to be constant. p .
i¼1
l l þ m  hi
Let nm ðtÞ be the number of non-support vectors at time t.
The probability of the number of support vectors at time t In the second stage,
is F ðtÞ which satisfies  
X
l
1 l1 þ m
l þ m  nm ðtÞ T ð2Þ
op ¼ 4ðl 1 þ mÞc1 ln
F ðtÞ ¼ ; nm ðtÞ ¼ ðl þ mÞ½1  F ðtÞ, i¼1
lðl 1 þ mÞ l 1 þ m  hi
lþm
X
l
4c1 hi
where l is the number of the clusters centers and m is p .
i¼1
l l 1 þ m  hi
the number of the elements in the mixed labeled
clusters. The growing rate of the number of support Another cost of computing is the calculation of kernels.
vectors (or decreasing rate of the number of non-support We define c2 be the cost of evaluating each element of K: In
vectors) is the first stage is
 T ð1Þ
ker ¼ ðl þ mÞc2 ; T ð2Þ
ker ¼ ðl 1 þ mÞc2 .
d½nm ðtÞ 1 F ðtÞ
hðtÞ ¼  ¼ .
dt nm ðtÞ ðl þ mÞ½1  F ðtÞ The total time for the three approaches are
l  
Since the growing rate is constant hðtÞ ¼ l; the solution of 4 X hi hi
T 2 pp l n cf þ c1 þ .
the following ODE: l i¼1 l þ m  hi l 1 þ m  hi

F ðtÞ ¼ lðl þ mÞF ðtÞ þ lðl þ mÞ

with F ð0Þ ¼ 0 is 5. Experimental results

F ðtÞ ¼ 1  elðlþmÞt . In this section we use four examples to compare our


The support vector number of the first stage SVM at time t algorithms with some other SVM classification methods. In
is nsv1 ðtÞ; it satisfies order to clarify the basic idea of our approach, let us first
consider a very simple case of classification and clustering.
nsv1 ðtÞ ¼ ðl þ mÞF ðtÞ ¼ ðl þ mÞð1  elðlþmÞt Þ; l40 (11) Example 1. We generate a set of data randomly in the
and is monotonically increasing. The model (11) can be range of ð0; 40Þ. The data set has two dimensions X i ¼
regarded as a growing model by the reliability theory [14]. xi;1 ; xi;2 . The output (label) is decided as follows:
The support vector number of the second stage SVM at
þ1 if WX i þ b4th;
time t is nsv2 ðtÞ; it satisfies yi ¼ (12)
1 otherwise
nsv2 ðtÞ ¼ ðl 1 þ mÞð1  elðl 1 þmÞt Þ; l40, where W ¼ ½1:2; 2:3T , b ¼ 10; th ¼ 95. In this way, the
data set is linearly separable.
where
Example 2. In this example, we use the benchmark data
X
l
which was proposed in IJCNN 2001. It is a neural network
l1 ¼ ni þ m.
competition data set, and available in [25,7]. The data set
i¼1
has 49,990 training data points and 91,701 testing data
We define the final support vector number in each cluster points, each record has 22 attributes.
at the first stage SVM as hi ; i ¼ 1    l: From (11) we know
Example 3. Recently SVM has been developed to detect
hi ¼ ðl þ mÞð1  elðlþmÞt Þ; so
RNA sequences [28]. However, long training time is
 
1 lþm needed. The training data is at ‘‘http://www.ghastlyfop.-
ti ¼ ln ; i ¼ 1 . . . l. com/blog/tag_index_svm.html/’’. To train the SVM classi-
lðl þ mÞ l þ m  hi
fier, a training set contains every possible sequence pairing.
We define c1 as the cost of each multiplication operation This resulted in 475; 865 rRNA and 114; 481 tRNA
for SMO. For each interactive step, the main cost is sequence pairs. The input data were computed for every
ARTICLE IN PRESS
J. Cervantes et al. / Neurocomputing 71 (2008) 611–619 617

 
sequence pair in the resulting training set of 486; 201 data ðx  zÞT ðx  zÞ
points. Each record has eight attributes with continuous f ðx; zÞ ¼ exp  (13)
2r2ck
values between 0 and 1.
we choose rck ¼P r=5, r is the average of the radius of the
Example 4. Another data set of RNA sequence is the clusters, r ¼ 1=l li ri . The comparison results are shown in
training data is available in [22]. The data set contains 2000 Fig. 5, where the running time vs training data size is (a),
data points, each record has 84 attributes with continuous and testing accuracies vs training data size is (b). We see
values between 0 and 1. that simple SVM has better classification accuracy than our
We will compare our two stages SVM via MEB approach. However, the training time is quite long since it
clustering (named ‘‘SMO þ MEB’’), with LIBSVM [7] works on the entire data set (close to 20,000 s) comparing
(named ‘‘SMO’’) and simple SVM [10] (named ‘‘Simple with our results (less than 400 s).
SVM’’). For Example 3, the comparison results are shown in
For Example 1, we generate 500; 000 data randomly Table 1.
whose range and radius are the same as in [31]. The RBF For Example 4, the comparison results are shown in
kernel is chosen the same as FCM clustering. Table 2.
Fig. 4(a) shows ‘‘running time’’ vs ‘‘training data size’’, We can see that for the two ENA sequences, the
Fig. 4 (b) shows ‘‘testing accuracy’’ vs ‘‘training data size’’. accuracies are almost the same, but our training time is
We can see that for small data set, LIBSVM has less significantly shorter than that of LIBSVM.
training time and higher accuracy. Our algorithm does not
have any advantage. But for large data set, the training
time is dramatically shortened in comparison with other 6. Conclusion and discussion
SVM implementations. Although the classification accu-
racy cannot be improved significantly, the testing accuracy In this paper, we proposed a new classification method
is still acceptable. for large data sets which takes the advantages of the
For Example 2, we use sets 1000, 5000, 12; 500, 25; 000, minimum enclosing ball and the support vector machine
37; 500 and 49; 990 training data sets. For the RBF kernel (SVM). Our two stages SVM classification has the

Fig. 4. The running time vs training data size (a) and testing accuracies vs training data size (b).

Fig. 5. Comparison with SMO and simple SVM.


ARTICLE IN PRESS
618 J. Cervantes et al. / Neurocomputing 71 (2008) 611–619

Table 1
Comparison with the other SVM classification

RNA sequence 1

MEB two stages LIBSVM SMO Simple SVM

# t Acc K # t Acc # t Acc # t Acc


500 45.04 79.6 300 500 0.23 84.88 500 26.125 85.6 500 2.563 85.38
1000 103.6 82.5 300 1000 0.69 85.71 1000 267.19 87.5 1000 9.40 87.21
5000 163.2 85.7 300 5000 10.28 86.40 5000 539.88 88.65
23,605 236.9 88.5 300 23,605 276.9 87.57

Table 2
Comparison with the other SVM classification

RNA sequence 2

MEB two stages LIBSVM SMO Simple SVM

# t Acc K # t Acc # t Acc # t Acc


2000 17.18 75.9 400 2000 8.71 73.15 2000 29.42 78.7 2000 27.35 59.15
2000 7.81 71.7 100

following advantages compared with the other SVM [11] J.-X. Dong, A. Krzyzak, C.Y. Suen, Fast SVM training algorithm
classifiers: with decomposition on very large data sets, IEEE Trans. Pattern
Anal. Mach. Intell. 27 (4) (2005) 603–618.
[12] G. Folino, C. Pizzuti, G. Spezzano, GP Ensembles for Large-Scale
1. It can be as fast as possible depending on the accuracy Data Classification, IEEE Trans. Evol. Comput. 10 (5) (2006)
requirement. 604–616.
2. The training data size is smaller than that of some other [14] B.V. Gnedenko, Y.K. Belyayev, A.D. Solovyev, Mathematical
SVM approaches, although we need twice classifica- Methods of Reliability Theory, Academic Press, NewYork, 1969.
[15] G.B. Huang, K.Z. Mao, C.K. Siew, D.-S. Huang, Fast modular
tions.
network implementation for support vector machines, IEEE Trans.
3. The classification accuracy does not decrease because Neural Networks, 2006.
the second stage SVM training data are all effective. [16] T. Joachims, Making large-scale support vector machine learning
practice, in: Advances in Kernel Methods: Support Vector Machine,
MIT Press, Cambridge, MA, 1998.
[17] S.-W. Kim, B.J. Oommen, Enhancing prototype reduction schemes
References with recursion: A method applicable for large data sets, IEEE Trans.
Syst. Man. Cybern. B 34 (3) (2004) 1184–1397.
[1] M. Awad, L. Khan, F.Bastani, I. L.Yen, An effective support vector [18] P. Kumar, J.S.B. Mitchell, A. Yildirim, Approximate minimum
machine SVMs performance using hierarchical clustering, in: enclosing balls in high dimensions using core-sets, ACM J. Exp.
Proceedings of the 16th IEEE International Conference on Tools Algorithmics 8, (January 2003).
with Artificial Intelligence (ICTAI’04), 2004, pp. 663–667. [19] C.-T. Lin, C.-M. Yeh, S.-F. Liang, J.-F. Chung, N. Kumar, Support-
[3] M. Badoiu, S. Har-Peled, P. Indyk. Approximate clustering via vector-based fuzzy neural network for pattern classification, IEEE
core-sets, in: Proceedings of the 34th Symposium on Theory of Trans. Fuzzy Syst. 14 (1) (2006) 31–41.
Computing, 2002. [20] O.L. Mangasarian, D.R. Musicant, Successive overrelaxation for
[4] P. Burman, A comparative study of ordinary cross-validation, v-Fold support vector machines, IEEE Trans. Neural Networks 10 (1999)
cross-validation and the repeated learning-testing methods, Biome- 1032–1037.
trika 76 (3) (1989) 503–514. [21] M.E. Mavroforakis, S. Theodoridis, A geometric approach to
[5] J. Cervantes, X. Li, W. Yu, Support vector machine classification support vector machine (SVM) classification, IEEE Trans. Neural
based on fuzzy clustering for large data sets, in: MICAI 2006: Networks 17 (3) (2006) 671–682.
Advances in Artificial Intelligence, Lecture Notes in Computer [22] W.S. Noble, S. Kuehn, R. Thurman, M. Yu, J. Stamatoyannopoulos,
Science (LNCS), vol. 4293, Springer, Berlin, 2006, pp. 572–582. Predicting the in vivo signature of human gene regulatory sequences,
[7] C.-C.Chang, C.-J. Lin, LIBSVM: a library for support vector Bioinformatics 21 (1) (2005) i338–i343.
machines, hhttp://www.csie.ntu.edu.tw/ cjlin/libsvmi, 2001. [24] J. Platt, Fast training of support vector machine using sequential
[8] P.-H. Chen, R.-E. Fan, C.-J. Lin, A study on SMO-type decomposi- minimal optimization, in: Advances in Kernel Methods: Support
tion methods for support vector machines, IEEE Trans. Neural Vector Machine, MIT Press, Cambridge, MA, 1998.
Networks 17 (4) (2006) 893–908. [25] D. Prokhorov. IJCNN 2001 neural network competition, Ford Research
[9] R. Collobert, S. Bengio, SVMTorch: Support vector machines for Laboratory, hhttp://www.geocities.com/ijcnn/nnc_ijcnn01.pdfi, 2001.
large regression problems, J. Mach. Learn. Res. 1 (2001) 143–160. [26] L. Shih, D.M. Rennie, Y. Chang, D.R. Karger, Text bundling:
[10] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector statistics-based data reduction, in: Proceedings of the 20th Interna-
Machines and Other Kernel-based Learning Methods, Cambridge tional Conference on Machine Learning (ICML-2003), Washington
University Press, Cambridge, 2000. DC, 2003.
ARTICLE IN PRESS
J. Cervantes et al. / Neurocomputing 71 (2008) 611–619 619

[27] V. Tresp, A Bayesian committee machine, Neural Comput. 12 (11) was a visiting professor of School of Electronics, Electrical Engineering
(2000) 2719–2741. and Computer Science, the Queeńs University of Belfast, UK.
[28] A.V. Uzilov, J.M. Keegan, D.H. Mathews, Detection of non-coding Her research interests include Petri net theory and application, neural
RNAs on the basis of predicted secondary structure formation free networks, knowledge based system, and data mining.
energy change, BMC Bioinformatics 173(7) (2006).
[29] Y.S. Xia, J. Wang, A one-layer recurrent neural network for support Wen Yu He received the B.S. degree from
vector machine learning, IEEE Trans. Syst. Man Cybern. B (2004) Tsinghua University, Beijing, China in 1990 and
1261–1269. the M.S. and Ph.D. degrees, both in Electrical
[31] H. Yu, J. Yang, J. Han, Classifying large data sets using SVMs with Engineering, from Northeastern University, She-
hierarchical clusters, in: Proceedings of the 9th ACM SIGKDD 2003, nyang, China, in 1992 and 1995, respectively.
Washington, DC, USA, 2003. From 1995 to 1996, he served as a Lecture in the
Department of Automatic Control at North-
Jair Cervantes received the B.S. degree in eastern University, Shenyang, China. In 1996,
Mechanical Engineering from Orizaba Techno- he joined CINVESTAV-IPN, México, where
logic Institute, Veracruz, Mexico, in 2001 and the he is a professor in the Departamento de
M.S degree in Automatic Control from CIN- Control Automático. He has held a research position with the Instituto
VESTAV-IPN, México, in 2005. He is currently Mexicano del Petróleo, from December 2002 to November 2003.
pursuing the Ph.D. degree in the Department of He was a visiting senior research fellow of Queen’s University Belfast
Computing, CINVESTAV-IPN. His research from October 2006 to December 2006. He is a also a visiting professor of
interests include support vector machine, pattern Northeastern University in China from 2006 to 2008. He is currently an
classification, neural networks, fuzzy logic and associate editor of Neurocomputing, and International Journal of
clustering. Modelling, Identification and Control. He is a senior member of IEEE.
His research interests include adaptive control, neural networks, and fuzzy
Control.
Xiaoou Li received her B.S. and Ph.D. degrees in
applied Mathematics and Electrical Engineering
from Northeastern University, China, in 1991 Kang Li is a lecturer in intelligent systems and
and 1995. control, Queen’s University Belfast. He received
From 1995 to 1997, she was a lecturer of B.Sc. (Xiangtan) in 1989, M.Sc. (HIT) in 1992
Electrical Engineering at the Department of and Ph.D. (Shanghai Jiaotong) in 1995. He held
Automatic Control of Northeastern University, various research positions at Shanghai Jiaotong
China. From 1998 to 1999, she was an associate University (1995–1996), Delft University of
professor of Computer Science at the Centro de Technology (1997), and Queen’s University Bel-
Instrumentos, Universidad Nacional Autónoma fast (1998–2002). His research interest covers
de México (UNAM), México. Since 2000, she has been a professor of the non-linear system modelling and identification,
Departamento de Computación, Centro de Investigación y de Estudios neural networks, genetic algorithms, process
Avanzados del Instituto Politécnico Nacional (CINVESTAV-IPN), control, and human supervisory control. Dr. Li is a Chartered Engineer
México. During the period from September 2006 to August 2007, she and a member of the IEEE and the InstMC.

You might also like