You are on page 1of 11

SN Computer Science (2020) 1:2

https://doi.org/10.1007/s42979-019-0006-z

REVIEW ARTICLE

Automatic Learning Algorithms for Local Support Vector Machines


Thanh‑Nghi Do1,2

Received: 11 April 2019 / Accepted: 13 June 2019 / Published online: 25 June 2019
© Springer Nature Singapore Pte Ltd 2019

Abstract
In this paper, we propose automatic learning algorithms (called auto-kSVM-1p, autokSVM-dp) being possible to automati-
cally tune hyper-parameters of k local support vector machines (SVM) for classifying large data sets. The autokSVM algo-
rithms are able to determine the number of clusters k to partition the large training data, followed which they auto-learn the
non-linear SVM model in each cluster to classify the data locally in the parallel way on multi-core computers. The autokSVM
algorithms combine the grid search, the .632 bootstrap estimator, the hill climbing heuristic to optimize hyper-parameters in
the local non-linear SVM training. The numerical test results on four data sets from UCI repository and three benchmarks of
handwritten letters recognition showed that our autokSVM algorithms give very competitive classification results compared
to the standard LibSVM and the original kSVM. An example of autokSVM-1p’s effectiveness is given with an accuracy of
96.74% obtained in the classification of Forest cover-type data set having 581,012 datapoints in 54-dimensional input space
and 7 classes in 334.45 seconds using a PC Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores.

Keywords Support vector machines · Large data sets · Local support vector machines · Automatic hyper-parameter tuning

Introduction proposed by [5, 6] for speeding up non-linear classification of


large data sets, to propose the autokSVM algorithms being able
Support vector machines (SVM) proposed by Vapnik [1] are to automatically search hyper-parameters. Instead of building
a state-of-the-art classification technique applied to many a global SVM model, as done by the classical algorithm is
pattern recognition problems. Successful applications of very difficult to deal with large datasets, the autokSVM algo-
SVMs include facial recognition, text categorization, and rithms are able to determine the number of clusters k (thanks to
bioinformatics [2]. In spite of the prominent properties, there the performance analysis [7–9]) to partition the training data,
are two challenging problems while training an SVM model. followed which it learns a non-linear SVM in each cluster to
The first problem is that the computational cost of an SVM classify the data locally in the parallel way on multi-core com-
approach is at least square of the number of training data- puters. The autokSVM algorithms combine the grid search
points [3] making SVM impractical for massive data sets. (as proposed by [10–14]), the .632 bootstrap estimator [15],
Furthermore, the hyper-parameter optimization for SVM the hill climbing heuristic to optimize hyper-parameters in the
learning is tuned by hand to obtain the good model, this local SVM training. The autokSVM-dp trains heterogeneous
becomes intractable use for non-experts. kernels for k local SVM models, and it automatically searches
This paper presents the extension of our previous works in hyper-parameters for every local SVM one. The autokSVM-
[4]. The investigation is to develop the recent kSVM algorithm 1p automatically tuned hyper-parameters in the biggest data
partition to select best hyper-parameters, followed which it
uses these best hyper-parameters to train k local SVM models.
This article is part of the topical collection “Future Data and The numerical test results on four data sets from UCI reposi-
Security Engineering” guest edited by Tran Khanh Dang. tory [16] and three benchmarks of handwritten letters recog-
nition [17], and MNIST [18, 19] showed that our proposed
* Thanh‑Nghi Do
dtnghi@cit.ctu.edu.vn autokSVM algorithms achieve good off-the-shelf classification
performances compared to the standard LibSVM [20] and the
1
College of Information Technology, Can Tho University, original kSVM [5, 6] in terms of training time and classifica-
92100 Cantho, Vietnam tion accuracy. Without any requirement of hyper-parameters
2
UMI UMMISCO 209 (IRD/UPMC), Bondy Cedex, France tuning, the autokSVM-1p can classify Forest cover-type data

SN Computer Science
Vol.:(0123456789)
2 Page 2 of 11 SN Computer Science (2020) 1:2

where 𝛼i are Lagrange multipliers, and C is the positive con-


stant used to tune the margin size, the error, and the linear
kernel function K⟨xi , xj ⟩ = ⟨xi · xj ⟩ (i.e., the inner product
of xi and xj).
The support vectors (for which 𝛼i > 0 ) are given by the
solution of the quadratic programming (1), and then, the
separating surface and the scalar b are determined by the
support vectors. The classification of a new data point x
based on the SVM model is as follows:
� #SV �

predict(x, SVM model) = sign yi 𝛼i K⟨x, xi ⟩ − b . (2)
Fig. 1  Classification of the datapoints into two classes i=1

SVM algorithms can be used for the non-linear classifica-


set having 581,012 datapoints in 54-dimensional input space tion tasks [21]. No algorithmic changes are required from
and 7 classes with an accuracy of 96.74% in 334.45 s using a the usual kernel function K⟨xi , xj ⟩ as a linear inner product,
PC Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores. K⟨xi , xj ⟩ = ⟨xi · xj ⟩ other than the modification of the ker-
The paper is organized as follows. Section “Training Algo- nel function evaluation. According to [22], the RBF (Radial
rithm of k Local Support Vector Machine Models” briefly Basis Function) is most general function, as follows:
introduces the kSVM algorithm. Section “Automatic Learn- 2
ing Algorithms for k Local Support Vector Machine Models K⟨xi , xj ⟩ = e−𝛾‖xi −xj ‖ . (3)
(autokSVMs)” illustrates our proposed autokSVM algorithms
SVMs are accurate models for facial recognition, text cat-
for automatic tuning hyper-parameters for local SVM models
egorization and bioinformatics [2]. Nevertheless, the study
in the classification of large data sets. Section “Numerical
in [3] illustrated that the computational cost requirements of
Test Results” shows the experimental results. The next sec-
the SVM solutions in (1) are at least O(m2 ) (where m is the
tion discusses about related works. We then conclude in the
number of training datapoints) making SVM impractical for
“Conclusion and Future Works” section.
massive data sets. Furthermore, the hyper-parameter optimi-
zation for SVM learning is tuned by hand to obtain a good
model, and this becomes intractable use for non-experts.
Training Algorithm of k Local Support Vector
Machine Models
Learning Algorithm for k Local Support Vector
Machine Models (kSVM)
Support Vector Machines
Instead of learning a global SVM model, as done by the
Let us consider a binary classification task for a given data
classical algorithm which is very difficult to deal with large
set with m datapoints xi (i = 1, … , m ) in the n-dimensional
data sets, the kSVM algorithm proposed by [5, 6] performs
input space Rn , having corresponding labels yi = ±1. The
the training task with two main steps as described in Fig. 2.
SVM algorithms [1] try to find the separating plane furthest
The first one is to use kmeans algorithm [23] to partition the
from both class +1 and class −1. As the main idea of SVM
full data set D into k clusters {D1 , D2 , … , Dk }, and then, it
classification depicted in Fig. 1, it simultaneously maximizes
is easily to learn a non-linear SVM in each cluster to clas-
the distance or the margin between the supporting planes
sify the data locally in parallel way on multi-core computers
for each class ( x.w − b = +1 for class +1, x.w − b = −1 for
(based on the shared memory multiprocessing programming
class −1) and minimizes the errors (any point xi falling on
model OpenMP [24]). The parallel training algorithm for
the wrong side of its supporting plane being considered as an
k local SVMs is described in Algorithm 1. An example as
error, denoted by zi ). SVM training algorithms pursue these
illustrated in Fig. 3 shows the comparison between a global
goals with the quadratic programming (1):
SVM model (left part) and 3 local SVM models (right part),
m m
� � m
� using an RBF kernel function with 𝛾 = 10 and a positive
min𝛼 (1∕2) yi yj 𝛼i 𝛼j K⟨xi , xj ⟩ − 𝛼i constant C = 106. The hyper-parameter 𝛾 of the kernel func-
i=1 j=1 i=1
tion and the positive constant C are denoted by 𝜃 = {𝛾, C}.
(1)
� ∑m The class of a new datapoint x is predicted by the local
y𝛼 =0
s.t. i=1 i i
SVM model lsvmNN obtained on the cluster DNN whose
0 ≤ 𝛼i ≤ C ∀i = 1, 2, … , m,
center cNN is nearest x.

SN Computer Science
SN Computer Science (2020) 1:2 Page 3 of 11 2

Fig. 2  Training algorithm for k


local SVMs (kSVM)

Fig. 3  Global SVM model (left


part) versus local SVM models
(right part)

Automatic Learning Algorithms for k


Local Support Vector Machine Models
(autokSVMs)

Although the kSVM can reduce the computational cost of an


SVM training task, but the hyper-parameter optimization is
still tuning by hand to obtain a good model. This becomes
hard use for non-experts.
A training task of the kSVM described in Fig. 2 and
Algorithm 1 requires three parameters including the num-
ber of local models k, the hyper-parameters 𝜃 including the
parameter 𝛾 of the kernel function, and the cost constant
C (the trade-off between the margin size and errors of
SVMs).
As illustrated in [5, 6], the complexity of parallel Furthermore, the RBF kernel function used in SVM
training k local SVM models on a P-core processor is models is general and efficient function, as illustrated in
2
O( Pk ( mk )2 ) = O( mkP ). The kSVM is kP times faster than build- [22]. Figure 5 shows different flexible separation forms
ing a global SVM model (the complexity is at least O(m2 )). according to the hyper-parameter 𝛾 used in an SVM clas-
The studies in [5–9] show that the parameter k gives a trade- sification task. Therefore, we propose to use the RBF
off between the generalization capacity (the classification kernel function (RBF kernel of two datapoints xi and xj ,
correctness) and the computational cost (the training time) K[i, j] = exp(−𝛾‖xi − xj ‖2 ) ) in the learning algorithm
(Fig. 4). kSVM.

SN Computer Science
2 Page 4 of 11 SN Computer Science (2020) 1:2

Fig. 4  Automatic training


algorithms of k local SVMs
(autokSVMs)

Fig. 5  RBF kernel with


𝛾 = 10−5 (left part) versus 𝛾 = 1
(right part)

And then, our investigation aims to develop the the autokSVM algorithms guarantee to obtain the cluster
autokSVM algorithms being able to automatically search size ∼ 500 by setting k = 500
m
.
these hyper-parameters k, 𝜃 = {𝛾, C} for the learning task
of kSVM. Automatic Tuning Hyper‑parameters  = { , C}
for Local SVMs Using the RBF Kernel Function
Setting the Parameter k
The remainder is to find best hyper-parameters 𝜃 = {𝛾, C}
According to the performance analysis in terms of the algo- for local SVM models using the RBF kernel. There are two
rithmic complexity and the generalization capacity studied ways for automatic training local SVM models as follows:
in [5–9], the role of the parameter k in kSVM is related to a
trade-off between the generalization capacity and the com- – first proposal is to train heterogeneous kernels for local
putational complexity. If k is large, then the kSVM reduces SVM models, it means that the autokSVM has to tune
significant training time (the speed-up factor of local SVMs to find best kernel parameter 𝛾i and cost constant Ci for
over the global one is k). And then, the size of a cluster is every cluster Di;
small; the locality is extreme with a very low generalization – second strategy is to train the homogeneous kernel for k
capacity. When k → m, the kSVM is closed to the 1 nearest- local SVMs, in which the autokSVM has to tune to find
neighbor classifier. If k is small, then the kSVM reduces best kernel parameter 𝛾l and cost constant Cl for the big-
insignificant training time. However, the size of a cluster is gest cluster Dl , and then, it uses these best parameters to
large; it improves the generalization capacity. When k → 1, train k local SVM models.
the kSVM is approximated the global one.
It leads to set k, so that the cluster size is large enough In both two strategies, the autokSVMs need to search
(e.g., 200 proposed by [7, 8]). The empirical results in [5, the hyper-parameter 𝛾 of RBF kernel and the cost con-
6, 25–27] show that the cluster size is about 500. Therefore, stant C for learning the local SVM model from the data

SN Computer Science
SN Computer Science (2020) 1:2 Page 5 of 11 2

Fig. 6  Automatic tuning hyper-


parameters 𝜃 = {𝛾, C} for local
SVM lSVMi learned from the
data partition Di

partition Di , so that obtain the best correctness in the autokSVM algorithms use the bootstrap sample Bi as the
classification. Figure 6 presents the process of automatic learning data set to train the classification model Mi for the
tuning hyper-parameters 𝜃 = {𝛾, C} for training the local data partition Di . An out-of-bag sample OOBi from the data
SVM model lSVMi in the data partition Di . The cost con- partition Di is out of the bootstrap sample Bi. The out-of-bag
stant C is chosen in G = {1, 101 , 102 , 103 , 104 , 105 , 106 } sample OOBi representing about 36.8% distinct datapoints
and the hyper-parameter 𝛾 of RBF kernel is tried among from the original data partition Di is used as the validation
P = {10−5 , 5.10−5 , 10−4 , 5.10−4 , 10−3 , 5.10−3 , 10−2 , 5.10−2 , data set to evaluate the classification model Mi . The .632
10−1 , 5.10−1 , 1, 5, 101 , 5.101 }. Furthermore, the local model bootstrap estimator does assert the error (denoted by Err
̂ ) of
is less complex than the global one (ref. Fig. 3). And then, the classification model Mi based on the bootstrap sample Bi
the autokSVMs start with the smallest values of the hyper- and the out-of-bag OOBi as follows:
parameter 𝛾 and the cost constant C in the grid and stop
when the classification correctness cannot be improved in ̂
Err
.632
̂
= .632 × Err(OOB ̂
i , Mi ) + .368 × Err(Bi , Mi ).
(4)
two next trials (the hill climbing heuristic).
We propose to use the .632 bootstrap estimator [15] in the Figure 6 presents the procedure for automatic searching
autokSVM algorithms to assert the classification correctness hyper-parameters while learning the local SVM model
while tuning the hyper-parameter 𝛾 and the cost constant C. lSVMi from the data partition Di . This strategy is used in
The .632 bootstrap estimator achieves the same empirical two autokSVM algorithms.
results obtained by the tenfold cross-validation protocol [28] The autokSVM-dp in Algorithm 2 is the autokSVM learn-
with the cheap computational cost. ing for k local SVM models using heterogeneous kernel
The main idea for automatic tuning hyper-parameters parameter 𝛾 and cost constant C.
with .632 bootstrap estimator in the autokSVM algorithms The autokSVM-1p in Algorithm 3 describes the train-
is described as follows. ing algorithm for k local SVM models using homogeneous
Given a data partition Di with mi datapoints, it randomly kernel parameter 𝛾 and cost constant C automatically tuned
draws with replacement from Di a bootstrap sample Bi of in the biggest data partition Dl.
size mi datapoints. The bootstrap sample Bi includes about Figure 7 is an example of the automatic classification
63.2% of distinct datapoints from the original cluster Di. The done by the autokSVM-dp.

SN Computer Science
2 Page 6 of 11 SN Computer Science (2020) 1:2

Fig. 7  Example of the automatic classification given by the


autokSVM-dp

Numerical Test Results

We are interested in the performance evaluation of our pro-


posed autokSVM algorithms for automatic hyper-parameters
tuning in classification problems. We have implemented
autokSVM algorithms in C/C++, OpenMP [24], using the
highly efficient standard library SVM, LibSVM [20]. First
proposal denoted by autokSVM-dp is to train heterogeneous
kernels for k local SVM models. Second strategy denoted by
autokSVM-1p is to train the homogeneous kernel parameter
𝛾 and cost constant C for k local SVM models that is selected
in the biggest cluster Dl).
Our evaluation of the classification performance is
reported in terms of correctness and training time. The
comparative study includes classification results obtained
by LibSVM, kSVM, autokSVM-dp, and autokSVM-1p.
All experiments are run on machine Linux Fedora 20,
Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, and 32 GB
main memory.

SN Computer Science
SN Computer Science (2020) 1:2 Page 7 of 11 2

Table 1  Description of data sets ID Data set Individuals Attributes Classes Evaluation protocol

1 Opt. rec. of handwritten digits 5620 64 10 3832 Trn–1797 Tst


2 Letter 20,000 16 26 13334 Trn–6666 Tst
3 Isolet 7797 617 26 6238 Trn–1559 Tst
4 USPS handwritten digit 9298 256 10 7291 Trn–2007 Tst
5 A new bench. hand. char. rec. 40,133 3136 36 36000 Trn–4133 Tst
6 MNIST 70,000 784 10 60000 Trn–10000 Tst
7 Forest cover types 581,012 54 7 400000 Trn–181012 Tst

Table 2  Hyper-parameters for LibSVM and kSVM Tuning Hyper‑parameters for LibSVM, kSVM
ID Datasets 𝛾 C k
We propose to use RBF kernel type in SVM models, because
1 Opt. rec. of handwritten digits 0.0001 100,000 10 it is general and efficient [22]. We also tried to tune the
2 Letter 0.0001 100,000 30 hyper-parameter 𝛾 of the RBF kernel and the cost constant
3 Isolet 0.0001 100,000 10 C (the trade-off between the margin size and the errors) to
4 USPS handwritten digit 0.0001 100,000 10 obtain the best accuracy. Furthermore, the kSVM uses the
5 A new bench. hand. char. rec. 0.001 100,000 50 parameter k local models (number of clusters). The cross-
6 MNIST 0.05 100,000 100 validation protocol [28, 29] is used to tune these hyper-
7 Forest cover types 0.0001 100,000 500 parameters. And then, the optimal parameters in Table 2
give highest correctness for data sets.

Data Sets Classification Results

Experiments are conducted with the four datasets from UCI Table 3 reports the classification correctness of LibSVM,
repository [16] and three benchmarks of handwritten let- kSVM, autokSVM-dp, and autokSVM-1p on the seven data
ters recognition, including USPS [17], MNIST [18], a new sets. Table 4 shows the training time of LibSVM, kSVM,
benchmark for handwritten character recognition [19]. autokSVM-dp, and autokSVM-1p to learn the classification
Table 1 presents the description of data sets. The last models. We can quickly assess the classification accuracy
column of Table 1 presents evaluation protocols. Data sets from Fig. 11, and the training time from Figs. 8, 9, and 10.
are already divided in training set (Trn) and testing set (Tst), Although our proposed autokSVM algorithms are off-the-
and then, we used the training data to build the SVM models. shelf classification tools (automatic tuning hyper-parameters
We report classification results obtained on the testing set for training local SVM models), but they achieve very com-
using the resulting models. petitive performances compared to LibSVM and kSVM.

Table 3  Classification results in ID Data sets Classification accuracy (%)


terms of accuracy (%)
LibSVM kSVM AutokSVM-dp AutokSVM-1p

1 Opt. rec. of handwritten digits 98.33 97.05 97.11 96.88


2 Letter 97.40 96.14 95.72 95.54
3 Isolet 96.47 95.44 95.83 95.45
4 USPS handwritten digit 96.86 95.86 95.32 95.37
5 A new bench. hand. char. rec. 95.14 92.98 95.54 92.72
6 MNIST 98.37 98.11 97.88 97.84
7 Forest cover types NA 97.06 96.90 96.74
Average 97.10 96.09 95.90 95.79

SN Computer Science
2 Page 8 of 11 SN Computer Science (2020) 1:2

Table 4  Training time (s) ID Data sets Training time (s)


LibSVM kSVM AutokSVM-dp AutokSVM-1p

1 Opt. rec. of handwritten digits 0.58 0.21 0.89 0.29


2 Letter 2.87 0.5 2.20 1.21
3 Isolet 8.37 2.94 28.61 8.32
4 USPS handwritten digit 5.88 3.82 11.23 9.08
5 A new bench. hand. char. rec. 107.07 35.7 125.21 105.51
6 MNIST 1531.06 45.50 108.66 86.87
7 Forest cover types NA (> 1,987,200) 223.7 373.64 334.45
Average NA (> 284,122.26) 44.62 92.92 77.96

Fig. 8  Comparison of training time on five small data sets Fig. 10  Comparison of training time on Forest Cover-Type data set

In terms of the averaging classification correctness,


the superiority of LibSVM on kSVM, autokSVM-dp, and
autokSVM-1p corresponds to 1.01%, 1.20%, and 1.31%,
respectively.
In terms of the training time, it must be noted that Table 4
does not include the time to tune by hand hyper-parame-
ters for LibSVM and kSVM. The hyper-parameter search
requires significant time to obtain the best results for Lib-
SVM and kSVM. A comparison of computational time is
not really fair for our autokSVM algorithms due to auto-
matic tuning hyper-parameters while training local SVM
models. Even facing this situation, the autokSVM-1p has
also very competitive training time. It is 1.75 times slower
than the kSVM to perform the classification task of data
sets. The autokSVM-1p has the same computational time
with LibSVM for training 5 first small data sets. On an aver-
age, the autokSVM-dp needs more 22 s than the autokSVM-
1p (instead of tuning only biggest partition as done by the
autokSVM-1p) to tune different kernels for local SVM mod-
Fig. 9  Comparison of training time on MNIST data set els (Fig.11).

SN Computer Science
SN Computer Science (2020) 1:2 Page 9 of 11 2

Fig. 11  Comparison of clas-


sification correctness

With large data sets, both kSVM, autokSVM-dp, and this problem, automated machine learning systems [33] try
autokSVM-1p achieve a significant speed-up in learning to select the right algorithm and its hyper-parameters in a
classification models. data-driven way without any human intervention. Therefore,
For MNIST data set, the kSVM, the auto-kSVM-dp, and automatic machine learning has become an interesting topic
the autokSVM-1p are, respectively, 33.64, 14.09, and 17.62 to make machine learning available for non-expert users.
times faster than LibSVM. Our proposal is in some aspects related to parameter
Typically, Forest cover-type data set is well known as a selection for SVM. Carl Staelin proposed an iterative algo-
difficult data set for non-linear SVM [30, 31]; LibSVM ran rithm [10] for searching the parameters on the grid resolu-
for 23 days without any result [31]. The recent parallelizing tion. Keerthi and his colleagues proposed an efficient heu-
SVM on distributed computers [32] uses 500 machines to ristic [12], a gradient-based method [11] for searching SVM
train the classification model in 1655 s with 97.69% accu- parameters. Adankon and Cheriet proposed the descent
racy. The kSVM performed this non-linear classification gradient algorithm [34] for the LS-SVM hyper-parameter
in 223.7 s with 97.06% accuracy. The autokSVM-dp needs optimization. The technique in [35] uses meta-learning,
373.64 s to automatically train the local SVM models with case-based reasoning to propose good staring points for
an accuracy of 96.90%. Our autokSVM-1p classifies Forest evolutionary parameter optimization of SVM. The study in
cover-type data set with a correctness of 96.74% in 334.45 s. [36] presented the multi-objective optimization framework
for tuning SVM hyper-parameters. More recent research [37]
proposed the method for automatically picking out the kernel
Discussion on Related Works type (linear/non-linear) and tuning parameters.
Recently, the research of automatic machine learning is
In past years, the machine learning shows the great success to make off-the-shelf machine learning methods without
in computer vision, speech recognition, machine translation, expert knowledge. Bergstra and his colleagues proposed
text categorization, and bioinformatics. Nevertheless, it is the random search method [38] and two greedy sequential
very hard for non-expert users to effectively apply machine techniques based on the expected improvement criterion
learning techniques due to the domain expertise, hard [39] for optimizing hyper-parameters of neural networks.
backgrounds of mathematics, probability, statistics, opti- Auto-WEKA proposed by [13] combines algorithm selection
mization, and computer science skills. As illustrated in the and hyper-parameter optimization for the machine learning
book [33], the performance of the machine learning model library WEKA [40]. A generic approach [41] tries to incor-
depends on both the fundamental quality of a given learn- porate knowledge from the previous experiments for simul-
ing technique and the details of its tuning. The users need taneously tuning a learning algorithm on new problems.
to decide between many available learning algorithms and A Python library Hyperopt [14] provides algorithms and
followed which they turn in hand the hyper-parameters of software framework in the distributed asynchronous way for
the selected techniques for the given data set. It is the bar- optimizing hyper-parameters of learning algorithms. Komer
rier to non-expert use of machine learning. For overcoming and his colleagues propose the hyperopt-sklearn based on

SN Computer Science
2 Page 10 of 11 SN Computer Science (2020) 1:2

Hyperopt to select models among machine learning algo- References


rithms in the library scikit-learn [42]. Another Python-
automated machine learning tool TPOT [43] automatically 1. Vapnik V. The nature of statistical learning theory. 2nd ed. New
optimizes feature pre-processors and machine learning mod- York: Springer; 1999.
2. Guyon I. Web page on SVM applications. http://www.clopi​net.
els in scikit-learn [42] based on the cross-validation correct- com/isabe​lle/Proje​cts/SVM/app-list.html.
ness. A strategy human domain expert [44] is used to speed 3. Platt J. Fast training of support vector machines using sequen-
up optimization. The Bayesian optimization method [45] tial minimal optimization. In: Schölkopf B, Burges C, Smola A,
is proposed for hyper-parameter selection of many learn- editors. Advances in kernel methods—support vector learning.
Cambridge: MIT Press; 1999. p. 185–208.
ing algorithms. A comparative studies of three Bayesian 4. Do TN, Tran-Nguyen MT. Automatic hyper-parameters tuning for
optimization techniques (Spearmint, Tree Parzen Estima- local support vector machines. In: Proceedings of 5th international
tor, and Sequential Model-based Algorithm Configuration) conference on future data and security engineering, FDSE 2018,
was presented in [46]. AUTO-SKLEARN [47] automati- Ho Chi Minh City, November 28–30; 2018. pp. 185–99.
5. Do TN. Non-linear classification of massive datasets with a par-
cally takes into account past performance on similar datasets allel algorithm of local support vector machines. In: Advanced
during the Bayesian hyper-parameter optimization for the computational methods for knowledge engineering. Springer
machine learning library scikit-learn [42]. The method in International Publishing; 2015. pp. 231–41.
[48] presented the Bayesian hyper-parameter optimization 6. Do TN, Poulet F. Parallel learning of local SVM algorithms for
classifying large datasets. Trans Large-Scale Data Knowl Cen-
technique for ensemble-based learning algorithms. Recent tered Syst. 2016;31:67–93.
Auto-Keras proposed in [49] is to automatically tune archi- 7. Bottou L, Vapnik V. Local learning algorithms. Neural Comput.
tecture and hyper-parameters of deep learning models with 1992;4(6):888–900.
Bayesian optimization and the neural network kernel for 8. Vapnik V, Bottou L. Local algorithms for pattern recognition and
dependencies estimation. Neural Comput. 1993;5(6):893–909.
Bayesian optimization, a tree-structured acquisition func- 9. Vapnik V. Principles of risk minimization for learning theory. In:
tion optimizer. Advances in neural information processing systems, vol 4. NIPS
conference, Denver, December 2–5; 1991. pp. 831–8.
10. Staelin C. Parameter selection for support vector machines. Tech-
nical report, HP Laboratories (2002).
Conclusion and Future Works 11. Keerthi SS, Sindhwani V, Chapelle O. An efficient method for
gradient-based adaptation of hyperparameters in SVM models.
We presented the off-the-shelf classification algorithms, In: Proceedings of the 19th international conference on neural
called autokSVM-dp and auto-kSVM-1p being able to auto- information processing systems. NIPS’06. MIT Press, Cambridge;
2006. pp. 673–80.
matically tune hyper-parameters in classification tasks. The 12. Keerthi SS, Lin CJ. Asymptotic behaviors of support vec-
autokSVM algorithms are able to determine by them-self tor machines with Gaussian kernel. Neural Comput.
the parameter k to partition the large training data into k 2003;15(7):1667–89.
clusters, followed which they auto-learn the non-linear SVM 13. Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA:
combined selection and hyperparameter optimization of classifi-
in each cluster to classify the data locally in the parallel way cation algorithms. In: Proceedings of the 19th ACM SIGKDD
on multi-core computers. The autokSVM algorithms use the international conference on knowledge discovery and data mining.
.632 bootstrap estimator and the hill climbing heuristic to KDD ’13. ACM, New York; 2013. pp. 847–55.
search hyper-parameters for the local non-linear SVM in 14. Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD. Hyperopt:
a Python library for model selection and hyperparameter optimi-
the grid. The autokSVM-dp automatically trains heteroge- zation. Comput Sci Discov. 2015;8(1):014008.
neous kernels for k local SVM models, and it searches ker- 15. Efron B, Tibshirani RJ. An introduction to the bootstrap. Softcover
nel parameter 𝛾 and cost constant C for every local SVM reprint of the original 1st ed. 1993 ed. Chapman and Hall/CRC,
model. The autokSVM-1p uses homogeneous kernel param- Boca Raton; 1994.
16. Lichman M. UCI machine learning repository; 2013.
eter 𝛾 and cost constant C automatically tuned in the biggest 17. LeCun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard
data partition. The numerical test results showed that our W, Jackel L. Backpropagation applied to handwritten zip code
autokSVM algorithms achieve good off-the-shelf classifi- recognition. Neural Comput. 1989;1(4):541–51.
cation performance compared to the standard LibSVM, the 18. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based
learning applied to document recognition. Proc IEEE.
original kSVM in terms of training time and accuracy. 1998;86(11):2278–324.
In the near future, we intend to develop distributed algo- 19. van der Maaten L. A new benchmark dataset for handwritten
rithms with MPI [50] or Apache Spark [51] for improving character recognition; 2009. http://homep​age.tudel​ft.nl/19j49​/
the training time. A promising research aims to extend this Publi​catio​ns_files​/chara​cters​.zip.
20. Chang CC, Lin CJ. LIBSVM: a library for support vector
work for automated learning local classification models [6, machines. ACM Trans Intell Syst Technol. 2011;2(27):1–27.
26, 27] and local support vector regression models [52, 53]. 21. Cristianini N, Shawe-Taylor J. An introduction to support vector
We will study to use some test protocols to assert the predic- machines: and other kernel-based learning methods. New York:
tion correctness while tuning hyper-parameters. Cambridge University Press; 2000.
22. Lin C. A practical guide to support vector classification; 2003.

SN Computer Science
SN Computer Science (2020) 1:2 Page 11 of 11 2

23. MacQueen J. Some methods for classification and analysis of mul- 41. Bardenet R, Brendel M, Kégl B, Sebag M. Collaborative hyperpa-
tivariate observations. In: Proceedings of 5th Berkeley symposium rameter tuning. In: Proceedings of the 30th international confer-
on mathematical statistics and probability, vol. 1. University of ence on machine learning; 2013. pp. 199–207.
California Press, Berkeley; 1967. pp. 281–97. 42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Gri-
24. OpenMP Architecture Review Board. OpenMP application pro- sel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderp-
gram interface version 3.0; 2008. las J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay
25. Do TN, Poulet F. Random local SVMs for classifying large data- E. Scikit-learn: machine learning in Python. J Mach Learn Res.
sets. In: International conference on future data and security engi- 2011;12:2825–30.
neering. Springer International Publishing; 2015. pp. 3–15. 43. Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd LC,
26. Do TN, Poulet F. Classifying very high-dimensional and large- Moore JH. Automating biomedical data science through tree-
scale multi-class image datasets with latent-lSVM. In: IEEE inter- based pipeline optimization. In: Applications of evolutionary
national conference on cloud and big data computing; 2016. computation: 19th European conference, EvoApplications 2016,
27. Do T, Poulet F. Latent-lsvm classification of very high-dimen- Porto, March 30–April 1, 2016, Part I. Springer International Pub-
sional and large-scale multi-class datasets. Concurr Comput Pract lishing; 2016. pp. 123–37.
Exp. 2019;31(2). 44. Feurer M, Springenberg JT, Hutter F. Initializing Bayesian
28. Hastie T, Tibshirani R, Friedman J. The elements of statistical hyperparameter optimization via meta-learning. In: Proceedings
learning—data mining, inference. 2nd ed. Berlin: Springer; 2009. of the twenty-ninth AAAI conference on artificial intelligence.
29. Pádraig C. Evaluation in machine learning. Tutorial (2009). AAAI’15, Austin. AAAI Press; 2015. pp. 1128–35.
30. Yu H, Yang J, Han J. Classifying large data sets using SVMs with 45. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization
hierarchical clusters. In: Proceedings of the ACM SIGKDD Intl. of machine learning algorithms. In: Proceedings of the 25th inter-
Conf. on KDD. ACM; 2003. pp. 306–15. national conference on neural information processing systems.
31. Do TN, Poulet F. Towards high dimensional data mining with NIPS’12. Curran Associates Inc.; 2012. pp. 2951–9.
boosting of PSVM and visualization tools. In: Proceedings of 6th 46. Eggensperger K, Feurer M, Hutter F, Bergstra J, Snoek J, Hoos H,
international conference on enterprise information systems; 2004. Leyton-Brown, K. Towards an empirical foundation for assessing
pp. 36–41. Bayesian optimization of hyperparameters. In: NIPS workshop on
32. Zhu K, Wang H, Bai H, Li J, Qiu Z, Cui H, Chang EY. Parallel- Bayesian optimization in theory and practice; 2013.
izing support vector machines on distributed computers. In: Platt 47. Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum
JC, Koller D, Singer Y, Roweis ST, editors. Advances in neural M, Hutter F. Efficient and robust automated machine learning.
information processing systems, vol. 20. New York: Curran Asso- In: Advances in neural information processing systems, vol. 28:
ciates, Inc.; 2008. p. 257–64. annual conference on neural information processing systems 2015,
33. Hutter F, Kotthoff L, Vanschoren J, editors. Automatic machine Montreal, December 7–12; 2015. pp. 2962–70.
learning: methods, systems, challenges. Berlin: Springer; 2018. 48. Lévesque JC, Gagné C, Sabourin R. Bayesian hyperparam-
http://autom​l.org/book(in press). eter optimization for ensemble learning. In: Proceedings of the
34. Adankon MM, Cheriet M. Model selection for the LS-SVM. thirty-second conference on uncertainty in artificial intelligence.
Application to handwriting recognition. Pattern Recognit. UAI’16, Arlington. AUAI Press; 2016. pp. 437–46.
2009;42(12):3264–70. 49. Jin H, Song Q, Hu X. Auto-keras: efficient neural archi-
35. Reif M, Shafait F, Dengel A. Meta-learning for Evolutionary param- tecture search with network morphism. 2018 CoRR. arXiv​
eter optimization of classifiers. Mach Learn. 2012;87(3):357–80. :abs/1806.10282​.
36. Chatelain C, Adam S, Lecourtier Y, Heutte L, Paquet T. Non-cost- 50. MPI-Forum.: MPI: a message-passing interface standard.
sensitive SVM training using multiple model selection. J Circuits 51. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I.
Syst Comput. 2010;19(1):231–42. Spark: cluster computing with working sets. In: Proceedings of
37. Huang H, Lin C. Linear and kernel classification: when to use the 2Nd USENIX conference on hot topics in cloud computing.
which? In: Proceedings of the 2016 SIAM international confer- HotCloud’10, Berkeley. USENIX Association; 2010. p. 10.
ence on data mining. Society for Industrial and Applied Math- 52. Tran-Nguyen M, Bui L, Kim Y, Do T. Decision tree using local
ematics; 2016. pp. 216–24. support vector regression for large datasets. In: Intelligent infor-
38. Bergstra J, Bengio Y. Random search for hyper-parameter optimi- mation and database systems—10th Asian conference, ACIIDS
zation. J Mach Learn Res. 2012;13(1):281–305. 2018, Dong Hoi City, March 19–21, 2018, Proceedings, Part I;
39. Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for hyper- 2018. pp. 255–65.
parameter optimization. In: Proceedings of the 24th international 53. Do T, Bui L. Parallel learning algorithms of local support vector
conference on neural information processing systems. NIPS’11, regression for dealing with large datasets. Trans Large-Scale Data
USA, Curran Associates Inc.; 2011. pp. 2546–54. Knowl Centered Syst. 2019;41:59–77.
40. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Wit-
ten IH. The WEKA Data Mining Software: an update. SIGKDD Publisher’s Note Springer Nature remains neutral with regard to
Explor. Newsl. 2009;11(1):10–8. jurisdictional claims in published maps and institutional affiliations.

SN Computer Science

You might also like