Professional Documents
Culture Documents
https://doi.org/10.1007/s42979-019-0006-z
REVIEW ARTICLE
Received: 11 April 2019 / Accepted: 13 June 2019 / Published online: 25 June 2019
© Springer Nature Singapore Pte Ltd 2019
Abstract
In this paper, we propose automatic learning algorithms (called auto-kSVM-1p, autokSVM-dp) being possible to automati-
cally tune hyper-parameters of k local support vector machines (SVM) for classifying large data sets. The autokSVM algo-
rithms are able to determine the number of clusters k to partition the large training data, followed which they auto-learn the
non-linear SVM model in each cluster to classify the data locally in the parallel way on multi-core computers. The autokSVM
algorithms combine the grid search, the .632 bootstrap estimator, the hill climbing heuristic to optimize hyper-parameters in
the local non-linear SVM training. The numerical test results on four data sets from UCI repository and three benchmarks of
handwritten letters recognition showed that our autokSVM algorithms give very competitive classification results compared
to the standard LibSVM and the original kSVM. An example of autokSVM-1p’s effectiveness is given with an accuracy of
96.74% obtained in the classification of Forest cover-type data set having 581,012 datapoints in 54-dimensional input space
and 7 classes in 334.45 seconds using a PC Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores.
Keywords Support vector machines · Large data sets · Local support vector machines · Automatic hyper-parameter tuning
SN Computer Science
Vol.:(0123456789)
2 Page 2 of 11 SN Computer Science (2020) 1:2
SN Computer Science
SN Computer Science (2020) 1:2 Page 3 of 11 2
SN Computer Science
2 Page 4 of 11 SN Computer Science (2020) 1:2
And then, our investigation aims to develop the the autokSVM algorithms guarantee to obtain the cluster
autokSVM algorithms being able to automatically search size ∼ 500 by setting k = 500
m
.
these hyper-parameters k, 𝜃 = {𝛾, C} for the learning task
of kSVM. Automatic Tuning Hyper‑parameters = { , C}
for Local SVMs Using the RBF Kernel Function
Setting the Parameter k
The remainder is to find best hyper-parameters 𝜃 = {𝛾, C}
According to the performance analysis in terms of the algo- for local SVM models using the RBF kernel. There are two
rithmic complexity and the generalization capacity studied ways for automatic training local SVM models as follows:
in [5–9], the role of the parameter k in kSVM is related to a
trade-off between the generalization capacity and the com- – first proposal is to train heterogeneous kernels for local
putational complexity. If k is large, then the kSVM reduces SVM models, it means that the autokSVM has to tune
significant training time (the speed-up factor of local SVMs to find best kernel parameter 𝛾i and cost constant Ci for
over the global one is k). And then, the size of a cluster is every cluster Di;
small; the locality is extreme with a very low generalization – second strategy is to train the homogeneous kernel for k
capacity. When k → m, the kSVM is closed to the 1 nearest- local SVMs, in which the autokSVM has to tune to find
neighbor classifier. If k is small, then the kSVM reduces best kernel parameter 𝛾l and cost constant Cl for the big-
insignificant training time. However, the size of a cluster is gest cluster Dl , and then, it uses these best parameters to
large; it improves the generalization capacity. When k → 1, train k local SVM models.
the kSVM is approximated the global one.
It leads to set k, so that the cluster size is large enough In both two strategies, the autokSVMs need to search
(e.g., 200 proposed by [7, 8]). The empirical results in [5, the hyper-parameter 𝛾 of RBF kernel and the cost con-
6, 25–27] show that the cluster size is about 500. Therefore, stant C for learning the local SVM model from the data
SN Computer Science
SN Computer Science (2020) 1:2 Page 5 of 11 2
partition Di , so that obtain the best correctness in the autokSVM algorithms use the bootstrap sample Bi as the
classification. Figure 6 presents the process of automatic learning data set to train the classification model Mi for the
tuning hyper-parameters 𝜃 = {𝛾, C} for training the local data partition Di . An out-of-bag sample OOBi from the data
SVM model lSVMi in the data partition Di . The cost con- partition Di is out of the bootstrap sample Bi. The out-of-bag
stant C is chosen in G = {1, 101 , 102 , 103 , 104 , 105 , 106 } sample OOBi representing about 36.8% distinct datapoints
and the hyper-parameter 𝛾 of RBF kernel is tried among from the original data partition Di is used as the validation
P = {10−5 , 5.10−5 , 10−4 , 5.10−4 , 10−3 , 5.10−3 , 10−2 , 5.10−2 , data set to evaluate the classification model Mi . The .632
10−1 , 5.10−1 , 1, 5, 101 , 5.101 }. Furthermore, the local model bootstrap estimator does assert the error (denoted by Err
̂ ) of
is less complex than the global one (ref. Fig. 3). And then, the classification model Mi based on the bootstrap sample Bi
the autokSVMs start with the smallest values of the hyper- and the out-of-bag OOBi as follows:
parameter 𝛾 and the cost constant C in the grid and stop
when the classification correctness cannot be improved in ̂
Err
.632
̂
= .632 × Err(OOB ̂
i , Mi ) + .368 × Err(Bi , Mi ).
(4)
two next trials (the hill climbing heuristic).
We propose to use the .632 bootstrap estimator [15] in the Figure 6 presents the procedure for automatic searching
autokSVM algorithms to assert the classification correctness hyper-parameters while learning the local SVM model
while tuning the hyper-parameter 𝛾 and the cost constant C. lSVMi from the data partition Di . This strategy is used in
The .632 bootstrap estimator achieves the same empirical two autokSVM algorithms.
results obtained by the tenfold cross-validation protocol [28] The autokSVM-dp in Algorithm 2 is the autokSVM learn-
with the cheap computational cost. ing for k local SVM models using heterogeneous kernel
The main idea for automatic tuning hyper-parameters parameter 𝛾 and cost constant C.
with .632 bootstrap estimator in the autokSVM algorithms The autokSVM-1p in Algorithm 3 describes the train-
is described as follows. ing algorithm for k local SVM models using homogeneous
Given a data partition Di with mi datapoints, it randomly kernel parameter 𝛾 and cost constant C automatically tuned
draws with replacement from Di a bootstrap sample Bi of in the biggest data partition Dl.
size mi datapoints. The bootstrap sample Bi includes about Figure 7 is an example of the automatic classification
63.2% of distinct datapoints from the original cluster Di. The done by the autokSVM-dp.
SN Computer Science
2 Page 6 of 11 SN Computer Science (2020) 1:2
SN Computer Science
SN Computer Science (2020) 1:2 Page 7 of 11 2
Table 1 Description of data sets ID Data set Individuals Attributes Classes Evaluation protocol
Table 2 Hyper-parameters for LibSVM and kSVM Tuning Hyper‑parameters for LibSVM, kSVM
ID Datasets 𝛾 C k
We propose to use RBF kernel type in SVM models, because
1 Opt. rec. of handwritten digits 0.0001 100,000 10 it is general and efficient [22]. We also tried to tune the
2 Letter 0.0001 100,000 30 hyper-parameter 𝛾 of the RBF kernel and the cost constant
3 Isolet 0.0001 100,000 10 C (the trade-off between the margin size and the errors) to
4 USPS handwritten digit 0.0001 100,000 10 obtain the best accuracy. Furthermore, the kSVM uses the
5 A new bench. hand. char. rec. 0.001 100,000 50 parameter k local models (number of clusters). The cross-
6 MNIST 0.05 100,000 100 validation protocol [28, 29] is used to tune these hyper-
7 Forest cover types 0.0001 100,000 500 parameters. And then, the optimal parameters in Table 2
give highest correctness for data sets.
Experiments are conducted with the four datasets from UCI Table 3 reports the classification correctness of LibSVM,
repository [16] and three benchmarks of handwritten let- kSVM, autokSVM-dp, and autokSVM-1p on the seven data
ters recognition, including USPS [17], MNIST [18], a new sets. Table 4 shows the training time of LibSVM, kSVM,
benchmark for handwritten character recognition [19]. autokSVM-dp, and autokSVM-1p to learn the classification
Table 1 presents the description of data sets. The last models. We can quickly assess the classification accuracy
column of Table 1 presents evaluation protocols. Data sets from Fig. 11, and the training time from Figs. 8, 9, and 10.
are already divided in training set (Trn) and testing set (Tst), Although our proposed autokSVM algorithms are off-the-
and then, we used the training data to build the SVM models. shelf classification tools (automatic tuning hyper-parameters
We report classification results obtained on the testing set for training local SVM models), but they achieve very com-
using the resulting models. petitive performances compared to LibSVM and kSVM.
SN Computer Science
2 Page 8 of 11 SN Computer Science (2020) 1:2
Fig. 8 Comparison of training time on five small data sets Fig. 10 Comparison of training time on Forest Cover-Type data set
SN Computer Science
SN Computer Science (2020) 1:2 Page 9 of 11 2
With large data sets, both kSVM, autokSVM-dp, and this problem, automated machine learning systems [33] try
autokSVM-1p achieve a significant speed-up in learning to select the right algorithm and its hyper-parameters in a
classification models. data-driven way without any human intervention. Therefore,
For MNIST data set, the kSVM, the auto-kSVM-dp, and automatic machine learning has become an interesting topic
the autokSVM-1p are, respectively, 33.64, 14.09, and 17.62 to make machine learning available for non-expert users.
times faster than LibSVM. Our proposal is in some aspects related to parameter
Typically, Forest cover-type data set is well known as a selection for SVM. Carl Staelin proposed an iterative algo-
difficult data set for non-linear SVM [30, 31]; LibSVM ran rithm [10] for searching the parameters on the grid resolu-
for 23 days without any result [31]. The recent parallelizing tion. Keerthi and his colleagues proposed an efficient heu-
SVM on distributed computers [32] uses 500 machines to ristic [12], a gradient-based method [11] for searching SVM
train the classification model in 1655 s with 97.69% accu- parameters. Adankon and Cheriet proposed the descent
racy. The kSVM performed this non-linear classification gradient algorithm [34] for the LS-SVM hyper-parameter
in 223.7 s with 97.06% accuracy. The autokSVM-dp needs optimization. The technique in [35] uses meta-learning,
373.64 s to automatically train the local SVM models with case-based reasoning to propose good staring points for
an accuracy of 96.90%. Our autokSVM-1p classifies Forest evolutionary parameter optimization of SVM. The study in
cover-type data set with a correctness of 96.74% in 334.45 s. [36] presented the multi-objective optimization framework
for tuning SVM hyper-parameters. More recent research [37]
proposed the method for automatically picking out the kernel
Discussion on Related Works type (linear/non-linear) and tuning parameters.
Recently, the research of automatic machine learning is
In past years, the machine learning shows the great success to make off-the-shelf machine learning methods without
in computer vision, speech recognition, machine translation, expert knowledge. Bergstra and his colleagues proposed
text categorization, and bioinformatics. Nevertheless, it is the random search method [38] and two greedy sequential
very hard for non-expert users to effectively apply machine techniques based on the expected improvement criterion
learning techniques due to the domain expertise, hard [39] for optimizing hyper-parameters of neural networks.
backgrounds of mathematics, probability, statistics, opti- Auto-WEKA proposed by [13] combines algorithm selection
mization, and computer science skills. As illustrated in the and hyper-parameter optimization for the machine learning
book [33], the performance of the machine learning model library WEKA [40]. A generic approach [41] tries to incor-
depends on both the fundamental quality of a given learn- porate knowledge from the previous experiments for simul-
ing technique and the details of its tuning. The users need taneously tuning a learning algorithm on new problems.
to decide between many available learning algorithms and A Python library Hyperopt [14] provides algorithms and
followed which they turn in hand the hyper-parameters of software framework in the distributed asynchronous way for
the selected techniques for the given data set. It is the bar- optimizing hyper-parameters of learning algorithms. Komer
rier to non-expert use of machine learning. For overcoming and his colleagues propose the hyperopt-sklearn based on
SN Computer Science
2 Page 10 of 11 SN Computer Science (2020) 1:2
SN Computer Science
SN Computer Science (2020) 1:2 Page 11 of 11 2
23. MacQueen J. Some methods for classification and analysis of mul- 41. Bardenet R, Brendel M, Kégl B, Sebag M. Collaborative hyperpa-
tivariate observations. In: Proceedings of 5th Berkeley symposium rameter tuning. In: Proceedings of the 30th international confer-
on mathematical statistics and probability, vol. 1. University of ence on machine learning; 2013. pp. 199–207.
California Press, Berkeley; 1967. pp. 281–97. 42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Gri-
24. OpenMP Architecture Review Board. OpenMP application pro- sel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderp-
gram interface version 3.0; 2008. las J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay
25. Do TN, Poulet F. Random local SVMs for classifying large data- E. Scikit-learn: machine learning in Python. J Mach Learn Res.
sets. In: International conference on future data and security engi- 2011;12:2825–30.
neering. Springer International Publishing; 2015. pp. 3–15. 43. Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd LC,
26. Do TN, Poulet F. Classifying very high-dimensional and large- Moore JH. Automating biomedical data science through tree-
scale multi-class image datasets with latent-lSVM. In: IEEE inter- based pipeline optimization. In: Applications of evolutionary
national conference on cloud and big data computing; 2016. computation: 19th European conference, EvoApplications 2016,
27. Do T, Poulet F. Latent-lsvm classification of very high-dimen- Porto, March 30–April 1, 2016, Part I. Springer International Pub-
sional and large-scale multi-class datasets. Concurr Comput Pract lishing; 2016. pp. 123–37.
Exp. 2019;31(2). 44. Feurer M, Springenberg JT, Hutter F. Initializing Bayesian
28. Hastie T, Tibshirani R, Friedman J. The elements of statistical hyperparameter optimization via meta-learning. In: Proceedings
learning—data mining, inference. 2nd ed. Berlin: Springer; 2009. of the twenty-ninth AAAI conference on artificial intelligence.
29. Pádraig C. Evaluation in machine learning. Tutorial (2009). AAAI’15, Austin. AAAI Press; 2015. pp. 1128–35.
30. Yu H, Yang J, Han J. Classifying large data sets using SVMs with 45. Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization
hierarchical clusters. In: Proceedings of the ACM SIGKDD Intl. of machine learning algorithms. In: Proceedings of the 25th inter-
Conf. on KDD. ACM; 2003. pp. 306–15. national conference on neural information processing systems.
31. Do TN, Poulet F. Towards high dimensional data mining with NIPS’12. Curran Associates Inc.; 2012. pp. 2951–9.
boosting of PSVM and visualization tools. In: Proceedings of 6th 46. Eggensperger K, Feurer M, Hutter F, Bergstra J, Snoek J, Hoos H,
international conference on enterprise information systems; 2004. Leyton-Brown, K. Towards an empirical foundation for assessing
pp. 36–41. Bayesian optimization of hyperparameters. In: NIPS workshop on
32. Zhu K, Wang H, Bai H, Li J, Qiu Z, Cui H, Chang EY. Parallel- Bayesian optimization in theory and practice; 2013.
izing support vector machines on distributed computers. In: Platt 47. Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum
JC, Koller D, Singer Y, Roweis ST, editors. Advances in neural M, Hutter F. Efficient and robust automated machine learning.
information processing systems, vol. 20. New York: Curran Asso- In: Advances in neural information processing systems, vol. 28:
ciates, Inc.; 2008. p. 257–64. annual conference on neural information processing systems 2015,
33. Hutter F, Kotthoff L, Vanschoren J, editors. Automatic machine Montreal, December 7–12; 2015. pp. 2962–70.
learning: methods, systems, challenges. Berlin: Springer; 2018. 48. Lévesque JC, Gagné C, Sabourin R. Bayesian hyperparam-
http://automl.org/book(in press). eter optimization for ensemble learning. In: Proceedings of the
34. Adankon MM, Cheriet M. Model selection for the LS-SVM. thirty-second conference on uncertainty in artificial intelligence.
Application to handwriting recognition. Pattern Recognit. UAI’16, Arlington. AUAI Press; 2016. pp. 437–46.
2009;42(12):3264–70. 49. Jin H, Song Q, Hu X. Auto-keras: efficient neural archi-
35. Reif M, Shafait F, Dengel A. Meta-learning for Evolutionary param- tecture search with network morphism. 2018 CoRR. arXiv
eter optimization of classifiers. Mach Learn. 2012;87(3):357–80. :abs/1806.10282.
36. Chatelain C, Adam S, Lecourtier Y, Heutte L, Paquet T. Non-cost- 50. MPI-Forum.: MPI: a message-passing interface standard.
sensitive SVM training using multiple model selection. J Circuits 51. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I.
Syst Comput. 2010;19(1):231–42. Spark: cluster computing with working sets. In: Proceedings of
37. Huang H, Lin C. Linear and kernel classification: when to use the 2Nd USENIX conference on hot topics in cloud computing.
which? In: Proceedings of the 2016 SIAM international confer- HotCloud’10, Berkeley. USENIX Association; 2010. p. 10.
ence on data mining. Society for Industrial and Applied Math- 52. Tran-Nguyen M, Bui L, Kim Y, Do T. Decision tree using local
ematics; 2016. pp. 216–24. support vector regression for large datasets. In: Intelligent infor-
38. Bergstra J, Bengio Y. Random search for hyper-parameter optimi- mation and database systems—10th Asian conference, ACIIDS
zation. J Mach Learn Res. 2012;13(1):281–305. 2018, Dong Hoi City, March 19–21, 2018, Proceedings, Part I;
39. Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for hyper- 2018. pp. 255–65.
parameter optimization. In: Proceedings of the 24th international 53. Do T, Bui L. Parallel learning algorithms of local support vector
conference on neural information processing systems. NIPS’11, regression for dealing with large datasets. Trans Large-Scale Data
USA, Curran Associates Inc.; 2011. pp. 2546–54. Knowl Centered Syst. 2019;41:59–77.
40. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Wit-
ten IH. The WEKA Data Mining Software: an update. SIGKDD Publisher’s Note Springer Nature remains neutral with regard to
Explor. Newsl. 2009;11(1):10–8. jurisdictional claims in published maps and institutional affiliations.
SN Computer Science