You are on page 1of 5

Fast Cross-Validation via Sequential Analysis

Tammo Krueger, Danny Panknin, Mikio Braun Technische Universitaet Berlin Machine Learning Group 10587 Berlin t.krueger@tu-berlin.de, {panknin|mikio}@cs.tu-berlin.de

Abstract
With the increasing size of today’s data sets, finding the right parameter configuration via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible the method speeds up the computation while preserving the capability of the full cross-validation. The experimental evaluation shows that our method reduces the computation time by a factor of up to 70 compared to a full cross-validation with a negligible impact on the accuracy.

1

Introduction

Unarguably, a lot of computing time is spent on cross-validation [1] to tune free parameters of machine learning methods. While cross-validation can be parallelized easily with every instance evaluating a single candidate parameter setting, an enormous amount of computing resources is still spent on cross-validation, which could probably be put to better use in the actual learning methods. Just to give you an idea, if you perform five-fold cross-validation over two parameters, and you only take five candidates for each parameter, you have to train 125 times to perform the cross-validation. Thus, even a training time of one second becomes more than two minutes without parallelization. In practice, almost no one performs cross-validation on the whole data set, though, as the parameters can often already be inferred reliably on a small subset of the data, thereby speeding up the computation time substantially. However, the choice of the subset depends a lot on the structure of the data set. If the subset is too small compared to the complexity of the learning task, the wrong parameter is chosen. Usually, researchers can tell from experience what subset sizes are necessary for specific learning problems, but one would like to have a robust method which is able to deal with a whole range of learning problems in an automatic fashion. In this paper, we propose a method which is based on the sequential analysis framework to achieve exactly this: Speed up cross-validation by taking subsets of the data, while being robust with respect to different problem complexities. To achieve this, the method performs cross-validation on subsets of increasing size up to the full data set size, eliminating suboptimal parameter choices quickly. The statistical tests used for the elimination are tuned such that they try to retain promising parameters as long as possible to guard against unreliable measurements at small sample sizes. In experiments, we show that even using such conservative tests, we can achieve significant speed ups of typically 25 times up to 70 times, which translate to literally hours of computing time freed up on our clusters. 1

conf. c1 c2 c3 . . . ck−2 ck−1 ck

d1 -2.2 -1.9 -1.4

d2 -1.9 -2.4 -0.9 . . .

data points d3 · · · -1.8 -2.3 · · · -0.7

dn−1 2.1 1.9 0.5 . . .

dn 1.5 2.4 0.5

flop flop flop top top top

1 0 0 0

2 0 1 1

3 0 0 1

4 0 0 0

steps 5 1 0 0 . . .

6 0 0 1 1 1 1

7 0 0 0 1 0 0

8 0 0 0 0 1 1    

9 0 0 0 1 1 1

10 0 0 0 1 1 1

(†) (†)

0.6 0.6 0.7 -0.8 -0.4 0.1 0.5 0.7 · · · -0.9 -0.1 0.5 0.4 0.6 -0.3 0.0 pointwisePerformance matrix

–

0 0 1

1 0 1 1 1 1 1 1 1 0 1 1 trace matrix

—
20

˜↓
X

    

Sa(π0, π1, βl, αl)
c1 ck

WINNER

Cumulative Sum

1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 5 10 Step

1 0

5

1

1

1

8 9 10 0 0 0   .   . . ? = 1 ck−2 1 0 1   ck−1 0 1 1 1 ck 0 1 1 1 similarPerformance(·) c3 . . . maxSteps = 20 ∆ = N/maxSteps modelSize = s∆ n = N − s∆

7 0

10

15

LOSER
∆H0(π0, π1, βl, αl)
15 20

Figure 1: One step of the fast cross-validation procedure. Shown is the situation in step s = 10. – a model with modelSize data points is learned for each configuration (c1 to ck ). Test errors are calculated on the current test set (d1 to dn ) and transformed into a binary performance indicator. — traces of configurations are filtered via sequential analysis (c1 and c2 are dropped). ˜ at the end of each step the procedure checks, whether the remaining configurations perform equally well in a time window and stops, if this is the case (see Sec. 5 in the Appendix for a complete example run).

2

Fast Cross-Validation

We consider the usual supervised learning setting: We have a data set consisting of data points d1 = (X1 , Y1 ), . . . , dN = (XN , YN ) ∈ X × Y which we assume to be drawn i.i.d. from PX ×Y . We have a learning algorithm A which depends on several parameters p. The goal is to select the parameter p∗ such that the learned predictor g has the best generalization error with respect to some loss function : Y × Y → R. Full k-fold cross-validation estimates the best parameter by splitting the data into k parts, using k − 1 parts for training and estimating the error on the remaining part. Our approach attempts to speed up the process by taking subsamples of size [sN/maxSteps] for 1 ≤ s ≤ maxSteps, starting with the full set of parameter candidates and eliminating clearly underperforming candidates at each step. Each execution of the main loop of the algorithm depicted in Figure 1 performs the following main parts given a subset of the data: – The procedure transforms the pointwise test errors of the remaining configurations into a binary “top or flop” scheme — It drops significant loser configurations along the way using tests from the sequential analysis framework. ˜ Applying robust, distribution free testing techniques allows for an early stopping of the procedure, when we have seen enough data for a stable parameter estimation. In the following we will discuss the individual steps in the algorithm. – Robust Transformation of Test Errors: As the first step, the pointwise test errors for each configuration is transformed into a binary value encoding whether the configuration is among the best ones or not. We call this the “top or flop” scheme. This step abstracts from the underlying loss function or the scale of the errors, encoding the information whether a configuration looks promising for further analysis or not. From the point of view of statistical test theory, the question now is to find the k top-performing configurations which show a similar behavior on all tested samples. Traditionally, this test could be performed using ANOVA, however we propose to use the following non-parametric tests in order to increase robustness: For classification, we use the Cochran Q test [2] applied to the binary information whether a sample has been correctly classified or not. For 2

0

regression problems we apply the Friedman test [3] directly on the residuals of the prediction. Note that both tests use a paired approach on the pointwise performance measure, thereby increasing the statistical power of the result (see Sec. 6 in the Appendix for a summary of these tests). — Determining Significant Losers: Having transformed the test errors in a scale-independent top or flop scheme, we can now test whether a given parameter configuration is an overall loser. Sequential testing of binary random variables is addressed in the sequential analysis framework developed by Wald [4]. The main idea is the following: One observes a sequence of i.i.d. binary random variables Z1 , Z2 , . . ., and one wants to test whether these variables are distributed according to H0 : π0 or H1 : π1 with π0 < π1 . Both significance levels for the acceptance of H1 and H0 can be controlled via the meta-parameters αl and βl . The test computes the likelihood for the so far observed data and rejects one of the hypothesis when the respective likelihood ratio is larger than some factor controlled by the meta-parameters. It can be shown that the procedure has a very intuitive geometric representation, shown in Figure 1, lower left: The binary observations are recorded as cumulative sums at each time step. If this sum exceeds the upper red line, we accept H1 ; if the sum is below the lower red line we accept H0 ; if the sum stays between the two red lines we have to draw another sample. Since our main goal is to use the sequential test to eliminate underperformers, we choose the parameters π0 and π1 of the test such that H1 (a configuration wins) is postponed as long as possible. At the same time, we want to maximize the area where configurations are eliminated (region denoted by “LOSER” in Fig. 1), rejecting the most loser configurations on the way as possible (see Sec. 1–3 in the Appendix for the concrete derivation of theses parameters of the test). ˜ Early Stopping and Final Winner: Finally, we employ an early stopping rule which takes the last earlyStoppingWindow columns from the trace matrix and checks whether all remaining configurations performed equally well in the past. If that is the case, the procedure is stopped. For the test, we again use the Cochran Q test which is illustrated in Figure 1, lower right: the last three traces at step 10 are performing nearly optimal in a given window but c3 shows a significant different behavior, so the test will indicate a significant effect and the procedure will go on. To determine the final winner after the procedure has stopped we iteratively go back in time among all winning configurations in each step until we have found an exclusive winner. This way, we make most use of the data accumulated during the course of the procedure. Efficient Parallelization: As for normal cross-validation the parallelization setup for the fast crossvalidation procedure is a solid map-reduce scheme: the model of each remaining configuration in each step of the procedure can be calculated on a single cluster node. Just the results of the model on the data points d1 , d2 , . . . , dn have to be transferred back to a central instance to calculate the binary – “top or flop” scheme. This central reduce node will then update the trace matrix accordingly and — test for significant losers. After eliminating underperforming configurations the ˜ early stopping rule checks, whether the procedure will iterate once more and schedule the remaining configurations on the cluster. This stepwise elimination of underperforming configurations will result in a significant speed-up as will be shown in the next section.

3

Experiments

In this section we will explore the performance of the fast cross-validation procedure on real-world data sets: First we use the benchmark repository as introduced by R¨ tsch et. al [5]. We split each a data set in two halves using one half for the parameter estimation via full and fast cross-validation and the other half for the calculation of the test error. Additionally we use the covertype data set [6]: After scaling the data we use the first two classes with the most entries and follow the procedure of the paper in sampling 2,000 data points of each class for the model learning and estimate the test error on the remaining data points. For all setups we use an ν−SVM with Gaussian kernel using 610 parameter configurations (σ ∈ [−3, 3], ν ∈ [0.05, 0.5]). The fast cross-validation procedure is carried out with 10 steps (fast) once with the early stopping rule and once without. For each data set we repeat the process 50 times each with a different split. Figure 2 shows that the speed improvement of the fast setup with early stopping often ranges in between 20 and 30 and even up to 70 for the covertype data set. Without the early stopping rule the speed gain drops but for the most data sets stays in between 10 to 20. The absolute test error difference of the fast cross-validation procedure compared to the normal cross-validation almost always ranges below 1 percentage point (data in Sec. 4 in the Appendix). These results illustrate, 3

Relative Speed Factor (full/fast)
fast/early 80 variable banana 60
q

fast

breastCancer diabetis

Relative Speed−up

40

q q q q q q q q q q q q q q q q q q q q q

flareSolar german image ringnorm splice thyroid
q

20

q

twonorm waveform

breastCancer

breastCancer

covertype covertype flareSolar waveform ringnorm twonorm diabetis german thyroid image splice

covertype

flareSolar

waveform

ringnorm

twonorm

diabetis

german

banana

Figure 2: Distribution of relative speed gains of the fast cross-validation on the benchmark data sets. that the huge speed improvement of the fast cross-validation comes at a very low price in terms of absolute test error difference.

4

Related Work

Using statistical tests in order to speed up learning has been the topic of several lines of researches. However, the existing body of work mostly focuses on reducing the number of test evaluations, while we focus on the overall process of eliminating candidates themselves. To the best of our knowledge, this is a new concept and can apparently be combined with the already available racing techniques to further reduce the total calculation time. Maron and Moore introduce the so-called Hoeffding Races [7, 8] which are based on the nonparametric Hoeffding bound for the mean of the test error. At each step of the algorithm a new test point is evaluated by all remaining models and the confidence interval of the test errors are updated accordingly. Models whose confidence interval of the test error lies outside of at least one interval of a better performing model are dropped. Chien et al. [9, 10] devise a similar range of algorithms using concepts of PAC learning and game theory different hypotheses are ordered by their expected utility according to the test data the algorithm has seen so far. This concept of racing is further extended by Domingos and Hulten [11]: By introducing an upper bound for the learner’s loss as a function of the examples, the procedure allows for an early stopping of the learning process, if the loss is nearly as optimal as for infinite data. While Bradley and Shapire [12] use similar concepts in the context of boosting (FilterBoost), Mnih et al. [13] introduce the empirical Bernstein Bounds to extend both the FilterBoost framework and the racing algorithms. In both cases the bounds are used to estimate the error within a specific region with a given probability. These racing concepts are applied in a wide variety of domains like reinforcement learning [14], multi-armed bandit problems [15], and timetabling [16] showing the relevance of the topic.

5

Conclusion and Further Work

We have proposed a procedure to significantly accelerate cross-validation by performing it on subsets of increasing size and eliminating underperforming candidates. We first transform the crossvalidation problem into a binary trace matrix which contains the winners/losers for each configuration for each subset size. To speed up cross-validation, the goal is to identify overall losers as early as possible. Note that the distribution of the matrix is very complex and in general unknown, as it depends on the data distribution, the learning algorithm, and the sample sizes. We can assume that the distribution of the columns of the matrix converges as the sample size becomes larger, but there may also be significant shifts in what the top candidates are at smaller sample sizes. Our approach is therefore a first step towards solving the problem by applying robust testing and the sequential analysis framework which makes several simplifying assumptions. To better understand the true distribution of the problem is an interesting question for future research. Acknowledgments: This work is generously funded by the BMBF project ALICE (01IB10003B). 4

banana

thyroid

image

splice

References
[1] Sylvain Arlot, Alain Celisse, and Paul Painleve. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79, 2010. [2] W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37(3-4):256– 266, 1950. [3] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675–701, 1937. [4] Abraham Wald. Sequential Analysis. Wiley, 1947. [5] G. R¨ tsch, T. Onoda, and K.-R. M¨ ller. Soft margins for AdaBoost. Machine Learning, a u 42(3):287–320, 2001. [6] J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, vol.24:131–151, 1999. [7] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems 6, pages 59–66. Morgan Kaufmann, 1994. [8] Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners. Artif. Intell. Rev., 11:193–225, February 1997. [9] Steve Chien, Jonathan Gratch, and Michael Burl. On the efficient allocation of resources for hypothesis evaluation: A statistical approach. IEEE Trans. Pattern Anal. Mach. Intell., 17:652– 665, July 1995. [10] Steve Chien, Andre Stechert, and Darren Mutz. Efficient heuristic hypothesis ranking. J. Artif. Int. Res., 10:375–397, June 1999. [11] Pedro Domingos and Geoff Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 106–113, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [12] Joseph K. Bradley and Robert Schapire. Filterboost: Regression and classification on large datasets. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 185–192, Cambridge, MA, 2008. MIT Press. [13] Volodymyr Mnih, Csaba Szepesv´ ri, and Jean-Yves Audibert. Empirical bernstein stopping. a In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 672–679, New York, NY, USA, 2008. ACM. [14] Verena Heidrich-Meisner and Christian Igel. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 401–408, New York, NY, USA, 2009. ACM. [15] Jean-Yves Audibert, R´ mi Munos, and Csaba Szepesv´ ri. Tuning bandit algorithms in stochase a tic environments. In Proceedings of the 18th international conference on Algorithmic Learning Theory, ALT ’07, pages 150–165, Berlin, Heidelberg, 2007. Springer-Verlag. [16] Mauro Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer, 2009.

5