You are on page 1of 5

Fast Cross-Validation via Sequential Analysis

Tammo Krueger, Danny Panknin, Mikio Braun Technische Universitaet Berlin Machine Learning Group 10587 Berlin, {panknin|mikio}

With the increasing size of todays data sets, nding the right parameter conguration via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible the method speeds up the computation while preserving the capability of the full cross-validation. The experimental evaluation shows that our method reduces the computation time by a factor of up to 70 compared to a full cross-validation with a negligible impact on the accuracy.


Unarguably, a lot of computing time is spent on cross-validation [1] to tune free parameters of machine learning methods. While cross-validation can be parallelized easily with every instance evaluating a single candidate parameter setting, an enormous amount of computing resources is still spent on cross-validation, which could probably be put to better use in the actual learning methods. Just to give you an idea, if you perform ve-fold cross-validation over two parameters, and you only take ve candidates for each parameter, you have to train 125 times to perform the cross-validation. Thus, even a training time of one second becomes more than two minutes without parallelization. In practice, almost no one performs cross-validation on the whole data set, though, as the parameters can often already be inferred reliably on a small subset of the data, thereby speeding up the computation time substantially. However, the choice of the subset depends a lot on the structure of the data set. If the subset is too small compared to the complexity of the learning task, the wrong parameter is chosen. Usually, researchers can tell from experience what subset sizes are necessary for specic learning problems, but one would like to have a robust method which is able to deal with a whole range of learning problems in an automatic fashion. In this paper, we propose a method which is based on the sequential analysis framework to achieve exactly this: Speed up cross-validation by taking subsets of the data, while being robust with respect to different problem complexities. To achieve this, the method performs cross-validation on subsets of increasing size up to the full data set size, eliminating suboptimal parameter choices quickly. The statistical tests used for the elimination are tuned such that they try to retain promising parameters as long as possible to guard against unreliable measurements at small sample sizes. In experiments, we show that even using such conservative tests, we can achieve signicant speed ups of typically 25 times up to 70 times, which translate to literally hours of computing time freed up on our clusters. 1

conf. c1 c2 c3 . . . ck2 ck1 ck

d1 -2.2 -1.9 -1.4

d2 -1.9 -2.4 -0.9 . . .

data points d3 -1.8 -2.3 -0.7

dn1 2.1 1.9 0.5 . . .

dn 1.5 2.4 0.5

op op op top top top

1 0 0 0

2 0 1 1

3 0 0 1

4 0 0 0

steps 5 1 0 0 . . .

6 0 0 1 1 1 1

7 0 0 0 1 0 0

8 0 0 0 0 1 1

9 0 0 0 1 1 1

10 0 0 0 1 1 1

() ()

0.6 0.6 0.7 -0.8 -0.4 0.1 0.5 0.7 -0.9 -0.1 0.5 0.4 0.6 -0.3 0.0 pointwisePerformance matrix

0 0 1

1 0 1 1 1 1 1 1 1 0 1 1 trace matrix


Sa(0, 1, l, l)
c1 ck


Cumulative Sum

1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 5 10 Step

1 0

8 9 10 0 0 0 . . . ? = 1 ck2 1 0 1 ck1 0 1 1 1 ck 0 1 1 1 similarPerformance() c3 . . . maxSteps = 20 = N/maxSteps modelSize = s n = N s

7 0



H0(0, 1, l, l)
15 20

Figure 1: One step of the fast cross-validation procedure. Shown is the situation in step s = 10. a model with modelSize data points is learned for each conguration (c1 to ck ). Test errors are calculated on the current test set (d1 to dn ) and transformed into a binary performance indicator. traces of congurations are ltered via sequential analysis (c1 and c2 are dropped). at the end of each step the procedure checks, whether the remaining congurations perform equally well in a time window and stops, if this is the case (see Sec. 5 in the Appendix for a complete example run).

Fast Cross-Validation

We consider the usual supervised learning setting: We have a data set consisting of data points d1 = (X1 , Y1 ), . . . , dN = (XN , YN ) X Y which we assume to be drawn i.i.d. from PX Y . We have a learning algorithm A which depends on several parameters p. The goal is to select the parameter p such that the learned predictor g has the best generalization error with respect to some loss function : Y Y R. Full k-fold cross-validation estimates the best parameter by splitting the data into k parts, using k 1 parts for training and estimating the error on the remaining part. Our approach attempts to speed up the process by taking subsamples of size [sN/maxSteps] for 1 s maxSteps, starting with the full set of parameter candidates and eliminating clearly underperforming candidates at each step. Each execution of the main loop of the algorithm depicted in Figure 1 performs the following main parts given a subset of the data: The procedure transforms the pointwise test errors of the remaining congurations into a binary top or op scheme It drops signicant loser congurations along the way using tests from the sequential analysis framework. Applying robust, distribution free testing techniques allows for an early stopping of the procedure, when we have seen enough data for a stable parameter estimation. In the following we will discuss the individual steps in the algorithm. Robust Transformation of Test Errors: As the rst step, the pointwise test errors for each conguration is transformed into a binary value encoding whether the conguration is among the best ones or not. We call this the top or op scheme. This step abstracts from the underlying loss function or the scale of the errors, encoding the information whether a conguration looks promising for further analysis or not. From the point of view of statistical test theory, the question now is to nd the k top-performing congurations which show a similar behavior on all tested samples. Traditionally, this test could be performed using ANOVA, however we propose to use the following non-parametric tests in order to increase robustness: For classication, we use the Cochran Q test [2] applied to the binary information whether a sample has been correctly classied or not. For 2

regression problems we apply the Friedman test [3] directly on the residuals of the prediction. Note that both tests use a paired approach on the pointwise performance measure, thereby increasing the statistical power of the result (see Sec. 6 in the Appendix for a summary of these tests). Determining Signicant Losers: Having transformed the test errors in a scale-independent top or op scheme, we can now test whether a given parameter conguration is an overall loser. Sequential testing of binary random variables is addressed in the sequential analysis framework developed by Wald [4]. The main idea is the following: One observes a sequence of i.i.d. binary random variables Z1 , Z2 , . . ., and one wants to test whether these variables are distributed according to H0 : 0 or H1 : 1 with 0 < 1 . Both signicance levels for the acceptance of H1 and H0 can be controlled via the meta-parameters l and l . The test computes the likelihood for the so far observed data and rejects one of the hypothesis when the respective likelihood ratio is larger than some factor controlled by the meta-parameters. It can be shown that the procedure has a very intuitive geometric representation, shown in Figure 1, lower left: The binary observations are recorded as cumulative sums at each time step. If this sum exceeds the upper red line, we accept H1 ; if the sum is below the lower red line we accept H0 ; if the sum stays between the two red lines we have to draw another sample. Since our main goal is to use the sequential test to eliminate underperformers, we choose the parameters 0 and 1 of the test such that H1 (a conguration wins) is postponed as long as possible. At the same time, we want to maximize the area where congurations are eliminated (region denoted by LOSER in Fig. 1), rejecting the most loser congurations on the way as possible (see Sec. 13 in the Appendix for the concrete derivation of theses parameters of the test). Early Stopping and Final Winner: Finally, we employ an early stopping rule which takes the last earlyStoppingWindow columns from the trace matrix and checks whether all remaining congurations performed equally well in the past. If that is the case, the procedure is stopped. For the test, we again use the Cochran Q test which is illustrated in Figure 1, lower right: the last three traces at step 10 are performing nearly optimal in a given window but c3 shows a signicant different behavior, so the test will indicate a signicant effect and the procedure will go on. To determine the nal winner after the procedure has stopped we iteratively go back in time among all winning congurations in each step until we have found an exclusive winner. This way, we make most use of the data accumulated during the course of the procedure. Efcient Parallelization: As for normal cross-validation the parallelization setup for the fast crossvalidation procedure is a solid map-reduce scheme: the model of each remaining conguration in each step of the procedure can be calculated on a single cluster node. Just the results of the model on the data points d1 , d2 , . . . , dn have to be transferred back to a central instance to calculate the binary top or op scheme. This central reduce node will then update the trace matrix accordingly and test for signicant losers. After eliminating underperforming congurations the early stopping rule checks, whether the procedure will iterate once more and schedule the remaining congurations on the cluster. This stepwise elimination of underperforming congurations will result in a signicant speed-up as will be shown in the next section.


In this section we will explore the performance of the fast cross-validation procedure on real-world data sets: First we use the benchmark repository as introduced by R tsch et. al [5]. We split each a data set in two halves using one half for the parameter estimation via full and fast cross-validation and the other half for the calculation of the test error. Additionally we use the covertype data set [6]: After scaling the data we use the rst two classes with the most entries and follow the procedure of the paper in sampling 2,000 data points of each class for the model learning and estimate the test error on the remaining data points. For all setups we use an SVM with Gaussian kernel using 610 parameter congurations ( [3, 3], [0.05, 0.5]). The fast cross-validation procedure is carried out with 10 steps (fast) once with the early stopping rule and once without. For each data set we repeat the process 50 times each with a different split. Figure 2 shows that the speed improvement of the fast setup with early stopping often ranges in between 20 and 30 and even up to 70 for the covertype data set. Without the early stopping rule the speed gain drops but for the most data sets stays in between 10 to 20. The absolute test error difference of the fast cross-validation procedure compared to the normal cross-validation almost always ranges below 1 percentage point (data in Sec. 4 in the Appendix). These results illustrate, 3

Relative Speed Factor (full/fast)

fast/early 80 variable banana 60


breastCancer diabetis

Relative Speedup


q q q q q q q q q q q q q q q q q q q q q

flareSolar german image ringnorm splice thyroid



twonorm waveform



covertype covertype flareSolar waveform ringnorm twonorm diabetis german thyroid image splice









Figure 2: Distribution of relative speed gains of the fast cross-validation on the benchmark data sets. that the huge speed improvement of the fast cross-validation comes at a very low price in terms of absolute test error difference.

Related Work

Using statistical tests in order to speed up learning has been the topic of several lines of researches. However, the existing body of work mostly focuses on reducing the number of test evaluations, while we focus on the overall process of eliminating candidates themselves. To the best of our knowledge, this is a new concept and can apparently be combined with the already available racing techniques to further reduce the total calculation time. Maron and Moore introduce the so-called Hoeffding Races [7, 8] which are based on the nonparametric Hoeffding bound for the mean of the test error. At each step of the algorithm a new test point is evaluated by all remaining models and the condence interval of the test errors are updated accordingly. Models whose condence interval of the test error lies outside of at least one interval of a better performing model are dropped. Chien et al. [9, 10] devise a similar range of algorithms using concepts of PAC learning and game theory different hypotheses are ordered by their expected utility according to the test data the algorithm has seen so far. This concept of racing is further extended by Domingos and Hulten [11]: By introducing an upper bound for the learners loss as a function of the examples, the procedure allows for an early stopping of the learning process, if the loss is nearly as optimal as for innite data. While Bradley and Shapire [12] use similar concepts in the context of boosting (FilterBoost), Mnih et al. [13] introduce the empirical Bernstein Bounds to extend both the FilterBoost framework and the racing algorithms. In both cases the bounds are used to estimate the error within a specic region with a given probability. These racing concepts are applied in a wide variety of domains like reinforcement learning [14], multi-armed bandit problems [15], and timetabling [16] showing the relevance of the topic.

Conclusion and Further Work

We have proposed a procedure to signicantly accelerate cross-validation by performing it on subsets of increasing size and eliminating underperforming candidates. We rst transform the crossvalidation problem into a binary trace matrix which contains the winners/losers for each conguration for each subset size. To speed up cross-validation, the goal is to identify overall losers as early as possible. Note that the distribution of the matrix is very complex and in general unknown, as it depends on the data distribution, the learning algorithm, and the sample sizes. We can assume that the distribution of the columns of the matrix converges as the sample size becomes larger, but there may also be signicant shifts in what the top candidates are at smaller sample sizes. Our approach is therefore a rst step towards solving the problem by applying robust testing and the sequential analysis framework which makes several simplifying assumptions. To better understand the true distribution of the problem is an interesting question for future research. Acknowledgments: This work is generously funded by the BMBF project ALICE (01IB10003B). 4





[1] Sylvain Arlot, Alain Celisse, and Paul Painleve. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:4079, 2010. [2] W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37(3-4):256 266, 1950. [3] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675701, 1937. [4] Abraham Wald. Sequential Analysis. Wiley, 1947. [5] G. R tsch, T. Onoda, and K.-R. M ller. Soft margins for AdaBoost. Machine Learning, a u 42(3):287320, 2001. [6] J. A. Blackard and D. J. Dean. Comparative accuracies of articial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, vol.24:131151, 1999. [7] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection search for classication and function approximation. In Advances in Neural Information Processing Systems 6, pages 5966. Morgan Kaufmann, 1994. [8] Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners. Artif. Intell. Rev., 11:193225, February 1997. [9] Steve Chien, Jonathan Gratch, and Michael Burl. On the efcient allocation of resources for hypothesis evaluation: A statistical approach. IEEE Trans. Pattern Anal. Mach. Intell., 17:652 665, July 1995. [10] Steve Chien, Andre Stechert, and Darren Mutz. Efcient heuristic hypothesis ranking. J. Artif. Int. Res., 10:375397, June 1999. [11] Pedro Domingos and Geoff Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pages 106113, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [12] Joseph K. Bradley and Robert Schapire. Filterboost: Regression and classication on large datasets. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 185192, Cambridge, MA, 2008. MIT Press. [13] Volodymyr Mnih, Csaba Szepesv ri, and Jean-Yves Audibert. Empirical bernstein stopping. a In Proceedings of the 25th international conference on Machine learning, ICML 08, pages 672679, New York, NY, USA, 2008. ACM. [14] Verena Heidrich-Meisner and Christian Igel. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pages 401408, New York, NY, USA, 2009. ACM. [15] Jean-Yves Audibert, R mi Munos, and Csaba Szepesv ri. Tuning bandit algorithms in stochase a tic environments. In Proceedings of the 18th international conference on Algorithmic Learning Theory, ALT 07, pages 150165, Berlin, Heidelberg, 2007. Springer-Verlag. [16] Mauro Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer, 2009.