/  5
 
Fast Cross-Validation via Sequential Analysis
Tammo Krueger, Danny Panknin, Mikio Braun
Technische Universitaet BerlinMachine Learning Group10587 Berlin
t.krueger@tu-berlin.de
,
{
panknin|mikio
}
@cs.tu-berlin.de
Abstract
With the increasing size of today’s data sets, finding the right parameter configu-ration via cross-validation can be an extremely time-consuming task. In this paperwe propose an improved cross-validation procedure which uses non-parametrictesting coupled with sequential analysis to determine the best parameter set on lin-early increasing subsets of the data. By eliminating underperforming candidatesquickly and keeping promising candidates as long as possible the method speedsup the computation while preserving the capability of the full cross-validation.The experimental evaluation shows that our method reduces the computation timeby a factor of up to 70 compared to a full cross-validation with a negligible impacton the accuracy.
1 Introduction
Unarguably, a lot of computing time is spent on cross-validation [1] to tune free parameters of machine learning methods. While cross-validation can be parallelized easily with every instanceevaluating a single candidate parameter setting, an enormous amount of computing resources is stillspent on cross-validation, which could probably be put to better use in the actual learning methods.Just to give you an idea, if you perform five-fold cross-validation over two parameters, and you onlytake five candidates for each parameter, you have to train 125 times to perform the cross-validation.Thus, even a training time of one second becomes more than two minutes without parallelization.In practice, almost no one performs cross-validation on the whole data set, though, as the parameterscan often already be inferred reliably on a small subset of the data, thereby speeding up the compu-tation time substantially. However, the choice of the subset depends a lot on the structure of the dataset. If the subset is too small compared to the complexity of the learning task, the wrong parameteris chosen. Usually, researchers can tell from experience what subset sizes are necessary for specificlearning problems, but one would like to have a robust method which is able to deal with a wholerange of learning problems in an automatic fashion.In this paper, we propose a method which is based on the sequential analysis framework to achieveexactly this: Speed up cross-validation by taking subsets of the data, while being robust with respectto different problem complexities. To achieve this, the method performs cross-validation on subsetsof increasing size up to the full data set size, eliminating suboptimal parameter choices quickly. Thestatistical tests used for the elimination are tuned such that they try to retain promising parametersas long as possible to guard against unreliable measurements at small sample sizes.In experiments, we show that even using such conservative tests, we can achieve significant speedups of typically 25 times up to 70 times, which translate to literally hours of computing time freedup on our clusters.1
 
data points stepsconf.
d
1
d
2
d
3
···
d
n
1
d
n
1 2 3 4 5 6 7 8 9 10
c
1
-2.2 -1.9 -1.8 2.1 1.5 flop0 0 0 0 1 0 0 0 0 0(
)
c
2
-1.9 -2.4 -2.3
···
1.9 2.4 flop 0 1 0 0 0 0 0 0 0 0 (
)
c
3
-1.4 -0.9 -0.7 0.5 0.5 flop 0 1 1 0 0 1 0 0 0 0.........
...
c
k
2
0.6 0.6 0.7 -0.8 -0.4 top
0 1 0 1 1 1 1 0 1 1
c
k
1
0.1 0.5 0.7
···
-0.9 -0.1 top 0 1 1 1 1 1 0 1 1 1
c
k
0.5 0.4 0.6 -0.3 0.0 top1 1 0 1 1 1 0 1 1 1
pointwisePerformance matrix trace matrix
                                           
 05101520
   0   5   1   0   1   5   2   0 
0000100000 1101110111
 
X
H0
(
π
0
, π
1
, β
l
, α
l
)
S
a
(
π
0
, π
1
, β
l
, α
l
)
WINNERLOSER
c
1
c
k
7 8 9 10
c
3
0 0 0 0
?
=
......
c
k
2
1 0 1 1
c
k
1
0 1 1 1
c
k
0 1 1 1
similarPerformance(
·
)
maxSteps
= 20=
N/
maxStepsmodelSize
=
s
n
=
s
Figure 1: One step of the fast cross-validation procedure. Shown is the situation in step
s
= 10
.
a model with modelSize data points is learned for each configuration (
c
1
to
c
k
). Test errors arecalculated on the current test set (
d
1
to
d
n
) and transformed into a binary performance indicator.
traces of configurations are filtered via sequential analysis (
c
1
and
c
2
are dropped).
at the endof each step the procedure checks, whether the remaining configurations perform equally well in atime window and stops, if this is the case (see Sec. 5 in the Appendix for a complete example run).
2 Fast Cross-Validation
We consider the usual supervised learning setting: We have a data set consisting of data points
d
1
= (
1
,
1
)
,...,d
= (
,
)
∈ X ×
which we assume to be drawn i.i.d. from
X×Y 
.We have a learning algorithm
A
which depends on several parameters
p
. The goal is to select theparameter
p
such that the learned predictor
g
has the best generalization error with respect to someloss function
:
Y×Y →R
. Full
k
-fold cross-validation estimates the best parameter by splittingthe data into
k
parts, using
k
1
parts for training and estimating the error on the remaining part.Our approach attempts to speed up the process by taking subsamples of size
[
sN/
maxSteps
]
for
1
s
maxSteps, starting with the full set of parameter candidates and eliminating clearly un-derperforming candidates at each step. Each execution of the main loop of the algorithm depictedin Figure 1 performs the following main parts given a subset of the data:
The procedure trans-forms the pointwise test errors of the remaining configurations into a binary “top or flop” scheme
It drops significant loser configurations along the way using tests from the sequential analysisframework.
Applying robust, distribution free testing techniques allows for an early stopping of the procedure, when we have seen enough data for a stable parameter estimation. In the followingwe will discuss the individual steps in the algorithm.
Robust Transformation of Test Errors:
As the first step, the pointwise test errors for eachconfiguration is transformed into a binary value encoding whether the configuration is among thebest ones or not. We call this the “top or flop” scheme. This step abstracts from the underlying lossfunction or the scale of the errors, encoding the information whether a configuration looks promisingfor further analysis or not. From the point of view of statistical test theory, the question now isto find the
k
top-performing configurations which show a similar behavior on all tested samples.Traditionally, this test could be performed using ANOVA, however we propose to use the followingnon-parametric tests in order to increase robustness: For classification, we use the Cochran Q test[2] applied to the binary information whether a sample has been correctly classified or not. For2
 
regression problems we apply the Friedman test [3] directly on the residuals of the prediction. Notethat both tests use a paired approach on the pointwise performance measure, thereby increasing thestatistical power of the result (see Sec. 6 in the Appendix for a summary of these tests).
DeterminingSignificantLosers:
Havingtransformedthetesterrorsinascale-independenttoporflop scheme, we can now test whether a given parameter configuration is an overall loser. Sequentialtesting of binary random variables is addressed in the
sequential analysis
framework developed byWald [4]. The main idea is the following: One observes a sequence of i.i.d. binary random variables
1
,
2
,...
, and one wants to test whether these variables are distributed according to
0
:
π
0
or
1
:
π
1
with
π
0
< π
1
. Both significance levels for the acceptance of 
1
and
0
can be controlledvia the meta-parameters
α
l
and
β
l
. The test computes the likelihood for the so far observed dataand rejects one of the hypothesis when the respective likelihood ratio is larger than some factorcontrolled by the meta-parameters. It can be shown that the procedure has a very intuitive geometricrepresentation, shown in Figure 1, lower left: The binary observations are recorded as cumulativesums at each time step. If this sum exceeds the upper red line, we accept
1
; if the sum is belowthe lower red line we accept
0
; if the sum stays between the two red lines we have to draw anothersample. Since our main goal is to use the sequential test to eliminate underperformers, we choose theparameters
π
0
and
π
1
of the test such that
1
(a configuration wins) is postponed as long as possible.At the same time, we want to maximize the area where configurations are eliminated (region denotedby “LOSER” in Fig. 1), rejecting the most loser configurations on the way as possible (see Sec. 1–3in the Appendix for the concrete derivation of theses parameters of the test).
Early Stopping and Final Winner:
Finally, we employ an early stopping rule which takesthe last earlyStoppingWindow columns from the trace matrix and checks whether all remainingconfigurations performed equally well in the past. If that is the case, the procedure is stopped.For the test, we again use the Cochran Q test which is illustrated in Figure 1, lower right: the lastthree traces at step 10 are performing nearly optimal in a given window but
c
3
shows a significantdifferent behavior, so the test will indicate a significant effect and the procedure will go on. Todetermine the final winner after the procedure has stopped we iteratively go back in time among allwinning configurations in each step until we have found an exclusive winner. This way, we makemost use of the data accumulated during the course of the procedure.
Efficient Parallelization:
As for normal cross-validation the parallelization setup for the fast cross-validation procedure is a solid map-reduce scheme: the model of each remaining configuration ineach step of the procedure can be calculated on a single cluster node. Just the results of the model onthe data points
d
1
,d
2
,...,d
n
have to be transferred back to a central instance to calculate the binary
“top or flop” scheme. This central reduce node will then update the trace matrix accordinglyand
test for significant losers. After eliminating underperforming configurations the
earlystopping rule checks, whether the procedure will iterate once more and schedule the remainingconfigurations on the cluster. This stepwise elimination of underperforming configurations willresult in a significant speed-up as will be shown in the next section.
3 Experiments
In this section we will explore the performance of the fast cross-validation procedure on real-worlddata sets: First we use the benchmark repository as introduced by R¨atsch et. al [5]. We split eachdata set in two halves using one half for the parameter estimation via full and fast cross-validationand the other half for the calculation of the test error. Additionally we use the covertype data set [6]:After scaling the data we use the first two classes with the most entries and follow the procedure of the paper in sampling 2,000 data points of each class for the model learning and estimate the testerror on the remaining data points. For all setups we use an
ν 
SVM with Gaussian kernel using610 parameter configurations (
σ
[
3
,
3]
,ν 
[0
.
05
,
0
.
5]
). The fast cross-validation procedure iscarried out with 10 steps (
 fast 
) once with the early stopping rule and once without. For each data setwe repeat the process 50 times each with a different split.Figure 2 shows that the speed improvement of the
fast 
setup with early stopping often ranges inbetween 20 and 30 and even up to 70 for the covertype data set. Without the early stopping rulethe speed gain drops but for the most data sets stays in between 10 to 20. The absolute test errordifference of the fast cross-validation procedure compared to the normal cross-validation almostalways ranges below 1 percentage point (data in Sec. 4 in the Appendix). These results illustrate,3

Share & Embed

More from this user

Add a Comment

Characters: ...