regression problems we apply the Friedman test [3] directly on the residuals of the prediction. Notethat both tests use a paired approach on the pointwise performance measure, thereby increasing thestatistical power of the result (see Sec. 6 in the Appendix for a summary of these tests).
DeterminingSignificantLosers:
Havingtransformedthetesterrorsinascale-independenttoporflop scheme, we can now test whether a given parameter configuration is an overall loser. Sequentialtesting of binary random variables is addressed in the
sequential analysis
framework developed byWald [4]. The main idea is the following: One observes a sequence of i.i.d. binary random variables
Z
1
,Z
2
,...
, and one wants to test whether these variables are distributed according to
H
0
:
π
0
or
H
1
:
π
1
with
π
0
< π
1
. Both significance levels for the acceptance of
H
1
and
H
0
can be controlledvia the meta-parameters
α
l
and
β
l
. The test computes the likelihood for the so far observed dataand rejects one of the hypothesis when the respective likelihood ratio is larger than some factorcontrolled by the meta-parameters. It can be shown that the procedure has a very intuitive geometricrepresentation, shown in Figure 1, lower left: The binary observations are recorded as cumulativesums at each time step. If this sum exceeds the upper red line, we accept
H
1
; if the sum is belowthe lower red line we accept
H
0
; if the sum stays between the two red lines we have to draw anothersample. Since our main goal is to use the sequential test to eliminate underperformers, we choose theparameters
π
0
and
π
1
of the test such that
H
1
(a configuration wins) is postponed as long as possible.At the same time, we want to maximize the area where configurations are eliminated (region denotedby “LOSER” in Fig. 1), rejecting the most loser configurations on the way as possible (see Sec. 1–3in the Appendix for the concrete derivation of theses parameters of the test).
Early Stopping and Final Winner:
Finally, we employ an early stopping rule which takesthe last earlyStoppingWindow columns from the trace matrix and checks whether all remainingconfigurations performed equally well in the past. If that is the case, the procedure is stopped.For the test, we again use the Cochran Q test which is illustrated in Figure 1, lower right: the lastthree traces at step 10 are performing nearly optimal in a given window but
c
3
shows a significantdifferent behavior, so the test will indicate a significant effect and the procedure will go on. Todetermine the final winner after the procedure has stopped we iteratively go back in time among allwinning configurations in each step until we have found an exclusive winner. This way, we makemost use of the data accumulated during the course of the procedure.
Efficient Parallelization:
As for normal cross-validation the parallelization setup for the fast cross-validation procedure is a solid map-reduce scheme: the model of each remaining configuration ineach step of the procedure can be calculated on a single cluster node. Just the results of the model onthe data points
d
1
,d
2
,...,d
n
have to be transferred back to a central instance to calculate the binary
“top or flop” scheme. This central reduce node will then update the trace matrix accordinglyand
test for significant losers. After eliminating underperforming configurations the
earlystopping rule checks, whether the procedure will iterate once more and schedule the remainingconfigurations on the cluster. This stepwise elimination of underperforming configurations willresult in a significant speed-up as will be shown in the next section.
3 Experiments
In this section we will explore the performance of the fast cross-validation procedure on real-worlddata sets: First we use the benchmark repository as introduced by R¨atsch et. al [5]. We split eachdata set in two halves using one half for the parameter estimation via full and fast cross-validationand the other half for the calculation of the test error. Additionally we use the covertype data set [6]:After scaling the data we use the first two classes with the most entries and follow the procedure of the paper in sampling 2,000 data points of each class for the model learning and estimate the testerror on the remaining data points. For all setups we use an
ν
−
SVM with Gaussian kernel using610 parameter configurations (
σ
∈
[
−
3
,
3]
,ν
∈
[0
.
05
,
0
.
5]
). The fast cross-validation procedure iscarried out with 10 steps (
fast
) once with the early stopping rule and once without. For each data setwe repeat the process 50 times each with a different split.Figure 2 shows that the speed improvement of the
fast
setup with early stopping often ranges inbetween 20 and 30 and even up to 70 for the covertype data set. Without the early stopping rulethe speed gain drops but for the most data sets stays in between 10 to 20. The absolute test errordifference of the fast cross-validation procedure compared to the normal cross-validation almostalways ranges below 1 percentage point (data in Sec. 4 in the Appendix). These results illustrate,3
Add a Comment