/  6
 
Appendix toFast Cross-Validation via Sequential Analysis
Tammo Krueger, Danny Panknin, Mikio Braun
Technische Universitaet BerlinMachine Learning Group10587 Berlin
t.krueger@tu-berlin.de
,
{
panknin|mikio
}
@cs.tu-berlin.de
1 Selection of Meta-Parameters for the Fast Cross-validation
The algorithm has a number of free parameters as can be seen from the pseudo-code in Algo-rithm 1: maxSteps, the number of subsamples sizes to consider,
α
, the significance level for thebinarization of the test errors,
α
l
,
β
l
, the significance levels for the sequential analysis test, andearlyStoppingWindow, the number of steps to look back in the early stopping procedure. Whilewe will give an in-depth treatment of the selection of 
π
0
,π
1
and the maxSteps parameter in thefollowing sections we here give some suggestions for the other parameters. The parameter
α
con-trols the significance level in each step of the test for similar behavior. We suggest to set this tothe usual level of 
α
= 0
.
05
. Furthermore
β
l
and
α
l
control the significance level of the
0
(con-figuration is a loser) and
1
(configuration is a winner) respectively. We suggest an asymmetricsetup by setting
β
l
= 0
.
1
, since we want to drop loser configurations relatively fast and
α
l
= 0
.
01
,since we want to be really sure when we accept a configuration as overall winner. Finally, we setearlyStoppingWindow to 3 for maxSteps
= 10
and 6 for maxSteps
= 20
, as we have observed thatthis choice works well in practice.
1.1 Choosing the Optimal Sequential Test Parameters
Asoutlinedinthemainpartofthepaperwewanttousethesequentialtestingframeworktoeliminateunderperforming configurations as fast as possible while postponing the decission for a winner aslong as possible. Using the parameters of the sequential testing framework we have to choose
π
0
and
π
1
, such that the the area of acceptance for
0
(region
0
(
π
0
,π
1
,β
l
,α
l
)
denoted by “LOSER” inthe overview figure) is maximized, while the earliest point of acceptance of 
1
(
a
(
π
0
,π
1
,β
l
,α
l
)
in the overview figure) is postoned until the procedure has run at least maxStep steps:
(
π
0
,π
1
) = argmax
π
0
1
0
(
π
0
,π
1
,β
l
,α
l
)
s.t.
a
(
π
0
,π
1
,β
l
,α
l
)
(
maxSteps
1
,
maxSteps
]
(1)It turns out that the global optimization in Equation (1) can be approximated by
π
0
= 0
.
5
π
1
= min
π
1
ASN(
π
0
,π
1
|
π
= 1
.
0)
maxSteps (2)where
ASN(
·
,
·
)
(Average Sample Number) is the expected number of steps until the given test willyield a decision, if the real
π
= 1
.
0
. For details of the sequential analysis please consult [1].Note that sequential analysis formally requires i.i.d. variables which is clearly not the case in oursetting. However, we focus on loser configurations, which are always zero (ergo deterministic) andtherefore i.i.d. by construction. Also note that the true distribution of the trace matrix is complex andin general unknown. Our method should therefore be considered a first approximation with morerefined methods being the topic of future work.1
 
Algorithm 1
Fast Cross-Validation
1:
function
FAST
CV(data, maxSteps, configurations,
α
,
β
l
,
α
l
, earlyStoppingWindow)
2:
N
/
maxSteps; modelSize
; test
getTest 
(steps,
β
l
,
α
l
)
3:
s
∈ {
1
,...,
maxSteps
}
,
c
configurations
:
traces[c, s]
performance[c, s]
0
4:
c
configurations
:
remainingModel[c]
true
5:
for
s
1
to
steps
do
6:
pointwisePerformance
calcPerformance
(data, modelSize, remainingModel)
7:
performance[remainingModel, s]
averagePerformance
(pointwisePerformance)
8:
traces[
bestPerformingConfigurations
(pointwisePerformance,
α
), s]
1
9:
remainingModel[
loserConfigurations
(test, traces[remainingModel, 1:s])]
false
10:
if 
similarPerformance
(traces[remainingModel, (s-earlyStoppingWindow):s],
α
)
then
11:
break
12:
modelSize
modelSize +
13:
return
selectWinnner 
(performance, remainingModel)
Steps
   R  e   l  a   t   i  v  e   S  p  e  e   d −  u  p
11020304050607020 40 60 80 100 120 140
Experiment
easymediumhard
Change Point
   F  a   l  s  e   N  e  g  a   t   i  v  e   R  a   t  e
0.00.20.40.60.8steps=10
q qqqqq qqq
2 4 6 8
 
steps=20
q q qq qq qqqqqqq qq
6 8 10 12 14 16 18
Pi
 
q
0.10.20.30.40.5
Figure 1:
Left:
Relative speed gain of fast CV compared to full CV. We assume that training time iscubic in the number of samples. Shown are simulated runtimes for 10-fold CV on different problemclasses by different loser/winner ratios (easy: 3:1; medium: 1:1, hard: 1:3) over 100 resamples.
 Right:
False negatives generated for non-stationary configurations, i.e., at the given change pointthe Bernoulli variable changes its
π
before
from the indicated value to 1.0.
1.2 Determine the Number of Steps
In this section we consider the maxSteps parameter. In principle, a larger number of steps leads tomore robust estimates, but also to an increase of computation time. We study the effect of differentchoices of this parameter in a simulation. For the sake of simplicity we assume that the binary topor flop scheme consists of independent Bernoulli variables with
π
winner
[0
.
9
,
1
.
0]
and
π
loser
[0
.
0
,
0
.
1]
. Figure 1 shows the resulting simulated runtimes for different settings. We see that thelargest speed-up can be expected for
10
maxSteps
20
. The speed gain rapidly decreasesafterwards and becomes negligible between 40 for the hard setup and 100 for the easy setup. Thesesimplified findings suggests that all following experiments should be carried out with either 10 or20 steps.
2 False Negative Rate
The types of errors we must be most concerned with in our procedure are false negatives: Configura-tions which are eliminated although they are among the top configurations on the full sample. In thefollowing we study the false negative rate and prove the maximal number of times a configurationcan be a loser before it is eliminated, and study the general effect in simulations.Assume that there exists a change point
cp
such that a winning configuration looses for the first
cp
iterations. From the properties of our algorithm we can prove a “security zone” in which the fast2
 
cross-validation has a false negative rate (FNR) of zero (see next Section for details): As long as
0
cp
maxSteps
log
β
l
1
α
l
log
π
1
π
0
log
1
β
l
α
l
log
1
π
1
1
π
0
with maxSteps
log1
β
l
α
l
/
log2
,
the probability of a FNR larger than zero is zero. For instance for
α
l
= 0
.
01
and
β
l
= 0
.
1
wecan start a fast cross-validation run with minimal 7 steps, since there is no suitable test availablefor a smaller number of steps. For maxSteps
= 10
steps, the security zone amounts to
0
.
27
×
10
,meaning that if the change point for all switching configurations occurs at step one or two, the fastcross-validation procedure would not suffer from false negatives. Similarly, for maxSteps
= 20
thesecurity zone is
0
.
39
×
20 = 7
.
8
.To illustrate the false negative rate further we simulate those switching configurations by indepen-dent Bernoulli variables, which change their parameter
π
from a chosen
π
before
∈ {
0
.
1
,
0
.
2
,...,
0
.
5
}
to a constant
1
.
0
at a given change point. The relative loss of these configurations for 10 and 20 stepsare plotted in Figure 1, right panel, for different change points. As stated by our theoretical resultabove, the FNR is zero for sufficiently small change points. After that, there are increasing proba-bilities that the configuration will be removed. As our experiments pointed out we see consistentlygood performance of the fast cross-validation procedure nevertheless, indicating that the changepoints are sufficiently small for real data sets.
3 Proof of Security Zone Bound
In this section we prove the security zone bound of the previous Section. We will follow the notationand treatment of the sequential analysis as found in the original publication of Wald [1], Sections 5.3to 5.5. First of all, Wald proves in Equation 5:27, that the following approximation holds:
ASN(
π
0
,π
1
|
π
= 1
.
0) =log
1
β
l
α
l
log
π
1
π
0
.
The minimal
ASN(
π
0
,π
1
|
π
= 1
.
0)
is therefore attained, if 
log
π
1
π
0
is maximal, which is clearlythe case for
π
1
= 1
.
0
and
π
0
= 0
.
5
, which holds by construction. So we get the lower bound of maxSteps for a given significance level
α
l
,β
l
:maxSteps
log1
β
l
α
l
/
log2
.
The lower line
L
0
of the graphical sequential analysis test as exemplified in the overview Figure of the paper is defined as follows (see Equation 5:13 - 5:15):
L
0
=log
β
l
1
α
l
log
π
1
π
0
log
1
π
1
1
π
0
n
log
1
π
1
1
π
0
log
π
1
π
0
log
1
π
1
1
π
0
.
Setting
L
0
= 0
, we can get the intersection of the lower test line with the x-axis and therefore theearliest step
n
drop
, in which the procedure will drop a constant loser configuration. This yields
n
drop
=log
β
l
1
α
l
log
π
1
π
0
log
1
π
1
1
π
0
/
log
1
π
1
1
π
0
log
π
1
π
0
log
1
π
1
1
π
0
=log
β
l
1
α
l
log
1
π
1
1
π
0
.
Setting
n
drop
in relation to
ASN(
π
0
,π
1
|
π
= 1
.
0)
yields the security zone bound of the previousSection.
4 Error Rates on Benchmark Data
The following table shows the mean absolute difference of test error (fast versus full cross-validation) in percentage points and 95% confidence intervals (standard error, 100 repetitions) forvarious setups. The
fast 
setup runs with maxSteps
= 10
steps while the
slow
setup is executed with20 steps. Each setup is once employed with and without the early stopping rule.3

Share & Embed

More from this user

Add a Comment

Characters: ...