Appendix to Fast Cross-Validation via Sequential Analysis

Tammo Krueger, Danny Panknin, Mikio Braun Technische Universitaet Berlin Machine Learning Group 10587 Berlin t.krueger@tu-berlin.de, {panknin|mikio}@cs.tu-berlin.de

1

Selection of Meta-Parameters for the Fast Cross-validation

The algorithm has a number of free parameters as can be seen from the pseudo-code in Algorithm 1: maxSteps, the number of subsamples sizes to consider, α, the significance level for the binarization of the test errors, αl , βl , the significance levels for the sequential analysis test, and earlyStoppingWindow, the number of steps to look back in the early stopping procedure. While we will give an in-depth treatment of the selection of π0 , π1 and the maxSteps parameter in the following sections we here give some suggestions for the other parameters. The parameter α controls the significance level in each step of the test for similar behavior. We suggest to set this to the usual level of α = 0.05. Furthermore βl and αl control the significance level of the H0 (configuration is a loser) and H1 (configuration is a winner) respectively. We suggest an asymmetric setup by setting βl = 0.1, since we want to drop loser configurations relatively fast and αl = 0.01, since we want to be really sure when we accept a configuration as overall winner. Finally, we set earlyStoppingWindow to 3 for maxSteps = 10 and 6 for maxSteps = 20, as we have observed that this choice works well in practice. 1.1 Choosing the Optimal Sequential Test Parameters

As outlined in the main part of the paper we want to use the sequential testing framework to eliminate underperforming configurations as fast as possible while postponing the decission for a winner as long as possible. Using the parameters of the sequential testing framework we have to choose π0 and π1 , such that the the area of acceptance for H0 (region H0 (π0 , π1 , βl , αl ) denoted by “LOSER” in the overview figure) is maximized, while the earliest point of acceptance of H1 (Sa (π0 , π1 , βl , αl ) in the overview figure) is postoned until the procedure has run at least maxStep steps: (π0 , π1 ) = argmax
π0 ,π1 H0 (π0 , π1 , βl , αl )

s.t. Sa (π0 , π1 , βl , αl ) ∈ (maxSteps − 1, maxSteps] (1)

It turns out that the global optimization in Equation (1) can be approximated by π0 = 0.5 ∧ π1 = min ASN(π0 , π1 |π = 1.0) ≥ maxSteps
π1

(2)

where ASN(·, ·) (Average Sample Number) is the expected number of steps until the given test will yield a decision, if the real π = 1.0. For details of the sequential analysis please consult [1]. Note that sequential analysis formally requires i.i.d. variables which is clearly not the case in our setting. However, we focus on loser configurations, which are always zero (ergo deterministic) and therefore i.i.d. by construction. Also note that the true distribution of the trace matrix is complex and in general unknown. Our method should therefore be considered a first approximation with more refined methods being the topic of future work. 1

Algorithm 1 Fast Cross-Validation 1: function FAST CV(data, maxSteps, configurations, α, βl , αl , earlyStoppingWindow) 2: ∆ ← N/maxSteps; modelSize ←∆; test ←getTest(steps, βl , αl ) 3: ∀s ∈ {1, . . . , maxSteps}, c ∈ configurations : traces[c, s] ←performance[c, s] ←0 4: ∀c ∈ configurations : remainingModel[c] ←true 5: for s ←1 to steps do 6: pointwisePerformance ←calcPerformance(data, modelSize, remainingModel) 7: performance[remainingModel, s] ←averagePerformance(pointwisePerformance) 8: traces[bestPerformingConfigurations(pointwisePerformance, α), s] ←1 9: remainingModel[loserConfigurations(test, traces[remainingModel, 1:s])] ←false 10: if similarPerformance(traces[remainingModel, (s-earlyStoppingWindow):s], α) then 11: break 12: modelSize ←modelSize + ∆ 13: return selectWinnner(performance, remainingModel)

steps=10 70 60 50 40 30 20 0.8
q q q q q q q q

steps=20
q q q q q q q q

q

False Negative Rate

Relative Speed−up

10 Experiment easy medium hard

Pi 0.6
q

0.1 0.2 0.3

q

q

0.4

0.4 0.5

1

0.2

0.0 20 40 60 80 100 120 140

q

q

q

q

q

Steps

2

4

6

Change Point

8

6

8

10

12

14

16

18

Figure 1: Left: Relative speed gain of fast CV compared to full CV. We assume that training time is cubic in the number of samples. Shown are simulated runtimes for 10-fold CV on different problem classes by different loser/winner ratios (easy: 3:1; medium: 1:1, hard: 1:3) over 100 resamples. Right: False negatives generated for non-stationary configurations, i.e., at the given change point the Bernoulli variable changes its πbefore from the indicated value to 1.0.

1.2

Determine the Number of Steps

In this section we consider the maxSteps parameter. In principle, a larger number of steps leads to more robust estimates, but also to an increase of computation time. We study the effect of different choices of this parameter in a simulation. For the sake of simplicity we assume that the binary top or flop scheme consists of independent Bernoulli variables with πwinner ∈ [0.9, 1.0] and πloser ∈ [0.0, 0.1]. Figure 1 shows the resulting simulated runtimes for different settings. We see that the largest speed-up can be expected for 10 ≤ maxSteps ≤ 20. The speed gain rapidly decreases afterwards and becomes negligible between 40 for the hard setup and 100 for the easy setup. These simplified findings suggests that all following experiments should be carried out with either 10 or 20 steps.

2

False Negative Rate

The types of errors we must be most concerned with in our procedure are false negatives: Configurations which are eliminated although they are among the top configurations on the full sample. In the following we study the false negative rate and prove the maximal number of times a configuration can be a loser before it is eliminated, and study the general effect in simulations. Assume that there exists a change point cp such that a winning configuration looses for the first cp iterations. From the properties of our algorithm we can prove a “security zone” in which the fast 2

cross-validation has a false negative rate (FNR) of zero (see next Section for details): As long as
βl log 1−αl log π1 1 − βl cp π0 with maxSteps ≥ log 0≤ ≤ / log 2 , 1−βl 1−π1 maxSteps αl log αl log 1−π0

the probability of a FNR larger than zero is zero. For instance for αl = 0.01 and βl = 0.1 we can start a fast cross-validation run with minimal 7 steps, since there is no suitable test available for a smaller number of steps. For maxSteps = 10 steps, the security zone amounts to 0.27 × 10, meaning that if the change point for all switching configurations occurs at step one or two, the fast cross-validation procedure would not suffer from false negatives. Similarly, for maxSteps = 20 the security zone is 0.39 × 20 = 7.8. To illustrate the false negative rate further we simulate those switching configurations by independent Bernoulli variables, which change their parameter π from a chosen πbefore ∈ {0.1, 0.2, . . . , 0.5} to a constant 1.0 at a given change point. The relative loss of these configurations for 10 and 20 steps are plotted in Figure 1, right panel, for different change points. As stated by our theoretical result above, the FNR is zero for sufficiently small change points. After that, there are increasing probabilities that the configuration will be removed. As our experiments pointed out we see consistently good performance of the fast cross-validation procedure nevertheless, indicating that the change points are sufficiently small for real data sets.

3

Proof of Security Zone Bound

In this section we prove the security zone bound of the previous Section. We will follow the notation and treatment of the sequential analysis as found in the original publication of Wald [1], Sections 5.3 to 5.5. First of all, Wald proves in Equation 5:27, that the following approximation holds: ASN(π0 , π1 |π = 1.0) = log
1−βl αl log π1 π0

.

π1 The minimal ASN(π0 , π1 |π = 1.0) is therefore attained, if log π0 is maximal, which is clearly the case for π1 = 1.0 and π0 = 0.5, which holds by construction. So we get the lower bound of maxSteps for a given significance level αl , βl :

maxSteps ≥ log

1 − βl / log 2 . αl

The lower line L0 of the graphical sequential analysis test as exemplified in the overview Figure of the paper is defined as follows (see Equation 5:13 - 5:15): L0 = log log
π1 π0 βl 1−αl 1−π1 1−π0

− log

−n

log log
π1 π0

1−π1 1−π0 1−π1 1−π0

− log

.

Setting L0 = 0, we can get the intersection of the lower test line with the x-axis and therefore the earliest step ndrop , in which the procedure will drop a constant loser configuration. This yields ndrop = log log
π1 π0 βl 1−αl 1−π1 1−π0

− log

/

log log
π1 π0

1−π1 1−π0 1−π1 1−π0

− log

=

log log

βl 1−αl 1−π1 1−π0

.

Setting ndrop in relation to ASN(π0 , π1 |π = 1.0) yields the security zone bound of the previous Section.

4

Error Rates on Benchmark Data

The following table shows the mean absolute difference of test error (fast versus full crossvalidation) in percentage points and 95% confidence intervals (standard error, 100 repetitions) for various setups. The fast setup runs with maxSteps = 10 steps while the slow setup is executed with 20 steps. Each setup is once employed with and without the early stopping rule. 3

banana breastCancer diabetis flareSolar german image ringnorm splice thyroid twonorm waveform covertype

fast/early 0.20 % ± 0.18 2.00 % ± 1.85 0.56 % ± 0.88 1.44 % ± 2.95 0.45 % ± 0.70 0.19 % ± 0.19 0.03 % ± 0.03 0.25 % ± 0.19 0.39 % ± 0.53 -0.02 % ± 0.03 0.27 % ± 0.12 0.78 % ± 0.21

fast 0.11 % ± 0.15 2.09 % ± 1.64 0.80 % ± 0.82 2.53 % ± 3.31 0.92 % ± 0.58 0.22 % ± 0.20 0.00 % ± 0.04 0.32 % ± 0.18 -0.13 % ± 0.47 -0.03 % ± 0.04 0.21 % ± 0.17 0.89 % ± 0.19

slow/early 0.32 % ± 0.22 -0.38 % ± 2.91 0.68 % ± 0.81 1.39 % ± 1.77 1.14 % ± 0.53 0.46 % ± 0.26 0.05 % ± 0.04 0.15 % ± 0.19 -0.06 % ± 0.56 0.00 % ± 0.05 0.33 % ± 0.15 0.65 % ± 0.19

slow 0.07 % ± 0.10 1.46 % ± 1.95 -0.00 % ± 0.71 -0.11 % ± 1.86 0.86 % ± 0.62 0.41 % ± 0.24 0.03 % ± 0.04 0.14 % ± 0.15 -0.38 % ± 0.44 0.00 % ± 0.03 0.21 % ± 0.15 0.88 % ± 0.20

5

Example Run of Fast Cross-Validation

In this section we give an example of the whole fast cross-validation procedure on a toy data set of n = 1, 000 data points, which is based on a sine wave y = sin(x) + , x ∈ [0, 2πd] with being Gaussian noise (µ = 0, σ = 0.25). The parameter d = 50 controls the inherent complexity of the data and the sign of y is taken as the class membership. The fast cross-validation is executed with maxSteps = 10 and earlyStoppingWindow = 3. We use a ν-SVM [2] and test a parameter grid of σ ∈ {−1, −0.5, 0, 0.5, 1} and ν ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. The procedure runs for 4 steps after which the early stopping rule takes effect. This yields the following traces matrix (only remaining configurations are shown): Configuration σ = 0, ν = 0.1 σ = 0, ν = 0.2 σ = 0, ν = 0.3 σ = 0, ν = 0.4 σ = 0, ν = 0.5 σ = 0.5, ν = 0.1 σ = 0.5, ν = 0.2 σ = 0.5, ν = 0.3 σ = 0.5, ν = 0.4 σ = 0.5, ν = 0.5 σ = 1, ν = 0.1 σ = 1, ν = 0.2 σ = 1, ν = 0.3 σ = 1, ν = 0.4 modelSize=100 1 1 1 1 1 1 1 1 1 1 1 1 1 0 modelSize=200 1 1 1 1 1 1 1 1 1 1 1 1 1 1 modelSize=300 0 0 0 0 0 1 1 1 1 0 1 1 1 1 modelSize=400 0 0 0 0 0 1 1 0 0 0 1 1 1 0

The corresponding performances (prediction accuracy) are as follows, from which the procedure chooses σ = 1, ν = 0.2 as final winning configuration: Configuration σ = 0, ν = 0.1 σ = 0, ν = 0.2 σ = 0, ν = 0.3 σ = 0, ν = 0.4 σ = 0, ν = 0.5 σ = 0.5, ν = 0.1 σ = 0.5, ν = 0.2 σ = 0.5, ν = 0.3 σ = 0.5, ν = 0.4 σ = 0.5, ν = 0.5 σ = 1, ν = 0.1 σ = 1, ν = 0.2 σ = 1, ν = 0.3 σ = 1, ν = 0.4 modelSize=100 0.659 0.659 0.659 0.659 0.659 0.657 0.657 0.657 0.658 0.658 0.652 0.648 0.646 0.624 modelSize=200 0.760 0.759 0.759 0.759 0.760 0.757 0.759 0.762 0.762 0.756 0.743 0.746 0.766 0.745 4 modelSize=300 0.824 0.826 0.824 0.827 0.824 0.841 0.853 0.851 0.850 0.837 0.847 0.866 0.861 0.861 modelSize=400 0.858 0.855 0.857 0.857 0.853 0.873 0.872 0.867 0.865 0.857 0.878 0.895 0.883 0.860

6

Non-Parametric Tests

The tests used in the fast cross-validation procedure are common tools in the field of statistical data analysis. Here we give a short summary based on the Dataplot Manual [3]. Both methods deal with a data matrix of c experimental treatments with observations arranged in r blocks: Block 1 2 3 ... r 1 x11 x21 x31 ... xr1 Treatment 2 ... x12 . . . x22 . . . x32 . . . ... ... xr2 . . . c x1c x2c x3c ... xrc

Both tests treat similar questions (“Do the c treatments have identical effects?”) but are designed for different kinds of data: Cochran Q test is tuned for binary xij while the Friedman test acts on continuous values. In the context of the fast cross-validation procedure the test are used for two different tasks: 1. Determine whether a set of configurations are the top performing ones (step – in the overview Figure and the function bestPerformingConfigurations in Algorithm 1). 2. Check whether the remaining configurations behaved similar in the past (step — in the overview Figure and the function similarPerformance in Algorithm 1). In both cases, the configurations act as treatments on either the samples (Point 1 above) or on the last earlyStoppingWindow traces (Point 2 above) of the remaining configurations. Depending on the learning problem either the Friedman Test for regression task or the Cochran Q test for classification tasks is used in Point 1. In both cases the hypotheses for the tests are as follows: • H0 : All treatments are equally effective (no effect) • H1 : There is a difference in the effectiveness among the treatments, i.e., there is at least one treatment showing a significant effect. 6.1 Cochran Q Test
c N i=1 Ci − c r i=1 Ri (c − Ri )

The test statistic is calculated as follows: T = c(c − 1)

with Ci denoting the column total for the ith treatment, Ri the row total for the ith block, and N the total number of values. We reject H0 , if T > χ2 (1 − α, c − 1) with χ2 (1 − α, c − 1) denoting the (1 − α)-quantile of the χ2 distribution with c − 1 degrees of freedom and α is the significance level. 6.2 Friedman Test

Let R(xij ) be the rank assigned to R(xij ) within block i (i.e., ranks within a given row). Average ranks are used in the case of ties. The ranks are summed to obtain
r

Rj =
i=1

R(xij ).

The test statistic is then calculated as follows: T =
2

12 rc(c + 1)
2

c

(Ri − r(c + 1)/2)2 .
i=1

We reject H0 if T > χ (α, c − 1) with χ (α, c − 1) denoting the α-quantile of the χ2 distribution with c − 1 degrees of freedom and α is the significance level. 5

References
[1] Abraham Wald. Sequential Analysis. Wiley, 1947. [2] Bernhard Sch¨ lkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New support o vector algorithms. Neural Comput., 12:1207–1245, May 2000. [3] James J. Filliben and Alan Heckert. Dataplot Reference Manual Volume 1: Commands. Statistical Engineering Division, Information Technology Laboratory, National Institute of Standards and Technology.

6

Sign up to vote on this title
UsefulNot useful