Professional Documents
Culture Documents
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
so that it is clear what changes when the df is fixed to distribution with df = n − 1 degrees of freedom:
a value that differs from the default one. Pn
j=1 xj
t= q √
2.1. k-fold cross validation 1
( n−1 )σ̂ 2 / df + 1
In k-fold cross validation, the data is split in k approx-
imately equal parts and the algorithms are trained on for resampling, and
all data but one fold for each fold. The accuracy is esti- Pn
mated by using the data of the fold left out as test set. j=1 xj
t= q √
This gives k accuracy estimates for algorithms A and ( n1 + nn21 )σ̂ 2 / df + 1
B, denoted PA,i and PB,i where i (1 ≤ i ≤ k) is the
fold left out. Let xi be the difference xi = PA,i − PB,i , for corrected resampling.
then the mean of xi is normally distributed if the al-
gorithms are the same and the folds are sufficiently
large (at least containing 30 cases) [6]. The mean x. 3. Multiple run k-fold cross validation
Pk
is simply estimated using x. = k1 i=1 xi and an un- As our experiments will demonstrate, tests from the
Pk
(xi −x. )2 previous section suffer from low replicability. This
biased estimate of the variance is σ̂ 2 = i=1
k−1 .
We have a statistic approximating the t distribution means that the result of one researcher may differ from
with df = k − 1 degrees of freedom another doing the same experiment with the same data
and same hypothesis test but with different random
splits of the data. Higher replicability can be expected
x. for hypothesis tests that utilize multiple runs of k-fold
t= √ √ . cross validation. There are various ways to get the
σ̂ 2 / df + 1
most out of the data, and we will describe them with
the help of Figure 1.1 The figure illustrates the re-
sults of a 3 run 3-fold cross validation experiment in-
Dietterich [2] demonstrated empirically that the k-fold
side the box at the left half, though in practice a ten
cross validation test has slightly elevated Type 1 error
run 10-fold cross validation is more appropriate. Each
when using the default degrees of freedom. In Section
cell contains the difference in accuracy between two
4, we will calibrate the test by using a range of values
algorithms trained on data from all but one folds and
for df and selecting the one that gives the desired Type
measured with the data in the single fold that was left
I error.
out as test data.
2.2. Resampling and corrected resampling The use all data approach is a naive way to use this
data. It considers the 100 outcomes as independent
Resampling is repeatedly splitting the training data samples. The mean folds test first averages the cells
randomly in two, training on the first fraction and for a single 10-fold cross validation experiment and
testing on the remaining section and applying a paired considers these averages as samples. In Figure 1 this
t-test. This used to be a popular way to determine al- test would use the numbers in the most right column.
gorithm performance before it was demonstrated to These averages are known to be better estimates of
have unacceptable high Type I error [2]. the accuracy [5]. The mean runs test averages over
Let PA,j and PB,j be the accuracy of algorithm A and the cells with the same fold number, that is, the num-
algorithm B measured on run j (1 ≤ j ≤ n) and xj bers in the last row of the top matrix in Figure 1. A
the difference xj = PA,jP − PB,j . The mean m of xj better way seem to be to sort the folds before averag-
n
is estimated by m = n1 j=1 xj , and the variance is ing. Sorting the numbers in each of the folds gives the
Pn matrix at the bottom of Figure 1. The mean sorted
first estimated using σ̂ = j=1 (xj − m)2 . To get an
2
1
unbiased estimate, this is multiplied with n−1 . runs test then uses the averages over the runs of the
sorted folds.
Nadeau and Bengio [7] observe that the high Type I is
due to an underestimation of the variance because the The mean folds average var test uses an alternative
samples are not independent. They propose to make a way to estimate the variance from the mean folds test.
correction to the estimate of the variance, and multiply Instead of estimating the variance directly from these
σ̂ 2 with n1 + nn21 where n1 is the fraction of the data numbers, a more accurate estimate of the variance may
used for training and n2 the fraction used for testing. 1
A more formal description and elaborated motivation
Altogether, this gives a statistic approximating the t of these tests can be found in [1].
Figure 1. Example illustrating the data used for averaging over folds, over runs and over sorted runs.
Pr
be obtained by averaging variances obtained from the (xθ(ij) −xi. )2
Then we use σ̂i.2 (θ)
for j=1 r−1 . Table 1 gives
data of each of the 10-fold cv experiments. The same an overview of the various tests and the way they are
idea can be extended to the mean run and mean sorted calculated. For many of the tests, the mean m and
run test, giving rise to the mean run averaged var and variance σ̂ 2 of x are estimated with which the test
mean sorted run averaged var tests. statistic Z is calculated. As evident from Table 1, the
The mean folds average T test uses a similar idea as the averaged T tests have a slightly different approach.
mean folds averaged var test, but instead of averaging
just the variances, it averages over the test statistic 4. Calibrating the tests
that would be obtained from each individual 10 fold
experiment. This idea can be extended to the mean The tests are calibrated on the Type I error by gen-
run and mean sorted run tests as well, giving the mean erating 1000 random data sets with mututal indepen-
run averaged T and mean sorted run averaged T tests. dent binary variables. The class probability is set to
50%, a value that typically generates the highest Type
More formally, let there be r runs (r > 1) and k folds I error [2]. On each of the training sets, two learning
(k > 1) and two learning schemes A and B that have algorithms are used, naive Bayes [4] and C4.5 [8] as im-
accuracy aij and bij for fold i (1 ≤ i ≤ k) and run j plemented in Weka [11]. The nature of the data sets
(1 ≤ j ≤ r). Let x be the difference between those ensures that none can be outperformed by the other.
accuracies, xij = aij − bij . We use short notation So, whenever a test indicates a difference between the
Pk Pr
x.j for k1 i=1 xij and xi. for 1r j=1 xij . Further, two algorithms, this contributes to the Type I error.
Pk
2 (xij −x.j )2 Each test is run with degrees of freedom ranging from
we use the short notation σ̂.j for i=1
and
Pr k−1
2 to 100. The degrees of freedom at which the Type
(xij −xi. )2
σ̂i.2 for j=1
r−1 . For the test that use sorting, I error measured is closest and less than or equal to
let θ(i, j) be an ordering such that xθ(ij) ≤ xθ((i+1)j) . the significance level is chosen as the calibrated value.
Figure 2. Degrees of freedom (left) and Type I error in percentage (right) that coincide with significance level of α = 0.05
and 10 folds for various numbers of runs.
• Those tests show very little sensitivity to the num- hand and the other tests on the other hand. The
ber of runs, though the power decreases slightly tests that use all data of the 10x10 folds have con-
with increasing number of runs. This may be due siderably better power
to the small variability of the calibrated df.
• The runs averaged T test is almost always best
and always in the top 3 of the tests listed. How-
5.3. Vary significance level ever, it also has a slightly elevated Type I error at
Datasets 1 to 4 were used to measure the sensitivity 10% significance level (10.5%).
of the tests on the significance level. The tests are for
• The use all data test is always in the top 3 of the
10 run and 10 folds. Table 4 shows the Type I error
tests listed.
for significance levels 1%, 2.5%, 5%, and 10%. Not
surprisingly, the Type I error is close and below the • Overall, all tests (except the first three) have a
expected levels, since the tests were calibrated on this power3 that is mutually close, an indication that
set. the calibration compensates appropriately for de-
pendence between samples.
Some observations: