Bootstrap Distributions
K.-D. WERNECHE
and G. KALB
Summary
The bootstrap error estimation method is investigated in comparison with the known n-method and
with a combined error estimation suggested by us using simulated and normally distributed “popu-
lations” in 15 and 30 characters, respectively. For small sample sim (below the double to three-
fold number of characters per class) the estimates resulting from the bootstrap method are on the
average too small and can no longer be. accepted. Significantly better results (with an essentially
lower calculation expenditure) are obtained for the n-method and the combined estimation. The
variability is essentially the same for all the t h p e methods.
This applies both in the 0888 of rather badly mparated and in the caw of very well separated
populations.
A bootatrap estimation modified by us also gives unsatisfactory resulte.
Key w&:Bootstrap method ; Diacriminance analysia; Modified bootstrapping;
‘
Error estimation ; n-method ;Combined eatimation ;Simulation.
1. Introduction
2. Efron’s boot.strapestimation
are used aa estimations for the expected value and the variance of D*.
According to the definition of D one obtains a biaa-corrected eatimation for
n
p(R,f ) f r o m ~ * ( Bf ,) = ~ ( ~ ) + d * .
E ~ (1979)
N states that this corrected estimation doesn’t w m worth making
n
because’ the variability of the estimation S(R)with respect to the correction d* is
too high. We can confirm this but will come back to this matter later.
MCLACELAN (1980) haa investigated the efficiency of the bootstrap estimation
by means of simulation. LAUTEB(1985) proved the aaymptotic efficiency of the
bootstrap estimator aa compared to the R-and U-method.
From the given sample elements gij we generate, by simulation, new samples
(bootstrap samples) in the following way :
where mi€{ l , 2 , ..., ni} was taken at random and Si describes the diagonal matrix
from the standard deviations of the i t h class, zij is a p-dimensional normally
distributed random vector with the expected value 0 and the variance veotor 1
and with the correlations given by the sample, 2 a fixed vector with the compo-
nents c l = l / f i ( ~ = l ( l ) p ) .
XY thus has the expected value gi and the variance &. With these simulated
-81
Based on the linear discriminance analysis as a classification rule, i.e. under the
condition of normal distribution for the probability densities f&), we want to
check the bootstrap estimations indicated above and to compare them with the
known n-method (estimation S(n) and with a combined error estimation S ( K )
suggested by us (cf. WRRNECKE and RUB, 1983).
Using real data (i.e. the mean and variance vectors as well as feature correla-
tions are given by a sample) we generate p-dimensional normally distributed
3-class random samples of certain size and regard them as “populations” from
which we draw, in a random process, 3-clam samples of a specified size (cf. WER-
NECKE, 1983).
For each sample we determine the allocation rule indicated, estimate the &880-
ciated classification error according t o the methods to be compared, and finally
classify all objects of the “population” into the given classes. This operation is
repeated according to a certain repetition rate.
The allocation of the “population”-vectors with the sample classifier directly
simulates t h e problem of determining the actual error rate P(R,f ) ; and we call
the error which regults &B the quotient of the wrongly allocated objects by the total
number of random vectors “allocation error”.
An error estimation method is to be regarded as a good one when the difference
between the sample error eatimation and the allocation error is as small as poesible,
since the most important component of each quality criterion is the valuation of
the actual error rate (VICTOR,1976).
we determined the repetition rate according to RASCHet al. (1981) using a one-
sided confidence interval.
Since the estimation of the error rate is only important in comparison wit.h the
290 ’ G., KALB
K.-D. W E B ~ O K E
allocation error, we estimate the expected half width of this interval according to
the respective difference between the allocation error and the bootstrap estima-
tion from pilot studies.
X S v co:03
,
Simulstd Bodatrep Dietriutions 29 1
Table 3
Drswing samplee of the aize 3x30 from the “populstion” 3x3000,
p=30; 200 repetitions esch (Z,8, u, e sccording to Table 1)
For 30 charactem the vslues S(B)=8.82, S(n)=9.09 and S(g)=9.06
resulted ss error rstea of the “populstiona” 3 ~3000.
4.3. conclusions
Since, especially for samples, i t is desirable to have a safe indicator and the
bootstrap estimation can only be expected to achieve results comparable to the
estimations S(n) and S ( K ) for adequately large samples at an eesentiallp higher
calculation expenditure, its application in practice eeems to be questionable.
Furthermore, a basic disadvantage of the bootstrap method is the fact that it
can practically not be applied for the determination of conditioned error estima-
tions (for this it would be necessary to carry out a full bootstrap estimation for
each conditioned error rate, which in cam of multiclaes problems would lead to
unjustified expenditures).
, The variant of “bootstrapping” suggested by us is actually always higher in the
c m investigated here than the simple bootstrap method (about 5 yo),however,
in cam of smaller sample sizes it is considerably lower than the n-method and the
combined estimation and can therefore not be recommended either.
Our statements apply both to rather badly separated “populations” ( p = 15)
and to well separated populations” ( p=30).
.Zwammenfassung
Anhsnd simuliertsr und normelverteilter ,,Grundgeasmtheiten“in 15 bzw. 30 Merkmelen wird die
Bootatrep-Schiitznng im Vergleich mit der bekannten n-niethode und einer von uns vorgeschla-
genen kombinierten Schiitzung untemucht.
292 0. KALB
K.-D. WEJ~NEC~E.
Fur kleine Stichprobenumfiange (nnterhalb der doppelten bie dreifachen Merkmalezahl pro
Klaaae) folgen am der BootstrapSchiifinng im Mittel zu kleine Schatzwerte, die nicht mehr zu
ekzeptieren aind. Weit bessere Ergebnisee (bei w-ntlich geringerem Rechenaufwand) werden fiir
n-Methode und kombinierte SohEtzung mhalten.
Die .Variabilitiitaller drei Verfahren fiillt im weaentlichen gleich groS aua.
Daa trifft mwohl im Fall eher achlecht getrennter ale auch bei sehr gut getrenntenPopulationen
zu.
Eine von une modifizierte Bootstrap-Sohatzung fiihrt ebenfalle zu unbefriedigenden Readtaten.
Refevewee