You are on page 1of 8

Solutions to the Exercises on

the Bias-Variance Dilemma

Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU

4 Febrary 2017

Contents

1 Bias-Variance Dilemma 1

1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Exercise: Bias-Variance Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Bias-Variance Dilemma

1.1 Exercises

1.1.1 Exercise: Bias-Variance Dilemma

This exercise illustrates the bias-variance dilemma by means of a simple example. It is also a nice exercise
for practicing the use of probabilities.

Consider the probability density function, or simple probability distribution, p(s, x) shown in figure 1 for the
generation of training data. The input variable x can only assume discrete values xi with equal probabilities
with i {0, 1} and x0 = 0 and x1 = 1. The desired output value s depends on x and is uniformly distributed
2017 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from
other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view
a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their own
copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights
(here usually figures I have the rights to publish but you dont, like my own published figures).
Several of my exercises (not necessarily on this topic) were inspired by papers and textbooks by other authors. Unfortunately,
I did not document that well, because initially I did not intend to make the exercises publicly available, and now I cannot trace
it back anymore. So I cannot give as much credit as I would like to. The concrete versions of the exercises are certainly my
own work, though.
These exercises complement my corresponding lecture notes available at https://www.ini.rub.de/PEOPLE/wiskott/

Teaching/Material/, where you can also find other teaching material such as programming exercises. The table of contents of
the lecture notes is reproduced here to give an orientation when the exercises can be reasonably solved. For best learning effect
I recommend to first seriously try to solve the exercises yourself before looking into the solutions.

1
s

m0
0
1
m1

0 1 x


Figure 1: Probability density function of the data.
CC BY-SA 4.0

in the interval Ii = [mi i , mi + i ]. The probabilities are therefore given by

Px (xi ) := 0.5 for x0 = 0 and x1 = 1, (1)



1/(2i ) for s Ii
ps|x (s|xi ) := (2)
0 for s / Ii
with Ii := [mi i , mi + i ] . (3)

The joint probability ps,x (s, xi ) is then given by ps,x (s, xi ) = ps|x (s|xi )Px (xi ). Note, that Px is a probability,
because x is discrete, while ps|x and ps,x are probability densities, because s is a continuous variable.

Note that in the following, averages over x are usually replaced by averages over i, because x is a discrete
variable that only assumes the values xi for i {0, 1}. Also remember that the (pdf of the) output value s
depends on the input value xi or, for short, on index i even if not explicitly stated. This also means that we
can write ps|i (s|i) instead of ps|x (s|xi ).

(a) Calculate the first and second moment of


the distribution of s separately for i = 0 (i.e. x = 0) and
i = 1 (i.e. x = 1), i.e. calculate hsis|i and s2 s|i .
Solution: The first moment, hsis|i , is the mean and for symmetry reasons one would expect it to be

2


equal to mi . The second moment, s2 s|i , is a bit more complicated.
Z +
hsis|i = s ps|i (s|i) ds (4)

Z mi +i
(2) 1
= s ds (5)
mi i 2i
1 2 mi +i
= [s /2]mi i (6)
2i
1
(mi + i )2 (mi i )2

= (7)
4i
1
(m2i + 2mi i + i2 ) (m2i 2mi i + i2 )

= (8)
4i
= mi , (9)
Z +

2
s s|i = s2 ps|i (s|i) ds (10)

Z mi +i
(2) 1
= s2 ds (11)
mi i 2i
1 3 mi +i
= [s /3]mi i (12)
2i
1
(mi + i )3 (mi i )3

= (13)
6i
1
(m3i + 3m2i i + 3mi i2 + i3 ) (m3i 3m2i i + 3mi i2 i3 )

= (14)
6i
= m2i + i2 /3 . (15)

From this probability distribution we randomly draw a set of training data, which is used to learn a function
y(x), which best approximates the training data. This function could for example be realized by a single
neuron.

In the lecture we have learned two error functions that calculate the generalization error.

Error for a given function

The first error function calculates the generalization error for a given function y(x) and is given by
 2 
1 
F (y) := y(x) hs(x)is|x (misfit of function y)
2 x
| {z }
=: BS
 2 
1
+ s(x) hs(x)is|x . (variance in the data s) (16)
2 s,x
| {z }
=: VS

Consider the following three functions of growing complexity:


y0 (x) := 0, (17)
y1 (x) := b , (18)

c0 if x = 0
y2 (x) := . (19)
c1 if x = 1
The subscript reflects the number of free parameters of the functions.

(b) Which values for the parameters b, c0 , and c1 minimize F (y) for y0 , y1 and y2 ? You do not need to
prove your statement.
Solution: We discuss the three functions y0 , y1 , and y2 in turn.

3
y0 : This function has no parameters.
y1 : The single parameter b should obviously be the mean of the desired output values, which is

b = (m0 + m1 )/2 . (20)

y2 : With the two parameters ci the function can be adapted to the means at x = 0 and x = 1
separately, i.e.
ci = mi (21)
would be an optimal choice.
(c) Given mi and i , what are the values of F (y) for the three optimal functions found above?
Solution: The variance VS of the data is independend of the function y. Thus, we can compute it
once for all.
 2 
(16) 1 
VS = s hsis|i (22)
2 s,i
1 D
2 2
E
= s s|i hsis|i (23)
2 i
(15,9) 1
2
mi + i2 /3 m2i i

= (24)
2
02 + 12 /12 .

= (25)

Now we discuss the first term, BS , for the three functions y0 , y1 and y2 in turn.

y0 : For y0 we obtain
 2 
(16) 1 
BS0 = y0 (xi ) hs(xi )is|i (26)
2 i
(17,9) 1

(0 mi )2 i

= (27)
2
= (m20 + m21 )/4 . (28)

y1 : For y1 we obtain
 2 
(16) 1 
BS1 = y1 (xi ) hs(xi )is|i (29)
2 i
(18,9) 1
2

= (b mi ) i (30)
2
(20) 1

((m0 + m1 )/2 mi )2 i

= (31)
2
1
(m1 m0 )2 /4 + (m0 m1 )2 /4

= (32)
4
= (m0 m1 )2 /8 . (33)

y2 : For y2 we obtain
 2 
(16) 1 
BS2 = y2 (xi ) hs(xi )is|i (34)
2 i
(19,9) 1

(ci mi )2 i

= (35)
2
(21) 1

(mi mi )2 i

= (36)
2
= 0. (37)

4
In summary we have
(16) (25,28)
02 + 12 /12 + (m20 + m21 )/4 ,

F (y0 ) = VS + BS0 = (38)
(16) (25,33)
02 + 12 /12 + (m0 m1 )2 /8 ,

F (y1 ) = VS + BS1 = (39)
(16) (25,37)
02 + 12 /12 + 0 .

F (y2 ) = VS + BS2 = (40)

(d) Prove that F (y0 ) F (y1 ) F (y2 ).


Solution: Since VS is common to all three values, we only have to compare the BSj -terms. It is
quite obvious that all BSj are positive and since BS2 = 0, it is evident that F (y0 ) F (y2 ) and
F (y1 ) F (y2 ). Thus, it only remains to be shown that
?
F (y0 ) F (y1 ) (41)
(16)
BS0 BS1 (42)
(28,33)
(m20 + m21 )/4 (m1 m0 )2 /8 |8 (43)
2m20
+ 2m21 m20 2m0 m1 + m21 (44)
m20 + 2m0 m1 + m21 0 (45)
(m0 + m1 )2 0. (46)
(47)

Error for an ensemble of functions trained on different data sets

The second error function calculates the mean generalization error when considering infinitely many indepen-
dent functions that have been trained with (in general) different training data sets D drawn independently
from the same probability distribution, i.e. from now on y(x) depends on D. It is given by
 2 
1 
hF (y)iD = hy(x)iD hs(x)is|x (bias)2
2 x
| {z }
=: BD
 2 
1
+ y(x) hy(x)iD (variance in the functions y)
2 D,x
| {z }
=: VD
 2 
1
+ s(x) hs(x)is|x (variance in the data s) . (48)
2 s,x
| {z }
=: VS

Consider the case, where the training data consist of two pairs D = {(x0 = 0, s0 ), (x1 = 1, s1 )}, where s0 and
s1 are randomly drawn from the probability distribution ps|i (s|i). This leads to a probability distribution
for drawing a particular training data set D of pD (D) = ps|i (s0 |0)ps|i (s1 |1).

(e) Given a data set D = {(x0 , s0 ), (x1 , s1 )} (and no other information about mi and i ), how should
we choose the free parameters b, c0 , and c1 in order to minimize F (y) for the functions y0 , y1 , and
y2 ? (Again no proof required.) These values for b, c0 , and c1 will in the following be called optimal
parameters.
Solution: We proceed analogously to the first exercise.
y0 : This function has no parameters.
y1 : For y1 an optimal choice would be
b = (s0 + s1 )/2 . (49)

5
y2 : For y2 an optimal choice would be
ci = si . (50)
(f) The optimal parameters clearly depend on the data set D, so they can themselves be regarded as
stochastic variables. Averages are now averages over all possible data sets D = {(x0 , s0 ), (x1 , s1 )}.
Calculate the mean and the variance of the optimal parameters b, c0 , and c1 in terms of mi and i .
(Hints: Make use of the facts that s0 and s1 are drawn independently, i.e. hs0 s1 iD = hs0 iD hs1 iD . You
do not need to calculate the probability distributions for the optimal parameter values.)
Solution:
y0 : This function has no parameters.
y1 : For y1 we obtain
(49)
hbiD = (hs0 iD + hs1 iD )/2 (51)
(9)
= (m0 + m1 )/2 , (52)

2

2 2
(b hbiD ) D
= b D hbiD (53)
(49,52)
(s0 + s1 )2 D /4 (m0 + m1 )2 /4


= (54)

2
s0 D + 2 hs0 s1 iD + s21 D /4 (m0 + m1 )2 /4


= (55)

2
s0 D + 2 hs0 iD hs1 iD + s21 D /4 (m0 + m1 )2 /4


= (56)
(since s0 and s1 are statistically independent)
(15,9)
m20 + 02 /3 + 2m0 m1 + m21 + 12 /3 /4

=
(m20 + 2m0 m1 + m21 )/4 (57)
= (02 + 12 )/12 (58)
(25)
= VS . (59)
y2 : For y2 we obtain
(50)
hci iD = hsi iD (60)
(9)
= mi , (61)
2
(ci hci iD )2 D


2
= ci D hci iD (62)
(50)
2 2
= si D hsi iD (63)
(15,9)
= m2i + i2 /3 m2i (64)
= i2 /3 . (65)
(g) Calculate hF (y)iD for the three functions y0 , y1 and y2 .
Solution: To calculate hF (y)iD we need to calculate the three terms BD , VD , and VS . The latter
does not depend on the function yj and has been calculated before (25). The other two can be directly
calculated as follows.
y0 : For y0 we obtain
 2 
(48) 1 
BD0 = hy0 (xi )iD hs(xi )is|i (66)
2 i
(17,9) 1
2

= (0 mi ) i (67)
2
= (m20 + m21 )/4 (68)
(28)
= BS0 , (69)
 2 
(48) 1 
VD0 = y0 (xi ) hy0 (xi )iD (70)
2 D,i
(17) 1

(0 0)2 D,i

= (71)
2
= 0. (72)

6
y1 : For y1 we obtain
 2 
(48) 1 
BD1 = hy1 (xi )iD hs(xi )is|i (73)
2 i
(18,9) 1D 2
E
= (hbiD mi ) (74)
2 i
(52) 1 D
2
E
= ((m0 + m1 )/2 mi ) (75)
2 i
1
(m1 m0 ) /4 + (m0 m1 )2 /4
2

= (76)
4
= (m0 m1 )2 /8 (77)
(33)
= BS1 , (78)
 2 
(48) 1 
VD1 = y1 (xi ) hy1 (xi )iD (79)
2 D,i
(18) 1

(b hbiD )2 D,i

= (80)
2
1

(b hbiD )2 D

= (81)
2
(58)
= (02 + 12 )/24 (82)
(25)
= VS /2 . (83)

y2 : For y2 we obtain
 2 
(48) 1 
BD2 = hy2 (xi )iD hs(xi )is|i (84)
2 i
(19,9) 1D 2
E
= (hci iD mi ) (85)
2 i
(61) 1
2

= (mi mi ) i (86)
2
= 0 (87)
(37)
= BS2 , (88)
 2 
(48) 1 
VD2 = y2 (xi ) hy2 (xi )iD (89)
2 D,i
(19) 1

(ci hci iD )2 D,i



= (90)
2
1

(ci hci iD )2 D i

= (91)
2
(65) 1
2
= /3 i (92)
2 i
= (02 + 12 )/12 (93)
(25)
= VS . (94)

(h) Interpret the result: How do the different terms in the generalization error depend on the data and
the complexity of the functions? Which of the functions is the best under which circumstances?
Solution: We can summarize the results in the following table.

parameters optimal for y0 y1 y2


true data distribution F (yj ) = BS0 + VS BS1 + VS 0 + VS
training data D hF (yj )iD = BS0 + 0 + VS BS1 + VS /2 + VS 0 + VS + VS

7
Remember that
(28)
BS0 = (m20 + m21 )/4 , (95)
(33) 2
BS1 = (m0 m1 ) /8 , (96)
(42)
with BS0 BS1 , (97)
(25)
02 12

VS = + /12 . (98)

Thus, we see that if we choose the parameters optimally for the true data distribution (as we did in
the first part of the exercise), it is better to use the function with more parameters, since F (y0 )
F (y1 ) F (y2 ).
If, however, the parameters have to be estimated from some limited training data (as was the case
in the second part), there is a tradeoff between the bias of the estimated function and its variance,
i.e. its randomness due to limited training data. The bias term in hF (yj )iD decreases from BS0 over
BS1 down to 0 while the function variance term increases from 0 over VS /2 up to VS as the number
of parameters increases. Which function is optimal depends on how much the true distribution fits to
the prejudices of the functions.
The following table shows a few examples for concrete distributions.

m0 m1 0 1 y0 y1 y2
0 0 0 0 12F (yj ) = 0 0 0
0 0 0 0 12 hF (yj )iD = 0 0 0
0 0 1 1 12F (yj ) = 2 2 2
0 0 1 1 12 hF (yj )iD = 2 3 4
1 1 0 0 12F (yj ) = 6 0 0
1 1 0 0 12 hF (yj )iD = 6 0 0
1 1 1 1 12F (yj ) = 8 2 2
1 1 1 1 12 hF (yj )iD = 8 3 4
1 -1 0 0 12F (yj ) = 6 6 0
1 -1 0 0 12 hF (yj )iD = 6 6 0
1 -1 1 1 12F (yj ) = 8 8 2
1 -1 1 1 12 hF (yj )iD = 9 9 4

You might also like