This action might not be possible to undo. Are you sure you want to continue?
4 7
Chapter 1 Pattern Recognition
1.1 Substituting (1.1) into (1.2) and then differentiating with respect to w
i
we obtain
N
¸
n=1
M
¸
j=0
w
j
x
j
n
−t
n
x
i
n
= 0. (1)
Rearranging terms then gives the required result.
1.2 For the regularized sumofsquares error function given by (1.4) the corresponding
linear equations are again obtained by differentiation, and take the same form as
(1.122), but with A
ij
replaced by
¯
A
ij
, given by
¯
A
ij
= A
ij
+ λI
ij
. (2)
1.3 Let us denote apples, oranges and limes by a, o and l respectively. The marginal
probability of selecting an apple is given by
p(a) = p(a[r)p(r) + p(a[b)p(b) + p(a[g)p(g)
=
3
10
0.2 +
1
2
0.2 +
3
10
0.6 = 0.34 (3)
where the conditional probabilities are obtained from the proportions of apples in
each box.
To ﬁnd the probability that the box was green, given that the fruit we selected was
an orange, we can use Bayes’ theorem
p(g[o) =
p(o[g)p(g)
p(o)
. (4)
The denominator in (4) is given by
p(o) = p(o[r)p(r) + p(o[b)p(b) + p(o[g)p(g)
=
4
10
0.2 +
1
2
0.2 +
3
10
0.6 = 0.36 (5)
from which we obtain
p(g[o) =
3
10
0.6
0.36
=
1
2
. (6)
1.4 We are often interested in ﬁnding the most probable value for some quantity. In
the case of probability distributions over discrete variables this poses little problem.
However, for continuous variables there is a subtlety arising from the nature of prob
ability densities and the way they transform under nonlinear changes of variable.
8 Solution 1.4
Consider ﬁrst the way a function f(x) behaves when we change to a new variable y
where the two variables are related by x = g(y). This deﬁnes a new function of y
given by
¯
f(y) = f(g(y)). (7)
Suppose f(x) has a mode (i.e. a maximum) at ´x so that f
(´x) = 0. The correspond
ing mode of
¯
f(y) will occur for a value ´y obtained by differentiating both sides of
(7) with respect to y
¯
f
(´y) = f
(g(´y))g
(´y) = 0. (8)
Assuming g
(´y) = 0 at the mode, then f
(g(´y)) = 0. However, we know that
f
(´x) = 0, and so we see that the locations of the mode expressed in terms of each
of the variables x and y are related by ´x = g(´y), as one would expect. Thus, ﬁnding
a mode with respect to the variable x is completely equivalent to ﬁrst transforming
to the variable y, then ﬁnding a mode with respect to y, and then transforming back
to x.
Now consider the behaviour of a probability density p
x
(x) under the change of vari
ables x = g(y), where the density with respect to the new variable is p
y
(y) and is
given by ((1.27)). Let us write g
(y) = s[g
(y)[ where s ∈ ¦−1, +1¦. Then ((1.27))
can be written
p
y
(y) = p
x
(g(y))sg
(y).
Differentiating both sides with respect to y then gives
p
y
(y) = sp
x
(g(y))¦g
(y)¦
2
+ sp
x
(g(y))g
(y). (9)
Due to the presence of the second term on the right hand side of (9) the relationship
´x = g(´y) no longer holds. Thus the value of x obtained by maximizing p
x
(x) will
not be the value obtained by transforming to p
y
(y) then maximizing with respect to
y and then transforming back to x. This causes modes of densities to be dependent
on the choice of variables. In the case of linear transformation, the second term on
the right hand side of (9) vanishes, and so the location of the maximum transforms
according to ´x = g(´y).
This effect can be illustrated with a simple example, as shown in Figure 1. We
begin by considering a Gaussian distribution p
x
(x) over x with mean μ = 6 and
standard deviation σ = 1, shown by the red curve in Figure 1. Next we draw a
sample of N = 50, 000 points from this distribution and plot a histogram of their
values, which as expected agrees with the distribution p
x
(x).
Now consider a nonlinear change of variables from x to y given by
x = g(y) = ln(y) −ln(1 −y) + 5. (10)
The inverse of this function is given by
y = g
−1
(x) =
1
1 + exp(−x + 5)
(11)
Solutions 1.5–1.6 9
Figure 1 Example of the transformation of
the mode of a density under a non
linear change of variables, illus
trating the different behaviour com
pared to a simple function. See the
text for details.
0 5 10
0
0.5
1
g
−1
(x)
p
x
(x)
p
y
(y)
y
x
which is a logistic sigmoid function, and is shown in Figure 1 by the blue curve.
If we simply transform p
x
(x) as a function of x we obtain the green curve p
x
(g(y))
shown in Figure 1, and we see that the mode of the density p
x
(x) is transformed
via the sigmoid function to the mode of this curve. However, the density over y
transforms instead according to (1.27) and is shown by the magenta curve on the left
side of the diagram. Note that this has its mode shifted relative to the mode of the
green curve.
To conﬁrm this result we take our sample of 50, 000 values of x, evaluate the corre
sponding values of y using (11), and then plot a histogram of their values. We see
that this histogram matches the magenta curve in Figure 1 and not the green curve!
1.5 Expanding the square we have
E[(f(x) −E[f(x)])
2
] = E[f(x)
2
−2f(x)E[f(x)] +E[f(x)]
2
]
= E[f(x)
2
] −2E[f(x)]E[f(x)] +E[f(x)]
2
= E[f(x)
2
] −E[f(x)]
2
as required.
1.6 The deﬁnition of covariance is given by (1.41) as
cov[x, y] = E[xy] −E[x]E[y].
Using (1.33) and the fact that p(x, y) = p(x)p(y) when x and y are independent, we
obtain
E[xy] =
¸
x
¸
y
p(x, y)xy
=
¸
x
p(x)x
¸
y
p(y)y
= E[x]E[y]
10 Solutions 1.7–1.8
and hence cov[x, y] = 0. The case where x and y are continuous variables is analo
gous, with (1.33) replaced by (1.34) and the sums replaced by integrals.
1.7 The transformation from Cartesian to polar coordinates is deﬁned by
x = r cos θ (12)
y = r sin θ (13)
and hence we have x
2
+y
2
= r
2
where we have used the wellknown trigonometric
result (2.177). Also the Jacobian of the change of variables is easily seen to be
∂(x, y)
∂(r, θ)
=
∂x
∂r
∂x
∂θ
∂y
∂r
∂y
∂θ
=
cos θ −r sin θ
sin θ r cos θ
= r
where again we have used (2.177). Thus the double integral in (1.125) becomes
I
2
=
2π
0
∞
0
exp
−
r
2
2σ
2
r dr dθ (14)
= 2π
∞
0
exp
−
u
2σ
2
1
2
du (15)
= π
exp
−
u
2σ
2
−2σ
2
∞
0
(16)
= 2πσ
2
(17)
where we have used the change of variables r
2
= u. Thus
I =
2πσ
2
1/2
.
Finally, using the transformation y = x−μ, the integral of the Gaussian distribution
becomes
∞
−∞
^
x[μ, σ
2
dx =
1
(2πσ
2
)
1/2
∞
−∞
exp
−
y
2
2σ
2
dy
=
I
(2πσ
2
)
1/2
= 1
as required.
1.8 From the deﬁnition (1.46) of the univariate Gaussian distribution, we have
E[x] =
∞
−∞
1
2πσ
2
1/2
exp
−
1
2σ
2
(x −μ)
2
xdx. (18)
Solution 1.9 11
Now change variables using y = x −μ to give
E[x] =
∞
−∞
1
2πσ
2
1/2
exp
−
1
2σ
2
y
2
(y + μ) dy. (19)
We now note that in the factor (y + μ) the ﬁrst term in y corresponds to an odd
integrand and so this integral must vanish (to show this explicitly, write the integral
as the sum of two integrals, one from −∞to 0 and the other from 0 to ∞and then
show that these two integrals cancel). In the second term, μ is a constant and pulls
outside the integral, leaving a normalized Gaussian distribution which integrates to
1, and so we obtain (1.49).
To derive (1.50) we ﬁrst substitute the expression (1.46) for the normal distribution
into the normalization result (1.48) and rearrange to obtain
∞
−∞
exp
−
1
2σ
2
(x −μ)
2
dx =
2πσ
2
1/2
. (20)
We now differentiate both sides of (20) with respect to σ
2
and then rearrange to
obtain
1
2πσ
2
1/2
∞
−∞
exp
−
1
2σ
2
(x −μ)
2
(x −μ)
2
dx = σ
2
(21)
which directly shows that
E[(x −μ)
2
] = var[x] = σ
2
. (22)
Now we expand the square on the lefthand side giving
E[x
2
] −2μE[x] + μ
2
= σ
2
.
Making use of (1.49) then gives (1.50) as required.
Finally, (1.51) follows directly from (1.49) and (1.50)
E[x
2
] −E[x]
2
=
μ
2
+ σ
2
−μ
2
= σ
2
.
1.9 For the univariate case, we simply differentiate (1.46) with respect to x to obtain
d
dx
^
x[μ, σ
2
= −^
x[μ, σ
2
x −μ
σ
2
.
Setting this to zero we obtain x = μ.
Similarly, for the multivariate case we differentiate (1.52) with respect to x to obtain
∂
∂x
^(x[μ, Σ) = −
1
2
^(x[μ, Σ)∇
x
¸
(x −μ)
T
Σ
−1
(x −μ)
¸
= −^(x[μ, Σ)Σ
−1
(x −μ),
where we have used (C.19), (C.20)
1
and the fact that Σ
−1
is symmetric. Setting this
derivative equal to 0, and leftmultiplying by Σ, leads to the solution x = μ.
1
NOTE: In PRML, there are mistakes in (C.20); all instances of x (vector) in the denominators
should be x (scalar).
12 Solutions 1.10–1.11
1.10 Since x and z are independent, their joint distribution factorizes p(x, z) = p(x)p(z),
and so
E[x + z] =
(x + z)p(x)p(z) dxdz (23)
=
xp(x) dx +
zp(z) dz (24)
= E[x] +E[z]. (25)
Similarly for the variances, we ﬁrst note that
(x + z −E[x + z])
2
= (x −E[x])
2
+ (z −E[z])
2
+ 2(x −E[x])(z −E[z]) (26)
where the ﬁnal term will integrate to zero with respect to the factorized distribution
p(x)p(z). Hence
var[x + z] =
(x + z −E[x + z])
2
p(x)p(z) dxdz
=
(x −E[x])
2
p(x) dx +
(z −E[z])
2
p(z) dz
= var(x) + var(z). (27)
For discrete variables the integrals are replaced by summations, and the same results
are again obtained.
1.11 We use to denote ln p(X[μ, σ
2
) from (1.54). By standard rules of differentiation
we obtain
∂
∂μ
=
1
σ
2
N
¸
n=1
(x
n
−μ).
Setting this equal to zero and moving the terms involving μ to the other side of the
equation we get
1
σ
2
N
¸
n=1
x
n
=
1
σ
2
Nμ
and by multiplying ing both sides by σ
2
/N we get (1.55).
Similarly we have
∂
∂σ
2
=
1
2(σ
2
)
2
N
¸
n=1
(x
n
−μ)
2
−
N
2
1
σ
2
and setting this to zero we obtain
N
2
1
σ
2
=
1
2(σ
2
)
2
N
¸
n=1
(x
n
−μ)
2
.
Multiplying both sides by 2(σ
2
)
2
/N and substituting μ
ML
for μ we get (1.56).
Solutions 1.12–1.14 13
1.12 If m = n then x
n
x
m
= x
2
n
and using (1.50) we obtain E[x
2
n
] = μ
2
+ σ
2
, whereas if
n = m then the two data points x
n
and x
m
are independent and hence E[x
n
x
m
] =
E[x
n
]E[x
m
] = μ
2
where we have used (1.49). Combining these two results we
obtain (1.130).
Next we have
E[μ
ML
] =
1
N
N
¸
n=1
E[x
n
] = μ (28)
using (1.49).
Finally, consider E[σ
2
ML
]. From (1.55) and (1.56), and making use of (1.130), we
have
E[σ
2
ML
] = E
⎡
⎣
1
N
N
¸
n=1
x
n
−
1
N
N
¸
m=1
x
m
2
⎤
⎦
=
1
N
N
¸
n=1
E
¸
x
2
n
−
2
N
x
n
N
¸
m=1
x
m
+
1
N
2
N
¸
m=1
N
¸
l=1
x
m
x
l
¸
=
μ
2
+ σ
2
−2
μ
2
+
1
N
σ
2
+ μ
2
+
1
N
σ
2
=
N −1
N
σ
2
(29)
as required.
1.13 In a similar fashion to solution 1.12, substituting μ for μ
ML
in (1.56) and using (1.49)
and (1.50) we have
E
{x
n
}
¸
1
N
N
¸
n=1
(x
n
−μ)
2
¸
=
1
N
N
¸
n=1
E
x
n
x
2
n
−2x
n
μ + μ
2
=
1
N
N
¸
n=1
μ
2
+ σ
2
−2μμ + μ
2
= σ
2
1.14 Deﬁne
w
S
ij
=
1
2
(w
ij
+ w
ji
) w
A
ij
=
1
2
(w
ij
−w
ji
). (30)
from which the (anti)symmetry properties follow directly, as does the relation w
ij
=
w
S
ij
+ w
A
ij
. We now note that
D
¸
i=1
D
¸
j=1
w
A
ij
x
i
x
j
=
1
2
D
¸
i=1
D
¸
j=1
w
ij
x
i
x
j
−
1
2
D
¸
i=1
D
¸
j=1
w
ji
x
i
x
j
= 0 (31)
14 Solution 1.15
fromwhich we obtain (1.132). The number of independent components in w
S
ij
can be
found by noting that there are D
2
parameters in total in this matrix, and that entries
off the leading diagonal occur in constrained pairs w
ij
= w
ji
for j = i. Thus we
start with D
2
parameters in the matrix w
S
ij
, subtract D for the number of parameters
on the leading diagonal, divide by two, and then add back D for the leading diagonal
and we obtain (D
2
−D)/2 + D = D(D + 1)/2.
1.15 The redundancy in the coefﬁcients in (1.133) arises from interchange symmetries
between the indices i
k
. Such symmetries can therefore be removed by enforcing an
ordering on the indices, as in (1.134), so that only one member in each group of
equivalent conﬁgurations occurs in the summation.
To derive (1.135) we note that the number of independent parameters n(D, M)
which appear at order M can be written as
n(D, M) =
D
¸
i
1
=1
i
1
¸
i
2
=1
i
M−1
¸
i
M
=1
1 (32)
which has M terms. This can clearly also be written as
n(D, M) =
D
¸
i
1
=1
i
1
¸
i
2
=1
i
M−1
¸
i
M
=1
1
¸
(33)
where the term in braces has M−1 terms which, from (32), must equal n(i
1
, M−1).
Thus we can write
n(D, M) =
D
¸
i
1
=1
n(i
1
, M −1) (34)
which is equivalent to (1.135).
To prove (1.136) we ﬁrst set D = 1 on both sides of the equation, and make use of
0! = 1, which gives the value 1 on both sides, thus showing the equation is valid for
D = 1. Now we assume that it is true for a speciﬁc value of dimensionality D and
then show that it must be true for dimensionality D +1. Thus consider the lefthand
side of (1.136) evaluated for D + 1 which gives
D+1
¸
i=1
(i + M −2)!
(i −1)!(M −1)!
=
(D + M −1)!
(D −1)!M!
+
(D + M −1)!
D!(M −1)!
=
(D + M −1)!D + (D + M −1)!M
D!M!
=
(D + M)!
D!M!
(35)
which equals the right hand side of (1.136) for dimensionality D + 1. Thus, by
induction, (1.136) must hold true for all values of D.
Solution 1.16 15
Finally we use induction to prove (1.137). For M = 2 we ﬁnd obtain the standard
result n(D, 2) =
1
2
D(D + 1), which is also proved in Exercise 1.14. Now assume
that (1.137) is correct for a speciﬁc order M −1 so that
n(D, M −1) =
(D + M −2)!
(D −1)! (M −1)!
. (36)
Substituting this into the right hand side of (1.135) we obtain
n(D, M) =
D
¸
i=1
(i + M −2)!
(i −1)! (M −1)!
(37)
which, making use of (1.136), gives
n(D, M) =
(D + M −1)!
(D −1)! M!
(38)
and hence shows that (1.137) is true for polynomials of order M. Thus by induction
(1.137) must be true for all values of M.
1.16 NOTE: In PRML, this exercise contains two typographical errors. On line 4, M6th
should be M
th
and on the l.h.s. of (1.139), N(d, M) should be N(D, M).
The result (1.138) follows simply from summing up the coefﬁcients at all order up
to and including order M. To prove (1.139), we ﬁrst note that when M = 0 the right
hand side of (1.139) equals 1, which we know to be correct since this is the number
of parameters at zeroth order which is just the constant offset in the polynomial.
Assuming that (1.139) is correct at order M, we obtain the following result at order
M + 1
N(D, M + 1) =
M+1
¸
m=0
n(D, m)
=
M
¸
m=0
n(D, m) + n(D, M + 1)
=
(D + M)!
D!M!
+
(D + M)!
(D −1)!(M + 1)!
=
(D + M)!(M + 1) + (D + M)!D
D!(M + 1)!
=
(D + M + 1)!
D!(M + 1)!
which is the required result at order M + 1.
16 Solutions 1.17–1.18
Now assume M D. Using Stirling’s formula we have
n(D, M) ·
(D + M)
D+M
e
−D−M
D! M
M
e
−M
=
M
D+M
e
−D
D! M
M
1 +
D
M
D+M
·
M
D
e
−D
D!
1 +
D(D + M)
M
·
(1 + D)e
−D
D!
M
D
which grows like M
D
with M. The case where D M is identical, with the roles
of D and M exchanged. By numerical evaluation we obtain N(10, 3) = 286 and
N(100, 3) = 176,851.
1.17 Using integration by parts we have
Γ(x + 1) =
∞
0
u
x
e
−u
du
=
−e
−u
u
x
∞
0
+
∞
0
xu
x−1
e
−u
du = 0 + xΓ(x). (39)
For x = 1 we have
Γ(1) =
∞
0
e
−u
du =
−e
−u
∞
0
= 1. (40)
If x is an integer we can apply proof by induction to relate the gamma function to
the factorial function. Suppose that Γ(x + 1) = x! holds. Then from the result (39)
we have Γ(x + 2) = (x + 1)Γ(x + 1) = (x + 1)!. Finally, Γ(1) = 1 = 0!, which
completes the proof by induction.
1.18 On the righthand side of (1.142) we make the change of variables u = r
2
to give
1
2
S
D
∞
0
e
−u
u
D/2−1
du =
1
2
S
D
Γ(D/2) (41)
where we have used the deﬁnition (1.141) of the Gamma function. On the left hand
side of (1.142) we can use (1.126) to obtain π
D/2
. Equating these we obtain the
desired result (1.143).
The volume of a sphere of radius 1 in Ddimensions is obtained by integration
V
D
= S
D
1
0
r
D−1
dr =
S
D
D
. (42)
For D = 2 and D = 3 we obtain the following results
S
2
= 2π, S
3
= 4π, V
2
= πa
2
, V
3
=
4
3
πa
3
. (43)
Solutions 1.19–1.20 17
1.19 The volume of the cube is (2a)
D
. Combining this with (1.143) and (1.144) we obtain
(1.145). Using Stirling’s formula (1.146) in (1.145) the ratio becomes, for large D,
volume of sphere
volume of cube
=
πe
2D
D/2
1
D
(44)
which goes to 0 as D → ∞. The distance from the center of the cube to the mid
point of one of the sides is a, since this is where it makes contact with the sphere.
Similarly the distance to one of the corners is a
√
D from Pythagoras’ theorem. Thus
the ratio is
√
D.
1.20 Since p(x) is radially symmetric it will be roughly constant over the shell of radius
r and thickness . This shell has volume S
D
r
D−1
and since x
2
= r
2
we have
shell
p(x) dx · p(r)S
D
r
D−1
(45)
from which we obtain (1.148). We can ﬁnd the stationary points of p(r) by differen
tiation
d
dr
p(r) ∝
(D −1)r
D−2
+ r
D−1
−
r
σ
2
exp
−
r
2
2σ
2
= 0. (46)
Solving for r, and using D 1, we obtain ´r ·
√
Dσ.
Next we note that
p(´r + ) ∝ (´r + )
D−1
exp
¸
−
(´r + )
2
2σ
2
= exp
¸
−
(´r + )
2
2σ
2
+ (D −1) ln(´r + )
. (47)
We now expand p(r) around the point ´r. Since this is a stationary point of p(r)
we must keep terms up to second order. Making use of the expansion ln(1 + x) =
x −x
2
/2 + O(x
3
), together with D 1, we obtain (1.149).
Finally, from (1.147) we see that the probability density at the origin is given by
p(x = 0) =
1
(2πσ
2
)
1/2
while the density at x =´r is given from (1.147) by
p(x =´r) =
1
(2πσ
2
)
1/2
exp
−
´r
2
2σ
2
=
1
(2πσ
2
)
1/2
exp
−
D
2
where we have used ´r ·
√
Dσ. Thus the ratio of densities is given by exp(D/2).
18 Solutions 1.21–1.24
1.21 Since the square root function is monotonic for nonnegative numbers, we can take
the square root of the relation a b to obtain a
1/2
b
1/2
. Then we multiply both
sides by the nonnegative quantity a
1/2
to obtain a (ab)
1/2
.
The probability of a misclassiﬁcation is given, from (1.78), by
p(mistake) =
R
1
p(x, (
2
) dx +
R
2
p(x, (
1
) dx
=
R
1
p((
2
[x)p(x) dx +
R
2
p((
1
[x)p(x) dx. (48)
Since we have chosen the decision regions to minimize the probability of misclassi
ﬁcation we must have p((
2
[x) p((
1
[x) in region {
1
, and p((
1
[x) p((
2
[x) in
region {
2
. We now apply the result a b ⇒a
1/2
b
1/2
to give
p(mistake)
R
1
¦p((
1
[x)p((
2
[x)¦
1/2
p(x) dx
+
R
2
¦p((
1
[x)p((
2
[x)¦
1/2
p(x) dx
=
¦p((
1
[x)p(x)p((
2
[x)p(x)¦
1/2
dx (49)
since the two integrals have the same integrand. The ﬁnal integral is taken over the
whole of the domain of x.
1.22 Substituting L
kj
= 1 − δ
kj
into (1.81), and using the fact that the posterior proba
bilities sum to one, we ﬁnd that, for each x we should choose the class j for which
1 −p((
j
[x) is a minimum, which is equivalent to choosing the j for which the pos
terior probability p((
j
[x) is a maximum. This loss matrix assigns a loss of one if
the example is misclassiﬁed, and a loss of zero if it is correctly classiﬁed, and hence
minimizing the expected loss will minimize the misclassiﬁcation rate.
1.23 From (1.81) we see that for a general loss matrix and arbitrary class priors, the ex
pected loss is minimized by assigning an input x to class the j which minimizes
¸
k
L
kj
p((
k
[x) =
1
p(x)
¸
k
L
kj
p(x[(
k
)p((
k
)
and so there is a direct tradeoff between the priors p((
k
) and the loss matrix L
kj
.
1.24 A vector x belongs to class (
k
with probability p((
k
[x). If we decide to assign x to
class (
j
we will incur an expected loss of
¸
k
L
kj
p((
k
[x), whereas if we select the
reject option we will incur a loss of λ. Thus, if
j = arg min
l
¸
k
L
kl
p((
k
[x) (50)
Solutions 1.25–1.26 19
then we minimize the expected loss if we take the following action
choose
class j, if min
l
¸
k
L
kl
p((
k
[x) < λ;
reject, otherwise.
(51)
For a loss matrix L
kj
= 1 −I
kj
we have
¸
k
L
kl
p((
k
[x) = 1 −p((
l
[x) and so we
reject unless the smallest value of 1 − p((
l
[x) is less than λ, or equivalently if the
largest value of p((
l
[x) is less than 1 − λ. In the standard reject criterion we reject
if the largest posterior probability is less than θ. Thus these two criteria for rejection
are equivalent provided θ = 1 −λ.
1.25 The expected squared loss for a vectorial target variable is given by
E[L] =
y(x) −t
2
p(t, x) dxdt.
Our goal is to choose y(x) so as to minimize E[L]. We can do this formally using
the calculus of variations to give
δE[L]
δy(x)
=
2(y(x) −t)p(t, x) dt = 0.
Solving for y(x), and using the sum and product rules of probability, we obtain
y(x) =
tp(t, x) dt
p(t, x) dt
=
tp(t[x) dt
which is the conditional average of t conditioned on x. For the case of a scalar target
variable we have
y(x) =
tp(t[x) dt
which is equivalent to (1.89).
1.26 NOTE: In PRML, there is an error in equation (1.90); the integrand of the second
integral should be replaced by var[t[x]p(x).
We start by expanding the square in (1.151), in a similar fashion to the univariate
case in the equation preceding (1.90),
y(x) −t
2
= y(x) −E[t[x] +E[t[x] −t
2
= y(x) −E[t[x]
2
+ (y(x) −E[t[x])
T
(E[t[x] −t)
+(E[t[x] −t)
T
(y(x) −E[t[x]) +E[t[x] −t
2
.
Following the treatment of the univariate case, we now substitute this into (1.151)
and perform the integral over t. Again the crossterm vanishes and we are left with
E[L] =
y(x) −E[t[x]
2
p(x) dx +
var[t[x]p(x) dx
20 Solutions 1.27–1.28
from which we see directly that the function y(x) that minimizes E[L] is given by
E[t[x].
1.27 Since we can choose y(x) independently for each value of x, the minimum of the
expected L
q
loss can be found by minimizing the integrand given by
[y(x) −t[
q
p(t[x) dt (52)
for each value of x. Setting the derivative of (52) with respect to y(x) to zero gives
the stationarity condition
q[y(x) −t[
q−1
sign(y(x) −t)p(t[x) dt
= q
y(x)
−∞
[y(x) −t[
q−1
p(t[x) dt −q
∞
y(x)
[y(x) −t[
q−1
p(t[x) dt = 0
which can also be obtained directly by setting the functional derivative of (1.91) with
respect to y(x) equal to zero. It follows that y(x) must satisfy
y(x)
−∞
[y(x) −t[
q−1
p(t[x) dt =
∞
y(x)
[y(x) −t[
q−1
p(t[x) dt. (53)
For the case of q = 1 this reduces to
y(x)
−∞
p(t[x) dt =
∞
y(x)
p(t[x) dt. (54)
which says that y(x) must be the conditional median of t.
For q → 0 we note that, as a function of t, the quantity [y(x) − t[
q
is close to 1
everywhere except in a small neighbourhood around t = y(x) where it falls to zero.
The value of (52) will therefore be close to 1, since the density p(t) is normalized, but
reduced slightly by the ‘notch’ close to t = y(x). We obtain the biggest reduction in
(52) by choosing the location of the notch to coincide with the largest value of p(t),
i.e. with the (conditional) mode.
1.28 From the discussion of the introduction of Section 1.6, we have
h(p
2
) = h(p) + h(p) = 2 h(p).
We then assume that for all k K, h(p
k
) = k h(p). For k = K + 1 we have
h(p
K+1
) = h(p
K
p) = h(p
K
) + h(p) = K h(p) + h(p) = (K + 1) h(p).
Moreover,
h(p
n/m
) = nh(p
1/m
) =
n
m
mh(p
1/m
) =
n
m
h(p
m/m
) =
n
m
h(p)
Solutions 1.29–1.30 21
and so, by continuity, we have that h(p
x
) = xh(p) for any real number x.
Now consider the positive real numbers p and q and the real number x such that
p = q
x
. From the above discussion, we see that
h(p)
ln(p)
=
h(q
x
)
ln(q
x
)
=
xh(q)
x ln(q)
=
h(q)
ln(q)
and hence h(p) ∝ ln(p).
1.29 The entropy of an Mstate discrete variable x can be written in the form
H(x) = −
M
¸
i=1
p(x
i
) ln p(x
i
) =
M
¸
i=1
p(x
i
) ln
1
p(x
i
)
. (55)
The function ln(x) is concave and so we can apply Jensen’s inequality in the form
(1.115) but with the inequality reversed, so that
H(x) ln
M
¸
i=1
p(x
i
)
1
p(x
i
)
= ln M. (56)
1.30 NOTE: In PRML, there is a minus sign (’−’) missing on the l.h.s. of (1.103).
From (1.113) we have
KL(pq) = −
p(x) ln q(x) dx +
p(x) ln p(x) dx. (57)
Using (1.46) and (1.48)– (1.50), we can rewrite the ﬁrst integral on the r.h.s. of (57)
as
−
p(x) ln q(x) dx =
^(x[μ, σ
2
)
1
2
ln(2πs
2
) +
(x −m)
2
s
2
dx
=
1
2
ln(2πs
2
) +
1
s
2
^(x[μ, σ
2
)(x
2
−2xm + m
2
) dx
=
1
2
ln(2πs
2
) +
σ
2
+ μ
2
−2μm + m
2
s
2
. (58)
The second integral on the r.h.s. of (57) we recognize from (1.103) as the negative
differential entropy of a Gaussian. Thus, from (57), (58) and (1.110), we have
KL(pq) =
1
2
ln(2πs
2
) +
σ
2
+ μ
2
−2μm + m
2
s
2
−1 −ln(2πσ
2
)
=
1
2
ln
s
2
σ
2
+
σ
2
+ μ
2
−2μm + m
2
s
2
−1
.
22 Solutions 1.31–1.33
1.31 We ﬁrst make use of the relation I(x; y) = H(y) − H(y[x) which we obtained in
(1.121), and note that the mutual information satisﬁes I(x; y) 0 since it is a form
of KullbackLeibler divergence. Finally we make use of the relation (1.112) to obtain
the desired result (1.152).
To show that statistical independence is a sufﬁcient condition for the equality to be
satisﬁed, we substitute p(x, y) = p(x)p(y) into the deﬁnition of the entropy, giving
H(x, y) =
p(x, y) ln p(x, y) dxdy
=
p(x)p(y) ¦ln p(x) + ln p(y)¦ dxdy
=
p(x) ln p(x) dx +
p(y) ln p(y) dy
= H(x) + H(y).
To showthat statistical independence is a necessary condition, we combine the equal
ity condition
H(x, y) = H(x) + H(y)
with the result (1.112) to give
H(y[x) = H(y).
We now note that the righthand side is independent of x and hence the lefthand side
must also be constant with respect to x. Using (1.121) it then follows that the mutual
information I[x, y] = 0. Finally, using (1.120) we see that the mutual information is
a form of KL divergence, and this vanishes only if the two distributions are equal, so
that p(x, y) = p(x)p(y) as required.
1.32 When we make a change of variables, the probability density is transformed by the
Jacobian of the change of variables. Thus we have
p(x) = p(y)
∂y
i
∂x
j
= p(y)[A[ (59)
where [ [ denotes the determinant. Then the entropy of y can be written
H(y) = −
p(y) ln p(y) dy = −
p(x) ln
¸
p(x)[A[
−1
¸
dx = H(x) + ln [A[
(60)
as required.
1.33 The conditional entropy H(y[x) can be written
H(y[x) = −
¸
i
¸
j
p(y
i
[x
j
)p(x
j
) ln p(y
i
[x
j
) (61)
Solution 1.34 23
which equals 0 by assumption. Since the quantity −p(y
i
[x
j
) ln p(y
i
[x
j
) is non
negative each of these terms must vanish for any value x
j
such that p(x
j
) = 0.
However, the quantity p ln p only vanishes for p = 0 or p = 1. Thus the quantities
p(y
i
[x
j
) are all either 0 or 1. However, they must also sum to 1, since this is a
normalized probability distribution, and so precisely one of the p(y
i
[x
j
) is 1, and
the rest are 0. Thus, for each value x
j
there is a unique value y
i
with nonzero
probability.
1.34 Obtaining the required functional derivative can be done simply by inspection. How
ever, if a more formal approach is required we can proceed as follows using the
techniques set out in Appendix D. Consider ﬁrst the functional
I[p(x)] =
p(x)f(x) dx.
Under a small variation p(x) →p(x) + η(x) we have
I[p(x) + η(x)] =
p(x)f(x) dx +
η(x)f(x) dx
and hence from (D.3) we deduce that the functional derivative is given by
δI
δp(x)
= f(x).
Similarly, if we deﬁne
J[p(x)] =
p(x) ln p(x) dx
then under a small variation p(x) →p(x) + η(x) we have
J[p(x) + η(x)] =
p(x) ln p(x) dx
+
η(x) ln p(x) dx +
p(x)
1
p(x)
η(x) dx
+ O(
2
)
and hence
δJ
δp(x)
= p(x) + 1.
Using these two results we obtain the following result for the functional derivative
−ln p(x) −1 + λ
1
+ λ
2
x + λ
3
(x −μ)
2
.
Rearranging then gives (1.108).
To eliminate the Lagrange multipliers we substitute (1.108) into each of the three
constraints (1.105), (1.106) and (1.107) in turn. The solution is most easily obtained
24 Solutions 1.35–1.36
by comparison with the standard form of the Gaussian, and noting that the results
λ
1
= 1 −
1
2
ln
2πσ
2
(62)
λ
2
= 0 (63)
λ
3
=
1
2σ
2
(64)
do indeed satisfy the three constraints.
Note that there is a typographical error in the question, which should read ”Use
calculus of variations to show that the stationary point of the functional shown just
before (1.108) is given by (1.108)”.
For the multivariate version of this derivation, see Exercise 2.14.
1.35 NOTE: In PRML, there is a minus sign (’−’) missing on the l.h.s. of (1.103).
Substituting the right hand side of (1.109) in the argument of the logarithm on the
right hand side of (1.103), we obtain
H[x] = −
p(x) ln p(x) dx
= −
p(x)
−
1
2
ln(2πσ
2
) −
(x −μ)
2
2σ
2
dx
=
1
2
ln(2πσ
2
) +
1
σ
2
p(x)(x −μ)
2
dx
=
1
2
ln(2πσ
2
) + 1
,
where in the last step we used (1.107).
1.36 Consider (1.114) with λ = 0.5 and b = a + 2 (and hence a = b −2),
0.5f(a) + 0.5f(b) > f(0.5a + 0.5b)
= 0.5f(0.5a + 0.5(a + 2)) + 0.5f(0.5(b −2) + 0.5b)
= 0.5f(a + ) + 0.5f(b −)
We can rewrite this as
f(b) −f(b −) > f(a + ) −f(a)
We then divide both sides by and let →0, giving
f
(b) > f
(a).
Since this holds at all points, it follows that f
(x) 0 everywhere.
Solutions 1.37–1.38 25
To show the implication in the other direction, we make use of Taylor’s theorem
(with the remainder in Lagrange form), according to which there exist an x
such
that
f(x) = f(x
0
) + f
(x
0
)(x −x
0
) +
1
2
f
(x
)(x −x
0
)
2
.
Since we assume that f
(x) > 0 everywhere, the third term on the r.h.s. will always
be positive and therefore
f(x) > f(x
0
) + f
(x
0
)(x −x
0
)
Now let x
0
= λa + (1 −λ)b and consider setting x = a, which gives
f(a) > f(x
0
) + f
(x
0
)(a −x
0
)
= f(x
0
) + f
(x
0
) ((1 −λ)(a −b)) . (65)
Similarly, setting x = b gives
f(b) > f(x
0
) + f
(x
0
)(λ(b −a)). (66)
Multiplying (65) by λ and (66) by 1 −λ and adding up the results on both sides, we
obtain
λf(a) + (1 −λ)f(b) > f(x
0
) = f(λa + (1 −λ)b)
as required.
1.37 From (1.104), making use of (1.111), we have
H[x, y] = −
p(x, y) ln p(x, y) dxdy
= −
p(x, y) ln (p(y[x)p(x)) dxdy
= −
p(x, y) (ln p(y[x) + ln p(x)) dxdy
= −
p(x, y) ln p(y[x) dxdy −
p(x, y) ln p(x) dxdy
= −
p(x, y) ln p(y[x) dxdy −
p(x) ln p(x) dx
= H[y[x] + H[x].
1.38 From (1.114) we know that the result (1.115) holds for M = 1. We now suppose that
it holds for some general value M and show that it must therefore hold for M + 1.
Consider the left hand side of (1.115)
f
M+1
¸
i=1
λ
i
x
i
= f
λ
M+1
x
M+1
+
M
¸
i=1
λ
i
x
i
(67)
= f
λ
M+1
x
M+1
+ (1 −λ
M+1
)
M
¸
i=1
η
i
x
i
(68)
26 Solution 1.39
where we have deﬁned
η
i
=
λ
i
1 −λ
M+1
. (69)
We now apply (1.114) to give
f
M+1
¸
i=1
λ
i
x
i
λ
M+1
f(x
M+1
) + (1 −λ
M+1
)f
M
¸
i=1
η
i
x
i
. (70)
We now note that the quantities λ
i
by deﬁnition satisfy
M+1
¸
i=1
λ
i
= 1 (71)
and hence we have
M
¸
i=1
λ
i
= 1 −λ
M+1
(72)
Then using (69) we see that the quantities η
i
satisfy the property
M
¸
i=1
η
i
=
1
1 −λ
M+1
M
¸
i=1
λ
i
= 1. (73)
Thus we can apply the result (1.115) at order M and so (70) becomes
f
M+1
¸
i=1
λ
i
x
i
λ
M+1
f(x
M+1
)+(1−λ
M+1
)
M
¸
i=1
η
i
f(x
i
) =
M+1
¸
i=1
λ
i
f(x
i
) (74)
where we have made use of (69).
1.39 From Table 1.3 we obtain the marginal probabilities by summation and the condi
tional probabilities by normalization, to give
x 0 2/3
1 1/3
p(x)
y
0 1
1/3 2/3
p(y)
y
0 1
x 0 1 1/2
1 0 1/2
p(x[y)
y
0 1
x 0 1/2 1/2
1 0 1
p(y[x)
Solution 1.40 27
Figure 2 Diagram showing the relationship be
tween marginal, conditional and joint en
tropies and the mutual information.
H[x, y]
H[x[y] H[y[x] I[x, y]
H[x] H[x]
From these tables, together with the deﬁnitions
H(x) = −
¸
i
p(x
i
) ln p(x
i
) (75)
H(x[y) = −
¸
i
¸
j
p(x
i
, y
j
) ln p(x
i
[y
j
) (76)
and similar deﬁnitions for H(y) and H(y[x), we obtain the following results
(a) H(x) = ln 3 −
2
3
ln 2
(b) H(y) = ln 3 −
2
3
ln 2
(c) H(y[x) =
2
3
ln 2
(d) H(x[y) =
2
3
ln 2
(e) H(x, y) = ln 3
(f) I(x; y) = ln 3 −
4
3
ln 2
where we have used (1.121) to evaluate the mutual information. The corresponding
diagram is shown in Figure 2.
1.40 The arithmetic and geometric means are deﬁned as
¯ x
A
=
1
K
K
¸
k
x
k
and ¯ x
G
=
K
¸
k
x
k
1/K
,
respectively. Taking the logarithm of ¯ x
A
and ¯ x
G
, we see that
ln ¯ x
A
= ln
1
K
K
¸
k
x
k
and ln ¯ x
G
=
1
K
K
¸
k
ln x
k
.
By matching f with ln and λ
i
with 1/K in (1.115), taking into account that the
logarithm is concave rather than convex and the inequality therefore goes the other
way, we obtain the desired result.
28 Solutions 1.41–2.2
1.41 From the product rule we have p(x, y) = p(y[x)p(x), and so (1.120) can be written
as
I(x; y) = −
p(x, y) ln p(y) dxdy +
p(x, y) ln p(y[x) dxdy
= −
p(y) ln p(y) dy +
p(x, y) ln p(y[x) dxdy
= H(y) −H(y[x). (77)
Chapter 2 Density Estimation
2.1 From the deﬁnition (2.2) of the Bernoulli distribution we have
¸
x∈{0,1}
p(x[μ) = p(x = 0[μ) + p(x = 1[μ)
= (1 −μ) + μ = 1
¸
x∈{0,1}
xp(x[μ) = 0.p(x = 0[μ) + 1.p(x = 1[μ) = μ
¸
x∈{0,1}
(x −μ)
2
p(x[μ) = μ
2
p(x = 0[μ) + (1 −μ)
2
p(x = 1[μ)
= μ
2
(1 −μ) + (1 −μ)
2
μ = μ(1 −μ).
The entropy is given by
H[x] = −
¸
x∈{0,1}
p(x[μ) ln p(x[μ)
= −
¸
x∈{0,1}
μ
x
(1 −μ)
1−x
¦xln μ + (1 −x) ln(1 −μ)¦
= −(1 −μ) ln(1 −μ) −μln μ.
2.2 The normalization of (2.261) follows from
p(x = +1[μ) + p(x = −1[μ) =
1 + μ
2
+
1 −μ
2
= 1.
The mean is given by
E[x] =
1 + μ
2
−
1 −μ
2
= μ.
Solution 2.3 29
To evaluate the variance we use
E[x
2
] =
1 −μ
2
+
1 + μ
2
= 1
from which we have
var[x] = E[x
2
] −E[x]
2
= 1 −μ
2
.
Finally the entropy is given by
H[x] = −
x=+1
¸
x=−1
p(x[μ) ln p(x[μ)
= −
1 −μ
2
ln
1 −μ
2
−
1 + μ
2
ln
1 + μ
2
.
2.3 Using the deﬁnition (2.10) we have
N
n
+
N
n −1
=
N!
n!(N −n)!
+
N!
(n −1)!(N + 1 −n)!
=
(N + 1 −n)N! + nN!
n!(N + 1 −n)!
=
(N + 1)!
n!(N + 1 −n)!
=
N + 1
n
. (78)
To prove the binomial theorem (2.263) we note that the theorem is trivially true
for N = 0. We now assume that it holds for some general value N and prove its
correctness for N + 1, which can be done as follows
(1 + x)
N+1
= (1 + x)
N
¸
n=0
N
n
x
n
=
N
¸
n=0
N
n
x
n
+
N+1
¸
n=1
N
n −1
x
n
=
N
0
x
0
+
N
¸
n=1
N
n
+
N
n −1
x
n
+
N
N
x
N+1
=
N + 1
0
x
0
+
N
¸
n=1
N + 1
n
x
n
+
N + 1
N + 1
x
N+1
=
N+1
¸
n=0
N + 1
n
x
n
(79)
30 Solutions 2.4–2.5
which completes the inductive proof. Finally, using the binomial theorem, the nor
malization condition (2.264) for the binomial distribution gives
N
¸
n=0
N
n
μ
n
(1 −μ)
N−n
= (1 −μ)
N
N
¸
n=0
N
n
μ
1 −μ
n
= (1 −μ)
N
1 +
μ
1 −μ
N
= 1 (80)
as required.
2.4 Differentiating (2.264) with respect to μ we obtain
N
¸
n=1
N
n
μ
n
(1 −μ)
N−n
¸
n
μ
−
(N −n)
(1 −μ)
= 0.
Multiplying through by μ(1 −μ) and rearranging we obtain (2.11).
If we differentiate (2.264) twice with respect to μ we obtain
N
¸
n=1
N
n
μ
n
(1 −μ)
N−n
¸
n
μ
−
(N −n)
(1 −μ)
2
−
n
μ
2
−
(N −n)
(1 −μ)
2
¸
= 0.
We now multiply through by μ
2
(1 − μ)
2
and rearrange, making use of the result
(2.11) for the mean of the binomial distribution, to obtain
E[n
2
] = Nμ(1 −μ) + N
2
μ
2
.
Finally, we use (1.40) to obtain the result (2.12) for the variance.
2.5 Making the change of variable t = y + x in (2.266) we obtain
Γ(a)Γ(b) =
∞
0
x
a−1
∞
x
exp(−t)(t −x)
b−1
dt
dx. (81)
We now exchange the order of integration, taking care over the limits of integration
Γ(a)Γ(b) =
∞
0
t
0
x
a−1
exp(−t)(t −x)
b−1
dxdt. (82)
The change in the limits of integration in going from (81) to (82) can be understood
by reference to Figure 3. Finally we change variables in the x integral using x = tμ
to give
Γ(a)Γ(b) =
∞
0
exp(−t)t
a−1
t
b−1
t dt
1
0
μ
a−1
(1 −μ)
b−1
dμ
= Γ(a + b)
1
0
μ
a−1
(1 −μ)
b−1
dμ. (83)
Solutions 2.6–2.7 31
Figure 3 Plot of the region of integration of (81)
in (x, t) space.
t
x
t
=
x
2.6 From (2.13) the mean of the beta distribution is given by
E[μ] =
1
0
Γ(a + b)
Γ(a)Γ(b)
μ
(a+1)−1
(1 −μ)
b−1
dμ.
Using the result (2.265), which follows directly from the normalization condition for
the Beta distribution, we have
E[μ] =
Γ(a + b)
Γ(a)Γ(b)
Γ(a + 1 + b)
Γ(a + 1)Γ(b)
=
a
a + b
where we have used the property Γ(x+1) = xΓ(x). We can ﬁnd the variance in the
same way, by ﬁrst showing that
E[μ
2
] =
Γ(a + b)
Γ(a)Γ(b)
1
0
Γ(a + 2 + b)
Γ(a + 2)Γ(b)
μ
(a+2)−1
(1 −μ)
b−1
dμ
=
Γ(a + b)
Γ(a)Γ(b)
Γ(a + 2 + b)
Γ(a + 2)Γ(b)
=
a
(a + b)
a + 1
(a + 1 + b)
. (84)
Now we use the result (1.40), together with the result (2.15) to derive the result (2.16)
for var[μ]. Finally, we obtain the result (2.269) for the mode of the beta distribution
simply by setting the derivative of the right hand side of (2.13) with respect to μ to
zero and rearranging.
2.7 NOTE: In PRML, the exercise text contains a typographical error. On the third line,
“mean value of x” should be “mean value of μ”.
Using the result (2.15) for the mean of a Beta distribution we see that the prior mean
is a/(a + b) while the posterior mean is (a + n)/(a + b + n + m). The maximum
likelihood estimate for μ is given by the relative frequency n/(n+m) of observations
32 Solutions 2.8–2.9
of x = 1. Thus the posterior mean will lie between the prior mean and the maximum
likelihood solution provided the following equation is satisﬁed for λ in the interval
(0, 1)
λ
a
a + b
+ (1 −λ)
n
n + m
=
a + n
a + b + n + m
.
which represents a convex combination of the prior mean and the maximum likeli
hood estimator. This is a linear equation for λ which is easily solved by rearranging
terms to give
λ =
1
1 + (n + m)/(a + b)
.
Since a > 0, b > 0, n > 0, and m > 0, it follows that the term (n +m)/(a +b) lies
in the range (0, ∞) and hence λ must lie in the range (0, 1).
2.8 To prove the result (2.270) we use the product rule of probability
E
y
[E
x
[x[y]] =
xp(x[y) dx
p(y) dy
=
xp(x, y) dxdy =
xp(x) dx = E
x
[x]. (85)
For the result (2.271) for the conditional variance we make use of the result (1.40),
as well as the relation (85), to give
E
y
[var
x
[x[y]] + var
y
[E
x
[x[y]] = E
y
E
x
[x
2
[y] −E
x
[x[y]
2
+E
y
E
x
[x[y]
2
−E
y
[E
x
[x[y]]
2
= E
x
[x
2
] −E
x
[x]
2
= var
x
[x]
where we have made use of E
y
[E
x
[x
2
[y]] = E
x
[x
2
] which can be proved by analogy
with (85).
2.9 When we integrate over μ
M−1
the lower limit of integration is 0, while the upper
limit is 1 −
¸
M−2
j=1
μ
j
since the remaining probabilities must sum to one (see Fig
ure 2.4). Thus we have
p
M−1
(μ
1
, . . . , μ
M−2
) =
1−
¸
M−2
j=1
μ
j
0
p
M
(μ
1
, . . . , μ
M−1
) dμ
M−1
= C
M
¸
M−2
¸
k=1
μ
α
k
−1
k
¸
1−
¸
M−2
j=1
μ
j
0
μ
α
M−1
−1
M−1
1 −
M−1
¸
j=1
μ
j
α
M
−1
dμ
M−1
.
In order to make the limits of integration equal to 0 and 1 we change integration
variable from μ
M−1
to t using
μ
M−1
= t
1 −
M−2
¸
j=1
μ
j
Solution 2.10 33
which gives
p
M−1
(μ
1
, . . . , μ
M−2
)
= C
M
¸
M−2
¸
k=1
μ
α
k
−1
k
¸
1 −
M−2
¸
j=1
μ
j
α
M−1
+α
M
−1
1
0
t
α
M−1
−1
(1 −t)
α
M
−1
dt
= C
M
¸
M−2
¸
k=1
μ
α
k
−1
k
¸
1 −
M−2
¸
j=1
μ
j
α
M−1
+α
M
−1
Γ(α
M−1
)Γ(α
M
)
Γ(α
M−1
+ α
M
)
(86)
where we have used (2.265). The right hand side of (86) is seen to be a normalized
Dirichlet distribution over M−1 variables, with coefﬁcients α
1
, . . . , α
M−2
, α
M−1
+
α
M
, (note that we have effectively combined the ﬁnal two categories) and we can
identify its normalization coefﬁcient using (2.38). Thus
C
M
=
Γ(α
1
+ . . . + α
M
)
Γ(α
1
) . . . Γ(α
M−2
)Γ(α
M−1
+ α
M
)
Γ(α
M−1
+ α
M
)
Γ(α
M−1
)Γ(α
M
)
=
Γ(α
1
+ . . . + α
M
)
Γ(α
1
) . . . Γ(α
M
)
(87)
as required.
2.10 Using the fact that the Dirichlet distribution (2.38) is normalized we have
M
¸
k=1
μ
α
k
−1
k
dμ =
Γ(α
1
) Γ(α
M
)
Γ(α
0
)
(88)
where
dμ denotes the integral over the (M −1)dimensional simplex deﬁned by
0 μ
k
1 and
¸
k
μ
k
= 1. Now consider the expectation of μ
j
which can be
written
E[μ
j
] =
Γ(α
0
)
Γ(α
1
) Γ(α
M
)
μ
j
M
¸
k=1
μ
α
k
−1
k
dμ
=
Γ(α
0
)
Γ(α
1
) Γ(α
M
)
Γ(α
1
) Γ(α
j
+ 1) Γ(α
M
)
Γ(α
0
+ 1)
=
α
j
α
0
where we have made use of (88), noting that the effect of the extra factor of μ
j
is to
increase the coefﬁcient α
j
by 1, and then made use of Γ(x+1) = xΓ(x). By similar
reasoning we have
var[μ
j
] = E[μ
2
j
] −E[μ
j
]
2
=
α
j
(α
j
+ 1)
α
0
(α
0
+ 1)
−
α
2
j
α
2
0
=
α
j
(α
0
−α
j
)
α
2
0
(α
0
+ 1)
.
34 Solution 2.11
Likewise, for j = l we have
cov[μ
j
μ
l
] = E[μ
j
μ
l
] −E[μ
j
]E[μ
l
] =
α
j
α
l
α
0
(α
0
+ 1)
−
α
j
α
0
α
l
α
0
= −
α
j
α
l
α
2
0
(α
0
+ 1)
.
2.11 We ﬁrst of all write the Dirichlet distribution (2.38) in the form
Dir(μ[α) = K(α)
M
¸
k=1
μ
α
k
−1
k
where
K(α) =
Γ(α
0
)
Γ(α
1
) Γ(α
M
)
.
Next we note the following relation
∂
∂α
j
M
¸
k=1
μ
α
k
−1
k
=
∂
∂α
j
M
¸
k=1
exp ((α
k
−1) ln μ
k
)
=
M
¸
k=1
ln μ
j
exp ¦(α
k
−1) ln μ
k
¦
= ln μ
j
M
¸
k=1
μ
α
k
−1
k
from which we obtain
E[ln μ
j
] = K(α)
1
0
1
0
ln μ
j
M
¸
k=1
μ
α
k
−1
k
dμ
1
. . . dμ
M
= K(α)
∂
∂α
j
1
0
1
0
M
¸
k=1
μ
α
k
−1
k
dμ
1
. . . dμ
M
= K(α)
∂
∂μ
j
1
K(α)
= −
∂
∂μ
j
ln K(α).
Finally, using the expression for K(α), together with the deﬁnition of the digamma
function ψ(), we have
E[ln μ
j
] = ψ(α
j
) −ψ(α
0
).
Solutions 2.12–2.13 35
2.12 The normalization of the uniform distribution is proved trivially
b
a
1
b −a
dx =
b −a
b −a
= 1.
For the mean of the distribution we have
E[x] =
b
a
1
b −a
xdx =
¸
x
2
2(b −a)
b
a
=
b
2
−a
2
2(b −a)
=
a + b
2
.
The variance can be found by ﬁrst evaluating
E[x
2
] =
b
a
1
b −a
x
2
dx =
¸
x
3
3(b −a)
b
a
=
b
3
−a
3
3(b −a)
=
a
2
+ ab + b
2
3
and then using (1.40) to give
var[x] = E[x
2
] −E[x]
2
=
a
2
+ ab + b
2
3
−
(a + b)
2
4
=
(b −a)
2
12
.
2.13 Note that this solution is the multivariate version of Solution 1.30.
From (1.113) we have
KL(pq) = −
p(x) ln q(x) dx +
p(x) ln p(x) dx.
Using (2.43), (2.57), (2.59) and (2.62), we can rewrite the ﬁrst integral on the r.h.s.
of () as
−
p(x) ln q(x) dx
=
^(x[μ, Σ
2
)
1
2
Dln(2π) + ln [L[ + (x −m)
T
L
−1
(x −m)
dx
=
1
2
Dln(2π) + ln [L[ + Tr[L
−1
(μμ
T
+Σ)]
−μL
−1
m−m
T
L
−1
μ +m
T
L
−1
m
. (89)
The second integral on the r.h.s. of () we recognize from (1.104) as the negative
differential entropy of a multivariate Gaussian. Thus, from (), (89) and (B.41), we
have
KL(pq) =
1
2
ln
[L[
[Σ[
+ Tr[L
−1
(μμ
T
+Σ)]
−μ
T
L
−1
m−m
T
L
−1
μ +m
T
L
−1
m−D
36 Solution 2.14
2.14 As for the univariate Gaussian considered in Section 1.6, we can make use of La
grange multipliers to enforce the constraints on the maximum entropy solution. Note
that we need a single Lagrange multiplier for the normalization constraint (2.280),
a Ddimensional vector m of Lagrange multipliers for the D constraints given by
(2.281), and a DD matrix L of Lagrange multipliers to enforce the D
2
constraints
represented by (2.282). Thus we maximize
¯
H[p] = −
p(x) ln p(x) dx + λ
p(x) dx −1
+m
T
p(x)xdx −μ
+Tr
L
p(x)(x −μ)(x −μ)
T
dx −Σ
. (90)
By functional differentiation (Appendix D) the maximum of this functional with
respect to p(x) occurs when
0 = −1 −ln p(x) + λ +m
T
x + Tr¦L(x −μ)(x −μ)
T
¦.
Solving for p(x) we obtain
p(x) = exp
¸
λ −1 +m
T
x + (x −μ)
T
L(x −μ)
¸
. (91)
We now ﬁnd the values of the Lagrange multipliers by applying the constraints. First
we complete the square inside the exponential, which becomes
λ −1 +
x −μ +
1
2
L
−1
m
T
L
x −μ +
1
2
L
−1
m
+μ
T
m−
1
4
m
T
L
−1
m.
We now make the change of variable
y = x −μ +
1
2
L
−1
m.
The constraint (2.281) then becomes
exp
λ −1 +y
T
Ly +μ
T
m−
1
4
m
T
L
−1
m
y +μ −
1
2
L
−1
m
dy = μ.
In the ﬁnal parentheses, the term in y vanishes by symmetry, while the term in μ
simply integrates to μ by virtue of the normalization constraint (2.280) which now
takes the form
exp
λ −1 +y
T
Ly +μ
T
m−
1
4
m
T
L
−1
m
dy = 1.
and hence we have
−
1
2
L
−1
m= 0
Solutions 2.15–2.16 37
where again we have made use of the constraint (2.280). Thus m = 0 and so the
density becomes
p(x) = exp
¸
λ −1 + (x −μ)
T
L(x −μ)
¸
.
Substituting this into the ﬁnal constraint (2.282), and making the change of variable
x −μ = z we obtain
exp
¸
λ −1 +z
T
Lz
¸
zz
T
dx = Σ.
Applying an analogous argument to that used to derive (2.64) we obtain L = −
1
2
Σ.
Finally, the value of λ is simply that value needed to ensure that the Gaussian distri
bution is correctly normalized, as derived in Section 2.3, and hence is given by
λ −1 = ln
1
(2π)
D/2
1
[Σ[
1/2
.
2.15 From the deﬁnitions of the multivariate differential entropy (1.104) and the multi
variate Gaussian distribution (2.43), we get
H[x] = −
^(x[μ, Σ) ln ^(x[μ, Σ) dx
=
^(x[μ, Σ)
1
2
Dln(2π) + ln [Σ[ + (x −μ)
T
Σ
−1
(x −μ)
dx
=
1
2
Dln(2π) + ln [Σ[ + Tr
Σ
−1
Σ
=
1
2
(Dln(2π) + ln [Σ[ + D)
2.16 We have p(x
1
) = ^(x
1
[μ
1
, τ
−1
1
) and p(x
2
) = ^(x
2
[μ
2
, τ
−1
2
). Since x = x
1
+ x
2
we also have p(x[x
2
) = ^(x[μ
1
+ x
2
, τ
−1
1
). We now evaluate the convolution
integral given by (2.284) which takes the form
p(x) =
τ
1
2π
1/2
τ
2
2π
1/2
∞
−∞
exp
−
τ
1
2
(x −μ
1
−x
2
)
2
−
τ
2
2
(x
2
−μ
2
)
2
¸
dx
2
.
(92)
Since the ﬁnal result will be a Gaussian distribution for p(x) we need only evaluate
its precision, since, from (1.110), the entropy is determined by the variance or equiv
alently the precision, and is independent of the mean. This allows us to simplify the
calculation by ignoring such things as normalization constants.
We begin by considering the terms in the exponent of (92) which depend on x
2
which
are given by
−
1
2
x
2
2
(τ
1
+ τ
2
) + x
2
¦τ
1
(x −μ
1
) + τ
2
μ
2
¦
= −
1
2
(τ
1
+ τ
2
)
x
2
−
τ
1
(x −μ
1
) + τ
2
μ
2
τ
1
+ τ
2
2
+
¦τ
1
(x −μ
1
) + τ
2
μ
2
¦
2
2(τ
1
+ τ
2
)
38 Solutions 2.17–2.18
where we have completed the square over x
2
. When we integrate out x
2
, the ﬁrst
term on the right hand side will simply give rise to a constant factor independent
of x. The second term, when expanded out, will involve a term in x
2
. Since the
precision of x is given directly in terms of the coefﬁcient of x
2
in the exponent, it is
only such terms that we need to consider. There is one other term in x
2
arising from
the original exponent in (92). Combining these we have
−
τ
1
2
x
2
+
τ
2
1
2(τ
1
+ τ
2
)
x
2
= −
1
2
τ
1
τ
2
τ
1
+ τ
2
x
2
from which we see that x has precision τ
1
τ
2
/(τ
1
+ τ
2
).
We can also obtain this result for the precision directly by appealing to the general
result (2.115) for the convolution of two linearGaussian distributions.
The entropy of x is then given, from (1.110), by
H[x] =
1
2
ln
2π(τ
1
+ τ
2
)
τ
1
τ
2
.
2.17 We can use an analogous argument to that used in the solution of Exercise 1.14.
Consider a general square matrix Λ with elements Λ
ij
. Then we can always write
Λ = Λ
A
+Λ
S
where
Λ
S
ij
=
Λ
ij
+ Λ
ji
2
, Λ
A
ij
=
Λ
ij
−Λ
ji
2
(93)
and it is easily veriﬁed that Λ
S
is symmetric so that Λ
S
ij
= Λ
S
ji
, and Λ
A
is antisym
metric so that Λ
A
ij
= −Λ
S
ji
. The quadratic form in the exponent of a Ddimensional
multivariate Gaussian distribution can be written
1
2
D
¸
i=1
D
¸
j=1
(x
i
−μ
i
)Λ
ij
(x
j
−μ
j
) (94)
where Λ = Σ
−1
is the precision matrix. When we substitute Λ = Λ
A
+ Λ
S
into
(94) we see that the term involving Λ
A
vanishes since for every positive term there
is an equal and opposite negative term. Thus we can always take Λto be symmetric.
2.18 We start by premultiplying both sides of (2.45) by u
†
i
, the conjugate transpose of
u
i
. This gives us
u
†
i
Σu
i
= λ
i
u
†
i
u
i
. (95)
Next consider the conjugate transpose of (2.45) and postmultiply it by u
i
, which
gives us
u
†
i
Σ
†
u
i
= λ
∗
i
u
†
i
u
i
. (96)
where λ
∗
i
is the complex conjugate of λ
i
. We now subtract (95) from (96) and use
the fact the Σ is real and symmetric and hence Σ = Σ
†
, to get
0 = (λ
∗
i
−λ
i
)u
†
i
u
i
.
Solution 2.19 39
Hence λ
∗
i
= λ
i
and so λ
i
must be real.
Now consider
u
T
i
u
j
λ
j
= u
T
i
Σu
j
= u
T
i
Σ
T
u
j
= (Σu
i
)
T
u
j
= λ
i
u
T
i
u
j
,
where we have used (2.45) and the fact that Σ is symmetric. If we assume that
0 = λ
i
= λ
j
= 0, the only solution to this equation is that u
T
i
u
j
= 0, i.e., that u
i
and u
j
are orthogonal.
If 0 = λ
i
= λ
j
= 0, any linear combination of u
i
and u
j
will be an eigenvector
with eigenvalue λ = λ
i
= λ
j
, since, from (2.45),
Σ(au
i
+ bu
j
) = aλ
i
u
i
+ bλ
j
u
j
= λ(au
i
+ bu
j
).
Assuming that u
i
= u
j
, we can construct
u
α
= au
i
+ bu
j
u
β
= cu
i
+ du
j
such that u
α
and u
β
are mutually orthogonal and of unit length. Since u
i
and u
j
are
orthogonal to u
k
(k = i, k = j), so are u
α
and u
β
. Thus, u
α
and u
β
satisfy (2.46).
Finally, if λ
i
= 0, Σ must be singular, with u
i
lying in the nullspace of Σ. In this
case, u
i
will be orthogonal to the eigenvectors projecting onto the rowspace of Σ
and we can chose u
i
 = 1, so that (2.46) is satisﬁed. If more than one eigenvalue
equals zero, we can chose the corresponding eigenvectors arbitrily, as long as they
remain in the nullspace of Σ, and so we can chose them to satisfy (2.46).
2.19 We can write the r.h.s. of (2.48) in matrix form as
D
¸
i=1
λ
i
u
i
u
T
i
= UΛU
T
= M,
where Uis a D D matrix with the eigenvectors u
1
, . . . , u
D
as its columns and Λ
is a diagonal matrix with the eigenvalues λ
1
, . . . , λ
D
along its diagonal.
Thus we have
U
T
MU = U
T
UΛU
T
U = Λ.
However, from (2.45)–(2.47), we also have that
U
T
ΣU = U
T
ΛU = U
T
UΛ = Λ,
40 Solutions 2.20–2.22
and so M= Σ and (2.48) holds.
Moreover, since Uis orthonormal, U
−1
= U
T
and so
Σ
−1
=
UΛU
T
−1
=
U
T
−1
Λ
−1
U
−1
= UΛ
−1
U
T
=
D
¸
i=1
λ
i
u
i
u
T
i
.
2.20 Since u
1
, . . . , u
D
constitute a basis for R
D
, we can write
a = ˆ a
1
u
1
+ ˆ a
2
u
2
+ . . . + ˆ a
D
u
D
,
where ˆ a
1
, . . . , ˆ a
D
are coefﬁcients obtained by projecting a on u
1
, . . . , u
D
. Note that
they typically do not equal the elements of a.
Using this we can write
a
T
Σa =
ˆ a
1
u
T
1
+ . . . + ˆ a
D
u
T
D
Σ(ˆ a
1
u
1
+ . . . + ˆ a
D
u
D
)
and combining this result with (2.45) we get
ˆ a
1
u
T
1
+ . . . + ˆ a
D
u
T
D
(ˆ a
1
λ
1
u
1
+ . . . + ˆ a
D
λ
D
u
D
) .
Now, since u
T
i
u
j
= 1 only if i = j, and 0 otherwise, this becomes
ˆ a
2
1
λ
1
+ . . . + ˆ a
2
D
λ
D
and since a is real, we see that this expression will be strictly positive for any non
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue,
λ
i
, is zero or negative, there exist a vector a (e.g. a = u
i
), for which this expression
will be less than or equal to zero. Thus, that a matrix has eigenvectors which are all
strictly positive is a sufﬁcient and necessary condition for the matrix to be positive
deﬁnite.
2.21 A D D matrix has D
2
elements. If it is symmetric then the elements not on the
leading diagonal form pairs of equal value. There are D elements on the diagonal
so the number of elements not on the diagonal is D
2
−D and only half of these are
independent giving
D
2
−D
2
.
If we now add back the D elements on the diagonal we get
D
2
−D
2
+ D =
D(D + 1)
2
.
2.22 Consider a matrix M which is symmetric, so that M
T
= M. The inverse matrix
M
−1
satisﬁes
MM
−1
= I.
Solutions 2.23–2.24 41
Taking the transpose of both sides of this equation, and using the relation (C.1), we
obtain
M
−1
T
M
T
= I
T
= I
since the identity matrix is symmetric. Making use of the symmetry condition for
Mwe then have
M
−1
T
M= I
and hence, from the deﬁnition of the matrix inverse,
M
−1
T
= M
−1
and so M
−1
is also a symmetric matrix.
2.23 Recall that the transformation (2.51) diagonalizes the coordinate system and that
the quadratic form (2.44), corresponding to the square of the Mahalanobis distance,
is then given by (2.50). This corresponds to a shift in the origin of the coordinate
system and a rotation so that the hyperellipsoidal contours along which the Maha
lanobis distance is constant become axis aligned. The volume contained within any
one such contour is unchanged by shifts and rotations. We now make the further
transformation z
i
= λ
1/2
i
y
i
for i = 1, . . . , D. The volume within the hyperellipsoid
then becomes
D
¸
i=1
dy
i
=
D
¸
i=1
λ
1/2
i
D
¸
i=1
dz
i
= [Σ[
1/2
V
D
Δ
D
where we have used the property that the determinant of Σ is given by the product
of its eigenvalues, together with the fact that in the z coordinates the volume has
become a sphere of radius Δ whose volume is V
D
Δ
D
.
2.24 Multiplying the left hand side of (2.76) by the matrix (2.287) trivially gives the iden
tity matrix. On the right hand side consider the four blocks of the resulting parti
tioned matrix:
upper left
AM−BD
−1
CM= (A−BD
−1
C)(A−BD
−1
C)
−1
= I
upper right
−AMBD
−1
+BD
−1
+BD
−1
CMBD
−1
= −(A−BD
−1
C)(A−BD
−1
C)
−1
BD
−1
+BD
−1
= −BD
−1
+BD
−1
= 0
lower left
CM−DD
−1
CM= CM−CM= 0
42 Solutions 2.25–2.28
lower right
−CMBD
−1
+DD
−1
+DD
−1
CMBD
−1
= DD
−1
= I.
Thus the right hand side also equals the identity matrix.
2.25 We ﬁrst of all take the joint distribution p(x
a
, x
b
, x
c
) and marginalize to obtain the
distribution p(x
a
, x
b
). Using the results of Section 2.3.2 this is again a Gaussian
distribution with mean and covariance given by
μ =
μ
a
μ
b
, Σ =
Σ
aa
Σ
ab
Σ
ba
Σ
bb
.
From Section 2.3.1 the distribution p(x
a
, x
b
) is then Gaussian with mean and co
variance given by (2.81) and (2.82) respectively.
2.26 Multiplying the left hand side of (2.289) by (A+BCD) trivially gives the identity
matrix I. On the right hand side we obtain
(A+BCD)(A
−1
−A
−1
B(C
−1
+DA
−1
B)
−1
DA
−1
)
= I +BCDA
−1
−B(C
−1
+DA
−1
B)
−1
DA
−1
−BCDA
−1
B(C
−1
+DA
−1
B)
−1
DA
−1
= I +BCDA
−1
−BC(C
−1
+DA
−1
B)(C
−1
+DA
−1
B)
−1
DA
−1
= I +BCDA
−1
−BCDA
−1
= I
2.27 From y = x + z we have trivially that E[y] = E[x] + E[z]. For the covariance we
have
cov[y] = E
(x −E[x] +y −E[y])(x −E[x] +y −E[y])
T
= E
(x −E[x])(x −E[x])
T
+E
(y −E[y])(y −E[y])
T
+E
(x −E[x])(y −E[y])
T
. .. .
=0
+E
(y −E[y])(x −E[x])
T
. .. .
=0
= cov[x] + cov[z]
where we have used the independence of x and z, together with E[(x −E[x])] =
E[(z −E[z])] = 0, to set the third and fourth terms in the expansion to zero. For
1dimensional variables the covariances become variances and we obtain the result
of Exercise 1.10 as a special case.
2.28 For the marginal distribution p(x) we see from (2.92) that the mean is given by the
upper partition of (2.108) which is simply μ. Similarly from (2.93) we see that the
covariance is given by the top left partition of (2.105) and is therefore given by Λ
−1
.
Now consider the conditional distribution p(y[x). Applying the result (2.81) for the
conditional mean we obtain
μ
yx
= Aμ +b +AΛ
−1
Λ(x −μ) = Ax +b.
Solutions 2.29–2.30 43
Similarly applying the result (2.82) for the covariance of the conditional distribution
we have
cov[y[x] = L
−1
+AΛ
−1
A
T
−AΛ
−1
ΛΛ
−1
A
T
= L
−1
as required.
2.29 We ﬁrst deﬁne
X = Λ+A
T
LA (97)
and
W= −LA, and thus W
T
= −A
T
L
T
= −A
T
L, (98)
since L is symmetric. We can use (97) and (98) to rewrite (2.104) as
R =
X W
T
W L
and using (2.76) we get
X W
T
W L
−1
=
M −MW
T
L
−1
−L
−1
WM L
−1
+L
−1
WMW
T
L
−1
where now
M=
X−W
T
L
−1
W
−1
.
Substituting Xand Wusing (97) and (98), respectively, we get
M=
Λ+A
T
LA−A
T
LL
−1
LA
−1
= Λ
−1
,
−MW
T
L
−1
= Λ
−1
A
T
LL
−1
= Λ
−1
A
T
and
L
−1
+L
−1
WMW
T
L
−1
= L
−1
+L
−1
LAΛ
−1
A
T
LL
−1
= L
−1
+AΛ
−1
A
T
,
as required.
2.30 Substituting the leftmost expression of (2.105) for R
−1
in (2.107), we get
Λ
−1
Λ
−1
A
T
AΛ
−1
S
−1
+AΛ
−1
A
T
Λμ −A
T
Sb
Sb
=
Λ
−1
Λμ −A
T
Sb
+Λ
−1
A
T
Sb
AΛ
−1
Λμ −A
T
Sb
+
S
−1
+AΛ
−1
A
T
Sb
=
μ −Λ
−1
A
T
Sb +Λ
−1
A
T
Sb
Aμ −AΛ
−1
A
T
Sb +b +AΛ
−1
A
T
Sb
=
μ
Aμ −b
44 Solutions 2.31–2.32
2.31 Since y = x + z we can write the conditional distribution of y given x in the form
p(y[x) = ^(y[μ
z
+ x, Σ
z
). This gives a decomposition of the joint distribution
of x and y in the form p(x, y) = p(y[x)p(x) where p(x) = ^(x[μ
x
, Σ
x
). This
therefore takes the form of (2.99) and (2.100) in which we can identify μ → μ
x
,
Λ
−1
→ Σ
x
, A → I, b → μ
z
and L
−1
→ Σ
z
. We can now obtain the marginal
distribution p(y) by making use of the result (2.115) from which we obtain p(y) =
^(y[μ
x
+μ
z
, Σ
z
+Σ
x
). Thus both the means and the covariances are additive, in
agreement with the results of Exercise 2.27.
2.32 The quadratic form in the exponential of the joint distribution is given by
−
1
2
(x −μ)
T
Λ(x −μ) −
1
2
(y −Ax −b)
T
L(y −Ax −b). (99)
We now extract all of those terms involving x and assemble them into a standard
Gaussian quadratic form by completing the square
= −
1
2
x
T
(Λ+A
T
LA)x +x
T
Λμ +A
T
L(y −b)
+ const
= −
1
2
(x −m)
T
(Λ+A
T
LA)(x −m)
+
1
2
m
T
(Λ+A
T
LA)m+ const (100)
where
m= (Λ+A
T
LA)
−1
Λμ +A
T
L(y −b)
.
We can now perform the integration over x which eliminates the ﬁrst term in (100).
Then we extract the terms in y from the ﬁnal term in (100) and combine these with
the remaining terms from the quadratic form (99) which depend on y to give
= −
1
2
y
T
¸
L −LA(Λ+A
T
LA)
−1
A
T
L
¸
y
+y
T
¸
L −LA(Λ+A
T
LA)
−1
A
T
L
¸
b
+LA(Λ+A
T
LA)
−1
Λμ
. (101)
We can identify the precision of the marginal distribution p(y) from the second order
term in y. To ﬁnd the corresponding covariance, we take the inverse of the precision
and apply the Woodbury inversion formula (2.289) to give
¸
L −LA(Λ+A
T
LA)
−1
A
T
L
¸
−1
= L
−1
+AΛ
−1
A
T
(102)
which corresponds to (2.110).
Next we identify the mean ν of the marginal distribution. To do this we make use of
(102) in (101) and then complete the square to give
−
1
2
(y −ν)
T
L
−1
+AΛ
−1
A
T
−1
(y −ν) + const
Solutions 2.33–2.34 45
where
ν =
L
−1
+AΛ
−1
A
T
(L
−1
+AΛ
−1
A
T
)
−1
b +LA(Λ+A
T
LA)
−1
Λμ
.
Now consider the two terms in the square brackets, the ﬁrst one involving b and the
second involving μ. The ﬁrst of these contribution simply gives b, while the term in
μ can be written
=
L
−1
+AΛ
−1
A
T
LA(Λ+A
T
LA)
−1
Λμ
= A(I +Λ
−1
A
T
LA)(I +Λ
−1
A
T
LA)
−1
Λ
−1
Λμ = Aμ
where we have used the general result (BC)
−1
= C
−1
B
−1
. Hence we obtain
(2.109).
2.33 To ﬁnd the conditional distribution p(x[y) we start from the quadratic form (99) cor
responding to the joint distribution p(x, y). Now, however, we treat y as a constant
and simply complete the square over x to give
−
1
2
(x −μ)
T
Λ(x −μ) −
1
2
(y −Ax −b)
T
L(y −Ax −b)
= −
1
2
x
T
(Λ+A
T
LA)x +x
T
¦Λμ +AL(y −b)¦ + const
= −
1
2
(x −m)
T
(Λ+A
T
LA)(x −m)
where, as in the solution to Exercise 2.32, we have deﬁned
m= (Λ+A
T
LA)
−1
¸
Λμ +A
T
L(y −b)
¸
from which we obtain directly the mean and covariance of the conditional distribu
tion in the form (2.111) and (2.112).
2.34 Differentiating (2.118) with respect to Σ we obtain two terms:
−
N
2
∂
∂Σ
ln [Σ[ −
1
2
∂
∂Σ
N
¸
n=1
(x
n
−μ)
T
Σ
−1
(x
n
−μ).
For the ﬁrst term, we can apply (C.28) directly to get
−
N
2
∂
∂Σ
ln [Σ[ = −
N
2
Σ
−1
T
= −
N
2
Σ
−1
.
For the second term, we ﬁrst rewrite the sum
N
¸
n=1
(x
n
−μ)
T
Σ
−1
(x
n
−μ) = NTr
Σ
−1
S
,
46 Solution 2.35
where
S =
1
N
N
¸
n=1
(x
n
−μ)(x
n
−μ)
T
.
Using this together with (C.21), in which x = Σ
ij
(element (i, j) in Σ), and proper
ties of the trace we get
∂
∂Σ
ij
N
¸
n=1
(x
n
−μ)
T
Σ
−1
(x
n
−μ) = N
∂
∂Σ
ij
Tr
Σ
−1
S
= NTr
¸
∂
∂Σ
ij
Σ
−1
S
= −NTr
¸
Σ
−1
∂Σ
∂Σ
ij
Σ
−1
S
= −NTr
¸
∂Σ
∂Σ
ij
Σ
−1
SΣ
−1
= −N
Σ
−1
SΣ
−1
ij
where we have used (C.26). Note that in the last step we have ignored the fact that
Σ
ij
= Σ
ji
, so that ∂Σ/∂Σ
ij
has a 1 in position (i, j) only and 0 everywhere else.
Treating this result as valid nevertheless, we get
−
1
2
∂
∂Σ
N
¸
n=1
(x
n
−μ)
T
Σ
−1
(x
n
−μ) =
N
2
Σ
−1
SΣ
−1
.
Combining the derivatives of the two terms and setting the result to zero, we obtain
N
2
Σ
−1
=
N
2
Σ
−1
SΣ
−1
.
Rearrangement then yields
Σ = S
as required.
2.35 NOTE: In PRML, this exercise contains a typographical error; E[x
n
x
m
] should be
E
x
n
x
T
m
on the l.h.s. of (2.291).
The derivation of (2.62) is detailed in the text between (2.59) (page 82) and (2.62)
(page 83).
If m = n then, using (2.62) we have E[x
n
x
T
n
] = μμ
T
+Σ, whereas if n = m then
the two data points x
n
and x
m
are independent and hence E[x
n
x
m
] = μμ
T
where
we have used (2.59). Combining these results we obtain (2.291). From (2.59) and
Solution 2.36 47
(2.62) we then have
E[Σ
ML
] =
1
N
N
¸
n=1
E
¸
x
n
−
1
N
N
¸
m=1
x
m
x
T
n
−
1
N
N
¸
l=1
x
T
l
¸
=
1
N
N
¸
n=1
E
¸
x
n
x
T
n
−
2
N
x
n
N
¸
m=1
x
T
m
+
1
N
2
N
¸
m=1
N
¸
l=1
x
m
x
T
l
¸
=
μμ
T
+Σ−2
μμ
T
+
1
N
Σ
+μμ
T
+
1
N
Σ
=
N −1
N
Σ (103)
as required.
2.36 NOTE: In PRML, there are mistakes that affect this solution. The sign in (2.129) is
incorrect, and this equation should read
θ
(N)
= θ
(N−1)
−a
N−1
z(θ
(N−1)
).
Then, in order to be consistent with the assumption that f(θ) > 0 for θ > θ
and
f(θ) < 0 for θ < θ
in Figure 2.10, we should ﬁnd the root of the expected negative
log likelihood. This lead to sign changes in (2.133) and (2.134), but in (2.135), these
are cancelled against the change of sign in (2.129), so in effect, (2.135) remains
unchanged. Also, x
n
should be x
n
on the l.h.s. of (2.133). Finally, the labels μ and
μ
ML
in Figure 2.11 should be interchanged and there are corresponding changes to
the caption (see errata on the PRML web site for details).
Consider the expression for σ
2
(N)
and separate out the contribution from observation
x
N
to give
σ
2
(N)
=
1
N
N
¸
n=1
(x
n
−μ)
2
=
1
N
N−1
¸
n=1
(x
n
−μ)
2
+
(x
N
−μ)
2
N
=
N −1
N
σ
2
(N−1)
+
(x
N
−μ)
2
N
= σ
2
(N−1)
−
1
N
σ
2
(N−1)
+
(x
N
−μ)
2
N
= σ
2
(N−1)
+
1
N
¸
(x
N
−μ)
2
−σ
2
(N−1)
¸
. (104)
If we substitute the expression for a Gaussian distribution into the result (2.135) for
48 Solution 2.37
the RobbinsMonro procedure applied to maximizing likelihood, we obtain
σ
2
(N)
= σ
2
(N−1)
+ a
N−1
∂
∂σ
2
(N−1)
−
1
2
ln σ
2
(N−1)
−
(x
N
−μ)
2
2σ
2
(N−1)
¸
= σ
2
(N−1)
+ a
N−1
−
1
2σ
2
(N−1)
+
(x
N
−μ)
2
2σ
4
(N−1)
¸
= σ
2
(N−1)
+
a
N−1
2σ
4
(N−1)
¸
(x
N
−μ)
2
−σ
2
(N−1)
¸
. (105)
Comparison of (105) with (104) allows us to identify
a
N−1
=
2σ
4
(N−1)
N
.
2.37 NOTE: In PRML, this exercise requires the additional assumption that we can use
the known true mean, μ, in (2.122). Furthermore, for the derivation of the Robbins
Monro sequential estimation formula, we assume that the covariance matrix is re
stricted to be diagonal. Starting from (2.122), we have
Σ
(N)
ML
=
1
N
N
¸
n=1
(x
n
−μ) (x
n
−μ)
T
=
1
N
N−1
¸
n=1
(x
n
−μ) (x
n
−μ)
T
+
1
N
(x
N
−μ) (x
N
−μ)
T
=
N −1
N
Σ
(N−1)
ML
+
1
N
(x
N
−μ) (x
N
−μ)
T
= Σ
(N−1)
ML
+
1
N
(x
N
−μ) (x
N
−μ)
T
−Σ
(N−1)
ML
. (106)
From Solution 2.34, we know that
∂
∂Σ
(N−1)
ML
ln p(x
N
[μ, Σ
(N−1)
ML
)
=
1
2
Σ
(N−1)
ML
−1
(x
N
−μ) (x
N
−μ)
T
−Σ
(N−1)
ML
Σ
(N−1)
ML
−1
=
1
2
Σ
(N−1)
ML
−2
(x
N
−μ) (x
N
−μ)
T
−Σ
(N−1)
ML
where we have used the assumption that Σ
(N−1)
ML
, and hence
Σ
(N−1)
ML
−1
, is diag
Solutions 2.38–2.39 49
onal. If we substitute this into the multivariate form of (2.135), we get
Σ
(N)
ML
= Σ
(N−1)
ML
+A
N−1
1
2
Σ
(N−1)
ML
−2
(x
N
−μ) (x
N
−μ)
T
−Σ
(N−1)
ML
(107)
where A
N−1
is a matrix of coefﬁcients corresponding to a
N−1
in (2.135). By com
paring (106) with (107), we see that if we choose
A
N−1
=
2
N
Σ
(N−1)
ML
2
.
we recover (106). Note that if the covariance matrix was restricted further, to the
form σ
2
I, i.e. a spherical Gaussian, the coefﬁcient in (107) would again become a
scalar.
2.38 The exponent in the posterior distribution of (2.140) takes the form
−
1
2σ
2
0
(μ −μ
0
)
2
−
1
2σ
2
N
¸
n=1
(x
n
−μ)
2
= −
μ
2
2
1
σ
2
0
+
N
σ
2
+ μ
μ
0
σ
2
0
+
1
σ
2
N
¸
n=1
x
n
+ const.
where ‘const.’ denotes terms independent of μ. Following the discussion of (2.71)
we see that the variance of the posterior distribution is given by
1
σ
2
N
=
N
σ
2
+
1
σ
2
0
.
Similarly the mean is given by
μ
N
=
N
σ
2
+
1
σ
2
0
−1
μ
0
σ
2
0
+
1
σ
2
N
¸
n=1
x
n
=
σ
2
Nσ
2
0
+ σ
2
μ
0
+
Nσ
2
0
Nσ
2
0
+ σ
2
μ
ML
. (108)
(109)
2.39 From (2.142), we see directly that
1
σ
2
N
=
1
σ
2
0
+
N
σ
2
=
1
σ
2
0
+
N −1
σ
2
+
1
σ
2
=
1
σ
2
N−1
+
1
σ
2
. (110)
We also note for later use, that
1
σ
2
N
=
σ
2
+ Nσ
2
0
σ
2
0
σ
2
=
σ
2
+ σ
2
N−1
σ
2
N−1
σ
2
(111)
50 Solution 2.39
and similarly
1
σ
2
N−1
=
σ
2
+ (N −1)σ
2
0
σ
2
0
σ
2
. (112)
Using (2.143), we can rewrite (2.141) as
μ
N
=
σ
2
Nσ
2
0
+ σ
2
μ
0
+
σ
2
0
¸
N
n=1
x
n
Nσ
2
0
+ σ
2
=
σ
2
μ
0
+ σ
2
0
¸
N−1
n=1
x
n
Nσ
2
0
+ σ
2
+
σ
2
0
x
N
Nσ
2
0
+ σ
2
.
Using (2.141), (111) and (112), we can rewrite the ﬁrst term of this expression as
σ
2
N
σ
2
N−1
σ
2
μ
0
+ σ
2
0
¸
N−1
n=1
x
n
(N −1)σ
2
0
+ σ
2
=
σ
2
N
σ
2
N−1
μ
N−1
.
Similarly, using (111), the second term can be rewritten as
σ
2
N
σ
2
x
N
and so
μ
N
=
σ
2
N
σ
2
N−1
μ
N−1
+
σ
2
N
σ
2
x
N
. (113)
Now consider
p(μ[μ
N
, σ
2
N
) = p(μ[μ
N−1
, σ
2
N−1
)p(x
N
[μ, σ
2
)
= ^(μ[μ
N−1
, σ
2
N−1
)^(x
N
[μ, σ
2
)
∝ exp
−
1
2
μ
2
N−1
−2μμ
N−1
+ μ
2
σ
2
N−1
+
x
2
N
−2x
N
μ + μ
2
σ
2
= exp
−
1
2
σ
2
(μ
2
N−1
−2μμ
N−1
+ μ
2
)
σ
2
N−1
σ
2
+
σ
2
N−1
(x
2
N
−2x
N
μ + μ
2
)
σ
2
N−1
σ
2
= exp
−
1
2
(σ
2
N−1
+ σ
2
)μ
2
−2(σ
2
μ
N−1
+ σ
2
N−1
x
N
)μ
σ
2
N−1
σ
2
+ C,
where C accounts for all the remaining terms that are independent of μ. From this,
we can directly read off
1
σ
2
N
=
σ
2
+ σ
2
N−1
σ
2
N−1
σ
2
=
1
σ
2
N−1
+
1
σ
2
Solutions 2.40–2.41 51
and
μ
N
=
σ
2
μ
N−1
+ σ
2
N−1
x
N
σ
2
N−1
+ σ
2
=
σ
2
σ
2
N−1
+ σ
2
μ
N−1
+
σ
2
N−1
σ
2
N−1
+ σ
2
x
N
=
σ
2
N
σ
2
N−1
μ
N−1
+
σ
2
N
σ
2
x
N
and so we have recovered (110) and (113).
2.40 The posterior distribution is proportional to the product of the prior and the likelihood
function
p(μ[X) ∝ p(μ)
N
¸
n=1
p(x
n
[μ, Σ).
Thus the posterior is proportional to an exponential of a quadratic form in μ given
by
−
1
2
(μ −μ
0
)
T
Σ
−1
0
(μ −μ
0
) −
1
2
N
¸
n=1
(x
n
−μ)
T
Σ
−1
(x
n
−μ)
= −
1
2
μ
T
Σ
−1
0
+ NΣ
−1
μ +μ
T
Σ
−1
0
μ
0
+Σ
−1
N
¸
n=1
x
n
+ const
where ‘const.’ denotes terms independent of μ. Using the discussion following
(2.71) we see that the mean and covariance of the posterior distribution are given by
μ
N
=
Σ
−1
0
+ NΣ
−1
−1
Σ
−1
0
μ
0
+Σ
−1
Nμ
ML
(114)
Σ
−1
N
= Σ
−1
0
+ NΣ
−1
(115)
where μ
ML
is the maximum likelihood solution for the mean given by
μ
ML
=
1
N
N
¸
n=1
x
n
.
2.41 If we consider the integral of the Gamma distribution over τ and make the change of
variable bτ = u we have
∞
0
Gam(τ[a, b) dτ =
1
Γ(a)
∞
0
b
a
τ
a−1
exp(−bτ) dτ
=
1
Γ(a)
∞
0
b
a
u
a−1
exp(−u)b
1−a
b
−1
du
= 1
where we have used the deﬁnition (1.141) of the Gamma function.
52 Solutions 2.42–2.43
2.42 We can use the same change of variable as in the previous exercise to evaluate the
mean of the Gamma distribution
E[τ] =
1
Γ(a)
∞
0
b
a
τ
a−1
τ exp(−bτ) dτ
=
1
Γ(a)
∞
0
b
a
u
a
exp(−u)b
−a
b
−1
du
=
Γ(a + 1)
bΓ(a)
=
a
b
where we have used the recurrence relation Γ(a + 1) = aΓ(a) for the Gamma
function. Similarly we can ﬁnd the variance by ﬁrst evaluating
E[τ
2
] =
1
Γ(a)
∞
0
b
a
τ
a−1
τ
2
exp(−bτ) dτ
=
1
Γ(a)
∞
0
b
a
u
a+1
exp(−u)b
−a−1
b
−1
du
=
Γ(a + 2)
b
2
Γ(a)
=
(a + 1)Γ(a + 1)
b
2
Γ(a)
=
a(a + 1)
b
2
and then using
var[τ] = E[τ
2
] −E[τ]
2
=
a(a + 1)
b
2
−
a
2
b
2
=
a
b
2
.
Finally, the mode of the Gamma distribution is obtained simply by differentiation
d
dτ
¸
τ
a−1
exp(−bτ)
¸
=
¸
a −1
τ
−b
τ
a−1
exp(−bτ) = 0
from which we obtain
mode[τ] =
a −1
b
.
Notice that the mode only exists if a 1, since τ must be a nonnegative quantity.
This is also apparent in the plot of Figure 2.13.
2.43 To prove the normalization of the distribution (2.293) consider the integral
I =
∞
−∞
exp
−
[x[
q
2σ
2
dx = 2
∞
0
exp
−
x
q
2σ
2
dx
and make the change of variable
u =
x
q
2σ
2
.
Solution 2.44 53
Using the deﬁnition (1.141) of the Gamma function, this gives
I = 2
∞
0
2σ
2
q
(2σ
2
u)
(1−q)/q
exp(−u) du =
2(2σ
2
)
1/q
Γ(1/q)
q
from which the normalization of (2.293) follows.
For the given noise distribution, the conditional distribution of the target variable
given the input variable is
p(t[x, w, σ
2
) =
q
2(2σ
2
)
1/q
Γ(1/q)
exp
−
[t −y(x, w)[
q
2σ
2
.
The likelihood function is obtained by taking products of factors of this form, over
all pairs ¦x
n
, t
n
¦. Taking the logarithm, and discarding additive constants, we obtain
the desired result.
2.44 From Bayes’ theorem we have
p(μ, λ[X) ∝ p(X[μ, λ)p(μ, λ),
where the factors on the r.h.s. are given by (2.152) and (2.154), respectively. Writing
this out in full, we get
p(μ, λ) ∝
¸
λ
1/2
exp
−
λμ
2
2
N
exp
λμ
N
¸
n=1
x
n
−
λ
2
N
¸
n=1
x
2
n
¸
(βλ)
1/2
exp
¸
−
βλ
2
μ
2
−2μμ
0
+ μ
2
0
λ
a−1
exp (−bλ) ,
where we have used the deﬁntions of the Gaussian and Gamma distributions and we
have ommitted terms independent of μ and λ. We can rearrange this to obtain
λ
N/2
λ
a−1
exp
−
b +
1
2
N
¸
n=1
x
2
n
+
β
2
μ
2
0
λ
¸
(λ(N + β))
1/2
exp
¸
−
λ(N + β)
2
μ
2
−
2
N + β
βμ
0
+
N
¸
n=1
x
n
¸
μ
¸
and by completing the square in the argument of the second exponential,
λ
N/2
λ
a−1
exp
⎧
⎪
⎨
⎪
⎩
−
⎛
⎜
⎝
b +
1
2
N
¸
n=1
x
2
n
+
β
2
μ
2
0
−
βμ
0
+
¸
N
n=1
x
n
2
2(N + β)
⎞
⎟
⎠
λ
⎫
⎪
⎬
⎪
⎭
(λ(N + β))
1/2
exp
¸
−
λ(N + β)
2
μ −
βμ
0
+
¸
N
n=1
x
n
N + β
¸
54 Solutions 2.45–2.46
we arrive at an (unnormalised) GaussianGamma distribution,
^
μ[μ
N
, ((N + β)λ)
−1
Gam(λ[a
N
, b
N
) ,
with parameters
μ
N
=
βμ
0
+
¸
N
n=1
x
n
N + β
a
N
= a +
N
2
b
N
= b +
1
2
N
¸
n=1
x
2
n
+
β
2
μ
2
0
−
N + β
2
μ
2
N
.
2.45 We do this, as in the univariate case, by considering the likelihood function of Λ for
a given data set ¦x
1
, . . . , x
N
¦:
N
¸
n=1
^(x
n
[μ, Λ
−1
) ∝ [Λ[
N/2
exp
−
1
2
N
¸
n=1
(x
n
−μ)
T
Λ(x
n
−μ)
= [Λ[
N/2
exp
−
1
2
Tr [ΛS]
,
where S =
¸
n
(x
n
− μ)(x
n
− μ)
T
. By simply comparing with (2.155), we see
that the functional dependence on Λ is indeed the same and thus a product of this
likelihood and a Wishart prior will result in a Wishart posterior.
2.46 From (2.158), we have
∞
0
b
a
e
(−bτ)
τ
a−1
Γ(a)
τ
2π
1/2
exp
−
τ
2
(x −μ)
2
¸
dτ
=
b
a
Γ(a)
1
2π
1/2
∞
0
τ
a−1/2
exp
−τ
b +
(x −μ)
2
2
dτ.
We now make the proposed change of variable z = τΔ, where Δ = b +(x−μ)
2
/2,
yielding
b
a
Γ(a)
1
2π
1/2
Δ
−a−1/2
∞
0
z
a−1/2
exp(−z) dz
=
b
a
Γ(a)
1
2π
1/2
Δ
−a−1/2
Γ(a + 1/2)
Solutions 2.47–2.48 55
where we have used the deﬁnition of the Gamma function (1.141). Finally, we sub
stitute b + (x −μ)
2
/2 for Δ, ν/2 for a and ν/2λ for b:
Γ(−a + 1/2)
Γ(a)
b
a
1
2π
1/2
Δ
a−1/2
=
Γ((ν + 1)/2)
Γ(ν/2)
ν
2λ
ν/2
1
2π
1/2
ν
2λ
+
(x −μ)
2
2
−(ν+1)/2
=
Γ((ν + 1)/2)
Γ(ν/2)
ν
2λ
ν/2
1
2π
1/2
ν
2λ
−(ν+1)/2
1 +
λ(x −μ)
2
ν
−(ν+1)/2
=
Γ((ν + 1)/2)
Γ(ν/2)
λ
νπ
1/2
1 +
λ(x −μ)
2
ν
−(ν+1)/2
2.47 Ignoring the normalization constant, we write (2.159) as
St(x[μ, λ, ν) ∝
¸
1 +
λ(x −μ)
2
ν
−(ν−1)/2
= exp
−
ν −1
2
ln
¸
1 +
λ(x −μ)
2
ν
. (116)
For large ν, we make use of the Taylor expansion for the logarithm in the form
ln(1 + ) = + O(
2
) (117)
to rewrite (116) as
exp
−
ν −1
2
ln
¸
1 +
λ(x −μ)
2
ν
= exp
−
ν −1
2
¸
λ(x −μ)
2
ν
+ O(ν
−2
)
= exp
−
λ(x −μ)
2
2
+ O(ν
−1
)
.
We see that in the limit ν →∞this becomes, up to an overall constant, the same as
a Gaussian distribution with mean μ and precision λ. Since the Student distribution
is normalized to unity for all values of ν it follows that it must remain normalized in
this limit. The normalization coefﬁcient is given by the standard expression (2.42)
for a univariate Gaussian.
2.48 Substituting expressions for the Gaussian and Gamma distributions into (2.161), we
56 Solution 2.49
have
St(x[μ, Λ, ν) =
∞
0
^
x[μ, (ηΛ)
−1
Gam(η[ν/2, ν/2) dη
=
(ν/2)
ν/2
Γ(ν/2)
[Λ[
1/2
(2π)
D/2
∞
0
η
D/2
η
ν/2−1
e
−νη/2
e
−ηΔ
2
/2
dη.
Now we make the change of variable
τ = η
¸
ν
2
+
1
2
Δ
2
−1
which gives
St(x[μ, Λ, ν) =
(ν/2)
ν/2
Γ(ν/2)
[Λ[
1/2
(2π)
D/2
¸
ν
2
+
1
2
Δ
2
−D/2−ν/2
∞
0
τ
D/2+ν/2−1
e
−τ
dτ
=
Γ(ν/2 + d/2)
Γ(ν/2)
[Λ[
1/2
(πν)
D/2
¸
1 +
Δ
2
ν
−D/2−ν/2
as required.
The correct normalization of the multivariate Student’s tdistribution follows directly
from the fact that the Gaussian and Gamma distributions are normalized. From
(2.161) we have
St (x[μ, Λ, ν) dx =
^
x[μ, (ηΛ)
−1
Gam(η[ν/2, ν/2) dη dx
=
^
x[μ, (ηΛ)
−1
dx Gam(η[ν/2, ν/2) dη
=
Gam(η[ν/2, ν/2) dη = 1.
2.49 If we make the change of variable z = x −μ, we can write
E[x] =
St(x[μ, Λ, ν)xdx =
St(z[0, Λ, ν)(z +μ) dz.
In the factor (z + μ) the ﬁrst term vanishes as a consequence of the fact that the
zeromean Student distribution is an even function of z that is St(−z[0, Λ, ν) =
St(−z[0, Λ, ν). This leaves the second term, which equals μ since the Student dis
tribution is normalized.
The covariance of the multivariate Student can be reexpressed by using the expres
sion for the multivariate Student distribution as a convolution of a Gaussian with a
Solution 2.50 57
Gamma distribution given by (2.161) which gives
cov[x] =
St(x[μ, Λ, ν)(x −μ)(x −μ)
T
dx
=
∞
0
^(x[μ, ηΛ)(x −μ)(x −μ)
T
dx Gam(η[ν/2, ν/2) dη
=
∞
0
η
−1
Λ
−1
Gam(η[ν/2, ν/2) dη
where we have used the standard result for the covariance of a multivariate Gaussian.
We now substitute for the Gamma distribution using (2.146) to give
cov[x] =
1
Γ(ν/2)
ν
2
ν/2
∞
0
e
−νη/2
η
ν/2−2
dηΛ
−1
=
ν
2
Γ(ν/2 −2)
Γ(ν/2)
Λ
−1
=
ν
ν −2
Λ
−1
where we have used the integral representation for the Gamma function, together
with the standard result Γ(1 + x) = xΓ(x).
The mode of the Student distribution is obtained by differentiation
∇
x
St(x[μ, Λ, ν) =
Γ(ν/2 + D/2)
Γ(ν/2)
[Λ[
1/2
(πν)
D/2
¸
1 +
Δ
2
ν
−D/2−ν/2−1
1
ν
Λ(x −μ).
Provided Λ is nonsingular we therefore obtain
mode[x] = μ.
2.50 Just like in univariate case (Exercise 2.47), we ignore the normalization coefﬁcient,
which leaves us with
¸
1 +
Δ
2
ν
−ν/2−D/2
= exp
−
ν
2
+
D
2
ln
¸
1 +
Δ
2
ν
where Δ
2
is the squared Mahalanobis distance given by
Δ
2
= (x −μ)
T
Λ(x −μ).
Again we make use of (117) to give
exp
−
ν
2
+
D
2
ln
¸
1 +
Δ
2
ν
= exp
−
Δ
2
2
+ O(1/ν)
.
As in the univariate case, in the limit ν →∞this becomes, up to an overall constant,
the same as a Gaussian distribution, here with mean μand precision Λ; the univariate
normalization argument also applies in the multivariate case.
58 Solutions 2.51–2.54
2.51 Using the relation (2.296) we have
1 = exp(iA) exp(−iA) = (cos A + i sin A)(cos A−i sin A) = cos
2
A + sin
2
A.
Similarly, we have
cos(A−B) = 'exp¦i(A−B)¦
= 'exp(iA) exp(−iB)
= '(cos A + i sin A)(cos B −i sin B)
= cos Acos B + sin Asin B.
Finally
sin(A−B) = ·exp¦i(A−B)¦
= ·exp(iA) exp(−iB)
= ·(cos A + i sin A)(cos B −i sin B)
= sin Acos B −cos Asin B.
2.52 Expressed in terms of ξ the von Mises distribution becomes
p(ξ) ∝ exp
¸
mcos(m
−1/2
ξ)
¸
.
For large m we have cos(m
−1/2
ξ) = 1 −m
−1
ξ
2
/2 + O(m
−2
) and so
p(ξ) ∝ exp
¸
−ξ
2
/2
¸
and hence p(θ) ∝ exp¦−m(θ −θ
0
)
2
/2¦.
2.53 Using (2.183), we can write (2.182) as
N
¸
n=1
(cos θ
0
sin θ
n
−cos θ
n
sin θ
0
) = cos θ
0
N
¸
n=1
sin θ
n
−sin
N
¸
n=1
cos θ
n
= 0.
Rearranging this, we get
¸
n
sin θ
n
¸
n
cos θ
n
=
sin θ
0
cos θ
0
= tan θ
0
,
which we can solve w.r.t. θ
0
to obtain (2.184).
2.54 Differentiating the von Mises distribution (2.179) we have
p
(θ) = −
1
2πI
0
(m)
exp ¦mcos(θ −θ
0
)¦ sin(θ −θ
0
)
which vanishes when θ = θ
0
or when θ = θ
0
+ π (mod2π). Differentiating again
we have
p
(θ) = −
1
2πI
0
(m)
exp ¦mcos(θ −θ
0
)¦
sin
2
(θ −θ
0
) + cos(θ −θ
0
)
.
Solutions 2.55–2.56 59
Since I
0
(m) > 0 we see that p
(θ) < 0 when θ = θ
0
, which therefore represents
a maximum of the density, while p
(θ) > 0 when θ = θ
0
+ π (mod2π), which is
therefore a minimum.
2.55 NOTE: In PRML, equation (2.187), which will be the starting point for this solution,
contains a typo. The “−” on the r.h.s. should be a “+”, as is easily seen from (2.178)
and (2.185).
From (2.169) and (2.184), we see that
¯
θ = θ
ML
0
. Using this together with (2.168)
and (2.177), we can rewrite (2.187) as follows:
A(m
ML
) =
1
N
N
¸
n=1
cos θ
n
cos θ
ML
0
+
1
N
N
¸
n=1
sin θ
n
sin θ
ML
0
= ¯r cos
¯
θ cos θ
ML
0
+ ¯r sin
¯
θ sin θ
ML
0
= ¯r
cos
2
θ
ML
0
+ sin
2
θ
ML
0
= ¯r.
2.56 We can most conveniently cast distributions into standard exponential family formby
taking the exponential of the logarithm of the distribution. For the Beta distribution
(2.13) we have
Beta(μ[a, b) =
Γ(a + b)
Γ(a)Γ(b)
exp ¦(a −1) ln μ + (b −1) ln(1 −μ)¦
which we can identify as being in standard exponential form (2.194) with
h(μ) = 1 (118)
g(a, b) =
Γ(a + b)
Γ(a)Γ(b)
(119)
u(μ) =
ln μ
ln(1 −μ)
(120)
η(a, b) =
a −1
b −1
. (121)
Applying the same approach to the gamma distribution (2.146) we obtain
Gam(λ[a, b) =
b
a
Γ(a)
exp ¦(a −1) ln λ −bλ¦ .
60 Solution 2.57
from which it follows that
h(λ) = 1 (122)
g(a, b) =
b
a
Γ(a)
(123)
u(λ) =
λ
ln λ
(124)
η(a, b) =
−b
a −1
. (125)
Finally, for the von Mises distribution (2.179) we make use of the identity (2.178) to
give
p(θ[θ
0
, m) =
1
2πI
0
(m)
exp ¦mcos θ cos θ
0
+ msin θ sin θ
0
¦
from which we ﬁnd
h(θ) = 1 (126)
g(θ
0
, m) =
1
2πI
0
(m)
(127)
u(θ) =
cos θ
sin θ
(128)
η(θ
0
, m) =
mcos θ
0
msin θ
0
. (129)
2.57 Starting from (2.43), we can rewrite the argument of the exponential as
−
1
2
Tr
Σ
−1
xx
T
+μ
T
Σ
−1
x −
1
2
μ
T
Σ
−1
μ.
The last term is indepedent of x but depends on μ and Σand so should go into g(η).
The second term is already an inner product and can be kept as is. To deal with
the ﬁrst term, we deﬁne the D
2
dimensional vectors z and λ, which consist of the
columns of xx
T
and Σ
−1
, respectively, stacked on top of each other. Now we can
write the multivariate Gaussian distribution on the form (2.194), with
η =
¸
Σ
−1
μ
−
1
2
λ
u(x) =
¸
x
z
h(x) = (2π)
−D/2
g(η) = [Σ[
−1/2
exp
−
1
2
μ
T
Σ
−1
μ
.
Solutions 2.58–2.60 61
2.58 Taking the ﬁrst derivative of (2.226) we obtain, as in the text,
−∇ln g(η) = g(η)
h(x) exp
¸
η
T
u(x)
¸
u(x) dx
Taking the gradient again gives
−∇∇ln g(η) = g(η)
h(x) exp
¸
η
T
u(x)
¸
u(x)u(x)
T
dx
+∇g(η)
h(x) exp
¸
η
T
u(x)
¸
u(x) dx
= E[u(x)u(x)
T
] −E[u(x)]E[u(x)
T
]
= cov[u(x)]
where we have used the result (2.226).
2.59
1
σ
f
x
σ
dx =
1
σ
f(y)
dx
dy
dy
=
1
σ
f(y)σ dy
=
σ
σ
f(y) dy = 1,
since f(x) integrates to 1.
2.60 The value of the density p(x) at a point x
n
is given by h
j(n)
, where the notation j(n)
denotes that data point x
n
falls within region j. Thus the log likelihood function
takes the form
N
¸
n=1
ln p(x
n
) =
N
¸
n=1
ln h
j(n)
.
We nowneed to take account of the constraint that p(x) must integrate to unity. Since
p(x) has the constant value h
i
over region i, which has volume Δ
i
, the normalization
constraint becomes
¸
i
h
i
Δ
i
= 1. Introducing a Lagrange multiplier λ we then
minimize the function
N
¸
n=1
ln h
j(n)
+ λ
¸
i
h
i
Δ
i
−1
with respect to h
k
to give
0 =
n
k
h
k
+ λΔ
k
where n
k
denotes the total number of data points falling within region k. Multiplying
both sides by h
k
, summing over k and making use of the normalization constraint,
62 Solutions 2.61–3.1
we obtain λ = −N. Eliminating λ then gives our ﬁnal result for the maximum
likelihood solution for h
k
in the form
h
k
=
n
k
N
1
Δ
k
.
Note that, for equal sized bins Δ
k
= Δ we obtain a bin height h
k
which is propor
tional to the fraction of points falling within that bin, as expected.
2.61 From (2.246) we have
p(x) =
K
NV (ρ)
where V (ρ) is the volume of a Ddimensional hypersphere with radius ρ, where in
turn ρ is the distance from x to its K
th
nearest neighbour in the data set. Thus, in
polar coordinates, if we consider sufﬁciently large values for the radial coordinate r,
we have
p(x) ∝ r
−D
.
If we consider the integral of p(x) and note that the volume element dx can be
written as r
D−1
dr, we get
p(x) dx ∝
r
−D
r
D−1
dr =
r
−1
dr
which diverges logarithmically.
Chapter 3 Linear Models for Regression
3.1 NOTE: In PRML, there is a 2 missing in the denominator of the argument to the
‘tanh’ function in equation (3.102).
Using (3.6), we have
2σ(2a) −1 =
2
1 + e
−2a
−1
=
2
1 + e
−2a
−
1 + e
−2a
1 + e
−2a
=
1 −e
−2a
1 + e
−2a
=
e
a
−e
−a
e
a
+ e
−a
= tanh(a)
Solutions 3.2–3.3 63
If we now take a
j
= (x −μ
j
)/2s, we can rewrite (3.101) as
y(x, w) = w
0
+
M
¸
j=1
w
j
σ(2a
j
)
= w
0
+
M
¸
j=1
w
j
2
(2σ(2a
j
) −1 + 1)
= u
0
+
M
¸
j=1
u
j
tanh(a
j
),
where u
j
= w
j
/2, for j = 1, . . . , M, and u
0
= w
0
+
¸
M
j=1
w
j
/2.
3.2 We ﬁrst write
Φ(Φ
T
Φ)
−1
Φ
T
v = Φ¯ v
= ϕ
1
˜ v
(1)
+ϕ
2
˜ v
(2)
+ . . . +ϕ
M
˜ v
(M)
where ϕ
m
is the mth column of Φ and ¯ v = (Φ
T
Φ)
−1
Φ
T
v. By comparing this
with the least squares solution in (3.15), we see that
y = Φw
ML
= Φ(Φ
T
Φ)
−1
Φ
T
t
corresponds to a projection of t onto the space spanned by the columns of Φ. To see
that this is indeed an orthogonal projection, we ﬁrst note that for any column of Φ,
ϕ
j
,
Φ(Φ
T
Φ)
−1
Φ
T
ϕ
j
=
Φ(Φ
T
Φ)
−1
Φ
T
Φ
j
= ϕ
j
and therefore
(y −t)
T
ϕ
j
= (Φw
ML
−t)
T
ϕ
j
= t
T
Φ(Φ
T
Φ)
−1
Φ
T
−I
T
ϕ
j
= 0
and thus (y −t) is ortogonal to every column of Φ and hence is orthogonal to o.
3.3 If we deﬁne R = diag(r
1
, . . . , r
N
) to be a diagonal matrix containing the weighting
coefﬁcients, then we can write the weighted sumofsquares cost function in the form
E
D
(w) =
1
2
(t −Φw)
T
R(t −Φw).
Setting the derivative with respect to w to zero, and rearranging, then gives
w
=
Φ
T
RΦ
−1
Φ
T
Rt
which reduces to the standard solution (3.15) for the case R = I.
If we compare (3.104) with (3.10)–(3.12), we see that r
n
can be regarded as a pre
cision (inverse variance) parameter, particular to the data point (x
n
, t
n
), that either
replaces or scales β.
64 Solution 3.4
Alternatively, r
n
can be regarded as an effective number of replicated observations
of data point (x
n
, t
n
); this becomes particularly clear if we consider (3.104) with r
n
taking positive integer values, although it is valid for any r
n
> 0.
3.4 Let
¯y
n
= w
0
+
D
¸
i=1
w
i
(x
ni
+
ni
)
= y
n
+
D
¸
i=1
w
i
ni
where y
n
= y(x
n
, w) and
ni
∼ ^(0, σ
2
) and we have used (3.105). From (3.106)
we then deﬁne
¯
E =
1
2
N
¸
n=1
¦¯y
n
−t
n
¦
2
=
1
2
N
¸
n=1
¸
¯y
2
n
−2¯y
n
t
n
+ t
2
n
¸
=
1
2
N
¸
n=1
⎧
⎨
⎩
y
2
n
+ 2y
n
D
¸
i=1
w
i
ni
+
D
¸
i=1
w
i
ni
2
−2t
n
y
n
−2t
n
D
¸
i=1
w
i
ni
+ t
2
n
⎫
⎬
⎭
.
If we take the expectation of
¯
E under the distribution of
ni
, we see that the second
and ﬁfth terms disappear, since E[
ni
] = 0, while for the third term we get
E
⎡
⎣
D
¸
i=1
w
i
ni
2
⎤
⎦
=
D
¸
i=1
w
2
i
σ
2
since the
ni
are all independent with variance σ
2
.
From this and (3.106) we see that
E
¯
E
= E
D
+
1
2
D
¸
i=1
w
2
i
σ
2
,
as required.
Solutions 3.5–3.6 65
3.5 We can rewrite (3.30) as
1
2
M
¸
j=1
[w
j
[
q
−η
0
where we have incorporated the 1/2 scaling factor for convenience. Clearly this does
not affect the constraint.
Employing the technique described in Appendix E, we can combine this with (3.12)
to obtain the Lagrangian function
L(w, λ) =
1
2
N
¸
n=1
¦t
n
−w
T
φ(x
n
)¦
2
+
λ
2
M
¸
j=1
[w
j
[
q
−η
and by comparing this with (3.29) we see immediately that they are identical in their
dependence on w.
Now suppose we choose a speciﬁc value of λ > 0 and minimize (3.29). Denoting
the resulting value of w by w
(λ), and using the KKT condition (E.11), we see that
the value of η is given by
η =
M
¸
j=1
[w
j
(λ)[
q
.
3.6 We ﬁrst write down the log likelihood function which is given by
ln L(W, Σ) = −
N
2
ln [Σ[ −
1
2
N
¸
n=1
(t
n
−W
T
φ(x
n
))
T
Σ
−1
(t
n
−W
T
φ(x
n
)).
First of all we set the derivative with respect to Wequal to zero, giving
0 = −
N
¸
n=1
Σ
−1
(t
n
−W
T
φ(x
n
))φ(x
n
)
T
.
Multiplying through by Σ and introducing the design matrix Φ and the target data
matrix T we have
Φ
T
ΦW= Φ
T
T
Solving for Wthen gives (3.15) as required.
The maximum likelihood solution for Σis easily found by appealing to the standard
result from Chapter 2 giving
Σ =
1
N
N
¸
n=1
(t
n
−W
T
ML
φ(x
n
))(t
n
−W
T
ML
φ(x
n
))
T
.
as required. Since we are ﬁnding a joint maximum with respect to both Wand Σ
we see that it is W
ML
which appears in this expression, as in the standard result for
an unconditional Gaussian distribution.
66 Solutions 3.7–3.8
3.7 From Bayes’ theorem we have
p(w[t) ∝ p(t[w)p(w),
where the factors on the r.h.s. are given by (3.10) and (3.48), respectively. Writing
this out in full, we get
p(w[t) ∝
¸
N
¸
n=1
^
t
n
[w
T
φ(x
n
), β
−1
¸
^ (w[m
0
, S
0
)
∝ exp
−
β
2
(t −Φw)
T
(t −Φw)
exp
−
1
2
(w−m
0
)
T
S
−1
0
(w−m
0
)
= exp
−
1
2
w
T
S
−1
0
+ βΦ
T
Φ
w−βt
T
Φw−βw
T
Φ
T
t + βt
T
t
m
T
0
S
−1
0
w−w
T
S
−1
0
m
0
+m
T
0
S
−1
0
m
0
= exp
−
1
2
w
T
S
−1
0
+ βΦ
T
Φ
w−
S
−1
0
m
0
+ βΦ
T
t
T
w
−w
T
S
−1
0
m
0
+ βΦ
T
t
+ βt
T
t +m
T
0
S
−1
0
m
0
= exp
−
1
2
(w−m
N
)
T
S
−1
N
(w−m
N
)
exp
−
1
2
βt
T
t +m
T
0
S
−1
0
m
0
−m
T
N
S
−1
N
m
N
where we have used (3.50) and (3.51) when completing the square in the last step.
The ﬁrst exponential corrsponds to the posterior, unnormalized Gaussian distribution
over w, while the second exponential is independent of wand hence can be absorbed
into the normalization factor.
3.8 Combining the prior
p(w) = ^(w[m
N
, S
N
)
and the likelihood
p(t
N+1
[x
N+1
, w) =
β
2π
1/2
exp
−
β
2
(t
N+1
−w
T
φ
N+1
)
2
(130)
where φ
N+1
= φ(x
N+1
), we obtain a posterior of the form
p(w[t
N+1
, x
N+1
, m
N
, S
N
)
∝ exp
−
1
2
(w−m
N
)
T
S
−1
N
(w−m
N
) −
1
2
β(t
N+1
−w
T
φ
N+1
)
2
.
Solutions 3.9–3.11 67
We can expand the argument of the exponential, omitting the −1/2 factors, as fol
lows
(w−m
N
)
T
S
−1
N
(w−m
N
) + β(t
N+1
−w
T
φ
N+1
)
2
= w
T
S
−1
N
w−2w
T
S
−1
N
m
N
+ βw
T
φ
T
N+1
φ
N+1
w−2βw
T
φ
N+1
t
N+1
+ const
= w
T
(S
−1
N
+ βφ
N+1
φ
T
N+1
)w−2w
T
(S
−1
N
m
N
+ βφ
N+1
t
N+1
) + const,
where const denotes remaining terms independent of w. From this we can read off
the desired result directly,
p(w[t
N+1
, x
N+1
, m
N
, S
N
) = ^(w[m
N+1
, S
N+1
),
with
S
−1
N+1
= S
−1
N
+ βφ
N+1
φ
T
N+1
. (131)
and
m
N+1
= S
N+1
(S
−1
N
m
N
+ βφ
N+1
t
N+1
). (132)
3.9 Identifying (2.113) with (3.49) and (2.114) with (130), such that
x ⇒w μ ⇒m
N
Λ
−1
⇒S
N
y ⇒t
N+1
A ⇒φ(x
N+1
)
T
= φ
T
N+1
b ⇒0 L
−1
⇒βI,
(2.116) and (2.117) directly give
p(w[t
N+1
, x
N+1
) = ^(w[m
N+1
, S
N+1
)
where S
N+1
and m
N+1
are given by (131) and (132), respectively.
3.10 Using (3.3), (3.8) and (3.49), we can rewrite (3.57) as
p(t[x, t, α, β) =
^(t[φ(x)
T
w, β
−1
)^(w[m
N
, S
N
) dw.
By matching the ﬁrst factor of the integrand with (2.114) and the second factor with
(2.113), we obtain the desired result directly from (2.115).
3.11 From (3.59) we have
σ
2
N+1
(x) =
1
β
+φ(x)
T
S
N+1
φ(x) (133)
where S
N+1
is given by (131). From (131) and (3.110) we get
S
N+1
=
S
−1
N
+ βφ
N+1
φ
T
N+1
−1
= S
N
−
S
N
φ
N+1
β
1/2
β
1/2
φ
T
N+1
S
N
1 + βφ
T
N+1
S
N
φ
N+1
= S
N
−
βS
N
φ
N+1
φ
T
N+1
S
N
1 + βφ
T
N+1
S
N
φ
N+1
.
68 Solution 3.12
Using this and (3.59), we can rewrite (133) as
σ
2
N+1
(x) =
1
β
+φ(x)
T
S
N
−
βS
N
φ
N+1
φ
T
N+1
S
N
1 + βφ
T
N+1
S
N
φ
N+1
φ(x)
= σ
2
N
(x) −
βφ(x)
T
S
N
φ
N+1
φ
T
N+1
S
N
φ(x)
1 + βφ
T
N+1
S
N
φ
N+1
. (134)
Since S
N
is positive deﬁnite, the numerator and denominator of the second term in
(134) will be nonnegative and positive, respectively, and hence σ
2
N+1
(x) σ
2
N
(x).
3.12 It is easiest to work in log space. The log of the posterior distribution is given by
ln p(w, β[t) = ln p(w, β) +
N
¸
n=1
ln p(t
n
[w
T
φ(x
n
), β
−1
)
=
M
2
ln β −
1
2
ln [S
0
[ −
β
2
(w−m
0
)
T
S
−1
0
(w−m
0
)
−b
0
β + (a
0
−1) ln β
+
N
2
ln β −
β
2
N
¸
n=1
¦w
T
φ(x
n
) −t
n
¦
2
+ const.
Using the product rule, the posterior distribution can be written as p(w, β[t) =
p(w[β, t)p(β[t). Consider ﬁrst the dependence on w. We have
ln p(w[β, t) = −
β
2
w
T
Φ
T
Φ+S
−1
0
w+w
T
βS
−1
0
m
0
+ βΦ
T
t
+ const.
Thus we see that p(w[β, t) is a Gaussian distribution with mean and covariance given
by
m
N
= S
N
S
−1
0
m
0
+Φ
T
t
(135)
βS
−1
N
= β
S
−1
0
+Φ
T
Φ
. (136)
To ﬁnd p(β[t) we ﬁrst need to complete the square over w to ensure that we pick
up all terms involving β (any terms independent of β may be discarded since these
will be absorbed into the normalization coefﬁcient which itself will be found by
inspection at the end). We also need to remember that a factor of (M/2) ln β will be
absorbed by the normalisation factor of p(w[β, t). Thus
ln p(β[t) = −
β
2
m
T
0
S
−1
0
m
0
+
β
2
m
T
N
S
−1
N
m
N
+
N
2
ln β −b
0
β + (a
0
−1) ln β −
β
2
N
¸
n=1
t
2
n
+ const.
Solutions 3.13–3.14 69
We recognize this as the log of a Gamma distribution. Reading off the coefﬁcients
of β and ln β we then have
a
N
= a
0
+
N
2
(137)
b
N
= b
0
+
1
2
m
T
0
S
−1
0
m
0
−m
T
N
S
−1
N
m
N
+
N
¸
n=1
t
2
n
. (138)
3.13 Following the line of presentation from Section 3.3.2, the predictive distribution is
now given by
p(t[x, t) =
^
t[φ(x)
T
w, β
−1
^
w[m
N
, β
−1
S
N
dw
Gam(β[a
N
, b
N
) dβ (139)
We begin by performing the integral over w. Identifying (2.113) with (3.49) and
(2.114) with (3.8), using (3.3), such that
x ⇒w μ ⇒m
N
Λ
−1
⇒S
N
y ⇒t A ⇒φ(x)
T
= φ
T
b ⇒0 L
−1
⇒β
−1
,
(2.115) and (136) give
p(t[β) = ^
t[φ
T
m
N
, β
−1
+φ
T
S
N
φ
= ^
t[φ
T
m
N
, β
−1
1 +φ
T
(S
0
+φ
T
φ)
−1
φ
.
Substituting this back into (139) we get
p(t[x, X, t) =
^
t[φ
T
m
N
, β
−1
s
Gam(β[a
N
, b
N
) dβ,
where we have deﬁned
s = 1 +φ
T
(S
0
+φ
T
φ)
−1
φ.
We can now use (2.158)– (2.160) to obtain the ﬁnal result:
p(t[x, X, t) = St (t[μ, λ, ν)
where
μ = φ
T
m
N
λ =
a
N
b
N
s
−1
ν = 2a
N
.
3.14 For α = 0 the covariance matrix S
N
becomes
S
N
= (βΦ
T
Φ)
−1
. (140)
70 Solution 3.14
Let us deﬁne a new set of orthonormal basis functions given by linear combinations
of the original basis functions so that
ψ(x) = Vφ(x) (141)
where V is an M M matrix. Since both the original and the new basis functions
are linearly independent and span the same space, this matrix must be invertible and
hence
φ(x) = V
−1
ψ(x).
For the data set ¦x
n
¦, (141) and (3.16) give
Ψ = ΦV
T
and consequently
Φ = ΨV
−T
where V
−T
denotes (V
−1
)
T
. Orthonormality implies
Ψ
T
Ψ = I.
Note that (V
−1
)
T
= (V
T
)
−1
as is easily veriﬁed. From (140), the covariance matrix
then becomes
S
N
= β
−1
(Φ
T
Φ)
−1
= β
−1
(V
−T
Ψ
T
ΨV
−1
)
−1
= β
−1
V
T
V.
Here we have used the orthonormality of the ψ
i
(x). Hence the equivalent kernel
becomes
k(x, x
) = βφ(x)
T
S
N
φ(x
) = φ(x)
T
V
T
Vφ(x
) = ψ(x)
T
ψ(x
)
as required. From the orthonormality condition, and setting j = 1, it follows that
N
¸
n=1
ψ
i
(x
n
)ψ
1
(x
n
) =
N
¸
n=1
ψ
i
(x
n
) = δ
i1
where we have used ψ
1
(x) = 1. Now consider the sum
N
¸
n=1
k(x, x
n
) =
N
¸
n=1
ψ(x)
T
ψ(x
n
) =
N
¸
n=1
M
¸
i=1
ψ
i
(x)ψ
i
(x
n
)
=
M
¸
i=1
ψ
i
(x)δ
i1
= ψ
1
(x) = 1
which proves the summation constraint as required.
Solutions 3.15–3.16 71
3.15 This is easily shown by substituting the reestimation formulae (3.92) and (3.95) into
(3.82), giving
E(m
N
) =
β
2
t −Φm
N

2
+
α
2
m
T
N
m
N
=
N −γ
2
+
γ
2
=
N
2
.
3.16 The likelihood function is a product of independent univariate Gaussians and so can
be written as a joint Gaussian distribution over t with diagonal covariance matrix in
the form
p(t[w, β) = ^(t[Φw, β
−1
I
N
). (142)
Identifying (2.113) with the prior distribution p(w) = ^(w[0, α
−1
I) and (2.114)
with (142), such that
x ⇒w μ ⇒0 Λ
−1
⇒α
−1
I
M
y ⇒t A ⇒Φ b ⇒0 L
−1
⇒β
−1
I
N
,
(2.115) gives
p(t[α, β) = ^(t[0, β
−1
I
N
+ α
−1
ΦΦ
T
).
Taking the log we obtain
ln p(t[α, β) = −
N
2
ln(2π) −
1
2
ln
β
−1
I
N
+ α
−1
ΦΦ
T
−
1
2
t
T
β
−1
I
N
+ α
−1
ΦΦ
T
t. (143)
Using the result (C.14) for the determinant we have
β
−1
I
N
+ α
−1
ΦΦ
T
= β
−N
I
N
+ βα
−1
ΦΦ
T
= β
−N
I
M
+ βα
−1
Φ
T
Φ
= β
−N
α
−M
αI
M
+ βΦ
T
Φ
= β
−N
α
−M
[A[
where we have used (3.81). Next consider the quadratic term in t and make use of
the identity (C.7) together with (3.81) and (3.84) to give
−
1
2
t
β
−1
I
N
+ α
−1
ΦΦ
T
−1
t
= −
1
2
t
T
βI
N
−βΦ
αI
M
+ βΦ
T
Φ
−1
Φ
T
β
t
= −
β
2
t
T
t +
β
2
2
t
T
ΦA
−1
Φ
T
t
= −
β
2
t
T
t +
1
2
m
T
N
Am
N
= −
β
2
t −Φm
N

2
−
α
2
m
T
N
m
N
72 Solutions 3.17–3.18
where in the last step, we have exploited results from Solution 3.18. Substituting for
the determinant and the quadratic term in (143) we obtain (3.86).
3.17 Using (3.11), (3.12) and (3.52) together with the deﬁnition for the Gaussian, (2.43),
we can rewrite (3.77) as follows:
p(t[α, β) =
p(t[w, β)p(w[α) dw
=
β
2π
N/2
α
2π
M/2
exp (−βE
D
(w)) exp
−
α
2
w
T
w
dw
=
β
2π
N/2
α
2π
M/2
exp (−E(w)) dw,
where E(w) is deﬁned by (3.79).
3.18 We can rewrite (3.79)
β
2
t −Φw
2
+
α
2
w
T
w
=
β
2
t
T
t −2t
T
Φw+w
T
Φ
T
Φw
+
α
2
w
T
w
=
1
2
βt
T
t −2βt
T
Φw+w
T
Aw
where, in the last line, we have used (3.81). We now use the tricks of adding 0 =
m
T
N
Am
N
−m
T
N
Am
N
and using I = A
−1
A, combined with (3.84), as follows:
1
2
βt
T
t −2βt
T
Φw+w
T
Aw
=
1
2
βt
T
t −2βt
T
ΦA
−1
Aw+w
T
Aw
=
1
2
βt
T
t −2m
T
N
Aw+w
T
Aw+m
T
N
Am
N
−m
T
N
Am
N
=
1
2
βt
T
t −m
T
N
Am
N
+
1
2
(w−m
N
)
T
A(w−m
N
).
Here the last term equals term the last term of (3.80) and so it remains to show that
the ﬁrst term equals the r.h.s. of (3.82). To do this, we use the same tricks again:
1
2
βt
T
t −m
T
N
Am
N
=
1
2
βt
T
t −2m
T
N
Am
N
+m
T
N
Am
N
=
1
2
βt
T
t −2m
T
N
AA
−1
Φ
T
tβ +m
T
N
αI + βΦ
T
Φ
m
N
=
1
2
βt
T
t −2m
T
N
Φ
T
tβ + βm
T
N
Φ
T
Φm
N
+ αm
T
N
m
N
=
1
2
β(t −Φm
N
)
T
(t −Φm
N
) + αm
T
N
m
N
=
β
2
t −Φm
N

2
+
α
2
m
T
N
m
N
Solutions 3.19–3.21 73
as required.
3.19 From (3.80) we see that the integrand of (3.85) is an unnormalized Gaussian and
hence integrates to the inverse of the corresponding normalizing constant, which can
be read off from the r.h.s. of (2.43) as
(2π)
M/2
[A
−1
[
1/2
.
Using (3.78), (3.85) and the properties of the logarithm, we get
ln p(t[α, β) =
M
2
(ln α −ln(2π)) +
N
2
(ln β −ln(2π)) + ln
exp¦−E(w)¦ dw
=
M
2
(ln α −ln(2π)) +
N
2
(ln β −ln(2π)) −E(m
N
) −
1
2
ln [A[ +
M
2
ln(2π)
which equals (3.86).
3.20 We only need to consider the terms of (3.86) that depend on α, which are the ﬁrst,
third and fourth terms.
Following the sequence of steps in Section 3.5.2, we start with the last of these terms,
−
1
2
ln [A[.
From (3.81), (3.87) and the fact that that eigenvectors u
i
are orthonormal (see also
Appendix C), we ﬁnd that the eigenvectors of Ato be α+λ
i
. We can then use (C.47)
and the properties of the logarithm to take us from the left to the right side of (3.88).
The derivatives for the ﬁrst and third term of (3.86) are more easily obtained using
standard derivatives and (3.82), yielding
1
2
M
α
+m
T
N
m
N
.
We combine these results into (3.89), from which we get (3.92) via (3.90). The
expression for γ in (3.91) is obtained from (3.90) by substituting
M
¸
i
λ
i
+ α
λ
i
+ α
for M and rearranging.
3.21 The eigenvector equation for the M M real, symmetric matrix A can be written
as
Au
i
= η
i
u
i
74 Solution 3.21
where ¦u
i
¦ are a set of M orthonormal vectors, and the M eigenvalues ¦η
i
¦ are all
real. We ﬁrst express the left hand side of (3.117) in terms of the eigenvalues of A.
The log of the determinant of Acan be written as
ln [A[ = ln
M
¸
i=1
η
i
=
M
¸
i=1
ln η
i
.
Taking the derivative with respect to some scalar α we obtain
d
dα
ln [A[ =
M
¸
i=1
1
η
i
d
dα
η
i
.
We now express the right hand side of (3.117) in terms of the eigenvector expansion
and show that it takes the same form. First we note that Acan be expanded in terms
of its own eigenvectors to give
A =
M
¸
i=1
η
i
u
i
u
T
i
and similarly the inverse can be written as
A
−1
=
M
¸
i=1
1
η
i
u
i
u
T
i
.
Thus we have
Tr
A
−1
d
dα
A
= Tr
M
¸
i=1
1
η
i
u
i
u
T
i
d
dα
M
¸
j=1
η
j
u
j
u
T
j
= Tr
M
¸
i=1
1
η
i
u
i
u
T
i
M
¸
j=1
dη
j
dα
u
j
u
T
j
+ η
j
b
j
u
T
j
+u
j
b
T
j
¸
= Tr
M
¸
i=1
1
η
i
u
i
u
T
i
M
¸
j=1
dη
j
dα
u
j
u
T
j
+Tr
M
¸
i=1
1
η
i
u
i
u
T
i
M
¸
j=1
η
j
b
j
u
T
j
+u
j
b
T
j
(144)
where b
j
= du
j
/ dα. Using the properties of the trace and the orthognality of
Solution 3.21 75
eigenvectors, we can rewrite the second term as
Tr
M
¸
i=1
1
η
i
u
i
u
T
i
M
¸
j=1
η
j
b
j
u
T
j
+u
j
b
T
j
= Tr
M
¸
i=1
1
η
i
u
i
u
T
i
M
¸
j=1
2η
j
u
j
b
T
j
= Tr
M
¸
i=1
M
¸
j=1
2η
j
η
i
u
i
u
T
i
u
j
b
T
j
= Tr
M
¸
i=1
b
j
u
T
j
+u
j
b
T
j
= Tr
d
dα
M
¸
i
u
i
u
T
i
.
However,
M
¸
i
u
i
u
T
i
= I
which is constant and thus its derivative w.r.t. α will be zero and the second term in
(144) vanishes.
For the ﬁrst term in (144), we again use the properties of the trace and the orthognal
ity of eigenvectors to obtain
Tr
A
−1
d
dα
A
=
M
¸
i=1
1
η
i
dη
i
dα
.
We have now shown that both the left and right hand sides of (3.117) take the same
form when expressed in terms of the eigenvector expansion. Next, we use (3.117) to
differentiate (3.86) w.r.t. α, yielding
d
dα
ln p(t[αβ) =
M
2
1
α
−
1
2
m
T
N
m
N
−
1
2
Tr
A
−1
d
dα
A
=
1
2
M
α
−m
T
N
m
N
−Tr
A
−1
=
1
2
M
α
−m
T
N
m
N
−
¸
i
1
λ
i
+ α
which we recognize as the r.h.s. of (3.89), from which (3.92) can be derived as de
tailed in Section 3.5.2, immediately following (3.89).
76 Solutions 3.22–3.23
3.22 Using (3.82) and (3.93)—the derivation of latter is detailed in Section 3.5.2—we get
the derivative of (3.86) w.r.t. β as the r.h.s. of (3.94). Rearranging this, collecting the
βdependent terms on one side of the equation and the remaining term on the other,
we obtain (3.95).
3.23 From (3.10), (3.112) and the properties of the Gaussian and Gamma distributions
(see Appendix B), we get
p(t) =
p(t[w, β)p(w[β) dwp(β) dβ
=
β
2π
N/2
exp
−
β
2
(t −Φw)
T
(t −Φw)
β
2π
M/2
[S
0
[
−1/2
exp
−
β
2
(w−m
0
)
T
S
−1
0
(w−m
0
)
dw
Γ(a
0
)
−1
b
a
0
0
β
a
0
−1
exp(−b
0
β) dβ
=
b
a
0
0
((2π)
M+N
[S
0
[)
1/2
exp
−
β
2
(t −Φw)
T
(t −Φw)
exp
−
β
2
(w−m
0
)
T
S
−1
0
(w−m
0
)
dw
β
a
0
−1
β
N/2
β
M/2
exp(−b
0
β) dβ
=
b
a
0
0
((2π)
M+N
[S
0
[)
1/2
exp
−
β
2
(w−m
N
)
T
S
−1
N
(w−m
N
)
dw
exp
−
β
2
t
T
t +m
T
0
S
−1
0
m
0
−m
T
N
S
−1
N
m
N
β
a
N
−1
β
M/2
exp(−b
0
β) dβ
where we have completed the square for the quadratic form in w, using
m
N
= S
N
S
−1
0
m
0
+Φ
T
t
S
−1
N
= β
S
−1
0
+Φ
T
Φ
a
N
= a
0
+
N
2
b
N
= b
0
+
1
2
m
T
0
S
−1
0
m
0
−m
T
N
S
−1
N
m
N
+
N
¸
n=1
t
2
n
.
Now we are ready to do the integration, ﬁrst over w and then β, and rearrange the
Solution 3.24 77
terms to obtain the desired result
p(t) =
b
a
0
0
((2π)
M+N
[S
0
[)
1/2
(2π)
M/2
[S
N
[
1/2
β
a
N
−1
exp(−b
N
β) dβ
=
1
(2π)
N/2
[S
N
[
1/2
[S
0
[
1/2
b
a
0
0
b
a
N
N
Γ(a
N
)
Γ(a
0
)
.
3.24 Substituting the r.h.s. of (3.10), (3.112) and (3.113) into (3.119), we get
p(t) =
^ (t[Φw, β
−1
I) ^ (w[m
0
, β
−1
S
0
) Gam(β[a
0
, b
0
)
^ (w[m
N
, β
−1
S
N
) Gam(β[a
N
, b
N
)
. (145)
Using the deﬁnitions of the Gaussian and Gamma distributions, we can write this as
β
2π
N/2
exp
−
β
2
t −Φw
2
β
2π
M/2
[S
0
[
1/2
exp
−
β
2
(w−m
0
)
T
S
−1
0
(w−m
0
)
Γ(a
0
)
−1
b
a
0
0
β
a
0
−1
exp(−b
0
β)
β
2π
M/2
[S
N
[
1/2
exp
−
β
2
(w−m
N
)
T
S
−1
N
(w−m
N
)
Γ(a
N
)
−1
b
a
N
N
β
a
N
−1
exp(−b
N
β)
−1
. (146)
Concentrating on the factors corresponding to the denominator in (145), i.e. the fac
78 Solution 4.1
tors inside ¦. . .¦)
−1
in (146), we can use (135)–(138) to get
^
w[m
N
, β
−1
S
N
Gam(β[a
N
, b
N
)
=
β
2π
M/2
[S
N
[
1/2
exp
−
β
2
w
T
S
−1
N
w−w
T
S
−1
N
m
N
−m
T
N
S
−1
N
w
+m
T
N
S
−1
N
m
N
Γ(a
N
)
−1
b
a
N
N
β
a
N
−1
exp(−b
N
β)
=
β
2π
M/2
[S
N
[
1/2
exp
−
β
2
w
T
S
−1
0
w+w
T
Φ
T
Φw−w
T
S
−1
0
m
0
−w
T
Φ
T
t −m
T
0
S
−1
N
w−t
T
Φw+m
T
N
S
−1
N
m
N
Γ(a
N
)
−1
b
a
N
N
β
a
0
+N/2−1
exp
−
b
0
+
1
2
m
T
0
S
−1
0
m
0
−m
T
N
S
−1
N
m
N
+ t
T
t
β
=
β
2π
M/2
[S
N
[
1/2
exp
−
β
2
(w−m
0
)
T
S
0
(w−m
0
) +t −Φw
2
Γ(a
N
)
−1
b
a
N
N
β
a
N
+N/2−1
exp(−b
0
β).
Substituting this into (146), the exponential factors along with β
a
0
+N/2−1
(β/2π)
M/2
cancel and we are left with (3.118).
Chapter 4 Linear Models for Classiﬁcation
4.1 Assume that the convex hulls of ¦x
n
¦ and ¦y
m
¦ intersect. Then there exist a point
z such that
z =
¸
n
α
n
x
n
=
¸
m
β
m
y
m
where β
m
0 for all m and
¸
m
β
m
= 1. If ¦x
n
¦ and ¦y
m
¦ also were to be
linearly separable, we would have that
´ w
T
z + w
0
=
¸
n
α
n
´ w
T
x
n
+ w
0
=
¸
n
α
n
(
since ´ w
T
x
n
+ w
0
> 0 and the ¦α
n
¦ are all nonnegative and sum to 1, but by the
corresponding argument
´ w
T
z + w
0
=
¸
m
β
m
´ w
T
y
m
+ w
0
=
¸
m
β
m
( ´ w
T
y
m
+ w
0
) < 0,
Solution 4.2 79
which is a contradiction and hence ¦x
n
¦ and ¦y
m
¦ cannot be linearly separable if
their convex hulls intersect.
If we instead assume that ¦x
n
¦ and ¦y
m
¦ are linearly separable and consider a point
z in the intersection of their convex hulls, the same contradiction arise. Thus no such
point can exist and the intersection of the convex hulls of ¦x
n
¦ and ¦y
m
¦ must be
empty.
4.2 For the purpose of this exercise, we make the contribution of the bias weights explicit
in (4.15), giving
E
D
(
W) =
1
2
Tr
¸
(XW+1w
T
0
−T)
T
(XW+1w
T
0
−T)
¸
, (147)
where w
0
is the column vector of bias weights (the top row of
Wtransposed) and 1
is a column vector of N ones.
We can take the derivative of (147) w.r.t. w
0
, giving
2Nw
0
+ 2(XW−T)
T
1.
Setting this to zero, and solving for w
0
, we obtain
w
0
=
¯
t −W
T
¯ x (148)
where
¯
t =
1
N
T
T
1 and ¯ x =
1
N
X
T
1.
If we subsitute (148) into (147), we get
E
D
(W) =
1
2
Tr
¸
(XW+T−XW−T)
T
(XW+T−XW−T)
¸
,
where
T = 1
¯
t
T
and X = 1¯ x
T
.
Setting the derivative of this w.r.t. Wto zero we get
W= (
´
X
T
´
X)
−1
´
X
T
´
T =
´
X
†
´
T,
where we have deﬁned
´
X = X−Xand
´
T = T−T.
Now consider the prediction for a new input vector x
,
y(x
) = W
T
x
+w
0
= W
T
x
+
¯
t −W
T
¯ x
=
¯
t −
´
T
T
´
X
†
T
(x
− ¯ x). (149)
If we apply (4.157) to
¯
t, we get
a
T
¯
t =
1
N
a
T
T
T
1 = −b.
80 Solutions 4.3–4.5
Therefore, applying (4.157) to (149), we obtain
a
T
y(x
) = a
T
¯
t +a
T
´
T
T
´
X
†
T
(x
− ¯ x)
= a
T
¯
t = −b,
since a
T
´
T
T
= a
T
(T−T)
T
= b(1 −1)
T
= 0
T
.
4.3 When we consider several simultaneous constraints, (4.157) becomes
At
n
+b = 0, (150)
where A is a matrix and b is a column vector such that each row of A and element
of b correspond to one linear constraint.
If we apply (150) to (149), we obtain
Ay(x
) = A
¯
t −A
´
T
T
´
X
†
T
(x
− ¯ x)
= A
¯
t = −b,
since A
´
T
T
= A(T−T)
T
= b1
T
−b1
T
= 0
T
. Thus Ay(x
) +b = 0.
4.4 NOTE: In PRML, the text of the exercise refers equation (4.23) where it should refer
to (4.22).
From (4.22) we can construct the Lagrangian function
L = w
T
(m
2
−m
1
) + λ
w
T
w−1
.
Taking the gradient of L we obtain
∇L = m
2
−m
1
+ 2λw (151)
and setting this gradient to zero gives
w = −
1
2λ
(m
2
−m
1
)
form which it follows that w ∝ m
2
−m
1
.
4.5 Starting with the numerator on the r.h.s. of (4.25), we can use (4.23) and (4.27) to
rewrite it as follows:
(m
2
−m
1
)
2
=
w
T
(m
2
−m
1
)
2
= w
T
(m
2
−m
1
)(m
2
−m
1
)
T
w
= w
T
S
B
w. (152)
Solution 4.6 81
Similarly, we can use (4.20), (4.23), (4.24), and (4.28) to rewrite the denominator of
the r.h.s. of (4.25):
s
2
1
+ s
2
2
=
¸
n∈C
1
(y
n
−m
1
)
2
+
¸
k∈C
2
(y
k
−m
2
)
2
=
¸
n∈C
1
w
T
(x
n
−m
1
)
2
+
¸
k∈C
2
w
T
(x
k
−m
2
)
2
=
¸
n∈C
1
w
T
(x
n
−m
1
)(x
n
−m
1
)
T
w
+
¸
k∈C
2
w
T
(x
k
−m
2
)(x
k
−m
2
)
T
w
= w
T
S
W
w. (153)
Substituting (152) and (153) in (4.25) we obtain (4.26).
4.6 Using (4.21) and (4.34) along with the chosen target coding scheme, we can rewrite
the l.h.s. of (4.33) as follows:
N
¸
n=1
w
T
x
n
−w
0
−t
n
x
n
=
N
¸
n=1
w
T
x
n
−w
T
m−t
n
x
n
=
N
¸
n=1
¸
x
n
x
T
n
−x
n
m
T
w−x
n
t
n
¸
=
¸
n∈C
1
¸
x
n
x
T
n
−x
n
m
T
w−x
n
t
n
¸
¸
m∈C
2
¸
x
m
x
T
m
−x
m
m
T
w−x
m
t
m
¸
=
¸
n∈C
1
x
n
x
T
n
−N
1
m
1
m
T
w−N
1
m
1
N
N
1
¸
m∈C
2
x
m
x
T
m
−N
2
m
2
m
T
w+ N
2
m
2
N
N
2
=
¸
n∈C
1
x
n
x
T
n
+
¸
m∈C
2
x
m
x
T
m
−(N
1
m
1
+ N
2
m
2
)m
T
w
−N(m
1
−m
2
). (154)
82 Solution 4.7
We then use the identity
¸
i∈C
k
(x
i
−m
k
) (x
i
−m
k
)
T
=
¸
i∈C
k
x
i
x
T
i
−x
i
m
T
k
−m
k
x
T
i
+m
k
m
T
k
=
¸
i∈C
k
x
i
x
T
i
−N
k
m
k
m
T
k
together with (4.28) and (4.36) to rewrite (154) as
S
W
+ N
1
m
1
m
T
1
+ N
2
m
2
m
T
2
−(N
1
m
1
+ N
2
m
2
)
1
N
(N
1
m
1
+ N
2
m
2
)
w−N(m
1
−m
2
)
=
S
W
+
N
1
−
N
2
1
N
m
1
m
T
1
−
N
1
N
2
N
(m
1
m
T
2
+m
2
m
1
)
+
N
2
−
N
2
2
N
m
2
m
T
2
w−N(m
1
−m
2
)
=
S
W
+
(N
1
+ N
2
)N
1
−N
2
1
N
m
1
m
T
1
−
N
1
N
2
N
(m
1
m
T
2
+m
2
m
1
)
+
(N
1
+ N
2
)N
2
−N
2
2
N
m
2
m
T
2
w−N(m
1
−m
2
)
=
S
W
+
N
2
N
1
N
m
1
m
T
1
−m
1
m
T
2
−m
2
m
1
+m
2
m
T
2
w
−N(m
1
−m
2
)
=
S
W
+
N
2
N
1
N
S
B
w−N(m
1
−m
2
),
where in the last line we also made use of (4.27). From (4.33), this must equal zero,
and hence we obtain (4.37).
4.7 From (4.59) we have
1 −σ(a) = 1 −
1
1 + e
−a
=
1 + e
−a
−1
1 + e
−a
=
e
−a
1 + e
−a
=
1
e
a
+ 1
= σ(−a).
Solutions 4.8–4.9 83
The inverse of the logistic sigmoid is easily found as follows
y = σ(a) =
1
1 + e
−a
⇒
1
y
−1 = e
−a
⇒ ln
1 −y
y
= −a
⇒ ln
y
1 −y
= a = σ
−1
(y).
4.8 Substituting (4.64) into (4.58), we see that the normalizing constants cancel and we
are left with
a = ln
exp
−
1
2
(x −μ
1
)
T
Σ
−1
(x −μ
1
)
p((
1
)
exp
−
1
2
(x −μ
2
)
T
Σ
−1
(x −μ
2
)
p((
2
)
= −
1
2
xΣ
T
x −xΣμ
1
−μ
T
1
Σx +μ
T
1
Σμ
1
−xΣ
T
x +xΣμ
2
+μ
T
2
Σx −μ
T
2
Σμ
2
+ ln
p((
1
)
p((
2
)
= (μ
1
−μ
2
)
T
Σ
−1
x −
1
2
μ
T
1
Σ
−1
μ
1
−μ
T
2
Σμ
2
+ ln
p((
1
)
p((
2
)
.
Substituting this into the rightmost form of (4.57) we obtain (4.65), with w and w
0
given by (4.66) and (4.67), respectively.
4.9 The likelihood function is given by
p (¦φ
n
, t
n
¦[¦π
k
¦) =
N
¸
n=1
K
¸
k=1
¦p(φ
n
[(
k
)π
k
¦
t
nk
and taking the logarithm, we obtain
ln p (¦φ
n
, t
n
¦[¦π
k
¦) =
N
¸
n=1
K
¸
k=1
t
nk
¦ln p(φ
n
[(
k
) + ln π
k
¦ . (155)
In order to maximize the log likelihood with respect to π
k
we need to preserve the
constraint
¸
k
π
k
= 1. This can be done by introducing a Lagrange multiplier λ and
maximizing
ln p (¦φ
n
, t
n
¦[¦π
k
¦) + λ
K
¸
k=1
π
k
−1
.
84 Solution 4.10
Setting the derivative with respect to π
k
equal to zero, we obtain
N
¸
n=1
t
nk
π
k
+ λ = 0.
Rearranging then gives
−π
k
λ =
N
¸
n=1
t
nk
= N
k
. (156)
Summing both sides over k we ﬁnd that λ = −N, and using this to eliminate λ we
obtain (4.159).
4.10 If we substitute (4.160) into (155) and then use the deﬁnition of the multivariate
Gaussian, (2.43), we obtain
ln p (¦φ
n
, t
n
¦[¦π
k
¦) =
−
1
2
N
¸
n=1
K
¸
k=1
t
nk
¸
ln [Σ[ + (φ
n
−μ
k
)
T
Σ
−1
(φ −μ)
¸
, (157)
where we have dropped terms independent of ¦μ
k
¦ and Σ.
Setting the derivative of the r.h.s. of (157) w.r.t. μ
k
, obtained by using (C.19), to
zero, we get
N
¸
n=1
K
¸
k=1
t
nk
Σ
−1
(φ
n
−μ
k
) = 0.
Making use of (156), we can rearrange this to obtain (4.161).
Rewriting the r.h.s. of (157) as
−
1
2
b
N
¸
n=1
K
¸
k=1
t
nk
¸
ln [Σ[ + Tr
Σ
−1
(φ
n
−μ
k
)(φ −μ
k
)
T
¸
,
we can use (C.24) and (C.28) to calculate the derivative w.r.t. Σ
−1
. Setting this to
zero we obtain
1
2
N
¸
n=1
T
¸
k
t
nk
¸
Σ−(φ
n
−μ
n
)(φ
n
−μ
k
)
T
¸
= 0.
Again making use of (156), we can rearrange this to obtain (4.162), with S
k
given
by (4.163).
Note that, as in Exercise 2.34, we do not enforce that Σ should be symmetric, but
simply note that the solution is automatically symmetric.
Solutions 4.11–4.13 85
4.11 The generative model for φ corresponding to the chosen coding scheme is given by
p (φ [ (
k
) =
M
¸
m=1
p (φ
m
[ (
k
)
where
p (φ
m
[ (
k
) =
L
¸
l=1
μ
φ
ml
kml
,
where in turn ¦μ
kml
¦ are the parameters of the multinomial models for φ.
Substituting this into (4.63) we see that
a
k
= ln p (φ [ (
k
) p ((
k
)
= ln p ((
k
) +
M
¸
m=1
ln p (φ
m
[ (
k
)
= ln p ((
k
) +
M
¸
m=1
L
¸
l=1
φ
ml
ln μ
kml
,
which is linear in φ
ml
.
4.12 Differentiating (4.59) we obtain
dσ
da
=
e
−a
(1 + e
−a
)
2
= σ(a)
e
−a
1 + e
−a
= σ(a)
1 + e
−a
1 + e
−a
−
1
1 + e
−a
= σ(a)(1 −σ(a)).
4.13 We start by computing the derivative of (4.90) w.r.t. y
n
∂E
∂y
n
=
1 −t
n
1 −y
n
−
t
n
y
n
(158)
=
y
n
(1 −t
n
) −t
n
(1 −y
n
)
y
n
(1 −y
n
)
=
y
n
−y
n
t
n
−t
n
+ y
n
t
n
y
n
(1 −y
n
)
(159)
=
y
n
−t
n
y
n
(1 −y
n
)
. (160)
86 Solutions 4.14–4.15
From (4.88), we see that
∂y
n
∂a
n
=
∂σ(a
n
)
∂a
n
= σ(a
n
) (1 −σ(a
n
)) = y
n
(1 −y
n
). (161)
Finally, we have
∇a
n
= φ
n
(162)
where ∇ denotes the gradient with respect to w. Combining (160), (161) and (162)
using the chain rule, we obtain
∇E =
N
¸
n=1
∂E
∂y
n
∂y
n
∂a
n
∇a
n
=
N
¸
n=1
(y
n
−t
n
)φ
n
as required.
4.14 If the data set is linearly separable, any decision boundary separating the two classes
will have the property
w
T
φ
n
0 if t
n
= 1,
< 0 otherwise.
Moreover, from (4.90) we see that the negative loglikelihood will be minimized
(i.e., the likelihood maximized) when y
n
= σ (w
T
φ
n
) = t
n
for all n. This will be
the case when the sigmoid function is saturated, which occurs when its argument,
w
T
φ, goes to ±∞, i.e., when the magnitude of w goes to inﬁnity.
4.15 NOTE: In PRML, “concave” should be “convex” on the last line of the exercise.
Assuming that the argument to the sigmoid function (4.87) is ﬁnite, the diagonal
elements of Rwill be strictly positive. Then
v
T
Φ
T
RΦv =
v
T
Φ
T
R
1/2
R
1/2
Φv
=
R
1/2
Φv
2
> 0
where R
1/2
is a diagonal matrix with elements (y
n
(1 −y
n
))
1/2
, and thus Φ
T
RΦis
positive deﬁnite.
Now consider a Taylor expansion of E(w) around a minima, w
,
E(w) = E(w
) +
1
2
(w−w
)
T
H(w−w
)
where the linear term has vanished since w
is a minimum. Now let
w = w
+ λv
Solutions 4.16–4.18 87
where v is an arbitrary, nonzero vector in the weight space and consider
∂
2
E
∂λ
2
= v
T
Hv > 0.
This shows that E(w) is convex. Moreover, at the minimum of E(w),
H(w−w
) = 0
and since H is positive deﬁnite, H
−1
exists and w = w
must be the unique mini
mum.
4.16 If the values of the ¦t
n
¦ were known then each data point for which t
n
= 1 would
contribute p(t
n
= 1[φ(x
n
)) to the log likelihood, and each point for which t
n
= 0
would contribute 1 − p(t
n
= 1[φ(x
n
)) to the log likelihood. A data point whose
probability of having t
n
= 1 is given by π
n
will therefore contribute
π
n
p(t
n
= 1[φ(x
n
)) + (1 −π
n
)(1 −p(t
n
= 1[φ(x
n
)))
and so the overall log likelihood for the data set is given by
N
¸
n=1
π
n
ln p (t
n
= 1 [ φ(x
n
)) + (1 −π
n
) ln (1 −p (t
n
= 1 [ φ(x
n
))) . (163)
This can also be viewed from a sampling perspective by imagining sampling the
value of each t
n
some number M times, with probability of t
n
= 1 given by π
n
, and
then constructing the likelihood function for this expanded data set, and dividing by
M. In the limit M →∞we recover (163).
4.17 From (4.104) we have
∂y
k
∂a
k
=
e
a
k
¸
i
e
a
i
−
e
a
k
¸
i
e
a
i
2
= y
k
(1 −y
k
),
∂y
k
∂a
j
= −
e
a
k
e
a
j
¸
i
e
a
i
2
= −y
k
y
j
, j = k.
Combining these results we obtain (4.106).
4.18 NOTE: In PRML, the text of the exercise refers equation (4.91) where it should refer
to (4.106).
From (4.108) we have
∂E
∂y
nk
= −
t
nk
y
nk
.
88 Solution 4.19
If we combine this with (4.106) using the chain rule, we get
∂E
∂a
nj
=
K
¸
k=1
∂E
∂y
nk
∂y
nk
∂a
nj
= −
K
¸
k=1
t
nk
y
nk
y
nk
(I
kj
−y
nj
)
= y
nj
−t
nj
,
where we have used that ∀n :
¸
k
t
nk
= 1.
If we combine this with (162), again using the chain rule, we obtain (4.109).
4.19 Using the crossentropy error function (4.90), and following Exercise 4.13, we have
∂E
∂y
n
=
y
n
−t
n
y
n
(1 −y
n
)
. (164)
Also
∇a
n
= φ
n
. (165)
From (4.115) and (4.116) we have
∂y
n
∂a
n
=
∂Φ(a
n
)
∂a
n
=
1
√
2π
e
−a
2
n
. (166)
Combining (164), (165) and (166), we get
∇E =
N
¸
n=1
∂E
∂y
n
∂y
n
∂a
n
∇a
n
=
N
¸
n=1
y
n
−t
n
y
n
(1 −y
n
)
1
√
2π
e
−a
2
n
φ
n
. (167)
In order to ﬁnd the expression for the Hessian, it is is convenient to ﬁrst determine
∂
∂y
n
y
n
−t
n
y
n
(1 −y
n
)
=
y
n
(1 −y
n
)
y
2
n
(1 −y
n
)
2
−
(y
n
−t
n
)(1 −2y
n
)
y
2
n
(1 −y
n
)
2
=
y
2
n
+ t
n
−2y
n
t
n
y
2
n
(1 −y
n
)
2
. (168)
Then using (165)–(168) we have
∇∇E =
N
¸
n=1
∂
∂y
n
¸
y
n
−t
n
y
n
(1 −y
n
)
1
√
2π
e
−a
2
n
φ
n
∇y
n
+
y
n
−t
n
y
n
(1 −y
n
)
1
√
2π
e
−a
2
n
(−2a
n
)φ
n
∇a
n
=
N
¸
n=1
y
2
n
+ t
n
−2y
n
t
n
y
n
(1 −y
n
)
1
√
2π
e
−a
2
n
−2a
n
(y
n
−t
n
)
e
−2a
2
n
φ
n
φ
T
n
√
2πy
n
(1 −y
n
)
.
Solutions 4.20–4.21 89
4.20 NOTE: In PRML, equation (4.110) contains an incorrect leading minus sign (‘−’)
on the right hand side.
We ﬁrst write out the components of the MK MK Hessian matrix in the form
∂
2
E
∂w
ki
∂w
jl
=
N
¸
n=1
y
nk
(I
kj
−y
nj
)φ
ni
φ
nl
.
To keep the notation uncluttered, consider just one term in the summation over n and
show that this is positive semideﬁnite. The sum over n will then also be positive
semideﬁnite. Consider an arbitrary vector of dimension MK with elements u
ki
.
Then
u
T
Hu =
¸
i,j,k,l
u
ki
y
k
(I
kj
−y
j
)φ
i
φ
l
u
jl
=
¸
j,k
b
j
y
k
(I
kj
−y
j
)b
k
=
¸
k
y
k
b
2
k
−
¸
k
b
k
y
k
2
where
b
k
=
¸
i
u
ki
φ
ni
.
We now note that the quantities y
k
satisfy 0 y
k
1 and
¸
k
y
k
= 1. Furthermore,
the function f(b) = b
2
is a concave function. We can therefore apply Jensen’s
inequality to give
¸
k
y
k
b
2
k
=
¸
k
y
k
f(b
k
) f
¸
k
y
k
b
k
=
¸
k
y
k
b
k
2
and hence
u
T
Hu 0.
Note that the equality will never arise for ﬁnite values of a
k
where a
k
is the set
of arguments to the softmax function. However, the Hessian can be positive semi
deﬁnite since the basis vectors φ
ni
could be such as to have zero dot product for a
linear subspace of vectors u
ki
. In this case the minimum of the error function would
comprise a continuum of solutions all having the same value of the error function.
4.21 NOTE: In PRML, Equation (4.116) contains a minor typographical error. On the
l.h.s., Φ should be Φ (i.e. not bold).
We consider the two cases where a 0 and a < 0 separately. In the ﬁrst case, we
90 Solutions 4.22–4.23
can use (2.42) to rewrite (4.114) as
Φ(a) =
0
−∞
^ (θ[0, 1) dθ +
a
0
1
√
2π
exp
−
θ
2
2
dθ
=
1
2
+
1
√
2π
a
0
exp
−
θ
2
2
dθ
=
1
2
1 +
1
√
2
erf(a)
,
where, in the last line, we have used (4.115).
When a < 0, the symmetry of the Gaussian distribution gives
Φ(a) = 1 −Φ(−a).
Combining this with (169), we get
Φ(a) = 1 −
1
2
1 +
1
√
2
erf(−a)
=
1
2
1 +
1
√
2
erf(a)
,
where we have used the fact that the erf function is is antisymmetric, i.e., erf(−a) =
−erf(a).
4.22 Starting from (4.136), using (4.135), we have
p (T) =
p (T [ θ) p (θ) dθ
· p (T [ θ
MAP
) p (θ
MAP
)
exp
−
1
2
(θ −θ
MAP
)A
−1
(θ −θ
MAP
)
dθ
= p (T [ θ
MAP
) p (θ
MAP
)
(2π)
M/2
[A[
1/2
,
where Ais given by (4.138). Taking the logarithm of this yields (4.137).
4.23 NOTE: In PRML, the text of the exercise contains a typographical error. Following
the equation, it should say that His the matrix of second derivatives of the negative
log likelihood.
The BIC approximation can be viewed as a large N approximation to the log model
evidence. From (4.138), we have
A = −∇∇ln p(T[θ
MAP
)p(θ
MAP
)
= H−∇∇ln p(θ
MAP
)
Solution 4.24 91
and if p(θ) = ^(θ[m, V
0
), this becomes
A = H+V
−1
0
.
If we assume that the prior is broad, or equivalently that the number of data points
is large, we can neglect the term V
−1
0
compared to H. Using this result, (4.137) can
be rewritten in the form
ln p(T) · ln p(T[θ
MAP
) −
1
2
(θ
MAP
−m)V
−1
0
(θ
MAP
−m) −
1
2
ln [H[ + const
(169)
as required. Note that the phrasing of the question is misleading, since the assump
tion of a broad prior, or of large N, is required in order to derive this form, as well
as in the subsequent simpliﬁcation.
We now again invoke the broad prior assumption, allowing us to neglect the second
term on the right hand side of (169) relative to the ﬁrst term.
Since we assume i.i.d. data, H = −∇∇ln p(T[θ
MAP
) consists of a sum of terms,
one term for each datum, and we can consider the following approximation:
H =
N
¸
n=1
H
n
= N
´
H
where H
n
is the contribution from the n
th
data point and
´
H =
1
N
N
¸
n=1
H
n
.
Combining this with the properties of the determinant, we have
ln [H[ = ln [N
´
H[ = ln
N
M
[
´
H[
= M ln N + ln [
´
H[
where M is the dimensionality of θ. Note that we are assuming that
´
Hhas full rank
M. Finally, using this result together (169), we obtain (4.139) by dropping the ln [
´
H[
since this O(1) compared to ln N.
4.24 Consider a rotation of the coordinate axes of the Mdimensional vector w such that
w = (w
, w
⊥
) where w
T
φ = w
φ, and w
⊥
is a vector of length M − 1. We
then have
σ(w
T
φ)q(w) dw =
σ
w
φ
q(w
⊥
[w
)q(w
) dw
dw
⊥
=
σ(w
φ)q(w
) dw
.
Note that the joint distribution q(w
⊥
, w
) is Gaussian. Hence the marginal distribu
tion q(w
) is also Gaussian and can be found using the standard results presented in
92 Solutions 4.25–4.26
Section 2.3.2. Denoting the unit vector
e =
1
φ
φ
we have
q(w
) = ^(w
[e
T
m
N
, e
T
S
N
e).
Deﬁning a = w
φ we see that the distribution of a is given by a simple rescaling
of the Gaussian, so that
q(a) = ^(a[φ
T
m
N
, φ
T
S
N
φ)
where we have used φe = φ. Thus we obtain (4.151) with μ
a
given by (4.149)
and σ
2
a
given by (4.150).
4.25 From (4.88) we have that
dσ
da
a=0
= σ(0)(1 −σ(0))
=
1
2
1 −
1
2
=
1
4
. (170)
Since the derivative of a cumulative distribution function is simply the corresponding
density function, (4.114) gives
dΦ(λa)
da
a=0
= λ^ (0[0, 1)
= λ
1
√
2π
.
Setting this equal to (170), we see that
λ =
√
2π
4
or equivalently λ
2
=
π
8
.
This is illustrated in Figure 4.9.
4.26 First of all consider the derivative of the right hand side with respect to μ, making
use of the deﬁnition of the probit function, giving
1
2π
1/2
exp
−
μ
2
2(λ
−2
+ σ
2
)
1
(λ
−2
+ σ
2
)
1/2
.
Now make the change of variable a = μ + σz, so that the left hand side of (4.152)
becomes
∞
−∞
Φ(λμ + λσz)
1
(2πσ
2
)
1/2
exp
−
1
2
z
2
σ dz
Solutions 5.1–5.2 93
where we have substituted for the Gaussian distribution. Now differentiate with
respect to μ, making use of the deﬁnition of the probit function, giving
1
2π
∞
−∞
exp
−
1
2
z
2
−
λ
2
2
(μ + σz)
2
σ dz.
The integral over z takes the standard Gaussian form and can be evaluated analyt
ically by making use of the standard result for the normalization coefﬁcient of a
Gaussian distribution. To do this we ﬁrst complete the square in the exponent
−
1
2
z
2
−
λ
2
2
(μ + σz)
2
= −
1
2
z
2
(1 + λ
2
σ
2
) −zλ
2
μσ −
1
2
λ
2
μ
2
= −
1
2
z + λ
2
μσ(1 + λ
2
σ
2
)
−1
2
(1 + λ
2
σ
2
) +
1
2
λ
4
μ
2
σ
2
(1 + λ
2
σ
2
)
−
1
2
λ
2
μ
2
.
Integrating over z then gives the following result for the derivative of the left hand
side
1
(2π)
1/2
1
(1 + λ
2
σ
2
)
1/2
exp
−
1
2
λ
2
μ
2
+
1
2
λ
4
μ
2
σ
2
(1 + λ
2
σ
2
)
=
1
(2π)
1/2
1
(1 + λ
2
σ
2
)
1/2
exp
−
1
2
λ
2
μ
2
(1 + λ
2
σ
2
)
.
Thus the derivatives of the left and right hand sides of (4.152) with respect to μ are
equal. It follows that the left and right hand sides are equal up to a function of σ
2
and
λ. Taking the limit μ → −∞the left and right hand sides both go to zero, showing
that the constant of integration must also be zero.
Chapter 5 Neural Networks
5.1 NOTE: In PRML, the text of this exercise contains a typographical error. On line 2,
g() should be replaced by h().
See Solution 3.1.
5.2 The likelihood function for an i.i.d. data set, ¦(x
1
, t
1
), . . . , (x
N
, t
N
)¦, under the
conditional distribution (5.16) is given by
N
¸
n=1
^
t
n
[y(x
n
, w), β
−1
I
.
8
Solution 1.4 Consider ﬁrst the way a function f (x) behaves when we change to a new variable y where the two variables are related by x = g(y). This deﬁnes a new function of y given by f (y) = f (g(y)). (7) Suppose f (x) has a mode (i.e. a maximum) at x so that f (x) = 0. The corresponding mode of f (y) will occur for a value y obtained by differentiating both sides of (7) with respect to y f (y) = f (g(y))g (y) = 0. (8) Assuming g (y) = 0 at the mode, then f (g(y)) = 0. However, we know that f (x) = 0, and so we see that the locations of the mode expressed in terms of each of the variables x and y are related by x = g(y), as one would expect. Thus, ﬁnding a mode with respect to the variable x is completely equivalent to ﬁrst transforming to the variable y, then ﬁnding a mode with respect to y, and then transforming back to x. Now consider the behaviour of a probability density px (x) under the change of variables x = g(y), where the density with respect to the new variable is py (y) and is given by ((1.27)). Let us write g (y) = sg (y) where s ∈ {−1, +1}. Then ((1.27)) can be written py (y) = px (g(y))sg (y). Differentiating both sides with respect to y then gives py (y) = spx (g(y)){g (y)}2 + spx (g(y))g (y). (9)
Due to the presence of the second term on the right hand side of (9) the relationship x = g(y) no longer holds. Thus the value of x obtained by maximizing px (x) will not be the value obtained by transforming to py (y) then maximizing with respect to y and then transforming back to x. This causes modes of densities to be dependent on the choice of variables. In the case of linear transformation, the second term on the right hand side of (9) vanishes, and so the location of the maximum transforms according to x = g(y). This effect can be illustrated with a simple example, as shown in Figure 1. We begin by considering a Gaussian distribution px (x) over x with mean μ = 6 and standard deviation σ = 1, shown by the red curve in Figure 1. Next we draw a sample of N = 50, 000 points from this distribution and plot a histogram of their values, which as expected agrees with the distribution px (x). Now consider a nonlinear change of variables from x to y given by x = g(y) = ln(y) − ln(1 − y) + 5. The inverse of this function is given by y = g −1 (x) = 1 1 + exp(−x + 5) (11) (10)
Solutions 1.5–1.6
Figure 1 Example of the transformation of the mode of a density under a nonlinear change of variables, illustrating the different behaviour compared to a simple function. See the text for details.
9
1 py (y) y 0.5 px (x) g −1 (x)
0
0
5
x
10
which is a logistic sigmoid function, and is shown in Figure 1 by the blue curve. If we simply transform px (x) as a function of x we obtain the green curve px (g(y)) shown in Figure 1, and we see that the mode of the density px (x) is transformed via the sigmoid function to the mode of this curve. However, the density over y transforms instead according to (1.27) and is shown by the magenta curve on the left side of the diagram. Note that this has its mode shifted relative to the mode of the green curve. To conﬁrm this result we take our sample of 50, 000 values of x, evaluate the corresponding values of y using (11), and then plot a histogram of their values. We see that this histogram matches the magenta curve in Figure 1 and not the green curve! 1.5 Expanding the square we have E[(f (x) − E[f (x)])2 ] = E[f (x)2 − 2f (x)E[f (x)] + E[f (x)]2 ] = E[f (x)2 ] − 2E[f (x)]E[f (x)] + E[f (x)]2 = E[f (x)2 ] − E[f (x)]2 as required. 1.6 The deﬁnition of covariance is given by (1.41) as cov[x, y] = E[xy] − E[x]E[y]. Using (1.33) and the fact that p(x, y) = p(x)p(y) when x and y are independent, we obtain E[xy] =
x y
p(x, y)xy p(x)x
x y
=
p(y)y
= E[x]E[y]
10
Solutions 1.7–1.8 and hence cov[x, y] = 0. The case where x and y are continuous variables is analogous, with (1.33) replaced by (1.34) and the sums replaced by integrals. 1.7 The transformation from Cartesian to polar coordinates is deﬁned by x = r cos θ y = r sin θ (12) (13)
and hence we have x2 + y 2 = r2 where we have used the wellknown trigonometric result (2.177). Also the Jacobian of the change of variables is easily seen to be ∂(x, y) ∂(r, θ) ∂x ∂r ∂x ∂θ
=
=
∂y ∂y ∂r ∂θ cos θ −r sin θ sin θ r cos θ
=r
where again we have used (2.177). Thus the double integral in (1.125) becomes I2 r2 r dr dθ 2σ 2 0 0 ∞ u 1 du exp − 2 = 2π 2σ 2 0 ∞ u = π exp − 2 −2σ 2 2σ 0 = 2πσ 2 = exp −
2π
∞
(14) (15) (16) (17)
where we have used the change of variables r2 = u. Thus I = 2πσ 2
1 /2
.
Finally, using the transformation y = x − μ, the integral of the Gaussian distribution becomes
∞ −∞
N xμ, σ 2 dx = =
1 (2πσ 2 ) I (2πσ 2 )
1 /2
∞ −∞
exp −
y2 2σ 2
dy
1 /2
=1
as required. 1.8 From the deﬁnition (1.46) of the univariate Gaussian distribution, we have E[x] =
∞ −∞
1 2πσ 2
1 /2
exp −
1 (x − μ)2 x dx. 2σ 2
(18)
9 For the univariate case.20)1 and the fact that Σ−1 is symmetric.50) E[x2 ] − E[x]2 = μ2 + σ 2 − μ2 = σ 2 . σ 2 . 1 (22) . Σ) = − N (xμ. Similarly.51) follows directly from (1. where we have used (C.9 Now change variables using y = x − μ to give E[x] = ∞ −∞ 11 1 2πσ 2 1 /2 exp − 1 2 y (y + μ) dy. leads to the solution x = μ.49). write the integral as the sum of two integrals. 1.46) for the normal distribution into the normalization result (1. Now we expand the square on the lefthand side giving E[x2 ] − 2μE[x] + μ2 = σ 2 . there are mistakes in (C. In the second term. all instances of x (vector) in the denominators should be x (scalar).52) with respect to x to obtain ∂ 1 N (xμ. Σ)Σ−1 (x − μ). and leftmultiplying by Σ. 2σ 2 (19) We now note that in the factor (y + μ) the ﬁrst term in y corresponds to an odd integrand and so this integral must vanish (to show this explicitly. Σ)∇x (x − μ)T Σ−1 (x − μ) ∂x 2 = −N (xμ. (1.Solution 1. μ is a constant and pulls outside the integral. for the multivariate case we differentiate (1.20). dx σ2 Setting this to zero we obtain x = μ.46) with respect to x to obtain d x−μ N xμ. (20) We now differentiate both sides of (20) with respect to σ 2 and then rearrange to obtain 1 /2 ∞ 1 1 exp − 2 (x − μ)2 (x − μ)2 dx = σ 2 (21) 2 2πσ 2σ −∞ which directly shows that E[(x − μ)2 ] = var[x] = σ 2 .49) then gives (1. we simply differentiate (1. NOTE: In PRML.19).50) we ﬁrst substitute the expression (1.48) and rearrange to obtain ∞ −∞ exp − 1 (x − μ)2 2σ 2 dx = 2πσ 2 1 /2 . leaving a normalized Gaussian distribution which integrates to 1. (C.50) as required. To derive (1. σ 2 = −N xμ. one from −∞ to 0 and the other from 0 to ∞ and then show that these two integrals cancel). Making use of (1. Finally.49) and (1. Setting this derivative equal to 0. and so we obtain (1.
10–1.11 We use to denote ln p(Xμ.10 Since x and z are independent.12 Solutions 1.56). = 2 ∂μ σ n=1 Setting this equal to zero and moving the terms involving μ to the other side of the equation we get 1 σ2 N xn = n=1 2 1 Nμ σ2 and by multiplying ing both sides by σ /N we get (1. z) = p(x)p(z). Hence var[x + z] = = (x + z − E[x + z])2 p(x)p(z) dx dz (x − E[x])2 p(x) dx + (z − E[z])2 p(z) dz (27) = var(x) + var(z). n=1 Multiplying both sides by 2(σ 2 )2 /N and substituting μML for μ we get (1. For discrete variables the integrals are replaced by summations. we ﬁrst note that (x + z − E[x + z])2 = (x − E[x])2 + (z − E[z])2 + 2(x − E[x])(z − E[z]) (26) where the ﬁnal term will integrate to zero with respect to the factorized distribution p(x)p(z). By standard rules of differentiation we obtain N ∂ 1 (xn − μ). . Similarly we have 1 ∂ = 2 ∂σ 2(σ 2 )2 and setting this to zero we obtain N 1 1 = 2 2 σ 2(σ 2 )2 N N (xn − μ)2 − n=1 N 1 2 σ2 (xn − μ)2 .55). and so E[x + z] = = (x + z)p(x)p(z) dx dz xp(x) dx + zp(z) dz (23) (24) (25) = E[x] + E[z]. their joint distribution factorizes p(x. 1. and the same results are again obtained. Similarly for the variances.11 1. σ 2 ) from (1.54).
49).55) and (1.49) and (1.130). Next we have E[μML ] = using (1. whereas if n n n = m then the two data points xn and xm are independent and hence E[xn xm ] = E[xn ]E[xm ] = μ2 where we have used (1.12.Solutions 1.14 Deﬁne 1 1 A (wij + wji ) wij = (wij − wji ). 2 Finally.49).130). we have ⎡ ⎤ 2 N N 1 1 2 E[σML ] = E ⎣ xn − xm ⎦ N N 1 N N E[xn ] = μ n=1 (28) n=1 m=1 N = = = as required. Combining these two results we obtain (1.56) and using (1. (30) 2 2 from which the (anti)symmetry properties follow directly.14 13 1.50) we obtain E[x2 ] = μ2 + σ 2 .13 In a similar fashion to solution 1. We now note that S wij = D D A wij xi xj = i=1 j =1 1 2 D D wij xi xj − i=1 j =1 1 2 D D wji xi xj = 0 i=1 j =1 (31) .12–1.50) we have E{xn } 1 N N n=1 (xn − μ) 2 = 1 N 1 N N Exn x2 − 2xn μ + μ2 n n=1 N = μ2 + σ 2 − 2μμ + μ2 n=1 = σ2 1. as does the relation wij = S A wij + wij . and making use of (1. 1 N N E x2 − n n=1 2 xn N xm + m=1 1 N2 N N xm xl m=1 l=1 μ2 + σ 2 − 2 μ2 + N −1 N σ2 1 2 σ N + μ2 + 1 2 σ N (29) 1. From (1. consider E[σML ].56). substituting μ for μML in (1.12 If m = n then xn xm = x2 and using (1.
To prove (1.135) we note that the number of independent parameters n(D.14 Solution 1. M ) = i1 =1 n(i1 . This can clearly also be written as D i1 iM − 1 n(D. Thus. M ) = i1 =1 i2 =1 ··· iM =1 1 (33) where the term in braces has M −1 terms which. so that only one member in each group of equivalent conﬁgurations occurs in the summation.136) must hold true for all values of D. .15 S from which we obtain (1. The number of independent components in wij can be 2 found by noting that there are D parameters in total in this matrix. Such symmetries can therefore be removed by enforcing an ordering on the indices. M ) = i1 =1 i2 =1 ··· iM =1 1 (32) which has M terms. (1.134). Thus we can write D n(D. by induction. and that entries off the leading diagonal occur in constrained pairs wij = wji for j = i. 1.136) for dimensionality D + 1. as in (1.136) evaluated for D + 1 which gives D +1 i=1 (i + M − 2)! (i − 1)!(M − 1)! = = = (D + M − 1)! (D + M − 1)! + (D − 1)!M ! D!(M − 1)! (D + M − 1)!D + (D + M − 1)!M D!M ! (D + M )! D!M ! (35) which equals the right hand side of (1. and then add back D for the leading diagonal and we obtain (D2 − D)/2 + D = D(D + 1)/2.133) arises from interchange symmetries between the indices ik . which gives the value 1 on both sides. must equal n(i1 . Thus we S start with D2 parameters in the matrix wij . subtract D for the number of parameters on the leading diagonal.15 The redundancy in the coefﬁcients in (1.135). thus showing the equation is valid for D = 1. Thus consider the lefthand side of (1. divide by two.136) we ﬁrst set D = 1 on both sides of the equation. To derive (1. from (32). M − 1) (34) which is equivalent to (1. M −1). and make use of 0! = 1. M ) which appear at order M can be written as D i1 iM − 1 n(D.132). Now we assume that it is true for a speciﬁc value of dimensionality D and then show that it must be true for dimensionality D + 1.
(D − 1)! (M − 1)! (36) Substituting this into the right hand side of (1. 2) = 1 D(D + 1). we obtain the following result at order M +1 M +1 N (D.16 15 Finally we use induction to prove (1. M 6th should be M th and on the l. On line 4.139).Solution 1. of (1.135) we obtain D n(D.139).14. we ﬁrst note that when M = 0 the right hand side of (1.137) is true for polynomials of order M . making use of (1.137) is correct for a speciﬁc order M − 1 so that n(D.16 NOTE: In PRML.137) must be true for all values of M . M + 1) = m=0 M n(D.136). Assuming that (1. M − 1) = (D + M − 2)! . The result (1. M + 1) (D + M )! (D + M )! + D!M ! (D − 1)!(M + 1)! (D + M )!(M + 1) + (D + M )!D D!(M + 1)! (D + M + 1)! D!(M + 1)! = = = which is the required result at order M + 1. For M = 2 we ﬁnd obtain the standard result n(D.h. N (d.137). 1. m) = m=0 n(D. m) + n(D. M ) should be N (D. this exercise contains two typographical errors.138) follows simply from summing up the coefﬁcients at all order up to and including order M . Thus by induction (1.139) equals 1. To prove (1. . M ) = (D + M − 1)! (D − 1)! M ! (38) and hence shows that (1. which is also proved in Exercise 1. M ) = i=1 (i + M − 2)! (i − 1)! (M − 1)! (37) which. Now assume 2 that (1. which we know to be correct since this is the number of parameters at zeroth order which is just the constant offset in the polynomial. gives n(D.139) is correct at order M . M ).s.
V3 = 4 3 πa . 3) = 286 and N (100. Then from the result (39) we have Γ(x + 2) = (x + 1)Γ(x + 1) = (x + 1)!. The case where D M is identical. The volume of a sphere of radius 1 in Ddimensions is obtained by integration 1 VD = S D rD−1 dr = 0 SD . 3 (43) . By numerical evaluation we obtain N (10. (39) Γ(1) = 0 e−u du = −e−u ∞ 0 = 1. Suppose that Γ(x + 1) = x! holds. Γ(1) = 1 = 0!.126) to obtain π D/2 . 1.18 On the righthand side of (1. V2 = πa2 .141) of the Gamma function. 1. with the roles of D and M exchanged. Finally.142) we make the change of variables u = r2 to give 1 SD 2 ∞ 0 e−u uD/2−1 du = 1 SD Γ(D/2) 2 (41) where we have used the deﬁnition (1. M ) = (D + M )D+M e−D−M D! M M e−M M D+M e−D D 1+ D! M M M M D e−D D! 1+ D +M D(D + M ) M (1 + D)e−D D M D! which grows like M D with M .16 Solutions 1.851. (40) If x is an integer we can apply proof by induction to relate the gamma function to the factorial function. Equating these we obtain the desired result (1.142) we can use (1. which completes the proof by induction.17 Using integration by parts we have ∞ Γ(x + 1) = 0 ux e−u du ∞ 0 = For x = 1 we have −e−u ux ∞ ∞ + 0 xux−1 e−u du = 0 + xΓ(x). D (42) For D = 2 and D = 3 we obtain the following results S2 = 2π. Using Stirling’s formula we have n(D. S3 = 4π.18 Now assume M D. On the left hand side of (1. 3) = 176.143).17–1.
x − x2 /2 + O(x3 ). (46) (r + )2 + (D − 1) ln(r + ) . since this is where it makes contact with the sphere.149).147) by p( x = r) = where we have used r r2 1 exp − 2 2σ (2πσ 2 )1/2 = 1 D exp − 2 (2πσ 2 )1/2 √ Dσ.19–1. and using D Next we note that p(r + ) ∝ (r + )D−1 exp − = exp − (r + )2 2σ 2 (47) 1. together with D Finally.20 17 1. Thus √ the ratio is D.144) we obtain (1. This shell has volume SD rD−1 and since x 2 = r2 we have p(x) dx shell p(r)SD rD−1 (45) from which we obtain (1. Combining this with (1. 2σ 2 We now expand p(r) around the point r.145) the ratio becomes.19 The volume of the cube is (2a)D . we obtain r √ Dσ. from (1. The distance from the center of the cube to the mid point of one of the sides is a.Solutions 1.146) in (1.143) and (1. for large D.147) we see that the probability density at the origin is given by p(x = 0) = 1 (2πσ 2 )1/2 while the density at x = r is given from (1. exp − r2 2σ 2 = 0. volume of sphere = volume of cube πe 2D D/2 1 D (44) which goes to 0 as D → ∞.20 Since p(x) is radially symmetric it will be roughly constant over the shell of radius r and thickness . Using Stirling’s formula (1.145). Thus the ratio of densities is given by exp(D/2). Making use of the expansion ln(1 + x) = 1. Since this is a stationary point of p(r) we must keep terms up to second order. We can ﬁnd the stationary points of p(r) by differentiation d r p(r) ∝ (D − 1)rD−2 + rD−1 − 2 dr σ Solving for r. 1. . √ Similarly the distance to one of the corners is a D from Pythagoras’ theorem. we obtain (1.148).
24 A vector x belongs to class Ck with probability p(Ck x). 1. from (1. whereas if we select the reject option we will incur a loss of λ. C1 ) dx p(C1 x)p(x) dx. and p(C1 x) p(C2 x) in ﬁcation we must have p(C2 x) region R2 .23 From (1. which is equivalent to choosing the j for which the posterior probability p(Cj x) is a maximum. and hence minimizing the expected loss will minimize the misclassiﬁcation rate. and using the fact that the posterior probabilities sum to one.81) we see that for a general loss matrix and arbitrary class priors. we ﬁnd that.18 Solutions 1. Thus. C2 ) dx + R2 p(x. and a loss of zero if it is correctly classiﬁed.78). if j = arg min l k Lkl p(Ck x) (50) .21–1.81). the expected loss is minimized by assigning an input x to class the j which minimizes Lkj p(Ck x) = k 1 p(x) Lkj p(xCk )p(Ck ) k and so there is a direct tradeoff between the priors p(Ck ) and the loss matrix Lkj . 1. This loss matrix assigns a loss of one if the example is misclassiﬁed.22 Substituting Lkj = 1 − δkj into (1. If we decide to assign x to class Cj we will incur an expected loss of k Lkj p(Ck x).21 Since the square root function is monotonic for nonnegative numbers. 1. for each x we should choose the class j for which 1 − p(Cj x) is a minimum. The ﬁnal integral is taken over the whole of the domain of x. We now apply the result a b ⇒ a1/2 b1/2 to give p(mistake) R1 {p(C1 x)p(C2 x)}1/2 p(x) dx + R2 {p(C1 x)p(C2 x)}1/2 p(x) dx (49) = {p(C1 x)p(x)p(C2 x)p(x)}1/2 dx since the two integrals have the same integrand. Then we multiply both the square root of the relation a b to obtain a1/2 1 /2 sides by the nonnegative quantity a to obtain a (ab)1/2 . The probability of a misclassiﬁcation is given. (48) = R1 p(C2 x)p(x) dx + R2 Since we have chosen the decision regions to minimize the probability of misclassip(C1 x) in region R1 . by p(mistake) = R1 p(x.24 1. we can take b1/2 .
x) dx dt. For the case of a scalar target variable we have y(x) = which is equivalent to (1. We start by expanding the square in (1. x) dt which is the conditional average of t conditioned on x. 19 (51) For a loss matrix Lkj = 1 − Ikj we have k Lkl p(Ck x) = 1 − p(Cl x) and so we reject unless the smallest value of 1 − p(Cl x) is less than λ.90). Our goal is to choose y(x) so as to minimize E[L].26 NOTE: In PRML. we obtain tp(t. x) dt y(x) = p(t.Solutions 1.151) and perform the integral over t. y(x) − t 2 = y(x) − E[tx] + E[tx] − t 2 = y(x) − E[tx] 2 + (y(x) − E[tx])T (E[tx] − t) +(E[tx] − t)T (y(x) − E[tx]) + E[tx] − t 2 . x) dt = 0. reject.26 then we minimize the expected loss if we take the following action choose class j. We can do this formally using the calculus of variations to give δE[L] = δy(x) 2(y(x) − t)p(t. Thus these two criteria for rejection are equivalent provided θ = 1 − λ. otherwise. In the standard reject criterion we reject if the largest posterior probability is less than θ. the integrand of the second integral should be replaced by var[tx]p(x). 1. Again the crossterm vanishes and we are left with E[L] = y(x) − E[tx] 2 p(x) dx + var[tx]p(x) dx tp(tx) dt = tp(tx) dt . Solving for y(x). or equivalently if the largest value of p(Cl x) is less than 1 − λ.151).90).25 The expected squared loss for a vectorial target variable is given by E[L] = y(x) − t 2 p(t. Following the treatment of the univariate case. 1. in a similar fashion to the univariate case in the equation preceding (1. and using the sum and product rules of probability. if minl k Lkl p(Ck x) < λ.25–1. we now substitute this into (1.89). there is an error in equation (1.
we have h(p2 ) = h(p) + h(p) = 2 h(p). The value of (52) will therefore be close to 1.27–1. (54) which says that y(x) must be the conditional median of t. but reduced slightly by the ‘notch’ close to t = y(x). 1. as a function of t. For k = K + 1 we have h(pK +1 ) = h(pK p) = h(pK ) + h(p) = K h(p) + h(p) = (K + 1) h(p). (53) For the case of q = 1 this reduces to y (x) ∞ p(tx) dt = −∞ y (x) p(tx) dt.28 from which we see directly that the function y(x) that minimizes E[L] is given by E[tx].91) with respect to y(x) equal to zero. Setting the derivative of (52) with respect to y(x) to zero gives the stationarity condition qy(x) − tq−1 sign(y(x) − t)p(tx) dt y (x) = q −∞ y(x) − tq−1 p(tx) dt − q ∞ y (x) y(x) − tq−1 p(tx) dt = 0 which can also be obtained directly by setting the functional derivative of (1. the minimum of the expected Lq loss can be found by minimizing the integrand given by y(x) − tq p(tx) dt (52) for each value of x. h(pk ) = k h(p).28 From the discussion of the introduction of Section 1. with the (conditional) mode. We then assume that for all k K. i. For q → 0 we note that. We obtain the biggest reduction in (52) by choosing the location of the notch to coincide with the largest value of p(t).e. since the density p(t) is normalized.20 Solutions 1.6. Moreover. 1. h(pn/m ) = n h(p1/m ) = n n n m h(p1/m ) = h(pm/m ) = h(p) m m m . It follows that y(x) must satisfy y (x) −∞ y(x) − tq−1 p(tx) dt = ∞ y (x) y(x) − tq−1 p(tx) dt. the quantity y(x) − tq is close to 1 everywhere except in a small neighbourhood around t = y(x) where it falls to zero.27 Since we can choose y(x) independently for each value of x.
s.110). by continuity.s. σ 2 ) 1 2 ln(2πs2 ) + (x − m)2 s2 dx N (xμ.30 and so.h. so that M H(x) ln i=1 p(xi ) 1 p(xi ) = ln M. σ 2 )(x2 − 2xm + m2 ) dx .29 The entropy of an M state discrete variable x can be written in the form M M H(x) = − i=1 p(xi ) ln p(xi ) = i=1 p(xi ) ln 1 .50). we can rewrite the ﬁrst integral on the r. there is a minus sign (’−’) missing on the l. (56) 1. 21 Now consider the positive real numbers p and q and the real number x such that p = q x . s2 .48)– (1. (58) and (1. 1.46) and (1. from (57). of (57) as − = = p(x) ln q(x) dx = 1 2 1 2 ln(2πs2 ) + ln(2πs2 ) + 1 s2 N (xμ.103). From the above discussion. (58) σ 2 + μ2 − 2μm + m2 s2 The second integral on the r.h. p(xi ) (55) The function ln(x) is concave and so we can apply Jensen’s inequality in the form (1. of (57) we recognize from (1.h.113) we have KL(p q) = − p(x) ln q(x) dx + p(x) ln p(x) dx.Solutions 1. we have that h(px ) = x h(p) for any real number x.29–1.s. Thus.103) as the negative differential entropy of a Gaussian. we see that h(q x ) x h(q) h(q) h(p) = = = x) ln(p) ln(q x ln(q) ln(q) and hence h(p) ∝ ln(p). From (1. (57) Using (1.30 NOTE: In PRML.115) but with the inequality reversed. of (1. we have KL(p q) = = 1 2 1 2 ln(2πs2 ) + ln s2 σ2 + σ 2 + μ2 − 2μm + m2 − 1 − ln(2πσ 2 ) s2 σ 2 + μ2 − 2μm + m2 −1 .
y) = p(x)p(y) as required.33 The conditional entropy H(yx) can be written H(yx) = − i j p(y) ln p(y) dy = − p(x) ln p(x)A−1 dx = H(x) + ln A (60) p(yi xj )p(xj ) ln p(yi xj ) (61) . y) = = = p(x.31 We ﬁrst make use of the relation I(x. y) ln p(x. y) dx dy p(x)p(y) {ln p(x) + ln p(y)} dx dy p(x) ln p(x) dx + p(y) ln p(y) dy = H(x) + H(y). we combine the equality condition H(x. y) = H(x) + H(y) with the result (1.112) to give H(yx) = H(y). and note that the mutual information satisﬁes I(x. We now note that the righthand side is independent of x and hence the lefthand side must also be constant with respect to x.112) to obtain the desired result (1. Thus we have p(x) = p(y) ∂yi = p(y)A ∂xj (59) where  ·  denotes the determinant. and this vanishes only if the two distributions are equal.121) it then follows that the mutual information I[x. y) = H(y) − H(yx) which we obtained in (1. Finally we make use of the relation (1. giving H(x.121). y) = p(x)p(y) into the deﬁnition of the entropy. the probability density is transformed by the Jacobian of the change of variables. y] = 0.31–1. To show that statistical independence is a sufﬁcient condition for the equality to be satisﬁed. we substitute p(x. Using (1. To show that statistical independence is a necessary condition. Finally.32 When we make a change of variables. y) 0 since it is a form of KullbackLeibler divergence. so that p(x. 1. 1. Then the entropy of y can be written H(y) = − as required.33 1.152).120) we see that the mutual information is a form of KL divergence.22 Solutions 1. using (1.
Consider ﬁrst the functional I[p(x)] = p(x)f (x) dx. and so precisely one of the p(yi xj ) is 1.107) in turn. Thus the quantities p(yi xj ) are all either 0 or 1. the quantity p ln p only vanishes for p = 0 or p = 1. for each value xj there is a unique value yi with nonzero probability. Rearranging then gives (1. since this is a normalized probability distribution. (1. Since the quantity −p(yi xj ) ln p(yi xj ) is nonnegative each of these terms must vanish for any value xj such that p(xj ) = 0.106) and (1. they must also sum to 1.108) into each of the three constraints (1.105). The solution is most easily obtained . δp(x) Similarly. However. and the rest are 0. To eliminate the Lagrange multipliers we substitute (1.34 23 which equals 0 by assumption. Under a small variation p(x) → p(x) + η(x) we have I[p(x) + η(x)] = p(x)f (x) dx + η(x)f (x) dx and hence from (D.3) we deduce that the functional derivative is given by δI = f (x). if a more formal approach is required we can proceed as follows using the techniques set out in Appendix D. δp(x) Using these two results we obtain the following result for the functional derivative − ln p(x) − 1 + λ1 + λ2 x + λ3 (x − μ)2 . 1. Thus.Solution 1. However. However. if we deﬁne J[p(x)] = p(x) ln p(x) dx then under a small variation p(x) → p(x) + η(x) we have J[p(x) + η(x)] = + and hence p(x) ln p(x) dx η(x) ln p(x) dx + p(x) 1 η(x) dx p(x) + O( 2 ) δJ = p(x) + 1.108).34 Obtaining the required functional derivative can be done simply by inspection.
it follows that f (x) 0 everywhere.103).5a + 0.5f (0.5(b − 2 ) + 0.s.5a + 0.5 and b = a + 2 (and hence a = b − 2 ). For the multivariate version of this derivation.5b) = 0.24 Solutions 1.5f (0.107).35–1. 0.5(a + 2 )) + 0.5b) = 0. Since this holds at all points. .114) with λ = 0. 2 where in the last step we used (1.5f (b − ) We can rewrite this as f (b) − f (b − ) > f (a + ) − f (a) We then divide both sides by and let → 0.103). there is a minus sign (’−’) missing on the l. giving f (b) > f (a). see Exercise 2.5f (a) + 0.36 Consider (1.36 by comparison with the standard form of the Gaussian. which should read ”Use calculus of variations to show that the stationary point of the functional shown just before (1. and noting that the results λ1 λ2 λ3 = 1− = 0 = 1 2σ 2 1 ln 2πσ 2 2 (62) (63) (64) do indeed satisfy the three constraints. Note that there is a typographical error in the question. of (1.h. 1. Substituting the right hand side of (1.109) in the argument of the logarithm on the right hand side of (1.14.108) is given by (1.35 NOTE: In PRML.5f (b) > f (0.5f (a + ) + 0. 1.108)”. we obtain H[x] = − = − = = p(x) ln p(x) dx 1 (x − μ)2 p(x) − ln(2πσ 2 ) − 2 2σ 2 p(x)(x − μ)2 dx dx 1 1 ln(2πσ 2 ) + 2 2 σ 1 ln(2πσ 2 ) + 1 .
making use of (1. 1. y) ln p(yx) dx dy − p(x. y] = − = − = − = − = − p(x. we have H[x. the third term on the r. we obtain λf (a) + (1 − λ)f (b) > f (x0 ) = f (λa + (1 − λ)b) as required. y) ln p(yx) dx dy − p(x. which gives f (a) > f (x0 ) + f (x0 )(a − x0 ) = f (x0 ) + f (x0 ) ((1 − λ)(a − b)) .104). Consider the left hand side of (1. y) (ln p(yx) + ln p(x)) dx dy p(x.38 From (1. y) dx dy p(x.115) holds for M = 1.Solutions 1. (66) Multiplying (65) by λ and (66) by 1 − λ and adding up the results on both sides.s. y) ln p(x) dx dy p(x) ln p(x) dx (65) = H[yx] + H[x].37 From (1. We now suppose that it holds for some general value M and show that it must therefore hold for M + 1. will always be positive and therefore f (x) > f (x0 ) + f (x0 )(x − x0 ) Now let x0 = λa + (1 − λ)b and consider setting x = a.37–1.h. according to which there exist an x such that 1 f (x) = f (x0 ) + f (x0 )(x − x0 ) + f (x )(x − x0 )2 . y) ln p(x. y) ln (p(yx)p(x)) dx dy p(x.115) M +1 M f i=1 λi xi = f λM +1 xM +1 + i=1 λi xi M (67) = f λM +1 xM +1 + (1 − λM +1 ) i=1 ηi xi (68) . setting x = b gives f (b) > f (x0 ) + f (x0 )(λ(b − a)). 2 Since we assume that f (x) > 0 everywhere. 1.114) we know that the result (1.111). we make use of Taylor’s theorem (with the remainder in Lagrange form). Similarly.38 25 To show the implication in the other direction.
(70) We now note that the quantities λi by deﬁnition satisfy M +1 λi = 1 i=1 (71) and hence we have M λi = 1 − λM +1 i=1 (72) Then using (69) we see that the quantities ηi satisfy the property M ηi = i=1 1 1 − λM +1 M λi = 1.39 where we have deﬁned ηi = We now apply (1.26 Solution 1.39 From Table 1.3 we obtain the marginal probabilities by summation and the conditional probabilities by normalization. to give x y 0 1 1/3 2/3 p(y) y 0 1 0 1/2 1/2 1 0 1 p(yx) y 0 1 0 1 1/2 1 0 1/2 p(xy) 0 1 2/3 1/3 p(x) x x .114) to give M +1 M λi . 1.115) at order M and so (70) becomes M +1 M M +1 f i=1 λi xi λM +1 f (xM +1 ) + (1 − λM +1 ) i=1 ηi f (xi ) = i=1 λi f (xi ) (74) where we have made use of (69). 1 − λM +1 (69) f i=1 λi xi λM +1 f (xM +1 ) + (1 − λM +1 )f i=1 ηi xi . i=1 (73) Thus we can apply the result (1.
Solution 1. 1. y] H[xy] I[x. k By matching f with ln and λi with 1/K in (1. conditional and joint entropies and the mutual information.40 The arithmetic and geometric means are deﬁned as 1 xA = ¯ K K K 1/K xk k and xG = ¯ k xk . yj ) ln p(xi yj ) i j (75) (76) H(xy) = − and similar deﬁnitions for H(y) and H(yx). 27 H[x.40 Figure 2 Diagram showing the relationship between marginal. taking into account that the logarithm is concave rather than convex and the inequality therefore goes the other way. Taking the logarithm of xA and xG . .121) to evaluate the mutual information. we obtain the desired result. together with the deﬁnitions H(x) = − i p(xi ) ln p(xi ) p(xi . y) = ln 3 − 4 ln 2 3 where we have used (1. we obtain the following results (a) H(x) = ln 3 − 2 ln 2 3 (c) H(yx) = 2 ln 2 3 2 ln 2 3 (b) H(y) = ln 3 − 2 ln 2 3 (d) H(xy) = (e) H(x.115). ¯ respectively. y) = ln 3 (f) I(x. y] H[yx] H[x] H[x] From these tables. we see that ¯ ln xA = ln ¯ 1 K K xk k and 1 ln xG = ¯ K K ln xk . The corresponding diagram is shown in Figure 2.
Chapter 2 Density Estimation 2.41–2. y) = − = − p(x.p(x = 1μ) = μ x∈{0. y) = p(yx)p(x).41 From the product rule we have p(x.261) follows from p(x = +1μ) + p(x = −1μ) = The mean is given by E[x] = 1+μ 2 − 1−μ 2 = μ.2 1.28 Solutions 1.p(x = 0μ) + 1.2) of the Bernoulli distribution we have p(xμ) = p(x = 0μ) + p(x = 1μ) x∈{0.2 The normalization of (2. .1} (x − μ)2 p(xμ) = μ2 p(x = 0μ) + (1 − μ)2 p(x = 1μ) x∈{0. The entropy is given by H[x] = − x∈{0. 1+μ 2 + 1−μ 2 = 1. 2.1} = μ2 (1 − μ) + (1 − μ)2 μ = μ(1 − μ). y) ln p(yx) dx dy p(x.1} = (1 − μ) + μ = 1 xp(xμ) = 0. y) ln p(yx) dx dy (77) = H(y) − H(yx).120) can be written as I(x. y) ln p(y) dx dy + p(y) ln p(y) dy + p(x. and so (1.1} = − = −(1 − μ) ln(1 − μ) − μ ln μ.1} p(xμ) ln p(xμ) μx (1 − μ)1−x {x ln μ + (1 − x) ln(1 − μ)} x∈{0.1 From the deﬁnition (2.
= − 2. We now assume that it holds for some general value N and prove its correctness for N + 1.10) we have N n + N n−1 = = = N! N! + n!(N − n)! (n − 1)!(N + 1 − n)! (N + 1)! (N + 1 − n)N ! + nN ! = n!(N + 1 − n)! n!(N + 1 − n)! N +1 .3 To evaluate the variance we use E[x2 ] = from which we have var[x] = E[x2 ] − E[x]2 = 1 − μ2 .Solution 2. which can be done as follows (1 + x)N +1 = (1 + x) N N n=0 N n x n N +1 n=1 = n=0 N n x + n N n=1 N xn n−1 + N n−1 xn + N N +1 x N = N 0 x + 0 N n N n=1 = N +1 0 x + 0 N +1 N +1 n N + 1 N +1 x + x n N +1 (79) = n=0 N +1 n x n . Finally the entropy is given by x=+1 29 1−μ 2 + 1+μ 2 =1 H[x] = − x=−1 p(xμ) ln p(xμ) 1−μ 2 ln 1−μ 2 − 1+μ 2 ln 1+μ 2 .263) we note that the theorem is trivially true for N = 0.3 Using the deﬁnition (2. n (78) To prove the binomial theorem (2.
μ (1 − μ) n Multiplying through by μ(1 − μ) and rearranging we obtain (2. 2.5 Making the change of variable t = y + x in (2. we use (1.30 Solutions 2. 2.266) we obtain ∞ ∞ x Γ(a)Γ(b) = 0 xa−1 exp(−t)(t − x)b−1 dt dx.5 which completes the inductive proof. taking care over the limits of integration ∞ t 0 Γ(a)Γ(b) = 0 xa−1 exp(−t)(t − x)b−1 dx dt.264) with respect to μ we obtain N n=1 N n n (N − n) − μ (1 − μ)N −n = 0. to obtain E[n2 ] = N μ(1 − μ) + N 2 μ2 . Finally we change variables in the x integral using x = tμ to give ∞ Γ(a)Γ(b) = 0 exp(−t)ta−1 tb−1 t dt 1 0 0 1 μa−1 (1 − μ)b−1 dμ (83) = Γ(a + b) μa−1 (1 − μ)b−1 dμ. Finally. We now multiply through by μ2 (1 − μ)2 and rearrange. (82) The change in the limits of integration in going from (81) to (82) can be understood by reference to Figure 3.11) for the mean of the binomial distribution.4 Differentiating (2.264) twice with respect to μ we obtain N n=1 N n μ (1 − μ)N −n n n (N − n) − μ (1 − μ) 2 − n (N − n) − 2 μ (1 − μ)2 = 0. (81) We now exchange the order of integration. . If we differentiate (2.264) for the binomial distribution gives N n=0 N n μ (1 − μ)N −n n = (1 − μ) = (1 − μ) N N n=0 N n μ 1−μ N n N μ 1+ 1−μ =1 (80) as required.4–2. Finally.40) to obtain the result (2.11). using the binomial theorem. the normalization condition (2.12) for the variance. making use of the result (2.
15) for the mean of a Beta distribution we see that the prior mean is a/(a + b) while the posterior mean is (a + n)/(a + b + n + m). 2. we have E[μ] = a Γ(a + b) Γ(a + 1 + b) = Γ(a)Γ(b) Γ(a + 1)Γ(b) a+b where we have used the property Γ(x + 1) = xΓ(x).16) for var[μ].6–2. The maximum likelihood estimate for μ is given by the relative frequency n/(n+m) of observations . together with the result (2.265).6 From (2.269) for the mode of the beta distribution simply by setting the derivative of the right hand side of (2.Solutions 2. On the third line.7 Figure 3 Plot of the region of integration of (81) in (x. which follows directly from the normalization condition for the Beta distribution. we obtain the result (2. We can ﬁnd the variance in the same way. 31 t t=x x 2.13) the mean of the beta distribution is given by E[μ] = 1 0 Γ(a + b) (a+1)−1 μ (1 − μ)b−1 dμ. Using the result (2. the exercise text contains a typographical error. Γ(a)Γ(b) Using the result (2.15) to derive the result (2. Γ(a)Γ(b) Γ(a + 2)Γ(b) (a + b) (a + 1 + b) (84) Now we use the result (1.7 NOTE: In PRML. Finally.13) with respect to μ to zero and rearranging.40). by ﬁrst showing that E[μ2 ] = = Γ(a + b) 1 Γ(a + 2 + b) (a+2)−1 μ (1 − μ)b−1 dμ Γ(a)Γ(b) 0 Γ(a + 2)Γ(b) Γ(a + b) Γ(a + 2 + b) a a+1 = . “mean value of x” should be “mean value of μ”. t) space.
μM −1 ) dμM −1 μj αM − −1 μM −11 M −1 αM −1 = CM k=1 μαk −1 k 1− j =1 μj dμM −1 . to give Ey [varx [xy]] + vary [Ex [xy]] = Ey Ex [x2 y] − Ex [xy]2 +Ey Ex [xy]2 − Ey [Ex [xy]] = Ex [x2 ] − Ex [x]2 = varx [x] where we have made use of Ey [Ex [x2 y]] = Ex [x2 ] which can be proved by analogy with (85). . μM −2 ) = M −2 0 1− 0 M −2 j =1 pM (μ1 . y) dx dy = xp(x) dx = Ex [x].9 of x = 1. Thus the posterior mean will lie between the prior mean and the maximum likelihood solution provided the following equation is satisﬁed for λ in the interval (0. This is a linear equation for λ which is easily solved by rearranging terms to give 1 . 1).271) for the conditional variance we make use of the result (1. λ= 1 + (n + m)/(a + b) Since a > 0.270) we use the product rule of probability Ey [Ex [xy]] = = xp(xy) dx p(y) dy xp(x. . 1) n a+n a + (1 − λ) = . .40). ∞) and hence λ must lie in the range (0. 2.32 Solutions 2. n > 0. . .4). (85) For the result (2. Thus we have 1− M −2 j =1 2 μj pM −1 (μ1 .8 To prove the result (2. and m > 0. . . while the upper M −2 limit is 1 − j =1 μj since the remaining probabilities must sum to one (see Figure 2. In order to make the limits of integration equal to 0 and 1 we change integration variable from μM −1 to t using M −2 μM −1 = t 1 − j =1 μj .9 When we integrate over μM −1 the lower limit of integration is 0. 2. b > 0. . it follows that the term (n + m)/(a + b) lies in the range (0.8–2. λ a+b n+m a+b+n+m which represents a convex combination of the prior mean and the maximum likelihood estimator. as well as the relation (85).
.Solution 2. . . Now consider the expectation of μj which can be written E[μj ] = = Γ(α0 ) Γ(α1 ) · · · Γ(αM ) M μj k=1 μαk −1 dμ k Γ(α1 ) · · · Γ(αj + 1) · · · Γ(αM ) αj Γ(α0 ) · = Γ(α1 ) · · · Γ(αM ) Γ(α0 + 1) α0 where we have made use of (88). 2 α0 (α0 + 1) 2 αj (αj + 1) αj − 2 α0 (α0 + 1) α0 . . noting that the effect of the extra factor of μj is to increase the coefﬁcient αj by 1.38). By similar reasoning we have var[μj ] = E[μ2 ] − E[μj ]2 = j = αj (α0 − αj ) . Thus CM = = as required. + αM ) Γ(α1 ) . 2. . Γ(αM −2 )Γ(αM −1 + αM ) Γ(αM −1 )Γ(αM ) Γ(α1 + .10 Using the fact that the Dirichlet distribution (2. . .10 which gives pM −1 (μ1 . with coefﬁcients α1 . . .38) is normalized we have M k=1 Γ(α1 + . . . Γ(αM ) (87) μαk −1 dμ = k Γ(α1 ) · · · Γ(αM ) Γ(α0 ) (88) where dμ denotes the integral over the (M − 1)dimensional simplex deﬁned by 0 μk 1 and k μk = 1. . αM −2 . + αM ) Γ(αM −1 + αM ) · Γ(α1 ) . . and then made use of Γ(x + 1) = xΓ(x). The right hand side of (86) is seen to be a normalized Dirichlet distribution over M −1 variables.265). . . . (note that we have effectively combined the ﬁnal two categories) and we can identify its normalization coefﬁcient using (2. αM −1 + αM . μM −2 ) M −2 αM −1 +αM −1 33 = CM k=1 M −2 μαk −1 k μαk −1 k M −2 1− j =1 M −2 1 0 μj αM −1 +αM −1 tαM −1 −1 (1 − t)αM −1 dt = CM k=1 1− j =1 μj Γ(αM −1 )Γ(αM ) Γ(αM −1 + αM ) (86) where we have used (2.
. dμM k ∂ ∂αj ··· k=1 1 ∂ = K(α) ∂μj K(α) ∂ = − ln K(α).11 We ﬁrst of all write the Dirichlet distribution (2. ∂μj Finally.11 Likewise. using the expression for K(α). dμM k μαk −1 dμ1 . . we have E[ln μj ] = ψ(αj ) − ψ(α0 ).38) in the form M Dir(μα) = K(α) k=1 μαk −1 k where K(α) = Next we note the following relation ∂ ∂αj M k=1 Γ(α0 ) .34 Solution 2. + 1) αj αl αj α l − α0 (α0 + 1) α0 α0 2. . . together with the deﬁnition of the digamma function ψ(·). . Γ(α1 ) · · · Γ(αM ) μαk −1 k = ∂ ∂αj M M exp ((αk − 1) ln μk ) k=1 = k=1 ln μj exp {(αk − 1) ln μk } M = ln μj k=1 μαk −1 k from which we obtain E[ln μj ] = K(α) = K(α) 1 0 ··· 1 0 1 0 M ln μj k=1 1 M 0 μαk −1 dμ1 . for j = l we have cov[μj μl ] = E[μj μl ] − E[μj ]E[μl ] = = − 2 α0 (α0 αj α l .
2(b − a) 2 The variance can be found by ﬁrst evaluating E[x2 ] = b a 1 x3 x2 dx = b−a 3(b − a) b = a a2 + ab + b2 b3 − a3 = 3(b − a) 3 and then using (1. (2.h. of () as − = = p(x) ln q(x) dx N (xμ.13 Note that this solution is the multivariate version of Solution 1. Using (2.104) as the negative differential entropy of a multivariate Gaussian. Thus.s.40) to give var[x] = E[x2 ] − E[x]2 = (b − a)2 a2 + ab + b2 (a + b)2 − = . b−a b−a b For the mean of the distribution we have E[x] = b a 1 x2 x dx = b−a 2(b − a) = a a+b b2 − a2 = .12–2.30.59) and (2.57). we can rewrite the ﬁrst integral on the r.43).113) we have KL(p q) = − p(x) ln q(x) dx + p(x) ln p(x) dx.s.12 The normalization of the uniform distribution is proved trivially b a 35 1 b−a dx = = 1. From (1. of () we recognize from (1. from (). (89) and (B. (89) The second integral on the r.Solutions 2.41). Σ2 ) 1 D ln(2π) + ln L + (x − m)T L−1 (x − m) dx 2 1 D ln(2π) + ln L + Tr[L−1 (μμT + Σ)] 2 −μL−1 m − mT L−1 μ + mT L−1 m . 3 4 12 2. (2.13 2. we have KL(p q) = L 1 + Tr[L−1 (μμT + Σ)] ln 2 Σ − μT L−1 m − mT L−1 μ + mT L−1 m − D .h.62).
282). 2 The constraint (2. a Ddimensional vector m of Lagrange multipliers for the D constraints given by (2.280).281). First we complete the square inside the exponential.14 2.280) which now takes the form 1 exp λ − 1 + yT Ly + μT m − mT L−1 m 4 and hence we have 1 − L−1 m = 0 2 dy = 1. Thus we maximize H[p] = − p(x) ln p(x) dx + λ p(x)x dx − μ p(x)(x − μ)(x − μ)T dx − Σ . we can make use of Lagrange multipliers to enforce the constraints on the maximum entropy solution. which becomes 1 λ − 1 + x − μ + L−1 m 2 T 1 1 L x − μ + L−1 m + μT m − mT L−1 m. Note that we need a single Lagrange multiplier for the normalization constraint (2. while the term in μ simply integrates to μ by virtue of the normalization constraint (2. In the ﬁnal parentheses. Solving for p(x) we obtain p(x) = exp λ − 1 + mT x + (x − μ)T L(x − μ) . (90) p(x) dx − 1 +mT +Tr L By functional differentiation (Appendix D) the maximum of this functional with respect to p(x) occurs when 0 = −1 − ln p(x) + λ + mT x + Tr{L(x − μ)(x − μ)T }. and a D × D matrix L of Lagrange multipliers to enforce the D2 constraints represented by (2. 2 4 We now make the change of variable 1 y = x − μ + L−1 m.36 Solution 2.14 As for the univariate Gaussian considered in Section 1.281) then becomes 1 exp λ − 1 + yT Ly + μT m − mT L−1 m 4 1 y + μ − L−1 m 2 dy = μ.6. (91) We now ﬁnd the values of the Lagrange multipliers by applying the constraints. the term in y vanishes by symmetry. .
64) we obtain L = − 1 Σ. Substituting this into the ﬁnal constraint (2. τ2 1 ). Since x = x1 + x2 −1 we also have p(xx2 ) = N (xμ1 + x2 . and hence is given by λ − 1 = ln 1 1 D/2 Σ1/2 (2π) . the value of λ is simply that value needed to ensure that the Gaussian distribution is correctly normalized. τ1 ).16 37 where again we have made use of the constraint (2.284) which takes the form p(x) = τ1 2π 1 /2 τ2 2π 1 /2 ∞ −∞ exp − τ1 τ2 (x − μ1 − x2 )2 − (x2 − μ2 )2 2 2 dx2 . Σ) 1 D ln(2π) + ln Σ + (x − μ)T Σ−1 (x − μ) dx 2 1 D ln(2π) + ln Σ + Tr Σ−1 Σ 2 1 (D ln(2π) + ln Σ + D) 2 − − 2.43). Σ) dx N (xμ.Solutions 2. from (1. This allows us to simplify the calculation by ignoring such things as normalization constants.15 From the deﬁnitions of the multivariate differential entropy (1.3. and making the change of variable x − μ = z we obtain exp λ − 1 + zT Lz zzT dx = Σ. the entropy is determined by the variance or equivalently the precision. Σ) ln N (xμ. τ1 1 ) and p(x2 ) = N (x2 μ2 . as derived in Section 2.282). Applying an analogous argument to that used to derive (2.280).15–2. 2. since.104) and the multivariate Gaussian distribution (2.16 We have p(x1 ) = N (x1 μ1 . (92) Since the ﬁnal result will be a Gaussian distribution for p(x) we need only evaluate its precision. Thus m = 0 and so the density becomes p(x) = exp λ − 1 + (x − μ)T L(x − μ) . and is independent of the mean. We now evaluate the convolution integral given by (2. we get H[x] = − = = = N (xμ. 2 Finally. We begin by considering the terms in the exponent of (92) which depend on x2 which are given by 1 − x2 (τ1 + τ2 ) + x2 {τ1 (x − μ1 ) + τ2 μ2 } 2 2 τ1 (x − μ1 ) + τ2 μ2 1 = − (τ1 + τ2 ) x2 − 2 τ1 + τ 2 2 + {τ1 (x − μ1 ) + τ2 μ2 } 2(τ1 + τ2 ) 2 .110).
Thus we can always take Λ to be symmetric.110). the ﬁrst term on the right hand side will simply give rise to a constant factor independent of x.45) and postmultiply it by ui . which gives us u† Σ† ui = λ∗ u† ui . and ΛA is antisymij ji metric so that ΛA = −ΛS . i i Next consider the conjugate transpose of (2. The second term. We can also obtain this result for the precision directly by appealing to the general result (2.45) by u† . (96) i i i where λ∗ is the complex conjugate of λi . This gives us (95) u† Σui = λi u† ui . by H[x] = 1 ln 2 2π(τ1 + τ2 ) τ1 τ2 .17 We can use an analogous argument to that used in the solution of Exercise 1. i i . will involve a term in x2 . it is only such terms that we need to consider. Then we can always write Λ = ΛA + ΛS where ΛS = ij Λij + Λji .115) for the convolution of two linearGaussian distributions. The entropy of x is then given.17–2. to get 0 = (λ∗ − λi )u† ui . When we substitute Λ = ΛA + ΛS into (94) we see that the term involving ΛA vanishes since for every positive term there is an equal and opposite negative term. 2 ΛA = ij Λij − Λji 2 (93) and it is easily veriﬁed that ΛS is symmetric so that ΛS = ΛS . The quadratic form in the exponent of a Ddimensional ij ji multivariate Gaussian distribution can be written 1 2 D D (xi − μi )Λij (xj − μj ) i=1 j =1 (94) where Λ = Σ−1 is the precision matrix. when expanded out. We now subtract (95) from (96) and use i the fact the Σ is real and symmetric and hence Σ = Σ† . Combining these we have − 2 τ1 1 τ1 τ 2 2 τ1 2 x + x2 = − x 2 2(τ1 + τ2 ) 2 τ1 + τ 2 from which we see that x has precision τ1 τ2 /(τ1 + τ2 ).18 where we have completed the square over x2 .38 Solutions 2. 2. from (1.14. There is one other term in x2 arising from the original exponent in (92). the conjugate transpose of i ui .18 We start by premultiplying both sides of (2. Since the precision of x is given directly in terms of the coefﬁcient of x2 in the exponent. Consider a general square matrix Λ with elements Λij . 2. When we integrate out x2 .
i. from (2. we can chose the corresponding eigenvectors arbitrily.46).19 We can write the r. 2. as long as they remain in the nullspace of Σ. Assuming that ui = uj . Since ui and uj are orthogonal to uk (k = i. . . Σ must be singular. . of (2. λD along its diagonal. ui will be orthogonal to the eigenvectors projecting onto the rowspace of Σ and we can chose ui = 1. and so we can chose them to satisfy (2. i i=1 where U is a D × D matrix with the eigenvectors u1 . that ui i and uj are orthogonal. . i Now consider uT uj λ j i = uT Σuj i = uT ΣT uj i = (Σui ) uj = λi uT uj .Solution 2. if λi = 0.48) in matrix form as D λi ui uT = UΛUT = M. . from (2. In this case.19 Hence λ∗ = λi and so λi must be real. Thus. If we assume that 0 = λi = λj = 0. so are uα and uβ . uD as its columns and Λ is a diagonal matrix with the eigenvalues λ1 .h. so that (2.47). Finally. the only solution to this equation is that uT uj = 0.45). If more than one eigenvalue equals zero.. since. . Thus we have UT MU = UT UΛUT U = Λ. we can construct uα = aui + buj uβ = cui + duj such that uα and uβ are mutually orthogonal and of unit length. Σ(aui + buj ) = aλi ui + bλj uj = λ(aui + buj ). . UT ΣU = UT ΛU = UT UΛ = Λ. However. If 0 = λi = λj = 0. i T 39 where we have used (2.45)–(2.45) and the fact that Σ is symmetric.s.e. any linear combination of ui and uj will be an eigenvector with eigenvalue λ = λi = λj . .46). with ui lying in the nullspace of Σ. we also have that . uα and uβ satisfy (2. k = j).46) is satisﬁed.
. . . . for which this expression will be less than or equal to zero. Thus. + aD uD ) and combining this result with (2. . + aD λD uD ) . ˆ 1 ˆ D a ˆ Now. . . . that a matrix has eigenvectors which are all strictly positive is a sufﬁcient and necessary condition for the matrix to be positive deﬁnite.20–2. we see that this expression will be strictly positive for any nonzero a. . if all eigenvalues are strictly positive. λi . There are D elements on the diagonal so the number of elements not on the diagonal is D2 − D and only half of these are independent giving D2 − D . this becomes i ˆD a2 λ1 + . . and 0 otherwise. uD .22 Consider a matrix M which is symmetric. + aD uT (ˆ1 λ1 u1 + .21 A D × D matrix has D2 elements. . aD are coefﬁcients obtained by projecting a on u1 . Note that ˆ they typically do not equal the elements of a. . there exist a vector a (e. The inverse matrix M−1 satisﬁes MM−1 = I. . . . 2. . 2 If we now add back the D elements on the diagonal we get D(D + 1) D2 − D +D = .40 Solutions 2. If it is symmetric then the elements not on the leading diagonal form pairs of equal value. + aD uT Σ (ˆ1 u1 + . . .22 and so M = Σ and (2. + aD uD .g. . . It is also clear that if an eigenvalue. + a2 λD ˆ1 and since a is real. . . since uT uj = 1 only if i = j. ˆ ˆ ˆ ˆ where a1 . since U is orthonormal. so that MT = M. . Moreover. uD constitute a basis for RD .45) we get a1 uT + . Using this we can write ˆ 1 ˆ D a ˆ aT Σa = a1 uT + . U−1 = UT and so Σ−1 = UΛUT −1 D = UT −1 Λ−1 U−1 = UΛ−1 UT = λ i ui uT . 2 2 2. . we can write a = a1 u1 + a2 u2 + .20 Since u1 . a = ui ). is zero or negative. i i=1 2.48) holds. .
2. On the right hand side consider the four blocks of the resulting partitioned matrix: upper left AM − BD−1 CM = (A − BD−1 C)(A − BD−1 C)−1 = I upper right −AMBD−1 + BD−1 + BD−1 CMBD−1 = −(A − BD−1 C)(A − BD−1 C)−1 BD−1 + BD−1 = −BD−1 + BD−1 = 0 lower left CM − DD−1 CM = CM − CM = 0 . Making use of the symmetry condition for M we then have T M−1 M = I and hence.51) diagonalizes the coordinate system and that the quadratic form (2. together with the fact that in the z coordinates the volume has become a sphere of radius Δ whose volume is VD ΔD . 2. .24 Multiplying the left hand side of (2. corresponding to the square of the Mahalanobis distance. is then given by (2.76) by the matrix (2. . We now make the further 1 /2 transformation zi = λi yi for i = 1.1). . from the deﬁnition of the matrix inverse.287) trivially gives the identity matrix. D.50).Solutions 2. M−1 and so M−1 is also a symmetric matrix. This corresponds to a shift in the origin of the coordinate system and a rotation so that the hyperellipsoidal contours along which the Mahalanobis distance is constant become axis aligned.23–2.44). we obtain T M−1 MT = IT = I since the identity matrix is symmetric. . The volume within the hyperellipsoid then becomes D D T = M−1 dyi = i=1 i=1 λi 1 /2 D i=1 dzi = Σ1/2 VD ΔD where we have used the property that the determinant of Σ is given by the product of its eigenvalues. The volume contained within any one such contour is unchanged by shifts and rotations. and using the relation (C.24 41 Taking the transpose of both sides of this equation.23 Recall that the transformation (2.
28 lower right −CMBD−1 + DD−1 + DD−1 CMBD−1 = DD−1 = I.1 the distribution p(xa .42 Solutions 2. 2.289) by (A + BCD) trivially gives the identity matrix I. Thus the right hand side also equals the identity matrix. μb Σ= Σaa Σab . Similarly from (2. to set the third and fourth terms in the expansion to zero. xb . Σba Σbb From Section 2. . 2. Using the results of Section 2. xb ) is then Gaussian with mean and covariance given by (2.2 this is again a Gaussian distribution with mean and covariance given by μ= μa .27 From y = x + z we have trivially that E[y] = E[x] + E[z]. xc ) and marginalize to obtain the distribution p(xa .28 For the marginal distribution p(x) we see from (2. 2. xb ).25 We ﬁrst of all take the joint distribution p(xa .105) and is therefore given by Λ−1 .10 as a special case.81) for the conditional mean we obtain μyx = Aμ + b + AΛ−1 Λ(x − μ) = Ax + b. On the right hand side we obtain (A + BCD)(A−1 − A−1 B(C−1 + DA−1 B)−1 DA−1 ) = I + BCDA−1 − B(C−1 + DA−1 B)−1 DA−1 −BCDA−1 B(C−1 + DA−1 B)−1 DA−1 = I + BCDA−1 − BC(C−1 + DA−1 B)(C−1 + DA−1 B)−1 DA−1 = I + BCDA−1 − BCDA−1 = I 2.25–2.3.93) we see that the covariance is given by the top left partition of (2.81) and (2. Now consider the conditional distribution p(yx).108) which is simply μ. For 1dimensional variables the covariances become variances and we obtain the result of Exercise 1. together with E [(x − E[x])] = E [(z − E[z])] = 0.92) that the mean is given by the upper partition of (2. Applying the result (2. For the covariance we have cov[y] = E (x − E[x] + y − E[y])(x − E[x] + y − E[y])T = E (x − E[x])(x − E[x])T + E (y − E[y])(y − E[y])T + E (x − E[x])(y − E[y])T + E (y − E[y])(x − E[x])T =0 =0 = cov[x] + cov[z] where we have used the independence of x and z.26 Multiplying the left hand side of (2.82) respectively.3.
76) we get X WT W L where now −1 = M −MWT L−1 −1 −L WM L + L−1 WMWT L−1 −1 −1 M = X − WT L−1 W . −MWT L−1 = Λ−1 AT LL−1 = Λ−1 AT and L−1 + L−1 WMWT L−1 as required. 2.107).30 43 Similarly applying the result (2.29–2. We can use (97) and (98) to rewrite (2. we get Λ−1 AΛ−1 S −1 = L−1 + L−1 LAΛ−1 AT LL−1 = L−1 + AΛ−1 AT .30 Substituting the leftmost expression of (2.104) as R= and using (2. Substituting X and W using (97) and (98). Λ−1 AT + AΛ−1 AT AΛ −1 Λμ − AT Sb Sb = = = Λ−1 Λμ − AT Sb + Λ−1 AT Sb Λμ − AT Sb + S−1 + AΛ−1 AT Sb μ − Λ−1 AT Sb + Λ−1 AT Sb Aμ − AΛ−1 AT Sb + b + AΛ−1 AT Sb μ Aμ − b . and thus WT = −AT LT = −AT L. X WT W L (97) (98) since L is symmetric.82) for the covariance of the conditional distribution we have cov[yx] = L−1 + AΛ−1 AT − AΛ−1 ΛΛ−1 AT = L−1 as required. respectively. we get M = Λ + AT LA − AT LL−1 LA −1 = Λ−1 .105) for R−1 in (2.Solutions 2. 2.29 We ﬁrst deﬁne and X = Λ + AT LA W = −LA.
To do this we make use of (102) in (101) and then complete the square to give 1 − (y − ν)T L−1 + AΛ−1 AT 2 −1 −1 = L−1 + AΛ−1 AT (102) (y − ν) + const .99) and (2. Next we identify the mean ν of the marginal distribution. Σx ). This therefore takes the form of (2.289) to give L − LA(Λ + AT LA)−1 AT L which corresponds to (2.31–2. To ﬁnd the corresponding covariance. We can now obtain the marginal distribution p(y) by making use of the result (2.31 Since y = x + z we can write the conditional distribution of y given x in the form p(yx) = N (yμz + x.27. y) = p(yx)p(x) where p(x) = N (xμx .110). Then we extract the terms in y from the ﬁnal term in (100) and combine these with the remaining terms from the quadratic form (99) which depend on y to give 1 = − yT L − LA(Λ + AT LA)−1 AT L y 2 +yT L − LA(Λ + AT LA)−1 AT L b +LA(Λ + AT LA)−1 Λμ . in agreement with the results of Exercise 2. Thus both the means and the covariances are additive. 2. Σz ).115) from which we obtain p(y) = N (yμx + μz . A → I. (101) We can identify the precision of the marginal distribution p(y) from the second order term in y.32 2. Λ−1 → Σx .44 Solutions 2. Σz + Σx ). 2 2 (99) We now extract all of those terms involving x and assemble them into a standard Gaussian quadratic form by completing the square 1 = − xT (Λ + AT LA)x + xT Λμ + AT L(y − b) + const 2 1 = − (x − m)T (Λ + AT LA)(x − m) 2 1 + mT (Λ + AT LA)m + const 2 where m = (Λ + AT LA)−1 Λμ + AT L(y − b) . This gives a decomposition of the joint distribution of x and y in the form p(x. (100) We can now perform the integration over x which eliminates the ﬁrst term in (100). b → μz and L−1 → Σz .100) in which we can identify μ → μx .32 The quadratic form in the exponential of the joint distribution is given by 1 1 − (x − μ)T Λ(x − μ) − (y − Ax − b)T L(y − Ax − b). we take the inverse of the precision and apply the Woodbury inversion formula (2.
y). Now. we ﬁrst rewrite the sum N n=1 (xn − μ)T Σ−1 (xn − μ) = N Tr Σ−1 S .112). The ﬁrst of these contribution simply gives b. Hence we obtain (2.111) and (2.33 To ﬁnd the conditional distribution p(xy) we start from the quadratic form (99) corresponding to the joint distribution p(x. For the ﬁrst term. 2 For the second term.33–2.109). 2. while the term in μ can be written = L−1 + AΛ−1 AT LA(Λ + AT LA)−1 Λμ = A(I + Λ−1 AT LA)(I + Λ−1 AT LA)−1 Λ−1 Λμ = Aμ where we have used the general result (BC)−1 = C−1 B−1 . . we treat y as a constant and simply complete the square over x to give 1 1 − (x − μ)T Λ(x − μ) − (y − Ax − b)T L(y − Ax − b) 2 2 1 T T = − x (Λ + A LA)x + xT {Λμ + AL(y − b)} + const 2 1 = − (x − m)T (Λ + AT LA)(x − m) 2 where.28) directly to get − N ∂ N ln Σ = − Σ−1 2 ∂Σ 2 T =− N −1 Σ . 2. Now consider the two terms in the square brackets. however. the ﬁrst one involving b and the second involving μ. as in the solution to Exercise 2.34 where ν = L−1 + AΛ−1 AT 45 (L−1 + AΛ−1 AT )−1 b + LA(Λ + AT LA)−1 Λμ .32.Solutions 2. we have deﬁned m = (Λ + AT LA)−1 Λμ + AT L(y − b) from which we obtain directly the mean and covariance of the conditional distribution in the form (2.34 Differentiating (2.118) with respect to Σ we obtain two terms: − N ∂ 1 ∂ ln Σ − 2 ∂Σ 2 ∂Σ N n=1 (xn − μ)T Σ−1 (xn − μ). we can apply (C.
62) we have E[xn xT ] = μμT + Σ. E [xn xm ] should be E xn xT on the l.59) (page 82) and (2. n=1 Using this together with (C. in which x = Σij (element (i.291).s.35 NOTE: In PRML.h. of (2. whereas if n = m then n the two data points xn and xm are independent and hence E[xn xm ] = μμT where we have used (2. j) in Σ).291).62) is detailed in the text between (2. this exercise contains a typographical error. From (2. using (2. we get 1 ∂ − 2 ∂Σ N n=1 (xn − μ)T Σ−1 (xn − μ) = N −1 Σ SΣ−1 . 2 2 Rearrangement then yields Σ=S as required. m The derivation of (2.21).62) (page 83). Combining these results we obtain (2. If m = n then.59). j) only and 0 everywhere else. and properties of the trace we get ∂ ∂Σij N n=1 (xn − μ)T Σ−1 (xn − μ) = N ∂ Tr Σ−1 S ∂Σij ∂ Σ−1 S ∂Σij ∂Σ −1 Σ S ∂Σij = N Tr = −N Tr Σ−1 = −N Tr ∂Σ −1 Σ SΣ−1 ∂Σij ij = −N Σ−1 SΣ−1 where we have used (C. 2 Combining the derivatives of the two terms and setting the result to zero. Treating this result as valid nevertheless. so that ∂Σ/∂Σij has a 1 in position (i.35 where 1 S= N N (xn − μ)(xn − μ)T .59) and . 2.26). Note that in the last step we have ignored the fact that Σij = Σji . we obtain N −1 N −1 Σ = Σ SΣ−1 .46 Solution 2.
h. and this equation should read θ(N ) = θ(N −1) − aN −1 z(θ(N −1) ).Solution 2. (2. we should ﬁnd the root of the expected negative log likelihood. = σ(N −1) + N (104) If we substitute the expression for a Gaussian distribution into the result (2.62) we then have E [ΣML ] = = = = as required.133) and (2. but in (2.129) is incorrect. these are cancelled against the change of sign in (2. Also. so in effect.135). Finally.135) remains unchanged.135) for .129).36 NOTE: In PRML. 2 Consider the expression for σ(N ) and separate out the contribution from observation xN to give 2 σ(N ) = 1 N 1 N N (xn − μ)2 n=1 N −1 = (xn − μ)2 + n=1 (xN − μ)2 N (xN − μ)2 N −1 2 σ(N −1) + = N N 1 2 (xN − μ)2 2 = σ(N −1) − σ(N −1) + N N 1 2 2 (xN − μ)2 − σ(N −1) .36 (2. This lead to sign changes in (2. the labels μ and μML in Figure 2. The sign in (2.134).s. there are mistakes that affect this solution. xn should be xn on the l. Then. of (2. in order to be consistent with the assumption that f (θ) > 0 for θ > θ and f (θ) < 0 for θ < θ in Figure 2.133).11 should be interchanged and there are corresponding changes to the caption (see errata on the PRML web site for details). 1 N 1 N N 47 E n=1 N 1 xn − N N xm m=1 N 1 xn − N T N xT l l=1 N E xn xT − n n=1 2 xn N xT + m m=1 1 N2 N xm xT l m=1 l=1 μμT + Σ − 2 μμT + N −1 N Σ 1 1 Σ + μμT + Σ N N (103) 2.10.
and hence ΣML (N −1) −1 .122).37 NOTE: In PRML.37 the RobbinsMonro procedure applied to maximizing likelihood. this exercise requires the additional assumption that we can use the known true mean. we know that ∂ (N −1) ∂ΣML (106) ln p(xN μ. Starting from (2. we have ΣML (N ) = 1 N 1 N N n=1 N −1 n=1 (xn − μ) (xn − μ) T = (xn − μ) (xn − μ) + T 1 T (xN − μ) (xN − μ) N N − 1 (N −1) 1 T ΣML + (xN − μ) (xN − μ) = N N 1 (N −1) (N −1) T = ΣML + (xN − μ) (xN − μ) − ΣML . 2. we obtain 2 σ( N ) 2 = σ(N −1) + aN −1 ∂ 2 ∂σ(N −1) 1 (xN − μ)2 2 − ln σ(N −1) − 2 2 2σ(N −1) + (xN − μ)2 4 2σ(N −1) (105) 2 = σ(N −1) + aN −1 2 = σ(N −1) + − 1 2 2σ(N −1) aN −1 2 (xN − μ)2 − σ(N −1) . for the derivation of the RobbinsMonro sequential estimation formula. in (2. N From Solution 2. μ. we assume that the covariance matrix is restricted to be diagonal. 4 2σ(N −1) Comparison of (105) with (104) allows us to identify aN −1 = 4 2σ(N −1) N .122).34. is diag . ΣML −1 −2 (N −1) ) T T (N −1) (N −1) = = 1 (N −1) ΣML 2 1 (N −1) ΣML 2 (xN − μ) (xN − μ) − ΣML (xN − μ) (xN − μ) − ΣML (N −1) ΣML (N −1) −1 where we have used the assumption that ΣML . Furthermore.48 Solution 2.
2 2 N σ0 + σ 2 N σ0 + σ 2 (108) (109) 2.39 onal. to the form σ 2 I.’ denotes terms independent of μ.38–2. Following the discussion of (2.Solutions 2. the coefﬁcient in (107) would again become a scalar. Note that if the covariance matrix was restricted further.135). we get ΣML = ΣML (N ) (N −1) 49 + AN −1 1 (N −1) ΣML 2 −2 (xN − μ) (xN − μ) − ΣML T (N −1) (107) where AN −1 is a matrix of coefﬁcients corresponding to aN −1 in (2. 2 σN σ σ0 Similarly the mean is given by μN = = 1 N + 2 σ 2 σ0 −1 N 1 μ0 + 2 2 σ0 σ xn n=1 2 σ2 N σ0 μ0 + μML . we see directly that 1 1 N 1 N −1 1 1 1 = 2+ 2 = 2+ + 2 = 2 + .140) takes the form − 1 1 (μ − μ0 )2 − 2 2 2σ0 2σ N (xn − μ)2 n=1 =− μ2 2 1 N + 2 2 σ0 σ +μ μ0 1 + 2 2 σ0 σ N xn n=1 + const. we recover (106). where ‘const.135). 2 σN σ0 σ σ0 σ2 σ σN −1 σ 2 We also note for later use. we see that if we choose AN −1 = 2 N ΣML (N −1) 2 . By comparing (106) with (107). i.38 The exponent in the posterior distribution of (2.71) we see that the variance of the posterior distribution is given by N 1 1 = 2 + 2. If we substitute this into the multivariate form of (2. that 2 2 σ 2 + σN −1 1 σ 2 + N σ0 = = 2 2 2 σN σ0 σ 2 σN −1 σ 2 (110) (111) .e. a spherical Gaussian. 2.39 From (2.142).
50 Solution 2. 2 σN −1 σ2 (113) 2 2 p(μμN . where C accounts for all the remaining terms that are independent of μ. σ 2 ) 2 = N (μμN −1 .143). the second term can be rewritten as 2 σN xN σ2 and so μN = Now consider 2 σN σ2 μN −1 + N xN . we can rewrite (2. σN −1 )p(xN μ. σ 2 ) ∝ exp − = exp − 1 2 1 2 μ2 −1 − 2μμN −1 + μ2 x2 − 2xN μ + μ2 N + N 2 σN −1 σ2 σ 2 (μ2 −1 − 2μμN −1 + μ2 ) N 2 σN −1 σ 2 + 2 σN −1 (x2 − 2xN μ + μ2 ) N 2 σN −1 σ 2 = exp − 2 2 1 (σN −1 + σ 2 )μ2 − 2(σ 2 μN −1 + σN −1 xN )μ 2 2 σN −1 σ 2 + C.39 and similarly 1 2 σN −1 = 2 σ 2 + (N − 1)σ0 . σN −1 )N (xN μ. we can rewrite the ﬁrst term of this expression as 2 2 σN σ 2 μ0 + σ0 n=1 xn σ2 = 2 N μN −1 .141). using (111). we can directly read off 2 σ 2 + σN −1 1 1 1 = = 2 + 2 2 2 2 σN σN −1 σ σN −1 σ . 2 2 σN −1 (N − 1)σ0 + σ 2 σN −1 N −1 Similarly. 2 2 N σ0 + σ 2 N σ0 + σ 2 N N −1 Using (2. σN ) = p(μμN −1 . (111) and (112). From this.141) as μN = = xn σ2 σ2 μ0 + 0 2n=1 2 2 2 N σ0 + σ N σ0 + σ 2 2 σ 2 μ0 + σ0 n=1 xn σ0 xN + . 2 σ0 σ 2 (112) Using (2.
141) of the Gamma function.41 and μN = = = 2 σ 2 μN −1 + σN −1 xN 2 σN −1 + σ 2 51 σ2 σ2 μN −1 + 2 N −1 2 xN 2 σN −1 + σ 2 σN −1 + σ 2 σN σ2 μN −1 + N xN 2 σN −1 σ2 and so we have recovered (110) and (113). 2.71) we see that the mean and covariance of the posterior distribution are given by μN Σ−1 N = = Σ−1 + N Σ−1 0 Σ−1 0 + NΣ −1 −1 Σ−1 μ0 + Σ−1 N μML 0 (114) (115) where μML is the maximum likelihood solution for the mean given by μML = 1 N N xn . Σ).Solutions 2. Using the discussion following (2.40 The posterior distribution is proportional to the product of the prior and the likelihood function N p(μX) ∝ p(μ) n=1 p(xn μ.41 If we consider the integral of the Gamma distribution over τ and make the change of variable bτ = u we have ∞ 0 Gam(τ a.40–2. b) dτ 1 Γ(a) 1 = Γ(a) = 1 = ∞ 0 ba τ a−1 exp(−bτ ) dτ ba ua−1 exp(−u)b1−a b−1 du ∞ 0 where we have used the deﬁnition (1. . Thus the posterior is proportional to an exponential of a quadratic form in μ given by 1 1 − (μ − μ0 )T Σ−1 (μ − μ0 ) − 0 2 2 N n=1 (xn − μ)T Σ−1 (xn − μ) Σ−1 μ0 + Σ−1 0 N 1 = − μT Σ−1 + N Σ−1 μ + μT 0 2 xn n=1 + const where ‘const. n=1 2.’ denotes terms independent of μ.
since τ must be a nonnegative quantity. Similarly we can ﬁnd the variance by ﬁrst evaluating E[τ 2 ] = = = and then using var[τ ] = E[τ 2 ] − E[τ ]2 = a(a + 1) a2 a − 2 = 2.13.293) consider the integral ∞ from which we obtain I= −∞ exp − xq 2σ 2 ∞ dx = 2 0 exp − xq 2σ 2 dx and make the change of variable u= xq . mode[τ ] = 2.42 We can use the same change of variable as in the previous exercise to evaluate the mean of the Gamma distribution E[τ ] = = = ∞ 1 ba τ a−1 τ exp(−bτ ) dτ Γ(a) 0 ∞ 1 ba ua exp(−u)b−a b−1 du Γ(a) 0 Γ(a + 1) a = bΓ(a) b where we have used the recurrence relation Γ(a + 1) = aΓ(a) for the Gamma function.43 To prove the normalization of the distribution (2.42–2. the mode of the Gamma distribution is obtained simply by differentiation d dτ τ a−1 exp(−bτ ) = a−1 − b τ a−1 exp(−bτ ) = 0 τ a−1 .43 2. 2 b b b ∞ 1 ba τ a−1 τ 2 exp(−bτ ) dτ Γ(a) 0 ∞ 1 ba ua+1 exp(−u)b−a−1 b−1 du Γ(a) 0 (a + 1)Γ(a + 1) a(a + 1) Γ(a + 2) = = b2 Γ(a) b2 Γ(a) b2 Finally. This is also apparent in the plot of Figure 2.52 Solutions 2. 2σ 2 . b Notice that the mode only exists if a 1.
where the factors on the r. Taking the logarithm. and discarding additive constants. ⎧ ⎛ ⎞ ⎫ 2 N ⎪ ⎪ N ⎨ ⎬ βμ0 + n=1 xn 1 β 2 ⎜ ⎟ N/2 a−1 2 exp − ⎝b + xn + μ0 − λ λ λ ⎠ 2 2 2(N + β) ⎪ ⎪ ⎩ ⎭ n=1 (λ(N + β)) 1 /2 exp − λ(N + β) 2 μ− βμ0 + n=1 xn N +β N .154).h. w)q 2σ 2 .293) follows.152) and (2. λ). we get p(μ. σ 2 ) = q 2(2σ 2 )1/q Γ(1/q) exp − t − y(x.141) of the Gamma function. w. this gives ∞ 53 I=2 0 2σ 2 2(2σ 2 )1/q Γ(1/q) (2σ 2 u)(1−q)/q exp(−u) du = q q from which the normalization of (2. where we have used the deﬁntions of the Gaussian and Gamma distributions and we have ommitted terms independent of μ and λ. For the given noise distribution. λX) ∝ p(Xμ. the conditional distribution of the target variable given the input variable is p(tx.44 Using the deﬁnition (1. respectively. The likelihood function is obtained by taking products of factors of this form. over all pairs {xn .Solution 2. are given by (2.44 From Bayes’ theorem we have p(μ.s. we obtain the desired result. Writing this out in full. λ) ∝ λ 1 /2 λμ2 exp − 2 N N exp λμ n=1 λ xn − 2 N x2 n n=1 βλ 2 (βλ)1/2 exp − μ − 2μμ0 + μ2 0 2 λa−1 exp (−bλ) . We can rearrange this to obtain λN/2 λa−1 exp − b+ 1 /2 1 2 N x2 + n n=1 β 2 μ 2 0 2 λ N (λ(N + β)) λ(N + β) exp − 2 2 μ − N +β βμ0 + n=1 xn μ and by completing the square in the argument of the second exponential. 2. λ)p(μ. tn }.
.158). . we have ∞ 0 ba e(−bτ ) τ a−1 Γ(a) = ba Γ(a) τ 2π 1 /2 τ exp − (x − μ)2 2 ∞ 0 dτ b+ (x − μ)2 2 dτ . N μμN . as in the univariate case.54 Solutions 2.155). we see that the functional dependence on Λ is indeed the same and thus a product of this likelihood and a Wishart prior will result in a Wishart posterior. 1 2π 1 /2 τ a−1/2 exp −τ We now make the proposed change of variable z = τ Δ. xN }: N n=1 N (xn μ. yielding ba Γ(a) 1 2π 1 /2 Δ−a−1/2 0 ∞ z a−1/2 exp(−z) dz = ba Γ(a) 1 2π 1 /2 Δ−a−1/2 Γ(a + 1/2) . bN ) .45 We do this. . by considering the likelihood function of Λ for a given data set {x1 . By simply comparing with (2. x2 + n n=1 β 2 N +β 2 μ − μN . where Δ = b + (x − μ)2 /2. 2 T where S = n (xn − μ)(xn − μ) . . Λ−1 ) ∝ ΛN/2 exp − 1 2 N (xn − μ)T Λ(xn − μ) n=1 1 = ΛN/2 exp − Tr [ΛS] . 2 0 2 2.46 we arrive at an (unnormalised) GaussianGamma distribution. 2. ((N + β)λ) with parameters μN aN bN βμ0 + n=1 xn N +β N = a+ 2 = = b+ 1 2 N N −1 Gam (λaN .46 From (2.45–2.
2. 2 (117) = exp − = exp − We see that in the limit ν → ∞ this becomes. ν/2 for a and ν/2λ for b: Γ(−a + 1/2) a b Γ(a) Γ ((ν + 1)/2) = Γ(ν/2) = = Γ ((ν + 1)/2) Γ(ν/2) Γ ((ν + 1)/2) Γ(ν/2) 1 2π ν 2λ ν 2λ λ νπ 1 /2 Δa−1/2 1 2π 1 2π 1+ 1 /2 ν/2 ν (x − μ)2 + 2λ 2 ν 2λ −(ν +1)/2 −(ν +1)/2 ν/2 1 /2 1+ λ(x − μ)2 ν −(ν +1)/2 1 /2 λ(x − μ)2 ν −(ν +1)/2 2.47–2.48 55 where we have used the deﬁnition of the Gamma function (1.161). up to an overall constant. we .141). we substitute b + (x − μ)2 /2 for Δ. we write (2.48 Substituting expressions for the Gaussian and Gamma distributions into (2.47 Ignoring the normalization constant. The normalization coefﬁcient is given by the standard expression (2.42) for a univariate Gaussian. Since the Student distribution is normalized to unity for all values of ν it follows that it must remain normalized in this limit.159) as St(xμ. the same as a Gaussian distribution with mean μ and precision λ. λ.Solutions 2. (116) For large ν. ν) ∝ 1+ λ(x − μ)2 ν −(ν−1)/2 = exp − ν−1 λ(x − μ)2 ln 1 + 2 ν . Finally. we make use of the Taylor expansion for the logarithm in the form ln(1 + ) = + O( 2 ) to rewrite (116) as exp − ν−1 λ(x − μ)2 ln 1 + 2 ν ν − 1 λ(x − μ)2 + O(ν −2 ) 2 ν λ(x − μ)2 + O(ν −1 ) .
which equals μ since the Student distribution is normalized. ν)x dx = St(z0. The covariance of the multivariate Student can be reexpressed by using the expression for the multivariate Student distribution as a convolution of a Gaussian with a . ν)(z + μ) dz. ν) dx = = = N xμ. ν) = St(−z0. Λ. This leaves the second term. Now we make the change of variable τ =η which gives St(xμ. Λ. Λ.49 have St(xμ. ν/2) dη η D/2 η ν/2−1 e−νη/2 e−ηΔ 2 (ν/2)ν/2 Λ1/2 Γ(ν/2) (2π)D/2 /2 dη. From (2. Λ. (ηΛ)−1 Gam (ην/2. ν) = = ∞ 0 N xμ. ν/2) dη dx N xμ. we can write E[x] = St(xμ. Λ. ν) = 1 (ν/2)ν/2 Λ1/2 ν + Δ2 D/2 2 Γ(ν/2) (2π) 2 ∞ 0 1 ν + Δ2 2 2 −1 −D/2−ν/2 τ D/2+ν/2−1 e−τ dτ −D/2−ν/2 = as required. (ηΛ) −1 ∞ 0 Gam(ην/2. ν). ν/2) dη = 1. Δ2 Γ(ν/2 + d/2) Λ1/2 1+ Γ(ν/2) ν (πν)D/2 The correct normalization of the multivariate Student’s tdistribution follows directly from the fact that the Gaussian and Gamma distributions are normalized. Λ. In the factor (z + μ) the ﬁrst term vanishes as a consequence of the fact that the zeromean Student distribution is an even function of z that is St(−z0.56 Solution 2.49 If we make the change of variable z = x − μ. ν/2) dη Gam (ην/2. Λ.161) we have St (xμ. (ηΛ)−1 dx Gam (ην/2. 2.
50 Just like in univariate case (Exercise 2. ν) = 1+ Γ(ν/2) ν (πν)D/2 Provided Λ is nonsingular we therefore obtain mode[x] = μ.47). ηΛ)(x − μ)(x − μ)T dx Gam(ην/2. ν/2) dη η −1 Λ−1 Gam(ην/2. together with the standard result Γ(1 + x) = xΓ(x).146) to give cov[x] = = = ν ν/2 ∞ −νη/2 ν/2−2 1 e η dηΛ−1 Γ(ν/2) 2 0 ν Γ(ν/2 − 2) −1 Λ 2 Γ(ν/2) ν Λ−1 ν−2 where we have used the integral representation for the Gamma function. 2 As in the univariate case. the univariate normalization argument also applies in the multivariate case. Again we make use of (117) to give exp − D ν + 2 2 ln 1 + Δ2 ν = exp − Δ2 + O(1/ν) . ν/2) dη ∞ = 0 where we have used the standard result for the covariance of a multivariate Gaussian. 2. . we ignore the normalization coefﬁcient. the same as a Gaussian distribution. Λ. Λ. We now substitute for the Gamma distribution using (2.Solution 2. in the limit ν → ∞ this becomes.50 Gamma distribution given by (2. The mode of the Student distribution is obtained by differentiation Γ(ν/2 + D/2) Λ1/2 Δ2 ∇x St(xμ. which leaves us with Δ2 1+ ν −ν/2−D/2 −D/2−ν/2−1 1 Λ(x − μ).161) which gives cov[x] = = 0 57 St(xμ. ν)(x − μ)(x − μ)T dx ∞ N (xμ. up to an overall constant. here with mean μ and precision Λ. ν = exp − D ν + 2 2 ln 1 + Δ2 ν where Δ2 is the squared Mahalanobis distance given by Δ2 = (x − μ)T Λ(x − μ).
For large m we have cos(m−1/2 ξ) = 1 − m−1 ξ 2 /2 + O(m−2 ) and so p(ξ) ∝ exp −ξ 2 /2 and hence p(θ) ∝ exp{−m(θ − θ0 )2 /2}.53 Using (2.184). Differentiating again we have 1 exp {m cos(θ − θ0 )} sin2 (θ − θ0 ) + cos(θ − θ0 ) . we have cos(A − B) = exp{i(A − B)} = exp(iA) exp(−iB) = (cos A + i sin A)(cos B − i sin B) = cos A cos B + sin A sin B. Finally sin(A − B) = exp{i(A − B)} = exp(iA) exp(−iB) = (cos A + i sin A)(cos B − i sin B) = sin A cos B − cos A sin B. Rearranging this. 2.296) we have 1 = exp(iA) exp(−iA) = (cos A + i sin A)(cos A − i sin A) = cos2 A + sin2 A. cos θn cos θ0 n n which we can solve w.179) we have p (θ) = − 1 exp {m cos(θ − θ0 )} sin(θ − θ0 ) 2πI0 (m) which vanishes when θ = θ0 or when θ = θ0 + π (mod2π).58 Solutions 2.54 2.51–2. 2. we get sin θn sin θ0 = = tan θ0 . 2.52 Expressed in terms of ξ the von Mises distribution becomes p(ξ) ∝ exp m cos(m−1/2 ξ) .r. we can write (2.51 Using the relation (2.183).182) as N N N (cos θ0 sin θn − cos θn sin θ0 ) = cos θ0 n=1 n=1 sin θn − sin n=1 cos θn = 0.54 Differentiating the von Mises distribution (2.t. θ0 to obtain (2. Similarly. p (θ) = − 2πI0 (m) .
2.177).168) θ and (2. The “−” on the r.146) we obtain Gam(λa.55–2. equation (2. Γ(a) .Solutions 2. we can rewrite (2. contains a typo.13) we have Beta(μa. b−1 (118) (119) (120) (121) Applying the same approach to the gamma distribution (2.56 59 Since I0 (m) > 0 we see that p (θ) < 0 when θ = θ0 .187).187) as follows: A(mML ) = 1 N N cos θn n=1 ML cos θ0 + 1 N N sin θn n=1 ML sin θ0 ML ML = ¯ cos ¯ cos θ0 + ¯ sin ¯ sin θ0 r θ r θ ML ML = ¯ cos2 θ0 + sin2 θ0 r = ¯.184). b) = Γ(a)Γ(b) u(μ) = η(a. we see that ¯ = θ0 . which will be the starting point for this solution.56 We can most conveniently cast distributions into standard exponential family form by taking the exponential of the logarithm of the distribution.s. while p (θ) > 0 when θ = θ0 + π (mod2π). as is easily seen from (2. Using this together with (2.185).194) with h(μ) = 1 Γ(a + b) g(a. b) = Γ(a + b) exp {(a − 1) ln μ + (b − 1) ln(1 − μ)} Γ(a)Γ(b) which we can identify as being in standard exponential form (2. b) = ln μ ln(1 − μ) a−1 .169) and (2. b) = ba exp {(a − 1) ln λ − bλ} . which therefore represents a maximum of the density. which is therefore a minimum.h. For the Beta distribution (2. r 2.55 NOTE: In PRML.178) and (2. ML From (2. should be a “+”.
with η = u(x) = Σ−1 μ −1λ 2 x z h(x) = (2π)−D/2 1 g(η) = Σ−1/2 exp − μT Σ−1 μ . for the von Mises distribution (2. we deﬁne the D2 dimensional vectors z and λ.179) we make use of the identity (2. m) = 1 2πI0 (m) cos θ sin θ m cos θ0 . b) = u(λ) = η(a. which consist of the columns of xxT and Σ−1 . a−1 a (122) (123) (124) (125) Finally.60 Solution 2. we can rewrite the argument of the exponential as 1 1 − Tr Σ−1 xxT + μT Σ−1 x − μT Σ−1 μ. respectively. To deal with the ﬁrst term. m) = 2πI0 (m) from which we ﬁnd h(θ) = 1 g(θ0 . m) = u(θ) = η(θ0 . stacked on top of each other.178) to give 1 exp {m cos θ cos θ0 + m sin θ sin θ0 } p(θθ0 . 2 . Now we can write the multivariate Gaussian distribution on the form (2.57 Starting from (2. The second term is already an inner product and can be kept as is.57 from which it follows that h(λ) = 1 g(a.43). b) = b Γ(a) λ ln λ −b .194). 2 2 The last term is indepedent of x but depends on μ and Σ and so should go into g(η). m sin θ0 (126) (127) (128) (129) 2.
0= with respect to hk to give .60 2.Solutions 2. ln p(xn ) = n=1 n=1 ln hj (n) . 2. the normalization constraint becomes i hi Δi = 1.226).58 Taking the ﬁrst derivative of (2.58–2. as in the text. Multiplying both sides by hk . which has volume Δi . where the notation j(n) denotes that data point xn falls within region j. Since p(x) has the constant value hi over region i. Introducing a Lagrange multiplier λ we then minimize the function N ln hj (n) + λ n=1 i hi Δ i − 1 nk + λΔk hk where nk denotes the total number of data points falling within region k. Thus the log likelihood function takes the form N N 1 σ 1 σ σ σ f (y) dx dy dy f (y)σ dy f (y) dy = 1. 2.60 The value of the density p(x) at a point xn is given by hj (n) .59 1 f σ x σ dx = = = since f (x) integrates to 1. −∇ ln g(η) = g(η) Taking the gradient again gives −∇∇ ln g(η) = g(η) h(x) exp η T u(x) u(x)u(x)T dx h(x) exp η T u(x) u(x) dx h(x) exp η T u(x) u(x) dx 61 +∇g(η) = E[u(x)u(x)T ] − E[u(x)]E[u(x)T ] = cov[u(x)] where we have used the result (2. We now need to take account of the constraint that p(x) must integrate to unity.226) we obtain. summing over k and making use of the normalization constraint.
246) we have p(x) = K N V (ρ) where V (ρ) is the volume of a Ddimensional hypersphere with radius ρ. where in turn ρ is the distance from x to its K th nearest neighbour in the data set. we get p(x) dx ∝ which diverges logarithmically. 2.1 NOTE: In PRML. we have 2σ(2a) − 1 = = = = = 2 −1 1 + e−2a 2 1 + e−2a − 1 + e−2a 1 + e−2a 1 − e−2a 1 + e−2a ea − e−a ea + e−a tanh(a) .61–3. Thus. in polar coordinates. there is a 2 missing in the denominator of the argument to the ‘tanh’ function in equation (3. as expected. N Δk Note that. for equal sized bins Δk = Δ we obtain a bin height hk which is proportional to the fraction of points falling within that bin. If we consider the integral of p(x) and note that the volume element dx can be written as rD−1 dr.102).61 From (2.62 Solutions 2.6). if we consider sufﬁciently large values for the radial coordinate r. Eliminating λ then gives our ﬁnal result for the maximum likelihood solution for hk in the form hk = nk 1 . Using (3.1 we obtain λ = −N . r−D rD−1 dr = r−1 dr Chapter 3 Linear Models for Regression 3. we have p(x) ∝ r−D .
To see that this is indeed an orthogonal projection.15) for the case R = I. 2 −1 Setting the derivative with respect to w to zero. .2 We ﬁrst write Φ(ΦT Φ)−1 ΦT v = Φv wj /2. then gives w = ΦT RΦ ΦT Rt which reduces to the standard solution (3. . we see that y = ΦwML = Φ(ΦT Φ)−1 ΦT t corresponds to a projection of t onto the space spanned by the columns of Φ.3 If we deﬁne R = diag(r1 . . then we can write the weighted sumofsquares cost function in the form ED (w) = 1 (t − Φw)T R(t − Φw).10)–(3. w) = w0 + j =1 M wj σ(2aj ) wj (2σ(2aj ) − 1 + 1) 2 uj tanh(aj ).2–3.15).104) with (3. for j = 1.Solutions 3. . If we compare (3. that either replaces or scales β.3 If we now take aj = (x − μj )/2s. we see that rn can be regarded as a precision (inverse variance) parameter. 3. we can rewrite (3. . M . ϕj . and rearranging. . . tn ).101) as M 63 y(x. By comparing this with the least squares solution in (3. . and u0 = w0 + 3. Φ(ΦT Φ)−1 ΦT ϕj = Φ(ΦT Φ)−1 ΦT Φ j = ϕj and therefore (y − t) ϕj = (ΦwML − t) ϕj = tT Φ(ΦT Φ)−1 ΦT − I T T T ϕj = 0 and thus (y − t) is ortogonal to every column of Φ and hence is orthogonal to S. + ϕM v (M ) ˜ ˜ ˜ where ϕm is the mth column of Φ and v = (ΦT Φ)−1 ΦT v. rN ) to be a diagonal matrix containing the weighting coefﬁcients.12). j =1 M j =1 = w0 + j =1 M = u0 + where uj = wj /2. particular to the data point (xn . = ϕ1 v (1) + ϕ2 v (2) + . we ﬁrst note that for any column of Φ. . . .
From (3.105). tn ). this becomes particularly clear if we consider (3. while for the third term we get ⎡ ⎤ 2 E⎣ D wi i=1 ni ⎦= D 2 wi σ 2 i=1 since the ni are all independent with variance σ 2 . 3.106) {yn − tn } 2 = 2 yn − 2yn tn + t2 n n=1 N = ⎧ ⎨ y 2 + 2yn ⎩ n n=1 D D 2 wi i=1 D ni + i=1 wi ni −2tn yn − 2tn i=1 wi ni + t2 n ⎫ ⎬ ⎭ . 1 2 D 2 wi σ 2 . From this and (3.104) with rn taking positive integer values. since E[ ni ] = 0.4 Alternatively.64 Solution 3. we see that the second and ﬁfth terms disappear. w) and we then deﬁne E = 1 2 1 2 1 2 N n=1 N ni ni ∼ N (0. although it is valid for any rn > 0. rn can be regarded as an effective number of replicated observations of data point (xn .4 Let D yn = w0 + i=1 D wi (xni + wi i=1 ni ) = yn + where yn = y(xn . If we take the expectation of E under the distribution of ni . i=1 .106) we see that E E = ED + as required. σ 2 ) and we have used (3.
6 3.6 We ﬁrst write down the log likelihood function which is given by 1 N ln L(W. Since we are ﬁnding a joint maximum with respect to both W and Σ we see that it is WML which appears in this expression.29) we see immediately that they are identical in their dependence on w. as in the standard result for an unconditional Gaussian distribution.5 We can rewrite (3.5–3. Denoting the resulting value of w by w (λ). Clearly this does not affect the constraint.11). n=1 as required. The maximum likelihood solution for Σ is easily found by appealing to the standard result from Chapter 2 giving Σ= 1 N N T T (tn − WML φ(xn ))(tn − WML φ(xn ))T . .12) to obtain the Lagrangian function L(w. we see that the value of η is given by M η= j =1 wj (λ)q .30) as 1 2 65 M j =1 wj q − η 0 where we have incorporated the 1/2 scaling factor for convenience.15) as required. Now suppose we choose a speciﬁc value of λ > 0 and minimize (3. 3. we can combine this with (3.Solutions 3. λ) = 1 2 N {tn − wT φ(xn )}2 + n=1 λ 2 M j =1 wj q − η and by comparing this with (3. Multiplying through by Σ and introducing the design matrix Φ and the target data matrix T we have ΦT ΦW = ΦT T Solving for W then gives (3. Employing the technique described in Appendix E. and using the KKT condition (E. Σ) = − ln Σ − 2 2 N n=1 (tn − WT φ(xn ))T Σ−1 (tn − WT φ(xn )). giving N 0=− n=1 Σ−1 (tn − WT φ(xn ))φ(xn )T .29). First of all we set the derivative with respect to W equal to zero.
51) when completing the square in the last step. we obtain a posterior of the form p(wtN +1 . are given by (3. mN . SN ) β exp − (tN +1 − wT φN +1 )2 2 (130) where φN +1 = φ(xN +1 ). SN ) 1 1 ∝ exp − (w − mN )T S−1 (w − mN ) − β(tN +1 − wT φN +1 )2 .10) and (3. The ﬁrst exponential corrsponds to the posterior. unnormalized Gaussian distribution over w.7 From Bayes’ theorem we have p(wt) ∝ p(tw)p(w). β −1 N (wm0 .50) and (3.66 Solutions 3. S0 ) β ∝ exp − (t − Φw)T (t − Φw) 2 1 exp − (w − m0 )T S−1 (w − m0 ) 0 2 = exp − 1 T −1 w S0 + βΦT Φ w − βtT Φw − βwT ΦT t + βtT t 2 mT S−1 w − wT S−1 m0 + mT S−1 m0 0 0 0 0 0 = exp − 1 T −1 w S0 + βΦT Φ w − S−1 m0 + βΦT t 0 2 T w −wT S−1 m0 + βΦT t + βtT t + mT S−1 m0 0 0 0 1 T = exp − (w − mN ) S−1 (w − mN ) N 2 exp − 1 βtT t + mT S−1 m0 − mT S−1 mN 0 0 N N 2 where we have used (3.8 3. respectively. w) = β 2π 1 /2 p(w) = N (wmN .8 Combining the prior and the likelihood p(tN +1 xN +1 .7–3. N 2 2 . 3.h.s. xN +1 . Writing this out in full. we get N p(wt) ∝ n=1 N tn wT φ(xn ).48). while the second exponential is independent of w and hence can be absorbed into the normalization factor. where the factors on the r.
10 Using (3.3).11 From (3. 3. 3. xN +1 .113).113) with (3. N N N where const denotes remaining terms independent of w.11 67 We can expand the argument of the exponential.110) we get SN +1 = S−1 + βφN +1 φT +1 N N SN φN +1 β 1/2 β 1/2 φT +1 SN N = SN − = SN − 1 + βφT +1 SN φN +1 N 1 + βφT +1 SN φN +1 N βSN φN +1 φT +1 SN N . SN +1 ). t. SN +1 ) where SN +1 and mN +1 are given by (131) and (132).9 Identifying (2. .Solutions 3. omitting the −1/2 factors.49) and (2. we can rewrite (3. SN ) dw. (3. as follows (w − mN )T S−1 (w − mN ) + β(tN +1 − wT φN +1 )2 N = wT S−1 w − 2wT S−1 mN N N + βwT φT +1 φN +1 w − 2βwT φN +1 tN +1 + const N = wT (S−1 + βφN +1 φT +1 )w − 2wT (S−1 mN + βφN +1 tN +1 ) + const.117) directly give p(wtN +1 .49).115). we obtain the desired result directly from (2. mN . (131) (132) 3. β) = N (tφ(x)T w. SN ) = N (wmN +1 . p(wtN +1 . xN +1 ) = N (wmN +1 . such that A ⇒ φ(xN +1 )T = φT +1 N By matching the ﬁrst factor of the integrand with (2.59) we have 2 σN +1 (x) = 1 + φ(x)T SN +1 φ(x) β −1 (133) where SN +1 is given by (131). N x⇒w y ⇒ tN +1 (2. μ ⇒ mN Λ−1 ⇒ SN b⇒0 L−1 ⇒ βI.8) and (3. N N +1 N mN +1 = SN +1 (S−1 mN + βφN +1 tN +1 ). respectively.114) and the second factor with (2.116) and (2.9–3.114) with (130). α. From (131) and (3. From this we can read off the desired result directly.57) as p(tx. with and S−1 = S−1 + βφN +1 φT +1 . β −1 )N (wmN .
β −1 ) 1 β M ln β − ln S0  − (w − m0 )T S−1 (w − m0 ) = 0 2 2 2 −b0 β + (a0 − 1) ln β + N β ln β − 2 2 N {wT φ(xn ) − tn }2 + const. t) is a Gaussian distribution with mean and covariance given by mN βS−1 N = SN S−1 m0 + ΦT t 0 = β S−1 0 +Φ Φ .12 Using this and (3. 0 0 2 Thus we see that p(wβ. We have β ln p(wβ. n=1 Using the product rule. βt) = p(wβ.68 Solution 3. respectively. the numerator and denominator of the second term in 2 2 (134) will be nonnegative and positive. the posterior distribution can be written as p(w. β) + n=1 ln p(tn wT φ(xn ). βt) = ln p(w. The log of the posterior distribution is given by N ln p(w. t)p(βt). t).12 It is easiest to work in log space. 3. Consider ﬁrst the dependence on w. T (135) (136) To ﬁnd p(βt) we ﬁrst need to complete the square over w to ensure that we pick up all terms involving β (any terms independent of β may be discarded since these will be absorbed into the normalization coefﬁcient which itself will be found by inspection at the end). Thus β β ln p(βt) = − mT S−1 m0 + mT S−1 mN 2 0 0 2 N N + N β ln β − b0 β + (a0 − 1) ln β − 2 2 N t2 + const. and hence σN +1 (x) σN (x). n n=1 . We also need to remember that a factor of (M/2) ln β will be absorbed by the normalisation factor of p(wβ. we can rewrite (133) as 2 σN +1 (x) = βSN φN +1 φT +1 SN 1 N + φ(x)T SN − T β 1 + βφN +1 SN φN +1 βφ(x)T SN φN +1 φT +1 SN φ(x) N 1 + βφT +1 SN φN +1 N . t) = − wT ΦT Φ + S−1 w + wT βS−1 m0 + βΦT t + const.59). φ(x) (134) 2 = σN (x) − Since SN is positive deﬁnite.
μ ⇒ mN Λ−1 ⇒ SN b⇒0 L−1 ⇒ β −1 .3). A ⇒ φ(x)T = φT 3. β −1 N wmN . β −1 1 + φT (S0 + φT φ)−1 φ Substituting this back into (139) we get p(tx. β −1 SN dw Gam (βaN .14 69 We recognize this as the log of a Gamma distribution. such that x⇒w y⇒t (2. t) = where we have deﬁned s = 1 + φT (S0 + φT φ)−1 φ. Identifying (2.114) with (3.115) and (136) give p(tβ) = N tφT mN . t) = St (tμ. β −1 s Gam (βaN . ν) where μ = φT mN λ= aN −1 s bN ν = 2aN . bN ) dβ. N tφT mN .14 For α = 0 the covariance matrix SN becomes SN = (βΦT Φ)−1 . λ.3. Reading off the coefﬁcients of β and ln β we then have aN bN = a0 + = b0 + N 2 1 2 mT S−1 m0 − mT S−1 mN + 0 0 N N N (137) t2 n n=1 . X.Solutions 3. t) = N tφ(x)T w.8).2. (140) .13–3. bN ) dβ (139) We begin by performing the integral over w. using (3. β −1 + φT SN φ = N tφT mN . .113) with (3.49) and (2.13 Following the line of presentation from Section 3. We can now use (2. the predictive distribution is now given by p(tx. X. (138) 3.160) to obtain the ﬁnal result: p(tx.158)– (2.
xn ) = n=1 n=1 M ψ(x) ψ(xn ) = n=1 i=1 T ψi (x)ψi (xn ) = i=1 ψi (x)δi1 = ψ1 (x) = 1 which proves the summation constraint as required. it follows that N N ψi (xn )ψ1 (xn ) = n=1 n=1 ψi (xn ) = δi1 where we have used ψ1 (x) = 1. the covariance matrix then becomes SN = β −1 (ΦT Φ)−1 = β −1 (V−T ΨT ΨV−1 )−1 = β −1 VT V.70 Solution 3. Now consider the sum N N N M k(x. and setting j = 1. Since both the original and the new basis functions are linearly independent and span the same space. this matrix must be invertible and hence φ(x) = V−1 ψ(x). Hence the equivalent kernel becomes k(x.16) give Ψ = ΦVT and consequently Φ = ΨV−T where V−T denotes (V−1 )T . (141) and (3. Orthonormality implies ΨT Ψ = I. x ) = βφ(x)T SN φ(x ) = φ(x)T VT Vφ(x ) = ψ(x)T ψ(x ) as required. From the orthonormality condition. Here we have used the orthonormality of the ψi (x). Note that (V−1 )T = (VT )−1 as is easily veriﬁed. From (140). . For the data set {xn }.14 Let us deﬁne a new set of orthonormal basis functions given by linear combinations of the original basis functions so that ψ(x) = Vφ(x) (141) where V is an M × M matrix.
Next consider the quadratic term in t and make use of the identity (C.113) with the prior distribution p(w) = N (w0.14) for the determinant we have β −1 IN + α−1 ΦΦT where we have used (3.114) with (142). giving β α 2 t − ΦmN + mT mN E(mN ) = 2 2 N γ N N −γ + = . (143) 2 = β −N IN + βα−1 ΦΦT = β −N IM + βα−1 ΦT Φ = β −N α−M αIM + βΦT Φ = β −N α−M A Taking the log we obtain ln p(tα. such that x⇒w (2. β) = N (tΦw.7) together with (3. = 2 2 2 3.81).84) to give 1 −1 t − t β −1 IN + α−1 ΦΦT 2 1 = − tT βIN − βΦ αIM + βΦT Φ 2 β2 β = − tT t + tT ΦA−1 ΦT t 2 2 1 T β T = − t t + mN AmN 2 2 β α 2 t − ΦmN − mT mN = − 2 2 N −1 ΦT β t . y⇒t A⇒Φ b⇒0 p(tα. (142) Identifying (2. β) = N (t0.16 71 3. β −1 IN ).92) and (3.81) and (3.16 The likelihood function is a product of independent univariate Gaussians and so can be written as a joint Gaussian distribution over t with diagonal covariance matrix in the form p(tw.15–3.Solutions 3.15 This is easily shown by substituting the reestimation formulae (3.82). N 1 ln(2π) − ln β −1 IN + α−1 ΦΦT 2 2 1 − tT β −1 IN + α−1 ΦΦT t.95) into (3. α−1 I) and (2. β) = − Using the result (C. β −1 IN + α−1 ΦΦT ).115) gives μ ⇒ 0 Λ−1 ⇒ α−1 IM L−1 ⇒ β −1 IN .
in the last line.12) and (3.43).h.81). Substituting for the determinant and the quadratic term in (143) we obtain (3.80) and so it remains to show that the ﬁrst term equals the r. We now use the tricks of adding 0 = mT AmN − mT AmN and using I = A−1 A.s.11).17–3. β)p(wα) dw β 2π β 2π N/2 α 2π α 2π M/2 α exp (−βED (w)) exp − wT w dw 2 exp (−E(w)) dw. we have exploited results from Solution 3.18 We can rewrite (3. β) = = = p(tw. as follows: N N 1 βtT t − 2βtT Φw + wT Aw 2 1 = βtT t − 2βtT ΦA−1 Aw + wT Aw 2 1 = βtT t − 2mT Aw + wT Aw + mT AmN − mT AmN N N N 2 1 1 = βtT t − mT AmN + (w − mN )T A(w − mN ). N/2 M/2 where E(w) is deﬁned by (3. To do this. combined with (3.18 where in the last step.86). (3. N 2 2 Here the last term equals term the last term of (3.77) as follows: p(tα.52) together with the deﬁnition for the Gaussian.17 Using (3. we can rewrite (3. of (3.84).79). (2. we use the same tricks again: 1 1 βtT t − mT AmN = βtT t − 2mT AmN + mT AmN N N N 2 2 1 = βtT t − 2mT AA−1 ΦT tβ + mT αI + βΦT Φ mN N N 2 1 = βtT t − 2mT ΦT tβ + βmT ΦT ΦmN + αmT mN N N N 2 1 = β(t − ΦmN )T (t − ΦmN ) + αmT mN N 2 β α 2 t − ΦmN + mT mN = 2 2 N . we have used (3. 3.79) β α 2 t − Φw + wT w 2 2 β T α = t t − 2tT Φw + wT ΦT Φw + wT w 2 2 1 T T T βt t − 2βt Φw + w Aw = 2 where.18.82).72 Solutions 3. 3.
85) is an unnormalized Gaussian and hence integrates to the inverse of the corresponding normalizing constant.89).Solutions 3.82).s. The expression for γ in (3.86). we ﬁnd that the eigenvectors of A to be α+λi .92) via (3.21 as required. symmetric matrix A can be written as Aui = ηi ui . yielding 1 2 M + mT mN N α .81). Following the sequence of steps in Section 3.90) by substituting M i λi + α λi + α for M and rearranging. which can be read off from the r. (3.21 The eigenvector equation for the M × M real. β) = = M N (ln α − ln(2π)) + (ln β − ln(2π)) + ln exp{−E(w)} dw 2 2 N M 1 M (ln α − ln(2π)) + (ln β − ln(2π)) − E(mN ) − ln A + ln(2π) 2 2 2 2 which equals (3. third and fourth terms. 3.43) as (2π) M/2 A−1 1/2 . from which we get (3.91) is obtained from (3. which are the ﬁrst. 1 − ln A. Using (3.90). The derivatives for the ﬁrst and third term of (3. We combine these results into (3.86) that depend on α.19–3. of (2. we start with the last of these terms.86) are more easily obtained using standard derivatives and (3.47) and the properties of the logarithm to take us from the left to the right side of (3.20 We only need to consider the terms of (3.88).19 From (3.87) and the fact that that eigenvectors ui are orthonormal (see also Appendix C).85) and the properties of the logarithm.2.80) we see that the integrand of (3. 3.5. 73 3. we get ln p(tα. (3.78). 2 From (3.h. We can then use (C.
Using the properties of the trace and the orthognality of . Taking the derivative with respect to some scalar α we obtain d ln A = dα M i=1 1 d ηi .117) in terms of the eigenvector expansion and show that it takes the same form. and the M eigenvalues {ηi } are all real. First we note that A can be expanded in terms of its own eigenvectors to give M A= i=1 ηi ui uT i and similarly the inverse can be written as A Thus we have Tr A −1 −1 M = i=1 1 ui uT .74 Solution 3. ηi dα We now express the right hand side of (3. The log of the determinant of A can be written as M M ln A = ln i=1 ηi = i=1 ln ηi .21 where {ui } are a set of M orthonormal vectors.117) in terms of the eigenvalues of A. i ηi d A dα M M = Tr i=1 1 d ui uT i ηi dα M ηj uj uT j j =1 = Tr i=1 M 1 ui uT i ηi 1 ui uT i ηi M M j =1 M j =1 dηj uj uT + ηj bj uT + uj bT j j j dα = Tr i=1 dηj uj uT j dα M +Tr i=1 1 ui uT i ηi ηj bj uT + uj bT j j j =1 (144) where bj = duj / dα. We ﬁrst express the left hand side of (3.
ηi dα We have now shown that both the left and right hand sides of (3.t.86) w. yielding d ln p(tαβ) = dα = = M 1 1 d 1 − mT mN − Tr A−1 A 2 α 2 N 2 dα 1 2 1 2 M − mT mN − Tr A−1 N α M − mT mN − N α 1 λi + α i which we recognize as the r.Solution 3.89). we can rewrite the second term as M 75 Tr i=1 1 ui uT i ηi M M ηj bj uT + uj bT j j j =1 M = Tr i=1 M 1 ui uT i ηi M 2ηj uj bT j j =1 = Tr i=1 j =1 M 2ηj ui uT uj bT i j ηi = Tr i=1 b j u T + uj b T j j d dα M M = Tr However. ui uT = I i i which is constant and thus its derivative w. we use (3. immediately following (3.89). from which (3.21 eigenvectors.117) take the same form when expressed in terms of the eigenvector expansion.h. we again use the properties of the trace and the orthognality of eigenvectors to obtain Tr A−1 d A dα M = i=1 1 dηi .s.r. Next. α will be zero and the second term in (144) vanishes. For the ﬁrst term in (144).2.r.5. α.92) can be derived as detailed in Section 3. ui uT i i . of (3.117) to differentiate (3. .t.
we obtain (3. ﬁrst over w and then β. of (3.22 Using (3. Now we are ready to do the integration. using mN = SN S−1 m0 + ΦT t 0 S−1 = β S−1 + ΦT Φ 0 N N aN = a0 + 2 bN 1 = b0 + 2 mT S−1 m0 0 0 − mN S−1 mN N T N + n=1 t2 n .94). we get p(t) = = p(tw.82) and (3.23 From (3.10).2—we get the derivative of (3.76 Solutions 3.s.86) w.22–3.23 3.r. and rearrange the .95). Rearranging this.t. collecting the βdependent terms on one side of the equation and the remaining term on the other.h.93)—the derivation of latter is detailed in Section 3.112) and the properties of the Gaussian and Gamma distributions (see Appendix B). (3. 3. β)p(wβ) dwp(β) dβ β 2π β 2π N/2 β exp − (t − Φw)T (t − Φw) 2 β S0 −1/2 exp − (w − m0 )T S−1 (w − m0 ) 0 2 dw M/2 Γ(a0 )−1 ba0 β a0 −1 exp(−b0 β) dβ 0 = ba0 0 ((2π)M +N S β exp − (t − Φw)T (t − Φw) 2 dw 0 ) 1 /2 β exp − (w − m0 )T S−1 (w − m0 ) 0 2 β a0 −1 β N/2 β M/2 exp(−b0 β) dβ = ba0 0 ((2π)M +N S0 ) β exp − tT t + mT S−1 m0 − mT S−1 mN 0 0 N N 2 β aN −1 β M/2 exp(−b0 β) dβ 1 /2 β exp − (w − mN )T S−1 (w − mN ) N 2 dw where we have completed the square for the quadratic form in w. β as the r.5.
(2π)N/2 S0 1/2 baN Γ(a0 ) N 3. bN ) (145) Using the deﬁnitions of the Gaussian and Gamma distributions.Solution 3.h. (3.24 terms to obtain the desired result 77 p(t) = = ba0 0 ((2π)M +N S0 ) 1 /2 (2π)M/2 SN 1/2 β aN −1 exp(−bN β) dβ SN 1/2 ba0 Γ(aN ) 1 0 . we can write this as β 2π N/2 exp − β 2π β 2π β t − Φw 2 2 M/2 β S0 1/2 exp − (w − m0 )T S−1 (w − m0 ) 0 2 Γ(a0 )−1 ba0 β a0 −1 exp(−b0 β) 0 β SN 1/2 exp − (w − mN )T S−1 (w − mN ) N 2 Γ(aN )−1 baN β aN −1 exp(−bN β) N −1 M/2 .s.e. β −1 S0 ) Gam (βa0 . (146) Concentrating on the factors corresponding to the denominator in (145). we get p(t) = N (tΦw.24 Substituting the r. i.119). b0 ) . of (3. N (wmN .10). the fac .112) and (3. β −1 I) N (wm0 .113) into (3. β −1 SN ) Gam (βaN .
If {xn } and {ym } also were to be linearly separable.})−1 in (146). Then there exist a point z such that αn xn = βm ym z= n m where βm 0 for all m and m βm = 1. β −1 SN Gam (βaN .78 Solution 4. . Chapter 4 Linear Models for Classiﬁcation 4. the exponential factors along with β a0 +N/2−1 (β/2π)M/2 cancel and we are left with (3. we can use (135)–(138) to get N wmN . . but by the corresponding argument wT z + w0 = m βm wT ym + w0 = m βm (wT ym + w0 ) < 0. N Substituting this into (146).1 Assume that the convex hulls of {xn } and {ym } intersect. we would have that wT z + w0 = n αn wT xn + w0 = n αn ( since wT xn + w0 > 0 and the {αn } are all nonnegative and sum to 1. bN ) = β 2π M/2 SN 1/2 exp − β T −1 w SN w − wT S−1 mN − mT S−1 w N N N 2 +mT S−1 mN N N = β 2π M/2 Γ(aN )−1 baN β aN −1 exp(−bN β) N β T −1 w S0 w + wT ΦT Φw − wT S−1 m0 0 2 SN 1/2 exp − −wT ΦT t − mT S−1 w − tT Φw + mT S−1 mN N N 0 N Γ(aN )−1 baN β a0 +N/2−1 N exp − b0 + = β 2π M/2 1 mT S−1 m0 − mT S−1 mN + tT t 0 0 N N 2 β 2 SN 1/2 exp − β (w − m0 )T S0 (w − m0 ) + t − Φw 2 Γ(aN )−1 baN β aN +N/2−1 exp(−b0 β).118). .1 tors inside {.
we make the contribution of the bias weights explicit in (4. y(x ) = WT x + w0 ¯ = WT x + ¯ − WT x t = ¯ − TT X † t If we apply (4. W to zero we get W = (XT X)−1 XT T = X† T. Setting this to zero. 2 (147) where w0 is the column vector of bias weights (the top row of W transposed) and 1 is a column vector of N ones. giving 2N w0 + 2(XW − T)T 1. N (148) If we subsitute (148) into (147).2 79 which is a contradiction and hence {xn } and {ym } cannot be linearly separable if their convex hulls intersect. 4. x Setting the derivative of this w. Thus no such point can exist and the intersection of the convex hulls of {xn } and {ym } must be empty.2 For the purpose of this exercise. If we instead assume that {xn } and {ym } are linearly separable and consider a point z in the intersection of their convex hulls. We can take the derivative of (147) w.t. where we have deﬁned X = X − X and T = T − T. Now consider the prediction for a new input vector x . w0 .157) to ¯ we get t.r.Solution 4.r. we get ED (W) = where 1 Tr (XW + T − XW − T)T (XW + T − XW − T) . t aT ¯ = 1 T T a T 1 = −b. the same contradiction arise. we obtain ¯ w0 = ¯ − WT x t where ¯ = 1 TT 1 t N and ¯ x= 1 T X 1. (149) . N T ¯ (x − x). 2 T = 1¯T t and X = 1¯ T .t.15). and solving for w0 . giving ED (W) = 1 T T Tr (XW + 1w0 − T)T (XW + 1w0 − T) .
4 NOTE: In PRML.80 Solutions 4. we obtain aT y(x ) = aT¯ + aT TT X† t = aT¯ = −b. Thus Ay(x ) + b = 0.3–4.s. t since aT TT = aT (T − T)T = b(1 − 1)T = 0T . From (4. (4.23) and (4.157) becomes Atn + b = 0. If we apply (150) to (149). 4. Taking the gradient of L we obtain ∇L = m2 − m1 + 2λw and setting this gradient to zero gives w=− 1 (m2 − m1 ) 2λ (151) T ¯ (x − x) form which it follows that w ∝ m2 − m1 .157) to (149).25).5 Therefore.5 Starting with the numerator on the r. of (4. applying (4.h. we obtain Ay(x ) = A¯ − ATT X† t = A¯ = −b. (150) T ¯ (x − x) where A is a matrix and b is a column vector such that each row of A and element of b correspond to one linear constraint. we can use (4.23) where it should refer to (4. t since ATT = A(T − T)T = b1T − b1T = 0T .3 When we consider several simultaneous constraints. 4.27) to rewrite it as follows: (m2 − m1 )2 = wT (m2 − m1 ) 2 = wT (m2 − m1 )(m2 − m1 )T w = wT SB w.22) we can construct the Lagrangian function L = wT (m2 − m1 ) + λ wT w − 1 . 4. the text of the exercise refers equation (4. (152) .22).
we can use (4.25): s2 + s2 1 2 = n∈C1 (yn − m1 )2 + wT (xn − n∈C1 T (yk − m2 )2 wT (xk − m2 ) k∈C2 2 = = n∈C1 k∈C2 2 m1 ) + w (xn − m1 )(xn − m1 )T w + k∈C2 wT (xk − m2 )(xk − m2 )T w (153) = w SW w.24).h.Solution 4.34) along with the chosen target coding scheme. we can rewrite the l.25) we obtain (4.33) as follows: N N wT xn − w0 − tn xn = n=1 N n=1 wT xn − wT m − tn xn = n=1 xn xT − xn mT w − xn tn n xn xT − xn mT w − xn tn n n∈C1 = xm xT − xm mT w − xm tm m m∈C2 = n∈C1 xn xT − N1 m1 mT n w − N1 m1 N N1 N N2 w (154) xm xT − N2 m2 mT m m∈C2 w + N2 m2 = n∈C1 xn xT + n m∈C2 xm xT − (N1 m1 + N2 m2 )mT m −N (m1 − m2 ).26).6 81 Similarly. (4.20).6 Using (4. . of (4.h.21) and (4. 4.s. T Substituting (152) and (153) in (4.s.23). and (4. (4. of (4.28) to rewrite the denominator of the r.
From (4.33). and hence we obtain (4. 4. this must equal zero.7 From (4. = a −a 1+e e +1 1 − σ(a) = 1 − = . SW + N SW + w where in the last line we also made use of (4.27).37).28) and (4.59) we have 1 + e−a − 1 1 = 1 + e−a 1 + e−a −a e 1 = σ(−a).36) to rewrite (154) as SW + N 1 m 1 m T + N 2 m 2 m T 1 2 −(N1 m1 + N2 m2 ) = SW + N 1 − + N2 − = SW + + = 2 N1 N 1 (N1 m1 + N2 m2 ) w − N (m1 − m2 ) N N1 N2 (m1 mT + m2 m1 ) 2 N m1 mT − 1 2 N2 N m2 mT w − N (m1 − m2 ) 2 2 (N1 + N2 )N1 − N1 N1 N2 m1 mT − (m1 mT + m2 m1 ) 1 2 N N 2 (N1 + N2 )N2 − N2 m2 mT w − N (m1 − m2 ) 2 N = N2 N1 m1 mT − m1 mT − m2 m1 + m2 mT 1 2 2 N −N (m1 − m2 ) N2 N1 SB w − N (m1 − m2 ).7 We then use the identity (xi − mk ) (xi − mk ) i∈Ck T = i∈Ck xi xT − xi mT − mk xT + mk mT i k i k xi xT − Nk mk mT i k i∈Ck = together with (4.82 Solution 4.
8 Substituting (4.9 The likelihood function is given by N K p ({φn .8–4. we see that the normalizing constants cancel and we are left with a = ln = − exp − 1 (x − μ1 ) Σ−1 (x − μ1 ) p(C1 ) 2 exp − 1 (x − μ2 ) Σ−1 (x − μ2 ) p(C2 ) 2 1 xΣT x − xΣμ1 − μT Σx + μT Σμ1 1 1 2 −xΣT x + xΣμ2 + μT Σx − μT Σμ2 + ln 2 2 = (μ1 − μ2 ) Σ−1 x − T T T 1 T −1 μ Σ μ1 − μT Σμ2 2 2 1 p(C1 ) p(C2 ) p(C1 ) + ln . 4. with w and w0 given by (4.57) we obtain (4. tn }{πk }) = n=1 k=1 tnk {ln p(φn Ck ) + ln πk } .65).9 The inverse of the logistic sigmoid is easily found as follows y = σ(a) = ⇒ ⇒ ⇒ ln ln 1 1 + e−a 83 1 − 1 = e−a y 1−y = −a y y 1−y = a = σ −1 (y). (155) In order to maximize the log likelihood with respect to πk we need to preserve the constraint k πk = 1. respectively.67). tn }{πk }) = n=1 k=1 {p(φn Ck )πk } tnk and taking the logarithm. p(C2 ) Substituting this into the rightmost form of (4. we obtain N K ln p ({φn .64) into (4. This can be done by introducing a Lagrange multiplier λ and maximizing K ln p ({φn . 4. .66) and (4.58). tn }{πk }) + λ k=1 πk − 1 .Solutions 4.
tn }{πk }) = − 1 2 N K tnk ln Σ + (φn − μk )T Σ−1 (φ − μ) . obtained by using (C. Note that.t. (2. n=1 k=1 we can use (C.t.s. we get N K tnk Σ−1 (φn − μk ) = 0. . with Sk given by (4. but simply note that the solution is automatically symmetric.h.161). k Again making use of (156).34. we can rearrange this to obtain (4.84 Solution 4. as in Exercise 2. (156) Summing both sides over k we ﬁnd that λ = −N .19). (157) n=1 k=1 where we have dropped terms independent of {μk } and Σ.r. n=1 k=1 Making use of (156).163).s. πk N Rearranging then gives −πk λ = n=1 tnk = Nk . of (157) w.28) to calculate the derivative w.10 Setting the derivative with respect to πk equal to zero.160) into (155) and then use the deﬁnition of the multivariate Gaussian. Σ−1 . and using this to eliminate λ we obtain (4. Setting this to zero we obtain 1 2 N n=1 T tnk Σ − (φn − μn )(φn − μk )T = 0. μk .159).162). Rewriting the r. we can rearrange this to obtain (4.10 If we substitute (4.h. 4. we obtain ln p ({φn .43).24) and (C. Setting the derivative of the r. we obtain N n=1 tnk + λ = 0. of (157) as 1 − b 2 N K tnk ln Σ + Tr Σ−1 (φn − μk )(φ − μk )T . to zero. we do not enforce that Σ should be symmetric.r.
12 Differentiating (4.t.90) w.13 85 4. = σ(a) 4. kml where in turn {μkml } are the parameters of the multinomial models for φ. which is linear in φml . yn ∂E ∂yn 1 − tn tn − 1 − yn yn yn (1 − tn ) − tn (1 − yn ) = yn (1 − yn ) yn − yn tn − tn + yn tn = yn (1 − yn ) yn − tn .Solutions 4. = yn (1 − yn ) = (158) (159) (160) .r.11 The generative model for φ corresponding to the chosen coding scheme is given by M p (φ  Ck ) = m=1 p (φm  Ck ) where p (φm  Ck ) = L l=1 μφml . Substituting this into (4.11–4.13 We start by computing the derivative of (4. 4.59) we obtain dσ da = e−a 2 (1 + e−a ) e−a = σ(a) 1 + e−a 1 + e−a 1 − −a 1+e 1 + e−a = σ(a)(1 − σ(a)).63) we see that ak = ln p (φ  Ck ) p (Ck ) M = ln p (Ck ) + m=1 M ln p (φm  Ck ) L = ln p (Ck ) + m=1 l=1 φml ln μkml .
Combining (160). 4. we see that ∂yn ∂σ(an ) = = σ(an ) (1 − σ(an )) = yn (1 − yn ). Moreover. the likelihood maximized) when yn = σ (wT φn ) = tn for all n. wT φ. Now let w = w + λv ..14–4. when the magnitude of w goes to inﬁnity. any decision boundary separating the two classes will have the property 0 if tn = 1. This will be the case when the sigmoid function is saturated. w . ∂an ∂an Finally.e.14 If the data set is linearly separable. and thus ΦT RΦ is Now consider a Taylor expansion of E(w) around a minima. from (4. goes to ±∞. “concave” should be “convex” on the last line of the exercise. we have ∇an = φn (162) where ∇ denotes the gradient with respect to w. i.88). 4. which occurs when its argument. the diagonal elements of R will be strictly positive. wT φn < 0 otherwise. we obtain N (161) ∇E = n=1 N ∂E ∂yn ∇an ∂yn ∂an (yn − tn )φn = n=1 as required.e. Then vT ΦT RΦv = vT ΦT R1/2 R1/2 Φv = R1/2 Φv 1 /2 2 >0 where R1/2 is a diagonal matrix with elements (yn (1 − yn )) positive deﬁnite. . E(w) = E(w ) + 1 T (w − w ) H (w − w ) 2 where the linear term has vanished since w is a minimum.90) we see that the negative loglikelihood will be minimized (i. Assuming that the argument to the sigmoid function (4.86 Solutions 4. (161) and (162) using the chain rule..15 From (4.15 NOTE: In PRML.87) is ﬁnite.
106).108) we have tnk ∂E =− . Moreover. 4.Solutions 4.16 If the values of the {tn } were known then each data point for which tn = 1 would contribute p(tn = 1φ(xn )) to the log likelihood. n=1 (163) This can also be viewed from a sampling perspective by imagining sampling the value of each tn some number M times. the text of the exercise refers equation (4. From (4.104) we have ∂yk ∂ak ∂yk ∂aj = = − eak − ai ie eak eaj i eak ai ie 2 2 = yk (1 − yk ). A data point whose probability of having tn = 1 is given by πn will therefore contribute πn p(tn = 1φ(xn )) + (1 − πn )(1 − p(tn = 1φ(xn ))) and so the overall log likelihood for the data set is given by N πn ln p (tn = 1  φ(xn )) + (1 − πn ) ln (1 − p (tn = 1  φ(xn ))) . ∂ynk ynk . Combining these results we obtain (4. and then constructing the likelihood function for this expanded data set.18 where v is an arbitrary. H−1 exists and w = w must be the unique minimum. eai = −yk yj . ∂λ2 This shows that E(w) is convex.106). with probability of tn = 1 given by πn .17 From (4. In the limit M → ∞ we recover (163).16–4. at the minimum of E(w).18 NOTE: In PRML. j = k.91) where it should refer to (4. 4. and dividing by M . 4. nonzero vector in the weight space and consider ∂2E = vT Hv > 0. H (w − w ) = 0 87 and since H is positive deﬁnite. and each point for which tn = 0 would contribute 1 − p(tn = 1φ(xn )) to the log likelihood.
88
Solution 4.19 If we combine this with (4.106) using the chain rule, we get ∂E ∂anj
K
=
k=1
∂E ∂ynk ∂ynk ∂anj tnk ynk (Ikj − ynj ) ynk
K
= −
k=1
= ynj − tnj , where we have used that ∀n : k tnk = 1. If we combine this with (162), again using the chain rule, we obtain (4.109). 4.19 Using the crossentropy error function (4.90), and following Exercise 4.13, we have ∂E yn − tn . = ∂yn yn (1 − yn ) Also From (4.115) and (4.116) we have
2 ∂Φ(an ) 1 ∂yn = = √ e−an . ∂an ∂an 2π
(164)
∇an = φn .
(165)
(166)
Combining (164), (165) and (166), we get
N
∇E =
n=1
∂E ∂yn ∇an = ∂yn ∂an
N n=1
2 1 yn − tn √ e−an φn . yn (1 − yn ) 2π
(167)
In order to ﬁnd the expression for the Hessian, it is is convenient to ﬁrst determine yn − tn ∂ ∂yn yn (1 − yn ) = = Then using (165)–(168) we have
N
yn (1 − yn ) (yn − tn )(1 − 2yn ) − 2 (1 − y )2 2 yn yn (1 − yn )2 n 2 yn + tn − 2yn tn . 2 yn (1 − yn )2
(168)
∇∇E
=
n=1
2 ∂ yn − tn 1 √ e−an φn ∇yn ∂yn yn (1 − yn ) 2π
+ =
2 1 yn − tn √ e−an (−2an )φn ∇an yn (1 − yn ) 2π
N n=1
2 yn + tn − 2yn tn 1 −a2 √ e n − 2an (yn − tn ) yn (1 − yn ) 2π
e−2an φn φT n √ . 2πyn (1 − yn )
2
Solutions 4.20–4.21
89
4.20 NOTE: In PRML, equation (4.110) contains an incorrect leading minus sign (‘−’) on the right hand side. We ﬁrst write out the components of the M K × M K Hessian matrix in the form ∂2E = ∂wki ∂wjl
N
ynk (Ikj − ynj )φni φnl .
n=1
To keep the notation uncluttered, consider just one term in the summation over n and show that this is positive semideﬁnite. The sum over n will then also be positive semideﬁnite. Consider an arbitrary vector of dimension M K with elements uki . Then uT Hu =
i,j,k,l
uki yk (Ikj − yj )φi φl ujl bj yk (Ikj − yj )bk
j,k
2 2
=
=
k
yk bk −
k
bk yk
where bk =
i
uki φni .
We now note that the quantities yk satisfy 0 yk 1 and k yk = 1. Furthermore, the function f (b) = b2 is a concave function. We can therefore apply Jensen’s inequality to give
2
yk bk =
k k
2
yk f (bk )
f
k
yk bk
=
k
yk bk
and hence uT Hu 0. Note that the equality will never arise for ﬁnite values of ak where ak is the set of arguments to the softmax function. However, the Hessian can be positive semideﬁnite since the basis vectors φni could be such as to have zero dot product for a linear subspace of vectors uki . In this case the minimum of the error function would comprise a continuum of solutions all having the same value of the error function. 4.21 NOTE: In PRML, Equation (4.116) contains a minor typographical error. On the l.h.s., Φ should be Φ (i.e. not bold). We consider the two cases where a 0 and a < 0 separately. In the ﬁrst case, we
90
Solutions 4.22–4.23 can use (2.42) to rewrite (4.114) as
0
Φ(a) =
−∞
N (θ0, 1) dθ +
a
0 2
θ2 1 √ exp − 2 2π dθ
dθ
= =
a 1 1 θ +√ exp − 2 2 2π 0 1 1 1 + √ erf(a) , 2 2
where, in the last line, we have used (4.115). When a < 0, the symmetry of the Gaussian distribution gives Φ(a) = 1 − Φ(−a). Combining this with (169), we get Φ(a) = 1 − = 1 2 1 1 + √ erf(−a) 2 1 1 + √ erf(a) , 2 1 2
where we have used the fact that the erf function is is antisymmetric, i.e., erf(−a) = −erf(a). 4.22 Starting from (4.136), using (4.135), we have p (D) = p (D  θ) p (θ) dθ p (D  θ MAP ) p (θ MAP ) 1 exp − (θ − θ MAP )A−1 (θ − θ MAP ) 2 = p (D  θ MAP ) p (θ MAP ) (2π)M/2 , A1/2
dθ
where A is given by (4.138). Taking the logarithm of this yields (4.137). 4.23 NOTE: In PRML, the text of the exercise contains a typographical error. Following the equation, it should say that H is the matrix of second derivatives of the negative log likelihood. The BIC approximation can be viewed as a large N approximation to the log model evidence. From (4.138), we have A = −∇∇ ln p(Dθ MAP )p(θ MAP ) = H − ∇∇ ln p(θ MAP )
Solution 4.24 and if p(θ) = N (θm, V0 ), this becomes
− A = H + V0 1 .
91
If we assume that the prior is broad, or equivalently that the number of data points − is large, we can neglect the term V0 1 compared to H. Using this result, (4.137) can be rewritten in the form 1 1 − ln p(Dθ MAP ) − (θ MAP − m)V0 1 (θ MAP − m) − ln H + const 2 2 (169) as required. Note that the phrasing of the question is misleading, since the assumption of a broad prior, or of large N , is required in order to derive this form, as well as in the subsequent simpliﬁcation. ln p(D) We now again invoke the broad prior assumption, allowing us to neglect the second term on the right hand side of (169) relative to the ﬁrst term. Since we assume i.i.d. data, H = −∇∇ ln p(Dθ MAP ) consists of a sum of terms, one term for each datum, and we can consider the following approximation:
N
H=
n=1
Hn = N H
where Hn is the contribution from the nth data point and 1 H= N
N
Hn .
n=1
Combining this with the properties of the determinant, we have ln H = ln N H = ln N M H = M ln N + ln H where M is the dimensionality of θ. Note that we are assuming that H has full rank M . Finally, using this result together (169), we obtain (4.139) by dropping the ln H since this O(1) compared to ln N . 4.24 Consider a rotation of the coordinate axes of the M dimensional vector w such that w = (w , w⊥ ) where wT φ = w φ , and w⊥ is a vector of length M − 1. We then have σ(wT φ)q(w) dw = = σ w φ q(w⊥ w )q(w ) dw dw⊥
σ(w φ )q(w ) dw .
Note that the joint distribution q(w⊥ , w ) is Gaussian. Hence the marginal distribution q(w ) is also Gaussian and can be found using the standard results presented in
149) 2 and σa given by (4. 1) a=0 1 = λ√ . giving 1 2π 1 /2 λ2 = π .88) we have that dσ da = σ(0)(1 − σ(0)) a=0 = 1 2 1− 1 2 = 1 . so that q(a) = N (aφT mN . Deﬁning a = w φ we see that the distribution of a is given by a simple rescaling of the Gaussian. (4. 4.26 Section 2.3. φT SN φ) where we have used φ e = φ. eT SN e).92 Solutions 4. + σ 2 )1/2 Now make the change of variable a = μ + σz.25–4. so that the left hand side of (4.150). Thus we obtain (4.9.151) with μa given by (4.26 First of all consider the derivative of the right hand side with respect to μ. 2π Setting this equal to (170). Denoting the unit vector e= we have 1 φ φ q(w ) = N (w eT mN . 4 (170) Since the derivative of a cumulative distribution function is simply the corresponding density function.25 From (4. we see that √ 2π or equivalently λ= 4 This is illustrated in Figure 4. making use of the deﬁnition of the probit function. 8 exp − μ2 2(λ−2 + σ 2 ) (λ−2 1 .2.152) becomes ∞ 1 1 Φ(λμ + λσz) exp − z 2 σ dz 2 )1/2 2 (2πσ −∞ .114) gives dΦ(λa) da = λN (00. 4.
making use of the deﬁnition of the probit function.1–5.2 The likelihood function for an i.1. . .Solutions 5. data set. giving 1 2π 1 λ2 exp − z 2 − (μ + σz)2 σ dz. β −1 I . . {(x1 .16) is given by N n=1 N tn y(xn . It follows that the left and right hand sides are equal up to a function of σ 2 and λ. tN )}. under the conditional distribution (5. To do this we ﬁrst complete the square in the exponent 1 λ2 − z 2 − (μ + σz)2 2 2 1 2 1 = − z (1 + λ2 σ 2 ) − zλ2 μσ − λ2 μ2 2 2 1 1 1 λ4 μ2 σ 2 2 2 2 −1 2 − λ2 μ2 . Now differentiate with respect to μ. Chapter 5 Neural Networks 5. showing that the constant of integration must also be zero. = − z + λ μσ(1 + λ σ ) (1 + λ2 σ 2 ) + 2 2 (1 + λ2 σ 2 ) 2 Integrating over z then gives the following result for the derivative of the left hand side 1 1 1 λ4 μ2 σ 2 1 exp − λ2 μ2 + 2 2 (1 + λ2 σ 2 ) (2π)1/2 (1 + λ2 σ 2 )1/2 1 1 1 λ2 μ2 = exp − .d. 2 2 −∞ ∞ The integral over z takes the standard Gaussian form and can be evaluated analytically by making use of the standard result for the normalization coefﬁcient of a Gaussian distribution. 1/2 (1 + λ2 σ 2 )1/2 2 (1 + λ2 σ 2 ) (2π) Thus the derivatives of the left and right hand sides of (4. 5. Taking the limit μ → −∞ the left and right hand sides both go to zero. See Solution 3. . g(·) should be replaced by h(·). w).2 93 where we have substituted for the Gaussian distribution. On line 2. (xN . t1 ).i. the text of this exercise contains a typographical error. .152) with respect to μ are equal.1 NOTE: In PRML.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.