Lecture 03

Advanced Machine Learning
Lecture 3: Linear models

Sandjai Bhulai
Vrije Universiteit Amsterdam
s.bhulai@vu.nl
12 September 2023
Towards a Bayesian framework

The frequentist pitfall
= (t | y(x0, w), β −1)
3 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

𝒩
▪ Given p(t | x) or p(t, x) directly minimize expected loss
function
∫∫
[L] = L(t, y(x))p(x, t) dx dt
▪ Natural choice:
∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
𝔼
𝔼
▪ Given a point x, the expected loss at that point is given by
∫
[L(t, y(x))] = {y(x) − t}2 p(t | x) dt
▪ Taking the derivative w.r.t. y(x) yields
∫
2 {y(x) − t}p(t | x) dt
▪ Setting this expression to 0, yields
∫ ∫
y(x) = y(x)p(t | x) dt = tp(t | x) dt = t[t | x]
𝔼
𝔼
▪ The expected loss function
∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
▪ Rewrite integrand as
{y(x) − t}2 = {y(x) − [t | x] + [t | x] + −t}2
= {y(x) − [t | x]}2 + 2{y(x) − [t | x]}{ [t | x] − t} + { [t | x] − t}2
▪ Taking the expected value yields
∫ ∫
[L] = {y(x) − [t | x]}2 p(x) dx + var[t | x]p(x) dx
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
▪ Recall the expected square loss,
∫ ∫∫
[L] = {y(x) − h(x)}2 p(x) dx + {h(x) − t}2 p(x, t) dx dt
∫
where h(x) = [t | x] = tp(t | x) dt
▪ The second term corresponds to the noise inherent in the

random variable t
▪ What about the rst term?
𝔼
𝔼
fi
▪ Suppose we were given multiple datasets, each of size N.
Any particular dataset , will give a particular function
y(x; )
▪ We then have
{y(x; ) − h(x)}2
= {y(x; )− [y(x; )] + [y(x; )] − h(x)}2
= {y(x; )− [y(x; )]}2 + { [y(x; )] − h(x)}2
+2{y(x; )− [y(x; )]}{ [y(x; )] − h(x)}

𝒟
𝒟
𝒟
𝒟
𝒟
𝒟
8
𝒟
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒟
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
▪ Taking the expectation over yields:
[{y(x; ) − h(x)}2]
={ [y(x; )] − h(x)}2 + [{y(x; )− [y(x; )]}2]
(bias)2 variance
𝒟
𝒟
𝒟
𝒟
9
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
𝔼
𝒟
▪ In conclusion:
expected loss = (bias)2 + variance + noise
where
∫
(bias)2 = { [y(x; )] − h(x)}2 p(x) dx
∫
variance = [{y(x; )] − [y(x; )]}2] p(x) dx
∫∫
noise = {h(x) − t}2 p(x, t) dx dt
𝒟
𝒟
𝒟
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization



▪ From these plots, we note that an over-regularized model
(large λ) will have a high bias, while an under-regularized
model (small λ) will have a high variance

▪ These insights are of limited practical value
▪ It is based on averages with respect to ensembles of
datasets
▪ In practice, we have only the single observed dataset

Bias-variance tradeo

f
Bayesian linear regression
▪ Bayes’ theorem:
p(X | Y )p(Y )
p(Y | X) =
p(X)
▪ Essentially, this leads to
posterior ∝ likelhood × prior
▪ The idea is to use a probability distribution over the weights,

and then update the weights based on the observed data

▪ De ne a conjugate prior over w
p(w) = (w | m0, S0)
▪ Combining this with the likelihood function and using results

for marginal and conditional Gaussian distributions, gives the
posterior
p(w | t) = (w | mN, SN )
where
mN = SN (S−1
0 m 0 + βΦ ⊤
t)
S−1
N = S −1
0 + βΦ ⊤
Φ
𝒩
𝒩
fi
▪ A common choice for the prior is
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ Consider the following example to make the concept less

abstract: y(x) = − 0.3 + 0.5x

𝒩
▪ 0 data points observed

▪ 1 data point observed



p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What is the log of the posterior distribution, i.e., ln p(w | t)?

p(t | w)p(w) β N T 2 α ⊤
∑
ln p(w | t) = ln = {tn − w φ(xn)} − w w + const
p(t) 2 n=1 2

𝒩
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have no prior no information, i.e., α → 0?

mN = (Φ⊤Φ)−1Φ⊤t

𝒩
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have precise prior information, i.e., α → ∞?

mN = 0

𝒩
p(w) = (w | 0, α −1I)
for which
mN = β SN Φ⊤t
S−1
N = αI + βΦ ⊤
Φ
▪ What if we have in nite data, i.e., N → ∞?

lim mN = (Φ⊤Φ)−1Φ⊤t
N→∞

𝒩
fi
▪ Predict t for new values of x by integrating over w
∫
p(t | t, α, β) = p(t | w, β)p(w | t, α, β) dw
= (t | m⊤N φ(x), σN2 (x))
where
1
σN2 (x) = + φ(x)⊤SN φ(x)
β

𝒩
▪ Sinusoidal data, 9 Gaussian basis functions: 1 data point

▪ Sinusoidal data, 9 Gaussian basis functions: 2 data points



Conclusions
▪ The use of maximum likelihood, or equivalently, least
squares, can lead to severe over tting if complex models are
trained using data sets of limited size
▪ A Bayesian approach to machine learning avoids the

over tting and also quanti es the uncertainty in model
parameters

fi
fi
fi
Linear models for regression

Linear regression

Linear regression

Linear regression
▪ General model is:
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ φj = x j
▪ Take M=2
= (Φ Φ) ⊤ −1
▪ Calculate wML Φ ⊤t

Linear regression
▪ Thus, we have y(x, w) = w0 + w1x
▪ Performance is measured by
1 N
{y(xn, w) − tn}2
2N ∑
E(w) =
n=1
▪ Goal: min E(w0, w1)

w0,w1

Gradient descent

Gradient descent

Gradient descent
▪ Gradient descent algorithm:
repeat until convergence {

∂
wj := wj − α E(w0, w1)
∂wj
}

Gradient descent
▪ Correct update:
∂
temp0 := w0 − α E(w0, w1)
∂w0
∂
temp1 := w1 − α E(w0, w1)
∂w1
w0 := temp0
w1 := temp1

Gradient descent
▪ Incorrect update:
∂
temp0 := w0 − α E(w0, w1)
∂w0
w0 := temp0
∂
temp1 := w1 − α E(w0, w1)
∂w1
w1 := temp1

Gradient descent
▪ Feature scaling is important

Gradient descent
▪ Step size is important for convergence

Gradient descent
▪ Convexity of the problem is important for global optimality

Gradient descent for linear regression
▪ General model is:
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ Repeat {
1 N
wj := wj − α
∑ (y(xn, w) − tn) φj(xn)
N n=1
}

Lecture 03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 03

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

Lecture 3: Linear models

Advanced Machine Learning

= (t | y(x0, w), β −1)

3 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ Taking the derivative w.r.t. y(x) yields

▪ Setting this expression to 0, yields

▪ Taking the expected value yields

▪ The second term corresponds to the noise inherent in the

= {y(x; )− [y(x; )] + [y(x; )] − h(x)}2

= {y(x; )− [y(x; )]}2 + { [y(x; )] − h(x)}2

+2{y(x; )− [y(x; )]}{ [y(x; )] − h(x)}

={ [y(x; )] − h(x)}2 + [{y(x; )− [y(x; )]}2]

expected loss = (bias)2 + variance + noise

11 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

12 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

13 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

14 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

15 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

16 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ Essentially, this leads to

posterior ∝ likelhood × prior

▪ The idea is to use a probability distribution over the weights,

17 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ Combining this with the likelihood function and using results

▪ Consider the following example to make the concept less

19 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

20 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

21 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

22 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

23 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ What is the log of the posterior distribution, i.e., ln p(w | t)?

24 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ What if we have no prior no information, i.e., α → 0?

25 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ What if we have precise prior information, i.e., α → ∞?

26 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ What if we have in nite data, i.e., N → ∞?

27 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

= (t | m⊤N φ(x), σN2 (x))

28 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

29 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

30 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

31 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

32 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ A Bayesian approach to machine learning avoids the

33 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

Advanced Machine Learning

35 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

36 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

37 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

▪ Goal: min E(w0, w1)

38 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

39 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

40 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

repeat until convergence {

41 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

42 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

43 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

44 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

45 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

46 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023