You are on page 1of 47

Advanced Machine Learning

Lecture 3: Linear models


Sandjai Bhulai
Vrije Universiteit Amsterdam

s.bhulai@vu.nl
12 September 2023
Towards a Bayesian framework

Advanced Machine Learning


The frequentist pitfall

= (t | y(x0, w), β −1)

3 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
The frequentist pitfall
▪ Given p(t | x) or p(t, x) directly minimize expected loss
function

∫∫
[L] = L(t, y(x))p(x, t) dx dt

▪ Natural choice:

∫∫
[L] = {y(x) − t}2 p(x, t) dx dt
𝔼
𝔼
4 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ Given a point x, the expected loss at that point is given by


[L(t, y(x))] = {y(x) − t}2 p(t | x) dt

▪ Taking the derivative w.r.t. y(x) yields


2 {y(x) − t}p(t | x) dt

▪ Setting this expression to 0, yields

∫ ∫
y(x) = y(x)p(t | x) dt = tp(t | x) dt = t[t | x]
𝔼
𝔼
5 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
The frequentist pitfall
▪ The expected loss function

∫∫
[L] = {y(x) − t}2 p(x, t) dx dt

▪ Rewrite integrand as
{y(x) − t}2 = {y(x) − [t | x] + [t | x] + −t}2
= {y(x) − [t | x]}2 + 2{y(x) − [t | x]}{ [t | x] − t} + { [t | x] − t}2

▪ Taking the expected value yields

∫ ∫
[L] = {y(x) − [t | x]}2 p(x) dx + var[t | x]p(x) dx
𝔼
𝔼
𝔼
6 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝔼
𝔼
𝔼
𝔼
The frequentist pitfall
▪ Recall the expected square loss,

∫ ∫∫
[L] = {y(x) − h(x)}2 p(x) dx + {h(x) − t}2 p(x, t) dx dt


where h(x) = [t | x] = tp(t | x) dt

▪ The second term corresponds to the noise inherent in the


random variable t
▪ What about the rst term?
𝔼
𝔼
7 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
fi
The frequentist pitfall
▪ Suppose we were given multiple datasets, each of size N.
Any particular dataset , will give a particular function
y(x; )

▪ We then have

{y(x; ) − h(x)}2

= {y(x; )− [y(x; )] + [y(x; )] − h(x)}2

= {y(x; )− [y(x; )]}2 + { [y(x; )] − h(x)}2

+2{y(x; )− [y(x; )]}{ [y(x; )] − h(x)}


𝒟
𝒟
𝒟
𝒟
𝒟
𝒟
8
𝒟
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒟
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
𝔼
𝔼
𝔼
𝒟
𝒟
𝒟
The frequentist pitfall
▪ Taking the expectation over yields:
[{y(x; ) − h(x)}2]

={ [y(x; )] − h(x)}2 + [{y(x; )− [y(x; )]}2]

(bias)2 variance
𝒟
𝒟
𝒟
𝒟
9
𝒟
Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
𝔼
𝒟
The frequentist pitfall
▪ In conclusion:

expected loss = (bias)2 + variance + noise

where


(bias)2 = { [y(x; )] − h(x)}2 p(x) dx


variance = [{y(x; )] − [y(x; )]}2] p(x) dx

∫∫
noise = {h(x) − t}2 p(x, t) dx dt
𝒟
𝒟
𝒟
𝔼
𝔼
𝒟
𝒟
𝔼
𝒟
10 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

11 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

12 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ Example: 100 datasets from the sinusoidal with 25 data
points, varying the degree of regularization

13 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ From these plots, we note that an over-regularized model
(large λ) will have a high bias, while an under-regularized
model (small λ) will have a high variance

14 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance decomposition
▪ These insights are of limited practical value
▪ It is based on averages with respect to ensembles of
datasets
▪ In practice, we have only the single observed dataset

15 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bias-variance tradeo

16 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


f
Bayesian linear regression
▪ Bayes’ theorem:
p(X | Y )p(Y )
p(Y | X) =
p(X)

▪ Essentially, this leads to

posterior ∝ likelhood × prior

▪ The idea is to use a probability distribution over the weights,


and then update the weights based on the observed data

17 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ De ne a conjugate prior over w
p(w) = (w | m0, S0)

▪ Combining this with the likelihood function and using results


for marginal and conditional Gaussian distributions, gives the
posterior
p(w | t) = (w | mN, SN )

where
mN = SN (S−1
0 m 0 + βΦ ⊤
t)

S−1
N = S −1
0 + βΦ ⊤
Φ
18 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023
𝒩
𝒩
fi
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ Consider the following example to make the concept less


abstract: y(x) = − 0.3 + 0.5x

19 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ 0 data points observed

20 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 1 data point observed

21 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 2 data points observed

22 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ 20 data points observed

23 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What is the log of the posterior distribution, i.e., ln p(w | t)?


p(t | w)p(w) β N T 2 α ⊤

ln p(w | t) = ln = {tn − w φ(xn)} − w w + const
p(t) 2 n=1 2

24 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have no prior no information, i.e., α → 0?


mN = (Φ⊤Φ)−1Φ⊤t

25 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have precise prior information, i.e., α → ∞?


mN = 0

26 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ A common choice for the prior is

p(w) = (w | 0, α −1I)

for which
mN = β SN Φ⊤t

S−1
N = αI + βΦ ⊤
Φ

▪ What if we have in nite data, i.e., N → ∞?


lim mN = (Φ⊤Φ)−1Φ⊤t
N→∞

27 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
fi
Bayesian linear regression
▪ Predict t for new values of x by integrating over w


p(t | t, α, β) = p(t | w, β)p(w | t, α, β) dw

= (t | m⊤N φ(x), σN2 (x))

where
1
σN2 (x) = + φ(x)⊤SN φ(x)
β

28 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


𝒩
Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 1 data point

29 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 2 data points

30 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 4 data points

31 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Bayesian linear regression
▪ Sinusoidal data, 9 Gaussian basis functions: 25 data points

32 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Conclusions
▪ The use of maximum likelihood, or equivalently, least
squares, can lead to severe over tting if complex models are
trained using data sets of limited size

▪ A Bayesian approach to machine learning avoids the


over tting and also quanti es the uncertainty in model
parameters

33 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


fi
fi
fi
Linear models for regression

Advanced Machine Learning


Linear regression

35 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression

36 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression
▪ General model is:

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ φj = x j
▪ Take M=2
= (Φ Φ) ⊤ −1
▪ Calculate wML Φ ⊤t

37 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Linear regression
▪ Thus, we have y(x, w) = w0 + w1x

▪ Performance is measured by

1 N
{y(xn, w) − tn}2
2N ∑
E(w) =
n=1

▪ Goal: min E(w0, w1)


w0,w1

38 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent

39 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent

40 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Gradient descent algorithm:

repeat until convergence {



wj := wj − α E(w0, w1)
∂wj
}

41 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Correct update:

temp0 := w0 − α E(w0, w1)
∂w0

temp1 := w1 − α E(w0, w1)
∂w1
w0 := temp0

w1 := temp1

42 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Incorrect update:

temp0 := w0 − α E(w0, w1)
∂w0
w0 := temp0

temp1 := w1 − α E(w0, w1)
∂w1
w1 := temp1

43 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Feature scaling is important

44 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Step size is important for convergence

45 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent
▪ Convexity of the problem is important for global optimality

46 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023


Gradient descent for linear regression
▪ General model is:

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ Repeat {

1 N
wj := wj − α
∑ (y(xn, w) − tn) φj(xn)
N n=1
}

47 Sandjai Bhulai / Advanced Machine Learning / 12 September 2023

You might also like