Professional Documents
Culture Documents
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Chapter content
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Figure 1.16, page 29
t y(x, w)
y(x0 , w) 2σ
p(t|x0 , w, β)
x0 x
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The chapter section by section
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Linear Basis Function Models
M
X −1
y(x, w) = wj φj (x) = w> φ(x)
j=0
where:
I w = (w0 , . . . , wM −1 )> and φ = (φ0 , . . . , φM −1 )> with
φ0 (x) = 1 and w0 = bias parameter.
I In general x ∈ RD but it will be convenient to treat the case
x∈R
I We observe the set X = {x1 , . . . , xn , . . . , xN } with
corresponding target variables t = {tn }.
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Basis function choices
I Polynomial
φj (x) = xj
I Gaussian
(x − µj )2
φj (x) = exp −
2s2
I Sigmoidal
x − µj 1
φj (x) = σ with σ(a) =
s 1 + e−a
I splines, Fourier, wavelets, etc.
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Examples of basis functions
1 1 1
0 0.5 0.5
−1 0 0
−1 0 1 −1 0 1 −1 0 1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Maximum likelihood and least squares
t = y(x, w) +
|{z}
| {z }
deterministic Gaussian noise
and:
N
−1 1 X 2
βM L = tn − w >
M L φ(xn )
N
n=1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Geometry of least squares
S
t
ϕ2
ϕ1 y
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Sequential learning
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Regularized least squares
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The Bias-Variance Decomposition
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Back to section 1.5.5
I The regression loss-function: L(t, y(x)) = (y(x) − t)2
I The decision problem = minimize the expected loss:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
Z
I Solution: y(x) = tp(t|x)dt = Et [t|x]
I this is known as the regression function
I conditional average of t conditioned on x, e.g., figure 1.28,
page 47
I Another expression for the expectation of the loss function:
Z Z
E[L] = (y(x) − E[t|x])2 p(x)dx + (E[t|x] − t)2 p(x)dx.
(1.90)
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
I The optimal prediction is obtained by minimization of the
expected squared loss function:
Z
h(x) = E[t|x] = tp(t|x)dt (3.36)
(3.37)
I The theoretical minimum of the first term is zero for an
appropriate choice of the function y(x) (for unlimited data
and unlimited computing power).
I The second term arises from noise in the data and it
represents the minimum achievable value of the expected
squared loss.
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
An ensemble of data sets
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Example: L=100, N=25, M=25, Gaussian basis
1
1
0 0
−1 −1
0 1 0 1
1
1
0 0
−1 −1
0 1 0 1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (1/5)
Assume additive gaussian noise with known precision β.
The likelihood function p(t|w) is the exponential of a quadratic
function of w, its conjugate prior is Gaussian:
mN = SN (S−1 m0 + βΦT t)
where −1 0 (3.50/3.51)
SN = S−1
0 + βΦ Φ
T
mN = βSN ΦT t
then −1 (3.53/3.54)
SN = αI + βΦT Φ
−1
I Note α → 0 implies mN → wML = (ΦT Φ) ΦT t (3.35)
p(t|x, t, α, β) = N (t|mT 2
N Φ(x), σN (x)) (3.58)
where
2
σN (x) = β −1 + Φ(x)T SN Φ(x) (3.59)
|{z} | {z }
noise in data uncertainty in w
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1
0 0
−1 −1
0 1 0 1
1 1
0 0
−1 −1
0 1 0 1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1
0 0
−1 −1
0 1 0 1
1 1
0 0
−1 −1
0 1 0 1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (4/5)
PN
y(x, mN ) rewrites as n=1 k(x, xn )tn where
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (5/5)
Kernel from Gaussian basis functions
0.04 0.04
0.02 0.02
0 0
−1 0 1 −1 0 1
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (1/2)
The overfitting that appears in ML can be avoided by
marginalizing over the model parameters.
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (2/2)
∆wposterior
p(D)
M1
M2
M3
wMAP w
∆wprior D
D0
About p(D|Mi ):
I if Mi is too simple, bad fitting of the data
I if Mi is too complex/powerful, the probability of generating
the observed data is washed out
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (1/2)
Fully bayesian treatment would imply marginalizing over
hyperparameters and parameters, but this is intractable:
ZZZ
p(t|t) = p(t|w, β)p(w|t, α, β)p(α, β|t)dwdαdβ (3.74)
M N 1 N
ln p(t|α, β) = ln α + ln β − E(mN ) − ln |S−1
N |− ln(2π)
2 2 2 2
(3.77 → 3.86)
Assuming p(α, β|t) is highly peaked at (α̂, β̂):
Z
p(t|t) ' p(t|t, α̂, β̂) = p(t|w, β̂)p(w, α̂, β̂)dw (3.75)
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (2/2)
Plot of the model evidence ln p(t|α, β) versus M , the model
complexity, for the polynomial regression of the synthetic
sinusoidal example (with fixed α).
−18
−20
−22
−24
−26
0 2 4 6 8
2 8
1 5
0 6
−1 1
−5 0 5 −5 0 5 −2 9
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression