Machine Learning: Linear Models For Regression

Machine Learning
Linear Models for Regression
Marcello Restelli
t
x
March 9, 2017 4
x3
2 t̂
t
0x1
w1
5 x2
0
−5 0 2 4
x2 x1
w0
Outline
1 Linear Regression
2 Minimizing Least Squares
3 Regularization
4 Bayesian Linear Regression
Marcello Restelli March 9, 2017 2 / 47

Linear Regression
Regression Problems
The goal of regression is to learn
t
a mapping from input x to a
continuous output t
x
Examples we assume
input x
Predict stock market price output t exists a model
Predict age of a web user that goes from x
Predict effect of an actuation in robotics to t
Predict the value of a house
Predict the temperature in a building

Linear Regression
Linear Models
Many real processes can be

approximated with linear models
t
Linear regression often appears as
a module of larger systems
x
Linear problems can be solved analytically
Linear prediction provides an introduction to many of the core concepts
of machine learning
Augmented with kernels, it can model non-linear relationships

Linear Regression
Linear Function
Linear function in the parameters w:

D−1
X
y(x, w) = w0 + wj xj = wT x
j=1
x = (1, x1 , . . . , xD−1 )
w0 is the offset
each point in w0 w1 space describes a model
w1
w0 x
Linear Regression
Loss Functions for Regression

We need to quantify what it means to do well or poorly on a task
We need to define a loss (error) function: L(t, y(x))
The average, or expected, loss is given by:
Z Z
E[L] = L(t, y(x))p(x, t)dxdt
A common choice is the squared loss function

Z Z
E[L] = (t − y(x))2 p(x, t)dxdt
The optimal solution (if we assume a completely

flexible function) is the conditional average:
Z
y(x) = tp(t|x)dt = E[t|x]
Linear Regression
Other Loss Functions
Simple generalization of the squared loss, called the Minkowski loss:

Z Z
E[L] = (t − y(x))q p(x, t)dxdt
The minimum of E[L] is given by:

the conditional mean for q = 2
the conditional median for q = 1
the conditional mode for q → 0

Linear Regression
Basis Functions
To consider non-linear functions we can use non-linear basis function:

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

1
Examples: 0.5
Polynomial: φj (x) = xj
Gaussian: 0
(x−µ )2

φj (x) = exp − 2σ2j
−0.5
Sigmoidal:
φj (x) = 1 µ −x −1
j
1+exp σ −1 −0.5 0 0.5 1

Linear Regression
Basis Functions

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

1
Examples: 0.8
Gaussian:
0.6
(x−µ )2

φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x
j
1+exp σ 0
−1 −0.5 0 0.5 1
Linear Regression
Basis Functions

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1
φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

1
Examples: 0.8
Gaussian:
0.6
(x−µ )2

φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x
j
1+exp σ 0
−1 −0.5 0 0.5 1
Linear Regression
Basis Functions
Example
t
t
x x
t
x2 x
x2
Linear Regression
Discriminative vs Generative
Generative approach:
Model the joint density: p(x, t) = p(x|t)p(t)
Infer conditional density: p(t|x) = p(x,t)
p(x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Discriminative approach:
Model conditional density p(t|x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Direct approach
Find a regression function y(x) directly from the training data

Minimizing Least Squares

Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
This is (half) the residual sum of w0

squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47

Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
This is (half) the residual sum of w0

squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Ordinary Least Squares

Closed-Form Optimization
Let’s write RSS in matrix form with Φ = (φ(x1 ), . . . , φ(xN ))T and
t = (t1 , . . . , tN )T
1 1
L(w) = RSS(w) = (t − Φw)T (t − Φw)
2 2
Compute first and second derivative
∂L(w) ∂ 2 L(w)
= −ΦT (t − Φw) ; T
T =Φ Φ
∂w ∂w∂w
Assuming ΦT Φ in nonsingular
−1 T
ŵOLS = ΦT Φ Φ t
Complexity O(NM 2 + M 3 )
Cholesky: M 3 + NM 2 /2
QR: NM 2
Geometric Interpretation
t is an N-dimensional vector
Let’s denote with ϕj the j-th column of Φ
Define t̂ the N-dimensional vector, whose n-th element is y(xn , w)
t̂ is a linear combination of ϕ1 , . . . , ϕM
so t̂ lies in an M-subspace S
Since t̂ minimizes the SSE with respect to t, it represents the projections
of t onto the subspace S
−1 T
t̂ = Φŵ = Φ ΦT Φ Φ t
−1 T
H = Φ ΦT Φ Φ is called the hat matrix

Geometric Example
= 3 and M= D = 2 
Assume N   
1 2 5 3.5
Φ=X=  1 −2  t=  1  t̂ =  1 
1 2 2 3.5
4
t̂
x3
2 t
0 x1
x2
5
0
2 4
x2 −5 0
x1
Gradient Optimization
Closed-form solution is not practical with big data
We can use sequential (online) updates
Stochastic gradient descent
If the lossPfunction can be expressed as a sum over samples
(L(x) = n L(xn ))
w(k+1) = w(k) − α(k) ∇L(xn )
T

w(k+1) = w(k) − α(k) w(k) φ(xn ) − tn φ(xn )
where k is the iteration and α is a learning rate

For convergence the learning rate has to satisfy
∞
X 1
(k)
= +∞
k=0
α
∞
X 1
2
< +∞
k=0 α(k)

Maximum Likelihood (ML)
The output variable t can be modeled as a deterministic function y of the

input x and a random noise
t = f(x) +
We want to approximate f(x) with y(x, w)

We assume ∼ N 0, σ 2

Given N samples, with inputs X = {x1 , . . . , xN } and outputs

t = (t1 , . . . , tN )T , the likelihood function is
N
Y
2
N tn |wT φ(xn ), σ 2

p(t|X, w, σ ) =
n=1

Maximum Likelihood (ML)
Assuming the samples to be independent and identically distributed

(iid), we consider the log-likelihood:
N
X
`(w) = ln p(t|X, w, σ 2 ) = log p(tn |xn , w, σ 2 )
n=1
N 1
= − log 2πσ 2 − 2 RSS(w)

2 2σ
To find the maximum likelihood, we equal the gradient to zero
N N
!
X T
X T
T
∇ ln `(w) = tn φ(xn ) − w φ(xn )φ(xn ) =0
n=1 n=1
−1 T
T
wML = Φ Φ Φ t

Variance of the Parameters

We assume
the observation ti are uncorrelated and have constant variance σ 2
the xi are fixed (non random)
The variance-covariance matrix of the least-squares estimates is
−1 2
Var(ŵOLS ) = ΦT Φ σ
Usually, the variance σ 2 is estimated by
N
1
σˆ2 =
X
(tn − ŵT φ(xn ))2
N−M−1
n=1
Assuming that the model is linear in the features φ1 (), . . . , φM () and that
the noise is additive and Gaussian
ŵ ∼ N w, (ΦT Φ)−1 σ 2 (N − M − 1)σˆ2 ∼ σ 2 χ2

N−M−1
Such properties can be used to form test hypothesis and confidence

intervals
Gauss-Markov Theorem
Theorem (Gauss-Markov)
The least squares estimate of w has the smallest variance among all linear
unbiased estimates.
It follows that least squares estimator has the lowest MSE of all linear
estimator with no bias
However, there may exist a biased estimator with smaller MSE

Multiple Outputs
Let now consider the case of multiple outputs

We could use a different set of basis functions for each output, thus
having independent regression problems
Usually, a single set of basis functions is considered
−1 T
ŴML = ΦT Φ Φ T
For each output tk we have

−1 T
ŵk = ΦT Φ Φ tk
where tk is an N-dimensional column vector

The solution decouples between the different outputs
−1 T
The pseudo-inverse Φ† = ΦT Φ Φ needs to be computed only once

Increasing Model Complexity: Quadratic Function
200 data
1
2
10
15
19
100
t
2 4 6 8 10 12 14 16 18 20
x

Increasing Model Complexity: Sinusoidal Function




Under-fitting vs Over-Fitting
With low-order polynomials we have under-fitting

With high-order polynomials we get excellent fit over the training data,
but a poor representation of the true function: over-fitting
We want to have good generalization
We use a test set of 100 samples

to evaluate generalization
q
ERMS = 2∗RSS(ŵ)
N

How to Avoid Over-fitting?
This is the problem of model selection (we will see this later)
What happens when the number of training samples increases?

How to Avoid Over-fitting?
What happens to the parameters when the model gets more complex?
M=0 M=1 M=3 M=9

ŵ 0 0.19 0.82 0.31 0.35
ŵ 1 -1.27 7.99 232.37
ŵ 2 -25.43 -5321.83
ŵ 3 48568.31
ŵ 4 -231639.30
ŵ 5 640042.26
ŵ 6 -1061800.52
ŵ 7 1042400.18
ŵ 8 -557682.99
ŵ 9 125201.43

Regularization
Ridge Regression
One way to reduce the MSE is to change the loss function as follows
L(w) = LD (w) + λLW (w)
LD (w): error on data (e.g., RSS)
LW (w): model complexity
By taking LW (w) = 12 wT w = 12 kwk22 we get
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk22
2 2
i=1
It is called ridge regression (or weight decay)

It is a regularization (or parameter shrinkage) method
The loss function is still quadratic in w:
−1 T
ŵridge = λI + ΦT Φ Φ t

Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =2.74e+6
200
data
λ=0
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example
kwk2 =1.75e+3
data
200
λ=0
λ=1e-10
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example
kwk2 =9.32e+2
data
200
λ=0
λ=1e-9
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example
kwk2 =3.92e+2
data
200
λ=0
λ=1e-8
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example
kwk2 =1.54e+2
data
200
λ=0
λ=1e-7
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example
kwk2 =6.22e+1
data
200
λ=0
λ=1e-6
100
t
2 4 6 8 10 12 14 16 18 20
x
Regularization
Ridge Regression
Quadratic Example: Weights
·105
2
1
w
−1
10−10
Marcello Restelli 10−9 10−8 10−7 10−6
March 9, 2017 31 / 47
Regularization
Ridge Regression
Sinusoidal Example

Regularization
Lasso
Another popular regularization method is lasso

N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk1
2 2
i=1
PM
where kwk1 = j=1 |wj |
Differently from ridge, lasso is nonlinear in the ti and no closed-form
solution exists (quadratic programming problem)
Nonetheless, it has the advantage of making some weights equal to zero
for values of λ sufficiently large
Lasso yields sparse models

Regularization
Lasso vs Ridge Regression

Ridge Lasso
w2 w2
w∗
w∗
w1 w1
Lasso tends to generate sparser solutions than quadratic regularizer

Bayesian Linear Regression
Bayesian Approach
We formulate our knowledge about the world in a probabilistic way

We define the model that expresses our knowledge qualitatively
Our model will have some unknown parameters
We capture our assumptions about unknown parameters by specifying the
prior distribution over those parameters before seeing the data
We observe the data
We compute the posterior probability distribution for the parameters,
given observed data
We use the posterior distribution to:
Make predictions by averaging over the posterior distribution
Examine/Account for uncertainty in the parameter values
Make decisions by minimizing expected posterior loss

Posterior Distribution
The posterior distribution for the model parameters can be found by
combining the prior with the likelihood for the parameters given data
This is accomplished using Bayes’ Rule:
P(data|parameters)P(parameters)
P(parameters|data) =
P(data)
p(D|w)P(w)
p(w|D) =
P(D)
where
p(w|D) is the posterior probability of parameters w given training data D
p(D|w) is the probability (likelihood) of observing D given w
P(w) is the prior probability over the parameters
R marginal likelihood (normalizing constant):
P(D) is the
P(D) = p(D|w)P(w)dw
We want the most probable value of w given the data: maximum a
posteriori (MAP). It is the mode of the posterior.
Predictive Distribution
Stating Bayes’ rule in words: posterior ∝ likelihood × prior

Prediction for a new data point x∗ (given the training dataset D) can be
done by integrating over the posterior distribution:
Z
p(x |D) = p(x∗ |w, D)p(w|D)dw = E[p(x∗ |w, D)]
∗
which is sometimes called predictive distribution

Note that computing the predictive distribution requires knowledge of
the posterior distribution, which is usually intractable

Another approach to avoid the over-fitting problem of ML is to use Bayesian

linear regression
In the Bayesian approach the parameters of the model are considered as drawn
from some distribution
Assuming Gaussian likelihood model, the conjugate prior is Gaussian too
p(w) = N (w|w0 , S0 )
Given the data D, the posterior is still Gaussian

p(w|t, Φ, σ 2 ) ∝ N (w|w0 , S0 ) N t|Φw, σ 2 IN = N (w|wN , SN )

ΦT t

wN = SN S−1 0 w 0 +
σ2
ΦT Φ
S−1
N = S0 +
−1
σ2
For sequential data, the posterior acts as prior for the next iteration
Relation to Ridge Regression
In Gaussian distributions the mode coincides with the mean

It follows that wN is the MAP estimator (Maximum a posteriori)
If the prior has infinite variance, wN reduces to the ML estimator
If w0 = 0 and S0 = τ 2 I, then wN reduces to the ridge estimate, where
2
λ = στ 2

1D Example
Data generated from: t(x) = −0.3 + 0.5x + , where ∼ N (0, 0.04)

x values taken uniformly from [−1, 1]
Model: y(x, w) = w0 + w1 x
We assume to know σ 2 = 0.04 and τ 2 = 0.5
w1
t
w0 x

1D Example
0 data points observed
1 data point observed

Predictive Distribution
We are interested in the posterior predictive distribution

Z
p(t|x, D, σ ) = N t|wT φ(x), σ 2 N (w|wN , SN ) dw
2

= N t|wN T φ(x), σN2 (x)

σN2 (x) = σ 2 + φ(x)T SN φ(x)
where
σN2 (x) = σ2
|{z} + φ(x)T SN φ(x)
| {z }
noise in the Uncertainty associated
target values
with parameter values
In the limit, as N → ∞, the second term goes to zero

The variance of the predictive distribution arises only from the additive
noise governed by parameter σ

Example
Sinsusoidal dataset, 9 Gaussian basis functions

Modeling Challenges
The first challenge is in specifying suitable model and suitable prior

distributions
A suitable model should admit all the possibilities that thought to be at all
likely
A suitable prior should avoid giving zero or very small probabilities to
possible events, but should also avoid spreading out the probability over
all possibilities
To avoid uninformative priors, we may need to model dependencies
between parameters
One strategy is to introduce latent variables into the model and
hyperparameters into the prior
Both of these represent the ways of modeling dependencies in a
tractable way

Computational Challenges
The other big challenge is computing the posterior distribution. There are
several approaches:
Analytical integration: If we use “conjugate priors”, the posterior
distribution can be computed analytically. Only works for simple
models
Gaussian (Laplace) approximation: Approximate the posterior
distribution with a Gaussian. Works well when there a lot of data
compared to the model complexity
Monte Carlo integration: Once we have a sample from the posterior
distribution, we can do many things. Currently, the common approach is
Markov Chain Monte Carlo (MCMC), that consists in simulating a
Markov chain that converges to the posterior distribution
Variational approximation: A cleverer way of approximating the
posterior. It is usually faster than MCMC, but it is less general

Pros and Cons of Fixed Basis Functions
Advantages
Closed-form solution
Tractable Bayesian treatment
Arbitrary non-linearity with the proper basis functions
Limitations
Basis functions are chosen independently from the training set
Curse of dimensionality

Machine Learning: Linear Models For Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning: Linear Models For Regression

Uploaded by

Copyright:

Available Formats

Machine Learning

Linear Models for Regression

2 Minimizing Least Squares

4 Bayesian Linear Regression

Marcello Restelli March 9, 2017 2 / 47

The goal of regression is to learn

Marcello Restelli March 9, 2017 4 / 47

Many real processes can be

Marcello Restelli March 9, 2017 5 / 47

Linear function in the parameters w:

Loss Functions for Regression

A common choice is the squared loss function

The optimal solution (if we assume a completely

Other Loss Functions

Simple generalization of the squared loss, called the Minkowski loss:

The minimum of E[L] is given by:

Marcello Restelli March 9, 2017 8 / 47

To consider non-linear functions we can use non-linear basis function:

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

Marcello Restelli March 9, 2017 9 / 47

To consider non-linear functions we can use non-linear basis function:

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

To consider non-linear functions we can use non-linear basis function:

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

Marcello Restelli March 9, 2017 11 / 47

Minimizing Least Squares

This is (half) the residual sum of w0

Minimizing Least Squares

This is (half) the residual sum of w0

Ordinary Least Squares

Marcello Restelli March 9, 2017 15 / 47

where k is the iteration and α is a learning rate

Marcello Restelli March 9, 2017 17 / 47

Maximum Likelihood (ML)

The output variable t can be modeled as a deterministic function y of the

We want to approximate f(x) with y(x, w)

Given N samples, with inputs X = {x1 , . . . , xN } and outputs

Marcello Restelli March 9, 2017 18 / 47

Maximum Likelihood (ML)

Assuming the samples to be independent and identically distributed

Marcello Restelli March 9, 2017 19 / 47

Variance of the Parameters

Such properties can be used to form test hypothesis and confidence

Marcello Restelli March 9, 2017 21 / 47

Let now consider the case of multiple outputs

For each output tk we have

where tk is an N-dimensional column vector

Marcello Restelli March 9, 2017 22 / 47

Increasing Model Complexity: Quadratic Function

Marcello Restelli March 9, 2017 23 / 47

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

With low-order polynomials we have under-fitting

We use a test set of 100 samples

Marcello Restelli March 9, 2017 25 / 47

How to Avoid Over-fitting?

Data generated from: t(x) = −0.3 + 0.5x + , where ∼ N (0, 0.04)