You are on page 1of 54

Machine Learning

Linear Models for Regression

Marcello Restelli
t

x
March 9, 2017 4

x3
2 t̂
t
0x1
w1

5 x2
0
−5 0 2 4
x2 x1
w0
Outline

1 Linear Regression

2 Minimizing Least Squares

3 Regularization

4 Bayesian Linear Regression

Marcello Restelli March 9, 2017 2 / 47


Linear Regression

Regression Problems

The goal of regression is to learn

t
a mapping from input x to a
continuous output t

x
Examples we assume
input x
Predict stock market price output t exists a model
Predict age of a web user that goes from x
Predict effect of an actuation in robotics to t
Predict the value of a house
Predict the temperature in a building

Marcello Restelli March 9, 2017 4 / 47


Linear Regression

Linear Models

Many real processes can be


approximated with linear models

t
Linear regression often appears as
a module of larger systems
x
Linear problems can be solved analytically
Linear prediction provides an introduction to many of the core concepts
of machine learning
Augmented with kernels, it can model non-linear relationships

Marcello Restelli March 9, 2017 5 / 47


Linear Regression

Linear Function

Linear function in the parameters w:


D−1
X
y(x, w) = w0 + wj xj = wT x
j=1

x = (1, x1 , . . . , xD−1 )
w0 is the offset
each point in w0 w1 space describes a model
w1

w0 x
Marcello Restelli March 9, 2017 6 / 47
Linear Regression

Loss Functions for Regression


We need to quantify what it means to do well or poorly on a task
We need to define a loss (error) function: L(t, y(x))
The average, or expected, loss is given by:
Z Z
E[L] = L(t, y(x))p(x, t)dxdt

A common choice is the squared loss function


Z Z
E[L] = (t − y(x))2 p(x, t)dxdt

The optimal solution (if we assume a completely


flexible function) is the conditional average:
Z
y(x) = tp(t|x)dt = E[t|x]
Marcello Restelli March 9, 2017 7 / 47
Linear Regression

Other Loss Functions

Simple generalization of the squared loss, called the Minkowski loss:


Z Z
E[L] = (t − y(x))q p(x, t)dxdt

The minimum of E[L] is given by:


the conditional mean for q = 2
the conditional median for q = 1
the conditional mode for q → 0

Marcello Restelli March 9, 2017 8 / 47


Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:


M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T


1
Examples: 0.5
Polynomial: φj (x) = xj
Gaussian:  0
(x−µ )2

φj (x) = exp − 2σ2j
−0.5
Sigmoidal:
φj (x) = 1 µ −x  −1
j
1+exp σ −1 −0.5 0 0.5 1

Marcello Restelli March 9, 2017 9 / 47


Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:


M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T


1
Examples: 0.8
Polynomial: φj (x) = xj
Gaussian: 
0.6
(x−µ )2

φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x 
j
1+exp σ 0
−1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:


M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T


1
Examples: 0.8
Polynomial: φj (x) = xj
Gaussian: 
0.6
(x−µ )2

φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x 
j
1+exp σ 0
−1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression

Basis Functions
Example
t

t
x x
t

x2 x
x2
Marcello Restelli March 9, 2017 10 / 47
Linear Regression

Discriminative vs Generative

Generative approach:
Model the joint density: p(x, t) = p(x|t)p(t)
Infer conditional density: p(t|x) = p(x,t)
p(x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Discriminative approach:
Model conditional density p(t|x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Direct approach
Find a regression function y(x) directly from the training data

Marcello Restelli March 9, 2017 11 / 47


Minimizing Least Squares

Minimizing Least Squares


Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103

w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1

This is (half) the residual sum of w0


squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Minimizing Least Squares

Minimizing Least Squares


Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103

w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1

This is (half) the residual sum of w0


squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Minimizing Least Squares

Ordinary Least Squares


Closed-Form Optimization

Let’s write RSS in matrix form with Φ = (φ(x1 ), . . . , φ(xN ))T and
t = (t1 , . . . , tN )T
1 1
L(w) = RSS(w) = (t − Φw)T (t − Φw)
2 2
Compute first and second derivative
∂L(w) ∂ 2 L(w)
= −ΦT (t − Φw) ; T
T =Φ Φ
∂w ∂w∂w
Assuming ΦT Φ in nonsingular
−1 T
ŵOLS = ΦT Φ Φ t
Complexity O(NM 2 + M 3 )
Cholesky: M 3 + NM 2 /2
QR: NM 2
Marcello Restelli March 9, 2017 14 / 47
Minimizing Least Squares

Geometric Interpretation

t is an N-dimensional vector
Let’s denote with ϕj the j-th column of Φ
Define t̂ the N-dimensional vector, whose n-th element is y(xn , w)
t̂ is a linear combination of ϕ1 , . . . , ϕM
so t̂ lies in an M-subspace S
Since t̂ minimizes the SSE with respect to t, it represents the projections
of t onto the subspace S
−1 T
t̂ = Φŵ = Φ ΦT Φ Φ t
−1 T
H = Φ ΦT Φ Φ is called the hat matrix

Marcello Restelli March 9, 2017 15 / 47


Minimizing Least Squares

Geometric Example
= 3 and M= D = 2 
Assume N   
1 2 5 3.5
Φ=X=  1 −2  t=  1  t̂ =  1 
1 2 2 3.5

4

x3

2 t

0 x1
x2
5
0
2 4
x2 −5 0
x1
Marcello Restelli March 9, 2017 16 / 47
Minimizing Least Squares

Gradient Optimization
Closed-form solution is not practical with big data
We can use sequential (online) updates
Stochastic gradient descent
If the lossPfunction can be expressed as a sum over samples
(L(x) = n L(xn ))
w(k+1) = w(k) − α(k) ∇L(xn )
T
 
w(k+1) = w(k) − α(k) w(k) φ(xn ) − tn φ(xn )

where k is the iteration and α is a learning rate


For convergence the learning rate has to satisfy

X 1
(k)
= +∞
k=0
α

X 1
2
< +∞
k=0 α(k)

Marcello Restelli March 9, 2017 17 / 47


Minimizing Least Squares

Maximum Likelihood (ML)

The output variable t can be modeled as a deterministic function y of the


input x and a random noise 

t = f(x) + 

We want to approximate f(x) with y(x, w)


We assume  ∼ N 0, σ 2


Given N samples, with inputs X = {x1 , . . . , xN } and outputs


t = (t1 , . . . , tN )T , the likelihood function is
N
Y
2
N tn |wT φ(xn ), σ 2

p(t|X, w, σ ) =
n=1

Marcello Restelli March 9, 2017 18 / 47


Minimizing Least Squares

Maximum Likelihood (ML)

Assuming the samples to be independent and identically distributed


(iid), we consider the log-likelihood:
N
X
`(w) = ln p(t|X, w, σ 2 ) = log p(tn |xn , w, σ 2 )
n=1
N 1
= − log 2πσ 2 − 2 RSS(w)

2 2σ
To find the maximum likelihood, we equal the gradient to zero
N N
!
X T
X T
T
∇ ln `(w) = tn φ(xn ) − w φ(xn )φ(xn ) =0
n=1 n=1
−1 T
T
wML = Φ Φ Φ t

Marcello Restelli March 9, 2017 19 / 47


Minimizing Least Squares

Variance of the Parameters


We assume
the observation ti are uncorrelated and have constant variance σ 2
the xi are fixed (non random)
The variance-covariance matrix of the least-squares estimates is
−1 2
Var(ŵOLS ) = ΦT Φ σ
Usually, the variance σ 2 is estimated by
N
1
σˆ2 =
X
(tn − ŵT φ(xn ))2
N−M−1
n=1
Assuming that the model is linear in the features φ1 (), . . . , φM () and that
the noise is additive and Gaussian
ŵ ∼ N w, (ΦT Φ)−1 σ 2 (N − M − 1)σˆ2 ∼ σ 2 χ2

N−M−1

Such properties can be used to form test hypothesis and confidence


intervals
Marcello Restelli March 9, 2017 20 / 47
Minimizing Least Squares

Gauss-Markov Theorem

Theorem (Gauss-Markov)
The least squares estimate of w has the smallest variance among all linear
unbiased estimates.

It follows that least squares estimator has the lowest MSE of all linear
estimator with no bias
However, there may exist a biased estimator with smaller MSE

Marcello Restelli March 9, 2017 21 / 47


Minimizing Least Squares

Multiple Outputs

Let now consider the case of multiple outputs


We could use a different set of basis functions for each output, thus
having independent regression problems
Usually, a single set of basis functions is considered
−1 T
ŴML = ΦT Φ Φ T

For each output tk we have


−1 T
ŵk = ΦT Φ Φ tk

where tk is an N-dimensional column vector


The solution decouples between the different outputs
−1 T
The pseudo-inverse Φ† = ΦT Φ Φ needs to be computed only once

Marcello Restelli March 9, 2017 22 / 47


Minimizing Least Squares

Increasing Model Complexity: Quadratic Function

200 data
1
2
10
15
19
100
t

2 4 6 8 10 12 14 16 18 20
x

Marcello Restelli March 9, 2017 23 / 47


Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47


Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47


Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47


Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47


Minimizing Least Squares

Under-fitting vs Over-Fitting

With low-order polynomials we have under-fitting


With high-order polynomials we get excellent fit over the training data,
but a poor representation of the true function: over-fitting
We want to have good generalization

We use a test set of 100 samples


to evaluate generalization
q
ERMS = 2∗RSS(ŵ)
N

Marcello Restelli March 9, 2017 25 / 47


Minimizing Least Squares

How to Avoid Over-fitting?

This is the problem of model selection (we will see this later)
What happens when the number of training samples increases?

Marcello Restelli March 9, 2017 26 / 47


Minimizing Least Squares

How to Avoid Over-fitting?

What happens to the parameters when the model gets more complex?

M=0 M=1 M=3 M=9


ŵ 0 0.19 0.82 0.31 0.35
ŵ 1 -1.27 7.99 232.37
ŵ 2 -25.43 -5321.83
ŵ 3 48568.31
ŵ 4 -231639.30
ŵ 5 640042.26
ŵ 6 -1061800.52
ŵ 7 1042400.18
ŵ 8 -557682.99
ŵ 9 125201.43

Marcello Restelli March 9, 2017 27 / 47


Regularization

Ridge Regression
One way to reduce the MSE is to change the loss function as follows
L(w) = LD (w) + λLW (w)
LD (w): error on data (e.g., RSS)
LW (w): model complexity
By taking LW (w) = 12 wT w = 12 kwk22 we get
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk22
2 2
i=1

It is called ridge regression (or weight decay)


It is a regularization (or parameter shrinkage) method
The loss function is still quadratic in w:
−1 T
ŵridge = λI + ΦT Φ Φ t

Marcello Restelli March 9, 2017 29 / 47


Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =2.74e+6

200
data
λ=0

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =1.75e+3

data
200
λ=0
λ=1e-10

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =9.32e+2

data
200
λ=0
λ=1e-9

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =3.92e+2

data
200
λ=0
λ=1e-8

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =1.54e+2

data
200
λ=0
λ=1e-7

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example

Polynomial degree 15
kwk2 =6.22e+1

data
200
λ=0
λ=1e-6

100
t

2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization

Ridge Regression
Quadratic Example: Weights

Polynomial degree 15
·105
2

1
w

−1

10−10
Marcello Restelli 10−9 10−8 10−7 10−6
March 9, 2017 31 / 47
Regularization

Ridge Regression
Sinusoidal Example

Marcello Restelli March 9, 2017 32 / 47


Regularization

Lasso

Another popular regularization method is lasso


N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk1
2 2
i=1
PM
where kwk1 = j=1 |wj |
Differently from ridge, lasso is nonlinear in the ti and no closed-form
solution exists (quadratic programming problem)
Nonetheless, it has the advantage of making some weights equal to zero
for values of λ sufficiently large
Lasso yields sparse models

Marcello Restelli March 9, 2017 33 / 47


Regularization

Lasso vs Ridge Regression


Ridge Lasso
w2 w2

w∗

w∗
w1 w1

Lasso tends to generate sparser solutions than quadratic regularizer

Marcello Restelli March 9, 2017 34 / 47


Bayesian Linear Regression

Bayesian Approach

We formulate our knowledge about the world in a probabilistic way


We define the model that expresses our knowledge qualitatively
Our model will have some unknown parameters
We capture our assumptions about unknown parameters by specifying the
prior distribution over those parameters before seeing the data
We observe the data
We compute the posterior probability distribution for the parameters,
given observed data
We use the posterior distribution to:
Make predictions by averaging over the posterior distribution
Examine/Account for uncertainty in the parameter values
Make decisions by minimizing expected posterior loss

Marcello Restelli March 9, 2017 36 / 47


Bayesian Linear Regression

Posterior Distribution
The posterior distribution for the model parameters can be found by
combining the prior with the likelihood for the parameters given data
This is accomplished using Bayes’ Rule:
P(data|parameters)P(parameters)
P(parameters|data) =
P(data)
p(D|w)P(w)
p(w|D) =
P(D)
where
p(w|D) is the posterior probability of parameters w given training data D
p(D|w) is the probability (likelihood) of observing D given w
P(w) is the prior probability over the parameters
R marginal likelihood (normalizing constant):
P(D) is the
P(D) = p(D|w)P(w)dw
We want the most probable value of w given the data: maximum a
posteriori (MAP). It is the mode of the posterior.
Marcello Restelli March 9, 2017 37 / 47
Bayesian Linear Regression

Predictive Distribution

Stating Bayes’ rule in words: posterior ∝ likelihood × prior


Prediction for a new data point x∗ (given the training dataset D) can be
done by integrating over the posterior distribution:
Z
p(x |D) = p(x∗ |w, D)p(w|D)dw = E[p(x∗ |w, D)]

which is sometimes called predictive distribution


Note that computing the predictive distribution requires knowledge of
the posterior distribution, which is usually intractable

Marcello Restelli March 9, 2017 38 / 47


Bayesian Linear Regression

Bayesian Linear Regression

Another approach to avoid the over-fitting problem of ML is to use Bayesian


linear regression
In the Bayesian approach the parameters of the model are considered as drawn
from some distribution
Assuming Gaussian likelihood model, the conjugate prior is Gaussian too
p(w) = N (w|w0 , S0 )

Given the data D, the posterior is still Gaussian


p(w|t, Φ, σ 2 ) ∝ N (w|w0 , S0 ) N t|Φw, σ 2 IN = N (w|wN , SN )


ΦT t
 
wN = SN S−1 0 w 0 +
σ2
ΦT Φ
S−1
N = S0 +
−1
σ2
For sequential data, the posterior acts as prior for the next iteration
Marcello Restelli March 9, 2017 39 / 47
Bayesian Linear Regression

Relation to Ridge Regression

In Gaussian distributions the mode coincides with the mean


It follows that wN is the MAP estimator (Maximum a posteriori)
If the prior has infinite variance, wN reduces to the ML estimator
If w0 = 0 and S0 = τ 2 I, then wN reduces to the ridge estimate, where
2
λ = στ 2

Marcello Restelli March 9, 2017 40 / 47


Bayesian Linear Regression

1D Example

Data generated from: t(x) = −0.3 + 0.5x + , where  ∼ N (0, 0.04)


x values taken uniformly from [−1, 1]
Model: y(x, w) = w0 + w1 x
We assume to know σ 2 = 0.04 and τ 2 = 0.5
w1

t
w0 x

Marcello Restelli March 9, 2017 41 / 47


Bayesian Linear Regression

1D Example

0 data points observed

1 data point observed

2 data points observed

20 data points observed


Marcello Restelli March 9, 2017 42 / 47
Bayesian Linear Regression

Predictive Distribution

We are interested in the posterior predictive distribution


Z
p(t|x, D, σ ) = N t|wT φ(x), σ 2 N (w|wN , SN ) dw
2


= N t|wN T φ(x), σN2 (x)




σN2 (x) = σ 2 + φ(x)T SN φ(x)

where
σN2 (x) = σ2
|{z} + φ(x)T SN φ(x)
| {z }
noise in the Uncertainty associated
target values
with parameter values

In the limit, as N → ∞, the second term goes to zero


The variance of the predictive distribution arises only from the additive
noise governed by parameter σ

Marcello Restelli March 9, 2017 43 / 47


Bayesian Linear Regression

Example
Sinsusoidal dataset, 9 Gaussian basis functions

Marcello Restelli March 9, 2017 44 / 47


Bayesian Linear Regression

Modeling Challenges

The first challenge is in specifying suitable model and suitable prior


distributions
A suitable model should admit all the possibilities that thought to be at all
likely
A suitable prior should avoid giving zero or very small probabilities to
possible events, but should also avoid spreading out the probability over
all possibilities
To avoid uninformative priors, we may need to model dependencies
between parameters
One strategy is to introduce latent variables into the model and
hyperparameters into the prior
Both of these represent the ways of modeling dependencies in a
tractable way

Marcello Restelli March 9, 2017 45 / 47


Bayesian Linear Regression

Computational Challenges
The other big challenge is computing the posterior distribution. There are
several approaches:
Analytical integration: If we use “conjugate priors”, the posterior
distribution can be computed analytically. Only works for simple
models
Gaussian (Laplace) approximation: Approximate the posterior
distribution with a Gaussian. Works well when there a lot of data
compared to the model complexity
Monte Carlo integration: Once we have a sample from the posterior
distribution, we can do many things. Currently, the common approach is
Markov Chain Monte Carlo (MCMC), that consists in simulating a
Markov chain that converges to the posterior distribution
Variational approximation: A cleverer way of approximating the
posterior. It is usually faster than MCMC, but it is less general

Marcello Restelli March 9, 2017 46 / 47


Bayesian Linear Regression

Pros and Cons of Fixed Basis Functions

Advantages
Closed-form solution
Tractable Bayesian treatment
Arbitrary non-linearity with the proper basis functions
Limitations
Basis functions are chosen independently from the training set
Curse of dimensionality

Marcello Restelli March 9, 2017 47 / 47

You might also like