Linear Regression Class6

CS 2750 Machine Learning
Lecture 6
Linear regression
Milos Hauskrecht
milos@cs.pitt.edu
5329 Sennott Square
Administration
• Matlab:
– Statistical and neural network toolboxes are not
available on unixs machines
– Please use Windows Machines in CSSD labs
1
Outline
Regression
• Linear model
• Error function based on the least squares fit.
• Parameter estimation.
• Gradient methods.
• On-line regression techniques.
• Linear additive models.
Supervised learning
Data: D = {D1 , D2 ,.., Dn } a set of n examples
Di =< x i , y i >
x i = ( xi ,1 , xi , 2 , L xi ,d ) is an input vector of size d
y i is the desired output (given by a teacher)
Objective: learn the mapping f : X → Y
s.t. y i ≈ f ( x i ) for all i = 1,.., n
• Regression: Y is continuous
Example: earnings, product orders company stock price
• Classification: Y is discrete
Example: handwritten digit in binary form digit label
2
Linear regression
• Function f : X →Y is a linear combination of input
components
d
f ( x) = w0 + w1 x1 + w2 x 2 + K wd x d = w0 + ∑ w j x j
j =1
w 0 , w1 , K w k - parameters (weights)
Bias term 1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
Linear regression
• Shorter (vector) definition of the model
– Include bias constant in the input vector
x = (1, x1 , x 2 , L x d )
f ( x) = w0 x0 + w1 x1 + w2 x 2 + K wd x d = w T x
w 0 , w1 , K w k - parameters (weights)
1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
3
Linear regression. Error.
• Data: Di =< x i , y i >

• Function: x i → f (x i )
• We would like to have y i ≈ f ( x i ) for all i = 1,.., n
• Error function measures how much our predictions deviate

from the desired answers
1
Mean-squared error J n =
n
∑ (y
i =1,.. n
i − f ( x i )) 2
• Learning:
We want to find the weights minimizing the error !
Linear regression. Example

• 1 dimensional input x = ( x1 )
30
25
20
15
10
-5
-10
-15
-1.5 -1 -0.5 0 0.5 1 1.5 2
4
Linear regression. Example.
• 2 dimensional input x = ( x1 , x 2 )
20
15
10
-5
-10
-15
4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3
Linear regression. Optimization.

• We want the weights minimizing the error
1 1
J n = ∑ ( y i − f (x i )) 2 = ∑ ( y i − w T x i ) 2
n i =1,..n n i =1,..n
• For the optimal set of parameters, derivatives of the error with
respect to each parameter must be 0
∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
• Vector of derivatives:
n
2
grad w ( J n ( w )) = ∇ w ( J n ( w )) = −
n
∑ (y
i =1
i − w T x i )x i = 0
5
Solving linear regression
∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
2 n
∇ w ( J n ( w )) = − ∑ ( y i − w T x i ) x i = 0
n i =1
By rearranging the terms we get a system of linear equations
with d+1 unknowns
Aw = b
Equation for the jth component:
n n n n n
w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j
i =1 i =1 i =1 i =1 i =1
Can be solved through matrix inversion, if the matrix is not singular

w = A − 1b

• Things can be rewritten also in terms of data matrices X and
vectors:
J n ( w ) = ( y − Xw ) T ( y − Xw )
∇ J n ( w ) = − 2 X T ( y − Xw )
• Set derivatives to 0 and solve

w = ( X T X ) −1 X T y
• What if X is singular?
• Some columns of the data matrix are linearly dependent
• Then X T X is singular. Multiple possible solutions exist.
• Remedy: drop the redundant (linearly dependent) columns
6
• Linear regression problem comes down to the problem of
solving a set of linear equations
• Alternative methods: gradient descent
– Iterative method
w ← w − α ∇ w J n (w )
J n ( w ) = ( y − Xw ) T ( y − Xw )
∇ J n ( w ) = − 2 X T ( y − Xw )
w ← w + α 2 X T ( y − Xw )
Gradient descent method

• Descend to the minimum of the function using the gradient
information
Error ( w ) ∇ w Error ( w ) | w *
w* w
• Change the value of w according to the gradient

w ← w − α ∇ w J n (w )
7
Online regression algorithm
• The error function defined for the whole dataset D
1
Jn = ∑ ( y i − f ( x i )) 2
n i =1,.. n
• Instead of the error for all data points we use error for an
individual sample
1
J online = Error i ( w ) = ( y i − f ( x i )) 2
2
• Change regression weights after every example according
to the gradient (delta rule):
∂
wj ← wj −α Error i (w )
∂w j
w ← w − α ∇ w Error i (w )
α > 0 - Learning rate that depends on the number of updates
On-line gradient descent method

• In every step update weights according to a new example
Error (w )
w ( 0 ) w (1 ) w ( 2 )w ( 3 ) w
8
Gradient for on-line learning
Linear model f (x) = w T x
1
On-line error Error i ( w ) = ( y i − f ( x i )) 2
2
(i+1)-th update for the linear model:
w (i +1) ← w (i ) − α (i + 1)∇ w Errori (w) |w ( i ) = w (i ) + α (i + 1)( yi − f (xi ))xi
1
Typical learning rate α (i) ≈
i
On-line algorithm: repeat online updates for all data points
Online regression algorithm
Online-linear-regression (D, number of iterations)

Initialize weights w = ( w0 , w1 , w2 K wd )
for i=1:1: number of iterations
do select a data point Di = ( x i , y i ) from D
set α = 1 / i
update weight vector
w ← w + α ( yi − f ( x i , w )) x i
end for
return weights w
• Advantages: very easy to implement, continuous data streams
9
On-line learning. Example
4.5 1 4.5 2
4
4
3.5
3.5
3
3
2.5
2.5
2 2
1.5 1.5
1 1
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
5.5 5.5
5 3 5 4
4.5 4.5
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Practical concerns: Input normalization

• Input normalization
– makes the data vary roughly on the same scale.
– Can make a huge difference in on-line learning
Assume on-line update (delta) rule for two weights j,k,:

w j ← w j + α ( i )( y i − f ( x i )) x i , j Change depends on
= the magnitude of
the input
w k ← w k + α ( i )( y i − f ( x i )) x i , k
For inputs with a large magnitude the change in the weight is

huge: changes to the inputs with high magnitude
disproportional as if the input was more important
10
Input normalization
• Input normalization:
– Solution to the problem of different scales
– Makes all inputs vary in the same range around 0
n
1 1 n
∑ σ ∑ ( xi , j − x j ) 2
2
xj = xi , j j =
n i =1 n − 1 i =1
( xi , j − x j )
New input: ~
xi , j =
σ j
More complex normalization approach can be applied

when we want to process data with correlations
Similarly we can renormalize outputs y
Extensions of simple linear model

Replace inputs to linear units with feature (basis) functions
to model nonlinearities
m
f ( x ) = w 0 + ∑ w jφ j ( x )
j =1
φ j (x ) - an arbitrary function of x
1
w0
φ1 (x )
∑
w1 f (x)
x1 φ 2 (x ) w2
xd wm
φ m (x )
The same techniques as before to learn the weights

11
Additive linear models
• Models linear in the parameters we want to fit
m
f ( x ) = w0 + ∑ w k φ k ( x )
k =1
w 0 , w1 ... w m - parameters
φ1 ( x ), φ 2 ( x )...φ m ( x ) - feature or basis functions
• Basis functions examples:
– a higher order polynomial, one-dimensional input x = ( x1 )
φ1 ( x ) = x φ 2 ( x ) = x 2 φ 3 ( x ) = x 3
– Multidimensional quadratic x = ( x1 , x 2 )
φ1 ( x ) = x1 φ 2 ( x ) = x1 φ 3 ( x ) = x 2 φ 4 ( x ) = x 22 φ 5 ( x ) = x1 x 2
2
– Other types of basis functions

φ1 ( x ) = sin x φ 2 ( x ) = cos x
Multidimensional additive model example
20
15
10
-5
-10
-15
4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3
12
Multidimensional additive model example
Fitting additive linear models

• Error function J n ( w ) = 1 / n ∑ (y −
i = 1 ,.. n
f ( x i )) 2
Assume: φ ( x i ) = (1, φ 1 ( x i ), φ 2 ( x i ), K , φ m ( x i ))
2
∇ w J n (w ) = −
n
∑ (y
i = 1 ,.. n
i − f ( x i )) φ ( x i ) = 0
• Leads to a system of m linear equations

n n n n
w0 ∑1φ j (xi ) +K+ wj ∑φ j (xi )φ j (xi ) +K+ wm ∑φm (xi )φ j (xi ) = ∑ yiφ j (xi )
i =1 i =1 i =1 i =1
• Can be solved exactly like the linear case
13
Statistical model of regression
• A model:
y = f ( x, w ) + ε , s.t. ε ~ N(0, σ 2 )
• The noise models deviations from the parametric linear model

• The model defines the conditional density of y given x
p( y | x)
• Allows not only to predict means but also tries to explain the
nature of deviations from it
• As a result we can compute, for a given set of parameters w , σ
the probability of a specific prediction
1 1
p ( y | x, w , σ ) = exp[ − ( y − f ( x, w )) 2 ]
2πσ 2 2σ 2
ML estimation of the parameters

• Given the distribution we can compute the probability of all
samples (x, y) observed in the dataset D (values of y drawn
independently) n
L ( D , w ,σ ) = ∏ p ( yi | x i , w ,σ )
i =1
• We want to find the optimal set of parametersn
• To do this we can optimize w * = arg max ∏ p ( y i | x i , w , σ )
w i =1
n
l ( D , w , σ ) = log( L ( D , w , σ )) = log ∏ i =1
p ( yi | x i , w ,σ )
Working the math we get
n n
 1 
= ∑ log
i =1
p ( yi | x i , w ,σ ) = ∑  − 2σ
i =1
2
( y i − f ( x i , w )) 2 − c (σ ) 

n
1
=−
2σ 2 ∑ (y
i =1
i − f ( x i , w )) 2 + C (σ ) Equivalent to LSF !!!
14
ML estimation of parameters
• Loss function and the log likelihood the output are related
1
Loss ( y i , x i ) = log p ( y i | x i , w , σ ) + c (σ )
2σ 2
• We know how to optimize parameters w (for a given and fixed

variance) – the same approach as for the LSF criterion
• How to estimate the variance of the noise?
• Maximize l ( D , w , σ ) with respect to variance
n
1
σˆ 2 =
n
∑
i =1
( y i − f ( x i , w * )) 2
= mean square prediction error for the best predictor
15

Linear Regression Class6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Class6

Uploaded by

Copyright:

Available Formats

CS 2750 Machine Learning

CS 2750 Machine Learning

CS 2750 Machine Learning

CS 2750 Machine Learning

CS 2750 Machine Learning

• Data: Di =< x i , y i >

• Error function measures how much our predictions deviate

CS 2750 Machine Learning

Linear regression. Example

CS 2750 Machine Learning

CS 2750 Machine Learning

Linear regression. Optimization.

CS 2750 Machine Learning

Can be solved through matrix inversion, if the matrix is not singular

Solving linear regression

• Set derivatives to 0 and solve

CS 2750 Machine Learning

CS 2750 Machine Learning

Gradient descent method

• Change the value of w according to the gradient

On-line gradient descent method

CS 2750 Machine Learning

w (i +1) ← w (i ) − α (i + 1)∇ w Errori (w) |w ( i ) = w (i ) + α (i + 1)( yi − f (xi ))xi

On-line algorithm: repeat online updates for all data points

CS 2750 Machine Learning

Online regression algorithm

Online-linear-regression (D, number of iterations)

• Advantages: very easy to implement, continuous data streams

CS 2750 Machine Learning

CS 2750 Machine Learning

Practical concerns: Input normalization

Assume on-line update (delta) rule for two weights j,k,:

For inputs with a large magnitude the change in the weight is

CS 2750 Machine Learning

More complex normalization approach can be applied

CS 2750 Machine Learning

Extensions of simple linear model

The same techniques as before to learn the weights

– Other types of basis functions

CS 2750 Machine Learning

Multidimensional additive model example

CS 2750 Machine Learning

CS 2750 Machine Learning

Fitting additive linear models

• Leads to a system of m linear equations

• Can be solved exactly like the linear case

CS 2750 Machine Learning

• The noise models deviations from the parametric linear model

CS 2750 Machine Learning

ML estimation of the parameters

• We know how to optimize parameters w (for a given and fixed

= mean square prediction error for the best predictor

CS 2750 Machine Learning

You might also like