You are on page 1of 15

CS 2750 Machine Learning

Lecture 6

Linear regression

Milos Hauskrecht
milos@cs.pitt.edu
5329 Sennott Square

CS 2750 Machine Learning

Administration
• Matlab:
– Statistical and neural network toolboxes are not
available on unixs machines
– Please use Windows Machines in CSSD labs

CS 2750 Machine Learning

1
Outline
Regression
• Linear model
• Error function based on the least squares fit.
• Parameter estimation.
• Gradient methods.
• On-line regression techniques.
• Linear additive models.

CS 2750 Machine Learning

Supervised learning
Data: D = {D1 , D2 ,.., Dn } a set of n examples
Di =< x i , y i >
x i = ( xi ,1 , xi , 2 , L xi ,d ) is an input vector of size d
y i is the desired output (given by a teacher)
Objective: learn the mapping f : X → Y
s.t. y i ≈ f ( x i ) for all i = 1,.., n

• Regression: Y is continuous
Example: earnings, product orders company stock price
• Classification: Y is discrete
Example: handwritten digit in binary form digit label

CS 2750 Machine Learning

2
Linear regression
• Function f : X →Y is a linear combination of input
components
d
f ( x) = w0 + w1 x1 + w2 x 2 + K wd x d = w0 + ∑ w j x j
j =1
w 0 , w1 , K w k - parameters (weights)

Bias term 1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
CS 2750 Machine Learning

Linear regression
• Shorter (vector) definition of the model
– Include bias constant in the input vector
x = (1, x1 , x 2 , L x d )
f ( x) = w0 x0 + w1 x1 + w2 x 2 + K wd x d = w T x
w 0 , w1 , K w k - parameters (weights)
1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
CS 2750 Machine Learning

3
Linear regression. Error.

• Data: Di =< x i , y i >


• Function: x i → f (x i )
• We would like to have y i ≈ f ( x i ) for all i = 1,.., n

• Error function measures how much our predictions deviate


from the desired answers

1
Mean-squared error J n =
n
∑ (y
i =1,.. n
i − f ( x i )) 2

• Learning:
We want to find the weights minimizing the error !

CS 2750 Machine Learning

Linear regression. Example


• 1 dimensional input x = ( x1 )
30

25

20

15

10

-5

-10

-15
-1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

4
Linear regression. Example.
• 2 dimensional input x = ( x1 , x 2 )

20

15

10

-5

-10

-15

4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3

CS 2750 Machine Learning

Linear regression. Optimization.


• We want the weights minimizing the error
1 1
J n = ∑ ( y i − f (x i )) 2 = ∑ ( y i − w T x i ) 2
n i =1,..n n i =1,..n
• For the optimal set of parameters, derivatives of the error with
respect to each parameter must be 0

∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
• Vector of derivatives:
n
2
grad w ( J n ( w )) = ∇ w ( J n ( w )) = −
n
∑ (y
i =1
i − w T x i )x i = 0

CS 2750 Machine Learning

5
Solving linear regression
∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
2 n
∇ w ( J n ( w )) = − ∑ ( y i − w T x i ) x i = 0
n i =1
By rearranging the terms we get a system of linear equations
with d+1 unknowns
Aw = b
Equation for the jth component:
n n n n n
w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j
i =1 i =1 i =1 i =1 i =1

Can be solved through matrix inversion, if the matrix is not singular


w = A − 1b
CS 2750 Machine Learning

Solving linear regression


• Things can be rewritten also in terms of data matrices X and
vectors:
J n ( w ) = ( y − Xw ) T ( y − Xw )
∇ J n ( w ) = − 2 X T ( y − Xw )

• Set derivatives to 0 and solve


w = ( X T X ) −1 X T y

• What if X is singular?
• Some columns of the data matrix are linearly dependent
• Then X T X is singular. Multiple possible solutions exist.
• Remedy: drop the redundant (linearly dependent) columns

CS 2750 Machine Learning

6
Solving linear regression
• Linear regression problem comes down to the problem of
solving a set of linear equations
• Alternative methods: gradient descent
– Iterative method
w ← w − α ∇ w J n (w )

J n ( w ) = ( y − Xw ) T ( y − Xw )

∇ J n ( w ) = − 2 X T ( y − Xw )

w ← w + α 2 X T ( y − Xw )

CS 2750 Machine Learning

Gradient descent method


• Descend to the minimum of the function using the gradient
information

Error ( w ) ∇ w Error ( w ) | w *

w* w

• Change the value of w according to the gradient


w ← w − α ∇ w J n (w )
CS 2750 Machine Learning

7
Online regression algorithm
• The error function defined for the whole dataset D
1
Jn = ∑ ( y i − f ( x i )) 2
n i =1,.. n
• Instead of the error for all data points we use error for an
individual sample
1
J online = Error i ( w ) = ( y i − f ( x i )) 2
2
• Change regression weights after every example according
to the gradient (delta rule):

wj ← wj −α Error i (w )
∂w j
w ← w − α ∇ w Error i (w )
α > 0 - Learning rate that depends on the number of updates
CS 2750 Machine Learning

On-line gradient descent method


• In every step update weights according to a new example

Error (w )

w ( 0 ) w (1 ) w ( 2 )w ( 3 ) w

CS 2750 Machine Learning

8
Gradient for on-line learning
Linear model f (x) = w T x
1
On-line error Error i ( w ) = ( y i − f ( x i )) 2
2
(i+1)-th update for the linear model:

w (i +1) ← w (i ) − α (i + 1)∇ w Errori (w) |w ( i ) = w (i ) + α (i + 1)( yi − f (xi ))xi

1
Typical learning rate α (i) ≈
i

On-line algorithm: repeat online updates for all data points

CS 2750 Machine Learning

Online regression algorithm

Online-linear-regression (D, number of iterations)


Initialize weights w = ( w0 , w1 , w2 K wd )
for i=1:1: number of iterations
do select a data point Di = ( x i , y i ) from D
set α = 1 / i
update weight vector
w ← w + α ( yi − f ( x i , w )) x i
end for
return weights w

• Advantages: very easy to implement, continuous data streams

CS 2750 Machine Learning

9
On-line learning. Example
4.5 1 4.5 2
4
4

3.5
3.5

3
3

2.5
2.5

2 2

1.5 1.5

1 1
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

5.5 5.5

5 3 5 4
4.5 4.5

4 4

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

CS 2750 Machine Learning

Practical concerns: Input normalization


• Input normalization
– makes the data vary roughly on the same scale.
– Can make a huge difference in on-line learning

Assume on-line update (delta) rule for two weights j,k,:


w j ← w j + α ( i )( y i − f ( x i )) x i , j Change depends on
= the magnitude of
the input
w k ← w k + α ( i )( y i − f ( x i )) x i , k

For inputs with a large magnitude the change in the weight is


huge: changes to the inputs with high magnitude
disproportional as if the input was more important

CS 2750 Machine Learning

10
Input normalization
• Input normalization:
– Solution to the problem of different scales
– Makes all inputs vary in the same range around 0
n
1 1 n
∑ σ ∑ ( xi , j − x j ) 2
2
xj = xi , j j =
n i =1 n − 1 i =1

( xi , j − x j )
New input: ~
xi , j =
σ j

More complex normalization approach can be applied


when we want to process data with correlations
Similarly we can renormalize outputs y

CS 2750 Machine Learning

Extensions of simple linear model


Replace inputs to linear units with feature (basis) functions
to model nonlinearities
m
f ( x ) = w 0 + ∑ w jφ j ( x )
j =1
φ j (x ) - an arbitrary function of x
1
w0
φ1 (x )

w1 f (x)
x1 φ 2 (x ) w2

xd wm
φ m (x )

The same techniques as before to learn the weights


CS 2750 Machine Learning

11
Additive linear models
• Models linear in the parameters we want to fit
m
f ( x ) = w0 + ∑ w k φ k ( x )
k =1
w 0 , w1 ... w m - parameters
φ1 ( x ), φ 2 ( x )...φ m ( x ) - feature or basis functions
• Basis functions examples:
– a higher order polynomial, one-dimensional input x = ( x1 )
φ1 ( x ) = x φ 2 ( x ) = x 2 φ 3 ( x ) = x 3
– Multidimensional quadratic x = ( x1 , x 2 )
φ1 ( x ) = x1 φ 2 ( x ) = x1 φ 3 ( x ) = x 2 φ 4 ( x ) = x 22 φ 5 ( x ) = x1 x 2
2

– Other types of basis functions


φ1 ( x ) = sin x φ 2 ( x ) = cos x

CS 2750 Machine Learning

Multidimensional additive model example

20

15

10

-5

-10

-15

4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3

CS 2750 Machine Learning

12
Multidimensional additive model example

CS 2750 Machine Learning

Fitting additive linear models


• Error function J n ( w ) = 1 / n ∑ (y −
i = 1 ,.. n
f ( x i )) 2

Assume: φ ( x i ) = (1, φ 1 ( x i ), φ 2 ( x i ), K , φ m ( x i ))
2
∇ w J n (w ) = −
n
∑ (y
i = 1 ,.. n
i − f ( x i )) φ ( x i ) = 0

• Leads to a system of m linear equations


n n n n
w0 ∑1φ j (xi ) +K+ wj ∑φ j (xi )φ j (xi ) +K+ wm ∑φm (xi )φ j (xi ) = ∑ yiφ j (xi )
i =1 i =1 i =1 i =1

• Can be solved exactly like the linear case

CS 2750 Machine Learning

13
Statistical model of regression
• A model:
y = f ( x, w ) + ε , s.t. ε ~ N(0, σ 2 )

• The noise models deviations from the parametric linear model


• The model defines the conditional density of y given x
p( y | x)
• Allows not only to predict means but also tries to explain the
nature of deviations from it
• As a result we can compute, for a given set of parameters w , σ
the probability of a specific prediction
1 1
p ( y | x, w , σ ) = exp[ − ( y − f ( x, w )) 2 ]
2πσ 2 2σ 2

CS 2750 Machine Learning

ML estimation of the parameters


• Given the distribution we can compute the probability of all
samples (x, y) observed in the dataset D (values of y drawn
independently) n
L ( D , w ,σ ) = ∏ p ( yi | x i , w ,σ )
i =1
• We want to find the optimal set of parametersn
• To do this we can optimize w * = arg max ∏ p ( y i | x i , w , σ )
w i =1
n
l ( D , w , σ ) = log( L ( D , w , σ )) = log ∏ i =1
p ( yi | x i , w ,σ )
Working the math we get
n n
 1 
= ∑ log
i =1
p ( yi | x i , w ,σ ) = ∑  − 2σ
i =1
2
( y i − f ( x i , w )) 2 − c (σ ) 

n
1
=−
2σ 2 ∑ (y
i =1
i − f ( x i , w )) 2 + C (σ ) Equivalent to LSF !!!
CS 2750 Machine Learning

14
ML estimation of parameters
• Loss function and the log likelihood the output are related
1
Loss ( y i , x i ) = log p ( y i | x i , w , σ ) + c (σ )
2σ 2

• We know how to optimize parameters w (for a given and fixed


variance) – the same approach as for the LSF criterion
• How to estimate the variance of the noise?
• Maximize l ( D , w , σ ) with respect to variance
n
1
σˆ 2 =
n

i =1
( y i − f ( x i , w * )) 2

= mean square prediction error for the best predictor

CS 2750 Machine Learning

15

You might also like