Professional Documents
Culture Documents
Lecture 6
Linear regression
Milos Hauskrecht
milos@cs.pitt.edu
5329 Sennott Square
Administration
• Matlab:
– Statistical and neural network toolboxes are not
available on unixs machines
– Please use Windows Machines in CSSD labs
1
Outline
Regression
• Linear model
• Error function based on the least squares fit.
• Parameter estimation.
• Gradient methods.
• On-line regression techniques.
• Linear additive models.
Supervised learning
Data: D = {D1 , D2 ,.., Dn } a set of n examples
Di =< x i , y i >
x i = ( xi ,1 , xi , 2 , L xi ,d ) is an input vector of size d
y i is the desired output (given by a teacher)
Objective: learn the mapping f : X → Y
s.t. y i ≈ f ( x i ) for all i = 1,.., n
• Regression: Y is continuous
Example: earnings, product orders company stock price
• Classification: Y is discrete
Example: handwritten digit in binary form digit label
2
Linear regression
• Function f : X →Y is a linear combination of input
components
d
f ( x) = w0 + w1 x1 + w2 x 2 + K wd x d = w0 + ∑ w j x j
j =1
w 0 , w1 , K w k - parameters (weights)
Bias term 1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
CS 2750 Machine Learning
Linear regression
• Shorter (vector) definition of the model
– Include bias constant in the input vector
x = (1, x1 , x 2 , L x d )
f ( x) = w0 x0 + w1 x1 + w2 x 2 + K wd x d = w T x
w 0 , w1 , K w k - parameters (weights)
1
w0
x1 w1 ∑
f (x, w )
w2
Input vector
x2
x wd
xd
CS 2750 Machine Learning
3
Linear regression. Error.
1
Mean-squared error J n =
n
∑ (y
i =1,.. n
i − f ( x i )) 2
• Learning:
We want to find the weights minimizing the error !
25
20
15
10
-5
-10
-15
-1.5 -1 -0.5 0 0.5 1 1.5 2
4
Linear regression. Example.
• 2 dimensional input x = ( x1 , x 2 )
20
15
10
-5
-10
-15
4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3
∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
• Vector of derivatives:
n
2
grad w ( J n ( w )) = ∇ w ( J n ( w )) = −
n
∑ (y
i =1
i − w T x i )x i = 0
5
Solving linear regression
∂ 2 n
J n ( w ) = − ∑ ( y i − w0 xi , 0 − w1 xi ,1 − K − wd xi , d ) xi , j = 0
∂w j n i =1
2 n
∇ w ( J n ( w )) = − ∑ ( y i − w T x i ) x i = 0
n i =1
By rearranging the terms we get a system of linear equations
with d+1 unknowns
Aw = b
Equation for the jth component:
n n n n n
w0 ∑ xi,0 xi, j + w1 ∑ xi,1xi, j + K+ wj ∑ xi, j xi, j + K+ wd ∑ xi,d xi, j = ∑ yi xi, j
i =1 i =1 i =1 i =1 i =1
• What if X is singular?
• Some columns of the data matrix are linearly dependent
• Then X T X is singular. Multiple possible solutions exist.
• Remedy: drop the redundant (linearly dependent) columns
6
Solving linear regression
• Linear regression problem comes down to the problem of
solving a set of linear equations
• Alternative methods: gradient descent
– Iterative method
w ← w − α ∇ w J n (w )
J n ( w ) = ( y − Xw ) T ( y − Xw )
∇ J n ( w ) = − 2 X T ( y − Xw )
w ← w + α 2 X T ( y − Xw )
Error ( w ) ∇ w Error ( w ) | w *
w* w
7
Online regression algorithm
• The error function defined for the whole dataset D
1
Jn = ∑ ( y i − f ( x i )) 2
n i =1,.. n
• Instead of the error for all data points we use error for an
individual sample
1
J online = Error i ( w ) = ( y i − f ( x i )) 2
2
• Change regression weights after every example according
to the gradient (delta rule):
∂
wj ← wj −α Error i (w )
∂w j
w ← w − α ∇ w Error i (w )
α > 0 - Learning rate that depends on the number of updates
CS 2750 Machine Learning
Error (w )
w ( 0 ) w (1 ) w ( 2 )w ( 3 ) w
8
Gradient for on-line learning
Linear model f (x) = w T x
1
On-line error Error i ( w ) = ( y i − f ( x i )) 2
2
(i+1)-th update for the linear model:
1
Typical learning rate α (i) ≈
i
9
On-line learning. Example
4.5 1 4.5 2
4
4
3.5
3.5
3
3
2.5
2.5
2 2
1.5 1.5
1 1
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
5.5 5.5
5 3 5 4
4.5 4.5
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
10
Input normalization
• Input normalization:
– Solution to the problem of different scales
– Makes all inputs vary in the same range around 0
n
1 1 n
∑ σ ∑ ( xi , j − x j ) 2
2
xj = xi , j j =
n i =1 n − 1 i =1
( xi , j − x j )
New input: ~
xi , j =
σ j
xd wm
φ m (x )
11
Additive linear models
• Models linear in the parameters we want to fit
m
f ( x ) = w0 + ∑ w k φ k ( x )
k =1
w 0 , w1 ... w m - parameters
φ1 ( x ), φ 2 ( x )...φ m ( x ) - feature or basis functions
• Basis functions examples:
– a higher order polynomial, one-dimensional input x = ( x1 )
φ1 ( x ) = x φ 2 ( x ) = x 2 φ 3 ( x ) = x 3
– Multidimensional quadratic x = ( x1 , x 2 )
φ1 ( x ) = x1 φ 2 ( x ) = x1 φ 3 ( x ) = x 2 φ 4 ( x ) = x 22 φ 5 ( x ) = x1 x 2
2
20
15
10
-5
-10
-15
4
-20
-3 2
-2 0
-1
0 -2
1
2 -4
3
12
Multidimensional additive model example
Assume: φ ( x i ) = (1, φ 1 ( x i ), φ 2 ( x i ), K , φ m ( x i ))
2
∇ w J n (w ) = −
n
∑ (y
i = 1 ,.. n
i − f ( x i )) φ ( x i ) = 0
13
Statistical model of regression
• A model:
y = f ( x, w ) + ε , s.t. ε ~ N(0, σ 2 )
14
ML estimation of parameters
• Loss function and the log likelihood the output are related
1
Loss ( y i , x i ) = log p ( y i | x i , w , σ ) + c (σ )
2σ 2
15