You are on page 1of 48

Machine Learning Srihari

Linear Models for Regression

Sargur Srihari
srihari@cedar.buffalo.edu

1
Machine Learning Srihari

Linear Regression with One Input


Simplest form of linear regression:
linear function of single input variable
Task is to learn weights w0, w1
y(x,w) = w0+w1x from data D={(y ,x )}i=1,,N
i i

More useful class of functions:


Polynomial curve fitting
y(x,w) = w0+w1x+w2x2+= wixi
linear combination of non-linear functions of
input variables i (x) instead of xi called basis
functions
Linear functions of parameters (which gives them
simple analytical properties), yet are nonlinear with2
respect to input variables
Machine Learning Srihari

Plan of Discussion
Discuss supervised learning starting with regression
Goal: predict value of one or more target variables t
Given d-dimensional vector x of input variables
Terminology
Regression
When t is continuous-valued
Classification
if t has a value consisting of labels (non-ordered categories)
Ordinal Regression
Discrete values, ordered categories

3
Machine Learning Srihari

Regression with multiple inputs


Polynomial Target
Variable
Single input variable scalar x

Input Variable
Generalizations
Predict value of continuous target variable t given value
of d input variables x=[x1,..xd]
t can also be a set of variables (multiple regression)
Linear functions of adjustable parameters
Specifically linear combinations of nonlinear functions of input
variable

4
Machine Learning Srihari

Simplest Linear Model with d inputs


Regression with d input variables
This differs from
y(x,w) = w0+w1x1+..+wd xd = wTx Linear Regression with one variable
and Polynomial Reg with one variable

where x=(x1,..,xd)T are the input variables

Called Linear Regression since it is a linear function of


parameters w0,..,wd
input variables x1,..,xd

Significant limitation since it is a linear function of input


variables
In the one-dimensional case this amounts a straight-line fit (degree-one
polynomial)
5
y(x,w) = w0 + w1x
Machine Learning Srihari

Fitting a Regression Plane


Assume t is a function of inputs x1, x2,...xd
(independent variables). Goal is to find the
best linear regressor of t on all the inputs.
For d =2: fitting a plane through N input
samples or a hyperplane in d dimensions.
x1 x2 t

6
Machine Learning Srihari

Learning to Rank Problem


LeToR
Multiple Inputs
Target Value
t is discrete (eg, 1,2..6) in training set but a continuous
value in [1,6] is learnt and used to rank objects

7
Machine Learning Srihari

Regression with multiple inputs: LeToR


Input (xi):
(d Features of Query-URL pair)
Log frequency of query in
(d >200) anchor text
Query word in color on page
# of images on page
# of (out) links on page
PageRank of page
URL length
URL contains ~
Page length

In LETOR 4.0 dataset Traditional IR uses TF/IDF


46 query-document features Output (y):
Maximum of 124 URLs/query Relevance Value
Target Variable
- Point-wise (0,1,2,3)
- Regression returns continuous value
Yahoo! data set has d=700 -Allows fine-grained ranking of URLs
Machine Learning Srihari

Input Features
See http://research.microsoft.com/en-us/projects/mslr/feature.aspx
Feature List of Microsoft Learning to Rank Datasets
26 body
feature id feature description stream comments 27 anchor
1 body 28 min of term frequency title
2 anchor 29 url
covered query term
3 title 30 whole document
number
4 url 31 body
5 whole document 32 anchor
6 body 33 max of term frequency title
7 anchor 34 url
covered query term
8 title 35 whole document
ratio
9 url
36 body
10 whole document
11 body
37 anchor
12 anchor 38 mean of term frequency title
13 stream length title 39 url
14 url 40 whole document
15 whole document 41 body
16 body 42 anchor
17 anchor
variance of term
IDF(Inverse document 43 title
18 title frequency
frequency) 44 url
19 url 45 whole document
20 whole document 46 body
21 body
47 sum of stream length anchor
22 anchor
23 sum of term frequency title
48 normalized term title
24 url 49 frequency url
25 whole document 50 whole document

IDF(t,D)=log N / |{d D: t d}|


Machine Learning Srihari

Feature Statistics
Most of 46 features are normalized as continuous
values from 0 to 1, exception some features are
all 0s.
Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Max 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1
Mean 0.254 0.1598 0.1392 0.2158 0.1322 0.1614 0 0 0 0 0 0.2841 0.1382 0.2109 0.1218

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.2879 0.1533 0.2258 0.3057 0.3332 0.1534 0.5473 0.5592 0.5453 0.5622 0.1675 0.1377 0.1249 0.126 0.2109 0.1705

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
0.1694 0.1603 0.1275 0.0762 0.0762 0.0728 0.5479 0.5574 0.5502 0.5673 0.4321 0.3361 0 0.1065 0.1211
Machine Learning Srihari

Linear Regression with M Basis Functions


Extended by considering nonlinear functions of input variables
M 1
y(x, w ) = w0 + w j j (x)
j =1

where j(x) are called Basis functions


We now need M weights for basis functions instead of d weights for
features
Can be written as
M 1
y(x, w) = w j j (x) = wT (x)
j= 0

where w=(w0,w1,..,wM-1) and =(0,1,..,M-1)T

Basis functions allow non-linearity with d input variables


11
Machine Learning Srihari

Polynomial Basis with one variable


Linear Basis Function Model
M 1
y(x, w) = w j j (x) = wT (x)
xj
j= 0

Polynomial Basis (for single variable x)


j(x)=x j with degree M-1 polynomial
x
Disadvantage
Global:
changes in one region of input space affects others
Difficult to formulate
Number of polynomials increases exponentially with M
Can divide input space into regions
use different polynomials in each region:
equivalent to spline functions
12
Machine Learning Srihari

Polynomial Basis with d variables


(Not used in practice!)
[ ]
j

Consider (for a vector x) the basis: j (x) =|| x || = x + x + ..+ x j 2 2 2


1 2 d

x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory


Vector is being converted into a scalar value thereby losing information
Better approach:
Polynomial of degree M-1 has terms with variables taken none, one,
two M-1 at a time.
Use multi-index j=(j1,j2,..jd) such that j1+j2+..jd < M-1
For a quadratic (M=3) with three variables (d=3)
y(x,w) = w (x) = w j j 0 + w1,0,0 x1 + w 0,1,0 x 2 + w 0,0,1 x 3 + w1,1,0 x1 x 2 + w1,0,1 x1 x 3 +
( j1 , j 2 , j 3 )

w 0,1,1 x 2 x 3 + w 2,0,0 x12 + w 0,2,0 x 22 + w 0,0,2 x 32

Number of quadratic terms is 1+d+d(d-1)/2+d


For d=46, it is 1128
Machine Learning Srihari

Gaussian Radial Basis Functions


" (x j )2 %
Gaussian j (x) = exp $$ 2
''
# 2s &
Does not necessarily have a probabilistic interpretation
Usual normalization term is unimportant
since basis function is multiplied by weight wj
Choice of parameters
j govern the locations of the basis functions
Can be an arbitrary set of points within the range of the data
Can choose some representative data points
s governs the spatial scale
Could be chosen from the data set e.g., average variance
Several variables
A Gaussian kernel would be chosen for each dimension
For each j a different set of means would be needed perhaps chosen from
the data
1
j (x) = exp (x j )t 1(x j )
2
Machine Learning Srihari

Sigmoidal Basis Function


" x j % 1
Sigmoid j (x) = $ ' where (a) =
# s & 1+ exp(a)

Equivalently, tanh because it is related to logistic


sigmoid by
tanh( a) = 2 (a) 1

Logistic Sigmoid
Machine Learning Srihari

Other Basis Functions


Fourier
Expansion in sinusoidal functions
Infinite spatial extent
Signal Processing
Functions localized in time and frequency
Called wavelets
Useful for lattices such as images and time series
Further discussion independent of choice
of basis including (x) = x
16
Machine Learning Srihari

Probabilistic Formulation
Given:
Data set of n observations {xn}, n=1,.., N
Corresponding target values {tn}
Goal:
Predict value of t for a new value of x
Simplest approach:
Construct function y(x) whose values are the predictions
Probabilistic Approach:
Model predictive distribution p(t|x)
It expresses uncertainty about value of t for each x
From this conditional distribution
Predict t for any value of x so as to minimize a loss function
Typical loss is squared loss for which the solution is the conditional
expectation of t

17
Machine Learning Srihari

Relationship between Maximum Likelihood


and Least Squares
Will show that Minimizing sum-of-squared errors is
the same as maximum likelihood solution under a
Gaussian noise model
Target variable is a scalar t given by deterministic
function y(x,w) with additive Gaussian noise
t = y(x,w)+
is zero-mean Gaussian with precision
Thus distribution of t is univariate normal:
p(t|x,w,)=N(t|y(x,w), -1)
mean variance
Machine Learning Srihari

Likelihood Function
Data set:
Input X={x1,..xN} with target t = {t1,..tN}
Target variables tn are scalars forming a vector of size N

Likelihood of the target data


It is the probability of observing the data assuming they
are independent
N
p(t | X,w, ) = N ( t n | wT (x n ), 1 )
n=1
Machine Learning Srihari

Log-Likelihood Function
N
Likelihood p(t | X,w, ) = N ( t n | wT (x n ), 1 )
n=1

Log-likelihood
N
ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )
n=1

Where N N
= ln ln2 E D (w)
2 2
2
1 N
{
ED ( w ) = t n w T ( xn )
2 n =1
} Sum-of-squares
Error Function

We have used standard form of univariate Gaussian


1 1 2
N(x | , ) =
2
exp 2 (x )
(2 2 )1/ 2 2
Machine Learning Srihari

Maximizing Log-Likelihood Function


Log-likelihood N
ln p(t | w, ) = ln N ( t n | wT (x n ), 1 )
n=1

N N
= ln ln2 E D (w)
2 2
Where
2
1 N
{
ED ( w ) = t n w T ( xn )
2 n =1
} Sum-of-squares
Error Function

Maximizing likelihood is equivalent to minimizing ED(w)


Take derivative of ln p(t | w, ) wrt w and set equal to zero
and solve for w
or equivalently derivative of -ED(w)
Machine Learning Srihari

Gradient of Log-likelihood wrt w

{ } (x )
N
ln p(t | w, ) = tn wT (x n ) n
T

n=1

1
Since w [ ( a wb)2 ] = (a wb)b
2

Gradient is set to zero and we solve for w


N #N &
0 = tn (x n ) w % (x n ) (x n )T (
T
as shown in next slide
n=1 $ n=1 '

Second derivative will be negative making


this a maximum
Machine Learning Srihari

Max Likelihood Solution for w


X={x1,..xN} are samples
Solving for w: w ML = + t (vectors of d variables)
t = {t1,..tN} are targets (scalars)

where + = (T ) 1 T is the Moore-Penrose pseudo inverse of N x


M Design Matrix

0( x1 ) 1( x1 ) ... M 1( x1 ) Pseudo inverse:


Design Matrix: generalization of notion of matrix inverse
Rows correspond to (x ) to non-square matrices
N samples, = 0 2 If design matrix is square and invertible.
Columns to
then pseudo-inverse is same as inverse
M basis functions
(x ) M 1( x N )
0 N

i(xn) are M basis functions, e.g., Gaussians centered on M data points

1

j (x) = exp (x j ) (x j )
t 1
2
Machine Learning Srihari

What is the role of Bias parameter w0


M 1
y(x, w ) = w0 + w j j (x)
Regression function: j =1
2

Error Function:
1 N
{
ED (w) = t n wT (x n )
2 n=1
}
Explicit bias parameter: 1 N M 1
ED (w) = t n w0 w j j (x n )


2

2 n=1 j=1
Setting derivatives wrt w0 M 1
w0 = t w j j
equal to zero and solving for w0 j=1

1 N
t = tn
N
1
where j = j (x n )
N n=1 N n=1
First term is average of the N values of t
Second is weighted average of basis function values
Thus bias w0 compensates for difference
between average target values and weighted sum of
averages of basis function values
Machine Learning Srihari

Maximum Likelihood for precision


We have determined m.l.e. solution for w
using a probabilistic formulation
p(t|x,w,)=N(t|y(x,w), -1) w ML = + t
With log-likelihood
2
N N
ln p(t | w, ) = ln ln 2 ED (w)
1 N
{
ED (w) = t n wT (x n )
2 n=1
}
2 2
Taking gradient wrt gives
1 1 N
{
= tn wTML (x n ) }
2

ML N n=1

Thus Inverse of the noise precision gives


Residual variance of the target values
25
around the regression function
Machine Learning Srihari

Geometry of Least Squares Solution


N-dim space with axes tn: t =(t1,,tN)T
Each basis function j(xn), corresponding
to jth column of , is represented in this subspace

space as vector j
If M<N then j(xn) are in subspace S of dim M
y is N-dim vector of elements y(xn,w)
Sum-of-squares error is equal to squared
Euclidean distance between y and t M Basis functions
Solution w corresponds to y that lies in ( x ) ( x ) ... ( x )
0 1 1 1
M 1 1

subspace S that is closest to t

N Data
( x ) 0 2
=
Corresponds to ortho projection of t onto S
(x ) M 1( x N )
0 N
Represented as N-dim vectors
0 1 M-1
Machine Learning Srihari

Difficulty of Direct solution


Direct solution of normal equations
w ML = + t + = (T ) 1 T

Can lead to numerical difficulties


When two basis functions are collinear, T is singular
(determinant=0)
Parameters can have large magnitudes
Not uncommon with real data sets
Can be addressed using
Singular Value Decomposition
Addition of regularization term ensures matrix is non-singular

27
Machine Learning Srihari

Sequential (On-line) Learning


Maximum likelihood solution is
w ML = (T )1 T t

It is a batch technique
Processing entire training set in one go
It is computationally expensive for large data
sets
Due to huge N x M Design matrix
Solution is to use a sequential algorithm where
samples are presented one at a time
Called Stochastic gradient descent 28
Machine Learning Srihari

Stochastic Gradient Descent


2

Error function { } sums over data


1 N
ED (w) = t n wT (x n )
2 n=1
Denoting ED(w)=nEn update parameter vector w
using
w ( +1) = w ( ) En
Substituting for the derivative
w( +1) = w( ) + (t n w( ) (x n )) (x n )
w is initialized to some starting vector w(0)
chosen with care to ensure convergence
Known as Least Mean Squares Algorithm

29
Machine Learning Srihari

Derivation of Gradient Descent Rule

Error surface shows desirability of


every weight vector, a parabola with
30 a single global minimum
Machine Learning Srihari

Simple Gradient Descent


Procedure Gradient-Descent ( Intuition
1 //Initial starting point Taylors expansion of function f() in the
t t T t
f //Function to be minimized neighborhood of t is f () f ( ) + ( ) f ( )
//Convergence threshold Let =t+1 =t +h, thus f (t+1 ) f (t ) + h f (t )
Derivative of f(t+1) wrt h is f (t )
) t
At h = f ( ) a maximum occurs (since h2 is
1 t1
positive) and at h = f (t ) a minimum occurs.
2 do
Alternatively,
3 t+1 t f t ( ) t
The slope f ( ) points to the direction of
4 t t +1 steepest ascent. If we take a step in the
5 while || t t1 ||> opposite direction we decrease the value of f
( )
6 return t

One-dimensional example
Let f()=2
This function has minimum at =0 which we want
to determine using gradient descent
We have f()=2
For gradient descent, we update by f()
If t > 0 then t+1<t
If t<0 then f(t)=2t is negative, thus t+1>t
Machine Learning Srihari

Difficulties with Simple Gradient Descent


Performance depends on choice of learning rate
Large learning rate
Overshoots
Small learning rate
Extremely slow
Solution
Start with large and settle on optimal value
Need a schedule for shrinking

32
Machine Learning Srihari

Improvements to Gradient Descent


Procedure Gradient-Descent (
1 //Initial starting point
f //Function to be minimized Line Search
//Convergence threshold Adaptively choose at each step
) Define !a line in the direction of gradient
g ( ) = + f ( )
t t
1 t1
2 do Find three points 1 < 2 < 3 so that
3 t+1 t f t ( ) f(g(2)) is smaller than at both others
4 t t +1 =(1+2)/2
5 while || t t1 ||>
( )
6 return t

Brents method: Solid line is f(g())


Three points bracket its maximum
Dashed line shows quadratic fit
Machine Learning Srihari

Conjugate Gradient Ascent


In simple gradient ascent:
two consecutive gradients are orthogonal:
fobj (t+1 ) is orthogonal to fobj (t )
Progress is zig-zag: progress to maximum slows down

Quadratic: fobj(x,y)=-(x2+10y2) Exponential: fobj(x,y)=exp [-(x2+10y2)]

Solution:
Remember past directions and bias new direction ht
as combination of current gt and past ht-1
Faster convergence than simple gradient ascent
Machine Learning Srihari

Steepest descent
Gradient descent search determines a
weight vector w that minimizes ED(w) by
Starting with an arbitrary initial weight vector
Repeatedly modifying it in small steps
At each step, weight vector is modified in the
direction that produces the steepest descent
along the error surface

35
Machine Learning Srihari

Definition of Gradient Vector


The Gradient (derivative) of E with respect to each component of
the vector w

!" E E E
E[w ] , ,#
w
0 w1 wn

Notice E[w] is a vector of partial derivatives


Specifies the direction that produces steepest increase in E
Negative of this vector specifies direction of steepest decrease

36
Machine Learning Srihari

Direction of Steepest Descent


Negated gradient

Direction in
w0-w1 plane
producing
steepest
descent

37
Machine Learning Srihari

Gradient Descent Rule


! !
w w + w
where
w = E[w]
is a positive constant called the learning rate
Determines step size of gradient descent search
Component Form of Gradient Descent
Can also be written as
wi wi + wi
where
E
wi =
38
wi
Machine Learning Srihari

Regularized Least Squares


Regularization term in error to control overfit
Error function to minimize E(w) = ED (w) + EW (w)
where is the regularization coefficient
controls relative importance of data-dependent error ED(w) and
regularization term EW(w)
Simple form of regularizer EW ( w ) = 1 w T w
2

Total error function becomes


T
1 N
{
E(w) = tn wT (xn ) }
2
Quadratic Regularizer + w w
2 n=1 2

This regularizer is called weight decay


because in sequential learning weight values decay
towards zero unless supported by data 39
Machine Learning Srihari

Quadratic Regularizer
1 N
E(w) = { t n wT (x n ) } + wT w
2

2 n=1 2

Advantage: error function remains a quadratic


function of w
Its exact minimizer can be found in closed form
Setting gradient to zero and solving for w
1
w = ( I + ) tT T

This is a simple extension of the least squared


solution w ML = (T )1 T t
Machine Learning Srihari

Generalization of Quadratic Regularizer


Regularized Error 1
N
M

{ t n wT (x n ) } + | w j |q
2

2 n=1 2 j=1

Where q=2 corresponds to the quadratic regularizer


q=1 is known as lasso
Contours of constant regularization term: |wj|q
Space of w1,w2

w1 + w2 = const w12 + w22 = const


w1 + w2 = const w14 + w24 = const

41
Lasso Quadratic
Machine Learning Srihari

Geometric Interpretation of Regularizer


w2 Contours of
We are trying to find w that minimizes
Unregularized
Error function N

{ t w (x n ) }
T 2
n
n=1

Any value of w on contour


gives same error
Choose that value of w
Subject to the constraint
w1 M

j
| w |q

j=1

We dont want the weights to become too large


The two approaches are related by using Lagrange
multipliers

42
Machine Learning Srihari

Minimization of Unregularized Error


subject to constraint
Blue: Contours of Minimization
unregularized error With quadratic
function Regularizer,
q=2
Constraint region
w* is optimum value

43
Machine Learning Srihari

Sparsity with Lasso constraint


With q=1 and is sufficiently large, some of the
coefficients wj are driven to zero
Leads to a sparse model
where corresponding basis functions play no role
Origin of sparsity is illustrated here:
Quadratic solution where Minimization with Lasso Regularizer
w1*and wo* are nonzero A sparse solution with w1*=0

44
Machine Learning Srihari

Regularization: Conclusion
Regularization allows
complex models to be trained on small data sets
without severe over-fitting
It limits model complexity
i.e., how many basis functions to use?
Problem of limiting complexity is shifted to
one of determining suitable value of regularization
coefficient l

45
Machine Learning Srihari

Returning to LeToR Problem


Try:
Several Basis Functions
Quadratic Regularization
Express results as ERMS
rather than as squared error E(w*) or as Error
Rate with thresholded results

E RMS = 2E(w*) /N
46
Machine Learning Srihari

Multiple Outputs
Several target variables t =(t1,..,tK) where K >1
Can be treated as multiple (K) independent
regression problems
Different basis functions for each component of t
More common solution: same set of basis
functions to model all components of target
vector y(x,w)=WT(x)
where y is a K-dim column vector, W is a M x K
matrix of weights and (x) is a M-dimensional
column vector with with elements j(x)
47
Machine Learning Srihari

Solution for Multiple Outputs


Set of observations t1,..,tN are combined
into a matrix T of size N x K such that the
nth row is given by tnT
Combine input vectors x1,..,xN into matrix X
Log-likelihood function is maximized

Solution is similar: WML=(T)-1TT

48