0% found this document useful (0 votes)
212 views22 pages

Iterative Reweighted Least Squares Explained

The document discusses iterative reweighted least squares (IRLS), an iterative method for finding solutions to linear and logistic regression problems. IRLS uses the Hessian matrix, which contains the second derivatives of the error function with respect to the weights, to update the weights in each iteration. This allows IRLS to converge faster than simple gradient descent by taking into account the curvature of the error surface. IRLS works by iteratively reweighting the least squares objective based on the current weight estimates.

Uploaded by

asdfasdffdsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
212 views22 pages

Iterative Reweighted Least Squares Explained

The document discusses iterative reweighted least squares (IRLS), an iterative method for finding solutions to linear and logistic regression problems. IRLS uses the Hessian matrix, which contains the second derivatives of the error function with respect to the weights, to update the weights in each iteration. This allows IRLS to converge faster than simple gradient descent by taking into account the curvature of the error surface. IRLS works by iteratively reweighting the least squares objective based on the current weight estimates.

Uploaded by

asdfasdffdsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Iterative Reweighted Least Squares

Sargur N. Srihari
University at Buffalo, State University of New York
USA
Machine Learning Srihari

Topics in Linear Classification using


Probabilistic Discriminative Models
•  Generative vs Discriminative
1.  Fixed basis functions in linear classification
2.  Logistic Regression (two-class)
3.  Iterative Reweighted Least Squares (IRLS)
4.  Multiclass Logistic Regression
5.  Probit Regression
6.  Canonical Link Functions
2
Machine Learning Srihari

Topics in IRLS
•  Linear and Logistic Regression
•  Newton-Raphson Method
•  Hessian
•  IRLS for Linear Regression
•  IRLS for Logistic Regression
•  Numerical Example
•  Hessian in Deep Learning
3
Machine Learning Srihari

Recall Linear Regression


•  Assuming Gaussian noise p(t|x,w,β)=N (t | y(x,w),β -1 )
M −1
•  And a linear model y(x,w) = ∑ w φ (x) = w φ(x) j j
T

j =0

•  Likelihood has the form


N

(
p(t | X, w, β) = ∏ N tn | wT φ(xn ), β −1 )
n=1

•  Log-likelihood becomes
2
N
N N
n=1
(
ln p(t | w, β) = ∑ ln N tn | w φ(xn ), β
T −1
) = ln β − ln 2π − βE D (w)
2 2
1 N
2 n=1
{
E D (w) = ∑ t n −wTϕ(x n ) }
•  Which is quadratic in input
N

•  Derivative has the form ∇ ln p(t | w, β) = β ∑ { t − w φ(x ) } φ(x ) n=1


n
T
n n
T

•  Setting equal to zero and solving we get


⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
w ML = Φ t +
Φ + = (ΦT Φ) −1 ΦT ⎜
⎜ φ 0( x 2 )


Φ=⎜ ⎟
⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
⎝ 0 N
4
Machine Learning Srihari

Linear and Logistic Regression


•  In linear regression there is a closed-form max
likelihood solution for determining w w ML = Φ + t
•  on the assumption of Gaussian noise model
•  Due to quadratic dependence of log-likelihood on w
2
1 N
{
E D (w) = ∑ t n −wTφ(x n )
2 n =1
}

•  For logistic regression: No closed-form


maximum likelihood solution
•  Due to nonlinearity of logistic sigmoid
N

{ }
E(w) = − ln p(t | w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
yn= σ(wTφn)

•  But departure from quadratic is not substantial


•  Error function is concave, i.e., unique minimum 5
Machine Learning Srihari

What is IRLS?
•  An iterative method to find solution w*
–  for linear regression and logistic regression
•  assuming least squares objective
•  While simple gradient descent has the form
w (new) = w (old) − η∇E(w)

•  IRLS uses the second derivative and has the


−1
form w (new)
= w (old)
− H ∇E(w)

•  It is derived from Newton-Raphson method


•  where H is the Hessian matrix whose elements are the
second derivatives of E(w) wrt w

6
Machine Learning Srihari

Newton-Raphson Method (1-D)


•  Based on second derivatives
–  Derivative of a function at x is the slope of its
tangent at that point
•  Illustration of 2nd and higher derivatives Critical
point
–  Derivatives of Gaussian p(x)~N(0, σ) Current
solution

Since we are solving for


derivative of E(w)
Need second derivative
7
7
Second derivative measures curvature
•  Derivative of a derivative
•  Quadratic functions with different curvatures
Dashed line is
value of cost
function predicted
by gradient alone

Decrease is Gradient Predicts Decrease


faster than predicted decrease correctly is slower than expected
by Gradient Descent Actually increases
8
Beyond Gradient: Jacobian and
Hessian matrices
•  Sometimes we need to find all derivatives of a
function whose input and output are both
vectors
•  If we have function f: RmàRn
–  Then the matrix of partial derivatives is known as
the Jacobian matrix J defined as

J i,j = f (x )
∂x j i

9
Second derivative
•  Derivative of a derivative
•  For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as ∂x∂∂x f
i
2

•  In a single dimension we can denote ∂ f by 2

2
∂x
f’’(x)
•  Tells us how the first derivative will change as
we vary the input
•  This important as it tells us whether a gradient
step will cause as much of an improvement as
based on gradient alone
10
Hessian
•  Second derivative with many dimensions
2

•  H ( f ) (x) is defined as H(f )(x) = i,j
∂x ∂x
f (x)
i j

•  Hessian is the Jacobian of the gradient


•  Hessian matrix is symmetric, i.e., Hi,j =Hj,i
•  anywhere that the second partial derivatives are
continuous
–  So the Hessian matrix can be decomposed into a
set of real eigenvalues and an orthogonal basis of
eigenvectors
•  Eigenvalues of H are useful to determine learning rate as
seen in next two slides
Role of eigenvalues of Hessian
•  Second derivative in direction d is dTHd
–  If d is an eigenvector, second derivative in that
direction is given by its eigenvalue
–  For other directions, weighted average of
eigenvalues (weights of 0 to 1, with eigenvectors
with smallest angle with d receiving more value)
•  Maximum eigenvalue determines maximum
second derivative and minimum eigenvalue
determines minimum second derivative
12
Learning rate from Hessian
•  Taylor’s series of f(x) around current point x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T g + (x - x (0) )T H(x - x (0) )
2
•  where g is the gradient and H is the Hessian at x(0)
–  If we use learning rate ε the new point x is given by
x(0)-εg. Thus we get f (x − εg) ≈ f (x )− εg g + 12 ε g Hg
(0) (0) T 2 T

•  There are three terms:


–  original value of f,
–  expected improvement due to slope, and
–  correction to be applied due to curvature
•  Solving for step size when correction is least gives
gT g
ε* ≈ T
g Hg 13
Another 2nd Derivative Method
•  Using Taylor’s series of f(x) around current x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T ∇ x f (x (0) ) + (x - x (0) )T H(f )(x - x (0) )(x - x (0) )
2

–  solve for the critical point of this function to give


x* = x (0) − H(f )(x (0) )−1 ∇x f (x (0) )

–  When f is a quadratic (positive definite) function use


solution to jump to the minimum function directly
–  When not quadratic apply solution iteratively
•  Can reach critical point much faster than
gradient descent
–  But useful only when nearby point is a minimum
14
Machine Learning Srihari

Two applications of IRLS


•  IRLS is applicable to both Linear Regression
and Logistic Regression
•  We discuss both, for each we need
1.  Model y(x,w) = ∑ w φ (x) = w φ(x)
M −1
T
j j
p(C1|ϕ) =y(ϕ) = σ (wTϕ)
j =0

2.  Objective function E (w)


1 N
{ }
2
E(w) = ∑ t − wT φ(xn )
•  Linear Regression: Sum of Squared Errors 2 n =1 n

•  Logistic Regression: Bernoulli Likelihood Function


N

{ }
1−tn
p(t | w) = ∏ ynn 1 − yn
t

n =1

3.  Gradient ∇E(w)


4.  Hessian H = ∇∇E(w)
5.  Newton-Raphson update
w (new) = w (old) − H −1∇E(w)

15
Machine Learning Srihari

IRLS for Linear Regression


1.  Model:
M −1
y(x,w) = ∑ w φ (x) = w φ(x)
j j
T

j =0

2.  Error Function: Sum of Squares: 1 N


{ }
2
E(w) = ∑ tn − wT φ (x n )
2 n =1

for data set X={xn,tn} n=1,..N

3.  Gradient of Error Function is:


N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
∇E(w) = ∑ (wTφn − tn )φn where Φ is the N x M ⎜
⎜ φ 0( x 2 )


Φ=⎜
n =1 design matrix whose ⎜


= ΦT Φw − ΦT t nth row is given by φnT ⎜φ (x )
⎝ 0 N φ M −1( x N ) ⎟⎠

4.  Hessian is:


N
H = ∇∇E(w) = ∑ φnφnT = ΦT Φ
n =1

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)


Substituting: H = ΦT Φ and ∇E(w) = ΦT Φw − ΦT t
w (new) = w(old)-(ΦTΦ)-1{ΦTΦw (old)-ΦTt}
= (ΦTΦ)-1ΦTt
which is the standard least squares solution

Since it is independent of w, Newton-Raphson gives exact solution in one step


IRLS for Logistic Regression
Machine Learning Srihari

1.  Model: p(C1|ϕ) =y(ϕ) = σ { }


1−tn
p(t | w) = ∏ ynn 1 − yn
t
(wTϕ) Likelihood: n =1

for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)

2.  Objective Function: Negative log-likelihood:


N

{
E(w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
}
Cross-entropy error function

3.  Gradient of Error Function is: ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞


⎜ ⎟
N where Φ is the N x M design matrix whose ⎜ φ 0( x 2 ) ⎟
∇E(w) = ∑ (yn − tn )φn = ΦT (y − t) nth row is given by φnT Φ=⎜ ⎟
n=1 ⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
4.  Hessian is: ⎝ 0 N

N
Hessian is not constant and depends on w through R. Since H is
H = ∇∇E(w) = ∑ yn (1 − yn )φnφnT = ΦT RΦ
n =1
positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave
function of w and so has a unique minimum
R is N x N diagonal matrix with elements
Rnn= yn(1-yn)=wTϕn(1-wTϕn)

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

Substituting H = ΦT RΦ and ∇E(w) = ΦT (y − t)


w(new) = w(old) - (ΦTRΦ)-1ΦT (y-t)
= (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)}
= (ΦTRΦ)-1ΦTRz
where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)

Update formula is a set of normal equations.


Since Hessian depends on w apply them iteratively each time using the new weight vector
18

Initial Weight Vector, Gradient and


Hessian (2-class)
•  Weight vector

•  Gradient

•  Hessian
19

Final Weight Vector, Gradient and


Hessian (2-class)
•  Weight Vector

•  Gradient

•  Hessian

Number of iterations : 10

Error (Initial and Final): 15.0642, 1.0000e-009


Hessian in Deep Learning
•  The second derivative can be used to
determine whether a critical point is a local
minimum, maximum or a saddle point
–  On a critical point f’(x)=0
–  When f’’(x)>0 the first derivative f’(x) increases as
we move to the right and decreases as we move left
•  We conclude that x is a local minimum
–  For local maximum, f’(x)=0 and f’’(x)<0
–  When f’’(x)=0 test is inconclusive: x may be a
saddle point or part of a flat region

20
Multidimensional Second derivative test
•  In multiple dimensions, we need to examine
second derivatives of all dimensions
•  Eigendecomposition generalizes the test
•  Test eigenvalues of Hessian to determine
whether critical point is a local maximum, local
minimum or saddle point
•  When H is positive definite (all eigenvalues are
positive) the point is a local minimum
•  Similarly negative definite implies a maximum
21
Saddle point
•  Contains both positive and negative curvature
•  Function is f(x)=x12-x22

–  Along axis x1, function curves upwards: this axis is


an eigenvector of H and has a positive value
–  Along x2, function corves downwards; its direction is
an eigenvector of H with negative eigenvalue
–  At a saddle point eigen values are both positive and
negative 22

You might also like