Iterative Reweighted Least Squares
Sargur N. Srihari
University at Buffalo, State University of New York
USA
Machine Learning Srihari
Topics in Linear Classification using
Probabilistic Discriminative Models
• Generative vs Discriminative
1. Fixed basis functions in linear classification
2. Logistic Regression (two-class)
3. Iterative Reweighted Least Squares (IRLS)
4. Multiclass Logistic Regression
5. Probit Regression
6. Canonical Link Functions
2
Machine Learning Srihari
Topics in IRLS
• Linear and Logistic Regression
• Newton-Raphson Method
• Hessian
• IRLS for Linear Regression
• IRLS for Logistic Regression
• Numerical Example
• Hessian in Deep Learning
3
Machine Learning Srihari
Recall Linear Regression
• Assuming Gaussian noise p(t|x,w,β)=N (t | y(x,w),β -1 )
M −1
• And a linear model y(x,w) = ∑ w φ (x) = w φ(x) j j
T
j =0
• Likelihood has the form
N
(
p(t | X, w, β) = ∏ N tn | wT φ(xn ), β −1 )
n=1
• Log-likelihood becomes
2
N
N N
n=1
(
ln p(t | w, β) = ∑ ln N tn | w φ(xn ), β
T −1
) = ln β − ln 2π − βE D (w)
2 2
1 N
2 n=1
{
E D (w) = ∑ t n −wTϕ(x n ) }
• Which is quadratic in input
N
• Derivative has the form ∇ ln p(t | w, β) = β ∑ { t − w φ(x ) } φ(x ) n=1
n
T
n n
T
• Setting equal to zero and solving we get
⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
w ML = Φ t +
Φ + = (ΦT Φ) −1 ΦT ⎜
⎜ φ 0( x 2 )
⎟
⎟
Φ=⎜ ⎟
⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
⎝ 0 N
4
Machine Learning Srihari
Linear and Logistic Regression
• In linear regression there is a closed-form max
likelihood solution for determining w w ML = Φ + t
• on the assumption of Gaussian noise model
• Due to quadratic dependence of log-likelihood on w
2
1 N
{
E D (w) = ∑ t n −wTφ(x n )
2 n =1
}
• For logistic regression: No closed-form
maximum likelihood solution
• Due to nonlinearity of logistic sigmoid
N
{ }
E(w) = − ln p(t | w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
yn= σ(wTφn)
• But departure from quadratic is not substantial
• Error function is concave, i.e., unique minimum 5
Machine Learning Srihari
What is IRLS?
• An iterative method to find solution w*
– for linear regression and logistic regression
• assuming least squares objective
• While simple gradient descent has the form
w (new) = w (old) − η∇E(w)
• IRLS uses the second derivative and has the
−1
form w (new)
= w (old)
− H ∇E(w)
• It is derived from Newton-Raphson method
• where H is the Hessian matrix whose elements are the
second derivatives of E(w) wrt w
6
Machine Learning Srihari
Newton-Raphson Method (1-D)
• Based on second derivatives
– Derivative of a function at x is the slope of its
tangent at that point
• Illustration of 2nd and higher derivatives Critical
point
– Derivatives of Gaussian p(x)~N(0, σ) Current
solution
Since we are solving for
derivative of E(w)
Need second derivative
7
7
Second derivative measures curvature
• Derivative of a derivative
• Quadratic functions with different curvatures
Dashed line is
value of cost
function predicted
by gradient alone
Decrease is Gradient Predicts Decrease
faster than predicted decrease correctly is slower than expected
by Gradient Descent Actually increases
8
Beyond Gradient: Jacobian and
Hessian matrices
• Sometimes we need to find all derivatives of a
function whose input and output are both
vectors
• If we have function f: RmàRn
– Then the matrix of partial derivatives is known as
the Jacobian matrix J defined as
∂
J i,j = f (x )
∂x j i
9
Second derivative
• Derivative of a derivative
• For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as ∂x∂∂x f
i
2
• In a single dimension we can denote ∂ f by 2
2
∂x
f’’(x)
• Tells us how the first derivative will change as
we vary the input
• This important as it tells us whether a gradient
step will cause as much of an improvement as
based on gradient alone
10
Hessian
• Second derivative with many dimensions
2
∂
• H ( f ) (x) is defined as H(f )(x) = i,j
∂x ∂x
f (x)
i j
• Hessian is the Jacobian of the gradient
• Hessian matrix is symmetric, i.e., Hi,j =Hj,i
• anywhere that the second partial derivatives are
continuous
– So the Hessian matrix can be decomposed into a
set of real eigenvalues and an orthogonal basis of
eigenvectors
• Eigenvalues of H are useful to determine learning rate as
seen in next two slides
Role of eigenvalues of Hessian
• Second derivative in direction d is dTHd
– If d is an eigenvector, second derivative in that
direction is given by its eigenvalue
– For other directions, weighted average of
eigenvalues (weights of 0 to 1, with eigenvectors
with smallest angle with d receiving more value)
• Maximum eigenvalue determines maximum
second derivative and minimum eigenvalue
determines minimum second derivative
12
Learning rate from Hessian
• Taylor’s series of f(x) around current point x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T g + (x - x (0) )T H(x - x (0) )
2
• where g is the gradient and H is the Hessian at x(0)
– If we use learning rate ε the new point x is given by
x(0)-εg. Thus we get f (x − εg) ≈ f (x )− εg g + 12 ε g Hg
(0) (0) T 2 T
• There are three terms:
– original value of f,
– expected improvement due to slope, and
– correction to be applied due to curvature
• Solving for step size when correction is least gives
gT g
ε* ≈ T
g Hg 13
Another 2nd Derivative Method
• Using Taylor’s series of f(x) around current x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T ∇ x f (x (0) ) + (x - x (0) )T H(f )(x - x (0) )(x - x (0) )
2
– solve for the critical point of this function to give
x* = x (0) − H(f )(x (0) )−1 ∇x f (x (0) )
– When f is a quadratic (positive definite) function use
solution to jump to the minimum function directly
– When not quadratic apply solution iteratively
• Can reach critical point much faster than
gradient descent
– But useful only when nearby point is a minimum
14
Machine Learning Srihari
Two applications of IRLS
• IRLS is applicable to both Linear Regression
and Logistic Regression
• We discuss both, for each we need
1. Model y(x,w) = ∑ w φ (x) = w φ(x)
M −1
T
j j
p(C1|ϕ) =y(ϕ) = σ (wTϕ)
j =0
2. Objective function E (w)
1 N
{ }
2
E(w) = ∑ t − wT φ(xn )
• Linear Regression: Sum of Squared Errors 2 n =1 n
• Logistic Regression: Bernoulli Likelihood Function
N
{ }
1−tn
p(t | w) = ∏ ynn 1 − yn
t
n =1
3. Gradient ∇E(w)
4. Hessian H = ∇∇E(w)
5. Newton-Raphson update
w (new) = w (old) − H −1∇E(w)
15
Machine Learning Srihari
IRLS for Linear Regression
1. Model:
M −1
y(x,w) = ∑ w φ (x) = w φ(x)
j j
T
j =0
2. Error Function: Sum of Squares: 1 N
{ }
2
E(w) = ∑ tn − wT φ (x n )
2 n =1
for data set X={xn,tn} n=1,..N
3. Gradient of Error Function is:
N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
∇E(w) = ∑ (wTφn − tn )φn where Φ is the N x M ⎜
⎜ φ 0( x 2 )
⎟
⎟
Φ=⎜
n =1 design matrix whose ⎜
⎟
⎟
= ΦT Φw − ΦT t nth row is given by φnT ⎜φ (x )
⎝ 0 N φ M −1( x N ) ⎟⎠
4. Hessian is:
N
H = ∇∇E(w) = ∑ φnφnT = ΦT Φ
n =1
5. Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)
Substituting: H = ΦT Φ and ∇E(w) = ΦT Φw − ΦT t
w (new) = w(old)-(ΦTΦ)-1{ΦTΦw (old)-ΦTt}
= (ΦTΦ)-1ΦTt
which is the standard least squares solution
Since it is independent of w, Newton-Raphson gives exact solution in one step
IRLS for Logistic Regression
Machine Learning Srihari
1. Model: p(C1|ϕ) =y(ϕ) = σ { }
1−tn
p(t | w) = ∏ ynn 1 − yn
t
(wTϕ) Likelihood: n =1
for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)
2. Objective Function: Negative log-likelihood:
N
{
E(w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
}
Cross-entropy error function
3. Gradient of Error Function is: ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
⎜ ⎟
N where Φ is the N x M design matrix whose ⎜ φ 0( x 2 ) ⎟
∇E(w) = ∑ (yn − tn )φn = ΦT (y − t) nth row is given by φnT Φ=⎜ ⎟
n=1 ⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
4. Hessian is: ⎝ 0 N
N
Hessian is not constant and depends on w through R. Since H is
H = ∇∇E(w) = ∑ yn (1 − yn )φnφnT = ΦT RΦ
n =1
positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave
function of w and so has a unique minimum
R is N x N diagonal matrix with elements
Rnn= yn(1-yn)=wTϕn(1-wTϕn)
5. Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)
Substituting H = ΦT RΦ and ∇E(w) = ΦT (y − t)
w(new) = w(old) - (ΦTRΦ)-1ΦT (y-t)
= (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)}
= (ΦTRΦ)-1ΦTRz
where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)
Update formula is a set of normal equations.
Since Hessian depends on w apply them iteratively each time using the new weight vector
18
Initial Weight Vector, Gradient and
Hessian (2-class)
• Weight vector
• Gradient
• Hessian
19
Final Weight Vector, Gradient and
Hessian (2-class)
• Weight Vector
• Gradient
• Hessian
Number of iterations : 10
Error (Initial and Final): 15.0642, 1.0000e-009
Hessian in Deep Learning
• The second derivative can be used to
determine whether a critical point is a local
minimum, maximum or a saddle point
– On a critical point f’(x)=0
– When f’’(x)>0 the first derivative f’(x) increases as
we move to the right and decreases as we move left
• We conclude that x is a local minimum
– For local maximum, f’(x)=0 and f’’(x)<0
– When f’’(x)=0 test is inconclusive: x may be a
saddle point or part of a flat region
20
Multidimensional Second derivative test
• In multiple dimensions, we need to examine
second derivatives of all dimensions
• Eigendecomposition generalizes the test
• Test eigenvalues of Hessian to determine
whether critical point is a local maximum, local
minimum or saddle point
• When H is positive definite (all eigenvalues are
positive) the point is a local minimum
• Similarly negative definite implies a maximum
21
Saddle point
• Contains both positive and negative curvature
• Function is f(x)=x12-x22
– Along axis x1, function curves upwards: this axis is
an eigenvector of H and has a positive value
– Along x2, function corves downwards; its direction is
an eigenvector of H with negative eigenvalue
– At a saddle point eigen values are both positive and
negative 22