0% found this document useful (0 votes)

212 views22 pages

Iterative Reweighted Least Squares Explained

The document discusses iterative reweighted least squares (IRLS), an iterative method for finding solutions to linear and logistic regression problems. IRLS uses the Hessian matrix, which contains the second derivatives of the error function with respect to the weights, to update the weights in each iteration. This allows IRLS to converge faster than simple gradient descent by taking into account the curvature of the error surface. IRLS works by iteratively reweighting the least squares objective based on the current weight estimates.

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

212 views22 pages

Iterative Reweighted Least Squares Explained

Uploaded by

asdfasdffdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Iterative Reweighted Least Squares

Sargur N. Srihari
University at Buffalo, State University of New York
USA
Machine Learning Srihari

Topics in Linear Classification using

Probabilistic Discriminative Models
•  Generative vs Discriminative
1.  Fixed basis functions in linear classification
2.  Logistic Regression (two-class)
3.  Iterative Reweighted Least Squares (IRLS)
4.  Multiclass Logistic Regression
5.  Probit Regression
6.  Canonical Link Functions
2
Machine Learning Srihari

Topics in IRLS
•  Linear and Logistic Regression
•  Newton-Raphson Method
•  Hessian
•  IRLS for Linear Regression
•  IRLS for Logistic Regression
•  Numerical Example
•  Hessian in Deep Learning
3
Machine Learning Srihari

Recall Linear Regression

•  Assuming Gaussian noise p(t|x,w,β)=N (t | y(x,w),β -1 )
M −1
•  And a linear model y(x,w) = ∑ w φ (x) = w φ(x) j j
T

j =0

•  Likelihood has the form

(
p(t | X, w, β) = ∏ N tn | wT φ(xn ), β −1 )
n=1

•  Log-likelihood becomes
2
N
N N
n=1
(
ln p(t | w, β) = ∑ ln N tn | w φ(xn ), β
T −1
) = ln β − ln 2π − βE D (w)
2 2
1 N
2 n=1
{
E D (w) = ∑ t n −wTϕ(x n ) }
•  Which is quadratic in input
N

•  Derivative has the form ∇ ln p(t | w, β) = β ∑ { t − w φ(x ) } φ(x ) n=1

n
T
n n
T

•  Setting equal to zero and solving we get

⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
w ML = Φ t +
Φ + = (ΦT Φ) −1 ΦT ⎜
⎜ φ 0( x 2 )
⎟
⎟
Φ=⎜ ⎟
⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
⎝ 0 N
4
Machine Learning Srihari

Linear and Logistic Regression

•  In linear regression there is a closed-form max
likelihood solution for determining w w ML = Φ + t
•  on the assumption of Gaussian noise model
•  Due to quadratic dependence of log-likelihood on w
2
1 N
{
E D (w) = ∑ t n −wTφ(x n )
2 n =1
}

•  For logistic regression: No closed-form

maximum likelihood solution
•  Due to nonlinearity of logistic sigmoid
N

{ }
E(w) = − ln p(t | w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
yn= σ(wTφn)

•  But departure from quadratic is not substantial

•  Error function is concave, i.e., unique minimum 5
Machine Learning Srihari

What is IRLS?
•  An iterative method to find solution w*
–  for linear regression and logistic regression
•  assuming least squares objective
•  While simple gradient descent has the form
w (new) = w (old) − η∇E(w)

•  IRLS uses the second derivative and has the

−1
form w (new)
= w (old)
− H ∇E(w)

•  It is derived from Newton-Raphson method

•  where H is the Hessian matrix whose elements are the
second derivatives of E(w) wrt w

6
Machine Learning Srihari

Newton-Raphson Method (1-D)

•  Based on second derivatives
–  Derivative of a function at x is the slope of its
tangent at that point
•  Illustration of 2nd and higher derivatives Critical
point
–  Derivatives of Gaussian p(x)~N(0, σ) Current
solution

Since we are solving for

derivative of E(w)
Need second derivative
7
7
Second derivative measures curvature
•  Derivative of a derivative
•  Quadratic functions with different curvatures
Dashed line is
value of cost
function predicted
by gradient alone

Decrease is Gradient Predicts Decrease

faster than predicted decrease correctly is slower than expected
by Gradient Descent Actually increases
8
Beyond Gradient: Jacobian and
Hessian matrices
•  Sometimes we need to find all derivatives of a
function whose input and output are both
vectors
•  If we have function f: RmàRn
–  Then the matrix of partial derivatives is known as
the Jacobian matrix J defined as
∂
J i,j = f (x )
∂x j i

9
Second derivative
•  Derivative of a derivative
•  For a function f: RnàR the derivative wrt xi of
the derivative of f wrt xj is denoted as ∂x∂∂x f
i
2

•  In a single dimension we can denote ∂ f by 2

2
∂x
f’’(x)
•  Tells us how the first derivative will change as
we vary the input
•  This important as it tells us whether a gradient
step will cause as much of an improvement as
based on gradient alone
10
Hessian
•  Second derivative with many dimensions
2
∂
•  H ( f ) (x) is defined as H(f )(x) = i,j
∂x ∂x
f (x)
i j

•  Hessian is the Jacobian of the gradient

•  Hessian matrix is symmetric, i.e., Hi,j =Hj,i
•  anywhere that the second partial derivatives are
continuous
–  So the Hessian matrix can be decomposed into a
set of real eigenvalues and an orthogonal basis of
eigenvectors
•  Eigenvalues of H are useful to determine learning rate as
seen in next two slides
Role of eigenvalues of Hessian
•  Second derivative in direction d is dTHd
–  If d is an eigenvector, second derivative in that
direction is given by its eigenvalue
–  For other directions, weighted average of
eigenvalues (weights of 0 to 1, with eigenvectors
with smallest angle with d receiving more value)
•  Maximum eigenvalue determines maximum
second derivative and minimum eigenvalue
determines minimum second derivative
12
Learning rate from Hessian
•  Taylor’s series of f(x) around current point x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T g + (x - x (0) )T H(x - x (0) )
2
•  where g is the gradient and H is the Hessian at x(0)
–  If we use learning rate ε the new point x is given by
x(0)-εg. Thus we get f (x − εg) ≈ f (x )− εg g + 12 ε g Hg
(0) (0) T 2 T

•  There are three terms:

–  original value of f,
–  expected improvement due to slope, and
–  correction to be applied due to curvature
•  Solving for step size when correction is least gives
gT g
ε* ≈ T
g Hg 13
Another 2nd Derivative Method
•  Using Taylor’s series of f(x) around current x(0)
1
f (x) ≈ f (x (0) ) + (x - x (0) )T ∇ x f (x (0) ) + (x - x (0) )T H(f )(x - x (0) )(x - x (0) )
2

–  solve for the critical point of this function to give

x* = x (0) − H(f )(x (0) )−1 ∇x f (x (0) )

–  When f is a quadratic (positive definite) function use

solution to jump to the minimum function directly
–  When not quadratic apply solution iteratively
•  Can reach critical point much faster than
gradient descent
–  But useful only when nearby point is a minimum
14
Machine Learning Srihari

Two applications of IRLS

•  IRLS is applicable to both Linear Regression
and Logistic Regression
•  We discuss both, for each we need
1.  Model y(x,w) = ∑ w φ (x) = w φ(x)
M −1
T
j j
p(C1|ϕ) =y(ϕ) = σ (wTϕ)
j =0

2.  Objective function E (w)

1 N
{ }
2
E(w) = ∑ t − wT φ(xn )
•  Linear Regression: Sum of Squared Errors 2 n =1 n

•  Logistic Regression: Bernoulli Likelihood Function

{ }
1−tn
p(t | w) = ∏ ynn 1 − yn
t

n =1

3.  Gradient ∇E(w)

4.  Hessian H = ∇∇E(w)
5.  Newton-Raphson update
w (new) = w (old) − H −1∇E(w)

15
Machine Learning Srihari

IRLS for Linear Regression

1.  Model:
M −1
y(x,w) = ∑ w φ (x) = w φ(x)
j j
T

j =0

2.  Error Function: Sum of Squares: 1 N

{ }
2
E(w) = ∑ tn − wT φ (x n )
2 n =1

for data set X={xn,tn} n=1,..N

3.  Gradient of Error Function is:

N ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
∇E(w) = ∑ (wTφn − tn )φn where Φ is the N x M ⎜
⎜ φ 0( x 2 )
⎟
⎟
Φ=⎜
n =1 design matrix whose ⎜
⎟
⎟
= ΦT Φw − ΦT t nth row is given by φnT ⎜φ (x )
⎝ 0 N φ M −1( x N ) ⎟⎠

4.  Hessian is:

N
H = ∇∇E(w) = ∑ φnφnT = ΦT Φ
n =1

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

Substituting: H = ΦT Φ and ∇E(w) = ΦT Φw − ΦT t
w (new) = w(old)-(ΦTΦ)-1{ΦTΦw (old)-ΦTt}
= (ΦTΦ)-1ΦTt
which is the standard least squares solution

Since it is independent of w, Newton-Raphson gives exact solution in one step

IRLS for Logistic Regression
Machine Learning Srihari

1.  Model: p(C1|ϕ) =y(ϕ) = σ { }

1−tn
p(t | w) = ∏ ynn 1 − yn
t
(wTϕ) Likelihood: n =1

for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)

2.  Objective Function: Negative log-likelihood:

{
E(w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n=1
}
Cross-entropy error function

3.  Gradient of Error Function is: ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞

⎜ ⎟
N where Φ is the N x M design matrix whose ⎜ φ 0( x 2 ) ⎟
∇E(w) = ∑ (yn − tn )φn = ΦT (y − t) nth row is given by φnT Φ=⎜ ⎟
n=1 ⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
4.  Hessian is: ⎝ 0 N

N
Hessian is not constant and depends on w through R. Since H is
H = ∇∇E(w) = ∑ yn (1 − yn )φnφnT = ΦT RΦ
n =1
positive-definite (i.e., for arbitrary u, uTHu>0) error function is a concave
function of w and so has a unique minimum
R is N x N diagonal matrix with elements
Rnn= yn(1-yn)=wTϕn(1-wTϕn)

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

Substituting H = ΦT RΦ and ∇E(w) = ΦT (y − t)

w(new) = w(old) - (ΦTRΦ)-1ΦT (y-t)
= (ΦTRΦ)-1{ΦΦw(old)-ΦT(y-t)}
= (ΦTRΦ)-1ΦTRz
where z is a N-dimensional vector with elements z =Φw (old)-R-1(y-t)

Update formula is a set of normal equations.

Since Hessian depends on w apply them iteratively each time using the new weight vector
18

Initial Weight Vector, Gradient and

Hessian (2-class)
•  Weight vector

•  Gradient

•  Hessian
19

Final Weight Vector, Gradient and

Hessian (2-class)
•  Weight Vector

•  Gradient

•  Hessian

Number of iterations : 10

Error (Initial and Final): 15.0642, 1.0000e-009

Hessian in Deep Learning
•  The second derivative can be used to
determine whether a critical point is a local
minimum, maximum or a saddle point
–  On a critical point f’(x)=0
–  When f’’(x)>0 the first derivative f’(x) increases as
we move to the right and decreases as we move left
•  We conclude that x is a local minimum
–  For local maximum, f’(x)=0 and f’’(x)<0
–  When f’’(x)=0 test is inconclusive: x may be a
saddle point or part of a flat region

20
Multidimensional Second derivative test
•  In multiple dimensions, we need to examine
second derivatives of all dimensions
•  Eigendecomposition generalizes the test
•  Test eigenvalues of Hessian to determine
whether critical point is a local maximum, local
minimum or saddle point
•  When H is positive definite (all eigenvalues are
positive) the point is a local minimum
•  Similarly negative definite implies a maximum
21
Saddle point
•  Contains both positive and negative curvature
•  Function is f(x)=x12-x22

–  Along axis x1, function curves upwards: this axis is

an eigenvector of H and has a positive value
–  Along x2, function corves downwards; its direction is
an eigenvector of H with negative eigenvalue
–  At a saddle point eigen values are both positive and
negative 22

Probabilistic Discriminative Models
No ratings yet
Probabilistic Discriminative Models
29 pages
Linear Models in Machine Learning
No ratings yet
Linear Models in Machine Learning
48 pages
Regression
No ratings yet
Regression
39 pages
Multivariate Logistic Regression Tutorial
No ratings yet
Multivariate Logistic Regression Tutorial
9 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Linear Regression and Gradient Descent Guide
No ratings yet
Linear Regression and Gradient Descent Guide
71 pages
Supervised vs Unsupervised Learning Guide
No ratings yet
Supervised vs Unsupervised Learning Guide
22 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
7 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
Logistic Regression for Binary Classification
No ratings yet
Logistic Regression for Binary Classification
9 pages
M146 Lec3 Sidenotes S25
No ratings yet
M146 Lec3 Sidenotes S25
29 pages
Understanding Regression in Machine Learning
No ratings yet
Understanding Regression in Machine Learning
10 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Cost Function
No ratings yet
Cost Function
17 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
8 pages
Understanding Linear Regression Techniques
No ratings yet
Understanding Linear Regression Techniques
54 pages
Supervised Learning: Logistic Regression
100% (1)
Supervised Learning: Logistic Regression
35 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
10 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Machine Learning Basics: Supervised vs Unsupervised
No ratings yet
Machine Learning Basics: Supervised vs Unsupervised
8 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
29 pages
Representer Function
No ratings yet
Representer Function
12 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
3 pages
COL774 Machine Learning Practice Problems
No ratings yet
COL774 Machine Learning Practice Problems
22 pages
Intro to Logistic Regression
No ratings yet
Intro to Logistic Regression
4 pages
Data Science Training Program Overview
No ratings yet
Data Science Training Program Overview
351 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
15 pages
Machine Learning Homework 4: Regression
No ratings yet
Machine Learning Homework 4: Regression
7 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Optimization Techniques in Machine Learning
No ratings yet
Optimization Techniques in Machine Learning
10 pages
ML Lec9
No ratings yet
ML Lec9
8 pages
Supervised Learning Overview and Models
No ratings yet
Supervised Learning Overview and Models
4 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
CS229 Machine Learning Class Notes
No ratings yet
CS229 Machine Learning Class Notes
217 pages
Evaluating ML Systems & Linear Regression
No ratings yet
Evaluating ML Systems & Linear Regression
34 pages
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
No ratings yet
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
10 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Classification
No ratings yet
Classification
47 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages
Understanding Cost Function in ML
No ratings yet
Understanding Cost Function in ML
1 page
Machine Learning: Linear & Logistic Regression
No ratings yet
Machine Learning: Linear & Logistic Regression
16 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Lecture 7 Newton
No ratings yet
Lecture 7 Newton
44 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Generalized Linear Model Overview
No ratings yet
Generalized Linear Model Overview
67 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
80 pages
Linear Regression Fundamentals Explained
No ratings yet
Linear Regression Fundamentals Explained
6 pages
Training Linear Models in Python
No ratings yet
Training Linear Models in Python
43 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
Linear Regression with Gradient Descent
No ratings yet
Linear Regression with Gradient Descent
13 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Machine Learning Regression Basics
No ratings yet
Machine Learning Regression Basics
22 pages
Logistic Regression via Newton Method
No ratings yet
Logistic Regression via Newton Method
16 pages
Ps and Solution CS229
No ratings yet
Ps and Solution CS229
55 pages
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Probability Theory: Sargur N. Srihari Srihari@cedar - Buffalo.edu
49 pages
Mixture Models for Data Clustering
No ratings yet
Mixture Models for Data Clustering
7 pages
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Forward Sampling: Sargur Srihari Srihari@buffalo - Edu
13 pages
Kernel Methods for ML Experts
No ratings yet
Kernel Methods for ML Experts
29 pages
Understanding Linear Dynamical Systems
No ratings yet
Understanding Linear Dynamical Systems
20 pages
K-Means Clustering Overview by Srihari
No ratings yet
K-Means Clustering Overview by Srihari
20 pages
Machine Learning Sampling Techniques
No ratings yet
Machine Learning Sampling Techniques
30 pages
Introduction to Boosting Algorithms
No ratings yet
Introduction to Boosting Algorithms
29 pages
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
31 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
42 pages
Resolution Monetary Policy Feb 2025
No ratings yet
Resolution Monetary Policy Feb 2025
3 pages
List of Admitted Students, Academic Year 2023-24
No ratings yet
List of Admitted Students, Academic Year 2023-24
29 pages
Research Paper On Markets
No ratings yet
Research Paper On Markets
9 pages
Final Student Management System Project Report
No ratings yet
Final Student Management System Project Report
3 pages
Prestige Song of The South Brochure (Unit Type D-F)
No ratings yet
Prestige Song of The South Brochure (Unit Type D-F)
24 pages
The Panasonic Lumix DMC G2 The Unofficial Quintessential Guide 1st Edition PH.D Brian Matsumoto PDF Download
No ratings yet
The Panasonic Lumix DMC G2 The Unofficial Quintessential Guide 1st Edition PH.D Brian Matsumoto PDF Download
124 pages
Semantic Web Architecture Plan
No ratings yet
Semantic Web Architecture Plan
10 pages
Human Factors in Eucheuma Cultivation
No ratings yet
Human Factors in Eucheuma Cultivation
6 pages
PCI June 2010
100% (1)
PCI June 2010
76 pages
Fe 9udffk
No ratings yet
Fe 9udffk
2 pages
Idler Catalog Complete Low Res For Web
No ratings yet
Idler Catalog Complete Low Res For Web
154 pages
Financial Accounting 3B Exam: Consolidation and Transactions
No ratings yet
Financial Accounting 3B Exam: Consolidation and Transactions
9 pages
CD - 56. CIR v. Union Shipping
100% (1)
CD - 56. CIR v. Union Shipping
1 page
Product Sheet 12Flb350P: Technical Data Layout
No ratings yet
Product Sheet 12Flb350P: Technical Data Layout
1 page
The Value of Five-Centavo Coins
No ratings yet
The Value of Five-Centavo Coins
2 pages
When Harry Met Sally (The Atlantic)
No ratings yet
When Harry Met Sally (The Atlantic)
5 pages
Apparatus For Measurement of Thermal Conductivity of Good and Bad Conductors Manual F
No ratings yet
Apparatus For Measurement of Thermal Conductivity of Good and Bad Conductors Manual F
13 pages
Reliability Centered Maintenance - Wikipedia, The Free Encyclopedia
No ratings yet
Reliability Centered Maintenance - Wikipedia, The Free Encyclopedia
7 pages
EGUAHON Mavellous Chukwuemeke
No ratings yet
EGUAHON Mavellous Chukwuemeke
1 page
Caho N Series Manual
No ratings yet
Caho N Series Manual
4 pages
Do - 071 - S2013 (Project Duration FC)
No ratings yet
Do - 071 - S2013 (Project Duration FC)
5 pages
Business Management: Written Examination
No ratings yet
Business Management: Written Examination
16 pages
The Effect of Variation of Aluminum Thickness Series 7075 in Heat Treatment Solution On Tensile Strength and Microstructure
No ratings yet
The Effect of Variation of Aluminum Thickness Series 7075 in Heat Treatment Solution On Tensile Strength and Microstructure
6 pages
Difference Between Analog and Digital
No ratings yet
Difference Between Analog and Digital
6 pages
Starch Sources for Biodegradable Plastics
No ratings yet
Starch Sources for Biodegradable Plastics
5 pages
Bioinformatics and Ontology Expert Resume
No ratings yet
Bioinformatics and Ontology Expert Resume
4 pages
Swim 2025
No ratings yet
Swim 2025
197 pages
Ratio Analysis of Hero Honda and Bajaj Auto
No ratings yet
Ratio Analysis of Hero Honda and Bajaj Auto
32 pages
Aa330424054231s RC29042024
No ratings yet
Aa330424054231s RC29042024
3 pages
Ict Skills Notes
No ratings yet
Ict Skills Notes
2 pages

Iterative Reweighted Least Squares Explained

Uploaded by

Iterative Reweighted Least Squares Explained

Uploaded by

Iterative Reweighted Least Squares

Topics in Linear Classification using

Recall Linear Regression

• Likelihood has the form

• Derivative has the form ∇ ln p(t | w, β) = β ∑ { t − w φ(x ) } φ(x ) n=1

• Setting equal to zero and solving we get

Linear and Logistic Regression

• For logistic regression: No closed-form

• But departure from quadratic is not substantial

• IRLS uses the second derivative and has the

• It is derived from Newton-Raphson method

Newton-Raphson Method (1-D)

Since we are solving for

Decrease is Gradient Predicts Decrease

• In a single dimension we can denote ∂ f by 2

• Hessian is the Jacobian of the gradient

• There are three terms:

– solve for the critical point of this function to give

– When f is a quadratic (positive definite) function use

Two applications of IRLS

2. Objective function E (w)

• Logistic Regression: Bernoulli Likelihood Function

3. Gradient ∇E(w)

IRLS for Linear Regression

2. Error Function: Sum of Squares: 1 N

for data set X={xn,tn} n=1,..N

3. Gradient of Error Function is:

4. Hessian is:

5. Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

Since it is independent of w, Newton-Raphson gives exact solution in one step

1. Model: p(C1|ϕ) =y(ϕ) = σ { }

for data set {ϕn,tn}, tn ε {0,1}, yn=ϕ (xn)

2. Objective Function: Negative log-likelihood:

3. Gradient of Error Function is: ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞

5. Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

Substituting H = ΦT RΦ and ∇E(w) = ΦT (y − t)

Update formula is a set of normal equations.

Initial Weight Vector, Gradient and

Final Weight Vector, Gradient and

Error (Initial and Final): 15.0642, 1.0000e-009

– Along axis x1, function curves upwards: this axis is

You might also like

•  Likelihood has the form

•  Derivative has the form ∇ ln p(t | w, β) = β ∑ { t − w φ(x ) } φ(x ) n=1

•  Setting equal to zero and solving we get

•  For logistic regression: No closed-form

•  But departure from quadratic is not substantial

•  IRLS uses the second derivative and has the

•  It is derived from Newton-Raphson method

•  In a single dimension we can denote ∂ f by 2

•  Hessian is the Jacobian of the gradient

•  There are three terms:

–  solve for the critical point of this function to give

–  When f is a quadratic (positive definite) function use

2.  Objective function E (w)

•  Logistic Regression: Bernoulli Likelihood Function

3.  Gradient ∇E(w)

2.  Error Function: Sum of Squares: 1 N

3.  Gradient of Error Function is:

4.  Hessian is:

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

1.  Model: p(C1|ϕ) =y(ϕ) = σ { }

2.  Objective Function: Negative log-likelihood:

3.  Gradient of Error Function is: ⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞

5.  Newton-Raphson Update: w (new) = w (old) − H −1∇E(w)

–  Along axis x1, function curves upwards: this axis is