Logistic Regression: Jia Li

Logistic Regression
Logistic Regression
Jia Li
Department of Statistics
The Pennsylvania State University
Email: jiali@stat.psu.edu
http://www.stat.psu.edu/∼jiali
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression
Logistic Regression
Preserve linear classification boundaries.
I By the Bayes rule:
Ĝ (x) = arg max Pr (G = k | X = x) .

k
I Decision boundary between class k and l is determined by the

equation:
Pr (G = k | X = x) = Pr (G = l | X = x) .
I Divide both sides by Pr (G = l | X = x) and take log. The

above equation is equivalent to
Pr (G = k | X = x)
log =0.
Pr (G = l | X = x)
Logistic Regression
I Since we enforce linear boundary, we can assume

p
Pr (G = k | X = x) (k,l)
X (k,l)
log = a0 + aj xj .
Pr (G = l | X = x)
j=1
I For logistic regression, there are restrictive relations between

a(k,l) for different pairs of (k, l).
Logistic Regression
Assumptions
Pr (G = 1 | X = x)
log = β10 + β1T x
Pr (G = K | X = x)
Pr (G = 2 | X = x)
log = β20 + β2T x
Pr (G = K | X = x)
..
.
Pr (G = K − 1 | X = x)
log = β(K −1)0 + βKT −1 x
Pr (G = K | X = x)
Logistic Regression
I For any pair (k, l):
Pr (G = k | X = x)
log = βk0 − βl0 + (βk − βl )T x .
Pr (G = l | X = x)
I Number of parameters: (K − 1)(p + 1).
I Denote the entire parameter set by
θ = {β10 , β1 , β20 , β2 , ..., β(K −1)0 , βK −1 } .
I The log ratio of posterior probabilities are called log-odds or

logit transformations.
Logistic Regression
I Under the assumptions, the posterior probabilities are given

by:
exp(βk0 + βkT x)
Pr (G = k | X = x) = PK −1
1 + l=1 exp(βl0 + βlT x)
for k = 1, ..., K − 1
1
Pr (G = K | X = x) = PK −1 .
1+ l=1 exp(βl0 + βlT x)
I For Pr (G = k | X = x) given above, obviously

PK
I Sum up to 1: k=1 Pr (G = k | X = x) = 1.
I A simple calculation shows that the assumptions are satisfied.
Logistic Regression
Comparison with LR on Indicators
I Similarities:
I Both attempt to estimate Pr (G = k | X = x).
I Both have linear classification boundaries.
I Difference:
I Linear regression on indicator matrix: approximate
Pr (G = k | X = x) by a linear function of x.
Pr (G = k | X = x) is not guaranteed to fall between 0 and 1
and to sum up to 1.
I Logistic regression: Pr (G = k | X = x) is a nonlinear function
of x. It is guaranteed to range from 0 to 1 and to sum up to 1.
Logistic Regression
Fitting Logistic Regression Models
I Criteria: find parameters that maximize the conditional

likelihood of G given X using the training data.
I Denote pk (xi ; θ) = Pr (G = k | X = xi ; θ).
I Given the first input x1 , the posterior probability of its class
being g1 is Pr (G = g1 | X = x1 ).
I Since samples in the training data set are independent, the
posterior probability for the N samples each having class gi ,
i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is:
N
Y
Pr (G = gi | X = xi ) .
i=1
Logistic Regression
I The conditional log-likelihood of the class labels in the

training data set is
N
X
L(θ) = log Pr (G = gi | X = xi )
i=1
XN
= log pgi (xi ; θ) .
i=1
Logistic Regression
Binary Classification
I For binary classification, if gi = 1, denote yi = 1; if gi = 2,

denote yi = 0.
I Let p1 (x; θ) = p(x; θ), then
p2 (x; θ) = 1 − p1 (x; θ) = 1 − p(x; θ) .
I Since K = 2, the parameters θ = {β10 , β1 }.

We denote β = (β10 , β1 )T .
Logistic Regression
I If yi = 1, i.e., gi = 1,
log pgi (x; β) = log p1 (x; β)

= 1 · log p(x; β)
= yi log p(x; β) .
If yi = 0, i.e., gi = 2,
log pgi (x; β) = log p2 (x; β)

= 1 · log(1 − p(x; β))
= (1 − yi ) log(1 − p(x; β)) .
Since either yi = 0 or 1 − yi = 0, we have
log pgi (x; β) = yi log p(x; β) + (1 − yi ) log(1 − p(x; β)) .
Logistic Regression
I The conditional likelihood

N
X
L(β) = log pgi (xi ; β)
i=1
XN
= [yi log p(xi ; β) + (1 − yi ) log(1 − p(xi ; β))]
i=1
I There p + 1 parameters in β = (β10 , β1 )T .

I Assume a column vector form for β:
 
β10
 β11 
 
β =  β12  .
 
 .. 
 . 
β1,p
Logistic Regression
I Here we add the constant term 1 to x to accommodate the

intercept.  
1
 x,1 
 
x =  x,2  .
 
 .. 
 . 
x,p
Logistic Regression
I By the assumption of logistic regression model:
exp(β T x)
p(x; β) = Pr (G = 1 | X = x) =
1 + exp(β T x)
1
1 − p(x; β) = Pr (G = 2 | X = x) =
1 + exp(β T x)
I Substitute the above in L(β):

N h i
Tx
X
L(β) = yi β T xi − log(1 + e β i
) .
i=1
Logistic Regression
I To maximize L(β), we set the first order partial derivatives of

L(β) to zero.
N N T
∂L(β) X X xij e β xi
= yi xij −
β1j
i=1 i=1
1 + e β T xi
XN N
X
= yi xij − p(x; β)xij
i=1 i=1
XN
= xij (yi − p(xi ; β))
i=1
for all j = 0, 1, ..., p.
Logistic Regression
I In matrix form, we write

N
∂L(β) X
= xi (yi − p(xi ; β)) .
∂β
i=1
I To solve the set of p + 1 nonlinear equations ∂L(β)

∂β1j = 0,
j = 0, 1, ..., p, use the Newton-Raphson algorithm.
I The Newton-Raphson algorithm requires the
second-derivatives or Hessian matrix:
N
∂ 2 L(β) X
T
=− xi xiT p(xi ; β)(1 − p(xi ; β)) .
∂β∂β
i=1
Logistic Regression
I The element on the jth row and nth column is (counting from
0):
∂L(β)
∂β1j ∂β1n
N T T T
X (1 + e β xi )e β xi xij xin − (e β xi )2 xij xin
= −
i=1
(1 + e β T xi )2
XN
= − xij xin p(xi ; β) − xij xin p(xi ; β)2
i=1
XN
= − xij xin p(xi ; β)(1 − p(xi ; β)) .
i=1
Logistic Regression
I Starting with β old , a single Newton-Raphson update is

−1
∂ 2 L(β)

new old ∂L(β)
β =β − ,
∂β∂β T ∂β
where the derivatives are evaluated at β old .
Logistic Regression
I The iteration can be expressed compactly in matrix form.

I Let y be the column vector of yi .
I Let X be the N × (p + 1) input matrix.
I Let p be the N-vector of fitted probabilities with ith element
p(xi ; β old ).
I Let W be an N × N diagonal matrix of weights with ith
element p(xi ; β old )(1 − p(xi ; β old )).
I Then
∂L(β)
= XT (y − p)
∂β
2
∂ L(β)
= −XT WX .
∂β∂β T
Logistic Regression
I The Newton-Raphson step is

β new = β old + (XT WX)−1 XT (y − p)
= (XT WX)−1 XT W(Xβ old + W−1 (y − p))
= (XT WX)−1 XT Wz ,
where z , Xβ old + W−1 (y − p).
I If z is viewed as a response and X is the input matrix, β new is
the solution to a weighted least square problem:
β new ← arg min(z − Xβ)T W(z − Xβ) .
β
I Recall that linear regression by least square is to solve
arg min(z − Xβ)T (z − Xβ) .
β
I z is referred to as the adjusted response.
I The algorithm is referred to as iteratively reweighted least
squares or IRLS.
Logistic Regression
Pseudo Code
1. 0 → β
2. Compute y by setting its elements to

1 if gi = 1
yi = ,
0 if gi = 2
i = 1, 2, ..., N.
3. Compute p by setting its elements to
T
e β xi
p(xi ; β) = i = 1, 2, ..., N.
1 + e β T xi
4. Compute the diagonal matrix W. The ith diagonal element is
p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N.
5. z ← Xβ + W−1 (y − p).
6. β ← (XT WX)−1 XT Wz.
7. If the stopping criteria is met, stop; otherwise go back to step
3.
Logistic Regression
Computational Efficiency
I Since W is an N × N diagonal matrix, direct matrix

operations with it may be very inefficient.
I A modified pseudo code is provided next.
Logistic Regression
1. 0 → β
2. Compute y by setting its elements to

1 if gi = 1
yi = , i = 1, 2, ..., N .
0 if gi = 2
3. Compute p by setting its elements to
T
e β xi
p(xi ; β) = i = 1, 2, ..., N.
1 + e β T xi
4. Compute the N × (p + 1) matrix X̃ by multiplying the ith row of
matrix X by p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N:
 T 
p(x1 ; β)(1 − p(x1 ; β))x1T
 
x1
 xT   p(x2 ; β)(1 − p(x2 ; β))x T 
X= 2  2
 · · ·  X̃ =  · · ·
 

xNT p(xN ; β)(1 − p(xN ; β))xNT
5. β ← β + (XT X̃)−1 XT (y − p).
6. If the stopping criteria is met, stop; otherwise go back to step 3.
Logistic Regression
Example
Diabetes data set

I Input X is two dimensional. X1 and X2 are the two principal
components of the original 8 variables.
I Class 1: without diabetes; Class 2: with diabetes.
I Applying logistic regression, we obtain
β = (0.7679, −0.6816, −0.3664)T .
Logistic Regression
I The posterior probabilities are:
e 0.7679−0.6816X1 −0.3664X2
Pr (G = 1 | X = x) =
1 + e 0.7679−0.6816X1 −0.3664X2
1
Pr (G = 2 | X = x) =
1 + e 0.7679−0.6816X1 −0.3664X2
I The classification rule is:

1 0.7679 − 0.6816X1 − 0.3664X2 ≥ 0
Ĝ (x) =
2 0.7679 − 0.6816X1 − 0.3664X2 < 0
Logistic Regression
Solid line: decision boundary obtained by logistic regression. Dash

line: decision boundary obtained by LDA.
I Within training
data set
classification error
rate: 28.12%.
I Sensitivity: 45.9%.
I Specificity: 85.8%.
Logistic Regression
Multiclass Case (K ≥ 3)
I When K ≥ 3, β is a (K-1)(p+1)-vector:
 
β10
 β11 
 
.
   ..
 
β10 


 β1   β1p 
  
  β20

 β20  

   .
 β2   . 
β= = . 
 .. 
  β2p
 
 . 

  .
 β(K −1)0    ..


βK −1
 
 β(K −1)0 
 
 .. 
 . 
β(K −1)p
Logistic Regression

βl0
I Let β̄l = .
βl
I The likelihood function becomes
N
X
L(β) = log pgi (xi ; β)
i=1
N T
!
X e β̄gi xi
= log PK −1 Tx
i=1 1+ l=1 e β̄l i
N K −1
" !#
β̄lT xi
X X
= β̄gTi xi − log 1 + e
i=1 l=1
Logistic Regression
I Note: the indicator function I (·) equals 1 when the argument

is true and 0 otherwise.
I First order derivatives:
N
" T
#
∂L(β) X e β̄k xi xij
= I (gi = k)xij − P −1 β̄ T x
∂βkj
i=1 1+ K l=1 e
l i
N
X
= xij (I (gi = k) − pk (xi ; β))
i=1
Logistic Regression
I Second order derivatives:

∂ 2 L(β)
∂βkj ∂βmn
N
X 1
= xij · PK −1 Tx ·
i=1 (1 + l=1 e β̄l i )2
K −1
" #
β̄kT xi β̄lT xi β̄kT xi Tx
X
β̄m
−e I (k = m)xin (1 + e )+e e i
xin
l=1
N
X
= xij xin (−pk (xi ; β)I (k = m) + pk (xi ; β)pm (xi ; β))
i=1
N
X
= − xij xin pk (xi ; β)[I (k = m) − pm (xi ; β)] .
i=1
Logistic Regression
I Matrix form.
I y is the concatenated indicator vector of dimension
N × (K − 1).
   
y1 I (g1 = k)
 y2   I (g2 = k) 
y= . yk = 
   
..
 ..
 
  . 
yK −1 I (gN = k)
1≤k ≤K −1
I p is the concatenated vector of fitted probabilities of dimension
N × (K − 1).
   
p1 pk (x1 ; β)
 p2   pk (x2 ; β) 
p= . pk = 
   
..
 ..
 
  . 
pK −1 pk (xN ; β)
1≤k ≤K −1
Logistic Regression
I X̃ is an N(K − 1) × (p + 1)(K − 1) matrix:

 
X 0 ··· 0
 0 X ··· 0 
X̃ =   ··· ··· ··· ···


0 0 ··· X
Logistic Regression
I Matrix W is an N(K − 1) × N(K − 1) square matrix:

 
W11 W12 · · · W1(K −1)
 W21 W 22 · · · W2(K −1) 
W =   ···

··· ··· ··· 
W(K −1),1 W(K −1),2 · · · W(K −1),(K −1)
I Each submatrix Wkm , 1 ≤ k, m ≤ K − 1, is an N × N

diagonal matrix.
I When k = m, the ith diagonal element in Wkk is
pk (xi ; β old )(1 − pk (xi ; β old )).
I When k 6= m, the ith diagonal element in Wkm is
−pk (xi ; β old )pm (xi ; β old ).
Logistic Regression
I Similarly as with binary classification
∂L(β)
= X̃T (y − p)
∂β
∂ 2 L(β)
= −X̃T WX̃ .
∂β∂β T
I The formula for updating β new in the binary classification case
holds for multiclass.
β new = (X̃T WX̃)−1 X̃T Wz ,
where z , X̃β old + W−1 (y − p). Or simply:
β new = β old + (X̃T WX̃)−1 X̃T (y − p) .
Logistic Regression
Computation Issues
I Initialization: one option is to use β = 0.

I Convergence is not guaranteed, but usually is the case.
I Usually, the log-likelihood increases after each iteration, but
overshooting can occur.
I In the rare cases that the log-likelihood decreases, cut step
size by half.
Logistic Regression
Connection with LDA

I Under the model of LDA:
Pr (G = k | X = x)
log
Pr (G = K | X = x)
πk 1
= log − (µk + µK )T Σ−1 (µk − µK )
πK 2
T −1
+x Σ (µk − µK )
= ak0 + akT x .
I The model of LDA satisfies the assumption of the linear

logistic model.
I The linear logistic model only specifies the conditional
distribution Pr (G = k | X = x). No assumption is made
about Pr (X ).
Logistic Regression
I The LDA model specifies the joint distribution of X and G .

Pr (X ) is a mixture of Gaussians:
K
X
Pr (X ) = πk φ(X ; µk , Σ) .
k=1
where φ is the Gaussian density function.

I Linear logistic regression maximizes the conditional likelihood
of G given X : Pr (G = k | X = x).
I LDA maximizes the joint likelihood of G and X :
Pr (X = x, G = k).
Logistic Regression
I If the additional assumption made by LDA is appropriate,

LDA tends to estimate the parameters more efficiently by
using more information about the data.
I Samples without class labels can be used under the model of
LDA.
I LDA is not robust to gross outliers.
I As logistic regression relies on fewer assumptions, it seems to
be more robust.
I In practice, logistic regression and LDA often give similar
results.
Logistic Regression
Simulation
I Assume input X is 1-D.

I Two classes have equal priors and the class-conditional
densities of X are shifted versions of each other.
I Each conditional density is a mixture of two normals:
I Class 1 (red): 0.6N(−2, 14 ) + 0.4N(0, 1).
I Class 2 (blue): 0.6N(0, 14 ) + 0.4N(2, 1).
I The class-conditional densities are shown below.
Logistic Regression
Logistic Regression
LDA Result
I Training data set: 2000 samples for each class.

I Test data set: 1000 samples for each class.
I The estimation by LDA: µ̂1 = −1.1948, µ̂2 = 0.8224,
σ̂ 2 = 1.5268. Boundary value between the two classes is
(µ̂1 + µ̂2 )/2 =−0.1862.
I The classification error rate on the test data is 0.2315.
I Based on the true distribution, the Bayes (optimal) boundary
value between the two classes is −0.7750 and the error rate is
0.1765.
Logistic Regression
Logistic Regression
Logistic Regression Result
I Linear logistic regression obtains
β = (−0.3288, −1.3275)T .
The boundary value satisfies −0.3288 − 1.3275X = 0, hence

equals −0.2477.
I The error rate on the test data set is 0.2205.
I The estimated posterior probability is:
e −0.3288−1.3275x
Pr (G = 1 | X = x) = .
1 + e −0.3288−1.3275x
Logistic Regression
The estimated posterior probability Pr (G = 1 | X = x) and its true

value based on the true distribution are compared in the graph
below.

Logistic Regression: Jia Li

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression: Jia Li

Uploaded by

Copyright:

Available Formats

Logistic Regression

Ĝ (x) = arg max Pr (G = k | X = x) .

I Decision boundary between class k and l is determined by the

I Divide both sides by Pr (G = l | X = x) and take log. The

I Since we enforce linear boundary, we can assume

I For logistic regression, there are restrictive relations between

I For any pair (k, l):

θ = {β10 , β1 , β20 , β2 , ..., β(K −1)0 , βK −1 } .

I The log ratio of posterior probabilities are called log-odds or

I Under the assumptions, the posterior probabilities are given

I For Pr (G = k | X = x) given above, obviously

Comparison with LR on Indicators

Fitting Logistic Regression Models

I Criteria: find parameters that maximize the conditional

I The conditional log-likelihood of the class labels in the

I For binary classification, if gi = 1, denote yi = 1; if gi = 2,

p2 (x; θ) = 1 − p1 (x; θ) = 1 − p(x; θ) .

I Since K = 2, the parameters θ = {β10 , β1 }.

log pgi (x; β) = log p1 (x; β)

log pgi (x; β) = log p2 (x; β)

Since either yi = 0 or 1 − yi = 0, we have

log pgi (x; β) = yi log p(x; β) + (1 − yi ) log(1 − p(x; β)) .

I The conditional likelihood

I There p + 1 parameters in β = (β10 , β1 )T .

I Here we add the constant term 1 to x to accommodate the

I By the assumption of logistic regression model:

I Substitute the above in L(β):

I To maximize L(β), we set the first order partial derivatives of

for all j = 0, 1, ..., p.

I In matrix form, we write

I To solve the set of p + 1 nonlinear equations ∂L(β)

I Starting with β old , a single Newton-Raphson update is

where the derivatives are evaluated at β old .

I The iteration can be expressed compactly in matrix form.

I The Newton-Raphson step is

I Since W is an N × N diagonal matrix, direct matrix

Diabetes data set

β = (0.7679, −0.6816, −0.3664)T .

I The posterior probabilities are:

Solid line: decision boundary obtained by logistic regression. Dash

I Note: the indicator function I (·) equals 1 when the argument

I Second order derivatives:

I X̃ is an N(K − 1) × (p + 1)(K − 1) matrix:

I Matrix W is an N(K − 1) × N(K − 1) square matrix:

I Each submatrix Wkm , 1 ≤ k, m ≤ K − 1, is an N × N

I Similarly as with binary classification

β new = (X̃T WX̃)−1 X̃T Wz ,

where z , X̃β old + W−1 (y − p). Or simply:

β new = β old + (X̃T WX̃)−1 X̃T (y − p) .

I Initialization: one option is to use β = 0.

Connection with LDA

I The model of LDA satisfies the assumption of the linear

I The LDA model specifies the joint distribution of X and G .

where φ is the Gaussian density function.

I If the additional assumption made by LDA is appropriate,

I Assume input X is 1-D.

I Training data set: 2000 samples for each class.

Logistic Regression Result

I Linear logistic regression obtains

The boundary value satisfies −0.3288 − 1.3275X = 0, hence

The estimated posterior probability Pr (G = 1 | X = x) and its true

You might also like