You are on page 1of 44

Logistic Regression

Logistic Regression

Jia Li

Department of Statistics
The Pennsylvania State University

Email: jiali@stat.psu.edu
http://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Logistic Regression
Preserve linear classification boundaries.
I By the Bayes rule:

Ĝ (x) = arg max Pr (G = k | X = x) .


k

I Decision boundary between class k and l is determined by the


equation:

Pr (G = k | X = x) = Pr (G = l | X = x) .

I Divide both sides by Pr (G = l | X = x) and take log. The


above equation is equivalent to
Pr (G = k | X = x)
log =0.
Pr (G = l | X = x)

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Since we enforce linear boundary, we can assume


p
Pr (G = k | X = x) (k,l)
X (k,l)
log = a0 + aj xj .
Pr (G = l | X = x)
j=1

I For logistic regression, there are restrictive relations between


a(k,l) for different pairs of (k, l).

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Assumptions

Pr (G = 1 | X = x)
log = β10 + β1T x
Pr (G = K | X = x)
Pr (G = 2 | X = x)
log = β20 + β2T x
Pr (G = K | X = x)
..
.
Pr (G = K − 1 | X = x)
log = β(K −1)0 + βKT −1 x
Pr (G = K | X = x)

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I For any pair (k, l):

Pr (G = k | X = x)
log = βk0 − βl0 + (βk − βl )T x .
Pr (G = l | X = x)
I Number of parameters: (K − 1)(p + 1).
I Denote the entire parameter set by

θ = {β10 , β1 , β20 , β2 , ..., β(K −1)0 , βK −1 } .

I The log ratio of posterior probabilities are called log-odds or


logit transformations.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Under the assumptions, the posterior probabilities are given


by:

exp(βk0 + βkT x)
Pr (G = k | X = x) = PK −1
1 + l=1 exp(βl0 + βlT x)

for k = 1, ..., K − 1

1
Pr (G = K | X = x) = PK −1 .
1+ l=1 exp(βl0 + βlT x)

I For Pr (G = k | X = x) given above, obviously


PK
I Sum up to 1: k=1 Pr (G = k | X = x) = 1.
I A simple calculation shows that the assumptions are satisfied.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Comparison with LR on Indicators

I Similarities:
I Both attempt to estimate Pr (G = k | X = x).
I Both have linear classification boundaries.
I Difference:
I Linear regression on indicator matrix: approximate
Pr (G = k | X = x) by a linear function of x.
Pr (G = k | X = x) is not guaranteed to fall between 0 and 1
and to sum up to 1.
I Logistic regression: Pr (G = k | X = x) is a nonlinear function
of x. It is guaranteed to range from 0 to 1 and to sum up to 1.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Fitting Logistic Regression Models

I Criteria: find parameters that maximize the conditional


likelihood of G given X using the training data.
I Denote pk (xi ; θ) = Pr (G = k | X = xi ; θ).
I Given the first input x1 , the posterior probability of its class
being g1 is Pr (G = g1 | X = x1 ).
I Since samples in the training data set are independent, the
posterior probability for the N samples each having class gi ,
i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is:
N
Y
Pr (G = gi | X = xi ) .
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The conditional log-likelihood of the class labels in the


training data set is
N
X
L(θ) = log Pr (G = gi | X = xi )
i=1
XN
= log pgi (xi ; θ) .
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Binary Classification

I For binary classification, if gi = 1, denote yi = 1; if gi = 2,


denote yi = 0.
I Let p1 (x; θ) = p(x; θ), then

p2 (x; θ) = 1 − p1 (x; θ) = 1 − p(x; θ) .

I Since K = 2, the parameters θ = {β10 , β1 }.


We denote β = (β10 , β1 )T .

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I If yi = 1, i.e., gi = 1,

log pgi (x; β) = log p1 (x; β)


= 1 · log p(x; β)
= yi log p(x; β) .

If yi = 0, i.e., gi = 2,

log pgi (x; β) = log p2 (x; β)


= 1 · log(1 − p(x; β))
= (1 − yi ) log(1 − p(x; β)) .

Since either yi = 0 or 1 − yi = 0, we have

log pgi (x; β) = yi log p(x; β) + (1 − yi ) log(1 − p(x; β)) .

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The conditional likelihood


N
X
L(β) = log pgi (xi ; β)
i=1
XN
= [yi log p(xi ; β) + (1 − yi ) log(1 − p(xi ; β))]
i=1

I There p + 1 parameters in β = (β10 , β1 )T .


I Assume a column vector form for β:
 
β10
 β11 
 
β =  β12  .
 
 .. 
 . 
β1,p

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Here we add the constant term 1 to x to accommodate the


intercept.  
1
 x,1 
 
x =  x,2  .
 
 .. 
 . 
x,p

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I By the assumption of logistic regression model:

exp(β T x)
p(x; β) = Pr (G = 1 | X = x) =
1 + exp(β T x)
1
1 − p(x; β) = Pr (G = 2 | X = x) =
1 + exp(β T x)

I Substitute the above in L(β):


N h i
Tx
X
L(β) = yi β T xi − log(1 + e β i
) .
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I To maximize L(β), we set the first order partial derivatives of


L(β) to zero.
N N T
∂L(β) X X xij e β xi
= yi xij −
β1j
i=1 i=1
1 + e β T xi
XN N
X
= yi xij − p(x; β)xij
i=1 i=1
XN
= xij (yi − p(xi ; β))
i=1

for all j = 0, 1, ..., p.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I In matrix form, we write


N
∂L(β) X
= xi (yi − p(xi ; β)) .
∂β
i=1

I To solve the set of p + 1 nonlinear equations ∂L(β)


∂β1j = 0,
j = 0, 1, ..., p, use the Newton-Raphson algorithm.
I The Newton-Raphson algorithm requires the
second-derivatives or Hessian matrix:
N
∂ 2 L(β) X
T
=− xi xiT p(xi ; β)(1 − p(xi ; β)) .
∂β∂β
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The element on the jth row and nth column is (counting from
0):

∂L(β)
∂β1j ∂β1n
N T T T
X (1 + e β xi )e β xi xij xin − (e β xi )2 xij xin
= −
i=1
(1 + e β T xi )2
XN
= − xij xin p(xi ; β) − xij xin p(xi ; β)2
i=1
XN
= − xij xin p(xi ; β)(1 − p(xi ; β)) .
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Starting with β old , a single Newton-Raphson update is


−1
∂ 2 L(β)

new old ∂L(β)
β =β − ,
∂β∂β T ∂β

where the derivatives are evaluated at β old .

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The iteration can be expressed compactly in matrix form.


I Let y be the column vector of yi .
I Let X be the N × (p + 1) input matrix.
I Let p be the N-vector of fitted probabilities with ith element
p(xi ; β old ).
I Let W be an N × N diagonal matrix of weights with ith
element p(xi ; β old )(1 − p(xi ; β old )).
I Then
∂L(β)
= XT (y − p)
∂β
2
∂ L(β)
= −XT WX .
∂β∂β T

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The Newton-Raphson step is


β new = β old + (XT WX)−1 XT (y − p)
= (XT WX)−1 XT W(Xβ old + W−1 (y − p))
= (XT WX)−1 XT Wz ,
where z , Xβ old + W−1 (y − p).
I If z is viewed as a response and X is the input matrix, β new is
the solution to a weighted least square problem:
β new ← arg min(z − Xβ)T W(z − Xβ) .
β
I Recall that linear regression by least square is to solve
arg min(z − Xβ)T (z − Xβ) .
β
I z is referred to as the adjusted response.
I The algorithm is referred to as iteratively reweighted least
squares or IRLS.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Pseudo Code
1. 0 → β
2. Compute y by setting its elements to

1 if gi = 1
yi = ,
0 if gi = 2
i = 1, 2, ..., N.
3. Compute p by setting its elements to
T
e β xi
p(xi ; β) = i = 1, 2, ..., N.
1 + e β T xi
4. Compute the diagonal matrix W. The ith diagonal element is
p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N.
5. z ← Xβ + W−1 (y − p).
6. β ← (XT WX)−1 XT Wz.
7. If the stopping criteria is met, stop; otherwise go back to step
3.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Computational Efficiency

I Since W is an N × N diagonal matrix, direct matrix


operations with it may be very inefficient.
I A modified pseudo code is provided next.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

1. 0 → β
2. Compute y by setting its elements to

1 if gi = 1
yi = , i = 1, 2, ..., N .
0 if gi = 2
3. Compute p by setting its elements to
T
e β xi
p(xi ; β) = i = 1, 2, ..., N.
1 + e β T xi
4. Compute the N × (p + 1) matrix X̃ by multiplying the ith row of
matrix X by p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N:
 T 
p(x1 ; β)(1 − p(x1 ; β))x1T
 
x1
 xT   p(x2 ; β)(1 − p(x2 ; β))x T 
X= 2  2
 · · ·  X̃ =  · · ·
 

xNT p(xN ; β)(1 − p(xN ; β))xNT
5. β ← β + (XT X̃)−1 XT (y − p).
6. If the stopping criteria is met, stop; otherwise go back to step 3.
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Example

Diabetes data set


I Input X is two dimensional. X1 and X2 are the two principal
components of the original 8 variables.
I Class 1: without diabetes; Class 2: with diabetes.
I Applying logistic regression, we obtain

β = (0.7679, −0.6816, −0.3664)T .

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The posterior probabilities are:

e 0.7679−0.6816X1 −0.3664X2
Pr (G = 1 | X = x) =
1 + e 0.7679−0.6816X1 −0.3664X2
1
Pr (G = 2 | X = x) =
1 + e 0.7679−0.6816X1 −0.3664X2
I The classification rule is:

1 0.7679 − 0.6816X1 − 0.3664X2 ≥ 0
Ĝ (x) =
2 0.7679 − 0.6816X1 − 0.3664X2 < 0

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Solid line: decision boundary obtained by logistic regression. Dash


line: decision boundary obtained by LDA.

I Within training
data set
classification error
rate: 28.12%.
I Sensitivity: 45.9%.
I Specificity: 85.8%.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Multiclass Case (K ≥ 3)
I When K ≥ 3, β is a (K-1)(p+1)-vector:
 
β10
 β11 
 
.
   ..
 
β10 


 β1   β1p 
  
  β20

 β20  

   .
 β2   . 
β= = . 
 .. 
  β2p
 
 . 

  .
 β(K −1)0    ..


βK −1
 
 β(K −1)0 
 
 .. 
 . 
β(K −1)p

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

 
βl0
I Let β̄l = .
βl
I The likelihood function becomes
N
X
L(β) = log pgi (xi ; β)
i=1
N T
!
X e β̄gi xi
= log PK −1 Tx
i=1 1+ l=1 e β̄l i

N K −1
" !#
β̄lT xi
X X
= β̄gTi xi − log 1 + e
i=1 l=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Note: the indicator function I (·) equals 1 when the argument


is true and 0 otherwise.
I First order derivatives:
N
" T
#
∂L(β) X e β̄k xi xij
= I (gi = k)xij − P −1 β̄ T x
∂βkj
i=1 1+ K l=1 e
l i

N
X
= xij (I (gi = k) − pk (xi ; β))
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Second order derivatives:


∂ 2 L(β)
∂βkj ∂βmn
N
X 1
= xij · PK −1 Tx ·
i=1 (1 + l=1 e β̄l i )2
K −1
" #
β̄kT xi β̄lT xi β̄kT xi Tx
X
β̄m
−e I (k = m)xin (1 + e )+e e i
xin
l=1
N
X
= xij xin (−pk (xi ; β)I (k = m) + pk (xi ; β)pm (xi ; β))
i=1
N
X
= − xij xin pk (xi ; β)[I (k = m) − pm (xi ; β)] .
i=1

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Matrix form.
I y is the concatenated indicator vector of dimension
N × (K − 1).
   
y1 I (g1 = k)
 y2   I (g2 = k) 
y= . yk = 
   
..
 ..
 
  . 
yK −1 I (gN = k)

1≤k ≤K −1
I p is the concatenated vector of fitted probabilities of dimension
N × (K − 1).
   
p1 pk (x1 ; β)
 p2   pk (x2 ; β) 
p= . pk = 
   
..
 ..
 
  . 
pK −1 pk (xN ; β)

1≤k ≤K −1
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I X̃ is an N(K − 1) × (p + 1)(K − 1) matrix:


 
X 0 ··· 0
 0 X ··· 0 
X̃ =   ··· ··· ··· ···


0 0 ··· X

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Matrix W is an N(K − 1) × N(K − 1) square matrix:


 
W11 W12 · · · W1(K −1)
 W21 W 22 · · · W2(K −1) 
W =   ···

··· ··· ··· 
W(K −1),1 W(K −1),2 · · · W(K −1),(K −1)

I Each submatrix Wkm , 1 ≤ k, m ≤ K − 1, is an N × N


diagonal matrix.
I When k = m, the ith diagonal element in Wkk is
pk (xi ; β old )(1 − pk (xi ; β old )).
I When k 6= m, the ith diagonal element in Wkm is
−pk (xi ; β old )pm (xi ; β old ).

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I Similarly as with binary classification

∂L(β)
= X̃T (y − p)
∂β

∂ 2 L(β)
= −X̃T WX̃ .
∂β∂β T
I The formula for updating β new in the binary classification case
holds for multiclass.

β new = (X̃T WX̃)−1 X̃T Wz ,

where z , X̃β old + W−1 (y − p). Or simply:

β new = β old + (X̃T WX̃)−1 X̃T (y − p) .

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Computation Issues

I Initialization: one option is to use β = 0.


I Convergence is not guaranteed, but usually is the case.
I Usually, the log-likelihood increases after each iteration, but
overshooting can occur.
I In the rare cases that the log-likelihood decreases, cut step
size by half.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Connection with LDA


I Under the model of LDA:
Pr (G = k | X = x)
log
Pr (G = K | X = x)
πk 1
= log − (µk + µK )T Σ−1 (µk − µK )
πK 2
T −1
+x Σ (µk − µK )
= ak0 + akT x .

I The model of LDA satisfies the assumption of the linear


logistic model.
I The linear logistic model only specifies the conditional
distribution Pr (G = k | X = x). No assumption is made
about Pr (X ).
Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I The LDA model specifies the joint distribution of X and G .


Pr (X ) is a mixture of Gaussians:
K
X
Pr (X ) = πk φ(X ; µk , Σ) .
k=1

where φ is the Gaussian density function.


I Linear logistic regression maximizes the conditional likelihood
of G given X : Pr (G = k | X = x).
I LDA maximizes the joint likelihood of G and X :
Pr (X = x, G = k).

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

I If the additional assumption made by LDA is appropriate,


LDA tends to estimate the parameters more efficiently by
using more information about the data.
I Samples without class labels can be used under the model of
LDA.
I LDA is not robust to gross outliers.
I As logistic regression relies on fewer assumptions, it seems to
be more robust.
I In practice, logistic regression and LDA often give similar
results.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Simulation

I Assume input X is 1-D.


I Two classes have equal priors and the class-conditional
densities of X are shifted versions of each other.
I Each conditional density is a mixture of two normals:
I Class 1 (red): 0.6N(−2, 14 ) + 0.4N(0, 1).
I Class 2 (blue): 0.6N(0, 14 ) + 0.4N(2, 1).
I The class-conditional densities are shown below.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

LDA Result

I Training data set: 2000 samples for each class.


I Test data set: 1000 samples for each class.
I The estimation by LDA: µ̂1 = −1.1948, µ̂2 = 0.8224,
σ̂ 2 = 1.5268. Boundary value between the two classes is
(µ̂1 + µ̂2 )/2 =−0.1862.
I The classification error rate on the test data is 0.2315.
I Based on the true distribution, the Bayes (optimal) boundary
value between the two classes is −0.7750 and the error rate is
0.1765.

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

Logistic Regression Result

I Linear logistic regression obtains

β = (−0.3288, −1.3275)T .

The boundary value satisfies −0.3288 − 1.3275X = 0, hence


equals −0.2477.
I The error rate on the test data set is 0.2205.
I The estimated posterior probability is:

e −0.3288−1.3275x
Pr (G = 1 | X = x) = .
1 + e −0.3288−1.3275x

Jia Li http://www.stat.psu.edu/∼jiali
Logistic Regression

The estimated posterior probability Pr (G = 1 | X = x) and its true


value based on the true distribution are compared in the graph
below.

Jia Li http://www.stat.psu.edu/∼jiali

You might also like