You are on page 1of 11

Bayesian Logistic Regression

Bayesian Logistic Regression


• Logistic regression is a discriminative
probabilistic linear classifier: p(C |) =  (w
1
T

)
• Exact Bayesian inference for Logistic
Regression p(C)p(
1|)   (w
T
w)dw is intractable, because:

1.Evaluation of posterior distribution p(w|t)


– Needs normalization of prior p(w)=N(w|m0,S0) times
likelihood (a product of sigmoids) p(t | w)  y 1 
N
tn 1 n
t
n n
n y
• Solution: use Laplace approximation to 
get
1
 q(w)
Gaussian
2. Evaluation of predictive distribution p(C1 |) ! 
(w T )q(w)dw
– Convolution of sigmoid and Gaussian
• Solution: Approximate Sigmoid by Probit
Laplace Approximation (summary)
• Need mode w0 of posterior distribution p(w|t)
– Done by a numerical optimization algorithm
• Fit a Gaussian centered at the mode
1 1/2 
q(w) = f (w) = A exp -1 (w -w0 )T A (w - 
Ww) (2π)M/2
 0 
-1
= 2N (w |w0, A )
– Needs second derivatives of log A   ln f (w) |w=w
0

posterior
• Equivalent to finding Hessian matrix
SN   ln p(w | t) 0 S  yn (1-
1
n  nnT
y ) n i1
Evaluation of Posterior Distribution
• Gaussian prior
p(w)=N(w|m0,S0)
– Where m0 and S0 are hyper-parameters
• Posterior distribution
p(w|t)  p(w)p(t|
w) where t =(t1,..,tN)T
N
– Substituting = y {1
1− n
p(t | w) ∏
tn
t
n n
−y n=
1 1
}
n
ln p(w|t)   (w  m0 )1S0 (w  m0 )
2
T
 (t n ln yn  (1  t n )ln(1  yn ) 
 i1
const
• yn )  (w
where T
n
Gaussian Approximation of Posterior
• Maximize posterior p(w|t) to give
– MAP solution wmap
• Done by numerical optimization
– Defines mean of the Gaussian
• Covariance given by
– Inverse of matrix of 2nd derivatives of negative n
1
SN   ln p(w|t)  S  y n (1  yn )n nT
log-likelihood 0

i1 

• Gaussian approximation to posterior


q(w)  N (w | w map , SN )

• Need to marginalize wrt this distribution to


make predictions
Predictive Distribution
• Predictive distribution for class C1, given
new feature vector  (x
)
– Obtained by marginalizing wrt posterior p(w|t)
Sum rule
p(C1 | , t)   p(C1 ,w | ,
t)dw Product rule
=  p(C1 | , t,w) p(w|
t)dw
=  p(C1 |  ,w) p(w|t)dw Given  and w, C1 is indep of
t Approximate p(w|t) by Gaussian q(w)
! 
correspondingT probability for class C2
(w  )q(w)dw
p(C2 | ,t )  1  p(C1 | ,t )
Predictive distrib. is a Convolution
p(C1 | , t) ! 
(w T  )q(w)dw
– Function σ(wTϕ) depends on w only through its
projection onto ϕ
– Denoting a = wTϕ we have
 (w  ) !   (a  w T T
 )
• where δ is the Dirac delta function
(a)da
– Thus
  (w T
 )q(w)dw    (a) p(a)dawhere p(a)    (a 
w T  )q(w)dw
• Can evaluate p(a) because
– the delta function imposes a linear constraint on w
– Since q(w) is Gaussian, its marginal is also Gaussian
a  [a]   p(a)da   q(w)wT  dw  map T

• Evaluate its mean and covariance w
 var[a]   p(a)a 2  [a]2
a
2

=  q(w) (w T2 )2  (mNT ) dw  T SN




da 
Variational Approximation to Predictive Distribution

• Predictive distribution is

p(C1 | t)    (a)

p(a)da =   (a)N a | a ,  2
a da


• Convolution of Sigmoid-Gaussian is intractable
• Use probit instead of logistic sigmoid
1

0.5

0
!5 0 5
Approximation using Probit

p(C1 | t)=  (a)N a | a , a2 da


• Use probit which is similar to Logistic sigmoid 1

– Defined as
a

 (a)   N ( | 0.5

• Approximate σ(a)
0,1)d
by Φ(λa) 0

• Find λ such that two functions have same slope at origin


!5 0 5

Approximate  (a) by  (a)


Find suitable value of  by requiring that two have same slope at origin, which yields  2 = /8

• Convolution of probit with Gaussian is a probit


2   
  (a)N (a | , )da   
  2   2
1/ 2 

– Thus p(C1 | , t)=  (a)N a | a ,2a da 


!  ( 2
a )a)
where (
 ( 2 )  (1   2 /
8)1/2
Probit Classification
Applying it to

p(C1 | t)=  (a)N a | a ,2a
da
We 
have p(C1 | , t)     a2 ) a
(
where 
a  w T
  a2  T S N
map 
Decision boundary corresponding to p(C1|ϕ,t) =0.5 is given by
a  0
This is the same solution as
w Tmap  0
Thus marginalization has no effect!
When minimizing misclassification rate with equal prior probabilities

For more complex decision criteria it plays important role


Summary
• Logistic regression is a linear probabilistic
discriminative model p(C | x )   (w
1
T

)
• Bayesian Logistic Regression is intractable
• Using Laplacian the posterior parameter
distribution p(w|t) can be approximated as a
Gaussian
• Predictive distribution is convolution of
sigmoids and Gaussian p(C | ) !  
1
(w )q(w)dw
T

– Probit yields convolution as probit

You might also like