Discriminative classifier and
logistic regression
Le Song
Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2015
Classification
Represent the data
A label is provided for each data point, eg.,
1, 1
Classifier
Boys vs. Girls (demo)
How to come up with decision boundary
Given class conditional distribution:
1 , and class prior:
1 ,
;
1 ,
1
1
,
?
;
1
,
Use Bayes rule
likelihood
posterior
Prior
normalization constant
Bayes Decision Rule
Learning: prior:
,class conditional distribution :
The poster probability of a test point
Bayes decision rule:
If
, then
, otherwise
Alternatively:
If ratio
!"
!"
!"
!"
, then
Or look at the loglikelihood ratio h x
, otherwise
ln
'(
')
More on Bayes error of Bayes rule
Bayes error is the lower bound of probability of classification
error
Bayes decision rule is the theoretically best classifier that
minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in
general a very complex problem. Why?
Need density estimation
Need to do integral, eg. *+
1 7
What do people do in practice?
Use simplifying assumption for
Assume
Assume
1 is Gaussian,
1 is fully factorized
! , !
Use geometric intuitions
knearest neighbor classifier
Support vector machine
Directly go for the decision boundary h x
ln
'(
')
Logistic regression
Neural networks
Nave Bayes Classifier
Use Bayes decision rule for classification
But assume
1 is fully factorized
.
"
Or the variables corresponding to each dimension of the data
are independent given the label
9
Nave Bayes classifier is a generative model
Once you have the model, you can generate sample from it:
For each data point :
Sample a label,
Sample the value of
1,2 , with according to the class prior
from class conditional
Nave Bayes: conditioned on , generate first dimension
dimension , ., independently
, second
Difference from mixture of Gaussian models
Purpose is different (density estimation vs. classification)
Data different (with/without labels)
label
Learning different (em/or not)
dimensions
1
10
K nearest neighbors
knearest neighbor classifier: assign a label by taking a
majority vote over the 2 training points closest to
For 3 4 1 , the knearest neighbor rule generalizes the nearest
neighbor rule
To define this more mathematically:
I6
If
as:
indices of the 2 training points closest to .
71, then we can write the 2nearest neighbor classifier
86
9 :;
<
=>
11
Example
K=1
12
Example
K=3
13
Example
K=5
14
Example
K = 25
15
Example
K = 51
16
Example
K = 101
17
Computations in KNN
Similar to KDE, essentially no training or learning phase,
computation is needed when applying the classifier
Memory: ? @
Finding the nearest neighbors out of a set of millions of examples
is still pretty hard
Test computation ? @
Use smart data structures and algorithms to index training data
Memory: ? @Training computation: ? @ log @
Test computation: ? log @
KDtree, Ball tree, Cover tree
18
Discriminative classifier
Directly estimate decision boundary h x

posterior distribution
'(
ln
')
or
Logistic regression, Neural networks
Do not estimate
 and
or 8
is a function of , and
does not have probabilistic meaning for ,
hence can not be used to sample data points
Why discriminative classifier?
Avoid difficult density estimation problem
Empirically achieve better classification results
19
What is logistic regression model
Assume that the posterior distribution
particular form
1 ,E
Logistic function 8 I
take a
1
exp E H
JKLM NO
20
Learning parameters in logistic regression
Find E, such that the conditional likelihood of the labels is
maximized
max E :
R
log .
"
,E
Good news: E is concave function of E, and there is a single
global optimum.
Bad new: no closed form solution (resort to numerical method)
21
The objective function E
logistic regression model
1 ,E
Note that
0 ,E
1
exp E H
Plug in
E :
<
1
exp E H
log .
1 EH
"
log 1
exp E H
1 exp E H
,E
exp
EH
22
The gradient of E
E :
<
Gradient
U E
UE
log .
"
1 EH
<
,E
log 1
exp
EH
exp E H
1 exp E H
Setting it to 0 does not lead to closed form solution
23
Gradient descent/ascent
One way to solve an unconstrained optimization problem is
gradient descent
Given an initial guess, we iteratively refine the guess by taking
the direction of the negative gradient
Think about going down a hill by
taking the steepest direction
at each step
Update rule
V6 W8 6
V6 is called the step size or learning rate
6J
24
Gradient Ascent/Descent algorithm
Initialize parameter E X
Do
E YJ E Y
While the E YJ
[<
E Y 
exp E H
1 exp E H
25
Boys vs girls (demo)
26
Nave Bayes vs. logistic regression
Consider
1, 1 ,
]1
Number of parameters
Nave Bayes :
2; 1, when all random variables are binary
4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior
logistic regression:
;
1: EX , E , E , , E1
27
Nave Bayes vs logistic regression II
When model assumptions correct
Both Nave Bayes and logistic regression produce good classifiers
When model assumptions incorrect
logistic regression is less biased does not assume conditional
independence
logistic regression has fewer parameters
expected to outperform Nave Bayes in practice
28
Nave Bayes vs logistic regression III
Estimation method:
Nave Bayes parameter estimates are decoupled (super easy)
Logistic regression parameter estimates are coupled (less easy)
How to estimate the parameters in logistic regression?
Maximum likelihood estimation
More specifically, maximize the conditional likelihood the label
29
Handwritten digits (demo)
30
Multiclass logistic regression
Assign input vector
1, , a
1, , @ into one of classes `, `
Assume that the posterior distribution take a particular form:
exp Ec H
` , E , , Eb
cd exp Ecd H
Now, lets introduce some notations:
Ic
c  , E , , Eb
f
`
c
31
Learning parameters in multiclass logistic regression
Given all the input data
,
,
,,
S,
The loglikelihood can be written as:
S
"
c"
E log . . Ic
S
<<
S
<<
" c"
" c"
h
c Ec
!g(
c logIc
S
log < < exp Echi
" c i"
32
Learning parameters in multiclass logistic regression
Find E such that the conditional likelihood of the labels is
maximized
E also known as crossentropy error function for
multiclass
Compute the gradient of 8 E with respect to one parameter
vector E :
U8
UEc
< Ic
33
Much more than documents.
Discover everything Scribd has to offer, including books and audiobooks from major publishers.
Cancel anytime.