You are on page 1of 33

Discriminative classifier and

logistic regression
Le Song

Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2015

Classification
Represent the data

A label is provided for each data point, eg.,

1, 1

Classifier

Boys vs. Girls (demo)

How to come up with decision boundary


Given class conditional distribution:
1 , and class prior:
1 ,
;

1 ,

1
1
,

?
;

1
,

Use Bayes rule


likelihood

posterior

Prior

normalization constant

Bayes Decision Rule


Learning: prior:

,class conditional distribution :

The poster probability of a test point

Bayes decision rule:


If

, then

, otherwise

Alternatively:
If ratio

|!"
|!"

!"
!"

, then

Or look at the log-likelihood ratio h x

, otherwise
ln

'(
')

More on Bayes error of Bayes rule


Bayes error is the lower bound of probability of classification
error
Bayes decision rule is the theoretically best classifier that
minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in
general a very complex problem. Why?
Need density estimation
Need to do integral, eg. *+

1 7

What do people do in practice?


Use simplifying assumption for
Assume
Assume

1 is Gaussian,
1 is fully factorized

! , !

Use geometric intuitions


k-nearest neighbor classifier
Support vector machine

Directly go for the decision boundary h x

ln

'(
')

Logistic regression
Neural networks

Nave Bayes Classifier


Use Bayes decision rule for classification

But assume

1 is fully factorized

.
"

Or the variables corresponding to each dimension of the data


are independent given the label
9

Nave Bayes classifier is a generative model


Once you have the model, you can generate sample from it:
For each data point :
Sample a label,
Sample the value of

1,2 , with according to the class prior


from class conditional

Nave Bayes: conditioned on , generate first dimension


dimension , ., independently

, second

Difference from mixture of Gaussian models


Purpose is different (density estimation vs. classification)
Data different (with/without labels)
label
Learning different (em/or not)
dimensions

1
10

K- nearest neighbors
k-nearest neighbor classifier: assign a label by taking a
majority vote over the 2 training points closest to
For 3 4 1 , the k-nearest neighbor rule generalizes the nearest
neighbor rule
To define this more mathematically:
I6
If
as:

indices of the 2 training points closest to .


71, then we can write the 2-nearest neighbor classifier
86

9 :;

<

=>

11

Example

K=1
12

Example

K=3
13

Example

K=5
14

Example

K = 25
15

Example

K = 51
16

Example

K = 101
17

Computations in K-NN
Similar to KDE, essentially no training or learning phase,
computation is needed when applying the classifier
Memory: ? @-

Finding the nearest neighbors out of a set of millions of examples


is still pretty hard
Test computation ? @-

Use smart data structures and algorithms to index training data


Memory: ? @Training computation: ? @ log @
Test computation: ? log @
KD-tree, Ball tree, Cover tree
18

Discriminative classifier
Directly estimate decision boundary h x
|

posterior distribution

'(
ln
')

or

Logistic regression, Neural networks


Do not estimate
| and

or 8

is a function of , and

does not have probabilistic meaning for ,


hence can not be used to sample data points

Why discriminative classifier?


Avoid difficult density estimation problem
Empirically achieve better classification results
19

What is logistic regression model


Assume that the posterior distribution
particular form
1 ,E
Logistic function 8 I

take a

1
exp E H

JKLM NO

20

Learning parameters in logistic regression


Find E, such that the conditional likelihood of the labels is
maximized
max E :
R

log .
"

,E

Good news: E is concave function of E, and there is a single


global optimum.

Bad new: no closed form solution (resort to numerical method)


21

The objective function E


logistic regression model
1 ,E
Note that
0 ,E

1
exp E H

Plug in
E :
<

1
exp E H

log .

1 EH

"

log 1

exp E H
1 exp E H

,E
exp

EH
22

The gradient of E
E :
<

Gradient

U E
UE

log .
"

1 EH

<

,E

log 1

exp

EH

exp E H
1 exp E H

Setting it to 0 does not lead to closed form solution

23

Gradient descent/ascent
One way to solve an unconstrained optimization problem is
gradient descent
Given an initial guess, we iteratively refine the guess by taking
the direction of the negative gradient
Think about going down a hill by
taking the steepest direction
at each step
Update rule

V6 W8 6
V6 is called the step size or learning rate
6J

24

Gradient Ascent/Descent algorithm


Initialize parameter E X
Do
E YJ E Y
While the ||E YJ

[<
E Y ||

exp E H
1 exp E H

25

Boys vs girls (demo)

26

Nave Bayes vs. logistic regression


Consider

1, 1 ,

]1

Number of parameters
Nave Bayes :
2; 1, when all random variables are binary
4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior

logistic regression:
;

1: EX , E , E , , E1

27

Nave Bayes vs logistic regression II


When model assumptions correct
Both Nave Bayes and logistic regression produce good classifiers

When model assumptions incorrect


logistic regression is less biased does not assume conditional
independence
logistic regression has fewer parameters
expected to outperform Nave Bayes in practice

28

Nave Bayes vs logistic regression III


Estimation method:
Nave Bayes parameter estimates are decoupled (super easy)
Logistic regression parameter estimates are coupled (less easy)

How to estimate the parameters in logistic regression?


Maximum likelihood estimation
More specifically, maximize the conditional likelihood the label

29

Handwritten digits (demo)

30

Multiclass logistic regression


Assign input vector
1, , a

1, , @ into one of classes `, `

Assume that the posterior distribution take a particular form:


exp Ec H

`| , E , , Eb
cd exp Ecd H
Now, lets introduce some notations:
Ic
c | , E , , Eb
f
`
c

31

Learning parameters in multiclass logistic regression


Given all the input data
,
,

,,

S,

The log-likelihood can be written as:


S

"

c"

E log . . Ic
S

<<
S

<<
" c"

" c"

h
c Ec

!g(

c logIc
S

log < < exp Echi


" c i"

32

Learning parameters in multiclass logistic regression


Find E such that the conditional likelihood of the labels is
maximized
E also known as cross-entropy error function for
multiclass
Compute the gradient of 8 E with respect to one parameter
vector E :
U8
UEc

< Ic

33