You are on page 1of 33

# Discriminative classifier and

logistic regression
Le Song

Machine Learning
CS 7641,CSE/ISYE 6740, Fall 2015

Classification
Represent the data

1, 1

Classifier

## How to come up with decision boundary

Given class conditional distribution:
1 , and class prior:
1 ,
;

1 ,

1
1
,

?
;

1
,

## Use Bayes rule

likelihood

posterior

Prior

normalization constant

Learning: prior:

If

, then

, otherwise

Alternatively:
If ratio

|!"
|!"

!"
!"

, then

, otherwise
ln

'(
')

## More on Bayes error of Bayes rule

Bayes error is the lower bound of probability of classification
error
Bayes decision rule is the theoretically best classifier that
minimize probability of classification error
However, computing Bayes error or Bayes decision rule is in
general a very complex problem. Why?
Need density estimation
Need to do integral, eg. *+

1 7

## What do people do in practice?

Use simplifying assumption for
Assume
Assume

1 is Gaussian,
1 is fully factorized

! , !

## Use geometric intuitions

k-nearest neighbor classifier
Support vector machine

## Directly go for the decision boundary h x

ln

'(
')

Logistic regression
Neural networks

## Nave Bayes Classifier

Use Bayes decision rule for classification

But assume

1 is fully factorized

.
"

## Or the variables corresponding to each dimension of the data

are independent given the label
9

## Nave Bayes classifier is a generative model

Once you have the model, you can generate sample from it:
For each data point :
Sample a label,
Sample the value of

## 1,2 , with according to the class prior

from class conditional

## Nave Bayes: conditioned on , generate first dimension

dimension , ., independently

, second

## Difference from mixture of Gaussian models

Purpose is different (density estimation vs. classification)
Data different (with/without labels)
label
Learning different (em/or not)
dimensions

1
10

K- nearest neighbors
k-nearest neighbor classifier: assign a label by taking a
majority vote over the 2 training points closest to
For 3 4 1 , the k-nearest neighbor rule generalizes the nearest
neighbor rule
To define this more mathematically:
I6
If
as:

## indices of the 2 training points closest to .

71, then we can write the 2-nearest neighbor classifier
86

9 :;

<

=>

11

Example

K=1
12

Example

K=3
13

Example

K=5
14

Example

K = 25
15

Example

K = 51
16

Example

K = 101
17

Computations in K-NN
Similar to KDE, essentially no training or learning phase,
computation is needed when applying the classifier
Memory: ? @-

## Finding the nearest neighbors out of a set of millions of examples

is still pretty hard
Test computation ? @-

## Use smart data structures and algorithms to index training data

Memory: ? @Training computation: ? @ log @
Test computation: ? log @
KD-tree, Ball tree, Cover tree
18

Discriminative classifier
Directly estimate decision boundary h x
|

posterior distribution

'(
ln
')

or

## Logistic regression, Neural networks

Do not estimate
| and

or 8

is a function of , and

## does not have probabilistic meaning for ,

hence can not be used to sample data points

## Why discriminative classifier?

Avoid difficult density estimation problem
Empirically achieve better classification results
19

## What is logistic regression model

Assume that the posterior distribution
particular form
1 ,E
Logistic function 8 I

take a

1
exp E H

JKLM NO

20

## Learning parameters in logistic regression

Find E, such that the conditional likelihood of the labels is
maximized
max E :
R

log .
"

,E

global optimum.

21

## The objective function E

logistic regression model
1 ,E
Note that
0 ,E

1
exp E H

Plug in
E :
<

1
exp E H

log .

1 EH

"

log 1

exp E H
1 exp E H

,E
exp

EH
22

E :
<

U E
UE

log .
"

1 EH

<

,E

log 1

exp

EH

exp E H
1 exp E H

## Setting it to 0 does not lead to closed form solution

23

One way to solve an unconstrained optimization problem is
Given an initial guess, we iteratively refine the guess by taking
the direction of the negative gradient
Think about going down a hill by
taking the steepest direction
at each step
Update rule

V6 W8 6
V6 is called the step size or learning rate
6J

24

Initialize parameter E X
Do
E YJ E Y
While the ||E YJ

[<
E Y ||

exp E H
1 exp E H

25

26

## Nave Bayes vs. logistic regression

Consider

1, 1 ,

]1

Number of parameters
Nave Bayes :
2; 1, when all random variables are binary
4n+1 for Gaussians: 2; mean, 2; variance, and 1 for prior

logistic regression:
;

1: EX , E , E , , E1

27

## Nave Bayes vs logistic regression II

When model assumptions correct
Both Nave Bayes and logistic regression produce good classifiers

## When model assumptions incorrect

logistic regression is less biased does not assume conditional
independence
logistic regression has fewer parameters
expected to outperform Nave Bayes in practice

28

## Nave Bayes vs logistic regression III

Estimation method:
Nave Bayes parameter estimates are decoupled (super easy)
Logistic regression parameter estimates are coupled (less easy)

## How to estimate the parameters in logistic regression?

Maximum likelihood estimation
More specifically, maximize the conditional likelihood the label

29

30

1, , a

## Assume that the posterior distribution take a particular form:

exp Ec H

`| , E , , Eb
cd exp Ecd H
Now, lets introduce some notations:
Ic
c | , E , , Eb
f
`
c

31

## Learning parameters in multiclass logistic regression

Given all the input data
,
,

,,

S,

S

"

c"

E log . . Ic
S

<<
S

<<
" c"

" c"

h
c Ec

!g(

c logIc
S

" c i"

32

## Learning parameters in multiclass logistic regression

Find E such that the conditional likelihood of the labels is
maximized
E also known as cross-entropy error function for
multiclass
Compute the gradient of 8 E with respect to one parameter
vector E :
U8
UEc

< Ic

33