You are on page 1of 6

CIS 520: Machine Learning Spring 2021: Lecture 4

Discriminative Probabilistic Models for Classification

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Introduction
• Logistic regression
• Probit regression and link functions
• Multiclass extensions
• Loss minimization view

1 Introduction

We have seen that, when classification performance is measured via a label-dependent loss (such as the 0-1
loss), the Bayes optimal classifier depends on only the conditional probabilities p(y|x) = P(Y = y|X = x).
Therefore, if the goal is simply to learn a good classifier, then one need not model the full joint distribution
D, which is what generative models effectively do; instead, it is sufficient to model only the conditional
probabilities p(y|x). Classification methods that take this route are referred to as discriminative probabilistic
models, since they model only the part of the joint distribution that is needed to discriminate between
different classes. Such methods often make better use of the limited training data that may be available,
since they do not attempt to model the distribution of instances x.
Below we will discuss one of the most widely used discriminative probabilistic models for classification, known
as logistic regression, and some of its variants. We will start by discussing the logistic regression method
for binary classification; we will then briefly mention some variants of the method, as well as extensions to
multiclass classification. Finally, we will given an alternative view of logistic regression that is not based on
modeling probability distributions at all, but rather, that is based on minimizing a ‘surrogate’ loss function.
This latter view will be helpful both in obtaining a better understanding of logistic regression, and in
seeing connections between logistic regression and other classification methods that we will meet later. The
loss minimization view has also proved to be immensely useful in recent years in understanding statistical
consistency properties of various classification methods.

2 Logistic Regression

Consider first binary classification, with instance space X and label space Y = {±1}. As discussed above,
if the goal is simply to learn an accurate classifier, then one need not model the full joint distribution D;

1
2 Discriminative Probabilistic Models for Classification

instead, it is sufficient to model only the conditional probabilities η(x) = P(Y = +1|X = x). One approach
for this is to assume a parametric form for η(x), and then estimate the parameters from the training sample,
e.g. using maximum likelihood estimation.
In its basic form, logistic regression applies to instances represented as real-valued feature vectors, with
instance space X = Rd . Logistic regression assumes a generalized linear model for the class probabilities
η(x), with
η(x) = g(w> x)
for a suitable ‘squashing’ function g : R→[0, 1]; here w ∈ Rd is a parameter vector.1 In particular, logistic
regression uses the logistic sigmoid function g given by
1
g(u) = ,
1 + e−u
so that the class probabilities η(x) are effectively modeled as2
1
η(x) = .
1 + e−w> x
Note that this gives
>
1 e−w x 1
1 − η(x) = 1 − > = = ,
1 + e−w x 1 + e−w> x 1 + ew > x
so that we can succinctly write for each y ∈ {±1}:
1
p(y | x; w) = .
1 + e−y w> x
Note also that with the above representation, the Bayes optimal classifier is given by
1 1
h∗ (x) = +1 ⇐⇒ >
1 + e−w> x 2
>
⇐⇒ e−w x
<1
⇐⇒ w> x > 0 ,

so that we can write


h∗ (x) = sign(w> x) .
In other words, under the logistic regression model, the Bayes optimal classifier is a linear classifier.
Now, given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), the parameter vector w can be estimated from
S, e.g. using maximum likelihood estimation. Specifically, the (conditional) likelihood of w is given by

L(w) = p y1 , . . . , ym | x1 , . . . , xm ; w
Y m

= p yi | x i ; w
i=1
m
Y 1
= ,
i=1
1+ e−yi w> xi
1 Note that one can easily accommodate an affine function of the form w> x + b by augmenting the feature vector x with an

additional constant component to create x0 = (x, 1) ∈ Rd+1 , and taking w0 = (w, b) ∈ Rd+1 , so that w0> x0 = w> x + b.
2 It can be verified that for the case of multivariate normal class-conditional densities on Rd with shared covariance matrix,

as well as the case of conditionally independent features in {0, 1}d , η(x) has this form with weights w (or rather, (w, b)) that
depend on the distribution parameters. In fact this holds for a larger class of settings in which the class-conditional distributions
come from a certain form of exponential family model.
Discriminative Probabilistic Models for Classification 3

and the log-likelihood is therefore given by


m
X >
ln 1 + e−yi w xi

ln L(w) = − .
i=1

Unfortunately there is no closed-form expression for the parameters w


b maximizing the above log-likelihood,
but these can be found using numerical optimization methods (such as Newton’s method). The resulting
plug-in classifier given by
b >x

hS (x) = sign w
is termed the (linear) logistic regression classifier.3

3 Probit Regression and Link Functions

Above, we used the logistic sigmoid function to map real numbers to probabilities. One can in principle use
any other ‘squashing’ function g : R→[0, 1], such as the cumulative distribution function (CDF) of a suitable
random variable. For example, the use of the standard normal CDF, given by
Z u
1 2
g(u) = Φ(u) = √ e−t /2 dt ,
2π −∞
leads to what is termed probit regression.
Squashing functions are closely related to ‘link’ functions. Specifically, a link function in binary classifica-
tion is any strictly increasing (and therefore invertible) function ψ : [0, 1]→R that maps probabilities to real
numbers.4 For example, logistic regression implicitly uses the logit  5(or logistic) link function, which is
η
simply the inverse of the logistic sigmoid function: ψ(η) = ln 1−η . Similarly, probit regression implicitly
uses the probit link function, which is simply the inverse of the standard normal CDF: ψ(η) = Φ−1 (η).
Other link functions have also been used as a basis for designing discriminative probabilistic models for
binary classification. Can you think of some reasons for why one might want to choose one link function (or
squashing function) over another?

4 Multiclass Extensions

Consider now multiclass classification with K > 2 classes: Y = [K] = {1, . . . , K}. A discriminative proba-
bilistic model in this case will attempt to model the conditional class probabilities ηy (x) = p(y|x), y ∈ [K].
Again, assume instances are real-valued feature vectors, with X = Rd . Multiclass logistic regression again
assumes a generalized linear model for the class probabilities ηy (x). In this case, the model is parametrized
by K weight vectors wy ∈ Rd , one for each class y ∈ [K]; we will collect these into a parameter matrix
W ∈ Rd×K as follows:  
| | |
W =  w1 w2 · · · wK  .
| | |
3 Often, in practice, one assumes a prior distribution over the parameters w, such as a multivariate normal or Laplace

distribution, and finds a maximum a posteriori (MAP) estimate w. b We will look at the effect of using such MAP
estimation in later lectures. A fully Bayesian treatment where one uses the full posterior distribution over w for classification
leads to Bayesian logistic regression. If in addition one assumes a prior distribution over the parameters of the prior on w,
this leads to a hierarchical Bayesian treatment of logistic regression.
4 Here R = [−∞, ∞] is the extended real line.
5 Note that we overload notation here by using η to denote a number in [0, 1] rather than a function; the usage should be

clear from context.


4 Discriminative Probabilistic Models for Classification

Then, denoting by η(x) = (η1 (x), . . . , ηK (x))> the class probability vector for instance x, the model assumes
η(x) = g(W> x)
where g : RK →∆K maps a K-dimensional real vector to a K-dimensional probability vector.6 In particular,
multiclass logistic regression uses the softmax function g given by
exp(uy )
gy (u) = PK ,
y 0 =1 exp(uy )
0

so that the class probabilities are effectively modeled as7


exp(wy> x)
ηy (x) = PK .
>
y 0 =1 exp(wy 0 x)

Strictly speaking, since a K-dimensional probability vector has only K − 1 degrees of freedom, one needs
only K − 1 weight vectors wy , for classes y = 1, . . . , K − 1; the weight vector for the K-th class can be fixed
to 0. However it is common to use the redundant representation above and include a parameter vector wy
for each class y = 1, . . . , K.
In this case, the Bayes optimal classifier becomes
h∗ (x) ∈ arg max wy> x .
y∈[K]

As before, given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), one can find a maximum likelihood (or other)
estimate W
c of the parameter matrix W from S; the resulting plug-in classifier

b y> x
hS (x) ∈ arg max w
y∈[K]

is termed the (linear) multiclass logistic regression classifier.

5 Loss Minimization View

Let us now briefly look at an alternative view of the linear logistic regression classifier described earlier for
binary classification, with X = Rd and Y = {±1}. Clearly, assuming classification performance is measured
via 0-1 loss, among all linear classifiers of the form hw (x) = sign(w> x), the ideal classifier would be one
that minimizes the 0-1 generalization error w.r.t. D:
w∗ ∈ arg min er0-1
D [hw ] .
w∈Rd

Since D is unknown, one might look for a parameter vector that minimizes the empirical 0-1 error on the
training sample S = ((x1 , y1 ), . . . , (xm , ym )) instead8 , defined for any classifier h : X →{±1} as
m
1 X
b 0-1

erS [h] = 1 h(xi ) 6= yi .
m i=1
6 Here
PK
∆K = {p ∈ RK + : y=1 py = 1}.
7 Again,it can be verified that for the case of multivariate normal class-conditional densities on Rd with shared covariance
matrix, as well as conditionally independent features in {0, 1}d , ηy (x) has this form for some parameters W (or rather, (W, b)
for some wy , by ); again, this holds for a wider class of class-conditional distributions that come from a certain form of exponential
family model.
8 Since S contains examples drawn i.i.d. from D, minimizing the error on S seems intuitively to be a reasonable thing to do;

we will see formal reasons later for why this makes sense.
Discriminative Probabilistic Models for Classification 5

Figure 1: Logistic and 0-1 losses, as a function of the margin yf .

Minimizing this 0-1 empirical error (over linear classifiers) would give

w
b ∈ b 0-1
arg min er S [hw ]
w∈Rd
m
1 X
1 sign(w> xi ) 6= yi .

= arg min
w∈Rd m i=1

Unfortunately, this turns out to be a computationally difficult optimization problem due to the discrete
indicator function (in fact it is NP-hard to solve). Consequently, one often minimizes the empirical error on
S w.r.t. some other ‘surrogate’ loss function ` : {±1} × R→R+ that is a continuous and sufficiently smooth
approximation to the 0-1 loss to allow for efficient minimization; often, it is desirable to have the loss be
convex in its second argument. In particular, define the logistic loss `log : {±1} × R→R+ as

`log (y, f ) = log2 (1 + e−yf ) .

Note also that for real-valued predictions f ∈ R used for classification via sign(f ), the 0-1 loss becomes
`0-1 : {±1} × R→R+ , defined as

`0-1 (y, f ) = 1(sign(f ) 6= y) = 1(yf < 0) .

Figure 1 shows plots of both the logistic loss and the 0-1 loss as a function of the margin yf . As can be
seen, the logistic loss is convex in its second argument and forms an upper bound on the 0-1 loss. We can
define the empirical logistic error of a real-valued function f : X →R on the training sample S as9
m
1 X
b log
er S [f ] = log2 (1 + e−yi f (xi ) ) .
m i=1

Minimizing this empirical logistic error over all linear functions of the form fw (x) = w> x then gives

w
b ∈ b log
arg min er S [fw ]
w∈Rd
m
1 X >
= arg min log2 (1 + e−yi w xi ) .
w∈Rd m i=1

This yields the same solution as the linear logistic regression classifier! This gives an alternative view of
logistic regression that makes no assumptions directly on the conditional probability function η, but rather
9 Note that we overload notation here; f was earlier used to denote a real number, and is now used to denote a real-valued

function. The usage should be clear from context.


6 Discriminative Probabilistic Models for Classification

simply minimizes the empirical logistic error on the training sample over some class of functions (in this case
linear functions).10 This view has been helpful in understanding statistical consistency properties of logistic
regression classifiers in recent years. We will see more examples of such loss minimization algorithms, often
called empirical (`-)risk minimization (ERM or `-ERM) algorithms (where ` denotes the loss being
minimized), in the coming lectures.

Acknowledgments. Thanks to Harikrishna Narasimhan for help in preparing the plot in Figure 1.

10 In practice, one often adds a regularizer to the objective to reduce overfitting; we will discuss this in the next lecture in the

context of least squares regression.

You might also like