Professional Documents
Culture Documents
Lecture Notes Chapt13
Lecture Notes Chapt13
Logistic Regression
In the first part of this chapter, we look at logistic regression for binary classification,
i.e. two output classes. Afterwards, we look at logistic regression for multi-class classifi-
cation, i.e., multiple output classes. We will see in Chapter 14 how to perform multi-class
classification using any binary classifier. Throughout this chapter, we will denote the two
classes by 0 and 1, as we will use the sigmoid and softmax functions that were introduced
in Chapter 12. In future chapters, notably Chapter 16, it will be more convenient to
use ≠1 and +1 to represent the two output classes. This choice is just for mathematical
convenience as we can trivially map from 0 and 1 to ≠1 and +1 and vice versa, using the
mappings:
(y+1)/2 sign(y≠0.5)
(≠1, +1) ≠æ (0, 1) (0, 1) ≠æ (≠1, +1)
Consider a Bernoulli random variable X that takes a value in {0, 1}. In mathematical
notation we write X ≥ Bernoulli(◊), where ◊ œ [0, 1], to say that variable X follows a
Bernoulli distribution with probability of “success” ◊. More explicitly, we can also write
Y
]1 with probability ◊
X=
[0 with probability 1 ≠ ◊
185
1
0.8
Sigmoid
0.6
0.4
0.2
0
≠4 ≠2 0 2 4
t
Figure 13.1: Sigmoid, or logistic, function on the domain [≠5, 5].
or equivalently
p(X = 1 | ◊) = ◊
p(X = 0 | ◊) = 1 ≠ ◊.
p(X = x | ◊) = ◊x (1 ≠ ◊)1≠x
Given an input feature vector x, the models for binary classification with parameters
w return a scalar value f (x, w) œ [0, 1]. We can interpret the function value as the
parameter of a Bernoulli random variable, i.e., as the probability of X = 1 or “success”.
This means we model the binary class labels as:
Definition 13.1 (Logistic Regression). The logistic regression model for classification is
defined by:
p(y | w, x) = Bernoulli(sigmoid(w · x)),
where, without loss of generality, x0 = 1 so that we do not need to handle the bias term
w0 separately.
It thus models the conditional probability of the label y given the parameter vector w
and the input feature vector (new data point) x. It uses the non-linear sigmoid function
to map the prediction of the linear regression model w · x to the probability parameter
of a Bernoulli distribution that defines the said conditional probability.
Recall that the sigmoid function ‡ represented in Figure 13.1 is defined by:
1
‡(t) = ‡ : R æ (0, 1)
1 + e≠t
186
1
The sigmoid function takes values above 2
whenever its argument is positive:
1
t Ø 0 … ‡(t) Ø (13.1)
2
This implication will become important when making predictions using logistic regression.
Suppose we have estimated the model parameters w œ RD . We now look at how we can
make predictions on new unseen data points. For a new data point xnew , the model gives
us the probability
1
p(ynew = 1 | xnew , w) = ‡(w · xnew ) =
1 + exp(≠w · xnew )
In order to make a prediction, we can use a threshold at 12 . This separates all points
with probability < 12 from the points with probability Ø 12 . We use the indicator function
I to obtain a 1 if the probability is Ø 12 and obtain a 0 otherwise.
1
y‚new = I(‡(w · xnew )) Ø ) = I(w · xnew Ø 0)
2
The second equality follows from Equation 13.1. The second form makes it clear that the
class (or decision) boundary is linear; in many dimensions, the class boundary is thus a
hyperplane.
We now look in greater detail into how to compute a decision boundary and contour
lines for our model function. For this, we consider the question: What is the contour
line for p(y = 1 | x, w) = p0 ? In other words, which data points x satisfy this equality,
assuming that w is fixed, i.e. we already have our model parameters estimated.
1 p0
By definition: p(y = 1 | x, w) = =
1 + exp(≠x · w) 1
1 p0
Simplify: =
1 + exp(≠x · w) ≠ 1 1 ≠ p0
p0
Take the log on both sides: log 1 ≠ log exp(≠x · w) = log
1 ≠ p0
p0
We obtain the hyperplane: x · w = log (13.2)
1 ≠ p0
We start by applying the definition of the probability that y = 1 given x and w under the
logistic regression model. In the second step, we subtract on both sides the numerator
from the denominator. Doing so gets rid of the +1 in the denominator on the left-hand
187
(a) (b)
Figure 13.2: Two classes red and blue with 2D data points. (a) Gray lines represent
contour lines for different thresholds p0 . Red and blue stars depict misclassified data
points for the threshold p0 = 0.6. The datasset is not linearly separable, so there will
always be misclassified data points since the decision boundary of the logistic regression
is linear. (b) 3D depiction of the same dataset by adding a new irrelevant dimension z
for the third axis to each data point; z = 0.5 for blue data points and z = 0.3 for each
red data point). The surface represents the contour lines for various values of p0 œ [0, 1].
side. Finally, we take the logarithm on both sides of the equation and simplify the
expressions to arrive at the decision boundary. We finally end up with a linear function
in x. The term on the right-hand side of the decision boundary is also known as the
log-odds of p0 .
The points that lie on the decision boundary, i.e. the points that are equally likely to
belong to class 1 and 2, are the data points x that satisfy the following equality:
1
p(y = 1 | x, w) = p(y = 0 | x, w) = ∆ x·w=0
2
1
We can verify this implication by letting p0 = 2
and plugging it into Equation 13.2. This
1
then gives us x · w = log 1≠ 1 = log 1 = 0.
2
2
Let us now examine the decision boundary graphically. Figure 13.2 depicts the data
points of two classes, red and blue, with different means in the (a) 2D plane and (b) in
188
the 3D plane. The data for each class is drawn from a normal distribution per class.
As we can see in Figure 13.2 (a), the points are not linearly separable. Therefore we
will always misclassify some data points with the linear decision boundary of the logistic
regression. The grey lines represent the contour lines for different probability thresholds
p0 : from bottom left to top right these p0 values are: 0.15, 0.3, 0.45, 0.6 ,0.75 and 0.9,
respectively. Starred points represent misclassifications made by the logistic regression
classifier for p0 = 0.6. The different heights for the two classes in Figure 13.2 (b) are only
to ease the visual distinction between the two classes.
We next derive the parameters of a logistic regression classifier using the maximum like-
lihood principle.
Assume we are given a dataset D = ((x1 , y1 ), . . . , (xN , yN )), where the N data points
xi œ RD are i.i.d. and yi œ {0, 1}. The likelihood of observing these data points, given
the model parameters w, is given by the product of the likelihood of observing every
single data point individually:
N
Ÿ N
Ÿ
T T 1≠yi
p(y | X, w) = ‡(w xi ) · (1 ≠ ‡(w xi ))
yi
= yi
µi · (1 ≠ µi )1≠yi
i=1 i=1
In the above expression, we made use of the compact formula for the Bernoulli distribution
def
and also defined µi = ‡(wT xi ) to shorten notation.
Interestingly, the NLL(yi | xi , w) is the cross-entropy between yi and µi for yi œ {0, 1}.
The concept of cross-entropy comes from the area of information theory. There is a deep
connection between the area of information theory, where the goal is to describe the
amount of information available via different random variables, and the area of machine
learning, where the goal is to extract meaningful patterns from data. Later in this chapter,
we briefly discuss this concept to better understand the connection.
189
1
0.8
0.6
Entropy 0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
◊
Figure 13.3: Entropy of a Bernoulli random variable for different values of ◊. The graph
reaches its maximum value for ◊ = 0.5, which corresponds to a uniform distribution where
both outcomes are equally likely.
The Hessian matrix (Section 2.2.5) of the second-order partial derivatives can be expressed
as
H = XT SX,
where S is a diagonal matrix with each entry Sii = µi (1 ≠ µi ). Since µi = ‡(wT xi ), the
term µi (1 ≠ µi ) is the first order derivative of µi with respect to wT xi . The Hessian of
NLL is positive definite everywhere, assuming X has full rank, otherwise it is only positive
semi-definite everywhere (easy exercise). This ensures that the NLL is convex, and we
can use convex optimisation methods to minimise it. Recall that for convex optimisation
methods a local optimum is also global optimum.
The full derivation of the gradient and Hessian of the above NLL is an exercise and
will be discussed in class.
The concepts of entropy and cross-entropy are relevant to machine learning. In short,
the entropy is used to quantify the information about a random variable.
Definition 13.2 (Entropy). Entropy H is a measure of uncertainty associated with a
random variable X: ÿ
H(X) = ≠ p(x) log p(x),
x
where we sum, over all possible outcomes x, the probability that x occurs times the log
of this probability.
190
Two important properties are:
• Minimum entropy is reached in case there is only one possible outcome of the
random variable.
For instance, for a Bernoulli random variable X with parameter ◊, the entropy is
given by:
H(X) = ≠◊ log2 (◊) ≠ (1 ≠ ◊) log2 (1 ≠ ◊)
The graph of this entropy function is depicted in Figure 13.3 as a function of ◊. The
graph achieves its maximum value 1 for ◊ = 0.5, which means a uniform distribution,
that is, the binary random variable X has both outcomes equally likely. For ◊ = 0 or
◊ = 1, the entropy is 0 (e.g., there is no uncertainty in case the variable has only one
outcome with probability 1) .
Now let p and q be two distributions and suppose the support of p is contained in
q. For instance, imagine we have two random variables, Xp and Xq , that follow the
distributions p and q respectively. Then the possible values (outcomes) of Xp are also
possible values of Xq .
In other words, for every x, we take the probability p of that value and the number
of bits needed to encode it using the encoding scheme of q. For the latter, we need
logarithmically many bits.
Cross-entropy is related to the negative log-likelihood given before for the logistic
regression classifier. Essentially, in our classification problem we estimate the probability
of different outcomes to occur. Imagine the estimated probability of outcome i is qi
and the frequency (empirical probability) of outcome i in the training set is pi , then the
negative log-likelihood of the training data is the cross-entropy H(p, q). Specifically, the
negative log-likelihood for data point (xi , yi )
is the cross-entropy between yi and µi , where yi is the observed class, that is supported
by the empirical probability of the outcomes in the training data set, i.e. it is supported
by pi , while µi is the estimated probability ‡(wT xi ), denoted by qi .
191
13.4 Newton Method for Optimising the Negative
Log-Likelihood for Logistic Regression
We next show how to optimise the negative log-likelihood using the Newton’s method
(Section 11.7.1). Recall that the Newton’s method is used if the number of dimensions D
is small, as it requires to take the inverse of the Hessian matrix with dimensions D ◊ D
and this takes time cubic in D. For small D, we can apply Newton’s method to estimate
w for logistic regression.
These minimisers allow us to move fast towards the minimum of the objective function.
We then update the parameters on each step and let wt be the parameters after t Newton
steps.
For the update, we need the gradient and Hessian matrix which are respectively given
by:
gt = XT (µt ≠ y) = ≠XT (y ≠ µt )
Ht = XT St X
Note that St depends on the parameter vector wt and therefore dependents on t: St,ii =
µt,i (1 ≠ µt,i ) and µt,i = ‡(wtT xi ). Similarly, the gradient is also dependent on t. We need
to compute the gradient and the Hessian at each iteration t. The Newton update rule is
as follows:
wt+1 = wt ≠ H≠1
t gt
= wt + (XT St X)≠1 XT (y ≠ µt )
= (XT St X)≠1 (XT St X)wt + (XT St X)≠1 XT (y ≠ µt )
= (XT St X)≠1 XT St (Xwt + S≠1
t (y ≠ µt ))
¸ ˚˙ ˝
zt
= (XT St X)≠1 XT St zt
We proceeded above as follows: We first plug in the definitions of the gradient and the
Hessian. Next, we multiply wt by ID = H≠1 t Ht = (X St X) (X St X). This does
T ≠1 T
not alter the equation but allows us to bring the expression in the desired final shape.
This step assumes that the Hessian is invertible, which has to be the case, because
we could otherwise not apply Newton’s method in the first place. Now we factor out
(XT St X)≠1 XT St and therefore add a correction term S≠1 to the second term (y ≠ µt ).
Lastly, we rename the expression in parentheses.
192
13.4.1 From Ordinary Least Squares to Weighted Least Squares
Looking closely, we can observe that the expression for wt+1 in Equation 13.3 looks similar
to the closed-form solution we obtained for the least-squares objective in Section 3.6.1.
We next derive a generalisation of ordinary least squares objective, called weighted least
squares objective, whose minimisation is given by this expression for wt+1 . In other words,
to compute the parameters at a particular Newton step t + 1, this form corresponds to
minimising the weighted least squares objective function.
Recall the ordinary least squares (OLS) objective function L(w) and the closed-form
solution for w.
ÿ
Objective function: L(w) = (xiT w ≠ yi )2
i
Closed-form solution: w = (XT X)≠1 XT y
We work our way from Equation 13.3 back to the objective function it solves, i.e. we try
to find L(w). Our strategy is to bring the expression into a form similar to the closed-
form solution for the ordinary least squares. This means we need to find a matrix X̃ such
that X̃T X̃ = XT St X, and we have to find a ỹ such that X̃T ỹ = XT St zt . We start with
the equation itself:
Since St is a diagonal matrix, we can factorise it into two diagonal matrices having
the square root of the entries. Recall that a diagonal matrix is symmetric and therefore
1/2 1/2 T
equal to its transpose, i.e. St = St . Therefore, we can rewrite the expression as
follows:
1/2 1/2 1/2 1/2 1/2 T 1/2 1/2 T 1/2
wt+1 = (XT St St X)≠1 XT St St zt = ((XT (St ) )St X)≠1 (XT (St ) )St zt
Renaming the individual terms, as shown in the above equation, yields a form that is
structurally identical to the closed-form solution of OLS, yet with different matrices and
vectors:
1/2 1/2 1/2 1/2
wt+1 = ((St X)T St X)≠1 (St X)T St zt
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
X̃T X̃ X̃T ỹ
T T
= (X̃ X̃) X̃ ỹ ≠1
This however allows us to synthesise the objective function, where we write x̃iT instead
of xiT and ỹi instead of yi :
ÿ
L(wt+1 ) = (x̃iT wt+1 ≠ ỹi )2
i
Now, we can plug in the terms for x̃iT and ỹi , but we need to be careful with the indices,
as for the objective function we consider vectors, not matrices as for the derivation before:
ÿ 1/2 1/2
L(wt+1 ) = (St,ii xiT wt+1 ≠ St,ii zt,i )2
i
193
1/2
Lastly, we can factor out the term St,ii to arrive at the objective function of weighted
least squares:
ÿ
Objective function: L(wt+1 ) = St,ii (xiT wt+1 ≠ zt,i )2
i
Closed-form solution: wt+1 = (XT St X)≠1 XT St zt
The above result is similar to OLS but with weights St,ii œ R for each squared residual.
The above weighted least squares objective function is to be minimised at each iteration
step to obtain the minimiser wt+1 . For this reason, the overall approach to computing
the parameters for the logistic regression classifier is called Iteratively Re-Weighted Least
Squares (IRLS):
• Each Newton step requires re-weighting of the residual by a new diagonal matrix
St , because it depends on our prediction ‡(wT xi ).
• Each step uses a new vector zt , which depends on wt .
• We proceed iteratively, one Newton step after the other.
Consider now C > 2 classes: y œ {1, . . . , C}. A natural generalisation of the bi-
nary logistic regression to multiple classes would be to consider a vector of parameters
wc œ RD for every class c. This is in contrast to binary logistic regression, where we
considered a single parameter vector for both classes1 . The parameters form a matrix
W = [w1 , . . . , wC ] œ RD◊C . The multi-class logistic model is then given by:
exp(wcT x)
p(y = c | x, W) = qC
cÕ =1 exp(wcÕ x)
T
Recall from Chapter 12 that the softmax function maps the elements of its input vector
to values in the domain [0, 1] that sum up to 1. This allows a probabilistic interpretation
of this values, with the largest value in the input vector being assigned to the largest
probability amongst the probabilities assigned to the elements of the input vector.
Similarly to the binary case, we can show that the negative log-likelihood is convex
also for the multi-class logistic regression. We can therefore use convex optimisation
methods to estimate the parameters W.
1
We could in fact omit the parameter vector for one of the classes since we can deduct the probability
of this class from the other probabilities, as they sum up to 1. However, we will not do so here as it will
allow a neat treatment using the softmax function.
194
The multi-class logistic model can be expressed using softmax:
3Ë ÈT 4
p(y | x, W) = softmax w1T x, . . . , wC
T
x ,
where softmax takes as arguments a vector of the dot products of the individual parameter
vectors of each class wc and the input vector x.
The first step follows by definition of the softmax function. The second step is achieved
by dividing the numerator and the denominator by the numerator. This brings the
equation into the form of the sigmoid function, and it also motivates the fact that we do
not need to use a distinct parameter vector for each class in the binary case. It is enough
to use one parameter vector that captures the difference between the two-parameter
vectors, here w1 ≠ w0 .
Decision boundaries for multi-class logistic regression are linear. Figure 13.4 shows data
points of three classes (red, blue and green) that are generated according to the following
normal distributions with different means and identical variances.
Stars indicate misclassified data points. To mathematically show that the decision bound-
aries are linear, we equate the probability of the data points belonging to one class with
the probability of the data points belonging to another class and then obtain a linear
function in x.
13.6 Summary
195
Figure 13.4: Linear decision boundaries between three classes (red, blue and green) using
multi-class logistic regression. The points are generated according to three normal dis-
tributions with different means and identical variances. Stars indicate misclassified data
points.
• We can derive maximum likelihood estimates using convex optimisation. This led
us to the iteratively re-weighted least squares optimisation method, which is based
on Newton’s method.
• For more information on multi-class logistic regression, see Section 8.3 in “Machine
Learning: A Probabilistic Perspective” [Murphy, 2012].
In the context of logistic regression, we have not discussed basis expansion and regu-
larisation. These concepts can be also combined with logistic regression.
• In the class to this topic, we will see that regularisation may be necessary if data
is linearly separable.
196
• What if the classification boundaries are non-linear? In order to cope with that
and “lift” these relatively simple models to non-linearity, we can use:
197
Bibliography
[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex Opti-
mization. Cambridge University Press.
[Breiman, 2001] Breiman, L. (2001). Statistical modeling: The two cultures (with com-
ments and a rejoinder by the author). Statistical Science, 16(3).
[Collins, 2012] Collins, M. (2012). Convergence Proof for the Perceptron Algorithm.
http://www.cs.columbia.edu/ mcollins/courses/6998-2012/notes/perc.converge.pdf.
[Domingos, 2012] Domingos, P. (2012). A few useful things to know about machine
learning. Commun. ACM, 55(10):78–87.
[Fisher, 1936] Fisher, R. A. (1936). The use of multiple elements in taxonomic problems.
Annals of Eugenics, 7(2):179–188.
[Goldstein et al., 2014] Goldstein, T., Studer, C., and Baraniuk, R. G. (2014). A
field guide to forward-backward splitting with a FASTA implementation. CoRR,
abs/1411.3406.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press. http://www.deeplearningbook.org.
312
[Goodfellow et al., 2015] Goodfellow, I., Shlens, J., and Szegedy, C. (2015). Explaining
and harnessing adversarial examples. In International Conference on Learning Repre-
sentations.
[Mayer et al., 2010] Mayer, J., Khairy, K., and Howard, J. (2010). Drawing an elephant
with four complex parameters. American Journal of Physics, 78.
[Moritz Hardt, 2016] Moritz Hardt, Ben Recht, Y. S. (2016). Train faster, generalize
better: Stability of stochastic gradient descent. Journal of Machine Learning Research,
48:1225–1234.
[Schleich et al., 2016] Schleich, M., Olteanu, D., and Ciucanu, R. (2016). Learning linear
regression models over factorized joins. In SIGMOD, pages 3–18.
[Schölkopf and Smola, 2002] Schölkopf, B. and Smola, A. J. (2002). Support vector ma-
chines and kernel algorithms. In Arbib, M. A., editor, The Handbook of Brain Theory
and Neural Networks, pages 1119–1125. MIT Press.
[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from
overfitting. J. Mach. Learn. Res., 15:1929–1958.
[Strang, 2005] Strang, G. (2005). Linear Algebra and its Applications. Cengage Learning.
[Szegedy et al., 2014] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D.,
Goodfellow, I., and Fergus, R. (2014). Intriguing properties of neural networks. In
International Conference on Learning Representations.
[Tukey, 1962] Tukey, J. W. (1962). The future of data analysis. The Annals of Mathe-
matical Statistics, pages 1–67.
[Zeiler and Fergus, 2013] Zeiler, M. D. and Fergus, R. (2013). Visualizing and under-
standing convolutional networks. CoRR, abs/1311.2901.
313