You are on page 1of 15

Chapter 13

Logistic Regression

Logistic regression is a discriminative classification method. The name is deceiving:


Even though it is called “regression”, it is actually used for classification. Recall that
for classification methods, the output y is categorical, and that in the discriminative
setting we model the conditional distribution over the output y, given the input x and
parameters w:
p(y | w, x).

In the first part of this chapter, we look at logistic regression for binary classification,
i.e. two output classes. Afterwards, we look at logistic regression for multi-class classifi-
cation, i.e., multiple output classes. We will see in Chapter 14 how to perform multi-class
classification using any binary classifier. Throughout this chapter, we will denote the two
classes by 0 and 1, as we will use the sigmoid and softmax functions that were introduced
in Chapter 12. In future chapters, notably Chapter 16, it will be more convenient to
use ≠1 and +1 to represent the two output classes. This choice is just for mathematical
convenience as we can trivially map from 0 and 1 to ≠1 and +1 and vice versa, using the
mappings:
(y+1)/2 sign(y≠0.5)
(≠1, +1) ≠æ (0, 1) (0, 1) ≠æ (≠1, +1)

13.1 Models for Binary Classification

Consider a Bernoulli random variable X that takes a value in {0, 1}. In mathematical
notation we write X ≥ Bernoulli(◊), where ◊ œ [0, 1], to say that variable X follows a
Bernoulli distribution with probability of “success” ◊. More explicitly, we can also write
Y
]1 with probability ◊
X=
[0 with probability 1 ≠ ◊

185
1
0.8

Sigmoid
0.6
0.4
0.2
0
≠4 ≠2 0 2 4
t
Figure 13.1: Sigmoid, or logistic, function on the domain [≠5, 5].

or equivalently

p(X = 1 | ◊) = ◊
p(X = 0 | ◊) = 1 ≠ ◊.

Even more succinctly, we can write

p(X = x | ◊) = ◊x (1 ≠ ◊)1≠x

This is a convenient notation, as we can plug in x = 0 or x = 1 to obtain the respective


probabilities from above in one single equation.

Given an input feature vector x, the models for binary classification with parameters
w return a scalar value f (x, w) œ [0, 1]. We can interpret the function value as the
parameter of a Bernoulli random variable, i.e., as the probability of X = 1 or “success”.
This means we model the binary class labels as:

y ≥ Bernoulli(f (x, w))

Unsurprisingly, one possible function f is the sigmoid, or logistic, function depicted in


Figure 13.1. It takes as argument the dot product of x and w and maps it to a value in
the domain (0, 1). For more details on the logistic function, see Section 12.9.4.

Definition 13.1 (Logistic Regression). The logistic regression model for classification is
defined by:
p(y | w, x) = Bernoulli(sigmoid(w · x)),
where, without loss of generality, x0 = 1 so that we do not need to handle the bias term
w0 separately.

It thus models the conditional probability of the label y given the parameter vector w
and the input feature vector (new data point) x. It uses the non-linear sigmoid function
to map the prediction of the linear regression model w · x to the probability parameter
of a Bernoulli distribution that defines the said conditional probability.

Recall that the sigmoid function ‡ represented in Figure 13.1 is defined by:
1
‡(t) = ‡ : R æ (0, 1)
1 + e≠t

186
1
The sigmoid function takes values above 2
whenever its argument is positive:

1
t Ø 0 … ‡(t) Ø (13.1)
2
This implication will become important when making predictions using logistic regression.

13.2 Prediction Using Logistic Regression

Suppose we have estimated the model parameters w œ RD . We now look at how we can
make predictions on new unseen data points. For a new data point xnew , the model gives
us the probability
1
p(ynew = 1 | xnew , w) = ‡(w · xnew ) =
1 + exp(≠w · xnew )

In order to make a prediction, we can use a threshold at 12 . This separates all points
with probability < 12 from the points with probability Ø 12 . We use the indicator function
I to obtain a 1 if the probability is Ø 12 and obtain a 0 otherwise.

1
y‚new = I(‡(w · xnew )) Ø ) = I(w · xnew Ø 0)
2
The second equality follows from Equation 13.1. The second form makes it clear that the
class (or decision) boundary is linear; in many dimensions, the class boundary is thus a
hyperplane.

We now look in greater detail into how to compute a decision boundary and contour
lines for our model function. For this, we consider the question: What is the contour
line for p(y = 1 | x, w) = p0 ? In other words, which data points x satisfy this equality,
assuming that w is fixed, i.e. we already have our model parameters estimated.
1 p0
By definition: p(y = 1 | x, w) = =
1 + exp(≠x · w) 1

1 p0
Simplify: =
1 + exp(≠x · w) ≠ 1 1 ≠ p0
p0
Take the log on both sides: log 1 ≠ log exp(≠x · w) = log
1 ≠ p0
p0
We obtain the hyperplane: x · w = log (13.2)
1 ≠ p0
We start by applying the definition of the probability that y = 1 given x and w under the
logistic regression model. In the second step, we subtract on both sides the numerator
from the denominator. Doing so gets rid of the +1 in the denominator on the left-hand

187
(a) (b)

Figure 13.2: Two classes red and blue with 2D data points. (a) Gray lines represent
contour lines for different thresholds p0 . Red and blue stars depict misclassified data
points for the threshold p0 = 0.6. The datasset is not linearly separable, so there will
always be misclassified data points since the decision boundary of the logistic regression
is linear. (b) 3D depiction of the same dataset by adding a new irrelevant dimension z
for the third axis to each data point; z = 0.5 for blue data points and z = 0.3 for each
red data point). The surface represents the contour lines for various values of p0 œ [0, 1].

side. Finally, we take the logarithm on both sides of the equation and simplify the
expressions to arrive at the decision boundary. We finally end up with a linear function
in x. The term on the right-hand side of the decision boundary is also known as the
log-odds of p0 .

The points that lie on the decision boundary, i.e. the points that are equally likely to
belong to class 1 and 2, are the data points x that satisfy the following equality:
1
p(y = 1 | x, w) = p(y = 0 | x, w) = ∆ x·w=0
2
1
We can verify this implication by letting p0 = 2
and plugging it into Equation 13.2. This
1
then gives us x · w = log 1≠ 1 = log 1 = 0.
2
2

13.2.1 Contour Lines Represent Class Label Probabilities

Let us now examine the decision boundary graphically. Figure 13.2 depicts the data
points of two classes, red and blue, with different means in the (a) 2D plane and (b) in

188
the 3D plane. The data for each class is drawn from a normal distribution per class.
As we can see in Figure 13.2 (a), the points are not linearly separable. Therefore we
will always misclassify some data points with the linear decision boundary of the logistic
regression. The grey lines represent the contour lines for different probability thresholds
p0 : from bottom left to top right these p0 values are: 0.15, 0.3, 0.45, 0.6 ,0.75 and 0.9,
respectively. Starred points represent misclassifications made by the logistic regression
classifier for p0 = 0.6. The different heights for the two classes in Figure 13.2 (b) are only
to ease the visual distinction between the two classes.

13.3 Deriving the Parameters for Logistic Regres-


sion

We next derive the parameters of a logistic regression classifier using the maximum like-
lihood principle.

Assume we are given a dataset D = ((x1 , y1 ), . . . , (xN , yN )), where the N data points
xi œ RD are i.i.d. and yi œ {0, 1}. The likelihood of observing these data points, given
the model parameters w, is given by the product of the likelihood of observing every
single data point individually:
N
Ÿ N
Ÿ
T T 1≠yi
p(y | X, w) = ‡(w xi ) · (1 ≠ ‡(w xi ))
yi
= yi
µi · (1 ≠ µi )1≠yi
i=1 i=1

In the above expression, we made use of the compact formula for the Bernoulli distribution
def
and also defined µi = ‡(wT xi ) to shorten notation.

The negative log-likelihood is then given by:


N
ÿ
NLL(y | X, w) = ≠ (yi log µi + (1 ≠ yi ) log(1 ≠ µi ))
i=1

Interestingly, the NLL(yi | xi , w) is the cross-entropy between yi and µi for yi œ {0, 1}.
The concept of cross-entropy comes from the area of information theory. There is a deep
connection between the area of information theory, where the goal is to describe the
amount of information available via different random variables, and the area of machine
learning, where the goal is to extract meaningful patterns from data. Later in this chapter,
we briefly discuss this concept to better understand the connection.

Recall that µi = ‡(wT xi ) and the negative log-likelihood is


N
ÿ
NLL(y | X, w) = ≠ (yi log µi + (1 ≠ yi ) log(1 ≠ µi )).
i=1

Taking the gradient with respect to w yields the following expression:


N
ÿ
Òw NLL(y | X, w) = xi (µi ≠ yi ) = XT (µ ≠ y)
i=1

189
1

0.8

0.6

Entropy 0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 13.3: Entropy of a Bernoulli random variable for different values of ◊. The graph
reaches its maximum value for ◊ = 0.5, which corresponds to a uniform distribution where
both outcomes are equally likely.

The Hessian matrix (Section 2.2.5) of the second-order partial derivatives can be expressed
as
H = XT SX,
where S is a diagonal matrix with each entry Sii = µi (1 ≠ µi ). Since µi = ‡(wT xi ), the
term µi (1 ≠ µi ) is the first order derivative of µi with respect to wT xi . The Hessian of
NLL is positive definite everywhere, assuming X has full rank, otherwise it is only positive
semi-definite everywhere (easy exercise). This ensures that the NLL is convex, and we
can use convex optimisation methods to minimise it. Recall that for convex optimisation
methods a local optimum is also global optimum.

The full derivation of the gradient and Hessian of the above NLL is an exercise and
will be discussed in class.

13.3.1 Side Note: Entropy and Cross-Entropy

The concepts of entropy and cross-entropy are relevant to machine learning. In short,
the entropy is used to quantify the information about a random variable.
Definition 13.2 (Entropy). Entropy H is a measure of uncertainty associated with a
random variable X: ÿ
H(X) = ≠ p(x) log p(x),
x
where we sum, over all possible outcomes x, the probability that x occurs times the log
of this probability.

190
Two important properties are:

• Maximum entropy is reached for uniform distributions.

• Minimum entropy is reached in case there is only one possible outcome of the
random variable.

For instance, for a Bernoulli random variable X with parameter ◊, the entropy is
given by:
H(X) = ≠◊ log2 (◊) ≠ (1 ≠ ◊) log2 (1 ≠ ◊)
The graph of this entropy function is depicted in Figure 13.3 as a function of ◊. The
graph achieves its maximum value 1 for ◊ = 0.5, which means a uniform distribution,
that is, the binary random variable X has both outcomes equally likely. For ◊ = 0 or
◊ = 1, the entropy is 0 (e.g., there is no uncertainty in case the variable has only one
outcome with probability 1) .

Now let p and q be two distributions and suppose the support of p is contained in
q. For instance, imagine we have two random variables, Xp and Xq , that follow the
distributions p and q respectively. Then the possible values (outcomes) of Xp are also
possible values of Xq .

Definition 13.3 (Cross-Entropy). Cross-entropy measures the expected number of bits


required to encode an observation from p, if the encoding scheme is based on q:
ÿ
H(p, q) = ≠ p(x) log q(x)
x

In other words, for every x, we take the probability p of that value and the number
of bits needed to encode it using the encoding scheme of q. For the latter, we need
logarithmically many bits.

Cross-entropy is related to the negative log-likelihood given before for the logistic
regression classifier. Essentially, in our classification problem we estimate the probability
of different outcomes to occur. Imagine the estimated probability of outcome i is qi
and the frequency (empirical probability) of outcome i in the training set is pi , then the
negative log-likelihood of the training data is the cross-entropy H(p, q). Specifically, the
negative log-likelihood for data point (xi , yi )

NLL(yi | xi , w) = ≠(yi log µi + (1 ≠ yi ) log(1 ≠ µi ))

is the cross-entropy between yi and µi , where yi is the observed class, that is supported
by the empirical probability of the outcomes in the training data set, i.e. it is supported
by pi , while µi is the estimated probability ‡(wT xi ), denoted by qi .

191
13.4 Newton Method for Optimising the Negative
Log-Likelihood for Logistic Regression

We next show how to optimise the negative log-likelihood using the Newton’s method
(Section 11.7.1). Recall that the Newton’s method is used if the number of dimensions D
is small, as it requires to take the inverse of the Hessian matrix with dimensions D ◊ D
and this takes time cubic in D. For small D, we can apply Newton’s method to estimate
w for logistic regression.

A quick reminder of how Newton’s method works. we construct a sequence of min-


imisers for the second-order approximation of the objective function NLL, starting with
some random minimiser w0 .

These minimisers allow us to move fast towards the minimum of the objective function.
We then update the parameters on each step and let wt be the parameters after t Newton
steps.

For the update, we need the gradient and Hessian matrix which are respectively given
by:

gt = XT (µt ≠ y) = ≠XT (y ≠ µt )
Ht = XT St X

Note that St depends on the parameter vector wt and therefore dependents on t: St,ii =
µt,i (1 ≠ µt,i ) and µt,i = ‡(wtT xi ). Similarly, the gradient is also dependent on t. We need
to compute the gradient and the Hessian at each iteration t. The Newton update rule is
as follows:

wt+1 = wt ≠ H≠1
t gt
= wt + (XT St X)≠1 XT (y ≠ µt )
= (XT St X)≠1 (XT St X)wt + (XT St X)≠1 XT (y ≠ µt )
= (XT St X)≠1 XT St (Xwt + S≠1
t (y ≠ µt ))
¸ ˚˙ ˝
zt

= (XT St X)≠1 XT St zt

We proceeded above as follows: We first plug in the definitions of the gradient and the
Hessian. Next, we multiply wt by ID = H≠1 t Ht = (X St X) (X St X). This does
T ≠1 T

not alter the equation but allows us to bring the expression in the desired final shape.
This step assumes that the Hessian is invertible, which has to be the case, because
we could otherwise not apply Newton’s method in the first place. Now we factor out
(XT St X)≠1 XT St and therefore add a correction term S≠1 to the second term (y ≠ µt ).
Lastly, we rename the expression in parentheses.

192
13.4.1 From Ordinary Least Squares to Weighted Least Squares

Looking closely, we can observe that the expression for wt+1 in Equation 13.3 looks similar
to the closed-form solution we obtained for the least-squares objective in Section 3.6.1.
We next derive a generalisation of ordinary least squares objective, called weighted least
squares objective, whose minimisation is given by this expression for wt+1 . In other words,
to compute the parameters at a particular Newton step t + 1, this form corresponds to
minimising the weighted least squares objective function.

Recall the ordinary least squares (OLS) objective function L(w) and the closed-form
solution for w.
ÿ
Objective function: L(w) = (xiT w ≠ yi )2
i
Closed-form solution: w = (XT X)≠1 XT y
We work our way from Equation 13.3 back to the objective function it solves, i.e. we try
to find L(w). Our strategy is to bring the expression into a form similar to the closed-
form solution for the ordinary least squares. This means we need to find a matrix X̃ such
that X̃T X̃ = XT St X, and we have to find a ỹ such that X̃T ỹ = XT St zt . We start with
the equation itself:

wt+1 = (XT St X)≠1 XT St zt (13.3)

Since St is a diagonal matrix, we can factorise it into two diagonal matrices having
the square root of the entries. Recall that a diagonal matrix is symmetric and therefore
1/2 1/2 T
equal to its transpose, i.e. St = St . Therefore, we can rewrite the expression as
follows:
1/2 1/2 1/2 1/2 1/2 T 1/2 1/2 T 1/2
wt+1 = (XT St St X)≠1 XT St St zt = ((XT (St ) )St X)≠1 (XT (St ) )St zt
Renaming the individual terms, as shown in the above equation, yields a form that is
structurally identical to the closed-form solution of OLS, yet with different matrices and
vectors:
1/2 1/2 1/2 1/2
wt+1 = ((St X)T St X)≠1 (St X)T St zt
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
X̃T X̃ X̃T ỹ
T T
= (X̃ X̃) X̃ ỹ ≠1

This however allows us to synthesise the objective function, where we write x̃iT instead
of xiT and ỹi instead of yi :
ÿ
L(wt+1 ) = (x̃iT wt+1 ≠ ỹi )2
i

Now, we can plug in the terms for x̃iT and ỹi , but we need to be careful with the indices,
as for the objective function we consider vectors, not matrices as for the derivation before:
ÿ 1/2 1/2
L(wt+1 ) = (St,ii xiT wt+1 ≠ St,ii zt,i )2
i

193
1/2
Lastly, we can factor out the term St,ii to arrive at the objective function of weighted
least squares:
ÿ
Objective function: L(wt+1 ) = St,ii (xiT wt+1 ≠ zt,i )2
i
Closed-form solution: wt+1 = (XT St X)≠1 XT St zt
The above result is similar to OLS but with weights St,ii œ R for each squared residual.

13.4.2 Iteratively Re-Weighted Least Squares

The above weighted least squares objective function is to be minimised at each iteration
step to obtain the minimiser wt+1 . For this reason, the overall approach to computing
the parameters for the logistic regression classifier is called Iteratively Re-Weighted Least
Squares (IRLS):

• Each Newton step requires re-weighting of the residual by a new diagonal matrix
St , because it depends on our prediction ‡(wT xi ).
• Each step uses a new vector zt , which depends on wt .
• We proceed iteratively, one Newton step after the other.

13.5 Multi-Class Logistic Regression

Consider now C > 2 classes: y œ {1, . . . , C}. A natural generalisation of the bi-
nary logistic regression to multiple classes would be to consider a vector of parameters
wc œ RD for every class c. This is in contrast to binary logistic regression, where we
considered a single parameter vector for both classes1 . The parameters form a matrix
W = [w1 , . . . , wC ] œ RD◊C . The multi-class logistic model is then given by:
exp(wcT x)
p(y = c | x, W) = qC
cÕ =1 exp(wcÕ x)
T

Recall from Chapter 12 that the softmax function maps the elements of its input vector
to values in the domain [0, 1] that sum up to 1. This allows a probabilistic interpretation
of this values, with the largest value in the input vector being assigned to the largest
probability amongst the probabilities assigned to the elements of the input vector.

Similarly to the binary case, we can show that the negative log-likelihood is convex
also for the multi-class logistic regression. We can therefore use convex optimisation
methods to estimate the parameters W.
1
We could in fact omit the parameter vector for one of the classes since we can deduct the probability
of this class from the other probabilities, as they sum up to 1. However, we will not do so here as it will
allow a neat treatment using the softmax function.

194
The multi-class logistic model can be expressed using softmax:
3Ë ÈT 4
p(y | x, W) = softmax w1T x, . . . , wC
T
x ,

where softmax takes as arguments a vector of the dot products of the individual parameter
vectors of each class wc and the input vector x.

Two-class logistic regression is a special case of this generalisation, where W =


[w0 , w1 ].
3Ë ÈT 4 exp(w1T x)
softmax w1T x, w0T x =
1 exp(w1T x) + exp(w0T x)
1
=
1 + exp(≠(w1T ≠ w0T )x)
= ‡((w1 ≠ w0 )T x)

The first step follows by definition of the softmax function. The second step is achieved
by dividing the numerator and the denominator by the numerator. This brings the
equation into the form of the sigmoid function, and it also motivates the fact that we do
not need to use a distinct parameter vector for each class in the binary case. It is enough
to use one parameter vector that captures the difference between the two-parameter
vectors, here w1 ≠ w0 .

13.5.1 Decision Boundaries are Linear

Decision boundaries for multi-class logistic regression are linear. Figure 13.4 shows data
points of three classes (red, blue and green) that are generated according to the following
normal distributions with different means and identical variances.

• Class red: Data drawn from N (µ1 = (≠1, ≠1), ‡ 2 = 1)


• Class blue: Data drawn from N (µ2 = (1, 1), ‡ 2 = 1)
• Class green: Data drawn from N (µ3 = (2, ≠2), ‡ 2 = 1)

Stars indicate misclassified data points. To mathematically show that the decision bound-
aries are linear, we equate the probability of the data points belonging to one class with
the probability of the data points belonging to another class and then obtain a linear
function in x.

13.6 Summary

In this chapter we have covered the following topics:

195
Figure 13.4: Linear decision boundaries between three classes (red, blue and green) using
multi-class logistic regression. The points are generated according to three normal dis-
tributions with different means and identical variances. Stars indicate misclassified data
points.

• Logistic Regression is a (binary) discriminative classification model.

• Extension to multi-class logistic regression can easily be achieved by replacing sig-


moid with softmax.

• We can derive maximum likelihood estimates using convex optimisation. This led
us to the iteratively re-weighted least squares optimisation method, which is based
on Newton’s method.

• For more information on multi-class logistic regression, see Section 8.3 in “Machine
Learning: A Probabilistic Perspective” [Murphy, 2012].

• Practical 2 is on comparing experimentally generative models with discriminative


models for classification, in particular Naïve Bayes vs. logistic regression.

In the context of logistic regression, we have not discussed basis expansion and regu-
larisation. These concepts can be also combined with logistic regression.

• In the class to this topic, we will see that regularisation may be necessary if data
is linearly separable.

196
• What if the classification boundaries are non-linear? In order to cope with that
and “lift” these relatively simple models to non-linearity, we can use:

– Polynomial or kernel-based basis expansion;


– ¸1 /¸2 regularisation if there is a risk of overfitting.

197
Bibliography

[Alex Krizhevsky, 2012] Alex Krizhevsky, Ilya Sutskever, G. E. H. (2012). Imagenet


classification with deep convolutional neural networks. Communications of the ACM,
25.

[Bertsekas, 1999] Bertsekas, D. P. (1999). Nonlinear programming. Athena scientific


Belmont.

[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex Opti-
mization. Cambridge University Press.

[Breiman, 1996] Breiman, L. (1996). Bagging predictors. In Bagging Predictors, vol-


ume 24, pages 123–140.

[Breiman, 2001] Breiman, L. (2001). Statistical modeling: The two cultures (with com-
ments and a rejoinder by the author). Statistical Science, 16(3).

[Collins, 2012] Collins, M. (2012). Convergence Proof for the Perceptron Algorithm.
http://www.cs.columbia.edu/ mcollins/courses/6998-2012/notes/perc.converge.pdf.

[Domingos, 2012] Domingos, P. (2012). A few useful things to know about machine
learning. Commun. ACM, 55(10):78–87.

[Donoho, 2017] Donoho, D. (2017). 50 years of data science. Journal of Computational


and Graphical Statistics, 26(4):745–766.

[Efroymson, 1960] Efroymson, M. A. (1960). Multiple regression analysis. Wiley.

[Erin L. Allwein, 2000] Erin L. Allwein, Robert E. Schapire, Y. S. (2000). Reducing


multiclass to binary: A unifying approach for margin classifiers. Journal of Machine
Learning Research, 1(2):113–141.

[Fisher, 1936] Fisher, R. A. (1936). The use of multiple elements in taxonomic problems.
Annals of Eugenics, 7(2):179–188.

[Goldstein et al., 2014] Goldstein, T., Studer, C., and Baraniuk, R. G. (2014). A
field guide to forward-backward splitting with a FASTA implementation. CoRR,
abs/1411.3406.

[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press. http://www.deeplearningbook.org.

312
[Goodfellow et al., 2015] Goodfellow, I., Shlens, J., and Szegedy, C. (2015). Explaining
and harnessing adversarial examples. In International Conference on Learning Repre-
sentations.

[Mayer et al., 2010] Mayer, J., Khairy, K., and Howard, J. (2010). Drawing an elephant
with four complex parameters. American Journal of Physics, 78.

[Mitchell, 1997] Mitchell, T. (1997). Machine Learning. McGraw-Hill International Edi-


tions. McGraw-Hill.

[Moritz Hardt, 2016] Moritz Hardt, Ben Recht, Y. S. (2016). Train faster, generalize
better: Stability of stochastic gradient descent. Journal of Machine Learning Research,
48:1225–1234.

[Murphy, 2012] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective.


MIT Press, Cambridge, MA.

[Nesterov, 2018] Nesterov, Y. E. (2018). Lectures on Convex Optimization. Optimization


and its Applications. Springer.

[Schleich et al., 2016] Schleich, M., Olteanu, D., and Ciucanu, R. (2016). Learning linear
regression models over factorized joins. In SIGMOD, pages 3–18.

[Schölkopf and Smola, 2002] Schölkopf, B. and Smola, A. J. (2002). Support vector ma-
chines and kernel algorithms. In Arbib, M. A., editor, The Handbook of Brain Theory
and Neural Networks, pages 1119–1125. MIT Press.

[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from
overfitting. J. Mach. Learn. Res., 15:1929–1958.

[Strang, 2005] Strang, G. (2005). Linear Algebra and its Applications. Cengage Learning.

[Szegedy et al., 2014] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D.,
Goodfellow, I., and Fergus, R. (2014). Intriguing properties of neural networks. In
International Conference on Learning Representations.

[Thomas, 2018] Thomas, G. (2018). Mathematics for Machine Learning.


https://gwthomas.github.io/docs/math4ml.pdf.

[Tukey, 1962] Tukey, J. W. (1962). The future of data analysis. The Annals of Mathe-
matical Statistics, pages 1–67.

[Turing, 1950] Turing, A. M. (1950). Computing machinery and intelligence. Mind,


LIX(236):433–460.

[Zeiler and Fergus, 2013] Zeiler, M. D. and Fergus, R. (2013). Visualizing and under-
standing convolutional networks. CoRR, abs/1311.2901.

313

You might also like