You are on page 1of 9

CIS 520: Machine Learning Spring 2021: Lecture 8

Neural Networks / Deep Learning

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Introduction
• Neural network models
• Training: Backpropagation
• Convolutional neural networks

1 Introduction
Neural networks enable learning of highly nonlinear models by automatically and jointly learning both
nonlinear ‘features’ from the raw input data and (generalized) linear models on top of the extracted features.
In this lecture we describe neural network models for both regression and classification problems, and discuss
their training via backpropagation; we also give some pointers related to convolutional neural networks, a
particular class of neural network models that are widely used in applications such as computer vision.

2 Neural Network Models


2.1 Neural Network Models for Regression
Let us start with regression. Given an input feature vector x ∈ Rd , the basic linear regression model simply
combines the input features x1 , . . . , xd via a linear function to produce a predicted output yb = w> x, where
w ∈ Rd is a weight vector:

We can construct a more complex model by first constructing some ‘higher-level’ or ‘intermediate’ nonlinear
features from the input feature vector x, and then linearly combining the new features to produce a prediction

1
2 Neural Networks / Deep Learning

yb. One possibility is to use some fixed basis functions, and to learn a linear combination of these from the
training data. Instead, neural network models allow the feature or basis functions to themselves be chosen
adaptively based on the training data. Moreover, they allow extraction of increasingly complex features by
allowing intermediate feature extraction units to be stacked in layers, so that features produced by units in
one layer can be used as inputs to the feature extraction units in the next layer, and so on.
As an example, consider a model with one layer of intermediate feature extraction units (one hidden layer).
Each feature extraction unit, referred to as a hidden unit, extracts a nonlinear feature from the input vector
x by first linearly combining the input features, and then applying a nonlinear transformation to this linear
combination. Specifically, suppose there are d1 hidden units in the hidden layer. The j-th hidden unit
produces a feature or ‘activation’ value
(1) (1) > 
aj = g wj x ,
(1)
where wj ∈ Rd is a weight vector associated with the j-th unit and g : R→R is a nonlinear activation
function. Here the ‘(1)’ in the superscript denotes quantities associated with the 1st hidden layer (in this
example, there is only one hidden layer, but this notation will be helpful when discussing deeper networks
(1) (1)
with more hidden layers below). Thus a(1) = (a1 , . . . , ad1 )> ∈ Rd1 is a vector of nonlinear features produced
by the hidden layer. The output unit then linearly combines these features to produce the final output
(2) > (1)
yb = w1 a ,
(2)
where w1 ∈ Rd1 is a weight vector associated with the 1st unit (in this case only unit) in the 2nd layer,
which in this case happens to be the output layer:1

Model parameters and functional form. In the above model with a single hidden layer, there are two
sets of adaptive weights (parameters): the first set going from the input features to the hidden layer, and the
second set going from the hidden layer to the output unit. The first set contains d1 weight vectors associated
with the d1 hidden units, and can be put into the rows of a matrix W(1) ∈ Rd1 ×d :
(1) >
 
− w1 −
− w(1) > −
 
(1) 2
W =
 
.. 

 . 

(1) >
− wd1 −
1 The term neural networks comes from viewing the hidden units as implementing an operation similar to that of a

neuron in the brain: a neuron takes inputs from neighboring connected neurons and combines these to produce an ‘impulse’ or
‘activation’ along its output axon. For this reason, hidden units are also sometimes referred to as neurons.
Neural Networks / Deep Learning 3

(2)
Since there is only one output unit, the second set of weights contains just one weight vector w1 , which
can be put into the row of a matrix W(2) ∈ R1×d1 :
h i
>
W(2) = − w1(2) −

(1)
To understand the functional form of the model, for each hidden unit j, let us denote by zj the input to
the unit, so that we have:

(1) (1) >


zj = wj x
(1) (1)
aj = g(zj ) .
(1) (1)
Then, denoting by z(1) = (z1 , . . . , zd1 )> the vector of inputs to the d1 hidden units, and abusing notation
somewhat to allow the activation function g to be applied element-wise to vectors, we can write

z(1) = W(1) x
a(1) = g(z(1) ) .

Thus, putting everything together, and denoting by W = {W(1) , W(2) } the collection of all parameters in
the model, the parametric function implemented by the nonlinear (one-hidden-layer) model above, which we
denote by fW : Rd →R, can be written using matrix notation as follows:2

yb = fW (x) := W(2) g(W(1) x) .


| {z }
a(1)

The above example had just one hidden layer; more generally, one can stack hidden units in multiple layers,
each layer extracting higher-level features from those produced by the previous layer. In general, if there are
L hidden layers, containing d1 , . . . , dL hidden units, respectively, then we have L + 1 sets of adaptive weights
(parameters), which can be represented by matrices W(1) ∈ Rd1 ×d , W(2) ∈ Rd2 ×d1 , . . . , W(L+1) ∈ R1×dL ,
(l)
respectively. Then wjk denotes the weight from the k-th unit in the (l − 1)-th layer to the j-th unit in the
l-th layer (with the input features forming the 0-th layer and the output unit forming the (L + 1)-th layer).
Activation functions. Each hidden unit is associated with a nonlinear activation function g. Including this
nonlinearity is critical: without it, we would simply have a linear regression model again (since composing
one linear function with another simply produces another linear function). Popular choices for the nonlinear
activation functions include the following:

• Logistic sigmoid:
1
gσ (z) =
1 + e−z
• tanh:
ez − e−z
gtanh (z) =
ez + e−z
• Rectified linear unit (ReLU):
grelu (z) = max(0, z)

Note that gσ produces activations in [0, 1], gtanh in [−1, 1], and grelu in R+ . Also, tanh is closely related to
the logistic sigmoid; in particular, it can be verified that gtanh (z) = 2gσ (2z) − 1.
2 Here we assume all units in a hidden layer implement the same activation function. One can allow for more general settings

where different units implement different activation functions; in that case, one can define a vector function g(1) : Rd1 →Rd1
with different component activation functions in order to similarly write the functional form of the model in a compact manner.
4 Neural Networks / Deep Learning

2.2 Neural Network Models for Classification

While our discussion above focused on neural networks for regression, a similar approach can be used to
construct highly nonlinear models for classification as well. For example, for binary classification, the output
is passed through a logistic sigmoid squashing function to produce an estimate of the probability of label +1
(similar to the linear logistic regression model, which passes a linear function through the logistic sigmoid):

In the case of multiclass classification with K classes, the output layer contains K output units (so that the
last weight matrix becomes W(L+1) ∈ RK×dL ); the outputs of these units are then passed through a softmax
squashing function to produce estimates of the probabilities of the K classes (much as in multiclass linear
logistic regression, where K linear functions are passed through the softmax).

2.3 Neural Network Architectures

The structure of a neural network, which includes such things as the number and organization (layout)
of hidden units and the connections allowed between them, is generally referred to as the architecture
of the network. The network architecture determines the functional form of the model (and therefore also
its complexity). A popular and very broad class of neural network architectures, and the one we focus on
here, is that of feed-forward neural networks; these are neural networks that do not contain directed
cycles between units. In particular, we focused above on a specific type of feed-forward neural network in
which hidden units are organized in successive layers, with connections going from one layer to the next;
such networks are sometimes called multilayer perceptrons (MLPs). In recent years, there has been
increasing use of MLPs with many hidden layers; these are often referred to as deep neural networks,
and the training of such networks from data is often termed deep learning. Deep neural networks used
in practice are usually designed to have highly sparse, structured connections with some shared parameters
depending on the application; for example, in computer vision, a widely used class is that of convolutional
neural networks; in speech recognition and other domains with sequential data, one often uses recurrent
neural networks, and so on.
Once a neural network architecture has been selected, one needs to train the model by estimating the model
parameters (weights) from the given training data. We discuss this below in the context of MLPs (the
discussion is easily generalized to other feed-forward neural network architectures). In practice, one may
need to consider several different architectures (especially the number of hidden units in different layers etc),
and select a suitable one based on performance on a hold-out validation set or via cross-validation on the
training data.
Neural Networks / Deep Learning 5

3 Training: Backpropagation

Given a training sample S = ((x1 , y1 ), . . . , (xm , ym )) and a desired neural network architecture, training
the model (estimating the parameters/weights of the model) usually involves minimizing a suitable loss
function on the training data. If the activation functions associated with hidden units are differentiable
(or at least subdifferentiable), then the minimization can be done using gradient descent (in practice, for
problems involving large numbers of training examples, one often uses stochastic gradient descent, or
more commonly, mini-batch gradient descent; we will discuss these below).
Gradient descent based methods start by initializing the network parameters to some (usually randomly
chosen) values, and then proceed in iterations; each iteration requires computation of the gradient of the
error objective w.r.t. the current parameters. In the case of neural networks, we will make repeated use of
the chain rule of differentiation to compute these gradients; as we will see, this will amount to propagating
derivatives backward through the network from the output layer toward the input layer, and this process
has therefore come to be known as backpropagation. In most cases, the nonlinear function implemented
by the neural network is non-convex, and one can only hope to find a local minimum of the error objective;
in practice, it is therefore common to run the gradient descent procedure from multiple random starting
points, and keep the best result (the one that yields parameter estimates with minimal objective value).

3.1 Regression

To make things concrete, let us first consider a regression problem with labels yi ∈ R, and a one-hidden-layer
MLP architecture as discussed above. Suppose the hidden layer has d1 hidden units with nonlinear activation
function g. Then, as discussed above, the parameters of the model are W(1) ∈ Rd1 ×d and W(2) ∈ R1×d1 ;
denoting the parameters collectively as W = {W(1) , W(2) }, the parametric function implemented by the
model, which we denote by fW : Rd →R, can be written as

fW (x) = W(2) g(W(1) x) .


| {z }
a(1)

Our goal is to find parameter estimates W


c that (approximately) minimize a suitable loss function on the
training sample. As we have discussed previously, for regression problems, a commonly used loss function is
the squared loss `sq : R × R→R+ given by

y − y)2 .
`sq (y, yb) = (b

Thus, we would like to find parameter estimates W


c that (approximately) solve the following optimization
problem:
m
1 X
min (fW (xi ) − yi )2 .
W m
i=1

Equivalently, as a function of W, we wish to minimize the following objective function:


m
1 X
J(W) = (fW (xi ) − yi )2 .
m i=1 | {z }
Ji (w)

We will first consider minimizing this objective using basic (batch) gradient descent, which on each iteration
updates parameters using the entire batch of m training examples (making each iteration computationally
expensive). We will then discuss stochastic and mini-batch gradient descent approaches, which update
parameters using only a single example or a mini-batch of examples at a time (leading to faster convergence).
6 Neural Networks / Deep Learning

Basic (batch) gradient descent. Gradient descent starts with some initial parameter values W,3 and
(l)
then on each iteration, updates each parameter wjk as follows:

(l) (l) ∂J(W)


wjk ← wjk − η (l)
∂wjk
m
(l) 1 X ∂Ji (W)
= wjk − η ,
m i=1 ∂w(l)
jk

4
where η > 0 is the step size or learning rate parameter. This process is repeated until the parameter
estimates converge (or for some suitably large number of iterations).
(l)
In order to implement gradient descent, we need to compute the derivatives ∂Ji (W)/∂wjk . This is where
backpropagation comes in; it is essentially an application of the chain rule of differentiation. Specifically,
continuing with the one-hidden-layer example above, for derivatives w.r.t. weights W(2) , we have for each
j ∈ {1, . . . , d1 }:
∂Ji (W) ∂Ji (W) ∂fw (xi )
(2)
= · (2)
∂w1j ∂fw (xi ) ∂w1j
 (1)
= 2 fW (xi ) − yi · aij ,

(1) (1) (1) >


where aij = g(zij ) = g(wj xi ) is the feature/activation value of the j-th hidden unit on example i (under
the current weights). Similarly, for derivatives w.r.t. weights W(1) , we have for each j ∈ {1, . . . , d1 } and
k ∈ {1, . . . , d}:
∂Ji (W) ∂Ji (W) ∂fw (xi )
(1)
= · (1)
∂wjk ∂fw (xi ) ∂wjk
(1) (1)
∂Ji (W) ∂fw (xi ) ∂aij ∂zij
= · (1)
· (1) · (1)
∂fw (xi ) ∂aij ∂zij ∂wjk
 (2) (1)
= 2 fW (xi ) − yi · w1j · g 0 (zij ) · xik .
For networks with larger numbers of hidden layers, derivative computations can be chained similarly as
above. For the three types of activation functions discussed above, the derivative g 0 (z) (or a subderivative
in the case of ReLU) is easily calculated as follows:

• Logistic sigmoid:
e−z
gσ0 (z) = = gσ (z) · (1 − gσ (z)) .
(1 + e−z )2
• tanh:
0 (ez + e−z )2 − (ez − e−z )2 2
gtanh (z) = = 1 − gtanh (z)
(ez + e−z )2
• ReLU:5 (
0 1 if z > 0
grelu (z) =
0 otherwise.
3 Typically, (l)
parameters are initialized to some small random values; e.g. each weight wjk could be initialized to some small
random value drawn from N (0, 0.1) (more sophisticated forms of random initialization are also used).
4 Here we assume a fixed step size η for simplicity, but the step size can in principle be allowed to vary from iteration to

iteration.
5 Sometimes, a differentiable approximation to the ReLU activation function is used instead
Neural Networks / Deep Learning 7

(l)
Stochastic gradient descent. Batch gradient descent requires computing the derivatives ∂Ji (W)/∂wjk
for all training examples i on each iteration. For large training sets, this is computationally expensive. An
alternative is to use stochastic gradient descent (SGD), in which each iteration updates parameters based
on the gradient for just one example i:

(l) (l) ∂Ji (W)


wjk ← wjk − η (l)
.
∂wjk

Here, examples i are either processed in sequence via repeated passes over the training data (possibly
randomizing the order of the examples before each pass over the training data), or alternatively, on each
iteration, an example i is drawn at random from the training set and used to update parameters. With
batch gradient descent, if the step size η is suitably decreasing, then the objective value decreases on each
iteration; SGD iterations do not guarantee this, but with suitably decreasing η, they do eventually converge
to a local minimum. In practice, while the number of iterations needed for SGD to reach a solution of a
given quality is larger than for batch GD, for large data sets (large number of training examples m), the
computation time per iteration is much smaller, resulting in significantly faster running times for SGD.
Mini-batch gradient descent. While batch gradient descent is slow, SGD tends to be quite noisy. In
practice, it is common to train neural networks using a compromise between the two, wherein on each
iteration, one selects (randomly or in sequence) a small mini-batch of training examples, B ⊂ [m], and
updates parameters based on the gradients for just these examples:6

(l) (l) 1 X ∂Ji (W)


wjk ← wjk − η (l)
.
|B| ∂w
i∈B jk

3.2 Classification

If we have a binary classification problem with labels yi ∈ {±1}, then recall that the real-valued output of the
network is passed through a logistic sigmoid function to transform it to a probability estimate (equivalently,
we can think of the output unit in this case as being equipped with a sigmoid activation function):
 
ηb(x) = gσ W(2) g(W(1) x) .
| {z }
fW (x)

In this case, it is common to minimize the log loss `log : {±1} × [0, 1]→R+ (also called the cross-entropy
loss), given by (
− ln ηb if y = +1
`log (y, ηb) = 
− ln 1 − ηb if y = −1.
Note that minimizing the log (cross-entropy) loss over class probability estimation functions of the form
ηb(x) = gσ (fW (x)) is equivalent to minimizing the logistic loss over real-valued functions of the form fW (x).
Equivalently, it will be convenient to transform the labels from {±1} to {0, 1} as follows:
(
yi + 1 1 if yi = +1
yei = =
2 0 if yi = −1.

Then the log (cross-entropy) loss `log : {0, 1} × [0, 1]→R+ becomes
(
− ln ηb if ye = 1 
`log (e
y , ηb) =  = − ye ln ηb − (1 − ye) ln 1 − ηb .
− ln 1 − ηb if ye = 0
6 E.g. mini-batches of 50–100 examples are often used.
8 Neural Networks / Deep Learning

The optimization problem here therefore becomes


m
1 X 
min − yei ln ηb(xi ) − (1 − yei ) ln(1 − ηb(xi )) .
W m i=1

Equivalently, as a function of W, we wish to minimize the following objective function:


m
1 X 
J(W) = − yei ln ηb(xi ) − (1 − yei ) ln 1 − ηb(xi ) .
m i=1 | {z }
Ji (w)

Again, one can use batch/stochastic/mini-batch gradient descent to find a local minimum of this objective.
The derivative computations proceed via backpropagation as before; the only difference is that the term
∂Ji (W)/∂fW (xi ) is now given by
∂Ji (W)
= ηb(xi ) − yei .
∂fW (xi )
Other than this difference, backpropagation here works similarly to the regression case.

3.3 Some Practical Issues

Vectorization. It is important to note that in practice, code for neural network training is significantly
speeded up by employing matrix and vector operations to compute the derivatives above, rather than using
for-loops to compute individual derivatives.7 For example, for the one-hidden-layer regression case discussed
above, the derivatives can be computed in vector/matrix form as follows:
∂Ji (W) (1) >
(2)
= 2 (fW (xi ) − yi ) ai
|∂W
{z } | {z
scalar
} | {z }
1×d1
1×d1
∂Ji (W) > (1) 
(1)
= 2 (fW (xi ) − yi ) W(2) g 0 (zi ) xi > ,
|∂W
{z } | {z
scalar
}| {z } |{z}
1×d
d1 ×1
d1 ×d

where denotes element-wise multiplication of entries of two vectors. We leave it as an exercise for the
reader to verify the details.
Controlling complexity/avoiding overfitting. In addition to using some form of holdout- or cross-
validation to select certain elements of the network architecture (such as the number of layers or numbers
of hidden units in the layers), there are several other strategies that are used for this. These include,
for example, the following: regularization (e.g. L2 regularization on the weights); early stopping (not
running gradient descent till complete convergence but stopping earlier); dropout (on each iteration, some
units are randomly dropped and the remaining weights updated without them; at the end, all units are
used in the final model, with weights reduced by some amount – this can be viewed as approximating a
combination/ensemble of several neural network models); structured/sparse connections (not all units
in successive layers are connected to each other, thus reducing the number of parameters in the model; this
is done for example in convolutional neural networks and in recurrent neural networks); and parameter
sharing (some connections are constrained to share the same weights, thus again reducing the number of
parameters in the model; again, this is done for example in convolutional neural networks and in recurrent
neural networks).
7 Indeed, vectorization has played an important role in the scaling up of neural network training; in particular, vectorized

code is easily parallelized on graphics processing units (GPUs), which allows deep multi-layer neural networks with many layers
to be trained on large amounts of data relatively quickly.
Neural Networks / Deep Learning 9

4 Convolutional Neural Networks

Convolutional neural networks are widely used in computer vision as well as in other domains. For a
description of early work on convolutional neural networks, we refer the reader to the book chapter by
LeCun and Bengio below [1]. For a review of recent progress in convolutional neural networks and in deep
learning more broadly, we refer the reader to the 2015 Nature review article by LeCun et al. below [2].

References
[1] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time-series. In M. A.
Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995.
[2] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

You might also like