You are on page 1of 73

Backpropagation

Recommended Course


Data Structures
The fundamental toolkit for the aspiring computer scientist or programmer.

Relevant For...
 Computer Science>
Artificial Neural Networks

Backpropagation, short for "backward propagation of errors," is an algorithm for


supervised learning of artificial neural networks using gradient descent. Given an artificial
neural network and an error function, the method calculates the gradient of the error
function with respect to the neural network's weights. It is a generalization of the delta rule
for perceptrons to multilayer feedforward neural networks.

The "backwards" part of the name stems from the fact that calculation of the gradient
proceeds backwards through the network, with the gradient of the final layer of weights
being calculated first and the gradient of the first layer of weights being calculated last.
Partial computations of the gradient from one layer are reused in the computation of the
gradient for the previous layer. This backwards flow of the error information allows for
efficient computation of the gradient at each layer versus the naive approach of calculating
the gradient of each layer separately.

Backpropagation's popularity has experienced a recent resurgence given the widespread


adoption of deep neural networks for image recognition and speech recognition. It is
considered an efficient algorithm, and modern implementations take advantage of
specialized GPUs to further improve performance.

Contents

 History

 Formal Definition

 Deriving the Gradients

 The Backpropagation Algorithm

History
Backpropagation was invented in the 1970s as a general optimization method for
performing automatic differentiation of complex nested functions. However, it wasn't until
1986, with the publishing of a paper by Rumelhart, Hinton, and Williams, titled "Learning
Representations by Back-Propagating Errors," that the importance of the algorithm was
appreciated by the machine learning community at large.

Researchers had long been interested in finding a way to train multilayer artificial neural
networks that could automatically discover good "internal representations," i.e. features that
make learning easier and more accurate. Features can be thought of as the stereotypical
input to a specific node that activates that node (i.e. causes it to output a positive value near
1). Since a node's activation is dependent on its incoming weights and bias, researchers
say a node has learned a feature if its weights and bias cause that node to activate when
the feature is present in its input.

By the 1980s, hand-engineering features had become the de facto standard in many fields,
especially in computer vision, since experts knew from experiments which features (e.g.
lines, circles, edges, blobs in computer vision) made learning simpler. However, hand-
engineering successful features requires a lot of knowledge and practice. More importantly,
since it is not automatic, it is usually very slow.

Backpropagation was one of the first methods able to demonstrate that artificial neural
networks could learn good internal representations, i.e. their hidden layers learned nontrivial
features. Experts examining multilayer feedforward networks trained using backpropagation
actually found that many nodes learned features similar to those designed by human
experts and those found by neuroscientists investigating biological neural networks in
mammalian brains (e.g. certain nodes learned to detect edges, while others computed
Gabor filters). Even more importantly, because of the efficiency of the algorithm and the fact
that domain experts were no longer required to discover appropriate features,
backpropagation allowed artificial neural networks to be applied to a much wider field of
problems that were previously off-limits due to time and cost constraints.

Formal Definition
Backpropagation is analogous to calculating the delta rule for a multilayer feedforward
network. Thus, like the delta rule, backpropagation requires three things:

1) Dataset consisting of input-output pairs \big(\vec{x_i}, \vec{y_i}\big)(xi,yi),


where \vec{x_i}xi is the input and \vec{y_i}yi is the desired output of the network on
input \vec{x_i}xi. The set of input-output pairs of size NN is denoted X = \Big\
{\big(\vec{x_1}, \vec{y_1}\big), \dots, \big(\vec{x_N},
\vec{y_N}\big)\Big\}X={(x1,y1),…,(xN,yN)}.

2) A feedforward neural network, as formally defined in the article concerning feedforward


neural networks, whose parameters are collectively denoted \thetaθ. In backpropagation,
the parameters of primary interest are w_{ij}^kwijk, the weight between node jj in
layer l_klk and node ii in layer l_{k-1}lk−1, and b_i^kbik, the bias for node ii in
layer l_klk. There are no connections between nodes in the same layer and layers are fully
connected.
3) An error function, E(X, \theta)E(X,θ), which defines the error between the desired
output \vec{y_i}yi and the calculated output \hat{\vec{y_i}}yi^ of the neural network on
input \vec{x_i}xi for a set of input-output pairs \big(\vec{x_i}, \vec{y_i}\big) \in X(xi
,yi)∈X and a particular value of the parameters \thetaθ.
Training a neural network with gradient descent requires the calculation of the gradient of
the error function E(X, \theta)E(X,θ) with respect to the weights w_{ij}^kwijk and
biases b_i^kbik. Then, according to the learning rate \alphaα, each iteration of gradient
descent updates the weights and biases ((collectively denoted \theta)θ) according to

\theta^{t+1}= \theta^{t} - \alpha \frac{\partial E(X, \theta^{t})}{\partial


\theta},θt+1=θt−α∂θ∂E(X,θt),
where \theta^{t}θt denotes the parameters of the neural network at iteration tt in gradient
descent.

What's the Target?

As mentioned in the previous section, one major problem in training multilayer feedforward
neural networks is in deciding how to learn good internal representations, i.e. what the
weights and biases for hidden layer nodes should be. Unlike the perceptron, which has the
delta rule for approximating a well-defined target output, hidden layer nodes don't have a
target output since they are used as intermediate steps in the computation.

Since hidden layer nodes have no target output, one can't simply define an error function
that is specific to that node. Instead, any error function for that node will be dependent on
the values of the parameters in the previous layers (since previous layers determine the
input for that node) and following layers \big((since the output of that node will affect the
computation of the error function E(X, \theta)\big).E(X,θ)). This coupling of parameters
between layers can make the math quite messy (primarily as a result of using the product
rule, discussed below), and if not implemented cleverly, can make the final gradient descent
calculations slow. Backpropagation addresses both of these issues by simplifying the
mathematics of gradient descent, while also facilitating its efficient calculation.

Formal Definition

The formulation below is for a neural network with one output, but the algorithm can be
applied to a network with any number of outputs by consistent application of the chain rule
and power rule. Thus, for all the following examples, input-output pairs will be of the
form (\vec{x}, y)(x,y), i.e. the target value yy is not a vector.

Remembering the general formulation for a feedforward neural network,

w_{ij}^k:wijk: weight for node jj in layer l_klk for incoming node ii


b_i^k:bik: bias for node ii in layer l_klk
a_i^k:aik: product sum plus bias (activation) for node ii in layer l_klk
o_i^k:oik: output for node ii in layer l_klk
r_k:rk: number of nodes in layer l_klk

g:g: activation function for the hidden layer nodes


g_o:go: activation function for the output layer nodes
The error function in classic backpropagation is the mean squared error

E(X, \theta) = \frac{1}{2N} \sum_{i=1}^N \left(\hat{y_i} -


y_i\right)^2,E(X,θ)=2N1i=1∑N(yi^−yi)2,
where y_iyi is the target value for input-output pair (\vec{x_i}, y_i)(xi,yi
) and \hat{y_i}yi^ is the computed output of the network on input \vec{x_i}xi. Again,
other error functions can be used, but the mean squared error's historical association with
backpropagation and its convenient mathematical properties make it a good choice for
learning the method.

Deriving the Gradients


The derivation of the backpropagation algorithm is fairly straightforward. It follows from the
use of the chain rule and product rule in differential calculus. Application of these rules is
dependent on the differentiation of the activation function, one of the reasons the heaviside
step function is not used (being discontinuous and thus, non-differentiable).

Preliminaries

For the rest of this section, the derivative of a function f(x)f(x) will be denoted f^{\prime}


(x)f′(x), so that the sigmoid function's derivative is \sigma^{\prime}(x)σ′(x).

To simplify the mathematics further, the bias b_i^kbik for node ii in layer kk will be


incorporated into the weights as w_{0i}^kw0ik with a fixed output of o_0^{k-1} = 1o0k−1
=1 for node 00 in layer k-1k−1. Thus,

w_{0i}^k = b_i^k.w0ik=bik.
To see that this is equivalent to the original formulation, note that

a_i^k = b_i^k + \sum_{j = 1}^{r_{k-1}} w_{ji}^k o_j^{k-1} = \sum_{j =


0}^{r_{k-1}} w_{ji}^k o_j^{k-1},aik=bik+j=1∑rk−1wjikojk−1=j=0∑rk−1wjikojk−1,
where the left side is the original formulation and the right side is the new formulation.

Using the notation above, backpropagation attempts to minimize the following error function
with respect to the neural network's weights:
E(X, \theta) = \frac{1}{2N}\sum_{i=1}^N\left( \hat{y_i} -
y_i\right)^{2}E(X,θ)=2N1i=1∑N(yi^−yi)2
by calculating, for each weight w_{ij}^k,wijk, the value of \frac{\partial E}{\partial
w_{ij}^k}∂wijk∂E. Since the error function can be decomposed into a sum over individual
error terms for each individual input-output pair, the derivative can be calculated with
respect to each input-output pair individually and then combined at the end (since the
derivative of a sum of functions is the sum of the derivatives of each function):

\frac{\partial E(X, \theta)}{\partial w_{ij}^k} = \frac{1}


{N}\sum_{d=1}^N\frac{\partial}{\partial w_{ij}^k}\left(\frac{1}
{2}\left(\hat{y_d} - y_d\right)^{2}\right) = \frac{1}
{N}\sum_{d=1}^N\frac{\partial E_d}{\partial w_{ij}^k}.∂wijk∂E(X,θ)=N1d=1∑N
∂wijk∂(21(yd^−yd)2)=N1d=1∑N∂wijk∂Ed.
Thus, for the purposes of derivation, the backpropagation algorithm will concern itself with
only one input-output pair. Once this is derived, the general form for all input-output pairs
in XX can be generated by combining the individual gradients. Thus, the error function in
question for derivation is

E = \frac{1}{2}\left( \hat{y} - y\right)^{2},E=21(y^−y)2,


where the subscript dd in E_dEd, \hat{y_d}yd^, and y_dyd is omitted for simplification.

Error Function Derivatives

The derivation of the backpropagation algorithm begins by applying the chain rule to the
error function partial derivative

\frac{\partial E}{\partial w_{ij}^k} = \frac{\partial E}{\partial a_j^k}\frac{\partial


a_j^k}{\partial w_{ij}^k},∂wijk∂E=∂ajk∂E∂wijk∂ajk,
where a_j^kajk is the activation (product-sum plus bias) of node jj in layer kk before it is
passed to the nonlinear activation function (in this case, the sigmoid function) to generate
the output. This decomposition of the partial derivative basically says that the change in the
error function due to a weight is a product of the change in the error function EE due to the
activation a_j^kajk times the change in the activation a_j^kajk due to the
weight w_{ij}^kwijk.

The first term is usually called the error, for reasons discussed below. It is denoted

\delta_j^k \equiv \frac{\partial E}{\partial a_j^k}.δjk≡∂ajk∂E.


The second term can be calculated from the equation for a_j^kajk above:
\frac{\partial a_j^k}{\partial w_{ij}^k} = \frac{\partial}{\partial w_{ij}^k}
\left(\sum_{l = 0}^{r_{k-1}} w_{lj}^k o_l^{k-1}\right) = o_i^{k-1}.∂wijk∂ajk
=∂wijk∂(l=0∑rk−1wljkolk−1)=oik−1.
Thus, the partial derivative of the error function EE with respect to a weight w_{ij}^kwijk is

\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1}.∂wijk∂E=δjkoik−1.


Thus, the partial derivative of a weight is a product of the error term \delta_j^kδjk at
node jj in layer kk, and the output o_i^{k-1}oik−1 of node ii in layer k-1k−1. This makes
intuitive sense since the weight w_{ij}^kwijk connects the output of node ii in layer k-
1k−1 to the input of node jj in layer kk in the computation graph.

It is important to note that the above partial derivatives have all been calculated without any
consideration of a particular error function or activation function. However, since the error
term \delta_j^kδjk still needs to be calculated, and is dependent on the error function EE,
at this point it is necessary to introduce specific functions for both of these. As mentioned
previously, classic backpropagation uses the mean squared error function (which is the
squared error function for the single input-output pair case) and the sigmoid activation
function.

The calculation of the error \delta_j^{k}δjk will be shown to be dependent on the values of


error terms in the next layer. Thus, computation of the error terms will proceed backwards
from the output layer down to the input layer. This is where backpropagation, or backwards
propagation of errors, gets its name.

The Output Layer

Starting from the final layer, backpropagation attempts to define the value \delta_1^mδ1m,
where mm is the final layer ((the subscript is 11 and not jj because this derivation
concerns a one-output neural network, so there is only one output node j = 1).j=1). For
example, a four-layer neural network will have m=3m=3 for the final layer, m=2m=2 for
the second to last layer, and so on. Expressing the error function EE in terms of the
value a_1^ma1m \big((since \delta_1^mδ1m is a partial derivative with respect
to a_1^m\big)a1m) gives

E = \frac{1}{2}\left( \hat{y} - y\right)^{2} = \frac{1}{2}\big(g_o(a_1^m) -


y\big)^{2},E=21(y^−y)2=21(go(a1m)−y)2,
where g_o(x)go(x) is the activation function for the output layer.

Thus, applying the partial derivative and using the chain rule gives

\delta_1^m = \left(g_0(a_1^m) - y\right)g_o^{\prime}(a_1^m) = \left(\hat{y}-


y\right)g_o^{\prime}(a_1^m).δ1m=(g0(a1m)−y)go′(a1m)=(y^−y)go′(a1m).
Putting it all together, the partial derivative of the error function EE with respect to a weight
in the final layer w_{i1}^mwi1m is

\frac{\partial E}{\partial w_{i1}^m}= \delta_1^m o_i^{m-1} = \left(\hat{y}-


y\right)g_o^{\prime}(a_1^m)\ o_i^{m-1}.∂wi1m∂E=δ1moim−1=(y^−y)go′(a1m) oim−1.
The Hidden Layers

Now the question arises of how to calculate the partial derivatives of layers other than the
output layer. Luckily, the chain rule for multivariate functions comes to the rescue again.
Observe the following equation for the error term \delta_j^kδjk in layer 1 \le k \lt
m:1≤k<m:

\delta_j^k = \frac{\partial E}{\partial a_j^k}= \sum_{l=1}^{r^{k+1}}\frac{\partial


E}{\partial a_l^{k+1}}\frac{\partial a_l^{k+1}}{\partial a_j^k},δjk=∂ajk∂E=l=1∑rk+1
∂alk+1∂E∂ajk∂alk+1,
where ll ranges from 11 to r^{k+1}rk+1 (the number of nodes in the next layer). Note that,
because the bias input o_0^ko0k corresponding to w_{0j}^{k+1}w0jk+1 is fixed, its value
is not dependent on the outputs of previous layers, and thus ll does not take on the
value 00.

Plugging in the error term \delta_l^{k+1}δlk+1 gives the following equation:

\delta_j^k = \sum_{l=1}^{r^{k+1}}\delta_l^{k+1}\frac{\partial a_l^{k+1}}


{\partial a_j^k}.δjk=l=1∑rk+1δlk+1∂ajk∂alk+1.
Remembering the definition of a_l^{k+1}alk+1

a_l^{k+1} = \sum_{j=1}^{r^k}w_{jl}^{k+1}g\big(a_j^k\big),alk+1=j=1∑rkwjlk+1
g(ajk),
where g(x)g(x) is the activation function for the hidden layers,

\frac{\partial a_l^{k+1}}{\partial a_j^k} =


w_{jl}^{k+1}g^{\prime}\big(a_j^k\big).∂ajk∂alk+1=wjlk+1g′(ajk).
Plugging this into the above equation yields a final equation for the error
term \delta_j^kδjk in the hidden layers, called the backpropagation formula:

\delta_j^k =
\sum_{l=1}^{r^{k+1}}\delta_l^{k+1}w_{jl}^{k+1}g^{\prime}\big(a_j^k\big) =
g^{\prime}\big(a_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk
=l=1∑rk+1δlk+1wjlk+1g′(ajk)=g′(ajk)l=1∑rk+1wjlk+1δlk+1.
Putting it all together, the partial derivative of the error function EE with respect to a weight
in the hidden layers w_{ij}^kwijk for 1 \le k \lt m1≤k<m is
\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1} =
g^{\prime}\big(a_j^k\big)o_i^{k-
1}\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.∂wijk∂E=δjkoik−1=g′(ajk)oik−1
l=1∑rk+1wjlk+1δlk+1.
Backpropagation as Backwards Computation

This equation is where backpropagation gets its name. Namely, the error \delta_j^kδjk at
layer kk is dependent on the errors \delta_k^{k+1}δkk+1 at the next layer k+1k+1. Thus,
errors flow backward, from the last layer to the first layer. All that is needed is to compute
the first error terms based on the computed output \hat{y} = g_o(a_1^m)y^=go(a1m) and
target output yy. Then, the error terms for the previous layer are computed by performing a
product sum \big((weighted by w_{jl}^{k+1}\big)wjlk+1) of the error terms for the next
layer and scaling it by g^{\prime}\big(a_j^k\big)g′(ajk), repeated until the input layer is
reached.

This backwards propagation of errors is very similar to the forward computation that
calculates the neural network's output. Thus, calculating the output is often called
the forward phase while calculating the error terms and derivatives is often called
the backward phase. While going in the forward direction, the inputs are repeatedly
recombined from the first layer to the last by product sums dependent on the
weights w_{ij}^kwijk and transformed by nonlinear activation
functions g(x)g(x) and g_o(x)go(x). In the backward direction, the "inputs" are the final
layer's error terms, which are repeatedly recombined from the last layer to the first by
product sums dependent on the weights w_{jl}^{k+1}wjlk+1 and transformed by
nonlinear scaling factors g_o^{\prime}\big(a_j^m\big)go′(ajm
) and g^{\prime}\big(a_j^k\big)g′(ajk).

Furthermore, because the computations for backwards phase are dependent on the
activations a_j^kajk and outputs o_j^kojk of the nodes in the previous (the non-error term
for all layers) and next layer (the error term for hidden layers), all of these values must be
computed before the backwards phase can commence. Thus, the forward phase precedes
the backward phase for every iteration of gradient descent. In the forward phase,
activations a_j^kajk and outputs o_j^kojk will be remembered for use in the backwards
phase. Once the backwards phase is completed and the partial derivatives are known, the
weights \big((and associated biases b_j^k = w_{0j}^k\big)bjk=w0jk) can be updated by
gradient descent. This process is repeated until a local minimum is found or convergence
criterion is met.

The Backpropagation Algorithm


Using the terms defined in the section titled Formal Definition and the equations derived in
the section titled Deriving the Gradients, the backpropagation algorithm is dependent on the
following five equations:
For the partial derivatives,

\frac{\partial E_d}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1}.∂wijk∂Ed=δjkoik−1.


For the final layer's error term,

\delta_1^m = g_o^{\prime}(a_1^m)\left(\hat{y_d}-y_d\right).δ1m=go′(a1m)(yd^−yd).
For the hidden layers' error terms,

\delta_j^k =
g^{\prime}\big(a_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk
=g′(ajk)l=1∑rk+1wjlk+1δlk+1.
For combining the partial derivatives for each input-output pair,

\frac{\partial E(X, \theta)}{\partial w_{ij}^k} = \frac{1}


{N}\sum_{d=1}^N\frac{\partial}{\partial w_{ij}^k}\left(\frac{1}
{2}\left(\hat{y_d} - y_d\right)^{2}\right) = \frac{1}
{N}\sum_{d=1}^N\frac{\partial E_d}{\partial w_{ij}^k}.∂wijk∂E(X,θ)=N1d=1∑N
∂wijk∂(21(yd^−yd)2)=N1d=1∑N∂wijk∂Ed.
For updating the weights,

\Delta w_{ij}^k = - \alpha \frac{\partial E(X, \theta)}{\partial w_{ij}^k}.Δwijk


=−α∂wijk∂E(X,θ).
The General Algorithm

The backpropagation algorithm proceeds in the following steps, assuming a suitable


learning rate \alphaα and random initialization of the parameters w_{ij}^k:wijk:

1) Calculate the forward phase for each input-output pair (\vec{x_d}, y_d)(xd,yd) and


store the results \hat{y_d}yd^, a_j^kajk, and o_j^kojk for each node jj in layer kk by
proceeding from layer 00, the input layer, to layer mm, the output layer.

2) Calculate the backward phase for each input-output pair (\vec{x_d}, y_d)(xd,yd) and


store the results \frac{\partial E_d}{\partial w_{ij}^k}∂wijk∂Ed for each
weight w_{ij}^kwijk connecting node ii in layer k-1k−1 to node jj in layer kk by
proceeding from layer mm, the output layer, to layer 11, the input layer.

\quad\quada) Evaluate the error term for the final layer \delta_1^mδ1m by using the
second equation.
\quad\quadb) Backpropagate the error terms for the hidden layers \delta_j^kδjk, working
backwards from the final hidden layer k = m-1k=m−1, by repeatedly using the third
equation.
\quad\quadc) Evaluate the partial derivatives of the individual error E_dEd with respect
to w_{ij}^kwijk by using the first equation.

3) Combine the individual gradients for each input-output pair \frac{\partial E_d}


{\partial w_{ij}^k}∂wijk∂Ed to get the total gradient \frac{\partial E(X, \theta)}{\partial
w_{ij}^k}∂wijk∂E(X,θ) for the entire set of input-output pairs X = \big\{(\vec{x_1},
y_1), \dots, (\vec{x_N}, y_N) \big\}X={(x1,y1),…,(xN,yN)} by using the fourth
equation (a simple average of the individual gradients).

4) Update the weights according to the learning rate \alphaα and total


gradient \frac{\partial E(X, \theta)}{\partial w_{ij}^k}∂wijk∂E(X,θ) by using the fifth
equation (moving in the direction of the negative gradient).
Backpropagation In Sigmoidal Neural Networks

The classic backpropagation algorithm was designed for regression problems with
sigmoidal activation units. While backpropagation can be applied to classification problems
as well as networks with non-sigmoidal activation functions, the sigmoid function has
convenient mathematical properties which, when combined with an appropriate output
activation function, greatly simplify the algorithm's understanding. Thus, in the classic
formulation, the activation function for hidden nodes is sigmoidal \big(g(x) =
\sigma(x)\big)(g(x)=σ(x)) and the output activation function is the identity
function \big(g_o(x) = x\big)(go(x)=x) (the network output is just a weighted sum of its
hidden layer, i.e. the activation).

Backpropagation is actually a major motivating factor in the historical use of sigmoid


activation functions due to its convenient derivative:

g^{\prime}(x) = \frac{\partial \sigma(x)}{\partial x} = \sigma(x)\big(1 -


\sigma(x)\big).g′(x)=∂x∂σ(x)=σ(x)(1−σ(x)).
Thus, calculating the derivative of the sigmoid function requires nothing more than
remembering the output \sigma(x)σ(x) and plugging it into the equation above.

Furthermore, the derivative of the output activation function is also very simple:

g_o^{\prime}(x) = \frac{\partial g_o(x)}{\partial x} = \frac{\partial x}{\partial x}


= 1.go′(x)=∂x∂go(x)=∂x∂x=1.
Thus, using these two activation functions removes the need to remember the activation
values a_1^ma1m and a_j^kajk in addition to the output values o_1^mo1m and o_j^kojk,
greatly reducing the memory footprint of the algorithm. This is because the derivative for the
sigmoid activation function in the backwards phase only needs to recall the output of that
function in the forward phase, and is not dependent on the actual activation value, which is
the case in the more general formulation of backpropagation
where g^{\prime}\big(a_j^k\big)g′(ajk) must be calculated. Similarly, the derivative for
the identity activation function doesn't depend on anything since it is a constant.

Thus, for a feedforward neural network with sigmoidal hidden units and an identity output
unit, the error term equations are as follows:

For the final layer's error term,

\delta_1^m = \hat{y_d}-y_d.δ1m=yd^−yd.
For the hidden layers' error terms,

\delta_j^k = o_j^k\big(1 -
o_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk=ojk
(1−ojk)l=1∑rk+1wjlk+1δlk+1.
Code Example

The following code example is for a sigmoidal neural network as described in the previous
subsection. It has one hidden layer and one output node in the output layer. The code is
written in Python3 and makes heavy use of the NumPy library for performing matrix math.
Because the calculations of the gradient for individual input-output pairs (\vec{x_d}, y_d)
(xd,yd) can be done in parallel, and many calculations are based on taking the dot product
of two vectors, matrices are a natural way to represent the input data, output data, and layer
weights. NumPy's efficient computation of matrix products and the ability to use modern
GPUs (which are optimized for matrix operations) can give significant speedups in both the
forward and backward phases of computation.

1 import numpy as np
2
3 # define the sigmoid function
4 def sigmoid(x, derivative=False):
5
6 if (derivative == True):
7 return sigmoid(x,derivative=False) * (1 - sigmoid(x,derivative=False))
8 else:
9 return 1 / (1 + np.exp(-x))
10
11 # choose a random seed for reproducible results
12 np.random.seed(1)
13
14 # learning rate
15 alpha = .1
16
17 # number of nodes in the hidden layer
18 num_hidden = 3
19
20 # inputs
21 X = np.array([
22 [0, 0, 1],
23 [0, 1, 1],
24 [1, 0, 0],
25 [1, 1, 0],
26 [1, 0, 1],
27 [1, 1, 1],
28 ])
29
30 # outputs
31 # x.T is the transpose of x, making this a column vector
32 y = np.array([[0, 1, 0, 1, 1, 0]]).T
33
34 # initialize weights randomly with mean 0 and range [-1, 1]
35 # the +1 in the 1st dimension of the weight matrices is for the bias weight
36 hidden_weights = 2*np.random.random((X.shape[1] + 1, num_hidden)) - 1
37 output_weights = 2*np.random.random((num_hidden + 1, y.shape[1])) - 1
38
39 # number of iterations of gradient descent
40 num_iterations = 10000
41
42 # for each iteration of gradient descent
43 for i in range(num_iterations):
44
45 # forward phase
46 # np.hstack((np.ones(...), X) adds a fixed input of 1 for the bias weight
47 input_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), X))
48 hidden_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), sigmoid(np.dot(input_laye
49 output_layer_outputs = np.dot(hidden_layer_outputs, output_weights)
50
51 # backward phase
52 # output layer error term
53 output_error = output_layer_outputs - y
54 # hidden layer error term
55 # [:, 1:] removes the bias term from the backpropagation
56 hidden_error = hidden_layer_outputs[:, 1:] * (1 - hidden_layer_outputs[:, 1:]) * np.d
57
58 # partial derivatives
59 hidden_pd = input_layer_outputs[:, :, np.newaxis] * hidden_error[: , np.newaxis, :]
60 output_pd = hidden_layer_outputs[:, :, np.newaxis] * output_error[:, np.newaxis, :]
61
62 # average for total gradients
63 total_hidden_gradient = np.average(hidden_pd, axis=0)
64 total_output_gradient = np.average(output_pd, axis=0)
65
66 # update weights
67 hidden_weights += - alpha * total_hidden_gradient
68 output_weights += - alpha * total_output_gradient
69
70 # print the final outputs of the neural network on the inputs X
71 print("Output After Training: \n{}".format(output_layer_outputs))
The matrix  X  is the set of inputs \vec{x}x and the matrix  y  is the set of outputs yy. The
number of nodes in the hidden layer can be customized by setting the value of the
variable  num_hidden . The learning rate \alphaα is controlled by the variable  alpha . The
number of iterations of gradient descent is controlled by the variable  num_iterations .

By changing these variables and comparing the output of the program to the target
values  y , one can see how these variables control how well backpropagation can learn the
dataset  X  and  y . For example, more nodes in the hidden layer and more iterations of
gradient descent will generally improve the fit to the training dataset. However, using too
large or too small a learning rate can cause the model to diverge or converge too slowly,
respectively.

Cite as: Backpropagation. Brilliant.org. Retrieved 09:31, April 26, 2021, from https://brilliant.org/wiki/backpropagation/

Artificial Neural Network


Relevant For...
 Computer Science>
Artificial Neural Networks

A simple artificial neural network. The first column of circles


represents the ANN's inputs, the middle column represents computational units that act on that input, and the third
column represents the ANN's output. Lines connecting circles indicate dependencies.[1]
Artificial neural networks (ANNs) are computational models inspired by the human brain.
They are comprised of a large number of connected nodes, each of which performs a
simple mathematical operation. Each node's output is determined by this operation, as well
as a set of parameters that are specific to that node. By connecting these nodes together
and carefully setting their parameters, very complex functions can be learned and
calculated.

Artificial neural networks are responsible for many of the recent advances in artificial
intelligence, including voice recognition, image recognition, and robotics. For example,
ANNs can perform image recognition on hand drawn digits. An interactive example can be
found here.

Contents
 Online Learning

 Neurons

 Model Desiderata

 A Computational Model of the Neuron

 The Universal Approximation Theorem

 The Sigmoid Function

 Putting It All Together

 Training The Model

 References

Online Learning
With the advent of computers in the 1940s, computer scientists' attention turned towards
developing intelligent systems that could learn to perform prediction and decision making.
Of particular interest were algorithms that could perform online learning, which is a learning
method that can be applied to data points arriving sequentially. This is in opposition to batch
learning, which requires that all of the data be present at the time of training.

Online learning is especially useful in scenarios where training data is arriving sequentially
over time, such as speech data or the movement of stock prices. With a system capable of
online learning, one doesn't have to wait until the system has received a ton of data before
it can make a prediction or decision. If the human brain learned by batch learning, then
human children would take 10 years before they could learn to speak, mostly just to gather
enough speech data and grammatical rules to speak correctly. Instead, children learn to
speak by observing the speech patterns of those around them and gradually incorporating
that knowledge to improve their own speech, an example of online learning.

Given that the brain is such a powerful online learner, it is natural to try to emulate it
mathematically. ANNs are one attempt at a model with the bare minimum level of
complexity required to approximate the function of the human brain, and so are among the
most powerful machine learning methods discovered thus far.

Neurons
The human brain is primarily comprised of neurons, small cells that learn to fire electrical
and chemical signals based on some function. There are on the order
of 10^{11}1011 neurons in the human brain, about 1515 times the total number of people
in the world. Each neuron is, on average, connected to 1000010000 other neurons, so that
there are a total of 10^{15}1015 connections between neurons.
Neurons and microglial cells stained red and green respectively.[2]
Since individual neurons aren't capable of very complicated calculations, it is thought that
the huge number of neurons and connections are what gives the brain its computational
power. While there are in fact thousands of different types of neurons in the human brain,
ANNs usually attempt to replicate only one type in an effort to simplify the model calculation
and analysis.

The electrical current for a neuron going from rest to firing to rest
again.[3]
Neurons function by firing when they receive enough input from the other neurons to which
they're connected. Typically, the output function is modeled as an activation function,
where inputs below a certain threshold don't cause the neuron to fire, and those above the
threshold do. Thus, a neuron exhibits what is known as all-or-nothing firing, meaning it is
either firing, or it is completely off and no output is produced.

From the point of view of a particular neuron, its connections can generally be split into two
classes, incoming connections and outgoing connections. Incoming connections form the
input to the neuron, while the output of the neuron flows through the outgoing connections.
Thus, neurons whose incoming connections are the outgoing connections of other neurons
treat other neurons' outputs as inputs. The repeated transformation of outputs of some
neurons into inputs of other neurons gives rise to the power of the human brain, since
the composition of activation functions can create highly complex functions.

It turns out that incoming connections for a particular neuron are not considered equal.
Specifically, some incoming connections are stronger than others, and provide more input
to a neuron than weak connections. Since a neuron fires when it receives input above a
certain threshold, these strong incoming connections contribute more to neural firing.
Neurons actually learn to make some connections stronger than others, in a process
called long-term potentiation, allowing them to learn when to fire in response to the
activities of neurons they're connected to. Neurons can also make connections weaker
through an analogous process called long-term depression.

Model Desiderata
As discussed in the above sections, as well as the later section titled The Universal
Approximation Theorem, a good computational model of the brain will have three
characteristics:

Biologically-Inspired The brain's computational power is derived from its neurons and the
connections between them. Thus, a good computational approximation of the brain will
have individual computational units (a la neurons), as well as ways for those neurons to
communicate (a la connections). Specifically, the outputs of some computational units will
be the inputs to other computational units. Furthermore, each computational unit should
calculate some function akin to the activation function of real neurons.

Flexible The brain is flexible enough to learn seemingly endless types and forms of data.
For example, even though most teenagers under the age of 16 have never driven a car
before, most learn very quickly to drive upon receiving their driver's license. No person's
brain is preprogrammed to learn how to drive, and yet almost anyone can do it given a small
amount of training. The brain's ability to learn to solve new tasks that it has no prior
experience with is part of what makes it so powerful. Thus, a good computational
approximation of the brain should be able to learn many different types of functions without
knowing the forms those functions will take beforehand.

Capable of Online Learning The brain doesn't need to learn everything at once, so neither
should a good model of it. Thus, a good computational approximation of the brain should be
able to improve by online learning, meaning it gradually improves over time as it learns to
correct past errors.
By the first desideratum, the model will consist of many computational units connected in
some way. Each computational unit will perform a simple computation whose output will be
passed as input to other units. This process will repeat itself some number of times, so that
outputs from some computational units are the inputs to others. With any luck, connecting
enough of these units together will give sufficient complexity to compute any function,
satisfying the second desideratum. However, what kind of function the model ends up
computing will depend on the data it is exposed to, as well as a learning algorithm that
determines how the model learns that data. Ideally, this algorithm will be able to perform
online learning, the third desideratum.

Thus, building a good computational approximation to the brain consists of three steps. The
first is to develop a computational model of the neuron and to connect those models
together to replicate the way the brain performs computations. This is covered in the
sections titled A Computational Model of the Neuron, The Sigmoid Function, and Putting It
All Together. The second is to prove that this model is sufficiently complex to calculate any
function and learn any type of data it is given, which is covered in the section titled The
Universal Approximation Theorem. The third is to develop a learning algorithm that can
learn to calculate a function, given a model and some data, in an online manner. This is
covered in the section titled Training The Model.

A Computational Model of the Neuron

The step function.[4]


As stated above, neurons fire above a certain threshold and do nothing below that
threshold, so a model of the neuron requires a function exhibiting the same properties. The
simplest function that does this is the step function.

The step function is defined as:

H(x) = \begin{cases} 1 & \mbox{if } x \ge 0, \\ 0 & \mbox{if } x \lt 0. \\


\end{cases}H(x)={10if x≥0,if x<0.
In this simple neuron model, the input is a single number that must exceed the activation
threshold in order to trigger firing. However, neurons can (and should, if they're to do
anything useful) have connections to multiple incoming neurons, so we need some way of
"integrating" these incoming neuron's inputs into a single number. The most common way of
doing this is to take a weighted sum of the neuron's incoming inputs, so that the neuron fires
when the weighted sum exceeds the threshold. If the vector of outputs from the incoming
neurons is represented by \vec{x}x, then the weighted sum of \vec{x}x is the dot
product \vec{w} \cdot \vec{x}w⋅x, where \vec{w}w is called the weight vector.

To further improve the modeling capacity of the neuron, we want to be able to set the
threshold arbitrarily. This can be achieved by adding a scalar (which may be positive or
negative) to the weighted sum of the inputs. Adding a scalar of -b−b will force the neuron's
activation threshold to be set to bb, since the new step function H(x+(-b))H(x+(−b)) at x
= bx=b equals 00, which is the threshold of the step function. The value bb is known as
the bias since it biases the step function away from the natural threshold at x = 0x=0.

Thus, calculating the output of our neuron model is comprised of two steps:

1) Calculate the integration. The integration, as defined above, is the sum \vec{w}


\cdot \vec{x} + bw⋅x+b for vectors \vec{w}w, \vec{x}x and scalar bb.
2) Calculate the output. The output is the activation function applied to the result of step 1.
Since the activation function in our model is the step function, the output of the neuron
is H(\vec{w} \cdot \vec{x} + b)H(w⋅x+b), which is 11 when \vec{w} \cdot \vec{x} +
b >= 0w⋅x+b>=0 and 00 otherwise.
A linear classifier, where squares evaluate to 1 and circles to 0.[5]
Following from the description of step 2, our neuron model defines a linear classfier, i.e. a
function that splits the inputs into two regions with a linear boundary. In two dimensions, this
is a line, while in higher dimensions the boundary is known as a hyperplane. The weight
vector \vec{w}w defines the slope of the linear boundary while the bias bb defines the
intercept of the linear boundary. The following diagram illustrates a neuron's output for two
incoming connections (i.e. a two dimensional input vector \vec{x}x. Note that the neuron
inputs are clearly separated into values of 00 and 11 by a line (defined by \vec{w}
\cdot \vec{x} + b = 0w⋅x+b=0).

By adjusting the values of \vec{w}w and bb, the step function unit can adjust its linear
boundary and learn to split its inputs into classes, 00 and 11, as shown in the previous
image. As a corollary, different values of \vec{w}w and bb for multiple step function units
will yield multiple different linear classifiers. Part of what makes ANNs so powerful is their
ability to adjust \vec{w}w and bb for many units at the same time, effectively learning
many linear classifiers simultaneously. This learning is discussed in more depth in the
section titled Training the Model.

The Universal Approximation Theorem


Since the brain can calculate more than just linear functions by connecting many neurons
together, this suggests that connecting many linear classifiers together should produce a
nonlinear function. In fact, it is proven that for certain activation functions and a very large
number of neurons, ANNs can model any continuous, smooth function arbitrarily well, a
result known as the universal approximation theorem.

This is very convenient because, like the brain, an ANN should ideally be able to learn any
function handed to it. If ANNs could only learn one type of function (e.g. third
degree polynomials), this would severely limit the types of problems to which they could be
applied. Furthermore, learning often happens in an environment where the type of function
to be learned is not known beforehand, so it is advantageous to have a model that does not
depend on knowing a priori the form of the data it will be exposed to.
Unfortunately, since the step function can only output two different values, 00 and 11, an
ANN of step function neurons cannot be a universal approximator (generally speaking,
continuous functions take on more than two values). Luckily, there is a continuous function
called the sigmoid function, described in the next section, that is very similar to the step
function and can be used in universal approximators.

The Sigmoid Function

The sigmoid function.[6]


There is a continuous approximation of the step function called the logistic curve,
or sigmoid function, denoted as \sigma(x)σ(x). This function's output ranges over all
values between 00 and 11 and makes a transition from values near 00 to values
near 11 at x = 0x=0, similar to the step function H(x)H(x).

The sigmoid function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1


So, for a computational unit that uses the sigmoid function, instead of firing 00 or 11 like a
step function unit, it's output will be between 00 and 11, non-inclusive. This changes slightly
the interpretation of this unit as a model of a neuron, since it no longer exhibits all-or-
nothing behavior since it will never take on the value of 00 (nothing) or 11 (all). However,
the sigmoid function is very close to 00 for x \lt 0x<0 and very close to 11 for x \gt 0x>0,
so it can be interpreted as exhibiting practically all-or-nothing behavior on most ( x
\not\approx 0x≈0) inputs.

The output for a sigmoidal unit with weight vector \vec{w}w and bias bb on


input \vec{x}x is:

\sigma(\vec{w} \cdot \vec{x} + b) = \left(1+\exp\left(-(\vec{w} \cdot \vec{x} +


b)\right)\right)^{-1}σ(w⋅x+b)=(1+exp(−(w⋅x+b)))−1
Thus, a sigmoid unit is like a linear classifier with a boundary defined at \vec{w} \cdot
\vec{x} + b = 0w⋅x+b=0. The value of the sigmoid function at the boundary is \sigma(0)
= .5σ(0)=.5. Inputs \vec{x}x that are far from the linear boundary will be
approximately 00 or 11, while those very close to the boundary will be closer to .5.5.

The sigmoid function turns out to be a member of the class of activation functions for
universal approximators, so it imitates the behavior of real neurons (by approximating the
step function) while also permitting the possibility of arbitrary function approximation. These
happen to be exactly the first two desiderata specified for a good mathematical model of the
brain. In fact, some ANNs use activation functions that are different from the sigmoidal
function, because those functions are also proven to be in the class of functions for which
universal approximators can be built. Two well-known activation functions used in the same
manner as the sigmoidal function are the hyperbolic tangent and the rectifier. The proof that
these functions can be used to build ANN universal approximators is fairly advanced, so it is
not covered here.

Calculate the output of a sigmoidal neuron with weight vector \vec{w} = (.25, .


75)w=(.25,.75) and bias b = -.75b=−.75 for the following two inputs vectors:

\vec{m} = (1, 2)m=(1,2)


\vec{n} = (1, -.5)n=(1,−.5)

Recalling that the output of a sigmoidal neuron with input \vec{x}x is \sigma(\vec{w}


\cdot \vec{x} + b)σ(w⋅x+b),

\begin{aligned} d &= \vec{w} \cdot \vec{m} + b \\ &= w_1 \cdot m_1 + w_2
\cdot m_2 + b \\ &= .25 \cdot 1 + .75 \cdot 2 -.75 \\ &= 1 \end{aligned}d
=w⋅m+b=w1⋅m1+w2⋅m2+b=.25⋅1+.75⋅2−.75=1

\begin{aligned} s &= \sigma(d) \\ &= \frac{1}{1 + e^{-d}} \\ &= \frac{1}{1+e^{-


1}} \\ &= .73105857863 \end{aligned}s=σ(d)=1+e−d1=1+e−11=.73105857863

Thus, the output on \vec{m} = (1, 2)m=(1,2) is .73105857863.73105857863. The


same reasoning applied to \vec{n} = (1,
-.5)n=(1,−.5) yields .29421497216.29421497216.
Like the step function unit describe above, the sigmoid function unit's linear boundary can
be adjusted by changing the values of \vec{w}w and bb. The weight vector defines the
slope of the linear boundary while the bias defines the intercept of the linear boundary.
Since, like the brain, the final model will include many individual computational units (a la
neurons), a learning algorithm that can learn, or train, many \vec{w}w and bb values
simultaneously is required. This algorithm is described in the section titled Training the
Model.
Putting It All Together
Neurons are connected to one another, with each neuron's incoming connections made up
of the outgoing connections of other neurons. Thus, the ANN will need to connect the
outputs of sigmoidal units to the inputs of other sigmoidal units.

One Sigmoidal Unit


The diagram below shows a sigmoidal unit with three inputs \vec{x} = (x_1, x_2,
x_3)x=(x1,x2,x3), one output yy, bias bb, and weight vector \vec{w} = (w_1, w_2,
w_3)w=(w1,w2,w3). Each of the inputs \vec{x} = (x_1, x_2, x_3)x=(x1,x2,x3) can be
the output of another sigmoidal unit (though it could also be raw input, analogous to
unprocessed sense data in the brain, such as sound), and the unit's output yy can be the
input to other sigmoidal units (though it could also be a final output, analogous to an action
associated neuron in the brain, such as one that bends your left elbow). Notice that each
component w_iwi of the weight vector corresponds to each component x_ixi of the input
vector. Thus, the summation of the product of the individual w_i, x_iwi,xi pairs is
equivalent to the dot product, as discussed in the previous sections.

A sigmoidal unit with


three inputs  \vec{x} = (x_1, x_2, x_3)x=(x1,x2,x3), weight vector  \vec{w}w, and bias  bb. [7]

ANNs as Graphs
Artificial neural networks are most easily visualized in terms of a directed graph. In the case
of sigmoidal units, node ss represents sigmoidal unit ss (as in the diagram above)
and directed edge e = (u, v)e=(u,v) indicates that one of sigmoidal unit vv's inputs is the
output of sigmoidal unit uu.

Thus, if the diagram above represents sigmoidal unit ss and inputs x_1x1, x_2x2,


and x_3x3 are the outputs of sigmoidal units aa, bb, and cc, respectively, then a graph
representation of the above sigmoidal unit will have nodes aa, bb, cc, and ss with directed
edges (a, s)(a,s), (b, s)(b,s), and (c, s)(c,s). Furthermore, since each incoming directed
edge is associated with a component of the weight vector for sigmoidal unit ss, each
incoming edge will be labeled with its corresponding weight component. Thus edge (a, s)
(a,s) will have label w_1w1, (b, s)(b,s) will have label w_2w2, and (c, s)(c,s) will have
label w_3w3. The corresponding graph is shown below, with the edges feeding into
nodes aa, bb, and cc representing inputs to those nodes.

Directed graph representing ANN with sigmoidal


units  aa,  bb,  cc, and  ss. Unit  ss's weight vector  \vec{w}w  is  (w_1, w_2, w_3)(w1,w2,w3)
While the above ANN is very simple, ANNs in general can have many more nodes (e.g.
modern machine vision applications use ANNs with more than 10^6106 nodes) in very
complicated connection patterns (see the wiki about convolutional neural networks).

The outputs of sigmoidal units are the inputs of other sigmoidal units, indicated by directed
edges, so computation follow the edges in the graph representation of the ANN. Thus, in
the example above, computation of ss's output is preceded by the computation of aa, bb,
and cc's outputs. If the graph above was modified so that's ss's output was an input of aa, a
directed edge passing from ss to aa would be added, creating what is known as a cycle.
This would mean that ss's output is dependent on itself. Cyclic computation graphs greatly
complicate computation and learning, so computation graphs are commonly restricted to
be directed acyclic graphs (or DAGs), which have no cycles. ANNs with DAG computation
graphs are known as feedforward neural networks, while ANNs with cycles are known
as recurrent neural networks.

Ultimately, ANNs are used to compute and learn functions. This consists of giving the ANN
a series of input-output pairs \vec{(x_i}, \vec{y_i})(xi,yi), and training the model to
approximate the function ff such that f(\vec{x_i}) = \vec{y_i}f(xi)=yi for all pairs. Thus,
if \vec{x}x is nn-dimensional and \vec{y}y is mm-dimensional, the final sigmoidal ANN
graph will consist of nn input nodes (i.e. raw input, not coming from other sigmoidal units)
representing \vec{x} = (x_1, \dots, x_n)x=(x1,…,xn), kk sigmoidal units (some of which
will be connected to the input nodes), and mm output nodes (i.e. final output, not fed into
other sigmoidal units) representing \vec{y} = (y_1, \dots, y_m)y=(y1,…,ym).
Like sigmoidal units, output nodes have multiple incoming connections and output one
value. This necessitates an integration scheme and an activation function, as defined in the
section titled The Step Function. Sometimes, output nodes use the same integration and
activation as sigmoidal units, while other times they may use more complicated functions,
such as the softmax function, which is heavily used in classification problems. Often, the
choice of integration and activation functions is dependent on the form of the output. For
example, since sigmoidal units can only output values in the range (0, 1)(0,1), they are ill-
suited to problems where the expected value of yy lies outside that range.

An example graph for an ANN computing a two dimensional output \vec{y}y on a three


dimensional input \vec{x}x using five sigmoidal units s_1, \dots, s_5s1,…,s5 is shown
below. An edge labeled with weight w_{ab}wab represents the component of the weight
vector for node bb that corresponds to the input coming from node aa. Note that this graph,
because it has no cycles, is a feedforward neural network.

ANN for three dimensional


input, two dimensional output, and five sigmoidal units
Layers
Thus, the above ANN would start by computing the outputs of
nodes s_1s1 and s_2s2 given x_1x1, x_2x2, and x_3x3. Once that was complete, the ANN
would next compute the outputs of nodes s_3s3, s_4s4, and s_5s5, dependent on the
outputs of s_1s1 and s_2s2. Once that was complete, the ANN would do the final
calculation of nodes y_1y1 and y_2y2, dependent on the outputs of nodes s_3s3, s_4s4,
and s_5s5.

It is obvious from this computational flow that certain sets of nodes tend to be computed at
the same time, since a different set of nodes uses their outputs as inputs. For example, set \
{s_3, s_4, s_5\}{s3,s4,s5} depends on set \{s_1, s_2\}{s1,s2}. These sets of nodes that
are computed together are known as layers, and ANNs are generally thought of a series of
such layers, with each layer l_ili dependent on previous layer l_{i-1}li−1 Thus, the above
graph is composed of four layers. The first layer l_0l0 is called the input layer (which does
not need to be computed, since it is given), while the final layer l_3l3 is called the output
layer. The intermediate layers are known as hidden layers, which in this case are the
layers l_1 = \{s_1, s_2\}l1={s1,s2} and l_2 = \{s_3, s_4, s_5\}l2={s3,s4,s5}, are
usually numbered so that hidden layer h_ihi corresponds to layer l_ili. Thus, hidden
layer h_1=\{s_1, s_2\}h1={s1,s2} and hidden layer h_2=\{s_3, s_4, s_5\}h2={s3,s4,s5
}. The diagram below shows the example ANN with each node grouped into its appropriate
layer.

The same ANN grouped into


layers
ANN LayersThe image source: Artificial Neural Network

Training The Model


The ANN can now calculate some function f_{\theta}(\vec{x})fθ(x) that depends on the
values of the individual nodes' weight vectors and biases, which together are known as the
ANN's parameters \thetaθ. The logical next step is to determine how to alter those biases
and weight vectors so that the ANN computes known values of the function. That is, given a
series of input-output pairs (\vec{x_i}, \vec{y_i})(xi,yi), how can the weight vectors and
biases be altered such that f_{\theta}(\vec{x_i}) \approx \vec{y_i}fθ(xi)≈yi for all ii?

Choosing an Error Function


The typical way to do this is define an error function EE over the set of pairs X
=\
{(\vec{x_1}, \vec{y_1}), \dots, (\vec{x_N},\vec{y_N})\}X={(x1,y1),…,(xN,yN
)} such that E(X, \theta)E(X,θ) is small when f_{\theta}(\vec{x_i}) \approx
\vec{y_i}fθ(xi)≈yi for all ii. Common choices for EE are the mean squared error (MSE) in
the case of regression problems and the cross entropy in the case
of classification problems. Thus, training the ANN reduces to minimizing the error E(X,
\theta)E(X,θ) with respect to the parameters (since XX is fixed). For example, for the
mean square error function, given two input-output pairs X= \{(\vec{x_1}, \vec{y_1}),
(\vec{x_2}, \vec{y_2})\}X={(x1,y1),(x2,y2)} and an ANN with parameters \thetaθ that
outputs f_{\theta}(\vec{x})fθ(x) for input \vec{x}x, the error function E(X,
\theta)E(X,θ) is

E(X, \theta)=\frac{(y_1 - f_{\theta}(\vec{x_1}))^2}{2} + \frac{(y_2 - f_{\theta}


(\vec{x_2}))^2}{2}E(X,θ)=2(y1−fθ(x1))2+2(y2−fθ(x2))2
Gradient Descent
Since the error function E(X, \theta)E(X,θ) defines a fairly complex function (it is a
function of the output of the ANN, which is a composition of many nonlinear functions),
finding the minimum analytically is generally impossible. Luckily, there exists a general
method for minimizing differentiable functions called gradient descent. Basically, gradient
descent finds the gradient of a function ff at a particular value xx (for an ANN, that value
will be the parameters \thetaθ) and then updates that value by moving (or stepping) in the
direction of the negative of the gradient. Generally speaking (it depends on the size of the
step \etaη), this will find a nearby value x^{\prime} = x - \eta \nabla f(x)x′=x−η∇f(x) for
which f(x^{\prime}) \lt f(x)f(x′)<f(x). This process repeats until a local minimum is
found, or the gradient sufficiently converges (i.e. becomes smaller than some threshold).
Learning for an ANN typically starts with a random initialization of the parameters (the
weight vectors and biases) followed by successive updates to those parameters based on
gradient descent until the error function E(X, \theta)E(X,θ) converges.

A major advantage of gradient descent is that it can be used for online learning, since the
parameters are not solved in one calculation but are instead gradually improved by moving
in the direction of the negative gradient. Thus, if input-output pairs are arriving in a
sequential fashion, the ANN can perform gradient descent on one input-output pair for a
certain number of steps, and then do the same once the next input-output pair arrives. For
an appropriate choice of step size \etaη, this approach can yield results similar to gradient
descent on the entire dataset XX (known as batch learning).

Because gradient descent is a local method (the step direction is determined by the
gradient at a single point), it can only find local minima. While this is generally a significant
problem for most optimization applications, recent research has suggested that finding local
minima is not actually an issue for ANNs, since the vast majority of local minima are evenly
distributed and similar in magnitude for large ANNs.

Backpropagation
For a long time, calculating the gradient for ANNs was thought to be mathematically
intractable, since ANNs can have large numbers of nodes and very many layers, making
the error function E(X, \theta)E(X,θ) highly nonlinear. However, in the mid-1980s,
computer scientists were able to derive a method for calculating the gradient with respect to
an ANN's parameters, known as backpropagation, or "backpropagation by errors". The
method works for both feedforward neural networks (for which it was originally designed) as
well as for recurrent neural networks, in which case it is called backpropagation through
time, or BPTT. The discovery of this method brought about a renaissance in artificial neural
network research, as training non-trivial ANNs had finally become feasible.
References

1. , D. Neuralnetwork. Retrieved June 4, 2005,


from https://commons.wikimedia.org/wiki/File:Neuralnetwork.png

2. , G. Microglia_and_neurons. Retrieved July 25, 2005,


from https://commons.wikimedia.org/wiki/File:Microglia_and_neurons.jpg

3. , B. Current_Clamp_recording_of_Neuron. Retrieved October 6, 2006,


from https://commons.wikimedia.org/wiki/File:Current_Clamp_recording_of_N
euron.GIF

4. , L. Heaviside. Retrieved August 25, 2007,


from https://commons.wikimedia.org/wiki/File:Heaviside.svg

5. , M. Linearna_separovatelnost_v_prikladovom_priestore. Retrieved
December 13, 2013,
from https://commons.wikimedia.org/wiki/File:Linearna_separovatelnost_v_pri
kladovom_priestore.png

6. , Q. Logistic-curve. Retrieved July 2, 2008,


from https://commons.wikimedia.org/wiki/File:Logistic-curve.svg

7. , C. ArtificialNeuronModel. Retrieved July 14, 2005,


from https://commons.wikimedia.org/wiki/File:ArtificialNeuronModel.png
Cite as: Artificial Neural Network. Brilliant.org. Retrieved 09:34, April 26, 2021, from https://brilliant.org/wiki/artificial-
neural-network/

Feedforward Neural Networks


Relevant For...
 Computer Science>
Artificial Neural Networks
A feedforward neural network
with information flowing left to right
Feedforward neural networks are artificial neural networks where the connections
between units do not form a cycle. Feedforward neural networks were the first type of
artificial neural network invented and are simpler than their counterpart, recurrent neural
networks. They are called feedforward because information only travels forward in the
network (no loops), first through the input nodes, then through the hidden nodes (if present),
and finally through the output nodes.

Feedfoward neural networks are primarily used for supervised learning in cases where the
data to be learned is neither sequential nor time-dependent. That is, feedforward neural
networks compute a function ff on fixed size input xx such that f(x) \approx yf(x)≈y for
training pairs (x, y)(x,y). On the other hand, recurrent neural networks learn sequential
data, computing gg on variable length input X_k = \{x_1, \dots, x_k\}Xk={x1,…,xk
} such that g(X_k) \approx y_kg(Xk)≈yk for training pairs (X_n, Y_n)(Xn,Yn) for all 1
\le k \le n1≤k≤n.
Contents

 Singe-layer Perceptron

 Limitations

 Multi-layer Perceptron

 Formal Definition

Singe-layer Perceptron
The simplest type of feedforward neural network is the perceptron, a feedforward neural
network with no hidden units. Thus, a perceptron has only an input layer and an output
layer. The output units are computed directly from the sum of the product of
their weights with the corresponding input units, plus some bias.

Historically, the perceptron's output has been binary, meaning it outputs a value
of 00 or 11. This is achieved by passing the aforementioned product sum into the step
function H(x)H(x). This is defined as

H(x) = \begin{cases} 1 && \mbox{if }\ x \ge 0 \\ 0 && \mbox{if }\ x \lt 0.


\end{cases}H(x)={10if  x≥0if  x<0.
For a binary perceptron with nn-dimensional input \vec{x}x, nn-dimensional weight
vector \vec{w}w, and bias bb, the 11-dimensional output oo is

A linear classifier, where squares evaluate to 1 and circles to 0


o = \begin{cases} 1 && \mbox{if }\ \vec{w} \cdot \vec{x} + b \ge 0 \\ 0 &&
\mbox{if }\ \vec{w} \cdot \vec{x} + b \lt 0. \end{cases}o={10
if  w⋅x+b≥0if  w⋅x+b<0.
Since the perceptron divides the input space into two classes, 00 and 11, depending on the
values of \vec{w}w and b,b, it is known as a linear classifier. The diagram to the right
displays one such linear classifier. The line separating the two classes is known as
the classification boundary or decision boundary. In the case of a two-dimensional input
(as in the diagram) it is a line, while in higher dimensions this boundary is a hyperplane. The
weight vector \vec{w}w defines the slope of the classification boundary while the
bias bb defines the intercept.

More general single-layer peceptrons can use activation functions other than the step
function H(x)H(x). Typical choices are the identity function f(x) = x,f(x)=x, the sigmoid
function \sigma(x) = \left(1 + e^{-x}\right)^{-1},σ(x)=(1+e−x)−1, and the hyperbolic
tangent \tanh (x) = \frac{e^{x} + e^{-x}}{e^{x} - e^{-x}}.tanh(x)=ex−e−xex+e−x. Use of
any of these functions ensures the output is a continuous number (as opposed to binary),
and thus not every activation function yields a linear classifier.

Generally speaking, a perceptron with activation function g(x)g(x) has output

o = g(\vec{w} \cdot \vec{x} + b).o=g(w⋅x+b).


In order for a perceptron to learn to correctly classify a set of input-output pairs (\vec{x},
y)(x,y), it has to adjust the weights \vec{w}w and bias bb in order to learn a good
classification boundary. The figure below shows many possible classification boundaries,
the best of which is the boundary labeled H_2H2. If the perceptron uses an activation
function other than the step function (e.g. the sigmoid function), then the weights and bias
should be adjusted so that the output oo is close to the true label yy.

An example of binary classified data and multiple classification


boundaries
Error Function
Typically, the learning process requires the definition of an error function EE that quantifies
the difference between the computed output of the perceptron oo and the true value yy for
an input \vec{x}x over a set of multiple input-output pairs (\vec{x}, y)(x,y). Historically,
this error function is the mean squared error (MSE), defined for a set of NN input-output
pairs X = \{(\vec{x_1}, y_1), \dots, (\vec{x_N}, y_N)\}X={(x1,y1),…,(xN,yN)} as

E(X) = \frac{1}{2N} \sum_{i=1}^N \left(o_i - y_i\right)^2 = \frac{1}{2N}


\sum_{i=1}^N \left(g(\vec{w} \cdot \vec{x_i} + b) - y_i\right)^2,E(X)=2N1i=1∑N
(oi−yi)2=2N1i=1∑N(g(w⋅xi+b)−yi)2,
where o_ioi denotes the output of the perceptron on input \vec{x_i}xi with activation
function gg. The factor of \frac{1}{2}21 is included in order to simplify the calculation of
the derivative later. Thus, E(X) = 0E(X)=0 when o_i = y_ioi=yi for all input-output
pairs (\vec{x_i}, y_i)(xi,yi) in XX, so a natural objective is to attempt to
change \vec{w}w and bb such that E(X)E(X) is as close to zero as possible. Thus,
minimizing E(X)E(X) with respect to \vec{w}w and bb should yield a good classification
boundary.

Delta Rule
E(X)E(X) is typically minimized using gradient descent, meaning the perceptron
adjusts \vec{w}w and bb in the direction of the negative gradient of the error function.
Gradient descent works for any error function, not just the mean squared error. This
iterative process reduces the value of the error function until it converges on a value,
usually a local minimum. The values of \vec{w}w and bb are typically set randomly and
then updated using gradient descent. If the random initializations of \vec{w}w and bb are
denoted \vec{w_0}w0 and b_0b0, respectively, then gradient descent
updates \vec{w}w and bb according to the equations

\vec{w_{i+1}} = \vec{w_i } - \alpha \frac{\partial E(X)}{\partial \vec{w_i}}\\


b_{i+1} = b_i - \alpha \frac{\partial E(X)}{\partial b_i},wi+1=wi−α∂wi∂E(X)bi+1=bi
−α∂bi∂E(X),
where \vec{w_i}wi and b_ibi are the values of \vec{w}w and bb after
the i^\text{th}ith iteration of gradient descent, and \frac{\partial f}{\partial x}∂x∂f is
the partial derivative of ff with respect to xx. \alphaα is known as the learning rate, which
controls the step size gradient descent takes each iteration, typically chosen to be a small
value, e.g. \alpha = 0.01.α=0.01. Values of \alphaα that are too large cause learning to
be suboptimal (by failing to converge), while values of \alphaα that are too small make
learning slow (by taking too long to converge).

The weight delta \Delta \vec{w} = \vec{w_{i+1}} - \vec{w_{i}}Δw=wi+1−wi and


bias delta \Delta b = b_{i+1} - b_iΔb=bi+1−bi are calculated using the delta rule. The
delta rule is a special case of backpropagation for single-layer perceptrons, and is used to
calculate the updates (or deltas) of the perceptron parameters. The delta rule can be
derived by consistent application of the chain rule and power rule for calculating the partial
derivatives \frac{\partial E(X)}{\partial \vec{w_i}}∂wi∂E(X) and \frac{\partial E(X)}
{\partial b_i}.∂bi∂E(X). For a perceptron with a mean squared error function and activation
function gg, the delta rules for the weight vector \vec{w}w and bias bb are

\Delta \vec{w} = \frac{1}{N} \sum_{i=1}^N \alpha (y_i - o_i) g'(h_i) \vec{x_i}\\\\


\Delta b = \frac{1}{N} \sum_{i=1}^N \alpha (y_i - o_i) g'(h_i),Δw=N1i=1∑Nα(yi
−oi)g′(hi)xiΔb=N1i=1∑Nα(yi−oi)g′(hi),
where o_i = g(\vec{w} \cdot \vec{x_i} + b)oi=g(w⋅xi+b) and h_i = \vec{w} \cdot
\vec{x_i} + bhi=w⋅xi+b. Note that the presence of \vec{x_i}xi on the right side of the
delta rule for \vec{w}w implies that the weight vector's update delta is also a vector, which
is expected.
Training the Perceptron
Thus, given a set of NN input-output pairs X = \left\{\big(\vec{x_1}, y_1\big),
\ldots, \big(\vec{x_N}, y_N\big)\right\},X={(x1,y1),…,(xN,yN)}, learning consists of
iteratively updating the values of \vec{w}w and bb according to the delta rules. This
consists of two distinct computational phases:

1. Calculate the forward values. That is, for X= \left\{\big(\vec{x_1}, y_1\big),


\ldots, \big(\vec{x_N}, y_N\big)\right\},X={(x1,y1),…,(xN,yN
)}, calculate h_ihi and o_ioi for all \vec{x_i}xi.
2. Calculate the backward values. Using the partial derivatives of the error function, update
the weight vector and bias according to gradient descent. Specifically, use the delta rules
and the values calculated in the forward phase to calculate the deltas for each.
The backward phase is so named because in multi-layer perceptrons there is no simple
delta rule, so calculation of the partial derivatives proceeds backwards from the error
between the target output and actual output y_i - o_iyi−oi (this is
where backpropagation gets its name). In a single-layer perceptron, there is no backward
flow of information (there is only one "layer" of parameters), but the naming convention
applies all the same.

Once the backward values are computed, they may be used to update the values of the
weight vector \vec{w}w and bias bb. The process repeats until the error
function E(X)E(X) converges. Once the error function has converged, the weight
vector \vec{w}w and bias bb can be fixed, and the forward phase used to calculate
predicted values oo of the true output yy for any input xx. If the perceptron has learned the
underlying function mapping inputs to outputs \big((i.e. not just remembered every pair
of (\vec{x_i}, y_i)\big),(xi,yi)), it will even predict the correct values for input-output pairs
it was not trained on, known as generalization. Ultimately, generalization is the primary
goal of supervised learning, since it is desirable and a practical necessity to learn an
unknown function based on a small sample of the set of all possible input-output pairs.

Limitations
It was mentioned earlier that single-layer perceptrons are linear classifiers. That is, they can
only learn linearly separable patterns. Linearly separable patterns are datasets or
functions that can be separated by a linear boundary (a line or hyperplane). Marvin Minsky
and Seymour Papert showed in their seminal 1969 book Perceptrons that it was impossible
for a perceptron to learn even simple non-linearly separable functions such as the XOR
function. The XOR, or "exclusive or", function is a simple function on two binary inputs and
is often found in bit twiddling hacks. A plot of and truth table for XOR is below.
XOR function, with white dots and black dots representing outputs
of  00 and  11, respectively

X_1X X_2X XORXO


1 2 R

0 0 0

0 1 1

1 0 1

1 1 0
Notice that, for the plot of the XOR function, it is impossible to find a linear boundary that
separates the black and white inputs from one another. This is because XOR is not a
linearly separable function, and by extension, perceptrons cannot learn the XOR function.
Similar analogs exist in higher dimensions, i.e. more than two inputs.

Many other (indeed, most other) functions are not linearly separable, so what is needed is
an extension to the perceptron. The obvious extension is to add more layers of units so that
there are nonlinear computations in between the input and output. For a long time, it was
assumed by many in the field that adding more layers of units would fail to solve the linear
separability problem (even though Minsky and Papert knew that such an extension could
learn the XOR function), so research in the field of artificial neural networks stagnated for a
good decade. Indeed, this assumption turned out to be very wrong, as multi-layer
perceptrons, covered in the next section, can learn practically any function of interest.

Multi-layer Perceptron
The mulit-layer perceptron (MLP) is an artificial neural network composed of
many perceptrons. Unlike single-layer perceptrons, MLPs are capable of learning to
compute non-linearly separable functions. Because they can learn nonlinear functions, they
are one of the primary machine learning techniques for
both regression and classification in supervised learning.

Layers
MLPs are usually organized into something called layers. As discussed in the sections
on neural networks as graphs and neural networks as layers, the generalized artificial
neural network consists of an input layer, some number (possibly zero) of hidden layers,
and an output layer. In the case of a single-layer perceptron, there are no hidden layers, so
the total number of layers is two. MLPs, on the other hand, have at least one hidden layer,
each composed of multiple perceptrons. An example of a feedforward neural network with
two hidden layers is below.

A four-layer feedforward neural


network
It was mentioned in the introduction that feedforward neural networks have the property that
information (i.e. computation) flows forward through the network, i.e. there are no loops in
the computation graph (it is a directed acyclic graph, or DAG). Another way of saying this is
that the layers are connected in such a way that no layer's output depends on itself. In the
above figure, this is clear since each layer (other than the input layer) is only connected to
the layer directly to its left. Feedforward neural networks, by having DAGs for their
computation graphs, have a greatly simplified learning algorithm compared to recurrent
neural networks, which have cycles in their dependency graphs.

Generally speaking, if one is given a graph representing a feedforward network, it can


always be grouped into layers such that each layer depends only on layers to its left. This
can be done by performing a topological sort on the nodes and grouping nodes with the
same depth into the same layer, where layer l_ili consists of all nodes in the graph with
depth ii in the topological sort. Then, arranging layers l_0, \dots, l_kl0,…,lk from left to
right, each layer's nodes will only depend on nodes from layers to its left.

Formal Definition
The following defines a prototypical mm-layer ((meaning m-2m−2 hidden layers)) MLP
that computes a one-dimensional output oo on an nn-dimensional input \vec{x} = \
{x_1, \dots, x_n\}x={x1,…,xn}.

Assume that

1. The output perceptron has an activation function g_ogo and hidden layer


perceptrons have activation functions gg.

2. Every perceptron in layer l_ili is connected to every perceptron in layer l_{i-


1};li−1; layers are "fully connected." Thus, every perceptron depends on the
outputs of all the perceptrons in the previous layer (this is without loss of
generality since the weight connecting two perceptrons can still be zero,
which is the same as no connection being present).

3. There are no connections between perceptrons in the same layer.

A fully connected
MLP on three inputs with two hidden layers, each with four perceptrons
and use the following denotation:

w_{ij}^k:wijk: weight for perceptron jj in layer l_klk for incoming node ii ((a perceptron


if k \ge 1)k≥1) in layer l_{k - 1}lk−1
b_i^k:bik: bias for perceptron ii in layer l_klk
h_i^k:hik: product sum plus bias for perceptron ii in layer l_klk
o_i^k:oik: output for node ii in layer l_klk
r_k:rk: number of nodes in layer l_klk

\vec{w_i^k}:wik: weight vector for perceptron ii in layer l_klk, i.e. \vec{w}_i^k = \big\


{w_{1i}^k, \ldots, w_{r_ki}^k\big\}wik={w1ik,…,wrkik}
\vec{o^k}:ok: output vector for layer l_klk, i.e. \vec{o^k} = \big\{o_1^k, \ldots,
o_{r_k}^k\big\}ok={o1k,…,orkk}
Then, computation of the MLP's output oo proceeds according to the following steps:

1. Initialize the input layer l_0l0


Set the values of the outputs o_i^0oi0 for nodes in the input layer l_0l0 to
their associated inputs in the vector \vec{x} = \{x_1, \dots, x_n\}x={x1,
…,xn}, i.e. o_i^0 = x_ioi0=xi.
2. Calculate the product sums and outputs of each hidden layer in order
from l_1l1 to l_{m-1}lm−1
For kk from 11 to m-1,m−1,
a. compute h_i^k = \vec{w_i^k} \cdot \vec{o^{k-1}} + b_i^k = b_i^k
+ \sum_{j = 1}^{r_{k-1}} w_{ji}^k o_j^{k-1}hik=wik⋅ok−1+bik=bik
+∑j=1rk−1wjikojk−1 for i = 1, \ldots, r_k;i=1,…,rk;
b. compute o_i^k = g\big(h_i^k\big)oik=g(hik) for i = 1, \ldots, r_k.i=1,
…,rk.
3. Compute the output yy for output layer l_mlm
a. Compute h_1^m = \vec{w_1^m} \cdot \vec{o^{m-1}} + b_1^m =
b_1^m + \sum_{j = 1}^{r_{m-1}} w_{j1}^k o_j^{k-1}.h1m=w1m
⋅om−1+b1m=b1m+∑j=1rm−1wj1kojk−1.
b. Compute o = o_1^m = g_o(h_1^m).o=o1m=go(h1m).

Training MLPs
Like the single-layer-perceptron, given a set of NN input-output pairs X
= \left\
{\big(\vec{x_1}, y_1\big), \ldots, \big(\vec{x_N}, y_N\big)\right\},X={(x1,y1),…,
(xN,yN)}, learning consists of iteratively updating the values
of \vec{w_i^k}wik and b_i^kbik in order to minimize the mean squared error (MSE)

E(X) = \frac{1}{2N} \sum_{i=1}^N \left(o_i - y_i\right)^2,E(X)=2N1i=1∑N(oi


−yi)2,
where o_ioi denotes the output oo (the result of step 3 in the computation above) of the
MLP on input \vec{x_i}xi. Analogous to the single-layer perceptron,
minimizing E(X)E(X) with respect to all w_{ij}^kwijk and b_i^kbik will yield a good
classification boundary. Using gradient descent to adjust the
parameters w_{ij}^kwijk and b_i^kbik with a learning rate of \alphaα yields the following
delta equations for each iteration:

\Delta w_{ij}^k = - \alpha \frac{\partial E(X)}{\partial w_{ij}^k}\\\\ \Delta b_i^k


= - \alpha \frac{\partial E(X)}{\partial b_i^k}.Δwijk=−α∂wijk∂E(X)Δbik=−α∂bik
∂E(X).
The expansion of the right-hand side of the delta rules can be found using backpropagation,
so called because the gradient information flows backwards through the network (i.e. in the
direction opposite the output computation flow). This gradient flow originates in the final
layer l_mlm, proportional to the difference between the target output yy and actual
output oo.

Thus, one iteration of training for MLPs consists of two distinct computational phases:

1. Calculate the forward values. That is, for X = \left\{\big(\vec{x_1},


y_1\big), \ldots, \big(\vec{x_N}, y_N\big)\right\},X={(x1,y1),…,(xN
,yN)}, calculate all h_i^khik and o_i^koik for all \vec{x_i}xi.
2. Calculate the backward values. Using the partial derivatives of the error
function, update the weights w_{ij}^kwijk and biases b_i^kbik according to
gradient descent. Specifically, use backpropagation and the values calculated
in the forward phase to calculate the deltas for each.
Here is a tutorial helps you to understand how Python implements feed forward neural
networks using Numpy.
Cite as: Feedforward Neural Networks. Brilliant.org. Retrieved 09:35, April 26,
2021, from https://brilliant.org/wiki/feedforward-neural-networks/

Backpropagation
Recommended Course


Data Structures
The fundamental toolkit for the aspiring computer scientist or programmer.

Relevant For...
 Computer Science>
Artificial Neural Networks

Backpropagation, short for "backward propagation of errors," is an algorithm for


supervised learning of artificial neural networks using gradient descent. Given an artificial
neural network and an error function, the method calculates the gradient of the error
function with respect to the neural network's weights. It is a generalization of the delta rule
for perceptrons to multilayer feedforward neural networks.

The "backwards" part of the name stems from the fact that calculation of the gradient
proceeds backwards through the network, with the gradient of the final layer of weights
being calculated first and the gradient of the first layer of weights being calculated last.
Partial computations of the gradient from one layer are reused in the computation of the
gradient for the previous layer. This backwards flow of the error information allows for
efficient computation of the gradient at each layer versus the naive approach of calculating
the gradient of each layer separately.

Backpropagation's popularity has experienced a recent resurgence given the widespread


adoption of deep neural networks for image recognition and speech recognition. It is
considered an efficient algorithm, and modern implementations take advantage of
specialized GPUs to further improve performance.

Contents

 History

 Formal Definition

 Deriving the Gradients

 The Backpropagation Algorithm

History
Backpropagation was invented in the 1970s as a general optimization method for
performing automatic differentiation of complex nested functions. However, it wasn't until
1986, with the publishing of a paper by Rumelhart, Hinton, and Williams, titled "Learning
Representations by Back-Propagating Errors," that the importance of the algorithm was
appreciated by the machine learning community at large.

Researchers had long been interested in finding a way to train multilayer artificial neural
networks that could automatically discover good "internal representations," i.e. features that
make learning easier and more accurate. Features can be thought of as the stereotypical
input to a specific node that activates that node (i.e. causes it to output a positive value near
1). Since a node's activation is dependent on its incoming weights and bias, researchers
say a node has learned a feature if its weights and bias cause that node to activate when
the feature is present in its input.
By the 1980s, hand-engineering features had become the de facto standard in many fields,
especially in computer vision, since experts knew from experiments which features (e.g.
lines, circles, edges, blobs in computer vision) made learning simpler. However, hand-
engineering successful features requires a lot of knowledge and practice. More importantly,
since it is not automatic, it is usually very slow.

Backpropagation was one of the first methods able to demonstrate that artificial neural
networks could learn good internal representations, i.e. their hidden layers learned nontrivial
features. Experts examining multilayer feedforward networks trained using backpropagation
actually found that many nodes learned features similar to those designed by human
experts and those found by neuroscientists investigating biological neural networks in
mammalian brains (e.g. certain nodes learned to detect edges, while others computed
Gabor filters). Even more importantly, because of the efficiency of the algorithm and the fact
that domain experts were no longer required to discover appropriate features,
backpropagation allowed artificial neural networks to be applied to a much wider field of
problems that were previously off-limits due to time and cost constraints.

Formal Definition
Backpropagation is analogous to calculating the delta rule for a multilayer feedforward
network. Thus, like the delta rule, backpropagation requires three things:

1) Dataset consisting of input-output pairs \big(\vec{x_i}, \vec{y_i}\big)(xi,yi),


where \vec{x_i}xi is the input and \vec{y_i}yi is the desired output of the network on
input \vec{x_i}xi. The set of input-output pairs of size NN is denoted X = \Big\
{\big(\vec{x_1}, \vec{y_1}\big), \dots, \big(\vec{x_N},
\vec{y_N}\big)\Big\}X={(x1,y1),…,(xN,yN)}.

2) A feedforward neural network, as formally defined in the article concerning feedforward


neural networks, whose parameters are collectively denoted \thetaθ. In backpropagation,
the parameters of primary interest are w_{ij}^kwijk, the weight between node jj in
layer l_klk and node ii in layer l_{k-1}lk−1, and b_i^kbik, the bias for node ii in
layer l_klk. There are no connections between nodes in the same layer and layers are fully
connected.

3) An error function, E(X, \theta)E(X,θ), which defines the error between the desired
output \vec{y_i}yi and the calculated output \hat{\vec{y_i}}yi^ of the neural network on
input \vec{x_i}xi for a set of input-output pairs \big(\vec{x_i}, \vec{y_i}\big) \in X(xi
,yi)∈X and a particular value of the parameters \thetaθ.
Training a neural network with gradient descent requires the calculation of the gradient of
the error function E(X, \theta)E(X,θ) with respect to the weights w_{ij}^kwijk and
biases b_i^kbik. Then, according to the learning rate \alphaα, each iteration of gradient
descent updates the weights and biases ((collectively denoted \theta)θ) according to
\theta^{t+1}= \theta^{t} - \alpha \frac{\partial E(X, \theta^{t})}{\partial
\theta},θt+1=θt−α∂θ∂E(X,θt),
where \theta^{t}θt denotes the parameters of the neural network at iteration tt in gradient
descent.

What's the Target?

As mentioned in the previous section, one major problem in training multilayer feedforward
neural networks is in deciding how to learn good internal representations, i.e. what the
weights and biases for hidden layer nodes should be. Unlike the perceptron, which has the
delta rule for approximating a well-defined target output, hidden layer nodes don't have a
target output since they are used as intermediate steps in the computation.

Since hidden layer nodes have no target output, one can't simply define an error function
that is specific to that node. Instead, any error function for that node will be dependent on
the values of the parameters in the previous layers (since previous layers determine the
input for that node) and following layers \big((since the output of that node will affect the
computation of the error function E(X, \theta)\big).E(X,θ)). This coupling of parameters
between layers can make the math quite messy (primarily as a result of using the product
rule, discussed below), and if not implemented cleverly, can make the final gradient descent
calculations slow. Backpropagation addresses both of these issues by simplifying the
mathematics of gradient descent, while also facilitating its efficient calculation.

Formal Definition

The formulation below is for a neural network with one output, but the algorithm can be
applied to a network with any number of outputs by consistent application of the chain rule
and power rule. Thus, for all the following examples, input-output pairs will be of the
form (\vec{x}, y)(x,y), i.e. the target value yy is not a vector.

Remembering the general formulation for a feedforward neural network,

w_{ij}^k:wijk: weight for node jj in layer l_klk for incoming node ii


b_i^k:bik: bias for node ii in layer l_klk
a_i^k:aik: product sum plus bias (activation) for node ii in layer l_klk
o_i^k:oik: output for node ii in layer l_klk
r_k:rk: number of nodes in layer l_klk

g:g: activation function for the hidden layer nodes


g_o:go: activation function for the output layer nodes
The error function in classic backpropagation is the mean squared error
E(X, \theta) = \frac{1}{2N} \sum_{i=1}^N \left(\hat{y_i} -
y_i\right)^2,E(X,θ)=2N1i=1∑N(yi^−yi)2,
where y_iyi is the target value for input-output pair (\vec{x_i}, y_i)(xi,yi
) and \hat{y_i}yi^ is the computed output of the network on input \vec{x_i}xi. Again,
other error functions can be used, but the mean squared error's historical association with
backpropagation and its convenient mathematical properties make it a good choice for
learning the method.

Deriving the Gradients


The derivation of the backpropagation algorithm is fairly straightforward. It follows from the
use of the chain rule and product rule in differential calculus. Application of these rules is
dependent on the differentiation of the activation function, one of the reasons the heaviside
step function is not used (being discontinuous and thus, non-differentiable).

Preliminaries

For the rest of this section, the derivative of a function f(x)f(x) will be denoted f^{\prime}


(x)f′(x), so that the sigmoid function's derivative is \sigma^{\prime}(x)σ′(x).

To simplify the mathematics further, the bias b_i^kbik for node ii in layer kk will be


incorporated into the weights as w_{0i}^kw0ik with a fixed output of o_0^{k-1} = 1o0k−1
=1 for node 00 in layer k-1k−1. Thus,

w_{0i}^k = b_i^k.w0ik=bik.
To see that this is equivalent to the original formulation, note that

a_i^k = b_i^k + \sum_{j = 1}^{r_{k-1}} w_{ji}^k o_j^{k-1} = \sum_{j =


0}^{r_{k-1}} w_{ji}^k o_j^{k-1},aik=bik+j=1∑rk−1wjikojk−1=j=0∑rk−1wjikojk−1,
where the left side is the original formulation and the right side is the new formulation.

Using the notation above, backpropagation attempts to minimize the following error function
with respect to the neural network's weights:

E(X, \theta) = \frac{1}{2N}\sum_{i=1}^N\left( \hat{y_i} -


y_i\right)^{2}E(X,θ)=2N1i=1∑N(yi^−yi)2
by calculating, for each weight w_{ij}^k,wijk, the value of \frac{\partial E}{\partial
w_{ij}^k}∂wijk∂E. Since the error function can be decomposed into a sum over individual
error terms for each individual input-output pair, the derivative can be calculated with
respect to each input-output pair individually and then combined at the end (since the
derivative of a sum of functions is the sum of the derivatives of each function):

\frac{\partial E(X, \theta)}{\partial w_{ij}^k} = \frac{1}


{N}\sum_{d=1}^N\frac{\partial}{\partial w_{ij}^k}\left(\frac{1}
{2}\left(\hat{y_d} - y_d\right)^{2}\right) = \frac{1}
{N}\sum_{d=1}^N\frac{\partial E_d}{\partial w_{ij}^k}.∂wijk∂E(X,θ)=N1d=1∑N
∂wijk∂(21(yd^−yd)2)=N1d=1∑N∂wijk∂Ed.
Thus, for the purposes of derivation, the backpropagation algorithm will concern itself with
only one input-output pair. Once this is derived, the general form for all input-output pairs
in XX can be generated by combining the individual gradients. Thus, the error function in
question for derivation is

E = \frac{1}{2}\left( \hat{y} - y\right)^{2},E=21(y^−y)2,


where the subscript dd in E_dEd, \hat{y_d}yd^, and y_dyd is omitted for simplification.

Error Function Derivatives

The derivation of the backpropagation algorithm begins by applying the chain rule to the
error function partial derivative

\frac{\partial E}{\partial w_{ij}^k} = \frac{\partial E}{\partial a_j^k}\frac{\partial


a_j^k}{\partial w_{ij}^k},∂wijk∂E=∂ajk∂E∂wijk∂ajk,
where a_j^kajk is the activation (product-sum plus bias) of node jj in layer kk before it is
passed to the nonlinear activation function (in this case, the sigmoid function) to generate
the output. This decomposition of the partial derivative basically says that the change in the
error function due to a weight is a product of the change in the error function EE due to the
activation a_j^kajk times the change in the activation a_j^kajk due to the
weight w_{ij}^kwijk.

The first term is usually called the error, for reasons discussed below. It is denoted

\delta_j^k \equiv \frac{\partial E}{\partial a_j^k}.δjk≡∂ajk∂E.


The second term can be calculated from the equation for a_j^kajk above:

\frac{\partial a_j^k}{\partial w_{ij}^k} = \frac{\partial}{\partial w_{ij}^k}


\left(\sum_{l = 0}^{r_{k-1}} w_{lj}^k o_l^{k-1}\right) = o_i^{k-1}.∂wijk∂ajk
=∂wijk∂(l=0∑rk−1wljkolk−1)=oik−1.
Thus, the partial derivative of the error function EE with respect to a weight w_{ij}^kwijk is

\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1}.∂wijk∂E=δjkoik−1.


Thus, the partial derivative of a weight is a product of the error term \delta_j^kδjk at
node jj in layer kk, and the output o_i^{k-1}oik−1 of node ii in layer k-1k−1. This makes
intuitive sense since the weight w_{ij}^kwijk connects the output of node ii in layer k-
1k−1 to the input of node jj in layer kk in the computation graph.
It is important to note that the above partial derivatives have all been calculated without any
consideration of a particular error function or activation function. However, since the error
term \delta_j^kδjk still needs to be calculated, and is dependent on the error function EE,
at this point it is necessary to introduce specific functions for both of these. As mentioned
previously, classic backpropagation uses the mean squared error function (which is the
squared error function for the single input-output pair case) and the sigmoid activation
function.

The calculation of the error \delta_j^{k}δjk will be shown to be dependent on the values of


error terms in the next layer. Thus, computation of the error terms will proceed backwards
from the output layer down to the input layer. This is where backpropagation, or backwards
propagation of errors, gets its name.

The Output Layer

Starting from the final layer, backpropagation attempts to define the value \delta_1^mδ1m,
where mm is the final layer ((the subscript is 11 and not jj because this derivation
concerns a one-output neural network, so there is only one output node j = 1).j=1). For
example, a four-layer neural network will have m=3m=3 for the final layer, m=2m=2 for
the second to last layer, and so on. Expressing the error function EE in terms of the
value a_1^ma1m \big((since \delta_1^mδ1m is a partial derivative with respect
to a_1^m\big)a1m) gives

E = \frac{1}{2}\left( \hat{y} - y\right)^{2} = \frac{1}{2}\big(g_o(a_1^m) -


y\big)^{2},E=21(y^−y)2=21(go(a1m)−y)2,
where g_o(x)go(x) is the activation function for the output layer.

Thus, applying the partial derivative and using the chain rule gives

\delta_1^m = \left(g_0(a_1^m) - y\right)g_o^{\prime}(a_1^m) = \left(\hat{y}-


y\right)g_o^{\prime}(a_1^m).δ1m=(g0(a1m)−y)go′(a1m)=(y^−y)go′(a1m).
Putting it all together, the partial derivative of the error function EE with respect to a weight
in the final layer w_{i1}^mwi1m is

\frac{\partial E}{\partial w_{i1}^m}= \delta_1^m o_i^{m-1} = \left(\hat{y}-


y\right)g_o^{\prime}(a_1^m)\ o_i^{m-1}.∂wi1m∂E=δ1moim−1=(y^−y)go′(a1m) oim−1.
The Hidden Layers

Now the question arises of how to calculate the partial derivatives of layers other than the
output layer. Luckily, the chain rule for multivariate functions comes to the rescue again.
Observe the following equation for the error term \delta_j^kδjk in layer 1 \le k \lt
m:1≤k<m:
\delta_j^k = \frac{\partial E}{\partial a_j^k}= \sum_{l=1}^{r^{k+1}}\frac{\partial
E}{\partial a_l^{k+1}}\frac{\partial a_l^{k+1}}{\partial a_j^k},δjk=∂ajk∂E=l=1∑rk+1
∂alk+1∂E∂ajk∂alk+1,
where ll ranges from 11 to r^{k+1}rk+1 (the number of nodes in the next layer). Note that,
because the bias input o_0^ko0k corresponding to w_{0j}^{k+1}w0jk+1 is fixed, its value
is not dependent on the outputs of previous layers, and thus ll does not take on the
value 00.

Plugging in the error term \delta_l^{k+1}δlk+1 gives the following equation:

\delta_j^k = \sum_{l=1}^{r^{k+1}}\delta_l^{k+1}\frac{\partial a_l^{k+1}}


{\partial a_j^k}.δjk=l=1∑rk+1δlk+1∂ajk∂alk+1.
Remembering the definition of a_l^{k+1}alk+1

a_l^{k+1} = \sum_{j=1}^{r^k}w_{jl}^{k+1}g\big(a_j^k\big),alk+1=j=1∑rkwjlk+1
g(ajk),
where g(x)g(x) is the activation function for the hidden layers,

\frac{\partial a_l^{k+1}}{\partial a_j^k} =


w_{jl}^{k+1}g^{\prime}\big(a_j^k\big).∂ajk∂alk+1=wjlk+1g′(ajk).
Plugging this into the above equation yields a final equation for the error
term \delta_j^kδjk in the hidden layers, called the backpropagation formula:

\delta_j^k =
\sum_{l=1}^{r^{k+1}}\delta_l^{k+1}w_{jl}^{k+1}g^{\prime}\big(a_j^k\big) =
g^{\prime}\big(a_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk
=l=1∑rk+1δlk+1wjlk+1g′(ajk)=g′(ajk)l=1∑rk+1wjlk+1δlk+1.
Putting it all together, the partial derivative of the error function EE with respect to a weight
in the hidden layers w_{ij}^kwijk for 1 \le k \lt m1≤k<m is

\frac{\partial E}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1} =


g^{\prime}\big(a_j^k\big)o_i^{k-
1}\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.∂wijk∂E=δjkoik−1=g′(ajk)oik−1
l=1∑rk+1wjlk+1δlk+1.
Backpropagation as Backwards Computation

This equation is where backpropagation gets its name. Namely, the error \delta_j^kδjk at
layer kk is dependent on the errors \delta_k^{k+1}δkk+1 at the next layer k+1k+1. Thus,
errors flow backward, from the last layer to the first layer. All that is needed is to compute
the first error terms based on the computed output \hat{y} = g_o(a_1^m)y^=go(a1m) and
target output yy. Then, the error terms for the previous layer are computed by performing a
product sum \big((weighted by w_{jl}^{k+1}\big)wjlk+1) of the error terms for the next
layer and scaling it by g^{\prime}\big(a_j^k\big)g′(ajk), repeated until the input layer is
reached.

This backwards propagation of errors is very similar to the forward computation that
calculates the neural network's output. Thus, calculating the output is often called
the forward phase while calculating the error terms and derivatives is often called
the backward phase. While going in the forward direction, the inputs are repeatedly
recombined from the first layer to the last by product sums dependent on the
weights w_{ij}^kwijk and transformed by nonlinear activation
functions g(x)g(x) and g_o(x)go(x). In the backward direction, the "inputs" are the final
layer's error terms, which are repeatedly recombined from the last layer to the first by
product sums dependent on the weights w_{jl}^{k+1}wjlk+1 and transformed by
nonlinear scaling factors g_o^{\prime}\big(a_j^m\big)go′(ajm
) and g^{\prime}\big(a_j^k\big)g′(ajk).

Furthermore, because the computations for backwards phase are dependent on the
activations a_j^kajk and outputs o_j^kojk of the nodes in the previous (the non-error term
for all layers) and next layer (the error term for hidden layers), all of these values must be
computed before the backwards phase can commence. Thus, the forward phase precedes
the backward phase for every iteration of gradient descent. In the forward phase,
activations a_j^kajk and outputs o_j^kojk will be remembered for use in the backwards
phase. Once the backwards phase is completed and the partial derivatives are known, the
weights \big((and associated biases b_j^k = w_{0j}^k\big)bjk=w0jk) can be updated by
gradient descent. This process is repeated until a local minimum is found or convergence
criterion is met.

The Backpropagation Algorithm


Using the terms defined in the section titled Formal Definition and the equations derived in
the section titled Deriving the Gradients, the backpropagation algorithm is dependent on the
following five equations:

For the partial derivatives,

\frac{\partial E_d}{\partial w_{ij}^k} = \delta_j^k o_i^{k-1}.∂wijk∂Ed=δjkoik−1.


For the final layer's error term,

\delta_1^m = g_o^{\prime}(a_1^m)\left(\hat{y_d}-y_d\right).δ1m=go′(a1m)(yd^−yd).
For the hidden layers' error terms,
\delta_j^k =
g^{\prime}\big(a_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk
=g′(ajk)l=1∑rk+1wjlk+1δlk+1.
For combining the partial derivatives for each input-output pair,

\frac{\partial E(X, \theta)}{\partial w_{ij}^k} = \frac{1}


{N}\sum_{d=1}^N\frac{\partial}{\partial w_{ij}^k}\left(\frac{1}
{2}\left(\hat{y_d} - y_d\right)^{2}\right) = \frac{1}
{N}\sum_{d=1}^N\frac{\partial E_d}{\partial w_{ij}^k}.∂wijk∂E(X,θ)=N1d=1∑N
∂wijk∂(21(yd^−yd)2)=N1d=1∑N∂wijk∂Ed.
For updating the weights,

\Delta w_{ij}^k = - \alpha \frac{\partial E(X, \theta)}{\partial w_{ij}^k}.Δwijk


=−α∂wijk∂E(X,θ).
The General Algorithm

The backpropagation algorithm proceeds in the following steps, assuming a suitable


learning rate \alphaα and random initialization of the parameters w_{ij}^k:wijk:

1) Calculate the forward phase for each input-output pair (\vec{x_d}, y_d)(xd,yd) and


store the results \hat{y_d}yd^, a_j^kajk, and o_j^kojk for each node jj in layer kk by
proceeding from layer 00, the input layer, to layer mm, the output layer.

2) Calculate the backward phase for each input-output pair (\vec{x_d}, y_d)(xd,yd) and


store the results \frac{\partial E_d}{\partial w_{ij}^k}∂wijk∂Ed for each
weight w_{ij}^kwijk connecting node ii in layer k-1k−1 to node jj in layer kk by
proceeding from layer mm, the output layer, to layer 11, the input layer.

\quad\quada) Evaluate the error term for the final layer \delta_1^mδ1m by using the
second equation.
\quad\quadb) Backpropagate the error terms for the hidden layers \delta_j^kδjk, working
backwards from the final hidden layer k = m-1k=m−1, by repeatedly using the third
equation.
\quad\quadc) Evaluate the partial derivatives of the individual error E_dEd with respect
to w_{ij}^kwijk by using the first equation.

3) Combine the individual gradients for each input-output pair \frac{\partial E_d}


{\partial w_{ij}^k}∂wijk∂Ed to get the total gradient \frac{\partial E(X, \theta)}{\partial
w_{ij}^k}∂wijk∂E(X,θ) for the entire set of input-output pairs X = \big\{(\vec{x_1},
y_1), \dots, (\vec{x_N}, y_N) \big\}X={(x1,y1),…,(xN,yN)} by using the fourth
equation (a simple average of the individual gradients).

4) Update the weights according to the learning rate \alphaα and total


gradient \frac{\partial E(X, \theta)}{\partial w_{ij}^k}∂wijk∂E(X,θ) by using the fifth
equation (moving in the direction of the negative gradient).
Backpropagation In Sigmoidal Neural Networks

The classic backpropagation algorithm was designed for regression problems with
sigmoidal activation units. While backpropagation can be applied to classification problems
as well as networks with non-sigmoidal activation functions, the sigmoid function has
convenient mathematical properties which, when combined with an appropriate output
activation function, greatly simplify the algorithm's understanding. Thus, in the classic
formulation, the activation function for hidden nodes is sigmoidal \big(g(x) =
\sigma(x)\big)(g(x)=σ(x)) and the output activation function is the identity
function \big(g_o(x) = x\big)(go(x)=x) (the network output is just a weighted sum of its
hidden layer, i.e. the activation).

Backpropagation is actually a major motivating factor in the historical use of sigmoid


activation functions due to its convenient derivative:

g^{\prime}(x) = \frac{\partial \sigma(x)}{\partial x} = \sigma(x)\big(1 -


\sigma(x)\big).g′(x)=∂x∂σ(x)=σ(x)(1−σ(x)).
Thus, calculating the derivative of the sigmoid function requires nothing more than
remembering the output \sigma(x)σ(x) and plugging it into the equation above.

Furthermore, the derivative of the output activation function is also very simple:

g_o^{\prime}(x) = \frac{\partial g_o(x)}{\partial x} = \frac{\partial x}{\partial x}


= 1.go′(x)=∂x∂go(x)=∂x∂x=1.
Thus, using these two activation functions removes the need to remember the activation
values a_1^ma1m and a_j^kajk in addition to the output values o_1^mo1m and o_j^kojk,
greatly reducing the memory footprint of the algorithm. This is because the derivative for the
sigmoid activation function in the backwards phase only needs to recall the output of that
function in the forward phase, and is not dependent on the actual activation value, which is
the case in the more general formulation of backpropagation
where g^{\prime}\big(a_j^k\big)g′(ajk) must be calculated. Similarly, the derivative for
the identity activation function doesn't depend on anything since it is a constant.

Thus, for a feedforward neural network with sigmoidal hidden units and an identity output
unit, the error term equations are as follows:

For the final layer's error term,


\delta_1^m = \hat{y_d}-y_d.δ1m=yd^−yd.
For the hidden layers' error terms,

\delta_j^k = o_j^k\big(1 -
o_j^k\big)\sum_{l=1}^{r^{k+1}}w_{jl}^{k+1}\delta_l^{k+1}.δjk=ojk
(1−ojk)l=1∑rk+1wjlk+1δlk+1.
Code Example

The following code example is for a sigmoidal neural network as described in the previous
subsection. It has one hidden layer and one output node in the output layer. The code is
written in Python3 and makes heavy use of the NumPy library for performing matrix math.
Because the calculations of the gradient for individual input-output pairs (\vec{x_d}, y_d)
(xd,yd) can be done in parallel, and many calculations are based on taking the dot product
of two vectors, matrices are a natural way to represent the input data, output data, and layer
weights. NumPy's efficient computation of matrix products and the ability to use modern
GPUs (which are optimized for matrix operations) can give significant speedups in both the
forward and backward phases of computation.

1 import numpy as np
2
3 # define the sigmoid function
4 def sigmoid(x, derivative=False):
5
6 if (derivative == True):
7 return sigmoid(x,derivative=False) * (1 - sigmoid(x,derivative=False))
8 else:
9 return 1 / (1 + np.exp(-x))
10
11 # choose a random seed for reproducible results
12 np.random.seed(1)
13
14 # learning rate
15 alpha = .1
16
17 # number of nodes in the hidden layer
18 num_hidden = 3
19
20 # inputs
21 X = np.array([
22 [0, 0, 1],
23 [0, 1, 1],
24 [1, 0, 0],
25 [1, 1, 0],
26 [1, 0, 1],
27 [1, 1, 1],
28 ])
29
30 # outputs
31 # x.T is the transpose of x, making this a column vector
32 y = np.array([[0, 1, 0, 1, 1, 0]]).T
33
34 # initialize weights randomly with mean 0 and range [-1, 1]
35 # the +1 in the 1st dimension of the weight matrices is for the bias weight
36 hidden_weights = 2*np.random.random((X.shape[1] + 1, num_hidden)) - 1
37 output_weights = 2*np.random.random((num_hidden + 1, y.shape[1])) - 1
38
39 # number of iterations of gradient descent
40 num_iterations = 10000
41
42 # for each iteration of gradient descent
43 for i in range(num_iterations):
44
45 # forward phase
46 # np.hstack((np.ones(...), X) adds a fixed input of 1 for the bias weight
47 input_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), X))
48 hidden_layer_outputs = np.hstack((np.ones((X.shape[0], 1)), sigmoid(np.dot(input_laye
49 output_layer_outputs = np.dot(hidden_layer_outputs, output_weights)
50
51 # backward phase
52 # output layer error term
53 output_error = output_layer_outputs - y
54 # hidden layer error term
55 # [:, 1:] removes the bias term from the backpropagation
56 hidden_error = hidden_layer_outputs[:, 1:] * (1 - hidden_layer_outputs[:, 1:]) * np.d
57
58 # partial derivatives
59 hidden_pd = input_layer_outputs[:, :, np.newaxis] * hidden_error[: , np.newaxis, :]
60 output_pd = hidden_layer_outputs[:, :, np.newaxis] * output_error[:, np.newaxis, :]
61
62 # average for total gradients
63 total_hidden_gradient = np.average(hidden_pd, axis=0)
64 total_output_gradient = np.average(output_pd, axis=0)
65
66 # update weights
67 hidden_weights += - alpha * total_hidden_gradient
68 output_weights += - alpha * total_output_gradient
69
70 # print the final outputs of the neural network on the inputs X
71 print("Output After Training: \n{}".format(output_layer_outputs))

The matrix  X  is the set of inputs \vec{x}x and the matrix  y  is the set of outputs yy. The
number of nodes in the hidden layer can be customized by setting the value of the
variable  num_hidden . The learning rate \alphaα is controlled by the variable  alpha . The
number of iterations of gradient descent is controlled by the variable  num_iterations .

By changing these variables and comparing the output of the program to the target
values  y , one can see how these variables control how well backpropagation can learn the
dataset  X  and  y . For example, more nodes in the hidden layer and more iterations of
gradient descent will generally improve the fit to the training dataset. However, using too
large or too small a learning rate can cause the model to diverge or converge too slowly,
respectively.
Cite as: Backpropagation. Brilliant.org. Retrieved 09:36, April 26, 2021, from https://brilliant.org/wiki/backpropagation/

erceptron
Relevant For...

 Computer Science>

Classification
 Computer Science>
Artificial Neural Networks

The perceptron is a machine learning algorithm used to determine whether


an input belongs to one class or another. For example, the perceptron algorithm can
determine the AND operator—given binary inputs x_1x1 and x_2x2, is (x_1x1 AND x_2x2)
equal to 0 or 1?
The AND operation between two numbers. A red dot represents one class  (x_1(x1 AND  x_2 = 0)x2=0)  and a blue
dot represents the other class  (x_1(x1  AND  x_2 = 1).x2=1). The line is the result of the perceptron algorithm,
which separates all data points of one class from those of the other.
The perceptron algorithm was one of the first artificial neural networks to be produced and
is the building block for one of the most commonly used neural networks, the multilayer
perceptron.

Contents

 Properties

 Definition

 Supervised Learning

 Implementation

 Summary

 References

Properties
The perceptron algorithm is frequently used in supervised learning, which is a machine
learning task that has the advantage of being trained on labeled data. This is contrasted
with unsupervised learning, which is trained on unlabeled data. Specifically, the perceptron
algorithm focuses on binary classified data, objects that are either members of one class or
another. Additionally, it allows for online learning, which simply means that it processes
elements in the training dataset one at a time (which can be useful for large datasets).

An example of binary classified data and decision boundaries used by classifiers  [1]
Furthermore, the perceptron algorithm is a type of linear classifier, which classifies data
points by using a linear combination of the variables used. As seen in the graph above, a
linear classifier uses lines \big((e.g. H_1, H_2H1,H2, or H_3\big)H3) to classify data
points—any object on one side of the line is part of one class and any object on the other
side is part of the other class. In this example, a successful linear classifier could
use H_1H1 or H_2H2 to discriminate between the two classes, whereas H_3H3 would be
a poor decision boundary.

An interesting consequence of the perceptron's properties is that it is unable to learn an


XOR function! As we see above, OR and AND functions are linearly separable, which
means that there exists a line that can separate all data points of one class from all data
points of the other. However, the XOR function is not linearly separable, and therefore the
perceptron algorithm (a linear classifier) cannot successfully learn the concept. This is a
principal reason why the perceptron algorithm by itself is not used for complex machine
learning tasks, but is rather a building block for a neural network that can handle linearly
inseparable classifications.

Definition
The perceptron is an algorithm used to produce a binary classifier. That is, the algorithm
takes binary classified input data, along with their class membership, and outputs a line that
attempts to separate data of one class from data of the other: data points on one side of the
line are of one class and data points on the other side are of the other.

Specifically, given an input with kk variables x_1, x_2, ..., x_kx1,x2,...,xk, a line is a linear


combination of these variables: w_1 x_1 + w_2 x_2 + \cdots + w_k x_k + b = 0w1x1
+w2x2+⋯+wkxk+b=0, where w_0, w_1, ..., w_kw0,w1,...,wk and bb are constants. Note
that this can also be written as \boldsymbol{w} \cdot \boldsymbol{x} + b =
0w⋅x+b=0, where \text{}\cdot\text{}⋅ is the dot product between the
two vectors \boldsymbol{w}w and \boldsymbol{x}x.

The perceptron algorithm returns values of w_0, w_1, ..., w_kw0,w1,...,wk and bb such


that data points on one side of the line are of one class and data points on the other side
are of the other. Mathematically, the values of \boldsymbol{w}w and bb are used by the
binary classifier in the following way: If \boldsymbol{w} \cdot \boldsymbol{x} + b >
0w⋅x+b>0, the classifier returns 1; otherwise, it returns 0. Note that 1 represents
membership of one class and 0 represents membership of the other. This can be seen
more clearly with the AND operator, replicated below for convenience.

The AND operation between two numbers: A red dot represents one class  (x_1(x1 AND  x_2 = 0)x2=0)  and a blue
dot represents the other class  (x_1(x1  AND  x_2 = 1).x2=1). The line is the result of the perceptron algorithm,
which separates all data points of one class from those of the other.
So what do \boldsymbol{w}w and bb stand for? \boldsymbol{w}w represents the
weights of the kk variables. Simply, a variable's weight determines how steep the line is
relative to that variable. A weight is needed for every variable; otherwise, the line would be
flat relative to that variable, which may prevent the line from successfully classifying the
data. Furthermore, bb represents the bias of the data. Essentially, this prevents the line
from being dependent on the origin ((the point (0,0)))—the bias shifts the line up or down to
better classify the data.

Supervised Learning
The perceptron algorithm learns to separate data by changing weights and bias over time,
where time is denoted as the number of times the algorithm has been run. As
such, \boldsymbol{w(t)}w(t) represents the value of the weights at
time tt and b(t)b(t) represents the value of the bias at time tt.
Additionally, \alphaα represents the learning rate, that is, how quickly the algorithm
responds to changes. This value has the bound 0 < \alpha \le 10<α≤1. \alphaα cannot be
0, as this would mean that no learning occurs. If \alphaα is a large value, the algorithm has
a propensity of oscillating around the solution, as illustrated later.

To better elucidate these concepts, the formal steps of the perceptron algorithm are detailed
below. In the following, d_idi represents the correct output value for input x_ixi; one class
is given d_i = 1di=1 if x_ixi is a member of that class and d_i = 0di=0 otherwise.

1. Begin by setting \boldsymbol{w(0)}, b(0), t = 0w(0),b(0),t=0.


2. For each input \boldsymbol{x_i}xi, determine
whether \boldsymbol{w(t)} \cdot \boldsymbol{x_i} + b > 0w(t)⋅xi
+b>0. Let y_iyi be the output for input \boldsymbol{x_i}xi (1 if true, 0 if
false).

3. The weights and bias are now updated for the next iteration of the
algorithm: \boldsymbol{w(t+1)} = \boldsymbol{w(t)} + \alpha(d_i -
y_i)\boldsymbol{x_i}w(t+1)=w(t)+α(di−yi)xi and b(t+1) = b(t) +
\alpha(d_i - y_i)b(t+1)=b(t)+α(di−yi) for all inputs.
4. If the learning is offline (if the inputs can be scanned multiple times), steps 2
and 3 can be repeated until errors are minimized. Note: tt is incremented on
every iteration.
An example is as follows:

Suppose we are attempting to learn the AND operator for the following input-class
pairs \big((x_1, x_2), d_i\big):((x1,x2),di): \big((0, 0), 0\big), \big((0, 1), 0\big),
\big((1, 0), 0\big),((0,0),0),((0,1),0),((1,0),0), and \big((1, 1), 1\big).((1,1),1). Let us
use a learning rate of \alpha=0.5α=0.5 and run through the algorithm until we can classify
all four points correctly.

w(0) = [0, 0], b(0) = 0

1 w(0) = [0, 0], b(0) = 0 y = [0, 0, 0, w(1) = [0.5, 0.5], b(1) = 0.5
0]

2 w(1) = [0.5, 0.5], b(1) = 0.5 y = [1, 1, 1, w(2) = [0, 0]; b(2) = -1
1]
3 w(2) = [0, 0], b(2) = -1 y = [0, 0, 0, w(3) = [0.5, 0.5], b(3) = -0.5
0]

4 w(3) = [0.5, 0.5], b(3) = -0.5 y = [0, 0, 0, SUCCESS!


1]

The perceptron algorithm over time. The green line represents the result of the perceptron algorithm after the second
iteration and the black line represents the final results of the perceptron algorithm (after iteration 4).

In the previous example, the perceptron algorithm terminates to the correct value fairly
quickly. One reason this occurs is due to a well-chosen learning rate ( \alphaα). With a
smaller \alphaα, the algorithm would take more iterations to finish, whereas a
larger \alphaα could result in the algorithm oscillating forever.

Implementation
An implementation of the perceptron algorithm is provided below (in Python):
1 # Example of AND operator, as described above
2 alpha = 0.5
3 input_data = [([0, 0], 0), ([0, 1], 0), ([1, 0], 0), ([1, 1], 1)]
4 weights = [0, 0]
5 bias = 0
6
7 # Begin algorithm
8 def perceptron():
9 # Repeat until we minimize error
10 while True:
11 # Start with the weights from t-1
12 new_weights = [i for i in weights]
13 new_bias = bias
14
15 # For each input data point
16 for input_datum in input_data:
17 # Add bias (intercept) to line
18 comparison = bias
19 list_of_vars = input_datum[0]
20
21 # For each variable, compute the value of the line
22 for index in range(len(list_of_vars)):
23 comparison += weights[index] * list_of_vars[index]
24
25 # Obtain the correct classification and the classification of the algorithm
26 correct_value = input_datum[1]
27 classified_value = int(comparison > 0)
28
29 # If the values are different, add an error to the weights and the bias
30 if classified_value != correct_value:
31 for index in range(len(list_of_vars)):
32 new_weights[index] += alpha * (correct_value - classified_value) * li
33 bias += alpha * (correct_value - classified_value)
34
35 # If there is no change in weights or bias, return
36 if new_weights == weights and new_bias == bias:
37 return (new_weights, bias)

Summary
The perceptron algorithm is one of the most commonly used machine learning algorithms
for binary classification. Some machine learning tasks that use the perceptron include
determining gender, low vs. high risk for diseases, and virus detection. Basically, any task
that involves classification into two groups can use the perceptron! Furthermore,
the multilayer perceptron uses the perceptron algorithm to distinguish classes that are not
linearly separable, which increases the number of tasks in which the perceptron can be
used!

Overall, the perceptron algorithm (and the ideas behind it) is one of the main building blocks
of neural networks, and its understanding is crucial for the development of more complex
networks.

References

1. cyc, . Graphic showing 3 Hyperplanes in 2D. H3 doesn't separate the 2


classes. H1 does, with a small margin and H2 with the maximum margin..
Retrieved May 26, 2016,
from https://en.wikipedia.org/wiki/Linear_classifier#/media/File:Svm_separatin
g_hyperplanes.png

Cite as: Perceptron. Brilliant.org. Retrieved 09:36, April 26, 2021, from https://brilliant.org/wiki/perceptron/

Multilayer perceptron
Autoencoders are a type of artificial neural network which attempt to reconstruct data from a
compressed reperesentation. An autoencoder consists of an encoder, a bottleneck, and
a decoder. The encoder receives an input and compresses it into a dense representation in
the bottleneck layer, which has fewer neurons than the input. The decoder takes the
information from the bottleneck and attempts to reconstruct the input

Recurrent Neural Network


Relevant For...
 Computer Science>
Artificial Neural Networks

A simple recurrent neural network


Recurrent neural networks are artificial neural networks where the computation graph
contains directed cycles. Unlike feedforward neural networks, where information flows
strictly in one direction from layer to layer, in recurrent neural networks (RNNs), information
travels in loops from layer to layer so that the state of the model is influenced by its previous
states. While feedforward neural networks can be thought of as stateless, RNNs have
a memory which allows the model to store information about its past computations. This
allows recurrent neural networks to exhibit dynamic temporal behavior and model
sequences of input-output pairs.

Because they can model temporal sequences of input-output pairs, recurrent neural
networks have found enormous success in natural language processing (NLP) applications.
This includes machine translation, speech recognition, and language modeling. RNNs have
also been used in reinforcement learning to solve very difficult problems at a level better
than humans. A recent example is AlphaGo, which beat world champion Go player Lee
Sedol in 2016. An interactive example of an RNN for generating handwriting samples can
be found here.

Contents

 Problems with Modeling Sequences

 Recurrent Neural Networks

 Unrolling RNNs

 Backpropagation through Time (BPTT)

 Vanishing/Exploding Gradients Problem

 Long Short-term Memory

 References

Problems with Modeling Sequences


Consider an application that needs to predict an output sequence y = \left(y_1, y_2,
\dots, y_n\right)y=(y1,y2,…,yn) for a given input sequence x = \left(x_1, x_2, \dots,
x_m\right)x=(x1,x2,…,xm). For example, in an application for translating English to
Spanish, the input xx might be the English sentence \text{"i like
pizza"}"i like pizza" and the associated output sequence yy would be the Spanish
sentence \text{"me gusta comer pizza"}"me gusta comer pizza". Thus, if the
sequence was broken up by character, then x_1=\text{"i"}x1="i", x_2=\text{" "}x2
=" ", x_3=\text{"l"}x3="l", x_4=\text{"i"}x4="i", x_5=\text{"k"}x5="k", all the way
up to x_{12}=\text{"a"}x12="a". Similarly, y_1=\text{"m"}y1="m", y_2=\text{"e"}y2
="e", y_3=\text{" "}y3=" ", y_4=\text{"g"}y4="g", all the way up
to y_{20}=\text{"a"}y20="a". Obviously, other input-output pair sentences are possible,
such as (\text{"it is hot today"}, \text{"hoy hace calor"})
("it is hot today","hoy hace calor") and (\text{"my dog is hungry"}, \text{"mi perro
tiene hambre"})("my dog is hungry","mi perro tiene hambre").

It might be tempting to try to solve this problem using feedforward neural networks, but two
problems become apparent upon investigation. The first issue is that the sizes of an
input xx and an output yy are different for different input-output pairs. In the example
above, the input-output pair (\text{"it is hot today"}, \text{"hoy hace calor"})
("it is hot today","hoy hace calor") has an input of length 1515 and an output of
length 1414 while the input-output pair (\text{"my dog is hungry"}, \text{"mi perro
tiene hambre"})("my dog is hungry","mi perro tiene hambre") has an input of
length 1616 and an output of length 2121. Feedforward neural networks have fixed-size
inputs and outputs, and thus cannot be automatically applied to temporal sequences of
arbitrary length.

The second issue is a bit more subtle. One can imagine trying to circumvent the above
issue by specifying a max input-output size, and then padding inputs and outputs that are
shorter than this maximum size with some special null character. Then, a feedforward
neural network could be trained that learns to produce y_iyi on input x_ixi. Thus, in the
example (\text{"it is hot today"}, \text{"hoy hace calor"})
("it is hot today","hoy hace calor"), the training pairs would be

\big\{(x_1=\text{"i"}, y_1=\text{"h"}), (x_2=\text{"t"}, y_2=\text{"o"}), \dots,


(x_{14}=\text{"a"}, y_{14}=\text{"r"}), (x_{15}=\text{"y"},
x_{15}=\text{"*"})\big\},{(x1="i",y1="h"),(x2="t",y2="o"),…,(x14="a",y14="r"),(x15
="y",x15="*")},
where the maximum size is 1515 and the padding character is \text{"*"}"*", used to pad
the output, which at length 1414 is one short of the maximum length 1515.

The problem with this is that there is no reason to believe that x_1x1 has anything to do
with y_1y1. In many Spanish sentences, the order of the words (and thus characters) in the
English translation is different. Thus, if the first word in an English sentence is the last word
in the Spanish translation, it stands to reason that any network that hopes to perform the
translation will need to remember that first word (or some representation of it) until it outputs
the end of the Spanish sentence. Any neural network that computes sequences needs a
way to remember past inputs and computations, since they might be needed for computing
later parts of the sequence output. One might say that the neural network needs a way to
remember its context, i.e. the relation between its past and its present.

Recurrent Neural Networks


Both of the issues outlined in the above section can be solved by using recurrent neural
networks. Recurrent neural networks, like feedforward layers, have hidden layers. However,
unlike feedforward neural networks, hidden layers have connections back to themselves,
allowing the states of the hidden layers at one time instant to be used as input to the hidden
layers at the next time instant. This provides the aforementioned memory, which, if properly
trained, allows hidden states to capture information about the temporal relation between
input sequences and output sequences.
RNNs are called recurrent because they perform the same computation (determined by the
weights, biases, and activation functions) for every element in the input sequence. The
difference between the outputs for different elements of the input sequence comes from the
different hidden states, which are dependent on the current element in the input sequence
and the value of the hidden states at the last time step.

In simplest terms, the following equations define how an RNN evolves over time:

\begin{aligned} o^t &= f\big(h^t; \theta\big)\\ h^t &= g\big(h^{t-1}, x^t;


\theta\big), \end{aligned}otht=f(ht;θ)=g(ht−1,xt;θ),
where o^tot is the output of the RNN at time t,t, x^txt is the input to the RNN at
time t,t, and h^tht is the state of the hidden layer(s) at time t.t. The image below outlines a
simple graphical model to illustrate the relation between these three variables in an RNN's
computation graph.

A graphical model for an RNN. The values  \theta_iθi,  \theta_hθh,


and  \theta_oθo  represent the parameters associated with the inputs, previous hidden layer states, and outputs,
respectively.
The first equation says that, given parameters \thetaθ (which encapsulates the weights and
biases for the network), the output at time tt depends only on the state of the hidden layer at
time tt, much like a feedforward neural network. The second equation says that, given the
same parameters \thetaθ, the hidden layer at time tt depends on the hidden layer at time t-
1t−1 and the input at time tt. This second equation demonstrates that the RNN can
remember its past by allowing past computations h^{t-1}ht−1 to influence the present
computations h^{t}ht.

Thus, the goal of training the RNN is to get the sequence o^{t+\tau}ot+τ to match the
sequence y_tyt, where \tauτ represents the time lag ((it's possible
that \tau=0)τ=0) between the first meaningful RNN output o^{\tau + 1}oτ+1 and the first
target output y_tyt. A time lag is sometimes introduced to allow the RNN to reach an
informative hidden state h^{\tau + 1}hτ+1 before it starts producing elements of the output
sequence. This is analogous to how humans translate English to Spanish, which often
starts by reading the first few words in order to provide context for translating the rest of the
sentence. A simple case when this is actually required is when the last word in the input
sequence corresponds to the first word in the output sequence. Then, it would be necessary
to delay the output sequence until the entire input sequence is read.

Unrolling RNNs
RNNs can be difficult to understand because of the cyclic connections between layers. A
common visualization method for RNNs is known as unrolling or unfolding. An RNN is
unrolled by expanding its computation graph over time, effectively "removing" the cyclic
connections. This is done by capturing the state of the entire RNN (called a slice) at each
time instant tt and treating it similar to how layers are treated in feedforward neural
networks. This turns the computation graph into a directed acyclic graph, with information
flowing in one direction only. The catch is that, unlike a feedforward neural network, which
has a fixed number of layers, an unfolded RNN has a size that is dependent on the size of
its input sequence and output sequence. This means that RNNs designed for very long
sequences produce very long unrollings. The image below illustrates unrolling for the RNN
model outlined in the image above at times t-1t−1, tt, and t+1t+1.
An unfolded RNN at time steps  t-
1t−1,  tt, and  t+1t+1.
One thing to keep in mind is that, unlike a feedforward neural network's layers, each of
which has its own unique parameters (weights and biases), the slices in an unrolled RNN all
have the same parameters \theta_iθi, \theta_hθh, and \theta_oθo. This is because RNNs
are recurrent, and thus the computation is the same for different elements of the input
sequence. As mentioned earlier, the differences in the output sequence arise from the
context preserved by the previous, hidden layer state h^{t-1}ht−1.

Furthermore, while each slice in the unrolling may appear to be similar to a layer in the
computation graph of a feedforward graph, in practice the variable h^tht in an RNN can
have many internal hidden layers. This allows the RNN to learn more hierarchal features
since a hidden layer's feature outputs can be another hidden layer's inputs. Thus, each
variable h^tht in the unrolling is more akin to the entirety of hidden layers in a feedforward
neural network. This allows RNNs to learn complex "static" relationships between the input
and output sequences in addition to the temporal relationship captured by cyclic
connections.

Backpropagation through Time (BPTT)


Training recurrent neural networks is very similar to training feedforward neural networks. In
fact, there is a variant of the backpropagation algorithm for feedforward neural networks that
works for RNNs, called backpropagation through time (often denoted BPTT). As the
name suggests, this is simply the backpropagation algorithm applied to the RNN backwards
through time.
Backpropagation through time works by applying the backpropagation algorithm to the
unrolled RNN. Since the unrolled RNN is akin to a feedforward neural network with all
elements o_tot as the output layer and all elements x_txt from the input sequence xx as
the input layer, the entire input sequence xx and output sequence oo are needed at the
time of training.

BPTT starts similarly to backpropagation, calculating the forward phase first to determine
the values of o_tot and then backpropagating (backwards in time) from o_tot to o_1o1 to
determine the gradients of some error function with respect to the parameters \thetaθ.
Since the parameters are replicated across slices in the unrolling, gradients are calculated
for each parameter at each time slice tt. The final gradients output by BPTT are calculated
by taking the average of the individual, slice-dependent gradients. This ensures that the
effects of the gradient update on the outputs for each time slice are roughly balanced.

Vanishing/Exploding Gradients Problem


One issue with RNNs in general is known as the vanishing/exploding gradients problem.
This problem states that, for long input-output sequences, RNNs have trouble modeling
long-term dependencies, that is, the relationship between elements in the sequence that are
separated by large periods of time.

For example, in the sentence \text{"The


quick brown fox jumped over the lazy
dog"}"The quick brown fox jumped over the lazy dog", the
words \text{"fox"}"fox" and \text{"dog"}"dog" are separated by a large amount of
space in the sequence. In the unrolling of an RNN for this sequence, this would be modeled
by a large difference \Delta tΔt in the time x_axa for the start of the
word \text{"fox"}"fox" and x_a + \Delta txa+Δt for the end of the
word \text{"dog"}"dog". Thus, if an RNN was attempting to learn how to identify subjects
and objects in sentences, it would need to remember the word \text{"fox"}"fox" (or some
hidden state representing it), the subject, up until it reads the word \text{"dog"}"dog", the
object. Only then would the RNN be able to output the pair (\text{"fox"}, \text{"dog"})
("fox","dog"), having finally identified both a subject and an object.

This problem arises due to the use of the chain rule in the backpropagation algorithm. The
actual proof is a bit messy, but the idea is that, because the unrolled RNN for long
sequences is so deep and the chain rule for backpropagation involves the products of
partial derivatives, the gradient at early time slices is the product of many partial derivatives.
In fact, the number of factors in the product for early slices is proportional to the length of
the input-output sequence. This is a problem because, unless the partial derivatives are all
close in value to 11, their product will either become very small, i.e. vanishing, when the
partial derivatives are \lt 1<1, or very large, i.e. exploding, when the partial derivatives
are \gt 1>1. This causes learning to become either very slow (in the vanishing case) or
wildly unstable (in the exploding case).
Long Short-term Memory
Luckily, recent RNN variants such as LSTM (Long Short-Term Memory) have been able to
overcome the vanishing/exploding gradient problem, so RNNs can safely be applied to
extremely long sequences, even ones that contain millions of elements. In fact, LSTMs
addressing the gradient problem have been largely responsible for the recent successes in
very deep NLP applications such as speech recognition, language modeling, and machine
translation.

LSTM RNNs work by allowing the input x_txt at time tt to influence the storing or
overwriting of "memories" stored in something called the cell. This decision is determined
by two different functions, called the input gate for storing new memories, and the forget
gate for forgetting old memories. A final output gate determines when to output the value
stored in the memory cell to the hidden layer. These gates are all controlled by the current
values of the input x_txt and cell c_tct at time tt, plus some gate-specific parameters. The
image below illustrates the computation graph for the memory portion of an LSTM RNN (i.e.
it does not include the hidden layer or output layer).

Computation graph for an LSTM RNN, with the cell denoted by  c_tct. Note that, in this illustration,  o_tot  is not the
output of the RNN, but the output of the cell to the hidden layer  h_tht.[1]
While the general RNN formulation can theoretically learn the same functions as an LSTM
RNN, by constraining the form that memories can take and how they are modified, LSTM
RNNs can learn long-term dependencies quickly and stably, and thus are much more useful
in practice.

References
1. , B. Long_Short_Term_Memory. Retrieved October 4, 2015,
from https://commons.wikimedia.org/wiki/File:Long_Short_Term_Memory.png
Cite as: Recurrent Neural Network. Brilliant.org. Retrieved 09:37, April 26,
2021, from https://brilliant.org/wiki/recurrent-neural-network/

Convolutional Neural Network


Relevant For...
 Computer Science>
Artificial Neural Networks

Convolutional neural networks (convnets, CNNs) are a powerful type of neural network


that is used primarily for image classification. CNNs were originally designed by Geoffery
Hinton, one of the pioneers of Machine Learning. Their location invariance makes them
ideal for detecting objects in various positions in images. Google, Facebook, Snapchat and
other companies that deal with images all use convolutional neural networks.

Convnets consist primarily of three different types of layers: convolutions, pooling layers,
and one fully connected layer. In the convolutional layers, a matrix known as a kernel is
passed over the input matrix to create a feature map for the next layer. The dimensions of
the kernel can also be adjusted to produce a different feature map, or to expand the data
along one dimension while reducing its size along the other axes. Sometimes, values on the
feature map are computed by taking the sum of the result of an element-wise multiplication
of the kernel and an appropriately sized section of the input matrix. Often, a dot product is
used instead of the element-wise multiplication, but this can be modified for better (or
worse) results. Another technique to improve CNNs is to use multiple kernels in a given
convolutional layer and concatenate the results to create the feature map. The fact that one
kernel is used for the entire image makes convolutional neural networks very location-
invariant and prevents them from overfitting. Here is an example of a convolution:

You can see how the filter maps a set of points from the input matrix to a single node in the
next layer. Here is a more low-level diagram of a convolution:
This convolution takes the sum of the element-wise product of the filter and a chunk of the
input. The filter used in the diagram could be used for sharpening an image because it
boosts the value of pixels that are different from their neighbors. When training a CNN, the
network may learn filters like this one to extract meaningful information from images.
Next, a pooling layer is applied to the feature map produced by the convolution. Max
pooling, the most common type of pooling, simply means taking the maximum value from a
given array of numbers. In this case, we split up the feature map into a bunch of n\times
nn×n boxes and choose only the maximum value from each box. Here is what that looks
like:

The final layer of a convolutional neural network is called the fully connected layer. This is a
standard neural network layer in which some nonlinearity (ReLu, tanh, sigmoid, etc.) is
applied to the dot product of an input and a matrix of weights. Then a softmax function can
convert the output into a list of probabilities for classification.

Convolutional neural networks usually have far more than just three layers. Convolutions
and max-pooling layers can be stacked on top of each other indefinitely for better results.
Here is an image of a very deep convolutional neural network with many layers:

Convolutional neural networks are most commonly used for image classification. Their
location invariance makes them ideal for detecting objects in various positions in images.
Google, Facebook, Snapchat and other companies that deal with images all use
convolutional neural networks. Another less common use for CNNs is text classification. A
list of Word2Vec or Glove embeddings may be used as the input for a CNN, which could be
trained to recognize sentiment or some other classification.

Convolutional neural network(CNN) have large applications in image and video recognition,
to recognize a car by deep learning model refer to Understanding CNN with an example to
recognize car
Cite as: Convolutional Neural Network. Brilliant.org. Retrieved 09:37, April 26,
2021, from https://brilliant.org/wiki/convolutional-neural-network/

You might also like