MLP and Backpropagation
x1
xn
We will introduce the MLP and the backpropagation
algorithm which is used to train it
MLP used to describe any general feedforward (no
recurrent connections) network
However, we will concentrate on nets with units
arranged in layers
2
x1
xn
Different books refer to the above as either 4 layer (no. of
layers of neurons) or 3 layer (no. of layers of adaptive
weights). We will follow the latter convention
1st question:
what do the extra layers gain you? Start with looking at
what a single layer can’t do
3
Perceptron Learning Theorem
• Recap: A perceptron (threshold unit) can
learn anything that it can represent (i.e.
anything separable with a hyperplane)
4
The Exclusive OR problem
A Perceptron cannot represent Exclusive OR
since it is not linearly separable.
5
6
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
+1
+1
7
Three-layer networks
x1
x2
Input
Output
xn
Hidden layers
8
Properties of architecture
• No connections within a layer
• No direct connections between input and output layers
• Fully connected between layers
• Often more than 3 layers
• Number of output units need not equal number of input units
• Number of hidden units per layer can be more or less than
input or output units
Each unit is a perceptron
m
y
f
(
i
w
x
i
j
b
)
j
i
j
1
Often include bias as an extra weight
9
What do each of the layers do?
3rd layer can generate arbitrarily
1st layer draws linear 2nd layer combines the complex boundaries
boundaries boundaries
10
Backpropagation
note: in this example, the activation function is dismissed to ease the calculation
Backpropagation
Backpropagation
• Backpropagation, short for “backward propagation of errors”, is a
mechanism used to update the weights using gradient descent. It
calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network.
• Gradient descent is an iterative optimization algorithm for finding the
minimum of a function; in our case we want to minimize th error
function. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient of the
function at the current point.
Backpropagation
Backpropagation
Backpropagation
Activation Function
Left: Sigmoid non-linearity squashes real numbers to range between [0,1]
Right: The tanh non-linearity squashes real numbers to range between [-1,1].
Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1
when x > 0.
Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence with the ReLU
unit compared to the tanh unit.
• [Link]
• [Link]
example/
• [Link]
with-numbers-step-by-step/