You are on page 1of 62

Neural Networks

 Neural network is a machine that is


designed to model the way in which the
brain performs a particular task or
function of interest.
 To achieve good performance, neural
networks employ a massive
interconnection of simple computing cells
referred to as "neurons" or "processing
units."
 The brain is a highly complex, nonlinear;
and parallel computer. It has the
capability to organize its neurons, so as to
perform certain computations (e.g.,
pattern recognition, perception, and
motor control) many times faster than the
fastest digital computer in existence
today.
Dendrites
Synapse
Synapse

Axon

Axon

Dendrites Soma
Soma
Image Courtesy: SimpliLearn
Biological Neuron Artificial Neuron

Soma/ Cell Nucleus Node/neuron

Dendrites Input

Axon Output

Synapse Weight or interconnections


 A neuron is a mathematical function modeled on the
working of biological neurons
 It is an elementary unit in an artificial neural network
 One or more inputs are separately weighted
 Inputs are summed and passed through a nonlinear
function to produce output
 Every neuron holds an internal state called activation
signal
 Each connection link carries information about the
input signal
 Every neuron is connected to another neuron via
connection link
 Ifwe have two groups of objects, one
group of several written A's, and the other
of B's, we may want our neuron to tell the
A's from the B's, as in figure.
 We want it to output a 1 when an A is
presented and a 0 when it sees a B.
Threshold Function
Step Function
Hard Limiter

Single Layer Perceptron


 The simplest kind of neural network is a single-layer
perceptron network, which consists of a single layer of
output nodes; the inputs are fed directly to the
outputs via a series of weights. The sum of the
products of the weights and the inputs is calculated in
each node, and if the value is above some threshold
the neuron fires and takes the activated value;
otherwise it takes the deactivated value.
 Neurons with this kind of activation function are also
called artificial neurons or linear threshold logic units.
 In the literature the term perceptron often refers to
networks consisting of just one of these units.
 The perceptron learning is an algorithm for
learning a binary classifier called a threshold
function: a function that maps its input x(a real-
valued vector) to an output value f(X) (a
single binary value):

 where w is a vector of real-valued weights, w . x


is the dot product , where m is the number
of inputs to the perceptron, and b is the bias.
 Perceptrons can be trained by a simple
learning algorithm that is usually called
the delta rule. It calculates the errors between
calculated output and sample output data,
and uses this to create an adjustment to the
weights, thus implementing a form
of gradient descent.
 Single-layer perceptrons are only capable of
learning linearly separable patterns
 It is impossible for a single-layer perceptron
network to learn an XOR function
 A single-layer neural network can compute a
continuous output instead of a step function.
A common choice is the so-called logistic
function:
 The logistic function is one of the family of
functions called sigmoid functions. It has a
continuous derivative, which allows it to be
used in backpropagation. This function is also
preferred because its derivative is easily
calculated (differentiable) :
 MLP is a class of a feedforward (Acyclic) artificial neural
network (ANN).
 Each neuron in one layer has directed connections to the
neurons of the subsequent layer. In many applications the
units of these networks apply a sigmoid function as an
activation function.
 MLPs models are the most basic deep neural network,
which is composed of a series of fully connected layers.
 Each new layer is a set of nonlinear functions of a
weighted sum of all outputs (fully connected) from the
prior one.
 Multilayer feed-forward networks, given enough hidden
units and enough training samples, can closely
approximate any function.
 MLP with one hidden layer
x1 (PE)

x2 Weighted Transfer
(PE) Sum Function
Y1
x3 (S) (f)

(PE)

(PE) (PE)

Output
(PE)
Layer

Hidden
(PE)
Layer

Input
Layer
(a) Single neuron (b) Multiple neurons

x1 x1 w11 (PE) Y1
w1
w21
(PE) Y

w1 w12
x2 Y  X 1W1  X 2W2
x2 w22 (PE) Y2
PE: Processing Element (or neuron)

Y1  X1W11  X 2W21
w23
Y2  X1W12  X2W22
Y3  X 2W 23 (PE) Y3
Summation function: Y = 3(0.2) + 1(0.4) + 2(0.1) = 1.2
X1 = 3 Transfer function: YT = 1/(1 + e-1.2) = 0.77
W
1 =0
.2

W2 = 0.4 Processing Y = 1.2


X2 = 1 YT = 0.77
element (PE)

0 .1
=
W3

X3 = 2
 Before training can begin, the user must decide on the
network topology by specifying:
 the number of units in the input layer,
layer,
 the number of hidden layers (if more than one), the
number of units in each hidden layer,
layer, and
 the number of units in the output layer.
layer.
 Normalizing the input values (between 0.0 and 1.0) for
each attribute measured in the training tuples will
help speed up the learning phase and prevent the
exploding gradient problem.
 Discrete-valued attributes may be encoded such that
there is one input unit per domain value.
 Choice of the transfer function
 Linear function
 Sigmoid (logical activation) function [0 1]
 Tangent Hyperbolic function [-1 1]
 Neural networks can be used for both
classification (to predict the class label of a given
tuple)) and numeric prediction (to predict a
continuous-valued output).
 For classification, one output unit may be used
to represent two classes (where the value 1
represents one class, and the value 0 represents
the other).
 If there are more than two classes, then one
output unit per class is used.
 There are no clear rules as to the “best” number
of hidden layer units.
 Network design is a trial
trial--and
and--error process and
may affect the accuracy of the resulting trained
network.
 The initial values of the weights may also affect
the resulting accuracy.
 Once a network has been trained and its
accuracy is not considered acceptable, it is
common to repeat the training process with
 a different network topology or
 a different set of initial weights.
weights
 It adjusts the weights of the machine, in order
to minimize the average squared error.
 The learning algorithm procedure
 Initialize weights with random values and set other
network parameters
 Read in the inputs and the desired outputs
 Compute the actual output (by working forward
through the layers)
 Compute the error (difference between the actual and
desired output)
 Change the weights by working backward through
the hidden layers
 Repeat steps 2-5 until weights stabilize
 Backpropagation learns by iteratively processing
a data set of training tuples, comparing the
network’s prediction for each tuple with the
actual known target value.
 The target value may be the known class label of
the training tuple (for classification problems) or
a continuous value (for numeric prediction).
 For each training tuple,, the weights are modified
so as to minimize the mean-
mean-squared error
between the network’s prediction and the actual
target value.
 These modifications are made in the
“backwards” direction (i.e., from the output
layer) through each hidden layer down to the
first hidden layer (hence the name
backpropagation).
 Although it is not guaranteed, in general the
weights will eventually converge, and the
learning process stops.
 Architecture of a neural network is driven by the
task it is intended to address
 Classification, regression, clustering, general
optimization, association, ….
 Most popular architecture: Feedforward multi-
layered perceptron with backpropagation learning
algorithm
 Used for both classification and regression type
problems
 Others – Recurrent, self-organizing feature maps,
Hopfield networks, …
 Multi-layer networks use a variety of learning techniques, the most
popular being back-propagation.
 The output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques, the error
is then fed back through the network.
 The algorithm adjusts the weights of each connection in order to reduce
the value of the error function by some small amount.
 After repeating this process for a sufficiently large number of training
cycles, the network will usually converge to some state where the error
of the calculations is small.
 In this case, one would say that the network has learned a certain target
function. To adjust weights properly, one applies a general method for
non-linear optimization that is called gradient descent. For this, the
network calculates the derivative of the error function with respect to the
network weights, and changes the weights such that the error decreases
(thus going downhill on the surface of the error function).
 For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
 The weights in the network are initialized to
small random numbers (e.g., ranging
from−1.0 to 1.0, or−0.5 to 0.5).
 Each unit has a bias associated with it, as
explained later.
 The biases are similarly initialized to small
random numbers.
 Each training tuple, X, is processed by the
following steps.
 First, the training tuple is fed to the network’s
input layer.
 The inputs pass through the input units,
unchanged.
 That is, for an input unit, j, its output, Oj, is equal
to its input value, Ij.
 Next, the net input and output of each unit in the
hidden and output layers are computed.
 The net input to a unit in the hidden or output
layers is computed as a linear combination of its
inputs.
 Propagate the inputs forward:
 Each hidden layer or output layer unit has a number
of inputs to it that are, in fact, the outputs of the
units connected to it in the previous layer.
 To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
 Given a unit, j in a hidden or output layer, the net
input, Ij, to unit j is

 where wij is the weight of the connection from unit i


in the previous layer to unit j; Oi is the output of unit i
from the previous layer; and θj is the bias of unit j.
 The bias acts as a threshold in that it serves to vary
the activity of the unit.
 Each unit in the hidden and output layers takes its net
input and then applies an activation function to it.
 The function symbolizes the activation of the neuron
represented by the unit.
 The logistic, or sigmoid, function is used. Given the
net input Ij to unit j, then Oj, the output of unit j, is
computed as

 The logistic function is nonlinear and differentiable,


allowing the backpropagation algorithm to model
classification problems that are linearly inseparable.
 We compute the output values, Oj, for each
hidden layer, up to and including the output
layer, which gives the network’s prediction.
 In practice, it is a good idea to cache (i.e.,
save) the intermediate output values at each
unit as they are required again later when
back propagating the error.
 This trick can substantially reduce the
amount of computation required.
 The error is propagated backward by updating
the weights and biases to reflect the error of the
network’s prediction. For a unit j in the output
layer, the error Errj is computed by

 where Oj is the actual output of unit j, and Tj is


the known target value of the given training
tuple.
 Note that Oj(1−Oj) is the derivative of the
logistic function.
 To compute the error of a hidden layer unit j, the
weighted sum of the errors of the units
connected to unit j in the next layer are
considered.
 The error of a hidden layer unit j is

 where wjk is the weight of the connection from


unit j to a unit k in the next higher layer, and Errk
is the error of unit k.
 The weights and biases are updated to reflect the
propagated errors.
 Weights are updated by the following equations,
where delta(wij) is the change in weight wij:

 The variable l is the learning rate, a constant typically


having a value between 0.0 and 1.0.
 The learning rate helps avoid getting stuck at a local
minimum in decision space. If the learning rate is too
small, then learning will occur at a very slow pace. If
learning rate is too large, then oscillation.
 Biases are updated by the following equations, where
delta(θj) is the change in bias θj:

 The updating of the weights and biases after the


presentation of each tuple, referred to case updating.
 Alternatively, the weight and bias increments could
be accumulated in variables, so that the weights and
biases are updated after all the tuples in the training
set have been presented. (called
called epoch updating)
updating)
 Batch/mini
Batch/mini--batch updating : weight and bias are
updated after several samples
 one iteration through the training set is an epoch.
 Training stops when:
 All delta(wij) in the previous epoch are so small as
to be below some specified threshold, or
 The percentage of tuples misclassified in the
previous epoch is below some threshold, or
 A pre-specified number of epochs has expired.
 In practice, several hundreds of thousands of
epochs may be required before the weights
will converge.
Input = 3, Hidden Neuron
= 2 Output =1
Initialize weights :

Random Numbers from -


1.0 to 1.0

Initial Input and weight

x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2


 Bias added to Hidden
 + Output nodes
 Initialize Bias
 Random Values from
 -1.0 to 1.0

 Bias ( Random )

θ4 θ5 θ6

-0.4 0.2 0.1


Unitj Net Input Ij Output Oj

4 0.2 + 0 + 0.5 -0.4 = -0.7 1


Oj  = 0.332
1  e0.7
5 -0.3 + 0 + 0.2 + 0.2 =0.1
1
Oj  = 0.525
1  e  0. 1
6 (-0.3)0.332-
(0.2)(0.525)+0.1= -0.105 1
Oj  = 0.475
1  e0.105
Unit j Error j
6 0.475(1-0.475)(1-0.475) =0.1311
We assume T 6 = 1

5 0.525 x (1- 0.525)x 0.1311x


(-0.2) = 0.0065

4 0.332 x (1-0.332) x 0.1311 x


(-0.3) = -0.0087
Learning Rate l =0.9

Weight New Values


w46 -0.3 + 0.9(0.1311)(0.332) = -
0.261
w56 -0.2 + (0.9)(0.1311)(0.525) = -
0.138
w14 0.2 + 0.9(-0.0087)(1) = 0.192

w15 -0.3 + (0.9)(-0.0065)(1) = -0.306


……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218

……..similarly ………similarly
 Rule of thumb: The number of training
samples should be at least 5 to 10 times the
number of weights in the network.
 Otherwise,the network is prone to overfitting
 A common criticism for ANN: The lack of
transparency/explainability
 Answer: sensitivity analysis
 Conducted on a trained ANN
 The inputs are perturbed while the relative
change on the output is measured/recorded
 Results illustrate the relative importance of input
variables
 In machine learning, the vanishing gradient problem is encountered
when training artificial neural networks with gradient-based learning
methods and backpropagation. In such methods, during each iteration of
training each of the neural network's weights receives an update
proportional to the partial derivative of the error function with respect to
the current weight. The problem is that in some cases, the gradient will
be vanishingly small, effectively preventing the weight from changing its
value. In the worst case, this may completely stop the neural network
from further training. As one example of the problem cause,
traditional activation functions such as the hyperbolic tangent function
have gradients in the range (0,1], and backpropagation computes
gradients by the chain rule. This has the effect of multiplying n of these
small numbers to compute gradients of the early layers in an n-layer
network, meaning that the gradient (error signal) decreases
exponentially with n while the early layers train very slowly.

You might also like