Professional Documents
Culture Documents
Axon
Axon
Dendrites Soma
Soma
Image Courtesy: SimpliLearn
Biological Neuron Artificial Neuron
Dendrites Input
Axon Output
x2 Weighted Transfer
(PE) Sum Function
Y1
x3 (S) (f)
(PE)
(PE) (PE)
Output
(PE)
Layer
Hidden
(PE)
Layer
Input
Layer
(a) Single neuron (b) Multiple neurons
x1 x1 w11 (PE) Y1
w1
w21
(PE) Y
w1 w12
x2 Y X 1W1 X 2W2
x2 w22 (PE) Y2
PE: Processing Element (or neuron)
Y1 X1W11 X 2W21
w23
Y2 X1W12 X2W22
Y3 X 2W 23 (PE) Y3
Summation function: Y = 3(0.2) + 1(0.4) + 2(0.1) = 1.2
X1 = 3 Transfer function: YT = 1/(1 + e-1.2) = 0.77
W
1 =0
.2
0 .1
=
W3
X3 = 2
Before training can begin, the user must decide on the
network topology by specifying:
the number of units in the input layer,
layer,
the number of hidden layers (if more than one), the
number of units in each hidden layer,
layer, and
the number of units in the output layer.
layer.
Normalizing the input values (between 0.0 and 1.0) for
each attribute measured in the training tuples will
help speed up the learning phase and prevent the
exploding gradient problem.
Discrete-valued attributes may be encoded such that
there is one input unit per domain value.
Choice of the transfer function
Linear function
Sigmoid (logical activation) function [0 1]
Tangent Hyperbolic function [-1 1]
Neural networks can be used for both
classification (to predict the class label of a given
tuple)) and numeric prediction (to predict a
continuous-valued output).
For classification, one output unit may be used
to represent two classes (where the value 1
represents one class, and the value 0 represents
the other).
If there are more than two classes, then one
output unit per class is used.
There are no clear rules as to the “best” number
of hidden layer units.
Network design is a trial
trial--and
and--error process and
may affect the accuracy of the resulting trained
network.
The initial values of the weights may also affect
the resulting accuracy.
Once a network has been trained and its
accuracy is not considered acceptable, it is
common to repeat the training process with
a different network topology or
a different set of initial weights.
weights
It adjusts the weights of the machine, in order
to minimize the average squared error.
The learning algorithm procedure
Initialize weights with random values and set other
network parameters
Read in the inputs and the desired outputs
Compute the actual output (by working forward
through the layers)
Compute the error (difference between the actual and
desired output)
Change the weights by working backward through
the hidden layers
Repeat steps 2-5 until weights stabilize
Backpropagation learns by iteratively processing
a data set of training tuples, comparing the
network’s prediction for each tuple with the
actual known target value.
The target value may be the known class label of
the training tuple (for classification problems) or
a continuous value (for numeric prediction).
For each training tuple,, the weights are modified
so as to minimize the mean-
mean-squared error
between the network’s prediction and the actual
target value.
These modifications are made in the
“backwards” direction (i.e., from the output
layer) through each hidden layer down to the
first hidden layer (hence the name
backpropagation).
Although it is not guaranteed, in general the
weights will eventually converge, and the
learning process stops.
Architecture of a neural network is driven by the
task it is intended to address
Classification, regression, clustering, general
optimization, association, ….
Most popular architecture: Feedforward multi-
layered perceptron with backpropagation learning
algorithm
Used for both classification and regression type
problems
Others – Recurrent, self-organizing feature maps,
Hopfield networks, …
Multi-layer networks use a variety of learning techniques, the most
popular being back-propagation.
The output values are compared with the correct answer to compute the
value of some predefined error-function. By various techniques, the error
is then fed back through the network.
The algorithm adjusts the weights of each connection in order to reduce
the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training
cycles, the network will usually converge to some state where the error
of the calculations is small.
In this case, one would say that the network has learned a certain target
function. To adjust weights properly, one applies a general method for
non-linear optimization that is called gradient descent. For this, the
network calculates the derivative of the error function with respect to the
network weights, and changes the weights such that the error decreases
(thus going downhill on the surface of the error function).
For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
The weights in the network are initialized to
small random numbers (e.g., ranging
from−1.0 to 1.0, or−0.5 to 0.5).
Each unit has a bias associated with it, as
explained later.
The biases are similarly initialized to small
random numbers.
Each training tuple, X, is processed by the
following steps.
First, the training tuple is fed to the network’s
input layer.
The inputs pass through the input units,
unchanged.
That is, for an input unit, j, its output, Oj, is equal
to its input value, Ij.
Next, the net input and output of each unit in the
hidden and output layers are computed.
The net input to a unit in the hidden or output
layers is computed as a linear combination of its
inputs.
Propagate the inputs forward:
Each hidden layer or output layer unit has a number
of inputs to it that are, in fact, the outputs of the
units connected to it in the previous layer.
To compute the net input to the unit, each input
connected to the unit is multiplied by its
corresponding weight, and this is summed.
Given a unit, j in a hidden or output layer, the net
input, Ij, to unit j is
Bias ( Random )
θ4 θ5 θ6
……..similarly ………similarly
Rule of thumb: The number of training
samples should be at least 5 to 10 times the
number of weights in the network.
Otherwise,the network is prone to overfitting
A common criticism for ANN: The lack of
transparency/explainability
Answer: sensitivity analysis
Conducted on a trained ANN
The inputs are perturbed while the relative
change on the output is measured/recorded
Results illustrate the relative importance of input
variables
In machine learning, the vanishing gradient problem is encountered
when training artificial neural networks with gradient-based learning
methods and backpropagation. In such methods, during each iteration of
training each of the neural network's weights receives an update
proportional to the partial derivative of the error function with respect to
the current weight. The problem is that in some cases, the gradient will
be vanishingly small, effectively preventing the weight from changing its
value. In the worst case, this may completely stop the neural network
from further training. As one example of the problem cause,
traditional activation functions such as the hyperbolic tangent function
have gradients in the range (0,1], and backpropagation computes
gradients by the chain rule. This has the effect of multiplying n of these
small numbers to compute gradients of the early layers in an n-layer
network, meaning that the gradient (error signal) decreases
exponentially with n while the early layers train very slowly.