You are on page 1of 64

ANN

Prepared by MJJ
Introduction

 Artificial neural networks (ANNs) provide a general, practical method for


learning real-valued, discrete-valued, and vector-valued functions from
examples.
 Algorithms such as BACKPROPAGATION use Gradient descent to tune network
parameters to best fit a training set of input-output pairs.
 ANN learning is robust to errors in the training data and has been successfully
applied to problems such as
 interpreting visual scenes,
 speech recognition, and
 learning robot control strategies.
Introduction
 Inspired by the human brain.
 Some NNs are models of biological neural
networks
 Human brain contains a massively
interconnected net of 1010-1011 (10 billion)
neurons (cortical cells)
 Massive parallelism – large number of simple processing units
 Connectionism – highly interconnected
 Associative distributed memory
 Pattern and strength of synaptic connections
Neuron

Neural Unit
ANNs

 ANNs incorporate the two fundamental components of


biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Properties
When to consider NN?

 Instances are represented by many attribute-value


pairs.
 The target function output may be discrete-valued,
real-valued, or a vector of several real- or discrete-
valued attributes.
 The training examples may contain errors.
 Long training times are acceptable.
 Fast evaluation of the learned target function may be
required.
Perceptron
 Basic unit in a neural network: Linear separator
N inputs, x1 ... xn
 Weights for each input, w1 ... wn
A bias input x0 (constant) and associated weight w0
 Weighted sum of inputs,
A threshold function, i.e., if y > 0, if y <= 0
x0
x1 w0
w1
x2 w2


xn
wn
Σ 𝜑=¿
𝑦=∑ 𝑤 𝑖 𝑥 𝑖
Conti..

 A single perceptron can be used to represent many boolean functions.


 In fact, AND and OR can be viewed as special cases of m-of-n functions: that
is, functions where at least m of the n inputs to the perceptron must be true.
 The OR function corresponds to m = 1 and the AND function to m = n.
 Any m-of-n function is easily represented using a perceptron by setting all
input weights to the same value (e.g., 0.5) and then setting the threshold wo
accordingly.
 Perceptrons can represent all of the primitive boolean functions AND, OR,
NAND (1 AND), and NOR (1 OR).
 Unfortunately, however, some boolean functions cannot be represented by a
single perceptron, such as the XOR function
Example
Decision surface of a perceptron
how to learn the weights for a single
perceptron
 Here the precise learning problem is to determine a weight vector that causes
the perceptron to produce the correct +1 or -1 output for each of the given
training examples.
 Several algorithms are known to solve this learning problem.
 Here we consider two: the perceptron rule and the delta rule
 These two algorithms are guaranteed to converge to somewhat different
acceptable hypotheses, under somewhat different conditions.
 They are important to ANNs because they provide the basis for learning
networks of many units.
 One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training example,
modifying the perceptron weights whenever it misclassifies an example.
 This process is repeated, iterating through the training examples as many
times as needed until the perceptron classifies all training examples correctly.
Perceptron learning rule
Perceptron training rule

 If the data is linearly separable and is sufficiently small, it will converge to a


hypothesis that classifies all training data correctly in a finite number of
iterations
Gradient Descent and the Delta Rule
 Although the perceptron rule finds a successful weight vector when the
training examples are linearly separable, it can fail to converge if the
examples are not linearly separable.
 A second training rule, called the delta rule, is designed to overcome this
difficulty. If the training examples are not linearly separable, the delta
rule converges toward a best-fit approximation to the target concept.
 The key idea behind the delta rule is to use gradient descent to search
the hypothesis space of possible weight vectors to find the weights that
best fit the training examples.
 This rule is important because gradient descent provides the basis for the
BACKPROPAGATION algorithm, which can learn networks with many
interconnected units.
 It is also important because gradient descent can serve as the basis for
learning algorithms that must search through hypothesis spaces containing
many different types of continuously parameterized hypotheses.
Conti..

where D is the set of training examples, td is the target output for training example
d, and od is the output of the linear unit for training example d.
DERIVATION OF THE GRADIENT
DESCENT RULE
 This vector derivative is called the gradient of E with respect to , written

 Notice is itself a vector, whose components are the partial derivatives of E


with respect to each of the wi.
 When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in E.
 The negative of this vector therefore gives the direction of steepest decrease.
Example
Steps

 Step 1: Initialize the weights(a & b) with random values and calculate
Error (SSE)
 Step 2: Calculate the gradient i.e. change in SSE when the weights (a & b)
are changed by a very small value from their original randomly initialized
value. This helps us move the values of a & b in the direction in which SSE
is minimized.
 Step 3: Adjust the weights with the gradients to reach the optimal values
where SSE is minimized
 Step 4: Use the new weights for prediction and to calculate the new SSE
 Step 5: Repeat steps 2 and 3 till further adjustments to weights doesn’t
significantly reduce the Error
STOCHASTIC APPROXIMATION TO
GRADIENT DESCENT
Conti..

 One common variation on gradient descent intended to alleviate these


difficulties is called incremental gradient descent, or alternatively
stochastic gradient descent.
The key differences between
standard gradient descent and stochastic gradient
Activation functions

 Why activation function is used in neural network?


 The purpose of the activation function is to introduce non-linearity into the output of
a neuron. We know, neural network has neurons that work in correspondence of weight,
bias and their respective activation function.
 How does neural network activation function work?
 Activation functions are mathematical equations that determine the output of a neural
network. The function is attached to each neuron in the network, and determines
whether it should be activated (“fired”) or not, based on whether each neuron's input is
relevant for the model's prediction.
 Can we do without an activation function?
 A neural network without an activation function is essentially just a linear regression
model.
 this network would be less powerful and will not be able to learn the complex patterns
from the data.
Conti..

 Popular types of activation functions and when to use them:


 Linear
 Sigmoid
 Tanh
 ReLU (rectified linear unit)
 Leaky ReLU
 Parameterised ReLU
 Softmax
Linear

 It takes the inputs, multiplied by the weights for each neuron, and
creates an output signal proportional to the input. In one sense, a linear
function is better than a step function because it allows multiple outputs,
not just yes and no.
 However, a linear activation function has two major problems:
 Not possible to use backpropagation (gradient descent) to train the model—the derivative of the
function is a constant, and has no relation to the input, X. So it’s not possible to go back and
understand which weights in the input neurons can provide a better prediction.
 All layers of the neural network collapse into one—with linear activation functions, no matter
how many layers in the neural network, the last layer will be a linear function of the first layer
(because a linear combination of linear functions is still a linear function). So a linear activation
function turns the neural network into just one layer.
A neural network with a linear activation function is simply a linear regression model. It
has limited power and ability to handle complexity varying parameters of input data.
Sigmoid
Tanh
Relu
Choosing the right Activation Function

 Depending upon the properties of the problem we might be able to make a


better choice for easy and quicker convergence of the network.
 Sigmoid functions and their combinations generally work better in the case of
classifiers
 ReLU function is a general activation function and is used in most cases these days
 If we encounter a case of dead neurons in our networks the leaky ReLU function is
the best choice
 Always keep in mind that ReLU function should only be used in the hidden layers
 As a rule of thumb, you can begin with using ReLU function and then move over to
other activation functions in case ReLU doesn’t provide with optimum results
Multilayer perceptron (MLP)
Example
Back propagation
Algorithm
Introduction

 Back-propagation is the essence of neural net training.


 It is the method of fine-tuning the weights of a neural net based on the error
rate obtained in the previous epoch (i.e., iteration).
 Proper tuning of the weights allows you to reduce error rates and to make the
model reliable by increasing its generalization.
Conti..

 The weight-update loop in BACKPROPAGATION may be iterated thousands of


times in a typical application.
 A variety of termination conditions can be used to halt the procedure.
 One may choose to halt after a fixed number of iterations through the loop,
or once the error on the training examples falls below some threshold, or
once the error on a separate validation set of examples meets some criterion.
 The choice of termination criterion is an important one, because too few
iterations can fail to reduce error sufficiently, and too many can lead to
overfitting the training data.
MOMENTUM
Conti..

You might also like