You are on page 1of 20

Perceptron

The original Perceptron was designed to take a number of binary inputs, and
produce one binary output (0 or 1).

The idea was to use different weights to represent the importance of each input,
and that the sum of the values should be greater than a threshold value before
making a decision like yes or no (true or false) (0 or 1).

How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network


that consists of four main parameters named input values (Input nodes), weights
and Bias, net sum, and an activation function. The perceptron model begins with
the multiplication of all input values and their weights, then adds these values
together to create the weighted sum. Then this weighted sum is applied to the
activation function 'f' to obtain the desired output. This activation function is also
known as the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output
is mapped between required values (0,1) or (-1,1). It is important to note that the
weight of input is indicative of the strength of a node. Similarly, an input's bias
value gives the ability to shift the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's
performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned


weighted sum, which gives us output either in binary form or a continuous value as
follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as
follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold
transfer function inside the model. The main objective of the single-layer
perceptron model is to analyze the linearly separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so
it begins with inconstantly allocated input for weight parameters. Further, it sums
up all inputs (weight). After adding all inputs, if the total sum of all inputs is more
than a pre-determined value, the model gets activated and shows the output value
as +1.

If the outcome is same as pre-determined or threshold value, then the performance


of this model is stated as satisfied, and weight demand does not change. However,
this model consists of a few discrepancies triggered when multiple weight inputs
values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

A perceptron is a neural network unit that does a precise computation to detect


features in the input data. Perceptron is mainly used to classify the data into two
parts. Therefore, it is also known as Linear Binary Classifier.
Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are
modified as per the model's requirement. In this stage, the error between
actual output and demanded originated backward on the output layer and
ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial


neural networks having various layers in which activation function does not remain
linear, similar to a single layer perceptron model. Instead of linear, activation
function can be executed as sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process
linear and non-linear patterns. Further, it can also implement logic gates such as
AND, OR, XOR, NAND, NOT, XNOR, NOR.

The pictorial representation of multi-layer perceptron learning is as shown below-

MLP networks are used for supervised learning format. A typical learning
algorithm for MLP networks is also called back propagation's algorithm.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear


problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
o The model functioning depends on the quality of the training.
Linear Separable:
A decision line is drawn to separate positive and negative responses. The decision
line may also be called as the decision-making Line or decision-support Line or
linear-separable line. The necessity of the linear separability concept was felt to
clarify classify the patterns based upon their output responses.
Linear Separability refers to the data points in binary classification problems
which can be separated using linear decision boundary. if the data points can be
separated using a line, linear function, or flat hyperplane are considered linearly
separable.
 Linear separability is an important concept in neural networks. If the separate
points in n-dimensional space
follows then it is said
linearly separable
 For two-dimensional inputs, if there exists a line (whose equation is
) that separates all samples of one class from
the other class, then an appropriate perception can be derived from the
equation of the separating line. such classification problems are called “Linear
separable” i.e, separating by a linear combination of i/p.
 The logical AND gate example shown below illustrates a two-dimensional
example of a linearly separable problem.
Linear Separability refers to the data points in binary classification problems
which can be separated using linear decision boundary. if the data points can be
separated using a line, linear function, or flat hyperplane are considered linearly
separable.
 Linear separability is an important concept in neural networks. If the separate
points in n-dimensional space
follows then it is said
linearly separable
 For two-dimensional inputs, if there exists a line (whose equation is
) that separates all samples of one class from
the other class, then an appropriate perception can be derived from the
equation of the separating line. such classification problems are called “Linear
separable” i.e, separating by a linear combination of i/p.
 The logical AND gate example shown below illustrates a two-dimensional
example of a linearly separable problem.
Perceptron Learning Rule

Perceptron Learning Rule states that the algorithm would automatically learn the
optimal weight coefficients. The input features are then multiplied with these
weights to determine if a neuron fires or not.

The Perceptron receives multiple input signals, and if the sum of the input signals
exceeds a certain threshold, it either outputs a signal or does not return an output.
In the context of supervised learning and classification, this can then be used to
predict the class of a sample.

Next up, let us focus on the perceptron function.

Perceptron Function

Perceptron is a function that maps its input “x,” which is multiplied with the
learned weight coefficient; an output value ”f(x)”is generated.

In the equation given above:


 “w” = vector of real-valued weights
 “b” = bias (an element that adjusts the boundary away from origin without any
dependence on the input value)
 “x” = vector of input x values
m

∑W i Xi
i=1

 “m” = number of inputs to the Perceptron


The output can be represented as “1” or “0.” It can also be represented as “1” or “-
1” depending on which activation function is used.

Let us learn the inputs of a perceptron in the next section.

unconstrained optimization
A problem devoid of constraints is, well, an unconstrained optimization problem.
Much of modern machine learning and deep learning depends on formulating and
solving an unconstrained optimization problem, by incorporating constraints as
additional elements of the loss with suitable penalties.

Gauss- Newton Method:


The Gauss-Newton method is an iterative algorithm to solve nonlinear least
squares problems. “Iterative” means it uses a series of calculations (based on
guesses for x-values) to find the solution. It is a modification of Newton’s method,
which finds x-intercepts (minimums) in calculus. The Gauss-Newton is usually
used to find the best fit theoretical model although it could also be used to locate
a single point.

This algorithm is probably the most popular method for non-linear least squares. It
does however, have a few pitfalls:

 If you don’t make a good initial guess, it will be very slow to find a
solution and may not find one at all.
 The procedure is not-suited for design matrices that are ill-
conditioned or deficient in rank.
 If relative residuals are very big, the procedure will lose a large
amount of information.
 The basic steps that the software will perform (note that the following steps
are for a single iteration):
 Make an initial guess x0 for x,
 Make a guess for k = 1,
 Create a vector fk with elements fi(xk),
 Create a Jacobian matrix for Jk
 Solve (JTkJkpk = -JTkfk). This gives you the probabilities p for all k.
 Find s. F(xk + spk) should satisfy the Wolfe conditions (these prove that step-
lengths exist).
 Set xk+1 = xk + spk.
 Repeat Steps 1 to 7 until convergence.

Linear Least Square:

Given a set of coordinates in the form of (X, Y), the task is to find the least
regression line that can be formed.

In statistics, Linear Regression is a linear approach to model the relationship


between a scalar response (or dependent variable), say Y, and one or more
explanatory variables (or independent variables), say X.
Regression Line: If our data shows a linear relationship between X and Y, then
the straight line which best describes the relationship is the regression line. It is
the straight line that covers the maximum points in the graph.

Examples:

Input: X = [95, 85, 80, 70, 60]


Y = [90, 80, 70, 65, 60]
Output: Y = 5.685 + 0.863*X
Explanation:
The graph of the data given below is:
X = [95, 85, 80, 70, 60]
Y = [90, 80, 70, 65, 60]
The regression line obtained is Y = 5.685 + 0.863*X
The graph shows that the regression line is the line that covers the maximum of
the points.
Input: X = [100, 95, 85, 80, 70, 60]
Y = [90, 95, 80, 70, 65, 60]
Output: Y = 4.007 + 0.89*X

Approach:

A regression line is given as Y = a + b*X where the formula of b and a are


given as:
b = (n?(xiyi) – ?(xi)?(yi)) ÷ (n?(xi2)-?(xi)2)
a = y? – b.x?
where x? and y? are mean of x and y respectively.
1. To find regression line, we need to find a and b.
2. Calculate a, which is given
by
3. Calculate b, which is given by

4. Put value of a and b in the equation of regression line.

Least Mean Square:

 LMS = least mean squares l


 LMS is a learning algorithm
 LMS algorithm was developed by Widrow and Hoff, 1960
 LMS algorithm is used in various applications of adaptive signal processing
including:
o Adaptive equalization of communication channels
o Echo cancellation on phone lines,
o Adaptive signal detection in presence of noise

Optimal filtering problem

 Assume there are p sensors located in space


 Let x1, x2, …, xp be signals acquired by the sensors and multipled by
weights w1, w2, …, wp
 We need to determine w1, w2, …, wp to minimize difference between
obtained response y and desired response d in the sense of the mean square
error 6 Optimal filtering problem
 Error signal is defined as: e = d – y l Let d be a random variable
 Let input values xk be random variables – such a sequence of random
variables is a random process
 In that case y and e are also random variables.

 Perceptron Convergence Theorem:


 In the classification of linearly separable patterns belonging to two classes
only, the training task for the classifier was to find the weight w such that.
 Completion of training with the fixed correction training rule for any initial
weight vector and any correction increment constant leads to the following
weights:
 w∗=wk0=wk0+1=wk0+2….
 with w∗ as the solution vector for equation.
 Integer k0 is the training step number starting at which no more
misclassification occurs, and thus no right adjustments take place for
(k_0>=0)
 This theorem is called as the "Perceptron Convergence Theorem".
 Perceptron Convergence theorem states that a classifier for two linearly
separable classes of patterns is always trainable in a finite number of training
steps.
 In summary, the training of a single discrete perceptron two class classifier
requires a change of weights if and only if a misclassification occurs.
 In the reason for misclassification is (w^tx<0\) then all weights are increased
in proportion wo xi . If \(w^tx>0) then all weights are decreased in
proportion to xi
 Summary of the Perceptron Convergence Algorithm:
 Variables and Parameters: x(n)=(m+1) by 1 input vector
T
 =[+1,x1(n),x2(n),.....xm(n)]
 w(n)=(m+1) by 1 weight vector
T
 =[b(n),w1(n),w2(n),.....wm(n)]
 b(n)= bias
 y(n)= actual response
 d(n)=desired response
 η=learning rate parameter, a +ve constant less than unity
1. Initialization: Set w(0)=0, then perform the following computations for time
step n=1,2
2. Activation: At time step n, activate the perceptron by applying input vector
x(n) and desired response d(n).
3. Computation of actual response: Compute the actual response of the
perceptron:
y(n)=sgn[wT(x)x(n)]
4. Adaptation of weight vector: Update the weight vector of the perceptron:
w(n+1)=w(n)+η[d(n)−y(n)]x(n)
5. Continuation: Increment time step n by 1, go to step 1
Backpropagation:
Backpropagation is the essence of neural network training. It is the method of
fine-tuning the weights of a neural network based on the error rate obtained in the
previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce
error rates and make the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of
errors.” It is a standard method of training artificial neural networks. This method
helps calculate the gradient of a loss function with respect to all the weights in the
network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the
loss function for a single weight by the chain rule. It efficiently computes one layer
at a time, unlike a native direct computation. It computes the gradient, but it does
not define how the gradient is used. It generalizes the computation in the delta rule.

Consider the following Back propagation neural network example diagram to


understand:

How Backpropagation Algorithm Works

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.

Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the
network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be
learned.

backpropagation algorithm?
Backpropagation, or backward propagation of errors, is an algorithm that is
designed to test for errors working back from output nodes to input nodes. It is an
important mathematical tool for improving the accuracy of predictions in data
mining and machine learning. Essentially, backpropagation is an algorithm used to
calculate derivatives quickly.

There are two leading types of backpropagation networks:

1. Static backpropagation. Static backpropagation is a network developed


to map static inputs for static outputs. Static backpropagation networks
can solve static classification problems, such as optical character
recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is
used for fixed-point learning. Recurrent backpropagation activation feeds
forward until it reaches a fixed value.

The key difference here is that static backpropagation offers instant mapping and
recurrent backpropagation does not.

Artificial neural networks use backpropagation as a learning algorithm to compute


a gradient descent with respect to weight values for the various inputs. By
comparing desired outputs to achieved system outputs, the systems are tuned by
adjusting connection weights to narrow the difference between the two as much as
possible.

The algorithm gets its name because the weights are updated backward, from
output to input.

The advantages of using a backpropagation algorithm are as follows:

 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior
knowledge about the network.
 It is a standard process that usually works well.
 It is user-friendly, fast and easy to program.
 Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

 It prefers a matrix-based approach over a mini-batch approach.


 Data mining is sensitive to noise and irregularities.
 Performance is highly dependent on input data.
 Training is time- and resource-intensive.

TRAINING THE PERCEPTRON

2.1. Initialize the Weights and Calculate the Actual Output

Let’s look at the perceptron again:

A more generalized diagram of the perceptron model.

The input x_i is multiplied by a randomly initialized weight w_ij and is fed into the
perceptron along with a bias and all other weighted inputs. Inside the perceptron the
activity function is applied to this weighted sum of inputs plus a bias and its value is
then fed as an argument into the activation function f. The value of the activation
function produces the output y_j of perceptron j.
2.2. Define and Calculate the Error

Before we can calculate the error, we first need to define the error. Remember that I
mentioned that the error function has to be dependent on the weights and it needs to
relate actual output y_j to desired output d_j. So let’s define it as this:

A prerequisite to the error function of a perceptron

This function e_j does relate d_j to y_j and is in fact dependent on the weights
because we know the y_j term itself is dependent on the weights. We know that we
want to minimize e_j. But depending on the values of d_j and y_j, we might get a
negative value for the error e_j. So let’s square it to ensure that the error is always a
positive number and define it as

Error function of a perceptron


So then, we’ve redefined the error function e_j as E_j to ensure that it is never
negative. But I’ve also included a factor of 1/2. As one of my favorite professors
used to say, ‘we do what is convenient in mathematics’. The factor of 1/2 is there to
make things more convenient later. I will come back to this shortly.

2.3. Gradient Descent — Updating the Weights to Further Reduce Error

Now that we have an error function, we can use it to help us determine how to
update w_ij so we can reduce the error further. Let’s define some mathematical
expression for what an updated weight looks like based on current the current and
the error function E_j:

The equation in the box is an expression for finding an updated weight


w_ij(k+1) using

1. the current weight w_ij(k)


2. some step size given by the greek letter eta
3. the derivative of the error function E_j with respect to w_ij.

You might also like