You are on page 1of 23

Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

UNIT 4 - NEURAL NETWORKS

Neural Network Representation, Problems, Perceptrons, Multilayer Networks and Back Propagation
Algorithms.

NEURAL NETWORKS

An artificial neuron network (ANN) is a computational model based on the structure and functions
of biological neural networks. Information that flows through the network affects the structure of the
ANN because a neural network changes or learns based on that input and output.
It provides a general practical method for learning real valued, discrete valued and vector valued
functions from examples. ANNs are considered nonlinear statistical data modeling tools where the
complex relationships between inputs and outputs are modeled and patterns are found.
ANN is
 Information processing architecture loosely modelled on the brain
 Consist of a large number of interconnected processing units (neurons)
 Work in parallel to accomplish a global task
 Generally used to model relationships between inputs and outputs or find patterns in data
ANN consists of an input layer, one or more hidden layers and an output layer as shown in figure
below

Artificial neural network with 1 hidden layer


Characteristics of Artificial Neural Network
 It is neurally implemented mathematical model
 It contains huge number of interconnected processing elements called neurons to do all
operations
 Information stored in the neurons are basically the weighted linkage of neurons
 The input signals arrive at the processing elements through connections and connecting
weights.
 It has the ability to learn , recall and generalize from the given data by suitable assignment and
adjustment of weights.
 The collective behavior of the neurons describes its computational power, and no single
neuron carries specific information.

Page 1
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Some of the learning problems which can be solved with ANN are:
1. Learning to interpret complex real world sensor data
2. Learning to recognize handwritten characters
3. Learning to recognize spoken words
4. Learning to recognize faces

APPROPRIATE PROBLEMS FOR NEURAL NETWORK LEARNING


ANN learning is well-suited to problems in which the training data corresponds to noisy, complex
sensor data, such as inputs from cameras and microphones. The BACKPROPAGATION algorithm
is the most commonly used ANN learning technique. It is appropriate for problems with the
following characteristics:

 Instances are represented by many attribute-value pairs. The target function to be learned is
defined over instances that can be described by a vector of predefined features, such as the pixel.
These input attributes may be highly correlated or independent of one another. Input values can
be any real values.
 The target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes
 The training examples may contain errors. ANN learning methods are quite robust to noise in
the training data.
 Long training times are acceptable. Network training algorithms typically require longer training
times than, say, decision tree learning algorithms. Training times can range from a few seconds to
many hours, depending on factors such as the number of weights in the network, the number of
training examples considered, and the settings of various learning algorithm parameters.
 Fast evaluation of the learned target function may be required. Although ANN learning times re
relatively long, evaluating the learned network, in order to apply it to a subsequent instance, is
typically very fast.
 The ability of humans to understand the learned target function is not important. The weights
learned by neural networks are often difficult for humans to interpret. Learned neural networks
are less easily communicated to humans than learned rules.

Page 2
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

PERCEPTRONS
One type of ANN system is based on a unit called a perceptron, illustrated in Figure 4.2. It’s a step
function based on a linear combination of real-valued inputs A perceptron takes a vector of real-
valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater
than some threshold and -1 otherwise. More precisely, given inputs xl through x,, the output o(x1, . .
. , x,) computed by the perceptron is

where each wi is a real-valued constant, or weight, that determines the contribution of input xi to the
perceptron output. Notice the quantity (-wO) is a threshold that the weighted combination of inputs
wlxl + . . . + wnxn must surpass in order for the perceptron to output a 1.

Representation of AND function


A single perceptron can be used to represent many boolean functions. For example, if we assume
boolean values of 1 (true) and -1 (false), a two-input perceptron can be used to implement the AND
function . We can set the weights wo = -0.8, and wl = wz = .5.

This perceptron can be made to represent the OR function instead by altering the threshold to wo = -
0.3.

Page 3
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

A perceptron draws a hyperplane as the decision boundary over the (n-dimensional) input space.

A perceptron can learn only examples that are called “linearly separable”. These are examples that
can be perfectly separated by a hyperplane.

Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but not XOR

Perceptron Learning
Learning a perceptron means finding the right values for W. The hypothesis space of a perceptron is
the space of all weight vectors. The perceptron learning algorithm can be stated as below.

1. Assign random values to the weight vector


2. Apply the weight update rule to every training example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.

There are two popular weight update rules.


i) The perceptron rule, and
ii) Delta rule

The Perceptron Rule


For a new training example X = (x1, x2, …, xn), update each weight according to this rule:
wi = wi + Δwi
Where Δwi = η (t-o) xi
t: target output
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1)

Page 4
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Comments about the perceptron training rule:


• If the example is correctly classified the term (t-o) equals zero, and no update on the weight is
necessary.
• If the perceptron outputs –1 and the real answer is 1, the weight is increased.
• If the perceptron outputs a 1 and the real answer is -1, the weight is decreased.
• Provided the examples are linearly separable and a small value for η is used, the rule is proved
to classify all training examples correctly (i.e, is consistent with the training data).
Strength:
If the data is linearly separable and η is set to a sufficiently small value, it will converge to a
hypothesis that classifies all training data correctly in a finite number of iterations
Weakness:
If the data is not linearly separable, it will not converge

Example : Consider the following set of points

Page 5
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 6
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Repeat the process for all training examples.

Page 7
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Gradient Descent
Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest point
of the mountain. A twist is that you are blindfolded and you have zero visibility to see where you are
headed. So, what approach will you take to reach the lake. The best way is to check the ground near
you and observe where the land tends to descend. This will give an idea in what direction you should
take your first step. If you follow the descending path, it is very likely you would reach the lake.

The Delta Rule


Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separable. A second
training rule, called the delta rule, is designed to overcome this difficulty. If the training examples
are not linearly separable, the delta rule converges toward a best-fit approximation to the target
concept. The key idea behind the delta rule is to use gradient descent to search the hypothesis space
of possible weight vectors to find the weights that best fit the training examples. Delta rule uses the
concept of linear unit.

Linear unit:
 A linear unit can be thought of as an unthresholded perceptron
 The output of an k-input linear unit is
 It is not reasonable to use a boolean notion of error for linear units, so we need to use
something else
 We will use a sum-of-squares measure of error E, under hypothesis (weights) (w0; : : : ;wk-1)
and training set D
:
where: td is training example d's output value, od is the output of the linear unit under d's inputs

i.e

 This E is a parabola, and has a global minimum


 Gradient descent aims to find the minimum by repeatedly taking a small step in the direction
of the gradient

Page 8
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Strengths:
 Converges to least squares error for the training data
 The data doesn't need to be linearly separable
 Can be used with multi-layer ANNs
Weakness:
 Doesn't necessarily converge to a perfect hypothesis on linearly separable data

There are two differences between the perceptron and the delta rule. The perceptron is based on an output
from a step function, whereas the delta rule uses the linear combination of inputs directly. The perceptron
is guaranteed to converge to a consistent hypothesis assuming the data is linearly separable. The delta
rules converges in the limit but it does not need the condition of linearly separable data.
There are two main difficulties with the gradient descent method:
1. Convergence to a minimum may take a long time.

2. There is no guarantee we will find the global minimum.

Page 9
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

DERIVATION OF THE GRADIENT DESCENT RULE

Page 10
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 11
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

MULTILAYER NETWORKS AND BACK PROPOGATION ALGORITHM

As seen earlier single perceptrons can only express linear decision surfaces. In contrast, the kind of
multilayer networks learned by the BACKPROPAGATION algorithm are capable of expressing a
rich variety of nonlinear decision surfaces.
A Multilayer Feed-Forward Neural Network
The backpropagation algorithm performs learning on a multilayer feed-forward neural network. It
iteratively learns a set of weights for prediction of the class label of tuples. A multilayer feed-
forward neural network consists of an input layer, one or more hidden layers, and an output layer.
An example of a multilayer feed-forward network is shown in Figure 9.2.

Each layer is made up of units. The inputs to the network correspond to the attributes measured for
each training tuple. The inputs are fed simultaneously into the units making up the input layer.
These inputs pass through the input layer and are then weighted and fed simultaneously to a second
layer of “neuronlike” units, known as a hidden layer. The outputs of the hidden layer units can be
input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in
practice, usually only one is used. The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network’s prediction for given tuples. The units in the
input layer are called input units. The units in the hidden layers and output layer are sometimes
referred to as neurodes, due to their symbolic biological basis, or as output units. The multilayer
neural network shown in Figure 9.2 has two layers of output units. Therefore, we say that it is a two-
layer neural network. (The input layer is not counted because it serves only to pass the input values
to the next layer.) Similarly, a network containing two hidden layers is called a three-layer neural
network, and so on. It is a feed-forward network since none of the weights cycles back to an input
unit or to a previous layer’s output unit. It is fully connected in that each unit provides input to each
unit in the next forward layer. Each output unit takes, as input, a weighted sum of the outputs from
units in the previous layer (see Figure 9.4 later). It applies a nonlinear (activation) function to the
weighted input. Multilayer feed-forward neural networks are able to model the class prediction as a
nonlinear combination of the inputs. From a statistical point of view, they perform nonlinear
Page 12
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

regression. Multilayer feed-forward networks, given enough hidden units and enough training
samples, can closely approximate any function

Backpropagation
“How does backpropagation work?” Backpropagation learns by iteratively processing a data set of
training tuples, comparing the network’s prediction for each tuple with the actual known target
value. The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for numeric prediction). For each training tuple, the weights are
modified so as to minimize the mean-squared error between the network’s prediction and the actual
target value. These modifications are made in the “backwards” direction (i.e., from the output layer)
through each hidden layer down to the first hidden layer (hence the name backpropagation).
Although it is not guaranteed, in general the weights will eventually converge, and the learning
process stops. The algorithm is summarized in Figure 9.3. The steps involved are expressed in terms
of inputs, outputs, and errors.

Page 13
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Initialize the weights: The weights in the network are initialized to small random numbers (e.g.,
ranging from �1.0 to 1.0, or �0.5 to 0.5). Each unit has a bias associated with it, as explained later.
The biases are similarly initialized to small random numbers. Each training tuple, X, is processed by
the following steps.

Propagate the inputs forward: First, the training tuple is fed to the network’s input layer. The
inputs pass through the input units, unchanged. That is, for an input unit, j, its output, Oj , is equal to
its input value, Ij . Next, the net input and output of each unit in the hidden and output layers are
computed. The net input to a unit in the hidden or output layers is computed as a linear combination
of its inputs. To help illustrate this point, a hidden layer or output layer unit is shown in Figure 9.4.
Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to it in

Page 14
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

the previous layer. Each connection has a weight. To compute the net input to the unit, each input
connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit, j in
a hidden or output layer, the net input, Ij , to unit j is

where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output
of unit i from the previous layer; and ɵj is the bias of the unit. The bias acts as a threshold in that it
serves to vary the activity of the unit. Each unit in the hidden and output layers takes its net input
and then applies an activation function to it, as illustrated in Figure 9.4. The function symbolizes the
activation of the neuron represented by the unit. The logistic, or sigmoid, function is used. Given the
net input Ij to unit j, then Oj , the output of unit j, is computed as

This function is also referred to as a squashing function, because it maps a large input domain onto
the smaller range of 0 to 1. The logistic function is nonlinear and differentiable, allowing the
backpropagation algorithm to model classification problems that are linearly inseparable. We
compute the output values, Oj , for each hidden layer, up to and including the output layer, which
gives the network’s prediction. In practice, it is a good idea to cache (i.e., save) the intermediate
output values at each unit as they are required again later when backpropagating the error. This trick
can substantially reduce the amount of computation required.

Backpropagate the error: The error is propagated backward by updating the weights and biases to
reflect the error of the network’s prediction. For a unit j in the output layer, the error Errj is
computed by

where Oj is the actual output of unit j, and Tj is the known target value of the given training tuple.
Note that Oj.1�Oj/ is the derivative of the logistic function. To compute the error of a hidden layer
unit j, the weighted sum of the errors of the units connected to unit j in the next layer are considered.
The error of a hidden layer unit j is

where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is
the error of unit k. The weights and biases are updated to reflect the propagated errors. Weights are
updated by the following equations, where 1wij is the change in weight wij :

Page 15
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

The variable l is the learning rate, a constant typically having a value between 0.0 and 1.0.
Backpropagation learns using a gradient descent method to search for a set of weights that fits the
training data so as to minimize the mean squared distance between the network’s class prediction
and the known target value of the tuples.1 The learning rate helps avoid getting stuck at a local
minimum in decision space (i.e., where the weights appear to converge, but are not the optimum
solution) and encourages finding the global minimum. If the learning rate is too small, then learning
will occur at a very slow pace. If the learning rate is too large, then oscillation between inadequate
solutions may occur. A rule of thumb is to set the learning rate to 1=t, where t is the number of
iterations through the training set so far. Biases are updated by the following equations, where 1_j is
the change in bias _

Note that here we are updating the weights and biases after the presentation of each tuple. This is
referred to as case updating. Alternatively, the weight and bias increments could be accumulated in
variables, so that the weights and biases are updated after all the tuples in the training set have been
presented. This latter strategy is called epoch updating, where one iteration through the training set
is an epoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet
in practice, case updating is more common because it tends to yield more accurate results.

Terminating condition: Training stops when All 1wij in the previous epoch are so small as to be
below some specified threshold, or The percentage of tuples misclassified in the previous epoch is
below some threshold, or A prespecified number of epochs has expired. In practice, several hundreds
of thousands of epochs may be required before the weights will converge.

“How efficient is backpropagation?” The computational efficiency depends on the time spent
training the network. Given jDj tuples and w weights, each epoch requires O.jDj_w/ time. However,
in the worst-case scenario, the number of epochs can be exponential in n, the number of inputs. In
practice, the time required for the networks to converge is highly variable. A number of techniques
exist that help speed up the training time. For example, a technique known as simulated annealing
can be used, which also ensures convergence to a global optimum.

Page 16
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Sample calculations for learning by the backpropagation algorithm.


Figure 9.5 shows amultilayer feed-forward neural network. Let the learning rate be 0.9. The initial
weight and bias values of the network are given in Table 9.1, along with the first training tuple, X D
.1, 0, 1/, with a class label of 1. This example shows the calculations for backpropagation, given the
first training tuple, X. The tuple is fed into the network, and the net input and output of each unit are
computed. These values are shown in Table 9.2. The error of each unit is computed and propagated
backward. The error values are shown in Table 9.3. The weight and bias updates are shown in Table
9.4.

Page 17
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

“How can we classify an unknown tuple using a trained network?” To classify an unknown tuple, X,
the tuple is input to the trained network, and the net input and output of each unit are computed.
(There is no need for computation and/or backpropagation of the error.) If there is one output node
per class, then the output node with the highest value determines the predicted class label for X. If
there is only one output node, then output values greater than or equal to 0.5 may be considered as
belonging to the positive class, while values less than 0.5 may be considered negative. Several
variations and alternatives to the backpropagation algorithm have been proposed for classification in
neural networks. These may involve the dynamic adjustment of the network topology and of the
learning rate or other parameters, or the use of different error functions.

Page 18
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Exercises

1. Design a two-input perceptron that implements the boolean function A ^~ B. Design a two-
layer network of perceptrons that implements A XO R B.
2. Consider two perceptrons defined by the threshold expression wo+w1x1+w2x2>0.
Perceptron A has weight values

Perceptron B has weight values

True or false? Perceptron A is more-general than perceptron B

Page 19
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

The BACKPROPAGATION Algorithm

The BACKPROPAGATION algorithm learns the weights for a multilayer network, given a network
with a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the
squared error between the network output values and the target values for these outputs. Because we
are considering networks with multiple output units rather than single units as before, we begin by
redefining E to sum the errors over all of the network output units

where outputs is the set of output units in the network, and tkd and okd are the target and output
values associated with the kth output unit and training example d. The learning problem faced by
BACKPROPAGATION is to search a large hypothesis space defined by all possible weight
values for all the units in the network.

The backpropagation algorithm is presented below:

Page 20
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 21
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 22
Machine Learning Notes – 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 23

You might also like