You are on page 1of 55

Day & Time: Monday (10am-11am & 3pm-4pm)

Tuesday (10am-11am)
Wednesday (10am-11am & 3pm-4pm)
Friday (9am-10am, 11am-12am, 2pm-3pm)
Dr. Srinivasa L. Chakravarthy
&
Smt. Jyotsna Rani Thota
Department of CSE
GITAM Institute of Technology (GIT)
Visakhapatnam – 530045
Email: slade@gitam.edu & jthota@gitam.edu
20 August 2020 Department of CSE, GIT EID 403 and machine learning 1
Course objectives
• Explore about various disciplines connected with ML.
• Explore about efficiency of learning with inductive
bias.
• Explore about identification of Ml algorithms like
decision tree learning.
• Explore about algorithms like Artificial Neural
20 August 2020 2
networks, genetic programming, Bayesian algorithm,
Nearest neighbor algorithm, Hidden Markov chain
model.

Department of CSE, GIT EID 403 and machine learning


Learning Outcomes
• Identify the various applications connected with ML.
• Classify efficiency of ML algorithms with Inductive
bias technique.
• Discriminate the purpose of all ML algorithms.
• Analyze any application and Correlate available ML
algorithms.
20 August 2020 3

• Choose an ML algorithm to develop their project.

Department of CSE, GIT EID 403 and machine learning


Syllabus

20 August 2020 4

Department of CSE, GIT EID 403 and machine learning


Reference book 1. Title -Machine Learning
Author- Tom M Mitchell

20 August 2020 5

Department of CSE, GIT EID 403 and machine learning


Reference book 2. Title –Introduction to Machine Learning
Author- Ethem Alpaydin

20 August 2020 6

Department of CSE, GIT EID 403 and machine learning


Module -2
It includes-
Chapter -4 Artificial Neural Networks
• Neural Network Representation

• Appropriate problems for neural network

learning
• Perceptrons

• Multilayer networks and

• Backpropagation algorithms

• Advanced topics in Neural networks

&
Chapter -9 Genetic Algorithms 7
Chapter - 4

Motivation for Artificial Neural Network(ANN)-

1. ANN is inspired by a biological learning system which is with a


complex web of interconnected neurons.

2. ANN is built with interconnected set of simple units, where each unit
8
takes a no.of real valued inputs & produce real valued outputs.

these inputs may be the output of other these outputs may be the input of
units other units
Introduction-
• ANN provides a method for learning-
Real-valued,
Discrete-valued &
Random-valued functions from examples.

• ANN is robust to errors in training data.


9
• ANN is applicable to problems like-
Interpreting visual scene,
Speech Recognition,
Learning robot control strategies.
ANN algorithms in this chapter-

• Algorithm like Back Propagation is used in ANN.

• Gradient descent technique used in back propagation to tune-


-network parameters for best fitting a training set of input-output
pairs.

• In general, ANN can be with graphs of many types acyclic or cyclic, directed or
undirected. 10

• In this chapter, ANN focus on backpropagation with fixed structure


corresponds to directed graphs with cycles.
Consider Neuro system of a Human Brain-

Number of neurons = 10 10
Connections per neuron = 10 4
Scene recognition time = 10 -1 seconds.
Neuron switching time = 10 -3 seconds.
but, still humans
can take complex
decisions quickly.

11
-10
It is slow when compared to a ANN(i.e.,10 seconds).
Comparison-
With the discussion,

- a biological neural system is with highly parallel processes which are


distributed over many neurons.

- more complexities in biological neural system are not completely


modeled in Artificial neural networks.
12
2 groups of research is happening with ANN-

1. Goal of using ANN’s is to- study and model biological learning


process.

13
2. Goal of defining effective ML algorithms, independent of whether
they mirror biological processes.
Neural network representation-
For example,
ALVINN- To steer an autonomous vehicle driving at normal speeds on
public highways.

14
Input = 30 X 32 grid of pixel intensities obtained from a
forward-pointed camera mounted on vehicle.

(hidden units with weights combination 960 inputs)

Output = the direction in which vehicle is


steered i.e., 30 output units.

15
In this picture,
-a large matrix of black & white boxes
weights from 30 X 32 pixel inputs.
-small white box is +ve weight.
-small black box is -ve weight.
-size of box is called weight magnitude
and
-A smaller rectangular on top is weights from 16

hidden unit to each of the 30 output units.


Problems suitable for ANN-

ANN is suitable for


-noisy data,
-complex sensor data.
(i.e., data from camera & microphones)
Problems suitable for ANN-(cont.)
It can deal with problems, where-
• Instances are represented by many attributes in such a way that
input attributes are correlated or independent of one another.

• The target function output may be discrete or real valued


attributes.

• Long training times are acceptable.

• Fast evaluation is required.

• Ability of humans to understand the target function is not


important.
Note-
To train a single unit, an ANN will be designed with-
several primitive units like
-Perceptrons,
-Linear units,
-sigmoid units with learning algorithms.

To train a multilayer networks of such units, ANN will be designed with


-Backpropagation algorithm
Perceptron-
It takes-
A vector of real valued inputs
Calculates a linear combination of these inputs.

It generates-
- Value ‘1’ if result is greater than threshold
- Value ‘-1’ otherwise
Perceptron representation-

iI

(In vector form wi.xi can also be written as w . x > 0)


Here w is for weight, x is for input.
Representational power of perceptrons-
- A single perceptron can represent many boolean functions.

- Boolean functions are like AND, OR, NAND, NOR.

- Every boolean function can be connected by some network of


perceptrons.

- Inputs are fed into multiple units and output of these units are
then input to second and final stage.
Calculate Weight vector-
To determine weight vector that makes perceptron to give correct
output for every training examples.

For this determination there are many algorithms existed-


Here ,In this chapter we discuss-
1. Perceptron Rule
2. Delta Rule(it is a variant of LMS rule which is
for the purpose of tuning weights
iteratively)

These two algorithms finds acceptable hypothesis under different


conditions.
Calculate Weight vector-(cont.)

For an acceptable weight vector-

- Begin with random weights

- Apply the perceptron to training example iteratively

- modify the perceptron whenever it misclassified.

- Repeat the above steps, until the perceptron classifies all training
examples correctly.
Calculate Weight vector-(cont.)
Perceptron Rule

Note-the role of learning rate is to moderate the degree to which


weights are changed at each step.
Weight vector at different cases-
Case 1- If training example is correctly classified then (t-o) = 0, so that
no weights updated.

Case 2- If perceptron output is -1 and target output is +1 then the


weights need to be altered.

Case 3-If input xi >0 and increasing wi will make perceptron to classify
correctly.

Case 4- If x=0.8, learning rate =0.1, t= 1, o = -1 then weight will be


updated or if t = -1, o = 1 then weights will be decreased rather than
increased.
Weight vector at different cases(cont.)
The above learning procedures of the perceptron training rule to a
weight vector will correctly classifies the training examples,

• Provided the training examples are linearly separable (i.e.,


+ve & -ve can be separated by a hyperplane)

• Provided a small learning rate in weight vector formula.


Delta Rule-
In perceptron rule finding a correct weight vector is possible only when
the training examples are linearly separable.

So, Delta Rule will overcome this difficulty.

Delta Rule uses gradient descent to search in a hypothesis space for


finding a weight that best fits the training example.

Note- gradient descent is a basis for Backpropagation algorithm.


Delta Rule-

Note- Both delta rule and perceptron rule appears in the same way but
the difference is in perceptron rule-
o’ refers to thresholded output.
Where as in delta rule-
O’ refers to linear unit output.
Gradient descent-

To derive weight learning rule, we need to measure the training error of a


hypothesis(weight vector) using-

D is training examples
td is target output and od is output of linear unit.

Note- Here error is defined as half the squared difference between the target
output and linear unit, summed over all training examples.
To understand gradient descent-It is helpful to visualize the
hypothesis space.

The horizontal axes w0 & w1 represents


two weights of a simpler unit.

The vertical axis indicates error relative


to training examples.

The arrow shows the negated gradient


Descent at particular point,producing
Steepest descent
To understand gradient descent-It is helpful to visualize the
hypothesis space-(cont.)
In gradient descent search-

- Determines a weight vector that is with less error.


- It repeatedly modifies in small steps.
- At each step weight vector is modified in the direction that
produces steepest descent along with error surface as shown in
figure.
- The process continues until it reaches minimum error.
Representation of gradient descent derivation
Gradient descent algorithm-
Difficulties with gradient descent-

1. It requires more gradient descent steps to minimise error and that


works bit slow.

2. If there are multiple local minimum errors in the error surface then
there is no guarantee that the procedure will find minimum error.

To overcome the above, gradient descent is updated as


Stochastic gradient descent or incremental gradient descent.
The error measure of it is represented as follows-
Observed Differences

Standard gradient descent Stochastic gradient descent


In this error is summed over all examples In this weights are updated upon each training
before updating weights. example.

It is with larger step size per weight. It’s step size is small per weight compared to
standard.

It requires more computation per weight step It takes less time compared to standard.
update.
Linear programming
• Other than Perceptron rule & Delta rule there exists another
algorithm for calculating weight vector i.e., Linear programming.

• Linear programming is also deals with linearly separable training


examples which is similar to perceptron rule.

• But Linear programming does not scale to multilayer networks,


whereas Gradient descent approach which is related to Delta rule
is applicable to multilayer networks.
Linear decision surface
So far, the discussion is about single perceptrons that can works on linear decision
surface (i.e., in the below figure).

a) - a set of training examples and the decision surface of a correctly classified


perceptron.
b)- a set of training examples that is non linearly separable.
MultiLayer Network
As single perceptron can express linear decision surface only,

But with Multilayer networks which are learned by Backpropagation algorithm


are capable of expressing a variety of non linear decision surfaces.

Let's consider a speech recognition task, which involves all spoken words
“H_d” with identified 1 t0 10 vowels in it(i.e., hid, had,hood,head etc) .

So, the decision surface for this is represented with a multilayer network
in a highly non linear decision surface.
MultiLayer Network(cont.)
Figure (a) represents, Multilayer network consists of-
1. Two input parameters F1 & F2
2. Ten outputs with possible vowel sounds in it(”h_d”).

Figure (b) represents, a highly nonlinear decision surface.


(a) (b)
MultiLayer Network(cont.)
For constructing a multilayer networks,
What type of unit to be used for calculating weight?
Linear unit or single unit??

Generally, multiple layers of cascaded linear units produce linear


functions only.
But we prefer those networks to even work for nonlinear functions.

HOW??
A differentiable threshold unit
For that, we need a unit whose output is nonlinear function of its
inputs.

That is possible using a unit called “Sigmoid Unit”.

Sigmoid is similar to perceptron but works like a smooth- differentiable


functions.
Representation of sigmoid unit-

• Like perceptron,Sigmoid unit first computes the linear combination of inputs, then
applies a threshold to the result.

• Sigmoid function is also called logistic function.

As it maps a large input to small range of outputs,it is referred as squashing function.


BackPropagation Algorithm
• Backpropagation algorithm learns the weights for a multilayer network.

• It uses gradient descent to minimize the squared error between network output
and target output.

• As we are looking for multiple output units than single units, the error is redefined
as

where outputs = set of output units in network


t kd & okd = target and output values associated with kth output unit
D = training examples.
BackPropagation Algorithm(cont.)
• The learning problem of backpropagation algorithm is to search a
large hypothesis space defined by all possible weight values in the
network.

This situation can be identified in terms of an error.

• In case of training a single unit, gradient descent is used to find a


hypothesis to minimize E.
Where as, in case of multilayer network the error can have multiple local
minima.

• But gradient descent can guaranteed only for some local minimum
but not necessarily for global minimum error.
BackPropagation Algorithm(cont.)

BackPropagation found to produce excellent results in many applications.

The stochastic gradient descent version of Backpropagation can be


applicable to layered feedforward networks contains two layers of
sigmoid units.
Backpropagation Algorithm of feedforward network
Create a feed-forward network with n input,n hidden,n output units.
Backpropagation Algorithm of feedforward network (cont.)
•The algorithm shown is applicable to feedforward network-
with two layers of sigmoid units.
• It is a stochastic gradient descent version of Backpropagation
algorithm,which calculates gradient w.r.t error it updates all weights in
network.
•Notations in the algorithm are,
An index is assigned to each node,which represents either input to
network-
ji
-or output to some unit in the network,
ji
x denotes the input from node i to unit j,
Backpropagation Algorithm of feedforward network (cont.)

• The weight update loop in Backpropagation may be iterated thousand times in


a typical application.

• For example, to keep a ball rolling in same direction from one iteration to the next at a
flat region then the ball would stop if there is no momentum.

• To alter the weight-update rule in algorithm is, for updating nth iteration it
depends partially on update that occured during (n-1)th iteration, represents
as follows-

Weight update rule α is a constant(0<α<1) called


in algorithm Momentum
Advanced topics in ANN

Alternate error function-

So far in a back propagation algorithm,it defines with E in terms of sum of


-squared errors in the network.

An alternative for this is by adding a term for weight vector-in such a way
that it leads gradient descent search to seek weight vectors with small
magnitudes,thereby reducing the risk of overfitting.
Advanced topics in ANN(cont.)

Alternate error minimization procedures-

As gradient descent is one of the most general search methods for finding a
hypothesis to minimize the error function, but it is not efficient always.

So, a no.of alternative weight optimization algorithms have been proposed for
decisions to weight update methods like-
Choosing a direction in which to alter the current weight vector
Choosing a distance to move.
Some of the Optimization methods are like-
1- Line Search
2-Conjugate gradient method
Advanced topics in ANN(cont.)
Recurrent networks-

• These are ANN’s that apply to time series data and that use outputs of
network units at time t as the input to other units at time t+1.

• In this way they form a directed cycles in the network.

• These are more difficult to train the networks with no loops but still they
remain important due to their more representational power.
Advanced topics in ANN(cont.)
Dynamically modifying network structure-

So far,it is all about neural network learning as a problem of adjusting weights


with in a fixed graph structure.

For this a variety of methods have been proposed to dynamically grow or shrink
the number of network units to improve accuracy and training efficiency.
Such as,
1. Cascade-correlation Algorithm: In this it starts with a network with no hidden
units,then grow the network by adding hidden units whenever required by
reducing the training error.
2. Optimal brain damage approach: In this it removes the least useful
connections which reduces the no.of weights in a large network with an
increase of accuracy and training efficiency.
END OF CHAPTER-4

You might also like