Lecture Slides-Week13,14

Machine Learning
(CS4613)
Department of Computer Science

Capital University of Science and Technology (CUST)
Course Outline
Topic Weeks Reference
Introduction Week 1 Hands-on machine learning, Ch 1
Hypothesis Learning Week 2 Tom Mitchel, Ch2
Model Evaluation. Week 3 Fundamentals of Machine Learning for Predictive Data Analytics, Ch8
Classification
Decision Trees Week 4, 5 Fundamentals of Machine Learning for Predictive Data Analytics, Ch4
Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
Master Machine Learning Algorithms Ch 22, 23
K-Means Clustering Week 16 Data Mining_ The Textbook. Ch 6
2
Outline Week 13, 14
• Introduction to Artificial Neural Networks (ANNs)
• Introduction to Perceptron
• Perceptron learning rule
• Activation Functions
• Multi-layer ANNs
• Learning in ANN
• Example
• Implementing ANN using Keras
3
Introduction
• It seems logical to look at the brain’s architecture for
inspiration on how to build an intelligent machine. This
is the key idea that inspired artificial neural networks
(ANNs).
• However, although planes were inspired by birds, they
don’t have to flap their wings. Similarly, ANNs have
gradually become quite different from their biological
cousins.
• Some researchers even argue that we should drop the
biological analogy altogether
• ANNs are at the very core of Deep Learning. They are
versatile, powerful, and scalable, making them ideal to
tackle large and highly complex Machine Learning tasks.
4
History of ANNs
• ANNs have been around for quite a while: they were first introduced back in 1943
by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts.
• In their landmark paper, “A Logical Calculus of Ideas Immanent in Nervous Activity,”
McCulloch and Pitts presented a simplified computational model of how biological
neurons might work together in animal brains to perform complex computations
using propositional logic. This was the first artificial neural network architecture.
• Since then many other architectures have been invented.
• The early successes of ANNs until the 1960s led to the widespread belief that we
would soon be conversing with truly intelligent machines.
• When it became clear that this promise would go unfulfilled (at least for quite a
while), funding flew elsewhere and ANNs entered a long dark era.
• In the early 1980s there was a revival of interest in ANNs as new network
architectures were invented and better training techniques were developed.
• But by the 1990s, powerful alternative Machine Learning techniques such as
Support Vector Machines were favored by most researchers, as they seemed to
offer better results and stronger theoretical foundations.
5
History of ANNs Contd..
• We are now witnessing yet another wave of interest in ANNs. Will
this wave die out like the previous ones did?
• There are a few good reasons to believe that this one is different and
will have a much more profound impact on our lives:
• There is now a huge quantity of data available to train neural
networks, and ANNs frequently outperform other ML techniques on
very large and complex problems.
• The tremendous increase in computing power since the 1990s now
makes it possible to train large neural networks in a reasonable
amount of time.
• Amazing products based on ANNs regularly make the headline news,
which pulls more and more attention and funding toward them,
resulting in more and more progress, and even more amazing
products.
6
Biological Neurons
• It is an unusual-looking cell composed of a cell body containing the nucleus
and most of the cell’s complex components, and many branching extensions
called dendrites, plus one very long extension called the axon.
• Near its extremity the axon splits off into many branches, and at the tip of
these branches are minuscule structures called synapses, which are
connected to the dendrites (or directly to the cell body) of other neurons.
• Biological neurons receive short electrical impulses called signals from other
neurons via these synapses.
• When a neuron receives a sufficient number of signals from other neurons
within a few milliseconds, it fires its own signals.
• Individual biological neurons seem to behave in a rather simple way, but they
are organized in a vast network of billions of neurons, each neuron typically
connected to thousands of other neurons.
• Highly complex computations can be performed by a vast network of fairly
simple neurons.
7
8
The Perceptron
9
The Perceptron
• Invented in 1957 by Frank Rosenblatt.
• The inputs and output are numbers, and each input
connection is associated with a weight.
• It computes a weighted sum of its inputs (z = w1 x1
+ w2 x2 + ⋯ + wn xn = wT · x)
• It then applies a step function to that sum and
outputs the result.
10
11
Commonly Used Step Function
12
Perceptron Learning Rule
• wi is the ith weight

• xi is the ith input feature’s value
• t is the target output
• o is the output generated by the perceptron
• ηis the learning rate. Its role is to control the
amount of weight change in each step. Usually, it is
set to a small value.
13
Note
• Perceptron Learning Rule is similar to the weight
update rule we derived previously (excluding the
summation over n training instances)
• See “Gradient Descent for finding Optimal Weights
Week 9, 10 (Slide 42)” (Given below for your reference)
14
Threshold Functions
15
Threshold Functions
• Instead of using the step function, we normally apply a threshold
function to the output.
• It adds the ability to learn non-linear complex arbitrary functional
mappings between inputs and outputs.
• A Perceptron with logistic function as activation function works in the
same way as logistic regression.
16
Example of Data that is Linearly
Separable
17
Not linearly separable
18
The effect of non-linear activation
19
Commonly Used Activation
Functions
• Linear
• Logistic
• ReLU
• Tanh
20
Linear Function
21
Logistic Function
22
ReLU
23
Tanh
The advantage is that the negative inputs will
be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh
graph.
24
Artificial Neural
Networks
25
Artificial Neural Networks
(ANN)
• Also called Multi-layer Perceptron
• Networks of inter-connected artificial neurons
• One (passthrough) input layer
• One or more layers of nodes, called hidden layers,
• One layer of nodes called the output layer.
• Every layer can include a bias neuron
• When an ANN has two or more hidden layers, it is
called a deep neural network (DNN).
26
27
Nodes in input layer
• Nodes in input layer take the input and pass it on to
the next layer.
• These nodes do not do any computation.
28
Node in other Layers
• Such a node computes the weighted sum of its inputs and then applies
an activation function to it.
• Weights are just real numbers. Can be from 0 to 1 or -1 to 1. initially
random. Then we learn them.
• These nodes are loosely designed based on a neuron in the human
brain, which fires when it encounters sufficient stimuli.
• Weights resemble “synapse” in real neurons.
29
Example: Handwritten
Character Recognition
30
Handwritten Character
Recognition
• Represent examples as 28 pixel x 28 pixel
images
• Input all the examples to a NN one by one
with the known output.
• Input layer has 28x28=784 Neurons
• Output layer has 10 Neurons (numbers 0 to 9)
• There can be one or more hidden layers
31
Letter “3” represented as an
image
32
Input Layer Output Layer
Inputs
33
Knowledge Stored as Weights
• This knowledge about mapping is stored in the
form of weights in the ANN.
Weights
34
Making Predictions
• Once the NN has learnt the correct mapping, it can
be used on new unknown examples.
• Example: Recognize “2”
35
Learning in ANN
Gradient Descent with Back Propagation
36
Learning
• For each training instance a prediction is made which
represents the output of the network (Forward Pass or
Forward Propagation)
• We then measures the error which is the difference
between the predicted output and the actual output
(Measure Error)
• We then go through each layer in reverse to measure the
error contribution from each connection and tweaks the
connection weights to reduce the error (Backward Pass or
Backward Propagation).
• The above process is repeated until the error reaches a
small value.
37
38
Forward Pass (Forward
Propagation)
• Step 1: Each input is multiplied by a weight as it travels over
an edge (connecting line) to a node in the following layer
• Step 2: All inputs to a node, including the bias, are summed
using the summation operator. The result is called the total
net input.
• Step 3: The total net input is then fed into an activation
function, which transforms the net input into a new output.
• This new output is then sent out over one or more edges
and multiplied by a weight, and the cycle continues until
the output layer calculations are completed.
39
Forward Pass Steps
40
Measuring Error
• In order for a neural network to successfully train it
must minimize the difference between its actual
output and target output
• This difference is the total error, which essentially
tells us how wrong a network is.
• A cost function provides the total error – or
difference - between the target output and actual
output.
41
Measuring Error Contd.
• Means Square Error (MSE)
• Squared Error (SE)
42
Backward Pass or Back
Propagation
• In the backward pass weights are adjusted in a way
that reduces the error of the network.
• Back-propagation is an automatic differentiation
algorithm that can be used to calculate the gradients
for the parameters in neural networks.
• The result of applying back-propagation is that we get a
formula for updating each weight in the network.
• We will discuss the weight update formulae in the following
example.
• See Slide 37, Week9,10 for the weight update derived
for multivariable linear regression.
43
Gradient Descent Algorithm
• See Slide 29, Week 9,10
• Gradient Descent or Stochastic Gradient Descent is an
optimization algorithm that can be used to train neural
network models.
• The Stochastic Gradient Descent algorithm requires gradients
to be calculated for each variable in the model so that new
values for the variables can be calculated. Back Propagation
gives those gradients.
• Together, the back-propagation algorithm and Stochastic
Gradient Descent algorithm are used to train a neural network.
• We might call this “Stochastic Gradient Descent with Back-
propagation.”
44
Common Mistake
• The term back-propagation is often misunderstood
as meaning the whole learning algorithm for multi-
layer neural networks.
• Actually, back-propagation refers only to the
method for computing the gradient, while another
algorithm, such as stochastic gradient descent, is
used to perform learning using this gradient.
https://machinelearningmastery.com/difference-bet
ween-backpropagation-and-stochastic-gradient-desc
ent/
45
Complete Example
Taken From
https://hmkcode.com/ai/backpropagation-step-by-ste
p/
46
Network Structure
47
Training (Learning)
• Neural network training is about finding weights
that minimize prediction error.
• We start our training with a set of randomly
generated weights.
• Then, backpropagation is used to update the
weights in an attempt to correctly map inputs to
outputs.
48
Initial Weights
49
Data
50
We are not using
any activation
function here.
51
52
Network Structure
53
Formulae for Updating Weights in
the output layer
• W6* = W6 – ηx (prediction – actual ) x h2 η is the learning rate
• W5* = W5 – ηx (prediction – actual ) x h1
• * means updated weight.
• This is very similar to the rule we used for perceptron learning.
• Also this is very similar to the rule we found for multivariable linear regression
(Slide 42, Week 9, 10)
• The complete derivation of weight update rule can be found on the following
link.
• https://en.wikipedia.org/wiki/Delta_rule
54
Formula for Updating Weights in
the hidden layer
• W4* = W4 – ηx (prediction – actual ) x i2 x w6
• We will not discuss the derivation of the above rules.

• The rules are very similar to the previous ones except for the additional
weight at the end.
55
Note
• In this example we have assumed the simple case
of linear activation and hence we have the above
mentioned weight update rules.
• Weight update rules would be different if different
activation functions are used.
56
Network Structure
57
58
Backward Pass prediction – actual
= 0.191 – 1
= -0.809
η= 0.05 (assume)
• W6* = W6 – ηx (prediction – actual ) x h2 h1 = 0.85 (see forward pass )

= 0.15 – 0.05 x -0.809 x 0.48 = 0.17 h2 = 0.48
• W5* = W5 – ηx (prediction – actual ) x h1 w6 = 0.15 (previous value)

= 0.14 – 0.05 x -0.809 x 0.85 = 0.17 w5 = 0.14 (previous value)
w4 = 0.08 (previous value)
• W4* = W4 – ηx (prediction – actual ) x i2 x w6 w3 = 0.12 (previous value)
= 0.08 – 0.05 x -0.809 x 2 x 0.15 = 0.10 w2 = 0.21 (previous value)
• W3* = W3 – ηx (prediction – actual ) x i1 x w6 w1 = 0.11 (previous value)
= 0.12 – 0.05 x -0.809 x 2 x 0.15 = 0.13 i1= 2

• W2* = W2 – ηx (prediction – actual ) x i2 x w5 i2= 3
= 0.21 – 0.05 x -0.809 x 3 x 0.14 = 0.23
= 0.11 – 0.05 x -0.809 x 3 x 0.14 = 0.12
59
Repeat the forward pass with
Updated Weights
60
Analysis
• We can see that the new prediction is closer to the
actual value.
• We can repeat the forward pass and backward pass
until the error is below a certain value (or a certain
number of repetitions have been completed).
61
Neural Network
Simulator
https://playground.tensorflow.org
62
Implementing ANN
Using Sci-kit learn
https://colab.research.google.com/drive/10H7UG4Ty
4DyPXmtWqv839GrLOAUMjd47?usp=sharing
63
Implementing ANN
Using Keras
https://colab.research.google.com/drive/1uSh7S97At
C9IGpNNvNyxiuHsfNkOYhm8?usp=sharing
https://colab.research.google.com/drive/1sm1c68zH
Z49d3HK2Ftct3LBcx1SIsSQg?usp=sharing
64
That is all for Week 13
and 14
65

Lecture Slides-Week13,14

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Slides-Week13,14

Uploaded by

Copyright:

Available Formats

Machine Learning

Department of Computer Science

• wi is the ith weight

• Squared Error (SE)

• This is very similar to the rule we used for perceptron learning.

• We will not discuss the derivation of the above rules.

• W6* = W6 – ηx (prediction – actual ) x h2 h1 = 0.85 (see forward pass )

• W5* = W5 – ηx (prediction – actual ) x h1 w6 = 0.15 (previous value)

= 0.12 – 0.05 x -0.809 x 2 x 0.15 = 0.13 i1= 2

You might also like