Professional Documents
Culture Documents
UNIT 2
COURSE B.TECH
SEMESTER 5
Version
1|D L - U N I T - I I
BTECH_CSE-SEM 31
TABLE OF CONTENTS – UNIT
2
S. NO CONTENTS PAGE
NO.
1 COURSE OBJECTIVES 3
2 PREREQUISITES 3
3 SYLLABUS 3
4 COURSE OUTCOMES 3
5 CO - PO/PSO MAPPING 4
6 LESSON PLAN 4
7 ACTIVITY BASED LEARNING 4
8 LECTURE NOTES 5
2.1 INTRODUCTION TO MACHINE LEARNING 5
2.2 BASICS AND UNDER FITTING 5
2.3 HYPER PARAMETERS AND VALIDATION SETS 6
2.4 ESTIMATORS 7
2.5 BIAS AND VARIANCE 8
2.6 MAXIMUM LIKELIHOOD 9
2.7 BAYESIAN STATISTICS 10
2.8 SUPERVISED AND UNSUPERVISED LEARNING 12
2.9 STOCHASTIC GRADIENT DESCENT 14
2.10 CHALLENGES MOTIVATING DEEP LEARNING 17
2.11 DEEP FEED FORWARD 19
NETWORKS:LEARNING XOR
2.12 GRADIENT BASED LEARNING 24
2|D L - U N I T - I I
BTECH_CSE-SEM 31
14 REAL TIME APPLICATIONS 38
15 CONTENTS BEYOND THE SYLLABUS 39
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 39
17 MINI PROJECT SUGGESTION 40
1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper perameters in neural
network’s architecture.
5. To apply concepts of Deep Learning to solve real word problems.
2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
Python Programming Language
Calculus
Linear Algebra
Probability Theory
Although it would be helpful, knowledge about classical machine learning is
NOT required.
3. Syllabus
UNIT I
Machine Learning: Basics and Under fitting, Hyper parameters and Validation
Sets, Estimators, Bias and Variance, Maximum Likelihood, Bayesian Statistics,
Supervised and Unsupervised Learning,Stochastic Gradient Descent,
Challenges Motivating Deep Learning
Deep Feed forward Networks: Learning XOR, Gradient-Based Learning,
Hidden Units, Architecture Design, Back-Propagation andother Differentiation
Algorithms.
4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.
3|D L - U N I T - I I
BTECH_CSE-SEM 31
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.
CO2 3 2
CO3 3 3 2 2 3 2 2
CO4 3 3 2 2 3 2 2
CO5
6. Lesson Plan
4|D L - U N I T - I I
BTECH_CSE-SEM 31
2. You will work on case studies from healthcare, autonomous driving, sign
language reading, music generation, and natural language processing.
You will master not only the theory, but also see how it is applied in industry.
8.Lecture Notes
2.1 INTRODUCTION TO MACHINE LEARNING
Introduction: Machine learning is essentially a form of applied statistics
with increased emphasis on the use of computers to statistically estimate
complicated functions and a decreased emphasis on proving confidence
intervals around these functions; we therefore present the two central
approaches to statistics: frequentist estimators and Bayesian inference. Most
machine learning algorithms can be divided into the categories of
supervised learning and unsupervised learning; we describe these
categories and give some examples of simple learning algorithms from
each category. Most deep learning algorithms are based on an optimization
algorithm called stochastic gradient descent.
BTECH_CSE-SEM 31
occurs when the gap between the training error and test error is too large. We
can control whether a model is more likely to overfit or underfit by altering
its capacity.
Validation Set
Cross-Validation
6|D L - U N I T - I I
BTECH_CSE-SEM 31
When data set is too small, dividing into a fixed training set and fixed
testing set is problematic – If it results in a small test set
Cannot claim algorithm A works better than algorithm B for a given task
k - fold cross-validation
Rest of the data is used as the training set k-fold Cross Validation
The errors can be used to compute a confidence interval around the mean
7|D L - U N I T - I I
BTECH_CSE-SEM 31
2.4 Estimators
A single parameter
A vector of parameters — e.g., weights in linear regression
A whole function
Point Estimation
To distinguish estimates of parameters from their true value, a point estimate
of a parameter θis represented by θˆ. Let {x(1) , x(2) ,..x(m)} be m
independent and identically distributed data points.Then a point estimator is
any function of the data:
Point estimation can also refer to estimation of relationship between input and
target variables referred to as function estimation.
Function Estimation :
Here we are trying to predict a variable y given an input vector x. We
assume that there is a function f(x) that describes the approximate
relationship between y and x. For example,we may assume that y
= f(x) + ε, where ε stands for the part of y that is not predictable from x.
In function estimation, we are interested in approximating f with a model or
estimate fˆ. Function estimation is really just the same as estimating a
parameter θ; the function estimator fˆis simply a point estimator in
function space. Ex: in polynomial regression we are either estimating a
parameter w or estimating a function mapping from x to y.
2.5 BIAS AND VARIANCE
Bias and variance measure two different sources of error in and estimator. Bias
measures the expected deviation from the true value of the function or
parameter. Variance on the other hand, provides a measure of the
deviation
8|D L - U N I T - I I
BTECH_CSE-SEM 31
from the expected estimator value that any particular sampling of the data
is likely to cause.
Bias
bias(θˆm) =E(θˆm) - θ.
where the expectation is over the data (seen as samples from a random
variable)and θ is the true underlying value of θ used to define the data
generating distribution.
that E(θˆm) = θ.
Just as we might like an estimator to exhibit low bias we would also like it
to have relatively low variance.
2.6.MAXIMUM LIKELIHOOD
9|D L - U N I T - I I
BTECH_CSE-SEM 31
Consider a set of m examples X = {x(1), . . . , x(m)} drawn independently
from the true but unknown data generating distribution Pdata(x). Let
Pmodel(x; θ) be a parametric family of probability distributions over the
same space indexed by θ. In other words, Pmodel(x; θ) maps any
configuration xto a real number estimating the true probability Pdata(x).
Since we have terms in product here, we need to apply the chain rule
which is quite cumbersome with products. To obtain a more convenient
but equivalent optimization problem, we observe that taking the logarithm of
the likelihood does not change its arg max but does conveniently
transform a product into a sum and since log is a strictly increasing
function ( natural log function is a monotone transformation), it would not
impact the resulting value of θ.
So we have:
10|D L - U N I T - I I
BTECH_CSE-SEM 31
2.7.BAYESIAN STATISTICS
Statistical Inference
Statistical Modeling
- Bayesian statistics helps some models by classifying and specifying the prior
distributions of any unknown parameters.
Experiment Design
While most machine learning models try to predict outcomes from large
datasets, the Bayesian approach is helpful for several classes of problems that
aren’t easily solved with other probability models. In particular:
BTECH_CSE-SEM 31
models
BTECH_CSE-SEM 31
- When a model generates a null hypothesis but it’s necessary to claim
something about the likelihood of the alternative hypothesis
2 It relies on the prior and It only counts on the likelihood for both
likelihood of observed data. observed and unobserved data.
12|D L - U N I T - I I
BTECH_CSE-SEM 31
machines, decision trees and random forest are all common types of
classification algorithms.
UNSUPERVISED LEARNING
13|D L - U N I T - I I
BTECH_CSE-SEM 31
data integrity. Often, this technique is used in the preprocessing data
stage, such as when autoencoders remove noise from visual data to
improve picture quality.
The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and
output data, while an unsupervised learning algorithm does not.
14|D L - U N I T - I I
BTECH_CSE-SEM 31
• Applications: Supervised learning models are ideal for spam detection,
sentiment analysis, weather forecasting and pricing predictions, among
other things. In contrast, unsupervised learning is a great fit for
anomaly detection, recommendation engines, customer personas and
medical imaging.
BTECH_CSE-SEM 31
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
Suppose, you have a million samples in your dataset, so if you use a typical
Gradient Descent optimization technique, you will have to use all of the
one million samples for completing one iteration while performing the
Gradient Descent, and it has to be done for every iteration until the minima
are reached. Hence, it becomes computationally very expensive to
perform.
SGD algorithm:
16|D L - U N I T - I I
BTECH_CSE-SEM 31
So, in SGD, we find out the gradient of the cost function of a single
example at each iteration instead of the sum of the gradient of the cost
function of all the examples.
* In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is usually
noisier than your typical Gradient Descent algorithm. But that doesn’t
matter all that much because the path taken by the algorithm does not
matter, as long as we reach the minima and with a significantly shorter
training time.
* One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of its randomness in its descent. Even though it requires a higher
number of iterations to reach the minima than typical Gradient Descent, it
is still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
17|D L - U N I T - I I
BTECH_CSE-SEM 31
optimizing a learning algorithm.
BTECH_CSE-SEM 31
2.10.CHALLENGES IN MOTIVATING DEEP LEARNING
Shortcomings of conventional ML
Curse of dimensionality
Prior beliefs
BTECH_CSE-SEM 31
These biases may not be expressed in terms of a probability distribution
Most widely used prior is smoothnessAlso called local constancy
prior States that the function we learn should not change very much
within a small region
Local Constancy
Manifold Learning
19|D L - U N I T - I I
BTECH_CSE-SEM 31
because the input is only processed in one direction. The data always
flows in one direction and never backwards/opposite.
shown in figure 1.
BTECH_CSE-SEM 31
input values to a 1 or a 0 (or a value very close to a 1 or 0) in order
to
BTECH_CSE-SEM 31
represent activation or lack thereof. Another form of unit, known as a
bias unit, always activates, typically sending a hard coded 1 to all
units to which it is connected.
BTECH_CSE-SEM 31
values as 0 or 1.
22|D L - U N I T - I I
BTECH_CSE-SEM 31
It is the setting of the weight variables that gives the network’s
author control over the process of converting input values to an
output value. It is the weights that determine where the classification
line, the line that separates data points into classification groups, is
drawn. If all data points on one side of a classification line are
assigned the class of 0, all others are classified as 1.
Multilayer Perceptrons
23|D L - U N I T - I I
BTECH_CSE-SEM 31
The solution to this problem is to expand beyond the single-
layer architecture by adding an additional layer of units without
any direct access to the outside world, known as a hidden layer.
This kind of architecture — shown in Figure 4— is another feed-
forward network known as a multilayer perceptron (MLP)
It is worth noting that an MLP can have any number of units in its
input, hidden and output layers. There can also be any number of
hidden layers. The architecture used here is designed specifically
for the XOR problem.
The products of the input layer values and their respective weights
are parsed as input to the non-bias units in the hidden layer. Each
non-bias hidden unit invokes an activation function — usually the
classic sigmoid function in the case of the XOR problem — to
squash
24|D L - U N I T - I I
BTECH_CSE-SEM 31
the sum of their input values down to a value that falls between 0 and
1 (usually a value very close to either 0 or 1).
The outputs of each hidden layer unit, including the bias unit, are then
multiplied by another set of respective weights and parsed to an
output unit. The output unit also parses the sum of its input values
through an activation function — again, the sigmoid function is
appropriate here — to return an output value falling between 0
and
1. This is the predicted output.
25|D L - U N I T - I I
BTECH_CSE-SEM 31
2.12.GRADIENT BASED LEARNING
BTECH_CSE-SEM 31
features.
Note: if b == m, then mini batch gradient descent will behave similarly
to batch gradient descent.
Where xj(i) Represents the jth feature of the ith training example.
Hence,
Let (x(i),y(i)) be the training example
Cost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i)) - y(i))
Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))
26|D L - U N I T - I I
BTECH_CSE-SEM 31
Repeat
{
For i=1 to m
{
θj = θj – (learning rate) * Σ( hθ(x(i)) - y(i))xj(i) For every j =0
…n
}
2.13.HIDDEN UNITS
BTECH_CSE-SEM 31
Rectified Linear Units and Their Generalizations
Rectified linear units use the activation function
g(z)=max{0, z}g(z)=max{0, z}.
Rectified linear units are easy to optimize due to similarity with linear
units.
Only difference with linear units that they output 0 across half its
domain Derivative is 1 everywhere that the unit is active.
One drawback to rectified linear units is that they cannot learn via
gradientbased methods on examples for which their activation is
zero.
28|D L - U N I T - I I
BTECH_CSE-SEM 31
Three generalizations of rectified linear units are based on using a
non-zero slope αi when zi < 0: hi=g(z, α)
i=max(0,zi)+αimin(0,zi)hi=g(z, α)i=max(0,zi)+αimin(0,zi).
Absolute value rectification fixes αi = −1 to obtain g(z) = |z|. It is
used for object recognition from images
A leaky ReLU fixes αi to a small value like 0.01 parametric ReLU treats
αi as a learnable parameter
29|D L - U N I T - I I
BTECH_CSE-SEM 31
so long as the activations of the network can be kept small.
2.14.ARCHITECTURE DESIGN
h(1)=g(1)(W(1)Tx+b(1))h(1)=g(1)(W(1)Tx+b(1))
BTECH_CSE-SEM 31
h(2)=g(2)(W(2)Th(1)+b(2))h(2)=g(2)(W(2)Th(1)+b(2))
BTECH_CSE-SEM 31
The training algorithm might choose wrong function due to over-fitting
32|D L - U N I T - I I
BTECH_CSE-SEM 31
the loss function for a single weight by the chain rule. It efficiently
computes one layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs Error B= Actual Output – Desired
Output
Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
Keep repeating the process until the desired output is achieved
33|D L - U N I T - I I
BTECH_CSE-SEM 31
be learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping
is rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
History of Backpropagation
In 1961, the basics concept of continuous backpropagation were
derived in the context of control theory by J. Kelly, Henry Arthur, and E.
Bryson.
In 1969, Bryson and Ho gave a multi-stage dynamic system optimization
method.
In 1974, Werbos stated the possibility of applying this principle in an
artificial neural network.
In 1982, Hopfield brought his idea of a neural network.
In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald
J. Williams, backpropagation gained recognition.
34|D L - U N
InI T1993,
-II Wan was the first person to win an international pattern
BTECH_CSE-SEM 31
recognition contest with the help of the backpropagation
method.
Computational Complexity
In general, determining the order of evaluation that results in the lowest
computational cost is a difficult problem
Finding the optimal sequence of operations to compute the gradient is
NP- complete (Naumann, 2008)
in the sense that it may require simplifying algebraic expressions into
their least expensive form
9. Practice Quiz
1. Which of the following CANNOT be achieved by using machine
learning?
a) forecast the outcome variable into the
35|D L - U N Ifuture
T-II
BTECH_CSE-SEM 31
b) accurately predict the outcome using supervised learning
algorithms
c) proving causal relationships between variables
d) classify respondents into groups based on their response pattern
a) Procedure-oriented
b) Object-oriented
c) Logic-oriented
d) Rule-oriented
a) Deep Learning
b) Machine Learning
c) Artificialintelligence
d) None of the above
7. What are the three types of Machine Learning?
a) Supervised Learning
b) Unsupervised learning
c) Reinforcement Learning
d) All of the above
8. Which of the following is not a supervised learning?
36|D L - U N I T - I I
BTECH_CSE-SEM 31
a) PCA
b) Naive Bayesian
c) Linear Regression
d) Decision Tree Answer
10. Neural Networks consist of artificial neurons that are similar to the
biological model of neurons.
a. True
b. False
10. Assignments
S.No Question BL CO
Discuss the supervised and unsupervised learning
1 6 2
Write and explain in detail about Gradient based learning
2 5 1
examples.
3 Explain in detail about the Gradient descent 5 1
Compare hyper parametres and validation sets
4 2 2
Discuss in detail about the Bayesian statistics
5 6 2
37|D L - U N I T - I I
BTECH_CSE-SEM 31
Model architecture.
Learning rate.
Number of epochs.
Number of branches in a decision tree.
Number of clusters in a clustering algorithm.
38|D L - U N I T - I I
BTECH_CSE-SEM 31
Ans: Backpropagation is an algorithm that back propagates
the errors from output nodes to the input nodes.
Therefore, it is simply referred to as backward
propagation of errors. It uses in the vast applications of
neural networks in data mining like Character
recognition, Signature verification, etc.
S.No Question BL CO
1 Explain in detail about the Gradient descent 2 1
2 Compare hyper parametres and validation sets 2 1
39|D L - U N I T - I I
BTECH_CSE-SEM 31
6. Tensorflow for deep learning By Dr Kevin Webster, conducted by coursera
– 6 months
7. Deep learning NPTEL course By Prof. Sudarshan Iyengar, Prof.
Sanatan Sukhija-IIT Ropar-10 weeks
S.No Application CO
1 Virtual Assistants 1
Virtual Assistants are cloud-based applications that understand
natural language voice commands and complete tasks for the user.
Amazon Alexa, Cortana, Siri, and Google Assistant are typical examples
of virtual assistants. They need internet-connected devices to work
with their full capabilities. Each time a command is fed to the
assistant, they tend to
provide a better user
. experience based on past experiences using Deep
Learning algorithms
2 Chatbots 1
Chatbots can solve customer problems in seconds. A chatbot is an AI
application to chat online via text or text-to-speech. It is capable of
communicating and performing actions similar to a human. Chatbots are
used a lot in customer interaction, marketing on social network sites, and
instant messaging the client. It delivers automated responses to user
inputs. It uses machine learning and deep learning algorithms to
generate different types of reactions.
The next important deep learning application is related to Healthcare.
3 Healthcare 1
Deep Learning has found its application in the Healthcare sector.
Computer-aided disease detection and computer-aided diagnosis have
been possible using Deep Learning. It is widely used for medical research,
drug discovery, and diagnosis of life-threatening diseases such as cancer
and diabetic retinopathy through the process of medical imaging.
4 Entertainment 1
Companies such as Netflix, Amazon, YouTube, and Spotify give relevant
movies, songs, and video recommendations to enhance their customer
experience. This is all thanks to Deep Learning. Based on a person’s
browsing history, interest, and behavior, online streaming companies give
suggestions to help them make product and service choices. Deep
learning techniques are also used to add sound to silent movies and
generate subtitles automatically.
5 News Aggregation and Fake News Detection 1
Deep Learning allows you to customize news depending on the
readers’
persona. You can aggregate and filter out news information as per
40|D L - U N I T - I I
BTECH_CSE-SEM 31
preferences of a reader. Neural Networks help develop classifiers that
can detect fake and biased news and remove it from your feed. They
also warn you of possible privacy breaches.
6 Image Coloring 1
Image colorization has seen significant advancements using Deep
Learning. Image colorization is taking an input of a grayscale image
and then producing an output of a colorized image. ChromaGAN is an
example of a picture colorization model. A generative network is framed
in an adversarial model that learns to colorize by incorporating a
perceptual and semantic understanding of both class distributions and
color.
To start with deep learning, the very basic project that you can build is to
predict the next digit in a sequence. Create a sequence like a list of odd
numbers and then build a model and train it to predict the next digit in
the
41|D L - U N I T - I I
BTECH_CSE-SEM 31
42|D L - U N I T - I I
BTECH_CSE-SEM 31
sequence. A simple neural network with 2 layers would be sufficient to build
the model.
The face detection took a major leap with deep learning techniques. We
can build models with high accuracy in detecting the bounding boxes of
the human face. This project will get you started with object detection and
you will learn how to detect any object in an image.
How often do you get stuck thinking about the name of a dog’s breed?
There are many dog breeds and most of them are similar to each other. We
can use the dog breeds dataset and build a model that will classify different
dog breeds from an image. This project will be useful for a lot of people.
43|D L - U N I T - I I
BTECH_CSE-SEM 31