You are on page 1of 47

COURSE MATERIAL

SUBJECT DEEP LEARNING

UNIT 2

COURSE B.TECH

COMPUTER SCIENCE &


DEPARTMENT ENGINEERING
(AI&ML)

SEMESTER 5

Version

PREPARED / REVISED DATE 23-08-2023

1|D L - U N I T - I I

BTECH_CSE-SEM 31
TABLE OF CONTENTS – UNIT
2
S. NO CONTENTS PAGE
NO.
1 COURSE OBJECTIVES 3
2 PREREQUISITES 3
3 SYLLABUS 3
4 COURSE OUTCOMES 3
5 CO - PO/PSO MAPPING 4
6 LESSON PLAN 4
7 ACTIVITY BASED LEARNING 4
8 LECTURE NOTES 5
2.1 INTRODUCTION TO MACHINE LEARNING 5
2.2 BASICS AND UNDER FITTING 5
2.3 HYPER PARAMETERS AND VALIDATION SETS 6
2.4 ESTIMATORS 7
2.5 BIAS AND VARIANCE 8
2.6 MAXIMUM LIKELIHOOD 9
2.7 BAYESIAN STATISTICS 10
2.8 SUPERVISED AND UNSUPERVISED LEARNING 12
2.9 STOCHASTIC GRADIENT DESCENT 14
2.10 CHALLENGES MOTIVATING DEEP LEARNING 17
2.11 DEEP FEED FORWARD 19
NETWORKS:LEARNING XOR
2.12 GRADIENT BASED LEARNING 24

2.13 HIDDEN UNITS 26

2.14 ARCHITECTURE DESIGN 28

2.15 BACK-PROPOGATION AND OTHER 30


DIFFERENTIATION ALGORITHMS
9 PRACTICE QUIZ 34
10 ASSIGNMENTS 34
11 PART A QUESTIONS & ANSWERS (2 MARKS QUESTIONS) 36
12 PART B QUESTIONS 37
13 SUPPORTIVE ONLINE CERTIFICATION COURSES 38

2|D L - U N I T - I I

BTECH_CSE-SEM 31
14 REAL TIME APPLICATIONS 38
15 CONTENTS BEYOND THE SYLLABUS 39
16 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 39
17 MINI PROJECT SUGGESTION 40

1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper perameters in neural
network’s architecture.
5. To apply concepts of Deep Learning to solve real word problems.

2. Prerequisites
This course is intended for senior undergraduate and junior graduate students
who have a proper understanding of
 Python Programming Language
 Calculus
 Linear Algebra
 Probability Theory
Although it would be helpful, knowledge about classical machine learning is
NOT required.
3. Syllabus
UNIT I
Machine Learning: Basics and Under fitting, Hyper parameters and Validation
Sets, Estimators, Bias and Variance, Maximum Likelihood, Bayesian Statistics,
Supervised and Unsupervised Learning,Stochastic Gradient Descent,
Challenges Motivating Deep Learning
Deep Feed forward Networks: Learning XOR, Gradient-Based Learning,
Hidden Units, Architecture Design, Back-Propagation andother Differentiation
Algorithms.
4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture of deep neural network.

3|D L - U N I T - I I

BTECH_CSE-SEM 31
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.

5. Co-PO / PSO Mapping


Machine
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 P10 PO11 PO12 PSO1 PSO2
Tools
CO1 3 2

CO2 3 2

CO3 3 3 2 2 3 2 2

CO4 3 3 2 2 3 2 2

CO5

6. Lesson Plan

Lecture No. Weeks Topics to be covered References

1 Machine Learning: Basics and Under fitting, T1

2 Hyper parameters and Validation Sets, T1, R1


1
3 Estimators, Bias and Variance, T1, R1

4 Maximum Likelihood, Bayesian Statistics T1, R1

5 Supervised and Unsupervised Learning Stochastic T1, R1

6 Gradient Descent, T1, R1


2
7 Challenges Motivating Deep Learning T1, R1

8 Deep Feed forward Networks: Learning XOR, T1, R1

9 Gradient-Based Learning, Hidden Units T1, R1

10 3 Architecture Design T1, R1

11 Back-Propagation and other Differentiation Algorithms T1, R1

7.Activity Based Learning


1. DL course is associated with laboratory, different open-ended problem
statements are given for each student to carry out the experiments using
google colab tool. The foundations of Deep Learning, understand how to
build neural networks, and learn how to lead successful machine learning
projects. You will learn about Convolutional networks, RNNs, LSTM,etc.

4|D L - U N I T - I I

BTECH_CSE-SEM 31
2. You will work on case studies from healthcare, autonomous driving, sign
language reading, music generation, and natural language processing.
You will master not only the theory, but also see how it is applied in industry.
8.Lecture Notes
2.1 INTRODUCTION TO MACHINE LEARNING
Introduction: Machine learning is essentially a form of applied statistics
with increased emphasis on the use of computers to statistically estimate
complicated functions and a decreased emphasis on proving confidence
intervals around these functions; we therefore present the two central
approaches to statistics: frequentist estimators and Bayesian inference. Most
machine learning algorithms can be divided into the categories of
supervised learning and unsupervised learning; we describe these
categories and give some examples of simple learning algorithms from
each category. Most deep learning algorithms are based on an optimization
algorithm called stochastic gradient descent.

2.2 BASICS AND UNDER FITTING

The central challenge in machine learning is that we must perform well on


new, previously unseen inputs—not just those on which our model was trained.
The ability to perform well on previously unobserved inputs is called
generalization.
Typically, when training a machine learning model, we have access to a
training set, we can compute some error measure on the training set called the
training error, and we reduce this training error. So far, what we have
described is simply an optimization problem. What separates machine learning
from optimization is that we want the generalization error, also called the
test error, to be low as well.
The factors determining how well a machine learning algorithm will perform are
its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine
learning: underfitting and overfitting . Underfitting occurs when the model is
not able to obtain a sufficiently low error value on the training set.
Overfitting
5|D L - U N I T - I I

BTECH_CSE-SEM 31
occurs when the gap between the training error and test error is too large. We
can control whether a model is more likely to overfit or underfit by altering
its capacity.

2.3 HYPER PARAMETERS AND VALIDATION SETS

Most machine learning algorithms have several settings that we


can use to control the behavior of the learning algorithm. These
settings are called hyperparameters. The values of
hyperparameters are not adapted by the learning algorithm itself
(though we can design a nested learning procedure where one
learning algorithm learns the best hyperparameters for another
learning algorithm).

Reasons for hyperparameters

 Sometimes setting is chosen as a hyperparam because it is too


difficult to optimize
 More frequently, the setting is a hyperparam because it is
not appropriate to learn that hyperparam on the training set

Validation Set

 To solve the problem we use a validation set – Examples that


training algorithm does not observe
 Test examples should not be used to make choices about the model
hyperparameters
 Training data is split into two disjoint parts – First to learn the parameters
 – Other is the validation set to estimate generalization error during
or after training allowing for the hyperparameters to be updated –
Typically 80% of training data for training and 20% for validation

Cross-Validation

6|D L - U N I T - I I

BTECH_CSE-SEM 31
 When data set is too small, dividing into a fixed training set and fixed
testing set is problematic – If it results in a small test set

 Small test set implies statistical uncertainty around the estimated


average test error

 Cannot claim algorithm A works better than algorithm B for a given task

k - fold cross-validation

 Partition the data into k non-overlapping subsets

 On trial i, i th subset of data is used as the test set

 Rest of the data is used as the training set k-fold Cross Validation

 Supply of data is limited

 All available data is partitioned into k groups (folds)

 k-1 groups are used to train and evaluated on remaining group

 Repeat for all k choices of heldout group

 Performance scores from k runs are averaged Cross validation


confidence

 Cross-validation algorithm returns vector of errors e for examples in D

 Whose mean is the estimated generalization error

The errors can be used to compute a confidence interval around the mean

95% confidence interval centered around mean is (µˆm − 1.96SE(µˆm ), µˆm +


1.96SE(µˆm ))

7|D L - U N I T - I I

BTECH_CSE-SEM 31
2.4 Estimators

Estimation is a statistical term for finding some estimate of unknown


parameter, given some data. Point Estimation is the attempt to provide the
single best prediction of some quantity of interest.
Quantity of interest can be:

 A single parameter
 A vector of parameters — e.g., weights in linear regression
 A whole function

Point Estimation
To distinguish estimates of parameters from their true value, a point estimate
of a parameter θis represented by θˆ. Let {x(1) , x(2) ,..x(m)} be m
independent and identically distributed data points.Then a point estimator is
any function of the data:

Point estimation can also refer to estimation of relationship between input and
target variables referred to as function estimation.
Function Estimation :
Here we are trying to predict a variable y given an input vector x. We
assume that there is a function f(x) that describes the approximate
relationship between y and x. For example,we may assume that y
= f(x) + ε, where ε stands for the part of y that is not predictable from x.
In function estimation, we are interested in approximating f with a model or
estimate fˆ. Function estimation is really just the same as estimating a
parameter θ; the function estimator fˆis simply a point estimator in
function space. Ex: in polynomial regression we are either estimating a
parameter w or estimating a function mapping from x to y.
2.5 BIAS AND VARIANCE

Bias and variance measure two different sources of error in and estimator. Bias
measures the expected deviation from the true value of the function or
parameter. Variance on the other hand, provides a measure of the
deviation
8|D L - U N I T - I I

BTECH_CSE-SEM 31
from the expected estimator value that any particular sampling of the data
is likely to cause.

Bias

The bias of an estimator is defined as:

bias(θˆm) =E(θˆm) - θ.

where the expectation is over the data (seen as samples from a random
variable)and θ is the true underlying value of θ used to define the data
generating distribution.

An estimator θˆm is said to be unbiased if bias(θˆm) = 0, which implies

that E(θˆm) = θ.

Variance and Standard Error

The variance of an estimator Var(θˆ) where the random variable is the


training set. Alternately, the square root of the variance is called the standard
error, denoted standard error SE(ˆθ). The variance or the standard error of an
estimator provides a measure of how we would expect the estimate we
compute from data to vary as we independently re-sample the dataset
from the underlying data generating process.

Just as we might like an estimator to exhibit low bias we would also like it
to have relatively low variance.

2.6.MAXIMUM LIKELIHOOD

Having discussed the definition of an estimator, let us now discuss some


commonly used estimator

Maximum Likelihood Estimation can be defined as a method for estimating


parameters (such as the mean or variance ) from sample data such that
the probability (likelihood) of obtaining the observed data is maximized.

9|D L - U N I T - I I

BTECH_CSE-SEM 31
Consider a set of m examples X = {x(1), . . . , x(m)} drawn independently
from the true but unknown data generating distribution Pdata(x). Let
Pmodel(x; θ) be a parametric family of probability distributions over the
same space indexed by θ. In other words, Pmodel(x; θ) maps any
configuration xto a real number estimating the true probability Pdata(x).

The maximum likelihood estimator for θ is then defined as:

Since we assumed the examples to be i.i.d, the above equation can be


written in the product form as:

This product over many probabilities can be inconvenient for a variety of


reasons. For example, it is prone to numerical underflow. Also, to find the
maxima/minima of this function, we can take the derivative of this
function
w.r.t θ and equate it to 0.

Since we have terms in product here, we need to apply the chain rule
which is quite cumbersome with products. To obtain a more convenient
but equivalent optimization problem, we observe that taking the logarithm of
the likelihood does not change its arg max but does conveniently
transform a product into a sum and since log is a strictly increasing
function ( natural log function is a monotone transformation), it would not
impact the resulting value of θ.

So we have:

10|D L - U N I T - I I

BTECH_CSE-SEM 31
2.7.BAYESIAN STATISTICS

Bayesian Statistics are a technique that assigns “degrees of belief,” or


Bayesian probabilities, to traditional statistical modeling. In this interpretation
of statistics, probability is calculated as the reasonable expectation of an
event occurring based upon currently known triggers. Or in other words, that
probability is a dynamic process that can change as new information is
gathered, rather than a fixed value based upon frequency or propensity.

While not applicable to every deep learning technique, this statistical


approach affects three key fields of machine learning:

Statistical Inference

- Bayesian inference uses Bayesian probability to summarize evidence for the


likelihood of a prediction.

Statistical Modeling

- Bayesian statistics helps some models by classifying and specifying the prior
distributions of any unknown parameters.

Experiment Design

– By including the concept of “prior belief influence,” this technique uses


sequential analysis to factor in the outcome of earlier experiments when
designing new ones. These “beliefs” are updated by prior and posterior
distribution.

While most machine learning models try to predict outcomes from large
datasets, the Bayesian approach is helpful for several classes of problems that
aren’t easily solved with other probability models. In particular:

 Databases with few data points for reference


 Models with strong prior intuitions from pre-existing observations
 Data with high levels of uncertainty, or when it’s necessary to quantify
the level of uncertainty across an entire model or compare different
11|D L - U N I T - I I

BTECH_CSE-SEM 31
models

BTECH_CSE-SEM 31
- When a model generates a null hypothesis but it’s necessary to claim
something about the likelihood of the alternative hypothesis

Frequentist Statistics vs Bayesian Statistics

S.NO Bayesian inference Frequentist inference

1 It uses probabilities for both It doesn’t use or render probabilities of a


hypotheses and data. hypothesis, ie. no prior or posterior.

2 It relies on the prior and It only counts on the likelihood for both
likelihood of observed data. observed and unobserved data.

3 It demands an individual to It never seeks a prior.


learn or make a subjective prior.

4 It had dominated statisticalIt had dominated statistical practice at


practice earlier than the 20ththe time of the 20th century
century

2.8.SUPERVISED AND UNSUPERVISED LEARNING

Supervised learning is a machine learning approach that’s defined by its


use of labeled datasets. These datasets are designed to train or
“supervise” algorithms into classifying data or predicting outcomes
accurately. Using labeled inputs and outputs, the model can measure its
accuracy and learn over time.

Supervised learning can be separated into two types of problems when


data mining: classification and regression:

• Classification problems use an algorithm to accurately assign test data


into specific categories, such as separating apples from oranges. Or, in the
real world, supervised learning algorithms can be used to classify spam in
a separate folder from your inbox. Linear classifiers, support vector

12|D L - U N I T - I I

BTECH_CSE-SEM 31
machines, decision trees and random forest are all common types of
classification algorithms.

• Regression is another type of supervised learning method that uses an


algorithm to understand the relationship between dependent and
independent variables. Regression models are helpful for predicting
numerical values based on different data points, such as sales revenue

projections for a given business. Some popular regression algorithms are


linear regression, logistic regression and polynomial regression.

UNSUPERVISED LEARNING

Unsupervised learning uses machine learning algorithms to analyze and


cluster unlabeled data sets. These algorithms discover hidden patterns in data
without the need for human intervention (hence, they are “unsupervised”).

Unsupervised learning models are used for three main tasks

: clustering, association and dimensionality reduction:

• Clustering is a data mining technique for grouping unlabeled data


based on their similarities or differences. For example, K-means
clustering algorithms assign similar data points into groups, where the
K value represents the size of the grouping and granularity. This technique
is helpful for market segmentation, image compression, etc.

• Association is another type of unsupervised learning method that uses


different rules to find relationships between variables in a given
dataset. These methods are frequently used for market basket analysis
and recommendation engines, along the lines of “Customers Who Bought
This Item Also Bought” recommendations.

• Dimensionality reduction is a learning technique used when the number of


features (or dimensions) in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also preserving the

13|D L - U N I T - I I

BTECH_CSE-SEM 31
data integrity. Often, this technique is used in the preprocessing data
stage, such as when autoencoders remove noise from visual data to
improve picture quality.

The main difference between supervised and unsupervised learning: Labeled


data

The main distinction between the two approaches is the use of labeled
datasets. To put it simply, supervised learning uses labeled input and
output data, while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training dataset by


iteratively making predictions on the data and adjusting for the correct
answer. While supervised learning models tend to be more accurate than
unsupervised learning models, they require upfront human intervention to
label the data appropriately. For example, a supervised learning model
can predict how long your commute will be based on the time of day,
weather conditions and so on. But first, you’ll have to train it to know that
rainy weather extends the driving time.

Unsupervised learning models, in contrast, work on their own to discover the


inherent structure of unlabeled data. Note that they still require some human
intervention for validating output variables. For example, an unsupervised
learning model can identify that online shoppers often purchase groups of
products at the same time. However, a data analyst would need to
validate that it makes sense for a recommendation engine to group baby
clothes with an order of diapers, applesauce and sippy cups.

Other key differences between supervised and unsupervised learning

• Goals: In supervised learning, the goal is to predict outcomes for new


data. You know up front the type of results to expect. With an
unsupervised learning algorithm, the goal is to get insights from large
volumes of new data. The machine learning itself determines what is
different or interesting from the dataset.

14|D L - U N I T - I I

BTECH_CSE-SEM 31
• Applications: Supervised learning models are ideal for spam detection,
sentiment analysis, weather forecasting and pricing predictions, among
other things. In contrast, unsupervised learning is a great fit for
anomaly detection, recommendation engines, customer personas and
medical imaging.

• Complexity: Supervised learning is a simple method for machine learning,


typically calculated through the use of programs like R or Python. In
unsupervised learning, you need powerful tools for working with large
amounts of unclassified data. Unsupervised learning models are
computationally complex because they need a large training set to
produce intended outcomes.

• Drawbacks: Supervised learning models can be time-consuming to


train, and the labels for input and output variables require expertise.
Meanwhile, unsupervised learning methods can have wildly inaccurate
results unless you have human intervention to validate the output
variables

2.9.STOCHASTIC GRADIENT DESCENT

Gradient Descent in Brief

 Gradient Descent is a generic optimization algorithm capable of finding


optimal solutions to a wide range of problems.
 The general idea is to tweak parameters iteratively in order to minimize the
cost function.
 An important parameter of Gradient Descent (GD) is the size of the
steps, determined by the learning rate hyperparameters. If the
learning rate is too small, then the algorithm will have to go
through many iterations to converge, which will take a long time, and
if it is too high we may jump the optimal value.

Types of Gradient Descent:

Typically, there are three types of Gradient Descent:

1. Batch Gradient Descent


15|D L - U N I T - I I

BTECH_CSE-SEM 31
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent

we will be discussing Stochastic Gradient Descent (SGD). Stochastic


Gradient Descent (SGD):

The word ‘stochastic‘ means a system or process linked with a random


probability. Hence, in Stochastic Gradient Descent, a few samples are
selected randomly instead of the whole data set for each iteration. In
Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the
gradient for each

iteration. In typical Gradient Descent optimization, like Batch Gradient


Descent, the batch is taken to be the whole dataset. Although using the
whole dataset is really useful for getting to the minima in a less noisy
and less random manner, the problem arises when our dataset gets big.

Suppose, you have a million samples in your dataset, so if you use a typical
Gradient Descent optimization technique, you will have to use all of the
one million samples for completing one iteration while performing the
Gradient Descent, and it has to be done for every iteration until the minima
are reached. Hence, it becomes computationally very expensive to
perform.

This problem is solved by Stochastic Gradient Descent. In SGD, it uses only


a single sample, i.e., a batch size of one, to perform each iteration. The
sample is randomly shuffled and selected for performing the iteration.

SGD algorithm:

16|D L - U N I T - I I

BTECH_CSE-SEM 31
So, in SGD, we find out the gradient of the cost function of a single
example at each iteration instead of the sum of the gradient of the cost
function of all the examples.

* In SGD, since only one sample from the dataset is chosen at random for
each iteration, the path taken by the algorithm to reach the minima is usually
noisier than your typical Gradient Descent algorithm. But that doesn’t
matter all that much because the path taken by the algorithm does not
matter, as long as we reach the minima and with a significantly shorter
training time.

* The path is taken by Batch Gradient Descent as shown below as follows:

* A path has been taken by Stochastic Gradient Descent –

* One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima,
because of its randomness in its descent. Even though it requires a higher
number of iterations to reach the minima than typical Gradient Descent, it
is still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
17|D L - U N I T - I I

BTECH_CSE-SEM 31
optimizing a learning algorithm.

BTECH_CSE-SEM 31
2.10.CHALLENGES IN MOTIVATING DEEP LEARNING

Shortcomings of conventional ML

 The curse of dimensionality


 Local constancy and smoothness regularization
 Manifold learning

Curse of dimensionality

 No of possible distinct configurations of a set of variables increases


exponentially with no of variables Poses a statistical challenge
Ex: 10 regions of interest with one variable
 We need to track 100 regions with two
variables 1000 regions with three variables

Local Constancy & Smoothness Regularization

Prior beliefs

 To generalize well ML algorithms need prior beliefs


 Form of probability distributions over parameters
 Influencing the function itself, while parameters are influenced
only indirectly
 Algorithms biased towards preferring a class of functions
18|D L - U N I T - I I

BTECH_CSE-SEM 31
 These biases may not be expressed in terms of a probability distribution
 Most widely used prior is smoothnessAlso called local constancy
prior States that the function we learn should not change very much
within a small region

Local Constancy

Manifold Learning

 An important idea underlying many ideas in machine learning


 A manifold is a connected region
 Mathematically it is a set of points in a neighborhood
 It appears to be in a Euclidean space
 E.g., we experience the world as a 2-D plane while it is a
spherical manifold in 3-D space

2.11. DEEP FEED FORWARD NETWORKS: LEARNING XOR

A Feed Forward Neural Network is an artificial Neural Network in which the


nodes are connected circularly. A feed-forward neural network, in
which some routes are cycled, is the polar opposite of a Recurrent
Neural Network. The feed-forward model is the basic type of neural
network

19|D L - U N I T - I I

BTECH_CSE-SEM 31
because the input is only processed in one direction. The data always
flows in one direction and never backwards/opposite.

The XOR Problem

The XOR, or “exclusive or”, problem is a classic problem in ANN


research. It is the problem of using a neural network to predict the
outputs of XOR logic gates given two binary inputs. An XOR
function should return a true value if the two inputs are not equal
and a false value if they are equal. All possible inputs and
predicted outputs are

shown in figure 1.

XOR is a classification problem and one for which the expected


outputs are known in advance. It is therefore appropriate to use a
supervised learning approach.

On the surface, XOR appears to be a very simple problem,


however, Minksy and Papert (1969) showed that this was a big
problem for neural network architectures of the 1960s, known as
perceptrons.
Perceptrons
Like all ANNs, the perceptron is composed of a network of *units*,
which are analagous to biological neurons. A unit can receive an
input from other units. On doing so, it takes the sum of all values
received and decides whether it is going to forward a signal on to
other units to which it is connected. This is called activation. The
activation
20|D L - U NI T-II function uses some means or other to reduce the sum of

BTECH_CSE-SEM 31
input values to a 1 or a 0 (or a value very close to a 1 or 0) in order
to

BTECH_CSE-SEM 31
represent activation or lack thereof. Another form of unit, known as a
bias unit, always activates, typically sending a hard coded 1 to all
units to which it is connected.

Perceptrons include a single layer of input units — including one


bias unit — and a single output unit (see figure 2). Here a bias unit
is depicted by a dashed circ
le, while other units are shown as blue circles. There are two non-
bias input units representing the two binary input values for XOR.
Any number of input units can be included.

The perceptron is a type of feed-forward network, which means


the process of generating an output — known as forward
propagation — flows in one direction from the input layer to the
output layer. There are no connections between units in the input
layer. Instead, all units in the input layer are connected directly to the
output unit.

A simplified explanation of the forward propagation process is that


the input values X1 and X2, along with the bias value of 1, are
multiplied by their respective weights W0..W2, and parsed to the
output unit. The output unit takes the sum of those values and
employs an activation function — typically the Heavside step function
— to convert the resulting value to a 0 or 1, thus classifying the input
21|D L - U N I T - I I

BTECH_CSE-SEM 31
values as 0 or 1.

22|D L - U N I T - I I

BTECH_CSE-SEM 31
It is the setting of the weight variables that gives the network’s
author control over the process of converting input values to an
output value. It is the weights that determine where the classification
line, the line that separates data points into classification groups, is
drawn. If all data points on one side of a classification line are
assigned the class of 0, all others are classified as 1.

A limitation of this architecture is that it is only capable of separating


data points with a single line. This is unfortunate because the XOR
inputs are not linearly separable. This is particularly visible if you plot
the XOR input values to a graph. As shown in figure 3, there is no way
to separate the 1 and 0 predictions with a single classification line.

Multilayer Perceptrons

23|D L - U N I T - I I

BTECH_CSE-SEM 31
The solution to this problem is to expand beyond the single-
layer architecture by adding an additional layer of units without
any direct access to the outside world, known as a hidden layer.
This kind of architecture — shown in Figure 4— is another feed-
forward network known as a multilayer perceptron (MLP)

It is worth noting that an MLP can have any number of units in its
input, hidden and output layers. There can also be any number of
hidden layers. The architecture used here is designed specifically
for the XOR problem.

Similar to the classic perceptron, forward propagation begins with the


input values and bias unit from the input layer being multiplied by
their respective weights, however, in this case there is a weight
for each combination of input (including the input layer’s bias unit)
and hidden unit (excluding the hidden layer’s bias unit).

The products of the input layer values and their respective weights
are parsed as input to the non-bias units in the hidden layer. Each
non-bias hidden unit invokes an activation function — usually the
classic sigmoid function in the case of the XOR problem — to
squash
24|D L - U N I T - I I

BTECH_CSE-SEM 31
the sum of their input values down to a value that falls between 0 and
1 (usually a value very close to either 0 or 1).

The outputs of each hidden layer unit, including the bias unit, are then
multiplied by another set of respective weights and parsed to an
output unit. The output unit also parses the sum of its input values
through an activation function — again, the sigmoid function is
appropriate here — to return an output value falling between 0
and
1. This is the predicted output.

This architecture, while more complex than that of the classic


perceptron network, is capable of achieving non-linear separation.
Thus, with the right set of weight values, it can provide the necessary
separation to accurately classify the XOR inputs.

25|D L - U N I T - I I

BTECH_CSE-SEM 31
2.12.GRADIENT BASED LEARNING

Gradient Descent is an optimization algorithm used for minimizing


the cost function in various machine learning algorithms. It is basically
used for updating the parameters of the learning model.
Types of gradient Descent:
Batch Gradient Descent: This is a type of gradient descent which
processes all the training examples for each iteration of gradient
descent. But if the number of training examples is large, then batch
gradient descent is computationally very expensive. Hence if the
number of training examples is large, then batch gradient descent is not
preferred. Instead, we prefer to use stochastic gradient descent or mini-
batch gradient descent.

Stochastic Gradient Descent: This is a type of gradient descent which


processes 1 training example per iteration. Hence, the parameters
are being updated even after one iteration in which only a single
example has been processed. Hence this is quite faster than batch
gradient descent. But again, when the number of training examples
is large, even then it processes only one example which can be
additional overhead for the system as the number of iterations will be
quite large.

Mini Batch gradient descent: This is a type of gradient descent which


works faster than both batch gradient descent and stochastic
gradient descent.
Here b examples where b<m are processed per iteration. So even if the
number of training examples is large, it is processed in batches of b
training examples in one go. Thus, it works for larger training examples
and that too with lesser number of iterations.
Variables used:
Let m be the number of training examples. Let n be the number of
25|D L - U N I T - I I

BTECH_CSE-SEM 31
features.
Note: if b == m, then mini batch gradient descent will behave similarly
to batch gradient descent.

Algorithm for batch gradient descent :


Let hθ(x) be the hypothesis for linear regression. Then, the cost
function is given by:
Let Σ represents the sum of all training examples from i=1 to m.
Jtrain(θ) = (1/2m) Σ( hθ(x(i)) - y(i))2
Repeat
{
θj = θj – (learning rate/m) * Σ( hθ(x(i)) - y(i))xj(i) For every j =0 …n
}

Where xj(i) Represents the jth feature of the ith training example.

So if m is very large(e.g. 5 million training samples), then it takes hours


or even days to converge to the global minimum.That’s why for
large datasets, it is not recommended to use batch gradient
descent as it slows down the learning.
Algorithm for stochastic gradient descent:
1. Randomly shuffle the data set so that the parameters can be
trained evenly for each type of data.
2. As mentioned above, it takes into consideration one example
per iteration.

Hence,
Let (x(i),y(i)) be the training example
Cost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i)) - y(i))
Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))

26|D L - U N I T - I I

BTECH_CSE-SEM 31
Repeat
{
For i=1 to m
{
θj = θj – (learning rate) * Σ( hθ(x(i)) - y(i))xj(i) For every j =0
…n
}

2.13.HIDDEN UNITS

The design of hidden units is an extremely active area of research


and does not yet have many definitive guiding theoretical principles.
Rectified linear units are an excellent default choice of hidden unit
We discuss motivations behind choice of hidden unit. It is usually
impossible to predict in advance which will work best. The design
process consists of trial and error, intuiting that a kind of hidden unit
may work well, and evaluating its performance on a validation set

Some hidden units are not differentiable at all input points.


For example, the rectified linear function g(z)=max{0, z}g(z)=max{0, z} is
not differentiable at z= 0. This may seem like it invalidates g for use
with a gradientbased learning algorithm. In practice, gradient
descent still performs well enough for these models to be used for
machine learning tasks.

Most hidden units can be described as accepting a vector of inputs x,


computing an affine transformation z=wTh+bz=wTh+b, and then
applying an element-wise nonlinear function g(z)g(z). Most hidden units
are distinguished from each other only by the choice of the form of
the activation function g(z)g(z)
27|D L - U N I T - I I

BTECH_CSE-SEM 31
Rectified Linear Units and Their Generalizations
Rectified linear units use the activation function
g(z)=max{0, z}g(z)=max{0, z}.
Rectified linear units are easy to optimize due to similarity with linear
units.

Only difference with linear units that they output 0 across half its
domain Derivative is 1 everywhere that the unit is active.

Thus gradient direction is far more useful than with activation


functions with second-order effects

Rectified linear units are typically used on top of an affine


transformation:
h=g(WTx+b)h=g(WTx+b).

Good practice to set all elements of b to a small value such as 0.1.


This makes it likely that ReLU will be initially active for most training
samples and allow derivatives to pass through

ReLU vs other activations:

Sigmoid and tanh activation functions cannot be with many layers


due to the vanishing gradient problem.
ReLU overcomes the vanishing gradient problem, allowing models to
learn faster and perform better.
ReLU is the default activation function with MLP and CNN

One drawback to rectified linear units is that they cannot learn via
gradientbased methods on examples for which their activation is
zero.

28|D L - U N I T - I I

BTECH_CSE-SEM 31
Three generalizations of rectified linear units are based on using a
non-zero slope αi when zi < 0: hi=g(z, α)
i=max(0,zi)+αimin(0,zi)hi=g(z, α)i=max(0,zi)+αimin(0,zi).
Absolute value rectification fixes αi = −1 to obtain g(z) = |z|. It is
used for object recognition from images
A leaky ReLU fixes αi to a small value like 0.01 parametric ReLU treats
αi as a learnable parameter

Logistic Sigmoid and Hyperbolic Tangent


Most neural networks used the logistic sigmoid activation function prior
to rectified linear units. g(z)=σ(z)g(z)=σ(z) or the hyperbolic tangent
activation function g(z)=tanh(z)g(z)=tanh(z)
These activation functions are closely related because tanh(z)=2
σ(2z)−1tanh(z)=2 σ(2z)-1
We have already seen sigmoid units as output units, used to predict the
probability that a binary variable is 1.

Sigmoidals saturate across most of domain Saturate to 1 when z is very


positive and 0 when z is very negative Strongly sensitive to input when z
is near 0 Saturation makes gradient-learning difficult

Hyperbolic tangent typically performs better than logistic sigmoid. It


resembles the identity function more closely. Because tanh is similar to
the identity function near 0, training a deep neural Network
ŷ=wTtanh(UTtanh(VTx))
ŷ=wTtanh(UTtanh(VTx))

resembles training a linear model y wTUTVTx


ŷ =wTUTVTx

29|D L - U N I T - I I

BTECH_CSE-SEM 31
so long as the activations of the network can be kept small.

2.14.ARCHITECTURE DESIGN

The word architecture refers to the overall structure of the network:


how many units it should have and how these units should be
connected to each other.

Generic Neural Architecture

Most neural networks are organized into groups of units called


layers. Most neural network architectures arrange these layers in a
chain structure, with each layer being a function of the layer that
preceded it.

In this structure, the first layer is given by

h(1)=g(1)(W(1)Tx+b(1))h(1)=g(1)(W(1)Tx+b(1))

the second layer is given by


30|D L - U N I T - I I

BTECH_CSE-SEM 31
h(2)=g(2)(W(2)Th(1)+b(2))h(2)=g(2)(W(2)Th(1)+b(2))

In these chain-based architectures, the main architectural


considerations are to choose the depth of the network and the
width of each layer.

Universal Approximation Properties and Depth


A feed-forward network with a single hidden layer containing a finite
number of neurons can approximate continuous functions on
compact subsets of ℝn, under mild assumptions on the activation
function

2.15.BACKPROPAGATION AND OTHER DIFFERENTIATION ALGORITHMS

Simple neural networks can represent a wide variety of interesting


functions when given appropriate parameters

However, it does not touch upon the algorithmic learnability of


those parameters

The universal approximation theorem means that regardless of what


function we are trying to learn, we know that a large MLP will be able to
represent this function. However, we are not guaranteed that the
training algorithm will be able to learn that function. Even if the MLP
is able to represent the function, learning can fail for two different
reasons.

Optimizing algorithms may not be able to find the value of the


31|D parameters
L - U N I T - I I that corresponds to the desired function.

BTECH_CSE-SEM 31
The training algorithm might choose wrong function due to over-fitting

The universal approximation theorem says that there exists a network


large enough to achieve any degree of accuracy we desire, but the
theorem does not say how large this network will be. provides some
bounds on the size of a single-layer network needed to approximate
a broad class of functions.
Unfortunately, in the worse case, an exponential number of hidden units
may be required. This is easiest to see in the binary case: the number of
possible binary functions on vectors v∈{0,1}nv∈{0,1}n is 22n22n and
selecting one such function requires 2n2n bits, which will in general
require O(2n)O(2n) degrees of freedom.

A feedforward network with a single layer is sufficient to represent


any function, But the layer may be infeasibly large and may fail to
generalize correctly. Using deeper models can reduce no.of units
required and reduce generalization error

BACKPROPOGATION AND OTHER DIFFERENTIATION ALGORITHMS


Backpropagation is the essence of neural network training. It is the method of
fine-tuning the weights of a neural network based on the error rate
obtained in the previous epoch (i.e., iteration). Proper tuning of the weights
allows you to reduce error rates and make the model reliable by increasing
its generalization.
Backpropagation in neural network is a short form for “backward propagation
of errors.” It is a standard method of training artificial neural networks.
This method helps calculate the gradient of a loss function with respect to
all the weights in the network.

The Back propagation algorithm in neural network computes the gradient of

32|D L - U N I T - I I

BTECH_CSE-SEM 31
the loss function for a single weight by the chain rule. It efficiently
computes one layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs Error B= Actual Output – Desired
Output

Travel back from the output layer to the hidden layer to adjust the weights
such that the error is decreased.
Keep repeating the process until the desired output is achieved

Most prominent advantages of Backpropagation are:


 -Backpropagation is fast, simple and easy to program
 -It has no parameters to tune apart from the numbers of input
 -It is a flexible method as it does not require prior knowledge about the
network
 -It is a standard method that generally works well
 -It does not need any special mention of the features of the function to

33|D L - U N I T - I I

BTECH_CSE-SEM 31
be learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
 Static Back-propagation
 Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues like
optical character recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping
is rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.

History of Backpropagation
 In 1961, the basics concept of continuous backpropagation were
derived in the context of control theory by J. Kelly, Henry Arthur, and E.
Bryson.
 In 1969, Bryson and Ho gave a multi-stage dynamic system optimization
method.
 In 1974, Werbos stated the possibility of applying this principle in an
artificial neural network.
 In 1982, Hopfield brought his idea of a neural network.
 In 1986, by the effort of David E. Rumelhart, Geoffrey E. Hinton, Ronald
 J. Williams, backpropagation gained recognition.
34|D L - U N
InI T1993,
-II Wan was the first person to win an international pattern

BTECH_CSE-SEM 31
recognition contest with the help of the backpropagation
method.

OTHER DIFFERENTIATION ALGORITHMS


Automatic Differentiation
 Deep learning community has been outside the CS community dealing
with automatic differentiation
 The back-propagation algorithm is only one approach to automatic
differentiation
 It is a special case of a broader class of techniques called reverse mode
accumulation

Computational Complexity
 In general, determining the order of evaluation that results in the lowest
computational cost is a difficult problem
 Finding the optimal sequence of operations to compute the gradient is
NP- complete (Naumann, 2008)
 in the sense that it may require simplifying algebraic expressions into
their least expensive form

Future differentiation technology


 Backprop is not the only- or optimal-way of computing the gradient, but
a practical method for deep learning
 In the future, differentiation technology for deep networks may improve
with advances in the broader field of automatic differentiation

9. Practice Quiz
1. Which of the following CANNOT be achieved by using machine
learning?
a) forecast the outcome variable into the
35|D L - U N Ifuture
T-II

BTECH_CSE-SEM 31
b) accurately predict the outcome using supervised learning
algorithms
c) proving causal relationships between variables
d) classify respondents into groups based on their response pattern

2. Algorithms is_ oriented Elements of the object model.

a) Procedure-oriented
b) Object-oriented
c) Logic-oriented
d) Rule-oriented

3) Machine Learning is a field of AI consisting of learning algorithms that


a) At executing some task
b) Over time with experience
c) improve their performance
d) All of the above

4. Machine learning algorithms build a model based on sample


data, known as
a. Training Data
b. Transfer Data
c. Data Training
d. None of the above

5. Machine learning is a subset of ................


a) Deep earning
b) Artificialntelligence
c) Data
Learining
d) None of the above

6...............algorithms enable the computers to learn from data, and


even improve themselves, without being explicitly programmed.

a) Deep Learning
b) Machine Learning
c) Artificialintelligence
d) None of the above
7. What are the three types of Machine Learning?
a) Supervised Learning
b) Unsupervised learning
c) Reinforcement Learning
d) All of the above
8. Which of the following is not a supervised learning?

36|D L - U N I T - I I

BTECH_CSE-SEM 31
a) PCA
b) Naive Bayesian
c) Linear Regression
d) Decision Tree Answer

9. Which is true for neural networks?

a) It is as set of nodes and connections


b) Each node computes it’s weighted input
c) Node could be in excited state or non-excited state
d) All of the above

10. Neural Networks consist of artificial neurons that are similar to the
biological model of neurons.

a. True
b. False

10. Assignments

S.No Question BL CO
Discuss the supervised and unsupervised learning
1 6 2
Write and explain in detail about Gradient based learning
2 5 1
examples.
3 Explain in detail about the Gradient descent 5 1
Compare hyper parametres and validation sets
4 2 2
Discuss in detail about the Bayesian statistics
5 6 2

11. Part A- Question & Answers

S.No Question& Answers BL CO


1 What is hyper parameters in machine learning explain?
Ans. In machine learning, a hyperparameter is a parameter
whose value is used to control the learning process. By
1 1
contrast, the values of other parameters (typically node
weights) are derived via training.
Examples of hyperparameters in machine learning include:

37|D L - U N I T - I I

BTECH_CSE-SEM 31
 Model architecture.
 Learning rate.
 Number of epochs.
 Number of branches in a decision tree.
 Number of clusters in a clustering algorithm.

2 What is the difference between train test and validation


sets? Ans: To summarise, the training set is -typically- the
largest subset created out of the original dataset that is
1 1
used to fir the models. The validation set is then used to
evaluate the models in order to perform model selection.

3 Why Baye’s rule is used?


Ans. Bayes' Rule lets you calculate the posterior (or
"updated") probability. This is a conditional probability.
It is the probability of the hypothesis being true, if the 1 1
evidence is present. Think of the prior (or "previous")
probability as your belief in the hypothesis before
seeing the new evidence
4 Why do we need Stochastic Gradient Descent SGD?
Ans:The benefits of stochastic gradient descent

SGD can be used to train online learning models that can 1 1


update the model parameters as new data comes in. SGD
can escape from local minima in the cost function more
easily than other optimization methods.
5 What are the challenges for deep learning models?
Ans:We explore 4 major challenges of deep learning
applications and how you can overcome them:
 Ensure you have enough and relevant training data. ...
 Optimize computing costs depending on the number
1 1
and size of your DL models. ...
 Give traditional interpretable models priority over
DL.
...
 Use privacy-protecting data security techniques.
6 What is backpropagation in deep learning? 1 1

38|D L - U N I T - I I

BTECH_CSE-SEM 31
Ans: Backpropagation is an algorithm that back propagates
the errors from output nodes to the input nodes.
Therefore, it is simply referred to as backward
propagation of errors. It uses in the vast applications of
neural networks in data mining like Character
recognition, Signature verification, etc.

12. Part B- Questions

S.No Question BL CO
1 Explain in detail about the Gradient descent 2 1
2 Compare hyper parametres and validation sets 2 1

3 Discuss in detail about the Bayesian statistics 2 4

4 Discuss in detail about the gradient descent and 6 2


stochastic gradient based algorithm

13. Supportive Online Certification Courses


1. CS231n: Convolutional Neural Networks for Visual Recognition, Stanford
2. CS224d: Deep Learning for Natural Language Processing, Stanford
3. CS285: Deep Reinforcement Learning, Berkeley
4. MIT 6.S094: Deep Learning for Self-Driving Cars, MIT
5. Neural networks and Deep learning By Andrew Ng, conducted
by Coursera – 4weeks

39|D L - U N I T - I I

BTECH_CSE-SEM 31
6. Tensorflow for deep learning By Dr Kevin Webster, conducted by coursera
– 6 months
7. Deep learning NPTEL course By Prof. Sudarshan Iyengar, Prof.
Sanatan Sukhija-IIT Ropar-10 weeks

14. Real Time Applications

S.No Application CO

1 Virtual Assistants 1
Virtual Assistants are cloud-based applications that understand
natural language voice commands and complete tasks for the user.
Amazon Alexa, Cortana, Siri, and Google Assistant are typical examples
of virtual assistants. They need internet-connected devices to work
with their full capabilities. Each time a command is fed to the
assistant, they tend to
provide a better user
. experience based on past experiences using Deep
Learning algorithms
2 Chatbots 1
Chatbots can solve customer problems in seconds. A chatbot is an AI
application to chat online via text or text-to-speech. It is capable of
communicating and performing actions similar to a human. Chatbots are
used a lot in customer interaction, marketing on social network sites, and
instant messaging the client. It delivers automated responses to user
inputs. It uses machine learning and deep learning algorithms to
generate different types of reactions.
The next important deep learning application is related to Healthcare.
3 Healthcare 1
Deep Learning has found its application in the Healthcare sector.
Computer-aided disease detection and computer-aided diagnosis have
been possible using Deep Learning. It is widely used for medical research,
drug discovery, and diagnosis of life-threatening diseases such as cancer
and diabetic retinopathy through the process of medical imaging.
4 Entertainment 1
Companies such as Netflix, Amazon, YouTube, and Spotify give relevant
movies, songs, and video recommendations to enhance their customer
experience. This is all thanks to Deep Learning. Based on a person’s
browsing history, interest, and behavior, online streaming companies give
suggestions to help them make product and service choices. Deep
learning techniques are also used to add sound to silent movies and
generate subtitles automatically.
5 News Aggregation and Fake News Detection 1
Deep Learning allows you to customize news depending on the
readers’
persona. You can aggregate and filter out news information as per
40|D L - U N I T - I I

BTECH_CSE-SEM 31
preferences of a reader. Neural Networks help develop classifiers that
can detect fake and biased news and remove it from your feed. They
also warn you of possible privacy breaches.
6 Image Coloring 1
Image colorization has seen significant advancements using Deep
Learning. Image colorization is taking an input of a grayscale image
and then producing an output of a colorized image. ChromaGAN is an
example of a picture colorization model. A generative network is framed
in an adversarial model that learns to colorize by incorporating a
perceptual and semantic understanding of both class distributions and
color.

15. Contents Beyond the Syllabus

1. Building Generative Adversarial Networks


Become familiar with generative adversarial networks (GANs) by learning
how to build and train different GANs architectures to generate new
images. Discover, build, and train architectures such as DCGAN,
CycleGAN, ProGAN, and StyleGAN on diverse datasets including the
MNIST dataset, Summer2Winter Yosemite dataset, or CelebA dataset.

16. Prescribed Text Books & Reference


Books Text Book
1. Ian Goodfellow, Yoshua bengio, Aaron Courville, “Deep learning”,
MIT Press, 2016.
2. Josh Patterson and Adam Gibson, “Deep learning: A practitioner’s
approach”, O’Reilly Media, first edition, 2017.
References:
1. “Fundamentals of Deep learning, Designing next generation machine
intelligence algorithms”, Nikhil Buduma, o’Reilly, Shroff Publishers,2019.
2. “Deep Learning Cook Book”, Practical recipes to get started Quickly,
DouweOsinga, O’Reilly, Shroff Publishers,2019

17. Mini Project Suggestion

1. Predict Next Sequence

To start with deep learning, the very basic project that you can build is to
predict the next digit in a sequence. Create a sequence like a list of odd
numbers and then build a model and train it to predict the next digit in
the
41|D L - U N I T - I I

BTECH_CSE-SEM 31
42|D L - U N I T - I I

BTECH_CSE-SEM 31
sequence. A simple neural network with 2 layers would be sufficient to build
the model.

2. Human Face Detection

The face detection took a major leap with deep learning techniques. We
can build models with high accuracy in detecting the bounding boxes of
the human face. This project will get you started with object detection and
you will learn how to detect any object in an image.

3. Dog’s Breed Identification

How often do you get stuck thinking about the name of a dog’s breed?
There are many dog breeds and most of them are similar to each other. We
can use the dog breeds dataset and build a model that will classify different
dog breeds from an image. This project will be useful for a lot of people.

43|D L - U N I T - I I

BTECH_CSE-SEM 31

You might also like