You are on page 1of 78

Geethanjali College of Engineering and Technology

(Autonomous)
Cheeryal (V), Keesara (M),Medchal District – 501 301 (T.S)

(18CS4108) –SIMULATION AND MODELLING


COURSE FILE

DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
(2021-2022)

Faculty In charge HOD-CSE

Mr. M Santhosh kumar Dr. A. Sree Lakshmi

CONTENTS
S.No Topic Page. No.

1 Cover Page 4

2 Syllabus copy 5

3 Vision of the Department 6


4 Mission of the Department 6

5 PEO’s,PO’s and PSO’s 6

6 Course objectives and outcomes 8

7 Course mapping with Pos 9

8 Brief notes on the importance of the course and how it fits into the curriculum 10

9 Prerequisites if any 12

10 Instructional Learning Outcomes 12

11 Class Time Table 14

12 Individual Time Table 15

13 Lecture schedule with methodology being used/adopted 15

14 Detailed notes 18

15 Additional topics 68

16 University Question papers of previous years 73

17 Question Bank 76

18 Assignment Questions 78

19 Unit wise Quiz Questions and long answer questions 79

20 Tutorial problems 86

21 Known gaps ,if any and inclusion of the same in lecture schedule 87

22 Discussion topics , if any 87

23 References, Journals, websites and E-links if any 88

24 Quality Measurement Sheets 88

A Course End Survey 88

B Teaching Evaluation 88

25 Student List 89

26  Group-Wise students list for discussion topic 90


Geethanjali College of Engineering and Technology
(Autonomous)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Name of the Subject: – SIMULATION AND MODELLING
SUBJECT CODE: 18CS4108 Programme: UG

Branch: COMPUTER SCIENCE & ENGINEERING Version No : 01

Year: IV Updated on : 22-08-21

Semester: I No.of pages : 90

Classification status (Unrestricted / Restricted )

Distribution List :

Prepared by : Updated by :
1)Name : M SANTHOSH KUMAR 1) Name :
2) Sign : 2) Sign :
3) Design : Assistant Professor 3) Design :
4) Date : 19-08-21 4) Date :
Verified by : *For Q.C only

1) Name : 1) Name :

2) Sign : 2) Sign :

3) Design : 3) Design :

4) Date : 4) Date :

Approved by : (HOD ) 1) Name : Dr. A. Sree Lakshmi

2) Sign :

3) Date :

Course coordinator Program Coordinator HOD


2. SYLLABUS
18CS4108– SIMULATION AND MODELLING
UNIT I
Introduction: Concepts of Simulation, Advantages and disadvantages of simulation, Areas of
application, Recent applications of simulation, Discrete and Continuous Systems, System
Modeling, Types of Models, Steps in simulation study.
UNIT II
Random Number Generation: Properties, Generation of Pseudo-Random Numbers, Techniques
of generating random numbers, tests for random numbers.
Random-Variate Generation: Inverse-Transform Technique, Acceptance-Rejection Technique,
Special Properties.
UNIT III: SIMULATION OF CONTINUOUS AND DISCRETE SYSTEMS
Simulation of Continuous Systems: A chemical reactor. Numerical integration vs. continuous
system simulation. Selection of an integration formula, Runge-Kutta integration formulas.
Simulation of a servo system, Simulation of a water reservoir system, Analog vs. digital
simulation.
Discrete System Simulation: Fixed time-step vs. event-to-event model, On simulating
randomness, generation of non-uniformly distributed random numbers, Monte-Carlo computation
vs. stochastic simulation.
UNIT IV: SYSTEM SIMULATION
Simulation of Queuing Systems: Rudiments of queuing theory, Simulation of a singleserver
queue, Simulation of a two-server queue, Simulation of more general queues.
Simulation of a Pert Network: Network model of a project, Analysis of activity network,
Critical path computation, Uncertainties in activity durations, Simulation of activity network,
Computer program for simulation, Resource allocation and cost considerations.
UNIT V: SIMULATION EXPERIMENTATION
Design and Evaluation of Simulation Experiments: Length of simulation runs, Variance
reduction techniques, Experimental layout, Validation. Simulation Languages: Continuous and
discrete simulation languages, Continuous simulation languages, Block-structured continuous
simulation languages, Expression-based languages, Discrete-system simulation languages, GPSS.
TEXT BOOK(S)
1. Discrete-Event System Simulation, Jerry Banks, John S. Carson II, Barry L. Nelson,
David M.Nicol, Pearson, Fifth Edition.(Unit I & II)
2. System Simulation with Digital Computer, Narsingh Deo, Prentice-Hall of India Private
Limited.(Unit III IV & V)

REFERENCE BOOK(S)
1. System Modeling and Simulation: An Introduction, Frank L. Severance, Wiley Publisher.
2. System Simulation, Geoffrey Gordon, Prentice-Hall of India Private Limited, Second Edition.

3. Vision of the Department


To produce globally competent and socially responsible computer science engineers contributing
to the advancement of engineering and technology which involves creativity and innovation by
providing excellent learning environment with world class facilities.

4. Mission of the Department


1. To be a center of excellence in instruction, innovation in research and scholarship, and
service to the stake holders, the profession, and the public.

2. To prepare graduates to enter a rapidly changing field as a competent computer science


engineer.

3. To prepare graduate capable in all phases of software development, possess a firm


understanding of hardware technologies, have the strong mathematical background necessary
for scientific computing, and be sufficiently well versed in general theory to allow growth
within the discipline as it advances.

4. To prepare graduates to assume leadership roles by possessing good communication skills,


the ability to work effectively as team members, and an appreciation for their social and
ethical responsibility in a global setting.

5. PEOs and POs


PROGRAM EDUCATIONAL OBJECTIVES
PEO1: To provide graduates with a good foundation in mathematics, sciences and engineering
fundamentals required to solve engineering problems that will facilitate them to find employment
in industry and / or to pursue postgraduate studies with an appreciation for lifelong learning.
PEO2:To provide graduates with analytical and problem solving skills to design algorithms, other
hardware / software systems, and inculcate professional ethics, inter-personal skills to work in a
multi-cultural team.
PEO3:To facilitate graduates to get familiarized with the art software / hardware tools, imbibing
creativity and innovation that would enable them to develop cutting-edge technologies of multi-
disciplinary nature for societal development.

PROGRAM OUTCOMES
PO1.Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering problems.
PO2.Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3.Design/development of solutions : Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
PO4.Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
PO5.Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6.The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
PO7.Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development.
PO8.Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9.Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10.Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
PO11.Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
PO12.Life-long learning : Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

PSO (Program Specific Outcome):


PSO1: To identify and define the computing requirements for its solution under given
constraints.

PSO 2: To follow the best practices namely SEI-CMM levels and six sigma which varies
from time to time for software development project using open ended programming environment
to produce software deliverables as per customer needs.

6.Course Objectives and Outcomes


Course Objectives
Develop ability to
1. Understand simulation and system studies.
2. Explain techniques of random number generation and random variate generation.
3. Distinguish simulation of continuous and discrete Systems.
4. Describe simulation of queuing systems and Pert-network.
5. Design and evaluation of simulation experiments and simulation languages
Course Outcomes
At the end of the course, the student would be able to:
18CS4108.CO1: Explain the need of simulation and steps in simulation. 18CS4108.CO2:
Generate random numbers and random variants employing various
techniques.
18CS4108. CO3: Compare simulation of continuous and discrete systems.
18CS4108.CO4: Analyze the simulation of queuing systems and apply Pert Network
models.
18CS4108.CO5: Design and evaluate simulation experiments and acquire knowledge on
simulation languages

7. Course mapping with Pos


Mapping of Course with Programme Educational Objectives

S.No Course Code Course Semester PEO 1 PEO 2 PEO 3


component

simulation
Professional 18CS4 1
1 and 3 2 1
Elective 108
modelling

Mapping of Course outcomes to Program Outcomes


Course Outcomes - Program Outcomes and Program Specific Outcomes
Cloud Computing 1 2 3 4 5 6 7 8 9 10 11 12 PSO1 PSO2
(18CS3209)

CO1. The need of


simulation and steps in 2 1 - - - - - - - - - - 1 -
simulation.
CO2. : Generate random
numbers and random
variants employing 3 1 - - - - - - - - - - - -
various
techniques.
CO3.Compare
simulation of continuous 1 - - - - - - - - - - - - -
and discrete systems.

CO4. Analyze the


simulation of queuing
systems and apply Pert 1 2 2 1 - - - - - - - - 1 -
Network Models
CO5.Design and
evaluate simulation
experiments and acquire 2 2 3 1 - - - - - - - - 1 -
knowledge on
Simulation languages

8. Brief Importance of the Course and how it fits into the curriculum
a. What role does this course play within the Program?
A simulation is the imitation of the operation of a real-world process or system over time.
 Simulations require the use of models; the model represents the key characteristics or behaviors of the
[1]

selected system or process, whereas the simulation represents the evolution of the model over time.
Often, computers are used to execute the simulation
b. How is the course unique or different from other courses of the Program?
Simulation is a technique for practice and learning that can be applied to many different disciplines
and trainees. It is a technique (not a technology) to replace and amplify real experiences with guided
ones, often “immersive” in nature, that evoke or replicate substantial aspects of the real world in a fully
interactive fashion.
c. What essential knowledge or skills should they gain from this experience?
 Practicing in a safe environment
 Understanding human behavior
 Improving teamwork
 Providing confidence
 Giving insight into trainees’ own behavior
d. What knowledge or skills from this course will students need to have mastered to
perform well in future classes or later (Higher Education / Jobs)?
As robots, automation and artificial intelligence perform more tasks and there is massive
disruption of jobs, experts say a wider array of education and skills-building programs will be created
to meet new demands. There are two uncertainties: Will well-prepared workers be able to keep up in
the race with AI tools? And will market capitalism survive?
e. Why is this course important for students to take?
Well-designed simulations and games have been shown to improve decision-making and critical
thinking skills as well as teaching discipline-specific concepts. Active learning also helps students
develop interpersonal and communications skills.
f. What is/are the prerequisite(s) for this course?

o Programming for Problem Solving


o Object Oriented Programming using Java
o Probability and Statistics
g. When students complete this course, what do they need know or be able to do?
Simulation allows students to change parameter values and see what happens. Students
develop a feel for what variables are important and the significance of magnitude changes in
parameters. data issues, probability and sampling theory. Simulations help students understand
probability and sampling theory.
h. Is there specific knowledge that the students will need to know in the future?
Simulation allows you to explore 'what if' questions and scenarios without having to experiment on
the system itself. It helps you to identify bottlenecks in material, information and product flows. It
helps you to gain insight into which variables are most important to system performance. .
i. Are there certain practical or professional skills that students will need to apply in the
future?
 YES. Most of the mini and major projects are generally based on Simulation
j. Five years from now, what do you hope students will remember from this course?
 As the internet of things grows, so will Simulation and as IOT and Simulation are
emerging technologies in today’s world, clearly these are the most important course for
future.
k. What is it about this course that makes it unique or special?
 It is the only fundamental course that facilitates students in the attainment of all levels of
Bloom's taxonomy.
l. Why does the program offer this course?
 This is the basic course in Simulation field. Without this course, students cannot get the
basic idea about Simulation.
m. Why can’t this course be “covered” as a sub-section of another course?
 It is not possible as it covers many topics such as different applications of Simulation, their
architectures, functioning, techniques if one tries to cover these as part of another course,
it would be too heavy to be taught in one semester.

n. What unique contributions to students’ learning experience does this course make?
 It helps in executing mini and major projects that involve Simulation during the later years
of the program.
o. What is the value of taking this course? How exactly does it enrich the program?
 Teaching employees new skills, techniques or processes can be challenging for many companies.
Often training depends on when another employee’s schedule is free or taking that person away
from production or billable work. If machines or equipment are needed for training, then those
machines are not in production during the training time, as well as the person operating the
equipment.

p. What are the major career options that require this course
Specific occupations that employ Simulation & Modeling include:

 Simulation developer
 A background in computer game simulation engineer
 Simulation Technical support

9. Prerequisites
 Programming for Problem Solving
 Object Oriented Programming using Java
 Probability and Statistics
10. Instructional Learning Outcomes

Upon completing this course, it is expected that a student will be able to do the following:

UNIT-I
1. Understand different concepts of simulation
2. Understand the Advantages and disadvantages of simulation
3. Understand the Recent applications of simulation
4. Understand Discrete and Continuous Systems
6. Understand the System Modeling, Types of Models
UNIT-II
1. Understand the random number generation
2. Understand the Generation of Pseudo-Random Numbers
3. Understand the Techniques of generating random numbers
4. Understand the tests for random numbers
5. Understand the Inverse-Transform Technique
6. Understand the Acceptance-Rejection Technique

UNIT-III
1. Understand the Numerical integration vs. continuous system simulation
2. Understand the Selection of an integration formula
3. Understand the Runge-Kutta integration formulas
4. Understand the Simulation of a water reservoir system
5. Understand the Fixed time-step vs. event-to-event model
6. Understand the generation of non-uniformly distributed random numbers

UNIT-IV
1. Understand Rudiments of queuing theory
2. Understand Simulation of a singleserver queue
3. Understand Simulation of a two-server queue
4. Understand Simulation of more general queues
5. Understand Analysis of activity network

UNIT-V
1. Understand Length of simulation runs
2. Understand Variance reduction techniques
3. Understand Experimental layout, Validation
4. Understand Continuous and discrete simulation languages
5. Understand Continuous simulation languages
6. Understand Block-structured continuous simulation languages
11. Class Time Table

12.Individual Time Table

13. Lecture Schedule with methodology being used/adopted

Unit Class Topic Teaching Aids


No BB/OHP/LCD
S.No
1. I Day1 Introduction to Deep Learning BB/LCD

2. Day2 Linear Algebra introduction BB/LCD

3. Day3 Linear dependency, Euclidean distance BB/LCD

4. Day4 Eigen Decomposition BB/LCD

5. Day5 Singular Value Decomposition BB/LCD

6. Day6 Moore Penrose Psedo Inverse BB/LCD

7. Day7 Principal Component Analysis BB/LCD

8. Day8 Random Variable, Probability Distributions BB/LCD

9. Day9 Marginal Probability, Conditional Probability BB/LCD

10. Day10 Baye’s Rule, Structural Probabilistic Models BB/LCD

11. Day11 Generation of Pseudo-Random Numbers BB/LCD

12. Day12 Variance and Covariance BB/LCD

13. II Day13 Numerical Computation: Overflow and Underflow, BB/LCD


Poor Conditioning

14. Day14 Gradient-Based Optimization BB/LCD

15. Day15 Constraint Optimization BB/LCD

16. Day16 Linear Least Squares BB/LCD

17. Day17 ML Basics, Learning Algorithms BB/LCD

18. Day18 Linear Regression, Capacity,Regularization BB/LCD

19. Day19 Overfitting, Underfitting, Hyper Parameters, LCD


Validation sets, Bias

20. Day20 Trading off Bias and Variance to Minimum Mean LCD
Square Error

21. Day21 Maximum Likelihood Estimation, Bayesian Statistics BB/LCD

22. Day22 Supervised and Unsupervised Learning Algorithms LCD

23. Day23 Stochastic Gradient Descent LCD

24. Day24 Building a Machine Learning Application, LCD


Challenges, motivating DL

25. III Day25 Deep Forward Networks: Learning XOR BB/LCD

26. Day26 Gradient-Based Learning BB/LCD


27. Day27 Conditional Distributions with Maximum Likelihood, BB/LCD
Conditional Statistics, Gaussian Output Distributions

28. Day28 Linear Units for Gaussian Output Distributions, BB/LCD


Sigmoid Units for Bernoulli Output Distribution,
Softmax Units for Multinouli Output Distribution

29. Day29 Hidden Units, Rectified Linear Units and their LCD
Generalizations, Logistic Sigmoid and Hyperbolic
Tangent

30. Day30 Architecture Design, Back Propagation and other LCD


Differentiation Algorithms

31. Day31 Regularization for Deep Learning: Parameter Norm BB/LCD


Penalities

32. Day32 Norm Penalities as Constrained optimization, BB/LCD


Regularization and Under-Constrained Problems

33. Day33 Dataset Augmentation, Noise Robustness BB/LCD

34. Day34 Challenges in Neural Network Optimization BB/LCD

35. Day35 Basic Algorithms, Parameter Initialization Strategies LCD

36. Day36 Algorithms with Adaptive Learning rates, LCD


Approximate Second-Order Methods,
Optimization Strategies

37. IV Day37 Convolutional Networks: The Convolution LCD/ BB


Operation, Motivation

38. Day38 Pooling, Variants of basic Convolution function LCD/ BB

39. Day39 Structured Outputs, Data Types, Efficient Convolution LCD/ BB


Algorithms, Random or Unsupervised Features

40. Day40 The Neuro Scientific basics for CNs LCD/ BB

41. Day41 Sequence Modelling: Recurrent Neural Networks, LCD/ BB


Unfolding Computational Graphs,

42. Day42 Recurrent Neural Networks, Modeling Sequences BB/LCD


Conditioned on Context with RNNs, Bidirectional
RNNs

43. Day43 Encoder-Decoder Sequence-to-Sequence Architecture, BB/LCD

44. Day44 Deep Recurrent Networks , Recursive Neural LCD


Networks,

45. Day45 Performance Metrics, Default Baseline models, LCD


Determining whether to gather more data
46. Day46 Selecting Hyper parameters LCD

47. Day47 Debugging strategies LCD

48. V Day48 Applications: Large Scale Deep Learning LCD/ BB

49. Day49 Computer Vision LCD/ BB

50. Day50 Speech Recognition LCD/ BB

51. Day51 Natural Language Processing LCD/ BB

52. Day52 Deep Learning Research LCD/ BB

53. Day53 Linear Factor Models BB

54. Day54 Auto Encoders LCD/ BB

55. Day55 Representation Learning LCD/ BB

56. Day56 Structured Probabilistic Models for Deep Learning LCD/ BB

57. Day57 Deep Generative Models: Boltzmann Machines, Deep LCD/ BB


Belief Networks

58. Day58 Deep Boltzmann Machines, Boltzmann Machines for LCD/ BB


Real Valued Data

59. Day59 Convolutional Boltzmann Machines LCD/ BB

60. Day60 Back Propagation through Random Operations LCD/ BB

61. Day61 Drawing Samples from Autoencoders LCD/ BB

62. Day62 Generative Stochastic Networks LCD/ BB

Total No of classes:62

14.Detailed Notes
UNIT-I
Linear Algebra:
 Scalar:
 Scalar is a single number.
 Lower case variable is used to represent a scalar.
 Ex: Let s∈R be the slope of the line
Let n∈N be the number of units.
 Vector:
 Vector is an array of numbers.
 The numbers are arranged in an order.
 Each individual number is identified using an index.
 Lower case and bold variable is used to represent a vector.
 If each element is in R, and the vector has n elements, then the
vector lies in the set formed by taking the Cartesian product of R n
times, denoted as Rn.
Ex: x= x1
x2
xn
 Matrices:
 A matrix is a 2-D array of numbers, so each element is identified by indices
instead of one.
 A ∈ Rm×n A is a real-valued matrix of height m and width of n.
 A1,1 is the upper left entry of A
 Am,n is the bottom right entry of A
 Ai,: i-th row of A
 A:,I i-th column of A
 f(A)i,j gives element(i,j) of the matrix computed by applying the function f
to A.
 Tensors:
 An array of numbers arranged on a regular grid with a
variable number of axes is known as a tensor.
 Ai,j,k

Transpose :

(AT)i,j = Aj,I

Vectors are matrices that contain only one column.

Transpose of a vector is a matrix with only one row.

Scalar can be thought of as a matrix with single element.

Transpose of a scalar is itself. a = aT.

We can add two matrices

C=A+B

We can add or multiply a scalar to a matrix

D=a.B+c Di,j = a . Bi,j + c.

Multiplying matrices and vectors:

 C = AB Ci,j = ∑ Ai,k Bk,j


 k
 A(B+C) = AB + AC
 A(BC) = (AB)C
 AB BA
 Dot product between two vectors is commutative.
 xT y = yTx
 (AB)T = BTAT
 Ax = B this equation can be written as
 A1,:x = b1 … A2,:x = b2… Am,:x = bm or
A1,1x1 + A1,2x2 + A1,3x3 +…… + A1,nxn = b1

…………………………………

Am,1x1 + Am,2x2 + Am,3x3 +…… + Am,nxn = bm

Probability and Information Theory

 The domain of P must be the set of all possible states of x.


 Ɐx ϵ X, 0 ≤ P(x)≤ 1. An impossible event has probability 0 and no state can be less
probable than that. Likewise, an event that is guaranteed to happen has probability 1, and
no state can have a greater chance of occurring.
 ∑xϵX P(x)=1. We refer to this property as being normalized. Without this property, we
could obtain probabilities greater than one by computing the probability of one of many
events occurring

Example: uniform distribution: P(x = xi) = 1/k

Probability Density Function

 The domain of p must be the set of all possible states of x.


 Ɐx ϵ X, p(x) ≥0. Note that we do not require p(x) ≤ 1.
 ∫ p(x)dx = 1.
 Example: uniform distribution : u(x;a,b) = 1 / b-a

Computing Marginal Probability with the Sum Rule

Ɐx ϵ X, P(x = x) = ∑y P(x = x, y = y)

p(x) = ∫ p(x, y)dy

Conditional Probability

P(y = y | x = x) = P(y = y, x = x) / P(x = x)

Chain Rule of Probability

P(x(1) ,..., x(n) ) = P(x(1)) ∏ n i=2 P(x(i) | x(1) ,..., x(i1))

Independence

Ɐx ϵ X, yϵ y, p(x = x, y = y) = p(x = x)p(y = y)


Conditional Independence

Ɐx ϵ X, yϵ y, z ϵ Z p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z)

Expectation

Ex~P [f(x)] = ∑x P(x)f(x),

Ex~P [f(x)] = ∫ p(x)f(x),

Linearity of Expectations:

Ex [αf(x) + βg(x)] = α Ex [f(x)] + β Ex[g(x)]

Variance and Covariance

Covariance matrix:

Bernoulli Distribution

Gaussian Distribution

Gaussian Distribution
Multivariate Gaussian

Empirical Distribution

Bayes’ Rule

Change of Variables
UNIT-2

Gradient Descent and the Structure of Neural Network Cost Functions


Given neural network parameters θ, find the value of θ that minimizes cost function J(θ)
Derivatives and Second Derivatives

Directional Curvature
Taylor series approximation

How much does a gradient step reduce the cost?

Critical points:
All positive eigenvalues All negative eigenvalues Some positive and some negative

Newton’s method

Newton’s method’s failure mode

The old view of SGD as difficult


- SGD usually moves downhill
- SGD eventually encounters a critical point
- Usually this is a minimum
- However, it is a local minimum
- J has a high value at this critical point
- Some global minimum is the real target, and has a
much lower value of J

The new view: does SGD get stuck on saddle points?


- SGD usually moves downhill
- SGD eventually encounters a critical point
- Usually this is a saddle point
- SGD is stuck, and the main reason it is stuck is that it fails to exploit negative curvature
(as we will see, this happens to Newton’s method, but not very much to SGD)

Gradient descent flees saddle points

Poor conditioning

Poor Conditioning
Why convergence may not happen?

Never stop if function doesn’t have a local minimum


- Get “stuck,” possibly still moving but not improving
- Too bad of conditioning
- Too much gradient noise
- Overfitting
- Usually we get “stuck” before finding a critical point
- Only Newton’s method and related techniques are attracted to saddle points
Are saddle points or local minima
more common?
- Imagine for each eigenvalue, you flip a coin
- If heads, the eigenvalue is positive, if tails, negative
- Need to get all heads to have a minimum
- Higher dimensions -> exponentially less likely to get all heads
- Random matrix theory:
- The coin is weighted; the lower J is, the more likely to be heads
- So most local minima have low J!
- Most critical points with high J are saddle points!
The state of modern optimization
- We can optimize most classifiers, autoencoders, or
recurrent nets if they are based on linear layers
- Especially true of LSTM, ReLU, maxout
Why is optimization so slow?
We can fail to compute good local updates (get “stuck”). Or local information can disagree with
global information, even when there are not any non-global minima, even when there are not any
minima of any kind.

•For most problems, there exists a linear subspace of monotonically decreasing values
• For some problems, there are obstacles between this subspace the SGD path
• Factored linear models capture many qualitative aspects of deep network training

UNIT-3

Neural Networks:

● Artificial neural network (ANN) is a machine learning approach that models human brain
and consists of a number of artificial neurons.

● Neuron in ANNs tend to have fewer connections than biological neurons.

● Each neuron in ANN receives a number of inputs.

● An activation function is applied to these inputs which results in activation level of neuron
(output value of the neuron).

● Knowledge about the learning task is given in the form of examples called training
examples.

● An Artificial Neural Network is specified by:

● neuron model: the information processing unit of the NN,

● an architecture: a set of neurons and links connecting neurons. Each link has a
weight,

● a learning algorithm: used for training the NN by modifying the weights in order
to model a particular learning task correctly on the training examples.

● The aim is to obtain a NN that is trained and generalizes well.

● It should behaves correctly on new instances of the learning task.

Neuron

● The neuron is the basic information processing unit of a NN. It consists of:

1 A set of links, describing the neuron inputs, with weights W1, W2, …, Wm

1 An adder function (linear combiner) for computing the weighted sum of the inputs:

(real numbers)

3 Activation function for limiting the amplitude of the neuron output. Here ‘b’
denotes bias.

The Neuron Diagram


Bias of a Neuron

● The bias b has the effect of applying a transformation to the weighted sum u
v=u+b
● The bias is an external parameter of the neuron. It can be modeled by adding an extra
input.
● v is called induced field of the neuron

Neuron Models
The choice of activation function determines the neuron model.
Examples:
step function:

ramp function:

sigmoid function with z,x,y parameters :

Gaussian function:
Step Function Ramp Function Sigmoid Function

The Gaussian function is the probability function of the normal distribution. Sometimes also
called the frequency curve.

Network Architectures:
● Three different classes of network architectures
− single-layer feed-forward
− multi-layer feed-forward
− recurrent
− The architecture of a neural network is linked with the learning algorithm used to
train
Single Layer Feed-forward:

Input Layer Output Layer


Of Source of
Nodes Neurons

Perceptron: Neuron Model


(Special form of single layer feed forward)
− The perceptron was first proposed by Rosenblatt (1958) is a simple neuron that is used to
b (bias)
classify its input into one of two categories.

x w
− A perceptron uses a step function that returns +1 if weighted sum of its input  0 and -1
1
otherwise
1 v y
x2 w2 (v)
wn
xn
Perceptron for Classification
● The perceptron is used for binary classification.
● First train a perceptron for a classification task.
− Find suitable weights in such a way that the training examples are correctly
classified.
− Geometrically try to find a hyper-plane that separates the examples of the two
classes.
● The perceptron can only model linearly separable classes.
● When the two classes are not linearly separable, it may be desirable to obtain a linear
separator that minimizes the mean squared error.
● Given training examples of classes C1, C2 train the perceptron in such a way that :
− If the output of the perceptron is +1 then the input is assigned to class C1
− If the output is -1 then the input is assigned to C2
Boolean function OR – Linearly separable

Learning Process for Perceptron


● Initially assign random weights to inputs between -0.5 and +0.5
● Training data is presented to perceptron and its output is observed.
● If output is incorrect, the weights are adjusted accordingly using following formula.
wi  wi + (a* xi *e), where ‘e’ is error produced
and ‘a’ (-1  a  1) is learning rate
− ‘a’ is defined as 0 if output is correct, it is +ve, if output is too low and –ve, if
output is too high.
− Once the modification to weights has taken place, the next piece of training data is
used in the same way.
− Once all the training data have been applied, the process starts again until all the
weights are correct and all errors are zero.
Each iteration of this process is known as an epoch
Perceptron: Limitations
● The perceptron can only model linearly separable functions,
− those functions which can be drawn in 2-dim graph and single straight line
separates values in two part.
● Boolean functions given below are linearly separable:
− AND
− OR
− COMPLEMENT
● It cannot model XOR function as it is non linearly separable.
− When the two classes are not linearly separable, it may be desirable to obtain a
linear separator that minimizes the mean squared error.
XOR – Non linearly separable function
● A typical example of non-linearly separable function is the XOR that computes the logical
exclusive or..
● This function takes two input arguments with values in {0,1} and returns one output in
{0,1},
● Here 0 and 1 are encoding of the truth values false and true,
● The output is true if and only if the two inputs have different truth values.
● XOR is non linearly separable function which can not be modeled by perceptron.
● For such functions we have to use multi layer feed-forward network.

These two classes (true and false) cannot be separated using a line. Hence XOR is non linearly
separable.
Multi layer feed-forward NN (FFNN)
● FFNN is a more general network architecture, where there are hidden layers between input
and output layers.
● Hidden nodes do not directly receive inputs nor send outputs to the external environment.
● FFNNs overcome the limitation of single-layer NN.
● They can handle non-linearly separable learning tasks.

Input Layer Output Layer

Hidden Layer
FFNN for XOR
● The ANN for XOR has two hidden nodes that realizes this non-linear separation and uses
the sign (step) activation function.
● Arrows from input nodes to two hidden nodes indicate the directions of the weight vectors
(1,-1) and (-1,1).
● The output node is used to combine the outputs of the two hidden nodes.

Since we are representing two states by 0 (false) and 1 (true), we will map negative outputs (–1, –
0.5) of hidden and output layers to 0 and positive output (0.5) to 1.
FFNN NEURON MODEL
● The classical learning algorithm of FFNN is based on the gradient descent method.
● For this reason the activation function used in FFNN are continuous functions of the
weights, differentiable everywhere.
● The activation function for node i may be defined as a simple form of the sigmoid
function in the following manner:
where A > 0, Vi =  Wij * Yj , such that Wij is a weight of the link from node i to node j
and Yj is the output of node j.
Training Algorithm: Backpropagation
● The Backpropagation algorithm learns in the same way as single perceptron.
● It searches for weight values that minimize the total error of the network over the set of
training examples (training set).
● Backpropagation consists of the repeated application of the following two passes:
− Forward pass: In this step, the network is activated on one example and the error of
(each neuron of) the output layer is computed.
− Backward pass: in this step the network error is used for updating the weights. The
error is propagated backwards from the output layer through the network layer by
layer. This is done by recursively computing the local gradient of each neuron.
Backpropagation
● Back-propagation training algorithm

● Consider a network of three layers.


● Let us use i to represent nodes in input layer, j to represent nodes in hidden layer and k
represent nodes in output layer.
● wij refers to weight of connection between a node in input layer and node in hidden layer.
● The following equation is used to derive the output value Yj of node j

where, Xj =  xi . wij - j , 1 i  n; n is the number of inputs to node j, and j is threshold


for node j
Total Mean Squared Error
● The error of output neuron k after the activation of the network on the n-th training
example (x(n), d(n)) is:
ek(n) = dk(n) – yk(n)
● The network error is the sum of the squared errors of the output neurons:

● The total mean squared error is the average of the network errors of the training examples.
Weight Update Rule
● The Backprop weight update rule is based on the gradient descent method:
− It takes a step in the direction yielding the maximum decrease of the network error
E.
− This direction is the opposite of the gradient of E.
● Iteration of the Backprop algorithm is usually terminated when the sum of squares of
errors of the output values for all training data in an epoch is less than some threshold such
as 0.01

Backpropagation learning algorithm (incremental-mode)


n=1;
initialize weights randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from those of the output layer:

with computed using the (generalized) Delta rule


end-for
n = n+1;
end-while;
Stopping criterions
● Total mean squared error change:
− Back-prop is considered to have converged when the absolute rate of change in the
average squared error per epoch is sufficiently small (in the range [0.1, 0.01]).
● Generalization based criterion:
− After each epoch, the NN is tested for generalization.
− If the generalization performance is adequate then stop.
− If this stopping criterion is used then the part of the training set used for testing the
network generalization will not used for updating the weights.
Recurrent Network
● FFNN is acyclic where data passes from input to the output nodes and not vice versa.
− Once the FFNN is trained, its state is fixed and does not alter as new data is
presented to it. It does not have memory.
● Recurrent network can have connections that go backward from output to input nodes
and models dynamic systems.
− In this way, a recurrent network’s internal state can be altered as sets of input data
are presented. It can be said to have memory.
− It is useful in solving problems where the solution depends not just on the current
inputs but on all previous inputs.
● Applications
− predict stock market price,
weather forecast
UNIT IV:
Convolution operation:

Convolutional layers are the major building blocks used in convolutional neural networks.

A convolution is the simple application of a filter to an input that results in an activation.


Repeated application of the same filter to an input results in a map of activations called a
feature map, indicating the locations and strength of a detected feature in an input, such as an
image.

The innovation of convolutional neural networks is the ability to automatically learn a large
number of filters in parallel specific to a training dataset under the constraints of a specific
predictive modeling problem, such as image classification. The result is highly specific
features that can be detected anywhere on input images.
convolution motivation and pooling:

1. Dimension Reduction: In deep learning when we train a model, because of


excessive data size the model can take huge amount of time for training. Now consider
the use of max pooling of size 5x5 with 1 stride. It reduces the successive region of
size 5x5 of the given image to a 1x1 region with max value of the 5x5 region. Here
pooling reduces the 25 (5x5) pixel to a single pixel (1x1) to avoid curse of
dimensionality.
2. Rotational/Position Invariance Feature Extraction : Pooling can also be used for
extracting rotational and position invariant feature. Consider the same example of
using pooling of size 5x5. Pooling extracts the max value from the given 5x5 region.
Basically extract the dominant feature value (max value) from the given region
irrespective of the position of the feature value. The max value would be from any
position inside the region. Pooling does not capture the position of the max value thus
provides rotational/positional invariant feature extraction.
3. Convolutional neural networks are typically used for image classification. However,
images are high-dimensional data - so we would prefer to reduce the dimensionality to
minimize the possibility of overfitting.
4. Pooling generally serves three aims:
(1) it generally acts as a noise suppressant
(2) makes it invariant to translation movement for image classification
(3) helps capture essential structural features of the represented images without being
bogged down by the fine details.
Variants of the Basic Convolution Function
Convolution in the context of NN means an operation that consists of many applications of
convolution in parallel.
 Kenel K with element Ki,j,k,lKi,j,k,l giving the connection strength between a unit in channel
i of output and a unit in channel j of the input, with an offset of k rows and l columns between
the output unit and the input unit.
Input: Vi,j,kVi,j,k with channel i, row j and column k
Output Z same format as V
Use 1 as first entry
 Full Convolution

0 Padding 1 stride

Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,nZi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n

 0 Padding s stride

Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,
s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]
 Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
downsampling.


Some 0 Paddings and 1 stride
Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at
each layer. We are forced to choose between shrinking the spatial extent of the network rapidly and
using small kernel. 0 padding allows us to control the kernel width and the size of the output
independently.

Special case of 0 padding:


 Valid: no 0 padding is used. Limited number of layers.
 Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near
the border influence fewer output pixels than the input pixels near the center.
 Full: Enough zeros are added for every pixels to be visited k (kenel width) times in each di-
rection, resulting width m + k - 1. Difficult to learn a single kernel that performs well at all
positions in the convolutional feature map.
 Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
 Unshared Convolution
 In some case when we do not want to use convolution but want to use locally connected
layer. We use Unshared convolution. Indices into weight W

Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]
Comparison on local connections, convolution and full connection Useful when we know that
each feature should be a function of a small part of space, but no reason to think that the same
feature should occur accross all the space. eg: look for mouth only in the bottom half of the
image.

It can be also useful to make versions of convolution or local connected layers in which the
connectivity is further restricted, eg: constrain each output channel i to be a function of only a
subset 0ofPaddings
Some the inputand
channel.
1 stride

Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at
using small kernel. 0 padding allows us to control the kernel width and the size of the output
independently.

Tiled Convolution:

Learn a set of kernels that we rotate through as we move through space. Immediately
neighboring locations will have different filters, but the memory requirement for storing the
parameters will increase by a factor of the size of this set of kernels. Comparison on locally
connected layers, tiled convolution and standard convolution:

Local connected layers and tiled convolutional layer with max pooling: the detector units of
these layers are driven by different filters. If the filters learn to detect different tranformed
version of the same underlying features, then the max-pooled units become invariant to the
learned transformation.
Structured outputs:

Even if we understand the Convolution Neural Network theoretically, quite of us still get
confused about its input and output shapes while fitting the data to the network. This guide
will help you understand the Input and Output shapes for the Convolution Neural Network.

ConvNet Input Shape

Input Shape

You always have to give a 4D array as input to the CNN. So input data has a shape
of (batch_size, height, width, depth), where the first dimension represents the batch size of
the image and the other three dimensions represent dimensions of the image which are height,
width, and depth. For some of you who are wondering what is the depth of the image, it’s
nothing but the number of color channels. For example, an RGB image would have a depth
of 3, and the greyscale image would have a depth of 1.
Output Shape

The output of the CNN is also a 4D array. Where batch size would be the same as input
batch size but the other 3 dimensions of the image might change depending upon the values
of filter, kernel size, and padding we use.

Data types:
different types of convolutional neural networks
The advancements in Computer Vision with Deep Learning has been constructed and
perfected with time, primarily over one particular algorithm — a Convolutional Neural
Network.

Neuroscientific basis for convolutional neural network:

while convolutional neural networks (CNNs) have dominated the field of object recognition,
it can easily be deceived by creating a small perturbation, also known as adversarial attacks.
This can lead to the failure of the computer vision models and make it susceptible to
cyberattacks. CNN’s vulnerability to image perturbations has become a pressing concern for
the machine learning community while researchers and scientists are working towards
building computer vision models that generalise images like humans.

To address this vulnerability, researchers from MIT, Harvard University and MIT-IBM
Watson AI Lab have proposed VOneNets — a new class of hybrid CNN vision models — in
a recent paper. According to the researchers, this novel architecture leverages “biologically-
constrained neural networks along with deep learning techniques” to create more model
robustness against white-box adversarial attacks.

Unfolding Computational Graphs

A computational graph is a way to formalize the structure of a set of computations, such as


those involved in mapping inputs and parameters to outputs and loss. Please refer to Sec
6.5.1. for a general introduction. In this section we explain the idea of a recursive or recurrent
computation into a computational unfolding graph that has a repetitive structure, typically
corresponding to a chain of events. Unfolding this graph results in the sharing of parameters
across a deep network structure.

Recurrent neural networks can be built in many different ways. Much as almost any
function can be considered a feedforward neural network, essentially any function
involving recurrence can be considered a recurrent neural network.
Many recurrent neural networks use Eq 10.5 . or a similar equation to define the values of
their hidden units. To indicate that the state is the hidden units of the network, we now
rewrite Eq 10.4. using the variable to represent the state:

typical RNNs will add extra architectural features such as output layers that read
information out of the state to make predictions.

When the recurrent network is trained to perform a task that requires predicting the future
from the past, the network typically learns to use h ( t ) as a kind of lossy summary of the
task- relevant aspects of the past sequence of inputs up to t . This summary is in general
necessarily lossy, since it maps an arbitrary length sequence ( x ( t) ,x (t−1) ,x (t−2) ,...,x
(2) ,x (1) ) to a fixed length vector h ( t ) . Depending on the training criterion, this summary
might selectively keep some aspects of the past sequence with more precision than other
aspects. For example, if the RNN is used in statistical language modeling, typically to predict
the next word given previous words, it may not be necessary to store all of the information in
the input sequence up to time t , but rather only enough information to predict the rest of the
sentence. The most demanding situation is when we ask h ( t ) to be rich enough to allow one
to approximately recover the input sequence, as in autoencoder frameworks.

Recurrent Neural Network (RNN):

A recurrent neural network (RNN) is a type of artificial neural network which uses sequential
data or time series data. These deep learning algorithms are commonly used for ordinal or
temporal problems, such as language translation, natural language processing (nlp), speech
recognition, and image captioning; they are incorporated into popular applications such as
Siri, voice search, and Google Translate. Like feedforward and convolutional neural networks
(CNNs), recurrent neural networks utilize training data to learn. They are distinguished by
their “memory” as they take information from prior inputs to influence the current input and
output. While traditional deep neural networks assume that inputs and outputs are
independent of each other, the output of recurrent neural networks depend on the prior
elements within the sequence. While future events would also be helpful in determining the
output of a given sequence, unidirectional recurrent neural networks cannot account for these
events in their predictions.
Recurrent Neural Network vs. Feedforward Neural Network
Let’s take an idiom, such as “feeling under the weather”, which is commonly used when
someone is ill, to aid us in the explanation of RNNs. In order for the idiom to make sense, it
needs to be expressed in that specific order. As a result, recurrent networks need to account
for the position of each word in the idiom and they use that information to predict the next
word in the sequence.

Comparison of Recurrent Neural Networks (on the left) and Feedforward Neural Networks
(on the right)
Looking at the visual below, the “rolled” visual of the RNN represents the whole neural
network, or rather the entire predicted phrase, like “feeling under the weather.” The
“unrolled” visual represents the individual layers, or time steps, of the neural network. Each
layer maps to a single word in that phrase, such as “weather”. Prior inputs, such as “feeling”
and “under”, would be represented as a hidden state in the third timestep to predict the output
in the sequence, “the”.

Another distinguishing characteristic of recurrent networks is that they share parameters


across each layer of the network. While feedforward networks have different weights across
each node, recurrent neural networks share the same weight parameter within each layer of
the network. That said, these weights are still adjusted in the through the processes of
backpropagation and gradient descent to facilitate reinforcement learning.

Recurrent neural networks leverage backpropagation through time (BPTT) algorithm to


determine the gradients, which is slightly different from traditional backpropagation as it is
specific to sequence data. The principles of BPTT are the same as traditional
backpropagation, where the model trains itself by calculating errors from its output layer to
its input layer. These calculations allow us to adjust and fit the parameters of the model
appropriately. BPTT differs from the traditional approach in that BPTT sums errors at each
time step whereas feedforward networks do not need to sum errors as they do not share
parameters across each layer.

Through this process, RNNs tend to run into two problems, known as exploding gradients
and vanishing gradients. These issues are defined by the size of the gradient, which is the
slope of the loss function along the error curve. When the gradient is too small, it continues to
become smaller, updating the weight parameters until they become insignificant—i.e. 0.
When that occurs, the algorithm is no longer learning. Exploding gradients occur when the
gradient is too large, creating an unstable model. In this case, the model weights will grow
too large, and they will eventually be represented as NaN. One solution to these issues is to
reduce the number of hidden layers within the neural network, eliminating some of the
complexity in the RNN model.

bi-directional recurrent neural network:

Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs
together. The input sequence is fed in normal time order for one network, and in reverse time
order for another. The outputs of the two networks are usually concatenated at each time step,
though there are other options, e.g. summation.

This structure allows the networks to have both backward and forward information about the
sequence at every time step. The concept seems easy enough. But when it comes to actually
implementing a neural network which utilizes bidirectional structure, confusion arises…

The Confusion

The first confusion is about the way to forward the outputs of a bidirectional RNN to a
dense neural network. For normal RNNs we could just forward the outputs at the last time
step, and the following picture I found via Google shows similar technique on a bidirectional
RNN.
A Confusing formulation

if we pick the output at the last time step, the reverse RNN will have only seen the last input
(x_3 in the picture). It’ll hardly provide any predictive power.

The second confusion is about the returned hidden states. In seq2seq models, we’ll want
hidden states from the encoder to initialize the hidden states of the decoder. Intuitively, if we
can only choose hidden states at one time step(as in PyTorch), we’d want the one at which
the RNN just consumed the last input in the sequence. But if the hidden states of time step n
(the last one) are returned, as before, we’ll have the hidden states of the reversed RNN with
only one step of inputs seen.

1. Encoder

 A stack of several recurrent units (LSTM or GRU cells for better performance)
where each accepts a single element of the input sequence, collects information
for that element and propagates it forward.

 In question-answering problem, the input sequence is a collection of all words


from the question. Each word is represented as x_i where i is the order of that
word.

 The hidden states h_i are computed using the formula:


This simple formula represents the result of an ordinary recurrent neural network. As you can
see, we just apply the appropriate weights to the previous hidden state h_(t-1) and the input
vector x_t.

Encoder Vector

 This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.

 This vector aims to encapsulate the information for all input elements in order to
help the decoder make accurate predictions.

 It acts as the initial hidden state of the decoder part of the model.

Decoder

 A stack of several recurrent units where each predicts an output y_t at a time step t.

 Each recurrent unit accepts a hidden state from the previous unit and produces
and output as well as its own hidden state.

 In the question-answering problem, the output sequence is a collection of all


words from the answer. Each word is represented as y_i where i is the order of
that word.

 Any hidden state h_i is computed using the formula:

we are just using the previous hidden state to compute the next one.
 The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with the
respective weight W(S). Softmax is used to create a probability vector which will help us
determine the final output (e.g. word in the question-answering problem).

The power of this model lies in the fact that it can map sequences of different lengths
to each other. As you can see the inputs and outputs are not correlated and their lengths
can differ. This opens a whole new range of problems which can now be solved using such
architecture.

we could stack multiple layers of RNNs on top of each other. This results in a flexible
mechanism, due to the combination of several simple layers. In particular, data might be
relevant at different levels of the stack. For instance, we might want to keep high-level
data about financial market conditions (bear or bull market) available, whereas at a lower
level we only record shorter-term temporal dynamics.

Recursive Neural Networks :

In simple words, if we say that a Recursive neural network is a family person of a deep neural
network, we can validate it. So, if the same set of weights are recursively applied on a
structured input, then the Recursive neural network will take birth. So, it will keep happening
for all the nodes, as explained above. Recursive neural networks are made of architectural
class, which is majorly operational on structured inputs. The RNN’s are particularly directed
on acyclic graphs.
It’s a deep tree structure. For conditions like there are needs to parse the complete sentence,
there recursive neural networks are used. It has a topology similar to tree-like. The RNN’s
allow the branching of connections & structures with hierarchies.
They mainly use recursive neural networks for the prediction of structured outputs. It is done
over variable-sized input structures. Also, it traverses a given structure that too in topological
order. They also do it for scalar predictions. But here point to note is that the Recursive
neural network just does not respond to structured inputs, but it also works in contexts.

Each time series is processed separately. A very interesting point to ponder is that the first
introduction of RNN happened when a need arose to learn distributed data representations of
various structural networks.

A Recursive Neural Network is a type of deep neural network. So, with this, you can expect
& get a structured prediction by applying the same number of sets of weights on structured
inputs. With this type of processing, you get a typical deep neural network known as
a recursive neural network. These networks are non-linear in nature.
The recursive networks are adaptive models that are capable of learning deep structured
erudition. Therefore, you may say that the Recursive Neural Networks are among complex
inherent chains. Let’s discuss its connection with deep learning concepts.

Performance metrics:

Performance metrics can vary considerably when viewed through different


industries. Performance metrics are integral to an organization's success.

Performance measurement is the process of collecting, analyzing and/or reporting


information regarding the performance of an individual, group, organization, system or
component. Definitions of performance measurement tend to be predicated upon an
assumption about why the performance is being measured.
Performance metrics are used to measure the behavior, activities, and performance of a
business. This should be in the form of data that measures required data within a range,
allowing a basis to be formed supporting the achievement of overall business goals.
Measuring performance through metrics is key to seeing how employees are working, and
whether targets are being met.

default baseline models:

It is very tempting to jump right into research and to implement a cutting edge deep learning
solution. However, that is the point where I usually tell myself to stay pragmatic and build a
decent baseline first. Many data scientists underestimate the importance of having a baseline.
I love baseline models for their ability to deliver 90% of value for 10% of the effort. An 80%
accurate model in 2 days is better than an 81.5% accurate model in 4 weeks and that is what
is important when working with clients. The beauty of a decent baseline model is that it is
very hard to beat and the cutting edge models will achieve just a marginal improvement over
it. There are few requirements for a good baseline model:

1. Baseline model should be simple. Simple models are less likely to overfit. If you
see that your baseline is already overfitting, it makes no sense to go for more
complex modelling, as the complexity will kill the performance.

2. Baseline model should be interpretable. Explainability will help you to get a


better understanding of your data and will show you a direction for the feature
engineering.
These two reasons lead us to my favourite choice of baseline models, which are the models
from the decision tree family. Another amazing fact about the trees is that tree based models
are non- parametric and do not require the data to be normally distributed.

selecting hyper parameters:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set


of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose
value is used to control the learning process. By contrast, the values of other parameters
(typically node weights) are learned.

When creating a machine learning model, you'll be presented with design choices as to how
to define your model architecture. Often times, we don't immediately know what the optimal
model architecture should be for a given model, and thus we'd like to be able to explore a
range of possibilities. In true machine learning fashion, we'll ideally ask the machine to
perform this exploration and select the optimal model architecture automatically. Parameters
which define the model architecture are referred to as hyperparameters and thus this process
of searching for the ideal model architecture is referred to as hyperparameter tuning.
These hyperparameters might address model design questions such as:

 What degree of polynomial features should I use for my linear model?


 What should be the maximum depth allowed for my decision tree?

 What should be the minimum number of samples required at a leaf node in my


decision tree?

 How many trees should I include in my random forest?


 How many neurons should I have in my neural network layer?

 How many layers should I have in my neural network?


What should I set my learning rate to for gradient descent?

I want to be absolutely clear, hyperparameters are not model parameters and they cannot be
directly trained from the data. Model parameters are learned during training when we
optimize a loss function using something like gradient descent.The process for learning
parameter values is shown generally below.
Whereas the model parameters specify how to transform the input data into the desired
output, the hyperparameters define how our model is actually structured. Unfortunately,
there's no way to calculate “which way should I update my hyperparameter to reduce the
loss?” (ie. gradients) in order to find the optimal model architecture; thus, we generally resort
to experimentation to figure out what works best.
In general, this process includes:

1. Define a model

2. Define the range of possible values for all hyperparameters

3. Define a method for sampling hyperparameter values

4. Define an evaluative criteria to judge the model

5. Define a cross-validation method


Specifically, the various hyperparameter tuning methods I'll discuss in this post offer various
approaches to Step 3.
2. Debugging strategies:

 Incremental and bottom-up program development. ...


 Instrument program to log information. ...
 Instrument program with assertions. ...
 Use debuggers. ...
 Backtracking. ...
 Binary search. ...
 Problem simplification. ...
 A scientific method: form hypotheses.
n the context of software engineering, debugging is the process of fixing a bug in the
software. In other words, it refers to identifying, analyzing and removing errors. This activity
begins after the software fails to execute properly and concludes by solving the problem and
successfully testing the software. It is considered to be an extremely complex and tedious
task because errors need to be resolved at all stages of debugging.

Debugging Strategies:

1. Study the system for the larger duration in order to understand the system. It helps
debugger to construct different representations of systems to be debugging
depends on the need. Study of the system is also done actively to find recent
changes made to the software.
2. Backwards analysis of the problem which involves tracing the program backward
from the location of failure message in order to identify the region of faulty code.
A detailed study of the region is conducting to find the cause of defects.
3. Forward analysis of the program involves tracing the program forwards using
breakpoints or print statements at different points in the program and studying the
results. The region where the wrong outputs are obtained is the region that needs
to be focused to find the defect.
4. Using the past experience of the software debug the software with similar
problems in nature. The success of this approach depends on the expertise of the
debugger.

UNIT-V
Large-Scale Deep Learning
 number of neurons must be large
 requires high performance hardware and software infrastructure.
 use GPU computing or the CPUs of many machines networked together.
 careful specialization of numerical computation routines can yield a large payoff.
Other strategies, besides choosing whether to use fixed or floating point, include
optimizing data structures to avoid cache misses and using vector instructions.
 Graphics processing units (GPUs) are specialized hardware components
that were originally developed for graphics applications.
 graphics cards having been designed to have a high degree of parallelism and high
memory bandwidth, at the cost of having a lower clock speed and less branching
capability relative to traditional CPUs.
 Neural networks usually involve large and numerous buffers of parameters,
activation values, and gradient values, each of which must be completely updated
during every step of training.
 memory operations are faster if they can be coalesced. Coalesced reads or writes
occur when several threads can each read or write a value that they need
simultaneously, as part of a single memory transaction. Different models of GPUs
are able to coalesce different kinds of read or write patterns. Typically, memory
operations are easier to coalesce if among n threads, thread i accesses byte i + j of
memory, and j is a multiple of some power of 2.
 each thread in a group executes the same instruction simultaneously. This means
that branching can be difficult on GPU. Threads are divided into small groups
called warps. Each thread in a warp executes the same instruction during each
cycle, so if different threads within the same warp need to execute different code
paths, these different code paths must be traversed sequentially rather than in
parallel.
 A key strategy for reducing the cost of inference is model compression
 dynamic structure in the graph describing the computation needed to process an
input.
Natural Language Processing
 A language model defines a probability distribution over sequences of tokens
in a natural language. Depending on how the model is designed, a token may
be a word, a character, or even a byte.
 Training n-gram models is straightforward because the maximum likelihood
estimate can be computed simply by counting how many times each possible n
gram occurs in the training set.
 When Pn−1 is non-zero but Pn is zero, the test log-likelihood is −∞. To avoid such
catastrophic outcomes, most n-gram models employ some form of smoothing.
Smoothing techniques
 One basic technique consists of adding non-zero probability mass to all of the
possible next symbol values. This method can be justified as Bayesian inference
with a uniform or Dirichlet prior over the count parameters.
 Classical n-gram models are particularly vulnerable to the curse of dimensionality.
There are | | V n possible n-grams and | | V is often very large. Even with a massive
training set and modest n, most n-grams will not occur in the training set. One way
to view a classical n-gram model is that it is performing nearest-neighbor lookup.
In other words, it can be viewed as a local non-parametric predictor, similar to k-
nearest neighbors.
 To improve the statistical efficiency of n-gram models, class-based language
models (Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce
the notion of word categories and then share statistical strength between words that
are in the same category. The idea is to use a clustering algorithm to partition the
set of words into clusters or classes, based on their co-occurrence frequencies with
other words.
 Neural language models or NLMs are a class of language model designed
to overcome the curse of dimensionality problem for modeling natural language
sequences by using a distributed representation of words (Bengio et al., 2001).
Unlike class-based n-gram models, neural language models are able to recognize
that two words are similar without losing the ability to encode each word as distinct
from the other. Neural language models share statistical strength between one
word (and its context) and other similar words and contexts. The distributed
representation the model learns for each word enables this sharing by allowing the
model to treat words that have features in common similarly.
 word embeddings. In this interpretation, we view the raw symbols as points in a
space of dimension equal to the vocabulary size. The word representations embed
those points in a feature space of lower dimension. In the original space, every
word is represented by a one-hot vector, so every pair of words is at Euclidean
distance √2 from each other.
 In many applications, V contains hundreds of thousands of words. The naive
approach to representing such a distribution is to apply an affine transformation
from a hidden representation to the output space, then apply the softmax function.
Suppose we have a vocabulary V with size | | V . The weight matrix describing the
linear component of this affine transformation is very large, because its output
dimension is |V | .

Boltzmann Machine

Boltzmann Machine is a kind of recurrent neural network where the nodes make binary decisions
and are present with certain biases. Several Boltzmann machines can be collaborated together to
make even more sophisticated systems such as a deep belief network. Coined after the famous
Austrian scientist Ludwig Boltzmann, who based the foundation on the idea of Boltzmann
distribution in the late 20th century, this type of network was further developed by Stanford
scientist Geoffrey Hinton. It derives its idea from the world of thermodynamics to conduct work
toward desired states. It consists of a network of symmetrically connected, neuron-like units that
make decisions stochastically whether to be active or not. 

Boltzmann Machines consist of a learning algorithm that helps them to discover interesting


features in datasets composed of binary vectors. The learning algorithm is generally slow in
networks with many layers of feature detectors but can be made faster by implementing a 
learning layer of feature detectors. Boltzmann machines are typically used to solve different
computational problems such as, for a search problem, the weights present on the connections can
be fixed and are used to represent the cost function of the optimization problem. Similarly, for a
learning problem, the Boltzmann machine can be presented with a set of binary data vectors from
which it must find the weights on the connections so that the data vectors are good solutions to the
optimization problem defined by those weights. Boltzmann machines make many small updates to
their weights, and each update requires solving many different search problems. 
Uses Of Boltzmann Machine

The main purpose of the Boltzmann Machine is to optimize the solution of a problem. It optimizes
the weights and quantities related to the particular problem assigned to it. This method is used
when the main objective is to create mapping and learn from the attributes and target variables in
the data. When the objective is to identify an underlying structure or the pattern within the data,
unsupervised learning methods for this model are considered to be more useful. Some of the most
popular unsupervised learning methods are Clustering, Dimensionality reduction, Anomaly
detection and Creating generative models.

 Each of these techniques has a different objective of detecting patterns such as identifying latent
grouping, finding irregularities in the data, or generating new samples from the available data.
These networks can also be stacked layer-wise to build deep neural networks that capture highly
complicated statistics. The use of Restricted Boltzmann Machines has gained popularity in the
domain of imaging and image processing as well since they are capable of modelling continuous
data that are common to natural images. They also are being used to solve complicated quantum
mechanical many-particle problems or classical statistical physics problems like the Ising and
Potts classes of models.

Components Of Boltzmann Machine

The architecture of the Boltzmann Machine comprises a shallow, two-layer neural network that
also constitutes the building blocks of the deep network. The first layer of this model is called the
visible or input layer and the second is the hidden layer. They consist of neuron-like units called a
node and nodes are the areas where calculations take place. These nodes are interconnected to
each other across layers, but no two nodes of the same layer are linked. Therefore there is no
intra-layer communication and hence being one of the restrictions in a Boltzmann machine. Each
node through computation processes input, and makes stochastic decisions about whether to
transmit that input or not. When data is fed as input, these nodes learn all the parameters, their
patterns and correlation between them, on their own and form an efficient system. Hence a
Boltzmann Machine is also termed as an Unsupervised Deep Learning model. 

This model can then be trained to get ready to monitor and study abnormal behaviour depending
on what it has learnt. The coefficients that modify the inputs are randomly initialized. Each visible
node takes a low-level feature from an item present in the dataset to be learned. The result of
those two operations is fed into an activation function which in turn produces the node’s output
also known as the strength of the signal passing through it. The outputs of the first hidden layer
would be passed as inputs to the second hidden layer, and from there through as many hidden
layers created, until they reach a final classifying layer. For simple feed-forward movements, the
nodes function as an autoencoder. Learning is typically very slow in Boltzmann machines with a
high number of hidden layers as large networks may take a long time to approach their
equilibrium distribution, especially when the weights are large and the distribution being highly
multimodal.
15.Additional Topics:

16.Question Papers
VASAVI COLLEGE OF ENGINEERING(AUTONOMOUS), IBRAHIMBAGH, HYDERABAD-500031
Department of Information Technology
BE(CBCS) VIII Semester (2019-2020) – I – Internal Examination
Title of the course: Deep Learning (Professional Elective-VI) (PE850IT)
Maximum Marks: 20 Duration: 60 min.
Date: 23-02-2020 Time: 02:30 PM to 03:30 PM

Question Description of the question Marks BTL Mapped


No.
(1/2/3/4/5/
CO PO
6)

Part-A(6x 1 Marks: 06) answer all questions

1.
Define a Neural Network. 1 1 1 1

2.
State the drawback of a singl-layer Perceptron. 1 1 1 1

3.
Mention some applications of Deep Learning. 1 1 1 1

4.
State the importance of Early Stopping. 1 1 2 1

5.
Give the equation for the expected squared error of 1 1 2 1
the ensemble predictor.
6.
Differentiate between L1 and L2 regularizations. 1 2 2 1

Part-B (02 x 07 Marks: 14 Marks) Answer any two questions

7. a) Explain Sigmoid Neuron, using an example


5 1 1 1
b) Discuss the advantages of Multilayer Neural
Networks. 2 1 1 1

8. a) Write an algorithm by applying Nesterov


5 3 2 2
Momentum to RMSProp.
b) Discuss the importance of Adversarial Training. 2 1 2 1

9. a) Explain Gradient Descent, using an example.


4 1 1 1

52
b) Write an Early Stopping meta-algorithm to
2
determine at what objective value we start to
overfit, then continue training until that value is 3 3 2
reached.

Summary of the percentage for each of the criteria BTL (Blooms Taxonomy Level) from the questions
framed.
1. Fundamental knowledge from Level-1 (Recall) & 2 (understand) : 60 %
2. Knowledge on application and analysis from Level-3(Apply) & 4 (Analyze) : 40 %
3. Critical thinking and ability to design from Level-5 (Estimate) & 6 (Create or Design): 00%

53
VASAVI COLLEGE OF ENGINEERING(AUTONOMOUS), IBRAHIMBAGH, HYDERABAD-500031
Department of Information Technology
BE(CBCS) VIII Semester (2019-2020) – I – Internal Examination
Title of the course: Deep Learning (Professional Elective-VI) (PE850IT)
Maximum Marks: 20 Duration: 60 min.
Date: 23-02-2020 Time: 02:30 PM to 03:30 PM

Question Description of the question Marks BTL Mapped


No.
(1/2/3/4/5/
CO PO
6)

Part-A(6x 1 Marks: 06) answer all questions

10.
Define a Perceptron. 1 1 1 1

11.
Mention any two applications of Multilayer Neural 1 1 1 1
Networks.
12.
What is Representation Learning? 1 1 1 1

13.
State the importance of Regularization. 1 1 2 1

14.
How does Early Stopping act as a regularizer? 1 1 2 1

15.
How is Data Augmentation effective for Object 1 2 2 1
Recognition?

Part-B (02 x 07 Marks: 14 Marks) Answer any two questions

16. a) Discuss in detail about Backpropagation.


5 1 1 1
b) Explain Representation Learning.
2 1 1 1

17. c) Write an Early Stopping meta-algorithm for


5 3 2 2
determining the best amount of time to train.
d) Discuss briefly about Bagging. 2 1 2 1

18. a) Design a Perceptron for AND Boolean


3 3 1 2
function.
b) Discuss the advantages of Parameter 4 1 2 1
Initialization Strategies.

54
VASAVI COLLEGE OF ENGINEERING(AUTONOMOUS), IBRAHIMBAGH, HYDERABAD-500031
Department of Information Technology
BE(CBCS) VIII Semester (2019-2020) – II – Internal Examination
Title of the course: Deep Learning (Professional Elective-VI) (PE850IT)
Maximum Marks: 20 Duration: 30 min.
Date: 27-06-2020(FN) Time: 11:00 AM to 11:30 AM

Q.No. Description of the question Marks BTL Mapped


(1/2/3/4
CO PO
/5/6)
19.
48 filters of size 21 x 21 is applied to an image of size 327 x 327, with 1 2 3 1
zero padding and stride of 3. The image is an RGB image. The depth of
the filter is same as the depth of image. What will be the volume of the
final image?
a) 103 X 103 X 3 b) 103 X 103 X 48 c)  327 X 327 X 3
d) 327 X 327 X 48
20.
Consider the following: 1 2 3 1
W= [0.2, 0.7, 0.05, 0.75, 0.86, 0.21]
X= [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
 What output St is obtained by sliding the filter Wt over the input Xt ?
a)  [1.306, 1.421, 1.567, 2.002] b)  [1.031, 1.308, 1.585, 1.862]
c) [2.345, 2.121, 3.547, 2.409] d) [2.127, 2.229, 3.212, 3.421]
21.
 What will be the output of the following convolution operation? 1 3 3 2

a)

55
b)

c)

d)
22.
What will be the output of the following matrix after applying average 1 3 3 2
pooling operation, 2x2 and stride=2?

a) b)

b) d)
23.
Number of layers in ALexNet, VGG, GoogleNet and ResNet model? 1 2 3 1

24.
What is the number of parameters in a max-pooling layer? 1 1 3 1
a) Number of filters times dimension of each filter
b) Number of filters
c) One d) zero

56
25.
What is the output dimension of the resulting image when a 7 × 7 1 2 3 2
kernel is applied to a 9 × 9 image?
a) 3 X 3 b) 4 X 4 c) 5 X 5 d) 2 X 2
26.
What is the difference between back-propagation algorithm and back- 1 2 4 1
propagation through time (BPTT) algorithm?
a)  Unlike back-propagation, in BPTT we add the gradients for
corresponding weight for each time step
b)  Unlike back-propagation, in BPTT we subtract the gradients
for corresponding weight for each time step
c) No difference
d) None of the above
27.
What technique is followed to deal with the problem of Exploding 1 1 4 1
Gradients in Recurrent Neural Net- works (RNN)?
Parameter Tying
a) b) Gradient clipping c) Using
modified architectures like LSTMs and GRUs d) Using
dropout
28.  LSTM provide more control-ability and better results as compared to
1 1 4 1
RNN. But also comes with more complexity and operating cost
a) True b) False
29.
What is the number of zeros in the derivative of ht w.r.t. st, where 1 2 4 1
ht,st∈Rn?
a) n-1 b) n c) n2 – n d) 0
30.
In LSTM, during forward propagation, the gates control the flow of 1 1 4 1
information.
a) True b) False
31.
Why is an RNN (Recurrent Neural Network) used for machine 1 1 4 1
translation, say translating English to French?
a) It is applicable when the input/output is a sequence (e.g., a
sequence of words)
b) b) It is strictly more powerful than a Convolutional Neural
Network (CNN)
c) RNNs do not have problem of vanishing gradient
d) None of the above
32.
RNNs can be used with convolutional layers to extend the effective 1 1 4 1
pixel neighborhood
a) True b) False
33. Given below is the representation of LSTM, What is the number of operations that takeplace a
given timestep, t?
1 2 4 1

57
a) 4 b) 3 c) 6 d) 2
34.
Keras is a deep learning framework on which tool? 1 1 5 1
a) R b) Tensorflow c) SAS d) Azure
35.
How calculations work in TensorFlow? 1 1 5 1
a) Through vector multiplications b) Through RDDs
c) Through Computational Graphs d) Through map reduce tasks
36.
Why does Tensorflow use Computational Graphs? 1 1 5 1
a) Tensors are nothing but Computational Graphs
b) Graphs are easy to plot
c) There is no such concept of Computational Graphs in Tensorflow
d) Calculations can be done in parallel
37.
Which tool is a Deep Learning Wrapper on Tensorflow? 1 1 5 1
a) Python b) Keras c) PyTorch d) Azure
38.
How do we perform calculations in Tensorflow? 1 1 5 1
a) We launch the computational graph in TensorFlow.
b) We launch the session inside a Computational Graph
c) By creating multiple Tensors
d) By creating data Frames.

58
Code No: 138DU R16
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
B. Tech IV Year II Semester Examinations, September - 2020
NEURAL NETWORKS AND DEEP
LEARNING
(Common to CSE, IT)

Time: 2 Hours Max. Marks: 75

Answer any Five Questions


All Questions Carry Equal Marks

---

1. List and explain the various activation functions used in modeling of artificial neuron.
Also explain their suitability with respect to applications. [15]
2. Describe the Characteristics of Continuous Hopfield memory and discuss how it can be
used to solve Traveling salesman Problem. [15]
3. Explain the architecture and algorithm of full CPN with diagram. [15]
4. Give the Architecture of kohenen self-organizing and explain how it is used cluster the
input vectors. [15]
5. Give an example of learning XOR function to explain a fully functioning feed
forward network. [15]
6. Explain in detail about the concept of gradient based learning. [15]
7. Write an early stopping meta-algorithm for determining the best amount of time to train.
[15]
8. Discuss the application of second-order methods to the training of deep networks. [15]
---ooOoo---

59
QUESTION BANK

UNIT-I

CO1: Describe various deep learning algorithms used across various domains.

DESCRIPTIVE QUESTIONS
1. List out the historical developments in deep learning?

2. Find the singular valued decomposition of the matrix

3. Find the mean and variance of exponential distribution.


4. The probability that a person has a certain disease is 0.03. Medical diagnostic tests are
available to determine whether the person actually has the disease. If the disease is
actually present, the probability that the medical diagnostic test will give a positive result
(indicating that the disease is present) is 0.90. If the disease is not actually present, the
probability of a positive test result (indicating that the disease is present) is 0.02. Suppose
that the medical diagnostic test has given a positive result (indicating that the disease is
present). What is the probability that the disease is actually present? What is the
probability of a positive test result?
5. Using Moore-Penrose Pseudo inverse method, Find the inverse of the matrix

6. Find the Eigen Values and Eigen Vectors of the matrix

7. Suppose 30% of the women in a class received an A on the test and 25% of the men received an
A. The class is 60% women. Given that a person chosen at random received an A, what is the
probability this person is a women?
8. Find the mean and variance of Uniform distribution?
9. Explain about structural probabilistic models along with an example.
10. Write a short note on Expectations, Mean and Variance.
11. What is the expected value, mean and variance of the sum of three dices thrown together?

60
UNIT-II

CO2: Design the feed forward neural network using appropriate techniques.

DESCRIPTIVE QUESTIONS
1. Explain briefly about gradient descent optimization along with an example
2. Illustrate constrained optimization?
3. Using Least squares method, Find the regression line of the following data.
X Y

0 1

1 2

2 4

3 3.5

4 5

5 4

6 7

7 9

8 12

9 17

4. Illustrate the mechanism of Bias-Variance trade-off


5. Suppose our data x1, . . . xn are independently drawn from a uniform distribution U(a, b).
Find the MLE estimate for a and b.
6. Illustrate linear regression using Gradient-Descent method?
7. How over fitting and under fitting can affect model generalization?

61
8. Suppose the data x1, x2, . . . , xn is drawn from a N(µ, σ2 ) distribution(Normal
distribution), where µ and σ are unknown. Find the maximum likelihood estimate
for the pair (µ, σ2 ).
9. How does cross validation reduce bias and variance?

UNIT-III

CO3: Develop the conditional random fields and its use in designing the deep neural network.

DESCRIPTIVE QUESTIONS
1. How Neural Networks solve the XOR problem using Back propagation method?
2. What is a training set and how is it used to train neural networks?
3.

4. You are given the following neural networks which take two binary valued inputs x 1, x2 ∈ {0, 1}
and the activation function is the threshold function(h(x) = 1 if x > 0; 0 otherwise). Which of the
following logical functions does it compute?

62
5. Describe the procedure of obtaining local minima using Gradient Descent method?

6. The XOR function (exclusive or) returns true only when one of the arguments is true and
another is false. Otherwise, it returns always false. Do you think it is possible to implement
this function using a single unit? A neural network of several units? Explain

7. Illustrate how data augmentation will improve the performance of deep learning model?
8. Illustrate how Bayesian inference over the weights will resolve Noise Robustness of a
neural network model?

9. Explain L1 parameter regularization of norm penalties under constrained


optimization

10. Explain L2 parameter regularization of norm penalties under constrained optimization.

11. Explain Newton second order derivative method to arrive optimal point for loss function

12. How sampling the weights of each fully connected layer will appropriate the initialization
of parameters of a neural network model?
13. Illustrate how Batch Normalization has stabilized the learning process for faster
convergence rates?

UNIT-IV

CO4: Perform research on various challenges in deep neural networks.

DISCRIPTIVE QUESTIONS
1. Discuss the Variants of the basic convolution function?
2. Suppose you have a convolutional network with the following architecture:
• The input is an RGB image of size 256 × 256.
• The first layer is a convolution layer with 32 feature maps and filters of size 3 ×3. It
uses a stride of 1, so it has the same width and height as the original image.
• The next layer is a pooling layer with a stride of 2 (so it reduces the size of each
dimension by a factor of 2) and pooling groups of size 3 × 3.
Determine the size of the receptive field for a single unit in the pooling layer.
3. An input image has been converted into a matrix of size 12 X 12 along with a filter of
size 3 X 3 with a Stride of 1. Determine the size of the convoluted matrix?
4. Illustrate hyper-parameter tuning to minimize the generalization error of a neural network
model?
5. Illustrate Sequence-to-Sequence Recurrent Neural Network architectures?

63
6. Determine what the following recurrent neural network computes. More precisely,
determine the function computed by the output unit at the final time step; the other
outputs are not important. All of the biases are 0. You may assume the inputs are integer
valued and the length of the input sequence is even?

7. Draw the unfolding graph for the Recurrent Neural Network model.

8. Illustrate how autoencoders are trained to reconstruct the input data in representation
learning mechanism?

9. Find the convolution of following signals?

i) = ii) = iii) iv)

10. Analyze the equation using Newton’s method to find the roots of the equation

for three decimal places.

11. Use Steepest Descent method for f( )= starting from the

point

64
12. The manufacturing company X and Y produces mobile phones. The manufacturing cost is,

C(x,y)= .

If the company objective is to produce 1900units per month while minimizing the total
monthly cost of production, how many units should be produced at each factory?
13.Explain about bi-directional RNN’s

UNIT-V

CO5: Optimize the deep neural network and to experiment various tools.

DESCRIPTIVE QUESTIONS
1. Explain in detail about Large Scale Deep Learning
2. What is convolutional Boltzmann machine?
3. What is linear factor model?
4. Illustrate reparameterization in Variational Autoencoders to achieve back
propagation through random operations?
5. Discuss the applications of Deep Learning in Computer Vision
6. Illustrate how autoencoders are trained to reconstruct the input data in
representation learning mechanism?
7. Explain how Deep Belief Network will be formed through the process of
gradient descent and back propagation of Restricted Boltzmann machines?
8. Discuss the important debugging tests in deep learning
9. what are encoders and decoders?

18. Assignment Questions


Assignment-I

1. compute the given matrix by using Eigen decomposition

A=

2. compute the given matrix by using SVD

65
A=

3. compute the given matrix by using moore-penrose pseudo inverse

A=

Assignment-II
4. Explain about structured probabilistic models and along with an example?
5. write short notes on Expectation, Mean, and Variance ?
Ex: what is the expected value, mean and variance of the sum of three dices thrown together?
Assignment-III
6. Explain briefly about gradient descent optimization along with an example?
7. Explain L1, L2 parameter regularization of norm penalities.
Assignment-IV
8. Discuss the role of hyper parameters in a deep learning application.
9. Discuss the variants of the basic convolution function
Assignment V

10. Illustrate Sequence to Sequence RNN architecture.


11. Explain about Deep Belief Networks

19. Unit wise Quiz Questions


Unit-1
1. A linear equation in three variables represents a?
A. flat objects B. line C. Planes D. Both A and C
2. Which of the following is correct method to solve matrix
equations?
A. Row Echelon Form B. Inverse of a Matrix
C. Both A and B D. None Of the above

66
3. The concept of Eigen values and vectors is applicable to?
A. Scalar matrix B. Identity matrix
C. Upper triangular matrix D. Square matrix
4. The rank of a 3 x 3 matrix C (= AB), found by multiplying a non-zero
column matrix A of size 3 x 1 and a non-zero row matrix B of size 1 x 3, is
A. 0 B. 1 C. 2 D. 3
5. How many of the following matrices have an eigenvalue 1?

A.1 B. 2 C. 3 D. 4
6. For any square matrix A, AAT is a
a) Unit matrix b) Symmetric matrix
c) Skew symmetric matrix d) Diagonal matrix
7. The eigenvalues of a 4 4 matrix A are given as 2,3,13, and 7. The detA is
a) 546 b) 19 c) 25 d) cannot be determined
8. The eigen vector (s) of the matrix is (are)

a) b) c) d)
9. Let A be a matrix such that A k = 0. What is the inverse of I - A?
a) 0 b) 1 c) A d)
e) Inverse is not guaranteed to exist
10. Let A and B be real symmetric matrices of size . Then which one of the
following is true?
a) b) c)
d)
11. If M is a square matrix with a zero determinant, which of the following assertion
(s) is (are) correct?
S1: Each row of M can be represented as a linear combination of the other rows
S2: Each column of M can be represented as a linear combination of the other
columns
S3: MX = 0 has a nontrivial solution
S4: M has an inverse
a) S3 and S2 b) S1 and S4 c) S1 and S3 d) S1, S2 and S3

67
12. Let A be a n×n matrix. Which of the following properties would necessarily imply that
A is singular?
I. The columns of A are linearly dependent.
II. A has a singular value that is 0.
III. Az = 0, for some z ≠ 0.
a) II only b) I and II only c) I and III only d) II and III only
e) I, II and III
13. Every matrix m x n has a singular value decomposition.
a) True b) False c) Invalid statement

14. Which of the following are true about principal components analysis (PCA)?
a) The principal components are eigenvectors of the centered data matrix.
b) The principal components are eigenvectors of the sample covariance matrix.
c) The principal components are right singular vectors of the centered data matrix.
d) The principal components are right singular vectors of the sample covariance matrix.
15. A man buys 10 bulbs, each with independent exponentially distributed lifetimes with the
same mean, with the intention of using one bulb at a time and replacing it with another as soon as
it fails. The distribution of the total duration of the 10 bulbs taken together is
a) Exponential b) Normal c) Uniform d) Bernoulli
16. Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) = 0.7. Can these
two events be disjoint?
a) Yes b) No

17. Alice has 2 kids and one of them is a girl. What is the probability that the other child is
also a girl? You can assume that there are an equal number of males and females in the world.
a) 0.5 b) 0.25 c) 0.333 d) 0.75

18. Given two Boolean random variables, A and B, where P(A) = ½, P(B) = 1/3, and P(A | ¬B)
= ¼, what is P(A | B)?
a) 1/6 b) ¼ c) ¾ d) 1

68
Unit-2
1.what does it mean if your model has over-fit the data?
a) It has memorized the correct answers to the test data.
b) It has’nt captured enough details.
c) It has captured details in the training data that are irrelevant to the question.
2. How might a learning algorithm find a best line? (more than option can be correct)
a) Use an iterative method like gradient descent
b) Trial and error.
c) Plot all possible lines and pick the one that looks best.
d) Set the derivative of the loss function equal to 0 and solve.
e) Brute Force search.
3. Why is Gradient Descent considered an iterative approach?
a) Because we are using continuous updates to converge to a minimum.
b) Because we are using step-wise updates to converge on a minimum

Unit-3
1. Which of the following guidelines is applicable to initialization of the weight vector in a
fully connected neural network.

a) Should not set it to zero since otherwise it will cause overfitting


b) Should not set it to zero since otherwise (stochastic) gradient descent will explore a very small
space
c) Should set it to zero since otherwise it causes a bias
d) Should set it to zero in order to preserve symmetry across all neurons
2. For a neural network, which one of these structural assumptions is the one that most
affects the trade-off between underfitting (i.e. a high bias model) and overfitting
(i.e. a high variance model):
a) The number of hidden nodes b) The learning rate
c) The initial choice of weights d) The use of a constant-term unit input

69
3. ___________ refers to a model that can neither model the training data nor generalize to new
data.
a) good fitting b) overfitting c) underfitting d) all of the above
4. Underdetermined problems are those problems that have  [ ]
a. infinitely many solutions b. finite solutions
c. unique solution d. None of the above

5. Dropout is a [ ]
a.optimization technique b. regularization technique
c.adversarial technique d. None of the above
6.RMSProp addresses the problem caused by accumulated gradients in [ ]
a. Adam b. Adadelta
c.Momentum d. AdaGrad
7. The sparsity property induced by L1 regularization has been used extensively
as a ---------------------- mechanism.
8. Dataset augmentation has been a particularly effective technique for ----------.
9. Optimization algorithms that use the entire training set are called ------------ gradient
methods.
10. --------------------------- and its variants are probably the most used optimization
algorithms for deep learning in particular.

Unit-4
1. Pooling layers are used to accomplish which of the following? [ ]

1. To progressively reduce the spatial size of the representation


2. To reduce the amount of parameters and computation in cost
3. To select maximum value over pooling region always
4. Pooling layer operates on each feature map independently
11. Which of the following is true for most CNN architectures? [ ]
1. Size of input decreases, while depth increases
2. Multiple convolutional layers followed by pooling layers
3. Fully connected layers in the first few layers
4. Back propagation can be applied when using pooling layers
12. We cannot use Recurrent Neural Networks for [ ]
1. Image classification
2. Planet movement understanding in any planetary system
3. Stock market trend prediction

70
4. We can use Recurrent Neural Network in all the above scenarios

13. Which of the following does not suffer from vanishing gradient descent problem?
1. 1-layer feed forward networks
[ ]
2. Very deep feed forward networks
3. Recurrent neural networks
4. Convolutional neural networks
14. -------------------------------provide a way to specialize neural networks to work
with data that has a clear grid-structured topology and to scale such models to very large
size.
15. --------------------- are a family of neural networks for processing sequential data.
16. The two basic approaches in choosing hyperparameters are --------------------------.

Unit-5

1. What consists of Boltzmann machine? [ ]


a. Fully connected network with both hidden and visible units
b. Asynchronous operation
c. Stochastic update
d. All the above

2.  What should be the aim of training procedure in Boltzmann machine of feedback


networks? [ ]
a. to capture inputs
b. to feedback the captured outputs
c. to capture the behaviour of system
d. none of the above

3. A Deep Belief Network is a stack of Restricted Boltzmann Machines. [ ]


a. True b. False

4. The task of ---------------------- is to map an acoustic signal containing a spoken


natural language utterance into the corresponding sequence of words intended by the
speaker.

71
5. The ------------------------ model takes advantage of the observation that most variations in the
data can be captured by the latent variables, up to some small residual reconstruction error.
6. The ------------------- is an autoencoder that receives a corrupted data point as input and is
trained to predict the original, uncorrupted data point as its output

20.Tutorial Problems
NA

Mapping of Assignment questions and Tutorials with Bloom’s taxonomy


Course Name: Deep Learning Course Code: 18CS4111
Class: B.Tech IV Year CSE – I SEM - Section: A
Academic Year: 2021-22
Level-1: Remembering Level-2: understanding Level-3: Applying
Level-4: Analyzing Level-5: Evaluating Level-6: Creating

Assignment s Question no. Level 1 Level 2 Level 3 Level 4 Level 5 Level 6

1 
Assignment- 1
2 

3 

4 

5 

1 
Assignment- 2
2 

3 

4 

72
5 

1 
Assignment- 3
2 

3 

4 

5 

1 
Assignment- 4
2 

3 

4 

5 

1 
Assignment- 5
2 

3 

4 

5 

21.Known gaps
-No-

22. Discussion Topics

73
23. References, Journals, websites and E-links
REFERRENCE BOOKS:

Deep Learning, Goodfellow, I., Bengio,Y., and Courville, A., MIT Press, 2016.(Unit I-V)

JOURNALS ,WEB SITES

1. https://www.deeplearningbook.org/contents/TOC.html
2. https://analyticsindiamag.com/
3. https://onlinecourses.nptel.ac.in/noc22_cs35

24.Quality Measurement sheets

Course End Survey


NA

Teaching Evaluation

NA

25.Student List
Section-A

74
S.No Roll No Name
1 18R11A0501 Adavikolanu Swapna
2 18R11A0502 Andugula Shashaank
3 18R11A0503 Awari Deekshitha
4 18R11A0504 B Deevena Angeline Sunayana
5 18R11A0505 Bhamidipati Shiridi Prasad Revanth
6 18R11A0506 Ch Siri Sowmya
7 18R11A0507 Cheripalli Sreeja
8 18R11A0509 Errabelli Rushyanth
9 18R11A0510 G N Harshita
10 18R11A0511 Gajji Varun Kumar
11 18R11A0512 Sri Sai Pranavi Ganti
12 18R11A0513 H S Shreya
13 18R11A0514 Jangam Nagarjuna Goud
14 18R11A0515 Kanne Nithesh Sai
15 18R11A0516 Kodi Akhil Yadav
16 18R11A0517 Kola Snehitha
17 18R11A0518 Komuravelli Karthik
18 18R11A0519 Korada Santosh Kumar
19 18R11A0520 Kunchala Sairam
20 18R11A0521 L A Prithviraj Kumar
21 18R11A0522 Lahari Basavaraju
22 18R11A0523 Linga Jaya Krishna
23 18R11A0524 M Sree Charan Reddy
24 18R11A0525 Mambeti Sairam
25 18R11A0526 Mamilla Ramya
26 18R11A0527 Mohammad Afroz Khan
27 18R11A0528 Mohammed Abdul Ameen Siddiqui
28 18R11A0529 Muddula Anusha
29 18R11A0530 Musale Aashish
30 18R11A0531 Mutyala Santosh
31 18R11A0532 Pariti Divya
32 18R11A0533 Paruchuri Harsha Vardhan
33 18R11A0534 Patri Sai Sindhura
34 18R11A0535 Pinnem Tarun Kumar
35 18R11A0536 Pirangi Nithin Kalyan
36 18R11A0537 Poojaboina Preethi
37 18R11A0538 Puranam Satya Sai Rama Tarun
38 18R11A0539 S Guna Sindhuja
39 18R11A0540 Sangaraju Greeshma
40 18R11A0541 Syed Zainuddin

75
41 18R11A0542 Telukuntla Rajkumar
42 18R11A0543 Thorupunuri Jancy
43 18R11A0544 Thumu Ram Sai Teja Reddy
44 18R11A0545 Vadakattu Harish
45 18R11A0546 Vaishnavi Sabna
46 18R11A0547 Vemuri Madhu Venkata Sai
47 18R11A0548 Yarram Reddy Venkata Srivani Reddy
48 19R15A0501 Bhulaxmi Kalpana
49 19R15A0502 Challa Divya Reddy
50 19R15A0503 Adla Likitha
51 19R15A0504 Gopaladas Vinayalatha
52 19R15A0505 Ganji Charan Kumar

Section-D

76
26.Group wise Students List: Section-D

B.No AdmnNo StudentName B.No AdmnNo StudentName


18R11A05E5 AKSHITA YERRAM 18R11A05H2 NEELA PAVAN
ARYASOMAYAJULA VISHAL
18R11A05E6 BHASKAR 18R11A05H3 NEELAPALA TEJA SHREE
BALANNAGARI DEEPAK NEELAYAVALASA MEGHNA
18R11A05E7 REDDY 18R11A05H4 PATNAIK
18R11A05E8 BATHRAJ HARINI 18R11A05H5 NEMANA PRANAMYA
BHALLAMUDI LAKSHMI
1 18R11A05E9 PRIYANKA 4 18R11A05H6 PAPAIAHGARI SAI PRIYA
PENUMARTHI KRISHNA
18R11A05F0 BODA AKHILA 18R11A05H7 BHARADWAJ
BODAGAM DEEKSHITHA
18R11A05F1 REDDY 18R11A05H8 SAI NEHA MANDA
BOGURAMPETA SUNIL
18R11A05F2 REDDY 18R11A05H9 SAI PRAVALIKA PERIKA
18R11A05F3 BORRA YASWANTH KUMAR 18R11A05J0 SALLA ANUSHA
18R11A05F4 CHINTAMANENI MEGHANA 18R11A05J1 SANDU JAI VENKATESH
18R11A05F5 DINDU SANDEEP 18R11A05J2 SANKU RAJSHREE RAO
18R11A05F6 DINTYALA NAVYA SREE 18R11A05J3 SEELAM SANJANA
18R11A05F7 DONDAPATI MITHUN 18R11A05J4 SOMI SETTY SAI NEELESH
18R11A05F8 DONKENA THARUN KUMAR 18R11A05J5 TADEPALLI SAI NANDINI
2 5 THARA REKHA
18R11A05F9 G BHUMIKA 18R11A05J6 KAKARAPARTHI
GAJJALA TEJANARAYANA
18R11A05G0 GOUD 18R11A05J7 TUMMALA VARSHITH
V SATYA NAGA SAI
18R11A05G1 GARUGULA VIDYA SAGAR 18R11A05J8 SRILEKHA
18R11A05G2 GATTU BHARGAVI 18R11A05J9 VADDE NITHISH
18R11A05G3 GOWLUGARI ALEKHYA 18R11A05K0 VARIKUTI LAKSHMI TEJA
3 6
REDDY

77
VIPRAGHNA VISHWANATH
18R11A05G4 INJEY DIVYA 18R11A05K1 SRIKAKULAPU
18R11A05G5 JYOTI GOUDA 18R11A05K2 YALALA SHALINI
KOMMERA VAMSHI KOLANUCHELIMI SAI
18R11A05G7 KRISHNA REDDY 19R15A0516 CHARAN
KONAKANCHI
18R11A05G8 MAHALAKSHMI 19R15A0517 CH NIKHIL
KORUKOPPULA SAI
18R11A05G9 KRISHNA 19R15A0518 KANDI PAVAN
18R11A05H0 KOTTAM CHANDRA SHEKAR 19R15A0519 CHITYALA SIRISHA
18R11A05H1 MADHAVI YADAV 19R15A0520 VAGALDAS ARAVIND

78

You might also like