Professional Documents
Culture Documents
VANDANA KATE
• Ever since the technical revolution, we’ve been generating an immeasurable amount of data. As
per research, we generate around 2.5 quintillion bytes of data every single day! It is estimated
that by 2020, 1.7MB of data will be created every second for every person on earth.
• With the availability of so much data, it is finally possible to build predictive models that can
study and analyze complex data to find useful insights and deliver more accurate results.
• Top Tier companies such as Netflix and Amazon build such Machine Learning models by
using tons of data in order to identify profitable opportunities and avoid unwanted risks.
• Increase in Data Generation: Due to excessive production of data, we need a method that can
be used to structure, analyze and draw useful insights from data. This is where Machine
Learning comes in. It uses data to solve problems and find solutions to the most complex tasks
faced by organizations.
• Improve Decision Making: By making use of various algorithms, Machine Learning can be
used to make better business decisions. For example, Machine Learning is used to forecast
sales, predict downfalls in the stock market, identify risks and anomalies, etc.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
• Uncover patterns & trends in data: Finding hidden patterns and extracting key insights from
data is the most essential part of Machine Learning. By building predictive models and using
statistical techniques, Machine Learning allows you to dig beneath the surface and explore the
data at a minute scale. Understanding data and extracting patterns manually will take days,
whereas Machine Learning algorithms can perform such computations in less than a second.
• Solve complex problems: From detecting the genes linked to the deadly ALS disease to
building self-driving cars, Machine Learning can be used to solve the most complex problems.
To give you a better understanding of how important Machine Learning is, let’s list down a couple of
Machine Learning Applications:
• Netflix’s Recommendation Engine: The core of Netflix is its infamous recommendation engine.
Over 75% of what you watch is recommended by Netflix and these recommendations are made
by implementing Machine Learning.
• Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face verification
system is Machine Learning and Neural Networks. DeepMind studies the facial features in an
image to tag your friends and family.
Automatic Friend Tagging Suggestions in Facebook or any other social media platform.
Facebook uses face detection and Image recognition to automatically find the face of the
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
person which matches it’s Database and hence suggests us to tag that person based on
DeepFace.
Facebook’s Deep Learning project DeepFace is responsible for the recognition of faces and
identifying which person is in the picture. It also provides Alt Tags (Alternative Tags) to
images already uploaded on facebook. For eg., if we inspect the following image on Facebook,
the alt-tag has a description.
• Amazon’s Alexa: The infamous Alexa, which is based on Natural Language Processing and
Machine Learning is an advanced level Virtual Assistant that does more than just play songs on
your playlist. It can book you an Uber, connect with the other IoT devices at home, track your
health, etc.
• Speech Recognition
• Speech to Text Conversion
• Natural Language Processing
• Text to Speech Conversion
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
• Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam messages. It
uses Machine Learning algorithms and Natural Language Processing to analyze emails in real-
time and classify them as either spam or non-spam.
• Traffic Alert:
Despite the Heavy Traffic, you are on the fastest route“. But, How does it know that?
Historic Data of that route collected over time and few tricks acquired from other companies.
Everyone using maps is providing their location, average speed, the route in which they are
traveling which in turn helps Google collect massive Data about the traffic, which makes them
predict the upcoming traffic and adjust your route according to it.
• Products Recommendations
Suppose you check an item on Amazon, but you do not buy it then and there. But the next day, you’re
watching videos on YouTube and suddenly you see an ad for the same item. You switch to Facebook,
there also you see the same ad. So how does this happen?
Well, this happens because Google tracks your search history, and recommends ads based on your
search history. This is one of the coolest applications of Machine Learning. In fact, 35% of
Amazon’s revenue is generated by Product Recommendations.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Well, this happens because Google tracks your search history, and recommends ads based on your
search history. This is one of the coolest applications of Machine Learning. In fact, 35% of
Amazon’s revenue is generated by Product Recommendations.
NVIDIA stated that they didn’t train their model to detect people or any object as such. The model
works on Deep Learning and it crowdsources data from all of its vehicles and its drivers. It uses
internal and external sensors which are a part of IOT. According to the data gathered by McKinsey,
the automotive data will hold a tremendous value of $750 Billion.
The term Machine Learning was first coined by Arthur Samuel in the year 1959. Looking back, that
year was probably the most significant in terms of technological advancements.
If you browse through the net about ‘what is Machine Learning’, you’ll get at least 100 different
definitions. However, the very first formal definition was given by Tom M. Mitchell:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
“A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P if its performance at tasks in T, as measured by P, improves with experience
E.”
In simple terms, Machine learning is a subset of Artificial Intelligence (AI) which provides machines
the ability to learn automatically & improve from experience without being explicitly programmed to
do so. In the sense, it is the practice of getting Machines to solve problems by gaining the ability to
think.
Machine Learning is the most popular technique of predicting the future or classifying information to
help people in making necessary decisions. Machine Learning algorithms are trained over instances
or examples through which they learn from past experiences and also analyze the historical data.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Therefore, as it trains over the examples, again and again, it is able to identify patterns in order to
make predictions about the future.
Model: A model is the main component of Machine Learning. A model is trained by using a
Machine Learning Algorithm. An algorithm maps all the decisions that a model is supposed to take
based on the given input, in order to get the correct output.
Predictor Variable: It is a feature(s) of the data that can be used to predict the output.
Response Variable: It is the feature or the output variable that needs to be predicted by using the
predictor variable(s).
Training Data: The Machine Learning model is built using the training data. The training data helps
the model to identify key trends and patterns essential to predict the output.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict
an outcome. This is done by the testing data set.
To sum it up, take a look at the above figure. A Machine Learning process begins by feeding the
machine lots of data, by using this data the machine is trained to detect hidden insights and trends.
These insights are then used to build a Machine Learning Model by using an algorithm in order to
solve a problem.
The next topic in this Introduction to Machine Learning blog is the Machine Learning Process.
The problem is to predict the occurrence of rain in your local area by using Machine Learning.
At this step, we must understand what exactly needs to be predicted. In our case, the objective is to
predict the possibility of rain by studying weather conditions. At this stage, it is also essential to take
mental notes on what kind of data can be used to solve this problem or the type of approach you must
follow to get to the solution.
Once you know the types of data that is required, you must understand how you can derive this data.
Data collection can be done manually or by web scraping. However, if you’re a beginner and you’re
just looking to learn Machine Learning you don’t have to worry about getting the data. There are
1000s of data resources on the web, you can just download the data set and get going.
Coming back to the problem at hand, the data needed for weather forecasting includes measures such
as humidity level, temperature, pressure, locality, whether or not you live in a hill station, etc. Such
data must be collected and stored for analysis.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
The data you collected is almost never in the right format. You will encounter a lot of inconsistencies
in the data set such as missing values, redundant variables, duplicate values, etc. Removing such
inconsistencies is very essential because they might lead to wrongful computations and predictions.
Therefore, at this stage, you scan the data set for any inconsistencies and you fix them then and there.
Grab your detective glasses because this stage is all about diving deep into data and finding all the
hidden data mysteries. EDA or Exploratory Data Analysis is the brainstorming stage of Machine
Learning. Data Exploration involves understanding the patterns and trends in the data. At this stage,
all the useful insights are drawn and correlations between the variables are understood.
For example, in the case of predicting rainfall, we know that there is a strong possibility of rain if the
temperature has fallen low. Such correlations must be understood and mapped at this stage.
All the insights and patterns derived during Data Exploration are used to build the Machine Learning
Model. This stage always begins by splitting the data set into two parts, training data, and testing
data. The training data will be used to build and analyze the model. The logic of the model is based
on the Machine Learning Algorithm that is being implemented.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
In the case of predicting rainfall, since the output will be in the form of True (if it will rain tomorrow)
or False (no rain tomorrow), we can use a Classification Algorithm such as Logistic Regression.
Choosing the right algorithm depends on the type of problem you’re trying to solve, the data set and
the level of complexity of the problem. In the upcoming sections, we will discuss the different types
of problems that can be solved by using Machine Learning.
After building a model by using the training data set, it is finally time to put the model to a test. The
testing data set is used to check the efficiency of the model and how accurately it can predict the
outcome. Once the accuracy is calculated, any further improvements in the model can be
implemented at this stage. Methods like parameter tuning and cross-validation can be used to
improve the performance of the model.
Step 7: Predictions
Once the model is evaluated and improved, it is finally used to make predictions. The final output can
be a Categorical variable (eg. True or False) or it can be a Continuous Quantity (eg. the predicted
value of a stock).
In our case, for predicting the occurrence of rainfall, the output will be a categorical variable.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
So that was the entire Machine Learning process. Now it’s time to learn about the different ways in
which Machines can learn.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Deep Learning involves taking large volumes of structured or unstructured data and using complex
algorithms to train neural networks. It performs complex operations to extract hidden patterns and
features (for instance, distinguishing the image of a cat from that of a dog).
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Neural Networks replicate the way humans learn, inspired by how the neurons in our brains fire, only
much simpler.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
1. An input layer
2. A hidden layer (this is the most important layer where feature extraction takes place, and
adjustments are made to train faster and function better)
3. An output layer
Each sheet contains neurons called “nodes,” performing various operations. Neural Networks are
used in deep learning algorithms like CNN, RNN, GAN, etc.
As in Neural Networks, MLPs have an input layer, a hidden layer, and an output layer. It has the
same structure as a single layer perceptron with one or more hidden layers. A single layer perceptron
can classify only linear separable classes with binary output (0,1), but MLP can classify nonlinear
classes.
Except for the input layer, each node in the other layers uses a nonlinear activation function. This
means the input layers, the data coming in, and the activation function is based upon all nodes and
weights being added together, producing the output. MLP uses a supervised learning method called
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
“backpropagation.” In backpropagation, the neural network calculates the error with the help of cost
function. It propagates this error backward from where it came (adjusts the weights to train the model
more accurately).
The process of standardizing and reforming data is called “Data Normalization.” It’s a pre-processing
step to eliminate data redundancy. Often, data comes in, and you get the same information in
different formats. In these cases, you should rescale values to fit into a particular range, achieving
better convergence.
One of the most basic Deep Learning models is a Boltzmann Machine, resembling a simplified
version of the Multi-Layer Perceptron. This model features a visible input layer and a hidden layer --
just a two-layer neural net that makes stochastic decisions as to whether a neuron should be on or off.
Nodes are connected across layers, but no two nodes of the same layer are connected.
At the most basic level, an activation function decides whether a neuron should be fired or not. It
accepts the weighted sum of the inputs and bias as input to any activation function. Step function,
Sigmoid, ReLU, Tanh, and Softmax are examples of activation functions.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Also referred to as “loss” or “error,” cost function is a measure to evaluate how good your model’s
performance is. It’s used to compute the error of the output layer during backpropagation. We push
that error backward through the neural network and use that during the different training functions.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Gradient Descent is an optimal algorithm to minimize the cost function or to minimize an error. The
aim is to find the local-global minima of a function. This determines the direction the model should
take to reduce the error.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
10. What Is the Difference Between a Feedforward Neural Network and Recurrent Neural Network?
A Feedforward Neural Network signals travel in one direction from input to output. There are no
feedback loops; the network considers only the current input. It cannot memorize previous inputs
(e.g., CNN).
A Recurrent Neural Network’s signals travel in both directions, creating a looped network. It
considers the current input with the previously received inputs for generating the output of a layer
and can memorize past data due to its internal memory.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
The RNN can be used for sentiment analysis, text mining, and image captioning. Recurrent Neural
Networks can also address time series problems such as predicting the prices of stocks in a month or
quarter.
Softmax is an activation function that generates the output between zero and one. It divides each
output, such that the total sum of the outputs is equal to one. Softmax is often used for output layers.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
ReLU (or Rectified Linear Unit) is the most widely used activation function. It gives an output of X if
X is positive and zeros otherwise. ReLU is often used for hidden layers.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
With neural networks, you’re usually working with hyperparameters once the data is formatted
correctly. A hyperparameter is a parameter whose value is set before the learning process begins. It
determines how a network is trained and the structure of the network (such as the number of hidden
units, the learning rate, epochs, etc.).
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
14. What Will Happen If the Learning Rate Is Set Too Low or Too High?
When your learning rate is too low, training of the model will progress very slowly as we are making
minimal updates to the weights. It will take many updates before reaching the minimum point.
If the learning rate is set too high, this causes undesirable divergent behavior to the loss function due
to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge
(data is too chaotic for the network to train).
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Dropout is a technique of dropping out hidden and visible units of a network randomly to prevent
overfitting of data (typically dropping 20 percent of the nodes). It doubles the number of iterations
needed to converge the network.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Batch normalization is the technique to improve the performance and stability of neural networks by
normalizing the inputs in every layer so that they have mean output activation of zero and standard
deviation of one.
16. What Is the Difference Between Batch Gradient Descent and Stochastic Gradient Descent?
Overfitting occurs when the model learns the details and noise in the training data to the degree that it
adversely impacts the execution of the model on new information. It is more likely to occur with
nonlinear models that have more flexibility when learning a target function. An example would be if
a model is looking at cars and trucks, but only recognizes trucks that have a specific box shape. It
might not be able to notice a flatbed truck because there's only a particular kind of truck it saw in
training. The model performs well on training data, but not in the real world.
Underfitting alludes to a model that is neither well-trained on data nor can generalize to new
information. This usually happens when there is less and incorrect data to train a model. Underfitting
has both poor performance and accuracy.
To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-
fold cross-validation) and by having a validation dataset to evaluate the model.
There are two methods here: we can either initialize the weights to zero or assign them randomly.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Initializing all weights to 0: This makes your model similar to a linear model. All the neurons and
every layer perform the same operation, giving the same output and making the deep net useless.
Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very
close to 0. It gives better accuracy to the model since every neuron performs different computations.
This is the most commonly used method.
1. Convolutional Layer - the layer that performs a convolutional operation, creating several smaller
picture windows to go over the data.
2. ReLU Layer - it brings non-linearity to the network and converts all the negative pixels to zero.
The output is a rectified feature map.
3. Pooling Layer - pooling is a down-sampling operation that reduces the dimensionality of the
feature map.
4. Fully Connected Layer - this layer recognizes and classifies the objects in the image.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Pooling is used to reduce the spatial dimensions of a CNN. It performs down-sampling operations to
reduce the dimensionality and creates a pooled feature map by sliding a filter matrix over the input
matrix.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
While training an RNN, your slope can become either too small or too large; this makes the training
difficult. When the slope is too small, the problem is known as a “Vanishing Gradient.” When the
slope tends to grow exponentially instead of decaying, it’s referred to as an “Exploding Gradient.”
Gradient problems lead to long training times, poor performance, and low accuracy.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
23. What Is the Difference Between Epoch, Batch, and Iteration in Deep Learning?
• Epoch - Represents one iteration over the entire dataset (everything put into the training model).
• Batch - Refers to when we cannot pass the entire dataset into the neural network at once, so we
divide the dataset into several batches.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
• Iteration - if we have 10,000 images as data and a batch size of 200. then an epoch should run 50
iterations (10,000 divided by 50).
Tensorflow provides both C++ and Python APIs, making it easier to work on and has a faster
compilation time compared to other Deep Learning libraries like Keras and Torch. Tensorflow
supports both CPU and GPU computing devices.
A tensor is a mathematical object represented as arrays of higher dimensions. These arrays of data
with different dimensions and ranks fed as input to the neural network are called “Tensors.”
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Suppose there is a wine shop purchasing wine from dealers, which they resell later. But some dealers
sell fake wine. In this case, the shop owner should be able to distinguish between fake and authentic
wine.
The forger will try different techniques to sell fake wine and make sure specific techniques go past
the shop owner’s check. The shop owner would probably get some feedback from wine experts that
some of the wine is not original. The owner would have to improve how he determines whether a
wine is fake or authentic.
The forger’s goal is to create wines that are indistinguishable from the authentic ones while the shop
owner intends to tell if the wine is real or not accurately.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Let us understand this example with the help of an image shown above.
There is a noise vector coming into the forger who is generating fake wine.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
The Discriminator gets two inputs; one is the fake wine, while the other is the real authentic wine.
The shop owner has to figure out whether it is real or fake.
So, there are two primary components of Generative Adversarial Network (GAN) named:
1. Generator
2. Discriminator
The generator is a CNN that keeps keys producing images and is closer in appearance to the real
images while the discriminator tries to determine the difference between real and fake images The
ultimate aim is to make the discriminator learn to identify real and fake images.
This Neural Network has three layers in which the input neurons are equal to the output neurons. The
network's target outside is the same as the input. It uses dimensionality reduction to restructure the
input. It works by compressing the image input to a latent space representation then reconstructing
the output from this representation.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Bagging and Boosting are ensemble techniques to train multiple models using the same learning
algorithm and then taking a call.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
With Bagging, we take a dataset and split it into training data and test data. Then we randomly select
data to place into the bags and train the model separately.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
With Boosting, the emphasis is on selecting data points which give wrong output to improve the
accuracy.
The three most important types of neural networks are: Artificial Neural Networks (ANN);
Convolution Neural Networks (CNN), and Recurrent Neural Networks (RNN).
Neural Networks are artificial networks used in Machine Learning that work in a similar fashion to
the human nervous system. Many things are connected in various ways for a neural network to mimic
and work like the human brain. Neural networks are basically used in computational models.
A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the
input and output layers. They can model complex non-linear relationships. Convolutional Neural
Networks (CNN) are an alternative type of DNN that allow modelling both time and space
correlations in multivariate signals.
CNN is a specific kind of ANN that has one or more layers of convolutional units. The class
of ANN covers several architectures including Convolutional Neural Networks (CNN), Recurrent
Neural Networks (RNN) eg LSTM and GRU, Autoencoders, and Deep Belief Networks.
Multilayer Perceptron (MLP) is great for MNIST as it is a simpler and more straight forward dataset,
but it lags when it comes to real-world application in computer vision, specifically image
classification as compared to CNN which is great.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Long-Short-Term Memory (LSTM) is a special kind of recurrent neural network capable of learning
long-term dependencies, remembering information for long periods as its default behavior. There are
three steps in an LSTM network:
• Image tagger
• Sentiment Analysis
• Translation
Designed to save the output of a layer, Recurrent Neural Network is fed back to the input to
help in predicting the outcome of the layer. The first layer is typically a feed forward neural
network followed by recurrent neural network layer where some information it had in the
previous time-step is remembered by a memory function. Forward propagation is
implemented in this case. It stores information required for it’s future use. If the prediction is
wrong, the learning rate is employed to make small changes. Hence, making it gradually
increase towards making the right prediction during the backpropagation.
1. Model sequential data where each sample can be assumed to be dependent on historical ones
is one of the advantage.
2. Used with convolution layers to extend the pixel effectiveness.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units
include a ‘memory cell’ that can maintain information in memory for long periods of time. A set of
gates is used to control when information enters the memory when it’s output, and when it’s
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
forgotten. There are three types of gates viz, Input gate, output gate and forget gate. Input gate
decides how many information from the last sample will be kept in memory; the output gate regulates
the amount of data passed to the next layer, and forget gates control the tearing rate of memory
stored. This architecture lets them learn longer-term dependencies
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
A sequence to sequence model consists of two Recurrent Neural Networks. Here, there exists an
encoder that processes the input and a decoder that processes the output. The encoder and decoder
work simultaneously – either using the same parameter or different ones. This model, on contrary to
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
the actual RNN, is particularly applicable in those cases where the length of the input data is equal to
the length of the output data. While they possess similar benefits and limitations of the RNN, these
models are usually applied mainly in chatbots, machine translations, and question answering systems.
• Image processing
• Computer Vision
• Speech Recognition
• Machine translation
Propagation is uni-directional where CNN contains one or more convolutional layers followed by
pooling and bidirectional where the output of convolution layer goes to a fully connected neural
network for classifying the images as shown in the above diagram. Filters are used to extract certain
parts of the image. In MLP the inputs are multiplied with weights and fed to the activation function.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Convolution uses RELU and MLP uses nonlinear activation function followed by softmax.
Convolution neural networks show very effective results in image and video recognition, semantic
parsing and paraphrase detection.
A Multi-Layer Perceptron (MLP) is one of the most basic neural networks that we use for
classification. For a binary classification problem, we know that the output can be either 0 or 1. This
is just like our simple logistic regression, where we use a logit function to generate a probability
between 0 and 1.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Simply put, it is just the difference in the threshold function! When we restrict the logistic regression
model to give us either exactly 1 or exactly 0, we get a Perceptron model:
42. Can we have the same bias for all neurons of a hidden layer?
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Essentially, you can have a different bias value at each layer or at each neuron as well. However, it is
best if we have a bias matrix for all the neurons in the hidden layers as well.
A point to note is that both these strategies would give you very different results.
The main aim of this question is to understand why we need activation functions in a neural network.
You can start off by giving a simple explanation of how neural networks are built:
Step 1: Calculate the sum of all the inputs (X) according to their weights and include the bias term:
Z = (weights * X) + bias
Y = Activation(Z)
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Steps 1 and 2 are performed at each layer. If you recollect, this is nothing but forward propagation!
Now, what if there is no activation function?
Y = Z = (weights * X) + bias
Wait – isn’t this just a simple linear equation? Yes – and that is why we need activation functions. A
linear equation will not be able to capture the complex patterns in the data – this is even more evident
in the case of deep learning problems.
In order to capture non-linear relationships, we use activation functions, and that is why a neural
network without an activation function is just a linear regression model.
44. In a neural network, what if all the weights are initialized with the same value?
In simplest terms, if all the neurons have the same value of weights, each hidden unit will get exactly
the same signal. While this might work during forward propagation, the derivative of the cost
function during backward propagation would be the same every time.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
In short, there is no learning happening by the network! What do you call the phenomenon of the
model being unable to learn any patterns from the data? Yes, underfitting.
Therefore, if all weights have the same initial value, this would lead to underfitting.
Note: This question might further lead to questions on exploding and vanishing gradients, which I
have covered below.
Now, this can be one tricky question. There might be a misconception that deep learning can only
solve unsupervised learning problems. This is not the case. Some example of Supervised Learning
and Deep learning include:
• Image classification
• Text classification
• Sequence tagging
On the other hand, there are some unsupervised deep learning techniques as well:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
• Word embeddings (like Skip-gram and Continuous Bag of Words): Understanding Word
Embeddings: From Word2Vec to Count Vectors
• Autoencoders: Learn How to Enhance a Blurred Image using an Autoencoder!
This is a question best explained with a real-life example. Consider that you want to go out today to
play a cricket match with your friends. Now, a number of factors can affect your decision-making,
like:
And so on. These factors can change your decision greatly or not too much. For example, if it is
raining outside, then you cannot go out to play at all. Or if you have only one bat, you can share it
while playing as well. The magnitude by which these factors can affect the game is called the weight
of that factor.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Factors like the weather or temperature might have a higher weight, and other factors like equipment
would have a lower weight.
However, does this mean that we can play a cricket match with only one bat? No – we would need 1
ball and 6 wickets as well. This is where bias comes into the picture. Bias lets you assign some
threshold which helps you activate a decision-point (or a neuron) only when that threshold is crossed.
47. How does forward propagation and backpropagation work in deep learning?
Now, this can be answered in two ways. If you are on a phone interview, you cannot perform all the
calculus in writing and show the interviewer. In such cases, it best to explain it as such:
• Forward propagation: The inputs are provided with weights to the hidden layer. At each
hidden layer, we calculate the output of the activation at each node and this further propagates
to the next layer till the final output layer is reached. Since we start from the inputs to the final
output layer, we move forward and it is called forward propagation
• Backpropagation: We minimize the cost function by its understanding of how it changes with
changing the weights and biases in a neural network. This change is obtained by calculating the
gradient at each hidden layer (and using the chain rule). Since we start from the final cost
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
function and go back each hidden layer, we move backward and thus it is called backward
propagation
For an in-person interview, it is best to take up the marker, create a simple neural network with 2
inputs, a hidden layer, and an output layer, and explain it.
Forward propagation:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Backpropagation:
You need not explain with respect to the bias term as well, though you might need to expand the
above equations substituting the actual derivatives.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Batch Normalization is one of the techniques used for reducing the training time of our deep learning
algorithm. Just like normalizing our input helps improve our logistic regression model, we can
normalize the activations of the hidden layers in our deep learning model as well:
We basically normalize a[1] and a[2] here. This means we normalize the inputs to the layer, and then
apply the activation functions to the normalized inputs.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Here is an article that explains Batch Normalization and other techniques for improving Neural
Networks: Neural Networks – Hyperparameter Tuning, Regularization & Optimization.
49. List the activation functions you have used so far in your projects and how you would
choose one.
• Sigmoid
• Tanh
• ReLU
• Softmax
While it is not important to know all the activation functions, you can always score points by
knowing the range of these functions and how they are used. Here is a handy table for you to follow:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Here is a great guide on how to use these and other activations functions: Fundamentals of Deep
Learning – Activation Functions and When to Use Them?.
50. Why does a Convolutional Neural Network (CNN) work better with image data?
The key to this question lies in the Convolution operation. Unlike humans, the machine sees the
image as a matrix of pixel values. Instead of interpreting a shape like a petal or an ear, it just
identifies curves and edges.
Thus, instead of looking at the entire image, it helps to just read the image in parts. Doing this for a
300 x 300 pixel image would mean dividing the matrix into smaller 3 x 3 matrices and dealing with
them one by one. This is convolution.
Mathematically, we just perform a small operation on the matrix to help us detect features in the
image – like boundaries, colors, etc.
Z=X*f
Here, we are convolving (* operation – not multiplication) the input matrix X with another small
matrix f, called the kernel/filter to create a new matrix Z. This matrix is then passed on to the other
layers.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
If you have a board/screen in front of you, you can always illustrate this with a simple example:
The main component that differentiates Recurrent Neural Networks (RNN) from the other models is
the addition of a loop at each node. This loop brings the recurrence mechanism in RNNs. In a basic
Artificial Neural Network (ANN), each input is given the same weight and fed to the network at the
same time. So, for a sentence like “I saw the movie and hated it”, it would be difficult to capture the
information which associates “it” with the “movie”.
The addition of a loop is to denote preserving the previous node’s information for the next node, and
so on. This is why RNNs are much better for sequential data, and since text data also is sequential in
nature, they are an improvement over ANNs.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
52. In a CNN, if the input size 5 X 5 and the filter size is 7 X 7, then what would be the size of the
output?
This is a pretty intuitive answer. As we saw above, we perform the convolution on ‘x’ one step at a
time, to the right, and in the end, we got Z with dimensions 2 X 2, for X with dimensions 3 X 3.
Thus, to make the input size similar to the filter size, we make use of padding – adding 0s to the input
matrix such that its new size becomes at least 7 X 7. Thus, the output size would be using the
formula:
53. What’s the difference between valid and same padding in a CNN?
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
This question has more chances of being a follow-up question to the previous one. Or if you have
explained how you used CNNs in a computer vision task, the interviewer might ask this question
along with the details of the padding parameters.
• Valid Padding: When we do not use any padding. The resultant matrix after convolution will
have dimensions (n – f + 1) X (n – f + 1)
• Same padding: Adding padded elements all around the edges such that the output matrix will
have the same dimensions as that of the input matrix
The key here is to make the explanation as simple as possible. As we know, the gradient descent
algorithm tries to minimize the error by taking small steps towards the minimum value. These steps
are used to update the weights and biases in a neural network.
However, at times, the steps become too large and this results in larger updates to weights and bias
terms – so much so as to cause an overflow (or a NaN) value in the weights. This leads to an unstable
algorithm and is called an exploding gradient.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
On the other hand, the steps are too small and this leads to minimal changes in the weights and bias
terms – even negligible changes at times. We thus might end up training a deep learning model with
almost the same weights and biases each time and never reach the minimum error function. This is
called the vanishing gradient.
A point to note is that both these issues are specifically evident in Recurrent Neural Networks – so be
prepared for follow-up questions on RNN!
I am sure you would have a doubt as to why a relatively simple question was included in the
Intermediate Level. The reason is the sheer volume of subsequent questions it can generate!
The use of transfer learning has been one of the key milestones in deep learning. Training a large
model on a huge dataset, and then using the final parameters on smaller simpler datasets has led to
defining breakthroughs in the form of Pretrained Models. Be it Computer Vision or NLP, pretrained
models have become the norm in research and in the industry.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
This loop essentially includes a time component into the network as well. This helps in capturing
sequential information from the data, which could not be possible in a generic artificial neural
network.
You can find a detailed explanation of RNNs here: Fundamentals of Deep Learning – Introduction to
Recurrent Neural Networks.
The LSTM model is considered a special case of RNNs. The problems of vanishing gradients and
exploding gradients we saw earlier are a disadvantage while using the plain RNN model.
In LSTMs, we add a forget gate, which is basically a memory unit that retains information that is
retained across timesteps and discards the other information that is not needed. This also necessitates
the need for input and output gates to include the results of the forget gate as well.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
As you can see, the LSTM model can become quite complex. In order to still retain the functionality
of retaining information across time and yet not make a too complex model, we need GRUs.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Basically, in GRUs, instead of having an additional Forget gate, we combine the input and Forget
gates into a single Update Gate:
It is this reduction in the number of gates that makes GRU less complex and faster than LSTM. You
can learn about GRUs, LSTMs and other sequence models in detail here: Must-Read Tutorial to
Learn Sequence Modeling & Attention Models.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Deep reinforcement learning combines artificial neural networks with a reinforcement learning
architecture that enables software-defined agents to learn the best actions possible in virtual
environment in order to attain their goals. Deep reinforcement learning is the combination of
reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide
range of complex decisionmaking tasks that were previously out of reach for a machine. Thus, deep
RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance,
and many more. This manuscript provides an introduction to deep reinforcement learning models,
algorithms and techniques. Particular focus is on the aspects related to generalization and how deep
RL can be used for practical applications. We assume the reader is familiar with basic machine
learning concepts.
An autoencoder is a neural network architecture capable of discovering structure within data in order
to develop a compressed representation of the input. ... Because autoencoders learn how to compress
the data based on attributes.Autoencoders are an unsupervised learning technique in which we
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
leverage neural networks for the task of representation learning. Specifically, we'll design a neural
network architecture such that we impose a bottleneck in the network which forces a compressed
knowledge representation of the original input. If the input features were each independent of one
another, this compression and subsequent reconstruction would be a very difficult task. However, if
some sort of structure exists in the data (ie. correlations between input features), this structure can be
learned and consequently leveraged when forcing the input through the network's bottleneck
61 In Reinforcement Learning (RL), agents are trained on a reward and punishment mechanism.
The agent is rewarded for correct moves and punished for the wrong ones. In doing so, the agent tries
to minimize wrong moves and maximize the right ones.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Source
In this article, we’ll look at some of the real-world applications of reinforcement learning.
In machine learning, when a statistical model describes random error or noise instead of
underlying relationship ‘overfitting’ occurs. When a model is excessively complex,
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
overfitting is normally observed, because of having too many parameters with respect to the
number of training data types. The model exhibits poor performance which has been overfit.
The possibility of overfitting exists as the criteria used for training the model is not the same
as the criteria used to judge the efficacy of a model.
Bayesian Network is used to represent the graphical model for probability relationship
among a set of variables.
65 Types of RNN
The main reason that the recurrent nets are more exciting is that they allow us to operate
over sequences of vectors: Sequence in the input, the output, or in the most general case,
both. A few examples may this more concrete:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Each rectangle in the above image represents vectors, and arrows represent functions.
Input vectors are Red, output vectors are blue, and green holds RNN's state.
One-to-one:
This is also called Plain Neural networks. It deals with a fixed size of the input to the fixed
size of output, where they are independent of previous information/output.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
One-to-Many:
It deals with a fixed size of information as input that gives a sequence of data as output.
Example: Image Captioning takes the image as input and outputs a sentence of words.
Many-to-One:
It takes a sequence of information as input and outputs a fixed size of the output.
Example: sentiment analysis where any sentence is classified as expressing the positive or
negative sentiment.
Many-to-Many:
Example: Machine Translation, where the RNN reads any sentence in English and then
outputs the sentence in French.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Bidirectional Many-to-Many:
Synced sequence input and output. Notice that in every case are no pre-specified
constraints on the lengths sequences because the recurrent transformation (green) is fixed
and can be applied as many times as we like.
Example: Video classification where we wish to label every frame of the video.
Dropout is a cheap regulation technique used for reducing overfitting in neural networks. We
randomly drop out a set of nodes at each training step. As a result, we create a different
model for each training case, and all of these models share weights. It's a form of model
averaging.
A Boltzmann machine (also known as stochastic Hopfield network with hidden units) is a
type of recurrent neural network. In a Boltzmann machine, nodes make binary decisions
with some bias. Boltzmann machines can be strung together to create more sophisticated
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
systems such as deep belief networks. Boltzmann Machines can be used to optimize the
solution to a problem.
Where,
α - learning rate,
In machine learning, it is used to update the parameters of our model. Parameters represent
the coefficients in linear regression and weights in neural networks.
69. Explain the following variant of Gradient Descent: Stochastic, Batch, and Mini-batch?
• Convolution
This layer comprises of a set of independent filters. All these filters are initialized
randomly. These filters then become our parameters which will be learned by the
network subsequently.
• ReLU
The ReLu layer is used with the convolutional layer.
• Pooling
It reduces the spatial size of the representation to lower the number of parameters and
computation in the network. This layer operates on each feature map independently.
• Full Collectedness
Neurons in a completely connected layer have complete connections to all activations
in the previous layer, as seen in regular Neural Networks. Their activations can be
easily computed with a matrix multiplication followed by a bias offset.
o Encoder
The encoder is used to compress the input into a latent space representation. It
encodes the input images as a compressed representation in a reduced dimension.
The compressed images are the distorted version of the original image.
o Code
The code layer is used to represent the compressed input which is fed to the decoder.
o Decoder
The decoder layer decodes the encoded image back to its original dimension. The
decoded image is a reduced reconstruction of the original image. It is automatically
reconstructed from the latent space representation.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
It is suitable for spatial data RNN is used for temporal data, also
4 like images. called sequential data.
The network takes fixed-size
inputs and generates fixed RNN can handle arbitrary input/
5 size outputs. output lengths.
CNN is a type of feed-forward
artificial neural network with
variations of multilayer RNN, unlike feed-forward neural
perceptron's designed to use networks- can use their internal
minimal amounts of memory to process arbitrary
6 preprocessing. sequences of inputs.
CNN's use of connectivity
patterns between the
neurons. CNN is affected by
the organization of the
animal visual cortex, whose
individual neurons are
arranged in such a way that Recurrent neural networks use time-
they can respond to series information- what a user spoke
overlapping regions in the last would impact what he will speak
7 visual field. next.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Following are the diagram shows the schematic representation of CNN and RNN
Following are the diagram shows the schematic representation of CNN and RNN
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
73 Choice of optimizer
Momentum:
I guess almost all of us are familiar with the word ‘momentum’. As gradient descent is
comparable with finding a valley, momentum can be compared to a ball rolling downhill.
Momentum helps us to accelerate Gradient Descent(GD) when we have surfaces that curve
more steeply in one direction than in another direction. It also moistens the oscillation as
shown below. For updating the weights it takes the gradient of the current step as well as
the gradient of the previous time steps. Momentum speeds up gradient descent by
converging faster.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Thus, we observe that the weight parameters are updated using the gradient of the previous
run.
Adagrad is an adaptive algorithm for gradient-based optimization that alters the learning
rate to a lower value for parameters associated with frequently occurring features, and
larger updates (i.e. high learning rates) for parameters associated with infrequent features.
For this reason, it is well-suited for dealing with sparse data.
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
Previously, we performed updates on the weights with the same learning rate for every
weight. But Adagrad refurbishes the learning rate for every parameter .
is the partial derivative of the cost function w.r.t the parameter at the time step t.
contains the sum of the squares of the past gradients w.r.t. to all parameters θ along its
diagonal. We can now vectorize our implementation by performing a matrix-vector product
⊙ between and :
One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning
rate. Most implementations use a default value of 0.01 and leave it at that.
Adagrad’s main weakness is its accumulation of the squared gradients in the denominator:
Since every added term is positive, the accumulated sum keeps growing during training.
This in turn causes the learning rate to shrink and eventually become infinitesimally small, at
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
which point the algorithm is no longer able to acquire additional knowledge. The following
algorithms aim to resolve this flaw.
RMSProp:
RMSProp, Root Mean Square Propagation, was devised by Geoffrey Hinton inLecture 6e
of his Coursera Class.. RMSProp comes up by solving the disadvantages of Adagrad. In
RMSProp learning rate gets adjusted automatically and it chooses different learning rates
for each parameter.
RMSprop as well divides the learning rate by an exponentially decaying average of squared
gradients. Hinton suggests γ to be set to 0.9, while a good default value for the learning
rate η is 0.001.
Adam:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
and are the hyperparameters. and are estimates of the first moment (the mean) and
the second moment (the uncentered variance) of the gradients respectively, hence the
name of the method. As and are initialized as vectors of 0’s, the authors of Adam
observe that they are biased towards zero, especially during the initial time steps, and
especially when the decay rates are small (i.e. β1 and β2 are close to 1).
They counteract these biases by computing bias-corrected first and second moment
estimates:
DEEP LEARNING VIVA PREPARATION: BY PROF. VANDANA KATE
They then use these to update the parameters just as we have seen in Adadelta and
RMSprop, which yields the Adam update rule: