You are on page 1of 189

Deep Learning in NLP with

Deep NLP
Theory + Practical
FAHAD HUSSAIN
MSCS, MCS, DAE(CIT)
Computer Science Instructor of well known international Center
Also, Machine Learning and Deep learning Practitioner

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Deep Learning
Deep learning (also known as deep structured learning) is part of a
broader family of machine learning methods based on artificial
neural networks with representation learning. Learning can be
supervised, semi-supervised or unsupervised.

Deep learning is a machine learning technique that


learns features and task directly from the data, where
data may by images, text or sound!

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
ML vs DL

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Neural

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Neural Working

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Perceptron in deep learning

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Normalize/Standardize
Artificial Neural Network

Applying Activation
Function

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is an Activation Function?
Activation functions are an extremely important feature of the artificial neural networks.
They basically decide whether a neuron should be activated or not. Whether the
information that the neuron is receiving is relevant for the given information or should it
be ignored.

The activation function is the non linear transformation that we do over the input
signal. This transformed output is then seen to the next layer of neurons as input.

• Linear Activation Function


• Non Linear Activation Function
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is an Activation Function?
Linear Function
The function is a line or linear. Therefore, the output of
the functions will not be confined between any range

Non Linear Function


1.Threshold
They make it easy for the model to generalize or adapt 2.Sigmoid
with variety of data and to differentiate between the
output
3.Tanh
4.ReLU
The Nonlinear Activation Functions are mainly divided on 5.Leaky ReLU
the basis of their range or curves
6.Softmax
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Threshold Function?

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Sigmoid Function?
The Sigmoid Function curve looks like a S-shape
This function reduces extreme values or outliers in data without removing them.
It converts independent variables of near infinite range into simple probabilities
between 0 and 1, and most of its output will be very close to 0 or 1.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Rectifier (Relu) Function?
ReLU is the most widely used activation function while designing networks today.
First things first, the ReLU function is non linear, which means we can easily
backpropagate the errors and have multiple layers of neurons being activated by the
ReLU function.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Leaky Relu Function?
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that
for the ReLU function, the gradient is 0 for x<0, which made the neurons die for activations in
that region. Leaky ReLU is defined to address this problem. Instead of defining the Relu
function as 0 for x less than 0, we define it as a small linear component of x.

What we have done here is that we have simply replaced the horizontal line with a non-zero, non-horizontal line.
Here a is a small value like 0.01 or so.
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Tanh Function?
Pronounced “tanch,” tanh is a hyperbolic trigonometric function
The tangent represents a ratio between the opposite and adjacent sides of a right triangle,
tanh represents the ratio of the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) /
cosh(x)
Unlike the Sigmoid function, the normalized range of tanh is –1 to 1 The advantage of tanh is
that it can deal more easily with negative numbers

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Softmax Function (for Multiple Classification)?
Softmax function calculates the probabilities distribution of the event over ‘n’ different events. In general way of
saying, this function will calculate the probabilities of each target class over all possible target classes. Later the
calculated probabilities will be helpful for determining the target class for the given inputs.

The main advantage of using Softmax is the output probabilities range. The range will 0 to 1, and the sum of all
the probabilities will be equal to one. If the softmax function used for multi-classification model it returns the
probabilities of each class and the target class will have the high probability.

The formula computes the exponential (e-power) of the given input value and the sum of exponential values of
all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values
is the output of the softmax function.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Activation Function Example

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Artificial neural networks (ANNs),
Artificial neural networks (ANNs), usually simply called neural networks (NNs), are computing systems
vaguely inspired by the biological neural networks that constitute animal brains.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model
the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a
signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons
connected to it. The "signal" at a connection is a real number, and the output of each neuron is computed
by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges
typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength
of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate
signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform
different transformations on their inputs. Signals travel from the first layer (the input layer), to the last
layer (the output layer), possibly after traversing the layers multiple times.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Artificial neural networks (ANNs),

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
How Neural Network Work
and
Back Propagation in deep learning

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
How Neural Network Work with many neurons

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Back Propagation in deep learning
Back-propagation is the essence of neural net training. It is the method of
fine-tuning the weights of a neural net based on the error rate obtained in the
previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce
error rates and to make the model reliable by increasing its generalization.

Backpropagation is a short form for "backward propagation of errors." It is a


standard method of training artificial neural networks. This method helps to
calculate the gradient of a loss function with respects to all the weights in the
network.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Back Propagation in deep learning

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Back Propagation in deep learning

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Back Propagation in deep learning (epoch)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
For further assistance
Visit Stack Exchange
https://stats.stackexchange.com/questions/154879/a-list-of
-cost-functions-used-in-neural-networks-alongside-applicati
ons

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Back Propagation in deep learning

What is Bias
Back Propagation (reduce Cost function)
(Batch Gradient Descent, Stochastic
Gradient Descent, Mini-batch
Gradient Descent)
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Bias
Bias is just like an intercept added in a linear equation. It is an additional
parameter in the Neural Network which is used to adjust the output along with
the weighted sum of the inputs to the neuron. Moreover, bias value allows you to
shift the activation function to either right or left.
output = sum (weights * inputs) + bias
The output is calculated by multiplying the inputs with their weights and then
passing it through an activation function like the Sigmoid function, etc. Here, bias
acts like a constant which helps the model to fit the given data. The steepness of
the Sigmoid depends on the weight of the inputs.
A simpler way to understand bias is through a constant c of a linear function
y =mx + c
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Bias

W*x+b

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Gradient Descent (BGD)
Gradient Descent is an optimization technique that is used to improve deep
learning and neural network-based models by minimizing the cost function.
Gradient Descent is a process that occurs in the backpropagation phase where the
goal is to continuously resample the gradient of the model’s parameter in the
opposite direction based on the weight w, updating consistently until we reach
the global minimum of function J(w).

More Precisely,
Gradient descent is an algorithm, which is used to iterate through
different combinations of weights in an optimal way.....to find the best
combination of weights which has a minimum error.
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Brute force algorithm

Curse of dimensionality

Brute Force Algorithms refers to a programming style that does not include any shortcuts to
improve performance, but instead relies on sheer computing power to try all possibilities until the
solution to a problem is found. A classic example is the traveling salesman problem (TSP).

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Gradient Descent

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Gradient Descent

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Useful link

https://towardsdatascience.com/understanding-the-mathematics-b
ehind-gradient-descent-dde5dc9be06e

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Stochastic gradient descent
The word ‘stochastic‘ means a system or a process that is linked with a random
probability. Hence, in Stochastic Gradient Descent, a few samples are selected
randomly instead of the whole data set for each iteration. In Gradient Descent,
there is a term called “batch” which denotes the total number of samples from a
dataset that is used for calculating the gradient for each iteration. In typical
Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although, using the whole dataset is really useful for
getting to the minima in a less noisy or less random manner, but the problem
arises when our datasets get really huge.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for
optimizing an objective function with suitable smoothness properties (e.g. differentiable
or subdifferentiable). ~Convex Loss function~

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Stochastic gradient descent

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Stochastic gradient descent

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Mini Batch gradient descent
Mini-batch gradient descent is a variation of the gradient descent
algorithm that splits the training dataset into small batches that are
used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the
mini-batch which further reduces the variance of the gradient.

Mini-batch gradient descent seeks to find a balance between the


robustness of stochastic gradient descent and the efficiency of batch
gradient descent. It is the most common implementation of gradient
descent used in the field of deep learning.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Mini Batch gradient descent

BGD SGD MBGD

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Different types of Neural Network
• Perceptron (Multilayer Perceptron) & ANN
• Feedforward Neural Network – Artificial Neuron
• Radial Basis Function Neural Network
• Convolutional Neural Network
• Recurrent Neural Network(RNN) –
Long Short Term Memory

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
ANN
A perceptron is a network with two layers, one input and one output. ... Artificial neural network, which has
input layer, output layer, and two or more trainable weight layers (constisting of Perceptrons) is called
multilayer perceptron or MLP

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Feedforward Neural Network
It is one of the simplest types of artificial neural networks. In a feedforward neural network,
the data passes through different input nodes until it reaches the output node. In other words,
the data moves in only one direction from the first range until it reaches the output node. It is
also known as a front propagating wave which is usually obtained using a graded activation
function. Unlike more complex types of neural networks, backpropagation and data move in only
one direction. A feedforward neural network may consist of a single layer or may contain hidden
layers. In a feedful neural network, the products of the inputs and their weights are calculated.
This is then fed to the output.

whereas
Backpropagation is a training algorithm
consisting of 2 steps:
•Feedforward the values.
•Calculate the error and propagate it back to
the earlier layers.
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Radial Basis Function Neural Network
A radial basis function (RBF) is a function that assigns a real value to each input from its domain (it is a real-value
function), and the value produced by the RBF is always an absolute value; i.e. it is a measure of distance and cannot
be negative.
f(x) = f(||x||)

Euclidean distance, the straight-line distance between two points in Euclidean space, is typically used. Radial
basis functions are used to approximate functions, much as neural networks act as function approximators.
RBF network represents a radial basis function network. The radial basis functions act as activation functions.
The approximant f(x) is differentiable with respect to the weights W, which are learned using iterative updater
methods coming among neural networks.

Radial basis function neural networks are extensively applied in power restoration systems. In recent decades,
power systems have become larger and more complex.
This increases the risk of blackout. This neural network is used in power restoration systems to restore power in
the least amount of time.
Convolutional Neural Network
Convolutional Neural Networks (CNN) is one of the variants of neural networks used
heavily in the field of Computer Vision. It derives its name from the type of hidden
layers it consists of. The hidden layers of a CNN typically consist of convolutional
layers, pooling layers, fully connected layers, and normalization layers. Here it simply
means that instead of using the normal activation functions defined above, convolution
and pooling functions are used as activation functions.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Recurrent Neural Network
Recurrent Neural Network(RNN) are a type of Neural Network where the output from
previous step are fed as input to the current step. In traditional neural networks, all the inputs
and outputs are independent of each other, but in cases like when it is required to predict the
next word of a sentence, the previous words are required and hence there is a need to
remember the previous words. Thus RNN came into existence, which solved this issue with the
help of a Hidden Layer. The main and most important feature of RNN is Hidden state, which
remembers some information about a sequence.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
principle

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
principle

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Step
STEP 1: Randomly initialize the weights to small numbers close to 0 (but not 0)

STEP 2: Input the first observation of your dataset in the input layer, each feature in one input node.

STEP 3: Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each
neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y.

STEP 4: Compare the predicted result to the actual result. Measure the generated error.

STEP 5: Back-Propagation: from right to left, the error is back-propagated.


Update the weights according to how much they are responsible for the error. The learning rate decides by
how much we update the weights.

STEP 6: Repeat Steps 1 to 5 and update the weights after each observation (Reinforcement Learning).
Or: Repeat Steps 1 to 5 but update the weights only after a batch of observations (Batch Learning).

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Keras
Keras is a high-level neural networks API, written in Python and capable of
running on top of TensorFlow, CNTK, or Theano. It was developed with a focus
on enabling fast experimentation. Being able to go from idea to result with the
least possible delay is key to doing good research.
Use Keras if you need a deep learning library that:

Allows for easy and fast prototyping (through user friendliness, modularity, and
extensibility).

Supports both convolutional networks and recurrent networks, as well as


combinations of the two.
Runs seamlessly on CPU, GPU and TPU.
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
TensorFlow
TensorFlow is a free and open-source software library for dataflow and
differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
networks. Wikipedia

Now its time to,


Practical Working on Google COLAB

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thank You!
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
To Handle Overfitting Problem

• Bias and Variance Problem (tradeoff)


• irreducible Error
• Regularization (L1 Ridge Regression , L2 Lasso Regression
• Drop out Method (Regularization)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Bias and Variance

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Underfitting and Overfitting (Bias and Variance)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
So, What will be the best model

Low Bias, Low Variance


So, what about the irreducible Error, actually it is errors in data
collected at the source. Noisiness and outlies introduced at source
level from data collected!

Thanks
Helpful link:
https://towardsdatascience.com/holy-grail-for-bias-variance-tradeoff-overfitting-underfitting-7fad64ab5d76

First we create the K fold in Breast Cancer dataset first then, we understand the others Regularization then finally we move
towards the Parameters Tuning…
Regularization, Regularization is a process of
(L2) Ridge Regression introducing additional
(L1) Lasso Regression information in order to
prevent overfitting.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Drop out Method (Regularization)

Drop out ratio


0 <= p >= 1

Eg: = 0.2

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Parameters Tuning in ANN
Parameter VS Hyper-parameters
Model Parameters are something that a model learns on its own. For example,
1) Weights or Coefficients of independent variables in Linear regression model.
2) Weights or Coefficients of independent variables SVM.
3) Split points in Decision Tree.

Model hyper-parameters are used to optimize the model performance. For


example
1)Kernel and slack in SVM
2)Value of K in KNN
3)Depth of tree in Decision trees.
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Parameters Tuning in ANN
Parameter VS Hyper-parameters
Lets understand the this
Hyper parameters (optimal)
size or quantity in ANN
using Parameter Tuning!

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Parameters Tuning in ANN
Parameter VS Hyper-parameters

Grid searching of hyperparameters


Grid search is an approach to hyperparameter tuning that will
methodically build and evaluate a model for each combination of
algorithm parameters specified in a grid.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Optimizer and different types it!
Optimizers are algorithms or methods used to change the attributes of your neural
network such as weights and learning rate in order to reduce the losses.
Optimization algorithms or strategies are responsible for reducing the losses and to provide
the most accurate results possible

Role of an optimizer
Optimizers update the weight parameters to minimize the loss function. Loss function acts as
guides to the terrain telling optimizer if it is moving in the right direction to reach the bottom
of the valley, the global minimum.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Optimizer and different types it!
Types of Gradient Descent: Different types of Gradient descents are

• Batch Gradient Descent or Vanilla Gradient Descent


• Stochastic Gradient Descent
• Mini batch Gradient Descent

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Gradient Descent with Momentum
Mini-batch gradient descent makes a parameter update with just a subset of examples, the direction of the
update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward
convergence. Gradient Descent with Momentum considers the past gradients to smooth out the update. It
computes an exponentially weighted average of your gradients, and then use that gradient to update your
weights instead. It works faster than the standard gradient descent algorithm.

Gradient Descent with Momentum considers the past gradients to smooth out the update. It computes an exponentially
weighted average of your gradients, and then use that gradient to update your weights instead.
Gradient Descent with Momentum
During backward propagation, we use dW and db to update our parameters W and b as
follows:
W = W – learning rate * dW
b = b – learning rate * db
In momentum, instead of using dW and db independently for each epoch, we take the
exponentially weighted averages of dW and db.

VdW = β x VdW + (1 – β) x dW
Vdb = β x Vdb + (1 – β) x db

Where beta ‘β’ is another hyperparameter called momentum and ranges from 0 to 1. It
sets the weight between the average of previous values and the current value to
calculate the new weighted average.
After calculating exponentially weighted averages, we will update our parameters.

W = W – learning rate *VdW


b = b – learning rate * Vdb
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Link:
https://medium.com/datadriveninvestor/overview-of-different-optimizers-for
-neural-networks-e0ed119440c3

Research Paper:
A Survey of Optimization Methods from a Machine Learning Perspective

Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Adagrad Optimizer
Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter
gets updated during training. The more updates a parameter receives, the smaller the learning rate.

AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for
infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image
recognition). Another advantage is that it basically eliminates the need to tune the learning rate. Each parameter has its own
learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the
biggest problem: at some point of time the learning rate is so small that the system stops learning

Two Features in Learning stage in (NN)


Dense (Matrix) most of value in matrix is NOT zero (0)
Sparse (Matrix) most of value in matrix is zero (0)

taking cumulative sum of squared


Link:
https://medium.com/datadriveninvestor/overview-of-different-optimizers-for
-neural-networks-e0ed119440c3

Research Paper:
A Survey of Optimization Methods from a Machine Learning Perspective

Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Adadelta & RMSProp Optimizer
RMSprop
Root mean square prop or RMSprop is another adaptive learning rate that is an improvement
of AdaGrad. Instead of taking cumulative sum of squared gradients like in AdaGrad, we take
the exponential moving average of these gradients.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Like RMSprop, Adadelta (2012) is also another improvement from AdaGrad, focusing on the learning rate
component. Adadelta is probably short for ‘adaptive delta’, where delta here refers to the difference between the
current weight and the newly updated weight.
The difference between Adadelta and RMSprop is that Adadelta removes the use of the learning rate
parameter completely by replacing it with D, the exponential moving average of squared deltas.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Link:
https://medium.com/datadriveninvestor/overview-of-different-optimizers-for
-neural-networks-e0ed119440c3

Research Paper:
A Survey of Optimization Methods from a Machine Learning Perspective

Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Adaptive moment estimation, or Adam Optimizer

Adaptive moment estimation, or Adam (2014), is a combination of momentum


and RMSprop.
It acts upon,

(i) The gradient component by using V, the exponential moving average of gradients (like in
momentum) and

(ii) The learning rate component by dividing the learning rate α by square root of S, the
exponential moving average of squared gradients (like in RMSprop).

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
W = W – learning rate * dW

VdW = β x VdW + (1 – β) x dW
Vdb = β x Vdb + (1 – β) x db

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Convolutional Neural Network (CNN)
Before understanding the CNN, lets understand first about these two core subject

Image processing is a method to perform some operations on an image, in order to get an enhanced
image or to extract some useful information from it. It is a type of signal processing in which input is an image
and output may be image or characteristics/features associated with that image.

Computer vision is a field of computer science that works on enabling computers to see, identify and
process images in the same way that human vision does, and then provide appropriate output. It is like
imparting human intelligence and instincts to a computer. In reality though, it is a difficult task to enable
computers to recognize images of different objects.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Convolutional Neural Network (CNN)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Convolutional Neural Network (CNN)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
But, the question arise, how
will be learning in this stage?
Let’s understand the learning
first about kids!

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
So, what about the Computer? CNN?
Learning… (by image features)
Gray scale image or RGB image

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
So, what about the Computer? CNN?
Learning…

X Here CNN work as like


black box, so what is
inside the black box!

O
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Steps in CNN

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Steps in CNN

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional ( of Smiling Face)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional ( of Smiling Face)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional ( of Smiling Face)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional ( of Smiling Face)

Different kind of filters / kernels in image processing!

http://setosa.io/ev/image-kernels/

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
1. Convolutional ( of Smiling Face)

Appling ReLu Activation


function to decrease the
linearity in the image,
because the image originally
non linear!

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
2. Pooling
A pooling layer is another building block of a CNN. Its function is to progressively reduce the
spatial size of the representation to reduce the amount of parameters and computation in the
network. Pooling layer operates on each feature map independently. The most common
approach used in pooling is max pooling.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Max / Avg. Pooling

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Pooling

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
3. Flattening
Flattening is converting the data into a 1-dimensional array for inputting it to the next
layer. We flatten the output of the convolutional layers to create a single long feature
vector. And it is connected to the final classification model, which is called a
fully-connected layer

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Flattening

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
4. Fulling Connection

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Complete CNN in one View

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Softmax Activation
In mathematics, the softmax function, also known as softargmax or normalized
function…
exponential function, is a function that takes as input a vector of K real numbers, and
normalizes it into a probability distribution consisting of K probabilities proportional to
the exponentials of the input numbers

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Loss
function…
Cross-entropy is commonly used in machine learning as a loss
function. Cross-entropy is a measure from the field of information theory, building
upon entropy and generally calculating the difference between two probability
distributions

0.9 1
0.1 0
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Summarize …
• Classification error
• RMS
• Cross Entropy

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks
https://en.wikipedia.org/wiki/Cross_entropy

https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac602518
1a

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
CNN for Categorical Variables (output)
So far, we have discussed the CNN using binary outcome result (0,1), so what
about the Categorical variable result means more than 0,1 therefore we are going
the understand the CNN using MNIST dataset.
So, what is MNIST dataset...
It is a data-set, consisting images of handwritten digits from 0 to 9. Each image is
a monochrome, 28 * 28 pixels.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
MNIST DATASET History

1980 Era of improving of image


and signal processing, here the
problem of zip code reading still
big challenge!

In the 1989 yann lecun solve this


Problem by adding fourth layer
Convolutional layer in CNN.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
MNIST CNN

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
MNIST CNN

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Useful links:
https://en.wikipedia.org/wiki/MNIST_database

http://yann.lecun.com/exdb/mnist/

http://colah.github.io/posts/2014-10-Visualizing-MNIST/

Lets move towards CODE!


For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Recurrent Neural Network
Foods History Sentence
History…

.... working with her.

.... working with him.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Recurrent Neural Network (RNN)
Recurrent Neural Network is a generalization of feedforward neural
network that has an internal memory. RNN is recurrent in nature as it
performs the same function for every input of data while the output of the
current input depends on the past one computation. After producing the
output, it is copied and sent back into the recurrent network. For making a
decision, it considers the current input and the output that it has learned
from the previous input.

Unlike feedforward neural networks, RNNs can use their internal state
(memory) to process sequences of inputs. This makes them applicable to
tasks such as unsegmented, connected handwriting recognition or speech
recognition. In other neural networks, all the inputs are independent of each
other. ButForinfurther
RNN,assistance,
all the inputs are related to each other.
code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
RNN Architecture and
Working…

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Unfold the RNN Layers

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Examples of RNN with respect to the relationships

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Basic
LSTM
Long short-term memory network was first introduced in
1997 by Sepp Hochreiter and his supervisor for a Ph.D.
thesis.
LSTM is a special kind of RNN, capable of learning
long term dependencies.
Remembering information for long period of time is it’s
default behaviour.
Long short-term memory (LSTM) network is the most
popular solution to the vanishing gradient problem.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
First Understand the RNN Works

This is a cat, and _____ is a good pet animal

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Looking More Clearly

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Looking More Clearly

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
LSTM’s and GRU’s as a
solution
LSTM ’s and GRU’s were created as the solution to short-term memory.
They have internal mechanisms called gates that can regulate the flow of
information.
LSTM’s as a solution

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
LSTM’s as a solution (steps)
1. First, the previous hidden state and the current input get concatenated. We’ll call it combine.

2. Combine get’s fed into the forget layer. This layer removes non-relevant data.

4. A candidate layer is created using combine. The candidate holds possible values to add to the
cell state.

3. Combine also get’s fed into the input layer. This layer decides what data from the candidate
should be added to the new cell state.

5. After computing the forget layer, candidate layer, and the input layer, the cell state is
calculated using those vectors and the previous cell state.

6. The output is then computed.

7. Pointwise multiplying the output and the new cell state gives us the new hidden state.
GRU’s () Gated Recurrent Unit, as a
solution
Now we know how an LSTM work, let’s briefly look at the GRU. The GRU is the newer
generation of Recurrent Neural networks and is pretty similar to an LSTM. GRU’s got rid of the
cell state and used the hidden state to transfer information. It also only has two gates, a reset
gate and update gate.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
RNN vs LSTM vs GRU

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
RNN vs LSTM vs GRU

The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update
gates) whereas an LSTM has three gates (namely input, output and forget gates).

GRUs train faster and perform better than LSTMs on less training data if you are doing language
modeling (not sure about other tasks).

GRUs are simpler and thus easier to modify, for example adding new gates in case of additional
input to the network. It's just less code in general.

LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks
requiring modeling long-distance relations.
Blog:

http://colah.github.io/posts/2015-08-Understanding-LSTM
s/

Research Papers:
LSTM: A Search Space Odyssey

Deep Learning for Solar Power Forecasting – An Approach


Using Autoencoder and LSTM Neural Networks

Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Why RNN and what is difference between
ANN & RNN

This is a cat, and _____


is a good pet animal

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Vanishing gradient problem
The vanishing gradient makes the gradient very close to zero, so it's difficult to know
where to move in the state space; the exploding gradient makes the gradient a very
large value, so it makes learning unstable. This problem is more pronounced in
recurrent networks since they use the same matrix at each time step.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Exploding Gradient: Vanishing Gradient:
The working of the exploding When making use
gradient is similar but the weights of back-propagation the goal is
here change drastically instead to calculate the error which is actually found out
of negligible change. Notice the small by finding out the difference between the actual
output and the model output and raising.
change.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Do Subscribe the channel for further updates!

Thanks

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
What is Time series Analysis, How relate it is RNN to
A time series is a series of data points
indexed in time order. Most commonly, a time
series is a sequence taken at successive
equally spaced points in time. Thus it is a
sequence of discrete-time data

Time series model is purely dependent on the idea


that past behavior and price patterns can be used
to predict future price behavior.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Exploding Gradient: Vanishing Gradient:
The working of the exploding When making use
gradient is similar but the weights of back-propagation the goal is
here change drastically instead to calculate the error which is actually found out
of negligible change. Notice the small by finding out the difference between the actual
output and the model output and raising.
change.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
First Understand the RNN Works

This is a cat, and _____ is a good pet animal

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Looking More Clearly

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Looking More Clearly

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
LSTM’s and GRU’s as a
solution
LSTM ’s and GRU’s were created as the solution to short-term memory.
They have internal mechanisms called gates that can regulate the flow of
information.
LSTM’s as a solution

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
LSTM’s as a solution (steps)
1. First, the previous hidden state and the current input get concatenated. We’ll call it combine.

2. Combine get’s fed into the forget layer. This layer removes non-relevant data.

4. A candidate layer is created using combine. The candidate holds possible values to add to the
cell state.

3. Combine also get’s fed into the input layer. This layer decides what data from the candidate
should be added to the new cell state.

5. After computing the forget layer, candidate layer, and the input layer, the cell state is
calculated using those vectors and the previous cell state.

6. The output is then computed.

7. Pointwise multiplying the output and the new cell state gives us the new hidden state.
GRU’s () Gated Recurrent Unit, as a
solution
Now we know how an LSTM work, let’s briefly look at the GRU. The GRU is the newer
generation of Recurrent Neural networks and is pretty similar to an LSTM. GRU’s got rid of the
cell state and used the hidden state to transfer information. It also only has two gates, a reset
gate and update gate.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
RNN vs LSTM vs GRU

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
RNN vs LSTM vs GRU

The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update
gates) whereas an LSTM has three gates (namely input, output and forget gates).

GRUs train faster and perform better than LSTMs on less training data if you are doing language
modeling (not sure about other tasks).

GRUs are simpler and thus easier to modify, for example adding new gates in case of additional
input to the network. It's just less code in general.

LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks
requiring modeling long-distance relations.
Blog:

http://colah.github.io/posts/2015-08-Understanding-LSTM
s/

Research Papers:
LSTM: A Search Space Odyssey

Deep Learning for Solar Power Forecasting – An Approach


Using Autoencoder and LSTM Neural Networks

Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Deep Learning from Scratch
So far we have discussed,
• Artificial neural network l e a r n i n g
e r v i s e d
• Convolutional neural network Sup
• Recurrent neural network (LSTM, GRU)

• Self Organization Map


l e a r n i n g
• Boltzmann Machine p e r v i s e d
U n s u
• AutoEncoders

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial
neural network (ANN) that is trained using unsupervised learning to produce a
low-dimensional (typically two-dimensional), discretized representation of the input space of
the training samples, called a map, and is therefore a method to do dimensionality reduction.
Self-organizing maps differ from other artificial neural networks as they apply competitive
learning as opposed to error-correction learning (such as backpropagation with gradient
descent), and in the sense that they use a neighborhood function to preserve the topological
properties of the input space.

It is used majorly for dimensional reduction / Features


Detection and Clustering(?)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organization Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Self Organizing Map (Kohonen Self-Organizing Maps)

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann Machine or Boltzmann distribution

ANN CNN RNN SOM

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann Machine
A Boltzmann Machine is a network of symmetrically connected, neuron-
like units that make stochastic decisions about whether to be on or off.
Boltz- mann machines have a simple learning algorithm that allows them to
discover interesting features in datasets composed of binary vectors.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann Machine or Boltzmann distribution

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann Machine or Boltzmann distribution

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann distribution and Factor
It is a probability measure that gives the probability that a system
will be in a certain state as a function of that state energy and the
temperature of the state.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Boltzmann Machine or Boltzmann distribution

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Restricted Boltzmann Machine
A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can
learn a probability distribution over its set of inputs.

Restricted Boltzmann Machine is an undirected graphical model that plays a major role in Deep Learning
Framework in recent times. It was initially introduced as Harmonium in 1986 and it gained big popularity in
recent years in the context of the Netflix Prize where Restricted Boltzmann Machines achieved state of the art
performance in collaborative filtering and have beaten most of the competition.
It is an algorithm which is useful for dimensionality reduction, classification, regression, collaborative filtering,
feature learning, and topic modeling.

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Restricted Boltzmann Machine
Working example as recommendation system

Drama Love Story Action Award Winning Khan_Movie

Movie1 Movie2 Movie3 Movie4 Movie5 Movie6

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Restricted Boltzmann Machine

Drama Love Story Action Award Winning Khan_Movie

Movie1 Movie2 Movie3 Movie4 Movie5 Movie6

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Thanks
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Restricted Boltzmann Machine

Drama Love Story Action Award Winning Khan_Movie

Movie1 Movie2 Movie3 Movie4 Movie5 Movie6

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Gibbs sampling and contrastive divergence

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Gibbs sampling and contrastive divergence

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Gibbs sampling and contrastive divergence
In term of curve

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Gibbs sampling and contrastive divergence
In term of curve

For further assistance, code and slide https://fahadhussaincs.blogspot.com/


YouTube Channel : https://www.youtube.com/fahadhussaintutorial
Advance Topic which is related to RBM:
Deep Boltzmann Machines
and
Deep Belief Networks
Thanks
Additional Link:

• http://deeplearning.net/tutorial/rbm.html#rbm
• A fast learning algorithm for deep belief nets
Article
• An Introduction to Restricted Boltzmann Machines
For further assistance, code and slide https://fahadhussaincs.blogspot.com/
YouTube Channel : https://www.youtube.com/fahadhussaintutorial

You might also like