You are on page 1of 21

UNIT-4 Part-II

What is deep learning?

Deep learning is a type of machine learning and artificial intelligence (AI) that imitates the way
humans gain certain types of knowledge. Deep learning is an important element of data science,
which includes statistics and predictive modeling. It is extremely beneficial to data scientists
who are tasked with collecting, analyzing and interpreting large amounts of data; deep learning
makes this process faster and easier.

At its simplest, deep learning can be thought of as a way to automate predictive analytics. While
traditional machine learning algorithms are linear, deep learning algorithms are stacked in
a hierarchy of increasing complexity and abstraction.

Deep learning methods

Various methods can be used to create strong deep learning models. These techniques include
learning rate decay, transfer learning, training from scratch and dropout.

Learning rate decay. The learning rate is a hyperparameter -- a factor that defines the system
or set conditions for its operation prior to the learning process -- that controls how much change
the model experiences in response to the estimated error every time the model weights are
altered. Learning rates that are too high may result in unstable training processes or the learning
of a suboptimal set of weights. Learning rates that are too small may produce a lengthy training
process that has the potential to get stuck.

The learning rate decay method -- also called learning rate annealing or adaptive learning
rates -- is the process of adapting the learning rate to increase performance and reduce training
time. The easiest and most common adaptations of learning rate during training include
techniques to reduce the learning rate over time.

Transfer learning. This process involves perfecting a previously trained model; it requires an
interface to the internals of a preexisting network. First, users feed the existing network new
data containing previously unknown classifications. Once adjustments are made to the network,
new tasks can be performed with more specific categorizing abilities. This method has the
advantage of requiring much less data than others, thus reducing computation time to minutes
or hours.
Training from scratch. This method requires a developer to collect a large labeled data set
and configure a network architecture that can learn the features and model. This technique is
especially useful for new applications, as well as applications with a large number of output
categories. However, overall, it is a less common approach, as it requires inordinate amounts
of data, causing training to take days or weeks.

Dropout. This method attempts to solve the problem of overfitting in networks with large
amounts of parameters by randomly dropping units and their connections from the neural
network during training. It has been proven that the dropout method can improve the
performance of neural networks on supervised learning tasks in areas such as speech
recognition, document classification and computational biology.

Deep learning neural networks?

A type of advanced machine learning algorithm, known as an artificial neural network,


underpins most deep learning models. As a result, deep learning may sometimes be referred to
as deep neural learning or deep neural networking.

Neural networks come in several different forms, including recurrent neural networks,
convolutional neural networks, artificial neural networks and feedforward neural networks, and
each has benefits for specific use cases. However, they all function in somewhat similar ways
-- by feeding data in and letting the model figure out for itself whether it has made the right
interpretation or decision about a given data element.

Neural networks involve a trial-and-error process, so they need massive amounts of data on
which to train. It's no coincidence neural networks became popular only after most enterprises
embraced big data analytics and accumulated large stores of data. Because the model's first few
iterations involve somewhat educated guesses on the contents of an image or parts of speech,
the data used during the training stage must be labeled so the model can see if its guess was
accurate. This means, though many enterprises that use big data have large amounts of data,
unstructured data is less helpful. Unstructured data can only be analyzed by a deep learning
model once it has been trained and reaches an acceptable level of accuracy, but deep learning
models can't train on unstructured data.
Use cases today for deep learning include all types of big data analytics applications, especially
those focused on NLP, language translation, medical diagnosis, stock market trading signals,
network security and image recognition.

Specific fields in which deep learning is currently being used include the following:

• Customer experience (CX). Deep learning models are already being used for chatbots.
And, as it continues to mature, deep learning is expected to be implemented in various
businesses to improve CX and increase customer satisfaction.

• Text generation. Machines are being taught the grammar and style of a piece of text and
are then using this model to automatically create a completely new text matching the proper
spelling, grammar and style of the original text.

• Aerospace and military. Deep learning is being used to detect objects from satellites that
identify areas of interest, as well as safe or unsafe zones for troops.

• Industrial automation. Deep learning is improving worker safety in environments like


factories and warehouses by providing services that automatically detect when a worker or
object is getting too close to a machine.

• Adding color. Color can be added to black-and-white photos and videos using deep learning
models. In the past, this was an extremely time-consuming, manual process.

• Medical research. Cancer researchers have started implementing deep learning into their
practice as a way to automatically detect cancer cells.

• Computer vision. Deep learning has greatly enhanced computer vision, providing
computers with extreme accuracy for object detection and image classification, restoration
and segmentation.

Other limitations and challenges include the following:

• Deep learning requires large amounts of data. Furthermore, the more powerful and
accurate models will need more parameters, which, in turn, require more data.

• Once trained, deep learning models become inflexible and cannot handle multitasking.
They can deliver efficient and accurate solutions but only to one specific problem. Even
solving a similar problem would require retraining the system.
• Any application that requires reasoning -- such as programming or applying the scientific
method -- long-term planning and algorithm like data manipulation is completely beyond
what current deep learning techniques can do, even with large data.

Deep learning vs. machine learning

Deep learning is a subset of machine learning that differentiates itself through the way it solves
problems. Machine learning requires a domain expert to identify most applied features. On the
other hand, deep learning understands features incrementally, thus eliminating the need for
domain expertise. This makes deep learning algorithms take much longer to train than machine
learning algorithms, which only need a few seconds to a few hours. However, the reverse is
true during testing. Deep learning algorithms take much less time to run tests than machine
learning algorithms, whose test time increases along with the size of the data.

Furthermore, machine learning does not require the same costly, high-end machines and high-
performing GPUs that deep learning does.

In the end, many data scientists choose traditional machine learning over deep learning due to
its superior interpretability, or the ability to make sense of the solutions. Machine learning
algorithms are also preferred when the data is small.

Introduction to Convolution Neural Network


It is assumed that the reader knows the concept of Neural networks.
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Artificial Neural Networks are used in various classification tasks like image, audio, words.
Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more precisely an
LSTM, similarly for image classification we use Convolution Neural networks. In this blog,
we are going to build a basic building block for CNN.
Before diving into the Convolution Neural Network, let us first revisit some concepts of
Neural Network. In a regular Neural Network there are three types of layers:

1. Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number of pixels
in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There
can be many hidden layers depending upon our model and data size. Each hidden layer
can have different numbers of neurons which are generally greater than the number of
features. The output from each layer is computed by matrix multiplication of output of
the previous layer with learnable weights of that layer and then by the addition of
learnable biases followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function
like sigmoid or softmax which converts the output of each class into the probability score
of each class.
The data is then fed into the model and output from each layer is obtained this step is called
feedforward, we then calculate the error using an error function, some common error
functions are cross-entropy, square loss error, etc. After that, we backpropagate into the
model by calculating the derivatives. This step is called Backpropagation which basically is
used to minimize the loss.
Here’s the basic python code for a neural network with random inputs and two hidden layers.

Convolution Neural Network


Convolution Neural Networks or covnets are neural networks that share their parameters.
Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (as images generally have red, green, and blue
channels).
Now imagine taking a small patch of this image and running a small neural network on it,
with say, k outputs and represent them vertically. Now slide that neural network across the
whole image, as a result, we will get another image with different width, height, and depth.
Instead of just R, G, and B channels now we have more channels but lesser width and height.
This operation is called Convolution. If the patch size is the same as that of the image it will
be a regular neural network. Because of this small patch, we have fewer weights.

Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole convolution process.

• Convolution layers consist of a set of learnable filters (a patch in the above image). Every
filter has small width and height and the same depth as that of input volume (3 if the
input layer is image input).
• For example, if we have to run convolution on an image with dimension 34x34x3. The
possible size of filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared
to image dimension.
• During forward pass, we slide each filter across the whole input volume step by step
where each step is called stride (which can have value 2 or 3 or even 4 for high
dimensional images) and compute the dot product between the weights of filters and
patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together
and as a result, we’ll get output volume having a depth equal to the number of filters.
The network will learn all the filters.
Layers used to build ConvNets
A covnets is a sequence of layers, and every layer transforms one volume to another through
a differentiable function.
Types of layers:
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.

1. Input Layer: This layer holds the raw input of the image with width 32, height 32, and
depth 3.
2. Convolution Layer: This layer computes the output volume by computing the dot
product between all filters and image patches. Suppose we use a total of 12 filters for
this layer we’ll get output volume of dimension 32 x 32 x 12.
3. Activation Function Layer: This layer will apply an element-wise activation function
to the output of the convolution layer. Some common activation functions are RELU:
max(0, x), Sigmoid: 1/(1+e^-x), Tanh, Leaky RELU, etc. The volume remains
unchanged hence output volume will have dimension 32 x 32 x 12.
4. Pool Layer: This layer is periodically inserted in the covnets and its main function is
to reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be
of dimension 16x16x12.

1. Fully-Connected Layer: This layer is a regular neural network layer that takes
input from the previous layer and computes the class scores and outputs the 1-D
array of size equal to the number of classes.

Understanding 1D and 3D Convolution Neural Network | Keras

When we say Convolution Neural Network (CNN), generally we refer to a 2-dimensional

CNN which is used for image classification. But there are two other types of Convolution Neural

Networks used in the real world, which are 1 dimensional and 3-dimensional CNNs. In this

guide, we are going to cover 1D and 3D CNNs and their applications in the real world.
2 dimensional CNN | Conv2D

This is the standard Convolution Neural Network which was first introduced in Lenet-

5 architecture. Conv2D is generally used on Image data. It is called 2 dimensional CNN because

the kernel slides along 2 dimensions on the data as shown in the following image.

Kernal sliding over the Image

The whole advantage of using CNN is that it can extract the spatial features from the data using

its kernel, which other networks are unable to do. For example, CNN can detect edges,

distribution of colours etc in the image which makes these networks very robust in image

classification and other similar data which contain spatial properties.

Following is the code to add a Conv2D layer in keras.


import keras

from keras.layers import Conv2D

model = keras.models.Sequential()

model.add(Conv2D(1, kernel_size=(3,3), input_shape = (128, 128, 3)))

model.summary()
Argument input_shape (128, 128, 3) represents (height, width, depth) of the image.

Argument kernel_size (3, 3) represents (height, width) of the kernel, and kernel depth will be

the same as the depth of the image.

1 dimensional CNN | Conv1D

Before going through Conv1D, let me give you a hint. In Conv1D, kernel slides along one

dimension. Now let’s pause the blog here and think which type of data requires kernel sliding

in only one dimension and have spatial properties?

The answer is Time-Series data. Let’s look at the following data.

Time series data from an accelerometer

This data is collected from an accelerometer which a person is wearing on his arm. Data

represent the acceleration in all the 3 axes. 1D CNN can perform activity recognition task from

accelerometer data, such as if the person is standing, walking, jumping etc. This data has 2

dimensions. The first dimension is time-steps and other is the values of the acceleration in 3
axes.
Following plot illustrate how the kernel will move on accelerometer data. Each row represents

time series acceleration for some axis. The kernel can only move in one dimension along the

axis of time.

Kernel sliding over accelerometer data

Following is the code to add a Conv1D layer in keras.


import keras

from keras.layers import Conv1D

model = keras.models.Sequential()

model.add(Conv1D(1, kernel_size=5, input_shape = (120, 3)))

model.summary()

Argument input_shape (120, 3), represents 120 time-steps with 3 data points in each time step.

These 3 data points are acceleration for x, y and z axes. Argument kernel_size is 5, representing
the width of the kernel, and kernel height will be the same as the number of data points in each

time step.

Similarly, 1D CNNs are also used on audio and text data since we can also represent the sound

and texts as a time series data. Please refer to the images below.
Text data as Time Series

Conv1D is widely applied on sensory data, and accelerometer data is one of it.
3 dimensional CNN | Conv3D

In Conv3D, the kernel slides in 3 dimensions as shown below. Let’s think again which data

type requires the kernel moving across the 3 dimension?

Kernel sliding on 3D data

Conv3D is mostly used with 3D image data. Such as Magnetic Resonance Imaging (MRI)
data. MRI data is widely used for examining the brain, spinal cords, internal organs and many
more. A Computerized Tomography (CT) Scan is also an example of 3D data, which is

created by combining a series of X-rays image taken from different angles around the body. We

can use Conv3D to classify this medical data or extract features from it.

Cross Section of 3D Image of CT Scan and MRI

One more example of 3D data is Video. Video is nothing but a sequence of image frames

together. We can apply Conv3D on video as well since it has spatial features.

Following is the code to add the Conv3D layer in keras.


import keras

from keras.layers import Conv3D

model = keras.models.Sequential()

model.add(Conv3D(1, kernel_size=(3,3,3), input_shape = (128, 128, 128, 3)))

model.summary()

Here argument Input_shape (128, 128, 128, 3) has 4 dimensions. A 3D image is a 4-

dimensional data where the fourth dimension represents the number of colour channels. Just

like a flat 2D image has 3 dimensions, where the 3rd dimension represents colour channels.

Argument kernel_size (3,3,3) represents (height, width, depth) of the kernel, and 4th dimension

of the kernel will be the same as the colour channel.


Summary

• In 1D CNN, kernel moves in 1 direction. Input and output data of 1D CNN


is 2 dimensional. Mostly used on Time-Series data.

• In 2D CNN, kernel moves in 2 directions. Input and output data of 2D CNN


is 3 dimensional. Mostly used on Image data.

• In 3D CNN, kernel moves in 3 directions. Input and output data of 3D CNN


is 4 dimensional. Mostly used on 3D Image data (MRI, CT scans, Video).

Some important algorithm


The Monty Hall Problem: Naive Bayes explained!

Examining the solution to the Monty Hall Problem, investigating the Naive Bayes Classifier,
and understanding the applications of Bayes Theorem.

Let’s make a deal!” If you are interested in machine learning, then it is very plausible that you
have heard of Bayes Theorem and the Naïve Bayes classifier. It is also plausible that you have
heard of the Monty Hall Problem. If you have not heard of the Monty Hall Problem, then it is a
famous mathematical problem that came about from the game show, Let’s Make A Deal when
it was hosted by Monty Hall. Contestants on this show will be told to pick between 3 doors,
where 2 of the doors will have an unfavourable outcome. After the selection is made, Monty
will reveal what was behind one of the 2 unfavourable doors and then ask whether the contestant
would like to stick with their initial selection or switch to the remaining closed door. Because
one of the doors is now open, the contestant knows that the prize must be behind one of the two
closed doors. Thus, there is a 50/50 chance that they will get the prize by sticking with their
initial choice and it will not matter if they switch or not, right? Wrong. The mathematics shows
that if the contestant were to switch their choice, the probability of winning would increase to
2/3; and therefore, you should always switch if given the option. Okay, but why is this true?
And how does it relate to the Naïve Bayes? Well, both the Monty Hall Problem and the Naïve
Bayes classifier are rooted in Bayes Theorem and they highly depend on the likelihood of an
event occurring. So, the deal is that we will explore both topics throughout this article!

Suppose that instead of opening doors, we are blindly choosing marbles from a bag. Assume
that the bag has 1 red, 1 green and 1 blue, where the blue marble is the favourable outcome.
Each marble has an equally likely chance of being selected and the contestant picks one at
random but cannot know the colour that they chose. The host then investigates the bag and
shows an unfavourable outcome. There are now three possible scenarios:

1. The contestant picked the blue marble. The host can now either show the red or the green
since they are both unfavourable outcomes, and the probability that either colour is shown
is 50%. Despite what the host shows, the contestant wins if they choose not to switch.

2. The contestant picked the green marble. The host can now only show the red marble
because it is the only unfavourable outcome. This happens with certainty, and the contestant
can only win if they choose to switch.
3. The contestant picked the red marble. The host can now only show the green marble
because it is the only unfavourable outcome. This happens with certainty, and the contestant
can only win if they choose to switch.
The outcomes when the contestant chooses to switch are as follows:
· Blue = 1/3 + 1/3 = 2/3
· Green = 1/6
· Red = 1/6
This shows that the probability of winning when switching is 2/3 while losing is 1/3.
The outcomes when the contestant chooses not to switch are as follows:
· Blue = 1/6 + 1/6 = 1/3
· Green = 1/6
· Red = 1/6
This shows that the probability of winning when not switching is 1/3 while losing is 2/3.

The solution to the Monty Hall Problem works because initially, there are more ways of
selecting an unfavourable outcome than there are of favourable ones. This is the essence of
Bayes Theorem; new information should not determine beliefs in isolation. Rather, new
information should be used to update prior information. Bayes Theorem is applicable whenever
there exists a hypothesis, evidence relating to the hypothesis, and the question being asked is
“what is the probability of this hypothesis, given that the evidence is true”. Let us look at the
formula. The goal is to find this updated information, or the posterior, which is written
as Pr(Hypothesis | Evidence).

We need to determine the probability of the hypothesis before any evidence is considered. This
is the prior information as is written as Pr(Hypothesis). In the case of the Monty Hall Problem,
the prior is the fact that each marble has a 1/3 chance of initially being selected.
After this, we need to consider the probability that the evidence can be seen, given that the
hypothesis is true; Pr(Evidence | Hypothesis). This is often referred to as the likelihood and it
intuitively represents the number of ways that an event can occur, given that a prior event is
true. In the case of Monty Hall, how many ways can the host show an unfavourable outcome,
given the contestant’s initial selection.

These numbers are multiplied because they represent the prior belief of the hypothesis and the
probability that the belief fits the evidence. Together, they provide the information necessary to
determine the answer to our question.

Now, it must be noted that the evidence can occur without the hypothesis being
true; Pr(Evidence | ~Hypothesis). When added to Pr(Evidence | Hypothesis), we achieve the
total probability of the evidence being true; Pr(Evidence).

Lastly, we must divide the information we have by the total chance of the evidence to determine
the probability of the hypothesis, given that the evidence is true.

The mechanism behind a Naïve Bayes classifier is based on this principle. Naïve Bayes is a
probabilistic classifier algorithm that assigns labels to instances based on probabilities. The term
naïve comes from the fact that this algorithm assumes strong independence between features. A
common application of Naïve Bayes is email classification, spam vs not spam. Let’s work
through an example:

· For simplicity, let’s assume that there are 4 words; “you”, “need”, “money” and “now”.
· Suppose for non-spam emails, Pr(word = you | email = non-spam) = 0.40; Pr(need | non-
spam) = 0.25; Pr(money | non-spam) = 0.20; and Pr(now | non-spam) =0.15

· Suppose for spam emails, Pr(word = you | email = spam) = 0.30; Pr(need | spam) = 0.10;
Pr(money | spam) = 0.45; and Pr(now | spam) =0.15

· Assume that Pr(email = spam) = 0.80, and Pr(non-spam) = 0.20

· Suppose that there is an email message “you need money money money now”, would it be
classified as spam or not spam?

· Using Bayes’ formula, Pr(spam | email) = ~93% and Pr(non-spam | email) = ~7%.

· It must be noted that Naïve Bayes does not calculate the actual probability. Instead, the formula
gives the email a score by determining the total information (and not dividing by the total
evidence) and then classifying the email into the class that had a higher score. The actual
probability was calculated for example purposes, but the same principles apply.

Therefore, since the posterior probability of the email being spam is higher than that of it being
non-spam, the email would be classified as spam. Notice, however, that if at least one of the
probabilities for the words in a spam email were 0, the email would have been classified as non-
spam instead. This fact can easily produce incorrect classifications when building models.
Because of this, a pseudo-count (or 1 count) is usually added to each word within a set to ensure
that 0 is never an output. The pseudo-count will not change the prior, it primarily ensures that
the posterior is nonzero.

So, now you are familiar with Bayes Theorem, you understand simple email classification, and
you know how to win on average at Let’s Make A Deal! But the main takeaway from this article
is understanding that information can be used to update prior beliefs to then make data-driven
decisions. Bayes Theorem is applicable in machine learning and can be used for various
processes, and the key to developing an appropriate model of this type is to understand the
theory from which it is derived.
The Ugly Duckling Theorem

The Space of All Possible Ducks

A common task in zoology consists of the creation of taxonomies for animal species. In
machine learning, we’re used to thinking of taxonomy in terms of classification problems
for supervised learning. In zoology, instead, we think of taxonomy as the problem of grouping
animals according to their similarity.
The underlying idea is that if two animals present a sufficient number of similar features,
then the two animals belong to the same taxonomic group. For example, we may want to
study the pair-wise similarity between an ugly duckling and two beautiful swans:

The intuitive idea that we have is that, of course, the two swans are very similar to one another,
while the ugly duckling is the odd one out. The ugly duckling theorem tells us however, that
this isn’t necessarily the case.
The Abstract Notion of a Duck
Ducks live in a world of abstractions but can also be treated as abstract concepts
themselves. Let’s imagine we’re coding an abstract class called “duck”, to which we associate
a finite number of boolean features:

These features can represent, for example, the color, the size, or the beak of a duck, or any
other physical, behavioral, or psychological characteristics of the waterfowl.
Let’s now say that we want to group some ducks in classes, but we have no preconceived idea
as to which features are more important than others. If this is the case, we can list all possible
combinations of abstract features and the values potentially associated with them. We can then
pick any arbitrary feature and sort lexicographically the string of bits that describe that
feature first and all other features second.
Let’s use for simplicity a value n=2, and list all possible 2n =4 combinations:
For a given duck in the space of all possible ducks, we can then ask ourselves the question:
what ducks are most dissimilar to it? We can answer this question by computing
the Hamming distance between all pairs of strings. If we start from, say, duck number 1, we
can see that duck number 4 is the one with the most dissimilar digits, and specifically two.
This graph represents the distance between the abstract ducks in duck-space, according to their
Hamming distance:

Starting from any given duckling, the ugly duckling is the one that’s most dissimilar to
the first, as calculated in this manner. In the case above, the ugly duckling in relation to the
first abstract duck is duck number 4.
The Similarity of Concrete Ducks
The restriction of comparing only individual features, however, is largely arbitrary. It might
have, in fact, been more informative to select multiple features out of the available. or, indeed
a boolean function that combines them all. Because we don’t have a criterion for selecting
which boolean function to use, however, the only unbiased approach is for us to select all
of them.
There are boolean functions for a feature vector of size n . In this context, we can also
think of n as the arity of the boolean functions that we consider. If we start with the two
features that we saw earlier, then we can generate a binary string of size that
corresponds to the output of all boolean functions.
Let’s take an example to clarify this further. We can encode with the proposition “it’s
smiling”, and with the proposition “it’s wearing a top-hat”, and measure these features in
the three ducks we represented above.
This is the representation of all possible boolean functions comprising the three ducks:
We can now notice how each duck has exactly 2 n-1 bits in common with any other duck, and 2
n-1
bits that differ from it. In other words, we can see that the pair-wise hamming distance
between the vectors associated with all possible boolean functions is always equal to 2 n-1. That
is to say, all ducks are equally similar or equally dissimilar to any other.

This property is independent of the specific value that assumes. Increasing the number of
features we select as meaningful increases the number of boolean functions accordingly. The
ratio between the number of similar and the dissimilar bits, however, remains equal to 1.

Notice that this is also agnostic with regards to the specific features that we select for our
measurements. The only condition we have to respect is that no two ducks have the same
feature vector, but this condition is always satisfied. It is, in fact, always true if all ducks
are discernable as distinct objects.

All Ducks Are Beautiful

The conclusion of the argument we studied here is that, of course, there are no ugly ducklings
at all. All ducklings are beautiful on their own, and it’s their individuality that makes
them so.
If we bring this back to the problem of classification and bias in machine learning, we can now
understand why a classification without some kind of bias is impossible. We need a rule or an
indication of some kind that tells us what features to favor over what others. In the absence of
that, we’re bound to consider any two observations as equally similar or dissimilar to any
other.

You might also like