You are on page 1of 180

Deep Learning for Beginners

An Easy Guide to Go Through the Artificial


Intelligence Revolution that Is Changing the Game,
Using Neural Networks with Python, Keras and
TensorFlow
To Enza and Angelomaria
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated or transmitted without
direct written permission from the author or the publisher.
Under no circumstances will any blame or legal responsibility be held against the publisher, or
author, for any damages, reparation, or monetary loss due to the information contained within this
book. Either directly or indirectly.
Legal Notice:
This book is copyright protected. This book is only for personal use. You cannot amend, distribute,
sell, use, quote or paraphrase any part, or the content within this book, without the consent of the
author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. All effort has been executed to present accurate, up to date, and reliable, complete
information. No warranties of any kind are declared or implied. Readers acknowledge that the author
is not engaging in the rendering of legal, financial, medical or professional advice. The content
within this book has been derived from various sources. Please consult a licensed professional before
attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, — errors, omissions, or inaccuracies.

Table of Contents
Introduction to Deep Learning
Chapter 1.How to Create and Train Deep Learning Models

Chapter 2.Deep Learning Uses

Chapter 3.How to Train Deep Neural Nets

Chapter 4.Deep Neural Networks

Chapter 5.The Basics of Using the TensorFlow Libraries

Chapter 6.Deep Learning with TensorFlow

Chapter 7.Linear Regression with Python

Chapter 8.Decision Trees to Handle Your Regression Problems

Chapter 9.Self Organizing Maps

Chapter 10. Presentation of Deep Learning

Chapter 11. Deep Q-Learning in this book

Chapter 12. Recursive Neural Tensor Networks

Chapter 13. Optimizers

Chapter 14. The Future of Deep Learning

Conclusion
Introduction to Deep Learning

Deep learning is a very complex subject and it may be difficult to


understand. If that’s you, don’t feel discouraged. Keep this book close by
and read it in small bites. Eventually, you will grasp its meaning and if
you’re seriously interested, don’t be afraid to reach out for more
knowledge. Thank you for walking with us into the world of machine
learning, artificial intelligence, and deep learning.
Google and Facebook are, however, trying to identify words spoken in
order to categorize them. They are also trying to help different machines
identify the relationship between alternate objects in the training data set
and assess that relationship between differing variables or data points.
For example, if you want a computer to interpret “this is an elephant”
exactly that way instead of “this is a collection of pixels”, you must
determine a way to map some features of the elephant to other complex
features. For instance, you can convert a line, curve, pixels, sounds of
alphabets, and much more if you know how to transform the features of that
entity into ones that can be recognized by the machine. It can then use
indexing or inference to predict the output. This type of learning is called
“deep learning”.
Neural networks are used in deep learning models to identify the outputs. In
this type of learning, different nodes are used as inputs in the differing
layers, and a signal from the input layer is sent to the hidden layers in the
network. The hidden layers will then use that input to calculate or derive the
output. The work in deep learning is defined by how the human mind
learns; it also considers how calculations and computations take place in the
cerebral cortex of the human brain.
Every node in the model is assigned a weight. For instance, if you are trying
to use the model to identify or classify images, you can assign a weight to
every pixel in the image that is used as the input. You should also include
the output value that you want the machine to provide in the training data
set. An error message is passed to the input layer or the source if the output
image is not the same as the one in the training data set. This means that the
weights assigned to the nodes will need to be updated.
The changes in these weights will help the user steer the network towards
the right output. The signals sent from one side of the neural network to the
other help the machine determine the correct values that must be provided
as the output. A system can use deep learning either in a supervised or
unsupervised mode.
Supervised Modes

The neural network is taught with a training data set. In this type of
learning, you will need to provide the output layer with the values that are
associated with the input category. When a similar data set is used as the
test data set, the network will look at the output layer and provide the user
with the desired output.
Unsupervised Modes

Both the input and the output layers in the network are provided with the
examples that you’re processing. Since the inner, or hidden, layers of the
network are being compressed, most of the features in the data set are
overlooked. In this mode of learning, the network will use the values
produced by the inner layer as the output.
Scientists and engineers are now spending the time to understand deep
learning systems since it helps them identify the features that a network can
support. This also helps the engineer or programmer understand how the
different features in the data can be combined to derive the necessary
output.
The disadvantage of using these techniques is that they are impenetrable. It
is hard for most systems to identify and report new features in the data set.
This makes it extremely difficult for the model to explain the method it
used, which is an ability that a machine must possess. This means that the
model can present you with some outputs or inferences that it may be
unable to explain. In this case, you will need to dig deeper to understand the
model and see how it derived a specific output.
Most engineers and developers help machines learn using different machine
learning techniques. Deep learning is one of the many techniques used to
help a machine do this by example. It is one of the key technologies used to
help a car navigate through the streets without a driver; cars can now
identify stop signs, pedestrians, and lampposts. Deep learning is also used
in voice control devices like phones, tablets, televisions, and other hands-
free devices, like Alexa. Through deep learning, the technology industry
can achieve results that were never possible before.
Computer models using deep learning techniques learn to perform
classification tasks from text, sound, or images. A machine using deep
learning models can achieve a high level of accuracy where human beings
cannot, and these machines also perform better than us. Engineers train the
machines using large data sets called the training data set and neural
network architecture.
How Does Deep Learning Work?

The neural network architecture is used in most deep learning techniques,


and it is for this reason that deep learning models are known as deep neural
networks. The term “deep” refers to the many hidden layers in the network.
A traditional neural network can only have up to two hidden layers, but a
deep neural network can have close to 150 of them. Deep learning models
use large data sets called the training data set and neural networks to learn
features from the data, and due to this, there is no need for the engineer to
manually extract features from the data to train the machine.
One of the most common types of deep networks is called the convolutional
neural network, or CNN. This type of network is well suited to process two-
dimensional data, such as images. A CNN combines the features it has
learned with the features in the input data and uses the two-dimensional
convolutional layers to process information.
A CNN eliminates the need to extract features from data manually. This
means that the engineer does not need to classify the features or identify the
images to assist the network in categorizing them. The CNN extracts the
features directly from the images or input data. The engineer does not train
the data to choose some features from the information provided to it. The
CNN learns what features it needs to look for when the engineer feeds it the
training data set. It is for this reason that computers or models with CNN
are used to classify objects.
A CNN learns to identify the different characteristics and structures of an
image. It does this by using the many hidden layers within the network.
Every hidden layer identifies complex features of the image, and the
complexity increases as the hidden layers increase. For example, the initial
hidden layer can detect colors in the image, while the last layer can identify
different objects in the background.
Why Deep Learning Is Better than Traditional Learning
Methods

Deep learning is a type of machine learning. Let us consider an instance


where an engineer is training a machine using a machine learning and deep
learning model to categorize images. In machine learning, the engineer
trains the model by extracting the relevant features from the training data
set and providing that information to the machine. The machine then uses
these features to categorize objects present in the images. In addition to this,
deep learning also performs “end-to-end” learning. In this process, the
network is given the training data set and is asked to perform a task, like
classification, and the network learns to do this without the help of the
engineer.
Another difference is that shallow learning algorithms converge with the
increase in data, while a deep learning algorithm will scale with data. In
other words, the hidden layers in the deep neural network continue to learn
and improve in their functioning as the size of the data set increases.
Shallow learning algorithms refer to those machine-learning algorithms that
plateau at a specific performance level when the engineer adds more
training data or examples to the network.
Therefore, in machine learning, the engineer must provide the machine with
a classifier and feature to sort images, while with deep learning, the
machine learns to perform these functions by itself.
Choosing Between Deep Learning and Machine Learning

Machine learning algorithms offer different models and techniques that the
engineer can choose from depending on the application, the type of problem
the machine should solve, and the data that it’s processing. Deep neural
networks require a huge volume of data that the engineer can use to train
the model, as well as a graphics processing units, or GPUs, which will
process the data quickly.
When you need to choose between deep neural networks and machine
learning, you should consider whether you have labeled data and a high-
performance GPU. If you do not have either, you should stick to machine
learning algorithms. Since deep learning is more complex, you’ll need at
least a few thousand variables to ensure that the algorithm provides the
necessary output. If you have GPU with high performance, you can be
certain that the model will analyze the data quickly.
Chapter 1.
How to Create and Train Deep Learning Models

This chapter provides information on the three ways an engineer can train a
deep learning model to classify objects, including training from scratch,
transfer learning, and feature extraction.
Training from Scratch

If you want to train the deep neural network from scratch, you should
collate a large volume of labeled data. You must then design the network to
ensure that it will learn all the features in the data set. This is a good
practice for new applications or for ones that have multiple output
categories. Most engineers do not use this approach since the network will
take a few days or weeks to learn the process due to the large volume of
training data.
Transfer Learning

Most engineers train deep learning networks using the transfer learning
approach. In this process, a pre-existing or pre-trained model is fine-tuned.
You can start with networks like GoogleNet and AlexNet and feed these
networks with new data containing some unknown classes. You’re also able
to make a few tweaks to the network, which will allow the network to
perform new tasks. Therefore, you will not require large volumes of data to
train the network, in turn reducing the computation time to a few hours or
minutes.
If you want to use transfer learning to train a model, you will need an
interface that allows you to connect to the pre-existing network. This will
help you enhance and modify the network to perform the task. Software like
MATLAB has functions and tools that you can use for this very purpose.
Feature Extraction

This is a specialized and slightly less common approach to training a deep


neural network. Every layer in the network needs to identify the different
features in the large data set. You can extract these features from the
network at any time during the training process; they can be used to train
support vector machines or a machine learning model.
Chapter 1 Summary

● A massive quantity of labeled data forms an effective


starting point to train deep neural networks from
scratch.
● Transfer learning offers a cost-effective and cumulative
ay of deep learning networks by utilizing pre-trained
and pre-existing models by fine tuning them to meet
targeted thresholds.
● Although not widely employed, feature extraction offers
a specialized training approach to deep neural networks.
Chapter 2.
Deep Learning Uses

If neural networks approach the way that human’s think, deep learning
takes the idea a step further. Neural networks and artificial neurons have a
long history dating as far back as 1950. But when they were first
introduced, computing power was limited, and ANN was looked at like
research toys rather than hardcore business facing algorithms. When
computing power improved, ANN received renewed interest from AI
researchers as well as large internet companies. We are just now emerging
out of the AI research winter that persisted into the 1980s and 90s. Part of
this is due to computing power, and the other part is due to the introduction
of the web and the massive amounts of data it generates. Today, deep
learning is at the forefront of AI research and continues to make progress,
helping the field thaw from the cold.
Two things limit an artificial neural network. First is the computing power
necessary to simulate layers of artificial neurons, and the second is a
combination of available data and feature selection. Even the most powerful
ANN clusters today are still orders of magnitude behind the raw computing
power of the brain, for example. The introduction of powerful graphics
processing units (GPU) for machine learning purposes greatly increase the
available computing power across the board. GPUs are inherently faster
than CPUs because they tend to prioritize smaller, more efficient cores
compared to the CPU’s powerful but bulky ones. They also allow for multi-
threading of computational tasks and can more easily perform floating point
arithmetic (decimal numbers) than CPUs. Though GPUs were intended for
rendering 3D graphics at hundreds of frames per second, they have been
adopted by the AI community for processing large deep learning projects.
So what exactly is deep learning? As the name applies, it relates to creating
additional layers of depth that traditional ANNs do. The argument goes that
if the brain is made up of layers and layers of neurons, how is it that flimsy
ANNs with singular layers are capable of simulating intelligence? These are
sometimes called “shallow” neural networks to differentiate from those
with multiple layers. Now that GPUs are extremely fast and getting better
every year, deep learning doesn’t require entire datacenters or neural net
clusters to train models.
Returning to the human brain motif, deep learning takes after the tendency
for complex ideas to fire deep in the folds of the brain, rather than at
superficial levels. Recognizing edges of pictures and tiny details fires
neurons closer to the surface of the brain, while recognizing larger
constructs like a person’s face fires deeper. More layers equal better, more
intelligent systems. Information passes from the input neurons to additional
hidden layers that also pass those inputs into each other. The more of these
hidden layers, the better the results of machine learning. This is why deep
learning is capable of tackling problems in artificial intelligence that
shallow learning has traditionally lacked in. This includes computer vision,
voice recognition, and language processing.
The true power from deep learning comes from its non-linear processing of
features. Traditional machine learning techniques mostly use linear models
and suffer from the feature engineering phase. With deep learning, features
don’t need to be picked out by a field expert. Instead, many different
features are picked per model, contributed to the overall complexity of the
neural net. A traditional classification of something may have used two or
three features, but the deep learning equivalent is to use as many as the data
affords. For example, to detect whether an object on the road should be
considered a vehicle obstacle, a shallow machine learning system can use
the shape of the object and its speed as factors. Such a system may perform
well in the short run, successfully identifying different makes and models of
cars whether in motion or parked. However, the system may encounter
some unspecified behavior like a large carnival float moving relatively
slowly around many pedestrians. Using shape and speed alone would not be
enough to classify it. In contrast, a deep learning system may use several
different factors in addition to shape and speed. As baseline inputs, shape
and speed get passed to the deeper layers where they may also compare
proximity to a road, the presence of pedestrians, orientation, distance from
the camera, and so on. These additional factors will take longer to train the
model and more computationally expensive, but it will be better at
identifying vehicles in the long run.
Consequently, the scope of traditional machine learning has its limits. There
is a point where introducing more labeled data doesn’t result in better
performance of the system. With deep learning, though, adding more data
directly leads to better performance. It is mainly an issue of scale. One can
scale well to a large number of inputs but the other cannot. It is no wonder
that data obese companies like Facebook and Google use it. Their main
value as a company comes from the data that they acquire. Deep learning
allows them to exploit it, gain insights, and ultimately profit from it, and the
reason why deep learning algorithms scale so well is that they are more
conducive to analog like data that span many features. Data like images,
audio recording, unlabeled text, and video footage are very different to
work with than neat tabular data. These types of data are particularly good
at forming hierarchical representations of features. Since the features don’t
need to be chosen beforehand, deep learning algorithms can learn to form
classes of features on their own. Higher level learned features will be
defined in terms of the lower level learned features. Returning to the vehicle
identification example, a low-level feature may be some small defining
aspect of the car like the rear windshield. A higher-level feature is a
collection of these, like bumper size, indicator light positions, and license
plate area, used to identify different makes. A car with a higher windshield
and blocky appearance may be an SUV class, whereas something that is
close to the ground may be a sedan class.
Deep learning neural networks work a bit differently from regular ANNs.
One class of these networks are called convolutional neural nets (CNN) and
are used primarily for image recognition. Like other neural networks, they
are designed after biological processes in the brain. Animals use their visual
cortex to perceive light through individual cortical neurons. Each of these
neurons corresponds to receptive fields that overlap in the retina. CNNs
work in a similar fashion. They consist of an input and output layer plus
additional hidden layers in between. The hidden layers use something called
convolution to process their inputs. Put simply – convolution is using two
distinct functions to create a third function that expresses how the first one
affects the second. Convolution is used to group pixels together from the
beginning so that the net already has an idea of how the big picture fits
together. It is easier to form a hierarchical structure of features as well.
These networks first recognize small edges of the image as the smallest
possible feature. Each layer progressively adds another edge or midsection
to the hierarchical data representation until the whole image is learned.
Despite their increased accuracy, deep neural networks suffer from a
number of drawbacks. Just because deep learning is state of the art, it
doesn’t mean it should be generalized to every conceivable machine
learning problem. For many tasks, shallow neural networks are the
preferred option. However, large companies like Facebook regularly
employ deep learning because they have the requirement, data, and
computational resources to perform it. Facebook recently said that it uses
some billion digital images to teach its deep learning systems. Smaller
players in the AI scene do not have the processing power to work on that
scale. But then again, Facebook is one of the biggest companies out there.
Google demoed how powerful some of its systems are a few years ago.
Their system purportedly consisted of one billion connections. It was
trained using YouTube data and could accurately recognize cats in the
videos, yellow flowers, and other images. It is interesting to note that none
of these features were selected or programmed outright. Their neural
networks identified them through the hierarchical data representation.
Furthermore, it could recognize between 22,000 different categories of
images with some 17% accuracy. That level of accuracy is quite astounding
once you consider the number of categories and how the system learned
without any human first labeling the data. This accuracy could be increased
to 50% if the number of categories was lowered to 1,000.
Today, if some artificial intelligence system is at the bleeding edge, it is
probably using deep learning. Virtually all of the big Silicon Valley tech
companies are using it. When you use Google Translate on some arbitrary
string, you are using a deep learning system. Every time you fire up your
Amazon Echo to speak with Alexa, you are using deep learning. Google
uses it to tailor your search experience to fit your personal interests. Over
time, it has developed a database of knowledge dubbed the “Knowledge
Graph” that contains some 570 million different entities and 70 billion facts.
It is used along with Google Search to more accurately represent the data
that a user may be looking for through their queries. For example, if you
look up the name of a past US president, Knowledge Graph accumulates the
relevant data and displays it in the sidebar. These small snippets of relevant
data are gathered from sources across the web like Wikipedia and the CIA
Factbook. Google says the information provided through the Knowledge
Graph is capable of answering one-third of its 100 billion monthly user
queries. And if you are ever lucky enough to ride in a driverless car – yep,
it’s also thanks to deep learning technology.
In summary, here are the applications of deep learning.
Automatic Colorization of Images

You can now use deep learning networks to automatically add color to
black and white photographs. A deep learning network will identify the
objects in the image and their context within the photograph, then adding
color to the image using that information. This is a highly impressive feat.
This capability increases the use of large convolutional and high-quality
neural networks like ImageNet. This approach involves the use of
supervised layers and CNNs to recreate an image by adding color to it.
Adding Sound to Silent Movies

In this task, the deep neural network must develop or recreate sounds that
will match a silent video. Let’s look at the following example. We need the
network to add sounds to a video where people are playing drums. The
engineer will provide the network with close to 1,000 videos with the sound
of the drum striking many surfaces. The network will identify the different
sounds and associate the video frames from the silent video or movie with
the pre-recorded sounds and then select the sound from the database that
matches the video in the scene. This system is then evaluated using a Turing
test for which human beings were asked to differentiate between the real
and synthesized video. Both CNN and LSTM neural networks are used to
perform this application.
Automatic Translation

Deep neural networks can translate words, phrases, or sentences from one
language to another automatically. This application has existed for a long
time now, but the introduction to deep neural networks has helped it achieve
great results in certain areas of the translation of images and text.
For the translation of a text, the engineer does not have to feed the deep
neural network with a pre-processing sequence. This allows the algorithm
to identify the dependencies between the words in a sentence and map them
to a new language. The stacked networks in a large LDTM recurrent neural
network are used for this purpose.
Once the network identifies these letters, it can transform them into text,
translate the text into a different language, and recreate the image with the
translated text. This process is known as instant visual translation.
Object Detection and Classification in Images

In this application, the deep neural network identifies and organizes the
objects in an image by classifying the images into a set of previously known
objects. Very large CNNs have been used to achieve accurate results when
compared to the benchmark examples of the problem.
Automatic Handwriting Generation

For this application, the engineer must feed the deep neural network with a
few handwriting examples. This helps the network generate a new
handwriting for a given word, phrase, or sentence. The data set that the
engineer feeds the network should provide a sequence of coordinates that
the writer uses when writing with a pen. From this data set, the network
identifies and establishes a relationship between the movement of the pen
and the letters in the data set. The network can then generate new
handwriting examples. What is fascinating is that the network can learn
different styles and mimic them whenever necessary.
Automatic Text Generation

For automatic text generation, the engineer will feed the network with a
data set that only includes text. The network learns it and can generate new
text character-by-character or word-by-word. The network can learn how to
punctuate, spell, form sentences, differentiate between paragraphs, and
capture the style of the text from the data set.
An engineer will use large recurrent neural networks to perform this task.
The network establishes the relationship between the items in the many
sequences in the data set and then generates new text. Most engineers
choose the LSTM recurrent neural network to generate text since these
networks use a character-based model and generate only one character at a
time.
Automatic Image Caption Generation

As the name suggests, when given an image, the model must describe the
contents of the image and generate a caption. In 2014, many deep-learning
algorithms used the models for object detection and object classification in
images to generate a caption. Once the model detects objects in the image
and categorizes them, it will need to label those objects and form a coherent
sentence. This is an impressive application of deep learning. The models
used for this application utilize large CNNs to detect the objects in the
images and a recurrent neural network like the LSTM to generate a
coherent sentence using the labels.
Automatically Playing a Game

This is an application where a machine learns how it can play a game using
only the pixels on the screen. Engineers use deep reinforcement models to
train a machine to play a computer game. DeepMind, which is now a part of
Google, works primarily on this application.
Chapter 2 Summary

● Deep learning differs from the traditional machine


learning with its non-linear approach where several
features are picked in every model leading to the
complexity of the resulting neural net.
● Deep learning seeks to push through the limitations of
traditional machine learning.
● Convoluted neural nets are a core example of deep
learning approaches.
● Deep learning is applied in adding sound to silent films,
automatic colorization of images, automatic translation,
image caption generation, playing a game, text
generation etc.
Chapter 3.
How to Train Deep Neural Nets

However, it wasn’t a very deep DNN, only a shallow one really, with just
two hidden layers. So, what would you do if you had a really complex task?
Perhaps you need to identify multiple objects from multiple high-res
images. In this case, we need a DNN that goes much deeper, maybe one
with 10 layers. Each layer would have many hundreds of neurons and each
would have thousands of connections. This is not going to be easy.
The first thing you are going to find is the problem of vanishing gradients
or you may even face the problem of exploding gradients. Both of these
have a profound effect on deep neural networks and they make it very hard
to train the lower layers.
Secondly, pretty obvious really, training would be excruciatingly slow on
such a large network.
Third, the number of parameters is such a network would reach the millions
and we would run a very high risk of overfitting.
How do we get over all of this? Well, that’s what we are going to discuss in
this chapter; we’ll go through each of the problems, one at a time, and look
at some techniques that could help solve them.
We’ll start by looking at the issue of vanishing gradients and consider some
solutions. We’ll move on to some of the optimizers that we could use to
speed the training up in comparison to the vanilla Gradient Descent. Lastly,
we will look at a few of the more commonly used regularization techniques
for the larger neural networks. These tools will provide everything you need
to start training the very deep nets. Welcome to the world of Deep Learning.
The Problems of Vanishing and Exploding Gradients

Basically, the error gradient is propagated by the algorithm as it goes from


the output to the input layer. When the cost function gradient has been
computed for each of the network parameters, those gradients are used to
apply a Gradient Descent step to update the parameters.
The problem we have is that, as the algorithm moves through the layers
down to the lower ones, the gradients become smaller and the Gradient
Descent update may not change the connection weights on the lower levels.
As a result, the training cannot ever converge to a suitable solution. This is
the vanishing gradients problem but it can work the other way around – the
gradients may grow in size resulting in many of the layers being given
considerably large updates to the weight connections, and resulting in the
divergence of the algorithm.
More often than not, the DNNs suffer more from instability in the gradients.
What this means is that the layers may each learn at completely different
speeds. This is one of the reasons why work on DNNs was left untouched
for so long, but from around 2010 onwards, progress on understanding it
started to move. Xavier Glorot and Yoshua Bengio published a paper called
“Understanding the Difficulty of Training Deep Feedforward Neural
Networks” and in this paper, they detailed a number of suspects that they
had found. One of those suspects was a combination of two popular
functions and techniques – the weight initialization technique and the
logistic sigmoid activation function – use of normal distribution with a
standard deviation of 1 and a mean of 0 to produce random initialization.
What they showed was that by using this combination of initialization and
activation there is a greater output variance of each layer than there is input
variance. As you move forward through the network, that variance
continues to increase, layer by layer, until saturation of the activation
function occurs at the top layers. What makes this worse is that the logistic
function does NOT have a mean of 0; it is 0.5. The hyperbolic tangent
function is the one with a mean of 0 and that behaves in the deep networks
just a little better than the logistic function does.
If you look closer at the logistic activation function, you would see that, as
the inputs become larger, be they positive or negative, the activation
function will saturate at either 1 or 0 and the derivative is very close to 0.
As such, when backpropagation starts, it doesn’t have much of a gradient to
propagate back and what there is of the gradient will continue to dilute as
the backpropagation moves down the top layers; when it gets to the bottom
layers, there is nothing left.

Xavier and He Initialization

What did Glorot and Bengio propose to solve the problem? What we need
is the signal flowing smoothly in both directions – forward, when the
predictions are made, and backward for backpropagation. This signal
shouldn’t vanish, nor should it explode to the point of saturation. For this
signal to flow smoothly, Bengio and Glorot argued that the variance of the
layer outputs needs to be equal to the variance of the layer inputs. At the
same time, there should be equal variance on the gradients before and after
it reverses through a layer. For that to be guaranteed, the layer would need
to have the same number of both input and output connections. However, a
compromise was suggested, one that is proven in practice. We need to
randomly initialize the connection weights, as you can see in the following
equation – n inputs denotes how many input connections are in the layer
where the initialization is happening, and n outputs is the number of output
connections for the same layer. This is often called Xavier Initialization or
Glorot Initializations, named after the author.

Equation - Xavier Initialization – For When The Logistic


Activation Function Is Used

This has been shown to speed the training significantly and the success that
is seen with Deep Learning today can be, in part, attributed to this. There
have been other papers published recently providing strategies that are
similar but for other activation functions. For example, the ReLU activation
function initialization strategy including the ELU activation and any other
variant is often termed the He Initialization, taking the surname of the
author.
By default, the Xavier initialization is used by the fully_connected function
by default but you use the function called variance_scaling_initializer() to
change it to He Initialization, as follows:
he_init = tf.contrib.layers.variance_scaling_initializer()
hidden1 = fully_connected(X, n_hidden1, weights_initializer=he_init,
scope="h1")

Note
Unlike Xavier initialization, He Initialization will only take the fan-in into
consideration. Xavier calculates the fan-in to fan-out average. The
variance_scaling_initializer() is also set to this as default but you can set
argument mode=”FAN_AVG” to change this.

Activation Functions That Don’t Saturate

Another insight from the Glorot and Bengio paper was that the problems
with vanishing and exploding gradients were partly due to choosing the
wrong type of activation function. Until then, it had been assumed that if
activation functions that were roughly sigmoid were used in biological
neurons then these must be the best choice. However, it has since come to
light that there are many other activation functions that work in the DNNs a
lot better, the ReLU activation function in particular. Much of this is down
to the fact that the ReLU function doesn’t saturate for the positive values
and because it computes much faster.
For one thing, the ReLU activation function isn’t perfect. It has its own
problems, in particular, the dying ReLUs problem. Some of the neurons die
during the training which means they don’t output anything but 0.
Sometimes as many as 50% of the neurons may die, especially if a large
learning rate has been used. If the weights are updated during training in
such a way that produces a negative weighted input sum, it will begin to
output 0. Once this happens, it is unlikely that the neuron can be revived
because when the ReLU function input is negative, the gradient is 0.
Solving this problem could be done by using a ReLU function variant, like
the leaky ReLU. The definition of this function is LeakyReLU a( z) = max(
az, z). The rate at which the function “leaks” is defined by hyperparameter a
– it is the function slope for z < 0 and tends to be set as 0.01. Because the
slope is so small, leaky ReLUs can’t die; they may go to sleep for a while
but they will wake up eventually.
Recently, another paper compared multiple ReLU activation function
variants drawing the conclusion that the strict ReLU activation function is
always outperformed by the leaky ReLUs. If the setting a = 0.2, which is a
massive leak, was used, the result was a much better performance than the
smaller a = 0.01 leak. The randomized leaky ReLU (RReLU) was also
evaluated; during training, a was randomly chosen from a provided range
and fixed for testing purposes to an average value. This also performed
well and appeared to reduce the overfitting risk, i.e. like a regularizer does.
Lastly, the parametric leaky ReLU (PreLU) was evaluated – in this, rather
being a hyperparameter, a is learned in the training – this makes it a
parameter that backpropagation can then modify, the same as any other
parameter. PreLU was found to outperform ReLU significantly on the larger
datasets with images but, with the small datasets, it was more likely to
overfit.

Last but by no means least, a paper was published in 2015 by Djork-Arné


Clevert et al. In this paper, we were introduced to a new activation function,
this one called ELU, or exponential linear unit. The ELU outperformed
every ReLU variant in every experiment – training was much faster and
neural network produced better test set performance. The definition of ELU
can be seen in the following equation:

Equation - ELU Activation Function

Now this might look, at first glance, to be a lot like ReLU but there are
some significant differences:
● When z < 0, ELU will take negative values. This means the unit can
have an output average that is nearer to 0, which mitigates some of the
problems with vanishing gradients. When z is a big negative number, the
value approached by the ELU function is defined by hyperparameter a.
Normally it is set to 1, but like any of the hyperparameters, it can be
tweaked.
● The gradient for z < 0 is nonzero, thus eliminating the problem of dying
units.
● There are no bumps in this function; it runs smoothly all the way and
that includes around z = 0. This helps Gradient Descent to go much
faster because there is less bouncing to the left and the right of z = 0.
● The ELU function does have one main drawback; it computes slower
than ReLU does and slower than any variant of ReLU and this is because
it uses the exponential function. Compensation is provided during
training in the form of a much faster rate of convergence. However,
when it comes to testing, the ReLU networks are faster than ELU
networks.
So, which one do you use on the DNN hidden layers? Generally, you should
use them in this order:
● ELU=>Leaky ReLU (and all variants)=>ReLU=>tanh=>logistic. If you
want performance at runtime, edge toward the leaky ReLU over ELU.
If you have sufficient computing power and some free time, use cross-
validation to evaluate some of the activation functions, like ReLU for
overfitting and PReLU for large training sets.
You can build a neural network using the elu() function in TensorFlow. All
you do, when you call fully_connected, is set the argument activation_fn:
hidden1 = fully_connected(X, n_hidden1, activation_fn=tf.nn.elu)
There is no predefined leaky ReLU function in TensorFlow but you can
easily define your own:
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu)
Chapter 3 Summary

● Training DNNs faces the challenge of Vanishing and


Exploding Gradients.
● This problem is occasioned as the algorithm moves
through the lawyers towards the lower ones leading to
the gradient becoming smaller.
● The resulting gradient instability causes the increased
suffering of the DNNs.
● The solutions of the vanishing and exploding gradient
problem include Xavier and He Initialization.
Chapter 4.
Deep Neural Networks

Back to our deep neural networks; these are really nothing more than an
ANN with several layers in between the input and output layers. Networks
like this look through several layers, calculating what the probability is for
each output. And a DNN can model a complex relationship that isn’t linear.

DNN Structure

DNNs are normally feedforward networks, which means that the input layer
data will flow onto the output layer with no loopback. A network like this
with one hidden layer is called a shallow feedforward neural network.
However, in a deep neural network, you could have, for example, 1000 or
more hidden layers. Whatever the number, to be considered as a DNN, it
must have more than two.
DNNs create maps showing virtual neurons and then assigns a weight
randomly to the connections between each neuron. The weights are
multiplied with the inputs and the return is an output of somewhere between
0 and 1. If a DNN cannot find a pattern, it will adjust the weights using an
algorithm.

Different Types of DNN with Python


In broad terms, Python deep neural networks can be classified into two
categories and we’ll discuss them now.
RNNs – Recurrent Neural Networks

RNNs are a type of ANN with the node connections forming a graph
directed along a specified sequence. RNNs can make use of their own
internal memory or state for processing the input sequences. As such, an
RNN can be used for several different tasks, such as speech recognition,
and connected handwriting recognition (unsegmented). There are two main
types of RNN:
Infinite Impulse RNN – a directed cyclic graph that cannot be unrolled
Finite Impulse RNN – a directed acyclic graph that can be replaced by a
strict feedforward network.
A simple RNN is just a neuron network where the neurons are stored in
layers. Each node in each layer will connect directly and one-directional
with each node in the next layer. Data may flow in any direction in an RNN
and we can use LSTM (Long Short-Term Memory); RNNs can be used in
many different applications including language modeling.
Before we look at how to implement them, let’s take a deeper look. At its
highest level, an RNN can process sequences – this could be sentences,
daily stock prices, sensor measurements, and so on – one individual element
at a time, all the while retaining the state or memory of what came before in
that sequence.
The ‘Recurrent’ in Recurrent Neural Network indicates that the current time
step output will become the input to the next one. At each sequence
element, the model will consider both the current input and everything it
can remember about the elements that came before, known as state or
memory.
This memory is how the network can learn the long-term dependencies in
sequences that mean the whole context can be considered when a prediction
is made. RNNs are designed in a way that they can mimic human behavior
for sequence processing- when a human forms a response, they will
consider everything, for example instead of individual words, we would
take an entire sentence into account. Look at the following sentence:
“For the first 15 minutes, while the band was warming up, the concert was
very boring but then things got exciting”.
For a machine learning model that looks at each work individually and
unconnected to all the others, this would likely be considered as a negative
sentence. However, an RNN would be able to see the words, “but” and
“exciting” and would see that the sentence is actually a positive one. This
is because it reads the whole sequence and that is what provides the context
needed to process the meaning – recurrent neural networks have this
concept encoded into them.
A layer consisting of memory cells lies at the heart of every RNN. The
LSTM is the ‘cell of the moment’, a cell that maintains both cell state and a
carry which ensures that the signal (a gradient containing the information),
isn’t lost while the sequence is being processed. We will be spending more
time on the LSTM shortly but, for now, the LSTM considers three things at
every time step – current word, carry and cell state.
LSTMs have three gates and weight vectors:
● Forget gate – this discards any information that is irrelevant
● Input gate – this handles current input
● Output gate – this produces the predictions at each of the time
steps
Each cell element function will be decided by the weights or parameters
that were learned in the training phase. Each cell part could be labeled if
you wanted but it really isn’t necessary; don’t forget that the RNN will
retain a memory of the whole sequence so that previous information cannot
be lost.

Problem Formulation

When it comes to training an RNN so it writes text, we have a choice of


methods. For this example, training the RNN to write patent abstracts, we
are going to train it as a ‘many-to-one’ mapper. In other words, a sequence
of words is input and the model is trained to predict the next word in the
sequence. The words are mapped, first to integers and then vectors and this
is done with an embedding matrix – it could be a pre-trained one or a
trainable one. Lastly, they go to the LSTM layer.
For writing a new patent, we begin by passing a sequence of words in.
Next, a prediction for the next word is made, the input sequence is updated,
another prediction made, and a new word added to our starting sequence.
This continues as many times as needed for the number of words we want
to be generated.
The approach steps are outlined below:
1. The abstracts are converted from a list of strings to a list of
integers, which is the sequence
2. The feature and the labels are created from the sequence
3. The LSTM model is built using Embeddings, LSTM and Dense
layers
4. Pre-trained embeddings are loaded in
5. The model is trained to predict the next word in the given
sequence
6. The starting sequence is passed on so the predictions can be made

Bear in mind that this is just a single formulation of our problem. We could
also go down the route of using a character-level model or we could make a
prediction for every word in the sequence. Like most things in both
machine and deep learning, there is always more than one answer. In
practice though, the approach outlined above works well.

Preparing the Data

A neural network may have very powerful capabilities for representation


but the most important thing is getting a dataset that is clean and high
quality. You can download the raw data for the project from
HTTP://WWW.PATENTSVIEW.ORG/QUERYDEV/ - simply type in your search term of
‘neural network’ and download the patent abstracts that result from this
(there should be around 3500.
Chapter 4 Summary
● The major types of RNNs include the infinite impulse
RNN and the finite impulse RNN. An RNN is a neuron
network comprising of stored layers of neurons.
● The Infinite Impulse RNN comprises of a directed cyclic
graph incapable of being unrolled.
● A Finite Impulse RNN comprises of a cyclic graph that
can be replaced by a strict feed-forward network.
● RNNs are used in language modeling and processing
sequences.
● A neural network exhibits a robust capacity for
representation.
Chapter 5.
The Basics of Using the TensorFlow Libraries

Now that we have had some time to look over TensorFlow a bit and have it
downloaded onto our system of choice; it is time to look at some of the
basics that come with this library. There are a lot of different parts that
come with this library—and while we talked about a summary of it before,
we need to go a bit more in-depth to make sure that we really understand
how this works, as well as what we are able to do. Before we dive into
some of the codes that we need to do with deep learning, we are going to
first look at some of the different things that come with TensorFlow, to help
us get prepared. Some of the basic parts that you need to know more about
when it comes to the TensorFlow library include the following:

DataFlow Graphs

The first thing that we are going to take a look at is the dataflow graphs.
When you are working in TensorFlow, the computation is going to be all
based on the graphs. These graphs are so important because they are going
to be there as a way to solve many mathematical problems in your system.
Let’s take a look at the expression that is below:
X = (y+z) * (z+4)
It is also possible to take the expression that is above and show it in another
way, including the following:
P=y+z
Q=z+4
X=p*q
When it is represented by the second method, it is going to be so much
easier for us to express it in graph form. In the first part, we had a single
expression to work with, but when we divide it up again, you end up with
two expressions, and both of these can be performed in parallel. We can
gain from this in terms of the time for computation. Such gains are going to
be important when it comes to deep learning and applications of deep
learning, especially when we are talking about Recurrent Neural Networks
(RNN) and Convolutional Neural Networks (CNN). These two neural
network architectures are going to be more complicated, which is why we
need to make sure that we are working with the graphs in the proper
manner.
The goal of the TensorFlow library is to use it to implement graphs, and to
make sure that it helps with the computation of operations in parallel. This
is going to lead us to see some more efficiency in the gains. In this library,
the graph nodes are going to be known as tensors, and they are basically
just a multidimensional data array.
The graph that you are going to work by starting in the input layer, where
we should expect to find the input tensor. After the input layer, we are going
to get to the hidden layer, which has rectified linear unit as the activation
function.

Constants

Next on the list to focus on is the constants. When we are taking a look at
the TensorFlow, we are going to create these various constraints using the
function constant. This function constant is going to provide us with the
signature that is given here:
constant(value, dtype=None, shape-None, name=’Const’,
verify_shape=False).
Let’s take a look at this signature now. Where the value is the actual
constant value to be used for more computation down the line, the dtype is a
data type parameter such as int8, int16, float32, and float64. Then, we move
on to the shape, which is going to allow you to put in some dimensions if it
is needed. Then, there is the name. This is also optional, and you can decide
whether you are going to put it in or not. This is going to be a name that you
can give to the tensor while the last parameter that is present is going to be
a type of Boolean, which will indicate the verification of the shape of the
values.
Now, it is possible that you will need to pick out constants that have a
specific value in them. If this is something that has to be done in your own
training model, you will want to make sure that you are picking out a
constant object to make this happen. You can take the signature from above
and add in the names or numbers that work the best for your code.

Variables

The TensorFlow library is also going to spend a bit of time looking at the
variables. These variables are going to refer to in-memory buffers that have
tensors that need to be initialized explicitly and used in graph to make sure
that the state is maintained through the whole session. When you decide to
call up one of the constructors, the variable is going to be added back to the
computational graph as well.
Variables will be used, for the most part at least, when you first begin with a
training model, and they will be used for holding and updating parameters.
The initial value that you decide to pass for the argument to the constructor
is going to represent either the object that will either be returned or
converted into a tensor. What all of this means is that we need to be able to
take any variable we want to work with and fill it with either a random or
predefined value that can then be used later in the training process, and
even updated through the iterations. A good way for us to define this will
be:
m – tf.Variable(tf.zeros([1]), name = “m”)

Sessions

The next topic that we need to take a look at is going to be the sessions. In
order for us to evaluate the nodes, we have to make sure that the
computational graph that we are using is able to run within the current
session. Remember here that the purpose of the session is going to be to
encapsulate the state and the control of the TensorFlow runtime.
If you are working on a new session, and it doesn’t end up having any
parameters in it, it is going to resort to using the default graph that was
created in the current session. If you do add in some parameters to this, the
session class is going to accept the parameters of the graph that you set,
which is used when you execute the session in the first place.
To get a better idea of what is going on with this kind of thing, let’s take a
moment to look at the “hello” code and how it is going to work with this
library to see how a session of TensorFlow is going to work:
import tensorflow as tf
h = tf.constant(‘Hello TensorFlow!”)
s = tf.Session()
print(s.run(h))
The code that you are going to get when you do this will be Hello
TensorFlow! This may seem pretty simple, but it is a good way to get some
practice when you are working in Python and shows you a bit of what you
are able to do with a code in the TensorFlow library.

Placeholders

As you are working on some of the codes that you would like to write out in
TensorFlow, you may find that there are times when you will not be aware
of the value of array y during the initial declaration phase of our
TensorFlow problem. This means that we are not aware of this value during
the stage for tf.Session() as ses. When this happens, TensorFlow is going to
expect that we will declare the basic structure of our data by use of
tf.placeholder variable declaration. This ensures that we still have
something present, and allows the program a chance to learn, without the
code bringing out an error because there wasn’t anything present there to
start with.
We are able to use the idea of this for y by using the code below:
# creating TensorFlow variables
y = tf.placeholder(tf.float32, [None, 1], name = ‘y’)
Since we are not going to go through and providing an initialization to the
declaration, we should stop and notify TensorFlow of the data type of all the
elements that should be in the tensor. The aim here is to use tf.float32. Our
second argument is going to denote the shape of the data to be injected into
the variable. Our aim here is going to be to use the tf.float32 for this point.
Chapter 5 Summary:

● Graphs play a crucial role establishing the platform upon


which the computation of TensorFlow is based.
● Neural networks exist in two types: RNNs and CNNs.
● Functionally, TensorFlow library ensures computations
operate in parallel.
● The process of coding in TensorFlow faces a few
challenges such as the inability to establish the value of
array y in the initial declaration phase.
Chapter 6.
Deep Learning with TensorFlow

Now that we have gotten a chance to look a bit more at TensorFlow and
some of the different things that you are able to do with this library, it is
time to take a look at what you are able to do with deep learning and this
library. This is the meat and potatoes of the guidebook—the good stuff that
we have been looking for as we go through this guidebook. Hence, let’s
break it down and see some of the different things that we will be able to do
when we are working with this library.
As we go through this chapter, we are going to be demonstrating all of the
processes that are needed to train a model of neural networks inside of the
TensorFlow library. This is going to be done with the help of the API’s
estimator known as DNNClassifier.
Our goal with this kind of neural network is to train it with the help of the
MNIST dataset. This is a dataset that is responsible for creating the “hello
world” code inside of any project in deep learning. You will be able to find
this “hello world” dataset inside the package for TensorFlow so that should
already be set up for you. You will find that while looking at this particular
dataset, it is going to be a 28/28 grayscale image with all the digits
handwritten. It is a larger dataset to work with since it contains 55,000
training rows, 5,000 validation rows, and 10,000 testing rows.

Importing the Data That You Need

With that in mind, we need to move on to the first step here. We need to
make sure that we have all of the necessary data in place to help us write
some of the codes that are needed in this guidebook. First, it is time to
import the libraries that we need to use, including the following:
import tensorflow as tf
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from PIL import Image
import numpy as np
import os
There is a lot of different things that we need to ensure to be on our
computer before we are able to work with the program, we will do next.
This may seem like a lot, but it is necessary to get all of the parts ready and
working together. As you may have noticed, we are going to import a few
things from the Keras library as well, so it is critical that if you have not
already installed and downloaded that library on your computer, that you do
so before we get going on this journey.
We are going to use the following code to help us load out the set of data
that we are going to use here. The data is importable by the use of the Keras
library that we talked about before. You will see the progress of the data
download as you do this. The code that we need to make this happen
includes:
(X_train, y_train), (X_testing, y_test) = mnist.load_data()
Next, it is time for us to take a moment to change up the data that we need,
making sure to reshape it the way that we would like. Since we are still
working with one of the convolutional neural networks, we are going to
make sure that our data is being reshaped into the batch, width, height, and
channels.
It is also possible for us to take a moment to add in some of the images that
we want to this thing. We can do it both in the training data, as well as
inside the test data. The code that we are able to use in order to get this
done includes:
def load_images(image_label, image_directory, features_data, label_data):
files_list = os.listdir(image_directory)
for file in files_list:
image_file_name = os.path.join(image_directory, file)
if ‘.ping” in image_file_name:
img = Image.open
(image_file_name).convert(“L”)
img = np.resize(img, (28, 28, 1))
im2arr = np.array(img)
im2arr = im2arr.reshape(1, 28, 28, 1)
features_data = np.append
(features_data, im2arr, axis = 0)
Label_data = np.append(label_data, [image_label], axis = 0)
Return features_data, label_data
This code that we just went through is going to help us to lead up the
features and all of the labels that we need. Keep in mind here that we have
just taken the time to define our function, and we named it as load_images()
taking in four parameters for this one. This means that it is going to list out
all of the files that are available in the image directory. The function is then
going to check the format of all the images, whether png or another option.
If you have .jpg in your system, then these .png images are going to be
taken over to .jpg.
The images, at this point, are going to be loaded into the system and they
will be converted into an array, which is going to be the same as the
features data and an image array is going to be added into this. It is going to
take an image label, then add it to the label_data part of all this.
Once we have been able to get all of the images set up and ready, and we
know that the right folder or directory (depending on what you have
chosen) is holding them, then the current set of data that we are in will
return these images back.
Now, it is time to move on to the next step. We are now going to need to
give the images their own directories to make sure that they are properly
loaded onto the existing set of data that we want. This means that we will
simply need to load the images into the training and the test sets. To make
this happen, we will need to use the code below to help us:
X_train, y_train = load_images(‘1’, ‘F:/mnist’, X_train, y_train)
X_test, y_test = load_images(‘1’, ‘F:/mnist’, X_test, y_test)
From here, we need to take a moment to normalize the data. The inputs are
going to be normalized with a range of 0-255 to 0-1:
X_train/-255
X_test/=255
We have the labels here, but they have not had a chance to be categorized. It
is now time for us to start to categorize all of these by using the code that
we have below:
total_classes = 10
y_train = np_utils.to_categorical(y_train, total_classes)
y_test = np_utils.to_categorical(y_test, total_classes)
Chapter 6 Summary
● DNNClassifier is a necessity in training a model of
neural networks inside the TensorFlow.
● The primary goal of TensorFlow neural networks is to
utilize the MNIST dataset in the training process.
● The MNIST dataset plays a crucial role in establishing
the “hello world” code in the deep learning project.
● Deep learning tapping into MNIST database and the
DNNClassifier establishes a strong base for efficient
and constant learning – adaptable and flexible to
perform varying functions.
● This is especially the case when importing data.
Chapter 7.
Linear Regression with Python

Linear regression when we just have one variable

The first part of linear regression that we are going to focus on is when we
just have one variable. This is going to make things a bit easier to work
with and will ensure that we can get some of the basics down before we try
some of the things that are a bit harder. We are going to focus on problems
that have just one independent and one dependent variable on them.
To help us get started with this one, we are going to use the set of data for
car_price.csv so that we can learn what the price of the care is going to be.
We will have the price of the car be our dependent variable and then the
year of the car is going to be the independent variable. You are able to find
this information in the folders for Data sets that we talked about before. To
help us make a good prediction on the price of the cars, we will need to use
the Scikit Learn library from Python to help us get the right algorithm for
linear regression. When we have all of this setup, we need to use the
following steps to help out.

Importing the right libraries

First, we need to make sure that we have the right libraries to get this going.
The codes that you need to get the libraries for this section include:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
You can implement this script into the Jupyter notebook The final line
needs to be there if you are using the Jupyter notebook, but if you are using
Spyder, you can remove the last line because it will go through and do this
part without your help.

Importing the Dataset

Once the libraries have been imported using the codes that you had before,
the next step is going to be importing the data sets that you want to use for
this training algorithm. We are going to work with the “car_price.csv”
dataset. You can execute the following script to help you get the data set in
the right place:
car_data = pd.read_csv(‘D:\Datasets\car_price.csv’)

Analyzing the data

Before you use the data to help with training, it is always best to practice
and analyze the data for any scaling or any values that are missing. First, we
need to take a look at the data. The head function is going to return the first
five rows of the data set you want to bring up. You can use the following
script to help make this one work:
car_data.head()
IN addition, the describe function can be used in order to return to you all
of the statistical details of the dataset.
car_data.describe ()
finally, let’s take a look to see if the linear regression algorithm is actually
going to be suitable for this kind of task. We are going to take the data
points and plot them on the graph. This will help us to see if there is a
relationship between the year and the price. To see if this will work out, use
the following script:
plt.scatter(car_data[‘Year’], car_data[‘Price’])
plt.title(“Year vs Price”)
plt.xlabel(“Year”)
plt.ylabel(“Price”)
plt.show()
When we use the script that is above, we are trying to work with a
scatterplot that we can then find on the library Matplotlib. This is going to
be useful because this scatter plot is going to have the year on the x-axis
and then the price is going to be over on our y-axis. From the figure for the
output, we are able to see that when there is an increase in the year, then the
price of the car is going to go up as well. This shows us the linear
relationship that is present between the year and the price. This is a good
way to see how this kind of algorithm can be used to solve this problem.

Going back to data pre-processing


This is done in order to help us to divide up the data and label it to get the
test and the training set that we need. Now we need to use that information
and actually have these two tasks come up for us. To divide the data into
features and labels, you will need to use the script below to get it started:
features = car_data.iloc[:,0:1].values
labels = car+data.iloc[:,1].values
Since we only have two columns here, the 0th column is going to contain
the feature set and then the first column is going to contain the label. We
will then be able to divide up the data so that there are 20 percent to the test
set and 80 percent to the training. Use the following scripts to help you get
this done:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split
(features, labels, test_size = 0.2, random_state = 0)
From this part, we are able to go back and look at the set of data again. And
when we do this, it is easy to see that there is not going to be a huge
difference between the values of the years and the values of the prices. Both
of these will end up being in the thousands each. What this means is that it
is not really necessary for you to do any scaling because you can just use
the data as you have it here. That saves you some time and effort in the long
run.

How to train the algorithm and get it to make some predictions

Now it is time to do a bit of training with the algorithm and ensure that it is
able to make the right predictions for you. This is where the
LinearRegression class is going to be helpful because it has all of the labels
and other training features that you need to input and train your models.
This is simple to do and you just need to work with the script below to help
you get started:
from sklearn.linear_model import LinearRegresison
lin-reg = LinearREgression()
lin_reg.fit (train_features, train_labels)
Using the same example of the car prices and the years from before, we are
going to look and see what the coefficient is for only the independent
variable. We need to use the following script to help us do that:
print(lin_reg.coef_)
The result of this process is going to be 204.815. This shows that for each
unit change in the year, the car price is going to increase by 204.815 (at
least in this example).
Once you have taken the time to train this model, the final step to use is to
predict the new instance that you are going to work with. The predict
method is going to be used with this kind of class to help see this happen.
The method is going to take the test features that you choose and add them
in as the input, and then it can predict the output that would correspond with
it the best. The script that you are able to use to make this happen will be
the following:
predictions = lin_reg.predict( test_features)
When you use this script, you will find that it is going to give us a good
prediction of what we are going to see in the future. Basically, we are able
to guess how much a car is going to be worth based on the year it is
produced in the future, going off the information that we have right now.
There could be some things that can change with the future, and it does
seem to matter based on the features that come with the car. But this is a
good way to get a look at the cars and get an average of what they cost each
year, and how much they will cost in the future.
So, let’s see how this would work. We now want to look at this linear
regression and figure out how much a car is going to cost us in the year
2025. Maybe you would like to save up for a vehicle and you want to
estimate how much it is going to cost you by the time you save that money.
You would be able to use the information that we have and add in the new
year that you want it based from, and then figure out an average value for a
new car in that year.
Of course, remember that this is not going to be 100 percent accurate.
Inflation could change prices, the manufacturer may change some things
up, and more. Sometimes the price is going to be lower, and sometimes
higher. But it at least gives you a good way to predict the price of the
vehicle that you have and how much it is going to cost you in the future.
This chapter spent some time looking at an example of how the linear
regression algorithm is going to work if you are just working with one
dependent and one independent variable. You can take this out and add in
more variables if you want, using the same kinds of ideas that we discussed
in this chapter as well.
Chapter 7 Summary

● Working with single-variable linear regression offers an


easier path to analyzing data for machine/deep learning
employing Python.
● Importing the right data (right libraries) is central to a
high quality linear regression and the accompanying
deep/machine learning. The same applies with
importing the right dataset.
● Before the data is utilized, analysis is crucial in
establishing the underlying patterns and trends to aid in
the subsequent analysis.
● The pre-processing stage offers ample chance to prepare
the data while training the algorithm to make some
predictions using pre-existing data and knowledge.
Chapter 8.
Decision Trees to Handle Your Regression
Problems

And now that we have a better understanding of this, it is time to take a


look at decision trees and see how we are able to use these for our benefit.
You will find that when you bring up the idea of machine learning, you will
not have to look far in order to hear about decision trees. These decision
trees are going to be a very important part of the machine learning process.
Each feature in the set of data that you are working with will be treated just
like a new node in the tree. And with each node, a decision has to be made
to determine which path is the best one for you.
Each time that you use this decision tree, you are going to be in a different
kind of situation, and this means that you need to weigh the decisions of
each, and then decide which path is the right one for you. The process will
keep going on from there, helping you to make new decisions, until you get
to the leaf node and you are to the final decision, the one you will choose to
go with.
This may seem like a lot of steps, and something that is a bit complicated to
work with at first, but you may be surprised to find out that we have
actually worked with decision trees and some of the ideas that come with
them, for our entire lives. For example, if you have ever tried to get a loan
from a bank, it is likely they used a form of the decision tree to figure out
whether or not to give you this loan or not.
When the bank uses this kind of decision tree, they are going to take a look
at a ton of data that they can gather on the customer to help them make the
decision. The information they may look at includes their age, gender, job
history, credit history, their salary, and more. As the bank is looking through
this information, they will be able to use it to figure out whether they want
to give the customer the loan, or if they think the customer is too much of a
risk for this.
Of course, each bank is going to be different. Some banks will turn down an
application quickly, and others are more than happy to work with customers
even when they are turned down. But no matter which bank you go with,
they are going to sit down and define the criteria they would like to meet
with the customer before they give out the loan. These new criteria are
going to be the set of rules that are used to help the bank figure out who is
going to get a loan or not. Some examples of the criteria that could be used
will include:
The bank may decide that if the applicant is at least 25 years old, and under
60, then they are able to move to the next step of the decision tree. If the
applicant is younger or older than these two ages, then the loan is rejected.
If the applicant has been able to meet with the first criteria, then the bank is
going to check and see whether that applicant has a salary at all. If they do
have a salary and it is steady, then the bank would move on to step three. If
the applicant is jobless or doesn’t make an income at all, then the bank is
going to reject that application.
If the person who applies is male and has a salary, then they are going to
move on to the fourth step. If the applicant has a salary is female and has a
salary, then the bank would go to the fifth step.
If the applicant has a salary that is more than $35,000 a year, then they are
able to get the loan. If their income is less than this, then they will not be
accepted.
If the applicant during this step earns more than $45,000 a year, they will be
able to get the loan. But if their income is less than this amount, then the
loan is rejected.
This is a basic way to look at the decision tree. With loan applications, and
with many other times using decision trees, there are going to be a lot more
steps and complexity that comes with it. There are even times when we will
need to bring in some kind of statistical technique, such as entropy, to help
us create the nodes that we need to make sure that the impurity of
classification in the labeled data is taken care of.
To help make sure that this process is kept as simple as possible, we want to
work with the features that have a minimum amount of entropy and then
this part is going to be set up as the root node. This is going to help anyone
who is using the decision tree, including the bank from before, a starting
point they can work with to help them pick out the perfect applicants to
give the loan to.

Are there benefits to using a decision tree?

There are a lot of times when you will want to work with these decision
trees. These are a good option to work with because they are simple and can
help you to see the steps that you need to take in order to see the steps that
are needed in order to make a decision. Some of the many benefits that you
are going to notice when you decide to work with the decision tree
algorithms include:
You will be able to use these decision trees to help out with a few different
problems. These are going to work well for regression and classification
tasks.
You can use these decision trees to help with the linear data and the non-
linear data that you would like to classify.
When you look at some of the other algorithms that you are able to use with
machine learning, you will find that decision trees are going to be fast to
train.

Implementing your own decision tree

Now that we know a bit more about these decision trees and why they are
such a great thing to use for your own decision making in data science, it is
time to learn how to work with Python in order to make one of your own
decision trees. To work with this, we are going to use an example of trying
to predict petrol consumption (using millions) in the United States based on
a few different features.
The features that we choose to use with this are going to include the ratio of
those who have their own license, the per capita income, the tax on petrol
(using cents), and how many miles there are of paved highways. Let’s take
some time to look through the various steps that are needed to help us make
this kind of decision tree.
The first step that we need to take a look at is to import both the libraries
and the data set that is needed. The libraries that you need to make this
happen include:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
When you are ready to import the right data set to use here, you need to use
the following command:
petrol_data = pd.read_csv)’D:\Datasets\petrol_data.csv’)
From this point, we need to look at the data a bit and see what is going to
show up there. To ensure that we are going to look at the data in the proper
manner, you just need to use the code of “petrol_data.head() to get this
started. When you type this in, it is going to ensure that you will have the
chart come up with the numbers (which are based on the categories we
want), showing up on your screen.
We also need to stop for a moment here and make sure that we use data
preprocessing in the proper manner. This helps us to get all of the data
organized the way that we want. To make this happen for this set of data,
we need to use the following information to help us get started:
features = petrol_data.iloc[:, 0:4].values
labels = petrol_data.iloc[:,4].values
Then you can take this information and divide it up so that eighty percent
goes to training and then the other twenty percent goes to a test set. Use the
following script to get this to happen.
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split
(features, labels, test_size = 0.2, random_state = 0)
In this example, we need to take some time to do a bit of data scaling. We
have miles of highway, petrol tax in cents, and the consumption of petrol in
millions. This definitely needs to be scaled to help us figure out what needs
to happen next. Because of all of these differences, we need to be able to
scale this information in order to get it to compare well to one another. The
steps that you need to use in order to help you to scale the data and get it to
work well together is going to include the following:
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
train_features_poly = feature_scaler.fit_transform(train)features)
test_features_poly = feature_scaler.transform(test_features)

Training the algorithm

Now that the features are scaled down, it is time to train the algorithm that
we are going to use. To implement the decision tree to do classification, you
will need to work with the decision tree classifier from the sklearn.tree
library. The following script will make sure that the right labels and features
are passed on to the decision tree:
from sklearn.tree import DecisionTreeClassifier
dt_reg = DecisionTreeClassifier()
dt_reg.fit(train_features, train_labels
And the final thing that we need to do in order to get started with this part
of the plan is to make predictions. We have all of the information that we
need at this point, and it is time to go through and use the prediction
method. With the algorithms and codes that we have been using so far, you
are going to be able to make some predictions based on the data that you
have, helping you to get the best information to make decisions. The code
that you need to use for this will include:
predictions = dt_reg.predict(test_features)
At this point, a decision tree, with the help of Python, is going to be created
for you. This is going to ensure that you will be able to see which prediction
is right for you based on the information that you put in. Businesses and
other companies are often going to use this to help them pick the
information and the choices that are right for them. This method is faster
and more accurate than what a single person is able to do, which is why
they are so valuable.
Chapter 8 Summary

● Decision trees are a crucial part of the machine learning


process.
● Decision trees are crucial in breaking down tasks into
SMART and manageable sizes allowing the subsequent
steps to be carried out effectively.
● In the machine/deep learning process, every feature is
treated like a new node of a tree.
● For each node, a decision has to be made to determine
the next subsequent path.
● A decision tree classifier from the sklearn.tree library is
effective.
Chapter 9.
Self Organizing Maps

Self Organising maps it is an Unsupervised Deep Learning technique and


we will discuss both theoretical and Practical Implementation from Scratch.
1. What is Self Organizing Maps?
2. K-Mean Clustering Technique.
3. SOMs Network Architecture.
4. How Self Organizing Maps work.
5. Practical Implementation of SOMs.

What is Self Organizing Maps?


The Self Organizing Map is one of the most popular neural models. It
belongs to the category of competitive learning network. The SOM is based
on unsupervised learning, which means that no human intervention is
needed during the training and those little needs to be known about
characterized of the input data. We could, for example, use the SOM for
clustering membership of the input data. The SOM can be used to detect
feature inherit to the problem and thus has also been called SOFM the Self
Origination Feature Map.
The Self Organized Map was developed by professor Kohenen and is used
in many applications.
Basically the purpose of SOM is providing a data visualization technique
which helps to understand high dimensional data by reducing the dimension
of data to map. SOM also represent clustering concept by grouping similar
data together.
Therefore it can be said that Self Organizing Map reduces data dimension
and displays similarly among data.
Every node is connected to input the same way and no nodes are connected
to each other.
A Self Organizing Map is formed from a grid of nodes or units to which the
input data are represented. Every node is connected to the input and there is
no connection between the nodes.
Self Organizing Map is a topology preserving techniques and keeps the
neighborhood relation in its mapping presentation.
Dataset Description:
This dataset has three attributes: first is an item which is our target to make
a cluster of similar item, second and third attribute are informatics value of
that item.
Now in the first step take any random row, let’s suppose I take row 1 and
row 3.
Now take this above centroid values compare with the observed value of
the respective row of our data by using Euclidean Distance formula.
Now let’s solve one by one
Row 1 (A)
C1= √((1-1)² + (1-1)²) = 1
C2= √((1-0)² + (1-2)²) = 1.4
Row 2 (B)
C1= √((1-1)² + (0-1)²) = 1
C2= √((1-0)² + (0-2)²) = 2.2
Row 3 (C)
C1= √((0-1)² + (2-1)²) = 1.4
C2= √((0-0)² + (2-2)²) = 0
Row 4 (D)
C1= √((2-1)² + (4-1)²) = 3.2
C2= √((2-0)² + (4-2)²) = 2.8
Row 5 (E)
C1= √((3-1)² + (5-1)²) = 4.5
C2= √((3-0)² + (5-2)²) = 4.2
Let's say A and B are belong the Cluster 1 and C, D and E.
Now calculate the centroid of cluster 1 and 2 respectively and again
calculate the closest mean until calculate when our centroid is repeated
previous one.
Now find the Centroid of respectively Cluster 1 and Cluster 2
X1 = (1+1)/2 = 1 , (1+0)/2 = 2
X1 = (1, 0.5)
X2 = (0+2+3)/3 = 1.7 , (2+4+5)/3 = 3.7
X2 = (1.7,3.7)
New Centroid
X1 = (1, 0.5)
X2 = (1.7,3.7)
Previous Centroid
X1 = (1, 1)
X2 = (0, 2)
If New Centroid Value is equal to previous Centroid Value then our cluster
is final otherwise if not equal then repeat the step until the new Centroid
value is equal to previous Centroid value.
So in our case, the new centroid value is not equal to the previous centroid.
Now recalculate cluster having closest mean.
So
X1 = (1, 0.5)
X2 = (1.7,3.7)
Similarly, procedure as we calculate above
So on the basis of closest distance, A B and C belong to cluster 1 & D and
E from cluster 2.
So mean of Cluster 1 and 2
X1 (Cluster 1) = (0.7 , 1)
X1 (Cluster 2) = (2.5 , 4.5)
New Centroid
X1 = (0.7 , 1)
X2 = (2.5 , 4.5)
Previous Centroid
X1 = (1, 0.5)
X2 = (1.7,3.7)
If New Centroid Value is equal to previous Centroid Value then our cluster
is final otherwise if not equal then repeat the step until the new Centroid
Value is equal to previous Centroid Value. So in our case, the new Centroid
Value is not equal to the previous centroid.
Now recalculate cluster having a closest mean similar step. So on the basis
of closest distance, A B and C belong to cluster 1 & D and E from cluster 2.
So mean of Cluster 1 and 2
X1 (Cluster 1) = (0.7 , 1)
X1 (Cluster 2) = (2.5 , 4.5)
New Centroid
X1 = (1, 0.5)
X2 = (1.7,3.7)
Previous Centroid
X1 = (1, 0.5)
X2 = (1.7,3.7)
So here we have New Centroid values is Equal to previous value and Hence
our cluster are final. A, B and C belong to cluster 1 and D and E are belong
to Cluster 2.

Self Organizing Maps Network Architecture

For the purposes we’ll be discussing a two dimensional SOM. The network
is created from a 2D lattice of 'nodes', each of which is fully connected to
the input layer.

In a Kohonen network, each node has a specific topological position (an x,


y coordinate in the lattice) and contains a vector of weights of the same
dimension as the input vectors. That is to say, if the training data consists of
vectors, V, of n dimensions:
V1, V2, V3...Vn
Then each node will contain a corresponding weight vector W, of n
dimensions:
W1, W2, W3...Wn
The lines connecting the nodes are only there to represent adjacency and do
not signify a connection as normally indicated when discussing a neural
network. There are no lateral connections between nodes within the lattice.
A SOM does not need a target output to be specified, unlike many other
types of network. Instead, where the node weights match the input vector,
that area of the lattice is selectively optimized to more closely resemble the
data for the class the input vector is a member of. From an initial
distribution of random weights, and over many iterations, the SOM
eventually settles into a map of stable zones. Each zone is effectively a
feature classifier, so you can think of the graphical output as a type of
feature map of the input space.
Training occurs in several steps and over many iterations:
1. Each node's weights are initialized.
2. A vector is chosen at random from the set of training data and presented
to the lattice.
3. Every node is examined to calculate which one's weights are most like
the input vector. The winning node is commonly known as the Best
Matching Unit (BMU).
4. The radius of the neighborhood of the BMU is now calculated. This is a
value that starts large, typically set to the 'radius' of the lattice, but
diminishes each time-step. Any nodes found within this radius are deemed
to be inside the BMU's neighborhood.
5. Each neighboring node's (the nodes found in step 4) weights are adjusted
to make them more like the input vector. The closer a node is to the BMU;
the more its weights get altered.
6. Repeat step 2 for N iterations.
Now it's time for us to learn how SOMs learn. Are you ready? Let's begin.
Right here we have a very basic self-organizing map.
Our input vectors amount to three features, and we have nine output nodes.
That being said, it might confuse you to see how this example shows three
input nodes producing nine output nodes. Don't get puzzled by that. The
three input nodes represent three columns (dimensions) in the dataset, but
each of these columns can contain thousands of rows. The output nodes in
an SOM are always two-dimensional.
Consider the Structure of Self Organizing which has 3 visible input nodes
and 9 outputs which is connected directly to input as shown below.
Our input nodes values are:
X_1= 0.7
X_2= 0.6
X_3= 0.9
Now let's take a look at each step in detail.

Step 1: Initializing the Weights

Now, let's take the topmost output node and focus on its connections with
the input nodes. As you can see, there is a weight assigned to each of these
connections.
Again, the word "weight" here carries a whole other meaning than it did
with artificial and convolutional neural networks. For instance, with
artificial neural networks we multiplied the input node's value by the weight
and, finally, applied an activation function. With SOMs, on the other hand,
there is no activation function.
Weights are not separate from the nodes here. In an SOM, the weights
belong to the output node itself. Instead of being the result of adding up the
weights, the output node in an SOM contains the weights as its coordinates.
Carrying these weights, it sneakily tries to find its way into the input space.
In this example, we have a 3D dataset, and each of the input nodes
represents an x-coordinate. The SOM would compress these into a single
output node that carries three weights. If we happen to deal with a 20-
dimensional dataset, the output node in this case would carry 20 weight
coordinates.
Each of these output nodes do not exactly become parts of the input space,
but try to integrate into it nevertheless, developing imaginary places for
themselves.
We have randomly initialized the weights values (close to 0 but not 0).

Step 2: Calculating the Best Matching Unit

The next step is to go through our dataset. For each of the rows in our
dataset, we'll try to find the node closest to it.
Say we take row number 1, and we extract its value for each of the three
columns we have. We'll then want to find which of our output nodes is
closest to that row.
To determine the best matching unit, one method is to iterate through all the
nodes and calculate the Euclidean distance between each node's weight
vector and the current input vector. The node with a weight vector closest to
the input vector is tagged as the BMU.
The Euclidean distance is given as:
Where X is the current input vector and W is the node's weight vector.
Let’s calculate the Best Match Unit using the Distance formula.
For 1st Nodes:
For 2nd Nodes:
For 3rd Nodes:
Similarly way we calculates the all remaining Nodes same way as you can
see below.
Since we have calculated all the values of respected Nodes. Now it’s time to
calculate the Best Match Unit.
As we can see, node number 3 is the closest with a distance of 0.4. We will
call this node our BMU (best-matching unit).
What happens next?
To understand this next part, we'll need to use a larger SOM.
Supposedly you now understand what the difference is between weights in
the SOM context as opposed to the one we were used to when dealing with
supervised machine learning.
The new SOM will have to update its weights so that it is even closer to our
dataset's first row. The reason we need this is that our input nodes cannot be
updated, whereas we have control over our output nodes.
In simple terms, our SOM is drawing closer to the data point by stretching
the BMU towards it.

Step 3: Calculating the size of the neighborhood around the BMU

This is where things start to get more interesting! Each iteration, after the
BMU has been determined, the next step is to calculate which of the other
nodes are within the BMU's neighborhood. All these nodes will have their
weight vectors altered in the next step. So how do we do that? Well it's not
too difficult... first you calculate what the radius of the neighborhood should
be and then it's a simple application of good ol' Pythagoras to determine if
each node is within the radial distance or not.
The neighborhood is centered around the BMU and encompasses most of
the other nodes and circle show radius.
The size of the neighborhood around the BMU is decreasing with an
exponential decay function. It shrinks on each iteration until reaching just
the BMU. The neighborhood decreases over time after each iteration
Over time the neighborhood will shrink to the size of just one node... the
BMU.
Now we know the radius, it's a simple matter to iterate through all the nodes
in the lattice to determine if they lay within the radius or not. If a node is
found to be within the neighborhood then its weight vector is adjusted as
follows in Step 4.
How to set the radius value in self organizing map?
It depends on the range and scale of your input data. If you are mean-zero
standardizing your feature values, then try σ=4. If you are normalizing
feature values to a range of [0, 1] then you can still try σ=4, but a value of
σ=1 might be better. Remember, you have to decrease the learning rate α
and the size of the neighborhood function with increasing iterations, as
none of the metrics stay constant throughout the iterations in SOM.
It also depends on how large your SOM is. If it's a 10 by 10, then use for
example σ=5. Otherwise, if it’s a 100 by 100 map, use σ=50.
In unsupervised classification, σ is sometimes based on the Euclidean
distance between the centroids of the first and second closest clusters.

Step 4: Adjusting the Weights

Every node within the BMU's neighborhood (including the BMU) has its
weight vector adjusted according to the following equation:
New Weights = Old Weights + Learning Rate (Input Vector - Old Weights)
W(t+1) = W(t) + L(t) ( V(t) – W(t) )
Where t represents the time-step and L is a small variable called the
learning rate, which decreases with time. Basically, what this equation is
saying is that the new adjusted weight for the node is equal to the old
weight (W), plus a fraction of the difference (L) between the old weight and
the input vector (V).
So according to our example are Node 4 is Best Match Unit (as you can see
in step 2) corresponding their weights:
Updated weights:
So in this way we update the weights.
The decay of the learning rate is calculated each iteration using the
following equation:
As training goes on, the neighborhood gradually shrinks. At the end of
training, the neighborhoods have shrunk to zero size.
The influence rate shows the amount of influence a node's distance from the
BMU has on its learning. In the simplest form influence rate is equal to 1
for all the nodes close to the BMU and zero for others, but a Gaussian
function is common too. Finally, from a random distribution of weights and
through much iteration, SOM is able to arrive at a map of stable zones. At
the end, interpretation of data is to be done by humans but SOM is a great
technique to present the invisible patterns in the data.

Practical Implementation of SOMs

Fraud Detection

According to a recent report published by Markets & Markets the Fraud


Detection and Prevention Market is going to be worth $33.19 Billion USD
by 2021. This is a huge industry and the demand for advanced Deep
Learning skills is only going to grow. That’s why we have included this
case study in this chapter.
The business challenge here is about detecting fraud in credit card
applications. We will be creating a Deep Learning model for a bank and
given a dataset that contains information on customers applying for an
advanced credit card.
This is the data that customers provided when filling out the application
form. Our task is to detect potential fraud within these applications. That
means that by the end of the challenge, we will literally come up with an
explicit list of customers who potentially cheated on their applications.

Dataset

Data Set Information:

This file concerns credit card applications. All attribute names and values
have been changed to meaningless symbols to protect the confidentiality of
the data.
This dataset is interesting because there is a good mix of attributes --
continuous, nominal with small numbers of values, and nominal with larger
numbers of values. There are also a few missing values.
Attribute Information:

There are 6 numerical and 8 categorical attributes. The labels have been
changed for the convenience of the statistical algorithms. For example,
attribute 4 originally had 3 labels p,g,gg and these have been changed to
labels 1,2,3.

A1: 0,1 CATEGORICAL (formerly: a,b)


A2: continuous.
A3: continuous.
A4: 1,2,3 CATEGORICAL (formerly: p,g,gg)
A5: 1, 2,3,4,5,6,7,8,9,10,11,12,13,14 CATEGORICAL (formerly:
ff,d,i,k,j,aa,m,c,w, e, q, r,cc, x)
A6: 1, 2,3, 4,5,6,7,8,9 CATEGORICAL (formerly: ff,dd,j,bb,v,n,o,h,z)
A7: continuous.
A8: 1, 0 CATEGORICAL (formerly: t, f)
A9: 1, 0 CATEGORICAL (formerly: t, f)
A10: continuous.
A11: 1, 0 CATEGORICAL (formerly t, f)
A12: 1, 2, 3 CATEGORICAL (formerly: s, g, p)
A13: continuous.
A14: continuous.
A15: 1,2 class attribute (formerly: +,-)
Chapter 9 Summary

● The Self Organizing Map is an important neural model


widely used in the competitive learning category.
● Each node on the map is connected to the input layer(s)
in a manner that ensures no nodes are connected to each
other.
● The Self Organizing Map provides a platform for
preserving topology and keeping the neighborhood in
relation to its mapping portrayal.
Chapter 10.
Presentation of Deep Learning

While the primary purpose of deep learning is applied to ANNs, its


development has many more applications where it can be used. Since
scientists first realized how many different ways these neural networks can
be applied, research has begun in a wide range of areas creating systems
that go far beyond the artificial.
These new and innovative machines represent the future of deep learning
and are expected to eventually replace the artificial neural networks in use
today.
This is a phenomenal accomplishment in the history of mankind, however,
as advanced as they are, they still have many limitations; one being their
immense size. They require many systems in order for them to function
properly but their list of tasks they can perform is very small. Think of it in
terms of the first computers invented. They were large in size (some taking
up an entire room) but the number of functions they could perform was
actually quite limited.
The same could be said for these deep learning machines today. The goal
for the future is to create a type of artificial intelligence that requires only
minimal human input and can perform a vast number of functions. This was
the underlying purpose of deep learning.
The deep neural network is the next evolutionary step of the ANN. The
machines used are considerably smaller but with the capability of
performing even more complex functions that go far beyond the capabilities
of the ANNs. Deep neural networks are targeted to have a more usable
program that will allow the machines to work effectively and efficiently
without consuming so much space and energy.
To achieve this deep neural network of the future, several things had to
change.
GPUs/CPUs

Most of us are familiar with CPUs (central processing units). We may not
know exactly what it is, but we know it represents the brains of our personal
computer systems. For years, the CPU was both the heart and the brain of
the computer.
In time, however, the CPU was improved upon with the aid of another
computer part that was not so familiar. The graphics processing unit or the
GPU. In every computer, there are chips responsible for displaying images
on the computer monitor and a GPU is one of the most powerful chips you
have. While these chips have the same function, some are not as effective as
others. Some will provide only the most basic of graphics and others will
function on a much higher level.
The GPU is one of those components of your computer which goes much
further than displaying a clear picture of a computer game. Their role does
not stop at displaying graphics, they can be programmed and perform
computations separate from the CPU in any system. Deep learning uses
GPUs to address many of the limitations that the ANNs have faced since
their inception.
For the most part, GPUs were initially designed for use in computer games
but because of their immense power they have far exceeded expectations
and have quickly been adapted for other uses that have been applied in a
number of ways.

How is the GPU Designed?


To get better graphics on your computer, you need a better graphics card, a
basic truth that most people can readily understand. At its most basic level,
a GPU differs from a CPU in that while they perform basically the same
function but with entirely different architectures. Both machines will
receive a problem in the form of zeros and ones (binary code) and both will
solve the problem quickly.
However, the way they are designed is where they part ways. CPUs are
designed with hundreds of simple cores and GPUs have thousands. When
you compare this difference in computers you can understand it more
clearly. The top of the line computer, the Mac Pro, has a six-core processor
while the NVidia GTX 980 graphics card has more than 2000. This allows
for clearer pictures, better resolution, and a host of other benefits.
There are other differences beyond having more core. You can think of the
CPU as a device that can perform lots of easy tasks that it can complete
quickly and efficiently. On a computer system, it can solve geometrical
equations or shade in a picture. The GPU, on the other hand, is better suited
for complex tasks and problem-solving. This is why it is so much more
practical to use with artificial intelligence.
Today’s advanced neural networks make use of a number of systems to
keep them running including algorithms and GPUs. As a result, deep
learning can be adapted to all sorts of industries to help solve many
problems that may or may not apply to artificial intelligence. It is already
being used in speech recognition, language processing, and computer
vision.
Because GPUs are a major component of machines, it can be adapted to a
wide variety of situations using many layers of data to solve a host of
problems. Depending on the different designs and strategies used, GPUs
can help in three different classifications of deep learning.
● Unsupervised or Generative Learning
This type of learning is meant to capture obvious images for pattern
analysis at times when no target data is available.
● Supervised Deep Learning
These are used to discern and classify different patterns.
● Hybrid Deep Networks
These are designed to distinguish between different elements in the data.
The objective of hybrid deep networks can be improved when used with
supervised learning and are primarily used to analyze different parameters
in the input data.
Deep Learning Methods

Another aspect of deep learning is something called Dynamic


Programming. This allows the machine to tackle certain problems with the
use of algorithms based on a recurrent formula and a starting state. A
“state” is a way to describe a particular situation or problem the machine
must solve. A sub-solution is then constructed from any previously found
ones in the system. DP algorithms are a fundamental part of the framework
in neural networks and graphical models.

● Unsupervised Learning with SL & RL

This deep learning method is regularly used when encoding input data is
needed. Data streams like video or speech need to be encoded into a form
that is better geared towards machine learning. These codes describe the
initial data in a manner that reduces redundancy, so it can be fed into SL or
RL machines. These machines usually have a much smaller search space
and cannot manage such large quantities of raw data.

● Backpropagation

This is simply a method that allows the system to compute the partial
gradients of a function. When the machine solves an optimization problem
with a gradient-based method, it must also compute the function each time
it repeats. To compute the gradient, the machine can either use analytic
differentiation: it knows the form of the function and simply needs to
compute the derivatives, or it can use the approximate differentiation using
finite differences.

● Stochastic Gradient Descent

This is a very intuitive way to create a gradient descent. Imagine looking at


a river as it flows from the top of a mountain. The goal of the machine is to
determine exactly where the river is going and the path it is going to take.
To accomplish this the machine needs to know certain elements of the
problem: the terrain of the mountain, the lowest point of the foothill, the
curvature of the hills. In machine learning, the input point (the very top of
the hill) may be the only input it has to solve the problem. As it tries to
work out the solution, it will label dips and valleys as local minima
solutions, which it will have to navigate to get around. The output could be
any number of possible paths the river might take. Each time it addresses
this problem it may reach its final destination in a completely different
manner each time.

● Learning Rate Decay

To improve on the Stochastic Gradient Descent, there is the Learning Rate


Decay method. This method is often used to reduce the learning rate over a
period of time. It allows the machine to make large changes at the start of a
training test by using larger learning rate values and decreasing them
according to the weights assigned in the training procedure. It may lower
the learning rate based on the epoch or by using punctuated large drops at
specific epochs.

● Occam’s Razor

This method works in the simpler problems rather than the complex. The
concept of the possible solutions, the machine will select the data with the
simplest explanation available. Given the input, the machine will determine
a list of possible solutions to the problem. A good way to think of it is to
imagine a patient going to the doctor complaining of a headache and a sore
throat. There are several medical conditions that can explain a headache, a
brain tumor, an aneurysm, a stroke. There are also several possibilities that
can explain a sore throat, an infection or a virus. The machine would
compare all of these solutions and filter out those that explain only one
symptom and not both. Then it will finally filter out the conditions that are
extreme and narrow it down to the simplest one. A cold or the flu would
explain both symptoms.

There are many more methods that can be applied to deep learning. Because
of their fast speeds and impressive computational power, they have been at
the heart of machine learning for years. With these additions to machines,
learning can be accelerated to work at least 50x faster.
Chapter 10 Summary

● The evolution of the ANN naturally leads to the deep


neural network.
● Algorithms alongside Dynamic Programming can be
utilized to solve targeted challenges based on recurrent
formula as well as a starting state.
● Advanced neural networks are dependent on several
systems such as GPUs and algorithms to sustain their
running.
● GPUs form part of machines – they can be adjusted to a
broad range of situations using several data layers to
address many problems.
Chapter 11.
Deep Q-Learning in this book

Bipedal robot

A bipedal robot will never walk bipedally by moving it randomly. It will


fall quickly.

Therefore, if the robot starts walking with Deep Q-Learning, you can
clearly see the effect of Deep Q-Learning.

That is the main reason for the bipedal robot.

One of the reasons is that, it is interesting to see the robot walking on two
legs.
The learning process is also interesting. The motion changes during the
learning process.

With the method of this book, the robot will rise after falling down.

In the middle of learning, it tries to get up after falling. After it learned


actions, when it is about to fall, it will try to reposition itself before it falls.

Finally, it will start to run.

Continuous actions

Reinforcement learning of continuous action is generally a difficult problem


because there are a myriad of action candidates.
In particular, bipedal robots have a high degree of freedom of joints, so it is
absolutely difficult to find appropriate actions from a myriad of action
candidates.
Basic way of thinking

This section explains the basic concept of Deep Q-Learning technology


used in this book.

3 Steps

In this book, Deep Q-Learning is based on three steps. They are as follows.

Approximate the Q function accurately


Find exactly the maximum value of the Q function for each situation
Approximate the policy accurately
Based on this principle, you can think about the direction of improvement
even when learning is not successful.

Approximate the Q function accurately

In implementing Deep Q-Learnig, the most important thing is to


approximate the Q function as accurately as possible.

The reason is that the value of Q function is often far from the value of the
teacher data when a program of Deep Q-Learning doesn't work.
For example, if the model has no upper limit on the value that the Q
function can take, the value

max_a' Q(s', a')


may be much larger than the original value due to the approximation error
of Q(s', a').

Then, because of the update

Q(s, a) = r(s, a) + g * max_a' Q(s', a') ,

errors accumulate and Q(s, a) takes a large value that is theoretically


impossible.

This is a common mistake when people start learning Deep Q-Learning.

Therefore, we use a slightly devised approximation.

We devise rewards so that the maximum and minimum values of the Q


function fall within a certain range.

Then, the value is expressed using a softmax function, so that there is no


significant error.

Find exactly the maximum value of the Q function for each situation

Q(s, a) = r(s, a) + g * max_a' Q(s', a')


So it is important to find max_a' Q(s', a') accurately.

Therefore, in this book, for each situation s', find the maximum value of
Q(s', a').

At this time, Policy P is irrelevant.


Policy Gradient first considers the parameterized function P_theta and looks
for theta that maximize Q(s', P_theta(s')) on average.

The method of this book is finer than Policy Gradient because it moves a'
for each s' to find the maximum value Q(s', a').

Finding max_a' Q(s', a') is so difficult because there are infinitely many a'.

Therefore, in this document, the maximum value of 4 values found by the


following 4 methods is considered as max_a' Q(s', a').
Baseline: Q(s', a0) where a0 is the optimal action of previous step
Global: Q(s', a1) where a1 is a random action
Mid-range: Q(s', a2) where a2 = P(s'). P is learned Policy
Local: Q(s', a3) where a3 is selected as follows. Let a the most appropriate
one among a0, a1, a2. Apply the gradient method to Q(s', a). Find a3 such
that Q(s', a3) is larger than Q(s', a).

You can calculate max_a' Q(s', a') approximately in hundreds of steps.

Approximate the policy accurately

For each s, if max_a Q(s, a) can be calculated as above, we know a4 such


that Q(s, a4) = max_a Q (s, a). So, a policy P

P(s) = a4

can be defined.
Then, we can learn the policy s --> P(s) with Deep Learning.
Exploration

For bipedal robots, it is necessary to search for good actions.

At first we let a bipedal robot act randomly. If we have a certain number of


logs, it can stand up with the logs.
However, walking speed is slow. The reason is that the random action log
contains almost no quick walking action.

So we will explore better actions. we use the ε - greedy method. In other


words, we generally use the learned policy and sometimes mix random
actions.

Then, we can take a log of actions with a slightly faster walking speed.

With this log, we can learn faster walking.

By repeating this, we can gradually learn much faster bipedal walking.

All logs should be used for learning.


Neural network of the Q function

Here, we use a distributional Q function as a Q function model.

You can understand a distribution Q function by reading the following


papers.

Distributional Bellman and the C51 Algorithm


https://flyyufelix.github.io/2017/10/24/distributional-bellman.html
Distributed Distributional Deterministic Policy Gradients
https://arxiv.org/abs/1804.08617
A Distributional Perspective on Reinforcement Learning
https://arxiv.org/abs/1707.06887

Although there are various explanations, it can be said that it is devised so


that the Q function can be recorded accurately.

The method using the distributional Q function is also called C51. This is
derived from the original paper which divides the values into 51 categories.

Distributional Q function

The Distributional Q function is a function which divides the value of Q


function into categories, stores the distribution of each category, and returns
the average value of the distribution as the value of Q function.
There are three points in the distributional Q function.

Categories of values
Updating of values
Recording of values

Reason of using the distributional Q function

The reason for using a distributional Q function is that it is the best in view
of accuracy and calculation speed as far as we know now.

Reward standardization

We recommend standardization of rewards between 0 and 1.


It is good that g = 0.9 or 0.95. In this case, 0 ≤ Q(s, a) ≤ 1 / (1 - g) = 10 or
20. It can be used even when the number of steps is unlimited.

Neural network structure

The structure of the neural network is a feed-forward network with Batch


Normalization.

The number of dense layers is 15 layers.

The width of hidden layers is 192.


The output layer is softmax.

It seems that the Q function cannot be approximated well if the layers are
too shallow.

It means that the robot does not walk on two legs no matter how much it
learns.

For the number of layers and widths, find appropriate values with regards to
learning results and calculation speed.
Neural network of the policy

Neural network structure

The structure of the neural network of the policy is also a feed-forward


network with Batch Normalization.

The number of dense layers is 15 layers.


The width of hidden layers is 192.

The output layer is sigmoid. Each dimension of the output represents each
dimension of action.

Each dimension of the policy is normalized to 0 to 1. It is converted to an


appropriate value on the program side of Unity and used.

It seems that if the layers are too shallow, the policy cannot be
approximated well.
For the number of layers and widths, similarly, find appropriate values with
regards to learning results and calculation speed.

In this case, the error of the learned policy is approximately 10%.

It is good to make it smaller if possible.

On the Unity side, the neural network of the policy is inferred. Therefore,
the speed of the policy calculations is important.

For the policy, we need a neural network that is faster and more accurate.
maxQ and Q gradient for continuous actions

Maximum value of the Q function max_a Q(s, a)

The maximum value of the Q function max_a Q(s, a) exists for each
situation s. But it is hard to calculate and find a4 = argmax_a Q(s, a) where
Q(s, a4) = max_a Q(s, a).

Therefore, in this document, we assume 4 candidates and use the maximum


value among them as max_a Q(s, a).

4 candidates

The 4 candidates are argmax_a Q(s, a) one step before, the global
candidate, the middle candidate, and the local candidate.

The baseline is the optimal action of the previous step.

Global candidates are used because they are searched from the entire action
space.

Mid-range candidates are P(s) where s is a situation and P is a learned


policy of the previous step. They are used because they may be good
candidates.

Local candidates are used because there are better actions near an action
except the action is maximal.
Baseline candidates: argmax_a Q(s, a) of the previous step

The optimal action a0 of the previous step may be a candidate for the
optimal action baseline because actions will converge to the best action in
many cases.

Global candidates: random

Random actions are good as global candidates.


The reason is that random actions may be better actions with a certain
probability.

Mid-range candidates: learned actions

Mid-range candidates are learned actions.

The learned actions are reasonable candidates because they are made from
the previous best actions.

Local candidates: Q gradient


Suppose that there is a candidate a1 for the optimal action at that time for
situation s. Consider the neighborhood of a1 for each s. If a1 is not a local
maximum point of Q(s, a), there will always be a better action than a1.

The method of finding the neighborhood of a1 for every s is called Q


gradient.

Therefore, Q gradient can be used to find local candidates for optimal


actions.

maxQ

For each situation s, among these 4 candidates, a that maximizes Q(s, a)


can be found by calculating and comparing Q(s, a). This method is called
maxQ in continuous action space.

maxQ can be used to approximate s --> argmax_a Q(s, a).


Others

There are some other points to go through, such as:

Data standardization

Standardization of data, that is to say, standardization of situations s


sometimes makes convergence of Deep Learning fast a lot. It is
recommended the standardization of data. By this standardization, the mean
of each situation is converted into 0, and the variance of each situation is
converted into 1.

Action standardization

About actions a, they are recommended to be standardized between 0 and 1


in logs. Moreover, the values of the action function p(s) are better to be
standardized between 0 and 1. That makes implementation and maintenance
of programs easy.

Using float32

It is recommended to use the type float32 in TensorFlow and numpy


explicitly. The reason is that the calculation speed of float32 is faster then
the speed of float64. In Deep Learning, the precision of float32 is sufficient
in many cases.

Reward

Rewards are output to log.

We may not know what is appropriate as a reward for reinforcement


learning when designing the log.

Since millions of logs are output, it takes too much time to re-output by trial
and error.

So, all the reward candidates are output to the log. Then, in the Deep Q-
Learning program, assemble a more appropriate reward by taking a linear
combination of rewards.
We will examine better mix of rewards by checking the learning results, for
example, the movement of a bipedal robot.

For instance, if the walking speed is slow, you can increase the walking
speed reward percentage, and if it is easy to fall, you can increase the
upright reward percentage.
Chapter 11 summary

● The driving rationale behind bipedal robot is that it is


interesting to see a robot walking on two legs.
● The learning process is also interesting – especially the
motion changes during the learning endeavor –
highlighting the interest behind bipedal robot.
● Deep Q-Learning implementation is dependent on
ensuring that the approximate Q is accurately
functional.
Chapter 12.
Recursive Neural Tensor Networks

In this chapter we are going to extend the recursive neural network to get an
even more expressive recursive neural tensor network or RNTN.
It’s a very fancy name but as you’ll see it’s a very simple concept.
Note that everything in this lecture assumes we’re working with binary
trees.
So you already know how a recursive neural network is built.
Let’s look at an alternative view of the TNN, since this was how it was
represented in the paper that first introduced this model.
To get the inner node value, we concatenate the values from the left and
right child, call this x.
Then the value at this node is h = f(WTx + b)
Note that this requires W to be of size 2DxD, because all the inner node
values must be of size D, and after concatenation, x is of size 2D.
Note that we still have the same number of parameters as before, because
before we had W_left and W_right which were of size DxD.
What is a natural way to extend this model?
Well, we can add a quadratic term.
h(j) = f(xTA(j)x + W(j)Tx + b(j))
What should the size of A be?
Notice I’ve only shown one component of h here. It’s written such that each
term is a scalar, since it’s difficult to represent tensor multiplication in terms
of matrices.
In order for the quadratic term to be a valid multiplication, A(j) needs to be
a matrix.
It needs to be a matrix of size 2Dx2D since x is of size 2D.
Since there are D components of h, so that j goes from 1..D, then means A
is of size Dx2Dx2D.
The functional form should remind you a lot of something we’ve seen in the
past.
Remember that if both Gaussian distributions had the same covariance, then
the separating hypersurface was linear.
This is also called linear discriminant analysis.
If they had a different covariance, then the hypersurface was quadratic,
since the covariance terms did not cancel out.
This is called quadratic discriminant analysis.
What’s another way to think about this?
Let’s break down the individual components of X, let’s say D = 2.
Then we have x_left_1, x_left_2, x_right_1, and x_right_2.
The result is we’re now getting terms like param*x_left_1*x_right_1, and
param*x_left_1*x_left_2
We saw these in my linear regression course, where we called them
“interaction terms”.
So that’s a pretty simple addition. Just adding some quadratic terms by
making all the input variables interact via multiplication.
Now how do we implement this?
Rather than concatenate the left child’s value and the right child’s value, we
are going to leave them separate, and implement an equivalent formulation.
Hopefully it will help you more easily picture what’s happening.
So let’s start again with the plain recursive neural network:
h = f(WLTxL + WRTxR + b)
We can extend this by adding 3 quadratic terms:
h = f(xLTALLxL + xLTALRxR + xRTARRxR + WLTxL + WRTxR + b)
Of course, each A must be of size DxDxD.
How many weights is that in total? 3D^3.
Note that this is NOT the same as the original formulation, which had 4D^3
terms.
The question you want to ask yourself is - are we missing any interaction
terms here?
I would recommend trying both and comparing the results.
Now let’s talk about how we’re going to do this in Theano.
It should be a small change to the plain recursive neural network.
The only part that’s different is that we are now going to represent our trees
as a different set of lists.
In particular, we’ll get rid of the notion of relations and parents.
Instead, we’ll have 3 lists called words, left_children, and right_children.
As the names suggest, the words list will contain the word index if that
node is a word, otherwise it will be -1 as usual.
Left children will contain the index of the left child if the node has a left
child, and the same goes for the right.
Note that any node with a left child will also have a right child, since all
nodes have either 0 or 2 children.
Why are we doing this?
Recall that in our previous formulation, when we calculated the value at the
current node, we would feed its value into the parent.
But now, we need to not only multiply the current node by a weight, but we
need to multiply it by its sibling node’s value as well, which we currently
don’t have access to.
So using lists for left_children and right children makes this task easier.
Let’s go through the main code changes.
1) Of course, we’ll be initializing more weights. In particular, W11, W12,
and W22.
2) The 4 Theano inputs we initialize will now be words, left_children,
right_children, and labels.
3) The recurrence will now be defined in terms of the formulation I
described previously.
If it’s a word, return the word vector.
If it’s not a word, calculate this node’s value in terms of its children.
You can see the full set of changes in the class repo, in the file
rntn_theano.py. I’ve bolded the main changes.
import sys
import numpy as np
import matplotlib.pyplot as plt
import theano
import theano.tensor as T
from sklearn.utils import shuffle
from util import init_weight, get_ptb_data, display_tree
from datetime import datetime
from sklearn.metrics import f1_score
class RecursiveNN:
def __init__(self, V, D, K):
self.V = V
self.D = D
self.K = K
def fit(self, trees, learning_rate=10e-4, mu=0.5, reg=10e-3, eps=10e-3,
epochs=20, activation=T.tanh, train_inner_nodes=False):
D = self.D
V = self.V
K = self.K
self.f = activation
N = len(trees)
We = init_weight(V, D)
W22 = np.random.randn(D, D, D) / np.sqrt(3*D)
W12 = np.random.randn(D, D, D) / np.sqrt(3*D)
W1 = init_weight(D, D)
W2 = init_weight(D, D)
bh = np.zeros(D)
Wo = init_weight(D, K)
bo = np.zeros(K)
self.We = theano.shared(We)
self.W22 = theano.shared(W22)
self.W12 = theano.shared(W12)
self.W1 = theano.shared(W1)
self.W2 = theano.shared(W2)
self.bh = theano.shared(bh)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
self.params = [self.We, self.W11, self.W22, self.W12, self.W1, self.W2,
self.bh, self.Wo, self.bo]
words = T.ivector('words')
right_children = T.ivector('right_children')
labels = T.ivector('labels')
def recurrence(n, hiddens, words, left, right):
w = words[n]
# any non-word will have index -1
T.ge(w, 0),
T.set_subtensor(hiddens[n], self.We[w]),
T.set_subtensor(hiddens[n],
self.f(
hiddens[left[n]].dot(self.W11).dot(hiddens[left[n]]) +
hiddens[right[n]].dot(self.W22).dot(hiddens[right[n]]) +
hiddens[left[n]].dot(self.W12).dot(hiddens[right[n]]) +
hiddens[left[n]].dot(self.W1) +
hiddens[right[n]].dot(self.W2) +
self.bh
)
)
)
return hiddens
hiddens = T.zeros((words.shape[0], D))
h, _ = theano.scan(
fn=recurrence,
outputs_info=[hiddens],
n_steps=words.shape[0],
sequences=T.arange(words.shape[0]),
)
py_x = T.nnet.softmax(h[-1].dot(self.Wo) + self.bo)
prediction = T.argmax(py_x, axis=1)
rcost = reg*T.mean([(p*p).sum() for p in self.params])
if train_inner_nodes:
cost = -T.mean(T.log(py_x[T.arange(labels.shape[0]), labels])) + rcost
else:
cost = -T.mean(T.log(py_x[-1, labels[-1]])) + rcost
grads = T.grad(cost, self.params)
cache = [theano.shared(p.get_value()*0) for p in self.params]
updates = [
(c, c + g*g) for c, g in zip(cache, grads)
]+[
(p, p - learning_rate*g / T.sqrt(c + eps)) for p, c, g in zip(self.params,
cache, grads)
]
self.cost_predict_op = theano.function(
inputs=[words, left_children, right_children, labels],
outputs=[cost, prediction],
allow_input_downcast=True,
)
self.train_op = theano.function(
inputs=[words, left_children, right_children, labels],
outputs=[cost, prediction],
updates=updates
)
costs = []
sequence_indexes = range(N)
if train_inner_nodes:
n_total = sum(len(words) for words, _, _, _ in trees)
else:
n_total = N
for i in xrange(epochs):
t0 = datetime.now()
sequence_indexes = shuffle(sequence_indexes)
n_correct = 0
cost = 0
it = 0
for j in sequence_indexes:
words, left, right, lab = trees[j]
c, p = self.train_op(words, left, right, lab)
if np.isnan(c):
print "Cost is nan! Let's stop here. Why don't you try decreasing the
learning rate?"
exit()
cost += c
if train_inner_nodes:
n_correct += np.sum(p == lab)
else:
n_correct += (p[-1] == lab[-1])
it += 1
if it % 1 == 0:
sys.stdout.write("j/N: %d/%d correct rate so far: %f, cost so far:
%f\r" % (it, N, float(n_correct)/n_total, cost))
sys.stdout.flush()
print "i:", i, "cost:", cost, "correct rate:", (float(n_correct)/n_total), "time
for epoch:", (datetime.now() - t0)
costs.append(cost)
plt.plot(costs)
plt.show()
def score(self, trees):
n_total = len(trees)
n_correct = 0
for words, left, right, lab in trees:
_, p = self.cost_predict_op(words, left, right, lab)
n_correct += (p[-1] == lab[-1])
return float(n_correct) / n_total
def f1_score(self, trees):
Y = []
P = []
for words, left, right, lab in trees:
_, p = self.cost_predict_op(words, left, right, lab)
Y.append(lab[-1])
P.append(p[-1])
return f1_score(Y, P, average=None).mean()
def add_idx_to_tree(tree, current_idx):
# post-order labeling of tree nodes
if tree is None:
return current_idx
current_idx = add_idx_to_tree(tree.left, current_idx)
current_idx = add_idx_to_tree(tree.right, current_idx)
tree.idx = current_idx
current_idx += 1
return current_idx
def tree2list(tree, parent_idx, is_binary=False):
if tree is None:
return [], [], [], []
words_left, left_child_left, right_child_left, labels_left = tree2list(tree.left,
tree.idx, is_binary)
words_right, left_child_right, right_child_right, labels_right =
tree2list(tree.right, tree.idx, is_binary)
if tree.word is None:
w = -1
left = tree.left.idx
right = tree.right.idx
else:
w = tree.word
left = -1
right = -1
words = words_left + words_right + [w]
left_child = left_child_left + left_child_right + [left]
right_child = right_child_left + right_child_right + [right]
if is_binary:
if tree.label > 2:
label = 1
elif tree.label < 2:
label = 0
else:
label = -1 # we will eventually filter these out
else:
label = tree.label
labels = labels_left + labels_right + [label]
return words, left_child, right_child, labels
def main(is_binary=True):
train, test, word2idx = get_ptb_data()
for t in train:
add_idx_to_tree(t, 0)
train = [tree2list(t, -1, is_binary) for t in train]
if is_binary:
train = [t for t in train if t[3][-1] >= 0] # for filtering binary labels
for t in test:
add_idx_to_tree(t, 0)
test = [tree2list(t, -1, is_binary) for t in test]
if is_binary:
test = [t for t in test if t[3][-1] >= 0] # for filtering binary labels
train = shuffle(train)
train = train[:5000]
test = shuffle(test)
test = test[:1000]
V = len(word2idx)
print "vocab size:", V
D = 20
K = 2 if is_binary else 5
model = RecursiveNN(V, D, K)
model.fit(train)
print "train accuracy:", model.score(train)
print "test accuracy:", model.score(test)
print "train f1:", model.f1_score(train)
print "test f1:", model.f1_score(test)
if __name__ == '__main__':
main()
Chapter 12 Summary

● A recursive neural network can be extended to establish


an expressive recursive neural tensor network (RNTN).
● The resulting networks operate with a well-defined and
yet flexible code that bolsters its ability to learn and
apply data in a targeted way.
● The underlying parameters help in guiding the operation
of the machine/deep leaning process.
● The resulting formulation offers a diverse and flexible –
in its command lines – basis for machine learning and
the subsequent output as well.
Chapter 13.
Optimizers

How, you may ask, do you turn your classifier into a deep neural network?
Optimizers!
An optimizer is nothing more than an algorithm designed to decrease the
loss function of a classifier. If there’s a big difference between the value of
your network’s prediction and the value of the correct answer, the optimizer
is the function that is going to calculate the difference and perform the
backpropagation. Optimizers are another one of those nifty lines of code
available online for you to cut and paste and be on your way.
While our simple example used binary as a sort of an on/off switch,
channeling information through the nodes to clearly defined classes, neural
networks more often use probabilities to determine which outcome is most
likely the correct answer. While it seems like this would be less predictable
than a simple yes or no answer, this element of uncertainty actually allows
neural networks the flexibility to handle complex tasks, without having to
change an entire class of animals into inappropriately lactating alligators.
Instead, these machines can learn from a vast array of samples, associating
images and data sets over and over again until they can determine the
difference between insects and reptiles through experience.
Before we go any further, I just want to remind you…
ANN’s are messy.
DNN’s are even messier.
Let’s start with happy input node number 1. This little guy is going to
represent a feature, and he is going to connect to all the nodes in our first
layer. Each of the edges, the connections between the input nodes and the
nodes in our first layer, are going to be assigned a weight. All of our
weights are going to fall between 1 and -1. For simplicity's sake, think of -1
as the new zero, a definite nope. We don’t want the limitations of the binary
system. We don’t want our machine to be certain of anything right now
because it shouldn’t be. It hasn’t learned anything yet. Our ANN needs to
be flexible. This is why we are not going to be using whole numbers. -1 =
certainly not and 1= certainly so, and our ANN is only guessing. These
weights are going to be assigned randomly, they don’t have to mean
anything. Their job is to make a guess and get corrected later on.
You may notice the different widths of the lines leading to and from our
nodes. This is a visual aid meant to illustrate the fact that they all carry
unique weights. Even though the weights are labeled here, we don’t need to
calculate their exact values, and you will probably not need them when you
begin programming your own neural networks. Just one more feature
available online: random weight generators. After we get all our nodes
connected, we’re going to drop in a net input function. This function is
going to generate a net sum of all the weighted input values and pass that
net sum through an activation function. We will use the ReLU function in
this example. The ReLU function, as stated above, way above, back with all
the graphs, will transform the weighted sum into an output signal, which
will then travel to the second hidden layer of our ANN. It will do so better
than all the other activation functions because ReLU can’t help that it’s just
so awesome, okay?
In addition to the random weights generators, the online platforms available
for people who are just wading into the waters of ANNs have ready to use
optimizers in their toolkits. It is not quite as simple as cutting and pasting
ideas together, however. You will need to know which tools do what and
how to choose the right options. And code. There are options here as well.
Many popular platforms work with Python, but you have options.
Now that the output has reached the second hidden layer, it will be
multiplied by the new weights and ready to pass through yet another net
input function and back again through, you guessed it, the ReLU.

And now you’re done!


Just kidding!
That’s right, get back to work, ANN.
Because this first run is just the beginning of the training. The next step is a
process called backpropagation, which involves taking the outcomes, or
predictions, generated by the neural network, passing them through a
function that determines just how wrong they were, and passing that
knowledge back through the layers, using math to adjust the weights in the
hopes that the next outcomes will be closer to the truth. We can make these
adjustments because all of our training data was clearly labeled, and we
knew the correct answers the entire time.
The weights are the key to a neural network’s ability to learn. Back
propagation is the process of calculating the gradient of the loss, which will
tell you whether the value of a weight needs to be increased or decreased.
We are not looking to calculate an exact answer to give us the perfect
weight, only an indication of which direction we should nudge our simple
ANN. To adjust the weights and decrease error, you will need to calculate
the gradient slope. The slope equals the change in error over the change in
weight OH GOD IT’S CALCULUS!
Before we panic, let us remember that these calculations are readily
available to us as neatly packaged lines of code, and that as long as we
know which kind of function we are performing, ie; classification vs
regression, and what kind of network we have on our hands, ie;
convolutional vs recurrent, it is only a matter of matching our needs to the
online toolkits. What the gradient descent is, simply put, is the distance
between the true answer and the predicted answer. A good algorithm will
not simply adjust the weights to what it thinks are the correct numbers, but
correct little by little, in a stochastic manner. This is what it is called,
actually; Stochastic Gradient Descent. It is the process of making small,
incremental changes. The general rule of thumb when training a neural
network is the slower, the better. It turns out our simple ANN is rather
touchy, and if you push it too hard in any direction, you are liable to see
some very unpredictable behavior from your network.
Neural networks can classify data extremely well. Sometimes too well.
Every once in awhile, the classifier will start to take into account one-off
circumstances and small deviations in the norm will throw your machine for
a loop as it struggles to let go of the one little detail that keeps a sample
from fitting neatly into a given class. Sometimes, the ANN overthinks the
situation.
This is called overfitting.
When this happens, there are a few things we can do to try to fix the
situation. We can apply regularization techniques. L1 and L2 regularization
work to minimize the difference between the true value and the predicted
value, or the loss. Think of them as a mathematical way of telling your
ANN to chill out and stop taking its job so seriously. These mathematical
chill pills can help your network stabilize its decision making. Other times,
more drastic measures are necessary.
Sometimes, you have to kill some neurons.
When your network is just a little too efficient for its own good, you have to
take matters into your own hands with a method known as the dropout. The
dropout function replaces 25% of the values in your neural network with 0,
essentially killing them off. “Why?” you ask. Why, after you’ve built this
machine, fed it all the data you could provide, spent the time and energy
training it to be this remarkable problem-solving wizard-machine, why
would you then turn around and give your computer brain damage?
So that it stops overthinking the problem. We’ve actually seen a loose
analogy for dropout in our animal classification example when the Eggs
node was dropped from the tree. In practice, dropout takes out neurons at
random, but it is, in essence, trying to prune redundant or superfluous nodes
from the structure. Dropout, rather than damaging the network as a whole,
actually boosts the remaining values in the vector to maintain the average.
Don’t worry about your ANN. This is for its own good.
You may not need to resort to such violence. If your neural network is
overfitting, you may want to take another look at your data. Overfitting can
be caused by either an insufficient supply of sample instances or too many
attributes. Consider our animal classification example again. We kicked the
Legs attribute out of our data set for being utterly useless to us as a
classifying attribute. Maybe your ANN will not have a feature so obviously
redundant, but it’s worth taking a look if your network seems to be
struggling. Remember one of the keys to a good learning machine is its
ability to generalize. If it can’t make assumptions, it is not going to make
very accurate predictions when presented with a subject from outside its test
data set.
Cross validation is a simple, easy way to gauge how your ANN’s training is
coming along. When you test your neural network, you will have to use
samples that your ANN has never encountered before, otherwise, you will
just be generating so-called memorized answers. Your ANN will be
cheating, in essence, because it doesn’t have to figure anything out if it’s
already solved that particular problem. You could go find a completely new
data set strictly for training purposes, or you could follow the common
practice of reserving a portion of your training set for testing. Selecting a
random 20 to 40 percent of your samples at the onset of training will give
you a good blind test set for your network after all those grueling test runs.
This is the only way to know if your ANN is ready for prime time. Think of
it as taking off the training wheels. After all the weights have been adjusted
and readjusted, you may think your artificial neural network is running like
a well-oiled machine. Make no mistake, it has been improving, but only
with instances it has seen before. Throwing new information into the works
will give you a realistic view of your network’s actual predictive
capabilities, and allow you to modify your training again if need be.
Maybe you’ve done the training, primed your network, and then bombed on
the test sets. Then you’ve trained again and bombed again. You may be
tempted to change algorithms like shoes, trying on pair after pair to find the
right fit. Although algorithms are important, the key to training a neural
network really is the data. Your ANN is a hungry, hungry machine, and you
are going to need to feed it a lot of information in order to get it to perform.
Yes, neural networks are needy, greedy beasts, but they come with a huge
benefit. The programmer does not necessarily need to understand every step
the network is taking, or even exactly what each neuron is looking for. The
ability of the machine to make adjustments after each iteration allows the
algorithms to write themselves, which is pretty cool when you think about
it.
Chapter 13 Summary

● An algorithm can be designed to mitigate the function


loss of a classifier while bolstering its effectiveness in
the process.
● The nodes are the platforms through which information
is channeled into defined classes.
● Neural networks typically utilize probabilities to
establish the preferred outcomes and what is likely to
happen.
● ANNs can be gauged using cross validation to establish
the progression of the training.
Chapter 14.
The Future of Deep Learning

“Deep learning, deep change?” The authors used a neural network to sift
through 1.3 million documents on Arxiv and incrementally narrowed down
search parameters by analyzing titles, keywords, locations, etc., before
comparing the list with Crunchbase registry analyzed the same way.
As expected, the deep learning papers covered topics of computer vision,
computer learning, machine learning, AI and neural networks, with the US
producing some 30% of all deep learning research papers and 30% of all
other unrelated research papers. China was overrepresented in the deep
learning section, producing three deep learning papers for every one
unrelated to deep learning. Computer vision and computer learning were the
most common topics, jointly encompassing some 70% of all deep learning
papers on Arxiv. Texas ranked highly as well, which is explainable by the
fact most disillusioned Californians named the barbecue state as the most
likely US resettlement destination in the 2018 Bay Area Council Poll.
The analysis showed that China has the fastest rise in deep learning-related
business ideas, with European countries falling behind and France being the
very worst. The explanation for this effect is that Chinese business,
research, and manufacturing sectors exist in tightly clustered regions, with
the Chinese government having lax regulations on any research that
promotes business growth and advances Chinese supremacy on the global
market of ideas. This kind of industriousness does tend to produce items of
subpar quality but fosters innovation, cost cutting, and quick turnaround.
The paper goes on to compare deep learning to seminal inventions such as
the steam engine, electricity and the free exchange of information known as
the internet, noting that each of these led to the rise of an empire: the UK
conquered half the known world thanks to the steam engine, the US
boomed because of electricity and Silicon Valley would command nothing
without the internet. This would make us believe that the research in deep
learning and AI-related business technologies will boost China to the
position of a world superpower; any country that doesn’t have an economic
strategy focused on deep learning and closely mimicking China’s is bound
to fall behind.

Rise of a new empire

For an invention to be of such earth-shattering magnitude as electricity, it


should have three distinct qualities: rapid growth, diffusion into new areas,
and a high degree of impact in those fields. Neural networks that use deep
learning practically teach and grow themselves, with the added benefit of
their owners being able to pit them against one another and see what comes
up. We also find that deep learning is getting more and more practical
applications with each passing day as some long-standing problems in
various industries that were simply too costly to do any other way are now
being reexamined. Finally, neural networks and deep learning can provide
genuinely novel insights where applied and are able to boost productivity
beyond human capabilities.
The steam engine was what started the industrial revolution in the 19th
century UK, with raw muscle strength being replaced by steam pressure,
but it was electricity that helped in miniaturizing every aspect of factories in
the 20th century and the internet that provided an instantaneous information
flow that will make deep learning a transformative force for the 21st
century. Each new quantum leap was always marked by the discovery of
brand-new materials and ways to extract even more resources from old ones
for less cost. If we now look back at how long it took for an industry to
adopt one such revolutionary invention, we can note the industrial giants
were set in their ways and incapable of adapting for several decades; it was
always the small, nimble competition that seized the first-mover advantage
in an environment of legal uncertainty that allowed unbridled
experimentation.
In a well-set industry, such as cellulose production through pulp-and-paper
mills, producers have razor-thin profit margins due to just how much
legislation there is, with governments regularly adding even more. For
example, some cellulose waste chemicals are known to be toxic if released
into the water, but for others, there’s only suspicion and no concrete proof; a
pulp-and-paper mill owner would be tempted to use new and cheaper but
potentially toxic chemicals as much as possible before the government
outlaws them. Once out in the wild, these chemicals have unknown effects
on plant life and animal health, which is even worse than if they were
poisons – since poisons have known effects and treatments. There is no
clear solution to this dilemma since we do need paper but can’t help
polluting when producing it. This leads us to the tragedy of the commons,
the unavoidable outcome of such business mindset on the environment and
resources we share.
The air we breathe, the water we drink and the very soil we live on are
considered joint properties, a shared resource we all need and compete for
but can’t really affect them much; it’s the businesses set to exploit as much
as possible before anyone else does that pollute and ravage the environment
in their mindless pursuit of profits. In 2010, “Deep Horizon”, a Gulf of
Mexico oil rig owned by BP Exploration & Production, exploded killing 11
workers and releasing 4 million barrels of oil into the ocean over 87 days
until the leaking oil well was plugged. BP Exploration & Production was
eventually sued by a party of litigants claiming damages and had to pay out
over $20bn on top of US government issuing a $5.5bn penalty for water
pollution and $8.8bn for damaging natural resources. So what caused the
Deep Horizon catastrophe?
Executives at any company have two guiding principles: duty of care and
duty of value. The former mandates that they do their due diligence before
undertaking any projects to make sure their company doesn’t harm the
environment and to constructively contribute to a better society for
everyone. All these notions of care are idealistic, but more importantly,
they’re impossible to quantify. On the other hand, we have the latter
principle that states an executive must do whatever it takes to increase
company value, determined by comparing revenue numbers, which are
quantifiable. These two principles are meant to balance out, but that’s never
the case and all companies that survive for decades gradually become more
and more exploitative, ruthless and manipulative to squeeze out that extra
0.1% profit that makes the next quarterly balance sheet green and provides
the CEO with a fat bonus.
If we now look back at the consumer market over the decades, we’ll easily
notice companies, such as telecom operators, that experienced this
inevitable transformation and became monsters that overcharge and ignore
customers to the point their contractual obligations border on a scam. It’s
not that executives gleefully enjoy causing distress but simply that any
company that wants to survive has to increasingly strangle existing revenue
streams without investing anything more or adding any value to the
customers. Even Google learned that lesson as they dropped the “Don’t Be
Evil” motto; moral values are antithetical to profits, and companies that
want to make money must be willing to consider treading the line between
good and evil, if not outright dashing over it before someone else does.
In a fresh market, customers would flock to a competitor, but in a highly-
regulated market, that company has a monopoly, and there are no
alternatives. We only have to look at Facebook’s acquisitions to see how
this plays out. In 2012, Facebook bought Instagram, a popular image-
sharing social network, for $1bn and thus got its technology, brand, user
base and all the users’ private data; even when someone does make a viable
alternative, the tech giants swoop in and devour the competition just like a
cheetah does to an antelope. For Instagram creators, this was a dream come
true, and they’re set for life, but for Instagram users, it’s back to the old
crummy Facebook paddock.
The thing is that the internet is another one of those shared resources, with
the only difference being that it’s not physical, but we sorely need it
nonetheless. However, there is barely any legislation that would prevent
pollution of the internet or enshrine the rights of the ordinary users when it
comes to sharing information and original content. Just like with soil, water
and air before the industrial revolution, any company can do as it pleases
online with abandon. This is the future of the technology market, and with
the arrival of deep learning, things will only get worse as a company
headquartered in India, China or Texas can create a worldwide product
powered by deep learning; when the product starts falling apart at the
seams, due to being pushed beyond its theoretical limits in pursuit of
profits, the victims will have absolutely no recourse.
Deep Horizon disaster thus shows an inevitable conclusion of a profit-
driven corporate system where environmental catastrophes must necessarily
happen because the reward is simply too great to ignore. Naturally, the
corporations fostering the development and deployment of smart machines
into the general public don’t care about any long-term consequences; why
would they? Whatever brings the largest profit right now while barely
remaining within the legal bounds is all that matters, and once that had
become the corporate norm, it's only a matter of time before the entire
humanity gets to foot the bill of wanton smart-machines research.
This would also imply that BP Exploration & Production actually got off
cheap as they most likely caused thousands of oil leaks in their reckless
drilling expeditions – it’s just that the Deep Horizon leak was too big to
ignore and they paid the price. When this logic is applied to deep learning
and neural networks we get a bleak picture of rampant experimentation that
populates our digital environment with all sorts of shabbily made assistants,
such as Google Translate, probably the most famous translator.

Shenzhen, the powerhouse of China

Located just north of Hong Kong, the fifth busiest port on the planet is the
small market town of Shenzhen. During the 1990s, Shenzhen has
experienced tremendous growth, becoming a sprawling center of Chinese
research, development, and manufacturing that now covers 750 square
miles and includes literally hundreds of factories. Got a cool idea? Hong
Kong has smart business people who are willing to listen and have the
means to order a prototype from a Shenzhen factory by the end of the day.
If it works, millions of copies can be made by the end of the week, carted to
Hong Kong and exported to the entire world.
By being joined at the hip, Hong Kong and Shenzhen cover all the basics
and comprise a powerhouse of China that’s been granted special
exemptions from Chinese government regulation and taxes. There is
nothing quite like it in the world, and unless other countries, in particular,
the English-speaking ones, step up to the plate, they’ll be outnumbered and
outgunned. There is one downside of such explosive growth and blazing
turnaround – the notion of lower quality products and services churned out
by the millions with little consideration for standards, at times literally
lighting a fire under our feet.
In 2015, hoverboards were the coolest fad, with everyone of any import
gliding on one. Shenzhen alone had 300 factories churning out hoverboards
24/7, amounting to over a million units in just October 2015, but their Li-
ion battery packs were incompatible with voltages around the world,
resulting in overnight fires and explosions; let’s just say customers didn’t
warm up to the idea of Chinese hoverboards. UK, US, and even Chinese
customers reported hoverboards going up in flames, and when the UK
National Trading Standards put 15,000 Shenzhen hoverboards to the test,
over 90% of them had a subpar electrical system or battery.
Fire departments across the world declared hoverboards a fire hazard. The
media could barely contain the glee when showing the most dramatic
hoverboard explosion footage, and retailers were suddenly stuck with
thousands of hoverboards that couldn’t be moved. Ironically, Shenzhen
factories were stuck with warehouses full of working hoverboards that
couldn’t be sold simply because of a bad reputation and lack of electrical
standards, but matters weren’t helped by the fact Chinese factories
generally engage in cutthroat business practices to lock out competition;
this time around, they were all locked out of the international market due to
the lack of an overall manufacturing strategy.
In March 2016, Chinese hoverboard manufacturers banded together to
create a Hoverboard Industry Alliance that standardized manufacturing
practices in accordance with US and UK electrical standards and asked for
a set of hoverboard battery manufacturing regulations, which they got in
May 2016 by UL, a safety company based in the US. The lesson learned
here is that Chinese manufacturers do exhibit a casual indifference when
making products aimed at international markets, but they’re willing to make
an about-face, cooperate and hold themselves to a higher standard when
profits are threatened, making them a competitor to be reckoned with. This
also implies we need a robust set of legislative limitations on deep learning,
neural networks, and AI before the businesses start clamoring for it.
Normally the way regulations work is quite slow – new technology is
introduced, and the legality or manner of its use is uncertain, there is some
damage or death, the public demands someone thinks of the children and
the government comes in with its habitual heavy-handed attitude.
Investigations are started, years pass, committees are formed, and laws are
made over the course of decades; it’s simply how things have to work to
maintain the integrity of the legal system.
It took the automotive industry years to come to grips with the fact that
seatbelts do save life and limb. Despite numbers being unequivocally clear
on the matter, the car manufacturers fought tooth and nail not to have
seatbelts while people died. This timeframe can’t be applied to the
development of AI since it will evolve on its own and jeopardize everyone
while legislators twiddle their thumbs. If a significant portion of humans
decide along the lines of “if we can’t beat them, we’ll join them” and
implant an electronic device into their brain to have a direct uplink with the
AI, the vanilla humans might end up just like chimps, causing a further
divide between the rich and the poor.
Chapter 14 Summary

● Deep learning overlaps aspects and core issues in


computer learning, computer vision, AI, machine
learning, and neural networks.
● Geographical-based differences in deep learning will
characterize the world with China having the fastest
growing deep-learning related franchise ideas.
● The breakthrough ideas in deep learning will depend on
their ability to generate rapid growth, high degree
impact in a range of fields, and diffusion into new areas
and frontiers.
● China is poised to play a pivotal role in the future of
deep learning encapsulated by the Shenzhen
transformation.

You might also like