Understanding Parameterized Deep Learning Models and Benchmarks

UNIT-4
Introduction
• Deep learning has gained many achievements over
the past few years, right from defeating professionals
in poker games to autonomous driving.
• Accomplishing such tasks needs a complex
methodology which will result in a complex system.
• Even now, there are many cases where the researchers
are applying a trial-and-error method to build certain
models.
• ParaDnn, a parameterised deep learning benchmark
suite is introduced by the researchers of Harvard
University.
• This suite generates end-to-end models for fully
connected (FC), convolutional neural network
(CNN), and recurrent neural networks (RNN).
• In ParaDnn, the model types cover 95% of Google’s
TPU workloads, all of Facebook’s deep learning
models, and eight out of nine MLPerf models.
• The image classification/detection and sentiment
analysis models are the CNNs, the recommendation
and translation models are FCs, the RNN translator
and another version of sentiment analysis are the
RNNs.
• In addition to ParaDnn, the researchers also included
two workloads written in TensorFlow from MLPerf
and they are transformer and ResNet-50.
TPU vs GPU vs CPU: A Cross-Platform
Comparison
• The researchers made a cross-platform comparison in
order to choose the most suitable platform based on
models of interest. This can also be said as the key
takeaways which shows that no single platform is the
best for all scenarios. They are mentioned below.
• TPU: Tensor Processing Unit is highly-optimised for

large batches and CNNs and has the highest training
throughput.
• GPU: Graphics Processing Unit shows better
flexibility and programmability for irregular
computations, such as small batches and
nonMatMul computations.
• CPU: Central Processing Unit achieves the
highest FLOPS utilisation for RNNs and
supports the largest model because of large
memory capacity.
Getting Started with PyTorch
• Deep Learning is a branch of Machine Learning
where algorithms are written which mimic the
functioning of a human brain.
• The most commonly used libraries in deep learning
are Tensorflow and PyTorch.
• As there are various deep learning frameworks
available, one might wonder when to use PyTorch.
Here are reasons why one might prefer using Pytorch
for specific tasks.
• Pytorch is an open-source deep learning framework
available with a Python and C++ interface.
• Pytorch resides inside the torch module. In PyTorch,
the data that has to be processed is input in the form
of a tensor.
Introduction to TensorFlow
• TensorFlow is an open-source software library.

• TensorFlow is basically a software library for
numerical computation using data flow
graphs where:
• nodes in the graph represent mathematical
operations.
• edges in the graph represent the multidimensional
data arrays (called tensors) communicated between
them. (Please note that tensor is the central unit of
data in TensorFlow).
TensorFlow APIs
• TensorFlow provides multiple APIs (Application

Programming Interfaces). These can be classified into
2 major categories:
• Low level API:
– complete programming control
– recommended for machine learning researchers
– provides fine levels of control over the models
– TensorFlow Core is the low level API of
TensorFlow.
• High level API:
– built on top of TensorFlow Core
– easier to learn and use than TensorFlow Core
– make repetitive tasks easier and more consistent
between different users
– tf.contrib.learn is an example of a high level API.
TensorFlow Core
Installing TensorFlow
Once installed, you can ensure a successful installation by
running this command in python interpreter:
• import tensorflow as tf2.
The Computational Graph
• Any TensorFlow Core program can be divided into two
discrete sections:
• Building the computational graph.A computational graph is
nothing but a series of TensorFlow operations arranged into a
graph of nodes.
• Running the computational graph. To actually evaluate the
nodes, we must run the computational graph within a session.
A session encapsulates the control and state of the TensorFlow
runtime.
Dynamic vs Static Computational
Graphs – PyTorch and TensorFlow
• Though both libraries employ a directed acyclic graph(or DAG) for
representing their machine learning and deep learning models, there is
still a big difference between how they let their data and calculations
flow through the graph.
• The subtle difference between the two libraries is that while
Tensorflow(v < 2.0) allows static graph computations, Pytorch allows
dynamic graph computations.
• This article will cover these differences in a visual manner with code
examples.
• The article assumes a working knowledge of computation graphs and a
basic understanding of the TensorFlow and Pytorch modules. For a
quick refresher of these concepts, the reader is suggested to go through
the following articles:
Static Computation graph in Tensorflow
• Properties of nodes & edges: The nodes represent the operations

that are applied directly on the data flowing in and out through the
edges. For the above set of equations, we can keep the following
things in mind while implementing it in TensorFlow:
• Since the inputs act as the edges of the graph, we can use
the tf.Placeholder() object which can take any input of the desired
datatype.
• For calculating the output ‘c’, we define a simple multiplication
operation and start a tensorflow session where we pass in the
required input values through the feed_dict attribute in
the session.run() method for calculating the outputs and the
gradients.
Dynamic computation graph in Pytorch
• Properties of nodes & edges: The nodes represent the data(in

form of tensors) and the edges represent the operations applied to
the input data.
• For the equations given in the Introduction, we can keep the
following things in mind while implementing it in Pytorch:
• Since everything in Pytorch is created dynamically, we don’t
need any placeholders and can define our inputs and operations
on the fly.
• After defining the inputs and computing the output ‘c’, we call
the backward() method, which calculates the corresponding
partial derivatives with respect to the two inputs accessible
through the .grad specifier.
Data Preprocessing and Its Types
• Preprocessing simply refers to perform series of operations to

transform or change data. It is transformation applied to our data
before feeding it to algorithm.
• Data processing refers to perform operations on data to retrieve,
transform, or change data, especially by computer. It is technique
that is used to convert raw data into clean data set.
• In other words, whenever data is gathered from different sources,
it is collected in raw format, which is not feasible for analysis.
Then it converts raw format into readable format (graphs,
documents, etc.), so that it can be interpreted by computers and
utilized by employees throughout an organization.
Types of Data Preprocessing Techniques
1.Rescale Data
• When our data consists of attributes with different
scales mainly ML algorithm can be benefited from
rescaling attributes.
• It means that all attributes of dataset have same
scale so that measuring parameter of dataset
maintains uniformity.
• This is also used for an optimization algorithm to
maintain uniformity of data set.
2.Binarize data
• Binarization is process that is used to transform data
features of any entity into binary numbers. It is done
to classify algorithms more efficiently.
• To convert into binary, we can transform data using
binary threshold. All value above threshold is marked
as 1 and all values that are equal to or below
threshold are marked as 0.
•This is called binarizing your data. It can be helpful
when you have value that you want to make Crip value.
3.Data Augmentation
Data augmentation is strategy that allows practitioners or

scientists to increase diversity of available data for training
models, even without collecting or gathering new data.
There are various types of data augmentation given below:
• Flip :
We can flip images horizontal or vertical. Some frames
do not provide functions for vertical. But we can
perform vertical flip by rotating an image of 180 degrees
and then perform horizontal flip.
• Scale :
The image can be scaled outworlds or inworld. While
scaling outworld, size of final image is longer than original
one. While scaling in world, final image size is smaller than
actual image.
• Crop :
Unlike scaling, we just randomly select section from original
image. After that, we resize this selected section to original
image size. This method is also called as random cropping.
• Translation :
It just involves moving image along x-axis or y-axis or both.
This method of augmentation is very useful. This is because
objects can be located almost anywhere in image.
Introduction to Batch Normalization
Objective
• Learn how to improve the neural network
with the process of Batch Normalization.
• Understand the advantages batch
normalization offers.
What is Batch Normalization?
8.7.1
• Normalization is a data pre-processing tool used to bring the
numerical data to a common scale without distorting its shape.
• Generally, when we input the data to a machine or deep learning
algorithm we tend to change the values to a balanced scale.
• The reason we normalize is partly to ensure that our model can
generalize appropriately.
• Now coming back to Batch normalization, it is a process to make
neural networks faster and more stable through adding extra
layers in a deep neural network.
• The new layer performs the standardizing and normalizing
operations on the input of a layer coming from a previous layer.
Deep Learning using Transfer Learning
• Motivation for Transfer learning used for Machine

Learning and Deep Learning is based on the fact that
people can intelligently apply knowledge learned
previously for a different task or domain that can be
used to solve new problems faster or with better
solutions.
What are the key considerations for
Transfer Learning?
• To effectively apply Transfer Learning we need to answer
three main questions
• What to transfer
• When to transfer
• How to transfer
• What to transfer — We need to understand what knowledge

is common between the source and target task. What
knowledge can be transferred from source task to target task
that will help improve the performance of the target task
• When to transfer or when not to Transfer- When the source
and target domains are not related at all we should not try to
apply transfer learning. In such a scenario the performance
will suffer. This type of transfer is called Negative Transfer.
We should apply Transfer learning only when source and
target domains/tasks are related
• How to transfer: Identifying different techniques to apply

transfer learning when the source and target domain/task are
related. We can use Inductive transfer learning, Transductive
transfer learning or unsupervised transfer learning.
Types of Transfer Learning
Pre-trained deep neural networks for
Computer Vision
• VGG-16
• VGG-19
• Inception V3
• ResNet-50
• Xception
Hyper-parameter Tuning Techniques in
Deep Learning
• Deep learning models are full of hyper-

parameters and finding the best configuration
for these parameters in such a high
dimensional space is not a trivial challenge.
Approaches of searching for the best
configuration: Grid Search & Random
Search
• Grid Search
• In grid search, we try every possible configuration of the
parameters.
• Steps:
• Define a grid on n dimensions, where each of these maps for
an hyper-parameter. e.g. n = (learning_rate,, batch_size)
• For each dimension, define the range of possible values: e.g.
batch_size = [4, 8, 16, 32], learning_rate =[0.1, 0.01, 0.0001]
• Search for all the possible configurations and wait for the
results to establish the best one: e.g. C1 = (0.1, 4) -> acc =
92%, C2 = (0.01, 4) -> acc = 92.3%, etc…
• As we can see with more dimensions, the more the
search will explode in time complexity.
• It’s common to use this approach when the
dimensions are less than or equal to 4. Though it
guarantees to find the best configuration at the end,
it’s still not preferable.
• Instead, it’s better to use Random Search
Random Search
• Random Search does random search on the step 1 to

pick a point randomly from the configuration space.
• The intuition of how it works better is that we can
explore the hyper-parameters space more widely with
Random Search (especially for the more important
variables).
• This will help us to find the best configuration in fewer
iterations. For example, see the image below :
• In the Grid Layout, it’s easy to notice that, even if we
have trained 9 (n=3) models, we have used only 3
values per variable.
• Whereas, with the Random Layout, it’s extremely
unlikely that we will select the same variables more
than once.
• It ends up that, with the second approach, we will
have trained 9 model using 9 different values for each
variables.
Learning Rate Scheduling: Optimization
Algorithm: Mini-batch Stochastic Gradient
Descent (SGD)
• We will be using mini-batch gradient descent in all our examples
here when scheduling our learning rate
• Combination of batch gradient descent & stochastic gradient
descent
– θ=θ−η⋅∇J(θ,xi:i+n,yi:i+n)θ=θ−η⋅∇J(θ,xi:i+n,yi:i+n)
• Characteristics
– Compute the gradient of the lost function w.r.t. parameters for n sets of
training sample (n input and n label), ∇J(θ,xi:i+n,yi:i+n)∇J(θ,xi:i+n,yi:i+n)
– Use this to update our parameters at every iteration
• Typically in deep learning, some variation of mini-batch gradient is
used where the batch size is a hyperparameter to be determined
Need for Learning Rate Schedules
Benefits
1.Converge faster
2.Higher accuracy

Understanding Parameterized Deep Learning Models and Benchmarks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Parameterized Deep Learning Models and Benchmarks

Uploaded by

Copyright:

Available Formats

UNIT-4

• TPU: Tensor Processing Unit is highly-optimised for

• TensorFlow is an open-source software library.

• TensorFlow provides multiple APIs (Application

• Properties of nodes & edges: The nodes represent the operations

• Properties of nodes & edges: The nodes represent the data(in

• Preprocessing simply refers to perform series of operations to

Data augmentation is strategy that allows practitioners or

• Motivation for Transfer learning used for Machine

• What to transfer — We need to understand what knowledge

• How to transfer: Identifying different techniques to apply

• Deep learning models are full of hyper-

• Random Search does random search on the step 1 to

You might also like