You are on page 1of 68

Deep Learning Lab

how to train your first neural network


Teaching Assistant

Subhankar Roy
email: subhankar.roy@unitn.it
where: Open Space 5, Povo 1

- PhD student (University of Trento and FBK)


- Working on Transfer Learning, Unsupervised Domain Adaptation, Image
Generation.
Goal of Labs
- Having a practical experience of the theory
- Learning to use a deep learning framework, i.e. Pytorch
- Understanding how to set up and train a deep neural network for various
tasks/settings
Outline
- Google CoLab
- Overview of deep learning frameworks
- How to train my first neural network
CoLab (https://colab.research.google.com)
- Jupyter notebook environment hosted by Google

- No setup required (basically)

- Allows running code on GPU (12 hour maximum of GPU runtime)


CoLab (https://colab.research.google.com)
CoLab
CoLab
- Understanding how to set up and train a neural network

- Practising with a deep learning framework


Session 1
Let’s try colab together
Deep Learning Frameworks
Deep Learning Frameworks over Time
- Imperative: Imperative-style programs perform computation as you run them
- Symbolic: define the function first, then compile them

https://web.cs.ucdavis.edu/~yjlee/teaching/ecs289g-winter2018/Pytorch_Tutorial.pdf
Caffe

- Protobuf as the interface


- Not easy to write and read protobuf
Tensorflow

- Rich set of operator


- Code is often difficult to read
Keras

- High level wrapper


- Simple and easy to use
- Difficult to personalize and write
complex algorithms
Pytorch

- Flexible and easy to write


Deep Learning Frameworks
Deep Learning Frameworks
Why PyTorch?
- Python based

- Fast

- Amazingly flexible and easy to learn

- Automatic differentiation

- Dynamic graph computation


PyTorch vs TensorFlow

- Biggest difference: Static vs. dynamic computation graphs


- Creating a static graph beforehand is unnecessary
Example: Linear Regression
- Tensorflow: Create optimizer before feeding data
Example: Linear Regression
- PyTorch: Create optimizer while feeding data
What is PyTorch?
Think about PyTorch as a deep learning oriented upgrade of NumPy:
- Allows operations on GPU(s)
- Contains everything you need to set up and train a network

It is based on the concept of Tensor. A Tensor is a version of the n-dimensional


array of NumPy which can be stored both in CPU and GPU.

Training/deploying a network is held out as operations among tensors


Forward and backward pass of a NN
Input: Output: Fully connected params:

Activation function: … its gradient:

Target: Loss:
Forward and backward pass of a NN
Forward pass: Backward pass:
Forward and backward pass of a NN (NumPy)
Forward pass: Backward pass:
Forward and backward pass of a NN (PyTorch)
Forward pass: Backward pass:
Forward and backward pass of a NN (PyTorch)
Forward pass: Backward pass:
Computational Graphs

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

Input image

Loss
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Deep Learning Frameworks
They need to permit to:

(1) Easily build big computational graphs

(2) Easily compute gradients in computational graphs

(3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc)


CPU vs GPU

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
CPU vs GPU

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

Problems:

- Can’t run on GPU

- Have to compute our own


gradients

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

We have:

- Computational Graph
creation

- Automatic gradient
computation

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

We can ask TF to run on GPU

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs
We have:

- Variable definition for


building CG

- Forward and Backward


pass.

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Computational Graphs

We can ask Pytorch to run on


GPU

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Summary

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
Back to Tensors
Back to Tensors
Think about PyTorch as a deep learning oriented upgrade of NumPy:
- Allows operations on GPU(s)
- Contains everything you need to set up and train a network

It is based on the concept of Tensor. A Tensor is a version of the n-dimensional


array of NumPy which can be stored both in CPU and GPU.

Training/deploying a network is held out as operations among tensors


Tensors
Multi-dimensional matrix. (Float/Byte/Long)

Can initialize from and convert to numpy arrays.


Tensors
Tensors are objects used to instantiate both variables and parameters
(torch.nn.Parameter).
Tensors
Tensors are objects used to instantiate both variables and parameters
(torch.nn.Parameter). In both cases they have different fields, among which:

- .data, storing the numerical values of a Tensor


- .requires_grad, saying whether the Tensor needs or not gradient
computations (can be set)
- .grad, which allows to store (and even retrieve) the gradient (in case the Tensor
requires it)

Especially .grad can be a useful tool to check the gradient flow.


Session 2
Tensors
Train a deep network
What do we need to train a network?
- Data
What do we need to train a network?
- Data

- Network
What do we need to train a network?
- Data

- Network

- Cost function
What do we need to train a network?
- Data

- Network

- Cost function

- Update rule
What do we need to train a network? (PyTorch)
- torchvision.datasets + torch.utils.data (.DataLoader)

- torch.nn.Module

- torch.nn.*Loss

- torch.optim
What do we need to train a network? (PyTorch)
- torchvision.datasets + torch.utils.data (.DataLoader)

- torch.nn.Module

- torch.nn.*Loss

- torch.optim

Everything is customizable:
you can create your own version of each component
Let’s train!
Your task is to classify digits (MNIST dataset):

- Instantiate a dataloader
- MNIST is already in torchvision.dataset
- Create a simple MLP
- input-to-hidden and hidden-to-output fully connected layers (torch.nn.Linear)
- Do not forget about activation(s)
- Instantiate an optimizer
- torch.optim is the guide
- Instantiate a loss/cost function
- It is a classification task with 10 classes...what about a cross entropy?
- Put things together to implement the training and test procedure
- the evaluation metric is obviously accuracy (= correct_predictions/number_of_samples)
Setting up a dataset
A dataset is defined through torchvision.dataset.
Various datasets are already there (e.g. MNIST, Cifar, ImageNet, COCO, ...).

E.g. Initialize MNIST:

Initialize your custom dataset:


Setting up a dataloader
A dataloader is a wrapper of the dataset which allows to iterate over the data. It can be
defined through:

Then we can obtain data and labels by just retrieving its elements:

Note:
- There exist different type of loaders and different samplers
Setting up a network
In PyTorch, a network is defined as a subclass of torch.nn.Module. We need to
define:
- The initialization of the network (i.e. layers, initial values of the parameters, etc.)
- The forward pass

No definition of the backward is needed (due to the automatic differentiation).


Setting up a network - example
torch.nn
In torch.nn are defined all the basic components needed to build a network. For
instance, there you can find:

- Layers (nn.Linear, nn.Conv2d, nn.Dropout, nn.BatchNorm2d, ...)

- Activation functions (nn.ReLU, nn.Sigmoid, ...)

- Loss functions (nn.CrossEntropyLoss, nn.MSELoss, ...)


Setting up a cost function
From what we said before, it is pretty easy: have a look at torch.nn:

E.g. :

Note:
- Most of these functions already contain the proper activation function
(e.g. CrossEntropyLoss contains a softmax activation)
Setting up an optimizer
The optimizers are defined in torch.optim. The standard template for initializing an
optimizer is:

E.g.

Note:
- Different optimizers need different hyperparameters
- You can filter the parameters given to the optimizer
Setting up an optimizer - operations
The optimizer controls how the parameters are updated after each iteration
(iteration = 1 forward pass + 1 backward pass) with respect to their gradient. To
update the weights after the backward call:

To avoid accumulating gradient, we must free the .grad component of each Tensor in
the graph after each iteration. This can be achieved by just calling:
Visualizing the results
Babysitting the training procedure by just looking at printed text is objectively boring.
We can exploit tensorboardcolab to have a nice visualization of the lines, like this:
Visualizing the results
Session 3
How to train your first neural network
Useful links
- Colab: https://colab.research.google.com

- PyTorch: https://pytorch.org/

- PyTorch doc: https://pytorch.org/docs/stable/index.html

- how to build an MLP with NumPy:


https://github.com/Trion129/Neural-Network-using-numpy/blob/master/neuralnetwork.py
Let’s train! (by yourself)

Follow the steps we discussed before for classifying digits in the SVHN dataset
(already included in torchvision). Try to do this from scratch, starting from the rough
template and not directly from the previous solution.
Other useful tasks you may want to try
- Let/make the network overfit
- How/why does it happen?

- Increase the performances of the network


- How can you do that? Is 99% of accuracy achievable?

- How do the gradient flow changes with the optimizer?


- Each parameter of a network as the field .grad, e.g. layer1.weight.grad
What happens if...
- I change some of the hyperparameters? (e.g. learning rate, weight decay, etc.)

- I change the optimizer? (Adam, RMSProp, etc. ...)

- I change the number of parameters? (e.g. I increase/decrease the hidden state dimension)

- I add more layers?

- I add Dropout? (torch.nn.Dropout)

Tip: Save the logs and visualize (in Tensorboard) the effect of the above components in the classification
accuracy and loss curves.

You might also like