You are on page 1of 32

Implement Neural Networks

using Keras and PyTorch

Liang Liang
Mini-batch Stochastic Gradient Descent (SGD)
• Initialization: initialize the network parameters using random numbers
one epoch (the network sees all of the training data points)
• Randomly shuffle data points in training set and divide them into mini-batches
a mini-batch is a subset of the training set. Assume there are M mini-batches
• for m in [1, 2, …, M]: 𝑁𝒎 is called batch_size
• assume data points in current mini-batch are 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑁𝒎
• Forward Pass: compute output 𝑦ො for every data point in this mini-batch
• Backward Pass: compute the derivatives
𝜕𝑳 1 𝑁𝒎
• Update every parameter: 𝑤 ← 𝑤 − 𝜂 , 𝑳 = σ𝑖=1 𝐿(𝑦ො𝑖 , 𝑦𝑖 )
𝜕𝑤 𝑁𝒎
𝜂 is called learning rate

• Run many epochs until convergence one iteration


Stochastic gradient descent (SGD) with momentum

initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
𝜕𝐿
define 𝑑𝑡 = at iteration t
𝜕𝑊
in iteration t: 𝜂: learning rate lr
𝑉 ← 𝑑𝑡 + 𝜌𝑉 𝜌: momentum
𝑊 ← 𝑊 − 𝜂𝑉 𝑉: velocity - smoothed gradient

𝑉 = 𝑑𝑡 + 𝜌𝑑𝑡−1 + 𝜌2 𝑑𝑡−2 + 𝜌3 𝑑𝑡−3 + ⋯


https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html#SGD
Stochastic gradient descent (SGD) with weight decay

initialization: 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿
𝑊 ←𝑊−𝜂 + 𝛽𝑊 𝛽 is weight_decay
𝜕𝑤

Adding a new term to the loss, we get weight decay to control model
complexity which may help to reduce the chance of overfitting.
𝛽 2 𝜕𝐿𝑛𝑒𝑤 𝜕𝐿
𝐿𝑛𝑒𝑤 =𝐿+ 𝑤 2 = + 𝛽𝑤
2 𝜕𝑤 𝜕𝑤
SGD with momentum and weight decay

initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿 𝜂: learning rate lr
𝑉← + 𝜌𝑉 𝜌: momentum
𝜕𝑊
𝑊 ← 𝑊 − 𝜂(𝑉 + 𝛽𝑊) 𝑉: velocity - smoothed gradient
𝛽 : weight decay
SGD with Nesterov momentum and weight decay
initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿
𝑉← + 𝛽𝑊 + 𝜌𝑉
𝜕𝑊
𝜕𝐿
𝑊 ←𝑊−𝜂 + 𝛽𝑊 + 𝜌𝑉
𝜕𝑊

𝜂: learning rate lr
𝜌: momentum
𝑉: velocity - smoothed gradient
𝛽 : weight decay
Adam (adaptive moment estimation): "adaptive" version of SGD
https://arxiv.org/pdf/1412.6980.pdf

usually, we do not need to


change other parameters

Problem of SGD: you may need to manually adjust/reduce learning rate after
many epochs

Adam: auto-adjust the learning rate, and you only need to specify the "initial"
learning rate (lr)

Adamax: make stable adjustment (e.g. when the mini-batch size is very small)
Which optimizer to use ?
• https://github.com/ilguyi/optimizers.numpy
• https://ruder.io/optimizing-gradient-descent/

• Whether an optimizer will or will not work for you data depends on
the "shape" of the loss function
• The "shape" of the loss function is determined by the type of loss
function and the data distribution
• In general, the loss function of a DNN is NOT convex
there are many local minima.
• The only way is to try it on your data
1st hidden layer 2nd hidden layer output layer
𝑥 𝑓(𝑊 𝑇 𝑥 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦
Node-1 Node-2 Node-3
A neural network is a computational graph
• A layer is a computation node in the graph
• A computation node is a function
• A tensor (vector/matrix) is input or output of a function
node-1 node-2 node-3
𝑥 𝑓(𝑊 𝑇 𝑥 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦

𝑊 𝑏 𝑊 𝑏 𝑊 𝑏

𝑊 is a leaf of the graph (it is not the output of a function)


an arbitrary function an arbitrary function loss
many many 𝑦
𝑥 𝑓(𝑥; 𝑊) 𝑦=𝑥 𝑓(𝑥; 𝑊) 𝑦=𝑥 𝐿
nodes nodes

loss
many
𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑦
nodes
𝜕𝐿
𝜕𝐿
𝜕𝑣𝑒𝑐(𝑊)
𝜕𝑣𝑒𝑐(𝑊)

The computational graph for


𝜕𝐿 𝜕𝐿 𝜕𝑓 𝜕𝐿 𝜕𝐿 𝜕𝑓
= , = backpropagation (backward pass )
𝜕𝑣𝑒𝑐(𝑊) 𝜕𝑥 𝜕𝑣𝑒𝑐(𝑊) 𝜕𝑥 𝜕𝑥 𝜕𝑥
Forward and Backward inside a Node (Function)
𝑊 𝑦 = 𝑓(𝑥; 𝑊)
𝑥 forward 𝑦

Node

𝜕𝐿Τ𝜕𝑥 backward 𝜕𝐿Τ𝜕𝑦


In the two equations, we assume 𝑥, 𝑦 and 𝑤
𝜕𝐿 have been vectorized.
𝜕𝐿 𝜕𝐿 𝜕𝑦 𝜕𝐿 𝜕𝑓
𝜕𝑊 = =
𝜕𝑥 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑥

𝜕𝐿 𝜕𝐿 𝜕𝑦 𝜕𝐿 𝜕𝑓
= =
𝜕𝑊 𝜕𝑦 𝜕𝑊 𝜕𝑦 𝜕𝑊
https://pytorch.org/docs/master/notes/extending.html
Forward and Backward of two connected nodes

𝑊 𝑊

𝑥 forward 𝑦 𝑥 forward 𝑦
output
becomes
Node input
Node

𝜕𝐿Τ𝜕𝑥 backward 𝜕𝐿Τ𝜕𝑦 𝜕𝐿Τ𝜕𝑥 backward 𝜕𝐿Τ𝜕𝑦

𝜕𝐿 𝜕𝐿
𝜕𝑊 𝜕𝑊

https://pytorch.org/docs/master/notes/extending.html
In general, 𝑊could be a high-dimensional tensor.
𝜕𝐿
the shape of is the same as the shape of 𝑊
𝜕𝑊

𝜕𝐿 𝜕𝐿
=
𝜕𝑊 𝑖0 ,𝑖1 ,𝑖2 ,…
𝜕𝑊𝑖0,𝑖1,𝑖2 ,…
an element of the tensor

𝑣𝑒𝑐(𝑊) convert 𝑊 to a vector


Automatic Differentiation in Pytorch
- automatically calculate the derivatives
the graph with two
computation nodes
𝑥1 𝑥2

𝑥1 ∗ 𝑥2

𝑠𝑢𝑚(𝑧)

𝑦
Automatic Differentiation in Pytorch
- automatically calculate the derivatives
the graph with two
y must be a scalar computation nodes
𝑥1 𝑥2

𝑥1 ∗ 𝑥2

𝑠𝑢𝑚(𝑧)

𝑦
Graph-1

𝑦=𝑥

Graph-2

Graph-2 and Graph-1 are connected


Graph-1

𝑦 = 𝑥. 𝑑𝑒𝑡𝑎𝑐ℎ() in pytorch

Graph-2

After 𝑦 = 𝑥. 𝑑𝑒𝑡𝑎𝑐ℎ(), Graph-2 will not be connected to Graph-1


Convert a rank-0 tensor to a number using .item()
Implement a Neural Network for Linear Regression

𝑥 𝑓 𝑥 = 𝑊𝑇𝑥 + 𝑏 𝑦
Define a function to train the neural network model in one epoch
If we write loss_train+=loss
minibatch-1
𝑥 Graph-1 𝑦𝑝 𝐿1

minibatch-2
𝑥 Graph-2 𝑦𝑝 𝐿2 ෍𝐿 𝑖
𝑖

minibatch-100

𝑥 Graph-100 𝑦𝑝 𝐿 100
All of the computational graphs will be kept in computer memory.
The computer will run out of memory and freeze….
write loss_train+=loss.item()

minibatch-1
𝑥 Graph-1 𝑦𝑝 𝐿1
Graph-1 is deleted after the 1st iteration of the for loop
minibatch-2
𝑥 Graph-2 𝑦𝑝 𝐿2
Graph-2 is deleted after the 2nd iteration of the for loop
define a function to evaluate the neural network model
on the validation set or the test set
we may need more than 10 epochs
Plot loss values over the epochs
see if the loss converges
or overfitting occurs
After training, we evaluate the model on the test set
the blue dots are near the
45-degree line, which
means very good prediction
Implement a Neural Network for Nonlinear Regression
layer1 layer2 layer3
𝑥 𝑊𝑇𝑥 + 𝑏 𝑓 𝑊𝑇𝑥 + 𝑏 𝑓 𝑊𝑇𝑥 + 𝑏 𝑦

𝑓 𝑥 = log(1 + 𝑒 𝑥 ), it is softplus
Softplus 𝒇 𝒙 = 𝐥𝐨𝐠(𝟏 + 𝒆𝒙 )
Rectifier Linear (ReLU) 𝒇 𝒙 = 𝐦𝐚𝐱(𝟎, 𝐱)
read and run these demos

You might also like