Professional Documents
Culture Documents
Liang Liang
Mini-batch Stochastic Gradient Descent (SGD)
• Initialization: initialize the network parameters using random numbers
one epoch (the network sees all of the training data points)
• Randomly shuffle data points in training set and divide them into mini-batches
a mini-batch is a subset of the training set. Assume there are M mini-batches
• for m in [1, 2, …, M]: 𝑁𝒎 is called batch_size
• assume data points in current mini-batch are 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑁𝒎
• Forward Pass: compute output 𝑦ො for every data point in this mini-batch
• Backward Pass: compute the derivatives
𝜕𝑳 1 𝑁𝒎
• Update every parameter: 𝑤 ← 𝑤 − 𝜂 , 𝑳 = σ𝑖=1 𝐿(𝑦ො𝑖 , 𝑦𝑖 )
𝜕𝑤 𝑁𝒎
𝜂 is called learning rate
initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
𝜕𝐿
define 𝑑𝑡 = at iteration t
𝜕𝑊
in iteration t: 𝜂: learning rate lr
𝑉 ← 𝑑𝑡 + 𝜌𝑉 𝜌: momentum
𝑊 ← 𝑊 − 𝜂𝑉 𝑉: velocity - smoothed gradient
initialization: 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿
𝑊 ←𝑊−𝜂 + 𝛽𝑊 𝛽 is weight_decay
𝜕𝑤
Adding a new term to the loss, we get weight decay to control model
complexity which may help to reduce the chance of overfitting.
𝛽 2 𝜕𝐿𝑛𝑒𝑤 𝜕𝐿
𝐿𝑛𝑒𝑤 =𝐿+ 𝑤 2 = + 𝛽𝑤
2 𝜕𝑤 𝜕𝑤
SGD with momentum and weight decay
initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿 𝜂: learning rate lr
𝑉← + 𝜌𝑉 𝜌: momentum
𝜕𝑊
𝑊 ← 𝑊 − 𝜂(𝑉 + 𝛽𝑊) 𝑉: velocity - smoothed gradient
𝛽 : weight decay
SGD with Nesterov momentum and weight decay
initialization: 𝑉 = 0, 𝑊 = 𝑟𝑎𝑛𝑑𝑜𝑚
in one iteration:
𝜕𝐿
𝑉← + 𝛽𝑊 + 𝜌𝑉
𝜕𝑊
𝜕𝐿
𝑊 ←𝑊−𝜂 + 𝛽𝑊 + 𝜌𝑉
𝜕𝑊
𝜂: learning rate lr
𝜌: momentum
𝑉: velocity - smoothed gradient
𝛽 : weight decay
Adam (adaptive moment estimation): "adaptive" version of SGD
https://arxiv.org/pdf/1412.6980.pdf
Problem of SGD: you may need to manually adjust/reduce learning rate after
many epochs
Adam: auto-adjust the learning rate, and you only need to specify the "initial"
learning rate (lr)
Adamax: make stable adjustment (e.g. when the mini-batch size is very small)
Which optimizer to use ?
• https://github.com/ilguyi/optimizers.numpy
• https://ruder.io/optimizing-gradient-descent/
• Whether an optimizer will or will not work for you data depends on
the "shape" of the loss function
• The "shape" of the loss function is determined by the type of loss
function and the data distribution
• In general, the loss function of a DNN is NOT convex
there are many local minima.
• The only way is to try it on your data
1st hidden layer 2nd hidden layer output layer
𝑥 𝑓(𝑊 𝑇 𝑥 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦
Node-1 Node-2 Node-3
A neural network is a computational graph
• A layer is a computation node in the graph
• A computation node is a function
• A tensor (vector/matrix) is input or output of a function
node-1 node-2 node-3
𝑥 𝑓(𝑊 𝑇 𝑥 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦 𝑓(𝑊 𝑇 𝑦 + 𝑏) 𝑦
𝑊 𝑏 𝑊 𝑏 𝑊 𝑏
loss
many
𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑥 𝜕𝐿Τ𝜕𝑦
nodes
𝜕𝐿
𝜕𝐿
𝜕𝑣𝑒𝑐(𝑊)
𝜕𝑣𝑒𝑐(𝑊)
Node
𝜕𝐿 𝜕𝐿 𝜕𝑦 𝜕𝐿 𝜕𝑓
= =
𝜕𝑊 𝜕𝑦 𝜕𝑊 𝜕𝑦 𝜕𝑊
https://pytorch.org/docs/master/notes/extending.html
Forward and Backward of two connected nodes
𝑊 𝑊
𝑥 forward 𝑦 𝑥 forward 𝑦
output
becomes
Node input
Node
𝜕𝐿 𝜕𝐿
𝜕𝑊 𝜕𝑊
https://pytorch.org/docs/master/notes/extending.html
In general, 𝑊could be a high-dimensional tensor.
𝜕𝐿
the shape of is the same as the shape of 𝑊
𝜕𝑊
𝜕𝐿 𝜕𝐿
=
𝜕𝑊 𝑖0 ,𝑖1 ,𝑖2 ,…
𝜕𝑊𝑖0,𝑖1,𝑖2 ,…
an element of the tensor
𝑥1 ∗ 𝑥2
𝑠𝑢𝑚(𝑧)
𝑦
Automatic Differentiation in Pytorch
- automatically calculate the derivatives
the graph with two
y must be a scalar computation nodes
𝑥1 𝑥2
𝑥1 ∗ 𝑥2
𝑠𝑢𝑚(𝑧)
𝑦
Graph-1
𝑦=𝑥
Graph-2
𝑦 = 𝑥. 𝑑𝑒𝑡𝑎𝑐ℎ() in pytorch
Graph-2
𝑥 𝑓 𝑥 = 𝑊𝑇𝑥 + 𝑏 𝑦
Define a function to train the neural network model in one epoch
If we write loss_train+=loss
minibatch-1
𝑥 Graph-1 𝑦𝑝 𝐿1
minibatch-2
𝑥 Graph-2 𝑦𝑝 𝐿2 𝐿 𝑖
𝑖
minibatch-100
𝑥 Graph-100 𝑦𝑝 𝐿 100
All of the computational graphs will be kept in computer memory.
The computer will run out of memory and freeze….
write loss_train+=loss.item()
minibatch-1
𝑥 Graph-1 𝑦𝑝 𝐿1
Graph-1 is deleted after the 1st iteration of the for loop
minibatch-2
𝑥 Graph-2 𝑦𝑝 𝐿2
Graph-2 is deleted after the 2nd iteration of the for loop
define a function to evaluate the neural network model
on the validation set or the test set
we may need more than 10 epochs
Plot loss values over the epochs
see if the loss converges
or overfitting occurs
After training, we evaluate the model on the test set
the blue dots are near the
45-degree line, which
means very good prediction
Implement a Neural Network for Nonlinear Regression
layer1 layer2 layer3
𝑥 𝑊𝑇𝑥 + 𝑏 𝑓 𝑊𝑇𝑥 + 𝑏 𝑓 𝑊𝑇𝑥 + 𝑏 𝑦
𝑓 𝑥 = log(1 + 𝑒 𝑥 ), it is softplus
Softplus 𝒇 𝒙 = 𝐥𝐨𝐠(𝟏 + 𝒆𝒙 )
Rectifier Linear (ReLU) 𝒇 𝒙 = 𝐦𝐚𝐱(𝟎, 𝐱)
read and run these demos