Professional Documents
Culture Documents
Spring 2022
Recap: Multi-Layer Networks
(𝟎)
𝑼𝒊 = 𝒀𝒊
(𝟎) (𝟏) (𝟏) (𝟐) (𝟐) (𝑳−𝟏) (𝑳)
𝒘𝒊𝒋 𝑿𝒊 𝒀𝒊 (𝟏) 𝑿𝒊 𝒀𝒊 𝒘𝒊𝒋 𝑿𝒊 (𝑳)
𝒀𝒊
U1 𝒇
𝒘𝒊𝒋
𝒇 … 𝒇
𝒇 𝒇 … 𝒇
⁞
⁞ ⁞ ⁞
UN 𝒇 𝒇 … 𝒇
Output layer
Input Layer 1
(Layer L)
layer
(layer 0)
𝜕𝐸 𝜕𝐸
= −𝑣𝑒 = +𝑣𝑒
𝜕𝑊 𝜕𝑊
𝑊∗ 𝑊 𝑊∗ 𝑊
3
Back propagation Algorithm
• It is an algorithm based on the steepest
descent concept
4
Back propagation Algorithm
• (𝑙)
𝑋𝑖 = σ𝑁(𝑙−1)
𝑗=1
(𝑙−1) (𝑙−1)
𝑤𝑖𝑗 𝑌𝑗
Output of hidden node before
applying the activation function
• 𝑌𝑖
(𝑙)
=𝑓
(𝑙)
𝑋𝑖 Output of layer
5
Back propagation Algorithm
Error, i.e., cost fn.
1 𝑀
• 𝐸= 𝑀
σ𝑚=1 𝐸𝑚
2 Loss (i.e., for regression)
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)
𝜕𝐸
• We need to compute the gradient, i.e., (𝑙)
𝜕𝑤𝑖𝑗
6
Chain Rule
• 𝑌 = 𝑓 𝑦1 , 𝑦2 , 𝑦3
𝑔1
𝑦1
𝑔2 𝑦2 𝑓
𝑍 𝑌
𝑦3
𝑔3
7
Back propagation Algorithm
2
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)
• (𝐿)
𝑌𝑖 = 𝑓 𝑋𝑖
(𝐿)
• 𝑋𝑖
(𝐿) 𝑁(𝐿−1) (𝐿−1) (𝐿−1)
= σ𝑗=1 𝑤𝑖𝑗 𝑌𝑗
𝐿 𝐿
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑌𝐼 𝜕𝑋𝐼 Chain rule!
(𝐿−1)
= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑤𝐼𝐽 𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝜕𝐸𝑚 𝜕𝑌𝐼
𝐿 𝜕 σ𝑁
𝑗=1
𝐿−1
𝑤𝐼𝑗
𝐿−1
𝑌𝑗
𝐿−1
= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝐿
𝜕𝐸𝑚 𝜕𝑌𝐼 𝐿−1
= (𝐿)
∗ 𝐿
∗ 𝑌𝐽
𝜕𝑌𝐼 𝜕𝑋𝐼
• 𝐿
Note that the other 𝑋𝑖 ’s are not taken into account, because they
𝐿−1
do not depend on 𝑤𝐼𝐽 at all
8
Back propagation Algorithm
𝜕𝐸𝑚
• To get (𝑙) for any general layer 𝑙
𝜕𝑤𝑖𝑗
⁞
𝑙+1
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑋𝐼 𝜕𝐸𝑚 𝑙
(𝑙)
= (𝑙+1)
∗ 𝑙
= (𝑙+1)
∗ 𝑌𝐽
𝜕𝑤𝐼𝐽 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽 𝜕𝑋𝐼
11
Disadvantages of Back propagation
12
Types of Weight Update
• Batch or epoch update, i.e, Gradient
Descent:
– Present full set of training examples (batch of
examples)
– Compute the error of each example
– Compute gradient of the batch (based on cost
function of the whole batch)
– Update weights based on this batch gradient
– Do another iteration … and so on
– Advantages:
• Optimization is more consistent
– Disadvantages:
• Slow (too long per iteration)
13
Types of Weight Update
• Sequential update, i.e., Stochastic Gradient Descent:
– Present a training
𝜕𝐸
pattern, then update the weights
(according to 𝑚
), then present the next one … and so on
𝜕𝑤
– Advantages:
• Faster compared to gradient descent, i.e., full-batch
– Disadvantages:
• Hard to converge: “stochastic” since it depends on every
single example; however, in practice being close to
minimum is reasonably good
• Loss speedup from vectorization
14
Types of Weight Update
• Mini-Batch :
– Present subset of training examples (mini-
batch of examples)
– Compute the error of each example
– Compute gradient of the mini-batch (based on
cost function of this mini-batch)
– Update weights based on this mini-batch
gradient
– Move to another mini-batch & after finishing all
mini-batches do another iteration … and so on
– Advantages:
• Fast
15
Other Optimization Algorithms
• Gradient descent with momentum
– Smooth-out the steps of the gradient descent
using a moving average of the derivatives
– Get faster learning in the intended direction &
avoid oscillations
Faster learning
𝝏𝑬
𝑫𝑾 = 𝜷 𝑫𝑾 + (𝟏 − 𝜷) Slower learning
𝝏𝑾
𝑾 = 𝑾 − 𝜶 𝑫𝑾 𝐷𝑊 = 0 initially 16
Other Optimization Algorithms
• RMSProp
– Slow-down learning in unintended directions
– Avoid oscillations
Faster learning
𝑤𝑗
Slower learning
𝑤𝑖
𝝏𝑬
𝝏𝒘𝒊
𝒘𝒊 = 𝒘𝒊 − 𝜶
𝟐 𝑺𝒘𝒊 + 𝜺
𝝏𝑬
𝑺𝒘𝒊 = 𝜷 𝑺𝒘𝒊 + (𝟏 − 𝜷) 𝝏𝑬
𝝏𝒘𝒊 small 𝝏𝒘𝒋
𝟐
𝝏𝑬 𝝏𝑬 𝒘𝒋 = 𝒘𝒋 − 𝜶
𝝏𝑬 < 𝑺𝒘𝒋 + 𝜺
𝑺𝒘𝒋 = 𝜷 𝑺𝒘𝒋 + (𝟏 − 𝜷) 𝝏𝒘𝒊 𝝏𝒘𝒋
𝝏𝒘𝒋
large 17
Other Optimization Algorithms
• Adam (combines both RMSProp &
momentum)
𝟐
𝝏𝑬 𝝏𝑬
𝑫𝒘𝒊 = 𝜷𝟏 𝑫𝒘𝒊 + (𝟏 − 𝜷𝟏 ) 𝑺𝒘𝒊 = 𝜷𝟐 𝑺𝒘𝒊 + (𝟏 − 𝜷𝟐 )
𝝏𝒘𝒊 𝝏𝒘𝒊
𝑫𝒘𝒊
𝒘𝒊 = 𝒘 𝒊 − 𝜶
𝑺𝒘𝒊 + 𝜺
18
Regularization
• Used to prevent overfitting
– Intuition: set the weights of some hidden nodes to zero to simplify the
network, i.e., smaller network
𝟏 𝑴 𝝀 𝟐
• L2 regularization (aka weight decay): 𝑱= σ 𝑬 + 𝑾
𝑴 𝒎=𝟏 𝒎 𝟐𝑴 𝟐
2
– 𝑊 2
= σ𝑗 𝑤𝑗2 = 𝑊 𝑇 𝑊
𝟏 𝑴 𝝀
• L1 regularization: 𝑱= σ𝒎=𝟏 𝑬𝒎 + 𝑾 𝟏
𝑴 𝟐𝑴
– 𝑊 1
= σ𝑗 |𝑤𝑗 |
19
Dropout Regularization
• Used to prevent overfitting
• Intuitions:
– Eliminate some nodes to simplify the network
based on some probability, i.e., smaller network
– As if you train smaller networks on individual
training examples
– Cannot rely on any one feature, so spread weights
20
Guidelines for Training
• Learning rate 𝛼 :
– Too small. Convergence will be slow.
– Too large: we will oscillate around the
minimum.
– Some methods propose varying rate. i.e.,
learning rate decay.
– When learning does not go well, consider using
smaller learning rate.
21
Learning Rate Decay
• Gradient descent with small mini-batch size
Source: Andrew Ng
22
Input and Output Normalization
• Input and Output normalization
– Inputs have to be approximately in the range of
0 to 1 or -1 to 1
Range -1 to 1
Range -2 to 2
– x = (u – umin)/(umax - umin)
– x = (u – Mean(u)) / st dev(u)
23
Train/Dev/Test Partition
• Best practice:
– Training: 60% ,
Validation (Dev): 20% & Test: 20%
– In case of big data, e.g., 10^6, then 98%, 1%
& 1%
24
Machine Learning Recipe
• Train the network and evaluate first on the training data
– If bias is high, i.e., underfitting (performance is bad on the training set
itself), then:
• Bigger network (more hidden nodes or more hidden layers) →
works most of the time
• Train longer → works sometimes
– Check for bias again and keep changing until a good bias is reached
25
Hyper-Parameters Tuning
• Learning rate 𝛼 1st in importance
• Mini-batch size
26
Tuning Process
• Try random values: don’t use a grid
– Better exploration of important parameters
– Consider the example on the board
28
K-Fold Cross Validation
• For parameter tuning over the training data
• Apply K-fold validation to the training set (usually
k= 5).
→ 𝑬𝟏
→ 𝑬𝟐
validate train → 𝑬𝟑
→ 𝑬𝟒
29
K-Fold Cross Validation
• Better than convention train-validation-test
split
– Just split not training and test (no need for validation set)
– Not biased to the nature of splitting of the training and
validation
30
Multi-Class Classification
• E.g., an image is either of a cat, or dog, or
duck or otherwise
X1 … P(cat|𝑿)
P(dog|𝑿)
…
⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …
Output layer
31
Multi-Class Classification
• Use sigmoid activation function in the
output only in case of binary classification,
i.e., two classes
P(dog|𝑿)
… [𝑳]
𝒁𝟐
⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …
32
Output layer
Multi-Class Classification
• For multi-class classification use soft-max regression:
Output layer
𝒚𝟏
X1 … [𝑳]
𝒁𝟏 P(cat|𝑿)
𝒚𝟐 P(dog|𝑿)
… [𝑳]
𝒁𝟐
⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …
• 𝑧𝑖[𝐿] output of node i at the output layer before applying
any activation function
[𝐿]
𝑧𝑖
• 𝑦𝑖 is the output after applying soft-max 𝒆
𝒚𝒊 = [𝐿]
𝑧𝑗
σ𝒋 𝒆 33
Convolutional Neural Networks (CNN)
34
Convolutional Neural Networks (CNN)
35
Convolutional Neural Networks (CNN)
36
Vertical Edge Detection
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
* 1 0 -1 = 0 30 30 0
10 10 10 0 0 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
10 10 10 0 0 0
convolution
37
Convolutional Neural Networks (CNN)
=
*
=
Input channels *
Output channels
38
Max Pooling
10 10 10 0 0 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 10 0 0 0
39
Global Max Pooling
10 10 10 0 0 0
30 10 10 0 0 0
10 10 10 0 0 0
30 30 10 0 0 0
10 10 10 0 0 0 10
10 10 10 0 0 0 30
10 10 10 0 0 0
10 10 10 0 30 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
40
Convolutional Neural Networks (CNN)
41
Convolutional Neural Networks (CNN)
42
Recurrent Neural Networks (RNN)
43
Recurrent Neural Networks (RNN)
zeros
𝑎<0> 𝑎<1> 𝑎<2>
… 𝑎<𝑇−1>
𝑦 <𝑡>
𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <𝑇>
≡
y is output at one time step from a NN
a is an activation passed from one step to another also from a NN
𝑥 <𝑡> 44
Autoencoder Network
45
Autoencoder Network
• Unsupervised network
• Gives embedding
– Better embeddings using supervised
46
Generative Adversarial Network (GAN)
48
Siamese Networks
49
Deep Learning
• Subset of machine learning
50
Machine Learning vs Deep Learning
51
Machine Learning vs Deep Learning
53
When to use Deep Learning?
• Big amount of data
expensive!
• Availability of high computational power
expensive!
• Lack of domain understanding
54
Scalability with Data Amount
55
Scalability with Data Amount
56
Potentials of AI
57
Limitations of Deep Learning
• Lots of achievements in vision field
58
Limitations of Deep Learning
Source: Google
– Expensive!
61
Limitations of Deep Learning
• Datasets may be biased
– Deep Networks become biased against rare
patterns
62
Limitations of Deep Learning
• Datasets may be biased
63
Limitations of Deep Learning
• Sensitive to standard adversarial attacks
- Datasets are finite and just represent a fraction
of all possible images
64
Limitations of Deep Learning
• Over-sensitive to changes in context
– Limited number of contexts in dataset, i.e.,
monkey in jungle
– Combinatorial Explosion!
65
Limitations of Deep Learning
• Combinatorial Explosion
– Real world images are combinatorial large
66
Limitations of Deep Learning
• Visual understanding is tricky
– Mirrors
– Sparse Information
– Physics
– Humor
67
Machine Learning Model Selection
1. Categorize the problem:
68
Machine Learning Model Selection
2. Understand your data:
a) Analyze the data:
• Descriptive statistics
• Data visualization
c) Feature Engineering
69
Machine Learning Model Selection
3. Determine the possible algorithms:
– Based on categorization & data understanding
70
Machine Learning Model Selection
4. Implement Machine Learning Algorithms:
– Setup a pipeline
– Compare algorithms
– Select an evaluation criteria
5. Tune hyperparameters
71
Time Series Prediction
• Time series contains:
– Trend trend
– Seasonality
• De-seasonalization:
– Remove the seasonal periodicities
return back
TS deseasonalize predict seasonal
components
72
How to deseasonalize?
• Removing the seasonal periodicities
𝒙(𝒕)
𝒕
12 months
73
How to deseasonalize?
• Obtain average of TS values over this window
1
𝑎 𝑦𝑒𝑎𝑟 = 𝑥(𝑡)
12
𝑤𝑖𝑛𝑑𝑜𝑤
𝒙(𝒊)
• Normalization step: 𝒁 𝒊 =
𝒂(𝒚𝒆𝒂𝒓)
9 2 8 𝒕
2 2
12 months 74
How to deseasonalize?
• Seasonal average ≡ avg of 𝒁 𝒊 ’s of the
different years for month 𝒊
#𝒚𝒆𝒂𝒓𝒔
σ𝒋=𝟏 𝒁𝒋 (𝒊)
𝒖 𝒊 =
#𝒚𝒆𝒂𝒓𝒔
𝒖(𝒊)
1 12
75
Deseasonalization Step
• Divide time series value by the
corresponding seasonal average
𝒙(𝒕)
𝒙𝒅𝒆𝒔𝒆𝒂𝒔𝒐𝒏𝒂𝒍 𝒕 =
𝒖(𝒎𝒐𝒏𝒕𝒉(𝒕))
𝒕 76
Recover Seasonality
• After trend prediction, seasonality can be
recovered via multiplication by the
corresponding seasonal average
77
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya
78