ANN-Unit 6 - Deep Neural Networks

12/19/2023
Applied Neural
Networks
Unit – 6
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 1
Lecture Outline
▪ Deep Neural Networks
▪ Vectorized Implementation
▪ DNN Building Blocks
▪ Empirical Learning
▪ Bias and Variance
▪ Regularization to Handle Bias and Variance
▪ Vanishing and Exploding Gradients
1
12/19/2023
What are Deep Neural Networks
logistic regression 1 hidden layer
2 hidden layers 5 hidden layers
Deep Neural Network Notation
n[1] = 5
n[2] = 5
n[3] = 3
x = a[0] n[4] = n[L] = 1
= a[L] n[0] = nx = 3
Layer 4
Layer 3 L = 4 # of layers, we do not count input as a layer

Layer 0
n[l] = # of units in layer l
Layer 2
Layer 1 a[l] = Activations in layer l
a[l] = g[l](z[l])
w[l] = weights to compute z[l]
b[l] = used to compute z[l]
2
12/19/2023
Forward Propagation in a Deep Network

▪ x: z [1] = w [1] x + b [1]
▪ a [1] = g [1](z [1]) Z [l] = W [l] A [l-1] + b [l]
▪ z [2] = w [2] a [1] + b [2] A [l] = g [l](Z [l])
▪ a [2] = g [2](z [2])
▪ ……
▪ Vectorizing:
}
▪ z [4] = w [4] a [3] + b [4]
▪ Z [1] = W [1] A [0] + b [1]
▪ a [4] = g [4](z [4]) = Explicit for loop
▪ A [1] = g [1](Z [1])
▪ More generically Z [1] = [ Z [1] (i), Z [1] (2), …., ]▪ Z
Z [1] (m) [2] = W [2] A [1] + b [2]
required to iterate
over layers
▪ z [l] = w [l] a [l-1] + b [l] ▪ A [2] = g [2](Z [2])
▪ a [l] = g [l](z [l]) ▪ = g [4](Z [4]) = A [4]
Parameters W[l] and b[l] Dimensions

▪ x: z [1] = w [1] x + b [1]
▪ z:(3,1) x:(2,1)
n[1] = 3
▪ (n[1],1) (n [0],1)
n[2] = 5
▪ w [1] = (3,2) or (n[1] ,n [0]) n[3] = 4
▪ Conversely n[4] = 2
n[5] = n[L] = 1
▪ w [2] = (5,3) or (n[2] ,n [1])
n[0] = nx = 2
▪ w [3] = (4,5) or (n[3] ,n [2])
▪ w [4] = (2,4) or (n[4] ,n [3])
▪ w [l] = (n[l] ,n [l-1]) w [l] = (n[l] ,n [l-1])

▪ b vector for every layer is generally the same dimension as b[l] =(n[l],1)
z as it needs to be added to provided the same dimension.
dw [l] = (n[l] ,n [l-1])
▪ So
db[l] =(n[1],1)
▪ b[1] =(n[1],1), b[2] =(n[2],1), b[3] =(n[3],1), …. b[l] =(n[l],1)
a[l] = z[l]
3
12/19/2023
Vectorized Implementation (Multiple Examples)

▪ Z [1] = W [1] X + b [1]
▪ Where:
[ ]
▪ Z [1] = Z [1] (i), Z [1] (2), Z [1] (3), …., Z [1] (m)
▪ So, Z [1] = (n[1],m)

▪ Conversely
z [l] = a [l] =(n[l] ,1)
▪ X= (n[0] ,m) Z [l] = A [l] =(n[l] ,m)
▪ W [1] = (n[1] ,n [0]) Special Case:
l=0, A [0] = X = (n[0] ,m)
▪ b[1] is still b[1] =(n[1],1) but when added into
the product of W and X t will be
Also:
broadcasted into (n[1],m) dimension dZ [l], dA [l] =(n[l] ,m)
Why Deep Networks
Audio Frequency Distribution Phonemes

Words Sentences
Through Waveforms C A T
4
12/19/2023
Layer l
Building Blocks of DNN w [l] b[l]
a [l-1] a [l]
Cache z[l]
w [l] b[l]
da [l-1] da [l]
For layer l: w [l], b[l] dz[l]

Forward: Input: a[l-1], Output: a[l]
z [l] = w [l] a[l-1] + b [l], Cache z [l] dw[l] db[l]
a[l] = g[l](z[l])
Backward: Input: a[l], Cache Z [l] ; Output: a[l-1], dW [l], db[l]
Forward and Backward Functions

Layer 1 Layer 2 Layer l
w [1] b[1] w [2] b[2] w [l] b[l]
a [0] a [1] a [2] a [l-1] a [l]
Cache z[1] Cache z[2] Cache z[l]

………
w [1] b[1] w [2] b[2] w [l] b[l]
da [1] da [2-1] da [2] da [l-1] da [l]
dz[1] dz[z] dz[l]
dw[1] db[1] dw[2] db[2] dw[l] db[l]
10
5
12/19/2023
Forward Propagation of Layer l

▪ Input: a[l-1]
▪ Output: a[l], Cache z[l]
▪ z [l] = w [l] a [l-1] + b [l]
▪ a[l] = g[l](z[l])
▪ Vectorized:
▪ Z [l] = W [l] A [l-1] + b [l]
▪ A[l] = g[l](Z[l])
11
Backward Propagation
▪ Input: da[l]
▪ Output: da[l-1], dW [l], db[l]
Vectorized:
▪ dz [l] = da [l] * g[l]’(z[l]) dZ [l] = dA [l] * g[l]’(Z[l])
▪ dW [l] = dz [l] . a [l-1]T dW [l] = 1/m (dZ [l] . A [l-1]T )
db[l] = 1/m np.sum(dZ [l], axis = 1, keepdims = True)
▪ db[l] = dz [l] dA[l] = W [l] T . dZ[l]
▪ da[l] = W [l] T . dz[l]
▪ dz [l] = W [l+1] T dz [l+1] * g[l]’(z[l])
12
6
12/19/2023
Summary
X
ReLU ReLU Sigmoid
z [1] z [2] z [3]

da[1] da[2]
da[3]
= -(y/a) + (1-y)/(1-a)
dw[1] dw[2] dw[3]
db[1] db[2] db[3]
13
Parameters and Hyperparameters

▪ Parameters: W [1], b [1], W [2], b [2], W [3] + b [3] , ……
▪ Hyperparameters:
▪Learning Rate α
▪Number of iterations
▪Number of hidden layers L
▪Number of hidden units n[1], n[2], …
▪Activation Functions
14
7
12/19/2023
Empirical Learning
Applied Deep Learning is an Empirical
process
15
The Iterative Process

# hidden layers
# hidden units
Learning rates
Activation functions
…..
16
8
12/19/2023
Dataset Distribution (Train/dev/Test sets)
17
Mismatch train/test Distribution

▪ Training set:
▪ Cat pictures from web pages
▪ Dev/test set:
▪ Cat pictures from users using your app.
▪ Need to ensure that the test and dev set comes from the same
distribution
▪ Not having a test set is completely ok (Only dev set)
18
9
12/19/2023
Bias and Variance
19
y=1 y=0
Bias and Variance
Train Set Error 1% 15% 15% 0.5%

Dev Set Error 11% 16% 30% 1%
High Variance High Bias High Bias and High Low Bias and Low
Variance Variance
Human Error ≈ 0%
Optimal (Bayes) Error ≈ 0%
20
10
12/19/2023
High Bias and High Variance
21
Basic Recipe for Machine Learning

Bigger Network
High Bias?
Train Longer
Training Data
Performance (NN Architecture
Yes Search)
No
Yes More Data

High Variance? Regularization
Dev Set (NN Architecture
Performance Search)
No, Done
22
11
12/19/2023
Regularization
23
Logistic Regression
min𝐽 𝑤, 𝑏 𝜔 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
𝑤𝑏
𝑚
1 𝜆 2
𝐽 w, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑤 2
𝑚 2𝑚
𝑖=1
𝑛𝑥
2
𝑤 2 =෍ 𝑤𝑗2 = 𝑤 𝑇 𝑤
L2 Regularization j=1
𝑛𝑥
𝜆
෎ 𝑤
𝑚
L1 Regularization j=1
24
12
12/19/2023
Neural Network 𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + ෍ 𝑊 [𝑙]
𝑚 2𝑚 2
𝑖=1 𝑙=1
𝑛𝑙
𝑛 𝑙+1
2 2
𝑤 𝑙
= ෍ ෎ 𝑤𝑖𝑗
𝑙
𝑤: 𝑛 𝑙 , 𝑛 𝑙+1
𝑖=1
j=1
2 2
Frobenius Normal . 2 . 𝐹
𝜆 𝑙
dW [l] = (from backpropogation) + 𝑚 𝑊
W [l] := W [l] - α dW [l]
25
Weight Decay
W [l] := W [l] - α dW [l]

𝜆
W [l] := W [l] - α [from backprop + 𝑚 𝑊 𝑙 ]
α𝜆
:= W [l] − 𝑊 𝑙 - α(from backprop)
𝑚
α𝜆
:= (1 − )𝑊 𝑙 - α(from backprop)
𝑚
26
13
12/19/2023
How does regularization prevents overfitting

𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + ෍ 𝑊 [𝑙]
𝑚 2𝑚 F
𝑖=1 𝑙=1
27
How does regularization prevent overfitting
As: λ↑ W [l]↓ Z [l] = W [l] a [l-1] + b [l]
Every layer ≈ linear
𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖
+ ෍ 𝑊 [𝑙]
𝑚 2𝑚 F
𝑖=1 𝑙=1
28
14
12/19/2023
Dropout Regularization
0.5 ↑ 0.5 ↑ 0.5 ↑
29
Implementing Dropout
Illustrate with l = 3, keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
# d3 will be a binary matrix based on random values < keep_prob
a3 = np.multiply(a3, d3) # a3 *= d3
a3/=keep_prob
50 units → 10 units shutoff
Reduced by
z [4] = w [4] a [3] + b [4] 20%
30
15
12/19/2023
While Making Predictions

a [0] = X
No dropout:
z [1] = w [1] a [0] + b [1]
a [1] = g [1](z [1])
z [2] = w [2] a [1] + b [2]
a [2] = g [2](z [2])
……
𝑦ො
31
Why does drop-out work?

Intuition: Can’t rely on any one feature, so have to spread out weights.
7 7
3
3
2
1
1.0
1.0
0.7
1.0
Keep-prob
0.7 0.5
32
16
12/19/2023
Other Regularization Methods
33
Data Augmentation
34
17
12/19/2023
Orthogonalization:
Early stopping • Optimize cost Generation J
• Gradient, …
• Not overfit
• Regularization, ….
W≈0 Mid-size ||W||2F W Large
35
Vanishing/exploding Gradients
w [1] w [2] w [3] …….. w [L]
W[l] < 𝐼 Vanishing

If g(z) = z then
W[l] > 𝐼 Exploding
y = w [l] w [l-1] w [l-2] …….. w [3] w [2] w [1] x
z [1] = w [1] x
a = g(z [1]) = z [1]
a[2] = g(z [2]) = g(w[2]a[1])
1.5 0 1.5 0 [l−1] 0.5 0 [l−1]
W [l]= 𝑦ො = W [l] x 𝑦ො = W[l] x
0 1.5 0 1.5 0 0.5
36
18
12/19/2023
Single Neuron Example Other Variants:
1
tanh
𝑛 l−1
Xavier Initialization
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ ⋅ 𝑤𝑛 𝑥𝑛 + 𝑏
2
Large n → Smaller wi
Var(w) = 1/n 𝑛 𝑙+1 + 𝑛 𝑙
W [l] = np.random.rand(shape)*np.sqrt(1/n[l-1])
ReLU g[l](z) = ReLU(z) W [l] = np.random.rand(shape)*np.sqrt(2/n[l-1])

37
Improving Training
Speed
38
19
12/19/2023
Normalizing Training Sets
Subtract mean: Normalize Variance:

𝑚 𝑚
1 1 ∗∗
𝜇 = ෍𝑥 𝑖 𝜎 2 = ෍ 𝑥 (i) 2
𝑚 𝑚
𝑖=1 𝑖=1
𝑥 ≔ 𝑥−𝜇 𝑥/= 𝜎
39
Why normalize inputs?
W 1 X1: 1 ….. 1000

W 2 X2: 0 …... 1
40
20
12/19/2023
Optimization Algorithms
41
Batch vs. Mini-Batch Gradient Descent

▪ Vectorization allows you to efficiently compute on m examples
1 2 3 1000 1001 2000 3000 m

𝑋(𝑛𝑥,𝑚) = 𝑥 ,𝑥 ,𝑥 ……. ,𝑥 ,𝑥 ,……….𝑥 ,𝑥 ,…….,𝑥
X{1} X{2}
𝑌(1,𝑚) = 𝑦 1 ,𝑦 2 ,𝑦 3 …….,𝑦 1000 ,𝑦 1001 ,……….𝑦 2000 ,𝑦 3000 ,…….,𝑦 m
Y{1} Y{2}
However for big data analytics m could be in the range
Batch t:
m = 5,000,000
X{t}, Y{t}
So we form mini batch of 1,000 each
42
21
12/19/2023
Mini-batch Gradient Descent

Repeat {
for t = 1, 5000 {
Forward prop on X{t}
Z [l] = W [l] X {t} + b [l]
A [1] = g [1](Z [1])
A [l] = g [l](Z [l])
1
} 𝐿
Vectorized Implementation
𝜆
σ𝐿
2
Computer Cost 𝐽{t} = ෌𝑖=1 ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑊 [𝑙]
1000 2.1000 𝑙=1 F
Backprop to compute gradients w.r.t J{t} (using (X {t} , Y {t}))
W [l] := W [l] – αdW [l], b[l] := b[l] – αdb[l]
}
Epoch
}
43
Mini-batch Gradient Descent
44
22
12/19/2023
Choosing your Batch Size
45
Choosing your Batch Size

▪ If you have small training set: Use Batch
Gradient descent
▪ m ≤ 2000
▪ Typical mini-batch sizes:
▪ 64, 128, 256, 512, …..
▪ Make sure that the mini-batch fits in
CPU/GPU memory
46
23
12/19/2023
Exponential Weighted (Moving) Averages
V0 = 0
V1 = 0.9 V0 + 0.1 θ1
V2 = 0.9 V1 + 0.1 θ2
V3 = 0.9 V2 + 0.1 θ3
.
.
.
Vt = 0.9 Vt-1 + 0.1 θt
47
Exponential Weighted (Moving) Average

▪ Vt = β Vt-1 + (1-β) θt
▪ β = 0.9
1
▪ Vt as approximately average over ≈ 1−𝛽 days’ temp
▪ β = 0.9, 10 days temperature

48
24
12/19/2023
Exponential Weight Intuition

▪ Vt = β Vt-1 + (1-β) θt
▪ V100 = β V99 + (1-β) θ100
▪ V99 = β V98 + (1-β) θ99
▪ V98 = β V97 + (1-β) θ98
▪ Replacing values
▪ V100 = 0.9 (0.9 V98 + 0.1 θ99) + 0.1 θ100
▪ V100 = 0.9 (0.9 (0.9 V97 + 0.1 θ98) + 0.1 θ99) + (0.1) θt
▪ V100 = 0.1 θ100 + 0.1 x 0.9 θ99 + 0.1 x (0.9)2 θ98 + 0.1 x (0.9)3 θ97 + 0.1 x (0.9)4 θ96
49
Implementing Exponential Weighted Average

▪ V0 = 0
▪ V1 = β V0 + (1- β) θ1
▪ V2 = β V1 + (1- β) θ2
▪ V3 = β V2 + (1- β) θ3
Vθ = 0
Repeat {
Get next θt
Vθ := β V θ + (1- β) θt
}
50
25
12/19/2023
Gradient Descent with Momentum
Slower
Learning
Faster
Momentum: Learning
On iteration t:
Compute dW, db on current mini-batch
VdW = βVdW + (1-β)dW Vθ := β V θ + (1- β) θt
Vdb = βVdb + (1-β)db
W := W – α VdW , b:= b - αVdb
51
Implementation Details
52
26
12/19/2023
Learning Rate Decay
53
Implementation of Learning Rate Decay

▪ 1 epoch = 1 pass over the data
1
α= α
1 + 𝑑𝑒𝑐𝑎𝑦 − 𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ# 𝑜
αo = 0.2, decay-rate = 1
Epoch α
1 0.1
2 0.067
3 0.05
4 0.04
…. …
54
27
12/19/2023
Other Decay Methods
55
Local Optima in Neural Networks
W1
W2
56
28
12/19/2023
Problem with Pleatues
57
29

ANN-Unit 6 - Deep Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ANN-Unit 6 - Deep Neural Networks

Uploaded by

Copyright:

Available Formats

12/19/2023

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 1

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 2

What are Deep Neural Networks

logistic regression 1 hidden layer

2 hidden layers 5 hidden layers

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 3

Deep Neural Network Notation

Layer 3 L = 4 # of layers, we do not count input as a layer

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 4

Forward Propagation in a Deep Network

Parameters W[l] and b[l] Dimensions

▪ w [4] = (2,4) or (n[4] ,n [3])

▪ w [l] = (n[l] ,n [l-1]) w [l] = (n[l] ,n [l-1])

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 6

Vectorized Implementation (Multiple Examples)

▪ So, Z [1] = (n[1],m)

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 7

Why Deep Networks

Audio Frequency Distribution Phonemes

For layer l: w [l], b[l] dz[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 9

Forward and Backward Functions

a [0] a [1] a [2] a [l-1] a [l]

Cache z[1] Cache z[2] Cache z[l]

da [1] da [2-1] da [2] da [l-1] da [l]

dz[1] dz[z] dz[l]

dw[1] db[1] dw[2] db[2] dw[l] db[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 10

Forward Propagation of Layer l

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 12

z [1] z [2] z [3]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 13

Parameters and Hyperparameters

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 14

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 15

The Iterative Process

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 16

Dataset Distribution (Train/dev/Test sets)

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 17

Mismatch train/test Distribution

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 18

Bias and Variance

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 19

Train Set Error 1% 15% 15% 0.5%

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 20

High Bias and High Variance

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 21

Basic Recipe for Machine Learning

Yes More Data

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 22

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 23

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 24

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 25

W [l] := W [l] - α dW [l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 26

How does regularization prevents overfitting

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 27

How does regularization prevent overfitting

As: λ↑ W [l]↓ Z [l] = W [l] a [l-1] + b [l]

Every layer ≈ linear

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 28