You are on page 1of 29

12/19/2023

Applied Neural
Networks
Unit – 6

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 1

Lecture Outline
▪ Deep Neural Networks
▪ Vectorized Implementation
▪ DNN Building Blocks
▪ Empirical Learning
▪ Bias and Variance
▪ Regularization to Handle Bias and Variance
▪ Vanishing and Exploding Gradients

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 2

1
12/19/2023

What are Deep Neural Networks

logistic regression 1 hidden layer

2 hidden layers 5 hidden layers

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 3

Deep Neural Network Notation

n[1] = 5
n[2] = 5
n[3] = 3
x = a[0] n[4] = n[L] = 1
= a[L] n[0] = nx = 3

Layer 4

Layer 3 L = 4 # of layers, we do not count input as a layer


Layer 0
n[l] = # of units in layer l
Layer 2
Layer 1 a[l] = Activations in layer l
a[l] = g[l](z[l])
w[l] = weights to compute z[l]
b[l] = used to compute z[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 4

2
12/19/2023

Forward Propagation in a Deep Network


▪ x: z [1] = w [1] x + b [1]
▪ a [1] = g [1](z [1]) Z [l] = W [l] A [l-1] + b [l]
▪ z [2] = w [2] a [1] + b [2] A [l] = g [l](Z [l])
▪ a [2] = g [2](z [2])
▪ ……
▪ Vectorizing:

}
▪ z [4] = w [4] a [3] + b [4]
▪ Z [1] = W [1] A [0] + b [1]
▪ a [4] = g [4](z [4]) = Explicit for loop
▪ A [1] = g [1](Z [1])
▪ More generically Z [1] = [ Z [1] (i), Z [1] (2), …., ]▪ Z
Z [1] (m) [2] = W [2] A [1] + b [2]
required to iterate
over layers
▪ z [l] = w [l] a [l-1] + b [l] ▪ A [2] = g [2](Z [2])
▪ a [l] = g [l](z [l]) ▪ = g [4](Z [4]) = A [4]
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 5

Parameters W[l] and b[l] Dimensions


▪ x: z [1] = w [1] x + b [1]

▪ z:(3,1) x:(2,1)
n[1] = 3
▪ (n[1],1) (n [0],1)
n[2] = 5
▪ w [1] = (3,2) or (n[1] ,n [0]) n[3] = 4
▪ Conversely n[4] = 2
n[5] = n[L] = 1
▪ w [2] = (5,3) or (n[2] ,n [1])
n[0] = nx = 2
▪ w [3] = (4,5) or (n[3] ,n [2])

▪ w [4] = (2,4) or (n[4] ,n [3])

▪ w [l] = (n[l] ,n [l-1]) w [l] = (n[l] ,n [l-1])


▪ b vector for every layer is generally the same dimension as b[l] =(n[l],1)
z as it needs to be added to provided the same dimension.
dw [l] = (n[l] ,n [l-1])
▪ So
db[l] =(n[1],1)
▪ b[1] =(n[1],1), b[2] =(n[2],1), b[3] =(n[3],1), …. b[l] =(n[l],1)
a[l] = z[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 6

3
12/19/2023

Vectorized Implementation (Multiple Examples)


▪ Z [1] = W [1] X + b [1]
▪ Where:

[ ]
▪ Z [1] = Z [1] (i), Z [1] (2), Z [1] (3), …., Z [1] (m)

▪ So, Z [1] = (n[1],m)


▪ Conversely
z [l] = a [l] =(n[l] ,1)
▪ X= (n[0] ,m) Z [l] = A [l] =(n[l] ,m)
▪ W [1] = (n[1] ,n [0]) Special Case:
l=0, A [0] = X = (n[0] ,m)
▪ b[1] is still b[1] =(n[1],1) but when added into
the product of W and X t will be
Also:
broadcasted into (n[1],m) dimension dZ [l], dA [l] =(n[l] ,m)

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 7

Why Deep Networks

Audio Frequency Distribution Phonemes


Words Sentences
Through Waveforms C A T
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 8

4
12/19/2023

Layer l
Building Blocks of DNN w [l] b[l]
a [l-1] a [l]

Cache z[l]

w [l] b[l]

da [l-1] da [l]

For layer l: w [l], b[l] dz[l]


Forward: Input: a[l-1], Output: a[l]
z [l] = w [l] a[l-1] + b [l], Cache z [l] dw[l] db[l]
a[l] = g[l](z[l])
Backward: Input: a[l], Cache Z [l] ; Output: a[l-1], dW [l], db[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 9

Forward and Backward Functions


Layer 1 Layer 2 Layer l
w [1] b[1] w [2] b[2] w [l] b[l]

a [0] a [1] a [2] a [l-1] a [l]

Cache z[1] Cache z[2] Cache z[l]


………
w [1] b[1] w [2] b[2] w [l] b[l]

da [1] da [2-1] da [2] da [l-1] da [l]

dz[1] dz[z] dz[l]

dw[1] db[1] dw[2] db[2] dw[l] db[l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 10

10

5
12/19/2023

Forward Propagation of Layer l


▪ Input: a[l-1]
▪ Output: a[l], Cache z[l]
▪ z [l] = w [l] a [l-1] + b [l]
▪ a[l] = g[l](z[l])
▪ Vectorized:
▪ Z [l] = W [l] A [l-1] + b [l]
▪ A[l] = g[l](Z[l])
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 11

11

Backward Propagation
▪ Input: da[l]
▪ Output: da[l-1], dW [l], db[l]
Vectorized:
▪ dz [l] = da [l] * g[l]’(z[l]) dZ [l] = dA [l] * g[l]’(Z[l])
▪ dW [l] = dz [l] . a [l-1]T dW [l] = 1/m (dZ [l] . A [l-1]T )
db[l] = 1/m np.sum(dZ [l], axis = 1, keepdims = True)
▪ db[l] = dz [l] dA[l] = W [l] T . dZ[l]
▪ da[l] = W [l] T . dz[l]
▪ dz [l] = W [l+1] T dz [l+1] * g[l]’(z[l])

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 12

12

6
12/19/2023

Summary

X
ReLU ReLU Sigmoid

z [1] z [2] z [3]


da[1] da[2]

da[3]
= -(y/a) + (1-y)/(1-a)
dw[1] dw[2] dw[3]
db[1] db[2] db[3]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 13

13

Parameters and Hyperparameters


▪ Parameters: W [1], b [1], W [2], b [2], W [3] + b [3] , ……
▪ Hyperparameters:
▪Learning Rate α
▪Number of iterations
▪Number of hidden layers L
▪Number of hidden units n[1], n[2], …
▪Activation Functions

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 14

14

7
12/19/2023

Empirical Learning
Applied Deep Learning is an Empirical
process

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 15

15

The Iterative Process


# hidden layers
# hidden units
Learning rates
Activation functions
…..

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 16

16

8
12/19/2023

Dataset Distribution (Train/dev/Test sets)

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 17

17

Mismatch train/test Distribution


▪ Training set:
▪ Cat pictures from web pages
▪ Dev/test set:
▪ Cat pictures from users using your app.
▪ Need to ensure that the test and dev set comes from the same
distribution
▪ Not having a test set is completely ok (Only dev set)

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 18

18

9
12/19/2023

Bias and Variance

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 19

19

y=1 y=0
Bias and Variance

Train Set Error 1% 15% 15% 0.5%


Dev Set Error 11% 16% 30% 1%
High Variance High Bias High Bias and High Low Bias and Low
Variance Variance

Human Error ≈ 0%
Optimal (Bayes) Error ≈ 0%

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 20

20

10
12/19/2023

High Bias and High Variance

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 21

21

Basic Recipe for Machine Learning


Bigger Network
High Bias?
Train Longer
Training Data
Performance (NN Architecture
Yes Search)

No

Yes More Data


High Variance? Regularization
Dev Set (NN Architecture
Performance Search)

No, Done

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 22

22

11
12/19/2023

Regularization

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 23

23

Logistic Regression
min𝐽 𝑤, 𝑏 𝜔 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
𝑤𝑏

𝑚
1 𝜆 2
𝐽 w, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑤 2
𝑚 2𝑚
𝑖=1

𝑛𝑥
2
𝑤 2 =෍ 𝑤𝑗2 = 𝑤 𝑇 𝑤
L2 Regularization j=1

𝑛𝑥
𝜆
෎ 𝑤
𝑚
L1 Regularization j=1

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 24

24

12
12/19/2023

Neural Network 𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + ෍ 𝑊 [𝑙]
𝑚 2𝑚 2
𝑖=1 𝑙=1
𝑛𝑙
𝑛 𝑙+1
2 2
𝑤 𝑙
= ෍ ෎ 𝑤𝑖𝑗
𝑙
𝑤: 𝑛 𝑙 , 𝑛 𝑙+1
𝑖=1
j=1

2 2
Frobenius Normal . 2 . 𝐹

𝜆 𝑙
dW [l] = (from backpropogation) + 𝑚 𝑊
W [l] := W [l] - α dW [l]

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 25

25

Weight Decay

W [l] := W [l] - α dW [l]


𝜆
W [l] := W [l] - α [from backprop + 𝑚 𝑊 𝑙 ]
α𝜆
:= W [l] − 𝑊 𝑙 - α(from backprop)
𝑚
α𝜆
:= (1 − )𝑊 𝑙 - α(from backprop)
𝑚

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 26

26

13
12/19/2023

How does regularization prevents overfitting


𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + ෍ 𝑊 [𝑙]
𝑚 2𝑚 F
𝑖=1 𝑙=1

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 27

27

How does regularization prevent overfitting

As: λ↑ W [l]↓ Z [l] = W [l] a [l-1] + b [l]

Every layer ≈ linear

𝑚 𝐿
1 𝜆 2
𝐽 W, 𝑏 = ෍ ℒ 𝑦ො 𝑖 , 𝑦 𝑖
+ ෍ 𝑊 [𝑙]
𝑚 2𝑚 F
𝑖=1 𝑙=1

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 28

28

14
12/19/2023

Dropout Regularization

0.5 ↑ 0.5 ↑ 0.5 ↑

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 29

29

Implementing Dropout
Illustrate with l = 3, keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
# d3 will be a binary matrix based on random values < keep_prob

a3 = np.multiply(a3, d3) # a3 *= d3
a3/=keep_prob
50 units → 10 units shutoff
Reduced by
z [4] = w [4] a [3] + b [4] 20%

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 30

30

15
12/19/2023

While Making Predictions


a [0] = X
No dropout:
z [1] = w [1] a [0] + b [1]
a [1] = g [1](z [1])
z [2] = w [2] a [1] + b [2]
a [2] = g [2](z [2])
……
𝑦ො
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 31

31

Why does drop-out work?


Intuition: Can’t rely on any one feature, so have to spread out weights.
7 7

3
3
2
1

1.0
1.0
0.7
1.0

Keep-prob
0.7 0.5

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 32

32

16
12/19/2023

Other Regularization Methods

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 33

33

Data Augmentation

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 34

34

17
12/19/2023

Orthogonalization:
Early stopping • Optimize cost Generation J
• Gradient, …
• Not overfit
• Regularization, ….

W≈0 Mid-size ||W||2F W Large

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 35

35

Vanishing/exploding Gradients

w [1] w [2] w [3] …….. w [L]

W[l] < 𝐼 Vanishing


If g(z) = z then
W[l] > 𝐼 Exploding
y = w [l] w [l-1] w [l-2] …….. w [3] w [2] w [1] x
z [1] = w [1] x
a = g(z [1]) = z [1]
a[2] = g(z [2]) = g(w[2]a[1])
1.5 0 1.5 0 [l−1] 0.5 0 [l−1]
W [l]= 𝑦ො = W [l] x 𝑦ො = W[l] x
0 1.5 0 1.5 0 0.5
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 36

36

18
12/19/2023

Single Neuron Example Other Variants:

1
tanh
𝑛 l−1
Xavier Initialization
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ ⋅ 𝑤𝑛 𝑥𝑛 + 𝑏
2
Large n → Smaller wi
Var(w) = 1/n 𝑛 𝑙+1 + 𝑛 𝑙
W [l] = np.random.rand(shape)*np.sqrt(1/n[l-1])

ReLU g[l](z) = ReLU(z) W [l] = np.random.rand(shape)*np.sqrt(2/n[l-1])


Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 37

37

Improving Training
Speed

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 38

38

19
12/19/2023

Normalizing Training Sets

Subtract mean: Normalize Variance:


𝑚 𝑚
1 1 ∗∗
𝜇 = ෍𝑥 𝑖 𝜎 2 = ෍ 𝑥 (i) 2
𝑚 𝑚
𝑖=1 𝑖=1
𝑥 ≔ 𝑥−𝜇 𝑥/= 𝜎
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 39

39

Why normalize inputs?

W 1 X1: 1 ….. 1000


W 2 X2: 0 …... 1

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 40

40

20
12/19/2023

Optimization Algorithms

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 41

41

Batch vs. Mini-Batch Gradient Descent


▪ Vectorization allows you to efficiently compute on m examples

1 2 3 1000 1001 2000 3000 m


𝑋(𝑛𝑥,𝑚) = 𝑥 ,𝑥 ,𝑥 ……. ,𝑥 ,𝑥 ,……….𝑥 ,𝑥 ,…….,𝑥
X{1} X{2}
𝑌(1,𝑚) = 𝑦 1 ,𝑦 2 ,𝑦 3 …….,𝑦 1000 ,𝑦 1001 ,……….𝑦 2000 ,𝑦 3000 ,…….,𝑦 m

Y{1} Y{2}
However for big data analytics m could be in the range
Batch t:
m = 5,000,000
X{t}, Y{t}
So we form mini batch of 1,000 each
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 42

42

21
12/19/2023

Mini-batch Gradient Descent


Repeat {
for t = 1, 5000 {
Forward prop on X{t}
Z [l] = W [l] X {t} + b [l]
A [1] = g [1](Z [1])
A [l] = g [l](Z [l])
1
} 𝐿
Vectorized Implementation

𝜆
σ𝐿
2
Computer Cost 𝐽{t} = ෌𝑖=1 ℒ 𝑦ො 𝑖 , 𝑦 𝑖 + 𝑊 [𝑙]
1000 2.1000 𝑙=1 F
Backprop to compute gradients w.r.t J{t} (using (X {t} , Y {t}))
W [l] := W [l] – αdW [l], b[l] := b[l] – αdb[l]
}
Epoch
}

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 43

43

Mini-batch Gradient Descent

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 44

44

22
12/19/2023

Choosing your Batch Size

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 45

45

Choosing your Batch Size


▪ If you have small training set: Use Batch
Gradient descent
▪ m ≤ 2000
▪ Typical mini-batch sizes:
▪ 64, 128, 256, 512, …..
▪ Make sure that the mini-batch fits in
CPU/GPU memory

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 46

46

23
12/19/2023

Exponential Weighted (Moving) Averages

V0 = 0
V1 = 0.9 V0 + 0.1 θ1
V2 = 0.9 V1 + 0.1 θ2
V3 = 0.9 V2 + 0.1 θ3
.
.
.
Vt = 0.9 Vt-1 + 0.1 θt
Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 47

47

Exponential Weighted (Moving) Average


▪ Vt = β Vt-1 + (1-β) θt
▪ β = 0.9
1
▪ Vt as approximately average over ≈ 1−𝛽 days’ temp

▪ β = 0.9, 10 days temperature


▪ β = 0.98, 50 days temperature
▪ β = 0.5, 2 days temperature

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 48

48

24
12/19/2023

Exponential Weight Intuition


▪ Vt = β Vt-1 + (1-β) θt
▪ V100 = β V99 + (1-β) θ100
▪ V99 = β V98 + (1-β) θ99
▪ V98 = β V97 + (1-β) θ98
▪ Replacing values
▪ V100 = 0.9 (0.9 V98 + 0.1 θ99) + 0.1 θ100
▪ V100 = 0.9 (0.9 (0.9 V97 + 0.1 θ98) + 0.1 θ99) + (0.1) θt
▪ V100 = 0.1 θ100 + 0.1 x 0.9 θ99 + 0.1 x (0.9)2 θ98 + 0.1 x (0.9)3 θ97 + 0.1 x (0.9)4 θ96

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 49

49

Implementing Exponential Weighted Average


▪ V0 = 0
▪ V1 = β V0 + (1- β) θ1
▪ V2 = β V1 + (1- β) θ2
▪ V3 = β V2 + (1- β) θ3

Vθ = 0
Repeat {
Get next θt
Vθ := β V θ + (1- β) θt
}

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 50

50

25
12/19/2023

Gradient Descent with Momentum

Slower
Learning

Faster
Momentum: Learning

On iteration t:
Compute dW, db on current mini-batch
VdW = βVdW + (1-β)dW Vθ := β V θ + (1- β) θt
Vdb = βVdb + (1-β)db
W := W – α VdW , b:= b - αVdb

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 51

51

Implementation Details

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 52

52

26
12/19/2023

Learning Rate Decay

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 53

53

Implementation of Learning Rate Decay


▪ 1 epoch = 1 pass over the data

1
α= α
1 + 𝑑𝑒𝑐𝑎𝑦 − 𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ# 𝑜
αo = 0.2, decay-rate = 1
Epoch α
1 0.1
2 0.067
3 0.05
4 0.04
…. …

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 54

54

27
12/19/2023

Other Decay Methods

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 55

55

Local Optima in Neural Networks

W1

W2

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 56

56

28
12/19/2023

Problem with Pleatues

Dr. Muhammad Usman Arif; Applied Neural Networks 12/19/2023 57

57

29

You might also like