You are on page 1of 25

Deep Neural Networks

Arles Rodríguez
aerodriguezp@unal.edu.co

Facultad de Ciencias
Departamento de Matemáticas
Universidad Nacional de Colombia
Deep Neural Networks
𝑥1

𝑥2 ^𝑦 =𝑎
𝑥3

1-layer NN 1-hidden layer NN

2-hidden layer NN 3-hidden layer NN


Deep learning networks
• How deep is a neural networks depends on the
number of hidden layers.
• There are problems that shallow networks cannot
solve that deep neural networks do.
• It may be hard to predict in advance how deep the
network you would want.
• It is reasonable to start with a logistic regression
model and increase the number of layers.
Notation
• This is a 4-layer Layer 1
Layer 2
network with 3 hidden Layer 0

layers =4 Layer 3

Layer 4

5
5
3
• activation in layer
1
𝑎 =𝑔[ 𝑙] ( 𝑧 [ 𝑙 ] ) , 𝑎 =𝑥 , 𝑎 = ^
[ 𝑙] [0] [ 𝐿]
𝑦
=3 • weights in layer
Forward propagation in deep network

Given a single training sample


Layer 1 Layer 2 Layer 4
𝑧 [1]=𝑊 [1] 𝑥 +𝑏[1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Forward propagation in deep network
Given a single training sample , with , do you see any pattern?

Layer 1 Layer 2 Layer 4


𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]

In general, for a given layer

Layer
𝑧 [𝑙]=𝑊 [ 𝑙] 𝑎[ 𝑙−1] +𝑏[𝑙]
Vectorizing…
In general, for a given layer

Layer
[𝑙]
𝑧 =𝑊 𝑎 [ 𝑙] [ 𝑙−1]
+𝑏 [𝑙] 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]

[ ]
¿ ¿ ¿ ¿
[𝑙 ]
𝑍 =¿𝑧 [𝑙](1) 𝑧 [𝑙] (2) … 𝑧[𝑙] (𝑚)
¿ ¿ ¿ ¿
Notation: , output for layer , sample

[ ]
¿ ¿ ¿ ¿
𝐴[ 𝑙]=¿𝑎[𝑙] (1) 𝑎[ 𝑙] (2) … 𝑎[𝑙](𝑚) Unit in layer
¿ ¿ ¿ ¿
Sample
Forward propagation loop
Layer 1
Layer 2
𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙] Layer 0
Layer 3

Layer 4

𝑍 [1]=𝑊 [1] 𝐴[0 ]+𝑏[ 1]

𝑍 [2 ]=𝑊 [2 ] 𝐴[1 ]+𝑏[ 2]


Forward propagation can be implemented as a for loop:
𝑍 [3 ]=𝑊 [ 3] 𝐴[2] +𝑏[3 ]
for l=1 to L:
[ 4]
𝑍 =𝑊 [4 ] [3]
𝐴 +𝑏 [4 ] 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]
Getting matrix dimensions right
𝑥1
^𝑦
𝑥2

[5] [0 ]
𝐿=¿5 𝑛 [1]
=¿3 [2]
𝑛 =¿5 𝑛 [3 ]
=¿4 𝑛[4 ]=¿2 𝑛 =¿ 1 𝑛 =¿ 2
What are the sizes of
size? size?

𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2]


Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (5,1) (5,3)(3,1) (5,1)
) [0] , 1) (𝑛[0] , 1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛 ) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛
Getting matrix dimensions right [5] [0 ]
[1]
𝑛 =¿3 [2]
𝑛 =¿5 𝑛 [3 ]
=¿4 𝑛[4 ]=¿2 𝑛 =¿ 1 𝑛 =¿ 2
𝑥1
^𝑦 𝐿=¿5 size?
𝑥2
𝑧 [3 ]=𝑊 [3 ] 𝑎[2] +𝑏[3 ]
Shapes: (4,1) (4,5)(5,1) (4,1)
What are the sizes of
) [2] ,1) (𝑛[3] , 1)
(𝑛[3] , 1) (𝑛[3] , 𝑛[2](𝑛
size?
size?
𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (2,1) (2,4)(4,1) (2,1)
) [0] , 1) (𝑛[1] ,1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛
) [3] , 1) (𝑛[4 ] , 1)
(𝑛[4 ] , 1) (𝑛[4 ] , 𝑛[3 ](𝑛
size? size?

𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]


Shapes: (5,1) (5,3)(3,1) (5,1) Shapes: (1,1) (1,2)(2,1) (1,1)
) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛 ) [4 ] , 1) (𝑛[5] ,1)
(𝑛[5] ,1) (𝑛[5] ,𝑛[ 4 ](𝑛
Getting matrix dimensions right
General case (shapes):
𝑊 [𝑙]:(𝑛 [ 𝑙 ] ,𝑛 [ 𝑙 −1 ] ) 𝑑𝑊 [𝑙] : (𝑛[ 𝑙 ] , 𝑛[ 𝑙− 1] ) size?
[𝑙 ] [𝑙 ]
𝑏[𝑙] :(𝑛 , 1) 𝑑𝑏 [𝑙](𝑛
: , 1)
𝑧 [3 ]=𝑊 [3 ] 𝑎[2] +𝑏[3 ]
Shapes: (4,1) (4,5)(5,1) (4,1)
What are the sizes of
) [2] ,1) (𝑛[3] , 1)
(𝑛[3] , 1) (𝑛[3] , 𝑛[2](𝑛
size?
size?
𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (2,1) (2,4)(4,1) (2,1)
) [0] , 1) (𝑛[0] , 1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛
) [3] , 1) (𝑛[4 ] , 1)
(𝑛[4 ] , 1) (𝑛[4 ] , 𝑛[3 ](𝑛
size? size?

𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]


Shapes: (5,1) (5,3)(3,1) (5,1) Shapes: (1,1) (1,2)(2,1) (1,1)
) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛 ) [4 ] , 1) (𝑛[5] ,1)
(𝑛[5] ,1) (𝑛[5] ,𝑛[ 4 ](𝑛
Getting matrix dimensions right
General case (shapes):
𝑊 [𝑙]:(𝑛 [ 𝑙 ] ,𝑛 [ 𝑙 −1 ] ) 𝑑𝑊 [𝑙] :( 𝑛[ 𝑙 ] , 𝑛[ 𝑙− 1] )
[𝑙 ] [𝑙 ]
𝑏[𝑙] :(𝑛 , 1) 𝑑𝑏 [𝑙](𝑛
: , 1)

If,

What about shapes with multiple samples at the time?

• Dimensions of W are the same


• Dimensions of will change
Getting matrix dimensions right
𝑧 [1]=𝑊 [1] 𝑥 +𝑏[1]
[1] [ 0] [0]
For one sample, sizes are:(𝑛[1] ,1) (𝑛 ,𝑛 (𝑛 ) , 1) (𝑛[1] ,1)
Vectorized:

[ ]
¿ ¿ ¿ ¿
[1]
𝑍 =¿𝑧 [1] (1) 𝑧 [1] (2) … 𝑧[1] (𝑚)
¿ ¿ ¿ ¿
𝑍 [1]=𝑊 [1] 𝑋 +𝑏[1 ]
Is broadcasted by
) [0] , 𝑚)(𝑛[1] ,1)
For multiple samples, sizes are:(𝑛[1] ,𝑚)(𝑛[1] ,𝑛[ 0](𝑛
element-wise sum
• Dimensions of will change
: [ 𝑙 ] , 1)
𝑧 [𝑙] , 𝑎 [𝑙 ](𝑛 For layer :(𝑛 [ 0] ,𝑚)
𝑍 [𝑙] , 𝐴[𝑙](𝑛: [ 𝑙 ] , 𝑚) [𝑙 ]
𝑑𝑍 [𝑙] ,𝑑𝐴[𝑙] : (𝑛 , 𝑚)
Why deep representations?
𝑥1
^𝑦
𝑥2

simple complex

First layers,
detect edges Parts of faces
and borders Parts of faces composed together

Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of
hierarchical representations. Proceedings of the 26th International Conference On Machine Learning, ICML 2009, 609–616.
https://doi.org/10.1145/1553374.1553453
When is deep better than shallow
• There are functions that you can compute with a
small L-layer deep neural network that shallower
networks that require exponentially more hidden
units to compute.

𝑥1 𝑥2 𝑥3 𝑥 4 𝑥5 𝑥 6 𝑥7 𝑥8
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8

Enumerate computations of the input Depth of network is


When is deep better than shallow
• Hierarchical network can approximate a higher
degree polynomial via composition:

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8 … 𝑥1 𝑥2 𝑥3 𝑥 4 𝑥5 𝑥 6 𝑥7 𝑥8

A hierarchical network with 11 layers and


Shallow network with units approximates
39 units can approximate arbitrarily well
arbitrarily well
Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning Functions: When Is Deep Better Than Shallow,
(045), 1–12. Retrieved from http://arxiv.org/abs/1603.00988
Practical advice
• Start with logistic regression.
• Start with 1-2 layers
• Increase number of layers iteratively.
Forward and backward functions
for l=1 to L:
[𝑙] [𝑙] Layer
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠:𝑊 , 𝑏
𝑎[ 𝑙 −1 ] 𝑊 [𝑙 ] , 𝑏[ 𝑙] 𝑎[ 𝑙]
Forward propagation:
[𝑙] [ 𝑙] [ 𝑙−1] [𝑙] Cache:
𝑧 =𝑊 𝑎 +𝑏 Cache

Backward propagation:
[𝑙] [𝑙] 𝑑𝑎 [𝑙 −1] 𝑑𝑎[𝑙 ]
𝑐𝑎𝑐h𝑒:𝑧 𝑑𝑊 [𝑙 ]
𝑑𝑏
𝑑𝑊[𝑙[𝑙]]
𝑑𝑏
Forward and backward functions

𝑎[ 1] 𝑎[ 2] 𝑎[ 𝐿
𝑎[ 0] 𝑊 [1]
,𝑏[1]
𝑊 [2 ] [2]
,𝑏 … 𝑊 [ 𝐿] [ 𝐿]
,𝑏
^
𝑦
x Cache: Cache: Cache: Cache:

𝑑𝑎[1] 𝑑𝑎[2 ]… 𝑑𝑎[𝑙 −1]


𝑑𝑎[𝑙
[1]
𝑑𝑊[1] 𝑑𝑊 [2]
𝑑𝑊[𝑖[𝑖]] 𝑑𝑊[𝑙[𝑙]]
𝑑𝑏 𝑑𝑏 [2]
𝑑𝑏 𝑑𝑏
𝑊 [𝑙 ]=𝑊 [ 𝑙] − 𝛼 𝑑𝑊 [𝑙 ]
𝑏[𝑙 ]=𝑏[ 𝑙] − 𝛼 𝑑𝑏 [𝑙 ]
Forward and backward implementation
Having
for l=1 to L:
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠:𝑤 [𝑙] , 𝑏[𝑙]
Forward propagation: , Cache
[𝑙]
𝑧 =𝑊 𝑎 [ 𝑙] [ 𝑙−1] [𝑙]
+𝑏 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]
𝐴[ 𝑙]=𝑔[ 𝑙 ] (𝑍 [ 𝑙 ] )
Backward propagation: , 𝑑𝑊 [𝑙],
′ 𝑑𝑍 [𝐿]= 𝐴[ 𝐿 ] − 𝑌
𝑑𝑧 =𝑑 𝑎 ∗𝑔 ( 𝑧 [ 𝑙 ] ) [𝑙]
[𝑙 ] [𝑙 ] ′
[𝑙 ]
( 𝑍 [𝑙 ])
[𝑙] [ 𝑙]
𝑑𝑍 =𝑑 𝐴 ∗[𝑙𝑔
]
1
𝑑𝑊 [𝑙 ]=𝑑 𝑧 [𝑙 ] 𝑎 [ 𝑙 − 1] 𝑇 𝑑𝑊 =
[𝑙 ]
𝑑𝑍 𝐴
[ 𝑙 − 1] 𝑇
𝑚
𝑑𝑏 [𝑙 ] =𝑑 𝑧 [𝑙 ] , axis=1, keepdims=True)
𝑑𝑎[𝑙 −1]=𝑊 [𝑙] 𝑇 𝑑𝑧 [ 𝑙]
𝑑𝐴[ 𝑙 −1]=𝑊 [𝑙 ]𝑇 𝑑𝑍 [ 𝑙]
Example
𝑥 ReLU ReLU sigmoid

𝑧 [1] 𝑧 [2 ] 𝑧 [3 ]

𝐿( ^𝑦 , 𝑦 )
𝑦 1− 𝑦
𝑑𝑎[1] 𝑑𝑎 [2 ] [ 𝐿]
𝑑𝑎 =− +
𝑎 1 −𝑎

𝑑 𝑊 [1] 𝑑 𝑊 [2] 𝑑 𝑊 [3 ]
𝑑 𝑏[1] 𝑑 𝑏[2] 𝑑 𝑏[3 ]
Parameters and hyperparameters
• Parameters are:

• Hyperparameters (controls W and b):


– Learning rate
– # iterations
– #hidden layers
– #hidden units
– Choice of activation functions
– Others (later): momentum, minibatch size, regularization
parameters, …
Applied deep learning process
• It’s iterative and empirical
• Require try out some values
1.idea • Depends on the application: vision,
NLP,
Online ad, web search,
recommendation.
3.experiment • Systematic process to explore
2.code hyperparameters.
References
• Ng. A (2022) Deep Learning Specialization. https://www.deeplearning.ai/courses/deep-learning-specialization/
• Lalin, J. (2021, December 10). Feedforward neural networks in depth, part 1: Forward and backward
propagations. I, Deep Learning. Retrieved February 28, 2023, from
https://jonaslalin.com/2021/12/10/feedforward-neural-networks-part-1/
• Lalin, J. (2021, December 21). Feedforward neural networks in depth, part 2: Activation functions. I, Deep
Learning. Retrieved February 28, 2023, from
https://jonaslalin.com/2021/12/21/feedforward-neural-networks-part-2/
• Lalin, J. (2021, December 22). Feedforward neural networks in depth, part 3: Cost functions. I, Deep Learning.
Retrieved February 28, 2023, from https://jonaslalin.com/2021/12/22/feedforward-neural-networks-part-3/
• Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning Functions: When Is Deep Better Than Shallow, (045), 1–
12. Retrieved from http://arxiv.org/abs/1603.00988
• Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. Proceedings of the 26th International Conference On
Machine Learning, ICML 2009, 609–616. https://doi.org/10.1145/1553374.1553453
¡Thank you!

You might also like