DNN

Deep Neural Networks
Arles Rodríguez
aerodriguezp@unal.edu.co
Facultad de Ciencias
Departamento de Matemáticas
Universidad Nacional de Colombia
Deep Neural Networks
𝑥1
𝑥2 ^𝑦 =𝑎
𝑥3
1-layer NN 1-hidden layer NN
2-hidden layer NN 3-hidden layer NN

Deep learning networks
• How deep is a neural networks depends on the
number of hidden layers.
• There are problems that shallow networks cannot
solve that deep neural networks do.
• It may be hard to predict in advance how deep the
network you would want.
• It is reasonable to start with a logistic regression
model and increase the number of layers.
Notation
• This is a 4-layer Layer 1
Layer 2
network with 3 hidden Layer 0
layers =4 Layer 3
Layer 4
5
5
3
• activation in layer
1
𝑎 =𝑔[ 𝑙] ( 𝑧 [ 𝑙 ] ) , 𝑎 =𝑥 , 𝑎 = ^
[ 𝑙] [0] [ 𝐿]
𝑦
=3 • weights in layer
Forward propagation in deep network
Given a single training sample

Layer 1 Layer 2 Layer 4
𝑧 [1]=𝑊 [1] 𝑥 +𝑏[1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Forward propagation in deep network
Given a single training sample , with , do you see any pattern?
Layer 1 Layer 2 Layer 4

𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
In general, for a given layer
Layer
𝑧 [𝑙]=𝑊 [ 𝑙] 𝑎[ 𝑙−1] +𝑏[𝑙]
Vectorizing…
In general, for a given layer
Layer
[𝑙]
𝑧 =𝑊 𝑎 [ 𝑙] [ 𝑙−1]
+𝑏 [𝑙] 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]
[ ]
¿ ¿ ¿ ¿
[𝑙 ]
𝑍 =¿𝑧 [𝑙](1) 𝑧 [𝑙] (2) … 𝑧[𝑙] (𝑚)
¿ ¿ ¿ ¿
Notation: , output for layer , sample
[ ]
¿ ¿ ¿ ¿
𝐴[ 𝑙]=¿𝑎[𝑙] (1) 𝑎[ 𝑙] (2) … 𝑎[𝑙](𝑚) Unit in layer
¿ ¿ ¿ ¿
Sample
Forward propagation loop
Layer 1
Layer 2
𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙] Layer 0
Layer 3
Layer 4
𝑍 [1]=𝑊 [1] 𝐴[0 ]+𝑏[ 1]
𝑍 [2 ]=𝑊 [2 ] 𝐴[1 ]+𝑏[ 2]

Forward propagation can be implemented as a for loop:
𝑍 [3 ]=𝑊 [ 3] 𝐴[2] +𝑏[3 ]
for l=1 to L:
[ 4]
𝑍 =𝑊 [4 ] [3]
𝐴 +𝑏 [4 ] 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]
Getting matrix dimensions right
𝑥1
^𝑦
𝑥2
[5] [0 ]
𝐿=¿5 𝑛 [1]
=¿3 [2]
𝑛 =¿5 𝑛 [3 ]
=¿4 𝑛[4 ]=¿2 𝑛 =¿ 1 𝑛 =¿ 2
What are the sizes of
size? size?
𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2]

Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (5,1) (5,3)(3,1) (5,1)
) [0] , 1) (𝑛[0] , 1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛 ) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛
Getting matrix dimensions right [5] [0 ]
[1]
𝑛 =¿3 [2]
𝑛 =¿5 𝑛 [3 ]
=¿4 𝑛[4 ]=¿2 𝑛 =¿ 1 𝑛 =¿ 2
𝑥1
^𝑦 𝐿=¿5 size?
𝑥2
𝑧 [3 ]=𝑊 [3 ] 𝑎[2] +𝑏[3 ]
Shapes: (4,1) (4,5)(5,1) (4,1)
) [2] ,1) (𝑛[3] , 1)
(𝑛[3] , 1) (𝑛[3] , 𝑛[2](𝑛
size?
size?
𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (2,1) (2,4)(4,1) (2,1)
) [0] , 1) (𝑛[1] ,1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛
) [3] , 1) (𝑛[4 ] , 1)
(𝑛[4 ] , 1) (𝑛[4 ] , 𝑛[3 ](𝑛
size? size?
𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]

Shapes: (5,1) (5,3)(3,1) (5,1) Shapes: (1,1) (1,2)(2,1) (1,1)
) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛 ) [4 ] , 1) (𝑛[5] ,1)
(𝑛[5] ,1) (𝑛[5] ,𝑛[ 4 ](𝑛
General case (shapes):
𝑊 [𝑙]:(𝑛 [ 𝑙 ] ,𝑛 [ 𝑙 −1 ] ) 𝑑𝑊 [𝑙] : (𝑛[ 𝑙 ] , 𝑛[ 𝑙− 1] ) size?
[𝑙 ] [𝑙 ]
𝑏[𝑙] :(𝑛 , 1) 𝑑𝑏 [𝑙](𝑛
: , 1)
𝑧 [3 ]=𝑊 [3 ] 𝑎[2] +𝑏[3 ]
Shapes: (4,1) (4,5)(5,1) (4,1)
) [2] ,1) (𝑛[3] , 1)
(𝑛[3] , 1) (𝑛[3] , 𝑛[2](𝑛
size?
size?
𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [4 ]=𝑊 [4 ] 𝑎[3] +𝑏[4 ]
Shapes: (3,1) (3,2)
(2,1) (3,1) Shapes: (2,1) (2,4)(4,1) (2,1)
) [0] , 1) (𝑛[0] , 1)
(𝑛[1] ,1) (𝑛[1] ,𝑛[ 0](𝑛
) [3] , 1) (𝑛[4 ] , 1)
(𝑛[4 ] , 1) (𝑛[4 ] , 𝑛[3 ](𝑛
size? size?
𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]

Shapes: (5,1) (5,3)(3,1) (5,1) Shapes: (1,1) (1,2)(2,1) (1,1)
) [1] ,1) (𝑛[2] ,1)
(𝑛[2] ,1) (𝑛[2] ,𝑛[ 1](𝑛 ) [4 ] , 1) (𝑛[5] ,1)
(𝑛[5] ,1) (𝑛[5] ,𝑛[ 4 ](𝑛
General case (shapes):
𝑊 [𝑙]:(𝑛 [ 𝑙 ] ,𝑛 [ 𝑙 −1 ] ) 𝑑𝑊 [𝑙] :( 𝑛[ 𝑙 ] , 𝑛[ 𝑙− 1] )
[𝑙 ] [𝑙 ]
𝑏[𝑙] :(𝑛 , 1) 𝑑𝑏 [𝑙](𝑛
: , 1)
If,
What about shapes with multiple samples at the time?
• Dimensions of W are the same

• Dimensions of will change
𝑧 [1]=𝑊 [1] 𝑥 +𝑏[1]
[1] [ 0] [0]
For one sample, sizes are:(𝑛[1] ,1) (𝑛 ,𝑛 (𝑛 ) , 1) (𝑛[1] ,1)
Vectorized:
[ ]
¿ ¿ ¿ ¿
[1]
𝑍 =¿𝑧 [1] (1) 𝑧 [1] (2) … 𝑧[1] (𝑚)
¿ ¿ ¿ ¿
𝑍 [1]=𝑊 [1] 𝑋 +𝑏[1 ]
Is broadcasted by
) [0] , 𝑚)(𝑛[1] ,1)
For multiple samples, sizes are:(𝑛[1] ,𝑚)(𝑛[1] ,𝑛[ 0](𝑛
element-wise sum
• Dimensions of will change
: [ 𝑙 ] , 1)
𝑧 [𝑙] , 𝑎 [𝑙 ](𝑛 For layer :(𝑛 [ 0] ,𝑚)
𝑍 [𝑙] , 𝐴[𝑙](𝑛: [ 𝑙 ] , 𝑚) [𝑙 ]
𝑑𝑍 [𝑙] ,𝑑𝐴[𝑙] : (𝑛 , 𝑚)
Why deep representations?
𝑥1
^𝑦
𝑥2
simple complex
First layers,
detect edges Parts of faces
and borders Parts of faces composed together
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of
hierarchical representations. Proceedings of the 26th International Conference On Machine Learning, ICML 2009, 609–616.
https://doi.org/10.1145/1553374.1553453
When is deep better than shallow
• There are functions that you can compute with a
small L-layer deep neural network that shallower
networks that require exponentially more hidden
units to compute.
𝑥1 𝑥2 𝑥3 𝑥 4 𝑥5 𝑥 6 𝑥7 𝑥8
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8
Enumerate computations of the input Depth of network is

When is deep better than shallow
• Hierarchical network can approximate a higher
degree polynomial via composition:
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8 … 𝑥1 𝑥2 𝑥3 𝑥 4 𝑥5 𝑥 6 𝑥7 𝑥8
A hierarchical network with 11 layers and

Shallow network with units approximates
39 units can approximate arbitrarily well
arbitrarily well
Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning Functions: When Is Deep Better Than Shallow,
(045), 1–12. Retrieved from http://arxiv.org/abs/1603.00988
Practical advice
• Start with logistic regression.
• Start with 1-2 layers
• Increase number of layers iteratively.
Forward and backward functions
for l=1 to L:
[𝑙] [𝑙] Layer
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠:𝑊 , 𝑏
𝑎[ 𝑙 −1 ] 𝑊 [𝑙 ] , 𝑏[ 𝑙] 𝑎[ 𝑙]
Forward propagation:
[𝑙] [ 𝑙] [ 𝑙−1] [𝑙] Cache:
𝑧 =𝑊 𝑎 +𝑏 Cache
Backward propagation:
[𝑙] [𝑙] 𝑑𝑎 [𝑙 −1] 𝑑𝑎[𝑙 ]
𝑐𝑎𝑐h𝑒:𝑧 𝑑𝑊 [𝑙 ]
𝑑𝑏
𝑑𝑊[𝑙[𝑙]]
𝑑𝑏
Forward and backward functions
𝑎[ 1] 𝑎[ 2] 𝑎[ 𝐿
𝑎[ 0] 𝑊 [1]
,𝑏[1]
𝑊 [2 ] [2]
,𝑏 … 𝑊 [ 𝐿] [ 𝐿]
,𝑏
^
𝑦
x Cache: Cache: Cache: Cache:
𝑑𝑎[1] 𝑑𝑎[2 ]… 𝑑𝑎[𝑙 −1]

𝑑𝑎[𝑙
[1]
𝑑𝑊[1] 𝑑𝑊 [2]
𝑑𝑊[𝑖[𝑖]] 𝑑𝑊[𝑙[𝑙]]
𝑑𝑏 𝑑𝑏 [2]
𝑑𝑏 𝑑𝑏
𝑊 [𝑙 ]=𝑊 [ 𝑙] − 𝛼 𝑑𝑊 [𝑙 ]
𝑏[𝑙 ]=𝑏[ 𝑙] − 𝛼 𝑑𝑏 [𝑙 ]
Forward and backward implementation
Having
for l=1 to L:
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠:𝑤 [𝑙] , 𝑏[𝑙]
Forward propagation: , Cache
[𝑙]
𝑧 =𝑊 𝑎 [ 𝑙] [ 𝑙−1] [𝑙]
+𝑏 𝑍 [𝑙]=𝑊 [ 𝑙] 𝐴[𝑙 −1] +𝑏[𝑙]
𝐴[ 𝑙]=𝑔[ 𝑙 ] (𝑍 [ 𝑙 ] )
Backward propagation: , 𝑑𝑊 [𝑙],
′ 𝑑𝑍 [𝐿]= 𝐴[ 𝐿 ] − 𝑌
𝑑𝑧 =𝑑 𝑎 ∗𝑔 ( 𝑧 [ 𝑙 ] ) [𝑙]
[𝑙 ] [𝑙 ] ′
[𝑙 ]
( 𝑍 [𝑙 ])
[𝑙] [ 𝑙]
𝑑𝑍 =𝑑 𝐴 ∗[𝑙𝑔
]
1
𝑑𝑊 [𝑙 ]=𝑑 𝑧 [𝑙 ] 𝑎 [ 𝑙 − 1] 𝑇 𝑑𝑊 =
[𝑙 ]
𝑑𝑍 𝐴
[ 𝑙 − 1] 𝑇
𝑚
𝑑𝑏 [𝑙 ] =𝑑 𝑧 [𝑙 ] , axis=1, keepdims=True)
𝑑𝑎[𝑙 −1]=𝑊 [𝑙] 𝑇 𝑑𝑧 [ 𝑙]
𝑑𝐴[ 𝑙 −1]=𝑊 [𝑙 ]𝑇 𝑑𝑍 [ 𝑙]
Example
𝑥 ReLU ReLU sigmoid
𝑧 [1] 𝑧 [2 ] 𝑧 [3 ]
𝐿( ^𝑦 , 𝑦 )
𝑦 1− 𝑦
𝑑𝑎[1] 𝑑𝑎 [2 ] [ 𝐿]
𝑑𝑎 =− +
𝑎 1 −𝑎
𝑑 𝑊 [1] 𝑑 𝑊 [2] 𝑑 𝑊 [3 ]
𝑑 𝑏[1] 𝑑 𝑏[2] 𝑑 𝑏[3 ]
Parameters and hyperparameters
• Parameters are:
• Hyperparameters (controls W and b):

– Learning rate
– # iterations
– #hidden layers
– #hidden units
– Choice of activation functions
– Others (later): momentum, minibatch size, regularization
parameters, …
Applied deep learning process
• It’s iterative and empirical
• Require try out some values
1.idea • Depends on the application: vision,
NLP,
Online ad, web search,
recommendation.
3.experiment • Systematic process to explore
2.code hyperparameters.
References
• Ng. A (2022) Deep Learning Specialization. https://www.deeplearning.ai/courses/deep-learning-specialization/
• Lalin, J. (2021, December 10). Feedforward neural networks in depth, part 1: Forward and backward
propagations. I, Deep Learning. Retrieved February 28, 2023, from
https://jonaslalin.com/2021/12/10/feedforward-neural-networks-part-1/
• Lalin, J. (2021, December 21). Feedforward neural networks in depth, part 2: Activation functions. I, Deep
Learning. Retrieved February 28, 2023, from
https://jonaslalin.com/2021/12/21/feedforward-neural-networks-part-2/
• Lalin, J. (2021, December 22). Feedforward neural networks in depth, part 3: Cost functions. I, Deep Learning.
Retrieved February 28, 2023, from https://jonaslalin.com/2021/12/22/feedforward-neural-networks-part-3/
• Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning Functions: When Is Deep Better Than Shallow, (045), 1–
12. Retrieved from http://arxiv.org/abs/1603.00988
• Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. Proceedings of the 26th International Conference On
Machine Learning, ICML 2009, 609–616. https://doi.org/10.1145/1553374.1553453
¡Thank you!

DNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNN

Uploaded by

Copyright:

Available Formats

Deep Neural Networks

1-layer NN 1-hidden layer NN

2-hidden layer NN 3-hidden layer NN

Given a single training sample

Layer 1 Layer 2 Layer 4

In general, for a given layer

𝑍 [1]=𝑊 [1] 𝐴[0 ]+𝑏[ 1]

𝑍 [2 ]=𝑊 [2 ] 𝐴[1 ]+𝑏[ 2]

𝑧 [1]=𝑊 [1] 𝑎[0 ]+𝑏[ 1] 𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2]

𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]

𝑧 [2 ]=𝑊 [2] 𝑎[ 1]+𝑏[ 2] 𝑧 [5 ]=𝑊 [5 ] 𝑎[ 4 ]+𝑏[5 ]

What about shapes with multiple samples at the time?

• Dimensions of W are the same

Enumerate computations of the input Depth of network is

A hierarchical network with 11 layers and

𝑑𝑎[1] 𝑑𝑎[2 ]… 𝑑𝑎[𝑙 −1]

• Hyperparameters (controls W and b):

You might also like