You are on page 1of 11

Practical aspects of Deep

Learning Part III

Arles Rodríguez
aerodriguezp@unal.edu.co

Facultad de Ciencias
Departamento de Matemáticas
Universidad Nacional de Colombia
Motivation
• Backpropagation is the most difficult part of a
neural network to implement.
• There is a mathematical way to debug the
gradient programming called gradient
checking.
• It consists of a numerical approximation of the
gradients.
Numerical approximation of gradients
Let`s say , 𝜀=0.01
𝑓 ( 𝜃+𝜀)

𝑓 ( 𝜃+𝜀 ) − 𝑓 (𝜃 − 𝜀)
𝑓 ( 𝜃 + 𝜀) − 𝑓 ( 𝜃 − 𝜀)
𝑓 ′ ( 𝜃) ≈
2𝜀

𝑓 (𝜃− 𝜀) Given

2𝜀 𝑓 ′ ( 𝜃) ≈
𝑓 ( 1.01 ) − 𝑓 (0.99)
2 (0.01)
( 1.01 )3 −(0.99)3
𝑓 ′ (𝜃 )≈
2 (0.01)
¿3,0001

𝑓 ′ ( 𝜃 ) =3 𝜃 2=3

𝜃−𝜀 𝜃+ 𝜀 Approx error is 0.0001


𝜃=1
Some math details
′ 𝑓 (𝜃+ 𝜀) − 𝑓 (𝜃− 𝜀)
𝑓 ( 𝜃+𝜀) 𝑓 ( 𝜃 ) =lim

𝑓 ( 𝜃+𝜀 ) − 𝑓 (𝜃 − 𝜀)
𝜀→ 0 2𝜀

That is the second order Taylor


expansion of (Bengio, 2012).
𝑓 (𝜃− 𝜀)
The error of this approximation on in
( 𝜃
2𝜀 )
the order of

If .0001

This two sides difference formula is


accurate.

𝜃−𝜀 𝜃+ 𝜀
𝜃=1
Gradient checking
• Take parameters and concatenate into a big
vector

• Take parameters d and concatenate into a big


vector
• How to validate that is the gradient of ?
Gradient checking implementation

for each i:

• To check if the distance between the two


vectors is computed.
Interpretation of

• If derivative approximation is ok.


• If let’s take a careful look.
• If there might be a bug. Review components
of the vector vs and make sure that no
component is too large .
Tips to implement gradient check
• Don’t use in training.
• Gradient check must be used only on debug.
• When doing gradient check remember the regularization
term.

must include regularization term


• Gradient check does not work with dropout because
dropout is randomly eliminating subsets of hidden units.
• Turn off dropout to check gradients (keep_prob = 1) and
then turn on dropout.
Tips to implement gradient check

• Run gradient check at starting (with approx. to


zero) and after some iterations.
• This can be useful because sometimes derivatives
seem well for small values of 𝑊,𝑏 and when training
values tend to grow there can be big differences in
gradients in the case of bugs.
References
• Ng. A (2022) Deep Learning Specialization.
https://www.deeplearning.ai/courses/deep-learning-specialization/
• Bengio, Y. (2012). Practical Recommendations for Gradient-Based
Training of Deep Architectures. In Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics) (Vol. 7700 LECTU, pp. 437–478).
https://doi.org/10.1007/978-3-642-35289-8_26
¡Thank you!

You might also like