CV w6 - Deep Learning

6.
Deep Learning Overview

COMPUTER VISION – TK44019
Materi kuliah untuk Chapter ini mengikuti modul HCIA - Artificial
Intelligence (hak cipta materi adalah milik Huawei).
D eep Learning O v e r v i e w
Foreword
l The chapter describes t h e basic k n o w l e d g e o f deep learning, including t h e

d e v e l o p m e n t history o f deep learning, c o m p o n e n t s a n d types o f deep
learning n e u r a l networks, a n d c o m m o n p r o b l e m s in deep learning projects.
2 Hu a we i Confidential
Objectives
O n c o m p l e t i o n o f this course, y o u w i l l be able to:

p Describe t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks.
p Learn a b o u t t h e i m p o r t a n t c o m p o n e n t s o f deep l e a r n i n g n e u r a l networks.
p U n d e r s t a n d t r a i n i n g a n d o p t i m i z a t i o n o f n e u r a l networks.
p Describe c o m m o n p r o b l e m s in deep learning.
Conten ts
1 . D e e p Learning S u m m a r y
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types o f N e u r a l N e t w o r k s
7 . C o m m o n Problems
Tradi ti onal M a c h i n e Learning a n d D eep Learning
l As a m o d e l based o n unsupervised fe a tu r e l e a r n i n g a n d fe a tu r e hierarchy learning, deep
l e a r n i n g has g r e a t advantages in fields such as c o m p u t e r vision, speech recognition, a n d
n a t u r a l l a n g u a g e processing.
Traditional M a c h i n e Learning D e e p Learning

H i g h e r h a r d w a r e r equir ements o n t h e c o m p u t e r : To
L o w h a r d w a r e r equir ements o n t h e c o m p u t e r : Given
execute m a t r i x operations o n massive data, t h e
t h e l i m i t e d c o m p u t i n g a m o u n t , t h e c o m p u t e r does
c o m p u t e r needs a GPU t o p e r f o r m parallel
n o t need a GPU f o r parallel c o m p u t i n g generally.
computing.
Applicable t o t r a i ni ng u n d e r a s mall d a t a a m o u n t The p e r f o r m a n c e can be h i g h w h e n h i g h -
a n d w hos e p e r f o r m a n c e c a n n o t be i m p r o v e d dimens ional w e i g h t parameters a n d massive
c ontinuous ly as t h e d a t a a m o u n t increases. t r a i ni ng d a t a are provided.
Level-by-level p r o b l e m b r e a k d o w n E2E lear ning
M a n u a l featur e selection A l g o r i t h m - b a s e d a u t o m a t i c featur e extraction
Easy-to-explain features H ar d - t o - e x p l a i n features
Tradi ti onal M a c h i n e Learning
Issue analysis
P r o b l e m loc at ing
Data Feature Feature

cleansing ext ract io n select i o n
Model
t r ai n i n g
Question: Can w e use

an algorithm to Execute inference,
a u t o m a t i c a l l y execute prediction, a n d
t h e procedure? ident if ic at ion
D eep Learning
l Generally, t he deep lear n in g architecture is a deep neu r a l net w o rk . " D eep " in
" d e e p l e a r n i n g " refers t o t h e n u m b e r o f layers o f t h e n e u r a l n e t w o r k .
Dendrite Synapse Output

layer
Nucleus
Hidden
layer
A xo n O u t p u t layer
I n p u t layer H id d e n layer I n p u t layer
H u m a n neural network Perc e p tro n Deep n e u r a l n e t w o r k
Neural Network
l Currently, t h e d e f i n i t i o n o f t h e n e u ra l n e t w o r k has n o t been d e t e r m i n e d yet. H e c h t Nielsen, a
n e u ra l n e t w o r k researcher in t h e U.S., defines a n e u ra l n e t w o r k as a c o m p u t e r system co mp o se d
o f simple a n d h ig h ly interconnected processing elements, w h i c h process i n f o r m a t i o n by d y n a m i c
response t o external inputs.
l A n e u ra l n e t w o r k can be simply expressed as a n i n f o r m a t i o n processing system designed t o

i m i t a t e t h e h u m a n b ra in structure a n d functions based o n its source, features, a n d explanations.
l Artificial n e u ra l n e t w o r k ( n e u r a l n e t w o r k ) : F o r m e d by artificial neurons connected t o each other,
t h e n e u ra l n e t w o r k extracts a n d simplifies t h e h u m a n brain's micro st ru ct u re a n d functions. It is a n
i m p o r t a n t a p p ro a ch t o s i m u l a t e h u m a n intelligence a n d reflect several basic features o f h u m a n
b r a i n functions, such as concurrent i n f o r m a t i o n processing, learning, association, model
classification, a n d m e m o r y.
D e v e l o p m e n t History o f N e u r a l N e t w o r k s
Deep
SV M
XOR n e tw o r k
Perceptron M LP
Golden age AI w i n t e r
1958 1970 1986 1995 2006
Single-Layer Perceptron
l I n p u t vector: 𝑋 = [𝑥0 , 𝑥1, … , 𝑥 𝑛 ] 𝑇 𝑥1
l Weight: 𝑊 = [𝜔0 , 𝜔1 , … , 𝜔 𝑛 ] 𝑇 , in w h i c h 𝜔0 is t h e offset. 𝑥2
𝑥𝑛
1, 𝑛𝑒𝑡 > 0, 𝑛
l Ac tiv ation func tion: 𝑂 = 𝑠𝑖𝑔𝑛 𝑛𝑒𝑡 =
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. 𝑛𝑒𝑡 = 𝜔𝑖𝑥 𝑖 = 𝑾 𝐗 𝑻
𝑖=0
l The preceding perc eptron is equiv alent t o a classifier. It uses t h e h i g h - d i m e n s i o n a l 𝑋 vector as t h e i n p u t a n d
p e r f o r m s binar y classification o n i n p u t samples in t h e h i g h - d i m e n s i o n a l space. W h e n 𝑾𝑻𝐗 > 0, O = 1. In this
case, t h e samples are classified i n t o a type. Otherwise, O = −1. In this case, t h e samples are classified i n t o t h e
o t h e r type. The b o u n d a r y o f these t w o types is 𝑾𝑻𝐗 = 0, w h i c h is a h i g h- d i me n s i o n a l hyperplane.
Classification p o i n t Classification line Classification plane Classification hyperplane

𝐴𝑥 + 𝐵 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶𝑧 + 𝐷 = 0 𝑊 𝑇X + 𝑏 =0
XOR P r o b l e m
l In 1969, Minsky, a n A m e r i c a n m a t h e m a t i c i a n a n d AI pioneer, proved t h a t a
perceptron is essentially a linear m o d e l t h a t can o n l y deal w i t h linear
classification problems, b u t c a n n o t process n o n - l i n e a r data.
AND OR XOR
Feedforward Neural Network
I n p u t layer O u t p u t layer
H i d d e n layer 1 H i d d e n layer 2
Solution o f XOR
w0
w1
w2
w3
w4
XOR w5
XOR
Imp a cts o f H i d d e n Layers o n A N e u r a l N e t w o r k
0 h id d e n layers 3 h id d e n layers 2 0 h i d d e n layers
Conten ts
1. Deep Learning S u m m a r y
2. Training Rules
4. Normalizer
5. Optimizer
Gr a d i e n t Descent a n d Loss Function
l The g ra d ien t o f th e m u lt ivaria t e f u n ction 𝑜 = 𝑓 𝑥 = 𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 at 𝑋 ′ = [𝑥 0 ′ , 𝑥 1 ′ , … , 𝑥 𝑛 ′ ] 𝑇 is s h o w n as
follows:
𝑇
𝛻𝑓 𝑥 0 ′ , 𝑥 1 ′ , … ,𝑥𝑛 ′ = [ 𝜕𝑓 , 𝜕𝑓 , … , 𝜕𝑓 ] |𝑋 =𝑋 ′ ,
𝜕𝑥0 𝜕𝑥1
𝜕
The direction o f t h e g r a d i e n t vector is t h e fastest g r o w i n g direction o f t h e func tion. As a result, t h e direction
o f t h e negative g r a d i e n t vector −𝛻𝑓 is t h e fastest descent direction o f t h e func tion.
l D u r i n g t h e t r a i ni ng o f t h e deep lear ning n e t w o r k , t a r g e t classification errors m u s t be parameterized. A loss
function ( e r r o r f u n c t i o n ) is used, w h i c h reflects t h e error b e t w e e n t h e t a r g e t o u t p u t a n d ac tual o u t p u t o f
t h e perceptron. For a single t r a i ni ng s ample x, t h e m o s t c o m m o n error f u n c t i o n is t h e Q u ad r at i c cost
function.
𝐸 𝑤 =1 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2,
2
In t h e preceding func tion, 𝑑 is o n e n e u r o n in t h e o u t p u t layer, D is all t h e neurons in t h e o u t p u t layer, 𝑡𝑑 is t h e

t a r g e t o u t p u t , a n d 𝑜𝑑 is t h e ac tual o u t p u t .
l The g r a d i e n t descent m e t h o d enables t h e loss f u n c t i o n t o search a l o n g t h e negativ e g r a d i e n t direction a n d
u p d a t e t h e parameters iteratively, finally m i n i m i z i n g t h e loss func tion.
Extrema o f t h e Loss Function
l Purpose: The loss f u n c t i o n 𝐸(𝑊) is defined o n t h e w e i g h t space. The objective is t o search f o r t h e w e i g h t
vector 𝑊 t h a t can m i n i m i z e 𝐸(𝑊).
l Limitation: N o effective m e t h o d can solve t h e e x t r e m u m in m a t h e m a t i c s o n t h e c o m p l e x h i g h- d i me n s i o n a l
surface o f 𝐸 𝑊 = 12 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2.
Exam ple o f g r a d ie n t descent o f

binary p a r a b o lo id
C o m m o n Loss Functions i n D eep Learning
l Q u a d r a ti c cost func tion:
1
2
𝐸 𝑊 = 𝑡 𝑑 − 𝑜𝑑
2
𝑑 ∈𝐷
l Cross e n t r o p y error func tion:
1
𝐸 𝑊 = − [𝑡𝑑 ln 𝑜𝑑 + (1 − 𝑡 𝑑 ) ln( 1 − 𝑜𝑑 )]
𝑛
𝑥 𝑑 ∈𝐷
l The cross e n t r o p y error f u n c t i o n depicts t h e distance b e t w e e n t w o p r o b a b i l i ty

distributions, w h i c h is a w i d e l y used loss f u n c t i o n f o r classification problems.
l Generally, t h e m e a n square error f u n c t i o n is used t o solve t h e regression p r o b l e m , w h i l e
t h e cross e n t r o p y error f u n c t i o n is used t o solve t h e classification p r o b l e m .
Batch Gr a d i e n t Descent A l g o r i t h m ( B G D )
l In t h e t r a i n i n g sa mp le set 𝐷, each sa mp le is recorded as < 𝑋, 𝑡 >, in w h i c h 𝑋 is t h e i n p u t vector, 𝑡
t h e t a r g e t o u t p u t , 𝑜 t h e actual o u t p u t , a n d 𝜂 t h e le a rn in g rate.
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e e n d c o n d it io n is m e t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in D:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.
1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− 𝑖 -η
For each 𝑤𝑖in this unit: ∆𝑤 += x 𝑑 ∈𝐷 .
𝑛 𝜕𝑤
𝑖
n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖.
l The g ra d ie n t descent a l g o r i t h m o f this version is n o t c o m m o n l y used because:

p The convergence process is very s low as all t r a i ni ng samples need t o be calculated every t i m e t h e w e i g h t
is updated.
Stochastic Gr a d i e n t Descent A l g o r i t h m (SGD)
l To address t h e BGD a l g o r i t h m defect, a c o m m o n variant called I n cre me n t a l G ra d ie n t Descent
a l g o r i t h m is used, w h i c h is also called t h e Stochastic G ra d ie n t Descent (SGD) a l g o r i t h m . O n e
i m p l e m e n t a t i o n is called O n l i n e Learning, w h i c h updates t h e g r a d i e n t based o n each sample:
1 𝜕C(𝑡 𝑑,𝑜 𝑑) 𝜕 C(𝑡 𝑑 ,𝑜 𝑑 )

∆𝑤𝑖 = −𝜂 𝑛 x 𝑑 ∈𝐷 𝜕𝑤
⟹ ∆𝑤𝑖 = −𝜂 𝑑 ∈𝐷 𝜕𝑤
.
𝑖 𝑖
l ONLINE-GRADIENT-DESCENT
p Before t h e end c ondition is me t :
p Generates a r a n d o m <X, t > f r o m D a n d does t h e f o l l o w i n g calculation:
n I n p u t X t o this u n i t a n d calculate t h e o u t p u t o.
𝜕C(𝑡 𝑑,𝑜 𝑑)
n For each 𝑤𝑖 in this unit: 𝑤𝑖 += - η 𝑑∈𝐷 𝜕 𝑖 .
M i n i - B a t c h Gr a d i e n t Descent A l g o r i t h m ( M B G D )
l To address t h e defects o f t h e previous t w o g r a d i e n t descent algorithms, t h e M i n i - b a t c h Gra d ie n t
Descent A l g o r i t h m ( M B G D ) wa s proposed a n d has been m o s t w i d e l y used. A sma ll n u m b e r o f
Batch Size (BS) samples are used a t a t i m e t o calculate ∆𝑤𝑖, a n d t h e n t h e w e i g h t is u p d a t e d
accordingly.
l Batch-gradient-descent
p Before t h e end c ondition is me t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in t h e BS samples in t h e nex t b a t c h in 𝐷:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.
1
− For each 𝑤 in
𝑖 this unit: ∆𝑤 +=
𝑖 -η 𝑛 x 𝑑∈𝐷 𝜕C(𝑡 𝑑 ,𝑜
)
𝜕𝑤 𝑖
n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖
n For t h e last batch, t h e t r a i n i n g samples are m i x e d u p in a r a n d o m order.
Backpropagation Algorithm (1)
l Signals are p r o p a g a t e d in f o r w a r d direction,
and errors are propagated in backward F o r w a r d p r o p a g a t i o n direction
direction.
l In t h e t r a i n i n g sa mp le set D, each sample is 𝑥1

recorded as <X, t>, in w h i c h X is t h e i n p u t
𝑜1
vector, t t h e t a r g e t o u t p u t , o t h e actual o u t p u t , 𝑥2
a n d w t h e w e i g h t coefficient. 𝑜2
𝑥3
l Loss function:
O u t p u t layer
1
E (w) = å (t - o ) 2 I n p u t layer
2 (dÎD) d d H i d d e n layer
B a c k p r o p a g a t i o n direction
l Accordi n g to the f o llo w i n g f o rm u las, errors in the input, hidden, a n d o u t p u t layers a re
a c c u m u l a t e d t o generate t h e error in t h e loss function.
l w c is t h e w e i g h t coefficient b e t w e e n t h e h id d e n layer a n d t h e o u t p u t layer, w h i l e w b is t h e w e i g h t

coefficient b e t w e e n t h e i n p u t layer a n d t h e h i d d e n layer. 𝑓 is t h e activation function, D is t h e
o u t p u t layer set, a n d C a n d B are t h e h i d d e n layer set a n d i n p u t layer set respectively. Assume
t h a t t h e loss f u n c t i o n is a quadratic cost function :
1
p O u t p u t layer error: E= å
2 (dÎD )
(t d - od ) 2
2
1 1
E = å (dÎD ) [td - f (netd ) ] = å (dÎD) étd - f (å (cÎC) wc y c)ù
2
p Expanded hidden
2 2 ë û
layer error:
( )
2
1
E = å (dÎD ) êtd - f å (cÎC ) wc f (net c)
é ù
=
p Expanded i n p u t 2 ë ú
û
( ( ))
layer error: 1 ù
2
å (dÎD) éêëtd - f å (cÎC ) wc f åbÎB wb xb úû
2
l To m i n i m i z e error E, t h e g r a d i e n t descent iterative calculation can be used t o
solve w c a n d w b , t h a t is, calculating w c a n d w b t o m i n i m i z e error E.
l Formula:
¶E
Dwc = -h ,c ÎC
¶wc
Dwb = -h ¶E ,b Î B
¶wb
l If there are m u l t i p l e h i d d e n layers, chain rules are used t o t a k e a derivative f o r
each layer t o o b t a i n t h e o p t i m i z e d parameters by iteration.
l For a n e u ra l n e t w o r k w i t h a n y n u m b e r o f layers, t h e a rra n g e d f o r m u l a f o r t r a i n i n g is as follows:
Dwl = -hdl+1 f (z l )
jk k j j
'
(z
ìfj j j
l
)(t - f (z l
)), l Î outputs, (1)
dl j = ïí
j j
ïî å
d k
l +1 l
w f
jk j
'
(z l
j ),otherwise,(2)
k
l The BP a l g o r i t h m is used t o t r a i n t h e n e t w o r k as follows:

p Takes o u t t h e nex t t r a i ni ng s ample <X, T>, inputs X t o t h e n e t w o r k , a n d obtains t h e ac tual o u t p u t o.
p Calculates o u t p u t layer δ according t o t h e o u t p u t layer error f o r m u l a (1).
p Calculates δ o f each hidden layer f r o m o u t p u t t o i n p u t by i t er at i on according t o t h e hidden layer error
p r o p a g a t i o n f o r m u l a (2).
p According t o t h e δ o f each layer, t h e w e i g h t values o f all t h e layer are updated.
Conten ts
2. Training Rules
4. Normalizer
5. Optimizer
Acti va ti o n Function
l Activation functions are i m p o r t a n t f o r t h e n e u r a l n e t w o r k m o d e l t o learn a n d
u n d e r s t a n d c o m p l e x n o n - l i n e a r functions. They a l l o w i n t r o d u c t i o n o f n o n - l i n e a r
features t o t h e n e t w o r k .
l W i t h o u t activation functions, o u t p u t signals are o n l y simple linear functions.

The c o m p l e x i t y o f linear functions is limited, a n d t h e capability o f learning
c o m p l e x f u n c t i o n m a p p i n g s f r o m d a t a is low.
A ctivation Func tion
output = f (w1x1 + w2 x2 + w3 x3 ) = f (W • X ) t
Si g m o i d
1
𝑓 𝑥 =
1 + 𝑒−𝑥
Tanh
𝑒 𝑥 − 𝑒−𝑥
tanh 𝑥 = 𝑒 𝑥 + 𝑒−𝑥
Softsign
𝑥
𝑓 𝑥 =
𝑥 +1
Rectified Linear U n i t (ReLU)
𝑥, 𝑥 ≥0
𝑦=
0, 𝑥 <0
Softplus
𝑓 𝑥 = ln 𝑒 𝑥 + 1
Softmax
l S o f t m a x function:
𝑒
σ(z)𝑗 =
𝑘
𝑒
l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real

values t o a n o t h e r K-dimensional vector o f real values, w h e r e each vector
e l e m e n t is in t h e interval (0, 1). All t h e elements a d d u p t o 1.
l The S o f t m a x f u n c t i o n is o f t e n used as t h e o u t p u t layer o f a multiclass

classification task.
Conten ts
2. Training Rules
4. Normalizer
5. Optimizer
Normalizer
l Regularization is a n i m p o r t a n t a n d effective t e c h n o l o g y t o reduce generalization
errors in m a c h i n e learning. It is especially useful f o r deep learning m o d e l s t h a t
t e n d t o be over-fi t d u e t o a large n u m b e r o f parameters. Therefore, researchers
have proposed m a n y effective technologies t o prevent over-fitting, including:
p A d d i n g constraints t o parameters, such as 𝐿1 a n d 𝐿2 n o r m s
p Expanding t h e t r a i n i n g set, such as a d d i n g noise a n d t r a n s f o r m i n g d a t a
p Dropout
p Early s to p p i n g
Penalty Parameters
l M a n y regularization m e t h o d s restrict t h e learning capability o f m o d e l s by
a d d i n g a penalty p a r a m e t e r Ω(𝜃) t o t h e objective f u n c t i o n 𝐽. Assume t h a t t h e
t a r g e t f u n c t i o n a f t e r regularization is 𝐽.
𝐽 𝜃; 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝜃),
l W h e r e 𝛼𝜖[0, ∞) is a h y p e r p a r a m e t e r t h a t w e i g h t s t h e relative c o n t r i b u t i o n o f
t h e n o r m penalty t e r m Ω a n d t h e standard objective f u n c t i o n 𝐽(𝑋; 𝜃). If 𝛼 is set
t o 0, n o regularization is p e r f o r m e d . The p e n a l t y in regularization increases w i t h
𝛼.
𝐿1 Regu l a ri z a ti o n
l A d d 𝐿1 n o r m constraint t o m o d e l parameters, t h a t is,
𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1 ,
l If a g r a d i e n t m e t h o d is used t o resolve t h e value, t h e p a r a m e t e r g r a d i e n t is
𝛻𝐽 𝑤 =∝ 𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤 .
𝐿2 Regu l a ri z a ti o n
l A d d n o r m penalty t e r m 𝐿2 t o prevent overfitting.
1
𝐽 𝑤; 𝑋,𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2,
2
2
l A p a r a m e ter o p ti m izati on m e th o d can be infe rr ed usin g a n o p t i m i z a t i o n

technol ogy (such as a g r a d i e n t m e t h o d ) :
𝑤 = 1 − 𝜀𝛼 𝜔 − 𝜀𝛻𝐽(𝑤),
l w h e r e 𝜀 is t h e learning rate. C o m p a r e d w i t h a c o m m o n g r a d i e n t o p t i m i z a t i o n
f o r m u l a , this f o r m u l a multiplies t h e p a r a m e t e r by a reduction factor.
𝐿1 v.s. 𝐿2
l The m a j o r differences b e t w e e n 𝐿2 a n d 𝐿1 :
p Accor ding t o t h e preceding analysis, 𝐿1 can generat e a m o r e sparse m o d e l t h a n 𝐿2. W h e n t h e value o f p a r a m e t e r 𝑤 is
small, 𝐿1 r egular iz at ion can directly reduce t h e p a r a m e t e r value t o 0, w h i c h can be used f o r f e a t u r e selection.
p F r o m t h e perspective o f probability, m a n y n o r m constraints are equiv alent t o a d d in g p r io r p r o b a b ilit y dis t r ibut ion t o
parameters. In 𝐿2 regularization, t h e p a r a m e t e r value complies w i t h t h e Gaussian dis t r ibut ion rule. In 𝐿1 regularization,
t h e p a r a m e t e r value complies w i t h t h e Laplace dis t r ibut ion rule.
𝐿1 𝐿2
Dataset Expansion
l The m o s t effective w a y t o p re ve n t o v e r - f i t t i n g is t o a d d a t r a i n i n g set. A larger t r a i n i n g set has a
smaller o v e r - f i t t i n g probability. Dataset expansion is a t ime -sa vin g m e t h o d , b u t it varies in
d i ff e r e n t fields.
p A c o m m o n m e t h o d in t h e object rec ognition field is t o r o t a t e o r scale images. (The prerequisite t o i m a g e
t r a n s f o r m a t i o n is t h a t t h e ty pe o f t h e i m a g e c a n n o t be c hanged t h r o u g h t r a n s f or ma t i o n . For example, f o r
h a n d w r i t i n g digit recognition, categories 6 a n d 9 can be easily c hanged a f t e r r otation) .
p R a n d o m noise is added t o t h e i n p u t d a t a in speech recognition.

p A c o m m o n practice o f n a t u r a l l anguage processing ( N L P ) is replacing w or ds w i t h their synonyms.
p Noise injection can a d d noise t o t h e i n p u t o r t o t h e hidden layer o r o u t p u t layer. For example, f o r S o f t ma x
classification, noise can be added using t h e label s m o o t h i n g technology. If noise is added t o categories 0
𝜀
a n d 1, t h e corresponding probabilities are c hanged t o a n d 1 − 𝑘−1 𝜀 respectively.
𝑘 𝑘
Dropout
l D r o p o u t is a c o m m o n a n d simple r egular iz ation m e t h o d , w h i c h has been w i del y used since 2014. Simply put,
D r o p o u t r a n d o m l y discards s o m e inputs d u r i n g t h e t r a i ni ng process. In this case, t h e par ameter s
corresponding t o t h e discarded inputs are n o t updated. As a n i n t e g r a t i o n m e t h o d , D r o p o u t c ombines all sub-
n e t w o r k results a n d obtains s ub- netw or k s by r a n d o m l y d r o p p i n g inputs. See t h e figures below:
Early Stoppi ng
l A test o n d a t a o f t h e validation set can be inserted d u r i n g t h e training. W h e n
t h e d a t a loss o f t h e verification set increases, p e r f o r m early stopping.
Early stopping
Conten ts
2. Training Rules
4. Normalizer
5. Optimizer
Optimizer
l There are various o p t i m i z e d versions o f g r a d i e n t descent algorithms. In object-
oriented l a n g u a g e i m p l e m e n t a t i o n , d i ff e r e n t g r a d i e n t descent a l g o r i t h m s are
o f t e n encapsulated i n t o objects called optimizers.
l Purposes o f t h e a l g o r i t h m o p t i m i z a t i o n include b u t are n o t l i m i t e d to:

p Accelerating a l g o r i t h m convergence.
p Preventing o r j u m p i n g o u t o f local e x t r e m e values.
p S implifying m a n u a l p a r a m e t e r setting, especially t h e l e a r n i n g r a te (LR).

l C o m m o n o ptim izers: co m m o n G D o p tim i z er, m o m e n tu m o p tim i z er, N e sterov,
AdaGrad, AdaDelta, RMSProp, A d a m , A d a M a x , a n d N a d a m .
M o m e n t u m Optimizer
l A m o s t basic i m p r o v e m e n t is t o a d d m o m e n t u m t e r m s f o r ∆𝑤𝑗𝑖. Assume t h a t t h e w e i g h t correction o f t h e 𝑛- t h i t e r a t i o n is
∆𝑤𝑗𝑖(𝑛) . The w e i g h t correction r ule is:
l ∆𝑤 𝑙 𝑛 = −𝜂𝛿𝑙+1𝑥 𝑙(𝑛) + 𝛼∆𝑤 𝑙 𝑛 −1

𝑗𝑖 𝑖 𝑗 𝑗𝑖
l w h e r e 𝛼 is a const ant ( 0 ≤ 𝛼 < 1 ) called M o m e n t u m Coefficient a n d 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a m o m e n t u m t e r m .
l I m a g i n e a s m all ball rolls d o w n f r o m a r a n d o m p o i n t o n t h e er r or surface. The i n t r o d u c t i o n o f t h e m o m e n t u m t e r m is

equiv alent t o giving t h e s m all ball inertia.
−𝜂𝛿𝑙+1 𝑥 𝑙 (𝑛)
𝑖 𝑗
Advantages a n d Disadvantages o f M o m e n t u m O p t i m i z e r
l A d van t a g es:
p Enhances t h e stability o f t h e g r a d i e n t correction direction a n d reduces mutations .
p In areas w h e r e t h e g r a d i e n t direction is stable, t h e ball rolls faster a n d faster ( t h e r e is a speed u p p e r l i m i t
because 𝛼 < 1), w h i c h helps t h e ball quickly overshoot t h e f l a t area a n d accelerates convergence.
p A s mall ball w i t h inertia is m o r e likely t o r oll over s o m e n a r r o w local extrema.
l Disadvantages:
p The learning r ate 𝜂 a n d m o m e n t u m 𝛼 need t o be m a n u a l l y set, w h i c h o f t e n requires m o r e experiments t o
d e t e r m i n e t h e appr opr iate value.
AdaGrad Optimizer (1)
l The com m o n f eat u r e o f th e ra n d o m g ra d ien t d escen t a lg o rit h m (SGD), sm all- b a tc h g ra d ien t d escen t
a l g o r i t h m ( M B G D ) , a n d m o m e n t u m o p t i mi z e r is t h a t each p a r a m e t e r is u p d a t e d w i t h t h e s ame LR.
l According t o t h e appr oac h o f AdaGrad, differ ent learning rates need t o be set f o r differ ent parameters.
gt = ¶C(t, o)
¶wt Gr a d ie n t calculation
r = r + g2 Square g r a d i e n t a c c u m u l a t i o n
t t-1 t
h
Dwt = - gt Computing update
e+ rt
A pp lic a tio n u p d a t e
wt +1 =wt +Dwt
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h
increases continuously. 𝜂 indicates t h e g l o b a l LR, w h i c h needs t o be set ma nual l y. 𝜀 is a s mall constant, a n d

is set t o a b o u t 10 - 7 f o r n u me r i c a l stability.
AdaGrad Optimizer (2)
l The A d a G r a d o p t i m i z a t i o n a l g o r i t h m shows t h a t t h e 𝑟 continues increasing w h i l e t h e
overall lear ning r ate keeps decreasing as t h e a l g o r i t h m iterates. This is because w e h o p e
LR t o decrease as t h e n u m b e r o f updates increases. In t h e initial l e a r n i n g phase, w e are
fa r a w a y f r o m t h e o p t i m a l s o lu tio n t o t h e loss func tion. As t h e n u m b e r o f updates
increases, w e are closer t o t h e o p t i m a l solution, a n d th er e fo r e LR can decrease.
l Pros:
p The le a rn in g ra t e is a u t o m a t i c a l l y updated. As t h e n u m b e r o f updates increases, t h e le a rn in g
ra te decreases.
l Cons:
p The d e n o m i n a t o r keeps a c c u m u l a t i n g so t h a t t h e le a rn in g ra t e w i l l eventually b e c o m e very
small, a n d t h e a l g o r i t h m w i l l b e c o m e ineffective.
RMSProp O p t i m i z e r
l The RMSProp o p t i m i z e r is a n i m p r o v e d AdaG r ad optimiz er. It introduces a n a t t e n u a t i o n coefficient t o ensure a
certain a t t e n u a t i o n r a t i o f o r 𝑟 in each round.
l The RMSProp o p t i m i z e r solves t h e p r o b l e m t h a t t h e AdaG r ad o p t i m i z e r ends t h e o p t i m i z a t i o n process t o o

early. It is suitable f o r non-s table t a r g e t h a n d l i n g a n d has g o o d effects o n t h e RNN.
¶C(t, o)
gt =
¶wt G ra d ie n t calculation
r =br- + (1- b)g 2
t t 1 t Square g r a d i e n t a c c u m u l a t i o n
Dwt = -
h gt Computing update
e+ rt
wt +1 = wt + Dwt A p p lica t io n u p d a t e
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h m a y
n o t increase a n d needs t o b e adjusted using a p a r a m e t e r . 𝛽 is t h e a t t e n u a t i o n factor, 𝜂 indicates t h e g l o b a l
LR, w h i c h needs t o be set manually. 𝜀 is a s mall constant, a n d is set t o a b o u t 10 - 7 f o r n u m e r i c a l stability.
Adam Optimizer (1)
l Adaptive M o m e n t Estimation ( A d a m ) : Developed based o n AdaGrad a n d
AdaDelta, A d a m m a i n t a i n s t w o a d d i t i o n a l variables 𝑚𝑡 a n d 𝑣𝑡 f o r each variable
t o be trained:
𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡
𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔2𝑡
l W h e r e 𝑡 represents t h e 𝑡 -th iteration a n d 𝑔𝑡 is t h e calculated gradient. 𝑚𝑡 a n d 𝑣𝑡

are m o v i n g averages o f t h e g r a d i e n t a n d square gradient. F r o m t h e statistical
perspective, 𝑚𝑡 a n d 𝑣𝑡 are estimates o f t h e first m o m e n t ( t h e average value)
a n d t h e second m o m e n t ( t h e uncentered variance) o f t h e gradients respectively,
w h i c h also explains w h y t h e m e t h o d is so n a m e d .
Adam Optimizer (2)
l If 𝑚𝑡 and 𝑣𝑡 are initialized using t h e zero vector, 𝑚𝑡 and 𝑣𝑡 are close t o 0 d u r i n g t h e initial
iterations, especially w h e n 𝛽1 a n d 𝛽2 are close t o 1. To solve this p r o b l e m , w e use 𝑚𝑡 a n d 𝑣𝑡 :
𝑚𝑡
𝑚𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣𝑡 =
1 − 𝛽2𝑡
l The w e i g h t u p d a t e ru le o f A d a m is as follows:
𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡
𝑣𝑡 +𝜖
l A l t h o u g h t h e rule involves m a n u a l setting o f 𝜂, 𝛽1, a n d 𝛽2, t h e setting is m u c h simpler. According

t o experiments, t h e d e f a u l t settings are 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8, a n d 𝜂 = 0.001. In practice, A d a m w i l l
converge quickly. W h e n convergence sa t u ra t io n is reached, xx can be reduced. A f t e r several times
o f reduction, a satisfying local e x t r e m u m w i l l be obtained. O t h e r p a ra me t e rs d o n o t need t o be
adjusted.
O p t i m i z e r Performance C o m p a r i so n
Comparison o f o p t i m i z a t i o n Comparison o f o p t i m i z a t i o n
a l g o r i t h m s in c o n t o u r m a p s o f a l g o r i t h m s a t t h e saddle p o i n t
loss functions
Conten ts
2. Training Rules
4. Normalizer
5. Optimizer
Convolutional Neural N e tw o r k
l A c o n v o l u ti o n a l n e u r a l n e t w o r k ( C N N ) is a f e e d f o r w a r d n e u r a l n e t w o r k . Its artificial
neurons m a y respond t o surrounding units w i t h i n t h e coverage range. C N N excels a t
i m a g e processing. It includes a convolutional layer, a pooling layer, a n d a fully
connected layer.
l In t h e 1960s, H u b e l a n d Wiesel studied cats' cortex neurons used f o r local sensitivity

a n d direction selection a n d f o u n d t h a t the ir u n i q u e n e t w o r k structure c o u ld s implify
feedback n e u r a l networks. They t h e n proposed t h e CNN.
l N o w, C N N has b e c o m e o n e o f t h e research hotspots in m a n y scientific fields, especially

i n t h e p a t t e r n classification field. The n e t w o r k is w i d e l y used because i t can avoid
c o m p l e x pre-processing o f ima g e s a n d directly i n p u t o r ig in a l images.
M a i n Concepts o f C N N
l Local receptive field: It is generally considered t h a t h u m a n perception o f t h e outside
w o r l d is f r o m local t o global. Spatial correlations a m o n g local pixels o f a n i m a g e a r e
closer t h a n those a m o n g distant pixels. Therefore, each n e u r o n does n o t need t o
k n o w t h e g l o b a l image. It o n l y needs t o k n o w t h e local image. The local i n f o r m a t i o n is
c o m b i n e d a t a h i g h e r level t o generate g l o b a l i n f o r m a t i o n .
l P a r a m e t e r sharing: O n e o r m o r e filters/kernels m a y be used t o scan i n p u t images.

Parameters carried b y t h e filters are weights. In a layer scanned b y filters, each filter
uses t h e s a me p a r a m e te r s d u r i n g w e i g h t e d c o m p u t a t i o n . W e i g h t sharing m e a n s t h a t
w h e n each filter scans a n entire image, p a r a m e te r s o f t h e filter are fixed.
Architecture o f C o n v o l u t i o n a l N e u r a l N e t w o r k
Input Th ree- fe a tu re Th ree- fe a tu re Five- fe a tu re Five- fe a tu re Output
im age image image image image layer
Con vo lu t io n al Po o ling Con vo lu t io n al Poo lin g Fully connected

layer layer layer layer layer
Bird Pb ird
Sunset Psu n set
Dog Pd o g
Cat Pca t
Vectorization
C o n vo lu t io n + nonlinearity M a x p o o lin g
Convolution layers + p o o lin g layers Multi-

Fully connected layer c a tego ry
Single-Filter Calculation ( 1 )
l Description o f c o n v o l u t i o n calculation
Single-Filter Calculation ( 2 )
l D e m o n s t r a t i o n o f t h e c o n v o l u t i o n calculation
H a n Bingtao, 2017, Conv olut ional N e u r a l N e t w o r k
C o n v o l u t i o n a l Layer
l The basic architecture o f a C N N is m u l t i - c h a n n e l c o n v o l u t i o n consisting o f m u l t i p l e single convolutions. The
o u t p u t o f t h e previous layer ( o r t h e original i m a g e o f t h e first layer) is used as t h e i n p u t o f t h e c ur r ent layer.
It is t h e n convolved w i t h t h e filter in t h e layer a n d serves as t h e o u t p u t o f this layer. The c o n v o l u t i o n kernel o f
each layer is t h e w e i g h t t o be learned. Similar t o FCN, a f t e r t h e c o n v o l u t i o n is c omplete, t h e result s hould be
biased a n d activated t h r o u g h activation func tions before being i n p u t t o t h e nex t layer.
Wn bn
Fn
Input Output
t enso r tensor
F1
W2 b2 A ctivat e
Output
W1 b1
Con vo lu t io n al Bias
kernel
Pooling Layer
l Pooling c ombines nearby units t o reduce t h e size o f t h e i n p u t o n t h e nex t layer, r educ ing dimensions.
C o m m o n p o o l i n g includes m a x p o o l i n g a n d average pooling. W h e n m a x p o o l i n g is used, t h e m a x i m u m value
in a s mall square area is selected as t h e representative o f this area, w h i l e t h e m e a n value is selected as t h e
representative w h e n average p o o l i n g is used. The side o f this s mall area is t h e p o o l w i n d o w size. The
f o l l o w i n g figur e shows t h e m a x p o o l i n g o p e r a t i o n w hos e p o o l i n g w i n d o w size is 2.
Sliding direction
Fully Connected Layer
l The ful l y connected layer is essentially a classifier. The features extracted o n t h e
c o n v o l u t i o n a l layer a n d p o o l i n g layer are straightened a n d placed a t t h e ful l y
connected layer t o o u t p u t a n d classify results.
l Generally, t h e S o f t m a x f u n c t i o n is used as t h e activation f u n c t i o n o f t h e final

ful l y connected o u t p u t layer t o c o m b i n e all local features i n t o g l o b a l features
a n d calculate t h e score o f each type.
𝑒𝑧𝑗
σ(z)𝑗 =
𝑘
𝑒
Recurrent N e u r a l N e t w o r k
l The rec urrent n e u r a l n e t w o r k ( R N N ) is a n e u r a l n e t w o r k t h a t captures d y n a m i c
i n f o r m a t i o n in sequential d a t a t h r o u g h periodical connections o f h i d d e n layer nodes. It
can classify sequential data.
l U n l i k e o t h e r f o r w a r d n e u r a l networks, t h e R N N can keep a c o n te x t state a n d even

store, learn, a n d express related i n f o r m a t i o n i n c o n te x t w i n d o w s o f any length. D i ffe r e n t
f r o m t r a d i t i o n a l n e u r a l networks, it is n o t l i m i t e d t o t h e space b o u n d a r y, b u t also
supports t i m e sequences. In o t h e r words, ther e is a side b e t w e e n t h e h i d d e n layer o f t h e
c u r r e n t m o m e n t a n d t h e h i d d e n layer o f t h e n e x t m o m e n t .
l The R N N is w i d e l y used in scenarios related t o sequences, such as videos consisting o f

i m a g e frames, a u d i o consisting o f clips, a n d sentences consisting o f words.
Recurrent N e u r a l N e t w o r k Architecture ( 1 )
l 𝑋𝑡 is t h e i n p u t o f t h e i n p u t sequence a t t i m e t.
l 𝑆𝑡 is th e m e m o ry u n i t o f the sequence a t t i m e t a n d caches
previous i n f o r m a t i o n .
𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

l 𝑂𝑡 is t h e o u t p u t o f t h e h i d d e n layer o f t h e sequence a t t i m e t.
𝑂𝑡 = 𝑡𝑎𝑛ℎ 𝑉𝑆𝑡
l 𝑂𝑡 a ft e r t h r o u g h m u l t i p le hidden la yers, i t can g e t the f inal
o u t p u t o f t h e sequence a t t i m e t.
Recurrent N e u r a l N e t w o r k Architecture ( 2 )
LeCun, Bengio, a n d G. H i n t o n , 2015, A Recurrent N e u r a l N e t w o r k a n d t h e

U n f o l d i n g in Ti m e o f t h e C o m p u t a t i o n Involved in Its F o r w a r d C o m p u t a t i o n
Types o f Recurrent N e u r a l N e t w o r k s
Andr ej Karpathy, 2015, The Unreasonable Effectiveness o f Recurrent N e u r a l N e t w o r k s
B a c k p r o p a g a t i o n T h r o u g h T i m e (BPTT)
l BPTT:
p Tradit ional b a c k p r o p a g a t i o n is t h e extension o n t h e t i m e sequence.
p There are t w o sources o f errors in t h e sequence a t t i m e o f m e m o r y unit: first is f r o m t h e h id d e n layer o u t p u t error a t t
t i m e sequence; t h e second is t h e error f r o m t h e m e m o r y cell a t t h e nex t t i m e sequence t + 1.
p The longer t h e t i m e sequence, t h e m o r e likely t h e loss o f t h e last t i m e sequence t o t h e g r a d ie n t o f w in t h e first t i m e

sequence causes t h e vanishing g r a d ie n t o r ex ploding g r a d ie n t p r o b le m .
p The t o t a l g r a d i e n t o f w e i g h t w is t h e a c c u m u l a t i o n o f t h e g r a d ie n t o f t h e w e i g h t a t all t i m e sequence.
l Three steps o f BPTT:

p C o m p u t i n g t h e o u t p u t value o f each n e u r o n t h r o u g h f o r w a r d pr opagat ion.
p C o m p u t i n g t h e error value o f each n e u r o n t h r o u g h b a c k p r o p a g a t i o n 𝛿𝑗.
p C o m p u t i n g t h e g r a d ie n t o f each w e ig h t .
l U p d a t i n g w eights using t h e SGD a l g o r i t h m .
Recurrent N e u r a l N e t w o r k P r o b l e m
l 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 is extended o n t h e t i m e sequence.
l 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …
l Despite t h a t t h e s ta n d a r d R N N structure solves t h e p r o b l e m o f i n f o r m a t i o n m e m o r y,

t h e i n f o r m a t i o n a tte n u a te s d u r i n g l o n g - t e r m m e m o r y.
l I n f o r m a t i o n needs t o be saved l o n g t i m e i n m a n y tasks. For example, a h i n t a t t h e

b e g i n n i n g o f a speculative fic tio n m a y n o t be a n s w e r e d u n t i l t h e end.
l The R N N m a y n o t be able t o save i n f o r m a t i o n f o r l o n g d u e t o t h e l i m i t e d m e m o r y u n i t
capacity.
l W e expect t h a t m e m o r y units can r e m e m b e r key i n f o r m a t i o n .

Long Short-term M e m o r y Network
Colah, 2015, Under s t anding LSTMs N e t w o r k s

Ga t e d Recurrent U n i t ( G R U )
Generative Adversarial N e t w o r k ( G A N )
l Generative Adversarial N e t w o r k is a f r a m e w o r k t h a t trains g e n e r a t o r G a n d dis c riminator D t h r o u g h t h e
adversarial process. T h r o u g h t h e adversarial process, t h e dis c riminator can tell w h e t h e r t h e s ample f r o m t h e
gener ator is fak e o r real. G A N adopts a m a t u r e BP a l g o r i t h m .
l ( 1 ) Generator G: The i n p u t is noise z, w h i c h complies w i t h m a n u a l l y selected pr ior pr obability distribution,

such as even dis tr ibution a n d Gaussian distribution. The g e n e r a t o r adopts t h e n e t w o r k structure o f t h e
m u l t i l a y e r per c eptr on (MLP), uses m a x i m u m lik elihood e s t i ma t i o n ( M L E ) par ameter s t o represent t h e
derivable m a p p i n g G(z), a n d m a p s t h e i n p u t space t o t h e s ample space.
l ( 2 ) D is c r iminator D: The i n p u t is t h e real s ample x a n d t h e fak e s ample G(z), w h i c h are t a g g e d as real a n d

fak e respectively. The n e t w o r k o f t h e dis c riminator can use t h e MLP carrying parameters. The o u t p u t is t h e
pr obability D ( G ( z ) ) t h a t determines w h e t h e r t h e s ample is a real o r fak e sample.
l G A N can be applied t o scenarios such as i m a g e generation, tex t generation, speech e n h a n c e me n t , i m a g e

super-resolution.
G A N Architecture
l Generator/Discriminator
Generative M o d e l a n d Discriminative M o d e l
l Generative n e t w o r k l Discriminator n e t w o r k
p Generates s a m p l e d a t a p
D ete r mines w h e t h e r s a m p l e d a t a is real
n Input: Gaussian w h i t e noise vector z n Input: real sa mp le d a t a 𝑥 𝑟𝑒𝑎𝑙 a n d
n O u t p u t : sa mp le d a t a vector x generated sa mp le d a t a 𝑥 = 𝐺 𝑧
n O u t p u t : p ro b a b ilit y t h a t determines
w h e t h e r t h e sa mp le is real
x = G(z;qG ) y = D(x;qD )
𝑥𝑟𝑒𝑎𝑙
G
z x D y
x
Training Rules o f G A N
l O p t i m i z a t i o n objective:
p Value f u n c t i o n
minmaxV (D, G) = Ex~ p data（x）[logD(x)] + E z~ p z( z ) [log(1- D(G(z)))]

G D
p In t h e early t r a i n i n g stage, w h e n t h e o u t c o m e o f G is very poor, D d e te r min e s t h a t

t h e g e n e r a te d s a m p l e is fak e w i t h h i g h confidence, because t h e s a m p l e is obviously
d i ffe r e n t f r o m t r a i n i n g data. In this case, l o g ( 1 - D ( G ( z ) ) ) is s a tu r a te d ( w h e r e t h e
g r a d i e n t is 0, a n d i te r a ti o n c a n n o t be p e r f o r m e d ) . Therefore, w e choose t o t r a i n G
only by minimizing [-log(D(G(z))].
Conten ts
2. Training Rules
4. Normalizer
5. Optimizer
Data Imbalance (1)
l P r o b l e m description: In t h e dataset consisting o f various task categories, t h e n u m b e r o f
samples varies greatly f r o m o n e category t o another. O n e o r m o r e categories i n t h e
predicted categories c o n ta i n very f e w samples.
l For example, i n a n i m a g e r e c og nitio n experiment, m o r e t h a n 2,000 categories a m o n g a

t o t a l o f 4 2 5 1 t r a i n i n g ima g e s c o n ta i n just o n e i m a g e each. S ome o f t h e others have 2 - 5
images.
l Impacts:
p D u e t o t h e u n b a la n ce d n u m b e r o f samples, w e c a n n o t g e t t h e o p t i m a l r e a l - t i m e result
because m o d e l / a l g o r i t h m never examines categories w i t h very f e w samples adequately.
p Since f e w observation objects m a y n o t be representative f o r a class, w e m a y fail t o o b t a i n
a d e q u a t e samples f o r verification a n d test.
Data Imbalance (2)
Random Random Synthetic

u n d ers am pl i ng o vers am pl i ng Minority
• Deleting r e d u n d a n t • Copying samples O v e rsam pling
samples in a Technique
category
• Sampling
• M e r g i n g samples
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 1 )
l Vanishin g gradie n t: As net w o r k layers in crease, the d e rivative value of
b a c k p r o p a g a t i o n decreases, w h i c h causes a vanishing g r a d i e n t p r o b l e m .
l Exploding gradie n t: As net w o r k layers in crease, the d e rivative value o f

b a c k p r o p a g a t i o n increases, w h i c h causes a n exploding g r a d i e n t p r o b l e m .
l Cause: y𝑖 = 𝜎(𝑧𝑖) = 𝜎 𝑤𝑖𝑥𝑖 + 𝑏𝑖 Where σ is s i g m o i d function.
w2 w3 w4
b1 b2 b3 C
l Backpropagation can be deduced as follows:

𝜕C 𝜕C 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1
=
𝜕𝑏 1 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1 𝜕𝑏 1
𝜕C ′
= 𝜕 𝜎 𝑧4 𝑤4 𝜎′ 𝑧3 𝑤 3 𝜎 ′ 𝑧2 𝑤 2 𝜎 ′ 𝑧1 𝑥
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 2 )
l The m a x i m u m value o f 𝜎 ′ (𝑥) is 14:
l H ow ev er, t h e n e t w o r k w e i g h t 𝑤 is usually s maller t h a n 1. Therefore, 𝜎 ′ 𝑧 𝑤 ≤ 1. According t o t h e chain

4
𝜕C
rule, as layers increase, t h e derivation result 𝜕
decreases, resulting in t h e vanishing g r a d i e n t pr oblem.
l W h e n t h e n e t w o r k w e i g h t 𝑤 is large, resulting in 𝜎 ′ 𝑧 𝑤 > 1, t h e ex ploding g r a d i e n t p r o b l e m occurs.
l Solution: For example, g r a d i e n t clipping is used t o alleviate t h e ex ploding g r a d i e n t p r o b l e m, ReLU ac tiv ation
f u n c t i o n a n d LSTM are used t o alleviate t h e vanishing g r a d i e n t pr oblem.
Overfitting
l P r o b l e m description: The m o d e l p e r f o r m s w e l l in t h e t r a i n i n g set, b u t badly in
t h e test set.
l Root cause: There are t o o m a n y f e a t u r e dimensions, m o d e l assumptions, a n d

parameters, t o o m u c h noise, b u t very f e w t r a i n i n g data. As a result, t h e f i t t i n g
f u n c t i o n perfectly predicts t h e t r a i n i n g set, w h i l e t h e prediction result o f t h e test
set o f n e w d a t a is poor. Training d a t a is o v e r - f i t t e d w i t h o u t considering
generalization capabilities.
l Solution : For exampl e, d a ta a u g m en t a t i o n , re gula r ization , early sto p ping, and

dropout
Summary
l This chapter describes t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks,

perceptrons a n d their t r a i n i n g rules, c o m m o n types o f neural n e t w o r k s
(CNN, RNN, a n d GAN), a n d t h e C o m m o n Problems o f n e u r a l n e t w o r k s in
AI engineering a n d solutions.
Quiz
1. ( Tr ue o r false) C o m p a r e d w i t h t h e rec urrent n e u r a l n e t w o r k , t h e c o n v o l u ti o n a l

n e u r a l n e t w o r k is m o r e suitable f o r i m a g e recognition. ( )
A. True
B. False
2. (True o r false) G A N is a deep l e a r n i n g m o d e l , w h i c h is o n e o f t h e m o s t p r o m i s i n g

m e t h o d s f o r unsupervised l e a r n i n g o f c o m p l e x d is tr ib u tio n i n recent years. ( )
A. True
B. False
Quiz
3. (Single-choice) There are m a n y types o f deep lear ning neural networks. W h i c h o f t h e f o l l o w i n g is n o t a deep
lear ning neur al n e t w o r k ? ( )
A. CNN
B. RNN
C. LSTM
D. Logistic
4. ( M u l t i - c h o i c e ) There are m a n y i m p o r t a n t " c o m p o n e n t s " in t h e c o n v o l u t i o n a l neur al n e t w o r k architecture. W h i c h o f

t h e f o l l o w i n g are t h e c onv olut ional n e u r a l n e t w o r k " c o m p o n e n t s " ? ( )
A. A ctivation f u n c t io n
B. C o n vo lu t io n a l kernel
C. Pooling
D. Fully connected layer
Recommendations
l O n l i n e learning website
p https://e.huawei.com/cn/talent/#/home
l H u a w e i K n o w l e d g e Base
p https://support.huawei.com/enterprise/servicecenter?lang=zh
Thank you. 把数字世界带入每个人、每个家庭、
每个组织，构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.
Copyright©2020 Huawei Technologies Co., Ltd.

All Rights Reserved.
The information in this document may contain predictive

statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

CV w6 - Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CV w6 - Deep Learning

Uploaded by

Copyright:

Available Formats

6.

Deep Learning Overview

l The chapter describes t h e basic k n o w l e d g e o f deep learning, including t h e

O n c o m p l e t i o n o f this course, y o u w i l l be able to:

Traditional M a c h i n e Learning D e e p Learning

Easy-to-explain features H ar d - t o - e x p l a i n features

Data Feature Feature

Question: Can w e use

Dendrite Synapse Output

H u m a n neural network Perc e p tro n Deep n e u r a l n e t w o r k

l A n e u ra l n e t w o r k can be simply expressed as a n i n f o r m a t i o n processing system designed t o

1958 1970 1986 1995 2006

Classification p o i n t Classification line Classification plane Classification hyperplane

0 h id d e n layers 3 h id d e n layers 2 0 h i d d e n layers

In t h e preceding func tion, 𝑑 is o n e n e u r o n in t h e o u t p u t layer, D is all t h e neurons in t h e o u t p u t layer, 𝑡𝑑 is t h e

l Limitation: N o effective m e t h o d can solve t h e e x t r e m u m in m a t h e m a t i c s o n t h e c o m p l e x h i g h- d i me n s i o n a l

Exam ple o f g r a d ie n t descent o f

l Cross e n t r o p y error func tion:

l The cross e n t r o p y error f u n c t i o n depicts t h e distance b e t w e e n t w o p r o b a b i l i ty

n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖.

l The g ra d ie n t descent a l g o r i t h m o f this version is n o t c o m m o n l y used because:

1 𝜕C(𝑡 𝑑,𝑜 𝑑) 𝜕 C(𝑡 𝑑 ,𝑜 𝑑 )

l In t h e t r a i n i n g sa mp le set D, each sample is 𝑥1

l w c is t h e w e i g h t coefficient b e t w e e n t h e h id d e n layer a n d t h e o u t p u t layer, w h i l e w b is t h e w e i g h t

l The BP a l g o r i t h m is used t o t r a i n t h e n e t w o r k as follows:

p According t o t h e δ o f each layer, t h e w e i g h t values o f all t h e layer are updated.

l W i t h o u t activation functions, o u t p u t signals are o n l y simple linear functions.

A ctivation Func tion

l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real

l The S o f t m a x f u n c t i o n is o f t e n used as t h e o u t p u t layer o f a multiclass

l A p a r a m e ter o p ti m izati on m e th o d can be infe rr ed usin g a n o p t i m i z a t i o n

p R a n d o m noise is added t o t h e i n p u t d a t a in speech recognition.

l Purposes o f t h e a l g o r i t h m o p t i m i z a t i o n include b u t are n o t l i m i t e d to:

p S implifying m a n u a l p a r a m e t e r setting, especially t h e l e a r n i n g r a te (LR).

l ∆𝑤 𝑙 𝑛 = −𝜂𝛿𝑙+1𝑥 𝑙(𝑛) + 𝛼∆𝑤 𝑙 𝑛 −1

l w h e r e 𝛼 is a const ant ( 0 ≤ 𝛼 < 1 ) called M o m e n t u m Coefficient a n d 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a m o m e n t u m t e r m .

l I m a g i n e a s m all ball rolls d o w n f r o m a r a n d o m p o i n t o n t h e er r or surface. The i n t r o d u c t i o n o f t h e m o m e n t u m t e r m is

increases continuously. 𝜂 indicates t h e g l o b a l LR, w h i c h needs t o be set ma nual l y. 𝜀 is a s mall constant, a n d

l The RMSProp o p t i m i z e r solves t h e p r o b l e m t h a t t h e AdaG r ad o p t i m i z e r ends t h e o p t i m i z a t i o n process t o o

l W h e r e 𝑡 represents t h e 𝑡 -th iteration a n d 𝑔𝑡 is t h e calculated gradient. 𝑚𝑡 a n d 𝑣𝑡

l A l t h o u g h t h e rule involves m a n u a l setting o f 𝜂, 𝛽1, a n d 𝛽2, t h e setting is m u c h simpler. According

l In t h e 1960s, H u b e l a n d Wiesel studied cats' cortex neurons used f o r local sensitivity

l N o w, C N N has b e c o m e o n e o f t h e research hotspots in m a n y scientific fields, especially

l P a r a m e t e r sharing: O n e o r m o r e filters/kernels m a y be used t o scan i n p u t images.

Con vo lu t io n al Po o ling Con vo lu t io n al Poo lin g Fully connected

Sunset Psu n set

Convolution layers + p o o lin g layers Multi-

H a n Bingtao, 2017, Conv olut ional N e u r a l N e t w o r k

l Generally, t h e S o f t m a x f u n c t i o n is used as t h e activation f u n c t i o n o f t h e final

l U n l i k e o t h e r f o r w a r d n e u r a l networks, t h e R N N can keep a c o n te x t state a n d even

l The R N N is w i d e l y used in scenarios related t o sequences, such as videos consisting o f

𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

LeCun, Bengio, a n d G. H i n t o n , 2015, A Recurrent N e u r a l N e t w o r k a n d t h e

Andr ej Karpathy, 2015, The Unreasonable Effectiveness o f Recurrent N e u r a l N e t w o r k s

p The longer t h e t i m e sequence, t h e m o r e likely t h e loss o f t h e last t i m e sequence t o t h e g r a d ie n t o f w in t h e first t i m e

p The t o t a l g r a d i e n t o f w e i g h t w is t h e a c c u m u l a t i o n o f t h e g r a d ie n t o f t h e w e i g h t a t all t i m e sequence.

l Three steps o f BPTT:

p C o m p u t i n g t h e error value o f each n e u r o n t h r o u g h b a c k p r o p a g a t i o n 𝛿𝑗.

l U p d a t i n g w eights using t h e SGD a l g o r i t h m .

l 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …

l Despite t h a t t h e s ta n d a r d R N N structure solves t h e p r o b l e m o f i n f o r m a t i o n m e m o r y,

l I n f o r m a t i o n needs t o be saved l o n g t i m e i n m a n y tasks. For example, a h i n t a t t h e

l W e expect t h a t m e m o r y units can r e m e m b e r key i n f o r m a t i o n .

Colah, 2015, Under s t anding LSTMs N e t w o r k s