You are on page 1of 86

6.

Deep Learning Overview


COMPUTER VISION – TK44019
Materi kuliah untuk Chapter ini mengikuti modul HCIA - Artificial
Intelligence (hak cipta materi adalah milik Huawei).
D eep Learning O v e r v i e w
Foreword

l The chapter describes t h e basic k n o w l e d g e o f deep learning, including t h e


d e v e l o p m e n t history o f deep learning, c o m p o n e n t s a n d types o f deep
learning n e u r a l networks, a n d c o m m o n p r o b l e m s in deep learning projects.

2 Hu a we i Confidential
Objectives

O n c o m p l e t i o n o f this course, y o u w i l l be able to:


p Describe t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks.
p Learn a b o u t t h e i m p o r t a n t c o m p o n e n t s o f deep l e a r n i n g n e u r a l networks.
p U n d e r s t a n d t r a i n i n g a n d o p t i m i z a t i o n o f n e u r a l networks.
p Describe c o m m o n p r o b l e m s in deep learning.

3 Hu a we i Confidential
Conten ts

1 . D e e p Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

4 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning a n d D eep Learning
l As a m o d e l based o n unsupervised fe a tu r e l e a r n i n g a n d fe a tu r e hierarchy learning, deep
l e a r n i n g has g r e a t advantages in fields such as c o m p u t e r vision, speech recognition, a n d
n a t u r a l l a n g u a g e processing.

Traditional M a c h i n e Learning D e e p Learning


H i g h e r h a r d w a r e r equir ements o n t h e c o m p u t e r : To
L o w h a r d w a r e r equir ements o n t h e c o m p u t e r : Given
execute m a t r i x operations o n massive data, t h e
t h e l i m i t e d c o m p u t i n g a m o u n t , t h e c o m p u t e r does
c o m p u t e r needs a GPU t o p e r f o r m parallel
n o t need a GPU f o r parallel c o m p u t i n g generally.
computing.
Applicable t o t r a i ni ng u n d e r a s mall d a t a a m o u n t The p e r f o r m a n c e can be h i g h w h e n h i g h -
a n d w hos e p e r f o r m a n c e c a n n o t be i m p r o v e d dimens ional w e i g h t parameters a n d massive
c ontinuous ly as t h e d a t a a m o u n t increases. t r a i ni ng d a t a are provided.
Level-by-level p r o b l e m b r e a k d o w n E2E lear ning
M a n u a l featur e selection A l g o r i t h m - b a s e d a u t o m a t i c featur e extraction

Easy-to-explain features H ar d - t o - e x p l a i n features

5 Hu a we i Confidential
Tradi ti onal M a c h i n e Learning

Issue analysis
P r o b l e m loc at ing

Data Feature Feature


cleansing ext ract io n select i o n

Model
t r ai n i n g

Question: Can w e use


an algorithm to Execute inference,
a u t o m a t i c a l l y execute prediction, a n d
t h e procedure? ident if ic at ion

6 Hu a we i Confidential
D eep Learning
l Generally, t he deep lear n in g architecture is a deep neu r a l net w o rk . " D eep " in
" d e e p l e a r n i n g " refers t o t h e n u m b e r o f layers o f t h e n e u r a l n e t w o r k .

Dendrite Synapse Output


layer

Nucleus
Hidden
layer

A xo n O u t p u t layer
I n p u t layer H id d e n layer I n p u t layer

H u m a n neural network Perc e p tro n Deep n e u r a l n e t w o r k

7 Hu a we i Confidential
Neural Network
l Currently, t h e d e f i n i t i o n o f t h e n e u ra l n e t w o r k has n o t been d e t e r m i n e d yet. H e c h t Nielsen, a
n e u ra l n e t w o r k researcher in t h e U.S., defines a n e u ra l n e t w o r k as a c o m p u t e r system co mp o se d
o f simple a n d h ig h ly interconnected processing elements, w h i c h process i n f o r m a t i o n by d y n a m i c
response t o external inputs.

l A n e u ra l n e t w o r k can be simply expressed as a n i n f o r m a t i o n processing system designed t o


i m i t a t e t h e h u m a n b ra in structure a n d functions based o n its source, features, a n d explanations.
l Artificial n e u ra l n e t w o r k ( n e u r a l n e t w o r k ) : F o r m e d by artificial neurons connected t o each other,
t h e n e u ra l n e t w o r k extracts a n d simplifies t h e h u m a n brain's micro st ru ct u re a n d functions. It is a n
i m p o r t a n t a p p ro a ch t o s i m u l a t e h u m a n intelligence a n d reflect several basic features o f h u m a n
b r a i n functions, such as concurrent i n f o r m a t i o n processing, learning, association, model
classification, a n d m e m o r y.

8 Hu a we i Confidential
D e v e l o p m e n t History o f N e u r a l N e t w o r k s

Deep
SV M
XOR n e tw o r k
Perceptron M LP

Golden age AI w i n t e r

1958 1970 1986 1995 2006

9 Hu a we i Confidential
Single-Layer Perceptron
l I n p u t vector: 𝑋 = [𝑥0 , 𝑥1, … , 𝑥 𝑛 ] 𝑇 𝑥1
l Weight: 𝑊 = [𝜔0 , 𝜔1 , … , 𝜔 𝑛 ] 𝑇 , in w h i c h 𝜔0 is t h e offset. 𝑥2

𝑥𝑛
1, 𝑛𝑒𝑡 > 0, 𝑛
l Ac tiv ation func tion: 𝑂 = 𝑠𝑖𝑔𝑛 𝑛𝑒𝑡 =
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. 𝑛𝑒𝑡 = 𝜔𝑖𝑥 𝑖 = 𝑾 𝐗 𝑻

𝑖=0
l The preceding perc eptron is equiv alent t o a classifier. It uses t h e h i g h - d i m e n s i o n a l 𝑋 vector as t h e i n p u t a n d
p e r f o r m s binar y classification o n i n p u t samples in t h e h i g h - d i m e n s i o n a l space. W h e n 𝑾𝑻𝐗 > 0, O = 1. In this
case, t h e samples are classified i n t o a type. Otherwise, O = −1. In this case, t h e samples are classified i n t o t h e
o t h e r type. The b o u n d a r y o f these t w o types is 𝑾𝑻𝐗 = 0, w h i c h is a h i g h- d i me n s i o n a l hyperplane.

Classification p o i n t Classification line Classification plane Classification hyperplane


𝐴𝑥 + 𝐵 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶𝑧 + 𝐷 = 0 𝑊 𝑇X + 𝑏 =0
10 Hu a we i Confidential
XOR P r o b l e m
l In 1969, Minsky, a n A m e r i c a n m a t h e m a t i c i a n a n d AI pioneer, proved t h a t a
perceptron is essentially a linear m o d e l t h a t can o n l y deal w i t h linear
classification problems, b u t c a n n o t process n o n - l i n e a r data.

AND OR XOR

11 Hu a we i Confidential
Feedforward Neural Network

I n p u t layer O u t p u t layer
H i d d e n layer 1 H i d d e n layer 2

12 Hu a we i Confidential
Solution o f XOR

w0

w1

w2

w3

w4
XOR w5
XOR

13 Hu a we i Confidential
Imp a cts o f H i d d e n Layers o n A N e u r a l N e t w o r k

0 h id d e n layers 3 h id d e n layers 2 0 h i d d e n layers

14 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

15 Hu a we i Confidential
Gr a d i e n t Descent a n d Loss Function
l The g ra d ien t o f th e m u lt ivaria t e f u n ction 𝑜 = 𝑓 𝑥 = 𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 at 𝑋 ′ = [𝑥 0 ′ , 𝑥 1 ′ , … , 𝑥 𝑛 ′ ] 𝑇 is s h o w n as
follows:
𝑇
𝛻𝑓 𝑥 0 ′ , 𝑥 1 ′ , … ,𝑥𝑛 ′ = [ 𝜕𝑓 , 𝜕𝑓 , … , 𝜕𝑓 ] |𝑋 =𝑋 ′ ,
𝜕𝑥0 𝜕𝑥1
𝜕
The direction o f t h e g r a d i e n t vector is t h e fastest g r o w i n g direction o f t h e func tion. As a result, t h e direction
o f t h e negative g r a d i e n t vector −𝛻𝑓 is t h e fastest descent direction o f t h e func tion.
l D u r i n g t h e t r a i ni ng o f t h e deep lear ning n e t w o r k , t a r g e t classification errors m u s t be parameterized. A loss
function ( e r r o r f u n c t i o n ) is used, w h i c h reflects t h e error b e t w e e n t h e t a r g e t o u t p u t a n d ac tual o u t p u t o f
t h e perceptron. For a single t r a i ni ng s ample x, t h e m o s t c o m m o n error f u n c t i o n is t h e Q u ad r at i c cost
function.

𝐸 𝑤 =1 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2,
2

In t h e preceding func tion, 𝑑 is o n e n e u r o n in t h e o u t p u t layer, D is all t h e neurons in t h e o u t p u t layer, 𝑡𝑑 is t h e


t a r g e t o u t p u t , a n d 𝑜𝑑 is t h e ac tual o u t p u t .
l The g r a d i e n t descent m e t h o d enables t h e loss f u n c t i o n t o search a l o n g t h e negativ e g r a d i e n t direction a n d
u p d a t e t h e parameters iteratively, finally m i n i m i z i n g t h e loss func tion.
16 Hu a we i Confidential
Extrema o f t h e Loss Function
l Purpose: The loss f u n c t i o n 𝐸(𝑊) is defined o n t h e w e i g h t space. The objective is t o search f o r t h e w e i g h t
vector 𝑊 t h a t can m i n i m i z e 𝐸(𝑊).

l Limitation: N o effective m e t h o d can solve t h e e x t r e m u m in m a t h e m a t i c s o n t h e c o m p l e x h i g h- d i me n s i o n a l

surface o f 𝐸 𝑊 = 12 𝑑 ∈𝐷 𝑡𝑑 − 𝑜𝑑 2.

Exam ple o f g r a d ie n t descent o f


binary p a r a b o lo id

17 Hu a we i Confidential
C o m m o n Loss Functions i n D eep Learning
l Q u a d r a ti c cost func tion:

1
2
𝐸 𝑊 = 𝑡 𝑑 − 𝑜𝑑
2
𝑑 ∈𝐷

l Cross e n t r o p y error func tion:

1
𝐸 𝑊 = − [𝑡𝑑 ln 𝑜𝑑 + (1 − 𝑡 𝑑 ) ln( 1 − 𝑜𝑑 )]
𝑛
𝑥 𝑑 ∈𝐷

l The cross e n t r o p y error f u n c t i o n depicts t h e distance b e t w e e n t w o p r o b a b i l i ty


distributions, w h i c h is a w i d e l y used loss f u n c t i o n f o r classification problems.
l Generally, t h e m e a n square error f u n c t i o n is used t o solve t h e regression p r o b l e m , w h i l e
t h e cross e n t r o p y error f u n c t i o n is used t o solve t h e classification p r o b l e m .

18 Hu a we i Confidential
Batch Gr a d i e n t Descent A l g o r i t h m ( B G D )
l In t h e t r a i n i n g sa mp le set 𝐷, each sa mp le is recorded as < 𝑋, 𝑡 >, in w h i c h 𝑋 is t h e i n p u t vector, 𝑡
t h e t a r g e t o u t p u t , 𝑜 t h e actual o u t p u t , a n d 𝜂 t h e le a rn in g rate.
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e e n d c o n d it io n is m e t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in D:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.

1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− 𝑖 -η
For each 𝑤𝑖in this unit: ∆𝑤 += x 𝑑 ∈𝐷 .
𝑛 𝜕𝑤
𝑖

n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖.

l The g ra d ie n t descent a l g o r i t h m o f this version is n o t c o m m o n l y used because:


p The convergence process is very s low as all t r a i ni ng samples need t o be calculated every t i m e t h e w e i g h t
is updated.

19 Hu a we i Confidential
Stochastic Gr a d i e n t Descent A l g o r i t h m (SGD)
l To address t h e BGD a l g o r i t h m defect, a c o m m o n variant called I n cre me n t a l G ra d ie n t Descent
a l g o r i t h m is used, w h i c h is also called t h e Stochastic G ra d ie n t Descent (SGD) a l g o r i t h m . O n e
i m p l e m e n t a t i o n is called O n l i n e Learning, w h i c h updates t h e g r a d i e n t based o n each sample:

1 𝜕C(𝑡 𝑑,𝑜 𝑑) 𝜕 C(𝑡 𝑑 ,𝑜 𝑑 )


∆𝑤𝑖 = −𝜂 𝑛 x 𝑑 ∈𝐷 𝜕𝑤
⟹ ∆𝑤𝑖 = −𝜂 𝑑 ∈𝐷 𝜕𝑤
.
𝑖 𝑖
l ONLINE-GRADIENT-DESCENT
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e end c ondition is me t :
p Generates a r a n d o m <X, t > f r o m D a n d does t h e f o l l o w i n g calculation:
n I n p u t X t o this u n i t a n d calculate t h e o u t p u t o.

𝜕C(𝑡 𝑑,𝑜 𝑑)
n For each 𝑤𝑖 in this unit: 𝑤𝑖 += - η 𝑑∈𝐷 𝜕 𝑖 .

20 Hu a we i Confidential
M i n i - B a t c h Gr a d i e n t Descent A l g o r i t h m ( M B G D )
l To address t h e defects o f t h e previous t w o g r a d i e n t descent algorithms, t h e M i n i - b a t c h Gra d ie n t
Descent A l g o r i t h m ( M B G D ) wa s proposed a n d has been m o s t w i d e l y used. A sma ll n u m b e r o f
Batch Size (BS) samples are used a t a t i m e t o calculate ∆𝑤𝑖, a n d t h e n t h e w e i g h t is u p d a t e d
accordingly.

l Batch-gradient-descent
p Initializes each 𝑤𝑖 t o a r a n d o m value w i t h a smaller absolute value.
p Before t h e end c ondition is me t :
n Initializes each ∆𝑤𝑖 t o zero.
n For each < 𝑋, 𝑡 > in t h e BS samples in t h e nex t b a t c h in 𝐷:
− I n p u t 𝑋 t o this u n i t a n d calculate t h e o u t p u t 𝑜.

1
− For each 𝑤 in
𝑖 this unit: ∆𝑤 +=
𝑖 -η 𝑛 x 𝑑∈𝐷 𝜕C(𝑡 𝑑 ,𝑜
)
𝜕𝑤 𝑖
n For each 𝑤𝑖 in this unit: 𝑤 𝑖 += ∆𝑤𝑖
n For t h e last batch, t h e t r a i n i n g samples are m i x e d u p in a r a n d o m order.
21 Hu a we i Confidential
Backpropagation Algorithm (1)
l Signals are p r o p a g a t e d in f o r w a r d direction,
and errors are propagated in backward F o r w a r d p r o p a g a t i o n direction
direction.

l In t h e t r a i n i n g sa mp le set D, each sample is 𝑥1


recorded as <X, t>, in w h i c h X is t h e i n p u t
𝑜1
vector, t t h e t a r g e t o u t p u t , o t h e actual o u t p u t , 𝑥2
a n d w t h e w e i g h t coefficient. 𝑜2
𝑥3
l Loss function:
O u t p u t layer
1
E (w) = å (t - o ) 2 I n p u t layer
2 (dÎD) d d H i d d e n layer

B a c k p r o p a g a t i o n direction

22 Hu a we i Confidential
Backpropagation Algorithm (2)
l Accordi n g to the f o llo w i n g f o rm u las, errors in the input, hidden, a n d o u t p u t layers a re
a c c u m u l a t e d t o generate t h e error in t h e loss function.

l w c is t h e w e i g h t coefficient b e t w e e n t h e h id d e n layer a n d t h e o u t p u t layer, w h i l e w b is t h e w e i g h t


coefficient b e t w e e n t h e i n p u t layer a n d t h e h i d d e n layer. 𝑓 is t h e activation function, D is t h e
o u t p u t layer set, a n d C a n d B are t h e h i d d e n layer set a n d i n p u t layer set respectively. Assume
t h a t t h e loss f u n c t i o n is a quadratic cost function :
1
p O u t p u t layer error: E= å
2 (dÎD )
(t d - od ) 2

2
1 1
E = å (dÎD ) [td - f (netd ) ] = å (dÎD) étd - f (å (cÎC) wc y c)ù
2
p Expanded hidden
2 2 ë û
layer error:

( )
2
1
E = å (dÎD ) êtd - f å (cÎC ) wc f (net c)
é ù
=
p Expanded i n p u t 2 ë ú
û

( ( ))
layer error: 1 ù
2
å (dÎD) éêëtd - f å (cÎC ) wc f åbÎB wb xb úû
2
23 Hu a we i Confidential
Backpropagation Algorithm (3)
l To m i n i m i z e error E, t h e g r a d i e n t descent iterative calculation can be used t o
solve w c a n d w b , t h a t is, calculating w c a n d w b t o m i n i m i z e error E.

l Formula:
¶E
Dwc = -h ,c ÎC
¶wc

Dwb = -h ¶E ,b Î B
¶wb
l If there are m u l t i p l e h i d d e n layers, chain rules are used t o t a k e a derivative f o r
each layer t o o b t a i n t h e o p t i m i z e d parameters by iteration.

24 Hu a we i Confidential
Backpropagation Algorithm (4)
l For a n e u ra l n e t w o r k w i t h a n y n u m b e r o f layers, t h e a rra n g e d f o r m u l a f o r t r a i n i n g is as follows:

Dwl = -hdl+1 f (z l )
jk k j j
'
(z
ìfj j j
l
)(t - f (z l
)), l Î outputs, (1)
dl j = ïí
j j

ïî å
d k
l +1 l
w f
jk j
'
(z l
j ),otherwise,(2)
k

l The BP a l g o r i t h m is used t o t r a i n t h e n e t w o r k as follows:


p Takes o u t t h e nex t t r a i ni ng s ample <X, T>, inputs X t o t h e n e t w o r k , a n d obtains t h e ac tual o u t p u t o.
p Calculates o u t p u t layer δ according t o t h e o u t p u t layer error f o r m u l a (1).
p Calculates δ o f each hidden layer f r o m o u t p u t t o i n p u t by i t er at i on according t o t h e hidden layer error
p r o p a g a t i o n f o r m u l a (2).

p According t o t h e δ o f each layer, t h e w e i g h t values o f all t h e layer are updated.

25 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

26 Hu a we i Confidential
Acti va ti o n Function
l Activation functions are i m p o r t a n t f o r t h e n e u r a l n e t w o r k m o d e l t o learn a n d
u n d e r s t a n d c o m p l e x n o n - l i n e a r functions. They a l l o w i n t r o d u c t i o n o f n o n - l i n e a r
features t o t h e n e t w o r k .

l W i t h o u t activation functions, o u t p u t signals are o n l y simple linear functions.


The c o m p l e x i t y o f linear functions is limited, a n d t h e capability o f learning
c o m p l e x f u n c t i o n m a p p i n g s f r o m d a t a is low.

A ctivation Func tion

output = f (w1x1 + w2 x2 + w3 x3 ) = f (W • X ) t

27 Hu a we i Confidential
Si g m o i d

1
𝑓 𝑥 =
1 + 𝑒−𝑥

28 Hu a we i Confidential
Tanh

𝑒 𝑥 − 𝑒−𝑥
tanh 𝑥 = 𝑒 𝑥 + 𝑒−𝑥

29 Hu a we i Confidential
Softsign

𝑥
𝑓 𝑥 =
𝑥 +1

30 Hu a we i Confidential
Rectified Linear U n i t (ReLU)
𝑥, 𝑥 ≥0
𝑦=
0, 𝑥 <0

31 Hu a we i Confidential
Softplus

𝑓 𝑥 = ln 𝑒 𝑥 + 1

32 Hu a we i Confidential
Softmax
l S o f t m a x function:

𝑒
σ(z)𝑗 =
𝑘
𝑒

l The S o f t m a x f u n c t i o n is used t o m a p a K-dimensional vector o f arbitrary real


values t o a n o t h e r K-dimensional vector o f real values, w h e r e each vector
e l e m e n t is in t h e interval (0, 1). All t h e elements a d d u p t o 1.

l The S o f t m a x f u n c t i o n is o f t e n used as t h e o u t p u t layer o f a multiclass


classification task.

33 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

34 Hu a we i Confidential
Normalizer
l Regularization is a n i m p o r t a n t a n d effective t e c h n o l o g y t o reduce generalization
errors in m a c h i n e learning. It is especially useful f o r deep learning m o d e l s t h a t
t e n d t o be over-fi t d u e t o a large n u m b e r o f parameters. Therefore, researchers
have proposed m a n y effective technologies t o prevent over-fitting, including:
p A d d i n g constraints t o parameters, such as 𝐿1 a n d 𝐿2 n o r m s
p Expanding t h e t r a i n i n g set, such as a d d i n g noise a n d t r a n s f o r m i n g d a t a

p Dropout

p Early s to p p i n g

35 Hu a we i Confidential
Penalty Parameters
l M a n y regularization m e t h o d s restrict t h e learning capability o f m o d e l s by
a d d i n g a penalty p a r a m e t e r Ω(𝜃) t o t h e objective f u n c t i o n 𝐽. Assume t h a t t h e
t a r g e t f u n c t i o n a f t e r regularization is 𝐽.

𝐽 𝜃; 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝜃),
l W h e r e 𝛼𝜖[0, ∞) is a h y p e r p a r a m e t e r t h a t w e i g h t s t h e relative c o n t r i b u t i o n o f
t h e n o r m penalty t e r m Ω a n d t h e standard objective f u n c t i o n 𝐽(𝑋; 𝜃). If 𝛼 is set
t o 0, n o regularization is p e r f o r m e d . The p e n a l t y in regularization increases w i t h
𝛼.

36 Hu a we i Confidential
𝐿1 Regu l a ri z a ti o n
l A d d 𝐿1 n o r m constraint t o m o d e l parameters, t h a t is,

𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1 ,
l If a g r a d i e n t m e t h o d is used t o resolve t h e value, t h e p a r a m e t e r g r a d i e n t is
𝛻𝐽 𝑤 =∝ 𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤 .

37 Hu a we i Confidential
𝐿2 Regu l a ri z a ti o n
l A d d n o r m penalty t e r m 𝐿2 t o prevent overfitting.
1
𝐽 𝑤; 𝑋,𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 2,
2
2

l A p a r a m e ter o p ti m izati on m e th o d can be infe rr ed usin g a n o p t i m i z a t i o n


technol ogy (such as a g r a d i e n t m e t h o d ) :

𝑤 = 1 − 𝜀𝛼 𝜔 − 𝜀𝛻𝐽(𝑤),
l w h e r e 𝜀 is t h e learning rate. C o m p a r e d w i t h a c o m m o n g r a d i e n t o p t i m i z a t i o n
f o r m u l a , this f o r m u l a multiplies t h e p a r a m e t e r by a reduction factor.

38 Hu a we i Confidential
𝐿1 v.s. 𝐿2
l The m a j o r differences b e t w e e n 𝐿2 a n d 𝐿1 :
p Accor ding t o t h e preceding analysis, 𝐿1 can generat e a m o r e sparse m o d e l t h a n 𝐿2. W h e n t h e value o f p a r a m e t e r 𝑤 is
small, 𝐿1 r egular iz at ion can directly reduce t h e p a r a m e t e r value t o 0, w h i c h can be used f o r f e a t u r e selection.

p F r o m t h e perspective o f probability, m a n y n o r m constraints are equiv alent t o a d d in g p r io r p r o b a b ilit y dis t r ibut ion t o
parameters. In 𝐿2 regularization, t h e p a r a m e t e r value complies w i t h t h e Gaussian dis t r ibut ion rule. In 𝐿1 regularization,
t h e p a r a m e t e r value complies w i t h t h e Laplace dis t r ibut ion rule.

𝐿1 𝐿2
39 Hu a we i Confidential
Dataset Expansion
l The m o s t effective w a y t o p re ve n t o v e r - f i t t i n g is t o a d d a t r a i n i n g set. A larger t r a i n i n g set has a
smaller o v e r - f i t t i n g probability. Dataset expansion is a t ime -sa vin g m e t h o d , b u t it varies in
d i ff e r e n t fields.
p A c o m m o n m e t h o d in t h e object rec ognition field is t o r o t a t e o r scale images. (The prerequisite t o i m a g e
t r a n s f o r m a t i o n is t h a t t h e ty pe o f t h e i m a g e c a n n o t be c hanged t h r o u g h t r a n s f or ma t i o n . For example, f o r
h a n d w r i t i n g digit recognition, categories 6 a n d 9 can be easily c hanged a f t e r r otation) .

p R a n d o m noise is added t o t h e i n p u t d a t a in speech recognition.


p A c o m m o n practice o f n a t u r a l l anguage processing ( N L P ) is replacing w or ds w i t h their synonyms.
p Noise injection can a d d noise t o t h e i n p u t o r t o t h e hidden layer o r o u t p u t layer. For example, f o r S o f t ma x
classification, noise can be added using t h e label s m o o t h i n g technology. If noise is added t o categories 0
𝜀
a n d 1, t h e corresponding probabilities are c hanged t o a n d 1 − 𝑘−1 𝜀 respectively.
𝑘 𝑘

40 Hu a we i Confidential
Dropout
l D r o p o u t is a c o m m o n a n d simple r egular iz ation m e t h o d , w h i c h has been w i del y used since 2014. Simply put,
D r o p o u t r a n d o m l y discards s o m e inputs d u r i n g t h e t r a i ni ng process. In this case, t h e par ameter s
corresponding t o t h e discarded inputs are n o t updated. As a n i n t e g r a t i o n m e t h o d , D r o p o u t c ombines all sub-
n e t w o r k results a n d obtains s ub- netw or k s by r a n d o m l y d r o p p i n g inputs. See t h e figures below:

41 Hu a we i Confidential
Early Stoppi ng
l A test o n d a t a o f t h e validation set can be inserted d u r i n g t h e training. W h e n
t h e d a t a loss o f t h e verification set increases, p e r f o r m early stopping.

Early stopping

42 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

43 Hu a we i Confidential
Optimizer
l There are various o p t i m i z e d versions o f g r a d i e n t descent algorithms. In object-
oriented l a n g u a g e i m p l e m e n t a t i o n , d i ff e r e n t g r a d i e n t descent a l g o r i t h m s are
o f t e n encapsulated i n t o objects called optimizers.

l Purposes o f t h e a l g o r i t h m o p t i m i z a t i o n include b u t are n o t l i m i t e d to:


p Accelerating a l g o r i t h m convergence.
p Preventing o r j u m p i n g o u t o f local e x t r e m e values.

p S implifying m a n u a l p a r a m e t e r setting, especially t h e l e a r n i n g r a te (LR).


l C o m m o n o ptim izers: co m m o n G D o p tim i z er, m o m e n tu m o p tim i z er, N e sterov,
AdaGrad, AdaDelta, RMSProp, A d a m , A d a M a x , a n d N a d a m .

44 Hu a we i Confidential
M o m e n t u m Optimizer
l A m o s t basic i m p r o v e m e n t is t o a d d m o m e n t u m t e r m s f o r ∆𝑤𝑗𝑖. Assume t h a t t h e w e i g h t correction o f t h e 𝑛- t h i t e r a t i o n is
∆𝑤𝑗𝑖(𝑛) . The w e i g h t correction r ule is:

l ∆𝑤 𝑙 𝑛 = −𝜂𝛿𝑙+1𝑥 𝑙(𝑛) + 𝛼∆𝑤 𝑙 𝑛 −1


𝑗𝑖 𝑖 𝑗 𝑗𝑖

l w h e r e 𝛼 is a const ant ( 0 ≤ 𝛼 < 1 ) called M o m e n t u m Coefficient a n d 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a m o m e n t u m t e r m .

l I m a g i n e a s m all ball rolls d o w n f r o m a r a n d o m p o i n t o n t h e er r or surface. The i n t r o d u c t i o n o f t h e m o m e n t u m t e r m is


equiv alent t o giving t h e s m all ball inertia.

−𝜂𝛿𝑙+1 𝑥 𝑙 (𝑛)
𝑖 𝑗

45 Hu a we i Confidential
Advantages a n d Disadvantages o f M o m e n t u m O p t i m i z e r
l A d van t a g es:
p Enhances t h e stability o f t h e g r a d i e n t correction direction a n d reduces mutations .
p In areas w h e r e t h e g r a d i e n t direction is stable, t h e ball rolls faster a n d faster ( t h e r e is a speed u p p e r l i m i t
because 𝛼 < 1), w h i c h helps t h e ball quickly overshoot t h e f l a t area a n d accelerates convergence.
p A s mall ball w i t h inertia is m o r e likely t o r oll over s o m e n a r r o w local extrema.

l Disadvantages:
p The learning r ate 𝜂 a n d m o m e n t u m 𝛼 need t o be m a n u a l l y set, w h i c h o f t e n requires m o r e experiments t o
d e t e r m i n e t h e appr opr iate value.

46 Hu a we i Confidential
AdaGrad Optimizer (1)
l The com m o n f eat u r e o f th e ra n d o m g ra d ien t d escen t a lg o rit h m (SGD), sm all- b a tc h g ra d ien t d escen t
a l g o r i t h m ( M B G D ) , a n d m o m e n t u m o p t i mi z e r is t h a t each p a r a m e t e r is u p d a t e d w i t h t h e s ame LR.

l According t o t h e appr oac h o f AdaGrad, differ ent learning rates need t o be set f o r differ ent parameters.

gt = ¶C(t, o)
¶wt Gr a d ie n t calculation

r = r + g2 Square g r a d i e n t a c c u m u l a t i o n
t t-1 t

h
Dwt = - gt Computing update
e+ rt
A pp lic a tio n u p d a t e
wt +1 =wt +Dwt
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h

increases continuously. 𝜂 indicates t h e g l o b a l LR, w h i c h needs t o be set ma nual l y. 𝜀 is a s mall constant, a n d


is set t o a b o u t 10 - 7 f o r n u me r i c a l stability.

47 Hu a we i Confidential
AdaGrad Optimizer (2)
l The A d a G r a d o p t i m i z a t i o n a l g o r i t h m shows t h a t t h e 𝑟 continues increasing w h i l e t h e
overall lear ning r ate keeps decreasing as t h e a l g o r i t h m iterates. This is because w e h o p e
LR t o decrease as t h e n u m b e r o f updates increases. In t h e initial l e a r n i n g phase, w e are
fa r a w a y f r o m t h e o p t i m a l s o lu tio n t o t h e loss func tion. As t h e n u m b e r o f updates
increases, w e are closer t o t h e o p t i m a l solution, a n d th er e fo r e LR can decrease.

l Pros:
p The le a rn in g ra t e is a u t o m a t i c a l l y updated. As t h e n u m b e r o f updates increases, t h e le a rn in g
ra te decreases.

l Cons:
p The d e n o m i n a t o r keeps a c c u m u l a t i n g so t h a t t h e le a rn in g ra t e w i l l eventually b e c o m e very
small, a n d t h e a l g o r i t h m w i l l b e c o m e ineffective.

48 Hu a we i Confidential
RMSProp O p t i m i z e r
l The RMSProp o p t i m i z e r is a n i m p r o v e d AdaG r ad optimiz er. It introduces a n a t t e n u a t i o n coefficient t o ensure a
certain a t t e n u a t i o n r a t i o f o r 𝑟 in each round.

l The RMSProp o p t i m i z e r solves t h e p r o b l e m t h a t t h e AdaG r ad o p t i m i z e r ends t h e o p t i m i z a t i o n process t o o


early. It is suitable f o r non-s table t a r g e t h a n d l i n g a n d has g o o d effects o n t h e RNN.
¶C(t, o)
gt =
¶wt G ra d ie n t calculation
r =br- + (1- b)g 2
t t 1 t Square g r a d i e n t a c c u m u l a t i o n

Dwt = -
h gt Computing update
e+ rt
wt +1 = wt + Dwt A p p lica t io n u p d a t e
l 𝑔𝑡 indicates t h e t - t h gradient, 𝑟 is a g r a d i e n t a c c u m u l a t i o n variable, a n d t h e initial value o f 𝑟 is 0, w h i c h m a y
n o t increase a n d needs t o b e adjusted using a p a r a m e t e r . 𝛽 is t h e a t t e n u a t i o n factor, 𝜂 indicates t h e g l o b a l
LR, w h i c h needs t o be set manually. 𝜀 is a s mall constant, a n d is set t o a b o u t 10 - 7 f o r n u m e r i c a l stability.

49 Hu a we i Confidential
Adam Optimizer (1)
l Adaptive M o m e n t Estimation ( A d a m ) : Developed based o n AdaGrad a n d
AdaDelta, A d a m m a i n t a i n s t w o a d d i t i o n a l variables 𝑚𝑡 a n d 𝑣𝑡 f o r each variable
t o be trained:
𝑚𝑡 = 𝛽1𝑚𝑡−1 + (1 − 𝛽1)𝑔𝑡
𝑣𝑡 = 𝛽2𝑣𝑡−1 + (1 − 𝛽2)𝑔2𝑡

l W h e r e 𝑡 represents t h e 𝑡 -th iteration a n d 𝑔𝑡 is t h e calculated gradient. 𝑚𝑡 a n d 𝑣𝑡


are m o v i n g averages o f t h e g r a d i e n t a n d square gradient. F r o m t h e statistical
perspective, 𝑚𝑡 a n d 𝑣𝑡 are estimates o f t h e first m o m e n t ( t h e average value)
a n d t h e second m o m e n t ( t h e uncentered variance) o f t h e gradients respectively,
w h i c h also explains w h y t h e m e t h o d is so n a m e d .

50 Hu a we i Confidential
Adam Optimizer (2)
l If 𝑚𝑡 and 𝑣𝑡 are initialized using t h e zero vector, 𝑚𝑡 and 𝑣𝑡 are close t o 0 d u r i n g t h e initial
iterations, especially w h e n 𝛽1 a n d 𝛽2 are close t o 1. To solve this p r o b l e m , w e use 𝑚𝑡 a n d 𝑣𝑡 :
𝑚𝑡
𝑚𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣𝑡 =
1 − 𝛽2𝑡

l The w e i g h t u p d a t e ru le o f A d a m is as follows:
𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡
𝑣𝑡 +𝜖

l A l t h o u g h t h e rule involves m a n u a l setting o f 𝜂, 𝛽1, a n d 𝛽2, t h e setting is m u c h simpler. According


t o experiments, t h e d e f a u l t settings are 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8, a n d 𝜂 = 0.001. In practice, A d a m w i l l
converge quickly. W h e n convergence sa t u ra t io n is reached, xx can be reduced. A f t e r several times
o f reduction, a satisfying local e x t r e m u m w i l l be obtained. O t h e r p a ra me t e rs d o n o t need t o be
adjusted.
51 Hu a we i Confidential
O p t i m i z e r Performance C o m p a r i so n

Comparison o f o p t i m i z a t i o n Comparison o f o p t i m i z a t i o n
a l g o r i t h m s in c o n t o u r m a p s o f a l g o r i t h m s a t t h e saddle p o i n t
loss functions

52 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer

6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

53 Hu a we i Confidential
Convolutional Neural N e tw o r k
l A c o n v o l u ti o n a l n e u r a l n e t w o r k ( C N N ) is a f e e d f o r w a r d n e u r a l n e t w o r k . Its artificial
neurons m a y respond t o surrounding units w i t h i n t h e coverage range. C N N excels a t
i m a g e processing. It includes a convolutional layer, a pooling layer, a n d a fully
connected layer.

l In t h e 1960s, H u b e l a n d Wiesel studied cats' cortex neurons used f o r local sensitivity


a n d direction selection a n d f o u n d t h a t the ir u n i q u e n e t w o r k structure c o u ld s implify
feedback n e u r a l networks. They t h e n proposed t h e CNN.

l N o w, C N N has b e c o m e o n e o f t h e research hotspots in m a n y scientific fields, especially


i n t h e p a t t e r n classification field. The n e t w o r k is w i d e l y used because i t can avoid
c o m p l e x pre-processing o f ima g e s a n d directly i n p u t o r ig in a l images.

54 Hu a we i Confidential
M a i n Concepts o f C N N
l Local receptive field: It is generally considered t h a t h u m a n perception o f t h e outside
w o r l d is f r o m local t o global. Spatial correlations a m o n g local pixels o f a n i m a g e a r e
closer t h a n those a m o n g distant pixels. Therefore, each n e u r o n does n o t need t o
k n o w t h e g l o b a l image. It o n l y needs t o k n o w t h e local image. The local i n f o r m a t i o n is
c o m b i n e d a t a h i g h e r level t o generate g l o b a l i n f o r m a t i o n .

l P a r a m e t e r sharing: O n e o r m o r e filters/kernels m a y be used t o scan i n p u t images.


Parameters carried b y t h e filters are weights. In a layer scanned b y filters, each filter
uses t h e s a me p a r a m e te r s d u r i n g w e i g h t e d c o m p u t a t i o n . W e i g h t sharing m e a n s t h a t
w h e n each filter scans a n entire image, p a r a m e te r s o f t h e filter are fixed.

55 Hu a we i Confidential
Architecture o f C o n v o l u t i o n a l N e u r a l N e t w o r k
Input Th ree- fe a tu re Th ree- fe a tu re Five- fe a tu re Five- fe a tu re Output
im age image image image image layer

Con vo lu t io n al Po o ling Con vo lu t io n al Poo lin g Fully connected


layer layer layer layer layer

Bird Pb ird

Sunset Psu n set

Dog Pd o g

Cat Pca t
Vectorization
C o n vo lu t io n + nonlinearity M a x p o o lin g

Convolution layers + p o o lin g layers Multi-


Fully connected layer c a tego ry

56 Hu a we i Confidential
Single-Filter Calculation ( 1 )
l Description o f c o n v o l u t i o n calculation

57 Hu a we i Confidential
Single-Filter Calculation ( 2 )
l D e m o n s t r a t i o n o f t h e c o n v o l u t i o n calculation

H a n Bingtao, 2017, Conv olut ional N e u r a l N e t w o r k

58 Hu a we i Confidential
C o n v o l u t i o n a l Layer
l The basic architecture o f a C N N is m u l t i - c h a n n e l c o n v o l u t i o n consisting o f m u l t i p l e single convolutions. The
o u t p u t o f t h e previous layer ( o r t h e original i m a g e o f t h e first layer) is used as t h e i n p u t o f t h e c ur r ent layer.
It is t h e n convolved w i t h t h e filter in t h e layer a n d serves as t h e o u t p u t o f this layer. The c o n v o l u t i o n kernel o f
each layer is t h e w e i g h t t o be learned. Similar t o FCN, a f t e r t h e c o n v o l u t i o n is c omplete, t h e result s hould be
biased a n d activated t h r o u g h activation func tions before being i n p u t t o t h e nex t layer.

Wn bn
Fn

Input Output
t enso r tensor
F1
W2 b2 A ctivat e
Output

W1 b1
Con vo lu t io n al Bias
kernel

59 Hu a we i Confidential
Pooling Layer
l Pooling c ombines nearby units t o reduce t h e size o f t h e i n p u t o n t h e nex t layer, r educ ing dimensions.
C o m m o n p o o l i n g includes m a x p o o l i n g a n d average pooling. W h e n m a x p o o l i n g is used, t h e m a x i m u m value
in a s mall square area is selected as t h e representative o f this area, w h i l e t h e m e a n value is selected as t h e
representative w h e n average p o o l i n g is used. The side o f this s mall area is t h e p o o l w i n d o w size. The
f o l l o w i n g figur e shows t h e m a x p o o l i n g o p e r a t i o n w hos e p o o l i n g w i n d o w size is 2.

Sliding direction

60 Hu a we i Confidential
Fully Connected Layer
l The ful l y connected layer is essentially a classifier. The features extracted o n t h e
c o n v o l u t i o n a l layer a n d p o o l i n g layer are straightened a n d placed a t t h e ful l y
connected layer t o o u t p u t a n d classify results.

l Generally, t h e S o f t m a x f u n c t i o n is used as t h e activation f u n c t i o n o f t h e final


ful l y connected o u t p u t layer t o c o m b i n e all local features i n t o g l o b a l features
a n d calculate t h e score o f each type.

𝑒𝑧𝑗
σ(z)𝑗 =
𝑘
𝑒

61 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k
l The rec urrent n e u r a l n e t w o r k ( R N N ) is a n e u r a l n e t w o r k t h a t captures d y n a m i c
i n f o r m a t i o n in sequential d a t a t h r o u g h periodical connections o f h i d d e n layer nodes. It
can classify sequential data.

l U n l i k e o t h e r f o r w a r d n e u r a l networks, t h e R N N can keep a c o n te x t state a n d even


store, learn, a n d express related i n f o r m a t i o n i n c o n te x t w i n d o w s o f any length. D i ffe r e n t
f r o m t r a d i t i o n a l n e u r a l networks, it is n o t l i m i t e d t o t h e space b o u n d a r y, b u t also
supports t i m e sequences. In o t h e r words, ther e is a side b e t w e e n t h e h i d d e n layer o f t h e
c u r r e n t m o m e n t a n d t h e h i d d e n layer o f t h e n e x t m o m e n t .

l The R N N is w i d e l y used in scenarios related t o sequences, such as videos consisting o f


i m a g e frames, a u d i o consisting o f clips, a n d sentences consisting o f words.

62 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 1 )
l 𝑋𝑡 is t h e i n p u t o f t h e i n p u t sequence a t t i m e t.
l 𝑆𝑡 is th e m e m o ry u n i t o f the sequence a t t i m e t a n d caches
previous i n f o r m a t i o n .

𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .


l 𝑂𝑡 is t h e o u t p u t o f t h e h i d d e n layer o f t h e sequence a t t i m e t.

𝑂𝑡 = 𝑡𝑎𝑛ℎ 𝑉𝑆𝑡
l 𝑂𝑡 a ft e r t h r o u g h m u l t i p le hidden la yers, i t can g e t the f inal
o u t p u t o f t h e sequence a t t i m e t.

63 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k Architecture ( 2 )

LeCun, Bengio, a n d G. H i n t o n , 2015, A Recurrent N e u r a l N e t w o r k a n d t h e


U n f o l d i n g in Ti m e o f t h e C o m p u t a t i o n Involved in Its F o r w a r d C o m p u t a t i o n

64 Hu a we i Confidential
Types o f Recurrent N e u r a l N e t w o r k s

Andr ej Karpathy, 2015, The Unreasonable Effectiveness o f Recurrent N e u r a l N e t w o r k s

65 Hu a we i Confidential
B a c k p r o p a g a t i o n T h r o u g h T i m e (BPTT)
l BPTT:
p Tradit ional b a c k p r o p a g a t i o n is t h e extension o n t h e t i m e sequence.
p There are t w o sources o f errors in t h e sequence a t t i m e o f m e m o r y unit: first is f r o m t h e h id d e n layer o u t p u t error a t t
t i m e sequence; t h e second is t h e error f r o m t h e m e m o r y cell a t t h e nex t t i m e sequence t + 1.

p The longer t h e t i m e sequence, t h e m o r e likely t h e loss o f t h e last t i m e sequence t o t h e g r a d ie n t o f w in t h e first t i m e


sequence causes t h e vanishing g r a d ie n t o r ex ploding g r a d ie n t p r o b le m .

p The t o t a l g r a d i e n t o f w e i g h t w is t h e a c c u m u l a t i o n o f t h e g r a d ie n t o f t h e w e i g h t a t all t i m e sequence.

l Three steps o f BPTT:


p C o m p u t i n g t h e o u t p u t value o f each n e u r o n t h r o u g h f o r w a r d pr opagat ion.

p C o m p u t i n g t h e error value o f each n e u r o n t h r o u g h b a c k p r o p a g a t i o n 𝛿𝑗.

p C o m p u t i n g t h e g r a d ie n t o f each w e ig h t .

l U p d a t i n g w eights using t h e SGD a l g o r i t h m .

66 Hu a we i Confidential
Recurrent N e u r a l N e t w o r k P r o b l e m
l 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 is extended o n t h e t i m e sequence.

l 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …

l Despite t h a t t h e s ta n d a r d R N N structure solves t h e p r o b l e m o f i n f o r m a t i o n m e m o r y,


t h e i n f o r m a t i o n a tte n u a te s d u r i n g l o n g - t e r m m e m o r y.

l I n f o r m a t i o n needs t o be saved l o n g t i m e i n m a n y tasks. For example, a h i n t a t t h e


b e g i n n i n g o f a speculative fic tio n m a y n o t be a n s w e r e d u n t i l t h e end.
l The R N N m a y n o t be able t o save i n f o r m a t i o n f o r l o n g d u e t o t h e l i m i t e d m e m o r y u n i t
capacity.

l W e expect t h a t m e m o r y units can r e m e m b e r key i n f o r m a t i o n .


67 Hu a we i Confidential
Long Short-term M e m o r y Network

Colah, 2015, Under s t anding LSTMs N e t w o r k s


68 Hu a we i Confidential
Ga t e d Recurrent U n i t ( G R U )

69 Hu a we i Confidential
Generative Adversarial N e t w o r k ( G A N )
l Generative Adversarial N e t w o r k is a f r a m e w o r k t h a t trains g e n e r a t o r G a n d dis c riminator D t h r o u g h t h e
adversarial process. T h r o u g h t h e adversarial process, t h e dis c riminator can tell w h e t h e r t h e s ample f r o m t h e
gener ator is fak e o r real. G A N adopts a m a t u r e BP a l g o r i t h m .

l ( 1 ) Generator G: The i n p u t is noise z, w h i c h complies w i t h m a n u a l l y selected pr ior pr obability distribution,


such as even dis tr ibution a n d Gaussian distribution. The g e n e r a t o r adopts t h e n e t w o r k structure o f t h e
m u l t i l a y e r per c eptr on (MLP), uses m a x i m u m lik elihood e s t i ma t i o n ( M L E ) par ameter s t o represent t h e
derivable m a p p i n g G(z), a n d m a p s t h e i n p u t space t o t h e s ample space.

l ( 2 ) D is c r iminator D: The i n p u t is t h e real s ample x a n d t h e fak e s ample G(z), w h i c h are t a g g e d as real a n d


fak e respectively. The n e t w o r k o f t h e dis c riminator can use t h e MLP carrying parameters. The o u t p u t is t h e
pr obability D ( G ( z ) ) t h a t determines w h e t h e r t h e s ample is a real o r fak e sample.

l G A N can be applied t o scenarios such as i m a g e generation, tex t generation, speech e n h a n c e me n t , i m a g e


super-resolution.

70 Hu a we i Confidential
G A N Architecture
l Generator/Discriminator

71 Hu a we i Confidential
Generative M o d e l a n d Discriminative M o d e l
l Generative n e t w o r k l Discriminator n e t w o r k
p Generates s a m p l e d a t a p
D ete r mines w h e t h e r s a m p l e d a t a is real
n Input: Gaussian w h i t e noise vector z n Input: real sa mp le d a t a 𝑥 𝑟𝑒𝑎𝑙 a n d
n O u t p u t : sa mp le d a t a vector x generated sa mp le d a t a 𝑥 = 𝐺 𝑧
n O u t p u t : p ro b a b ilit y t h a t determines
w h e t h e r t h e sa mp le is real

x = G(z;qG ) y = D(x;qD )
𝑥𝑟𝑒𝑎𝑙
G
z x D y
x

72 Hu a we i Confidential
Training Rules o f G A N
l O p t i m i z a t i o n objective:
p Value f u n c t i o n

minmaxV (D, G) = Ex~ p data(x)[logD(x)] + E z~ p z( z ) [log(1- D(G(z)))]


G D

p In t h e early t r a i n i n g stage, w h e n t h e o u t c o m e o f G is very poor, D d e te r min e s t h a t


t h e g e n e r a te d s a m p l e is fak e w i t h h i g h confidence, because t h e s a m p l e is obviously
d i ffe r e n t f r o m t r a i n i n g data. In this case, l o g ( 1 - D ( G ( z ) ) ) is s a tu r a te d ( w h e r e t h e
g r a d i e n t is 0, a n d i te r a ti o n c a n n o t be p e r f o r m e d ) . Therefore, w e choose t o t r a i n G
only by minimizing [-log(D(G(z))].

73 Hu a we i Confidential
Conten ts

1. Deep Learning S u m m a r y

2. Training Rules

3. Activation Function

4. Normalizer

5. Optimizer
6. Types o f N e u r a l N e t w o r k s

7 . C o m m o n Problems

74 Hu a we i Confidential
Data Imbalance (1)
l P r o b l e m description: In t h e dataset consisting o f various task categories, t h e n u m b e r o f
samples varies greatly f r o m o n e category t o another. O n e o r m o r e categories i n t h e
predicted categories c o n ta i n very f e w samples.

l For example, i n a n i m a g e r e c og nitio n experiment, m o r e t h a n 2,000 categories a m o n g a


t o t a l o f 4 2 5 1 t r a i n i n g ima g e s c o n ta i n just o n e i m a g e each. S ome o f t h e others have 2 - 5
images.

l Impacts:
p D u e t o t h e u n b a la n ce d n u m b e r o f samples, w e c a n n o t g e t t h e o p t i m a l r e a l - t i m e result
because m o d e l / a l g o r i t h m never examines categories w i t h very f e w samples adequately.
p Since f e w observation objects m a y n o t be representative f o r a class, w e m a y fail t o o b t a i n
a d e q u a t e samples f o r verification a n d test.

75 Hu a we i Confidential
Data Imbalance (2)

Random Random Synthetic


u n d ers am pl i ng o vers am pl i ng Minority
• Deleting r e d u n d a n t • Copying samples O v e rsam pling
samples in a Technique
category
• Sampling
• M e r g i n g samples

76 Hu a we i Confidential
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 1 )
l Vanishin g gradie n t: As net w o r k layers in crease, the d e rivative value of
b a c k p r o p a g a t i o n decreases, w h i c h causes a vanishing g r a d i e n t p r o b l e m .

l Exploding gradie n t: As net w o r k layers in crease, the d e rivative value o f


b a c k p r o p a g a t i o n increases, w h i c h causes a n exploding g r a d i e n t p r o b l e m .

l Cause: y𝑖 = 𝜎(𝑧𝑖) = 𝜎 𝑤𝑖𝑥𝑖 + 𝑏𝑖 Where σ is s i g m o i d function.

w2 w3 w4
b1 b2 b3 C

l Backpropagation can be deduced as follows:


𝜕C 𝜕C 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1
=
𝜕𝑏 1 𝜕𝑦 4 𝜕𝑧 4 𝜕𝑥 4 𝜕𝑧 3 𝜕𝑥 3 𝜕𝑧 2 𝜕𝑥 2 𝜕𝑧 1 𝜕𝑏 1
𝜕C ′
= 𝜕 𝜎 𝑧4 𝑤4 𝜎′ 𝑧3 𝑤 3 𝜎 ′ 𝑧2 𝑤 2 𝜎 ′ 𝑧1 𝑥
78 Hu a we i Confidential
Vanishing Gr a d i e n t a n d Exploding Gr a d i e n t P r o b l e m ( 2 )

l The m a x i m u m value o f 𝜎 ′ (𝑥) is 14:

l H ow ev er, t h e n e t w o r k w e i g h t 𝑤 is usually s maller t h a n 1. Therefore, 𝜎 ′ 𝑧 𝑤 ≤ 1. According t o t h e chain


4

𝜕C
rule, as layers increase, t h e derivation result 𝜕
decreases, resulting in t h e vanishing g r a d i e n t pr oblem.

l W h e n t h e n e t w o r k w e i g h t 𝑤 is large, resulting in 𝜎 ′ 𝑧 𝑤 > 1, t h e ex ploding g r a d i e n t p r o b l e m occurs.

l Solution: For example, g r a d i e n t clipping is used t o alleviate t h e ex ploding g r a d i e n t p r o b l e m, ReLU ac tiv ation
f u n c t i o n a n d LSTM are used t o alleviate t h e vanishing g r a d i e n t pr oblem.

79 Hu a we i Confidential
Overfitting
l P r o b l e m description: The m o d e l p e r f o r m s w e l l in t h e t r a i n i n g set, b u t badly in
t h e test set.

l Root cause: There are t o o m a n y f e a t u r e dimensions, m o d e l assumptions, a n d


parameters, t o o m u c h noise, b u t very f e w t r a i n i n g data. As a result, t h e f i t t i n g
f u n c t i o n perfectly predicts t h e t r a i n i n g set, w h i l e t h e prediction result o f t h e test
set o f n e w d a t a is poor. Training d a t a is o v e r - f i t t e d w i t h o u t considering
generalization capabilities.

l Solution : For exampl e, d a ta a u g m en t a t i o n , re gula r ization , early sto p ping, and


dropout

80 Hu a we i Confidential
Summary

l This chapter describes t h e d e f i n i t i o n a n d d e v e l o p m e n t o f n e u r a l networks,


perceptrons a n d their t r a i n i n g rules, c o m m o n types o f neural n e t w o r k s
(CNN, RNN, a n d GAN), a n d t h e C o m m o n Problems o f n e u r a l n e t w o r k s in
AI engineering a n d solutions.

82 Hu a we i Confidential
Quiz

1. ( Tr ue o r false) C o m p a r e d w i t h t h e rec urrent n e u r a l n e t w o r k , t h e c o n v o l u ti o n a l


n e u r a l n e t w o r k is m o r e suitable f o r i m a g e recognition. ( )

A. True

B. False

2. (True o r false) G A N is a deep l e a r n i n g m o d e l , w h i c h is o n e o f t h e m o s t p r o m i s i n g


m e t h o d s f o r unsupervised l e a r n i n g o f c o m p l e x d is tr ib u tio n i n recent years. ( )
A. True

B. False

83 Hu a we i Confidential
Quiz
3. (Single-choice) There are m a n y types o f deep lear ning neural networks. W h i c h o f t h e f o l l o w i n g is n o t a deep
lear ning neur al n e t w o r k ? ( )

A. CNN

B. RNN

C. LSTM

D. Logistic

4. ( M u l t i - c h o i c e ) There are m a n y i m p o r t a n t " c o m p o n e n t s " in t h e c o n v o l u t i o n a l neur al n e t w o r k architecture. W h i c h o f


t h e f o l l o w i n g are t h e c onv olut ional n e u r a l n e t w o r k " c o m p o n e n t s " ? ( )

A. A ctivation f u n c t io n

B. C o n vo lu t io n a l kernel

C. Pooling

D. Fully connected layer

84 Hu a we i Confidential
Recommendations

l O n l i n e learning website
p https://e.huawei.com/cn/talent/#/home
l H u a w e i K n o w l e d g e Base
p https://support.huawei.com/enterprise/servicecenter?lang=zh

85 Hu a we i Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like