You are on page 1of 72

Artificial Neural Network

Neuron in the Brain

Sumber: MICHAEL NEGNEVITSKY, Artificial Intelligence : A Guide to


Intelligent Systems, Second Edition, Addision Wesley, 2005

• A neuron consists of a cell body, soma, a number of fibres called dendrites,


and a single long fibre called the axon.

• While dendrites branch into a network around the soma, the axon
stretches out to the dendrites and somas of other neurons.
The Artificial Neurons

Sumber: MICHAEL NEGNEVITSKY, Artificial Intelligence : A Guide to


Intelligent Systems, Second Edition, Addision Wesley, 2005
wij Proposed by McCulloch
positive The ArtificialandNeurons
— excitatory Pitts [1943]
negative — inhibitory called M-P neurons
zero — no connection
ti — threshold
x1 yi (t  1)  a( f )
wi1
x2
wi2  ti yi

f (.) a (.)
m
f ( i ) x  wijwximj  ti 1 f  0
a( f )  
m
j 1 0 otherwise
Neuron
● The neuron is the basic information processing
unit of a Neural Network. It consists of:
1 A set of links, describing the neuron inputs, with weights w1,
w2, …, wm
2 An adder function (linear combiner) for computing the
weighted sum of the inputs: i m
(real numbers) f   wi xi
i 1

3 Activation function a(.) for limiting the amplitude of the


neuron output.
y  a( f )
Activation Function a(.)
• Step function 1 if f  t
a( f )  
0 otherwise
• Sigmoid function
1
a( f ) 
1  exp( f )

• Gaussian function
1  1  f    
2

a( f )  
exp    

2   2   
Activation Function a(.)
• Tanh

ex − e−x
tanh x = x
e + e−x

• Rectified Linear Unit (ReLU)


What Can a Neuron Do?
• A hard limiter.
• A binary threshold unit.
• Hyperspace separation.
y x2

t

f ( i )  w1 x1  w2 x2  t
1 f ( i )  0
y
0
0 otherwise

x1
w1 w2
x2
1 x1
Artificial Neural Networks (ANN)

X1 X2 X3 Y Input Black box


1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3
0 0 0 0

Output Y is 1 if at least two of the three inputs are equal to 1.


Artificial Neural Networks (ANN)
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3 
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0

Y  a (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)


1 if f is true
where a ( f )  
0 otherwise
Algorithm for learning ANN
• Initialize the weights (w0, w1, …, wk)

• Adjust the weights in such a way that the output


of ANN is consistent with class labels of training
examples
– Objective function: E
1

 iY  f ( wi , X i ) 2

2 i

– Find the weights wi’s that minimize the above


objective function
• e.g., backpropagation algorithm
Learning Algorithms

• To design a learning algorithm, we face the


following problems:

1. Whether to stop? Is the criterion satisfactory?


2. In what direction to proceed? e.g., gradient descent
3. How long a step to take? : learning rate
Assume there are only two
Gradient Descent parameters w1 and w2 in a
network.
Error Surface 𝜃 = 𝑤1 , 𝑤2

The colors represent the value of C. Randomly pick a


starting point 𝜃 0
Compute the
negative gradient
𝑤2 𝜃∗ at 𝜃 0
−𝜂𝛻𝐶 𝜃 0 −𝛻𝐶 𝜃 0
−𝛻𝐶 𝜃 0
Times the
𝜕𝐶 𝜃 0 /𝜕𝑤1 learning rate 𝜂
𝜃0 𝛻𝐶 𝜃 0 =
𝜕𝐶 𝜃 0 /𝜕𝑤2 −𝜂𝛻𝐶 𝜃 0
𝑤1 Source: Hung-yi Lee, Deep Learning Tutorial
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point 𝜃 0
Compute the
2−𝜂𝛻𝐶 𝜃2 negative gradient
−𝜂𝛻𝐶 𝜃𝜃
1
𝑤2 2 at 𝜃 0
−𝛻𝐶
−𝛻𝐶 𝜃 1 𝜃
𝜃1 −𝛻𝐶 𝜃 0
Times the
learning rate 𝜂
𝜃0
−𝜂𝛻𝐶 𝜃 0
𝑤1 Source: Hung-yi Lee, Deep Learning Tutorial
Local Minima
• Gradient descent never guarantee global
minima
Different initial
point 𝜃 0

𝐶 Reach different minima,


so different results
Who is Afraid of Non-Convex
Loss Functions?
𝑤1 𝑤2 http://videolectures.net/eml07
_lecun_wia/
Source: Hung-yi Lee, Deep Learning Tutorial
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
Source: Hung-yi Lee, Deep Learning Tutorial
Back propagation algorithm
for Single-Layer Perceptron
• Step 0 - Initialize weights (w0, w1, …, wn), m = 0,
learning rate , and threshold t
• Step 1 – Do m = m + 1
• Step 2 – Select pattern Xm
• Step 3 – Calculate output
f ( wi , X i )   wi X i  t o  a ( f )
• Step 3 – Calculate error atau delta  = d – o
i

• Step 4 – Update weight w( new)  w( old )     i X i


• Step 5 – Repeat until w convergent
• Step 6 – Return w
Sumber: MICHAEL NEGNEVITSKY, Artificial Intelligence : A Guide to
Intelligent Systems, Second Edition, Addision Wesley, 2005
General Structure of ANN
x1 x2 x3 x4 x5

Input
Layer Input Neuron i Output
I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
Hidden g(Si )
Layer I3

threshold, t

Output Training ANN means learning


Layer the weights of the neurons
y
Multilayer Perceptron
y1 y2 yn

Output Layer . . .

. . .
Hidden Layer
. . .

Input Layer . . .

x1 x2 xm
How an MLP Works?
Example:
 Not linearly separable.
XOR
 Is a single layer
x2
perceptron workable?
1

0 x1
1
How an MLP Works?
Example:

XOR L1
y1 y2
x00
L2
2
01 L1 L2
1

11
x1 x2 x3= 1
0 x1
1
How an MLP Works?
Example:

XOR L1 L3
x00
L2
y2
2
01
1 1

11
0 x1 0 y1
1 1
How an MLP Works?
Example:

XOR L1 L3
x00
L2
y2
2
01
1 1

11
0 x1 0 y1
1 1
How an MLP Works?
Example:
z
L3
L3
y1 y2 y2
1
L1 L2 y3= 1

0 y1
x1 x2 x3= 1 1
Is the problem linearly separable?
Parity Problem

x 1 x2 x3 x3
000 0
001 1
010 1
011 0
100 1 x2
101 0
110 0 x1
111 1
Parity Problem
x1 x2 x3 x3
000 0
001 1 P1
010 1
011 0
100 1 P2
P3 x2
101 0
110 0 x1
111 1
Parity Problem
x1 x2 x3 x3
000 0
001 1 P1
010 1 111
011 0
100 1 P2
P3 x2
101 0 011
110 0 x1
111 1 000 001
Parity Problem
x3

P1
y1 y2 y3 111

P1 P2 P3
P2
P3 x2
011
x1 x2 x3
x1
000 001
Parity Problem
y3 x3

P1
111
P4
y2 P2
P3 x2
011
y1 x1
000 001
Parity Problem
y3 z

P4
y1 y3
y2
P4
y2 P1 P2 P3

y1 x1 x2 x3
General Problem
General Problem
Hyperspace Partition
L3
L1

L2
Region Encoding
000 L3 001
L1
010
100

101

110

111 L2
Hyperspace Partition &
Region Encoding Layer

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

101

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

001

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

000

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

110

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

010

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

100

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Region Identification Layer

111

L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Classification

0 0
1 1
101 001 000 110 010 100 111 1
L3
000 001
L1
010
L1 L2 L3
100
101

110
x1 x2 x3 111 L2
Feed-Forward Neural Networks

Back Propagation Learning algorithm


Supervised Learning
Training Set  
T  (x(1) , d(1) ), (x( 2) , d( 2) ), , (x( p ) , d ( p ) )

o1 o2 on
d1 d2 dn
Output Layer . . .

. . .
Hidden Layer
. . .

Input Layer . . .

x1 x2 xm
Supervised Learning
Training Set  
T  (x(1) , d(1) ), (x( 2) , d( 2) ), , (x( p ) , d ( p ) )

Sum of Squared Errors o1 o2 on


d1 d2 dn
. . .

E (l )

2 j 1

1 n (l )
  d j  o (jl )  2
. . .

Goal: . . .

p
Minimize E   E (l ) . . .

l 1 x1 x2 xm
Back Propagation Learning Algorithm


1 n (l )

p
  d j  o (jl ) E  E
2
E (l ) (l )
2 j 1 l 1

o1 o2 on
d1 d2 dn
 Learning on Output Neurons . . .
 Learning on Hidden Neurons
. . .

. . .

. . .

x1 x2 xm
Learning on Output Neurons
1 n (l )
 
p
  d j  oj E   E (l )
(l ) (l ) 2
E
2 j 1 l 1

o1 oj on o(jl )  a(net (jl ) ) net (jl )   w jioi(l )


d1 dj dn
. . . j . . . E  p p
E ( l )

w ji w ji
E
l 1
(l )

l 1 w ji
wji
. . . i . . . E E
(l ) (l )
net (jl )

w ji net (jl ) w ji
. . . . . .
? ?
. . . . . .
Learning on Output Neurons
 
n p
E   E (l )
1
  d j  oj
(l ) (l ) (l ) 2
E
2 j 1 l 1

o1 oj on o(jl )  a(net (jl ) ) net (jl )   w jioi(l )


d1 dj dn
. . . j . . . E  p p
E ( l )

w ji w ji
E
l 1
(l )

l 1 w ji
wji
. . . i . . . E E
(l ) (l )
net (jl )

w ji net (jl ) w ji
. . . . . . E (l ) E (l ) o j
(l )

 (l )
net j
(l )
o j net (jl )
. . . . . .
depends on the
 (d (l )
j o )(l )
j activation function
Activation Function — Sigmoid

1
1
y  a (net )   net
0.5
1 e
0 net

1 y
2
 1   net

a (net )    net   (   ) e  net
e 
 1  e  y

a(net )  y (1  y ) Remember this


Learning on Output Neurons
1 n (l )
 
p
  d j  o (jl ) E   E (l )
2
E (l )
2 j 1 l 1

o1 oj on o(jl )  a(net (jl ) ) net (jl )   w jioi(l )


d1 dj dn
. . . j . . . E  p
E ( l )
p

w ji w ji
E
l 1
(l )

l 1 w ji
wji
. . . i . . . E E
(l ) (l )
net (jl )

w ji net (jl ) w ji
. . . . . . E (l ) E (l ) o j
(l )

 (l )
net j
(l )
o j net (jl )
. . . . . . Using sigmoid,

(d (l )
j o )
(l )
j
o(jl ) (1  o(jl ) )
 
n p
1
E (l )   d (jl )  o (jl ) E   E (l )
2

Learning on Output
2 j 1 Neurons l 1
E (l )
 (l )
   ( d (l )
 o (l )
)  o (l )
(1  o (l )
j )
net j
j (l ) j j j

o1 oj on o(jl )  a(net (jl ) ) 


net (jl )  ( l ) w o(l )
ji i
d1 dj dn j
. . . j . . . E  p p
E ( l )

w ji w ji
E
l 1
(l )

l 1 w ji
wji
. . . i . . . E E
(l ) (l )
net (jl )

w ji net (jl ) w ji
. . . . . . E (l ) E (l ) o j
(l )

 (l )
net j
(l )
o j net (jl )
. . . . . . Using sigmoid,

(d (l )
j o )
(l )
j
o(jl ) (1  o(jl ) )
Learning on Output Neurons
 
n p
E   E (l )
1
  d j  oj
(l ) (l ) (l ) 2
E
2 j 1 l 1

o1 oj on o(jl )  a(net (jl ) ) net (jl )   w jioi(l )


d1 dj dn
. . . j . . . E  p p
E ( l )

w ji w ji
E
l 1
(l )

l 1 w ji
wji
. . . i . . . E E
(l ) (l )
net (jl )

w ji net (jl ) w ji oi(l )
. . . . . .
E (l )
  (j l ) oi(l )
w ji
. . . . . .
 (d (jl )  o(jl ) )o(jl ) (1  o(jl ) )oi(l )
Learning on Output Neurons
1 n (l )
 
p
  d j  o (jl ) E   E (l )
2
E (l )
2 j 1 l 1

o1 oj on o(jl )  a(net (jl ) ) net (jl )   w jioi(l )


d1 dj dn
. . . j . . . E  p
E ( l )
p

w ji w ji
E (l )

l 1 w ji
wji How to train the weights l 1

. . . i . . . connecting
E

E to
(l )
output neurons?
net (l ) (l )
j

w ji net (jl ) w ji
E p oi(l )
.   (l ) (l )
. . j . o.i .
w ji l 1 E (l )
  (j l ) oi(l )
w ji
. . . p
w ji     j oi(l )
. .( l ).
 (d (jl )  o(jl ) )o(jl ) (1  o(jl ) )oi(l )
l 1
Learning on Hidden Neurons
1 n (l )
 
p
  d j  oj E   E (l )
(l ) (l ) 2
E
2 j 1 l 1

E  p p
E (l )
. . . j . . .

wik wik
E
l 1
(l )

l 1 wik

wji E (l ) E (l ) neti(l )

wik neti(l ) wik
. . . i . . .

wik
. . .k . . .
? ?
. . . . . .
 
n p
1
E (l )   d (jl )  o (jl ) E   E (l )
2

Learning on Hidden
2 j 1 Neurons l 1

 i
(l )

E  p p
E (l )
. . . j . . .

wik wik
 E (l )

l 1

l 1 wik

wji E (l ) E (l ) neti(l )

wik neti(l ) wik ok(l )
. . . i . . .

wik
. . .k . . .

. . . . . .
 
n p
1
E (l )   d (jl )  o (jl ) E   E (l )
2

Learning on Hidden
2 j 1 Neurons l 1

 i
(l )

E  p p
E (l )
. . . j . . .

wik wik
 E
l 1
(l )

l 1 wik

wji E (l ) E (l ) neti(l )

wik neti(l ) wik ok(l )
. . . i . . .

wik E (l ) E (l ) oi(l )
 (l )
. . .k . . . neti(l )
oi neti(l )
 oi(l ) (1  oi(l ) )
?
. . . . . .
 
n p
1
E (l )   d (jl )  o (jl ) E   E (l )
2

Learning on Hidden
2 j 1 Neurons l 1

 (l )

E (l )
  o (l )
(1  o (l )
)  w  (l ) i
(l )
neti(l )
i i i ji j

E  E (l )
j p p

. . . j . . .

wik wik
 E (l )

l 1

l 1 wik

wji E (l ) E (l ) neti(l )

wik neti(l ) wik ok(l )
. . . i . . .

wik E (l ) E (l ) oi(l )
 (l )
. . .k . . . neti(l )
oi neti(l )
 oi(l ) (1  oi(l ) )
E (l ) E (l ) net j
(l )
. . . . . . 
oi
(l )
j net j
(l )
oi(l )

 (lj ) w ji
 
n p
1
E (l )   d (jl )  o (jl ) E   E (l )
2

Learning on Hidden
2 j 1 Neurons l 1

E (l )
 i(l ) 
neti(l )
  oi
(l )
(1  oi
(l )
)  w 
ji j
(l )

E  E (l )
j p p

. . . j . . .

wik wik
E
l 1
(l )

l 1 wik

wji E (l ) E (l ) neti(l )

wik neti(l ) wik ok(l )
. . . i . . .

wik E p

. . .k . . .    i(l ) ok(l )
wik l 1
. . . . . . p
wik     i(l ) ok(l )
l 1
Back Propagation

o1 oj on
d1 dj dn
. . . j . . .

. . . i . . .

. . . k . . .

. . . . . .

x1 . . . xm
Back Propagation
 (l )

E (l )
  ( d (l )
 o (l )
) o (l )
(1  o (l )
j )
net j
j (l ) j j j

o1 oj on
d1 dj dn
. . . j . . . p
w ji     (j l ) oi(l )
l 1

. . . i . . .

. . . k . . .

. . . . . .

x1 . . . xm
Back Propagation
 (l )

E (l )
  ( d (l )
 o (l )
) o (l )
(1  o (l )
j )
net j
j (l ) j j j

o1 oj on
d1 dj dn
. . . j . . . p
w ji     (j l ) oi(l )
l 1

. . . i . . . p
wik     i(l ) ok(l )
l 1

. . . k . . .

. . . . . . E (l )
 i
(l )

neti(l )
  oi
(l )
(1  oi
(l )
) j ji j
w  (l )

x1 . . . xm
Multilayer Neural Network
xii 𝛿𝑗 = 𝑜𝑗 1 − 𝑜𝑗 ෍ 𝑤𝑗𝑖 𝛿𝑘
wij
oi wji j
𝛿𝑘 = (𝑑𝑘 − 𝑜𝑘 )𝑜𝑘 (1 − 𝑜𝑘 )
oj wjk
wkj k
ok

PEMBELAJARAN MULTI LAYER PERCEPTRON@DP-RL2006 64


Backpropagation algorithm
for MultiLayer Perceptorn
• Step 0 - Initialize weights (w0, w1, …, wn), m = 0, learning rate , and
threshold t, sigmoid variable 
• Step 1 – Do m = m + 1
• Step 2 – Select pattern Xm
• Step 3 – Calculate output oj and ok
o  a( f )
• Step 3 – Calculate delta k and j
• Step 4 – Update weight w
kj ( new )  wkj ( old )    o
k j

wij ( new)  wij ( old )   X


j i
• Step 5 – Repeat until w convergent
• Step 6 – Return w
https://machinelearningmastery.com/implement-
backpropagation-algorithm-scratch-python/
Learning Factors
• Initial Weights
• Learning Rate ()
• Cost Functions
• Momentum
• Update Rules
• Training Data and Generalization
• Number of Layers
• Number of Hidden Nodes
Number of Hidden Layers
• In fact, for many practical problems, there is no
reason to use more than one hidden layer.
No. of Result
Hidden Layer
none Only capable for representing linear separable
function or decision
1 Can approximate any function that contains a
continuous mapping from one finite space to another
2 Can represent an arbitrary decision boundary to
arbitrary accuracy with rational activation functions
and can approximate any smooth mapping to any
accuracy
Number of Neurons in the Hidden
Layer
There are many rule-of-thumb methods:
• Between the size of the input layer and the
size of output layer
• 2/3 the size of the input layer, plus the size of
the output layer
• Less than twice the size of the input layer
Number of Neurons in Output Layer
• Regression
– One neuron
• Classification
– Binary class  one neuron
– Multi class  more than one neuron
Material Resources
• 虞台文, Feed-Forward Neural Networks, Course
slides presentation
• Andrew Ng, Machine Learning, Course slides
presentation
• MICHAEL NEGNEVITSKY, Artificial Intelligence : A
Guide to Intelligent Systems, Second Edition,
Addision Wesley, 2005.
• Richard O. Duda, et. al, Pattern Classification 2nd
Edition, John Wiley & Sons, Inc., 2001
• Hung-yi Lee, Deep Learning Tutorial

You might also like