Professional Documents
Culture Documents
Licao 20 PDF
Licao 20 PDF
Perceptron
Gradient Descent
Multi-layerd neural network
Back-Propagation
More on Back-Propagation
Examples
1
Inner-product
r r r r
net =< w, x >=|| w ||" || x ||"cos(# )
n
net = # w i " x i
i=1
!
A measure of the projection of one vector
onto another
!
Activation function
n
o = f (net) = f (# w i " x i )
i=1
$ 1 if x " 0
f (x) := sgn(x) = %
 if x < 0
!
2
$1 if x #0
f (x) := " (x) = %
&0 if x <0
! & 1 if x # 0.5
(
f (x) := " (x) = ' x if 0.5 > x > 0.5
( 0 if x $ %0.5
)
sigmoid function
!
1
f (x) := " (x) =
1+ e(#ax )
Gradient Descent
To understand, consider simpler linear
unit, where
n
o = # wi " x i
i= 0
3
Error for different hypothesis,
for w0 and w1 (dim 2)
wi=wi+Δwi w=w+Δw
4
Differentiating E
"w i = # % (t d & od )x id
d $D
5
Stochastic Approximation to
gradient descent
"w i = #(t $ o)x i
The gradient decent training rule updates summing
over all the training examples D
Stochastic gradient approximates gradient decent by
updating weights incrementally
Calculate error for each example
! Known as delta-rule or LMS (last mean-square) weight
update
Adaline rule, used for adaptive filters Widroff and Hoff (1960)
6
XOR problem and Perceptron
By Minsky and Papert in mid 1960
7
Multi-layer Networks
The limitations of simple perceptron do not
apply to feed-forward networks with
intermediate or „hidden“ nonlinear units
A network with just one hidden unit can
represent any Boolean function
The great power of multi-layer networks was
realized long ago
But it was only in the eighties it was shown how to
make them learn
8
XOR-example
9
Back-propagation is a learning algorithm for
multi-layer neural networks
It was invented independently several times
Bryson an Ho [1969]
Werbos [1974]
Parker [1985]
Rumelhart et al. [1986]
10
Back-propagation
The algorithm gives a prescription for
changing the weights wij in any feed-
forward network to learn a training set of
input output pairs {xd,td}
xk
x1 x2 x3 x4 x5
11
Given the pattern xd the hidden unit j
receives a net input
5
net = " w jk x kd
d
j
k=1
! V = f (net ) = f (" w jk x kd )
d
j
d
j
k=1
!
12
Out usual error function
13
For hidden-to-output connections the
gradient descent rule gives:
m
%E
"W ij = #$ = #$& (t id # oid ) f ' (net id ) ' (#V jd )
%W ij d =1
m
"W ij = $& (t id # oid ) f ' (net id ) ' V jd
d =1
m d
%E %E %V j
"w jk = #$ = #$' d &
%w jk d =1
%V j %w jk
14
m 2
"w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net dj ) & x kd
d =1 i=1
m
"W ij = #%$id V jd
d =1
m
"w jk = #%$ dj & x kd
d =1
!
15
In general, with an arbitrary number of layers,
the back-propagation update rule has always
the form m
"w ij = #%$output & Vinput
d =1
16
We have to use a nonlinear differentiable
activation function
Examples:
1
f (x) = " (x) =
1+ e(#$ % x )
f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))
!
f (x) = tanh(" # x)
!
f ' (x) = " # (1$ f (x) 2 )
!
17
Consider a network with M layers
m=1,2,..,M
Vmi from the output of the ith unit of the
mth layer
V0i is a synonym for xi of the ith input
Subscript m layers m’s layers, not
patterns
Wmij mean connection from Vjm-1 to Vim
Stochastic Back-Propagation
Algorithm (mostly used)
1. Initialize the weights to small random values
2. Choose a pattern xdk and apply is to the input layer V0k= xdk for
all k
3. Propagate the signal through the network
Vim = f (net im ) = f (" w ijmV jm#1 )
j
! !
18
More on Back-Propagation
Gradient descent over entire network
weight vector
Easily generalized to arbitrary directed
graphs
Will find a local, not necessarily global
error minimum
In practice, often works well (can run
multiple times)
%E
"w pq (t + 1) = #$ + & ' "w pq (t)
%w pq
19
Minimizes error over training examples
Will it generalize well
Training can take thousands of iterations,
it is slow!
20
Convergence of Back-
propagation
Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with different initial weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as training
progresses
21
Expressive Capabilities of
ANNs
Boolean functions:
Every boolean function can be represented by
network with single hidden layer
but might require exponential (in number of inputs)
hidden units
Continuous functions:
Every bounded continuous function can be
approximated with arbitrarily small
error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
Any function can be approximated to arbitrary
accuracy by a network with two
hidden layers [Cybenko 1988].
22
Prediction
23
24
Perceptron
Gradient Descent
Multi-layerd neural network
Back-Propagation
More on Back-Propagation
Examples
25
RBF Networks, Support Vector Machines
26