Professional Documents
Culture Documents
Neural Networks and Deep Learning: WWW - Cs.wisc - Edu/ Dpage/cs760
Neural Networks and Deep Learning: WWW - Cs.wisc - Edu/ Dpage/cs760
www.cs.wisc.edu/~dpage/cs760/
1
Goals for the lecture
you should understand the following concepts
• perceptrons
• the perceptron training rule
• linear separability
• hidden units
• multilayer neural networks
• gradient descent
• stochastic (online) gradient descent
• sigmoid function
• gradient descent with a linear output unit
• gradient descent with a sigmoid output unit
• backpropagation
2
Goals for the lecture
you should understand the following concepts
• weight initialization
• early stopping
• the role of hidden units
• input encodings for neural networks
• output encodings
• recurrent neural networks
• autoencoders
• stacked autoencoders
3
Neural networks
• a.k.a. artificial neural networks, connectionist models
• inspired by interconnected neurons in biological systems
• simple processing units
• each unit receives a number of real-valued inputs
• each unit produces a single real-valued output
4
Perceptrons
[McCulloch & Pitts, 1943; Rosenblatt, 1959; Widrow & Hoff, 1960]
1 w0
x1 w1
" n
$ 1 if w0 + ∑ wi xi > 0
x2 o=#
w2 i=1
$ 0 otherwise
%
xn wn
" n
2a. calculate the output $ 1 if w0 + ∑ wi xi > 0
for the given instance o=# i=1
$ 0 otherwise
%
Δwi = η ( y − o ) xi
η is learning rate;
set to value << 1 wi ← wi + Δwi
6
Representational power of perceptrons
perceptrons can represent only linearly separable concepts
" n
$ 1 if w0 + ∑ wi xi > 0
o=# i=1
$ 0 otherwise
%
w
decision boundary given by:
x2
+ +
1 if w0 + w1 x1 + w2 x2 > 0 + + +
+ -
+ - -
also write as: wx > 0 + +
+ + -
+ + - -
w1 x1 + w2 x2 = −w0 -
+ + -
-
w1 w + - - -
x2 = − x1 − 0 x1
w2 w2 - -
7
Representational power of perceptrons
8
Some linearly separable functions
AND
x1
x1 x2
y
1b d
a 0 0 0
b 0 1 0
c 1 0 0 a c
d 1 1 1 0 1 x2
OR
x1
x1 x2
y
1b d
a 0 0 0
b 0 1 1
c 1 0 1 a c
d 1 1 1 0 1 x2
9
XOR is not linearly separable
x1
x1 x2
y
1b d
a 0 0 0
b 0 1 1
c 1 0 1 a c
d 1 1 0 0 1 x2
1
x1 1
-1
a multilayer perceptron
can represent XOR
-1
x2 1
1
assume w0 = 0 for all nodes 10
Example multilayer neural network
output units
hidden units
input units
x1
how to determine
error signal for
hidden units?
x2
1 (d ) 2
E(w) = ∑ ( y − o )
(d )
2 d∈D
w2
w1
This error measure defines a surface over the hypothesis (i.e. weight) space
14
Gradient descent in weight space
gradient descent is an iterative process aimed at finding a minimum in
the error surface
on each iteration
• current weights define a
point in this space Error
• find direction in which
error surface descends
most steeply
• take a step (i.e. update
weights) in that direction
w1
w2
15
Gradient descent in weight space
# ∂E ∂E ∂E &
calculate the gradient of E: ∇E(w) = % , , , (
∂w
$ 0 ∂w1 ∂wn '
∂E
Δwi = −η
∂wi
w1
∂E
−
∂w1
w2 ∂E
− 16
∂w2
The sigmoid function
• to be able to differentiate E with respect to wi , our network
must represent a continuous function
• to do this, we use sigmoid functions instead of threshold
functions in our hidden and output units
1
f (x) = −x
1+ e
x 17
The sigmoid function
for the case of a single-layer network
1
f (x) = # n &
$
∑
−% w0 + wi xi (
'
1+ e i=1
n
w0 + ∑ wi xi
18
i=1
Batch neural network training
{ }
given: network structure and a training set D = (x (1) , y(1) )…(x (m ) , y(m ) )
initialize all weights in w to small random numbers
until stopping criteria met do
initialize the error E(w) = 0
for each (x(d), y(d)) in the training set
input x(d) to the network and compute output o(d)
1 (d ) (d ) 2
increment the error E(w) = E(w) + ( y − o )
2
calculate the gradient
# ∂E ∂E ∂E &
∇E(w) = % , , ,
$ ∂w0 ∂w1 ∂wn ('
Δw = −η ∇E ( w ) 19
Online vs. batch training
20
Online neural network training
(stochastic gradient descent)
{ }
given: network structure and a training set D = (x (1) , y(1) )…(x (m ) , y(m ) )
initialize all weights in w to small random numbers
until stopping criteria met do
for each (x(d), y(d)) in the training set
input x(d) to the network and compute output o(d)
calculate the error 1 (d ) (d ) 2
E(w) =
2
( y −o )
calculate the gradient
# ∂E ∂E ∂E &
∇E(w) = % , , , (
∂w
$ 0 ∂w1 ∂wn '
y = f (u)
u = g(x)
∂y ∂y ∂u
=
∂x ∂u ∂x
23
Gradient descent: simple case
Consider a simple case of a network with one linear output unit
and no hidden units:
1 w0
n
o(d ) = w0 + ∑ wi x (d
i
)
i=1 x1 w1
2 d∈D xn wn
∂E ∂ 1 (d ) 2 ∂E (d ) ∂ 1 (d ) (d ) 2
= ∑
∂wi ∂wi 2 d∈D
( y −o )
(d )
∂wi
=
∂wi 2
( y −o )
24
Stochastic gradient descent: simple case
let’s focus on the online case (stochastic gradient descent):
∂E (d ) ∂ 1 (d ) (d ) 2
∂wi
=
∂wi 2
( y −o )
∂
= (y (d )
−o )
(d )
( y(d ) − o(d ) )
∂wi
(d )
# ∂o &
= ( y − o )% −
(d ) (d )
$ ∂w (' i
(d ) (d ) (d )
∂o ∂net ∂net
= − ( y(d ) − o(d ) ) = − ( y(d ) − o(d ) )
∂net (d ) ∂wi ∂wi
= − ( y (d ) − o(d ) ) ( x (d
i
)
) 25
Gradient descent with a sigmoid
Now let’s consider the case in which we have a sigmoid output
unit and no hidden units:
1 w0
n
(d )
net = w0 + ∑ wi x (d
i
)
i=1 x1 w1
(d ) 1
o = − net ( d )
x2
1+ e w2
useful property:
xn wn
∂o(d ) (d ) (d )
(d )
= o (1− o )
∂net
26
Stochastic GD with sigmoid output unit
∂E (d ) ∂ 1 (d ) (d ) 2
∂wi
=
∂wi 2
( y −o )
∂
= (y (d )
−o )(d )
( y (d )
− o (d )
)
∂wi
(d )
# ∂o &
= ( y − o )% −
(d ) (d )
$ ∂wi ('
(d ) (d )
∂o ∂net
= − ( y(d ) − o(d ) )
∂net (d ) ∂wi
(d )
∂net
= − ( y(d ) − o(d ) ) o(d ) (1− o(d ) )
∂wi
= − ( y(d ) − o(d ) ) o(d ) (1− o(d ) )xi(d ) 27
Backpropagation
∂E
• how can we calculate for every weight in a multilayer network?
∂wi
28
Backpropagation notation
let’s consider the online case, but drop the (d) superscripts for simplicity
we’ll use
• subscripts on y, o, net to indicate which unit they refer to
• subscripts to indicate the unit a weight emanates from and goes to
i
w ji
j
oj
29
Backpropagation
∂E ∂net j
= −η
∂net j ∂w ji
= η δ j oi
where
∂E xi if i is an input unit
δj = −
∂net j
30
Backpropagation
∂E
where δj = −
∂net j
same as
δ j = o j (1− o j )(y j − o j ) single-layer net
if j is an output unit
with sigmoid
output
δ j = o j (1− o j )∑ δ k wkj if j is a hidden unit
k
sum of backpropagated
contributions to error
31
Backpropagation illustrated
1. calculate error of output units 2. determine updates for
δ j = o j (1− o j )(y j − o j ) weights going to output units
Δw ji = η δ j oi
32
Backpropagation illustrated
3. calculate error for hidden units 4. determine updates for
weights to hidden units using
δ j = o j (1− o j )∑ δ k wkj
k
hidden-unit errors
Δw ji = η δ j oi
j
j
33
Neural network jargon
• epoch: one pass through the training instances during gradient descent
34
Initializing weights
• Weights should be initialized to
• small values so that the sigmoid activations are in the range
where the derivative is large (learning will be quicker)
∂E
−
∂wij
Error
wij
η too small (error goes down
a little)
error
training iterations 37
Input (feature) encoding for neural networks
nominal features are usually represented using a 1-of-k encoding
! 1 $ ! 0 $ ! 0 $ ! 0 $
# & # & # & # &
0 1 0 0
A=# & C=# & G= # & T= # &
# 0 & # 0 & # 1 & # 0 &
# 0 & # 0 & # 0 & # 1 &
" % " % " % " %
precipitation = [ 0.68 ]
38
Output encoding for neural networks
regression tasks usually use output units with linear transfer functions
eneti
oi = net j
∑ e
j∈outputs 39
Recurrent neural networks
recurrent networks are sometimes used for tasks that involve making
sequences of predictions
• Elman networks: recurrent connections go from hidden units to inputs
• Jordan networks: recurrent connections go from output units to inputs
40
Alternative approach to
training deep networks
• use unsupervised learning to to find useful hidden unit
representations
41
Learning representations
• the feature representation provided is often the most
significant factor in how well a learning system works
44
The role of hidden units
• In this task, hidden units learn a compressed numerical coding of the
inputs/outputs
45
How many hidden units should be used?
• conventional wisdom in the early days of neural nets: prefer small
networks because fewer parameters (i.e. weights & biases) will be
less likely to overfit
• somewhat more recent wisdom: if early stopping is used, larger
networks often behave as if they have fewer “effective” hidden
units, and find better solutions
4 HUs
test set
error
15 HUs
48
Autoencoders
• one approach: use autoencoders to learn hidden-unit representations
• in an autoencoder, the network is trained to reconstruct the inputs
49
Autoencoder variants
• how to encourage the autoencoder to generalize
50
Stacking Autoencoders
• can be stacked to form highly nonlinear representations
[Bengio et al. NIPS 2006]
52
Why does the unsupervised training
step work well?
53
Deep learning not limited to
neural networks
• First developed by Geoff Hinton and colleagues for
belief networks, a kind of hybrid between neural
nets and Bayes nets
• Hinton motivates the unsupervised deep learning
training process by the credit assignment problem,
which appears in belief nets, Bayes nets, neural
nets, restricted Boltzmann machines, etc.
• d-separation: the problem of evidence at a converging
connection creating competing explanations
• backpropagation: can’t choose which neighbors get the
blame for an error at this node
54
Room for Debate
• many now arguing that unsupervised
pre-training phase not really needed…
• backprop is sufficient if done better
– wider diversity in initial weights, try with
many initial settings until you get learning
– don’t worry much about exact learning rate,
but add momentum: if moving fast in a
given direction, keep it up for awhile
– Need a lot of data for deep net backprop
55
Dropout training
Dropout
• On each training iteration, drop out (ignore)
On each training iteration
90% of the units (or other %)
– randomly “drop out” a subset of the units and their weights
• Ignore for forward & backprop (all training)
– do forward and backprop on remaining network
57
Some Deep Learning Resources
• Nature, Jan 8, 2014:
http://www.nature.com/news/computer-science-the-
learning-machines-1.14481
• Ng Tutorial:
http://deeplearning.stanford.edu/wiki/index.php/
UFLDL_Tutorial
• Hinton Tutorial:
http://videolectures.net/jul09_hinton_deeplearn/
• backpropagation generalizes to
• arbitrary numbers of output and hidden units
• arbitrary layers of hidden units (in theory)
• arbitrary connection patterns
• other transfer (i.e. output) functions
• other error measures
• backprop doesn’t usually work well for networks with multiple layers of
hidden units; recent work in deep networks addresses this limitation
59