You are on page 1of 31

Arti cial Neural Networks

[Read Ch. 4]
[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
 Threshold units
 Gradient descent
 Multilayer networks
 Backpropagation
 Hidden layer representations
 Example: Face Recognition
 Advanced topics

74

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Connectionist Models
Consider humans:
 Neuron switching time ~ :001 second
 Number of neurons ~ 10
 Connections per neuron ~ 10 ?
 Scene recognition time ~ :1 second
 100 inference steps doesn't seem like enough
! much parallel computation
10

4 5

Properties of arti cial neural nets (ANN's):


 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed process
 Emphasis on tuning weights automatically

75

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

When to Consider Neural Networks


 Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)


 Output is discrete or real valued
 Output is a vector of values
 Possibly noisy data
 Form of target function is unknown
 Human readability of result is unimportant
Examples:
 Speech phoneme recognition [Waibel]
 Image classi cation [Kanade, Baluja, Rowley]
 Financial prediction

76

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

ALVINN drives 70 mph on highways

Sharp
Left

Straight
Ahead

Sharp
Right

30 Output
Units

4 Hidden
Units

30x32 Sensor
Input Retina

77

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron
x1

w1

x2

w2

.
.
.

x0=1
w0

wi xi

i=0

wn

o=

xn

8
>
<
>
:

1 if

AA
x >0

i=0 i i
-1 otherwise

w + w x +    + wnxn > 0
o(x ; : : : ; xn) = ?11 ifotherwise.
1

Sometimes we'll use simpler vector notation:


8
>
if w~  ~x > 0
o(~x) = <>: ?11 otherwise.

78

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Decision Surface of a Perceptron


x2

x2
+

+
x1

x1

(a)

(b)

Represents some useful functions


 What weights represent
g(x ; x ) = AND(x ; x )?
1

But some functions not representable


 e.g., not linearly separable
 Therefore, we'll want networks of these...

79

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule

where

wi

wi + wi

wi = (t ? o)xi

Where:
 t = c(~x) is target value
 o is perceptron output
  is small constant (e.g., .1) called learning rate

80

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule


Can prove it will converge
 If training data is linearly separable
 and  suciently small

81

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent
To understand, consider simpler linear unit, where
o = w + w x +    + w n xn
Let's learn wi's that minimize the squared error
E [w~ ]  21 d2XD(td ? od)
Where D is set of training examples
0

82

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent

25
20

E[w]

15
10
5
0
2
1
-2
-1

0
1
-1

2
3

w1

w0

Gradient

3
@E
@E
@E
rE [w~ ]  @w ; @w ;    @w 75
n
Training rule:
w~ = ?rE [w~ ]
i.e.,
@E
wi = ? @w
2
64

83

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent
@E = @ 1 X(t ? o )
@wi @wi 2 d d d
1
X @
= 2 d @w (td ? od)
i
@ (t ? o )
= 21 Xd 2(td ? od) @w
d
d
i
@
X
= d (td ? od) @w (td ? w~  x~d)
i
@E = X(t ? o )(?x )
@wi d d d i;d
2

84

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent
(training examples; )
Each training example is a pair of the form
h~x; ti, where ~x is the vector of input values,
and t is the target output value.  is the
learning rate (e.g., .05).
 Initialize each wi to some small random value
 Until the termination condition is met, Do
{ Initialize each wi to zero.
{ For each h~x; ti in training examples, Do
 Input the instance ~x to the unit and
compute the output o
 For each linear unit weight wi, Do
wi wi + (t ? o)xi
{ For each linear unit weight wi, Do
wi wi + wi

Gradient-Descent

85

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Summary
Perceptron training rule guaranteed to succeed if
 Training examples are linearly separable
 Suciently small learning rate 
Linear unit training rule uses gradient descent
 Guaranteed to converge to hypothesis with
minimum squared error
 Given suciently small learning rate 
 Even when training data contains noise
 Even when training data not separable by H

86

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Incremental (Stochastic) Gradient Descent


Batch mode Gradient Descent:

Do until satis ed
1. Compute the gradient rED[w~ ]
2. w~ w~ ? rED[w~ ]
Incremental mode Gradient Descent:
Do until satis ed
 For each training example d in D
1. Compute the gradient rEd[w~ ]
2. w~ w~ ? rEd[w~ ]
1
ED [w~ ]  2 dX2D(td ? od)
1
Ed[w~ ]  2 (td ? od)
Incremental Gradient Descent can approximate
Batch Gradient Descent arbitrarily closely if 
made small enough
2

87

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Multilayer Networks of Sigmoid Units

head hid

...

F1

88

...

whod hood

F2

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Sigmoid Unit
x1
x2

AA
AA
A
.
.
.

xn

w1
w2

x0 = 1
w0

net = wi xi
i=0

wn

o = (net) =

-net

1+e

(x) is the sigmoid function


1
1 + e?x
Nice property: ddxx = (x)(1 ? (x))
We can derive gradient decent rules to train
 One sigmoid unit
 Multilayer networks of sigmoid units !
Backpropagation
( )

89

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Error Gradient for a Sigmoid Unit


@E = @ 1 X (t ? o )
@wi @wi 2 d2D d d
@ (t ? o )
= 12 Xd @w
d
d
i
@
1
X
= 2 d 2(td ? od) @w (td ? od)
1i
0
@od CA
= Xd (td ? od) B@? @w
i
@o
X
d @netd
= ? d (td ? od) @net @w
d
i
But we know:
@od = @(netd) = o (1 ? o )
d
d
@netd @netd
@netd = @ (w~  ~xd) = x
i;d
@wi
@wi
So:
@E = ? X (t ? o )o (1 ? o )x
d
d d
d i;d
@wi
d2D
2

90

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satis ed, Do
 For each training example, Do
1. Input the training example to the network
and compute the network outputs
2. For each output unit k
k ok(1 ? ok )(tk ? ok )
3. For each hidden unit h
h oh(1 ? oh) X wh;k k
k2outputs

4. Update each network weight wi;j


wi;j wi;j + wi;j
where
wi;j = j xi;j

91

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

More on Backpropagation
 Gradient descent over entire network weight

vector
 Easily generalized to arbitrary directed graphs
 Will nd a local, not necessarily global error
minimum
{ In practice, often works well (can run multiple
times)
 Often include weight momentum
wi;j(n) = j xi;j + wi;j (n ? 1)
 Minimizes error over training examples
{ Will it generalize well to subsequent
examples?
 Training can take thousands of iterations !
slow!
 Using network after training is very fast

92

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learning Hidden Layer Representations


Inputs

A target function:
Input
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001

Outputs

!
!
!
!
!
!
!
!

Output
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001

Can this be learned??

93

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learning Hidden Layer Representations


A network:

Inputs

Outputs

Learned hidden layer representation:


Input
Hidden
Output
Values
10000000 ! .89 .04 .08 ! 10000000
01000000 ! .01 .11 .88 ! 01000000
00100000 ! .01 .97 .27 ! 00100000
00010000 ! .99 .97 .71 ! 00010000
00001000 ! .03 .05 .02 ! 00001000
00000100 ! .22 .99 .99 ! 00000100
00000010 ! .80 .01 .98 ! 00000010
00000001 ! .60 .94 .01 ! 00000001

94

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training
Sum of squared errors for each output unit
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

95

500

1000

lecture slides for textbook

1500

2000

2500

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training

Hidden unit encoding for input 01000000


1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

96

500

1000

lecture slides for textbook

1500

2000

2500

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training
Weights from inputs to one hidden unit
4
3
2
1
0
-1
-2
-3
-4
-5
0

97

500

1000

lecture slides for textbook

1500

2000

2500

Machine Learning, T. Mitchell, McGraw Hill, 1997

Convergence of Backpropagation
Gradient descent to some local minimum
 Perhaps not global minimum...
 Add momentum
 Stochastic gradient descent
 Train multiple nets with di erent inital weights
Nature of convergence
 Initialize weights near zero
 Therefore, initial networks near-linear
 Increasingly non-linear functions possible as
training progresses

98

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Expressive Capabilities of ANNs


Boolean functions:
 Every boolean function can be represented by
network with single hidden layer
 but might require exponential (in number of
inputs) hidden units
Continuous functions:
 Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer [Cybenko 1989;
Hornik et al. 1989]
 Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
[Cybenko 1988].

99

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Over tting in ANNs


Error versus weight updates (example 1)
0.01
Training set error
Validation set error

0.009
0.008

Error

0.007
0.006
0.005
0.004
0.003
0.002
0

5000

10000
15000
Number of weight updates

20000

Error versus weight updates (example 2)


0.08
Training set error
Validation set error

0.07
0.06

Error

0.05
0.04
0.03
0.02
0.01
0
0

100

1000

2000
3000
4000
Number of weight updates

lecture slides for textbook

5000

6000

Machine Learning, T. Mitchell, McGraw Hill, 1997

Neural Nets for Face Recognition


left strt rght up

...

...

30x32
inputs

Typical input images


90% accurate learning head pose, and recognizing 1-of-20 faces

101

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learned Hidden Unit Weights


left strt rght up

...

...

Learned Weights

30x32
inputs

Typical input images


http://www.cs.cmu.edu/tom/faces.html

102

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Alternative Error Functions


Penalize large weights:
1
X
(tkd ? okd) + i;jX wji
E (w~ )  2 d2XD k2outputs
Train on target slopes as well as values:
2
0
1 3
66
77
@o
1
X BB @tkd
X
X
kd C
C
6
E (w~ )  2 d2D k2outputs 4(tkd ? okd) +  j2inputs @ j ? j A 75
@xd @xd
2

Tie together weights:


 e.g., in phoneme recognition network

103

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Recurrent Networks
y(t + 1)

y(t + 1)

x(t)

x(t)

c(t)

(b) Recurrent network

(a) Feedforward network

y(t + 1)

x(t)

c(t)
y(t)

x(t 1)

c(t 1)
y(t 1)

(c) Recurrent network

x(t 2)

c(t 2)

unfolded in time

104

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

You might also like