Mcculloch-Pitts "Unit": A G (In) G W A

McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:

ai ← g(ini) = g Σj Wj,iaj

Neural networks Bias Weight

a0 = −1 ai = g(ini)
W0,i
g
ini
Σ
Wj,i
aj ai
Chapter 20, Section 5
Input Input Activation Output

Links Function Function Output Links
A gross oversimplification of real neurons, but its purpose is

to develop understanding of what networks of simple units can do
Chapter 20, Section 5 1 Chapter 20, Section 5 4
Outline Activation functions

♦ Brains g(ini) g(ini)
♦ Neural networks
♦ Perceptrons
+1 +1
♦ Multilayer perceptrons
♦ Applications of neural networks
ini ini
(a) (b)
(a) is a step function or threshold function

(b) is a sigmoid function 1/(1 + e−x)
Changing the bias weight W0,i moves the threshold location
Brains Implementing logical functions

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time W0 = 1.5 W0 = 0.5 W0 = – 0.5
Signals are noisy “spike trains” of electrical potential
W1 = 1 W1 = 1
W1 = –1
Axonal arborization W2 = 1 W2 = 1
AND OR NOT
Axon from another cell
Synapse
McCulloch and Pitts: every Boolean function can be implemented
Dendrite Axon
Nucleus
Synapses
Cell body or Soma

Network structures Expressiveness of perceptrons
Feed-forward networks: Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
– single-layer perceptrons
– multi-layer perceptrons Can represent AND, OR, NOT, majority, etc., but not XOR
Feed-forward networks implement functions, have no internal state Represents a linear separator in input space:
Recurrent networks: Σj Wj xj > 0 or W · x > 0

– Hopfield networks have symmetric weights (Wi,j = Wj,i) x1 x1 x1
g(x) = sign(x), ai = ± 1; holographic associative memory
1 1 1
– Boltzmann machines use stochastic activation functions,
≈ MCMC in Bayes nets ?
– recurrent neural nets have directed cycles with delays
⇒ have internal state (like flip-flops), can oscillate etc. 0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Minsky & Papert (1969) pricked the neural network balloon
Feed-forward example Perceptron learning
W1,3 Learn by adjusting weights to reduce error on training set

1 3 The squared error for an example with input x and true output y is
W3,5
W1,4 1 1
E = Err 2 ≡ (y − hW(x))2 ,
5 2 2
Perform optimization search by gradient descent:
W2,3 W4,5 ∂E ∂Err ∂ n
y − g(Σj = 0Wj xj )

= Err × = Err ×
2 4 ∂Wj ∂Wj ∂Wj
W2,4
= −Err × g 0(in) × xj
Feed-forward network = a parameterized family of nonlinear functions: Simple weight update rule:
a5 = g(W3,5 · a3 + W4,5 · a4) Wj ← Wj + α × Err × g 0(in) × xj

= g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2)) E.g., +ve error ⇒ increase network output
Adjusting weights changes the function: do learning this way! ⇒ increase weights on +ve inputs, decrease on -ve inputs
Single-layer perceptrons Perceptron learning contd.

Perceptron learning rule converges to a consistent function
for any linearly separable data set
Proportion correct on test set
Perceptron output
1 1 1
0.8
0.9 0.9
0.6
0.4 0.8 0.8
0.2 4
2 0.7 0.7
0 0 x2
-4 -2 -2
Input Output 0 2 -4 0.6 Perceptron 0.6
x1
Wj,i 4
0.5
Decision tree
0.5 Perceptron
Units Units Decision tree
0.4 0.4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Output units all operate separately—no shared weights Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data
Adjusting weights moves the location, orientation, and steepness of cliff Perceptron learns majority function easily, DTL is hopeless
DTL learns restaurant function easily, perceptron cannot represent it

Multilayer perceptrons Back-propagation derivation
Layers are usually fully connected; The squared error on a single example is defined as
numbers of hidden units typically chosen by hand 1X
E= (yi − ai)2 ,
Output units ai 2 i
where the sum is over the nodes in the output layer.
Wj,i
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
Hidden units aj
∂in i ∂ X
 
0
= −(yi − ai)g (in i) = −(yi − ai)g 0(in i)  Wj,iaj 
∂Wj,i ∂Wj,i j
Wk,j = −(yi − ai)g 0(in i)aj = −aj ∆i
Input units ak
Expressiveness of MLPs Back-propagation derivation contd.

All continuous functions w/ 2 layers, all functions w/ 3 layers
∂E ∂ai ∂g(in i)
= − (yi − ai) = − (yi − ai)
X X
∂Wk,j i ∂Wk,j i ∂Wk,j

hW(x1, x2) hW(x1, x2)
∂in i ∂ X
 
0
1 1 = − (yi − ai)g (in i) = − ∆i Wj,iaj 
X X

0.8 0.8 i ∂Wk,j i ∂Wk,j j
0.6 0.6
0.4 0.4 ∂aj ∂g(in j )
= − ∆iWj,i = − ∆iWj,i
X X
0.2 0.2
0 2
4
0 2
4 i ∂Wk,j i ∂Wk,j
-4 -2 0 x2 -4 -2 0 x2
x1
0 2 -4
-2
x1
0 2 -4
-2
0 ∂in j
4 4 = − ∆iWj,ig (in j )
X
i ∂Wk,j
∂ X
 
0
= − ∆iWj,ig (in j ) Wk,j ak 
X
Combine two opposite-facing threshold functions to make a ridge


i ∂Wk,j k
= − ∆iWj,ig 0(in j )ak = −ak ∆j
X
Combine two perpendicular ridges to make a bump i
Add bumps of various sizes and locations to fit any surface

Proof requires exponentially many hidden units (cf DTL proof)
Back-propagation learning Back-propagation learning contd.

Output layer: same as for single-layer perceptron, At each epoch, sum gradient updates for all examples and apply
Wj,i ← Wj,i + α × aj × ∆i Training curve for 100 restaurant examples: finds exact fit
0 14
where ∆i = Err i × g (in i)
Total error on training set
12
Hidden layer: back-propagate the error from the output layer: 10
0 8
∆j = g (in j ) Wj,i∆i .
X
i 6
Update rule for weights in hidden layer: 4
2
Wk,j ← Wk,j + α × ak × ∆j .
0
0 50 100 150 200 250 300 350 400
(Most neuroscientists deny that back-propagation occurs in the brain)
Number of epochs
Typical problems: slow convergence, local minima

Back-propagation learning contd.
Learning curve for MLP with 4 hidden units:
1
0.9
0.8
0.7
0.6 Decision tree

Multilayer network
0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size - RESTAURANT data
MLPs are quite good for complex pattern recognition tasks,

but resulting hypotheses cannot be understood easily
Chapter 20, Section 5 19
Handwritten digit recognition
3-nearest-neighbor = 2.4% error

400–300–10 unit MLP = 1.6% error
LeNet: 768–192–30–10 unit MLP = 0.9% error
Current best (kernel machines, vision algorithms) ≈ 0.6% error
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
Perceptrons (one-layer networks) insufficiently expressive
Multi-layer networks are sufficiently expressive; can be trained by gradient
descent, i.e., error back-propagation
Many applications: speech, driving, handwriting, fraud detection, etc.
Engineering, cognitive modelling, and neural system modelling
subfields have largely diverged

Mcculloch-Pitts "Unit": A G (In) G W A

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mcculloch-Pitts "Unit": A G (In) G W A

Uploaded by

Copyright:

Available Formats

McCulloch–Pitts “unit”

Output is a “squashed” linear function of the inputs:

Neural networks Bias Weight

Input Input Activation Output

A gross oversimplification of real neurons, but its purpose is

Chapter 20, Section 5 1 Chapter 20, Section 5 4

Outline Activation functions

(a) is a step function or threshold function

Chapter 20, Section 5 2 Chapter 20, Section 5 5

Brains Implementing logical functions

Cell body or Soma

Chapter 20, Section 5 3 Chapter 20, Section 5 6

Recurrent networks: Σj Wj xj > 0 or W · x > 0

Minsky & Papert (1969) pricked the neural network balloon

Chapter 20, Section 5 7 Chapter 20, Section 5 10

Feed-forward example Perceptron learning

W1,3 Learn by adjusting weights to reduce error on training set

a5 = g(W3,5 · a3 + W4,5 · a4) Wj ← Wj + α × Err × g 0(in) × xj

Chapter 20, Section 5 8 Chapter 20, Section 5 11

Single-layer perceptrons Perceptron learning contd.

Proportion correct on test set

Chapter 20, Section 5 9 Chapter 20, Section 5 12

Chapter 20, Section 5 13 Chapter 20, Section 5 16

Expressiveness of MLPs Back-propagation derivation contd.

∂Wk,j i ∂Wk,j i ∂Wk,j

Combine two opposite-facing threshold functions to make a ridge

Combine two perpendicular ridges to make a bump i

Add bumps of various sizes and locations to fit any surface

Back-propagation learning Back-propagation learning contd.

Typical problems: slow convergence, local minima

Chapter 20, Section 5 15 Chapter 20, Section 5 18

0.6 Decision tree

MLPs are quite good for complex pattern recognition tasks,

Chapter 20, Section 5 19

Handwritten digit recognition

3-nearest-neighbor = 2.4% error

Chapter 20, Section 5 20

Chapter 20, Section 5 21

You might also like