Simon Chapter 3

Slide03 Historical Overview
Haykin Chapter 3 (Chap 1, 3, 3rd • McCulloch and Pitts (1943): neural networks as computing
machines.
Ed): Single-Layer Perceptrons • Hebb (1949): postulated the first rule for self-organizing learning.
• Rosenblatt (1958): perceptron as a first model of supervised

CPSC 636-600
learning.
Instructor: Yoonsuck Choe
• Widrow and Hoff (1960): adaptive filters using least-mean-square
(LMS) algorithm (delta rule).
1 2
Multiple Faces of a Single Neuron
What a single neuron does can be viewed from different perspectives:
• Adaptive filter: as in signal processing
• Classifier: as in perceptron
Part I: Adaptive Filter
The two aspects will be reviewed, in the above order.
3 4
Adaptive Filtering Problem Unconstrained Optimization Techniques
• Consider an unknown dynamical system, that takes m inputs and • How can we adjust w(i) to gradually minimize e(i)? Note that
generates one output. e(i) = d(i) − y(i) = d(i) − xT (i)w(i). Since d(i) and x(i)
are fixed, only the change in w(i) can change e(i).
• Behavior of the system described as its input/output pair:
• In other words, we want to minimize the cost function E(w) with respect
T : {x(i), d(i); i = 1, 2, ..., n, ...} where ∗
to the weight vector w: Find the optimal solution w .
x(i) = [x1 (i), x2 (i), ..., xm (i)]T is the input and d(i) the • The necessary condition for optimality is
desired response (or target signal).
∗
∇E(w ) = 0,
• Input vector can be either a spatial snapshot or a temporal sequence
uniformly spaced in time. where the gradient operator is defined as
• There are two important processes in adaptive filtering: » –T

∂ ∂ ∂
∇= , , ...
– Filtering process: generation of output based on the input: ∂w1 ∂w2 ∂wm
y(i) = xT (i)w(i).
With this, we get
– Adapative process: automatic adjustment of weights to reduce error: » –T
e(i) = d(i) − y(i). ∗ ∂E ∂E ∂E
∇E(w ) = , , ... .
∂w1 ∂w2 ∂wm
5 6
Steepest Descent Steepest Descent (cont’d)

We now check if E(w(n + 1)) < E(w(n)).
• We want the iterative update algorithm to have the following
Using first-order Taylor expansion† of E(·) near w(n),
property:
E(w(n + 1)) < E(w(n)). E(w(n + 1)) ≈ E(w(n)) + gT (n)∆w(n)
• Define the gradient vector ∇E(w) as g. and ∆w(n) = −ηg(n), we get
• The iterative weight update rule then becomes: E(w(n + 1)) ≈ E(w(n)) − ηgT (n)g(n)
= E(w(n)) − ηkg(n)k2 .
w(n + 1) = w(n) − ηg(n) | {z }
Positive!
where η is a small learning-rate parameter. So we can say,
So, it is indeed (for small η ):
∆w(n) = w(n + 1) − w(n) = −ηg(n) E(w(n + 1)) < E(w(n)).
† f 00 (a)(x−a)2
Taylor series: f (x) = f (a) + f 0 (a)(x − a) + 2! + ....
7 8
Steepest Descent: Example Steepest Descent: Another Example
x*x+y*y
200
180
160
140
120
100 Gradient of x*x+y*y
220 80 7
200 60
180 40
160 20
140
120
100
80
60
40
20
0 0
y
10
5
-10
-5 0
0 -5
5
10-10
-7
• Convergence to optimal w is very slow. -7 0
x
7
• Small η : overdamped, smooth trajectory = f (x, y) = x2 + y 2 ,

For f (x)
h i
∂f T
∇f (x, y) = ∂f , = [2x, 2y]T . Note that (1) the gradient
• Large η : underdamped, jagged trajectory ∂x ∂y
vectors are pointing upward, away from the origin, (2) length of the vectors are
• η too large: algorithm becomes unstable shorter near the origin. If you follow −∇f (x, y), you will end up at the origin.
We can see that the gradient vectors are perpendicular to the level curves.
* The vector lengths were scaled down by a factor of 10 to avoid clutter.
9 10
Newton’s Method Gauss-Newton Method

• Applicable for cost-functions expressed as sum of error squares:
• Newton’s method is an extension of steepest descent, where the n
1X 2
second-order term in the Taylor series expansion is used. E(w) = ei (w) ,
2 i=1
• It is generally faster and shows a less erratic meandering
where ei (w) is the error in the i-th trial, with the weight w.
compared to the steepest descent method.
• Recalling the Taylor series f (x) = f (a) + f 0 (a)(x − a)..., we can
• There are certain conditions to be met though, such as the express ei (w) evaluated near ei (wk ) as
Hessian matrix ∇2 E(w) being positive definite (for an arbitarry » –
∂ei T
x, xT Hx > 0). ei (w) = ei (wk ) + (w − wk ).
∂w w=w
k
• In matrix notation, we get:
e(w) = e(wk ) + Je (wk )(w − wk ).
* We will use a slightly different notation than the textbook, for clarity.
11 12
Gauss-Newton Method (cont’d) Quick Example: Jacobian Matrix
• Je (w) is the Jacobian matrix, where each row is the gradient • Given
2 3 2 3
of ei (w):
e1 (x, y) x2 + y2
2 ∂e1 ∂e1 ∂e1
3 2
T
3 e(x, y) = 4 5=4 5,
∂w1 ∂w2 ... ∂wn (∇e1 (w)) e2 (x, y) cos(x) + sin(y)
6 7 6 7
6 ∂e2 ∂e2 ∂e2 7 6 7
6 ... 7 6 (∇e2 (w))T 7
6 ∂w1 ∂w2 ∂wn 7 6 7 • The Jacobian of e(x, y) becomes
6 7 6 7
Je (w) = 6 : : : 7=6 : 7
6 7 6 7 " # " #
6 7 6 7 ∂e1 (x,y) ∂e1 (x,y)
2x 2y
6 : : : 7 6 : 7 Je (x, y) = ∂x ∂y
= .
4 5 4 5 ∂e2 (x,y) ∂e2 (x,y)
− sin(x) cos(y)
∂en
∂w1
∂en
∂w2 ... ∂en
∂wn
(∇en (w))T ∂x ∂y
• For (x, y) = (0.5π, π), we get

• We can then evaluate Je (wk ) by plugging in actual values of " # " #
wk into the Jabobian matrix above. π 2π π 2π
Je (0.5π, π) = = .
− sin(0.5π) cos(π) −1 −1
13 14
Gauss-Newton Method (cont’d) Linear Least-Square Filter

• Again, starting with • Given m input and 1 output function y(i) = φ(xT i wi ) where
φ(x) = x, i.e., it is linear, and a set of training samples {xi , di }n
i=1 ,
e(w) = e(wk ) + Je (wk )(w − wk ), we can define the error vector for an arbitrary weight w as
what we want is to set w so that the error approaches 0. T

e(w) = d − [x1 , x2 , ..., xn ] w.
• That is, we want to minimize the norm of e(w):
where d = [d1 , d2 , ..., dn ]T . Setting X = [x1 , x2 , ..., xn ]T ,
ke(w)k2 = ke(wk )k2 + 2e(wk )T Je (wk )(w − wk ) we get: e(w) = d − Xw.
+ (w − wk )T JT
e (wk )Je (wk )(w − wk ). • Differentiating the above wrt w, we get ∇e(w) = −XT . So, the
T
Jacobian becomes Je (w) = (∇e(w)) = −X.
• Differentiating the above wrt w and setting the result to 0, we get
• Plugging this in to the Gauss-Newton equation, we finally get:
T T
Je (wk )e(wk )+Je (wk )Je (wk )(w−wk ) = 0, from which we get
w = wk + (XT X)−1 XT (d − Xwk )
T −1 T
w = wk − (Je (wk )Je (wk )) Je (wk )e(wk ). = wk + (XT X)−1 XT d − (X X) X Xwk
T −1 T
T | {z }
* Je (wk )Je (wk ) needs to be nonsingular (inverse is needed).
This is Iwk = wk .
= (XT X)−1 XT d.
15 16
Linear Least-Square Filter (cont’d) Linear Least-Square Filter: Example
See src/pseudoinv.m.
Points worth noting:
X = ceil(rand(4,2)*10), wtrue = rand(2,1)*10 , d=X*wtrue, w = inv(X’*X)*X’*d
• X does not need to be a square matrix! X =
10 7
• We get w = (XT X)−1 XT d off the bat partly because the 3 7
3 6
output is linear (otherwise, the formula would be more complex). 5 4
• The Jacobian of the error function only depends on the input, and wtrue =
0.56644
is invariant wrt the weight w. 4.99120
• The factor (XT X)−1 XT (let’s call it X+ ) is like an inverse. d =

Multiply X to both sides of 40.603
36.638
31.647
d = Xw 22.797
then we get: w =
+ + 0.56644
w = X d = X X w.
| {z } 4.99120
=I
17 18
Least-Mean-Square Algorithm Least-Mean-Square Algorithm: Evaluation

• Cost function is based on instantaneous values. • LMS algorithm behaves like a low-pass filter.
1 2
E(w) = e (w) • LMS algorithm is simple, model-independent, and thus robust.
2
• Differentiating the above wrt w, we get • LMS does not follow the direction of steepest descent: Instead, it
follows it stochastically (stochastic gradient descent).
∂E(w) ∂e(w)
= e(w) .
∂w ∂w • Slow convergence is an issue.
• Pluggin in e(w) = d − xT w, • LMS is sensitive to the input correlation matrix’s condition number
(ratio between largest vs. smallest eigenvalue of the correl.
∂e(w) ∂E(w)
= −x, and hence = −xe(w). matrix).
∂w ∂w
• Using this in the steepest descent rule, we get the LMS algorithm: • LMS can be shown to converge if the learning rate has the
following property:
ŵn+1 = ŵn + ηxn en . 2
0<η<
λmax
• Note that this weight update is done with only one (xi , di ) pair!
where λmax is the largest eigenvalue of the correl. matrix.
19 20
Improving Convergence in LMS Search-Then-Converge in LMS
• The main problem arises because of the fixed η .
• One solution: Use a time-varying learning rate: η(n) = c/n, as

in stochastic optimization theory.
• A better alternative: use a hybrid method called

search-then-converge.
η0
η(n) =
1 + (n/τ )
When n < τ , performance is similar to standard LMS. When
n > τ , it behaves like stochastic optimization.
η0 η0
η(n) = vs. η(n) =
n 1 + (n/τ )
21 22
The Perceptron Model
Part II: Perceptron
• Perceptron uses a non-linear neuron model (McCulloch-Pitts

model).
8
m
X < 1 if v > 0
v= wi xi + b, y = φ(v) =
: 0 if v ≤ 0
i=1
• Goal: classify input vectors into two classes.
23 24
Boolean Logic Gates with Perceptron Units What Perceptrons Can Represent
−1 t=1.5 −1 t=0.5 −1 t=−0.5 I1 Output = 1
−1 t
W1=1 W1=1 W1=−1 Slope = −W0
AND OR NOT
w0 t W1
W2=1 W2=1
I0 W1
w1
Russel & Norvig I1
I0
• Perceptrons can represent basic boolean functions. Output=0fs
• Thus, a network of perceptron units can compute any Boolean

Perceptrons can only represent linearly separable functions.
function.
• Output of the perceptron:
What about XOR or EQUIV?
W0 × I0 + W1 × I1 − t > 0, then output is 1
W0 × I0 + W1 × I1 − t ≤ 0, then output is 0
25 26
Geometric Interpretation The Role of the Bias

I1 Output = 1
−1
I1
t
Slope = −W0
t W1 −1 t=0
w0
I0 W1
w1 w0
I0 t =0 I0
I1 w1 W1 Slope = −W0
I0 W1
I1
Output=0fs
• Rearranging
W0 × I0 + W1 × I1 − t > 0, then output is 1, • Without the bias (t = 0), learning is limited to adjustment of the
slope of the separating line passing through the origin.
we get (if W1 > 0)
−W0 t • Three example lines with different weights are shown.
I1 > × I0 + ,
W1 W1
where points above the line, the output is 1, and 0 for those below the line.
Compare with
−W0 t
y= ×x+ .
W1
27 W1 28
Limitation of Perceptrons Generalizing to n-Dimensions
z
I1 Output = 1
−1 t
Slope = −W0
t W1 (x,y,z) 1
w0
I0 W1 n = [a b c]T
d
w1 x
a
(x0,y0,z0)
I1 b
y y
c
I0
Output=0fs z
x
http://mathworld.wolfram.com/Plane.html
• Only functions where the 0 points and 1 points are clearly linearly • ~
n = (a, b, c), ~
x = (x, y, z), x~0 = (x0 , y0 , z0 ).
separable can be represented by perceptrons.
• Equation of a plane: ~
n · (~
x − x~0 ) = 0
• The geometric interpretation is generalizable to functions of n
• In short, ax + by + cz + d = 0, where a, b, c can serve as
arguments, i.e. perceptron with n inputs plus one threshold (or
the weight, and d = −~
n · x~0 as the bias.
bias) unit.
• For n-D input space, the decision boundary becomes a
(n − 1)-D hyperplane (1-D less than the input space).
29 30
Linear Separability Linear Separability (cont’d)

I1 I1 I1
I1 I1 I1
0 1 1 1 1 0
?
I0 I0 I0
Linearly−separable Not Linearly−separable Not Linearly−separable 0 0 0 1 0 1

I0 I0 I0
• For functions that take integer or real values as arguments and AND OR XOR
output either 0 or 1.
• Perceptrons cannot represent XOR!
• Left: linearly separable (i.e., can draw a straight line between the
• Minsky and Papert (1969)
classes).
• Right: not linearly separable (i.e., perceptrons cannot represent

such a function)
31 32
XOR in Detail Perceptrons: A Different Perspective
Output = 1
# I0 I1 XOR −1 t
I1
xi
Slope = −W0
w
1 0 0 0 w0 t W1
I0 W1
2 0 1 1 w1
I1
3 1 0 1 θ
I0 d
Output=0fs
4 1 1 0
W0 × I0 + W1 × I1 − t > 0, then output is 1:

1 −t ≤ 0 → t≥0
2 W1 − t > 0 → W1 > t wT x > b then, output is 1
T
w x = kwkkxk cos θ > b then, output is 1
3 W0 − t > 0 → W0 > t
b
kxk cos θ > kwk
then, output is 1
4 W0 + W1 − t ≤ 0 → W0 + W1 ≤ t
b
So, if d = kxk cos θ in the figure above is greater than , then output = 1.
2t < W0 + W1 < t (from 2, 3, and 4), but t ≥ 0 (from 1), a kwk
contradiction. Adjusting w changes the tilt of the decision boundary, and adjusting the bias b
(and kwk) moves the decision boundary closer or away from the origin.
33 34
Perceptron Learning Rule Perceptron Learning Rule (cont’d)
• Given a linearly separable set of inputs that can belong to class C1 or C2 , For misclassified inputs (η(n) is the learning rate):
• The goal of perceptron learning is to have
• w(n + 1) = w(n) − η(n)x(n) if wT x > 0 and x ∈ C2 .
T
w x > 0 for all input in class C1
• w(n + 1) = w(n) + η(n)x(n) if wT x ≤ 0 and x ∈ C1 .
T
w x ≤ 0 for all input in class C2 Or, simply x(n + 1)
= w(n) + η(n)e(n)x(n), where
• If all inputs are correctly classified with the current weights w(n), e(n) = d(n) − y(n) (the error).
T
w(n) x > 0, for all input in class C1 , and
T
w(n) x ≤ 0, for all input in class C2 ,
then w(n + 1) = w(n) (no change).
• Otherwise, adjust the weights.
35 36
Learning in Perceptron: Another Look Perceptron Convergence Theorem
x+w • Given a set of linearly separable inputs, Without loss of generality, assume
+ + + + +
+ − w−x η = 1, w(0) = 0.
+ w x w
+ + +
− − x+ + − − • Assume the first n examples ∈ C1 are all misclassified.
− − − −
− − + − − + +
− − + + − − + • Then, using w(n + 1) = w(n) + x(n), we get
− + − +
− − − − − −
w−x w(n + 1) = x(1) + x(2) + ... + x(n). (1)
• Since the input set is linearly separable, there is at least on solution w0

T
such that w0 x(n) > 0 for all inputs in C1 .
• When a positive example (C1 ) is misclassified, – Define α = minx(n)∈C1 w0T x(n) > 0.
w(n + 1) = w(n) + η(n)x(n). – Multiply both sides in eq. 1 with w0 , we get:
• When a negative example (C2 ) is misclassified, T T T T
w0 w(n+1) = w0 x(1)+w0 x(2)+...+w0 x(n). (2)
w(n + 1) = w(n) − η(n)x(n).
– From the two steps above, we get:
• Note the tilt in the weight vector, and observe how it would change T
w0 w(n + 1) > nα (3)
the decision boundary.
37 38
Perceptron Convergence Theorem (cont’d) Perceptron Convergence Theorem (cont’d)

• Taking the Euclidean norm of w(k + 1) = w(k) + x(k),
• Using Cauchy-Schwartz inequality
2 2 T 2
h i2 kw(k + 1)k = kw(k)k + 2w (k)x(k) + kx(k)k
2 2 T
kw0 k kw(n + 1)k ≥ w0 w(n + 1)
• Since all n inputs in C1 are misclassified, wT (k)x(k) ≤ 0 for
k = 1, 2, ..., n,
• From the above and w0T w(n + 1) > nα,
2 2 2 T
2 2 2 2
kw(k + 1)k − kw(k)k − kx(k)k = 2w (k)x(k) ≤ 0,
kw0 k kw(n + 1)k ≥ n α
2 2 2
kw(k + 1)k ≤ kw(k)k + kx(k)k
So, finally, we get 2 2 2
kw(k + 1)k − kw(k)k ≤ kx(k)k
2 n2 α2 • Summing up the inequalities for all k = 1, 2, ..., n, and w(0) = 0,
kw(n + 1)k ≥ (4)
kw0 k2 we get
| {z } n
X
2 2
First main result kw(k + 1)k ≤ kx(k)k ≤ nβ, (5)
k=1
where β = maxx (k) ∈ C1 kx(k)k2 .

39 40
Perceptron Convergence Theorem (cont’d) Fixed-Increment Convergence Theorem
• From eq. 4 and eq. 5,
n2 α2 2 Let the subsets of training vectors C1 and C2 be linearly separable. Let

≤ kw(n + 1)k ≤ nβ
kw0 k2 the inputs presented to perceptron originate from these two subsets.
• Here, α is a constant, depending on the fixed input set and the fixed The perceptron converges after some n0 iterations, in the sense that
solution w0 (so, kw0 k is also a constant), and β is also a constant since
it depends only on the fixed input set.
w(n0 ) = w(n0 + 1) = w(n0 + 2) = ....
• In this case, if n grows to a large value, the above inequality will becomes is a solution vector for n0 ≤ nmax .
invalid (n is a positive integer).
• Thus, n cannot grow beyond a certain nmax , where
n2max α2
= nmax β
kw0 k2
βkw0 k2
nmax = ,
α2
and when n = nmax , all inputs will be correctly classified
41 42
Summary
• Adaptive filter using the LMS algorithm and perceptrons are

closely related (the learning rule is almost identical).
• LMS and perceptrons are different, however, since one uses

linear activation and the other hard limiters.
• LMS is used in continuous learning, while perceptrons are trained

for only a finite number of steps.
• Single-neuron or single-layer has severe limits: How can multiple

layers help?
43 44
XOR with Multilayer Perceptrons
XOR
AND
1 1 1 0 1 0
0 1 1 1 0 1
Note: the bias units are not shown in the network on the right, but they are needed.
• Only three perceptron units are needed to implement XOR.
• However, you need two layers to achieve this.
45

Simon Chapter 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simon Chapter 3

Uploaded by

Copyright:

Available Formats

Slide03 Historical Overview

• Rosenblatt (1958): perceptron as a first model of supervised

Multiple Faces of a Single Neuron

What a single neuron does can be viewed from different perspectives:

• Adaptive filter: as in signal processing

• There are two important processes in adaptive filtering: » –T

Steepest Descent Steepest Descent (cont’d)

• Define the gradient vector ∇E(w) as g. and ∆w(n) = −ηg(n), we get

• Small η : overdamped, smooth trajectory = f (x, y) = x2 + y 2 ,

Newton’s Method Gauss-Newton Method

• In matrix notation, we get:

e(w) = e(wk ) + Je (wk )(w − wk ).

• For (x, y) = (0.5π, π), we get

Gauss-Newton Method (cont’d) Linear Least-Square Filter

what we want is to set w so that the error approaches 0. T

• The factor (XT X)−1 XT (let’s call it X+ ) is like an inverse. d =

Least-Mean-Square Algorithm Least-Mean-Square Algorithm: Evaluation

• The main problem arises because of the fixed η .

• One solution: Use a time-varying learning rate: η(n) = c/n, as

• A better alternative: use a hybrid method called

The Perceptron Model

Part II: Perceptron

• Perceptron uses a non-linear neuron model (McCulloch-Pitts

• Goal: classify input vectors into two classes.

• Thus, a network of perceptron units can compute any Boolean

Geometric Interpretation The Role of the Bias

Linear Separability Linear Separability (cont’d)

Linearly−separable Not Linearly−separable Not Linearly−separable 0 0 0 1 0 1

• Right: not linearly separable (i.e., perceptrons cannot represent

W0 × I0 + W1 × I1 − t > 0, then output is 1:

Perceptron Learning Rule Perceptron Learning Rule (cont’d)

• Since the input set is linearly separable, there is at least on solution w0

Perceptron Convergence Theorem (cont’d) Perceptron Convergence Theorem (cont’d)

where β = maxx (k) ∈ C1 kx(k)k2 .

n2 α2 2 Let the subsets of training vectors C1 and C2 be linearly separable. Let

• Thus, n cannot grow beyond a certain nmax , where

• Adaptive filter using the LMS algorithm and perceptrons are

• LMS and perceptrons are different, however, since one uses

• LMS is used in continuous learning, while perceptrons are trained

• Single-neuron or single-layer has severe limits: How can multiple

• Only three perceptron units are needed to implement XOR.

• However, you need two layers to achieve this.

You might also like