Professional Documents
Culture Documents
Haykin Chapter 3 (Chap 1, 3, 3rd • McCulloch and Pitts (1943): neural networks as computing
machines.
Ed): Single-Layer Perceptrons • Hebb (1949): postulated the first rule for self-organizing learning.
1 2
• Classifier: as in perceptron
Part I: Adaptive Filter
The two aspects will be reviewed, in the above order.
3 4
Adaptive Filtering Problem Unconstrained Optimization Techniques
• Consider an unknown dynamical system, that takes m inputs and • How can we adjust w(i) to gradually minimize e(i)? Note that
generates one output. e(i) = d(i) − y(i) = d(i) − xT (i)w(i). Since d(i) and x(i)
are fixed, only the change in w(i) can change e(i).
• Behavior of the system described as its input/output pair:
• In other words, we want to minimize the cost function E(w) with respect
T : {x(i), d(i); i = 1, 2, ..., n, ...} where ∗
to the weight vector w: Find the optimal solution w .
x(i) = [x1 (i), x2 (i), ..., xm (i)]T is the input and d(i) the • The necessary condition for optimality is
desired response (or target signal).
∗
∇E(w ) = 0,
• Input vector can be either a spatial snapshot or a temporal sequence
uniformly spaced in time. where the gradient operator is defined as
• The iterative weight update rule then becomes: E(w(n + 1)) ≈ E(w(n)) − ηgT (n)g(n)
= E(w(n)) − ηkg(n)k2 .
w(n + 1) = w(n) − ηg(n) | {z }
Positive!
where η is a small learning-rate parameter. So we can say,
So, it is indeed (for small η ):
∆w(n) = w(n + 1) − w(n) = −ηg(n) E(w(n + 1)) < E(w(n)).
† f 00 (a)(x−a)2
Taylor series: f (x) = f (a) + f 0 (a)(x − a) + 2! + ....
7 8
Steepest Descent: Example Steepest Descent: Another Example
x*x+y*y
200
180
160
140
120
100 Gradient of x*x+y*y
220 80 7
200 60
180 40
160 20
140
120
100
80
60
40
20
0 0
y
10
5
-10
-5 0
0 -5
5
10-10
-7
• Convergence to optimal w is very slow. -7 0
x
7
* We will use a slightly different notation than the textbook, for clarity.
11 12
Gauss-Newton Method (cont’d) Quick Example: Jacobian Matrix
• Je (w) is the Jacobian matrix, where each row is the gradient • Given
2 3 2 3
of ei (w):
e1 (x, y) x2 + y2
2 ∂e1 ∂e1 ∂e1
3 2
T
3 e(x, y) = 4 5=4 5,
∂w1 ∂w2 ... ∂wn (∇e1 (w)) e2 (x, y) cos(x) + sin(y)
6 7 6 7
6 ∂e2 ∂e2 ∂e2 7 6 7
6 ... 7 6 (∇e2 (w))T 7
6 ∂w1 ∂w2 ∂wn 7 6 7 • The Jacobian of e(x, y) becomes
6 7 6 7
Je (w) = 6 : : : 7=6 : 7
6 7 6 7 " # " #
6 7 6 7 ∂e1 (x,y) ∂e1 (x,y)
2x 2y
6 : : : 7 6 : 7 Je (x, y) = ∂x ∂y
= .
4 5 4 5 ∂e2 (x,y) ∂e2 (x,y)
− sin(x) cos(y)
∂en
∂w1
∂en
∂w2 ... ∂en
∂wn
(∇en (w))T ∂x ∂y
13 14
+ (w − wk )T JT
e (wk )Je (wk )(w − wk ). • Differentiating the above wrt w, we get ∇e(w) = −XT . So, the
T
Jacobian becomes Je (w) = (∇e(w)) = −X.
• Differentiating the above wrt w and setting the result to 0, we get
• Plugging this in to the Gauss-Newton equation, we finally get:
T T
Je (wk )e(wk )+Je (wk )Je (wk )(w−wk ) = 0, from which we get
w = wk + (XT X)−1 XT (d − Xwk )
T −1 T
w = wk − (Je (wk )Je (wk )) Je (wk )e(wk ). = wk + (XT X)−1 XT d − (X X) X Xwk
T −1 T
T | {z }
* Je (wk )Je (wk ) needs to be nonsingular (inverse is needed).
This is Iwk = wk .
= (XT X)−1 XT d.
15 16
Linear Least-Square Filter (cont’d) Linear Least-Square Filter: Example
See src/pseudoinv.m.
Points worth noting:
X = ceil(rand(4,2)*10), wtrue = rand(2,1)*10 , d=X*wtrue, w = inv(X’*X)*X’*d
• X does not need to be a square matrix! X =
10 7
• We get w = (XT X)−1 XT d off the bat partly because the 3 7
3 6
output is linear (otherwise, the formula would be more complex). 5 4
• The Jacobian of the error function only depends on the input, and wtrue =
0.56644
is invariant wrt the weight w. 4.99120
then we get: w =
+ + 0.56644
w = X d = X X w.
| {z } 4.99120
=I
17 18
• Pluggin in e(w) = d − xT w, • LMS is sensitive to the input correlation matrix’s condition number
(ratio between largest vs. smallest eigenvalue of the correl.
∂e(w) ∂E(w)
= −x, and hence = −xe(w). matrix).
∂w ∂w
• Using this in the steepest descent rule, we get the LMS algorithm: • LMS can be shown to converge if the learning rate has the
following property:
ŵn+1 = ŵn + ηxn en . 2
0<η<
λmax
• Note that this weight update is done with only one (xi , di ) pair!
where λmax is the largest eigenvalue of the correl. matrix.
19 20
Improving Convergence in LMS Search-Then-Converge in LMS
21 22
23 24
Boolean Logic Gates with Perceptron Units What Perceptrons Can Represent
−1 t=1.5 −1 t=0.5 −1 t=−0.5 I1 Output = 1
−1 t
W1=1 W1=1 W1=−1 Slope = −W0
AND OR NOT
w0 t W1
W2=1 W2=1
I0 W1
w1
Russel & Norvig I1
I0
• Perceptrons can represent basic boolean functions. Output=0fs
W0 × I0 + W1 × I1 − t ≤ 0, then output is 0
25 26
• Rearranging
W0 × I0 + W1 × I1 − t > 0, then output is 1, • Without the bias (t = 0), learning is limited to adjustment of the
slope of the separating line passing through the origin.
we get (if W1 > 0)
−W0 t • Three example lines with different weights are shown.
I1 > × I0 + ,
W1 W1
where points above the line, the output is 1, and 0 for those below the line.
Compare with
−W0 t
y= ×x+ .
W1
27 W1 28
Limitation of Perceptrons Generalizing to n-Dimensions
z
I1 Output = 1
−1 t
Slope = −W0
t W1 (x,y,z) 1
w0
I0 W1 n = [a b c]T
d
w1 x
a
(x0,y0,z0)
I1 b
y y
c
I0
Output=0fs z
x
http://mathworld.wolfram.com/Plane.html
• Only functions where the 0 points and 1 points are clearly linearly • ~
n = (a, b, c), ~
x = (x, y, z), x~0 = (x0 , y0 , z0 ).
separable can be represented by perceptrons.
• Equation of a plane: ~
n · (~
x − x~0 ) = 0
• The geometric interpretation is generalizable to functions of n
• In short, ax + by + cz + d = 0, where a, b, c can serve as
arguments, i.e. perceptron with n inputs plus one threshold (or
the weight, and d = −~
n · x~0 as the bias.
bias) unit.
• For n-D input space, the decision boundary becomes a
(n − 1)-D hyperplane (1-D less than the input space).
29 30
I1 I1 I1
0 1 1 1 1 0
?
I0 I0 I0
31 32
XOR in Detail Perceptrons: A Different Perspective
Output = 1
# I0 I1 XOR −1 t
I1
xi
Slope = −W0
w
1 0 0 0 w0 t W1
I0 W1
2 0 1 1 w1
I1
3 1 0 1 θ
I0 d
Output=0fs
4 1 1 0
contradiction. Adjusting w changes the tilt of the decision boundary, and adjusting the bias b
(and kwk) moves the decision boundary closer or away from the origin.
33 34
• Given a linearly separable set of inputs that can belong to class C1 or C2 , For misclassified inputs (η(n) is the learning rate):
• The goal of perceptron learning is to have
• w(n + 1) = w(n) − η(n)x(n) if wT x > 0 and x ∈ C2 .
T
w x > 0 for all input in class C1
• w(n + 1) = w(n) + η(n)x(n) if wT x ≤ 0 and x ∈ C1 .
T
w x ≤ 0 for all input in class C2 Or, simply x(n + 1)
= w(n) + η(n)e(n)x(n), where
• If all inputs are correctly classified with the current weights w(n), e(n) = d(n) − y(n) (the error).
T
w(n) x > 0, for all input in class C1 , and
T
w(n) x ≤ 0, for all input in class C2 ,
then w(n + 1) = w(n) (no change).
• Otherwise, adjust the weights.
35 36
Learning in Perceptron: Another Look Perceptron Convergence Theorem
x+w • Given a set of linearly separable inputs, Without loss of generality, assume
+ + + + +
+ − w−x η = 1, w(0) = 0.
+ w x w
+ + +
− − x+ + − − • Assume the first n examples ∈ C1 are all misclassified.
− − − −
− − + − − + +
− − + + − − + • Then, using w(n + 1) = w(n) + x(n), we get
− + − +
− − − − − −
w−x w(n + 1) = x(1) + x(2) + ... + x(n). (1)
• Here, α is a constant, depending on the fixed input set and the fixed The perceptron converges after some n0 iterations, in the sense that
solution w0 (so, kw0 k is also a constant), and β is also a constant since
it depends only on the fixed input set.
w(n0 ) = w(n0 + 1) = w(n0 + 2) = ....
• In this case, if n grows to a large value, the above inequality will becomes is a solution vector for n0 ≤ nmax .
invalid (n is a positive integer).
n2max α2
= nmax β
kw0 k2
βkw0 k2
nmax = ,
α2
and when n = nmax , all inputs will be correctly classified
41 42
Summary
43 44
XOR with Multilayer Perceptrons
XOR
AND
1 1 1 0 1 0
0 1 1 1 0 1
Note: the bias units are not shown in the network on the right, but they are needed.
45