Professional Documents
Culture Documents
ELEC 6240: Neural Networks
ELEC 6240: Neural Networks
1
CONTENTS Revision : 2003.10
Contents
1 Course overview 7
9 2003 09 10 46
9.1 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References 250
Index 251
1 Course overview
Instructor A. S. Hodel hodelas@eng.auburn.edu AIM screen name ELEC 6240 1 .
Web page http://www.eng.auburn.edu/users/hodelas. Office hours: MW 3-4pm
or by appt.
Grades Grades are assigned on a 10% scale. You may earn points from 2 Hour exams (50
pts ea), 1 Course project (50 pts), Homework (50 pts), 1 Final exam (100 pts), for a
total of 300 points.
Special needs Students who need special accommodations should make an appointment
to discuss their needs as soon as possible.
Class resources other class resources (notes, m-files, etc.) will be made available at
ftp://ftp.eng.auburn.edu/pub/hodel/6240
Projects Will be described later in the semester. Project final reports will be due during
the final week of the semester; a precise due date will be announced later. Oral
presentations will be made to the class during the final two weeks of the course.
1. Learning processes
2. Perceptrons: single layer and multi-layer
3. Radial basis function neural networks
4. Principal components analysis
5. self-organizing maps
6. Neurodynamics
7. Neural network applications
Resources MATLAB is available on the engineering network by either (1) Use of Windows
PC labs in Broun 128, etc., (2) Sun workstations, or (3) remote log-in (ssh) to
gate.eng.auburn.edu
1
Please identify yourself by name when you message me.
from off-campus. Several tutorials for MATLAB are available on the net. 2 A brief
review of MATLAB will be given in this course, but students are expected to be
familiar with MATLAB from the prerequisite course.
Homework software Grading of software will be done in a batch-run fashion. This will
require that students set up a folder on their engineering account that the instructor
can access (read/execute privileges) from the engineering network (sun workstation).
Evidence of copying of software will be grounds for a zero grade on the homework
assignment in question.
Remark This is a 6000 level course (senior undergraduate/graduate level course). Students
will be expected to have a corresponding level of mathematical/conceptual maturity.
C-language programming (COMP1200) will be essential. The instructor makes no
commitment to provide support for other compiled languages in this course.
Note Homework assignments in this class for Fall 2003 should be done using MATLAB 6.5
(Release 13). For historical reasons, many software examples in these notes were done using
octave3 , a program similar to (but not identical to) MATLAB that is available at no cost.
Octave is not currently available on the Engineering network; if anyone wants to volunteer
to help to get octave installed, please let me know.
2
See, e.g., p. 37 of the ELEC 2020 manual,
ftp://ftp.eng.auburn.edu/pub/hodel/2020
3
http://www.octave.org
• Nonlinear
More flexibility in representation of systems than in, e.g., transfer functions (LTI).
• input-output mapping
Does not require a “first principles” physical model of behavior/system being learned.
• adaptivity (retraining)
Can adjust its “synaptic weights” in response to changes in operating environment.
• VLSI implementation
• neurobiological analogy
Retina: preprocessing (compression) of visual information before it is sent to brain for
processing.
Homework 1
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 1.
4. Design a 2 layer neural network with a threshold activation function (eqn (1.8), p. 12
in [Hay99]) to identify the region (x, y) ∈ [0, 1] × [0, 1].
Approach: hidden layer neurons are “feature detectors,” but can only split space into
two halves by dividing in a plane (see the figures that follow showing hidden layer values
as a function of x and y). Use one hidden layer neuron for each of the sides of the
square. Then combine them together. We can’t do an “and” operation naturally, so we
have to do it as a sum with a threshold and a bias just below the sum you’d get with
all hidden neurons firing.
Write an m-file main.m that plots the output of your ANN over the above domain. Put
this in a folder called 6240H1 your home directory on the engineering network (the H:
drive for Windoze enthusiasts) and set permissions so that anyone can read the file. 4
Solution
1. Sigmoid derivative
% homework 1 check.
xx = linspace(-5,5,1000);
a = 2;
eav = exp(-a*xx);
pp = (1 - eav) ./ (1 + eav);
dp = (a/2)*(1 - pp .^ 2);
3. hw, xi = 10 · 0.8 − 20 · 0.2 + 4 · (−1) − 2 · (−0.9) = −1.8. (a) linear output is -1.8 (or
a · 1.8 if the activation function has a linear slope parameter). (b) 0.
xx = linspace(-1,2,30); yy = linspace(-1,2,30);
% format of rows 1st weight matrix:
% bias weight, x weight, y weight
W1 = [0 1 0; ... % right of y axis (x = 0)
0 0 1; ... % above x axis (y = 0)
1 -1 0; ... % left of line x = 1
1 0 -1]; % below line y = 1;
end
end
fn = 0;
fn = fn+1; figure(fn);
mesh(yy,xx,h1);
title(’Hidden layer node 1 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
fn=fn+1; figure(fn);
mesh(yy,xx,h2);
title(’Hidden layer node 2 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
fn=fn+1; figure(fn);
mesh(yy,xx,h3);
title(’Hidden layer node 3 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
fn=fn+1; figure(fn);
mesh(yy,xx,h4);
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
fn=fn+1; figure(fn);
mesh(yy,xx,zprime);
title(’Neural network (no threshhold on output)’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
fn=fn+1; figure(fn);
mesh(yy,xx,zz);
title(’Neural network output’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 5: second layer output: value is positive only for (x, y) ∈ [0, 1] × [0, 1].
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
y = φ(w T x) = w1 x1 + · · · + wm xm = hw, xi
T
∈ IRm
x = x1 · · · x m
T
w = w1 · · · wm
I prefer to write w T x so that w and
x are both column vectors.
y ∈ IR scalar (for now; this will change to y ∈ IRp )
• w T x > 0 and is big if x points in “the same direction” as w; is negative and big if x
points in the opposite direction (think of correlation coefficient in random variables).
1. summation
3. activation function
5
(a) Threshold,
a.k.a “McCullough-Pitts” (theoretical limits discussed in [MP88] )
1 v≥0
φ(v) = Also vector (output) form.
0 v<0
1 v ≥ 1/2
(b) piecewise linear φ(v) = v+1/2 |v| < 1/2 notice misprint from text
0 v < −1/2
1
(c) sigmoid φ(v) = approaches threshold as a → ∞
1 + e−av
(d) stochastic (not much done in this course due to prerequisites)
P (x = 1) = P (v) = sigmoid. P (x = 0) = 1−P (v). “random process” 1/(1+ev/T )
where T is a pseudo-temperature.
5
M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1988.
Expanded Edition
Example 3.1 Activation function: single neuron in MATLAB. See Figure 7. Source code:
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
Neuron input
φ
xj yk = φ(xj )
P
wk0 yk = φ m
j=0 wkj xj
x0
wk1
x1 ..
.
wkm
..
.
xm
y = φ(w T x)
• mex-files are slower to write, more difficult to debug, but are 5-10x faster.
3.3 Feedback
Read §1.5
• operator loops
x0j (n)
A
xj (n) yk (n)
A
yk (n) = x (n)
1−AB j
• IIR filters
1st order signal flow graph
x0j (n)
w
xj (n) yk (n)
z −1
x1 ..
.
wkm
..
.
xm
Example 4.1 Single layer network with three neurons (outputs) and two inputs.
T 1
φ1 w 1
x
y1
1 T 1
y = y2 = φ W = φ 2 w 2
x x
y3
1
φ3 w 3 T
x
Could also write as y = φ(W̄2 φ(W̄1 x + b1 ) + b2 ) for bias vectors b1 , b2 . We will select
T T
w1 = w 2 = w 3 = b w 1 w2 = 1 1 1 .
n1 = 40; x1 = linspace(-2,2,n1);
n2 = 45; x2 = linspace(-2,2,n2);
subplot(2,2,1); meshc(x1,x2,y0’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’McCullough-Pitts activation function’);
grid on
subplot(2,2,2); meshc(x1,x2,y1’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’sigmoid activation function’);
grid on
subplot(2,2,3); meshc(x1,x2,y2’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’tanh activation function’);
grid on
orient tall
print -depsc neuronEx1.eps
1 1
0.8 0.8
NN output
NN output
0.6 0.6
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x −2 input x input x −2 input x
2 −2 1 2 −2 1
0.5
NN output
−0.5
−1
2
1 2
0
0
−1
input x2 −2 input x1
−2
Example 4.2 Two-layer network with all activation function φ =threshold, two in-
puts, one output, and three “hidden” nodes:
1
y = φ W 2 1 = φ(W2 φ1 (W1 x̄))
φ W1
x
subplot(2,2,1); meshc(x1,x2,h1’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h1 output’);
title(’Hidden Neuron 1 output’);
grid on
subplot(2,2,2); meshc(x1,x2,h2’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h2 output’);
title(’Hidden Neuron 2 output’);
grid on
subplot(2,2,3); meshc(x1,x2,h3’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h3 output’);
title(’Hidden Neuron 3 output’);
grid on
subplot(2,2,4); meshc(x1,x2,yy’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’Two-layer ANN example output’);
grid on
orient tall
print -depsc neuronEx2.eps
1 1
0.8 0.8
0.6 0.6
h1 output
h2 output
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2
1 1
0.8 0.8
NN output
0.6 0.6
h3 output
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2
Definition 5.1 (Fischler & Firschein, 1987) Knowledge refers to stored information or mod-
els used by a person or machine to interpret, predict, and appropriately respond to the outside
world.
Knowledge representation:
• how is it encoded?
We often ask an ANN to “learn” an environment. Must have (or present) lots of data (“prior
information”).
• labeled: (x0 , d0 ).
Process:
Remark 5.1 Training allows the data (and the training algorithm) to organize and represent
input data. Other forms of pattern classification often require the designer to organize the
data for classification and representation.
• vectors
• dot products
• norms (vector/error)
For unit vectors
h i
T
• covariance Σ = E (xi − µi )(xi − µi )
Assume xi , xj have identical covariance. Then
dot product/recognition
Ideas: (Rules)
3. Important features should have many neurons devoted to them - high probability of
detection, low probability of false alarm.
Neyman-Person: max Prob(detect) subject to P(false alarm) < γ
/*=================================================================
* Example Hyperbolic Tangent MEX function
* Adam Simmons, GTA, Neural Networks
*=================================================================*/
#include <math.h>
#include "mex.h"
/* If you have a C File written outside of Matlab’s Mex File format, you can
* place it here were tanh1 is and write a wrapper for it below (mexFunction)
*/
static void
tanh1 (double yout[], double xin[])
{
yout[0] = tanh (xin[0]);
return;
}
#include "mex.h"
#include "math.h"
Hwk
Homework 2 Mex file example Handed out by Simmons in class; solutions shown in lecture
notes. Everyone did well.
% mexExV.m: call mex file with a vector input, get a vector output
if( exist(’phiExV’) ~= 3 ) % see if the mex file is there
mex phiExV.c
end
xx = (-10:0.1:10)’;
yy = phiExV(xx);
plot(xx,yy);
grid on
xlabel(’activation function input’);
ylabel(’activation function output’);
print -depsc mexExV.eps
Results:
0.8
0.6
0.2
−0.2
−0.4
−0.6
−0.8
−1
−10 −8 −6 −4 −2 0 2 4 6 8 10
activation function input
>> mexPrintMatEx
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/Undetermined -lmx -lmex -lmat
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
mex link phase: cc -O -bundle -Wl,-flat_namespace -undefined suppress
-o mexPrintMat.mexmac mexPrintMat.o mexversion.o
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
(3 x 3) =
1.0000e+00 2.0000e+00 3.0000e+00
4.0000e+00 5.0000e+00 6.0000e+00
7.0000e+00 8.0000e+00 1.0000e+01
if (nrhs != 1)
{
sprintf(errMsg,"Received %d args, need 1", nrhs);
mexErrMsgTxt(errMsg);
return;
}
else if (nlhs > 0)
{
sprintf(errMsg,"%d outputs requested, max is 0", nlhs);
mexErrMsgTxt(errMsg);
return;
}
nrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
xx = mxGetPr(prhs[0]);
mexPrintf("(%d x %d) = \n",nrows,ncols);
for( ii = 0 ; ii < nrows ; ii++)
{
for ( jj = 0 ; jj < ncols ; jj++ )
mexPrintf("%12.4e ",xx[ii + jj*nrows]);
mexPrintf("\n");
}
}
Remark 7.1 Danger! Stray pointers and bad subscripts are great at making mex files
crash.
Stored: all (or most) xi (now vectors) stored with desired output di .
element 1 2 3 4 5
0 0 1 1 0.9
xi
0 1 0 1 0.3
di 0 1 1 0 1
2
Question 8.1 When do we add a new data point? How close is “close enough?” (related
to Radial Basis Function networks ...)
Can also develop methods by which connections are weakened (“forgetting” methods).
Example 8.3
average values give threshhold to “sign” of correction so that we can both strengthen and
weaken connections.
Idea
η(xj − wkj ) if neuron k “wins”
∆wkj =
0 else
P P 2
Adjust weights to enforce j wkj = 1 for all k or j wkj = 1 for all k .
Similar to tagged (labelled) data learning, except that we have a “teacher” instead of a
large database.
primary reinforcement
state
Environment vector Critic
heuristic reinforcement
Learning
System
actions
All of these require design of some mechanism to steer toward a “learned” outcome.
9 2003 09 10
We remember ...
“I wish that it need not have happened in my time,” said Frodo.
“So do I,” said Gandalf, “ and so do all who live to see such times.
But that is not for them to decide. All we have to decide is what
to do with the time given to us.”
J. R. R. Tolkien, The Fellowship of the Ring, p. 76
“We don’t want the bad guys to win!”
Fozzie Bear, The Great Muppet Caper.
pattern recognition/classification
Example 9.1 Homework 2 problem: classification can be used to clean up input waveforms:
3
noisy sine wave
2
−1
−2
−3
−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
1
cleaned sine wave
0.5
−0.5
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
function approximation: e.g. system i.d., inverse dynamics, control, filtering (smoothing,
prediction, extraction)
Example 9.2 system identification Try to mimic an unknown system: Neural net must
“remember” previous outputs, inputs, and try to predict next output:
stored neural
(k), u(k − 1), ...
data
u(k-N), y(k − 1), ... network
y(k − N ).
9.2 Memory
Read §2.11
Define
M0 = 0
Mk = Mk−1 + W (k)
q
X
M = Mq = W (k)
k=1
x(k)T x(j)
cos(x(k), x(j)) =
kx(k)k kx(j)k
If x(k)’s are unit length then cos(x(k), x(j)) = x(k)T x(j). For recognition, want x(k)T x(j) =
0, orthogonal.
tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;
figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx1.eps
M = zeros(3,length(tt));
for kk=1:3
M = M + Yk(:,kk)*Xk(:,kk)’;
end
x1 = (M*sinewave)’
x2 = (M*sawtooth)’
x3 = (M*square)’
figure(2)
plot(tt,dirtySine);
legend(’noisy sine wave’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx2.eps
0.08
sine
saw
square
0.06
0.04
0.02
−0.02
−0.04
−0.06
−0.08
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
0.3
noisy sine wave
0.2
0.1
−0.1
−0.2
−0.3
−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
0.3
noisy sine wave
0.2
0.1
−0.1
−0.2
−0.3
−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
0.1
cleaned sine wave
0.05
−0.05
−0.1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
Q if key vectors X are orthonormal, what is the storage capacity of the network? - m,
rank of M̂ .
Q classification accuracy: lower bound on error x(k)T x(j) ≥ γ∀k 6= j. If γ is big enough,
can get classfication errors. (Get upper bound instead?)
Homework 3
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 2.
tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;
figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
xlabel(’time (s)’)
grid on
print -depsc learnTaskEx1a.eps
This m-file implements a single-layer neural network that has 101 inputs (length of the
input vector) and 3 outputs. Output 1 should be a “1” when the input is a sinewave,
output 2 should be a “1” when the input is a sawtooth wave, and output 3 should be a
“1” when the input is a square wave, with the other outputs being zero. Each of these
waveforms is defined in the first four lines of the m-file.
Your job is to select what to fill in for A so that the neural network gives the desired
output for perfect (uncorrupted) inputs. The last few lines of the network demonstrate
what happens when you apply a corrupted sine wave.
The output for my solution is shown below:
>> learnTaskEx1
x1 = 1.0000 0.0000 -0.0000
x2 = 0.0000 1.0000 -0.0000
x3 = 0.0000 0.0000 1.0000
x4 = 1.0405 0.0530 -0.0658
3
noisy sine wave
−1
−2
−3
−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
Solution
1. M-file:
fprintf(’(a)\n’);
M = y1*x1’ + y2*x2’ + y3*x3’
fprintf(’(b)\n’);
error1 = M*x1 - y1
error2 = M*x2 - y2
error3 = M*x3 - y3
Output:
hwk0218.m Output
(a)
M =
5 -2 -2 0
1 1 4 0
0 6 3 0
(b)
error1 =
0
0
0
error2 =
0
0
0
error3 =
0
0
0
M-file:
Output:
hwk0220.m Output
-0.0012636
-0.8199219
0.6388963
yerr =
0.498736
-0.069922
0.205884
-0.0043773
-2.8402926
2.2132017
yerr =
0.49562
-2.09029
1.78019
hwk0220a.m Output
1.9456e-17
-8.6603e-01
5.0000e-01
yerr =
0.500000
-0.116025
0.066987
BT
= A
y x
The matrix B = b1 · · · bnf acts as a feature extractor when we perform dot products
hbi , xi. The matrix A is then used to construct the associated waveform y as a combination
of columns of A:
hb1 , xi
∆
y = Az = A
..
.
b nf , x
= a1 hb1 , xi + · · · + anf bnf , x
A BT x
hb1 , xi
= hb2 , xi
..
.
y a1 a2 a3
Question 10.1 How many patterns can be stored in an N xN linear associative memory?
10.2 Adaptation
Read [Hay99] §2.12
• The environment “changes:” a good decision now may not be a good decision later.
T = {(xi , di )}N
i=1
d = f (x) +
1. E[|x] = 0. - zero mean random variable. Implies E[d|x] = f (x), which is what the
neural net is trying to match.
E[f (x)T ] = 0
(Is consistent with the conditional expectation above). Says that the function f gives
us all available information about d that we can get from x.
What does this sort of modeling assumption tell us about what neural networks can do?
Short summary of §2.13 Need to approximate f (x) with an ANN F (x; W ). Things to
reduce: mean value of error (bias) and variance of error (standard deviation).
Contributors:
1. W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas immanent
in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943 McCullough
& Pitts: (1943): use of NN as computational tool
3. F. Rosenblatt. The perceptron: A probabilistic model for information storage and orga-
nization in the brain. Psychological Review, 65:386–408, 1958 for perceptron (learning
with a teacher)
4. B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON Con-
vention Record, pages 96–104 Widrow-Hoff delta rule (least mean square)
Perceptron: linearly separable; “perceptron convergence theorem.” Single neuron can
be viewed as an “adaptive filter.”
v(i)
wi0
y(i)
x0
wi1
-1
..
x1 . ei
wim
..
.
d(i)
xm
Linear adaptive filter
1 ∆
y = φ(W̄ x + w0 ) = φ(W ) = φ(W x̄)
x
where x(i) is a vector of length m (input dimensionality). x(i) is called a stimulus vector.
How do we get x?
Learning assumptions:
• Adjustments of weights are made continuously6 (time is a part of the learning algo-
rithm)
Linear neuron: m
X
y(i) = v(i) = wk (i)x)k(i) = w(i)T x(i)
k=1
6
Not as a differential equation, usually, but as a difference (discrete-time) equation.
Remark 11.1 Define P = AA+ . Then P is a projection matrix, which means that
P 2 = P . Further, it is easy to show that P A = A. Multiplication by P extracts the
part of a vector (or matrix) that is in span(A).
N
!
1 1X
= (d(n) − y(n))T (d(n) − y(n))
N 2 n=1
Want to find optimal w ∗ such that E(w ∗ ) ≤ E(w) for all possible w, i.e.
Lemma 11.1 matrix calculus identities Let f (x) be a scalar function of a vector x ∈
IRn and
∂flet
g(W ) of a matrix m×n
∂g W ∈ IR ∂g . Define their respective partial derivatives as
∂x1 ∂w11
· · · ∂w1n
∂f ∆ .. ∂g ∆ . .. .. . Then
∂x
= . and ∂W = .. . .
∂f ∂g ∂g
∂xn ∂wm1
· · · ∂wmn
∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W
Proof: Left as an exercise for the reader. This would be a good exam question, eh? 2
we can write
∂E 1 X 1 T
T
= W 1 x(n) − d(n) 1 x(n)
∂W N x(n)
1 X T T
= y(n)x̄(n) − d(n)x̄(n)
N
1 X
T
= (y(n) − d(n))x̄(n)
N
∆ 1
where x̄(n) = .
x(n)
Question 12.1 How can we use this information to “train” a neural network?
1. Requires knowledge of ∇E
1
Example 12.1 Problem 3.2 in [Hay99]. Minimize f (x) = xT Rx x + Rxd T x.7
2
7
See also GradientDescent.mov at http://www.eng.auburn.edu/users/hodelas/teaching/6240.
w_opt = Rx\rxd;
fprintf(’(a): w_opt = %12.4e %12.4e\n’,w_opt(1), w_opt(2));
if(nn == 1)
wn = [0;0];
gn = Rx*wn - rxd;
else
wn1 = wdata(:,nn-1); % get w(nn-1)
gn = Rx*wn1 - rxd; % here’s my gradient
wn = wn1 - eta*gn; % get next weights
wdata(:,nn) = wn; % save the weights for plotting
end
gradNorm(nn) = norm(gn);
end
fignum = fignum+1; figure(fignum);
subplot(2,1,1);
plot(wdata(1,:), wdata(2,:),’-’, w_opt(1), w_opt(2), ’x’);
grid on ;
title(sprintf(’Steepest descent with eta=%f’,eta));
legend(’iterate values’, ’optimal value’);
xlabel(’w_1(n)’); ylabel(’w_2(n)’);
subplot(2,1,2);
plot(1:nstps,errv,’-’,1:nstps,gradNorm,’-’);
xlabel(’iteration number’);
legend(’cost function value’,’|| gradient ||’);
grid on;
eval (sprintf( ’print -depsc hwk0302_%d.eps’,fignum));
end
figure(fignum);
meshc(xx,yy,cf’);
xlabel(’x value’);
ylabel(’y value’);
grid on
hwk0302.m Output
−0.2
w (n)
2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w (n)
1
1
cost function value
|| gradient ||
0.5
−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number
Case 1: η = 0.3
−0.2
w (n)
2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w1(n)
1
cost function value
|| gradient ||
0.5
−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number
Case 2: η = 1.0
2. Requires knowledge of g, H.
−0.13
x2(n)
−0.14
−0.15
−0.16
−0.17
−0.17 −0.16 −0.15 −0.14 −0.13 −0.12 −0.11 −0.1
x1(n)
1.5
cost function value
|| gradient ||
1
0.5
−0.5
0 20 40 60 80 100 120 140 160 180 200
iteration number
(a) (mfile) write an m-file that uses the method of steepest descent to find the value
of x that minimizes f (x).
Solution
2 1 T
Note Misprint in assignment, I meant to write f (x) = x x/2+ 1 1 x.
1 2
I’m surprised no one asked about that. M-file is nearly identical to M-file 12.1
(hwk0302.m) (see listing at the end of this solution). Plots are in Figure 8.
(b) (written) Use ∇x f = 0 to solve the above minimization by hand. Compare your
theoretical answer to the result from your m-file. Explain any differences you
observe.
hwk304.m Output
2(b): difference is 2.7756e-17 0.0000e+00
>>
4 2 1 1/6 9
Optimal value is at x+ = 0 or x = − . Differences (see
2 4 1 1/6
m-file output) are due to double precision arithmetic roundoff.
3. (mfile) Use the method of steepest descent to train a linear neural network (LNN)
y = W x̄ to mimic the logic gates indicated below. (written) Discuss the quality of the
output of your LNN: why does it work (or not work)?
You should run your iteration for 100 steps. Your plots should include:
!
∂ X X ∂
Solution Recall that fi (x) = fi (x) . Thus we have
∂x i i
∂x
4
1X
E = (d(n) − W x(n))T (d(n) − W x(n))
2 n=1
4
X
= d(n)T d(n)/2 − d(n)T W x(n) + x(n)T W T W x(n)/2
n=1
4
∂E X
J= = −d(n)x(n)T + W x(n)x(n)T
∂W n=1
∆ 1 X
E = (d(n) − W x̄(n))T (d(n) − W x̄(n))
2 X
= eT e/2 = e2i /2.
If we define
d(1) − W x̄(1)
e=
..
.
d(4) − W x̄(4)
it is easy to verify that E = eT e/2. With this definition,
x̄(1)T
∂e ∂e1
1
∂w0
· · · ∂w 2
J = ... ..
.
.. = − ..
. .
∂e4
∂w0
∂e4
· · · ∂w2 x̄(4)T
Note Clearly label all plots and turn in printed copies of your plots with your written
homework.
%----------------------------------------------
% Problem 2a : min x’ Q x/2 + c’ x
%----------------------------------------------
%----------------------------------------------
% Problem 2(b): steepest descent
%----------------------------------------------
nstps = 200; fignum = 0;
xdata = zeros(2,nstps); % array in which to save iterative solution values
eta = 0.1;
errv = zeros(1,nstps);
gradNorm = zeros(1,nstps);
for nn=1:nstps
if(nn == 1)
xn1 = [0;0]; % 1st step: initialize x(n-1) to zero
else
xn1 = xdata(:,nn-1); % xn1 = x(n-1)
end
gn = QQ*xn1 + cc; % gradient at x(n-1)
xn = xn1 - eta*gn; % compute next x(n)
xdata(:,nn) = xn; % and store it
errv(nn) = xn’*QQ*xn/2 + cc’*xn;
gradNorm(nn) = norm(gn);
end
xmin = xn; % save in variable for Simmons to grade
fprintf(’2(b): difference is %12.4e %12.4e\n’, xmin(1) - xopt(1), ...
xmin(2) - xopt(2))
%----------------------------------------------
% Problems 3,4 , AND,XOR,OR gates, Steepest Descent
%----------------------------------------------
xn = [0 0; 0 1; 1 0; 1 1]’; % x(i) is in col i of xn
x1 = linspace(0,1,25);
x2 = linspace(0,1,27);
ee = zeros(4,1);
JJ = zeros(4,3);
for ii=1:4
xbar = [1; xn(:,ii)];
ee(ii) = (dd(ii,jd) - Wn*xbar);
JJ(ii,:) = -xbar’;
end
wn = wn - ( JJ’*JJ + delta*eye(3))\JJ’*ee;
Wn = wn’;
end
wdata(:,nn) = Wn’;
% compute error for this function (jd =1,2,3 -> AND, OR, XOR)
err(jd,nn) = 0;
for ii = 1:4
xbar = [1;xn(:,ii)];
err(jd,nn) = err(jd,nn) + (dd(ii,jd) - Wn*xbar)^2/2;
end
gradNorm(jd,nn) = norm(gn);
end
titles = {’AND’,’OR’,’XOR’};
subplot(2,2,jd) % mesh plots
warning(’off’); % avoid pesky messages on xor plot
meshPex(Wn,x1,x2,’input a’,’input b’,titles{jd});
end
% print meshplots of the 3 network outputs
eval(sprintf(’print -depsc hwk304%.2d.eps’,fignum));
if(probNum == 3)
eta3 = eta*ones(1,3); % used same eta for all three
wn3 = wdata;
error3 = err;
nmgrad3 = gradNorm;
else
eta4 = delta*ones(1,3); % used same eta for all three
wn4 = wdata;
error4 = err;
nmgrad4 = gradNorm;
end
end
AND
OR
0.5 XOR
0.4
0.3
0.2
0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number
−2
10
−4
10
0 10 20 30 40 50 60 70 80 90 100
iteration number
AND OR
1 1.5
0.5 1
0 0.5
−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a
XOR
0.5005
0.5
0.4995
1
1
0.5
0.5
input b 0 0 input a
AND
OR
0.5 XOR
0.4
0.3
0.2
0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number
−10
10
−20
10
0 10 20 30 40 50 60 70 80 90 100
iteration number
AND OR
1 1.5
0.5 1
0 0.5
−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a
XOR
0.5
0.5
1
1
0.5
0.5
input b 0 0 input a
xx = linspace(-1,2,11);
yy = linspace(-1,2,13);
W = [1,2,3];
yvals = meshPex(W,xx,yy,’x_1’,’x_2’,’Example plot’);
print -depsc meshPexPlot.eps
if(doMeshPlot)
mesh(x1,x2,Yvals);
xlabel(xstr);
ylabel(ystr);
title(tstr);
grid on;
end
Example plot
12
10
−2
−4
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
x −1 −1
2 x1
Suppose
n
1X 2
∆ 1
E(w) = ei (w) = e(w)T e(w)
2 i=1 2
mean squared error
1st order Taylor’s series for error e about w(n) (to select a new w):
∆
e(w(n + 1)) = e(w(n)) + ∇w e|w(n) T (w(n + 1) − w(n)) = e(w(n)) + J(n)(w(n + 1) − w(n))
∆ 1
w = arg min ke(w(n + 1))k
2
w
1 1
= arg min ke(n)k2 + e(n)T J(n)(w − w(n)) + ((w − w(n))T J(n)T J(n)(w − w(n))
w 2 2
Differentiate w.r.t. w, set to 0 to obtain
13.3 Perceptrons
Read §3.8-3.9
C1
w
C1
C2
C2
So
v(n) = w(n)T x(n)
x ∈ C1 =⇒ w T x > 0.
Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).
for n = 1, 2, ...
endfor
perEx.m Output
W = perceptron2Dlearn(x,d,nIter);
train a threshhold activation function perceptron to learn a data set
(if possible)
inputs:
x, d: x(Nx2), d(N,1): input, desired output pairs
d(nn) should be either 1 or -1.
nIter: max number of iterations to run
outputs:
W: phi( W* xbar ) should match data (if possible in given # iterations)
iNum: number of passes through data to classify
xx =
0 1 0 1
0 0 1 1
dd =
1 1 1 -1
Converged in 9 iterations to
WW =
4 -2 -3
>>
−1
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
NAND input 2
NAND input 1
iNum = 0;
done = 0;
WW = zeros(1,3);
iNum = iNum + 1;
if( iNum >= nIter ) % too many iterations, quit.
done = 1;
end
end
Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).
for n = 1, 2, ...
endfor
Theorem 14.2 Suppose input vectors x(n) in Algorithm 13.1 are drawn from subsets Xi
of Ci , i = 1, 2. Assume that sets C1 and C2 are linearly separable. Then Algorithm 13.1
converges with w(0) = 0 and η = 1.
Proof: Since C1 and C2 are linearly separable it follows that there exists a vector w0 such
that w0 T x(n) > 0 for all x(n) ∈ X1 and w0 T x(n) ≤ 0 for all x(n) ∈ X2 . Define
Suppose all input vectors x(n) are drawn from X1 and that w(n)T x(n) ≤ 0 (the vectors
are incorrectly classified). Then for w(0) = 0 and η = 1
n
X
w(n + 1) = x(i)
i=1
Define
β = max kx(k)k2 .
x∈X1
By assumption the vectors x(k) are incorrectly classified, and so w(k)T x(k) < 0 which implies
and so
kw(k + 1)k2 − kw(k)k2 ≤
x(k)2
.
(14.4)
Sum (14.4) over k = 1, ..., n and recall w(0) = 0 to get
n
X
2
x(k)2
≤ nβ
kw(n + 1)k ≤ (14.5)
k=1
Equation (14.3) gives a quadratically growing lower bound on w(n+1). Conversely, equation
(14.5) gives a linearly growing upper bound on w(n + 1). We conclude that there exists some
number nmax such that a correct classification must occur for n ≥ nmax . 2
Remark 14.1 The above theorem states nothing about converging to a vector that correctly
differentiates between C1 and C2 . It merely states that if you present vectors from C1 long
enough it will eventually correctly classify those vectors. Nevertheless, the stronger result
can also be shown to be true: if a solution exists, the above procedure will converge so that
w(n0 ) = w(n0 + 1) = · · · for some n0 ≤ nmax .
Can also use an adaptive error correction model: let η(n) be the smallest integer such
that
η(n)x(n)T x(n) ≥ w(n)T x(n)
Question 14.1 Why is it o.k. to use integers? Shouldn’t we have to use small numbers?
yi = φ(wi,·T x̄)
X
= φ( wij x̄j )
∂yi
= φ0 (w T x̄)x̄
∂wi,·
Question 14.3 How can we use this to design a training algorithm using continuous valued
φ?
We need to determine
∂E ∂ 1 X
= (d(n) − φ(W x̄)T (d(n) − φ(W x̄))
∂W ∂W 2 n
function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);
function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phi(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );
Neuron models mathematical formulae, signal flow graphs, capabilities (separating hyper-
planes, normal vectors, dot products)
15.1 Introduction
Read §4.1-4.2
1. First pass: present data, get outputs and intermediate (local field) values.
1. Computationally efficient
3. High connectivity.
Vocabulary
Notation used in text is bad; uses indices i, j, and k to denote both a layer number and
to designate a neuron within a specified layer. We’ll use this instead:
+1 +1 (bias)
W (1) .. .. W (2) .. ..
.. . . . .
.
Need not use the same activation function φ at all neurons, but I will follow
that format here.
Desired output Associated with x(n) is desired output vector d(n). Define
error vector e(n) = d(n) − y (2) (n). If there are more than two layers, then
set kmax = maximum layer number (2 in the diagram above) and set e(n) =
d(n) − y (kmax ) (n).
∆
Error function E(n) = e(n)T e(n) = ej (n)2 .
P
∆
Define average error function Eav (n) Eav (N ) = N1 N
P
n=1 E(n)
Basis of back-propagation: chain rule of differentiation.
E
..
. ..
.
m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0
Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have
∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)
and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
E
..
. ..
.
Remark 15.2 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.
x̄(1)T
1 0 0
x̄(2)T 1 0 1
J = −
x̄(3)T = 1 1
0
x̄(4)T 1 1 1
which is a constant matrix. That means that I can calculate J (for this case) before I
write
etc.
Question 16.1 Would this be true if I wrote y = φ(W x̄)? (we’ll talk about that on
Monday)
Question 16.2 Suppose again that I wrote y = φ(W x̄). How would that change the
invertibility of J?
18 2003 10 01 Exam 1
Scores
1. (20)
2. (20)
3. (20)
4. (20)
5. (20)
Lemma 18.1 Copied from 11.1, Let f (x) be a scalar function of a vector x ∈ IRnand let
∂f
∂x1
∂f ∆ ..
g(W ) of a matrix W ∈ IRm×n . Define their respective partial derivatives as ∂x
= .
∂f
∂xn
∂g ∂g
∂w11
··· ∂w1n
and ∂g ∆
= .. .. ..
. . Then
∂W . .
∂g ∂g
∂wm1
··· ∂wmn
∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W
class 0
class 1
5 division 1
division 2
4
2
x2(n) value
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
x1(n) value
The plot above divides the 40 data points {x(n)}40 n=1 shown in the plot into two classes.
The division is based on the separating lines defined by the linear neural network
w1 T
−1 1 1
weights W = = ; that is, a data point x(n) is in class 1 if
−0.5 −1 1 w2 T
w1 T x̄(n) > 0 and w2 T x̄(n) > 0, otherwise it’s in class 0. Indicate which division line
(division 1 or division 2) corresponds to the weight vector w1 T x̄ = 0. Show your work,
either in mathematics or by labelling the diagram above.
Division number =
class 0
class 1
5 division 1
division 2
4
2
x2(n) value
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
x1(n) value
2. Can the 40 data points above be correctly classified using a single neuron (perhaps one
that uses a nonlinear activation function)? Explain why or why not.
If they can be correctly classified by a single neuron, give the corresponding activation
function and weights below.
Can it be done? =
Neuron function =
Explain.
3. Consider the sigmoid activation function φ(v) − 1/(1 + e−av ) for some a > 0. Show
that dφ/dv = aφ(v)(1 − φ(v)).
4. Consider a neuron y = φ(W x̄) where W = 1 2 3 , where φ(v) is the sigmoid
T
function shown above and x̄ = 1 x1 x2 Define v = W x̄.
Fill in the boxes below with the correct expressions or numerical values. (I will accept
either.)
∂v
=
∂W
∂y
=
∂v
∂y
=
∂W
1 T T 2 1 T
5. Let f (x) = 2
x Qx + c x for Q = and c = 4 5 .
1 2
(a) Find the errors in the MATLAB code below that attempts to implement a steepest
descent iteration to find x minimizing f (x).
M-file 18.1 exam2003BrainDead.m
% m-file example (with errors) of
% steepest descent iteration
% to minimize
% f(x) = x’*[2, 1, 1, 2]*x/2 + [4;5]*x
eta = 10;
grad_x = x*Q - c;
for ii=1:100
x = x + eta*grad_x;
end
(b) Find a vector x∗ such that f (x∗ = minx f (x). Show that ∇x f |x∗ = 0.
x∗ =
W (1) .. .. W (2) .. ..
.. . . . .
.
E
..
. ..
.
! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj
where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) =
(k+1) T (k+1) 0 (k)
W δ .∗φ (v (n)) k= hidden layer
Question 19.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?
1
φ(v) = a > 0 v ∈ IR
1 + e−av
ae−av e−av
0 a
φ (v) = = +
(1 + e−av )2 1 + e−av 1 + e−av
= aφ(v) (1 − φ(v))
δ(n) = e(n).∗φ0 (v (kmax ) (n)) = a d(n) − y (kmax ) (n) .∗ y (kmax ) (n).∗ 1 − y (kmax ) (n)
function y = phi(x,a)
% sigmoid function with parameter a
y = 1 ./ (1 + exp(-a*x) );
function dy = dphi(x,a)
% derivative of sigmoid function with parameter a
y = phi(x,a);
dy = a * y .* (1 - y);
figure(1);
grid(); title(’Exponential sigmoid with a=1’);
plot(xx,phix, "-;phi;", xx, dphif, "-;symbolic dphi;", ...
xx1, dphiN,"-;numerical dphi;");
printeps("derivS.eps");
Exponential sigmoid with a=1
1
phi
symbolic dphi
0.9 numerical dphi
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
φ(v) = a tanh(bv)
=⇒ φ0 (v) = ab(1 − tanh2 (bv))
b
= (a − y)(a + y)
a
function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);
function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phiT(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );
20.1.2 Assumption
There is a mathematical model
that will accurately predict the motion of the blimp given sufficient information. We will
attempt to approximate f with a function of the last two values of p and u:
u(n − 1)
where p̂(n) is an estimate of p(n) based on the other measurements. In matrix form we can
write
p(2) · · · p(11)
p(1) · · · p(10)
p̂(3) · · · p̂(12) = a0 a1 b0 b1
u(2) · · · u(11)
u(1) · · · u(10)
Define d(n) = p(n + 2) and d = d(1) · · · d(10) . Similarly define
p(2) · · · p(11)
p(1) · · · p(10)
X = x(1) · · · x(n) = u(2) · · · u(11)
u(1) · · · u(10)
and define
W = a 0 a 1 b0 b1 .
Then we want to minimize
1 1
E = kd − W Xk2 = (d − W X)(d − W X)T
2 2
The matrices d and X are obtained with m-file M-file 20.1 (sysIdEx1GetData.m) in Figure
11. Input data is plotted in Figure 12.
u = [0; 1; 1; 2; 2; 3; 2; 1; 0; 0; 0; 0];
p = [0; 0.1; 0.5; 0.8; 1.2; 2.0; 3.0; 3.8; 4.2; 4.6; 4.9; 5.1];
t = ((1:12)-1)*0.5;
for n=2:11
d(n) = p(n+1);
X(:,n) = [p(n);p(n-1);u(n);u(n-1)];
end
Figure 11: M-file to construct input data matrices for blimp sys id experiment.
2.5
voltage input
1.5
0.5
0
0 1 2 3 4 5 6
time
5
blimp position
0
0 1 2 3 4 5 6
time
Remark 20.1 This approach will not work if we model p̂(n + 1) = φ(W x(n)) due to the
nonlinearity of the activation function.
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
kk = 1:maxIter;
fn = figure; subplot(2,1,1); plot(kk,Wmat,’-’);
xlabel(’iteration’); legend(’a_0’,’a_1’,’u_0’,’u_1’);
title(’steepest descent parameters’);
grid on
subplot(2,1,2); plot(kk,ErrV,’-’); xlabel(’iteration’);
ylabel(’Error’); grid on;
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
12
10
8
Error
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
Figure 16: Steepest descent parameter values during iteration and resulting output error.
we apply an update for each point individually and use a very small step size η so that after
N = 10 steps we approximate the above gradient. M-file is in Figure 17. Results are in
Figure 18.
Figure 17: Approximate steepest descent (backpropagation) m-file. Notice that only one
data point is used to compute gn, which means that gn is no longer the gradient.
backpropagation parameters
0.7
a0
0.6 a1
0.5 u0
u1
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
12
10
8
Error
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
E
..
. ..
.
m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0
Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have
∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)
and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
E
..
. ..
.
Remark 21.1 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.
! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj
where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) = (k+1) T (k+1)
0 (k)
W δ .∗φ (v (n)) k= hidden layer
Question 21.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?
Desired output
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
y = phi(v2,a);
om = zeros(size(dm));
em = dm;
nx = length(xx);
ny = length(yy);
for ii=1:nx
for jj=1:ny
y0 = [xx(ii);yy(jj)];
om(jj,ii) = nnsig(y0,W1, W2,a);
end
end
em = dm - om;
/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}
/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}
if(nx != 1)
{
mexPrintf("x (%d x %d) must be a column vector",mx,nx);
mexErrMsgTxt("\nError");
return;
}
if( mx != nw1-1 )
{
mexPrintf("x (%d) w1 (%d x %d) incompatible",
mx, mw1, nw1);
mexErrMsgTxt("\nError");
return;
}
if( mw1 != nw2-1 )
{
mexPrintf("w1 (%d x %d), w2 (%d x %d) incompatible",
mw1, nw1, mw2, nw2);
mexErrMsgTxt("\nError");
return;
}
if(mw2 < 1 )
{
mexPrintf("w2 (%d x %d) has no outputs", mw2, nw2);
mexErrMsgTxt("\nError");
return;
}
}
}
om = mxGetPr(plhs[0]);
em = mxGetPr(plhs[1]);
10
See also backPropSig.mov at http://www.eng.auburn.edu/users/hodelas/teaching.
figure(4);
mesh(xx,yy,em);
title(’Initial error surface’);
print -depsc backPropSig_4.eps
e = dm(jj,ii) - y2;
d2 = -e .* dphi(v2,a);
DW2 = d2*[1;y1]’;
W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;
figure(8);
plot( ...
xp,yp(:,1),pcolor{1}, xp,yp(:,2),pcolor{2}, ...
xp,yp(:,3),pcolor{3}, xp,yp(:,4),pcolor{4}, ...
xp,yp(:,5),pcolor{5}, xp,yp(:,6),pcolor{6}, ...
xp,yp(:,7),pcolor{7}, xp,yp(:,8),pcolor{8}, ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh
text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
print -depsc backPropSig_8.eps
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))
print -depsc backPropSig_7.eps
Initial surface
0.998
0.996
0.994
0.992
0.99
0.988
0.986
0.984
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 19: Simple backpropagation example: initial neural network output surface.
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 20: Simple backpropagation example: initial neural network output surface error
d(x1 , x2 ) − y(x1 , x2 )
0.6
0.4
0.2
x2 value
0 x1 value
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 21: Simple backpropagation example: output surface after 200 passes through all
data points.
1 1
x2 value/W2
W2(9)
0 0 W2(0) W2(1) W2(2) W2(6) W2(8)
W2(7)
W2(3) line 9
−1 −1 W2(5)
−3 −3
line 7
−4 −4
−6 −4 −2 0 2 4
line 8
−5 −5
−6 −4 −2 0 2 4 6
x1 value
Figure 22: Simple backpropagation example: linear region separating boundaries after 200
iterations
.
90
80
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10
Figure 23: Simple backpropagation example: error function over all iterations
1b Compare these to C-file 6.1 (mextanh.c) , C-file 6.2 (simpletanh.c) and C-file 7.3
(phiExV.c) . My solutions allow the input x to be a vector. I will accept solutions
where your code can only handle scalars (as done in C-file 6.1 (mextanh.c) and
C-file 6.2 (simpletanh.c) ). Source code for this problem is listed at the end of these
solutions.
2a You can use either steepest descent or backpropagation. My solutions show both; they are
listed at the end of the solutions. AND gate output in Figure 24.
hwk502.m Output
steepest descent:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -3.8891e+00 2.5818e+00 2.5818e+00
AND output
1.5
0.5
−0.5
−1
−1.5
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
input 2
input 1
backprop:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -4.1842e+00 2.7874e+00 2.7875e+00
>>
2b The hyperbolic tangent function allows the training function to “bend” the plane from
Homework 4 to match the shape of the AND gate data.
3 Source code is at the end of these solutions. Of course my solution matched MATLAB’s
output perfectly.
#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;
#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = aa * tanh (bb*xx[ii]);
}
static void dphiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;
phiT(yy,xx,aa,bb,len);
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = (bb/aa) * (aa - yy[ii]) * ( aa + yy[ii]);
}
void mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *yy, *xx, aa, bb;
int len, mm, nn;
if (nrhs != 3) /* Check for proper number of arguments */
{
mexPrintf("dphiT: Got %d inputs\n", nrhs);
mexErrMsgTxt ("Need 3 input arguments");
}
else if (nlhs > 1)
{
mexPrintf("Got %d outputs\n", nlhs);
mexErrMsgTxt ("Need one.");
}
xx = mxGetPr (prhs[0]);
mm = mxGetM ( prhs[0]);
nn = mxGetN ( prhs[0]);
if(mm != 1 && nn != 1)
{
mexPrintf("xx(%d x %d) must be a vector\n",mm,nn);
mexErrMsgTxt ("dphiT: error");
}
plhs[0] = mxCreateDoubleMatrix (mm, nn, mxREAL);
yy = mxGetPr (plhs[0]);
aa = *mxGetPr(prhs[1]);
bb = *mxGetPr(prhs[2]);
dphiT (yy, xx, aa, bb, mm*nn);
}
figure(1)
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output (steepest)’);
end
% do backpropagation "gradient" step
xn = X(:,nn);
vv = W*xn;
gn = dphiT(vv,a,b)*((W*xn)*xn’ - d(nn)*xn’);
W = W - eta*gn;
maxErr = max(abs(d - phiT(W*X,a,b)));
end
figure(2);
fprintf(’backprop:\ntanh parameters: a=%12.4e b=%12.4e\n’,a,b);
fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e %12.4e\n’, ...
W(1), W(2), W(3));
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output’);
print -depsc hwk2003_0502.eps
#include <math.h>
#include "mex.h"
/* aa (mm x pp ), bb (pp x nn ), -> cc (mm x nn ) */
static void
matmul (double *cc, const double *aa, const double *bb,
const int mm, const int nn, const int pp)
{
int ii, jj, kk;
for ( ii = 0 ; ii < mm ; ii++)
{
for ( jj = 0 ; jj < nn ; jj++ )
{
cc[ii + mm*jj] = 0.0;
for ( kk = 0 ; kk < pp ; kk++)
{
cc[ ii + mm*jj ] += aa[ii + mm*kk] * bb [ kk + jj*pp];
}
}
}
}
void
mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *cc, *aa, *bb;
int mm, pp, nn;
mex matmul.c
a = [1 2 3 ; 4 5 6 ; 7 8 10];
b = [5 6 ; 7 8 ; 1 2 ];
c = matmul(a,b)
chk = a*b
err = c - chk
backPropRand.m Output
1 X̀
Eav = E(x(`))
N `=1
1 X̀
∆W (k) (n) = η ∆W k (n; x` )
N
`=1
where ∆W k (n; x` ) is the gradient due to x` with the weight set W (k) (n).
As an alternative to batch processing, select
T
∆W (k) (n) = α∆W (k) (N − 1) + ηδ (k) (n)y (k−1) (n)
Result: n
(k)
X T
∆W (n) = η αn−` δ (k) (`)y (k−1) (`)
`=1
Selection of parameters:
Idea Select α so that αN is “significant”, but so that α2N is small (e.g., αN = 0.2 or
smaller). This results in behavior similar to “batch” processing (see text).
T
Selection of η: suppose that δ (k) (n)y (k−1) (n) were constant. Then we’d like for ∆W (k) (n) →
T
δ (k) (n)y (k−1) (n) , =⇒ η = 1 − α. (Consider D.C. gain of transfer function η/(z − α)).
(a) Obtain the all files from the class ftp site
ftp://ftp.eng.auburn.edu/pub/hodel/6240
in directory hwk6. Run the m-file makeMex in MATLAB. This will compile a
number of C-language functions for you. Equivalent m-file functions are included
in hwk6 so that you can see what the C-functions do. Relevant functions are
M-file 19.4 (phiT.m), M-file 19.5 (dphiT.m) (mex) solutions to your last home-
work; M-file 4.3 (mlpT.m) (mex) Compute the output of a two layer perceptron
with hyperbolic tangent activation functions; M-file 4.2 (mlpEvalT.m) (mex) eval-
uate a two layer perceptron with user specified weights W1 and W2 over all data
points in a given data set, returns the network outputs and the correspoding er-
rors; M-file 25.3 (backPropStep.m) execute one back propagation step M-file 25.2
(mlpTrain.m) (m-file) Repeatedly calls the above functions to train a two layer
MLP to match a given data set; M-file 25.4 (mlpNormalize.m) (m-file) Perform
statistical normalization on training data; M-file 9.2 (learnTaskEx1s.m) Example
of a linear classifier network worked earlier this semester in class (sine, sawtooth,
and square wave); hwk6P1.m: template for the solution to problem 1 of this home-
work. hwk6P2.m: m-file for analysis in problem 2 of this homework.
(b) Modify the code in hwk6P1.m to train the neural network as a pattern classi-
fier as was done in M-file 9.2 (learnTaskEx1s.m). Email your completed m-file
hwk6P1.m to Mr. Simmons.
N
1 X ∆
2. Recall that we normalize input data by computing the mean x̄ = x(n) = E[x(n)]
N i=1
N
1 X ∆
and covariance Σx = (x(n) − x̄)(x(n) − x̄)T = E[(x(n) − x̄)(x(n) − x̄)T ] of a
N i=1
data set {x(n)}N
n=1 . The random number generator randn in MATLAB generates in-
dependent, identically distributed pseudo-random Gaussian variables with mean 0 and
variance 1.
YY = sigX\( XX - mX*ones(1,size(XX,2)) );
sigY = YY * YY’ / N;
fprintf(’sigY: %12.4e %12.4e\n’,sigY(1,1), sigY(1,2));
fprintf(’ %12.4e %12.4e\n’,sigY(2,1), sigY(2,2));
backPropNormTest.m Output
63 data points
sigX: 1.0000e+00 0.0000e+00
0.0000e+00 9.6825e-01
mX: 5.0000e-01
5.0000e-01
sigY: 1.0000e+00 -5.2868e-18
-5.2868e-18 1.0000e+00
Drop superscripts for now since they’re clear by context. Suppose we select weights W so
that
E[wij ] = 0
and
σw2 i = k, j = `
E[wij wk` ] =
0 else
What does this mean about E[v (1) ]? Suppose that W and y are statistically independent.
Then
Techniques used:
8. Select equal numbers from each class so that the output variable is also uniformly
distributed (alternatively: scale outputs as described above)
backPropEx2.m Output
Found /Users/hodelas/aub/6240/dphiT.mexmac.
Found /Users/hodelas/aub/6240/mexPrintMat.mexmac.
Found /Users/hodelas/aub/6240/mlpEvalT.mexmac.
Found /Users/hodelas/aub/6240/mlpT.mexmac.
Found /Users/hodelas/aub/6240/phiT.mexmac.
ans =
success
sigY =
mY =
6.3050e-17
W1 =
W2 =
Columns 1 through 7
Columns 8 through 11
>>
ni = 2; % network dimensions:
nh = 10; % use 5 hidden nodes and one output node
no = 1;
startTime = clock;
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, eta, ...
alpha, a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime);
• Haykin’s derivation of Neural Network performance assumes that inputs and outputs
are uniformly distributed. Our sample space is strongly biased toward points outside
the box. While the training algorithm works with all data points included in the sample
set, it works pretty well by keeping only selected points in the training data set. The
randomly chosen data values from outside the square are shown in Figure 28.
• The sum of the squared errors in each iteration (epoch in Haykin’s book) is shown in
Figure 29.
An epoch involves presentation of all “kept” data points to the neural network and
their corresponding backpropagation steps. Convergence is pretty much done by 150-
200 iterations (of 162 backpropagation steps each).
• The resulting fit to the data is much more pleasing in this case than in the original;
see Figures 30 – 31.
1.5
0.5
−0.5
−1
−1.5
−2
−1 −0.5 0 0.5 1 1.5 2
Figure
√ 26: Activation function used in Example 25.1. Training was also done with φ(x) =
3 tanh(6x). Steep slope permits a better potential fit to the discontinuous underlying
function, but the iteration did not converge after 20,000 iterations.
1.5
0.5
−0.5
−1
−1.5
−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
1.5
0.5
−0.5
−1
−1 −0.5 0 0.5 1 1.5 2
Figure 28: Randomly chosen data values from outside the unit square used for training the
neural network in Example 25.1.
Error history
800
700
600
500
400
300
200
100
0
0 500 1000 1500 2000 2500
Figure 29: Squared error sum in training iteration for example 25.1.
1.5
0.5
−0.5
−1
−1.5
−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
input 1
1.5
0.5
−0.5
−1
−1.5
−2
−2.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
inin input 1
The subroutines used by this task are at the class ftp site.
function [W1, W2, Yv, ErrV, ErrHist ] = mlpTrain(inData, nx, nh, no, eta, alpha, a, b, m
% function [W1, W2, Yv, ErrV, ErrHist ] =
% mlpTrain(inData, nx, nh, no, eta, alpha, a, b, maxIter)
% perform backpropagation training on a two layer MLP
e = dn - y2;
d2 = e .* dphiT(v2,a,b);
DW2 = d2*[1;y1]’;
DW2 *= max(1, 0.01*norm(e)/(norm(DW2)+1e-3));
W2 = W2 + DeltaW2;
W1 = W1 + DeltaW1;
xData = mlpData(:,1:nx);
yData = mlpData(:,nx+(1:ny));
N = size(mlpData,1);
% mean values
xm = mean(xData);
ym = mean(yData);
xData = xData - ones(N,1)*xm;
yData = yData - ones(N,1)*ym;
% covariance matrices
if(N > nx)
sigX = xData’*xData/N;
xData = xData/sigX;
else
sigX = 1;
end
if(N > ny)
sigY = 1;
yData = yData/sigY;
else
sigY = eye(ny);
end
% normalized data
inData = [xData,yData];
1.5
0.5
−0.5
−1
−1.5
−2
1
0.5 4
2
0
0
−0.5
−2
−1 −4
backPropEx3.m Output
trainingTimeSeconds =
287.0925
sigX =
3.5249 -0.0000
-0.0000 0.3590
mX =
1.0e-17 *
0.4537 -0.0851
sigY =
mY =
-1.1627e-17
W1 =
W2 =
Columns 1 through 7
Columns 8 through 11
>>
fn = 0;
fn = fn+1; figure(fn);
plot(xx,phiT(xx,a,b),’-’);
legend(sprintf(’-;phiT(x,%f,%f);’,a,b));
title(’activation function used’)
grid on;
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));
fn = fn+1; figure(fn);
title(’Desired output’);
mesh(xx,yy,dm);
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));
myData= mlpData;
fn = fn+1; figure(fn);
plot(ErrHist);
title(’Error history’);
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));
fn = fn+1; figure(fn);
mesh(xx,yy,(dm - zm))
title(’NN error surface - 200 iterations’)
xlabel(’input 1’);
ylabel(’input 2’);
Error history
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700
Output surface is in Figure 34. Notice that the choice of a in M-file 25.5 (backPropEx3.m)
results in the inability to match the data at the extreme upper and lower bounds on y. The
error surface is plotted in Figure 35.
1.5
0.5
−0.5
−1
−1.5
−2
1
0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1
Figure 34: Output surface for example 25.2. Compare to Figure 32.
Classification: how to interpret network output: Assign class i if yik̄ (n) > yjk̄ (n) for all
j 6= i. Confidence? How close are the competitors? How close is yik̄ (n) to quantized value?
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
1
0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1
26.1.1 Undergraduates
Undergraduate projects may be of the following two types:
1. Write a technical review of a technical paper on neural networks. The article must be
approved by Dr. Hodel, and should come from an IEEE journal (Transactions on Neural
Networks, Transactions on Automatic Control, Control Systems Magazine, other IEEE
conferences or journals, or some other professional society journal/conferences (e.g.,
ASME, AIAA, etc.). etc. Your review should either include
2. Update the neural network library functions in nnLib.c and nnLib.h. Possible ideas
here are to update backPropStep so that will work for any number of layers (not just
two), updating and testing the radial basis function codes, etc. This kind of project
should include
26.1.2 Graduates
Graduate student projects will involve at least the following three elements: (1) A written
manual/report. (2) Software implementations relevant to the project. (3) Test data used for
training and software verification.
Project subjects may be selected related to a student’s thesis research; discuss this
opportunity with your advisor. Otherwise, your project should involve some level of
complexity comparable to the example ideas listed in the next subsection. I will be very
flexible on the nature of the project, but it must include (1) a general problem statement,
(2) a mathematical discussion of the solution technique, (3) a software solution of the general
problem, and (4) an evaluation of the quality of solution.
Success is not required; what is required is a thorough discussion and understanding of
the techniques and results.
Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.
2. written Define b(1) as the bias vector in layer 1 of a multilayer perceptron so that
v (1) = W (1) y (0) + b(1) . Consider the effect of data normalization on the output y (1) of
the hidden layer:
26.2 Separability
Read §5.1-5.2 in Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice
Hall, 2nd edition, 1999
Definition 26.1 Given a vector valued function φ : IRm → IRn and a vector w ∈ IRn , the
corresponding separating surface X = X (φ, w) is
X (φ, w) = {x : w T φ(x) = 0}
Definition 26.2 Classes C1 , C2 are φ-separable if there exists a vector w such that x ∈
C∞ =⇒ w T φ(x) < 0 and x ∈ C∈ =⇒ w T φ(x) > 0.
Idea Map inputs nonlinearly into hidden layer, then linearly to output layer.
3. hyperspheres
Fact 26.1 The more hidden nodes you have, the more likely your data is φ−separable.
Remark 26.2 §5.2 in the text discusses the probability of φ-separability in terms of a
binomial expansion and Bernoulli trials. We will not address this analysis in this course.
2
2 e−(x1 −1)
Example 26.1 x ∈ IR . With φ(x) = 2 :
e−(x2 −0.5)
M-file 26.1 radialEx00.m
% radialEx00.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [1;2]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
xx = [x1(ii); x2(jj)] - x0;
zz(ii,jj) = w’ * exp( - xx .* xx );
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx00a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx00b.eps’);
2.5
1.5
0.5
0
5
0
0
−5 −5
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
% radialEx01.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [-1;1]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
% calculate bizarre monomial for this example
xx = [x1(ii)*x2(jj); x1(ii) + x2(jj)] - x0;
zz(ii,jj) = w’ * xx;
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx01a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx01b.eps’);
30
20
10
−10
−20
−30
−40
5
0
0
ii −5 −5
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
>> backPropEx5
iteration 50; error = 0.0226
iteration 100; error = 0.0001
iteration 150; error = 0.0000
... [deleted a few lines]
iteration 550; error = 0.0000
(a) Since randn produces pseudo-random numbers that are expected to be statistically
0
dependent and identically distributed with mean 0 and variance 1, E[x] = 0
0
T
and E[(x − x̄)(x − x̄) ] = I (a 3 × 3 identity).
(b) My output:
NN=2
Computed mean value=
mX = 0.3569 -0.3223 -1.0545
Computed covariance =
sigX =
0.3875 -0.1407 -0.2030
-0.1407 0.0511 0.0737
-0.2030 0.0737 0.1064
NN=10
Computed mean value=
mX = -0.0467 0.0289 -0.2239
Computed covariance =
sigX =
1.7623 0.3045 -0.3040
0.3045 0.7776 0.2775
-0.3040 0.2775 0.7536
NN=100
Computed mean value=
mX = -0.0833 0.1086 -0.0263
Computed covariance =
sigX =
NN=1000
Computed mean value=
mX = 0.0396 0.0240 0.0255
Computed covariance =
sigX =
0.9900 0.0016 -0.0090
0.0016 0.9878 -0.0206
-0.0090 -0.0206 0.9801
The smaller data sets are too small to give statistically reliable characterizations
of the mean and variance. This is illustrated by the histograms shown below.
histogram: N=2 histogram: N=10
1 4
0.8
3
0.6
2
0.4
1
0.2
0 0
−1.5 −1 −0.5 0 0.5 1 −3 −2 −1 0 1 2
25 250
20 200
15 150
10 100
5 50
0 0
−4 −2 0 2 4 −4 −2 0 2 4
Notice that, even in the case of N = 1000, the data bears poor resemblance to a
bell curve.
3. This corresponds to a digital filter with a pole between 0 and -1; so its impulse response
would be oscillatory, but stable. That is, one would expect the momentum term to
oscillate. Also, since
∆W (n + 1) = α∆W (n) + δy 0
this implies that α < 0 would cause the next backpropagation step to “backtrack” - to
undo some of the update of the current backpropagation step, and so one would expect
slower convergence.
We tested these expectations by re-running M-file 25.1 (backPropEx2.m) with alpha = -exp(log(0.2
(note the negative sign). This latter observation is consistent with the results shown
below, a comparison of the original training run with α > 0 to the results with α < 0:
Error history
800
700
600
500
400
300
200
100
0
0 500 1000 1500 2000 2500 α>0
Error history
800
700
600
500
400
300
200
0 500 1000 1500 2000 2500 α<0
switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end
%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
%=============================================================================
function [sys,x0,str,ts]=mdlInitializeSizes
sizes = simsizes;
sizes.NumContStates = 2;
sizes.NumDiscStates = 0;
sizes.NumOutputs = 2;
sizes.NumInputs = 1;
sizes.DirFeedthrough = 0;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = [0;0]; % initial conditions
str = []; % str is always an empty matrix
ts = [0]; % initialize the array of sample times
return
%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
%=============================================================================
function dx=mdlDerivatives(t,x,u)
dx = [x(2); (sin(x(1)) + u)];
% limit theta to stay within [-pi,pi]
thlim = min(max(x(1),-pi),pi);
dx(2) = dx(2) -100*(x(1)-thlim);
return
%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
%=============================================================================
function sys=mdlUpdate(t,x,u)
sys = [];
return
%=============================================================================
% mdlOutputs
% Return the block outputs.
%=============================================================================
function y=mdlOutputs(t,x,u);
y = x;
return
%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%
function sys=mdlGetTimeOfNextVarHit(t,x,u)
% end mdlGetTimeOfNextVarHit
%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)
sys = [];
return
% end mdlTerminate
switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end
%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
function [sys,x0,str,ts]=mdlInitializeSizes
sizes = simsizes;
sizes.NumContStates = 0;
sizes.NumDiscStates = 2 + 2*11 + 4*10;
sizes.NumOutputs = 2;
sizes.NumInputs = 3;
sizes.DirFeedthrough = 1;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = randn(sizes.NumDiscStates,1); % initial conditions
str = []; % str is always an empty matrix
ts = [0.1, 0]; % initialize the array of sample times
return
%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
function dx=mdlDerivatives(t,x,u)
dx = []
return
%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
function nextStates=mdlUpdate(t,x,u)
[W1,W2,yk1] = unpackStates(x); % do a backpropagation step
nextStates = x;
return
%=============================================================================
% mdlOutputs
% Return the block outputs.
function y=mdlOutputs(t,x,u);
% output is current estimate of next output
[W1,W2,yk1] = unpackStates(x);
y0 = [u(1);yk1];
y = mlpT(y0,W1,W2,1.5,1);
return
%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%
function sys=mdlGetTimeOfNextVarHit(t,x,u)
sampleTime = 1; % Example, set the next hit to be one second later.
sys = t + sampleTime;
% end mdlGetTimeOfNextVarHit
%
%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)
sys = [];
return
% end mdlTerminate
Notice: Exam 2 Will be on Monday Nov 3. Same rules as last time, except: item you may
bring a ruler so that you can draw a straight line. No calculators, no notes, no references.
1.
−1/2
y (0) = ΣX (x(n) − µX )
N N
1 X −1/2 1 −1/2 X
µy(0) = ΣX (x(n) − µX ) = ΣX (x(n) − µX )
N n=1 N n=1
N N
!
−1/2 1 X 1 X −1/2
= ΣX x(n) − µX = ΣX (µX − µX ) = 0.
N n=1 N n=1
N N
1 X (0) (0) T 1 X −1/2 −1/2 T
Σy(0) = y y = ΣX (x(n) − µX ) ΣX (x(n) − µX )
N n=1 N n=1
N
!
−1/2 1 X T −1/2 −1/2 −1/2
= ΣX (x(n) − µX ) (x(n) − µX ) ΣX = Σ X ΣX ΣX = I
N n=1
2. The analysis is correct: the two methods do in fact give the same result. The difference
is that data normalization moves the internal values v (k) into the quasi-linear parts of
the activation functions φ so that learning can occur faster.
Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.
1
Regularization term Ec (F ) = kDF k2
2
• D: “linear differential operator” – in other words, a filter, probably multidimen-
sional.
Use to enforce frequency domain constraints on your solution function
Example: h 2 i
∂ ∂2
DF = ∂x2 · · · ∂x2m F
1
Remark 29.1 The text makes use of some powerful mathematics (see, e.g., D. G. Luen-
berger. Optimization by Vector Space Methods. Wiley and Sons, Inc., New York, NY, 1969)
to justify the use of Green’s functions. For appropriate choice of operator D, these Greene’s
functions are the Gaussian radial basis functions we’re considering. The significance of Gaus-
sian functions is that, in terms of wavelet theory, they provide an optimal trade-off between
state-space localization (“time-domain”) and frequency localization in the sense of Heisen-
berg’s uncertainty principle. See Gilbert Strang and Truong Nguyen. Wavelets and Filter
Banks. Wellesley Cambridge Press, wellesley, mass. edition, 1996.
Remark 29.2 Strict enforcement of the choice of operator D leads to a choice of basis
functions φ that satisfy Dφ = 0.
Read §5.7
m (1)
∆
X
Define φi (x) = φi (kx − ti k). RBF network output is F (x) = wi φi (x). Let
i=1
T
d = d1 · · · dN
T
w = w1 · · · wm(1)
φ1 (x1 ) · · · φm(1) (x1 )
G = .. .. ..
.
. .
φ1 (xN ) · · · φm(1) (xN )
φ1 (t1 ) ··· φm(1) (t1 )
G0 = .. .. ..
.
. .
φ1 (tm(0) )) · · · φm(1) (tm(0) )
Then the minimizing solution for weights w is
GT G + λG0 w = GT d
line 1
3
2.5
2
1.5
1
0.5
0
8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6
2
Alternative form (similar to Fuzzy Logic)
m
X
wi φi (x)
i=1
F (x) = m
X
φi (x)
i=1
Example 29.2 Fuzzy RBF network added together. M-file below; plot in Figure 37. Notice
that between centers we get interpolation, while one RBF tends to dominate away from the
centers.
w = [3;1];
nx = 40; xx = linspace(0,8,nx);
ny = 41; yy = linspace(-5,8,ny);
line 1
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6
zm = zeros(nx,ny);
for ix = 1:nx
for iy = 1:ny
xn = [xx(ix); yy(iy)];
e1 = xn -t1; e2 = xn - t2;
zm(ix,iy) = w’*[ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]/ ...
sum([ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]);
end
end
2
Training involves selection of parameters ti , Σi , and wi .
From [Hay99]: adaptation formulae for RBF network: Requires Green’s function G(x)
and its first derivative G0 (x) = dG/dx with respect to scalar x.
Linear weights:
N
∂E(n) X
= ej (n)G kxj − ti (n)kCi
∂wi (n) j=1
∂E(n)
wi (n + 1) = wi (n) − η1 i = 1, 2, ..., m(1)
∂wi (n)
Spread parameters
N
∂E(n) X
0
= −w i (n) e j (n)G kx j − t i (n)k Ci Qji (n)
∂Σi −1 (n) j=1
min J(w) :
w
N
X
J(w) = kdi − F (xi )k2
i=1
2
XN
Xm
=
d i − wj φj (xi )
i=1 j=1
2
d1 φ1 (x1 ) · · · φm (x1 ) w1
.. .. .. ..
min
. − ..
w
. . . .
dm φ1 (xN ) · · · φm (xN ) wm
ti
2. end for
for nn = 1:N
xn = xx(:,nn);
for ix = 1:nx
for iy = 1:ny
xp = [xv(ix); yv(iy)];
for nn = 1:N
ti = xx(:,nn);
SigI = reshape(Sigs(:,nn),2,2);
zm(ix,iy) += exp( -(xp - ti)’*SigI*(xp-ti) );
end
end
end
rbfSig.m Output
ans = 1
ans = 2
ans = 2
Centers
1
line 1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
x
line 1
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
0.9
0.8
0.7
0 0.6
0.1 0.5
0.2 0.4 y
0.3 0.3
0.4 0.2
0.5
x 0.6 0.1
0.7
0.80
1.6
1.4
1.2
1
0.8
1 0.6
0.9 0.4
0.2
0.8
0.7
0.6
0.5 y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
N
1X
∆ ∆ ∆
E = e(n)T e(n) e(n) = d(n) − y(n) y(n) = F (x(n))
2 n=1
nφ
∆ ∆
X
F (x) = w(k)φk (x) φk (x) = φ(vk (x)) φ(v) = ev
k=1
∆
Similarly, with S(k) = Σ(k)−1 ,
N X p
∂E X ∂vk (x(n))
= −ei (n)wi (k)φ(vk (x(n))
∂sij (k) n=1 i=1
∂sij j(k)
∂vk (x(n)) 1
= (x(n) − t(k))(x(n) − t(k))T
∂S(k) 2
Notice that my solution differs from the text because I include the factor of 1/2 in v k (x).
• E (q T x)(xT q) = q T Rq
• encode: y = Q3 T x
• decode: x̂ = Q3 y
Neural network idea can we get Q3 without lots of nasty eigenvalue problems?
32 Exam 2 solutions
Name
Scores
1. (25pts)
2. (25pts)
3. (25pts)
4. (25pts)
Total:
Note Show your work. Multiple choice questions may have more than one correct answer.
Whether or not your answer will be judged to be “correct” depends on your response under
the word “explain.”
Permitted resources for this exam None. You may use a pencil or pen (use a pen
only if you never make mistakes), an eraser, and a ruler. You are not permitted to use a
calculator, textbook, written notes, oral or written communication with other people (besides
the instructor/GTA), laptop computers, cell phones, PDAs, wireless modems, telepathic
contact, or any other resources besides your own mind, body and a writing utensil. T-shirts
with Maxwell’s equations on the back will be tolerated, but I reserve the right to reseat you
in the back of the classroom. You are to sit with at least one empty chair between you and
the nearest classmate. Use of unauthorized resources on this exam will result in a failing
grade.
32.1
Consider the function plotted below:
sinc(2 pi x ) for |x| <=1 , 0 else
1
0.8
0.6
0.4
0.2
-0.2
-0.4
-3 -2 -1 0 1 2 3
x
2 3 4 none of these
1
2. How many outputs does the network have? (circle 1)
2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:
6 Two each for each “hump” in the diagram: one for the leading edge, one for the
trailing edge.
None of these (four) use two neurons to create a trough from -1 to 1 with depth
of about -0.25, two more neurons to create the peak in the middle between about
-0.75 and 0.75.
None of these (many) The function change in slope at -1 and 1 is not smooth, and
so will require a lot of neurons to model accurately.
32.2
Consider the function plotted below:
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
2
1.5
1
0.5
-2 0
-1.5
-1 -0.5 y
-0.5
0 -1
0.5 -1.5
x 1
1.5
2 -2
1 3 4 none of these
2
2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:
None of these (two) Use one function to create the wider trough in the middle of
the plane, a second function to add the peak. However, this would likely not give
a good match to the data.
None of these (many) This function has a fairly flat top and a strange shape, so a
good match would probably require many RBF’s to get good interpolation behavior.
32.3
m
Consider a data set {x(n), d(n)}N
n=1 where x(n) ∈ IR and d(n) ∈ IRp . Define the error
function
N
∆ 1
X
E= e(n)T e(n)
2 n=1
∆
where e(n) = d(n) − y(n) and y(n) = W (2) φ(W (1) x(n) + b(1) ) where b(1) ∈ IRh . Define
E(n) = 12 e(n)T e(n). Find ∂E(n)
(2) · · · ∂E(n) (2)
∂W1,1 ∂W1,h
∂E(n) ∆
.. . ..
= . .. .
∂W (2) ∂E(n) ∂E(n)
(2) ··· (2)
∂Wp,1 ∂Wp,h
∂E(n)
Note Undergraduates may choose to derive (2)
, a scalar valued gradient instead.
∂Wij
∂E(n) T
(2)
= −e(n)y (1) (n)
∂W
Show your work. If you need more space, work on the back side of this page. If you need
more space than that ... try to write smaller.
(2)
∂yk
Solution Apply the chain rule: First, notice that = 0 if k 6= i. From this result, there
(2)
∂Wij
T
is no need to sum the partial derivatives over all entries in e = e1 (n) · · · em (n) .
! !
∂E(n) ∂E(n) ∂ei (n) ∂yi (n)
(2)
= (2) (2)
∂Wij ∂ei (n) ∂yi (n) ∂Wij
(1)
= ei (n)(−1)yj (n)
which is the answer for undergraduates. Write the above gradient in terms of all combinations
of i and j to get the answer listed above.
32.4
Consider the unit square problem we’ve been working all semester. A friend (who is not in our
class) suggests that instead of using φ(v) = tanh(v) that you should use φ(v) = tanh(10v),
since the latter function is much steeper and so it can give a better approximation to the
unit square.
activation function comparison
1
tanh(x)
tanh(10*x)
Evaluate your friend’s suggestion.
0.8
Discuss (1) whether or not the use
0.6 of φ(v) = tanh(10v) can give a
0.5
0.4
better approximation to the data
than the use of φ(v) = tanh(v), (2)
0.2
how the use of φ(v) = tanh(10v)
0 will affect input data normalization,
-0.2
and (3) how the use of φ(v) =
tanh(10v) will affect the backprop-
-0.4
-0.5
agation training algorithm.
-0.6 If you need more space, you may
-0.8 continue your answer to this prob-
lem on the back of this page exam.
-1
-3 -2 -1 0 1 2 3
x
Solution It changes the training, but doesn’t improve things at all.
1. Consider output layer y (2) = φ(W (2) y (1) ) with φ = tanh(10v). We can use φ(v) =
tanh(v) instead by redefining W (2) := 10W (2) , that is, just absorb the factor of 10 into
the network weights. Hence, the proposed function does not provide the capability of a
better approximation than tanh by itself.
2. Data normalization is performed to ensure that the expected value of initial network
weights is in the “linear” region of the activation function. The use of φ(v) = tanh(10v)
sets the “linear” region to a much more narrow area, hence the initial weights must be
initialized to be one-tenth of what we’d usually do, i.e., in MATLAB code, Winitial = rand(m,n)/(10
3. The derivative of of tanh(10v) is 10 times larger (and 10 times more narrow) than
that of tanh(v). As a result, the backpropagation step size should be adjusted to be one-
tenth of what one would use for tanh(v). Further, since the derivative is effectively zero
outside of a narrow range, one would not expect a large number of neurons to be able
to “find” the boundary edges of the unit square unless they happened to be initialized
“just right.” On the other hand, since initial weights (listed above) are selected to
compensate for the factor of 10 in the activation function, the smaller step size may
result in comparable training behavior.
Either way, the proposed method doesn’t provide any clear advantage over φ(v) =
tanh(v).
function pcaEx1
format short e
mm = 150;
tt = linspace(-2,2,mm);
rand(’seed’,0);
randn(’seed’,0);
nsets = 20;
for ii=1:nsets
om = 1 + rand/10;
ph = rand/2;
rt = om*tt + ph;
mydat(ii,1:mm) = sin(rt);
rt = rt*5;
mydat(ii+nsets,1:mm) = sin(rt) ./ rt;
mydat(ii+2*nsets,1:mm) = sign(sin(rt));
fprintf(’%4d: om=%12.4f ph = %12.4f\n’,ii,om,ph);
end
mydat = mydat + randn(size(mydat))/20;
NN = size(mydat,1);
xbar = mean(mydat);
zmdat = mydat - ones(NN,1)*xbar;
sigx = zeros(mm,mm);
for nn=1:NN
xn = zmdat(nn,:)’;
sigx = sigx + xn*xn’/NN;
end
[VV,lam] = eig(sigx);
lam = diag(lam);
[lam,idx] = sort(-lam); lam = -lam;
VV = VV(:,idx); % reorder eigenvectors
chkval = (1e-6)*max(lam);
lamsiz = size(lam)
chkvalsiz = size(chkval);
idx = find(lam > chkval);
V3 = VV(:,1:3);
approxDat = (zmdat*V3)*V3’ + ones(NN,1)*xbar;
figure(1);
plot(tt,mydat,’-’);
title(’Principal components analysis example’);
xlabel(’time (sec)’);
ylabel(’signal value’);
grid on;
axis([-2,2,-1.2,1.2]);
printeps(’pcaEx1.eps’);
figure(2);
semilogy(idx,lam(idx),’x’);
axis ;
grid on;
printeps(’pcaEx1a.eps’);
figure(3);
plot(tt,xbar);
grid on;
title(’mean value vector’);
printeps(’pcaEx1b.eps’);
figure(4);
plot(tt,VV(:,1:3));
title(’dominant eigenvectors’);
xlabel(’time (s)’);
ylabel(’signal value’);
grid on;
printeps(’pcaEx1c.eps’);
figure(5);
plot(tt,approxDat);
title(’approximated data with dominant 3 eigenvectors’);
xlabel(’time (s)’);
ylabel(’signal value’);
grid on;
printeps(’pcaEx1e.eps’);
imax = 6
for kk=1:imax
ww = sigx*ww;
ww = ww/norm(ww);
end
plot(tt,VV(:,1),’-’,tt,w0,’--’,tt,ww,’-.’);
grid on;
legend(’v_1’,’w(0)’,sprintf(’w(%d)’,imax));
xlabel(’time (s)’);
ylabel(’vector waveform’)
title(’Principal Compenents Analysis example’);
printeps(’pcaEx1f.eps’);
function printeps(str)
eval(sprintf(’print -depsc %s’,str));
0.8
0.6
0.4
0.2
signal value
−0.2
−0.4
−0.6
−0.8
−1
Subtract mean value, then compute Σx . Dominant eigenvalues of Σx (ignoring all eigen-
values less than 10−6 λmax ) are shown in Figure 43. Dominant 3 eigenvectors are plotted in
Figurte 44. Approximated data shown in Figure 47. Notice the general signal forms are
recognizable, in spite of different phase and frequency: the signal type can be classified with
confidence.
Remark 34.1 Data set characteristics: same number of data points from each class (uni-
form sampling).
2
10
1
10
0
10
−1
10
−2
10
−3
10
0 10 20 30 40 50 60
dominant eigenvectors
0.2
0.15
0.1
0.05
signal value
−0.05
−0.1
−0.15
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
1.5
0.5
signal value
−0.5
−1
−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
0.05
vector waveform
−0.05
−0.1
−0.15
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
20
15
10
0
0 2 4 6 8 10 12 14 16 18 20
linear model m
X
T
y=w x= w i xi
i=1
Hebbian learning:
w(n) + ηy(n)x(n)
w(n + 1) = q
(w(n) + ηy(n)x(n))T (w(n) + ηy(n)x(n))
1
x(n)x(n)T , so for a small enough step size, the first term
P
Interpretation:
Recall Σ x = N
x(n)x(n)T w(n) is an approximation for multiplying Σx w(n), which will converge to the
dominant eigenvector direction.
Suppose w(n) were a scalar. Then the 2nd term is −w(n)3 x(n)2 , which causes w(n) to
decrease in magnitude.
Hence, the 1st term drives toward the dominant eigenvector, while the 2nd term drives
toward 0. Net result: converges to a bounded
P multiple P of dominant eigenvector of Σx .
Formal convergence analysis requires: η(n) = ∞, |η(n)|p < ∞ for some p > 1. Text
chooses η(n) = 1/n, learn more slowly as get “older.” Then w(n) → dominant eigenvector.
Human brain motor organization mimics physical organization of body (mirror image).
Idea Design a neural network to organize itself to match data organization. Techniques/concepts
involved:
else, bestCost = 0;
end
Jnext(ii) = bestCost;
end
Ju = Jnext; Jhist(:,iter) = Ju;
end
plot(1:maxi,Jhist,’-o’); grid on;
xlabel(’iteration’); ylabel(’cost to go estimate’);
for ii=1:length(S)
text(5 + 0.5*ii,Ju(ii)+0.5,char(’A’+ii-1));
end
title(’Dynamic programming example (see Fig 12.4 in Haykin’’s book)’);
print -depsc dynProgEx.eps
output
State A: input 1 -> state B, cost 2
State A: input 2 -> state C, cost 4
State A: input 3 -> state D, cost 3
State B: input 1 -> state E, cost 7
State B: input 2 -> state F, cost 4
State B: input 3 -> state G, cost 6
State C: input 1 -> state E, cost 3
State C: input 2 -> state F, cost 2
State C: input 3 -> state G, cost 4
State D: input 1 -> state E, cost 4
State D: input 2 -> state F, cost 1
State D: input 3 -> state G, cost 5
State E: input 1 -> state H, cost 1
State E: input 2 -> state I, cost 4
State F: input 1 -> state H, cost 6
State F: input 2 -> state I, cost 3
State G: input 1 -> state H, cost 3
State G: input 2 -> state I, cost 3
State H: input 1 -> state J, cost 3
State I: input 1 -> state J, cost 4
State J: no transitions
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
10
D
8
C F
cost to go estimate
G
6
E I
4
H
J
0
1 2 3 4 5 6 7 8 9 10 11
iteration
37 Hopfield Networks
Training uses correlation matrix memory (see [Hay99], §2.11, p. 79) to estimate training
values.
ftp://ftp.eng.auburn.edu/pub/sjreeves/matlab_primer_40.pdf
• The user’s manual for the Student Version of MATLAB (if you purchase it).
The manuals for Octave and/or Scilab may be of some use to you as well. These may be
obtained with their respective source code distributions.
Advanced students may wish to write mex function interfaces to compiled language com-
puter programs; this topic is not addressed in this laboratory session.
for i = 1, ..., 7. For example, suppose for equation i = 1 that we apply Kirchoff’s current law
to the node with voltage v1 to obtain
i2 − i 3 − i 4 = 0 (3.2)
Equation (3.2) can be rewritten in the form of equation (3.1) by selecting the coefficients
a1,1 = a1,2 = a1,3 = a1,7 = b1 = 0, a1,4 = 1 and a1,5 = a1,6 = −1.
(a) (b)
100kΩ 10µF
i1 (t)
+
+
us (t − 0.5) vo (t)
10kΩ 10mH
−
−
(c)
Figure 48: Circuit examples for use of MATLAB. Note us (t) refers to the unit step function
us (t) = 0 for t < 0, us (t) = 1 for t ≥ 0.
C MATLAB
C.3.1 Functions
Unlike C, MATLAB (usually) requires each function to be in its own text file, and the file
must end with extension .m. So, these are usually called “m-files.” M-file function text
MATLAB
C
if(x < y )
if(x < y ) % use fprintf and single quotes
{ % otherwise same as printf.
printf("%e < %e\n",x,y); fprintf(’%e < %e\n’,x,y);
} % else if changes to elseif
else if ( x != y ) % != changes to ~= (use ~ for "not")
{ elseif ( x ~= y )
/* can split cmds across lines */ % use ... to split a command across
printf("%e is different from %e\n", % lines
x,y); fprintf( ...
} ’%e is different from %e\n’, ...
x,y);
Figure 50: Flow control in C and MATLAB: if statements
C MATLAB
between “function” line and 1st statement is printed when you type in help function name
at MATLAB prompt. See Figure 53 for more detail.
1. Compiled code will generally run faster than m-files. This is because m-files are not
C MATLAB
switch(x) switch(x)
{ case(0),
case 0; % no empty parenthesis in MATLAB
do_this(); do_this;
break; case(1),
case 1; do_that;
do_that(); otherwise,
break; % error function: returns to
default: % MATLAB prompt
printf("Bad case. x = %d\n",x); error(’Bad case. x=%d’,x);
} end
Figure 52: Flow control in C and MATLAB: switch
translated directly to machine code, but are interpreted by the MATLAB program.
2. Compiled code will generally take much longer to write than MATLAB m-files. This is
because MATLAB’s basic variable types are much more flexible than C-language and
because debugging tools in MATLAB are much easier to work with.
We illustrate some of the utility of MATLAB with the following examples.
w1
Example 3.1 dot products Given two three-dimensional vectors, w = w2 and x =
w3
x1
x2 , their dot product is defined as
x3
3
∆
X
w · x = w 1 x1 + w 2 x2 + w 3 x3 = w i xi
i=1
C and MATLAB functions to compute the dot product of w and x are given below.
C
double dotProd3(const double w[], MATLAB : create file dotProd.m containing this
const double x[]) text.
{
int ii; function z = dotProd(w,x)
/* initialize dot prod % z = dotProd(w,x)
* in declaration*/ % return dot product of column
double retval = 0; % vectors w and x
for ( ii = 0 ; ii < 3 ; ii++)
retval += w[ii] * x[ii]; z = w’ * x; % w’ = transpose, row vector
return retval;
}
double a, b, c, d; c = 1;
int i; d = 2;
c = 1; i = is_positive(c);
d = 2; % MATLAB functions can return
i = is_positive(c); % many values at once!
stats( &a, &b, c, d); [a,b] = stats(c, d);
• no declarations
• the line z = w’ * x; works with the entire vectors (arrays) w and x and not just their
individual components x(1), x(2), etc.
Figure 54: Example of plotting in MATLAB (3rd order Fourier series approximation of a
square wave).
The unknowns in this circuit problem are i1 , v1 , and v2 . From Ohm’s law and Kirchhoff’s
14
The C code is not included because this is an ECE course, not a CSSE course.
0.8
0.6
0.4
0.2
y(t)
−0.2
−0.4
−0.6
−0.8
−1
−8 −6 −4 −2 0 2 4 6 8
time (s)
v2 = 2000i1
v1 = 7000i1 + v2
(3000 + 7000 + 2000)i1 = 5
We rewrite these equations so that all unknowns appear on the left side of the equation and
all the constant terms are on the right side to obtain
v2 − 2000i1 = 0 (3.3)
v1 − 7000i1 + v2 = 0 (3.4)
12000i1 = 5 (3.5)
In order to put these equations into MATLAB, we need to look at each line of the equation
as if it was a dot product w · x, written in MATLAB as w’ * x. Here’s the procedure. First,
we define a vector (array) x of the unknowns in some order. For this example we’ll put them
in alphabetical order:
i1
∆
x = v1 (3.6)
v2
Remark 3.1 The command x = A\b tells MATLAB to compute a vector x so that the dot
product of row i of the matrix (2-D array) A with x matches component i of b.15
15
You will study this problem in much greater detail in your linear algebra class, where you will write
A ∗ x = b. The expression A*x refers to matrix-vector multiplication.
Remark 3.2 Notice that the MATLAB output for x does not show units. You as the
programmer have to remember that x corresponds to i1 = 41.667µA, v1 = 3.75V, and
v2 = 0.83333V.
Remark 3.3 Notice that rows of A and b are separated with semicolons ;, and that entries
on each row are separated with (optional) commas. The commas are not required, but
they’re a very good idea. To see this, type in the following commands (including spaces) at
the MATLAB prompt:
• x = [1 , - 2 ]
• x = [1 - 2]
These do not give the same answer! To avoid ambiguity, it’s a good idea to use commas
and/or parentheses, e.g., x = [1, ( 3 - 4 ) , 5] does the same thing as x = [ 1 3 - 4
5], but there’s no question what the first one is supposed to do.16
function dx = odeExample(t,x)
% dx = odeExample(t,x)
% derivatives function to simulate
% y’’(t) = 2 y’(t) - 2 y(t) + sin(t)
y = x(1);
v = x(2);
dy = v;
dv = -2*v - 2*y + sin(t);
dx = [dy ; dv];
Notice that ode45 requires three inputs: (1) a function ODEFUN= f (t, x) that, given current
conditions x(t), will return the current state derivatives, (2) a vector TSPAN of time values at
which we want to compute values of x(t), and (3) X0 = x(t0 ), a vector of initial conditions.
Item (1) is implemented as an m-file function shown in Figure 56. We can then simulate the
function with ode45 by writing and executing the m-file script odeExampleMain.m in Figure
57. For clarity, we discuss line by line the main routine odeExampleMain below:
2. tspan = linspace(0,15);
This line creates a variable tspan that has one row of 100 numbers that are evenly
spaced from 0.0 to 15.0. Try typing in tspan = linspace(0,15) without the semi-
colon. MATLAB will print out all 100 entries of tspan to the screen for you to see.
If you want tspan to have a different number of points, then use a third argument
to the linspace function, for example, tspan = linspace(0,15,10); will give you
10 points and tspan = linspace(0,15,16); will give you the 16 numbers 0,1,2,...,15,
etc. Since tspan has only one row, it’s often called a row vector.
We will use the variable tspan to tell ode45 what time instants we want to have in
our simulation.
3. x0 = [0;0];
0
This line creates a variable x0 = that has two rows and one column. x0 is used
0
to tell ode45 the initial conditions for the differential equation.
4. [tt,xx] = ode45(’odeExample’,tspan,x0);
This is where ode45 simulates the differential equations. Notice that the name of
odeExample must be written in quotes in the call to ode45, i.e., ode45(’odeExampleMain’,tspan,x0)
Notice that ode45 takes advantage of MATLAB’s ability to return more than one
variable. The variable tt is returned as a column vector (an array with only one
column) that has 100 entries, the same entries we specified in tspan. The variables xx
is returned as a 2-dimensional array with 100 rows and two columns. The first column
is the values of v(t) = dy(t)/dt at the times in tt. That is, xx(ii, 1) = v(tt(ii)).
We use this knowledge in the plot command that comes next. Similarly, the second
column of xx is the values of y(t) at the times in tt.
A reasonable question for the reader to ask is “How do you know which column in xx
goes with which variable in the differential equations?” The answer is that I wrote the
routine odeExample.m,
and so I know that that routine routine interprets the vector x
v(t)
as x(t) = , so each row in the vector xx returned by ode45 is the value of x(t)
y(t)
at time tt(ii). (It doesn’t matter that odeExample.m uses column vectors; ode45
will output its data in this format.)
Another reasonable question is “why do you use variable names like tt and xx instead
of t and x.” The answer is that it makes it a lot easier to debug programs and m-files.
If I search for the letter t in odeExampleMain.m I will find it in many places, including
in the comment on the first line. However, if I search for tt instead, I will (usually)
only find the variable that I’m looking for.
5. plot(tt,xx);
This line opens a graphic window and plots the data on the screen, xx as a function of
tt. This is a “lazy” way to generate the plot. I could also have plotted one waveform
at a time by typing
plot(tt,xx(:,1),’-’, tt,xx(:,2),’-’);
The notation xx(:,1) means “the first column of xx” and xx(:,2) means “the second
column of xx.” The ’-’ is a line style command. You can learn more about line styles
by typing in help plot at the MATLAB prompt.
6. grid on
The plot command only puts the wave forms on the graph. This command puts up
“graph paper” on the window to make the plot easier to read.
7. legend(’v = dy/dt’,’y’);
8. xlabel(’time (s)’)
Always label your axes. If possible, include units. In fact, it’s a good idea to include
units in your m-file (and C-program) variable names.18
The legend command puts a legend in the graph window so that you know which color
(or line style) goes with which variable. The xlabel command labels the x axis. There
is also a ylabel command that you can use instead of legend if you’re only plotting
one waveform in a window.
9. print -depsc odeExampleMain.eps
This command is used to store the plot in color .eps format so that we can include it
in this manual. The resulting plot is shown in Figure 58.
Remark 3.4 This manual was written in LATEX a free mathematical typesetting lan-
guage (not quite a word processor) that is included with Linux, is installed on the
Engineering Sun Network, and can be downloaded with the cygwin environment to a
windows machine or installed with fink on Macintosh. Microsoft Word users should
print to jpg files or some other image format. Type in help print at the MATLAB
prompt for more options.
2
18
While working on leave at NASA, the writer of this chapter once spent a month comparing two sim-
ulations that did not match because one simulation used degrees for angles and the other used radians,
but both used the same data tables. The problem was caught and fixed, and a lesson learned. Label your
variables/axes and document your code!
0.5
y
v = dy/dt
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
0 5 10 15
time (s)
2. Write the differential equation(s) describing the behavior of the circuit in Figure 48(b).
Derive by hand the output vo (t) for the circuit in Figure 48(b) given that vo (0) = 0V.
3. Write the differential equation(s) describing the behavior of the circuit in Figure 48(c).
Derive by hand the output vo (t) for the circuit in Figure 48(c) given that vo (0) = 0V
and that i1 (0) = 0A.
(b) at the MATLAB prompt, type in edit sqfour7. This will open a new editor
window. Type in the m-file in Figure 54 and the modify it to plot a 7th order
Fourier series approximation of a square wave. Use the MATLAB command help
print to see how to print the plot to a printer instead of storing it in a file.
(Alternatively, you can save the plot as a .jpg file and import it into MS Word.)
5. Use MATLAB to calculate the voltages and currents in the circuit shown in Figure
48(a).
6. Use MATLAB to simulate the capacitor voltage in Figure 48(b) for 2 seconds. Turn
in your m-file(s) and a plot of the capacitor voltage.
7. Use MATLAB to simulate the capacitor voltage and inductor current in Figure 48(c)
for 2 seconds. Turn in your m-file(s) and a plot of the capacitor voltage and inductor
current.
W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;
if(iter == maxIter)
print -depsc backPropSig_6.eps
end
figure(8);
plot( ...
xp,yp(:,1),pcolor{1}, xp,yp(:,2),pcolor{2}, ...
xp,yp(:,3),pcolor{3}, xp,yp(:,4),pcolor{4}, ...
xp,yp(:,5),pcolor{5}, xp,yp(:,6),pcolor{6}, ...
xp,yp(:,7),pcolor{7}, xp,yp(:,8),pcolor{8}, ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh
text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
A = getframe; mov2 = addframe(mov2,A);
if(iter == maxIter)
print -depsc backPropSig_8.eps
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))
print -depsc backPropSig_7.eps
end
end
mov = close(mov);
mov2 = close(mov2);
nx = size(W1,2)-1;
ny = size(W2,1);
N = size(inData,1);
x = reshape(x,length(x),1);
v1 = w1*[1;x]; % append bias
h = phiT(v1,a,b);
function plotErr(d,X,W,tstr);
% plot error of current fit
nn = 1:length(d);
fn = figure;
plot(nn,d,’x’, nn,W*X,’+’, nn,d-W*X,’o’);
xlabel(’sample number’);
legend(’desired output’,’actual output’,’error’);
grid on;
title(tstr);
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return
References
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999.
[Lue69] D. G. Luenberger. Optimization by Vector Space Methods. Wiley and Sons, Inc.,
New York, NY, 1969.
[MP43] W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas imma-
nent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.
[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65:386–408, 1958.
[SN96] Gilbert Strang and Truong Nguyen. Wavelets and Filter Banks. Wellesley Cam-
bridge Press, wellesley, mass. edition, 1996.
[WH] B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON
Convention Record, pages 96–104.
251
INDEX Revision : 2003.10
separability
φ-separable, 176
separating surface, 176