ELEC 6240: Neural Networks

ELEC 6240: Neural Networks
A. S. Hodel, Dept. ECE, Auburn University hodelas@auburn.edu

http://www.eng.auburn.edu/users/hodelas
ftp://ftp.eng.auburn.edu/pub/hodel
This notes set is in progress. This is Revision : 2003.10, November 19, 2003
1
CONTENTS Revision : 2003.10
Contents
1 Course overview 7
2 2003 08 20: Introduction 9
3 2003 08 22 Neuron models 19

3.1 Neuron models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 NN’s as directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 2003 08 25 Network architectures 24

4.1 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Neural network: an interconnection of neurons . . . . . . . . . . . . . . . . . 24
5 2003 08 29 Knowledge representation 31
6 2003 09 03 Mex files in MATLAB 33
7 2003 09 05 More on MATLAB 36

7.1 Scalar functions of scalar variables . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Functions of vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 What about MATLAB and matrices? . . . . . . . . . . . . . . . . . . . . . . 39
8 2003 09 08 Learning processes 41

8.1 Error-correction learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Memory Based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Competitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.5 Boltzmann learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Credit assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.7 Learning with a teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.8 Learning without a teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 2003 09 10 46
9.1 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
10 2003 09 12 Single layer networks 60

10.1 From last time: Linear associative memory (LAM) . . . . . . . . . . . . . . 60
10.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.3 Performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.4 Single layer perceptrons: “In the beginning ....” . . . . . . . . . . . . . . . . 62
10.5 Adaptive filtering interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 62
Dept. ECE, Auburn Univ. 2 ELEC 6240/Hodel: Fa 2003

11 2003 09 15 Optimization and Neural Networks 64

11.1 What you should have seen in your linear algebra class . . . . . . . . . . . . 64
11.2 Unconstrained optimization techniques . . . . . . . . . . . . . . . . . . . . . 64
12 2003 09 17: Gradient based learning methods 67

12.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2 Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
13 2003 09 19: More gradient based learning 82

13.1 A matlab m-file example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.2 Gauss Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.3 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14 2003 09 22: Single Layer Perceptrons (conclusion) 90

14.1 The perceptron convergence theorem . . . . . . . . . . . . . . . . . . . . . . 90
14.2 Other activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
14.3 Exam 1 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
14.3.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
14.3.2 Permitted resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
15 2003 09 24 Multi-layer perceptrons 94

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.2 Back-propagation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.3 Output layer update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
16 2003 09 26: Review of Homework 4. 99

16.1 After class comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
16.2 Example in office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17 2003 09 29: More review for Exam 1 100
18 2003 10 01 Exam 1 101

18.1 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
18.2 The answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
19 2003 10 03 Multi-layered perceptrons: continued 107

19.1 Backpropagation algorithm: review of output layer . . . . . . . . . . . . . . 107
19.2 Backpropagation: Hidden layer update . . . . . . . . . . . . . . . . . . . . . 108
19.3 Exponential activation function . . . . . . . . . . . . . . . . . . . . . . . . . 109
19.4 tanh activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
20 2003 10 06: Single layer linear system i.d. example 113

20.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.1.1 Vehicle data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.1.2 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.2 Solution method 1: direct solution . . . . . . . . . . . . . . . . . . . . . . . . 116

20.3 Solution method 2: Steepest descent iteration . . . . . . . . . . . . . . . . . 118

20.4 Solution method 3: Backpropagation . . . . . . . . . . . . . . . . . . . . . . 120
20.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
21 2003 10 08: MLP 123

21.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
21.1.1 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
21.1.2 Hidden layer(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
21.2 Example (not covered in class) . . . . . . . . . . . . . . . . . . . . . . . . . . 126
21.2.1 Utility m-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
21.2.2 C-language implementation . . . . . . . . . . . . . . . . . . . . . . . 127
21.3 Simple backprop without any preprocessing . . . . . . . . . . . . . . . . . . 132
22 2003 10 10: Techniques to improve training - 1 141

22.1 Homework 5 solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
22.1.1 Derivations and plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
22.1.2 Source code: problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 142
22.2 Techniques to improve training . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.3 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.4 Randomize sample order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

23.1 Present “worst case” data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
23.2 Momentum term - generalized delta rule . . . . . . . . . . . . . . . . . . . . 150

24.1 Statistically normalize the data . . . . . . . . . . . . . . . . . . . . . . . . . 152
24.2 Selection of initial weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
24.3 Adjust learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
25 2003 10 17 Example: training with the unit square 155

25.1 Another example: bad output normalization . . . . . . . . . . . . . . . . . . 166
25.2 Decision rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
25.3 Feature detection/hidden neurons . . . . . . . . . . . . . . . . . . . . . . . . 171
26 2003 10 20: Radial Basis Function Networks 173

26.1 Project proposal guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
26.1.1 Undergraduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
26.1.2 Graduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.1.3 Some project ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Homework 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
26.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

27 2003 10 22: Radial Basis Functions (2) 183

27.1 M-file s-function examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
27.1.1 Continuous time model . . . . . . . . . . . . . . . . . . . . . . . . . . 187
27.1.2 Discrete time model . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
27.2 Lecture notes: handwritten today . . . . . . . . . . . . . . . . . . . . . . . . 192
28 2003 10 24; RBF’s (3) 193

28.1 Interpolation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
29 2003 10 27 RBF’s (4) 194

Homework 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
29.2 What functions to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
29.2.1 Tikhonov regularization . . . . . . . . . . . . . . . . . . . . . . . . . 196
29.2.2 Solution of the problem . . . . . . . . . . . . . . . . . . . . . . . . . 197
29.3 Training RBF networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
29.3.1 Selection of output (interpolation) weights w . . . . . . . . . . . . . . 200
29.3.2 Selection of centers ti . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
29.3.3 Selection of covariance (spread) matrices Σ . . . . . . . . . . . . . . . 201
30 2003 10 31: RBF’s (cont’d) 206

31 2003 10 31 Principal Components Analysis 207
32 Exam 2 solutions 208

32.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
32.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
32.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
32.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
33 2003 11 05: PCA (2) 213
34 2003 11 07: PCA (3) 214

34.1 Single vector case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
34.2 Multiple vector case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
35 Self Organizing Maps 225
36 2003 11 19 Neurodynamic programming Example 226
37 Hopfield Networks 229
A Appendix: Review of linear algebra 230

A.1 Vector stack function vec(·) . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

B Appendix: Review of C-programming syntax related to neural nets 230
C Appendix: Review of MATLAB syntax related to neural nets 230

C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
C.1.1 Access to software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
C.1.2 Software overview and tutorials . . . . . . . . . . . . . . . . . . . . . 231
C.2 Mathematical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
C.3 MATLAB basics: similarities to C . . . . . . . . . . . . . . . . . . . . . . . . 232
C.3.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
C.3.2 Differences from C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
C.3.3 Graphical (plotting) concepts in MATLAB . . . . . . . . . . . . . . . 236
C.3.4 Circuits problems in MATLAB . . . . . . . . . . . . . . . . . . . . . 236
C.3.5 Differential equations in MATLAB . . . . . . . . . . . . . . . . . . . 239
C.4 Things to try on your own . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
D Appendix: source code 245

D.1 Utility m-files not listed elsewhere . . . . . . . . . . . . . . . . . . . . . . . . 245
References 250
Index 251

Revision : 2003.10
1 Course overview
Instructor A. S. Hodel hodelas@eng.auburn.edu AIM screen name ELEC 6240 1 .
Web page http://www.eng.auburn.edu/users/hodelas. Office hours: MW 3-4pm
or by appt.
Grader Adam T. Simmons simmoat@eng.auburn.edu. No office hours.
Textbook Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,

2nd edition, 1999
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,

2nd edition, 1999
Grades Grades are assigned on a 10% scale. You may earn points from 2 Hour exams (50
pts ea), 1 Course project (50 pts), Homework (50 pts), 1 Final exam (100 pts), for a
total of 300 points.
Special needs Students who need special accommodations should make an appointment
to discuss their needs as soon as possible.
Class resources other class resources (notes, m-files, etc.) will be made available at
ftp://ftp.eng.auburn.edu/pub/hodel/6240
Projects Will be described later in the semester. Project final reports will be due during
the final week of the semester; a precise due date will be announced later. Oral
presentations will be made to the class during the final two weeks of the course.
Topics Covered in the course:
1. Learning processes
2. Perceptrons: single layer and multi-layer
3. Radial basis function neural networks
4. Principal components analysis
5. self-organizing maps
6. Neurodynamics
7. Neural network applications
Resources MATLAB is available on the engineering network by either (1) Use of Windows
PC labs in Broun 128, etc., (2) Sun workstations, or (3) remote log-in (ssh) to
gate.eng.auburn.edu
1
Please identify yourself by name when you message me.

Revision : 2003.10
from off-campus. Several tutorials for MATLAB are available on the net. 2 A brief
review of MATLAB will be given in this course, but students are expected to be
familiar with MATLAB from the prerequisite course.
Homework software Grading of software will be done in a batch-run fashion. This will
require that students set up a folder on their engineering account that the instructor
can access (read/execute privileges) from the engineering network (sun workstation).
Evidence of copying of software will be grounds for a zero grade on the homework
assignment in question.
Remark This is a 6000 level course (senior undergraduate/graduate level course). Students
will be expected to have a corresponding level of mathematical/conceptual maturity.
C-language programming (COMP1200) will be essential. The instructor makes no
commitment to provide support for other compiled languages in this course.
Note Homework assignments in this class for Fall 2003 should be done using MATLAB 6.5
(Release 13). For historical reasons, many software examples in these notes were done using
octave3 , a program similar to (but not identical to) MATLAB that is available at no cost.
Octave is not currently available on the Engineering network; if anyone wants to volunteer
to help to get octave installed, please let me know.
2
See, e.g., p. 37 of the ELEC 2020 manual,
3
http://www.octave.org

Revision : 2003.10
2 2003 08 20: Introduction

Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999 [Hay99] Chapter 1.
Applications: optimization, classification, function approximation

Why neural networks?
• Nonlinear
More flexibility in representation of systems than in, e.g., transfer functions (LTI).
• input-output mapping
Does not require a “first principles” physical model of behavior/system being learned.
• adaptivity (retraining)
Can adjust its “synaptic weights” in response to changes in operating environment.
Question 2.1 How to ensure that adaptation “makes sense?”
• evidential response: classification and confidence of classification both available

Can provide both response (classification) and confidence level.
• fault tolerance (in hardware form)
• VLSI implementation
• neurobiological analogy
Retina: preprocessing (compression) of visual information before it is sent to brain for
processing.
Fundamental building block: the neuron

Revision : 2003.10
Homework 1
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 1.
Due Fri Aug. 29. Hwk
1. Problem 1.1 in [Hay99], p. 45.
2. Problem 1.2 in [Hay99], p. 46.
3. Problem 1.9 in [Hay99], p. 47.
4. Design a 2 layer neural network with a threshold activation function (eqn (1.8), p. 12
in [Hay99]) to identify the region (x, y) ∈ [0, 1] × [0, 1].
Approach: hidden layer neurons are “feature detectors,” but can only split space into
two halves by dividing in a plane (see the figures that follow showing hidden layer values
as a function of x and y). Use one hidden layer neuron for each of the sides of the
square. Then combine them together. We can’t do an “and” operation naturally, so we
have to do it as a sum with a threshold and a bias just below the sum you’d get with
all hidden neurons firing.
Write an m-file main.m that plots the output of your ANN over the above domain. Put
this in a folder called 6240H1 your home directory on the engineering network (the H:
drive for Windoze enthusiasts) and set permissions so that anyone can read the file. 4
Solution
1. Sigmoid derivative
dφ (·) · 0 − (−ae−av ) a + ae−av − a

= =
dv (1 + e−av )2 (1 + e−av )2
a(1 + e−av ) − a a((1 + e−av ) − 1)
= =
(1 + e−av )2 (1 + e−av )2
a 1 − (1 + e−av )
= −av
· −av
= aφ(v)(1 − φ(v))
1 + e 1 + e
d
φ(v) = a/4
dv v=0
2. Derivation done and checked in MATLAB:
M-file 2.1 hw0102chk.m

4
This problem is assigned in part to check the process for electronic paperless submission of your software
and to test my plagiarism detection routines.

Revision : 2003.10
% homework 1 check.
xx = linspace(-5,5,1000);
a = 2;
eav = exp(-a*xx);
pp = (1 - eav) ./ (1 + eav);
dp = (a/2)*(1 - pp .^ 2);
dp2 = a * ( (1+eav).*(eav) + (1-eav).*(eav)) ...

./ ( (1+eav).^2);
dp2 = (a/2) * ( 2*(1+eav).*(eav) + 2*(1-eav).*(eav)) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( 2*(eav+eav.^2) + 2*(eav-eav.^2)) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( 4*eav )./ ( (1+eav).^2);
dp2 = (a/2) * ( 4*eav + (1 + eav).^2 - 1 - 2*eav - eav.^2 ) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( (1 + eav).^2 - 1 + 2*eav - eav.^2 ) ...
./ ( (1+eav).^2);
% and the result follows from here.
plot(xx,dp,’-’,xx,dp2,’-’);
3. hw, xi = 10 · 0.8 − 20 · 0.2 + 4 · (−1) − 2 · (−0.9) = −1.8. (a) linear output is -1.8 (or
a · 1.8 if the activation function has a linear slope parameter). (b) 0.
4. Software is in m-file square1.m. Results in Figures 1–6.
M-file 2.2 square1.m
xx = linspace(-1,2,30); yy = linspace(-1,2,30);
% format of rows 1st weight matrix:
% bias weight, x weight, y weight
W1 = [0 1 0; ... % right of y axis (x = 0)
0 0 1; ... % above x axis (y = 0)
1 -1 0; ... % left of line x = 1
1 0 -1]; % below line y = 1;
% hidden layer neuron outputs: h1, h2, h3, h4

% format of 2nd weight matrix: [bias, h1, h2, h3, h4]
W2 = [-0.99, 1/4, 1/4, 1/4, 1/4];
for ii = 1:length(xx)
for jj = 1:length(yy)
inputVec = [1; xx(ii); yy(jj)];
% vector of hidden node values

Revision : 2003.10
% without double(...) the next line creates a

% logical variable type (try "whos" at the MATLAB
% prompt.
% MATLAB requires you to convert it to double
% precision for mesh plotting purposes. Annoying;
% Octave does not have this requirement.
hv = double(( (W1 * inputVec) > 0));
% store hidden neuron values for plotting

h1(ii,jj) = hv(1);
h2(ii,jj) = hv(2);
h3(ii,jj) = hv(3);
h4(ii,jj) = hv(4);
% include bias term to get final output value

zz(ii,jj) = double( (W2*[1;hv]) > 0);
% this is the output without the threshhold

zprime(ii,jj) = W2*[1;hv];
end
end
fn = 0;
fn = fn+1; figure(fn);
mesh(yy,xx,h1);
title(’Hidden layer node 1 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));
fn=fn+1; figure(fn);
mesh(yy,xx,h2);
mesh(yy,xx,h3);
mesh(yy,xx,h4);

Revision : 2003.10
Hidden layer node 1 values
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 1: Homework 1: output is 1 if x > 0

mesh(yy,xx,zprime);
title(’Neural network (no threshhold on output)’);
mesh(yy,xx,zz);
title(’Neural network output’);

Revision : 2003.10
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 2: output is 1 if y > 0

Revision : 2003.10
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 3: output is 1 if x < 1

Revision : 2003.10
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 4: output is 1 if y < 1

Revision : 2003.10
Neural network (no threshhold on output)
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 5: second layer output: value is positive only for (x, y) ∈ [0, 1] × [0, 1].

Revision : 2003.10
Neural network output
0.8
0.6
0.4
0.2
0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y
Figure 6: final result: output is 1 if (x, y) is in [0, 1] × [0, 1].

Revision : 2003.10
3 2003 08 22 Neuron models

3.1 Neuron models
Read [Hay99] §1.3
y = φ(w T x) = w1 x1 + · · · + wm xm = hw, xi
T
∈ IRm

x = x1 · · · x m
T
w = w1 · · · wm
I prefer to write w T x so that w and
x are both column vectors.
y ∈ IR scalar (for now; this will change to y ∈ IRp )
Interpretation w T x = “how much does x look like w?”

• w (here is a vector, later will be a matrix) is an internal representation of a data feature
we’re interested in
• w T x > 0 and is big if x points in “the same direction” as w; is negative and big if x
points in the opposite direction (think of correlation coefficient in random variables).
• use activation function φ to “normalize” the inner product w T x of w and x.
1. summation
2. bias vector y = φ(w T x + b. Can embed x0 = 1, w0 = b to keep y = φ(w T x).
3. activation function
5
(a) Threshold,
a.k.a “McCullough-Pitts” (theoretical limits discussed in [MP88] )
1 v≥0
φ(v) = Also vector (output) form.
0 v<0

 1 v ≥ 1/2
(b) piecewise linear φ(v) = v+1/2 |v| < 1/2 notice misprint from text
0 v < −1/2

1
(c) sigmoid φ(v) = approaches threshold as a → ∞
1 + e−av
(d) stochastic (not much done in this course due to prerequisites)
P (x = 1) = P (v) = sigmoid. P (x = 0) = 1−P (v). “random process” 1/(1+ev/T )
where T is a pseudo-temperature.
5
M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1988.
Expanded Edition

3.1 Neuron models Revision : 2003.10
Example 3.1 Activation function: single neuron in MATLAB. See Figure 7. Source code:
M-file 3.1 activation.m
xx = linspace(-4,4,1000); % 1000 linearly spaced points

linThresh = xx; % linear threshhold function
Thresh = ( xx >= 0.5 ); % hard threshhold function
sigmoid = 1 ./ ( 1+ exp(-2.0*xx) ); % sigmoid
tanSig1 = tanh(xx+0); % no offset
tanSig2 = tanh(xx+2); % constant offset
plot(xx,linThresh,’-’, xx, Thresh,’-’, xx, sigmoid, ’-’, ...
xx, tanSig1, ’-’, xx,tanSig2,’-’);
grid on
title(’Artificial Neuron Activation Function Examples’);
xlabel(’Neuron input’)
legend(’linear threshhold’,’Threshhold (at 0.5)’, ...
’sigmoid (alpha=2)’, ’tanh(x)’,’tanh(x+2)’);
print -depsc activation.eps

3.1 Neuron models Revision : 2003.10
Artificial Neuron Activation Function Examples

4
linear threshhold
Threshhold (at 0.5)
sigmoid (alpha=2)
3 tanh(x)
tanh(x+2)
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
Neuron input
Figure 7: For example 3.1

3.2 NN’s as directed graphs Revision : 2003.10
3.2 NN’s as directed graphs

Notation: text uses scalar notation
X
k =φ wkj xj
or signal flow graphs

wkj
xj yk = wkj xj
φ
xj yk = φ(xj )
P
wk0 yk = φ m
j=0 wkj xj
x0
wk1
x1 ..
.
wkm
..
.
xm
Alternative: matrix notation (similar to MATLAB):
y = φ(w T x)
In MATLAB: y = phi ( w’ * x ); . Need to write the function as an m-file phi.m or as

a mex-file phi.mexXYZ or phi.dll. Choice:
• m-files are easy to write, debug, but are slow.
• mex-files are slower to write, more difficult to debug, but are 5-10x faster.
Conclusion Develop as m-file, translate to C (or other language) as needed.

3.3 Feedback Revision : 2003.10
3.3 Feedback
Read §1.5
• operator loops
x0j (n)
A
xj (n) yk (n)
A
yk (n) = x (n)
1−AB j
• unit delay → transfer function
• IIR filters
1st order signal flow graph
x0j (n)
w
xj (n) yk (n)
z −1
• Analysis in neural networks: nonlinear! (graduate topic)
• “Process” - time becomes a variable.
Introduce idea of Z transform without details. (Stability, convergence)

Revision : 2003.10
4 2003 08 25 Network architectures

4.1 Network architectures
Read §1.6
4.2 Neural network: an interconnection of neurons

1. Single layer feed forward (acyclic connections)
P
wk0 yk = φ m
j=0 wkj xj
x0
wk1
x1 ..
.
wkm
..
.
xm
Example 4.1 Single layer network with three neurons (outputs) and two inputs.
 
T 1
 φ1 w 1
x 
  
y1 
1  T 1 
y =  y2  = φ W = φ 2 w 2

x  x 
y3  
 1 
φ3 w 3 T
x
Could also write as y = φ(W̄2 φ(W̄1 x + b1 ) + b2 ) for bias vectors b1 , b2 . We will select
T T
w1 = w 2 = w 3 = b w 1 w2 = 1 1 1 .
(b =bias), and φ1 = threshold (McCullough-Pitts), φ2 =sigmoid (a = 1), and φ3 =

tanh(·). M-file and output follow:

4.2 Neural network: an interconnection of neurons Revision : 2003.10
M-file 4.1 neuronEx1.m
n1 = 40; x1 = linspace(-2,2,n1);
n2 = 45; x2 = linspace(-2,2,n2);
% pre-allocate memory (can help speed sometimes)

y0 = zeros(n1,n2); y1 = y0; y2 = y0;
ww = [1;-1;1];
for ii =1:n1
for jj =1:n2
xx = [1;x1(ii);x2(jj)];
y0(ii,jj) = double( ww’ * xx > 0 );
y1(ii,jj) = 1/(1+exp(-ww’ *xx));
y2(ii,jj) = tanh(ww’*xx);
end
end
subplot(2,2,1); meshc(x1,x2,y0’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’McCullough-Pitts activation function’);
grid on
title(’sigmoid activation function’);
grid on
title(’tanh activation function’);
grid on
orient tall
print -depsc neuronEx1.eps

McCullough−Pitts activation function sigmoid activation function
1 1
0.8 0.8
NN output
NN output
0.6 0.6
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x −2 input x input x −2 input x
2 −2 1 2 −2 1
tanh activation function
0.5
NN output
−0.5
−1
2
1 2
0
0
−1
input x2 −2 input x1
−2

2. Multi-layer feed forward (hidden layers)
Example 4.2 Two-layer network with all activation function φ =threshold, two in-
puts, one output, and three “hidden” nodes:
  
1
y = φ W 2  1  = φ(W2 φ1 (W1 x̄))
φ W1
x
where x̄ and φ1 include the bias term 1 in their vectors.

M-file 4.2 neuronEx2.m
% two layer network with 3 hidden neurons

n1 = 40; x1 = linspace(-2,2,n1);
n2 = 45; x2 = linspace(-2,2,n2);
% pre-allocate memory (can help speed sometimes)

h1 = zeros(n1,n2); h2 = h1; h3 = h1; yy=h1;
W1 = [0,1,0; ... % zero bias
0,0,1; ... % zero bias
1,-1,-1]; % unit bias
W2 = [-2.8,1,1,1]; % bias of 2.8 (slightly less than 3 - but why?)
for ii =1:n1
for jj =1:n2
xx = [1;x1(ii);x2(jj)];
h0 = double ( W1*xx > 0 );
hh = [1;h0];
yy(ii,jj) = double( W2*hh > 0);
% store temp values for plotting

h1(ii,jj) = h0(1);
h2(ii,jj) = h0(2);
h3(ii,jj) = h0(3);
end
end
subplot(2,2,1); meshc(x1,x2,h1’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h1 output’);
title(’Hidden Neuron 1 output’);
grid on
grid on
grid on
subplot(2,2,4); meshc(x1,x2,yy’);
title(’Two-layer ANN example output’);

grid on
orient tall
print -depsc neuronEx2.eps
Hidden Neuron 1 output Hidden Neuron 2 output
1 1
0.8 0.8
0.6 0.6
h1 output
h2 output
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2
Hidden Neuron 3 output Two−layer ANN example output
1 1
0.8 0.8
NN output
0.6 0.6
h3 output
0.4 0.4
0.2 0.2
0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2

3. Recurrent Fig 1.17 has no hidden layers, no self feedback.

Revision : 2003.10
5 2003 08 29 Knowledge representation

Read [Hay99] §1.7
Definition 5.1 (Fischler & Firschein, 1987) Knowledge refers to stored information or mod-
els used by a person or machine to interpret, predict, and appropriately respond to the outside
world.
Knowledge representation:
• What information is explicit?
• how is it encoded?
We often ask an ANN to “learn” an environment. Must have (or present) lots of data (“prior
information”).
• labeled: (x0 , d0 ).
• unlabeled (xi ) (self-organizing)
Process:
• training: subset of data; real and fake (passive sonar) data
• testing (complete set)
Remark 5.1 Training allows the data (and the training algorithm) to organize and represent
input data. Other forms of pattern classification often require the designer to organize the
data for classification and representation.
Linear algebra concepts For knowledge representation.
• vectors
• dot products
• norms (vector/error)
For unit vectors
kxi − xj k2 = (xi − xj )T (xi − xj ) = 2 − 2xi T xj
• Expectation/ mean value

Revision : 2003.10
h i
T
• covariance Σ = E (xi − µi )(xi − µi )
Assume xi , xj have identical covariance. Then
d2ij = kxi − xj k2 = (xi − µi )T Σ−1 (xi − µi )
dot product/recognition
Ideas: (Rules)
1. Similar items should have similar representations.
2. Items from different classes should have dissimilar representations.
3. Important features should have many neurons devoted to them - high probability of
detection, low probability of false alarm.
Neyman-Person: max Prob(detect) subject to P(false alarm) < γ
4. Incorporate prior knowledge and “invariances” into network design.
• restrict network architecture

• weight sharing
Both techniques constrain the weight coefficient matrix W .
Question 5.1 So, given lots of data {(xi , di )}N

i=1 , how do we train a network to “learn?”

Revision : 2003.10
6 2003 09 03 Mex files in MATLAB

C-file 6.1 mextanh.c
/*=================================================================
* Example Hyperbolic Tangent MEX function
* Adam Simmons, GTA, Neural Networks
*=================================================================*/
#include <math.h>
#include "mex.h"
/* If you have a C File written outside of Matlab’s Mex File format, you can
* place it here were tanh1 is and write a wrapper for it below (mexFunction)
*/
static void
tanh1 (double yout[], double xin[])
{
yout[0] = tanh (xin[0]);
return;
}
/* wrapper for the c code above (tanh1) */

void
mexFunction (int nlhs, mxArray * plhs[], int nrhs, const mxArray * prhs[])
{
double *yout, *xin;
/* Check for proper number of arguments */

if (nrhs != 1)
mexErrMsgTxt ("One input argument required.");
else if (nlhs > 1)
mexErrMsgTxt ("Too many output arguments.");
/* Create a scalar (1 x 1 matrix) for the return argument */

plhs[0] = mxCreateDoubleMatrix (1, 1, mxREAL);
/* Assign pointers to the various parameters */

yout = mxGetPr (plhs[0]);
xin = mxGetPr (prhs[0]);
/* Do the actual computations in a subroutine */

tanh1 (yout, xin);
}

Revision : 2003.10
C-file 6.2 simpletanh.c
#include "mex.h"
#include "math.h"
/* mxArray is a struct defined in mex.h

* an mxArray variable A contains (among other things):
* (1) a pointer to the first element of the matrix A
* (2) the dimensions (rows and columns) of the matrix A
* Get the pointer to the data with myMatrix = mxGetPr(&A);
* (Notice: it needs the ADDRESS of A! Pass a pointer!)
* Get the dimensions with
* int nrows = mxGetM(&A);
* int ncols = mxGetN(&A);
*
* MATLAB passes its arguments in an array of pointers, so
* if you call this with
* [a,b,c] = myfunc(d,e,f), then
* prhs[0] is a pointer to d,
* prhs[1] is a pointer to 3,
* prhs[2] is a pointer to f.
*
* you have to CREATE the output arguments with mxCreateDoubleArray
* as shown below.
*/
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double *y,*x;
plhs[0] = mxCreateDoubleMatrix(1,1,mxREAL);
x = mxGetPr(prhs[0]);
y = mxGetPr(plhs[0]);
y[0] = tanh(x[0]);
}

Revision : 2003.10
Hwk
Homework 2 Mex file example Handed out by Simmons in class; solutions shown in lecture
notes. Everyone did well.

Revision : 2003.10
7 2003 09 05 More on MATLAB

7.1 Scalar functions of scalar variables
Shown last time
7.2 Functions of vectors

How do we implement  
φ(x1 )
φ(x) =  ... ?
 
φ(xn )
In MATLAB:
M-file 7.1 mexExV.m
% mexExV.m: call mex file with a vector input, get a vector output
if( exist(’phiExV’) ~= 3 ) % see if the mex file is there
mex phiExV.c
end
xx = (-10:0.1:10)’;
yy = phiExV(xx);
plot(xx,yy);
grid on
xlabel(’activation function input’);
ylabel(’activation function output’);
print -depsc mexExV.eps
Results:

7.2 Functions of vectors Revision : 2003.10
0.8
0.6
activation function output 0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
−10 −8 −6 −4 −2 0 2 4 6 8 10
activation function input

7.2 Functions of vectors Revision : 2003.10
C-file 7.3 phiExV.c

#include <math.h>
#include "mex.h"
void phi(double *yy, const double * xx, const int len )

{
int ii;
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = tanh(xx[ii]);
}
void mexFunction( int nlhs, mxArray *plhs[],

int nrhs, const mxArray*prhs[] )
{
char errMsg[1000];
double *xx, *yy;
int nrows, ncols;
/* Argument checking - always a good thing! */

if (nrhs != 1)
{
sprintf(errMsg,"Received %d args, need 1", nrhs);
mexErrMsgTxt(errMsg);
return;
}
else if (nlhs > 1)
{
sprintf(errMsg,"%d outputs requested, max is 1", nlhs);
return;
}
nrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
if(nrows < 1 || ncols != 1 )
{
sprintf(errMsg,"input (%dx%d), need (Nx1)\n",nrows, ncols);
return;
}
plhs[0] = mxCreateDoubleMatrix(nrows, ncols, mxREAL);
xx = mxGetPr(prhs[0]); /* pointer to data in prhs[0] */
yy = mxGetPr(plhs[0]); /* pointer to data in plhs[0] */
phi(yy,xx,nrows); /* do the work */
}

7.3 What about MATLAB and matrices? Revision : 2003.10
7.3 What about MATLAB and matrices?

MATLAB was originally written in FORTRAN, and so uses column major ordering. That
is, the matrix  
1 2 3
A= 4 5 6 
7 8 10
is stored in memory as
1
4
7
4
..
.
6
10
Example 7.1 Print out a matrix in a Mex file
M-file 7.2 mexPrintMatEx.m
if( exist( ’mexPrintMat’ ) ~= 3 )

mex mexPrintMat.c
end
A = [1, 2, 3; 4, 5, 6; 7, 8, 10];
mexPrintMat(A);
Results (on a Macintosh):
>> mexPrintMatEx
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/Undetermined -lmx -lmex -lmat
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
mex link phase: cc -O -bundle -Wl,-flat_namespace -undefined suppress
-o mexPrintMat.mexmac mexPrintMat.o mexversion.o
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
(3 x 3) =
1.0000e+00 2.0000e+00 3.0000e+00
4.0000e+00 5.0000e+00 6.0000e+00
7.0000e+00 8.0000e+00 1.0000e+01
C-code on next page

7.3 What about MATLAB and matrices? Revision : 2003.10
C-file 7.4 mexPrintMat.c
/* mexPrintMat.c: print out a matrix in a mex file */

#include <math.h>
#include "mex.h"

{
char errMsg[1000];
double *xx, *yy;
int nrows, ncols, ii, jj;
if (nrhs != 1)
{
return;
}
else if (nlhs > 0)
{
return;
}
nrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
xx = mxGetPr(prhs[0]);
mexPrintf("(%d x %d) = \n",nrows,ncols);
for( ii = 0 ; ii < nrows ; ii++)
{
for ( jj = 0 ; jj < ncols ; jj++ )
mexPrintf("%12.4e ",xx[ii + jj*nrows]);
mexPrintf("\n");
}
}
Remark 7.1 Danger! Stray pointers and bad subscripts are great at making mex files
crash.

Revision : 2003.10
8 2003 09 08 Learning processes

Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999 Ch. 2
Ad hoc procedures that seem to work.
8.1 Error-correction learning

Read §2.2
From end of Ch 1: knowledge representation, feature extraction, classification.

∆
Approach: input vector process x(n), desired output vector d(n), error e(n) = d(n)−y(n).
Scalar values: ek (n), dk (n), yk (n).
Remark 8.1 stupidly profound observation ek (n) > 0 =⇒ we want yk to increase,
and ek (n) < 0 =⇒ we want yk to decrease.
Use error e to adjust parameters (weights) of neural network.
Idea iterative “optimization:” define

1
E(n) = e(n)T e(n) vector form
2
or
1
Ek (n) = ek (n)2 scalar form
2
“instantaneous error energy.” Idea: adjust weights to decrease E until it reaches “steady
state.”
Consider
  
w10 w11 · · · w1m x0 = 1
y(n) = φ(W (n)x(n)) = φ  ... ..
.
..
.
..   x1 

. 
..

wp0 wp1 · · · wpm .xm
P
i.e., yk (n) = wkj (n)xj (n), so
∂yk (n)
= xj (n)
∂wkj (n)
That is, if we increase wkj , then yk changes in the direction of xj .
Idea Try δwkj (n) = ηek (n)xj (n), so

∆W (n) = ηe(n)x(n)T
W (n + 1) = W (n) + ∆W (n)
Delta rule or Widrow-Huff rule. Why? Matrix gradients (sort of). η = “learning rate”
parameter. Requires: e(n) is measurable.

8.2 Memory Based learning Revision : 2003.10
8.2 Memory Based learning

Read §2.3
Stored: all (or most) xi (now vectors) stored with desired output di .
Example 8.1 Classification as in Class C1 or C2 . di ∈ {0, 1}.
Approach present vector xtest . If xtest is “unknown”, identify an xi “near” to xtest .
nearest neighbor x0N ∈ {xi }N 0

i=1 : kxN − xtest k = mini kxi − xtest k.
[Text uses d(xi , xtest ) instead of norm.]

Select class d0N
k-nearest neighbors Select most common class of the k−nearest neighbors.
Fact 8.1 Quality of classification: within a factor of 2 of optimal if xi , di are uniformly

distributed and have an infinite amount of data.
What’s that mean Probability of mis-classification by this method is at most twice as

high as probability of misclassification by the best possible theoretical algorithm you could
develop.
Example 8.2 Exclusive-or network Explicitly store data:

element 1 2 3 4
0 0 1 1
xi
0 1 0 1
di 0 1 1 0

0.9
Receive new input x5 = . Nearest point is x3 , so update table:
0.3
element 1 2 3 4 5
0 0 1 1 0.9
xi
0 1 0 1 0.3
di 0 1 1 0 1
2
Question 8.1 When do we add a new data point? How close is “close enough?” (related
to Radial Basis Function networks ...)

8.3 Hebbian learning Revision : 2003.10
8.3 Hebbian learning

Read §2.4
“reinforce what’s already occuring.”
1. time dependent - modificatin occurs based on when signals occur
2. local - occurs at a specific “synapse”
3. interactive - depends on both input and output
4. conjunctional or correlational: strengthen when input/output are correlated.
Can also develop methods by which connections are weakened (“forgetting” methods).
Example 8.3
∆wkj (n) = F (yk (n), xj (n))

= ηyk (n)xj (n) simplest form
∆W (n) = ηy(n)xT (n)
Problem can lead to saturation (unstable growth)
Idea “covariance hypothesis” - subtract time-averaged mean values.
∆W (n) = η (y(n) − ȳ(n)) (x(n) − x̄(n))T
average values give threshhold to “sign” of correction so that we can both strengthen and
weaken connections.

8.4 Competitive learning Revision : 2003.10
8.4 Competitive learning

Read §2.5
neurons “compete” to fire.

Causes neurons to become “feature detectors” for different input classes.

1 if vk > vj ∀j, j 6= k
yk =
0 else
Idea
η(xj − wkj ) if neuron k “wins”
∆wkj =
0 else
P P 2
Adjust weights to enforce j wkj = 1 for all k or j wkj = 1 for all k .
8.5 Boltzmann learning

Read §2.6
Stochastic method. Concept: network energy state

1X X
E=− wkj xk xj
2 j,k,j6=k
x = state of neuron. Change neoron value xk → −xk based on

1
P (xk → −xk ) =
1 + e∆E/T
where ∆E is change in energy from the flip, T = “temperature” (adaptation speed parame-
ter).
8.6 Credit assignment

Read §2.7
Covered in more detail in later chapters.
8.7 Learning with a teacher

Read §2.8
Similar to tagged (labelled) data learning, except that we have a “teacher” instead of a
large database.

8.8 Learning without a teacher Revision : 2003.10
8.8 Learning without a teacher

Read §2.9
• with a critic (related to dynamic programming)
primary reinforcement
state
Environment vector Critic
heuristic reinforcement
Learning
System
actions
The “heuristic” reinforcement relates to a cost to go function from dynamic program-

ming.
Is much more difficult to perform than with labelled data, but permits the use of neural
networks to solve an optimization problem.
Remark 8.2 This would be a good project for a graduate student.
• unsupervised learning (e.g., competitive classification)
All of these require design of some mechanism to steer toward a “learned” outcome.

Revision : 2003.10
9 2003 09 10
We remember ...
“I wish that it need not have happened in my time,” said Frodo.
“So do I,” said Gandalf, “ and so do all who live to see such times.
But that is not for them to decide. All we have to decide is what
to do with the time given to us.”
J. R. R. Tolkien, The Fellowship of the Ring, p. 76
“We don’t want the bad guys to win!”
Fozzie Bear, The Great Muppet Caper.
9.1 Learning tasks

Read §2.10
Pattern association: e.g., clean up dirty paterns.

x y
pattern recognition/classification
Example 9.1 Homework 2 problem: classification can be used to clean up input waveforms:
3
noisy sine wave
2
−1
−2
−3
−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
1
cleaned sine wave
0.5
−0.5
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

9.2 Memory Revision : 2003.10
function approximation: e.g. system i.d., inverse dynamics, control, filtering (smoothing,
prediction, extraction)
Example 9.2 system identification Try to mimic an unknown system: Neural net must
“remember” previous outputs, inputs, and try to predict next output:
unknown output y(k)

input u(k)
system
stored neural
(k), u(k − 1), ...
data
u(k-N), y(k − 1), ... network
y(k − N ).
text gives filtering examples:

• “cocktail party” - signal in clutter
• beamforming (radar)
9.2 Memory
Read §2.11
Geometric interpretations of neural network operation.

• short term memory
• long term memory
• distributed memory
Notation: vector (book says xk = ..., I’ll write ...)
 
x1 (k)
x(k) = 
 .. 
. 
xm (k)
m = network dimensionality (assume that y is the same length as x for now).
Suppose network is linear. Then y(k) = W (k)x(k) with w(k) specifically selected for the
pair (xk , yk ). Scalar notation:
 
x1 (k)

yi (k) = wi1 (k) · · · wim (k)  .. 
. 
xm (k)

Define
M0 = 0
Mk = Mk−1 + W (k)
q
X
M = Mq = W (k)
k=1
Can we select W (k)’s so that pattern recognition

Pq still works?
T
Define: estimate M̂ of M is M̂ = k=1 y(k)x(k) . Local outer products yi (k)xj (k),
similar to Hebbian learning.
can show M̂ = Y X T where

• Y = y1 · · · yq “memorized matrix”

• X = x1 · · · xq “key matrix”
Use of M̂ in associative memory: pick an input vector x(j).

∆
y(j) = M̂ x(j)
q
X
= y(k)x(k)T x(j) misprint in text, eqn 2.39
k=1
q
X T

= x(k) x(j) y(k)
k=1
Notice x(k)T x(j) is a scalar inner product,
x(k)T x(j)
cos(x(k), x(j)) =
kx(k)k kx(j)k
If x(k)’s are unit length then cos(x(k), x(j)) = x(k)T x(j). For recognition, want x(k)T x(j) =
0, orthogonal.

Example 9.3 Use correlation matrix idea in the homework example:
M-file 9.1 corrMemEx1.m
tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;
% normalize to unit vectors

sinewave = sinewave/norm(sinewave);
sawtooth = sawtooth/norm(sawtooth);
square = square/norm(square);
Xk = [sinewave, sawtooth, square];

Yk = [ [1;0;0], [0;1;0], [0;0;1] ];
figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx1.eps
M = zeros(3,length(tt));
for kk=1:3
M = M + Yk(:,kk)*Xk(:,kk)’;
end
x1 = (M*sinewave)’
x2 = (M*sawtooth)’
x3 = (M*square)’
randn(’seed’,1); % seed the random # generator for repeatable tests

dirtySine = sinewave + randn(size(sinewave))/10;
figure(2)
plot(tt,dirtySine);
legend(’noisy sine wave’);
grid on
% try to recognize a dirty sine wave

x4 = (M*dirtySine)’

W = [sinewave, sawtooth, square]*M;

cleaned = W*dirtySine;
figure(3);
subplot(2,1,1)
plot(tt,dirtySine);
grid on
subplot(2,1,2)
plot(tt,cleaned);
legend(’cleaned sine wave’);
grid on
0.08
sine
saw
square
0.06
0.04
0.02
−0.02
−0.04
−0.06
−0.08
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

0.3
noisy sine wave
0.2
0.1
−0.1
−0.2
−0.3
−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
0.3
noisy sine wave
0.2
0.1
−0.1
−0.2
−0.3
−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
0.1
cleaned sine wave
0.05
−0.05
−0.1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

Q if key vectors X are orthonormal, what is the storage capacity of the network? - m,
rank of M̂ .
Q classification accuracy: lower bound on error x(k)T x(j) ≥ γ∀k 6= j. If γ is big enough,
can get classfication errors. (Get upper bound instead?)

Homework 3
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 2.
Due Wed Sept. 17. Hwk

1. Problem 2.18 in [Hay99]. (MATLAB exercise.)
2. Problem 2.20 in [Hay99]. (MATLAB exercise.)
3. Consider the m-file below:
M-file 9.2 learnTaskEx1s.m
tt = (0:0.01:5)’;
figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
grid on
print -depsc learnTaskEx1a.eps
A = (you fill in here);
x1 = (A*sinewave)’ % this should be [1;0;0]

x2 = (A*sawtooth)’ % this should be [0;1;0]
x3 = (A*square)’ % this should be [0;0;1]
randn(’seed’,1); % seed the random # generator for repeatable tests

dirtySine = sinewave + randn(size(sinewave));
figure(2)
plot(tt,dirtySine);
grid on
print -depsc learnTaskEx1b.eps
% try to recognize a dirty sine wave

x4 = (A*dirtySine)’
This m-file implements a single-layer neural network that has 101 inputs (length of the
input vector) and 3 outputs. Output 1 should be a “1” when the input is a sinewave,

output 2 should be a “1” when the input is a sawtooth wave, and output 3 should be a
“1” when the input is a square wave, with the other outputs being zero. Each of these
waveforms is defined in the first four lines of the m-file.
Your job is to select what to fill in for A so that the neural network gives the desired
output for perfect (uncorrupted) inputs. The last few lines of the network demonstrate
what happens when you apply a corrupted sine wave.
The output for my solution is shown below:
>> learnTaskEx1
x1 = 1.0000 0.0000 -0.0000
x2 = 0.0000 1.0000 -0.0000
x3 = 0.0000 0.0000 1.0000
x4 = 1.0405 0.0530 -0.0658
3
noisy sine wave
−1
−2
−3
−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)
Solution
1. M-file:
M-file 9.3 hwk0218.m
% Homework 2.18, Haykin ’99

x1 = [1;0;0;0]; x2 = [0;1;0;0]; x3 = [0;0;1;0];

y1 = [5;1;0]; y2 = [-2;1;6]; y3 = [-2;4;3];
fprintf(’(a)\n’);
M = y1*x1’ + y2*x2’ + y3*x3’
fprintf(’(b)\n’);
error1 = M*x1 - y1
error2 = M*x2 - y2
error3 = M*x3 - y3
Output:
hwk0218.m Output
(a)
M =
5 -2 -2 0
1 1 4 0
0 6 3 0
(b)
error1 =
0
0
0
error2 =
0
0
0
error3 =
0
0
0
M-file:


x(1:3,1) = 0.25* [ -2; -3; sqrt(3) ];
x(1:3,2) = 0.25* [ 2; -2; -sqrt(8) ];
x(1:3,3) = 0.25* [ 3; -1; sqrt(6) ];
fprintf(’(a) angles between vectors \n’);

for ii=1:2
for jj= (ii+1):3
xi = x(:,ii);
xj = x(:,jj);
thisAngle = acos((xi’*xj)/(norm(xi)*norm(xj)));
orthErr = abs(abs(thisAngle) - pi/2);
fprintf(’The angle between x%d and x%d is %f degrees, \n’, ...
ii, jj, thisAngle*180/pi);
fprintf(’\t%f degrees off orthogonal\n’, orthErr*180/pi);
end
end
fprintf(’(b): autoassociative memory matrix\n’);

M = zeros(3);
for ii=1:3
M = M + x(:,ii)*x(:,ii)’/norm(x(:,ii));
end
M
fprintf(’(c): with normalization\n’);

xv = [0;-3;sqrt(3)];
y = M*xv/norm(xv)
yerr = y - x(:,1)
fprintf(’(c): without normalization\n’);

xv = [0;-3;sqrt(3)];
y = M*xv
yerr = y - x(:,1)
Output:
hwk0220.m Output
(a) angles between vectors

The angle between x1 and x2 is 100.438861 degrees,

10.438861 degrees off orthogonal

(b): autoassociative memory matrix
M =
1.062500 -0.062500 -0.110780

-0.062500 0.875000 -0.124299
-0.110780 -0.124299 1.062500
(c): with normalization

y =
-0.0012636
-0.8199219
0.6388963
yerr =
0.498736
-0.069922
0.205884
(c): without normalization

y =
-0.0043773
-2.8402926
2.2132017
yerr =
0.49562
-2.09029
1.78019
2. Using pseudo-inverse approach:
M-file 9.5 hwk0220a.m

x(1:3,1) = 0.25* [ -2; -3; sqrt(3) ];

x(1:3,2) = 0.25* [ 2; -2; -sqrt(8) ];
x(1:3,3) = 0.25* [ 3; -1; sqrt(6) ];
fprintf(’(a) angles between vectors \n’);

for ii=1:2
for jj= (ii+1):3
xi = x(:,ii);
xj = x(:,jj);
thisAngle = acos((xi’*xj)/(norm(xi)*norm(xj)));
orthErr = abs(abs(thisAngle) - pi/2);
fprintf(’The angle between x%d and x%d is %f degrees, \n’, ...
ii, jj, thisAngle*180/pi);
fprintf(’\t%f degrees off orthogonal\n’, orthErr*180/pi);
end
end
fprintf(’(b): autoassociative memory matrix\n’);

M = x*pinv(x); % well, duh, that gives us identity. What if x
% were 3x2 instead? The point is, pinv(x) gives us
% a working set of weights for an associative memory
fprintf(’(c): with normalization\n’);

xv = [0;-3;sqrt(3)];
y = M*xv/norm(xv)
yerr = y - x(:,1)
hwk0220a.m Output
(a) angles between vectors

(b): autoassociative memory matrix
(c): with normalization
y =
1.9456e-17
-8.6603e-01
5.0000e-01

yerr =
0.500000
-0.116025
0.066987

Revision : 2003.10
10 2003 09 12 Single layer networks

Read §3.1
10.1 From last time: Linear associative memory (LAM)

q
∆
X
y(j) = M̂ x(j) = y(k)x(k)T x(j)
k=1
q
X
= x(k)T x(j) y(k)
k=1
General case (LAM):

y = AB T x
BT
= A
y x

The matrix B = b1 · · · bnf acts as a feature extractor when we perform dot products
hbi , xi. The matrix A is then used to construct the associated waveform y as a combination
of columns of A:
 
hb1 , xi
∆
y = Az = A 
 .. 

.

b nf , x

= a1 hb1 , xi + · · · + anf bnf , x
A BT x
hb1 , xi
= hb2 , xi
..
.
y a1 a2 a3
Question 10.1 How many patterns can be stored in an N xN linear associative memory?

10.2 Adaptation Revision : 2003.10
10.2 Adaptation
Read [Hay99] §2.12
• The environment “changes:” a good decision now may not be a good decision later.
• If it changes slowly enough, can adapt to current conditions. “pseudo-stationary.”
10.3 Performance issues

§2.13, Statistical nature of the learning process, §1.14, Statistical learning theory, and §2.15
Probably approximately correct learning, will be examined later on.
For now: Suppose we have a training data set
T = {(xi , di )}N
i=1
We assume that this data represents a system
d = f (x) +
where is some functional error (perhaps a random variable or noise).

Assumptions:
1. E[|x] = 0. - zero mean random variable. Implies E[d|x] = f (x), which is what the
neural net is trying to match.
2. Error is uncorrelated with the function:
E[f (x)T ] = 0
(Is consistent with the conditional expectation above). Says that the function f gives
us all available information about d that we can get from x.
What does this sort of modeling assumption tell us about what neural networks can do?
Short summary of §2.13 Need to approximate f (x) with an ANN F (x; W ). Things to
reduce: mean value of error (bias) and variance of error (standard deviation).

10.4 Single layer perceptrons: “In the beginning ....” Revision : 2003.10
10.4 Single layer perceptrons: “In the beginning ....”

Read §3.1
Contributors:
1. W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas immanent
in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943 McCullough
& Pitts: (1943): use of NN as computational tool
2. D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley, New

York, 1949 1st rule for self-organized learning
3. F. Rosenblatt. The perceptron: A probabilistic model for information storage and orga-
nization in the brain. Psychological Review, 65:386–408, 1958 for perceptron (learning
with a teacher)
4. B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON Con-
vention Record, pages 96–104 Widrow-Hoff delta rule (least mean square)
Perceptron: linearly separable; “perceptron convergence theorem.” Single neuron can
be viewed as an “adaptive filter.”
10.5 Adaptive filtering interpretation

Read [Hay99] §3.2
v(i)
wi0
y(i)
x0
wi1
-1
..
x1 . ei
wim
..
.
d(i)
xm
Linear adaptive filter

1 ∆
y = φ(W̄ x + w0 ) = φ(W ) = φ(W x̄)
x
Desired system behavior specified by
T = {x(i), d(i), i = 1, ..., n}

10.5 Adaptive filtering interpretation Revision : 2003.10
where x(i) is a vector of length m (input dimensionality). x(i) is called a stimulus vector.
How do we get x?
• snapshot in space (m different points in space)
• uniformly spaced in time (present and m − 1 previous values).
Learning assumptions:
• Arbitrary starting point (neuron settings)
• Adjustments of weights are made continuously6 (time is a part of the learning algo-
rithm)
Linear neuron: m
X
y(i) = v(i) = wk (i)x)k(i) = w(i)T x(i)
k=1
Define: e(i) = d(i) − y(i)

Goal: update w to get desired output(s) d(i) for all input(s) x(i).
6
Not as a differential equation, usually, but as a difference (discrete-time) equation.

Revision : 2003.10
11 2003 09 15 Optimization and Neural Networks

11.1 What you should have seen in your linear algebra class
Linear systems of equations: 3 cases:
Case 1 n equations, n unknowns
Ax = b =⇒ x = A−1 b
with A and b known. A must have n linearly independent rows/columns for there to
be a unique solution. Otherwise either (1) there is no solution, or (2) there are an
infinite number of solutions.
Case 2 Overdetermined systems: A ∈ IRm×n , m > n.
min kAx − bk2 = min (Ax − b)T (Ax − b)
x x
n×m
Solution is x = A+ b where A+ ∈ IR is the pseudo-inverse of A, which means (among
other things) that A+ A = In . However, AA+ 6= In .
Remark 11.1 Define P = AA+ . Then P is a projection matrix, which means that
P 2 = P . Further, it is easy to show that P A = A. Multiplication by P extracts the
part of a vector (or matrix) that is in span(A).
case 3 Underdetermined systems. A ∈ IRm×n , m < n. Problem is:

min xT x
x
subject to Ax = b
Ax = b is a linear constraint that has m equations in n unknowns (not enough equations
to uniquely identify x). Solution to this problem is x = A+ b where A+ ∈ IRn×m is
the pseudo-inverse of A. Notice that, since the case 3 A matrix is short and wide,
the product AA+ = I and A+ A is a projection matrix for this case, unlike the case 2
problem.
11.2 Unconstrained optimization techniques

Given a set of data T = {x(i), d(i), i = 1, ..., n} and a neural network parameter vector w,
define the neural network function and output as

1
y(i) = f (x(i), w) = φ W
x(i)
Define the corresponding cost function as
∆
E(w) = E kd(n) − y(n)k2

N
!
1 1X
= (d(n) − y(n))T (d(n) − y(n))
N 2 n=1

11.2 Unconstrained optimization techniques Revision : 2003.10
Want to find optimal w ∗ such that E(w ∗ ) ≤ E(w) for all possible w, i.e.
w ∗ = arg min E(w)

w
Necessary condition (unconstrained case):

 ∂   ∂E(w) 
∂w1 ∂w1
∆ 
∇w E(w) =  .. ∆  ..
 E(w) =   = 0.
 
. .
∂ ∂E(w)
∂wm ∂wm
Goal: generate sequence of w(n) : E(w(n + 1)) < E(w(n))
Lemma 11.1 matrix calculus identities Let f (x) be a scalar function of a vector x ∈
IRn and
 ∂flet 
g(W ) of a matrix m×n
 ∂g W ∈ IR ∂g . Define their respective partial derivatives as
∂x1 ∂w11
· · · ∂w1n
∂f ∆  ..  ∂g ∆  . .. .. . Then
∂x
=  .  and ∂W =  .. . . 
∂f ∂g ∂g
∂xn ∂wm1
· · · ∂wmn
∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W
Proof: Left as an exercise for the reader. This would be a good exam question, eh? 2
Example 11.1 gradients with vector-valued functions Linear associative memory.

w10 w11 w12
W =
w20 w21 w22
∆ T
w = vec(W ) = w10 w20 w11 w21 w12 w22
 
1
w10 + w11 x1 + w12 x2
f (x, w) = W x1 =
 
w20 + w21 x1 + w22 x2
x2
T
1 X 1 1
E(w) = (d(n) − W ) (d(n) − W )
2N x(n) x(n)
∂E ∆ h ∂E i
= ∂wij
∂W

11.2 Unconstrained optimization techniques Revision : 2003.10
Can use Lemma 11.1 to show

∂E 1 X 1 T
T

= W 1 x(n) − d(n) 1 x(n)
∂W N x(n)
1 X T T

= y(n)x̄(n) − d(n)x̄(n)
N

∆ 1
where x̄(n) = .
x(n)
2

Revision : 2003.10
12 2003 09 17: Gradient based learning methods

Recall from last time that for a linear associative memory y = W x that is to learn a set of
training data
T = {(xi , di )}N
i=1
we can write

∂E 1 X 1 T
T

= W 1 x(n) − d(n) 1 x(n)
∂W N x(n)
1 X T T

= y(n)x̄(n) − d(n)x̄(n)
N
1 X
T

= (y(n) − d(n))x̄(n)
N

∆ 1
where x̄(n) = .
x(n)
Question 12.1 How can we use this information to “train” a neural network?
12.1 Steepest Descent

∆
Define g = ∇w E(w)|w .
Steest descent iteration:
Heuristic Choose “learning parameter” η
w(n + 1) = w(n) − ηg(n).
Justification: Expand E in a Taylor’s series around w(n):
E(w(n + 1)) = E(w(n)) − η∇w E(w)T g(n) + O kηgk2 )

= E(w(n)) − ηg(n)T g(n) + O kηgk2 )

= E(w(n)) − η kg(n)k + O kηgk2 )

decreasing for η small enough.
1. Requires knowledge of ∇E
2. Slow convergence for small η
3. Oscillatory (or divergence) for large η.
1
Example 12.1 Problem 3.2 in [Hay99]. Minimize f (x) = xT Rx x + Rxd T x.7
2
7
See also GradientDescent.mov at http://www.eng.auburn.edu/users/hodelas/teaching/6240.

12.1 Steepest Descent Revision : 2003.10

% see if running Octave or MATLAB
rxd = [0.8182 ; 0.354]; Rx = [1, 0.8182; 0.8182, 1];
w_opt = Rx\rxd;
fprintf(’(a): w_opt = %12.4e %12.4e\n’,w_opt(1), w_opt(2));
fprintf(’(b): see plots\n’);

nstps = 100; fignum = 0;
wdata = zeros(2,nstps);
for eta = [0.3, 1.0];
errv = zeros(1,nstps);
gradNorm = zeros(1,nstps);
for nn=1:nstps % run nstps iterations of gradient descent
if(nn == 1)
wn = [0;0];
gn = Rx*wn - rxd;
else
wn1 = wdata(:,nn-1); % get w(nn-1)
gn = Rx*wn1 - rxd; % here’s my gradient
wn = wn1 - eta*gn; % get next weights
wdata(:,nn) = wn; % save the weights for plotting
end
errv(nn) = wn’*Rx*wn/2 - rxd’*wn; % here’s the cost function
gradNorm(nn) = norm(gn);
end
fignum = fignum+1; figure(fignum);
subplot(2,1,1);
plot(wdata(1,:), wdata(2,:),’-’, w_opt(1), w_opt(2), ’x’);
grid on ;
title(sprintf(’Steepest descent with eta=%f’,eta));
legend(’iterate values’, ’optimal value’);
xlabel(’w_1(n)’); ylabel(’w_2(n)’);
subplot(2,1,2);
plot(1:nstps,errv,’-’,1:nstps,gradNorm,’-’);
xlabel(’iteration number’);
legend(’cost function value’,’|| gradient ||’);
grid on;
eval (sprintf( ’print -depsc hwk0302_%d.eps’,fignum));

end
% contour plot of optimization surface

fignum=fignum+1;
nx = 100; xx = linspace(-10,10,nx);
ny = 101; yy = linspace(-10,10,ny);
cf = zeros(nx,ny);
for ii=1:nx
for jj=1:ny
wn = [xx(ii); yy(jj)];
cf(ii,jj) = wn’*Rx*wn/2 - rxd’*wn;
end
end
figure(fignum);
meshc(xx,yy,cf’);
xlabel(’x value’);
ylabel(’y value’);
grid on
hwk0302.m Output
(b): see plots

>>

Steepest descent with eta=0.300000

0.4
iterate values
0.2 optimal value
0
−0.2
w (n)
2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w (n)
1
1
cost function value
|| gradient ||
0.5
−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number
Case 1: η = 0.3
Steepest descent with eta=1.000000

0.4
iterate values
0.2 optimal value
0
−0.2
w (n)
2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w1(n)
1
cost function value
|| gradient ||
0.5
−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number
Case 2: η = 1.0

12.2 Newton Iteration Revision : 2003.10
12.2 Newton Iteration

Expand Taylor’s series further in ∆w(n):
1
E(w(n + 1)) ≈ E(w(n)) + g(n)T ∆w(n) + ∆w(n)T H(n)∆w(n)a
2
where H = ∇2 E(w)) = ∇g T
 ∂E ∂E 
∂w1 2 · · · ∂w1 ∂wm
= 
 .. .. .. 
. . . 
∂E ∂E
∂wm ∂w1
··· ∂w 2 m
Differentiate w.r.t. ∆w to get
g(n) + H(n)δw(n) = 0 =⇒ w(n) = −H(n)−1 g(n)
1. Good (rapid) convergence if H(n) > 0; instant if E is quadratic.
2. Requires knowledge of g, H.
3. Cannot guarantee H > 0 in general.

Homework 4 Due Fri Sept. 24. Hwk

Your homework this week will include both written and m-file components. You should
turn in your written portion of the homework at the start of class on Sept. 17 and you should
email your m-file component to simmoat@eng.auburn.edu by the start of class on Sept. 17.
Mr. Simmons will send details by email of how to submit your m-file.
1. Recall from Lemma 11.1 that the gradient of a scalar function f is defined as
 
∂f /∂x1
∆ 
∇x f (x) =  .. 
. 
∂f /∂xm
T
∈ IR2 . Show that if

(a) (written) Consider a function f (x) where x = x1 x2
f (x) = cT x then

x1 c1
∇x f = ∇ x c1 c2 =c= .
x2 c2
T
∈ IR2 . Show that if

(b) (written) Consider a function f(x) where x = x1 x2
q11 q12
f (x) = 12 xT Qx where Q = QT = then
q12 q22

q11 x1 + q12 x2
∇x f = Qx = .
q12 x1 + q22 x2
(c) (written) Consider a function f (W ) = xT W y where W ∈ IR2×3 (which implies
that x ∈ IR2 and y ∈ IR3 . Show that
∇W f = xy T .
Solution
∂c1 x1 +c2 x2
∂x1 c1
(a) ∇x f = ∇x c1 c2 = ∇x (c1 x1 + c2 x2 ) = ∂c1 x1 +c2 x2 =
∂x2
c2
(b) f (x) = x2 q11 /2 + x1 x2 q12 + x22 q22 /2, so
" #
∂f
q11 x1 q12 x2
∇x f = ∂x ∂f
1
= = Qx
∂x2
q21 x1 + q22 x2
(c) I’ll show for the general case W ∈ IRm×n , so that

m n
! m X
n
X X X
f (x) = xT (W y) = xi wij yj = wij xi yj
i=1 j=1 i=1 j=1
∂f
Then the partial derivative ∂w ij
= xi yj , which means that
 ∂f ∂f
  
∂w11
· · · ∂w 1n
x1 y 1 · · · x 1 y n
∂f ∆  . .. ..  =  .. .. ..  = xy T
=  .. . .   . . . 
∂W ∂f ∂f
∂wm1
· · · ∂wmn xm y 1 · · · x m y n

Hwk 4: 2(a) Steepest descent with eta=0.100000

−0.1
iterate values
−0.11 optimal value
−0.12
−0.13
x2(n)
−0.14
−0.15
−0.16
−0.17
−0.17 −0.16 −0.15 −0.14 −0.13 −0.12 −0.11 −0.1
x1(n)
1.5
cost function value
|| gradient ||
1
0.5
−0.5
0 20 40 60 80 100 120 140 160 180 200
iteration number
Figure 8: Results of Homework 4 Problem 2a.

2 1
2. Consider the function f (x) = xT

x+ 1 1 x
1 2
(a) (mfile) write an m-file that uses the method of steepest descent to find the value
of x that minimizes f (x).
Solution

2 1 T
Note Misprint in assignment, I meant to write f (x) = x x/2+ 1 1 x.
1 2
I’m surprised no one asked about that. M-file is nearly identical to M-file 12.1
(hwk0302.m) (see listing at the end of this solution). Plots are in Figure 8.
(b) (written) Use ∇x f = 0 to solve the above minimization by hand. Compare your
theoretical answer to the result from your m-file. Explain any differences you
observe.

Solution Derivation of optimal solution8

2 1 x1 x1 ∆ T
= x Qx + cT x

f (x) = x1 x1 + 1 1
1 2 x2 x2
= 2x21 + 2x1 x2 + 2x21 + x1 + x2
" #
∂f
∂x1 4x 1 + 2x 2 + 1
∇x f = ∂f = = 2Qx + c
∂x2
2x1 + 4x2 + 1
1 −1/6
=⇒ xopt = − Q−1 c =
2 −1/6
hwk304.m Output
2(b): difference is 2.7756e-17 0.0000e+00
>>

4 2 1 1/6 9
Optimal value is at x+ = 0 or x = − . Differences (see
2 4 1 1/6
m-file output) are due to double precision arithmetic roundoff.
3. (mfile) Use the method of steepest descent to train a linear neural network (LNN)
y = W x̄ to mimic the logic gates indicated below. (written) Discuss the quality of the
output of your LNN: why does it work (or not work)?
n x1 (n) x2 (n) AND OR XOR

1 0 0 0 0 0
2 0 1 0 1 1
3 1 0 0 1 1
4 1 1 1 1 0
You should run your iteration for 100 steps. Your plots should include:
(a) LNN error E as a function of iteration step.

(b) Norm of the gradient as a function of iteration step.
(c) A mesh plot of the output of each of your LNN’s on the range 0 ≤ (x 1 , x2 ) ≤ 1.
8
This derivation added to original homework solutions.
9
Note: the original solutions had a misprint here. This is the correct answer.

!
∂ X X ∂
Solution Recall that fi (x) = fi (x) . Thus we have
∂x i i
∂x
4
1X
E = (d(n) − W x(n))T (d(n) − W x(n))
2 n=1
4
X
= d(n)T d(n)/2 − d(n)T W x(n) + x(n)T W T W x(n)/2
n=1
4
∂E X
J= = −d(n)x(n)T + W x(n)x(n)T
∂W n=1
The m-file is in M-file 12.2 (hwk304.m). Plots are in Figure 9.
4. (mfile) Repeat problem 3 using the Gauss Newton method.

Solution The derivation of the Gauss-Newton method in §13.2 requires us to work
with a vector e sucvh that E = eT e/2. Recall that
∆ 1 X
E = (d(n) − W x̄(n))T (d(n) − W x̄(n))
2 X
= eT e/2 = e2i /2.
If we define  
d(1) − W x̄(1)
e=
 .. 
. 
d(4) − W x̄(4)
it is easy to verify that E = eT e/2. With this definition,
x̄(1)T
 ∂e ∂e1
  
1
∂w0
· · · ∂w 2
J =  ... ..
.
..  = −  .. 

.   . 
∂e4
∂w0
∂e4
· · · ∂w2 x̄(4)T
My results are in Figure 10.
Note Clearly label all plots and turn in printed copies of your plots with your written
homework.
% Homework 4 Solutions ELEC6240 Problems 2a,3,4
%----------------------------------------------
% Problem 2a : min x’ Q x/2 + c’ x
%----------------------------------------------

% PROBLEM! I wrote x’ Q x, not x’ Q x/2, so ...

QQ = [2, 1; 1, 2]*2; cc = [1 ; 1];
xopt = -QQ\cc;
fprintf(’(a): w_opt = %12.4e %12.4e\n’,xopt(1), xopt(2));
%----------------------------------------------
% Problem 2(b): steepest descent
%----------------------------------------------
nstps = 200; fignum = 0;
xdata = zeros(2,nstps); % array in which to save iterative solution values
eta = 0.1;
errv = zeros(1,nstps);
gradNorm = zeros(1,nstps);
for nn=1:nstps
if(nn == 1)
xn1 = [0;0]; % 1st step: initialize x(n-1) to zero
else
xn1 = xdata(:,nn-1); % xn1 = x(n-1)
end
gn = QQ*xn1 + cc; % gradient at x(n-1)
xn = xn1 - eta*gn; % compute next x(n)
xdata(:,nn) = xn; % and store it
errv(nn) = xn’*QQ*xn/2 + cc’*xn;
gradNorm(nn) = norm(gn);
end
xmin = xn; % save in variable for Simmons to grade
fprintf(’2(b): difference is %12.4e %12.4e\n’, xmin(1) - xopt(1), ...
xmin(2) - xopt(2))
% plot of iterative solution values with optimal solution marked

subplot(2,1,1);
plot(xdata(1,:), xdata(2,:),’-’, xopt(1), xopt(2), ’x’);
grid on ;
title(sprintf(’Hwk 4: 2(a) Steepest descent with eta=%f’,eta));
legend(’iterate values’, ’optimal value’);
xlabel(’x_1(n)’); ylabel(’x_2(n)’);
% norm of the gradient

subplot(2,1,2);
plot(1:nstps,errv,’-’,1:nstps,gradNorm,’-’);
legend(’cost function value’,’|| gradient ||’);
grid on;

eval(sprintf(’print -depsc hwk304%.2d.eps’,fignum));
%----------------------------------------------
% Problems 3,4 , AND,XOR,OR gates, Steepest Descent
%----------------------------------------------
xn = [0 0; 0 1; 1 0; 1 1]’; % x(i) is in col i of xn
x1 = linspace(0,1,25);
x2 = linspace(0,1,27);
% column 1: AND, column 2: OR, column 3: XOR

dd = [ [0;0;0;1], [0;1;1;1], [0;1;1;0] ];
stps = 100;
wdata = zeros(3,stps);
eta = 0.1;
delta = 0.01; % for Gauss-Newton step
nstps = 100;
for probNum = 3:4 % problem number

err = zeros(3,stps);
gradNorm = zeros(3,stps);
for jd = 1:3 % select column of dd we’re using
Wn = zeros(1,3); % initialize 1st data point
for nn=1:stps
% at loop start, Wn contains the last value of W(n)
% compute gradient (for steepest descent)

gn = zeros(1,3); % initialize gradient value
for ii = 1:4;
xbar = [1; xn(:,ii)];
gn = gn + (Wn*xbar)*xbar’ - dd(ii,jd)*xbar’;
end
if(probNum == 3) % steepest descent case
Wn = Wn - eta*gn; % update W(n)
else % probNum == 4 means this is the Gauss-Newton case
wn = Wn’; % we derived G-N with w = column vector
ee = zeros(4,1);
JJ = zeros(4,3);
for ii=1:4
xbar = [1; xn(:,ii)];
ee(ii) = (dd(ii,jd) - Wn*xbar);
JJ(ii,:) = -xbar’;

end
wn = wn - ( JJ’*JJ + delta*eye(3))\JJ’*ee;
Wn = wn’;
end
wdata(:,nn) = Wn’;
% compute error for this function (jd =1,2,3 -> AND, OR, XOR)
err(jd,nn) = 0;
for ii = 1:4
xbar = [1;xn(:,ii)];
err(jd,nn) = err(jd,nn) + (dd(ii,jd) - Wn*xbar)^2/2;
end
gradNorm(jd,nn) = norm(gn);
end
titles = {’AND’,’OR’,’XOR’};
subplot(2,2,jd) % mesh plots
warning(’off’); % avoid pesky messages on xor plot
meshPex(Wn,x1,x2,’input a’,’input b’,titles{jd});
end
% print meshplots of the 3 network outputs
% plot of LNN error E as function of iteration step

algType = {’steepest descent’,’Gauss-Newton’};
varnames = {’eta’,’delta’};
subplot(2,1,1);
plot(1:nstps, err,’-’);
legend(’AND’,’OR’,’XOR’);
grid on ;
title(sprintf(’Cost function value: %s with %s=%f’, ...
algType{probNum-2},varnames{probNum-2}, eta*(probNum==3) + delta*(probNum==4)));
% plot of norm of the gradient as function of iteration step

subplot(2,1,2);
semilogy(1:stps,gradNorm,’-’);
legend(’AND’,’OR’,’XOR’);
grid on;
title(sprintf(’gradient norm: %s’,algType{probNum-2}));

if(probNum == 3)
eta3 = eta*ones(1,3); % used same eta for all three
wn3 = wdata;
error3 = err;
nmgrad3 = gradNorm;
else
eta4 = delta*ones(1,3); % used same eta for all three
wn4 = wdata;
error4 = err;
nmgrad4 = gradNorm;
end
end

Cost function value: steepest descent with eta=0.100000
AND
OR
0.5 XOR
0.4
0.3
0.2
0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number
gradient norm: steepest descent

2
10
AND
OR
XOR
0
10
−2
10
−4
10
0 10 20 30 40 50 60 70 80 90 100
iteration number
AND OR
1 1.5
0.5 1
0 0.5
−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a
XOR
0.5005
0.5
0.4995
1
1
0.5
0.5
input b 0 0 input a
Figure 9: Solution to Homework 4 Problem 3.

Cost function value: Gauss−Newton with delta=0.010000
AND
OR
0.5 XOR
0.4
0.3
0.2
0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number
gradient norm: Gauss−Newton

10
10
AND
OR
XOR
0
10
−10
10
−20
10
0 10 20 30 40 50 60 70 80 90 100
iteration number
AND OR
1 1.5
0.5 1
0 0.5
−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a
XOR
0.5
0.5
1
1
0.5
0.5
input b 0 0 input a
Figure 10: Solution to Homework 4 Problem 4.

Revision : 2003.10
13 2003 09 19: More gradient based learning

13.1 A matlab m-file example
M-file 13.1 meshPexPlot.m
% test of meshPex plotting m-file
xx = linspace(-1,2,11);
yy = linspace(-1,2,13);
W = [1,2,3];
yvals = meshPex(W,xx,yy,’x_1’,’x_2’,’Example plot’);
print -depsc meshPexPlot.eps
M-file 13.2 meshPex.m

function Yvals = meshPex(W,x1,x2,xstr, ystr, tstr)
% Yvals = meshPex(W,x1,x2,tstr)
% compute and/or plot output of a single layer linear neural network
% function of two variables
% inputs: W (1 x 2): network weights
% x1, x2: each vector xx = [x1(ii); x2(jj)] for appropriate values of
% ii, jj
% xstr, ystr, tstr: strings for xlabel, ylabel, and title, respectively
% if all three are passed, then the mesh plot is plotted.
% if not, these arguments are ignored
doMeshPlot = 0; % assume no mesh plot.

if(nargin == 6) % nargin automatically contains the number of input arguments
if( isstr(xstr) & isstr(ystr) & isstr(tstr) )
doMeshPlot = 1;
end
end
% could check dimensions, etc., but I’m not going to

for ii=1:length(x1)
for jj=1:length(x2)
xbar = [1;x1(ii);x2(jj)];
Yvals(jj,ii) = W*xbar;
end
end
if(doMeshPlot)
mesh(x1,x2,Yvals);

13.1 A matlab m-file example Revision : 2003.10
xlabel(xstr);
ylabel(ystr);
title(tstr);
grid on;
end
Example plot
12
10
−2
−4
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
x −1 −1
2 x1

13.2 Gauss Newton Revision : 2003.10
13.2 Gauss Newton

[Notation in this section of thebook is icky.]
e1 (w)
Define error vector e(w) =  ... .
∆  
en (w)
Remark 13.1 w = column vector m × 1.
Suppose
n
1X 2
∆ 1
E(w) = ei (w) = e(w)T e(w)
2 i=1 2
mean squared error
Remark 13.2 e = column vector p × 1
1st order Taylor’s series for error e about w(n) (to select a new w):
∆
e(w(n + 1)) = e(w(n)) + ∇w e|w(n) T (w(n + 1) − w(n)) = e(w(n)) + J(n)(w(n + 1) − w(n))
Remark 13.3 J(n) is the Jacobian (1st derivative matrix).
Select new w to satisfy
∆ 1
w = arg min ke(w(n + 1))k
2
w
1 1
= arg min ke(n)k2 + e(n)T J(n)(w − w(n)) + ((w − w(n))T J(n)T J(n)(w − w(n))
w 2 2
Differentiate w.r.t. w, set to 0 to obtain
J T e(n) + J(n)T J(n)(w − w(n)) = 0

−1
=⇒ w(n + 1) = w(n) − J(n)T J(n) J(n)T e(n) = w(n) − J(n)† e(n)
Problem J(n)T J(n) is not guaranteed to be invertible. In fact, if p < m (e.g., p = 1 =⇒

J is a row vector), then J(n)T J(n) is guaranteed to be not invertible.
Idea Invert J(n)T J(n) + δI instead.

1
δ kw − w(n)k2 + ke(n)2 k . Take smaller steps,

Remark 13.4 Equivalent to minimizing 2
but still reduce norm of error.

13.3 Perceptrons Revision : 2003.10
13.3 Perceptrons
Read §3.8-3.9
Induced local field (linear)

m
X
v= w i xi + b
i=1
Classify patterns: must be “linearly separable.”

wT x + b = 0
C1
w
C1
C2
C2
Treat b as a weight, modify

∆ T
x(n) = 1 x1 (n) · · · xm (n)
∆ T
w(n) = b(n) w1 (n) · · · wm (n)
So
v(n) = w(n)T x(n)
x ∈ C1 =⇒ w T x > 0.

Algorithm 13.1 Perceptron training:
Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).
Outputs Weight vector w
for n = 1, 2, ...
if x(n) ∈ C1 and w(n)T x(n) ≤ 0 /* incorrectly classfied */

w(n + 1) = w(n) + ηx(n)
else if x(n) ∈ C2 and w(n)T x(n) > 0 /* incorrectly classfied */
w(n + 1) = w(n) − ηx(n)
end if
endfor
Analysis of this algorithm is in the next section of the notes.

Example 13.1 Perceptron convergence algorithm NAND gate:
M-file 13.3 perEx.m
% train a single layer perceptron to mimic a NAND gate

help perceptron2Dlearn
xx = [0,1,0,1; 0, 0, 1, 1]
dd = [1,1,1,-1]
[WW, iNum] = perceptron2Dlearn(xx,dd,1e4);

fprintf(’Converged in %d iterations to\n’,iNum);
WW
Yvals = meshPex(WW,linspace(0,1), linspace(0,1), ...
’NAND input 1’, ’NAND input 2’, ’Activation function input’);
print -depsc perEx.eps
perEx.m Output
W = perceptron2Dlearn(x,d,nIter);
train a threshhold activation function perceptron to learn a data set
(if possible)
inputs:
x, d: x(Nx2), d(N,1): input, desired output pairs
d(nn) should be either 1 or -1.
nIter: max number of iterations to run
outputs:
W: phi( W* xbar ) should match data (if possible in given # iterations)
iNum: number of passes through data to classify
xx =
0 1 0 1
0 0 1 1
dd =
1 1 1 -1
Converged in 9 iterations to
WW =
4 -2 -3

>>
Activation function input
−1
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
NAND input 2
NAND input 1
M-file 13.4 perceptron2Dlearn.m
function [WW,iNum] = perceptron2Dlearn(xx,dd,nIter)

% W = perceptron2Dlearn(x,d,nIter);
% train a threshhold activation function perceptron to learn a data set
% (if possible)
% inputs:
% x, d: x(Nx2), d(N,1): input, desired output pairs
% d(nn) should be either 1 or -1.
% nIter: max number of iterations to run
% outputs:
% W: phi( W* xbar ) should match data (if possible in given # iterations)
% iNum: number of passes through data to classify
iNum = 0;
done = 0;
WW = zeros(1,3);

NN =size(xx,2); % number of columns in xx, size(xx,1) gives number of rows.

while ( ~done )
% present all input pairs one at a time and correct
nBad = 0; % count mis-classified entries
for nn = 1:NN
xn = [1;xx(:,nn)];
dn = dd(nn);
if( dn*WW*xn <= 0 )

WW = WW + dn*xn’;
nBad = nBad + 1; % another misclassified
end
end
if( nBad == 0 ) % none misclassified, so done

done = 1;
end
iNum = iNum + 1;
if( iNum >= nIter ) % too many iterations, quit.
done = 1;
end
end

Revision : 2003.10
14 2003 09 22: Single Layer Perceptrons (conclusion)

14.1 The perceptron convergence theorem
Recall Algorithm 13.1 from last lecture:
Algorithm 14.1 Perceptron training:
Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).
Outputs Weight vector w
for n = 1, 2, ...
if x(n) ∈ C1 and w(n)T x(n) ≤ 0 /* incorrectly classfied */

w(n + 1) = w(n) + ηx(n)
else if x(n) ∈ C2 and w(n)T x(n) > 0 /* incorrectly classfied */
w(n + 1) = w(n) − ηx(n)
end if
endfor
Analysis of this algorithm requires use of
Theorem 14.1 Cauchy Schwartz√ inequality Given compatibly dimensioned vectors x, y

and the Euclidean norm kxk = xT x,
2
kxk2 kyk2 ≥ xT y
Theorem 14.2 Suppose input vectors x(n) in Algorithm 13.1 are drawn from subsets Xi
of Ci , i = 1, 2. Assume that sets C1 and C2 are linearly separable. Then Algorithm 13.1
converges with w(0) = 0 and η = 1.
Proof: Since C1 and C2 are linearly separable it follows that there exists a vector w0 such
that w0 T x(n) > 0 for all x(n) ∈ X1 and w0 T x(n) ≤ 0 for all x(n) ∈ X2 . Define
α = min w0 T x(n) (14.1)

x∈X1
Suppose all input vectors x(n) are drawn from X1 and that w(n)T x(n) ≤ 0 (the vectors
are incorrectly classified). Then for w(0) = 0 and η = 1
n
X
w(n + 1) = x(i)
i=1

14.1 The perceptron convergence theorem Revision : 2003.10
By equation (14.1) we have

n
X
T
wo w(n + 1) = w0 T x(i) ≥ nα (14.2)
i=1
By Cauchy-Schwartz, (14.1) and (14.2) we have
kw0 k2 kw(n + 1)k2 ≥ n2 α2 =⇒ kw(n + 1)k2 ≥ n2 α2 / kw0 k2 (14.3)
Define
β = max kx(k)k2 .
x∈X1
Notice that, for k ≤ n, w(k + 1) = w(k) + x(k) and so
kw(k + 1)k2 = kw(k)k2 + x(k)2 + 2w(k)T x(k)

By assumption the vectors x(k) are incorrectly classified, and so w(k)T x(k) < 0 which implies
kw(k + 1)k2 ≤ kw(k)k2 + x(k)2

and so
kw(k + 1)k2 − kw(k)k2 ≤ x(k)2 .

(14.4)
Sum (14.4) over k = 1, ..., n and recall w(0) = 0 to get
n
X
2 x(k)2 ≤ nβ

kw(n + 1)k ≤ (14.5)
k=1
Equation (14.3) gives a quadratically growing lower bound on w(n+1). Conversely, equation
(14.5) gives a linearly growing upper bound on w(n + 1). We conclude that there exists some
number nmax such that a correct classification must occur for n ≥ nmax . 2
Remark 14.1 The above theorem states nothing about converging to a vector that correctly
differentiates between C1 and C2 . It merely states that if you present vectors from C1 long
enough it will eventually correctly classify those vectors. Nevertheless, the stronger result
can also be shown to be true: if a solution exists, the above procedure will converge so that
w(n0 ) = w(n0 + 1) = · · · for some n0 ≤ nmax .
Can also use an adaptive error correction model: let η(n) be the smallest integer such
that
η(n)x(n)T x(n) ≥ w(n)T x(n)

Question 14.1 Why is it o.k. to use integers? Shouldn’t we have to use small numbers?

14.2 Other activation functions Revision : 2003.10
14.2 Other activation functions

Suppose we use a continuous activation function so that (ignoring dependence on step
numbern
yi = φ(wi,·T x̄)
X
= φ( wij x̄j )
Question 14.2 What is ∂yi /∂wij ?
Must apply the chain rule. Define vi = wi,· T x̄.
∂yi ∂yi ∂vi

=
∂wij ∂vi ∂wij
∂φ(v)
= xj
∂v
∂φ

Define φ0 (v) = ∂v v
(scalar and vector form). Then
∂yi
= φ0 (w T x̄)x̄
∂wi,·
a vector valued function.
Question 14.3 How can we use this to design a training algorithm using continuous valued
φ?
We need to determine
∂E ∂ 1 X
= (d(n) − φ(W x̄)T (d(n) − φ(W x̄))
∂W ∂W 2 n
function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);
function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phi(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );

14.3 Exam 1 information Revision : 2003.10
14.3 Exam 1 information

14.3.1 Topics
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999
Activation functions and their derivatives
Neuron models mathematical formulae, signal flow graphs, capabilities (separating hyper-
planes, normal vectors, dot products)
Learning processes Error-correction learning (Delta rule), memory-based learning, Heb-

bian learning, Competitive learning
NOT covered Boltzmann learning credit assignment, learning without a teacher
Learning tasks Associative memory/pattern association

Covariance matrix associative memory design (requires orthogonality)
−1
“optimal” asosciative memory design (pseudo-inverse A† = AT A AT if A is tall and
thin)
Single layer networks design, capabilities, performance issues
Unconstrained optimization techniques Error function E, Matrix calculus identities,

Gradient based learning methods
Perceptron learning algorithm What is it? Why does it work?
MATLAB/C short programming exercises (writing and reading)
14.3.2 Permitted resources

None. You may bring a pencil or pen (use a pen only if you never make mistakes) and
an eraser. You are not permitted to use a calculator, textbook, written notes, oral or
written communication with other people (besides the instructor/GTA), laptop computers,
cell phones, PDAs, wireless modems, telepathic contact, or any other resources besides your
own mind, body and a writing utensil. T-shirts with Maxwell’s equations on the back will
be tolerated, but I reserve the right to reseat you in the back of the classroom. You are to sit
with at least one empty chair between you and the nearest classmate. Use of unauthorized
resources on this exam will result in a failing grade.
Exam 1 material ends here.

Revision : 2003.10
15 2003 09 24 Multi-layer perceptrons

Don’t Forget! IEEE meeting tonight in 238 at 6:00pm with Chevron-Texaco - they’re
hiring, and they’re providing barbecue house. (GPA ≥ 3.0; pre-select at Career Services by
TOMORROW!)
15.1 Introduction
Read §4.1-4.2
Two pass algorithm:
1. First pass: present data, get outputs and intermediate (local field) values.
2. Second pass; adjust weights according to results of first pass.
Remark 15.1 back-propagation is a learning algorithm, not a kind of neural network.
1. Computationally efficient
2. Not guaranteed to converge to optimal solution
Changes from chapter 3:
1. Activation function will be nonlinear, e.g., sigmoid

1
yj =
1 + e−avj
2. At least one hidden layer.
3. High connectivity.
Vocabulary
Function signal forward (network operation)
Error signal training (backward)

15.1 Introduction Revision : 2003.10
Notation used in text is bad; uses indices i, j, and k to denote both a layer number and
to designate a neuron within a specified layer. We’ll use this instead:
+1 +1 (bias)
W (1) .. .. W (2) .. ..
.. . . . .
.
y (0) v (1) y (1) v (2) y (2)
• Use subscripts i, j to denote node numbers/weight indices. Superscript k to

denote layer number. Use mk to denote number of neurons in layer k.
Input layer Write here as y (0) (n) or y (0) where the time index is clear. Individual
(0 (0)
neurons denoted as y1 (n), ..., ym1 (n). Present pattern y (0) (n) = x(n).
Weight
h matrices
i Connection from layer k − 1 to layer k is denoted W (k) (n) =
(k)
wij (n) : k is the layer number, ij denotes the connection from input
neuron j to output neuron i, and n is the timestep. Update is
W (k) (n + 1) = W (k) (n) + ∆W (k) (n)
Forward calculations Calculate v (k) , y (k) as follows:
v (k) (n) = W (k) (n)y (k−1) (n)

(k)
X (k) (k−1)
equivalently, vi = wij yj (n)
j
(k)
y (n) = φ(v (k) (n))
Need not use the same activation function φ at all neurons, but I will follow
that format here.
Desired output Associated with x(n) is desired output vector d(n). Define
error vector e(n) = d(n) − y (2) (n). If there are more than two layers, then
set kmax = maximum layer number (2 in the diagram above) and set e(n) =
d(n) − y (kmax ) (n).
∆
Error function E(n) = e(n)T e(n) = ej (n)2 .
P

15.2 Back-propagation algorithm Revision : 2003.10
15.2 Back-propagation algorithm

Define instantaneous error function E(n) over output nodes only
mkmax
1 X ∆ 1
E(n) = e2j (n) = e(n)T e(n) (15.1)
2 j=1 2
∆
Define average error function Eav (n) Eav (N ) = N1 N
P
n=1 E(n)
Basis of back-propagation: chain rule of differentiation.
15.3 Output layer update

Look at an individual neuron:
W (i) φ
φ
E
..
. ..
.
ȳ (j) v̄ (i) ȳ (i)

Hodel notation
m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0
Apply the chain rule - repeatedly!

! ! !
(k) k)
∂E(n) ∂E(n) ∂ei (n) ∂yi (n) ∂vi (n)
(k)
= (k) (k) (k)
∂wij (n) ∂ei (n) ∂yi (n) ∂vi (n) ∂wij (n)
Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have
∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)

15.3 Output layer update Revision : 2003.10
and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)

15.3 Output layer update Revision : 2003.10
From last page: output layer weights update with

W (i) φ
φ
E
..
. ..
.
ȳ (j) v̄ (i) ȳ (i)

Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
Define update rule as follows
∆
δi (n) = ei (n)φ0 (vi (n))
∆ T
δ(n) = δ1 (n) · · · δm (n) = e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
∆wji (n) = ηδi (n)yj (n)
Clear how to update output layer weights.
Remark 15.2 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.
Question 15.1 How do we update weights for the hidden layer(s)?
Ans set up a credit assignment problem.

Revision : 2003.10
16 2003 09 26: Review of Homework 4.

Today we went over problem 4 of Homework 4. Notice also the update (dimensions) added
to the discussion of the Gauss-Newton iteration method, $ 13.2.
16.1 After class comments

A comment from students after class.
1. Consider problem 4 where we calculate the Jacobian: For this problem,
x̄(1)T
   
1 0 0
 x̄(2)T   1 0 1 
J = −
 x̄(3)T  =  1 1
  
0 
x̄(4)T 1 1 1
which is a constant matrix. That means that I can calculate J (for this case) before I
write
for probNum = 3:4

for jd = 1:3
etc.
Question 16.1 Would this be true if I wrote y = φ(W x̄)? (we’ll talk about that on
Monday)
2. Further, this particular J has three linearly independent rows/columns, so J T J is

always invertible. That means we don’t need to use δI (set δ = 0) and the method
converges for this problem in one step!
Question 16.2 Suppose again that I wrote y = φ(W x̄). How would that change the
invertibility of J?
16.2 Example in office

I added some detail to my solution for Homework 4 problem 2. Look for the footnotes to
see where the changes were made

Revision : 2003.10
17 2003 09 29: More review for Exam 1

Cover more of Hw 4, discuss “example” questions that I will make up before your very eyes.

Revision : 2003.10
18 2003 10 01 Exam 1
Scores
1. (20)
2. (20)
3. (20)
4. (20)
5. (20)
Lemma 18.1 Copied from 11.1, Let f (x) be a scalar function of a vector x ∈ IRnand let
∂f

∂x1
∂f ∆  .. 
g(W ) of a matrix W ∈ IRm×n . Define their respective partial derivatives as ∂x
=  . 
∂f
∂xn
∂g ∂g
 
∂w11
··· ∂w1n
and ∂g ∆
= .. .. ..
. . Then
 
∂W  . .
∂g ∂g
∂wm1
··· ∂wmn
∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W

18.1 The questions Revision : 2003.10
18.1 The questions

1. Consider the data set shown below.
Exam 1 problem 1 example
class 0
class 1
5 division 1
division 2
4
2
x2(n) value
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
x1(n) value
The plot above divides the 40 data points {x(n)}40 n=1 shown in the plot into two classes.
The division is based on the separating lines defined by the linear neural network
w1 T

−1 1 1
weights W = = ; that is, a data point x(n) is in class 1 if
−0.5 −1 1 w2 T
w1 T x̄(n) > 0 and w2 T x̄(n) > 0, otherwise it’s in class 0. Indicate which division line
(division 1 or division 2) corresponds to the weight vector w1 T x̄ = 0. Show your work,
either in mathematics or by labelling the diagram above.
Division number =

Exam 1 problem 1 example
class 0
class 1
5 division 1
division 2
4
2
x2(n) value
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
x1(n) value
2. Can the 40 data points above be correctly classified using a single neuron (perhaps one
that uses a nonlinear activation function)? Explain why or why not.
If they can be correctly classified by a single neuron, give the corresponding activation
function and weights below.
Can it be done? =
Neuron function =
Activation function φ(v) =
Explain.

3. Consider the sigmoid activation function φ(v) − 1/(1 + e−av ) for some a > 0. Show
that dφ/dv = aφ(v)(1 − φ(v)).

4. Consider a neuron y = φ(W x̄) where W = 1 2 3 , where φ(v) is the sigmoid
T
function shown above and x̄ = 1 x1 x2 Define v = W x̄.
Fill in the boxes below with the correct expressions or numerical values. (I will accept
either.)
∂v
=
∂W
∂y
=
∂v
∂y
=
∂W


1 T T 2 1 T
5. Let f (x) = 2
x Qx + c x for Q = and c = 4 5 .
1 2
(a) Find the errors in the MATLAB code below that attempts to implement a steepest
descent iteration to find x minimizing f (x).
M-file 18.1 exam2003BrainDead.m
% m-file example (with errors) of
% steepest descent iteration
% to minimize
% f(x) = x’*[2, 1, 1, 2]*x/2 + [4;5]*x
x = [0;0] ; /* initial value of x */
eta = 10;
grad_x = x*Q - c;
for ii=1:100
x = x + eta*grad_x;
end
fprintf("x= %e %e\n", x(1), x(2));
(b) Find a vector x∗ such that f (x∗ = minx f (x). Show that ∇x f |x∗ = 0.
x∗ =

18.2 The answers Revision : 2003.10
18.2 The answers

1. Division 2. w T x̄ = −1 + x1 + x2 = 0 =⇒ x2 = −x1 + 1, which is the division 2 line.

2. Select W = −3 0 1 so that W x̄ > 0 for y > 3. Use threshhold activation
(McCullough-Pitts).
3. See the solution to homework 1.
∂v
= x̄T = 1 x1 x2

4.
∂W
∂y
= aφ(v)(1 − φ(v)
∂v
∂y
= aφ(v)(1 − φ(v))x̄T = aφ(W x̄)(1 − W x̄) 1 x1 x2

∂W

1 T T 2 1 T
5. Let f (x) = 2 x Qx + c x for Q = and c = 4 5 .
1 2
(a) M-file:
M-file 18.2 exam2003BrainDeadFix.m
% f(x) = x’*[2, 1, 1, 2]*x/2 + [4;5]*x
Q = [2,1 ; 1 2]; % error 2: Q and c not defined
c = [4;5];
x = [0;0] ; % error 3: C-type comment /* initial value of x */
eta = 0.01; % error 4: huge step size.
% error 5: wrong gradient formula, wrong location in the program
%grad_x = x*Q - c;
for ii=1:100
grad_x = Q*x + c; % error 6 continued: correct place and formula
x = x - eta*grad_x; % error 7: subtract gradient, not add
end
% error 8: replace double quotes (C style) with single
% quotes (MATLAB style).
fprintf(’x= %e %e\n’, x(1), x(2));

1
x∗ =
2

2 1 4
want 0 = ∇x f = x+
1 2 5
−1
∗ 2 1 4 1
=⇒ x = =−
1 2 5 2

2 1 1 4 4 4
Qx∗ + c = − =− + =0
1 2 2 5 5 5

Revision : 2003.10
19 2003 10 03 Multi-layered perceptrons: continued

19.1 Backpropagation algorithm: review of output layer
+1 +1 (bias)
W (1) .. .. W (2) .. ..
.. . . . .
.
y (0) v (1) y (1) v (2) y (2)

(b)
Previously seen in §19.1: output layer weights update with
W (i) φ
φ
E
..
. ..
.
ȳ (j) v̄ (i) ȳ (i)

Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
∆
δi (n) = ei (n)φ0 (vi (n))
∆ T
δ(n) = δ1 (n) · · · δm (n)
= e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
(k) (k−1)

19.2 Backpropagation: Hidden layer update Revision : 2003.10
19.2 Backpropagation: Hidden layer update

W (2)
(1)
wij
..
. ..
.
ȳ (0) ȳ (1) v̄ (2) ȳ (2)
! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj
weighted sum of “next layer” δ (gradient) values

T
∆W (k) = ηδ (k) (n)y (k−1) (n)
where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) =

(k+1) T (k+1) 0 (k)
W δ .∗φ (v (n)) k= hidden layer
Question 19.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?

19.3 Exponential activation function Revision : 2003.10
19.3 Exponential activation function
1
φ(v) = a > 0 v ∈ IR
1 + e−av
ae−av e−av

0 a
φ (v) = = +
(1 + e−av )2 1 + e−av 1 + e−av
= aφ(v) (1 − φ(v))
Output layer update:
δ(n) = e(n).∗φ0 (v (kmax ) (n)) = a d(n) − y (kmax ) (n) .∗ y (kmax ) (n).∗ 1 − y (kmax ) (n)

Hidden neuron update in layer k

(k) (k) (k+1)
X
δj = φ0 (vj ) wij (n)(k+1) δi
i
(k) (k+1)
X
(k)
δ = φ 0
(vj ) wij (n)(k+1) δi
i
(k)
T

= ay .∗ 1 − y (k) .∗ W (k+1) δk+1
M-file 19.1 phi.m
function y = phi(x,a)
% sigmoid function with parameter a
y = 1 ./ (1 + exp(-a*x) );
M-file 19.2 dphi.m
function dy = dphi(x,a)
% derivative of sigmoid function with parameter a
y = phi(x,a);
dy = a * y .* (1 - y);
Example 19.1 Check derivative function:
M-file 19.3 derivS.m
% check exponential derivative function gradient

a = 1; N = 100; N1 = N-1;
xx = linspace(-5,5,N); xx1 = xx(1:N1);
phix = phi(xx,a); % original function
dphif = dphi(xx,a); % symbolic derivative
dphiN = diff(phix) ./ diff(xx); % numerical derivative

19.4 tanh activation function Revision : 2003.10
figure(1);
grid(); title(’Exponential sigmoid with a=1’);
plot(xx,phix, "-;phi;", xx, dphif, "-;symbolic dphi;", ...
xx1, dphiN,"-;numerical dphi;");
printeps("derivS.eps");
Exponential sigmoid with a=1
1
phi
symbolic dphi
0.9 numerical dphi
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
19.4 tanh activation function
φ(v) = a tanh(bv)
=⇒ φ0 (v) = ab(1 − tanh2 (bv))
b
= (a − y)(a + y)
a
M-file 19.4 phiT.m
y = a*tanh(b*x);

M-file 19.5 dphiT.m
y = phiT(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );

Homework 5 Due Fri Oct. 10. Hwk

1. Consider the activation function φ(v) = a tanh(bv) (see M-file 19.4 (phiT.m)).
(a) Show that dφ(v)/dv = (b/a)(a − φ(v))(a + φ(v)).
(b) Write mex-file implementations of both phiT and dphiT, each listed below.
y = a*tanh(b*x);
y = phiT(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );
What to turn in Derivation (handwritten) for part (a). C-language source files
phiT.c and dphiT.c.
2. Consider the hyperbolic tangent activation function φ(v) = a tanh(bv) with parameters
a = 1.5 and b = 1.
(a) Write an m-file that uses your mex-files from problem 1 to perform steepest descent
iteration to compute
a single layer MLP AND gate. That is, compute a set of
weights W = w0 w1 w1 such that y = φ(W x̄) is a “good” approximation
of y = AND(x1 , x2 ). Your maximum error should be less than 10% for four data
points {x(n), d(n)}4n=1 specified in Homework 4. Plot your neural network netowrk
output over the range −0.5 ≤ x1 , x2 ≤ 1.5. You will need to modify the m-file M-
file 13.2 (meshPex.m) to use φ when it computes Yvals.
(b) Why does your m-file from problem 2a provide a better AND gate than your m-file
from Homework 4? In particular, explain why your AND gate can achieve 10%
worst-case error but the AND gate from Homework 4 could not.
What to turn in Source code m-files hwk502.m and meshPexT.m (your modified
version of meshPex.m), and a written answer to part (b). Your m-file hwk502.m
should end with the following lines:
fprintf(’tanh parameters: a=%12.4e b=%12.4e\n,a,b);
fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e 512.4e\n, ...
W(1), W(2), W(3));
meshPexT(W,xx,yy,’input 1’,’input 2’,’AND gate output’);
3. Write a C-language mex file to multiply two matrices. That is, write a C-language
translation of the following m-file:
function C = matmul(A,B)
C = A*B
What to turn in C-language source file matmul.c.

Revision : 2003.10
20 2003 10 06: Single layer linear system i.d. example

20.1 Goal
Model motion of a blimp.
20.1.1 Vehicle data

Measure blimp position (m) in response to motor voltage commands (V):
Sample Input Blimp Sample Input Blimp
Number voltage position Number voltage position
u(n) p(n) u(n) p(n)
1 0 0.0 7 2 3.00
2 1 0.1 8 1 3.80
3 1 0.5 9 0 4.20
4 2 0.8 10 0 4.6
5 2 1.2 11 0 4.9
6 3 2.0 12 0 5.1
20.1.2 Assumption
There is a mathematical model
p(n + 1) = f (history of the universe)
that will accurately predict the motion of the blimp given sufficient information. We will
attempt to approximate f with a function of the last two values of p and u:
p̂(n + 1) = a0 p(n) + a1 p(n − 1) + b0 u(n) + b1 u(n − 1)

 
p(n)
 p(n − 1) 
= a 0 a 1 b0 b1   u(n) 

u(n − 1)
where p̂(n) is an estimate of p(n) based on the other measurements. In matrix form we can
write  
p(2) · · · p(11)
 p(1) · · · p(10) 
p̂(3) · · · p̂(12) = a0 a1 b0 b1  
 u(2) · · · u(11) 
u(1) · · · u(10)

Define d(n) = p(n + 2) and d = d(1) · · · d(10) . Similarly define
 
p(2) · · · p(11)
 p(1) · · · p(10) 
X = x(1) · · · x(n) =   u(2) · · · u(11) 

u(1) · · · u(10)

20.1 Goal Revision : 2003.10
and define
W = a 0 a 1 b0 b1 .
Then we want to minimize
1 1
E = kd − W Xk2 = (d − W X)(d − W X)T
2 2
The matrices d and X are obtained with m-file M-file 20.1 (sysIdEx1GetData.m) in Figure
11. Input data is plotted in Figure 12.
M-file 20.1 sysIdEx1GetData.m
function [d,X] = sysIdEx1GetData

% function [d,X] = sysIdEx1GetData
% get data for blimp sys id experiment
u = [0; 1; 1; 2; 2; 3; 2; 1; 0; 0; 0; 0];
p = [0; 0.1; 0.5; 0.8; 1.2; 2.0; 3.0; 3.8; 4.2; 4.6; 4.9; 5.1];
t = ((1:12)-1)*0.5;
for n=2:11
d(n) = p(n+1);
X(:,n) = [p(n);p(n-1);u(n);u(n-1)];
end
% from here down is plotting code

fn = figure; subplot(2,1,1); plot(t,u,’x’);
title(’original data from blimp experiment’); grid on;
xlabel(’time’); ylabel(’voltage input’);
subplot(2,1,2); plot(t,p,’X’); grid on;

xlabel(’time’); ylabel(’blimp position’);
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return
Figure 11: M-file to construct input data matrices for blimp sys id experiment.

20.1 Goal Revision : 2003.10
original data from blimp experiment

3
2.5
voltage input
1.5
0.5
0
0 1 2 3 4 5 6
time
5
blimp position
0
0 1 2 3 4 5 6
time
Figure 12: Input data for blimp sys id experiment.

20.2 Solution method 1: direct solution Revision : 2003.10
20.2 Solution method 1: direct solution

We can also write
10
1 1X
E = kd − W Xk2 = (d(n) − W x(n))T (d(n) − W x(n))
2 2 n=1
10
1X
= d(n)T d(n) − 2x(n)T W T d(n) + x(n)T W T W x(n))
2 n=1
−1
Take the gradient and set it to zero to get Wopt = d · X † = dX T XX T . Calculations
performed in M-file M-file 20.2 (sysIdEx1Wopt.m), in Figure 13. Fit quality shown in Figure
14.
Remark 20.1 This approach will not work if we model p̂(n + 1) = φ(W x(n)) due to the
nonlinearity of the activation function.
M-file 20.2 sysIdEx1Wopt.m
function Wopt = sysIdEx1Wopt(d,X)

% compute pseudo-inverse (optimal) solution and plot results
Wopt = d*pinv(X);
return
Figure 13: Calculations for optimal data fit (linear model)

20.2 Solution method 1: direct solution Revision : 2003.10
optimal: err= 3.4392e−01

6
desired output
actual output
error
5
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
Figure 14: Results of optimal data fit.

20.3 Solution method 2: Steepest descent iteration Revision : 2003.10
20.3 Solution method 2: Steepest descent iteration

In each iteration compute
X
∇W E = (W x(n))x(n)T − d(n)x(n)T
n
Result plots shown in Figure 16. M-file is in Figure 15.
M-file 20.3 sysIdEx1GradOpt.m
function Wgrad = sysIdEx1GradOpt(d,X)

% use gradient descent to get optimal solution
maxIter = 1e3;
eta = 3e-3;
Wmat = zeros(maxIter,4); % maxIter iterations with 4 parameters each
ErrV = zeros(maxIter,1);
for iter = 1:(maxIter-1);
Wn = Wmat(iter,:);
ErrV(iter) = norm(d - Wn*X);
gn = zeros(size(Wn)); % compute gradient with ALL data points
for nn = 1:length(d)
xn = X(:,nn);
gn = gn + (-d(nn)*xn’ + (Wn*xn)*xn’);
end
Wn = Wn - eta*gn;
Wmat(iter+1,:) = Wn;
end
ErrV(maxIter) = norm(d - Wmat(maxIter,:)*X);
Wgrad = Wmat(maxIter,:);
kk = 1:maxIter;
fn = figure; subplot(2,1,1); plot(kk,Wmat,’-’);
xlabel(’iteration’); legend(’a_0’,’a_1’,’u_0’,’u_1’);
title(’steepest descent parameters’);
grid on
subplot(2,1,2); plot(kk,ErrV,’-’); xlabel(’iteration’);
ylabel(’Error’); grid on;
return
Figure 15: M-file calculations for steepest descent iteration.

20.3 Solution method 2: Steepest descent iteration Revision : 2003.10
steepest descent parameters

0.7
a0
0.6 a1
0.5 u0
u1
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
12
10
8
Error
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
steepest descent: err= 3.4954e−01

6
desired output
actual output
error
5
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
Figure 16: Steepest descent parameter values during iteration and resulting output error.

20.4 Solution method 3: Backpropagation Revision : 2003.10
20.4 Solution method 3: Backpropagation

Neural network (approximate steepest descent) method. Instead of computing
X
∇W E = (W x(n))x(n)T − d(n)x(n)T
n
we apply an update for each point individually and use a very small step size η so that after
N = 10 steps we approximate the above gradient. M-file is in Figure 17. Results are in
Figure 18.
M-file 20.4 sysIdEx1nnOpt.m
function Wnn = sysIdEx1nnOpt(d,X)

% use real gradient descent to get optimal solution
maxIter = 1000;
eta = 3e-3;
Wmat = zeros(maxIter,4); % maxIter iterations with 4 parameters each
ErrV = zeros(maxIter,1);
nn = 0;
for iter = 1:(maxIter-1);
Wn = Wmat(iter,:);
ErrV(iter) = norm(d - Wn*X);
% compute (some of) gradient using SINGLE data points
nn = nn+1; if(nn > length(d)), nn=1; end;
xn = X(:,nn);
gn = -d(nn)*xn’ + (Wn*xn)*xn’;
Wn = Wn - eta*gn;
Wmat(iter+1,:) = Wn;
end
Wnn = Wmat(maxIter,:);
ErrV(maxIter) = norm(d - Wnn*X);
kk = 1:maxIter;
fn = figure; subplot(2,1,1); plot(kk,Wmat,’-’);
title(’backpropagation parameters’);
xlabel(’iteration’); legend(’a_0’,’a_1’,’u_0’,’u_1’); grid on;
subplot(2,1,2); plot(kk,ErrV,’-’); xlabel(’iteration’);
ylabel(’Error’); grid on;
return
Figure 17: Approximate steepest descent (backpropagation) m-file. Notice that only one
data point is used to compute gn, which means that gn is no longer the gradient.

20.4 Solution method 3: Backpropagation Revision : 2003.10
backpropagation parameters
0.7
a0
0.6 a1
0.5 u0
u1
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
12
10
8
Error
0
0 100 200 300 400 500 600 700 800 900 1000
iteration
backpropagation: err= 3.6749e−01

6
desired output
actual output
error
5
−1
1 2 3 4 5 6 7 8 9 10 11
sample number
Figure 18: Approximate steepest descent (backpropagation) output results

20.5 Summary of results Revision : 2003.10
20.5 Summary of results

Comparison of parameters of all three methods.
M-file 20.5 sysIdEx1.m
% system i.d. example

function [d,X,Wopt,Wgrad, Wnn] = sysIdEx1
[d,X] = sysIdEx1GetData;
% compute pseudo-inverse (optimal) solution and plot results

Wopt = sysIdEx1Wopt(d,X);
sysIdEx1PlotErr(d,X,Wopt, ...
sprintf(’optimal: err=%12.4e’,norm(d-Wopt*X)));
% compute full gradient (iteration) solution and

% compare to optimal
Wgrad = sysIdEx1GradOpt(d,X);
sysIdEx1PlotErr(d,X,Wgrad, ...
sprintf(’steepest descent: err=%12.4e’,norm(d-Wgrad*X)));
% compute neural-net type iterative solution

% and compare to optimal
Wnn = sysIdEx1nnOpt(d,X);
sysIdEx1PlotErr(d,X,Wnn, ...
sprintf(’backpropagation: err=%12.4e’,norm(d-Wnn*X)));
fprintf(’Wopt: %12.4e %12.4e %12.4e %12.4e\n’, Wopt(1), Wopt(2), ...

Wopt(3), Wopt(4));
fprintf(’Wgrad: %12.4e %12.4e %12.4e %12.4e\n’, Wgrad(1), Wgrad(2), ...
Wgrad(3), Wgrad(4));
fprintf(’Wnn: %12.4e %12.4e %12.4e %12.4e\n’, Wnn(1), ...
Wnn(2), Wnn(3), Wnn(4));
return
Wopt: 8.0891e-01 2.6466e-01 2.7104e-01 9.3968e-02

Wgrad: 6.1854e-01 4.6948e-01 2.8291e-01 1.3563e-01
Wnn: 5.9144e-01 4.9120e-01 2.3935e-01 1.9136e-01

Revision : 2003.10
21 2003 10 08: MLP

21.1 Derivation
21.1.1 Output layer
Look at an individual neuron:
W (i) φ
φ
E
..
. ..
.
ȳ (j) v̄ (i) ȳ (i)

Hodel notation
m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0
Apply the chain rule - repeatedly!

! ! !
(k) k)
∂E(n) ∂E(n) ∂ei (n) ∂yi (n) ∂vi (n)
(k)
= (k) (k) (k)
∂wij (n) ∂ei (n) ∂yi (n) ∂vi (n) ∂wij (n)
Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have
∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)
and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)

21.1 Derivation Revision : 2003.10
From last page: output layer weights update with

W (i) φ
φ
E
..
. ..
.
ȳ (j) v̄ (i) ȳ (i)

Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
∆
δi (n) = ei (n)φ0 (vi (n))
∆ T
δ(n) = δ1 (n) · · · δm (n) = e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
(k) (k−1)

21.1 Derivation Revision : 2003.10
21.1.2 Hidden layer(s)

W (2)
(1)
wij
..
. ..
.
ȳ (0) ȳ (1) v̄ (2) ȳ (2)
! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj
weighted sum of “next layer” δ (gradient) values

T
∆W (k) = ηδ (k) (n)y (k−1) (n)
where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) = (k+1) T (k+1)

0 (k)
W δ .∗φ (v (n)) k= hidden layer
Question 21.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?

21.2 Example (not covered in class) Revision : 2003.10
21.2 Example (not covered in class)

Train a sigmoid-based 2-layer MLP to classify points inside the unit square as class 1, other
points as class 0.
Desired output
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
21.2.1 Utility m-files

M-file 21.1 nnsig.m
function [y,h] =nnsig(x,w1,w2,a)
% function [y,h] =nnsig(x,w1,w2,a)
% return hidden layer values h and output layer values y for
% weights w1 and w2 with input x
% activation function is sigmoid with parameter a
% bias value of 1 is appended to both x and hidden layer value vector
v1 = w1*[1;x]; % append bias

h = phi(v1,a);
v2 = w2*[1;h]; % append bias

y = phi(v2,a);
M-file 21.2 neteval.m

function [om, em] = neteval(xx, yy, dm, W1, W2,a)
% function [om, em] = neteval(xm, ym, dm, W1, W2,a)
% evaluate network at all sample points
% dm = nx x ny matrix of desired output values
% use exponential function with parameter a
om = zeros(size(dm));
em = dm;
nx = length(xx);
ny = length(yy);
for ii=1:nx
for jj=1:ny
y0 = [xx(ii);yy(jj)];
om(jj,ii) = nnsig(y0,W1, W2,a);
end
end
em = dm - om;
21.2.2 C-language implementation

C-file 21.5 layerprop.c
/* layerprop.c: #include this file as needed in mex file source */
/* propagate through one layer of a sigmoid multi-layer perceptron */

static void
layerProp(double * yy, const double * ww, const double *xx,
const int mw, const int nw, const double a)
{
int ii, jj;
for ( ii = 0 ; ii < mw ; ii++)

{
/* copy bias term */
yy[ii] = ww[ii];
/* matrix multiply with row ii of ww */

for ( jj = 0 ; jj < nw; jj++)
{
yy[ii] += ww[ii + (jj+1)*mw] * xx[jj];
}

/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}
C-file 21.6 nnsig.c
/* nnsig.c: sigmoid basde neural network evaluator (two layer) */

#include <math.h>
#include "mex.h"

static void
const int mw, const int nw, const double a)
{
int ii, jj;
for ( ii = 0 ; ii < mw ; ii++)

{
/* copy bias term */
yy[ii] = ww[ii];

for ( jj = 0 ; jj < nw; jj++)
{
yy[ii] += ww[ii + (jj+1)*mw] * xx[jj];
}
/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}

{
double *xx, *yy, *hh, aa, *w1, *w2;
int mx, nx, mw1, nw1, mw2, nw2;

if (nrhs != 4)
{

mexPrintf("Received %d args, need 4", nrhs);

mexErrMsgTxt("\nError");
return;
}
else if (nlhs != 2)
{
mexPrintf("%d outputs requested, need 2", nlhs);
return;
}
/* load arguments, check dimensions */

mx = mxGetM(prhs[0]);
nx = mxGetN(prhs[0]);
mw1 = mxGetM(prhs[1]);
nw1 = mxGetN(prhs[1]);
w1 = mxGetPr(prhs[1]);
aa = *mxGetPr(prhs[3]);
if(nx != 1)
{
mexPrintf("x (%d x %d) must be a column vector",mx,nx);
return;
}
if( mx != nw1-1 )
{
mexPrintf("x (%d) w1 (%d x %d) incompatible",
mx, mw1, nw1);
return;
}
if( mw1 != nw2-1 )
{
mexPrintf("w1 (%d x %d), w2 (%d x %d) incompatible",
mw1, nw1, mw2, nw2);
return;
}
if(mw2 < 1 )

{
mexPrintf("w2 (%d x %d) has no outputs", mw2, nw2);
return;
}
plhs[0] = mxCreateDoubleMatrix(mw2, 1, mxREAL); /* y */

plhs[1] = mxCreateDoubleMatrix(mw1, 1, mxREAL); /* h */
yy = mxGetPr(plhs[0]);
hh = mxGetPr(plhs[1]);
/* v1 = w1*[1;x]; % append bias

* h = phi(v1,a);
* v2 = w2*[1;h]; % append bias
* y = phi(v2,a);
*/
/* copy bias terms to h and y */

layerProp(hh,w1,xx,mw1,nw1,aa);
layerProp(yy,w2,hh,mw2,nw2,aa);
}
C-file 21.7 neteval.c
/* neteval.c: evaluate quality of sigmoid NN fit */

#include <math.h>
#include "mex.h"

static void
const int wRows, const int wCols, const double a)
{
int ii, jj;
for ( ii = 0 ; ii < wRows ; ii++)

{
yy[ii] = ww[ii]; /* copy bias term */

for ( jj = 0 ; jj < wCols; jj++)
yy[ii] += ww[ii + (jj+1)*wRows] * xx[jj];
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii])); /* sigmoid function */

}
}

{
char errMsg[1000];
double *xx, *yy, *dm, aa, *w1, *w2, *om, *em, y0[2], *y1, *y2;
int xrows, xcols, yrows, ycols, drows, dcols, mw1, nw1, mw2, nw2, ii, jj;

if (nrhs != 6)
{
return;
}
else if (nlhs > 2)
{
return;
}
/* load arguments, check dimensions */

xrows = mxGetM(prhs[0]);
xcols = mxGetN(prhs[0]);
yrows = mxGetM(prhs[1]);
ycols = mxGetN(prhs[1]);
yy = mxGetPr(prhs[1]);
drows = mxGetM(prhs[2]);
dcols = mxGetN(prhs[2]);
dm = mxGetPr(prhs[2]);
if( xcols != 1 || ycols != 1 || drows != xrows || dcols != yrows)

{

21.3 Simple backprop without any preprocessing Revision : 2003.10
sprintf(errMsg,"x(%d x %d), y(%d x %d), d(%d x %d), incompatible",

xrows, xcols, yrows, ycols, drows, dcols);
return;
}
/* memory allocation rules: "If you get it out, put it back" */

y1 = mxCalloc( mw1, sizeof(double) );
y2 = mxCalloc( mw2, sizeof(double) );
plhs[0] = mxCreateDoubleMatrix(drows, dcols, mxREAL); /* output surf */

plhs[1] = mxCreateDoubleMatrix(drows, dcols, mxREAL); /* error surf */
om = mxGetPr(plhs[0]);
em = mxGetPr(plhs[1]);
/* evaluate each point in the database */

for( ii = 0 ; ii < xrows ; ii++)
{
for (jj = 0 ; jj < yrows ; jj++)
{
y0[0] = xx[ii];
y0[1] = yy[jj];
layerProp(y1,w1,y0,mw1,nw1,aa);
layerProp(y2,w2,y1,mw2,nw2,aa);
om[ii+ jj*drows] = y2[0];
em[ii+ jj*drows] = dm[ii+jj*drows] - y2[0];
}
}
/* memory allocation rules: "If you get it out, put it back" */

mxFree(y1);
mxFree(y2);
}
21.3 Simple backprop without any preprocessing

Source code is in M-file 21.3 (backPropSig1.m). Results are shown in Figure 19–23.10
M-file 21.3 backPropSig1.m
mex nnsig.c
mex neteval.c
10
See also backPropSig.mov at http://www.eng.auburn.edu/users/hodelas/teaching.

rand(’seed’,pi); % seed random number generator
% construct target function

nx = 9;
xx = linspace(-1, 2, nx)’;
ny = 7;
yy = linspace(-1, 2, ny)’;
[xm,ym] = meshgrid( xx, yy);
dm = 0.1 + 0.9*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1);
% network: use (nh) hidden nodes and one output node

% bias introduced in input layer
a = 5; % activation function parameter
nh = 9; no = 1; % network dimensions
W1 = 5*(rand(nh,3)-0.5);
W2 = 5*(rand(no,nh+1)-0.5);
[om, em] = neteval(xx, yy, dm, W1, W2,a);

figure(3);
mesh(xx,yy, om);
title(’Initial surface’);
print -depsc backPropSig_3.eps
figure(4);
mesh(xx,yy,em);
title(’Initial error surface’);
% keep a record of the error function

icnt = 1; Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt+1; % icnt = "iteration count"

eta = 0.01; % execute backprop algorithm with eta selected
maxIter = 2000;
for iter = 0:maxIter
fprintf(’%4d: err=%12.4e\n’,iter,Errv(icnt-1));
for ii=1:nx
for jj = 1:ny
y0 = [xx(ii); yy(jj)]; % input vector
[y2,y1] = nnsig(y0, W1, W2, a);
v1 = W1*[1;y0];
v2 = W2*[1;y1];
% perform backprop step

e = dm(jj,ii) - y2;
d2 = -e .* dphi(v2,a);
DW2 = d2*[1;y1]’;
W2_times_d2 = W2’ * d2;

d1 = -(W2_times_d2(2:(nh+1))) .* dphi(v1,a) ;
DW1 = d1*[1;y0]’;
W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;
% evaluate error surface

[om,em] = neteval(xx,yy,dm, W1, W2,a);
Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt + 1;
end
end
end
figure(6); mesh(xx,yy, om);
title(sprintf(’Output surface iteration %d’,iter));
axis([-1,2,-1,2,0,1.1]);
text(-0.5,2,0.8,sprintf(’current error =%12.4g’,Errv(icnt-1)));
text(1.5,-0.5,0.1,’x_1 value’);
text(-0.5,1.5,0.1,’x_2 value’);
% plot boundary lines, compute color based on slope of line

xp = linspace(-5,5,100)’;
yp = zeros(100,nh);
for ip = 1:nh
if(W1(ip,3) == 0)
denom = 1e-6;
else
denom = W1(ip,3);
end
yp(:,ip) = -(W1(ip,2)*xp + W1(ip,1))/denom;
linAng = atan2(W2(ip+1)*W1(ip,3),W2(ip+1)*W1(ip,2))*180/pi;
if(-45 < linAng & linAng <= 45), pcolor{ip} = ’g-’; %left
elseif(45 < linAng & linAng <= 135), pcolor{ip} = ’c-’; %bottom
elseif(135 < linAng & linAng <= 225), pcolor{ip} = ’r-’; %right
elseif(-135 < linAng & linAng <= -45), pcolor{ip} = ’b-’; %top
elseif(-225 < linAng & linAng <= -135), pcolor{ip} = ’r-’; %right
end
end

figure(8);
plot( ...
xp,yp(:,1),pcolor{1}, xp,yp(:,2),pcolor{2}, ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh
text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))

Initial surface
0.998
0.996
0.994
0.992
0.99
0.988
0.986
0.984
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 19: Simple backpropagation example: initial neural network output surface.

Initial error surface
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 20: Simple backpropagation example: initial neural network output surface error
d(x1 , x2 ) − y(x1 , x2 )

Output surface iteration 400
current error = 6.268

0.8
0.6
0.4
0.2
x2 value
0 x1 value
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 21: Simple backpropagation example: output surface after 200 passes through all
data points.

Layer 1 linear separation boundaries

5 5
line 6
line 5
4 4 iteration 400
Blue : top border
Green: left border
3 3
Red : right border
Cyan : bottom border
2 2
1 1
x2 value/W2
W2(9)
0 0 W2(0) W2(1) W2(2) W2(6) W2(8)
W2(7)
W2(3) line 9
−1 −1 W2(5)
W2(4) line 1 line

line 4
2
−2 −2 line 3
−3 −3
line 7
−4 −4
−6 −4 −2 0 2 4
line 8
−5 −5
−6 −4 −2 0 2 4 6
x1 value
Figure 22: Simple backpropagation example: linear region separating boundaries after 200
iterations
.

Error function per backprop step

100
90
80
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10
Figure 23: Simple backpropagation example: error function over all iterations

Revision : 2003.10
22 2003 10 10: Techniques to improve training - 1

22.1 Homework 5 solution
Solution
22.1.1 Derivations and plots

1a
ebv − e−bv 1 − e−2bv
φ(v) = a tanh(bv) = a bv = a
e + e−bv 1 + e−2bv
(1 + e−2bv )(2be−2bv ) + (1 − e−2bv )(2be−2bv )

dφ
= a
dv (1 + e−2bv )2
(1 + e−2bv ) + (1 − e−2bv ))

−2bv 1 + φ(v)/a
= 2bae −2bv 2
= 2bae−2bv
(1 + e ) 1 + e−2bv
2ae−2bv

= b(1 + φ(v)/a)
1 + e−2bv
2e−2bv

= (b/a)(a + φ(v))a +1−1
1 + e−2bv
−2bv
+ 1 + e−2bv − 1 − e−2bv

2e
= (b/a)(a + φ(v))a
1 + e−2bv
−2bv
+ 1 + e−2bv − 1

e
= (b/a)(a + φ(v))a
1 + e−2bv
e−2bv − 1

= (b/a)(a + φ(v))a 1 +
1 + e−2bv
= (b/a)(a + φ(v))a (1 − φ(v)/a)
and the result follows.
1b Compare these to C-file 6.1 (mextanh.c) , C-file 6.2 (simpletanh.c) and C-file 7.3
(phiExV.c) . My solutions allow the input x to be a vector. I will accept solutions
where your code can only handle scalars (as done in C-file 6.1 (mextanh.c) and
C-file 6.2 (simpletanh.c) ). Source code for this problem is listed at the end of these
solutions.
2a You can use either steepest descent or backpropagation. My solutions show both; they are
listed at the end of the solutions. AND gate output in Figure 24.
hwk502.m Output
steepest descent:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -3.8891e+00 2.5818e+00 2.5818e+00

22.1 Homework 5 solution Revision : 2003.10
AND output
1.5
0.5
−0.5
−1
−1.5
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
input 2
input 1
Figure 24: AND gate output for Homework 5.
backprop:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -4.1842e+00 2.7874e+00 2.7875e+00
>>
2b The hyperbolic tangent function allows the training function to “bend” the plane from
Homework 4 to match the shape of the AND gate data.
3 Source code is at the end of these solutions. Of course my solution matched MATLAB’s
output perfectly.
22.1.2 Source code: problem 1

C-file 22.8 phiT.c
#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;

for ( ii = 0 ; ii < len ; ii++)

yy[ii] = aa * tanh (bb*xx[ii]);
}
void mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *yy, *xx, aa, bb;
int len, mm, nn;
if (nrhs != 3) /* Check for proper number of arguments */
{
mexPrintf("phiT: Got %d inputs\n", nrhs);
mexErrMsgTxt ("Need 3 input arguments");
}
else if (nlhs > 1)
{
mexPrintf("Got %d outputs\n", nlhs);
mexErrMsgTxt ("Need one.");
}
xx = mxGetPr (prhs[0]);
mm = mxGetM ( prhs[0]);
nn = mxGetN ( prhs[0]);
if(mm != 1 && nn != 1)
{
mexPrintf("xx(%d x %d) must be a vector\n",mm,nn);
mexErrMsgTxt ("phiT: error");
}
plhs[0] = mxCreateDoubleMatrix (mm, nn, mxREAL);
yy = mxGetPr (plhs[0]);
bb = *mxGetPr(prhs[2]);
phiT (yy, xx, aa, bb, mm*nn);
}
C-file 22.9 dphiT.c
#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
{
int ii;
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = aa * tanh (bb*xx[ii]);
}

static void dphiT (double *yy, const double *xx, const double aa,
{
int ii;
phiT(yy,xx,aa,bb,len);
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = (bb/aa) * (aa - yy[ii]) * ( aa + yy[ii]);
}
void mexFunction (int nlhs, mxArray * plhs[], int nrhs,
{
double *yy, *xx, aa, bb;
int len, mm, nn;
if (nrhs != 3) /* Check for proper number of arguments */
{
mexPrintf("dphiT: Got %d inputs\n", nrhs);
mexErrMsgTxt ("Need 3 input arguments");
}
else if (nlhs > 1)
{
mexPrintf("Got %d outputs\n", nlhs);
mexErrMsgTxt ("Need one.");
}
xx = mxGetPr (prhs[0]);
if(mm != 1 && nn != 1)
{
mexPrintf("xx(%d x %d) must be a vector\n",mm,nn);
mexErrMsgTxt ("dphiT: error");
}
yy = mxGetPr (plhs[0]);
bb = *mxGetPr(prhs[2]);
dphiT (yy, xx, aa, bb, mm*nn);
}

% compile mex files

mex phiT.c
mex dphiT.c

a = 1.5; b = 1; % phiT parameters

x1 = linspace(0,1,25); % for plotting with mexPexT
x2 = x1;
maxIter = 1e5; % max iterations and step size
eta = 0.01;
X = [1, 1, 1, 1; 0, 1, 0, 1 ; 0, 0, 1, 1]; % set up training data

d = 0.95*[-a, -a, -a, a];
% run steepest descent iteration

W = [0,0,0];
maxErr = max(abs(d - phiT(W*X,a,b)));
iter = 0;
while( maxErr > 0.1*max(abs(d)) & iter < maxIter )
iter = iter+1;
gn = zeros(size(W));
for nn=1:4
xn = X(:,nn);
vv = W*xn;
gn = gn + dphiT(vv,a,b)*((W*xn)*xn’ - d(nn)*xn’);
end
W = W - eta*gn;
end
maxErr1 = maxErr;
figure(1)
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output (steepest)’);
fprintf(’\nsteepest descent:\ntanh parameters: a=%12.4e b=%12.4e\n’,a,b);

fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e %12.4e\n’, ...
W(1), W(2), W(3));
% run backpropagation iteration

W = [0,0,0];
iter = 0;
nn = 0;
while( maxErr > 0.1 & iter < maxIter )
iter = iter+1;
nn = nn + 1; % get new data point number
if ( nn > 4 )
nn = 1;

end
% do backpropagation "gradient" step
xn = X(:,nn);
vv = W*xn;
gn = dphiT(vv,a,b)*((W*xn)*xn’ - d(nn)*xn’);
W = W - eta*gn;
end
figure(2);
fprintf(’backprop:\ntanh parameters: a=%12.4e b=%12.4e\n’,a,b);
fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e %12.4e\n’, ...
W(1), W(2), W(3));
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output’);
print -depsc hwk2003_0502.eps
M-file 22.2 meshPexT.m
function Yvals = meshPexT(W,x1,x2,a,b,xstr, ystr, tstr)

% function Yvals = meshPexT(W,x1,x2,a,b,xstr, ystr, tstr)
% compute and/or plot output of a single layer linear neural network
% function of two variables
% inputs: W (1 x 2): network weights
% x1, x2: each vector xx = [x1(ii); x2(jj)] for appropriate values of
% ii, jj
% a, b: arguments for phiT()
% xstr, ystr, tstr: strings for xlabel, ylabel, and title, respectively
% if all three are passed, then the mesh plot is plotted.
% if not, these arguments are ignored
doMeshPlot = 0;
if(nargin == 8)
if( isstr(xstr) & isstr(ystr) & isstr(tstr) )
doMeshPlot = 1;
end
end
% could check dimensions, etc., but I’m not going to
for ii=1:length(x1)
for jj=1:length(x2)
xbar = [1;x1(ii);x2(jj)];
Yvals(jj,ii) = phiT(W*xbar,a,b);
end
end
if(doMeshPlot)
mesh(x1,x2,Yvals);

xlabel(xstr); ylabel(ystr); title(tstr); grid on;

end

C-file 22.10 matmul.c
#include <math.h>
#include "mex.h"
/* aa (mm x pp ), bb (pp x nn ), -> cc (mm x nn ) */
static void
matmul (double *cc, const double *aa, const double *bb,
const int mm, const int nn, const int pp)
{
int ii, jj, kk;
for ( ii = 0 ; ii < mm ; ii++)
{
for ( jj = 0 ; jj < nn ; jj++ )
{
cc[ii + mm*jj] = 0.0;
for ( kk = 0 ; kk < pp ; kk++)
{
cc[ ii + mm*jj ] += aa[ii + mm*kk] * bb [ kk + jj*pp];
}
}
}
}
void
mexFunction (int nlhs, mxArray * plhs[], int nrhs,
{
double *cc, *aa, *bb;
int mm, pp, nn;
/* should argument checking - check dimensions; I skip that here */

aa = mxGetPr (prhs[0]);
pp = mxGetN ( prhs[0]);
bb = mxGetPr (prhs[1]);
cc = mxGetPr (plhs[0]);
matmul (cc, aa, bb, mm,nn,pp);
}

M-file 22.3 matmultest.m
mex matmul.c
a = [1 2 3 ; 4 5 6 ; 7 8 10];
b = [5 6 ; 7 8 ; 1 2 ];
c = matmul(a,b)
chk = a*b
err = c - chk

22.2 Techniques to improve training Revision : 2003.10
backPropRand.m Output
backPropRand: randomize order of samples

Present data set 1
Present data set 9
Present data set 8
Present data set 2
Present data set 10
Present data set 5
Present data set 3
Present data set 7
Present data set 4
Present data set 6
Figure 25: Output of m-file backPropRand in Example 22.1.
22.2 Techniques to improve training

Read §4.4–4.6
22.3 Activation function

Use φ(v) = tanh(v) or something similar so that φ can take on positive and negative values.
22.4 Randomize sample order

Randomize order of sample data {(x(n), d(n))}N
n=1 in each iteration. Avoids limit cycles.
Example 22.1 M-file 22.4 backPropRand.m
fprintf(’\nbackPropRand: randomize order of samples\n’);

N = 10; % ten samples
[jnk,idx] = sort(rand(N,1));
for ii=1:N
this_idx = idx(ii);
fprintf(’Present data set %d\n’, this_idx);
end
Results in Figure 25.

Revision : 2003.10

23.1 Present “worst case” data
Present pair (x(n), d(n)) with worst output error. Contrast of worst case (k·k∞ ) vs least
squres k·k2 .
Remark 23.1 Doesn’t work very well without careful treatment.
23.2 Momentum term - generalized delta rule

Batch processing attempts to minimize E over all input data simultaneously:
1 X̀
Eav = E(x(`))
N `=1
and results in an average gradient
1 X̀
∆W (k) (n) = η ∆W k (n; x` )
N
`=1
where ∆W k (n; x` ) is the gradient due to x` with the weight set W (k) (n).
As an alternative to batch processing, select
T
∆W (k) (n) = α∆W (k) (N − 1) + ηδ (k) (n)y (k−1) (n)
Result: n
(k)
X T
∆W (n) = η αn−` δ (k) (`)y (k−1) (`)
`=1
Selection of parameters:
1. if α is big, then you may get slow convergence.
2. If α is small, you may get noisy behavior.
Idea Select α so that αN is “significant”, but so that α2N is small (e.g., αN = 0.2 or
smaller). This results in behavior similar to “batch” processing (see text).
T
Selection of η: suppose that δ (k) (n)y (k−1) (n) were constant. Then we’d like for ∆W (k) (n) →
T
δ (k) (n)y (k−1) (n) , =⇒ η = 1 − α. (Consider D.C. gain of transfer function η/(z − α)).

23.2 Momentum term - generalized delta rule Revision : 2003.10
Homework 6 Due Wed Oct. 22. Hwk
1. MATLAB Pattern classification with MLP’s
(a) Obtain the all files from the class ftp site
in directory hwk6. Run the m-file makeMex in MATLAB. This will compile a
number of C-language functions for you. Equivalent m-file functions are included
in hwk6 so that you can see what the C-functions do. Relevant functions are
M-file 19.4 (phiT.m), M-file 19.5 (dphiT.m) (mex) solutions to your last home-
work; M-file 4.3 (mlpT.m) (mex) Compute the output of a two layer perceptron
with hyperbolic tangent activation functions; M-file 4.2 (mlpEvalT.m) (mex) eval-
uate a two layer perceptron with user specified weights W1 and W2 over all data
points in a given data set, returns the network outputs and the correspoding er-
rors; M-file 25.3 (backPropStep.m) execute one back propagation step M-file 25.2
(mlpTrain.m) (m-file) Repeatedly calls the above functions to train a two layer
MLP to match a given data set; M-file 25.4 (mlpNormalize.m) (m-file) Perform
statistical normalization on training data; M-file 9.2 (learnTaskEx1s.m) Example
of a linear classifier network worked earlier this semester in class (sine, sawtooth,
and square wave); hwk6P1.m: template for the solution to problem 1 of this home-
work. hwk6P2.m: m-file for analysis in problem 2 of this homework.
(b) Modify the code in hwk6P1.m to train the neural network as a pattern classi-
fier as was done in M-file 9.2 (learnTaskEx1s.m). Email your completed m-file
hwk6P1.m to Mr. Simmons.
N
1 X ∆
2. Recall that we normalize input data by computing the mean x̄ = x(n) = E[x(n)]
N i=1
N
1 X ∆
and covariance Σx = (x(n) − x̄)(x(n) − x̄)T = E[(x(n) − x̄)(x(n) − x̄)T ] of a
N i=1
data set {x(n)}N
n=1 . The random number generator randn in MATLAB generates in-
dependent, identically distributed pseudo-random Gaussian variables with mean 0 and
variance 1.
(a) Suppose you were given a data set {x(n)}N

n=1 of vectors x(n) of length 3 that were
generated by randn. What would you expect x̄ and Σx to be? Explain.
(b) Run the m-file hwk6P2.m included in the ftp directory from problem 1. Explain
any differences between your predicted values from part (a) and those computed
by this m-file.
3. Problem 4.4, p. 252 in [Hay99].

Revision : 2003.10

24.1 Statistically normalize the data
For a set of N variables x, define
N
∆1 X
E[x] = x(n)
N n=1
x̄ = E[x]
Σx = E[(x − x̄)(x − x̄)T ]
We want to normalize {(x(n), d(n))}N n=1 to be zero-mean and unit (identity matrix) covari-
ance. Compute
y (0) (n) = (Σx )1/2 (x(n) − x̄)
where Σx = (Σx )1/2 (Σx )1/2 is computed from the singular value decomposition. As a result,
T
E[y (0) ] = 0 and E[y (0) y (0) ] = I.
Example 24.1 M-file 24.1 backPropNormTest.m
xn = 7;
yn = 9;
x = linspace(-1,2,xn);
y = linspace(-1,2,yn);
N = xn*yn;
fprintf(’%d data points\n’,N);
idx = 1; % construct data matrix

XX = zeros(2,N);
for ii = 1:xn
for jj = 1:yn
XX(1:2,idx) = [x(ii); y(jj)];
idx = idx+1;
end
end
% get mean, covariance of data in XX

[sigX, mX] = backPropNormalize(XX);
fprintf(’sigX: %12.4e %12.4e\n’,sigX(1,1), sigX(1,2));
fprintf(’ %12.4e %12.4e\n’,sigX(2,1), sigX(2,2));
fprintf(’mX: %12.4e\n’,mX(1));
fprintf(’ %12.4e\n’,mX(2));
YY = sigX\( XX - mX*ones(1,size(XX,2)) );
sigY = YY * YY’ / N;
fprintf(’sigY: %12.4e %12.4e\n’,sigY(1,1), sigY(1,2));
fprintf(’ %12.4e %12.4e\n’,sigY(2,1), sigY(2,2));

24.2 Selection of initial weights Revision : 2003.10
backPropNormTest.m Output
63 data points
sigX: 1.0000e+00 0.0000e+00
0.0000e+00 9.6825e-01
mX: 5.0000e-01
5.0000e-01
sigY: 1.0000e+00 -5.2868e-18
-5.2868e-18 1.0000e+00
M-file 24.2 backPropNormalize.m
function [sigX,mX] = backPropNormalize(XX)

% [sigX,mX] = backPropNormalize(XX)
% given a m x N matrix of data X = [ x(1) , x(2) , ... ]
% compute mean and normalizing matrix so that y = sigX\(x - mX) implies
% E [y] = 0 and E [ y y’ ] = I
[m,N] = size(XX); % shift XX to be zero mean

mX = mean(XX’)’;
% scale to be unit variance

XX = XX - mX*ones(1,N);
SigX = XX * XX’/N; % same as sum x(n) * x(n)’
[u,s] = eig(SigX); % SigX symmetric -> eig() and svd() do the same thing
sigX = u*sqrt(s)*u’;
24.2 Selection of initial weights

Suppose ȳ (0) (0) = 0 and Σy( 0) (0) = I. Define w̄ (k) (n) = vec W (k) (n) , the vector stack of

W (k) (n); in MATLAB:
[m2,m1] = size(Wk); wbar = reshape(Wk,m2*m1,1);
Drop superscripts for now since they’re clear by context. Suppose we select weights W so
that
E[wij ] = 0
and
σw2 i = k, j = `

E[wij wk` ] =
0 else

24.3 Adjust learning rates Revision : 2003.10
What does this mean about E[v (1) ]? Suppose that W and y are statistically independent.
Then
E[v] = E[W y] = E[W ]E[y] = 0

" m m
# m X
m
X X X
E[vi vk ] = E wij yj wk` y` = E [wij wk` yj y` ]
j=1 `=1 j=1 `=1
m
X m
X
2 2 2
= E wij yj = E wij = mσw2
j=1 j=1
So choose σw2 = 1/m to get unit variance on v.
24.3 Adjust learning rates

We want the gradients at each layer to be similar in magnitude; scale η (k) ’s so that ∆W (k) ’s
are roughly the same.

Revision : 2003.10
25 2003 10 17 Example: training with the unit square

Example 25.1 Repeat the unit-square neural network problem Hwk 4 using the backprop-
agation algorithm to find the network weights. Apply some or all of the methods in this
section.
Techniques used:
1. Set initial bias values to zero.
2. Scale inputs to be zero mean, unit variance
3. Scale outputs to lie within “linear region” of activation function
4. use tanh activation function
5. Increase number of iterations
6. Keep η step size bounded above by 1 − α momentum “d.c. gain.”
7. Randomize order of inputs in each iteration
8. Select equal numbers from each class so that the output variable is also uniformly
distributed (alternatively: scale outputs as described above)
backPropEx2.m Output
Found /Users/hodelas/aub/6240/dphiT.mexmac.
Found /Users/hodelas/aub/6240/mexPrintMat.mexmac.
Found /Users/hodelas/aub/6240/mlpEvalT.mexmac.
Found /Users/hodelas/aub/6240/mlpT.mexmac.
Found /Users/hodelas/aub/6240/phiT.mexmac.
ans =
success
Keeping 81 data points from each class

iteration 50; error = 487.8313

Revision : 2003.10

training time: 142.203429 s
sigX:(2 x 2) =
4.3523e-01 -2.4166e-03
-2.4166e-03 5.1434e-01
mX:(1 x 2) =
5.1190e-01 5.1709e-01
sigY =
mY =

Revision : 2003.10
6.3050e-17
W1 =
0.2265 -0.2593 -0.1067

0.0620 -0.0302 -0.0603
-0.0040 -0.0845 0.0535
-0.1333 -0.0143 0.0068
0.1099 -0.0857 0.2027
-0.0037 -0.0689 -0.0940
-0.1170 -0.2602 -0.2454
0.2079 0.0299 -0.2889
-0.0431 -0.1600 -0.0015
0.0266 0.0372 0.0794
W2 =
Columns 1 through 7
-0.1739 0.4423 -0.6291 -0.2248 0.4981 0.3003 0.0400
Columns 8 through 11
-0.1679 0.3461 -0.2438 -0.1559
>>
M-file 25.1 backPropEx2.m
% Backpropagation example. backprop to fit unit square problem

makeMex
rand(’seed’,2);
a = sqrt(3); % parameters for activation functin phiT
b = 3;
nx = 29; % range of evaluation
xx = linspace(-1,2,nx);
ny = 27;
yy = linspace(-1, 2, ny);
[xm,ym] = meshgrid(xx, yy); % compute output function
dm = (-a + 2*a*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1))*0.9;

Revision : 2003.10
% repackage data for call to training/normalization codes

% format: row n = [x1(n), x2(n), d(n)]
mlpData = [ reshape(xm,nx*ny,1), reshape(ym,nx*ny,1), reshape(dm,nx*ny,1)];
C1 = find(mlpData(:,3) > 0); % use same # of samples for 0 as for 1
C2 = find(mlpData(:,3) < 0);
nC1 = length(C1);
nC2 = length(C2);
nkeep = min(nC1, nC2);
fprintf(’Keeping %d data points from each class\n’,nkeep);
[jnk,idx] = sort(rand(1,nC2));
idx = idx(1:nkeep);
C2keep = C2(idx);
myData= mlpData([C1,C2keep],:);
ni = 2; % network dimensions:
nh = 10; % use 5 hidden nodes and one output node
no = 1;
[inData,sigX, mX, sigY, mY] = mlpNormalize(myData,ni, no); % normalize data
% learning step size and momentum parameter

eta = 0.0002/(b*nkeep);
alpha = exp(log(0.2)/(2*nkeep)); % 2*nkeep points in database
alpha = 0.2;
maxIter = 2000;
startTime = clock;
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, eta, ...
alpha, a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime);
% --- ALL PLOTTING FROM HERE ---

fprintf(’training time: %f s\n’,trainingTimeSeconds);
fn = 1; figure(fn); plot(xx,phiT(xx,a,b),’-’); grid on ;
legend(sprintf(’-;phiT(x,%f,%f);’,a,b));
title(’activation function used’); grid on;
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); title(’Desired output’); mesh(xx,yy,dm);
fn = fn+1; figure(fn); plot(mlpData(C2keep,1), mlpData(C2keep,2),’x’);
title(’kept points outside the box’); grid on ;
fn = fn+1; figure(fn); plot(ErrHist); title(’Error history’); grid on;

Revision : 2003.10
% plot final error surface

for ix = 1:nx
for iy=1:ny
xn = sigX\([xx(ix); yy(iy)] - mX’);
zm(iy,ix) = mY + sigY*mlpT(xn,W1,W2,a,b);
end
end
fn = fn+1; figure(fn); mesh(xx,yy,zm)
title(’NN output surface - 200 iterations’); xlabel(’input 1’);
ylabel(’input 2’); eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); mesh(xx,yy,(dm - zm))
title(’NN error surface - 200 iterations’)
xlabel(’input 1’); ylabel(’input 2’);
fprintf(’sigX:’); mexPrintMat(sigX);
fprintf(’mX:’); mexPrintMat(mX);
sigY
mY
W1
W2
• Activation function is shown in Figure 26.
• The desired output is in Figure 27.
• Haykin’s derivation of Neural Network performance assumes that inputs and outputs
are uniformly distributed. Our sample space is strongly biased toward points outside
the box. While the training algorithm works with all data points included in the sample
set, it works pretty well by keeping only selected points in the training data set. The
randomly chosen data values from outside the square are shown in Figure 28.
• The sum of the squared errors in each iteration (epoch in Haykin’s book) is shown in
Figure 29.
An epoch involves presentation of all “kept” data points to the neural network and
their corresponding backpropagation steps. Convergence is pretty much done by 150-
200 iterations (of 162 backpropagation steps each).
• The resulting fit to the data is much more pleasing in this case than in the original;
see Figures 30 – 31.

Revision : 2003.10
activation function used

2
−;phiT(x,1.732051,3.000000);
1.5
0.5
−0.5
−1
−1.5
−2
−1 −0.5 0 0.5 1 1.5 2
Figure
√ 26: Activation function used in Example 25.1. Training was also done with φ(x) =
3 tanh(6x). Steep slope permits a better potential fit to the discontinuous underlying
function, but the iteration did not converge after 20,000 iterations.
1.5
0.5
−0.5
−1
−1.5
−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
Figure 27: Desired output function (discontinuous!) for example 25.1.

Revision : 2003.10
kept points outside the box

2
1.5
0.5
−0.5
−1
−1 −0.5 0 0.5 1 1.5 2
Figure 28: Randomly chosen data values from outside the unit square used for training the
neural network in Example 25.1.
Error history
800
700
600
500
400
300
200
100
0
0 500 1000 1500 2000 2500
Figure 29: Squared error sum in training iteration for example 25.1.

Revision : 2003.10
NN output surface − 200 iterations
1.5
0.5
−0.5
−1
−1.5
−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
input 1
Figure 30: Trained network surface for example 25.1.
NN error surface − 200 iterations
1.5
0.5
−0.5
−1
−1.5
−2
−2.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
inin input 1
Figure 31: Trained network error surface for example 25.1.

Revision : 2003.10
The subroutines used by this task are at the class ftp site.
M-file 25.2 mlpTrain.m
function [W1, W2, Yv, ErrV, ErrHist ] = mlpTrain(inData, nx, nh, no, eta, alpha, a, b, m
% function [W1, W2, Yv, ErrV, ErrHist ] =
% mlpTrain(inData, nx, nh, no, eta, alpha, a, b, maxIter)
% perform backpropagation training on a two layer MLP
[N, inM] = size(inData);

if(inM ~= nx + no)
error(’mlpTrain: inData[%d x %d] has nx=%d, no=%d incompatible’, ...
inM, N, nx, no);
end
% initalize weight matrices

W1 = rands(nh, nx+1)/sqrt(nh+1);
W2 = rands(no, nh+1)/sqrt(no+1);
W1(:,1) = 0; W2(:,1) = 0; % start with zero bias
% calculate initial error

[Yv,ErrV] = mlpEvalT(inData, W1, W2, a,b);

icnt = 1;
ErrHist(icnt) = norm(ErrV,’fro’)^2;
icnt = icnt+1;
% execute backprop algorithm with eta selected

% initialize momentum terms
DeltaW1 = zeros(size(W1)); DeltaW2 = zeros(size(W2));

if(iter == 50*floor(iter/50))
fprintf(’iteration %5d; error = %12.4f\n’,iter, norm(ErrV,’fro’)^2);
end
% randomize input order
[jnk,nidx] = sort(rand(1,N));
for ni=1:N
nn = nidx(ni);
y0n = inData(nn,1:nx)’;
dn = inData(nn,nx+(1:no));
[W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
alpha, eta, W1, W2, DeltaW1, DeltaW2);
end

Revision : 2003.10
[Yv,ErrV] = mlpEvalT(inData, W1, W2, a,b);

ErrHist(icnt) = norm(ErrV,’fro’)^2;
icnt = icnt + 1;
end
M-file 25.3 backPropStep.m

function [W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
alpha, eta, W1, W2, DeltaW1, DeltaW2);
%function [W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
% alpha, eta, W1, W2, DeltaW1, DeltaW2);
% back propagation step with generalized delta rule for two layer NN.
% NOTE: This m-file has been translated to C
% output and hidden layer values

[y2,y1] = mlpT(y0n, W1, W2, a,b);
e = dn - y2;
d2 = e .* dphiT(v2,a,b);
DW2 = d2*[1;y1]’;
DW2 *= max(1, 0.01*norm(e)/(norm(DW2)+1e-3));
W2_times_d2 = W2’ * d2;

nh = size(W1,1);
d1 = (W2_times_d2(2:(nh+1))) .* dphiT(v1,a,b) ;
DW1 = d1*[1;y0n]’;
DW1 *= max(1, 0.01*norm(e)/(norm(DW1)+1e-3));
DeltaW1 = alpha*DeltaW1 + eta*DW1;

DeltaW2 = alpha*DeltaW2 + eta*DW2;
W2 = W2 + DeltaW2;
W1 = W1 + DeltaW1;
M-file 25.4 mlpNormalize.m

function [inData,sigX, xm, sigY, ym] = mlpNormalize(mlpData,nx, ny)
% [inData,sigX, mX, sigY, mY] = mlpNormalize(mlpData);
% normalize data for neural network output
% revised network is:
% yy = mY + sigY* neuralNetwork( ..., SigX\(x - mX) );
%
% FIXME: this function assumes that mlpData is uniformly distributed.

Revision : 2003.10
xData = mlpData(:,1:nx);
yData = mlpData(:,nx+(1:ny));
N = size(mlpData,1);
% mean values
xm = mean(xData);
ym = mean(yData);
xData = xData - ones(N,1)*xm;
yData = yData - ones(N,1)*ym;
% covariance matrices
if(N > nx)
sigX = xData’*xData/N;
xData = xData/sigX;
else
sigX = 1;
end
if(N > ny)
sigY = 1;
yData = yData/sigY;
else
sigY = eye(ny);
end
% normalized data
inData = [xData,yData];

25.1 Another example: bad output normalization Revision : 2003.10
1.5
0.5
−0.5
−1
−1.5
−2
1
0.5 4
2
0
0
−0.5
−2
−1 −4
Figure 32: Target surface for example 25.2.
25.1 Another example: bad output normalization

Example 25.2 Plant model: y = sin(x1 ) + x2 . Surface plotted in Figure 32.
Mild modifications to M-file 25.1 (backPropEx2.m); Results: Error history is shown in
Figure 33.
backPropEx3.m Output


trainingTimeSeconds =
287.0925
sigX =
3.5249 -0.0000
-0.0000 0.3590
mX =
1.0e-17 *
0.4537 -0.0851
sigY =
mY =
-1.1627e-17
W1 =
0.0005 0.9481 -0.2460

0.0008 0.3885 -0.0492
-0.0001 -0.4829 0.0632
-0.0002 0.1094 -0.3366
0.0029 0.5427 -0.0768
-0.0005 -0.5568 0.0802
0.0007 0.3137 -0.0431
-0.0003 -0.5402 0.0757
-0.0006 0.0354 0.1529
-0.0013 1.9706 -0.0630

W2 =
Columns 1 through 7
-0.0010 -0.8242 -0.4163 0.6490 0.4644 -0.5167 0.7112
Columns 8 through 11
-0.3279 0.7088 -0.0569 1.9855
>>
% Backpropagation example. backprop to fit unit square problem

% compile backPropStep mex file
% required files are:
% parameters for activation functin phiT

a = sqrt(3); b = 1; % activation function parameters
nx = 29; xx = linspace(-pi,pi,nx); % theta

ny = 27; yy = linspace(-1, 1, ny); % volts

dm = sin(xm) + ym;
mesh(xm,ym,dm)
fn = 0;
plot(xx,phiT(xx,a,b),’-’);
legend(sprintf(’-;phiT(x,%f,%f);’,a,b));
title(’activation function used’)
grid on;
title(’Desired output’);
mesh(xx,yy,dm);
% data format: column 1 = x, column 2 = y, column 3 = d

mlpData = [ reshape(xm,nx*ny,1), reshape(ym,nx*ny,1), reshape(dm,nx*ny,1)];

myData= mlpData;
% network: use 10 hidden nodes and one output node

ni = 2; nh = 10; no = 1;
rand(’seed’,pi); % initialize random number generator
[inData,sigX, mX, sigY, mY] = mlpNormalize(myData,ni, no); % normalize data
% learning step size and momentum parameter

eta = 0.002; alpha = 0.1; maxIter = 600;
% time how ling this takes.

startTime = clock;
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, eta, alpha, ...
a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime)
plot(ErrHist);
title(’Error history’);
% plot final error surface

for ix = 1:nx
for iy=1:ny
xn = sigX\([xx(ix); yy(iy)] - mX’);
zm(iy,ix) = mY + sigY*mlpT(xn,W1,W2,a,b);
end
end
mesh(xx,yy,zm)
title(’NN output surface - 200 iterations’)
xlabel(’input 1’);
ylabel(’input 2’);
mesh(xx,yy,(dm - zm))
title(’NN error surface - 200 iterations’)
xlabel(’input 1’);
ylabel(’input 2’);

25.2 Decision rules Revision : 2003.10
Error history
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700
Figure 33: Error history for example 25.2.

sigX
mX
sigY
mY
W1
W2
Output surface is in Figure 34. Notice that the choice of a in M-file 25.5 (backPropEx3.m)
results in the inability to match the data at the extreme upper and lower bounds on y. The
error surface is plotted in Figure 35.
25.2 Decision rules

Read §4.7

25.3 Feature detection/hidden neurons Revision : 2003.10
NN output surface − 200 iterations
1.5
0.5
−0.5
−1
−1.5
−2
1
0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1
Figure 34: Output surface for example 25.2. Compare to Figure 32.
Classification: how to interpret network output: Assign class i if yik̄ (n) > yjk̄ (n) for all
j 6= i. Confidence? How close are the competitors? How close is yik̄ (n) to quantized value?
25.3 Feature detection/hidden neurons

Suppose a network has k̄ layers. Then output neuron values are
mk̄−1
(k̄) (k̄) (k̄−)
X
yi (n) = wij yj (n)
j=0

25.3 Feature detection/hidden neurons Revision : 2003.10
NN error surface − 200 iterations
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
1
0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1
Figure 35: Error surface for example 25.2.

Revision : 2003.10
26 2003 10 20: Radial Basis Function Networks

26.1 Project proposal guidelines
All class projects are to be turned in to Dr. Hodel not later than 5:00 pm on Wed, Dec. 3,
2003. Your draft project proposal is to be turned in to Dr. Hodel at the start of class on
Wed Oct 27, 2003. Your project must be approved by Dr. Hodel not later than Nov 5. Un-
dergraduate projects will be permitted to be less ambitious than graduate student projects.
However, undergraduates may choose to meet graduate student project requirements. The
guidelines below are representative only; students may propose other types of projects if
they wish, subject to instructor approval.
All project materials (manuals and software) should be submitted to Dr. Hodel in elec-
tronic form. Proposals may be printed out or submitted in electronic form. Printed proposals
should be put in my mailbox or given to me in class. Printed proposals that are submitted
in other fashions (e.g., slid under my door) will be torn up and thrown away.
Proposal written documentation should be submitted in either LaTeX source code or
Microsoft Word. Other word processor formats will be accepted only with prior approval.
26.1.1 Undergraduates
Undergraduate projects may be of the following two types:
1. Write a technical review of a technical paper on neural networks. The article must be
approved by Dr. Hodel, and should come from an IEEE journal (Transactions on Neural
Networks, Transactions on Automatic Control, Control Systems Magazine, other IEEE
conferences or journals, or some other professional society journal/conferences (e.g.,
ASME, AIAA, etc.). etc. Your review should either include
(a) a technical discussion of the contribution of the paper (why is it better/worse

than other methods? How does it compare to what we know in class?) or
(b) a moderate discussion of techniques and a computer implementation demonstrat-
ing the proposed method. Your software (matlab m-files are fine) should be com-
mented or else it should have accompanying description in the form of a manual.
2. Update the neural network library functions in nnLib.c and nnLib.h. Possible ideas
here are to update backPropStep so that will work for any number of layers (not just
two), updating and testing the radial basis function codes, etc. This kind of project
should include
(a) Documentation (a manual) of the revised software and libraries.

(b) Test code to demonstrate that the software works correctly.
(c) Source code - commented!

26.1 Project proposal guidelines Revision : 2003.10
26.1.2 Graduates
Graduate student projects will involve at least the following three elements: (1) A written
manual/report. (2) Software implementations relevant to the project. (3) Test data used for
training and software verification.
Project subjects may be selected related to a student’s thesis research; discuss this
opportunity with your advisor. Otherwise, your project should involve some level of
complexity comparable to the example ideas listed in the next subsection. I will be very
flexible on the nature of the project, but it must include (1) a general problem statement,
(2) a mathematical discussion of the solution technique, (3) a software solution of the general
problem, and (4) an evaluation of the quality of solution.
Success is not required; what is required is a thorough discussion and understanding of
the techniques and results.
26.1.3 Some project ideas

• (graduate) Optical character recognition of the digits 0-9.
• (graduate/ugrad) Voice recognition with a vocabulary of at least 5 words.
• (graduate) Simulation of real-time adaptation / system identification. For example,
model an inverted pendulum as θ̇ = ω and ω̇ = (sin(θ)+v(t))/J where J is the moment
of inertia, θ is the bar position, and v(t) is a motor control voltage input. It is desired
for the bar to track the command θref = t (rotate at a constant angular velocity).
Three possible tasks (ugrads can do any one of these, grads should try at least two,
preferably 3)
1. Design a working control law; train a neural network in simulation to accurately
model the behavior of the bar. (On-line training, not off-line.) I will call this the
model neural network, or MNN.
2. Design a neural network that trains on-line (not batch design) in an attempt
to control the bar position so that the closed loop dynamics are θ̇ = ω and
ω̇ = θref − θ. I will call this the control neural network, or CNN.
3. Combine the above two parts together: The CNN should update its weights at
each time step based on the behavior predicted by the MNN.
• (grad/ugrad) Duplication and/or verification of results in a technical paper on neural
networks.
• (grad/ugrad) Modify backpropagation training to include some “competititive learn-
ing” - reinforce the strongest connections, move the remaining neurons in the opposite
direction: Winner: wi = wi + ηx(n) (Hebbian learning). Losers: wj = wj − ηwi for all
j 6= i. How does this compare to normal backpropagation?
• (ugrad) Write a manual for nnLib.c, nnLib.h so that future students can use your
manual in this course, or Fix one or more of the broken routines (radial basis function
routines) in nnLib.c and nnLib.h. (Get instructor approval first.)

26.1 Project proposal guidelines Revision : 2003.10
Homework 7 Due Mon Oct. 27. Hwk
Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.
1. written Recall that we perform data normalization by selecting

−1/2
y (0) = ΣX (x(n) − µX )
∆ 1 PN ∆ −1/2
where µX = E[x] = N n=1 x(n), ΣX = E[(x − µX )(x − µX )T ], and the matrix ΣX
is selected so that
−1/2 −1/2 −1/2 −1/2
ΣX ΣX ΣX = Σ X ΣX ΣX =I
Show that µy(0) = 0 and Σy(0) = I.
2. written Define b(1) as the bias vector in layer 1 of a multilayer perceptron so that
v (1) = W (1) y (0) + b(1) . Consider the effect of data normalization on the output y (1) of
the hidden layer:
y (1) (n) = φ v (1) (n)

= φ W (1) y (0) (n) + b(1)

−1/2
= φ W (1) ΣX (x(n) − µX ) + b(1)
Notice that if we select

−1/2
W̃ (1) = W (1) ΣX
and
−1/2
b̃(1) = b1 − ΣX µX
then v (1) = W̃ (1) x(n) + b̃(1) . That is, we can train a neural network directly off of the
raw data x and get exactly the same result as we got with the normalized data.
Is this analysis correct? If not, explain my error. If it is correct, then explain why we
normalize data in the first place.

26.2 Separability Revision : 2003.10
26.2 Separability
Read §5.1-5.2 in Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice
Hall, 2nd edition, 1999
MLP’s use y = φ(w T x) results in linear/planar equipotential surfaces, resulting in linear

separability.
Theorem 26.1 Cover, 1965: A complex pattern-classification problem nonlinearly cast in

a high-dimensional space is more likely to be linearly separable than in a low-dimensional
space.
Linearly separable =⇒ easy to solve.
Question 26.1 What if we use y = w T φ(x)?
Definition 26.1 Given a vector valued function φ : IRm → IRn and a vector w ∈ IRn , the
corresponding separating surface X = X (φ, w) is
X (φ, w) = {x : w T φ(x) = 0}
Definition 26.2 Classes C1 , C2 are φ-separable if there exists a vector w such that x ∈
C∞ =⇒ w T φ(x) < 0 and x ∈ C∈ =⇒ w T φ(x) > 0.
Idea Map inputs nonlinearly into hidden layer, then linearly to output layer.
Remark 26.1 Many possibilities of separating functions φ:
1. linear (as in MLP)
2. quadratic (or higher order polynomial)
3. hyperspheres
Fact 26.1 The more hidden nodes you have, the more likely your data is φ−separable.
Remark 26.2 §5.2 in the text discusses the probability of φ-separability in terms of a
binomial expansion and Bernoulli trials. We will not address this analysis in this course.

2
2 e−(x1 −1)
Example 26.1 x ∈ IR . With φ(x) = 2 :
e−(x2 −0.5)
M-file 26.1 radialEx00.m
% radialEx00.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [1;2]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
xx = [x1(ii); x2(jj)] - x0;
zz(ii,jj) = w’ * exp( - xx .* xx );
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx00a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx00b.eps’);
Radial basis function example
2.5
1.5
0.5
0
5
0
0
−5 −5


5
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5

Many other choices: e.g., monomials:
Example 26.2 Monomial example:
% radialEx01.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [-1;1]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
for ii = 1:nx
for jj = 1:ny
% calculate bizarre monomial for this example
xx = [x1(ii)*x2(jj); x1(ii) + x2(jj)] - x0;
zz(ii,jj) = w’ * xx;
end
end
30
20
10
−10
−20
−30
−40
5
0
0
ii −5 −5


5
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5

Example 26.3 Exclusive or revisited:
% radialEx02.m: Exclusive or revisited; See Haykin ’99 Example 5.1

nx = 25; x1 = linspace(0,1,nx); % set of data points
ny = 27; x2 = linspace(0,1,ny);
w = [1;1]; % weight vector (picked arbitrarily)
t1 = [1;1]; t2 = [0;0];
for ii = 1:nx
for jj = 1:ny
xx = [x1(ii); x2(jj)];
phix = [ exp( -norm(xx-t1)^2 ); exp( -norm(xx-t2)^2 ) ];
zz(ii,jj) = 1 - w’ * phix;
end
end
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0


1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Revision : 2003.10
27 2003 10 22: Radial Basis Functions (2)

Homework 6 solution
Solution
1. MATLAB Pattern classification with MLP’s
(a) Code download.

(b) My solution and output:
% parameters for activation functin phiT
a = 1.5; b = 1; % activation function parameters
tt = (0:0.01:5)’;
XX = [sinewave, sawtooth, square];
dd = eye(3);
mlpData = [ XX’, dd’]; % data format: each row is [x(n)’, d(n)’]
ni = length(tt); nh = 10; no = 3;
[inData,sigX, mX, sigY, mY] = mlpNormalize(mlpData,ni, no); % normalize data
eta = 0.002; alpha = 0.1; maxIter = 600;
startTime = clock; % time how ling this takes.
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, ...
eta, alpha, a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime)
for nn=1:3
xn = mlpData(nn,1:ni);
yy = mY’ + mlpT((xn - mX)/sigX,W1, W2,a,b);
fprintf(’%3d: yy=%12.4e %12.4e %12.4e\n’,nn, yy(1), yy(2), yy(3));
end
fn = 1; figure(fn);
semilogy(ErrHist);
grid on;
title(’Error history’);
>> backPropEx5
... [deleted a few lines]

Revision : 2003.10

trainingTimeSeconds = 108.8119
1: yy= 1.0000e+00 -4.7801e-07 1.7496e-07
2: yy= 8.2413e-07 1.0000e+00 2.2694e-07
3: yy= 1.9447e-07 -2.0178e-07 1.0000e+00
N
1 X ∆
2. Recall that we normalize input data by computing the mean x̄ = x(n) = E[x(n)]
N i=1
N
1 X ∆
and covariance Σx = (x(n) − x̄)(x(n) − x̄)T = E[(x(n) − x̄)(x(n) − x̄)T ] of a
N i=1
N
data set {x(n)}n=1 . The random number generator randn in MATLAB generates in-
dependent, identically distributed pseudo-random Gaussian variables with mean 0 and
variance 1.
(a) Since randn produces pseudo-random numbers that are expected to be statistically
 
0
dependent and identically distributed with mean 0 and variance 1, E[x] = 0 

0
T
and E[(x − x̄)(x − x̄) ] = I (a 3 × 3 identity).
(b) My output:
NN=2
Computed mean value=
mX = 0.3569 -0.3223 -1.0545
Computed covariance =
sigX =
0.3875 -0.1407 -0.2030
-0.1407 0.0511 0.0737
-0.2030 0.0737 0.1064
NN=10
mX = -0.0467 0.0289 -0.2239
sigX =
1.7623 0.3045 -0.3040
0.3045 0.7776 0.2775
-0.3040 0.2775 0.7536
NN=100
mX = -0.0833 0.1086 -0.0263
sigX =

Revision : 2003.10
0.8703 -0.0681 -0.1425

-0.0681 0.9555 0.0438
-0.1425 0.0438 1.1427
NN=1000
mX = 0.0396 0.0240 0.0255
sigX =
0.9900 0.0016 -0.0090
0.0016 0.9878 -0.0206
-0.0090 -0.0206 0.9801
The smaller data sets are too small to give statistically reliable characterizations
of the mean and variance. This is illustrated by the histograms shown below.
histogram: N=2 histogram: N=10
1 4
0.8
3
0.6
2
0.4
1
0.2
0 0
−1.5 −1 −0.5 0 0.5 1 −3 −2 −1 0 1 2
histogram: N=100 histogram: N=1000

30 300
25 250
20 200
15 150
10 100
5 50
0 0
−4 −2 0 2 4 −4 −2 0 2 4
Notice that, even in the case of N = 1000, the data bears poor resemblance to a
bell curve.
3. This corresponds to a digital filter with a pole between 0 and -1; so its impulse response
would be oscillatory, but stable. That is, one would expect the momentum term to
oscillate. Also, since
∆W (n + 1) = α∆W (n) + δy 0

Revision : 2003.10
this implies that α < 0 would cause the next backpropagation step to “backtrack” - to
undo some of the update of the current backpropagation step, and so one would expect
slower convergence.
We tested these expectations by re-running M-file 25.1 (backPropEx2.m) with alpha = -exp(log(0.2
(note the negative sign). This latter observation is consistent with the results shown
below, a comparison of the original training run with α > 0 to the results with α < 0:
Error history
800
700
600
500
400
300
200
100
0
0 500 1000 1500 2000 2500 α>0
Error history
800
700
600
500
400
300
200
0 500 1000 1500 2000 2500 α<0

27.1 M-file s-function examples Revision : 2003.10
27.1 M-file s-function examples

27.1.1 Continuous time model
M-file 27.2 invPendS.m
function [sys,x0,str,ts] = invPendS(t,x,u,flag)
% The general form of an M-File S-function syntax is:
% [SYS,X0,STR,TS] = arca(T,X,U,FLAG,P1,...,Pn)
%
% Inputs:
% u = voltage
% x = [theta;omega]
% Outputs:
% y = [theta;omega]
switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end
%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
%=============================================================================
function [sys,x0,str,ts]=mdlInitializeSizes
sizes = simsizes;
sizes.NumContStates = 2;
sizes.NumDiscStates = 0;
sizes.NumOutputs = 2;
sizes.NumInputs = 1;
sizes.DirFeedthrough = 0;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = [0;0]; % initial conditions
str = []; % str is always an empty matrix
ts = [0]; % initialize the array of sample times
return

%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
%=============================================================================
function dx=mdlDerivatives(t,x,u)
dx = [x(2); (sin(x(1)) + u)];
% limit theta to stay within [-pi,pi]
thlim = min(max(x(1),-pi),pi);
dx(2) = dx(2) -100*(x(1)-thlim);
return
%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
%=============================================================================
function sys=mdlUpdate(t,x,u)
sys = [];
return
%=============================================================================
% mdlOutputs
% Return the block outputs.
%=============================================================================
function y=mdlOutputs(t,x,u);
y = x;
return
%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%
function sys=mdlGetTimeOfNextVarHit(t,x,u)
sampleTime = 1; % Example, set the next hit to be one second later.

sys = t + sampleTime;
% end mdlGetTimeOfNextVarHit

%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)
sys = [];
return
% end mdlTerminate
27.1.2 Discrete time model

M-file 27.3 nnSysIdEx.m
function [sys,x0,str,ts] = nnSysId(t,x,u,flag)

% The general form of an M-File S-function syntax is:
% [SYS,X0,STR,TS] = arca(T,X,U,FLAG,P1,...,Pn)
% Inputs:
% u = [current voltage input, current system output] 3x1
% x = last system output (2x1), weights (2x11, 10x4) - total of 52
% Outputs:
% y = [theta;omega] predicted value
switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end
%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
function [sys,x0,str,ts]=mdlInitializeSizes
sizes = simsizes;
sizes.NumContStates = 0;
sizes.NumDiscStates = 2 + 2*11 + 4*10;
sizes.NumOutputs = 2;

sizes.NumInputs = 3;
sizes.DirFeedthrough = 1;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = randn(sizes.NumDiscStates,1); % initial conditions
str = []; % str is always an empty matrix
ts = [0.1, 0]; % initialize the array of sample times
return
%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
function dx=mdlDerivatives(t,x,u)
dx = []
return
%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
function nextStates=mdlUpdate(t,x,u)
[W1,W2,yk1] = unpackStates(x); % do a backpropagation step
nextStates = x;
return
%=============================================================================
% mdlOutputs
% Return the block outputs.
function y=mdlOutputs(t,x,u);
% output is current estimate of next output
[W1,W2,yk1] = unpackStates(x);
y0 = [u(1);yk1];
y = mlpT(y0,W1,W2,1.5,1);
return
%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%

function sys=mdlGetTimeOfNextVarHit(t,x,u)
sampleTime = 1; % Example, set the next hit to be one second later.
sys = t + sampleTime;
% end mdlGetTimeOfNextVarHit
%
%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)
sys = [];
return
% end mdlTerminate
% unpack the states in the vector x into an easy-to use form

function [W1,W2,yk1] = unpackStates(x)
m1=10; % Neural net dimensions
n1=4;
m2=2;
n2 = 11;
W2LastIndex = 2 + m2*n2;
W1FirstIndex = W2LastIndex + 1;
W1LastIndex = W2LastIndex + m1*n1;
yk1 = x(1:2);
W2 = reshape(x(3:W2LastIndex),m2,n2);
W1 = reshape(x(W1FirstIndex:W1LastIndex),m1,n1);
return

27.2 Lecture notes: handwritten today Revision : 2003.10
27.2 Lecture notes: handwritten today

Read See 20031022p*.jpg

Revision : 2003.10
28 2003 10 24; RBF’s (3)

Read Handwritten notes in 20031023p*.jpg
28.1 Interpolation result

Read §5.4 Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999
Simple (memory-based learning) approach: Given {x(n), d(n)}Nn=1 , choose a set of nc = N

centers t(n) = x(n) with corresponding output vectors w(n), n = 1, ..., nc , and compute
 
n c
φ(kx − t(1)k)
F (x) =
X 
w(i)φ(kx − t(n)k) = w(1) · · · w(nc )  .. ∆
.  = W φ(x)
i=1 φ(kx − t(nc )k)
so that F (t(i)) = d(i). Find W such that:

 
φ11 · · · φ1N
 .. .. ..  w · · · w = d(1) · · · d(N )
 . . .  1 N
φN 1 · · · φ N N
or
ΦW = D
How do we know that Φ is invertible? If the points t(n) are distinct then lots of φ RBF’s
will work.

Revision : 2003.10
29 2003 10 27 RBF’s (4)

Solution
Notice: Exam 2 Will be on Monday Nov 3. Same rules as last time, except: item you may
bring a ruler so that you can draw a straight line. No calculators, no notes, no references.
1.
−1/2
y (0) = ΣX (x(n) − µX )
N N
1 X −1/2 1 −1/2 X
µy(0) = ΣX (x(n) − µX ) = ΣX (x(n) − µX )
N n=1 N n=1
N N
!
−1/2 1 X 1 X −1/2
= ΣX x(n) − µX = ΣX (µX − µX ) = 0.
N n=1 N n=1
N N
1 X (0) (0) T 1 X −1/2 −1/2 T
Σy(0) = y y = ΣX (x(n) − µX ) ΣX (x(n) − µX )
N n=1 N n=1
N
!
−1/2 1 X T −1/2 −1/2 −1/2
= ΣX (x(n) − µX ) (x(n) − µX ) ΣX = Σ X ΣX ΣX = I
N n=1
2. The analysis is correct: the two methods do in fact give the same result. The difference
is that data normalization moves the internal values v (k) into the quasi-linear parts of
the activation functions φ so that learning can occur faster.

Homework 8 Due Fri Oct. 31. Hwk
Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.
1. written Problem 5.13, p. 3.16 in Simon Haykin. Neural Networks: A Comprehensive

Foundation. Prentice Hall, 2nd edition, 1999. (Undergraduate students may assume
that all variables and parameters are scalars.)

29.2 What functions to use? Revision : 2003.10
29.2 What functions to use?

Read §5.5-5.7
Haykin applies signal processing theory (system identification/inverse problems) to RBF

networks.
Problem: What if data (or the original problem) is ill conditioned?
Definition 29.1 A problem is ill-conditioned if ...
Definition 29.2 A problem is backward stable if ...
29.2.1 Tikhonov regularization

Given input/output data pairs (xi , di ) and a function F (x), define yi = F (xi ). RBF function
F (x) attempts to minimize a modified error term:
E = Es (F ) − λEc (F )
N
1X∆
Standard error term Es (F ) = (di − yi )
2 i=1
1
Regularization term Ec (F ) = kDF k2
2
• D: “linear differential operator” – in other words, a filter, probably multidimen-
sional.
Use to enforce frequency domain constraints on your solution function
Example: h 2 i
∂ ∂2
DF = ∂x2 · · · ∂x2m F
1
penalize large changes in slope.

• λ > 0 weight value to specify how important the regularization penalty is relative
to the input data.
Don’t confuse λ here with a LaGrange multiplier.
Remark 29.1 The text makes use of some powerful mathematics (see, e.g., D. G. Luen-
berger. Optimization by Vector Space Methods. Wiley and Sons, Inc., New York, NY, 1969)
to justify the use of Green’s functions. For appropriate choice of operator D, these Greene’s
functions are the Gaussian radial basis functions we’re considering. The significance of Gaus-
sian functions is that, in terms of wavelet theory, they provide an optimal trade-off between
state-space localization (“time-domain”) and frequency localization in the sense of Heisen-
berg’s uncertainty principle. See Gilbert Strang and Truong Nguyen. Wavelets and Filter
Banks. Wellesley Cambridge Press, wellesley, mass. edition, 1996.
Remark 29.2 Strict enforcement of the choice of operator D leads to a choice of basis
functions φ that satisfy Dφ = 0.

29.3 Training RBF networks Revision : 2003.10
29.2.2 Solution of the problem
Read §5.7
m (1)
∆
X
Define φi (x) = φi (kx − ti k). RBF network output is F (x) = wi φi (x). Let
i=1
T
d = d1 · · · dN
T
w = w1 · · · wm(1)
 
φ1 (x1 ) · · · φm(1) (x1 )
G =  .. .. .. 
.

. . 
φ1 (xN ) · · · φm(1) (xN )
 
φ1 (t1 ) ··· φm(1) (t1 )
G0 =  .. .. .. 
.

. . 
φ1 (tm(0) )) · · · φm(1) (tm(0) )
Then the minimizing solution for weights w is
GT G + λG0 w = GT d

29.3 Training RBF networks

Read §5.13
Don’t want lots of RBF’s (fewer than number of data points).

• Overtraining
• Computational cost
RBF function m
X
F (x) = wi φi (x)
i=1
where
T
φi (x) = e(x−ti ) Σi −1 (x−ti )
ti is the center of φi . Σi controls the “spread” of φi .

Example 29.1 Two RBF’s added together.
n
X
F (x) = wi exp (x − ti )T Σi −1 (x − ti )
i=1

4 1 5 2 0.2 0.04
with w1 = 3, w2 = 1, t1 = , t2 = , Σ1 = and Σ2 = .
2 −2 2 3 0.04 0.5
Plot in Figure 36.

RBF example function
line 1
3
2.5
2
1.5
1
0.5
0
8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6
Figure 36: RBF network with two RBFs.
2
Alternative form (similar to Fuzzy Logic)
m
X
wi φi (x)
i=1
F (x) = m
X
φi (x)
i=1
Example 29.2 Fuzzy RBF network added together. M-file below; plot in Figure 37. Notice
that between centers we get interpolation, while one RBF tends to dominate away from the
centers.
M-file 29.4 rbf0502.m
% rbf0502: example radial basis function network

t1 = [4;2]; Sig1 = [5,2; 2,3];
t2 = [1;-2]; Sig2 = [0.2,0.04; 0.04,0.5];
w = [3;1];
nx = 40; xx = linspace(0,8,nx);
ny = 41; yy = linspace(-5,8,ny);

RBF example function
line 1
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6
Figure 37: RBF network with two RBFs.
zm = zeros(nx,ny);
for ix = 1:nx
for iy = 1:ny
xn = [xx(ix); yy(iy)];
e1 = xn -t1; e2 = xn - t2;
zm(ix,iy) = w’*[ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]/ ...
sum([ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]);
end
end
title(’RBF example function’);

xlabel(’x’);
ylabel(’y’);
mesh(xx,yy,zm’);
printeps("rbf0502.eps");
2
Training involves selection of parameters ti , Σi , and wi .
From [Hay99]: adaptation formulae for RBF network: Requires Green’s function G(x)
and its first derivative G0 (x) = dG/dx with respect to scalar x.

Linear weights:
N
∂E(n) X
= ej (n)G kxj − ti (n)kCi
∂wi (n) j=1
∂E(n)
wi (n + 1) = wi (n) − η1 i = 1, 2, ..., m(1)
∂wi (n)
Center positions Can interpret as hidden layer node parameter

N
∂E(n) X
= 2wi (n) ej (n)G0 kxj − ti (n)kCi Σi −1 (xj − ti (n))
∂ti (n) j=1
∂E(n)
ti (n + 1) = ti (n) − η2 i = 1, ..., m(1)
∂ti (n)
Spread parameters
N
∂E(n) X
0

= −w i (n) e j (n)G kx j − t i (n)k Ci Qji (n)
∂Σi −1 (n) j=1
Qji (n) = (xj − ti (n)) (xj − ti (n))T

∂E(n)
Σi −1 (n + 1) = Σi −1 (n) − η3 i = 1, ..., m(1)
∂Σi −1 (n)
29.3.1 Selection of output (interpolation) weights w

Suppose ti Σi are held fixed, i = 1, m. We get a least squares problem.
min J(w) :
w
N
X
J(w) = kdi − F (xi )k2
i=1
2
XN Xm
= d i − wj φj (xi )

i=1 j=1
     2
d1 φ1 (x1 ) · · · φm (x1 ) w1
 ..   .. ..   .. 

min  .  −  ..
w
. . .   . 

dm φ1 (xN ) · · · φm (xN ) wm

φi (x) < 1/2
ti
Figure 38: Hodel’s idea for automated selection of spread matrices Σ.
29.3.2 Selection of centers ti

29.3.3 Selection of covariance (spread) matrices Σ
Hodel’s heuristic “Σ-shaping.” See Figure 38. The idea is to maximize spread subject to
“non-interference” between radial basis functions. This can be set up as a convex pro-
gramming problem [NN94], but we will utilize a simple suboptimal procedure for this task.
Assume that the centers ti have been chosen to reflect the variation in the data (sensitivity
of di to xi ).
1. For i = 1, ..., nφ
(a) Select center point ti .

(b) For each j 6= i in {1, ..., nφ }

ti + t j
i. Evaluate φ halfway between ti and tj , i.e., define pij = φ . If pij ≤
2
1/2, then we say that φi does not interfere with φj at tj . Conversely, if
pij > 1/2, then φi interferes with φj , i.e., φi is spread out too far in the
direction tj − ti from the center point ti of φi .
ii. If φi interferes with φj , then we reduce the spread of φi as follows. Define
S = Σi −1 and assume without loss of generality that tj − ti = αe1 where e1
is the first column of the identity matrix. Then

−(tj −ti )Σi −1 (tj −ti )/4 1 2
pij = e = exp ktj − ti k s11
4
We achieve the constraint pij ≤ 1/2 by scaling s11 appropriately.
In the general case where tj − ti is not a multiple of e1 , then we may use an
orthogonal coordinate transformation to perform the update as shown in the
following example.
(c) end for
2. end for

Example 29.3 M-file 29.5 rbfSig.m

% rbfSig.m: check ‘‘spread’’ matrix idea
rand(’seed’,1); % for repeatable experiments
N = 20; xx = rand(2,N); % select 20 centers at random
% allocate space for spread matrices:

Sigs = zeros(4,N); % column i contains spread matrix i (vector stack)
for nn = 1:N
xn = xx(:,nn);
% get matrix of distances to other center points

idx = complement(nn,1:N); xErr = xx(:,idx) - xn*ones(1,N-1);
% initialize to have a ‘‘wide’’ spread.

Sigi_n = 0.01*eye(2); % Sigi -> "inv(Sigma)"
% scale in each direction to avoid conflict with other points

for ni = 1:length(idx)
ei = xErr(:,ni)/2;
pij = exp(-ei’*Sigi_n*ei);
if(pij > 1/2)
% transform coordinates so that ei is 1st column of identity
[Q,r] = qr(ei); SigTmp = Q’*Sigi_n*Q;
% scale leading diagonal value

dv = SigTmp(1,1); ne = norm(ei)^2; dv = log(1/2)/(-ne);
% update and back-transform

SigTmp(1,1) = dv; Sigi_n = Q*SigTmp*Q’;
end
end
% store value in spread matrix database

SigInv = Sigi_n;
Sigs(:,nn) = reshape(SigInv,4,1);
end
% compute mesh of RBF’s

xmin = min(xx’); xmax = max(xx’);
nx = 26; xv = linspace(xmin(1), xmax(1),nx);

ny = 27; yv = linspace(xmin(2), xmax(2),ny);
zm = zeros(nx, ny);

for ix = 1:nx
for iy = 1:ny
xp = [xv(ix); yv(iy)];
for nn = 1:N
ti = xx(:,nn);
SigI = reshape(Sigs(:,nn),2,2);
zm(ix,iy) += exp( -(xp - ti)’*SigI*(xp-ti) );
end
end
end
% plot the centers

figure(1)
title("Centers");
xlabel("x");
ylabel("y");
plot(xx(1,:), xx(2,:),’xx’);
printeps("rbfSig01.eps");
% plot sum of RBFS as a mesh

figure(2)
title("Sum of RBF functions with spread shaping")
xlabel("x");
ylabel("y");
mesh(xv, yv, zm’)
% plot sum of RBFS as a contour plot

figure(2)
title("Contour plot: sum of RBF functions with spread shaping")
xlabel("x");
ylabel("y");
contour(xv, yv, zm’, 10)
rbfSig.m Output
ans = 1
ans = 2
ans = 2
Results in Figures 39–41.

Centers
1
line 1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
x
Sum of RBF functions with spread shaping
line 1
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
0.9
0.8
0.7
0 0.6
0.1 0.5
0.2 0.4 y
0.3 0.3
0.4 0.2
0.5
x 0.6 0.1
0.7
0.80

Contour plot: sum of RBF functions with spread shaping
1.6
1.4
1.2
1
0.8
1 0.6
0.9 0.4
0.2
0.8
0.7
0.6
0.5 y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Revision : 2003.10
30 2003 10 31: RBF’s (cont’d)

Solution Make use of Lemma 11.1.
N
1X
∆ ∆ ∆
E = e(n)T e(n) e(n) = d(n) − y(n) y(n) = F (x(n))
2 n=1
nφ
∆ ∆
X
F (x) = w(k)φk (x) φk (x) = φ(vk (x)) φ(v) = ev
k=1
vk (x) = −(x − t(k))T Σ(k)−1 (x − t(k))/2

∂E
= covered in class
∂wi (k)
N X p
∂E X ∂E ∂ei (n) ∂yi (n) ∂φk (x(n))
=
∂tj (k) n=1 i=1
∂ei (n) ∂yi (n) ∂φk (x(n)) ∂tj (k)
N X p
X ∂φk (x(n))
= ei (n)(−1)wij (k)
n=1 i=1
∂tj (k)
N X p
X ∂φk (x(n)) ∂vk (x(n))
= −ei (n)wij (k)
n=1 i=1
∂vk (x(n)) ∂tj (k)
N X p
X ∂vk (x(n))
= −ei (n)wi (k)φ(vk (x(n))
n=1 i=1
∂tj (k)
∂vk (x(n)) 1 ∂
T −1

= − (x − t(k)) Σ(k) (x − t(k))
∂t(k) 2 ∂t(k)
1 ∂ T −1 T −1 T −1

= − x Σ(k) x − 2x Σ(k) t(k) + t(k) Σ(k) t(k)
2 ∂t(k)
= Σ(k)−1 x − Σ(k)−1 t(k) and the result follows
∆
Similarly, with S(k) = Σ(k)−1 ,
N X p
∂E X ∂vk (x(n))
= −ei (n)wi (k)φ(vk (x(n))
∂sij (k) n=1 i=1
∂sij j(k)
∂vk (x(n)) 1
= (x(n) − t(k))(x(n) − t(k))T
∂S(k) 2
Notice that my solution differs from the text because I include the factor of 1/2 in v k (x).

Revision : 2003.10
31 2003 10 31 Principal Components Analysis

Read §8.3
From Hebb (1949).
• Projection onto a vector q
• E (q T x)(xT q) = q T Rq

• SVD =⇒ maxq q T Rq =⇒ q = eigenvector of R.

Eigenvectors of R are orthogonal.
idea: order λ1 ≥ · · · ≥ λn = 0
What if λ3 λ4 ? Approx as rank 3, keep only 3 dimensions of data.
• encode: y = Q3 T x
• decode: x̂ = Q3 y
• Error x − x̂ orthogonal to Q3 (small).
Neural network idea can we get Q3 without lots of nasty eigenvalue problems?

Revision : 2003.10
32 Exam 2 solutions
Name
Scores
1. (25pts)
2. (25pts)
3. (25pts)
4. (25pts)
Total:
Note Show your work. Multiple choice questions may have more than one correct answer.
Whether or not your answer will be judged to be “correct” depends on your response under
the word “explain.”
Permitted resources for this exam None. You may use a pencil or pen (use a pen
only if you never make mistakes), an eraser, and a ruler. You are not permitted to use a
calculator, textbook, written notes, oral or written communication with other people (besides
the instructor/GTA), laptop computers, cell phones, PDAs, wireless modems, telepathic
contact, or any other resources besides your own mind, body and a writing utensil. T-shirts
with Maxwell’s equations on the back will be tolerated, but I reserve the right to reseat you
in the back of the classroom. You are to sit with at least one empty chair between you and
the nearest classmate. Use of unauthorized resources on this exam will result in a failing
grade.
a five volt power State all assump-

Design a human supply and kite tions.
brain using only string.
MOSFETs

32.1 Revision : 2003.10
32.1
Consider the function plotted below:
sinc(2 pi x ) for |x| <=1 , 0 else
1
0.8
0.6
0.4
0.2
-0.2
-0.4
-3 -2 -1 0 1 2 3
x
It is desired to approximate this function with a neural network.

1. How many inputs does the network have? (circle 1)
2 3 4 none of these
1
2. How many outputs does the network have? (circle 1)
2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:
6 Two each for each “hump” in the diagram: one for the leading edge, one for the
trailing edge.
None of these (four) use two neurons to create a trough from -1 to 1 with depth
of about -0.25, two more neurons to create the peak in the middle between about
-0.75 and 0.75.
None of these (many) The function change in slope at -1 and 1 is not smooth, and
so will require a lot of neurons to model accurately.

32.2 Revision : 2003.10
32.2
Consider the function plotted below:
sinc(2 pi (x^2 + 2 y^2) ) for x^2 + 2 y^2 <=2 , 0 else
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
2
1.5
1
0.5
-2 0
-1.5
-1 -0.5 y
-0.5
0 -1
0.5 -1.5
x 1
1.5
2 -2
It is desired to approximate this function with a neural network.
1. How many inputs does the network have? (circle 1)
1 3 4 none of these
2
2. How many outputs does the network have? (circle 1)
2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:
None of these (two) Use one function to create the wider trough in the middle of
the plane, a second function to add the peak. However, this would likely not give
a good match to the data.
None of these (many) This function has a fairly flat top and a strange shape, so a
good match would probably require many RBF’s to get good interpolation behavior.

32.3 Revision : 2003.10
32.3
m
Consider a data set {x(n), d(n)}N
n=1 where x(n) ∈ IR and d(n) ∈ IRp . Define the error
function
N
∆ 1
X
E= e(n)T e(n)
2 n=1
∆
where e(n) = d(n) − y(n) and y(n) = W (2) φ(W (1) x(n) + b(1) ) where b(1) ∈ IRh . Define
E(n) = 12 e(n)T e(n). Find  ∂E(n) 
(2) · · · ∂E(n) (2)
∂W1,1 ∂W1,h 
∂E(n) ∆ 
 .. . .. 
= . .. . 
∂W (2)  ∂E(n) ∂E(n)

(2) ··· (2)
∂Wp,1 ∂Wp,h
∂E(n)
Note Undergraduates may choose to derive (2)
, a scalar valued gradient instead.
∂Wij
∂E(n) T
(2)
= −e(n)y (1) (n)
∂W
Show your work. If you need more space, work on the back side of this page. If you need
more space than that ... try to write smaller.
(2)
∂yk
Solution Apply the chain rule: First, notice that = 0 if k 6= i. From this result, there
(2)
∂Wij
T
is no need to sum the partial derivatives over all entries in e = e1 (n) · · · em (n) .
! !
∂E(n) ∂E(n) ∂ei (n) ∂yi (n)
(2)
= (2) (2)
∂Wij ∂ei (n) ∂yi (n) ∂Wij
(1)
= ei (n)(−1)yj (n)
which is the answer for undergraduates. Write the above gradient in terms of all combinations
of i and j to get the answer listed above.

32.4 Revision : 2003.10
32.4
Consider the unit square problem we’ve been working all semester. A friend (who is not in our
class) suggests that instead of using φ(v) = tanh(v) that you should use φ(v) = tanh(10v),
since the latter function is much steeper and so it can give a better approximation to the
unit square.
activation function comparison
1
tanh(x)
tanh(10*x)
Evaluate your friend’s suggestion.
0.8
Discuss (1) whether or not the use
0.6 of φ(v) = tanh(10v) can give a
0.5
0.4
better approximation to the data
than the use of φ(v) = tanh(v), (2)
0.2
how the use of φ(v) = tanh(10v)
0 will affect input data normalization,
-0.2
and (3) how the use of φ(v) =
tanh(10v) will affect the backprop-
-0.4
-0.5
agation training algorithm.
-0.6 If you need more space, you may
-0.8 continue your answer to this prob-
lem on the back of this page exam.
-1
-3 -2 -1 0 1 2 3
x
Solution It changes the training, but doesn’t improve things at all.
1. Consider output layer y (2) = φ(W (2) y (1) ) with φ = tanh(10v). We can use φ(v) =
tanh(v) instead by redefining W (2) := 10W (2) , that is, just absorb the factor of 10 into
the network weights. Hence, the proposed function does not provide the capability of a
better approximation than tanh by itself.
2. Data normalization is performed to ensure that the expected value of initial network
weights is in the “linear” region of the activation function. The use of φ(v) = tanh(10v)
sets the “linear” region to a much more narrow area, hence the initial weights must be
initialized to be one-tenth of what we’d usually do, i.e., in MATLAB code, Winitial = rand(m,n)/(10
3. The derivative of of tanh(10v) is 10 times larger (and 10 times more narrow) than
that of tanh(v). As a result, the backpropagation step size should be adjusted to be one-
tenth of what one would use for tanh(v). Further, since the derivative is effectively zero
outside of a narrow range, one would not expect a large number of neurons to be able
to “find” the boundary edges of the unit square unless they happened to be initialized
“just right.” On the other hand, since initial weights (listed above) are selected to
compensate for the factor of 10 in the activation function, the smaller step size may
result in comparable training behavior.
Either way, the proposed method doesn’t provide any clear advantage over φ(v) =
tanh(v).

Revision : 2003.10
33 2003 11 05: PCA (2)

See scanned notes.

Revision : 2003.10
34 2003 11 07: PCA (3)

Example 34.1 20 sinusoids sin(ωt + φ) where ω ∈ [1.0, 1.1]rad/s, φ ∈ [0, 0.5] rad. 20 sinc
functions sinc(ωt + φ), and 20 square waves sign(sin(ωt + φ)). Gaussian noise added with
variance σ 2 = 0.05. Input signals are shown in Figure 42.
M-file 34.1 pcaEx1.m
function pcaEx1
format short e
mm = 150;
tt = linspace(-2,2,mm);
rand(’seed’,0);
randn(’seed’,0);
nsets = 20;
for ii=1:nsets
om = 1 + rand/10;
ph = rand/2;
rt = om*tt + ph;
mydat(ii,1:mm) = sin(rt);
rt = rt*5;
mydat(ii+nsets,1:mm) = sin(rt) ./ rt;
mydat(ii+2*nsets,1:mm) = sign(sin(rt));
fprintf(’%4d: om=%12.4f ph = %12.4f\n’,ii,om,ph);
end
mydat = mydat + randn(size(mydat))/20;
NN = size(mydat,1);
xbar = mean(mydat);
zmdat = mydat - ones(NN,1)*xbar;
sigx = zeros(mm,mm);
for nn=1:NN
xn = zmdat(nn,:)’;
sigx = sigx + xn*xn’/NN;
end
[VV,lam] = eig(sigx);
lam = diag(lam);
[lam,idx] = sort(-lam); lam = -lam;
VV = VV(:,idx); % reorder eigenvectors
chkval = (1e-6)*max(lam);
lamsiz = size(lam)
chkvalsiz = size(chkval);
idx = find(lam > chkval);

Revision : 2003.10
V3 = VV(:,1:3);
approxDat = (zmdat*V3)*V3’ + ones(NN,1)*xbar;
figure(1);
plot(tt,mydat,’-’);
title(’Principal components analysis example’);
xlabel(’time (sec)’);
ylabel(’signal value’);
grid on;
axis([-2,2,-1.2,1.2]);
printeps(’pcaEx1.eps’);
figure(2);
semilogy(idx,lam(idx),’x’);
axis ;
grid on;
printeps(’pcaEx1a.eps’);
figure(3);
plot(tt,xbar);
grid on;
title(’mean value vector’);
printeps(’pcaEx1b.eps’);
figure(4);
plot(tt,VV(:,1:3));
title(’dominant eigenvectors’);
xlabel(’time (s)’);
grid on;
printeps(’pcaEx1c.eps’);
figure(5);
plot(tt,approxDat);
title(’approximated data with dominant 3 eigenvectors’);
grid on;
printeps(’pcaEx1e.eps’);
% power iteration to get the dominant eigenvector

w0 = rand(mm,1);
w0 = w0/norm(w0);
ww = w0;

Revision : 2003.10
imax = 6
for kk=1:imax
ww = sigx*ww;
ww = ww/norm(ww);
end
plot(tt,VV(:,1),’-’,tt,w0,’--’,tt,ww,’-.’);
grid on;
legend(’v_1’,’w(0)’,sprintf(’w(%d)’,imax));
ylabel(’vector waveform’)
title(’Principal Compenents Analysis example’);
printeps(’pcaEx1f.eps’);
% Neural network approach: get a single eigenvector

eta = 0.0001;
imax = 20;
nv = 5;
WW = rand(mm,nv);
for kk=1:nv
WW(:,kk) = WW(:,kk)/norm(WW(:,kk));
end
for iter = 1:imax
lamh(1:5,iter) = diag(WW’*sigx*WW);
WW= sigx*WW;
[jnk,idx] = sort(rand(NN,1));
idx = 1:NN;
for nn=1:NN
xn = mydat(idx(nn),:)’;
yy = WW’*xn;
WW = WW + eta*xn*yy’;
for kk=1:nv
% project and normalize
for jj=1:(kk-1)
WW(:,kk) = WW(:,kk) - WW(:,jj)*( WW(:,jj)’*WW(:,kk) );
end
WW(:,kk) = WW(:,kk)/norm(WW(:,kk));
end
end
end
plot(1:imax,lamh,’+’,1:imax,reshape(lam(1:nv),nv,1)*ones(1,imax),’:’);
legend(’l1e’,’l2e’,’l3e’,’l4e’,’l5e’);
title(sprintf(’Neural network PCA after %d iterations’,imax));
printeps(’pcaEx1g.eps’);

Revision : 2003.10
function printeps(str)
eval(sprintf(’print -depsc %s’,str));

Revision : 2003.10
Principal components analysis example
0.8
0.6
0.4
0.2
signal value
−0.2
−0.4
−0.6
−0.8
−1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

time (sec)
Figure 42: PCA example: input signals
Subtract mean value, then compute Σx . Dominant eigenvalues of Σx (ignoring all eigen-
values less than 10−6 λmax ) are shown in Figure 43. Dominant 3 eigenvectors are plotted in
Figurte 44. Approximated data shown in Figure 47. Notice the general signal forms are
recognizable, in spite of different phase and frequency: the signal type can be classified with
confidence.
Remark 34.1 Data set characteristics: same number of data points from each class (uni-
form sampling).

Revision : 2003.10
2
10
1
10
0
10
−1
10
−2
10
−3
10
0 10 20 30 40 50 60
Figure 43: PCA example: dominant eigenvalues of Σx

Revision : 2003.10
dominant eigenvectors
0.2
0.15
0.1
0.05
signal value
−0.05
−0.1
−0.15
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
Figure 44: Dominant 3 eigenvectors of Σx (PCA example).

Revision : 2003.10
approximated data with dominant 3 eigenvectors

2
1.5
0.5
signal value
−0.5
−1
−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
Figure 45: Original data approximated based on dominant 3 eigenvectors of Σx . Observe

that the original sine, sinc, and square waves can be recognized in spite of noise corruption
and data compression to only three linear parameters.

Revision : 2003.10
Principal Compenents Analysis example

0.15
v1
w(0)
w(6)
0.1
0.05
vector waveform
−0.05
−0.1
−0.15
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)
Figure 46: Power iteration to compute dominant eigenvector of Σx .

Revision : 2003.10
Neural network PCA after 20 iterations

30
l1e
l2e
l3e
l4e
25 l5e
20
15
10
0
0 2 4 6 8 10 12 14 16 18 20
Figure 47: Power iteration to compute dominant eigenvectors/eigenvalues of Σx .

34.1 Single vector case Revision : 2003.10
34.1 Single vector case

Read §8.4
linear model m
X
T
y=w x= w i xi
i=1
Hebbian learning:
wi (n + 1) = wi (n) + ηy(n)xi (n)

w(n + 1) = w(n) + ηy(n)x(n)
weights can blow up. → normalize weights
wi (n) + ηy(n)xi (n)

wi (n + 1) = qP
m 2
i=1 (wi (n) + ηy(n)xi (n))
w(n) + ηy(n)x(n)
w(n + 1) = q
(w(n) + ηy(n)x(n))T (w(n) + ηy(n)x(n))
Taylor’s series in η: (recall d uv = u dv−v du

v2
)
w(n + 1) = w(n) + ηy(n)[x(n) − y(n)w(n)]
Stability analysis and interpretation
y(n) = w(n)T x(n) = x(n)T w(n)

h i
T T T
w(n + 1) = w(n) + η x(n)x(n) w(n) − w(n) x(n)x(n) w(n) w(n)
1
x(n)x(n)T , so for a small enough step size, the first term
P
Interpretation:
Recall Σ x = N
x(n)x(n)T w(n) is an approximation for multiplying Σx w(n), which will converge to the
dominant eigenvector direction.
Suppose w(n) were a scalar. Then the 2nd term is −w(n)3 x(n)2 , which causes w(n) to
decrease in magnitude.
Hence, the 1st term drives toward the dominant eigenvector, while the 2nd term drives
toward 0. Net result: converges to a bounded
P multiple P of dominant eigenvector of Σx .
Formal convergence analysis requires: η(n) = ∞, |η(n)|p < ∞ for some p > 1. Text
chooses η(n) = 1/n, learn more slowly as get “older.” Then w(n) → dominant eigenvector.
34.2 Multiple vector case

Revision : 2003.10
35 Self Organizing Maps

Read Ch. 9
Human brain motor organization mimics physical organization of body (mirror image).
Idea Design a neural network to organize itself to match data organization. Techniques/concepts
involved:
Competitive learning Competition: determine one ‘winner’
Topology/cooperation Establish “neighborhoods” of neurons. Organize neurons in each

layer in a plane; winner and its ‘neighbors’ get reinforced.
Eventually neighboring neurons are activated by neighboring input patterns.

Revision : 2003.10
36 2003 11 19 Neurodynamic programming Example

Read Chapter 12
M-file 36.1 dynProgEx.m

% dynamic programming example - deterministic case (all transition
% probabilities are either 0 (never happens) or 1 (always happens).
XX = 1:10; % 10 states
% transitions and probabilities in a cell array:
% S{i,j} = [f(i,1),c(i,1) ; f(i,2), c(i,2) ; ... ; f(i,j), c(i,j), ...]
% describes what happens in state i with input j
S{1} = [2,2; 3,4; 4,3]; S{2} = [5,7; 6,4; 7,6];
S{3} = [5,3; 6,2; 7,4]; S{4} = [5,4; 6,1; 7,5];
S{5} = [8,1; 9,4]; S{6} = [8,6; 9,3];
S{7} = [8,3; 9,3]; S{8} = [10,3];
S{9} = [10,4]; S{10} = [];
for ii=1:length(S) % print out transitions and costs
Si = S{ii};
[mm,nn] = size(Si);
if(mm == 0), fprintf(’State %s: no transitions\n’, char(’A’+ii-1));
else
for uu=1:mm
fprintf(’State %s: input %d -> state %c, cost %d\n’, ...
char(’A’+ii-1),uu, char(’A’+Si(uu,1)-1),Si(uu,2));
end
end
end
% iteratively compute optimal cost to go
rand(’seed’,1);
Ju = 10*rand(length(S),1); % initialize to random cost
gam = 1.0; % gamma variable in Table 12.2
Jhist = Ju;
maxi = 11;
for iter = 2:maxi
fprintf(’iteration %d\n’,iter);
Jnext = zeros(length(S),1); % compute next optimal cost
for ii=1:length(S) % initial state
[mm,nn] = size(S{ii});
if(mm > 0)
bestCost = 1e6; % some huge number
for uu=1:mm % all possible inputs
bestCost = min( bestCost, S{ii}(uu,2) + gam*Ju( S{ii}(uu,1) ) );
end

Revision : 2003.10
else, bestCost = 0;
end
Jnext(ii) = bestCost;
end
Ju = Jnext; Jhist(:,iter) = Ju;
end
plot(1:maxi,Jhist,’-o’); grid on;
xlabel(’iteration’); ylabel(’cost to go estimate’);
for ii=1:length(S)
text(5 + 0.5*ii,Ju(ii)+0.5,char(’A’+ii-1));
end
title(’Dynamic programming example (see Fig 12.4 in Haykin’’s book)’);
print -depsc dynProgEx.eps
output
State A: input 1 -> state B, cost 2
State A: input 2 -> state C, cost 4
State A: input 3 -> state D, cost 3
State B: input 1 -> state E, cost 7
State B: input 2 -> state F, cost 4
State B: input 3 -> state G, cost 6
State C: input 1 -> state E, cost 3
State C: input 2 -> state F, cost 2
State C: input 3 -> state G, cost 4
State D: input 1 -> state E, cost 4
State D: input 2 -> state F, cost 1
State D: input 3 -> state G, cost 5
State E: input 1 -> state H, cost 1
State E: input 2 -> state I, cost 4
State F: input 1 -> state H, cost 6
State F: input 2 -> state I, cost 3
State G: input 1 -> state H, cost 3
State G: input 2 -> state I, cost 3
State H: input 1 -> state J, cost 3
State I: input 1 -> state J, cost 4
State J: no transitions
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8

Revision : 2003.10
iteration 9
iteration 10
iteration 11
Dynamic programming example (see Fig 12.4 in Haykin’s book)

12
A B
10
D
8
C F
cost to go estimate
G
6
E I
4
H
J
0
1 2 3 4 5 6 7 8 9 10 11
iteration

Revision : 2003.10
37 Hopfield Networks
Training uses correlation matrix memory (see [Hay99], §2.11, p. 79) to estimate training
values.

Revision : 2003.10
A Appendix: Review of linear algebra

A.1 Vector stack function vec(·)
Define w̄ = vec (W ) where W ∈ IRm×n
B Appendix: Review of C-programming syntax re-

lated to neural nets
C Appendix: Review of MATLAB syntax related to
neural nets
C.1 Introduction
MATLAB is a commercial product of the Mathworks, http://www.mathworks.com, that is
an industry standard software tool in electromagnetics, signal processing, and control sys-
tems. A student version of MATLAB can be purchased at the Auburn University bookstore.
The objective of this section is to introduce the student to some of the basic features of
MATLAB that are relevant to problem solving with artificial neural networks.
C.1.1 Access to software

The commercial program MATLAB may be run on any of the computers on the engineer-
ing network, including Sun-workstations (Unix) and Windows computers. In order to use
Sun-workstations (or a sun-session by using X-windows over a broadband connection) it is
necessary to select MATLAB in the program user-setup under mathematical packages.
Alternatively, many researchers and corporations use the freely distributable software
package octave, a program similar to MATLAB that is available under the terms of the
Free Software Foundation Gnu Public License, or SciLab, developed in France, that provides
similar functionality to MATLAB. Neither of these programs is 100% MATLAB-compatible,
but they do provide a useful low-cost computing tool for those who wish to use them. See
http://www.octave.org and http://www.scilab.org for additional information. Instal-
lation of Octave under Windows requires use of the Cygwin environment11 . Installation of
Octave on Macintosh computers is most easily done with fink12 . Installation of Octave on
Linux-based machines is usually straightforward.13
This manual is written based on the use of MATLAB on the engineering network.
11
http://www.cygwin.com
12
http://fink.sourceforge.net
13
Open the tar ball, run the configure script, them type in make all ; make install

C.2 Mathematical preliminaries Revision : 2003.10
C.1.2 Software overview and tutorials

The chief advantage to using MATLAB (as opposed to a formal programming language such
as C, C++, or FORTRAN) is the easy of development and debugging in the MATLAB
environment. The chief disadvantage of using MATLAB is that it uses interpreted scripts,
or m-files, which can run much slower than comparable compiled programs.
Students who wish more information on the use of MATLAB may use any of the following
resources:
• S. J. Reeves, Beginning MATLAB for Engineers, College House Enterprises, 2001,

available at the Auburn University Bookstore.
• K. Sigmon, MATLAB Primer, available at
ftp://ftp.eng.auburn.edu/pub/sjreeves/matlab_primer_40.pdf
• The Mathworks on-line documentation at http://www.mathworks.com
• The user’s manual for the Student Version of MATLAB (if you purchase it).
• The (paperback) textbook by D. Etter, Engineering Problem Solving with MATLAB

(tm), Prentice-Hall.
The manuals for Octave and/or Scilab may be of some use to you as well. These may be
obtained with their respective source code distributions.
Advanced students may wish to write mex function interfaces to compiled language com-
puter programs; this topic is not addressed in this laboratory session.
C.2 Mathematical preliminaries

We examine the use of to analyze the currents and voltages for three circuits: a resistor
network, an RC low-pass filter circuit, and an RLC circuit, shown in Figures 48. The voltages
and currents in the resistor network in Figure 48(a) can be computed as the solution to a
set of linear equations
ai,1 v1 + ai,2 v2 + ai,3 i1 + ai,4 i2 + ai,5 i3 + ai,6 i4 + ai,7 i5 = bi (3.1)
for i = 1, ..., 7. For example, suppose for equation i = 1 that we apply Kirchoff’s current law
to the node with voltage v1 to obtain
i2 − i 3 − i 4 = 0 (3.2)
Equation (3.2) can be rewritten in the form of equation (3.1) by selecting the coefficients
a1,1 = a1,2 = a1,3 = a1,7 = b1 = 0, a1,4 = 1 and a1,5 = a1,6 = −1.

C.3 MATLAB basics: similarities to C Revision : 2003.10
3kΩ v1 7kΩ 1kΩ vo

v2
i1 i1
+ + i2
2kΩ us (t − 0.5)
5V 10µF
− −
(a) (b)
100kΩ 10µF
i1 (t)
+
+
us (t − 0.5) vo (t)
10kΩ 10mH
−
−
(c)
Figure 48: Circuit examples for use of MATLAB. Note us (t) refers to the unit step function
us (t) = 0 for t < 0, us (t) = 1 for t ≥ 0.
C MATLAB
double a, b, c[10], d[10]; % Comments start with a pct sign

double e[5][5]; % don’t declare variables in MATLAB
a = 1; /* comments look like this */ a = 1;
b = 2; b = 2;
c[0] = a; /* subscripts start at 0 */ c(1) = a; % subscripts start at 1
d[1] = b; d(2) = b;
e[2][2] = a+b; e(3,3) = a + b;
Figure 49: Assignment statements and subscripts in C and MATLAB.
C.3 MATLAB basics: similarities to C

The MATLAB program allows users to do many things that can be found in compiled pro-
gramming languages. Some examples are shown in Figures 49—52, which address assignment
statements, if statements, for and while statements, and switch statements, respectively.
C.3.1 Functions
Unlike C, MATLAB (usually) requires each function to be in its own text file, and the file
must end with extension .m. So, these are usually called “m-files.” M-file function text

MATLAB
C
if(x < y )
if(x < y ) % use fprintf and single quotes
{ % otherwise same as printf.
printf("%e < %e\n",x,y); fprintf(’%e < %e\n’,x,y);
} % else if changes to elseif
else if ( x != y ) % != changes to ~= (use ~ for "not")
{ elseif ( x ~= y )
/* can split cmds across lines */ % use ... to split a command across
printf("%e is different from %e\n", % lines
x,y); fprintf( ...
} ’%e is different from %e\n’, ...
x,y);
Figure 50: Flow control in C and MATLAB: if statements
C MATLAB
double sum = 0; sum = 0;

int ii; % 1:5 is a vector of integers
for( ii = 1 ; ii <= 10 ; ii++) % same as writing
{ % [1, 2, 3, 4, 5]
sum = sum + ii; for ii = 1:5
printf("%d: %d\n",ii,sum); sum = sum + ii;
fprintf(’%d: %d\n’,ii,sum);
}
while(sum > 5) end
{ while(sum > 5)
/* don’t write % double precision math
* sum /= 2; % automatically
* you’ll get a different result! */ sum /= 2;
sum /= 2.0; end
} % no semicolon: print variable name
% and value
printf("sum = %e\n",sum);
sum
Figure 51: Flow control in C and MATLAB: for, while statements
between “function” line and 1st statement is printed when you type in help function name
at MATLAB prompt. See Figure 53 for more detail.
C.3.2 Differences from C

There are two major differences between MATLAB m-files and compiled C-code (or compiled
code from other high level languages):
1. Compiled code will generally run faster than m-files. This is because m-files are not

C MATLAB
switch(x) switch(x)
{ case(0),
case 0; % no empty parenthesis in MATLAB
do_this(); do_this;
break; case(1),
case 1; do_that;
do_that(); otherwise,
break; % error function: returns to
default: % MATLAB prompt
printf("Bad case. x = %d\n",x); error(’Bad case. x=%d’,x);
} end
Figure 52: Flow control in C and MATLAB: switch
translated directly to machine code, but are interpreted by the MATLAB program.
2. Compiled code will generally take much longer to write than MATLAB m-files. This is
because MATLAB’s basic variable types are much more flexible than C-language and
because debugging tools in MATLAB are much easier to work with.
We illustrate some of the utility of MATLAB with the following examples.

w1
Example 3.1 dot products Given two three-dimensional vectors, w =  w2  and x =
  w3
x1
 x2 , their dot product is defined as
x3
3
∆
X
w · x = w 1 x1 + w 2 x2 + w 3 x3 = w i xi
i=1
C and MATLAB functions to compute the dot product of w and x are given below.
C
double dotProd3(const double w[], MATLAB : create file dotProd.m containing this
const double x[]) text.
{
int ii; function z = dotProd(w,x)
/* initialize dot prod % z = dotProd(w,x)
* in declaration*/ % return dot product of column
double retval = 0; % vectors w and x
for ( ii = 0 ; ii < 3 ; ii++)
retval += w[ii] * x[ii]; z = w’ * x; % w’ = transpose, row vector
return retval;
}

C MATLAB : create file is positive.m

int is_positive(double x) containing this text.
{ function y = is_positive(x)
if(x > 0) % y = is_positive(x)
return 1; % returns y = 1 if x is positive,
else % 0 otherwise
return 0;
} % short way; can also use "if" stmt
y = ( x > 0 );
/* use pointers to % no return statement needed
* return multiple values */
void stats(double * sum, MATLAB : create file stats.m contain-
double * mean, ing this text:
const double x,
const double y) function [sum, mean] = stats(x,y)
{ % [sum, mean] = stats(x,y)
*sum = x + y; % (put other comments here)
*mean = sum/2.0; sum = x+y;
} mean = sum/2;
Call these functions with Call these functions with
double a, b, c, d; c = 1;
int i; d = 2;
c = 1; i = is_positive(c);
d = 2; % MATLAB functions can return
i = is_positive(c); % many values at once!
stats( &a, &b, c, d); [a,b] = stats(c, d);
Figure 53: Function definition in C and MATLAB
Notice that the MATLAB code has
• no declarations
• no need to know that the vectors are of length 3!
• the line z = w’ * x; works with the entire vectors (arrays) w and x and not just their
individual components x(1), x(2), etc.

M-file 3.1 sqfour3.m
tt = linspace(-2*pi,2*pi,1000); % 1000 evenly spaced points between +/- 2 pi

yy = sin(tt) + sin(3*tt)/3; % get Fourier series
plot(tt,yy); % plot the waveform
xlabel(’time (s)’); % Dr. Gross says: "label your axes!"
ylabel(’y(t)’);
title(’Fourier series approx of sq. wave’);
grid on % add a grid to the plot for easy reading
print -depsc sqfour3.eps % save plot for the manual
Figure 54: Example of plotting in MATLAB (3rd order Fourier series approximation of a
square wave).
C.3.3 Graphical (plotting) concepts in MATLAB

Example 3.2 plotting signals Suppose it is wished to plot the 3-rd order Fourier series
approximation of a square wave,
1
f (t) = sin(t) + sin(3t)
3
A MATLAB m-file (stored in file sqfour3.m) to plot this function is shown in Figure 54.14
This m-file script is run in MATLAB by typing in the name of the script, sqfour3 at the
MATLAB prompt, and results in the plot shown in Figure 55.
C.3.4 Circuits problems in MATLAB

Resistor circuits typically result in a set of algebraic equations that are very easy to solve in
MATLAB. We illustrate the procedure first with an example:
Example 3.3 Recall the resistor circuit network in Figure 48(a):

3kΩ v1 7kΩ
v2
i1
+ 2kΩ
5V
−
The unknowns in this circuit problem are i1 , v1 , and v2 . From Ohm’s law and Kirchhoff’s
14
The C code is not included because this is an ECE course, not a CSSE course.

Fourier series approx of sq. wave

1
0.8
0.6
0.4
0.2
y(t)
−0.2
−0.4
−0.6
−0.8
−1
−8 −6 −4 −2 0 2 4 6 8
time (s)
Figure 55: MATLAB plot generated by m-file sqfour3.m, Example 3.2.
voltage law we have
v2 = 2000i1
v1 = 7000i1 + v2
(3000 + 7000 + 2000)i1 = 5
We rewrite these equations so that all unknowns appear on the left side of the equation and
all the constant terms are on the right side to obtain
v2 − 2000i1 = 0 (3.3)
v1 − 7000i1 + v2 = 0 (3.4)
12000i1 = 5 (3.5)
In order to put these equations into MATLAB, we need to look at each line of the equation
as if it was a dot product w · x, written in MATLAB as w’ * x. Here’s the procedure. First,
we define a vector (array) x of the unknowns in some order. For this example we’ll put them
in alphabetical order:  
i1
∆
x =  v1  (3.6)
v2

Then equation (3.3) can be written as

 
−2000
 0  · x = −2000i1 + 0v1 + v2 = 0 (3.7)
1
Similarly, equations (3.4) and (3.5) can be written respectively as

 
−7000
 1  · x = −7000i1 + v1 − v2 = 0 (3.8)
−1
and  
12000
 0  · x = 12000i1 + 0v1 + 0v2 = 5 (3.9)
0
Notice that in equations (3.7)–(3.9) that all of the vectors in the dot products (except for
x) are made up of known constants. We solve for the unknowns x by writing the constant
entries on theleft side of these equations into the rows of a two-dimensional array (called a
−2000 0 1
matrix) A =  −7000 1 −1  and the constants on the right side of the equations into a
12000 0 0
0
column vector b =  0 . We use MATLAB to solve the problem as shown below. Lines
5
beginning with >> are typed in by the user; the remaining lines are output to the screen by
MATLAB.
>> A = [-2000, 0, 1; -7000, 1, -1; 12000,0,0]
A = -2000 0 1
-7000 1 -1
12000 0 0
>> b = [0;0;5]
b = 0
0
5
>> x = A\b
x =
4.1667e-04
3.7500e+00
8.3333e-01
Remark 3.1 The command x = A\b tells MATLAB to compute a vector x so that the dot
product of row i of the matrix (2-D array) A with x matches component i of b.15
15
You will study this problem in much greater detail in your linear algebra class, where you will write
A ∗ x = b. The expression A*x refers to matrix-vector multiplication.

Remark 3.2 Notice that the MATLAB output for x does not show units. You as the
programmer have to remember that x corresponds to i1 = 41.667µA, v1 = 3.75V, and
v2 = 0.83333V.
Remark 3.3 Notice that rows of A and b are separated with semicolons ;, and that entries
on each row are separated with (optional) commas. The commas are not required, but
they’re a very good idea. To see this, type in the following commands (including spaces) at
the MATLAB prompt:
• x = [1 , - 2 ]
• x = [1 - 2]
These do not give the same answer! To avoid ambiguity, it’s a good idea to use commas
and/or parentheses, e.g., x = [1, ( 3 - 4 ) , 5] does the same thing as x = [ 1 3 - 4
5], but there’s no question what the first one is supposed to do.16
C.3.5 Differential equations in MATLAB

MATLAB provides two approaches for the simulation of differential equations. One is a
differential equation solver function, ode45. The other is a graphical simulation tool (block
diagram editor) called Simulink.17 Simulink will be discussed in the controls course, ELEC
3500. For this experiment, we will present only the ode45 solver, which is used to simulate
differential equations of the form
dy
= f (t, y) (3.10)
dt
with initial conditions y(t0 ) = y0 .
Since all ELEC 2020 students have had differential equations, the form of equation (3.10)
should raise some questions, such as “what if a differential equation has two derivatives in
it instead of one?” For example, many physical systems can be modeled by the differential
equation
d2 d
2
y(t) = a1 y(t) + a0 y(t) + u(t) (3.11)
dt dt
where u(t) is some input function. Since ode45 only accepts one derivative, at first glance
it appears that ode45 cannot be used to solve differential equations of the form (3.11).
However, with a mild change of notation, the user can employ ode45 to solve differential
equations of arbitrary degree.
Consider again equation equation (3.11), and define a new variable v(t) = dtd y(t). Then
2
the second derivative of y is ddt y(t) = dtd v(t), the first derivative of v. We can then write two
16
Dr. Hodel says: Never, ever trust a computer!
17
Octave and Scilab each provide tools comparable to ode45, but as of this writing there is no freely
available graphic user interface-based for simulation of differential equations.

first order differential equations:

d
y(t) = v(t) (3.12)
dt
d d
v(t) = a1 y(t) + a0 y(t) + u(t)
dt dt
= a1 v(t) + a0 y(t) + u(t) (3.13)
Notice that each of the equations (3.12) and (3.13) are organized so that all derivatives dtd
are on the left side of the equations and all derivatives are first order; that is, each of these
equations is “pretty close” to the form (3.10) that is required by ode4.
We can finish putting equations (3.12) and (3.13)
in the required form by changing our
y(t)
notation as follows. Define a vector x(t) = , and define the derivative of x as
v(t)
d y(t) ∆ dtd y(t)

d
x(t) = = d
dt dt v(t) dt
v(t)
Then, from equations (3.12) and (3.13), we can write

d d y(t) v(t) ∆ f1 (t, x) ∆
x(t) = = = = f (t, x) (3.14)
dt dt v(t) a1 v(t) + a0 y(t) + u(t) f2 (t, x)
where now the variable x and the function f are vectors, not scalars. While this concept can
be tricky for the first-time user, MATLAB’s features are designed to take advantage of this
sort of notation.
We’ll illustrate this process further in an example.
Example 3.4 Use of ode45 Consider the differential equation
d2 d
y(t) = −2 y(t) − 2y(t) + sin(t) (3.15)
dt2 dt
d
We wish to simulate equation (3.15) for 15 seconds with initial conditions y(0) = dt
y(0) = 0.
First, we rewrite equation (3.15) as a pair of first-order differential equations:
d
y(t) = v(t)
dt
d
v(t) = −2v(t) − 2y(t) + sin(t)
dt

y(t)
Define the vector x(t) = so that we can write
v(t)

d d y(t) v(t)
x(t) = = (3.16)
dt dt v(t) −2v(t) − 2y(t) + sin(t)

y(0) 0
with initial conditions x(0) = = . Now, examine the MATLAB on-line
v(0) 0
documentation for ode45:

M-file 3.2 odeExample.m
function dx = odeExample(t,x)
% dx = odeExample(t,x)
% derivatives function to simulate
% y’’(t) = 2 y’(t) - 2 y(t) + sin(t)
y = x(1);
v = x(2);
dy = v;
dv = -2*v - 2*y + sin(t);
dx = [dy ; dv];
% could also do this in one line:

% dx = [ x(2) ; 2*x(2) - 2*x(1) + sin(t) ];
Figure 56: Derivatives function odeExample for Example 3.4.
>> help ode45
ODE45 Solve non-stiff differential equations, medium order method.

[T,X] = ODE45(ODEFUN,TSPAN,X0) with TSPAN = [T0 TFINAL] integrates the
system of differential equations y’ = f(t,y) from time T0 to TFINAL with
initial conditions X0. Function ODEFUN(T,X) must return a column vector
corresponding to f(t,y). Each row in the solution array X corresponds to
a time returned in the column vector T. To obtain solutions at specific
times T0,T1,...,TFINAL (all increasing or all decreasing), use
TSPAN = [T0 T1 ... TFINAL].
Notice that ode45 requires three inputs: (1) a function ODEFUN= f (t, x) that, given current
conditions x(t), will return the current state derivatives, (2) a vector TSPAN of time values at
which we want to compute values of x(t), and (3) X0 = x(t0 ), a vector of initial conditions.
Item (1) is implemented as an m-file function shown in Figure 56. We can then simulate the
function with ode45 by writing and executing the m-file script odeExampleMain.m in Figure
57. For clarity, we discuss line by line the main routine odeExampleMain below:
1. % simulate a differential equation with zero initial conditions for 15 sec

As discussed before, this first line is a comment. Document your code with meaningful
comments!
2. tspan = linspace(0,15);
This line creates a variable tspan that has one row of 100 numbers that are evenly
spaced from 0.0 to 15.0. Try typing in tspan = linspace(0,15) without the semi-
colon. MATLAB will print out all 100 entries of tspan to the screen for you to see.

M-file 3.3 odeExampleMain.m
% simulate a differential equation with zero initial conditions for 15 sec

tspan = linspace(0,15);
x0 = [0;0];
[tt,xx] = ode45(’odeExample’,tspan,x0);
plot(tt,xx);
grid on
legend(’y’,’v = dy/dt’);
print -depsc odeExampleMain.eps
Figure 57: Main m-file script odeExampleMain for Example 3.4.
If you want tspan to have a different number of points, then use a third argument
to the linspace function, for example, tspan = linspace(0,15,10); will give you
10 points and tspan = linspace(0,15,16); will give you the 16 numbers 0,1,2,...,15,
etc. Since tspan has only one row, it’s often called a row vector.
We will use the variable tspan to tell ode45 what time instants we want to have in
our simulation.
3. x0 = [0;0];

0
This line creates a variable x0 = that has two rows and one column. x0 is used
0
to tell ode45 the initial conditions for the differential equation.
4. [tt,xx] = ode45(’odeExample’,tspan,x0);
This is where ode45 simulates the differential equations. Notice that the name of
odeExample must be written in quotes in the call to ode45, i.e., ode45(’odeExampleMain’,tspan,x0)
Notice that ode45 takes advantage of MATLAB’s ability to return more than one
variable. The variable tt is returned as a column vector (an array with only one
column) that has 100 entries, the same entries we specified in tspan. The variables xx
is returned as a 2-dimensional array with 100 rows and two columns. The first column
is the values of v(t) = dy(t)/dt at the times in tt. That is, xx(ii, 1) = v(tt(ii)).
We use this knowledge in the plot command that comes next. Similarly, the second
column of xx is the values of y(t) at the times in tt.
A reasonable question for the reader to ask is “How do you know which column in xx
goes with which variable in the differential equations?” The answer is that I wrote the
routine odeExample.m,
and so I know that that routine routine interprets the vector x
v(t)
as x(t) = , so each row in the vector xx returned by ode45 is the value of x(t)
y(t)
at time tt(ii). (It doesn’t matter that odeExample.m uses column vectors; ode45
will output its data in this format.)

Another reasonable question is “why do you use variable names like tt and xx instead
of t and x.” The answer is that it makes it a lot easier to debug programs and m-files.
If I search for the letter t in odeExampleMain.m I will find it in many places, including
in the comment on the first line. However, if I search for tt instead, I will (usually)
only find the variable that I’m looking for.
5. plot(tt,xx);
This line opens a graphic window and plots the data on the screen, xx as a function of
tt. This is a “lazy” way to generate the plot. I could also have plotted one waveform
at a time by typing
plot(tt,xx(:,1),’-’, tt,xx(:,2),’-’);
The notation xx(:,1) means “the first column of xx” and xx(:,2) means “the second
column of xx.” The ’-’ is a line style command. You can learn more about line styles
by typing in help plot at the MATLAB prompt.
6. grid on
The plot command only puts the wave forms on the graph. This command puts up
“graph paper” on the window to make the plot easier to read.
7. legend(’v = dy/dt’,’y’);
8. xlabel(’time (s)’)
Always label your axes. If possible, include units. In fact, it’s a good idea to include
units in your m-file (and C-program) variable names.18
The legend command puts a legend in the graph window so that you know which color
(or line style) goes with which variable. The xlabel command labels the x axis. There
is also a ylabel command that you can use instead of legend if you’re only plotting
one waveform in a window.
9. print -depsc odeExampleMain.eps
This command is used to store the plot in color .eps format so that we can include it
in this manual. The resulting plot is shown in Figure 58.
Remark 3.4 This manual was written in LATEX a free mathematical typesetting lan-
guage (not quite a word processor) that is included with Linux, is installed on the
Engineering Sun Network, and can be downloaded with the cygwin environment to a
windows machine or installed with fink on Macintosh. Microsoft Word users should
print to jpg files or some other image format. Type in help print at the MATLAB
prompt for more options.
2
18
While working on leave at NASA, the writer of this chapter once spent a month comparing two sim-
ulations that did not match because one simulation used degrees for angles and the other used radians,
but both used the same data tables. The problem was caught and fixed, and a lesson learned. Label your
variables/axes and document your code!

C.4 Things to try on your own Revision : 2003.10
0.5
y
v = dy/dt
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
0 5 10 15
time (s)
Figure 58: MATLAB plot generated by m-file odeExampleMain.m, Example 3.4.
C.4 Things to try on your own

1. Write an expression for a 7th order Fourier series approximation of a square wave (see
Example 3.2).
2. Write the differential equation(s) describing the behavior of the circuit in Figure 48(b).
Derive by hand the output vo (t) for the circuit in Figure 48(b) given that vo (0) = 0V.
3. Write the differential equation(s) describing the behavior of the circuit in Figure 48(c).
Derive by hand the output vo (t) for the circuit in Figure 48(c) given that vo (0) = 0V
and that i1 (0) = 0A.
4. Plot a 7th order Fourier series approximation of a square wave.
(a) Start MATLAB on a College of Engineering computer. If you’re using a Windows

machine, MATLAB is listed under the “start” menu. If you’re using a Unix-based
machine (or you’re running MATLAB over the phone line from home under X-
windows19 ) type in the command matlab at the Unix prompt.
19
You can get the code for X-windows at http://www.cygwin.com; the remainder of this manual is written
based on the assumption that you’re working on-campus.

Revision : 2003.10
(b) at the MATLAB prompt, type in edit sqfour7. This will open a new editor
window. Type in the m-file in Figure 54 and the modify it to plot a 7th order
Fourier series approximation of a square wave. Use the MATLAB command help
print to see how to print the plot to a printer instead of storing it in a file.
(Alternatively, you can save the plot as a .jpg file and import it into MS Word.)
5. Use MATLAB to calculate the voltages and currents in the circuit shown in Figure
48(a).
6. Use MATLAB to simulate the capacitor voltage in Figure 48(b) for 2 seconds. Turn
in your m-file(s) and a plot of the capacitor voltage.
7. Use MATLAB to simulate the capacitor voltage and inductor current in Figure 48(c)
for 2 seconds. Turn in your m-file(s) and a plot of the capacitor voltage and inductor
current.
D Appendix: source code

D.1 Utility m-files not listed elsewhere
M-file 4.1 backPropSig.m
rand(’seed’,pi); % seed random number generator
mex nnsig.c; % comment these out to skip compile

mex neteval.c
% construct target function

nx = 11; xx = linspace(-1, 2, nx)’;
ny = 11; yy = linspace(-1, 2, ny)’;
dm = 0.1 + 0.9*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1);
% network: use (nh) hidden nodes and one output node

a = 5; % activation function parameter
nh = 9; no = 1; % network dimensions
W1 = 5*(rand(nh,3)-0.5);
W2 = 5*(rand(no,nh+1)-0.5);
% initial output, error surfaces

[om, em] = neteval(xx, yy, dm, W1, W2,a);
figure(3); mesh(xx,yy, om); title(’Initial surface’);
figure(4); mesh(xx,yy,em); title(’Initial error surface’);

D.1 Utility m-files not listed elsewhere Revision : 2003.10

icnt = 1; Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt+1; % icnt = "iteration count"

eta = 0.02; % execute backprop algorithm with eta selected
mov = avifile(’backPropSig.avi’);
mov2 = avifile(’backPropSigSep.avi’);
maxIter = 400;
fprintf(’%4d %s %e\n’,iter,datestr(now),Errv(icnt-1));
for ii=1:nx
for jj = 1:ny
y0 = [xx(ii); yy(jj)]; % input vector
[y2,y1] = nnsig(y0, W1, W2, a);
v1 = W1*[1;y0];
v2 = W2*[1;y1];
% perform backprop step

e = dm(jj,ii) - y2;
d2 = -e .* dphi(v2,a);
DW2 = d2*[1;y1]’;
W2_times_d2 = W2’ * d2;

d1 = -(W2_times_d2(2:(nh+1))) .* dphi(v1,a) ;
DW1 = d1*[1;y0]’;
W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;
% evaluate error surface

[om,em] = neteval(xx,yy,dm, W1, W2,a);
Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt + 1;
end
end
figure(6); mesh(xx,yy, om);

title(sprintf(’Output surface iteration %d’,iter));
axis([-1,2,-1,2,0,1.1]);
text(-0.5,2,0.8,sprintf(’current error =%12.4g’,Errv(icnt-1)));
text(1.5,-0.5,0.1,’x_1 value’);
text(-0.5,1.5,0.1,’x_2 value’);

A = getframe; mov = addframe(mov,A);
% plot boundary lines, compute color based on slope of line

xp = linspace(-5,5,100)’;
yp = zeros(100,nh);
for ip = 1:nh
if(W1(ip,3) == 0)
denom = 1e-6;
else
denom = W1(ip,3);
end
yp(:,ip) = -(W1(ip,2)*xp + W1(ip,1))/denom;
linAng = atan2(W2(ip+1)*W1(ip,3),W2(ip+1)*W1(ip,2))*180/pi;
if(-45 < linAng & linAng <= 45), pcolor{ip} = ’g-’; %left
elseif(45 < linAng & linAng <= 135), pcolor{ip} = ’c-’; %bottom
elseif(135 < linAng & linAng <= 225), pcolor{ip} = ’r-’; %right
elseif(-135 < linAng & linAng <= -45), pcolor{ip} = ’b-’; %top
elseif(-225 < linAng & linAng <= -135), pcolor{ip} = ’r-’; %right
end
end
if(iter == maxIter)
end
figure(8);
plot( ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh

text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
A = getframe; mov2 = addframe(mov2,A);
if(iter == maxIter)
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))
end
end
mov = close(mov);
mov2 = close(mov2);
M-file 4.2 mlpEvalT.m
function [yy, Errv] = mlpEvalT(inData, W1, W2, a,b)

% function [yy, Errv] = mlpEvalT(inData, W1, W2, a,b)
% evaluate network at all sample points
% dm = nx x ny matrix of desired output values
% use exponential function with parameter a
nx = size(W1,2)-1;
ny = size(W2,1);
N = size(inData,1);
% present each vector to the network

for nn=1:N
yy(nn,1:ny) = reshape(mlpT(inData(nn,1:nx),W1,W2,a,b),1,ny);
end

% calculate error values

Errv = yy - inData(:,nx+(1:ny));
M-file 4.3 mlpT.m
function [y,h] =mlpT(x,w1,w2,a,b)

% function [y,h] =mlpT(x,w1,w2,a,b)
% return hidden layer values h and output layer values y for
% weights w1 and w2 with input x
% activation function is hyperbolic tangent with parameters a,b
% bias value of 1 is appended to both x and hidden layer value vector
x = reshape(x,length(x),1);
v1 = w1*[1;x]; % append bias
h = phiT(v1,a,b);
v2 = w2*[1;h]; % append bias

y = phiT(v2,a,b);
M-file 4.4 sysIdEx1PlotErr.m
function plotErr(d,X,W,tstr);
% plot error of current fit
nn = 1:length(d);
fn = figure;
plot(nn,d,’x’, nn,W*X,’+’, nn,d-W*X,’o’);
xlabel(’sample number’);
legend(’desired output’,’actual output’,’error’);
grid on;
title(tstr);
return

REFERENCES Revision : 2003.10
References
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999.
[Heb49] D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley,

New York, 1949.
[Lue69] D. G. Luenberger. Optimization by Vector Space Methods. Wiley and Sons, Inc.,
New York, NY, 1969.
[MP43] W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas imma-
nent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.
[MP88] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geome-

try. MIT Press, 1988. Expanded Edition.
[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-Point Polynomial Algorithms in

Convex Programming. SIAM, 1994.
[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65:386–408, 1958.
[SN96] Gilbert Strang and Truong Nguyen. Wavelets and Filter Banks. Wellesley Cam-
bridge Press, wellesley, mass. edition, 1996.
[WH] B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON
Convention Record, pages 96–104.

Index
cfile 3.1 activation.m , 20
22.9 dphiT.c , 143 25.1 backPropEx2.m , 157
21.5 layerprop.c , 127 25.5 backPropEx3.m , 168
22.10 matmul.c , 147 27.1 backPropEx5.m , 183
7.4 mexPrintMat.c , 40 24.2 backPropNormalize.m , 153
6.1 mextanh.c , 33 24.1 backPropNormTest.m , 152
21.7 neteval.c , 130 22.4 backPropRand.m , 149
21.6 nnsig.c , 128 4.1 backPropSig.m , 245
7.3 phiExV.c , 38 21.3 backPropSig1.m , 132
22.8 phiT.c , 142 25.3 backPropStep.m , 164
6.2 simpletanh.c , 33 9.1 corrMemEx1.m , 49
dphiT.c 22.9 , 143 19.3 derivS.m , 109
layerprop.c 21.5 , 127 19.2 dphi.m , 109
matmul.c 22.10 , 147 19.5 dphiT.m , 110
mexPrintMat.c 7.4 , 40 36.1 dynProgEx.m , 226
mextanh.c 6.1 , 33 18.1 exam2003BrainDead.m , 105
neteval.c 21.7 , 130 18.2 exam2003BrainDeadFix.m , 106
nnsig.c 21.6 , 128 2.1 hw0102chk.m , 10
phiExV.c 7.3 , 38 9.3 hwk0218.m , 54
phiT.c 22.8 , 142 9.4 hwk0220.m , 55
simpletanh.c 6.2 , 33 9.5 hwk0220a.m , 57
mextanh.c, 141 12.1 hwk0302.m , 67
phiExV.c, 141 12.2 hwk304.m , 75
simpletanh.c, 141 22.1 hwk502.m , 144
Course information, 7 27.2 invPendS.m , 187
9.2 learnTaskEx1s.m , 53
Exam 1 information, 93
22.3 matmultest.m , 148
Homework 13.2 meshPex.m , 82
Homework 1 Assigned Aug 20, 10 13.1 meshPexPlot.m , 82
Homework 2 2, 35 22.2 meshPexT.m , 146
Homework 3 Assigned Sept. 10, 53 7.1 mexExV.m , 36
Homework 4 Assigned Sept. 17, 72 7.2 mexPrintMatEx.m , 39
Homework 5 Due Fri Oct. 10, 112 4.2 mlpEvalT.m , 248
Homework 5 solution, 141 25.4 mlpNormalize.m , 164
Homework 6 Due Wed Oct. 22, 151 4.3 mlpT.m , 249
Homework 6 solution, 183 25.2 mlpTrain.m , 163
Homework 7 Due Fri Oct. 31, 195 21.2 neteval.m , 127
Homework 7 Due Mon Oct. 27, 175 4.1 neuronEx1.m , 25
Homework 7 solution, 194 4.2 neuronEx2.m , 28
Homework 8 solution, 206 21.1 nnsig.m , 126
27.3 nnSysIdEx.m , 189
M-files
251
INDEX Revision : 2003.10
3.2 odeExample.m , 241 hwk0220a.m (M-file 9.5) , 57

3.3 odeExampleMain.m , 242 hwk0302.m (M-file 12.1) , 67
34.1 pcaEx1.m , 214 hwk304.m (M-file 12.2) , 75
13.4 perceptron2Dlearn.m , 88 hwk502.m (M-file 22.1) , 144
13.3 perEx.m , 87 invPendS.m (M-file 27.2) , 187
19.1 phi.m , 109 learnTaskEx1s.m (M-file 9.2) , 53
19.4 phiT.m , 110 matmultest.m (M-file 22.3) , 148
26.1 radialEx00.m , 177 meshPex.m (M-file 13.2) , 82
26.2 radialEx01.m , 179 meshPexPlot.m (M-file 13.1) , 82
26.3 radialEx02.m , 181 meshPexT.m (M-file 22.2) , 146
29.4 rbf0502.m , 198 mexExV.m (M-file 7.1) , 36
29.5 rbfSig.m , 202 mexPrintMatEx.m (M-file 7.2) , 39
3.1 sqfour3.m , 236 mlpEvalT.m (M-file 4.2) , 248
2.2 square1.m , 11 mlpNormalize.m (M-file 25.4) , 164
20.5 sysIdEx1.m , 122 mlpT.m (M-file 4.3) , 249
20.1 sysIdEx1GetData.m , 114 mlpTrain.m (M-file 25.2) , 163
20.3 sysIdEx1GradOpt.m , 118 neteval.m (M-file 21.2) , 127
20.4 sysIdEx1nnOpt.m , 120 neuronEx1.m (M-file 4.1) , 25
4.4 sysIdEx1PlotErr.m , 249 neuronEx2.m (M-file 4.2) , 28
20.2 sysIdEx1Wopt.m , 116 nnsig.m (M-file 21.1) , 126
activation.m (M-file 3.1) , 20 nnSysIdEx.m (M-file 27.3) , 189
backPropEx2.m (M-file 25.1) , 157 odeExample.m (M-file 3.2) , 241
backPropEx3.m (M-file 25.5) , 168 odeExampleMain.m (M-file 3.3) , 242
backPropEx5.m (M-file 27.1) , 183 pcaEx1.m (M-file 34.1) , 214
backPropNormalize.m (M-file 24.2) perceptron2Dlearn.m (M-file 13.4)
, 153 , 88
backPropNormTest.m (M-file 24.1) perEx.m (M-file 13.3) , 87
, 152 phi.m (M-file 19.1) , 109
backPropRand.m (M-file 22.4) , 149 phiT.m (M-file 19.4) , 110
backPropSig.m (M-file 4.1) , 245 radialEx00.m (M-file 26.1) , 177
backPropSig1.m (M-file 21.3) , 132 radialEx01.m (M-file 26.2) , 179
backPropStep.m (M-file 25.3) , 164 radialEx02.m (M-file 26.3) , 181
corrMemEx1.m (M-file 9.1) , 49 rbf0502.m (M-file 29.4) , 198
derivS.m (M-file 19.3) , 109 rbfSig.m (M-file 29.5) , 202
dphi.m (M-file 19.2) , 109 sqfour3.m (M-file 3.1) , 236
dphiT.m (M-file 19.5) , 110 square1.m (M-file 2.2) , 11
dynProgEx.m (M-file 36.1) , 226 sysIdEx1.m (M-file 20.5) , 122
exam2003BrainDead.m (M-file 18.1) sysIdEx1GetData.m (M-file 20.1) ,
, 105 114
exam2003BrainDeadFix.m (M-file 18.2) sysIdEx1GradOpt.m (M-file 20.3) ,
, 106 118
hw0102chk.m (M-file 2.1) , 10 sysIdEx1nnOpt.m (M-file 20.4) , 120
hwk0218.m (M-file 9.3) , 54 sysIdEx1PlotErr.m (M-file 4.4) , 249
hwk0220.m (M-file 9.4) , 55 sysIdEx1Wopt.m (M-file 20.2) , 116

INDEX Revision : 2003.10
backPropEx2.m, 166, 186

backPropEx3.m, 170
backPropSig1.m, 132
backPropStep.m, 151
dphiT.m, 151
hwk0302.m, 73
hwk304.m, 75
learnTaskEx1s.m, 151
meshPex.m, 112
mlpEvalT.m, 151
mlpNormalize.m, 151
mlpT.m, 151
mlpTrain.m, 151
phiT.m, 112, 151
sysIdEx1GetData.m, 114
sysIdEx1Wopt.m, 116
matrix calculus identities
Lemma 11.1, 65, 101
separability
φ-separable, 176
separating surface, 176
vector stack of a matrix, 65, 153, 230

ELEC 6240: Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ELEC 6240: Neural Networks

Uploaded by

Copyright:

Available Formats

ELEC 6240: Neural Networks

A. S. Hodel, Dept. ECE, Auburn University hodelas@auburn.edu

2 2003 08 20: Introduction 9

3 2003 08 22 Neuron models 19

4 2003 08 25 Network architectures 24

5 2003 08 29 Knowledge representation 31

6 2003 09 03 Mex files in MATLAB 33

7 2003 09 05 More on MATLAB 36

8 2003 09 08 Learning processes 41

10 2003 09 12 Single layer networks 60

Dept. ECE, Auburn Univ. 2 ELEC 6240/Hodel: Fa 2003

11 2003 09 15 Optimization and Neural Networks 64

12 2003 09 17: Gradient based learning methods 67

13 2003 09 19: More gradient based learning 82

14 2003 09 22: Single Layer Perceptrons (conclusion) 90

15 2003 09 24 Multi-layer perceptrons 94

16 2003 09 26: Review of Homework 4. 99

17 2003 09 29: More review for Exam 1 100

18 2003 10 01 Exam 1 101

19 2003 10 03 Multi-layered perceptrons: continued 107

20 2003 10 06: Single layer linear system i.d. example 113

Dept. ECE, Auburn Univ. 3 ELEC 6240/Hodel: Fa 2003

20.3 Solution method 2: Steepest descent iteration . . . . . . . . . . . . . . . . . 118

21 2003 10 08: MLP 123

22 2003 10 10: Techniques to improve training - 1 141

23 2003 10 13: Techniques to improve training - 2 150

24 2003 10 15: Techniques to improve training - 3 152

25 2003 10 17 Example: training with the unit square 155

26 2003 10 20: Radial Basis Function Networks 173

Dept. ECE, Auburn Univ. 4 ELEC 6240/Hodel: Fa 2003

27 2003 10 22: Radial Basis Functions (2) 183

28 2003 10 24; RBF’s (3) 193

29 2003 10 27 RBF’s (4) 194

30 2003 10 31: RBF’s (cont’d) 206

31 2003 10 31 Principal Components Analysis 207

32 Exam 2 solutions 208

33 2003 11 05: PCA (2) 213

34 2003 11 07: PCA (3) 214

35 Self Organizing Maps 225

36 2003 11 19 Neurodynamic programming Example 226

37 Hopfield Networks 229

A Appendix: Review of linear algebra 230

Dept. ECE, Auburn Univ. 5 ELEC 6240/Hodel: Fa 2003

B Appendix: Review of C-programming syntax related to neural nets 230

C Appendix: Review of MATLAB syntax related to neural nets 230

D Appendix: source code 245

Dept. ECE, Auburn Univ. 6 ELEC 6240/Hodel: Fa 2003

Grader Adam T. Simmons simmoat@eng.auburn.edu. No office hours.

Textbook Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,

[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,

Topics Covered in the course:

Dept. ECE, Auburn Univ. 7 ELEC 6240/Hodel: Fa 2003

Dept. ECE, Auburn Univ. 8 ELEC 6240/Hodel: Fa 2003

2 2003 08 20: Introduction

Applications: optimization, classification, function approximation

Question 2.1 How to ensure that adaptation “makes sense?”

• evidential response: classification and confidence of classification both available

• fault tolerance (in hardware form)

Fundamental building block: the neuron

Dept. ECE, Auburn Univ. 9 ELEC 6240/Hodel: Fa 2003

Due Fri Aug. 29. Hwk

1. Problem 1.1 in [Hay99], p. 45.

dp2 = a * ( (1+eav).(eav) + (1-eav).(eav)) ...