You are on page 1of 253

ELEC 6240: Neural Networks

A. S. Hodel, Dept. ECE, Auburn University hodelas@auburn.edu


http://www.eng.auburn.edu/users/hodelas
ftp://ftp.eng.auburn.edu/pub/hodel
This notes set is in progress. This is Revision : 2003.10, November 19, 2003

1
CONTENTS Revision : 2003.10

Contents
1 Course overview 7

2 2003 08 20: Introduction 9

3 2003 08 22 Neuron models 19


3.1 Neuron models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 NN’s as directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 2003 08 25 Network architectures 24


4.1 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Neural network: an interconnection of neurons . . . . . . . . . . . . . . . . . 24

5 2003 08 29 Knowledge representation 31

6 2003 09 03 Mex files in MATLAB 33

7 2003 09 05 More on MATLAB 36


7.1 Scalar functions of scalar variables . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Functions of vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 What about MATLAB and matrices? . . . . . . . . . . . . . . . . . . . . . . 39

8 2003 09 08 Learning processes 41


8.1 Error-correction learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Memory Based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Competitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.5 Boltzmann learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Credit assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.7 Learning with a teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.8 Learning without a teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 2003 09 10 46
9.1 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

10 2003 09 12 Single layer networks 60


10.1 From last time: Linear associative memory (LAM) . . . . . . . . . . . . . . 60
10.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.3 Performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.4 Single layer perceptrons: “In the beginning ....” . . . . . . . . . . . . . . . . 62
10.5 Adaptive filtering interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 62

Dept. ECE, Auburn Univ. 2 ELEC 6240/Hodel: Fa 2003


CONTENTS Revision : 2003.10

11 2003 09 15 Optimization and Neural Networks 64


11.1 What you should have seen in your linear algebra class . . . . . . . . . . . . 64
11.2 Unconstrained optimization techniques . . . . . . . . . . . . . . . . . . . . . 64

12 2003 09 17: Gradient based learning methods 67


12.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2 Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

13 2003 09 19: More gradient based learning 82


13.1 A matlab m-file example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.2 Gauss Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
13.3 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

14 2003 09 22: Single Layer Perceptrons (conclusion) 90


14.1 The perceptron convergence theorem . . . . . . . . . . . . . . . . . . . . . . 90
14.2 Other activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
14.3 Exam 1 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
14.3.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
14.3.2 Permitted resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

15 2003 09 24 Multi-layer perceptrons 94


15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.2 Back-propagation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.3 Output layer update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

16 2003 09 26: Review of Homework 4. 99


16.1 After class comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
16.2 Example in office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

17 2003 09 29: More review for Exam 1 100

18 2003 10 01 Exam 1 101


18.1 The questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
18.2 The answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

19 2003 10 03 Multi-layered perceptrons: continued 107


19.1 Backpropagation algorithm: review of output layer . . . . . . . . . . . . . . 107
19.2 Backpropagation: Hidden layer update . . . . . . . . . . . . . . . . . . . . . 108
19.3 Exponential activation function . . . . . . . . . . . . . . . . . . . . . . . . . 109
19.4 tanh activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

20 2003 10 06: Single layer linear system i.d. example 113


20.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.1.1 Vehicle data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.1.2 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
20.2 Solution method 1: direct solution . . . . . . . . . . . . . . . . . . . . . . . . 116

Dept. ECE, Auburn Univ. 3 ELEC 6240/Hodel: Fa 2003


CONTENTS Revision : 2003.10

20.3 Solution method 2: Steepest descent iteration . . . . . . . . . . . . . . . . . 118


20.4 Solution method 3: Backpropagation . . . . . . . . . . . . . . . . . . . . . . 120
20.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

21 2003 10 08: MLP 123


21.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
21.1.1 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
21.1.2 Hidden layer(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
21.2 Example (not covered in class) . . . . . . . . . . . . . . . . . . . . . . . . . . 126
21.2.1 Utility m-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
21.2.2 C-language implementation . . . . . . . . . . . . . . . . . . . . . . . 127
21.3 Simple backprop without any preprocessing . . . . . . . . . . . . . . . . . . 132

22 2003 10 10: Techniques to improve training - 1 141


22.1 Homework 5 solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
22.1.1 Derivations and plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
22.1.2 Source code: problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 142
22.1.3 Source code: problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 144
22.1.4 Source code: problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 147
22.2 Techniques to improve training . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.3 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
22.4 Randomize sample order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

23 2003 10 13: Techniques to improve training - 2 150


23.1 Present “worst case” data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
23.2 Momentum term - generalized delta rule . . . . . . . . . . . . . . . . . . . . 150

24 2003 10 15: Techniques to improve training - 3 152


24.1 Statistically normalize the data . . . . . . . . . . . . . . . . . . . . . . . . . 152
24.2 Selection of initial weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
24.3 Adjust learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

25 2003 10 17 Example: training with the unit square 155


25.1 Another example: bad output normalization . . . . . . . . . . . . . . . . . . 166
25.2 Decision rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
25.3 Feature detection/hidden neurons . . . . . . . . . . . . . . . . . . . . . . . . 171

26 2003 10 20: Radial Basis Function Networks 173


26.1 Project proposal guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
26.1.1 Undergraduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
26.1.2 Graduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
26.1.3 Some project ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Homework 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
26.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Dept. ECE, Auburn Univ. 4 ELEC 6240/Hodel: Fa 2003


CONTENTS Revision : 2003.10

27 2003 10 22: Radial Basis Functions (2) 183


27.1 M-file s-function examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
27.1.1 Continuous time model . . . . . . . . . . . . . . . . . . . . . . . . . . 187
27.1.2 Discrete time model . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
27.2 Lecture notes: handwritten today . . . . . . . . . . . . . . . . . . . . . . . . 192

28 2003 10 24; RBF’s (3) 193


28.1 Interpolation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

29 2003 10 27 RBF’s (4) 194


29.1 Homework 7 solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Homework 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
29.2 What functions to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
29.2.1 Tikhonov regularization . . . . . . . . . . . . . . . . . . . . . . . . . 196
29.2.2 Solution of the problem . . . . . . . . . . . . . . . . . . . . . . . . . 197
29.3 Training RBF networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
29.3.1 Selection of output (interpolation) weights w . . . . . . . . . . . . . . 200
29.3.2 Selection of centers ti . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
29.3.3 Selection of covariance (spread) matrices Σ . . . . . . . . . . . . . . . 201

30 2003 10 31: RBF’s (cont’d) 206


30.1 Homework 8 solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

31 2003 10 31 Principal Components Analysis 207

32 Exam 2 solutions 208


32.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
32.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
32.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
32.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

33 2003 11 05: PCA (2) 213

34 2003 11 07: PCA (3) 214


34.1 Single vector case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
34.2 Multiple vector case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

35 Self Organizing Maps 225

36 2003 11 19 Neurodynamic programming Example 226

37 Hopfield Networks 229

A Appendix: Review of linear algebra 230


A.1 Vector stack function vec(·) . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

Dept. ECE, Auburn Univ. 5 ELEC 6240/Hodel: Fa 2003


CONTENTS Revision : 2003.10

B Appendix: Review of C-programming syntax related to neural nets 230

C Appendix: Review of MATLAB syntax related to neural nets 230


C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
C.1.1 Access to software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
C.1.2 Software overview and tutorials . . . . . . . . . . . . . . . . . . . . . 231
C.2 Mathematical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
C.3 MATLAB basics: similarities to C . . . . . . . . . . . . . . . . . . . . . . . . 232
C.3.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
C.3.2 Differences from C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
C.3.3 Graphical (plotting) concepts in MATLAB . . . . . . . . . . . . . . . 236
C.3.4 Circuits problems in MATLAB . . . . . . . . . . . . . . . . . . . . . 236
C.3.5 Differential equations in MATLAB . . . . . . . . . . . . . . . . . . . 239
C.4 Things to try on your own . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

D Appendix: source code 245


D.1 Utility m-files not listed elsewhere . . . . . . . . . . . . . . . . . . . . . . . . 245

References 250

Index 251

Dept. ECE, Auburn Univ. 6 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

1 Course overview
Instructor A. S. Hodel hodelas@eng.auburn.edu AIM screen name ELEC 6240 1 .
Web page http://www.eng.auburn.edu/users/hodelas. Office hours: MW 3-4pm
or by appt.

Grader Adam T. Simmons simmoat@eng.auburn.edu. No office hours.

Textbook Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,


2nd edition, 1999

[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,


2nd edition, 1999

Grades Grades are assigned on a 10% scale. You may earn points from 2 Hour exams (50
pts ea), 1 Course project (50 pts), Homework (50 pts), 1 Final exam (100 pts), for a
total of 300 points.

Special needs Students who need special accommodations should make an appointment
to discuss their needs as soon as possible.

Class resources other class resources (notes, m-files, etc.) will be made available at
ftp://ftp.eng.auburn.edu/pub/hodel/6240

Projects Will be described later in the semester. Project final reports will be due during
the final week of the semester; a precise due date will be announced later. Oral
presentations will be made to the class during the final two weeks of the course.

Topics Covered in the course:

1. Learning processes
2. Perceptrons: single layer and multi-layer
3. Radial basis function neural networks
4. Principal components analysis
5. self-organizing maps
6. Neurodynamics
7. Neural network applications

Resources MATLAB is available on the engineering network by either (1) Use of Windows
PC labs in Broun 128, etc., (2) Sun workstations, or (3) remote log-in (ssh) to

gate.eng.auburn.edu
1
Please identify yourself by name when you message me.

Dept. ECE, Auburn Univ. 7 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

from off-campus. Several tutorials for MATLAB are available on the net. 2 A brief
review of MATLAB will be given in this course, but students are expected to be
familiar with MATLAB from the prerequisite course.

Homework software Grading of software will be done in a batch-run fashion. This will
require that students set up a folder on their engineering account that the instructor
can access (read/execute privileges) from the engineering network (sun workstation).
Evidence of copying of software will be grounds for a zero grade on the homework
assignment in question.

Remark This is a 6000 level course (senior undergraduate/graduate level course). Students
will be expected to have a corresponding level of mathematical/conceptual maturity.
C-language programming (COMP1200) will be essential. The instructor makes no
commitment to provide support for other compiled languages in this course.

Note Homework assignments in this class for Fall 2003 should be done using MATLAB 6.5
(Release 13). For historical reasons, many software examples in these notes were done using
octave3 , a program similar to (but not identical to) MATLAB that is available at no cost.
Octave is not currently available on the Engineering network; if anyone wants to volunteer
to help to get octave installed, please let me know.

2
See, e.g., p. 37 of the ELEC 2020 manual,

ftp://ftp.eng.auburn.edu/pub/hodel/2020

3
http://www.octave.org

Dept. ECE, Auburn Univ. 8 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

2 2003 08 20: Introduction


Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999 [Hay99] Chapter 1.

Applications: optimization, classification, function approximation


Why neural networks?

• Nonlinear
More flexibility in representation of systems than in, e.g., transfer functions (LTI).

• input-output mapping
Does not require a “first principles” physical model of behavior/system being learned.

• adaptivity (retraining)
Can adjust its “synaptic weights” in response to changes in operating environment.

Question 2.1 How to ensure that adaptation “makes sense?”

• evidential response: classification and confidence of classification both available


Can provide both response (classification) and confidence level.

• fault tolerance (in hardware form)

• VLSI implementation

• neurobiological analogy
Retina: preprocessing (compression) of visual information before it is sent to brain for
processing.

Fundamental building block: the neuron

Dept. ECE, Auburn Univ. 9 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Homework 1
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 1.

Due Fri Aug. 29. Hwk

1. Problem 1.1 in [Hay99], p. 45.

2. Problem 1.2 in [Hay99], p. 46.

3. Problem 1.9 in [Hay99], p. 47.

4. Design a 2 layer neural network with a threshold activation function (eqn (1.8), p. 12
in [Hay99]) to identify the region (x, y) ∈ [0, 1] × [0, 1].
Approach: hidden layer neurons are “feature detectors,” but can only split space into
two halves by dividing in a plane (see the figures that follow showing hidden layer values
as a function of x and y). Use one hidden layer neuron for each of the sides of the
square. Then combine them together. We can’t do an “and” operation naturally, so we
have to do it as a sum with a threshold and a bias just below the sum you’d get with
all hidden neurons firing.
Write an m-file main.m that plots the output of your ANN over the above domain. Put
this in a folder called 6240H1 your home directory on the engineering network (the H:
drive for Windoze enthusiasts) and set permissions so that anyone can read the file. 4

Solution

1. Sigmoid derivative

dφ (·) · 0 − (−ae−av ) a + ae−av − a


= =
dv (1 + e−av )2 (1 + e−av )2
a(1 + e−av ) − a a((1 + e−av ) − 1)
= =
(1 + e−av )2 (1 + e−av )2
a 1 − (1 + e−av )
= −av
· −av
= aφ(v)(1 − φ(v))
1 + e 1 + e
d
φ(v) = a/4
dv v=0

2. Derivation done and checked in MATLAB:

M-file 2.1 hw0102chk.m


4
This problem is assigned in part to check the process for electronic paperless submission of your software
and to test my plagiarism detection routines.

Dept. ECE, Auburn Univ. 10 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

% homework 1 check.
xx = linspace(-5,5,1000);
a = 2;
eav = exp(-a*xx);

pp = (1 - eav) ./ (1 + eav);
dp = (a/2)*(1 - pp .^ 2);

dp2 = a * ( (1+eav).*(eav) + (1-eav).*(eav)) ...


./ ( (1+eav).^2);
dp2 = (a/2) * ( 2*(1+eav).*(eav) + 2*(1-eav).*(eav)) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( 2*(eav+eav.^2) + 2*(eav-eav.^2)) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( 4*eav )./ ( (1+eav).^2);
dp2 = (a/2) * ( 4*eav + (1 + eav).^2 - 1 - 2*eav - eav.^2 ) ...
./ ( (1+eav).^2);
dp2 = (a/2) * ( (1 + eav).^2 - 1 + 2*eav - eav.^2 ) ...
./ ( (1+eav).^2);
% and the result follows from here.
plot(xx,dp,’-’,xx,dp2,’-’);

3. hw, xi = 10 · 0.8 − 20 · 0.2 + 4 · (−1) − 2 · (−0.9) = −1.8. (a) linear output is -1.8 (or
a · 1.8 if the activation function has a linear slope parameter). (b) 0.

4. Software is in m-file square1.m. Results in Figures 1–6.

M-file 2.2 square1.m

xx = linspace(-1,2,30); yy = linspace(-1,2,30);
% format of rows 1st weight matrix:
% bias weight, x weight, y weight
W1 = [0 1 0; ... % right of y axis (x = 0)
0 0 1; ... % above x axis (y = 0)
1 -1 0; ... % left of line x = 1
1 0 -1]; % below line y = 1;

% hidden layer neuron outputs: h1, h2, h3, h4


% format of 2nd weight matrix: [bias, h1, h2, h3, h4]
W2 = [-0.99, 1/4, 1/4, 1/4, 1/4];
for ii = 1:length(xx)
for jj = 1:length(yy)
inputVec = [1; xx(ii); yy(jj)];

% vector of hidden node values

Dept. ECE, Auburn Univ. 11 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

% without double(...) the next line creates a


% logical variable type (try "whos" at the MATLAB
% prompt.
% MATLAB requires you to convert it to double
% precision for mesh plotting purposes. Annoying;
% Octave does not have this requirement.
hv = double(( (W1 * inputVec) > 0));

% store hidden neuron values for plotting


h1(ii,jj) = hv(1);
h2(ii,jj) = hv(2);
h3(ii,jj) = hv(3);
h4(ii,jj) = hv(4);

% include bias term to get final output value


zz(ii,jj) = double( (W2*[1;hv]) > 0);

% this is the output without the threshhold


zprime(ii,jj) = W2*[1;hv];

end
end

fn = 0;
fn = fn+1; figure(fn);
mesh(yy,xx,h1);
title(’Hidden layer node 1 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

fn=fn+1; figure(fn);
mesh(yy,xx,h2);
title(’Hidden layer node 2 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

fn=fn+1; figure(fn);
mesh(yy,xx,h3);
title(’Hidden layer node 3 values’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

fn=fn+1; figure(fn);
mesh(yy,xx,h4);

Dept. ECE, Auburn Univ. 12 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Hidden layer node 1 values

0.8

0.6

0.4

0.2

0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 1: Homework 1: output is 1 if x > 0

title(’Hidden layer node 4 values’);


xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

fn=fn+1; figure(fn);
mesh(yy,xx,zprime);
title(’Neural network (no threshhold on output)’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

fn=fn+1; figure(fn);
mesh(yy,xx,zz);
title(’Neural network output’);
xlabel(’y’); ylabel(’x’);
eval(sprintf(’print -depsc square1_%d.eps’,fn));

Dept. ECE, Auburn Univ. 13 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Hidden layer node 2 values

0.8

0.6

0.4

0.2

0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 2: output is 1 if y > 0

Dept. ECE, Auburn Univ. 14 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Hidden layer node 3 values

0.8

0.6

0.4

0.2

0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 3: output is 1 if x < 1

Dept. ECE, Auburn Univ. 15 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Hidden layer node 4 values

0.8

0.6

0.4

0.2

0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 4: output is 1 if y < 1

Dept. ECE, Auburn Univ. 16 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Neural network (no threshhold on output)

0.2

0.1

−0.1

−0.2

−0.3

−0.4

−0.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 5: second layer output: value is positive only for (x, y) ∈ [0, 1] × [0, 1].

Dept. ECE, Auburn Univ. 17 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Neural network output

0.8

0.6

0.4

0.2

0
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
x
y

Figure 6: final result: output is 1 if (x, y) is in [0, 1] × [0, 1].

Dept. ECE, Auburn Univ. 18 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

3 2003 08 22 Neuron models


3.1 Neuron models
Read [Hay99] §1.3

y = φ(w T x) = w1 x1 + · · · + wm xm = hw, xi
T
∈ IRm

x = x1 · · · x m
 T
w = w1 · · · wm
I prefer to write w T x so that w and
x are both column vectors.
y ∈ IR scalar (for now; this will change to y ∈ IRp )

Interpretation w T x = “how much does x look like w?”


• w (here is a vector, later will be a matrix) is an internal representation of a data feature
we’re interested in

• w T x > 0 and is big if x points in “the same direction” as w; is negative and big if x
points in the opposite direction (think of correlation coefficient in random variables).

• use activation function φ to “normalize” the inner product w T x of w and x.

1. summation

2. bias vector y = φ(w T x + b. Can embed x0 = 1, w0 = b to keep y = φ(w T x).

3. activation function
5
(a) Threshold,
 a.k.a “McCullough-Pitts” (theoretical limits discussed in [MP88] )
1 v≥0
φ(v) = Also vector (output) form.
0 v<0

 1 v ≥ 1/2
(b) piecewise linear φ(v) = v+1/2 |v| < 1/2 notice misprint from text
0 v < −1/2

1
(c) sigmoid φ(v) = approaches threshold as a → ∞
1 + e−av
(d) stochastic (not much done in this course due to prerequisites)
P (x = 1) = P (v) = sigmoid. P (x = 0) = 1−P (v). “random process” 1/(1+ev/T )
where T is a pseudo-temperature.
5
M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1988.
Expanded Edition

Dept. ECE, Auburn Univ. 19 ELEC 6240/Hodel: Fa 2003


3.1 Neuron models Revision : 2003.10

Example 3.1 Activation function: single neuron in MATLAB. See Figure 7. Source code:

M-file 3.1 activation.m

xx = linspace(-4,4,1000); % 1000 linearly spaced points


linThresh = xx; % linear threshhold function
Thresh = ( xx >= 0.5 ); % hard threshhold function
sigmoid = 1 ./ ( 1+ exp(-2.0*xx) ); % sigmoid
tanSig1 = tanh(xx+0); % no offset
tanSig2 = tanh(xx+2); % constant offset
plot(xx,linThresh,’-’, xx, Thresh,’-’, xx, sigmoid, ’-’, ...
xx, tanSig1, ’-’, xx,tanSig2,’-’);
grid on
title(’Artificial Neuron Activation Function Examples’);
xlabel(’Neuron input’)
legend(’linear threshhold’,’Threshhold (at 0.5)’, ...
’sigmoid (alpha=2)’, ’tanh(x)’,’tanh(x+2)’);
print -depsc activation.eps

Dept. ECE, Auburn Univ. 20 ELEC 6240/Hodel: Fa 2003


3.1 Neuron models Revision : 2003.10

Artificial Neuron Activation Function Examples


4
linear threshhold
Threshhold (at 0.5)
sigmoid (alpha=2)
3 tanh(x)
tanh(x+2)

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4
Neuron input

Figure 7: For example 3.1

Dept. ECE, Auburn Univ. 21 ELEC 6240/Hodel: Fa 2003


3.2 NN’s as directed graphs Revision : 2003.10

3.2 NN’s as directed graphs


Notation: text uses scalar notation
X 
k =φ wkj xj

or signal flow graphs


wkj
xj yk = wkj xj

φ
xj yk = φ(xj )
P 
wk0 yk = φ m
j=0 wkj xj
x0
wk1

x1 ..
.
wkm
..
.

xm

Alternative: matrix notation (similar to MATLAB):

y = φ(w T x)

In MATLAB: y = phi ( w’ * x ); . Need to write the function as an m-file phi.m or as


a mex-file phi.mexXYZ or phi.dll. Choice:

• m-files are easy to write, debug, but are slow.

• mex-files are slower to write, more difficult to debug, but are 5-10x faster.

Conclusion Develop as m-file, translate to C (or other language) as needed.

Dept. ECE, Auburn Univ. 22 ELEC 6240/Hodel: Fa 2003


3.3 Feedback Revision : 2003.10

3.3 Feedback
Read §1.5

• operator loops
x0j (n)
A
xj (n) yk (n)

A
yk (n) = x (n)
1−AB j

• unit delay → transfer function

• IIR filters
1st order signal flow graph
x0j (n)
w
xj (n) yk (n)

z −1

• Analysis in neural networks: nonlinear! (graduate topic)

• “Process” - time becomes a variable.

Introduce idea of Z transform without details. (Stability, convergence)

Dept. ECE, Auburn Univ. 23 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

4 2003 08 25 Network architectures


4.1 Network architectures
Read §1.6

4.2 Neural network: an interconnection of neurons


1. Single layer feed forward (acyclic connections)
P 
wk0 yk = φ m
j=0 wkj xj
x0
wk1

x1 ..
.
wkm
..
.

xm

Example 4.1 Single layer network with three neurons (outputs) and two inputs.
    
T 1
 φ1 w 1
 x  
  
y1     
1  T 1 
y =  y2  = φ W = φ 2 w 2

x  x  
y3    
 1 
φ3 w 3 T
x

Could also write as y = φ(W̄2 φ(W̄1 x + b1 ) + b2 ) for bias vectors b1 , b2 . We will select
 T  T
w1 = w 2 = w 3 = b w 1 w2 = 1 1 1 .

(b =bias), and φ1 = threshold (McCullough-Pitts), φ2 =sigmoid (a = 1), and φ3 =


tanh(·). M-file and output follow:

Dept. ECE, Auburn Univ. 24 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

M-file 4.1 neuronEx1.m

n1 = 40; x1 = linspace(-2,2,n1);
n2 = 45; x2 = linspace(-2,2,n2);

% pre-allocate memory (can help speed sometimes)


y0 = zeros(n1,n2); y1 = y0; y2 = y0;
ww = [1;-1;1];
for ii =1:n1
for jj =1:n2
xx = [1;x1(ii);x2(jj)];
y0(ii,jj) = double( ww’ * xx > 0 );
y1(ii,jj) = 1/(1+exp(-ww’ *xx));
y2(ii,jj) = tanh(ww’*xx);
end
end

subplot(2,2,1); meshc(x1,x2,y0’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’McCullough-Pitts activation function’);
grid on

subplot(2,2,2); meshc(x1,x2,y1’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’sigmoid activation function’);
grid on

subplot(2,2,3); meshc(x1,x2,y2’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’tanh activation function’);
grid on

orient tall
print -depsc neuronEx1.eps

Dept. ECE, Auburn Univ. 25 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

McCullough−Pitts activation function sigmoid activation function

1 1

0.8 0.8
NN output

NN output
0.6 0.6

0.4 0.4

0.2 0.2

0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x −2 input x input x −2 input x
2 −2 1 2 −2 1

tanh activation function

0.5
NN output

−0.5

−1
2
1 2
0
0
−1
input x2 −2 input x1
−2

Dept. ECE, Auburn Univ. 26 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

2. Multi-layer feed forward (hidden layers)

Example 4.2 Two-layer network with all activation function φ =threshold, two in-
puts, one output, and three “hidden” nodes:
  
 1 
y = φ W 2  1  = φ(W2 φ1 (W1 x̄))
φ W1
x

where x̄ and φ1 include the bias term 1 in their vectors.

Dept. ECE, Auburn Univ. 27 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

M-file 4.2 neuronEx2.m

% two layer network with 3 hidden neurons


n1 = 40; x1 = linspace(-2,2,n1);
n2 = 45; x2 = linspace(-2,2,n2);

% pre-allocate memory (can help speed sometimes)


h1 = zeros(n1,n2); h2 = h1; h3 = h1; yy=h1;
W1 = [0,1,0; ... % zero bias
0,0,1; ... % zero bias
1,-1,-1]; % unit bias
W2 = [-2.8,1,1,1]; % bias of 2.8 (slightly less than 3 - but why?)
for ii =1:n1
for jj =1:n2
xx = [1;x1(ii);x2(jj)];
h0 = double ( W1*xx > 0 );
hh = [1;h0];
yy(ii,jj) = double( W2*hh > 0);

% store temp values for plotting


h1(ii,jj) = h0(1);
h2(ii,jj) = h0(2);
h3(ii,jj) = h0(3);
end
end

subplot(2,2,1); meshc(x1,x2,h1’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h1 output’);
title(’Hidden Neuron 1 output’);
grid on

subplot(2,2,2); meshc(x1,x2,h2’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h2 output’);
title(’Hidden Neuron 2 output’);
grid on

subplot(2,2,3); meshc(x1,x2,h3’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’h3 output’);
title(’Hidden Neuron 3 output’);
grid on

subplot(2,2,4); meshc(x1,x2,yy’);
xlabel(’input x_1’); ylabel(’input x_2’); zlabel(’NN output’);
title(’Two-layer ANN example output’);

Dept. ECE, Auburn Univ. 28 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

grid on
orient tall
print -depsc neuronEx2.eps

Hidden Neuron 1 output Hidden Neuron 2 output

1 1

0.8 0.8

0.6 0.6
h1 output

h2 output
0.4 0.4

0.2 0.2

0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2

Hidden Neuron 3 output Two−layer ANN example output

1 1

0.8 0.8
NN output

0.6 0.6
h3 output

0.4 0.4

0.2 0.2

0 0
2 2
1 2 1 2
0 0
0 0
−1 −1
input x2 −2 input x1 input x2 −2 input x1
−2 −2

Dept. ECE, Auburn Univ. 29 ELEC 6240/Hodel: Fa 2003


4.2 Neural network: an interconnection of neurons Revision : 2003.10

3. Recurrent Fig 1.17 has no hidden layers, no self feedback.

Dept. ECE, Auburn Univ. 30 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

5 2003 08 29 Knowledge representation


Read [Hay99] §1.7

Definition 5.1 (Fischler & Firschein, 1987) Knowledge refers to stored information or mod-
els used by a person or machine to interpret, predict, and appropriately respond to the outside
world.

Knowledge representation:

• What information is explicit?

• how is it encoded?

We often ask an ANN to “learn” an environment. Must have (or present) lots of data (“prior
information”).

• labeled: (x0 , d0 ).

• unlabeled (xi ) (self-organizing)

Process:

• training: subset of data; real and fake (passive sonar) data

• testing (complete set)

Remark 5.1 Training allows the data (and the training algorithm) to organize and represent
input data. Other forms of pattern classification often require the designer to organize the
data for classification and representation.

Linear algebra concepts For knowledge representation.

• vectors

• dot products

• norms (vector/error)
For unit vectors

kxi − xj k2 = (xi − xj )T (xi − xj ) = 2 − 2xi T xj

• Expectation/ mean value

Dept. ECE, Auburn Univ. 31 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

h i
T
• covariance Σ = E (xi − µi )(xi − µi )
Assume xi , xj have identical covariance. Then

d2ij = kxi − xj k2 = (xi − µi )T Σ−1 (xi − µi )

dot product/recognition

Ideas: (Rules)

1. Similar items should have similar representations.

2. Items from different classes should have dissimilar representations.

3. Important features should have many neurons devoted to them - high probability of
detection, low probability of false alarm.
Neyman-Person: max Prob(detect) subject to P(false alarm) < γ

4. Incorporate prior knowledge and “invariances” into network design.

• restrict network architecture


• weight sharing

Both techniques constrain the weight coefficient matrix W .

Question 5.1 So, given lots of data {(xi , di )}N


i=1 , how do we train a network to “learn?”

Dept. ECE, Auburn Univ. 32 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

6 2003 09 03 Mex files in MATLAB


C-file 6.1 mextanh.c

/*=================================================================
* Example Hyperbolic Tangent MEX function
* Adam Simmons, GTA, Neural Networks
*=================================================================*/

#include <math.h>
#include "mex.h"

/* If you have a C File written outside of Matlab’s Mex File format, you can
* place it here were tanh1 is and write a wrapper for it below (mexFunction)
*/
static void
tanh1 (double yout[], double xin[])
{
yout[0] = tanh (xin[0]);
return;
}

/* wrapper for the c code above (tanh1) */


void
mexFunction (int nlhs, mxArray * plhs[], int nrhs, const mxArray * prhs[])
{
double *yout, *xin;

/* Check for proper number of arguments */


if (nrhs != 1)
mexErrMsgTxt ("One input argument required.");
else if (nlhs > 1)
mexErrMsgTxt ("Too many output arguments.");

/* Create a scalar (1 x 1 matrix) for the return argument */


plhs[0] = mxCreateDoubleMatrix (1, 1, mxREAL);

/* Assign pointers to the various parameters */


yout = mxGetPr (plhs[0]);
xin = mxGetPr (prhs[0]);

/* Do the actual computations in a subroutine */


tanh1 (yout, xin);
}

Dept. ECE, Auburn Univ. 33 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

C-file 6.2 simpletanh.c

#include "mex.h"
#include "math.h"

/* mxArray is a struct defined in mex.h


* an mxArray variable A contains (among other things):
* (1) a pointer to the first element of the matrix A
* (2) the dimensions (rows and columns) of the matrix A
* Get the pointer to the data with myMatrix = mxGetPr(&A);
* (Notice: it needs the ADDRESS of A! Pass a pointer!)
* Get the dimensions with
* int nrows = mxGetM(&A);
* int ncols = mxGetN(&A);
*
* MATLAB passes its arguments in an array of pointers, so
* if you call this with
* [a,b,c] = myfunc(d,e,f), then
* prhs[0] is a pointer to d,
* prhs[1] is a pointer to 3,
* prhs[2] is a pointer to f.
*
* you have to CREATE the output arguments with mxCreateDoubleArray
* as shown below.
*/
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double *y,*x;
plhs[0] = mxCreateDoubleMatrix(1,1,mxREAL);
x = mxGetPr(prhs[0]);
y = mxGetPr(plhs[0]);
y[0] = tanh(x[0]);
}

Dept. ECE, Auburn Univ. 34 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Hwk
Homework 2 Mex file example Handed out by Simmons in class; solutions shown in lecture
notes. Everyone did well.

Dept. ECE, Auburn Univ. 35 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

7 2003 09 05 More on MATLAB


7.1 Scalar functions of scalar variables
Shown last time

7.2 Functions of vectors


How do we implement  
φ(x1 )
φ(x) =  ... ?
 
φ(xn )
In MATLAB:

M-file 7.1 mexExV.m

% mexExV.m: call mex file with a vector input, get a vector output
if( exist(’phiExV’) ~= 3 ) % see if the mex file is there
mex phiExV.c
end
xx = (-10:0.1:10)’;
yy = phiExV(xx);
plot(xx,yy);
grid on
xlabel(’activation function input’);
ylabel(’activation function output’);
print -depsc mexExV.eps

Results:

Dept. ECE, Auburn Univ. 36 ELEC 6240/Hodel: Fa 2003


7.2 Functions of vectors Revision : 2003.10

0.8

0.6

activation function output 0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
−10 −8 −6 −4 −2 0 2 4 6 8 10
activation function input

Dept. ECE, Auburn Univ. 37 ELEC 6240/Hodel: Fa 2003


7.2 Functions of vectors Revision : 2003.10

C-file 7.3 phiExV.c


#include <math.h>
#include "mex.h"

void phi(double *yy, const double * xx, const int len )


{
int ii;
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = tanh(xx[ii]);
}

void mexFunction( int nlhs, mxArray *plhs[],


int nrhs, const mxArray*prhs[] )
{
char errMsg[1000];
double *xx, *yy;
int nrows, ncols;

/* Argument checking - always a good thing! */


if (nrhs != 1)
{
sprintf(errMsg,"Received %d args, need 1", nrhs);
mexErrMsgTxt(errMsg);
return;
}
else if (nlhs > 1)
{
sprintf(errMsg,"%d outputs requested, max is 1", nlhs);
mexErrMsgTxt(errMsg);
return;
}
nrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
if(nrows < 1 || ncols != 1 )
{
sprintf(errMsg,"input (%dx%d), need (Nx1)\n",nrows, ncols);
mexErrMsgTxt(errMsg);
return;
}
plhs[0] = mxCreateDoubleMatrix(nrows, ncols, mxREAL);
xx = mxGetPr(prhs[0]); /* pointer to data in prhs[0] */
yy = mxGetPr(plhs[0]); /* pointer to data in plhs[0] */
phi(yy,xx,nrows); /* do the work */
}

Dept. ECE, Auburn Univ. 38 ELEC 6240/Hodel: Fa 2003


7.3 What about MATLAB and matrices? Revision : 2003.10

7.3 What about MATLAB and matrices?


MATLAB was originally written in FORTRAN, and so uses column major ordering. That
is, the matrix  
1 2 3
A= 4 5 6 
7 8 10
is stored in memory as
1
4
7
4
..
.
6
10

Example 7.1 Print out a matrix in a Mex file

M-file 7.2 mexPrintMatEx.m

if( exist( ’mexPrintMat’ ) ~= 3 )


mex mexPrintMat.c
end
A = [1, 2, 3; 4, 5, 6; 7, 8, 10];
mexPrintMat(A);

Results (on a Macintosh):

>> mexPrintMatEx
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/Undetermined -lmx -lmex -lmat
/Applications/MATLAB6p5
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
mex link phase: cc -O -bundle -Wl,-flat_namespace -undefined suppress
-o mexPrintMat.mexmac mexPrintMat.o mexversion.o
-L/Applications/MATLAB6p5/bin/mac -lmx -lmex -lmat
(3 x 3) =
1.0000e+00 2.0000e+00 3.0000e+00
4.0000e+00 5.0000e+00 6.0000e+00
7.0000e+00 8.0000e+00 1.0000e+01

C-code on next page

Dept. ECE, Auburn Univ. 39 ELEC 6240/Hodel: Fa 2003


7.3 What about MATLAB and matrices? Revision : 2003.10

C-file 7.4 mexPrintMat.c

/* mexPrintMat.c: print out a matrix in a mex file */


#include <math.h>
#include "mex.h"

void mexFunction( int nlhs, mxArray *plhs[],


int nrhs, const mxArray*prhs[] )
{
char errMsg[1000];
double *xx, *yy;
int nrows, ncols, ii, jj;

if (nrhs != 1)
{
sprintf(errMsg,"Received %d args, need 1", nrhs);
mexErrMsgTxt(errMsg);
return;
}
else if (nlhs > 0)
{
sprintf(errMsg,"%d outputs requested, max is 0", nlhs);
mexErrMsgTxt(errMsg);
return;
}

nrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);

xx = mxGetPr(prhs[0]);
mexPrintf("(%d x %d) = \n",nrows,ncols);
for( ii = 0 ; ii < nrows ; ii++)
{
for ( jj = 0 ; jj < ncols ; jj++ )
mexPrintf("%12.4e ",xx[ii + jj*nrows]);
mexPrintf("\n");
}
}

Remark 7.1 Danger! Stray pointers and bad subscripts are great at making mex files
crash.

Dept. ECE, Auburn Univ. 40 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

8 2003 09 08 Learning processes


Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999 Ch. 2

Ad hoc procedures that seem to work.

8.1 Error-correction learning


Read §2.2

From end of Ch 1: knowledge representation, feature extraction, classification.



Approach: input vector process x(n), desired output vector d(n), error e(n) = d(n)−y(n).
Scalar values: ek (n), dk (n), yk (n).
Remark 8.1 stupidly profound observation ek (n) > 0 =⇒ we want yk to increase,
and ek (n) < 0 =⇒ we want yk to decrease.
Use error e to adjust parameters (weights) of neural network.

Idea iterative “optimization:” define


1
E(n) = e(n)T e(n) vector form
2
or
1
Ek (n) = ek (n)2 scalar form
2
“instantaneous error energy.” Idea: adjust weights to decrease E until it reaches “steady
state.”
Consider
  
w10 w11 · · · w1m x0 = 1
y(n) = φ(W (n)x(n)) = φ  ... ..
.
..
.
..   x1 

. 
..

wp0 wp1 · · · wpm .xm
P
i.e., yk (n) = wkj (n)xj (n), so
∂yk (n)
= xj (n)
∂wkj (n)
That is, if we increase wkj , then yk changes in the direction of xj .

Idea Try δwkj (n) = ηek (n)xj (n), so


∆W (n) = ηe(n)x(n)T
W (n + 1) = W (n) + ∆W (n)
Delta rule or Widrow-Huff rule. Why? Matrix gradients (sort of). η = “learning rate”
parameter. Requires: e(n) is measurable.

Dept. ECE, Auburn Univ. 41 ELEC 6240/Hodel: Fa 2003


8.2 Memory Based learning Revision : 2003.10

8.2 Memory Based learning


Read §2.3

Stored: all (or most) xi (now vectors) stored with desired output di .

Example 8.1 Classification as in Class C1 or C2 . di ∈ {0, 1}.

Approach present vector xtest . If xtest is “unknown”, identify an xi “near” to xtest .

nearest neighbor x0N ∈ {xi }N 0


i=1 : kxN − xtest k = mini kxi − xtest k.

[Text uses d(xi , xtest ) instead of norm.]


Select class d0N

k-nearest neighbors Select most common class of the k−nearest neighbors.

Fact 8.1 Quality of classification: within a factor of 2 of optimal if xi , di are uniformly


distributed and have an infinite amount of data.

What’s that mean Probability of mis-classification by this method is at most twice as


high as probability of misclassification by the best possible theoretical algorithm you could
develop.

Example 8.2 Exclusive-or network Explicitly store data:


element 1  2  3  4 
0 0 1 1
xi
0 1 0 1
di 0 1 1 0
 
0.9
Receive new input x5 = . Nearest point is x3 , so update table:
0.3

element 1  2  3  4  5 
0 0 1 1 0.9
xi
0 1 0 1 0.3
di 0 1 1 0 1
2

Question 8.1 When do we add a new data point? How close is “close enough?” (related
to Radial Basis Function networks ...)

Dept. ECE, Auburn Univ. 42 ELEC 6240/Hodel: Fa 2003


8.3 Hebbian learning Revision : 2003.10

8.3 Hebbian learning


Read §2.4

“reinforce what’s already occuring.”

1. time dependent - modificatin occurs based on when signals occur

2. local - occurs at a specific “synapse”

3. interactive - depends on both input and output

4. conjunctional or correlational: strengthen when input/output are correlated.

Can also develop methods by which connections are weakened (“forgetting” methods).

Example 8.3

∆wkj (n) = F (yk (n), xj (n))


= ηyk (n)xj (n) simplest form
∆W (n) = ηy(n)xT (n)

Problem can lead to saturation (unstable growth)

Idea “covariance hypothesis” - subtract time-averaged mean values.

∆W (n) = η (y(n) − ȳ(n)) (x(n) − x̄(n))T

average values give threshhold to “sign” of correction so that we can both strengthen and
weaken connections.

Dept. ECE, Auburn Univ. 43 ELEC 6240/Hodel: Fa 2003


8.4 Competitive learning Revision : 2003.10

8.4 Competitive learning


Read §2.5

neurons “compete” to fire.


Causes neurons to become “feature detectors” for different input classes.

1 if vk > vj ∀j, j 6= k
yk =
0 else

Idea 
η(xj − wkj ) if neuron k “wins”
∆wkj =
0 else
P P 2
Adjust weights to enforce j wkj = 1 for all k or j wkj = 1 for all k .

8.5 Boltzmann learning


Read §2.6

Stochastic method. Concept: network energy state


1X X
E=− wkj xk xj
2 j,k,j6=k

x = state of neuron. Change neoron value xk → −xk based on


1
P (xk → −xk ) =
1 + e∆E/T
where ∆E is change in energy from the flip, T = “temperature” (adaptation speed parame-
ter).

8.6 Credit assignment


Read §2.7

Covered in more detail in later chapters.

8.7 Learning with a teacher


Read §2.8

Similar to tagged (labelled) data learning, except that we have a “teacher” instead of a
large database.

Dept. ECE, Auburn Univ. 44 ELEC 6240/Hodel: Fa 2003


8.8 Learning without a teacher Revision : 2003.10

8.8 Learning without a teacher


Read §2.9

• with a critic (related to dynamic programming)

primary reinforcement

state
Environment vector Critic

heuristic reinforcement

Learning
System

actions

The “heuristic” reinforcement relates to a cost to go function from dynamic program-


ming.
Is much more difficult to perform than with labelled data, but permits the use of neural
networks to solve an optimization problem.

Remark 8.2 This would be a good project for a graduate student.

• unsupervised learning (e.g., competitive classification)

All of these require design of some mechanism to steer toward a “learned” outcome.

Dept. ECE, Auburn Univ. 45 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

9 2003 09 10
We remember ...
“I wish that it need not have happened in my time,” said Frodo.
“So do I,” said Gandalf, “ and so do all who live to see such times.
But that is not for them to decide. All we have to decide is what
to do with the time given to us.”
J. R. R. Tolkien, The Fellowship of the Ring, p. 76
“We don’t want the bad guys to win!”
Fozzie Bear, The Great Muppet Caper.

9.1 Learning tasks


Read §2.10

Pattern association: e.g., clean up dirty paterns.


x y

pattern recognition/classification
Example 9.1 Homework 2 problem: classification can be used to clean up input waveforms:
3
noisy sine wave
2

−1

−2

−3

−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

1
cleaned sine wave

0.5

−0.5

−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

Dept. ECE, Auburn Univ. 46 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

function approximation: e.g. system i.d., inverse dynamics, control, filtering (smoothing,
prediction, extraction)
Example 9.2 system identification Try to mimic an unknown system: Neural net must
“remember” previous outputs, inputs, and try to predict next output:

unknown output y(k)


input u(k)
system

stored neural
(k), u(k − 1), ...
data
u(k-N), y(k − 1), ... network
y(k − N ).

text gives filtering examples:


• “cocktail party” - signal in clutter
• beamforming (radar)

9.2 Memory
Read §2.11

Geometric interpretations of neural network operation.


• short term memory
• long term memory
• distributed memory
Notation: vector (book says xk = ..., I’ll write ...)
 
x1 (k)
x(k) = 
 .. 
. 
xm (k)
m = network dimensionality (assume that y is the same length as x for now).
Suppose network is linear. Then y(k) = W (k)x(k) with w(k) specifically selected for the
pair (xk , yk ). Scalar notation:
 
x1 (k)
 
yi (k) = wi1 (k) · · · wim (k)  .. 
. 
xm (k)

Dept. ECE, Auburn Univ. 47 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

Define

M0 = 0
Mk = Mk−1 + W (k)
q
X
M = Mq = W (k)
k=1

Can we select W (k)’s so that pattern recognition


Pq still works?
T
Define: estimate M̂ of M is M̂ = k=1 y(k)x(k) . Local outer products yi (k)xj (k),
similar to Hebbian learning.
can show M̂ = Y X T where
 
• Y = y1 · · · yq “memorized matrix”
 
• X = x1 · · · xq “key matrix”

Use of M̂ in associative memory: pick an input vector x(j).



y(j) = M̂ x(j)
q  
X
= y(k)x(k)T x(j) misprint in text, eqn 2.39
k=1
q
X T

= x(k) x(j) y(k)
k=1

Notice x(k)T x(j) is a scalar inner product,

x(k)T x(j)
cos(x(k), x(j)) =
kx(k)k kx(j)k

If x(k)’s are unit length then cos(x(k), x(j)) = x(k)T x(j). For recognition, want x(k)T x(j) =
0, orthogonal.

Dept. ECE, Auburn Univ. 48 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

Example 9.3 Use correlation matrix idea in the homework example:

M-file 9.1 corrMemEx1.m

tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;

% normalize to unit vectors


sinewave = sinewave/norm(sinewave);
sawtooth = sawtooth/norm(sawtooth);
square = square/norm(square);

Xk = [sinewave, sawtooth, square];


Yk = [ [1;0;0], [0;1;0], [0;0;1] ];

figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx1.eps

M = zeros(3,length(tt));
for kk=1:3
M = M + Yk(:,kk)*Xk(:,kk)’;
end

x1 = (M*sinewave)’
x2 = (M*sawtooth)’
x3 = (M*square)’

randn(’seed’,1); % seed the random # generator for repeatable tests


dirtySine = sinewave + randn(size(sinewave))/10;

figure(2)
plot(tt,dirtySine);
legend(’noisy sine wave’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx2.eps

% try to recognize a dirty sine wave


x4 = (M*dirtySine)’

Dept. ECE, Auburn Univ. 49 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

W = [sinewave, sawtooth, square]*M;


cleaned = W*dirtySine;
figure(3);
subplot(2,1,1)
plot(tt,dirtySine);
legend(’noisy sine wave’);
xlabel(’time (s)’)
grid on
subplot(2,1,2)
plot(tt,cleaned);
legend(’cleaned sine wave’);
xlabel(’time (s)’)
grid on
print -depsc corrMemEx3.eps

0.08
sine
saw
square
0.06

0.04

0.02

−0.02

−0.04

−0.06

−0.08
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

Dept. ECE, Auburn Univ. 50 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

0.3
noisy sine wave

0.2

0.1

−0.1

−0.2

−0.3

−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

0.3
noisy sine wave
0.2

0.1

−0.1

−0.2

−0.3

−0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

0.1
cleaned sine wave

0.05

−0.05

−0.1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

Dept. ECE, Auburn Univ. 51 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

Q if key vectors X are orthonormal, what is the storage capacity of the network? - m,
rank of M̂ .

Q classification accuracy: lower bound on error x(k)T x(j) ≥ γ∀k 6= j. If γ is big enough,
can get classfication errors. (Get upper bound instead?)

Dept. ECE, Auburn Univ. 52 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

Homework 3
Read Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999 [Hay99] Chapter 2.

Due Wed Sept. 17. Hwk


1. Problem 2.18 in [Hay99]. (MATLAB exercise.)
2. Problem 2.20 in [Hay99]. (MATLAB exercise.)
3. Consider the m-file below:

M-file 9.2 learnTaskEx1s.m

tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;

figure(1);
plot(tt,sinewave,’-’, tt, sawtooth,’-’, tt, square,’-’);
legend(’sine’,’saw’,’square’);
xlabel(’time (s)’)
grid on
print -depsc learnTaskEx1a.eps

A = (you fill in here);

x1 = (A*sinewave)’ % this should be [1;0;0]


x2 = (A*sawtooth)’ % this should be [0;1;0]
x3 = (A*square)’ % this should be [0;0;1]

randn(’seed’,1); % seed the random # generator for repeatable tests


dirtySine = sinewave + randn(size(sinewave));
figure(2)
plot(tt,dirtySine);
legend(’noisy sine wave’);
xlabel(’time (s)’)
grid on
print -depsc learnTaskEx1b.eps

% try to recognize a dirty sine wave


x4 = (A*dirtySine)’

This m-file implements a single-layer neural network that has 101 inputs (length of the
input vector) and 3 outputs. Output 1 should be a “1” when the input is a sinewave,

Dept. ECE, Auburn Univ. 53 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

output 2 should be a “1” when the input is a sawtooth wave, and output 3 should be a
“1” when the input is a square wave, with the other outputs being zero. Each of these
waveforms is defined in the first four lines of the m-file.
Your job is to select what to fill in for A so that the neural network gives the desired
output for perfect (uncorrupted) inputs. The last few lines of the network demonstrate
what happens when you apply a corrupted sine wave.
The output for my solution is shown below:

>> learnTaskEx1
x1 = 1.0000 0.0000 -0.0000
x2 = 0.0000 1.0000 -0.0000
x3 = 0.0000 0.0000 1.0000
x4 = 1.0405 0.0530 -0.0658

3
noisy sine wave

−1

−2

−3

−4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time (s)

Solution
1. M-file:

M-file 9.3 hwk0218.m

% Homework 2.18, Haykin ’99


x1 = [1;0;0;0]; x2 = [0;1;0;0]; x3 = [0;0;1;0];

Dept. ECE, Auburn Univ. 54 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

y1 = [5;1;0]; y2 = [-2;1;6]; y3 = [-2;4;3];

fprintf(’(a)\n’);
M = y1*x1’ + y2*x2’ + y3*x3’

fprintf(’(b)\n’);
error1 = M*x1 - y1
error2 = M*x2 - y2
error3 = M*x3 - y3

Output:

hwk0218.m Output

(a)
M =

5 -2 -2 0
1 1 4 0
0 6 3 0

(b)
error1 =

0
0
0

error2 =

0
0
0

error3 =

0
0
0

M-file:

Dept. ECE, Auburn Univ. 55 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

M-file 9.4 hwk0220.m

% Homework 2.20, Haykin ’99


x(1:3,1) = 0.25* [ -2; -3; sqrt(3) ];
x(1:3,2) = 0.25* [ 2; -2; -sqrt(8) ];
x(1:3,3) = 0.25* [ 3; -1; sqrt(6) ];

fprintf(’(a) angles between vectors \n’);


for ii=1:2
for jj= (ii+1):3
xi = x(:,ii);
xj = x(:,jj);
thisAngle = acos((xi’*xj)/(norm(xi)*norm(xj)));
orthErr = abs(abs(thisAngle) - pi/2);
fprintf(’The angle between x%d and x%d is %f degrees, \n’, ...
ii, jj, thisAngle*180/pi);
fprintf(’\t%f degrees off orthogonal\n’, orthErr*180/pi);
end
end

fprintf(’(b): autoassociative memory matrix\n’);


M = zeros(3);
for ii=1:3
M = M + x(:,ii)*x(:,ii)’/norm(x(:,ii));
end
M

fprintf(’(c): with normalization\n’);


xv = [0;-3;sqrt(3)];
y = M*xv/norm(xv)
yerr = y - x(:,1)

fprintf(’(c): without normalization\n’);


xv = [0;-3;sqrt(3)];
y = M*xv
yerr = y - x(:,1)

Output:

hwk0220.m Output

(a) angles between vectors


The angle between x1 and x2 is 100.438861 degrees,

Dept. ECE, Auburn Univ. 56 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

10.438861 degrees off orthogonal


The angle between x1 and x3 is 85.545635 degrees,
4.454365 degrees off orthogonal
The angle between x2 and x3 is 86.159034 degrees,
3.840966 degrees off orthogonal
(b): autoassociative memory matrix
M =

1.062500 -0.062500 -0.110780


-0.062500 0.875000 -0.124299
-0.110780 -0.124299 1.062500

(c): with normalization


y =

-0.0012636
-0.8199219
0.6388963

yerr =

0.498736
-0.069922
0.205884

(c): without normalization


y =

-0.0043773
-2.8402926
2.2132017

yerr =

0.49562
-2.09029
1.78019

2. Using pseudo-inverse approach:

M-file 9.5 hwk0220a.m

% Homework 2.20, Haykin ’99

Dept. ECE, Auburn Univ. 57 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

x(1:3,1) = 0.25* [ -2; -3; sqrt(3) ];


x(1:3,2) = 0.25* [ 2; -2; -sqrt(8) ];
x(1:3,3) = 0.25* [ 3; -1; sqrt(6) ];

fprintf(’(a) angles between vectors \n’);


for ii=1:2
for jj= (ii+1):3
xi = x(:,ii);
xj = x(:,jj);
thisAngle = acos((xi’*xj)/(norm(xi)*norm(xj)));
orthErr = abs(abs(thisAngle) - pi/2);
fprintf(’The angle between x%d and x%d is %f degrees, \n’, ...
ii, jj, thisAngle*180/pi);
fprintf(’\t%f degrees off orthogonal\n’, orthErr*180/pi);
end
end

fprintf(’(b): autoassociative memory matrix\n’);


M = x*pinv(x); % well, duh, that gives us identity. What if x
% were 3x2 instead? The point is, pinv(x) gives us
% a working set of weights for an associative memory

fprintf(’(c): with normalization\n’);


xv = [0;-3;sqrt(3)];
y = M*xv/norm(xv)
yerr = y - x(:,1)

hwk0220a.m Output

(a) angles between vectors


The angle between x1 and x2 is 100.438861 degrees,
10.438861 degrees off orthogonal
The angle between x1 and x3 is 85.545635 degrees,
4.454365 degrees off orthogonal
The angle between x2 and x3 is 86.159034 degrees,
3.840966 degrees off orthogonal
(b): autoassociative memory matrix
(c): with normalization
y =

1.9456e-17
-8.6603e-01
5.0000e-01

Dept. ECE, Auburn Univ. 58 ELEC 6240/Hodel: Fa 2003


9.2 Memory Revision : 2003.10

yerr =

0.500000
-0.116025
0.066987

Dept. ECE, Auburn Univ. 59 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

10 2003 09 12 Single layer networks


Read §3.1

10.1 From last time: Linear associative memory (LAM)


q  

X
y(j) = M̂ x(j) = y(k)x(k)T x(j)
k=1
q  
X
= x(k)T x(j) y(k)
k=1

General case (LAM):


y = AB T x

BT
= A

y x
 
The matrix B = b1 · · · bnf acts as a feature extractor when we perform dot products
hbi , xi. The matrix A is then used to construct the associated waveform y as a combination
of columns of A:
 
hb1 , xi

y = Az = A 
 .. 

.

b nf , x


= a1 hb1 , xi + · · · + anf bnf , x

A BT x

hb1 , xi

= hb2 , xi

..
.

y a1 a2 a3

Question 10.1 How many patterns can be stored in an N xN linear associative memory?

Dept. ECE, Auburn Univ. 60 ELEC 6240/Hodel: Fa 2003


10.2 Adaptation Revision : 2003.10

10.2 Adaptation
Read [Hay99] §2.12

• The environment “changes:” a good decision now may not be a good decision later.

• If it changes slowly enough, can adapt to current conditions. “pseudo-stationary.”

10.3 Performance issues


§2.13, Statistical nature of the learning process, §1.14, Statistical learning theory, and §2.15
Probably approximately correct learning, will be examined later on.
For now: Suppose we have a training data set

T = {(xi , di )}N
i=1

We assume that this data represents a system

d = f (x) + 

where  is some functional error (perhaps a random variable or noise).


Assumptions:

1. E[|x] = 0. - zero mean random variable. Implies E[d|x] = f (x), which is what the
neural net is trying to match.

2. Error is uncorrelated with the function:

E[f (x)T ] = 0

(Is consistent with the conditional expectation above). Says that the function f gives
us all available information about d that we can get from x.

What does this sort of modeling assumption tell us about what neural networks can do?

Short summary of §2.13 Need to approximate f (x) with an ANN F (x; W ). Things to
reduce: mean value of error (bias) and variance of error (standard deviation).

Dept. ECE, Auburn Univ. 61 ELEC 6240/Hodel: Fa 2003


10.4 Single layer perceptrons: “In the beginning ....” Revision : 2003.10

10.4 Single layer perceptrons: “In the beginning ....”


Read §3.1

Contributors:
1. W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas immanent
in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943 McCullough
& Pitts: (1943): use of NN as computational tool

2. D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley, New


York, 1949 1st rule for self-organized learning

3. F. Rosenblatt. The perceptron: A probabilistic model for information storage and orga-
nization in the brain. Psychological Review, 65:386–408, 1958 for perceptron (learning
with a teacher)

4. B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON Con-
vention Record, pages 96–104 Widrow-Hoff delta rule (least mean square)
Perceptron: linearly separable; “perceptron convergence theorem.” Single neuron can
be viewed as an “adaptive filter.”

10.5 Adaptive filtering interpretation


Read [Hay99] §3.2

v(i)
wi0
y(i)
x0
wi1
-1
..
x1 . ei
wim
..
.

d(i)
xm
Linear adaptive filter

 
1 ∆
y = φ(W̄ x + w0 ) = φ(W ) = φ(W x̄)
x

Desired system behavior specified by

T = {x(i), d(i), i = 1, ..., n}

Dept. ECE, Auburn Univ. 62 ELEC 6240/Hodel: Fa 2003


10.5 Adaptive filtering interpretation Revision : 2003.10

where x(i) is a vector of length m (input dimensionality). x(i) is called a stimulus vector.
How do we get x?

• snapshot in space (m different points in space)

• uniformly spaced in time (present and m − 1 previous values).

Learning assumptions:

• Arbitrary starting point (neuron settings)

• Adjustments of weights are made continuously6 (time is a part of the learning algo-
rithm)

Linear neuron: m
X
y(i) = v(i) = wk (i)x)k(i) = w(i)T x(i)
k=1

Define: e(i) = d(i) − y(i)


Goal: update w to get desired output(s) d(i) for all input(s) x(i).

6
Not as a differential equation, usually, but as a difference (discrete-time) equation.

Dept. ECE, Auburn Univ. 63 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

11 2003 09 15 Optimization and Neural Networks


11.1 What you should have seen in your linear algebra class
Linear systems of equations: 3 cases:
Case 1 n equations, n unknowns
Ax = b =⇒ x = A−1 b
with A and b known. A must have n linearly independent rows/columns for there to
be a unique solution. Otherwise either (1) there is no solution, or (2) there are an
infinite number of solutions.
Case 2 Overdetermined systems: A ∈ IRm×n , m > n.
min kAx − bk2 = min (Ax − b)T (Ax − b)
x x
n×m
Solution is x = A+ b where A+ ∈ IR is the pseudo-inverse of A, which means (among
other things) that A+ A = In . However, AA+ 6= In .

Remark 11.1 Define P = AA+ . Then P is a projection matrix, which means that
P 2 = P . Further, it is easy to show that P A = A. Multiplication by P extracts the
part of a vector (or matrix) that is in span(A).

case 3 Underdetermined systems. A ∈ IRm×n , m < n. Problem is:


min xT x
x
subject to Ax = b
Ax = b is a linear constraint that has m equations in n unknowns (not enough equations
to uniquely identify x). Solution to this problem is x = A+ b where A+ ∈ IRn×m is
the pseudo-inverse of A. Notice that, since the case 3 A matrix is short and wide,
the product AA+ = I and A+ A is a projection matrix for this case, unlike the case 2
problem.

11.2 Unconstrained optimization techniques


Given a set of data T = {x(i), d(i), i = 1, ..., n} and a neural network parameter vector w,
define the neural network function and output as
  
1
y(i) = f (x(i), w) = φ W
x(i)
Define the corresponding cost function as

E(w) = E kd(n) − y(n)k2
 

N
!
1 1X
= (d(n) − y(n))T (d(n) − y(n))
N 2 n=1

Dept. ECE, Auburn Univ. 64 ELEC 6240/Hodel: Fa 2003


11.2 Unconstrained optimization techniques Revision : 2003.10

Want to find optimal w ∗ such that E(w ∗ ) ≤ E(w) for all possible w, i.e.

w ∗ = arg min E(w)


w

Necessary condition (unconstrained case):


 ∂   ∂E(w) 
∂w1 ∂w1
∆ 
∇w E(w) =  .. ∆  ..
 E(w) =   = 0.
 
. .
∂ ∂E(w)
∂wm ∂wm

Goal: generate sequence of w(n) : E(w(n + 1)) < E(w(n))

Lemma 11.1 matrix calculus identities Let f (x) be a scalar function of a vector x ∈
IRn and
 ∂flet 
g(W ) of a matrix m×n
 ∂g W ∈ IR ∂g . Define their respective partial derivatives as
∂x1 ∂w11
· · · ∂w1n
∂f ∆  ..  ∂g ∆  . .. .. . Then
∂x
=  .  and ∂W =  .. . . 
∂f ∂g ∂g
∂xn ∂wm1
· · · ∂wmn

∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W
Proof: Left as an exercise for the reader. This would be a good exam question, eh? 2

Example 11.1 gradients with vector-valued functions Linear associative memory.


 
w10 w11 w12
W =
w20 w21 w22
∆  T
w = vec(W ) = w10 w20 w11 w21 w12 w22
 
1  
w10 + w11 x1 + w12 x2
f (x, w) = W x1 =
 
w20 + w21 x1 + w22 x2
x2
 T  
1 X 1 1
E(w) = (d(n) − W ) (d(n) − W )
2N x(n) x(n)
∂E ∆ h ∂E i
= ∂wij
∂W

Dept. ECE, Auburn Univ. 65 ELEC 6240/Hodel: Fa 2003


11.2 Unconstrained optimization techniques Revision : 2003.10

Can use Lemma 11.1 to show


   
∂E 1 X 1  T
  T

= W 1 x(n) − d(n) 1 x(n)
∂W N x(n)
1 X T T

= y(n)x̄(n) − d(n)x̄(n)
N
 
∆ 1
where x̄(n) = .
x(n)
2

Dept. ECE, Auburn Univ. 66 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

12 2003 09 17: Gradient based learning methods


Recall from last time that for a linear associative memory y = W x that is to learn a set of
training data
T = {(xi , di )}N
i=1

we can write
   
∂E 1 X 1  T
  T

= W 1 x(n) − d(n) 1 x(n)
∂W N x(n)
1 X T T

= y(n)x̄(n) − d(n)x̄(n)
N
1 X 
T

= (y(n) − d(n))x̄(n)
N
 
∆ 1
where x̄(n) = .
x(n)

Question 12.1 How can we use this information to “train” a neural network?

12.1 Steepest Descent



Define g = ∇w E(w)|w .
Steest descent iteration:

Heuristic Choose “learning parameter” η

w(n + 1) = w(n) − ηg(n).

Justification: Expand E in a Taylor’s series around w(n):

E(w(n + 1)) = E(w(n)) − η∇w E(w)T g(n) + O kηgk2 )




= E(w(n)) − ηg(n)T g(n) + O kηgk2 )




= E(w(n)) − η kg(n)k + O kηgk2 )




decreasing for η small enough.

1. Requires knowledge of ∇E

2. Slow convergence for small η

3. Oscillatory (or divergence) for large η.

1
Example 12.1 Problem 3.2 in [Hay99]. Minimize f (x) = xT Rx x + Rxd T x.7
2
7
See also GradientDescent.mov at http://www.eng.auburn.edu/users/hodelas/teaching/6240.

Dept. ECE, Auburn Univ. 67 ELEC 6240/Hodel: Fa 2003


12.1 Steepest Descent Revision : 2003.10

M-file 12.1 hwk0302.m


% Homework 3.02, Haykin ’99
% see if running Octave or MATLAB
rxd = [0.8182 ; 0.354]; Rx = [1, 0.8182; 0.8182, 1];

w_opt = Rx\rxd;
fprintf(’(a): w_opt = %12.4e %12.4e\n’,w_opt(1), w_opt(2));

fprintf(’(b): see plots\n’);


nstps = 100; fignum = 0;
wdata = zeros(2,nstps);
for eta = [0.3, 1.0];
errv = zeros(1,nstps);
gradNorm = zeros(1,nstps);

for nn=1:nstps % run nstps iterations of gradient descent

if(nn == 1)
wn = [0;0];
gn = Rx*wn - rxd;
else
wn1 = wdata(:,nn-1); % get w(nn-1)
gn = Rx*wn1 - rxd; % here’s my gradient
wn = wn1 - eta*gn; % get next weights
wdata(:,nn) = wn; % save the weights for plotting
end

errv(nn) = wn’*Rx*wn/2 - rxd’*wn; % here’s the cost function

gradNorm(nn) = norm(gn);
end
fignum = fignum+1; figure(fignum);
subplot(2,1,1);
plot(wdata(1,:), wdata(2,:),’-’, w_opt(1), w_opt(2), ’x’);
grid on ;
title(sprintf(’Steepest descent with eta=%f’,eta));
legend(’iterate values’, ’optimal value’);
xlabel(’w_1(n)’); ylabel(’w_2(n)’);
subplot(2,1,2);
plot(1:nstps,errv,’-’,1:nstps,gradNorm,’-’);
xlabel(’iteration number’);
legend(’cost function value’,’|| gradient ||’);
grid on;
eval (sprintf( ’print -depsc hwk0302_%d.eps’,fignum));

Dept. ECE, Auburn Univ. 68 ELEC 6240/Hodel: Fa 2003


12.1 Steepest Descent Revision : 2003.10

end

% contour plot of optimization surface


fignum=fignum+1;
nx = 100; xx = linspace(-10,10,nx);
ny = 101; yy = linspace(-10,10,ny);
cf = zeros(nx,ny);
for ii=1:nx
for jj=1:ny
wn = [xx(ii); yy(jj)];
cf(ii,jj) = wn’*Rx*wn/2 - rxd’*wn;
end
end

figure(fignum);
meshc(xx,yy,cf’);
xlabel(’x value’);
ylabel(’y value’);
grid on

hwk0302.m Output

(b): see plots


>>

Dept. ECE, Auburn Univ. 69 ELEC 6240/Hodel: Fa 2003


12.1 Steepest Descent Revision : 2003.10

Steepest descent with eta=0.300000


0.4
iterate values
0.2 optimal value
0

−0.2
w (n)
2

−0.4

−0.6

−0.8

−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w (n)
1

1
cost function value
|| gradient ||

0.5

−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number

Case 1: η = 0.3

Steepest descent with eta=1.000000


0.4
iterate values
0.2 optimal value
0

−0.2
w (n)
2

−0.4

−0.6

−0.8

−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
w1(n)

1
cost function value
|| gradient ||

0.5

−0.5
0 10 20 30 40 50 60 70 80 90 100
iteration number

Case 2: η = 1.0

Dept. ECE, Auburn Univ. 70 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

12.2 Newton Iteration


Expand Taylor’s series further in ∆w(n):
1
E(w(n + 1)) ≈ E(w(n)) + g(n)T ∆w(n) + ∆w(n)T H(n)∆w(n)a
2
where H = ∇2 E(w)) = ∇g T
 ∂E ∂E 
∂w1 2 · · · ∂w1 ∂wm
= 
 .. .. .. 
. . . 
∂E ∂E
∂wm ∂w1
··· ∂w 2 m

Differentiate w.r.t. ∆w to get

g(n) + H(n)δw(n) = 0 =⇒ w(n) = −H(n)−1 g(n)

1. Good (rapid) convergence if H(n) > 0; instant if E is quadratic.

2. Requires knowledge of g, H.

3. Cannot guarantee H > 0 in general.

Dept. ECE, Auburn Univ. 71 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

Homework 4 Due Fri Sept. 24. Hwk


Your homework this week will include both written and m-file components. You should
turn in your written portion of the homework at the start of class on Sept. 17 and you should
email your m-file component to simmoat@eng.auburn.edu by the start of class on Sept. 17.
Mr. Simmons will send details by email of how to submit your m-file.
1. Recall from Lemma 11.1 that the gradient of a scalar function f is defined as
 
∂f /∂x1
∆ 
∇x f (x) =  .. 
. 
∂f /∂xm
T
∈ IR2 . Show that if

(a) (written) Consider a function f (x) where x = x1 x2
f (x) = cT x then
    
  x1 c1
∇x f = ∇ x c1 c2 =c= .
x2 c2
T
∈ IR2 . Show that if

(b) (written) Consider a function f(x) where x = x1 x2
q11 q12
f (x) = 12 xT Qx where Q = QT = then
q12 q22
 
q11 x1 + q12 x2
∇x f = Qx = .
q12 x1 + q22 x2
(c) (written) Consider a function f (W ) = xT W y where W ∈ IR2×3 (which implies
that x ∈ IR2 and y ∈ IR3 . Show that
∇W f = xy T .
Solution
 ∂c1 x1 +c2 x2   
  ∂x1 c1
(a) ∇x f = ∇x c1 c2 = ∇x (c1 x1 + c2 x2 ) = ∂c1 x1 +c2 x2 =
∂x2
c2
(b) f (x) = x2 q11 /2 + x1 x2 q12 + x22 q22 /2, so
" # 
∂f 
q11 x1 q12 x2
∇x f = ∂x ∂f
1
= = Qx
∂x2
q21 x1 + q22 x2

(c) I’ll show for the general case W ∈ IRm×n , so that


m n
! m X
n
X X X
f (x) = xT (W y) = xi wij yj = wij xi yj
i=1 j=1 i=1 j=1
∂f
Then the partial derivative ∂w ij
= xi yj , which means that
 ∂f ∂f
  
∂w11
· · · ∂w 1n
x1 y 1 · · · x 1 y n
∂f ∆  . .. ..  =  .. .. ..  = xy T
=  .. . .   . . . 
∂W ∂f ∂f
∂wm1
· · · ∂wmn xm y 1 · · · x m y n

Dept. ECE, Auburn Univ. 72 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

Hwk 4: 2(a) Steepest descent with eta=0.100000


−0.1
iterate values
−0.11 optimal value
−0.12

−0.13
x2(n)

−0.14

−0.15

−0.16

−0.17
−0.17 −0.16 −0.15 −0.14 −0.13 −0.12 −0.11 −0.1
x1(n)

1.5
cost function value
|| gradient ||
1

0.5

−0.5
0 20 40 60 80 100 120 140 160 180 200
iteration number

Figure 8: Results of Homework 4 Problem 2a.


 
2 1
2. Consider the function f (x) = xT
 
x+ 1 1 x
1 2

(a) (mfile) write an m-file that uses the method of steepest descent to find the value
of x that minimizes f (x).
Solution
 
2 1  T
Note Misprint in assignment, I meant to write f (x) = x x/2+ 1 1 x.
1 2
I’m surprised no one asked about that. M-file is nearly identical to M-file 12.1
(hwk0302.m) (see listing at the end of this solution). Plots are in Figure 8.
(b) (written) Use ∇x f = 0 to solve the above minimization by hand. Compare your
theoretical answer to the result from your m-file. Explain any differences you
observe.

Dept. ECE, Auburn Univ. 73 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

Solution Derivation of optimal solution8


    
 2 1 x1  x1 ∆ T
= x Qx + cT x
 
f (x) = x1 x1 + 1 1
1 2 x2 x2
= 2x21 + 2x1 x2 + 2x21 + x1 + x2
" # 
∂f 
∂x1 4x 1 + 2x 2 + 1
∇x f = ∂f = = 2Qx + c
∂x2
2x1 + 4x2 + 1
1 −1/6
=⇒ xopt = − Q−1 c =
2 −1/6

hwk304.m Output
2(b): difference is 2.7756e-17 0.0000e+00
>>
     
4 2 1 1/6 9
Optimal value is at x+ = 0 or x = − . Differences (see
2 4 1 1/6
m-file output) are due to double precision arithmetic roundoff.

3. (mfile) Use the method of steepest descent to train a linear neural network (LNN)
y = W x̄ to mimic the logic gates indicated below. (written) Discuss the quality of the
output of your LNN: why does it work (or not work)?

n x1 (n) x2 (n) AND OR XOR


1 0 0 0 0 0
2 0 1 0 1 1
3 1 0 0 1 1
4 1 1 1 1 0

You should run your iteration for 100 steps. Your plots should include:

(a) LNN error E as a function of iteration step.


(b) Norm of the gradient as a function of iteration step.
(c) A mesh plot of the output of each of your LNN’s on the range 0 ≤ (x 1 , x2 ) ≤ 1.
8
This derivation added to original homework solutions.
9
Note: the original solutions had a misprint here. This is the correct answer.

Dept. ECE, Auburn Univ. 74 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

!
∂ X X ∂ 
Solution Recall that fi (x) = fi (x) . Thus we have
∂x i i
∂x

4
1X
E = (d(n) − W x(n))T (d(n) − W x(n))
2 n=1
4
X
= d(n)T d(n)/2 − d(n)T W x(n) + x(n)T W T W x(n)/2
n=1
4
∂E X
J= = −d(n)x(n)T + W x(n)x(n)T
∂W n=1

The m-file is in M-file 12.2 (hwk304.m). Plots are in Figure 9.

4. (mfile) Repeat problem 3 using the Gauss Newton method.


Solution The derivation of the Gauss-Newton method in §13.2 requires us to work
with a vector e sucvh that E = eT e/2. Recall that

∆ 1 X 
E = (d(n) − W x̄(n))T (d(n) − W x̄(n))
2 X
= eT e/2 = e2i /2.

If we define  
d(1) − W x̄(1)
e=
 .. 
. 
d(4) − W x̄(4)
it is easy to verify that E = eT e/2. With this definition,

x̄(1)T
 ∂e ∂e1
  
1
∂w0
· · · ∂w 2

J =  ... ..
.
..  = −  .. 

.   . 
∂e4
∂w0
∂e4
· · · ∂w2 x̄(4)T

My results are in Figure 10.

Note Clearly label all plots and turn in printed copies of your plots with your written
homework.

M-file 12.2 hwk304.m

% Homework 4 Solutions ELEC6240 Problems 2a,3,4

%----------------------------------------------
% Problem 2a : min x’ Q x/2 + c’ x
%----------------------------------------------

Dept. ECE, Auburn Univ. 75 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

% PROBLEM! I wrote x’ Q x, not x’ Q x/2, so ...


QQ = [2, 1; 1, 2]*2; cc = [1 ; 1];
xopt = -QQ\cc;
fprintf(’(a): w_opt = %12.4e %12.4e\n’,xopt(1), xopt(2));

%----------------------------------------------
% Problem 2(b): steepest descent
%----------------------------------------------
nstps = 200; fignum = 0;
xdata = zeros(2,nstps); % array in which to save iterative solution values
eta = 0.1;
errv = zeros(1,nstps);
gradNorm = zeros(1,nstps);
for nn=1:nstps
if(nn == 1)
xn1 = [0;0]; % 1st step: initialize x(n-1) to zero
else
xn1 = xdata(:,nn-1); % xn1 = x(n-1)
end
gn = QQ*xn1 + cc; % gradient at x(n-1)
xn = xn1 - eta*gn; % compute next x(n)
xdata(:,nn) = xn; % and store it
errv(nn) = xn’*QQ*xn/2 + cc’*xn;
gradNorm(nn) = norm(gn);
end
xmin = xn; % save in variable for Simmons to grade
fprintf(’2(b): difference is %12.4e %12.4e\n’, xmin(1) - xopt(1), ...
xmin(2) - xopt(2))

% plot of iterative solution values with optimal solution marked


fignum = fignum+1; figure(fignum);
subplot(2,1,1);
plot(xdata(1,:), xdata(2,:),’-’, xopt(1), xopt(2), ’x’);
grid on ;
title(sprintf(’Hwk 4: 2(a) Steepest descent with eta=%f’,eta));
legend(’iterate values’, ’optimal value’);
xlabel(’x_1(n)’); ylabel(’x_2(n)’);

% norm of the gradient


subplot(2,1,2);
plot(1:nstps,errv,’-’,1:nstps,gradNorm,’-’);
xlabel(’iteration number’);
legend(’cost function value’,’|| gradient ||’);
grid on;

Dept. ECE, Auburn Univ. 76 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

eval(sprintf(’print -depsc hwk304%.2d.eps’,fignum));

%----------------------------------------------
% Problems 3,4 , AND,XOR,OR gates, Steepest Descent
%----------------------------------------------
xn = [0 0; 0 1; 1 0; 1 1]’; % x(i) is in col i of xn

x1 = linspace(0,1,25);
x2 = linspace(0,1,27);

% column 1: AND, column 2: OR, column 3: XOR


dd = [ [0;0;0;1], [0;1;1;1], [0;1;1;0] ];
stps = 100;
wdata = zeros(3,stps);
eta = 0.1;
delta = 0.01; % for Gauss-Newton step
nstps = 100;

for probNum = 3:4 % problem number


fignum = fignum+1; figure(fignum);
err = zeros(3,stps);
gradNorm = zeros(3,stps);
for jd = 1:3 % select column of dd we’re using
Wn = zeros(1,3); % initialize 1st data point
for nn=1:stps
% at loop start, Wn contains the last value of W(n)

% compute gradient (for steepest descent)


gn = zeros(1,3); % initialize gradient value
for ii = 1:4;
xbar = [1; xn(:,ii)];
gn = gn + (Wn*xbar)*xbar’ - dd(ii,jd)*xbar’;
end
if(probNum == 3) % steepest descent case
Wn = Wn - eta*gn; % update W(n)
else % probNum == 4 means this is the Gauss-Newton case
wn = Wn’; % we derived G-N with w = column vector

ee = zeros(4,1);
JJ = zeros(4,3);
for ii=1:4
xbar = [1; xn(:,ii)];
ee(ii) = (dd(ii,jd) - Wn*xbar);
JJ(ii,:) = -xbar’;

Dept. ECE, Auburn Univ. 77 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

end
wn = wn - ( JJ’*JJ + delta*eye(3))\JJ’*ee;
Wn = wn’;
end
wdata(:,nn) = Wn’;

% compute error for this function (jd =1,2,3 -> AND, OR, XOR)
err(jd,nn) = 0;
for ii = 1:4
xbar = [1;xn(:,ii)];
err(jd,nn) = err(jd,nn) + (dd(ii,jd) - Wn*xbar)^2/2;
end
gradNorm(jd,nn) = norm(gn);
end

titles = {’AND’,’OR’,’XOR’};
subplot(2,2,jd) % mesh plots
warning(’off’); % avoid pesky messages on xor plot
meshPex(Wn,x1,x2,’input a’,’input b’,titles{jd});
end
% print meshplots of the 3 network outputs
eval(sprintf(’print -depsc hwk304%.2d.eps’,fignum));

% plot of LNN error E as function of iteration step


algType = {’steepest descent’,’Gauss-Newton’};
varnames = {’eta’,’delta’};
fignum = fignum+1; figure(fignum);
subplot(2,1,1);
plot(1:nstps, err,’-’);
legend(’AND’,’OR’,’XOR’);
xlabel(’iteration number’);
grid on ;
title(sprintf(’Cost function value: %s with %s=%f’, ...
algType{probNum-2},varnames{probNum-2}, eta*(probNum==3) + delta*(probNum==4)));

% plot of norm of the gradient as function of iteration step


subplot(2,1,2);
semilogy(1:stps,gradNorm,’-’);
legend(’AND’,’OR’,’XOR’);
xlabel(’iteration number’);
grid on;
title(sprintf(’gradient norm: %s’,algType{probNum-2}));
eval(sprintf(’print -depsc hwk304%.2d.eps’,fignum));

Dept. ECE, Auburn Univ. 78 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

if(probNum == 3)
eta3 = eta*ones(1,3); % used same eta for all three
wn3 = wdata;
error3 = err;
nmgrad3 = gradNorm;
else
eta4 = delta*ones(1,3); % used same eta for all three
wn4 = wdata;
error4 = err;
nmgrad4 = gradNorm;
end

end

Dept. ECE, Auburn Univ. 79 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

Cost function value: steepest descent with eta=0.100000

AND
OR
0.5 XOR

0.4

0.3

0.2

0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number

gradient norm: steepest descent


2
10
AND
OR
XOR
0
10

−2
10

−4
10
0 10 20 30 40 50 60 70 80 90 100
iteration number
AND OR

1 1.5

0.5 1

0 0.5

−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a

XOR

0.5005

0.5

0.4995
1
1
0.5
0.5
input b 0 0 input a

Figure 9: Solution to Homework 4 Problem 3.

Dept. ECE, Auburn Univ. 80 ELEC 6240/Hodel: Fa 2003


12.2 Newton Iteration Revision : 2003.10

Cost function value: Gauss−Newton with delta=0.010000

AND
OR
0.5 XOR

0.4

0.3

0.2

0.1
0 10 20 30 40 50 60 70 80 90 100
iteration number

gradient norm: Gauss−Newton


10
10
AND
OR
XOR
0
10

−10
10

−20
10
0 10 20 30 40 50 60 70 80 90 100
iteration number

AND OR

1 1.5

0.5 1

0 0.5

−0.5 0
1 1
1 1
0.5 0.5
0.5 0.5
input b 0 0 input a input b 0 0 input a

XOR

0.5

0.5
1
1
0.5
0.5
input b 0 0 input a

Figure 10: Solution to Homework 4 Problem 4.

Dept. ECE, Auburn Univ. 81 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

13 2003 09 19: More gradient based learning


13.1 A matlab m-file example
M-file 13.1 meshPexPlot.m
% test of meshPex plotting m-file

xx = linspace(-1,2,11);
yy = linspace(-1,2,13);
W = [1,2,3];
yvals = meshPex(W,xx,yy,’x_1’,’x_2’,’Example plot’);
print -depsc meshPexPlot.eps

M-file 13.2 meshPex.m


function Yvals = meshPex(W,x1,x2,xstr, ystr, tstr)
% Yvals = meshPex(W,x1,x2,tstr)
% compute and/or plot output of a single layer linear neural network
% function of two variables
% inputs: W (1 x 2): network weights
% x1, x2: each vector xx = [x1(ii); x2(jj)] for appropriate values of
% ii, jj
% xstr, ystr, tstr: strings for xlabel, ylabel, and title, respectively
% if all three are passed, then the mesh plot is plotted.
% if not, these arguments are ignored

doMeshPlot = 0; % assume no mesh plot.


if(nargin == 6) % nargin automatically contains the number of input arguments
if( isstr(xstr) & isstr(ystr) & isstr(tstr) )
doMeshPlot = 1;
end
end

% could check dimensions, etc., but I’m not going to


for ii=1:length(x1)
for jj=1:length(x2)
xbar = [1;x1(ii);x2(jj)];
Yvals(jj,ii) = W*xbar;
end
end

if(doMeshPlot)
mesh(x1,x2,Yvals);

Dept. ECE, Auburn Univ. 82 ELEC 6240/Hodel: Fa 2003


13.1 A matlab m-file example Revision : 2003.10

xlabel(xstr);
ylabel(ystr);
title(tstr);
grid on;
end

Example plot

12

10

−2

−4
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
x −1 −1
2 x1

Dept. ECE, Auburn Univ. 83 ELEC 6240/Hodel: Fa 2003


13.2 Gauss Newton Revision : 2003.10

13.2 Gauss Newton


[Notation in this section of thebook is icky.]
e1 (w)
Define error vector e(w) =  ... .
∆  
en (w)

Remark 13.1 w = column vector m × 1.

Suppose
n
1X 2
∆ 1
E(w) = ei (w) = e(w)T e(w)
2 i=1 2
mean squared error

Remark 13.2 e = column vector p × 1

1st order Taylor’s series for error e about w(n) (to select a new w):

e(w(n + 1)) = e(w(n)) + ∇w e|w(n) T (w(n + 1) − w(n)) = e(w(n)) + J(n)(w(n + 1) − w(n))

Remark 13.3 J(n) is the Jacobian (1st derivative matrix).

Select new w to satisfy

∆ 1
w = arg min ke(w(n + 1))k
2
w
1 1
= arg min ke(n)k2 + e(n)T J(n)(w − w(n)) + ((w − w(n))T J(n)T J(n)(w − w(n))
w 2 2
Differentiate w.r.t. w, set to 0 to obtain

J T e(n) + J(n)T J(n)(w − w(n)) = 0


 −1
=⇒ w(n + 1) = w(n) − J(n)T J(n) J(n)T e(n) = w(n) − J(n)† e(n)

Problem J(n)T J(n) is not guaranteed to be invertible. In fact, if p < m (e.g., p = 1 =⇒


J is a row vector), then J(n)T J(n) is guaranteed to be not invertible.

Idea Invert J(n)T J(n) + δI instead.


1
δ kw − w(n)k2 + ke(n)2 k . Take smaller steps,

Remark 13.4 Equivalent to minimizing 2
but still reduce norm of error.

Dept. ECE, Auburn Univ. 84 ELEC 6240/Hodel: Fa 2003


13.3 Perceptrons Revision : 2003.10

13.3 Perceptrons
Read §3.8-3.9

Induced local field (linear)


m
X
v= w i xi + b
i=1

Classify patterns: must be “linearly separable.”


wT x + b = 0

C1
w

C1
C2

C2

Treat b as a weight, modify


∆  T
x(n) = 1 x1 (n) · · · xm (n)
∆  T
w(n) = b(n) w1 (n) · · · wm (n)

So
v(n) = w(n)T x(n)
x ∈ C1 =⇒ w T x > 0.

Dept. ECE, Auburn Univ. 85 ELEC 6240/Hodel: Fa 2003


13.3 Perceptrons Revision : 2003.10

Algorithm 13.1 Perceptron training:

Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).

Outputs Weight vector w

for n = 1, 2, ...

if x(n) ∈ C1 and w(n)T x(n) ≤ 0 /* incorrectly classfied */


w(n + 1) = w(n) + ηx(n)
else if x(n) ∈ C2 and w(n)T x(n) > 0 /* incorrectly classfied */
w(n + 1) = w(n) − ηx(n)
end if

endfor

Analysis of this algorithm is in the next section of the notes.

Dept. ECE, Auburn Univ. 86 ELEC 6240/Hodel: Fa 2003


13.3 Perceptrons Revision : 2003.10

Example 13.1 Perceptron convergence algorithm NAND gate:

M-file 13.3 perEx.m

% train a single layer perceptron to mimic a NAND gate


help perceptron2Dlearn
xx = [0,1,0,1; 0, 0, 1, 1]
dd = [1,1,1,-1]

[WW, iNum] = perceptron2Dlearn(xx,dd,1e4);


fprintf(’Converged in %d iterations to\n’,iNum);
WW
Yvals = meshPex(WW,linspace(0,1), linspace(0,1), ...
’NAND input 1’, ’NAND input 2’, ’Activation function input’);
print -depsc perEx.eps

perEx.m Output

W = perceptron2Dlearn(x,d,nIter);
train a threshhold activation function perceptron to learn a data set
(if possible)
inputs:
x, d: x(Nx2), d(N,1): input, desired output pairs
d(nn) should be either 1 or -1.
nIter: max number of iterations to run
outputs:
W: phi( W* xbar ) should match data (if possible in given # iterations)
iNum: number of passes through data to classify

xx =

0 1 0 1
0 0 1 1

dd =

1 1 1 -1

Converged in 9 iterations to

WW =

4 -2 -3

Dept. ECE, Auburn Univ. 87 ELEC 6240/Hodel: Fa 2003


13.3 Perceptrons Revision : 2003.10

>>

Activation function input

−1
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
NAND input 2
NAND input 1

M-file 13.4 perceptron2Dlearn.m

function [WW,iNum] = perceptron2Dlearn(xx,dd,nIter)


% W = perceptron2Dlearn(x,d,nIter);
% train a threshhold activation function perceptron to learn a data set
% (if possible)
% inputs:
% x, d: x(Nx2), d(N,1): input, desired output pairs
% d(nn) should be either 1 or -1.
% nIter: max number of iterations to run
% outputs:
% W: phi( W* xbar ) should match data (if possible in given # iterations)
% iNum: number of passes through data to classify

iNum = 0;
done = 0;

WW = zeros(1,3);

Dept. ECE, Auburn Univ. 88 ELEC 6240/Hodel: Fa 2003


13.3 Perceptrons Revision : 2003.10

NN =size(xx,2); % number of columns in xx, size(xx,1) gives number of rows.


while ( ~done )
% present all input pairs one at a time and correct
nBad = 0; % count mis-classified entries
for nn = 1:NN
xn = [1;xx(:,nn)];
dn = dd(nn);

if( dn*WW*xn <= 0 )


WW = WW + dn*xn’;
nBad = nBad + 1; % another misclassified
end
end

if( nBad == 0 ) % none misclassified, so done


done = 1;
end

iNum = iNum + 1;
if( iNum >= nIter ) % too many iterations, quit.
done = 1;
end
end

Dept. ECE, Auburn Univ. 89 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

14 2003 09 22: Single Layer Perceptrons (conclusion)


14.1 The perceptron convergence theorem
Recall Algorithm 13.1 from last lecture:

Algorithm 14.1 Perceptron training:

Inputs Sequence of inputs x(n) and desired outputs d(n), learning parameter η (or se-
quence of learning parameters η(n).

Outputs Weight vector w

for n = 1, 2, ...

if x(n) ∈ C1 and w(n)T x(n) ≤ 0 /* incorrectly classfied */


w(n + 1) = w(n) + ηx(n)
else if x(n) ∈ C2 and w(n)T x(n) > 0 /* incorrectly classfied */
w(n + 1) = w(n) − ηx(n)
end if

endfor

Analysis of this algorithm requires use of

Theorem 14.1 Cauchy Schwartz√ inequality Given compatibly dimensioned vectors x, y


and the Euclidean norm kxk = xT x,
2
kxk2 kyk2 ≥ xT y

Theorem 14.2 Suppose input vectors x(n) in Algorithm 13.1 are drawn from subsets Xi
of Ci , i = 1, 2. Assume that sets C1 and C2 are linearly separable. Then Algorithm 13.1
converges with w(0) = 0 and η = 1.

Proof: Since C1 and C2 are linearly separable it follows that there exists a vector w0 such
that w0 T x(n) > 0 for all x(n) ∈ X1 and w0 T x(n) ≤ 0 for all x(n) ∈ X2 . Define

α = min w0 T x(n) (14.1)


x∈X1

Suppose all input vectors x(n) are drawn from X1 and that w(n)T x(n) ≤ 0 (the vectors
are incorrectly classified). Then for w(0) = 0 and η = 1
n
X
w(n + 1) = x(i)
i=1

Dept. ECE, Auburn Univ. 90 ELEC 6240/Hodel: Fa 2003


14.1 The perceptron convergence theorem Revision : 2003.10

By equation (14.1) we have


n
X
T
wo w(n + 1) = w0 T x(i) ≥ nα (14.2)
i=1

By Cauchy-Schwartz, (14.1) and (14.2) we have

kw0 k2 kw(n + 1)k2 ≥ n2 α2 =⇒ kw(n + 1)k2 ≥ n2 α2 / kw0 k2 (14.3)

Define
β = max kx(k)k2 .
x∈X1

Notice that, for k ≤ n, w(k + 1) = w(k) + x(k) and so

kw(k + 1)k2 = kw(k)k2 + x(k)2 + 2w(k)T x(k)


By assumption the vectors x(k) are incorrectly classified, and so w(k)T x(k) < 0 which implies

kw(k + 1)k2 ≤ kw(k)k2 + x(k)2


and so
kw(k + 1)k2 − kw(k)k2 ≤ x(k)2 .

(14.4)
Sum (14.4) over k = 1, ..., n and recall w(0) = 0 to get
n
X
2 x(k)2 ≤ nβ

kw(n + 1)k ≤ (14.5)
k=1

Equation (14.3) gives a quadratically growing lower bound on w(n+1). Conversely, equation
(14.5) gives a linearly growing upper bound on w(n + 1). We conclude that there exists some
number nmax such that a correct classification must occur for n ≥ nmax . 2

Remark 14.1 The above theorem states nothing about converging to a vector that correctly
differentiates between C1 and C2 . It merely states that if you present vectors from C1 long
enough it will eventually correctly classify those vectors. Nevertheless, the stronger result
can also be shown to be true: if a solution exists, the above procedure will converge so that
w(n0 ) = w(n0 + 1) = · · · for some n0 ≤ nmax .

Can also use an adaptive error correction model: let η(n) be the smallest integer such
that
η(n)x(n)T x(n) ≥ w(n)T x(n)

Question 14.1 Why is it o.k. to use integers? Shouldn’t we have to use small numbers?

Dept. ECE, Auburn Univ. 91 ELEC 6240/Hodel: Fa 2003


14.2 Other activation functions Revision : 2003.10

14.2 Other activation functions


Suppose we use a continuous activation function so that (ignoring dependence on step
numbern

yi = φ(wi,·T x̄)
X
= φ( wij x̄j )

Question 14.2 What is ∂yi /∂wij ?

Must apply the chain rule. Define vi = wi,· T x̄.

∂yi ∂yi ∂vi


=
∂wij ∂vi ∂wij
∂φ(v)
= xj
∂v
∂φ

Define φ0 (v) = ∂v v
(scalar and vector form). Then

∂yi
= φ0 (w T x̄)x̄
∂wi,·

a vector valued function.

Question 14.3 How can we use this to design a training algorithm using continuous valued
φ?

We need to determine
∂E ∂ 1 X 
= (d(n) − φ(W x̄)T (d(n) − φ(W x̄))
∂W ∂W 2 n

function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);

function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phi(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );

Dept. ECE, Auburn Univ. 92 ELEC 6240/Hodel: Fa 2003


14.3 Exam 1 information Revision : 2003.10

14.3 Exam 1 information


14.3.1 Topics
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999

Activation functions and their derivatives

Neuron models mathematical formulae, signal flow graphs, capabilities (separating hyper-
planes, normal vectors, dot products)

Learning processes Error-correction learning (Delta rule), memory-based learning, Heb-


bian learning, Competitive learning
NOT covered Boltzmann learning credit assignment, learning without a teacher

Learning tasks Associative memory/pattern association


Covariance matrix associative memory design (requires orthogonality)
−1
“optimal” asosciative memory design (pseudo-inverse A† = AT A AT if A is tall and
thin)

Single layer networks design, capabilities, performance issues

Unconstrained optimization techniques Error function E, Matrix calculus identities,


Gradient based learning methods

Perceptron learning algorithm What is it? Why does it work?

MATLAB/C short programming exercises (writing and reading)

14.3.2 Permitted resources


None. You may bring a pencil or pen (use a pen only if you never make mistakes) and
an eraser. You are not permitted to use a calculator, textbook, written notes, oral or
written communication with other people (besides the instructor/GTA), laptop computers,
cell phones, PDAs, wireless modems, telepathic contact, or any other resources besides your
own mind, body and a writing utensil. T-shirts with Maxwell’s equations on the back will
be tolerated, but I reserve the right to reseat you in the back of the classroom. You are to sit
with at least one empty chair between you and the nearest classmate. Use of unauthorized
resources on this exam will result in a failing grade.

Exam 1 material ends here.

Dept. ECE, Auburn Univ. 93 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

15 2003 09 24 Multi-layer perceptrons


Don’t Forget! IEEE meeting tonight in 238 at 6:00pm with Chevron-Texaco - they’re
hiring, and they’re providing barbecue house. (GPA ≥ 3.0; pre-select at Career Services by
TOMORROW!)

15.1 Introduction
Read §4.1-4.2

Two pass algorithm:

1. First pass: present data, get outputs and intermediate (local field) values.

2. Second pass; adjust weights according to results of first pass.

Remark 15.1 back-propagation is a learning algorithm, not a kind of neural network.

1. Computationally efficient

2. Not guaranteed to converge to optimal solution

Changes from chapter 3:

1. Activation function will be nonlinear, e.g., sigmoid


1
yj =
1 + e−avj

2. At least one hidden layer.

3. High connectivity.

Vocabulary

Function signal forward (network operation)

Error signal training (backward)

Dept. ECE, Auburn Univ. 94 ELEC 6240/Hodel: Fa 2003


15.1 Introduction Revision : 2003.10

Notation used in text is bad; uses indices i, j, and k to denote both a layer number and
to designate a neuron within a specified layer. We’ll use this instead:

+1 +1 (bias)

W (1) .. .. W (2) .. ..
.. . . . .
.

y (0) v (1) y (1) v (2) y (2)

• Use subscripts i, j to denote node numbers/weight indices. Superscript k to


denote layer number. Use mk to denote number of neurons in layer k.
Input layer Write here as y (0) (n) or y (0) where the time index is clear. Individual
(0 (0)
neurons denoted as y1 (n), ..., ym1 (n). Present pattern y (0) (n) = x(n).
Weight
h matrices
i Connection from layer k − 1 to layer k is denoted W (k) (n) =
(k)
wij (n) : k is the layer number, ij denotes the connection from input
neuron j to output neuron i, and n is the timestep. Update is

W (k) (n + 1) = W (k) (n) + ∆W (k) (n)

Forward calculations Calculate v (k) , y (k) as follows:

v (k) (n) = W (k) (n)y (k−1) (n)


(k)
X (k) (k−1)
equivalently, vi = wij yj (n)
j
(k)
y (n) = φ(v (k) (n))

Need not use the same activation function φ at all neurons, but I will follow
that format here.
Desired output Associated with x(n) is desired output vector d(n). Define
error vector e(n) = d(n) − y (2) (n). If there are more than two layers, then
set kmax = maximum layer number (2 in the diagram above) and set e(n) =
d(n) − y (kmax ) (n).

Error function E(n) = e(n)T e(n) = ej (n)2 .
P

Dept. ECE, Auburn Univ. 95 ELEC 6240/Hodel: Fa 2003


15.2 Back-propagation algorithm Revision : 2003.10

15.2 Back-propagation algorithm


Define instantaneous error function E(n) over output nodes only
mkmax
1 X ∆ 1
E(n) = e2j (n) = e(n)T e(n) (15.1)
2 j=1 2


Define average error function Eav (n) Eav (N ) = N1 N
P
n=1 E(n)
Basis of back-propagation: chain rule of differentiation.

15.3 Output layer update


Look at an individual neuron:
W (i) φ
φ

E
..
. ..
.

ȳ (j) v̄ (i) ȳ (i)


Hodel notation

m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0

Apply the chain rule - repeatedly!


! ! !
  (k) k)
∂E(n) ∂E(n) ∂ei (n) ∂yi (n) ∂vi (n)
(k)
= (k) (k) (k)
∂wij (n) ∂ei (n) ∂yi (n) ∂vi (n) ∂wij (n)

Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have

∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)

Dept. ECE, Auburn Univ. 96 ELEC 6240/Hodel: Fa 2003


15.3 Output layer update Revision : 2003.10

and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)

Dept. ECE, Auburn Univ. 97 ELEC 6240/Hodel: Fa 2003


15.3 Output layer update Revision : 2003.10

From last page: output layer weights update with


W (i) φ
φ

E
..
. ..
.

ȳ (j) v̄ (i) ȳ (i)


Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
Define update rule as follows

δi (n) = ei (n)φ0 (vi (n))
∆  T
δ(n) = δ1 (n) · · · δm (n) = e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
∆wji (n) = ηδi (n)yj (n)

Clear how to update output layer weights.

Remark 15.2 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.

Question 15.1 How do we update weights for the hidden layer(s)?

Ans set up a credit assignment problem.

Dept. ECE, Auburn Univ. 98 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

16 2003 09 26: Review of Homework 4.


Today we went over problem 4 of Homework 4. Notice also the update (dimensions) added
to the discussion of the Gauss-Newton iteration method, $ 13.2.

16.1 After class comments


A comment from students after class.

1. Consider problem 4 where we calculate the Jacobian: For this problem,

x̄(1)T
   
1 0 0
 x̄(2)T   1 0 1 
J = −
 x̄(3)T  =  1 1
  
0 
x̄(4)T 1 1 1

which is a constant matrix. That means that I can calculate J (for this case) before I
write

for probNum = 3:4


for jd = 1:3

etc.

Question 16.1 Would this be true if I wrote y = φ(W x̄)? (we’ll talk about that on
Monday)

2. Further, this particular J has three linearly independent rows/columns, so J T J is


always invertible. That means we don’t need to use δI (set δ = 0) and the method
converges for this problem in one step!

Question 16.2 Suppose again that I wrote y = φ(W x̄). How would that change the
invertibility of J?

16.2 Example in office


I added some detail to my solution for Homework 4 problem 2. Look for the footnotes to
see where the changes were made

Dept. ECE, Auburn Univ. 99 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

17 2003 09 29: More review for Exam 1


Cover more of Hw 4, discuss “example” questions that I will make up before your very eyes.

Dept. ECE, Auburn Univ. 100 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

18 2003 10 01 Exam 1
Scores

1. (20)

2. (20)

3. (20)

4. (20)

5. (20)

Lemma 18.1 Copied from 11.1, Let f (x) be a scalar function of a vector x ∈ IRnand let
∂f

∂x1
∂f ∆  .. 
g(W ) of a matrix W ∈ IRm×n . Define their respective partial derivatives as ∂x
=  . 
∂f
∂xn
∂g ∂g
 
∂w11
··· ∂w1n
and ∂g ∆
= .. .. ..
. . Then
 
∂W  . .
∂g ∂g
∂wm1
··· ∂wmn

∂f
1. f (x) = cT x =⇒ =c
∂x
1 ∂f
2. f (x) = xT Qx =⇒ = Qx
2 ∂x
∂g
3. g(W ) = xT W y =⇒ = xy T
∂W
1 ∂g
4. g(W ) = xT W T W x =⇒ = W xxT
2 ∂W

Dept. ECE, Auburn Univ. 101 ELEC 6240/Hodel: Fa 2003


18.1 The questions Revision : 2003.10

18.1 The questions


1. Consider the data set shown below.

Exam 1 problem 1 example

class 0
class 1
5 division 1
division 2
4

2
x2(n) value

−1

−2

−3

−4

−6 −4 −2 0 2 4 6
x1(n) value

The plot above divides the 40 data points {x(n)}40 n=1 shown in the plot into two classes.
The division is based on the separating lines defined by the linear neural network
w1 T

−1 1 1
weights W = = ; that is, a data point x(n) is in class 1 if
−0.5 −1 1 w2 T
w1 T x̄(n) > 0 and w2 T x̄(n) > 0, otherwise it’s in class 0. Indicate which division line
(division 1 or division 2) corresponds to the weight vector w1 T x̄ = 0. Show your work,
either in mathematics or by labelling the diagram above.
Division number =

Dept. ECE, Auburn Univ. 102 ELEC 6240/Hodel: Fa 2003


18.1 The questions Revision : 2003.10

Exam 1 problem 1 example

class 0
class 1
5 division 1
division 2
4

2
x2(n) value

−1

−2

−3

−4

−6 −4 −2 0 2 4 6
x1(n) value

2. Can the 40 data points above be correctly classified using a single neuron (perhaps one
that uses a nonlinear activation function)? Explain why or why not.
If they can be correctly classified by a single neuron, give the corresponding activation
function and weights below.
Can it be done? =

Neuron function =

Activation function φ(v) =

Explain.

Dept. ECE, Auburn Univ. 103 ELEC 6240/Hodel: Fa 2003


18.1 The questions Revision : 2003.10

3. Consider the sigmoid activation function φ(v) − 1/(1 + e−av ) for some a > 0. Show
that dφ/dv = aφ(v)(1 − φ(v)).

 
4. Consider a neuron y = φ(W x̄) where W = 1 2 3 , where φ(v) is the sigmoid
 T
function shown above and x̄ = 1 x1 x2 Define v = W x̄.
Fill in the boxes below with the correct expressions or numerical values. (I will accept
either.)
∂v
=
∂W

∂y
=
∂v

∂y
=
∂W

Dept. ECE, Auburn Univ. 104 ELEC 6240/Hodel: Fa 2003


18.1 The questions Revision : 2003.10

 
1 T T 2 1  T
5. Let f (x) = 2
x Qx + c x for Q = and c = 4 5 .
1 2

(a) Find the errors in the MATLAB code below that attempts to implement a steepest
descent iteration to find x minimizing f (x).
M-file 18.1 exam2003BrainDead.m
% m-file example (with errors) of
% steepest descent iteration
% to minimize
% f(x) = x’*[2, 1, 1, 2]*x/2 + [4;5]*x

x = [0;0] ; /* initial value of x */

eta = 10;

grad_x = x*Q - c;

for ii=1:100

x = x + eta*grad_x;

end

fprintf("x= %e %e\n", x(1), x(2));

(b) Find a vector x∗ such that f (x∗ = minx f (x). Show that ∇x f |x∗ = 0.
x∗ =

Dept. ECE, Auburn Univ. 105 ELEC 6240/Hodel: Fa 2003


18.2 The answers Revision : 2003.10

18.2 The answers


1. Division 2. w T x̄ = −1 + x1 + x2 = 0 =⇒ x2 = −x1 + 1, which is the division 2 line.
 
2. Select W = −3 0 1 so that W x̄ > 0 for y > 3. Use threshhold activation
(McCullough-Pitts).
3. See the solution to homework 1.
∂v
= x̄T = 1 x1 x2
 
4.
∂W
∂y
= aφ(v)(1 − φ(v)
∂v
∂y
= aφ(v)(1 − φ(v))x̄T = aφ(W x̄)(1 − W x̄) 1 x1 x2
 
∂W
 
1 T T 2 1  T
5. Let f (x) = 2 x Qx + c x for Q = and c = 4 5 .
1 2
(a) M-file:
M-file 18.2 exam2003BrainDeadFix.m
% f(x) = x’*[2, 1, 1, 2]*x/2 + [4;5]*x
Q = [2,1 ; 1 2]; % error 2: Q and c not defined
c = [4;5];
x = [0;0] ; % error 3: C-type comment /* initial value of x */
eta = 0.01; % error 4: huge step size.
% error 5: wrong gradient formula, wrong location in the program
%grad_x = x*Q - c;
for ii=1:100
grad_x = Q*x + c; % error 6 continued: correct place and formula
x = x - eta*grad_x; % error 7: subtract gradient, not add
end
% error 8: replace double quotes (C style) with single
% quotes (MATLAB style).
fprintf(’x= %e %e\n’, x(1), x(2));
 
1
x∗ =
2
   
2 1 4
want 0 = ∇x f = x+
1 2 5
 −1    
∗ 2 1 4 1
=⇒ x = =−
1 2 5 2
       
2 1 1 4 4 4
Qx∗ + c = − =− + =0
1 2 2 5 5 5

Dept. ECE, Auburn Univ. 106 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

19 2003 10 03 Multi-layered perceptrons: continued


19.1 Backpropagation algorithm: review of output layer
+1 +1 (bias)

W (1) .. .. W (2) .. ..
.. . . . .
.

y (0) v (1) y (1) v (2) y (2)


(b)
Previously seen in §19.1: output layer weights update with
W (i) φ
φ

E
..
. ..
.

ȳ (j) v̄ (i) ȳ (i)


Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
Define update rule as follows

δi (n) = ei (n)φ0 (vi (n))
∆  T
δ(n) = δ1 (n) · · · δm (n)
= e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
∆wji (n) = ηδi (n)yj (n)
Clear how to update output layer weights.
Remark 19.1 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.
Question 19.1 How do we update weights for the hidden layer(s)?
Ans set up a credit assignment problem.

Dept. ECE, Auburn Univ. 107 ELEC 6240/Hodel: Fa 2003


19.2 Backpropagation: Hidden layer update Revision : 2003.10

19.2 Backpropagation: Hidden layer update


W (2)
(1)
wij
..
. ..
.

ȳ (0) ȳ (1) v̄ (2) ȳ (2)

! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj

weighted sum of “next layer” δ (gradient) values


T
∆W (k) = ηδ (k) (n)y (k−1) (n)

where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) =

(k+1) T (k+1) 0 (k)
W δ .∗φ (v (n)) k= hidden layer

Question 19.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?

Dept. ECE, Auburn Univ. 108 ELEC 6240/Hodel: Fa 2003


19.3 Exponential activation function Revision : 2003.10

19.3 Exponential activation function

1
φ(v) = a > 0 v ∈ IR
1 + e−av
ae−av e−av
  
0 a
φ (v) = = +
(1 + e−av )2 1 + e−av 1 + e−av
= aφ(v) (1 − φ(v))

Output layer update:

δ(n) = e(n).∗φ0 (v (kmax ) (n)) = a d(n) − y (kmax ) (n) .∗ y (kmax ) (n).∗ 1 − y (kmax ) (n)
 

Hidden neuron update in layer k


(k) (k) (k+1)
X
δj = φ0 (vj ) wij (n)(k+1) δi
i
(k) (k+1)
X
(k)
δ = φ 0
(vj ) wij (n)(k+1) δi
i
(k)
  T

= ay .∗ 1 − y (k) .∗ W (k+1) δk+1

M-file 19.1 phi.m

function y = phi(x,a)
% sigmoid function with parameter a
y = 1 ./ (1 + exp(-a*x) );

M-file 19.2 dphi.m

function dy = dphi(x,a)
% derivative of sigmoid function with parameter a
y = phi(x,a);
dy = a * y .* (1 - y);

Example 19.1 Check derivative function:

M-file 19.3 derivS.m

% check exponential derivative function gradient


a = 1; N = 100; N1 = N-1;
xx = linspace(-5,5,N); xx1 = xx(1:N1);
phix = phi(xx,a); % original function
dphif = dphi(xx,a); % symbolic derivative
dphiN = diff(phix) ./ diff(xx); % numerical derivative

Dept. ECE, Auburn Univ. 109 ELEC 6240/Hodel: Fa 2003


19.4 tanh activation function Revision : 2003.10

figure(1);
grid(); title(’Exponential sigmoid with a=1’);
plot(xx,phix, "-;phi;", xx, dphif, "-;symbolic dphi;", ...
xx1, dphiN,"-;numerical dphi;");

printeps("derivS.eps");
Exponential sigmoid with a=1
1
phi
symbolic dphi
0.9 numerical dphi

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-6 -4 -2 0 2 4 6

19.4 tanh activation function

φ(v) = a tanh(bv)
=⇒ φ0 (v) = ab(1 − tanh2 (bv))
b
= (a − y)(a + y)
a

M-file 19.4 phiT.m

function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);

Dept. ECE, Auburn Univ. 110 ELEC 6240/Hodel: Fa 2003


19.4 tanh activation function Revision : 2003.10

M-file 19.5 dphiT.m

function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phiT(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );

Dept. ECE, Auburn Univ. 111 ELEC 6240/Hodel: Fa 2003


19.4 tanh activation function Revision : 2003.10

Homework 5 Due Fri Oct. 10. Hwk


1. Consider the activation function φ(v) = a tanh(bv) (see M-file 19.4 (phiT.m)).
(a) Show that dφ(v)/dv = (b/a)(a − φ(v))(a + φ(v)).
(b) Write mex-file implementations of both phiT and dphiT, each listed below.
function y = phiT(x,a,b)
% function y = phi(x,a,b)
% hyperbolic tangent activation function a*tanh(b*x)
y = a*tanh(b*x);
function dy = dphi(x,a,b)
% derivative of hyperbolic tangent function
y = phiT(x,a,b);
dy = (b/a)*(a - y ) .* ( a + y );
What to turn in Derivation (handwritten) for part (a). C-language source files
phiT.c and dphiT.c.
2. Consider the hyperbolic tangent activation function φ(v) = a tanh(bv) with parameters
a = 1.5 and b = 1.
(a) Write an m-file that uses your mex-files from problem 1 to perform steepest descent
iteration to compute
 a single layer MLP AND gate. That is, compute a set of
weights W = w0 w1 w1 such that y = φ(W x̄) is a “good” approximation
of y = AND(x1 , x2 ). Your maximum error should be less than 10% for four data
points {x(n), d(n)}4n=1 specified in Homework 4. Plot your neural network netowrk
output over the range −0.5 ≤ x1 , x2 ≤ 1.5. You will need to modify the m-file M-
file 13.2 (meshPex.m) to use φ when it computes Yvals.
(b) Why does your m-file from problem 2a provide a better AND gate than your m-file
from Homework 4? In particular, explain why your AND gate can achieve 10%
worst-case error but the AND gate from Homework 4 could not.
What to turn in Source code m-files hwk502.m and meshPexT.m (your modified
version of meshPex.m), and a written answer to part (b). Your m-file hwk502.m
should end with the following lines:
fprintf(’tanh parameters: a=%12.4e b=%12.4e\n,a,b);
fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e 512.4e\n, ...
W(1), W(2), W(3));
meshPexT(W,xx,yy,’input 1’,’input 2’,’AND gate output’);
3. Write a C-language mex file to multiply two matrices. That is, write a C-language
translation of the following m-file:
function C = matmul(A,B)
C = A*B
What to turn in C-language source file matmul.c.

Dept. ECE, Auburn Univ. 112 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

20 2003 10 06: Single layer linear system i.d. example


20.1 Goal
Model motion of a blimp.

20.1.1 Vehicle data


Measure blimp position (m) in response to motor voltage commands (V):
Sample Input Blimp Sample Input Blimp
Number voltage position Number voltage position
u(n) p(n) u(n) p(n)
1 0 0.0 7 2 3.00
2 1 0.1 8 1 3.80
3 1 0.5 9 0 4.20
4 2 0.8 10 0 4.6
5 2 1.2 11 0 4.9
6 3 2.0 12 0 5.1

20.1.2 Assumption
There is a mathematical model

p(n + 1) = f (history of the universe)

that will accurately predict the motion of the blimp given sufficient information. We will
attempt to approximate f with a function of the last two values of p and u:

p̂(n + 1) = a0 p(n) + a1 p(n − 1) + b0 u(n) + b1 u(n − 1)


 
p(n)
   p(n − 1) 
= a 0 a 1 b0 b1   u(n) 

u(n − 1)

where p̂(n) is an estimate of p(n) based on the other measurements. In matrix form we can
write  
p(2) · · · p(11)
     p(1) · · · p(10) 
p̂(3) · · · p̂(12) = a0 a1 b0 b1  
 u(2) · · · u(11) 
u(1) · · · u(10)
 
Define d(n) = p(n + 2) and d = d(1) · · · d(10) . Similarly define
 
p(2) · · · p(11)
   p(1) · · · p(10) 
X = x(1) · · · x(n) =   u(2) · · · u(11) 

u(1) · · · u(10)

Dept. ECE, Auburn Univ. 113 ELEC 6240/Hodel: Fa 2003


20.1 Goal Revision : 2003.10

and define  
W = a 0 a 1 b0 b1 .
Then we want to minimize
1 1
E = kd − W Xk2 = (d − W X)(d − W X)T
2 2
The matrices d and X are obtained with m-file M-file 20.1 (sysIdEx1GetData.m) in Figure
11. Input data is plotted in Figure 12.

M-file 20.1 sysIdEx1GetData.m

function [d,X] = sysIdEx1GetData


% function [d,X] = sysIdEx1GetData
% get data for blimp sys id experiment

u = [0; 1; 1; 2; 2; 3; 2; 1; 0; 0; 0; 0];
p = [0; 0.1; 0.5; 0.8; 1.2; 2.0; 3.0; 3.8; 4.2; 4.6; 4.9; 5.1];
t = ((1:12)-1)*0.5;

for n=2:11
d(n) = p(n+1);
X(:,n) = [p(n);p(n-1);u(n);u(n-1)];
end

% from here down is plotting code


fn = figure; subplot(2,1,1); plot(t,u,’x’);
title(’original data from blimp experiment’); grid on;
xlabel(’time’); ylabel(’voltage input’);

subplot(2,1,2); plot(t,p,’X’); grid on;


xlabel(’time’); ylabel(’blimp position’);
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return

Figure 11: M-file to construct input data matrices for blimp sys id experiment.

Dept. ECE, Auburn Univ. 114 ELEC 6240/Hodel: Fa 2003


20.1 Goal Revision : 2003.10

original data from blimp experiment


3

2.5
voltage input

1.5

0.5

0
0 1 2 3 4 5 6
time

5
blimp position

0
0 1 2 3 4 5 6
time

Figure 12: Input data for blimp sys id experiment.

Dept. ECE, Auburn Univ. 115 ELEC 6240/Hodel: Fa 2003


20.2 Solution method 1: direct solution Revision : 2003.10

20.2 Solution method 1: direct solution


We can also write
10
1 1X
E = kd − W Xk2 = (d(n) − W x(n))T (d(n) − W x(n))
2 2 n=1
10
1X
= d(n)T d(n) − 2x(n)T W T d(n) + x(n)T W T W x(n))
2 n=1
−1
Take the gradient and set it to zero to get Wopt = d · X † = dX T XX T . Calculations
performed in M-file M-file 20.2 (sysIdEx1Wopt.m), in Figure 13. Fit quality shown in Figure
14.

Remark 20.1 This approach will not work if we model p̂(n + 1) = φ(W x(n)) due to the
nonlinearity of the activation function.

M-file 20.2 sysIdEx1Wopt.m

function Wopt = sysIdEx1Wopt(d,X)


% compute pseudo-inverse (optimal) solution and plot results
Wopt = d*pinv(X);
return

Figure 13: Calculations for optimal data fit (linear model)

Dept. ECE, Auburn Univ. 116 ELEC 6240/Hodel: Fa 2003


20.2 Solution method 1: direct solution Revision : 2003.10

optimal: err= 3.4392e−01


6
desired output
actual output
error
5

−1
1 2 3 4 5 6 7 8 9 10 11
sample number

Figure 14: Results of optimal data fit.

Dept. ECE, Auburn Univ. 117 ELEC 6240/Hodel: Fa 2003


20.3 Solution method 2: Steepest descent iteration Revision : 2003.10

20.3 Solution method 2: Steepest descent iteration


In each iteration compute
X
∇W E = (W x(n))x(n)T − d(n)x(n)T
n

Result plots shown in Figure 16. M-file is in Figure 15.

M-file 20.3 sysIdEx1GradOpt.m

function Wgrad = sysIdEx1GradOpt(d,X)


% use gradient descent to get optimal solution
maxIter = 1e3;
eta = 3e-3;
Wmat = zeros(maxIter,4); % maxIter iterations with 4 parameters each
ErrV = zeros(maxIter,1);
for iter = 1:(maxIter-1);
Wn = Wmat(iter,:);
ErrV(iter) = norm(d - Wn*X);
gn = zeros(size(Wn)); % compute gradient with ALL data points
for nn = 1:length(d)
xn = X(:,nn);
gn = gn + (-d(nn)*xn’ + (Wn*xn)*xn’);
end
Wn = Wn - eta*gn;
Wmat(iter+1,:) = Wn;
end
ErrV(maxIter) = norm(d - Wmat(maxIter,:)*X);
Wgrad = Wmat(maxIter,:);

kk = 1:maxIter;
fn = figure; subplot(2,1,1); plot(kk,Wmat,’-’);
xlabel(’iteration’); legend(’a_0’,’a_1’,’u_0’,’u_1’);
title(’steepest descent parameters’);
grid on
subplot(2,1,2); plot(kk,ErrV,’-’); xlabel(’iteration’);
ylabel(’Error’); grid on;
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return

Figure 15: M-file calculations for steepest descent iteration.

Dept. ECE, Auburn Univ. 118 ELEC 6240/Hodel: Fa 2003


20.3 Solution method 2: Steepest descent iteration Revision : 2003.10

steepest descent parameters


0.7
a0
0.6 a1
0.5 u0
u1
0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700 800 900 1000
iteration

12

10

8
Error

0
0 100 200 300 400 500 600 700 800 900 1000
iteration

steepest descent: err= 3.4954e−01


6
desired output
actual output
error
5

−1
1 2 3 4 5 6 7 8 9 10 11
sample number

Figure 16: Steepest descent parameter values during iteration and resulting output error.

Dept. ECE, Auburn Univ. 119 ELEC 6240/Hodel: Fa 2003


20.4 Solution method 3: Backpropagation Revision : 2003.10

20.4 Solution method 3: Backpropagation


Neural network (approximate steepest descent) method. Instead of computing
X
∇W E = (W x(n))x(n)T − d(n)x(n)T
n

we apply an update for each point individually and use a very small step size η so that after
N = 10 steps we approximate the above gradient. M-file is in Figure 17. Results are in
Figure 18.

M-file 20.4 sysIdEx1nnOpt.m

function Wnn = sysIdEx1nnOpt(d,X)


% use real gradient descent to get optimal solution
maxIter = 1000;
eta = 3e-3;
Wmat = zeros(maxIter,4); % maxIter iterations with 4 parameters each
ErrV = zeros(maxIter,1);
nn = 0;
for iter = 1:(maxIter-1);
Wn = Wmat(iter,:);
ErrV(iter) = norm(d - Wn*X);
% compute (some of) gradient using SINGLE data points
nn = nn+1; if(nn > length(d)), nn=1; end;
xn = X(:,nn);
gn = -d(nn)*xn’ + (Wn*xn)*xn’;
Wn = Wn - eta*gn;
Wmat(iter+1,:) = Wn;
end
Wnn = Wmat(maxIter,:);
ErrV(maxIter) = norm(d - Wnn*X);
kk = 1:maxIter;
fn = figure; subplot(2,1,1); plot(kk,Wmat,’-’);
title(’backpropagation parameters’);
xlabel(’iteration’); legend(’a_0’,’a_1’,’u_0’,’u_1’); grid on;
subplot(2,1,2); plot(kk,ErrV,’-’); xlabel(’iteration’);
ylabel(’Error’); grid on;
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return

Figure 17: Approximate steepest descent (backpropagation) m-file. Notice that only one
data point is used to compute gn, which means that gn is no longer the gradient.

Dept. ECE, Auburn Univ. 120 ELEC 6240/Hodel: Fa 2003


20.4 Solution method 3: Backpropagation Revision : 2003.10

backpropagation parameters
0.7
a0
0.6 a1
0.5 u0
u1
0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700 800 900 1000
iteration

12

10

8
Error

0
0 100 200 300 400 500 600 700 800 900 1000
iteration

backpropagation: err= 3.6749e−01


6
desired output
actual output
error
5

−1
1 2 3 4 5 6 7 8 9 10 11
sample number

Figure 18: Approximate steepest descent (backpropagation) output results

Dept. ECE, Auburn Univ. 121 ELEC 6240/Hodel: Fa 2003


20.5 Summary of results Revision : 2003.10

20.5 Summary of results


Comparison of parameters of all three methods.

M-file 20.5 sysIdEx1.m

% system i.d. example


function [d,X,Wopt,Wgrad, Wnn] = sysIdEx1
[d,X] = sysIdEx1GetData;

% compute pseudo-inverse (optimal) solution and plot results


Wopt = sysIdEx1Wopt(d,X);
sysIdEx1PlotErr(d,X,Wopt, ...
sprintf(’optimal: err=%12.4e’,norm(d-Wopt*X)));

% compute full gradient (iteration) solution and


% compare to optimal
Wgrad = sysIdEx1GradOpt(d,X);
sysIdEx1PlotErr(d,X,Wgrad, ...
sprintf(’steepest descent: err=%12.4e’,norm(d-Wgrad*X)));

% compute neural-net type iterative solution


% and compare to optimal
Wnn = sysIdEx1nnOpt(d,X);
sysIdEx1PlotErr(d,X,Wnn, ...
sprintf(’backpropagation: err=%12.4e’,norm(d-Wnn*X)));

fprintf(’Wopt: %12.4e %12.4e %12.4e %12.4e\n’, Wopt(1), Wopt(2), ...


Wopt(3), Wopt(4));
fprintf(’Wgrad: %12.4e %12.4e %12.4e %12.4e\n’, Wgrad(1), Wgrad(2), ...
Wgrad(3), Wgrad(4));
fprintf(’Wnn: %12.4e %12.4e %12.4e %12.4e\n’, Wnn(1), ...
Wnn(2), Wnn(3), Wnn(4));
return

Wopt: 8.0891e-01 2.6466e-01 2.7104e-01 9.3968e-02


Wgrad: 6.1854e-01 4.6948e-01 2.8291e-01 1.3563e-01
Wnn: 5.9144e-01 4.9120e-01 2.3935e-01 1.9136e-01

Dept. ECE, Auburn Univ. 122 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

21 2003 10 08: MLP


21.1 Derivation
21.1.1 Output layer
Look at an individual neuron:
W (i) φ
φ

E
..
. ..
.

ȳ (j) v̄ (i) ȳ (i)


Hodel notation

m
!
(k) (k) (k) (k−1)
X
yi (n) = φ(vi (n)) = φ wij (n)yj (n)
j=0

Apply the chain rule - repeatedly!


! ! !
  (k) k)
∂E(n) ∂E(n) ∂ei (n) ∂yi (n) ∂vi (n)
(k)
= (k) (k) (k)
∂wij (n) ∂ei (n) ∂yi (n) ∂vi (n) ∂wij (n)

Let k = kmax (looking at output neurons). From equation (15.1) and ej = dj − yj we have

∂E(n)
= ei (n) → ∇e(n) E(n) = e(n)
∂ei (n)
∂ei (n)
(k)
= −1
∂yi (n)
(k)
∂yi (n) (k)
(k)
= φ0 (vi (n))
∂vi (n)
(k)
∂vi (n) (k−1)
(k)
= yj (n)
∂wij (n)

and so
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)

Dept. ECE, Auburn Univ. 123 ELEC 6240/Hodel: Fa 2003


21.1 Derivation Revision : 2003.10

From last page: output layer weights update with


W (i) φ
φ

E
..
. ..
.

ȳ (j) v̄ (i) ȳ (i)


Hodel notation
∂E(n) (k) (k−1)
(k)
= −ei (n)φ0 (vi (n))yj (n)
∂wij (n)
Define update rule as follows

δi (n) = ei (n)φ0 (vi (n))
∆  T
δ(n) = δ1 (n) · · · δm (n) = e.∗φ0 (v)
∆ T
∆W (n) = ηδ (k) (n)y (k−1) (n)
(k) ∆ (k) (k−1)
∆wji (n) = ηδi (n)yj (n)

Clear how to update output layer weights.

Remark 21.1 Notice that the above update formulas can be executed in a decentralized
(k) (k−1)
computing environment. That is, all of the information (ei , φ0 , vi , yj ) each neuron
needs to update its weights is available locally.

Question 21.1 How do we update weights for the hidden layer(s)?

Ans set up a credit assignment problem.

Dept. ECE, Auburn Univ. 124 ELEC 6240/Hodel: Fa 2003


21.1 Derivation Revision : 2003.10

21.1.2 Hidden layer(s)


W (2)
(1)
wij
..
. ..
.

ȳ (0) ȳ (1) v̄ (2) ȳ (2)

! ! !
(1) (1)
∂E(n) ∂E(n) ∂yi ∂vi
(1)
= (1) (1) (1)
∂wij (n) ∂yi ∂vi ∂wij
" ! ! ! !# ! !
(2) (2) (2) (1) (1)
X ∂E(n) ∂ek ∂yk ∂vk ∂yi ∂vi
= (2) (2) (2) (1) (1) (1)
k ∂ek ∂yk ∂vk ∂yi ∂vi ∂wij
!
(2) (2) (2) (1) (0)
X
= ek (−1)φ0 (vk )wki φ0 (vi )yj
k
!
(2) (2) (1) (0)
X
= − δki wki φ0 (vi )yj
k
∆ (1) (0)
= −δij yj

weighted sum of “next layer” δ (gradient) values


T
∆W (k) = ηδ (k) (n)y (k−1) (n)

where (
(k)
e(k) (n).∗φ0 (v (k) (n)) k= output layer
δ (n) = (k+1) T (k+1)

0 (k)
W δ .∗φ (v (n)) k= hidden layer

Question 21.2 The perceptron convergence theorem states that the perceptron training al-
gorithm 13.1 will converge with initial weights W = 0. What happens to the back-propagation
algorithm of we initialize W (0) = 0 and W (1) = 0?

Dept. ECE, Auburn Univ. 125 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

21.2 Example (not covered in class)


Train a sigmoid-based 2-layer MLP to classify points inside the unit square as class 1, other
points as class 0.

Desired output

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1

21.2.1 Utility m-files


M-file 21.1 nnsig.m
function [y,h] =nnsig(x,w1,w2,a)
% function [y,h] =nnsig(x,w1,w2,a)
% return hidden layer values h and output layer values y for
% weights w1 and w2 with input x
% activation function is sigmoid with parameter a
% bias value of 1 is appended to both x and hidden layer value vector

v1 = w1*[1;x]; % append bias


h = phi(v1,a);
v2 = w2*[1;h]; % append bias

Dept. ECE, Auburn Univ. 126 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

y = phi(v2,a);

M-file 21.2 neteval.m


function [om, em] = neteval(xx, yy, dm, W1, W2,a)
% function [om, em] = neteval(xm, ym, dm, W1, W2,a)
% evaluate network at all sample points
% dm = nx x ny matrix of desired output values
% use exponential function with parameter a

om = zeros(size(dm));
em = dm;
nx = length(xx);
ny = length(yy);
for ii=1:nx
for jj=1:ny
y0 = [xx(ii);yy(jj)];
om(jj,ii) = nnsig(y0,W1, W2,a);
end
end
em = dm - om;

21.2.2 C-language implementation


C-file 21.5 layerprop.c
/* layerprop.c: #include this file as needed in mex file source */

/* propagate through one layer of a sigmoid multi-layer perceptron */


static void
layerProp(double * yy, const double * ww, const double *xx,
const int mw, const int nw, const double a)
{
int ii, jj;

for ( ii = 0 ; ii < mw ; ii++)


{
/* copy bias term */
yy[ii] = ww[ii];

/* matrix multiply with row ii of ww */


for ( jj = 0 ; jj < nw; jj++)
{
yy[ii] += ww[ii + (jj+1)*mw] * xx[jj];
}

Dept. ECE, Auburn Univ. 127 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}

C-file 21.6 nnsig.c

/* nnsig.c: sigmoid basde neural network evaluator (two layer) */


#include <math.h>
#include "mex.h"

/* propagate through one layer of a sigmoid multi-layer perceptron */


static void
layerProp(double * yy, const double * ww, const double *xx,
const int mw, const int nw, const double a)
{
int ii, jj;

for ( ii = 0 ; ii < mw ; ii++)


{
/* copy bias term */
yy[ii] = ww[ii];

/* matrix multiply with row ii of ww */


for ( jj = 0 ; jj < nw; jj++)
{
yy[ii] += ww[ii + (jj+1)*mw] * xx[jj];
}

/* sigmoid function */
yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii]));
}
}

void mexFunction( int nlhs, mxArray *plhs[],


int nrhs, const mxArray*prhs[] )
{
double *xx, *yy, *hh, aa, *w1, *w2;
int mx, nx, mw1, nw1, mw2, nw2;

/* Argument checking - always a good thing! */


if (nrhs != 4)
{

Dept. ECE, Auburn Univ. 128 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

mexPrintf("Received %d args, need 4", nrhs);


mexErrMsgTxt("\nError");
return;
}
else if (nlhs != 2)
{
mexPrintf("%d outputs requested, need 2", nlhs);
mexErrMsgTxt("\nError");
return;
}

/* load arguments, check dimensions */


mx = mxGetM(prhs[0]);
nx = mxGetN(prhs[0]);
xx = mxGetPr(prhs[0]);
mw1 = mxGetM(prhs[1]);
nw1 = mxGetN(prhs[1]);
w1 = mxGetPr(prhs[1]);
mw2 = mxGetM(prhs[2]);
nw2 = mxGetN(prhs[2]);
w2 = mxGetPr(prhs[2]);
aa = *mxGetPr(prhs[3]);

if(nx != 1)
{
mexPrintf("x (%d x %d) must be a column vector",mx,nx);
mexErrMsgTxt("\nError");
return;
}
if( mx != nw1-1 )
{
mexPrintf("x (%d) w1 (%d x %d) incompatible",
mx, mw1, nw1);
mexErrMsgTxt("\nError");
return;
}
if( mw1 != nw2-1 )
{
mexPrintf("w1 (%d x %d), w2 (%d x %d) incompatible",
mw1, nw1, mw2, nw2);
mexErrMsgTxt("\nError");
return;
}
if(mw2 < 1 )

Dept. ECE, Auburn Univ. 129 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

{
mexPrintf("w2 (%d x %d) has no outputs", mw2, nw2);
mexErrMsgTxt("\nError");
return;
}

plhs[0] = mxCreateDoubleMatrix(mw2, 1, mxREAL); /* y */


plhs[1] = mxCreateDoubleMatrix(mw1, 1, mxREAL); /* h */
yy = mxGetPr(plhs[0]);
hh = mxGetPr(plhs[1]);

/* v1 = w1*[1;x]; % append bias


* h = phi(v1,a);
* v2 = w2*[1;h]; % append bias
* y = phi(v2,a);
*/

/* copy bias terms to h and y */


layerProp(hh,w1,xx,mw1,nw1,aa);
layerProp(yy,w2,hh,mw2,nw2,aa);
}

C-file 21.7 neteval.c

/* neteval.c: evaluate quality of sigmoid NN fit */


#include <math.h>
#include "mex.h"

/* propagate through one layer of a sigmoid multi-layer perceptron */


static void
layerProp(double * yy, const double * ww, const double *xx,
const int wRows, const int wCols, const double a)
{
int ii, jj;

for ( ii = 0 ; ii < wRows ; ii++)


{
yy[ii] = ww[ii]; /* copy bias term */

/* matrix multiply with row ii of ww */


for ( jj = 0 ; jj < wCols; jj++)
yy[ii] += ww[ii + (jj+1)*wRows] * xx[jj];

yy[ii] = 1.0/( 1.0 + exp(-a*yy[ii])); /* sigmoid function */

Dept. ECE, Auburn Univ. 130 ELEC 6240/Hodel: Fa 2003


21.2 Example (not covered in class) Revision : 2003.10

}
}

void mexFunction( int nlhs, mxArray *plhs[],


int nrhs, const mxArray*prhs[] )
{
char errMsg[1000];
double *xx, *yy, *dm, aa, *w1, *w2, *om, *em, y0[2], *y1, *y2;
int xrows, xcols, yrows, ycols, drows, dcols, mw1, nw1, mw2, nw2, ii, jj;

/* Argument checking - always a good thing! */


if (nrhs != 6)
{
sprintf(errMsg,"Received %d args, need 6", nrhs);
mexErrMsgTxt(errMsg);
return;
}
else if (nlhs > 2)
{
sprintf(errMsg,"%d outputs requested, max is 2", nlhs);
mexErrMsgTxt(errMsg);
return;
}

/* load arguments, check dimensions */


xrows = mxGetM(prhs[0]);
xcols = mxGetN(prhs[0]);
xx = mxGetPr(prhs[0]);
yrows = mxGetM(prhs[1]);
ycols = mxGetN(prhs[1]);
yy = mxGetPr(prhs[1]);
drows = mxGetM(prhs[2]);
dcols = mxGetN(prhs[2]);
dm = mxGetPr(prhs[2]);
mw1 = mxGetM(prhs[3]);
nw1 = mxGetN(prhs[3]);
w1 = mxGetPr(prhs[3]);
mw2 = mxGetM(prhs[4]);
nw2 = mxGetN(prhs[4]);
w2 = mxGetPr(prhs[4]);
aa = *mxGetPr(prhs[5]);

if( xcols != 1 || ycols != 1 || drows != xrows || dcols != yrows)


{

Dept. ECE, Auburn Univ. 131 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

sprintf(errMsg,"x(%d x %d), y(%d x %d), d(%d x %d), incompatible",


xrows, xcols, yrows, ycols, drows, dcols);
mexErrMsgTxt(errMsg);
return;
}

/* memory allocation rules: "If you get it out, put it back" */


y1 = mxCalloc( mw1, sizeof(double) );
y2 = mxCalloc( mw2, sizeof(double) );

plhs[0] = mxCreateDoubleMatrix(drows, dcols, mxREAL); /* output surf */


plhs[1] = mxCreateDoubleMatrix(drows, dcols, mxREAL); /* error surf */

om = mxGetPr(plhs[0]);
em = mxGetPr(plhs[1]);

/* evaluate each point in the database */


for( ii = 0 ; ii < xrows ; ii++)
{
for (jj = 0 ; jj < yrows ; jj++)
{
y0[0] = xx[ii];
y0[1] = yy[jj];
layerProp(y1,w1,y0,mw1,nw1,aa);
layerProp(y2,w2,y1,mw2,nw2,aa);
om[ii+ jj*drows] = y2[0];
em[ii+ jj*drows] = dm[ii+jj*drows] - y2[0];
}
}

/* memory allocation rules: "If you get it out, put it back" */


mxFree(y1);
mxFree(y2);
}

21.3 Simple backprop without any preprocessing


Source code is in M-file 21.3 (backPropSig1.m). Results are shown in Figure 19–23.10
M-file 21.3 backPropSig1.m
mex nnsig.c
mex neteval.c

10
See also backPropSig.mov at http://www.eng.auburn.edu/users/hodelas/teaching.

Dept. ECE, Auburn Univ. 132 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

rand(’seed’,pi); % seed random number generator

% construct target function


nx = 9;
xx = linspace(-1, 2, nx)’;
ny = 7;
yy = linspace(-1, 2, ny)’;
[xm,ym] = meshgrid( xx, yy);
dm = 0.1 + 0.9*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1);

% network: use (nh) hidden nodes and one output node


% bias introduced in input layer
a = 5; % activation function parameter
nh = 9; no = 1; % network dimensions
W1 = 5*(rand(nh,3)-0.5);
W2 = 5*(rand(no,nh+1)-0.5);

[om, em] = neteval(xx, yy, dm, W1, W2,a);


figure(3);
mesh(xx,yy, om);
title(’Initial surface’);
print -depsc backPropSig_3.eps

figure(4);
mesh(xx,yy,em);
title(’Initial error surface’);
print -depsc backPropSig_4.eps

% keep a record of the error function


icnt = 1; Errv(icnt) = norm(em,’fro’)^2;

icnt = icnt+1; % icnt = "iteration count"


eta = 0.01; % execute backprop algorithm with eta selected
maxIter = 2000;
for iter = 0:maxIter
fprintf(’%4d: err=%12.4e\n’,iter,Errv(icnt-1));
for ii=1:nx
for jj = 1:ny
y0 = [xx(ii); yy(jj)]; % input vector
[y2,y1] = nnsig(y0, W1, W2, a);
v1 = W1*[1;y0];
v2 = W2*[1;y1];

% perform backprop step

Dept. ECE, Auburn Univ. 133 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

e = dm(jj,ii) - y2;
d2 = -e .* dphi(v2,a);
DW2 = d2*[1;y1]’;

W2_times_d2 = W2’ * d2;


d1 = -(W2_times_d2(2:(nh+1))) .* dphi(v1,a) ;
DW1 = d1*[1;y0]’;

W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;

% evaluate error surface


[om,em] = neteval(xx,yy,dm, W1, W2,a);
Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt + 1;
end
end
end
figure(6); mesh(xx,yy, om);
title(sprintf(’Output surface iteration %d’,iter));
axis([-1,2,-1,2,0,1.1]);
text(-0.5,2,0.8,sprintf(’current error =%12.4g’,Errv(icnt-1)));
text(1.5,-0.5,0.1,’x_1 value’);
text(-0.5,1.5,0.1,’x_2 value’);
print -depsc backPropSig_6.eps

% plot boundary lines, compute color based on slope of line


xp = linspace(-5,5,100)’;
yp = zeros(100,nh);
for ip = 1:nh
if(W1(ip,3) == 0)
denom = 1e-6;
else
denom = W1(ip,3);
end
yp(:,ip) = -(W1(ip,2)*xp + W1(ip,1))/denom;
linAng = atan2(W2(ip+1)*W1(ip,3),W2(ip+1)*W1(ip,2))*180/pi;
if(-45 < linAng & linAng <= 45), pcolor{ip} = ’g-’; %left
elseif(45 < linAng & linAng <= 135), pcolor{ip} = ’c-’; %bottom
elseif(135 < linAng & linAng <= 225), pcolor{ip} = ’r-’; %right
elseif(-135 < linAng & linAng <= -45), pcolor{ip} = ’b-’; %top
elseif(-225 < linAng & linAng <= -135), pcolor{ip} = ’r-’; %right
end
end

Dept. ECE, Auburn Univ. 134 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

figure(8);
plot( ...
xp,yp(:,1),pcolor{1}, xp,yp(:,2),pcolor{2}, ...
xp,yp(:,3),pcolor{3}, xp,yp(:,4),pcolor{4}, ...
xp,yp(:,5),pcolor{5}, xp,yp(:,6),pcolor{6}, ...
xp,yp(:,7),pcolor{7}, xp,yp(:,8),pcolor{8}, ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh
text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
print -depsc backPropSig_8.eps
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))
print -depsc backPropSig_7.eps

Dept. ECE, Auburn Univ. 135 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

Initial surface

0.998

0.996

0.994

0.992

0.99

0.988

0.986

0.984
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1

Figure 19: Simple backpropagation example: initial neural network output surface.

Dept. ECE, Auburn Univ. 136 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

Initial error surface

0.4

0.2

−0.2

−0.4

−0.6

−0.8

−1
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1

Figure 20: Simple backpropagation example: initial neural network output surface error
d(x1 , x2 ) − y(x1 , x2 )

Dept. ECE, Auburn Univ. 137 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

Output surface iteration 400

current error = 6.268


0.8

0.6

0.4

0.2

x2 value
0 x1 value
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1

Figure 21: Simple backpropagation example: output surface after 200 passes through all
data points.

Dept. ECE, Auburn Univ. 138 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

Layer 1 linear separation boundaries


5 5
line 6
line 5
4 4 iteration 400
Blue : top border
Green: left border
3 3
Red : right border
Cyan : bottom border
2 2

1 1
x2 value/W2

W2(9)
0 0 W2(0) W2(1) W2(2) W2(6) W2(8)
W2(7)
W2(3) line 9
−1 −1 W2(5)

W2(4) line 1 line


line 4
2
−2 −2 line 3

−3 −3

line 7
−4 −4
−6 −4 −2 0 2 4
line 8
−5 −5
−6 −4 −2 0 2 4 6
x1 value

Figure 22: Simple backpropagation example: linear region separating boundaries after 200
iterations
.

Dept. ECE, Auburn Univ. 139 ELEC 6240/Hodel: Fa 2003


21.3 Simple backprop without any preprocessing Revision : 2003.10

Error function per backprop step


100

90

80

70

60

50

40

30

20

10

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10

Figure 23: Simple backpropagation example: error function over all iterations

Dept. ECE, Auburn Univ. 140 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

22 2003 10 10: Techniques to improve training - 1


22.1 Homework 5 solution
Solution

22.1.1 Derivations and plots


1a
ebv − e−bv 1 − e−2bv
φ(v) = a tanh(bv) = a bv = a
e + e−bv 1 + e−2bv
(1 + e−2bv )(2be−2bv ) + (1 − e−2bv )(2be−2bv )
 

= a
dv (1 + e−2bv )2
(1 + e−2bv ) + (1 − e−2bv ))
 
−2bv 1 + φ(v)/a
= 2bae −2bv 2
= 2bae−2bv
(1 + e ) 1 + e−2bv
2ae−2bv
 
= b(1 + φ(v)/a)
1 + e−2bv
2e−2bv
 
= (b/a)(a + φ(v))a +1−1
1 + e−2bv
 −2bv
+ 1 + e−2bv − 1 − e−2bv

2e
= (b/a)(a + φ(v))a
1 + e−2bv
 −2bv
+ 1 + e−2bv − 1

e
= (b/a)(a + φ(v))a
1 + e−2bv
e−2bv − 1
 
= (b/a)(a + φ(v))a 1 +
1 + e−2bv
= (b/a)(a + φ(v))a (1 − φ(v)/a)

and the result follows.

1b Compare these to C-file 6.1 (mextanh.c) , C-file 6.2 (simpletanh.c) and C-file 7.3
(phiExV.c) . My solutions allow the input x to be a vector. I will accept solutions
where your code can only handle scalars (as done in C-file 6.1 (mextanh.c) and
C-file 6.2 (simpletanh.c) ). Source code for this problem is listed at the end of these
solutions.

2a You can use either steepest descent or backpropagation. My solutions show both; they are
listed at the end of the solutions. AND gate output in Figure 24.

hwk502.m Output

steepest descent:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -3.8891e+00 2.5818e+00 2.5818e+00

Dept. ECE, Auburn Univ. 141 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

AND output

1.5

0.5

−0.5

−1

−1.5
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0
input 2
input 1

Figure 24: AND gate output for Homework 5.

backprop:
tanh parameters: a= 1.5000e+00 b= 1.0000e+00
network weights: [ w0 w1 w2 ] = -4.1842e+00 2.7874e+00 2.7875e+00
>>

2b The hyperbolic tangent function allows the training function to “bend” the plane from
Homework 4 to match the shape of the AND gate data.

3 Source code is at the end of these solutions. Of course my solution matched MATLAB’s
output perfectly.

22.1.2 Source code: problem 1


C-file 22.8 phiT.c

#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;

Dept. ECE, Auburn Univ. 142 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

for ( ii = 0 ; ii < len ; ii++)


yy[ii] = aa * tanh (bb*xx[ii]);
}
void mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *yy, *xx, aa, bb;
int len, mm, nn;
if (nrhs != 3) /* Check for proper number of arguments */
{
mexPrintf("phiT: Got %d inputs\n", nrhs);
mexErrMsgTxt ("Need 3 input arguments");
}
else if (nlhs > 1)
{
mexPrintf("Got %d outputs\n", nlhs);
mexErrMsgTxt ("Need one.");
}
xx = mxGetPr (prhs[0]);
mm = mxGetM ( prhs[0]);
nn = mxGetN ( prhs[0]);
if(mm != 1 && nn != 1)
{
mexPrintf("xx(%d x %d) must be a vector\n",mm,nn);
mexErrMsgTxt ("phiT: error");
}
plhs[0] = mxCreateDoubleMatrix (mm, nn, mxREAL);
yy = mxGetPr (plhs[0]);
aa = *mxGetPr(prhs[1]);
bb = *mxGetPr(prhs[2]);
phiT (yy, xx, aa, bb, mm*nn);
}

C-file 22.9 dphiT.c

#include <math.h>
#include "mex.h"
static void phiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = aa * tanh (bb*xx[ii]);
}

Dept. ECE, Auburn Univ. 143 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

static void dphiT (double *yy, const double *xx, const double aa,
const double bb, int len)
{
int ii;
phiT(yy,xx,aa,bb,len);
for ( ii = 0 ; ii < len ; ii++)
yy[ii] = (bb/aa) * (aa - yy[ii]) * ( aa + yy[ii]);
}
void mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *yy, *xx, aa, bb;
int len, mm, nn;
if (nrhs != 3) /* Check for proper number of arguments */
{
mexPrintf("dphiT: Got %d inputs\n", nrhs);
mexErrMsgTxt ("Need 3 input arguments");
}
else if (nlhs > 1)
{
mexPrintf("Got %d outputs\n", nlhs);
mexErrMsgTxt ("Need one.");
}
xx = mxGetPr (prhs[0]);
mm = mxGetM ( prhs[0]);
nn = mxGetN ( prhs[0]);
if(mm != 1 && nn != 1)
{
mexPrintf("xx(%d x %d) must be a vector\n",mm,nn);
mexErrMsgTxt ("dphiT: error");
}
plhs[0] = mxCreateDoubleMatrix (mm, nn, mxREAL);
yy = mxGetPr (plhs[0]);
aa = *mxGetPr(prhs[1]);
bb = *mxGetPr(prhs[2]);
dphiT (yy, xx, aa, bb, mm*nn);
}

22.1.3 Source code: problem 2


M-file 22.1 hwk502.m

% compile mex files


mex phiT.c
mex dphiT.c

Dept. ECE, Auburn Univ. 144 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

a = 1.5; b = 1; % phiT parameters


x1 = linspace(0,1,25); % for plotting with mexPexT
x2 = x1;
maxIter = 1e5; % max iterations and step size
eta = 0.01;

X = [1, 1, 1, 1; 0, 1, 0, 1 ; 0, 0, 1, 1]; % set up training data


d = 0.95*[-a, -a, -a, a];

% run steepest descent iteration


W = [0,0,0];
maxErr = max(abs(d - phiT(W*X,a,b)));
iter = 0;
while( maxErr > 0.1*max(abs(d)) & iter < maxIter )
iter = iter+1;
gn = zeros(size(W));
for nn=1:4
xn = X(:,nn);
vv = W*xn;
gn = gn + dphiT(vv,a,b)*((W*xn)*xn’ - d(nn)*xn’);
end
W = W - eta*gn;
maxErr = max(abs(d - phiT(W*X,a,b)));
end
maxErr1 = maxErr;

figure(1)
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output (steepest)’);

fprintf(’\nsteepest descent:\ntanh parameters: a=%12.4e b=%12.4e\n’,a,b);


fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e %12.4e\n’, ...
W(1), W(2), W(3));

% run backpropagation iteration


W = [0,0,0];
maxErr = max(abs(d - phiT(W*X,a,b)));
iter = 0;
nn = 0;
while( maxErr > 0.1 & iter < maxIter )
iter = iter+1;
nn = nn + 1; % get new data point number
if ( nn > 4 )
nn = 1;

Dept. ECE, Auburn Univ. 145 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

end
% do backpropagation "gradient" step
xn = X(:,nn);
vv = W*xn;
gn = dphiT(vv,a,b)*((W*xn)*xn’ - d(nn)*xn’);
W = W - eta*gn;
maxErr = max(abs(d - phiT(W*X,a,b)));
end

figure(2);
fprintf(’backprop:\ntanh parameters: a=%12.4e b=%12.4e\n’,a,b);
fprintf(’network weights: [ w0 w1 w2 ] = %12.4e %12.4e %12.4e\n’, ...
W(1), W(2), W(3));
meshPexT(W,x1, x2, a,b,’input 1’,’input 2’,’AND output’);
print -depsc hwk2003_0502.eps

M-file 22.2 meshPexT.m

function Yvals = meshPexT(W,x1,x2,a,b,xstr, ystr, tstr)


% function Yvals = meshPexT(W,x1,x2,a,b,xstr, ystr, tstr)
% compute and/or plot output of a single layer linear neural network
% function of two variables
% inputs: W (1 x 2): network weights
% x1, x2: each vector xx = [x1(ii); x2(jj)] for appropriate values of
% ii, jj
% a, b: arguments for phiT()
% xstr, ystr, tstr: strings for xlabel, ylabel, and title, respectively
% if all three are passed, then the mesh plot is plotted.
% if not, these arguments are ignored
doMeshPlot = 0;
if(nargin == 8)
if( isstr(xstr) & isstr(ystr) & isstr(tstr) )
doMeshPlot = 1;
end
end
% could check dimensions, etc., but I’m not going to
for ii=1:length(x1)
for jj=1:length(x2)
xbar = [1;x1(ii);x2(jj)];
Yvals(jj,ii) = phiT(W*xbar,a,b);
end
end
if(doMeshPlot)
mesh(x1,x2,Yvals);

Dept. ECE, Auburn Univ. 146 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

xlabel(xstr); ylabel(ystr); title(tstr); grid on;


end

22.1.4 Source code: problem 3


C-file 22.10 matmul.c

#include <math.h>
#include "mex.h"
/* aa (mm x pp ), bb (pp x nn ), -> cc (mm x nn ) */
static void
matmul (double *cc, const double *aa, const double *bb,
const int mm, const int nn, const int pp)
{
int ii, jj, kk;
for ( ii = 0 ; ii < mm ; ii++)
{
for ( jj = 0 ; jj < nn ; jj++ )
{
cc[ii + mm*jj] = 0.0;
for ( kk = 0 ; kk < pp ; kk++)
{
cc[ ii + mm*jj ] += aa[ii + mm*kk] * bb [ kk + jj*pp];
}
}
}
}

void
mexFunction (int nlhs, mxArray * plhs[], int nrhs,
const mxArray * prhs[])
{
double *cc, *aa, *bb;
int mm, pp, nn;

/* should argument checking - check dimensions; I skip that here */


aa = mxGetPr (prhs[0]);
mm = mxGetM ( prhs[0]);
pp = mxGetN ( prhs[0]);
bb = mxGetPr (prhs[1]);
nn = mxGetN ( prhs[1]);
plhs[0] = mxCreateDoubleMatrix (mm, nn, mxREAL);
cc = mxGetPr (plhs[0]);
matmul (cc, aa, bb, mm,nn,pp);
}

Dept. ECE, Auburn Univ. 147 ELEC 6240/Hodel: Fa 2003


22.1 Homework 5 solution Revision : 2003.10

M-file 22.3 matmultest.m

mex matmul.c

a = [1 2 3 ; 4 5 6 ; 7 8 10];
b = [5 6 ; 7 8 ; 1 2 ];
c = matmul(a,b)
chk = a*b
err = c - chk

Dept. ECE, Auburn Univ. 148 ELEC 6240/Hodel: Fa 2003


22.2 Techniques to improve training Revision : 2003.10

backPropRand.m Output

backPropRand: randomize order of samples


Present data set 1
Present data set 9
Present data set 8
Present data set 2
Present data set 10
Present data set 5
Present data set 3
Present data set 7
Present data set 4
Present data set 6

Figure 25: Output of m-file backPropRand in Example 22.1.

22.2 Techniques to improve training


Read §4.4–4.6

22.3 Activation function


Use φ(v) = tanh(v) or something similar so that φ can take on positive and negative values.

22.4 Randomize sample order


Randomize order of sample data {(x(n), d(n))}N
n=1 in each iteration. Avoids limit cycles.

Example 22.1 M-file 22.4 backPropRand.m

fprintf(’\nbackPropRand: randomize order of samples\n’);


N = 10; % ten samples
[jnk,idx] = sort(rand(N,1));
for ii=1:N
this_idx = idx(ii);
fprintf(’Present data set %d\n’, this_idx);
end

Results in Figure 25.

Dept. ECE, Auburn Univ. 149 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

23 2003 10 13: Techniques to improve training - 2


23.1 Present “worst case” data
Present pair (x(n), d(n)) with worst output error. Contrast of worst case (k·k∞ ) vs least
squres k·k2 .

Remark 23.1 Doesn’t work very well without careful treatment.

23.2 Momentum term - generalized delta rule


Batch processing attempts to minimize E over all input data simultaneously:

1 X̀
Eav = E(x(`))
N `=1

and results in an average gradient

1 X̀
∆W (k) (n) = η ∆W k (n; x` )
N
`=1

where ∆W k (n; x` ) is the gradient due to x` with the weight set W (k) (n).
As an alternative to batch processing, select
T
∆W (k) (n) = α∆W (k) (N − 1) + ηδ (k) (n)y (k−1) (n)

Result: n
(k)
X T
∆W (n) = η αn−` δ (k) (`)y (k−1) (`)
`=1

Selection of parameters:

1. if α is big, then you may get slow convergence.

2. If α is small, you may get noisy behavior.

Idea Select α so that αN is “significant”, but so that α2N is small (e.g., αN = 0.2 or
smaller). This results in behavior similar to “batch” processing (see text).
T
Selection of η: suppose that δ (k) (n)y (k−1) (n) were constant. Then we’d like for ∆W (k) (n) →
T
δ (k) (n)y (k−1) (n) , =⇒ η = 1 − α. (Consider D.C. gain of transfer function η/(z − α)).

Dept. ECE, Auburn Univ. 150 ELEC 6240/Hodel: Fa 2003


23.2 Momentum term - generalized delta rule Revision : 2003.10

Homework 6 Due Wed Oct. 22. Hwk

1. MATLAB Pattern classification with MLP’s

(a) Obtain the all files from the class ftp site
ftp://ftp.eng.auburn.edu/pub/hodel/6240
in directory hwk6. Run the m-file makeMex in MATLAB. This will compile a
number of C-language functions for you. Equivalent m-file functions are included
in hwk6 so that you can see what the C-functions do. Relevant functions are
M-file 19.4 (phiT.m), M-file 19.5 (dphiT.m) (mex) solutions to your last home-
work; M-file 4.3 (mlpT.m) (mex) Compute the output of a two layer perceptron
with hyperbolic tangent activation functions; M-file 4.2 (mlpEvalT.m) (mex) eval-
uate a two layer perceptron with user specified weights W1 and W2 over all data
points in a given data set, returns the network outputs and the correspoding er-
rors; M-file 25.3 (backPropStep.m) execute one back propagation step M-file 25.2
(mlpTrain.m) (m-file) Repeatedly calls the above functions to train a two layer
MLP to match a given data set; M-file 25.4 (mlpNormalize.m) (m-file) Perform
statistical normalization on training data; M-file 9.2 (learnTaskEx1s.m) Example
of a linear classifier network worked earlier this semester in class (sine, sawtooth,
and square wave); hwk6P1.m: template for the solution to problem 1 of this home-
work. hwk6P2.m: m-file for analysis in problem 2 of this homework.
(b) Modify the code in hwk6P1.m to train the neural network as a pattern classi-
fier as was done in M-file 9.2 (learnTaskEx1s.m). Email your completed m-file
hwk6P1.m to Mr. Simmons.
N
1 X ∆
2. Recall that we normalize input data by computing the mean x̄ = x(n) = E[x(n)]
N i=1
N
1 X ∆
and covariance Σx = (x(n) − x̄)(x(n) − x̄)T = E[(x(n) − x̄)(x(n) − x̄)T ] of a
N i=1
data set {x(n)}N
n=1 . The random number generator randn in MATLAB generates in-
dependent, identically distributed pseudo-random Gaussian variables with mean 0 and
variance 1.

(a) Suppose you were given a data set {x(n)}N


n=1 of vectors x(n) of length 3 that were
generated by randn. What would you expect x̄ and Σx to be? Explain.
(b) Run the m-file hwk6P2.m included in the ftp directory from problem 1. Explain
any differences between your predicted values from part (a) and those computed
by this m-file.

3. Problem 4.4, p. 252 in [Hay99].

Dept. ECE, Auburn Univ. 151 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

24 2003 10 15: Techniques to improve training - 3


24.1 Statistically normalize the data
For a set of N variables x, define
N
∆1 X
E[x] = x(n)
N n=1
x̄ = E[x]
Σx = E[(x − x̄)(x − x̄)T ]
We want to normalize {(x(n), d(n))}N n=1 to be zero-mean and unit (identity matrix) covari-
ance. Compute
y (0) (n) = (Σx )1/2 (x(n) − x̄)
where Σx = (Σx )1/2 (Σx )1/2 is computed from the singular value decomposition. As a result,
T
E[y (0) ] = 0 and E[y (0) y (0) ] = I.
Example 24.1 M-file 24.1 backPropNormTest.m
xn = 7;
yn = 9;
x = linspace(-1,2,xn);
y = linspace(-1,2,yn);
N = xn*yn;
fprintf(’%d data points\n’,N);

idx = 1; % construct data matrix


XX = zeros(2,N);
for ii = 1:xn
for jj = 1:yn
XX(1:2,idx) = [x(ii); y(jj)];
idx = idx+1;
end
end

% get mean, covariance of data in XX


[sigX, mX] = backPropNormalize(XX);
fprintf(’sigX: %12.4e %12.4e\n’,sigX(1,1), sigX(1,2));
fprintf(’ %12.4e %12.4e\n’,sigX(2,1), sigX(2,2));
fprintf(’mX: %12.4e\n’,mX(1));
fprintf(’ %12.4e\n’,mX(2));

YY = sigX\( XX - mX*ones(1,size(XX,2)) );
sigY = YY * YY’ / N;
fprintf(’sigY: %12.4e %12.4e\n’,sigY(1,1), sigY(1,2));
fprintf(’ %12.4e %12.4e\n’,sigY(2,1), sigY(2,2));

Dept. ECE, Auburn Univ. 152 ELEC 6240/Hodel: Fa 2003


24.2 Selection of initial weights Revision : 2003.10

backPropNormTest.m Output

63 data points
sigX: 1.0000e+00 0.0000e+00
0.0000e+00 9.6825e-01
mX: 5.0000e-01
5.0000e-01
sigY: 1.0000e+00 -5.2868e-18
-5.2868e-18 1.0000e+00

M-file 24.2 backPropNormalize.m

function [sigX,mX] = backPropNormalize(XX)


% [sigX,mX] = backPropNormalize(XX)
% given a m x N matrix of data X = [ x(1) , x(2) , ... ]
% compute mean and normalizing matrix so that y = sigX\(x - mX) implies
% E [y] = 0 and E [ y y’ ] = I

[m,N] = size(XX); % shift XX to be zero mean


mX = mean(XX’)’;

% scale to be unit variance


XX = XX - mX*ones(1,N);
SigX = XX * XX’/N; % same as sum x(n) * x(n)’
[u,s] = eig(SigX); % SigX symmetric -> eig() and svd() do the same thing
sigX = u*sqrt(s)*u’;

24.2 Selection of initial weights


Suppose ȳ (0) (0) = 0 and Σy( 0) (0) = I. Define w̄ (k) (n) = vec W (k) (n) , the vector stack of


W (k) (n); in MATLAB:

[m2,m1] = size(Wk); wbar = reshape(Wk,m2*m1,1);

Drop superscripts for now since they’re clear by context. Suppose we select weights W so
that
E[wij ] = 0
and
σw2 i = k, j = `

E[wij wk` ] =
0 else

Dept. ECE, Auburn Univ. 153 ELEC 6240/Hodel: Fa 2003


24.3 Adjust learning rates Revision : 2003.10

What does this mean about E[v (1) ]? Suppose that W and y are statistically independent.
Then

E[v] = E[W y] = E[W ]E[y] = 0


" m m
# m X
m
X X X
E[vi vk ] = E wij yj wk` y` = E [wij wk` yj y` ]
j=1 `=1 j=1 `=1
m
X m
X
 2 2  2
= E wij yj = E wij = mσw2
j=1 j=1

So choose σw2 = 1/m to get unit variance on v.

24.3 Adjust learning rates



We want the gradients at each layer to be similar in magnitude; scale η (k) ’s so that ∆W (k) ’s
are roughly the same.

Dept. ECE, Auburn Univ. 154 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

25 2003 10 17 Example: training with the unit square


Example 25.1 Repeat the unit-square neural network problem Hwk 4 using the backprop-
agation algorithm to find the network weights. Apply some or all of the methods in this
section.

Techniques used:

1. Set initial bias values to zero.

2. Scale inputs to be zero mean, unit variance

3. Scale outputs to lie within “linear region” of activation function

4. use tanh activation function

5. Increase number of iterations

6. Keep η step size bounded above by 1 − α momentum “d.c. gain.”

7. Randomize order of inputs in each iteration

8. Select equal numbers from each class so that the output variable is also uniformly
distributed (alternatively: scale outputs as described above)

backPropEx2.m Output

Found /Users/hodelas/aub/6240/dphiT.mexmac.
Found /Users/hodelas/aub/6240/mexPrintMat.mexmac.
Found /Users/hodelas/aub/6240/mlpEvalT.mexmac.
Found /Users/hodelas/aub/6240/mlpT.mexmac.
Found /Users/hodelas/aub/6240/phiT.mexmac.

ans =

success

Keeping 81 data points from each class


iteration 50; error = 487.8313
iteration 100; error = 448.0934
iteration 150; error = 366.5157
iteration 200; error = 346.0957
iteration 250; error = 331.9596
iteration 300; error = 318.6952
iteration 350; error = 306.0510
iteration 400; error = 293.8684
iteration 450; error = 282.0579

Dept. ECE, Auburn Univ. 155 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

iteration 500; error = 270.5603


iteration 550; error = 259.3391
iteration 600; error = 248.3798
iteration 650; error = 237.6699
iteration 700; error = 227.2100
iteration 750; error = 217.0141
iteration 800; error = 207.1120
iteration 850; error = 197.5519
iteration 900; error = 188.3970
iteration 950; error = 179.7163
iteration 1000; error = 171.5714
iteration 1050; error = 164.0072
iteration 1100; error = 157.0391
iteration 1150; error = 150.6566
iteration 1200; error = 144.8296
iteration 1250; error = 139.5162
iteration 1300; error = 134.6656
iteration 1350; error = 130.2302
iteration 1400; error = 126.1644
iteration 1450; error = 122.4263
iteration 1500; error = 118.9796
iteration 1550; error = 115.7918
iteration 1600; error = 112.8354
iteration 1650; error = 110.0860
iteration 1700; error = 107.5225
iteration 1750; error = 105.1269
iteration 1800; error = 102.8824
iteration 1850; error = 100.7749
iteration 1900; error = 98.7918
iteration 1950; error = 96.9224
iteration 2000; error = 95.1559
training time: 142.203429 s
sigX:(2 x 2) =
4.3523e-01 -2.4166e-03
-2.4166e-03 5.1434e-01
mX:(1 x 2) =
5.1190e-01 5.1709e-01

sigY =

mY =

Dept. ECE, Auburn Univ. 156 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

6.3050e-17

W1 =

0.2265 -0.2593 -0.1067


0.0620 -0.0302 -0.0603
-0.0040 -0.0845 0.0535
-0.1333 -0.0143 0.0068
0.1099 -0.0857 0.2027
-0.0037 -0.0689 -0.0940
-0.1170 -0.2602 -0.2454
0.2079 0.0299 -0.2889
-0.0431 -0.1600 -0.0015
0.0266 0.0372 0.0794

W2 =

Columns 1 through 7

-0.1739 0.4423 -0.6291 -0.2248 0.4981 0.3003 0.0400

Columns 8 through 11

-0.1679 0.3461 -0.2438 -0.1559

>>

M-file 25.1 backPropEx2.m

% Backpropagation example. backprop to fit unit square problem


makeMex
rand(’seed’,2);
a = sqrt(3); % parameters for activation functin phiT
b = 3;
nx = 29; % range of evaluation
xx = linspace(-1,2,nx);
ny = 27;
yy = linspace(-1, 2, ny);
[xm,ym] = meshgrid(xx, yy); % compute output function
dm = (-a + 2*a*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1))*0.9;

Dept. ECE, Auburn Univ. 157 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

% repackage data for call to training/normalization codes


% format: row n = [x1(n), x2(n), d(n)]
mlpData = [ reshape(xm,nx*ny,1), reshape(ym,nx*ny,1), reshape(dm,nx*ny,1)];
C1 = find(mlpData(:,3) > 0); % use same # of samples for 0 as for 1
C2 = find(mlpData(:,3) < 0);
nC1 = length(C1);
nC2 = length(C2);
nkeep = min(nC1, nC2);
fprintf(’Keeping %d data points from each class\n’,nkeep);
[jnk,idx] = sort(rand(1,nC2));
idx = idx(1:nkeep);
C2keep = C2(idx);
myData= mlpData([C1,C2keep],:);

ni = 2; % network dimensions:
nh = 10; % use 5 hidden nodes and one output node
no = 1;

[inData,sigX, mX, sigY, mY] = mlpNormalize(myData,ni, no); % normalize data

% learning step size and momentum parameter


eta = 0.0002/(b*nkeep);
alpha = exp(log(0.2)/(2*nkeep)); % 2*nkeep points in database
alpha = 0.2;
maxIter = 2000;

startTime = clock;
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, eta, ...
alpha, a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime);

% --- ALL PLOTTING FROM HERE ---


fprintf(’training time: %f s\n’,trainingTimeSeconds);
fn = 1; figure(fn); plot(xx,phiT(xx,a,b),’-’); grid on ;
legend(sprintf(’-;phiT(x,%f,%f);’,a,b));
title(’activation function used’); grid on;
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); title(’Desired output’); mesh(xx,yy,dm);
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); plot(mlpData(C2keep,1), mlpData(C2keep,2),’x’);
title(’kept points outside the box’); grid on ;
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); plot(ErrHist); title(’Error history’); grid on;
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));

Dept. ECE, Auburn Univ. 158 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

% plot final error surface


for ix = 1:nx
for iy=1:ny
xn = sigX\([xx(ix); yy(iy)] - mX’);
zm(iy,ix) = mY + sigY*mlpT(xn,W1,W2,a,b);
end
end
fn = fn+1; figure(fn); mesh(xx,yy,zm)
title(’NN output surface - 200 iterations’); xlabel(’input 1’);
ylabel(’input 2’); eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fn = fn+1; figure(fn); mesh(xx,yy,(dm - zm))
title(’NN error surface - 200 iterations’)
xlabel(’input 1’); ylabel(’input 2’);
eval(sprintf(’print -depsc backPropEx2_%d.eps’,fn));
fprintf(’sigX:’); mexPrintMat(sigX);
fprintf(’mX:’); mexPrintMat(mX);
sigY
mY
W1
W2

• Activation function is shown in Figure 26.

• The desired output is in Figure 27.

• Haykin’s derivation of Neural Network performance assumes that inputs and outputs
are uniformly distributed. Our sample space is strongly biased toward points outside
the box. While the training algorithm works with all data points included in the sample
set, it works pretty well by keeping only selected points in the training data set. The
randomly chosen data values from outside the square are shown in Figure 28.

• The sum of the squared errors in each iteration (epoch in Haykin’s book) is shown in
Figure 29.
An epoch involves presentation of all “kept” data points to the neural network and
their corresponding backpropagation steps. Convergence is pretty much done by 150-
200 iterations (of 162 backpropagation steps each).

• The resulting fit to the data is much more pleasing in this case than in the original;
see Figures 30 – 31.

Dept. ECE, Auburn Univ. 159 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

activation function used


2
−;phiT(x,1.732051,3.000000);

1.5

0.5

−0.5

−1

−1.5

−2
−1 −0.5 0 0.5 1 1.5 2

Figure
√ 26: Activation function used in Example 25.1. Training was also done with φ(x) =
3 tanh(6x). Steep slope permits a better potential fit to the discontinuous underlying
function, but the iteration did not converge after 20,000 iterations.

1.5

0.5

−0.5

−1

−1.5

−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1

Figure 27: Desired output function (discontinuous!) for example 25.1.

Dept. ECE, Auburn Univ. 160 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

kept points outside the box


2

1.5

0.5

−0.5

−1
−1 −0.5 0 0.5 1 1.5 2

Figure 28: Randomly chosen data values from outside the unit square used for training the
neural network in Example 25.1.

Error history
800

700

600

500

400

300

200

100

0
0 500 1000 1500 2000 2500

Figure 29: Squared error sum in training iteration for example 25.1.

Dept. ECE, Auburn Univ. 161 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

NN output surface − 200 iterations

1.5

0.5

−0.5

−1

−1.5

−2
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
input 1

Figure 30: Trained network surface for example 25.1.

NN error surface − 200 iterations

1.5

0.5

−0.5

−1

−1.5

−2

−2.5
2
1.5
2
1 1.5
0.5 1
0 0.5
0
−0.5 −0.5
−1 −1
input 2
inin input 1

Figure 31: Trained network error surface for example 25.1.

Dept. ECE, Auburn Univ. 162 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

The subroutines used by this task are at the class ftp site.

M-file 25.2 mlpTrain.m

function [W1, W2, Yv, ErrV, ErrHist ] = mlpTrain(inData, nx, nh, no, eta, alpha, a, b, m
% function [W1, W2, Yv, ErrV, ErrHist ] =
% mlpTrain(inData, nx, nh, no, eta, alpha, a, b, maxIter)
% perform backpropagation training on a two layer MLP

[N, inM] = size(inData);


if(inM ~= nx + no)
error(’mlpTrain: inData[%d x %d] has nx=%d, no=%d incompatible’, ...
inM, N, nx, no);
end

% initalize weight matrices


W1 = rands(nh, nx+1)/sqrt(nh+1);
W2 = rands(no, nh+1)/sqrt(no+1);
W1(:,1) = 0; W2(:,1) = 0; % start with zero bias

% calculate initial error


[Yv,ErrV] = mlpEvalT(inData, W1, W2, a,b);

% keep a record of the error function


icnt = 1;
ErrHist(icnt) = norm(ErrV,’fro’)^2;
icnt = icnt+1;

% execute backprop algorithm with eta selected


% initialize momentum terms
DeltaW1 = zeros(size(W1)); DeltaW2 = zeros(size(W2));

for iter = 1:maxIter


if(iter == 50*floor(iter/50))
fprintf(’iteration %5d; error = %12.4f\n’,iter, norm(ErrV,’fro’)^2);
end
% randomize input order
[jnk,nidx] = sort(rand(1,N));
for ni=1:N
nn = nidx(ni);
y0n = inData(nn,1:nx)’;
dn = inData(nn,nx+(1:no));
[W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
alpha, eta, W1, W2, DeltaW1, DeltaW2);
end

Dept. ECE, Auburn Univ. 163 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

[Yv,ErrV] = mlpEvalT(inData, W1, W2, a,b);


ErrHist(icnt) = norm(ErrV,’fro’)^2;
icnt = icnt + 1;
end

M-file 25.3 backPropStep.m


function [W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
alpha, eta, W1, W2, DeltaW1, DeltaW2);
%function [W1, W2, DeltaW1, DeltaW2] = backPropStep(y0n, dn, a, b, ...
% alpha, eta, W1, W2, DeltaW1, DeltaW2);
% back propagation step with generalized delta rule for two layer NN.
% NOTE: This m-file has been translated to C

% output and hidden layer values


[y2,y1] = mlpT(y0n, W1, W2, a,b);

e = dn - y2;
d2 = e .* dphiT(v2,a,b);
DW2 = d2*[1;y1]’;
DW2 *= max(1, 0.01*norm(e)/(norm(DW2)+1e-3));

W2_times_d2 = W2’ * d2;


nh = size(W1,1);
d1 = (W2_times_d2(2:(nh+1))) .* dphiT(v1,a,b) ;
DW1 = d1*[1;y0n]’;
DW1 *= max(1, 0.01*norm(e)/(norm(DW1)+1e-3));

DeltaW1 = alpha*DeltaW1 + eta*DW1;


DeltaW2 = alpha*DeltaW2 + eta*DW2;

W2 = W2 + DeltaW2;
W1 = W1 + DeltaW1;

M-file 25.4 mlpNormalize.m


function [inData,sigX, xm, sigY, ym] = mlpNormalize(mlpData,nx, ny)
% [inData,sigX, mX, sigY, mY] = mlpNormalize(mlpData);
% normalize data for neural network output
% revised network is:
% yy = mY + sigY* neuralNetwork( ..., SigX\(x - mX) );
%
% FIXME: this function assumes that mlpData is uniformly distributed.

Dept. ECE, Auburn Univ. 164 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

xData = mlpData(:,1:nx);
yData = mlpData(:,nx+(1:ny));
N = size(mlpData,1);

% mean values
xm = mean(xData);
ym = mean(yData);
xData = xData - ones(N,1)*xm;
yData = yData - ones(N,1)*ym;

% covariance matrices
if(N > nx)
sigX = xData’*xData/N;
xData = xData/sigX;
else
sigX = 1;
end
if(N > ny)
sigY = 1;
yData = yData/sigY;
else
sigY = eye(ny);
end

% normalized data
inData = [xData,yData];

Dept. ECE, Auburn Univ. 165 ELEC 6240/Hodel: Fa 2003


25.1 Another example: bad output normalization Revision : 2003.10

1.5

0.5

−0.5

−1

−1.5

−2
1

0.5 4
2
0
0
−0.5
−2
−1 −4

Figure 32: Target surface for example 25.2.

25.1 Another example: bad output normalization


Example 25.2 Plant model: y = sin(x1 ) + x2 . Surface plotted in Figure 32.
Mild modifications to M-file 25.1 (backPropEx2.m); Results: Error history is shown in
Figure 33.

backPropEx3.m Output

iteration 100; error = 9.7856


iteration 150; error = 9.3335
iteration 200; error = 8.8783
iteration 250; error = 8.6363
iteration 300; error = 9.1668
iteration 350; error = 8.1168
iteration 400; error = 8.1344
iteration 450; error = 7.7494
iteration 500; error = 7.6579
iteration 550; error = 7.3684

Dept. ECE, Auburn Univ. 166 ELEC 6240/Hodel: Fa 2003


25.1 Another example: bad output normalization Revision : 2003.10

iteration 600; error = 7.2368

trainingTimeSeconds =

287.0925

sigX =

3.5249 -0.0000
-0.0000 0.3590

mX =

1.0e-17 *

0.4537 -0.0851

sigY =

mY =

-1.1627e-17

W1 =

0.0005 0.9481 -0.2460


0.0008 0.3885 -0.0492
-0.0001 -0.4829 0.0632
-0.0002 0.1094 -0.3366
0.0029 0.5427 -0.0768
-0.0005 -0.5568 0.0802
0.0007 0.3137 -0.0431
-0.0003 -0.5402 0.0757
-0.0006 0.0354 0.1529
-0.0013 1.9706 -0.0630

Dept. ECE, Auburn Univ. 167 ELEC 6240/Hodel: Fa 2003


25.1 Another example: bad output normalization Revision : 2003.10

W2 =

Columns 1 through 7

-0.0010 -0.8242 -0.4163 0.6490 0.4644 -0.5167 0.7112

Columns 8 through 11

-0.3279 0.7088 -0.0569 1.9855

>>

M-file 25.5 backPropEx3.m

% Backpropagation example. backprop to fit unit square problem


% compile backPropStep mex file
% required files are:

% parameters for activation functin phiT


a = sqrt(3); b = 1; % activation function parameters

nx = 29; xx = linspace(-pi,pi,nx); % theta


ny = 27; yy = linspace(-1, 1, ny); % volts

[xm,ym] = meshgrid( xx, yy);


dm = sin(xm) + ym;
mesh(xm,ym,dm)

fn = 0;
fn = fn+1; figure(fn);
plot(xx,phiT(xx,a,b),’-’);
legend(sprintf(’-;phiT(x,%f,%f);’,a,b));
title(’activation function used’)
grid on;
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));

fn = fn+1; figure(fn);
title(’Desired output’);
mesh(xx,yy,dm);
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));

% data format: column 1 = x, column 2 = y, column 3 = d


mlpData = [ reshape(xm,nx*ny,1), reshape(ym,nx*ny,1), reshape(dm,nx*ny,1)];

Dept. ECE, Auburn Univ. 168 ELEC 6240/Hodel: Fa 2003


25.1 Another example: bad output normalization Revision : 2003.10

myData= mlpData;

% network: use 10 hidden nodes and one output node


% bias introduced in input layer
ni = 2; nh = 10; no = 1;

rand(’seed’,pi); % initialize random number generator

[inData,sigX, mX, sigY, mY] = mlpNormalize(myData,ni, no); % normalize data

% learning step size and momentum parameter


eta = 0.002; alpha = 0.1; maxIter = 600;

% time how ling this takes.


startTime = clock;
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, eta, alpha, ...
a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime)

fn = fn+1; figure(fn);
plot(ErrHist);
title(’Error history’);
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));

% plot final error surface


for ix = 1:nx
for iy=1:ny
xn = sigX\([xx(ix); yy(iy)] - mX’);
zm(iy,ix) = mY + sigY*mlpT(xn,W1,W2,a,b);
end
end
fn = fn+1; figure(fn);
mesh(xx,yy,zm)
title(’NN output surface - 200 iterations’)
xlabel(’input 1’);
ylabel(’input 2’);
eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));

fn = fn+1; figure(fn);
mesh(xx,yy,(dm - zm))
title(’NN error surface - 200 iterations’)
xlabel(’input 1’);
ylabel(’input 2’);

Dept. ECE, Auburn Univ. 169 ELEC 6240/Hodel: Fa 2003


25.2 Decision rules Revision : 2003.10

Error history
600

500

400

300

200

100

0
0 100 200 300 400 500 600 700

Figure 33: Error history for example 25.2.

eval(sprintf(’print -depsc backPropEx3_%d.eps’,fn));


sigX
mX
sigY
mY
W1
W2

Output surface is in Figure 34. Notice that the choice of a in M-file 25.5 (backPropEx3.m)
results in the inability to match the data at the extreme upper and lower bounds on y. The
error surface is plotted in Figure 35.

25.2 Decision rules


Read §4.7

Dept. ECE, Auburn Univ. 170 ELEC 6240/Hodel: Fa 2003


25.3 Feature detection/hidden neurons Revision : 2003.10

NN output surface − 200 iterations

1.5

0.5

−0.5

−1

−1.5

−2
1

0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1

Figure 34: Output surface for example 25.2. Compare to Figure 32.

Classification: how to interpret network output: Assign class i if yik̄ (n) > yjk̄ (n) for all
j 6= i. Confidence? How close are the competitors? How close is yik̄ (n) to quantized value?

25.3 Feature detection/hidden neurons


Suppose a network has k̄ layers. Then output neuron values are
mk̄−1
(k̄) (k̄) (k̄−)
X
yi (n) = wij yj (n)
j=0

Dept. ECE, Auburn Univ. 171 ELEC 6240/Hodel: Fa 2003


25.3 Feature detection/hidden neurons Revision : 2003.10

NN error surface − 200 iterations

0.4

0.3

0.2

0.1

−0.1

−0.2

−0.3

−0.4
1

0.5 4
2
0
0
−0.5
−2
−1 −4
input 2
input 1

Figure 35: Error surface for example 25.2.

Dept. ECE, Auburn Univ. 172 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

26 2003 10 20: Radial Basis Function Networks


26.1 Project proposal guidelines
All class projects are to be turned in to Dr. Hodel not later than 5:00 pm on Wed, Dec. 3,
2003. Your draft project proposal is to be turned in to Dr. Hodel at the start of class on
Wed Oct 27, 2003. Your project must be approved by Dr. Hodel not later than Nov 5. Un-
dergraduate projects will be permitted to be less ambitious than graduate student projects.
However, undergraduates may choose to meet graduate student project requirements. The
guidelines below are representative only; students may propose other types of projects if
they wish, subject to instructor approval.
All project materials (manuals and software) should be submitted to Dr. Hodel in elec-
tronic form. Proposals may be printed out or submitted in electronic form. Printed proposals
should be put in my mailbox or given to me in class. Printed proposals that are submitted
in other fashions (e.g., slid under my door) will be torn up and thrown away.
Proposal written documentation should be submitted in either LaTeX source code or
Microsoft Word. Other word processor formats will be accepted only with prior approval.

26.1.1 Undergraduates
Undergraduate projects may be of the following two types:

1. Write a technical review of a technical paper on neural networks. The article must be
approved by Dr. Hodel, and should come from an IEEE journal (Transactions on Neural
Networks, Transactions on Automatic Control, Control Systems Magazine, other IEEE
conferences or journals, or some other professional society journal/conferences (e.g.,
ASME, AIAA, etc.). etc. Your review should either include

(a) a technical discussion of the contribution of the paper (why is it better/worse


than other methods? How does it compare to what we know in class?) or
(b) a moderate discussion of techniques and a computer implementation demonstrat-
ing the proposed method. Your software (matlab m-files are fine) should be com-
mented or else it should have accompanying description in the form of a manual.

2. Update the neural network library functions in nnLib.c and nnLib.h. Possible ideas
here are to update backPropStep so that will work for any number of layers (not just
two), updating and testing the radial basis function codes, etc. This kind of project
should include

(a) Documentation (a manual) of the revised software and libraries.


(b) Test code to demonstrate that the software works correctly.
(c) Source code - commented!

Dept. ECE, Auburn Univ. 173 ELEC 6240/Hodel: Fa 2003


26.1 Project proposal guidelines Revision : 2003.10

26.1.2 Graduates
Graduate student projects will involve at least the following three elements: (1) A written
manual/report. (2) Software implementations relevant to the project. (3) Test data used for
training and software verification.
Project subjects may be selected related to a student’s thesis research; discuss this
opportunity with your advisor. Otherwise, your project should involve some level of
complexity comparable to the example ideas listed in the next subsection. I will be very
flexible on the nature of the project, but it must include (1) a general problem statement,
(2) a mathematical discussion of the solution technique, (3) a software solution of the general
problem, and (4) an evaluation of the quality of solution.
Success is not required; what is required is a thorough discussion and understanding of
the techniques and results.

26.1.3 Some project ideas


• (graduate) Optical character recognition of the digits 0-9.
• (graduate/ugrad) Voice recognition with a vocabulary of at least 5 words.
• (graduate) Simulation of real-time adaptation / system identification. For example,
model an inverted pendulum as θ̇ = ω and ω̇ = (sin(θ)+v(t))/J where J is the moment
of inertia, θ is the bar position, and v(t) is a motor control voltage input. It is desired
for the bar to track the command θref = t (rotate at a constant angular velocity).
Three possible tasks (ugrads can do any one of these, grads should try at least two,
preferably 3)
1. Design a working control law; train a neural network in simulation to accurately
model the behavior of the bar. (On-line training, not off-line.) I will call this the
model neural network, or MNN.
2. Design a neural network that trains on-line (not batch design) in an attempt
to control the bar position so that the closed loop dynamics are θ̇ = ω and
ω̇ = θref − θ. I will call this the control neural network, or CNN.
3. Combine the above two parts together: The CNN should update its weights at
each time step based on the behavior predicted by the MNN.
• (grad/ugrad) Duplication and/or verification of results in a technical paper on neural
networks.
• (grad/ugrad) Modify backpropagation training to include some “competititive learn-
ing” - reinforce the strongest connections, move the remaining neurons in the opposite
direction: Winner: wi = wi + ηx(n) (Hebbian learning). Losers: wj = wj − ηwi for all
j 6= i. How does this compare to normal backpropagation?
• (ugrad) Write a manual for nnLib.c, nnLib.h so that future students can use your
manual in this course, or Fix one or more of the broken routines (radial basis function
routines) in nnLib.c and nnLib.h. (Get instructor approval first.)

Dept. ECE, Auburn Univ. 174 ELEC 6240/Hodel: Fa 2003


26.1 Project proposal guidelines Revision : 2003.10

Homework 7 Due Mon Oct. 27. Hwk

Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.

1. written Recall that we perform data normalization by selecting


−1/2
y (0) = ΣX (x(n) − µX )
∆ 1 PN ∆ −1/2
where µX = E[x] = N n=1 x(n), ΣX = E[(x − µX )(x − µX )T ], and the matrix ΣX
is selected so that
−1/2 −1/2 −1/2 −1/2
ΣX ΣX ΣX = Σ X ΣX ΣX =I
Show that µy(0) = 0 and Σy(0) = I.

2. written Define b(1) as the bias vector in layer 1 of a multilayer perceptron so that
v (1) = W (1) y (0) + b(1) . Consider the effect of data normalization on the output y (1) of
the hidden layer:

y (1) (n) = φ v (1) (n)




= φ W (1) y (0) (n) + b(1)



 
−1/2
= φ W (1) ΣX (x(n) − µX ) + b(1)

Notice that if we select


−1/2
W̃ (1) = W (1) ΣX
and
−1/2
b̃(1) = b1 − ΣX µX
then v (1) = W̃ (1) x(n) + b̃(1) . That is, we can train a neural network directly off of the
raw data x and get exactly the same result as we got with the normalized data.
Is this analysis correct? If not, explain my error. If it is correct, then explain why we
normalize data in the first place.

Dept. ECE, Auburn Univ. 175 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

26.2 Separability
Read §5.1-5.2 in Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice
Hall, 2nd edition, 1999

MLP’s use y = φ(w T x) results in linear/planar equipotential surfaces, resulting in linear


separability.

Theorem 26.1 Cover, 1965: A complex pattern-classification problem nonlinearly cast in


a high-dimensional space is more likely to be linearly separable than in a low-dimensional
space.

Linearly separable =⇒ easy to solve.

Question 26.1 What if we use y = w T φ(x)?

Definition 26.1 Given a vector valued function φ : IRm → IRn and a vector w ∈ IRn , the
corresponding separating surface X = X (φ, w) is

X (φ, w) = {x : w T φ(x) = 0}

Definition 26.2 Classes C1 , C2 are φ-separable if there exists a vector w such that x ∈
C∞ =⇒ w T φ(x) < 0 and x ∈ C∈ =⇒ w T φ(x) > 0.

Idea Map inputs nonlinearly into hidden layer, then linearly to output layer.

Remark 26.1 Many possibilities of separating functions φ:

1. linear (as in MLP)

2. quadratic (or higher order polynomial)

3. hyperspheres

Fact 26.1 The more hidden nodes you have, the more likely your data is φ−separable.

Remark 26.2 §5.2 in the text discusses the probability of φ-separability in terms of a
binomial expansion and Bernoulli trials. We will not address this analysis in this course.

Dept. ECE, Auburn Univ. 176 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

 2 
2 e−(x1 −1)
Example 26.1 x ∈ IR . With φ(x) = 2 :
e−(x2 −0.5)
M-file 26.1 radialEx00.m

% radialEx00.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [1;2]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
xx = [x1(ii); x2(jj)] - x0;
zz(ii,jj) = w’ * exp( - xx .* xx );
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx00a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx00b.eps’);

Radial basis function example

2.5

1.5

0.5

0
5

0
0

−5 −5

Dept. ECE, Auburn Univ. 177 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

Radial basis function example


5

−1

−2

−3

−4

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5

Dept. ECE, Auburn Univ. 178 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

Many other choices: e.g., monomials:

Example 26.2 Monomial example:

M-file 26.2 radialEx01.m

% radialEx01.m
nx = 25; x1 = linspace(-5,5,nx); % set of data points
ny = 27; x2 = linspace(-5,5,ny);
w = [-1;1]; % weight vector (picked arbitrarily)
x0 = [1;0.5]; % center of radial functions
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
% calculate bizarre monomial for this example
xx = [x1(ii)*x2(jj); x1(ii) + x2(jj)] - x0;
zz(ii,jj) = w’ * xx;
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx01a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx01b.eps’);

Radial basis function example

30

20

10

−10

−20

−30

−40
5

0
0

ii −5 −5

Dept. ECE, Auburn Univ. 179 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

Radial basis function example


5

−1

−2

−3

−4

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5

Dept. ECE, Auburn Univ. 180 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

Example 26.3 Exclusive or revisited:

M-file 26.3 radialEx02.m

% radialEx02.m: Exclusive or revisited; See Haykin ’99 Example 5.1


nx = 25; x1 = linspace(0,1,nx); % set of data points
ny = 27; x2 = linspace(0,1,ny);
w = [1;1]; % weight vector (picked arbitrarily)
t1 = [1;1]; t2 = [0;0];
zz = zeros(nx,ny); % compute surface values
for ii = 1:nx
for jj = 1:ny
xx = [x1(ii); x2(jj)];
phix = [ exp( -norm(xx-t1)^2 ); exp( -norm(xx-t2)^2 ) ];
zz(ii,jj) = 1 - w’ * phix;
end
end
% plot surface and equipotential surfaces
mesh(x1, x2, zz’); title(’Radial basis function example’);
printeps(’radialEx02a.eps’);
contour(x1, x2, zz’, 5); title(’Radial basis function example’);
printeps(’radialEx02b.eps’);

Radial basis function example

0.3

0.2

0.1

−0.1

−0.2

−0.3

−0.4
1
0.8 1
0.6 0.8
0.4 0.6
0.4
0.2
0.2
0 0

Dept. ECE, Auburn Univ. 181 ELEC 6240/Hodel: Fa 2003


26.2 Separability Revision : 2003.10

Radial basis function example


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Dept. ECE, Auburn Univ. 182 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

27 2003 10 22: Radial Basis Functions (2)


Homework 6 solution
Solution

1. MATLAB Pattern classification with MLP’s

(a) Code download.


(b) My solution and output:
M-file 27.1 backPropEx5.m
% parameters for activation functin phiT
a = 1.5; b = 1; % activation function parameters
tt = (0:0.01:5)’;
sinewave=sin(pi*tt);
sawtooth = 2*abs(tt - floor(tt))-1;
square = 2*double ( floor(tt) == 2*floor(tt/2) )-1;
XX = [sinewave, sawtooth, square];
dd = eye(3);
mlpData = [ XX’, dd’]; % data format: each row is [x(n)’, d(n)’]
ni = length(tt); nh = 10; no = 3;
[inData,sigX, mX, sigY, mY] = mlpNormalize(mlpData,ni, no); % normalize data
eta = 0.002; alpha = 0.1; maxIter = 600;
startTime = clock; % time how ling this takes.
[W1, W2, Yv, Errv, ErrHist] = mlpTrain(inData, ni, nh, no, ...
eta, alpha, a, b, maxIter);
trainingTimeSeconds = etime(clock,startTime)
for nn=1:3
xn = mlpData(nn,1:ni);
yy = mY’ + mlpT((xn - mX)/sigX,W1, W2,a,b);
fprintf(’%3d: yy=%12.4e %12.4e %12.4e\n’,nn, yy(1), yy(2), yy(3));
end
fn = 1; figure(fn);
semilogy(ErrHist);
grid on;
title(’Error history’);
eval(sprintf(’print -depsc backPropEx5_%d.eps’,fn));

>> backPropEx5
iteration 50; error = 0.0226
iteration 100; error = 0.0001
iteration 150; error = 0.0000
... [deleted a few lines]
iteration 550; error = 0.0000

Dept. ECE, Auburn Univ. 183 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

iteration 600; error = 0.0000


trainingTimeSeconds = 108.8119
1: yy= 1.0000e+00 -4.7801e-07 1.7496e-07
2: yy= 8.2413e-07 1.0000e+00 2.2694e-07
3: yy= 1.9447e-07 -2.0178e-07 1.0000e+00
N
1 X ∆
2. Recall that we normalize input data by computing the mean x̄ = x(n) = E[x(n)]
N i=1
N
1 X ∆
and covariance Σx = (x(n) − x̄)(x(n) − x̄)T = E[(x(n) − x̄)(x(n) − x̄)T ] of a
N i=1
N
data set {x(n)}n=1 . The random number generator randn in MATLAB generates in-
dependent, identically distributed pseudo-random Gaussian variables with mean 0 and
variance 1.

(a) Since randn produces pseudo-random numbers that are expected to be statistically
 
0
dependent and identically distributed with mean 0 and variance 1, E[x] = 0 

0
T
and E[(x − x̄)(x − x̄) ] = I (a 3 × 3 identity).
(b) My output:
NN=2
Computed mean value=
mX = 0.3569 -0.3223 -1.0545
Computed covariance =
sigX =
0.3875 -0.1407 -0.2030
-0.1407 0.0511 0.0737
-0.2030 0.0737 0.1064

NN=10
Computed mean value=
mX = -0.0467 0.0289 -0.2239
Computed covariance =
sigX =
1.7623 0.3045 -0.3040
0.3045 0.7776 0.2775
-0.3040 0.2775 0.7536

NN=100
Computed mean value=
mX = -0.0833 0.1086 -0.0263
Computed covariance =
sigX =

Dept. ECE, Auburn Univ. 184 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

0.8703 -0.0681 -0.1425


-0.0681 0.9555 0.0438
-0.1425 0.0438 1.1427

NN=1000
Computed mean value=
mX = 0.0396 0.0240 0.0255
Computed covariance =
sigX =
0.9900 0.0016 -0.0090
0.0016 0.9878 -0.0206
-0.0090 -0.0206 0.9801
The smaller data sets are too small to give statistically reliable characterizations
of the mean and variance. This is illustrated by the histograms shown below.
histogram: N=2 histogram: N=10
1 4

0.8
3

0.6
2
0.4

1
0.2

0 0
−1.5 −1 −0.5 0 0.5 1 −3 −2 −1 0 1 2

histogram: N=100 histogram: N=1000


30 300

25 250

20 200

15 150

10 100

5 50

0 0
−4 −2 0 2 4 −4 −2 0 2 4

Notice that, even in the case of N = 1000, the data bears poor resemblance to a
bell curve.

3. This corresponds to a digital filter with a pole between 0 and -1; so its impulse response
would be oscillatory, but stable. That is, one would expect the momentum term to
oscillate. Also, since
∆W (n + 1) = α∆W (n) + δy 0

Dept. ECE, Auburn Univ. 185 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

this implies that α < 0 would cause the next backpropagation step to “backtrack” - to
undo some of the update of the current backpropagation step, and so one would expect
slower convergence.
We tested these expectations by re-running M-file 25.1 (backPropEx2.m) with alpha = -exp(log(0.2
(note the negative sign). This latter observation is consistent with the results shown
below, a comparison of the original training run with α > 0 to the results with α < 0:

Error history
800

700

600

500

400

300

200

100

0
0 500 1000 1500 2000 2500 α>0
Error history
800

700

600

500

400

300

200
0 500 1000 1500 2000 2500 α<0

Dept. ECE, Auburn Univ. 186 ELEC 6240/Hodel: Fa 2003


27.1 M-file s-function examples Revision : 2003.10

27.1 M-file s-function examples


27.1.1 Continuous time model
M-file 27.2 invPendS.m
function [sys,x0,str,ts] = invPendS(t,x,u,flag)
% The general form of an M-File S-function syntax is:
% [SYS,X0,STR,TS] = arca(T,X,U,FLAG,P1,...,Pn)
%
% Inputs:
% u = voltage
% x = [theta;omega]
% Outputs:
% y = [theta;omega]

switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end

%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
%=============================================================================
function [sys,x0,str,ts]=mdlInitializeSizes

sizes = simsizes;

sizes.NumContStates = 2;
sizes.NumDiscStates = 0;
sizes.NumOutputs = 2;
sizes.NumInputs = 1;
sizes.DirFeedthrough = 0;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = [0;0]; % initial conditions
str = []; % str is always an empty matrix
ts = [0]; % initialize the array of sample times
return

Dept. ECE, Auburn Univ. 187 ELEC 6240/Hodel: Fa 2003


27.1 M-file s-function examples Revision : 2003.10

%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
%=============================================================================
function dx=mdlDerivatives(t,x,u)
dx = [x(2); (sin(x(1)) + u)];
% limit theta to stay within [-pi,pi]
thlim = min(max(x(1),-pi),pi);
dx(2) = dx(2) -100*(x(1)-thlim);
return

%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
%=============================================================================
function sys=mdlUpdate(t,x,u)
sys = [];
return

%=============================================================================
% mdlOutputs
% Return the block outputs.
%=============================================================================
function y=mdlOutputs(t,x,u);
y = x;
return
%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%
function sys=mdlGetTimeOfNextVarHit(t,x,u)

sampleTime = 1; % Example, set the next hit to be one second later.


sys = t + sampleTime;

% end mdlGetTimeOfNextVarHit

Dept. ECE, Auburn Univ. 188 ELEC 6240/Hodel: Fa 2003


27.1 M-file s-function examples Revision : 2003.10

%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)

sys = [];

return

% end mdlTerminate

27.1.2 Discrete time model


M-file 27.3 nnSysIdEx.m

function [sys,x0,str,ts] = nnSysId(t,x,u,flag)


% The general form of an M-File S-function syntax is:
% [SYS,X0,STR,TS] = arca(T,X,U,FLAG,P1,...,Pn)
% Inputs:
% u = [current voltage input, current system output] 3x1
% x = last system output (2x1), weights (2x11, 10x4) - total of 52
% Outputs:
% y = [theta;omega] predicted value

switch flag,
case 0, [sys,x0,str,ts]=mdlInitializeSizes;
case 1, sys=mdlDerivatives(t,x,u);
case 2, sys=mdlUpdate(t,x,u);
case 3, sys=mdlOutputs(t,x,u);
case 4, sys=mdlGetTimeOfNextVarHit(t,x,u);
case 9, sys=mdlTerminate(t,x,u);
otherwise error([’Unhandled flag = ’,num2str(flag)]);
end

%=============================================================================
% mdlInitializeSizes
% Return the sizes, initial conditions, and sample times for the S-function.
function [sys,x0,str,ts]=mdlInitializeSizes
sizes = simsizes;
sizes.NumContStates = 0;
sizes.NumDiscStates = 2 + 2*11 + 4*10;
sizes.NumOutputs = 2;

Dept. ECE, Auburn Univ. 189 ELEC 6240/Hodel: Fa 2003


27.1 M-file s-function examples Revision : 2003.10

sizes.NumInputs = 3;
sizes.DirFeedthrough = 1;
sizes.NumSampleTimes = 1; % at least one sample time is needed
sys = simsizes(sizes);
x0 = randn(sizes.NumDiscStates,1); % initial conditions
str = []; % str is always an empty matrix
ts = [0.1, 0]; % initialize the array of sample times
return

%=============================================================================
% mdlDerivatives
% Return the derivatives for the continuous states.
function dx=mdlDerivatives(t,x,u)
dx = []
return

%=============================================================================
% mdlUpdate
% Handle discrete state updates, sample time hits, and major time step
% requirements.
function nextStates=mdlUpdate(t,x,u)
[W1,W2,yk1] = unpackStates(x); % do a backpropagation step
nextStates = x;
return

%=============================================================================
% mdlOutputs
% Return the block outputs.
function y=mdlOutputs(t,x,u);
% output is current estimate of next output
[W1,W2,yk1] = unpackStates(x);
y0 = [u(1);yk1];
y = mlpT(y0,W1,W2,1.5,1);
return

%
%=============================================================================
% mdlGetTimeOfNextVarHit
% Return the time of the next hit for this block. Note that the result is
% absolute time. Note that this function is only used when you specify a
% variable discrete-time sample time [-2 0] in the sample time array in
% mdlInitializeSizes.
%=============================================================================
%

Dept. ECE, Auburn Univ. 190 ELEC 6240/Hodel: Fa 2003


27.1 M-file s-function examples Revision : 2003.10

function sys=mdlGetTimeOfNextVarHit(t,x,u)
sampleTime = 1; % Example, set the next hit to be one second later.
sys = t + sampleTime;
% end mdlGetTimeOfNextVarHit

%
%=============================================================================
% mdlTerminate
% Perform any end of simulation tasks.
%=============================================================================
%
function sys=mdlTerminate(t,x,u)
sys = [];
return

% end mdlTerminate

% unpack the states in the vector x into an easy-to use form


function [W1,W2,yk1] = unpackStates(x)
m1=10; % Neural net dimensions
n1=4;
m2=2;
n2 = 11;
W2LastIndex = 2 + m2*n2;
W1FirstIndex = W2LastIndex + 1;
W1LastIndex = W2LastIndex + m1*n1;
yk1 = x(1:2);
W2 = reshape(x(3:W2LastIndex),m2,n2);
W1 = reshape(x(W1FirstIndex:W1LastIndex),m1,n1);
return

Dept. ECE, Auburn Univ. 191 ELEC 6240/Hodel: Fa 2003


27.2 Lecture notes: handwritten today Revision : 2003.10

27.2 Lecture notes: handwritten today


Read See 20031022p*.jpg

Dept. ECE, Auburn Univ. 192 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

28 2003 10 24; RBF’s (3)


Read Handwritten notes in 20031023p*.jpg

28.1 Interpolation result


Read §5.4 Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd edition, 1999

Simple (memory-based learning) approach: Given {x(n), d(n)}Nn=1 , choose a set of nc = N


centers t(n) = x(n) with corresponding output vectors w(n), n = 1, ..., nc , and compute
 
n c
φ(kx − t(1)k)
F (x) =
X  
w(i)φ(kx − t(n)k) = w(1) · · · w(nc )  .. ∆
.  = W φ(x)
i=1 φ(kx − t(nc )k)

so that F (t(i)) = d(i). Find W such that:


 
φ11 · · · φ1N
 .. .. ..   w · · · w  =  d(1) · · · d(N ) 
 . . .  1 N
φN 1 · · · φ N N
or
ΦW = D
How do we know that Φ is invertible? If the points t(n) are distinct then lots of φ RBF’s
will work.

Dept. ECE, Auburn Univ. 193 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

29 2003 10 27 RBF’s (4)


29.1 Homework 7 solution
Solution

Notice: Exam 2 Will be on Monday Nov 3. Same rules as last time, except: item you may
bring a ruler so that you can draw a straight line. No calculators, no notes, no references.

1.
−1/2
y (0) = ΣX (x(n) − µX )
N N
1 X −1/2 1 −1/2 X
µy(0) = ΣX (x(n) − µX ) = ΣX (x(n) − µX )
N n=1 N n=1
N N
!
−1/2 1 X 1 X −1/2
= ΣX x(n) − µX = ΣX (µX − µX ) = 0.
N n=1 N n=1
N N
1 X (0) (0) T 1 X −1/2 −1/2 T
Σy(0) = y y = ΣX (x(n) − µX ) ΣX (x(n) − µX )
N n=1 N n=1
N
!
−1/2 1 X T −1/2 −1/2 −1/2
= ΣX (x(n) − µX ) (x(n) − µX ) ΣX = Σ X ΣX ΣX = I
N n=1

2. The analysis is correct: the two methods do in fact give the same result. The difference
is that data normalization moves the internal values v (k) into the quasi-linear parts of
the activation functions φ so that learning can occur faster.

Dept. ECE, Auburn Univ. 194 ELEC 6240/Hodel: Fa 2003


29.1 Homework 7 solution Revision : 2003.10

Homework 8 Due Fri Oct. 31. Hwk

Notice: Exam 2 Will be on Monday Nov 3. Same rules as on the last exam, except that
you may bring a ruler so that you can draw a straight line.

1. written Problem 5.13, p. 3.16 in Simon Haykin. Neural Networks: A Comprehensive


Foundation. Prentice Hall, 2nd edition, 1999. (Undergraduate students may assume
that all variables and parameters are scalars.)

Dept. ECE, Auburn Univ. 195 ELEC 6240/Hodel: Fa 2003


29.2 What functions to use? Revision : 2003.10

29.2 What functions to use?


Read §5.5-5.7

Haykin applies signal processing theory (system identification/inverse problems) to RBF


networks.
Problem: What if data (or the original problem) is ill conditioned?
Definition 29.1 A problem is ill-conditioned if ...
Definition 29.2 A problem is backward stable if ...

29.2.1 Tikhonov regularization


Given input/output data pairs (xi , di ) and a function F (x), define yi = F (xi ). RBF function
F (x) attempts to minimize a modified error term:
E = Es (F ) − λEc (F )
N
1X∆
Standard error term Es (F ) = (di − yi )
2 i=1

1
Regularization term Ec (F ) = kDF k2
2
• D: “linear differential operator” – in other words, a filter, probably multidimen-
sional.
Use to enforce frequency domain constraints on your solution function
Example: h 2 i
∂ ∂2
DF = ∂x2 · · · ∂x2m F
1

penalize large changes in slope.


• λ > 0 weight value to specify how important the regularization penalty is relative
to the input data.
Don’t confuse λ here with a LaGrange multiplier.

Remark 29.1 The text makes use of some powerful mathematics (see, e.g., D. G. Luen-
berger. Optimization by Vector Space Methods. Wiley and Sons, Inc., New York, NY, 1969)
to justify the use of Green’s functions. For appropriate choice of operator D, these Greene’s
functions are the Gaussian radial basis functions we’re considering. The significance of Gaus-
sian functions is that, in terms of wavelet theory, they provide an optimal trade-off between
state-space localization (“time-domain”) and frequency localization in the sense of Heisen-
berg’s uncertainty principle. See Gilbert Strang and Truong Nguyen. Wavelets and Filter
Banks. Wellesley Cambridge Press, wellesley, mass. edition, 1996.
Remark 29.2 Strict enforcement of the choice of operator D leads to a choice of basis
functions φ that satisfy Dφ = 0.

Dept. ECE, Auburn Univ. 196 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

29.2.2 Solution of the problem

Read §5.7

m (1)

X
Define φi (x) = φi (kx − ti k). RBF network output is F (x) = wi φi (x). Let
i=1
 T
d = d1 · · · dN
 T
w = w1 · · · wm(1)
 
φ1 (x1 ) · · · φm(1) (x1 )
G =  .. .. .. 
.

. . 
φ1 (xN ) · · · φm(1) (xN )
 
φ1 (t1 ) ··· φm(1) (t1 )
G0 =  .. .. .. 
.

. . 
φ1 (tm(0) )) · · · φm(1) (tm(0) )
Then the minimizing solution for weights w is
GT G + λG0 w = GT d


29.3 Training RBF networks


Read §5.13

Don’t want lots of RBF’s (fewer than number of data points).


• Overtraining
• Computational cost
RBF function m
X
F (x) = wi φi (x)
i=1
where
T
φi (x) = e(x−ti ) Σi −1 (x−ti )

ti is the center of φi . Σi controls the “spread” of φi .


Example 29.1 Two RBF’s added together.
n
X  
F (x) = wi exp (x − ti )T Σi −1 (x − ti )
i=1
       
4 1 5 2 0.2 0.04
with w1 = 3, w2 = 1, t1 = , t2 = , Σ1 = and Σ2 = .
2 −2 2 3 0.04 0.5
Plot in Figure 36.

Dept. ECE, Auburn Univ. 197 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

RBF example function

line 1

3
2.5
2
1.5
1
0.5
0

8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6

Figure 36: RBF network with two RBFs.

2
Alternative form (similar to Fuzzy Logic)
m
X
wi φi (x)
i=1
F (x) = m
X
φi (x)
i=1

Example 29.2 Fuzzy RBF network added together. M-file below; plot in Figure 37. Notice
that between centers we get interpolation, while one RBF tends to dominate away from the
centers.

M-file 29.4 rbf0502.m

% rbf0502: example radial basis function network


t1 = [4;2]; Sig1 = [5,2; 2,3];
t2 = [1;-2]; Sig2 = [0.2,0.04; 0.04,0.5];

w = [3;1];

nx = 40; xx = linspace(0,8,nx);
ny = 41; yy = linspace(-5,8,ny);

Dept. ECE, Auburn Univ. 198 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

RBF example function

line 1

3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1

8
6
4
0 2
1 0
2 y
3 -2
4
5 -4
x 6
7
8 -6

Figure 37: RBF network with two RBFs.

zm = zeros(nx,ny);

for ix = 1:nx
for iy = 1:ny
xn = [xx(ix); yy(iy)];
e1 = xn -t1; e2 = xn - t2;
zm(ix,iy) = w’*[ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]/ ...
sum([ exp(-e1’*(Sig1\e1)) ; exp(-e2’*(Sig2\e2)) ]);
end
end

title(’RBF example function’);


xlabel(’x’);
ylabel(’y’);
mesh(xx,yy,zm’);
printeps("rbf0502.eps");

2
Training involves selection of parameters ti , Σi , and wi .
From [Hay99]: adaptation formulae for RBF network: Requires Green’s function G(x)
and its first derivative G0 (x) = dG/dx with respect to scalar x.

Dept. ECE, Auburn Univ. 199 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

Linear weights:
N
∂E(n) X  
= ej (n)G kxj − ti (n)kCi
∂wi (n) j=1
∂E(n)
wi (n + 1) = wi (n) − η1 i = 1, 2, ..., m(1)
∂wi (n)

Center positions Can interpret as hidden layer node parameter


N
∂E(n) X  
= 2wi (n) ej (n)G0 kxj − ti (n)kCi Σi −1 (xj − ti (n))
∂ti (n) j=1
∂E(n)
ti (n + 1) = ti (n) − η2 i = 1, ..., m(1)
∂ti (n)

Spread parameters
N
∂E(n) X
0
 
= −w i (n) e j (n)G kx j − t i (n)k Ci Qji (n)
∂Σi −1 (n) j=1

Qji (n) = (xj − ti (n)) (xj − ti (n))T


∂E(n)
Σi −1 (n + 1) = Σi −1 (n) − η3 i = 1, ..., m(1)
∂Σi −1 (n)

29.3.1 Selection of output (interpolation) weights w


Suppose ti Σi are held fixed, i = 1, m. We get a least squares problem.

min J(w) :
w
N
X
J(w) = kdi − F (xi )k2
i=1
2
XN Xm
= d i − wj φj (xi )


i=1 j=1
     2
d1 φ1 (x1 ) · · · φm (x1 ) w1
 ..   .. ..   .. 

min  .  −  ..
w
. . .   . 

dm φ1 (xN ) · · · φm (xN ) wm

Dept. ECE, Auburn Univ. 200 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

φi (x) < 1/2

ti

Figure 38: Hodel’s idea for automated selection of spread matrices Σ.

29.3.2 Selection of centers ti


29.3.3 Selection of covariance (spread) matrices Σ
Hodel’s heuristic “Σ-shaping.” See Figure 38. The idea is to maximize spread subject to
“non-interference” between radial basis functions. This can be set up as a convex pro-
gramming problem [NN94], but we will utilize a simple suboptimal procedure for this task.
Assume that the centers ti have been chosen to reflect the variation in the data (sensitivity
of di to xi ).
1. For i = 1, ..., nφ

(a) Select center point ti .


(b) For each j 6= i in {1, ..., nφ }
 
ti + t j
i. Evaluate φ halfway between ti and tj , i.e., define pij = φ . If pij ≤
2
1/2, then we say that φi does not interfere with φj at tj . Conversely, if
pij > 1/2, then φi interferes with φj , i.e., φi is spread out too far in the
direction tj − ti from the center point ti of φi .
ii. If φi interferes with φj , then we reduce the spread of φi as follows. Define
S = Σi −1 and assume without loss of generality that tj − ti = αe1 where e1
is the first column of the identity matrix. Then
 
−(tj −ti )Σi −1 (tj −ti )/4 1 2
pij = e = exp ktj − ti k s11
4
We achieve the constraint pij ≤ 1/2 by scaling s11 appropriately.
In the general case where tj − ti is not a multiple of e1 , then we may use an
orthogonal coordinate transformation to perform the update as shown in the
following example.
(c) end for

2. end for

Dept. ECE, Auburn Univ. 201 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

Example 29.3 M-file 29.5 rbfSig.m


% rbfSig.m: check ‘‘spread’’ matrix idea
rand(’seed’,1); % for repeatable experiments
N = 20; xx = rand(2,N); % select 20 centers at random

% allocate space for spread matrices:


Sigs = zeros(4,N); % column i contains spread matrix i (vector stack)

for nn = 1:N
xn = xx(:,nn);

% get matrix of distances to other center points


idx = complement(nn,1:N); xErr = xx(:,idx) - xn*ones(1,N-1);

% initialize to have a ‘‘wide’’ spread.


Sigi_n = 0.01*eye(2); % Sigi -> "inv(Sigma)"

% scale in each direction to avoid conflict with other points


for ni = 1:length(idx)
ei = xErr(:,ni)/2;
pij = exp(-ei’*Sigi_n*ei);
if(pij > 1/2)
% transform coordinates so that ei is 1st column of identity
[Q,r] = qr(ei); SigTmp = Q’*Sigi_n*Q;

% scale leading diagonal value


dv = SigTmp(1,1); ne = norm(ei)^2; dv = log(1/2)/(-ne);

% update and back-transform


SigTmp(1,1) = dv; Sigi_n = Q*SigTmp*Q’;
end
end

% store value in spread matrix database


SigInv = Sigi_n;
Sigs(:,nn) = reshape(SigInv,4,1);
end

% compute mesh of RBF’s


xmin = min(xx’); xmax = max(xx’);

nx = 26; xv = linspace(xmin(1), xmax(1),nx);


ny = 27; yv = linspace(xmin(2), xmax(2),ny);
zm = zeros(nx, ny);

Dept. ECE, Auburn Univ. 202 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

for ix = 1:nx
for iy = 1:ny
xp = [xv(ix); yv(iy)];
for nn = 1:N
ti = xx(:,nn);
SigI = reshape(Sigs(:,nn),2,2);
zm(ix,iy) += exp( -(xp - ti)’*SigI*(xp-ti) );
end
end
end

% plot the centers


figure(1)
title("Centers");
xlabel("x");
ylabel("y");
plot(xx(1,:), xx(2,:),’xx’);
printeps("rbfSig01.eps");

% plot sum of RBFS as a mesh


figure(2)
title("Sum of RBF functions with spread shaping")
xlabel("x");
ylabel("y");
mesh(xv, yv, zm’)
printeps("rbfSig02.eps");

% plot sum of RBFS as a contour plot


figure(2)
title("Contour plot: sum of RBF functions with spread shaping")
xlabel("x");
ylabel("y");
contour(xv, yv, zm’, 10)
printeps("rbfSig03.eps");

rbfSig.m Output

ans = 1
ans = 2
ans = 2

Results in Figures 39–41.

Dept. ECE, Auburn Univ. 203 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

Centers
1
line 1

0.9

0.8

0.7

0.6

0.5
y

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
x

Figure 39: For example 29.3

Sum of RBF functions with spread shaping

line 1

1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

1
0.9
0.8
0.7
0 0.6
0.1 0.5
0.2 0.4 y
0.3 0.3
0.4 0.2
0.5
x 0.6 0.1
0.7
0.80

Figure 40: For example 29.3

Dept. ECE, Auburn Univ. 204 ELEC 6240/Hodel: Fa 2003


29.3 Training RBF networks Revision : 2003.10

Contour plot: sum of RBF functions with spread shaping

1.6
1.4
1.2
1
0.8
1 0.6
0.9 0.4
0.2
0.8
0.7
0.6
0.5 y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 41: For example 29.3

Dept. ECE, Auburn Univ. 205 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

30 2003 10 31: RBF’s (cont’d)


30.1 Homework 8 solution
Solution Make use of Lemma 11.1.

N
1X
∆ ∆ ∆
E = e(n)T e(n) e(n) = d(n) − y(n) y(n) = F (x(n))
2 n=1

∆ ∆
X
F (x) = w(k)φk (x) φk (x) = φ(vk (x)) φ(v) = ev
k=1

vk (x) = −(x − t(k))T Σ(k)−1 (x − t(k))/2


∂E
= covered in class
∂wi (k)
N X p     
∂E X ∂E ∂ei (n) ∂yi (n) ∂φk (x(n))
=
∂tj (k) n=1 i=1
∂ei (n) ∂yi (n) ∂φk (x(n)) ∂tj (k)
N X p  
X ∂φk (x(n))
= ei (n)(−1)wij (k)
n=1 i=1
∂tj (k)
N X p   
X ∂φk (x(n)) ∂vk (x(n))
= −ei (n)wij (k)
n=1 i=1
∂vk (x(n)) ∂tj (k)
N X p  
X ∂vk (x(n))
= −ei (n)wi (k)φ(vk (x(n))
n=1 i=1
∂tj (k)
∂vk (x(n)) 1 ∂ 
T −1

= − (x − t(k)) Σ(k) (x − t(k))
∂t(k) 2 ∂t(k)
1 ∂  T −1 T −1 T −1

= − x Σ(k) x − 2x Σ(k) t(k) + t(k) Σ(k) t(k)
2 ∂t(k)
= Σ(k)−1 x − Σ(k)−1 t(k) and the result follows


Similarly, with S(k) = Σ(k)−1 ,
N X p  
∂E X ∂vk (x(n))
= −ei (n)wi (k)φ(vk (x(n))
∂sij (k) n=1 i=1
∂sij j(k)
∂vk (x(n)) 1
= (x(n) − t(k))(x(n) − t(k))T
∂S(k) 2

Notice that my solution differs from the text because I include the factor of 1/2 in v k (x).

Dept. ECE, Auburn Univ. 206 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

31 2003 10 31 Principal Components Analysis


Read §8.3

From Hebb (1949).

• Projection onto a vector q

• E (q T x)(xT q) = q T Rq


• SVD =⇒ maxq q T Rq =⇒ q = eigenvector of R.


Eigenvectors of R are orthogonal.
idea: order λ1 ≥ · · · ≥ λn = 0

What if λ3  λ4 ? Approx as rank 3, keep only 3 dimensions of data.

• encode: y = Q3 T x

• decode: x̂ = Q3 y

• Error x − x̂ orthogonal to Q3 (small).

Neural network idea can we get Q3 without lots of nasty eigenvalue problems?

Dept. ECE, Auburn Univ. 207 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

32 Exam 2 solutions
Name

Scores
1. (25pts)
2. (25pts)
3. (25pts)
4. (25pts)
Total:

Note Show your work. Multiple choice questions may have more than one correct answer.
Whether or not your answer will be judged to be “correct” depends on your response under
the word “explain.”

Permitted resources for this exam None. You may use a pencil or pen (use a pen
only if you never make mistakes), an eraser, and a ruler. You are not permitted to use a
calculator, textbook, written notes, oral or written communication with other people (besides
the instructor/GTA), laptop computers, cell phones, PDAs, wireless modems, telepathic
contact, or any other resources besides your own mind, body and a writing utensil. T-shirts
with Maxwell’s equations on the back will be tolerated, but I reserve the right to reseat you
in the back of the classroom. You are to sit with at least one empty chair between you and
the nearest classmate. Use of unauthorized resources on this exam will result in a failing
grade.

a five volt power State all assump-


Design a human supply and kite tions.
brain using only string.
MOSFETs

Dept. ECE, Auburn Univ. 208 ELEC 6240/Hodel: Fa 2003


32.1 Revision : 2003.10

32.1
Consider the function plotted below:
sinc(2 pi x ) for |x| <=1 , 0 else
1

0.8

0.6

0.4

0.2

-0.2

-0.4
-3 -2 -1 0 1 2 3
x

It is desired to approximate this function with a neural network.


1. How many inputs does the network have? (circle 1)

2 3 4 none of these
1
2. How many outputs does the network have? (circle 1)

2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:
6 Two each for each “hump” in the diagram: one for the leading edge, one for the
trailing edge.
None of these (four) use two neurons to create a trough from -1 to 1 with depth
of about -0.25, two more neurons to create the peak in the middle between about
-0.75 and 0.75.
None of these (many) The function change in slope at -1 and 1 is not smooth, and
so will require a lot of neurons to model accurately.

Dept. ECE, Auburn Univ. 209 ELEC 6240/Hodel: Fa 2003


32.2 Revision : 2003.10

32.2
Consider the function plotted below:

sinc(2 pi (x^2 + 2 y^2) ) for x^2 + 2 y^2 <=2 , 0 else

1
0.8
0.6
0.4
0.2
0
-0.2
-0.4

2
1.5
1
0.5
-2 0
-1.5
-1 -0.5 y
-0.5
0 -1
0.5 -1.5
x 1
1.5
2 -2

It is desired to approximate this function with a neural network.

1. How many inputs does the network have? (circle 1)

1 3 4 none of these
2

2. How many outputs does the network have? (circle 1)

2 3 4 none of these
1
3. What is the minimum number of hidden nodes in a multi-layer perceptron that can
give a “fairly good approximation” of the sinc function? (circle one)
1 3 6 10 none of these
Explain.
Solution Acceptable answers:

None of these (two) Use one function to create the wider trough in the middle of
the plane, a second function to add the peak. However, this would likely not give
a good match to the data.
None of these (many) This function has a fairly flat top and a strange shape, so a
good match would probably require many RBF’s to get good interpolation behavior.

Dept. ECE, Auburn Univ. 210 ELEC 6240/Hodel: Fa 2003


32.3 Revision : 2003.10

32.3
m
Consider a data set {x(n), d(n)}N
n=1 where x(n) ∈ IR and d(n) ∈ IRp . Define the error
function
N
∆ 1
X
E= e(n)T e(n)
2 n=1

where e(n) = d(n) − y(n) and y(n) = W (2) φ(W (1) x(n) + b(1) ) where b(1) ∈ IRh . Define
E(n) = 12 e(n)T e(n). Find  ∂E(n) 
(2) · · · ∂E(n) (2)
∂W1,1 ∂W1,h 
∂E(n) ∆ 
 .. . .. 
= . .. . 
∂W (2)  ∂E(n) ∂E(n)

(2) ··· (2)
∂Wp,1 ∂Wp,h

∂E(n)
Note Undergraduates may choose to derive (2)
, a scalar valued gradient instead.
∂Wij
∂E(n) T
(2)
= −e(n)y (1) (n)
∂W
Show your work. If you need more space, work on the back side of this page. If you need
more space than that ... try to write smaller.
(2)
∂yk
Solution Apply the chain rule: First, notice that = 0 if k 6= i. From this result, there
(2)
∂Wij
 T
is no need to sum the partial derivatives over all entries in e = e1 (n) · · · em (n) .
  ! !
∂E(n) ∂E(n) ∂ei (n) ∂yi (n)
(2)
= (2) (2)
∂Wij ∂ei (n) ∂yi (n) ∂Wij
(1)
= ei (n)(−1)yj (n)

which is the answer for undergraduates. Write the above gradient in terms of all combinations
of i and j to get the answer listed above.

Dept. ECE, Auburn Univ. 211 ELEC 6240/Hodel: Fa 2003


32.4 Revision : 2003.10

32.4
Consider the unit square problem we’ve been working all semester. A friend (who is not in our
class) suggests that instead of using φ(v) = tanh(v) that you should use φ(v) = tanh(10v),
since the latter function is much steeper and so it can give a better approximation to the
unit square.
activation function comparison
1
tanh(x)
tanh(10*x)
Evaluate your friend’s suggestion.
0.8
Discuss (1) whether or not the use
0.6 of φ(v) = tanh(10v) can give a
0.5
0.4
better approximation to the data
than the use of φ(v) = tanh(v), (2)
0.2
how the use of φ(v) = tanh(10v)
0 will affect input data normalization,
-0.2
and (3) how the use of φ(v) =
tanh(10v) will affect the backprop-
-0.4
-0.5
agation training algorithm.
-0.6 If you need more space, you may
-0.8 continue your answer to this prob-
lem on the back of this page exam.
-1
-3 -2 -1 0 1 2 3
x
Solution It changes the training, but doesn’t improve things at all.

1. Consider output layer y (2) = φ(W (2) y (1) ) with φ = tanh(10v). We can use φ(v) =
tanh(v) instead by redefining W (2) := 10W (2) , that is, just absorb the factor of 10 into
the network weights. Hence, the proposed function does not provide the capability of a
better approximation than tanh by itself.

2. Data normalization is performed to ensure that the expected value of initial network
weights is in the “linear” region of the activation function. The use of φ(v) = tanh(10v)
sets the “linear” region to a much more narrow area, hence the initial weights must be
initialized to be one-tenth of what we’d usually do, i.e., in MATLAB code, Winitial = rand(m,n)/(10

3. The derivative of of tanh(10v) is 10 times larger (and 10 times more narrow) than
that of tanh(v). As a result, the backpropagation step size should be adjusted to be one-
tenth of what one would use for tanh(v). Further, since the derivative is effectively zero
outside of a narrow range, one would not expect a large number of neurons to be able
to “find” the boundary edges of the unit square unless they happened to be initialized
“just right.” On the other hand, since initial weights (listed above) are selected to
compensate for the factor of 10 in the activation function, the smaller step size may
result in comparable training behavior.
Either way, the proposed method doesn’t provide any clear advantage over φ(v) =
tanh(v).

Dept. ECE, Auburn Univ. 212 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

33 2003 11 05: PCA (2)


See scanned notes.

Dept. ECE, Auburn Univ. 213 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

34 2003 11 07: PCA (3)


Example 34.1 20 sinusoids sin(ωt + φ) where ω ∈ [1.0, 1.1]rad/s, φ ∈ [0, 0.5] rad. 20 sinc
functions sinc(ωt + φ), and 20 square waves sign(sin(ωt + φ)). Gaussian noise added with
variance σ 2 = 0.05. Input signals are shown in Figure 42.

M-file 34.1 pcaEx1.m

function pcaEx1
format short e
mm = 150;
tt = linspace(-2,2,mm);
rand(’seed’,0);
randn(’seed’,0);
nsets = 20;
for ii=1:nsets
om = 1 + rand/10;
ph = rand/2;
rt = om*tt + ph;
mydat(ii,1:mm) = sin(rt);
rt = rt*5;
mydat(ii+nsets,1:mm) = sin(rt) ./ rt;
mydat(ii+2*nsets,1:mm) = sign(sin(rt));
fprintf(’%4d: om=%12.4f ph = %12.4f\n’,ii,om,ph);
end
mydat = mydat + randn(size(mydat))/20;
NN = size(mydat,1);

xbar = mean(mydat);
zmdat = mydat - ones(NN,1)*xbar;

sigx = zeros(mm,mm);
for nn=1:NN
xn = zmdat(nn,:)’;
sigx = sigx + xn*xn’/NN;
end
[VV,lam] = eig(sigx);
lam = diag(lam);
[lam,idx] = sort(-lam); lam = -lam;
VV = VV(:,idx); % reorder eigenvectors
chkval = (1e-6)*max(lam);
lamsiz = size(lam)
chkvalsiz = size(chkval);
idx = find(lam > chkval);

Dept. ECE, Auburn Univ. 214 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

V3 = VV(:,1:3);
approxDat = (zmdat*V3)*V3’ + ones(NN,1)*xbar;

figure(1);
plot(tt,mydat,’-’);
title(’Principal components analysis example’);
xlabel(’time (sec)’);
ylabel(’signal value’);
grid on;
axis([-2,2,-1.2,1.2]);
printeps(’pcaEx1.eps’);

figure(2);
semilogy(idx,lam(idx),’x’);
axis ;
grid on;
printeps(’pcaEx1a.eps’);

figure(3);
plot(tt,xbar);
grid on;
title(’mean value vector’);
printeps(’pcaEx1b.eps’);

figure(4);
plot(tt,VV(:,1:3));
title(’dominant eigenvectors’);
xlabel(’time (s)’);
ylabel(’signal value’);
grid on;
printeps(’pcaEx1c.eps’);

figure(5);
plot(tt,approxDat);
title(’approximated data with dominant 3 eigenvectors’);
xlabel(’time (s)’);
ylabel(’signal value’);
grid on;
printeps(’pcaEx1e.eps’);

% power iteration to get the dominant eigenvector


w0 = rand(mm,1);
w0 = w0/norm(w0);
ww = w0;

Dept. ECE, Auburn Univ. 215 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

imax = 6
for kk=1:imax
ww = sigx*ww;
ww = ww/norm(ww);
end
plot(tt,VV(:,1),’-’,tt,w0,’--’,tt,ww,’-.’);
grid on;
legend(’v_1’,’w(0)’,sprintf(’w(%d)’,imax));
xlabel(’time (s)’);
ylabel(’vector waveform’)
title(’Principal Compenents Analysis example’);
printeps(’pcaEx1f.eps’);

% Neural network approach: get a single eigenvector


eta = 0.0001;
imax = 20;
nv = 5;
WW = rand(mm,nv);
for kk=1:nv
WW(:,kk) = WW(:,kk)/norm(WW(:,kk));
end
for iter = 1:imax
lamh(1:5,iter) = diag(WW’*sigx*WW);
WW= sigx*WW;
[jnk,idx] = sort(rand(NN,1));
idx = 1:NN;
for nn=1:NN
xn = mydat(idx(nn),:)’;
yy = WW’*xn;
WW = WW + eta*xn*yy’;
for kk=1:nv
% project and normalize
for jj=1:(kk-1)
WW(:,kk) = WW(:,kk) - WW(:,jj)*( WW(:,jj)’*WW(:,kk) );
end
WW(:,kk) = WW(:,kk)/norm(WW(:,kk));
end
end
end
plot(1:imax,lamh,’+’,1:imax,reshape(lam(1:nv),nv,1)*ones(1,imax),’:’);
legend(’l1e’,’l2e’,’l3e’,’l4e’,’l5e’);
title(sprintf(’Neural network PCA after %d iterations’,imax));
printeps(’pcaEx1g.eps’);

Dept. ECE, Auburn Univ. 216 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

function printeps(str)
eval(sprintf(’print -depsc %s’,str));

Dept. ECE, Auburn Univ. 217 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Principal components analysis example

0.8

0.6

0.4

0.2
signal value

−0.2

−0.4

−0.6

−0.8

−1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2


time (sec)

Figure 42: PCA example: input signals

Subtract mean value, then compute Σx . Dominant eigenvalues of Σx (ignoring all eigen-
values less than 10−6 λmax ) are shown in Figure 43. Dominant 3 eigenvectors are plotted in
Figurte 44. Approximated data shown in Figure 47. Notice the general signal forms are
recognizable, in spite of different phase and frequency: the signal type can be classified with
confidence.

Remark 34.1 Data set characteristics: same number of data points from each class (uni-
form sampling).

Dept. ECE, Auburn Univ. 218 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

2
10

1
10

0
10

−1
10

−2
10

−3
10
0 10 20 30 40 50 60

Figure 43: PCA example: dominant eigenvalues of Σx

Dept. ECE, Auburn Univ. 219 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

dominant eigenvectors
0.2

0.15

0.1

0.05
signal value

−0.05

−0.1

−0.15

−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)

Figure 44: Dominant 3 eigenvectors of Σx (PCA example).

Dept. ECE, Auburn Univ. 220 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

approximated data with dominant 3 eigenvectors


2

1.5

0.5
signal value

−0.5

−1

−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)

Figure 45: Original data approximated based on dominant 3 eigenvectors of Σx . Observe


that the original sine, sinc, and square waves can be recognized in spite of noise corruption
and data compression to only three linear parameters.

Dept. ECE, Auburn Univ. 221 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Principal Compenents Analysis example


0.15
v1
w(0)
w(6)
0.1

0.05
vector waveform

−0.05

−0.1

−0.15

−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
time (s)

Figure 46: Power iteration to compute dominant eigenvector of Σx .

Dept. ECE, Auburn Univ. 222 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

Neural network PCA after 20 iterations


30
l1e
l2e
l3e
l4e
25 l5e

20

15

10

0
0 2 4 6 8 10 12 14 16 18 20

Figure 47: Power iteration to compute dominant eigenvectors/eigenvalues of Σx .

Dept. ECE, Auburn Univ. 223 ELEC 6240/Hodel: Fa 2003


34.1 Single vector case Revision : 2003.10

34.1 Single vector case


Read §8.4

linear model m
X
T
y=w x= w i xi
i=1

Hebbian learning:

wi (n + 1) = wi (n) + ηy(n)xi (n)


w(n + 1) = w(n) + ηy(n)x(n)

weights can blow up. → normalize weights

wi (n) + ηy(n)xi (n)


wi (n + 1) = qP
m 2
i=1 (wi (n) + ηy(n)xi (n))

w(n) + ηy(n)x(n)
w(n + 1) = q
(w(n) + ηy(n)x(n))T (w(n) + ηy(n)x(n))

Taylor’s series in η: (recall d uv = u dv−v du


v2
)

w(n + 1) = w(n) + ηy(n)[x(n) − y(n)w(n)]

Stability analysis and interpretation

y(n) = w(n)T x(n) = x(n)T w(n)


h    i
T T T
w(n + 1) = w(n) + η x(n)x(n) w(n) − w(n) x(n)x(n) w(n) w(n)

1
x(n)x(n)T , so for a small enough step size, the first term
P
Interpretation:
  Recall Σ x = N
x(n)x(n)T w(n) is an approximation for multiplying Σx w(n), which will converge to the
dominant eigenvector direction.
Suppose w(n) were a scalar. Then the 2nd term is −w(n)3 x(n)2 , which causes w(n) to
decrease in magnitude.
Hence, the 1st term drives toward the dominant eigenvector, while the 2nd term drives
toward 0. Net result: converges to a bounded
P multiple P of dominant eigenvector of Σx .
Formal convergence analysis requires: η(n) = ∞, |η(n)|p < ∞ for some p > 1. Text
chooses η(n) = 1/n, learn more slowly as get “older.” Then w(n) → dominant eigenvector.

34.2 Multiple vector case

Dept. ECE, Auburn Univ. 224 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

35 Self Organizing Maps


Read Ch. 9

Human brain motor organization mimics physical organization of body (mirror image).

Idea Design a neural network to organize itself to match data organization. Techniques/concepts
involved:

Competitive learning Competition: determine one ‘winner’

Topology/cooperation Establish “neighborhoods” of neurons. Organize neurons in each


layer in a plane; winner and its ‘neighbors’ get reinforced.

Eventually neighboring neurons are activated by neighboring input patterns.

Dept. ECE, Auburn Univ. 225 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

36 2003 11 19 Neurodynamic programming Example


Read Chapter 12

M-file 36.1 dynProgEx.m


% dynamic programming example - deterministic case (all transition
% probabilities are either 0 (never happens) or 1 (always happens).
XX = 1:10; % 10 states
% transitions and probabilities in a cell array:
% S{i,j} = [f(i,1),c(i,1) ; f(i,2), c(i,2) ; ... ; f(i,j), c(i,j), ...]
% describes what happens in state i with input j
S{1} = [2,2; 3,4; 4,3]; S{2} = [5,7; 6,4; 7,6];
S{3} = [5,3; 6,2; 7,4]; S{4} = [5,4; 6,1; 7,5];
S{5} = [8,1; 9,4]; S{6} = [8,6; 9,3];
S{7} = [8,3; 9,3]; S{8} = [10,3];
S{9} = [10,4]; S{10} = [];
for ii=1:length(S) % print out transitions and costs
Si = S{ii};
[mm,nn] = size(Si);
if(mm == 0), fprintf(’State %s: no transitions\n’, char(’A’+ii-1));
else
for uu=1:mm
fprintf(’State %s: input %d -> state %c, cost %d\n’, ...
char(’A’+ii-1),uu, char(’A’+Si(uu,1)-1),Si(uu,2));
end
end
end
% iteratively compute optimal cost to go
rand(’seed’,1);
Ju = 10*rand(length(S),1); % initialize to random cost
gam = 1.0; % gamma variable in Table 12.2
Jhist = Ju;
maxi = 11;
for iter = 2:maxi
fprintf(’iteration %d\n’,iter);
Jnext = zeros(length(S),1); % compute next optimal cost
for ii=1:length(S) % initial state
[mm,nn] = size(S{ii});
if(mm > 0)
bestCost = 1e6; % some huge number
for uu=1:mm % all possible inputs
bestCost = min( bestCost, S{ii}(uu,2) + gam*Ju( S{ii}(uu,1) ) );
end

Dept. ECE, Auburn Univ. 226 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

else, bestCost = 0;
end
Jnext(ii) = bestCost;
end
Ju = Jnext; Jhist(:,iter) = Ju;
end
plot(1:maxi,Jhist,’-o’); grid on;
xlabel(’iteration’); ylabel(’cost to go estimate’);
for ii=1:length(S)
text(5 + 0.5*ii,Ju(ii)+0.5,char(’A’+ii-1));
end
title(’Dynamic programming example (see Fig 12.4 in Haykin’’s book)’);
print -depsc dynProgEx.eps

output
State A: input 1 -> state B, cost 2
State A: input 2 -> state C, cost 4
State A: input 3 -> state D, cost 3
State B: input 1 -> state E, cost 7
State B: input 2 -> state F, cost 4
State B: input 3 -> state G, cost 6
State C: input 1 -> state E, cost 3
State C: input 2 -> state F, cost 2
State C: input 3 -> state G, cost 4
State D: input 1 -> state E, cost 4
State D: input 2 -> state F, cost 1
State D: input 3 -> state G, cost 5
State E: input 1 -> state H, cost 1
State E: input 2 -> state I, cost 4
State F: input 1 -> state H, cost 6
State F: input 2 -> state I, cost 3
State G: input 1 -> state H, cost 3
State G: input 2 -> state I, cost 3
State H: input 1 -> state J, cost 3
State I: input 1 -> state J, cost 4
State J: no transitions
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8

Dept. ECE, Auburn Univ. 227 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

iteration 9
iteration 10
iteration 11

Dynamic programming example (see Fig 12.4 in Haykin’s book)


12
A B

10

D
8
C F
cost to go estimate

G
6

E I
4
H

J
0
1 2 3 4 5 6 7 8 9 10 11
iteration

Dept. ECE, Auburn Univ. 228 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

37 Hopfield Networks
Training uses correlation matrix memory (see [Hay99], §2.11, p. 79) to estimate training
values.

Dept. ECE, Auburn Univ. 229 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

A Appendix: Review of linear algebra


A.1 Vector stack function vec(·)
Define w̄ = vec (W ) where W ∈ IRm×n

B Appendix: Review of C-programming syntax re-


lated to neural nets
C Appendix: Review of MATLAB syntax related to
neural nets
C.1 Introduction
MATLAB is a commercial product of the Mathworks, http://www.mathworks.com, that is
an industry standard software tool in electromagnetics, signal processing, and control sys-
tems. A student version of MATLAB can be purchased at the Auburn University bookstore.
The objective of this section is to introduce the student to some of the basic features of
MATLAB that are relevant to problem solving with artificial neural networks.

C.1.1 Access to software


The commercial program MATLAB may be run on any of the computers on the engineer-
ing network, including Sun-workstations (Unix) and Windows computers. In order to use
Sun-workstations (or a sun-session by using X-windows over a broadband connection) it is
necessary to select MATLAB in the program user-setup under mathematical packages.
Alternatively, many researchers and corporations use the freely distributable software
package octave, a program similar to MATLAB that is available under the terms of the
Free Software Foundation Gnu Public License, or SciLab, developed in France, that provides
similar functionality to MATLAB. Neither of these programs is 100% MATLAB-compatible,
but they do provide a useful low-cost computing tool for those who wish to use them. See
http://www.octave.org and http://www.scilab.org for additional information. Instal-
lation of Octave under Windows requires use of the Cygwin environment11 . Installation of
Octave on Macintosh computers is most easily done with fink12 . Installation of Octave on
Linux-based machines is usually straightforward.13
This manual is written based on the use of MATLAB on the engineering network.
11
http://www.cygwin.com
12
http://fink.sourceforge.net
13
Open the tar ball, run the configure script, them type in make all ; make install

Dept. ECE, Auburn Univ. 230 ELEC 6240/Hodel: Fa 2003


C.2 Mathematical preliminaries Revision : 2003.10

C.1.2 Software overview and tutorials


The chief advantage to using MATLAB (as opposed to a formal programming language such
as C, C++, or FORTRAN) is the easy of development and debugging in the MATLAB
environment. The chief disadvantage of using MATLAB is that it uses interpreted scripts,
or m-files, which can run much slower than comparable compiled programs.
Students who wish more information on the use of MATLAB may use any of the following
resources:

• S. J. Reeves, Beginning MATLAB for Engineers, College House Enterprises, 2001,


available at the Auburn University Bookstore.

• K. Sigmon, MATLAB Primer, available at

ftp://ftp.eng.auburn.edu/pub/sjreeves/matlab_primer_40.pdf

• The Mathworks on-line documentation at http://www.mathworks.com

• The user’s manual for the Student Version of MATLAB (if you purchase it).

• The (paperback) textbook by D. Etter, Engineering Problem Solving with MATLAB


(tm), Prentice-Hall.

The manuals for Octave and/or Scilab may be of some use to you as well. These may be
obtained with their respective source code distributions.
Advanced students may wish to write mex function interfaces to compiled language com-
puter programs; this topic is not addressed in this laboratory session.

C.2 Mathematical preliminaries


We examine the use of to analyze the currents and voltages for three circuits: a resistor
network, an RC low-pass filter circuit, and an RLC circuit, shown in Figures 48. The voltages
and currents in the resistor network in Figure 48(a) can be computed as the solution to a
set of linear equations

ai,1 v1 + ai,2 v2 + ai,3 i1 + ai,4 i2 + ai,5 i3 + ai,6 i4 + ai,7 i5 = bi (3.1)

for i = 1, ..., 7. For example, suppose for equation i = 1 that we apply Kirchoff’s current law
to the node with voltage v1 to obtain

i2 − i 3 − i 4 = 0 (3.2)

Equation (3.2) can be rewritten in the form of equation (3.1) by selecting the coefficients
a1,1 = a1,2 = a1,3 = a1,7 = b1 = 0, a1,4 = 1 and a1,5 = a1,6 = −1.

Dept. ECE, Auburn Univ. 231 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

3kΩ v1 7kΩ 1kΩ vo


v2
i1 i1
+ + i2
2kΩ us (t − 0.5)
5V 10µF
− −

(a) (b)
100kΩ 10µF
i1 (t)
+

+
us (t − 0.5) vo (t)
10kΩ 10mH


(c)

Figure 48: Circuit examples for use of MATLAB. Note us (t) refers to the unit step function
us (t) = 0 for t < 0, us (t) = 1 for t ≥ 0.
C MATLAB

double a, b, c[10], d[10]; % Comments start with a pct sign


double e[5][5]; % don’t declare variables in MATLAB
a = 1; /* comments look like this */ a = 1;
b = 2; b = 2;
c[0] = a; /* subscripts start at 0 */ c(1) = a; % subscripts start at 1
d[1] = b; d(2) = b;
e[2][2] = a+b; e(3,3) = a + b;
Figure 49: Assignment statements and subscripts in C and MATLAB.

C.3 MATLAB basics: similarities to C


The MATLAB program allows users to do many things that can be found in compiled pro-
gramming languages. Some examples are shown in Figures 49—52, which address assignment
statements, if statements, for and while statements, and switch statements, respectively.

C.3.1 Functions
Unlike C, MATLAB (usually) requires each function to be in its own text file, and the file
must end with extension .m. So, these are usually called “m-files.” M-file function text

Dept. ECE, Auburn Univ. 232 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

MATLAB
C
if(x < y )
if(x < y ) % use fprintf and single quotes
{ % otherwise same as printf.
printf("%e < %e\n",x,y); fprintf(’%e < %e\n’,x,y);
} % else if changes to elseif
else if ( x != y ) % != changes to ~= (use ~ for "not")
{ elseif ( x ~= y )
/* can split cmds across lines */ % use ... to split a command across
printf("%e is different from %e\n", % lines
x,y); fprintf( ...
} ’%e is different from %e\n’, ...
x,y);
Figure 50: Flow control in C and MATLAB: if statements

C MATLAB

double sum = 0; sum = 0;


int ii; % 1:5 is a vector of integers
for( ii = 1 ; ii <= 10 ; ii++) % same as writing
{ % [1, 2, 3, 4, 5]
sum = sum + ii; for ii = 1:5
printf("%d: %d\n",ii,sum); sum = sum + ii;
fprintf(’%d: %d\n’,ii,sum);
}
while(sum > 5) end
{ while(sum > 5)
/* don’t write % double precision math
* sum /= 2; % automatically
* you’ll get a different result! */ sum /= 2;
sum /= 2.0; end
} % no semicolon: print variable name
% and value
printf("sum = %e\n",sum);
sum
Figure 51: Flow control in C and MATLAB: for, while statements

between “function” line and 1st statement is printed when you type in help function name
at MATLAB prompt. See Figure 53 for more detail.

C.3.2 Differences from C


There are two major differences between MATLAB m-files and compiled C-code (or compiled
code from other high level languages):

1. Compiled code will generally run faster than m-files. This is because m-files are not

Dept. ECE, Auburn Univ. 233 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

C MATLAB

switch(x) switch(x)
{ case(0),
case 0; % no empty parenthesis in MATLAB
do_this(); do_this;
break; case(1),
case 1; do_that;
do_that(); otherwise,
break; % error function: returns to
default: % MATLAB prompt
printf("Bad case. x = %d\n",x); error(’Bad case. x=%d’,x);
} end
Figure 52: Flow control in C and MATLAB: switch

translated directly to machine code, but are interpreted by the MATLAB program.
2. Compiled code will generally take much longer to write than MATLAB m-files. This is
because MATLAB’s basic variable types are much more flexible than C-language and
because debugging tools in MATLAB are much easier to work with.
We illustrate some of the utility of MATLAB with the following examples.

w1
Example 3.1 dot products Given two three-dimensional vectors, w =  w2  and x =
  w3
x1
 x2 , their dot product is defined as
x3
3

X
w · x = w 1 x1 + w 2 x2 + w 3 x3 = w i xi
i=1

C and MATLAB functions to compute the dot product of w and x are given below.
C

double dotProd3(const double w[], MATLAB : create file dotProd.m containing this
const double x[]) text.
{
int ii; function z = dotProd(w,x)
/* initialize dot prod % z = dotProd(w,x)
* in declaration*/ % return dot product of column
double retval = 0; % vectors w and x
for ( ii = 0 ; ii < 3 ; ii++)
retval += w[ii] * x[ii]; z = w’ * x; % w’ = transpose, row vector
return retval;
}

Dept. ECE, Auburn Univ. 234 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

C MATLAB : create file is positive.m


int is_positive(double x) containing this text.
{ function y = is_positive(x)
if(x > 0) % y = is_positive(x)
return 1; % returns y = 1 if x is positive,
else % 0 otherwise
return 0;
} % short way; can also use "if" stmt
y = ( x > 0 );
/* use pointers to % no return statement needed
* return multiple values */
void stats(double * sum, MATLAB : create file stats.m contain-
double * mean, ing this text:
const double x,
const double y) function [sum, mean] = stats(x,y)
{ % [sum, mean] = stats(x,y)
*sum = x + y; % (put other comments here)
*mean = sum/2.0; sum = x+y;
} mean = sum/2;

Call these functions with Call these functions with

double a, b, c, d; c = 1;
int i; d = 2;
c = 1; i = is_positive(c);
d = 2; % MATLAB functions can return
i = is_positive(c); % many values at once!
stats( &a, &b, c, d); [a,b] = stats(c, d);

Figure 53: Function definition in C and MATLAB

Notice that the MATLAB code has

• no declarations

• no need to know that the vectors are of length 3!

• the line z = w’ * x; works with the entire vectors (arrays) w and x and not just their
individual components x(1), x(2), etc.

Dept. ECE, Auburn Univ. 235 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

M-file 3.1 sqfour3.m

tt = linspace(-2*pi,2*pi,1000); % 1000 evenly spaced points between +/- 2 pi


yy = sin(tt) + sin(3*tt)/3; % get Fourier series
plot(tt,yy); % plot the waveform
xlabel(’time (s)’); % Dr. Gross says: "label your axes!"
ylabel(’y(t)’);
title(’Fourier series approx of sq. wave’);
grid on % add a grid to the plot for easy reading
print -depsc sqfour3.eps % save plot for the manual

Figure 54: Example of plotting in MATLAB (3rd order Fourier series approximation of a
square wave).

C.3.3 Graphical (plotting) concepts in MATLAB


Example 3.2 plotting signals Suppose it is wished to plot the 3-rd order Fourier series
approximation of a square wave,
1
f (t) = sin(t) + sin(3t)
3
A MATLAB m-file (stored in file sqfour3.m) to plot this function is shown in Figure 54.14
This m-file script is run in MATLAB by typing in the name of the script, sqfour3 at the
MATLAB prompt, and results in the plot shown in Figure 55.

C.3.4 Circuits problems in MATLAB


Resistor circuits typically result in a set of algebraic equations that are very easy to solve in
MATLAB. We illustrate the procedure first with an example:

Example 3.3 Recall the resistor circuit network in Figure 48(a):


3kΩ v1 7kΩ
v2
i1
+ 2kΩ
5V

The unknowns in this circuit problem are i1 , v1 , and v2 . From Ohm’s law and Kirchhoff’s
14
The C code is not included because this is an ECE course, not a CSSE course.

Dept. ECE, Auburn Univ. 236 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

Fourier series approx of sq. wave


1

0.8

0.6

0.4

0.2
y(t)

−0.2

−0.4

−0.6

−0.8

−1
−8 −6 −4 −2 0 2 4 6 8
time (s)

Figure 55: MATLAB plot generated by m-file sqfour3.m, Example 3.2.

voltage law we have

v2 = 2000i1
v1 = 7000i1 + v2
(3000 + 7000 + 2000)i1 = 5

We rewrite these equations so that all unknowns appear on the left side of the equation and
all the constant terms are on the right side to obtain

v2 − 2000i1 = 0 (3.3)
v1 − 7000i1 + v2 = 0 (3.4)
12000i1 = 5 (3.5)

In order to put these equations into MATLAB, we need to look at each line of the equation
as if it was a dot product w · x, written in MATLAB as w’ * x. Here’s the procedure. First,
we define a vector (array) x of the unknowns in some order. For this example we’ll put them
in alphabetical order:  
i1

x =  v1  (3.6)
v2

Dept. ECE, Auburn Univ. 237 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

Then equation (3.3) can be written as


 
−2000
 0  · x = −2000i1 + 0v1 + v2 = 0 (3.7)
1

Similarly, equations (3.4) and (3.5) can be written respectively as


 
−7000
 1  · x = −7000i1 + v1 − v2 = 0 (3.8)
−1
and  
12000
 0  · x = 12000i1 + 0v1 + 0v2 = 5 (3.9)
0
Notice that in equations (3.7)–(3.9) that all of the vectors in the dot products (except for
x) are made up of known constants. We solve for the unknowns x by writing the constant
entries on theleft side of these equations into the rows of a two-dimensional array (called a
−2000 0 1
matrix) A =  −7000 1 −1  and the constants on the right side of the equations into a
12000 0 0
0
column vector b =  0 . We use MATLAB to solve the problem as shown below. Lines
5
beginning with >> are typed in by the user; the remaining lines are output to the screen by
MATLAB.
>> A = [-2000, 0, 1; -7000, 1, -1; 12000,0,0]
A = -2000 0 1
-7000 1 -1
12000 0 0
>> b = [0;0;5]
b = 0
0
5
>> x = A\b
x =
4.1667e-04
3.7500e+00
8.3333e-01

Remark 3.1 The command x = A\b tells MATLAB to compute a vector x so that the dot
product of row i of the matrix (2-D array) A with x matches component i of b.15
15
You will study this problem in much greater detail in your linear algebra class, where you will write
A ∗ x = b. The expression A*x refers to matrix-vector multiplication.

Dept. ECE, Auburn Univ. 238 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

Remark 3.2 Notice that the MATLAB output for x does not show units. You as the
programmer have to remember that x corresponds to i1 = 41.667µA, v1 = 3.75V, and
v2 = 0.83333V.

Remark 3.3 Notice that rows of A and b are separated with semicolons ;, and that entries
on each row are separated with (optional) commas. The commas are not required, but
they’re a very good idea. To see this, type in the following commands (including spaces) at
the MATLAB prompt:

• x = [1 , - 2 ]

• x = [1 - 2]

These do not give the same answer! To avoid ambiguity, it’s a good idea to use commas
and/or parentheses, e.g., x = [1, ( 3 - 4 ) , 5] does the same thing as x = [ 1 3 - 4
5], but there’s no question what the first one is supposed to do.16

C.3.5 Differential equations in MATLAB


MATLAB provides two approaches for the simulation of differential equations. One is a
differential equation solver function, ode45. The other is a graphical simulation tool (block
diagram editor) called Simulink.17 Simulink will be discussed in the controls course, ELEC
3500. For this experiment, we will present only the ode45 solver, which is used to simulate
differential equations of the form
dy
= f (t, y) (3.10)
dt
with initial conditions y(t0 ) = y0 .
Since all ELEC 2020 students have had differential equations, the form of equation (3.10)
should raise some questions, such as “what if a differential equation has two derivatives in
it instead of one?” For example, many physical systems can be modeled by the differential
equation
d2 d
2
y(t) = a1 y(t) + a0 y(t) + u(t) (3.11)
dt dt
where u(t) is some input function. Since ode45 only accepts one derivative, at first glance
it appears that ode45 cannot be used to solve differential equations of the form (3.11).
However, with a mild change of notation, the user can employ ode45 to solve differential
equations of arbitrary degree.
Consider again equation equation (3.11), and define a new variable v(t) = dtd y(t). Then
2
the second derivative of y is ddt y(t) = dtd v(t), the first derivative of v. We can then write two
16
Dr. Hodel says: Never, ever trust a computer!
17
Octave and Scilab each provide tools comparable to ode45, but as of this writing there is no freely
available graphic user interface-based for simulation of differential equations.

Dept. ECE, Auburn Univ. 239 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

first order differential equations:


d
y(t) = v(t) (3.12)
dt
d d
v(t) = a1 y(t) + a0 y(t) + u(t)
dt dt
= a1 v(t) + a0 y(t) + u(t) (3.13)
Notice that each of the equations (3.12) and (3.13) are organized so that all derivatives dtd
are on the left side of the equations and all derivatives are first order; that is, each of these
equations is “pretty close” to the form (3.10) that is required by ode4.
We can finish putting equations (3.12) and (3.13)
 in the required form by changing our
y(t)
notation as follows. Define a vector x(t) = , and define the derivative of x as
v(t)

d y(t) ∆ dtd y(t)


   
d
x(t) = = d
dt dt v(t) dt
v(t)
Then, from equations (3.12) and (3.13), we can write
     
d d y(t) v(t) ∆ f1 (t, x) ∆
x(t) = = = = f (t, x) (3.14)
dt dt v(t) a1 v(t) + a0 y(t) + u(t) f2 (t, x)
where now the variable x and the function f are vectors, not scalars. While this concept can
be tricky for the first-time user, MATLAB’s features are designed to take advantage of this
sort of notation.
We’ll illustrate this process further in an example.
Example 3.4 Use of ode45 Consider the differential equation
d2 d
y(t) = −2 y(t) − 2y(t) + sin(t) (3.15)
dt2 dt
d
We wish to simulate equation (3.15) for 15 seconds with initial conditions y(0) = dt
y(0) = 0.
First, we rewrite equation (3.15) as a pair of first-order differential equations:
d
y(t) = v(t)
dt
d
v(t) = −2v(t) − 2y(t) + sin(t)
dt
 
y(t)
Define the vector x(t) = so that we can write
v(t)
   
d d y(t) v(t)
x(t) = = (3.16)
dt dt v(t) −2v(t) − 2y(t) + sin(t)
   
y(0) 0
with initial conditions x(0) = = . Now, examine the MATLAB on-line
v(0) 0
documentation for ode45:

Dept. ECE, Auburn Univ. 240 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

M-file 3.2 odeExample.m

function dx = odeExample(t,x)
% dx = odeExample(t,x)
% derivatives function to simulate
% y’’(t) = 2 y’(t) - 2 y(t) + sin(t)

y = x(1);
v = x(2);
dy = v;
dv = -2*v - 2*y + sin(t);
dx = [dy ; dv];

% could also do this in one line:


% dx = [ x(2) ; 2*x(2) - 2*x(1) + sin(t) ];

Figure 56: Derivatives function odeExample for Example 3.4.

>> help ode45

ODE45 Solve non-stiff differential equations, medium order method.


[T,X] = ODE45(ODEFUN,TSPAN,X0) with TSPAN = [T0 TFINAL] integrates the
system of differential equations y’ = f(t,y) from time T0 to TFINAL with
initial conditions X0. Function ODEFUN(T,X) must return a column vector
corresponding to f(t,y). Each row in the solution array X corresponds to
a time returned in the column vector T. To obtain solutions at specific
times T0,T1,...,TFINAL (all increasing or all decreasing), use
TSPAN = [T0 T1 ... TFINAL].

Notice that ode45 requires three inputs: (1) a function ODEFUN= f (t, x) that, given current
conditions x(t), will return the current state derivatives, (2) a vector TSPAN of time values at
which we want to compute values of x(t), and (3) X0 = x(t0 ), a vector of initial conditions.
Item (1) is implemented as an m-file function shown in Figure 56. We can then simulate the
function with ode45 by writing and executing the m-file script odeExampleMain.m in Figure
57. For clarity, we discuss line by line the main routine odeExampleMain below:

1. % simulate a differential equation with zero initial conditions for 15 sec


As discussed before, this first line is a comment. Document your code with meaningful
comments!

2. tspan = linspace(0,15);
This line creates a variable tspan that has one row of 100 numbers that are evenly
spaced from 0.0 to 15.0. Try typing in tspan = linspace(0,15) without the semi-
colon. MATLAB will print out all 100 entries of tspan to the screen for you to see.

Dept. ECE, Auburn Univ. 241 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

M-file 3.3 odeExampleMain.m

% simulate a differential equation with zero initial conditions for 15 sec


tspan = linspace(0,15);
x0 = [0;0];
[tt,xx] = ode45(’odeExample’,tspan,x0);
plot(tt,xx);
grid on
legend(’y’,’v = dy/dt’);
xlabel(’time (s)’)
print -depsc odeExampleMain.eps

Figure 57: Main m-file script odeExampleMain for Example 3.4.

If you want tspan to have a different number of points, then use a third argument
to the linspace function, for example, tspan = linspace(0,15,10); will give you
10 points and tspan = linspace(0,15,16); will give you the 16 numbers 0,1,2,...,15,
etc. Since tspan has only one row, it’s often called a row vector.
We will use the variable tspan to tell ode45 what time instants we want to have in
our simulation.

3. x0 = [0;0];
 
0
This line creates a variable x0 = that has two rows and one column. x0 is used
0
to tell ode45 the initial conditions for the differential equation.

4. [tt,xx] = ode45(’odeExample’,tspan,x0);
This is where ode45 simulates the differential equations. Notice that the name of
odeExample must be written in quotes in the call to ode45, i.e., ode45(’odeExampleMain’,tspan,x0)
Notice that ode45 takes advantage of MATLAB’s ability to return more than one
variable. The variable tt is returned as a column vector (an array with only one
column) that has 100 entries, the same entries we specified in tspan. The variables xx
is returned as a 2-dimensional array with 100 rows and two columns. The first column
is the values of v(t) = dy(t)/dt at the times in tt. That is, xx(ii, 1) = v(tt(ii)).
We use this knowledge in the plot command that comes next. Similarly, the second
column of xx is the values of y(t) at the times in tt.
A reasonable question for the reader to ask is “How do you know which column in xx
goes with which variable in the differential equations?” The answer is that I wrote the
routine odeExample.m,
  and so I know that that routine routine interprets the vector x
v(t)
as x(t) = , so each row in the vector xx returned by ode45 is the value of x(t)
y(t)
at time tt(ii). (It doesn’t matter that odeExample.m uses column vectors; ode45
will output its data in this format.)

Dept. ECE, Auburn Univ. 242 ELEC 6240/Hodel: Fa 2003


C.3 MATLAB basics: similarities to C Revision : 2003.10

Another reasonable question is “why do you use variable names like tt and xx instead
of t and x.” The answer is that it makes it a lot easier to debug programs and m-files.
If I search for the letter t in odeExampleMain.m I will find it in many places, including
in the comment on the first line. However, if I search for tt instead, I will (usually)
only find the variable that I’m looking for.
5. plot(tt,xx);
This line opens a graphic window and plots the data on the screen, xx as a function of
tt. This is a “lazy” way to generate the plot. I could also have plotted one waveform
at a time by typing
plot(tt,xx(:,1),’-’, tt,xx(:,2),’-’);
The notation xx(:,1) means “the first column of xx” and xx(:,2) means “the second
column of xx.” The ’-’ is a line style command. You can learn more about line styles
by typing in help plot at the MATLAB prompt.
6. grid on
The plot command only puts the wave forms on the graph. This command puts up
“graph paper” on the window to make the plot easier to read.
7. legend(’v = dy/dt’,’y’);
8. xlabel(’time (s)’)
Always label your axes. If possible, include units. In fact, it’s a good idea to include
units in your m-file (and C-program) variable names.18
The legend command puts a legend in the graph window so that you know which color
(or line style) goes with which variable. The xlabel command labels the x axis. There
is also a ylabel command that you can use instead of legend if you’re only plotting
one waveform in a window.
9. print -depsc odeExampleMain.eps
This command is used to store the plot in color .eps format so that we can include it
in this manual. The resulting plot is shown in Figure 58.
Remark 3.4 This manual was written in LATEX a free mathematical typesetting lan-
guage (not quite a word processor) that is included with Linux, is installed on the
Engineering Sun Network, and can be downloaded with the cygwin environment to a
windows machine or installed with fink on Macintosh. Microsoft Word users should
print to jpg files or some other image format. Type in help print at the MATLAB
prompt for more options.

2
18
While working on leave at NASA, the writer of this chapter once spent a month comparing two sim-
ulations that did not match because one simulation used degrees for angles and the other used radians,
but both used the same data tables. The problem was caught and fixed, and a lesson learned. Label your
variables/axes and document your code!

Dept. ECE, Auburn Univ. 243 ELEC 6240/Hodel: Fa 2003


C.4 Things to try on your own Revision : 2003.10

0.5
y
v = dy/dt
0.4

0.3

0.2

0.1

−0.1

−0.2

−0.3

−0.4

−0.5
0 5 10 15
time (s)

Figure 58: MATLAB plot generated by m-file odeExampleMain.m, Example 3.4.

C.4 Things to try on your own


1. Write an expression for a 7th order Fourier series approximation of a square wave (see
Example 3.2).

2. Write the differential equation(s) describing the behavior of the circuit in Figure 48(b).
Derive by hand the output vo (t) for the circuit in Figure 48(b) given that vo (0) = 0V.

3. Write the differential equation(s) describing the behavior of the circuit in Figure 48(c).
Derive by hand the output vo (t) for the circuit in Figure 48(c) given that vo (0) = 0V
and that i1 (0) = 0A.

4. Plot a 7th order Fourier series approximation of a square wave.

(a) Start MATLAB on a College of Engineering computer. If you’re using a Windows


machine, MATLAB is listed under the “start” menu. If you’re using a Unix-based
machine (or you’re running MATLAB over the phone line from home under X-
windows19 ) type in the command matlab at the Unix prompt.
19
You can get the code for X-windows at http://www.cygwin.com; the remainder of this manual is written
based on the assumption that you’re working on-campus.

Dept. ECE, Auburn Univ. 244 ELEC 6240/Hodel: Fa 2003


Revision : 2003.10

(b) at the MATLAB prompt, type in edit sqfour7. This will open a new editor
window. Type in the m-file in Figure 54 and the modify it to plot a 7th order
Fourier series approximation of a square wave. Use the MATLAB command help
print to see how to print the plot to a printer instead of storing it in a file.
(Alternatively, you can save the plot as a .jpg file and import it into MS Word.)

5. Use MATLAB to calculate the voltages and currents in the circuit shown in Figure
48(a).

6. Use MATLAB to simulate the capacitor voltage in Figure 48(b) for 2 seconds. Turn
in your m-file(s) and a plot of the capacitor voltage.

7. Use MATLAB to simulate the capacitor voltage and inductor current in Figure 48(c)
for 2 seconds. Turn in your m-file(s) and a plot of the capacitor voltage and inductor
current.

D Appendix: source code


D.1 Utility m-files not listed elsewhere
M-file 4.1 backPropSig.m

rand(’seed’,pi); % seed random number generator

mex nnsig.c; % comment these out to skip compile


mex neteval.c

% construct target function


nx = 11; xx = linspace(-1, 2, nx)’;
ny = 11; yy = linspace(-1, 2, ny)’;
[xm,ym] = meshgrid( xx, yy);
dm = 0.1 + 0.9*( xm >= 0 & xm <=1 & ym >= 0 & ym <= 1);

% network: use (nh) hidden nodes and one output node


% bias introduced in input layer
a = 5; % activation function parameter
nh = 9; no = 1; % network dimensions
W1 = 5*(rand(nh,3)-0.5);
W2 = 5*(rand(no,nh+1)-0.5);

% initial output, error surfaces


[om, em] = neteval(xx, yy, dm, W1, W2,a);
figure(3); mesh(xx,yy, om); title(’Initial surface’);
print -depsc backPropSig_3.eps
figure(4); mesh(xx,yy,em); title(’Initial error surface’);

Dept. ECE, Auburn Univ. 245 ELEC 6240/Hodel: Fa 2003


D.1 Utility m-files not listed elsewhere Revision : 2003.10

print -depsc backPropSig_4.eps

% keep a record of the error function


icnt = 1; Errv(icnt) = norm(em,’fro’)^2;

icnt = icnt+1; % icnt = "iteration count"


eta = 0.02; % execute backprop algorithm with eta selected
mov = avifile(’backPropSig.avi’);
mov2 = avifile(’backPropSigSep.avi’);
maxIter = 400;
for iter = 0:maxIter
fprintf(’%4d %s %e\n’,iter,datestr(now),Errv(icnt-1));
for ii=1:nx
for jj = 1:ny
y0 = [xx(ii); yy(jj)]; % input vector
[y2,y1] = nnsig(y0, W1, W2, a);
v1 = W1*[1;y0];
v2 = W2*[1;y1];

% perform backprop step


e = dm(jj,ii) - y2;
d2 = -e .* dphi(v2,a);
DW2 = d2*[1;y1]’;

W2_times_d2 = W2’ * d2;


d1 = -(W2_times_d2(2:(nh+1))) .* dphi(v1,a) ;
DW1 = d1*[1;y0]’;

W2 = W2 - eta*DW2;
W1 = W1 - eta*DW1;

% evaluate error surface


[om,em] = neteval(xx,yy,dm, W1, W2,a);
Errv(icnt) = norm(em,’fro’)^2;
icnt = icnt + 1;
end
end

figure(6); mesh(xx,yy, om);


title(sprintf(’Output surface iteration %d’,iter));
axis([-1,2,-1,2,0,1.1]);
text(-0.5,2,0.8,sprintf(’current error =%12.4g’,Errv(icnt-1)));
text(1.5,-0.5,0.1,’x_1 value’);
text(-0.5,1.5,0.1,’x_2 value’);

Dept. ECE, Auburn Univ. 246 ELEC 6240/Hodel: Fa 2003


D.1 Utility m-files not listed elsewhere Revision : 2003.10

A = getframe; mov = addframe(mov,A);

% plot boundary lines, compute color based on slope of line


xp = linspace(-5,5,100)’;
yp = zeros(100,nh);
for ip = 1:nh
if(W1(ip,3) == 0)
denom = 1e-6;
else
denom = W1(ip,3);
end
yp(:,ip) = -(W1(ip,2)*xp + W1(ip,1))/denom;
linAng = atan2(W2(ip+1)*W1(ip,3),W2(ip+1)*W1(ip,2))*180/pi;
if(-45 < linAng & linAng <= 45), pcolor{ip} = ’g-’; %left
elseif(45 < linAng & linAng <= 135), pcolor{ip} = ’c-’; %bottom
elseif(135 < linAng & linAng <= 225), pcolor{ip} = ’r-’; %right
elseif(-135 < linAng & linAng <= -45), pcolor{ip} = ’b-’; %top
elseif(-225 < linAng & linAng <= -135), pcolor{ip} = ’r-’; %right
end
end

if(iter == maxIter)
print -depsc backPropSig_6.eps
end

figure(8);
plot( ...
xp,yp(:,1),pcolor{1}, xp,yp(:,2),pcolor{2}, ...
xp,yp(:,3),pcolor{3}, xp,yp(:,4),pcolor{4}, ...
xp,yp(:,5),pcolor{5}, xp,yp(:,6),pcolor{6}, ...
xp,yp(:,7),pcolor{7}, xp,yp(:,8),pcolor{8}, ...
xp,yp(:,9),pcolor{9}, ...
-5:4,W2,’o’);
text(-5,4.0,sprintf(’iteration %d’,iter));
text(-5,3.6,sprintf(’Blue : top border’));
text(-5,3.2,sprintf(’Green: left border’));
text(-5,2.8,sprintf(’Red : right border’));
text(-5,2.4,sprintf(’Cyan : bottom border’));
for ip = 1:nh
% label line at far left side
idx = max(find(abs(yp(:,ip)) < 4.8) );
text(xp(idx),yp(idx,ip),sprintf(’line %d’,ip));
end
for ip = 0:nh

Dept. ECE, Auburn Univ. 247 ELEC 6240/Hodel: Fa 2003


D.1 Utility m-files not listed elsewhere Revision : 2003.10

text(ip-5,W2(ip+1)+0.2,sprintf(’W2(%d)’,ip));
end
% x-axis labels
for lp = -6:2:4
text(lp,-4.5,sprintf(’%d’,lp));
end
% y-axis labels
for lp = -5:1:5
text(-6.1,lp,sprintf(’%d’,lp));
end
xlabel(’x1 value’);
ylabel(’x2 value/W2’);
title(’Layer 1 linear separation boundaries’);
grid on;
axis([-5,5,-5,5]);
axis(’equal’);
A = getframe; mov2 = addframe(mov2,A);

if(iter == maxIter)
print -depsc backPropSig_8.eps
figure(7); plot(Errv); grid on;
title(sprintf(’Error function per backprop step’))
print -depsc backPropSig_7.eps
end
end
mov = close(mov);
mov2 = close(mov2);

M-file 4.2 mlpEvalT.m

function [yy, Errv] = mlpEvalT(inData, W1, W2, a,b)


% function [yy, Errv] = mlpEvalT(inData, W1, W2, a,b)
% evaluate network at all sample points
% dm = nx x ny matrix of desired output values
% use exponential function with parameter a

nx = size(W1,2)-1;
ny = size(W2,1);
N = size(inData,1);

% present each vector to the network


for nn=1:N
yy(nn,1:ny) = reshape(mlpT(inData(nn,1:nx),W1,W2,a,b),1,ny);
end

Dept. ECE, Auburn Univ. 248 ELEC 6240/Hodel: Fa 2003


D.1 Utility m-files not listed elsewhere Revision : 2003.10

% calculate error values


Errv = yy - inData(:,nx+(1:ny));

M-file 4.3 mlpT.m

function [y,h] =mlpT(x,w1,w2,a,b)


% function [y,h] =mlpT(x,w1,w2,a,b)
% return hidden layer values h and output layer values y for
% weights w1 and w2 with input x
% activation function is hyperbolic tangent with parameters a,b
% bias value of 1 is appended to both x and hidden layer value vector

x = reshape(x,length(x),1);
v1 = w1*[1;x]; % append bias
h = phiT(v1,a,b);

v2 = w2*[1;h]; % append bias


y = phiT(v2,a,b);

M-file 4.4 sysIdEx1PlotErr.m

function plotErr(d,X,W,tstr);
% plot error of current fit
nn = 1:length(d);
fn = figure;
plot(nn,d,’x’, nn,W*X,’+’, nn,d-W*X,’o’);
xlabel(’sample number’);
legend(’desired output’,’actual output’,’error’);
grid on;
title(tstr);
eval(sprintf(’print -depsc sysIdEx1%.4d.eps’,fn));
return

Dept. ECE, Auburn Univ. 249 ELEC 6240/Hodel: Fa 2003


REFERENCES Revision : 2003.10

References
[Hay99] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd
edition, 1999.

[Heb49] D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley,


New York, 1949.

[Lue69] D. G. Luenberger. Optimization by Vector Space Methods. Wiley and Sons, Inc.,
New York, NY, 1969.

[MP43] W. S. McCullough and W. Pitts. A logical calculus of the ideas of the ideas imma-
nent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.

[MP88] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geome-


try. MIT Press, 1988. Expanded Edition.

[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-Point Polynomial Algorithms in


Convex Programming. SIAM, 1994.

[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65:386–408, 1958.

[SN96] Gilbert Strang and Truong Nguyen. Wavelets and Filter Banks. Wellesley Cam-
bridge Press, wellesley, mass. edition, 1996.

[WH] B. Widrow and M. E. Hoff, Jr. Adaptive switching circuits. In IRE WESCON
Convention Record, pages 96–104.

Dept. ECE, Auburn Univ. 250 ELEC 6240/Hodel: Fa 2003


Index
cfile 3.1 activation.m , 20
22.9 dphiT.c , 143 25.1 backPropEx2.m , 157
21.5 layerprop.c , 127 25.5 backPropEx3.m , 168
22.10 matmul.c , 147 27.1 backPropEx5.m , 183
7.4 mexPrintMat.c , 40 24.2 backPropNormalize.m , 153
6.1 mextanh.c , 33 24.1 backPropNormTest.m , 152
21.7 neteval.c , 130 22.4 backPropRand.m , 149
21.6 nnsig.c , 128 4.1 backPropSig.m , 245
7.3 phiExV.c , 38 21.3 backPropSig1.m , 132
22.8 phiT.c , 142 25.3 backPropStep.m , 164
6.2 simpletanh.c , 33 9.1 corrMemEx1.m , 49
dphiT.c 22.9 , 143 19.3 derivS.m , 109
layerprop.c 21.5 , 127 19.2 dphi.m , 109
matmul.c 22.10 , 147 19.5 dphiT.m , 110
mexPrintMat.c 7.4 , 40 36.1 dynProgEx.m , 226
mextanh.c 6.1 , 33 18.1 exam2003BrainDead.m , 105
neteval.c 21.7 , 130 18.2 exam2003BrainDeadFix.m , 106
nnsig.c 21.6 , 128 2.1 hw0102chk.m , 10
phiExV.c 7.3 , 38 9.3 hwk0218.m , 54
phiT.c 22.8 , 142 9.4 hwk0220.m , 55
simpletanh.c 6.2 , 33 9.5 hwk0220a.m , 57
mextanh.c, 141 12.1 hwk0302.m , 67
phiExV.c, 141 12.2 hwk304.m , 75
simpletanh.c, 141 22.1 hwk502.m , 144
Course information, 7 27.2 invPendS.m , 187
9.2 learnTaskEx1s.m , 53
Exam 1 information, 93
22.3 matmultest.m , 148
Homework 13.2 meshPex.m , 82
Homework 1 Assigned Aug 20, 10 13.1 meshPexPlot.m , 82
Homework 2 2, 35 22.2 meshPexT.m , 146
Homework 3 Assigned Sept. 10, 53 7.1 mexExV.m , 36
Homework 4 Assigned Sept. 17, 72 7.2 mexPrintMatEx.m , 39
Homework 5 Due Fri Oct. 10, 112 4.2 mlpEvalT.m , 248
Homework 5 solution, 141 25.4 mlpNormalize.m , 164
Homework 6 Due Wed Oct. 22, 151 4.3 mlpT.m , 249
Homework 6 solution, 183 25.2 mlpTrain.m , 163
Homework 7 Due Fri Oct. 31, 195 21.2 neteval.m , 127
Homework 7 Due Mon Oct. 27, 175 4.1 neuronEx1.m , 25
Homework 7 solution, 194 4.2 neuronEx2.m , 28
Homework 8 solution, 206 21.1 nnsig.m , 126
27.3 nnSysIdEx.m , 189
M-files

251
INDEX Revision : 2003.10

3.2 odeExample.m , 241 hwk0220a.m (M-file 9.5) , 57


3.3 odeExampleMain.m , 242 hwk0302.m (M-file 12.1) , 67
34.1 pcaEx1.m , 214 hwk304.m (M-file 12.2) , 75
13.4 perceptron2Dlearn.m , 88 hwk502.m (M-file 22.1) , 144
13.3 perEx.m , 87 invPendS.m (M-file 27.2) , 187
19.1 phi.m , 109 learnTaskEx1s.m (M-file 9.2) , 53
19.4 phiT.m , 110 matmultest.m (M-file 22.3) , 148
26.1 radialEx00.m , 177 meshPex.m (M-file 13.2) , 82
26.2 radialEx01.m , 179 meshPexPlot.m (M-file 13.1) , 82
26.3 radialEx02.m , 181 meshPexT.m (M-file 22.2) , 146
29.4 rbf0502.m , 198 mexExV.m (M-file 7.1) , 36
29.5 rbfSig.m , 202 mexPrintMatEx.m (M-file 7.2) , 39
3.1 sqfour3.m , 236 mlpEvalT.m (M-file 4.2) , 248
2.2 square1.m , 11 mlpNormalize.m (M-file 25.4) , 164
20.5 sysIdEx1.m , 122 mlpT.m (M-file 4.3) , 249
20.1 sysIdEx1GetData.m , 114 mlpTrain.m (M-file 25.2) , 163
20.3 sysIdEx1GradOpt.m , 118 neteval.m (M-file 21.2) , 127
20.4 sysIdEx1nnOpt.m , 120 neuronEx1.m (M-file 4.1) , 25
4.4 sysIdEx1PlotErr.m , 249 neuronEx2.m (M-file 4.2) , 28
20.2 sysIdEx1Wopt.m , 116 nnsig.m (M-file 21.1) , 126
activation.m (M-file 3.1) , 20 nnSysIdEx.m (M-file 27.3) , 189
backPropEx2.m (M-file 25.1) , 157 odeExample.m (M-file 3.2) , 241
backPropEx3.m (M-file 25.5) , 168 odeExampleMain.m (M-file 3.3) , 242
backPropEx5.m (M-file 27.1) , 183 pcaEx1.m (M-file 34.1) , 214
backPropNormalize.m (M-file 24.2) perceptron2Dlearn.m (M-file 13.4)
, 153 , 88
backPropNormTest.m (M-file 24.1) perEx.m (M-file 13.3) , 87
, 152 phi.m (M-file 19.1) , 109
backPropRand.m (M-file 22.4) , 149 phiT.m (M-file 19.4) , 110
backPropSig.m (M-file 4.1) , 245 radialEx00.m (M-file 26.1) , 177
backPropSig1.m (M-file 21.3) , 132 radialEx01.m (M-file 26.2) , 179
backPropStep.m (M-file 25.3) , 164 radialEx02.m (M-file 26.3) , 181
corrMemEx1.m (M-file 9.1) , 49 rbf0502.m (M-file 29.4) , 198
derivS.m (M-file 19.3) , 109 rbfSig.m (M-file 29.5) , 202
dphi.m (M-file 19.2) , 109 sqfour3.m (M-file 3.1) , 236
dphiT.m (M-file 19.5) , 110 square1.m (M-file 2.2) , 11
dynProgEx.m (M-file 36.1) , 226 sysIdEx1.m (M-file 20.5) , 122
exam2003BrainDead.m (M-file 18.1) sysIdEx1GetData.m (M-file 20.1) ,
, 105 114
exam2003BrainDeadFix.m (M-file 18.2) sysIdEx1GradOpt.m (M-file 20.3) ,
, 106 118
hw0102chk.m (M-file 2.1) , 10 sysIdEx1nnOpt.m (M-file 20.4) , 120
hwk0218.m (M-file 9.3) , 54 sysIdEx1PlotErr.m (M-file 4.4) , 249
hwk0220.m (M-file 9.4) , 55 sysIdEx1Wopt.m (M-file 20.2) , 116

Dept. ECE, Auburn Univ. 252 ELEC 6240/Hodel: Fa 2003


INDEX Revision : 2003.10

backPropEx2.m, 166, 186


backPropEx3.m, 170
backPropSig1.m, 132
backPropStep.m, 151
dphiT.m, 151
hwk0302.m, 73
hwk304.m, 75
learnTaskEx1s.m, 151
meshPex.m, 112
mlpEvalT.m, 151
mlpNormalize.m, 151
mlpT.m, 151
mlpTrain.m, 151
phiT.m, 112, 151
sysIdEx1GetData.m, 114
sysIdEx1Wopt.m, 116
matrix calculus identities
Lemma 11.1, 65, 101

separability
φ-separable, 176
separating surface, 176

vector stack of a matrix, 65, 153, 230

Dept. ECE, Auburn Univ. 253 ELEC 6240/Hodel: Fa 2003

You might also like