i

Fuzzy and
Neural
Approaches in
Engineering

MATLAB Supplement

J. Wesley Hines


Wiley Series on Adaptive
and Learning Systems
for Signal Processing
Communications and Control
Simon Haykin, Series Editor


Copyright 1997
John Wiley and Sons
New York, NY
v













To SaDonya


vi

vii

CONTENTS
CONTENTS.................................................................................................................. VII
PREFACE .......................................................................................................................... X
ACKNOWLEDGMENTS....................................................................................................... X
ABOUT THE AUTHOR....................................................................................................... XI
SOFTWARE DESCRIPTION................................................................................................ XI
INTRODUCTION TO THE MATLAB SUPPLEMENT...............................................1
INTRODUCTION TO MATLAB....................................................................................1
MATLAB TOOLBOXES.....................................................................................................3
SIMULINK......................................................................................................................3
USER CONTRIBUTED TOOLBOXES .....................................................................................4
MATLAB PUBLICATIONS.................................................................................................4
MATLAB APPLICATIONS.............................................................................................4
CHAPTER 1 INTRODUCTION TO HYBRID ARTIFICIAL INTELLIGENCE SYSTEMS .................5
CHAPTER 2 FOUNDATIONS OF FUZZY APPROACHES.........................................................6
2.1 Union, Intersection and Complement of a Fuzzy Set .......................................................6
2.2 Concentration and Dilation...............................................................................................9
2.3 Contrast Intensification...................................................................................................10
2.4 Extension Principle.........................................................................................................12
2.5 Alpha Cuts ......................................................................................................................14
CHAPTER 3 FUZZY RELATIONSHIPS................................................................................16
3.1 A Similarity Relation......................................................................................................16
3.2 Union and Intersection of Fuzzy Relations.....................................................................16
3.3 Max-Min Composition ...................................................................................................18
CHAPTER 4 FUZZY NUMBERS.........................................................................................19
4.1 Addition and Subtraction of Discrete Fuzzy Numbers...................................................19
4.2 Multiplication of Discrete Fuzzy Numbers ....................................................................21
4.3 Division of Discrete Fuzzy Numbers..............................................................................23
CHAPTER 5 LINGUISTIC DESCRIPTIONS AND THEIR ANALYTICAL FORM........................24
5.1 Generalized Modus Ponens ............................................................................................24
5.2 Membership Functions ...................................................................................................24
5.2.1 Triangular Membership Function.............................................................................24
5.2.2 Trapezoidal Membership Function..........................................................................26
5.2.3 S-shaped Membership Function..............................................................................27
5.2.4 Π-shaped Membership Function.............................................................................28
5.2.5 Defuzzification of a Fuzzy Set ................................................................................29
5.2.6 Compound Values ...................................................................................................31
5.3 Implication Relations......................................................................................................33
5.4 Fuzzy Algorithms ...........................................................................................................37
CHAPTER 6 FUZZY CONTROL .........................................................................................44
6.1 Tank Level Fuzzy Control ..............................................................................................44
CHAPTER 7 FUNDAMENTALS OF NEURAL NETWORKS....................................................52
7.1 Artificial Neuron.............................................................................................................52
viii
7.2 Single Layer Neural Network.........................................................................................57
7.3 Rosenblatt's Perceptron...................................................................................................58
7.4 Separation of Linearly Separable Variables ...................................................................65
7.5 Multilayer Neural Network.............................................................................................68
CHAPTER 8 BACKPROPAGATION AND RELATED TRAINING PARADIGMS.........................71
8.1 Derivative of the Activation Functions...........................................................................71
8.2 Backpropagation for a Multilayer Neural Network........................................................72
8.2.1 Weight Updates .......................................................................................................74
8.2.2 Hidden Layer Weight Updates ................................................................................76
8.2.3 Batch Training.........................................................................................................77
8.2.4 Adaptive Learning Rate...........................................................................................79
8.2.5 The Backpropagation Training Cycle .....................................................................80
8.3 Scaling Input Vectors .....................................................................................................81
8.4 Initializing Weights ........................................................................................................84
8.5 Creating a MATLAB Function for Backpropagation.....................................................85
8.6 Backpropagation Example..............................................................................................88
CHAPTER 9 COMPETITIVE, ASSOCIATIVE AND OTHER SPECIAL NEURAL NETWORKS ....91
9.1 Hebbian Learning ...........................................................................................................91
9.2 Instar Learning................................................................................................................93
9.3 Outstar Learning .............................................................................................................95
9.4 Crossbar Structure ..........................................................................................................97
9.5 Competitive Networks ....................................................................................................98
9.5.1 Competitive Network Implementation....................................................................99
9.5 2 Self Organizing Feature Maps...............................................................................103
9.6 Probabilistic Neural Networks......................................................................................106
9.7 Radial Basis Function Networks...................................................................................109
9.7.1 Radial Basis Function Example.............................................................................113
9.7.2 Small Neuron Width Example...............................................................................115
9.7.3 Large Neuron Width Example...............................................................................116
9.8 Generalized Regression Neural Network......................................................................117
CHAPTER 10 DYNAMIC NEURAL NETWORKS AND CONTROL SYSTEMS........................122
10.1 Introduction ................................................................................................................122
10.2 Linear System Theory.................................................................................................123
10.3 Adaptive Signal Processing ........................................................................................127
10.4 Adaptive Processors and Neural Networks.................................................................129
10.5 Neural Networks Control............................................................................................131
10.5.1 Supervised Control ..............................................................................................132
10.5.2 Direct Inverse Control .........................................................................................132
10.5.3 Model Referenced Adaptive Control...................................................................133
10.5.4 Back Propagation Through Time ........................................................................133
10.5.5 Adaptive Critic ....................................................................................................134
10.6 System Identification..................................................................................................135
10.6.1 ARX System Identification Model ......................................................................135
10.6.2 Basic Steps of System Identification...................................................................136
10.6.3 Neural Network Model Structure ........................................................................136
10.6.4 Tank System Identification Example ..................................................................138
10.7. Implementation of Neural Control Systems ..............................................................141
CHAPTER 11 PRACTICAL ASPECTS OF NEURAL NETWORKS .........................................144
11.1 Neural Network Implementation Issues......................................................................144
11.2 Overview of Neural Network Training Methodology ................................................144
ix
11.3 Training and Test Data Selection................................................................................146
11.4 Overfitting ..................................................................................................................149
11.4.1 Neural Network Size. ..........................................................................................150
11.4.2 Neural Network Noise.........................................................................................153
11.4.3 Stopping Criteria and Cross Validation Training................................................155
CHAPTER 12 NEURAL METHODS IN FUZZY SYSTEMS....................................................158
12.1 Introduction ................................................................................................................158
12.2 From Crisp to Fuzzy Neurons.....................................................................................158
12.3 Generalized Fuzzy Neuron and Networks ..................................................................159
12.4 Aggregation and Transfer Functions in Fuzzy Neurons.............................................160
12.5 AND and OR Fuzzy Neurons .....................................................................................161
12.6 Multilayer Fuzzy Neural Networks ............................................................................162
12.7 Learning and Adaptation in Fuzzy Neural Networks .................................................164
CHAPTER 13 NEURAL METHODS IN FUZZY SYSTEMS...................................................170
13.1 Introduction ................................................................................................................170
13.2 Fuzzy-Neural Hybrids ................................................................................................171
13.3 Neural Networks for Determining Membership Functions ........................................171
13.4 Neural Network Driven Fuzzy Reasoning..................................................................174
13.5 Learning and Adaptation in Fuzzy Systems via Neural Networks .............................177
13.5.1 Zero Order Sugeno Fan Speed Control ...............................................................179
13.5.2 Consequent Membership Function Training.......................................................183
13.5.3 Antecedent Membership Function Training........................................................184
13.5.4 Membership Function Derivative Functions .......................................................186
13.5.5 Membership Function Training Example............................................................189
13.6 Adaptive Network-Based Fuzzy Inference Systems...................................................194
13.6.1 ANFIS Hybrid Training Rule..............................................................................194
13.6.2 Least Squares Regression Techniques.................................................................195
13.6.3 ANFIS Hybrid Training Example .......................................................................199
CHAPTER 14 GENERAL HYBRID NEUROFUZZY APPLICATIONS.....................................205
CHAPTER 15 DYNAMIC HYBRID NEUROFUZZY SYSTEMS.............................................205
CHAPTER 16 ROLE OF EXPERT SYSTEMS IN NEUROFUZZY SYSTEMS............................205
CHAPTER 17 GENETIC ALGORITHMS............................................................................206
REFERENCES .................................................................................................................207

x
Preface
Over the past decade, the application of artificial neural networks and fuzzy systems to
solving engineering problems has grown enormously. And recently, the synergism
realized by combining the two techniques has become increasingly apparent. Although
many texts are available for presenting artificial neural networks and fuzzy systems to
potential users, few exist that deal with the combinations of the two subjects and fewer
still exist that take the reader through the practical implementation aspects.

This supplement introduces the fundamentals necessary to implement and apply these
Soft Computing approaches to engineering problems using MATLAB. It takes the reader
from the underlying theory to actual coding and implementation. Presenting the theory's
implementation in code provides a more in depth understanding of the subject matter.
The code is built from a bottom up framework; first introducing the pieces and then
putting them together to perform more complex functions, and finally implementation
examples. The MATLAB Notebook allows the embedding and evaluation of MATLAB
code fragments in the Word document; thus providing a compact and comprehensive
presentation of the Soft Computing techniques.

The first part of this supplement gives a very brief introduction to MATLAB including
resources available on the World Wide Web. The second section of this supplement
contains 17 chapters that mirror the chapters of the text. Chapters 2-13 have MATLAB
implementations of the theory and discuss practical implementation issues. Although
Chapters 14-17 do not give MATLAB implementations of the examples presented in the
text, some references are given to support a more in depth study.
Acknowledgments
I would like to thank Distinguished Professor Robert E. Uhrig from The University of
Tennessee and Professor Lefteri H. Tsoukalas from Purdue University for offering me the
opportunity and encouraging me to write this supplement to their book entitled Fuzzy and
Neural Approaches in Engineering. Also thanks for their review, comments, and
suggestions.

My sincere thanks goes to Darryl Wrest of Honeywell for his time and effort during the
review of this supplement. Thanks also go to Mark Buckner of Oak Ridge National
Laboratory for his contributions to Sections 15.5 and 15.6.

This supplement would not have been possible without the foresight of the founders of
The MathWorks in developing what I think is the most useful and productive engineering
software package available. I have been a faithful user for the past seven years and look
forward to the continued improvement and expansion of their base software package and
application toolboxes. I have found few companies that provide such a high level of
commitment to both quality and support.

xi
About the Author
Dr. J. Wesley Hines is currently a Research Assistant Professor in the Nuclear
Engineering Department at the University of Tennessee. He received the BS degree
(Summa Cum Laude) in Electrical Engineering from Ohio University in 1985, both an
MBA (with distinction) and a MS in Nuclear Engineering from The Ohio State
University in 1992, and a Ph.D. in Nuclear Engineering from The Ohio State University
in 1994. He graduated from the officers course of the Naval Nuclear Power School (with
distinction) in 1986.

Dr. Hines teaches classes in Applied Artificial Intelligence to students from all
departments in the engineering college. He is involved in several research projects in
which he uses his experience in modeling and simulation, instrumentation and control,
applied artificial intelligence, and surveillance & diagnostics in applying artificial
intelligence methodologies to solve practical engineering problems. He is a member of
the American Nuclear Society and IEEE professional societies and a member of Sigma
Xi, Tau Beta Pi, Eta Kappa Nu, Alpha Nu Sigma, and Phi Kappa Phi honor societies.

For the five years prior to coming to the University of Tennessee, Dr. Hines was a
member of The Ohio State University's Nuclear Engineering Artificial Intelligence
Group. While there, he worked on several DOE and EPRI funded projects applying AI
techniques to engineering problems. From 1985 to 1990 Dr. Hines served in the United
States Navy as a nuclear qualified Naval Officer. He was the Assistant Nuclear Controls
and Chemistry Officer for the Atlantic Submarine Force (1988 to 1990), and served as
the Electrical Officer of a nuclear powered Ballistic Missile Submarine (1987 to 1988).
Software Description
This supplement comes with an IBM compatible disk containing an install program. The
program includes an MS Word 7.0 notebook file (PC) and several MATLAB functions,
scripts and data files. The MS Word file, master.doc, is a copy of this supplement and
can be opened into MS Word so that the code fragments in this document can be run and
modified. Its size is over 4 megabytes, so I recommend a Pentium computer platform
with at least 16 MB of RAM. It and the other files should not be duplicated or
distributed without the written consent of the author.

The installation program’s default directory is C:\MATLAB\TOOLBOX\NN_FUZZY,
which you can change if you wish. This should result in the extraction of 100 files that
require about 5 megabytes of disk space. The contents.m file gives a brief description of
the MATLAB files that were extracted into the directory. The following is a description
of the files:

master.doc This supplement in MS Word 7.0 (PC)
readme.txt A test version of this software description section.
*.m MATLAB script and function files (67)
*.mat MATLAB data files (31)
1

INTRODUCTION TO THE MATLAB SUPPLEMENT

This supplement uses the mathematical tools of the educational version of MATLAB to
demonstrate some of the important concepts presented in Fuzzy and Neural Approaches
in Engineering, by Lefteri H. Tsoukalas and Robert E. Uhrig and being published by John
Wiley & Sons. This book integrates the two technologies of fuzzy logic systems and
neural networks. These two advanced information processing technologies have undergone
explosive growth in the past few years. However, each field has developed independently
of the other with its own nomenclature and symbology. Although there appears to be little
that is common between the two fields, they are actually closely related and are being
integrated in many applications. Indeed, these two technologies form the core of the
discipline called SOFT COMPUTING, a name directly attributed to Lofti Zadeh. Fuzzy
and Neural Approaches in Engineering integrates the two technologies and presents them
in a clear and concise framework.

This supplement was written using the MATLAB notebook and Microsoft WORD ver.
7.0. The notebook allows MATLAB commands to be entered and evaluated while in the
Word environment. This allows the document to both briefly explain the theoretical
details and also show the MATLAB implementation. It allows the user to experiment
with changing the MATLAB code fragments in order to gain a better understanding of
the application.

This supplement contains numerous examples that demonstrate the practical
implementation of neural, fuzzy, and hybrid processing techniques using MATLAB.
Although MATLAB toolboxes for Fuzzy Logic [Jang and Gulley, 1995] and Neural
Networks [Demuth and Beale, 1994] exist, they are not required to run the examples given
in this supplement. This supplement should be considered to be a brief introduction to the
MATLAB implementation of neural and fuzzy systems and the author strongly
recommends the use of the Neural Networks Toolbox and the Fuzzy Logic Toobox for a
more in depth study of these information processing technologies. Some of the examples in
this supplement are not written in a general format and will have to be altered significantly
for use to solve specific problems, other examples and m-files are extremely general and
portable.

INTRODUCTION TO MATLAB

MATLAB is a technical computing environment that is published by The MathWorks. It
can run on many platforms including windows based personal computers (windows,
DOS, Liunix), Macintosh, Sun, DEC, VAX and Cray. Applications are transportable
between the platforms.

MATLAB is the base package and was originally written as an easy interface to
LINPACK, which is a state of the art package for matrix computations. MATLAB has
functionality to perform or use:
2
· Matrix Arithmetic - add, divide, inverse, transpose, etc.
· Relational Operators - less than, not equal, etc.
· Logical operators - AND, OR, NOT, XOR
· Data Analysis - minimum, mean, covariance, etc.
· Elementary Functions - sin, acos, log, imaginary part, etc.
· Special Functions -- Bessel, Hankel, error function, etc.
· Numerical Linear Algebra - LU decomposition, etc.
· Signal Processing - FFT, inverse FFT, etc.
· Polynomials - roots, fit polynomial, divide, etc.
· Non-linear Numerical Methods - solve DE, minimize functions, etc.

MATLAB is also used for graphics and visualization in both 2-D and 3-D.

MATLAB is a language in itself and can be used at the command line or in m-files.
There are two types of MATLAB M files: scripts and functions. A good reference for
MATLAB programming is Mastering MATLAB by Duane Hanselman and Bruce
Littlefield and published by Prentice Hall (http://www.prenhall.com/). These authors
also wrote the user guide for the student edition of MATLAB.

1. Scripts are standard MATLAB programs that run as if they were typed into the
command window.

2. Functions are compiled m-files that are stored in memory. Most MATLAB
commands are functions and are accessible to the user. This allows the user to modify
the functions to perform the desired functionality necessary for a specific application.
MATLAB m-files may contain

· Standard programming constructs such as IF, else, break, while, etc.
· C style file I/O such as open, read, write formatted, etc.
· String manipulation commands: number to string, test for string, etc.
· Debugging commands: set breakpoint, resume, show status, etc.
· Graphical User Interfaces such as pull down menus, radio buttons, sliders, dialog
boxes, mouse-button events, etc.
· On-Line help routines for all functions.

MATLAB also contains methods for external interfacing capabilities:
· For data import/export from ASCII files, binary, etc.
· To files and directories: chdir, dir, time, etc.
· To external interface libraries: C and FORTRAN callable external interface libraries.
· To dynamic linking libraries: (MEX files) allows C or FORTRAN routines to be
linked directly into MATLAB at run time. This also allows access to A/D cards.
· Computational engine service library: Allows C and FORTRAN programs to call and
access MATLAB routines.
· Dynamic Data Exchange: Allows MATLAB to communicate with other Windows
applications.
3
MATLAB Toolboxes
Toolboxes are add-on packages that perform application-specific functions. MATLAB
toolboxes are collections of functions that can be used from the command line, from
scripts, or called from other functions. They are written in MATLAB and stored as m-
files; this allows the user to modify them to meet his or her needs.

A partial listing of these toolboxes include:
· Signal Processing
· Image Processing
· Symbolic Math
· Neural Networks
· Statistics
· Spline
· Control System
· Robust Control
· Model Predictive Control
· Non-Linear Control
· System Identification
· Mu Analysis
· Optimization
· Fuzzy Logic
· Hi-Spec
· Chemometrics
SIMULINK
SIMULINK is a MATLAB toolbox which provides an environment for modeling,
analyzing, and simulating linear and non-linear dynamic systems. SIMULINK provides
a graphical user interface that supports click and drag of blocks that can be connected to
form complex systems. SIMULINK functionality includes:

· Live displays that let you observe variables as the simulation runs.
· Linear approximations of non-linear systems can be made.
· MATLAB functions or MEX (C and FORTRAN) functions can be called.
· C code can be generated from your models.
· Output can be saved to file for later analysis.
· Systems of blocks can be combined into larger blocks to aid in program structuring.
· New blocks can be created to perform special functions such as simulating neural or
fuzzy systems.
· Discrete or continuous simulations can be run.
· Seven different integration algorithms can be used depending on the system type:
linear, stiff, etc.

SIMULINK’s Real Time Workshop can be used for rapid prototyping, embedded real
time control, real-time simulation, and stand-alone simulation. This toolbox
automatically generates stand-alone C code.
4
User Contributed Toolboxes
Several user contributed toolboxes are available for download at the MATLAB FTP site:
ftp.mathworks.com. by means of anonymous user access. Some that may be of interest
are:

Genetic Algorithm Toolbox:
A freeware toolbox developed by a MathWorks employee that will probably become a
full toolbox in the future.

FISMAT Toolbox:
A fuzzy inference system toolbox developed in Australia that incorporates several
extensions to the fuzzy logic toolbox.

IFR-Fuzzy Toolbox:
User contributed fuzzy-control toolbox.

There are also thousands of user contributed m-files on hundreds of topics ranging from
the Microorbit mission analysis to sound spectrogram printing, to Lagrange interpolation.
In addition to these, there are also several other MATLAB tools that are published by
other companies. The most relevant of these is the Fuzzy Systems Toolbox developed by
Mark H. Beale and Howard B. Demuth of the University of Idaho which is published by
PWS (http://www.thomson.com/pws/default.html). This toolbox goes into greater detail
than the MATLAB toolbox and better explains the lower level programming used in the
functions. These authors also wrote the MATLAB Neural Network Toolbox.
MATLAB Publications
The following publications are available at The MathWorks WWW site.
WWW Address: http://www.mathworks.com
FTP Address: ftp.mathworks.com
Login: anonymous
Password: "your user address"

List of MATLAB based books.
MATLAB Digest: electronic newsletter.
MATLAB Quarterly Newsletter: News and Notes.
MATLAB Technical Notes
MATLAB Frequently Asked Questions
MATLAB Conference Archive; this conference is held every other year.
MATLAB USENET Newsgroup archive.
FTP Server also provides technical references such as papers and article reprints.

MATLAB APPLICATIONS

5
The following chapters implement the theory and techniques discussed in the text. These
MATLAB implementations can be executed by placing the cursor in the code fragment
and selecting "evaluate cell" located in the Notebook menu. The executable code
fragments are green when viewed in the Word notebook and the answers are blue. Since
this supplement is printed in black and white, the code fragments will be represented by
10 point Courier New gray scale. The regular text is in 12 point Times New Roman
black.

Some of these implementations use m-file functions or data files. These are included on
the disk that comes with this supplement. Also included is a MS Word file of this
document. The file :contents.m lists and gives a brief description of all the m-files
included with this supplement.

The following code segment is an autoinit cell and is executed each time the notebook is
opened. If it does not execute when the document is opened, execute it manually. It
performs three functions:

1. whitebg([1 1 1]) gives the figures a white background.
2. set(0, 'DefaultAxesColorOrder', [0 0 0]); close(gcf) sets the line colors in all figures
to black. This produces black and white pages for printing but can be deleted for
color.
3. d:/nn_fuzzy changes the current MATLAB directory to the directory where the m-
files associated with this supplement are located. If you installed the files in another
directory, you need to change the code to point to the directory where they are
installed.

whitebg([1 1 1]);
set(0, 'DefaultAxesColorOrder', [0 0 0]); close(gcf)
cd d:/nn_fuzzy

Chapter 1 Introduction to Hybrid Artificial Intelligence Systems
Chapter 1 of Fuzzy and Neural Approaches in Engineering, gives a brief description of
the benefits of integrating Fuzzy Logic, Neural Networks, Genetic Algorithms, and
Expert Systems. Several applications are described but no specific algorithms or
architectures are presented in enough detail to warrant their implementation in
MATLAB.

In the following chapters, the algorithms and applications described in Fuzzy and Neural
Approaches in Engineering will be implemented in MATLAB code. This code can be
run from the WORD Notebook when in the directory containing the m-files associated
with this supplement is the active directory in MATLAB's command window. In many
of the chapters, the code must be executed sequentially since earlier code fragments may
create data or variables used in later fragments.

6
Chapters 1 through 6 implement Fuzzy Logic, Chapters 7 through 11 implement
Artificial Neural Networks, Chapters 12 and 13 implement fuzzy-neural hybrid systems,
Chapters 14 through 17 do not contain MATLAB implementations but do point the
reader towards references or user contributed toolboxes. This supplement will be
updated and expanded as suggestions are received and as time permits. Updates are
expected to be posted at John Wiley & Sons WWW page but may be posted at University
of Tennessee web site. Further information should be available from the author at
hines@utkux.utk.edu.
Chapter 2 Foundations of Fuzzy Approaches
This chapter will present the building blocks that lay the foundation for constructing
fuzzy systems. These building blocks include membership functions, linguistic
modifiers, and alpha cuts.
2.1 Union, Intersection and Complement of a Fuzzy Set
A graph depicting the membership of a number to a fuzzy set is called a Zadeh diagram.
A Zadeh diagram is a graphical representation that shows the membership of crisp input
values to fuzzy sets. The Zadeh diagrams for two membership functions A (small
numbers) and B (about 8) are constructed below.

x=[0:0.1:20];
muA=1./(1+(x./5).^3);
muB=1./(1+.3.*(x-8).^2);
plot(x,muA,x,muB);
title('Zadeh diagram for the Fuzzy Sets A and B');
text(1,.8,'Set A');text(7,.8,'Set B')
xlabel('Number');ylabel('Membership');

0.4
0.5
0.6
0.7
0.8
0.9
1
Zadeh diagram for the Fuzzy Sets A and B
Set A Set B
M
e
m
b
e
r
s
h
i
p


7
The horizontal axis of a Zadeh diagram is called the universe of discourse. The universe
of discourse is the range of values where the fuzzy set is defined. The vertical axis is the
membership of a value, in the universe of discourse, to the fuzzy set. The membership of
a number (x) to a fuzzy set A is represented by: ( ) u
A
x .

The union of the two fuzzy sets is calculated using the max function. We can see that
this results in the membership of a number to the union being the maximum of its
membership to either of the two initial fuzzy sets. The union of the fuzzy sets A and B is
calculated below.

union=max(muA,muB);plot(x,union);
title('Union of the Fuzzy Sets A and B');
xlabel('Number');
ylabel('Membership');

0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Union of the Fuzzy Sets A and B
Number
M
e
m
b
e
r
s
h
i
p


The intersection of the two fuzzy sets is calculated using the min function. We can see
that this results in the membership of a number to the intersection being the minimum of
its membership to either of the two initial fuzzy sets. The intersection of the fuzzy sets A
and B is calculated below.

intersection=min(muA,muB);
plot(x,intersection);
title('Intersection of the Fuzzy Sets A and B');
xlabel('Number');
ylabel('Membership');

8
0 5 10 15 20
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Intersection of the Fuzzy Sets A and B
Number
M
e
m
b
e
r
s
h
i
p


The complement of about 8 is calculated below.

complement=1-muB;
plot(x,complement);
title('Complement of the Fuzzy Set B');
xlabel('Number');
ylabel('Membership');

03
0.4
0.5
0.6
0.7
0.8
0.9
1
Complement of the Fuzzy Set B
M
e
m
b
e
r
s
h
i
p


9
2.2 Concentration and Dilation
The concentration of a fuzzy set is equivalent to linguistically modifying it by the term
VERY. The concentration of small numbers is therefore VERY small numbers and can be
quantitatively represented by squaring the membership value. This is computed in the
function very(mf).

x=[0:0.1:20];
muA=1./(1+(x./5).^3);
muvsb=very(muA);
plot(x,muA,x,muvsb);
title('Zadeh diagram for the Fuzzy Sets A and VERY A');
xlabel('Number');
ylabel('Membership');
text(1,.5,'Very A');
text(7,.5,'Set A')

0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Zadeh diagram for the Fuzzy Sets A and VERYA
Number
M
e
m
b
e
r
s
h
i
p
Very A Set A


The dilation of a fuzzy set is equivalent to linguistically modifying it by the term MORE
OR LESS. The dilation of small numbers is therefore MORE OR LESS small numbers
and can be quantitatively represented by taking the square root of the membership value.
This is compute in the function moreless(mf).

x=[0:0.1:20];
muA=1./(1+(x./5).^3);
muvsb=moreless(muA);
plot(x,muA,x,muvsb);
title('Zadeh diagram for the Fuzzy Sets A and MORE or LESS A');
xlabel('Number');
ylabel('Membership');
text(2,.5,'Set A');
text(9,.5,'More or Less A')
10

0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Zadeh diagram for the Fuzzy Sets A and MORE or LESS A
Number
M
e
m
b
e
r
s
h
i
p
Set A More or Less A


2.3 Contrast Intensification
A fuzzy set can have its fuzziness intensified, this is called contrast intensification. A
membership function can be represented by an exponential fuzzifier F
1
and a
denominator fuzzifier F
2
. The following equation describes a fuzzy set large numbers.

u( ) x
x
F
F
=
+
|
\

|
.
|

1
1
2
1


Letting F
1
vary {1 2 4 10 100} with F
2
=50 results in a family of curves with slopes
increasing as F
1
increases.

F1=[1 2 4 10 100];
F2=50;
x=[0:1:100];
muA=zeros(length(F1),length(x));
for i=1:length(F1);
muA(i,:)=1./(1+(x./F2).^(-F1(i)));
end
plot(x,muA);
title('Contrast Intensification');
xlabel('Number')
ylabel('Membership')
text(5,.3,'F1 = 1');text(55,.2,'F1 = 100');

11
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Contrast Intensification
Number
M
e
m
b
e
r
s
h
i
p
F1 = 1
F1 = 100


Letting F
2
vary {40 50 60 70} with F
1
=4 results in the following family of curves.

F1=4;F2=[30 40 50 60 70];
for i=1:length(F2);
muA(i,:)=1./(1+(x./F2(i)).^(-F1));
end
plot(x,muA);title('Contrast Intensification');
xlabel('Number');ylabel('Membership')
text(10,.5,'F2 = 30');text(75,.5,'F2 = 70');

04
0.5
0.6
0.7
0.8
0.9
1
Contrast Intensification
F2 = 30 F2 = 70

12

2.4 Extension Principle
The extension principle is a mathematical tool for extending crisp mathematical notions
and operations to the milieu of fuzziness. Consider a function that maps points from the
X-axis to the Y-axis in the Cartesian plane:

y f x
x
= = − ( ) 1
4
2

This is graphed as the upper half of an ellipse centered at the origin.

x=[-2:.1:2];
y=(1-x.^2/4).^.5;
plot(x,y,x,-y);
title('Functional Mapping')
xlabel('x');ylabel('y');

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Functional Mapping
x
y


Suppose fuzzy set A is defined as
A
x
x
x
=
− ≤ ≤

1
2
2 2

mua=0.5.*abs(x);
plot(x,mua)
title('Fuzzy Set A');
xlabel('x');
ylabel('Membership of x to A');

13
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fuzzy Set A
x
M
e
m
b
e
r
s
h
i
p

o
f

x

t
o

A


Solving for x in terms of y we get:
x y = − 2 1
2
.
And the membership function of y to B is
u
B
y y ( ) = − 1
2
.
y=[-1:.05:1];
mub=(1-y.^2).^.5;
plot(y,mub)
title('Fuzzy Set B');
xlabel('y');ylabel('Membership of y to B');

04
0.5
0.6
0.7
0.8
0.9
1
Fuzzy Set B
b
e
r
s
h
i
p

o
f

y

t
o

B

14

The geometric interpretation is shown below.

set(gcf,'color',[1 1 1]);
x=[-2:.2:2];
mua=0.5.*abs(x);
y=[-1:.1:1];
mub=(1-y.^2).^.5;
[X,Y] = meshgrid(x,y);
Z=.5*abs(X).*(1-Y.^2).^.5;
mesh(X,Y,Z);
axis([-2 2 -1 1 -1 1])
colormap(1-gray)
view([0 90]);
shading interp
xlabel('x')
ylabel('y')
title('Fuzzy Region Inside and Outside the Eclipse')

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
y
Fuzzy Region Inside and Outside the Eclipse


2.5 Alpha Cuts
An alpha cut is a crisp set that contains the elements that have a support (also called
membership or grade) greater than a certain value. Consider a fuzzy set whose
membership function is
u
A
x
x
( )
. *( ).^
=
+ −
1
1 0 01 50 2
. Suppose we are interested in
the portion of the membership function where the support is greater than 0.2. The 0.2
alpha cut is given by:

x=[0:1:100];
15
mua=1./(1+0.01.*(x-50).^2);
alpha_cut = mua>=.2;
plot(x,alpha_cut)
title('0.2 Level Fuzzy Set of A');
xlabel('x');
ylabel('Membership of x');

0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2 Level Fuzzy Set of A
x
M
e
m
b
e
r
s
h
i
p

o
f

x


The function alpha is written to return the minimum and maximum values where an alpha
cut is one. This function will be used in subsequent exercises.

function [a,b] = alpha(FS,x,level);
% [a,b] = alpha(FS,x,level)
%
% Returns the alpha cut for the fuzzy set at a given level.
% FS : the grades of a fuzzy set.
% x : the universe of discourse
% level : the level of the alpha cut
% [a,b] : the vector indices of the alpha cut
%
ind=find(FS>=level);
a=x(min(ind));
b=x(max(ind));

[a,b]=alpha(mua,x,.2)

a =
30
b =
70
16

Chapter 3 Fuzzy Relationships
Fuzzy if/then rules and their aggregations are fuzzy relations in linguistic disguise and
can be thought of as fuzzy sets defined over high dimensional universes of discourse.
3.1 A Similarity Relation
Suppose a relation R is defined as "x is near the origin AND near y". This can be
expressed as u
R
x y
x e ( )
( )
=
− +
2 2
. The universe of discourse is graphed below.

[x,y]=meshgrid(-2:.2:2,-2:.2:2);
mur=exp(-1*(x.^2+y.^2));
surf(x,y,mur)
xlabel('x')
ylabel('y')
zlabel('Membership to the Fuzzy Set R')

-2
-1
0
1
2
-2
-1
0
1
2
0
0.2
0.4
0.6
0.8
1
x
y
M
e
m
b
e
r
s
h
i
p

t
o

t
h
e

F
u
z
z
y

S
e
t

R


3.2 Union and Intersection of Fuzzy Relations
Suppose a relation R1 is defined as "x is near y AND near the origin", and a relation R2
is defined as "x is NOT near the origin". The union R1 OR R2 is defined as:

mur1=exp(-1*(x.^2+y.^2));
mur2=1-exp(-1*(x.^2+y.^2));
surf(x,y,max(mur1,mur2))
xlabel('x')
ylabel('y')
zlabel('Union of R1 and R2')
17

-2
-1
0
1
2
-2
-1
0
1
2
0.5
0.6
0.7
0.8
0.9
1
x
y
U
n
i
o
n

o
f

R
1

a
n
d

R
2


The intersection R1 AND R2 is defined as:

mur1=exp(-1*(x.^2+y.^2));
mur2=1-exp(-1*(x.^2+y.^2));
surf(x,y,min(mur1,mur2))
xlabel('x')
ylabel('y')
zlabel('Intersection of R1 and R2')

0.2
0.3
0.4
0.5
c
t
i
o
n

o
f

R
1

a
n
d

R
2

18

3.3 Max-Min Composition
The max-min composition uses the max and min operators described in section 3.2.
Suppose two relations are defined as follows:

R
x y x y x y x y
x y x y x y x y
x y x y x y x y
x y x y x y
R R R R
R R R R
R R R R
R R R
1
1 1 1 1 1 2 1 1 3 1 1 4
1 2 1 1 2 2 1 2 3 1 2 4
1 3 1 1 3 2 1 3 3 1 3 4
1 4 1 1 4 2 1 4 3
=
u u u u
u u u u
u u u u
u u u
( , ) ( , ) ( , ) ( , )
( , ) ( , ) ( , ) ( , )
( , ) ( , ) ( , ) ( , )
( , ) ( , ) ( , ) u
R
x y
1 4 4
10 0 3 0 9 0 0
0 3 10 08 10
0 9 08 10 08
0 0 10 08 10 ( , )
. . . .
. . . .
. . . .
. . . .

=

R
x y x y x y
x y x y x y
x y x y x y
x y x y x y
R R R
R R R
R R R
R R R
2
2 1 1 2 1 2 2 1 3
2 2 1 2 2 2 2 2 3
2 3 1 2 3 2 2 3 3
2 4 1 2 4 2 2 4 3
10 10 0 9
10 0 0 0
=

=
u u u
u u u
u u u
u u u
( , ) ( , ) ( , )
( , ) ( , ) ( , )
( , ) ( , ) ( , )
( , ) ( , ) ( , )
. . .
. . .5
0 3 01 0 0
0 2 0 3 01
. . .
. . .

Their max-min composition is defined in its matrix form as:

R R
1 2
10 0 3 0 9 0 0
0 3 10 08 10
0 9 08 10 08
0 0 10 08 10
10 10 0 9
10 0 0 05
0 3 01 0 0
0 2 0 3 01
o o =

. . . .
. . . .
. . . .
. . . .
. . .
. . .
. . .
. . .


Using MATLAB to compute the max-min composition:

R1=[1.0 0.3 0.9 0.0;0.3 1.0 0.8 1.0;0.9 0.8 1.0 0.8;0.0 1.0 0.8 1.0];
R2=[1.0 1.0 0.9;1.0 0.0 0.5; 0.3 0.1 0.0;0.2 0.3 0.1];
[r1,c1]=size(R1); [r2,c2]=size(R2);
R0=zeros(r1,c2);
for i=1:r1;
for j=1:c2;
R0(i,j)=max(min(R1(i,:),R2(:,j)'));
end
end
R0

R0 =
1.0000 1.0000 0.9000
1.0000 0.3000 0.5000
0.9000 0.9000 0.9000
19

1.0000 0.3000 0.5000

Chapter 4 Fuzzy Numbers
Fuzzy numbers are fuzzy sets used in connection with applications where an explicit
representation of the ambiguity and uncertainty found in numerical data is desirable.
4.1 Addition and Subtraction of Discrete Fuzzy Numbers
Addition of two fuzzy numbers can be performed using the extension principle.

Suppose you have two fuzzy numbers that are represented tabularly. They are the fuzzy
number 3 (FN3) and the fuzzy number 7 (FN7).

FN3=0/0 + 0.3/1 + 0.7/2 + 1.0/3 + 0.7/4 + 0.3/5 + 0/6
FN7=0/4 + 0.2/5 + 0.6/6 + 1.0/7 + 0.6/8 + 0.2/9 + 0/10

To define these fuzzy numbers using MATLAB:

x = [1 2 3 4 5 6 7 8 9 10];
FN3 = [0.3 0.7 1.0 0.7 0.3 0 0 0 0 0];
FN7 = [0 0 0 0 0.2 0.6 1.0 0.6 0.2 0];
bar(x',[FN3' FN7']); axis([0 11 0 1.1])
title('Fuzzy Numbers 3 and 7');
xlabel('x');
ylabel('membership')
text(2,1.05,'Fuzzy Number 3')
text(6,1.05,'Fuzzy Number 7');;

0.6
0.8
1
m
e
m
b
e
r
s
h
i
p
Fuzzy Numbers 3 and 7
Fuzzy Number 3 Fuzzy Number 7

20
Adding fuzzy number 3 to fuzzy number 7 results in a fuzzy number 10 using the alpha cut
procedure described in the book.

By hand we have: FN3 FN7 FN10 = FN3+FN7
0.2 alpha cut: [1 5] [5 9] [6 14]
0.3 alpha cut: [1 5] [6 8] [7 13]
0.6 alpha cut: [2 4] [6 8] [8 12]
0.7 alpha cut: [2 4] [7 7] [9 11]
1.0 alpha cut: [3 3] [7 7] [10 10]

FN10 = .2/6 + .3/7 + .6/8 + .7/9 + 1/10 + .7/11 + .6/12 + .3/13 + .2/14

x=[1:1:20];
FNSUM=zeros(size(x));
for i=.1:.1:1
[a1,b1]=alpha(FN3,x,i-eps); % Use eps due to buggy MATLAB increments
[a2,b2]=alpha(FN7,x,i-eps);
a=a1+a2;
b=b1+b2;
FNSUM(a:b)=i*ones(size(FNSUM(a:b)));
end
bar(x,FNSUM); axis([0 20 0 1.1])
title('Fuzzy Number 3+7=10')
xlabel('x')
ylabel('membership')

0.2
0.4
0.6
0.8
1
m
e
m
b
e
r
s
h
i
p
Fuzzy Number 3+7=10


The following program subtracts the fuzzy number 3 from the fuzzy number 8 to get a
fuzzy number 8-3=5.

21
By hand we have: FN3 FN8 FN5 = FN8-FN3
0.2 alpha cut: [1 5] [6 10] [1 9]
0.3 alpha cut: [1 5] [7 9] [2 8]
0.6 alpha cut: [2 4] [7 9] [3 7]
0.7 alpha cut: [2 4] [8 8] [4 6]
1.0 alpha cut: [3 3] [8 8] [5 5]

FN5 = .2/1 + .3/2 + .6/3 + .7/4 + 1/5+ .7/6 + .6/7 + .3/8 + .2/9

x=[1:1:11];
FN3 = [0.3 0.7 1.0 0.7 0.3 0 0 0 0 0];
FN8 = [0 0 0 0 0 0.2 0.6 1.0 0.6 0.2];
FNDIFF=zeros(size(x));
for i=.1:.1:1
[a1,a2]=alpha(FN8,x,i-eps);
[b1,b2]=alpha(FN3,x,i-eps);
a=a1-b2;
b=a2-b1;
FNDIFF(a:b)=i*ones(size(FNDIFF(a:b)));
end
bar(x,FNDIFF);axis([0 11 0 1.1])
title('Fuzzy Number 8-3=5')
xlabel('x')
ylabel('Membership')

0.2
0.4
0.6
0.8
1
M
e
m
b
e
r
s
h
i
p
Fuzzy Number 8-3=5


4.2 Multiplication of Discrete Fuzzy Numbers
This program multiplies the fuzzy number 3 by the fuzzy number 7 to get a fuzzy number
3*7=21. Where the fuzzy numbers 3 and 7 are defined as in Section 4.1. The
22
multiplication of continuous fuzzy numbers is somewhat messy and will not be
implemented in MATLAB.

By hand we have: FN3 FN7 FN21 = FN3*FN7
0.2 alpha cut: [1 5] [5 9] [5 45]
0.3 alpha cut: [1 5] [6 8] [6 40]
0.6 alpha cut: [2 4] [6 8] [12 32]
0.7 alpha cut: [2 4] [7 7] [14 28]
1.0 alpha cut: [3 3] [7 7] [21 21]

FN21 = .2/5 + .3/6 + .6/12 + .7/14 + 1/21 + .7/28 + .6/32 + .3/40 + .2/45

x=[1:1:60]; % Universe of Discourse
FN3 = [0.3 0.7 1.0 0.7 0.3 0 0 0 0 0 0];
FN7 = [0 0 0 0 0.2 0.6 1.0 0.6 0.2 0 0];
FNPROD=zeros(size(x));
for i=.1:.1:1
[a1,a2]=alpha(FN3,x,i-eps);
[b1,b2]=alpha(FN7,x,i-eps);
a=a1*b1;
b=a2*b2;
FNPROD(a:b)=i*ones(size(FNPROD(a:b)));
end
bar(x,FNPROD);axis([0 60 0 1.1])
title('Fuzzy Number 3*7=21')
xlabel('Fuzzy Number 21')
ylabel('Membership')

0.4
0.6
0.8
1
M
e
m
b
e
r
s
h
i
p
Fuzzy Number 3*7=21


23
4.3 Division of Discrete Fuzzy Numbers
This program divides the fuzzy number 6 by the fuzzy number 3 to get a fuzzy number 2.
The division of continuous fuzzy numbers is somewhat messy and will not be
implemented in MATLAB.

By hand we have: FN3 FN6 FN2 = FN6/FN3
0.2 alpha cut: [1 5] [4 8] [4/5 8/1]
0.3 alpha cut: [1 5] [5 7] [5/5 7/1]
0.6 alpha cut: [2 4] [5 7] [5/4 7/2]
0.7 alpha cut: [2 4] [6 6] [6/4 6/2]
1.0 alpha cut: [3 3] [6 6] [6/3 6/3]

FN21 = .2/.8 + .3/1 + .6/1.25 + .7/1.5 + 1/2 + .7/3 + .6/3.5 + .3/7 + .2/8

x=[1:1:12]; % Universe of Discourse
FN3 = [0.3 0.7 1.0 0.7 0.3 0 0 0 0 0];
FN6 = [0 0 0 0.2 0.6 1.0 0.6 0.2 0 0];
FNDIV=zeros(size(x));
for i=.1:.1:1
[a1,a2]=alpha(FN6,x,i-eps);
[b1,b2]=alpha(FN3,x,i-eps);
a=round(a1/b2);
b=round(a2/b1);
FNDIV(a:b)=i*ones(size(FNDIV(a:b)));
end
bar(x,FNDIV);axis([0 10 0 1.1])
title('Fuzzy Number 6/3=2')
xlabel('Fuzzy Number 2')
ylabel('Membership')

0.6
0.8
1
M
e
m
b
e
r
s
h
i
p
Fuzzy Number 6/3=2

24

Chapter 5 Linguistic Descriptions and Their Analytical Form
5.1 Generalized Modus Ponens
Fuzzy linguistic descriptions are formal representations of systems made through fuzzy
if/then rules. Generalized Modus Ponens (GMP) states that when a rule's antecedent is
met to some degree, its consequence is inferred by the same degree.

IF x is A THEN y is B
x is A'
so y is B'

This can be written using the implication relation (R(x,y)) as in the max-min composition
of section 3.3.

B'=A'°R(x,y)

Implication relations are explained in greater detail in section 5.3.
5.2 Membership Functions
This supplement contains functions that define triangular, trapezoidal, S-Shaped and Π-
shaped membership functions.
5.2.1 Triangular Membership Function
A triangular membership function is defined by the parameters [a b c], where a is the
membership function's left intercept with grade equal to 0, b is the center peak where the
grade equals 1 and c is the right intercept at grade equal to 0. The function
y=triangle(x,[a b c]); is written to return the membership values corresponding to the
defined universe of discourse x. The parameters that define the triangular membership
function: [a b c] must be in the discretely defined universe of discourse.

For example: A triangular membership function for "x is close to 33" defined over
x=[0:1:50] with [a b c]=[23 33 43] would be created with:

x=[0:1:50];
y=triangle(x,[23 33 43]);
plot(x,y);
title('Close to 33')
xlabel('X')
ylabel('Membership')

25
0 10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Close to 33
X
M
e
m
b
e
r
s
h
i
p


A fuzzy variable temperature may have three fuzzy values: cool, medium and hot.
Membership functions defining these values can be constructed to overlap in the universe
of discourse [0:100]. A matrix with each row corresponding to the three fuzzy values can
be constructed. Suppose the following fuzzy value definitions are used:

x=[0:100];
cool=[0 25 50];
medium=[25 50 75];
hot=[50 75 100];
mf_cool=triangle(x,cool);
mf_medium =triangle(x,medium);
mf_hot=triangle(x,hot);
plot(x,[mf_cool;mf_medium;mf_hot])
title('Temperature: cool, medium and hot');
ylabel('Membership');
xlabel('Degrees')
text(20,.58,'Cool')
text(42,.58,'Medium')
text(70,.58,'Hot')

26
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature: cool, medium and hot
M
e
m
b
e
r
s
h
i
p
Degrees
Cool Medium Hot


5.2.2 Trapezoidal Membership Function
As can be seen, a temperature value of 0 would have a 0 membership to all fuzzy sets.
Therefore, we should use trapezoidal membership functions to define the cool and hot
fuzzy sets.

x=[0:100];
cool=[0 0 25 50];
medium=[15 50 75];
hot=[50 75 100 100];
mf_cool=trapzoid(x,cool);
mf_medium =triangle(x,medium);
mf_hot=trapzoid(x,hot);
plot(x,[mf_cool;mf_medium;mf_hot]);
title('Temperature: cool, medium and hot');
ylabel('Membership');
xlabel('Degrees');
text(20,.65,'Cool')
text(42,.65,'Medium')
text(70,.65,'Hot')

27
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature: cool, medium and hot
M
e
m
b
e
r
s
h
i
p
Degrees
Cool Medium Hot


The use of trapezoidal membership functions results in a 0 value of temperature being
properly represented by a membership value of 1 to the fuzzy set cool. Likewise, high
temperatures are properly represented with high membership values to the fuzzy set hot.
5.2.3 S-shaped Membership Function
An S-shaped membership function is defined by three parameters [α β γ] using the
following equations:

S_ shape( , , ) = 0 for x
S_ shape( , , ) = 2
x -
-
for x
S_ shape( , , ) =1- 2
x -
-
for x
S_ shape( , , ) =1 for
2
2
α β γ α
α β γ
α
γ α
α β
α β γ
γ
γ α
β γ
α β γ γ

|
\

|
.
| ≤ ≤
|
\

|
.
| ≤ ≤
≤ x


where:
α = the point where u(x)=0
β = the point where u(x)=0.5
γ = the point where u(x)=1.0
note: β-α must equal γ-β for continuity of slope

28
x=[0:100];
cool=[50 25 0];
hot=[50 75 100];
mf_cool=s_shape(x,cool);
mf_hot=s_shape(x,hot);
plot(x,[mf_cool;mf_hot]);
title('Temperature: cool and hot');
ylabel('Membership');
xlabel('Degrees');
text(8,.45,'Cool')
text(82,.45,'Hot')

0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature: cool and hot
M
e
m
b
e
r
s
h
i
p
Degrees
Cool Hot


5.2.4 Π-shaped Membership Function
A Π-shaped membership functions is defined by two parameters [γ,β] using the following
equations:

P_ shape( ) = S_ shape x; ,
2
for x
P_ shape( ) =1- S_ shape x; ,
2
for
γ δ γ δ
γ δ
γ γ
γ δ γ
γ δ
γ δ γ
, ,
, ,


|
\

|
.
| ≤
+
+
|
\

|
.
| ≥ x
where:
γ = center of the membership function
β = width of the membership function at grade = 0.5.

x=[0:100];
29
cool=[25 20];
medium=[50 20];
hot=[75 20];
mf_cool=p_shape(x,cool);
mf_medium =p_shape(x,medium);
mf_hot=p_shape(x,hot);
plot(x,[mf_cool;mf_medium;mf_hot]);
title('Temperature: cool, medium and hot');
ylabel('Membership');
xlabel('Degrees');
text(20,.55,'Cool')
text(42,.55,'Medium')
text(70,.55,'Hot')

0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature: cool, medium and hot
M
e
m
b
e
r
s
h
i
p
Degrees
Cool Medium Hot


5.2.5 Defuzzification of a Fuzzy Set
Defuzzification is the process of representing a fuzzy set with a crisp number and is
discussed in Section 6.3 of the text. Internal representations of data in a fuzzy system are
usually fuzzy sets but the output frequently needs to be a crisp number that can be used to
perform a function, such as commanding a valve to a desired position.

The most commonly used defuzzification method is the center of area method also
commonly referred to as the centroid method. This method determines the center of area
of the fuzzy set and returns the corresponding crisp value. The function centroid
(universe, grades) performs this function by using a method similar to that of finding a
balance point on a loaded beam.


function [center] = centroid(x,y);
30
%CENTER Calculates Centroid
% [center] = centroid(universe,grades)
%
% universe: row vector defining the universe of discourse.
% grades: row vector of corresponding membership.
% centroid: crisp number defining the centroid.
%
center=(x*y')/sum(y);

To illustrate this method, we will defuzzify the following triangular fuzzy set and plot the
result using c_plot:

x=[10:150];
y=triangle(x,[32 67 130]);
center=centroid(x,y);
c_plot(x,y,center,'Centroid')

0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Centroid is at 76.33


There are several other defuzzification methods including mean of max, max of max and
min of max. The following function implements mean of max defuzzification:
mom(universe,grades).

x=[10:150];
y=trapzoid(x,[32 67 74 130]);
center=mom(x,y);
c_plot(x,y,center,'Mean of Max');

31
0 50 100 150
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mean of Max is at 70.5


5.2.6 Compound Values
Connectives such as AND and OR, and modifiers such as NOT, VERY, and MORE or
LESS can be used to generate compound values from primary values:

OR corresponds to max or union
AND corresponds to min or intersection
NOT corresponds to the complement and is calculated by the function not(MF).
VERY, MORE or LESS, etc. correspond to various degrees of contrast intensification.

Temperature is NOT cool AND NOT hot is a fuzzy set represented by:

x=[0:100];
cool=[0 0 25 50];
hot=[50 75 100 100];
mf_cool=trapzoid(x,cool);
mf_hot=trapzoid(x,hot);
not_cool=not(mf_cool);
not_hot=not(mf_hot);
answer=min([not_hot;not_cool]);
plot(x,answer);
title('Temperature is NOT hot AND NOT cool');
ylabel('Membership');
xlabel('Degrees');

32
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature is NOT hot AND NOT cool
M
e
m
b
e
r
s
h
i
p
Degrees


VERY and MORE or LESS are called linguistic modifiers. These can be implemented by
taking the square (VERY) or square root (MORE or LESS) of the membership values.
These modifiers are implemented with the very(MF) and moreless(MF) functions. For
example, NOT VERY hot would be represented as:

not_very_hot=not(very(trapzoid(x,hot)));
plot(x,not_very_hot);
title('NOT VERY hot');ylabel('Membership');xlabel('Degrees');

04
0.5
0.6
0.7
0.8
0.9
1
NOT VERYhot
M
e
m
b
e
r
s
h
i
p

33

and, MORE or LESS hot would be represented as:

ml_hot=moreless(trapzoid(x,hot));
plot(x,ml_hot);
title('Temperature is More or Less hot');
ylabel('Membership');xlabel('Degrees');

0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Temperature is More or Less hot
M
e
m
b
e
r
s
h
i
p
Degrees


Note that some membership functions are affected by linguistic modifiers more than
others. For example, a membership function that only has crisp values, such as a
hardlimit membership function, would not be affected at all.
5.3 Implication Relations
The underlying analytical form of an if/then rule is a fuzzy relation called an implication
relation: R(x,y). There are several implication relation operators (φ) including:

Zadeh Max-Min Implication Operator | | ( ) ( ) φ u u u u u
A B A B A
x y x y x ( ), ( ) ( ) ( ) ( ) = ∧ ∨ − 1

Mamdami Min Implication Operator
φ u u u u
A B A B
x y x y ( ), ( ) ( ) ( ) = ∧


Larson Product Implication Operator
φ u u u u
A B A B
x y x y ( ), ( ) ( ) ( ) = ⋅


To illustrate the Mamdami Min implementation operator, suppose there is a rule that
states:

if x is "Fuzzy Number 3"
34
then y is "Fuzzy Number 7"

For the Fuzzy Number 3 of section 4.1, if the input x is a 2, it matches the set "Fuzzy
Number 3" with a value of 0.7. This value is called the "Degree of Fulfillment" (DOF) of
the antecedent. Therefore, the consequence should be met with a degree of 0.7 and
results in the output fuzzy number being clipped to a maximum of 0.7. To perform this
operation we construct a function called clip(FS,level).

mua=1./(1+0.01.*(x-50).^2);
clip_mua=clip(mua,0.2);
plot(x,clip_mua);
title('Fuzzy Set A Clipped to a 0.2 Level');
xlabel('x');
ylabel('Membership of x');

0 20 40 60 80 100
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
Fuzzy Set A Clipped to a 0.2 Level
x
M
e
m
b
e
r
s
h
i
p

o
f

x


Referring back to the discrete example:
if x is "Fuzzy Number 3"
then y is "Fuzzy number 7"
and x is equal to 2, then the output y is equal to the fuzzy set clipped at 2's degree of
fulfillment of Fuzzy Number 7.

x= [0 1 2 3 4 5 6 7 8 9 10];
FN3 = [0 0.3 0.7 1.0 0.7 0.3 0 0 0 0 0];
FN7 = [0 0 0 0 0 0.2 0.6 1.0 0.6 0.2 0];
degree=FN3(find(x==2));
y=clip(FN7,degree);
plot(x,y);
axis([0 10 0 1])
title('Mamdani Min Output of Fuzzy Rule');
xlabel('x');
35
ylabel('Output Fuzzy Set');

0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mamdani Min Output of Fuzzy Rule
x
O
u
t
p
u
t

F
u
z
z
y

S
e
t


This example shows the basic foundation of a rule based fuzzy logic system. We can see
that using discrete membership functions of very rough granularity may not provide the
precision that one may desire. Membership functions with less granularity should be
used.

To illustrate the use of the Larson Product implication relation, suppose there is a rule
that states:

if x is "Fuzzy Number 3"
then y is "Fuzzy number 7"

For the Fuzzy Number 3 of section 4.1, if the input x is a 2, it matches the antecedent
fuzzy set "Fuzzy Number 3" with a degree of fulfillment of 0.7. The Larson Product
implication operator scales the consequence with the degree of fulfillment which is 0.7
and results in the output fuzzy number being scaled to a maximum of 0.7. The function
product(FS,level) performs the Larson Product operation.

x=[0:1:100];
mua=1./(1+0.01.*(x-50).^2);
prod_mua=product(mua,.7);
plot(x,prod_mua)
axis([min(x) max(x) 0 1]);
title('Fuzzy Set A Scaled to a 0.7 Level');
xlabel('x');
ylabel('Membership of x');

36
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fuzzy Set A Scaled to a 0.7 Level
x
M
e
m
b
e
r
s
h
i
p

o
f

x


Referring back to the highly granular discrete example:

if x is "Fuzzy Number 3"
then y is "Fuzzy Number 7"

and x is equal to 2, then the output y is equal to the fuzzy set squashed to the antecedent's
degree of fulfillment to "Fuzzy Number 7".

x= [0 1 2 3 4 5 6 7 8 9 10];
FN3 = [0 0.3 0.7 1.0 0.7 0.3 0 0 0 0 0];
FN7 = [0 0 0 0 0 0.2 0.6 1.0 0.6 0.2 0];
degree=FN3(find(x==2));
y=product(FN7,degree);
plot(x,y);
axis([0 10 0 1.0])
title('Larson Product Output of Fuzzy Rule');
xlabel('x');
ylabel('Output Fuzzy Set');

37
0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Larson Product Output of Fuzzy Rule
x
O
u
t
p
u
t

F
u
z
z
y

S
e
t


5.4 Fuzzy Algorithms
Now that we can manipulate Fuzzy Rules we can combine them into Fuzzy Algorithms.
A Fuzzy Algorithm is a procedure for performing a task formulated by a collection of
fuzzy if/then rules. These rules are usually connected by ELSE statements.

if x is A
1
then y is B
1
ELSE
if x is A
2
then y is B
2
ELSE
...
if x is A
n
then y is B
n


ELSE is interpreted differently for different implication operators:

Zadeh Max-Min Implication Operator AND
Mamdami Min Implication Operator OR
Larson Product Implication Operator OR

As a first example, consider a fuzzy algorithm that controls a fan's speed. The input is
the crisp value of temperature and the output is a crisp value for the fan speed. Suppose
the fuzzy system is defined as:

if Temperature is Cool then Fan_speed is Low ELSE
if Temperature is Moderate then Fan_speed is Medium ELSE
if Temperature is Hot then Fan_speed is High

38
This system has three fuzzy rules where the antecedent membership functions Cool,
Moderate, Hot and consequent membership functions Low, Medium, High are defined by
the following fuzzy sets over the given universes of discourse:

% Universe of Discourse
x = [0:1:120]; % Temperature
y = [0:1:10]; % Fan Speed

% Temperature
cool_mf = trapzoid(x,[0 0 30 50]);
moderate_mf = triangle(x,[30 55 80]);
hot_mf = trapzoid(x,[60 80 120 120]);
antecedent_mf = [cool_mf;moderate_mf;hot_mf];
plot(x,antecedent_mf)
title('Cool, Moderate and Hot Temperatures')
xlabel('Temperature')
ylabel('Membership')

0 20 40 60 80 100 120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cool, Moderate and Hot Temperatures
Temperature
M
e
m
b
e
r
s
h
i
p


% Fan Speed
low_mf = trapzoid(y,[0 0 2 5]);
medium_mf = trapzoid(y,[2 4 6 8]);
high_mf = trapzoid(y,[5 8 10 10]);
consequent_mf = [low_mf;medium_mf;high_mf];
plot(y,consequent_mf)
title('Low, Medium and High Fan Speeds')
xlabel('Fan Speed')
ylabel('Membership')

39
0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Low, Medium and High Fan Speeds
Fan Speed
M
e
m
b
e
r
s
h
i
p


Now that we have the membership functions defined we can perform the five steps of
evaluating fuzzy algorithms:

1. Fuzzify the input.
2. Apply a fuzzy operator.
3. Apply an implication operation.
4. Aggregate the outputs.
5. Defuzzify the output.

First we fuzzify the input. The output of the first step is the degree of fulfillment of each
rule. Suppose the input is Temperature = 72.

temp = 72;
dof1 = cool_mf(find(x==temp));
dof2 = moderate_mf(find(x == temp));
dof3 = hot_mf(find(x == temp));
DOF = [dof1;dof2;dof3]

DOF =
0
0.3200
0.6000

Doing this in matrix notation:

temp=72;
DOF=antecedent_mf(:,find(x==temp))

DOF =
40
0
0.3200
0.6000

There is no fuzzy operator (AND, OR) since each rule has only one input. Next we apply
a fuzzy implication operation. Suppose we choose the Larson Product implication
operation.

consequent1 = product(low_mf,dof1);
consequent3 = product(medium_mf,dof2);
consequent2 = product(high_mf,dof3);
plot(y,[consequent1;consequent2;consequent3])
axis([0 10 0 1.0])
title('Consequent Fuzzy Set')
xlabel('Fan Speed')
ylabel('Membership')

0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Consequent Fuzzy Set
Fan Speed
M
e
m
b
e
r
s
h
i
p


Or again, in matrix notation:

consequent = product(consequent_mf,DOF);
plot(y,consequent)
axis([0 10 0 1.0])
title('Consequent Fuzzy Set')
xlabel('Fan Speed')
ylabel('Membership')

41
0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Consequent Fuzzy Set
Fan Speed
M
e
m
b
e
r
s
h
i
p


Next we need to aggregate the consequent fuzzy sets. We will use the max operator.

Output_mf=max([consequent1;consequent2;consequent3]);
plot(y,Output_mf)
axis([0 10 0 1])
title('Output Fuzzy Set')
xlabel('Fan Speed')
ylabel('Membership')

0.4
0.5
0.6
0.7
0.8
0.9
1
Output Fuzzy Set
M
e
m
b
e
r
s
h
i
p


42
Output_mf = max(consequent);
plot(y,Output_mf)
axis([0 10 0 1]);title('Output Fuzzy Set')
xlabel('Fan Speed');ylabel('Membership')

0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Output Fuzzy Set
Fan Speed
M
e
m
b
e
r
s
h
i
p


Lastly we defuzzify the output set to obtain a crisp value.

output=centroid(y,Output_mf);
c_plot(y,Output_mf,output,'Crisp Output');

0.5
0.6
0.7
0.8
0.9
1
Crisp Output is at 7.313

43

The crisp output of the fuzzy rules states that the fan speed should be set to a value of 7.3
for a temperature of 72 degrees. To see the output for different input temperatures, we
write a loop that covers the input universe of discourse and computes the output for each
input temperature. Note: you must have already run the code fragments that set up the
membership functions and define the universe of discourse to run this example.

outputs=zeros(size([1:1:100]));
for temp=1:1:100
DOF=antecedent_mf(:,find(x==temp)); %Fuzzification
consequent = product(consequent_mf,DOF); %Implication
Output_mf = max(consequent); %Aggregation
output=centroid(y,Output_mf); %Defuzzification
outputs(temp)=output;
end
plot([1:1:100],outputs)
title('Fuzzy System Input Output Relationship')
xlabel('Temperature')
ylabel('Fan Speed')

0 20 40 60 80 100
1
2
3
4
5
6
7
8
9
Fuzzy System Input Output Relationship
Temperature
F
a
n

S
p
e
e
d


We see that the input/output relationship is non-linear. The next chapter will demonstrate
fuzzy tank level control when Fuzzy Operators are included.
44
Chapter 6 Fuzzy Control
Fuzzy control refers to the control of processes through the use of fuzzy linguistic
descriptions. For additional reading on fuzzy control see DeSilva, 1995; Jamshidi,
Vadiee and Ross, 1993; or Kandel and Langholz, 1994.
6.1 Tank Level Fuzzy Control
A tank is filled by means of a valve and continuously drains. The level is measured and
compared to a level setpoint forming a level error. This error is used by a controller to
position the valve to make the measured level equal to the desired level. The setup is
shown below and is used in a laboratory at The University of Tennessee for fuzzy and
neural control experiments.
11"
35"
h
water out water in
servo valve
pressure transducer

This is a nonlinear control problem since the dynamics of the plant are dependent on the
height of level of the water through the square root of the level. There also may be some
non-linearities due to the valve flow characteristics. The following equations model the
process.

45
&
*
(
&
( ) ( )
&
( )
h
Vin Vout
Area
Area pi R A
Vout K h K
Vin f
h
f u K h
A
f u
A
K h
A
hA K h f u
k
k k k
k
=

= =
=
=
=

= −
+ =
2
is the resistance in the outlet piping
u) u is the valve position


These equations can be used to model the plant in SIMULINK.
% open
converts
Valve voltage
to % open
+
-
Sum
-K-
1/tank
Area
+
+
Sum1 1/s
Limited
Integrator
(0-36")
11
level off-set
+
+
Sum
Drain dynamics
(includes sqrt)
Tank Level
1
Input
control
voltage
WATER TANK MODEL REPRESENTATION
fill rate
fill rate at
100% open
H2O out
(in^3/sec)
H2O in
(in^3/sec)
h'
flow rate
(in/sec)
1
Level
3.3
voltage
offset

The non-linearities are apparent when linearizing the plant around different operating
levels. This can be done using LINMOD.

[a,b,c,d]=linmod('tank',3,.1732051)
resulting in:
a = -0.0289 b = 1.0 c = 1.0

For different operating levels we have:
for h=1 pole at -0.05
for h=2 pole at -0.0354
for h=3 pole at -0.0289
for h=4 pole at -0.025

This nonlinearity makes control with a PID controller difficult unless gain scheduling is
used. A controller designed to meet certain performance specifications at a low level
such as h=1 may not meet those specifications at a higher level such as h=4. Therefore, a
fuzzy controller may be a viable alternative.

The fuzzy controller described in the book uses two input variables [error, change in
error] to control valve position. The membership functions were chosen to be:

46
Error: nb, nm, z, pm, pb
Change in Error: ps, pm, pb
Valve Position: vh, high, med, low, vl

Where:
nb, nm, z, pm, pb = negative big, negative medium, zero, positive big, positive medium
ps, pm, pb = positive small, positive medium, positive big
vh, high, med, low, vl = very high, high, medium, low, very low

Fifteen fuzzy rules are used to account for each combination of input variables:
1. if (error is nb) AND (del_error is n) then (control is high) (1) ELSE
2. if (error is nb) AND (del_error is ze) then (control is vh) (1) ELSE
3. if (error is nb) AND (del_error is p) then (control is vh) (1) ELSE
4. if (error is ns) AND (del_error is n) then (control is high) (1) ELSE
5. if (error is ns) AND (del_error is ze) then (control is high) (1) ELSE
6. if (error is ns) AND (del_error is p) then (control is med) (1) ELSE
7. if (error is z) AND (del_error is n) then (control is med) (1) ELSE
8. if (error is z) AND (del_error is ze) then (control is med) (1) ELSE
9. if (error is z) AND (del_error is p) then (control is med) (1) ELSE
10. if (error is ps) AND (del_error is n) then (control is med) (1) ELSE
11. if (error is ps) AND (del_error is ze) then (control is low) (1) ELSE
12. if (error is ps) AND (del_error is p) then (control is low) (1) ELSE
13. if (error is pb) AND (del_error is n) then (control is low) (1) ELSE
14. if (error is pb) AND (del_error is ze) then (control is vl) (1) ELSE
15. if (error is pb) AND (del_error is p) then (control is vl) (1)

The membership functions were manually tuned by trial and error to give good controller
performance. Automatic adaptation of membership functions will be discussed in
Chapter 13. The resulting membership functions are:

Level_error = [-36:0.1:36];
nb = trapzoid(Level_error,[-36 -36 -10 -5]);
ns = triangle(Level_error,[-10 -2 0]);
z = triangle(Level_error,[-1 0 1]);
ps = triangle(Level_error,[0 2 10]);
pb = trapzoid(Level_error,[5 10 36 36]);
l_error = [nb;ns;z;ps;pb];
plot(Level_error,l_error);
title('Level Error Membership Functions')
xlabel('Level Error')
ylabel('Membership')

47
-40 -30 -20 -10 0 10 20 30 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Level Error Membership Functions
Level Error
M
e
m
b
e
r
s
h
i
p


Del_error = [-40:.1:40];
p = trapzoid(Del_error,[-40 -40 -2 0]);
ze = triangle(Del_error,[-1 0 1]);
n = trapzoid(Del_error,[0 2 40 40]);
d_error = [p;ze;n];
plot(Del_error,d_error);
title('Level Rate Membership Functions')
xlabel('Level Rate')
ylabel('Membership')

0.4
0.5
0.6
0.7
0.8
0.9
1
Level Rate Membership Functions
M
e
m
b
e
r
s
h
i
p

48

Control = [-4.5:0.05:1];
vh = triangle(Control,[0 1 1]);
high = triangle(Control,[-1 0 1]);
med = triangle(Control,[-3 -2 -1]);
low = triangle(Control,[-4.5 -3.95 -3]);
vl = triangle(Control,[-4.5 -4.5 -3.95]);
control=[vh;high;med;low;vl];
plot(Control,control);
title('Output Voltage Membership Functions')
xlabel('Control Voltage')
ylabel('Membership')

-5 -4 -3 -2 -1 0 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Output Voltage Membership Functions
Control Voltage
M
e
m
b
e
r
s
h
i
p


A Mamdami fuzzy system that uses centroid defuzzification will now be created. Test
results show that the fuzzy system performs superior to that of a PID controller. There
was practically no overshoot., and the speed of response was only limited by the inlet
supply pressure and output piping resistance. Suppose the following error and change in
error are input to the fuzzy controller. First, the degree of fulfillments of the antecedent
membership functions are calculated.

error=-8.1;
derror=0.3;
DOF1=interp1(Level_error',l_error',error')';
DOF2=interp1(Del_error',d_error',derror')';

Next, the fuzzy relation operations inherent in the 15 rules are performed.

antecedent_DOF = [min(DOF1(1), DOF2(1))
min(DOF1(1), DOF2(2))
min(DOF1(1), DOF2(3))
49
min(DOF1(2), DOF2(1))
min(DOF1(2), DOF2(2))
min(DOF1(2), DOF2(3))
min(DOF1(3), DOF2(1))
min(DOF1(3), DOF2(2))
min(DOF1(3), DOF2(3))
min(DOF1(4), DOF2(1))
min(DOF1(4), DOF2(2))
min(DOF1(4), DOF2(3))
min(DOF1(5), DOF2(1))
min(DOF1(5), DOF2(2))
min(DOF1(5), DOF2(3))]

antecedent_DOF =
0
0.6200
0.1500
0
0.2375
0.1500
0
0
0
0
0
0
0
0
0

consequent = [control(5,:)
control(5,:)
control(4,:)
control(4,:)
control(4,:)
control(3,:)
control(3,:)
control(3,:)
control(3,:)
control(3,:)
control(2,:)
control(2,:)
control(2,:)
control(1,:)
control(1,:)];

Consequent = product(consequent,antecedent_DOF);
plot(Control,Consequent)
axis([min(Control) max(Control) 0 1.0])
title('Consequent of Fuzzy Rules')
xlabel('Control Voltage')
ylabel('Membership')

50
-4 -3 -2 -1 0 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Consequent of Fuzzy Rules
Control Voltage
M
e
m
b
e
r
s
h
i
p


The fuzzy output sets are aggregated to form a single fuzzy output set.

aggregation = max(Consequent);
plot(Control,aggregation)
axis([min(Control) max(Control) 0 1.0])
title('Aggregation of Fuzzy Rule Outputs')
xlabel('Control Voltage')
ylabel('Membership')

0.4
0.5
0.6
0.7
0.8
0.9
1
Aggregation of Fuzzy Rule Outputs
M
e
m
b
e
r
s
h
i
p


51
The output fuzzy set is defuzzified to find the crisp output voltage.

output=centroid(Control,aggregation);
c_plot(Control,aggregation,output,'Crisp Output Value for Voltage')
axis([min(Control) max(Control) 0 1.0])
xlabel('Control Voltage');

-4 -3 -2 -1 0 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Crisp Output Value for Voltage is at -3.402
Control Voltage


For these inputs, a voltage of -3.4 would be sent to the control valve.

Now that we have the five steps of evaluating fuzzy algorithms defined (fuzzification,
apply fuzzy operator, apply implication operation, aggregation and defuzzification), we
can combine them into a function that is called at each controller voltage update. The
level error and change in level error will be passed to the fuzzy controller function and
the command valve actuator voltage will be passed back. This function, named
tankctrl(), is included as an m-file. The universes of discourse and membership functions
are initialized by a MATLAB script named tankinit. These variables are made to be
global MATLAB variables because they need to be used by the fuzzy controller function.

The differential equations that model the tank are contained in a function called
tank_mod.m. It operates by passing to it the current state of the tank (tank level) and the
control valve voltage. It passes back the next state of the tank. A demonstration of the
operation of the tank with its controller is given in the function
tankdemo(initial_level,desired_level). You may try running tankdemo with different
initial and target levels. This function plots out the result of a 40 second simulation, this
may take from 10 seconds to a minute or two depending on the speed of the computer
used for the simulation.

tankdemo(24.3,11.2)
52

The tank and controller are simulated for 40 seconds, please be patient.
0 5 10 15 20 25 30 35 40
10
15
20
25
Time (sec)
L
e
v
e
l

(
i
n
)
Tank Level Response


As you can see, the controller has very good response characteristics. There is very low
steady state error and no overshoot. The speed of response is mostly controlled by the
piping and valve resistances. The first second of the simulation is before feedback
occurs, so disregard that data point.

By changing the membership functions and rules, you can get different response
characteristics. The steady state error is controlled by the width of the zero level error
membership function. Keeping this membership function thin, keeps the steady state
error small.
Chapter 7 Fundamentals of Neural Networks
The MathWorks markets a Neural Networks Toolbox. A description of it can be found at
http://www.mathworks.com/neural.html. Other MATLAB based Neural Network tools
are the NNSYSID Toolbox at http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html and the
NNCTRL toolkit at http://www.iau.dtu.dk/Projects/proj/nnctrl.html. These are freeware
toolkits for system identification and control.
7.1 Artificial Neuron
The standard artificial neuron is a processing element whose output is calculated by
multiplying its inputs by a weight vector, summing the results, and applying an activation
function to the sum.

y f x w b
k k k
k
n
= +

=

1

53

The following figure depicts an artificial neuron with n inputs.

x
1
x
2
x
3
x
n
bias
Artificial Neuron
w
1
w
n
Output
Inputs
Sum
f()


The activation function could be one of many types. A linear activation function's output
is simply equal to its input:

f x x ( ) =


x=[-5:0.1:5];
y=linear(x);
plot(x,y)
title('Linear Activation Function')
xlabel('x')
ylabel('Linear(x)')

-4
-3
-2
-1
0
1
2
3
4
5
Linear Activation Function
L
i
n
e
a
r
(
x
)


There are several types on non-linear activation functions. Differentiable, non-linear
activation functions can be used in networks trained with backpropagation. The most
common are the logistic function and the hyperbolic tangent function.

54
f x x
e e
e e
x x
x x
( ) tanh( ) = =

+




Note that the output range of the logistic function is between -1 and 1.

x=[-3:0.1:3];
y=tanh(x);
plot(x,y)
title('Hyperbolic Tangent Activation Function')
xlabel('x')
ylabel('tanh(x)')

-3 -2 -1 0 1 2 3
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Hyperbolic Tangent Activation Function
x
t
a
n
h
(
x
)


f x istic x
x
( ) log ( )
exp( )
= =
+ −
1
1 β


where β is the slope constant. We will always consider β to be one but it can be changed.
Note that the output range of the logistic function is between 0 and 1.

x=[-5:0.1:5];
y=logistic(x);
plot(x,y)
title('Logistic Activation Function')
xlabel('x');ylabel('logistic(x)')

55
-5 0 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Activation Function
x
l
o
g
i
s
t
i
c
(
x
)


Non-differentiable non-linear activation functions are usually used as outputs of
perceptrons and competitive networks. There are two type: the threshold function's
output is either a 0 or 1 and the signum's output is a -1 or 1.

x=[-5:0.1:5];y=thresh(x);
plot(x,y);title('Thresh Activation Function')
xlabel('x');ylabel('thresh(x)')

0.4
0.5
0.6
0.7
0.8
0.9
1
Thresh Activation Function
t
h
r
e
s
h
(
x
)


x=[-5:0.1:5];
56
y=signum(x);
plot(x,y)
title('Signum Activation Function')
xlabel('x')
ylabel('signum(x)')

-5 0 5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Signum Activation Function
x
s
i
g
n
u
m
(
x
)


Note that the activation functions defined above can take a vector as input, and output a
vector by performing the operation on each element of the input vector.

x=[-1 0 1];
linear(x)
logistic(x)
tanh(x)
thresh(x)
signum(x)

ans =
-1 0 1
ans =
0.2689 0.5000 0.7311
ans =
-0.7616 0 0.7616
ans =
0 1 1
ans =
-1 -1 1

The output of a neuron is easily computed by using vector multiplication of the input and
weights, and adding the bias. Suppose you have an input vector x=[2 4 6], and a weight
matrix [.5 .25 .33] with a bias of -0.8. If the activation function is a hyperbolic tangent
function, the output of the artificial neuron defined above is
57

x=[2 4 6]';
w=[0.5 -0.25 0.33];
b=-0.8;
y=tanh(w*x+b)

y =
0.8275

7.2 Single Layer Neural Network
Neurons are grouped into layers and layers are grouped into networks to form highly
interconnected processing structures. An input layer does no processing, it simply sends
the inputs, modified by a weight, to each of the neurons in the next layer. This next
layer can be a hidden layer or the output layer in a single layer design.

A bias is included in the neurons to allow the activation functions to be offset from zero.
One method of implementing a bias is to use a dummy input node with a magnitude of 1.
The weights connecting this dummy node to the next layer are the actual bias values.

x
1
x
2
x
3
Single Layer Network
Input
Layer
Output
Layer
W
y
1
y
2
x
0
=1


Suppose we have a single layer network with three input neurons and two output neurons
as shown above. The outputs would be computed using matrix algebra in either of the
two forms. The second form augments the input matrix with a dummy node and embeds
the bias values into the weight matrix..

Form 1:
( ) y w x b = + =

− −

+

|
\

|
.
|
|
|
tanh * tanh
. . .
. . .
.
.
05 0 25 0 33
0 2 0 75 05
2
4
6
0 4
12


x=[2 4 6]';
w=[0.5 -0.25 0.33; 0.2 -0.75 -0.5];
b=[0.4 -1.2]';
y=tanh(w*x+b)

58
y =
0.9830
-1.0000

Form 2:
( ) y w x = =

− − −

|
\

|
.
|
|
|
|
tanh * tanh
. . . .
. . . .
0 4 05 0 25 0 33
12 0 2 0 75 05
1
2
4
6


x=[1 2 4 6]';
w=[0.4 0.5 -0.25 0.33; -1.2 0.2 -0.75 -0.5];
y=tanh(w*x)

y =
0.9830
-1.0000

7.3 Rosenblatt's Perceptron
The most simple single layer neuron is the perceptron and was developed by Frank
Rosenblatt [1958]. A perceptron is a neural network composed of a single layer feed-
forward network using threshold activation functions. Feed-forward means that all the
interconnections between the layers propagate forward to the next layer. The figure
below shows a single layer perceptron with two inputs and one output.
Input Neuron Output
x
1
x
2
Sum y
bias
w
1
w
2

The simple perceptron uses the threshold activation function with a bias and thus has a
binary output. The binary output perceptron has two possible outputs: 0 and 1. It is
trained by supervised learning and can only classify input patterns that are linearly
separable [Minsky 1969]. The next section gives an example of linearly separable data
that the perceptron can properly classify.

Training is accomplished by initializing the weights and bias to small random values and
then presenting input data to the network. The output (y) is compared to the target output
(t=0 or t=1) and the weights are adapted according to Hebb's training rule [Hebb, 1949]:

"When the synaptic input and the neuron output are both active, the strength of the
connection between the input and the output is enhanced."

59
This rule can be implemented as:

if y = target w = w; % Correct output, no change.

elseif y = 0 w = w+x; % Target = 1, enhance strengths.
else w = w-x; % Target = 0, reduce strengths.
end

The bias is updated as a weight of a dummy node with an input of 1. The function
trainpt1() implements this learning algorithm. It is called with:

[w,b] = trainpt1(x,t,w,b);

Assume the weight and bias values are randomly initialized and the following input and
target output are given.

w = [.3 0.7];
b = [-0.8];
x = [1;-3];
t = [1];

the output is incorrect as shown:

y = thresh([w b]*[x ;1])

y =
0

One learning cycle of the perceptron learning rule results in:

[w,b] = trainpt1(x,t,w,b)
y = thresh([w b]*[x ;1])

w =
1.3000 -2.3000
b =
0.2000
y =
1

As can be seen, the weights are updated and the output now equals the target. Since the
target was equal to 1, the weights corresponding to inputs with positive values were made
stronger. For example, x
1
=1 and w
1
changed from .3 to 1.3. Conversely, x
2
=-3, and w
2

changed from 0.7 to -2.3; it was made more negative since the input was negative. Look
at trainpt1 to see its implementation.

A single perceptron can be used to classify two inputs. For example, if x
1
= [0,1] is to be
classified as a 0 and x
2
= [1 -1] is to be classified as a 1, the initial weights and bias are
chosen and the following training routine can be used.

60
x1=[0 1]';
x2=[1 -1]';
t=[0 1];
w=[-0.1 .8]; b=[-.5];
y1 = thresh([w b]*[x1 ;1])
y2 = thresh([w b]*[x2 ;1])

y1 =
1
y2 =
0

Neither output matches the target so we will train the network with first x
1
and then x
2
.:

[w,b] = trainpt1(x1,t,w,b);
y1 = thresh([w b]*[x1 ;1])
y2 = thresh([w b]*[x2 ;1])
[w,b] = trainpt1(x2,t,w,b);
y1 = thresh([w b]*[x1 ;1])
y2 = thresh([w b]*[x2 ;1])

y1 =
0
y2 =
0
y1 =
0
y2 =
1

The network now correctly classifies the inputs. A better way of performing this training
would be to modify trainpt1 so that it can take a matrix of input patterns such as x =[x
1

x
2
]. We will call this function trainpt(). Also, a function to simulate a perceptron with the
inputs being a matrix of input patterns will be called percept().

w=[-0.1 .8]; b=[-.5];
y=percept(x,w,b)

y =
0

[w,b] = trainpt(x,t,w,b)
y=percept(x,w,b)

w =
-0.1000 0.8000
b =
-0.5000
y =
0

One training cycle results in the correct classification. This will not always be the case.
It may take several training cycles, which are called epochs, to alter the weights enough
61
to give the correct outputs. As long as the inputs are linearly separable, the perceptron
will find a decision boundary which correctly divides the inputs. This proof is derived in
many neural network texts and is called the perceptron convergence theorem [Hagan,
Demuth and Beale, 1996]. The decision boundary is formed by the x,y pairs that solve
the following equation:

w*x+b = 0

Let us now look at the decision boundaries before and after training for initial weights
that correctly classify only one pattern.

x1=[0 0]';
x2=[1 -1]';
x=[x1 x2];
t=[0 1];
w=[-0.1 0.8]; b=[-0.5];
plot(x(1,:),x(2,:),'*')
axis([-1.5 1.5 -1.5 1.5]);hold on
X=[-1.5:.5:1.5];
Y=(-b-w(1)*X)./w(2);
plot(X,Y);hold;
title('Original Perceptron Decision Boundary')

Current plot released
15
-1
-0.5
0
0.5
1
1.5
Original Perceptron Decision Boundary


[w,b] = trainpt(x,t,w,b);
y=percept(x,w,b)
plot(x(1,:),x(2,:),'*')
axis([-1.5 1.5 -1.5 1.5]);
hold on
X=[-1.5:.5:1.5]; Y=(-b-w(1)*X)./w(2);
62
plot(X,Y);
hold
title('Perceptron Decision Boundary After One Epoch')

y =
1 1
Current plot released
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Perceptron Decision Boundary After One Epoch


Note that after one epoch, still only one pattern is correctly classified.

[w,b] = trainpt(x,t,w,b);
y=percept(x,w,b)
plot(x(1,:),x(2,:),'*')
axis([-1.5 1.5 -1.5 1.5])
hold on
X=[-1.5:.5:1.5]; Y=(-b-w(1)*X)./w(2);
plot(X,Y)
hold
title('Perceptron Decision Boundary After Two Epochs')

y =
0 1
Current plot released
63
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Perceptron Decision Boundary After Two Epochs


Note that after two epochs, both patterns are correctly classified.

The perceptron can also be used to classify several linearly separable patterns. The
function percept() will now be modified to train until the patterns are correctly classified
or until 20 epochs.

x=[0 -.3 .5 1;-.4 -.2 1.3 -1.3];
t=[0 0 1 1]
w=[-0.1 0.8]; b=[-0.5];
y=percept(x,w,b)
plot(x(1,1:2),x(2,1:2),'*')
hold on
plot(x(1,3:4),x(2,3:4),'+')
axis([-1.5 1.5 -1.5 1.5])
X=[-1.5:.5:1.5]; Y=(-b-w(1)*X)./w(2);
plot(X,Y)
hold
title('Original Perceptron Decision Boundary')

t =
0 0 1 1
y =
0 0 1 0
Current plot released
64
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Original Perceptron Decision Boundary


The original weight and bias values misclassifies pattern number 4.

[w,b] = trainpti(x,t,w,b)
t
y=percept(x,w,b)
plot(x(1,1:2),x(2,1:2),'*')
hold on
plot(x(1,3:4),x(2,3:4),'+')
axis([-1.5 1.5 -1.5 1.5])
X=[-1.5:.5:1.5]; Y=(-b-w(1)*X)./w(2);
plot(X,Y)
hold
title('Final Perceptron Decision Boundary')

Solution found in 5 epochs.
w =
2.7000 0.5000
b =
-0.5000
t =
0 0 1 1
y =
0 0 1 1
Current plot released
65
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Final Perceptron Decision Boundary


After 5 epochs ,all 5 inputs are correctly classified.

7.4 Separation of Linearly Separable Variables
A two input perceptron can separate a plane into two sections because its transfer
equation can be rearranged to form the equation for a line. In a three dimensional
problem, the equation would define a plane and in higher dimensions it would define a
hyperplane.

Linear Separability
y
x
o
o
o
+
+
+
+
+
o
o


Note that the decision boundary is always orthogonal to the weight matrix. Suppose we
have a two input perceptron with weights = [1 2] and a bias equal to 1. The decision
boundary is defined as:

66
y
w
w
x
b
w
x x
x
y y
= − − = − − = − −
1
2
1
2
05 05 . .


which is orthogonal to the weight vector [1 2]. In the figure below, the more vertical line
is the decision boundary and the more horizontal line is the weight vector extended to
meet the decision boundary.

w=[1 2]; b=[1];
x=[-2.5:.5:2.5]; y=(-b-w(1)*x)./w(2);
plot(x,y)
text(.5,-.65,'Decision Boundary');
grid
title('Perceptron Decision Boundary')
xlabel('x');ylabel('y');
hold on
plot([w(1) -w(1)],[w(2) -w(2)])
text(.5,.7,'Weight Vector');
axis([-2 2 -2 2]);
hold

Current plot released
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Decision Boundary
Perceptron Decision Boundary
x
y
Weight Vector


If the inputs need to be classified into three or four classes, a two neuron perceptron can
be used. The outputs can be coded to one of the pattern classifications, and two lines can
separate the classification regions. In the following example, each of the inputs will be
classified into one of the three binary classes: [0 1], [1 0], and [0 0]. . The weights can be
defined as a matrix, the bias is a vector, and two lines are formed.

x=[0 -.3 .5 1;-.4 -.2 1.3 -1.3]; % Input vectors
t=[0 0 1 1; 0 1 0 0] % Target vectors
67
w=[-0.1 0.8; 0.2 -0.9]; b=[-0.5;0.3]; % Weights and biases
y=percept(x,w,b) % Initial classifications

t =
0 0 1 1
0 1 0 0
y =
0 0 1 0
1 1 0 1

Two of the patterns (t
1
and t
4
) are incorrectly classified.

[w,b] = trainpti(x,t,w,b)
t
y=percept(x,w,b)

Solution found in 6 epochs.
w =
2.7000 0.5000
-2.2000 -0.7000
b =
-0.5000
-0.7000
t =
0 0 1 1
0 1 0 0
y =
0 0 1 1
0 1 0 0

The perceptron learning algorithm was able to define lines that separated the input
patterns into their target classifications. This is shown in the following figure.

plot(x(1,1),x(2,1),'*')
hold on
plot(x(1,2),x(2,2),'+')
plot(x(1,3:4),x(2,3:4),'o')
axis([-1.5 1.5 -1.5 1.5])
X1=[-1.5:.5:1.5]; Y1=(-b(1)-w(1,1)*X1)./w(1,2);
plot(X1,Y1)
X2=[-1.5:.5:1.5]; Y2=(-b(2)-w(2,1)*X2)./w(2,2);
plot(X2,Y2)
hold
title('Perceptron Decision Boundaries')
text(-1,.5,'A'); text(-.3,.5,'B'); text(.5,.5,'C');

Current plot released
68
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Perceptron Decision Boundaries
A B C


The simple single layer perceptron can separate linearly separable inputs but will fail if
the inputs are not linearly separable. One such example of linearly non-separable inputs
is the exclusive-or (XOR) problem. Linearly non-separable patterns, such as those of the
XOR problem, can be separated with multilayer networks. A two input, one hidden layer
of two neurons, one output network defines two lines in the two dimensional space.
These two lines can classify points into groups that are not linearly separable with one
line.

Perceptrons are limited in that they can only separate linearly separable patterns and that
they have a binary output. Many of the limitations of the simple perceptron can be
solved with multi-layer architectures, non-binary activation functions, and more complex
training algorithms. Multilayer perceptrons with threshold activation functions are not
that useful because they can't be trained with the perceptron learning rule and since the
functions are not differentiable, they can't be trained with gradient descent algorithms.
Although if the first layer is randomly initialized, the second layer may be trained to
classify linearly non-separable classes (MATLAB Neural Networks Toolbox).

The Adaline (adaptive linear) network, also called a Widrow Hoff network, developed
by Bernard Widrow and Marcian Hoff [1960], is composed of one layer of linear transfer
functions, as opposed to threshold transfer functions, and thus has a continuous valued
output. It is trained with supervised training by the Delta Rule which will be discussed in
Chapter 8.
7.5 Multilayer Neural Network
Neural networks with one or more hidden layers are called multilayer neural networks or
multilayer perceptrons (MLP). Normally, each hidden layer of a network uses the same
type of activation function. The output activation function is either sigmoidal or linear.
69
The output of a sigmoidal neuron is constrained [-1 1] for a hyperbolic tangent neuron
and [0 1] for a logarithmic sigmoidal neuron. A linear output neuron is not constrained
and can output a value of any magnitude.

It has been proven that the standard feedforward multilayer perceptron (MLP) with a
single non-linear hidden layer (sigmoidal neurons) can approximate any continuous
function to any desired degree of accuracy over a compact set [Cybenko 1989, Hornick
1989, Funahashi 1989, and others], thus the MLP has been termed a universal
approximator. Haykin [1994] gives a very concise overview of the research leading to
this conclusion.

What this proof does not say is how many hidden layer neurons would be needed, and if
the weight matrix that corresponds to that error goal can be found. It may be
computationally limiting to train such a network since the size of the network is
dependent on the complexity of the function and the range of interest.

For example, a simple non-linear function:

f x x ( ) x = ⋅ 1 2

requires many nodes if the ranges of x
1
and x
2
are very large.

In order to be a universal approximation, the hidden layer of a multilayer perceptron is
usually a sigmoidal neuron. A linear hidden layer is rarely used because any two linear
transformations

h W1 x
y W2 h
=
=

where W1 and W2 are transformation matrices that transform the mx1 vector x to h and
h to y, can be represented as one linear transformation

W W1 W2
y W1 W2 x W x
=
= =

where W is a matrix that performs the transformation from x to y. The following figure
shows the general multilayer neural network architecture.

x
1
x
2
x
m
Multilayer Network
Hidden
Layer
Input
Layer
Output
Layer
W1 W2
y
1
y
r
.
.
.
.
h
1
h
2
h
n
x
0
=1
h
0
=1


70
The output for a single hidden layer MLP with three inputs, three hidden hyperbolic
tangent neurons and two linear output neurons can be calculated using matrix algebra.

( ) y w w x b b = + +
=

− −



+

|
\

|
.
|
|
|
+

2 1 1 2
05 0 25 0 33
0 2 0 75 05
0 2 0 7 0 9
2 3 14 21
10 2 10 2 0 3
2
4
6
05
0 2
08
0 4
12
* tanh *
. . .
. . .
tanh
. . .
. . .
. . .
.
.
.
.
.

By using dummy nodes and embedding the biases into the weight matrix we can use a
more compact notation:

( ) y w w x =
=

− − −



− −

|
\

|
.
|
|
|
|
¦
´
¦
¦
¦
¹
¦
¦
¦
¹
`
¦
¦
¦
)
¦
¦
¦
2 1
1 1
0 4 05 0 25 0 33
12 0 2 0 75 05
1
05 0 2 0 7 0 9
0 2 2 3 14 21
08 10 2 10 2 0 3
1
2
4
6
*[ ; tanh *[ ; ] ]
. . . .
. . . .
tanh
. . . .
. . . .
. . . .


x=[2 4 6]';
w1=[0.2 -0.7 0.9; 2.3 1.4 -2.1; 10.2 -10.2 0.3];
w2=[0.5 -0.25 0.33; 0.2 -0.75 -0.5];
b1=[0.5 0.2 -0.8]';
b2=[0.4 -1.2]';
y=w2*tanh(w1*x+b1)+b2

y =
0.8130
0.2314

or

x=[2 4 6]';
w1=[0.5 0.2 -0.7 0.9; 0.2 2.3 1.4 -2.1; -0.8 10.2 -10.2 0.3];
w2=[0.4 0.5 -0.25 0.33; -1.2 0.2 -0.75 -0.5];
y=w2*[1;tanh(w1*[1;x])]

y =
0.8130
0.2314

71
Chapter 8 Backpropagation and Related Training Paradigms
Backpropagation (BP) is a general method for iteratively solving for a multilayer
perceptrons' weights and biases. It uses a steepest descent technique which is very stable
when a small learning rate is used, but has slow convergence properties. Several
methods for speeding up BP have been used including momentum and a variable learning
rate.

Other methods of solving for the weights and biases involve more complex algorithms.
Many of these techniques are based on Newton's method, but practical implementations
usually use a combination of Newton's method and steepest descent. The most popular
second order method is the Levenberg [1944] and Marquardt [1963] technique. The
scaled conjugate gradient technique [Moller 1993] is also very popular because it less
memory intensive than Levenberg Marquardt and more powerful than gradient descent
techniques. These methods will not be implemented in this supplement although
Levenberg Marquardt is implemented in The MathWork's Neural Network Toolbox..
8.1 Derivative of the Activation Functions
The chain rule that is used in deriving the BP algorithm necessitates the computation of
the derivative of the activation functions. For logistic, hyperbolic tangent, and linear
functions; the derivatives are as follows:

( )
( )
Linear
Logistic
Tanh
Φ Φ
Φ Φ Φ Φ
Φ Φ Φ
( )
&
( )
( )
&
( ) ( ) ( )
( )
&
( ) ( )
I I I
I
e
I I I
I
e e
e e
I I
I
I I
I I
= =
=
+
= −
=

+
= −



1
1
1
1
1
2
α
α α
α α
α
α


Alpha is called the slope parameter. Usually alpha is chosen to be 1 but other slopes may
be used. This formulation for the derivative makes the computation of the gradient more
efficient since the output Φ(I) has already been calculated in the forward pass. A plot of
the logistic function and its derivative follows.

x=[-5:.1:5];
y=1./(1+exp(-x));
dy=y.*(1-y);
subplot(2,1,1)
plot(x,y)
title('Logistic Function')
subplot(2,1,2)
plot(x,dy)
title('Derivative of Logistic Function')

72
-5 0 5
0
0.5
1
Logistic Function
-5 0 5
0
0.1
0.2
0.3
Derivative of Logistic Function


As can be seen, the highest gradient is at I=0. Since the speed of learning is partially
dependent on the size of the gradient, the internal activation of all neurons should be kept
small to expedite training. This is why we scale the inputs and initialize weights to small
random values.

Since there is no logistic function in MATLAB, we will write a m-file function to
perform the computation.

function [y] = logistic(x);
% [y] = logistic(x)
% Returns the result of applying the logistic operator to the input x.
% x : the input
% y : the result
y=1./(1+exp(-x));
8.2 Backpropagation for a Multilayer Neural Network
The backpropagation algorithm is an optimization technique designed to minimize an
objective function [Werbos 1974]. The most commonly used objective function is the
squared error which is defined as:

ε
2
2
= − T
q qk
Φ
.

The network syntax is defined as in the following figure:

73
x
h
Input Layer (i)
Index h
m Nodes
Output Layer (k)
Index q
r Nodes
Hidden Layer (j)
Index p
n Nodes
w
hp.j I
p.j
Φ
p.j
I
r.k
Φ
r.k
I
2.k
Φ
2.k
I
1.k
Φ
1.k
w
p2.k
w
pr.k
w
p1.k Comp
Comp
Comp
T
1
T
2
T
r
ε
1
ε
1
ε
1


In this notation, the layers are labeled i, j, and k; with m, n, and r neurons respectively;
and the neurons in each layer are indexed h, p, and q respectively.

x = input value
T = target output value
w = weight value
I = internal activation
Φ = neuron output
ε = error term

The outputs for a two layer network with both layers using a logistic activation function
are calculated by the equation:

| | { } Φ = + + logistic logistic w2 w1 x b1 b2 * ( * )

where: w1 = first layer weight matrix
w2 = second layer weight matrix
b1 = first layer bias vector
b2 = second layer bias vector

The input vector can be augmented with a dummy node representing the bias input. This
dummy input of 1 is multiplied by a weight corresponding to the bias value. This results
in a more compact representation of the above equation:

Φ =

¦
´
¹
¹
`
)
logistic
1
logistic
W2
W1 X
*
( * )


where X = [1 x]' % Augmented input vector.
W1 = [b1 w1]
W2 = [b2 w2]

74
Note that a dummy hidden node (=1) also needs to be inserted into the equation.

For a two input network with two hidden nodes and one output node we have:

| |
x W1 W =

=

=
1
2
1
2
11 11 21
12 12 22
2 11 12
x
x
b w w
b w w
b w w


As an example, consider a network that has 2 inputs, one hidden layer of 2 logistic
neurons, and 1 logistic output neuron (same as in the text). After we define the initial
weight and bias matrices we combine them to be used in the MATLAB code and
calculate the output value for the initial weights.

x = [0.4;0.7]; % rows = # of inputs = 2
% columns = # of patterns = 1
w1 = [0.1 -0.2; 0.4 0.2]; % rows = # of hidden nodes = 2
% columns = # of inputs = 2
w2 = [0.2 -0.5]; % rows = # of outputs = 1
% columns = # of hidden nodes = 2
b1=[-0.5;-0.2]; % rows = number of hidden nodes = 2
b2=[-0.6]; % rows = number of output nodes = 1

X = [1;x] % Augmented input vector.
W1 = [b1 w1]
W2 = [b2 w2]
output=logistic(W2*[1;logistic(W1*X)])

X =
1.0000
0.4000
0.7000
W1 =
-0.5000 0.1000 -0.2000
-0.2000 0.4000 0.2000
W2 =
-0.6000 0.2000 -0.5000
output =
0.3118

8.2.1 Weight Updates
The output layer weights are changed in proportion to the negative gradient of the
squared error with respect to the weights. These weight changes can be calculated using
the chain rule. The following is the derivation for a two layer network with each layer
having logistic activation functions. Note that the target outputs can only have a range of
[0 1] for a network with a logistic output layer.

75
( )
| | | |
| | | |

Φ
Φ
Φ Φ Φ Φ
Φ
Φ Φ Φ
w
w
I
I
w
T
and
T
pq k p q
pq k
p q
q k
q k
q k
q k
pq k
p q q q k q k q k p j
p q pq k p j
pq k q q k q k q k
. .
.
.
.
.
.
.
.
. . . . .
. . .
. . . .
= −
= − ⋅ ⋅ ⋅
= − ⋅ − − ⋅ − ⋅
= − ⋅ ⋅
= − −
η
∂ε

η
∂ ε





η
η δ
δ
2
2
2 1
2 1


The weight update equation for the output neurons is:

w N w N
pq k pq k p q pq k p j . . . . .
( ) ( ) + = − ⋅ ⋅ 1 η δ Φ


The output of the hidden layer of the above network is Φ
pj
=h=[0.48 0.57]. The target
output is T = [0.7]. The update for the output layer weights is calculated by first
propagating forward through the network to calculate the error terms:

h = logistic(W1*X); % Hidden node output.
H = [1;h]; % Augmented hidden node output.
t = [0.1]; % Target output.
Out_err = t-logistic(W2*[1;logistic(W1*X)])

Out_err =
-0.2118

Next, the gradient vector for the output weight matrix is calculated.

output=logistic(W2*[1;logistic(W1*X)])
delta2=output.*(1-output).*Out_err % Derivative of logistic function.

output =
0.3118
delta2 =
-0.0455

And lastly, the weights are updated with the learning rate α =0.5.

lr = 0.5;
del_W2 = 2*lr*H'*delta2 % Change in weight.
new_W2 = W2+del_W2 % Weight update.

del_W2 =
-0.0455 -0.0161 -0.0239
new_W2 =
-0.6455 0.1839 -0.5239
76

Out_err = t-logistic(new_W2*[1;logistic(W1*X)]) % New output error)

Out_err =
-0.1983

We see that by updating the output weight matrix for one iteration of training, the error is
reduced from 0.2118 to 0.1983.

8.2.2 Hidden Layer Weight Updates
The hidden layer outputs have no target values. Therefore, a procedure is used to
backpropagate the output layer errors to the hidden layer neurons in order to modify their
weights to minimize the error. To accomplish this, we start with the equation for the
gradient with respect to the weights and use the chain rule.

Φ
Φ
Φ
Φ
w
w
w
I
I
I
I
w
hp j h p
hp j
h p
hp j
q
r
h p
q
q k
q k
q k
q k
p j
q
r
p j
p j
p j
hp j
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
= −
= −
= − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
=
=


η
∂ε

η
∂ε

η
∂ ε









2
2
1
2
1


( )
| |
| |
| |
∂ ε



α




α


q
q k
q q k
q k
q k
q k q k
q k
p j
pq k
p j
p j
p j p j
p j
hp j
h
T
I
I
w
I
I
w
x
2
2
1
1
Φ
Φ
Φ
Φ Φ
Φ
Φ
Φ Φ
.
.
.
.
. .
.
.
.
.
.
. .
.
.
= − −
= −
=
= −
=


resulting in:
77
( )
| | | | | |
| |
∂ε

α α
δ α
δ δ


η δ
2
1
1
2 1 1
1
1
w
T w x
w x
w
I
w N w N x
hp j
q q k q k q k pq k p j p j h
q
r
pq k pq k p j p j h
q
r
hp j pq k pq k
p j
p j
hp j hp j hp h hp j
.
. . . . . .
. . . .
. . .
.
.
. . .
( ) ( )
= − − ⋅ − ⋅ ⋅ −
= ⋅ ⋅ −
=
+ = −
=
=


Φ Φ Φ Φ Φ
Φ Φ
Φ


First we must backpropagate the gradient terms back to the hidden layer:

[numout,numhid]=size(W2);
delta1=delta2.*h.*(1-h).*W2(:,2:numhid)'

delta1 =
-0.0021
0.0057

Now we calculate the hidden layer weight change. Note that we don't propagate back
through the dummy node.

del_W1 = 2*lr*delta1*X'
new_W1 = W1+del_W1

del_W1 =
-0.0021 -0.0008 -0.0015
0.0057 0.0023 0.0040
new_W1 =
-0.5021 0.0992 -0.2015
-0.1943 0.4023 0.2040

Now we calculate the new output value.

output = logistic(new_W2*[1;logistic(new_W1*X)])

output =
0.2980

The new output is 0.298 which is closer to the target of 0.1 than was the original output
of 0.3118. The magnitude of the learning rate affects the convergence and stability of the
training. This will be discussed in greater detail in the adaptive learning rate section.
8.2.3 Batch Training
The type of training used in the previous section is called sequential training. During
sequential training, the weights are updated after each presentation of a pattern. This
method may be more stochastic when the patterns are chosen randomly and may reduce
the chance of getting stuck in a local minima. During batch training all of the training
78
patterns are processed before a weight update is made. Suppose the training set consists
of four patterns (z=4). Now we have four output patterns from the hidden layer of the
network and four target outputs.

x = [0.4 0.8 1.3 -1.3;0.7 0.9 1.8 -0.9];
t = [0.1 0.3 0.6 0.2];
[inputs,patterns] = size(x);
[outputs,patterns] = size(t);
W1 = [0.1 -0.2 0.1; 0.4 0.2 0.9]; % rows = # of hidden nodes = 2
% columns = # of inputs +1 = 3
W2 = [0.2 -0.5 0.1]; % rows = # of outputs = 1
% columns = # of hidden nodes+1 = 3
X = [ones(1,patterns); x]; % Augment with bias dummy node.
h = logistic(W1*X);
H = [ones(1,patterns);h];
e = t-logistic(W2*H)

e =
-0.4035 -0.2065 0.0904 -0.2876

The sum of squared error is:

SSE = sum(sum(e.^2))

SSE =
0.2963

Next, the gradient vector is calculated:

output = logistic(W2*H)
delta2 = output.*(1-output).*e

output =
0.5035 0.5065 0.5096 0.4876
delta2 =
-0.1009 -0.0516 0.0226 -0.0719

And lastly, the weights are updated with a learning rate α =0.5:

lr = 0.5;
del_W2 = 2*lr* delta2*H'
new_W2 = W2+ del_W2

del_W2 =
-0.2017 -0.1082 -0.1208
new_W2 =
-0.0017 -0.6082 -0.0208

The new sum of squared error is calculated as:

e = t- logistic(new_W2*H);
SSE = sum(sum(e.^2))
79

SSE =
0.1926

The SSE has been reduced from 0.2963 to 0.1926 by just changing the output layer
weights and biases.

To change the hidden layer weight matrix we must backpropagate the gradient terms
back to the hidden layer. Note that we can't backpropagate through a dummy node, so
only the weight portion of W2 is used.
[numout,numhidb] = size(W2);
delta1 = h.*(1-h).*(W2(:,2:numhidb)'*delta2)

delta1 =
0.0126 0.0065 -0.0028 0.0088
-0.0019 -0.0008 0.0002 -0.0016

Now we calculate the hidden layer weight change.

del_W1 = 2*lr*delta1*X'
new_W1 = W1+del_W1

del_W1 =
0.0250 -0.0049 0.0016
-0.0041 0.0009 -0.0003
new_W1 =
0.1250 -0.2049 0.1016
0.3959 0.2009 0.8997

h = logistic(new_W1*X);
H = [ones(1,patterns);h];
e = t-logistic(new_W2*H);
SSE = sum(sum(e.^2))

SSE =
0.1917

The new SSE is 0.1917 which is less than the SSE of 0.1926 so the change in hidden
layer weights reduced the SSE slightly more.
8.2.4 Adaptive Learning Rate
In the previous examples a fixed learning rate was used. When training a neural network
iteratively, it is more efficient to use an adaptive learning rate. The learning rate can be
thought of as the size of a step down the error gradient. If very small steps are taken, you
are guaranteed to fine an error minimum, but this may take a very long time. Larger
steps may result in unstable learning since you may step over a minima. To speed
training and still have stability, a heuristic method is used to determine the step size.

The heuristic rule states:
If training is "went well" (error decreased) then increase the step size.
80
lr=lr*1.1
If training is "poor" (error increased) then decrease the step size.
lr=lr*0.5
Only update the weights if the error decreased.

Using an adaptive learning rate allows the training procedure to quickly move across
large error plateaus and slowly descend through tortuous error paths. This results in
training that is somewhat optimized to increase learning speed while remaining stable.

8.2.5 The Backpropagation Training Cycle
The training procedure discussed above is applied iteratively for a certain number of
cycles or until a specified error goal is met. Before starting training, weights are
initialized to small random values and inputs are scaled to small values of similar
magnitude to reduce the chance of prematurely saturating the sigmoidal neurons and thus
slowing training. Input scaling and weight initialization will be covered in subsequent
sections.

x = [0.4 0.8 1.3 -1.3;0.7 0.9 1.8 -0.9];
t = [0.1 0.3 0.6 0.2];
[inputs,patterns] = size(x);
[outputs,patterns] = size(t);
hidden=6;
W1=0.1*ones(hidden,inputs+1); % Initialize to matrix of 0.1's.
W2=0.1*ones(outputs,hidden+1); % Initialize to matrix of 0.1's.
maxcycles=200;SSE_Goal=0.1;
lr=0.5;SSE=zeros(1,maxcycles);
X=[ones(1,patterns); x]; % Augment inputs with bias dummy node.
for i=1:maxcycles
h = logistic(W1*X);
H=[ones(1,patterns);h];
e=t-logistic(W2*H);
SSE(i)= sum(sum(e.^2));
if SSE(i)<SSE_Goal; break;end
output = logistic(W2*H);
delta2= output.*(1-output).*e;
del_W2= 2*lr* delta2*H';
W2 = W2+ del_W2;
delta1 = h.*(1-h).*(W2(:,2:hidden+1)'*delta2);
del_W1 = 2*lr*delta1*X';
W1 = W1+del_W1;
end;clf
semilogy(nonzeros(SSE));
title('Backpropagation Training');
xlabel('Cycles');ylabel('Sum of Squared Error')
if i<200;fprintf('Error goal reached in %i cycles.',i);end

Error goal reached in 116 cycles.
81
0 20 40 60 80 100 120
10
-1
10
0
Backpropagation Training
Cycles
S
u
m

o
f

S
q
u
a
r
e
d

E
r
r
o
r


You can try different numbers of hidden nodes in the above example. The relationship
between the number of hidden nodes and the cycles to reach the error goal is shown
below. Note that the weights are usually initialized to random numbers as discussed in
section 8.3. When the weights are randomly chosen, the number of training cycles
varies. This is due to the different initial position on the error surface. In fact, sometimes
a network will not train to a desired error goal with one set of initial weights, due to
getting trapped in a local minima, but will train with a different set of initial weights.

Number of Hidden Nodes Training Cycles to Error Goal
1 106
2 95
3 97
4 102
5 109
6 116

Each training set will have its own relationship between training cycles and the number
of hidden nodes. The results shown above are fairly typical. After a point where enough
free parameters are given to the network to model the function, the addition of hidden
nodes complicates the error surface and the number of training cycles may increase.
8.3 Scaling Input Vectors
Training data is scaled for two major reasons. First, input data is usually scaled to give
each input equal importance and to prevent premature saturation of sigmoidal activation
functions. Secondly, output or target data is scaled if the output activation functions have
a limited range and the unscaled targets do not match that range.

82
There are two popular types of input scaling: linear scaling and z-score scaling. Linearly
scaling transforms the data into a new range which is usually 0.1 to 0.9. If the training
patterns are in a form such that the columns are inputs and the rows are patterns, a
MATLAB function to perform linear scaling is:

function [y,slope,int]=scale(x,slope,int)
% [x,m,b]=scale(x,m,b)
%
% Linear scale the data between .1 and .9
%
% y = m*x + b
%
% x = data
% m = slope
% b = y intercept
%
[nrows,ncols]=size(x);

if nargin == 1
del = max(x)-min(x); % calculate slope and intercept
slope = .8./del;
int = .1 - slope.*min(x);
end

y = (ones(nrows,1)*slope).*x + ones(nrows,1)*int;

The function returns the scaled inputs y and the scale parameters: slope and int. Each
column of inputs has a maximum and minimum value of 0.9 and 0.1 respectively. The
scaling parameters are returned so that other input data may be transformed using the
same terms. A network that is trained with scaled inputs must always have its inputs
scaled using the same scaling parameters.

This scaling function can also be used for scaling target values. Outputs are usually
scaled to 0.9 to 0.1 when logistic output activation functions are used. This keeps the
training algorithm from trying to force outputs beyond the range of the function.

Another method of scaling, called z-score or mean center unit variance scaling is also
frequently used. This method subtracts the mean of each input from each column and
then divides by the variance. This centers all the patterns of each data type around 0 and
gives them a unit variance. The MATLAB function to perform z-score scaling is:

function [y,meanval,stdval] = zscore(x, meanaval,stdval)
%
% [y,mean,std] = zscore(x, mean_in,std_in)
%
% Mean center the data and scale to unit variance.
83
% If number of inputs is one, calculate the mean and standard deviation.
% If the number if inputs is three, use the calculated mean and SD.
%
[nrows,ncols]=size(x);

if nargin == 1
meanval = mean(x); % calculate mean values
end

y = x - ones(nrows,1)*meanval; % subtract off mean

if nargin == 1
stdval = std(y); % calculate the SD
end

y = y ./ (ones(nrows,1)*stdval); % normalize to unit variance

An example of the z-score scaling function is:

x=[1 2;30 21;-1 -10;8 34]
[y,slope, int]=scale(x)
[y,meanval,stdval]=zscore(x)

x =
1 2
30 21
-1 -10
8 34
y =
0.1516 0.3182
0.9000 0.6636
0.1000 0.1000
0.3323 0.9000
slope =
0.0258 0.0182
int =
0.1258 0.2818
y =
-0.5986 -0.4983
1.4436 0.4727
-0.7394 -1.1115
-0.1056 1.1370
meanval =
9.5000 11.7500
stdval =
14.2009 19.5683

If a network is trained with scaled data and new data is presented to the network, it must
first be scaled using the same scaling factors. In this case, the scaling functions are called
with three variables and only the scaled data is passed back.

x_new=[-2 4;-.3 12; 9 -10]
84
[y]=scale(x_new, slope, int)
[y]=zscore(x_new,meanval,stdval)

x_new =
-2.0000 4.0000
-0.3000 12.0000
9.0000 -10.0000
y =
0.0742 0.3545
0.1181 0.5000
0.3581 0.1000
y =
-0.8098 -0.3960
-0.6901 0.0128
-0.0352 -1.1115

8.4 Initializing Weights
As mentioned above, the initial weights should be selected to be small random values in
order to prevent premature saturation of the sigmoidal activation functions. The most
common method is to use the random number generator and pass it the number of inputs
plus 1 and the number of hidden nodes for the first hidden layer weight matrix W1 and
pass it the number of outputs and hidden nodes plus 1 for the output weight matrix W2.
One is added to the number of inputs in W1 and to hidden in W2 to account for the bias.
To make the weights somewhat smaller, the resulting random weight matrix is multiplied
by 0.5.

W1=0.5*randn(2,3)

W1 =
0.5825 0.0375 -0.3483
0.3134 0.1758 0.8481

We are trying to limit the internal activation of the neurons during training to a high
gradient region. This region is between -2.5 and 2.5 for a hyperbolic tangent neuron and
-5 to +5 for a logistic function.

plot([-8:.1:8], logistic([-8:.1:8]))
title('Logistic Activation Function')

85
-8 -6 -4 -2 0 2 4 6 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Logistic Activation Function


The sum of the inputs times their weights should be in this high gradient region for
efficient training. Scaling the inputs to small values helps, but the weights should also be
made small, random, and centered around 0.
8.5 Creating a MATLAB Function for Backpropagation
In this section, a MATLAB script will be discussed that performs the necessary
operations to define and train a multilayer perceptron with a backpropagation algorithm.
Up to this point, the two layer networks used logistic activation functions in each layer.
This limits the network output to the interval [0 1]. Since most problems have targets
outside of this range, the training function backprop() will use a linear output layer that
allows targets of any magnitude to be reached.

The backpropagation MATLAB script that sets up the network architecture and training
parameters is called bptrain. To use this script you must first have training data (x,t)
saved in a file (data.mat). The following defines the format for these data:

Variable Description Rows Columns
x Input data Number of inputs Number of patterns
t Target data Number of outputs Number of patterns

An example of creating and saving a training set is:

x=[0:1:10];
t=2*x - 0.21*x.^2;
save data8 x t

This will create training data to train a network to approximate the function
86

y x x = + 2 0 21
1 1
2
.

over the interval [0 10] and store in binary format in a file named data8.mat. The
MATLAB script, bptrain, asks for the name of the file containing the training data, the
number of hidden neurons, and the type of scaling to use. It also asks if the default error
tolerance, maximum number of cycles, and initial learning rate is acceptable. If they are,
the function backprop() is called and the network is trained. After training the weights,
biases, and scaling parameters are saved in a file called weights.mat. The following is a
diary of running the script bptrain on the training data defined above.

EDU» bptrain

Enter filename of input/output data vectors: data8

How many neurons in the hidden layer? 20

Input scaling method zscore=z, linear=l, none=n ?[z,l,n]:n

The default variables are:

Output error tolerence (RMS per output term) = .1
Maximum number of training cycles = 5000.
The initial learning rate is = 0.1.

Are you satisfied with these selections? [y,n]:y

This network has:

1 input neurons
20 neurons in the hidden layer
1 output neurons

There are 11 input/output pairs in this training set.

*** BP Training complete, error goal not met!
*** RMS = 1.780286e-001
87
0 1000 2000 3000 4000 5000
10
-1
10
0
10
1
10
2 Root Mean Squared Error
0 1000 2000 3000 4000 5000
0
0.02
0.04
0.06
0.08
Learning Rate
Cycles


In this example, the error goal of 0.1 was not met in 8000 cycles; the final network error
was 0.17. Note that this error is a root mean squared error rather than a sum of squared
error. The RMS error is not dependent on the number of outputs and the number of
training patterns. It is therefore more intuitive. It can be thought of as an average error
per output rather than the total error over all outputs and training patterns.

After we train a network we usually want to test it. To test the network, we define a test
set of input/output patterns of the function that was used to train the network. Since
some of these patterns were not used to train the network, this checks for generalization.
Generalization is the ability of a network to give the correct output for an input that was
not in the training set. Networks are not expected to generalize outside of the training
space and no confidence should be given to outputs generated from data outside of the
training data.

load weights8
x=[0:.1:10];
t=2*x - 0.21*x.^2;
output = W2*[ones(size(x));logistic(W1*[ones(size(x));x])];
plot(x,t,x,output)
title('Function Approximation Verification')
xlabel('Input')
ylabel('Output')

88
0 2 4 6 8 10
-1
0
1
2
3
4
5
Function Approximation Verification
Input
O
u
t
p
u
t


The network does a good job of learning the functional relationship between the inputs
and outputs. It also generalizes well. Training longer to a lower error goal would
improve the network performance.
8.6 Backpropagation Example
In this example a neural network is trained to recognize the letters of the alphabet. The
first 16 letters (A through P) are defined on a 5x7 template and are each stored as a vector
of length 35 in the data file letters.mat. The data file contains an input matrix x (size =
35 by 16) and a target matrix t (size = 4 by 16). The entries in a column of the t matrix
are a binary coding of the letter of the alphabet in the corresponding column of x. The
columns in the x matrix are the indices of the filled in boxes of the 5x7 template. A
function letgph() displays the letter on the template. For example, to plot the letter A
which is the first letter of the alphabet and identified by t=[0 0 0 0], we load the data file
and plot the first column of x.

load('letters')
t(:,1)'
letgph(x(:,1));

ans =
0 0 0 0
89
1 2 3 4 5
1
2
3
4
5
6
7


We will now train a neural network to identify a letter. The network has 35 inputs
corresponding to the boxes in the template and has 4 outputs corresponding to the binary
code of the letter. Since the outputs are binary, we can use a logistic output layer. The
function bprop2 has a logistic/logistic architecture. We will use a beginning learning rate
of 0.5, train to a maximum of 5000 cycles and a root mean squared error goal of 0.05.
This may take a few minutes.

load letters
W1=0.5*randn(10,36); % Initialize output layer weight matrix.
W2=0.5*randn(4,11); % Initialize hidden layer weight matrix.
[W1 W2 RMS]=bprop2(x,t,W1,W2,.05,.1,5000);
semilogy(RMS);
title('Backpropagation Training');
xlabel('Cycles');
ylabel('Root Mean Squared Error')

90
0 20 40 60 80 100 120 140 160
10
-2
10
-1
10
0
Backpropagation Training
Cycles
R
o
o
t

M
e
a
n

S
q
u
a
r
e
d

E
r
r
o
r


The trained network is saved in a file weights_l. We can now use the trained network to
identify an input letter. For example, presenting the network with several letters resulted
in:

load weight_l;
output0 = logistic(W2*[1;logistic(W1*[1;x(:,1)])])'
output1 = logistic(W2*[1;logistic(W1*[1;x(:,2)])])'
output7 = logistic(W2*[1;logistic(W1*[1;x(:,8)])])'
output12 = logistic(W2*[1;logistic(W1*[1;x(:,13)])])'

output0 =
0.0188 0.0595 0.0054 0.0443
output1 =
0.0239 0.0433 0.0607 0.9812
output7 =
0.0590 0.9872 0.9620 0.9536
output12 =
0.9775 0.9619 0.0531 0.0555

The network correctly identifies each of the test cases. You can verify this by comparing
each output with its binary equivalent. For example, output7 should be [0 1 1 1], which
is very close to its actual output. One may want to make the target outputs be in the
range of [.1 .9] because outputs of 0 and 1.0 are not obtainable and may force the weights
to very large values.

A neural network's ability to generalize is found in its ability to give a correct response to
an input that was not in the training set. For example, a noisy input of an A may look
like:

91
a=[0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1
1]';
letgph(a);

1 2 3 4 5
1
2
3
4
5
6
7


Presenting the network with this input results in:

OutputA = logistic(W2*[1;logistic(W1*[1;a])])'

OutputA =
0.0165 0.1135 0.0297 0.0161

This output is very close to the binary pattern [0 0 0 0] which designates an 'A'. This
show the networks ability to generalize, and more specifically, its tolerance to noise.
Chapter 9 Competitive, Associative and Other Special Neural Networks
9.1 Hebbian Learning
Recall from Section 7.3 that Hebb's training rule states:

"When the synaptic input and the neuron output are both active, the strength of the
connection between the input and the output is enhanced."

There are several methods of implementing a Hebbian learning rule, a supervised form is
used in the perceptron learning rule of Chapter 7. This chapter explores the
implementation of unsupervised learning rules and begins with an implementation of
Hebb's rule. An unsupervised learning rule is one in which no target outputs are given.

92
If the output of the single layer network is active when the input is active, the weight
connecting the two active nodes is enhanced. This allows the network to associate
relationships between inputs and outputs, hence the name associative networks. The
most simple unsupervised Hebb rule is:

w w xy
new old
= + β


Where: w
AB
is the weight connecting input A to output B.
β is the learning constant
x is the input
y is the output

The constant β controls the rate at which the network learns. If β is made large, few
presentations are needed to learn an association and if β is made small, many
presentations are needed.

If the weights between active neurons are only allowed to be enhanced, as in the equation
above, there is no limit to their magnitude. Therefore, a rule that allows both learning
and forgetting should be implemented. Stephen Grossberg [1982] states the weight
change law as:

w w xy
new old
= − + ( ) 1 α β


In this equation, α is the forgetting constant and controls the rate at which the memory of
old information is allowed to decay away or be forgotten. Using this Hebbian update
rule, the network constantly forgets old information and continually learns new
information. The values of β and α control the speed of learning and forgetting and are
usually set in the interval [0 1]. The update rule can be rewritten as:

∆W W xy = − + α β
T


This rule limits the magnitude of the weights to a value determined by α and β. Solving
the above equation for ∆W=0, we find the maximum weight to be β/α when x and y are
active. For example: if the learning rate β is set to 0.9 and there is a small forgetting rate
α=0.1, then the maximum weight is (β/α)*xy
T
=9xy
T
.

Suppose a four input network, that has its weights initialized to the identity matrix, is
presented with an input vector x.

93
Inputs Neuron Output
x
1
x
2
Sum y
weights
x
3
x
4


w=[1 0 0 0]; % Weight vector.
x=[1 0 0 1]'; % Input vector.

The inputs will be presented to the network and the weights will be updated with the
unsupervised Hebbian learning rule with β = 0.1 and α =0.1.

a=0.1; % Forgetting factor.
b=0.1; % Learning factor.
yout=thresh(w*x-eps) % Output.
del_w=-a*w+b*x'*yout; % Weight update.
w=w+del_w % New weight.

yout =
1
w =
1.0000 0 0 0.1000

This rule is implemented in a function called hebbian(x,w,a,b,cycles). This function is
called with cycles equal to the number of training iterations.

w=hebbian(x,w,a,b,100); % Train the network with a Hebbian learning
% rule for 100 cycles.
w % Trained weight vector.

w =
1.0000 0 0 1.0000

We can see that the weights are bounded by (β/α)*xy
T
=1. The network has learned to
associate an input of [1 0 0 1] with an active output. Now if a degraded version of the
input vector is input to the network, the network's output will still be active. For
example, if the input is [0 0 0 1], the output will still be high. The network has learned to
associate an x
1
or x
2
input with an active output.
9.2 Instar Learning
The Hebbian learning rule described above will continuously forget previous information
due to the weight decay structure. A more useful structure would allow the network to
forget only when it is learning. Since the network only learns when the output is active
(y

= 1), the network should only forget when the output is active. If we make the
learning and forgetting rates equal, this Hebbian learning rule is called the Instar learning
rule.
94

w w y xy w x w y
AB
ne
AB
old T T
AB
old
AB
old T
= − + = + − ( ) ( ) 1 α β α


Rearranging:

w w y x y
AB
ne
AB
old T
A
T
= − + ( ) 1 α α


The first term allows the network to forget when the output is active and the second term
allows the network to learn. This implementation causes the weight matrix to move
towards new inputs. If the weight vector and input vector are normalized, this weight
update rule is graphically represented as:

Instar Learning
w(k)
x(k)
w(k+1)


The value of α determines the rate at which the weight vector is drawn towards new input
vectors. The output of an Instar is the dot product of the weight vector and the input
vector. Therefore, the Instar has the ability to learn an input vector and the output is a
value corresponding to the degree that the input matches the weight vector. The Instar
learns patterns and classifies patterns.

The following is an example of how the Instar network can learn to remember and
recognize an input vector. Note that the weight vector and input vector are normalized
and that the dot product must be >0.9 for the identification of the input vector to be
positive.

w=rand(1,4) % Initialize weight vector randomly.
x=[1 0 0 1]'; % Input vector to be learned.
x=x/norm(x); % Normalize input vector.
w=w/norm(w); % Normalize weight vector.
yout1=thresh(w*x-.9) % Initial output.
a=0.8; % Learning and forgetting factor.
w=instar(x,w,a,10); % Train for 10 cycles.
w % Final weight vector.
yout2=thresh(w*x-.9) % Final output.

w =
0.2190 0.0470 0.6789 0.6793
yout1 =
0
95
w =
0.7071 0.0000 0.0000 0.7071
yout2 =
1

The network was able to learn a weight vector that would identify the desired input
vector. A 0.9 criteria was used for identification.
9.3 Outstar Learning
The Instar network learns to identify input vectors and its dual network, called an Outstar
network, can store and recall a vector. Putting these two network architectures together
results in an associative memory.

The Outstar network also uses a Hebbian learning rule. Again, when the input and output
are both active, the weight connecting the two is increased. The Outstar network is
trained by applying an active signal at the input while applying the vector to be stored at
the output. This is a type of supervised network since the desired output is applied to the
output. After training, when the input is made active, the output will be the stored vector.
Input Outputs
y
1
x
weights
y
2
y
3
y
4
Outstar Network


For example, a network can be trained to store a vector v = [0 1 1 0]. After training,
when the input is a 1, the output will be the learned vector; if the input is a 0, the output
will be a 0. The Outstar uses a Hebbian learning rule similar to that of the Instar. If the
learning and forgetting terms are equal, the Outstar learning rule for this simple recall
network is:

∆w wx vx w v x = − + = − α α α( )


The Outstar function, outstar(v,x,w,a), is used where:
v is the vector to be learned
x is the input vector
w is the weight matrix of size(outputs,inputs)
a is the learning rate.

v = [0 1 1 0]; % Vector to be learned.
x=1; % Input.
w=rand(1,4) % Initialize weights randomly.
a=0.9; % Learning rate.
w=outstar(v,x,w,a,10); % Train the network for 10 cycles.
96
yout=w*x % The trained network output.

w =
0.9347 0.3835 0.5194 0.8310
yout =
0.0000 1.0000 1.0000 0.0000

This algorithm can generalize to higher dimensional inputs. The general Outstar
architecture is now:
Inputs Outputs
y
1
weights
y
2
y
3
y
4
Outstar Network
x
2
x
3
x
1
Sum
Sum
Sum
Sum


If we want a network to output a certain vector depending on the active input node we
will have a weight matrix versus a weight vector. Suppose we want to learn three vectors
of four terms.

v1=[0 -1 0 0]'; % First vector.
v2=[1 0 -1 0]'; % Second vector.
v3=[0 1 0 0]'; % Third vector.
v=[v1 v2 v3];
w=rand(4,3); % Random initial weight matrix.
w=outstar(v,x,w,a,10); % Train the network for 10 cycles.
x=[1 0 0
0 1 0
0 0 1]; % Three input vectors.
yout=w*x % Three output vectors.

yout =
0.0000 1.0000 0.0000
-1.0000 0.0000 1.0000
0.0000 -1.0000 0.0000
0.0000 0.0000 0.0000

The network learned to recall the correct vector for each of the three inputs. Although
this implementation uses a supervised paradigm, the implementation could be presented
in an unsupervised form. The unsupervised form uses an initial identity weight matrix, a
learning rate equal to one, and only one pass of training. This method embeds the input
matrix into the weight matrix in one pass but may not be useful when there are several
patterns in the data set that are noisy. The one pass method may not generalize well from
noisy or incomplete data.
97
9.4 Crossbar Structure
An associative network architecture with a crossbar structure is termed a bi-directional
associative memory (BAM) by its developer, Bart Kosko [1988]. This methodology is
really a matrix solution to an associative memory problem. The BAM does not undergo
training as do most neural network architectures.

As an example of its implementation, consider three vector pairs (a,b).

a1=[1 -1 -1 -1 -1 1]';
a2=[-1 1 -1 -1 1 -1]';
a3=[-1 -1 1 -1 -1 1]';
b1=[-1 1 -1]';
b2=[1 -1 -1]';
b3=[-1 -1 1]';

Each vector a is associated with a vector b, and either vector can be the input or the
output. Note that the terms of the vectors in a BAM must be ± 1. Three correlation
matrices are formed by multiplying the vectors using the equation:

M A B
1 1 1
T
=

m1=a1*b1';
m2=a2*b2';
m3=a3*b3'; % Three correlation matrices.

The three correlation matrices are then added to get a master weight matrix:

M M M M
1 2 3
= + +

m=m1+m2+m3 % Master weight matrix.

m =
-1 3 -1
3 -1 -1
-1 -1 3
1 1 1
3 -1 -1
-3 1 1

The master weight matrix can now be used to get the vector associated with any input
vector. This matrix can perform transformations in either direction. The resulting vector
must be limited to the [1 -1] range. This is done using the signum() function.

A MB
i
=
i
or B M A
i
T
i
=

For example:

A1=signum(m*b1) % Recall a1 from b1.
B2=signum(m'*a2) % Recall b2 from a2.
98

A1 =
1
-1
-1
-1
-1
1
B2 =
1
-1
-1

We can see that the BAM was able to recall the associations stored in the master matrix
memory. A discussion of the capacity and efficiency of the BAM network is given in the
text.
9.5 Competitive Networks
Artificial neural networks that use competitive learning have only one output node
activated at a time. The output nodes compete to be the one that is active, this is
sometimes called a winner-take-all algorithm.

y w x
il l i
l
= =

w x

where: i = output node index
l = input node index

The weights between the input and output nodes (w
il
) are initially chosen as small
random values and are continuously normalized. When training commences, the input
vectors (x) search out which weight vector (w
i
*
) is closest to it.

w x w x
i i
*
⋅ ≥ ⋅

The closest weight vector is then updated to make it closer to the input vector. The
amount that the weight vector is changed is determined by the learning rate η.

∆w
x
x
w
ij
j
j
j
ij
*
( ) = −

η

Training involves the repetitive application of input vectors to the network in a random
order or in turn. The weights are continually updated with each application. This
process could continue changing the weight vectors forever; therefore, the learning rate η
is reduced as training progresses until it eventually is negligible and the weight changes
cease. This results in weight vectors (+) centered in clusters of input vectors (*) as
shown in the following figure.
99
Before Training After Training
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
+
+
+
+
+
+
+
+

Competitive Network

Generally, competitive learning networks are single-layered but there are several variants
that are multi-layered. Some are briefly described by Hertz, Krogh, and Palmer [1991]
but none will be discussed here. As is intuitively obvious, competitive networks perform
clustering. They extract similarities from the input vectors and group or categorize them
into specific clusters. These similar input vectors fire the same output node. They find
uses in vector quantization problems such as data compression.
9.5.1 Competitive Network Implementation
A competitive network is a single layer network which only has one output activate at a
time. This output corresponds to the neuron whose weight vector is closest to the input
vector.

Inputs Outputs
weights
y
1
y
2
y
3
y
4
Competitive Network
x
2
x
3
x
1
Sum
Sum
Sum
Sum
C


Suppose we have three weight vectors already stored in a competitive network. The
output of the network to an input vector is found with the following.

w=[1 1; 1 -1; -1 1]; % Cluster centers.
x=[1;3]; % Input to be classified.
y=compete(x,w); % Classify.
clg
plot(w(:,1),w(:,2),'*')
hold on
plot(x(1),x(2),'+')
hold on
plot(w(find(y==1),1),w(find(y==1),2),'o')
title('Competitive Network Clustering')
100
xlabel('Input 1');ylabel('Input2')
axis([-1.5 1.5 -1.5 3.5])

-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
Competitive Network Clustering
Input 1
I
n
p
u
t
2


The above figure displays the cluster centers (*), shows the input vector (+), and shows
that cluster 1 won the competition (o). Cluster center 1 is circled and is closest to the
input vector labeled +. The next example shows how a competitive network is trained. It
uses the instar learning rule to learn the cluster centers and since only one output is active
at a time, only the winning weight vector is updated during each presentation.

Suppose there are a 11 input vectors (dimension=2) that we want to group into 3 clusters.
The weight matrix is randomly initialized and the network is trained for 20 presentation
of the 11 inputs.

x=[-1 5;-1.2 6;-1 5.5;3 1;4 2;9.5 3.3;-1.1 5;9 2.7;8 3.7;5 1.1;5 1.2]';
clg
plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'),
ylabel('Input 2');

101
-2 0 2 4 6 8 10
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Training Data
Input 1
I
n
p
u
t

2


This data is sent to a competitive network for training. It is specified a priori that the
network will have 3 clusters.

w=rand(3,2);
a=0.8; % Learning and forgetting factor.
w=trn_cmpt(x,w,a,20); % Train for 20 cycles.
plot(x(1,:),x(2,:),'k*');title('Training Data'),xlabel('Input 1'),
ylabel('Input 2');hold on
plot(w(:,1),w(:,2),'o');hold off

3
4
5
6
Training Data
I
n
p
u
t

2

102

Note that the first weight vector never moves towards the group of data centered around
(10,3). This neuron is called a "dead neuron" since its output is never activated. One
method of dealing with dead neurons is to somehow give them an extra chance of
winning a competition (to give them increased bias towards winning). To implement this
increased bias, we will add a bias to all of the competitive neurons. The bias will be
increased when the neurons doesn't win and decreased when the neuron does win. The
function competb() will evaluate a competitive network with a bias.

dimension=2; % Input space dimension.
clusters=3; % Number of clusters to identify.
w=rand(clusters,dimension); % Initialize weights.
b=.1*ones(clusters,1); % Initialize biases.
a=0.8; % Learning and forgetting factor.
cycles=5; % Iteratively train for 5 cycles.
w=trn_cptb(x,w,b,a,cycles); % Train network.
clg
plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'),
ylabel('Input 2');hold on
plot(w(:,1),w(:,2),'o');hold off

-2 0 2 4 6 8 10
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Training Data
Input 1
I
n
p
u
t

2


We can see that the dead neuron has come alive. It has centered itself in third cluster.

One detail that has not been discussed in detail is the selection of the learning rate. A
large learning rate allows the network to learn fast but reduces its stability and leads to
oscillatory behavior. An adaptive learning rate can be used that allows the network to
initially learn fast and then slower as training progresses.
103
9.5 2 Self Organizing Feature Maps
The self-organizing feature map, or Kohonen network [1984], maps a high dimension
input vector into a smaller dimensional pattern; usually this pattern is of dimension one
or two. The conventional two dimensional feature map architecture is shown below.
x
1
x
2
x
n
X

Self Organizing Feature Map

In a feature map, the geometrical arrangement or location of the outputs contains
information about the input vectors. If the input vectors X
1
and X
2
are fairly similar,
their outputs should be located close together; and if X
1
and X
2
are quite similar, then
their outputs should be equal. This relationship can be realized by one of several
different learning algorithms. It can be realized by using ordinary competitive learning
with lateral connections weights in the output layer that excite nearby nodes and inhibit
nodes that are farther away. It can also be realized by ordinary competitive learning
where the weights of nearby neighbors are allowed to update along with the weights of
the winning node. This realization is termed Kohonen's algorithm. Kohonen's algorithm
clusters data in a way that preserves the topology of the inputs, thus geometrically
revealing the similarities of the inputs. This similarity is defined as it was in competitive
learning as the Euclidean distance from each other.

The Kohonen learning algorithm is:

∆w x w
i
= − α( ) for w near x. i

The weight updates are only performed for the winning neuron and its neighbors (w
i
near
x
i
). For a 4x4, 2-dimensional feature map, the neurons nearest to the winning neuron are
also updated. These neurons are the gray shaded neurons in the above figure. There may
be a reduced update for neurons as a function of their distance from the winning neuron.
This type of function is commonly referred as a "Mexican hat" function.

mexhat;

104
0
10
20
30
0
10
20
30
0
0.5
1
Mexican Hat Function


Suppose we have a two input network that is to organize the data into a one dimensional
feature map of length 5.
x
1
x
2
1
2
3
4
5

One Dimensional Feature Map

In the following example, we will organize 18 input pairs (this may take a long time).

x=[1 2;8 9;7 8;6 6;2 3;7 7;2 2;5 4;3 3;8 7;4 4; 7 6;1 3;4 5;8 8;5 5;6
7;9 9]';
plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'),
ylabel('Input 2');
b=.1*ones(5,1); % Small initial biases.
w=rand(5,2); % Initial random weights.
tp=[0.7 20]; % Learning rate and maximum training iterations.
[w,b]=kohonen(x,w,b,tp);% Train the self-organizing map.
ind=zeros(1,18);
for j=1:18
y=compete(x(:,j),w);
ind(j)=find(y==1);
end
105
[x;ind]

ans =
Columns 1 through 12
1 8 7 6 2 7 2 5 3 8 4 7
2 9 8 6 3 7 2 4 3 7 4 6
1 5 4 3 1 4 1 2 1 4 2 4
Columns 13 through 18
1 4 8 5 6 9
3 5 8 5 7 9
1 2 5 3 4 5
1 2 3 4 5 6 7 8 9
2
3
4
5
6
7
8
9
Training Data
Input 1
I
n
p
u
t

2


To observe the order of the classifications, we will use different symbols to plot the
classifications.

clg
plot(w(:,1),w(:,2),'+')
hold on
plot(w(:,1),w(:,2),'-')
hold on
index=find(ind==1);
plot(x(1,index),x(2,index),'*')
hold on
index=find(ind==2);
plot(x(1,index),x(2,index),'o')
hold on
index=find(ind==3);
plot(x(1,index),x(2,index),'*')
hold on
index=find(ind==4);
plot(x(1,index),x(2,index),'o')
hold on
index=find(ind==5);
plot(x(1,index),x(2,index),'*')
106
hold on
axis([0 10 0 10]);
xlabel('Input 1'), ylabel('Input 2');
title('Self Organizing Map Output');
hold off

0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
10
Input 1
I
n
p
u
t

2
Self Organizing Map Output


It is apparent that the network not only clustered the data, but it also organized the data
so that the data near each other were put in clusters next to each other. The use of self
organizing maps preserves the topography of the input vectors.
9.6 Probabilistic Neural Networks
The Probabilistic Neural Network (PNN) is a Bayesian Classifier put into a neural
network architecture. This network is well described in Fuzzy and Neural Approaches in
Engineering, by Lefteri H. Tsoukalas and Robert E. Uhrig. Timothy Masters has written
two books that give good discussions of PNNs: Practical Neural Network Recipes in
C++ [1993a] and Advanced Algorithms for Neural Networks [1993b]. Both of these
books contain disks with C++ version of a PNN code.

A PNN is a classifier. Although it can be used as a function approximator, this task is
better performed by other iteratively trained neural network architectures or by the
Generalized Regression Neural Network of Section 9.8.

One of the most common classifiers is the Nearest Neighbor classifier. This classifier
classifies a pattern to be in the same class as its nearest neighbor, or more generally, its k
nearest neighbors. A drawback of the Nearest Neighbor classifier is that sometimes the
nearest neighbor may be an outlier from another class. In the figure below, the input
marked as a ? would be classified as a o, even though it is in the middle of many x's.
This is a weakness of the nearest neighbor classifier.
107

x
x
x
x x
x
x
x
x
x
o
o
o
o
o
o
o
o
?
x

Classification Problem

The principal advantage of PNNs over other NN architectures is its speed of learning. Its
weights are not trained through an iterative process, they are stored during what is
commonly called the learning process. A second advantage is that the PNN has a solid
theoretical foundation for making confidence estimates.

There are several major disadvantages to the PNN. First, the PNN must store all training
patterns. This requires large amounts of memory. Secondly, during recall, all the
training patterns must be processed. This requires a lengthy recall period. Also, the PNN
requires a large representative training set for proper operation. Lastly, the PNN requires
the proper choice of a width parameter called sigma. There are routines to choose this
parameter in an optimal manner, but this is an iterative and sometimes lengthy procedure
(see Masters 1993a). The width parameter may be different for each population (class),
but the implementation presented here will use a single width parameter.

In summary, the Probabilistic Neural Network should be used only for classification
problems where there is a representative training set. It can be trained quickly but has
slow recall and is memory intensive. It has solid underlying theory and can produce
confidence intervals. This network is simply a Bayesian Classifier put into a neural
network architecture. The estimator of the probability density function uses the gaussian
weighting function:

g x
n
e
x x
i
n i
( ) =


=

1
2
2
2
1
σ


Where:
n is the number of cases in a class
x
i
is a specific case in a class
x is the input
σ is the width parameter

This formula simply estimates the probability density function (PDF) as an average of
separate multivariate normal distributions. This function is used to calculate the
probability density function for each class. For example, if the training data consisted of
108
three classes with populations of 23, 12, and 18. The above formula would be used to
estimate the PDF for each of the three classes with n=23, 12 and 18.

A simple PNN will now be implemented that classifies an input as one of two classes.
The training data consists of two classes of data with four vectors in each class. A test
data point will be used to verify correct operation.

x=[-3 -2;-3 -3;-2 -2;-2 -3;3 2;3 3;2 2;2 3]; % Training data
y=[1 1 1 1 2 2 2 2]'; % Classifications of training data
xtest=[-.5 1.5]; % Vector to be classified.
plot(x(:,1),x(:,2),'*');hold;plot(xtest(1),xtest(2),'o');
title('Probabilistic Neural Network Data')
axis([-4 4 -4 4]); xlabel('Input1');ylabel('Input2');hold off;

Current plot held
-4 -3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
3
4
Probabilistic Neural Network Data
Input1
I
n
p
u
t
2


The test data point can be classified by a PNN.

a=3; % a is the width parameter: sigma.
classes=2; % x has two classifications.
[class,prob]=pnn(x,y,classes,xtest,a) % Classify the test input: testx.

class =
2
prob =
0.1025 0.3114

This function properly classified the input vector as class 2 and output a measure of
membership (0.3114). As a final example, we will use an input vector x=[-2.5 -2.5].

x=[-3 -2;-3 -3;-2 -2;-2 -3;3 2;3 3;2 2;2 3]; % Training data
109
y=[1 1 1 1 2 2 2 2]'; % Classifications of training data
xtest=[-2.5 -2.5]; % Vector to be classified.
plot(x(:,1),x(:,2),'*');hold;plot(xtest(1),xtest(2),'o');
title('Probabilistic Neural Network Data')
axis([-4 4 -4 4]); xlabel('Input1');ylabel('Input2');hold off
a=3; % a is the width parameter: sigma.
classes=2; % x has two classifications.
[class,prob]=pnn(x,y,classes,xtest,a) % Classify the test input.

Current plot held
class =
1
prob =
0.9460 0.0037
-4 -3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
3
4
Probabilistic Neural Network Data
Input1
I
n
p
u
t
2


The PNN properly classified the input vector to class 1. The PNN also outputs a number
related to the membership of the input to each class (0.946 0.004). These numbers can be
used as confidence values for the classification.

9.7 Radial Basis Function Networks
A Radial Basis Function Network (RBF) has been proven to be a universal function
approximator [Park and Sandberg 1991]. Therefore, it can perform similar function
mappings as a MLP but its architecture and functionality are very different. We will first
examine the RBF architecture and then examine the differences between it and the MLP
that arise from this architecture.

110
Input Space Hidden Nodes
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Receptive
Fields
Output Layer
Bias
W

Radial Basis Function Network

A RBF network is a two layer network that has different types of neurons in the hidden
layer and the output layer. The hidden layer, which corresponds to a MLP hidden layer,
is a non-linear, local mapping. This layer contains radial basis function neurons which
most commonly use a gaussian activation function (g(x)). These functions are centered
over receptive fields. Receptive fields are areas in the input space which activate the
local radial basis neurons.

| |
g x x
j j j
( ) exp ( ) / = − − u σ
2 2


Where:
x is the input vector.
u
j
is the center of a region called a receptive field.
σ
j
is the width of the receptive field.
g
j
(x) is the output of the jth neuron.

The output layer is a layer of standard linear neurons and performs a linear
transformation of the hidden node outputs. This layer is equivalent to a linear output
layer in a MLP, but the weights are usually solved for using a least square algorithm
rather trained for using backpropagation. The output layer may, or may not, contain
biases; the examples in this supplement do not use biases.

Receptive fields center on areas of the input space where input vectors lie, and serve to
cluster similar input vectors. If an input vector (x) lies near the center of a receptive field
(u), then that hidden node will be activated. If an input vector lies between two receptive
field centers, but inside the receptive field width (σ) then the hidden nodes will both be
partially activated. When input vectors that lie far from all receptive fields there is no
hidden layer activation and the RBF output is equal to the output layer bias values.

A RBF is a local network that is trained in a supervised manner. This contrasts with a
MLP network that is a global network. The distinction between local and global is the
made though the extent of input surface covered by the function approximation. An MLP
111
performs a global mapping, meaning all inputs cause an output, while an RBF performs a
local mapping, meaning only inputs near a receptive field produce an activation.

Global Mapping Local Mapping
*
*
*
*
*
*
*
*
*
*
*
*
*
*
o
o
o
o
o
o
o
o
o
o
o
o
?
?
?


The ability to recognize whether an input is near the training set or if it is in an untrained
region of the input space gives the RBF a significant benefit over the standard MLP. It
can give a "don't know" output. Since networks generalize improperly and arbitrarily
when operating in regions outside the training area, no confidence should be given to
their outputs in those regions. When using an MLP, one cannot judge whether or not the
input vector comes from these untrained regions; and therefore, one cannot judge whether
the output contains significant information. On the other hand, an RBF can tell the user
if the network is operating outside its training region and the user will know when to
disregard the output. This ability makes the RBF the network of choice for safety critical
applications or for applications that have a high financial impact.

The radial basis function y=gaussian(x,w,a) is given above and is implemented in an m-
file where
x is the input vector
w is center of the receptive field
a is the width of the receptive field
y is the output value

x=[-3:.1:3]'; % Input space.
y=gaussian(x,0,1); % Radial basis function centered at 0
plot(x,y); % with a width of 1.
grid;
xlabel('input');ylabel('output')
title('Radial Basis Neuron')

112
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
input
o
u
t
p
u
t
Radial Basis Neuron


From the above figure, we can see that the output of the Gaussian function, evaluated at
the width parameter, equals about one half. We also note that the function has an
approximate zero output at distances of 2.5 times the width parameter. This figure shows
the range of coverage of the gaussian activation function.

Designing an RBF neural network requires the selection of the radial basis function width
parameter. This decision is not required for an MLP. The width should be chosen so that
the receptive fields overlap but so that one function does not cover the entire input space.
This means that several radial basis neurons have some activation to each input but all
radial basis neurons are not highly active for a single input.

Another choice to be made is the number of radial basis neurons. Depending on the
training algorithm used to implement the RBF, this may, or may not, be a decision made
by the designer. For example, the MATLAB Neural Network Toolbox has two training
algorithms. The first algorithm centers a radial basis neuron on each input vector. This
leads to an extremely large network for input data composed of many patterns. The
second algorithm incrementally adds radial basis neurons to reduce the training error to
the preset goal.

There are several network architectures that will meet a specified error criteria. These
architectures consist of different combinations of the radial basis function widths and the
number of radial basis functions in the network. The following figure roughly shows the
allowable combinations that may solve an example problem.

113
Number
of Neurons
Neuron Width
min
min
max
Possible
Combinations


The maximum number of neurons is the number of input patterns, the minimum is related
to the error tolerance and the complexity of the mapping. This minimum must be
experimentally determined. A more complex map and a smaller tolerance requires more
neurons. The minimum width constant should overlap the input patterns and the
maximum should not cover the entire input space. Excessively large widths can
sometimes give good results for data with no noise, but these systems usually fail under
real world conditions in which noise exists. The reason that the system can train well
with noise free cases is that a linear method is used to solve for the second layer weights.
The use of a regression method will minimize the error, but usually at the expense of
large weights and significant overfitting. This overfitting is apparent when there is noise
in the system. A smaller width will do a better job of alerting that an input vector is
outside the training space, while a larger width may result in a network of smaller size
and faster execution.
9.7.1 Radial Basis Function Example
As an example of the implementation of a RBF network, a function approximation over
an interval will be used.

x=[-10:1:10]'; % Inputs.
y=.05*x.^3-.2*x.^2-3*x+20; % Target outputs.
plot(x,y);
xlabel('Input');
ylabel('Output');
title('Function to be Approximated')

114
-10 -5 0 5 10
-20
-15
-10
-5
0
5
10
15
20
25
30
Input
O
u
t
p
u
t
Function to be Approximated


A radial basis function width of 4 will be used and the centers will be placed at [-8 -5 -2 0
2 5 8]. Most training routines will have an algorithm for determining the placement of
the radial basis neurons, but for this example, they will simply be placed to cover the
input space.

width=6; % Radial basis function width.
w1=[-8 -5 -2 0 2 5 8]; % Center of the receptive fields.
num_w1=length(w1); % The number of receptive fields.
a1=gaussian(x,w1,width); % Hidden layer outputs.
w2=inv(a1'*a1)*a1'*y; % Pseudo inverse to solve for output weights.

The network can now be tested both outside the training region and inside the training
region.

test_x=[-15:.2:15]'; % Test inputs.
y_target=.05*test_x.^3-.2*test_x.^2-3*test_x+20; % Test outputs.
test_a1=gaussian(test_x,w1,width); % Hidden layer outputs.
yout=test_a1*w2; % Network outputs.
plot(test_x,[y_target yout]);
title('Testing the RBF Network'); xlabel('Input'); ylabel('Output')

115
-15 -10 -5 0 5 10 15
-150
-100
-50
0
50
100
Testing the RBF Network
Input
O
u
t
p
u
t


The network generalizes very well in the training region but poorly outside the training
region. As the inputs get far from the training region, the radial basis neurons are not
active. This would alert the operator that the network is trying to operate outside the
training space and that no confidence should be given to the output value.

To show the tradeoffs between the size of the neuron width and the number of neurons in
the network, we will investigate two other cases. In the first case, the width will be made
so small that there is no overlap and the number of neurons will be equal to the number
of inputs. In the second case, the spread constant will be made very large and some noise
will be added to the data.
9.7.2 Small Neuron Width Example
The radial basis function width will be set to 0.2 so that there is no overlap between the
neurons.

x=[-10:1:10]'; % Inputs.
y=.05*x.^3-.2*x.^2-3*x+20; % Target outputs.
width=.2; % Radial basis function width.
w1=x'; % Center of the receptive fields.
a1=gaussian(x,w1,width); % Hidden layer outputs.
w2=inv(a1'*a1)*a1'*y; % Solve for output weights.
test_x=[-15:.2:15]'; % Test inputs.
y_target=.05*test_x.^3-.2*test_x.^2-3*test_x+20; % Test outputs.
test_a1=gaussian(test_x,w1,width); % Hidden layer outputs.
yout=test_a1*w2; % Network outputs.
plot(test_x,[y_target yout]);
title('Testing the RBF Network');xlabel('Input');ylabel('Output')

116
-15 -10 -5 0 5 10 15
-150
-100
-50
0
50
100
Testing the RBF Network
Input
O
u
t
p
u
t


The above figure shows that the width parameter is too small and that there is poor
generalization inside the training space. For proper overlap, the width parameter needs to
be at least equal to the distance between input patterns.
9.7.3 Large Neuron Width Example
A very large radial basis function width equal to 200 will be used. Such a large width
parameter causes each radial basis function to cover the entire input space. When this
occurs, the radial basis functions are all highly activated for each input value. Therefore,
the network may have problems learning the desired mapping.

width=200; % Radial basis function width.
w1=[-8 -3 3 8]; % Center of the receptive fields.
a1=gaussian(x,w1,width); % Hidden layer outputs.
w2=inv(a1'*a1)*a1'*y; % Solve for output weights.
test_x=[-15:.2:15]'; % Test inputs.
y_target=.05*test_x.^3-.2*test_x.^2-3*test_x+20; % Test outputs.
test_a1=gaussian(test_x,w1,width); % Hidden layer outputs.
yout=test_a1*w2; % Network outputs.
plot(test_x,[y_target yout]);
title('Testing the RBF Network');
xlabel('Input');
ylabel('Output')

Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 3.273238e-017
117
-15 -10 -5 0 5 10 15
-150
-100
-50
0
50
100
Testing the RBF Network
Input
O
u
t
p
u
t


The hidden layer activations ranged from 0.9919 to 1.0. This made the regression
solution of the output weight matrix very difficult and ill-conditioned. The use of such a
large width parameter causes numerical problems and also makes it difficult to know
when an input vector is outside the training space.
9.8 Generalized Regression Neural Network
The Generalized Regression Neural Network [Specht 1991] is a feedforward neural
network best suited to function approximation tasks such as system modeling and
prediction. Although it can be used for pattern classification, the Probabilistic Neural
Network discussed in Section 9.6 is better suited to those applications.

The GRNN is composed of four layers. The first layer is the input layer and is fully
connected to the pattern layer. The second layer is the pattern layer and has one neuron
for each input pattern. This layer performs the same function as the first layer RFB
neurons: its output is a measure of the distance the input is from the stored patterns. The
third layer is the summation layer and is composed of two types of neurons: S-summation
neurons and a single D-summation neuron (division). The S-summation neuron
computes the sum of the weighted outputs of the pattern layer while the D-summation
neuron computes the sum of the unweighted outputs of the pattern neurons. There is one
S-summation neuron for each output neuron and a single D-summation neuron. The last
layer is the output layer and divides the output of each S-summation neuron by the output
of the D-summation neuron. A general diagram of a GRNN is shown below.

118
Input
Layer
Pattern
Layer
Summation
Layer
Output
Layer
X
1
X
2
:
X
n
:
Y
1
:
Y
n
:
:
S
S
D

Generalized Regression Neural Network

The output of a GRNN is the conditional mean given by:

$
exp
exp
Y
W
=

|
\

|
.
|

|
\

|
.
|
=
=


T t
j
T
t
j
T
D
D
2
2
1
2
2
1
2
2
σ
σ


Where the exponential function is a Gaussian function with a width constant sigma. Note
that the calculation of the Gaussian is performed in the pattern layer, the multiplication of
the weight vector and summations are performed in the summation layer, and the division
is performed in the output layer.

The GRNN learning phase is similar to that of a PNN. It does not learn iteratively as do
most ANNs; but instead, it learns by storing each input pattern in the pattern layer and
calculating the weights in the summation layer. The equations for the weight calculations
are given below.

The pattern layer weights are set to the input patterns.

W X
p
T
=

The summation layer weights matrix is set using the training target outputs. Specifically,
the matrix is the target output values appended with a vector of ones that connect the
pattern layer to the D-summation neuron.

W Y
s
= [ ] ones

119
To demonstrate the operation of the GRNN we will use the same example used in the
RBF section. The training patterns will be limited to five vectors distributed throughout
the input space. A width parameter of 4 will also be used. The training parameters must
cover the training space and the set should also contain the values at any minima or
maxima. The input training vector is chosen to be [-10 -6.7 -3.3 0 3.3 6.7 10]. First we
calculate the weight matrices.

x=[-10 -6.7 -3.3 0 3.3 6.7 10]'; % Training inputs.
y_target=.05*x.^3-.2*x.^2-3*x+20; % Generate the target outputs.
[Wp,Ws]=grnn_trn(x,y_target) % Calculate the weight matrices.
plot(x,y_target)
title('Training Data for a GRNN');
xlabel('Input');
ylabel('Output')

Wp =
-10.0000 -6.7000 -3.3000 0 3.3000 6.7000 10.0000
Ws =
-20.0000 1.0000
16.0838 1.0000
25.9252 1.0000
20.0000 1.0000
9.7189 1.0000
5.9602 1.0000
20.0000 1.0000
10 5 0 5 10
-20
-15
-10
-5
0
5
10
15
20
25
30
Training Data for a GRNN
O
u
t
p
u
t


The GRNN will now be simulated for the training data.

x=[-10 -6.7 -3.3 0 3.3 6.7 10]';
a=2;
y=grnn_sim(x,Wp,Ws,a);
y_actual=.05*x.^3-.2*x.^2-3*x+20;
120
plot(x,y_actual,x,y,'*')
title('Generalization of a GRNN');
xlabel('Input');
ylabel('Output')

-10 -5 0 5 10
-20
-15
-10
-5
0
5
10
15
20
25
30
Generalization of a GRNN
Input
O
u
t
p
u
t


The recall performance for the network is very dependent on the width parameter. A
small width parameter gives good recall of the training patterns but poor generalization.
A larger width parameter would give better generalization but poorer recall. The choice
of a good width parameter is necessary to having good performance. Usually, the largest
width parameter that gives good recall is optimal. In the example above, a width
parameter of 2.5 was found to be the maximum width that has good recall.

Next we check for correct generalization by simulating the GRNN over the trained
region. To show the effects of the width parameter, the GRNN is simulated with a width
parameter being too small (a = .5), too large (a = 5), and optimal (a = 2).

x=[-10:.5:10]';
a=.5;
y=grnn_sim(x,Wp,Ws,a);
y_actual=.05*x.^3-.2*x.^2-3*x+20;
plot(x,y_actual,x,y,'*')
title('Generalization of GRNN: a = 0.5')
xlabel('Input');
ylabel('Output' )

121
-10 -5 0 5 10
-20
-15
-10
-5
0
5
10
15
20
25
30
Generalization of GRNN: a = 0.5
Input
O
u
t
p
u
t


x=[-10:.5:10]';a=5;
y=grnn_sim(x,Wp,Ws,a);
y_actual=.05*x.^3-.2*x.^2-3*x+20;
plot(x,y_actual,x,y,'*')
title('Generalization of GRNN, a = 5')
xlabel('Input');ylabel('Output')

10
-5
0
5
10
15
20
25
30
Generalization of GRNN, a = 5
O
u
t
p
u
t


x=[-10:.5:10]';a=2;
y=grnn_sim(x,Wp,Ws,a);
122
y_actual=.05*x.^3-.2*x.^2-3*x+20;
plot(x,y_actual,x,y,'*')
title('Generalization of GRNN: a = 2.5')
xlabel('Input');ylabel('Output')

-10 -5 0 5 10
-20
-15
-10
-5
0
5
10
15
20
25
30
Generalization of GRNN: a = 2.5
Input
O
u
t
p
u
t


Note that with the proper choice of training data and width parameter, the network was
able to generalize with very few training parameters. If there is nothing known about the
function, a large training set must be chosen to guarantee it is representative. This would
make the network very large (many pattern nodes) and would require much memory and
long recall times. Clustering techniques can be used to select a representative training
set, thus reducing the number of pattern nodes.
Chapter 10 Dynamic Neural Networks and Control Systems
10.1 Introduction
Dynamic neural networks require some sort of memory. This memory allows the
network to exhibit temporal behavior; behavior that is not only dependent on present
inputs, but also on prior inputs. There are two major classes of dynamic networks:
Recurrent Neural Networks (RNN) and Time Delayed Neural Networks (TDNN).
Recurrent Neural Networks are networks with internal time delayed feedback
connections. The two most common RNN designs are the Elman network [Elman 1990]
and the Jordan network [Jordan 1986]. In an Elman network, the hidden layer outputs are
fed back through a one step delay to dummy input nodes. The Elman network can learn
temporal patterns as well as spatial patterns because it can store information. The Jordan
network is a recurrent architecture similar to the Elman network but it feeds back the
output layer rather than the hidden layer. Recurrent Neural Networks are difficult to train
due to the feedback connections. Usual methods are Real Time Recurrent Learning
123
(RTRL) [Williams and Zipser, 1989], and Back Propagation Through Time (BPTT)
[Werbos 1990].

Input Output
p(1)
p(2)
p(R)
Hidden
D
D
a
1
a
2
(1)
a
2
(S
2
)

Elman Recurrent Neural Network

Time Delay Neural Networks (TDNNs) can learn temporal behavior by using not only
the present inputs, but also past inputs. TDNNs accomplish this by simply delaying the
input signal. The neural network architecture is usually a standard MLP but it can also
be a RBF, PNN, GRNN, or other feedforward network architecture. Since the TDNN has
no feedback terms, it is easily trained with standard algorithms.
ANN
u(k)
y(k)
D
D
D
Time Delay Neural Network Model


Specific applications using these dynamic network architectures will be discussed in later
sections of this chapter.
10.2 Linear System Theory
The educational version of MATLAB provides many functions for linear system
analysis. This section will provide simple examples of the usage of some of these
functions. Theoretical issues and derivation will not be discussed in this supplement.
We will examine the Fast Fourier Transform (fft) and the Power Spectral Density (psd)
functions.

Suppose you have a periodic signal that is 256 time steps long and is a combination of
sine waves of 3 different frequencies. Taking the fft of that signal results in

t=[1:1:512]; % Time
f1=.06*(2*pi); f2=.1*(2*pi); f3=.4*(2*pi); % Three frequencies
sig=2*sin(f1*t)+sin(f2*t)+1.5*sin(f3*t); % Periodic signal
plot(t,sig);title('Periodic Signal');
124
ylabel('Amplitude');xlabel('Time');axis([0 512 -5 5]);

0 100 200 300 400 500
-5
-4
-3
-2
-1
0
1
2
3
4
5
Periodic Signal
A
m
p
l
i
t
u
d
e
Time


Y=fft(sig,256); % Take Fast Fourier Transform
Pyy=Y.*conj(Y)/256; % Find the normalized amplitude of the FFT.
f=[1:128]/256; % Calculate a scale for the y axis.
plot(f,Pyy(1:128)) % Points 129:256 are symmetric.
xlabel('Frequency');ylabel('Power'); title('Power Spectral Density');

60
80
100
120
140
160
180
P
o
w
e
r
Power Spectral Density


125
The plot of the power in each frequency band shows the three frequency components of
the signal and the magnitudes of the signals. A noise component is usually inherent to
most signals.

noisy_sig=sig+randn(size(sig));; % Add normally distributed noise.
plot(t,noisy_sig);title('Noisy Periodic Signal');
ylabel('Amplitude');
xlabel('Time');
axis([0 512 -5 5]);

0 100 200 300 400 500
-5
-4
-3
-2
-1
0
1
2
3
4
5
Noisy Periodic Signal
A
m
p
l
i
t
u
d
e
Time


Y=fft(noisy_sig,256); % Take Fast Fourier Transform
Pyy=Y.*conj(Y)/256; % Find the normalized amplitude of the FFT.
f=[1:128]/256; % Calculate a scale for the y axis.
plot(f,Pyy(1:128)) % Points 129:256 are symmetric.
xlabel('Frequency');
ylabel('Power');
title('Power Spectral Density');

126
0 0.1 0.2 0.3 0.4 0.5
0
20
40
60
80
100
120
140
160
180
200
Frequency
P
o
w
e
r
Power Spectral Density


The figure shows the noise level rising across the entire spectrum. The PSD function can
also be used to plot the Power Spectral Density with a dB scale.

psd(noisy_sig,256,1); % Plots a Power Spectral Density [dB].

-10
-5
0
5
10
15
20
25
P
o
w
e
r

S
p
e
c
t
r
u
m

M
a
g
n
i
t
u
d
e

(
d
B
)


The use of these signal processing functions may be necessary when designing a neural
or fuzzy system that uses inputs from the frequency domain.
127
10.3 Adaptive Signal Processing
Neural Networks can learn off-line or on-line. On-line learning neural networks are
usually called adaptive networks because they can adapt to changes in the input or target
signals. One example of an adaptive neural network is the single input adaptive
transverse filter. This is a specific case of the TDNN with one layer of linear neurons.

D
Input Signal
w
1
w
2
w
3
w
4
w
n
D
D
:
D
Σ Output

Single Input Adaptive Transverse Filter

Suppose that we have a Finite Duration Input Response (FIR) filter implemented by a
TDNN (Shamma). FIR filters are sometimes used over Infinite Duration Input Response
(IIR) filters such as Chebyshev, Elliptic, and Bessel due to their simplicity, stability,
linearity , and finite response duration. The following example will show how a TDNN
can be used as a FIR filter.

Suppose that a fairly noisy signal needs to be filtered so that it can be used as an input to
a PID controller or other analog device. A FIR implemented as a neural network can be
used for this application. Consider the following filtering example.

t=[250:1:400];
sig=sin(.01*t)+sin(.03*t)+.2*randn(1,151);
plot(t,sig);
title('Noisy Periodic Signal');
ylabel('Amplitude');
xlabel('Time');

128
250 300 350 400
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Noisy Periodic Signal
A
m
p
l
i
t
u
d
e
Time


To simulate a TDNN we must construct an input matrix containing the delayed values of
the input. For example, if we have an input vector of ten values that is being input to a
TDNN with 3 delays, we call the function delay_in(x,d) with the number of delays d=3.
This function returns an input vector with 4 rows since there are four inputs to the
network. This function does not pad zeros into the inputs, so the input vector shrinks by
3 patterns. For example:

x=[0 1 2 3 4 5 6 7 8 9]';
xd=delay_in(x,3)

xd =
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9

Suppose we have a linear neural network FIR filter implemented with a TDNN with 5
delays. This filter can be used to process noisy data.

d=5; % Number of delays.
x=delay_in(sig',d); % Construct delayed input matrix.
w=[.2 .2 .2 .2 .1 .1]'; % Weight matrix.
y=x*w; % Calculate outputs.
td=t(d:length(sig)-1);
plot(td,sig(d:length(sig)-1),td,y');
title('Noisy Periodic Signal and Filtered Signal');
ylabel('Amplitude');
xlabel('Time');
129

250 300 350 400
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Noisy Periodic Signal and Filtered Signal
A
m
p
l
i
t
u
d
e
Time


We can see that the neural network filter removed a large portion of the noise. It actually
performs this operation by calculating a linear weighted average over a window. In this
example, the window is six time steps long. This filter can be trained on line to give
specific response characteristics.
10.4 Adaptive Processors and Neural Networks
The FIR example of the previous section is a dynamic network in the sense that is can
model temporal behavior. This section demonstrates how a neural network can be
dynamic in the sense that its parameters change with time. The example given trains a
linear network on-line. This allows a network to adaptively learn non-stationary (time-
varying) trends. The figure below shows a block diagram of an adaptive neural network
used for system identification. The neural network can be a TDNN if temporal behavior
is necessary or can be a static mapping without time delays. Systems whose
characteristics (mathematical models) do not change with time can be modeled with a
neural network that is trained off-line. The example of the section will only require a
static mapping since the system output only depends on the current inputs but will be
trained adaptively since one parameter is non-stationary.
130
Parallel Identification Model
ANN
u(k)
D
Plant
Σ
y
out
(k)
+
-
e
i
D

Let us consider a simple linear network used for adaptive system identification. In this
example, the weights are the linear system coefficients of a non-stationary system. A
linear neural network can be adaptively trained on-line to follow the non-stationary
coefficient. Suppose the system is modeled by:

y=2x
1
-3x
2
-1+.01t

Initially, the offset term is -1 and this term changes to +1 as time reaches 200 seconds. A
simple linear network with two inputs can be trained on-line to estimate this system. The
two weighting coefficients will converge to 2 and -3 while the bias will start at -1 and
move towards +1. The choice of the learning rate will affect the speed of learning and
the stability of the network. A larger learning rate will allow the network to track faster
but will reduce its stability while a lower learning rate will produce a stable system with
slower tracking abilities.

W=[0 0 0]; % Initialize weights and bias to zero.
Weights=[]; % Store weight and bias values over time.
lr=.4; % Learning rate.
for i=1:200 % Run for 200 seconds.
x=rand(2,1); % System has random inputs.
y=2*x(1)-3*x(2)-1+.01*i; % Simulate system.
[W]=adapt(x,y,W,lr); % Train network.
Weights=[Weights W']; % Save weights and bias to plot.
end
plot(Weights') % Plot parameters.
title('Weights and Bias Approximation Over Time')
xlabel('Time');
ylabel('Weight and Bias Values')
text(50,2.5,'W1');
text(50,-2.5,'W2');
text(50,0,'B1');

131
0 50 100 150 200
-4
-3
-2
-1
0
1
2
3
Weights and Bias Approximation Over Time
Time
W
e
i
g
h
t

a
n
d

B
i
a
s

V
a
l
u
e
s
W1
W2
B1


The figure shows that the network properly identified the system by about 30 seconds
and the bias correctly tracked the non-stationary parameter over time. This type of
learning paradigm could be used to train the FIR filter used in the previous section.
10.5 Neural Networks Control
There are user supplied MATLAB toolkits that implement neural network based system
identification and control paradigms:

The NNSYSID Toolbox is located at: http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html
and the NNCTRL Toolkit is at: http://www.iau.dtu.dk/Projects/proj/nnctrl.html.

These toolboxes were developed by Magnus Morgaard of the Institute of Automation ,
Technical University of Denmark. The toolboxes and user guides: Technical Report 95-
E-773 and Technical Report 95-E-830 can be downloaded free of charge.

For a more in depth discussion of the use of neural networks for system identification and
control refer to Advanced Control with MATLAB & SIMULINK by Moscinski and
Ogonowski. For further reading on the use of neural network for control see Irwin,
Warwock and Hunt, 1995; Miller, Sutton and Werbos, 1990; Mills, Zomaya and
Tade,1996; Narendra and Parthasarathy, 1990; Omatu, Khalid and Yusof, 1996; Pham
and Liu, 1995; White and Sofga,1992;or Zbikowski and Hunt, 1996.

There are five general methods for implementing neural network controllers (Werbos pp.
59-65, in Miller Sutton and Werbos 1990):
1. Supervised Control
2. Direct Inverse Control
3. Neural Adaptive Control
132
4. Back-Propagation Through Time
5. Adaptive Critic Methods
10.5.1 Supervised Control
In supervised control, a neural network is trained to perform the same actions as another
controller (mechanical or human) for given inputs and plant conditions. After the neural
controller is trained, it replaces the controller.
Plant Controller
Σ
+
-
r(k) y(k)
Neural Controller Training
Neural
Controller
e(k)

10.5.2 Direct Inverse Control
In Direct Inverse Control, a Neural Network is trained to model the inverse of a plant.
This modeling is similar to the system identification problem discussed in Section 10.6.
Plant
Inverse Model
Σ
+
-
u(k) y(k)
Inverse System Identification
e(k)

After the neural network learns the inverse model, it is used as a forward controller. This
methodology only works for a plant that can be modeled or approximated by an inverse
function (F
-1
). Since F F ( )

=
1
1 the output (y(k)) approximates the input (u(k)).
Plant
ANN Inverse Plant
Model Controller
u(k) y(k)
Direct Inverse Control

133
10.5.3 Model Referenced Adaptive Control
When the plant model changes with time due to wear, temperature affects, etc., neural
adaptive control may be the best technique to use. Model Referenced Adaptive Control
(MRAC) adapts the controller characteristics so that the controller/plant combination
performs like a reference plant.
Plant
Σ
+
-
r(k)
y(k)
Direct Adaptive Control
Neural
Controller
e(k)
Reference Model
u(k)


Since the plant lies between the neural network and the error term, there is no method to
directly adjust the controllers weights to reduce the error. Therefore, indirect control
must be used.
Plant
Σ
+
-
r(k) y(k)
Indirect Adaptive Control
Neural
Controller
e(k)
Reference Model
u(k)
Identification
Model
Σ
+
-
e(k)


In indirect adaptive control, an ANN identification model is used to model the non-linear
plant. If necessary, this model may be updated to track the plant. The error signals can
now be backpropagated through the identification model to train the neural controller so
that the plant response is equal to that of the reference model. This method uses two
neural networks, one for system identification and one for MRAC.
10.5.4 Back Propagation Through Time
Back Propagation Through Time (BPTT) [Werbos 1990] can be used to move a system
from one state to another state in a finite number of steps (if the system is controllable).
First a system identification neural network model must be trained so that the error
signals can be propagated through it to the controller, then the controller can be trained
with a BPTT paradigm.
134
C
P
Initial
State C
P
C
P
C
P
Final
State
u(0)
x(0)
u(k-1) u(.) u(1)
x(k-1) x(.) x(1) x(k)
x is the state vector
u is the control signal
C is the controller
P is the plant model
Training With BPTT
Desired
State
Σ
e(k)


BPTT training takes place in two steps:
1. The plant motion stage, where the plant takes k time steps.
2. The weight adjustment stage, where the controller's weights are adjusted to make
the final state approach the target state.

It is important to note that there is only one set of weights to adjust because there is only
one controller. Many iterations are run until performance is as desired.
10.5.5 Adaptive Critic
Often a decision has to be made without an exact conclusion as to its effectiveness (e.g.
chess), but an approximation of its effectiveness can be obtained. This approximation
can be used to change the control system. This type of learning is called reinforcement
learning.

A critic evaluates the results of the control action: if it is good, the action is reinforced, if
it is poor, the action is weakened. This is a trial and error method and uses active
exploration when the gradient of the evaluation system in terms of the control action is
not available. Note that this is an approximate method, and should only be used when a
more exact method is unavailable.

A neural network based adaptive critic system uses one ANN to estimate the utility J(k)
of the state x(k). This utility is a measure of the goodness of the state. It also uses a
second ANN that trains with reinforcement learning to produce an input to the system
(u(k)) that produces a good state, x(k).

Critic
Network
Action
Network
x(k)
J(k)
u(k)
Adaptive Critic System
Plant
x(k)

135
10.6 System Identification
The task of both conventional and neural network based system identification is to build
a mathematical model of a dynamic system based on empirical data. In neural network
based system identification, the internal weights and biases of the neural network are
adjusted to make the model outputs similar to the measured outputs. Conventional
methods use empirical data and regression techniques to estimate the coefficients of
difference equations (ARX, ARMAX) or state space representations (see [Ljung 1987]).
10.6.1 ARX System Identification Model
A dynamic model is one whose output is dependent on past states of the system which
may be dependent on past inputs and outputs. A static model's output at a specific time is
only dependent on the input at that time. The basic conventional dynamic model is the
difference equation and the most common difference equation form is the ARX model:
Autoregressive with Exogenous Input Model which is sometimes called the Equation
Error Model. This model uses past inputs and outputs to predict current outputs. For
example, examine the following ARX model.

y(t) + a
1
y(t-1) + ... + a
na
y(t-n
a
) = b
1
u(t-1) + ... + b
nb
u(t-n
b
) + e(t)

The coefficients are represented by the set: θ = [a
1
a
2
... a
na
b
1
b
2
... b
nb
]'. The designer
sets the structure and a regression technique is used to solve for the coefficients. This
form can be visualized as
u(t) B(q)/A(q) +
1/A(q)
e(t)
y(t)


where:

A(q)=1 + a
1
q
-1
... a
na
q
-na

B(q) = b
1
q
-1
... b
nb
q
-nb


and:

y(t) = [B(q)/A(q)]u(t) + [1/A(q)]e(t).

This model assumes that the poles (roots of A(q)) are common between the dynamic
model and the noise model. This model is a linear model, one of the major advantages of
using neural networks is their non-linear approximating capabilities.
136
10.6.2 Basic Steps of System Identification
There are three phases of system identification:
A. Collect experimental input/output data.
B. Select and estimate candidate model structures.
C. Validate the models and select best model.

These phases can be accomplished by using the following general procedure:
1. Design an experiment and collect data. This data must be Persistently Exciting;
meaning that the training set has to be representative of the entire class of inputs that
may excite the system.
2. Process the data to filter and remove outliers, etc.
3. Select a model structure.
4. Compute the best parameters for that structure.
5. Examine the model's properties.
6. If the properties are acceptable quit, else goto 3.
10.6.3 Neural Network Model Structure
There are two basic neural network model structures: the parallel identification structure
and the series parallel structure. The Parallel Identification Structure has direct feedback
from the networks output to its input. It uses its estimate of the output to estimate future
outputs. Because of this feedback, it has no guarantee of stability and requires dynamic
backpropagation training. This structure should only be used if the actual plant outputs
are not available.
Parallel Identification Model
ANN
u(k)
D
D
Plant
Σ
y
out
(k)
+
-
e
i
y’
out
(k)
D
D


The Series-Parallel Identification Structure does not use feedback. Instead, it uses the
actual plant output to estimate future system outputs. Therefore, static backpropagation
training can be used and there are proofs to guarantee stability and convergence.
137
Series-Parallel Identification Model
ANN
u(k)
D
D
Plant
Σ
y
out
(k)
+
-
e
i
y’
out
(k)
D
D


Once the general model structure is chosen, the model order must be selected. The model
order is the number of past signals to use as regressors. This can be estimated through
knowledge of the system or through experimentation.

The network system identification model is then trained and tested. The error terms, also
called residuals, should be white and independent of the input. Therefore, they are tested
with an autocorrelation function and a crosscorrelation function with the inputs.
Excessive correlation between the residuals and delayed inputs of outputs is evidence that
the delayed inputs or outputs have information that can be used to reduce the estimation
error.

If the output residuals have a high correlation with lagged input signals, there is reason to
believe that the lagged input signal should be included as an input. This is one method
that can be used to experimentally determine the number of lagged inputs to use for input
to the model. For example, if a second order system is being identified with a neural
network that only uses inputs delayed by one time step, the crosscorrelation between the
two time step lagged input and the output will be large. This means that there is
information in the two time step lagged input that can be used to reduce the estimation
error.

As an example of an ARX model, consider:

y(t) -1.5y(t-T) + 0.7y(t-2T) = 0.9u(t-2T) + 0.5u(t-3T) + e(t)

Here, the output at time t is dependent on the two past outputs, delayed inputs of 2 and 3
time steps and the disturbance. The basic steps in setting up a system identification
problem are the same for a conventional model and a neural network based model. The
ARX structure is defined by:

1. The number of delayed outputs to include (y(t-T) y(t-2T)).
2. The time delay of the system. In this case the input does not effect the output for 2T.
138
3. The number of delayed inputs to use (u(t-2T) u(t-3T)).

A neural network model structure is defined by the same inputs but is also defined by the
network architecture with includes the number and type of network hidden layers and
hidden nodes.
10.6.4 Tank System Identification Example
This section presents an example that deals with the construction of a neural network
system identification model for the tank system of Section 6.1. The function
dy=tank_mod(t,y), is a non-linear model of the tank system where t is the simulation
time, y(1) is the current state of tank (level), y(2) is the input signal (voltage to valve),
and the output dy is the change in the tank state (change in level). To design a system
identification model, input/output data must be collected for the operating range and
input conditions of the tank. The operating state is the tank level and the input is the
voltage supplied to the inlet valve. The state and input defined in the function
tank_mod(t,y) cover the following ranges:

Y(1): Tank level (0-36 inches).
Y(2): Voltage being applied to the control valve (-4.5 to 1 volt).

To cover all possible operating conditions, we simulate the tank model over all
combinations of the input and state.

x1=[0:3:36]'; % Level range.
x2=[-4.5 :.5:1]'; % Input voltage range.
x=combine(x1,x2); % Input combinations.
dx=zeros(length(x1)*length(x2),1); % Level changes vector.
for i=1:size(dx); % Simulate the system for all combinations.
dx(i)=tank_mod(1,x(i,:));
end
save tank_dat x dx % Save the simulation training data.

Now that we have training data, we can train a neural network to model the system. The
inputs to the neural network model are the state: x(k) and the voltage going to the valve
actuator: u(k). The output will be the change in the tank level: dx. By using a tank
model with output dx and training an ANN with an input x(k) and u(k) we are able to
avoid using a recurrent or time delay neural network.

Tank
ANN Model
+
u(k)
x(k+1)
System Identification
Σ
-
e(k)
x(k)
dx
sum
dx


139
The backpropagation training script is used to train a single layer neural network with 5
hidden neurons. The training data file has inputs x(k) and u(k) and the change in state dx
is the target.

load tank_dat % Tank inverse model simulation data.
t=dx'; % Change in state.
x=x'; % State and input voltage.
save tank_sys x t; % Tank data in neural network training format.

The bptrian script trained the tank system identification model and the weights were
saved in sys_wgt.mat.

load tank_sys % Tank training data.
load sys_wgt % Neural Network weights for system ID model.
[inputs,pats]=size(x);
output = linear(W2*[ones(1,pats);logistic(W1*[ones(1,pats);x])]);
subplot(2,1,1);
plot([t;output]');
title('System Identification Results')
ylabel('Actual and Estimated Output')
subplot(2,1,2);
plot(output-t); % Calculate the error.
ylabel('Error');xlabel('Pattern Number');

0 20 40 60 80 100 120 140 160
-1
-0.5
0
0.5
1
System Identification Results
A
c
t
u
a
l

a
n
d

E
s
t
i
m
a
t
e
d

O
u
t
p
u
t
0 20 40 60 80 100 120 140 160
-0.04
-0.02
0
0.02
E
r
r
o
r


The top plot in the above figure shows that the neural network gives the correct output
for the training set. The lower plot shows that the error levels are very small. The
following is a comparison of outputs for a sinusoidal input. The time step must be the
same as the integration time step of the analytical model. This checks for proper
generalization.

140
t=[0:1:40]; % Simulation time.
x(1)=15; % Initial tank level.
X(1)=15; % Initial tank estimator level.
u=sin(.2*t)-1.5; % Input voltage to control valve.
load sys_wgt % Neural Network weights for system ID model.
[inputs,pats]=size(x);
for i=1:length(u); % Simulate the system.
dx(i)=tank_mod(1,[x(i) u(i)]); % Calculate change in state.
estimate=linear(W2*[1;logistic(W1*[1;x(i);u(i)])]);
x(i+1)=x(i)+dx(i); % Update the actual state.
X(i+1)=X(i)+estimate; % Update state extimate.
end
plot(t,[x(1:length(u));X(1:length(u))]);
title('Simulation Testing of Tank Model')
xlabel('Time');ylabel('Tank Level');

0 5 10 15 20 25 30 35 40
15
15.5
16
16.5
17
17.5
18
18.5
Simulation Testing of Tank Model
Time
T
a
n
k

L
e
v
e
l


This simulation is being run in an open loop parallel-identification mode. That is, the
value used for the current state input into the neural network is the neural network's
estimate. This structure allows the errors to build up over time. When a closed loop
structure is chosen, the estimate is much better.

t=[0:1:40]; % Simulation time.
x(1)=15; % Initial tank level.
X(1)=15; % Initial tank estimator level.
u=sin(.2*t)-1.5; % Input voltage to control valve.
load sys_wgt % Neural Network weights for system ID model.
[inputs,pats]=size(x);
for i=1:length(u); % Simulate the system.
dx(i)=tank_mod(1,[x(i) u(i)]); % Calculate change in state.
estimate=linear(W2*[1;logistic(W1*[1;x(i);u(i)])]);
x(i+1)=x(i)+dx(i); % Update the actual state.
X(i+1)=x(i)+estimate; % Update state estimate.
141
end
plot(t,[x(1:length(u));X(1:length(u))]);
title('Simulation Testing of Tank Model')
xlabel('Time');ylabel('Tank Level');

0 5 10 15 20 25 30 35 40
15
15.5
16
16.5
17
17.5
18
Simulation Testing of Tank Model
Time
T
a
n
k

L
e
v
e
l


Since the actual state is used at each simulation time step, the errors are not allowed to
build up and the neural network more closely tracks the actual system.
10.7. Implementation of Neural Control Systems
The example of this section deals with the use of an inverse neural network system
identification model for direct inverse control of the tank problem of Section 6.1. The
data constructed in Section 10.6.4 will be used to train an inverse system model. Again,
the operating state is the tank level and the input is the voltage to the valve. The state
and input are defined in the function tank_mod(t,y) cover the range:

Y(1): Tank level (0-36 inches).
Y(2): Voltage being applied to the control valve (-4.5 to 1 volt)

To construct a direct inverse neural network controller we need to train a neural network
to model the inverse of the tank. The inputs to the neural network inverse model will be
the state: x(k) and the desired change in state: dx. The output will be the input voltage
going to the valve actuator: u(k). By using a tank model with output dx and training an
ANN with an input dx we are able to avoid using a recurrent or time delay neural
network.

142
Tank
Inverse Model
Σ
+
-
u(k)
x(k+1)
Inverse System Identification
e(k)
x(k)
x(k)
dx
sum
dx


The backpropagation training script was used to train a single layer neural network with 5
hidden neurons. We must set up the training data file correctly. The inputs x are the
state x and the change in state dx. The target is the input u.

load tank_dat % Tank inverse model simulation data.
t=x(:,2)'; % Valve actuator voltage.
x=[x(:,1) dx]'; % State and desired change in state.
save tank_trn x t; % Tank data in neural network training format.

The bptrian script trained the inverse model and the weights were saved in tank_wgt.mat.

load tank_trn % Tank training data.
load tank_wgt % Neural Network weights for inverse system ID model.
[inputs,pats]=size(x);clg;
output = linear(W2*[ones(1,pats);logistic(W1*[ones(1,pats);x])]);
plot(output-t); % Compare NN model results with tank analytical model.
title('Training Results');xlabel('Pattern Number');ylabel('Error')

-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
Training Results
E
r
r
o
r


143
After the neural network is trained, it is put into the direct inverse control framework.
The input to the inverse tank model controller is the current state and the desired state.
The output of the controller is the voltage input to the valve actuator.

Tank
Inverse Tank
Model
u(k)
x(k+1)
Direct Inverse Control of Tank
D
x
d
(k+1)
x(k)
dx
sum


A simulation of the inverse controller can be run using tanksim(xinitial,x_desired). The
output of this simulation is the level response of the neural network controlled tank. The
desired level has been changed from 10 inches to 20 inches in this simulation.

tank_sim(10,20);

The tank and controller are simulated for 40 seconds, please be patient.
0 5 10 15 20 25 30 35 40
8
10
12
14
16
18
20
Time (sec)
L
e
v
e
l

(
i
n
)
Tank Level Response


This controller has a faster speed of response and less steady state error than the fuzzy
logic controller. This is shown in the next plot where both outputs are plotted together.

hold on
tankdemo(10,20);
text(2,18,'Neural Net');
text(18,18,'Fuzzy System');
hold off
144

The tank and controller are simulated for 40 seconds, please be patient.
0 5 10 15 20 25 30 35 40
8
10
12
14
16
18
20
Time (sec)
L
e
v
e
l

(
i
n
)
Tank Level Response
Neural Net Fuzzy System


Chapter 11 Practical Aspects of Neural Networks
11.1 Neural Network Implementation Issues
There are several choices to be made when implementing neural networks to solve a
problem. These choices involve the selection of the training and testing data, the network
architecture, the training method, the data scaling method, and the error goal. Since over
90% of all neural network implementations use backpropagation trained multi-layer
perceptrons, we will only discuss their implementation in this section. Sections 8.3 and
8.4 discussed scaling methods and weight initialization so those topics will not be
revisited. The rest of these choices will be discussed in this chapter.
11.2 Overview of Neural Network Training Methodology
The figure below shows the methodology to follow when training a neural network. First
you must collect or generate the data to be used for training and testing the neural
network. Once this data is collected, it must be divided into a training set and a test set.
The training set should cover the input space or should at least cover the space in which
the network will be expected to operate. If there is not training data for certain
conditions, the output of the network should not be trusted for those inputs. The division
of the data into the training and test sets is somewhat of an art and somewhat of a trial
and error procedure. You want to keep the training set small so that training is fast, but
you also want to exercise the input space well which may require a large training set.

145
Collect Data
Select Training
and Test Sets
Select Neural
Network Architecture
Initialize Weights
SSE Goal
Met?
Y
SSE Goal
Met?
Y
N
Run Test Set
Done
Reselect Training
Set or Collect
More Data
Change Weights
or
Increase NN Size
N

Neural Network Training Flow Chart

Once the training set is selected, you must choose the neural network architecture. There
are two lines of thought here. Some designers choose to start with a fairly large network
that is sure to have enough degrees of freedom (neurons in the hidden layer) to train to
the desired error goal; then, once the network is trained, they try to shrink the network
until the smallest network that trains remains. Other designers choose to start with a
small network and grow it until the network trains and its error goal is met. We will use
the second method which involves initially selecting a fairly small network architecture.

After the network architecture is chosen, the weights and biases are initialized and the
network is trained. The network may not reach the error goal due to one or more of the
following reasons.

1. The training gets stuck in a local minima.
2. The network does not have enough degrees of freedom to fit the desired input/output
model.
3. There is not enough information in the training data to perform the desired mapping.

In case one, the weights and biases are reinitialized and training is restarted. In case two,
additional hidden nodes or layers are added, and network training is restarted. Case three
is usually not apparent unless all else fails. When attempting to train a neural network,
you want to end up with the smallest network architecture that trains correctly (meets the
error goal); if not, you may have overfitting. Overfitting is described in greater detail in
Section 11.4.

146
Once the smallest network that trains to the desired error goal is found, it must be tested
with the test data set. The test data set should also cover the operating region well.
Testing the network involves presenting the test set to the network and calculating the
error. If the error goal is met, training is complete. If the error goal is not met, there
could be two causes:

1. Poor generalization due to an incomplete training set.
2. Overfitting due to an incomplete training set or too many degrees of freedom in the
network architecture.

The cause of the poor test performance is rarely apparent without using crossvalidation
checking which will be discussed in Section 11.4.3. If an incomplete test set is causing
the poor performance, the test patterns that have high error levels should be added to the
training set, a new test set should be chosen, and the network should be retrained. If
there is not enough data left for training and testing, data may need to be collected again
or be regenerated. These training decisions will now be covered in more detail and
augmented with examples.
11.3 Training and Test Data Selection
Neural network training data should be selected to cover the entire region where the
network is expected to operate. Usually a large amount of data is collected and a subset
of that data is used to train the network. Another subset of that data is then used as test
data to verify the correct generalization of the network. If the network does not
generalize well on several data points, that data is added to the training data and the
network is retrained. This process continues until the performance of the network is
acceptable.

The training data should bound the operating region because a neural network's
performance cannot be relied upon outside the operating region. This ability is called a
network's extrapolation ability. The following is an example of a network that is being
used outside of the region where it was trained.

x=[0:1:10];
t=2+3*x-.4*x.^2;
save data11 x t

The above code segment creates the training data used to train a network to approximate
the following function:

f(x)=2+3*x-0.4*x.^2

load data11
plot(x,t)
title('Function Approximation Training Data')
xlabel('Input');ylabel('Output')

147
0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
6
8
Function Approximation Training Data
Input
O
u
t
p
u
t


We will choose a single hidden layer network architecture with 2 logistic hidden neurons
and train to either an average error level of 0.2 or 5000 epochs. The network was trained
using bptrain and zscore scaling resulting in:

*** BP Training complete, error goal not met!
*** RMS = 2.239272e-001

After several tries, the network never reached the error goal in 5000 epochs. For this
example, we will continue with the exercise and plot the training performance.
Continued training may result in a network that meets the initial error criteria.

load weight11
subplot(2,1,1);
semilogy(RMS);
title('Backpropagation Training Results');
ylabel('Root Mean Squared Error')
subplot(2,1,2);
plot(LR)
ylabel('Learning Rate')
xlabel('Cycles');

148
0 1000 2000 3000 4000 5000
10
-1
10
0
10
1
Backpropagation Training Results
R
o
o
t

M
e
a
n

S
q
u
a
r
e
d

E
r
r
o
r
0 1000 2000 3000 4000 5000
0
0.05
0.1
0.15
0.2
L
e
a
r
n
i
n
g

R
a
t
e
Cycles


Now that the network is trained, we can check for generalization performance inside the
training region. Remember that when scaling is used for training, those same scaling
parameters must be used for any data input to the network.

x=[0:.01:10];y=2+3*x-0.4*x.^2;
load weight11;clg;X=zscore(x,xm,xs);
output = linear(W2*[ones(size(x));logistic(W1*[ones(size(x));X])]);
plot(x,y,x,output);title('Function Approximation Verification')
xlabel('Input');ylabel('Output')

0
2
4
6
8
Function Approximation Verification
O
u
t
p
u
t

149

We can see that the network generalizes very well within the training region. Now we
look at how the network extrapolates outside of the training region.

x=[-5:.1:15];
y=2+3*x-0.4*x.^2;
load weight11
X=zscore(x,xm,xs);
output = W2*[ones(size(x));logistic(W1*[ones(size(x));X])];
plot(x,y,x,output)
title('Function Approximation Extrapollation')
xlabel('Input')
ylabel('Output')

-5 0 5 10 15
-50
-40
-30
-20
-10
0
10
Function Approximation Extrapollation
Input
O
u
t
p
u
t


We see that the network generalizes very well within the training region (0-10) but
poorly outside of the training region. This shows that a neural network should never be
expected to operate correctly outside of the region where it was trained.
11.4 Overfitting
Several parameters affect the ability of a neural network to overfit the data. Overfitting is
apparent when a networks error level for the training data is significantly better than the
error level of the test data. When this happens, the data learned the peculiarities of the
training data, such as noise, rather than the underlying functional relationship of the
model to be learned. Overfitting can be reduced by:

1. Limiting the number of free parameters (neurons) to the minimum necessary.
2. Increasing the training set size so that the noise averages itself out.
3. Stopping training before overfitting occurs.
150

Three examples will now be used to illustrate these three methods. A robust training
routine would use all of the above methods to reduce the chance of overfitting.
11.4.1 Neural Network Size.
In the example of Section 11.3, we see that the function can be approximated well with a
network having 2 hidden neurons. Let us now train a network with data from a more
realistic model. This model will have 20% noise added to simulate noisy data that would
be measured from a process.

x=[0:1:10];
y=2+3*x-.4*x.^2;
randn('seed',3); % Set seed to original seed.
t=2+3*x-.4*x.^2+0.2*randn(1,11).*(2+3*x-.4*x.^2);
save data12 x t
plot(x,y,x,t)
title('Training Data With Noise')
xlabel('Input');ylabel('Output')

0 2 4 6 8 10
-10
-5
0
5
10
15
Training Data With Noise
Input
O
u
t
p
u
t


We trained a neural network with the same architecture as above (2 neurons) using the
noisy data.

*** BP Training complete, error goal not met!
*** RMS = 1.276182e+000

The error only trained down to 1.27, but this is to be expected since we did not want the
network to learn the noise.

151
x=[0:1:10];
y=2+3*x-0.4*x.^2;
load data12
load weight12
X=zscore(x,xm,xs);
output = linear(W2*[ones(size(x));logistic(W1*[ones(size(x));X])]);
clg
plot(x,y,x,output,x,t)
title('Function Approximation Verification')
xlabel('Input')
ylabel('Output')

0 2 4 6 8 10
-10
-5
0
5
10
15
Function Approximation Verification
Input
O
u
t
p
u
t


In the above figure, the smoothest line is the actual function, the choppy line is the noisy
data and the line closest to the smooth line is the network approximation. We can see
that the neural network approximation is smooth and follows the function very well.
Next we increase the degrees of freedom to more than is necessary for approximating the
function. This case will use a network with 4 hidden neurons.

*** BP Training complete, error goal not met!
*** RMS = 4.709244e-001

x=[0:1:10];
y=2+3*x-0.4*x.^2;
load data12
load weight13
X=zscore(x,xm,xs);
output = linear(W2*[ones(size(x));logistic(W1*[ones(size(x));X])]);
clg
plot(x,y,x,output,x,t)
title('Function Approximation Verification')
xlabel('Input')
152
ylabel('Output')

0 2 4 6 8 10
-10
-5
0
5
10
15
Function Approximation Verification
Input
O
u
t
p
u
t


The above figure is the output of a network with 4 neurons trained with the noisy data.
We can see that when extra free parameters are used in the network, the approximation is
severely overfitted. Therefore, a network with the fewest number of free parameters that
can be trained to the error goal should be used. This statement requires that a realistic
error goal be set.

A function with 11 training points containing 20% random noise for outputs that average
around 6, will have a RMS error goal of .2*6=1.2. This is about the value that we got in
the properly trained example above. If we stop training of an overparameterized network
when that error goal is met, we reduce the chances of overfitting the model. A neural
network with 4 hidden neurons is now trained with an error goal of 1.2.

*** BP Training complete after 151 epochs! ***
*** RMS = 1.194469e+000

The network learned much faster (151 epochs versus 5000 epochs) than the network with
only two neurons. Lets look at the generalization.

x=[0:1:10];
y=2+3*x-0.4*x.^2;
load data12
load weight14
X=zscore(x,xm,xs);
output = linear(W2*[ones(size(x));logistic(W1*[ones(size(x));X])]);
clg
plot(x,y,x,output,x,t)
title('Function Approximation Verification')
153
xlabel('Input')
ylabel('Output')

0 2 4 6 8 10
-10
-5
0
5
10
15
Function Approximation Verification
Input
O
u
t
p
u
t


The network generalized much better than the overparameterized network with an
unrealistically low error goal. This illustrates two methods that can be used to reduce the
chance of overfitting.

When the actual signal is not known, a realistic error goal can be found by filtering the
signal to smooth out the noise, then calculating the difference between the smoothed
signal and the noisy signal. This difference is a rough approximation of the amount of
noise in the signal.
11.4.2 Neural Network Noise
As discussed above, when there is noise in the training data, a method to calculate the
RMS error goal needs to be used. If there is significant noise in the data, increasing the
number of patterns in the training set can reduce the amount of overfitting. The
following example will illustrate this point. In this example the test set size is increased
to 51 patterns.

x=[0:.2:10];
randn('seed',100)
y=2+3*x-.4*x.^2;
t=2+3*x-.4*x.^2+0.2*randn(1,size(x,2)).*(2+3*x-.4*x.^2);
save data13 x t
plot(x,y,x,t)
title('Training Data With Noise')

154
0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
6
8
10
12
Training Data With Noise


Using this data we will now train a network with 5 hidden neurons. The results are as
follows.

*** BP Training complete, error goal not met!
*** RMS = 1.062932e+000

Seeing that the RMS error is less than 1.2, we can expect some overfitting.

x=[0:.2:10];
y=2+3*x-0.4*x.^2;
load data13
load weight15
X=zscore(x,xm,xs);
output = linear(W2*[ones(size(x));logistic(W1*[ones(size(x));X])]);
clg
plot(x,y,x,output,x,t)
title('Function Approximation Verification With 5 Neurons')
xlabel('Input')
ylabel('Output')

155
0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
6
8
10
12
Function Approximation Verification With 5 Neurons
Input
O
u
t
p
u
t


The above figure shows that the network generalized very well even though there were
too many free parameters in the model. This shows that using a more representative
training set tends to average out the noise in the signals without having to stop training at
an appropriate error goal or having to estimate the number of hidden neurons. The
network does sag at the input=2 region, to avoid this, use more samples or reduce the
number of hidden neurons.
11.4.3 Stopping Criteria and Cross Validation Training
The last method of reducing the chance of overfitting is cross validation training. Cross
validation training uses the principle of checking for overfitting during training. This
methodology uses two sets of data during training. One set is used for training and the
other is used to check for overfitting. Since overfitting occurs when the neural network
models the training data better than it would other data, checking data is used during
training to test for this overlearning behavior.

At each training epoch, the RMS error is calculated for both the test set and the checking
set. If the network has more than enough neurons to model the data, there will be a point
during training when the training error continues to decrease but the checking error levels
off and begins to increase. The script cvtrain is used to check for this behavior. An
additional 11 pattern noisy data set will be used as the checking data.

x=[0:1:10];
y=2+3*x-.4*x.^2;
randn('seed',5); % Change seed.
tc=2+3*x-.4*x.^2+0.2*randn(1,11).*(2+3*x-.4*x.^2); % Checking data set.
load data12 % Training data set.
save data14 x t tc
plot(x,y,x,t,x,tc);
156
title('Training Data and Checking Data')

0 2 4 6 8 10
-10
-5
0
5
10
15
Training Data and Checking Data


A 5 hidden neuron network will now be trained using cvtrain for 1000 epochs.

*** BP Training complete, error goal not met!
*** Minimum RMS = 6.909391e-001
*** Minimum checking RMS = 1.249887e+000
*** Best Weight Matrix at 240 epochs!

load weight16;
clg
semilogy(RMS);
hold on
semilogy(RMSc);
hold off
title('Cross Validation Training Results');
ylabel('Root Mean Squared Error')
text(600,2, 'Checking Error');
text(600,.5,'Training Error');

157
0 200 400 600 800 1000
10
-1
10
0
10
1
Cross Validation Training Results
R
o
o
t

M
e
a
n

S
q
u
a
r
e
d

E
r
r
o
r
Checking Error
Training Error


The figure shows that both errors initially are reduced, then the checking error starts to
increase. The best weight matrices occur at the checking error minimum. Subsequent to
the checking error minimum, the network is overfitting. This training method can also be
used to identify a realistic training RMS error goal. In this example, the error goal is the
minimum of the checking error.

min(RMSc)

ans =
1.2499

Again we find that a realistic error goal is near 1.2. This agrees with our two previous
calculations.

In summary, there are four methods to reduce the chance of overfitting:

1. Limiting the number of free parameters.
2. Training to a realistic error goal.
3. Increase the training set size.
4. Use cross validation training to identify when overfitting occurs.

These methods can be used independently or used together to reduce the chance of
overfitting.

158
Chapter 12 Neural Methods in Fuzzy Systems
12.1 Introduction
In this chapter we explore the use of fuzzy neurons in neural systems while the next
chapter we will explore the use of neural methodologies to train fuzzy systems. There is
a tremendous advantage to training fuzzy networks when experiential data is available
but the advantages of using fuzzy neurons in neural networks is less well defined.

Embedding fuzzy notions into neural networks is an area of active research, and there are
few well known or proven results. This chapter will present the fundamentals of
constructing neural networks with fuzzy neurons but will not describe the advantages or
practical considerations in detail.
12.2 From Crisp to Fuzzy Neurons
The artificial neuron was first presented in Section 7.1. This neuron processed
information using the following equation,

y f x w b
k k
k
n
= +

=

1


where x is the input vector, w are the weights connecting the inputs to the neuron and b is
a bias. The function is usually continuous and monatonically increasing. The result of
the product,

d x w
k k k
=


is referred to as the dendritic input to the neuron. In the case of fuzzy neurons, this
dendritic input is usually a function of the input and the weight. This function can be an
S-norm, such as the probabilistic sum, or a T-norm, such as the product. The choice of
the function is dependent on the type of fuzzy neuron being used. Fuzzy neuron types
will be discussed in subsequent sections.

S-norms: 1. probabilistic sum:
d x S w x w x w
k k k k k k k
= = + −

2. OR:
d x S w x w x w
k k k k k k k
= = ∨ = ∨ max( )



T-norms: 1. product:
d x T w x w
k k k k k
= = ∗

2. AND:
d x T w x w x w
k k k k k k k
= = ∧ = min( , )


The dendritic inputs are then aggregated by some chosen operator. In the most simple
case, this operator is a summation operator, but other operators such as min and max are
commonly used.
159
Summation aggregation operator:
I d
j k
k
n
=
=

1

Min aggregation operator:
I d d
j
k
n
k k
= ∧ =
=1
min( )

Max aggregation operator:
I d d
j
k
n
k k
= ∨ =
=1
max( )


The fuzzy neuron output y
j
is a function of the of the internal activation I
j
and the
threshold level of the neuron. This function can be a numerical function or a T-norm or
S-norm.

( )
y I T
j j j
= Φ ,

The use of different operators in fuzzy neurons gives the designer the flexibility to make
the neuron perform various functions. For example, the output of a fuzzy neuron may be
a linguistic representation of the input vector such as Small or Rapid. These linguistic
representation can be further processed by subsequent layers of fuzzy neurons to model a
specified relationship.
12.3 Generalized Fuzzy Neuron and Networks
This section will further discuss the fuzzy neural network architecture. The generalized
fuzzy neuron is shown in the figure below.

x
1
x
2
x
3
:
x
n
w
1j
y
j
I
j
Φ
j
d
1j
d
2j
d
3j
d
nj
δ
1j
δ
2j
δ
3j
δ
nj

Generalized Fuzzy Neuron

The synaptic inputs (x) generally represent the degree of membership to a fuzzy set and
have a value in the range [0 1]. The dendritic inputs (d) are also normally bounded, in
the range [0 1], and represent the membership to a fuzzy set.

In this figure, the dendritic inputs have been modified to produce excitatory and
inhibitory signals. This is done with a function that performs the compliment and is
represented graphically by the not sign: °. Numerically, the behavior of these operators
is represented by:

160
excitatory δ
ij ij
d =
inhibitory δ
ij ij
d = − 1

The generalized fuzzy neuron has internal signals that represent membership to fuzzy
sets. Therefore, these signals have meaning. Most signals in artificial neural networks
do not have a discernible meaning. This is one advantage of a fuzzy neural system.
12.4 Aggregation and Transfer Functions in Fuzzy Neurons
As stated in section 12.2, the aggregation operator is implimentable by several different
mathematical expressions. The most common aggregation operator is a T-norm
represented by:


I T
j
i
n
ij
=
=1
δ


but other operators can also be implemented. The internal activation is simply the
aggregation of the modified input membership values (dendritic inputs). A T-norm
(product or min) tends to reduce the activation while an S-norm tends to enhance the
activation (probabilistic sum or max). These aggregation operators provide the fuzzy
neuron with a means of implementing intersection (T-norm) and union (S-norm)
concepts.

The MATLAB implementation of a T-norm aggregation is:

x=[0.7 0.3]; % Two inputs.
w=[0.2 0.5]; % Hidden weight matrix.
I=product(x',w') % Calculate product T-norm output.

I =
0.1400
0.1500

The MATLAB implementation of a probabilistic sum S-norm is:

x=[0.7 0.3]; % Two inputs.
w=[0.2 0.5]; % Hidden weight matrix.
I=probor(x',w') % Calculate probabilistic sum S-norm output.

I =
0.7600
0.6500

Biases can be implemented in T-norm and S-norm operations by setting a weight value to
either 1 or 0. In normal artificial neurons, the bias is the weight corresponding to a
dummy node of input equal to 1. Similarly, in fuzzy neurons, the bias is the weight
corresponding to a dummy input value (x
0
). For example, a bias in a T-norm would have
161
its dummy input set to a 1 while a bias in an S-norm would have its dummy input set to a
0.

The MATLAB implementation of a T-norm bias is:

x=[1 0.3]; % The dummy input is a 1
w=[0.7 0.5]; % The corresponding bias weight is a .7.
I=x'.*w' % Calculate product T-norm output.

I =
0.7000
0.1500

The MATLAB implementation of an S-norm bias is:

x=[0 0.3]; % The dummy input is a 0.
w=[0.7 0.5]; % The corresponding bias weight is a .7.
I=probor(x',w') % Calculate probabilistic sum S-norm output.

I =
0.7000
0.6500

In both of the above cases, the bias propagates to the internal activation and may affect
the neuron output depending on the activation function operator.

The activation function or transfer function is a mapping operator from the internal
activation to the neuron's output. This mapping may correspond to a linguistic modifier.
For example if the inputs are weighted grades of accomplishments, the linguistic modifier
“more-or-less” could give the aggregated value a stronger value. S-norms and T-norms
are commonly used. The MATLAB implementation of an S-norm aggregation and a T-
norm activation function would be:

x=[0 0.3]; % The dummy input is a 0.
w=[0.7 0.5]; % The corresponding bias weight is a .7.
I=probor(x',w') % Calculate probabilistic sum S-norm output.
z=prod(I) % Calculate product T-norm.

I =
0.7000
0.6500
z =
0.4550

12.5 AND and OR Fuzzy Neurons
The most commonly used fuzzy neurons are AND and OR neurons. The AND neuron
performs an S-norm operation on the dendritic inputs and weights and then performs a T-
norm operation on the results of the S-norm operation. Although any S-norm or T-norm
operations can be implemented, usually max and min operators are used. The exception
162
is when training routines are implemented; in this case, differentiable functions are used.
The AND neuron representation is:

( ) z T x S w
h i
n
i hi
=
=1


The MATLAB AND fuzzy neuron implementation is:

x=[0.7 0.3]; % Two inputs.
w=[0.2 0.5]; % Hidden weight matrix.
I=max(x,w) % Calculate max S-norm operation.
z=min(I) % Calculate min T-norm operation.

I =
0.7000 0.5000
z =
0.5000

The OR neuron performs a T-norm operation on the dendritic inputs and weights and then
performs an S-norm operation on the results of the S-norm operation.

( ) z S x T w
h i
n
i hi
=
=1


The MATLAB OR fuzzy neuron implementation is:

x=[0.7 0.3]; % Two inputs.
w=[0.2 0.5]; % Hidden weight matrix.
I=min(x,w) % Calculate min T-norm operation.
z=max(I) % Calculate max S-norm operation.

I =
0.2000 0.3000
z =
0.3000

AND and OR neurons can be arranged in layers and these layers can be arranged in
networks to form multilayer fuzzy neural networks.
12.6 Multilayer Fuzzy Neural Networks
The fuzzy neurons discussed in earlier sections of this chapter can be connected to form
multiple layers. The multilayer networks discussed here have three layers with each
layer performing a different function. The input layer simply sends the inputs to each of
the hidden nodes. The hidden layer is composed of either AND or OR neurons. These
hidden layer neurons perform a norm operation on the inputs and weight matrix. The
outputs of the hidden layer neurons perform a norm operation on the hidden layer outputs
(z) and the output weight vector (v). The norm operations can be any type of S-norms or
T-norms. The following figure presents a diagram of a multilayer fuzzy network.

163
x
1
x
2
:
x
n
x
n+1
x
n+2
:
x
2n
Input Layer
.
.
.
.
.
.
Output Layer Hidden Layer
AND
1
.
.
.
w
hi
v
jh
OR
1
.
.
.
AND
2
AND
p
OR
2
OR
m
y
1
y
2
.
.
.
y
m
index=i index=h index=j


Multilayer Fuzzy Neural Network

In the above figure, the hidden layer uses AND neurons to perform the T-norm
aggregation. The output of the hidden layer is denoted z
h
where h is the index of the
hidden node.

( )
| | ( )
| |
z T x S w T T x S w
h i
n
i hi i
n
i h n i
=
= = + 1 1 ( )
h =1,2,..., p


This is implemented in MATLAB with product T-norms and probabilistic sum S-norms.
As a very simple example, suppose we have a network with two inputs, two hidden nodes
and one output. The hidden layer outputs are given by:

x=[0.1 0.3]; % Two inputs.
xc=[0.9 0.7]; % Two complements.
w= [0.2 0.5; -0.2 0.8]; % Hidden weight matrix.
wc=[0.9 -0.3; 0.6 0.2]; % Complementary portion weight matrix.
for i=1:2 % Calculate hidden node outputs.
z(i)=prod([prod(probor(x',w(:,i))) prod(probor(xc',wc(:,i)))]);
end
z

z =
0.0390 0.3127

In the above figure, the output layer uses OR neurons to perform an S-norm aggregation.
The output of the network is denoted y
j
where j is the index of the output neuron. In the
examples of the text and this supplement, the networks are limited to one output neuron
although the MATLAB code provides for multiple outputs.

( )
| |
y S z T v
j h
p
h hj
=
=1
j =1,2,..., m


164
This is implemented in MATLAB with product T-norms and probabilistic sum S-norms:

v=[0.3 0.7]'; % One output node.
for j=1:2 % Calculate output node internal activations.
I(j)=prod([z(j) v(j)]');
end
y=probor(I) % Perform a probor on the internal activations.

y =
0.2281

The evaluation of all the fuzzy neurons in a multilayer fuzzy network can be combined
into one function. The following code simulates the fuzzy neural network where:

x is the input vector. size(x) = (patterns, inputs)
w is the hidden weight vector. size(w) = (inputs, hidden neurons)
v is the output weight vector. size(v) = (hidden neurons, output neurons)
y is the output vector. size(y) = (patterns, output neurons)

As a very simple example, suppose we have a network with two inputs, one hidden node
and one output. The forward pass of two patterns gives:

x=[.1 .3 .9 .7;.7 .4 .3 .6 ]; % Two patterns of 2 inputs and their
% complements.
w=[.2 .5;-.2 .8;.9 -.3;.6 .2;]; % Two hidden nodes.
v=[0.3 0.7]'; % One output node.
y=fuzzy_nn(x,w,v) % Simulate the network.

y =
0.2281
0.0803

12.7 Learning and Adaptation in Fuzzy Neural Networks
If there is experiential input/output data from the relationship to be modeled, the fuzzy
neural network's weights and biases can be trained to better model the relationship. This
training can be performed with a gradient descent algorithm similar to the one used for
the standard neural network.

Complications with fuzzy neural network training arise because the activation or transfer
functions must be continuous, monotonically increasing, differentiable operators.
Several S-norm and T-norm operators such as the max and min operators do not fit this
requirement, although the probabilistic sum S-norm and product T-norm that were used in
the previous section do meet his requirement.

As a simple example, let us study the weight changes involved in training a single OR
fuzzy neuron with two inputs. We will assume the OR fuzzy neuron is an output neuron
and is therefore represented by:

165
( )
| |
y S z T v
j h
p
h hj
=
=1
j =1


where: y
j
is the j
th
output neuron.
z
h
is the output of the h
th
hidden layer neuron.
v
jh
is the weight connecting the h
th
hidden layer neuron to the j
th
output.
T-norm is product.
S-norm is probabilistic OR

The training data consists of three patterns.

z=[1 0.2 0.7;1 0.3 0.9;1 0.1 0.2]; % Training input data with bias.
t=[0.7 ;0.8 ;0.4 ]; % Training output target data.

And the output is represented by:

v=[.32 .76 .11] % Random initial weights.
y=zeros(3,1); % Output vector
for k=1:3 % For each input pattern
I=product(z(k,:)',v'); % Calculate product T-norm operation.
y(k)=probor(I); % Calculate probabilistic sum S-norm.
end
y

v =
0.3200 0.7600 0.1100
y =
0.4678
0.5270
0.3855

To train this network to minimize a squared error between the target vector (t) and the
output vector (y), we will use a gradient descent procedure. The squared error (as
opposed to the sum of squared errors) is used because the training will occur sequentially
rather in batch mode. The error and squared error are defined as:

ε
ε
= −
= −
t y
t y
2 2
( )


∆v
v
y
y
v
jh
jh
j
j
jh
= −
= − ⋅
η
∂ε

η
∂ ε



.
.
2
2


where:

166
( )
| |
∂ ε





2
1
2
y
t y
y
v
S z T
I
j
j j
j
jh
h
p
h jh
j
= − −
=
=
( )


and T is the T-norm operator (product).
S is the S-norm operator (probabilistic sum).

If we substitute the symbol A for the norms of the terms not involving a weight h=q, then
we can solve for the partial derivative of y with respect to that weight v
q
.

| |
A = S
h q ≠
=
+ −
= −
p
h jh
j
jq
jq q jq q
jq
q
z Tv
y
v
A v z Av z
v
z A




( ) 1


If the T-norm is a product and the S-norm is a probabilistic sum, this equation can be
written as:

| |


y
v
z A z probor v z
j
jq
q q jh h
h q
p
= − = −

( ) ( ) 1 1


Combining the factor of 2 into the learning rate results in:

( ) | |
∆v t y z probor v z
jq j j q
h q
p
jh h
= − −

η ( ) 1

The error terms and squared error are:

error=(t-y)
SSE=sum((t-y).^2)

error =
0.2322
0.2730
0.0145
SSE =
0.1287

Now lets train the network with a learning rate of 1.

lr=1; % Learning rate.
patterns=length(error); % Number of input patterns.
cycles=30; % Number of training iterations.
167

% Train the network.
inds=[1:length(v)];
for k=1:cycles
for j=1:patterns % Loop through each pattern.
for i=1:length(v); % Loop through each weight.
ind=find(inds~=i); % Specify weights other than i.
A=probor(product(z(j,ind)',v(ind)')); % Calculate T and S norms.
delv(i)=lr*error(j)*z(j,i)*(1-A); % Calculate weight update.
end
v=v+delv; % Update weights.
end

% Now check the new SSE.
for k=1:length(error); % For each input pattern
I=product(z(k,:)',v'); % Calculate product T-norm operation.
y(k)=probor(I); % Calculate probabilitsic sum S-norm.
end
error=(t-y); % Calculate the eror terms.
SSE=sum((t-y).^2); % Find the sum of squared errors.
end
SSE
y

SSE =
1.3123e-004
y =
0.6909
0.8063
0.4030

This closely matches the target vector: t=[.7 .8 .4]. This gradient descent training
algorithm can be expanded to train multilayer fuzzy neural networks. The previous
section derived the weight update for an fuzzy OR output neuron. We will now derive
the weight update for a fuzzy multilayer neural network with AND hidden units and OR
output units.

We use the chain rule to find the gradient vector.

∂ ε

∂ ε





2 2
1
w y
y
z
z
w
jk j
j
h h
p
h
hi
=
=



and

( )
| |
∂ ε

2
2
y
t y
j
j j
= − −

Since the inputs and weights in a AND and OR are treated the same, the partial derivative
with respect to the inputs (z
h
) is of the same form of the partial derivative with respect to
168
the weights (v
h
) that was derived above. Therefore, substituting A for the norms not
containing z
q,
we get a solution of the same form.

| |
A = S
h q ≠
=
+ −
= −
p
h jh
j
q
jq q jq q
q
jq
z Tv
y
z
A v z Av z
z
v A




( ) 1


For a product T-norm and a probabilistic sum S-norm, this results in:

| |


y
z
v A v probor v z
j
q
jq jq jh h
h q
p
= − = −

( ) ( ) 1 1


For the second term in the chain rule:





z
w
x w x w
w
h
hi
i hi i hi
i
n
hi
=
+ −
=

( )
1


For a specific input weight: w
hr,








z
w w
x w x w x w x w
w
x w x w x x w x w w x w x w x w
x w x w x
h
hr hr
i hi i hi
i r
n
r hr r hr
hi
i hi i hi r i hi i hi hr i hi i hi r hr
i r
n
i r
n
i r
n
i hi i hi r
i r
n
= + − ∗ + −

= + − + + − − + −

= + − −

≠ ≠ ≠


∏ ∏ ∏

( ) ( )
( ) ( ) ( )
( )( ) 1

Combining the above chain rule terms results in:

∂ ε

∂ ε





2 2
w y
y
z
z
w
jk j
j
h
h
hi
=

= + − −



ηε ( )( ) x w x w x
i hi i hi r
i r
n
1 v probor v z
jq h q jh h
( ( )) 1−



As an example of training a multilayer fuzzy neural network, we will consider a network
with two hidden AND neurons and one output OR neuron.
169
x
0
x
1
x
2
Input Layer Output Layer Hidden Layer
AND
1
w
hi
v
h
OR
AND
2
y
index=i index=h
z
0

Example Fuzzy Neural Network

In this case we will implement biases in the hidden layer. Since these are AND fuzzy
neurons, the dummy input must equal 0. A bias will also be implemented in the output
node with a dummy input equal to 1. The forward pass results in:

clear all;rand('seed',10)
x=[0 0.2 0.7;0 0.3 0.9;0 0.7 0.2]; % Training input data with bias.
t=[0.7 ;0.8 ;0.4 ]; % Training output target data.
w=rand(2,3); % Random initial hidden layer weights.
v=rand(1,3); % Random initial output layer weights.
z=zeros(3,2); % Hidden vector (three patterns, 2 outputs).
y=zeros(3,1); % Output vector.(three patterns)
for k=1:3 % For each pattern.
for h=1:2 % For each hidden neuron.
z(k,h)=prod(probor(x(k,:)',w(h,:)'));% Output of each hidden node.
end
I=product([1 z(k,:)]',v'); % Calculate product T-norm operation.
y(k)=probor(I); % Calculate probabilitsic sum S-norm.
end
error=(t-y)
SSE=sum((t-y).^2)

error =
0.2654
0.2987
0.1153
SSE =
0.1730

Now lets train the network with a learning rate of 0.5.

lr=0.5; % Learning rate.
patterns=size(error,1); % Number of input patterns.
cycles=20; % Number of training iterations.
[h_nodes,inputs]=size(w); % Define number of inputs and hidden nodes.
indv=[1:size(v,2)];indw=[1:size(w,2)]; % weight matrix indices.

% Update the network output layer.
for m=1:cycles
170
for j=1:patterns % Loop through patterns.
for i=1:size(v,2); % Loop through weights.
ind=find(indv~=i); % Specify weights not = i.
Z=[ones(patterns,1) z]; % Output biases.
A=probor(product(Z(j,ind)',v(ind)')); % Calculate T and S norms.
delv(i)=lr*error(j)*Z(j,i)*(1-A); % Calculate weight update.
end
v=v+delv; % Update weights.
end

% Update the network hidden layer.
for j=1:patterns % Loop through patterns.
for l=1:h_nodes
for i=1:inputs; % Loop through weights.
ind=find(indw~=i); % Specify weights not = i.
A=probor(product(Z(j,l)',v(l)')); % Calculate OR norms.
B=prod(probor(x(j,ind)',w(l,ind)')); % Calculate AND norms.
delw(l,i)=lr*error(j)*B*(1-x(j,i))*v(l)*A; % Compute update.
end
end
w=w+delw; % Update weights.
end

% Now check the new error terms.
for k=1:patterns % For each pattern.
for h=1:h_nodes % For each input neuron.
z(k,h)=prod(probor(x(k,:)',w(h,:)'));%Output of hidden nodes.
end
I=product([1 z(k,:)]',v'); % Calculate product T-norm operation.
y(k)=probor(I); % Calculate probabilitsic sum S-norm.
end
error=(t-y);
end
SSE=sum((t-y).^2)
y

SSE =
0.0049
y =
0.6704
0.7615
0.4502

The error has been drastically reduced and the output vector has moved towards the
target vector: t=[.7 .8 .4]. By varying the learning rate and training for more iterations,
this error can be reduced even further.
Chapter 13 Neural Methods in Fuzzy Systems
13.1 Introduction
The use of neural network training techniques allows us the ability to embed empirical
information into a fuzzy system. This greatly expands the range of applications in which
171
fuzzy systems can be used. The ability to make use of both expert and empirical
information greatly enhances the utility of fuzzy systems.

One of the limitations of fuzzy systems comes from the curse of dimensionality. Expert
information about the relationship to be modeled may make it possible to reduce the rule
set from all combinations to the few that are important. In this way, expert knowledge
may make a problem tractable.

A limitation of using only expert knowledge is the inability to efficiently tune a fuzzy
system to give precise outputs for several, possibly contradictory, input combinations.
The use of error based training techniques allows fuzzy systems to learn the intricacies
inherent in empirical data.
13.2 Fuzzy-Neural Hybrids
Neural methods can be used in constructing fuzzy systems in ways other than training.
They can also be used for rule selection, membership function determination and in what
we can refer to as hybrid systems. Hybrid systems are systems that employ both neural
networks and fuzzy systems. These hybrid systems may make use of a supervisory fuzzy
system to select the best output from several neural networks or may use a neural
network to intelligently combine the outputs from several fuzzy systems. The
combinations and applications are endless.
13.3 Neural Networks for Determining Membership Functions
Membership function determination can be viewed as a data clustering and classification
problem. First the data is classified into clusters and then membership values are given
to the individual patterns in the clusters. Neural network architectures such as Kohonen
Self Organizing Maps are well suited to finding clusters in input data. After cluster
centers are identified, the width parameters of the SOM functions, which are usually
Gaussian, can be set so that the SOM outputs are the membership values.

One methodology of implementing this type of two stage process is called the Adeli-
Hung Algorithm. The first stage is called classification while the second stage is called
fuzzification. Suppose you have N input patterns with M components or inputs. The
Adeli-Hung Algorithm constructs a two layer neural network with M inputs and C
clusters. Clusters are added as new inputs, which do not closely resemble old clusters,
are presented to the network. This ability to add clusters and allow the network to grow
resembles the plasticity inherent in ART networks. The algorithm is implemented in the
following steps:

1. Calculate the degree of difference between the input vector X
i
and each cluster center
C
i
. A Euclidean distance can be used:

dist X C x c
i j ij
j
M
( , ) ( ) = −
=

2
1


172
where: x
j
is the j
th
input.
c
ij
is the j
th
component of the i
th
cluster.
M is the number of clusters.

2. Find the closest cluster to the input pattern and call it C
p
.

3. Compare the distance to this closest cluster: dist(X,C
p
) with some predetermined
distance. If it is closer than the predetermined distance, add it to that cluster, if it is
further than the predefined cluster, then add a new cluster and center it on the input
vector. When an input is added to a cluster, the cluster center (prototype vector) is
recalculated as the mean of all patterns in the cluster.


| |
C c c c
n
X
p p p pM
p
i
p
i
n
p
= =
=
∑ 1 2
1
1
...

4. The membership of an input vector X
i
to a cluster C
p
is defined as

u
κ
κ
κ
p
w
i
p
p
w
i
p
p
w
i
p
p
if D X C
D X C
if D X C
=
>
− <
¦
´
¦
¹
¦
0
1
( , )
( , )
( , )


where:
κ is the width of the triangular membership function.

( ) ( )
D X C x c
w
i
p
p ij
p
pj
j
M
, = −
=

2
1
, the weighted norm, is the Euclidean distance:

This results in triangular membership functions with unity membership at the center and
linearly decreasing to 0 at the tolerance distance. An example of a MATLAB
implementation of the Adeli-Hung algorithm is given below.

X=[1 0 .9;0 1 0;1 1 1;0 0 0;0 1 1;.9 1 0;1 .9 1;0 0 .1];% Inputs
C=[1 0 1]; % Matrix of prototype clusters.
data=[1 0 1 1]; % Matrix of data; 4th column specifies class.
tolerance=1; % Tolerance value used to create new clusters.
for k=1:length(X);
% Step 1: Find the Euclidean distance to each cluster center.
[p,inputs]=size(C);
for i=1:p
distance(i)=dist(X(k,:),C(i,:)).^.5;
end
% Step 2: Find the closest cluster.
ind=find(min(distance)==distance);
% Step 3: Compare with a predefined tolerance.
if min(distance)>tolerance
C=[C;X(k,:)]; % Make X a new cluster center.
data=[data;[X(k,:) p+1]];
173
else % Calculate old cluster center.
data=[data;[X(k,:) ind]]; % Add new data pattern.
cluster_inds=find(data(:,4)==ind); % Other clusters in class.
for j=1:inputs
C(ind,j)=sum(data(cluster_inds,j))/length(cluster_inds);
end
end
% Step 4: Calculate memberships to all p classes.
mu=zeros(p,1);
for i=1:p
D=dist(X(k,:),C(i,:)).^.5;
if D<=tolerance;
mu(i)=1-D/tolerance;
end
end
end
C % Display the cluster prototypes.
data % Display all input vectors and their classification.
mu % Display membership of last input to each prototype.
save AHAdata data C % Save results for the next section.

C =
1.0000 0 0.9500
0 0.3333 0.0333
0.6667 0.9667 1.0000
0.9000 1.0000 0
data =
1.0000 0 1.0000 1.0000
1.0000 0 0.9000 1.0000
0 1.0000 0 2.0000
1.0000 1.0000 1.0000 3.0000
0 0 0 2.0000
0 1.0000 1.0000 3.0000
0.9000 1.0000 0 4.0000
1.0000 0.9000 1.0000 3.0000
0 0 0.1000 2.0000
mu =
0
0.6601
0
0

There were four clusters created from the X data matrix and their prototypes vectors were
stored in the matrix C. The first three elements in each row of the data matrix is the
original data and the last column contains the identifier for the cluster closest to it. The
vector mu contains the membership of the last data vector to each of the clusters. It can
be seen that it is closest to vector prototype number 2.

The Adeli-Hung Algorithm does a good job of clustering the data and finding their
membership to the clusters. This can be used to preprocess data to be input to a fuzzy
system.
174
13.4 Neural Network Driven Fuzzy Reasoning
Fuzzy systems that have several inputs suffer from the curse of dimensionality. This
section will investigate and implement the Takagi-Hayashi (T-H) method for the
construction and tuning of fuzzy rules, this is commonly referred to as neural network
driven fuzzy reasoning. The T-H method is an automatic procedure for extracting rules
and can greatly reduce the number of rules in a high dimensional problem, thus making
the problem tractable.

The T-H method performs three major functions:

1. Partitions the decision hyperspace into a number of rules. It performs this with a
clustering algorithm.
2. Identifies a rule's antecedent values (left hand side membership function). It performs
this with a neural network.
3. Identifies a rule's consequent values (right hand side membership function) by using a
neural network with supervised training. This part necessitates the existence of target
outputs.

NN
1
NN
mem
NN
2
NN
r
x
x
x
Σ
y
x
1
x
2
x
n
...
...
w
r
w
2
w
1 ...
u
1
u
2
u
n

Takagi-Hayashi Method

The above block diagram represents the T-H Method of fuzzy rule extraction. This
method uses a variation of the Sugeno fuzzy rule:

if x
i
is A
i
AND x
2
is A
2
AND ....x
n
is A
n
then y=f(x
1
, ...x
n
)

where f() is a neural network model rather than a mathematical function. This results in a
rule of the form:

if x
i
is A
i
AND x
2
is A
2
AND ....x
n
is A
n
then y=NN(x
1
, ...x
n
).

The NN
mem
calculates the membership of the input to the LHS membership functions and
outputs the membership values. The other neural networks form the RHS of the rules.
The LHS membership values weigh the RHS neural network outputs through a product
function. The altered RHS membership values are aggregated to calculate the T-H
system output. The neural networks are standard feedforward multilayer perceptron
designs.
175

The following 5 steps implement the T-H Method. The T-H method also implements
methods for reducing the neural network inputs to a small set of significant inputs and
checking them for overfitting during training.

Step 1: The training data x is clustered into r groups: R
1
, R
2
,..., R
s
{s=1,2,...,r} with n
t
s

terms in each group. Note that the number of inferencing rules will be equal to r.

Step 2: The NN
mem
neural network is trained with the targets values selected as:


w
x R
x R
i n s r
i
s
i
s
i
s
t
s
=


¦
´
¦
¹
¦
= =
1
0
1 1
,
,
,..., ; ,...,


The outputs of NN
mem
for an input x
i
are labeled w
i
s
, and are the membership
values of x
i
to each antecedent set R
s
.

Step 3: The NN
s
networks are trained to identify the consequent part of the rules. The
inputs are {x
i1
s
,...x
im
s
}, and the outputs are y
s
i=1,2,...,n
t
.

Step 4: The final output value y is calculated with a weighted sum of the NN
s
outputs.:


{ }
y
x u x
x
i n
i
A
i s i
s
r
A
i
s
r
s
s
=

=
=
=


u
u
( ) ( )
( )
, ,...
inf
1
1
1 2

where u
s
(x
i
) is the calculated output of NN
s
.

An example of implementing the T-H method is given below.

Suppose the data and clustering results from the prior example are used. In this example
there are 9 input data vectors that have been clustered into four groups:

R
1
has 2 inputs assigned to it (n
t
1

= 2).
R
2
has 3 inputs assigned to it (n
t
2

= 2).
R
3
has 3 inputs assigned to it (n
t
3

= 3).
R
4
has 1 input assigned to it (n
t
4

= 1).

Therefore, there are four rules to implement in the system. First we will train the
network NN
mem
.

load AHAdata
x=data(:,1:3); % The first three columns of data are the input patterns.
class=data(:,4); % The last column of data is the classification.
t=zeros(9,4);
176
for i=1:9 % Create the target vector so that the
t(i,class(i))=1; % classification column in each target pattern
end % is equal to a 1; all other are zero.
t
save TH_data x t

t =
1 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 1 0
0 1 0 0

We used the bptrain algorithm to train NN
mem
and save the weights in th_m_wgt. The
network's architecture consists of three logistic hidden neurons and a logistic output
neuron. Now each of the consequent neural networks (NN
s
) must be trained. To train
those networks, we must first define the target vectors y and then create the 4 training
data sets.

% In this example we will define single outputs for each input.
y=[10 12 5 2 4 1 7 1.5 4.5]';
r=4; % Number of classifications.
for s=1:r
ind=find(data(:,4)==s); % Identify the rows in each class.
x=data(ind,1:3); % Make the input training matrix.
t=y(ind,:); % make the target training matrix.
eval(['save th_',num2str(s),'_dat x t']);
end

The four neural networks with 1 logistic hidden neuron and a linear output neuron were
trained off line with bptrain. The weights were saved in th_1_wgt, th_2_wgt, th_3_wgt,
and th_4_wgt.

The following code implements the T-H network structure and evaluates the output for a
test vector.

xt=[-.1 .1 .05]'; % Define the test vector.
% Load the neural network weight matrices.
load th_1_wgt; W11=W1;W21=W2;
load th_2_wgt; W12=W1;W22=W2;
load th_3_wgt; W13=W1;W23=W2;
load th_4_wgt; W14=W1;W24=W2;
load th_m_wgt; W1n=W1;W2n=W2;
mu = logistic(W2n*[1;logistic(W1n*[1;xt])]); % NNmem outputs.
% Find the outputs of all four consequent networks.
u1 = W21*[1;logistic(W11*[1;xt])];
u2 = W22*[1;logistic(W12*[1;xt])];
u3 = W23*[1;logistic(W13*[1;xt])];
u4 = W24*[1;logistic(W14*[1;xt])];
177
u=[u1 u2 u3 u4]; % Vector of NNs outputs.
Ys=u*mu; % NNs outputs times the membership values (NNmem).
Y=sum(Ys)/sum(mu) % T-H Fuzzy System ouptut

Y =
4.4096

The closest input to the test vector [-0.1 0.1 0.05] are the vectors [0 1 0], [0 0 0] and [0 0
.1]. These three vectors have target values of 5, 4, and 4.5 respectively, so the output of
4.41 is appropriate. The training vectors were all input to the system and the network
performed well for all 9 cases. The T-H method of using neural networks to generate the
antecedent and consequent membership functions has been found to be useful and easily
implementable with MATLAB.
13.5 Learning and Adaptation in Fuzzy Systems via Neural Networks
When experiential data exists, fuzzy systems can be trained to represent an input-output
relationship. By using gradient descent techniques, fuzzy system parameters, such as
membership functions (LHS or RHS), and the connectives between layers in an adaptive
network, can be optimized. Adaptation of fuzzy systems using neural network training
methods have been proposed by various researchers. Some of the methods described in
the literature are: 1) fuzzy system adaptation using gradient-descent error minimization
[Hayashi et al. 1992]; 2) optimization of a parameterized fuzzy system with symmetric
triangular-shaped input membership functions and crisp outputs using gradient-descent
error minimization [Nomura, 1994; Wang, 1994; Jang, 1995]; 3) gradient-descent with
exponential MFs [Ichihashi, 1993]; and 4) gradient-descent with symmetric and non-
symmetric LHS MFs varying connectives and RHS forms [Guély, Siarry, 1993].

Regardless of the method or the parameter of the fuzzy system chosen for adaptation, an
objective error function, E, must be chosen. Commonly, the squared error is chosen:

E y y
t
= −
1
2
2
( ) ;

where y
t
is the target output and y is the fuzzy system output. Consider the i
th
rule of a
zero-order Sugeno fuzzy system consisting of n rules (i = 1, …, n). The figure below
presents a zero-order Sugeno system with m inputs and n rules.

178
A
11
A
12
A
1m
x
1
x
2
Π
w
1
x
m
A
n1
A
n2
A
nm
Π
w
n
N
y
N
Σ
w
1
w
n
Π
Π
y
1
y
n
f
1
=r
1
f
n
= r
n
u
A
(x)

Zero-Order Sugeno System

where: x is the input vector.
A is an antecedent membership function (LHS).
u
A
(x) is the membership of x to set A.
w
i
is the degree of fulfillment of the i
th
rule.
w
i
is the normalized degree of fulfillment of the i
th
rule.
r
i
is a constant singleton membership function of the i
th
rule (RHS).
y
i
is the output of the i
th
rule.

Mathematically, a zero-order Sugeno system is represented by the following equations:

w x
i A j
j
m
ji
=
=

u ( )
1


where u
A
(x
j
) is the membership of x
j
to the fuzzy set A, and w
i
is the degree of fulfillment
of the i
th
rule. The normalized degree of fulfillment is given by:

w
w
w
i
i
i
i
n
=
=

1


The output (y) of a fuzzy system with n rules can be calculated as:

y w f y
i i
i
n
i
i
n
= =
= =
∑ ∑
1 1


In this case, the system is a zero-order Sugeno system and f
i
is defined as:

f r
i i
=

An example of a 1
st
order Sugeno system is given in section 13.6.

179
This notation is slightly different than that in the text, but it is consistent with Jang
(1996). When target outputs (y
rp
) are given, the network can be adapted to reduce an
error measure. The adaptable parameters are the input membership functions and the
output singleton membership functions, r
i
. Now we will look at an example of using
gradient descent to optimize the r
i
values of this zero-order Sugeno system.
13.5.1 Zero Order Sugeno Fan Speed Control
Consider a zero-order Sugeno fuzzy system used to control the speed of a fan. The
objective is to maintain a comfortable temperature in the room based on two input
variables: temperature and activity. Three linguistic variables: "cool", "moderate", and
"hot" will be used to describe temperature, and two linguistic variables: "low" and "high"
will be used to describe the level of activity in the room. The fan speed will be a crisp
output value based on the following set of fuzzy rules:

if Temp is Cool AND Activity is Low then Speed is very_low (w
1
)
if Temp is Cool AND Activity is High then Speed is low (w
2
)
if Temp is Moderate AND Activity is Low then Speed is low_medium (w
3
)
if Temp is Moderate AND Activity is High then Speed is medium (w
4
)
if Temp is Hot AND Activity is Low then Speed is medium_high (w
5
)
if Temp is Hot AND Activity is High then Speed is high (w
6
)

The fuzzy antecedent and consequent membership functions are defined over the
universe of discourse with the following MATLAB code.

% Universe of Discourse
x = [0:5:100]; % Temperature
y = [0:1:10]; % Activity
z = [0:1:10]; % Fan Speed

% Temperature
cool_mf = mf_trap(x,[0 0 30 50],'n');
moderate_mf = mf_tri(x,[30 55 80],'n');
hot_mf = mf_trap(x,[60 80 100 100],'n');
antecedent_t = [cool_mf;moderate_mf;hot_mf];
plot(x,antecedent_t)
axis([-inf inf 0 1.2]);
title('Antecedent MFs for Temperature')
text(10, 1.1, 'Cool');
text(50, 1.1, 'Moderate');
text(80, 1.1, 'Hot');
xlabel('Temperature')
ylabel('Membership')

180
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Antecedent MFs for Temperature
Cool Moderate Hot
Temperature
M
e
m
b
e
r
s
h
i
p


low_act = mf_trap(y,[0 0 2 8],'n');
high_act = mf_trap(y,[2 8 10 10],'n');
antecedent_a = [low_act;high_act];
plot(y, antecedent_a);
axis([-inf inf 0 1.2]);
title('Antecedent MFs for Activity')
text(1, 1.1, 'Low');text(8, 1.1, 'High');
xlabel('Activity Level');ylabel('Membership')

04
0.6
0.8
1
Antecedent MFs for Activity
Low High
M
e
m
b
e
r
s
h
i
p


181
The consequent values of a Sugeno system are crisp singletons. The singletons for the
fan speed are defined as:

% Fan Speed Consequent Values.
very_low_mf = 1;
low_mf = 2;
low_medium_mf = 4;
medium_mf = 6;
medium_high_mf = 8;
high_mf = 10;
consequent_mf =
[very_low_mf;low_mf;low_medium_mf;medium_mf;medium_high_mf;high_mf];
stem(consequent_mf,ones(size(consequent_mf)))
axis([0 11 0 1.2]);
title('Consequent Values for Fan Speed');
xlabel('Fan Speed')
ylabel('Membership')
text(.5, 1.1, 'Very_Low');
text(2.1, .6, 'Low');
text(3.5, 1.1, 'Low_Medium');
text(6.1, .6, 'Medium');
text(7.5, 1.1, 'Medium_High');
text(10.1, .6, 'High');

0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
FanSpeed
M
e
m
b
e
r
s
h
i
p
Consequent Values for Fan Speed
Very_Low
Low
Low_Medium
Medium
Medium_High
High


Now that we have defined the membership functions and consequent values, we will
evaluate the fuzzy system for an input pattern with temperature = 72.5 and activity = 6.1.
First we fuzzify the inputs by finding their membership to each antecedent membership
function.

temp = 72.5; % Temperature
mut1=mf_trap(temp, [0 0 30 50],'n');
182
mut2=mf_tri(temp, [30 55 80],'n');
mut3=mf_trap(temp, [60 80 100 100],'n');
MU_t = [mut1;mut2;mut3]

MU_t =
0
0.3000
0.6250

act = 6.1; % Activity
mua1 = mf_trap(act,[0 0 2 8],'n');
mua2 = mf_trap(act,[2 8 10 10],'n');
MU_a = [mua1;mua2]

MU_a =
0.3167
0.6833

Next, we apply the Sugeno fuzzy AND implication operation to find the degree of
fulfillment of each rule.

antecedent_DOF = [MU_t(1)*MU_a(1)
MU_t(1)*MU_a(2)
MU_t(2)*MU_a(1)
MU_t(2)*MU_a(2)
MU_t(3)*MU_a(1)
MU_t(3)*MU_a(2)]

antecedent_DOF =
0
0
0.0950
0.2050
0.1979
0.4271

A plot of the firing strengths of each of the rules is:

stem(consequent_mf,antecedent_DOF)
axis([0 11 0 .5]);
title('Consequent Firing Strengths')
xlabel('Fan Speed')
ylabel('Firing Strength')
text(0.2, antecedent_DOF(1)+.02, ' Very_Low');
text(1.5, antecedent_DOF(2)+.04, ' Low');
text(3.0, antecedent_DOF(3)+.02, ' Low_Medium');
text(5.1, antecedent_DOF(4)+.02, ' Medium');
text(7.0, antecedent_DOF(5)+.02, ' Medium_High');
text(9.5, antecedent_DOF(6)+.02, ' High');

183
0 2 4 6 8 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Fan Speed
F
i
r
i
n
g

S
t
r
e
n
g
t
h
Consequent Firing Strengths
Very_Low
Low
Low_Medium
Medium
Medium_High
High


The output is the weighted average of the rule fulfillments.

output_y=sum(antecedent_DOF.*consequent_mf)/ sum(antecedent_DOF)

output_y =
8.0694

The fan speed would be set to a value of 8.07 for a temperature of 72.5 degrees and an
activity level of 6.1.
13.5.2 Consequent Membership Function Training
If you desire a specific input-output model and example data are available, the
membership functions may be trained to produce the desired model. This is
accomplished by using a gradient-descent algorithm to minimize an objective error
function such as the one defined earlier.

The learning rule for the output crisp membership functions (r
i
) is defined by
r t r t lr
E
r
i i
i
( ' ) ( ' ) + = − ⋅ 1



where t' is the learning epoch. The update equation can be rewritten as [Nomura et al.
1994]
r t r t lr y y
i i
p
i
i
p
i
n
p rp
( ' ) ( ' ) ( ) + = − ⋅ −
=

1
1
u
u
.
The m-file sugfuz.m demonstrates this use of gradient descent to optimize the output
membership functions (r
i
).
184
13.5.3 Antecedent Membership Function Training
The gradient descent algorithm can also be used to optimize the antecedent membership
functions. The examples in this supplement use the triangular and trapezoidal
membership functions which have non-zero derivatives where the outputs are non-zero.
Other adaptable fuzzy systems may use gaussian or generalized bell antecedent
membership functions. These functions have non-zero derivatives throughout the
universe of discourse and may be easier to implement. Because their derivatives are
continuous and smooth, their training performance may also be better. Consider the
parameters of a symmetric triangular membership function (a
ij
= peak; b
ij
= support):
u
ij j
j ij
ij
j ij
ij
x
x a
b
if x a
b
otherwise
( )
/
,
,
=


− ≤
¦
´
¦
¹
¦
1
2 2
0

A symmetric or non-symmetric triangular membership function can be created using the
mf_tri() function, this function will be better described in Section 13.5.4. The
membership function can be plotted using the plot function or by using the plot flag. In
this example, the plot flag is set to no.

x= [0:1:10]; % universe of discourse
a=5;b=4; % [peak support]
mu_i= mf_tri(x,[a-b/2 a a+b/2],'n'); % calculate memberships
plot(x,mu_i); % plot memberships
title('Symmetric Triangular Membership Function');
axis([-inf inf 0 1.2]);xlabel('Input')
ylabel('Membership');text(4.5,1.1,'(a = Peak)');
text(2.5,.1,'(a-0.5b)');text(6.5,.1,'(a+0.5b)');

0.4
0.6
0.8
1
Symmetric Triangular Membership Function
M
e
m
b
e
r
s
h
i
p
(a = Peak)


The peak parameter update rule is:
185

a t a t
p
E
a
ij ij
a
ij
( ' ) ( ' ) + = − ⋅ 1
η



= − ⋅
=

a t
p
E
a
ij
a
p
ij p
p
( ' )
'
'
η

∂ 2
1


where η
a
is the learning rate for a
ij
and p is the number of input patterns. Similar update
rules exist for the other MF parameters. The chain rule can be used to calculate the
derivatives used to update the MF parameters:










∂ u
∂ u

E
a
E
y
y
y
y
w
w
a
ij i
i
i
i
ij
ij
ij
= ⋅ ⋅ ⋅
, for the peak parameter;

Using the symmetric triangular MF equations and the product interpretation for AND, the
partial derivatives are derived below.

E y y
t
= −
1
2
2
( ) so


E
y
y y e
t
= − =
y y
i
i
n
=
=

1
so


y
y
i
= 1
y
w
w
r
i
i
i
i
n i
=
=

1
so


w
w
r y
w
i
i
i
i
i
n
=

=

( )
1

w x
i A j
j
m
ji
=
=

u ( )
1
so

∂ u u
w
x
w
x
i
ij j
i
ij j
( ) ( )
=

u
ij j
j ij
ij
x
x a
b
( )
/
= −

1
2
so
∂ u

i
ij
j ij
ij
a
sign x a
b
=
− 2 * ( )
if x a
b
j ij
ij
− ≤
2


and
∂ u

i
ij
a
= 0 if x a
b
j ij
ij
− >
2


similarly:

∂ u

u
ij
ij
ij ij
ij
b
x
b
=
− 1 ( )
.

Substituting these into the update equation we get
186


∂ u
E
a
y y
r y
w
w
x
sign x a
b
ij
t i
i
i
n
i
ij j
j ij
ij
= − ⋅

⋅ ⋅

=

( )
( )
( )
* ( )
1
2


∂ u
u E
b
y y
r y
w
w
x
x
b
ij
t i
i
i
n
i
ij j
ij j
ij
= − ⋅

⋅ ⋅

=

( )
( )
( )
( )
1
1


For the RHS membership functions we have:









E
r
E
y
y
y
y
r
i i
i
i
= ⋅

y w r
i i i
= so


y
r
w
i
i
i
=

resulting in the following gradient.



E
w
y y w
i
t
i
= − ⋅ ⋅ ( ) 1 .
13.5.4 Membership Function Derivative Functions
Four functions were created to calculate the MF values and derivatives of the MF with
respect to their parameters for non-symmetric triangular and trapezoidal MFs. They are
mf_tri.m, mf_trap.m, dmf_tri.m and dmf_trap.m.

The MATLAB code used in the earlier Fuzzy Logic Chapters required the input to be a
defined point on the universe of discourse. These functions do not have limitations on
the input and are therefore more general in nature. For example, if the universe of
discourse is defined as x=[0:1:10], earlier functions could only be evaluated at those
points and an error would occur for an input such as input=2.3. The functions described
in this section can handle all ranges of inputs.

Let's look at the triangular membership function outputs and derivatives first.

The function mf_tri can construct symmetrical or non-symmetric triangular membership
functions.

x=[-10:.2:10];
[mf1]=mf_tri(x,[-4 0 3],'y');

187
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
Triangular Membership Function
Input
M
e
m
b
e
r
s
h
i
p
Left Foot(a)
Peak (b)
Right Foot (c)


Derivative of non-symmetric triangular MF with respect to its parameters

[mf1]=dmf_tri(x,[-4 0 3],'y');

-0.2
-0.1
0
0.1
0.2
0.3
Triangular Membership Function Derivatives
d
M
F

w
r
t

p
a
r
a
m
e
t
e
r
s


Trapezoidal membership functions:

[mf1]=mf_trap(x,[-5 -3 3 5],'y');

188
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
Trapezoidal Membership Function
Input
M
e
m
b
e
r
s
h
i
p
(a)
(b) (c)
(d)


Derivative of trapezoidal MF with respect to it parameters:

[mf1]=dmf_trap(x,[-5 -3 3 5],'y');

-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Trapezoidal Membership Function Derivatives
d
m
f

w
r
t

p
a
r
a
m
e
t
e
r


The m-file adaptri.m demonstrates the use of gradient descent to optimize parameters of
3 non-symmetric triangular MFs and consequent singletons to approximate the function

f(x) = 0.05*x - 0.02*x - 0.3*x +20.
3 2

189

The m-file adpttrap.m demonstrates the use of gradient descent to optimize parameters of
3 trapezoid MFs and consequent singletons to approximate the same function.
13.5.5 Membership Function Training Example
As a simple example of using gradient descent to adjust the parameters of the antecedent
and singleton consequent membership functions, consider the following problem.
Assume that a large naval gun is properly sighted when it is pointing at 5 degrees. The
universe of discourse will be defined from 1 to 10 degrees, and the membership to
"Sighted Properly" will be defined as:

clear all;x= [0:.5:10]; % input
num_pts=size(x,2); % # of points
a=5; % peak
b=4; % support
mu_i=mf_tri(x,[a-b/2 a a+b/2],'y');
title('MF representing "Sighted Properly"');
xlabel('Direction in Degrees')
ylabel('Membership')

0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
MF representing "Sighted Properly"
Direction in Degrees
M
e
m
b
e
r
s
h
i
p
Left Foot(a)
Peak (b)
Right Foot (c)


To decide if an artillery shell will hit the target (on a scale of 1-10), consider the
following zero-order Sugeno fuzzy system with one rule and r = 10.

if Gun is Sighted Properly then "Chance of Hit" is 10 (r=10).

Suppose we have experimentally calculated the input-output surface to be:

r=10; % r value of zero-order Sugeno rule
y_t=mu_i*r; % Chance of Hit
190
plot(x,y_t);
title('Input-Output Surface for Gun');
axis([-inf inf 0 10.5]);
xlabel('Direction (Input)')
ylabel('Chance of Hit (Output)')

0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
10
Input-Output Surface for Gun
Direction (Input)
C
h
a
n
c
e

o
f

H
i
t

(
O
u
t
p
u
t
)


We will now demonstrate how the antecedent MF parameters of a symmetric triangular
MF and the consequent r value can be optimized using gradient descent. We will use the
input-output data above as training data and initialize the MFs to a=3, b=4, and r=8. The
general procedure will involve these steps:

1. A forward pass of all inputs to calculate the output;
2. The calculation of the error and SSE for all inputs;
3. The calculation of the gradient vectors;
4. The updating of the parameters based on the update rule; and
5. Repeating steps 1 through until the training error goal is reached or the maximum
number of training epochs is exceeded.

Step 1: Forward pass of all inputs to calculate the output
% Initial MF parameters
r=8; % initial r value
a=3; % initial peak value
b=6; % initial support value
mu_i=mf_tri(x,[a-b/2 a a+b/2],'n'); % initial MFs
y=mu_i*r; % initial output
plot(x,y_t,x,y);
title('Input-Output Surface');
axis([-inf inf 0 10.5]);
text(2.5, 4.5, 'Initial')
text(4.5,6, 'Target')
191
xlabel('Direction (Input)')
ylabel('Chance of Hit (Output)')

0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
10
Input-Output Surface
Initial
Target
Direction (Input)
C
h
a
n
c
e

o
f

H
i
t

(
O
u
t
p
u
t
)


Step 2: Calculate the error and SSE for all inputs
e=y-y_t;
SSE=sum(sum(e.^2))

SSE =
314.5556

Step: 3 Calculate the gradient vectors (note: one input, one rule)
ind=find(abs(x-a)<=(b/2)); % Locate indices under MF.
delta_a=r*e(ind).*((2*sign(x(ind)-a))/b);% Deltas for ind points.
delta_b=r*e(ind).*((1-mu_i(ind))/b);
delta_r=e(ind).*mu_i(ind);

Step 4: Update the parameters with the update rule
lr_a=.1;
lr_b=5;
lr_r=10;
del_a=-((lr_a/(2*num_pts))*sum(delta_a));
del_b=-((lr_b/(2*num_pts))*sum(delta_b));
del_r=-((lr_r/(2*num_pts))*sum(delta_r));
a=a+del_a;
b=b+del_b;
r=r+del_r;

a =
3.7955
b =
4.3239
r =
192
5.8409

Let's see if the SSE is any better:

mu_i=mf_tri(x,[a-b/2 a a+b/2],'n');
y_new=mu_i*r; % Ready To Eat
e=y_t-y_new; % error
SSE(2)=sum(sum(e.^2)) % Sum Squared error

SSE =
314.5556 189.1447

The SSE was reduced from 314 to 190 in one pass. Lets now iteratively train the fuzzy
system.

Step 5: Repeat steps 1 through 5

maxcycles=30;
SSE_goal=.5;
for i=2:maxcycles
mu_i=mf_tri(x,[a-b/2 a a+b/2],'n');
y_new=mu_i*r; % Output
e=y_new-y_t; % Output Error
SSE(i)=sum(sum(e.^2)); % SSE
if SSE(i) < SSE_goal; break;end
ind=find(abs(x-a)<=(b/2));
delta_a=r*e(ind).*((2*sign(x(ind)-a))/b);
delta_b=r*e(ind).*((1-mu_i(ind))/b);
delta_r=e(ind).*mu_i(ind);
del_a=-((lr_a/(2*num_pts))*sum(delta_a));
del_b=-((lr_b/(2*num_pts))*sum(delta_b));
del_r=-((lr_r/(2*num_pts))*sum(delta_r));
a=a+del_a;
b=b+del_b;
r=r+del_r;
end

Now lets plot the results:

plot(x,y_t,x,y_new,x,y);
title('Input-Output Surface');
axis([0 10 0 10.5]);
text(2.5,5, 'Initial');
text(4.5,7.5, 'Target');
text(4,6.5, 'Final');
xlabel('Input');ylabel('Output');

193
0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
10
Input-Output Surface
Initial
Target
Final
Input
O
u
t
p
u
t


Plot SSE training performance:

semilogy(SSE);title('Training Record SSE')
xlabel('Epochs');ylabel('SSE');grid;

10
0
10
1
10
2
10
3
Training Record SSE
S
S
E


The m-file adaptfuz.m demonstrates the use of gradient descent to optimize the MF
parameters and shows how the membership functions and input-output relationship
changes during the training process.
194
13.6 Adaptive Network-Based Fuzzy Inference Systems
Jang and Sun [Jang, 1992, Jang and Gulley, 1995] introduced the adaptive network-based
fuzzy inference system (ANFIS). This system makes use of a hybrid learning rule to
optimize the fuzzy system parameters of a first order Sugeno system. A first order
Sugeno system can be graphically represented by:
A
1
A
2
B
1
B
2
x
1
x
2
Π
Π
N
N
Σ
y
x
1
x
1
x
2
x
2
w
1
w
2
w
1
w
2
y
1
=w
1
f
1
y
2
=w
2
f
2

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
ANFIS architecture for a two-input, two-rule first-order Sugeno model with [Jang
1995a].

where the consequence parameters (p, q, and r) of the n
th
rule contribute through a first
order polynomial of the form:

f p x q x r
n n n n
= + +
1 2

13.6.1 ANFIS Hybrid Training Rule
The ANFIS architecture consists of two trainable parameter sets:

1). The antecedent membership function parameters [a,b,c,d].
2). The polynomial parameters [p,q,r], also called the consequent parameters.

The ANFIS training paradigm uses a gradient descent algorithm to optimize the
antecedent parameters and a least squares algorithm to solve for the consequent
parameters. Because it uses two very different algorithms to reduce the error, the training
rule is called a hybrid. The consequent parameters are updated first using a least squares
algorithm and the antecedent parameters are then updated by backpropagating the errors
that still exist.

The ANFIS architecture consists of five layers with the output of the nodes in each
respective layer represented by O
i
l
, where i is the i
th
node of layer l. The following is a
layer by layer description of a two input two rule first-order Sugeno system.

Layer 1. Generate the membership grades:

195
O x
i A
i
1
= u ( )

Layer 2. Generate the firing strengths.

O w x
i i A
j
m
i
2
1
= =
=

u ( )

Layer 3. Normalize the firing strengths.

O w
w
w w
i i
i 3
1 2
= =
+


Layer 4. Calculate rule outputs based on the consequent parameters.

O y w f w p x q x r
i i i i i i i i
4
1 2
= = = + + ( )

Layer 5. Sum all the inputs from layer 4.

O y w f w x p w x q w r w x p w x q w r
i
i
i i
i
1
5
1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2
= = = + + + + +
∑ ∑
( ) ( ) ( ) ( )

It is in this last layer that the consequent parameters can be solved for using a least square
algorithm. Let us rearrange this last equation into a more usable form:

| |
O y w x p w x q w r w x p w x q w r
y w x w x w w x w x w
p
q
r
p
q
r
1
5
1 1 1 1 2 1 1 1 2 1 2 2 2 2 2 2
1 1 1 2 3 2 1 2 2 2
1
1
1
2
2
2
= = + + + + +
=

=
( ) ( ) ( ) ( )
XW


When input-output training patterns exist, the weight vector (W), which consist of the
consequent parameters, can be solved for using a regression technique.
13.6.2 Least Squares Regression Techniques
Since a least squares technique can be used to solve for the consequent parameters, we
will investigate three different techniques used to solve this type of problem. When the
output layer of a network or system performs a linear combination of the previous layer's
output, the weights can be solved for using a least squares method rather than iteratively
trained with a backpropagation algorithm.

196
The rule outputs are represented by the p by 3n dimensional matrix X, with p rows equal
to the number of input patterns and n columns equal to the number of rules. The desired
or target outputs are represented by the p by m dimensional matrix Y with p rows and m
columns equal to the number of outputs. Setting the problem up in this format allows us
to use a lease squares solution for the weight vector W of n by m dimension..

Y=X*W

If the X matrix were invertable, it would be easy to solve for W:

W=X
-1
*Y

but this is not usually the case. Therefore, a pseudoinverse can be used to solve for W:

W=(X
T
*X)
-1
*X
T
*Y (where X
T
is the transpose of X)

This will minimize the error between the predicted Y and the target Y. However, this
method involves inverting (X'*X), which can cause numerical problems when the
columns of x are dependent. Consider the following regression example.

% Pseudoinverse Solution

x=[2 1
4 2
6 3
8 4];

y=[3 6 9 12]';

w=inv(x'*x)*x'*y;
sse=sum(sum((x*w-y).^2))

Warning: Matrix is singular to working precision.

sse =
Inf

Although dependent rows can be removed, data is rarely dependent. Consider data that
has a small amount of noise that may result in a marginally independent case. We will
use the marginally independent data for training and the noise free data to check for
generalization.

X=[2.000000001 1
3.999999998 2
5.999999998 3
8.000000001 4];

w=inv(X'*X)*X'*y
SSE=sum(sum((X*w-y).^2))
sse=sum(sum((x*w-y).^2))
197

Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 7.894919e-018

w =
-0.3742
0.7496
SSE =
269.7693
sse =
269.7693

The warning tells us that numerical instabilities may result from the nearly singular case
and the SSE shows that instabilities did occur. A case with independent patterns results
in no errors or warnings.

X1=[2.001 1
4.00 2
6.00 3
8.00 4.001];

% Pseudoinverse Solution
w=inv(X1'*X1)*X1'*y
SSE=sum(sum((X1*w-y).^2))
sse=sum(sum((x*w-y).^2))

w =
0.9504
1.0990
SSE =
1.1583e-006
sse =
9.5293e-007

Better regression methods use the LU (Lower Triangular, Upper Triangular) or more
robust QR (Orthogonal, Right Triangular) decompositions rather than a simple inversion
of the matrix, and the best method uses the singular value decomposition (SVD) [Masters
1993]. The MATLAB command / uses a QR decomposition with pivoting. It provides a
least squares solution to an under or over-determined system.

% QR Decomposition

w=(y'/X')'
sse=sum(sum((X*w-y).^2))
SSE=sum(sum((x*w-y).^2))

w =
0.0000
3.0000
sse =
7.2970e-030
SSE =
7.2970e-030

198
The SVD method of solving for the output weights also has the advantage of giving the
user control to remove unimportant information that may be related to noise. By
removing this unimportant information, one lessens the chance of overfitting the function
to be approximated. The SVD method decomposes the x matrix into a diagonal matrix S
of the same dimension as X that contains the singular values, and unitary matrix U of
principle components, and an orthonormal matrix of right singular values V.

X=U S V
T


The singular values in S are positive and arranged in decreasing order. Their magnitude
is related to the information content of the columns of U (principle components) that span
X. Therefore, to remove the noise effects on the solution of the weight matrix, we simply
remove the columns of U that correspond to small diagonal values in S. The weight
matrix is then solved for using:

W = V S
-1
U
T
Y

The SVD methodology uses the most relevant information to compute the weight matrix
and discards unimportant information that may be due to noise. Application of this
methodology has sped up the training of neural networks 40 fold [Uhrig et. al. 1996] and
resulted in networks with superior generalization capabilities.

% Singular Value Decomposition (SVD)

[u,s,v]=svd(X,0);;
inv_s=inv(s)

inv_s =
1.0e+008 *
0.0000 0
0 7.3855

We can see that the first singular value has very little information in it. Therefore, we
discard its corresponding column in U. This discards the information related to the noise
and keeps the solution from attempting to fit the noise.

for i=1:2
if s(i,i)<.1
inv_s(i,i)=0;
end
end
w=v*inv_s*u'*y
sse=sum(sum((x*w-y).^2))
SSE=sum(sum((X*w-y).^2))

w =
1.2000
0.6000
sse =
1.2000e-018
199
SSE =
1.3200e-017

The SVD method did not reduce the error as much as did the QR decomposition method,
but this is because the QR method tried to fit the noise. Remember that this is overfitting
and is not desired. MATLAB has a pinv() function that automatically calculates a
pseudo-inverse the SVD.

w=pinv(X,1e-5)*y
sse=sum(sum((x*w-y).^2))
SSE=sum(sum((X*w-y).^2))

w =
1.2000
0.6000
sse =
1.2000e-018
SSE =
1.3200e-017

We see that the results of the two SVD methods are close to identical. The pinv function
will be used in the following example.
13.6.3 ANFIS Hybrid Training Example
To better understand the ANFIS architecture and training paradigm consider the
following example of a first order Sugeno system with three rules:

if x is A
1
then f
1
=p
1
x + r
1

if x is A
2
then f
2
=p
2
x + r
2

if x is A
3
then f
3
=p
3
x + r
3


This ANFIS has three trapezoidal membership functions (A
1
, A
2
, A
3
) and can be
represented by the following diagram:
A
1
A
2
A
3
x
w
1
w
3
N
y
N
Σ
x
x
w
1
w
3
Π
Π
y
1
y
3
w
2
N
w
2
Π
x
y
2
f
1
f
2
f
3

ANFIS architecture for a one-input first-order Sugeno fuzzy model with three rules

First we will create training data for the function to be approximated:
200

f(x) = 0.05*x - 0.02*x - 0.3*x +20.
3 2


x=[-10:1:10];clg;
y_rp=.05*x'.^3-.02*x'.^2-.3*x'+20;
num_pts=size(x,2); % number of input patterns
plot(x,y_rp);
xlabel('Input');
ylabel('Output');
title('Function to be Approximated');

-10 -5 0 5 10
-30
-20
-10
0
10
20
30
40
50
60
70
Input
O
u
t
p
u
t
Function to be Approximated


Now we will step through the ANFIS training paradigm.

Layer 1. Generate the membership grades:

O x
i A
1
= u ( )

% LAYER 1 MF values
% Initialize Antecedent Parameters
a1=-17; b1=-13; c1=-7; d1=-3;
a2=-9; b2=-4; c2=3; d2=8;
a3=3; b3=7; c3=13; d3=8;
[mf1]=mf_trap(x,[a1 b1 c1 d1],'n');
[mf2]=mf_trap(x,[a2 b2 c2 d2], 'n');
[mf3]=mf_trap(x,[a3 b3 c3 d3], 'n');

% Plot MFs
plot(x,mf1,x,mf2,x,mf3);
axis([-inf inf 0 1.2]);
title('Initial Input MFs')
201
xlabel('Input');
ylabel('Membership');
text(-9, .7, 'mf1');
text(-1, .7, 'mf2');
text(8, .7, 'mf3');

-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
Initial Input MFs
Input
M
e
m
b
e
r
s
h
i
p
mf1 mf2 mf3


Layer 2. Generate the firing strengths.

O w x
i i A
i
2
= = u ( )

% LAYER 2 calculates the firing strength of the
% 1st order Sugeno rules. Since each rule only has one antecedent
% membership function, no product operation is necessary.
w1=mf1; % rule 1
w2=mf2; % rule 2
w3=mf3; % rule 3

Layer 3. Normalizes the firing strengths.

O w
w
w w w
i i
i 3
1 2 3
= =
+ +


% LAYER 3
% Determines the normalized firing strengths for the rules (nw) and
% sets to zero if all rules are zero (prevent divide by 0 errors)
for j=1:num_pts
if (w1(:,j)==0 & w2(:,j)==0 & w3(:,j)==0)
nw1(:,j)=0;
nw2(:,j)=0;
202
nw3(:,j)=0;
else
nw1(:,j)=w1(:,j)/(w1(:,j)+w2(:,j)+w3(:,j));
nw2(:,j)=w2(:,j)/(w1(:,j)+w2(:,j)+w3(:,j));
nw3(:,j)=w3(:,j)/(w1(:,j)+w2(:,j)+w3(:,j));
end
end

Calculate the consequent parameters.

X_inner=[nw1.*x;nw2.*x;nw3.*x;nw1;nw2;nw3];
C_parms=pinv(X_inner')*y_rp % [p1 p2 p3 r1 r2 r3]

C_parms =
8.3762
0.7055
8.2927
57.4508
20.0514
-21.1564

Layers 4 and 5. Calculate the outputs using the consequent parameters.

O w f w p x r
i i i i i i
4
= = + ( )
O w f w x p w r w x p w r w x p w r
i i
i
1
5
1 1 1 1 2 2 2 2 3 3 3 3
= = + + + + +

( ) ( ) ( )

% LAYERS 4 and 5
% Calculate the outputs using the inner layer outputs and the
% consequent parameters.
y=X_inner'*C_parms;

Plot the Results.

plot(x,y_rp,'+',x,y);
xlabel('Input','fontsize',10);
ylabel('Output','fontsize',10);
title('Function Approximation');
legend('Reference','Output')

203
Reference
Output
-10 -5 0 5 10
-30
-20
-10
0
10
20
30
40
50
60
70
Input
O
u
t
p
u
t
Function Approximation


We can see that by just solving for the consequent parameters we have a very good
approximation of the function. We can also train the antecedent parameters using
gradient descent.

Each ANFIS training epoch, using the hybrid learning rule, consists of two passes. The
consequent parameters are obtained during the forward pass using a least-squares
optimization algorithm and the premise parameters are updated using a gradient descent
algorithm. During the forward pass all node outputs are calculated up to layer 4. At
layer 4 the consequent parameters are calculated using a least-squares regression method.
Next, the outputs are calculated using the new consequent parameters and the error
signals are propagated back through the layers to determine the premise parameter
updates.

The consequent parameters are usually solved for at each epoch during the training
phase, because as the output of the last hidden layer changes due the backpropagation
phase, the consequent parameters are no longer optimal. Since the SVD is
computationally intensive, it may be most efficient to perform it every few epochs versus
every epoch.

The m-file anfistrn.m demonstrates use of the hybrid learning rule to train an ANFIS
architecture to approximate the function mentioned above. The consequent parameters
([p1 p2 p3 r1 r2 r3]) after the first SVD solution were:

C_parms = [8.8650 4.5311 8.6415 64.2963 23.7368 -26.0706]

and the final consequent parameters were:

204
C_parms =[11.0308 7.3120 10.5315 82.7411 25.4364 -41.2865]

Below is a graph of the initial antecedent membership functions and the function
approximation after one iteration.

Re fe re nc e
Outp ut
-1 0 -5 0 5 1 0
-5 0
0
5 0
1 0 0
Input
O
u
t
p
u
t
F unc t i on E poc h 1 S S E 194. 7
-1 0 -5 0 5 1 0
0
0 .5
1
Input
M
e
m
b
e
r
s
h
i
p
M F 1 M F 2 M F 3


Below is a graph of the final antecedent membership functions and the function
approximation after training is completed. The sum of squared error was reduced from
200 to 13 in 40 epochs.
Re ference
Output
-1 0 -5 0 5 1 0
-5 0
0
5 0
10 0
Input
O
u
t
p
u
t
Func t i on E poc h 40 S S E 12. 91
1
h
i
p

A graph or the training record shows that the ANFIS was able to learn the input output
patters with a high degree of accuracy. The anfistrn.m code only updates the consequent
205
parameters every 10 epochs in order to reduce the SVD computation time. This results in
a training record with dips every 10 epochs. Updating the consequent parameters every
epoch did not provide significant error reductions improvements and slowed down the
training process.

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
1 0
1
1 0
2
1 0
3
T r a i ni ng R e c o r d S S E
E p o c hs
S
S
E


Chapter 14 General Hybrid Neurofuzzy Applications
No specific hybrid neurofuzzy applications will be examined in this supplement,
although Chapters 12 and 13 present the methodologies and tools necessary to implement
them. The application of the tools and techniques described in Chapter 14 of Fuzzy and
Neural Approaches in Engineering, is left to the reader.
Chapter 15 Dynamic Hybrid Neurofuzzy Systems
Chapter 15 of Fuzzy and Neural Approaches in Engineering, presents several hybrid
Neurofuzzy Systems that were developed at The University of Tennessee. These
applications are complex and can not be easily implemented with the introductory tools
developed in this Supplement. Therefore, for further information on these subjects, the
reader should consult the references given in the text.
Chapter 16 Role of Expert Systems in Neurofuzzy Systems
MATLAB does not have an expert system toolbox, nor is a user contributed toolbox
available. Since Fuzzy Systems may be viewed as a special type of expert systems that
handle uncertainty well, expert systems could be generated by using the fuzzy systems
tools with crisp membership functions.

MATLAB does have if->then programming constructs, so expert rules containing
heuristic knowledge can be embedded in the neural and fuzzy systems described in
earlier chapters; although no examples will be given in this supplement.
206
Chapter 17 Genetic Algorithms
MATLAB code will not be used to implement Genetic Algorithms in this supplement.
Commercial and user Genetic Algorithm toolboxes are available for use with MATLAB.
The following is information on a user contributed toolbox and a commercially available
toolbox.

GENETIC is a user contributed set of genetic algorithm m-files written by Andrew F.
Potvin of The MathWorks, Inc. His email is potvin@mathwork.com. This set of m-files
tries to maximize a function using a simple genetic algorithm. It is located at:
http://www.mathworks.com/optims.html

FlexTool(GA) is a commercially available package, the following is an excerpt from their
email advertising:
FlexTool(GA) M 1.1 Features:
- Modular, User Friendly, Hardware and operating system transparent
- Expert, Intermediate, and Novice help settings
- Hands-on tutorial with step by step application guidelines
- Designed to draw on MATLAB power
- Cold Start (start using previously selected GA parameters)
- Warm Start (start from the previous generation) features
- GA options : generational GA, steady state GA, micro GA
- Coding schemes include binary, logarithmic, and real
- Selection strategies : tournament, roulette wheel, ranking
- Crossover techniques include 1, 2, multiple point crossover
- Niching module to identify multiple solutions
- Clustering module : Use separately or with Niching module
- Can optimize multiple objectives
- Default parameter settings for the novice
- Statistics, figures, and data collection
For more information, contact:
Flexible Intelligence Group, L.L.C., Box 1477
Tuscaloosa, AL 35486-1477, USA
Voice: (205) 345-5166
Fax : (205) 345-5095
email: FIGLLC@AOL.COM
207
References
Demuth, H. and M. Beale, Neural Networks Toolbox, The MathWorks Inc., Natick, MA.,
1994.

DeSilva, C. W., Intelligent Control: Fuzzy Logic Applications, CRC Press, Boca Raton,
Fl, 1995.

Elman, J., Finding Structure in Time, Cognitive Science, Vol. 14, 1990, pp. 179-211.

Grossberg, S., Studies of the Mind and Brain, Reidel Press, Drodrecht, Holland, 1982.

Guely, F., and P. Siarry, Gradient Descent Method for Optimizing Various Fuzzy Rules,
Proceedings of the Second IEEE International Conference on Fuzzy Systems, San
Francisco, 1993, pp. 1241-1246.

Hagan, M., H. Demuth, and Mark Beale, Neural Network Design, PWS Publishing
Company, Boston, MA,1996.

Hebb, D., Organization of Behavior, John Wiley, New York, 1949.

Hertz, J., A. Krough, R. G. Palmer, Introduction to the Theory of Neural Computing,
Redwood City, California, Addison-Wesley, 1991.

Hanselman, D. and B. Littlefield, Mastering MATLAB, Prentice Hall, Upper Saddle
River, NJ, 1996.

Hayashi, I., H. Nomura, H. Yamasaki, and N., Wakami, Construction of Fuzzy Inference
Rules by NDF and NDFL, International Journal of Approximate Reasoning, Vol. 6,
1992, pp. 241-266.

Ichihashi, H., T. Miyoshi, and K. Nagasaka, Computed Tomography by Neuro-fuzzy
Inversion, in Proceedings of the 1993 International Joint Conference on Neural
Networks, Part 1, Nagoya, Oct 25-29, 1993, pp.709-712.

Irwin, G. W., K. Warwock, and K. J. Hunt, Neural Network Applications in Control,
Institute of Electrical Engineers, London, 1995.

Jamshidi, M., N. Vadiee, and T. J. Ross (Eds.), Fuzzy Logic and Control, edited by, PTR
Prentice Hall, Englewood Cliffs, NJ, 1993.

Jang, J.-S., and N. Gulley, Fuzzy Logic Toolbox for Use with MATLAB, The MathWorks
Inc., Natick, MA, 1995.

Jang, J.-S., C.-T. Sun, Neuro-fuzzy Modeling and Control, Proceedings of the IEEE, Vol.
83, No. 3, March, 1995, pp. 378-406.
208

Jang, J.-S., C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing, Prentice Hall,
Upper Saddle River, NJ, 1997.

Jordan, M., Attractor Dynamics and Parallelism in a Connectionist Sequential Machine,
Proc. of the Eighth Annual Conference on Cognitive Science Society, Amherst, 1986, pp.
531-546.

Kandel, A., and G. Langholz, Fuzzy Control Systems, CRC Press, Boca Ratton, Fl, 1994.

Kohonen, T., Self -Organization and Associative Memory, Springer-Verlag, Berlin, 1984.

Kosko, B., Bidirectional Associative Memories, IEEE Trans. Systems, Man and
Cybernetics, Vol. 18, (1), pp. 49-60, 1988.

Levenberg, K., "A Method for the Solution of Certain Non-linear Problems in Least
Squares", Quarterly Journal of Applied Mathematics, 2, 164-168, 1944.

Ljung, L., System Identification: Theory for the User, Prentice Hall, Upper Saddle River,
NJ, 1987.

Marquardt, D. W., "An Algorithm for Least Squares Estimation of Non-linear
Parameters", J. SIAM, 11, 431-441, 1963.

Masters, T., Practical Neural Network Recipes in C++, Academic Press, San Diego, CA,
1993a.

Masters, T., Advanced Algorithms for Neural Networks, John Wiley & Sons, New York,
1993b.

MATLAB, The MathWorks Inc., Natick, MA, 1994.

Miller, W. T., R. S. Sutton and P. J. Werbos (eds.), Neural Networks for Control, MIT
Press, Cambridge, MA., 1990.

Mills, P.M., A. Y. Zomaya, and M. O. Tade, Neuro-Adaptive Process Control, John
Wiley & Sons, New York, 1996.

Minsky, M., S. Pappert, Perceptrons, MIT Press, Cambridge, MA, 1969.

Moller, M. F., "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning",
Neural Networks, Vol. 6, 525-533, 1993.

Moscinski, J., and Z. Ogonowski, Eds., Advanced Control with MATLAB & SIMULINK,
Ellis Horwood division of Prentice Hall, Englewood Cliffs, NJ, 1995.

209
Narendra and Parthasarathy, "Identification and Control of Dynamical Systems Using
Neural Networks", IEEE Transactions on Neural Networks, Vol. 1, No. 1, March 1990.

Nomura, H., I. Hayashi, and N. Wakami, A Self Tuning Method of Fuzzy Reasoning by
Genetic Algorithms, in Fuzzy Control Systems, A Kandel and G. Langhols, eds., CRC
Press, Boca Raton, Fl, 1994, pp. 338-354.

Omatu, S., M. Khalid, and R. Yusof, Neuro-Control and its Applications, Springer
Verlag, London, 1996.

Park, J., and I. Sandberg, Universal Approximation Using Radial-Basis-Function
Networks, Neural Computation, Vol. 3, 246-257.

Pham, D. T., and X. Liu, Neural Networks for Identification Prediction and Control,
Springer Verlag, London, 1995.

Rosenblatt, F., "The Perceptron: a Probabilistic Model for Information Storage and
Organization in the Human Brain", Psychological Review, Vol. 65, 1958, 386-408.

Shamma, S., "Spatial and Temporal Processing in Central Auditory Networks", Methods
in Neural Modeling, ( C. Koch and I. Segev, eds.), MIT Press, Cambridge, MA, 1989, pp.
247-289.

Specht, D., A General Regression Neural Network, IEEE Trans. on Neural Networks,
Vol. 2, No.5, Nov. 1991, pp. 568-576.

Uhrig, R. E., J. W. Hines, C. Black, D. Wrest, and X. Xu, Instrument Surveillance and
Calibration Verification System, Sandia National Laboratory contract AQ-6982 Final
Report by The University of Tennessee, March, 1996.

Wang, L.-X., Adaptive Fuzzy Systems and Control, Prentice Hall, Englewood Cliffs, NJ,
1994.

Wasserman, Advanced Methods in Neural Computing, Van Nostrand Reinhold, New
York, 1993.

White, D. A., and D. A. Sofge (eds.), Handbook of Intelligent Control, Van Nostrand
Reinhold, New York, 1992.

Williams R. J., and D. Zipser, "A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks", Neural Computation, Vol. 1, 1989, 270-280.

Werbos, P. J., Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences, Ph.D. Thesis, Harvard University, 1974.

210
Werbos, P. J., "Backpropagation Through Time: What It is and How to Do It",
Proceedings of the IEEE, Vol. 78, No. 10, October 1990.

White, D., and D. Sofge, Eds., Handbook of Intelligent Control: Neural, Fuzzy and
Adaptive Approaches, Van Nostrand Reinhold, New York, 1992.

Zbikowski, R., and K. J. Hunt, Neural Adaptive Control Technology, World Scientific
Publishing Company, Singapore, 1996.

To SaDonya

v

vi

CONTENTS
CONTENTS.................................................................................................................. VII PREFACE .......................................................................................................................... X ACKNOWLEDGMENTS....................................................................................................... X ABOUT THE AUTHOR....................................................................................................... XI SOFTWARE DESCRIPTION ................................................................................................ XI INTRODUCTION TO THE MATLAB SUPPLEMENT...............................................1 INTRODUCTION TO MATLAB ....................................................................................1 MATLAB TOOLBOXES.....................................................................................................3 SIMULINK ......................................................................................................................3 USER CONTRIBUTED TOOLBOXES .....................................................................................4 MATLAB PUBLICATIONS.................................................................................................4 MATLAB APPLICATIONS.............................................................................................4 CHAPTER 1 INTRODUCTION TO HYBRID ARTIFICIAL INTELLIGENCE SYSTEMS .................5 CHAPTER 2 FOUNDATIONS OF FUZZY APPROACHES .........................................................6
2.1 2.2 2.3 2.4 2.5 Union, Intersection and Complement of a Fuzzy Set .......................................................6 Concentration and Dilation...............................................................................................9 Contrast Intensification...................................................................................................10 Extension Principle.........................................................................................................12 Alpha Cuts ......................................................................................................................14

CHAPTER 3 FUZZY RELATIONSHIPS ................................................................................16
3.1 A Similarity Relation ......................................................................................................16 3.2 Union and Intersection of Fuzzy Relations.....................................................................16 3.3 Max-Min Composition ...................................................................................................18

CHAPTER 4 FUZZY NUMBERS .........................................................................................19
4.1 Addition and Subtraction of Discrete Fuzzy Numbers ...................................................19 4.2 Multiplication of Discrete Fuzzy Numbers ....................................................................21 4.3 Division of Discrete Fuzzy Numbers..............................................................................23

CHAPTER 5 LINGUISTIC DESCRIPTIONS AND THEIR ANALYTICAL FORM ........................24
5.1 Generalized Modus Ponens ............................................................................................24 5.2 Membership Functions ...................................................................................................24 5.2.1 Triangular Membership Function.............................................................................24 5.2.2 Trapezoidal Membership Function..........................................................................26 5.2.3 S-shaped Membership Function ..............................................................................27 5.2.4 Π-shaped Membership Function .............................................................................28 5.2.5 Defuzzification of a Fuzzy Set ................................................................................29 5.2.6 Compound Values ...................................................................................................31 5.3 Implication Relations......................................................................................................33 5.4 Fuzzy Algorithms ...........................................................................................................37

CHAPTER 6 FUZZY CONTROL .........................................................................................44
6.1 Tank Level Fuzzy Control ..............................................................................................44

CHAPTER 7 FUNDAMENTALS OF NEURAL NETWORKS ....................................................52
7.1 Artificial Neuron.............................................................................................................52

vii

..................91 9.......................3 Batch Training..........................................................................................................4 Crossbar Structure ...2 Direct Inverse Control ....................3 Large Neuron Width Example.................................6 Probabilistic Neural Networks.....................................7......6 System Identification ...4 7...134 10................................................ Implementation of Neural Control Systems ....4 Tank System Identification Example .........................................................................................................................................................................................1 Weight Updates .............................................................122 10.............................................5......................................................................7 Radial Basis Function Networks..............................5........144 viii ......................................1 Derivative of the Activation Functions.......138 10.............................4 Adaptive Processors and Neural Networks............................................................................81 8...............................................................65 Multilayer Neural Network................98 9................5 Neural Networks Control......129 10..68 CHAPTER 8 BACKPROPAGATION AND RELATED TRAINING PARADIGMS .......................................................................5 The Backpropagation Training Cycle ..........123 10......................4 Adaptive Learning Rate............................................................................76 8.........1 Competitive Network Implementation ..1 Neural Network Implementation Issues..............................................2 Basic Steps of System Identification..............132 10......6..............113 9..............................3 Neural Network Model Structure ............................2 Linear System Theory............................................................................1 Hebbian Learning ..................4 Back Propagation Through Time .....................................................................144 11.......................1 Introduction ..........131 10.........................................................................................................2 7.......................................1 ARX System Identification Model..................2..........................................................................................................................................7.............5 Single Layer Neural Network ..........................................................................................................................141 CHAPTER 11 PRACTICAL ASPECTS OF NEURAL NETWORKS .............136 10..................................135 10............................1 Radial Basis Function Example.........................................................................................................79 8...........7...................................................................97 9....................80 8....... ASSOCIATIVE AND OTHER SPECIAL NEURAL NETWORKS ....................................................6.5 2 Self Organizing Feature Maps.......................................................5..........................................2...........................................................................................99 9..............103 9...............................................................................2 Hidden Layer Weight Updates ...........3 Outstar Learning ....................................................3 Adaptive Signal Processing .................5 Creating a MATLAB Function for Backpropagation....................................................144 11...............................2 Overview of Neural Network Training Methodology ...........3 Scaling Input Vectors ...........95 9..........6 Backpropagation Example.................................................................71 8.........................................................72 8...................2 Small Neuron Width Example.........91 9........77 8.........................115 9..7................................84 8..135 10........................133 10.....................................4 Initializing Weights ........93 9.........117 CHAPTER 10 DYNAMIC NEURAL NETWORKS AND CONTROL SYSTEMS ......58 Separation of Linearly Separable Variables ......................................5.........2 Instar Learning...................74 8..................116 9............2...................................................106 9...............................................................5.......................................8 Generalized Regression Neural Network........................................3 Model Referenced Adaptive Control..............................6................................................5 Adaptive Critic .....................................................................................................132 10...........127 10........2 Backpropagation for a Multilayer Neural Network ......................109 9...............................................................................................................3 7.5 Competitive Networks ........................................................2...........................................................5..............................122 10...........71 8........................133 10.....57 Rosenblatt's Perceptron........1 Supervised Control ....................6....................................88 CHAPTER 9 COMPETITIVE.......................................................2.7...136 10...............................85 8................................

.....3 Training and Test Data Selection...........6 Adaptive Network-Based Fuzzy Inference Systems.....................5................................................194 13.....2 Least Squares Regression Techniques...............3 Neural Networks for Determining Membership Functions ........................4 Neural Network Driven Fuzzy Reasoning.......189 13.................................................................................................162 Learning and Adaptation in Fuzzy Neural Networks .....171 13....................................................194 13.........6........................2 12............................................................................................................5 12.........................1 Neural Network Size..........................170 13................206 REFERENCES .................................................................170 13.......................................207 ix ...............................................160 AND and OR Fuzzy Neurons ...............................................................................4...........................................................................................................155 CHAPTER 12 NEURAL METHODS IN FUZZY SYSTEMS........................................................................................4 Membership Function Derivative Functions .....6....164 CHAPTER 13 NEURAL METHODS IN FUZZY SYSTEMS .............1 Introduction .171 13..................................................................4......................................................................5...........5..153 11....2 Consequent Membership Function Training .......................................................205 CHAPTER 16 ROLE OF EXPERT SYSTEMS IN NEUROFUZZY SYSTEMS..........................1 Zero Order Sugeno Fan Speed Control .........205 CHAPTER 17 GENETIC ALGORITHMS ...3 Stopping Criteria and Cross Validation Training.....146 11.......3 Antecedent Membership Function Training........6 12......................161 Multilayer Fuzzy Neural Networks ................................................................................4 Overfitting ............5............................174 13...................1 ANFIS Hybrid Training Rule...158 From Crisp to Fuzzy Neurons.......2 Neural Network Noise..............................158 Generalized Fuzzy Neuron and Networks ...................7 Introduction ............159 Aggregation and Transfer Functions in Fuzzy Neurons .....183 13.....................................5 Learning and Adaptation in Fuzzy Systems via Neural Networks ..184 13..................................205 CHAPTER 15 DYNAMIC HYBRID NEUROFUZZY SYSTEMS ....................................................................................................................................................................150 11................................ .......................6......149 11........158 12........2 Fuzzy-Neural Hybrids ......................................................................................................................................................186 13..3 12..............3 ANFIS Hybrid Training Example ............11....1 12..................................................4 12....................................179 13.....................................................................................................................................................195 13..................177 13........5.............4.........199 CHAPTER 14 GENERAL HYBRID NEUROFUZZY APPLICATIONS ......5 Membership Function Training Example...........................................

x . Uhrig from The University of Tennessee and Professor Lefteri H.Preface Over the past decade. It takes the reader from the underlying theory to actual coding and implementation. Thanks also go to Mark Buckner of Oak Ridge National Laboratory for his contributions to Sections 15. This supplement would not have been possible without the foresight of the founders of The MathWorks in developing what I think is the most useful and productive engineering software package available.5 and 15. The second section of this supplement contains 17 chapters that mirror the chapters of the text. few exist that deal with the combinations of the two subjects and fewer still exist that take the reader through the practical implementation aspects. the synergism realized by combining the two techniques has become increasingly apparent. This supplement introduces the fundamentals necessary to implement and apply these Soft Computing approaches to engineering problems using MATLAB. comments. My sincere thanks goes to Darryl Wrest of Honeywell for his time and effort during the review of this supplement. Although Chapters 14-17 do not give MATLAB implementations of the examples presented in the text. Acknowledgments I would like to thank Distinguished Professor Robert E. I have been a faithful user for the past seven years and look forward to the continued improvement and expansion of their base software package and application toolboxes. and suggestions. Although many texts are available for presenting artificial neural networks and fuzzy systems to potential users. I have found few companies that provide such a high level of commitment to both quality and support. first introducing the pieces and then putting them together to perform more complex functions. thus providing a compact and comprehensive presentation of the Soft Computing techniques. some references are given to support a more in depth study.6. The first part of this supplement gives a very brief introduction to MATLAB including resources available on the World Wide Web. Also thanks for their review. And recently. the application of artificial neural networks and fuzzy systems to solving engineering problems has grown enormously. Chapters 2-13 have MATLAB implementations of the theory and discuss practical implementation issues. The code is built from a bottom up framework. Presenting the theory's implementation in code provides a more in depth understanding of the subject matter. and finally implementation examples. Tsoukalas from Purdue University for offering me the opportunity and encouraging me to write this supplement to their book entitled Fuzzy and Neural Approaches in Engineering. The MATLAB Notebook allows the embedding and evaluation of MATLAB code fragments in the Word document.

He was the Assistant Nuclear Controls and Chemistry Officer for the Atlantic Submarine Force (1988 to 1990). both an MBA (with distinction) and a MS in Nuclear Engineering from The Ohio State University in 1992. Dr.0 notebook file (PC) and several MATLAB functions. Tau Beta Pi. he worked on several DOE and EPRI funded projects applying AI techniques to engineering problems. This should result in the extraction of 100 files that require about 5 megabytes of disk space. While there. Its size is over 4 megabytes. and surveillance & diagnostics in applying artificial intelligence methodologies to solve practical engineering problems. master. J. which you can change if you wish. and Phi Kappa Phi honor societies. Hines teaches classes in Applied Artificial Intelligence to students from all departments in the engineering college. MATLAB script and function files (67) MATLAB data files (31) xi . The MS Word file.mat This supplement in MS Word 7. He graduated from the officers course of the Naval Nuclear Power School (with distinction) in 1986. He is involved in several research projects in which he uses his experience in modeling and simulation. The following is a description of the files: master.doc readme.doc. Dr. Wesley Hines is currently a Research Assistant Professor in the Nuclear Engineering Department at the University of Tennessee. For the five years prior to coming to the University of Tennessee. so I recommend a Pentium computer platform with at least 16 MB of RAM.D. in Nuclear Engineering from The Ohio State University in 1994. The installation program’s default directory is C:\MATLAB\TOOLBOX\NN_FUZZY. Alpha Nu Sigma. and served as the Electrical Officer of a nuclear powered Ballistic Missile Submarine (1987 to 1988).0 (PC) A test version of this software description section. Hines served in the United States Navy as a nuclear qualified Naval Officer.m *. is a copy of this supplement and can be opened into MS Word so that the code fragments in this document can be run and modified.m file gives a brief description of the MATLAB files that were extracted into the directory. applied artificial intelligence. Eta Kappa Nu.About the Author Dr. He received the BS degree (Summa Cum Laude) in Electrical Engineering from Ohio University in 1985. Software Description This supplement comes with an IBM compatible disk containing an install program. From 1985 to 1990 Dr. The program includes an MS Word 7. and a Ph. He is a member of the American Nuclear Society and IEEE professional societies and a member of Sigma Xi. instrumentation and control.txt *. The contents. scripts and data files. It and the other files should not be duplicated or distributed without the written consent of the author. Hines was a member of The Ohio State University's Nuclear Engineering Artificial Intelligence Group.

This allows the document to both briefly explain the theoretical details and also show the MATLAB implementation. The notebook allows MATLAB commands to be entered and evaluated while in the Word environment. This supplement contains numerous examples that demonstrate the practical implementation of neural. and hybrid processing techniques using MATLAB. MATLAB has functionality to perform or use: 1 .INTRODUCTION TO THE MATLAB SUPPLEMENT This supplement uses the mathematical tools of the educational version of MATLAB to demonstrate some of the important concepts presented in Fuzzy and Neural Approaches in Engineering. Sun. a name directly attributed to Lofti Zadeh. However. This book integrates the two technologies of fuzzy logic systems and neural networks. 1994] exist. DOS. they are actually closely related and are being integrated in many applications. other examples and m-files are extremely general and portable. Fuzzy and Neural Approaches in Engineering integrates the two technologies and presents them in a clear and concise framework. This supplement should be considered to be a brief introduction to the MATLAB implementation of neural and fuzzy systems and the author strongly recommends the use of the Neural Networks Toolbox and the Fuzzy Logic Toobox for a more in depth study of these information processing technologies. 1995] and Neural Networks [Demuth and Beale. Although MATLAB toolboxes for Fuzzy Logic [Jang and Gulley. they are not required to run the examples given in this supplement. Tsoukalas and Robert E. Liunix). Applications are transportable between the platforms. these two technologies form the core of the discipline called SOFT COMPUTING. INTRODUCTION TO MATLAB MATLAB is a technical computing environment that is published by The MathWorks. VAX and Cray. Although there appears to be little that is common between the two fields. 7. each field has developed independently of the other with its own nomenclature and symbology. which is a state of the art package for matrix computations. Uhrig and being published by John Wiley & Sons.0. It can run on many platforms including windows based personal computers (windows. Some of the examples in this supplement are not written in a general format and will have to be altered significantly for use to solve specific problems. It allows the user to experiment with changing the MATLAB code fragments in order to gain a better understanding of the application. by Lefteri H. fuzzy. DEC. MATLAB is the base package and was originally written as an easy interface to LINPACK. These two advanced information processing technologies have undergone explosive growth in the past few years. Macintosh. This supplement was written using the MATLAB notebook and Microsoft WORD ver. Indeed.

inverse.add. etc. etc. read. OR. A good reference for MATLAB programming is Mastering MATLAB by Duane Hanselman and Bruce Littlefield and published by Prentice Hall (http://www.· · · · · · · · · · Matrix Arithmetic . etc. acos. radio buttons. else. write formatted. Hankel. test for string.roots. imaginary part. etc. dir.minimum. while.sin. Relational Operators . etc. mean. Logical operators . 2. This allows the user to modify the functions to perform the desired functionality necessary for a specific application. etc. MATLAB is also used for graphics and visualization in both 2-D and 3-D.FFT. not equal. 1. sliders. error function. MATLAB also contains methods for external interfacing capabilities: · For data import/export from ASCII files. MATLAB m-files may contain · · · · · · Standard programming constructs such as IF. mouse-button events. · To external interface libraries: C and FORTRAN callable external interface libraries. divide. covariance. etc. time. etc. etc. Non-linear Numerical Methods . On-Line help routines for all functions. etc. inverse FFT. dialog boxes. C style file I/O such as open. There are two types of MATLAB M files: scripts and functions. minimize functions.Bessel. show status.AND.com/). · To files and directories: chdir. MATLAB is a language in itself and can be used at the command line or in m-files. Most MATLAB commands are functions and are accessible to the user. Elementary Functions . binary. This also allows access to A/D cards. etc. · Dynamic Data Exchange: Allows MATLAB to communicate with other Windows applications. These authors also wrote the user guide for the student edition of MATLAB. Graphical User Interfaces such as pull down menus. Functions are compiled m-files that are stored in memory. NOT. Special Functions -. transpose. divide. etc. etc. 2 . Polynomials . Debugging commands: set breakpoint. Scripts are standard MATLAB programs that run as if they were typed into the command window. XOR Data Analysis . etc. · Computational engine service library: Allows C and FORTRAN programs to call and access MATLAB routines.less than. Numerical Linear Algebra . Signal Processing . String manipulation commands: number to string. etc. resume. break.prenhall. · To dynamic linking libraries: (MEX files) allows C or FORTRAN routines to be linked directly into MATLAB at run time.LU decomposition. log. etc.solve DE. fit polynomial.

SIMULINK’s Real Time Workshop can be used for rapid prototyping. stiff. SIMULINK provides a graphical user interface that supports click and drag of blocks that can be connected to form complex systems. Seven different integration algorithms can be used depending on the system type: linear. They are written in MATLAB and stored as mfiles. MATLAB toolboxes are collections of functions that can be used from the command line. analyzing. Output can be saved to file for later analysis. Systems of blocks can be combined into larger blocks to aid in program structuring. embedded real time control. and simulating linear and non-linear dynamic systems. 3 . etc.MATLAB Toolboxes Toolboxes are add-on packages that perform application-specific functions. This toolbox automatically generates stand-alone C code. New blocks can be created to perform special functions such as simulating neural or fuzzy systems. from scripts. MATLAB functions or MEX (C and FORTRAN) functions can be called. and stand-alone simulation. this allows the user to modify them to meet his or her needs. SIMULINK functionality includes: · · · · · · · · · Live displays that let you observe variables as the simulation runs. Linear approximations of non-linear systems can be made. real-time simulation. A partial listing of these toolboxes include: · Signal Processing · Image Processing · Symbolic Math · Neural Networks · Statistics · Spline · Control System · Robust Control · Model Predictive Control · Non-Linear Control · System Identification · Mu Analysis · Optimization · Fuzzy Logic · Hi-Spec · Chemometrics SIMULINK SIMULINK is a MATLAB toolbox which provides an environment for modeling. Discrete or continuous simulations can be run. C code can be generated from your models. or called from other functions.

MATLAB Publications The following publications are available at The MathWorks WWW site.com FTP Address: ftp. Demuth of the University of Idaho which is published by PWS (http://www.mathworks.User Contributed Toolboxes Several user contributed toolboxes are available for download at the MATLAB FTP site: ftp. there are also several other MATLAB tools that are published by other companies. FISMAT Toolbox: A fuzzy inference system toolbox developed in Australia that incorporates several extensions to the fuzzy logic toolbox. In addition to these. MATLAB APPLICATIONS 4 . WWW Address: http://www.com/pws/default. to Lagrange interpolation. by means of anonymous user access.mathworks. Some that may be of interest are: Genetic Algorithm Toolbox: A freeware toolbox developed by a MathWorks employee that will probably become a full toolbox in the future. IFR-Fuzzy Toolbox: User contributed fuzzy-control toolbox.com. FTP Server also provides technical references such as papers and article reprints. This toolbox goes into greater detail than the MATLAB toolbox and better explains the lower level programming used in the functions.mathworks.html). this conference is held every other year. MATLAB Quarterly Newsletter: News and Notes. MATLAB USENET Newsgroup archive.thomson. These authors also wrote the MATLAB Neural Network Toolbox. MATLAB Digest: electronic newsletter. Beale and Howard B. There are also thousands of user contributed m-files on hundreds of topics ranging from the Microorbit mission analysis to sound spectrogram printing.com Login: anonymous Password: "your user address" List of MATLAB based books. The most relevant of these is the Fuzzy Systems Toolbox developed by Mark H. MATLAB Technical Notes MATLAB Frequently Asked Questions MATLAB Conference Archive.

and Expert Systems. 2. the code fragments will be represented by 10 point Courier New gray scale. Since this supplement is printed in black and white. These are included on the disk that comes with this supplement. The following code segment is an autoinit cell and is executed each time the notebook is opened. close(gcf) sets the line colors in all figures to black. [0 0 0]). Several applications are described but no specific algorithms or architectures are presented in enough detail to warrant their implementation in MATLAB. The file :contents. [0 0 0]). Genetic Algorithms. close(gcf) cd d:/nn_fuzzy Chapter 1 Introduction to Hybrid Artificial Intelligence Systems Chapter 1 of Fuzzy and Neural Approaches in Engineering.The following chapters implement the theory and techniques discussed in the text. The regular text is in 12 point Times New Roman black. the algorithms and applications described in Fuzzy and Neural Approaches in Engineering will be implemented in MATLAB code. If it does not execute when the document is opened. This code can be run from the WORD Notebook when in the directory containing the m-files associated with this supplement is the active directory in MATLAB's command window. 'DefaultAxesColorOrder'. It performs three functions: 1. Some of these implementations use m-file functions or data files. execute it manually. set(0. Neural Networks. d:/nn_fuzzy changes the current MATLAB directory to the directory where the mfiles associated with this supplement are located. This produces black and white pages for printing but can be deleted for color. whitebg([1 1 1]). If you installed the files in another directory. These MATLAB implementations can be executed by placing the cursor in the code fragment and selecting "evaluate cell" located in the Notebook menu. you need to change the code to point to the directory where they are installed. The executable code fragments are green when viewed in the Word notebook and the answers are blue. In the following chapters. 5 . set(0. In many of the chapters. the code must be executed sequentially since earlier code fragments may create data or variables used in later fragments. Also included is a MS Word file of this document.m lists and gives a brief description of all the m-files included with this supplement. 'DefaultAxesColorOrder'. gives a brief description of the benefits of integrating Fuzzy Logic. 3. whitebg([1 1 1]) gives the figures a white background.

. Zadeh diagram for the Fuzzy Sets A and B 1 0. Chapter 2 Foundations of Fuzzy Approaches This chapter will present the building blocks that lay the foundation for constructing fuzzy systems. text(1. Chapters 14 through 17 do not contain MATLAB implementations but do point the reader towards references or user contributed toolboxes.text(7. x=[0:0.muB).utk. Updates are expected to be posted at John Wiley & Sons WWW page but may be posted at University of Tennessee web site. muB=1. These building blocks include membership functions. and alpha cuts.ylabel('Membership').edu. Intersection and Complement of a Fuzzy Set A graph depicting the membership of a number to a fuzzy set is called a Zadeh diagram.8 0. linguistic modifiers.^2)./(1+./(1+(x.5 0. plot(x. 2.1 Union.^3).muA.Chapters 1 through 6 implement Fuzzy Logic.7 Membership 0. This supplement will be updated and expanded as suggestions are received and as time permits.x.6 0.'Set B') xlabel('Number').8. title('Zadeh diagram for the Fuzzy Sets A and B'). Further information should be available from the author at hines@utkux./5).1:20]. A Zadeh diagram is a graphical representation that shows the membership of crisp input values to fuzzy sets. muA=1.4 Set A Set B 6 .3. Chapters 7 through 11 implement Artificial Neural Networks.. Chapters 12 and 13 implement fuzzy-neural hybrid systems.'Set A').8.9 0. The Zadeh diagrams for two membership functions A (small numbers) and B (about 8) are constructed below.*(x-8).

ylabel('Membership').5 0. Union of the Fuzzy Sets A and B 1 0. The intersection of the fuzzy sets A and B is calculated below.plot(x. The vertical axis is the membership of a value. The universe of discourse is the range of values where the fuzzy set is defined. The membership of a number (x) to a fuzzy set A is represented by: µ A ( x ) .muB). intersection=min(muA. The union of the fuzzy sets A and B is calculated below. 7 .7 Membership 0.4 0. title('Union of the Fuzzy Sets A and B'). We can see that this results in the membership of a number to the intersection being the minimum of its membership to either of the two initial fuzzy sets.8 0. union=max(muA.intersection).1 0 0 5 10 Number 15 20 The intersection of the two fuzzy sets is calculated using the min function. plot(x. title('Intersection of the Fuzzy Sets A and B').3 0. The union of the two fuzzy sets is calculated using the max function.9 0.union). xlabel('Number'). We can see that this results in the membership of a number to the union being the maximum of its membership to either of the two initial fuzzy sets. in the universe of discourse. ylabel('Membership').The horizontal axis of a Zadeh diagram is called the universe of discourse. to the fuzzy set.2 0. xlabel('Number').6 0.muB).

xlabel('Number').05 0 0 5 10 Number 15 20 The complement of about 8 is calculated below.8 0.6 0.Intersection of the Fuzzy Sets A and B 0. title('Complement of the Fuzzy Set B').15 0.3 Membership 0. complement=1-muB.1 0.4 03 8 .complement).35 0.9 0. Complement of the Fuzzy Set B 1 0.25 0.2 0. plot(x. ylabel('Membership').4 0.5 0.7 Membership 0.

/5). title('Zadeh diagram for the Fuzzy Sets A and VERY A'). plot(x. The concentration of small numbers is therefore VERY small numbers and can be quantitatively represented by squaring the membership value.'Set A').5. text(1.'More or Less A') 9 . plot(x.muvsb).2..4 0.7 Membership 0. text(9. xlabel('Number'). muvsb=moreless(muA). muA=1.8 0.. xlabel('Number').muA.muvsb). muA=1. muvsb=very(muA). text(2./(1+(x.^3)..2 0. ylabel('Membership'). title('Zadeh diagram for the Fuzzy Sets A and MORE or LESS A').5.1 0 0 5 10 Number 15 20 Very A Set A The dilation of a fuzzy set is equivalent to linguistically modifying it by the term MORE OR LESS. ylabel('Membership').'Very A').5.5. x=[0:0.^3).6 0./(1+(x.1:20].9 0. The dilation of small numbers is therefore MORE OR LESS small numbers and can be quantitatively represented by taking the square root of the membership value.x..2 Concentration and Dilation The concentration of a fuzzy set is equivalent to linguistically modifying it by the term VERY./5).5 0. text(7. This is computed in the function very(mf).'Set A') Zadeh diagram for the Fuzzy Sets A and VERYA 1 0. This is compute in the function moreless(mf).muA.1:20].3 0. x=[0:0.x.

A membership function can be represented by an exponential fuzzifier F1 and a denominator fuzzifier F2. for i=1:length(F1).3 Contrast Intensification A fuzzy set can have its fuzziness intensified. µ ( x) = 1  x  − F1 1+    F2 Letting F1 vary {1 2 4 10 100} with F2 =50 results in a family of curves with slopes increasing as F1 increases.'F1 = 100'). this is called contrast intensification.3 0. title('Contrast Intensification').4 0. xlabel('Number') ylabel('Membership') text(5.length(x))..text(55.7 Membership 0.2.Zadeh diagram for the Fuzzy Sets A and MORE or LESS A 1 0.^(-F1(i))).2 0.3. muA=zeros(length(F1). F2=50.9 0.:)=1.5 0.muA). The following equation describes a fuzzy set large numbers. F1=[1 2 4 10 100].'F1 = 1')./(1+(x..1 0 0 5 10 Number 15 20 Set A More or Less A 2. end plot(x.6 0.8 0./F2). muA(i. x=[0:1:100]. 10 .

'F2 = 70')./F2(i)).1 0 0 20 40 Number 60 80 100 F1 = 1 F1 = 100 Letting F2 vary {40 50 60 70} with F1 =4 results in the following family of curves.7 Membership 0. F1=4. muA(i.5 0.8 0.'F2 = 30').ylabel('Membership') text(10.6 0.text(75. end plot(x.5 04 F2 = 30 F2 = 70 11 .^(-F1)).:)=1.9 0.5.6 0..4 0.Contrast Intensification 1 0.7 0.3 0./(1+(x. Contrast Intensification 1 0.5.2 0.title('Contrast Intensification').8 0.9 0. xlabel('Number').. for i=1:length(F2).F2=[30 40 50 60 70].muA).

ylabel('y').^2/4). y=(1-x. xlabel('x').4 0. Consider a function that maps points from the X-axis to the Y-axis in the Cartesian plane: x2 y = f ( x) = 1− 4 This is graphed as the upper half of an ellipse centered at the origin. title('Functional Mapping') xlabel('x').6 0. ylabel('Membership of x to A').4 -0. Functional Mapping 1 0.5 2 Suppose fuzzy set A is defined as A= −2≤ x ≤2 ∫ 1 2 x x mua=0.5 0 x 0.8 -1 -2 -1.2 -0.-y).5.1:2].mua) title('Fuzzy Set A').5 -1 -0.2.^.x. plot(x.*abs(x). 12 .6 -0.y. x=[-2:. plot(x.5 1 1.2 0 y -0.5.8 0.4 Extension Principle The extension principle is a mathematical tool for extending crisp mathematical notions and operations to the milieu of fuzziness.

8 bership of y to B 0.Fuzzy Set A 1 0.7 0. plot(y.6 0. mub=(1-y.mub) title('Fuzzy Set B').3 0. 2 And the membership function of y to B is µ B ( y ) = 1 − y .5 2 2 Solving for x in terms of y we get: x = 2 1 − y .^2).^.5 -1 -0.6 0.9 0.5 0 x 0.4 0.5 04 13 . xlabel('y').05:1].1 0 -2 -1.5 0.8 Membership of x to A 0.7 0. Fuzzy Set B 1 0.5 1 1.ylabel('Membership of y to B'). y=[-1:.2 0.5.9 0.

^.2 alpha cut is given by: x=[0:1:100].8 0.6 0.[1 1 1]). mesh(X.*(1-Y.^.5. 01 * ( x − 50).5 1 1. 14 .y).5 -1 -0. The 0. mua=0.5. Z=.^2).2:2]. Consider a fuzzy set whose 1 membership function is µ A ( x ) = .4 -0.5 2 2.5 0 x 0.5 Alpha Cuts An alpha cut is a crisp set that contains the elements that have a support (also called membership or grade) greater than a certain value.5*abs(X). y=[-1:.6 -0.4 0.^2).*abs(x).2 0 y -0.'color'. set(gcf.2.5. [X.8 -1 -2 -1.Z). axis([-2 2 -1 1 -1 1]) colormap(1-gray) view([0 90]).1:1].Y. Suppose we are interested in 1 + 0.^ 2 the portion of the membership function where the support is greater than 0. shading interp xlabel('x') ylabel('y') title('Fuzzy Region Inside and Outside the Eclipse') Fuzzy Region Inside and Outside the Eclipse 1 0.The geometric interpretation is shown below.2 -0. mub=(1-y. x=[-2:.Y] = meshgrid(x.

function [a.b] = alpha(FS.8 0.2 0.2) a = b = 30 70 15 . %x : the universe of discourse % level : the level of the alpha cut % [a.b] = alpha(FS. ylabel('Membership of x').alpha_cut) title('0.4 0.*(x-50).x.2 Level Fuzzy Set of A').b]=alpha(mua.2 Level Fuzzy Set of A 1 0.. % [a.x.01.6 0./(1+0.2.x. % FS : the grades of a fuzzy set.9 0. alpha_cut = mua>=.^2).level).mua=1. [a. plot(x. b=x(max(ind)).level) % % Returns the alpha cut for the fuzzy set at a given level. 0. xlabel('x').5 0.1 0 0 20 40 x 60 80 100 The function alpha is written to return the minimum and maximum values where an alpha cut is one.7 Membership of x 0.b] : the vector indices of the alpha cut % ind=find(FS>=level). This function will be used in subsequent exercises. a=x(min(ind)).3 0.

^2+y. and a relation R2 is defined as "x is NOT near the origin".y. This can be −( x2 + y2 ) .2:2). mur=exp(-1*(x.mur) xlabel('x') ylabel('y') zlabel('Membership to the Fuzzy Set R') Membership to the Fuzzy Set R 1 0. surf(x.^2)).Chapter 3 Fuzzy Relationships Fuzzy if/then rules and their aggregations are fuzzy relations in linguistic disguise and can be thought of as fuzzy sets defined over high dimensional universes of discourse.4 0.2 0 2 1 0 -1 y -2 -2 -1 x 0 1 2 3.-2:.^2)).2 Union and Intersection of Fuzzy Relations Suppose a relation R1 is defined as "x is near y AND near the origin".^2+y.6 0.max(mur1. mur2=1-exp(-1*(x.y.8 0.2:2.y]=meshgrid(-2:. The union R1 OR R2 is defined as: mur1=exp(-1*(x. surf(x. 3. expressed as µ R ( x ) = e [x.1 A Similarity Relation Suppose a relation R is defined as "x is near the origin AND near y".^2+y.^2)).mur2)) xlabel('x') ylabel('y') zlabel('Union of R1 and R2') 16 . The universe of discourse is graphed below.

^2+y.5 2 1 0 -1 y -2 -2 -1 x 0 1 2 The intersection R1 AND R2 is defined as: mur1=exp(-1*(x.9 0.^2)).1 Union of R1 and R2 0.3 0.mur2)) xlabel('x') ylabel('y') zlabel('Intersection of R1 and R2') 0.min(mur1.^2)).y. mur2=1-exp(-1*(x.4 0.7 0.8 0.2 17 .^2+y.5 ction of R1 and R2 0. surf(x.6 0.

8 1.0  .2 0.8 10  0.0 0.3 01 Using MATLAB to compute the max-min composition: R1=[1.   0.3  0.0 .0 0. end end R0 R0 = 1. y ) µ ( x . y ) µ ( x . y1 ) µ R 2 ( x 3 .9 0.3000 0.c2]=size(R2). Their max-min composition is defined in its matrix form as: . y 2 ) µ R1 ( x 4 .1 0.9 0. y 3 ) µ R1 ( x 3 . [r2.3 0.0. y ) µ ( x . 10  0.9 .0 0. y 2 ) µ R 2 ( x 1 .0 1.0. 0.8.0000 0.2.0.0].8 1.9 0.8  0.j)=max(min(R1(i. 10 0. y 3 ) µ R1 ( x 4 .3 0. y1 ) µ R1 ( x1 . y 4 )  0.:). y 4 )  0.3   .c1]=size(R1). 0. 10 0.3 1.0 0.0.3 0.0.5  01 0. 01 0.9     µ R1 ( x 4 .0 0. y )   0.0 0.8 10  . 0.0 0. .0  µ R 2 ( x 1 .0 1. .5000 0.3 R1 o R2 =  0.5. y 4 )  10  µ ( x .1].R2(:. for j=1:c2.8 10 0. y 2 ) µ R1 ( x 3 .0.c2).3 01 . y 2 ) µ R1 ( x1 .j)')).3.3 0.0 10  10 . [r1.9 0.0 0.  µ R1 ( x1 .9  0. y 3 )     µ R 2 ( x 4 . y ) µ ( x . y1 ) µ R 2 ( x 4 .9000 18 .8 10 o .2 .9 0. R0(i. y1 ) µ R1 ( x 4 . R2=[1.8 1. y 2 ) µ R 2 ( x 4 . y 3 )  . 0. 10 0. R0=zeros(r1.0 0.  10 0.1.0 10 0. . y 3 )  µ ( x . y ) µ ( x .0 . .2 10 0.8 . y 3 ) µ R1 ( x1 .3 0.5  .0.3 Max-Min Composition The max-min composition uses the max and min operators described in section 3.  0.8 10 0.0000 1.8 10  . y1 ) µ R 2 ( x 1 .0000 0. 0. Suppose two relations are defined as follows: .9000 0. .  0. y 2 ) µ R 2 ( x 3 . for i=1:r1. 10 10 .3 R1 R1 R1 R1 2 1 2 2 2 3 2 4  = R1 =   µ R1 ( x 3 . y ) R2 R2 R2 2 1 2 2 2 3  R2 =  =  µ R 2 ( x 3 .9. y1 ) µ R1 ( x 3 . 0.9000 1.9000 0.

'Fuzzy Number 3') text(6.8 membership 0. They are the fuzzy number 3 (FN3) and the fuzzy number 7 (FN7).7 1.2 0].3 0 0 0 0 0].2 0.0/7 + 0.6 0.0000 0.3/1 + 0.7/4 + 0.05.1. xlabel('x').05.5000 Chapter 4 Fuzzy Numbers Fuzzy numbers are fuzzy sets used in connection with applications where an explicit representation of the ambiguity and uncertainty found in numerical data is desirable. FN3=0/0 + 0.1.6/8 + 0.3/5 + 0/6 FN7=0/4 + 0.1 Addition and Subtraction of Discrete Fuzzy Numbers Addition of two fuzzy numbers can be performed using the extension principle. ylabel('membership') text(2.2/9 + 0/10 To define these fuzzy numbers using MATLAB: x = [1 2 3 4 5 6 7 8 9 10].'Fuzzy Number 7').0 0.1]) title('Fuzzy Numbers 3 and 7').6/6 + 1.0 0. Suppose you have two fuzzy numbers that are represented tabularly. FN3 = [0.6 19 . FN7 = [0 0 0 0 0.7/2 + 1. axis([0 11 0 1.2/5 + 0. bar(x'.1. Fuzzy Numbers 3 and 7 1 Fuzzy Number 3 Fuzzy Number 7 0.3 0.6 1..7 0.0/3 + 0.3000 0. 4.[FN3' FN7']).

6 alpha cut: [2 0.FNSUM).6/12 + . By hand we have: 0.7/9 + 1/10 + . FNSUM=zeros(size(x)).2 alpha cut: [1 0.2 The following program subtracts the fuzzy number 3 from the fuzzy number 8 to get a fuzzy number 8-3=5.i-eps). a=a1+a2.3/13 + . FNSUM(a:b)=i*ones(size(FNSUM(a:b))).x.b2]=alpha(FN7.1:1 [a1.Adding fuzzy number 3 to fuzzy number 7 results in a fuzzy number 10 using the alpha cut procedure described in the book. axis([0 20 0 1. for i=.1]) title('Fuzzy Number 3+7=10') xlabel('x') ylabel('membership') Fuzzy Number 3+7=10 1 0.2/6 + . 20 .0 alpha cut: [3 FN3 5] 5] 4] 4] 3] [5 [6 [6 [7 [7 FN7 9] 8] 8] 7] 7] FN10 = FN3+FN7 [6 14] [7 13] [8 12] [9 11] [10 10] FN10 = .6/8 + .7/11 + .x. % Use eps due to buggy MATLAB increments [a2.3 alpha cut: [1 0.i-eps). end bar(x.8 membership 0.2/14 x=[1:1:20].1:.3/7 + .4 0.6 0. b=b1+b2.b1]=alpha(FN3.7 alpha cut: [2 1.

1:1 [a1.a2]=alpha(FN8.1]) title('Fuzzy Number 8-3=5') xlabel('x') ylabel('Membership') Fuzzy Number 8-3=5 1 0.2 Multiplication of Discrete Fuzzy Numbers This program multiplies the fuzzy number 3 by the fuzzy number 7 to get a fuzzy number 3*7=21.6/7 + .2 4.4 0.1. a=a1-b2. for i=. FNDIFF(a:b)=i*ones(size(FNDIFF(a:b))).b2]=alpha(FN3.6/3 + .i-eps).3 alpha cut: [1 0. FN3 = [0.6 0.By hand we have: 0.7/4 + 1/5+ .3/8 + .x.6 alpha cut: [2 0.7 alpha cut: [2 1. FN8 = [0 0 0 0 0 0. b=a2-b1. [b1.6 1. Where the fuzzy numbers 3 and 7 are defined as in Section 4.3 0.2/1 + .2].7 1.7 0.FNDIFF).6 0.2 alpha cut: [1 0.8 Membership 0.3 0 0 0 0 0].0 0.7/6 + .0 0.x. The 21 .i-eps). end bar(x.0 alpha cut: [3 FN3 5] 5] 4] 4] 3] [6 [7 [7 [8 [8 FN8 10] 9] 9] 8] 8] [1 [2 [3 [4 [5 FN5 = FN8-FN3 9] 8] 7] 6] 5] FN5 = .2 0.axis([0 11 0 1. FNDIFF=zeros(size(x)).2/9 x=[1:1:11].3/2 + .1:.

6 0. % Universe of Discourse FN3 = [0.0 0.3 0. [b1. FNPROD(a:b)=i*ones(size(FNPROD(a:b))).multiplication of continuous fuzzy numbers is somewhat messy and will not be implemented in MATLAB.6/12 + .6 0. FNPROD=zeros(size(x)).2 0 0].a2]=alpha(FN3.1:.8 Membership 0. By hand we have: 0.1]) title('Fuzzy Number 3*7=21') xlabel('Fuzzy Number 21') ylabel('Membership') Fuzzy Number 3*7=21 1 0.0 alpha cut: [3 FN3 5] 5] 4] 4] 3] [5 [6 [6 [7 [7 FN7 9] 8] 8] 7] 7] FN21 = FN3*FN7 [5 45] [6 40] [12 32] [14 28] [21 21] FN21 = .7 0.axis([0 60 0 1.7 1.x.3/40 + . end bar(x.FNPROD). a=a1*b1.2 alpha cut: [1 0.7/28 + .3 0 0 0 0 0 0].7/14 + 1/21 + .2/5 + .i-eps).2/45 x=[1:1:60].x.b2]=alpha(FN7.7 alpha cut: [2 1.6/32 + .i-eps). b=a2*b2.6 1.4 22 .3 alpha cut: [1 0. FN7 = [0 0 0 0 0.3/6 + .2 0.1:1 [a1.0 0. for i=.6 alpha cut: [2 0.

3 0 0 0 0 0]. % Universe of Discourse FN3 = [0.5 + 1/2 + .2 0 0].2 alpha cut: [1 0.3 Division of Discrete Fuzzy Numbers This program divides the fuzzy number 6 by the fuzzy number 3 to get a fuzzy number 2.7 alpha cut: [2 1.6/3.7/1.25 + .6 1.FNDIV).4.a2]=alpha(FN6.3/1 + . end bar(x.1:.8 + .3 0. b=round(a2/b1).7 0. a=round(a1/b2).i-eps).x. FNDIV(a:b)=i*ones(size(FNDIV(a:b))).b2]=alpha(FN3.axis([0 10 0 1.2/8 x=[1:1:12].5 + .6 alpha cut: [2 0. [b1.6/1.6 0.x.2 0.7 1.1:1 [a1. By hand we have: 0.6 23 . FN6 = [0 0 0 0. The division of continuous fuzzy numbers is somewhat messy and will not be implemented in MATLAB.2/.8 Membership 0.0 0.i-eps).1]) title('Fuzzy Number 6/3=2') xlabel('Fuzzy Number 2') ylabel('Membership') Fuzzy Number 6/3=2 1 0. FNDIV=zeros(size(x)).3/7 + .7/3 + . for i=.0 alpha cut: [3 FN3 5] 5] 4] 4] 3] [4 [5 [5 [6 [6 FN6 8] 7] 7] 6] 6] FN2 = FN6/FN3 [4/5 8/1] [5/5 7/1] [5/4 7/2] [6/4 6/2] [6/3 6/3] FN21 = .0 0.3 alpha cut: [1 0.

3.2 Membership Functions This supplement contains functions that define triangular. B'=A'°R(x. is written to return the membership values corresponding to the defined universe of discourse x. S-Shaped and Πshaped membership functions.1 Triangular Membership Function A triangular membership function is defined by the parameters [a b c]. title('Close to 33') xlabel('X') ylabel('Membership') 24 . where a is the membership function's left intercept with grade equal to 0.y) Implication relations are explained in greater detail in section 5.Chapter 5 Linguistic Descriptions and Their Analytical Form 5.[23 33 43]). trapezoidal.[a b c]). The parameters that define the triangular membership function: [a b c] must be in the discretely defined universe of discourse. The function y=triangle(x.3. 5. its consequence is inferred by the same degree.y).1 Generalized Modus Ponens Fuzzy linguistic descriptions are formal representations of systems made through fuzzy if/then rules. y=triangle(x. 5. b is the center peak where the grade equals 1 and c is the right intercept at grade equal to 0. For example: A triangular membership function for "x is close to 33" defined over x=[0:1:50] with [a b c]=[23 33 43] would be created with: x=[0:1:50].y)) as in the max-min composition of section 3. IF so x is A x is A' y is B' THEN y is B This can be written using the implication relation (R(x. Generalized Modus Ponens (GMP) states that when a rule's antecedent is met to some degree.2. plot(x.

.. medium and hot.58. medium=[25 50 75].2 0.[mf_cool. A matrix with each row corresponding to the three fuzzy values can be constructed.9 0.'Cool') text(42. ylabel('Membership').5 0. medium and hot'). plot(x. mf_medium =triangle(x.'Medium') text(70..medium). mf_cool=triangle(x.1 0 0 10 20 X 30 40 50 A fuzzy variable temperature may have three fuzzy values: cool. xlabel('Degrees') text(20.58.mf_medium.mf_hot]) title('Temperature: cool.7 Membership 0. cool=[0 25 50].4 0. Suppose the following fuzzy value definitions are used: x=[0:100].6 0.hot). Membership functions defining these values can be constructed to overlap in the universe of discourse [0:100].Close to 33 1 0. mf_hot=triangle(x.3 0.cool).8 0.58.'Hot') 25 . hot=[50 75 100].

medium and hot'). we should use trapezoidal membership functions to define the cool and hot fuzzy sets. mf_hot=trapzoid(x.2 Trapezoidal Membership Function As can be seen.2 0..Temperature: cool.9 0.medium). mf_cool=trapzoid(x. mf_medium =triangle(x. a temperature value of 0 would have a 0 membership to all fuzzy sets.2. x=[0:100].6 0.65.3 0. text(20. ylabel('Membership'). medium=[15 50 75].65. plot(x. hot=[50 75 100 100].'Hot') 26 . cool=[0 0 25 50].hot).mf_hot]). medium and hot 1 0.mf_medium.cool).'Medium') text(70. title('Temperature: cool.7 Membership 0.. Therefore.'Cool') text(42.5 0.8 0. xlabel('Degrees').[mf_cool.1 0 0 20 40 Degrees 60 80 100 Cool Medium Hot 5..65.4 0.

3 0.2.7 Membership 0.5 γ = the point where µ(x)=1. γ ) = 2   γ -α  2 for for 2 x≤α α ≤x≤β β ≤ x≤γ γ ≤x  x -γ  S_ shape(α . Likewise. γ ) = 1.8 0.2 0. β .2   γ -α  S_ shape(α .5 0.3 S-shaped Membership Function An S-shaped membership function is defined by three parameters [α β γ] using the following equations: S_ shape(α .0 note: β-α must equal γ-β for continuity of slope for for 27 . γ ) = 0  x -α  S_ shape(α . β .Temperature: cool. β .6 0.4 0. γ ) = 1 where: α = the point where µ(x)=0 β = the point where µ(x)=0. medium and hot 1 0. high temperatures are properly represented with high membership values to the fuzzy set hot. 5.9 0.1 0 0 20 40 Degrees 60 80 100 Cool Medium Hot The use of trapezoidal membership functions results in a 0 value of temperature being properly represented by a membership value of 1 to the fuzzy set cool. β .

γ − δ .γ + δ    2 where: γ = center of the membership function β = width of the membership function at grade = 0. δ ) = 1.2 0. title('Temperature: cool and hot').x=[0:100].9 0. plot(x.1 0 0 20 40 Degrees 60 80 100 Cool Hot 5.cool)..[mf_cool. hot=[50 75 100].4 Π-shaped Membership Function A Π-shaped membership functions is defined by two parameters [γ. ylabel('Membership').5.6 0.hot). for for x≤γ x≥γ 28 .3 0.4 0.mf_hot]).45. γ . xlabel('Degrees'). cool=[50 25 0].. x=[0:100].45. .8 0. δ ) = S_ shape x.5 0.β] using the following equations: γ −δ   P_ shape(γ . .S_ shape x. text(8.'Hot') Temperature: cool and hot 1 0. mf_hot=s_shape(x.2.γ    2 γ +δ   P_ shape(γ .'Cool') text(82. mf_cool=s_shape(x.7 Membership 0.

'Medium') text(70. medium and hot 1 0.hot). The most commonly used defuzzification method is the center of area method also commonly referred to as the centroid method.7 Membership 0. grades) performs this function by using a method similar to that of finding a balance point on a loaded beam.mf_hot]). mf_hot=p_shape(x. function [center] = centroid(x...y).cool=[25 20]. mf_cool=p_shape(x. 29 .4 0. medium=[50 20].5 Defuzzification of a Fuzzy Set Defuzzification is the process of representing a fuzzy set with a crisp number and is discussed in Section 6.55. title('Temperature: cool.1 0 0 20 40 Degrees 60 80 100 Cool Medium Hot 5. The function centroid (universe.2 0..medium). such as commanding a valve to a desired position.8 0.55. medium and hot').55.cool).3 0.9 0.'Hot') Temperature: cool. Internal representations of data in a fuzzy system are usually fuzzy sets but the output frequently needs to be a crisp number that can be used to perform a function. ylabel('Membership').5 0.[mf_cool. text(20.6 0. plot(x.3 of the text.mf_medium. hot=[75 20].'Cool') text(42. This method determines the center of area of the fuzzy set and returns the corresponding crisp value. mf_medium =p_shape(x. xlabel('Degrees').2.

[32 67 74 130]).4 0. The following function implements mean of max defuzzification: mom(universe. 30 . % centroid: crisp number defining the centroid.'Mean of Max').y.y). y=triangle(x. center=mom(x.9 0. c_plot(x. center=centroid(x. % grades: row vector of corresponding membership. we will defuzzify the following triangular fuzzy set and plot the result using c_plot: x=[10:150].8 0.33 1 0.[32 67 130]).grades). To illustrate this method. x=[10:150]. c_plot(x. % center=(x*y')/sum(y).y). y=trapzoid(x.5 0.grades) % % universe: row vector defining the universe of discourse.center.3 0.'Centroid') Centroid is at 76.%CENTER Calculates Centroid % [center] = centroid(universe.center.y. max of max and min of max.2 0.1 0 0 50 100 150 There are several other defuzzification methods including mean of max.7 0.6 0.

6 Compound Values Connectives such as AND and OR. correspond to various degrees of contrast intensification.6 0. not_hot=not(mf_hot). plot(x. cool=[0 0 25 50]. ylabel('Membership').not_cool]). mf_hot=trapzoid(x. 31 .Mean of Max is at 70. xlabel('Degrees'). etc.8 0. mf_cool=trapzoid(x.hot).1 0 0 50 100 150 5. VERY. and modifiers such as NOT.3 0. not_cool=not(mf_cool).cool).2 0.7 0. MORE or LESS. Temperature is NOT cool AND NOT hot is a fuzzy set represented by: x=[0:100]. and MORE or LESS can be used to generate compound values from primary values: OR corresponds to max or union AND corresponds to min or intersection NOT corresponds to the complement and is calculated by the function not(MF). answer=min([not_hot.9 0.5 0. VERY. title('Temperature is NOT hot AND NOT cool').answer).4 0. hot=[50 75 100 100].2.5 1 0.

These can be implemented by taking the square (VERY) or square root (MORE or LESS) of the membership values.not_very_hot).9 0. plot(x.1 0 0 20 40 Degrees 60 80 100 VERY and MORE or LESS are called linguistic modifiers. For example.6 0.Temperature is NOT hot AND NOT cool 1 0.8 0.7 Membership 0.4 0.7 Membership 0.5 0.6 0.xlabel('Degrees'). These modifiers are implemented with the very(MF) and moreless(MF) functions. NOT VERY hot would be represented as: not_very_hot=not(very(trapzoid(x. NOT VERYhot 1 0. title('NOT VERY hot').3 0.5 04 32 .9 0.2 0.8 0.ylabel('Membership').hot))).

suppose there is a rule that states: if x is "Fuzzy Number 3" 33 . Temperature is More or Less hot 1 0. 5.3 Implication Relations The underlying analytical form of an if/then rule is a fuzzy relation called an implication relation: R(x.3 0. ylabel('Membership'). µ B ( y ) = µ A ( x ) ⋅ µ B ( y ) To illustrate the Mamdami Min implementation operator. µ B ( y ) = µ A ( x ) ∧ µ B ( y ) φ µ A ( x ).1 0 0 20 40 Degrees 60 80 100 Note that some membership functions are affected by linguistic modifiers more than others. would not be affected at all. plot(x.4 0.2 0.ml_hot). title('Temperature is More or Less hot').y). such as a hardlimit membership function.7 Membership 0. a membership function that only has crisp values.hot)).6 0. µ B ( y ) = µ A ( x ) ∧ µ B ( y ) ∨ 1 − µ A ( x ) Mamdami Min Implication Operator Larson Product Implication Operator [ ] ( ) ( ) φ µ A ( x ). For example. MORE or LESS hot would be represented as: ml_hot=moreless(trapzoid(x.5 0.8 0.9 0.and. There are several implication relation operators (φ) including: Zadeh Max-Min Implication Operator φ µ A ( x ).xlabel('Degrees').

6 0.0 0.2 0.1 0. y=clip(FN7. clip_mua=clip(mua.02 0 20 40 x 60 80 100 Referring back to the discrete example: if x is "Fuzzy Number 3" then y is "Fuzzy number 7" and x is equal to 2. FN7 = [0 0 0 0 0 0.7. plot(x.y).0 0.2 0]. Fuzzy Set A Clipped to a 0.0. degree=FN3(find(x==2)).18 0.*(x-50). To perform this operation we construct a function called clip(FS. This value is called the "Degree of Fulfillment" (DOF) of the antecedent.08 0.7 1.7 and results in the output fuzzy number being clipped to a maximum of 0.01. 34 .then y is "Fuzzy Number 7" For the Fuzzy Number 3 of section 4.^2).2).3 0 0 0 0 0].2 Level 0.7. the consequence should be met with a degree of 0. it matches the set "Fuzzy Number 3" with a value of 0.12 0. plot(x. xlabel('x')./(1+0. axis([0 10 0 1]) title('Mamdani Min Output of Fuzzy Rule'). ylabel('Membership of x').1.14 0. Therefore. mua=1.level). xlabel('x'). x= [0 1 2 3 4 5 6 7 8 9 10].06 0.2 0.2 Level').degree).7 0.22 0. FN3 = [0 0.16 Membership of x 0.clip_mua).6 1. title('Fuzzy Set A Clipped to a 0. if the input x is a 2.04 0. then the output y is equal to the fuzzy set clipped at 2's degree of fulfillment of Fuzzy Number 7.3 0.

Mamdani Min Output of Fuzzy Rule 1 0. We can see that using discrete membership functions of very rough granularity may not provide the precision that one may desire.7 Output Fuzzy Set 0. To illustrate the use of the Larson Product implication relation. Membership functions with less granularity should be used. title('Fuzzy Set A Scaled to a 0.5 0.^2).7).7 and results in the output fuzzy number being scaled to a maximum of 0.7.prod_mua) axis([min(x) max(x) 0 1]). ylabel('Membership of x').6 0.1. x=[0:1:100].2 0. prod_mua=product(mua.01. mua=1.. The function product(FS. The Larson Product implication operator scales the consequence with the degree of fulfillment which is 0.9 0./(1+0. xlabel('x').1 0 0 2 4 x 6 8 10 This example shows the basic foundation of a rule based fuzzy logic system.4 0. it matches the antecedent fuzzy set "Fuzzy Number 3" with a degree of fulfillment of 0.ylabel('Output Fuzzy Set'). if the input x is a 2.7 Level'). suppose there is a rule that states: if then x is "Fuzzy Number 3" y is "Fuzzy number 7" For the Fuzzy Number 3 of section 4.3 0.level) performs the Larson Product operation.8 0. 35 .*(x-50). plot(x.7.

7 0.2 0. then the output y is equal to the fuzzy set squashed to the antecedent's degree of fulfillment to "Fuzzy Number 7". degree=FN3(find(x==2)).0 0.0]) title('Larson Product Output of Fuzzy Rule').7 Level 1 0.2 0].6 0. 36 .7 1.8 0.1 0 0 20 40 x 60 80 100 Referring back to the highly granular discrete example: if then x is "Fuzzy Number 3" y is "Fuzzy Number 7" and x is equal to 2.y).9 0.3 0 0 0 0 0].3 0.7 Membership of x 0. y=product(FN7.degree). FN3 = [0 0.Fuzzy Set A Scaled to a 0. axis([0 10 0 1.4 0. FN7 = [0 0 0 0 0 0. plot(x.2 0.5 0.3 0. xlabel('x'). x= [0 1 2 3 4 5 6 7 8 9 10].6 1. ylabel('Output Fuzzy Set').6 0.0 0.

A Fuzzy Algorithm is a procedure for performing a task formulated by a collection of fuzzy if/then rules.Larson Product Output of Fuzzy Rule 1 0.2 0.4 Fuzzy Algorithms Now that we can manipulate Fuzzy Rules we can combine them into Fuzzy Algorithms.1 0 0 2 4 x 6 8 10 5.. if if .3 0.4 0.5 0.6 0. consider a fuzzy algorithm that controls a fan's speed.7 Output Fuzzy Set 0.9 0. The input is the crisp value of temperature and the output is a crisp value for the fan speed.. if x is A1 x is A2 x is An then then then y is B1 ELSE y is B2 ELSE y is Bn ELSE is interpreted differently for different implication operators: Zadeh Max-Min Implication Operator Mamdami Min Implication Operator Larson Product Implication Operator AND OR OR As a first example. These rules are usually connected by ELSE statements. Suppose the fuzzy system is defined as: if if if Temperature is Cool Temperature is Moderate Temperature is Hot then then then Fan_speed is Low ELSE Fan_speed is Medium ELSE Fan_speed is High 37 .8 0.

[2 4 6 8]).[60 80 120 120]).This system has three fuzzy rules where the antecedent membership functions Cool.hot_mf].high_mf]. plot(y. Medium and High Fan Speeds') xlabel('Fan Speed') ylabel('Membership') 38 .4 0. consequent_mf = [low_mf. Moderate and Hot Temperatures 1 0.[5 8 10 10]). hot_mf = trapzoid(x. % Temperature y = [0:1:10]. % Fan Speed % Temperature cool_mf = trapzoid(x. Medium.7 Membership 0.6 0.9 0. plot(x.3 0.8 0. Moderate and Hot Temperatures') xlabel('Temperature') ylabel('Membership') Cool.antecedent_mf) title('Cool.2 0.[0 0 2 5]). medium_mf = trapzoid(y.1 0 0 20 40 60 80 Temperature 100 120 % Fan Speed low_mf = trapzoid(y.5 0. antecedent_mf = [cool_mf. Hot and consequent membership functions Low. High are defined by the following fuzzy sets over the given universes of discourse: % Universe of Discourse x = [0:1:120].consequent_mf) title('Low.medium_mf. moderate_mf = triangle(x.moderate_mf.[30 55 80]). high_mf = trapzoid(y. Moderate.[0 0 30 50]).

dof1 = cool_mf(find(x==temp)).dof3] DOF = 0 0.2 0. Defuzzify the output.3200 0.3 0. 2. 5. Medium and High Fan Speeds 1 0. DOF=antecedent_mf(:.1 0 0 2 4 6 Fan Speed 8 10 Now that we have the membership functions defined we can perform the five steps of evaluating fuzzy algorithms: 1. Apply a fuzzy operator.5 0.dof2. Fuzzify the input.8 0. Suppose the input is Temperature = 72. Aggregate the outputs. 3.6 0. The output of the first step is the degree of fulfillment of each rule. dof2 = moderate_mf(find(x == temp)). DOF = [dof1. temp = 72.4 0.9 0. dof3 = hot_mf(find(x == temp)).7 Membership 0. Apply an implication operation.Low. 4. First we fuzzify the input.find(x==temp)) DOF = 39 .6000 Doing this in matrix notation: temp=72.

3 0.0 0.6000 There is no fuzzy operator (AND.dof2). plot(y.dof1).DOF). plot(y. Next we apply a fuzzy implication operation.2 0.consequent2.5 0.9 0.0]) title('Consequent Fuzzy Set') xlabel('Fan Speed') ylabel('Membership') Consequent Fuzzy Set 1 0.consequent3]) axis([0 10 0 1. consequent2 = product(high_mf. consequent1 = product(low_mf.3200 0. OR) since each rule has only one input.8 0.[consequent1. in matrix notation: consequent = product(consequent_mf.dof3).consequent) axis([0 10 0 1.1 0 0 2 4 6 Fan Speed 8 10 Or again.4 0. Suppose we choose the Larson Product implication operation. consequent3 = product(medium_mf.0]) title('Consequent Fuzzy Set') xlabel('Fan Speed') ylabel('Membership') 40 .7 Membership 0.6 0.

6 0.9 0.5 0. Output_mf=max([consequent1.Consequent Fuzzy Set 1 0.8 0. plot(y.2 0.8 0.3 0.1 0 0 2 4 6 Fan Speed 8 10 Next we need to aggregate the consequent fuzzy sets. We will use the max operator.4 0.5 0.9 0.7 Membership 0.4 41 .6 0.Output_mf) axis([0 10 0 1]) title('Output Fuzzy Set') xlabel('Fan Speed') ylabel('Membership') Output Fuzzy Set 1 0.consequent3]).7 Membership 0.consequent2.

1 0 0 2 4 6 Fan Speed 8 10 Lastly we defuzzify the output set to obtain a crisp value.313 1 0.Output_mf). output=centroid(y.5 0.7 0.output.4 0.7 Membership 0.3 0.Output_mf) axis([0 10 0 1]).Output_mf.'Crisp Output').9 0.8 0. c_plot(y.title('Output Fuzzy Set') xlabel('Fan Speed').Output_mf = max(consequent).6 0.9 0. Crisp Output is at 7.6 0.2 0. plot(y.5 42 .8 0.ylabel('Membership') Output Fuzzy Set 1 0.

The next chapter will demonstrate fuzzy tank level control when Fuzzy Operators are included. To see the output for different input temperatures. %Implication Output_mf = max(consequent). outputs=zeros(size([1:1:100])).The crisp output of the fuzzy rules states that the fan speed should be set to a value of 7. for temp=1:1:100 DOF=antecedent_mf(:. %Fuzzification consequent = product(consequent_mf.3 for a temperature of 72 degrees.find(x==temp)). %Defuzzification outputs(temp)=output.DOF). Note: you must have already run the code fragments that set up the membership functions and define the universe of discourse to run this example. %Aggregation output=centroid(y.Output_mf). we write a loop that covers the input universe of discourse and computes the output for each input temperature.outputs) title('Fuzzy System Input Output Relationship') xlabel('Temperature') ylabel('Fan Speed') Fuzzy System Input Output Relationship 9 8 7 Fan Speed 6 5 4 3 2 1 0 20 40 60 Temperature 80 100 We see that the input/output relationship is non-linear. end plot([1:1:100]. 43 .

1993.Chapter 6 Fuzzy Control Fuzzy control refers to the control of processes through the use of fuzzy linguistic descriptions. Vadiee and Ross. This error is used by a controller to position the valve to make the measured level equal to the desired level. There also may be some non-linearities due to the valve flow characteristics. For additional reading on fuzzy control see DeSilva. 1995. Jamshidi. The level is measured and compared to a level setpoint forming a level error. or Kandel and Langholz. 1994. The setup is shown below and is used in a laboratory at The University of Tennessee for fuzzy and neural control experiments. 35" h pressure transducer servo valve 11" water out water in This is a nonlinear control problem since the dynamics of the plant are dependent on the height of level of the water through the square root of the level. 44 .1 Tank Level Fuzzy Control A tank is filled by means of a valve and continuously drains. 6. The following equations model the process.

0289 for h=4 pole at -0. A controller designed to meet certain performance specifications at a low level such as h=1 may not meet those specifications at a higher level such as h=4.1732051) resulting in: a = -0.05 for h=2 pole at -0. a fuzzy controller may be a viable alternative.& Vin − Vout h= Area Vout = K h Vin = f ( u) Area = pi * R 2 = Ak K is the resistance in the outlet piping u is the valve position & f (u) − K h = f (u) − K h h= Ak Ak Ak & hAk + K h = f ( u ) These equations can be used to model the plant in SIMULINK. [a. WATER TANK MODEL REPRESENTATION H2O in (in^3/sec) % open + Sum1 fill rate + 1 Input control voltage 3.025 This nonlinearity makes control with a PID controller difficult unless gain scheduling is used.b. This can be done using LINMOD. change in error] to control valve position.3.0 For different operating levels we have: for h=1 pole at -0..c.0354 for h=3 pole at -0.d]=linmod('tank'. The fuzzy controller described in the book uses two input variables [error.0 c = 1. The membership functions were chosen to be: 45 .0289 b = 1. Therefore.3 voltage offset + h' flow rate (in/sec) Tank Level -K1/s Limited Integrator level off-set (0-36") 11 + + Sum 1/tank Area 1 Level converts fill rate at Valve voltage 100% open to % open H2O out (in^3/sec) Sum Drain dynamics (includes sqrt) The non-linearities are apparent when linearizing the plant around different operating levels.

z.pb]. nb = trapzoid(Level_error. if (error is z) AND (del_error is ze) then (control is med) (1) ELSE 9.ns.1:36]. pb Change in Error: ps. if (error is ps) AND (del_error is ze) then (control is low) (1) ELSE 12.[-36 -36 -10 -5]). if (error is ns) AND (del_error is n) then (control is high) (1) ELSE 5.[5 10 36 36]). pb = trapzoid(Level_error. vl = very high.[-10 -2 0]). if (error is nb) AND (del_error is n) then (control is high) (1) ELSE 2. pm. The resulting membership functions are: Level_error = [-36:0. pm. high. z = triangle(Level_error. title('Level Error Membership Functions') xlabel('Level Error') ylabel('Membership') 46 .Error: nb. nm. if (error is ns) AND (del_error is ze) then (control is high) (1) ELSE 6. pb = negative big. medium. ps = triangle(Level_error. pm. negative medium. pb Valve Position: vh. if (error is ps) AND (del_error is p) then (control is low) (1) ELSE 13. if (error is pb) AND (del_error is p) then (control is vl) (1) The membership functions were manually tuned by trial and error to give good controller performance.ps. if (error is z) AND (del_error is n) then (control is med) (1) ELSE 8. zero. if (error is ps) AND (del_error is n) then (control is med) (1) ELSE 11. nm. if (error is pb) AND (del_error is n) then (control is low) (1) ELSE 14. low. if (error is pb) AND (del_error is ze) then (control is vl) (1) ELSE 15.z. if (error is ns) AND (del_error is p) then (control is med) (1) ELSE 7. high. plot(Level_error. low. ns = triangle(Level_error. if (error is z) AND (del_error is p) then (control is med) (1) ELSE 10. high. positive big. if (error is nb) AND (del_error is p) then (control is vh) (1) ELSE 4. if (error is nb) AND (del_error is ze) then (control is vh) (1) ELSE 3. positive medium. pm. med. very low Fifteen fuzzy rules are used to account for each combination of input variables: 1. l_error = [nb.l_error). pb = positive small. positive medium ps.[0 2 10]).[-1 0 1]). z. vl Where: nb. low. Automatic adaptation of membership functions will be discussed in Chapter 13. positive big vh. med.

1:40].6 0. plot(Del_error. n = trapzoid(Del_error.8 0.4 47 .ze. title('Level Rate Membership Functions') xlabel('Level Rate') ylabel('Membership') Level Rate Membership Functions 1 0.9 0.7 Membership 0.9 0.[-40 -40 -2 0]).1 0 -40 -30 -20 -10 0 10 Level Error 20 30 40 Del_error = [-40:.5 0.Level Error Membership Functions 1 0.[-1 0 1]).2 0.8 0.5 0. d_error = [p.7 Membership 0.[0 2 40 40]).6 0.4 0.n]. p = trapzoid(Del_error.3 0.d_error). ze = triangle(Del_error.

95 -3]). and the speed of response was only limited by the inlet supply pressure and output piping resistance.5 -4.3.7 Membership 0.05:1].[-4.5:0. First.[-3 -2 -1]). Test results show that the fuzzy system performs superior to that of a PID controller.med.9 0. DOF2=interp1(Del_error'.8 0. antecedent_DOF = [min(DOF1(1).low. control=[vh.d_error'.derror')'.6 0. vh = triangle(Control. high = triangle(Control.95]).4 0. vl = triangle(Control.1.[-4.[0 1 1]).l_error'.5 0. plot(Control. title('Output Voltage Membership Functions') xlabel('Control Voltage') ylabel('Membership') Output Voltage Membership Functions 1 0. med = triangle(Control.Control = [-4.[-1 0 1]). Suppose the following error and change in error are input to the fuzzy controller. low = triangle(Control. derror=0. error=-8.1 0 -5 -4 -3 -2 -1 Control Voltage 0 1 A Mamdami fuzzy system that uses centroid defuzzification will now be created.high.5 -3. DOF2(1)) min(DOF1(1). DOF2(3)) 48 .2 0.3 0..vl].5 -3. DOF1=interp1(Level_error'. the degree of fulfillments of the antecedent membership functions are calculated.error')'. DOF2(2)) min(DOF1(1). There was practically no overshoot. the fuzzy relation operations inherent in the 15 rules are performed.control). Next.

1500 0 0.:) control(4.0]) title('Consequent of Fuzzy Rules') xlabel('Control Voltage') ylabel('Membership') 49 . Consequent = product(consequent. min(DOF1(3).:) control(1.:) control(4. min(DOF1(4).:) control(3. min(DOF1(4).:) control(3.:) control(4.:) control(3. min(DOF1(4). min(DOF1(2).2375 0. min(DOF1(3).:) control(5.:) control(1.:) control(2.:) control(2.antecedent_DOF).1500 0 0 0 0 0 0 0 0 0 consequent = [control(5. plot(Control.:) control(2.Consequent) axis([min(Control) max(Control) 0 1. min(DOF1(5). min(DOF1(3).min(DOF1(2). min(DOF1(5).6200 0.:) control(3. DOF2(1)) DOF2(2)) DOF2(3)) DOF2(1)) DOF2(2)) DOF2(3)) DOF2(1)) DOF2(2)) DOF2(3)) DOF2(1)) DOF2(2)) DOF2(3))] antecedent_DOF = 0 0.:) control(3. min(DOF1(5). min(DOF1(2).:)].

6 0.8 0.4 50 .7 Membership 0.5 0.0]) title('Aggregation of Fuzzy Rule Outputs') xlabel('Control Voltage') ylabel('Membership') Aggregation of Fuzzy Rule Outputs 1 0.7 Membership 0.6 0.1 0 -4 -3 -2 -1 Control Voltage 0 1 The fuzzy output sets are aggregated to form a single fuzzy output set. aggregation = max(Consequent).4 0.5 0.2 0.9 0. plot(Control.8 0.3 0.Consequent of Fuzzy Rules 1 0.aggregation) axis([min(Control) max(Control) 0 1.9 0.

m.9 0.0]) xlabel('Control Voltage'). c_plot(Control.aggregation. output=centroid(Control. These variables are made to be global MATLAB variables because they need to be used by the fuzzy controller function. Now that we have the five steps of evaluating fuzzy algorithms defined (fuzzification. this may take from 10 seconds to a minute or two depending on the speed of the computer used for the simulation.8 0. This function.3.4 0.5 0.7 0.4 would be sent to the control valve.1 0 -4 -3 -2 -1 Control Voltage 0 1 For these inputs.The output fuzzy set is defuzzified to find the crisp output voltage.2) 51 . The differential equations that model the tank are contained in a function called tank_mod. This function plots out the result of a 40 second simulation. The level error and change in level error will be passed to the fuzzy controller function and the command valve actuator voltage will be passed back. is included as an m-file. A demonstration of the operation of the tank with its controller is given in the function tankdemo(initial_level. tankdemo(24.desired_level). apply fuzzy operator.11.2 0.402 1 0.'Crisp Output Value for Voltage') axis([min(Control) max(Control) 0 1.6 0.3 0. Crisp Output Value for Voltage is at -3. aggregation and defuzzification).output. named tankctrl(). we can combine them into a function that is called at each controller voltage update. a voltage of -3. The universes of discourse and membership functions are initialized by a MATLAB script named tankinit. It operates by passing to it the current state of the tank (tank level) and the control valve voltage. You may try running tankdemo with different initial and target levels.aggregation). apply implication operation. It passes back the next state of the tank.

dk/Projects/proj/nnctrl. so disregard that data point. please be patient. Keeping this membership function thin.html and the NNCTRL toolkit at http://www.dtu. There is very low steady state error and no overshoot.com/neural. These are freeware toolkits for system identification and control. The speed of response is mostly controlled by the piping and valve resistances.html. A description of it can be found at http://www. the controller has very good response characteristics.mathworks. and applying an activation function to the sum. The steady state error is controlled by the width of the zero level error membership function.iau.iau.The tank and controller are simulated for 40 seconds. By changing the membership functions and rules. The first second of the simulation is before feedback occurs. keeps the steady state error small.dtu.html. summing the results.dk/Projects/proj/nnsysid. Tank Level Response 25 20 Level (in) 15 10 0 5 10 15 20 25 Time (sec) 30 35 40 As you can see.1 Artificial Neuron The standard artificial neuron is a processing element whose output is calculated by multiplying its inputs by a weight vector. n  y = f ∑ xk wk + bk   k =1  52 . 7. Chapter 7 Fundamentals of Neural Networks The MathWorks markets a Neural Networks Toolbox. Other MATLAB based Neural Network tools are the NNSYSID Toolbox at http://kalman. you can get different response characteristics.

non-linear activation functions can be used in networks trained with backpropagation. A linear activation function's output is simply equal to its input: f ( x) = x x=[-5:0. y=linear(x).y) title('Linear Activation Function') xlabel('x') ylabel('Linear(x)') Linear Activation Function 5 4 3 2 Linear(x) 1 0 -1 -2 -3 -4 There are several types on non-linear activation functions. 53 . The most common are the logistic function and the hyperbolic tangent function.The following figure depicts an artificial neuron with n inputs. x1 Inputs x2 x3 xn w1 Sum wn bias f() Output Artificial Neuron The activation function could be one of many types.1:5]. plot(x. Differentiable.

x=[-3:0. y=tanh(x).1:3]. Note that the output range of the logistic function is between 0 and 1. x=[-5:0. plot(x.y) title('Logistic Activation Function') xlabel('x').2 -0. We will always consider β to be one but it can be changed.e x − e− x f ( x ) = tanh( x ) = x − x e +e Note that the output range of the logistic function is between -1 and 1.8 -1 -3 -2 -1 0 x 1 2 3 f ( x ) = log istic( x ) = 1 1 + exp( − β x ) where β is the slope constant.1:5].ylabel('logistic(x)') 54 .2 tanh(x) 0 -0.y) title('Hyperbolic Tangent Activation Function') xlabel('x') ylabel('tanh(x)') Hyperbolic Tangent Activation Function 1 0.8 0.6 -0.4 -0.6 0. y=logistic(x). plot(x.4 0.

y).y=thresh(x).8 0.3 0.Logistic Activation Function 1 0.9 0.1:5].7 thresh(x) 0.5 0. plot(x.4 x=[-5:0.8 0.5 0.2 0. 55 .9 0.7 logistic(x) 0.1 0 -5 0 x 5 Non-differentiable non-linear activation functions are usually used as outputs of perceptrons and competitive networks. x=[-5:0.1:5].title('Thresh Activation Function') xlabel('x'). There are two type: the threshold function's output is either a 0 or 1 and the signum's output is a -1 or 1.6 0.6 0.4 0.ylabel('thresh(x)') Thresh Activation Function 1 0.

and a weight matrix [. and adding the bias. Suppose you have an input vector x=[2 4 6].5000 0 1 1 0.8 0. plot(x.2 0 -0. linear(x) logistic(x) tanh(x) thresh(x) signum(x) ans = -1 0 ans = 0. and output a vector by performing the operation on each element of the input vector.y) title('Signum Activation Function') xlabel('x') ylabel('signum(x)') Signum Activation Function 1 0.33] with a bias of -0.2689 ans = -0. the output of the artificial neuron defined above is 56 .8.8 -1 -5 0 x 5 Note that the activation functions defined above can take a vector as input.7616 The output of a neuron is easily computed by using vector multiplication of the input and weights. If the activation function is a hyperbolic tangent function.4 -0.2 -0.25 .6 0.y=signum(x).7616 ans = 0 1 ans = -1 -1 1 0.4 signum(x) 0.5 .6 -0. x=[-1 0 1].7311 0.

2 -0.4 -1..5 − 0.33].33. b=[0.2 − 0.2]'.8.5 -0.75 − 0. b=-0. Form 1:   2    0. modified by a weight.5 6 − 12     x=[2 4 6]'.25 0.   0. 0. An input layer does no processing. y=tanh(w*x+b) y = 0. w=[0. y=tanh(w*x+b) 57 .2 Single Layer Neural Network Neurons are grouped into layers and layers are grouped into networks to form highly interconnected processing structures.8275 7.25 0. A bias is included in the neurons to allow the activation functions to be offset from zero. One method of implementing a bias is to use a dummy input node with a magnitude of 1.5]. This next layer can be a hidden layer or the output layer in a single layer design.5 -0. The weights connecting this dummy node to the next layer are the actual bias values.x=[2 4 6]'.33     0. The second form augments the input matrix with a dummy node and embeds the bias values into the weight matrix.25 0. to each of the neurons in the next layer. The outputs would be computed using matrix algebra in either of the two forms.4  y = tanh( w * x + b) = tanh    4  +  . Input Layer x0=1 x1 x2 x3 Single Layer Network W Output Layer y1 y2 Suppose we have a single layer network with three input neurons and two output neurons as shown above.75 -0. it simply sends the inputs. w=[0.

2 − 0. w=[0.25 0. y=tanh(w*x) y = 0.25 0.75 − 0. Feed-forward means that all the interconnections between the layers propagate forward to the next layer.5 − 0. − 12 0.y = 0. the strength of the connection between the input and the output is enhanced. Input x1 x2 w1 w2 Neuron Sum bias Output y The simple perceptron uses the threshold activation function with a bias and thus has a binary output. The next section gives an example of linearly separable data that the perceptron can properly classify. The binary output perceptron has two possible outputs: 0 and 1.33.4 0.3 Rosenblatt's Perceptron The most simple single layer neuron is the perceptron and was developed by Frank Rosenblatt [1958].9830 -1. The output (y) is compared to the target output (t=0 or t=1) and the weights are adapted according to Hebb's training rule [Hebb.0000 Form 2:   1       0. It is trained by supervised learning and can only classify input patterns that are linearly separable [Minsky 1969].33  2 y = tanh( w * x) = tanh    .5 4      6  x=[1 2 4 6]'. 1949]: "When the synaptic input and the neuron output are both active.4 0.9830 -1. A perceptron is a neural network composed of a single layer feedforward network using threshold activation functions.75 -0. Training is accomplished by initializing the weights and bias to small random values and then presenting input data to the network.5 -0.2 0." 58 . The figure below shows a single layer perceptron with two inputs and one output.0000 7.2 -0. -1.5].

% Target = 1. w = w+x.8].7 to -2.w.b] = trainpt1(x.This rule can be implemented as: if end The bias is updated as a weight of a dummy node with an input of 1.3000 0. Assume the weight and bias values are randomly initialized and the following input and target output are given. Since the target was equal to 1.1]) y = 0 One learning cycle of the perceptron learning rule results in: [w. For example.t. Conversely. [1. if x1 = [0. no change. y = target elseif y = 0 else w = w. it was made more negative since the input was negative.3. w = w-x.7]. 59 . the weights are updated and the output now equals the target.t. enhance strengths. Look at trainpt1 to see its implementation.b] = trainpt1(x.b) y = thresh([w b]*[x . the initial weights and bias are chosen and the following training routine can be used. [-0.w. and w2 changed from 0. It is called with: [w.1]) w = b = y = 1. For example.b). x2=-3. w b x t = = = = [.2000 1 -2.-3]. the output is incorrect as shown: y = thresh([w b]*[x . [1]. % Correct output. x1=1 and w1 changed from .3 to 1. A single perceptron can be used to classify two inputs.3000 As can be seen. The function trainpt1() implements this learning algorithm. reduce strengths.3.3 0. the weights corresponding to inputs with positive values were made stronger. % Target = 0.1] is to be classified as a 0 and x2 = [1 -1] is to be classified as a 1.

A better way of performing this training would be to modify trainpt1 so that it can take a matrix of input patterns such as x =[x1 x2]. y=percept(x. y1 = thresh([w b]*[x1 . t=[0 1].b) y=percept(x. w=[-0.1 .5000 y = 0 0.t. It may take several training cycles.1]) y1 = y2 = 1 0 Neither output matches the target so we will train the network with first x1 and then x2.8].w.1000 b = -0. y1 = thresh([w b]*[x1 .8000 One training cycle results in the correct classification. y1 = thresh([w b]*[x1 .5]. This will not always be the case.b).w.1]) y2 = thresh([w b]*[x2 .b] = trainpt1(x2.w.b] = trainpt1(x1. b=[-. w=[-0.5].t.8].b).b] = trainpt(x.1]) y2 = thresh([w b]*[x2 . which are called epochs. to alter the weights enough 60 . x2=[1 -1]'.b) y = 0 [w. Also.1]) y1 = y2 = y1 = y2 = 0 0 0 1 The network now correctly classifies the inputs. a function to simulate a perceptron with the inputs being a matrix of input patterns will be called percept().: [w.w. b=[-.1]) [w. We will call this function trainpt().t.w.1]) y2 = thresh([w b]*[x2 .b) w = -0.1 .x1=[0 1]'.

'*') axis([-1. plot(X. b=[-0.5 0 -0. x=[x1 x2].5:.t.5 -1 15 [w. hold on X=[-1.5:1.5 1.to give the correct outputs. x2=[1 -1]'. The decision boundary is formed by the x.5 1. Y=(-b-w(1)*X). y=percept(x.5].:).1 0.5:1. x1=[0 0]'.hold.w.5:.5].y pairs that solve the following equation: w*x+b = 0 Let us now look at the decision boundaries before and after training for initial weights that correctly classify only one pattern./w(2). Demuth and Beale.:).5 1.:). 1996].b).5]).5 1. the perceptron will find a decision boundary which correctly divides the inputs.5 -1.Y).5].b] = trainpt(x.5 -1.hold on X=[-1. title('Original Perceptron Decision Boundary') Current plot released Original Perceptron Decision Boundary 1. As long as the inputs are linearly separable.x(2.5]). 61 . Y=(-b-w(1)*X).b) plot(x(1.:). This proof is derived in many neural network texts and is called the perceptron convergence theorem [Hagan./w(2).x(2.5 1 0. plot(x(1. w=[-0.8].w. t=[0 1].'*') axis([-1.

5:1.b] = trainpt(x.5 -1 -0.5 0 0.'*') axis([-1.5 1.w. still only one pattern is correctly classified.Y).:).5 -1.Y) hold title('Perceptron Decision Boundary After Two Epochs') y = 0 1 Current plot released 62 . y=percept(x. [w.5 -1 -1.5]) hold on X=[-1.5 1.b) plot(x(1.5 1 1.5 Note that after one epoch. hold title('Perceptron Decision Boundary After One Epoch') y = 1 1 Current plot released Perceptron Decision Boundary After One Epoch 1. plot(X.:).x(2.5 1 0.5:.plot(X.5 0 -0.b).t.5 -1.w.5]./w(2). Y=(-b-w(1)*X).

5].Y) hold title('Original Perceptron Decision Boundary') t = y = 0 0 1 1 0 0 0 1 Current plot released 63 .5 1. y=percept(x.5 -1. t=[0 0 1 1] w=[-0.3 -1.8].5 -1 -0. The perceptron can also be used to classify several linearly separable patterns.1:2).5:1. The function percept() will now be modified to train until the patterns are correctly classified or until 20 epochs.4 -.1:2).-. x=[0 -./w(2).'+') axis([-1.1 0. b=[-0.'*') hold on plot(x(1.5 1.b) plot(x(1.5 0 0.3]. Y=(-b-w(1)*X).5 Note that after two epochs. plot(X.5 0 -0.5:.5 -1.Perceptron Decision Boundary After Two Epochs 1. both patterns are correctly classified.5 1 1.2 1.x(2.3:4).3 .x(2.5 1 0.5 1.3:4).w.5 -1 -1.5]) X=[-1.5].

Y) hold title('Final Perceptron Decision Boundary') Solution found in 5 epochs.5 1 1.t.b) t y=percept(x.5 0 -0.5 -1 -0.5 The original weight and bias values misclassifies pattern number 4.b] = trainpti(x. Y=(-b-w(1)*X).5 1.w.x(2.5].5]) X=[-1.5:1.5 -1 -1.7000 0. [w.5000 b = -0.5 0 0.1:2).3:4).w.'*') hold on plot(x(1.b) plot(x(1.5:. plot(X.3:4).5 -1.'+') axis([-1.x(2.5000 t = 0 0 1 1 y = 0 0 1 1 Current plot released 64 .Original Perceptron Decision Boundary 1.5 1.5 -1.5 1 0.1:2)./w(2). w = 2.

4 Separation of Linearly Separable Variables A two input perceptron can separate a plane into two sections because its transfer equation can be rearranged to form the equation for a line.5 -1.all 5 inputs are correctly classified. In a three dimensional problem. the equation would define a plane and in higher dimensions it would define a hyperplane. The decision boundary is defined as: 65 .5 1 0.5 0 -0.5 0 0.5 After 5 epochs .5 -1 -1. y + + + + o o o + o x o Linear Separability Note that the decision boundary is always orthogonal to the weight matrix. Suppose we have a two input perceptron with weights = [1 2] and a bias equal to 1. 7.5 1 1.Final Perceptron Decision Boundary 1.5 -1 -0.

5 1.5 -1 -1.2 1. hold on plot([w(1) -w(1)].5 1 1.4 -. b=[1].5. The weights can be defined as a matrix. and two lines are formed. .5:2. In the figure below.y=− wx 1 1 b = − x − = −0.5 -2 -2 -1./w(2).'Decision Boundary').5 -1 -0. plot(x.'Weight Vector').7. each of the inputs will be classified into one of the three binary classes: [0 1].5]. t=[0 0 1 1.5x − 0. hold Current plot released Perceptron Decision Boundary 2 1.-.5 0 Decision Boundary y Weight Vector -0. x=[-2.5 x− wy 2 2 wy which is orthogonal to the weight vector [1 2].65.5 0 x 0. and two lines can separate the classification regions. axis([-2 2 -2 2]).5 1 0.3 -1.ylabel('y'). y=(-b-w(1)*x). a two neuron perceptron can be used. the more vertical line is the decision boundary and the more horizontal line is the weight vector extended to meet the decision boundary. x=[0 -.. The outputs can be coded to one of the pattern classifications.3 .y) text(.[w(2) -w(2)]) text(.5 2 If the inputs need to be classified into three or four classes. 0 1 0 0] % Input vectors % Target vectors 66 .3].-. grid title('Perceptron Decision Boundary') xlabel('x'). and [0 0]. w=[1 2]. the bias is a vector. [1 0]. In the following example.5.5:.

'B').5:.5.w.7000 0.7000 b = -0.'+') plot(x(1.2).5000 -2.Y2) hold title('Perceptron Decision Boundaries') text(-1. text(-.w=[-0. plot(X1.3:4).5 1. b=[-0.3]. w = 2.5.7000 t = 0 0 1 1 0 1 0 0 y = 0 0 1 1 0 1 0 0 The perceptron learning algorithm was able to define lines that separated the input patterns into their target classifications.3.5 -1./w(1.5:1.b) Solution found in 6 epochs.Y1) X2=[-1.5:1. text(. plot(X2.1).'A').5000 -0.2 -0.1)*X1).w.8.2000 -0. Y2=(-b(2)-w(2.1)*X2). Current plot released 67 .b) t y=percept(x.3:4). Y1=(-b(1)-w(1.9].b] = trainpti(x.w. plot(x(1.1). y=percept(x.2).5.5.5 1.x(2.2).0.'*') hold on plot(x(1...'o') axis([-1. [w.5:.5.5]) X1=[-1..t.'C').x(2.2).b) t = y = 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 1 % Weights and biases % Initial classifications Two of the patterns (t1 and t4) are incorrectly classified.x(2.5]./w(2.1 0. This is shown in the following figure.5]. 0.

The Adaline (adaptive linear) network. one output network defines two lines in the two dimensional space. one hidden layer of two neurons. Perceptrons are limited in that they can only separate linearly separable patterns and that they have a binary output. Normally. is composed of one layer of linear transfer functions. and more complex training algorithms. they can't be trained with gradient descent algorithms. These two lines can classify points into groups that are not linearly separable with one line.5 -1. such as those of the XOR problem. Many of the limitations of the simple perceptron can be solved with multi-layer architectures.5 1 1. each hidden layer of a network uses the same type of activation function. One such example of linearly non-separable inputs is the exclusive-or (XOR) problem. 7. Although if the first layer is randomly initialized. 68 .5 1 0.5 The simple single layer perceptron can separate linearly separable inputs but will fail if the inputs are not linearly separable. the second layer may be trained to classify linearly non-separable classes (MATLAB Neural Networks Toolbox). non-binary activation functions. and thus has a continuous valued output. also called a Widrow Hoff network.5 0 -0. It is trained with supervised training by the Delta Rule which will be discussed in Chapter 8.5 0 0.Perceptron Decision Boundaries 1. Linearly non-separable patterns. as opposed to threshold transfer functions.5 Multilayer Neural Network Neural networks with one or more hidden layers are called multilayer neural networks or multilayer perceptrons (MLP).5 A B C -1 -0. developed by Bernard Widrow and Marcian Hoff [1960]. Multilayer perceptrons with threshold activation functions are not that useful because they can't be trained with the perceptron learning rule and since the functions are not differentiable. The output activation function is either sigmoidal or linear. A two input. can be separated with multilayer networks.5 -1 -1.

Input Layer x0=1 x1 x2 . It has been proven that the standard feedforward multilayer perceptron (MLP) with a single non-linear hidden layer (sigmoidal neurons) can approximate any continuous function to any desired degree of accuracy over a compact set [Cybenko 1989. A linear hidden layer is rarely used because any two linear transformations h = W1 x y = W2 h where W1 and W2 are transformation matrices that transform the mx1 vector x to h and h to y. What this proof does not say is how many hidden layer neurons would be needed. .The output of a sigmoidal neuron is constrained [-1 1] for a hyperbolic tangent neuron and [0 1] for a logarithmic sigmoidal neuron. . and if the weight matrix that corresponds to that error goal can be found. Funahashi 1989. xm Multilayer Network W1 Hidden Layer h0=1 h1 h2 hn Output Layer W2 y1 . It may be computationally limiting to train such a network since the size of the network is dependent on the complexity of the function and the range of interest. the hidden layer of a multilayer perceptron is usually a sigmoidal neuron. Hornick 1989. a simple non-linear function: f (x) = x 1 ⋅ x 2 requires many nodes if the ranges of x1 and x2 are very large. can be represented as one linear transformation W = W1 W2 y = W1 W2 x = W x where W is a matrix that performs the transformation from x to y. Haykin [1994] gives a very concise overview of the research leading to this conclusion. For example. yr 69 . thus the MLP has been termed a universal approximator. The following figure shows the general multilayer neural network architecture. A linear output neuron is not constrained and can output a value of any magnitude. In order to be a universal approximation. and others].

5  10.9.8130 0.2]'. − 2.75 -0.3       6         x=[2 4 6]'. b2=[0.2  +  14 . -1.2 − 0.5].33. w2=[0.3].2 0.4 0.5]. 0. x])] 1    1       0.7 0.25 0.2314 70 .2 0.5 0. tanh( w 1 *[1 .8 − 12      By using dummy nodes and embedding the biases into the weight matrix we can use a more compact notation: y = w 2 *[1 .75 -0. w1=[0.  tanh  2.The output for a single hidden layer MLP with three inputs.4 0.5 0.5  4    − 0.1 4  +  0.2 0.4 -1. y = w 2 * tanh( w 1 * x + b 1 ) + b 2   0.3 14 .1. 10.25 0.33.5 -0.1   . three hidden hyperbolic tangent neurons and two linear output neurons can be calculated using matrix algebra.7 0.8 10.2 -0. b1=[0.  0.25 0.33       0.3 1.7 0.8]'.5 -0.5 0.5 − 0.4 -2.8 10.7 0.5 − 0. y=w2*tanh(w1*x+b1)+b2 y = 0.x])] y = 0.2 2.3].2 − 0.2 − 10.1.2 -0.2 − 10.3 1.9  2   0.tanh(w1*[1.2 -0.2 − 0.8130 0.2 2.5    0.33     2    =  tanh  0.4 -2.2 -0. -0. w2=[0.2 -10.2 -10.2314 or x=[2 4 6]'.3  6  − 0.75 − 0. 2.75 − 0. y=w2*[1.25 0.2 − 0.2 0. w1=[0.2 -0.4  = − 2.9      0.2 0. 0.  − 12 0.9.3 .

dy) title('Derivative of Logistic Function') 71 . dy=y. but practical implementations usually use a combination of Newton's method and steepest descent.. It uses a steepest descent technique which is very stable when a small learning rate is used. For logistic.*(1-y).Chapter 8 Backpropagation and Related Training Paradigms Backpropagation (BP) is a general method for iteratively solving for a multilayer perceptrons' weights and biases. These methods will not be implemented in this supplement although Levenberg Marquardt is implemented in The MathWork's Neural Network Toolbox.1:5].2) plot(x. subplot(2. the derivatives are as follows: Linear Logistic Tanh Φ( I ) = I 1 1 + e − αI e αI − e − αI Φ ( I ) = αI e + e −αI Φ( I ) = & Φ( I ) = 1 & Φ ( I ) = αΦ ( I )(1 − Φ ( I ) ) & Φ ( I ) = α (1 − Φ ( I ) 2 ) Alpha is called the slope parameter. The most popular second order method is the Levenberg [1944] and Marquardt [1963] technique. A plot of the logistic function and its derivative follows. Several methods for speeding up BP have been used including momentum and a variable learning rate. The scaled conjugate gradient technique [Moller 1993] is also very popular because it less memory intensive than Levenberg Marquardt and more powerful than gradient descent techniques.1./(1+exp(-x)).1. and linear functions. Many of these techniques are based on Newton's method. hyperbolic tangent. 8. y=1. Other methods of solving for the weights and biases involve more complex algorithms. but has slow convergence properties. x=[-5:.1) plot(x.1 Derivative of the Activation Functions The chain rule that is used in deriving the BP algorithm necessitates the computation of the derivative of the activation functions.y) title('Logistic Function') subplot(2. This formulation for the derivative makes the computation of the gradient more efficient since the output Φ(I) has already been calculated in the forward pass. Usually alpha is chosen to be 1 but other slopes may be used.

2 Backpropagation for a Multilayer Neural Network The backpropagation algorithm is an optimization technique designed to minimize an objective function [Werbos 1974]. This is why we scale the inputs and initialize weights to small random values. the internal activation of all neurons should be kept small to expedite training. Since there is no logistic function in MATLAB.5 0 -5 0.Logistic Function 1 0. % x : the input % y : the result y=1. % [y] = logistic(x) % Returns the result of applying the logistic operator to the input x.3 0. The network syntax is defined as in the following figure: 2 72 . The most commonly used objective function is the squared error which is defined as: ε 2 = Tq − Φ qk . function [y] = logistic(x). we will write a m-file function to perform the computation. Since the speed of learning is partially dependent on the size of the gradient.1 0 -5 0 Derivative of Logistic Function 5 0 5 As can be seen. the highest gradient is at I=0./(1+exp(-x)).2 0. 8.

and r neurons respectively. 73 . the layers are labeled i.k I1.j Ip.k wpr. n.wp1.j Φp. with m. j.k Ir.k Φ1. and q respectively. and the neurons in each layer are indexed h.k Φr.k Φ2. and k.k I2. x = input value T = target output value w = weight value I = internal activation Φ = neuron output ε = error term The outputs for a two layer network with both layers using a logistic activation function are calculated by the equation: Φ = logistic w2 * [logistic( w1 * x + b1)] + b2 where: w1 = first layer weight matrix w2 = second layer weight matrix b1 = first layer bias vector b2 = second layer bias vector { } The input vector can be augmented with a dummy node representing the bias input. This dummy input of 1 is multiplied by a weight corresponding to the bias value. This results in a more compact representation of the above equation: 1    Φ = logistic  W2 *   logistic( W1 * X)    where X = [1 x]' W1 = [b1 w1] W2 = [b2 w2] % Augmented input vector. p.k Output Layer (k) Index q r Nodes Comp ε1 Input Layer (i) Index h m Nodes Hidden Layer (j) Index p n Nodes In this notation.j wp2.k ε1 Comp Comp ε1 T1 T2 Tr xh whp.

Note that a dummy hidden node (=1) also needs to be inserted into the equation. For a two input network with two hidden nodes and one output node we have:

1   x =  x1   x2   

b11 W1 =  b12

w11 w12

w21  w22  

W 2 = b2

[

w11

w12

]

As an example, consider a network that has 2 inputs, one hidden layer of 2 logistic neurons, and 1 logistic output neuron (same as in the text). After we define the initial weight and bias matrices we combine them to be used in the MATLAB code and calculate the output value for the initial weights.
x = [0.4;0.7]; w1 = [0.1 -0.2; 0.4 0.2]; w2 = [0.2 -0.5]; b1=[-0.5;-0.2]; b2=[-0.6]; % % % % % % % % rows = # of inputs = 2 columns = # of patterns = 1 rows = # of hidden nodes = 2 columns = # of inputs = 2 rows = # of outputs = 1 columns = # of hidden nodes = 2 rows = number of hidden nodes = 2 rows = number of output nodes = 1

X = [1;x] % Augmented input vector. W1 = [b1 w1] W2 = [b2 w2] output=logistic(W2*[1;logistic(W1*X)]) X = 1.0000 0.4000 0.7000 W1 = -0.5000 -0.2000 W2 = -0.6000 output = 0.3118

0.1000 0.4000 0.2000

-0.2000 0.2000 -0.5000

8.2.1 Weight Updates The output layer weights are changed in proportion to the negative gradient of the squared error with respect to the weights. These weight changes can be calculated using the chain rule. The following is the derivation for a two layer network with each layer having logistic activation functions. Note that the target outputs can only have a range of [0 1] for a network with a logistic output layer.

74

∆w pq .k = −η p.q

∂ε 2 ∂ w pq .k

∂ Φ q .k ∂ ε2 = −η p.q ⋅ ⋅ ∂ Φ q .k ∂ I q .k
= −η p.q and

⋅ ( − 2) Tq − Φ q .k ⋅ Φ q .k 1 − Φ q .k ⋅ Φ p. j

[

]

∂ I q .k ∂ w pq .k

[

]

= −η p.q ⋅ δ pq .k ⋅ Φ p. j

δ pq .k

= 2 Tq − Φ q .k Φ q .k 1 − Φ q .k

[

] [

]

The weight update equation for the output neurons is:

w pq .k ( N + 1) = w pq .k ( N ) − η p.q ⋅ δ pq .k ⋅ Φ p. j
The output of the hidden layer of the above network is Φpj=h=[0.48 0.57]. The target output is T = [0.7]. The update for the output layer weights is calculated by first propagating forward through the network to calculate the error terms:
h = logistic(W1*X); % Hidden node output. H = [1;h]; % Augmented hidden node output. t = [0.1]; % Target output. Out_err = t-logistic(W2*[1;logistic(W1*X)]) Out_err = -0.2118

Next, the gradient vector for the output weight matrix is calculated.
output=logistic(W2*[1;logistic(W1*X)]) delta2=output.*(1-output).*Out_err % Derivative of logistic function. output = 0.3118 delta2 = -0.0455

And lastly, the weights are updated with the learning rate α =0.5.
lr = 0.5; del_W2 = 2*lr*H'*delta2 new_W2 = W2+del_W2 del_W2 = -0.0455 new_W2 = -0.6455 -0.0161 0.1839 % Change in weight. % Weight update.

-0.0239 -0.5239

75

Out_err = t-logistic(new_W2*[1;logistic(W1*X)]) Out_err = -0.1983

% New output error)

We see that by updating the output weight matrix for one iteration of training, the error is reduced from 0.2118 to 0.1983. 8.2.2 Hidden Layer Weight Updates The hidden layer outputs have no target values. Therefore, a procedure is used to backpropagate the output layer errors to the hidden layer neurons in order to modify their weights to minimize the error. To accomplish this, we start with the equation for the gradient with respect to the weights and use the chain rule.

∆whp. j

= − ηh. p = − ηh. p = − ηh. p

∂ε 2 ∂ whp. j

∑∂w
q =1 r

r

∂ε 2
hp . j

∂ ε q 2 ∂ Φ q . k ∂ I q .k ∂ Φ p . j ∂ I p . j ⋅ ⋅ ⋅ ⋅⋅ ⋅ ∂ Φ q .k ∂ I q .k ∂ Φ p. j ∂ I p. j ∂ whp. j q =1

∂ εq 2 = ( − 2) Tq − Φ q .k ∂ Φ q .k

[

] ] ]

∂ Φ q .k = αΦ q .k 1 − Φ q .k ∂ I q .k

[

∂ I q .k = w pq .k ∂ Φ p. j ∂ Φ p. j = αΦ p . j 1 − Φ p . j ∂ I p. j

[

∂ I p. j = xh ∂ whp. j
resulting in:

76

∂ε 2 ∂ whp. j

= ∑ ( − 2) Tq − Φ q .k ⋅ α Φ q .k 1 − Φ q .k ⋅ w pq .k ⋅ α Φ p. j 1 − Φ p. j xh
q =1 r

r

[

]

[

]

[

]

= ∑ δ pq .k ⋅ w pq .k ⋅ α Φ p. j 1 − Φ p. j xh
q =1

[

]

δhp. j = δ pq .k w pq .k

∂Φ p. j ∂ I p. j

whp. j ( N + 1) = whp. j ( N ) − ηhp xhδhp. j
First we must backpropagate the gradient terms back to the hidden layer:
[numout,numhid]=size(W2); delta1=delta2.*h.*(1-h).*W2(:,2:numhid)' delta1 = -0.0021 0.0057

Now we calculate the hidden layer weight change. Note that we don't propagate back through the dummy node.
del_W1 = 2*lr*delta1*X' new_W1 = W1+del_W1 del_W1 = -0.0021 0.0057 new_W1 = -0.5021 -0.1943 -0.0008 0.0023 0.0992 0.4023 -0.0015 0.0040 -0.2015 0.2040

Now we calculate the new output value.
output = logistic(new_W2*[1;logistic(new_W1*X)]) output = 0.2980

The new output is 0.298 which is closer to the target of 0.1 than was the original output of 0.3118. The magnitude of the learning rate affects the convergence and stability of the training. This will be discussed in greater detail in the adaptive learning rate section. 8.2.3 Batch Training The type of training used in the previous section is called sequential training. During sequential training, the weights are updated after each presentation of a pattern. This method may be more stochastic when the patterns are chosen randomly and may reduce the chance of getting stuck in a local minima. During batch training all of the training

77

the gradient vector is calculated: output = logistic(W2*H) delta2 = output.5065 -0.9]. Now we have four output patterns from the hidden layer of the network and four target outputs.0516 0. 0.8 1. [inputs.1082 -0.1. e = t-logistic(W2*H) e = -0. % rows = # of outputs = 1 % columns = # of hidden nodes+1 = 3 X = [ones(1.4876 -0. the weights are updated with a learning rate α =0. Suppose the training set consists of four patterns (z=4).3 0. t = [0.0719 And lastly.5 0.4 0.^2)) 78 . W1 = [0.*(1-output).1208 -0.2 0.5: lr = 0.3 -1.0226 0. H = [ones(1.patterns] = size(t).logistic(new_W2*H). x].2 0.7 0.patterns] = size(x).2].5.1 0. x = [0. % rows = # of hidden nodes = 2 % columns = # of inputs +1 = 3 W2 = [0.9 1.0017 -0.4 0.4035 -0.2065 0.0.2963 Next. h = logistic(W1*X).8 -0.6 0.0904 -0.patterns are processed before a weight update is made. [outputs.patterns). % Augment with bias dummy node.9]. SSE = sum(sum(e.2876 The sum of squared error is: SSE = sum(sum(e.6082 -0.3.h].patterns).5096 0.1].*e output = 0.2 -0.0208 The new sum of squared error is calculated as: e = t.1 -0. del_W2 = 2*lr* delta2*H' new_W2 = W2+ del_W2 del_W2 = -0.1009 0.2017 new_W2 = -0.^2)) SSE = 0.5035 delta2 = -0.

[numout.1917 The new SSE is 0.0003 0. so only the weight portion of W2 is used.1926 by just changing the output layer weights and biases. you are guaranteed to fine an error minimum.0016 Now we calculate the hidden layer weight change.4 Adaptive Learning Rate In the previous examples a fixed learning rate was used.^2)) SSE = 0. t-logistic(new_W2*H).*(W2(:.0065 -0.2049 0. To speed training and still have stability. Larger steps may result in unstable learning since you may step over a minima.0250 -0. 79 . 8. When training a neural network iteratively. del_W1 = 2*lr*delta1*X' new_W1 = W1+del_W1 del_W1 = 0.2. but this may take a very long time.0009 -0.numhidb] = size(W2).*(1-h). To change the hidden layer weight matrix we must backpropagate the gradient terms back to the hidden layer.0028 0.0049 0.0088 -0. a heuristic method is used to determine the step size.SSE = 0.0126 -0.0016 -0.2009 0.1016 0. The heuristic rule states: If training is "went well" (error decreased) then increase the step size.8997 logistic(new_W1*X).0002 0.2:numhidb)'*delta2) delta1 = 0.0008 -0. If very small steps are taken.patterns).h]. The learning rate can be thought of as the size of a step down the error gradient.3959 h = H = e = SSE -0. Note that we can't backpropagate through a dummy node. delta1 = h.1926 so the change in hidden layer weights reduced the SSE slightly more.1926 The SSE has been reduced from 0. = sum(sum(e.1917 which is less than the SSE of 0.0041 new_W1 = 0.2963 to 0.0019 0. [ones(1. it is more efficient to use an adaptive learning rate.1250 0.

8 1.i). delta2= output. del_W1 = 2*lr*delta1*X'. 80 .1. [outputs.0. % Initialize to matrix of 0. for i=1:maxcycles h = logistic(W1*X).patterns).fprintf('Error goal reached in %i cycles.5 The Backpropagation Training Cycle The training procedure discussed above is applied iteratively for a certain number of cycles or until a specified error goal is met.3.9 1. H=[ones(1. x = [0.patterns] = size(x).^2)). if SSE(i)<SSE_Goal.patterns] = size(t).hidden+1). break.1's. [inputs.1 If training is "poor" (error increased) then decrease the step size. W2=0. xlabel('Cycles').ylabel('Sum of Squared Error') if i<200. t = [0.4 0.inputs+1).9].clf semilogy(nonzeros(SSE)). X=[ones(1.*(W2(:.2:hidden+1)'*delta2). lr=0.1's. del_W2= 2*lr* delta2*H'.patterns).maxcycles). This results in training that is somewhat optimized to increase learning speed while remaining stable.*(1-h).3 0.5. lr=lr*0. Input scaling and weight initialization will be covered in subsequent sections.1*ones(outputs.h].1*ones(hidden. W1=0. W1 = W1+del_W1. end.'. e=t-logistic(W2*H).end output = logistic(W2*H).6 0. title('Backpropagation Training'). hidden=6. SSE(i)= sum(sum(e. % Initialize to matrix of 0. 8.end Error goal reached in 116 cycles. Using an adaptive learning rate allows the training procedure to quickly move across large error plateaus and slowly descend through tortuous error paths.SSE_Goal=0.2].5 Only update the weights if the error decreased.8 -0.2.1 0.lr=lr*1. maxcycles=200.*e.SSE=zeros(1. Before starting training.3 -1.7 0. weights are initialized to small random values and inputs are scaled to small values of similar magnitude to reduce the chance of prematurely saturating the sigmoidal neurons and thus slowing training. % Augment inputs with bias dummy node. delta1 = h. W2 = W2+ del_W2. x].*(1-output).

but will train with a different set of initial weights. the number of training cycles varies. Secondly. This is due to the different initial position on the error surface. due to getting trapped in a local minima. First.10 0 Backpropagation Training Sum of Squared Error 10 -1 0 20 40 60 Cycles 80 100 120 You can try different numbers of hidden nodes in the above example. The relationship between the number of hidden nodes and the cycles to reach the error goal is shown below.3 Scaling Input Vectors Training data is scaled for two major reasons. 8. input data is usually scaled to give each input equal importance and to prevent premature saturation of sigmoidal activation functions. 81 . When the weights are randomly chosen. the addition of hidden nodes complicates the error surface and the number of training cycles may increase. In fact. sometimes a network will not train to a desired error goal with one set of initial weights. The results shown above are fairly typical. Number of Hidden Nodes 1 2 3 4 5 6 Training Cycles to Error Goal 106 95 97 102 109 116 Each training set will have its own relationship between training cycles and the number of hidden nodes.3. Note that the weights are usually initialized to random numbers as discussed in section 8. After a point where enough free parameters are given to the network to model the function. output or target data is scaled if the output activation functions have a limited range and the unscaled targets do not match that range.

This method subtracts the mean of each input from each column and then divides by the variance.meanval.m. if nargin == 1 del = max(x)-min(x). int = .1 when logistic output activation functions are used.stdval] = zscore(x.b) % % Linear scale the data between . called z-score or mean center unit variance scaling is also frequently used.stdval) % % [y. Another method of scaling.There are two popular types of input scaling: linear scaling and z-score scaling. Each column of inputs has a maximum and minimum value of 0.m./del. Linearly scaling transforms the data into a new range which is usually 0. This centers all the patterns of each data type around 0 and gives them a unit variance. meanaval.*min(x). The MATLAB function to perform z-score scaling is: function [y.int]=scale(x. Outputs are usually scaled to 0. This keeps the training algorithm from trying to force outputs beyond the range of the function. If the training patterns are in a form such that the columns are inputs and the rows are patterns. % calculate slope and intercept slope = .b]=scale(x. a MATLAB function to perform linear scaling is: function [y. The scaling parameters are returned so that other input data may be transformed using the same terms. 82 .1)*int.mean.int) % [x.std_in) % % Mean center the data and scale to unit variance.slope.8.1 .slope.*x + ones(nrows. This scaling function can also be used for scaling target values.9.std] = zscore(x.9 % % y = m*x + b % % x = data % m = slope % b = y intercept % [nrows.ncols]=size(x).slope. mean_in.1)*slope).1 and . end y = (ones(nrows. The function returns the scaled inputs y and the scale parameters: slope and int.1 to 0.1 respectively.9 to 0.9 and 0. A network that is trained with scaled inputs must always have its inputs scaled using the same scaling parameters.

9000 0. % calculate mean values end y = x .stdval]=zscore(x) x = 1 30 -1 8 2 21 -10 34 0. use the calculated mean and SD.-. In this case.9000 0.slope.4436 -0.1516 0.3323 slope = 0. x_new=[-2 4.0258 int = 0.1056 meanval = 9.7394 -0.% If number of inputs is one.1115 1. % If the number if inputs is three./ (ones(nrows.-1 -10. calculate the mean and standard deviation.1000 0. % normalize to unit variance An example of the z-score scaling function is: x=[1 2.ones(nrows.meanval.4983 0. % calculate the SD end y = y .1000 0.30 21.7500 19.ncols]=size(x).2818 -0.5986 1.2009 If a network is trained with scaled data and new data is presented to the network.8 34] [y.1370 11.0182 0.5683 y = 0. % [nrows.1)*meanval.5000 stdval = 14. 9 -10] 83 .6636 0. % subtract off mean if nargin == 1 stdval = std(y).4727 -1. the scaling functions are called with three variables and only the scaled data is passed back. int]=scale(x) [y. it must first be scaled using the same scaling factors.3182 0.1)*stdval).3 12. if nargin == 1 meanval = mean(x).1258 y = -0.

0000 0. The most common method is to use the random number generator and pass it the number of inputs plus 1 and the number of hidden nodes for the first hidden layer weight matrix W1 and pass it the number of outputs and hidden nodes plus 1 for the output weight matrix W2.0000 -0.5825 0. int) [y]=zscore(x_new.8098 -0.5 and 2.3545 0.5 for a hyperbolic tangent neuron and -5 to +5 for a logistic function.0352 4.3134 0. slope.3000 9.meanval.5000 0. One is added to the number of inputs in W1 and to hidden in W2 to account for the bias. This region is between -2.5*randn(2.1000 -0.3483 0. plot([-8:.0000 -10.8481 We are trying to limit the internal activation of the neurons during training to a high gradient region.3960 0.1758 -0.0375 0.stdval) x_new = -2.1:8])) title('Logistic Activation Function') 84 .0742 0.3) W1 = 0.0000 y = 0.5.0128 -1.1:8]. logistic([-8:. W1=0. the initial weights should be selected to be small random values in order to prevent premature saturation of the sigmoidal activation functions.[y]=scale(x_new.4 Initializing Weights As mentioned above.3581 y = -0. To make the weights somewhat smaller.1115 8.0000 12. the resulting random weight matrix is multiplied by 0.1181 0.6901 -0.

mat).^2. random.9 0.7 0. and centered around 0. t=2*x .4 0.1 0 -8 -6 -4 -2 0 2 4 6 8 The sum of the inputs times their weights should be in this high gradient region for efficient training.2 0.0. This limits the network output to the interval [0 1].8 0. the training function backprop() will use a linear output layer that allows targets of any magnitude to be reached. Since most problems have targets outside of this range.5 0. Scaling the inputs to small values helps.Logistic Activation Function 1 0.t) saved in a file (data. Up to this point.21*x. The backpropagation MATLAB script that sets up the network architecture and training parameters is called bptrain.3 0. To use this script you must first have training data (x. 8. save data8 x t This will create training data to train a network to approximate the function 85 .5 Creating a MATLAB Function for Backpropagation In this section.6 0. the two layer networks used logistic activation functions in each layer. a MATLAB script will be discussed that performs the necessary operations to define and train a multilayer perceptron with a backpropagation algorithm. The following defines the format for these data: Variable x t Description Input data Target data Rows Columns Number of inputs Number of patterns Number of outputs Number of patterns An example of creating and saving a training set is: x=[0:1:10]. but the weights should also be made small.

n]:y This network has: 1 input neurons 20 neurons in the hidden layer 1 output neurons There are 11 input/output pairs in this training set. Are you satisfied with these selections? [y. If they are. It also asks if the default error tolerance. none=n ?[z. the number of hidden neurons. asks for the name of the file containing the training data.mat. and the type of scaling to use. error goal not met! *** RMS = 1. EDU» bptrain Enter filename of input/output data vectors: data8 How many neurons in the hidden layer? 20 Input scaling method zscore=z. the function backprop() is called and the network is trained. maximum number of cycles. and scaling parameters are saved in a file called weights. The MATLAB script. biases. *** BP Training complete.21x12 over the interval [0 10] and store in binary format in a file named data8.1 Maximum number of training cycles = 5000.l.mat.780286e-001 86 . bptrain.n]:n The default variables are: Output error tolerence (RMS per output term) = . linear=l. After training the weights. The initial learning rate is = 0. and initial learning rate is acceptable.1. The following is a diary of running the script bptrain on the training data defined above.y = 2 x1 + 0.

After we train a network we usually want to test it.output) title('Function Approximation Verification') xlabel('Input') ylabel('Output') 87 .02 0 0 1000 Learning Rate 2000 3000 Cycles 4000 5000 In this example.0.x.x])]. Since some of these patterns were not used to train the network. The RMS error is not dependent on the number of outputs and the number of training patterns.17.1:10]. It can be thought of as an average error per output rather than the total error over all outputs and training patterns.1 was not met in 8000 cycles. output = W2*[ones(size(x)). we define a test set of input/output patterns of the function that was used to train the network.04 0.06 0.08 0.10 10 10 10 2 Root Mean Squared Error 1 0 -1 0 1000 2000 3000 4000 5000 0. load weights8 x=[0:. the final network error was 0.logistic(W1*[ones(size(x)).^2. To test the network.t.21*x. Generalization is the ability of a network to give the correct output for an input that was not in the training set. Note that this error is a root mean squared error rather than a sum of squared error. plot(x. the error goal of 0. this checks for generalization. Networks are not expected to generalize outside of the training space and no confidence should be given to outputs generated from data outside of the training data. It is therefore more intuitive. t=2*x .

A function letgph() displays the letter on the template. 8.mat. ans = 0 0 0 0 88 .Function Approximation Verification 5 4 3 Output 2 1 0 -1 0 2 4 Input 6 8 10 The network does a good job of learning the functional relationship between the inputs and outputs. The data file contains an input matrix x (size = 35 by 16) and a target matrix t (size = 4 by 16).1)).1)' letgph(x(:. Training longer to a lower error goal would improve the network performance. we load the data file and plot the first column of x. The first 16 letters (A through P) are defined on a 5x7 template and are each stored as a vector of length 35 in the data file letters. For example.6 Backpropagation Example In this example a neural network is trained to recognize the letters of the alphabet. It also generalizes well. The columns in the x matrix are the indices of the filled in boxes of the 5x7 template. load('letters') t(:. to plot the letter A which is the first letter of the alphabet and identified by t=[0 0 0 0]. The entries in a column of the t matrix are a binary coding of the letter of the alphabet in the corresponding column of x.

% Initialize output layer weight matrix. semilogy(RMS). title('Backpropagation Training'). The function bprop2 has a logistic/logistic architecture.1. [W1 W2 RMS]=bprop2(x. The network has 35 inputs corresponding to the boxes in the template and has 4 outputs corresponding to the binary code of the letter.7 6 5 4 3 2 1 1 2 3 4 5 We will now train a neural network to identify a letter.5*randn(10. Since the outputs are binary.W2.t.05.05..11)..5000). xlabel('Cycles'). train to a maximum of 5000 cycles and a root mean squared error goal of 0. we can use a logistic output layer. This may take a few minutes. W2=0. We will use a beginning learning rate of 0.5*randn(4.36). % Initialize hidden layer weight matrix.5.W1. load letters W1=0. ylabel('Root Mean Squared Error') 89 .

logistic(W1*[1. One may want to make the target outputs be in the range of [. For example.9620 0. output7 should be [0 1 1 1]. output0 = logistic(W2*[1.0531 0.x(:.1 .9619 0.2)])])' output7 = logistic(W2*[1. a noisy input of an A may look like: 90 .logistic(W1*[1.0607 0. which is very close to its actual output. For example.10 0 Backpropagation Training Root Mean Squared Error 10 -1 10 -2 0 20 40 60 80 100 Cycles 120 140 160 The trained network is saved in a file weights_l.9] because outputs of 0 and 1.1)])])' output1 = logistic(W2*[1.13)])])' output0 = 0.9812 0.0188 output1 = 0. We can now use the trained network to identify an input letter. You can verify this by comparing each output with its binary equivalent.0590 output12 = 0.9775 0.x(:.logistic(W1*[1.x(:. A neural network's ability to generalize is found in its ability to give a correct response to an input that was not in the training set.0595 0.0054 0.9872 0.9536 0.8)])])' output12 = logistic(W2*[1. presenting the network with several letters resulted in: load weight_l.0555 The network correctly identifies each of the test cases.logistic(W1*[1. For example.0433 0.0 are not obtainable and may force the weights to very large values.0239 output7 = 0.0443 0.x(:.

a])])' OutputA = 0. its tolerance to noise. This show the networks ability to generalize. and more specifically. An unsupervised learning rule is one in which no target outputs are given. Associative and Other Special Neural Networks 9.a=[0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1]'. letgph(a).0165 0.logistic(W1*[1. the strength of the connection between the input and the output is enhanced.3 that Hebb's training rule states: "When the synaptic input and the neuron output are both active. 91 . 7 6 5 4 3 2 1 1 2 3 4 5 Presenting the network with this input results in: OutputA = logistic(W2*[1. Chapter 9 Competitive.0297 0.1 Hebbian Learning Recall from Section 7." There are several methods of implementing a Hebbian learning rule. This chapter explores the implementation of unsupervised learning rules and begins with an implementation of Hebb's rule.0161 This output is very close to the binary pattern [0 0 0 0] which designates an 'A'.1135 0. a supervised form is used in the perceptron learning rule of Chapter 7.

we find the maximum weight to be β/α when x and y are active. 92 . few presentations are needed to learn an association and if β is made small. The update rule can be rewritten as: ∆W = −αW + β xy T This rule limits the magnitude of the weights to a value determined by α and β. Using this Hebbian update rule. β is the learning constant x is the input y is the output The constant β controls the rate at which the network learns. If the weights between active neurons are only allowed to be enhanced.9 and there is a small forgetting rate α=0. If β is made large. This allows the network to associate relationships between inputs and outputs. that has its weights initialized to the identity matrix. The values of β and α control the speed of learning and forgetting and are usually set in the interval [0 1]. hence the name associative networks. as in the equation above. For example: if the learning rate β is set to 0. The most simple unsupervised Hebb rule is: w new = w old + β xy Where: wAB is the weight connecting input A to output B. Suppose a four input network. Stephen Grossberg [1982] states the weight change law as: w new = w old (1 − α ) + βxy In this equation. α is the forgetting constant and controls the rate at which the memory of old information is allowed to decay away or be forgotten. many presentations are needed.If the output of the single layer network is active when the input is active. the weight connecting the two active nodes is enhanced. Solving the above equation for ∆W=0. a rule that allows both learning and forgetting should be implemented. is presented with an input vector x. there is no limit to their magnitude. the network constantly forgets old information and continually learns new information. Therefore.1. then the maximum weight is (β/α)*xyT=9xyT.

The network has learned to associate an x1 or x2 input with an active output. 93 . the network should only forget when the output is active.100).a. 0 1. % Trained weight vector. the network's output will still be active.a. y The inputs will be presented to the network and the weights will be updated with the unsupervised Hebbian learning rule with β = 0. If we make the learning and forgetting rates equal.w.b.1. The network has learned to associate an input of [1 0 0 1] with an active output. if the input is [0 0 0 1]. 0 0 0. A more useful structure would allow the network to forget only when it is learning. this Hebbian learning rule is called the Instar learning rule.w. Learning factor. 9. Neuron Output Sum weights % Weight vector.cycles).1.0000 We can see that the weights are bounded by (β/α)*xyT=1. % Input vector. For example. w w = 1. x=[1 0 0 1]'. w=hebbian(x. Output.0000 % % % % % Forgetting factor. New weight.1 and α =0. Weight update. This function is called with cycles equal to the number of training iterations.0000 0 % Train the network with a Hebbian learning % rule for 100 cycles. b=0.Inputs x1 x2 x3 x4 w=[1 0 0 0]. yout=thresh(w*x-eps) del_w=-a*w+b*x'*yout. Since the network only learns when the output is active (y = 1).b. the output will still be high.2 Instar Learning The Hebbian learning rule described above will continuously forget previous information due to the weight decay structure. a=0.1.1000 This rule is implemented in a function called hebbian(x. Now if a degraded version of the input vector is input to the network. w=w+del_w yout = 1 w = 1.

w ne = w old (1 − αy T ) + βxy T = w old + α ( x − w old ) y T AB AB AB AB Rearranging: w ne = w old (1 − αy T ) + αx A y T AB AB The first term allows the network to forget when the output is active and the second term allows the network to learn. The output of an Instar is the dot product of the weight vector and the input vector.a.2190 yout1 = 0 0.9) a=0. this weight update rule is graphically represented as: w(k) w(k+1) x(k) Instar Learning The value of α determines the rate at which the weight vector is drawn towards new input vectors. Initial output. w=w/norm(w). yout1=thresh(w*x-. w=instar(x.8. If the weight vector and input vector are normalized. w=rand(1. 0. Input vector to be learned.6793 0. the Instar has the ability to learn an input vector and the output is a value corresponding to the degree that the input matches the weight vector. Normalize weight vector.9 for the identification of the input vector to be positive. Final output. Train for 10 cycles.10). The following is an example of how the Instar network can learn to remember and recognize an input vector.6789 94 . Final weight vector.0470 % % % % % % % % % Initialize weight vector randomly.w. Therefore. Normalize input vector. The Instar learns patterns and classifies patterns.9) w = 0. w yout2=thresh(w*x-.4) x=[1 0 0 1]'. This implementation causes the weight matrix to move towards new inputs. x=x/norm(x). Note that the weight vector and input vector are normalized and that the dot product must be >0. Learning and forgetting factor.

outstar(v. the Outstar learning rule for this simple recall network is: ∆w = −αwx + αvx = α ( w − v ) x The Outstar function. the weight connecting the two is increased.3 Outstar Learning The Instar network learns to identify input vectors and its dual network. If the learning and forgetting terms are equal. Putting these two network architectures together results in an associative memory. A 0. if the input is a 0.9. the output will be the learned vector. Train the network for 10 cycles. v = [0 1 1 0]. the output will be the stored vector. when the input is a 1. can store and recall a vector. Learning rate. Initialize weights randomly. is used where: v is the vector to be learned x is the input vector w is the weight matrix of size(outputs.9 criteria was used for identification. 9.a.4) a=0.7071 yout2 = 1 0. This is a type of supervised network since the desired output is applied to the output. the output will be a 0.x. 95 . called an Outstar network. when the input is made active. Input weights Outputs y1 y2 y3 y4 x Outstar Network For example. a network can be trained to store a vector v = [0 1 1 0]. The Outstar network is trained by applying an active signal at the input while applying the vector to be stored at the output.w = 0.0000 0. w=outstar(v. % % % % % Vector to be learned.w. Again.x. w=rand(1.10). when the input and output are both active. Input. The Outstar uses a Hebbian learning rule similar to that of the Instar. x=1. The Outstar network also uses a Hebbian learning rule.7071 The network was able to learn a weight vector that would identify the desired input vector.w.a).0000 0. After training. After training.inputs) a is the learning rate.

x=[1 0 0 0 1 0 0 0 1].0000 -1. % Three output vectors.0000 0.9347 yout = 0.0000 0.0000 0. The one pass method may not generalize well from noisy or incomplete data. Suppose we want to learn three vectors of four terms. 0. % Second vector.10).3).0000 This algorithm can generalize to higher dimensional inputs.0000 1.0000 0.yout=w*x w = 0.3835 1. a learning rate equal to one. v1=[0 -1 0 0]'. w=outstar(v. yout=w*x yout = 0.x.0000 0.0000 The network learned to recall the correct vector for each of the three inputs.0000 0.0000 0.5194 1.0000 % The trained network output.0000 0. % Three input vectors. The unsupervised form uses an initial identity weight matrix. % Third vector. v3=[0 1 0 0]'. v=[v1 v2 v3]. % Train the network for 10 cycles. The general Outstar architecture is now: Inputs x1 x2 x3 weights Outputs Sum Sum Sum Sum Outstar Network y1 y2 y3 y4 If we want a network to output a certain vector depending on the active input node we will have a weight matrix versus a weight vector. Although this implementation uses a supervised paradigm. w=rand(4. 96 .a.0000 -1.0000 % First vector. and only one pass of training. 0. % Random initial weight matrix.8310 0. v2=[1 0 -1 0]'. This method embeds the input matrix into the weight matrix in one pass but may not be useful when there are several patterns in the data set that are noisy.w.0000 1. the implementation could be presented in an unsupervised form.

b).4 Crossbar Structure An associative network architecture with a crossbar structure is termed a bi-directional associative memory (BAM) by its developer. m3=a3*b3'. The resulting vector must be limited to the [1 -1] range. This is done using the signum() function. This methodology is really a matrix solution to an associative memory problem. Each vector a is associated with a vector b. The BAM does not undergo training as do most neural network architectures. 97 . % Recall b2 from a2. b2=[1 -1 -1]'. b3=[-1 -1 1]'. and either vector can be the input or the output. b1=[-1 1 -1]'. -1 -1 3 1 -1 1 The master weight matrix can now be used to get the vector associated with any input vector. Three correlation matrices are formed by multiplying the vectors using the equation: T M 1 = A 1B 1 m1=a1*b1'. Note that the terms of the vectors in a BAM must be ± 1. consider three vector pairs (a. % Three correlation matrices.9. a3=[-1 -1 1 -1 -1 1]'. Bart Kosko [1988]. The three correlation matrices are then added to get a master weight matrix: M = M1 + M 2 + M 3 m=m1+m2+m3 m = -1 3 -1 1 3 -3 3 -1 -1 1 -1 1 % Master weight matrix. As an example of its implementation. m2=a2*b2'. This matrix can perform transformations in either direction. A i = MB i For example: A1=signum(m*b1) B2=signum(m'*a2) or Bi = MTAi % Recall a1 from b1. a1=[1 -1 -1 -1 -1 1]'. a2=[-1 1 -1 -1 1 -1]'.

y = ∑ wil x l = w i x l where: i = output node index l = input node index The weights between the input and output nodes (wil) are initially chosen as small random values and are continuously normalized. the learning rate η is reduced as training progresses until it eventually is negligible and the weight changes cease. The amount that the weight vector is changed is determined by the learning rate η. this is sometimes called a winner-take-all algorithm.5 Competitive Networks Artificial neural networks that use competitive learning have only one output node activated at a time. the input vectors (x) search out which weight vector (wi*) is closest to it. This results in weight vectors (+) centered in clusters of input vectors (*) as shown in the following figure. 9. The weights are continually updated with each application. w* ⋅ x ≥ wi ⋅ x i The closest weight vector is then updated to make it closer to the input vector. 98 . This process could continue changing the weight vectors forever. The output nodes compete to be the one that is active. * ∆wij = η ( ∑x j xj − wij ) j Training involves the repetitive application of input vectors to the network in a random order or in turn. A discussion of the capacity and efficiency of the BAM network is given in the text. therefore. When training commences.A1 = 1 -1 -1 -1 -1 1 B2 = 1 -1 -1 We can see that the BAM was able to recall the associations stored in the master matrix memory.

* * * * +* * * + * + * ** + * * * *
Before Training

* + ** *

* +* * *

* +* * *+* * *

A fter Training

Competitive Network

Generally, competitive learning networks are single-layered but there are several variants that are multi-layered. Some are briefly described by Hertz, Krogh, and Palmer [1991] but none will be discussed here. As is intuitively obvious, competitive networks perform clustering. They extract similarities from the input vectors and group or categorize them into specific clusters. These similar input vectors fire the same output node. They find uses in vector quantization problems such as data compression. 9.5.1 Competitive Network Implementation A competitive network is a single layer network which only has one output activate at a time. This output corresponds to the neuron whose weight vector is closest to the input vector.

Inputs

x1 x2 x3

weights

Outputs

Sum Sum Sum Sum C

y1 y2 y3 y4

Competitive Network
Suppose we have three weight vectors already stored in a competitive network. The output of the network to an input vector is found with the following.
w=[1 1; 1 -1; -1 1]; % Cluster centers. x=[1;3]; % Input to be classified. y=compete(x,w); % Classify. clg plot(w(:,1),w(:,2),'*') hold on plot(x(1),x(2),'+') hold on plot(w(find(y==1),1),w(find(y==1),2),'o') title('Competitive Network Clustering')

99

xlabel('Input 1');ylabel('Input2') axis([-1.5 1.5 -1.5 3.5])
Competitive Network Clustering 3.5 3 2.5 2 1.5 Input2 1 0.5 0 -0.5 -1 -1.5 -1.5 -1 -0.5 0 Input 1 0.5 1 1.5

The above figure displays the cluster centers (*), shows the input vector (+), and shows that cluster 1 won the competition (o). Cluster center 1 is circled and is closest to the input vector labeled +. The next example shows how a competitive network is trained. It uses the instar learning rule to learn the cluster centers and since only one output is active at a time, only the winning weight vector is updated during each presentation. Suppose there are a 11 input vectors (dimension=2) that we want to group into 3 clusters. The weight matrix is randomly initialized and the network is trained for 20 presentation of the 11 inputs.
x=[-1 5;-1.2 6;-1 5.5;3 1;4 2;9.5 3.3;-1.1 5;9 2.7;8 3.7;5 1.1;5 1.2]'; clg plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'), ylabel('Input 2');

100

Training Data 6 5.5 5 4.5 4 Input 2 3.5 3 2.5 2 1.5 1 -2 0 2 4 Input 1 6 8 10

This data is sent to a competitive network for training. It is specified a priori that the network will have 3 clusters.
w=rand(3,2); a=0.8; % Learning and forgetting factor. w=trn_cmpt(x,w,a,20); % Train for 20 cycles. plot(x(1,:),x(2,:),'k*');title('Training Data'),xlabel('Input 1'), ylabel('Input 2');hold on plot(w(:,1),w(:,2),'o');hold off
Training Data 6 5 4 Input 2 3

101

Note that the first weight vector never moves towards the group of data centered around (10,3). This neuron is called a "dead neuron" since its output is never activated. One method of dealing with dead neurons is to somehow give them an extra chance of winning a competition (to give them increased bias towards winning). To implement this increased bias, we will add a bias to all of the competitive neurons. The bias will be increased when the neurons doesn't win and decreased when the neuron does win. The function competb() will evaluate a competitive network with a bias.
dimension=2; % Input space dimension. clusters=3; % Number of clusters to identify. w=rand(clusters,dimension); % Initialize weights. b=.1*ones(clusters,1); % Initialize biases. a=0.8; % Learning and forgetting factor. cycles=5; % Iteratively train for 5 cycles. w=trn_cptb(x,w,b,a,cycles); % Train network. clg plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'), ylabel('Input 2');hold on plot(w(:,1),w(:,2),'o');hold off
Training Data 6 5.5 5 4.5 4 Input 2 3.5 3 2.5 2 1.5 1 -2 0 2 4 Input 1 6 8 10

We can see that the dead neuron has come alive. It has centered itself in third cluster. One detail that has not been discussed in detail is the selection of the learning rate. A large learning rate allows the network to learn fast but reduces its stability and leads to oscillatory behavior. An adaptive learning rate can be used that allows the network to initially learn fast and then slower as training progresses.

102

It can be realized by using ordinary competitive learning with lateral connections weights in the output layer that excite nearby nodes and inhibit nodes that are farther away. The weight updates are only performed for the winning neuron and its neighbors (wi near xi). This similarity is defined as it was in competitive learning as the Euclidean distance from each other. 103 . 2-dimensional feature map. This realization is termed Kohonen's algorithm. the geometrical arrangement or location of the outputs contains information about the input vectors. If the input vectors X1 and X2 are fairly similar. This type of function is commonly referred as a "Mexican hat" function. and if X1 and X2 are quite similar. These neurons are the gray shaded neurons in the above figure. their outputs should be located close together.9. Kohonen's algorithm clusters data in a way that preserves the topology of the inputs. X x1 x2 xn Self Organizing Feature Map In a feature map. It can also be realized by ordinary competitive learning where the weights of nearby neighbors are allowed to update along with the weights of the winning node. The Kohonen learning algorithm is: ∆w = α ( x − wi ) for wi near x. or Kohonen network [1984]. This relationship can be realized by one of several different learning algorithms. the neurons nearest to the winning neuron are also updated. The conventional two dimensional feature map architecture is shown below. then their outputs should be equal. There may be a reduced update for neurons as a function of their distance from the winning neuron. mexhat. maps a high dimension input vector into a smaller dimensional pattern. thus geometrically revealing the similarities of the inputs. usually this pattern is of dimension one or two.5 2 Self Organizing Feature Maps The self-organizing feature map. For a 4x4.

Mexican Hat Function

1 0.5 0 30 20 10 0 0 10 30 20

Suppose we have a two input network that is to organize the data into a one dimensional feature map of length 5.

x1 x2

1 2 3 4 5
One Dimensional Feature Map

In the following example, we will organize 18 input pairs (this may take a long time).
x=[1 2;8 9;7 8;6 6;2 3;7 7;2 2;5 4;3 3;8 7;4 4; 7 6;1 3;4 5;8 8;5 5;6 7;9 9]'; plot(x(1,:),x(2,:),'*');title('Training Data'),xlabel('Input 1'), ylabel('Input 2'); b=.1*ones(5,1); % Small initial biases. w=rand(5,2); % Initial random weights. tp=[0.7 20]; % Learning rate and maximum training iterations. [w,b]=kohonen(x,w,b,tp);% Train the self-organizing map. ind=zeros(1,18); for j=1:18 y=compete(x(:,j),w); ind(j)=find(y==1); end

104

[x;ind] ans = Columns 1 through 12 1 8 7 6 2 2 9 8 6 3 1 5 4 3 1 Columns 13 through 18 1 4 8 5 6 3 5 8 5 7 1 2 5 3 4 Training Data 9
8 7 6 5 4 3 2 1

7 7 4 9 9 5

2 2 1

5 4 2

3 3 1

8 7 4

4 4 2

7 6 4

Input 2

2

3

4

5 Input 1

6

7

8

9

To observe the order of the classifications, we will use different symbols to plot the classifications.
clg plot(w(:,1),w(:,2),'+') hold on plot(w(:,1),w(:,2),'-') hold on index=find(ind==1); plot(x(1,index),x(2,index),'*') hold on index=find(ind==2); plot(x(1,index),x(2,index),'o') hold on index=find(ind==3); plot(x(1,index),x(2,index),'*') hold on index=find(ind==4); plot(x(1,index),x(2,index),'o') hold on index=find(ind==5); plot(x(1,index),x(2,index),'*')

105

hold on axis([0 10 0 10]); xlabel('Input 1'), ylabel('Input 2'); title('Self Organizing Map Output'); hold off
Self Organizing Map Output 10 9 8 7 6 Input 2 5 4 3 2 1 0 0 2 4 Input 1 6 8 10

It is apparent that the network not only clustered the data, but it also organized the data so that the data near each other were put in clusters next to each other. The use of self organizing maps preserves the topography of the input vectors.
9.6 Probabilistic Neural Networks The Probabilistic Neural Network (PNN) is a Bayesian Classifier put into a neural network architecture. This network is well described in Fuzzy and Neural Approaches in Engineering, by Lefteri H. Tsoukalas and Robert E. Uhrig. Timothy Masters has written two books that give good discussions of PNNs: Practical Neural Network Recipes in C++ [1993a] and Advanced Algorithms for Neural Networks [1993b]. Both of these books contain disks with C++ version of a PNN code.

A PNN is a classifier. Although it can be used as a function approximator, this task is better performed by other iteratively trained neural network architectures or by the Generalized Regression Neural Network of Section 9.8. One of the most common classifiers is the Nearest Neighbor classifier. This classifier classifies a pattern to be in the same class as its nearest neighbor, or more generally, its k nearest neighbors. A drawback of the Nearest Neighbor classifier is that sometimes the nearest neighbor may be an outlier from another class. In the figure below, the input marked as a ? would be classified as a o, even though it is in the middle of many x's. This is a weakness of the nearest neighbor classifier. 106

x x x x

o oo o x x

x x x ?o x x o

o o

Classification Problem The principal advantage of PNNs over other NN architectures is its speed of learning. Its weights are not trained through an iterative process, they are stored during what is commonly called the learning process. A second advantage is that the PNN has a solid theoretical foundation for making confidence estimates. There are several major disadvantages to the PNN. First, the PNN must store all training patterns. This requires large amounts of memory. Secondly, during recall, all the training patterns must be processed. This requires a lengthy recall period. Also, the PNN requires a large representative training set for proper operation. Lastly, the PNN requires the proper choice of a width parameter called sigma. There are routines to choose this parameter in an optimal manner, but this is an iterative and sometimes lengthy procedure (see Masters 1993a). The width parameter may be different for each population (class), but the implementation presented here will use a single width parameter. In summary, the Probabilistic Neural Network should be used only for classification problems where there is a representative training set. It can be trained quickly but has slow recall and is memory intensive. It has solid underlying theory and can produce confidence intervals. This network is simply a Bayesian Classifier put into a neural network architecture. The estimator of the probability density function uses the gaussian weighting function:

1 n − g( x) = ∑ e n i =1

x − xi 2σ
2

2

Where: n is the number of cases in a class xi is a specific case in a class x is the input σ is the width parameter This formula simply estimates the probability density function (PDF) as an average of separate multivariate normal distributions. This function is used to calculate the probability density function for each class. For example, if the training data consisted of

107

and 18.2 2.3 3.-2 -3. we will use an input vector x=[-2.classes. 12.2 2. % Classifications of training data xtest=[-.1). The training data consists of two classes of data with four vectors in each class.-2 -3.hold off. % Training data y=[1 1 1 1 2 2 2 2]'.3114).1025 0.a) % Classify the test input: testx.3 3.'*'). % x has two classifications.5]. % Training data 108 .ylabel('Input2').5 -2.-3 -3.xtest.2 3].3 2.three classes with populations of 23.xtest(2).prob]=pnn(x. As a final example. Current plot held Probabilistic Neural Network Data 4 3 2 1 Input2 0 -1 -2 -3 -4 -4 -3 -2 -1 0 Input1 1 2 3 4 The test data point can be classified by a PNN.y.3 2.2 3].x(:. A test data point will be used to verify correct operation. x=[-3 -2. 12 and 18.hold.plot(xtest(1). % Vector to be classified. title('Probabilistic Neural Network Data') axis([-4 4 -4 4]). xlabel('Input1'). class = 2 prob = 0. A simple PNN will now be implemented that classifies an input as one of two classes.2).3114 This function properly classified the input vector as class 2 and output a measure of membership (0. plot(x(:.-2 -2.-3 -3. [class. a=3.5 1. classes=2. % a is the width parameter: sigma.-2 -2. x=[-3 -2.'o'). The above formula would be used to estimate the PDF for each of the three classes with n=23.5].

2).prob]=pnn(x. Current plot held class = 1 prob = 0.004).a) % Classify the test input.hold off a=3.946 0.1).y=[1 1 1 1 2 2 2 2]'. % x has two classifications. These numbers can be used as confidence values for the classification.x(:. % a is the width parameter: sigma. 9.0037 Probabilistic Neural Network Data 4 3 2 1 Input2 0 -1 -2 -3 -4 -4 -3 -2 -1 0 Input1 1 2 3 4 The PNN properly classified the input vector to class 1.classes. % Vector to be classified.5 -2. We will first examine the RBF architecture and then examine the differences between it and the MLP that arise from this architecture.xtest(2).y.5].ylabel('Input2').'*'). plot(x(:. Therefore. it can perform similar function mappings as a MLP but its architecture and functionality are very different. % Classifications of training data xtest=[-2. title('Probabilistic Neural Network Data') axis([-4 4 -4 4]).7 Radial Basis Function Networks A Radial Basis Function Network (RBF) has been proven to be a universal function approximator [Park and Sandberg 1991]. 109 . [class. classes=2.'o').xtest.plot(xtest(1).hold. xlabel('Input1').9460 0. The PNN also outputs a number related to the membership of the input to each class (0.

local mapping. Receptive fields are areas in the input space which activate the local radial basis neurons. When input vectors that lie far from all receptive fields there is no hidden layer activation and the RBF output is equal to the output layer bias values. If an input vector (x) lies near the center of a receptive field (µ). is a non-linear. but inside the receptive field width (σ) then the hidden nodes will both be partially activated.Input Space * ** * Receptive Fields Hidden Nodes W Output Layer *** * * * ****** * * ** * * **** ** Bias Radial Basis Function Network A RBF network is a two layer network that has different types of neurons in the hidden layer and the output layer. σj is the width of the receptive field. This contrasts with a MLP network that is a global network. then that hidden node will be activated. The distinction between local and global is the made though the extent of input surface covered by the function approximation. but the weights are usually solved for using a least square algorithm rather trained for using backpropagation. contain biases. the examples in this supplement do not use biases. This layer contains radial basis function neurons which most commonly use a gaussian activation function (g(x)). If an input vector lies between two receptive field centers. The output layer may. g j ( x ) = exp − ( x − µ j ) 2 / σ j2 [ ] Where: x is the input vector. which corresponds to a MLP hidden layer. This layer is equivalent to a linear output layer in a MLP. The output layer is a layer of standard linear neurons and performs a linear transformation of the hidden node outputs. The hidden layer. and serve to cluster similar input vectors. µj is the center of a region called a receptive field. These functions are centered over receptive fields. A RBF is a local network that is trained in a supervised manner. An MLP 110 . Receptive fields center on areas of the input space where input vectors lie. gj(x) is the output of the jth neuron. or may not.

no confidence should be given to their outputs in those regions. It can give a "don't know" output. When using an MLP. while an RBF performs a local mapping.1:3]'.1). y=gaussian(x. This ability makes the RBF the network of choice for safety critical applications or for applications that have a high financial impact. meaning all inputs cause an output. % Input space. one cannot judge whether or not the input vector comes from these untrained regions.0.y). % with a width of 1. Since networks generalize improperly and arbitrarily when operating in regions outside the training area. grid. one cannot judge whether the output contains significant information. and therefore. an RBF can tell the user if the network is operating outside its training region and the user will know when to disregard the output.ylabel('output') title('Radial Basis Neuron') 111 . On the other hand. meaning only inputs near a receptive field produce an activation.performs a global mapping.w.a) is given above and is implemented in an mfile where x is the input vector w is center of the receptive field a is the width of the receptive field y is the output value x=[-3:. % Radial basis function centered at 0 plot(x. xlabel('input'). The radial basis function y=gaussian(x. ? ? o oo o o ** * * ** * oo o ? o Global Mapping o o o Local Mapping ** * * ** * The ability to recognize whether an input is near the training set or if it is in an untrained region of the input space gives the RBF a significant benefit over the standard MLP.

There are several network architectures that will meet a specified error criteria. this may. This decision is not required for an MLP. The second algorithm incrementally adds radial basis neurons to reduce the training error to the preset goal. Designing an RBF neural network requires the selection of the radial basis function width parameter.4 0. The first algorithm centers a radial basis neuron on each input vector. the MATLAB Neural Network Toolbox has two training algorithms. Depending on the training algorithm used to implement the RBF.5 0. The following figure roughly shows the allowable combinations that may solve an example problem.8 0.6 output 0.7 0. be a decision made by the designer.Radial Basis Neuron 1 0.9 0. 112 .5 times the width parameter. we can see that the output of the Gaussian function. This means that several radial basis neurons have some activation to each input but all radial basis neurons are not highly active for a single input. The width should be chosen so that the receptive fields overlap but so that one function does not cover the entire input space. These architectures consist of different combinations of the radial basis function widths and the number of radial basis functions in the network. evaluated at the width parameter. This leads to an extremely large network for input data composed of many patterns.3 0. Another choice to be made is the number of radial basis neurons. or may not. This figure shows the range of coverage of the gaussian activation function.1 0 -3 -2 -1 0 input 1 2 3 From the above figure. We also note that the function has an approximate zero output at distances of 2. For example. equals about one half.2 0.

% Target outputs. plot(x.max Number of Neurons min min Possible Combinations Neuron Width The maximum number of neurons is the number of input patterns. The reason that the system can train well with noise free cases is that a linear method is used to solve for the second layer weights.7. A smaller width will do a better job of alerting that an input vector is outside the training space. A more complex map and a smaller tolerance requires more neurons. 9. but these systems usually fail under real world conditions in which noise exists. % Inputs. title('Function to be Approximated') 113 .y).^3-. This overfitting is apparent when there is noise in the system. Excessively large widths can sometimes give good results for data with no noise. The use of a regression method will minimize the error. x=[-10:1:10]'. This minimum must be experimentally determined.2*x. the minimum is related to the error tolerance and the complexity of the mapping. ylabel('Output'). but usually at the expense of large weights and significant overfitting.1 Radial Basis Function Example As an example of the implementation of a RBF network. a function approximation over an interval will be used.05*x. while a larger width may result in a network of smaller size and faster execution. y=. The minimum width constant should overlap the input patterns and the maximum should not cover the entire input space. xlabel('Input').^2-3*x+20.

a1=gaussian(x. % Hidden layer outputs.w1. The number of receptive fields.[y_target yout]). yout=test_a1*w2. % % % % % Radial basis function width.width).width). The network can now be tested both outside the training region and inside the training region. Center of the receptive fields. w2=inv(a1'*a1)*a1'*y. Pseudo inverse to solve for output weights.w1. test_x=[-15:.^3-. test_a1=gaussian(test_x. they will simply be placed to cover the input space. ylabel('Output') 114 .Function to be Approximated 30 25 20 15 10 Output 5 0 -5 -10 -15 -20 -10 -5 0 Input 5 10 A radial basis function width of 4 will be used and the centers will be placed at [-8 -5 -2 0 2 5 8].2:15]'.05*test_x. % Test outputs. plot(test_x. num_w1=length(w1). width=6. title('Testing the RBF Network'). Most training routines will have an algorithm for determining the placement of the radial basis neurons. w1=[-8 -5 -2 0 2 5 8]. % Test inputs. but for this example. y_target=. xlabel('Input').2*test_x.^2-3*test_x+20. Hidden layer outputs. % Network outputs.

% Inputs. y=.7.2*x.05*x. width=.05*test_x. This would alert the operator that the network is trying to operate outside the training space and that no confidence should be given to the output value. test_a1=gaussian(test_x.width).Testing the RBF Network 100 50 0 Output -50 -100 -150 -15 -10 -5 0 Input 5 10 15 The network generalizes very well in the training region but poorly outside the training region. yout=test_a1*w2.xlabel('Input'). % Test outputs.2. % Radial basis function width. y_target=.2 so that there is no overlap between the neurons. plot(test_x.^2-3*test_x+20. we will investigate two other cases.width). title('Testing the RBF Network'). % Test inputs.w1.w1.2*test_x. As the inputs get far from the training region. % Target outputs. the spread constant will be made very large and some noise will be added to the data. test_x=[-15:. 9. To show the tradeoffs between the size of the neuron width and the number of neurons in the network.^2-3*x+20. w2=inv(a1'*a1)*a1'*y.2:15]'.^3-. a1=gaussian(x.^3-. x=[-10:1:10]'. In the first case. In the second case.[y_target yout]). % Solve for output weights. the radial basis neurons are not active. % Hidden layer outputs. the width will be made so small that there is no overlap and the number of neurons will be equal to the number of inputs. w1=x'. % Hidden layer outputs. % Center of the receptive fields.ylabel('Output') 115 . % Network outputs.2 Small Neuron Width Example The radial basis function width will be set to 0.

2*test_x.width). y_target=.^2-3*test_x+20.width). % Solve for output weights. 9.w1. w1=[-8 -3 3 8]. % Test inputs.2:15]'.^3-. ylabel('Output') Warning: Matrix is close to singular or badly scaled. a1=gaussian(x. test_x=[-15:. Such a large width parameter causes each radial basis function to cover the entire input space. % Network outputs. width=200. % Hidden layer outputs. the radial basis functions are all highly activated for each input value. the network may have problems learning the desired mapping.7. % Hidden layer outputs. yout=test_a1*w2. % Test outputs. For proper overlap.w1.273238e-017 116 . RCOND = 3. % Radial basis function width.3 Large Neuron Width Example A very large radial basis function width equal to 200 will be used. plot(test_x. % Center of the receptive fields. title('Testing the RBF Network').[y_target yout]). w2=inv(a1'*a1)*a1'*y. the width parameter needs to be at least equal to the distance between input patterns. Therefore. Results may be inaccurate. When this occurs.Testing the RBF Network 100 50 0 Output -50 -100 -150 -15 -10 -5 0 Input 5 10 15 The above figure shows that the width parameter is too small and that there is poor generalization inside the training space.05*test_x. test_a1=gaussian(test_x. xlabel('Input').

This made the regression solution of the output weight matrix very difficult and ill-conditioned.0. the Probabilistic Neural Network discussed in Section 9.8 Generalized Regression Neural Network The Generalized Regression Neural Network [Specht 1991] is a feedforward neural network best suited to function approximation tasks such as system modeling and prediction. Although it can be used for pattern classification. The third layer is the summation layer and is composed of two types of neurons: S-summation neurons and a single D-summation neuron (division). The S-summation neuron computes the sum of the weighted outputs of the pattern layer while the D-summation neuron computes the sum of the unweighted outputs of the pattern neurons. The first layer is the input layer and is fully connected to the pattern layer.6 is better suited to those applications. The GRNN is composed of four layers. This layer performs the same function as the first layer RFB neurons: its output is a measure of the distance the input is from the stored patterns. There is one S-summation neuron for each output neuron and a single D-summation neuron. The last layer is the output layer and divides the output of each S-summation neuron by the output of the D-summation neuron. A general diagram of a GRNN is shown below.9919 to 1.Testing the RBF Network 100 50 0 Output -50 -100 -150 -15 -10 -5 0 Input 5 10 15 The hidden layer activations ranged from 0. 117 . The second layer is the pattern layer and has one neuron for each input pattern. The use of such a large width parameter causes numerical problems and also makes it difficult to know when an input vector is outside the training space. 9.

but instead. and the division is performed in the output layer. the matrix is the target output values appended with a vector of ones that connect the pattern layer to the D-summation neuron. The equations for the weight calculations are given below. The pattern layer weights are set to the input patterns. It does not learn iteratively as do most ANNs.Input Layer Pattern Layer Summation Layer Output Layer X1 X2 : Xn S : : S : Y1 : Yn D Generalized Regression Neural Network The output of a GRNN is the conditional mean given by:  D2  W T exp − t 2  ∑  2σ  j =1 T $ Y=  D2  exp − t 2  ∑  2σ  j =1 T Where the exponential function is a Gaussian function with a width constant sigma. Note that the calculation of the Gaussian is performed in the pattern layer. it learns by storing each input pattern in the pattern layer and calculating the weights in the summation layer. Specifically. The GRNN learning phase is similar to that of a PNN. Ws = [ Y ones] 118 . Wp = X T The summation layer weights matrix is set using the training target outputs. the multiplication of the weight vector and summations are performed in the summation layer.

xlabel('Input').0000 16. The training parameters must cover the training space and the set should also contain the values at any minima or maxima.0000 9. ylabel('Output') Wp = -10.7 -3.0000 Ws = -20. a=2.a).3 0 3. % Generate the target outputs. A width parameter of 4 will also be used. x=[-10 -6. -6. y=grnn_sim(x.0000 1.Wp.0838 25.0000 Training Data for a GRNN The GRNN will now be simulated for the training data.3 6.3 6.3 6.7 10].7189 5. % Calculate the weight matrices.2*x.7 -3. First we calculate the weight matrices.7 10]'.3 0 3.7 10]'.3000 6.y_target) title('Training Data for a GRNN'). y_actual=. x=[-10 -6.0000 1.To demonstrate the operation of the GRNN we will use the same example used in the RBF section.y_target) plot(x. y_target=. The training patterns will be limited to five vectors distributed throughout the input space. [Wp.Ws.Ws]=grnn_trn(x.9252 20.3 0 3. The input training vector is chosen to be [-10 -6.0000 1.7 -3.7000 10.05*x.0000 1.0000 1.0000 1.05*x. 119 .7000 -3.9602 20.0000 1.0000 30 25 20 15 10 Output 5 0 -5 -10 -15 -20 10 5 0 5 10 % Training inputs.3000 0 3.^3-.^2-3*x+20.^3-.^2-3*x+20.2*x.

Ws.a). ylabel('Output' ) 120 . Next we check for correct generalization by simulating the GRNN over the trained region. The choice of a good width parameter is necessary to having good performance.y. xlabel('Input'). and optimal (a = 2). y=grnn_sim(x.x.^2-3*x+20. x=[-10:. the largest width parameter that gives good recall is optimal.'*') title('Generalization of a GRNN').x. ylabel('Output') Generalization of a GRNN 30 25 20 15 10 Output 5 0 -5 -10 -15 -20 -10 -5 0 Input 5 10 The recall performance for the network is very dependent on the width parameter.plot(x.5). Usually. a=. A small width parameter gives good recall of the training patterns but poor generalization.y.5.y_actual.5 was found to be the maximum width that has good recall.05*x. To show the effects of the width parameter. A larger width parameter would give better generalization but poorer recall. plot(x.5:10]'. a width parameter of 2. y_actual=. too large (a = 5).'*') title('Generalization of GRNN: a = 0.y_actual. the GRNN is simulated with a width parameter being too small (a = .2*x.Wp.^3-. In the example above.5') xlabel('Input').

Generalization of GRNN: a = 0. a = 5 30 25 20 15 10 Output 5 0 -5 10 x=[-10:.5 30 25 20 15 10 Output 5 0 -5 -10 -15 -20 -10 -5 0 Input 5 10 x=[-10:.Wp.^2-3*x+20.5:10]'. y=grnn_sim(x.y_actual.Wp. y=grnn_sim(x.a=2. 121 .y.'*') title('Generalization of GRNN.5:10]'.^3-.Ws.a).a=5.ylabel('Output') Generalization of GRNN.x.a). a = 5') xlabel('Input').Ws.05*x.2*x. y_actual=. plot(x.

5') xlabel('Input'). This would make the network very large (many pattern nodes) and would require much memory and long recall times.x. The Elman network can learn temporal patterns as well as spatial patterns because it can store information.y_actual. This memory allows the network to exhibit temporal behavior.'*') title('Generalization of GRNN: a = 2.^2-3*x+20. Clustering techniques can be used to select a representative training set.y_actual=. a large training set must be chosen to guarantee it is representative. thus reducing the number of pattern nodes. the hidden layer outputs are fed back through a one step delay to dummy input nodes. plot(x. In an Elman network.05*x. If there is nothing known about the function. the network was able to generalize with very few training parameters. Usual methods are Real Time Recurrent Learning 122 . Chapter 10 Dynamic Neural Networks and Control Systems 10.ylabel('Output') Generalization of GRNN: a = 2. The Jordan network is a recurrent architecture similar to the Elman network but it feeds back the output layer rather than the hidden layer. but also on prior inputs.^3-. behavior that is not only dependent on present inputs. Recurrent Neural Networks are networks with internal time delayed feedback connections.5 30 25 20 15 10 Output 5 0 -5 -10 -15 -20 -10 -5 0 Input 5 10 Note that with the proper choice of training data and width parameter. The two most common RNN designs are the Elman network [Elman 1990] and the Jordan network [Jordan 1986].1 Introduction Dynamic neural networks require some sort of memory.2*x. Recurrent Neural Networks are difficult to train due to the feedback connections. There are two major classes of dynamic networks: Recurrent Neural Networks (RNN) and Time Delayed Neural Networks (TDNN).y.

% Three frequencies sig=2*sin(f1*t)+sin(f2*t)+1.title('Periodic Signal').(RTRL) [Williams and Zipser. u(k) D D D Time Delay Neural Network Model ANN y(k) Specific applications using these dynamic network architectures will be discussed in later sections of this chapter. 10. % Time f1=. it is easily trained with standard algorithms.06*(2*pi). Since the TDNN has no feedback terms. and Back Propagation Through Time (BPTT) [Werbos 1990].2 Linear System Theory The educational version of MATLAB provides many functions for linear system analysis. TDNNs accomplish this by simply delaying the input signal. PNN. but also past inputs. 1989].1*(2*pi). f2=. or other feedforward network architecture. f3=. 123 . GRNN. Theoretical issues and derivation will not be discussed in this supplement.5*sin(f3*t).sig). The neural network architecture is usually a standard MLP but it can also be a RBF. This section will provide simple examples of the usage of some of these functions. % Periodic signal plot(t. We will examine the Fast Fourier Transform (fft) and the Power Spectral Density (psd) functions.4*(2*pi). p(1) p(2) p(R) a1 D D a2(1) a2(S2) Input Hidden Output Elman Recurrent Neural Network Time Delay Neural Networks (TDNNs) can learn temporal behavior by using not only the present inputs. Taking the fft of that signal results in t=[1:1:512]. Suppose you have a periodic signal that is 256 time steps long and is a combination of sine waves of 3 different frequencies.

f=[1:128]/256.ylabel('Power').*conj(Y)/256.xlabel('Time'). xlabel('Frequency'). title('Power Spectral Density').256). plot(f. Periodic Signal 5 4 3 2 Amplitude 1 0 -1 -2 -3 -4 -5 0 100 200 Time 300 400 500 Y=fft(sig. Power Spectral Density 180 160 140 120 Power 100 80 60 124 . % Find the normalized amplitude of the FFT. % Calculate a scale for the y axis.ylabel('Amplitude').axis([0 512 -5 5]).Pyy(1:128)) % Points 129:256 are symmetric. % Take Fast Fourier Transform Pyy=Y.

% Take Fast Fourier Transform Pyy=Y. Noisy Periodic Signal 5 4 3 2 Amplitude 1 0 -1 -2 -3 -4 -5 0 100 200 Time 300 400 500 Y=fft(noisy_sig. f=[1:128]/256.. % Add normally distributed noise.Pyy(1:128)) % Points 129:256 are symmetric. plot(f.title('Noisy Periodic Signal').noisy_sig). % Calculate a scale for the y axis. xlabel('Time'). axis([0 512 -5 5]). A noise component is usually inherent to most signals. 125 . ylabel('Power').*conj(Y)/256. title('Power Spectral Density'). noisy_sig=sig+randn(size(sig)). % Find the normalized amplitude of the FFT. ylabel('Amplitude').256). plot(t. xlabel('Frequency').The plot of the power in each frequency band shows the three frequency components of the signal and the magnitudes of the signals.

2 0.Power Spectral Density 200 180 160 140 120 Power 100 80 60 40 20 0 0 0.256.4 0.3 Frequency 0. 25 20 Power Spectrum Magnitude (dB) 15 10 5 0 -5 -10 The use of these signal processing functions may be necessary when designing a neural or fuzzy system that uses inputs from the frequency domain.5 The figure shows the noise level rising across the entire spectrum. psd(noisy_sig. 126 . The PSD function can also be used to plot the Power Spectral Density with a dB scale. % Plots a Power Spectral Density [dB].1).1 0.

FIR filters are sometimes used over Infinite Duration Input Response (IIR) filters such as Chebyshev. 127 .2*randn(1.151). The following example will show how a TDNN can be used as a FIR filter. title('Noisy Periodic Signal'). On-line learning neural networks are usually called adaptive networks because they can adapt to changes in the input or target signals. Elliptic. Consider the following filtering example. stability.01*t)+sin(. and Bessel due to their simplicity.03*t)+. plot(t.sig).3 Adaptive Signal Processing Neural Networks can learn off-line or on-line. ylabel('Amplitude'). Input Signal D D D D w1 w2 w3 w4 : wn Σ Output Single Input Adaptive Transverse Filter Suppose that we have a Finite Duration Input Response (FIR) filter implemented by a TDNN (Shamma). and finite response duration. This is a specific case of the TDNN with one layer of linear neurons. t=[250:1:400]. sig=sin(.10. Suppose that a fairly noisy signal needs to be filtered so that it can be used as an input to a PID controller or other analog device. One example of an adaptive neural network is the single input adaptive transverse filter. linearity . xlabel('Time'). A FIR implemented as a neural network can be used for this application.

% Number of delays. y=x*w. xd=delay_in(x. ylabel('Amplitude'). xlabel('Time').y'). 128 .5 0 Amplitude -0. title('Noisy Periodic Signal and Filtered Signal'). we call the function delay_in(x.sig(d:length(sig)-1).5 1 0. For example. This function returns an input vector with 4 rows since there are four inputs to the network.5 -1 -1. This filter can be used to process noisy data.2 . This function does not pad zeros into the inputs. For example: x=[0 1 2 3 4 5 6 7 8 9]'.2 .1]'. x=delay_in(sig'. if we have an input vector of ten values that is being input to a TDNN with 3 delays. d=5. % Calculate outputs.1 .Noisy Periodic Signal 2 1. % Construct delayed input matrix.2 . plot(td. w=[.d).td.2 .5 -2 250 300 Time 350 400 To simulate a TDNN we must construct an input matrix containing the delayed values of the input. % Weight matrix. so the input vector shrinks by 3 patterns. td=t(d:length(sig)-1).3) xd = 0 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 Suppose we have a linear neural network FIR filter implemented with a TDNN with 5 delays.d) with the number of delays d=3.

the window is six time steps long. 129 . The example of the section will only require a static mapping since the system output only depends on the current inputs but will be trained adaptively since one parameter is non-stationary. This allows a network to adaptively learn non-stationary (timevarying) trends. It actually performs this operation by calculating a linear weighted average over a window.4 Adaptive Processors and Neural Networks The FIR example of the previous section is a dynamic network in the sense that is can model temporal behavior. This section demonstrates how a neural network can be dynamic in the sense that its parameters change with time.5 0 Amplitude -0. This filter can be trained on line to give specific response characteristics. 10. The example given trains a linear network on-line. Systems whose characteristics (mathematical models) do not change with time can be modeled with a neural network that is trained off-line.Noisy Periodic Signal and Filtered Signal 2 1.5 -2 250 300 Time 350 400 We can see that the neural network filter removed a large portion of the noise. In this example.5 1 0. The neural network can be a TDNN if temporal behavior is necessary or can be a static mapping without time delays.5 -1 -1. The figure below shows a block diagram of an adaptive neural network used for system identification.

5. Weights=[].4.1). A larger learning rate will allow the network to track faster but will reduce its stability while a lower learning rate will produce a stable system with slower tracking abilities.0. [W]=adapt(x.W.y. text(50. % Train network. end plot(Weights') % Plot parameters. lr=.01t Initially. W=[0 0 0]. % Simulate system.-2. The two weighting coefficients will converge to 2 and -3 while the bias will start at -1 and move towards +1. A simple linear network with two inputs can be trained on-line to estimate this system.'W2').01*i. % Save weights and bias to plot. x=rand(2. for i=1:200 % Run for 200 seconds.'B1').2. Suppose the system is modeled by: y=2x1-3x2-1+. % Initialize weights and bias to zero. A linear neural network can be adaptively trained on-line to follow the non-stationary coefficient. The choice of the learning rate will affect the speed of learning and the stability of the network. In this example. % System has random inputs. ylabel('Weight and Bias Values') text(50. y=2*x(1)-3*x(2)-1+. the offset term is -1 and this term changes to +1 as time reaches 200 seconds. text(50.u(k) Plant ei + Σ - yout(k) D D ANN Parallel Identification Model Let us consider a simple linear network used for adaptive system identification. Weights=[Weights W']. % Store weight and bias values over time. the weights are the linear system coefficients of a non-stationary system. title('Weights and Bias Approximation Over Time') xlabel('Time').'W1').5.lr). 130 . % Learning rate.

dk/Projects/proj/nnctrl. This type of learning paradigm could be used to train the FIR filter used in the previous section.dtu. White and Sofga. 1996. There are five general methods for implementing neural network controllers (Werbos pp. Mills.1996. Khalid and Yusof. Neural Adaptive Control 131 .dtu.5 Neural Networks Control There are user supplied MATLAB toolkits that implement neural network based system identification and control paradigms: The NNSYSID Toolbox is located at: http://kalman.iau.1992. 59-65. Omatu. 1996. Zomaya and Tade. These toolboxes were developed by Magnus Morgaard of the Institute of Automation . in Miller Sutton and Werbos 1990): 1.Weights and Bias Approximation Over Time 3 W1 2 Weight and Bias Values 1 0 B1 -1 -2 W2 -3 -4 0 50 100 Time 150 200 The figure shows that the network properly identified the system by about 30 seconds and the bias correctly tracked the non-stationary parameter over time. Miller.or Zbikowski and Hunt. 10. The toolboxes and user guides: Technical Report 95E-773 and Technical Report 95-E-830 can be downloaded free of charge. Sutton and Werbos. 1990. 1995.iau.dk/Projects/proj/nnsysid. Narendra and Parthasarathy. For further reading on the use of neural network for control see Irwin.html and the NNCTRL Toolkit is at: http://www. Technical University of Denmark. Pham and Liu. 1995. Warwock and Hunt. For a more in depth discussion of the use of neural networks for system identification and control refer to Advanced Control with MATLAB & SIMULINK by Moscinski and Ogonowski. 1990.html. Direct Inverse Control 3. Supervised Control 2.

1 Supervised Control In supervised control. u(k) ANN Inverse Plant Model Controller Plant y(k) Direct Inverse Control 132 .5. After the neural controller is trained.5. u(k) + e(k) Σ - Plant y(k) Inverse Model Inverse System Identification After the neural network learns the inverse model. a Neural Network is trained to model the inverse of a plant. Adaptive Critic Methods 10. it is used as a forward controller.2 Direct Inverse Control In Direct Inverse Control.6. This methodology only works for a plant that can be modeled or approximated by an inverse −1 function (F-1). This modeling is similar to the system identification problem discussed in Section 10.4. Since F ( F ) = 1 the output (y(k)) approximates the input (u(k)). it replaces the controller. Back-Propagation Through Time 5. r(k) Controller e(k) Neural Controller Σ + Plant y(k) Neural Controller Training 10. a neural network is trained to perform the same actions as another controller (mechanical or human) for given inputs and plant conditions.

neural adaptive control may be the best technique to use. r(k) u(k) Neural Controller Plant e(k) + - y(k) Σ Reference Model Direct Adaptive Control Since the plant lies between the neural network and the error term. Therefore. 133 . temperature affects. If necessary. there is no method to directly adjust the controllers weights to reduce the error..10.5. indirect control must be used. r(k) u(k) Neural Controller Plant Identification Model y(k) e(k) Σ+ + - e(k) Σ Reference Model Indirect Adaptive Control In indirect adaptive control. First a system identification neural network model must be trained so that the error signals can be propagated through it to the controller.3 Model Referenced Adaptive Control When the plant model changes with time due to wear. this model may be updated to track the plant. one for system identification and one for MRAC.5. etc. then the controller can be trained with a BPTT paradigm. 10. an ANN identification model is used to model the non-linear plant.4 Back Propagation Through Time Back Propagation Through Time (BPTT) [Werbos 1990] can be used to move a system from one state to another state in a finite number of steps (if the system is controllable). This method uses two neural networks. Model Referenced Adaptive Control (MRAC) adapts the controller characteristics so that the controller/plant combination performs like a reference plant. The error signals can now be backpropagated through the identification model to train the neural controller so that the plant response is equal to that of the reference model.

g. where the plant takes k time steps. It is important to note that there is only one set of weights to adjust because there is only one controller. chess). This approximation can be used to change the control system. and should only be used when a more exact method is unavailable. the action is reinforced. This type of learning is called reinforcement learning.) u(. This utility is a measure of the goodness of the state. Note that this is an approximate method. Many iterations are run until performance is as desired. The plant motion stage.5. 10. The weight adjustment stage. the action is weakened. but an approximation of its effectiveness can be obtained. A critic evaluates the results of the control action: if it is good. x(k). A neural network based adaptive critic system uses one ANN to estimate the utility J(k) of the state x(k). if it is poor. where the controller's weights are adjusted to make the final state approach the target state.5 Adaptive Critic Often a decision has to be made without an exact conclusion as to its effectiveness (e. This is a trial and error method and uses active exploration when the gradient of the evaluation system in terms of the control action is not available. 2.) P C x(k-1) u(k-1) Final State x(k) P Σ x is the state vector u is the control signal C is the controller P is the plant model Desired State e(k) Training With BPTT BPTT training takes place in two steps: 1.Initial State x(0) C u(0) P C x(1) u(1) P C x(. x(k) Critic Network Action Network J(k) u(k) Plant x(k) Adaptive Critic System 134 . It also uses a second ANN that trains with reinforcement learning to produce an input to the system (u(k)) that produces a good state.

In neural network based system identification.. This model uses past inputs and outputs to predict current outputs.. The basic conventional dynamic model is the difference equation and the most common difference equation form is the ARX model: Autoregressive with Exogenous Input Model which is sometimes called the Equation Error Model. + bnbu(t-nb) + e(t) The coefficients are represented by the set: θ = [a1 a2 . + anay(t-na) = b1u(t-1) + . one of the major advantages of using neural networks is their non-linear approximating capabilities.1 ARX System Identification Model A dynamic model is one whose output is dependent on past states of the system which may be dependent on past inputs and outputs. Conventional methods use empirical data and regression techniques to estimate the coefficients of difference equations (ARX. the internal weights and biases of the neural network are adjusted to make the model outputs similar to the measured outputs. The designer sets the structure and a regression technique is used to solve for the coefficients. This model assumes that the poles (roots of A(q)) are common between the dynamic model and the noise model. B(q)/A(q) + y(t) 135 .6 System Identification The task of both conventional and neural network based system identification is to build a mathematical model of a dynamic system based on empirical data.... anaq-na B(q) = b1q-1 .. bnb ]'. ana b1 b2 . ARMAX) or state space representations (see [Ljung 1987]). examine the following ARX model. bnbq-nb and: y(t) = [B(q)/A(q)]u(t) + [1/A(q)]e(t)..10... This form can be visualized as e(t) 1/A(q) u(t) where: A(q)=1 + a1q-1 . A static model's output at a specific time is only dependent on the input at that time.. For example. This model is a linear model. 10.. y(t) + a1y(t-1) + ..6.

Examine the model's properties. Process the data to filter and remove outliers. Select and estimate candidate model structures.2 Basic Steps of System Identification There are three phases of system identification: A. 2. 136 . 6. 4. 10. it uses the actual plant output to estimate future system outputs. static backpropagation training can be used and there are proofs to guarantee stability and convergence. Therefore. 5. It uses its estimate of the output to estimate future outputs. The Parallel Identification Structure has direct feedback from the networks output to its input. Because of this feedback.10. These phases can be accomplished by using the following general procedure: 1.3 Neural Network Model Structure There are two basic neural network model structures: the parallel identification structure and the series parallel structure.6. If the properties are acceptable quit. u(k) Plant D D ei + Σ y’out(k) yout(k) ANN D D Parallel Identification Model The Series-Parallel Identification Structure does not use feedback.6. else goto 3. This structure should only be used if the actual plant outputs are not available. Instead. Collect experimental input/output data. C. This data must be Persistently Exciting. Compute the best parameters for that structure. Select a model structure. 3. meaning that the training set has to be representative of the entire class of inputs that may excite the system. etc. Design an experiment and collect data. B. Validate the models and select best model. it has no guarantee of stability and requires dynamic backpropagation training.

The number of delayed outputs to include (y(t-T) y(t-2T)). the output at time t is dependent on the two past outputs.5y(t-T) + 0. This means that there is information in the two time step lagged input that can be used to reduce the estimation error. As an example of an ARX model. should be white and independent of the input. The basic steps in setting up a system identification problem are the same for a conventional model and a neural network based model. The error terms. consider: y(t) -1. The time delay of the system. also called residuals. The network system identification model is then trained and tested. For example. Excessive correlation between the residuals and delayed inputs of outputs is evidence that the delayed inputs or outputs have information that can be used to reduce the estimation error.9u(t-2T) + 0. This can be estimated through knowledge of the system or through experimentation. if a second order system is being identified with a neural network that only uses inputs delayed by one time step. 137 .5u(t-3T) + e(t) Here. The model order is the number of past signals to use as regressors. 2. This is one method that can be used to experimentally determine the number of lagged inputs to use for input to the model. the crosscorrelation between the two time step lagged input and the output will be large. In this case the input does not effect the output for 2T. The ARX structure is defined by: 1. the model order must be selected. Therefore. there is reason to believe that the lagged input signal should be included as an input. delayed inputs of 2 and 3 time steps and the disturbance. If the output residuals have a high correlation with lagged input signals.7y(t-2T) = 0. they are tested with an autocorrelation function and a crosscorrelation function with the inputs.u(k) Plant D D ei + Σ - yout(k) ANN D D y’out(k) Series-Parallel Identification Model Once the general model structure is chosen.

The inputs to the neural network model are the state: x(k) and the voltage going to the valve actuator: u(k). for i=1:size(dx). we simulate the tank model over all combinations of the input and state. The number of delayed inputs to use (u(t-2T) u(t-3T)). x2=[-4. x1=[0:3:36]'. A neural network model structure is defined by the same inputs but is also defined by the network architecture with includes the number and type of network hidden layers and hidden nodes.x2). x=combine(x1. The output will be the change in the tank level: dx.4 Tank System Identification Example This section presents an example that deals with the construction of a neural network system identification model for the tank system of Section 6. we can train a neural network to model the system. The function dy=tank_mod(t.5 :. dx=zeros(length(x1)*length(x2). % Level changes vector. input/output data must be collected for the operating range and input conditions of the tank. and the output dy is the change in the tank state (change in level). To cover all possible operating conditions.1).:)). % Simulate the system for all combinations. Now that we have training data. dx(i)=tank_mod(1. Y(2): Voltage being applied to the control valve (-4. The operating state is the tank level and the input is the voltage supplied to the inlet valve. % Input voltage range. y(1) is the current state of tank (level). % Level range. By using a tank model with output dx and training an ANN with an input x(k) and u(k) we are able to avoid using a recurrent or time delay neural network.3. end save tank_dat x dx % Save the simulation training data.5 to 1 volt). To design a system identification model.y). 10. % Input combinations.y) cover the following ranges: Y(1): Tank level (0-36 inches). The state and input defined in the function tank_mod(t.6.x(i. x(k) u(k) sum Tank dx + e(k) Σ dx ANN Model x(k+1) System Identification 138 . y(2) is the input signal (voltage to valve).1. is a non-linear model of the tank system where t is the simulation time.5:1]'.

The backpropagation training script is used to train a single layer neural network with 5 hidden neurons. The training data file has inputs x(k) and u(k) and the change in state dx is the target.
load tank_dat t=dx'; x=x'; save tank_sys x t; % % % % Tank inverse model simulation data. Change in state. State and input voltage. Tank data in neural network training format.

The bptrian script trained the tank system identification model and the weights were saved in sys_wgt.mat.
load tank_sys % Tank training data. load sys_wgt % Neural Network weights for system ID model. [inputs,pats]=size(x); output = linear(W2*[ones(1,pats);logistic(W1*[ones(1,pats);x])]); subplot(2,1,1); plot([t;output]'); title('System Identification Results') ylabel('Actual and Estimated Output') subplot(2,1,2); plot(output-t); % Calculate the error. ylabel('Error');xlabel('Pattern Number');
Actual and Estimated Output System Identification Results 1 0.5 0

-0.5 -1 0 20 40 60 80 100 120 140 160

0.02 0

Error

-0.02 -0.04 0

20

40

60

80

100

120

140

160

The top plot in the above figure shows that the neural network gives the correct output for the training set. The lower plot shows that the error levels are very small. The following is a comparison of outputs for a sinusoidal input. The time step must be the same as the integration time step of the analytical model. This checks for proper generalization.

139

t=[0:1:40]; % Simulation time. x(1)=15; % Initial tank level. X(1)=15; % Initial tank estimator level. u=sin(.2*t)-1.5; % Input voltage to control valve. load sys_wgt % Neural Network weights for system ID model. [inputs,pats]=size(x); for i=1:length(u); % Simulate the system. dx(i)=tank_mod(1,[x(i) u(i)]); % Calculate change in state. estimate=linear(W2*[1;logistic(W1*[1;x(i);u(i)])]); x(i+1)=x(i)+dx(i); % Update the actual state. X(i+1)=X(i)+estimate; % Update state extimate. end plot(t,[x(1:length(u));X(1:length(u))]); title('Simulation Testing of Tank Model') xlabel('Time');ylabel('Tank Level');
Simulation Testing of Tank Model 18.5 18 17.5 Tank Level 17

16.5 16 15.5 15 0

5

10

15

20 Time

25

30

35

40

This simulation is being run in an open loop parallel-identification mode. That is, the value used for the current state input into the neural network is the neural network's estimate. This structure allows the errors to build up over time. When a closed loop structure is chosen, the estimate is much better.
t=[0:1:40]; % Simulation time. x(1)=15; % Initial tank level. X(1)=15; % Initial tank estimator level. u=sin(.2*t)-1.5; % Input voltage to control valve. load sys_wgt % Neural Network weights for system ID model. [inputs,pats]=size(x); for i=1:length(u); % Simulate the system. dx(i)=tank_mod(1,[x(i) u(i)]); % Calculate change in state. estimate=linear(W2*[1;logistic(W1*[1;x(i);u(i)])]); x(i+1)=x(i)+dx(i); % Update the actual state. X(i+1)=x(i)+estimate; % Update state estimate.

140

end plot(t,[x(1:length(u));X(1:length(u))]); title('Simulation Testing of Tank Model') xlabel('Time');ylabel('Tank Level');
Simulation Testing of Tank Model 18 17.5 17 Tank Level 16.5 16 15.5 15 0

5

10

15

20 Time

25

30

35

40

Since the actual state is used at each simulation time step, the errors are not allowed to build up and the neural network more closely tracks the actual system.

10.7. Implementation of Neural Control Systems The example of this section deals with the use of an inverse neural network system identification model for direct inverse control of the tank problem of Section 6.1. The data constructed in Section 10.6.4 will be used to train an inverse system model. Again, the operating state is the tank level and the input is the voltage to the valve. The state and input are defined in the function tank_mod(t,y) cover the range:
Y(1): Tank level (0-36 inches). Y(2): Voltage being applied to the control valve (-4.5 to 1 volt) To construct a direct inverse neural network controller we need to train a neural network to model the inverse of the tank. The inputs to the neural network inverse model will be the state: x(k) and the desired change in state: dx. The output will be the input voltage going to the valve actuator: u(k). By using a tank model with output dx and training an ANN with an input dx we are able to avoid using a recurrent or time delay neural network.

141

x(k) u(k)

Tank dx + e(k) Σ -

sum

x(k+1)

Inverse Model Inverse System Identification

dx x(k)

The backpropagation training script was used to train a single layer neural network with 5 hidden neurons. We must set up the training data file correctly. The inputs x are the state x and the change in state dx. The target is the input u.
load tank_dat t=x(:,2)'; x=[x(:,1) dx]'; save tank_trn x t; % % % % Tank inverse model simulation data. Valve actuator voltage. State and desired change in state. Tank data in neural network training format.

The bptrian script trained the inverse model and the weights were saved in tank_wgt.mat.
load tank_trn % Tank training data. load tank_wgt % Neural Network weights for inverse system ID model. [inputs,pats]=size(x);clg; output = linear(W2*[ones(1,pats);logistic(W1*[ones(1,pats);x])]); plot(output-t); % Compare NN model results with tank analytical model. title('Training Results');xlabel('Pattern Number');ylabel('Error')
Training Results 0.02 0.015 0.01 0.005 Error 0

-0.005 -0.01 -0.015

142

The desired level has been changed from 10 inches to 20 inches in this simulation. x(k) D xd(k+1) dx u(k) Direct Inverse Control of Tank Inverse Tank Model Tank sum x(k+1) A simulation of the inverse controller can be run using tanksim(xinitial.x_desired). hold off 143 . hold on tankdemo(10.18.18.'Fuzzy System'). The tank and controller are simulated for 40 seconds. please be patient. text(2.20).20). text(18. it is put into the direct inverse control framework. tank_sim(10. The output of this simulation is the level response of the neural network controlled tank. The output of the controller is the voltage input to the valve actuator. This is shown in the next plot where both outputs are plotted together.'Neural Net'). Tank Level Response 20 18 16 Level (in) 14 12 10 8 0 5 10 15 20 25 Time (sec) 30 35 40 This controller has a faster speed of response and less steady state error than the fuzzy logic controller. The input to the inverse tank model controller is the current state and the desired state.After the neural network is trained.

and the error goal.1 Neural Network Implementation Issues There are several choices to be made when implementing neural networks to solve a problem. we will only discuss their implementation in this section. Since over 90% of all neural network implementations use backpropagation trained multi-layer perceptrons. the output of the network should not be trusted for those inputs.4 discussed scaling methods and weight initialization so those topics will not be revisited.3 and 8. it must be divided into a training set and a test set. The rest of these choices will be discussed in this chapter. the training method. 144 . The division of the data into the training and test sets is somewhat of an art and somewhat of a trial and error procedure. These choices involve the selection of the training and testing data. 11. the data scaling method. First you must collect or generate the data to be used for training and testing the neural network.2 Overview of Neural Network Training Methodology The figure below shows the methodology to follow when training a neural network. please be patient. The training set should cover the input space or should at least cover the space in which the network will be expected to operate. but you also want to exercise the input space well which may require a large training set. Once this data is collected. Tank Level Response 20 18 16 Level (in) 14 12 10 8 0 Neural Net Fuzzy System 5 10 15 20 25 Time (sec) 30 35 40 Chapter 11 Practical Aspects of Neural Networks 11.The tank and controller are simulated for 40 seconds. the network architecture. Sections 8. You want to keep the training set small so that training is fast. If there is not training data for certain conditions.

Other designers choose to start with a small network and grow it until the network trains and its error goal is met. 145 . the weights and biases are initialized and the network is trained. 3. additional hidden nodes or layers are added. We will use the second method which involves initially selecting a fairly small network architecture. if not. Case three is usually not apparent unless all else fails. once the network is trained. 1. you must choose the neural network architecture. In case two. The network does not have enough degrees of freedom to fit the desired input/output model. The training gets stuck in a local minima. The network may not reach the error goal due to one or more of the following reasons. Some designers choose to start with a fairly large network that is sure to have enough degrees of freedom (neurons in the hidden layer) to train to the desired error goal. and network training is restarted. In case one. they try to shrink the network until the smallest network that trains remains. the weights and biases are reinitialized and training is restarted. you may have overfitting.Collect Data Select Training and Test Sets Select Neural Network Architecture Initialize Weights Change Weights or Increase NN Size N SSE Goal Met? Y Run Test Set SSE Goal Met? Reselect Training Set or Collect More Data N Y Done Neural Network Training Flow Chart Once the training set is selected. After the network architecture is chosen. 2.4. There are two lines of thought here. you want to end up with the smallest network architecture that trains correctly (meets the error goal). There is not enough information in the training data to perform the desired mapping. then. Overfitting is described in greater detail in Section 11. When attempting to train a neural network.

4*x. 11. the test patterns that have high error levels should be added to the training set.3 Training and Test Data Selection Neural network training data should be selected to cover the entire region where the network is expected to operate. This process continues until the performance of the network is acceptable. Testing the network involves presenting the test set to the network and calculating the error. data may need to be collected again or be regenerated. These training decisions will now be covered in more detail and augmented with examples. If the error goal is met.4*x.3.^2 load data11 plot(x. x=[0:1:10]. it must be tested with the test data set. If the network does not generalize well on several data points. training is complete. save data11 x t The above code segment creates the training data used to train a network to approximate the following function: f(x)=2+3*x-0. If the error goal is not met. a new test set should be chosen. Another subset of that data is then used as test data to verify the correct generalization of the network. Poor generalization due to an incomplete training set. and the network should be retrained.ylabel('Output') 146 . Usually a large amount of data is collected and a subset of that data is used to train the network. there could be two causes: 1. If there is not enough data left for training and testing.t) title('Function Approximation Training Data') xlabel('Input'). 2. The test data set should also cover the operating region well. The cause of the poor test performance is rarely apparent without using crossvalidation checking which will be discussed in Section 11.Once the smallest network that trains to the desired error goal is found. that data is added to the training data and the network is retrained.^2. The following is an example of a network that is being used outside of the region where it was trained. Overfitting due to an incomplete training set or too many degrees of freedom in the network architecture.4. If an incomplete test set is causing the poor performance. t=2+3*x-. The training data should bound the operating region because a neural network's performance cannot be relied upon outside the operating region. This ability is called a network's extrapolation ability.

plot(LR) ylabel('Learning Rate') xlabel('Cycles'). load weight11 subplot(2. For this example. the network never reached the error goal in 5000 epochs. we will continue with the exercise and plot the training performance.2).239272e-001 After several tries. The network was trained using bptrain and zscore scaling resulting in: *** BP Training complete.1. 147 .1. ylabel('Root Mean Squared Error') subplot(2.2 or 5000 epochs. title('Backpropagation Training Results'). semilogy(RMS).Function Approximation Training Data 8 6 4 2 Output 0 -2 -4 -6 -8 0 2 4 Input 6 8 10 We will choose a single hidden layer network architecture with 2 logistic hidden neurons and train to either an average error level of 0. error goal not met! *** RMS = 2. Continued training may result in a network that meets the initial error criteria.1).

2 Learning Rate 0.title('Function Approximation Verification') xlabel('Input').01:10].15 0.xs). output = linear(W2*[ones(size(x)).y.05 0 0 1000 2000 3000 Cycles 4000 5000 Now that the network is trained. we can check for generalization performance inside the training region.output).x.logistic(W1*[ones(size(x)). Remember that when scaling is used for training.ylabel('Output') Function Approximation Verification 8 6 4 2 Output 0 148 .^2. those same scaling parameters must be used for any data input to the network.xm.4*x.X=zscore(x.y=2+3*x-0.X])]). plot(x.Root Mean Squared Error 10 1 Backpropagation Training Results 10 0 10 -1 0 1000 2000 3000 4000 5000 0.1 0. x=[0:.clg. load weight11.

Overfitting can be reduced by: 1. Now we look at how the network extrapolates outside of the training region.output) title('Function Approximation Extrapollation') xlabel('Input') ylabel('Output') Function Approximation Extrapollation 10 0 -10 Output -20 -30 -40 -50 -5 0 5 Input 10 15 We see that the network generalizes very well within the training region (0-10) but poorly outside of the training region.x.1:15].xs). rather than the underlying functional relationship of the model to be learned. 3. y=2+3*x-0.We can see that the network generalizes very well within the training region.xm. output = W2*[ones(size(x)).^2. 11. This shows that a neural network should never be expected to operate correctly outside of the region where it was trained. plot(x.X])].4 Overfitting Several parameters affect the ability of a neural network to overfit the data. When this happens. Stopping training before overfitting occurs. such as noise. the data learned the peculiarities of the training data.logistic(W1*[ones(size(x)). Overfitting is apparent when a networks error level for the training data is significantly better than the error level of the test data. 149 . load weight11 X=zscore(x.y. Limiting the number of free parameters (neurons) to the minimum necessary. Increasing the training set size so that the noise averages itself out. 2.4*x. x=[-5:.

Let us now train a network with data from a more realistic model.4*x. x=[0:1:10].y. 11.4.*(2+3*x-.276182e+000 The error only trained down to 1.27. save data12 x t plot(x.3). % Set seed to original seed. 150 . t=2+3*x-. This model will have 20% noise added to simulate noisy data that would be measured from a process. *** BP Training complete.^2).1 Neural Network Size. y=2+3*x-.^2+0. but this is to be expected since we did not want the network to learn the noise.Three examples will now be used to illustrate these three methods. In the example of Section 11.11).2*randn(1. randn('seed'.4*x.^2.3. error goal not met! *** RMS = 1.x. A robust training routine would use all of the above methods to reduce the chance of overfitting.4*x.t) title('Training Data With Noise') xlabel('Input').ylabel('Output') Training Data With Noise 15 10 5 Output 0 -5 -10 0 2 4 Input 6 8 10 We trained a neural network with the same architecture as above (2 neurons) using the noisy data. we see that the function can be approximated well with a network having 2 hidden neurons.

output = linear(W2*[ones(size(x)).x. y=2+3*x-0.xs).4*x. We can see that the neural network approximation is smooth and follows the function very well.t) title('Function Approximation Verification') xlabel('Input') ylabel('Output') Function Approximation Verification 15 10 5 Output 0 -5 -10 0 2 4 Input 6 8 10 In the above figure.y.x=[0:1:10]. load data12 load weight13 X=zscore(x. error goal not met! *** RMS = 4.t) title('Function Approximation Verification') xlabel('Input') 151 .x.logistic(W1*[ones(size(x)).output.^2. y=2+3*x-0.4*x. This case will use a network with 4 hidden neurons. clg plot(x. output = linear(W2*[ones(size(x)).output. clg plot(x.xs). the choppy line is the noisy data and the line closest to the smooth line is the network approximation.xm. Next we increase the degrees of freedom to more than is necessary for approximating the function.709244e-001 x=[0:1:10].logistic(W1*[ones(size(x)). load data12 load weight12 X=zscore(x.y.x.X])]).^2.xm.X])]). *** BP Training complete. the smoothest line is the actual function.x.

x. y=2+3*x-0. a network with the fewest number of free parameters that can be trained to the error goal should be used.x.^2. output = linear(W2*[ones(size(x)). This is about the value that we got in the properly trained example above. A neural network with 4 hidden neurons is now trained with an error goal of 1.4*x.2.xm.194469e+000 The network learned much faster (151 epochs versus 5000 epochs) than the network with only two neurons. Lets look at the generalization.2*6=1. *** BP Training complete after 151 epochs! *** *** RMS = 1. We can see that when extra free parameters are used in the network.y. load data12 load weight14 X=zscore(x. will have a RMS error goal of .output.X])]).xs). clg plot(x. x=[0:1:10].t) title('Function Approximation Verification') 152 . Therefore.2. A function with 11 training points containing 20% random noise for outputs that average around 6. the approximation is severely overfitted.ylabel('Output') Function Approximation Verification 15 10 5 Output 0 -5 -10 0 2 4 Input 6 8 10 The above figure is the output of a network with 4 neurons trained with the noisy data. If we stop training of an overparameterized network when that error goal is met. we reduce the chances of overfitting the model.logistic(W1*[ones(size(x)). This statement requires that a realistic error goal be set.

^2+0.4*x.size(x.4.100) y=2+3*x-. In this example the test set size is increased to 51 patterns. The following example will illustrate this point. This difference is a rough approximation of the amount of noise in the signal.^2). x=[0:.4*x.4*x.2:10].2*randn(1. t=2+3*x-. increasing the number of patterns in the training set can reduce the amount of overfitting.y. a realistic error goal can be found by filtering the signal to smooth out the noise.x.^2.xlabel('Input') ylabel('Output') Function Approximation Verification 15 10 5 Output 0 -5 -10 0 2 4 Input 6 8 10 The network generalized much better than the overparameterized network with an unrealistically low error goal.2)).*(2+3*x-.2 Neural Network Noise As discussed above. save data13 x t plot(x.t) title('Training Data With Noise') 153 . randn('seed'. If there is significant noise in the data. then calculating the difference between the smoothed signal and the noisy signal. 11. when there is noise in the training data. When the actual signal is not known. This illustrates two methods that can be used to reduce the chance of overfitting. a method to calculate the RMS error goal needs to be used.

we can expect some overfitting.logistic(W1*[ones(size(x)).y.X])]).2. error goal not met! *** RMS = 1.xs).xm. load data13 load weight15 X=zscore(x.x.062932e+000 Seeing that the RMS error is less than 1. The results are as follows.4*x.2:10].x. *** BP Training complete. output = linear(W2*[ones(size(x)). y=2+3*x-0.output. clg plot(x.^2.t) title('Function Approximation Verification With 5 Neurons') xlabel('Input') ylabel('Output') 154 .Training Data With Noise 12 10 8 6 4 2 0 -2 -4 -6 -8 0 2 4 6 8 10 Using this data we will now train a network with 5 hidden neurons. x=[0:.

155 . tc=2+3*x-. If the network has more than enough neurons to model the data. y=2+3*x-.*(2+3*x-. The script cvtrain is used to check for this behavior.^2. there will be a point during training when the training error continues to decrease but the checking error levels off and begins to increase.x. At each training epoch.Function Approximation Verification With 5 Neurons 12 10 8 6 4 Output 2 0 -2 -4 -6 -8 0 2 4 Input 6 8 10 The above figure shows that the network generalized very well even though there were too many free parameters in the model. One set is used for training and the other is used to check for overfitting.11).4*x. An additional 11 pattern noisy data set will be used as the checking data.tc).x.4*x. This methodology uses two sets of data during training. save data14 x t tc plot(x. 11. x=[0:1:10].5). % Checking data set.^2+0. Cross validation training uses the principle of checking for overfitting during training. checking data is used during training to test for this overlearning behavior. the RMS error is calculated for both the test set and the checking set. load data12 % Training data set.y.^2). % Change seed.4.2*randn(1. use more samples or reduce the number of hidden neurons.t.3 Stopping Criteria and Cross Validation Training The last method of reducing the chance of overfitting is cross validation training. randn('seed'. The network does sag at the input=2 region. Since overfitting occurs when the neural network models the training data better than it would other data. This shows that using a more representative training set tends to average out the noise in the signals without having to stop training at an appropriate error goal or having to estimate the number of hidden neurons. to avoid this.4*x.

156 .'Training Error').249887e+000 Best Weight Matrix at 240 epochs! load weight16. hold on semilogy(RMSc). ylabel('Root Mean Squared Error') text(600. text(600. error goal not met! Minimum RMS = 6. clg semilogy(RMS).2. 'Checking Error').909391e-001 Minimum checking RMS = 1. *** *** *** *** BP Training complete.5. hold off title('Cross Validation Training Results').title('Training Data and Checking Data') Training Data and Checking Data 15 10 5 0 -5 -10 0 2 4 6 8 10 A 5 hidden neuron network will now be trained using cvtrain for 1000 epochs..

there are four methods to reduce the chance of overfitting: 1. In this example. Training to a realistic error goal. min(RMSc) ans = 1. then the checking error starts to increase.2499 Again we find that a realistic error goal is near 1. the network is overfitting. The best weight matrices occur at the checking error minimum.10 1 Cross Validation Training Results Root Mean Squared Error Checking Error 0 10 Training Error 10 -1 0 200 400 600 800 1000 The figure shows that both errors initially are reduced. Subsequent to the checking error minimum. Limiting the number of free parameters. 4. 3.2. 157 . These methods can be used independently or used together to reduce the chance of overfitting. This training method can also be used to identify a realistic training RMS error goal. Use cross validation training to identify when overfitting occurs. In summary. This agrees with our two previous calculations. the error goal is the minimum of the checking error. Increase the training set size. 2.

this dendritic input is usually a function of the input and the weight. w are the weights connecting the inputs to the neuron and b is a bias. but other operators such as min and max are commonly used. probabilistic sum: 2. The result of the product. In the case of fuzzy neurons. AND: The dendritic inputs are then aggregated by some chosen operator. OR: d k = xk S wk = xk + wk − xk wk d k = xk S wk = xk ∨ wk = max( xk ∨ wk ) d k = xk T wk = xk ∗ wk d k = xk T wk = xk ∧ wk = min( xk .2 From Crisp to Fuzzy Neurons The artificial neuron was first presented in Section 7. d k = x k wk is referred to as the dendritic input to the neuron. This neuron processed information using the following equation. such as the probabilistic sum. Embedding fuzzy notions into neural networks is an area of active research. n  y = f ∑ x k w k + b    k =1 where x is the input vector. This chapter will present the fundamentals of constructing neural networks with fuzzy neurons but will not describe the advantages or practical considerations in detail. Fuzzy neuron types will be discussed in subsequent sections.1. There is a tremendous advantage to training fuzzy networks when experiential data is available but the advantages of using fuzzy neurons in neural networks is less well defined. The function is usually continuous and monatonically increasing.1 Introduction In this chapter we explore the use of fuzzy neurons in neural systems while the next chapter we will explore the use of neural methodologies to train fuzzy systems. In the most simple case. 158 . wk ) T-norms: 1. S-norms: 1. This function can be an S-norm. or a T-norm. and there are few well known or proven results. 12.Chapter 12 Neural Methods in Fuzzy Systems 12. product: 2. such as the product. The choice of the function is dependent on the type of fuzzy neuron being used. this operator is a summation operator.

This function can be a numerical function or a T-norm or S-norm. in the range [0 1]. The dendritic inputs (d) are also normally bounded. Numerically. y j = Φ I j . 12. the dendritic inputs have been modified to produce excitatory and inhibitory signals. In this figure. These linguistic representation can be further processed by subsequent layers of fuzzy neurons to model a specified relationship. For example. w1j d1j d2j d3j dnj δ2j δ3j δ1j δnj x1 x2 x3 : xn Ij Φj yj Generalized Fuzzy Neuron The synaptic inputs (x) generally represent the degree of membership to a fuzzy set and have a value in the range [0 1]. This is done with a function that performs the compliment and is represented graphically by the not sign: °. the output of a fuzzy neuron may be a linguistic representation of the input vector such as Small or Rapid. and represent the membership to a fuzzy set. The generalized fuzzy neuron is shown in the figure below. the behavior of these operators is represented by: 159 .3 Generalized Fuzzy Neuron and Networks This section will further discuss the fuzzy neural network architecture. Tj ( ) The use of different operators in fuzzy neurons gives the designer the flexibility to make the neuron perform various functions.Summation aggregation operator: Min aggregation operator: Max aggregation operator: I j = ∑ dk n I j = ∧ d k = min(d k ) I j = ∨ d k = max(d k ) k =1 k =1 n k =1 n The fuzzy neuron output yj is a function of the of the internal activation Ij and the threshold level of the neuron.

These aggregation operators provide the fuzzy neuron with a means of implementing intersection (T-norm) and union (S-norm) concepts.7 0. Therefore. The MATLAB implementation of a probabilistic sum S-norm is: x=[0. the bias is the weight corresponding to a dummy node of input equal to 1.3].7 0. w=[0.1400 0. Biases can be implemented in T-norm and S-norm operations by setting a weight value to either 1 or 0. w=[0. these signals have meaning.1500 % Two inputs.5].3].2 0. Similarly.6500 % Two inputs.7600 0. Most signals in artificial neural networks do not have a discernible meaning. I=product(x'. This is one advantage of a fuzzy neural system. A T-norm (product or min) tends to reduce the activation while an S-norm tends to enhance the activation (probabilistic sum or max). For example. the aggregation operator is implimentable by several different mathematical expressions. The internal activation is simply the aggregation of the modified input membership values (dendritic inputs). In normal artificial neurons. % Calculate probabilistic sum S-norm output. % Hidden weight matrix. % Hidden weight matrix. % Calculate product T-norm output. The MATLAB implementation of a T-norm aggregation is: x=[0.5]. a bias in a T-norm would have 160 . The most common aggregation operator is a T-norm represented by: I j = T δ ij i =1 n but other operators can also be implemented. 12.2.4 Aggregation and Transfer Functions in Fuzzy Neurons As stated in section 12.w') I = 0. the bias is the weight corresponding to a dummy input value (x0). in fuzzy neurons.w') I = 0.2 0. I=probor(x'.excitatory inhibitory δij = d ij δij = 1 − d ij The generalized fuzzy neuron has internal signals that represent membership to fuzzy sets.

The MATLAB implementation of an S-norm aggregation and a Tnorm activation function would be: x=[0 0.5]. w=[0. % The corresponding bias weight is a . In both of the above cases. For example if the inputs are weighted grades of accomplishments. % Calculate probabilistic sum S-norm output. I=probor(x'. The corresponding bias weight is a . Although any S-norm or T-norm operations can be implemented.3]. The activation function or transfer function is a mapping operator from the internal activation to the neuron's output.7. w=[0.its dummy input set to a 1 while a bias in an S-norm would have its dummy input set to a 0.*w' I = 0. usually max and min operators are used. the linguistic modifier “more-or-less” could give the aggregated value a stronger value. Calculate probabilistic sum S-norm output. The exception 161 .4550 % % % % The dummy input is a 0.7000 0. The MATLAB implementation of a T-norm bias is: x=[1 0.3]. The MATLAB implementation of an S-norm bias is: x=[0 0.6500 % The dummy input is a 0.1500 % The dummy input is a 1 % The corresponding bias weight is a .w') I = 0.7. w=[0. This mapping may correspond to a linguistic modifier.5].7.6500 0. Calculate product T-norm.7000 0. 12. I=probor(x'.7 0. I=x'.w') z=prod(I) I = z = 0.3].7 0. S-norms and T-norms are commonly used. the bias propagates to the internal activation and may affect the neuron output depending on the activation function operator.7 0.5]. The AND neuron performs an S-norm operation on the dendritic inputs and weights and then performs a Tnorm operation on the results of the S-norm operation. % Calculate product T-norm output.5 AND and OR Fuzzy Neurons The most commonly used fuzzy neurons are AND and OR neurons.7000 0.

Hidden weight matrix. w=[0. The AND neuron representation is: n zh = Ti =1 ( xi S whi ) The MATLAB AND fuzzy neuron implementation is: x=[0.2 0.3000 % % % % Two inputs.5].2 0. 162 . I=max(x. in this case.7000 0. The multilayer networks discussed here have three layers with each layer performing a different function. AND and OR neurons can be arranged in layers and these layers can be arranged in networks to form multilayer fuzzy neural networks. zh = Sin=1 ( xi T whi ) The MATLAB OR fuzzy neuron implementation is: x=[0. The outputs of the hidden layer neurons perform a norm operation on the hidden layer outputs (z) and the output weight vector (v). The hidden layer is composed of either AND or OR neurons. Calculate min T-norm operation. w=[0. I=min(x.2000 0.3]. Calculate max S-norm operation. differentiable functions are used.5000 0.7 0. Calculate min T-norm operation.6 Multilayer Fuzzy Neural Networks The fuzzy neurons discussed in earlier sections of this chapter can be connected to form multiple layers. 12. Calculate max S-norm operation. The input layer simply sends the inputs to each of the hidden nodes.3].3000 0.w) z=max(I) I = z = 0.5]. The OR neuron performs a T-norm operation on the dendritic inputs and weights and then performs an S-norm operation on the results of the S-norm operation.7 0. These hidden layer neurons perform a norm operation on the inputs and weight matrix.w) z=min(I) I = z = 0. The following figure presents a diagram of a multilayer fuzzy network. Hidden weight matrix. The norm operations can be any type of S-norms or T-norms.5000 % % % % Two inputs.is when training routines are implemented.

.3].3.x1 x2 : xn xn+1 xn+2 : x2n whi AND1 . The output of the hidden layer is denoted zh where h is the index of the hidden node.7].wc(:. for i=1:2 % Calculate hidden node outputs. % Complementary portion weight matrix. two hidden nodes and one output.. . p This is implemented in MATLAB with product T-norms and probabilistic sum S-norms.3127 In the above figure.i))) prod(probor(xc'.. .2 0. vjh OR1 OR2 . .1 0. ym . xc=[0. .w(:. .. As a very simple example.2]. The hidden layer outputs are given by: x=[0. .8]. end z z = 0. % Two inputs.9 -0.0390 0. the hidden layer uses AND neurons to perform the T-norm aggregation. n n zh = Ti =1 ( xi S whi ) T Ti =1 xi S wh ( n +i ) [ ][ ( )] h = 1.i)))]). . ORm index=i index=h index=j Input Layer Hidden Layer Output Layer Multilayer Fuzzy Neural Network In the above figure. m 163 . . the output layer uses OR neurons to perform an S-norm aggregation. suppose we have a network with two inputs.6 0.2 0. the networks are limited to one output neuron although the MATLAB code provides for multiple outputs.9 0. ANDp y1 y2 . % Two complements. -0. In the examples of the text and this supplement. .2.. AND2 . % Hidden weight matrix. 0.5... z(i)=prod([prod(probor(x'...2. wc=[0. The output of the network is denoted yj where j is the index of the output neuron. w= [0. y j = Shp=1 zh T vhj [ ( )] j = 1.

y=fuzzy_nn(x. % One output node. We will assume the OR fuzzy neuron is an output neuron and is therefore represented by: 164 . although the probabilistic sum S-norm and product T-norm that were used in the previous section do meet his requirement.9 -. one hidden node and one output.1 .3 . let us study the weight changes involved in training a single OR fuzzy neuron with two inputs.9 .5.7]'.4 . hidden neurons) size(v) = (hidden neurons. the fuzzy neural network's weights and biases can be trained to better model the relationship. for j=1:2 % Calculate output node internal activations.2281 The evaluation of all the fuzzy neurons in a multilayer fuzzy network can be combined into one function.-. Several S-norm and T-norm operators such as the max and min operators do not fit this requirement. I(j)=prod([z(j) v(j)]').3.6 .8. v is the output weight vector.. v=[0.3 0.]. This training can be performed with a gradient descent algorithm similar to the one used for the standard neural network.3 0.w.7 Learning and Adaptation in Fuzzy Neural Networks If there is experiential input/output data from the relationship to be modeled.2281 0.. Simulate the network.6 ]. y = 0. As a simple example. 12.7. y is the output vector.7 . suppose we have a network with two inputs.This is implemented in MATLAB with product T-norms and probabilistic sum S-norms: v=[0. w=[. end y=probor(I) % Perform a probor on the internal activations. monotonically increasing. output neurons) size(y) = (patterns. w is the hidden weight vector.3 . differentiable operators.2.7]'. inputs) size(w) = (inputs. One output node.. Complications with fuzzy neural network training arise because the activation or transfer functions must be continuous. Two hidden nodes.0803 % % % % % Two patterns of 2 inputs and their complements. output neurons) As a very simple example. The forward pass of two patterns gives: x=[. size(x) = (patterns.v) y = 0. The following code simulates the fuzzy neural network where: x is the input vector.2 .2 .

8 . 0. z=[1 0. % Training output target data.3855 0.:)'.y j = Shp=1 zh T vhj [ ( )] j=1 where: yj is the jth output neuron.v').4678 0. vjh is the weight connecting the hth hidden layer neuron to the jth output. T-norm is product.1 0.3 0.76 .5270 0.9. y(k)=probor(I). end y v = y = 0. for k=1:3 I=product(z(k.2 0. t=[0.1 0. Output vector For each input pattern Calculate product T-norm operation. The squared error (as opposed to the sum of squared errors) is used because the training will occur sequentially rather in batch mode. Calculate probabilistic sum S-norm.1). we will use a gradient descent procedure.7600 % % % % % Random initial weights. And the output is represented by: v=[. S-norm is probabilistic OR The training data consists of three patterns.32 .3200 0. ⋅ ∂ yj ∂ v jh where: 165 .4 ].1 0. The error and squared error are defined as: ε =t−y ε 2 = (t − y) 2 ∆v jh = −η. ∂ε 2 ∂ v jh ∂ yj ∂ ε2 = −η.0. zh is the output of the hth hidden layer neuron.11] y=zeros(3.2].1100 To train this network to minimize a squared error between the target vector (t) and the output vector (y).0.7.7 . % Training input data with bias.

2322 0.1287 Now lets train the network with a learning rate of 1.0145 SSE = 0. If we substitute the symbol A for the norms of the terms not involving a weight h=q. % Number of training iterations.∂ ε2 = ( − 2) t j − y j ∂ yj [ ] ∂ y j ∂ S hp=1 ( zh T jh ) = ∂ v jh ∂ Ij and T is the T-norm operator (product). lr=1. % Number of input patterns. cycles=30.2730 0. then we can solve for the partial derivative of y with respect to that weight vq. patterns=length(error). 166 .^2) error = 0. p A = S h ≠ q zh Tv jh ∂ y j ∂ [ A + v jq zq − Av jq zq ] = = zq (1 − A) ∂ v jq ∂ v jq If the T-norm is a product and the S-norm is a probabilistic sum. this equation can be written as: p ∂ yj = zq (1 − A) = zq (1 − probor v jh z h ) ∂ v jq h≠q [ ] Combining the factor of 2 into the learning rate results in: ∆v jq = η t j − y j zq (1 − probor v jh z h ) h≠q ( ) p [ ] The error terms and squared error are: error=(t-y) SSE=sum((t-y). % Learning rate. S is the S-norm operator (probabilistic sum).

7 . end SSE y SSE = 1. % Calculate probabilitsic sum S-norm.8063 0. This closely matches the target vector: t=[. y(k)=probor(I). % Loop through each weight. inds=[1:length(v)]. % Calculate the eror terms.4030 % For each input pattern % Calculate product T-norm operation.i)*(1-A). A=probor(product(z(j. % Update weights.8 . We use the chain rule to find the gradient vector. % Calculate weight update. I=product(z(k. delv(i)=lr*error(j)*z(j. We will now derive the weight update for a fuzzy multilayer neural network with AND hidden units and OR output units. % Calculate T and S norms.4]. end error=(t-y).^2). the partial derivative with respect to the inputs (zh) is of the same form of the partial derivative with respect to 167 .6909 0. % Specify weights other than i.v'). end v=v+delv.3123e-004 y = 0. ∂ ε2 ∂ ε2 = ∂ w jk ∂ y j and ∑∂ z ∂ y j ∂ zh h =1 h ∂ whi p ∂ ε2 = ( − 2) t j − y j ∂ yj [ ] Since the inputs and weights in a AND and OR are treated the same. SSE=sum((t-y).:)'.% Train the network. end % Now check the new SSE. This gradient descent training algorithm can be expanded to train multilayer fuzzy neural networks.ind)'. ind=find(inds~=i). for i=1:length(v). % Find the sum of squared errors.v(ind)')). for k=1:cycles for j=1:patterns % Loop through each pattern. for k=1:length(error). The previous section derived the weight update for an fuzzy OR output neuron.

Therefore. we get a solution of the same form. substituting A for the norms not containing zq. this results in: p ∂ yj = v jq (1 − A) = v jq (1 − probor v jh zh ) ∂ zq h≠q [ ] For the second term in the chain rule: ∂ zh = ∂ whi ∂ ∏(x i =1 n i + whi − xi whi ) ∂ whi For a specific input weight: whr. 168 . p A = S h ≠ q zh Tv jh ∂ y j ∂ [ A + v jq zq − Av jq zq ] = = v jq (1 − A) ∂ zq ∂ zq For a product T-norm and a probabilistic sum S-norm.the weights (vh) that was derived above. we will consider a network with two hidden AND neurons and one output OR neuron. ∂ zh ∂  n  = ∏ ( xi + whi − xi whi )∗ ( x r + whr − x r whr )  ∂ whr ∂ whr  i ≠ r  = n n  n  ( xi + whi − xi whi ) x r + ∏ ( xi + whi − xi whi ) whr − ∏ ( xi + whi − xi whi ) x r whr  ∏ ∂ whi  i ≠ r i ≠r i≠r  n ∂ = ∏ ( xi + whi − xi whi )(1 − x r ) i ≠r Combining the above chain rule terms results in: ∂ ε 2 ∂ ε 2 ∂ y j ∂ zh = ∂ w jk ∂ y j ∂ zh ∂ whi  n  = η ε ∏ ( xi + whi − xi whi )(1 − x r )  v jq (1 − proborh ≠ q ( v jh zh ))  i ≠r  As an example of training a multilayer fuzzy neural network.

% Output of each hidden node.5.7.4 ].7 .1).rand('seed'.2987 0. % Output vector.:)'. A bias will also be implemented in the output node with a dummy input equal to 1.9. [h_nodes.v'). y=zeros(3.w(h.8 .3 0. % weight matrix indices.10) x=[0 0. for m=1:cycles 169 . % Calculate product T-norm operation.h)=prod(probor(x(k.2654 0. % Update the network output layer. % Define number of inputs and hidden nodes.:)]'. % Training input data with bias.2)]. lr=0. patterns=size(error.3). for h=1:2 % For each hidden neuron.(three patterns) for k=1:3 % For each pattern.0.indw=[1:size(w. the dummy input must equal 0. z=zeros(3. % Calculate probabilitsic sum S-norm.0 0. % Training output target data. % Number of input patterns.7 0. t=[0. % Hidden vector (three patterns. 2 outputs). v=rand(1. y(k)=probor(I).x0 x1 x2 index=i whi AND1 vh z0 OR y AND2 index=h Input Layer Hidden Layer Output Layer Example Fuzzy Neural Network In this case we will implement biases in the hidden layer.1730 Now lets train the network with a learning rate of 0. % Random initial hidden layer weights.2]. Since these are AND fuzzy neurons.5. The forward pass results in: clear all. end error=(t-y) SSE=sum((t-y).^2) error = 0. % Random initial output layer weights. z(k.3). end I=product([1 z(k.:)')). indv=[1:size(v. % Number of training iterations.1153 SSE = 0. w=rand(2. % Learning rate. cycles=20.2)].inputs]=size(w).0.2).2 0.1).0 0.

i))*v(l)*A. % Update weights. % Specify weights not = i. ind=find(indv~=i).2). By varying the learning rate and training for more iterations.8 .4502 The error has been drastically reduced and the output vector has moved towards the target vector: t=[.7615 0.0049 y = 0. z(k.4]. for l=1:h_nodes for i=1:inputs. for h=1:h_nodes % For each input neuron. Calculate T and S norms. end SSE=sum((t-y).w(h.%Output of hidden nodes. Z=[ones(patterns. end v=v+delv. end end w=w+delw.v(ind)')). % Loop through weights.^2) y SSE = 0. Calculate weight update.1 Introduction The use of neural network training techniques allows us the ability to embed empirical information into a fuzzy system. for j=1:patterns % Loop through patterns.v(l)')). for k=1:patterns % For each pattern.i)*(1-A). % Calculate product T-norm operation.1) z]. Chapter 13 Neural Methods in Fuzzy Systems 13. Loop through weights.6704 0. Output biases.h)=prod(probor(x(k.ind)')).:)'. A=probor(product(Z(j.i)=lr*error(j)*B*(1-x(j. delw(l. % Update the network hidden layer.ind)'. % Calculate probabilitsic sum S-norm.for j=1:patterns for i=1:size(v.l)'. end error=(t-y).:)')). end I=product([1 z(k. % Update weights. this error can be reduced even further.w(l. ind=find(indw~=i).:)]'. A=probor(product(Z(j.7 . delv(i)=lr*error(j)*Z(j. % Calculate OR norms. % Compute update.v'). B=prod(probor(x(j. This greatly expands the range of applications in which 170 . end % Now check the new error terms. Specify weights not = i.ind)'. % Calculate AND norms. end % % % % % % Loop through patterns. y(k)=probor(I).

They can also be used for rule selection. The algorithm is implemented in the following steps: 1.2 Fuzzy-Neural Hybrids Neural methods can be used in constructing fuzzy systems in ways other than training. One of the limitations of fuzzy systems comes from the curse of dimensionality. This ability to add clusters and allow the network to grow resembles the plasticity inherent in ART networks. are presented to the network. In this way. The ability to make use of both expert and empirical information greatly enhances the utility of fuzzy systems. Suppose you have N input patterns with M components or inputs. One methodology of implementing this type of two stage process is called the AdeliHung Algorithm. possibly contradictory. can be set so that the SOM outputs are the membership values. expert knowledge may make a problem tractable. the width parameters of the SOM functions. The Adeli-Hung Algorithm constructs a two layer neural network with M inputs and C clusters. First the data is classified into clusters and then membership values are given to the individual patterns in the clusters. These hybrid systems may make use of a supervisory fuzzy system to select the best output from several neural networks or may use a neural network to intelligently combine the outputs from several fuzzy systems. The combinations and applications are endless. A limitation of using only expert knowledge is the inability to efficiently tune a fuzzy system to give precise outputs for several. Expert information about the relationship to be modeled may make it possible to reduce the rule set from all combinations to the few that are important. Neural network architectures such as Kohonen Self Organizing Maps are well suited to finding clusters in input data. Calculate the degree of difference between the input vector Xi and each cluster center Ci. Clusters are added as new inputs. membership function determination and in what we can refer to as hybrid systems.3 Neural Networks for Determining Membership Functions Membership function determination can be viewed as a data clustering and classification problem. The first stage is called classification while the second stage is called fuzzification. A Euclidean distance can be used: dist ( X .fuzzy systems can be used. After cluster centers are identified. 13. Hybrid systems are systems that employ both neural networks and fuzzy systems. input combinations. 13. The use of error based training techniques allows fuzzy systems to learn the intricacies inherent in empirical data. which do not closely resemble old clusters. Ci ) = ∑ (x j =1 M j − cij ) 2 171 . which are usually Gaussian.

4th column specifies class.0 0 0. Find the closest cluster to the input pattern and call it Cp. add it to that cluster.. end % Step 2: Find the closest cluster.0 1 1.Cp) with some predetermined distance. % Matrix of data.1 . the weighted norm. The membership of an input vector Xi to a cluster Cp is defined as 0  µ p =  D w ( X ip . D w X ip .inputs]=size(C).[X(k.:) p+1]].:). C p ) > κ if D w ( X ip . ind=find(min(distance)==distance).9. if min(distance)>tolerance C=[C. if it is further than the predefined cluster. C p ) < κ where: κ is the width of the triangular membership function. data=[1 0 1 1].where: xj is the jth input. for i=1:p distance(i)=dist(X(k.9 1.% Inputs C=[1 0 1]. then add a new cluster and center it on the input vector. C p = c p1 c p 2 . the cluster center (prototype vector) is recalculated as the mean of all patterns in the cluster. 172 . [p.5. % Matrix of prototype clusters. % Step 1: Find the Euclidean distance to each cluster center. 3. When an input is added to a cluster.1 1 1. % Make X a new cluster center. X=[1 0 . An example of a MATLAB implementation of the Adeli-Hung algorithm is given below.:)). for k=1:length(X). C p = ( ) ∑(x j =1 M p ij − c pj ) 2 .X(k.. % Tolerance value used to create new clusters. is the Euclidean distance: This results in triangular membership functions with unity membership at the center and linearly decreasing to 0 at the tolerance distance.9 1 0. data=[data. tolerance=1.:)].C(i.^.. If it is closer than the predetermined distance. % Step 3: Compare with a predefined tolerance. cij is the jth component of the ith cluster.0 1 0.1]. 2. c pM [ ] 1 = np ∑X i =1 np p i 4. Compare the distance to this closest cluster: dist(X. M is the number of clusters.0 0 . C p ) 1 − κ  if D w ( X ip .

% Save results for the next section. mu % Display membership of last input to each prototype.0000 1.9000 data = 1.0000 There were four clusters created from the X data matrix and their prototypes vectors were stored in the matrix C.9000 0 1. % Other clusters in class. cluster_inds=find(data(:.j))/length(cluster_inds). for j=1:inputs C(ind.[X(k.9000 0 0.0000 1. data % Display all input vectors and their classification.^. The Adeli-Hung Algorithm does a good job of clustering the data and finding their membership to the clusters. mu=zeros(p.0000 0 1. It can be seen that it is closest to vector prototype number 2. % Add new data pattern.:)). mu(i)=1-D/tolerance.0000 0.else % Calculate old cluster center.0000 2.6601 0 0 0 0.1).j)=sum(data(cluster_inds.0000 1.0000 4. if D<=tolerance.:) ind]].0000 3.0000 0 0 0.9000 1.4)==ind). The vector mu contains the membership of the last data vector to each of the clusters.0000 0 1.:). for i=1:p D=dist(X(k.0000 0 0 1.0000 3.0000 0.0000 2.0000 0.9667 1. end end % Step 4: Calculate memberships to all p classes. The first three elements in each row of the data matrix is the original data and the last column contains the identifier for the cluster closest to it.0000 0 1.0000 2. end end end C % Display the cluster prototypes.0000 1.0333 1.5.0000 0 mu = 0 0. 173 . This can be used to preprocess data to be input to a fuzzy system.1000 1.6667 0. save AHAdata data C C = 1.0000 0 1.3333 0.9500 0.0000 3.C(i. data=[data.0000 0 1.0000 0 0.

It performs this with a clustering algorithm. xn Takagi-Hayashi Method The above block diagram represents the T-H Method of fuzzy rule extraction. The T-H method performs three major functions: 1. y wr w2 w1 Σ x x u1 u2 x un . This section will investigate and implement the Takagi-Hayashi (T-H) method for the construction and tuning of fuzzy rules. thus making the problem tractable.. It performs this with a neural network. 174 . Identifies a rule's consequent values (right hand side membership function) by using a neural network with supervised training. The other neural networks form the RHS of the rules... 2.. This method uses a variation of the Sugeno fuzzy rule: if xi is Ai AND x2 is A2 AND . The altered RHS membership values are aggregated to calculate the T-H system output.. Partitions the decision hyperspace into a number of rules. Identifies a rule's antecedent values (left hand side membership function).xn).xn is An then y=NN(x1... The NNmem calculates the membership of the input to the LHS membership functions and outputs the membership values...4 Neural Network Driven Fuzzy Reasoning Fuzzy systems that have several inputs suffer from the curse of dimensionality.. .xn is An then y=f(x1..xn) where f() is a neural network model rather than a mathematical function.13.. .. NNmem NN1 x1 x2 NN2 . This results in a rule of the form: if xi is Ai AND x2 is A2 AND . NNr . This part necessitates the existence of target outputs. The neural networks are standard feedforward multilayer perceptron designs. this is commonly referred to as neural network driven fuzzy reasoning.. 3.. The T-H method is an automatic procedure for extracting rules and can greatly reduce the number of rules in a high dimensional problem. The LHS membership values weigh the RHS neural network outputs through a product function..

s = 1... nts ... Step 4: The final output value y is calculated with a weighted sum of the NNs outputs. Rs {s=1.. Note that the number of inferencing rules will be equal to r. % The last column of data is the classification. class=data(:. 175 .. Step 2: The NNmem neural network is trained with the targets values selected as: 1. In this example there are 9 input data vectors that have been clustered into four groups: R1 has 2 inputs assigned to it (nt1 = 2). r The outputs of NNmem for an input xi are labeled wis. The T-H method also implements methods for reducing the neural network inputs to a small set of significant inputs and checking them for overfitting during training. and the outputs are ys i=1. An example of implementing the T-H method is given below..4).. Therefore.nt. load AHAdata x=data(:......2...2.. R4 has 1 input assigned to it (nt4 = 1).1:3).. % The first three columns of data are the input patterns. t=zeros(9..: yi = ∑µ s =1 r As ( xi ) ⋅ {u s ( xi )} inf r As ∑µ s =1 i = 1. R3 has 3 inputs assigned to it (nt3 = 3).... and are the membership values of xi to each antecedent set Rs.2. First we will train the network NNmem. Step 3: The NNs networks are trained to identify the consequent part of the rules. Step 1: The training data x is clustered into r groups: R1. R2 has 3 inputs assigned to it (nt2 = 2).r} with nts terms in each group.. there are four rules to implement in the system.The following 5 steps implement the T-H Method. The inputs are {xi1s.. R2. xi ∉ R s  s i i = 1..xims}.. xi ∈ R s  w = 0.. Suppose the data and clustering results from the prior example are used.4). n ( xi ) where us(xi) is the calculated output of NNs.

% NNmem outputs.logistic(W14*[1. end t save TH_data x t t = 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 % Create the target vector so that the % classification column in each target pattern % is equal to a 1.W21=W2.05]'. 176 .W24=W2.1:3). W1n=W1. y=[10 12 5 2 4 1 7 1.4)==s). W11=W1. u4 = W24*[1.num2str(s).logistic(W12*[1. xt=[-. load th_4_wgt. u2 = W22*[1. The network's architecture consists of three logistic hidden neurons and a logistic output neuron.W2n=W2.xt])]). 0 0 0 0 0 0 1 0 0 We used the bptrain algorithm to train NNmem and save the weights in th_m_wgt. and th_4_wgt.1 . W13=W1.class(i))=1. end The four neural networks with 1 logistic hidden neuron and a linear output neuron were trained off line with bptrain. % Make the input training matrix.xt])]. eval(['save th_'. % Identify the rows in each class.1 .logistic(W13*[1. % make the target training matrix.:). t=y(ind. all other are zero. load th_1_wgt. load th_3_wgt. u3 = W23*[1.xt])].'_dat x t']).logistic(W11*[1.W23=W2. % Define the test vector. we must first define the target vectors y and then create the 4 training data sets. The weights were saved in th_1_wgt. th_3_wgt. % In this example we will define single outputs for each input.W22=W2.xt])]. mu = logistic(W2n*[1.logistic(W1n*[1. r=4. for s=1:r ind=find(data(:. % Number of classifications. Now each of the consequent neural networks (NNs) must be trained. The following code implements the T-H network structure and evaluates the output for a test vector. W12=W1.xt])]. To train those networks. load th_m_wgt.5 4. x=data(ind. load th_2_wgt. W14=W1. % Load the neural network weight matrices. u1 = W21*[1. % Find the outputs of all four consequent networks.for i=1:9 t(i. th_2_wgt.5]'.

1994.5 respectively. so the output of 4. an objective error function. The training vectors were all input to the system and the network performed well for all 9 cases. 3) gradient-descent with exponential MFs [Ichihashi. These three vectors have target values of 5. and 4) gradient-descent with symmetric and nonsymmetric LHS MFs varying connectives and RHS forms [Guély. the squared error is chosen: E= 1 ( y − yt )2 . 2 where y t is the target output and y is the fuzzy system output. 1993]. Siarry. Wang.41 is appropriate. must be chosen. can be optimized. Adaptation of fuzzy systems using neural network training methods have been proposed by various researchers. Jang.u=[u1 u2 u3 u4]. 4.4096 % Vector of NNs outputs. 1992]. [0 0 0] and [0 0 . E. 177 . 1993].05] are the vectors [0 1 0]. 1995].1 0. Ys=u*mu. 2) optimization of a parameterized fuzzy system with symmetric triangular-shaped input membership functions and crisp outputs using gradient-descent error minimization [Nomura. and the connectives between layers in an adaptive network. Consider the ith rule of a zero-order Sugeno fuzzy system consisting of n rules (i = 1. By using gradient descent techniques.5 Learning and Adaptation in Fuzzy Systems via Neural Networks When experiential data exists. …. such as membership functions (LHS or RHS). fuzzy systems can be trained to represent an input-output relationship. Regardless of the method or the parameter of the fuzzy system chosen for adaptation.1].1 0. % T-H Fuzzy System ouptut The closest input to the test vector [-0. % NNs outputs times the membership values (NNmem). Y=sum(Ys)/sum(mu) Y = 4. fuzzy system parameters. Some of the methods described in the literature are: 1) fuzzy system adaptation using gradient-descent error minimization [Hayashi et al. The T-H method of using neural networks to generate the antecedent and consequent membership functions has been found to be useful and easily implementable with MATLAB. The figure below presents a zero-order Sugeno system with m inputs and n rules. 1994. n). and 4. 13. Commonly.

a zero-order Sugeno system is represented by the following equations: wi = ∏ µ A ji ( x j ) j =1 m where µA(xj) is the membership of xj to the fuzzy set A.6. wi is the degree of fulfillment of the ith rule. and wi is the degree of fulfillment of the ith rule. the system is a zero-order Sugeno system and fi is defined as: f i = ri An example of a 1st order Sugeno system is given in section 13. w i is the normalized degree of fulfillment of the ith rule. The normalized degree of fulfillment is given by: wi = ∑ wi n i =1 wi The output (y) of a fuzzy system with n rules can be calculated as: y = ∑i =1 wi f i = ∑i =1 yi n n In this case. A is an antecedent membership function (LHS). Mathematically. µA(x) is the membership of x to set A. yi is the output of the ith rule. 178 .A11 µA(x) A12 x1 x2 xm f1=r1 Π w1 N w1 Π y1 A1m An1 An2 Anm Zero-Order Sugeno System fn= rn Σ yn y Π wn N wn Π where: x is the input vector. ri is a constant singleton membership function of the ith rule (RHS).

The objective is to maintain a comfortable temperature in the room based on two input variables: temperature and activity. ri.1. % Fan Speed % Temperature cool_mf = mf_trap(x. "moderate".'n'). % Temperature = [0:1:10].This notation is slightly different than that in the text.[30 55 80].5.'n'). Now we will look at an example of using gradient descent to optimize the ri values of this zero-order Sugeno system. the network can be adapted to reduce an error measure. text(80. 'Moderate'). title('Antecedent MFs for Temperature') text(10.1.[0 0 30 50]. xlabel('Temperature') ylabel('Membership') 179 .hot_mf].antecedent_t) axis([-inf inf 0 1. % x y z Universe of Discourse = [0:5:100]. antecedent_t = [cool_mf. but it is consistent with Jang (1996). hot_mf = mf_trap(x. 1.1. 1. 13. When target outputs (yrp) are given. text(50.1 Zero Order Sugeno Fan Speed Control Consider a zero-order Sugeno fuzzy system used to control the speed of a fan. and "hot" will be used to describe temperature. The adaptable parameters are the input membership functions and the output singleton membership functions.moderate_mf. The fan speed will be a crisp output value based on the following set of fuzzy rules: if if if if if if Temp is Cool Temp is Cool Temp is Moderate Temp is Moderate Temp is Hot Temp is Hot AND AND AND AND AND AND Activity is Low Activity is High Activity is Low Activity is High Activity is Low Activity is High then then then then then then Speed is very_low Speed is low Speed is low_medium Speed is medium Speed is medium_high Speed is high (w1) (w2) (w3) (w4) (w5) (w6) The fuzzy antecedent and consequent membership functions are defined over the universe of discourse with the following MATLAB code. plot(x. 'Hot'). Three linguistic variables: "cool". % Activity = [0:1:10]. and two linguistic variables: "low" and "high" will be used to describe the level of activity in the room. 1.2]).'n'). moderate_mf = mf_tri(x. 'Cool').[60 80 100 100].

ylabel('Membership') Antecedent MFs for Activity Low 1 0. title('Antecedent MFs for Activity') text(1.Antecedent MFs for Temperature Cool 1 0.8 Membership 0.6 04 High 180 . antecedent_a = [low_act. 1.1. 'Low').8 Membership 0.2]). antecedent_a).'n').[2 8 10 10].text(8.high_act]. high_act = mf_trap(y. 1. 'High'). xlabel('Activity Level'). plot(y.6 0.'n'). axis([-inf inf 0 1.2 0 0 Moderate Hot 20 40 60 Temperature 80 100 low_act = mf_trap(y.1.4 0.[0 0 2 8].

high_mf = 10. 'Low_Medium').2 0 0 Low Medium High Low_Medium Medium_High 2 4 6 FanSpeed 8 10 Now that we have defined the membership functions and consequent values.medium_high_mf. xlabel('Fan Speed') ylabel('Membership') text(.ones(size(consequent_mf))) axis([0 11 0 1.6. 'Medium'). temp = 72. 'High'). medium_mf = 6. medium_high_mf = 8.6.The consequent values of a Sugeno system are crisp singletons.5.low_medium_mf. consequent_mf = [very_low_mf. [0 0 30 50]. . 'Medium_High'). . The singletons for the fan speed are defined as: % Fan Speed Consequent Values.high_mf]. 'Low').low_mf. text(7. Consequent Values for Fan Speed Very_Low 1 0.1. text(10.'n').1.6. text(6. mut1=mf_trap(temp. % Temperature 181 . . text(2. low_medium_mf = 4. 'Very_Low').1.5. low_mf = 2. very_low_mf = 1.5.2]).1. we will evaluate the fuzzy system for an input pattern with temperature = 72.1.8 Membership 0. First we fuzzify the inputs by finding their membership to each antecedent membership function. 1.6 0.1. stem(consequent_mf.1. 1.5 and activity = 6. text(3. 1.medium_mf. title('Consequent Values for Fan Speed').5.4 0.

' Low_Medium'). MU_t = [mut1. mua2 = mf_trap(act.0.3167 0. we apply the Sugeno fuzzy AND implication operation to find the degree of fulfillment of each rule.0. antecedent_DOF(4)+. MU_a = [mua1. 182 .6833 Next. ' Very_Low').2.1. antecedent_DOF(6)+. text(1. mut3=mf_trap(temp. [30 55 80].[0 0 2 8]. ' Low').0950 0.1.02.[2 8 10 10].5]). [60 80 100 100]. ' Medium_High'). antecedent_DOF = [MU_t(1)*MU_a(1) MU_t(1)*MU_a(2) MU_t(2)*MU_a(1) MU_t(2)*MU_a(2) MU_t(3)*MU_a(1) MU_t(3)*MU_a(2)] antecedent_DOF = 0 0 0. text(9.'n').mut2.'n').02. mua1 = mf_trap(act.02.02. ' Medium'). title('Consequent Firing Strengths') xlabel('Fan Speed') ylabel('Firing Strength') text(0.6250 % Activity act = 6. text(3.'n').5. text(7.02.3000 0. antecedent_DOF(1)+.04.mua2] MU_a = 0.mut2=mf_tri(temp. antecedent_DOF(5)+.1979 0.mut3] MU_t = 0 0.2050 0. ' High').'n').5. text(5.antecedent_DOF) axis([0 11 0 .4271 A plot of the firing strengths of each of the rules is: stem(consequent_mf. antecedent_DOF(2)+. antecedent_DOF(3)+.

13.45 0.0694 The fan speed would be set to a value of 8.Consequent Firing Strengths 0.2 Consequent Membership Function Training If you desire a specific input-output model and example data are available.15 0.4 0.5 degrees and an activity level of 6.35 Firing Strength 0. The learning rule for the output crisp membership functions (ri) is defined by ∂E ri (t '+1) = ri (t ') − lr ⋅ ∂ ri where t' is the learning epoch.m demonstrates this use of gradient descent to optimize the output membership functions (ri). The update equation can be rewritten as [Nomura et al.3 Medium Medium_High High 0.1 0. This is accomplished by using a gradient-descent algorithm to minimize an objective error function such as the one defined earlier.05 0 0 Low Very_Low 2 4 6 Fan Speed 8 10 Low_Medium The output is the weighted average of the rule fulfillments.07 for a temperature of 72.2 0. 183 .*consequent_mf)/ sum(antecedent_DOF) output_y = 8.5.5 0. The m-file sugfuz.25 0. 1994] ri (t '+1) = ri (t ') − lr ⋅ i µp ∑ n i =1 µi p ( y p − y rp ) .1. output_y=sum(antecedent_DOF. the membership functions may be trained to produce the desired model.

Because their derivatives are continuous and smooth. axis([-inf inf 0 1.1. the plot flag is set to no.text(6.[a-b/2 a a+b/2]. The membership function can be plotted using the plot function or by using the plot flag.1.4 The peak parameter update rule is: 184 .  A symmetric or non-symmetric triangular membership function can be created using the mf_tri() function.8 Membership 0.text(4. % plot memberships title('Symmetric Triangular Membership Function').1..1.2]). if x j − aij ≤ µij ( x j ) =  bij / 2 2  otherwise 0.5.'(a+0.mu_i). Symmetric Triangular Membership Function (a = Peak) 1 0. their training performance may also be better.13. % universe of discourse a=5. In this example. The examples in this supplement use the triangular and trapezoidal membership functions which have non-zero derivatives where the outputs are non-zero. text(2. Consider the parameters of a symmetric triangular membership function (aij = peak.5..'n').xlabel('Input') ylabel('Membership').5b)').5b)'). These functions have non-zero derivatives throughout the universe of discourse and may be easier to implement.'(a = Peak)'). bij = support):  x − aij bij 1 − j .4.5. % calculate memberships plot(x. % [peak support] mu_i= mf_tri(x. Other adaptable fuzzy systems may use gaussian or generalized bell antecedent membership functions.5. this function will be better described in Section 13. x= [0:1:10].6 0.5.b=4.'(a-0.3 Antecedent Membership Function Training The gradient descent algorithm can also be used to optimize the antecedent membership functions.

aij (t '+1) = aij (t ') − ηa ∂ E ⋅ p ∂ aij ηa p ∂ E p ' = aij (t ') − ⋅∑ 2 p p '=1 ∂ aij where ηa is the learning rate for aij and p is the number of input patterns. Substituting these into the update equation we get 185 . for the peak parameter. the partial derivatives are derived below. Similar update rules exist for the other MF parameters. ∂ aij ∂ y ∂ yi ∂ wi ∂ µij ∂ aij Using the symmetric triangular MF equations and the product interpretation for AND. The chain rule can be used to calculate the derivatives used to update the MF parameters: ∂ E ∂ E ∂ y ∂ yi ∂ wi ∂ µij = ⋅ ⋅ ⋅ . E= 1 ( y − yt )2 2 n so so y = ∑i =1 yi yi = ∑ m wi w i =1 i n ri so ∂E = y − yt = e ∂y ∂y =1 ∂ yi ∂ wi (ri − y ) = n ∂ wi ∑w i =1 i wi = ∏ µ A ji ( x j ) j =1 so ∂ wi wi = ∂ µij ( x j ) µij ( x j ) ∂ µi 2 * sign( x j − aij ) = ∂ aij bij ∂ µi =0 ∂ aij bij 2 bij 2 µij ( x j ) = 1 − x j − aij bij / 2 so if x j − aij ≤ and similarly: if x j − aij > ∂ µij ∂ bij = 1 − µij ( xij ) bij .

186 . x=[-10:. dmf_tri. [mf1]=mf_tri(x. For example. They are mf_tri. earlier functions could only be evaluated at those points and an error would occur for an input such as input=2. ∂E = ( y − y t ) ⋅ 1 ⋅ wi . Let's look at the triangular membership function outputs and derivatives first.'y'). ∂ wi 13. mf_trap. if the universe of discourse is defined as x=[0:1:10].m.m.m and dmf_trap. The function mf_tri can construct symmetrical or non-symmetric triangular membership functions. The functions described in this section can handle all ranges of inputs.3.m. These functions do not have limitations on the input and are therefore more general in nature.[-4 0 3].∂E 2 * sign( x j − aij ) (r − y ) wi = ( y − y t ) ⋅ in ⋅ ⋅ ∂ aij µij ( x j ) bij ∑w i =1 i ∂E 1 − µij ( x j ) (r − y ) wi = ( y − y t ) ⋅ in ⋅ ⋅ ∂ bij µij ( x j ) bij ∑w i =1 i For the RHS membership functions we have: ∂ E ∂ E ∂ y ∂ yi = ⋅ ∂ ri ∂ y ∂yi ∂ri yi = wi ri so ∂ yi = wi ∂ ri resulting in the following gradient. The MATLAB code used in the earlier Fuzzy Logic Chapters required the input to be a defined point on the universe of discourse.4 Membership Function Derivative Functions Four functions were created to calculate the MF values and derivatives of the MF with respect to their parameters for non-symmetric triangular and trapezoidal MFs.5.2:10].

1 0 -0.'y').[-5 -3 3 5].4 0.'y'). 187 .2 Left Foot(a) 0 -10 -5 0 Input Right Foot (c) 5 10 Derivative of non-symmetric triangular MF with respect to its parameters [mf1]=dmf_tri(x.2 Trapezoidal membership functions: [mf1]=mf_trap(x.2 dMF wrt parameters 0.3 0.8 Membership 0.6 0.1 -0.Triangular Membership Function Peak (b) 1 0. Triangular Membership Function Derivatives 0.[-4 0 3].

6 0.0. 188 . Trapezoidal Membership Function Derivatives 0.0.2 -0.4 The m-file adaptri.m demonstrates the use of gradient descent to optimize parameters of 3 non-symmetric triangular MFs and consequent singletons to approximate the function f(x) = 0.3 dmf wrt parameter 0.1 -0.4 0.5 0.4 0.2 (a) 0 -10 -5 0 Input (d) 5 10 (c) Derivative of trapezoidal MF with respect to it parameters: [mf1]=dmf_trap(x.'y').3 -0.Trapezoidal Membership Function (b) 1 0.1 0 -0.[-5 -3 3 5].2 0.3* x + 20.02 * x 2 .05* x 3 .8 Membership 0.

5:10]. Suppose we have experimentally calculated the input-output surface to be: r=10.5.5 Membership Function Training Example As a simple example of using gradient descent to adjust the parameters of the antecedent and singleton consequent membership functions.'y'). consider the following problem. % support mu_i=mf_tri(x.[a-b/2 a a+b/2]. if Gun is Sighted Properly then "Chance of Hit" is 10 (r=10).m demonstrates the use of gradient descent to optimize parameters of 3 trapezoid MFs and consequent singletons to approximate the same function.6 0. % # of points a=5. xlabel('Direction in Degrees') ylabel('Membership') MF representing "Sighted Properly" Peak (b) 1 0. y_t=mu_i*r.x= [0:. consider the following zero-order Sugeno fuzzy system with one rule and r = 10. Assume that a large naval gun is properly sighted when it is pointing at 5 degrees. 13. % r value of zero-order Sugeno rule % Chance of Hit 189 . % input num_pts=size(x.2 Left Foot(a) 0 0 2 4 6 Direction in Degrees Right Foot (c) 8 10 To decide if an artillery shell will hit the target (on a scale of 1-10). title('MF representing "Sighted Properly"').8 Membership 0.4 0. % peak b=4.The m-file adpttrap.2). and the membership to "Sighted Properly" will be defined as: clear all. The universe of discourse will be defined from 1 to 10 degrees.

2. and Repeating steps 1 through until the training error goal is reached or the maximum number of training epochs is exceeded.y_t. The general procedure will involve these steps: 1.x. A forward pass of all inputs to calculate the output. 4. 3. axis([-inf inf 0 10. a=3.5]).plot(x. axis([-inf inf 0 10. 5. title('Input-Output Surface').6. 4.[a-b/2 a a+b/2]. b=6.5. The calculation of the error and SSE for all inputs. We will use the input-output data above as training data and initialize the MFs to a=3. and r=8. mu_i=mf_tri(x. y=mu_i*r. text(2. title('Input-Output Surface for Gun').5.'n'). xlabel('Direction (Input)') ylabel('Chance of Hit (Output)') Input-Output Surface for Gun 10 9 8 Chance of Hit (Output) 7 6 5 4 3 2 1 0 0 2 4 6 Direction (Input) 8 10 We will now demonstrate how the antecedent MF parameters of a symmetric triangular MF and the consequent r value can be optimized using gradient descent.y).5.5]). plot(x. The calculation of the gradient vectors. 'Initial') text(4. Step 1: Forward pass of all inputs to calculate the output % Initial MF parameters r=8. The updating of the parameters based on the update rule.y_t). 'Target') % % % % % initial initial initial initial initial r value peak value support value MFs output 190 . b=4.

3239 191 .1.5556 Step: 3 Calculate the gradient vectors (note: one input. % Locate indices under MF. del_a=-((lr_a/(2*num_pts))*sum(delta_a)).*((1-mu_i(ind))/b). lr_b=5. one rule) ind=find(abs(x-a)<=(b/2)). delta_a=r*e(ind). del_b=-((lr_b/(2*num_pts))*sum(delta_b)).^2)) SSE = 314. r=r+del_r.xlabel('Direction (Input)') ylabel('Chance of Hit (Output)') Input-Output Surface 10 9 8 Chance of Hit (Output) 7 6 5 4 3 2 1 0 0 2 4 6 Direction (Input) 8 10 Initial Target Step 2: Calculate the error and SSE for all inputs e=y-y_t. del_r=-((lr_r/(2*num_pts))*sum(delta_r)). a = b = r = 3.*((2*sign(x(ind)-a))/b). Step 4: Update the parameters with the update rule lr_a=. SSE=sum(sum(e. b=b+del_b. delta_r=e(ind). delta_b=r*e(ind). lr_r=10.*mu_i(ind).7955 4. a=a+del_a.% Deltas for ind points.

text(2. end Now lets plot the results: plot(x. y_new=mu_i*r.7.x.y_new. del_b=-((lr_b/(2*num_pts))*sum(delta_b)). xlabel('Input'). y_new=mu_i*r. Lets now iteratively train the fuzzy system.*((2*sign(x(ind)-a))/b). del_r=-((lr_r/(2*num_pts))*sum(delta_r)).ylabel('Output').'n').6. Step 5: Repeat steps 1 through 5 maxcycles=30. % SSE if SSE(i) < SSE_goal.5. del_a=-((lr_a/(2*num_pts))*sum(delta_a)). b=b+del_b. axis([0 10 0 10. break. title('Input-Output Surface').'n').[a-b/2 a a+b/2].^2)). 'Target'). for i=2:maxcycles mu_i=mf_tri(x.end ind=find(abs(x-a)<=(b/2)).5556 189.y_t. text(4. 'Final').5.*((1-mu_i(ind))/b). delta_a=r*e(ind). delta_r=e(ind).x. delta_b=r*e(ind).5. SSE_goal=. % Output Error SSE(i)=sum(sum(e.5. a=a+del_a.1447 The SSE was reduced from 314 to 190 in one pass. % Ready To Eat e=y_t-y_new. % error SSE(2)=sum(sum(e. text(4. % Output e=y_new-y_t.8409 Let's see if the SSE is any better: mu_i=mf_tri(x. 'Initial').^2)) % Sum Squared error SSE = 314.[a-b/2 a a+b/2].*mu_i(ind).y).5.5.5. r=r+del_r. 192 .5]).

m demonstrates the use of gradient descent to optimize the MF parameters and shows how the membership functions and input-output relationship changes during the training process.grid.Input-Output Surface 10 9 8 7 Output 6 5 4 3 2 1 0 0 2 4 Input 6 8 10 Initial Target Final Plot SSE training performance: semilogy(SSE). 3 Training Record SSE 10 10 2 SSE 10 1 10 0 The m-file adaptfuz.title('Training Record SSE') xlabel('Epochs'). 193 .ylabel('SSE').

This system makes use of a hybrid learning rule to optimize the fuzzy system parameters of a first order Sugeno system. two-rule first-order Sugeno model with [Jang 1995a]. The consequent parameters are updated first using a least squares algorithm and the antecedent parameters are then updated by backpropagating the errors that still exist. 2).6.d]. The polynomial parameters [p.6 Adaptive Network-Based Fuzzy Inference Systems Jang and Sun [Jang. The antecedent membership function parameters [a.q. the training rule is called a hybrid. where i is the ith node of layer l.c.13. Because it uses two very different algorithms to reduce the error.1 ANFIS Hybrid Training Rule The ANFIS architecture consists of two trainable parameter sets: 1). q. also called the consequent parameters. The ANFIS training paradigm uses a gradient descent algorithm to optimize the antecedent parameters and a least squares algorithm to solve for the consequent parameters. 1995] introduced the adaptive network-based fuzzy inference system (ANFIS). The ANFIS architecture consists of five layers with the output of the nodes in each respective layer represented by Oil . Generate the membership grades: 194 . A first order Sugeno system can be graphically represented by: x1 x2 x1 A1 A2 B1 B2 Π w1 N w1 y1=w1f1 Σ Π w2 y x2 N w2 x1 x2 y2=w2f2 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 ANFIS architecture for a two-input. 1992.b.r]. The following is a layer by layer description of a two input two rule first-order Sugeno system. Layer 1. where the consequence parameters (p. Jang and Gulley. and r) of the nth rule contribute through a first order polynomial of the form: f n = pn x1 + q n x 2 + rn 13.

Let us rearrange this last equation into a more usable form: O15 = y = ( w1 x1 ) p1 + ( w1 x 2 ) q1 + w1r1 + ( w2 x1 ) p2 + ( w2 x 2 ) q2 + w2 r2  p1  q   1  r1  w2 ]  = XW  p2   q2     r2  y = [ w1 x1 w1 x 2 w3 w2 x1 w2 x 2 When input-output training patterns exist. 195 . the weight vector (W). 13. we will investigate three different techniques used to solve this type of problem. Sum all the inputs from layer 4. Normalize the firing strengths.6. Calculate rule outputs based on the consequent parameters. can be solved for using a regression technique. Oi4 = yi = wi f i = wi ( pi x1 + qi x 2 + ri ) Layer 5. the weights can be solved for using a least squares method rather than iteratively trained with a backpropagation algorithm.Oi1 = µ Ai ( x ) Layer 2.2 Least Squares Regression Techniques Since a least squares technique can be used to solve for the consequent parameters. When the output layer of a network or system performs a linear combination of the previous layer's output. Generate the firing strengths. O15 = ∑ yi = ∑ wi f i = ( w1 x1 ) p1 + ( w1 x 2 ) q1 + w1r1 + ( w2 x 2 ) p2 + ( w2 x 2 )q2 + w2 r2 i i It is in this last layer that the consequent parameters can be solved for using a least square algorithm. Oi2 = wi = ∏ j =1 µ Ai ( x ) m Layer 3. Oi3 = wi = wi w1 + w2 Layer 4. which consist of the consequent parameters.

this method involves inverting (X'*X). Therefore. w=inv(X'*X)*X'*y SSE=sum(sum((X*w-y).. % Pseudoinverse Solution x=[2 1 4 2 6 3 8 4]. it would be easy to solve for W: W=X-1*Y but this is not usually the case. y=[3 6 9 12]'. Y=X*W If the X matrix were invertable. which can cause numerical problems when the columns of x are dependent. Consider the following regression example. sse = Inf Although dependent rows can be removed.999999998 3 8. data is rarely dependent. a pseudoinverse can be used to solve for W: W=(XT*X)-1*XT*Y (where XT is the transpose of X) This will minimize the error between the predicted Y and the target Y. X=[2. We will use the marginally independent data for training and the noise free data to check for generalization. The desired or target outputs are represented by the p by m dimensional matrix Y with p rows and m columns equal to the number of outputs. Consider data that has a small amount of noise that may result in a marginally independent case.000000001 1 3.The rule outputs are represented by the p by 3n dimensional matrix X. However.^2)) Warning: Matrix is singular to working precision. sse=sum(sum((x*w-y). with p rows equal to the number of input patterns and n columns equal to the number of rules.^2)) sse=sum(sum((x*w-y). Setting the problem up in this format allows us to use a lease squares solution for the weight vector W of n by m dimension.000000001 4]. w=inv(x'*x)*x'*y.^2)) 196 .999999998 2 5.

^2)) SSE=sum(sum((x*w-y).3742 0.7496 SSE = 269. % Pseudoinverse Solution w=inv(X1'*X1)*X1'*y SSE=sum(sum((X1*w-y).0000 3.001 1 4. Results may be inaccurate.00 4. A case with independent patterns results in no errors or warnings. Right Triangular) decompositions rather than a simple inversion of the matrix. The MATLAB command / uses a QR decomposition with pivoting.7693 The warning tells us that numerical instabilities may result from the nearly singular case and the SSE shows that instabilities did occur.0000 sse = 7.894919e-018 w = -0. % QR Decomposition w=(y'/X')' sse=sum(sum((X*w-y). X1=[2.7693 sse = 269.00 2 6.^2)) w = 0.2970e-030 197 .0990 SSE = 1.2970e-030 SSE = 7.Warning: Matrix is close to singular or badly scaled.9504 1.00 3 8. RCOND = 7.^2)) sse=sum(sum((x*w-y). Upper Triangular) or more robust QR (Orthogonal.1583e-006 sse = 9.^2)) w = 0.5293e-007 Better regression methods use the LU (Lower Triangular. It provides a least squares solution to an under or over-determined system.001]. and the best method uses the singular value decomposition (SVD) [Masters 1993].

s. 1996] and resulted in networks with superior generalization capabilities.0000 0 0 7.0). we simply remove the columns of U that correspond to small diagonal values in S. By removing this unimportant information.^2)) w = 1. Application of this methodology has sped up the training of neural networks 40 fold [Uhrig et.v]=svd(X. inv_s=inv(s) inv_s = 1. Their magnitude is related to the information content of the columns of U (principle components) that span X. Therefore. The SVD method decomposes the x matrix into a diagonal matrix S of the same dimension as X that contains the singular values.i)<. end end w=v*inv_s*u'*y sse=sum(sum((x*w-y). and unitary matrix U of principle components.6000 sse = 1.1 inv_s(i. Therefore. X=U S VT The singular values in S are positive and arranged in decreasing order. al. This discards the information related to the noise and keeps the solution from attempting to fit the noise.0e+008 * 0. The weight matrix is then solved for using: W = V S-1 UT Y The SVD methodology uses the most relevant information to compute the weight matrix and discards unimportant information that may be due to noise.2000 0.^2)) SSE=sum(sum((X*w-y).i)=0. and an orthonormal matrix of right singular values V.The SVD method of solving for the output weights also has the advantage of giving the user control to remove unimportant information that may be related to noise. one lessens the chance of overfitting the function to be approximated.2000e-018 198 . for i=1:2 if s(i. to remove the noise effects on the solution of the weight matrix.. we discard its corresponding column in U. % Singular Value Decomposition (SVD) [u.3855 We can see that the first singular value has very little information in it.

w=pinv(X.SSE = 1.2000 0. MATLAB has a pinv() function that automatically calculates a pseudo-inverse the SVD. The pinv function will be used in the following example.3 ANFIS Hybrid Training Example To better understand the ANFIS architecture and training paradigm consider the following example of a first order Sugeno system with three rules: if if if x is A1 then x is A2 then x is A3 then f1 =p1x + r1 f2 =p2x + r2 f3 =p3x + r3 This ANFIS has three trapezoidal membership functions (A1. Remember that this is overfitting and is not desired. A3) and can be represented by the following diagram: x A1 x w1 w2 w3 N N N w1 x w2 x w3 f1 Π f2 y1 y2 A2 A3 Π f3 Σ y Π y3 ANFIS architecture for a one-input first-order Sugeno fuzzy model with three rules First we will create training data for the function to be approximated: 199 .^2)) w = 1. 13.^2)) SSE=sum(sum((X*w-y).1e-5)*y sse=sum(sum((x*w-y). A2.2000e-018 SSE = 1.6000 sse = 1. but this is because the QR method tried to fit the noise.3200e-017 We see that the results of the two SVD methods are close to identical.6.3200e-017 The SVD method did not reduce the error as much as did the QR decomposition method.

% Plot MFs plot(x.^2-. Layer 1. num_pts=size(x.mf2.[a1 b1 c1 d1].y_rp). x=[-10:1:10]. Generate the membership grades: Oi1 = µ A ( x ) % LAYER 1 MF values % Initialize Antecedent Parameters a1=-17.3*x'+20. 'n').[a3 b3 c3 d3]. xlabel('Input').^3-.02 * x 2 .mf1.clg. a3=3.0. % number of input patterns plot(x. d2=8. [mf3]=mf_trap(x. c3=13.05*x'. y_rp=. ylabel('Output'). 'n'). b1=-13. [mf1]=mf_trap(x. c1=-7. d3=8.f(x) = 0. a2=-9. c2=3. b2=-4. Function to be Approximated 70 60 50 40 30 Output 20 10 0 -10 -20 -30 -10 -5 0 Input 5 10 Now we will step through the ANFIS training paradigm.x.3* x + 20.[a2 b2 c2 d2].'n').x. d1=-3. b3=7. [mf2]=mf_trap(x. title('Function to be Approximated').2).02*x'.2]). title('Initial Input MFs') 200 .05* x 3 .mf3).0. axis([-inf inf 0 1.

% rule 2 w3=mf3. Generate the firing strengths. 'mf2'). ylabel('Membership'). w1=mf1. % rule 3 Layer 3.j)=0.j)==0 & w2(:. .xlabel('Input'). Initial Input MFs 1 0. nw2(:. text(-1. . Oi3 = wi = wi w1 + w2 + w3 % LAYER 3 % Determines the normalized firing strengths for the rules (nw) and % sets to zero if all rules are zero (prevent divide by 0 errors) for j=1:num_pts if (w1(:. Since each rule only has one antecedent % membership function.j)==0) nw1(:. % rule 1 w2=mf2. text(8.j)==0 & w3(:. no product operation is necessary. 'mf1'). . text(-9.6 0.2 0 -10 -5 0 Input 5 10 Layer 2. 'mf3').7.7. Oi2 = wi = µ Ai ( x ) % LAYER 2 calculates the firing strength of the % 1st order Sugeno rules. Normalizes the firing strengths. 201 .j)=0.7.8 Membership mf1 mf2 mf3 0.4 0.

j)=0. ylabel('Output'.j)/(w1(:.*x.nw3.10).j)).y). else nw1(:.2927 57.j)=w3(:. title('Function Approximation').nw2. legend('Reference'.j)=w2(:.j)/(w1(:.j)).'+'.4508 20.j)+w2(:. plot(x. xlabel('Input'. y=X_inner'*C_parms.j)+w3(:. nw3(:.'fontsize'.*x.j)=w1(:.nw3(:. Calculate the outputs using the consequent parameters.y_rp.j)/(w1(:.nw2.j)+w3(:.3762 0.j)). C_parms=pinv(X_inner')*y_rp % [p1 p2 p3 r1 r2 r3] C_parms = 8.'fontsize'. nw2(:. X_inner=[nw1. Oi4 = wi f i = wi ( pi x + ri ) O15 = ∑ wi f i = ( w1 x ) p1 + w1r1 + ( w2 x ) p2 + w2 r2 + ( w3 x ) p3 + w3 r3 i % LAYERS 4 and 5 % Calculate the outputs using the inner layer outputs and the % consequent parameters.7055 8.nw1.1564 Layers 4 and 5. Plot the Results.j)+w2(:. end end Calculate the consequent parameters.j)+w2(:.j)+w3(:.'Output') 202 .x.10).0514 -21.nw3].*x.

5311 8.m demonstrates use of the hybrid learning rule to train an ANFIS architecture to approximate the function mentioned above.8650 4.0706] and the final consequent parameters were: 203 . The consequent parameters are usually solved for at each epoch during the training phase. Since the SVD is computationally intensive. the consequent parameters are no longer optimal. The consequent parameters ([p1 p2 p3 r1 r2 r3]) after the first SVD solution were: C_parms = [8. The consequent parameters are obtained during the forward pass using a least-squares optimization algorithm and the premise parameters are updated using a gradient descent algorithm. using the hybrid learning rule. because as the output of the last hidden layer changes due the backpropagation phase. The m-file anfistrn. Next. During the forward pass all node outputs are calculated up to layer 4.6415 64. At layer 4 the consequent parameters are calculated using a least-squares regression method. We can also train the antecedent parameters using gradient descent. it may be most efficient to perform it every few epochs versus every epoch.Function Approximation 70 60 50 40 30 Output 20 10 0 -10 -20 -30 -10 -5 0 Input Reference Output 5 10 We can see that by just solving for the consequent parameters we have a very good approximation of the function. Each ANFIS training epoch.7368 -26. consists of two passes. the outputs are calculated using the new consequent parameters and the error signals are propagated back through the layers to determine the premise parameter updates.2963 23.

The anfistrn. 100 50 0 -5 0 -1 0 E poc h 40 F u n c t io n S S E 12.C_parms =[11. The sum of squared error was reduced from 200 to 13 in 40 epochs.5 0 -1 0 -5 0 In p u t 5 10 Below is a graph of the final antecedent membership functions and the function approximation after training is completed.5315 82.91 R e fe re nc e O utp ut Output -5 0 In p u t 5 10 1 hip A graph or the training record shows that the ANFIS was able to learn the input output patters with a high degree of accuracy.7 R e fe re n c e O u tp u t Output -5 0 In p u t 5 10 1 Membership M F1 M F2 M F3 0 .2865] Below is a graph of the initial antecedent membership functions and the function approximation after one iteration.7411 25.m code only updates the consequent 204 .3120 10. 100 50 0 -5 0 -1 0 E poc h 1 F u n c t io n S S E 194.0308 7.4364 -41.

This results in a training record with dips every 10 epochs. nor is a user contributed toolbox available. 10 3 T ra in in g R e c o rd S S E SSE 10 2 10 1 0 5 10 15 20 E p o c hs 25 30 35 40 Chapter 14 General Hybrid Neurofuzzy Applications No specific hybrid neurofuzzy applications will be examined in this supplement. expert systems could be generated by using the fuzzy systems tools with crisp membership functions. These applications are complex and can not be easily implemented with the introductory tools developed in this Supplement. Therefore. The application of the tools and techniques described in Chapter 14 of Fuzzy and Neural Approaches in Engineering. presents several hybrid Neurofuzzy Systems that were developed at The University of Tennessee. Chapter 16 Role of Expert Systems in Neurofuzzy Systems MATLAB does not have an expert system toolbox. Since Fuzzy Systems may be viewed as a special type of expert systems that handle uncertainty well. Chapter 15 Dynamic Hybrid Neurofuzzy Systems Chapter 15 of Fuzzy and Neural Approaches in Engineering. 205 .parameters every 10 epochs in order to reduce the SVD computation time. the reader should consult the references given in the text. MATLAB does have if->then programming constructs. although Chapters 12 and 13 present the methodologies and tools necessary to implement them. so expert rules containing heuristic knowledge can be embedded in the neural and fuzzy systems described in earlier chapters. Updating the consequent parameters every epoch did not provide significant error reductions improvements and slowed down the training process. for further information on these subjects. is left to the reader. although no examples will be given in this supplement.

roulette wheel.html FlexTool(GA) is a commercially available package.Designed to draw on MATLAB power . contact: Flexible Intelligence Group. Box 1477 Tuscaloosa.COM 206 . Intermediate.Modular.Cold Start (start using previously selected GA parameters) .Statistics. His email is potvin@mathwork. and real .Coding schemes include binary. AL 35486-1477. Hardware and operating system transparent .Warm Start (start from the previous generation) features . The following is information on a user contributed toolbox and a commercially available toolbox. This set of m-files tries to maximize a function using a simple genetic algorithm. USA Voice: (205) 345-5166 Fax : (205) 345-5095 email: FIGLLC@AOL. L.Crossover techniques include 1. 2. Commercial and user Genetic Algorithm toolboxes are available for use with MATLAB.Default parameter settings for the novice .L.Clustering module : Use separately or with Niching module .mathworks. User Friendly. the following is an excerpt from their email advertising: FlexTool(GA) M 1. multiple point crossover . and data collection For more information.. figures. Potvin of The MathWorks.Chapter 17 Genetic Algorithms MATLAB code will not be used to implement Genetic Algorithms in this supplement.Expert. and Novice help settings . ranking .GA options : generational GA.Can optimize multiple objectives .1 Features: .Selection strategies : tournament. GENETIC is a user contributed set of genetic algorithm m-files written by Andrew F. Inc. steady state GA. It is located at: http://www. micro GA .com/optims.Niching module to identify multiple solutions . logarithmic.C.com.Hands-on tutorial with step by step application guidelines .

Redwood City. A.. D. J. International Journal of Approximate Reasoning.. Oct 25-29.. Upper Saddle River. 1949. John Wiley. Vol. Natick. Neural Network Design.). The MathWorks Inc. Construction of Fuzzy Inference Rules by NDF and NDFL. 1995. 378-406. Miyoshi. Finding Structure in Time. Hanselman. pp. March. 1993. Ross (Eds. Organization of Behavior. Ichihashi. Proceedings of the IEEE. Hebb. Krough. W. Warwock. 241-266. Sun. 1992. and N. Guely. Vol. S. Intelligent Control: Fuzzy Logic Applications. M. Vol. Addison-Wesley. K. Siarry. Reidel Press. J. Jamshidi. Neuro-fuzzy Modeling and Control. Palmer. 1994. Nagasaka. PWS Publishing Company.-S. Nagoya. and B. 1241-1246. Vadiee. Hunt. I. Holland. 1996. Demuth. C..-T. 3.. D. J. Wakami. London. and Mark Beale. Elman. N. Institute of Electrical Engineers. R. Beale.. pp. PTR Prentice Hall. CRC Press. Boston. and K.. Studies of the Mind and Brain. Gradient Descent Method for Optimizing Various Fuzzy Rules. J.-S. NJ. Fl.. and M. Hagan. The MathWorks Inc. M.. 14. in Proceedings of the 1993 International Joint Conference on Neural Networks. Jang. H. pp.. and K. Cognitive Science. G. H. F. Neural Networks Toolbox. Hayashi. MA. Fuzzy Logic and Control. 1990.. H.709-712. pp. Mastering MATLAB. Part 1. San Francisco. Computed Tomography by Neuro-fuzzy Inversion. Drodrecht. and T. NJ. Introduction to the Theory of Neural Computing. Littlefield.References Demuth. W. 1993. New York. Jang. and N. C. Proceedings of the Second IEEE International Conference on Fuzzy Systems. Neural Network Applications in Control. J. Natick. 207 .. edited by. 1995. H. Fuzzy Logic Toolbox for Use with MATLAB. 1991. T. No. G. Gulley. Hertz. H. 179-211. Englewood Cliffs.. MA. 6. 1993. 83.1996. J... pp. Grossberg.. 1982. 1995. DeSilva. 1995. Irwin. Nomura. Yamasaki. MA.. Boca Raton. and P. California. Prentice Hall.

P.-T.. 49-60. M. "An Algorithm for Least Squares Estimation of Non-linear Parameters". Vol. San Diego.M. Pappert. NJ. 1990. 1988. M.. 2. and G. Advanced Algorithms for Neural Networks. 1997. W. R. O. Langholz. Moller. Ljung.. Systems. Minsky. New York. Berlin. Fuzzy Control Systems. 1995. Masters.-S. 1987. Y. S. J. Natick. Proc.. 431-441. 1994. Mills. New York. Ellis Horwood division of Prentice Hall. and M. Neural Networks for Control. and Z. Practical Neural Network Recipes in C++. John Wiley & Sons. T. Quarterly Journal of Applied Mathematics. 1993a. Masters. B. D. K. T. NJ. Kosko. MIT Press. Perceptrons. 164-168. Boca Ratton. Miller. F. 525-533. 11. 1993b. MIT Press. pp. Neural Networks. MA.. M. Neuro-Adaptive Process Control. J. Academic Press. Zomaya. CRC Press. Cambridge. Cambridge. Ogonowski. 1984. SIAM. Marquardt. Kohonen.. Advanced Control with MATLAB & SIMULINK. T.. of the Eighth Annual Conference on Cognitive Science Society. S. John Wiley & Sons. 1969. L. "A Method for the Solution of Certain Non-linear Problems in Least Squares". Self -Organization and Associative Memory.. Man and Cybernetics. MA. MATLAB. Bidirectional Associative Memories. J. NJ. Englewood Cliffs. Sutton and P. MA. C. J. 18.Jang. CA. 1986. A. Neuro-Fuzzy and Soft Computing. Fl. 1994. Tade. 1944. Amherst... Eds. The MathWorks Inc. Levenberg. System Identification: Theory for the User. Springer-Verlag.. 531-546. Kandel. Moscinski. A... 208 ..).. and E. Werbos (eds. 1963. 1993. Vol. pp. W. T. Sun.. Mizutani. Upper Saddle River. Prentice Hall. Attractor Dynamics and Parallelism in a Connectionist Sequential Machine. Jordan. IEEE Trans... Prentice Hall. "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning". Upper Saddle River. (1). 1996. 6.

1989. 1. I. pp. IEEE Transactions on Neural Networks. Omatu. Van Nostrand Reinhold.. eds. and X. R. London. ( C. A. Uhrig. Vol. L. Park. J. 1992. White. IEEE Trans. T. in Fuzzy Control Systems. Universal Approximation Using Radial-Basis-Function Networks. 209 . Methods in Neural Modeling. A Kandel and G. 1989. Springer Verlag. and X. New York. J. Segev. London.D.). 1994. and D. Liu. pp. Wakami. Vol.. Hines. Nov. Koch and I. S. 2. Sofge (eds. Vol... "Spatial and Temporal Processing in Central Auditory Networks". C. "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". Fl. Prentice Hall. Williams R. D.. March. Handbook of Intelligent Control. J. and D. Specht. MIT Press. 1994. Van Nostrand Reinhold. H. NJ. A. 1974.5. Sandberg. 386-408. and N.). Neural Networks for Identification Prediction and Control. Yusof. D. Wang. 1995. 1991. Psychological Review. No.. Springer Verlag. M. Ph.. and R. A Self Tuning Method of Fuzzy Reasoning by Genetic Algorithms. Cambridge. D. Shamma. New York. CRC Press. March 1990. Hayashi.-X. Vol. W. No. Vol. 1. 1993. MA. F. 1958. Werbos. Langhols. 1996. "The Perceptron: a Probabilistic Model for Information Storage and Organization in the Human Brain". 247-289. Thesis. E. 1.. Neural Computation.. on Neural Networks. S. 246-257. 1996. Nomura. D. Khalid. Harvard University. J. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. and I. 568-576.. Pham. pp.. Wrest. eds. Wasserman. Adaptive Fuzzy Systems and Control. Boca Raton. Englewood Cliffs. P. Rosenblatt.. A General Regression Neural Network. Zipser. Instrument Surveillance and Calibration Verification System. Black.. "Identification and Control of Dynamical Systems Using Neural Networks". Xu. Sandia National Laboratory contract AQ-6982 Final Report by The University of Tennessee. 3.Narendra and Parthasarathy. 270-280. Advanced Methods in Neural Computing. 338-354. Neural Computation. 65. Neuro-Control and its Applications.

210 . 1996. Zbikowski. World Scientific Publishing Company. Hunt. Handbook of Intelligent Control: Neural. J. 78. R... D. Eds. and D. White. Van Nostrand Reinhold. New York. Sofge.Werbos. P.. 1992. and K. Fuzzy and Adaptive Approaches.. Proceedings of the IEEE. "Backpropagation Through Time: What It is and How to Do It". Singapore. October 1990. No. Vol. J. 10. Neural Adaptive Control Technology.

Sign up to vote on this title
UsefulNot useful