You are on page 1of 194

Advances in Fuzzy Systems — Applications and Theory Vol.

14

fluromaNc GeneraNon of
Neural Network flrchirecrure
Using Evolutionary Computation

EVonk
LCJain
R PJohnson

Output

World Scientific
Automatic Generation of
Neural Networh Architecture
Using EvoluNonarq ComputaMon
ADVANCES IN FUZZY SYSTEMS — APPLICATIONS AND THEORY

Honorary Editor: Lotfi A. Zadeh (Univ. of California, Berkeley)


Series Editors: Kaoru Hirota (Tokyo Inst. of Tech.),
George J. Klir (Binghamton Univ.-SUNY),
Elie Sanchez (Neurinfo),
Pei-Zhuang Wang (IVesf Texas A&M Univ.),
Ronald R. Yager (lona College)

Vol. 1: Between Mind and Computer: Fuzzy Science and Engineering


(Eds. P.-Z. Wang and K.-F. Loe)
Vol. 2: Industrial Applications of Fuzzy Technology in the World
(Eds. K. Hirota and M. Sugeno)
Vol. 3: Comparative Approaches to Medical Reasoning
(Eds. M. E. Cohen and D. L. Hudson)
Vol. 4: Fuzzy Logic and Soft Computing
(Eds. B. Bouchon-Meunier, R. R. Yager and L A. Zadeh)
Vol. 5: Fuzzy Sets, Fuzzy Logic, Applications
(G. Bojadziev and M. Bojadziev)
Vol. 6: Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers
by Lotfi A. Zadeh
(Eds. G. J. Klir and B. Yuan)
Vol. 7: Genetic Algorithms and Fuzzy Logic Systems: Soft Computing
Perspectives
(Eds. E. Sanchez, T. Shibata and L. A. Zadeh)
Vol. 8: Foundations and Applications of Possibility Theory
(Eds. G. de Cooman, D. Ruan and E. E. Kerre)
Vol. 10: Fuzzy Algorithms: With Applications to Image Processing and
Pattern Recognition
(Z. Chi, H. Yan and T. D. Pham)
Vol. 11: Hybrid Intelligent Engineering Systems
(Eds. L C. Jain and R. K. Jain)
Vol. 12: Fuzzy Logic for Business, Finance, and Management
(G. Bojadziev and M. Bojadziev)
Vol. 15: Fuzzy-Logic-Based Programming
(Chin-Liang Chang)

Forthcoming volumes:
Vol. 9: Fuzzy Topology
(Y. M. Liu and M. K. Luo)
Vol. 13: Fuzzy and Uncertain Object-Oriented Databases: Concepts and Models
(Ed. R. de Caluwe)
Advances in Fuzzy Systems — Applications and Theory Vol. 14

Automatic Generation of
Neural Nehuorh Architecture
Using Evolutionary Computation

EVonk
Vrije Univ. Amsterdam

LC Jain
Univ. South Australia

R PJohnson
Australian Defense Sci. & Tech. Organ.

World Scientific
Singapore 'New Jersey • London • Hong Kong
Published by
World Scientific Publishing Co. Pte. Ltd.
P O Box 128, Farrer Road, Singapore 912805
USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data


Vonk, E.
Automatic generation of neural network architecture using
evolutionary computation / E. Vonk, L. C. Jain, R. P. Johnson
p. cm. - (Advances in fuzzy systems ; vol. 14)
Includes bibliographical references and index.
ISBN 9810231067
1. Neural networks (Computer science) 2. Computer architecture.
3. Evolutionary computation. I. Jain, L. C. II. Johnson, R. P.
(Ray P.) III. Title. IV. Series.
QA76.87.V663 1997
006.3'2-dc21 97-28485
CIP

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

First published 1997


Reprinted 1999

Copyright © 1997 by World Scientific Publishing Co. Pte. Ltd.


All rights reserved. This book or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.

Printed in Singapore by Regal Press (S) Pte. Ltd.


Preface
This book presents our research on the application of evolutionary computation in
the automatic generation of a neural network architecture. The architecture has a
significant influence on the performance of the network. It is the usual practice to use
trial and error to find a suitable neural network architecture for a given problem. This
method is not only time consuming but may not generate an optimal network. The
use of evolutionary computation is a step towards automation in neural network
architecture generation. In this book, an overview of the field of evolutionary
computation is given together with the biological background from which the field
was inspired. The most commonly used approaches towards a mathematical
foundation of the field of genetic algorithms is given, as well as an overview of the
hybridisations between evolutionary computation and neural networks. Experiments
concerning an implementation of automatic neural network generation using genetic
programming and one using genetic algorithms are described, and the efficacy of
genetic algorithms as a learning technique for a feedforward neural network is also
investigated.

The book contains 11 chapters. Chapter 1 provides an introduction to the automatic


generation of a feedforward neural network using evolutionary computation. Chapter
2 provides an introduction to the artificial neural networks. Chapter 3 describes the
principle of operation of evolutionary computation. It includes an introduction to
genetic algorithms, genetic programming and evolutionary algorithms. Chapter 4
presents the biological background of evolutionary computation. In chapter 5, an
attempt is made to present the mathematical basis of genetic algorithms in a limited
sense. Chapter 6 presents the implementation of genetic algorithms. A brief overview
of the most commonly used genetic algorithms settings is given in this chapter.
Chapter 7 presents ways to combine neural networks with evolutionary computation.
Chapter 8 describes the use of genetic programming to generate neural networks.
Chapter 9 describes the use of genetic algorithms to optimise the weights of a neural
network. Chapter 10 presents the use of genetic algorithms with grammar encoding
schemes to generate neural network architectures. Chapter 11 provides concluding
remarks and presents future directions.

v
vi Preface

The book will prove useful for application engineers, scientists, researchers and the
senior undergraduate/first year graduate students in Computer, Electrical, Electronic,
Manufacturing, Mechatronics and Mechanical Engineering, and related disciplines.

Thanks are due to Berend Jan van der Zwaag and Pieter Grimmerink for their
excellent help in the preparation of the manuscript. We are grateful to Professor
Sanchez of the University of Marseille, France, and Professor Karr of the University
of Alabama for reviewing the manuscript. Thanks are also due to Mr Chiang Yew Kee
for his excellent editorial assistance.

This work was supported by the Australian Defense Science and Technology
Organisation (contract number 340479).

E. Vonk
L. C. Jain
R. P. Johnson
Contents

PREFACE v

1. INTRODUCTION 1

2. ARTIFICIAL NEURAL NETWORKS 3


2.1 Introduction 3
2.1.1 The artificial neuron 4
2.1.2 The perceptron 5
2.1.3 Activation functions 6
2.1.4 Two layer neural network 8
2.1.5 Types of neural networks 9
2.1.6 Learning 9
2.1.7 Recall of output data from the trained network 11
2.1.8 Learning rules 11
2.1.9 Forms of neural network connections 12
2.2 Basic types of neural networks 13
2.2.7 The Multiple Layer Perceptron 14
2.3 Conclusion 16

3. EVOLUTIONARY COMPUTATION 17

3.1 Genetic Algorithms (GAs) 17


3.1.1 Example of an optimisation problem 19
3.1.2 The algorithm 22
3.1.3 Example of a generation 24
3.1.4 Dual representation and competing conventions 29
3.1.5 The Steady State Genetic Algorithm 32
3.1.6 Parallel Genetic Algorithms 33
3.1.7 Elitism 34
3.1.8 Extensions of the Standard Genetic Algorithm 34
3.2 Genetic Programming (GP) 35
3.3 Evolutionary Algorithms (EAs) 40

vii
viii Contents

4. THE BIOLOGICAL BACKGROUND 43

4.1 Genetic structures 43


4.2 Reproduction 44
4.3 Mutations 46
4.3.1 Chromosome mutations 46
4.3.2 Gene mutations 48
4.4 Natural Evolution 48
4.5 Links to Evolutionary Computation 54

5. MATHEMATICAL FOUNDATIONS OF GENETIC ALGORITHMS 60

5.1 The Operation of Genetic Algorithms 60


5.2 The Schema Theorem and the Building Block Hypothesis 62
5.2.1 The effect of roulette wheel reproduction 63
5.2.2 The effect of crossover 64
5.2.3 The effect of mutation 66
5.2.4 The effects of all genetic operators combined: The Schema Theorem 66
5.2.5 The Building Block Hypothesis 67
5.2.6 Another viewpoint: the switching of hyperplanes 69
5.2.7 The Walsh-Schema Transform 69
5.2.8 Extending the Schema Theorem to other representations 72
5.3 Criticism on the Schema Theorem and the Building Block Hypothesis 72
5.4 Price's Theorem as an alternative to the Schema Theorem 74
5.5 Markov Chain Analysis 75

6. IMPLEMENTING GAs 79

6.1 GA performance 79
6.2 Fitness function 81
6.3 Coding 82
6.3.1 Binary coding 83
6.3.2 Real-valued coding 84
6.3.3 Symbolic coding 85
6.3.4 Non-homogeneous coding 85
Contents ix

6.4 Selection schemes 86


6.4.1 Proportionate Reproduction 86
6.4.2 Tournament Selection 87
6.4.3 Steady State Genetic Algorithms 87
6.5 Crossover, mutation and inversion 88
6.5.7 Crossover 88
6.5.2 Mutation 89
6.5.3 Inversion 89

7. HYBRIDISATION OF EVOLUTIONARY COMPUTATION AND


NEURAL NETWORKS 91

7.1 Evolutionary Computing to Train the Weights of a NN 91


7.2 Evolutionary Computing to analyse a NN 93
7.3 Evolutionary Computing to optimise a NN architecture and its weights 93
7.3.1 Direct encoding 96
7.3.2 Parametrised encoding 98
7.3.3 Grammar encoding 98

8. USING GENETIC PROGRAMMING TO GENERATE NEURAL


NETWORKS 101

8.1 Set-up 101


8.2 Example of a Genetically Programmed Neural Network 102
8.3 Creation and Crossover Rules for Genetic Programming for Neural
Networks 104
8.3.1 Creation rules 104
8.3.2 Crossover rules 104
8.4 Automatically Defined Functions (ADFs) 105
8.5 Implementation of the Fitness Function 106
8.6 Experiments with Genetic Programming for Neural Networks 108
8.6.1 The XOR problem 108
8.6.2 The one-bit adder problem 110
8.6.3 The intertwined spirals problem Ill
8.7 Discussion of Genetic Programming for Neural Networks 112
x Contents

9. USING A GA TO OPTIMISE THE WEIGHTS OF A NEURAL


NETWORK 114

9.1 Description of the GA software 114


9.2 Set-up 117
9.3 Experiments 120
9.3.1 Data-sets 121
9.3.2 Comparing GA with Back Propagation 122
9.3.3 Results 123
9.4 Discussion 129

10. USING A GA WITH GRAMMAR ENCODING TO GENERATE


NEURAL NETWORKS 131

10.1 Structured Genetic Algorithms in Neural Network Design 131


10.1.1 Weight transmission 132
10.1.2 Structural and parametric changes 134
10.1.3 Weight representation 135
10.2 Kitano's matrix grammar 135
10.3 The Modified Matrix Grammar 137
10.4 Combining Structured GAs with the Matrix Grammar 145
10.4.1 Genetic operators 148
10.4.2 Evaluation 148
10.5 Direct Encoding 149
10.6 Network pruning and reduction 151
10.7 Experiments 152
10.7.1 Set-up 152
10.7.2 Results 153
10.8 Discussion 167

11. CONCLUSIONS AND FUTURE DIRECTIONS 169

REFERENCES AND FURTHER READING 173

INDEX 181
1. Introduction
It is often overlooked that the performance of a neural network on a certain problem
depends in the first place on the network architecture used and only in the second
place on the actual knowledge representation (i.e. values of the weights) within that
specific architecture. It can be said that the performance of a neural network depends
on three factors: the problem for which the network is going to be used or rather how
this is measured, the network structure and the set of weights. The performance of a
network is typically measured by the cumulative error of the neural network on some
test data with known target outputs, but can include computational speed and
complexity as well. This performance can be defined by an abstract quality function
Q-

Q =
-- Q(T,
~- Q(T,S,S,W)
W)

where:
Q = the type of quality function
T - the testing data (i.e. the target input/output data set)
S = the structure or architecture of the network
W = the set of weights

The objective is to optimise the quality function Q to gain an optimal performance of


the neural network. This really holds for any type of neural network. This book
however only deals with feedforward neural networks that can be trained with some
type of supervised learning algorithm. An example of such a quality function Q that is
commonly used is the mean cumulative squared error on the test data set consisting of
several input / target output patterns:

F
1
Q-- -T.r
r ,=i

where:
F = the number of test patterns or facts
O = the neural network output vector
T = the target output vector

1
2 Chapter 1. Introduction

Traditionally the structure of a feedforward neural network S is set by the user a


priori. The type of structure used may be based on some knowledge of the problem
domain but commonly a sufficient network structure is found by trial and error. In
many cases the structure used will be a fully connected feedforward network and the
user might try different numbers of hidden neurons to see how well the resulting
structures will fit the task. This network structure is then trained with some learning
algorithm to gain an appropriate set of weights W. The emphasis on optimising the
quality of the network, Q, is very often based on the ability of the learning algorithm
to generate an optimal set of weights, while the structure S is taken for granted or
chosen from a limited domain.

The automatic generation of a neural network structure is a useful concept, as, in many
applications, the optimal structure is not known a priori. Construction-deconstruction
algorithms can be used as an approach but they have several drawbacks. They are
usually restricted to a certain subset of network topologies and as with all hill climbing
methods they often get stuck at local optima and may therefore not reach the optimal
solution. These limitations can be overcome using evolutionary computation as an
approach to the generation of neural network structures. In order to optimise the
quality function Q the algorithm used must at least be able to change the structure S as
well as the set of weights W. In the case of feedforward neural networks, optimising
the structure S alone may be sufficient since the existing learning algorithms are such
that, given a neural network structure S, in many cases the optimal set of weights W
can quite easily be found. In many applications the test data set, T, will be stable.
However the network may have to operate in a dynamic environment where the task
description or at least the testing of the network on the task may change over time. In
such a case the algorithm must ideally be able to adapt Tas well.

The field of evolutionary computation is investigated in its ability to generate an


optimal feedforward neural network structure given a fixed classification task. Its
efficacy as a learning algorithm for feedforward neural networks is also considered in
this book.
2. Artificial Neural Networks
Artificial neural networks are parallel computational models comprised of densely
interconnected adaptive processing units [36], [60]. These networks are fine-grained
parallel implementations of non-linear systems, either static or dynamic. A very
important feature is their adaptive nature where 'learning by example' replaces
■programming' in solving problems. This feature renders these computational models
very appealing in application domains where one has little or incomplete
understanding of the problem to be solved, but where training data (examples) are
available. Another key feature is the intrinsic parallel architecture that allows for fast
computation of solutions when these networks are implemented on parallel digital
computers or when implemented in customised hardware.

Artificial neural networks are viable and very important computational models for a
wide variety of problems. These include pattern classification, speech synthesis and
recognition, function approximation, image compression, associative memory,
clustering, forecasting and prediction, combinatorial optimisation, and non-linear
system modeling and control. The networks are 'neural' in the sense that they have
been inspired by neuroscience, the study of the human brain and nervous system. The
artificial neurons used are thought to be very simple models of their biological
counterpart. However, this does not mean that they are faithful models of biological
neural or cognitive phenomena, those are of a much more complex nature. In fact, the
majority of the neural networks presently used are more closely related to traditional
mathematical and/or statistical models, such as non-parametric pattern classifiers, non­
linear filters and statistical regression models, than they do to neurobiological models.
Still, the technology of neural networks attempts to mimic nature's approach to solve
certain complex problems that are impossible to solve with the more traditional
techniques.

2.1 Introduction
This section introduces some of the concepts of neural networks. The basic
components of neural networks are discussed and some of the more common forms of
neural networks are considered.

3
4 Chapter 2. Artificial Neural Networks

The study of neural networks was originally undertaken in order to understand the
behaviour and structure of the biological neuron. It was soon realised how inadequate
the artificial neuron models were in comparison with the biological neuron, and as a
result some researchers in artificial neural networks decided that the name of neuron
was inappropriate and used other terms such as node rather than neuron. The use of
the term neuron is now so deeply entrenched that its continued general use seems
assured.

Another point which is sometimes confusing is that different writers use a different
numbering nomenclature for multi-layered neural networks. Some workers do not
count the input layer as one of the layers on the basis that this layer often serves only
for the input data and no processing of data occurs in it. Processing however does
occur within the input layer in some forms of artificial neural network. For the sake of
consistency we include the input layer as one of the layers when numbering the layers
of neurons.

2.1.1 The Artificial Neuron


The artificial neuron (refer Figure 2.1) may be thought of as an attempt to model the
behaviour of the biological neuron. It is at the present time a limited approximation to
the biological neuron and it is probably not desirable to stretch the analogy too far.

Figure 2.1 The artificial neuron


2.1 Introduction 5

The first stage is a process where the inputs x0, xh ... xn multiplied by their respective
weights w0, wh ... w„ are summed by the neuron.. The input vector x0, xs, ... x„ may be
denoted by X and the weight vector w0, wh ... w„ by W. Weight w0 forms the neuron's
threshold. The resulting summation process may be shown as:

yy =
=ww +xjXj■■ wwj
0 0+ +x
t+x ■2 w+ 2,.+x
2 ■2 w . + ...
n
+ x„
■ w„ ■ w„

=X W

The weight vector W contains the weights connecting the various parts of the network.
The memory of the neural network is stored in the values of the weights. The term
weight is used in neural network terminology and is a means of expressing the strength
of the connection between any two neurons in the neural network.

During training phase of a neural network the values of the weights are continuously
modified by the training process until some previously agreed criteria are met.
Different types of network use different methods of making the necessary adjustments.

2.1.2 The Perceptron


In order to allow for varying input conditions and their effect on the output it is
usually necessary to include a nonlinear activation function f in the neuron
arrangement. This is so that adequate levels of amplification may be used where
necessary for small input signals without running the risk of driving the output to
unacceptable limits where a large input signal is applied. Depending on the
circumstances one of a number of different activation functions may be employed for
this dynamic range control action.

Figure 2.2 shows a simplified representation of a perceptron.

The output of the neuron is now expressed in the form

Output =fiy)
6 Chapter 2. Artificial Neural Networks

Figure 2.2 The perceptron

2.1.3 Activation Functions


There are a number of types of commonly used activation functions and some of these
are shown in Figure 2.3. Most activation functions are also known as threshold
functions or squashing functions. A brief description of the properties of the activation
functions shown in Figure 2.3 follows:

• Step function. The function shown in Figure 2.3a is known as the step function.
The output from this function is limited to one of two values, depending on
whether the input signal is greater or less than zero. Usually the output value
would be one for signal values greater than zero and minus one for signal values
less than zero That is:

Output = 11 when y> 0


= -1 yy<<00
2.1 Introduction 7

Figure 2.3 Some common types of activation functions

• Linear function. The function shown in Figure 2.3b is the only linear function in
the group of four functions shown and it has application in some specific network
nodes where dynamic range is not a consideration. The effect of this function is to
multiply by a constant factor. That is:

Output = K • y

• Ramp function. The effect of the ramp function, shown in Figure 2.3c, is to behave
as a linear function between the upper and lower limits and once these limits are
reached to behave as a step function. Another attraction is that the function may be
simply defined:

Output = Max when y > upper limit


=K y y< upper limit and y > lower limit
= Min y < lower limit
8 Chapter 2. Artificial Neural Networks

• Sigmoid function. The sigmoid function is an 'S' shaped curve, as shown in Figure
2.3d. A number of mathematical expressions may be used to define an 'S' shaped
curve, but the most commonly used form is given by the expression:

f(y) = ^ - 7

This expression is easy to differentiate and sometimes this property enables a


simplification to be made in the neural network formulation.

2.1.4 Two Layer Neural Network


Several perceptrons may be grouped together to form a neural network where the two
layers of neurons are fully interconnected, but there is no interconnection between
neurons in the same layer. This results in a network as shown in Figure 2.4. This
arrangement is a two layer neural network and it illustrates a common form of neural
network.

Figure 2.4 Two layer fully interconnected neural network


2.1 Introduction 9

2.1.5 Types of Neural Networks


Neural networks may be classified in a number of different ways depending on their
structures and underlying principles of operation including methods used for learning
and recall. An indication of the methods of classification is given in Table 2.1 below.
The various types of learning and recall applying to some of the more common
paradigms will be explained in the following sections.

Table 2.1 Classification of neural networks

Feedback Recall Feed-forward-Recall


Feed-forward- Recall
TType
Unsupervised yP e AA TType
yP e BB
Learning Example: Adaptive Resonance Example: Linear Associative
Theorem Memory
Supervised Type C Type D
Learning Example: Brain State in Box Example: Perceptron

2.1.6 Learning
Before a neural network can be used it is necessary to subject it some form of training
during which process the values of the weights in the network are adjusted to reflect
the characteristics of the input data. The learning process is one of developing a
mapping between the output data and the input data. When the network is adequately
trained, it will retrieve the correct output when a set of input data is presented to it. A
valuable property often claimed for neural networks is that of generalisation, whereby
a trained neural network is able to provide a correct matching of output data to a set of
previously unseen input data.

The generalisation capability of a network is determined largely by the network


structure and the degree of training. In training a network the available input dataset
consists of many facts and is normally divided into two groups. One group of facts is
used as the training data and a second group is retained for checking and testing the
accuracy of the performance of the network after training. The quantity of data
available should be large enough to encompass a representative range of
circumstances which the network will encounter during service.
10 Chapter 2. Artificial Neural Networks

As indicated in Table 2.1 there are two forms of learning: supervised learning and
unsupervised learning.

• Supervised Learning
In this form of learning, a target value is included as part of each fact within the
training data. In this instance a fact incorporates all of the input data for the particular
event and the required output expected from the network for this fact. The target value
is the output value corresponding to a particular fact.

During the training process, the set of training data facts is repeatedly applied to the
network until the difference between the output results and the target values are within
the desired tolerance. When the neural network meets the error criteria on the training
facts, the previously unseen test data set of facts is applied to the neural network to
test the generalisation performance of the network.

Figure 2.5 Example of supervised learning in a neural network

• Unsupervised Learning
Unlike supervised learning there is no target value in this form of training. Instead, the
set of data which contains the facts is repeatedly applied to the network until a stable
network output is obtained. It has been suggested that this form of training is more
similar to the biological neuron as in the biological situation there is not normally a
target value.
2.1 Introduction 11

Figure 2.6 Example of unsupervised learning in a neural network

2.1.7 Recall of Output Data from the Trained Network


The recall of output from a trained neural network is obtained by two distinct
methods.

• The Feed-Forward Recall of Output


When obtaining output data from a trained neural network, novel input data is
presented to the trained network and a single traverse of the network is made. The
output corresponding to the input data is then immediately available.

• The Feedback Recall of Output


In this case, the applied input data is circulated in the trained neural network until
a stable condition is obtained. This is intrinsically a more lengthy process than the
corresponding feed-forward recall of data.

2.1.8 Learning Rules


There are a large number of training rules that have been developed and some are
listed below. The role of the learning mechanism is to adjust the weights of the
network in response to the problem.
12 Chapter 2. Artificial Neural Networks

Table 2.2 Indicative arrangements of some important types of learning rules

Rule
Rule Weight adjustment
Weight adjustment AAWy
Wy Comments
Comments
Hebbian
Hebbian Aw,lj=r)f(w,X)x
Aw J=r]f(wlX)Xj
J 77 == learning
learning rate
(enhance successful
successful w = weight
weight vector
vector
connections) X == input
input vector
vector
Perceptron
Perceptron Awy
Aw,j == rj(t,-sgn(w
■q(t,-sgn(w iX))x
1X))x J J tt -- target
target vector
vector
(binary response, no 77 == learning
learning rate
action if no error)
Delta
Delta Aw^ = r\
Awji r\8dpja
Pja„,
p, SiSj==weighted
weightedsumsumofofinputs
inputstoy
to;
A: 8p]8=f(S ])(t
pj=f(S J)(tp] -a-ap]pj))
l>J A: output layer error
B: Spj S=f(Sj)ZAw
B: pj=f(Sj)£Aw kjkj B:
B:hidden
hidden layer
layer error
error
Least Mean
Least Mean Square
Square Aw, = rj(t
Aw, rWjX)Xj
Tj(t,-WjX)Xj 77, t,t,XX and
and vv
w are
areasasabove
above
(Widrow-Hoff)
Outstar (Grossberg)
Outstar Aw
Awj,p = r](t
T](trWj.J
rwp)

Winner Takes
Winner All
Takes All A: Aw
A: Aw,j
v =- IKXJ-W,J)
r\{xfw,^ A:when
A: whenininnear
nearneighbourhood
neighbourhood
(nearby neurons modify
modify B: Aw,.
AwtJ=
=00 B:when
B: when not
not inin near
near
in a similar fashion)
fashion) neighbourhood

2.1.9 Forms of Neural Network Connections


Neurons may be arranged in many different ways, including the fully interconnected
layer (Figure 2.7), multi-layer networks (Figure 2.8), through to Adaptive Resonance
Theory networks, that include complex layers with an external decision making
structure. Some simple examples of possible forms of network connections are given
in the following figures.
2.2 Basic Types of Neural Networks 13

Figure 2.7 Fully interconnected layer

Figure 2. 8 Two layer recurrent network (note the feedback corinections)

2.2 Basic Types of Neural Networks


A number of neural networks are successfully used and reported in literature. Some
common examples of different types of networks are:
• Perceptron network
• Multiple Layer Perceptron (MLP)
• Radial Basis Function network
• Kohonen's self organising feature map
• Adaptive Resonance Theory network (ART)
• Hopfield net work
14 Chapter 2. Artificial Neural Networks

• Bidirectional Associative Memory (BAM) network


• Counter-propagation network
• Cognitron & Neo-cognitron network

The multiple layer perceptron network will be described in further detail, since its
regular feedforward structure lends itself well to an investigation of genetic algorithm
techniques for network design.

2.2.1 The Multiple Layer Perceptron


The MLP network is a widely reported and used neural network. It consists of an input
layer of neurons, one or more hidden layers of neurons, and an output layer of neurons
as illustrated in the very simple structure of Figure 2.9. Each neuron calculates the
weighted sum of its inputs, and uses this sum as the input of an activation function,
which is commonly a sigmoid function. The supervised back-propagation learning
algorithm, uses gradient descent search in the weight space to minimise the error
between the target output and the actual output. A large number of gradient-based
search methods are reported in the literature. The back-propagation method is chosen
due to its popularity.

Figure 2.9 A typical Multiple Layer Perceptron (MLP) architecture


2.2 Basic Types of Neural Networks 15

The mean squared error, often called training error or network error, between the
actual output and the desired output is defined as follows:

E 2 2
E == lYJ(tljk-y
J(tkk)-yk) (2.1)
(2.1)
1
k

where
tk = target output of the k& neuron in the output layer
yk - actual output of thefc*neuron in the output layer

The derivative of the error, with respect to each weight is set proportional to weight
change as:

dF
dE
AM>
Aw
jk == - £ - - T — (2.2)
(2.2)
dWjk
^'jk

where £ is called the learning rate.

It is a general practice to accelerate the learning procedure by introducing a


momentum term ji into the learning equation (2.2), as follows:

dE
Awikjk(t(t +
+1)
l) = - e£ ■• - — ( (t
t ++1)
l) + ft • fi-Aw
AwJkjk(t)
(t) (2.3)
(2.3)
dw jk
dWjk

where

Wjk - weight from the/11 unit to the k unit

Another form of the weight update rule is:


dE
Awjk(t + l) = - ( 1 - ft)e--—0dE + 1) + H' Aw # (f) (2.4)
dw
Aw/t(r + l) = - ( l - i u ) £ - ^ ( r + l) + Ai-Aw,(r)
jk (2.4)
dwjk
The factor (l-,u) is included so the learning rate £ does not need to be stepped down
as the momentum ji is increased.
16 Chapter 2. Artificial Neural Networks

The back-propagation algorithm, despite its simplicity and popularity, has several
drawbacks. It is slow and typically needs thousands of iterations to train a network to
solve a simple problem. The algorithm performance is also dependent on the initial
weights, and the values of [i and £.

2.3 Conclusion
Artificial neural networks are viable and important computational models for a wide
variety of problems. It is a common practice to use trial and error to find a suitable
neural network architecture for a given problem. This trial and error method is time
consuming, and may not generate an optimum neural network structure. The learning
process whereby the network encodes information from the training process is also of
great importance in neural network performance and generalisation.

A number of learning techniques, such as error back-propagation, have been proposed


to train neural networks, but all are highly dependent on the interconnection topology.
This work reports an effort to develop a more general artificial neural network training
technique, based on genetic algorithms, that can be applied to a variety of topologies,
possibly allowing new artificial neural network structures to be investigated.
3. Evolutionary Computation
Evolutionary Computation is the name of a collection of stochastic optimisation
algorithms loosely based on concepts of biological evolutionary theory. (Some authors
prefer to use the term Evolution Programs.) These techniques are successfully used in
many applications including the optimisation of a neural network architecture. They
are based on the evolution of a population of potential solutions to a certain problem.
The population of possible solutions evolves from one generation to the next,
ultimately arriving at a satisfactory solution to the given problem. The algorithms
differ in the way new populations are generated and in the way the members are
represented within the algorithm.

There is some confusion about the grouping and naming of the various kinds of
evolutionary computations. In this report the distinction is made between three kinds
of evolutionary computations: Genetic Algorithms (GAs), Genetic Programming and
Evolutionary Algorithms. The latter can be divided into Evolution Strategies and
Evolutionary Programming.

3.1 Genetic Algorithms (GAs)


Genetic algorithms were developed by John Holland in the 1970's [30] (refer to [19],
[48] for an overview) and are based on a Darwinian-type survival of the fittest strategy
with sexual reproduction, where stronger individuals in the population have a higher
chance of creating offspring. Each individual in the population represents a potential
solution to the problem that is to be solved. The individuals are represented in the
genetic algorithm by means of a linear string similar to the way genetic information is
coded in organisms as chromosomes. In GA terminology the members of a population
are therefore referred to as chromosomes. Chromosomes are made up of a set of genes
and in the traditional case of binary strings these genes are just bits. More generally
genes are referred to as characters belonging to a certain alphabet A. A chromosome
can be thought of as a vector x consisting of / genes a,:

x = (au. .., a,), a, e A,

x = (flu ■••, «/), «i e A,

17
18 Chapter 3. Evolutionary Computation

I is referred to as the chromosome length. Commonly all alphabets are the same: A =
A] = A2=... = Ah and in the case of binary genes: A = {0,1}.

The definitions of the basic terms in a genetic algorithm are given below:

phenotype - the potential solution to the problem


chromosome = the representation of the phenotype in a form that can be used by
the genetic algorithm (generally as a linear string)
genotype = the set of parameters encoded in the chromosome
gene = the non-changeable pieces of data from which a chromosome is
made up
alphabet = the set of values a gene can take on
population - the collection of chromosomes that evolves from generation to
generation
generation = a single pass from the present population to the next one
fitness = the measure of the performance of an individual on the actual
problem
evaluation = the translation of the genotype into the phenotype and the
calculation of its fitness

The aim of the genetic algorithm is to find an individual with a maximum fitness by
means of a stochastic global search of the solution space.

In biological systems from which the genetic algorithm terminology originates, a


genotype, or the total genetic package, is a structure made up of several chromosomes.
The phenotype is the actual organism formed by the interaction of the genotype with
its environment. In genetic algorithms however, an individual is usually represented
by a single chromosome and therefore the chromosome and the genotype are one and
the same. The term individual is used for a member of the population where the
genotype, x, of an individual refers to the (linear) chromosome and the phenotype, p,
to the observed structure acting as a potential solution to the problem. GAs therefore
rely on a dual representation of individuals where a mapping function is needed
between the two representations (refer section 3.1.4): the genotype, or representation
space, and the phenotype, or problem space. The behaviour of the individual on the
task can be expressed by traits or phenes expressed in the problem space.
3.1 Genetic Algorithms (GAs) 19

The fitness of a chromosome, fix), is in general a mapping of the chromosome to a


real positive value,y(x) e 9T, that measures the individual's (phenotype) performance
on the problem.

3.1.1 Example of an Optimisation Problem


As an example consider the problem of optimising the weights for a simple neural
network. The neural network in question consists of 2 inputs, 2 hidden neurons and 1
output neuron. All neurons have a connection to the bias unit which has a constant
value of 1. All weights in the network are restricted to the values -1 and +1. Inputs and
outputs are also restricted to these values, and the transfer or processing function, P,
of the neurons is the threshold function on {-1,+ 1}:

P(x) = -1 ifjt<0
+1 ifxSO

where x is the weighted sum of the inputs to the neuron.

The aim is to get the network to perform the XOR function, which is described by the
input-output mapping as described by Table 3.1.

Table 3.1 The XOR problem

Input 11 Input 2 XOR


-1 " -1 -1
-1 +1 +1
+1
+1
+1 []
-1 +T~
+1
+ 11
+ +~l1 ~1
-1

The task is to find the set of weights such that the neural network performs the XOR
function. Figure 3.1 shows the neural network structure.
20 Chapter 3. Evolutionary Computation

Figure 3.1 A '2-2-1' Feedforward Neural Network

A chromosome or genotype consists of all the weights of the network, including the
bias weights. One gene of a chromosome represents a single weight-value. For the
demonstration, a simple genetic algorithm is used with binary valued chromosomes.
Thus the alphabet is {0,1}. The alphabet size or cardinality, k, therefore is two. During
the evaluation of the chromosome, a gene in the chromosome that has a value of 0 will
be translated into a weight-value of -1. The weights are numbered from 1 to 9 in
Figure 3.1 which reflects the order in which they are represented in the chromosome.
The chromosome length, /, is 9.

An example of an ordered set of weights such that the network correctly performs its
task is:

( + 1,+ 1,-1,-1,-1,+ 1,+ 1,+ 1,-1).

The corresponding chromosome denoted by the vector x is then:

x = (1 1000 1110)

The phenotype of this individual can be seen as the actual neural network structure
with the values of the weights given by the set { +1,+ 1,-1,-1,-1,+ 1,+ 1,+ 1,-1}. Such a
phenotype is a potential solution of the XOR problem; in fact this phenotype is an
optimal solution to the problem.
3.1 Genetic Algorithms (GAs) 21

The fitness function should reflect the individual's performance on the actual problem.
The standard genetic algorithm searches for the maximum fitness and a low
performance error should be reflected in a good performance and therefore a high
fitness value.

For example when the chromosome ( 1 1 0 0 0 1 1 1 0 ) is evaluated, it is first


translated into the corresponding set of weights : ( +1,-1-1,-1,-1,-1,+1,+1,-1). Then the
performance error of the neural network on the XOR training set using this set of
weights is calculated. The corresponding fitness,/(or), is then calculated as:

f{x) = Emax - E(x)

where E(x) is the cumulative performance error on the training set and Emax is the
maximum value E{x) can obtain. E(x) is given by:

i=\ ,=\

where: F = the number of facts in the training set


Nou, - the number of outputs
Ojj(x) = the/11 output of the network resulting from training facti
Tjj = they"1 target output of training fact i

The maximum performance error, EmliX, would in this case be: 4 * 2 =16, and thus the
fitness value of the chromosome in question \%:f{x) = 16 - 0 = 16.
22 Chapter 3. Evolutionary Computation

3.1.2 The Algorithm


The following steps describe the operation of the standard genetic algorithm.

1. randomly generate an initial population of chromosomes


2. compute the fitness of every member of the current population
3. make an intermediate population by extracting members out of the current
population by means of the reproduction operator
4. generate the new population by applying the genetic operators (crossover,
mutation) to this intermediate population
5. if there is a member of the current population that satisfies the problem
requirements then stop, otherwise go to step 2

The reproduction (or selection) operator that is most commonly used is the Roulette
wheel method, where members of a population are extracted using a probabilistic
Monte Carlo procedure based on their average fitness. For example, a chromosome
with a fitness of 20% of the total fitness of a population will, on an average, make up
20% of the intermediate generation. Apart from the Roulette wheel method many
other selection schemes are possible. An overview is presented in a later chapter.

The heuristics of GA are mainly based on reproduction and on the crossover operator,
and only on a very small scale on the mutation operator. The crossover operator
exchanges parts of the chromosomes (strings) of two randomly chosen members in the
intermediate population and the newly created chromosomes are placed into the new
population. Sometimes instead of two, only one newly created chromosome is put into
the new population; the other one is discarded. The mutation operator works only on a
single chromosome and randomly alters some part of the representation string. Both
operators (and sometimes more) are applied with a certain probability. Figure 3.2
shows the flowchart of the standard genetic algorithm.

The stopping criterion is usually set to that point in time when an individual that gives
an adequate solution to the problem has been found or simply when a set number of
generations has been run. It can also be set equal to the point where the population has
converged to a single solution. A gene is said to have converged when 95% of the
population of chromosomes share the same value of that gene. Ideally, the GA will
converge to the optimal solution; sometimes however a premature convergence to a
sub-optimal solution is observed.
3.1 Genetic Algorithms (GAs) 23

Many variations of the above algorithm are possible and will be discussed in some
detail in later chapters.

Figure 3.2 Flowchart of the standard genetic algorithm


24 Chapter 3. Evolutionary Computation

3.1.3 Example of a Generation


The example of section 3.1.1 will be used here to clarify the working of the standard
genetic algorithm. The steps can be traced in the flowchart of Figure 3.2. The stopping
criterion in this case will be:

stopping criterion : is there a chromosome x in the population with E(x) = 0 ?

This corresponds to a chromosome having a fitness off(x) = 16.

• Create Random Population


The initial population is filled with chromosomes that have randomly valued genes.
For binary valued chromosomes a gene can only take on values {0,1} and a
chromosome x of length / is defined by x e (0,1}'- Each gene has an equal probability
of being initialised with either of these values. An example of an initial population
with population size N- 5 is:

111110011
0 110 10 10 1
001101011
111001101
0 1111110 1

In the general case a gene a, will be initialised from a set of values corresponding to
the alphabet of the gene, A,. For example real-valued genes can be initialised in a
certain range using a normal distribution.

• Evaluate Fitness of Each Member


For every member of the population the fitness, f(x), is now evaluated. For the
example concerned this procedure was described in section 3.1.1. This yields the
following fitness values for the initial population:

Table 3.2 Evaluation of initial population

Index Chromosome x Output of network Error: E(x)


E(x) f(x)
fix)
1 1111110011
11110011 (( ++1,-1,-1,+
1,-1,-1,+1)!} 1616~ 00
2 0 110 1
100 1100 11 ( ++ 1,+
1,+ 1,+
1,+ 1,-1) S8 88 ~
3.1 Genetic Algorithms (GAs) 25

3 I0
000 111 01100 1 10 1 1 {+1,-1,+1,+1}
( + 1,-1,+ 1,+"JT~ 12 12 4 I 4 ~
w
4 111001101
111001101 " 4 12
5 0 1111110 11 { + 1,+ 1,-1,+ 1}
{+1,+1,-1,+1} 1212 4 4_

In most genetic algorithm systems this assessment is by far the most time consuming
activity, so care must be taken in implementing it.

Since the stopping criterion is not satisfied, a new generation is created from the
present one. First, the intermediate population is made by means of the reproduction
operator.

• Create Intermediate Population


The reproduction or selection operator used here is the common Roulette wheel
operator. Chromosomes are selected according to their relative fitness values and are
placed into the intermediate population, also called the 'mating pool' The Roulette
wheel selection method is therefore called a proportionate selection method. Other
selection methods will be discussed in section 6.4. The probability, pseiea(x), that a
chromosome x is selected is simply its relative fitness. Thus:

( ) - f(x)
P'select ( * / y r

In the initial population Jf= 0 + 8 + 4 +12 + 4 = 28.

The Roulette wheel operator is best visualised by imagining a wheel where each
chromosome occupies an area that is sized according to its relative fitness:
26 Chapter 3. Evolutionary Computation

Figure 3.3 Roulette wheel for reproduction

Selection of the chromosomes can now be seen as spinning the roulette wheel. When
the wheel stops a fixed marker determines which chromosome will be selected. To
make the intermediate population the Roulette wheel is simply spun 5 times.

Since the expected number of times that chromosome x will be selected is given by
Eseiect(x) = N * pseieci(x), where N is the population size, this can be expressed as:

C
■ select I * / -f

where: / = the average fitness of the population

Table 3.3 gives the statistics of the current population. The last column shows the
actual number of times the chromosome is chosen. The intermediate population
therefore consists of two copies of chromosome 4, and one copy of chromosome 2,3
and 5.
In practice this intermediate population or mating pool is normally not actually
formed. Instead the reproduction operator is used to select parents that will be subject
to the crossover operator. The reproduction operator is therefore more accurately
referred to as the selection operator.
3.1 Genetic Algorithms (GAs) 27

Table 3.3 Reproduction of initial population

Index Chromosome m Psetect(x) E.le/m(x) Number selected


1 111110011 0 0.00 0.00 0
2 0 1 1 0 10 10 1 8 0.29 1.45 1
3 001101011 4 0.14 0.70 1
4 1110 0 110 1 12 0.43 2.15 2
5 0 1111110 1 4 0.14 0.70 1

• Form the New Population Using Crossover


The next step is using the crossover or recombination operator to generate the new
population. Two chromosomes are selected randomly from the intermediate
population or mating pool and they serve as parents. There is a random chance that
any pair will produce offspring and the overall probability of mating is determined by
pc, the crossover-rate. If selected, crossover is performed and the two resulting
offspring are, after possibly having undergone mutation, inserted into the new
population. If no crossover takes place, the two offspring will be identical to their
parents. In some implementations one offspring is randomly discarded so that the
crossover operator only produces one child. If the crossover operator is not selected
the two parents (or one parent) are simply copied into the new population. Usually the
crossover-rate is set to a value around 0.8 which means that 80% of the new
population will be formed by crossover.

The crossover operators used most often are based on 1-point or 2-point crossover,
depending on the number of crossover-points selected in a chromosome. In general n-
point crossover is possible with n<l. The different versions of the crossover operator
are illustrated by applying them to the XOR weight optimisation example.

0 1-Point Crossover
After selecting the parents the crossover-site within the chromosome is randomly
selected and the substrings about the crossover-site are swapped between the two
parents. The crossover site is randomly chosen from {1 / - l ) , / being the length of the
cl iromosorr es. This process is illustrated in Table 3.4.
28 Chapter 3. Evolutionary Computation

Table 3.4 1-point crossover

Chromosomes selected Crossover site = 3 Resulting offspring


ooiioioii ~ooiiioion~ ooiooiioi
111001101 1111001101 111101011

When crossover is chosen to produce 2 offspring and the population size is uneven
(e.g. N = 5 as in the example), the last crossover operation can only result in one
offspring. An option is to randomly discard the second offspring.

0 2-Point Crossover
Table 3.5 gives an example of using 2-point crossover, where two crossover-points are
randomly selected and the substring between these two point is swapped.

Table 3.5 2-point crossover; example 1

Chromosomes selected Crossover sites = {4,7} Resulting offspring


001101011 00111 010111 001101111
111001101 " 11101 0 1 1 1 0 1 111001001

Usually 2-point crossover is implemented so that the two crossover sites are chosen at
random and independent from each other. When the second crossover site lies to the
left of the first crossover site, the chromosome string is treated as being circular where
the endpoint and starting point are connected. Table 3.6 shows an example of this
situation.

Table 3.6 2-point crossover; example 2

Chromosomes selected Crossover sites = {7,4} Resulting offspring


001101011 " 00 1 1 I 0 1 0 I 1 1 ' 111001001
111001101 1 1 1 0 I 0 1 1 I 01 001101111

0 Uniform Crossover
Another version of the crossover operator is the so called uniform crossover. Instead
of using a predefined number of crossover points, the number is chosen
3.1 Genetic Algorithms (GAs) 29

probabalistically and is dependent on the chromosome length. Often a '0.5-uniform


crossover' is used, meaning that every site in the chromosome has a probability of 0.5
of being a crossover site. Thus every gene in the offspring-chromosome has a 50%
chance of inheriting its value from either one of its parents. More generally, when a
/vuniform crossover is used (pue [0,1]), every site in the chromosome has a chance pu
of being a crossover site. Thus using chromosomes of length /, /vuniform crossover
will result in an average of pu-l crossover points.

• Perform Mutation
After the new offspring is formed, mutation is performed on the selected
chromosomes. Mutation is usually implemented as follows: each gene in every
chromosome may undergo mutation with a probability of pm, where pm is the mutation-
rate. The mutation-rate is usually set to a low value such as 0.001.

In our example, since genes are bits, mutation normally just inverts the value of the
gene. In the more general case mutation re-initialises the value of a gene with a
random value taken from the initial distribution or alphabet. In the case of a binary
coded chromosome re-initialising the value of a gene will result in a 50% chance of
inverting it. The effect of the binary 'inversion-mutation' will on an average be the
same as the binary re-initialising mutation with half the mutation rate. The 'inversion-
mutation' will be used here. The expected number of genes altered by the mutation
operator, Em, is:

E,„ = N Pm

Here pm is set to 0.01. Since the total number of genes in our population, l*N, is 5 * 9
= 45, a total of 45 * 0.01 = 0.45 genes are altered on average per generation.

• Finished Loop
The algorithm now returns to the second step where each chromosome is evaluated
and the stopping criterion is checked. The process continues until the stopping
criterion is met.

3.1.4 Dual Representation and Competing Conventions


GAs rely on two separate representational spaces. One is the representation space or
genotype space, where the actual genetic operations (crossover, mutation) are
30 Chapter 3. Evolutionary Computation

performed on the (binary) strings or genotypes. The other space is the evaluation
space or phenotype space where the actual problem-structures or phenotypes are
evaluated on their ability to perform the task and where their fitness is calculated. An
interpretation or mapping function is necessary between the two. This is visualised in
Figure 3.4. The problem space constitutes all the potential solutions relating to the
problem. The evaluation space is in general a subset of the problem space and is
dependent on the representation used. Naturally, the evaluation space should be
'chosen' in such a way that it includes the optimal structure.

Figure 3.4 The dual representation of GAs

In general the interpretation function g maps chromosome or genotype x to phenotype


p and is dependent on the so called epigenetic environment (EF) which is the
environment in which the mapping (or development) takes place

p = g(x,EP)
3.1 Genetic Algorithms (GAs) 31

In biological systems g is highly non-linear and stochastic and depends heavily on EP


In GAs however it is commonly a deterministic function that is not dependent on its
environment. A simple one-to-one mapping between genotypes and phenotypes
however is practically never found. The fitness function / now measures the
performance of the phenotype p in the problem space (PS) by assigning it a real
positive value:

f(p,PS) -»9T

The performance of a GA depends primarily on how the information needed to solve


the problem is coded; i.e. the representation space, as well as on selection of the
fitness function. Genetic operators generally perform their task on the genotypes
without any knowledge of their interpretation in the evaluation space. This works well
as long as the interpretation function is such that the application of the genetic
operators in the representation space leads to good points in the evolution space.
Problems occur when a structure (or several very similar structures) in the evaluation
space can be represented by very different chromosomes in the representation space.
Schaffer et al. [56] calls this 'competing conventions', but it is also referred to as the
phenomenon of different structural mappings (genotypes) coding the same or very
similar functional mappings (phenotypes) [5]. Basically it means that a unimodal error
landscape becomes multimodal where each peak represents a representation
(convention) of the structure. Standard crossover between two different chromosomes
having the same convention will very likely not result in a useful offspring. This is
because knowledge about the problem space is not built into the standard crossover
operator. A problem dependent crossover operator that incorporates knowledge of the
problem space can help but may be quite difficult to implement.

• Example of Competing Conventions


To illustrate the 'competing conventions' problem consider our example again. In our
neural network the operation of the hidden neurons are not dependent on their position
in the network. Together with the corresponding weights they could be swapped, not
altering the functioning or input-output mapping of the network. This is illustrated in
Figure 3.5 where chromosomes 'ABCDEF' and 'CDABFE' correspond to neural
networks with an identical input-to-output mapping.
32 Chapter 3. Evolutionary Computation

Figure 3.5 Illustration of competing conventions. Two very different chromosomes


represent a single phenotype. The phenotype in this context is the input-output
mapping of the neural network

For example when the network performs the XOR function, it can be described as:

XOR = AND ( OR (Input 1, Input 2 ) , NAND (Input 1, Input 2 ) ).

One of the hidden neurons performs the OR function, the other the NAND, but it does
not matter which hidden neuron performs which. In our example both chromosomes (1
1 0, 0 0 1, 1 1 0) and (0 0 1, 1 1 0, 1 1 0 ) perform the XOR function correctly. The
first one by means of AND(OR,NAND) and the second one by AND(NAND,OR).
However, since the functioning of their hidden neurons is swapped with respect to one
another, standard crossover is not expected to yield a useful offspring. This is because
the standard crossover (and mutation) operator does not use any topological
information available in the phenotype. The two individuals suffer from 'competing
conventions' Instead of one there are two optimal solutions to the problem. This
problem increases when more than two hidden neurons are used, and is thought to be a
main source of poor GA performance on such problems.

3.1.5 The Steady State Genetic Algorithm


Instead of first making an intermediate population and only then applying the genetic
operators another approach may be used where the operators are applied directly to
3.1 Genetic Algorithms (GAs) 33

members of the current population [23]. These members are chosen based on their
fitness. One or more new chromosomes are then merged into the current population
taking the place of a 'doomed' chromosome. This 'doomed' chromosome is usually
chosen based on its inverse fitness. For a single generation step, this process is
repeated until the number of removed chromosomes equals the number of members in
the population. This approach is called a Steady State Genetic Algorithm as opposed
to the standard or Batch Genetic Algorithm. It requires much less memory storage as
only one population instead of two needs to be stored. A certain notion of age can be
built into the system where for a certain number of iterations these newly made
members can not be reselected to create a new offspring.

The replacement of the doomed chromosome by the new chromosome can be


performed in several ways. The doomed chromosome can be chosen probablistically,
based on its inverse fitness, it can be chosen randomly or it can be chosen to be the
worst-fit chromosome in the population (ranked replacement). Furthermore the
replacement can be unconditional or conditional. In the unconditional replacement
mechanism the new chromosome always replaces the doomed chromosome, while in
conditional replacement the new replaces the doomed only if its fitness is better. If
not, the doomed stays in the population and the new chromosome is discarded.

The first and most commonly used genetic algorithm software package based on a
Steady State Genetic Algorithm is 'Genitor' developed by John J. Grefenstette.
Genitor uses a linear ranking selection method (see section 6.4) and unconditional
ranked replacement. Of the two offspring made by crossover only one is allowed to
enter the population, replacing the doomed individual; the other offspring is
discarded. This type of Steady State Genetic Algorithm is also referred to as a
'Genitor-type' genetic algorithm.

3.1.6 Parallel Genetic Algorithms


Genetic algorithms can be successfully implemented as a parallel system [51]. As
stated before, the evaluation of the chromosomes is usually by far the most time
consuming part of the algorithm. This part can be implemented in parallel, resulting in
a substantial increase of speed of the total algorithm. A second possible parallelisation
is when instead of one, multiple (sub)populations are used each performing their own
search and only occasionally interacting with each other. Biological genetic systems
are of course highly parallel.
34 Chapter 3. Evolutionary Computation

3.1.7 Elitism
Elitism is an optional characteristic of a genetic algorithm. When used, it makes sure
that the fittest chromosome of a population is passed on to the next generation
unchanged; it can never be replaced by another chromosome. Without elitism this
chromosome may be lost. Extended forms of elitism are also possible where the best
m chromosomes of the population are retained. Simple elitism is the case where m=\.
In effect elitism means that the number of offspring that are generated each generation
is reduced from N to N-m replacing the worst N-m individuals in the population. A
Steady State Genetic Algorithm with ranked unconditional replacement (Genitor-type)
can be seen as a GA using extended elitism with m=N-\.

3.1.8 Extensions of the Standard Genetic Algorithm


Various extensions have been made to overcome some of the shortcomings of the
standard genetic algorithm [59]. A few of these are briefly described here.

• Niched GAs
Niched genetic algorithms are used to preserve information across a diverse
population. The simple standard GA loses information by quite rapidly converging to
a single solution. Niched GAs however, try to maintain several sub-populations of
individuals relating to different fit solutions. They are especially useful in finding a set
of mutually supportive solutions to a problem and have been successfully used in
solving multimodal functions. They can offer a solution to the competing conventions
problem (section 3.1.4). A niche is defined as a region in the fitness landscape with a
high fitness. A niched GA tries to 'fill' each niche with a set of chromosomes in
proportion to the quality of the niche.

There are a number of mechanisms available to achieve niching. The most frequently
used is fitness sharing. Here the normal or unshared fitness of an individual is
degraded depending on the presence of nearby individuals. The distance metric often
used in binary coding is the Hamming distance between the genotypes
(chromosomes). However a distance metric in the evaluation space relating to the
phenotypes of the individuals can also be used. Fitness sharing spreads the population
out over the niches where each niche is filled according to its height. Other niching
methods include restrictive mating schemes where in general only similar
chromosomes are allowed to reproduce.
3.2 Genetic Programming (GP) 35

• Meta-LevelGA
In a meta-level GA, GAs are contained within other GAs. For the simplest case of a
two level GA, the top level GA calls upon the bottom level GA during evaluation.
This bottom level GA can be used to optimise some sub-problem of the overall
problem. A two level GA has been used where one GA is used to control the
parameters (mutation rate etc.) of the other GA.

3.2 Genetic Programming (GP)


Genetic programming is a technique derived from genetic algorithms and was
developed by John Koza [40]. Genetic programming can be seen as a special kind of
genetic algorithm but differs in that it uses hierarchical genetic material whose size is
not predefined. The members of a population are tree structured programs and the
genetic operators work on the branches or single points of these trees. Originally
genetic programming was implemented in the LISP programming language, because
of its built-in tree like data structures (S-expressions), but it has been implemented in
various languages since.

When the potential solutions to a problem have a hierarchical tree structure


themselves, GP offers a much more natural chromosomal representation than a
standard linear-string GA does. A distinct advantage of GP over GA is that the size
and shape of the final solution does not need to be known in advance. The tree
structured chromosomes typically vary their size and shape over the course of a
generation. Research has shown that GP can be successfully applied to many problems
in the fields of artificial intelligence, machine learning and symbolic processing. [40]

• Representation
In GP the chromosomes are made up of a set of functions and terminals connected to
each other by a hierarchical tree structure. The endpoints or leaves of the
chromosomal tree are defined by the terminals, all the other points are functions.
Typically the set of functions (denoted by 'F') includes arithmetic operations, logical
operations and problem specific operators. The terminal set (denoted by 'T') is made
up of the data inputs to the system and the numerical constants. Functions can
generally have other functions as well as terminals as their arguments and must
therefore be well-defined to handle any input combination. The number of arguments
a function has must be defined beforehand. GP incorporates 'variable selection'; it is
not needed to set a priori which data-inputs are going to be used. These are selected
36 Chapter 3. Evolutionary Computation

on the run, which can be a useful concept when it is not known in advance exactly
which data-inputs are needed in order to solve the problem. Figure 3.6 shows an
example of a very simple chromosome, made up of the functions AND and OR and
the data terminals DO and Dl. The function set could for this example be: F = {AND,
OR, XOR} and the terminal set simply: T= {D0,D1}.

Figure 3.6 Example of a simple chromosome in GP

This chromosome is represented in the GP system by the following 'LISP S-


expression':

AND ( AND (D0.D1), OR (D0.D1))

• Evaluation
During evaluation of the chromosome the data inputs DO and Dl are assigned actual
input values. The output of the chromosome (= program) is then calculated as the
value of the top-most function-point in the tree, the root, and is used in the fitness
function as a measure of the performance of the individual on the problem. In this
example of Boolean functions, the tree representation used by GP is much more
natural than a string representation used by a GA.

• Selection, Crossover and Mutation


As in the standard genetic algorithm paradigm, genetic programming relies mainly on
the reproduction mechanism and the crossover operator. The flowchart for the
standard genetic algorithm (Figure 3.2) also applies for genetic programming and the
same reproduction mechanisms apply. Crossover is performed on branches of trees,
which means that entire branches or subtrees are swapped between two chromosomes.
This is shown for a simple example in Figure 3.7.
3.2 Genetic Programming (GP) 37

Figure 3.7 Example of crossover in GP

Mutation re-initialises a randomly chosen point (= gene) in the tree. In general this can
be a function or a terminal. An example is shown below where a function-point is
chosen to undergo mutation.

Figure 3.8 Example of mutation in GP


38 Chapter 3. Evolutionary Computation

• Automatically Defined Functions (ADFs)


An interesting feature within the GP paradigm which accounts for modularity is the
possibility to include the so called Automatically Defined Functions (ADFs). These
ADFs perform a subtask of the problem and can be called upon more than once. An
ADF does not have particular fixed terminals as its inputs, but instead is parametrised
by dummy variables. When an ADF is called upon from within the main program
(Koza calls this the result producing branch) of the chromosome, the dummy variables
are instantiated with specific values or terminals. The ADFs are defined in the so
called function-defining branches. The complete genetic tree that represents a certain
solution therefore consists of a result-producing and one or more function-defining
branches depending on the number of ADFs used. Figure 3.9 (with conventions as
used by Koza) shows an abstraction of the overall structure of a chromosome with two
ADFs: ADFO and ADF 1.

Figure 3.9 Overall abstracted structure of a chromosome with two ADFs

The names PROGN, DEFUN etc are labels used in the actual representation of the
chromosome in the GP system. An ADF is defined by its name (e.g. ADFO), by the list
of its dummy arguments (ARGO and ARG1) and by the actual function as defined in
the body. This function is just another tree structured program like the one in Figure
3.6 as is the result producing branch. When ADFs are present, the function set is
3.2 Genetic Programming (GP) 39

extended with the ADFs. In the example above, the function set would now be: F =
{AND, OR, XOR, ADFO, ADF1), with ADFO being a function taking two arguments
and ADF1 taking three.

As an illustration Figure 3.10 shows an example of the body of ADFO and of the result
producing branch or main program of a chromosome.

When the chromosome is evaluated the result producing branch is computed where
the body of ADFO is called upon when the function ADFO is encountered. The ADF
body is then instantiated with the appropriate arguments from the main program which
can be other functions or terminals and its output is returned. This evaluation always
takes place 'bottom-up' The outputs of the functions is fed from the bottom of the
tree towards the top (or root).

Figure 3.10 Example of the body of ADFO (left) and of the result producing branch
or main program of a chromosome

The genetic operators work on both branches. The idea is that GP will dynamically
evolve functions that are useful to the problem (ADFs) as well as a main program that
calls upon these functions. A parallel can be drawn here to the field of neural networks
where a certain part of the network performs a function that can be seen as a subtask
for the complete problem. The difference is that its position within the neural network
is fixed and that it is of no use to the network if its needs this same function
somewhere else but with different inputs.
40 Chapter 3. Evolutionary Computation

• Steady State Genetic Programming


As in the genetic algorithm paradigm, there exists a steady state approach to genetic
programming. Steady State Genetic Programming has proven to be advantageous over
the standard, or batch GP paradigm in certain applications [40].

3.3 Evolutionary Algorithms (EAs)


Evolutionary algorithms (see e.g. [14]) is another form of evolutionary computation
but instead of GA (and GP), it focuses on phenotypes and not on genotypes. There is
no need for a separation between the recombination space and an evaluation space as
in Figure 3.4. The genetic operators work directly on the actual structure or
phenotype. The structures used in evolutionary algorithms are representations that are
problem dependent and more natural for the task than the general representations used
for GA. The representation used is a vector of real values, identical to a real-valued
chromosome in a GA. In EAs however the real-valued numbers are seen more as traits
(or phenes) than genes. EAs therefore focus more on the behavioural link between
parents and offspring while GAs focus on the genetic link.

Originally evolutionary algorithms focused on a single parent only, but extensions


have been made for a population consisting of more members. Evolutionary
algorithms can be divided into Evolution Strategies (ES) which focus on the behaviour
of individuals, while Evolutionary Programming (EP) focuses on the behaviour of
entire species. The distinction is subtle and in general not very clear. In many cases
both terms are virtually equivalent. In EP, the only genetic operator used is the
representation-dependent mutation operator, although several different mutation
operators can be used in the same algorithm. A commonly used mutation operator just
adds a Gaussian random variable to each component of a chromosome. Because ES
deal with individuals instead of entire species, sexual operators (crossover) are
possible as well and extensions have been made to include these.

In ES a potential problem solution is vector x consisting of / real-valued 'genes'.


Although the vector components are normally seen more as phenes than as genes, the
term gene is used here to ease the comparison between the fields of ES and GAs. So:

x - {«!, ..., a,), a, e 9?


3.3 Evolutionary Algorithms (EAs) 41

ES systems (see e.g. [48]) are commonly categorised as being:

( H Ip + X ) - ES or (id Ip , X) - ES

where p - the number of parents in the population = population size


X = the number of offspring
p = the number of parents taking part in the reproduction of offspring

During the course of a generation the ii parents initially create X offspring by means of
mutation and sometimes recombination. Then the intermediate population consisting
of parents and offspring is reduced to the original size by means of a 'selection'
process which simply retains the best ix individuals and discards the rest. The '+' and ','
denote the selection method used. In a ( p Ip , X ) - ES the parents can not be selected
as members of the next generation, while in a ( p Ip + X ) - ES system they can. The
integer p, also called the mixing number, denotes how many parents mix their genes
during the creation of offspring. In the case p - 1 two parents mix their genes by
means of a crossover operator to produce offspring (typically one). The offspring are
then mutated. In the absence of crossover (mutation only) p = 1. The first systems
developed were ( 1 + 1 ) - ESs where a single parent produces a single offspring that
replaces it if is better and is discarded otherwise. Multimembered ESs were developed
later including the addition of crossover operators. There is no selective pressure in a
multimembered ES; every individual has an equal chance of producing offspring.

An important feature of ESs is that the range of mutations, the stepsize, is not fixed but
inherited. It is unique to an individual and generally different for each gene. An
individual is represented by the pair of vectors v:

v = (x , CT)

x denotes a point in the search space consisting of / genes and a is a vector of the
same length consisting of standard deviations, one for each gene. Mutation creates a
new offspring x' from x by adding to it a Gaussian number with mean 0 and standard
deviation <x

x'=x + /V(0, a)
42 Chapter 3. Evolutionary Computation

Although not present in the earliest models o is normally adapted during the mutation
process as well. A commonly used method is:

oJ=CT.emAo)

where ACT is a system parameter. A commonly used crossover operator creates a single
offspring (JC,<T) from two parents (x'.cr1) and (x 2 ,^) by randomly mixing their genes
(as uniform crossover in GAs) and their step sizes. Mutation is performed after this to
complete the process.

Many extensions and alterations have been made on the basic ES scheme described
here. It is interesting to note that although the fields of GAs and ESs vary in a number
of ways, quite a few ideas are being taken from one field and implemented in the
other. Examples are the introduction of the crossover operator in ES and real-valued
instead of binary encoding with 'creeping' or additive mutations in GAs. Also the idea
of adaptive parameters, especially the mutation rate, has received a lot of attention in
the GA community lately.
4. The Biological Background
Since practically all ideas and certainly most of the nomenclature in the field of
evolutionary computation are taken from its biological counterpart, a brief
introduction of genetics [42],[58] is presented in this chapter together with an
overview of the main concepts of Darwinian evolutionary theory. First, the genetic
structures as observed in nature are described. Second, the actual process of
reproduction and the occurrence of mutations is then dealt with. Third, the process of
natural evolution is described in section 4.4, by present day evolutionary theories.
Fourth, the link is made between this biological background and the field of
evolutionary computation (focused on genetic algorithms).

4.1 Genetic Structures


In cells the information which determines their function is carried in chromosomes. A
position on a chromosome is called a locus, which can be thought of as a box, and it is
taken up by two genes. Genes are therefore the structures from which a chromosome
is made. There are a multitude of different sets of genes, each of them specific to one
locus. The set of genes which relate to a specific locus are called alleles. A locus can
only contain two of these, so a locus A has a set (a ; , a2, ..., ak) of alleles from which
two can be chosen. If there are two alleles to choose from (k = 2) a locus is said to be
diallelic and if there are more, multiallelic.

A chromosome consists of two strands called chromatids joined together at one point
by a centromere. Chemically the genetic information in a chromosome is carried by
the nucleic acids DNA and RNA.

All cells in an organism are identical in their chromosomal content. There is thought
to be some switching mechanism which, together with the position of a cell in the
organism, determines which genes become operative and which do not. This in turn
determines the specialisation of a cell; i.e. if a cell operates as a liver cell or a skin
cell.

43
44 Chapter 4. The Biological Background

• Genotypes and Phenotypes


A genotype of an individual at a single locus is the pair of genes contained in it. When
the two genes are identical (i.e. afij, i-j) the genotype is said to be homozygote, if not
(a,ap i#j) heterozygote. The complete genotype of an individual, or the total genetic
package, is the set of all genotypes over all loci or the totality of all chromosomes.
This is also known as the genome of an organism. What is actually observed is the
phenotype and it is formed by the interaction of a genotype with its environment.
Different genotypes may result in the same phenotype. A single characteristic
observed in an organism, such as eye-colour, is referred to as a trait.

• Dominant and Recessive Genes


In the case of a diallellic locus one of the alleles, ah is often a dominant gene and the
other, a2, a recessive one. The dominant gene is always expressed in the phenotype,
the recessive gene only in the absence of the dominant one. So in genotypes atah a,a2,
and a2ah aj will be expressed in the phenotype. Only in a genotype a2a2 will the
recessive gene a2 be expressed. In the lack of dominant and recessive genes both
genes will be expressed in the phenotype resulting in a mixture of the influence of
each. This is called partial dominance and can often be seen for example in skin
colour.

• Epistasis
The way a gene is expressed in the phenotype or whether it is expressed at all often
depends on the presence or absence of another gene. When there is such an interaction
between genes in the expression of the genotype, it is called epistasis. The most
common form of epistasis is the masking effect. This means that a gene acts as a mask
for one or more other genes. When the masking gene is present in the chromosome it
completely 'turns off this set of genes; i.e. these genes are not expressed in the
phenotype. In the absence of the masking gene they are.

4.2 Reproduction
In organisms there are two reproductive methods by which cells divide to form new
cells. The first kind is mitosis, where the parent cell simply divides itself in two cells
identical to the parent. This is the main method by which organisms produce new cells
in order to grow larger. It is also part of asexual reproduction as used by simple
organisms. The second one is meiosis, or 'reduction division', and is used for sexual
4.2 Reproduction 45

reproduction. Meiosis produces four cells from one parent cell. In sexual reproduction
special reproductive cells called gametes are used.

When two organisms perform sexual reproduction, each of them produces gametes
(the sperm of the male and the egg of the female) by means of meiosis. Normal cells
in an organism carry pairs of chromosomes of each type and are said to be diploid.
The two chromosomes in a pair are called homologous chromosomes. A gamete
carries only one such set of chromosomes and is said to be haploid. Thus a haploid
cell contains half the number of chromosomes of a diploid cell. Also, in a gamete,
instead of two genes at every locus there is only one gene. Each of the two genes of a
locus of a cell before meiosis has a chance of 50% of ending up at the locus of the
gamete; this process is called segregation. This is Mendel's First Law, which says that
characteristics of organisms are carried in pairs and only one of each pair can be
carried by a gamete, each having equal chance of ending up in the gamete. The second
stage of sexual reproduction is fertilisation where the gametes of the male and female
unite to form one new cell called a zygote, restoring the original count of
chromosomes and again having two genes at each locus.

Mendel's Second Law concerns the independent segregation of genes specifying a


different trait. It says that during the formation of gametes, allelic pairs specifying
different traits segregate independent from each other. For example when an organism
contains allelic pairs asa2 for trait a and bjb2 for trait b, the gametes may contain atbt,
atb2, a2bj, or a2b2. This second law however only holds under certain constrictions.

The complete set of chromosomes in every cell consist of autosomal chromosomes


which determine the characteristics of the individual and of sex chromosomes
determining the sex of the organism. Often a trait is not expressed by a single gene but
by several. Also a single gene often has more than one effect on the phenotype, and is
then called pleiotropic.

Another phenomenon found in reproduction is gene linkage. It is found that during the
formation of gametes, alleles associated with the same chromosome remain together in
the offspring. For example alleles such as aft] or a2b2 may be linked in the offspring
forming a linkage group.
46 Chapter 4. The Biological Background

4.3 Mutations
Apart from the normal processes described above, comparatively rare events called
mutations can occur. A mutation is a change in a chromosome which may result in a
change in the characteristic of a cell or an organism. A mutated individual is called a
mutant. Most often mutations are harmful to the cell or organism resulting in disease
or even death. When they are beneficial however, they have great effect, providing a
basis for variation between and within a species. This ensures that species can adapt
to changing environments. Mutations can be divided into chromosome mutations and
gene mutations.

4.3.1 Chromosome Mutations


This section briefly describes a number of chromosome mutations.

• Abnormal Number of Chromosomes


One form of a chromosome mutation occurs during meiosis when the zygote ends up
with an abnormal number of chromosomes. This usually results in death of the
organism; one exception being Down's syndrome or mongolism in Man.

• Recombination
Another form of chromosome mutation that also occurs during meiosis is
recombination. During meiosis the homologous chromosomes are intimately
intertwined and various types of mixing of chromosomes can occur when they wrap
around each other. This type of general or homologous recombination is also known
as crossover.

Points of attachment in a chromosome are called chiasmata and define the points
where a chromosome might break and rejoin with the homologous chromosome next
to it. A single crossover involves the swapping of the parts of two chromosomes at a
single chiasma. Double or triple crossover can occur when chromosome parts are
swapped at more than one place. The probability that two different linked alleles cross
over together (i.e. end up in the same offspring) is a function of how close they are
together on the chromosome. The closer they are together, the higher the frequency.
4.3 Mutations 47

ABCIDEF A B C d e f

abcldef a b c D E F

Figure 4.1 Single crossover

When crossover occurs between homologous chromosomes but at two different


positions, it is called unequal crossover. This results in two chromosomes of unequal
length.

ABAIBlCD A B A B A B C D
=>
AlBIABCD ABCD

Figure 4.2 Unequal crossover

Unlike the above cases certain recombinations are non-reciprocal and only one of the
offspring is changed by crossover while the other remains unaffected. This is referred
to as gene conversion.

ABCIDEF A B C d e f

abcldef a b c d e f

Figure 4.3 Gene conversion

Still other forms of recombination are possible, often resulting in more subtle changes
in the DNA structure of the chromosomes.

• Inversion
Inversion occurs when a chromosome section breaks off and the broken part turns and
rejoins the rest of the chromosome resulting in a reverse order of the genes in that
section.

• Deletion
Deletion is the phenomenon where a chromosome section breaks off and is omitted
from the chromosome altogether. The two loose ends of the chromosome then join up
resulting in a shorter chromosome.
48 Chapter 4. The Biological Background

• Translocation
When crossover occurs between two non-homologous chromosomes, this is called
translocation. This phenomenon is also known as non-homologous recombination.

' Polyploidy
Occasionally, because of an erroneous meiosis, a diploid gamete is produced instead
of a normal haploid one. When this gamete is united with a normal haploid gamete
during fertilisation, the resulting zygote will have three sets of chromosomes instead
of two and is called triploid. If two of those abnormal diploid gametes unite, the result
is a tetraploid zygote. This phenomenon is called polyploidy and although rare in
animals, it is quite frequently found amongst plants and can actually be beneficial for
the organism.

4.3.2 Gene Mutations


Gene mutations (also called point mutations) are confined to a change in a single gene
only and are the result of a chemical change in the structure of the gene. They are
thought to play the most important part in contributing evolutionary changes to
organisms. Since most of the DNA code seems to be redundant (so called 'genetic
garbage') in most cases mutations within a gene do not have any effect on the
phenotype at all: the mutations are neutral. When they do however the effects can be
enormous. Most mutations are deleterious for the organism, in the case of a lethal
gene even resulting in death. The small percentage of mutations that are beneficial
provide an increased fitness to the organism and its influence can spread throughout
the population.

4.4 Natural Evolution


Some comments regarding theories on natural evolution are presented in this section.
The most influential theory on evolution was proposed by Charles Darwin in the last
century, and his main ideas form the basis of most present-day evolutionary theories.
First, the term evolution itself must be clarified. As defined in the field of biology
evolution is: a change in the gene pool of a population over time. It is therefore a
population level phenomena and basically says that organisms evolve from common
ancestors. Biologists and, in fact, the vast majority of the scientific community, treat
evolution simply as a fact, bearing in mind that in pure science even facts are not
100% provable. Most biologists also treat it a fact that all modern life originates from
4.4 Natural Evolution 49

a single common ancestor. In everyday use the term evolution is often confused with a
specific evolutionary theory such as the one proposed by Darwin, which tries to
explain how evolution actually works.

While the kind of evolution described by Darwin normally takes place over very long
time spans and observations of it are based on fossil records, evolution can and has
been directly observed within a span of only several years. For this reason the
distinction between microevolution and macroevolution is often made. While some
biologists feel the mechanisms of both are different, most simply treat macroevolution
as a long cumulative series of microevolutions.

Evolutionary mechanisms can basically be grouped into two categories: those that
increase genetic variation and those that decrease it. The mechanisms that increase
variation are the mutations occurring during reproduction as described in the last
section as well as a concept called gene flow. Gene flow simply means that new
genetic information is introduced into the population by migration from another
population. It occurs when two more or less related species from different populations
mate. The mechanisms decreasing genetic variation are natural selection and genetic
flow and these are now described in more detail.

• Natural Selection
In Darwinian evolutionary theory natural selection is seen as the creative force of
evolution. When supplied with genetic variation it makes sure that sexually
reproducing species can adapt to changing environments. In the course of evolution
natural selection preserves the favourable part of the variation within a species. It
often does this by letting the fittest individuals of a species produce the most offspring
for the next generation. It provides a selective pressure that favours the fitter
individuals of a population. The theory of natural selection is therefore often referred
to as the survival of the fittest. This term is misleading for a number of reasons.

Reproduction ('survival') of the organism itself is not the driving force of natural
selection. The driving force is the contribution of the organism's alleles to the next
generation's gene pool. Natural selection favours selfish behaviour but does so more
at the level of genes than at the level of organisms. For example it can be beneficial
for an organism to help other organisms reproduce that are closely related to it, i.e.
share many of the same alleles, sometimes even sacrificing its own chances of
reproduction or even its own life. For this reason fitness is often split into two
50 Chapter 4. The Biological Background

components: direct fitness, which is a measure of how many alleles the organism can
enter into the next generation's gene pool by reproduction of itself, and indirect
fitness, which measures how many alleles identical to its own but belonging to other
organisms it helps enter the gene pool. Natural selection works in such a way as to
increase the combination: the inclusive fitness.

Another point against the term "survival of the fittest" is that survival is only one
component of selection. Another one, often even more dominant, is sexual selection.
In many species males have to compete against each other for mates. This competition
can be physical or it can be ruled by female choice. In the latter case organisms evolve
traits, 'status symbols', which are favoured by females for sexual selection. In some
species where very few males monopolise all females, many males live to
reproductive age but very few of them ever mate. While they perhaps do not differ in
their ability to survive, they do differ in their ability to attract mates. The fitness of an
organism is therefore not just a measure of its physical abilities, it is often much more
a measure of its sexual attractiveness.

For natural selection to be a creative force, the genetic variation must be random and
its effect relatively small. This is the case in present evolutionary theories. A
fundamental concept of Darwinism often not understood is that evolution has no
direction and that there really is no sense of progress where certain organisms are
'better' than others. Organisms just become better adapted to their environments. The
changes made may in fact prove harmful when the environment changes. A related
popular notion is that natural selection favours organisms with a high level of
complexity resulting in an "evolutionary ladder' from simple one-celled organisms to
the ultimate creation: man. In fact by far the most successful species in the past and
present are the simplest of them all: bacteria, whose existence is incidentally crucial to
our own. From evolutionary theory it should be concluded that the evolution of
mankind is nothing more than a lucky outcome of thousands of linked events and by
no means inevitable.

• Genetic Drift
Even without a selective pressure contributed by the mechanism of natural selection,
there is a mechanism at work that decreases genetic variation. If it were the case that
each organism had an equal chance of producing offspring (i.e. no selective pressure)
and there was no mechanism for introducing variation, the frequency of alleles would
decrease by means of genetic drift. Genetic drift is simply the binomial sampling error
of the gene pool. The organisms that reproduce increase the frequency of their alleles
4.4 Natural Evolution 51

over the population. In the next generation the frequency of these alleles is expected to
increase even more simply because there is a larger chance that an organism
possessing them is chosen to reproduce. Without mechanisms to introduce variation,
the effect of genetic drift (with or without natural selection) would ultimately be a
compete lack of genetic variation in the gene pool.

• Preadaptation
One of the main difficulties for evolutionary theories is to explain how complex
structures in organisms evolved from scratch. For example it can be very beneficial
for an organism to have an eye, but since evolution works in small steps, how
beneficial can it be to have say 5% of an eye? This is usually explained by the concept
of preadaptation. Preadaptation states that a structure in an organism can change its
function radically while its form remains approximately the same; i.e. functional
change in structural continuity. In the first steps towards the evolution of an eye, the
structure serves a different purpose than vision. This purpose has to be beneficiary for
the organism for it to be rewarded by natural selection.

• Sexual and Asexual Reproduction


Simple organisms in relatively stable environments often reproduce asexually.
Asexual reproduction produces offspring that are very similar to their ancestors. When
the environment changes though, such an organism finds it very hard to change in
time. Major evolutionary change can only occur when there is a large store of genetic
variability available. This is the case for sexually reproducing organisms, where
natural selection is used to maintain the most favourable genetic variants of a large
genetic pool.

• Niching
Biological systems use a restrictive mating scheme to encourage the formation of
species: speciation. Only organisms in the same niche can mate with each other. A
group of organisms within a species is called a population of organisms. When a
population differs to a certain extent from the rest of the species it forms its own niche
and can ultimately form a new species.

• Evolution and Learning


There is a distinction between characteristics that an organism has inherited from its
ancestors, i.e. by means of evolution, and that which it has learned during its lifetime:
individual learning. Classic genetic theory dating from Mendel stated that anything
52 Chapter 4. The Biological Background

that an organism learned in its lifetime was not physically passed on to future
generations [58]. The opposite view was Lamarckian inheritance where learned
characteristics are passed on to offspring. Lamarckism is not widely accepted.

Evolutionary phenomena that do appear to require Lamarckian learning are commonly


accredited to the Baldwin Effect, introduced by James M. Baldwin late last century.
The Baldwin Effect rejects Lamarckian inheritance and postulates that instead of
passing on learned information, the criteria for fitness of the organisms changed. This
way individual learning does change the path of evolution. Baldwin further proposed
that during evolution these learned behaviours eventually become instinctive
behaviours of the organisms.

An illustrative example of the Baldwin Effect is the evolution of 'normal' to tree-


jumping squirrels. Suppose one group of squirrels (a population) learns to jump from
tree to tree while another group does not. The group that did learn this characteristic is
rewarded by evolution for things that help the squirrels in this task: e.g. developing
webbing between the toes. The fitness of the squirrels is changed in that the ability for
tree-jumping is now rewarded. The initial development of webbing between toes is
just caused by genetic variation during reproduction. Over several generations the
population of squirrels that learned to jump trees now becomes born to jump trees and
tree jumping becomes an instinct. In short the Baldwin Effect says that the fitness
function is changed by individual learning.

• Optimisation
Natural selection does not necessarily have the effect of producing optimal structures
or behaviours. For one thing it acts on the organism as a whole, not on specific traits.
There is only one fitness measure (the inclusive fitness) that is influenced by many
factors. Many species are stuck in so called local optima simply because the transition
from this local optimum to a global optimum (assuming there actually is one) is very
unlikely. This transition would normally involve having to pass through less adaptive
states. Natural selection does not cater to this; the only way the species can reach a
state with a higher fitness is by a lucky variation (mutation) or combinations of these.
Since environments are generally non-stationary, even being in a very fit state does
not mean the species will continue to thrive in the future. In fact when a species has
specialised itself to function perfectly in a certain environment, it is likely to find
difficulties in adapting if this environment happens to change. Natural selection has no
mechanism that provides future planning. It is a purely local mechanism.
4.4 Natural Evolution 53

• The Darwin Machine


The main themes of the evolutionary theory as described above are commonly
referred to as Darwinism. There is strong evidence that evolution is not the only
Darwinian process found in nature. On a much smaller time scale of days to weeks a
similar process as observed in evolution seems to take place when the immune system
of an organism produces antibodies to an virus infection. Through a series of cellular
generations the immune system evolves antibodies that become better and better
('fitter') in defending the organism against the invaders. It is even postulated [7] that
mental activity in the human brain is governed by a Darwinian process where ideas
'compete' with each other resulting in the 'fittest' idea actually being passed on. Thus
intelligent (and creative) solutions to problems can be generated by a Darwinian
process operating on a very small time scale. An example of this would be when
someone throws a stone at some distant object. The idea is that the brain makes a
model of this situation where potential 'solutions' are being tested in a matter of
milliseconds. The fittest one of these determines the course of action.

To abstract from the special Darwinian theories described in this section the concept
of a minimal Darwin Machine is introduced [7]. A Darwin Machine being a system
ruled by a Darwinian process must have the six essential properties listed in Table 4.1.
For each property the corresponding occurrence in Darwinian evolutionary theory is
given.

A Darwin Machine is a process satisfying these six requirements and Darwinian


evolutionary theory is such a process. The competition between patterns/organisms for
some limited resource, such as territory or food, is generally thought to be a main
force behind the evolution of complexity and diversity. In order to remain competitive
against rivals, patterns must often develop complex traits different from existing ones.
54 Chapter 4. The Biological Background

Table 4.1 The requirements for a minimal Darwin Machine illustrated by their
occurrence in Darwinian evolutionary theory.

Requirements Darwinian Evolutionary Theory


1. The system operates on The genotypes are chromosomes consisting of a
patterns of some sort. string of genes.
2. Copies are made of these Genotypes are copied by means of sexual or
patterns. asexual reproduction.
3. During the copying variations Genetic mutations occur during reproduction;
occur. for example by means of crossover, gene
mutations.
4. The various patterns compete Competition occurs because there is only room
with each other for a limited (and food) for so many organisms in a certain
territory/workspace. area.
5. The selective success of a Natural selection works on the relative fitnesses
pattern is biased by its of an organism which depends on the complete
environment. state of the system (i.e. the organism and its
environment).
6. Copying only occurs after a Only organisms fit enough survive until
certain amount of differential reproductive age and are able to pass on
success. offspring often depending on sexual
attractiveness. Differential success is measured
by the inclusive fitness of an organism.

4.5 Links to Evolutionary Computation


Evolutionary computation and in particular genetic algorithms have exploited ideas
from Darwinian evolutionary theory together with a genetic representation similar to
the one found in nature. This section presents some of the main parallels and
differences between biological systems and genetic algorithms.

• Genetic Representation
The string representation of chromosomes in GAs is comparable to the ones found in
real life. However, nearly all evolutionary computation algorithms so far have been
limited to haploid chromosomes, where each locus can only contain one gene. While
4.5 Links to Evolutionary Computation 55

in biological systems this is true for gametes during reproduction, normal cells always
have a pair of genes contained in one locus. This feature allows organisms to adapt
more quickly to changing environments and is especially useful if the organism is
required to switch between two environment states. Also a population of organisms
that have diploid chromosomes can contain a much larger genetic variability than
organisms with haploid chromosomes.

Lately the representation used in GAs tends to be more problem specific and no longer
limited to the classic genetic string. Genetic programming of course with its tree
structured chromosomes uses a representation quite different from the one found in
nature.

• Genotype to Phenotype Mapping


In GAs the mapping from the genotype into the phenotype by means of the
interpretation function is almost always completely deterministic. The phenotype only
depends on the genotype and there is no stochastic element involved. This is not true
in nature. The very complex developmental process from genotype into the phenotype,
morphogenesis, is influenced by many environmental factors and is stochastic by
nature. The environment in which the development takes place is often called the
epigenetic environment. During the development, deformities or mutations might take
place that can have a big influence of the final phenotype, depending on what stage of
the development they occur. In more abstract terms it can be said that the mapping
from the genotype into the phenotype is stochastic and dependent upon the state of the
system. This concept is implemented in a GA system in [8] where the neural
development in a population of neural networks is simulated using cell division, cell
migration and axonal growth and branching. It is suggested that the lack of mutations
in the development from genotype to phenotype in most GA applications concerning
neural network evolution could be a reason why such high (genetic) mutation rates are
needed for a good GA performance.

• Selection
In nature, adaptation is performed using natural selection instead of the selection
method used in most evolutionary computation systems. The main difference is that in
natural selection there is no such thing as a superimposed fitness measure. Not just the
organisms but also the fitness measure evolves. EC systems where this is implemented
are called open-ended evolution. The majority of EC systems however works as a
function optimiser and therefore necessarily have a fixed fitness function. This is
56 Chapter 4. The Biological Background

probably the main difference between natural evolution and EC. Most biologists
would argue that the idea of optimisation in itself is not found in nature and it is for
this reason possibly quite dangerous to blindly copy ideas from natural evolution into
the field of EC.

Another point of difference is that there is no such thing as survival of individuals in


EC. All individuals have a non-zero chance of producing offspring for the next
generation. An individual can not die before it reaches reproductive age. And, of
course, in EC the entire concept of selection based on sexual attractiveness is absent.

• Crossover and Mutation


In biological systems it is often thought to be mutation that is the main force in
exploring new genetic territory. Crossover is generally thought to play only a minor
role. In genetic algorithms this picture is reversed and crossover is generally thought
to be the main force in GAs. Lately it is therefore postulated that crossover might play
a much bigger role in biological systems than previously thought. Although the
mutation-rate is usually very small, mutation usually plays a vital part in a GA in that
crossover can normally not (re)introduce information not contained in the present
population. It is clear from the successful applications of Evolutionary Programming
that algorithms using selection and mutation only can be very powerful. These types
of algorithms are referred to as 'naive evolution'. As small as the mutation rates may
be in GAs, the ones usually found in nature are actually much smaller. Reasons why
natural selection does not seem to need high mutation rates in order to maintain
enough genetic variability could be the stochastic nature of the genotype-to-phenotype
mapping as mentioned above and the use of diploid instead of haploid chromosomes.
Both these features are lacking in most GA implementations.

• Inheritance and Individual Learning


Individual learning in organisms is equivalent to local search in evolutionary
computation. The Baldwin Effect occurs in EC when local search changes the fitness
of an individual but does not effect its genotype. Lately the Baldwin Effect has gained
much attention in the field of evolutionary computation, where EC is combined with
local search algorithms to improve its performance. In order to have a true Baldwin
Effect in EC similar to the one found in nature, the fitness function should be allowed
to evolve together with the individual.
4.5 Links to Evolutionary Computation 57

While it seems to be true that Lamarckian learning does not actually occur in
biological systems, it can prove beneficial for evolutionary computation. It not only
changes the fitness (by means of local search) but it also changes the genetic
representation of the individual so that the learned information can be passed on to
future generations. In EC systems where there is no way for the fitness function to
evolve, Lamarckian learning provides the only mechanism for passing on learned
information. Since almost all EC systems are used as optimisation algorithms the
fitness function will indeed be fixed.

Individual learning is sometimes referred to as phenotype learning while genotype


learning refers to the evolutionary process where learned behaviour is inherited. In EC
systems the trade-off between the two can be a difficult one. For example if genotype
learning has too much effect on the behaviour of an individual it might be very hard
for it to learn new behaviour by means of phenotype learning simply because its
options are limited. Genotype learning is often said to determine the boundaries in
which an individual can evolve by means of phenotype learning.

• Epistasis
Tackling epistasis is one of the main problems in GAs. Present day GAs usually fail
when the level of epistasis is high. As most theoretical work in GAs is concerned with
problems of low epistasis, more work needs to be done to understand the working of
problems with high epistasis. By contrast, biological systems perform well even with a
very high level of epistasis. In fact the level of epistasis found even in simple
organisms is so high that some biologists reject the reductionist approach resulting in
the idea of genes as a useful tool for studying genetics. A better understanding of
biological systems concerning epistasis is expected to be of value in research on GAs.

• The Darwin Machine

Is Evolutionary Computation a Darwin Machine? In Table 4.2, the six properties


defining the minimal Darwin Machine are again listed, this time with their
correspondence in the field of Evolutionary Computation.
58 Chapter 4. The Biological Background

Table 4.2 The requirements for a minimal Darwin Machine and their occurrence in
evolutionary computation.

Requirements Evolutionary Computation


1. The system operates on patterns of Potential solutions are represented by
some sort. (often linear) chromosomes.
2. Copies are made of these patterns. Reproduction of chromosomes.
3. During the copying variations occur. Crossover and mutations introduce
variation during reproduction.
4. The various patterns compete with
each other for a limited territory/
'■ workspace.
5. The selective success of a pattern is Selection depends on fitness of an
biased by its environment. individual relative to the fitnesses of the
other individuals but is usually not
influenced by its environment
6. Copying only occurs after a certain The chance of reproduction depends
amount of differential success. only on the relative fitness.

There seem to be two main reasons why, in general, evolutionary computation does
not qualify as a Darwin Machine. First, there is no equivalent to the struggle for a
place in a limited territory or workspace in EC. An individual feels no effect of the
way other individuals perform on the problem and there is no notion of some kind of
'source' (e.g. territory, food, or even mates) that is limited in any sense. Second, there
is no influence of the environment reflected in the fitness value. The only thing that is
reflected in the individual's fitness is its own performance on the problem.

An exception of this general picture of EC can be found in the work by Nolfi and
Parisi [55] where the GA system evolves artificial organisms represented by
ecological neural networks that compete with each other in a limited two dimensional
world in the quest for food. A changing environment is modelled by varying the food
resources over time. Experiments are performed where the fitness function itself is left
to evolve, resulting in observed forms of preadaptation to changing environments.
This system does meet all the requirements for a Darwin Machine even though the
concept of a Darwin Machine was really set up to compare natural processes rather
than artificial ones. The GA system in this approach can therefore be said to belong to
4.5 Links to Evolutionary Computation 59

the field of artificial life, where complex natural like behaviour is generated from
interacting artificial organisms operating with a relatively simple rule-based system.

• Overview
Table 4.3 gives a brief overview of the main differences between the evolutionary
theory of biological systems and the operation of most present day GAs.

Table 4.3 A brief overview of the main differences between Darwinian evolutionary
theory and most present day GAs

Darwinian Evolutionary Theory Genetic Algorithms


natural selection with an evolving fitness selection based on superimposed fixed
measure fitness measure
adaptation to changing environments optimisation of fitness under stable
environments
complex stochastic genotype-to- relatively simple deterministic
phenotype mapping genotype-to-phenotype mapping
relatively low mutation rate relatively high mutation rate
extreme high level of epistasis only perform well on problems with
medium level epistasis
competition between organisms for a no competition between individuals
common source
Baldwin effect: change in fitness sometimes local learning but no change
function due to individual learning in fitness function

As stated before, the main overall difference between the two systems lies in their
goals, (or rather the lack of one in Darwinian evolutionary theory). While in
evolutionary computation the goal almost always is the optimisation of some kind of
fixed problem, this does not necessarily seem to be the case for biological
evolutionary systems. Still, thfi success of evolutionary computation as a function
optimiser, as reported on a wide variety of problems and in some parts supported by
theoretical foundation, indicates that many features of Darwinism lend themselves
very well for this purpose.
5. Mathematical Foundations of
Genetic Algorithms
Several approaches can and have been taken as a first step to form a basic theory of
genetic algorithms (and genetic programming), each one providing some useful
insights into their functioning. Still, a fundamental foundational theory incorporating
all aspects of genetic algorithms is a long way off.

One of the first and most frequently referenced foundational works in this field is by
Holland [30], who examined the case of a binary coded fixed length genetic algorithm
and introduced a mathematical foundation known as the Schema Theorem. Goldberg
[19] extended this idea with the notion of the so called Building Block Hypothesis.

Section 5.1 presents a brief description of the working of genetic algorithms. In


section 5.2 an overview of the Schema Theorem is given based on a simple binary
coded genetic algorithm. Other approaches are also presented.

5.1 The Operation of Genetic Algorithms


Genetic algorithms perform a stochastic global search through the solution space by
swapping information between chromosomes and by occasionally (re)introducing new
information. The basic genetic algorithm comprises three genetic operators. These are
reproduction, crossover and mutation.

The reproduction or selection operator makes sure that the search is biased in the
direction of chromosomes with high fitness values if maximising or low fitness values
if minimising. Chromosomes that have above average fitness have more chance to
survive and to reproduce than others. Chromosomes with very low fitness will die off.
Low fitness chromosomes are needed in the population though, because they can
contain information that can be useful or even crucial to the formation of the optimal
chromosome.

The crossover operator ensures that partial information contained in one chromosome
can reach other chromosomes in the population. This mixing of information leads to

60
5.1 The Operation of Genetic Algorithms 61

the formation of optimal chromosomes. In order for the genetic algorithm to perform
well the crossover operator should be such that a high correlation exists between the
fitness of the parent chromosomes and the fitness distribution of their offspring.

In order to fully explore the search space, diversity of the population is crucial. The
purpose of the mutation operator is to maintain enough diversity in the population to
overcome local optima and eventually reach the global optimum. A mutation-rate that
is too high can destroy useful information in highly fit chromosomes and slow down
the search.

• Exploitation and Exploration


For optimal search a genetic algorithm must find a balance between exploration of
new regions of the search space and exploitation of the information available in the
current population. For the sake of comparison, hillclimbing algorithms which guide
the search based on local gradient information of the function to be optimised, are
very good at exploitation but do limited exploration. In contrast a random search
mechanism, where points in the search space are selected and tested at random is good
at exploration but does no exploitation. Genetic algorithms fall somewhere in
between.

• Population Diversity and Selective Pressure


Another viewpoint of the working of GAs is that of population diversity and selective
pressure. Selective pressure reflects the 'pushing force' of a GA for above average
individuals. When the selective pressure is very strong, a superfit individual takes over
the population within a few generations often resulting in premature convergence.
This in turn is reflected by a strong decrease in population diversity. A selective
pressure that is too weak however results in a very slow and ineffective search. Again
a balance is needed between the two. This viewpoint is, in a sense, a variation of the
exploitation and exploration idea. Too much exploitation and too little exploration is
reflected in a strong selective pressure resulting in a low population diversity. The
other extreme results in a very low selective pressure and an ineffective search.

• Evolvability
Since even in a pure random search there is a chance that the offspring chromosomes
are fitter than the parents, a genetic algorithm si juld have a better average
performance than a random search. In order to do so, the effect of the genetic
operators should be such that there exists a high correlation between the fitness of the
62 Chapter 5. Mathematical Foundations of Genetic Algorithms

parents and the fitness distribution of the offspring. When this is the case the fitness
distribution of the offspring can on average be expected to be better than the one
belonging to the parents. This correlation property is called the evolvability of a
genetic algorithm [2] and serves as a local performance measure. The global
performance measure then is simply the ability of the genetic algorithm to produce
fitter offspring over the course of one or more generations. This global performance
depends on the maintenance of the evolvability of the population as the search is
guided to the global optimum. Using the Schema Theorem, this evolvability can be
expressed by the Building Block hypothesis.

5.2 The Schema Theorem and the Building Block


Hypothesis
Introduced by Holland [30], the Schema Theorem is often viewed as the fundamental
theoretical foundation of genetic algorithms. In its basic form it can be applied to
chromosomes that are fixed length strings only. Additions have been made to
incorporate more general chromosomal structures, such as the ones used in genetic
programming [40]. The theorem will be presented in its basic form here, where
chromosomes will simply be referred to as strings. Firstly some related terms are
defined.

The search space, £2, is the complete set of possible strings. In the case of a fixed-
length chromosomal string where each gene can take on a value in the alphabet A, the
size of the search space is:

size(Q) = k!

where:
k - alphabet size
/ = chromosome length

Returning to the example of the last chapter, the search space has a size of 29 = 512. In
other words: there are 512 different chromosomes in the search space.

A string in the population S is denoted by x e Q. So in the case of a binary string: x e


(0,1}'. The number of instances of a string* in the population 5 is denoted by m(x).
5.2 The Schema Theorem and the Building Block Hypothesis 63

A schema is a similarity template that defines a subset of strings with similarities at


certain fixed positions. A schema therefore defines a subset of the complete search
space. More precisely it defines a hyperplane partition of the /-dimensional search
space. A schema, H, is a string of the same length as the chromosomes. Each position
in the string can take on the values of the alphabet, called fixed positions, plus a 'don't
care' character '*'. So in the binary case H e {0,1,*}'. Using the example genetic
algorithm again, an example of a schema is:

H, = * 1 0 * 0 1 0 0 1

This schema describes the following subset of the search space:

Hj= {(0 1 0 0 0 1 0 0 1 ),(0 1 0 1 0 1 0 0 1 ),( 1 1 0 0 1 0 0 1 ),( 1 1 0 10 1 0 0 1

These strings are said to belong to schema Hh but belong to many other schemata as
well. The search space thus can be defined as a schema of length / with a 'don't care'
symbol at every position. In our case: Q.-*********. It also follows that the
9
number of schemata possible is ( k+\ ) ; in our case: ( 2 + 1 ) = 19683.

In order to present further discussion about schemata, the following properties need to
be defined. The order of a schema, o(H), is the number of fixed positions in H. The
order of Ht for example is o {H,) = 7. The defining length, b\H), of a schema is the
distance between the first and the last fixed positions of the schema. In the example
schema: b\H,) = 9-2 = 1.

The number of strings in the population belonging to schema H, m(H) is given by:

m(H) = 2^m(x)
xefj

Using schemata the effects of the genetic operators on the fitness distribution of the
population can be seen.

5.2.1 The Effect of Roulette Wheel Reproduction


Consider the reproduction operator. Let the population at time t be defined by S(t),
and let m{H,t) denote the number of strings in this population belonging to schema H.
64 Chapter 5. Mathematical Foundations of Genetic Algorithms

The average fitness of all strings in the population representing schema H is defined
as:

^f(x)m(x)
f(H) = ^
m(H)
f(H) is also called the average payofffunction of schema H, and the fitness of a string
x in the population is f(x). Using the standard Roulette wheel reproduction the
expected number of times string x is selected is given by E(x) = f(x) I f .It can be
seen that the expected number strings belonging to schema H in the next population is
given by:

m(H,t + \) = m(H,t)^£^-

where / is the average fitness of all the strings in the population at time t.

From the above equation it can be seen that schemata with above average fitness
values will reproduce in increasing numbers in the next generation, while schemata
with below average fitness values will eventually die off. When/[W) / / i s relatively
constant, the equation can be approximated by a linear difference equation of the form
m(H, t+l) = a m(H,t). The solution is then given by:

m(H,t + l) = m(H,0)a'

With a being approximated by f(H) if, it can be seen that strings belonging to above-
average schemata are expected to grow exponentially while strings belonging to
below-average schemata are expected to decay exponentially.

5.2.2 The Effect of Crossover


Now the effect of the (1-point) crossover operator on a schema H is considered. When
a string belonging to a schema H recombines with another string into two offspring,
the schema can either have survived crossover or not. The survival probability of a
5.2 The Schema Theorem and the Building Block Hypothesis 65

schema depends on its defining length and can best be illustrated by an example.
Consider a string x that is selected for 1-point crossover, and consider two
representative schemata Ht and H2 within that string:

x = 0100011001

H-. = **00*l****

The crossover point shown above is randomly chosen to be 6. Unless the string with
which Si mates has the same gene values at positions 2 and 9, a possibility that will be
ignored for now, schema Hj will not survive. Schema H2 however does survive the
crossover operator and at least one of the offspring will belong to H2. Due to its longer
defining length it is clear that schema Ht has less chance of surviving crossover than
does H2. Only a crossover with its crossover site at position 3 will destroy schema H2,
while only a crossover at position 1 preserves schema H,. The defining lengths for the
schemata are: b\H,) = 9 - 2 = 7 and b\H2) = 4 - 3 = 1 .

Generally speaking a schema survives 1-point crossover if the crossover site falls
outside its defining length. Assuming the crossover site is chosen randomly from [1,...,
Z-l] the probability of survival for a schema H is:

. S(H)
=
"■ '~ —

When crossover itself is applied with probability pc, the following expression gives a
lower bound on the survival probability of schema H due to the crossover operator:

8(H)
>1
p >l—p

This result can be extended for the 2-point crossover operator. Assuming the two
crossover sites are chosen independent from each other and assuming they are not
equal to each other (otherwise there would be no crossover), the survival probability is
given by:
66 Chapter 5. Mathematical Foundations of Genetic Algorithms

S H
>1_ < > | lI-8(H)
('8(H) ~S(H) 8(H)}
8(H)''
P,Zl-Pc'
Ps Pc
Uv /-- ll l~l 1-2,
1-2)

So when 2-point crossover is used, the survival probability of a schema is decreased.


Schemata of length 1-1, like the example schema Hh never survive 2-point crossover.

Uniform crossover diminishes the survival probability of schemata. Since every gene
in the chromosome has a 50% chance of survival, the lower bound on the survival
probability is:

P , > l --Pc/-0.5
Ps>\- V5(H0>. 5 ^

5.2.3 The Effect of Mutation


The standard mutation operator destroys a schema H if it is applied on the o(H) fixed
positions of H, since it inverts the value of the bits. Since mutation is performed on
each gene of the population with a probability pm , the chance that a gene survives
mutation is (1 - pm ). Therefore the probability that a schema H survives mutation is
given by:

/-,
(, \o(H)
\o(H)
Ps

For small values of pm, as is usually the case, this may be approximated by \-o(H)pm.

5.2.4 The Effects of all Genetic Operators Combined: The Schema


Theorem
To see the effect of all three genetic operators (using 1-point crossover), the survival
probabilities are simply multiplied. This can be done since all three operators are
applied independently. Thus the lower bound on the expected number of copies of a
schema H in the next generation is:

f(H)( 8<H)\
m(H,t + \)>m(H,t)^J- J \ l - p . - f' "- ^I J -o(H)Pm)m)
(l-o(H)p
J V /-I )
5.2 The Schema Theorem and the Building Block Hypothesis 67

By neglecting the small cross-product term, this can be simplified to:

m(H,t + l)>m(H,t)^^-\l-pc^^--o(H)p(m)

From the above equation the Schema Theorem may now be stated:

The Schema Theorem: Using reproduction, crossover and mutation in the standard
genetic algorithm short, low-order, above-average schemata receive exponentially
increasing trials in subsequent generations. [ 19]

Finally, it can be shown that the number of schemata which are effectively processed
in each generation is of the order /V3, with N the population size. This property of GAs
which helps explain its performance on many optimisation problems is known as
implicit parallelism.

5.2.5 The Building Block Hypothesis


As an extension of the Schema Theorem, Goldberg introduced the Building Block
Hypothesis [19]. The short, low-order, highly-fit schemata, which play such an
important role in the standard genetic algorithm are given the name of Building
Blocks. When building blocks are recombined into another chromosome, its fitness is
likely to increase. Using the Schema Theorem the Building Block Hypothesis can now
be stated:

The Building Block Hypothesis (BBH): The partial information contained in the
building blocks is combined in a GA to form globally optimal strings. [19]

The genetic algorithm can now be seen to work in such a way that the building blocks
are sampled, recombined and resampled to form strings of higher fitness which
ultimately should arrive at the global optimal string. The building block hypothesis is
in fact a way to express the evolvability of the standard genetic algorithm. It states that
a genetic algorithm tries to find low-order schemata that have the best average payoff
in each hyperplane partition of the search space and that it combines these to form a
more complete solution.
68 Chapter 5. Mathematical Foundations of Genetic Algorithms

Although the building block hypothesis (BBH) has been shown to work in many
applications, there are GAs for which it does not [23]. The problem-coding
combination in such GAs is generally referred to as being GA-deceptive, meaning that
the GA search is deceived or mislead in finding the global optimum. In GA-deceptive
problems there is no regularity in the function-coding combination that may be
exploited by the recombination of short length schemata. Building blocks cannot form.
GA-deceptiveness is a theoretical concept derived from the analysis of schemata
payoff functions. In contrast GA-hardness is a practical concept expressing the actual
performance of the GA; i.e. how easy is it for the standard GA to converge to the
global optimum? It is important to note that GA-deceptiveness does not necessarily
entail GA-hardness. Some problems classified as being GA-deceptive in fact turn out
to be quite easily solved by the standard GA.

As an example where the standard genetic algorithm should have no problem finding
the optimum solution, consider the following GA-non-deceptive problem. Suppose the
optimum solution is the string 000 ... 0 (the string length is undefined). Furthermore
for the average schema fitnesses the following holds:

/f(0*
( 0 * . ..... *) >> / (( l1* .*... . . •*) )
/ ( 0 00 **. . ..*)
. * ) > / ( 0 11 *. *...*)
..*)
/f(00*
( 0 0 * . ...*)
..*) > / ( 1 00**. . ..*) . *)
/ ( 0 00**.. ..*)
. . * ) >>/ /( ( 1l l1 ** ......*)
*)
etc.

In other words, for all schemata of a certain length (a hyperplane partition), those with
all 0's in their fixed position are preferred. According to the building block hypothesis
this problem should not deceive the GA and it should easily converge to the global
optimum.

Now consider the same problem, i.e. the optimum is 000 ... 0, but now schemata that
have l's are preferred for every hyperplane partition. Thus:

/ ( 0 ** .. .. .. ** )) << // (( !!**. .. .■. *) *)


/ ( 0 0 * . ...*) . . * ) / ( 0 1 * . *)
< / ( 0 1 *...
■ • * )

/ ( 0 0 * . ..... *) < / ( 1 0*... 0* .,.*)*)


/ ( 0 0 * . ...*) . . * ) << // ((ll lI**. . ..*)..*)
etc.
5.2 The Schema Theorem and the Building Block Hypothesis 69

This is a GA-deceptive problem according to the BBH, since the coding regularity
occurs in the non-preferred schemata, and the standard GA should have great
difficulty in finding the optimum. The problem now is to design the genetic coding in
such a way that the problem is not GA-deceptive so that building blocks can form and
the BBH will hold. This is in general very hard to do. One apparent requirement of the
coding in order for building blocks to form is that related genes should be close
together on the chromosome. When they are close together they can form a building
block and guide the GA search to better individuals. According to the BBH therefore
the ordering of the genes in the chromosome can play an important part in the GA
performance.

5.2.6 Another Viewpoint: the Switching of Hyperplanes


It can be beneficial to take a more geometric viewpoint when examining schemata.
The search space can be seen as an /-dimensional space, / being the chromosome-
length [6]. Points in this space are the chromosome-strings and can be seen as
schemata of order / (i.e. schemata with fixed positions only). Schemata of order M
define lines in the space, schemata of order 1-2 planes etc. In general schemata of
order l-n define ^-dimensional hyperplanes in the search space, the search space itself
being defined by the schema of order 0 (i.e. the schema consisting of 'don't care'
values only).

Using this viewpoint, the genetic algorithm can now be seen as moving between points
across different hyperplanes in search of the optimal point in the search space.

5.2.7 The Walsh-Schema Transform


The Walsh-Schema transform can be used as an efficient, analytical method for
determining the expected performance of genetic algorithms. The average fitness
values of schemata, f{H), are determined using Walsh coefficients. The early work in
this field provides an analysis to determine the expected static performance of genetic
algorithms (e.g. [19]), while in [6] a nonuniform Walsh-Schema transform is used for
the dynamic analysis of genetic algorithms. The static analysis requires the assumption
of a so called flat population in which every possible string is represented in equal
numbers.
70 Chapter 5. Mathematical Foundations of Genetic Algorithms

Another non-uniform transform used to analyse genetic algorithms was developed by


John Holland. It is called Hyperplane Transform and is shown to be essentially
equivalent to the nonuniform Walsh-Schema transform [6].

The (static) Walsh-Schema transform basically works as follows. The Walsh


transform of a function/: JC I—> 9^, where x is a binary string of length /, is given by:

1 2 ' -1
w x / x
j=^rllf(
L
Lx=0
)y j( )
JC=0

where:
y/j(x)== the Walsh function
Vj(x)--
Wj --== the Walsh coefficient relating toj
Wj tojf
j — = aabinary
binary string
string of
oflength
length /I

The summation is over all 21 'integer' values of x; i.e. x = 000, 001, 010, 011 etc. for
the case / = 3. The function^*) is transformed into a set of coefficients wj, one for
each possible bitstringj. The total number of such bitstrings and therefore the number
of Walsh coefficients is 2 . The Walsh coefficients are sometimes also called partition
coefficients. The Walsh function y/j(x) is given by:

¥j(x)=ll(-ir-if.-*
y/j(x)-- j
'
i=i

where:
x, = the value of the i{ bit of JC
jji, = the value of the t bit of j

The Walsh function will have a value of 1 if JC and j are equal in an even number of
positions and a value of -1 otherwise. The inverse Walsh transform is:

22'-l
'-l

f(x)--
f(x) = YdwW
JYt:(X)
f(x)
j=0
j=0
5.2 The Schema Theorem and the Building Block Hypothesis 71

So for the case of a three bit string: f(x) = ± WQOO ± Wooi ± vfoio ± wm 1 etc. The Walsh-
schema transform is now:

f(H)
f(H)= = ^WjY/MH))
^J¥J(P(H))
jzJ(H)

where:
J(H) = a set generator of schema H
/?(//) i
P(H) = an operator that maps H to a binary string

The set generator J(H) generates a set of binary vectors from a schema H. This set is
defined by:
Ji(H,)
Wd == 0, if H, = **
*, if //,
Ht = 0,1

So for example/(***) = 000 = {000} and/(**l) = 00* = {000, 001). p\H) is defined
by:

p\<H,)== 0,0,ififHiH,==0*
P,(Hd 0*
1, if //,= 1

Using these definitions the average schema payoff function of a schema f(H) can be
transformed into Walsh coefficients. For example:

^***\ _=: "000


K***)
f(**l)
./(**!)==: wooo
WQOO ++ wwooi
ooi
f(**0) === wooo-
/C**0) WQOI
WQOO - wooi
/(*!*) === wW
fi*l*) 000
00Q++ w WQIO
010
f(*0*) = W
/C*0*) ;
QOO "- WQIO
>V000 w0io
etc.

The values of the Walsh coefficients can be obtained from the problem dependent
values of the schema payoff functions f{H) by simple back-substitution. Insight into
whether a problem may be difficult to solve for a GA may be gained from observing
the Walsh coefficients. For example for a problem to be GA-deceptive, conditions
such as the following may need to hold:
72 Chapter 5. Mathematical Foundations of Genetic Algorithms

ft**l)<f(**Q)
f(*l*)<f(*0*)
/(l**)</(0**)
etc.

This can be translated into the following relations concerning Walsh coefficients:

wooi <0
w0io < 0
w 100 <0

These relations can easily be checked once the Walsh coefficients are determined. The
Walsh-Schema transform provides an analysis into the deceptiveness of a problem.
Furthermore, contributions in schema fitnesses due to epistatic interactions between
certain bit positions can be investigated [19]. A disadvantage of the Walsh-Schema
transform is the excessive amount of computation needed in the analysis. The
nonuniform Walsh-Schema is much better in this sense and provides dynamic analysis
of problems for which the computation needed in a normal Walsh-Schema transform
would be impractical.

5.2.8 Extending the Schema Theorem to Other Representations


The Schema Theorem as described above can be extended for representations other
than binary coding. The Schema Theorem is not restricted to the binary alphabet of '0'
and '1*. The effects of the Roulette wheel selections scheme and the standard
crossover operator (the swapping of substrings) on the survival probabilities of
schemata hold for any coding. The standard mutation operator for other codings is
slightly different from the inversion-mutation used for binary coding. Usually it
randomly re-initialises the value of a gene making it in principle possible that the gene
is re-initialised with its old value; i.e. no mutation at all. The Schema Theorem can
easily be extended to include this.

5.3 Criticism on the Schema Theorem and the


Building Block Hypothesis
The Schema Theorem describes how the crossover operator is used to combine the
proper building blocks so that the evolvability of the genetic algorithm is maintained.
5.3 Criticism on the Schema Theorem and the Building Block Hypothesis 73

For the crossover operator to be of use, the chromosomal representation or coding


used must be such that building blocks actually exist. The weakness of the Schema
Theorem lies herein in that it does not provide a way to express the correlation
between the problem representation and the genetic operators used, and the actual
performance of the genetic algorithm. It has been said that it does not properly explain
the sources of power in genetic algorithms.

A special point of criticism of the BBH is that it is based on a static analysis of the
payoff functions of schemata, while a dynamic view would be needed to properly
explain the working of a GA. According to Grefenstette [23], this means in fact that
the BBH is false and fails in practice due to the following factors:

• the sampling of schemata is biased due to previous convergence


When the GA starts to converge (i.e. right after the first generation), the population
represents a biased sample of all schemata. This means that the GA can no longer
estimate the true average schema fitnesses. This can in fact only be done in the first
random population (i.e. at generation 0). During a run the GA can favour schemata
that have a less true average fitness or payoff than others simply by eliminating its
competitors. It can be shown that because of this, GA-deceptive problems can in fact
quite easily be solved by a standard GA.

• the population size is always limited and there is a large variance within
schemata
This means that even in the initial random population the GA cannot estimate the true
average schema fitnesses. To illustrate this, consider the following problem which
according to the BBH is GA-easy or at least non-deceptive.

The function to be optimised is f[x), x e [0,1], where x is represented by a 10-bit


binary string. The string 000...0 represents 0.0, the string 111... 1 represents 1.0, etc.
Let/be defined by:

f(x)= x2 ifx>0
2048 ifx = 0

It can be seen that for any schema H which contains the optimum string 000...0 (i.e. all
the schemata with only 0's in their fixed positions), the average schema fitness, f(H) >
2, since the sum of all its payoff functions is at least 2048 (due to the optimum string)
74 Chapter 5. Mathematical Foundations of Genetic Algorithms

and the number of strings contained in H is at most 210 = 1024. Also for any schema H
that does not contain 000. ..0, the average payoff function/(//) < 1, since the sum of all
its payoff functions is at the most 1024*1 = 1024. Therefore schemata containing only
0's in their fixed positions are always preferred over others and the problem is GA-
non-deceptive.

However a standard GA will find it extremely difficult to find the optimum string
000...0. Unless it was already part of the initial population or because it was
introduced by a very lucky crossover or mutation, the string will very likely not be
found. Intelligent sampling of hyperplane partitions will not lead to the discovery of
the optimal string as predicted by the BBH. This is because the variance in the best
schemata is extremely high due to this problem being a "needle-in-the-haystack'
search. The GA can not accurately estimate the true average payoff functions of these
schemata.

Grefenstette states that while the Schema Theorem as presented by Holland refers to
the average payoff of schemata according to the current sample in the population, the
BBH ignores this crucial feature and should therefore not be used as a fundamental
theorem for GAs. The classification of problems being GA-hard or GA-easy using the
BBH is certainly not always true as shown above.

5.4 Price's Theorem as an Alternative to the


Schema Theorem
In order to be able to express the correlation between representation, genetic operators
and performance explicitly in a theory and in order to account for arbitrary
representations and genetic operators, a more general and more complete approach
than the Schema Theorem is suggested in [2] to serve as the mathematical foundation
of genetic algorithms. Apart from GAs it can also serve as a foundation of genetic
programming (in contrast to the Schema Theorem). The theory is based on Price's
Covariance and Selection Theorem and does not include the notion of schemata or
building blocks.

In order to account for the effects of the choice of representation and genetic operators
the notion of a transmission function is used. A transmission function is the
probability distribution of the offspring chromosomes from every possible mating. For
5.5 Markov Chain Analysis 75

the case of two parents, as used in normal crossover, the transmission function is
represented as: T(i <- j,k), where (' is the label for the offspring and j,k are the labels
for the parents. T{i <— j,k) represents the probability that an offspring of type i is
produced by parental types /' and k resulting from the application of the genetic
operators on the representation.

The performance of a genetic algorithm is now determined by the relation between the
transmission function and the fitness function. Price's theorem is used to analyse the
dynamic behaviour of the fitness distribution over one generation. This is shown to
depend on the covariance between parent and offspring fitness distributions and a so
called 'search-bias' which indicates how much better the effect of a genetic operator
on the current population is than pure random search.

Using the search-bias a quantitative notion can be given to the idea that the
transmission function should find a balance between exploring the search space and
exploiting the current population. It is still very hard to actually use the theorem in
practice in order to analyse or optimise a genetic algorithm, but enhanced ease of use
may be expected in future.

5.5 Markov Chain Analysis


Yet another approach to a fundamental theory of genetic algorithms is Markov chain
analysis. As opposed to most models where certain approximations have to be made,
Markov chains can be used to form an exact model of genetic algorithms. This can
only be on small scale though, as the transition matrix used in the analysis grows
exponentially with increasing population size and chromosome length. Due to this
scalability problem, studies in this area have used very small scale genetic algorithms
(e.g. [22], [31]) or have worked with matrix notation only (e.g. [12], [54).

Even in the case of small scale models very useful insights can be gained such as the
concept of genetic drift and the effect of preferential selection on the population.

In [22] a genetic algorithm was modelled that had binary chromosomes of length one;
i.e. a 'single-locus genome'. In other words, the only individuals possible are '0' and
T . A population of size N now gives a total of ( N + 1 ) possible states, whereby the
location of a chromosome in the population is of no concern. For example a
population of size two has states '00', '01', and '11' State i is referred to as the state
76 Chapter 5. Mathematical Foundations of Genetic Algorithms

with exactly i ones and ( N - i) zeros. The operation of the genetic algorithm is now
defined b y a ( 7 V + l ) * ( / V + l ) transition matrix P[i, j] that maps the current state i
to the next state j . The probability of a transition from state i to state j is given by one
entry in the matrix: p(ij). Figure 5.1 visualises these terms for a simple single-locus
genetic algorithm with population size 2. In general with a chromosome length I, the
number of possible binary chromosomes is 21 and the number of states is (N + 2 - 1 )!
/ N\( 2 1 )!. For any realistic genetic algorithm the transition matrix becomes of
unmanageable size.

Figure 5.1 Markov chain of single-locus GA (/=1) with N=2

When simple Roulette wheel selection is the only genetic operator in the system (i.e.
no mutation and crossover), the transition matrix can be generated quite easily. It can
be used to examine the influence of selection pressure only on the system, f, is defined
as the fitness of an individual T , and f0 the fitness of a '0'. The probability of
choosing a T for the next population is simply p, = f, I Yf. When the number of ones
in the current population is given by i, p, can be expressed as:

Px
i-f[+(N-i)-f0
5.5 Markov Chain Analysis 77

Introducing the fitness ratio r as/} I f0 gives:

ir
i•r N-i
N-i
ri and Fo
Po
i-r + (N-i)
(N-i) i-r + (N-
(N-i)
■i)

The probability of transition from a state with *' ones to a state with; ones is now:

N-j
,. .. ^ / \ j /
j \N-j
j *> ( i-r
i-r \( N —- i1
IN >
P(iJ)
P0.J)== ■ (p.) (p 0 r == • :
P.) (Po) [i-r + jrr—z - +777—
(N-i)J l,i-r (N-i) ;
UJ U A i - r + (N-i)Ai-r + (N-i)J
This equation defines the complete (/V + 1 ) * (/V + 1 ) matrix P[i,j]. With r=l both
T and '0' individuals have an equal fitness value. There is no preference for a state
(i.e. a population) with all-ones or a state with all-zeros, and the equation reduces to
the one for pure genetic drift. This genetic drift causes the simple GA to always
converge to a uniform population, in this case to a state with all-ones or one with all-
zeros; i.e. i - {0, N}. These two states are absorbing states, meaning that once the
system is in such a state it will always stay there. In other words, the transition
probability from such a state to itself is one ip(i,i)=\) and zero to all other states
(p{ij)-0; &j ). Absorption time is defined as the expected number of generations until
the genetic algorithm finds itself in one of the absorbing states. Absorption time
depends linearly on the population size, N [22].

In [31] the above model is extended to include niched genetic algorithms based on
fitness sharing (see section 3.1.8). It is reported that in the case of 'perfect sharing',
where the niches do not overlap, the effect of fitness sharing ('niching force') balances
exactly the effect of selection/drift. The niching force is a stabilising one in that it tries
to spread the population out evenly over the niches, as opposed to the effect of
selection/drift. When overlapping niches are examined, as is often the case, it is found
that the niching force dominates for small overlaps but for larger overlaps its influence
decreased. As could be expected the absorption time is significantly larger than for the
simple GA without niching and grows exponentially with the population size.

In [12] the transient behaviour of GAs with a finite population size is modelled using
Markov chains. The concept of the state transition probability matrix P is extended to
a fc-step transition matrix P* Using P% analysis is done on expected transient
behaviour of simple GAs. Questions like: "What is the probability that the GA
78 Chapter 5. Mathematical Foundations of Genetic Algorithms

population will contain a copy of the optimum at generation k can be answered using
this approach using relatively simple function optimisation problems. Also, expected
waiting time analysis can be performed to answer questions like: "For how many
generations does the GA have to run on average before first encountering the
optimum?" The effects of crossover, mutation and fitness scaling can be seen in the
expected waiting time analysis and useful insights are gained. Future work in this
approach needs to concentrate on scaling up to problems of more realistic size. Also
visualisation techniques to display for example transition matrices are expected to be
of much help in gaining insight into the operation of GAs.
6. Implementing GAs
Although highly problem dependent, some general remarks can be made on what the
options are concerning coding (representation) and genetic operators for a GA system
and how they will affect the performance. A brief overview of the most commonly
used GA settings is given. Aside from coding the most crucial part of the set-up of a
GA is the fitness or evaluation function. General remarks concerning the fitness
function are given as well, but first some general comments are made about the
performance of a GA.

6.1 GA Performance
This section describes some of the main problems found while implementing genetic
algorithms.

• Premature Convergence
A common problem in GAs is premature convergence of the population to a local
optimum. This happens when a super-fit individual, representing a sub-optimal
solution, is chosen to reproduce many times and takes over the population in a few
generations. After this, the only way the GA can overcome this local optimum is by
the (re)introduction of new genetic material by means of mutation. This process is
then just a slow random search.

• Poor Fine Tuning


A second problem in GA performance is that once the population is near the global
optimum, further convergence is very slow. The average fitness will be high and there
is usually little difference between the fitness values. Therefore there is not enough
pressure to push the GA towards the global optimum.

• Epistasis
Problems that are difficult to solve for a GA can generally be classified as problems
with high epistasis. The level of epistasis in a certain problem-coding combination
reflects the dependence of gene expression in the phenotype on the values of other
genes. With epistasis, a specific variation in a gene produces a change in the fitness of
the chromosome depending on the values of other genes. The level of gene interaction

79
80 Chapter 6. Implementing GAs

measures the extent to which the contribution to the fitness of a single gene depends
on the values of other genes in the chromosome. In the absence of epistasis a
particular change in a gene always produces the same change in the fitness of the
chromosome. As an example of such a problem consider the case of a binary string
where the fitness is simply equal to the number of ones in the chromosome:

;
f(x) = ^ai , x = («!,„.,a,),a t e {0,1}
i=i

There is no interaction between the genes at all. The fitness function is a composite of
the contributions of each gene.

A medium level of gene interaction can be defined where the problem is such that a
particular change in a gene always produces a change in the fitness function of the
same sign (or zero). In this case, the change in fitness depends on the values of other
genes. An example of such a problem is one where using binary coding the fitness is
one if all genes are one, and zero otherwise:

fix) = 1, if a, - dj, all ij


0, otherwise

A high level of interaction can be defined as a problem-coding combination where a


particular change in a gene produces a change in the fitness that varies in magnitude
and in sign depending on the values of other genes. Commonly only problem-coding
combinations of this kind are referred to as having epistasis.

Problems without epistasis do not need a GA solution; a simple hill-climbing


algorithm is sufficient. GAs are useful and successful when there is a medium level of
epistasis. The problems classified by the Building Block Hypothesis as being GA-
deceptive suffer from high epistasis. High epistasis simply means that building blocks
cannot form.

The two obvious ways to tackle problems that have high epistasis are: design a coding
such that the problem becomes one with low or no epistasis, or: design the genetic
operators (crossover and mutation) so that the GA will have no problem with epistasis.
In effect this means that some prior knowledge about the objective function has to be
build into the GA system. Although it can be shown that in theory any high epistasis
6.2 Fitness Function 81

problem can be reduced to one with low or even no epistasis, in practice this is very
hard to do. The effort needed to accomplish this may be greater than is needed to
actually solve the problem.

• Genetic Hill-Climbing
While in theory a genetic algorithm performs a global search of the solution space, in
some implementations the search is not as global as most theory would suggest. For
example in a Steady State GA with ranked replacement (a 'Genitor-type' GA) and a
relatively small population size, the search is often centred around the single fittest
individual. This is due to the very high selective force on above average individuals in
this type of GA. The Genitor-type GA 'pushes' very hard, often becoming stuck in
local optima. The performance of this GA with relatively small population sizes is
sometimes found to be largely independent of the population size, see e.g. [64] where
good solutions to a neural network weight optimisation problem were found even with
a population of size 5. Instead of intelligent hyperplane sampling, the GA basically
performs a local search around the fittest individual. It is said to perform genetic hill-
climbing. This does not necessarily mean that the algorithm does not function well; it
may still outperform conventional hill-climbing algorithms. However, the foundational
work of GAs will have to be extended to include the phenomenon of genetic hill-
climbing. GAs working as a genetic hill-climber are commonly found to require a
relatively high level of mutation for a good performance. The GA is in a way always
in a state of premature convergence and strong mutation is simply needed to make a
transition to another better state.

6.2 Fitness Function


For some problems the construction of the fitness function is obvious. For example in
function optimisation it is simply the function to be optimised. In general the fitness
function should reflect the performance of the phenotype or the 'real' value of an
individual and its choice is far from trivial. A special problem is when it is possible
for the GA to construct genotypes that translate into an invalid or meaningless
phenotype. These chromosomes have zero 'real' value and this should in some way be
reflected in the fitness function (i.e. they should be penalised). In genetic
programming, care is often taken to prevent the genetic operators from producing such
chromosomes. In most GAs however, this is practically impossible and possibly even
harmful to the GA. This is because although these chromosomes are essentially
82 Chapter 6. Implementing GAs

meaningless, they may contain information that is crucial for the production of highly
fit meaningful chromosomes.

Some researchers say that the fitness function should be smooth and regular and
chromosomes that are close in the representation space should have similar fitness
values. If this were the case however, a simple hillclimbing algorithm could always
find the optimum and a genetic algorithm would not be needed. In practice the fitness
function will typically contain many local optima making it hard for a hillclimbing
algorithm to find the global optimum. However a fitness function may be constructed
that minimises the effect of local optima and enhances the better performance of a
GA.

6.3 Coding
In the early days of the field of genetic algorithms researchers practically always used
a binary coding scheme following the example of Holland. Since then many variations
have been used such as real-valued and symbolic coding.

A simple analysis of the Schema Theorem seems to suggest that alphabets of low
cardinality (small alphabets) yield the highest rate of schema processing: the amount
of implicit parallelism is the highest. This is because the number of schemata available
for genetic processing is the highest when low cardinality alphabets are used. Since
genetic algorithms are run on computers, all genetic information is ultimately stored as
bits. Supposing we have an alphabet of cardinality k, then each position in the
chromosome can represent k+\ schemata. Since each position represents log2fc bits,
the number of schemata per bit of information, ns, is:

I
log k
ns =(k + \) '-

Therefore it is easy to prove [20] that chromosomes coded with smaller alphabets
represent a larger number of schemata than ones with larger alphabets. Because of this
apparent higher rate of schema processing of small alphabets traditionally only binary-
coded chromosomes were used. However, GAs using codings with high cardinality
alphabets (large alphabets) such as real-valued codings have been shown to work well
in certain applications. Goldberg [20] reconciles this with the Schema Theorem as
described in a following section.
6.3 Coding 83

Intuitively, the coding should be such that the representation and the problem space
are close together; i.e. a "natural' representation of the problem. This also allows the
incorporation of knowledge about the problem domain into the GA system in the form
of special genetic operators. The GA can then be made more problem specific and
achieve better performance. One concept of particular importance is that points that
are close together in the problem space should also be close together in the
representation space.

6.3.1 Binary Coding


Some problems can be expressed very efficiently using binary coding:

x = (oi, ..., «;), a, e {0,1)

However the choice of which coding to use is usually not straightforward. Quite often
the problem to be solved is integer or real-valued and genes can be chosen to be
binary or real-valued. An example of a binary coding for a real-valued problem is the
following. Suppose the problem is to optimise a function f(xhx2,x3) that takes real-
valued arguments: xhx2,x3 e [0,1]. A chromosome has to represent these three
arguments. Each argument can for example be represented by 8 bits, making the
chromosome a binary-valued string of length 24. In 'standard' binary coding the
substring 00000000 will correspond to the value 0.0, 00000001 to 1/256=0.0039, ... ,
and 11111111 to 1.0. An example of such a binary-coded chromosome is:

x = (00000001 00101000 10110001)


si" 4- si"

0.00390625 0.15625 0.69140625

Figure 6.1 Example of a binary coded chromosome representing three real-valued


parameters

Some authors use the term gene for a substring of 8 bits representing a single real-
valued argument. This is not very appropriate since a gene in GA-theory is considered
to be a unchangeable piece of data. In this example the standard GA will work on the
bitstring without any knowledge of the actual representation within it. A gene simply
corresponds to a single bit.
84 Chapter 6. Implementing GAs

As can be seen in this example the effect on the phenotype of changing a single bit in
the chromosome depends on its position within the string. When the right-most bit of
an 8 bit substring is changed the effect is very small, but when the left-most bit is
changed the effect is relatively quite large.

In this coding, the Hamming distance (the number of different bit positions) between
two individuals does not reflect the distance between the two in the problem space.
This makes it very difficult for the mutation operator to make small changes in the
values represented in the chromosome and is the reason why Gray coding is often used
instead of 'normal' binary coding to code real values. In Gray coding adjacent real
values (or integer values) differ from each other in only one bit position. Going
through the real values represented by a Gray coding from low to high only requires
flipping one bit at a time. GAs using Gray coding are often found to perform better
than ones with standard binary coding when solving real-valued problems.

6.3.2 Real-Valued Coding


When real-valued coding is used chromosomes are simply a string of real values:

x = (fli, ..., a,), a, G SR

The values of the genes can be bounded: a, e [min,max], min,max e 9t.

In [20] Goldberg gives an explanation for the success and failure of real-valued
codings based on the Schema Theorem by suggesting that the GA breaks the original
coding into 'virtual alphabets' of higher cardinality. The real-valued GA still has a
high rate of implicit parallelism.

Advantages of using real-valued coding over binary coding include increased


precision. Binary coding of real-valued numbers can suffer from loss of precision
depending on the number of bits used to represent one number. Also in real-valued
coding chromosome strings become much shorter. Since it is often thought that GAs
fail for larger problems because of the very large string-sizes involved, this can be an
important aspect. Furthermore real-valued coding gives a greater freedom to use
special crossover and mutation techniques that can be beneficial to the performance of
the GA. Some of these issues are discussed in section 6.5. A final point is that for real-
valued optimisation problems real-valued coding is simply much easier and more
efficient to implement, since it is conceptually closer to the problem space. In [48] and
6.3 Coding 85

[65] real-valued optimisation problem GA performance was compared between binary


and real-valued coding; the real-valued coding produced better results.

6.3.3 Symbolic Coding

Symbolic coding treats genes as symbols belonging to a certain alphabet.

xx = (au ...,
= {a\, ..,«,), a,ee {a,b,c,
af), a, {a,b,c, ...}
...}

Often the genes are simply implemented as unsigned integer values taken from a
certain range. The main characteristic of symbolic coding is that there is no measure
of distance between two symbols. For example symbols that are 'adjacent' in the
alphabet are not considered to be closer to each other than any other two symbols.

6.3.4 Non-Homogeneous Coding


The codings described above are all homogeneous in the sense that all the genes in the
chromosome have identical codings (i.e. one alphabet A). It is possible for a
chromosome to have different parts each having their own coding. This can be seen as
the general case of chromosomal representation.

x = (au]t ...... , a,), a, e A,


A/

Homogeneous coding is the special case where A, - Aj, all ij. Alternatively, a
chromosome could consist of a part that is binary coded (A = {0,1}) and a part that
uses real-valued coding (A - Si):

x -=-- (a
(auu ....• ,i a
amm, , flm+b
am+], ■...■ ■,, a,),a,), a,
a,ee {0,1}, 1 < ii<
<mm
a, € 9t,
a,e 91, m< ii<l</

This type of coding poses extra constraints on the genetic operators. The normal
crossover operator can still be applied even when a substring is swapped that contains
more than one coding because it leaves the genes intact. The mutation operator has to
be changed in that it re-initialises a gene depending on the coding of that gene since
the alphabets of the different codings will not be the same.
86 Chapter 6. Implementing GAs

6.4 Selection Schemes


A brief overview of the various selection or reproduction schemes is given here. In
[21] a comparative analysis is given of these selection schemes.

6.4.1 Proportionate Reproduction


Proportionate reproduction schemes select individuals based on their fitness relative to
the rest of the population. The Roulette Wheel or Monte Carlo selection scheme is the
most widely used and was described in section 3.1.3. Other proportionate
reproduction schemes include stochastic remainder selection. In the stochastic
remainder selection scheme the expected number of copies for each individual,
E(x)=flx) l~J, is calculated and the integer portions of these counts are assigned
deterministically. The remainders are used to fill the rest of the population
probabalistically, usually by means of a Roulette wheel selection scheme.

One way to overcome the problems mentioned in section 6.1 is to use a. fitness
remapping scheme. Before individuals are selected using proportionate reproduction,
their raw fitnesses are remapped to new values. Two of such techniques are described
below.

• Fitness Scaling
Instead of using the actual fitness values in the selection mechanism, the fitness values
of all individuals are scaled to a certain range. This is commonly done by first
subtracting a fixed value from the fitnesses. These fitnesses are then divided by their
average value to produce the adjusted fitnesses. When fitness scaling is used the
amount of relative selective pressure on an individual can be controlled. Very fit
individuals no longer produce excessive numbers of offspring. There is a price to be
paid however. When there is a single super-fit (or super-unfit) individual in the
population, fitness scaling leads to overcompression. When just one individual has a
fitness many times higher than any other, fitness scaling will result in flattening out the
fitness distribution of the rest of the individuals. They will obtain near identical fitness
values, and the difference in selective pressures between them will almost be lost.
Performance suffers if there are extremely valued individuals.
6.4 Selection Schemes 87

Fitness normalisation is similar to fitness scaling but instead of remapping the


fitnesses to a certain interval, the fitnesses are remapped so that their total equals a
certain value (usually 1).

• Fitness ranking
The dominance of extreme individuals may be overcome by fitness ranking. Here
individuals are ranked based on their fitness and then the new reproductive fitness
values are given to them based solely on their rank, usually using a linear function
(linear ranking). Similar to fitness scaling, fitness ranking ensured that the ratio of
maximum to average fitness is fixed. However it also spreads out the remapped
fitnesses evenly over the interval. The problem of overcompression is gone. It no
longer matters whether the fittest individual is extremely fit or only just fitter than its
nearest competitor. By means of the ranking function the selective pressure of
individuals relative to each other can be controlled. Non-linear ranking may also be
used where the ranking function is such that the remapped fitness of an individual is
for example an exponential function of its rank (exponential ranking).

6.4.2 Tournament Selection


In this process, a number of individuals, set by the tournament size, is selected from
the population at random. In deterministic tournament selection, the best fit individual
of this tournament is then chosen to reproduce. The winning individual can also be
chosen probabalistically. The simplest version, binary tournament selection, has a
tournament size of two. The selection pressure of individuals can be adjusted by the
tournament size or the 'win-probability' Larger tournament sizes increase selection
pressure, since higher fitness individuals have more chance to win tournaments.

In Steady State Genetic Algorithms using tournament selection the 'doomed'


individuals are selected as the losers of a tournament. They can be chosen from the
same tournaments as their parents or from independent tournaments.

6.4.3 Steady State Genetic Algorithms


In [21] population dynamics using birth, life, and death rates are used to support the
viewpoint that the Genitor-type Steady State Genetic Algorithm is just a special
selection mechanism instead of a different kind of GA. It is suggested that the
behaviour of the Genitor-type GA is primarily caused by its extreme 'pushing' force
or selective pressure of above average individuals. The growth rate of the fittest
88 Chapter 6. Implementing GAs

individual is very high, where the growth rate is defined as the proportion the
individual takes up in the mating pool relative to the proportion it takes up in the
current population. A normal GA can also gain high growth rates when an appropriate
selection scheme is used (such as non-linear ranking), and it is suggested that it should
then show similar behaviour to the Genitor-type GA.

6.5 Crossover, Mutation and Inversion


Some extensions of the basic crossover and mutation operators are described here.
Very often the genetic operators have been adjusted to suit a specific problem and/or
coding [65] so that general remarks can not be made.

In the standard GA, crossover and mutation operators work with fixed rates. Lately
much work is being done on adaptive rates for these operators. Concepts being
investigated include coding the values of the rates in the chromosomes to let the GA
find optimum values or using a diversity measure of the population to control the
rates. For example when the diversity is very low, new genetic information can be
introduced by setting the mutation operator temporarily to a high value. Yet another
idea comes from the field of simulated annealing where the rates of the operators are
controlled using a 'cooling scheme'

6.5.1 Crossover
The standard one-point, two-point and uniform crossover operators were described in
section 3.1.3 for binary-valued chromosomes. They can be used in the same form for
any coding. These crossover operators swap entire genes or series of genes between
individuals and therefore can never change the value of a gene.

For real-valued chromosomes an exception to this is the so called linear crossover


operator. Introduced in [65] it is implemented there in the following manner. When
two parents, xt and x2, are chosen to reproduce, three new points in the search space
are made. One point is the midpoint between xt and x2, found by taking the average of
all the gene values of both parents; i.e. (JC/+X2)/2. The other two points also lie on the
midline between Xj and x2 and are (3xi-x2)/2 and (-Xi+3x2)/2. The two offspring are
determined as the best two of these three points. The idea behind this process is that
when an optimum in the search space lies between the two parents, it will not be
reached using normal crossover. Normal crossover only swaps coordinates; i.e.
samples hyperplanes. Linear crossover however produces offspring that lie in between
6.5 Crossover, Mutation and Inversion 89

their parents. Linear crossover is probably better classified as a mutation operator,


since it introduces new genetic material. It is highly disruptive of schemata; all the
schemata contained in the parents can be lost. In [65] GAs that used a mixture of
normal and linear crossover (a chance of 50% each) gave superior results on a variety
of problems when compared to GAs with only one type of crossover operator.

6.5.2 Mutation
The normal mutation operator as described in section 3.1.3 for binary coding is easily
extended to any representation. With probability pm it will re-initialise the value of a
gene. The set of possible gene values will usually be the same as that used for
initialising; i.e. the alphabet.

When the chromosomes consist of real-valued genes, another form of the mutation
operator may be used. Instead of re-initialising the value of the gene, a small randomly
selected value (usually Gaussian) is added to it. This version of the mutation operator
is called creeping mutation. It can be seen as a local search mechanism within the GA
and can operate simultaneously with the normal mutation operator. When creeping
mutation is the only mutation operator used in the GA, it is empirically found that the
mutation rate should be much higher than usual and mutation rates of up to 0.1 have
been used. With creeping mutation, gene values can be obtained that lie outside the
range of the initial population. If the genes are restricted to a certain range (a, e
[min,max]) then the creeping mutation operator can simply be altered so that it can not
take a gene value beyond that range.

6.5.3 Inversion
While not part of the standard GA toolkit, inversion is often added to a GA system to
operate alongside crossover and mutation. It is called upon in a similar way as
crossover and mutation and operates on a single chromosome. A chromosome is
inverted with probability p,. Inversion randomly picks two points in the chromosome
and inverts the order of the substring between these points. The 'meaning' of the
chromosome remains the same however. The only thing that is changed is the order of
the coding. Inversion requires that genes carry labels. For example a gene af
represents the j * gene having a value a,-. The order in which the genes appear in the
chromosome does not have to resemble the label. Inversion is illustrated below:

(<n l ai I a 3 3 a t 4 fl551 a 6 6 a ? 7 ) -» ( at] a22 \ a55 a44 a32 \ a66 a^ )


90 Chapter 6. Implementing GAs

Both chromosomes before and after inversion code exactly the same information and
represent the same phenotype; only the order of the genes is changed. The order of the
genes can play an important factor in the GA performance. The building block
hypothesis for example requires that related genes should be close together on the
chromosome in order for building blocks to form. Inversion is an operator that
changes the order of the genes and can therefore improve GA performance in some
circumstances.
7. Hybridisation of Evolutionary
Computation and Neural Networks
Evolutionary Computation (EC) can be used in the field of neural networks in several
ways. It can be used:

• to train the weights of a neural network


• to analyse a neural network
• to generate the architecture of a neural network
• to generate both the neural network architecture and its weights

Each of these approaches is briefly described below; also see [62]. The last two of
these are dealt with together as many remarks concern both and the distinction is often
subtle.

7.1 Evolutionary Computing to Train the Weights


ofaNN
The genetic algorithm may be used to optimise the weights of a neural network and
provides an alternative to a learning algorithm such as back propagation (BP) which
often gets stuck in local minima. The genetic algorithm generally performs a global
search of the weight space and therefore is unlikely to get stuck in a local minima.
Genetic algorithms (or evolutionary computation in general) do not use error-gradient
information and, unlike algorithms as BP, they can be used where this information is
not available or is computationally expensive. It also means that the activation
function of the neurons does not have to be differentiable or even continuous. Genetic
algorithms can in principle be used to train any type of neural network including fully
recurrent networks. A problem often encountered with genetic algorithms is that they
are quite slow in fine tuning once they are close to a solution. Therefore the
hybridisation of GA and BP, where BP is used to fine-tune a near-optimal solution
found by GA, has proven to be successful [29].

91
92 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

The members of the population are the weights of the network which are coded as
strings. When real valued weights are used, they are often coded into a binary string
using a binary or a Gray coding mechanism, although real-valued coding is also
possible. The fitness measure is normally calculated as the performance error of the
network on the training data. The genetic algorithm can then be classified as a
supervised learning algorithm.

In [44] a GA is used to evolve ecological neural networks that can adapt to their
changing environment. This is achieved by letting the fitness function, which in this
case is seen as individual for every gene, co-evolve with the weights of the network. A
special feature of this research is that there is no reinforcement for 'good' behaviour
of the network; the network just tries to model or adapt to the world in which it lives.
The system can be classified as open-ended evolution.

De Garis [18] uses a method which is based on fully self-connected neural network
modules. It is shown that using this approach a network can be taught a task even
though the time-dependent input varies so fast that the network never settles down.
The system does not use a crossover operator (it could therefore be called
evolutionary programming) and is used to teach a pair of sticks to walk.

In [28] and [52], a genetic algorithm is used on a fixed three layer feedforward
network to find the optimal mapping from the input to the hidden layer (i.e. the set of
optimal hidden targets). In the evaluation phase, the weights from the hidden to the
output layer are learned using a simple supervised gradient descent rule. The search
space is not the weight space but the hidden target space. It is suggested in [28] that
the hidden target space might have more optima than the weight space and that finding
the optimum will therefore be easier.

In [49], [50] and [64], instead of binary coded chromosomes, chromosomes with real-
valued genes are used. Satisfactory results are reported using a Genitor type Steady
State Genetic Algorithm with a relatively small population size of 50. Instead of the
normal mutation operator, creeping mutation is used where a small random value is
added to the gene. In [49] several special genetic operators are investigated such as a
crossover operator that swaps groups of weights corresponding to a neuron. This
specific operator did not show any obvious improvement. Experiments with
decreasing population size strongly suggest that genetic hill-climbing (see section
6.1.1) is the main search mechanism in these implementations. Even with a population
size as small as 5, good results were obtained. The genetic algorithm is said to
7.2 Evolutionary Computing to Analyse a NN 93

outperform back propagation on certain problems that require a neural network with
over 300 connections.

7.2 Evolutionary Computing to Analyse a NN


Although this combination of GAs and NNs is not common, GAs can be used to
analyse or explain neural networks. In [13] GAs are used as a neural network
inversion tool in that they can find the input patterns that yield a certain output of the
network.

7.3 Evolutionary Computing to Optimise a NN


Architecture and its Weights
In this hybridisation of neural networks and evolutionary computation the architecture
of the neural network is automatically generated using evolutionary computation. An
individual in the population of the evolutionary computation algorithm codes a neural
network structure and sometimes its weights. During evaluation an individual is
translated into a neural network structure. Commonly this network is then trained
using a separate training module such as back propagation or a EC weight
optimisation algorithm. This is illustrated in Figure 7.1.
94 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

Figure 7.1 Using Evolutionary Computation to optimise a NN architecture

As the chromosomes usually do not contain information concerning the weights of the
network; these have to be set to an initial (random) value. After a network is trained it
is evaluated on its performance, which is reflected in the fitness measure. The
performance measure can simply be the overall error on the training data, but often
reflects other properties such as network size as well. Instead of testing the network on
the training data it can be tested on (real) test data as well. In a real-life application
however the actual test data will not be available until the network is used for its task
(otherwise it might as well be included in the training data). The training set can be
divided into two parts though, one part serving as training data for the training module
and the other part used in the evaluation phase as a test of the generalisation
performance of the network.

However the EC system as in Figure 7.1 in theory can act as a real-time system where
networks are evaluated on real data taken from the environment and the end-user
simply uses the best individual found so far. The EC system is run continuously and
when a better individual is found, then the one currently in use is replaced. This is an
evolutionary adaptive system operating in a non-stationary environment, pictured in
Figure 7.2. It will not be of any use in a stationary environment because once the
optimum network is found the EC system becomes redundant.
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 95

Figure 7.2 A hypothetical real-time EC system generating neural networks in a non-


stationary environment

The network as used by the end-user, of course, operates in the same non-stationary
environment. In practice such a system would be very hard to implement. The EC
system for example must have some way of testing the networks it generates on the
task they are going to be used for by the end-user in order to determine their fitness
values. Thus a model of the task dictated by the end-user has to be built into the EC
system. Furthermore special care has to be taken in implementing the specific EC
system to avoid premature convergence. Also the information that is fed into the EC
system from the environment has to chosen carefully, since, if it is only based on the
present situation, useful information that was observed in the past may be lost. As
described in section 4.4.1 there is evidence suggesting a very similar process as
pictured in Figure 7.2 might be at work in the human brain. The brain makes a model
of the outside world and ideas are generated and tested according to a Darwinian
process resulting in the fittest one actually being used.

The performance of a neural network usually depends on the values of the initial
weights. Therefore the networks should be trained several times using different
random initial weights each time and the results be averaged, in order to get a good
performance measure. This can cause the approach to become very slow, see e.g. [4].
In some applications the generation of the network architecture is done simultaneously
with the learning of the weights. The chromosomes not only code the architecture of
the network but they also code the values of the weights.
96 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

There are several ways to encode a neural network architecture as a chromosome that
can be used by an evolutionary computation algorithm. These methods can be divided
into the following approaches:

• direct encoding
• parametrised encoding
• grammar encoding

These methods as well as their applications are described below in more detail.

7.3.1 Direct Encoding


In the direct encoding methods the entire neural network structure is directly
represented by the chromosome. Every connection of the network is directly
represented in the chromosome, usually by a single gene. Direct encoding is often
implemented by encoding the network as a connectivity matrix. This matrix has size N
x TV, where A' is the maximum number of neurons in the network. A ' 1' in the matrix
denotes a connection between two neurons, a '0' no connection. When a feedforward
neural network is coded using a connectivity matrix, only the upper right half of the
matrix is used. A common problem found when using a direct encoding method is that
the chromosomes become very large with increasing network size. This usually results
in poor convergence of the algorithm.

An alternative approach is to use a genetic algorithm where the topology and weights
are encoded as variable-length binary strings [46]. In [11] a structured GA is used that
simultaneously optimises the neural network topology and the values of the weights. A
two-level genetic structure represented by a single binary string is used where one
level defines the connectivity (topology) and the other the values of the weights and
biases. It was found that although the algorithm worked well on small problems like
XOR, it could not scale up properly to bigger real-world problems.

In [5] feedforward neural networks are generated with a GA, using a direct encoding
scheme where every gene in a chromosome represents a connection between two
neurons. The problem of competing conventions is tackled here by introducing
connection specific distance coefficients in the genetic material. For each functional
mapping or phenotype, the structural mapping or genotype with the shortest amount of
connection lengths is preferred. This approach is also known as 'restrictive mating' [1]
and is one of the niching methods described in section 3.1.8. In this way, some of the
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 97

topology information of the phenotypes (the actual neural networks) is incorporated


into the genotypes. A disadvantage of the system proposed is that a maximum size
neural network topology, including the number of hidden layers used, needs to be
specified in advance.

Jacob and Rehder [32] use a grammar-based genetic system, where topology creation,
neuron functionality creation (e.g. activation functions) and weight creation are split
up into three different modules, each using a separate GA. The modules are linked to
each other by passing on a fitness measure. The grammar used is such that a neural
network topology is represented as a string consisting of all the existing paths from
input to output neurons. This is not a grammar encoding method such as the ones
described in section 7.3.3 as no grammar rewriting or production rules are encoded.

In [26] Happel and Murre report an approach where modular neural networks are
generated using a direct encoding scheme. The system implements modularity, where
modularity is meant as the grouping of certain neurons in the network into a module.
When such a module of neurons is connected to another module, all the neurons in the
two modules are connected to each other. An advantage of using modular neural
networks is that the weight space of the network is reduced. This has a positive effect
on both the generalisation capability and the time needed to learn the network. The
networks used are made up of so called CALM modules and are used for unsupervised
categorisation.

Angeline et al. [3] implemented a system based on evolutionary programming where


networks evolve using both a parametric mutation (mutation of the weights) and a
structural mutation. It is argued that EP is a better choice for this task than GA, mainly
because it is not clear that there exists an appropriate interpretation function between
the recombination and evaluation space for the application of neural network design.
In [47] EP is used where the initial network is a 3 layered fully connected feedforward
network and the EP algorithm is used to prune connections.

Genetic programming offers another approach to a direct encoding scheme. This


approach is described in [40] and [41], and consists of directly encoding a neural
network in the genetic tree structure used by genetic programming. The neural
network topology as well as the values of the weights are encoded in the chromosomes
and they are trained simultaneously. Only tree-structured neural network architectures
can be generated using this method.
98 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

In [41] the genetic programming approach is extended to a system called Breeder


Genetic Programming (BGP). The neural networks are still represented by trees but
less restrictions are put on the possible architectures. However only integer valued
weights and bias values are possible. Since the hidden neurons are defined in a tree­
like fashion and each output is represented by its own genetic tree, there still are
restrictions on the possible architectures. The method uses Occam's Razor where less
complex neural networks are preferred over more complex ones, providing they have
similar performances. This offers a way to balance neural network accuracy and
structural complexity.

7.3.2 Parametrised Encoding


The above methods can be classified as 'strong or low-level representations' because
the complete network topology is coded in the chromosomes. When 'weak or high-
level representations' are used, the chromosomes do not contain the complete network
topology. Instead they consist of more abstract terms, like 'die number of hidden
neurons' or 'the number of hidden layers' This usually results in a limitation of the
possible neural network architectures; e.g. only full connectivity between the layers
and no connections from input neurons to output neurons.

In [1] a modular design approach is used where a distinction is made between


structure, connectivity and weights optimisation. The network structure here is defined
as the number of layers and the number of neurons in every layer.

7.3.3 Grammar Encoding


Grammar based systems are often found to perform better than methods using a direct
encoding method for larger sized neural networks. This is due to the fact that when a
direct encoding method is used, the chromosomes and accordingly the search space
for the algorithm, become very large as the network size is increased. When grammar
encodings are used this is not the case and these methods usually show much better
performances when the network size is scaled up. Grammar encodings used to
represent a neural network architecture can be divided into matrix grammars and
graph grammars.

Kitano [38], [39] uses a GA-based matrix grammar approach where chromosome
code grammar rewriting rules that can be used to build the connectivity matrix. A rule
in this grammar rewrites a single character into a 2x2 matrix. After a certain number
7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 99

of rewriting cycles the connectivity matrix is formed. The method automatically


generates neural networks that show many regular connection patterns, and can thus
be said to show some sort of network modularity. Drawbacks of this approach include
that in order to generate a network with N neurons, a matrix is needed that has size
MxM, where M is the smallest power of 2 greater than N. Also, when feedforward
neural networks are generated, more than half of the connectivity matrix is unused.
Another drawback is that the method is not clean, in that rewriting rules may be
generated that rewrite the same character, or for some characters no rewriting rules at
all may be found. One more weak point is that networks can be generated that have
neurons without incoming or without outgoing connections and therefore a pruning
algorithm is needed. In [39] Kitano points to a possible solution to the matrix size
problem by using flexible size matrix rewriting. Here he also lets the genetic algorithm
generate the initial weights of the network that is trained using back propagation.

Gruau [24], [25] uses a graph grammar system called Cellular Encoding. The graph
grammar rules work directly on the neurons (called cells) and their connections and
include various kinds of cell divisions and connection pruning rules. The grammar
rules are coded in a tree structure, and a genetic programming system is used. The
values of the binary weights and of neuron bias values can be coded into the
chromosomes as well. For some problems the generated boolean networks are further
trained using a cross-over between back propagation and the so called bucket brigade
algorithm. The approach can generate networks that are highly modular, where
modularity is defined as follows: Consider a network N] that includes at many
different places, a copy of the same subnetwork N2. An encoding scheme is modular if
the code of Nj includes the code of N2, a single time. Experiments show that the
system can be used to generate modular boolean neural networks of large size. This
approach is therefore especially useful when the problem to be solved shows a great
deal of modularity in the repetitive use of functional groups.

Boers and Kuiper [4] use a graph grammar system based on a class of fractals called
L-systems. The chromosomes used in the genetic algorithm code the production rules
of this grammar. The system generates modular feedforward neural networks where
modularity is evident in the grouping together of neurons in a module (see also the
description of Happel and Murre's work in section 7.3.1). The networks generated are
again trained using a back propagation algorithm. Drawbacks include the need for a
repair mechanism because of possible faulty strings and the extremely long
converging times. The method does not scale up well to larger problems.
100 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

In [8] and [55] a quite different approach is presented. Neural networks are viewed as
physical objects in a two-dimensional space and are represented by a single cell ([8])
and various parameters concerning the growth process of the cell and a set of rules for
cell reproduction. The translation from genotype to phenotype is a complex one where
the final network is generated from the single starting cell by means of axonal growth
and branching as well as cell division and migration. The neural network 'grows' out
of the starting cell(s). This interpretation function comes a lot closer to the
developmental process found in nature. Successive phases of functional differentiation
and specialisation can be observed in the development. Mutations are introduced in
the development and it is observed that changes in the phenotype due to these
mutations depend largely on what stage in the development they occur. The neural
networks are used to model organisms living in a two-dimensional world in which
they can move in the search for food and water.
8. Using Genetic Programming to
Generate Neural Networks
In this chapter, we discuss the use of a genetic programming algorithm using a direct
encoding scheme; also see [63]. This work is mainly based on [40] and, where a LISP
program was used to implement the algorithm and that showed good results when GP
was applied to generate a neural network that could perform the one-bit adder task. A
complete neural network, i.e. its topology as well as its weights, is coded as a tree
structure and is optimised in the algorithm.

A public domain genetic programming system called GPC++, version 0.40, has been
used [17]. This software package was written in C++ by Adam P. Fraser, University of
Salford, UK, and several alterations were made to use it for the application to neural
network design. The GPC++ system uses Steady State Genetic Programming (SSGP)
as discussed in section 3.2. The probability of crossover, pc, is always 1.0; the new
population is constructed using the crossover operator, after which mutation is
performed. The crossover operator swaps randomly-picked branches between two
parents, but creates only one offspring for each pair. There is no notion of age in the
SSGP system, which means that after a new member is made, it can be chosen
immediately afterwards to create a new offspring.

8.1 Set-up
The technique applied in [40] and [41] was used, where a neural network is
represented by a connected tree structure of functions and terminals. Both the
topology and the values of the weights are defined within this structure, and no
distinction is made between the learning of the network-topology and its weights.

The terminal set is made up of the data inputs to the network (D), and random floating
point constant atom (R). This atom is the source of all the numerical constants in the
network and these constants are used to represent the values of the weights. So:

T = (D,R)

101
102 Chapter 8. Using Genetic Programming to Generate Neural Networks

The neural networks generated by this algorithm are of the feed-forward kind. The
terminal set T for a two-input neural network is for example T = {DO, Dl, R}.

In [41] the function set F is made up of six functions: F = {P,W,+,-,*,%}. P is the


processing function of a neuron which performs a weighted sum of its inputs and feeds
this to a processing function (e.g. linear threshold, sigmoid). The processing function
takes two arguments in the current version of the program; i.e. every neuron has two
inputs only. The weight function, W, also has two arguments. One is a subtree made
up of arithmetic functions and random constants that represent the numerical values of
the weights. The other is the point in the network that it acts upon, which is either a
processing unit (neuron) or a data input. The four arithmetic functions, AR = {+,-
,*,%}, are used to create and modify the weights of the network. All take two
arguments and the division function is protected in the case of a division by zero.

After some initial experimentation it was found that, for the problems under
investigation, the system performed much better if the arithmetic functions were not
used. So:

F={P,W)

The values of the weights are represented by a single random constant atom and their
values can only be changed by a one-point crossover or mutation performed on this
constant atom.

The output of the genetic program is a LISP-like S-expression, which can be


translated into a neural network structure made up of processing functions (neurons),
weights and data inputs. Initially no bias units were implemented. The name given to
this implementation of neural network design using genetic programming is GPNN.

8.2 Example of a Genetically Programmed Neural


Network
An example of a chromosome generated by GPNN is the following neural network,
which performs the XOR function.

(P(W(P(W-0.65625 Dl ) ( W 1.59375 DO)) 1.01562)


(W 1.45312(P(W 1.70312 Dl ) (W-0.828125 DO ) ) ) )
8.2 Example of a Genetically Programmed Neural Network 103

The graphical representation and the corresponding neural network are shown in
Figure 8.1. The condensation of the W-P tree initially drawn from the chromosome
into a fully connected feedforward network is illustrated in two stages.

Figure 8.1 Translation of a GP chromosome into a neural network


104 Chapter 8. Using Genetic Programming to Generate Neural Networks

8.3 Creation and Crossover Rules for Genetic


Programming for Neural Networks
In the standard GP paradigm, there are no restrictions concerning the creation of the
genetic tree and the crossover operator, except a user-defined maximum depth of the
tree. In neural network design, several limitations on the creation as well as on the
crossover operator are required.

8.3.1 Creation Rules


The creation rules are:

• the root of the genetic tree must be a "list" function (L) of all the outputs of the
network
• the function below a list function must be the Processing (P) function
• the function below a P function must be the Weight (W) function
• below a W function, one of the functions/terminals must be chosen from the set
{P,D}, the other one must be {R}

These creation rules make sure that the created tree represents a viable neural
network. The root of the tree is a list function of all its outputs while the leaves are
either a data signal (D) or a numerical constant (R). This tree can then be translated
into a neural network structure as in Figure 8.1.

8.3.2 Crossover Rules


The crossover operator must preserve the genetic tree so that it still obeys the above
rules. This is done by structure-preserving crossover which has the following rule: the
points of the two parent-genes between which the crossover is performed (the
branches connected to these points are swapped) must be of the same type. In effect
this means that firstly a crossover point on the first parent tree is randomly selected.
Then the crossover point on the second parent tree is randomly selected with the
restriction that it must be of the same type.
8.4 Automatically Defined Functions (ADFs) 105

The types of points are:

- type 1: a P function or a D terminal


- type 2: a W function
- type 3: a R terminal

So, for example, a branch whose root (the crossover point) is a P function can never
be swapped with a branch whose root is a W function. Would this be allowed, then the
creation rules as described above would be violated and the genetic tree could no
longer be translated into a neural network. In [41], P functions and D terminals are
treated as being of different types, which means a branch whose root is a P function
can never be replaced by D terminal and vice versa.

8.4 Automatically Defined Functions (ADFs)


As can be seen from Figure 8.1 only tree structured neural networks can be generated
using GPNN. Sub-branches can never 'reach' each other. For example the simple '2-
2-2' fully connected feedforward network in Figure 8.2 can not be represented by a
single tree in GPNN. Instead two separate sub-trees like the one in Figure 8.1 are
needed, one for each output neuron. So six processing functions are needed in GPNN
to represent this four neuron network.

Figure 8.2 A simple '2-2-2' feedforward neural network (left). The GPNN system
needs two separate sub-trees to represent this network (right).
106 Chapter 8. Using Genetic Programming to Generate Neural Networks

When Automatically Defined Functions (ADFs) are implemented it is possible to


represent functional groups of neurons in the network by ADFs. For example using
two ADFs, ADFO and ADF1, each one representing a hidden neuron and its (two)
incoming weights, the '2-2-2' network can be represented by the GPNN tree in Figure
8.3.

Figure 8.3 Representation of the '2-2-2' network in GPNN with two ADFs.

The two ADFs have P functions as their roots and have two arguments each: ARGO
and ARG1. In the example these arguments are instantiated with the data inputs DO
and Dl but instead of data inputs, the output value of some P function or even another
ADF function can also be used. The problem with a representation of this kind is that
if every sub-network that is called upon more than once is represented by an ADF, the
number of these ADFs can become very large. This number normally needs to be set
by the user a-priori. Another problem is that the number of arguments of each ADF,
just as for every other function, needs to specified in advance. However extensions to
the standard GP system have recently been made by Koza allowing the system to
automatically build ADFs when it needs them.

8.5 Implementation of the Fitness Function


The fitness function is calculated as a constant value minus the total performance error
of the neural network on the training set, identical to the fitness function used in the
example of chapter 3. A training set consisting of input and target-output patterns
(facts) needs to be supplied. The error on the training set is then calculated as:
8.5 Implementation of the Fitness Function 107

i=\ ,=\

where: F - the number of facts in the training set


Nou, = the number of outputs
Oijix) = the j output of the network of training fact i
Tjj = the; target output of training fact /

Since a lower error must correspond to a higher fitness, the fitness of a chromosome x
is then calculated as:

fx) = Emllx-E{x)

The maximum performance error, Emax, is a constant value equal to the maximum
error possible, so that a network that has the worst performance possible on a given
training set (maximum error) will have a fitness equal to zero. When a threshold
function is used as the neurons' processing function, only output values of '0' or T
are possible. The range of fitness values is then very limited and it is impossible to
distinguish between many networks. In order to increase this range the output neuron
could be chosen to have a continuous sigmoid processing function.

In using a supervised learning scheme, there are many other ways to implement the
fitness function of a neural network. Instead of the sum of the square errors, for
example, we could use the sum of the absolute errors or the sum of the exponential
absolute errors. Another definition of the fitness could be the number of correctly
classified facts in a training set. The fitness function could also reflect the size (=
structural complexity) and the generalisation capabilities of the network. For example
smaller networks having the same performance on the training set as bigger networks
would be preferred, as they generally have better generalisation capabilities. The
generalisation capability of a network could be added to the fitness function by
performing a test on test data that lies outside the training data. These suggestions are
not implemented here.
108 Chapter 8. Using Genetic Programming to Generate Neural Networks

8.6 Experiments with Genetic Programming for


Neural Networks
The Genetic Programming for Neural Networks (GPNN) algorithm has been
implemented using the code of GPC++ with several alterations and additions. The
neurons in the resulting neural networks initially did not have bias units. The fitness
function used was the total performance error over the training set multiplied by a
factor to increase the range. The fitness value was then made into an integer value as
this is required by the GPC++ software. The mutation operator was implemented so
that it only acted on terminals, not on functions. The maximum depth of a genetic tree
in the creation phase, the creation depth, was set to 6. During crossover, the genetic
trees were limited to a maximum depth of 17, the crossover depth. These values were
used as a default value by Koza [40], to make sure the trees stay within reasonable
size.

Simulations have been performed to automatically generate neural network


architectures for the XOR problem, the one-bit adder problem and on the intertwined
spirals problem.

8.6.1 The XOR Problem


The XOR problem was the first that was attempted using GPNN. The processing
function used for the neurons was a simple threshold function: thres(;c)=l if x>\, 0
otherwise, and the following statistics for the genetic programming algorithm were
used:

Table 8.1 The GP settings for the XOR optimisation problem

Parameter Setting
ADFs 0
creation depth 6
crossover depth 17
elitism on
: N (population size) 500
pc (crossover rate) 1.0
pm (mutation rate) 0.1
selection mechanism tournament (tournament size = 5)
8.6 Experiments with Genetic Programming for Neural Networks 109

No Automatically Defined Functions (ADFs) were used, as they did not seem
necessary for such a simple task.

Several runs were performed on this problem with solutions evolving between
generation 1 and generation 5. Figure 8.4 shows a solution that was found in a
particular run in generation 5. All solutions found had a number of neurons ranging
from 3 to 5. When the roulette wheel reproduction mechanism was used instead of the
tournament mechanism, the convergence to a solution took on average 2 generations
longer.

Figure 8.4 A generated neural network that performs the XOR problem

The GPNN system was extended with a bias input to every neuron by means of an
extra random constant (in the range [-4,4]) added to every P function. The effect of
this on the XOR problem was a somewhat slower convergence. The reason might be
that the search space is increased, while for a solution to this simple problem bias-
inputs are not needed. It should be noted that the GPNN system with this specific set­
up cannot generate the 'minimal XOR network'. This network is pictured in Figure
8.5.
110 Chapter 8. Using Genetic Programming to Generate Neural Networks

Figure 8.5 The minimal XOR network. This is the neural network with the lowest
complexity (number of connections) that can perform the XOR problem

GPNN can not generate this network simply because the P functions are only allowed
to have two arguments (inputs), while for this particular network the output neuron has
three inputs. The GPNN settings can of course be changed so that the function set F
contains two P functions: one with two arguments, Vi(arg^,arg2) and one with three:
~P2.{argi,arg2,argT). The minimal XOR network can then be represented by a
chromosome using these two functions.

8.6.2 The One-Bit Adder Problem


As in [41], the slightly more difficult one-bit adder problem was then attempted. The
network has to solve the following task:

Input: Target output:

0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

In effect this means that the first output has to solve the AND function on the two
inputs, and the second output the XOR function.

The same characteristics as used in the XOR problem were used. A solution to the
problem was found on all 10 runs between generation 3 and generation 8. One of them
is shown in Figure 8.6. The convergence is much faster than in [41], where a solution
was only found after 35 generations, also using a population of 500.
8.6 Experiments with Genetic Programming for Neural Networks 111

Figure 8.6 A generated neural network that performs the one-bit adder problem

As can be seen from the figure, the neural network found is indeed made up of an
AND and an XOR function. On average the generated neural networks had more than
just 5 neurons and the largest effective network had 20.

8.6.3 The Intertwined Spirals Problem


The intertwined-spiral classification problem was tried, as it is often regarded as a
benchmark problem for neural network training. The training set consists of two sets
of 97 data points on a two-dimensional grid, representing two spirals that are
intertwined making three loops around the origin. A 2-input, 2-output neural network
is needed.

The results were poor. When the same settings as in the above experiments were used,
roughly half of the training set was classified correctly. Automatically Defined
Functions (ADFs) were introduced taking two, three and four arguments respectively,
but no improvements were observed. The function set was also extended with
processing functions P3 and P4 taking three and four arguments respectively. Again the
performance was still very poor.

Although GPNN was not able to find a solution to this problem, it should be noted
that GP has been found to be a very good classifier on the intertwined spirals problem.
In [40] a GP system gave a very good performance on this problem using the
following set-up: The terminal set, T, was made up of the two data inputs DO and Dl
and the usual real-valued terminal R:

T = {DO, D1,R)
112 Chapter 8. Using Genetic Programming to Generate Neural Networks

The function set consisted of the arithmetic functions +,-,*,%, the functions SIN and
COS and the function IFLTE (If Less Than or Equal to). The IFLTE function takes 4
arguments (branches) and is defined as: if (argi<arg2) then return arg3, else return
argA. So the function set F is:

F ={+,-. *, %, IFLTE, SIN, COS}

No creation or crossover rules are needed and the fitness function is simply the
classification error on the intertwined spirals data set. This GP configuration gave
very good results on the intertwined spirals classification task and a 100% correct
classification on the data set is reported.

8.7 Discussion of Genetic Programming for


Neural Networks
It was found that the GPNN approach works well for small scale problems such as the
XOR and one-bit adder problem but that it does not scale up to larger more real-world
like problems. One of the reasons for this is thought to be that the size of the
chromosomes becomes excessively large for these problems. The GPNN system will
have enormous difficulty in finding an optimum within reasonable time. Other reasons
for the poor scaling up quality of GPNN can be found in the restrictions that apply to
the system. These restrictions are:

• There are severe restrictions on the network topologies generated: only tree
structured networks are possible.

• The number of arguments of a function is always fixed; e.g. a processing function


(neuron) can and must only have two inputs.

• The learning of the topology and weights is done simultaneously within the same
algorithm. This has the drawback that a neural network with a perfectly good
topology might have a very poor performance and will therefore be thrown out of
the population just because of the value of its weights.
8.7 Discussion of Genetic Programming for Neural Networks 113

Another application-independent problem in using GP is how to choose the function


and terminal set. For example in the GPNN system that we used, only two functions
are implemented: {P,W}. This function set could easily be extended. In order to
decide on what functions are useful to the problem detailed knowledge of the problem
domain is needed.

It is believed that the main reason why the GPNN approach fails to scale up to larger
size problems lies in the restrictions mentioned and the very large chromosome size
needed. An approach which overcomes some of the limitations of GPNN is discussed
in Chapter 10.
9. Using a GA to Optimise the Weights
of a Neural Network
This chapter describes experiments using a genetic algorithm for weight optimisation
of a feedforward neural network. When genetic algorithms are used to optimise the
structure of feedforward neural networks a separate learning algorithm such as back
propagation is often used to train the weights (see Figure 7.1). In the weight
optimisation module a separate genetic algorithm can be used instead of back
propagation, making the system a meta-level GA. The performance of a GA as a
neural network weight optimiser is investigated here. For certain problems (see e.g.
[50], [64]) genetic algorithms have proven to be comparable to or even better than
back propagation. In section 7.1 an overview was presented on the research in this
area. The best results were noted when a Steady State Genitor-type GA was used with
a real-valued coding of the weights. We have used a normal GA with 'non-
overlapping populations' and an altered replacement mechanism so that it can act as a
Steady State Genetic Algorithm. Since the main characteristic of a Genitor-type GA is
thought to be its extreme selective pressure or 'pushing force' of above average
individuals, its performance can be approximated by a normal GA with the
appropriate selection mechanism. The effect of the selective pressure on the GA
performance as a weight optimiser is also investigated here.

First a brief description of the GA software is given. After this the set-up of the GA is
discussed and experiments are presented where the GA weight optimiser is compared
to the standard back propagation algorithm. Finally the results are discussed.

9.1 Description of the GA Software


The GA software used was a genetic algorithm C-library called 'SUGAL' (vl.O),
developed by A. Hunter at the University of Sunderland, England. The system is very
flexible and the user has many options available. The basic working of SUGAL is
illustrated in the flow chart of Figure 9.1.

114
9.1 Description of the GA Software 115

In the course of a generation, pairs of individuals that serve as candidates to be


included in the next generation are chosen using the selection mechanism. In the
standard GA the number of candidates just equals the population size; i.e. the
complete population is replaced by the candidates. An exception to this is when
elitism (see section 3.1.7) is used. With elitism the number of candidates is equal to
the population size minus one. As usual, crossover is performed on the pair of
candidates with probability pc. Mutation is then performed with probability pm.
The candidates are then evaluated and inserted back into the population using the
replacement mechanism. In the standard GA the replacement mechanism is such that
the individuals in the population are always replaced by the candidates. This is known
as unconditional replacement. SUGAL offers extra replacement strategies identical to
the ones used in a Steady State Genetic Algorithm; i.e. conditional/ unconditional and
ranked/unranked replacement. The standard GA can be transformed into a Steady
State Genetic Algorithm by decreasing the number of candidates to only one (or two).
This type of GA implemented in SUGAL is also described by Michalewicz in [48], p.
60, where it is labelled 'modGA'

The SUGAL software mutation operator was changed so that a single gene is subject
to mutation with probability pm (and not a chromosome as was the case). This
probabilistic implementation of the mutation operator where every gene has to
undergo a 'test' to determine whether or not it should be mutated make the program
quite slow.

A second change was made concerning the selection of the pair of candidates. In
SUGAL it was possible for a single individual to be chosen both as the father and as
the mother. In such a case the offspring are simply exact copies of the parent no matter
what kind of crossover takes place. This has the effect of lowering the effective
crossover rate and in populations with one superfit individual it may easily lead to
premature convergence. The code was changed so that the father and mother
chromosome could not be one and the same.

An option in SUGAL is re-evaluation. If the re-evaluation flag is set, each individual


is re-evaluated at the start of a generation. This can serve a purpose if the evaluation is
dependent on the state of the system or its non-stationary environment, or if it the
evaluation contains stochastic elements. In many static optimisation problems the
fitness of an individual is deterministically dependent on the individual and re-
evaluation will serve no purpose.
116 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

Figure 9.1 Flowchart for the GA software SUGAL

Figure 9.1 Flowchart for the GA software SUGAL


9.2 Set-up 117

9.2 Set-up
In this section the set-up of the GA is described for the implementation of neural
network, weight optimisation.

• Coding
The coding is chosen to be real-valued. A single chromosome represents all the
weights in the neural network (including the bias weights), where a single real-valued
gene corresponds to one weight-value. The nodes in the network are numbered from
'0' starting at the bias-unit, then the input units, the hidden neurons and finally the
output neurons. Even though the input units and the bias unit are not really neurons at
all, they will be referred to as such (as is common practice). The network architecture
is not restricted to a classic fully connected layer-model. However, the hidden neurons
are numbered in such a way that neurons with a higher index are 'higher' up in the
hierarchy of the network; i.e. neurons can only have outgoing connections to neurons
with a higher index. Figure 9.2 illustrates this. The indices of the weights represent the
order in which they appear in the chromosome. Incoming weights to a certain neuron
are grouped together in the chromosome representation.

• Initialisation
The initialisation of the weights is important when a GA is used to train a neural
network. Because the standard crossover operator for real-valued chromosomes leaves
the gene values intact, it can never introduce new values of weights and the available
genetic information is dictated by the initial values and by the mutation operator used.
When the standard mutation operator is used to simply replace genes by a new random
value in a certain range, this range and the range of initial gene values dictates the
boundaries of possible values the genes can ever obtain. The range of initialisation in
the GA weight optimiser therefore usually plays a more important role than in a hill-
climbing algorithm like back propagation. The initial values of the genes can be
chosen to be uniformly distributed within a certain range or normally distributed with
a certain mean and standard deviation.
118 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

Figure 9.2 Example of the ordering of the weights in a chromosome

• Evaluation
The evaluation phase involves initialising the neural network with the set of weights
contained in the chromosome. The fitness value, f(x), is then simply the cumulative
squared error of the network on the training set where the outputs are compared to the
target output patterns:

, = l ;=1

where: F = the number of facts in the training set


Nou, = the number of outputs
0,j(x) = the j " 1 output of the network of training fact <'
9.2 Set-up 119

Tij = the 7 target output of training fact i

The training is supervised. All the neurons in the network perform a weighted sum of
their inputs and produce as output the standard sigmoid function on [0,1] of this
weighted sum. So 0,.y(x) e [0,1]. Commonly target outputs will either have a value of
Oorof 1: Tue {0,1}.

• Stopping Criterion
In this implementation the stopping criterion is chosen to be the occurrence of a
chromosome in the current population corresponding to a neural network that
correctly classifies the complete training set within a certain tolerance. All outputs of
the network must be within this tolerance of their target values for the criterion to be
satisfied:

Stopping criterion satisfied <=> 3x ( \ Oi/x) - Zy I < Tolerance, all ij )

The default tolerance is set to 0.4, where it is assumed that all target outputs have a
value of either 0 or 1. The chosen network does not necessarily have to be the network
with the smallest error on the training set (the fittest chromosome), rather, it is the first
encountered within the stopping criterion. An alternative stopping criterion could be
when a network is found that has a fitness below a certain error value, but we have not
used this approach.

• Selection
Before selection is performed the fitness values of the individuals are normalised (or
remapped) using a normalisation method. Normalisation is implemented in SUGAL
by optionally altering the fitness values using some function (such as ranking), and
then normalising all fitnesses so that the total of the fitness values of the population
equals 1. Normalisation methods include inversion, where the fitness values are
inverted so that lower fitnesses take higher values and high fitnesses take low values,
linear ranking, where the fitness value becomes a linear function of the rank of the
chromosome, and geometric ranking, where the fitness is a geometric function of the
rank.

• Crossover
The standard one-point, two-point and uniform crossover operators are available.
Since entire genes are swapped, these operators can never change a gene value (a
120 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

weight). An exception is the linear crossover operator (see section 6.5.1) which was
implemented as a special option as follows: the first offspring JC3 receives the average
values of all the genes of its parents; i.e. (xl+x2)/2. The other offspring is generated as
xA = {3x\-x2)l2.

• Mutation
As stated above, the standard mutation operators in SUGAL were changed. There are
two mutation operators available. Normal, 'uniform', mutation re-initialises the gene
with a random value. This new random value can be taken from a uniform distribution
within a certain range or from a normal distribution with a given mean and standard
deviation. Creeping, 'Gaussian', mutation is such that a normally distributed value
with a certain standard deviation is added to the current value of the gene. The
SUGAL code was extended so that both mutations could operate at the same time,
each with its own mutation rate.

• Replacement
All the replacement mechanisms as described in section 3.1.5 are available, i.e.
ranked/unranked and conditional/unconditional replacement. In unranked
replacement, the 'doomed' individuals are chosen randomly. With ranked replacement
the doomed are the least fit individuals of the population. When the replacement is
unconditional, candidates always replace the doomed individuals. In conditional
replacement the doomed individual is only replaced if its replacement is fitter.
SUGAL offers the ability to set the number of candidates that are generated during
each generation to any number Nc. Ranked unconditional replacement then becomes
an extended form of elitism where the worst Nc individuals of a population are
replaced each generation. When Nc is set to 1 (or 2) the GA is transformed into a
Steady State Genetic Algorithm. The SUGAL settings resulting in a Genitor-type GA
therefore are: Nc = 1, normalisation method = linear ranking, replacement mechanism
= ranked unconditional replacement.

9.3 Experiments
This section is concerned with experiments that were performed with the GA system
described above. Several neural network weight optimisation problems were tried and
the results were compared with the standard back propagation learning algorithm. All
neural networks used here have all their hidden and output neurons connected to a bias
unit that has a constant output of 1.
9.3 Experiments 121

9.3.1 Data-sets

• 4 to 4 Encoder Problem
The 4 to 4 encoder problem is a simple one to one mapping of all of the 16 possible 4
bit binary inputs to the outputs. The target output values are identical to the input
pattern for each training pattern. Table 9.1 shows the 4 to 4 encoder training data.

Table 9.1 The 4 to 4 encoder training data

Input Target output


" ~~ 0000 0000
000 1 000 1
00 10 00 10

1111 1111

A '4-4-4' fully connected feedforward neural network was used for which the
backpropagation algorithm had no problems learning the data. The corresponding
chromosome length is: I = 4*4 + 4*4 + 4 + 4 = 40.

• Iris Flower Data


The iris flower data consists of a training set of 75 facts and of a test set of the same
size. A single fact contains 4 real-valued input values on [0,1] and 3 binary output
values. The class of a fact is determined by that output which has a value of one; the
other two output values are zero. The data represents four attributes of flowers
according to which the flowers are categorised into three classes. This data set is often
considered to be a benchmark problem in neural network classification tasks. The
neural network used to classify this data was a '4-4-3' fully connected feedforward
neural network. This network was easily trained with back propagation for a 100%
correct classification of the training data. The GA system requires chromosomes of
length I = 35. To test the trained network a separate test set also containing 75 facts
was available.
122 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

• Radar Classification Data


The radar classification data concerns a real-world task. The training data set consists
of 240 facts, each having 17 inputs in the range [-1,1] and 12 binary outputs
determining the class of the object. The data concerns the classification of 6 classes of
ships each class having two attitudes: left-side and right-side. A '17-12-12' fully
connected neural network was used resulting in a chromosome with a length of / =
372.

9.3.2 Comparing GA with Back Propagation


It is difficult to compare the performance of a GA weight optimisation algorithm with
an algorithm like back propagation. A simple measure of the time it takes to converge
to the appropriate solution is of course not reliable since it depends heavily on the
software implementations used. However a rough comparison can be made between
the two algorithms as is reported in [50]. In this GA application, as in the vast
majority of GA systems, the evaluation of the chromosomes takes up the most
computational time by far when compared to the rest of the algorithm. When
comparing GA and BP in computational effort the 'rest' of the GA algorithm (i.e.
selection, crossover, mutation etc.) is simply ignored. During evaluation of a
chromosome for each fact a single pass of the training data through the neural network
is made after which the error is calculated. So the number of 'passes' per evaluation is
simply the number of facts. In a BP algorithm with 'per-pattern update' (i.e. weights
are updated after every single presentation of a training fact) the data is passed
through the network (forward pass) after which the error propagates back (backward
pass) and the weights are updated. In one training cycle of the BP algorithm the total
number of passes through the network therefore equals twice the number of facts in
the training set. So when comparing GA and BP one training cycle in BP is considered
to be equivalent to two evaluations in the GA.

In the standard GA without elitism where all newly made individuals for the next
generation are evaluated, the number of evaluations per generation simply equals the
population size. In the GA used here this does not hold in general. Some individuals
pass on from one generation to the next unchanged and are not evaluated. For this
reason, during each GA run the number of evaluations needed to find the solution is
recorded. When comparing the GA and BP algorithms on a certain problem the
number of GA evaluations, or the number of passes through the neural network, will
simply be called iterations. Thus:
9.3 Experiments 123

number of iterations = number of GA evaluations = number of BP cycles * 2.

9.3.3 Results
It is difficult to visualise the operation of a GA. In this section graphs are presented
that show the fitness of the best individual in the population versus the number of
generations. In contrast to a hill-climbing algorithm like BP this graphical
representation does not give much insight into the actual search of the GA.

SUGAL offers a measure of the diversity of the population at the end of each
generation. The diversity measure for a real-valued coding as is used here is the mean
of the standard deviations of each gene across the entire population. So:

It! ■
where: D = the diversity of the population
/ = the chromosome length
cr, = the standard deviation of gene i across the population

• The 4-to-4 Encoder Problem


Figure 9.3 shows a typical run of the GA for the 4 to 4 encoder problem. Shown is the
average value of the fitnesses of all the individuals in the population and the fitness
value of the best individual in the current population. The GA was run for 300
generations.
124 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

Figure 9.3 Example of a GA run for the 4 to 4 encoder problem

The settings used for this particular run are given in Table 9.2.

Table 9.2 The GA settings corresponding to Figure 9.3

Parameter Setting
crossover type two point
elitism on
fitness normalisation reverse linear ranking with bias = 10.0
initialisation of population Normal distribution: N(0,5)
/I 40
mutation type creeping with N(0,1) distribution
~ N ' 50
Pc
Pc 0.8
08
pPm
m ~ 0.1
re-evaluation off
replacement mechanism ranked unconditional
selection mechanism Roulette
9.3 Experiments 125

A Normal (or Gaussian) distribution is characterised as N(/i,CT), with ji = the mean and
a = the standard deviation of the distribution. At generation 64 an individual was
found that correctly performed the 4 to 4 encoder problem subject to the required
tolerance on the target output values of 0.4. A total of 3195 evaluations (passes
through the network) was needed to find this solution. Over the course often runs with
the same settings the average number of evaluations needed to find a solution was
about 3500, corresponding to an average of 87 generations.

As can be seen from Figure 9.3 the diversity of the population remains fairly high
throughout the run. It drops from its initial value of about 5.0 to 1.0 at generation 40
and stays there due to the relatively high mutation rate.

0 Comparison with Back Propagation


Despite using trial and error to compare various configurations, the parameter settings
may not have been optimal for this problem and the convergence time compares very
poorly to the back propagation algorithm. The average number of cycles needed for
BP to find a solution subject to the same requirements is about 50. This would
compare to a number of BP cycles * 2 = 100 evaluations in the GA, meaning about 2
generations with a population size of 50. Figure 9.4 clearly shows this drastic
difference in performance between the same GA run as above and an average BP run
for the 4 to 4 encoder problem.

Clearly BP has no difficulty at all in finding a solution. Apparently the 4 to 4 encoder


problem is a very simple one for a hillclimbing algorithm like BP to solve and the
global optimum is found without any trouble. The GA on the other hand needs many
times more iterations to find a solution to this problem. Although the GA settings
might not have been optimal, it is clearly outperformed by BP on the 4 to 4 encoder
problem.
126 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

Figure 9.4 Comparison of GA vs BP on the 4 to 4 encoder problem

0 Effect of Selection Mechanism


The GA configuration was then changed to one of a Genitor-type Steady State Genetic
Algorithm. Good results have been reported on neural network weight optimisation
using this type of GA with creeping mutation and a small population size of fifty [50],
[64]. It is thought that a Genitor-type GA can work well on weight optimisation
problems mainly because of its high selective pressure that will centre the search
around a single superfit individual. The same settings as in Table 9.2 were used with
the exception that the number of candidates generated during each generation is now
just one. The number of evaluations or iterations is simply equal to the number of
generations for this type of GA. The average number of iterations needed to find a
solution to the 4 to 4 encoder problem did not differ much from the previous results.
On average something like 3000 iterations were needed and the Genitor-type GA does
not seem to offer any major advantages on this problem.

Because the population size is so small, the effects of selective pressure and genetic
drift in the population are rather large. The population is in general very quickly
dominated by a single superfit individual. To get an idea of the effect of genetic drift
9.3 Experiments 127

alone, a run was performed without any selective pressure and with the genetic
operators crossover an mutation turned off (pc = pm - 0). The GA selects individuals
without preference and copies them into the next generation. The population
converged to a single individual in just 7 generations.

0 Effect of Mutation
The effect of the mutation operator was investigated to some extent. Runs were
performed with the normal mutation operator instead of the creeping one. It was
generally observed that this resulted in a poorer performance on the problem. When
both mutation operators were used at the same time, the results were about the same as
the situation where only the creeping mutation was used. The mutation operator
clearly plays a very important part in GA weight optimisation and the GA
performance depends greatly on the settings used.

0 Effect of Population Size


The effect of the population size on this problem was investigated to some extent.
Several runs were done for population sizes of 50, 100, 200. The average number of
iterations needed to find a solution did not vary much at all between the
configurations. Very small population sizes of N = 5 and even N = 2 were also
investigated.

As was also reported in [64] the GA converges to a solution even with a population
size as small as 5. Not all the runs converged, but the ones that did (about 80%)
needed on average about half the number of iterations (2000) as those in the case of a
population size of 50. For a 'normal' GA, convergence to a solution would normally
not be found with such a small population size since there simply is not enough
genetic diversity to maintain a proper search by means of intelligent hyperplane
sampling (formation of building blocks). As was also mentioned in [64] the fact that
the GA converges to a solution even with such a small population size strongly
suggests that the search is mainly performed by genetic hill-climbing (see section
6.1.1). Solutions were even found with a population size of 2, although in this case
about 50% of the runs did not converge.
128 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

• The Iris Flower Classification Problem


The performance of the GA system on the iris flower classification problem was also
investigated. Figure 9.5 shows a typical GA run in comparison to a typical BP run on
this problem. The GA settings were the same as the those in Table 9.2.

Despite the fact that BP has some problems in finding the global optimum for this
problem again it drastically outperforms the GA system in convergence time.

Figure 9.5 Comparison of GA vs BP on the iris flower classification problem

• The Radar Classification Problem


A few runs were performed using the radar classification problem. On none of these
runs was convergence observed. Using back propagation around 600 cycles or 1200
iterations were needed to find a solution. Using the genetic algorithms, the best
individual found after as much as 10000 iterations still gives a very poor performance
on the data set: E[x) ~ 200.
9.4 Discussion 129

9.4 Discussion
The GA system has not been found to perform well on the task of feedforward neural
network weight optimisation. It is drastically outperformed by back propagation on
the problems investigated. This might however partially be caused by the nature of the
problems. Problems for which the BP algorithm has no difficulty in finding the
optimum are typically problems with a low level of epistasis resulting in a 'simple'
error landscape. Back propagation will not get 'trapped' in local minima for these
problems and it is not surprising that a hill-climbing algorithm such as BP will
outperform a global search algorithm like GA. Problems which do pose severe
convergence problems for back propagation may be better suited for the genetic
algorithm. It is reported in [50],[64] that a GA system very similar to the one
implemented here does outperform back propagation on some large size tasks that are
very difficult for BP.

Several facts seem to indicate that the genetic algorithm in this set-up does not
perform a global search through the weight space by means of intelligent hyperplane
sampling. Instead, the search seems to be focused around a single individual and
better solutions are generated by genetic hill-climbing. Reasons why the GA seems to
work better as a genetic hill-climber on weight optimisation problems very likely
include the competing conventions problem, caused by multiple chromosomal
representations coding identically functioning networks. By focusing the search
around a single individual this problem is avoided. Another reason why a global
search may not work very well is simply the extremely large size of the search space
for bigger sized problems.

Future work will need to be done in optimising the GA set-up for neural network
weight optimisation, possibly extending the set of genetic operators with ones that are
more problem specific. This can present an alternative in tackling the competing
conventions problems. Some good results have been reported in literature where
genetic operators were used that use some kind of gradient information of the error
landscape. Since competing conventions seem to be such a major problem for weight
optimisation with a standard GA, better results may be expected when niching
techniques such as restrictive mating are used, although this has not been investigated.

Since BP is very good at fine tuning potential solutions and the standard GA can
perform a global search in the problem space a hybridisation of the two seems natural.
130 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

A hybridisation of a GA and a hill-climbing algorithm like back propagation can


produce a robust system that will work on a variety of problems. The GA could
perhaps be used to find 'basins of attraction' (areas around a local/ global optimum
from which a hill-climbing algorithm always converges to the optimum) in the error
landscape from which the back propagation algorithm can take off to find the local or
global optimum. This hybridisation could be implemented in the GA system as
follows: during the evaluation of an individual, train the set of weights using back
propagation for a certain amount of training. The training could be implemented in
such a way that the back propagation algorithm is allowed to go on when the error
continues to decrease (i.e. converging to a local or global optimum) and that it must
stop and return to the GA when it does not.
10. Using a GA with Grammar
Encoding to Generate Neural Networks
In this chapter we describe a GA system that was implemented based on the ideas of
Kitano's matrix grammar (see section 7.3.3) in the automatic generation of both a
neural network structure and its weights. Kitano's approach is extended in the sense
that not just the neural network structure but also the values of the weights are coded
in the chromosome.

When a chromosome is translated into a neural network it is often initialised with


random weights; e.g. [38]. We chose to code the values of the weights in the
chromosome as well, so that not only the structural but also the parametric information
can be passed on from generation to generation.

When both the structure and the weights of the network are coded in the chromosome,
the resulting system is best described by a Structured Genetic Algorithm (sGA). In
[11], where a direct representation was used, good results were reported using an sGA
on small problems such as the XOR or small decoder networks but it was found the
method did not scale up well to bigger problems. Instead of using a direct encoding
method, better results were expected using a grammar encoding and we investigated a
method based on Kitano's matrix grammar encoding in this context. Results are
compared to the matrix grammar system without weight encoding and to a system
implementing direct encoding to represent the structure of a neural network.

10.1 Structured Genetic Algorithms in Neural


Network Design
Structured Genetic Algorithms were developed by Dasgupta and McGregor [10] and
have proved to be a successful method to simultaneously optimise the neural network
architecture and its weights [11] using hierarchically structured chromosomes. The
recombination phase is the same as in the standard genetic algorithm. During
evaluation however 'high-level' genes act as switches to activate or deactivate lower
level genes. In [11] two leveled chromosomes were used. The top level defines the

131
132 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

connectivity of the network, the bottom level the values of the weights and biases. The
network connectivity is represented by the connectivity matrix with that part of the
chromosome treated as a binary string.

In most of the GA approaches where both network structure and weights are subject to
genetic operations, the chromosome representing both parts is thought of as a long
binary string and is subject to the genetic operators of the algorithm. As far as the
genetic operators are concerned there is no distinction between the structural and the
parametric (weight) part. This distinction is only made when the chromosome is
translated into the actual neural network.

It is possible to make a distinction between the structural and the weight part of the
chromosome. When different codings (i.e. binary and real-valued) are used for the two
parts, this distinction must be made and non-homogeneous chromosomes are needed.
In [11] the best performance of the sGA algorithm was observed when the weights and
biases were coded as real-valued genes, as opposed to the binary coded structural part
of the chromosome. Genetic operations like crossover and mutation can now be
thought of as being either structural or parametric changes to the network depending
on what part of the chromosome they operate on. When the changes are structural, the
resulting offspring can inherit the set of weights from its parent. This process is called
'weight transmission' and will be described in the next section. A 'structural'
crossover is illustrated in Figure 10.1.

Figure 10.1 Abstract visualisation of structural crossover. The offspring inherit the
set of weights from their parents

Another option in a structural crossover is to initialise the set of weights of the


offspring to a random value.

10.1.1 Weight Transmission


In [5], instead of initialising the weights of the newly created offspring with random
values, the values of the weights are set to a fraction F*Wy of the relating parent
10.1 Structured Genetic Algorithms in Neural Network Design 133

weight W,j. This process is called 'reduced weight transmission' and it was found that
the optimum value of F depended very much on the problem. A training module was
used to learn the weights and it was found that the reduced weight transmission
mechanism speeded up learning by more than an order of magnitude compared to
starting with random weights. The idea is that, with weight transmission, the training
of the networks will generally start off from a better point in the weight space when
compared to starting at a random point, and that less training is required during
evaluation of the networks.

There are several ways to implement weight transmission. For example, the weights of
the offspring network can be set to a fraction of the corresponding weights of one or
both parents. The parent networks could first be checked to see which weights in the
complete weight set are actually in use. When the offspring network uses a connection
that is also in use by one or both of its parents, a fraction of the corresponding
weight(s) can be 'transmitted'. The problem is then to initialise weights that are not in
use by either of the parents. We choose to use a system where two parents produce
two offspring, and each offspring inherits a fraction F of a particular weight of one of
its parents. Normally, F is set to a default of 1, so the entire weight value is
transferred. For each weight of the offspring there is an equal chance of inheriting the
weight value from either parent. So after weight transmission an offspring will on
average have inherited 50% of its weights from parent 1 and 50% from parent 2. Other
options include allowing offspring to inherit all weights from a single parent or to let
the offspring's weights be an average of the weights of its parents. These options are
not investigated here. When no crossover is performed on the pair of candidates the
offspring are identical to the parents, including the set of weights.

When using a grammar encoding instead of a direct encoding method as above, the
reduced weight transmission concept may not work as well. This is because in a
grammar encoding scheme there is no one-to-one correspondence between the
structural and the weight space. The part of the chromosome representing the network
structure, coded by grammar rules, is in general much shorter than the part
representing the network's weights. When two network parent structures are involved
in reproduction there is no guarantee that the resulting offspring will use weights that
were used by both or any of its parents.

When a network is evaluated, the weight training starts from a point in the weight
space determined by the set of weights of one of its parents. Assuming the network
structure of the offspring is similar to the structures of the parents from which it has
134 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

inherited the set of weights, the learning will in general start off from a much better
point when compared to starting off with random weights. When the network structure
of the offspring is not similar to its parent's however, weight transmission may not be
very useful. In this case the set of weights received from the parents might not be
better than just starting off with random weights. Consequently such a network will in
general need more training to reach an optimal set of weights compared to a network
that starts off training much nearer to the optimum in weight space. This probably
presents the main difficulty in the weight transmission scheme. Assuming that the
amount of training a network receives is limited, weight transmission strongly favours
networks that are structurally close to their parents. It may therefore result more in a
local search through the structure space than a global one. If the optimal network
structure can be found by such a local search, this may not necessarily a bad thing. In
order to test every network fairly without favouring some, the concept of weight
transmission will have to be abandoned or the amount of training that a network
receives will have to depend on the position in weight space that the training starts
from. The latter will be almost impossible to realise since the distance between the
starting point and the optimum in weight space will in general not be known. Another
option is of course to give every network so much training that the optimum can
always be found. Weight transmission will then no longer be needed. But the purpose
of weight transmission was to bring down the amount of training in the first place.

10.1.2 Structural and Parametric Changes


Using an sGA approach where both the structure and weights are coded in the
chromosome, a distinction can be made between structural and parametric changes. In
[47] where an EP technique is used, every parent network generates one set of
offspring with structural changes and one set with parametric changes. This could be
implemented in a GA so that a parent chromosome generates one offspring with a
structural change and one with a parametric change. Another idea is to have 'phases'
of structural change and phases of parametric change. For example after a generation
of structural changes only, one could have several generations of parametric changes.
This is similar to approaches where a separate training algorithm such as
backpropagation is used in the evaluation phase of the neural network structure. A
two-level GA can be used where the top level GA searches for the optimal neural
network structure. During the evaluation of a neural network, the bottom level GA is
invoked where for a certain number of generations the weights of the network are
optimised using a special GA. This bottom level GA then acts as a separate training
module.
10.2 Kitano's Matrix Grammar 135

10.1.3 Weight Representation


There is a choice to make whether to represent the weights of all possible connections
of a neural network in the chromosome (i.e. the weights of a fully connected
feedforward network) or just the weights that are actually in use by the network. In
[11] the former is chosen and the argument is given that learned information (i.e.
weights) can be passed on to next generations passively, but may be of use for a future
generation. 'Passively' here means that although the information is part of the
genotype it is not expressed in the phenotype; i.e. weights that are coded in the
chromosome but not actually used by the network. During evaluation the connectivity
of the network coded in the top level defines which weights are actually used. The
genetic operators act on the entire set of weights even though some weights are
passive. Therefore weight changes are possible on non-activated weights that can
possibly only be noticed in future generations. Although this phenomenon is supported
by biological evidence, its usefulness is questioned here. Changes in the weight space
by means of the genetic operators could perhaps be limited to active weights only.

When only those weights that are used by the network are encoded in the
chromosome, difficulty arises when structural crossover or mutation is performed. For
example, when an offspring produced by crossover uses a connection that was not
used by any of its parents, it cannot obtain the corresponding weight value from the
parents. Instead this newly created weight has to be initialised with a random value.
Performing reproduction on the weight space itself is not likely to be a viable option
here since in general there is no one-to-one correspondence between the weight strings
of two different chromosomes. When this method is chosen a variable length GA has
to be used.

10.2 Kitano's Matrix Grammar


In Kitano's approach [38], [39] the NN is represented by a set of matrix grammar
rules that are encoded as a binary chromosome with fixed length. Each rule rewrites a
character into a 2x2 matrix of characters.

Kitano uses a constant and a variable part within the chromosome. The constant part
does not change and consists of the final rewriting rules. It would seem that there is no
point in coding these into the chromosome and in our implementation these 16 final
rewriting rules are set in the system and are the same for every chromosome. The LHS
136 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

of these constant rules is a character from the set 'a' to 'p'. The RHS is one of the
possible 2 x 2 matrices consisting of O's and l's. Thus the final, constant, rewriting
rules are:

a-» [ o oJ.b-> [o J, c-> [ i o],...., p —> L i iJ

The higher level grammar rewriting rules are coded in the chromosome and are
subject to the genetic operators. The starting character is always 'Start', to make sure
the initial rewriting step can always be performed. The other positions of the
chromosome are characters in the range 'A' to 'p' A set of 5 characters defines a
rewriting rule starting from the initial rewriting rule where the LHS is always the
starting symbol 'Start'. By placing no restrictions on the rewriting rules, many rules
may be developed that rewrite the same character, or for some characters no rules may
be developed at all. Furthermore, many developed rules may never be used. Kitano
normally uses chromosomes with a length of 100, which means a number of 100 / 5 =
20 generated rewriting rules. Examples of developed rewriting rules are:

Start^[AbcA],A^[aacb],a-+[aAbb]

At the end of the M matrix rewriting cycles the connectivity matrix is formed out of
the acquired string. The size of this matrix and therefore the maximum size of the
network, the number of neurons, is predetermined to be 2M x 2M. The connectivity
matrix consists of 'l's and '0's. A T denotes a connection, a '0' no connection.
When, after the rewriting cycles, a position in the matrix is still a 'non-differentiated'
cell (i.e. neither a ' 1 ' nor a '0'), Kitano considers it to be dead and is therefore made
equal to '0'. In the connectivity matrix the first n rows and columns correspond to the
input nodes and the last m rows and columns to the output neurons.

The final network may need to be pruned because it is possible that nodes are created
that have no outgoing or no incoming link. Pruning is a repair mechanism and is one
of the possibilities to handle constraints in a GA. Other options include the
punishment of individuals that violate constraints (i.e. give them a poor fitness value)
or choosing the chromosomal representation in such a way that chromosomes that
violate constraints simply do not occur. Pruning may of course be combined with
punishment, so that the GA will prefer networks that need the least pruning.
10.3 The Modified Matrix Grammar 137

10.3 The Modified Matrix Grammar


The matrix grammar described above is modified in that a hierarchy in the characters
is used. For example, when only three rewriting cycles are used (i.e. a connectivity
matrix of size 8), the following hierarchy may be used: 'Start' can be rewritten in a
matrix consisting of characters from (A,..,D), and the characters (A,..,D} can be
rewritten in matrices consisting of characters from (a,..,p). By fixing the LHS of every
rewriting rule, the problem of developing more than one rule for the same character is
avoided. Also, in the final matrix there will be no undifferentiated cells.

This method can be seen as a multiple level substitutional compressor (also known as
dictionary-based compressors), where the compressed string and the dictionary used
are generated using a GA and in the evaluation stage it is decompressed into a
connectivity matrix. The basic idea behind a substitutional compressor is to replace an
occurrence of a particular phrase or group of characters in a piece of data with a
reference to a previous occurrence of that phrase. In this context the starting character
'Start' could be seen as the compressed string and the rewriting rules as a
hierarchically ordered dictionary.

The structure of the chromosomes in this system is not the same as in a Structured
Genetic Algorithm (sGA). In an sGA a set of lower level genes is unique to one higher
level gene. In the approach described above, this is not the case.

Some problems also present in Kitano's system still remain. Many rules may never be
used and several characters could have identical rewriting rules, essentially making all
but one of them unnecessary. An idea could be to code rewriting rules in the
chromosome only when they are used, and leave them out otherwise. The problem
then is that when a character is referred to but has no rewriting rule, a (random) one
has to be made. Furthermore the restriction on matrix sizes of 2M x 2 still applies.

• Coding
This section describes how the characters are coded in the chromosome. Kitano uses
binary coding; e.g. 'a' = 0001, 'b' = 0010 etc. Depending on the GA software used, it
might be preferable to simply code the characters as symbols and we use this
representation. In effect it means that the crossover operator can only work on a group
of characters and thus will leave the characters themselves intact. Only mutation can
change the value of a character by re-initialising it with a random value. Using binary
138 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

coding the genetic operators can operate within the representation substring of a
character.

In implementing the system described above, we need as many alphabets as there are
rewriting cycles. Since there is a rewriting rule for every character of the alphabet, the
chromosome length defines the size of the alphabets or vice versa. For example a
connectivity matrix of size 8 requires 3 rewriting steps and therefore 3 alphabets, A%,
A2 and Ay.

A, = {A, ... }
A2= {a, ...,p}
A 3 ={1,0}

The starting symbol, 'Start', can be seen as the alphabet A0. A rewriting step at level
' 1' rewrites the starting symbol 'Start' into a 2 x 2 matrix consisting of characters of
alphabet A%. In general a rewriting step at level i consists of rewriting a character of
alphabet A, into a 2 x 2 matrix of characters of alphabet Ai+\. The characters of
alphabet A, will be denoted by: S\, ..., S'k, where £, is the cardinality of the alphabet.
So in the example above S\ corresponds to 'A', 523 corresponds to 'c', k2 - 16, /c3 = 2
and so on.

The last two alphabets are predefined and the same for any size matrix. Since we code
a rule for every character in an alphabet, the left-hand side (LHS) of every rule is
predefined and there is no need to code these in the chromosome. Thus a rewriting
rule is represented by its RHS consisting of 4 characters.

• Example
This is a simple example based on [38] of a chromosome representing a neural
network that can perform the XOR task. The network has two inputs and one output.
The system uses three rewriting cycles (M - 3) so that the size of the connectivity
matrix is: 2M x 2M = 8 x 8 . The alphabets used are:

A, = { A, B,C, D }
A2= { a, ...,p }
A3 = { 1 , 0 }
10.3 The Modified Matrix Grammar 139

This particular configuration can be described by k\ = 4, since the alphabets A2 and /43
are pre-defined. An example of a chromosome representing an XOR network is:

A B B D kpak aaaf aaaa aiab

This chromosome represents the following matrix rewriting rules:

Start -> [AB BD] , A -> [ k / k ], B -> [\%], C -» [\\l D -4 [aa'b]

The fixed rewriting rules are not part of the chromosome, but are embedded in the
system and are identical to the ones used in Kitano's grammar; i.e.:

a -» [ o o] - b -> [ o iJ, c -» [ i o] - P -> [ i d

This chromosome is translated into the neuron connectivity matrix by means of the
following rewriting cycles:

Figure 10.2 An example of the rewriting cycles for a simple XOR network
140 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

All hidden and output neurons have a connection from the bias neuron that has an
constant activation value of 1. These connections are not represented in the
connectivity matrix. One entry in the connectivity matrix needed to be pruned. In the
connectivity matrix a connection between neurons 5 and 7 is encoded, but since
neither of these neurons are connected to any other neuron the connection is useless.

As can be seen above a rewriting rule for the character ' C is coded in the
chromosome but it is not actually used. Because only feedforward neural networks are
wanted that have no connections to input and no connections from output neurons,
only the highlighted upper-right part of the connectivity matrix is used. Therefore the
method is not very 'clean': some information that is contained in the chromosomes is
never used (i.e. the lower-left part of the matrix). The corresponding neural network is
shown in Figure 10.3.

Figure 10.3 The neural network constructed from the matrix grammar

• The Alphabet Sizes


One of the main problems of the matrix grammar approach is how to decide on the
alphabet size for a certain rewriting cycle. If the alphabet size of the nth rewriting
cycle is too small, the number of different sub-matrices that are coded in the rewriting
cycle may be too small. In the example described above, if the alphabet size of the
first rewriting cycle, ku is set to 2 instead of 4 (i.e. A, = (A,B) instead of {A,B,C,D}),
the 8x8 connectivity matrix can only be made up of two different 4x4 sub-matrices.
There is no way to code more than two different 4x4 sub-matrices, while the complete
10.3 The Modified Matrix Grammar 141

matrix is made up of four. If it is desired that it must be possible to have no two sub-
matrices that are identical (e.g. A, = {A,B,C,D} or larger), the chromosome will
become very large for larger connectivity matrices. The cardinality of every alphabet
must then be equal to (or larger than) the number of corresponding sub-matrices in the
complete matrix. For example a 16x16 matrix would require the following (minimal)
configuration: k, = 4 (the number of 8 x 8 matrices), and k2 = 16 (the number of 4 x 4
matrices). The resulting chromosome would have a length 1 = 4 + 4*4 + 4*16 = 84. In
the case of a 32 x 32 matrix, the corresponding chromosome length would be: 1 = 4 +
4*4 +4*16 + 4*64 = 340. In general, the chromosome length needed for a system with
M rewriting cycles (i.e. a matrix of size 2Mx2M ) is:

M-\
1 = ^4"
n=l

Thus for larger size matrices the chromosomes will become large indeed. Still, this
chromosome length is smaller than that needed in the direct encoding scheme to code
the same size matrix. In the direct encoding scheme there exists a one-to-one
correspondence between the genes in the chromosome and connections in the neural
network. Each gene codes one connection. This scheme will be described in section
10.5.

In the matrix grammar configuration described above an alternative approach can be


taken. Since for example in the 16x16 matrix scheme every one of the 4x4 sub-
matrices can be represented uniquely, an option is to let the chromosome simply just
code these 16 4 x 4 matrices without the first two matrix rewriting rules. The
chromosome length would then be / = 4*16 = 64, which is the chromosome length
which would have been needed in a direct encoding scheme divided by 16. The
variable rewriting rules in the chromosome are replaced by fixed ones that are part of
the system. In such a scheme however it is no longer possible to encode regularities in
the final matrix and the chromosome length will still be very large for bigger
problems. It does however seem reasonable to leave out the rewriting rules of the first
rewriting cycle. With an alphabet size, ku of 4 every 'quarter' of the connectivity
matrix can be represented uniquely with a character of the first alphabet, Ai.lt seems
reasonable to assume that most networks will in general not have identical matrix
quarters so that the first rewriting rule will be for example: 'Start'—>{C,B,A,D} and
not something like 'Start'—>{A,A,B,A}. The first rewriting rule may then be left out of
the chromosome altogether, and replaced by a fixed rule of the form:
142 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

'Start'—»{A,B,C,D) or even 'Start'->{A,B,0,D}, where 0 represents a 'quarter' sub-


matrix filled with zeros, since the third quarter of the connectivity matrix is never used
in the representation of feedforward networks. The same procedure could in fact also
quite easily be followed for the second rewriting cycle. Instead of coding the
corresponding rewriting rules in the chromosome, one could have fixed rewriting rules
like: A->{A',B',C',D'}, B->{E',F',G',H') etc. These concepts have not yet been
investigated, but they are suggested as a logical extension to the work carried out to
date.

The principal goal is to limit the alphabet sizes so that the chromosomes will be of
manageable length. The resulting connectivity matrix is then made up of sub-matrices
that cannot all be unique, with some sub-matrices being used more than once. This
does of course place restrictions on the neural network structures that can be
generated, since they must have some form of regularity in their connectivity matrices.
The severity of these restrictions depends of the alphabet sizes used. Many problems
can be very adequately solved however using neural networks with a high level of
regularity. The classic case of a fully connected feedforward network contains for
example a very high level of regularity in the connectivity matrix, as is illustrated in
Figure 10.4 which shows the 16x16 connectivity matrix of a fully connected '4-4-3'
neural network. In this particular case the 5th, 6"\ 701 and 8th columns and rows in the
matrix correspond to the four hidden neurons. Overall, significant reductions in
chromosome complexity may be obtained if relatively regular and repeating structures
are acceptable for the evolved neural networks.

• Competing Conventions
In a connectivity matrix as in Figure 10.4 a hidden neuron can be represented by any
one of the rows/columns 5 to 14. The first four and the last three rows/columns are
reserved for the input and output neurons. This means for example that the fully
(9\
connected '4-4-3' network can be represented by \A)= 126 different connectivity

matrices of size 16x16. Since different connectivity matrices correspond to different


chromosomes, the genetic representation clearly suffers from competing conventions
(see section 3.1.4). Many different matrices and therefore many different
chromosomes can be used to represent the same neural network structure.
10.3 The Modified Matrix Grammar 143

Figure 10.4 A 16x16 connectivity matrix representing a fully connected '4-4-3'


neural network

In fact the competing conventions problem is even worse than the above analysis
indicates. Using the matrix grammar scheme, many different chromosomes can be
used to represent one and the same connectivity matrix. This is illustrated below
where two different chromosomes code the same 8x8 connectivity matrix pictured in
Figure 10.2 that corresponds to the XOR network.

Figure 10.5 Example of competing conventions. The two different chromosomes


represent the same connectivity matrix

The reason for this type of competing convention is that a position within the
chromosome does not correspond to a fixed position in the connectivity matrix.

• Representation
The matrix grammar representation scheme relies not just on two but on three
different spaces: the representation space (the chromosomes), the evaluation space
144 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

(the neural network structures) and an intermediate space consisting of the


connectivity matrices. This is illustrated in Figure 10.6. Both the mapping from the
representation space to the intermediate space and from the intermediate space to the
evaluation space suffer from competing conventions.

Figure 10.6 The three spaces used in the matrix grammar representation scheme.
Two examples of competing conventions are shown

As stated above, depending on the alphabet sizes used there will be a restriction on the
network structures generated. The system will in general not be able to generate any
arbitrary feedforward neural network structure. The connectivity matrix must contain
some kind of regularity, so, the evaluation space is only part of the complete problem
space, where the problem space is defined as the complete set of feedforward
networks subject to the appropriate number of in and outputs.
10.4 Combining Structured GAs with the Matrix Grammar 145

10.4 Combining Structured GAs with the Matrix


Grammar
The chosen chromosomes will have a two level hierarchy. The top level is a string of
characters that codes the matrix rewriting rules (the structural part), the bottom level is
a real-valued string of the weights of the connections of a fully connected feedforward
network (the parametric part). These weights are coded as a long real-valued string
corresponding to the 'upper-right-half of the connectivity matrix, column by column
starting from the left. Added to each column is the weight connected to the bias unit.
The set of weights in use by the actual network is only a subset of this parametric part.
This is illustrated by Figure 10.7.

Figure 10.7 The mapping of a two-level chromosome into a NN

The first three entries in the parametric part of the chromosome correspond to the
incoming connections of neuron 3 (i.e. column 3 with the bias weight added). The
next four entries correspond to neuron 4 etc. The 6 entry corresponds to the
connection from neuron 3 to neuron 4. Since this connection is absent in the neural
146 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

network, the corresponding weight value is simply ignored (as are most of the weight
values contained in the chromosome). There are 26 (upper-right half of the matrix of
Figure 10.2) + 6 (bias weights of all neurons) = 32 weight values represented in the
parametric part of the chromosome while only 9 of these are in use by the network.

A separate back propagation training module is used to evaluate the neural networks.
After training, the parametric part of the chromosome is updated with the trained
weights. The parametric changes that result from running the training module are
carried onto future generations by means of weight transmission. No parametric
changes are performed within the GA itself. Only the structural part of the
chromosome is subject to the genetic operations. The working of this system is
illustrated in Figure 10.8.

Figure 10.8 The sGA system with a separate Back Propagation training module
10.4 Combining Structured GAs with the Matrix Grammar 147

Since only feedforward neural networks are generated in the present application, the
number of possible connections or the maximum complexity, Cmlx, when in total N
neurons are used is:

M2 -N N 2
-N N 2
-N
— - 2 2 2

where:
N = total n u m b e r of neurons
Nln = n u m b e r of inputs
NOM = n u m b e r of output neurons

T h e total n u m b e r of neurons includes the inputs to the network according to the


accepted c o n v e n t i o n , but not the bias unit. Cmax is the complexity of a fully
interconnected network specified by the total n u m b e r of units, the number of inputs
and the n u m b e r of outputs. In other words it is the number of entries in the upper-right
half of the connectivity matrix.

T h e last t w o terms in Cmax indicate that input neurons are not allowed to have
incoming connections and that there are no outgoing connections from output neurons.
This means that in the connectivity matrix, columns corresponding to input neurons do
not have any entries (or the entries are simply discarded) in their columns and the
rows corresponding to the output neurons are empty.

As an example a neural network with 35 input-neurons, 10 output-neurons and a


maximum of 20 hidden neurons has 2037 possible feedforward connections. This
could impose quite severe strains on computational facilities especially when a large
population size is needed. A '35-20-10' fully connected feedforward network uses
only 900 of these 2037 possible weights. If parametric changes such as weight-
mutations were implemented in the GA system, it would mean that, for such a
network, these changes would affect the active weights in less than 50% of the cases.
It would seem that this is a wasteful procedure and that parametric changes can best be
performed on the active weights only. However this causes a problem when a
grammar encoding as the one proposed is used and the set of active weights cannot be
directly read from the top level of the chromosome. This top level first needs to be
translated into the neural network connectivities, and this will have to be done for
148 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

every parametric change. Since at present no parametric changes are implemented in


the GA itself, this is of no concern.

10.4.1 Genetic Operators


During recombination the genetic operators work on the top level of the chromosome.
When two neural network structures reproduce, the weights of the resulting offspring
are initialised using weight transmission as described above. Crossover and mutation
are implemented as normal.

10.4.2 Evaluation
When a chromosome is evaluated into a neural network the top level of the
chromosome needs to be translated into the connectivity matrix. This matrix is then
pruned so that there are no hidden neurons without incoming or outgoing connections.
After this step the matrix is transformed into a neural network which, after the training
phase (backpropagation), is then tested on a set of training patterns. The network uses
the values of the weights of the bottom level of the chromosome that correspond to the
connections used. The fitness value will reflect the error on this training set and can
optionally include a measure of the network's complexity. The amount of pruning that
was necessary can also be reflected in the fitness as a negative measure. In [66]
networks that were less complex (i.e. less connections) were preferred over more
complex ones when both networks achieved a comparable performance on the training
data. In this way, minimal complexity neural networks can evolve. After training, the
parametric part of the chromosome is updated with the trained values of the weights.

Back propagation training is performed for a set number of cycles, rather than the
more customary process of stopping at a required error level, since convergence
cannot be assumed for all networks. The optimal number will depend on the problem
and on the training set used. The back propagation module is a standard one using the
normal gradient descent weight updating rule with a momentum term. Default values
for the learning rate and the momentum term are 0.1 and 0.9 respectively. The module
uses a ■per-pattern' weight update mechanism, meaning that the weights are updated
after every presentation of a training pattern (and not after a presentation of the
complete training set).
10.5 Direct Encoding 149

The following fitness function is used:

E(x) C(x) Q n/ >


N , C
out max

where: E(x) = the cumulative squared error of the individual on the training set
Noul = the number of output neurons
C(x) = the number of connections in the network (including the bias
weights)
Cmax = the maximum complexity for the specific configuration
P(x) = the number of connections that had to be pruned

The genetic algorithm works in such a way that the fitness measure is minimised
instead of maximised. The relative weight of the complexity and pruning terms can be
set by a and /3. The optimal values are problem dependent and possibly quite hard to
find. Optionally a and/or /J can be set to 0 so that the corresponding term(s) do not
have any influence on the fitness.

10.5 Direct Encoding


For comparison with the matrix grammar approach, a direct encoding scheme was also
implemented. Details are given in this section.

The direct encoding scheme differs from the matrix grammar scheme described above
in the representation of the network structure. The parametric part of the chromosome
that codes the weights is identical. Direct encoding is implemented as a bit-string that
directly represents the neural network with a one-to-one correspondence between the
genes and the connections of the network. The bit-string is simply the upper-right half
of the connectivity matrix that defines the neural network structure. As with the matrix
grammar scheme the size for the matrix must be set a-priori. Thus the maximum
number of neurons is pre-defined. In contrast to the matrix grammar scheme however,
the direct encoding scheme is not restricted to matrices of the size 2 and any matrix
size can be used.

As an example of how the structural part of the chromosome is translated into a neural
network, the same XOR network of the last section is considered. The matrix size
150 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

used is the same: M = 8. The upper-right half of the connectivity matrix (the high­
lighted part of Figure 10.2) consists of 27 bits. The structural part of the chromosome
contains these bits 'row-wise'. The network structure is now represented by the
following bitstring:

JC = (11000 0, 11000 0, 0000 1, 000 1, 00 0, 00)

During evaluation this is translated into the following connectivity matrix:

The matrix is then translated into the same XOR neural network structure as in Figure
10.3.

Advantages of using direct encoding over the matrix grammar method are that the
matrix size is not restricted to values of 2M. Furthermore the method is cleaner in that
only the upper-right half of the connectivity matrix is coded in the chromosome. A
disadvantage is that the chromosome length increases rapidly with network size and
that it offers no way to code certain regularities in the network structure. Since the
chromosome length equals the number of maximum feedforward connections given
the maximum number of neurons (i.e. the matrix size) and the number of in- and
outputs, this is given by Cmax (see last section). For example a 4-input, 3-output neural
network with a maximum total of 32 neurons requires a chromosome length of 487
with the direct encoding scheme.

The direct encoding scheme also suffers from competing conventions since a single
neural network structure may be represented by various connectivity matrices. In
contrast to the matrix grammar scheme however, there is a one-to-one correspondence
between chromosomes and connectivity matrices. In contrast to the matrix grammar
scheme it is not the case that one connectivity matrix can be represented by different
chromosomes.
70.6 Network Pruning and Reduction 151

Theoretical analysis (such as the Schema Theorem) suggests that for good
performance of the GA, functionally close genes should be close together on the
chromosome so that they are not easily disrupted by crossover. In the direct encoding
scheme this can be taken to mean that connections belonging to the same neuron
should be close together on the chromosome. Since the connections are coded in the
chromosome row by row this is true for the outgoing connections of a neuron. The
incoming connections of a neuron however can be located very far apart. This is
caused by the mapping of the two-dimensional network structure onto a one-
dimensional linear chromosome. Part of the information concerning the position of
connections in the network is lost. Remedies could be to use the two-dimensional
positional information in the genetic operators (crossover, mutation). In effect a
chromosome could be treated as a direct two-dimensional representation of the
connectivity matrix, and crossover could for example be implemented as swapping
parts of rows (incoming connections) and parts of columns (outgoing connections) or
even areas in the matrix (functional groups of neurons). These approaches have not
yet been attempted, but are suggested as directions for future research.

10.6 Network Pruning and Reduction


In the evaluation stage of both the matrix grammar and the direct encoding systems,
the generated networks are first pruned before they are trained using the back
propagation module. Pruning removes (hidden) neurons and their links that have no
incoming or outgoing connections. It does this recursively until all such neurons have
been removed. Figure 10.9 shows an example of network pruning. Also shown is
network reduction where neurons that only have one incoming link are discarded and
the links reorganised.

The network reduction stage is not implemented in the GA systems, although it would
somewhat reduce the computational cost required in the learning stage of the network.
Networks can be penalised for the amount of pruning necessary (the number of links
that needed to be removed) by setting the parameter f5 in the fitness function to an
appropriate value. The potential need for network reduction is in a way penalised by a
greater complexity term (regulated by a) in the fitness function. Assuming that both
networks on the right have a (near) equal error on the training set, the network after
pruning is preferred to the one before pruning with a setting of a > 0.
152 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.9 Network pruning (first step) and reduction (second step). In the current
set-up only network pruning is performed in the evaluation phase just before training

10.7 Experiments
In our preliminary experiments we have implemented an sGA system as described
above, where not only the structure but also the weights of the network are coded in
the chromosomes. The GA software used was a 'SUGAL', vl.O; see section 9.1 for a
description. The changes made to the mutation operators also apply here; i.e. a gene is
mutated with probability pm.

The matrix grammar approach is compared to the direct encoding scheme described in
the last section. The same data-sets that were introduced in section 9.3.1 are used.

10.7.1 Set-up
The structural part of the chromosomes uses symbolic coding and the weights are
coded using real-valued coding. The symbolic coding uses integer-valued genes taken
from the alphabet: {0,...,k-\}, where k is the alphabet size or cardinality. The coding
of the structural part is, in general, non-homogeneous in that it consists of different
parts each having its own alphabet size. These parts correspond to the rewriting rules
for one or more rewriting cycles. In the present implementation only homogeneous
chromosomes were used because implementing non-homogeneous chromosomes in
the GA software used is quite difficult. The alphabet size was normally set to k = 16.
When for example some part of the chromosome has an alphabet size of 4, the
character set is reduced from 16 to 4 during evaluation by the following rules. If the
10.7 Experiments 153

gene lies in the range 1,..,4 the corresponding character will have a value of 1, if it lies
in the range 5,...,8 the character will have a value of 2, etc.

10.7.2 Results

• The XOR Problem


Preliminary experiments were performed on the simple XOR data set. Because of the
very small scale of this problem it is difficult if not impossible to make any
comparison between various settings and systems. The XOR problem is interesting in
the sense that the minimal network that is able to solve it is known. It is shown in
Figure 10.10 and consists of two inputs, one hidden neuron and one output neuron.
The GA system was run with a small 8 x 8 connectivity matrix. The matrix grammar
scheme was used with alphabet size of A\, kx = 4 (Ai={'A\ ... ,'D'}) resulting in a
chromosome length of 4 + 4*4 = 20. The alphabets A2 and A3 are the predefined ones:
A2 = {'a', ... , 'p'} and A3 = (0,1}. One of the main problems is to decide on the
number of back propagation cycles that the networks are trained for each time they are
evaluated. For example, the number of cycles needed for the minimal XOR network to
learn the task within a tolerance of 0.4 is on average about 300 but the actual number
depends strongly on the values of the initial weights. Also, other networks may need
many more cycles to learn the XOR task.

Figure 10.10 The "minimal XOR' neural network. This is the lowest-complexity
network structure that is able to solve the XOR problem

The number of back propagation cycles was initially set to 500. The pruning term in
the fitness function was 'turned off: p = 0.0. Experiments were performed with
several settings of the complexity measure a. It was observed that even for quite small
154 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

values of a, the GA system converged to a network that had no connections at all (i.e.
C{x) = 0). Such a network can still achieve a reasonable fitness value simply because
it's complexity is so low. Since these networks have nothing to offer in the GA search,
the fitness function was changed. If a network has no connections at all its (raw)
fitness, fix), is simply set to an extremely high value resulting in a normalised fitness
of effectively zero. The same could be done for networks that fall below a certain
level of complexity. In the case of the XOR problem all networks with a complexity
below seven could be eliminated this way. Since, in general, the minimal complexity
required to solve a problem is not known in advance this approach could not be
universally applied. However, in many cases it might be reasonable to assume that the
networks generated should have outgoing links from all inputs and incoming links to
all output neurons. The minimum level of complexity can then be set equal to the
number of inputs plus the number of outputs times two (incoming connections plus
connections to the bias unit). This approach was applied in our testing where the
minima] complexity was set to:


^min ' " i n "■" * - ' ' o u t

And the fitness function was changed IO/'(JC), defined as:

f,,, , rioooo if ifc(x)<c mA\


C(x)<Cmn
/ ( X ) =
I /( x )
1/(X) otherwise J

Of course this still does not guarantee that all the inputs and outputs are used, but it
avoids the generation of useless very low complexity neural networks.

It was observed that the performance of the system on the XOR problem depends very
much on the number of back propagation cycles. If this number is too low the neural
networks are not properly tested on their task. With a setting of a = 1.0 and for
example a number of BP cycles of 100, the GA always converges to a neural network
with only five connections (= C ^ + l ) . This network uses one hidden neuron which
has one connection to an input and one to the bias unit. It has a poor performance on
the training set: E(x) ~ 1 and two of the four training patterns are misclassified. With
the number of BP cycles set to 500 however, the GA finds the minimum XOR network
as pictured in Figure 10.10 on average in as little as 3 generations. The specific GA
settings are given in Table 10.1.
10.7 Experiments 155

Table 10.1 GA matrix grammar settings for NN optimisation concerning the XOR
problem with matrix size 8x8

Parameter Setting
a 1.0 ~
£
P 0o
BP cycles
BP cycles 500
500
coding
coding symbolic
symbolic
crossover
crossover type
type two point
two point
elitism on
on
fitness normalisation reverse linear ranking with bias = 10.0
/ 20
mutation type normal
N ' 50 ~
fa 0 8
Pc 0.8
Pj,
Pm 0.005
0_005
re-evaluation off
replacement mechanism unconditional
selection mechanism Roulette

No further experiments were performed on the XOR problem because it is very hard
to make any comparative statements from simulations on such a relatively simple
problem.

• Iris Flower Data


The iris flower data has 4 element vectors and 3 classes. When trained with back
propagation a fully connected feedforward neural network performs well on the
training set with a single hidden layer consisting of 4 neurons; i.e. a '4-4-3' network.
This network has a total of 11 neurons (not counting the bias unit) and its number of
connections, the complexity, is 4*4 + 4*3 + 4 + 3 = 35 (including the connections of
each neuron to the bias unit).

The matrix grammar GA system was implemented to generate a 16 x 16 connectivity


matrix. Therefore 4 rewriting steps were needed resulting in a network that uses at
most 16 neurons (not including bias). The alphabet sizes for these steps were: ki=4
and k2=l6. The last two alphabets (A3 and A4) have their standard cardinality of k3=\6:
156 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

{'a', ..., 'p'} and£4=2: ('0', '1'). Since the (fixed) rewriting rules for the last cycle are
not coded in the chromosome, the chromosome consists of 4 + 4 * 4 + 1 6 * 4 = 84
symbols.

The matrix grammar and the direct encoding approach are compared on the neural
network optimisation problem concerning the iris data. Both systems used a maximum
number of neurons of 16 and were run for 200 generations. The GA settings are
shown in Table 10.2. The settings for both systems are identical except of course for
the chromosomal representation used.

Table 10.2 GA settings for NN optimisation using Iris data with matrix sizel6 x 16

Parameter Setting Setting


( Matrix Grammar
Grammar)) (Direct Encoding))
( Direct Encoding
aq 10 10
P
p 0 0
0
BP
BP cycles 1 11
coding symbolic binary
! crossover type two
two point
point two
two point
point
elitism on on
fitness normalisation reverse linear ranking reverse linear ranking
with bias = 10.0 with bias = 10.0
I 84 111
mutation type normal normal
N 50 50
^
Pc 08
0.8 0.8
0.8
Pjn
Pm 0.005 " 0.005
0.005
re-evaluation on on
replacement mechanism unconditional unconditional
selection mechanism Roulette Roulette

Figure 10.11 shows the fittest individuals at the end of a typical run for both systems.
Both networks shown have hidden neurons that have one incoming link only (neuron
11 in the top network and neuron 12 in the bottom); they can be removed and the links
can be reorganised using network reduction (not shown) resulting in networks with a
complexity of 18 and 19 respectively.
10.7 Experiments 157

Figure 10.11 The best individuals of a run for the Iris data, matrix grammar vs direct
with the corresponding fitness, fix), error on iris training set, E(x), and complexity,
C(x). The values of the weights are not shown. Both networks misclass-ified one
training pattern. Every neuron is connected to the bias unit (not shown)

Both systems showed very similar behaviour on this problem. Convergence curves of
the best individual in the population vs generation were nearly identical. The best
individuals as shown above are just examples of one particular run of the GAs. The
particular run resulting in the top neural network is seen in Figure 10.12. To get an
idea of the computational requirements needed, the run took about 20 minutes on a
Sun-Sparc 4 workstation.
158 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.12 The GA run with matrix grammar system resulting in the top network
of Figure 10.11

In Figure 10.13 the same GA run is shown but this time the complexity, C(x), and the
error on the training set, E{x), of the fittest individual are shown as well as the mean
complexity and error in the population. The complexity and error of the fittest
individual do, of course, not have to respond to the lowest complexity and error values
found in the population.

It is interesting to note that the level of complexity of the fittest individual no longer
changes after about 70 generations. The search seems to be stuck at a certain network
structure and the only change in fitness is due to the further learning of the weights.
With the GA settings chosen, every individual of the population is re-evaluated at the
start of the generation. If the fittest individual remains the fittest over several
generations, due to elitism its structure will not be changed and at the start of every
generation its weights will be further refined in training resulting (in general) in a
further decrease in error and therefore a better fitness. Without re-evaluation however
very similar results were obtained. After the initial phase the only change in the fitness
of the fittest individual was caused by a decrease in error. Apparently a number of
individuals in the population share the same network structure and, when evaluated,
10.7 Experiments 159

one of them will replace the fittest. The search still seems to be mainly centred around
a single network structure.

Figure 10.13 The same GA run showing the complexity C(x) and error E(x) of the
fittest individual as well as the mean complexity and error of the population

The neural networks found on this problem using the particular GA settings had a low
level of complexity and generally used quite a few direct connections from input to
output neurons. The networks had on average a complexity of 20. When tested on the
iris test data the neural networks performed well. For example the top network in
Figure 10.11 produced an error of 5.46 on the test data with 4 patterns misclassified
(tolerance = 0.4). So 95% of the test set was correctly classified. Further training of
the network using back propagation did not improve on this (nor on the performance
on the training set itself). This performance is similar to a fully connected '4-4-3'
neural network (complexity = 35) that has been trained for 1000 cycles and produced
a 100% correct classification of the training set with an error of 0.003. This trained
network produces an error of 5.89 on the test data with 3 patterns misclassified.

So despite the relatively low level of complexity and the relatively poor performance
on the training set, the generalisation capabilities of the neural networks found with
160 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

the specific settings of the GA system are good. It is interesting to note that both the
networks of Figure 10.11 could not correctly classify all the patterns in the training
set. No matter how long they were trained using back propagation, one pattern
remained misclassified. It seems that the structure of the networks simply does not
allow for a 100% correct classification of the training data. For example in the case of
the top network, it could very well be that in order for the pattern in question to be
correctly classified connection(s) from the first input neuron are needed.

0 Effect of the Complexity Term


The trade-off between the error term and the complexity term in the fitness function is
a difficult one. If the complexity measure, a, is too low, the system invariably
converges to networks with a very high level of complexity. If on the other hand a is
too high, the system generates networks with very low complexity despite the fact that
they give quite poor performance on the training set. Figure 10.14 shows a typical GA
run with the same settings as in Figure 10.13 except that this time the complexity term
was turned off: a = 0.0. On all runs the average complexity of the best individual
found was around 80 and the average complexity of the population was about the
same. In all cases the error of the best individual as well as the mean error of the
population was very small and the networks always correctly classified the entire
training set. It is interesting to observe that in the first 30 or so generations the mean
complexity of the population actually increases from around 60 to 80. This was true
for all the runs. A higher complexity seems to benefit the performance on the training
set for the particular settings.

Several runs were performed for a= 5.0, 20.0 and 50.0. It was found that the setting a
= 5.0 produced results similar to those observed with a = 10.0 but the networks had a
somewhat higher level of complexity: on average C(x) = 25. With a = 20.0 the
average complexity was around 15. Most networks (80%) had error levels of around
3.0 but some runs produced networks with much higher errors (such as E(x) = 20) that
misclassified up to 20 training facts. Most of these networks had a very low level of
complexity. This effect was even stronger for the case a = 50.0. The average
complexity of the best individuals was 10 with some networks having as few as 4
connections. Some runs produced networks with a complexity of 11 using only one
hidden neuron that had a very similar performance on the training set as those pictured
in Figure 10.11 (E(x) ~ 3, 1 pattern misclassified).
10.7 Experiments 161

Figure 10.14 An example of a GA run with the same settings as in Figure 10.13 but
with a-0.0

0 Larger Matrix Size


The system behaviour for a larger size connectivity matrix was then tested on the same
problem. Instead of a 16x16 matrix, a matrix "one' size bigger was used: 32x32.
Again, both the matrix grammar and the direct encoding method were used. The
matrix grammar system was configured with the following alphabet sizes: Ay = 4, A2 =
16, A3 = 16 resulting in chromosomes with a length / = 4 +4*4 + 4*16 + 4*16 = 148.
In the direct encoding scheme the chromosome length required to represent the upper-
right half of a 32x32 matrix is: / = 487.

Using a 32x32 matrix, the resulting networks had a much higher level of complexity.
On average after a run of 200 generations with the same settings as before the matrix
grammar GA system generated individuals that had a complexity of about 100 as
opposed to around 20 with the 16x16 matrix. After 500 generations this complexity
had dropped to a value of around 50. Figure 10.15 shows two GA runs with this
configuration, one with the matrix grammar scheme and one with the direct encoding
scheme.
162 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.15 Examples of two GA runs on the iris flower data with a matrix size of
32x32. One run is done with the matrix grammar scheme, the other one with the direct
encoding scheme. Apart from the chromosome lengths the same settings as in Table
10.2 were used

The direct encoding scheme on average generated networks with a somewhat higher
level of complexity: C(x) ~ 70 after 500 generations. The resulting connectivity
matrices needed large amounts of pruning when translated into neural network
structures for both systems. On average something like 100 entries in the matrices
needed to be removed.

Experiments were then performed with the pruning term in the fitness function 'turned
on'. A value of p= 0.01 was used. Both systems gave very similar results to the "non-
pruning" results, but the amount of final pruning was somewhat reduced to around 70.

Both systems have difficulty in minimising both the network complexity as well as the
amount of pruning that is needed for a matrix of this size. The matrix grammar scheme
is able to generate less complex networks than the direct encoding scheme.
10.7 Experiments 163

0 Effect of Weight Transmission


Runs were performed on the iris problem with the weight transmission turned off and
results were compared with the ones described earlier using the 16x16 matrix. The
parametric part of the chromosome is not passed on to the next generation in any way.
This part of the chromosome in fact no longer serves a purpose. Each time a neural
network is evaluated its weights are set to small random values uniformly distributed
on [-1,1].

The networks found after 200 generations had a very poor performance on the training
set. The average cumulative error was around 30, with about 8 training facts
misclassified. The level of complexity of these networks was very low: on average
about 10 connections were used and without exception the networks did not use any
hidden neurons. The networks consisted purely of straight connections between inputs
and output neurons. A typical GA run for this configuration is shown in Figure 10.16.

The results can be explained by the lack of training that a network receives in the
evaluation phase. The number of back propagation training cycles per evaluation is
only one. This number is in fact misleading since every individual is re-evaluated at
the start of a generation and, with weight transmission this re-evaluation has the same
effect as training for two back propagation cycles. Without weight transmission it
simply re-trains the network and re-evaluation does not really serve any purpose. To
obtain an accurate estimate of the network's performance on the training set,
evaluation should consist of several back propagation runs, with each one starting with
different random weights. The average error can then be used as a more accurate
estimate of the performance. Clearly, without weight transmission, one back
propagation cycle is not enough to properly test the network structures on their given
task. The GA system does well in minimising the neural network structure but does so
by severely limiting the network's performance on the training set. The balance
between complexity and performance on the training data lies strongly in favour of
reduced complexity with this configuration.
164 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.16 An example of a GA run with the same settings as in Figure 10.13 but
without weight transmission. When evaluated the networks are instantiated with
random weights

As a comparison some runs were performed without weight transmission but with a
much higher number of back propagation cycles per evaluation. This number was set
to 50 and typical GA runs took several hours of CPU time. Figure 10.17 shows one
such run. The networks generated had an average complexity of about 11 with one
hidden neuron. While the performance on the training set was practically identical to
the networks found with weight transmission, the resulting complexity was somewhat
lower.

While it seems that with this configuration less complex networks are found when
compared to the situation without weight transmission with near identical training
errors, the computational time required is far larger. The effect of weight transmission
is such that many less back propagation cycles are needed in order to find a "suitable'
network structure. No exhaustive comparison was performed to determine the full
effect of chosen parameters. For example, the GA system with weight transmission
might improve if a somewhat larger number of BP cycles were used, and the GA
10.7 Experiments 165

system without weight transmission might not need as many as 50 BP cycles for an
identical convergence.

Figure 10.17 An example of a GA run with the same settings as in Figure 10.16 but
with a much larger number of back propagation cycles: BP cycles = 50

• Radar Classification Data


Some runs were performed on the radar classification data. The matrix size used was
64 x 64. The matrix grammar system was configured with alphabet sizes kx = 4, k2 = k^
= k4 = 16 resulting in a chromosome length of I = 4 + 4*4 + 4*16 + 4*16 + 4*16 =
212. To encode matrices of the same size the direct encoding scheme requires
chromosomes with length / = 1814. Apart from the chromosome lengths the GA
settings were the same as in Table 10.2. The number of back propagation cycles per
evaluation is one and re-evaluation is turned on. The task can be learned very well
with a fully connected feedforward neural network of size '17-12-12'. This network of
complexity C(x) = 372 needs about 400 training cycles to learn the task within a
tolerance of 0.4. Figure 10.18 compares a run with the matrix grammar system and
one using the direct encoding scheme. Despite the fact that the complexity term has a
significant impact on the fitness function, both systems do not significantly reduce the
166 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

complexity of the networks generated during the course of a run. The complexity of
the best network of the final population is almost identical to the best network in the
initial population. These complexities differ between the two schemes, with the best
neural network generated using the matrix grammar scheme has a complexity of
around 730 while the direct encoding scheme generates networks with a complexity of
around 900. The same was found in general and not just in these particular runs.

Figure 10.18 Two GA runs on the radar classification data with a matrix of size
64x64, one with the matrix grammar scheme, the other one with direct encoding

The matrix grammar scheme generates networks with a better performance on the data
set, with an error of about 40 with about 40 outputs misclassified (of the total
12*240=2880 outputs of the whole training set). In contrast the direct encoding
scheme generated networks with an average error of about 120 with over 100 outputs
misclassified. These results could be described as encouraging. The error rates
remained fairly high and complexity was still recessive when compared to manually
tuned networks. However, the matrix grammar scheme in particular generated solution
networks that performed moderately well. Further work is required to identify GA
10.8 Discussion 167

settings that can produce better solutions to this problem and to other more complex
problems.

10.8 Discussion
Two ideas have been combined in this chapter for the optimisation of feedforward
neural network structures and their weights. A structured genetic algorithm has been
developed where both the network structure and the set of weights are coded in the
chromosomes. This set of weights is passed on to the offspring by means of weight
transmission. A matrix grammar scheme has been implemented to represent neural
network structures in a concise manner. It encodes (forced) regularities in the
connectivity matrix defining the network structure. A direct encoding scheme where
each entry in the (upper-right half of the) connectivity matrix is represented by one
gene, has also been implemented and results of the two systems are compared.

On the XOR problem the 'minimal XOR network' was generated by all systems. It
was found however that even on a relatively simple problem as this the settings of the
GA system, such as the number of back propagation cycles, have a major influence on
the performance of the system. With the iris data set, low complexity neural networks
were found that were still able to perform quite well on the training and test set.
Differences were only observed between the performances of the matrix grammar
encoding scheme and the direct encoding scheme when a larger size 32 x 32
connectivity matrix was used. The matrix grammar scheme was able to generate less
complex (i.e. 'better') networks. Results with the larger sized radar classification data
set were less positive. Neither the direct encoding nor the matrix grammar scheme
were able to decrease the complexity of the networks over the course of a GA run,
although moderately low error rates were obtained. The matrix grammar scheme also
produced networks with a lower complexity but this complexity was still very high.
The task can be learned very well with a fully connected neural network with about
half the complexity of the generated networks.

It was observed that weight transmission ensures that many less training cycles are
needed in the evaluation phase of the generated neural networks when compared to
randomisation of network weights at each generation. Weight transmission does
however not seem to be fair on all network structures generated, since networks with a
structure similar to their parents will be at an advantage. With or without weight
transmission, the number of training cycles used is a critical parameter and the optimal
168 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

number depends strongly on the problem. In general a small training set will require
more training cycles than a large training set, but the amount of training necessary will
depend on the difficulty of the problem as well.

The matrix grammar scheme that was implemented here provides a way to encode
neural network structures in a concise manner by encoding (and forcing) regularities
in the corresponding connectivity matrix. One of the main drawbacks of the matrix
grammar approach is the restriction to matrices of size 2N x 2N which in turn specifies
the maximum number of neurons in the network. The direct encoding scheme does not
suffer from this restriction but it still requires a maximum number of neurons to be set.
For large problems, the chromosome length increases drastically using direct encoding
while it remains within reasonable length using the matrix grammar scheme. It was
found that the maximum number of neurons strongly influences the complexity of the
networks generated and both the matrix grammar scheme and especially the direct
encoding scheme have difficulty in generating 'small' networks within a large
connectivity matrix.

The fitness function was composed of a term measuring the network's performance on
the training set and one reflecting the network's complexity. The idea was to generate
low complexity networks that still perform well on the problem. It was observed that
the value of a, the measure of complexity in the fitness function, had a significant
influence on the neural networks generated. If set to a small value the complexity of
the networks is very large, if set to a large value, very small networks that perform
quite poorly on the data are generated. It is a difficult issue to find the optimal trade­
off between the error on the training set and the complexity of the network.

Future work might include investigating the effect of letting not just the structure but
the weights be subject to genetic operators. One of the main issues is the chromosomal
representation of the neural network structures. It has been shown that the matrix
grammar scheme is able to generate somewhat less complex neural networks as
compared to the direct encoding scheme. There are several drawbacks to the matrix
grammar scheme however and there may well be other representation schemes such as
graph grammars that give better results.
11. Conclusions and Future Directions
Evolutionary computation has proven to be a very useful optimisation tool in many
applications. Determining its efficacy as an optimisation algorithm for feedforward
neural network architectures and/or weights was the goal of this research.

A genetic algorithm was implemented as a weight optimisation or learning algorithm


for feedforward neural networks. Even though it has been reported in the literature
that a very similar GA produced very good results, outperforming back propagation
on certain problems, this has not been observed here. The GA was invariably
outperformed by large margins in computational effort when compared to back
propagation on the problems investigated. It must be said that none of these problems
seem to present any major difficulty for a hill-climbing algorithm like BP and it is
therefore not really surprising that it outperforms a global search algorithm such as
GA. More interesting results can be expected when problems are investigated for
which BP finds it very hard to find the global optimum. Future work in this field
should be based on the optimisation of GA parameters. Genetic operators such as
crossover and mutation and/or the representation used can be made more problem
specific. Methods that can minimise the competing conventions problem, where a
single phenotype (input-output mapping or functioning of the network) can be
represented by several genotypes (set of weights), also need further investigation.

Similar to the work in [41], it has been shown that the genetic programming paradigm
can be used in a direct encoding scheme coding both the neural network architecture
as well as its weights, to generate neural networks that can perform the XOR and the
one-bit adder task. It was found however that the GP system does not scale up well to
larger real world applications. This is mainly due to the rapidly growing chromosome
sizes for larger problems and the restrictions of this approach as described in section
8.7. The main restriction is that only tree structured network architectures can be
generated. Many problems may be very hard or even impossible to solve using a tree
structured neural network. Genetic programming provides certain advantages over
standard genetic algorithms in that the size of the chromosomal representation is not
fixed and that it provides a way to efficiently code functional subgroups that may be
called upon more than once. A graph grammar encoding scheme has been successfully
used in GP to represent boolean neural network architectures [24] and a similar

169
170 Chapter 11. Conclusions and Future Directions

scheme may prove to be an efficient and concise way to code feedforward neural
networks in general.

A GA based matrix grammar encoding scheme was implemented and combined with
the idea of structured genetic algorithms where both the network architecture as well
as the set of weights are coded in the chromosomes. Weight values are passed on to
offspring networks by means of weight transmission. A direct encoding scheme was
also implemented where feedforward neural networks are directly represented by a
connectivity matrix. Using this direct encoding scheme, larger sized networks require
excessively large chromosomes, generally decreasing the GA performance as a
network optimiser. The matrix grammar encoding scheme encodes regularities in the
connectivity matrix resulting in chromosomes of much shorter length. Drawbacks of
both representation schemes include the need to specify a maximum number of
neurons in advance. This in turn specifies the size of the connectivity matrix. The
matrix grammar scheme poses an even more severe restriction on the matrix size since
only matrices of size 2N \2N (N= 1,2,3...) are allowed. The matrix grammar scheme
is also not very 'clean' in that it may code a lot of information that is never used and
there exist numerous ways to represent one and the same network structure resulting in
a competing conventions problem similar to the one in neural network weight
optimisation. Still, good results were obtained on a neural network optimisation
problem of medium size and both the matrix grammar and the direct encoding scheme
were able to generate low complexity neural networks that performed well on training
and on test data. With increasing network size the matrix grammar scheme gave
somewhat better results. Both schemes were unable to generate low complexity
networks on a large real-world classification problem, but investigations of this
problem were limited by the excessive computational effort required and the limited
time available. The effect of weight transmission is such that it reduces the amount of
training that is necessary to evaluate the networks. However, it does impose a
restriction on the networks generated in that neural networks that bear a close
structural resemblance to their parents will be favoured.

One of the main issues in neural network architecture optimisation is the chromosomal
representation of the neural network structure. A direct encoding scheme such as
Genetic Programming for Neural Networks is commonly found to have a poor scaling
performance. Grammar encoding provides an alternative. The general idea in grammar
encoding is to use some form of repetition or modularity in the network structure so
that a representation of manageable length is achieved. Kitano's matrix grammar
scheme and the matrix grammar scheme implemented here code repeated patterns in
Chapter 11. Conclusions and Future Directions 171

the connectivity matrix. Gruau's cellular encoding scheme codes cell-divisions and
connectivity mutations and uses repeated subnetworks that perform a certain function.
The latter has only been used for binary networks but it could well be that a similar
scheme may provide an efficient way to code neural network structures in general.

Finally, some comments are in order on the topic of evolutionary computation. So far
very little research has been performed on the generalisation capabilities (testing of
the solution on data outside the 'training set') of evolutionary computation
optimisation systems. The training set is meant here as the data set that is used to
evaluate the individuals on their task. Problems similar to the ones in learning
algorithms of neural networks apply: when to stop the evolutionary computation
algorithm, how to choose the training set and the problem of overfitting on the training
data.

In general it can be said that more foundational work is needed in the field of
evolutionary computation. The lack of a proper mathematical foundation results in a
trial and error based search for the optimal parameters without any formal guidelines.
Further investigation of methods for convergence analysis of GAs using for example
Markov chains seem likely to yield significant payoff. Techniques for visualisation in
evolutionary computation may also prove very beneficial to the field, since in general
the internal workings of the algorithms remain hidden to the user. With such
techniques it might even be possible for the user to intervene in the search and adjust
certain parameters on the run.
This page is intentionally left blank
References and Further Reading
[1] Alba, E., Aldana, J.F., and Troya, J.M., "Genetic Algorithms as Heuristics for
Optimizing ANN Design", International Conference on Artificial Neural Nets
and Genetic Algorithms (ANNGA93), Innsbruck, Austria, pp. 683-689, 1993.

[2] Altenberg, L., "The Evolution of Evolvability in Genetic Programming",


Advances in Genetic Programming, edited by Kinnear, K.E., Jr., MIT Press,
1994.

[3] Angeline, P.J., Saunders, G. M. and Pollack, J.M., "An Evolutionary Algorithm
that Constructs Recurrent Neural Networks", IEEE Transactions on Neural
Networks, vol. 5, no. 1, 1994.

[4] Boers, E.J.W. and Kuiper, H., "Biological Metaphors and the Design of Modular
Artificial Neural Networks", Technical Report, Departments of Computer
Science and Experimental and Theoretical Psychology, Leiden University, The
Netherlands, 1992.

[5] Braun, H. and Weisbrod, J., "Evolving Neural Feedforward Networks",


International Conference on Artificial Neural Nets and Genetic Algorithms
(ANNGA93), Innsbruck, Austria, pp. 25-32, 1993.

[6] Bridges, C.L. and Goldberg, D.E., "The Nonuniform Walsh-Schema


Transform", Foundations of Genetic Algorithms, edited by Rawlins, G.J.E.,
Morgan Kaufmann Publishers, pp. 13-22, 1991.

[7] Calvin, W.H., "The Emergence of Intelligence", Scientific American, special


issue on Life in the Universe, pp. 79-85, October 1994.

[8] Cangelosi, A., Parisi, D., and Nolfi, S., "Cell Division and Migration in a
'Genotype' for Neural Networks", Network: computation in neural systems, in
press.

173
174 References and Further Reading

[9] le Cun, Y., "Generalization and Network Design Strategies", Connectionism in


Perspective, edited by Pfeifer, P., Schreter, Z., Fogelman-Soulie F., and Steels,
L., Elsevier Science Publishers B.V. (North-Holland), pp. 143-155, 1989.

[10] Dasgupta, D. and McGregor, D.R., "sGA: A Structured Genetic Algorithm",


Technical Report: IKBS-11-93, Department of Computer Science, University of
Strathclyde, Glasgow, 1993.

[11] Dasgupta, D. and McGregor, D.R., "Designing Application-Specific Neural


Networks using the Structured Genetic Algorithm", IEEE International
Workshop on Combinations of Genetic Algortihms and Neural Networks
(COGANN-92), Baltimore, pp. 87-96, 1992.

[12] De Jong, K.A., Spears, W.M. and Gordon, D.F., "Using Markov Chains to
Analyze GAFOs", Foundations of GAs Workshop, (ftp.aic.navy.mil/pub/spears/
foga94), 1994.

[13] Eberhart, R.C., "The Role of Genetic Algorithms in Neural Network Query-
Based Learning and Explanation Facilities", IEEE International Workshop on
Combinations of Genetic Algortihms and Neural Networks (COGANN-92),
Baltimore, pp. 169-183, 1992.

[14] Fogel, D.B., "An Introduction to Evolutionary Computation", Australian Journal


of Intelligent Information Processing Systems, Vol. 1, No. 2, pp. 34-42, 1994.

[15] Fogel, D.B., "An Introduction to Simulated Evolutionary Optimization", IEEE


Trans, on Neural Networks, Vol. 5, No. 1, pp. 3-14, January 1994.

[16] Fogel, D.B. and Fogel, L.J. (Guest editors), Special Issue on Evolutionary
Computation, IEEE Trans, on Neural Networks, Vol. 5, No. 1, January 1994.

[17] Fraser, A.P., "Genetic Programming in C++, A Manual for GPC++", Technical
Report 040, University of Salford, Cybernetics Research Institute, 1994.

[18] Garis, H. de, "Genetic Programming, Building Nanobrains with Genetically


Programmed Neural Network Modules", IEEE International Joint Conference
on Neural Networks, New York, vol. 3, pp. 511-516, 1990.
References and Further Reading 175

[19] Goldberg, D.E., Genetic Algorithms in Search, Optimization, and Machine


Learning, Addison-Wesley Publishing Company, Inc., 1989.

[20] Goldberg, D.E., "Real-coded Genetic Algorithms, Virtual Alphabets, and


Blocking", University of Illinois at Urbana-Champaign, Technical Report No.
90001, 1990.

[21] Goldberg, D.E. and Deb, K., "A Comparative Analysis of Selection Schemes
Used in Genetic Algorithms", in: Foundations of Genetic Algorithms, edited by
Rawlins, G.J.E., Morgan Kaufmann Publishers, pp. 69-93, 1991.

[22] Goldberg, D.E. and Segrest, P., "Finite Markov Chain Analysis of Genetic
Algorithms", Proceedings of the Second International Conference on Genetic
Algorithms (ICGA-87), pp. 1-8, 1987.

[23] Grefenstette, J.J., "Deception Considered Harmful", Foundations of Genetic


Algorithms 2, edited by Whitley, L.D., Morgan Kaufmann Publishers, pp. 75-91,
1993.

[24] Gruau, F., "Genetic Synthesis of Boolean Neural Networks with a Cell Rewiting
Developmental Process", IEEE International Workshop on Combinations of
Genetic Algortihms and Neural Networks (COGANN-92), Baltimore, pp. 55-74,
1992.

[25] Gruau, F., "Genetic Microprogramming of Neural Networks", Advances in


Genetic Programming, edited by Kinnear, Jr., K.E., MIT Press, 1994.

[26] Happel, B.L.M. and Murre, J.M.J., "Design and Evolution of Modular Neural
Network Architectures, Neural Networks, vol. 7, no. 6/7, pp. 985-1004, 1994.

[27] Harp, S.A. and Samad, T., "Genetic Synthesis of Neural Network Architecture",
Handbook of Genetic Algorithms, edited by Davis, L., Van Nostrand Reinhold,
pp. 202-221, 1991.

[28] Hassoun, M.H., Fundamentals of Artificial Neural Networks, MIT Press, 1995.
176 References and Further Reading

[29] Hibbs, R.A. , "Speeding up Backpropagation: A Comparative Study", Technical


Report, Knowledge-based Engineering Systems Group, University of South
Australia, Australia, 1994.

[30] Holland, J.H., Adaption in Natural and Artificial Systems, University of


Michigan Press, Ann Arbor, 1975.

[31] Horn, J., "Finite Markov Chain Analysis of Genetic Algorithms with Niching",
Proceedings of the Fifth International Conference on Genetic Algorithms, San
Mateo, CA, pp. 110-117, 1993.

[32] Jacob, C. and Rehder, J., "Evolution of Neural Net Architectures by a


Hierarchical Grammar-based Genetic System", International Conference on
Artificial Neural Nets and Genetic Algorithms (ANNGA93), Innsbruck, Austria,
pp. 72-79, 1993.

[33] Jain, L.C., "Hybrid Intelligent Techniques in Teaching and Research", IEEE
AES, Vol. 10, No. 3, March 1995, pp.14-18.

[34] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 2,
Journal of Network and Computer Applications, Academic Press, England, Vol.
19, Issue 2, April 1996.

[35] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 1,
Journal of Network and Computer Applications, Academic Press, England, Vol.
19, Issue 1, January 1996.

[36] Jain, L.C. (Editor), Electronic Technology Directions Towards 2000, ETD2000,
IEEE Computer Society Press, USA (Edited Conference Proceedings), Volume
1,2, May 1995.

[37] Kinnear, K.E. Jr., "Evolving of a Sort: Lessons in Genetic Programming", IEEE
International Conference on Neural Networks, vol.2, pp. 881-888, 1993.

[38] Kitano, H., "Designing Neural Networks Using Genetic Algorithms with Graph
Generation System",Complex Systems, vol. 4, pp. 461-476, 1990.
References and Further Reading 177

[39] Kitano, H., "Neurogenetic Learning: An Integrated Method of Designing and


Training Neural Networks Using Genetic Algorithms", Physica D, vol. 75, pp.
225-228, 1994.

[40] Koza, J.R., Genetic Programming, On the Programming of Computers by


Means of Natural Selection, MIT Press, Cambridge, 1992.

[41] Koza, J.R. and Rice, J.P., "Genetic Generation of both the Weights and
Architecture for a Neural Network", IEEE International Joint Conference on
Neural Networks, 1991.

[42] Lewin, B., Genes IV, Oxford University Press and Cell Press, 1990.

[43] Lohmann, R., "Structure Evolution in Neural Systems", Dynamic, Genetic, and
Chaotic Programming, edited by B. Soucek and the IRIS Group, John Wiley &
Sons, Chapter 15, pp. 395-411, 1992.

[44] Lund, H.H. and Parisi, D., "Simulations with an Evolvable Fitness Formula",
Technical Report PCIA-1-94, C.N.R., Rome, 1994.

[45] Mandisher, M., "Representation and Evolution of Neural Networks",


International Conference on Artificial Neural Nets and Genetic Algorithms
(ANNGA93), Innsbruck, Austria, pp. 643-649, 1993.

[46] Maniezzo, V., "Genetic Evolution of the Topology and Weight Distribution of
Neural Networks", IEEE Transactions on Neural Networks, Vol. 5, No. 1,
January 1994.

[47] McDonnell, J.R. and Waagen, D., "Evolving Neural Network Connectivity",
IEEE International Conference on Neural Networks, San Fransisco, 1993.

[48] Michalewicz, Z., Genetic Algorithms + Data Structures = Evolution Programs,


2nd extended edition, Springer-Verlag, 1994.

[49] Montana, D.J., "Automated Parameter Tuning for Interpretation of Synthetic


Images", Handbook of Genetic Algorithms, edited by Davis, L., Van Nostrand
Reinhold, pp. 202-221, 1991.
178 References and Further Reading

[50] Montana, DJ. and Davis, L., "Training Feedforward Neural Networks Using
Genetic Algorithms", Proceedings of the Internatinal Conference on Artificial
Intelligence, pp. 762-767, 1989.

[51] Muhlenbein, H., Schomisch, M. and Born, J., "The Parallel Genetic Algorithm
as Function Optimizer", Parallel Computing, Vol. 17, pp. 619-632, 1991.

[52] Munro, P.W., "Genetic Search for Optimal Representations in Neural


Networks", International Conference on Artificial Neural Nets and Genetic
Algorithms (ANNGA93), Innsbruck, Austria, pp. 628-634, 1993.

[53] Narasimhan, V.L. and Jain, L.C. (Editors), The Proceedings of the Australian
and New Zealand Conference on Intelligent Information Systems, IEEE Press,
1996.

[54] Nix, A.E. and Vose, M.D., "Modelling Genetic Algorithms with Markov
Chains", Annals of Mathematics and Artificial Intelligence #5, pp. 79-88, 1992.

[55] Nolfi, S. and Parisi, D., "Growing Neural Networks", Proceedings of Artificial
Life III, Santa Fe, New Mexico, 1992.

[56] Schaffer, J.D., Whitley D. and Eshelman, L.J., "Combinations of Genetic


Algorithms and Neural Networks: A Survey of the State of the Art", IEEE
International Workshop on Combinations of Genetic Algortihms and Neural
Networks (COGANN-92), Baltimore, pp. 1-37, 1992.

[57] Schiffmann, W., Joost, M. and Werner, R., "Application of Genetic Algorithms
to the Construction of Topologies for Multilayer Perceptions", International
Conference on Artificial Neural Nets and Genetic Algorithms (ANNGA93),
Innsbruck, Austria, pp. 676-682, 1993.

[58] Singer, M. and Berg, P., Genes & Genomes, A Changing Perspective, University
Science Books, Blackwell Scientific Publications, 1991.

[59] Soucek, B. and the IRIS Group, Dynamic, Genetic and Chaotic Programming,
John Wiley & Sons Inc., 1992.
References and Further Reading 179

[60] Van Rooij, A.J.F., Jain, L.C. and Johnson, R.P., "Neural Network Training
Using Genetic Algorithms", Guidance, Control and Fuzing Technlogy
International Meeting, 2nd TTCP, WTP-7, DSTO, Salisbury, Australia, 1 0 - 1 2
April, 1996.

[61] Vonk, E., Jain, LG. and Johnson, R., "Using Genetic Algorithms with Grammar
Encoding to Generate Neural Networks", IEEE International Conference on
Neural Networks, Perth, December, 1995.

[62] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Hibbs, R., "Integrating Evolutionary
Computation with Neural Networks", Electronic Technology Directions to the
Year 2000, IEEE Computer Society Press, pp. 135-141, 1995.

[63] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Johnson, R., "Automatic Generation
of a Neural Network Architecture Using Evolutionary Computation", Electronic
Technology Directions to the Year 2000, IEEE Computer Society Press, pp. 142-
147, 1995.

[64] Whitley, D., Starkweather, T. and Bogart, C, "Genetic Algorithms and Neural
Networks: Optimizing Connections and Connectivity", Parallel Computing, vol.
14, pp. 347-361, 1990.

[65] Wright, A.H., "Genetic Algorithms for Real Parameter Optimization",


Foundations of Genetic Algorithms, edited by Rawlins, G.J.E., Morgan
Kaufmann Publishers, pp. 205-218, 1991.

[66] Zhang, B. and Muhlenbein, H., "Evolving Optimal Neural Networks Using
Genetic Algorithms with Occam's Razor", Complex Systems, vol. 7, no. 3, 1993.
This page is intentionally left blank
Index
A Foundations Of Genetic
Activation Functions 6 Algorithms 00
Artificial Neural Network; 3
Artificial Neuron 4 G
Automatically Defined GA Software 114
Functions 105 Gene Mutations 48
Generation 24
B Genetic Algorithms 17
Back Propagation 122 Genetic Operators 148
Binary Coding 83 Genetic Programming 35 , 101
Biological Background 43 Genetic Structures 43
Building Block Hypothesis 67 Genetically Programmed
Neural Network 102
Grammar Encoding 98
c
Chromosome Mutations 46
Coding 82 H
Creation Rules 104 Hybridisation Of
Crossover Rules 64,88, 104 Evolutionary Computation 91

D I
Direct Encoding 96, 149 Implementing GA's 79
Dual Representation 29 Intertwined Spirals 111
Inversion 89
E
Elitism 34 K
Evolutionary Kitano's Matrix Grammar 135
Algorithms 40
- Computation 17, 54,91 ,93 L
Extensions Of Genetic Algorithm 34 Learning Rules 9, 11

F M
Fitness Function 81, 106 Markov Chain Analysis 75
181
182 Index

Matrix Grammar 145 Reproduction 44


- Modified 137 Roulette Wheel Reproduction 63
Multiple Layer Perceptron 5, 14
Mutations 46, 66,89
s
Schema Theorem 62,66
N Selection Schemes 86
Natural Evolution 48 Steady State
Neural Network Connections 12 Genetic Algorithms 32,87
Non-Homogeneous Coding 85 Switching Of Hyperplanes 69
Symbolic Coding 85

o
One-Bit Adder 110 T
Operation Of Genetic Algorith ms 60 Tournament Selection 87
Optimisation Problem 19 Types Of Neural Networks 9, 13
Optimisation of Weights 114
w
P Walsh-Schema Transform 69
Parallel Genetic Algorithms 33 Weight Representation 135
Parametrised Encoding 98 Weight Transmission 132
Price's Theorem 74
Proportionate Reproduction 86 X
XOR 108
R
Real-Valued Coding 84
Advances in Fuzzy Systems — Applications and Theory Vol. 14

RUTDMHTIC GENERRTION OF NEURRL NETWORK


ARCHITECTURE USING EVOLUTIONARY COMPUTATION
by E Vonk (Vrije Univ. Amsterdam). L C Jain (Univ. South Australia) &
R P Johnson (Australian Defense Sci. & Tech. Organ.)

This book describes the application of evolutionary computation in the


automatic generation of a neural network architecture. The architecture has
a significant influence on the performance of the neural network. It
is the usual practice to use trial and error to find a suitable neural network
architecture for a given problem. The process of trial and error is not only
time-consuming but may not generate an optimal network. The use of
evolutionary computation is a step towards automation in neural network
architecture generation.

An overview of the field of evolutionary computation is presented, together


with the biological background from which the field was inspired. The most
commonly used approaches to a mathematical foundation of the field of
genetic algorithms are given, as well as an overview of the hybridization
between evolutionary computation and neural networks. Experiments on
the implementation of automatic neural network generation using genetic
programming and one using genetic algorithms are described, and the
efficacy of genetic algorithms as a learning algorithm for a feedforward
neural network is also investigated.

ISBN 981-02-3106-7 I
it IIIIII ii mini mi

■ I I M H I B I I I I I I I I I I ! IN IB

3449hc | 9 "789810H231064"