© All Rights Reserved

0 views

© All Rights Reserved

- 10.1.1.9.4515
- FTSE Trend Forecasting Using Neural Nets
- laudonch11-091101111813-phpapp01
- Traffic signal timing optimisation based on genetic algorithm approach, including drivers routing
- Effective Object Detection by Modifying choice of basic Parameters of Object Detection
- Deep Learning Hardware
- Neural Networks
- Face Liveness Detection for Biometric Antispoofing Applications using Color Texture and Distortion Analysis Features
- Neural Network
- Mathematical Modeling Content
- A Neural Parametric Singing Synthesizer (2017) - Blaauwa, Bonada
- (Good)Ian H. Witten, Eibe Frank, Mark a. Hall Data Mining_ Practical Machine Learning Tools and Techniques, Third Edition (the Morgan Kaufmann Series in Data Management Systems) 2011
- Image Splicing Detection involving Moment-based Feature Extraction and Classification using Artificial Neural Networks
- ANN Based STLF of Power System
- Ann
- Modular ANN
- Celluar GA Thesis
- Neurocomputing Submit
- CISIM2010_HBHashemi
- Scimakelatex.6547.None

You are on page 1of 194

14

fluromaNc GeneraNon of

Neural Network flrchirecrure

Using Evolutionary Computation

EVonk

LCJain

R PJohnson

Output

World Scientific

Automatic Generation of

Neural Networh Architecture

Using EvoluNonarq ComputaMon

ADVANCES IN FUZZY SYSTEMS — APPLICATIONS AND THEORY

Series Editors: Kaoru Hirota (Tokyo Inst. of Tech.),

George J. Klir (Binghamton Univ.-SUNY),

Elie Sanchez (Neurinfo),

Pei-Zhuang Wang (IVesf Texas A&M Univ.),

Ronald R. Yager (lona College)

(Eds. P.-Z. Wang and K.-F. Loe)

Vol. 2: Industrial Applications of Fuzzy Technology in the World

(Eds. K. Hirota and M. Sugeno)

Vol. 3: Comparative Approaches to Medical Reasoning

(Eds. M. E. Cohen and D. L. Hudson)

Vol. 4: Fuzzy Logic and Soft Computing

(Eds. B. Bouchon-Meunier, R. R. Yager and L A. Zadeh)

Vol. 5: Fuzzy Sets, Fuzzy Logic, Applications

(G. Bojadziev and M. Bojadziev)

Vol. 6: Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers

by Lotfi A. Zadeh

(Eds. G. J. Klir and B. Yuan)

Vol. 7: Genetic Algorithms and Fuzzy Logic Systems: Soft Computing

Perspectives

(Eds. E. Sanchez, T. Shibata and L. A. Zadeh)

Vol. 8: Foundations and Applications of Possibility Theory

(Eds. G. de Cooman, D. Ruan and E. E. Kerre)

Vol. 10: Fuzzy Algorithms: With Applications to Image Processing and

Pattern Recognition

(Z. Chi, H. Yan and T. D. Pham)

Vol. 11: Hybrid Intelligent Engineering Systems

(Eds. L C. Jain and R. K. Jain)

Vol. 12: Fuzzy Logic for Business, Finance, and Management

(G. Bojadziev and M. Bojadziev)

Vol. 15: Fuzzy-Logic-Based Programming

(Chin-Liang Chang)

Forthcoming volumes:

Vol. 9: Fuzzy Topology

(Y. M. Liu and M. K. Luo)

Vol. 13: Fuzzy and Uncertain Object-Oriented Databases: Concepts and Models

(Ed. R. de Caluwe)

Advances in Fuzzy Systems — Applications and Theory Vol. 14

Automatic Generation of

Neural Nehuorh Architecture

Using Evolutionary Computation

EVonk

Vrije Univ. Amsterdam

LC Jain

Univ. South Australia

R PJohnson

Australian Defense Sci. & Tech. Organ.

World Scientific

Singapore 'New Jersey • London • Hong Kong

Published by

World Scientific Publishing Co. Pte. Ltd.

P O Box 128, Farrer Road, Singapore 912805

USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661

UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Vonk, E.

Automatic generation of neural network architecture using

evolutionary computation / E. Vonk, L. C. Jain, R. P. Johnson

p. cm. - (Advances in fuzzy systems ; vol. 14)

Includes bibliographical references and index.

ISBN 9810231067

1. Neural networks (Computer science) 2. Computer architecture.

3. Evolutionary computation. I. Jain, L. C. II. Johnson, R. P.

(Ray P.) III. Title. IV. Series.

QA76.87.V663 1997

006.3'2-dc21 97-28485

CIP

A catalogue record for this book is available from the British Library.

Reprinted 1999

All rights reserved. This book or parts thereof, may not be reproduced in any form or by any means,

electronic or mechanical, including photocopying, recording or any information storage and retrieval

system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright

Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to

photocopy is not required from the publisher.

Preface

This book presents our research on the application of evolutionary computation in

the automatic generation of a neural network architecture. The architecture has a

significant influence on the performance of the network. It is the usual practice to use

trial and error to find a suitable neural network architecture for a given problem. This

method is not only time consuming but may not generate an optimal network. The

use of evolutionary computation is a step towards automation in neural network

architecture generation. In this book, an overview of the field of evolutionary

computation is given together with the biological background from which the field

was inspired. The most commonly used approaches towards a mathematical

foundation of the field of genetic algorithms is given, as well as an overview of the

hybridisations between evolutionary computation and neural networks. Experiments

concerning an implementation of automatic neural network generation using genetic

programming and one using genetic algorithms are described, and the efficacy of

genetic algorithms as a learning technique for a feedforward neural network is also

investigated.

generation of a feedforward neural network using evolutionary computation. Chapter

2 provides an introduction to the artificial neural networks. Chapter 3 describes the

principle of operation of evolutionary computation. It includes an introduction to

genetic algorithms, genetic programming and evolutionary algorithms. Chapter 4

presents the biological background of evolutionary computation. In chapter 5, an

attempt is made to present the mathematical basis of genetic algorithms in a limited

sense. Chapter 6 presents the implementation of genetic algorithms. A brief overview

of the most commonly used genetic algorithms settings is given in this chapter.

Chapter 7 presents ways to combine neural networks with evolutionary computation.

Chapter 8 describes the use of genetic programming to generate neural networks.

Chapter 9 describes the use of genetic algorithms to optimise the weights of a neural

network. Chapter 10 presents the use of genetic algorithms with grammar encoding

schemes to generate neural network architectures. Chapter 11 provides concluding

remarks and presents future directions.

v

vi Preface

The book will prove useful for application engineers, scientists, researchers and the

senior undergraduate/first year graduate students in Computer, Electrical, Electronic,

Manufacturing, Mechatronics and Mechanical Engineering, and related disciplines.

Thanks are due to Berend Jan van der Zwaag and Pieter Grimmerink for their

excellent help in the preparation of the manuscript. We are grateful to Professor

Sanchez of the University of Marseille, France, and Professor Karr of the University

of Alabama for reviewing the manuscript. Thanks are also due to Mr Chiang Yew Kee

for his excellent editorial assistance.

This work was supported by the Australian Defense Science and Technology

Organisation (contract number 340479).

E. Vonk

L. C. Jain

R. P. Johnson

Contents

PREFACE v

1. INTRODUCTION 1

2.1 Introduction 3

2.1.1 The artificial neuron 4

2.1.2 The perceptron 5

2.1.3 Activation functions 6

2.1.4 Two layer neural network 8

2.1.5 Types of neural networks 9

2.1.6 Learning 9

2.1.7 Recall of output data from the trained network 11

2.1.8 Learning rules 11

2.1.9 Forms of neural network connections 12

2.2 Basic types of neural networks 13

2.2.7 The Multiple Layer Perceptron 14

2.3 Conclusion 16

3. EVOLUTIONARY COMPUTATION 17

3.1.1 Example of an optimisation problem 19

3.1.2 The algorithm 22

3.1.3 Example of a generation 24

3.1.4 Dual representation and competing conventions 29

3.1.5 The Steady State Genetic Algorithm 32

3.1.6 Parallel Genetic Algorithms 33

3.1.7 Elitism 34

3.1.8 Extensions of the Standard Genetic Algorithm 34

3.2 Genetic Programming (GP) 35

3.3 Evolutionary Algorithms (EAs) 40

vii

viii Contents

4.2 Reproduction 44

4.3 Mutations 46

4.3.1 Chromosome mutations 46

4.3.2 Gene mutations 48

4.4 Natural Evolution 48

4.5 Links to Evolutionary Computation 54

5.2 The Schema Theorem and the Building Block Hypothesis 62

5.2.1 The effect of roulette wheel reproduction 63

5.2.2 The effect of crossover 64

5.2.3 The effect of mutation 66

5.2.4 The effects of all genetic operators combined: The Schema Theorem 66

5.2.5 The Building Block Hypothesis 67

5.2.6 Another viewpoint: the switching of hyperplanes 69

5.2.7 The Walsh-Schema Transform 69

5.2.8 Extending the Schema Theorem to other representations 72

5.3 Criticism on the Schema Theorem and the Building Block Hypothesis 72

5.4 Price's Theorem as an alternative to the Schema Theorem 74

5.5 Markov Chain Analysis 75

6. IMPLEMENTING GAs 79

6.1 GA performance 79

6.2 Fitness function 81

6.3 Coding 82

6.3.1 Binary coding 83

6.3.2 Real-valued coding 84

6.3.3 Symbolic coding 85

6.3.4 Non-homogeneous coding 85

Contents ix

6.4.1 Proportionate Reproduction 86

6.4.2 Tournament Selection 87

6.4.3 Steady State Genetic Algorithms 87

6.5 Crossover, mutation and inversion 88

6.5.7 Crossover 88

6.5.2 Mutation 89

6.5.3 Inversion 89

NEURAL NETWORKS 91

7.2 Evolutionary Computing to analyse a NN 93

7.3 Evolutionary Computing to optimise a NN architecture and its weights 93

7.3.1 Direct encoding 96

7.3.2 Parametrised encoding 98

7.3.3 Grammar encoding 98

NETWORKS 101

8.2 Example of a Genetically Programmed Neural Network 102

8.3 Creation and Crossover Rules for Genetic Programming for Neural

Networks 104

8.3.1 Creation rules 104

8.3.2 Crossover rules 104

8.4 Automatically Defined Functions (ADFs) 105

8.5 Implementation of the Fitness Function 106

8.6 Experiments with Genetic Programming for Neural Networks 108

8.6.1 The XOR problem 108

8.6.2 The one-bit adder problem 110

8.6.3 The intertwined spirals problem Ill

8.7 Discussion of Genetic Programming for Neural Networks 112

x Contents

NETWORK 114

9.2 Set-up 117

9.3 Experiments 120

9.3.1 Data-sets 121

9.3.2 Comparing GA with Back Propagation 122

9.3.3 Results 123

9.4 Discussion 129

NEURAL NETWORKS 131

10.1.1 Weight transmission 132

10.1.2 Structural and parametric changes 134

10.1.3 Weight representation 135

10.2 Kitano's matrix grammar 135

10.3 The Modified Matrix Grammar 137

10.4 Combining Structured GAs with the Matrix Grammar 145

10.4.1 Genetic operators 148

10.4.2 Evaluation 148

10.5 Direct Encoding 149

10.6 Network pruning and reduction 151

10.7 Experiments 152

10.7.1 Set-up 152

10.7.2 Results 153

10.8 Discussion 167

INDEX 181

1. Introduction

It is often overlooked that the performance of a neural network on a certain problem

depends in the first place on the network architecture used and only in the second

place on the actual knowledge representation (i.e. values of the weights) within that

specific architecture. It can be said that the performance of a neural network depends

on three factors: the problem for which the network is going to be used or rather how

this is measured, the network structure and the set of weights. The performance of a

network is typically measured by the cumulative error of the neural network on some

test data with known target outputs, but can include computational speed and

complexity as well. This performance can be defined by an abstract quality function

Q-

Q =

-- Q(T,

~- Q(T,S,S,W)

W)

where:

Q = the type of quality function

T - the testing data (i.e. the target input/output data set)

S = the structure or architecture of the network

W = the set of weights

the neural network. This really holds for any type of neural network. This book

however only deals with feedforward neural networks that can be trained with some

type of supervised learning algorithm. An example of such a quality function Q that is

commonly used is the mean cumulative squared error on the test data set consisting of

several input / target output patterns:

F

1

Q-- -T.r

r ,=i

where:

F = the number of test patterns or facts

O = the neural network output vector

T = the target output vector

1

2 Chapter 1. Introduction

priori. The type of structure used may be based on some knowledge of the problem

domain but commonly a sufficient network structure is found by trial and error. In

many cases the structure used will be a fully connected feedforward network and the

user might try different numbers of hidden neurons to see how well the resulting

structures will fit the task. This network structure is then trained with some learning

algorithm to gain an appropriate set of weights W. The emphasis on optimising the

quality of the network, Q, is very often based on the ability of the learning algorithm

to generate an optimal set of weights, while the structure S is taken for granted or

chosen from a limited domain.

The automatic generation of a neural network structure is a useful concept, as, in many

applications, the optimal structure is not known a priori. Construction-deconstruction

algorithms can be used as an approach but they have several drawbacks. They are

usually restricted to a certain subset of network topologies and as with all hill climbing

methods they often get stuck at local optima and may therefore not reach the optimal

solution. These limitations can be overcome using evolutionary computation as an

approach to the generation of neural network structures. In order to optimise the

quality function Q the algorithm used must at least be able to change the structure S as

well as the set of weights W. In the case of feedforward neural networks, optimising

the structure S alone may be sufficient since the existing learning algorithms are such

that, given a neural network structure S, in many cases the optimal set of weights W

can quite easily be found. In many applications the test data set, T, will be stable.

However the network may have to operate in a dynamic environment where the task

description or at least the testing of the network on the task may change over time. In

such a case the algorithm must ideally be able to adapt Tas well.

optimal feedforward neural network structure given a fixed classification task. Its

efficacy as a learning algorithm for feedforward neural networks is also considered in

this book.

2. Artificial Neural Networks

Artificial neural networks are parallel computational models comprised of densely

interconnected adaptive processing units [36], [60]. These networks are fine-grained

parallel implementations of non-linear systems, either static or dynamic. A very

important feature is their adaptive nature where 'learning by example' replaces

■programming' in solving problems. This feature renders these computational models

very appealing in application domains where one has little or incomplete

understanding of the problem to be solved, but where training data (examples) are

available. Another key feature is the intrinsic parallel architecture that allows for fast

computation of solutions when these networks are implemented on parallel digital

computers or when implemented in customised hardware.

Artificial neural networks are viable and very important computational models for a

wide variety of problems. These include pattern classification, speech synthesis and

recognition, function approximation, image compression, associative memory,

clustering, forecasting and prediction, combinatorial optimisation, and non-linear

system modeling and control. The networks are 'neural' in the sense that they have

been inspired by neuroscience, the study of the human brain and nervous system. The

artificial neurons used are thought to be very simple models of their biological

counterpart. However, this does not mean that they are faithful models of biological

neural or cognitive phenomena, those are of a much more complex nature. In fact, the

majority of the neural networks presently used are more closely related to traditional

mathematical and/or statistical models, such as non-parametric pattern classifiers, non

linear filters and statistical regression models, than they do to neurobiological models.

Still, the technology of neural networks attempts to mimic nature's approach to solve

certain complex problems that are impossible to solve with the more traditional

techniques.

2.1 Introduction

This section introduces some of the concepts of neural networks. The basic

components of neural networks are discussed and some of the more common forms of

neural networks are considered.

3

4 Chapter 2. Artificial Neural Networks

The study of neural networks was originally undertaken in order to understand the

behaviour and structure of the biological neuron. It was soon realised how inadequate

the artificial neuron models were in comparison with the biological neuron, and as a

result some researchers in artificial neural networks decided that the name of neuron

was inappropriate and used other terms such as node rather than neuron. The use of

the term neuron is now so deeply entrenched that its continued general use seems

assured.

Another point which is sometimes confusing is that different writers use a different

numbering nomenclature for multi-layered neural networks. Some workers do not

count the input layer as one of the layers on the basis that this layer often serves only

for the input data and no processing of data occurs in it. Processing however does

occur within the input layer in some forms of artificial neural network. For the sake of

consistency we include the input layer as one of the layers when numbering the layers

of neurons.

The artificial neuron (refer Figure 2.1) may be thought of as an attempt to model the

behaviour of the biological neuron. It is at the present time a limited approximation to

the biological neuron and it is probably not desirable to stretch the analogy too far.

2.1 Introduction 5

The first stage is a process where the inputs x0, xh ... xn multiplied by their respective

weights w0, wh ... w„ are summed by the neuron.. The input vector x0, xs, ... x„ may be

denoted by X and the weight vector w0, wh ... w„ by W. Weight w0 forms the neuron's

threshold. The resulting summation process may be shown as:

yy =

=ww +xjXj■■ wwj

0 0+ +x

t+x ■2 w+ 2,.+x

2 ■2 w . + ...

n

+ x„

■ w„ ■ w„

=X W

The weight vector W contains the weights connecting the various parts of the network.

The memory of the neural network is stored in the values of the weights. The term

weight is used in neural network terminology and is a means of expressing the strength

of the connection between any two neurons in the neural network.

During training phase of a neural network the values of the weights are continuously

modified by the training process until some previously agreed criteria are met.

Different types of network use different methods of making the necessary adjustments.

In order to allow for varying input conditions and their effect on the output it is

usually necessary to include a nonlinear activation function f in the neuron

arrangement. This is so that adequate levels of amplification may be used where

necessary for small input signals without running the risk of driving the output to

unacceptable limits where a large input signal is applied. Depending on the

circumstances one of a number of different activation functions may be employed for

this dynamic range control action.

Output =fiy)

6 Chapter 2. Artificial Neural Networks

There are a number of types of commonly used activation functions and some of these

are shown in Figure 2.3. Most activation functions are also known as threshold

functions or squashing functions. A brief description of the properties of the activation

functions shown in Figure 2.3 follows:

• Step function. The function shown in Figure 2.3a is known as the step function.

The output from this function is limited to one of two values, depending on

whether the input signal is greater or less than zero. Usually the output value

would be one for signal values greater than zero and minus one for signal values

less than zero That is:

= -1 yy<<00

2.1 Introduction 7

• Linear function. The function shown in Figure 2.3b is the only linear function in

the group of four functions shown and it has application in some specific network

nodes where dynamic range is not a consideration. The effect of this function is to

multiply by a constant factor. That is:

Output = K • y

• Ramp function. The effect of the ramp function, shown in Figure 2.3c, is to behave

as a linear function between the upper and lower limits and once these limits are

reached to behave as a step function. Another attraction is that the function may be

simply defined:

=K y y< upper limit and y > lower limit

= Min y < lower limit

8 Chapter 2. Artificial Neural Networks

• Sigmoid function. The sigmoid function is an 'S' shaped curve, as shown in Figure

2.3d. A number of mathematical expressions may be used to define an 'S' shaped

curve, but the most commonly used form is given by the expression:

f(y) = ^ - 7

simplification to be made in the neural network formulation.

Several perceptrons may be grouped together to form a neural network where the two

layers of neurons are fully interconnected, but there is no interconnection between

neurons in the same layer. This results in a network as shown in Figure 2.4. This

arrangement is a two layer neural network and it illustrates a common form of neural

network.

2.1 Introduction 9

Neural networks may be classified in a number of different ways depending on their

structures and underlying principles of operation including methods used for learning

and recall. An indication of the methods of classification is given in Table 2.1 below.

The various types of learning and recall applying to some of the more common

paradigms will be explained in the following sections.

Feed-forward- Recall

TType

Unsupervised yP e AA TType

yP e BB

Learning Example: Adaptive Resonance Example: Linear Associative

Theorem Memory

Supervised Type C Type D

Learning Example: Brain State in Box Example: Perceptron

2.1.6 Learning

Before a neural network can be used it is necessary to subject it some form of training

during which process the values of the weights in the network are adjusted to reflect

the characteristics of the input data. The learning process is one of developing a

mapping between the output data and the input data. When the network is adequately

trained, it will retrieve the correct output when a set of input data is presented to it. A

valuable property often claimed for neural networks is that of generalisation, whereby

a trained neural network is able to provide a correct matching of output data to a set of

previously unseen input data.

structure and the degree of training. In training a network the available input dataset

consists of many facts and is normally divided into two groups. One group of facts is

used as the training data and a second group is retained for checking and testing the

accuracy of the performance of the network after training. The quantity of data

available should be large enough to encompass a representative range of

circumstances which the network will encounter during service.

10 Chapter 2. Artificial Neural Networks

As indicated in Table 2.1 there are two forms of learning: supervised learning and

unsupervised learning.

• Supervised Learning

In this form of learning, a target value is included as part of each fact within the

training data. In this instance a fact incorporates all of the input data for the particular

event and the required output expected from the network for this fact. The target value

is the output value corresponding to a particular fact.

During the training process, the set of training data facts is repeatedly applied to the

network until the difference between the output results and the target values are within

the desired tolerance. When the neural network meets the error criteria on the training

facts, the previously unseen test data set of facts is applied to the neural network to

test the generalisation performance of the network.

• Unsupervised Learning

Unlike supervised learning there is no target value in this form of training. Instead, the

set of data which contains the facts is repeatedly applied to the network until a stable

network output is obtained. It has been suggested that this form of training is more

similar to the biological neuron as in the biological situation there is not normally a

target value.

2.1 Introduction 11

The recall of output from a trained neural network is obtained by two distinct

methods.

When obtaining output data from a trained neural network, novel input data is

presented to the trained network and a single traverse of the network is made. The

output corresponding to the input data is then immediately available.

In this case, the applied input data is circulated in the trained neural network until

a stable condition is obtained. This is intrinsically a more lengthy process than the

corresponding feed-forward recall of data.

There are a large number of training rules that have been developed and some are

listed below. The role of the learning mechanism is to adjust the weights of the

network in response to the problem.

12 Chapter 2. Artificial Neural Networks

Rule

Rule Weight adjustment

Weight adjustment AAWy

Wy Comments

Comments

Hebbian

Hebbian Aw,lj=r)f(w,X)x

Aw J=r]f(wlX)Xj

J 77 == learning

learning rate

(enhance successful

successful w = weight

weight vector

vector

connections) X == input

input vector

vector

Perceptron

Perceptron Awy

Aw,j == rj(t,-sgn(w

■q(t,-sgn(w iX))x

1X))x J J tt -- target

target vector

vector

(binary response, no 77 == learning

learning rate

action if no error)

Delta

Delta Aw^ = r\

Awji r\8dpja

Pja„,

p, SiSj==weighted

weightedsumsumofofinputs

inputstoy

to;

A: 8p]8=f(S ])(t

pj=f(S J)(tp] -a-ap]pj))

l>J A: output layer error

B: Spj S=f(Sj)ZAw

B: pj=f(Sj)£Aw kjkj B:

B:hidden

hidden layer

layer error

error

Least Mean

Least Mean Square

Square Aw, = rj(t

Aw, rWjX)Xj

Tj(t,-WjX)Xj 77, t,t,XX and

and vv

w are

areasasabove

above

(Widrow-Hoff)

Outstar (Grossberg)

Outstar Aw

Awj,p = r](t

T](trWj.J

rwp)

Winner Takes

Winner All

Takes All A: Aw

A: Aw,j

v =- IKXJ-W,J)

r\{xfw,^ A:when

A: whenininnear

nearneighbourhood

neighbourhood

(nearby neurons modify

modify B: Aw,.

AwtJ=

=00 B:when

B: when not

not inin near

near

in a similar fashion)

fashion) neighbourhood

Neurons may be arranged in many different ways, including the fully interconnected

layer (Figure 2.7), multi-layer networks (Figure 2.8), through to Adaptive Resonance

Theory networks, that include complex layers with an external decision making

structure. Some simple examples of possible forms of network connections are given

in the following figures.

2.2 Basic Types of Neural Networks 13

A number of neural networks are successfully used and reported in literature. Some

common examples of different types of networks are:

• Perceptron network

• Multiple Layer Perceptron (MLP)

• Radial Basis Function network

• Kohonen's self organising feature map

• Adaptive Resonance Theory network (ART)

• Hopfield net work

14 Chapter 2. Artificial Neural Networks

• Counter-propagation network

• Cognitron & Neo-cognitron network

The multiple layer perceptron network will be described in further detail, since its

regular feedforward structure lends itself well to an investigation of genetic algorithm

techniques for network design.

The MLP network is a widely reported and used neural network. It consists of an input

layer of neurons, one or more hidden layers of neurons, and an output layer of neurons

as illustrated in the very simple structure of Figure 2.9. Each neuron calculates the

weighted sum of its inputs, and uses this sum as the input of an activation function,

which is commonly a sigmoid function. The supervised back-propagation learning

algorithm, uses gradient descent search in the weight space to minimise the error

between the target output and the actual output. A large number of gradient-based

search methods are reported in the literature. The back-propagation method is chosen

due to its popularity.

2.2 Basic Types of Neural Networks 15

The mean squared error, often called training error or network error, between the

actual output and the desired output is defined as follows:

E 2 2

E == lYJ(tljk-y

J(tkk)-yk) (2.1)

(2.1)

1

k

where

tk = target output of the k& neuron in the output layer

yk - actual output of thefc*neuron in the output layer

The derivative of the error, with respect to each weight is set proportional to weight

change as:

dF

dE

AM>

Aw

jk == - £ - - T — (2.2)

(2.2)

dWjk

^'jk

momentum term ji into the learning equation (2.2), as follows:

dE

Awikjk(t(t +

+1)

l) = - e£ ■• - — ( (t

t ++1)

l) + ft • fi-Aw

AwJkjk(t)

(t) (2.3)

(2.3)

dw jk

dWjk

where

dE

Awjk(t + l) = - ( 1 - ft)e--—0dE + 1) + H' Aw # (f) (2.4)

dw

Aw/t(r + l) = - ( l - i u ) £ - ^ ( r + l) + Ai-Aw,(r)

jk (2.4)

dwjk

The factor (l-,u) is included so the learning rate £ does not need to be stepped down

as the momentum ji is increased.

16 Chapter 2. Artificial Neural Networks

The back-propagation algorithm, despite its simplicity and popularity, has several

drawbacks. It is slow and typically needs thousands of iterations to train a network to

solve a simple problem. The algorithm performance is also dependent on the initial

weights, and the values of [i and £.

2.3 Conclusion

Artificial neural networks are viable and important computational models for a wide

variety of problems. It is a common practice to use trial and error to find a suitable

neural network architecture for a given problem. This trial and error method is time

consuming, and may not generate an optimum neural network structure. The learning

process whereby the network encodes information from the training process is also of

great importance in neural network performance and generalisation.

to train neural networks, but all are highly dependent on the interconnection topology.

This work reports an effort to develop a more general artificial neural network training

technique, based on genetic algorithms, that can be applied to a variety of topologies,

possibly allowing new artificial neural network structures to be investigated.

3. Evolutionary Computation

Evolutionary Computation is the name of a collection of stochastic optimisation

algorithms loosely based on concepts of biological evolutionary theory. (Some authors

prefer to use the term Evolution Programs.) These techniques are successfully used in

many applications including the optimisation of a neural network architecture. They

are based on the evolution of a population of potential solutions to a certain problem.

The population of possible solutions evolves from one generation to the next,

ultimately arriving at a satisfactory solution to the given problem. The algorithms

differ in the way new populations are generated and in the way the members are

represented within the algorithm.

There is some confusion about the grouping and naming of the various kinds of

evolutionary computations. In this report the distinction is made between three kinds

of evolutionary computations: Genetic Algorithms (GAs), Genetic Programming and

Evolutionary Algorithms. The latter can be divided into Evolution Strategies and

Evolutionary Programming.

Genetic algorithms were developed by John Holland in the 1970's [30] (refer to [19],

[48] for an overview) and are based on a Darwinian-type survival of the fittest strategy

with sexual reproduction, where stronger individuals in the population have a higher

chance of creating offspring. Each individual in the population represents a potential

solution to the problem that is to be solved. The individuals are represented in the

genetic algorithm by means of a linear string similar to the way genetic information is

coded in organisms as chromosomes. In GA terminology the members of a population

are therefore referred to as chromosomes. Chromosomes are made up of a set of genes

and in the traditional case of binary strings these genes are just bits. More generally

genes are referred to as characters belonging to a certain alphabet A. A chromosome

can be thought of as a vector x consisting of / genes a,:

17

18 Chapter 3. Evolutionary Computation

I is referred to as the chromosome length. Commonly all alphabets are the same: A =

A] = A2=... = Ah and in the case of binary genes: A = {0,1}.

The definitions of the basic terms in a genetic algorithm are given below:

chromosome = the representation of the phenotype in a form that can be used by

the genetic algorithm (generally as a linear string)

genotype = the set of parameters encoded in the chromosome

gene = the non-changeable pieces of data from which a chromosome is

made up

alphabet = the set of values a gene can take on

population - the collection of chromosomes that evolves from generation to

generation

generation = a single pass from the present population to the next one

fitness = the measure of the performance of an individual on the actual

problem

evaluation = the translation of the genotype into the phenotype and the

calculation of its fitness

The aim of the genetic algorithm is to find an individual with a maximum fitness by

means of a stochastic global search of the solution space.

genotype, or the total genetic package, is a structure made up of several chromosomes.

The phenotype is the actual organism formed by the interaction of the genotype with

its environment. In genetic algorithms however, an individual is usually represented

by a single chromosome and therefore the chromosome and the genotype are one and

the same. The term individual is used for a member of the population where the

genotype, x, of an individual refers to the (linear) chromosome and the phenotype, p,

to the observed structure acting as a potential solution to the problem. GAs therefore

rely on a dual representation of individuals where a mapping function is needed

between the two representations (refer section 3.1.4): the genotype, or representation

space, and the phenotype, or problem space. The behaviour of the individual on the

task can be expressed by traits or phenes expressed in the problem space.

3.1 Genetic Algorithms (GAs) 19

real positive value,y(x) e 9T, that measures the individual's (phenotype) performance

on the problem.

As an example consider the problem of optimising the weights for a simple neural

network. The neural network in question consists of 2 inputs, 2 hidden neurons and 1

output neuron. All neurons have a connection to the bias unit which has a constant

value of 1. All weights in the network are restricted to the values -1 and +1. Inputs and

outputs are also restricted to these values, and the transfer or processing function, P,

of the neurons is the threshold function on {-1,+ 1}:

P(x) = -1 ifjt<0

+1 ifxSO

The aim is to get the network to perform the XOR function, which is described by the

input-output mapping as described by Table 3.1.

-1 " -1 -1

-1 +1 +1

+1

+1

+1 []

-1 +T~

+1

+ 11

+ +~l1 ~1

-1

The task is to find the set of weights such that the neural network performs the XOR

function. Figure 3.1 shows the neural network structure.

20 Chapter 3. Evolutionary Computation

A chromosome or genotype consists of all the weights of the network, including the

bias weights. One gene of a chromosome represents a single weight-value. For the

demonstration, a simple genetic algorithm is used with binary valued chromosomes.

Thus the alphabet is {0,1}. The alphabet size or cardinality, k, therefore is two. During

the evaluation of the chromosome, a gene in the chromosome that has a value of 0 will

be translated into a weight-value of -1. The weights are numbered from 1 to 9 in

Figure 3.1 which reflects the order in which they are represented in the chromosome.

The chromosome length, /, is 9.

An example of an ordered set of weights such that the network correctly performs its

task is:

x = (1 1000 1110)

The phenotype of this individual can be seen as the actual neural network structure

with the values of the weights given by the set { +1,+ 1,-1,-1,-1,+ 1,+ 1,+ 1,-1}. Such a

phenotype is a potential solution of the XOR problem; in fact this phenotype is an

optimal solution to the problem.

3.1 Genetic Algorithms (GAs) 21

The fitness function should reflect the individual's performance on the actual problem.

The standard genetic algorithm searches for the maximum fitness and a low

performance error should be reflected in a good performance and therefore a high

fitness value.

translated into the corresponding set of weights : ( +1,-1-1,-1,-1,-1,+1,+1,-1). Then the

performance error of the neural network on the XOR training set using this set of

weights is calculated. The corresponding fitness,/(or), is then calculated as:

where E(x) is the cumulative performance error on the training set and Emax is the

maximum value E{x) can obtain. E(x) is given by:

i=\ ,=\

Nou, - the number of outputs

Ojj(x) = the/11 output of the network resulting from training facti

Tjj = they"1 target output of training fact i

The maximum performance error, EmliX, would in this case be: 4 * 2 =16, and thus the

fitness value of the chromosome in question \%:f{x) = 16 - 0 = 16.

22 Chapter 3. Evolutionary Computation

The following steps describe the operation of the standard genetic algorithm.

2. compute the fitness of every member of the current population

3. make an intermediate population by extracting members out of the current

population by means of the reproduction operator

4. generate the new population by applying the genetic operators (crossover,

mutation) to this intermediate population

5. if there is a member of the current population that satisfies the problem

requirements then stop, otherwise go to step 2

The reproduction (or selection) operator that is most commonly used is the Roulette

wheel method, where members of a population are extracted using a probabilistic

Monte Carlo procedure based on their average fitness. For example, a chromosome

with a fitness of 20% of the total fitness of a population will, on an average, make up

20% of the intermediate generation. Apart from the Roulette wheel method many

other selection schemes are possible. An overview is presented in a later chapter.

The heuristics of GA are mainly based on reproduction and on the crossover operator,

and only on a very small scale on the mutation operator. The crossover operator

exchanges parts of the chromosomes (strings) of two randomly chosen members in the

intermediate population and the newly created chromosomes are placed into the new

population. Sometimes instead of two, only one newly created chromosome is put into

the new population; the other one is discarded. The mutation operator works only on a

single chromosome and randomly alters some part of the representation string. Both

operators (and sometimes more) are applied with a certain probability. Figure 3.2

shows the flowchart of the standard genetic algorithm.

The stopping criterion is usually set to that point in time when an individual that gives

an adequate solution to the problem has been found or simply when a set number of

generations has been run. It can also be set equal to the point where the population has

converged to a single solution. A gene is said to have converged when 95% of the

population of chromosomes share the same value of that gene. Ideally, the GA will

converge to the optimal solution; sometimes however a premature convergence to a

sub-optimal solution is observed.

3.1 Genetic Algorithms (GAs) 23

Many variations of the above algorithm are possible and will be discussed in some

detail in later chapters.

24 Chapter 3. Evolutionary Computation

The example of section 3.1.1 will be used here to clarify the working of the standard

genetic algorithm. The steps can be traced in the flowchart of Figure 3.2. The stopping

criterion in this case will be:

The initial population is filled with chromosomes that have randomly valued genes.

For binary valued chromosomes a gene can only take on values {0,1} and a

chromosome x of length / is defined by x e (0,1}'- Each gene has an equal probability

of being initialised with either of these values. An example of an initial population

with population size N- 5 is:

111110011

0 110 10 10 1

001101011

111001101

0 1111110 1

In the general case a gene a, will be initialised from a set of values corresponding to

the alphabet of the gene, A,. For example real-valued genes can be initialised in a

certain range using a normal distribution.

For every member of the population the fitness, f(x), is now evaluated. For the

example concerned this procedure was described in section 3.1.1. This yields the

following fitness values for the initial population:

E(x) f(x)

fix)

1 1111110011

11110011 (( ++1,-1,-1,+

1,-1,-1,+1)!} 1616~ 00

2 0 110 1

100 1100 11 ( ++ 1,+

1,+ 1,+

1,+ 1,-1) S8 88 ~

3.1 Genetic Algorithms (GAs) 25

3 I0

000 111 01100 1 10 1 1 {+1,-1,+1,+1}

( + 1,-1,+ 1,+"JT~ 12 12 4 I 4 ~

w

4 111001101

111001101 " 4 12

5 0 1111110 11 { + 1,+ 1,-1,+ 1}

{+1,+1,-1,+1} 1212 4 4_

In most genetic algorithm systems this assessment is by far the most time consuming

activity, so care must be taken in implementing it.

Since the stopping criterion is not satisfied, a new generation is created from the

present one. First, the intermediate population is made by means of the reproduction

operator.

The reproduction or selection operator used here is the common Roulette wheel

operator. Chromosomes are selected according to their relative fitness values and are

placed into the intermediate population, also called the 'mating pool' The Roulette

wheel selection method is therefore called a proportionate selection method. Other

selection methods will be discussed in section 6.4. The probability, pseiea(x), that a

chromosome x is selected is simply its relative fitness. Thus:

( ) - f(x)

P'select ( * / y r

The Roulette wheel operator is best visualised by imagining a wheel where each

chromosome occupies an area that is sized according to its relative fitness:

26 Chapter 3. Evolutionary Computation

Selection of the chromosomes can now be seen as spinning the roulette wheel. When

the wheel stops a fixed marker determines which chromosome will be selected. To

make the intermediate population the Roulette wheel is simply spun 5 times.

Since the expected number of times that chromosome x will be selected is given by

Eseiect(x) = N * pseieci(x), where N is the population size, this can be expressed as:

C

■ select I * / -f

Table 3.3 gives the statistics of the current population. The last column shows the

actual number of times the chromosome is chosen. The intermediate population

therefore consists of two copies of chromosome 4, and one copy of chromosome 2,3

and 5.

In practice this intermediate population or mating pool is normally not actually

formed. Instead the reproduction operator is used to select parents that will be subject

to the crossover operator. The reproduction operator is therefore more accurately

referred to as the selection operator.

3.1 Genetic Algorithms (GAs) 27

1 111110011 0 0.00 0.00 0

2 0 1 1 0 10 10 1 8 0.29 1.45 1

3 001101011 4 0.14 0.70 1

4 1110 0 110 1 12 0.43 2.15 2

5 0 1111110 1 4 0.14 0.70 1

The next step is using the crossover or recombination operator to generate the new

population. Two chromosomes are selected randomly from the intermediate

population or mating pool and they serve as parents. There is a random chance that

any pair will produce offspring and the overall probability of mating is determined by

pc, the crossover-rate. If selected, crossover is performed and the two resulting

offspring are, after possibly having undergone mutation, inserted into the new

population. If no crossover takes place, the two offspring will be identical to their

parents. In some implementations one offspring is randomly discarded so that the

crossover operator only produces one child. If the crossover operator is not selected

the two parents (or one parent) are simply copied into the new population. Usually the

crossover-rate is set to a value around 0.8 which means that 80% of the new

population will be formed by crossover.

The crossover operators used most often are based on 1-point or 2-point crossover,

depending on the number of crossover-points selected in a chromosome. In general n-

point crossover is possible with n<l. The different versions of the crossover operator

are illustrated by applying them to the XOR weight optimisation example.

0 1-Point Crossover

After selecting the parents the crossover-site within the chromosome is randomly

selected and the substrings about the crossover-site are swapped between the two

parents. The crossover site is randomly chosen from {1 / - l ) , / being the length of the

cl iromosorr es. This process is illustrated in Table 3.4.

28 Chapter 3. Evolutionary Computation

ooiioioii ~ooiiioion~ ooiooiioi

111001101 1111001101 111101011

When crossover is chosen to produce 2 offspring and the population size is uneven

(e.g. N = 5 as in the example), the last crossover operation can only result in one

offspring. An option is to randomly discard the second offspring.

0 2-Point Crossover

Table 3.5 gives an example of using 2-point crossover, where two crossover-points are

randomly selected and the substring between these two point is swapped.

001101011 00111 010111 001101111

111001101 " 11101 0 1 1 1 0 1 111001001

Usually 2-point crossover is implemented so that the two crossover sites are chosen at

random and independent from each other. When the second crossover site lies to the

left of the first crossover site, the chromosome string is treated as being circular where

the endpoint and starting point are connected. Table 3.6 shows an example of this

situation.

001101011 " 00 1 1 I 0 1 0 I 1 1 ' 111001001

111001101 1 1 1 0 I 0 1 1 I 01 001101111

0 Uniform Crossover

Another version of the crossover operator is the so called uniform crossover. Instead

of using a predefined number of crossover points, the number is chosen

3.1 Genetic Algorithms (GAs) 29

crossover' is used, meaning that every site in the chromosome has a probability of 0.5

of being a crossover site. Thus every gene in the offspring-chromosome has a 50%

chance of inheriting its value from either one of its parents. More generally, when a

/vuniform crossover is used (pue [0,1]), every site in the chromosome has a chance pu

of being a crossover site. Thus using chromosomes of length /, /vuniform crossover

will result in an average of pu-l crossover points.

• Perform Mutation

After the new offspring is formed, mutation is performed on the selected

chromosomes. Mutation is usually implemented as follows: each gene in every

chromosome may undergo mutation with a probability of pm, where pm is the mutation-

rate. The mutation-rate is usually set to a low value such as 0.001.

In our example, since genes are bits, mutation normally just inverts the value of the

gene. In the more general case mutation re-initialises the value of a gene with a

random value taken from the initial distribution or alphabet. In the case of a binary

coded chromosome re-initialising the value of a gene will result in a 50% chance of

inverting it. The effect of the binary 'inversion-mutation' will on an average be the

same as the binary re-initialising mutation with half the mutation rate. The 'inversion-

mutation' will be used here. The expected number of genes altered by the mutation

operator, Em, is:

E,„ = N Pm

Here pm is set to 0.01. Since the total number of genes in our population, l*N, is 5 * 9

= 45, a total of 45 * 0.01 = 0.45 genes are altered on average per generation.

• Finished Loop

The algorithm now returns to the second step where each chromosome is evaluated

and the stopping criterion is checked. The process continues until the stopping

criterion is met.

GAs rely on two separate representational spaces. One is the representation space or

genotype space, where the actual genetic operations (crossover, mutation) are

30 Chapter 3. Evolutionary Computation

performed on the (binary) strings or genotypes. The other space is the evaluation

space or phenotype space where the actual problem-structures or phenotypes are

evaluated on their ability to perform the task and where their fitness is calculated. An

interpretation or mapping function is necessary between the two. This is visualised in

Figure 3.4. The problem space constitutes all the potential solutions relating to the

problem. The evaluation space is in general a subset of the problem space and is

dependent on the representation used. Naturally, the evaluation space should be

'chosen' in such a way that it includes the optimal structure.

p and is dependent on the so called epigenetic environment (EF) which is the

environment in which the mapping (or development) takes place

p = g(x,EP)

3.1 Genetic Algorithms (GAs) 31

In GAs however it is commonly a deterministic function that is not dependent on its

environment. A simple one-to-one mapping between genotypes and phenotypes

however is practically never found. The fitness function / now measures the

performance of the phenotype p in the problem space (PS) by assigning it a real

positive value:

f(p,PS) -»9T

the problem is coded; i.e. the representation space, as well as on selection of the

fitness function. Genetic operators generally perform their task on the genotypes

without any knowledge of their interpretation in the evaluation space. This works well

as long as the interpretation function is such that the application of the genetic

operators in the representation space leads to good points in the evolution space.

Problems occur when a structure (or several very similar structures) in the evaluation

space can be represented by very different chromosomes in the representation space.

Schaffer et al. [56] calls this 'competing conventions', but it is also referred to as the

phenomenon of different structural mappings (genotypes) coding the same or very

similar functional mappings (phenotypes) [5]. Basically it means that a unimodal error

landscape becomes multimodal where each peak represents a representation

(convention) of the structure. Standard crossover between two different chromosomes

having the same convention will very likely not result in a useful offspring. This is

because knowledge about the problem space is not built into the standard crossover

operator. A problem dependent crossover operator that incorporates knowledge of the

problem space can help but may be quite difficult to implement.

To illustrate the 'competing conventions' problem consider our example again. In our

neural network the operation of the hidden neurons are not dependent on their position

in the network. Together with the corresponding weights they could be swapped, not

altering the functioning or input-output mapping of the network. This is illustrated in

Figure 3.5 where chromosomes 'ABCDEF' and 'CDABFE' correspond to neural

networks with an identical input-to-output mapping.

32 Chapter 3. Evolutionary Computation

represent a single phenotype. The phenotype in this context is the input-output

mapping of the neural network

For example when the network performs the XOR function, it can be described as:

One of the hidden neurons performs the OR function, the other the NAND, but it does

not matter which hidden neuron performs which. In our example both chromosomes (1

1 0, 0 0 1, 1 1 0) and (0 0 1, 1 1 0, 1 1 0 ) perform the XOR function correctly. The

first one by means of AND(OR,NAND) and the second one by AND(NAND,OR).

However, since the functioning of their hidden neurons is swapped with respect to one

another, standard crossover is not expected to yield a useful offspring. This is because

the standard crossover (and mutation) operator does not use any topological

information available in the phenotype. The two individuals suffer from 'competing

conventions' Instead of one there are two optimal solutions to the problem. This

problem increases when more than two hidden neurons are used, and is thought to be a

main source of poor GA performance on such problems.

Instead of first making an intermediate population and only then applying the genetic

operators another approach may be used where the operators are applied directly to

3.1 Genetic Algorithms (GAs) 33

members of the current population [23]. These members are chosen based on their

fitness. One or more new chromosomes are then merged into the current population

taking the place of a 'doomed' chromosome. This 'doomed' chromosome is usually

chosen based on its inverse fitness. For a single generation step, this process is

repeated until the number of removed chromosomes equals the number of members in

the population. This approach is called a Steady State Genetic Algorithm as opposed

to the standard or Batch Genetic Algorithm. It requires much less memory storage as

only one population instead of two needs to be stored. A certain notion of age can be

built into the system where for a certain number of iterations these newly made

members can not be reselected to create a new offspring.

performed in several ways. The doomed chromosome can be chosen probablistically,

based on its inverse fitness, it can be chosen randomly or it can be chosen to be the

worst-fit chromosome in the population (ranked replacement). Furthermore the

replacement can be unconditional or conditional. In the unconditional replacement

mechanism the new chromosome always replaces the doomed chromosome, while in

conditional replacement the new replaces the doomed only if its fitness is better. If

not, the doomed stays in the population and the new chromosome is discarded.

The first and most commonly used genetic algorithm software package based on a

Steady State Genetic Algorithm is 'Genitor' developed by John J. Grefenstette.

Genitor uses a linear ranking selection method (see section 6.4) and unconditional

ranked replacement. Of the two offspring made by crossover only one is allowed to

enter the population, replacing the doomed individual; the other offspring is

discarded. This type of Steady State Genetic Algorithm is also referred to as a

'Genitor-type' genetic algorithm.

Genetic algorithms can be successfully implemented as a parallel system [51]. As

stated before, the evaluation of the chromosomes is usually by far the most time

consuming part of the algorithm. This part can be implemented in parallel, resulting in

a substantial increase of speed of the total algorithm. A second possible parallelisation

is when instead of one, multiple (sub)populations are used each performing their own

search and only occasionally interacting with each other. Biological genetic systems

are of course highly parallel.

34 Chapter 3. Evolutionary Computation

3.1.7 Elitism

Elitism is an optional characteristic of a genetic algorithm. When used, it makes sure

that the fittest chromosome of a population is passed on to the next generation

unchanged; it can never be replaced by another chromosome. Without elitism this

chromosome may be lost. Extended forms of elitism are also possible where the best

m chromosomes of the population are retained. Simple elitism is the case where m=\.

In effect elitism means that the number of offspring that are generated each generation

is reduced from N to N-m replacing the worst N-m individuals in the population. A

Steady State Genetic Algorithm with ranked unconditional replacement (Genitor-type)

can be seen as a GA using extended elitism with m=N-\.

Various extensions have been made to overcome some of the shortcomings of the

standard genetic algorithm [59]. A few of these are briefly described here.

• Niched GAs

Niched genetic algorithms are used to preserve information across a diverse

population. The simple standard GA loses information by quite rapidly converging to

a single solution. Niched GAs however, try to maintain several sub-populations of

individuals relating to different fit solutions. They are especially useful in finding a set

of mutually supportive solutions to a problem and have been successfully used in

solving multimodal functions. They can offer a solution to the competing conventions

problem (section 3.1.4). A niche is defined as a region in the fitness landscape with a

high fitness. A niched GA tries to 'fill' each niche with a set of chromosomes in

proportion to the quality of the niche.

There are a number of mechanisms available to achieve niching. The most frequently

used is fitness sharing. Here the normal or unshared fitness of an individual is

degraded depending on the presence of nearby individuals. The distance metric often

used in binary coding is the Hamming distance between the genotypes

(chromosomes). However a distance metric in the evaluation space relating to the

phenotypes of the individuals can also be used. Fitness sharing spreads the population

out over the niches where each niche is filled according to its height. Other niching

methods include restrictive mating schemes where in general only similar

chromosomes are allowed to reproduce.

3.2 Genetic Programming (GP) 35

• Meta-LevelGA

In a meta-level GA, GAs are contained within other GAs. For the simplest case of a

two level GA, the top level GA calls upon the bottom level GA during evaluation.

This bottom level GA can be used to optimise some sub-problem of the overall

problem. A two level GA has been used where one GA is used to control the

parameters (mutation rate etc.) of the other GA.

Genetic programming is a technique derived from genetic algorithms and was

developed by John Koza [40]. Genetic programming can be seen as a special kind of

genetic algorithm but differs in that it uses hierarchical genetic material whose size is

not predefined. The members of a population are tree structured programs and the

genetic operators work on the branches or single points of these trees. Originally

genetic programming was implemented in the LISP programming language, because

of its built-in tree like data structures (S-expressions), but it has been implemented in

various languages since.

themselves, GP offers a much more natural chromosomal representation than a

standard linear-string GA does. A distinct advantage of GP over GA is that the size

and shape of the final solution does not need to be known in advance. The tree

structured chromosomes typically vary their size and shape over the course of a

generation. Research has shown that GP can be successfully applied to many problems

in the fields of artificial intelligence, machine learning and symbolic processing. [40]

• Representation

In GP the chromosomes are made up of a set of functions and terminals connected to

each other by a hierarchical tree structure. The endpoints or leaves of the

chromosomal tree are defined by the terminals, all the other points are functions.

Typically the set of functions (denoted by 'F') includes arithmetic operations, logical

operations and problem specific operators. The terminal set (denoted by 'T') is made

up of the data inputs to the system and the numerical constants. Functions can

generally have other functions as well as terminals as their arguments and must

therefore be well-defined to handle any input combination. The number of arguments

a function has must be defined beforehand. GP incorporates 'variable selection'; it is

not needed to set a priori which data-inputs are going to be used. These are selected

36 Chapter 3. Evolutionary Computation

on the run, which can be a useful concept when it is not known in advance exactly

which data-inputs are needed in order to solve the problem. Figure 3.6 shows an

example of a very simple chromosome, made up of the functions AND and OR and

the data terminals DO and Dl. The function set could for this example be: F = {AND,

OR, XOR} and the terminal set simply: T= {D0,D1}.

expression':

• Evaluation

During evaluation of the chromosome the data inputs DO and Dl are assigned actual

input values. The output of the chromosome (= program) is then calculated as the

value of the top-most function-point in the tree, the root, and is used in the fitness

function as a measure of the performance of the individual on the problem. In this

example of Boolean functions, the tree representation used by GP is much more

natural than a string representation used by a GA.

As in the standard genetic algorithm paradigm, genetic programming relies mainly on

the reproduction mechanism and the crossover operator. The flowchart for the

standard genetic algorithm (Figure 3.2) also applies for genetic programming and the

same reproduction mechanisms apply. Crossover is performed on branches of trees,

which means that entire branches or subtrees are swapped between two chromosomes.

This is shown for a simple example in Figure 3.7.

3.2 Genetic Programming (GP) 37

Mutation re-initialises a randomly chosen point (= gene) in the tree. In general this can

be a function or a terminal. An example is shown below where a function-point is

chosen to undergo mutation.

38 Chapter 3. Evolutionary Computation

An interesting feature within the GP paradigm which accounts for modularity is the

possibility to include the so called Automatically Defined Functions (ADFs). These

ADFs perform a subtask of the problem and can be called upon more than once. An

ADF does not have particular fixed terminals as its inputs, but instead is parametrised

by dummy variables. When an ADF is called upon from within the main program

(Koza calls this the result producing branch) of the chromosome, the dummy variables

are instantiated with specific values or terminals. The ADFs are defined in the so

called function-defining branches. The complete genetic tree that represents a certain

solution therefore consists of a result-producing and one or more function-defining

branches depending on the number of ADFs used. Figure 3.9 (with conventions as

used by Koza) shows an abstraction of the overall structure of a chromosome with two

ADFs: ADFO and ADF 1.

The names PROGN, DEFUN etc are labels used in the actual representation of the

chromosome in the GP system. An ADF is defined by its name (e.g. ADFO), by the list

of its dummy arguments (ARGO and ARG1) and by the actual function as defined in

the body. This function is just another tree structured program like the one in Figure

3.6 as is the result producing branch. When ADFs are present, the function set is

3.2 Genetic Programming (GP) 39

extended with the ADFs. In the example above, the function set would now be: F =

{AND, OR, XOR, ADFO, ADF1), with ADFO being a function taking two arguments

and ADF1 taking three.

As an illustration Figure 3.10 shows an example of the body of ADFO and of the result

producing branch or main program of a chromosome.

When the chromosome is evaluated the result producing branch is computed where

the body of ADFO is called upon when the function ADFO is encountered. The ADF

body is then instantiated with the appropriate arguments from the main program which

can be other functions or terminals and its output is returned. This evaluation always

takes place 'bottom-up' The outputs of the functions is fed from the bottom of the

tree towards the top (or root).

Figure 3.10 Example of the body of ADFO (left) and of the result producing branch

or main program of a chromosome

The genetic operators work on both branches. The idea is that GP will dynamically

evolve functions that are useful to the problem (ADFs) as well as a main program that

calls upon these functions. A parallel can be drawn here to the field of neural networks

where a certain part of the network performs a function that can be seen as a subtask

for the complete problem. The difference is that its position within the neural network

is fixed and that it is of no use to the network if its needs this same function

somewhere else but with different inputs.

40 Chapter 3. Evolutionary Computation

As in the genetic algorithm paradigm, there exists a steady state approach to genetic

programming. Steady State Genetic Programming has proven to be advantageous over

the standard, or batch GP paradigm in certain applications [40].

Evolutionary algorithms (see e.g. [14]) is another form of evolutionary computation

but instead of GA (and GP), it focuses on phenotypes and not on genotypes. There is

no need for a separation between the recombination space and an evaluation space as

in Figure 3.4. The genetic operators work directly on the actual structure or

phenotype. The structures used in evolutionary algorithms are representations that are

problem dependent and more natural for the task than the general representations used

for GA. The representation used is a vector of real values, identical to a real-valued

chromosome in a GA. In EAs however the real-valued numbers are seen more as traits

(or phenes) than genes. EAs therefore focus more on the behavioural link between

parents and offspring while GAs focus on the genetic link.

have been made for a population consisting of more members. Evolutionary

algorithms can be divided into Evolution Strategies (ES) which focus on the behaviour

of individuals, while Evolutionary Programming (EP) focuses on the behaviour of

entire species. The distinction is subtle and in general not very clear. In many cases

both terms are virtually equivalent. In EP, the only genetic operator used is the

representation-dependent mutation operator, although several different mutation

operators can be used in the same algorithm. A commonly used mutation operator just

adds a Gaussian random variable to each component of a chromosome. Because ES

deal with individuals instead of entire species, sexual operators (crossover) are

possible as well and extensions have been made to include these.

Although the vector components are normally seen more as phenes than as genes, the

term gene is used here to ease the comparison between the fields of ES and GAs. So:

3.3 Evolutionary Algorithms (EAs) 41

( H Ip + X ) - ES or (id Ip , X) - ES

X = the number of offspring

p = the number of parents taking part in the reproduction of offspring

During the course of a generation the ii parents initially create X offspring by means of

mutation and sometimes recombination. Then the intermediate population consisting

of parents and offspring is reduced to the original size by means of a 'selection'

process which simply retains the best ix individuals and discards the rest. The '+' and ','

denote the selection method used. In a ( p Ip , X ) - ES the parents can not be selected

as members of the next generation, while in a ( p Ip + X ) - ES system they can. The

integer p, also called the mixing number, denotes how many parents mix their genes

during the creation of offspring. In the case p - 1 two parents mix their genes by

means of a crossover operator to produce offspring (typically one). The offspring are

then mutated. In the absence of crossover (mutation only) p = 1. The first systems

developed were ( 1 + 1 ) - ESs where a single parent produces a single offspring that

replaces it if is better and is discarded otherwise. Multimembered ESs were developed

later including the addition of crossover operators. There is no selective pressure in a

multimembered ES; every individual has an equal chance of producing offspring.

An important feature of ESs is that the range of mutations, the stepsize, is not fixed but

inherited. It is unique to an individual and generally different for each gene. An

individual is represented by the pair of vectors v:

v = (x , CT)

x denotes a point in the search space consisting of / genes and a is a vector of the

same length consisting of standard deviations, one for each gene. Mutation creates a

new offspring x' from x by adding to it a Gaussian number with mean 0 and standard

deviation <x

x'=x + /V(0, a)

42 Chapter 3. Evolutionary Computation

Although not present in the earliest models o is normally adapted during the mutation

process as well. A commonly used method is:

oJ=CT.emAo)

where ACT is a system parameter. A commonly used crossover operator creates a single

offspring (JC,<T) from two parents (x'.cr1) and (x 2 ,^) by randomly mixing their genes

(as uniform crossover in GAs) and their step sizes. Mutation is performed after this to

complete the process.

Many extensions and alterations have been made on the basic ES scheme described

here. It is interesting to note that although the fields of GAs and ESs vary in a number

of ways, quite a few ideas are being taken from one field and implemented in the

other. Examples are the introduction of the crossover operator in ES and real-valued

instead of binary encoding with 'creeping' or additive mutations in GAs. Also the idea

of adaptive parameters, especially the mutation rate, has received a lot of attention in

the GA community lately.

4. The Biological Background

Since practically all ideas and certainly most of the nomenclature in the field of

evolutionary computation are taken from its biological counterpart, a brief

introduction of genetics [42],[58] is presented in this chapter together with an

overview of the main concepts of Darwinian evolutionary theory. First, the genetic

structures as observed in nature are described. Second, the actual process of

reproduction and the occurrence of mutations is then dealt with. Third, the process of

natural evolution is described in section 4.4, by present day evolutionary theories.

Fourth, the link is made between this biological background and the field of

evolutionary computation (focused on genetic algorithms).

In cells the information which determines their function is carried in chromosomes. A

position on a chromosome is called a locus, which can be thought of as a box, and it is

taken up by two genes. Genes are therefore the structures from which a chromosome

is made. There are a multitude of different sets of genes, each of them specific to one

locus. The set of genes which relate to a specific locus are called alleles. A locus can

only contain two of these, so a locus A has a set (a ; , a2, ..., ak) of alleles from which

two can be chosen. If there are two alleles to choose from (k = 2) a locus is said to be

diallelic and if there are more, multiallelic.

A chromosome consists of two strands called chromatids joined together at one point

by a centromere. Chemically the genetic information in a chromosome is carried by

the nucleic acids DNA and RNA.

All cells in an organism are identical in their chromosomal content. There is thought

to be some switching mechanism which, together with the position of a cell in the

organism, determines which genes become operative and which do not. This in turn

determines the specialisation of a cell; i.e. if a cell operates as a liver cell or a skin

cell.

43

44 Chapter 4. The Biological Background

A genotype of an individual at a single locus is the pair of genes contained in it. When

the two genes are identical (i.e. afij, i-j) the genotype is said to be homozygote, if not

(a,ap i#j) heterozygote. The complete genotype of an individual, or the total genetic

package, is the set of all genotypes over all loci or the totality of all chromosomes.

This is also known as the genome of an organism. What is actually observed is the

phenotype and it is formed by the interaction of a genotype with its environment.

Different genotypes may result in the same phenotype. A single characteristic

observed in an organism, such as eye-colour, is referred to as a trait.

In the case of a diallellic locus one of the alleles, ah is often a dominant gene and the

other, a2, a recessive one. The dominant gene is always expressed in the phenotype,

the recessive gene only in the absence of the dominant one. So in genotypes atah a,a2,

and a2ah aj will be expressed in the phenotype. Only in a genotype a2a2 will the

recessive gene a2 be expressed. In the lack of dominant and recessive genes both

genes will be expressed in the phenotype resulting in a mixture of the influence of

each. This is called partial dominance and can often be seen for example in skin

colour.

• Epistasis

The way a gene is expressed in the phenotype or whether it is expressed at all often

depends on the presence or absence of another gene. When there is such an interaction

between genes in the expression of the genotype, it is called epistasis. The most

common form of epistasis is the masking effect. This means that a gene acts as a mask

for one or more other genes. When the masking gene is present in the chromosome it

completely 'turns off this set of genes; i.e. these genes are not expressed in the

phenotype. In the absence of the masking gene they are.

4.2 Reproduction

In organisms there are two reproductive methods by which cells divide to form new

cells. The first kind is mitosis, where the parent cell simply divides itself in two cells

identical to the parent. This is the main method by which organisms produce new cells

in order to grow larger. It is also part of asexual reproduction as used by simple

organisms. The second one is meiosis, or 'reduction division', and is used for sexual

4.2 Reproduction 45

reproduction. Meiosis produces four cells from one parent cell. In sexual reproduction

special reproductive cells called gametes are used.

When two organisms perform sexual reproduction, each of them produces gametes

(the sperm of the male and the egg of the female) by means of meiosis. Normal cells

in an organism carry pairs of chromosomes of each type and are said to be diploid.

The two chromosomes in a pair are called homologous chromosomes. A gamete

carries only one such set of chromosomes and is said to be haploid. Thus a haploid

cell contains half the number of chromosomes of a diploid cell. Also, in a gamete,

instead of two genes at every locus there is only one gene. Each of the two genes of a

locus of a cell before meiosis has a chance of 50% of ending up at the locus of the

gamete; this process is called segregation. This is Mendel's First Law, which says that

characteristics of organisms are carried in pairs and only one of each pair can be

carried by a gamete, each having equal chance of ending up in the gamete. The second

stage of sexual reproduction is fertilisation where the gametes of the male and female

unite to form one new cell called a zygote, restoring the original count of

chromosomes and again having two genes at each locus.

different trait. It says that during the formation of gametes, allelic pairs specifying

different traits segregate independent from each other. For example when an organism

contains allelic pairs asa2 for trait a and bjb2 for trait b, the gametes may contain atbt,

atb2, a2bj, or a2b2. This second law however only holds under certain constrictions.

which determine the characteristics of the individual and of sex chromosomes

determining the sex of the organism. Often a trait is not expressed by a single gene but

by several. Also a single gene often has more than one effect on the phenotype, and is

then called pleiotropic.

Another phenomenon found in reproduction is gene linkage. It is found that during the

formation of gametes, alleles associated with the same chromosome remain together in

the offspring. For example alleles such as aft] or a2b2 may be linked in the offspring

forming a linkage group.

46 Chapter 4. The Biological Background

4.3 Mutations

Apart from the normal processes described above, comparatively rare events called

mutations can occur. A mutation is a change in a chromosome which may result in a

change in the characteristic of a cell or an organism. A mutated individual is called a

mutant. Most often mutations are harmful to the cell or organism resulting in disease

or even death. When they are beneficial however, they have great effect, providing a

basis for variation between and within a species. This ensures that species can adapt

to changing environments. Mutations can be divided into chromosome mutations and

gene mutations.

This section briefly describes a number of chromosome mutations.

One form of a chromosome mutation occurs during meiosis when the zygote ends up

with an abnormal number of chromosomes. This usually results in death of the

organism; one exception being Down's syndrome or mongolism in Man.

• Recombination

Another form of chromosome mutation that also occurs during meiosis is

recombination. During meiosis the homologous chromosomes are intimately

intertwined and various types of mixing of chromosomes can occur when they wrap

around each other. This type of general or homologous recombination is also known

as crossover.

Points of attachment in a chromosome are called chiasmata and define the points

where a chromosome might break and rejoin with the homologous chromosome next

to it. A single crossover involves the swapping of the parts of two chromosomes at a

single chiasma. Double or triple crossover can occur when chromosome parts are

swapped at more than one place. The probability that two different linked alleles cross

over together (i.e. end up in the same offspring) is a function of how close they are

together on the chromosome. The closer they are together, the higher the frequency.

4.3 Mutations 47

ABCIDEF A B C d e f

abcldef a b c D E F

positions, it is called unequal crossover. This results in two chromosomes of unequal

length.

ABAIBlCD A B A B A B C D

=>

AlBIABCD ABCD

Unlike the above cases certain recombinations are non-reciprocal and only one of the

offspring is changed by crossover while the other remains unaffected. This is referred

to as gene conversion.

ABCIDEF A B C d e f

abcldef a b c d e f

Still other forms of recombination are possible, often resulting in more subtle changes

in the DNA structure of the chromosomes.

• Inversion

Inversion occurs when a chromosome section breaks off and the broken part turns and

rejoins the rest of the chromosome resulting in a reverse order of the genes in that

section.

• Deletion

Deletion is the phenomenon where a chromosome section breaks off and is omitted

from the chromosome altogether. The two loose ends of the chromosome then join up

resulting in a shorter chromosome.

48 Chapter 4. The Biological Background

• Translocation

When crossover occurs between two non-homologous chromosomes, this is called

translocation. This phenomenon is also known as non-homologous recombination.

' Polyploidy

Occasionally, because of an erroneous meiosis, a diploid gamete is produced instead

of a normal haploid one. When this gamete is united with a normal haploid gamete

during fertilisation, the resulting zygote will have three sets of chromosomes instead

of two and is called triploid. If two of those abnormal diploid gametes unite, the result

is a tetraploid zygote. This phenomenon is called polyploidy and although rare in

animals, it is quite frequently found amongst plants and can actually be beneficial for

the organism.

Gene mutations (also called point mutations) are confined to a change in a single gene

only and are the result of a chemical change in the structure of the gene. They are

thought to play the most important part in contributing evolutionary changes to

organisms. Since most of the DNA code seems to be redundant (so called 'genetic

garbage') in most cases mutations within a gene do not have any effect on the

phenotype at all: the mutations are neutral. When they do however the effects can be

enormous. Most mutations are deleterious for the organism, in the case of a lethal

gene even resulting in death. The small percentage of mutations that are beneficial

provide an increased fitness to the organism and its influence can spread throughout

the population.

Some comments regarding theories on natural evolution are presented in this section.

The most influential theory on evolution was proposed by Charles Darwin in the last

century, and his main ideas form the basis of most present-day evolutionary theories.

First, the term evolution itself must be clarified. As defined in the field of biology

evolution is: a change in the gene pool of a population over time. It is therefore a

population level phenomena and basically says that organisms evolve from common

ancestors. Biologists and, in fact, the vast majority of the scientific community, treat

evolution simply as a fact, bearing in mind that in pure science even facts are not

100% provable. Most biologists also treat it a fact that all modern life originates from

4.4 Natural Evolution 49

a single common ancestor. In everyday use the term evolution is often confused with a

specific evolutionary theory such as the one proposed by Darwin, which tries to

explain how evolution actually works.

While the kind of evolution described by Darwin normally takes place over very long

time spans and observations of it are based on fossil records, evolution can and has

been directly observed within a span of only several years. For this reason the

distinction between microevolution and macroevolution is often made. While some

biologists feel the mechanisms of both are different, most simply treat macroevolution

as a long cumulative series of microevolutions.

Evolutionary mechanisms can basically be grouped into two categories: those that

increase genetic variation and those that decrease it. The mechanisms that increase

variation are the mutations occurring during reproduction as described in the last

section as well as a concept called gene flow. Gene flow simply means that new

genetic information is introduced into the population by migration from another

population. It occurs when two more or less related species from different populations

mate. The mechanisms decreasing genetic variation are natural selection and genetic

flow and these are now described in more detail.

• Natural Selection

In Darwinian evolutionary theory natural selection is seen as the creative force of

evolution. When supplied with genetic variation it makes sure that sexually

reproducing species can adapt to changing environments. In the course of evolution

natural selection preserves the favourable part of the variation within a species. It

often does this by letting the fittest individuals of a species produce the most offspring

for the next generation. It provides a selective pressure that favours the fitter

individuals of a population. The theory of natural selection is therefore often referred

to as the survival of the fittest. This term is misleading for a number of reasons.

Reproduction ('survival') of the organism itself is not the driving force of natural

selection. The driving force is the contribution of the organism's alleles to the next

generation's gene pool. Natural selection favours selfish behaviour but does so more

at the level of genes than at the level of organisms. For example it can be beneficial

for an organism to help other organisms reproduce that are closely related to it, i.e.

share many of the same alleles, sometimes even sacrificing its own chances of

reproduction or even its own life. For this reason fitness is often split into two

50 Chapter 4. The Biological Background

components: direct fitness, which is a measure of how many alleles the organism can

enter into the next generation's gene pool by reproduction of itself, and indirect

fitness, which measures how many alleles identical to its own but belonging to other

organisms it helps enter the gene pool. Natural selection works in such a way as to

increase the combination: the inclusive fitness.

Another point against the term "survival of the fittest" is that survival is only one

component of selection. Another one, often even more dominant, is sexual selection.

In many species males have to compete against each other for mates. This competition

can be physical or it can be ruled by female choice. In the latter case organisms evolve

traits, 'status symbols', which are favoured by females for sexual selection. In some

species where very few males monopolise all females, many males live to

reproductive age but very few of them ever mate. While they perhaps do not differ in

their ability to survive, they do differ in their ability to attract mates. The fitness of an

organism is therefore not just a measure of its physical abilities, it is often much more

a measure of its sexual attractiveness.

For natural selection to be a creative force, the genetic variation must be random and

its effect relatively small. This is the case in present evolutionary theories. A

fundamental concept of Darwinism often not understood is that evolution has no

direction and that there really is no sense of progress where certain organisms are

'better' than others. Organisms just become better adapted to their environments. The

changes made may in fact prove harmful when the environment changes. A related

popular notion is that natural selection favours organisms with a high level of

complexity resulting in an "evolutionary ladder' from simple one-celled organisms to

the ultimate creation: man. In fact by far the most successful species in the past and

present are the simplest of them all: bacteria, whose existence is incidentally crucial to

our own. From evolutionary theory it should be concluded that the evolution of

mankind is nothing more than a lucky outcome of thousands of linked events and by

no means inevitable.

• Genetic Drift

Even without a selective pressure contributed by the mechanism of natural selection,

there is a mechanism at work that decreases genetic variation. If it were the case that

each organism had an equal chance of producing offspring (i.e. no selective pressure)

and there was no mechanism for introducing variation, the frequency of alleles would

decrease by means of genetic drift. Genetic drift is simply the binomial sampling error

of the gene pool. The organisms that reproduce increase the frequency of their alleles

4.4 Natural Evolution 51

over the population. In the next generation the frequency of these alleles is expected to

increase even more simply because there is a larger chance that an organism

possessing them is chosen to reproduce. Without mechanisms to introduce variation,

the effect of genetic drift (with or without natural selection) would ultimately be a

compete lack of genetic variation in the gene pool.

• Preadaptation

One of the main difficulties for evolutionary theories is to explain how complex

structures in organisms evolved from scratch. For example it can be very beneficial

for an organism to have an eye, but since evolution works in small steps, how

beneficial can it be to have say 5% of an eye? This is usually explained by the concept

of preadaptation. Preadaptation states that a structure in an organism can change its

function radically while its form remains approximately the same; i.e. functional

change in structural continuity. In the first steps towards the evolution of an eye, the

structure serves a different purpose than vision. This purpose has to be beneficiary for

the organism for it to be rewarded by natural selection.

Simple organisms in relatively stable environments often reproduce asexually.

Asexual reproduction produces offspring that are very similar to their ancestors. When

the environment changes though, such an organism finds it very hard to change in

time. Major evolutionary change can only occur when there is a large store of genetic

variability available. This is the case for sexually reproducing organisms, where

natural selection is used to maintain the most favourable genetic variants of a large

genetic pool.

• Niching

Biological systems use a restrictive mating scheme to encourage the formation of

species: speciation. Only organisms in the same niche can mate with each other. A

group of organisms within a species is called a population of organisms. When a

population differs to a certain extent from the rest of the species it forms its own niche

and can ultimately form a new species.

There is a distinction between characteristics that an organism has inherited from its

ancestors, i.e. by means of evolution, and that which it has learned during its lifetime:

individual learning. Classic genetic theory dating from Mendel stated that anything

52 Chapter 4. The Biological Background

that an organism learned in its lifetime was not physically passed on to future

generations [58]. The opposite view was Lamarckian inheritance where learned

characteristics are passed on to offspring. Lamarckism is not widely accepted.

accredited to the Baldwin Effect, introduced by James M. Baldwin late last century.

The Baldwin Effect rejects Lamarckian inheritance and postulates that instead of

passing on learned information, the criteria for fitness of the organisms changed. This

way individual learning does change the path of evolution. Baldwin further proposed

that during evolution these learned behaviours eventually become instinctive

behaviours of the organisms.

jumping squirrels. Suppose one group of squirrels (a population) learns to jump from

tree to tree while another group does not. The group that did learn this characteristic is

rewarded by evolution for things that help the squirrels in this task: e.g. developing

webbing between the toes. The fitness of the squirrels is changed in that the ability for

tree-jumping is now rewarded. The initial development of webbing between toes is

just caused by genetic variation during reproduction. Over several generations the

population of squirrels that learned to jump trees now becomes born to jump trees and

tree jumping becomes an instinct. In short the Baldwin Effect says that the fitness

function is changed by individual learning.

• Optimisation

Natural selection does not necessarily have the effect of producing optimal structures

or behaviours. For one thing it acts on the organism as a whole, not on specific traits.

There is only one fitness measure (the inclusive fitness) that is influenced by many

factors. Many species are stuck in so called local optima simply because the transition

from this local optimum to a global optimum (assuming there actually is one) is very

unlikely. This transition would normally involve having to pass through less adaptive

states. Natural selection does not cater to this; the only way the species can reach a

state with a higher fitness is by a lucky variation (mutation) or combinations of these.

Since environments are generally non-stationary, even being in a very fit state does

not mean the species will continue to thrive in the future. In fact when a species has

specialised itself to function perfectly in a certain environment, it is likely to find

difficulties in adapting if this environment happens to change. Natural selection has no

mechanism that provides future planning. It is a purely local mechanism.

4.4 Natural Evolution 53

The main themes of the evolutionary theory as described above are commonly

referred to as Darwinism. There is strong evidence that evolution is not the only

Darwinian process found in nature. On a much smaller time scale of days to weeks a

similar process as observed in evolution seems to take place when the immune system

of an organism produces antibodies to an virus infection. Through a series of cellular

generations the immune system evolves antibodies that become better and better

('fitter') in defending the organism against the invaders. It is even postulated [7] that

mental activity in the human brain is governed by a Darwinian process where ideas

'compete' with each other resulting in the 'fittest' idea actually being passed on. Thus

intelligent (and creative) solutions to problems can be generated by a Darwinian

process operating on a very small time scale. An example of this would be when

someone throws a stone at some distant object. The idea is that the brain makes a

model of this situation where potential 'solutions' are being tested in a matter of

milliseconds. The fittest one of these determines the course of action.

To abstract from the special Darwinian theories described in this section the concept

of a minimal Darwin Machine is introduced [7]. A Darwin Machine being a system

ruled by a Darwinian process must have the six essential properties listed in Table 4.1.

For each property the corresponding occurrence in Darwinian evolutionary theory is

given.

evolutionary theory is such a process. The competition between patterns/organisms for

some limited resource, such as territory or food, is generally thought to be a main

force behind the evolution of complexity and diversity. In order to remain competitive

against rivals, patterns must often develop complex traits different from existing ones.

54 Chapter 4. The Biological Background

Table 4.1 The requirements for a minimal Darwin Machine illustrated by their

occurrence in Darwinian evolutionary theory.

1. The system operates on The genotypes are chromosomes consisting of a

patterns of some sort. string of genes.

2. Copies are made of these Genotypes are copied by means of sexual or

patterns. asexual reproduction.

3. During the copying variations Genetic mutations occur during reproduction;

occur. for example by means of crossover, gene

mutations.

4. The various patterns compete Competition occurs because there is only room

with each other for a limited (and food) for so many organisms in a certain

territory/workspace. area.

5. The selective success of a Natural selection works on the relative fitnesses

pattern is biased by its of an organism which depends on the complete

environment. state of the system (i.e. the organism and its

environment).

6. Copying only occurs after a Only organisms fit enough survive until

certain amount of differential reproductive age and are able to pass on

success. offspring often depending on sexual

attractiveness. Differential success is measured

by the inclusive fitness of an organism.

Evolutionary computation and in particular genetic algorithms have exploited ideas

from Darwinian evolutionary theory together with a genetic representation similar to

the one found in nature. This section presents some of the main parallels and

differences between biological systems and genetic algorithms.

• Genetic Representation

The string representation of chromosomes in GAs is comparable to the ones found in

real life. However, nearly all evolutionary computation algorithms so far have been

limited to haploid chromosomes, where each locus can only contain one gene. While

4.5 Links to Evolutionary Computation 55

in biological systems this is true for gametes during reproduction, normal cells always

have a pair of genes contained in one locus. This feature allows organisms to adapt

more quickly to changing environments and is especially useful if the organism is

required to switch between two environment states. Also a population of organisms

that have diploid chromosomes can contain a much larger genetic variability than

organisms with haploid chromosomes.

Lately the representation used in GAs tends to be more problem specific and no longer

limited to the classic genetic string. Genetic programming of course with its tree

structured chromosomes uses a representation quite different from the one found in

nature.

In GAs the mapping from the genotype into the phenotype by means of the

interpretation function is almost always completely deterministic. The phenotype only

depends on the genotype and there is no stochastic element involved. This is not true

in nature. The very complex developmental process from genotype into the phenotype,

morphogenesis, is influenced by many environmental factors and is stochastic by

nature. The environment in which the development takes place is often called the

epigenetic environment. During the development, deformities or mutations might take

place that can have a big influence of the final phenotype, depending on what stage of

the development they occur. In more abstract terms it can be said that the mapping

from the genotype into the phenotype is stochastic and dependent upon the state of the

system. This concept is implemented in a GA system in [8] where the neural

development in a population of neural networks is simulated using cell division, cell

migration and axonal growth and branching. It is suggested that the lack of mutations

in the development from genotype to phenotype in most GA applications concerning

neural network evolution could be a reason why such high (genetic) mutation rates are

needed for a good GA performance.

• Selection

In nature, adaptation is performed using natural selection instead of the selection

method used in most evolutionary computation systems. The main difference is that in

natural selection there is no such thing as a superimposed fitness measure. Not just the

organisms but also the fitness measure evolves. EC systems where this is implemented

are called open-ended evolution. The majority of EC systems however works as a

function optimiser and therefore necessarily have a fixed fitness function. This is

56 Chapter 4. The Biological Background

probably the main difference between natural evolution and EC. Most biologists

would argue that the idea of optimisation in itself is not found in nature and it is for

this reason possibly quite dangerous to blindly copy ideas from natural evolution into

the field of EC.

EC. All individuals have a non-zero chance of producing offspring for the next

generation. An individual can not die before it reaches reproductive age. And, of

course, in EC the entire concept of selection based on sexual attractiveness is absent.

In biological systems it is often thought to be mutation that is the main force in

exploring new genetic territory. Crossover is generally thought to play only a minor

role. In genetic algorithms this picture is reversed and crossover is generally thought

to be the main force in GAs. Lately it is therefore postulated that crossover might play

a much bigger role in biological systems than previously thought. Although the

mutation-rate is usually very small, mutation usually plays a vital part in a GA in that

crossover can normally not (re)introduce information not contained in the present

population. It is clear from the successful applications of Evolutionary Programming

that algorithms using selection and mutation only can be very powerful. These types

of algorithms are referred to as 'naive evolution'. As small as the mutation rates may

be in GAs, the ones usually found in nature are actually much smaller. Reasons why

natural selection does not seem to need high mutation rates in order to maintain

enough genetic variability could be the stochastic nature of the genotype-to-phenotype

mapping as mentioned above and the use of diploid instead of haploid chromosomes.

Both these features are lacking in most GA implementations.

Individual learning in organisms is equivalent to local search in evolutionary

computation. The Baldwin Effect occurs in EC when local search changes the fitness

of an individual but does not effect its genotype. Lately the Baldwin Effect has gained

much attention in the field of evolutionary computation, where EC is combined with

local search algorithms to improve its performance. In order to have a true Baldwin

Effect in EC similar to the one found in nature, the fitness function should be allowed

to evolve together with the individual.

4.5 Links to Evolutionary Computation 57

While it seems to be true that Lamarckian learning does not actually occur in

biological systems, it can prove beneficial for evolutionary computation. It not only

changes the fitness (by means of local search) but it also changes the genetic

representation of the individual so that the learned information can be passed on to

future generations. In EC systems where there is no way for the fitness function to

evolve, Lamarckian learning provides the only mechanism for passing on learned

information. Since almost all EC systems are used as optimisation algorithms the

fitness function will indeed be fixed.

learning refers to the evolutionary process where learned behaviour is inherited. In EC

systems the trade-off between the two can be a difficult one. For example if genotype

learning has too much effect on the behaviour of an individual it might be very hard

for it to learn new behaviour by means of phenotype learning simply because its

options are limited. Genotype learning is often said to determine the boundaries in

which an individual can evolve by means of phenotype learning.

• Epistasis

Tackling epistasis is one of the main problems in GAs. Present day GAs usually fail

when the level of epistasis is high. As most theoretical work in GAs is concerned with

problems of low epistasis, more work needs to be done to understand the working of

problems with high epistasis. By contrast, biological systems perform well even with a

very high level of epistasis. In fact the level of epistasis found even in simple

organisms is so high that some biologists reject the reductionist approach resulting in

the idea of genes as a useful tool for studying genetics. A better understanding of

biological systems concerning epistasis is expected to be of value in research on GAs.

defining the minimal Darwin Machine are again listed, this time with their

correspondence in the field of Evolutionary Computation.

58 Chapter 4. The Biological Background

Table 4.2 The requirements for a minimal Darwin Machine and their occurrence in

evolutionary computation.

1. The system operates on patterns of Potential solutions are represented by

some sort. (often linear) chromosomes.

2. Copies are made of these patterns. Reproduction of chromosomes.

3. During the copying variations occur. Crossover and mutations introduce

variation during reproduction.

4. The various patterns compete with

each other for a limited territory/

'■ workspace.

5. The selective success of a pattern is Selection depends on fitness of an

biased by its environment. individual relative to the fitnesses of the

other individuals but is usually not

influenced by its environment

6. Copying only occurs after a certain The chance of reproduction depends

amount of differential success. only on the relative fitness.

There seem to be two main reasons why, in general, evolutionary computation does

not qualify as a Darwin Machine. First, there is no equivalent to the struggle for a

place in a limited territory or workspace in EC. An individual feels no effect of the

way other individuals perform on the problem and there is no notion of some kind of

'source' (e.g. territory, food, or even mates) that is limited in any sense. Second, there

is no influence of the environment reflected in the fitness value. The only thing that is

reflected in the individual's fitness is its own performance on the problem.

An exception of this general picture of EC can be found in the work by Nolfi and

Parisi [55] where the GA system evolves artificial organisms represented by

ecological neural networks that compete with each other in a limited two dimensional

world in the quest for food. A changing environment is modelled by varying the food

resources over time. Experiments are performed where the fitness function itself is left

to evolve, resulting in observed forms of preadaptation to changing environments.

This system does meet all the requirements for a Darwin Machine even though the

concept of a Darwin Machine was really set up to compare natural processes rather

than artificial ones. The GA system in this approach can therefore be said to belong to

4.5 Links to Evolutionary Computation 59

the field of artificial life, where complex natural like behaviour is generated from

interacting artificial organisms operating with a relatively simple rule-based system.

• Overview

Table 4.3 gives a brief overview of the main differences between the evolutionary

theory of biological systems and the operation of most present day GAs.

Table 4.3 A brief overview of the main differences between Darwinian evolutionary

theory and most present day GAs

natural selection with an evolving fitness selection based on superimposed fixed

measure fitness measure

adaptation to changing environments optimisation of fitness under stable

environments

complex stochastic genotype-to- relatively simple deterministic

phenotype mapping genotype-to-phenotype mapping

relatively low mutation rate relatively high mutation rate

extreme high level of epistasis only perform well on problems with

medium level epistasis

competition between organisms for a no competition between individuals

common source

Baldwin effect: change in fitness sometimes local learning but no change

function due to individual learning in fitness function

As stated before, the main overall difference between the two systems lies in their

goals, (or rather the lack of one in Darwinian evolutionary theory). While in

evolutionary computation the goal almost always is the optimisation of some kind of

fixed problem, this does not necessarily seem to be the case for biological

evolutionary systems. Still, thfi success of evolutionary computation as a function

optimiser, as reported on a wide variety of problems and in some parts supported by

theoretical foundation, indicates that many features of Darwinism lend themselves

very well for this purpose.

5. Mathematical Foundations of

Genetic Algorithms

Several approaches can and have been taken as a first step to form a basic theory of

genetic algorithms (and genetic programming), each one providing some useful

insights into their functioning. Still, a fundamental foundational theory incorporating

all aspects of genetic algorithms is a long way off.

One of the first and most frequently referenced foundational works in this field is by

Holland [30], who examined the case of a binary coded fixed length genetic algorithm

and introduced a mathematical foundation known as the Schema Theorem. Goldberg

[19] extended this idea with the notion of the so called Building Block Hypothesis.

section 5.2 an overview of the Schema Theorem is given based on a simple binary

coded genetic algorithm. Other approaches are also presented.

Genetic algorithms perform a stochastic global search through the solution space by

swapping information between chromosomes and by occasionally (re)introducing new

information. The basic genetic algorithm comprises three genetic operators. These are

reproduction, crossover and mutation.

The reproduction or selection operator makes sure that the search is biased in the

direction of chromosomes with high fitness values if maximising or low fitness values

if minimising. Chromosomes that have above average fitness have more chance to

survive and to reproduce than others. Chromosomes with very low fitness will die off.

Low fitness chromosomes are needed in the population though, because they can

contain information that can be useful or even crucial to the formation of the optimal

chromosome.

The crossover operator ensures that partial information contained in one chromosome

can reach other chromosomes in the population. This mixing of information leads to

60

5.1 The Operation of Genetic Algorithms 61

the formation of optimal chromosomes. In order for the genetic algorithm to perform

well the crossover operator should be such that a high correlation exists between the

fitness of the parent chromosomes and the fitness distribution of their offspring.

In order to fully explore the search space, diversity of the population is crucial. The

purpose of the mutation operator is to maintain enough diversity in the population to

overcome local optima and eventually reach the global optimum. A mutation-rate that

is too high can destroy useful information in highly fit chromosomes and slow down

the search.

For optimal search a genetic algorithm must find a balance between exploration of

new regions of the search space and exploitation of the information available in the

current population. For the sake of comparison, hillclimbing algorithms which guide

the search based on local gradient information of the function to be optimised, are

very good at exploitation but do limited exploration. In contrast a random search

mechanism, where points in the search space are selected and tested at random is good

at exploration but does no exploitation. Genetic algorithms fall somewhere in

between.

Another viewpoint of the working of GAs is that of population diversity and selective

pressure. Selective pressure reflects the 'pushing force' of a GA for above average

individuals. When the selective pressure is very strong, a superfit individual takes over

the population within a few generations often resulting in premature convergence.

This in turn is reflected by a strong decrease in population diversity. A selective

pressure that is too weak however results in a very slow and ineffective search. Again

a balance is needed between the two. This viewpoint is, in a sense, a variation of the

exploitation and exploration idea. Too much exploitation and too little exploration is

reflected in a strong selective pressure resulting in a low population diversity. The

other extreme results in a very low selective pressure and an ineffective search.

• Evolvability

Since even in a pure random search there is a chance that the offspring chromosomes

are fitter than the parents, a genetic algorithm si juld have a better average

performance than a random search. In order to do so, the effect of the genetic

operators should be such that there exists a high correlation between the fitness of the

62 Chapter 5. Mathematical Foundations of Genetic Algorithms

parents and the fitness distribution of the offspring. When this is the case the fitness

distribution of the offspring can on average be expected to be better than the one

belonging to the parents. This correlation property is called the evolvability of a

genetic algorithm [2] and serves as a local performance measure. The global

performance measure then is simply the ability of the genetic algorithm to produce

fitter offspring over the course of one or more generations. This global performance

depends on the maintenance of the evolvability of the population as the search is

guided to the global optimum. Using the Schema Theorem, this evolvability can be

expressed by the Building Block hypothesis.

Hypothesis

Introduced by Holland [30], the Schema Theorem is often viewed as the fundamental

theoretical foundation of genetic algorithms. In its basic form it can be applied to

chromosomes that are fixed length strings only. Additions have been made to

incorporate more general chromosomal structures, such as the ones used in genetic

programming [40]. The theorem will be presented in its basic form here, where

chromosomes will simply be referred to as strings. Firstly some related terms are

defined.

The search space, £2, is the complete set of possible strings. In the case of a fixed-

length chromosomal string where each gene can take on a value in the alphabet A, the

size of the search space is:

size(Q) = k!

where:

k - alphabet size

/ = chromosome length

Returning to the example of the last chapter, the search space has a size of 29 = 512. In

other words: there are 512 different chromosomes in the search space.

(0,1}'. The number of instances of a string* in the population 5 is denoted by m(x).

5.2 The Schema Theorem and the Building Block Hypothesis 63

certain fixed positions. A schema therefore defines a subset of the complete search

space. More precisely it defines a hyperplane partition of the /-dimensional search

space. A schema, H, is a string of the same length as the chromosomes. Each position

in the string can take on the values of the alphabet, called fixed positions, plus a 'don't

care' character '*'. So in the binary case H e {0,1,*}'. Using the example genetic

algorithm again, an example of a schema is:

H, = * 1 0 * 0 1 0 0 1

These strings are said to belong to schema Hh but belong to many other schemata as

well. The search space thus can be defined as a schema of length / with a 'don't care'

symbol at every position. In our case: Q.-*********. It also follows that the

9

number of schemata possible is ( k+\ ) ; in our case: ( 2 + 1 ) = 19683.

In order to present further discussion about schemata, the following properties need to

be defined. The order of a schema, o(H), is the number of fixed positions in H. The

order of Ht for example is o {H,) = 7. The defining length, b\H), of a schema is the

distance between the first and the last fixed positions of the schema. In the example

schema: b\H,) = 9-2 = 1.

The number of strings in the population belonging to schema H, m(H) is given by:

m(H) = 2^m(x)

xefj

Using schemata the effects of the genetic operators on the fitness distribution of the

population can be seen.

Consider the reproduction operator. Let the population at time t be defined by S(t),

and let m{H,t) denote the number of strings in this population belonging to schema H.

64 Chapter 5. Mathematical Foundations of Genetic Algorithms

The average fitness of all strings in the population representing schema H is defined

as:

^f(x)m(x)

f(H) = ^

m(H)

f(H) is also called the average payofffunction of schema H, and the fitness of a string

x in the population is f(x). Using the standard Roulette wheel reproduction the

expected number of times string x is selected is given by E(x) = f(x) I f .It can be

seen that the expected number strings belonging to schema H in the next population is

given by:

m(H,t + \) = m(H,t)^£^-

where / is the average fitness of all the strings in the population at time t.

From the above equation it can be seen that schemata with above average fitness

values will reproduce in increasing numbers in the next generation, while schemata

with below average fitness values will eventually die off. When/[W) / / i s relatively

constant, the equation can be approximated by a linear difference equation of the form

m(H, t+l) = a m(H,t). The solution is then given by:

m(H,t + l) = m(H,0)a'

With a being approximated by f(H) if, it can be seen that strings belonging to above-

average schemata are expected to grow exponentially while strings belonging to

below-average schemata are expected to decay exponentially.

Now the effect of the (1-point) crossover operator on a schema H is considered. When

a string belonging to a schema H recombines with another string into two offspring,

the schema can either have survived crossover or not. The survival probability of a

5.2 The Schema Theorem and the Building Block Hypothesis 65

schema depends on its defining length and can best be illustrated by an example.

Consider a string x that is selected for 1-point crossover, and consider two

representative schemata Ht and H2 within that string:

x = 0100011001

H-. = **00*l****

The crossover point shown above is randomly chosen to be 6. Unless the string with

which Si mates has the same gene values at positions 2 and 9, a possibility that will be

ignored for now, schema Hj will not survive. Schema H2 however does survive the

crossover operator and at least one of the offspring will belong to H2. Due to its longer

defining length it is clear that schema Ht has less chance of surviving crossover than

does H2. Only a crossover with its crossover site at position 3 will destroy schema H2,

while only a crossover at position 1 preserves schema H,. The defining lengths for the

schemata are: b\H,) = 9 - 2 = 7 and b\H2) = 4 - 3 = 1 .

Generally speaking a schema survives 1-point crossover if the crossover site falls

outside its defining length. Assuming the crossover site is chosen randomly from [1,...,

Z-l] the probability of survival for a schema H is:

. S(H)

=

"■ '~ —

When crossover itself is applied with probability pc, the following expression gives a

lower bound on the survival probability of schema H due to the crossover operator:

8(H)

>1

p >l—p

This result can be extended for the 2-point crossover operator. Assuming the two

crossover sites are chosen independent from each other and assuming they are not

equal to each other (otherwise there would be no crossover), the survival probability is

given by:

66 Chapter 5. Mathematical Foundations of Genetic Algorithms

S H

>1_ < > | lI-8(H)

('8(H) ~S(H) 8(H)}

8(H)''

P,Zl-Pc'

Ps Pc

Uv /-- ll l~l 1-2,

1-2)

Schemata of length 1-1, like the example schema Hh never survive 2-point crossover.

Uniform crossover diminishes the survival probability of schemata. Since every gene

in the chromosome has a 50% chance of survival, the lower bound on the survival

probability is:

P , > l --Pc/-0.5

Ps>\- V5(H0>. 5 ^

The standard mutation operator destroys a schema H if it is applied on the o(H) fixed

positions of H, since it inverts the value of the bits. Since mutation is performed on

each gene of the population with a probability pm , the chance that a gene survives

mutation is (1 - pm ). Therefore the probability that a schema H survives mutation is

given by:

/-,

(, \o(H)

\o(H)

Ps

For small values of pm, as is usually the case, this may be approximated by \-o(H)pm.

Theorem

To see the effect of all three genetic operators (using 1-point crossover), the survival

probabilities are simply multiplied. This can be done since all three operators are

applied independently. Thus the lower bound on the expected number of copies of a

schema H in the next generation is:

f(H)( 8<H)\

m(H,t + \)>m(H,t)^J- J \ l - p . - f' "- ^I J -o(H)Pm)m)

(l-o(H)p

J V /-I )

5.2 The Schema Theorem and the Building Block Hypothesis 67

m(H,t + l)>m(H,t)^^-\l-pc^^--o(H)p(m)

From the above equation the Schema Theorem may now be stated:

The Schema Theorem: Using reproduction, crossover and mutation in the standard

genetic algorithm short, low-order, above-average schemata receive exponentially

increasing trials in subsequent generations. [ 19]

Finally, it can be shown that the number of schemata which are effectively processed

in each generation is of the order /V3, with N the population size. This property of GAs

which helps explain its performance on many optimisation problems is known as

implicit parallelism.

As an extension of the Schema Theorem, Goldberg introduced the Building Block

Hypothesis [19]. The short, low-order, highly-fit schemata, which play such an

important role in the standard genetic algorithm are given the name of Building

Blocks. When building blocks are recombined into another chromosome, its fitness is

likely to increase. Using the Schema Theorem the Building Block Hypothesis can now

be stated:

The Building Block Hypothesis (BBH): The partial information contained in the

building blocks is combined in a GA to form globally optimal strings. [19]

The genetic algorithm can now be seen to work in such a way that the building blocks

are sampled, recombined and resampled to form strings of higher fitness which

ultimately should arrive at the global optimal string. The building block hypothesis is

in fact a way to express the evolvability of the standard genetic algorithm. It states that

a genetic algorithm tries to find low-order schemata that have the best average payoff

in each hyperplane partition of the search space and that it combines these to form a

more complete solution.

68 Chapter 5. Mathematical Foundations of Genetic Algorithms

Although the building block hypothesis (BBH) has been shown to work in many

applications, there are GAs for which it does not [23]. The problem-coding

combination in such GAs is generally referred to as being GA-deceptive, meaning that

the GA search is deceived or mislead in finding the global optimum. In GA-deceptive

problems there is no regularity in the function-coding combination that may be

exploited by the recombination of short length schemata. Building blocks cannot form.

GA-deceptiveness is a theoretical concept derived from the analysis of schemata

payoff functions. In contrast GA-hardness is a practical concept expressing the actual

performance of the GA; i.e. how easy is it for the standard GA to converge to the

global optimum? It is important to note that GA-deceptiveness does not necessarily

entail GA-hardness. Some problems classified as being GA-deceptive in fact turn out

to be quite easily solved by the standard GA.

As an example where the standard genetic algorithm should have no problem finding

the optimum solution, consider the following GA-non-deceptive problem. Suppose the

optimum solution is the string 000 ... 0 (the string length is undefined). Furthermore

for the average schema fitnesses the following holds:

/f(0*

( 0 * . ..... *) >> / (( l1* .*... . . •*) )

/ ( 0 00 **. . ..*)

. * ) > / ( 0 11 *. *...*)

..*)

/f(00*

( 0 0 * . ...*)

..*) > / ( 1 00**. . ..*) . *)

/ ( 0 00**.. ..*)

. . * ) >>/ /( ( 1l l1 ** ......*)

*)

etc.

In other words, for all schemata of a certain length (a hyperplane partition), those with

all 0's in their fixed position are preferred. According to the building block hypothesis

this problem should not deceive the GA and it should easily converge to the global

optimum.

Now consider the same problem, i.e. the optimum is 000 ... 0, but now schemata that

have l's are preferred for every hyperplane partition. Thus:

/ ( 0 0 * . ...*) . . * ) / ( 0 1 * . *)

< / ( 0 1 *...

■ • * )

/ ( 0 0 * . ...*) . . * ) << // ((ll lI**. . ..*)..*)

etc.

5.2 The Schema Theorem and the Building Block Hypothesis 69

This is a GA-deceptive problem according to the BBH, since the coding regularity

occurs in the non-preferred schemata, and the standard GA should have great

difficulty in finding the optimum. The problem now is to design the genetic coding in

such a way that the problem is not GA-deceptive so that building blocks can form and

the BBH will hold. This is in general very hard to do. One apparent requirement of the

coding in order for building blocks to form is that related genes should be close

together on the chromosome. When they are close together they can form a building

block and guide the GA search to better individuals. According to the BBH therefore

the ordering of the genes in the chromosome can play an important part in the GA

performance.

It can be beneficial to take a more geometric viewpoint when examining schemata.

The search space can be seen as an /-dimensional space, / being the chromosome-

length [6]. Points in this space are the chromosome-strings and can be seen as

schemata of order / (i.e. schemata with fixed positions only). Schemata of order M

define lines in the space, schemata of order 1-2 planes etc. In general schemata of

order l-n define ^-dimensional hyperplanes in the search space, the search space itself

being defined by the schema of order 0 (i.e. the schema consisting of 'don't care'

values only).

Using this viewpoint, the genetic algorithm can now be seen as moving between points

across different hyperplanes in search of the optimal point in the search space.

The Walsh-Schema transform can be used as an efficient, analytical method for

determining the expected performance of genetic algorithms. The average fitness

values of schemata, f{H), are determined using Walsh coefficients. The early work in

this field provides an analysis to determine the expected static performance of genetic

algorithms (e.g. [19]), while in [6] a nonuniform Walsh-Schema transform is used for

the dynamic analysis of genetic algorithms. The static analysis requires the assumption

of a so called flat population in which every possible string is represented in equal

numbers.

70 Chapter 5. Mathematical Foundations of Genetic Algorithms

John Holland. It is called Hyperplane Transform and is shown to be essentially

equivalent to the nonuniform Walsh-Schema transform [6].

transform of a function/: JC I—> 9^, where x is a binary string of length /, is given by:

1 2 ' -1

w x / x

j=^rllf(

L

Lx=0

)y j( )

JC=0

where:

y/j(x)== the Walsh function

Vj(x)--

Wj --== the Walsh coefficient relating toj

Wj tojf

j — = aabinary

binary string

string of

oflength

length /I

The summation is over all 21 'integer' values of x; i.e. x = 000, 001, 010, 011 etc. for

the case / = 3. The function^*) is transformed into a set of coefficients wj, one for

each possible bitstringj. The total number of such bitstrings and therefore the number

of Walsh coefficients is 2 . The Walsh coefficients are sometimes also called partition

coefficients. The Walsh function y/j(x) is given by:

¥j(x)=ll(-ir-if.-*

y/j(x)-- j

'

i=i

where:

x, = the value of the i{ bit of JC

jji, = the value of the t bit of j

The Walsh function will have a value of 1 if JC and j are equal in an even number of

positions and a value of -1 otherwise. The inverse Walsh transform is:

22'-l

'-l

f(x)--

f(x) = YdwW

JYt:(X)

f(x)

j=0

j=0

5.2 The Schema Theorem and the Building Block Hypothesis 71

So for the case of a three bit string: f(x) = ± WQOO ± Wooi ± vfoio ± wm 1 etc. The Walsh-

schema transform is now:

f(H)

f(H)= = ^WjY/MH))

^J¥J(P(H))

jzJ(H)

where:

J(H) = a set generator of schema H

/?(//) i

P(H) = an operator that maps H to a binary string

The set generator J(H) generates a set of binary vectors from a schema H. This set is

defined by:

Ji(H,)

Wd == 0, if H, = **

*, if //,

Ht = 0,1

So for example/(***) = 000 = {000} and/(**l) = 00* = {000, 001). p\H) is defined

by:

p\<H,)== 0,0,ififHiH,==0*

P,(Hd 0*

1, if //,= 1

Using these definitions the average schema payoff function of a schema f(H) can be

transformed into Walsh coefficients. For example:

K***)

f(**l)

./(**!)==: wooo

WQOO ++ wwooi

ooi

f(**0) === wooo-

/C**0) WQOI

WQOO - wooi

/(*!*) === wW

fi*l*) 000

00Q++ w WQIO

010

f(*0*) = W

/C*0*) ;

QOO "- WQIO

>V000 w0io

etc.

The values of the Walsh coefficients can be obtained from the problem dependent

values of the schema payoff functions f{H) by simple back-substitution. Insight into

whether a problem may be difficult to solve for a GA may be gained from observing

the Walsh coefficients. For example for a problem to be GA-deceptive, conditions

such as the following may need to hold:

72 Chapter 5. Mathematical Foundations of Genetic Algorithms

ft**l)<f(**Q)

f(*l*)<f(*0*)

/(l**)</(0**)

etc.

This can be translated into the following relations concerning Walsh coefficients:

wooi <0

w0io < 0

w 100 <0

These relations can easily be checked once the Walsh coefficients are determined. The

Walsh-Schema transform provides an analysis into the deceptiveness of a problem.

Furthermore, contributions in schema fitnesses due to epistatic interactions between

certain bit positions can be investigated [19]. A disadvantage of the Walsh-Schema

transform is the excessive amount of computation needed in the analysis. The

nonuniform Walsh-Schema is much better in this sense and provides dynamic analysis

of problems for which the computation needed in a normal Walsh-Schema transform

would be impractical.

The Schema Theorem as described above can be extended for representations other

than binary coding. The Schema Theorem is not restricted to the binary alphabet of '0'

and '1*. The effects of the Roulette wheel selections scheme and the standard

crossover operator (the swapping of substrings) on the survival probabilities of

schemata hold for any coding. The standard mutation operator for other codings is

slightly different from the inversion-mutation used for binary coding. Usually it

randomly re-initialises the value of a gene making it in principle possible that the gene

is re-initialised with its old value; i.e. no mutation at all. The Schema Theorem can

easily be extended to include this.

Building Block Hypothesis

The Schema Theorem describes how the crossover operator is used to combine the

proper building blocks so that the evolvability of the genetic algorithm is maintained.

5.3 Criticism on the Schema Theorem and the Building Block Hypothesis 73

used must be such that building blocks actually exist. The weakness of the Schema

Theorem lies herein in that it does not provide a way to express the correlation

between the problem representation and the genetic operators used, and the actual

performance of the genetic algorithm. It has been said that it does not properly explain

the sources of power in genetic algorithms.

A special point of criticism of the BBH is that it is based on a static analysis of the

payoff functions of schemata, while a dynamic view would be needed to properly

explain the working of a GA. According to Grefenstette [23], this means in fact that

the BBH is false and fails in practice due to the following factors:

When the GA starts to converge (i.e. right after the first generation), the population

represents a biased sample of all schemata. This means that the GA can no longer

estimate the true average schema fitnesses. This can in fact only be done in the first

random population (i.e. at generation 0). During a run the GA can favour schemata

that have a less true average fitness or payoff than others simply by eliminating its

competitors. It can be shown that because of this, GA-deceptive problems can in fact

quite easily be solved by a standard GA.

• the population size is always limited and there is a large variance within

schemata

This means that even in the initial random population the GA cannot estimate the true

average schema fitnesses. To illustrate this, consider the following problem which

according to the BBH is GA-easy or at least non-deceptive.

binary string. The string 000...0 represents 0.0, the string 111... 1 represents 1.0, etc.

Let/be defined by:

f(x)= x2 ifx>0

2048 ifx = 0

It can be seen that for any schema H which contains the optimum string 000...0 (i.e. all

the schemata with only 0's in their fixed positions), the average schema fitness, f(H) >

2, since the sum of all its payoff functions is at least 2048 (due to the optimum string)

74 Chapter 5. Mathematical Foundations of Genetic Algorithms

and the number of strings contained in H is at most 210 = 1024. Also for any schema H

that does not contain 000. ..0, the average payoff function/(//) < 1, since the sum of all

its payoff functions is at the most 1024*1 = 1024. Therefore schemata containing only

0's in their fixed positions are always preferred over others and the problem is GA-

non-deceptive.

However a standard GA will find it extremely difficult to find the optimum string

000...0. Unless it was already part of the initial population or because it was

introduced by a very lucky crossover or mutation, the string will very likely not be

found. Intelligent sampling of hyperplane partitions will not lead to the discovery of

the optimal string as predicted by the BBH. This is because the variance in the best

schemata is extremely high due to this problem being a "needle-in-the-haystack'

search. The GA can not accurately estimate the true average payoff functions of these

schemata.

Grefenstette states that while the Schema Theorem as presented by Holland refers to

the average payoff of schemata according to the current sample in the population, the

BBH ignores this crucial feature and should therefore not be used as a fundamental

theorem for GAs. The classification of problems being GA-hard or GA-easy using the

BBH is certainly not always true as shown above.

Schema Theorem

In order to be able to express the correlation between representation, genetic operators

and performance explicitly in a theory and in order to account for arbitrary

representations and genetic operators, a more general and more complete approach

than the Schema Theorem is suggested in [2] to serve as the mathematical foundation

of genetic algorithms. Apart from GAs it can also serve as a foundation of genetic

programming (in contrast to the Schema Theorem). The theory is based on Price's

Covariance and Selection Theorem and does not include the notion of schemata or

building blocks.

In order to account for the effects of the choice of representation and genetic operators

the notion of a transmission function is used. A transmission function is the

probability distribution of the offspring chromosomes from every possible mating. For

5.5 Markov Chain Analysis 75

the case of two parents, as used in normal crossover, the transmission function is

represented as: T(i <- j,k), where (' is the label for the offspring and j,k are the labels

for the parents. T{i <— j,k) represents the probability that an offspring of type i is

produced by parental types /' and k resulting from the application of the genetic

operators on the representation.

The performance of a genetic algorithm is now determined by the relation between the

transmission function and the fitness function. Price's theorem is used to analyse the

dynamic behaviour of the fitness distribution over one generation. This is shown to

depend on the covariance between parent and offspring fitness distributions and a so

called 'search-bias' which indicates how much better the effect of a genetic operator

on the current population is than pure random search.

Using the search-bias a quantitative notion can be given to the idea that the

transmission function should find a balance between exploring the search space and

exploiting the current population. It is still very hard to actually use the theorem in

practice in order to analyse or optimise a genetic algorithm, but enhanced ease of use

may be expected in future.

Yet another approach to a fundamental theory of genetic algorithms is Markov chain

analysis. As opposed to most models where certain approximations have to be made,

Markov chains can be used to form an exact model of genetic algorithms. This can

only be on small scale though, as the transition matrix used in the analysis grows

exponentially with increasing population size and chromosome length. Due to this

scalability problem, studies in this area have used very small scale genetic algorithms

(e.g. [22], [31]) or have worked with matrix notation only (e.g. [12], [54).

Even in the case of small scale models very useful insights can be gained such as the

concept of genetic drift and the effect of preferential selection on the population.

In [22] a genetic algorithm was modelled that had binary chromosomes of length one;

i.e. a 'single-locus genome'. In other words, the only individuals possible are '0' and

T . A population of size N now gives a total of ( N + 1 ) possible states, whereby the

location of a chromosome in the population is of no concern. For example a

population of size two has states '00', '01', and '11' State i is referred to as the state

76 Chapter 5. Mathematical Foundations of Genetic Algorithms

with exactly i ones and ( N - i) zeros. The operation of the genetic algorithm is now

defined b y a ( 7 V + l ) * ( / V + l ) transition matrix P[i, j] that maps the current state i

to the next state j . The probability of a transition from state i to state j is given by one

entry in the matrix: p(ij). Figure 5.1 visualises these terms for a simple single-locus

genetic algorithm with population size 2. In general with a chromosome length I, the

number of possible binary chromosomes is 21 and the number of states is (N + 2 - 1 )!

/ N\( 2 1 )!. For any realistic genetic algorithm the transition matrix becomes of

unmanageable size.

When simple Roulette wheel selection is the only genetic operator in the system (i.e.

no mutation and crossover), the transition matrix can be generated quite easily. It can

be used to examine the influence of selection pressure only on the system, f, is defined

as the fitness of an individual T , and f0 the fitness of a '0'. The probability of

choosing a T for the next population is simply p, = f, I Yf. When the number of ones

in the current population is given by i, p, can be expressed as:

Px

i-f[+(N-i)-f0

5.5 Markov Chain Analysis 77

ir

i•r N-i

N-i

ri and Fo

Po

i-r + (N-i)

(N-i) i-r + (N-

(N-i)

■i)

The probability of transition from a state with *' ones to a state with; ones is now:

N-j

,. .. ^ / \ j /

j \N-j

j *> ( i-r

i-r \( N —- i1

IN >

P(iJ)

P0.J)== ■ (p.) (p 0 r == • :

P.) (Po) [i-r + jrr—z - +777—

(N-i)J l,i-r (N-i) ;

UJ U A i - r + (N-i)Ai-r + (N-i)J

This equation defines the complete (/V + 1 ) * (/V + 1 ) matrix P[i,j]. With r=l both

T and '0' individuals have an equal fitness value. There is no preference for a state

(i.e. a population) with all-ones or a state with all-zeros, and the equation reduces to

the one for pure genetic drift. This genetic drift causes the simple GA to always

converge to a uniform population, in this case to a state with all-ones or one with all-

zeros; i.e. i - {0, N}. These two states are absorbing states, meaning that once the

system is in such a state it will always stay there. In other words, the transition

probability from such a state to itself is one ip(i,i)=\) and zero to all other states

(p{ij)-0; &j ). Absorption time is defined as the expected number of generations until

the genetic algorithm finds itself in one of the absorbing states. Absorption time

depends linearly on the population size, N [22].

In [31] the above model is extended to include niched genetic algorithms based on

fitness sharing (see section 3.1.8). It is reported that in the case of 'perfect sharing',

where the niches do not overlap, the effect of fitness sharing ('niching force') balances

exactly the effect of selection/drift. The niching force is a stabilising one in that it tries

to spread the population out evenly over the niches, as opposed to the effect of

selection/drift. When overlapping niches are examined, as is often the case, it is found

that the niching force dominates for small overlaps but for larger overlaps its influence

decreased. As could be expected the absorption time is significantly larger than for the

simple GA without niching and grows exponentially with the population size.

In [12] the transient behaviour of GAs with a finite population size is modelled using

Markov chains. The concept of the state transition probability matrix P is extended to

a fc-step transition matrix P* Using P% analysis is done on expected transient

behaviour of simple GAs. Questions like: "What is the probability that the GA

78 Chapter 5. Mathematical Foundations of Genetic Algorithms

population will contain a copy of the optimum at generation k can be answered using

this approach using relatively simple function optimisation problems. Also, expected

waiting time analysis can be performed to answer questions like: "For how many

generations does the GA have to run on average before first encountering the

optimum?" The effects of crossover, mutation and fitness scaling can be seen in the

expected waiting time analysis and useful insights are gained. Future work in this

approach needs to concentrate on scaling up to problems of more realistic size. Also

visualisation techniques to display for example transition matrices are expected to be

of much help in gaining insight into the operation of GAs.

6. Implementing GAs

Although highly problem dependent, some general remarks can be made on what the

options are concerning coding (representation) and genetic operators for a GA system

and how they will affect the performance. A brief overview of the most commonly

used GA settings is given. Aside from coding the most crucial part of the set-up of a

GA is the fitness or evaluation function. General remarks concerning the fitness

function are given as well, but first some general comments are made about the

performance of a GA.

6.1 GA Performance

This section describes some of the main problems found while implementing genetic

algorithms.

• Premature Convergence

A common problem in GAs is premature convergence of the population to a local

optimum. This happens when a super-fit individual, representing a sub-optimal

solution, is chosen to reproduce many times and takes over the population in a few

generations. After this, the only way the GA can overcome this local optimum is by

the (re)introduction of new genetic material by means of mutation. This process is

then just a slow random search.

A second problem in GA performance is that once the population is near the global

optimum, further convergence is very slow. The average fitness will be high and there

is usually little difference between the fitness values. Therefore there is not enough

pressure to push the GA towards the global optimum.

• Epistasis

Problems that are difficult to solve for a GA can generally be classified as problems

with high epistasis. The level of epistasis in a certain problem-coding combination

reflects the dependence of gene expression in the phenotype on the values of other

genes. With epistasis, a specific variation in a gene produces a change in the fitness of

the chromosome depending on the values of other genes. The level of gene interaction

79

80 Chapter 6. Implementing GAs

measures the extent to which the contribution to the fitness of a single gene depends

on the values of other genes in the chromosome. In the absence of epistasis a

particular change in a gene always produces the same change in the fitness of the

chromosome. As an example of such a problem consider the case of a binary string

where the fitness is simply equal to the number of ones in the chromosome:

;

f(x) = ^ai , x = («!,„.,a,),a t e {0,1}

i=i

There is no interaction between the genes at all. The fitness function is a composite of

the contributions of each gene.

A medium level of gene interaction can be defined where the problem is such that a

particular change in a gene always produces a change in the fitness function of the

same sign (or zero). In this case, the change in fitness depends on the values of other

genes. An example of such a problem is one where using binary coding the fitness is

one if all genes are one, and zero otherwise:

0, otherwise

particular change in a gene produces a change in the fitness that varies in magnitude

and in sign depending on the values of other genes. Commonly only problem-coding

combinations of this kind are referred to as having epistasis.

algorithm is sufficient. GAs are useful and successful when there is a medium level of

epistasis. The problems classified by the Building Block Hypothesis as being GA-

deceptive suffer from high epistasis. High epistasis simply means that building blocks

cannot form.

The two obvious ways to tackle problems that have high epistasis are: design a coding

such that the problem becomes one with low or no epistasis, or: design the genetic

operators (crossover and mutation) so that the GA will have no problem with epistasis.

In effect this means that some prior knowledge about the objective function has to be

build into the GA system. Although it can be shown that in theory any high epistasis

6.2 Fitness Function 81

problem can be reduced to one with low or even no epistasis, in practice this is very

hard to do. The effort needed to accomplish this may be greater than is needed to

actually solve the problem.

• Genetic Hill-Climbing

While in theory a genetic algorithm performs a global search of the solution space, in

some implementations the search is not as global as most theory would suggest. For

example in a Steady State GA with ranked replacement (a 'Genitor-type' GA) and a

relatively small population size, the search is often centred around the single fittest

individual. This is due to the very high selective force on above average individuals in

this type of GA. The Genitor-type GA 'pushes' very hard, often becoming stuck in

local optima. The performance of this GA with relatively small population sizes is

sometimes found to be largely independent of the population size, see e.g. [64] where

good solutions to a neural network weight optimisation problem were found even with

a population of size 5. Instead of intelligent hyperplane sampling, the GA basically

performs a local search around the fittest individual. It is said to perform genetic hill-

climbing. This does not necessarily mean that the algorithm does not function well; it

may still outperform conventional hill-climbing algorithms. However, the foundational

work of GAs will have to be extended to include the phenomenon of genetic hill-

climbing. GAs working as a genetic hill-climber are commonly found to require a

relatively high level of mutation for a good performance. The GA is in a way always

in a state of premature convergence and strong mutation is simply needed to make a

transition to another better state.

For some problems the construction of the fitness function is obvious. For example in

function optimisation it is simply the function to be optimised. In general the fitness

function should reflect the performance of the phenotype or the 'real' value of an

individual and its choice is far from trivial. A special problem is when it is possible

for the GA to construct genotypes that translate into an invalid or meaningless

phenotype. These chromosomes have zero 'real' value and this should in some way be

reflected in the fitness function (i.e. they should be penalised). In genetic

programming, care is often taken to prevent the genetic operators from producing such

chromosomes. In most GAs however, this is practically impossible and possibly even

harmful to the GA. This is because although these chromosomes are essentially

82 Chapter 6. Implementing GAs

meaningless, they may contain information that is crucial for the production of highly

fit meaningful chromosomes.

Some researchers say that the fitness function should be smooth and regular and

chromosomes that are close in the representation space should have similar fitness

values. If this were the case however, a simple hillclimbing algorithm could always

find the optimum and a genetic algorithm would not be needed. In practice the fitness

function will typically contain many local optima making it hard for a hillclimbing

algorithm to find the global optimum. However a fitness function may be constructed

that minimises the effect of local optima and enhances the better performance of a

GA.

6.3 Coding

In the early days of the field of genetic algorithms researchers practically always used

a binary coding scheme following the example of Holland. Since then many variations

have been used such as real-valued and symbolic coding.

A simple analysis of the Schema Theorem seems to suggest that alphabets of low

cardinality (small alphabets) yield the highest rate of schema processing: the amount

of implicit parallelism is the highest. This is because the number of schemata available

for genetic processing is the highest when low cardinality alphabets are used. Since

genetic algorithms are run on computers, all genetic information is ultimately stored as

bits. Supposing we have an alphabet of cardinality k, then each position in the

chromosome can represent k+\ schemata. Since each position represents log2fc bits,

the number of schemata per bit of information, ns, is:

I

log k

ns =(k + \) '-

Therefore it is easy to prove [20] that chromosomes coded with smaller alphabets

represent a larger number of schemata than ones with larger alphabets. Because of this

apparent higher rate of schema processing of small alphabets traditionally only binary-

coded chromosomes were used. However, GAs using codings with high cardinality

alphabets (large alphabets) such as real-valued codings have been shown to work well

in certain applications. Goldberg [20] reconciles this with the Schema Theorem as

described in a following section.

6.3 Coding 83

Intuitively, the coding should be such that the representation and the problem space

are close together; i.e. a "natural' representation of the problem. This also allows the

incorporation of knowledge about the problem domain into the GA system in the form

of special genetic operators. The GA can then be made more problem specific and

achieve better performance. One concept of particular importance is that points that

are close together in the problem space should also be close together in the

representation space.

Some problems can be expressed very efficiently using binary coding:

However the choice of which coding to use is usually not straightforward. Quite often

the problem to be solved is integer or real-valued and genes can be chosen to be

binary or real-valued. An example of a binary coding for a real-valued problem is the

following. Suppose the problem is to optimise a function f(xhx2,x3) that takes real-

valued arguments: xhx2,x3 e [0,1]. A chromosome has to represent these three

arguments. Each argument can for example be represented by 8 bits, making the

chromosome a binary-valued string of length 24. In 'standard' binary coding the

substring 00000000 will correspond to the value 0.0, 00000001 to 1/256=0.0039, ... ,

and 11111111 to 1.0. An example of such a binary-coded chromosome is:

si" 4- si"

parameters

Some authors use the term gene for a substring of 8 bits representing a single real-

valued argument. This is not very appropriate since a gene in GA-theory is considered

to be a unchangeable piece of data. In this example the standard GA will work on the

bitstring without any knowledge of the actual representation within it. A gene simply

corresponds to a single bit.

84 Chapter 6. Implementing GAs

As can be seen in this example the effect on the phenotype of changing a single bit in

the chromosome depends on its position within the string. When the right-most bit of

an 8 bit substring is changed the effect is very small, but when the left-most bit is

changed the effect is relatively quite large.

In this coding, the Hamming distance (the number of different bit positions) between

two individuals does not reflect the distance between the two in the problem space.

This makes it very difficult for the mutation operator to make small changes in the

values represented in the chromosome and is the reason why Gray coding is often used

instead of 'normal' binary coding to code real values. In Gray coding adjacent real

values (or integer values) differ from each other in only one bit position. Going

through the real values represented by a Gray coding from low to high only requires

flipping one bit at a time. GAs using Gray coding are often found to perform better

than ones with standard binary coding when solving real-valued problems.

When real-valued coding is used chromosomes are simply a string of real values:

In [20] Goldberg gives an explanation for the success and failure of real-valued

codings based on the Schema Theorem by suggesting that the GA breaks the original

coding into 'virtual alphabets' of higher cardinality. The real-valued GA still has a

high rate of implicit parallelism.

precision. Binary coding of real-valued numbers can suffer from loss of precision

depending on the number of bits used to represent one number. Also in real-valued

coding chromosome strings become much shorter. Since it is often thought that GAs

fail for larger problems because of the very large string-sizes involved, this can be an

important aspect. Furthermore real-valued coding gives a greater freedom to use

special crossover and mutation techniques that can be beneficial to the performance of

the GA. Some of these issues are discussed in section 6.5. A final point is that for real-

valued optimisation problems real-valued coding is simply much easier and more

efficient to implement, since it is conceptually closer to the problem space. In [48] and

6.3 Coding 85

and real-valued coding; the real-valued coding produced better results.

xx = (au ...,

= {a\, ..,«,), a,ee {a,b,c,

af), a, {a,b,c, ...}

...}

Often the genes are simply implemented as unsigned integer values taken from a

certain range. The main characteristic of symbolic coding is that there is no measure

of distance between two symbols. For example symbols that are 'adjacent' in the

alphabet are not considered to be closer to each other than any other two symbols.

The codings described above are all homogeneous in the sense that all the genes in the

chromosome have identical codings (i.e. one alphabet A). It is possible for a

chromosome to have different parts each having their own coding. This can be seen as

the general case of chromosomal representation.

A/

Homogeneous coding is the special case where A, - Aj, all ij. Alternatively, a

chromosome could consist of a part that is binary coded (A = {0,1}) and a part that

uses real-valued coding (A - Si):

x -=-- (a

(auu ....• ,i a

amm, , flm+b

am+], ■...■ ■,, a,),a,), a,

a,ee {0,1}, 1 < ii<

<mm

a, € 9t,

a,e 91, m< ii<l</

This type of coding poses extra constraints on the genetic operators. The normal

crossover operator can still be applied even when a substring is swapped that contains

more than one coding because it leaves the genes intact. The mutation operator has to

be changed in that it re-initialises a gene depending on the coding of that gene since

the alphabets of the different codings will not be the same.

86 Chapter 6. Implementing GAs

A brief overview of the various selection or reproduction schemes is given here. In

[21] a comparative analysis is given of these selection schemes.

Proportionate reproduction schemes select individuals based on their fitness relative to

the rest of the population. The Roulette Wheel or Monte Carlo selection scheme is the

most widely used and was described in section 3.1.3. Other proportionate

reproduction schemes include stochastic remainder selection. In the stochastic

remainder selection scheme the expected number of copies for each individual,

E(x)=flx) l~J, is calculated and the integer portions of these counts are assigned

deterministically. The remainders are used to fill the rest of the population

probabalistically, usually by means of a Roulette wheel selection scheme.

One way to overcome the problems mentioned in section 6.1 is to use a. fitness

remapping scheme. Before individuals are selected using proportionate reproduction,

their raw fitnesses are remapped to new values. Two of such techniques are described

below.

• Fitness Scaling

Instead of using the actual fitness values in the selection mechanism, the fitness values

of all individuals are scaled to a certain range. This is commonly done by first

subtracting a fixed value from the fitnesses. These fitnesses are then divided by their

average value to produce the adjusted fitnesses. When fitness scaling is used the

amount of relative selective pressure on an individual can be controlled. Very fit

individuals no longer produce excessive numbers of offspring. There is a price to be

paid however. When there is a single super-fit (or super-unfit) individual in the

population, fitness scaling leads to overcompression. When just one individual has a

fitness many times higher than any other, fitness scaling will result in flattening out the

fitness distribution of the rest of the individuals. They will obtain near identical fitness

values, and the difference in selective pressures between them will almost be lost.

Performance suffers if there are extremely valued individuals.

6.4 Selection Schemes 87

fitnesses to a certain interval, the fitnesses are remapped so that their total equals a

certain value (usually 1).

• Fitness ranking

The dominance of extreme individuals may be overcome by fitness ranking. Here

individuals are ranked based on their fitness and then the new reproductive fitness

values are given to them based solely on their rank, usually using a linear function

(linear ranking). Similar to fitness scaling, fitness ranking ensured that the ratio of

maximum to average fitness is fixed. However it also spreads out the remapped

fitnesses evenly over the interval. The problem of overcompression is gone. It no

longer matters whether the fittest individual is extremely fit or only just fitter than its

nearest competitor. By means of the ranking function the selective pressure of

individuals relative to each other can be controlled. Non-linear ranking may also be

used where the ranking function is such that the remapped fitness of an individual is

for example an exponential function of its rank (exponential ranking).

In this process, a number of individuals, set by the tournament size, is selected from

the population at random. In deterministic tournament selection, the best fit individual

of this tournament is then chosen to reproduce. The winning individual can also be

chosen probabalistically. The simplest version, binary tournament selection, has a

tournament size of two. The selection pressure of individuals can be adjusted by the

tournament size or the 'win-probability' Larger tournament sizes increase selection

pressure, since higher fitness individuals have more chance to win tournaments.

individuals are selected as the losers of a tournament. They can be chosen from the

same tournaments as their parents or from independent tournaments.

In [21] population dynamics using birth, life, and death rates are used to support the

viewpoint that the Genitor-type Steady State Genetic Algorithm is just a special

selection mechanism instead of a different kind of GA. It is suggested that the

behaviour of the Genitor-type GA is primarily caused by its extreme 'pushing' force

or selective pressure of above average individuals. The growth rate of the fittest

88 Chapter 6. Implementing GAs

individual is very high, where the growth rate is defined as the proportion the

individual takes up in the mating pool relative to the proportion it takes up in the

current population. A normal GA can also gain high growth rates when an appropriate

selection scheme is used (such as non-linear ranking), and it is suggested that it should

then show similar behaviour to the Genitor-type GA.

Some extensions of the basic crossover and mutation operators are described here.

Very often the genetic operators have been adjusted to suit a specific problem and/or

coding [65] so that general remarks can not be made.

In the standard GA, crossover and mutation operators work with fixed rates. Lately

much work is being done on adaptive rates for these operators. Concepts being

investigated include coding the values of the rates in the chromosomes to let the GA

find optimum values or using a diversity measure of the population to control the

rates. For example when the diversity is very low, new genetic information can be

introduced by setting the mutation operator temporarily to a high value. Yet another

idea comes from the field of simulated annealing where the rates of the operators are

controlled using a 'cooling scheme'

6.5.1 Crossover

The standard one-point, two-point and uniform crossover operators were described in

section 3.1.3 for binary-valued chromosomes. They can be used in the same form for

any coding. These crossover operators swap entire genes or series of genes between

individuals and therefore can never change the value of a gene.

operator. Introduced in [65] it is implemented there in the following manner. When

two parents, xt and x2, are chosen to reproduce, three new points in the search space

are made. One point is the midpoint between xt and x2, found by taking the average of

all the gene values of both parents; i.e. (JC/+X2)/2. The other two points also lie on the

midline between Xj and x2 and are (3xi-x2)/2 and (-Xi+3x2)/2. The two offspring are

determined as the best two of these three points. The idea behind this process is that

when an optimum in the search space lies between the two parents, it will not be

reached using normal crossover. Normal crossover only swaps coordinates; i.e.

samples hyperplanes. Linear crossover however produces offspring that lie in between

6.5 Crossover, Mutation and Inversion 89

since it introduces new genetic material. It is highly disruptive of schemata; all the

schemata contained in the parents can be lost. In [65] GAs that used a mixture of

normal and linear crossover (a chance of 50% each) gave superior results on a variety

of problems when compared to GAs with only one type of crossover operator.

6.5.2 Mutation

The normal mutation operator as described in section 3.1.3 for binary coding is easily

extended to any representation. With probability pm it will re-initialise the value of a

gene. The set of possible gene values will usually be the same as that used for

initialising; i.e. the alphabet.

When the chromosomes consist of real-valued genes, another form of the mutation

operator may be used. Instead of re-initialising the value of the gene, a small randomly

selected value (usually Gaussian) is added to it. This version of the mutation operator

is called creeping mutation. It can be seen as a local search mechanism within the GA

and can operate simultaneously with the normal mutation operator. When creeping

mutation is the only mutation operator used in the GA, it is empirically found that the

mutation rate should be much higher than usual and mutation rates of up to 0.1 have

been used. With creeping mutation, gene values can be obtained that lie outside the

range of the initial population. If the genes are restricted to a certain range (a, e

[min,max]) then the creeping mutation operator can simply be altered so that it can not

take a gene value beyond that range.

6.5.3 Inversion

While not part of the standard GA toolkit, inversion is often added to a GA system to

operate alongside crossover and mutation. It is called upon in a similar way as

crossover and mutation and operates on a single chromosome. A chromosome is

inverted with probability p,. Inversion randomly picks two points in the chromosome

and inverts the order of the substring between these points. The 'meaning' of the

chromosome remains the same however. The only thing that is changed is the order of

the coding. Inversion requires that genes carry labels. For example a gene af

represents the j * gene having a value a,-. The order in which the genes appear in the

chromosome does not have to resemble the label. Inversion is illustrated below:

90 Chapter 6. Implementing GAs

Both chromosomes before and after inversion code exactly the same information and

represent the same phenotype; only the order of the genes is changed. The order of the

genes can play an important factor in the GA performance. The building block

hypothesis for example requires that related genes should be close together on the

chromosome in order for building blocks to form. Inversion is an operator that

changes the order of the genes and can therefore improve GA performance in some

circumstances.

7. Hybridisation of Evolutionary

Computation and Neural Networks

Evolutionary Computation (EC) can be used in the field of neural networks in several

ways. It can be used:

• to analyse a neural network

• to generate the architecture of a neural network

• to generate both the neural network architecture and its weights

Each of these approaches is briefly described below; also see [62]. The last two of

these are dealt with together as many remarks concern both and the distinction is often

subtle.

ofaNN

The genetic algorithm may be used to optimise the weights of a neural network and

provides an alternative to a learning algorithm such as back propagation (BP) which

often gets stuck in local minima. The genetic algorithm generally performs a global

search of the weight space and therefore is unlikely to get stuck in a local minima.

Genetic algorithms (or evolutionary computation in general) do not use error-gradient

information and, unlike algorithms as BP, they can be used where this information is

not available or is computationally expensive. It also means that the activation

function of the neurons does not have to be differentiable or even continuous. Genetic

algorithms can in principle be used to train any type of neural network including fully

recurrent networks. A problem often encountered with genetic algorithms is that they

are quite slow in fine tuning once they are close to a solution. Therefore the

hybridisation of GA and BP, where BP is used to fine-tune a near-optimal solution

found by GA, has proven to be successful [29].

91

92 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

The members of the population are the weights of the network which are coded as

strings. When real valued weights are used, they are often coded into a binary string

using a binary or a Gray coding mechanism, although real-valued coding is also

possible. The fitness measure is normally calculated as the performance error of the

network on the training data. The genetic algorithm can then be classified as a

supervised learning algorithm.

In [44] a GA is used to evolve ecological neural networks that can adapt to their

changing environment. This is achieved by letting the fitness function, which in this

case is seen as individual for every gene, co-evolve with the weights of the network. A

special feature of this research is that there is no reinforcement for 'good' behaviour

of the network; the network just tries to model or adapt to the world in which it lives.

The system can be classified as open-ended evolution.

De Garis [18] uses a method which is based on fully self-connected neural network

modules. It is shown that using this approach a network can be taught a task even

though the time-dependent input varies so fast that the network never settles down.

The system does not use a crossover operator (it could therefore be called

evolutionary programming) and is used to teach a pair of sticks to walk.

In [28] and [52], a genetic algorithm is used on a fixed three layer feedforward

network to find the optimal mapping from the input to the hidden layer (i.e. the set of

optimal hidden targets). In the evaluation phase, the weights from the hidden to the

output layer are learned using a simple supervised gradient descent rule. The search

space is not the weight space but the hidden target space. It is suggested in [28] that

the hidden target space might have more optima than the weight space and that finding

the optimum will therefore be easier.

In [49], [50] and [64], instead of binary coded chromosomes, chromosomes with real-

valued genes are used. Satisfactory results are reported using a Genitor type Steady

State Genetic Algorithm with a relatively small population size of 50. Instead of the

normal mutation operator, creeping mutation is used where a small random value is

added to the gene. In [49] several special genetic operators are investigated such as a

crossover operator that swaps groups of weights corresponding to a neuron. This

specific operator did not show any obvious improvement. Experiments with

decreasing population size strongly suggest that genetic hill-climbing (see section

6.1.1) is the main search mechanism in these implementations. Even with a population

size as small as 5, good results were obtained. The genetic algorithm is said to

7.2 Evolutionary Computing to Analyse a NN 93

outperform back propagation on certain problems that require a neural network with

over 300 connections.

Although this combination of GAs and NNs is not common, GAs can be used to

analyse or explain neural networks. In [13] GAs are used as a neural network

inversion tool in that they can find the input patterns that yield a certain output of the

network.

Architecture and its Weights

In this hybridisation of neural networks and evolutionary computation the architecture

of the neural network is automatically generated using evolutionary computation. An

individual in the population of the evolutionary computation algorithm codes a neural

network structure and sometimes its weights. During evaluation an individual is

translated into a neural network structure. Commonly this network is then trained

using a separate training module such as back propagation or a EC weight

optimisation algorithm. This is illustrated in Figure 7.1.

94 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

As the chromosomes usually do not contain information concerning the weights of the

network; these have to be set to an initial (random) value. After a network is trained it

is evaluated on its performance, which is reflected in the fitness measure. The

performance measure can simply be the overall error on the training data, but often

reflects other properties such as network size as well. Instead of testing the network on

the training data it can be tested on (real) test data as well. In a real-life application

however the actual test data will not be available until the network is used for its task

(otherwise it might as well be included in the training data). The training set can be

divided into two parts though, one part serving as training data for the training module

and the other part used in the evaluation phase as a test of the generalisation

performance of the network.

However the EC system as in Figure 7.1 in theory can act as a real-time system where

networks are evaluated on real data taken from the environment and the end-user

simply uses the best individual found so far. The EC system is run continuously and

when a better individual is found, then the one currently in use is replaced. This is an

evolutionary adaptive system operating in a non-stationary environment, pictured in

Figure 7.2. It will not be of any use in a stationary environment because once the

optimum network is found the EC system becomes redundant.

7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 95

stationary environment

The network as used by the end-user, of course, operates in the same non-stationary

environment. In practice such a system would be very hard to implement. The EC

system for example must have some way of testing the networks it generates on the

task they are going to be used for by the end-user in order to determine their fitness

values. Thus a model of the task dictated by the end-user has to be built into the EC

system. Furthermore special care has to be taken in implementing the specific EC

system to avoid premature convergence. Also the information that is fed into the EC

system from the environment has to chosen carefully, since, if it is only based on the

present situation, useful information that was observed in the past may be lost. As

described in section 4.4.1 there is evidence suggesting a very similar process as

pictured in Figure 7.2 might be at work in the human brain. The brain makes a model

of the outside world and ideas are generated and tested according to a Darwinian

process resulting in the fittest one actually being used.

The performance of a neural network usually depends on the values of the initial

weights. Therefore the networks should be trained several times using different

random initial weights each time and the results be averaged, in order to get a good

performance measure. This can cause the approach to become very slow, see e.g. [4].

In some applications the generation of the network architecture is done simultaneously

with the learning of the weights. The chromosomes not only code the architecture of

the network but they also code the values of the weights.

96 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

There are several ways to encode a neural network architecture as a chromosome that

can be used by an evolutionary computation algorithm. These methods can be divided

into the following approaches:

• direct encoding

• parametrised encoding

• grammar encoding

These methods as well as their applications are described below in more detail.

In the direct encoding methods the entire neural network structure is directly

represented by the chromosome. Every connection of the network is directly

represented in the chromosome, usually by a single gene. Direct encoding is often

implemented by encoding the network as a connectivity matrix. This matrix has size N

x TV, where A' is the maximum number of neurons in the network. A ' 1' in the matrix

denotes a connection between two neurons, a '0' no connection. When a feedforward

neural network is coded using a connectivity matrix, only the upper right half of the

matrix is used. A common problem found when using a direct encoding method is that

the chromosomes become very large with increasing network size. This usually results

in poor convergence of the algorithm.

An alternative approach is to use a genetic algorithm where the topology and weights

are encoded as variable-length binary strings [46]. In [11] a structured GA is used that

simultaneously optimises the neural network topology and the values of the weights. A

two-level genetic structure represented by a single binary string is used where one

level defines the connectivity (topology) and the other the values of the weights and

biases. It was found that although the algorithm worked well on small problems like

XOR, it could not scale up properly to bigger real-world problems.

In [5] feedforward neural networks are generated with a GA, using a direct encoding

scheme where every gene in a chromosome represents a connection between two

neurons. The problem of competing conventions is tackled here by introducing

connection specific distance coefficients in the genetic material. For each functional

mapping or phenotype, the structural mapping or genotype with the shortest amount of

connection lengths is preferred. This approach is also known as 'restrictive mating' [1]

and is one of the niching methods described in section 3.1.8. In this way, some of the

7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 97

into the genotypes. A disadvantage of the system proposed is that a maximum size

neural network topology, including the number of hidden layers used, needs to be

specified in advance.

Jacob and Rehder [32] use a grammar-based genetic system, where topology creation,

neuron functionality creation (e.g. activation functions) and weight creation are split

up into three different modules, each using a separate GA. The modules are linked to

each other by passing on a fitness measure. The grammar used is such that a neural

network topology is represented as a string consisting of all the existing paths from

input to output neurons. This is not a grammar encoding method such as the ones

described in section 7.3.3 as no grammar rewriting or production rules are encoded.

In [26] Happel and Murre report an approach where modular neural networks are

generated using a direct encoding scheme. The system implements modularity, where

modularity is meant as the grouping of certain neurons in the network into a module.

When such a module of neurons is connected to another module, all the neurons in the

two modules are connected to each other. An advantage of using modular neural

networks is that the weight space of the network is reduced. This has a positive effect

on both the generalisation capability and the time needed to learn the network. The

networks used are made up of so called CALM modules and are used for unsupervised

categorisation.

networks evolve using both a parametric mutation (mutation of the weights) and a

structural mutation. It is argued that EP is a better choice for this task than GA, mainly

because it is not clear that there exists an appropriate interpretation function between

the recombination and evaluation space for the application of neural network design.

In [47] EP is used where the initial network is a 3 layered fully connected feedforward

network and the EP algorithm is used to prune connections.

approach is described in [40] and [41], and consists of directly encoding a neural

network in the genetic tree structure used by genetic programming. The neural

network topology as well as the values of the weights are encoded in the chromosomes

and they are trained simultaneously. Only tree-structured neural network architectures

can be generated using this method.

98 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

Genetic Programming (BGP). The neural networks are still represented by trees but

less restrictions are put on the possible architectures. However only integer valued

weights and bias values are possible. Since the hidden neurons are defined in a tree

like fashion and each output is represented by its own genetic tree, there still are

restrictions on the possible architectures. The method uses Occam's Razor where less

complex neural networks are preferred over more complex ones, providing they have

similar performances. This offers a way to balance neural network accuracy and

structural complexity.

The above methods can be classified as 'strong or low-level representations' because

the complete network topology is coded in the chromosomes. When 'weak or high-

level representations' are used, the chromosomes do not contain the complete network

topology. Instead they consist of more abstract terms, like 'die number of hidden

neurons' or 'the number of hidden layers' This usually results in a limitation of the

possible neural network architectures; e.g. only full connectivity between the layers

and no connections from input neurons to output neurons.

structure, connectivity and weights optimisation. The network structure here is defined

as the number of layers and the number of neurons in every layer.

Grammar based systems are often found to perform better than methods using a direct

encoding method for larger sized neural networks. This is due to the fact that when a

direct encoding method is used, the chromosomes and accordingly the search space

for the algorithm, become very large as the network size is increased. When grammar

encodings are used this is not the case and these methods usually show much better

performances when the network size is scaled up. Grammar encodings used to

represent a neural network architecture can be divided into matrix grammars and

graph grammars.

Kitano [38], [39] uses a GA-based matrix grammar approach where chromosome

code grammar rewriting rules that can be used to build the connectivity matrix. A rule

in this grammar rewrites a single character into a 2x2 matrix. After a certain number

7.3 Evolutionary Computing to Optimise a NN Architecture and its Weights 99

generates neural networks that show many regular connection patterns, and can thus

be said to show some sort of network modularity. Drawbacks of this approach include

that in order to generate a network with N neurons, a matrix is needed that has size

MxM, where M is the smallest power of 2 greater than N. Also, when feedforward

neural networks are generated, more than half of the connectivity matrix is unused.

Another drawback is that the method is not clean, in that rewriting rules may be

generated that rewrite the same character, or for some characters no rewriting rules at

all may be found. One more weak point is that networks can be generated that have

neurons without incoming or without outgoing connections and therefore a pruning

algorithm is needed. In [39] Kitano points to a possible solution to the matrix size

problem by using flexible size matrix rewriting. Here he also lets the genetic algorithm

generate the initial weights of the network that is trained using back propagation.

Gruau [24], [25] uses a graph grammar system called Cellular Encoding. The graph

grammar rules work directly on the neurons (called cells) and their connections and

include various kinds of cell divisions and connection pruning rules. The grammar

rules are coded in a tree structure, and a genetic programming system is used. The

values of the binary weights and of neuron bias values can be coded into the

chromosomes as well. For some problems the generated boolean networks are further

trained using a cross-over between back propagation and the so called bucket brigade

algorithm. The approach can generate networks that are highly modular, where

modularity is defined as follows: Consider a network N] that includes at many

different places, a copy of the same subnetwork N2. An encoding scheme is modular if

the code of Nj includes the code of N2, a single time. Experiments show that the

system can be used to generate modular boolean neural networks of large size. This

approach is therefore especially useful when the problem to be solved shows a great

deal of modularity in the repetitive use of functional groups.

Boers and Kuiper [4] use a graph grammar system based on a class of fractals called

L-systems. The chromosomes used in the genetic algorithm code the production rules

of this grammar. The system generates modular feedforward neural networks where

modularity is evident in the grouping together of neurons in a module (see also the

description of Happel and Murre's work in section 7.3.1). The networks generated are

again trained using a back propagation algorithm. Drawbacks include the need for a

repair mechanism because of possible faulty strings and the extremely long

converging times. The method does not scale up well to larger problems.

100 Chapter 7. Hybridisation of Evolutionary Computation and Neural Networks

In [8] and [55] a quite different approach is presented. Neural networks are viewed as

physical objects in a two-dimensional space and are represented by a single cell ([8])

and various parameters concerning the growth process of the cell and a set of rules for

cell reproduction. The translation from genotype to phenotype is a complex one where

the final network is generated from the single starting cell by means of axonal growth

and branching as well as cell division and migration. The neural network 'grows' out

of the starting cell(s). This interpretation function comes a lot closer to the

developmental process found in nature. Successive phases of functional differentiation

and specialisation can be observed in the development. Mutations are introduced in

the development and it is observed that changes in the phenotype due to these

mutations depend largely on what stage in the development they occur. The neural

networks are used to model organisms living in a two-dimensional world in which

they can move in the search for food and water.

8. Using Genetic Programming to

Generate Neural Networks

In this chapter, we discuss the use of a genetic programming algorithm using a direct

encoding scheme; also see [63]. This work is mainly based on [40] and, where a LISP

program was used to implement the algorithm and that showed good results when GP

was applied to generate a neural network that could perform the one-bit adder task. A

complete neural network, i.e. its topology as well as its weights, is coded as a tree

structure and is optimised in the algorithm.

A public domain genetic programming system called GPC++, version 0.40, has been

used [17]. This software package was written in C++ by Adam P. Fraser, University of

Salford, UK, and several alterations were made to use it for the application to neural

network design. The GPC++ system uses Steady State Genetic Programming (SSGP)

as discussed in section 3.2. The probability of crossover, pc, is always 1.0; the new

population is constructed using the crossover operator, after which mutation is

performed. The crossover operator swaps randomly-picked branches between two

parents, but creates only one offspring for each pair. There is no notion of age in the

SSGP system, which means that after a new member is made, it can be chosen

immediately afterwards to create a new offspring.

8.1 Set-up

The technique applied in [40] and [41] was used, where a neural network is

represented by a connected tree structure of functions and terminals. Both the

topology and the values of the weights are defined within this structure, and no

distinction is made between the learning of the network-topology and its weights.

The terminal set is made up of the data inputs to the network (D), and random floating

point constant atom (R). This atom is the source of all the numerical constants in the

network and these constants are used to represent the values of the weights. So:

T = (D,R)

101

102 Chapter 8. Using Genetic Programming to Generate Neural Networks

The neural networks generated by this algorithm are of the feed-forward kind. The

terminal set T for a two-input neural network is for example T = {DO, Dl, R}.

processing function of a neuron which performs a weighted sum of its inputs and feeds

this to a processing function (e.g. linear threshold, sigmoid). The processing function

takes two arguments in the current version of the program; i.e. every neuron has two

inputs only. The weight function, W, also has two arguments. One is a subtree made

up of arithmetic functions and random constants that represent the numerical values of

the weights. The other is the point in the network that it acts upon, which is either a

processing unit (neuron) or a data input. The four arithmetic functions, AR = {+,-

,*,%}, are used to create and modify the weights of the network. All take two

arguments and the division function is protected in the case of a division by zero.

After some initial experimentation it was found that, for the problems under

investigation, the system performed much better if the arithmetic functions were not

used. So:

F={P,W)

The values of the weights are represented by a single random constant atom and their

values can only be changed by a one-point crossover or mutation performed on this

constant atom.

translated into a neural network structure made up of processing functions (neurons),

weights and data inputs. Initially no bias units were implemented. The name given to

this implementation of neural network design using genetic programming is GPNN.

Network

An example of a chromosome generated by GPNN is the following neural network,

which performs the XOR function.

(W 1.45312(P(W 1.70312 Dl ) (W-0.828125 DO ) ) ) )

8.2 Example of a Genetically Programmed Neural Network 103

The graphical representation and the corresponding neural network are shown in

Figure 8.1. The condensation of the W-P tree initially drawn from the chromosome

into a fully connected feedforward network is illustrated in two stages.

104 Chapter 8. Using Genetic Programming to Generate Neural Networks

Programming for Neural Networks

In the standard GP paradigm, there are no restrictions concerning the creation of the

genetic tree and the crossover operator, except a user-defined maximum depth of the

tree. In neural network design, several limitations on the creation as well as on the

crossover operator are required.

The creation rules are:

• the root of the genetic tree must be a "list" function (L) of all the outputs of the

network

• the function below a list function must be the Processing (P) function

• the function below a P function must be the Weight (W) function

• below a W function, one of the functions/terminals must be chosen from the set

{P,D}, the other one must be {R}

These creation rules make sure that the created tree represents a viable neural

network. The root of the tree is a list function of all its outputs while the leaves are

either a data signal (D) or a numerical constant (R). This tree can then be translated

into a neural network structure as in Figure 8.1.

The crossover operator must preserve the genetic tree so that it still obeys the above

rules. This is done by structure-preserving crossover which has the following rule: the

points of the two parent-genes between which the crossover is performed (the

branches connected to these points are swapped) must be of the same type. In effect

this means that firstly a crossover point on the first parent tree is randomly selected.

Then the crossover point on the second parent tree is randomly selected with the

restriction that it must be of the same type.

8.4 Automatically Defined Functions (ADFs) 105

- type 2: a W function

- type 3: a R terminal

So, for example, a branch whose root (the crossover point) is a P function can never

be swapped with a branch whose root is a W function. Would this be allowed, then the

creation rules as described above would be violated and the genetic tree could no

longer be translated into a neural network. In [41], P functions and D terminals are

treated as being of different types, which means a branch whose root is a P function

can never be replaced by D terminal and vice versa.

As can be seen from Figure 8.1 only tree structured neural networks can be generated

using GPNN. Sub-branches can never 'reach' each other. For example the simple '2-

2-2' fully connected feedforward network in Figure 8.2 can not be represented by a

single tree in GPNN. Instead two separate sub-trees like the one in Figure 8.1 are

needed, one for each output neuron. So six processing functions are needed in GPNN

to represent this four neuron network.

Figure 8.2 A simple '2-2-2' feedforward neural network (left). The GPNN system

needs two separate sub-trees to represent this network (right).

106 Chapter 8. Using Genetic Programming to Generate Neural Networks

represent functional groups of neurons in the network by ADFs. For example using

two ADFs, ADFO and ADF1, each one representing a hidden neuron and its (two)

incoming weights, the '2-2-2' network can be represented by the GPNN tree in Figure

8.3.

Figure 8.3 Representation of the '2-2-2' network in GPNN with two ADFs.

The two ADFs have P functions as their roots and have two arguments each: ARGO

and ARG1. In the example these arguments are instantiated with the data inputs DO

and Dl but instead of data inputs, the output value of some P function or even another

ADF function can also be used. The problem with a representation of this kind is that

if every sub-network that is called upon more than once is represented by an ADF, the

number of these ADFs can become very large. This number normally needs to be set

by the user a-priori. Another problem is that the number of arguments of each ADF,

just as for every other function, needs to specified in advance. However extensions to

the standard GP system have recently been made by Koza allowing the system to

automatically build ADFs when it needs them.

The fitness function is calculated as a constant value minus the total performance error

of the neural network on the training set, identical to the fitness function used in the

example of chapter 3. A training set consisting of input and target-output patterns

(facts) needs to be supplied. The error on the training set is then calculated as:

8.5 Implementation of the Fitness Function 107

i=\ ,=\

Nou, = the number of outputs

Oijix) = the j output of the network of training fact i

Tjj = the; target output of training fact /

Since a lower error must correspond to a higher fitness, the fitness of a chromosome x

is then calculated as:

fx) = Emllx-E{x)

The maximum performance error, Emax, is a constant value equal to the maximum

error possible, so that a network that has the worst performance possible on a given

training set (maximum error) will have a fitness equal to zero. When a threshold

function is used as the neurons' processing function, only output values of '0' or T

are possible. The range of fitness values is then very limited and it is impossible to

distinguish between many networks. In order to increase this range the output neuron

could be chosen to have a continuous sigmoid processing function.

In using a supervised learning scheme, there are many other ways to implement the

fitness function of a neural network. Instead of the sum of the square errors, for

example, we could use the sum of the absolute errors or the sum of the exponential

absolute errors. Another definition of the fitness could be the number of correctly

classified facts in a training set. The fitness function could also reflect the size (=

structural complexity) and the generalisation capabilities of the network. For example

smaller networks having the same performance on the training set as bigger networks

would be preferred, as they generally have better generalisation capabilities. The

generalisation capability of a network could be added to the fitness function by

performing a test on test data that lies outside the training data. These suggestions are

not implemented here.

108 Chapter 8. Using Genetic Programming to Generate Neural Networks

Neural Networks

The Genetic Programming for Neural Networks (GPNN) algorithm has been

implemented using the code of GPC++ with several alterations and additions. The

neurons in the resulting neural networks initially did not have bias units. The fitness

function used was the total performance error over the training set multiplied by a

factor to increase the range. The fitness value was then made into an integer value as

this is required by the GPC++ software. The mutation operator was implemented so

that it only acted on terminals, not on functions. The maximum depth of a genetic tree

in the creation phase, the creation depth, was set to 6. During crossover, the genetic

trees were limited to a maximum depth of 17, the crossover depth. These values were

used as a default value by Koza [40], to make sure the trees stay within reasonable

size.

architectures for the XOR problem, the one-bit adder problem and on the intertwined

spirals problem.

The XOR problem was the first that was attempted using GPNN. The processing

function used for the neurons was a simple threshold function: thres(;c)=l if x>\, 0

otherwise, and the following statistics for the genetic programming algorithm were

used:

Parameter Setting

ADFs 0

creation depth 6

crossover depth 17

elitism on

: N (population size) 500

pc (crossover rate) 1.0

pm (mutation rate) 0.1

selection mechanism tournament (tournament size = 5)

8.6 Experiments with Genetic Programming for Neural Networks 109

No Automatically Defined Functions (ADFs) were used, as they did not seem

necessary for such a simple task.

Several runs were performed on this problem with solutions evolving between

generation 1 and generation 5. Figure 8.4 shows a solution that was found in a

particular run in generation 5. All solutions found had a number of neurons ranging

from 3 to 5. When the roulette wheel reproduction mechanism was used instead of the

tournament mechanism, the convergence to a solution took on average 2 generations

longer.

Figure 8.4 A generated neural network that performs the XOR problem

The GPNN system was extended with a bias input to every neuron by means of an

extra random constant (in the range [-4,4]) added to every P function. The effect of

this on the XOR problem was a somewhat slower convergence. The reason might be

that the search space is increased, while for a solution to this simple problem bias-

inputs are not needed. It should be noted that the GPNN system with this specific set

up cannot generate the 'minimal XOR network'. This network is pictured in Figure

8.5.

110 Chapter 8. Using Genetic Programming to Generate Neural Networks

Figure 8.5 The minimal XOR network. This is the neural network with the lowest

complexity (number of connections) that can perform the XOR problem

GPNN can not generate this network simply because the P functions are only allowed

to have two arguments (inputs), while for this particular network the output neuron has

three inputs. The GPNN settings can of course be changed so that the function set F

contains two P functions: one with two arguments, Vi(arg^,arg2) and one with three:

~P2.{argi,arg2,argT). The minimal XOR network can then be represented by a

chromosome using these two functions.

As in [41], the slightly more difficult one-bit adder problem was then attempted. The

network has to solve the following task:

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 0

In effect this means that the first output has to solve the AND function on the two

inputs, and the second output the XOR function.

The same characteristics as used in the XOR problem were used. A solution to the

problem was found on all 10 runs between generation 3 and generation 8. One of them

is shown in Figure 8.6. The convergence is much faster than in [41], where a solution

was only found after 35 generations, also using a population of 500.

8.6 Experiments with Genetic Programming for Neural Networks 111

Figure 8.6 A generated neural network that performs the one-bit adder problem

As can be seen from the figure, the neural network found is indeed made up of an

AND and an XOR function. On average the generated neural networks had more than

just 5 neurons and the largest effective network had 20.

The intertwined-spiral classification problem was tried, as it is often regarded as a

benchmark problem for neural network training. The training set consists of two sets

of 97 data points on a two-dimensional grid, representing two spirals that are

intertwined making three loops around the origin. A 2-input, 2-output neural network

is needed.

The results were poor. When the same settings as in the above experiments were used,

roughly half of the training set was classified correctly. Automatically Defined

Functions (ADFs) were introduced taking two, three and four arguments respectively,

but no improvements were observed. The function set was also extended with

processing functions P3 and P4 taking three and four arguments respectively. Again the

performance was still very poor.

Although GPNN was not able to find a solution to this problem, it should be noted

that GP has been found to be a very good classifier on the intertwined spirals problem.

In [40] a GP system gave a very good performance on this problem using the

following set-up: The terminal set, T, was made up of the two data inputs DO and Dl

and the usual real-valued terminal R:

T = {DO, D1,R)

112 Chapter 8. Using Genetic Programming to Generate Neural Networks

The function set consisted of the arithmetic functions +,-,*,%, the functions SIN and

COS and the function IFLTE (If Less Than or Equal to). The IFLTE function takes 4

arguments (branches) and is defined as: if (argi<arg2) then return arg3, else return

argA. So the function set F is:

No creation or crossover rules are needed and the fitness function is simply the

classification error on the intertwined spirals data set. This GP configuration gave

very good results on the intertwined spirals classification task and a 100% correct

classification on the data set is reported.

Neural Networks

It was found that the GPNN approach works well for small scale problems such as the

XOR and one-bit adder problem but that it does not scale up to larger more real-world

like problems. One of the reasons for this is thought to be that the size of the

chromosomes becomes excessively large for these problems. The GPNN system will

have enormous difficulty in finding an optimum within reasonable time. Other reasons

for the poor scaling up quality of GPNN can be found in the restrictions that apply to

the system. These restrictions are:

• There are severe restrictions on the network topologies generated: only tree

structured networks are possible.

(neuron) can and must only have two inputs.

• The learning of the topology and weights is done simultaneously within the same

algorithm. This has the drawback that a neural network with a perfectly good

topology might have a very poor performance and will therefore be thrown out of

the population just because of the value of its weights.

8.7 Discussion of Genetic Programming for Neural Networks 113

and terminal set. For example in the GPNN system that we used, only two functions

are implemented: {P,W}. This function set could easily be extended. In order to

decide on what functions are useful to the problem detailed knowledge of the problem

domain is needed.

It is believed that the main reason why the GPNN approach fails to scale up to larger

size problems lies in the restrictions mentioned and the very large chromosome size

needed. An approach which overcomes some of the limitations of GPNN is discussed

in Chapter 10.

9. Using a GA to Optimise the Weights

of a Neural Network

This chapter describes experiments using a genetic algorithm for weight optimisation

of a feedforward neural network. When genetic algorithms are used to optimise the

structure of feedforward neural networks a separate learning algorithm such as back

propagation is often used to train the weights (see Figure 7.1). In the weight

optimisation module a separate genetic algorithm can be used instead of back

propagation, making the system a meta-level GA. The performance of a GA as a

neural network weight optimiser is investigated here. For certain problems (see e.g.

[50], [64]) genetic algorithms have proven to be comparable to or even better than

back propagation. In section 7.1 an overview was presented on the research in this

area. The best results were noted when a Steady State Genitor-type GA was used with

a real-valued coding of the weights. We have used a normal GA with 'non-

overlapping populations' and an altered replacement mechanism so that it can act as a

Steady State Genetic Algorithm. Since the main characteristic of a Genitor-type GA is

thought to be its extreme selective pressure or 'pushing force' of above average

individuals, its performance can be approximated by a normal GA with the

appropriate selection mechanism. The effect of the selective pressure on the GA

performance as a weight optimiser is also investigated here.

First a brief description of the GA software is given. After this the set-up of the GA is

discussed and experiments are presented where the GA weight optimiser is compared

to the standard back propagation algorithm. Finally the results are discussed.

The GA software used was a genetic algorithm C-library called 'SUGAL' (vl.O),

developed by A. Hunter at the University of Sunderland, England. The system is very

flexible and the user has many options available. The basic working of SUGAL is

illustrated in the flow chart of Figure 9.1.

114

9.1 Description of the GA Software 115

included in the next generation are chosen using the selection mechanism. In the

standard GA the number of candidates just equals the population size; i.e. the

complete population is replaced by the candidates. An exception to this is when

elitism (see section 3.1.7) is used. With elitism the number of candidates is equal to

the population size minus one. As usual, crossover is performed on the pair of

candidates with probability pc. Mutation is then performed with probability pm.

The candidates are then evaluated and inserted back into the population using the

replacement mechanism. In the standard GA the replacement mechanism is such that

the individuals in the population are always replaced by the candidates. This is known

as unconditional replacement. SUGAL offers extra replacement strategies identical to

the ones used in a Steady State Genetic Algorithm; i.e. conditional/ unconditional and

ranked/unranked replacement. The standard GA can be transformed into a Steady

State Genetic Algorithm by decreasing the number of candidates to only one (or two).

This type of GA implemented in SUGAL is also described by Michalewicz in [48], p.

60, where it is labelled 'modGA'

The SUGAL software mutation operator was changed so that a single gene is subject

to mutation with probability pm (and not a chromosome as was the case). This

probabilistic implementation of the mutation operator where every gene has to

undergo a 'test' to determine whether or not it should be mutated make the program

quite slow.

A second change was made concerning the selection of the pair of candidates. In

SUGAL it was possible for a single individual to be chosen both as the father and as

the mother. In such a case the offspring are simply exact copies of the parent no matter

what kind of crossover takes place. This has the effect of lowering the effective

crossover rate and in populations with one superfit individual it may easily lead to

premature convergence. The code was changed so that the father and mother

chromosome could not be one and the same.

is re-evaluated at the start of a generation. This can serve a purpose if the evaluation is

dependent on the state of the system or its non-stationary environment, or if it the

evaluation contains stochastic elements. In many static optimisation problems the

fitness of an individual is deterministically dependent on the individual and re-

evaluation will serve no purpose.

116 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

9.2 Set-up 117

9.2 Set-up

In this section the set-up of the GA is described for the implementation of neural

network, weight optimisation.

• Coding

The coding is chosen to be real-valued. A single chromosome represents all the

weights in the neural network (including the bias weights), where a single real-valued

gene corresponds to one weight-value. The nodes in the network are numbered from

'0' starting at the bias-unit, then the input units, the hidden neurons and finally the

output neurons. Even though the input units and the bias unit are not really neurons at

all, they will be referred to as such (as is common practice). The network architecture

is not restricted to a classic fully connected layer-model. However, the hidden neurons

are numbered in such a way that neurons with a higher index are 'higher' up in the

hierarchy of the network; i.e. neurons can only have outgoing connections to neurons

with a higher index. Figure 9.2 illustrates this. The indices of the weights represent the

order in which they appear in the chromosome. Incoming weights to a certain neuron

are grouped together in the chromosome representation.

• Initialisation

The initialisation of the weights is important when a GA is used to train a neural

network. Because the standard crossover operator for real-valued chromosomes leaves

the gene values intact, it can never introduce new values of weights and the available

genetic information is dictated by the initial values and by the mutation operator used.

When the standard mutation operator is used to simply replace genes by a new random

value in a certain range, this range and the range of initial gene values dictates the

boundaries of possible values the genes can ever obtain. The range of initialisation in

the GA weight optimiser therefore usually plays a more important role than in a hill-

climbing algorithm like back propagation. The initial values of the genes can be

chosen to be uniformly distributed within a certain range or normally distributed with

a certain mean and standard deviation.

118 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

• Evaluation

The evaluation phase involves initialising the neural network with the set of weights

contained in the chromosome. The fitness value, f(x), is then simply the cumulative

squared error of the network on the training set where the outputs are compared to the

target output patterns:

, = l ;=1

Nou, = the number of outputs

0,j(x) = the j " 1 output of the network of training fact <'

9.2 Set-up 119

The training is supervised. All the neurons in the network perform a weighted sum of

their inputs and produce as output the standard sigmoid function on [0,1] of this

weighted sum. So 0,.y(x) e [0,1]. Commonly target outputs will either have a value of

Oorof 1: Tue {0,1}.

• Stopping Criterion

In this implementation the stopping criterion is chosen to be the occurrence of a

chromosome in the current population corresponding to a neural network that

correctly classifies the complete training set within a certain tolerance. All outputs of

the network must be within this tolerance of their target values for the criterion to be

satisfied:

The default tolerance is set to 0.4, where it is assumed that all target outputs have a

value of either 0 or 1. The chosen network does not necessarily have to be the network

with the smallest error on the training set (the fittest chromosome), rather, it is the first

encountered within the stopping criterion. An alternative stopping criterion could be

when a network is found that has a fitness below a certain error value, but we have not

used this approach.

• Selection

Before selection is performed the fitness values of the individuals are normalised (or

remapped) using a normalisation method. Normalisation is implemented in SUGAL

by optionally altering the fitness values using some function (such as ranking), and

then normalising all fitnesses so that the total of the fitness values of the population

equals 1. Normalisation methods include inversion, where the fitness values are

inverted so that lower fitnesses take higher values and high fitnesses take low values,

linear ranking, where the fitness value becomes a linear function of the rank of the

chromosome, and geometric ranking, where the fitness is a geometric function of the

rank.

• Crossover

The standard one-point, two-point and uniform crossover operators are available.

Since entire genes are swapped, these operators can never change a gene value (a

120 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

weight). An exception is the linear crossover operator (see section 6.5.1) which was

implemented as a special option as follows: the first offspring JC3 receives the average

values of all the genes of its parents; i.e. (xl+x2)/2. The other offspring is generated as

xA = {3x\-x2)l2.

• Mutation

As stated above, the standard mutation operators in SUGAL were changed. There are

two mutation operators available. Normal, 'uniform', mutation re-initialises the gene

with a random value. This new random value can be taken from a uniform distribution

within a certain range or from a normal distribution with a given mean and standard

deviation. Creeping, 'Gaussian', mutation is such that a normally distributed value

with a certain standard deviation is added to the current value of the gene. The

SUGAL code was extended so that both mutations could operate at the same time,

each with its own mutation rate.

• Replacement

All the replacement mechanisms as described in section 3.1.5 are available, i.e.

ranked/unranked and conditional/unconditional replacement. In unranked

replacement, the 'doomed' individuals are chosen randomly. With ranked replacement

the doomed are the least fit individuals of the population. When the replacement is

unconditional, candidates always replace the doomed individuals. In conditional

replacement the doomed individual is only replaced if its replacement is fitter.

SUGAL offers the ability to set the number of candidates that are generated during

each generation to any number Nc. Ranked unconditional replacement then becomes

an extended form of elitism where the worst Nc individuals of a population are

replaced each generation. When Nc is set to 1 (or 2) the GA is transformed into a

Steady State Genetic Algorithm. The SUGAL settings resulting in a Genitor-type GA

therefore are: Nc = 1, normalisation method = linear ranking, replacement mechanism

= ranked unconditional replacement.

9.3 Experiments

This section is concerned with experiments that were performed with the GA system

described above. Several neural network weight optimisation problems were tried and

the results were compared with the standard back propagation learning algorithm. All

neural networks used here have all their hidden and output neurons connected to a bias

unit that has a constant output of 1.

9.3 Experiments 121

9.3.1 Data-sets

• 4 to 4 Encoder Problem

The 4 to 4 encoder problem is a simple one to one mapping of all of the 16 possible 4

bit binary inputs to the outputs. The target output values are identical to the input

pattern for each training pattern. Table 9.1 shows the 4 to 4 encoder training data.

" ~~ 0000 0000

000 1 000 1

00 10 00 10

1111 1111

A '4-4-4' fully connected feedforward neural network was used for which the

backpropagation algorithm had no problems learning the data. The corresponding

chromosome length is: I = 4*4 + 4*4 + 4 + 4 = 40.

The iris flower data consists of a training set of 75 facts and of a test set of the same

size. A single fact contains 4 real-valued input values on [0,1] and 3 binary output

values. The class of a fact is determined by that output which has a value of one; the

other two output values are zero. The data represents four attributes of flowers

according to which the flowers are categorised into three classes. This data set is often

considered to be a benchmark problem in neural network classification tasks. The

neural network used to classify this data was a '4-4-3' fully connected feedforward

neural network. This network was easily trained with back propagation for a 100%

correct classification of the training data. The GA system requires chromosomes of

length I = 35. To test the trained network a separate test set also containing 75 facts

was available.

122 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

The radar classification data concerns a real-world task. The training data set consists

of 240 facts, each having 17 inputs in the range [-1,1] and 12 binary outputs

determining the class of the object. The data concerns the classification of 6 classes of

ships each class having two attitudes: left-side and right-side. A '17-12-12' fully

connected neural network was used resulting in a chromosome with a length of / =

372.

It is difficult to compare the performance of a GA weight optimisation algorithm with

an algorithm like back propagation. A simple measure of the time it takes to converge

to the appropriate solution is of course not reliable since it depends heavily on the

software implementations used. However a rough comparison can be made between

the two algorithms as is reported in [50]. In this GA application, as in the vast

majority of GA systems, the evaluation of the chromosomes takes up the most

computational time by far when compared to the rest of the algorithm. When

comparing GA and BP in computational effort the 'rest' of the GA algorithm (i.e.

selection, crossover, mutation etc.) is simply ignored. During evaluation of a

chromosome for each fact a single pass of the training data through the neural network

is made after which the error is calculated. So the number of 'passes' per evaluation is

simply the number of facts. In a BP algorithm with 'per-pattern update' (i.e. weights

are updated after every single presentation of a training fact) the data is passed

through the network (forward pass) after which the error propagates back (backward

pass) and the weights are updated. In one training cycle of the BP algorithm the total

number of passes through the network therefore equals twice the number of facts in

the training set. So when comparing GA and BP one training cycle in BP is considered

to be equivalent to two evaluations in the GA.

In the standard GA without elitism where all newly made individuals for the next

generation are evaluated, the number of evaluations per generation simply equals the

population size. In the GA used here this does not hold in general. Some individuals

pass on from one generation to the next unchanged and are not evaluated. For this

reason, during each GA run the number of evaluations needed to find the solution is

recorded. When comparing the GA and BP algorithms on a certain problem the

number of GA evaluations, or the number of passes through the neural network, will

simply be called iterations. Thus:

9.3 Experiments 123

9.3.3 Results

It is difficult to visualise the operation of a GA. In this section graphs are presented

that show the fitness of the best individual in the population versus the number of

generations. In contrast to a hill-climbing algorithm like BP this graphical

representation does not give much insight into the actual search of the GA.

SUGAL offers a measure of the diversity of the population at the end of each

generation. The diversity measure for a real-valued coding as is used here is the mean

of the standard deviations of each gene across the entire population. So:

It! ■

where: D = the diversity of the population

/ = the chromosome length

cr, = the standard deviation of gene i across the population

Figure 9.3 shows a typical run of the GA for the 4 to 4 encoder problem. Shown is the

average value of the fitnesses of all the individuals in the population and the fitness

value of the best individual in the current population. The GA was run for 300

generations.

124 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

The settings used for this particular run are given in Table 9.2.

Parameter Setting

crossover type two point

elitism on

fitness normalisation reverse linear ranking with bias = 10.0

initialisation of population Normal distribution: N(0,5)

/I 40

mutation type creeping with N(0,1) distribution

~ N ' 50

Pc

Pc 0.8

08

pPm

m ~ 0.1

re-evaluation off

replacement mechanism ranked unconditional

selection mechanism Roulette

9.3 Experiments 125

A Normal (or Gaussian) distribution is characterised as N(/i,CT), with ji = the mean and

a = the standard deviation of the distribution. At generation 64 an individual was

found that correctly performed the 4 to 4 encoder problem subject to the required

tolerance on the target output values of 0.4. A total of 3195 evaluations (passes

through the network) was needed to find this solution. Over the course often runs with

the same settings the average number of evaluations needed to find a solution was

about 3500, corresponding to an average of 87 generations.

As can be seen from Figure 9.3 the diversity of the population remains fairly high

throughout the run. It drops from its initial value of about 5.0 to 1.0 at generation 40

and stays there due to the relatively high mutation rate.

Despite using trial and error to compare various configurations, the parameter settings

may not have been optimal for this problem and the convergence time compares very

poorly to the back propagation algorithm. The average number of cycles needed for

BP to find a solution subject to the same requirements is about 50. This would

compare to a number of BP cycles * 2 = 100 evaluations in the GA, meaning about 2

generations with a population size of 50. Figure 9.4 clearly shows this drastic

difference in performance between the same GA run as above and an average BP run

for the 4 to 4 encoder problem.

problem is a very simple one for a hillclimbing algorithm like BP to solve and the

global optimum is found without any trouble. The GA on the other hand needs many

times more iterations to find a solution to this problem. Although the GA settings

might not have been optimal, it is clearly outperformed by BP on the 4 to 4 encoder

problem.

126 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

The GA configuration was then changed to one of a Genitor-type Steady State Genetic

Algorithm. Good results have been reported on neural network weight optimisation

using this type of GA with creeping mutation and a small population size of fifty [50],

[64]. It is thought that a Genitor-type GA can work well on weight optimisation

problems mainly because of its high selective pressure that will centre the search

around a single superfit individual. The same settings as in Table 9.2 were used with

the exception that the number of candidates generated during each generation is now

just one. The number of evaluations or iterations is simply equal to the number of

generations for this type of GA. The average number of iterations needed to find a

solution to the 4 to 4 encoder problem did not differ much from the previous results.

On average something like 3000 iterations were needed and the Genitor-type GA does

not seem to offer any major advantages on this problem.

Because the population size is so small, the effects of selective pressure and genetic

drift in the population are rather large. The population is in general very quickly

dominated by a single superfit individual. To get an idea of the effect of genetic drift

9.3 Experiments 127

alone, a run was performed without any selective pressure and with the genetic

operators crossover an mutation turned off (pc = pm - 0). The GA selects individuals

without preference and copies them into the next generation. The population

converged to a single individual in just 7 generations.

0 Effect of Mutation

The effect of the mutation operator was investigated to some extent. Runs were

performed with the normal mutation operator instead of the creeping one. It was

generally observed that this resulted in a poorer performance on the problem. When

both mutation operators were used at the same time, the results were about the same as

the situation where only the creeping mutation was used. The mutation operator

clearly plays a very important part in GA weight optimisation and the GA

performance depends greatly on the settings used.

The effect of the population size on this problem was investigated to some extent.

Several runs were done for population sizes of 50, 100, 200. The average number of

iterations needed to find a solution did not vary much at all between the

configurations. Very small population sizes of N = 5 and even N = 2 were also

investigated.

As was also reported in [64] the GA converges to a solution even with a population

size as small as 5. Not all the runs converged, but the ones that did (about 80%)

needed on average about half the number of iterations (2000) as those in the case of a

population size of 50. For a 'normal' GA, convergence to a solution would normally

not be found with such a small population size since there simply is not enough

genetic diversity to maintain a proper search by means of intelligent hyperplane

sampling (formation of building blocks). As was also mentioned in [64] the fact that

the GA converges to a solution even with such a small population size strongly

suggests that the search is mainly performed by genetic hill-climbing (see section

6.1.1). Solutions were even found with a population size of 2, although in this case

about 50% of the runs did not converge.

128 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

The performance of the GA system on the iris flower classification problem was also

investigated. Figure 9.5 shows a typical GA run in comparison to a typical BP run on

this problem. The GA settings were the same as the those in Table 9.2.

Despite the fact that BP has some problems in finding the global optimum for this

problem again it drastically outperforms the GA system in convergence time.

A few runs were performed using the radar classification problem. On none of these

runs was convergence observed. Using back propagation around 600 cycles or 1200

iterations were needed to find a solution. Using the genetic algorithms, the best

individual found after as much as 10000 iterations still gives a very poor performance

on the data set: E[x) ~ 200.

9.4 Discussion 129

9.4 Discussion

The GA system has not been found to perform well on the task of feedforward neural

network weight optimisation. It is drastically outperformed by back propagation on

the problems investigated. This might however partially be caused by the nature of the

problems. Problems for which the BP algorithm has no difficulty in finding the

optimum are typically problems with a low level of epistasis resulting in a 'simple'

error landscape. Back propagation will not get 'trapped' in local minima for these

problems and it is not surprising that a hill-climbing algorithm such as BP will

outperform a global search algorithm like GA. Problems which do pose severe

convergence problems for back propagation may be better suited for the genetic

algorithm. It is reported in [50],[64] that a GA system very similar to the one

implemented here does outperform back propagation on some large size tasks that are

very difficult for BP.

Several facts seem to indicate that the genetic algorithm in this set-up does not

perform a global search through the weight space by means of intelligent hyperplane

sampling. Instead, the search seems to be focused around a single individual and

better solutions are generated by genetic hill-climbing. Reasons why the GA seems to

work better as a genetic hill-climber on weight optimisation problems very likely

include the competing conventions problem, caused by multiple chromosomal

representations coding identically functioning networks. By focusing the search

around a single individual this problem is avoided. Another reason why a global

search may not work very well is simply the extremely large size of the search space

for bigger sized problems.

Future work will need to be done in optimising the GA set-up for neural network

weight optimisation, possibly extending the set of genetic operators with ones that are

more problem specific. This can present an alternative in tackling the competing

conventions problems. Some good results have been reported in literature where

genetic operators were used that use some kind of gradient information of the error

landscape. Since competing conventions seem to be such a major problem for weight

optimisation with a standard GA, better results may be expected when niching

techniques such as restrictive mating are used, although this has not been investigated.

Since BP is very good at fine tuning potential solutions and the standard GA can

perform a global search in the problem space a hybridisation of the two seems natural.

130 Chapter 9. Using a GA to Optimise the Weights of a Neural Network

produce a robust system that will work on a variety of problems. The GA could

perhaps be used to find 'basins of attraction' (areas around a local/ global optimum

from which a hill-climbing algorithm always converges to the optimum) in the error

landscape from which the back propagation algorithm can take off to find the local or

global optimum. This hybridisation could be implemented in the GA system as

follows: during the evaluation of an individual, train the set of weights using back

propagation for a certain amount of training. The training could be implemented in

such a way that the back propagation algorithm is allowed to go on when the error

continues to decrease (i.e. converging to a local or global optimum) and that it must

stop and return to the GA when it does not.

10. Using a GA with Grammar

Encoding to Generate Neural Networks

In this chapter we describe a GA system that was implemented based on the ideas of

Kitano's matrix grammar (see section 7.3.3) in the automatic generation of both a

neural network structure and its weights. Kitano's approach is extended in the sense

that not just the neural network structure but also the values of the weights are coded

in the chromosome.

random weights; e.g. [38]. We chose to code the values of the weights in the

chromosome as well, so that not only the structural but also the parametric information

can be passed on from generation to generation.

When both the structure and the weights of the network are coded in the chromosome,

the resulting system is best described by a Structured Genetic Algorithm (sGA). In

[11], where a direct representation was used, good results were reported using an sGA

on small problems such as the XOR or small decoder networks but it was found the

method did not scale up well to bigger problems. Instead of using a direct encoding

method, better results were expected using a grammar encoding and we investigated a

method based on Kitano's matrix grammar encoding in this context. Results are

compared to the matrix grammar system without weight encoding and to a system

implementing direct encoding to represent the structure of a neural network.

Network Design

Structured Genetic Algorithms were developed by Dasgupta and McGregor [10] and

have proved to be a successful method to simultaneously optimise the neural network

architecture and its weights [11] using hierarchically structured chromosomes. The

recombination phase is the same as in the standard genetic algorithm. During

evaluation however 'high-level' genes act as switches to activate or deactivate lower

level genes. In [11] two leveled chromosomes were used. The top level defines the

131

132 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

connectivity of the network, the bottom level the values of the weights and biases. The

network connectivity is represented by the connectivity matrix with that part of the

chromosome treated as a binary string.

In most of the GA approaches where both network structure and weights are subject to

genetic operations, the chromosome representing both parts is thought of as a long

binary string and is subject to the genetic operators of the algorithm. As far as the

genetic operators are concerned there is no distinction between the structural and the

parametric (weight) part. This distinction is only made when the chromosome is

translated into the actual neural network.

It is possible to make a distinction between the structural and the weight part of the

chromosome. When different codings (i.e. binary and real-valued) are used for the two

parts, this distinction must be made and non-homogeneous chromosomes are needed.

In [11] the best performance of the sGA algorithm was observed when the weights and

biases were coded as real-valued genes, as opposed to the binary coded structural part

of the chromosome. Genetic operations like crossover and mutation can now be

thought of as being either structural or parametric changes to the network depending

on what part of the chromosome they operate on. When the changes are structural, the

resulting offspring can inherit the set of weights from its parent. This process is called

'weight transmission' and will be described in the next section. A 'structural'

crossover is illustrated in Figure 10.1.

Figure 10.1 Abstract visualisation of structural crossover. The offspring inherit the

set of weights from their parents

offspring to a random value.

In [5], instead of initialising the weights of the newly created offspring with random

values, the values of the weights are set to a fraction F*Wy of the relating parent

10.1 Structured Genetic Algorithms in Neural Network Design 133

weight W,j. This process is called 'reduced weight transmission' and it was found that

the optimum value of F depended very much on the problem. A training module was

used to learn the weights and it was found that the reduced weight transmission

mechanism speeded up learning by more than an order of magnitude compared to

starting with random weights. The idea is that, with weight transmission, the training

of the networks will generally start off from a better point in the weight space when

compared to starting at a random point, and that less training is required during

evaluation of the networks.

There are several ways to implement weight transmission. For example, the weights of

the offspring network can be set to a fraction of the corresponding weights of one or

both parents. The parent networks could first be checked to see which weights in the

complete weight set are actually in use. When the offspring network uses a connection

that is also in use by one or both of its parents, a fraction of the corresponding

weight(s) can be 'transmitted'. The problem is then to initialise weights that are not in

use by either of the parents. We choose to use a system where two parents produce

two offspring, and each offspring inherits a fraction F of a particular weight of one of

its parents. Normally, F is set to a default of 1, so the entire weight value is

transferred. For each weight of the offspring there is an equal chance of inheriting the

weight value from either parent. So after weight transmission an offspring will on

average have inherited 50% of its weights from parent 1 and 50% from parent 2. Other

options include allowing offspring to inherit all weights from a single parent or to let

the offspring's weights be an average of the weights of its parents. These options are

not investigated here. When no crossover is performed on the pair of candidates the

offspring are identical to the parents, including the set of weights.

When using a grammar encoding instead of a direct encoding method as above, the

reduced weight transmission concept may not work as well. This is because in a

grammar encoding scheme there is no one-to-one correspondence between the

structural and the weight space. The part of the chromosome representing the network

structure, coded by grammar rules, is in general much shorter than the part

representing the network's weights. When two network parent structures are involved

in reproduction there is no guarantee that the resulting offspring will use weights that

were used by both or any of its parents.

When a network is evaluated, the weight training starts from a point in the weight

space determined by the set of weights of one of its parents. Assuming the network

structure of the offspring is similar to the structures of the parents from which it has

134 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

inherited the set of weights, the learning will in general start off from a much better

point when compared to starting off with random weights. When the network structure

of the offspring is not similar to its parent's however, weight transmission may not be

very useful. In this case the set of weights received from the parents might not be

better than just starting off with random weights. Consequently such a network will in

general need more training to reach an optimal set of weights compared to a network

that starts off training much nearer to the optimum in weight space. This probably

presents the main difficulty in the weight transmission scheme. Assuming that the

amount of training a network receives is limited, weight transmission strongly favours

networks that are structurally close to their parents. It may therefore result more in a

local search through the structure space than a global one. If the optimal network

structure can be found by such a local search, this may not necessarily a bad thing. In

order to test every network fairly without favouring some, the concept of weight

transmission will have to be abandoned or the amount of training that a network

receives will have to depend on the position in weight space that the training starts

from. The latter will be almost impossible to realise since the distance between the

starting point and the optimum in weight space will in general not be known. Another

option is of course to give every network so much training that the optimum can

always be found. Weight transmission will then no longer be needed. But the purpose

of weight transmission was to bring down the amount of training in the first place.

Using an sGA approach where both the structure and weights are coded in the

chromosome, a distinction can be made between structural and parametric changes. In

[47] where an EP technique is used, every parent network generates one set of

offspring with structural changes and one set with parametric changes. This could be

implemented in a GA so that a parent chromosome generates one offspring with a

structural change and one with a parametric change. Another idea is to have 'phases'

of structural change and phases of parametric change. For example after a generation

of structural changes only, one could have several generations of parametric changes.

This is similar to approaches where a separate training algorithm such as

backpropagation is used in the evaluation phase of the neural network structure. A

two-level GA can be used where the top level GA searches for the optimal neural

network structure. During the evaluation of a neural network, the bottom level GA is

invoked where for a certain number of generations the weights of the network are

optimised using a special GA. This bottom level GA then acts as a separate training

module.

10.2 Kitano's Matrix Grammar 135

There is a choice to make whether to represent the weights of all possible connections

of a neural network in the chromosome (i.e. the weights of a fully connected

feedforward network) or just the weights that are actually in use by the network. In

[11] the former is chosen and the argument is given that learned information (i.e.

weights) can be passed on to next generations passively, but may be of use for a future

generation. 'Passively' here means that although the information is part of the

genotype it is not expressed in the phenotype; i.e. weights that are coded in the

chromosome but not actually used by the network. During evaluation the connectivity

of the network coded in the top level defines which weights are actually used. The

genetic operators act on the entire set of weights even though some weights are

passive. Therefore weight changes are possible on non-activated weights that can

possibly only be noticed in future generations. Although this phenomenon is supported

by biological evidence, its usefulness is questioned here. Changes in the weight space

by means of the genetic operators could perhaps be limited to active weights only.

When only those weights that are used by the network are encoded in the

chromosome, difficulty arises when structural crossover or mutation is performed. For

example, when an offspring produced by crossover uses a connection that was not

used by any of its parents, it cannot obtain the corresponding weight value from the

parents. Instead this newly created weight has to be initialised with a random value.

Performing reproduction on the weight space itself is not likely to be a viable option

here since in general there is no one-to-one correspondence between the weight strings

of two different chromosomes. When this method is chosen a variable length GA has

to be used.

In Kitano's approach [38], [39] the NN is represented by a set of matrix grammar

rules that are encoded as a binary chromosome with fixed length. Each rule rewrites a

character into a 2x2 matrix of characters.

Kitano uses a constant and a variable part within the chromosome. The constant part

does not change and consists of the final rewriting rules. It would seem that there is no

point in coding these into the chromosome and in our implementation these 16 final

rewriting rules are set in the system and are the same for every chromosome. The LHS

136 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

of these constant rules is a character from the set 'a' to 'p'. The RHS is one of the

possible 2 x 2 matrices consisting of O's and l's. Thus the final, constant, rewriting

rules are:

The higher level grammar rewriting rules are coded in the chromosome and are

subject to the genetic operators. The starting character is always 'Start', to make sure

the initial rewriting step can always be performed. The other positions of the

chromosome are characters in the range 'A' to 'p' A set of 5 characters defines a

rewriting rule starting from the initial rewriting rule where the LHS is always the

starting symbol 'Start'. By placing no restrictions on the rewriting rules, many rules

may be developed that rewrite the same character, or for some characters no rules may

be developed at all. Furthermore, many developed rules may never be used. Kitano

normally uses chromosomes with a length of 100, which means a number of 100 / 5 =

20 generated rewriting rules. Examples of developed rewriting rules are:

Start^[AbcA],A^[aacb],a-+[aAbb]

At the end of the M matrix rewriting cycles the connectivity matrix is formed out of

the acquired string. The size of this matrix and therefore the maximum size of the

network, the number of neurons, is predetermined to be 2M x 2M. The connectivity

matrix consists of 'l's and '0's. A T denotes a connection, a '0' no connection.

When, after the rewriting cycles, a position in the matrix is still a 'non-differentiated'

cell (i.e. neither a ' 1 ' nor a '0'), Kitano considers it to be dead and is therefore made

equal to '0'. In the connectivity matrix the first n rows and columns correspond to the

input nodes and the last m rows and columns to the output neurons.

The final network may need to be pruned because it is possible that nodes are created

that have no outgoing or no incoming link. Pruning is a repair mechanism and is one

of the possibilities to handle constraints in a GA. Other options include the

punishment of individuals that violate constraints (i.e. give them a poor fitness value)

or choosing the chromosomal representation in such a way that chromosomes that

violate constraints simply do not occur. Pruning may of course be combined with

punishment, so that the GA will prefer networks that need the least pruning.

10.3 The Modified Matrix Grammar 137

The matrix grammar described above is modified in that a hierarchy in the characters

is used. For example, when only three rewriting cycles are used (i.e. a connectivity

matrix of size 8), the following hierarchy may be used: 'Start' can be rewritten in a

matrix consisting of characters from (A,..,D), and the characters (A,..,D} can be

rewritten in matrices consisting of characters from (a,..,p). By fixing the LHS of every

rewriting rule, the problem of developing more than one rule for the same character is

avoided. Also, in the final matrix there will be no undifferentiated cells.

This method can be seen as a multiple level substitutional compressor (also known as

dictionary-based compressors), where the compressed string and the dictionary used

are generated using a GA and in the evaluation stage it is decompressed into a

connectivity matrix. The basic idea behind a substitutional compressor is to replace an

occurrence of a particular phrase or group of characters in a piece of data with a

reference to a previous occurrence of that phrase. In this context the starting character

'Start' could be seen as the compressed string and the rewriting rules as a

hierarchically ordered dictionary.

The structure of the chromosomes in this system is not the same as in a Structured

Genetic Algorithm (sGA). In an sGA a set of lower level genes is unique to one higher

level gene. In the approach described above, this is not the case.

Some problems also present in Kitano's system still remain. Many rules may never be

used and several characters could have identical rewriting rules, essentially making all

but one of them unnecessary. An idea could be to code rewriting rules in the

chromosome only when they are used, and leave them out otherwise. The problem

then is that when a character is referred to but has no rewriting rule, a (random) one

has to be made. Furthermore the restriction on matrix sizes of 2M x 2 still applies.

• Coding

This section describes how the characters are coded in the chromosome. Kitano uses

binary coding; e.g. 'a' = 0001, 'b' = 0010 etc. Depending on the GA software used, it

might be preferable to simply code the characters as symbols and we use this

representation. In effect it means that the crossover operator can only work on a group

of characters and thus will leave the characters themselves intact. Only mutation can

change the value of a character by re-initialising it with a random value. Using binary

138 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

coding the genetic operators can operate within the representation substring of a

character.

In implementing the system described above, we need as many alphabets as there are

rewriting cycles. Since there is a rewriting rule for every character of the alphabet, the

chromosome length defines the size of the alphabets or vice versa. For example a

connectivity matrix of size 8 requires 3 rewriting steps and therefore 3 alphabets, A%,

A2 and Ay.

A, = {A, ... }

A2= {a, ...,p}

A 3 ={1,0}

The starting symbol, 'Start', can be seen as the alphabet A0. A rewriting step at level

' 1' rewrites the starting symbol 'Start' into a 2 x 2 matrix consisting of characters of

alphabet A%. In general a rewriting step at level i consists of rewriting a character of

alphabet A, into a 2 x 2 matrix of characters of alphabet Ai+\. The characters of

alphabet A, will be denoted by: S\, ..., S'k, where £, is the cardinality of the alphabet.

So in the example above S\ corresponds to 'A', 523 corresponds to 'c', k2 - 16, /c3 = 2

and so on.

The last two alphabets are predefined and the same for any size matrix. Since we code

a rule for every character in an alphabet, the left-hand side (LHS) of every rule is

predefined and there is no need to code these in the chromosome. Thus a rewriting

rule is represented by its RHS consisting of 4 characters.

• Example

This is a simple example based on [38] of a chromosome representing a neural

network that can perform the XOR task. The network has two inputs and one output.

The system uses three rewriting cycles (M - 3) so that the size of the connectivity

matrix is: 2M x 2M = 8 x 8 . The alphabets used are:

A, = { A, B,C, D }

A2= { a, ...,p }

A3 = { 1 , 0 }

10.3 The Modified Matrix Grammar 139

This particular configuration can be described by k\ = 4, since the alphabets A2 and /43

are pre-defined. An example of a chromosome representing an XOR network is:

The fixed rewriting rules are not part of the chromosome, but are embedded in the

system and are identical to the ones used in Kitano's grammar; i.e.:

This chromosome is translated into the neuron connectivity matrix by means of the

following rewriting cycles:

Figure 10.2 An example of the rewriting cycles for a simple XOR network

140 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

All hidden and output neurons have a connection from the bias neuron that has an

constant activation value of 1. These connections are not represented in the

connectivity matrix. One entry in the connectivity matrix needed to be pruned. In the

connectivity matrix a connection between neurons 5 and 7 is encoded, but since

neither of these neurons are connected to any other neuron the connection is useless.

As can be seen above a rewriting rule for the character ' C is coded in the

chromosome but it is not actually used. Because only feedforward neural networks are

wanted that have no connections to input and no connections from output neurons,

only the highlighted upper-right part of the connectivity matrix is used. Therefore the

method is not very 'clean': some information that is contained in the chromosomes is

never used (i.e. the lower-left part of the matrix). The corresponding neural network is

shown in Figure 10.3.

Figure 10.3 The neural network constructed from the matrix grammar

One of the main problems of the matrix grammar approach is how to decide on the

alphabet size for a certain rewriting cycle. If the alphabet size of the nth rewriting

cycle is too small, the number of different sub-matrices that are coded in the rewriting

cycle may be too small. In the example described above, if the alphabet size of the

first rewriting cycle, ku is set to 2 instead of 4 (i.e. A, = (A,B) instead of {A,B,C,D}),

the 8x8 connectivity matrix can only be made up of two different 4x4 sub-matrices.

There is no way to code more than two different 4x4 sub-matrices, while the complete

10.3 The Modified Matrix Grammar 141

matrix is made up of four. If it is desired that it must be possible to have no two sub-

matrices that are identical (e.g. A, = {A,B,C,D} or larger), the chromosome will

become very large for larger connectivity matrices. The cardinality of every alphabet

must then be equal to (or larger than) the number of corresponding sub-matrices in the

complete matrix. For example a 16x16 matrix would require the following (minimal)

configuration: k, = 4 (the number of 8 x 8 matrices), and k2 = 16 (the number of 4 x 4

matrices). The resulting chromosome would have a length 1 = 4 + 4*4 + 4*16 = 84. In

the case of a 32 x 32 matrix, the corresponding chromosome length would be: 1 = 4 +

4*4 +4*16 + 4*64 = 340. In general, the chromosome length needed for a system with

M rewriting cycles (i.e. a matrix of size 2Mx2M ) is:

M-\

1 = ^4"

n=l

Thus for larger size matrices the chromosomes will become large indeed. Still, this

chromosome length is smaller than that needed in the direct encoding scheme to code

the same size matrix. In the direct encoding scheme there exists a one-to-one

correspondence between the genes in the chromosome and connections in the neural

network. Each gene codes one connection. This scheme will be described in section

10.5.

taken. Since for example in the 16x16 matrix scheme every one of the 4x4 sub-

matrices can be represented uniquely, an option is to let the chromosome simply just

code these 16 4 x 4 matrices without the first two matrix rewriting rules. The

chromosome length would then be / = 4*16 = 64, which is the chromosome length

which would have been needed in a direct encoding scheme divided by 16. The

variable rewriting rules in the chromosome are replaced by fixed ones that are part of

the system. In such a scheme however it is no longer possible to encode regularities in

the final matrix and the chromosome length will still be very large for bigger

problems. It does however seem reasonable to leave out the rewriting rules of the first

rewriting cycle. With an alphabet size, ku of 4 every 'quarter' of the connectivity

matrix can be represented uniquely with a character of the first alphabet, Ai.lt seems

reasonable to assume that most networks will in general not have identical matrix

quarters so that the first rewriting rule will be for example: 'Start'—>{C,B,A,D} and

not something like 'Start'—>{A,A,B,A}. The first rewriting rule may then be left out of

the chromosome altogether, and replaced by a fixed rule of the form:

142 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

matrix filled with zeros, since the third quarter of the connectivity matrix is never used

in the representation of feedforward networks. The same procedure could in fact also

quite easily be followed for the second rewriting cycle. Instead of coding the

corresponding rewriting rules in the chromosome, one could have fixed rewriting rules

like: A->{A',B',C',D'}, B->{E',F',G',H') etc. These concepts have not yet been

investigated, but they are suggested as a logical extension to the work carried out to

date.

The principal goal is to limit the alphabet sizes so that the chromosomes will be of

manageable length. The resulting connectivity matrix is then made up of sub-matrices

that cannot all be unique, with some sub-matrices being used more than once. This

does of course place restrictions on the neural network structures that can be

generated, since they must have some form of regularity in their connectivity matrices.

The severity of these restrictions depends of the alphabet sizes used. Many problems

can be very adequately solved however using neural networks with a high level of

regularity. The classic case of a fully connected feedforward network contains for

example a very high level of regularity in the connectivity matrix, as is illustrated in

Figure 10.4 which shows the 16x16 connectivity matrix of a fully connected '4-4-3'

neural network. In this particular case the 5th, 6"\ 701 and 8th columns and rows in the

matrix correspond to the four hidden neurons. Overall, significant reductions in

chromosome complexity may be obtained if relatively regular and repeating structures

are acceptable for the evolved neural networks.

• Competing Conventions

In a connectivity matrix as in Figure 10.4 a hidden neuron can be represented by any

one of the rows/columns 5 to 14. The first four and the last three rows/columns are

reserved for the input and output neurons. This means for example that the fully

(9\

connected '4-4-3' network can be represented by \A)= 126 different connectivity

chromosomes, the genetic representation clearly suffers from competing conventions

(see section 3.1.4). Many different matrices and therefore many different

chromosomes can be used to represent the same neural network structure.

10.3 The Modified Matrix Grammar 143

neural network

In fact the competing conventions problem is even worse than the above analysis

indicates. Using the matrix grammar scheme, many different chromosomes can be

used to represent one and the same connectivity matrix. This is illustrated below

where two different chromosomes code the same 8x8 connectivity matrix pictured in

Figure 10.2 that corresponds to the XOR network.

represent the same connectivity matrix

The reason for this type of competing convention is that a position within the

chromosome does not correspond to a fixed position in the connectivity matrix.

• Representation

The matrix grammar representation scheme relies not just on two but on three

different spaces: the representation space (the chromosomes), the evaluation space

144 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

connectivity matrices. This is illustrated in Figure 10.6. Both the mapping from the

representation space to the intermediate space and from the intermediate space to the

evaluation space suffer from competing conventions.

Figure 10.6 The three spaces used in the matrix grammar representation scheme.

Two examples of competing conventions are shown

As stated above, depending on the alphabet sizes used there will be a restriction on the

network structures generated. The system will in general not be able to generate any

arbitrary feedforward neural network structure. The connectivity matrix must contain

some kind of regularity, so, the evaluation space is only part of the complete problem

space, where the problem space is defined as the complete set of feedforward

networks subject to the appropriate number of in and outputs.

10.4 Combining Structured GAs with the Matrix Grammar 145

Grammar

The chosen chromosomes will have a two level hierarchy. The top level is a string of

characters that codes the matrix rewriting rules (the structural part), the bottom level is

a real-valued string of the weights of the connections of a fully connected feedforward

network (the parametric part). These weights are coded as a long real-valued string

corresponding to the 'upper-right-half of the connectivity matrix, column by column

starting from the left. Added to each column is the weight connected to the bias unit.

The set of weights in use by the actual network is only a subset of this parametric part.

This is illustrated by Figure 10.7.

The first three entries in the parametric part of the chromosome correspond to the

incoming connections of neuron 3 (i.e. column 3 with the bias weight added). The

next four entries correspond to neuron 4 etc. The 6 entry corresponds to the

connection from neuron 3 to neuron 4. Since this connection is absent in the neural

146 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

network, the corresponding weight value is simply ignored (as are most of the weight

values contained in the chromosome). There are 26 (upper-right half of the matrix of

Figure 10.2) + 6 (bias weights of all neurons) = 32 weight values represented in the

parametric part of the chromosome while only 9 of these are in use by the network.

A separate back propagation training module is used to evaluate the neural networks.

After training, the parametric part of the chromosome is updated with the trained

weights. The parametric changes that result from running the training module are

carried onto future generations by means of weight transmission. No parametric

changes are performed within the GA itself. Only the structural part of the

chromosome is subject to the genetic operations. The working of this system is

illustrated in Figure 10.8.

Figure 10.8 The sGA system with a separate Back Propagation training module

10.4 Combining Structured GAs with the Matrix Grammar 147

Since only feedforward neural networks are generated in the present application, the

number of possible connections or the maximum complexity, Cmlx, when in total N

neurons are used is:

M2 -N N 2

-N N 2

-N

— - 2 2 2

where:

N = total n u m b e r of neurons

Nln = n u m b e r of inputs

NOM = n u m b e r of output neurons

accepted c o n v e n t i o n , but not the bias unit. Cmax is the complexity of a fully

interconnected network specified by the total n u m b e r of units, the number of inputs

and the n u m b e r of outputs. In other words it is the number of entries in the upper-right

half of the connectivity matrix.

T h e last t w o terms in Cmax indicate that input neurons are not allowed to have

incoming connections and that there are no outgoing connections from output neurons.

This means that in the connectivity matrix, columns corresponding to input neurons do

not have any entries (or the entries are simply discarded) in their columns and the

rows corresponding to the output neurons are empty.

maximum of 20 hidden neurons has 2037 possible feedforward connections. This

could impose quite severe strains on computational facilities especially when a large

population size is needed. A '35-20-10' fully connected feedforward network uses

only 900 of these 2037 possible weights. If parametric changes such as weight-

mutations were implemented in the GA system, it would mean that, for such a

network, these changes would affect the active weights in less than 50% of the cases.

It would seem that this is a wasteful procedure and that parametric changes can best be

performed on the active weights only. However this causes a problem when a

grammar encoding as the one proposed is used and the set of active weights cannot be

directly read from the top level of the chromosome. This top level first needs to be

translated into the neural network connectivities, and this will have to be done for

148 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

the GA itself, this is of no concern.

During recombination the genetic operators work on the top level of the chromosome.

When two neural network structures reproduce, the weights of the resulting offspring

are initialised using weight transmission as described above. Crossover and mutation

are implemented as normal.

10.4.2 Evaluation

When a chromosome is evaluated into a neural network the top level of the

chromosome needs to be translated into the connectivity matrix. This matrix is then

pruned so that there are no hidden neurons without incoming or outgoing connections.

After this step the matrix is transformed into a neural network which, after the training

phase (backpropagation), is then tested on a set of training patterns. The network uses

the values of the weights of the bottom level of the chromosome that correspond to the

connections used. The fitness value will reflect the error on this training set and can

optionally include a measure of the network's complexity. The amount of pruning that

was necessary can also be reflected in the fitness as a negative measure. In [66]

networks that were less complex (i.e. less connections) were preferred over more

complex ones when both networks achieved a comparable performance on the training

data. In this way, minimal complexity neural networks can evolve. After training, the

parametric part of the chromosome is updated with the trained values of the weights.

Back propagation training is performed for a set number of cycles, rather than the

more customary process of stopping at a required error level, since convergence

cannot be assumed for all networks. The optimal number will depend on the problem

and on the training set used. The back propagation module is a standard one using the

normal gradient descent weight updating rule with a momentum term. Default values

for the learning rate and the momentum term are 0.1 and 0.9 respectively. The module

uses a ■per-pattern' weight update mechanism, meaning that the weights are updated

after every presentation of a training pattern (and not after a presentation of the

complete training set).

10.5 Direct Encoding 149

N , C

out max

where: E(x) = the cumulative squared error of the individual on the training set

Noul = the number of output neurons

C(x) = the number of connections in the network (including the bias

weights)

Cmax = the maximum complexity for the specific configuration

P(x) = the number of connections that had to be pruned

The genetic algorithm works in such a way that the fitness measure is minimised

instead of maximised. The relative weight of the complexity and pruning terms can be

set by a and /3. The optimal values are problem dependent and possibly quite hard to

find. Optionally a and/or /J can be set to 0 so that the corresponding term(s) do not

have any influence on the fitness.

For comparison with the matrix grammar approach, a direct encoding scheme was also

implemented. Details are given in this section.

The direct encoding scheme differs from the matrix grammar scheme described above

in the representation of the network structure. The parametric part of the chromosome

that codes the weights is identical. Direct encoding is implemented as a bit-string that

directly represents the neural network with a one-to-one correspondence between the

genes and the connections of the network. The bit-string is simply the upper-right half

of the connectivity matrix that defines the neural network structure. As with the matrix

grammar scheme the size for the matrix must be set a-priori. Thus the maximum

number of neurons is pre-defined. In contrast to the matrix grammar scheme however,

the direct encoding scheme is not restricted to matrices of the size 2 and any matrix

size can be used.

As an example of how the structural part of the chromosome is translated into a neural

network, the same XOR network of the last section is considered. The matrix size

150 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

used is the same: M = 8. The upper-right half of the connectivity matrix (the high

lighted part of Figure 10.2) consists of 27 bits. The structural part of the chromosome

contains these bits 'row-wise'. The network structure is now represented by the

following bitstring:

The matrix is then translated into the same XOR neural network structure as in Figure

10.3.

Advantages of using direct encoding over the matrix grammar method are that the

matrix size is not restricted to values of 2M. Furthermore the method is cleaner in that

only the upper-right half of the connectivity matrix is coded in the chromosome. A

disadvantage is that the chromosome length increases rapidly with network size and

that it offers no way to code certain regularities in the network structure. Since the

chromosome length equals the number of maximum feedforward connections given

the maximum number of neurons (i.e. the matrix size) and the number of in- and

outputs, this is given by Cmax (see last section). For example a 4-input, 3-output neural

network with a maximum total of 32 neurons requires a chromosome length of 487

with the direct encoding scheme.

The direct encoding scheme also suffers from competing conventions since a single

neural network structure may be represented by various connectivity matrices. In

contrast to the matrix grammar scheme however, there is a one-to-one correspondence

between chromosomes and connectivity matrices. In contrast to the matrix grammar

scheme it is not the case that one connectivity matrix can be represented by different

chromosomes.

70.6 Network Pruning and Reduction 151

Theoretical analysis (such as the Schema Theorem) suggests that for good

performance of the GA, functionally close genes should be close together on the

chromosome so that they are not easily disrupted by crossover. In the direct encoding

scheme this can be taken to mean that connections belonging to the same neuron

should be close together on the chromosome. Since the connections are coded in the

chromosome row by row this is true for the outgoing connections of a neuron. The

incoming connections of a neuron however can be located very far apart. This is

caused by the mapping of the two-dimensional network structure onto a one-

dimensional linear chromosome. Part of the information concerning the position of

connections in the network is lost. Remedies could be to use the two-dimensional

positional information in the genetic operators (crossover, mutation). In effect a

chromosome could be treated as a direct two-dimensional representation of the

connectivity matrix, and crossover could for example be implemented as swapping

parts of rows (incoming connections) and parts of columns (outgoing connections) or

even areas in the matrix (functional groups of neurons). These approaches have not

yet been attempted, but are suggested as directions for future research.

In the evaluation stage of both the matrix grammar and the direct encoding systems,

the generated networks are first pruned before they are trained using the back

propagation module. Pruning removes (hidden) neurons and their links that have no

incoming or outgoing connections. It does this recursively until all such neurons have

been removed. Figure 10.9 shows an example of network pruning. Also shown is

network reduction where neurons that only have one incoming link are discarded and

the links reorganised.

The network reduction stage is not implemented in the GA systems, although it would

somewhat reduce the computational cost required in the learning stage of the network.

Networks can be penalised for the amount of pruning necessary (the number of links

that needed to be removed) by setting the parameter f5 in the fitness function to an

appropriate value. The potential need for network reduction is in a way penalised by a

greater complexity term (regulated by a) in the fitness function. Assuming that both

networks on the right have a (near) equal error on the training set, the network after

pruning is preferred to the one before pruning with a setting of a > 0.

152 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.9 Network pruning (first step) and reduction (second step). In the current

set-up only network pruning is performed in the evaluation phase just before training

10.7 Experiments

In our preliminary experiments we have implemented an sGA system as described

above, where not only the structure but also the weights of the network are coded in

the chromosomes. The GA software used was a 'SUGAL', vl.O; see section 9.1 for a

description. The changes made to the mutation operators also apply here; i.e. a gene is

mutated with probability pm.

The matrix grammar approach is compared to the direct encoding scheme described in

the last section. The same data-sets that were introduced in section 9.3.1 are used.

10.7.1 Set-up

The structural part of the chromosomes uses symbolic coding and the weights are

coded using real-valued coding. The symbolic coding uses integer-valued genes taken

from the alphabet: {0,...,k-\}, where k is the alphabet size or cardinality. The coding

of the structural part is, in general, non-homogeneous in that it consists of different

parts each having its own alphabet size. These parts correspond to the rewriting rules

for one or more rewriting cycles. In the present implementation only homogeneous

chromosomes were used because implementing non-homogeneous chromosomes in

the GA software used is quite difficult. The alphabet size was normally set to k = 16.

When for example some part of the chromosome has an alphabet size of 4, the

character set is reduced from 16 to 4 during evaluation by the following rules. If the

10.7 Experiments 153

gene lies in the range 1,..,4 the corresponding character will have a value of 1, if it lies

in the range 5,...,8 the character will have a value of 2, etc.

10.7.2 Results

Preliminary experiments were performed on the simple XOR data set. Because of the

very small scale of this problem it is difficult if not impossible to make any

comparison between various settings and systems. The XOR problem is interesting in

the sense that the minimal network that is able to solve it is known. It is shown in

Figure 10.10 and consists of two inputs, one hidden neuron and one output neuron.

The GA system was run with a small 8 x 8 connectivity matrix. The matrix grammar

scheme was used with alphabet size of A\, kx = 4 (Ai={'A\ ... ,'D'}) resulting in a

chromosome length of 4 + 4*4 = 20. The alphabets A2 and A3 are the predefined ones:

A2 = {'a', ... , 'p'} and A3 = (0,1}. One of the main problems is to decide on the

number of back propagation cycles that the networks are trained for each time they are

evaluated. For example, the number of cycles needed for the minimal XOR network to

learn the task within a tolerance of 0.4 is on average about 300 but the actual number

depends strongly on the values of the initial weights. Also, other networks may need

many more cycles to learn the XOR task.

Figure 10.10 The "minimal XOR' neural network. This is the lowest-complexity

network structure that is able to solve the XOR problem

The number of back propagation cycles was initially set to 500. The pruning term in

the fitness function was 'turned off: p = 0.0. Experiments were performed with

several settings of the complexity measure a. It was observed that even for quite small

154 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

values of a, the GA system converged to a network that had no connections at all (i.e.

C{x) = 0). Such a network can still achieve a reasonable fitness value simply because

it's complexity is so low. Since these networks have nothing to offer in the GA search,

the fitness function was changed. If a network has no connections at all its (raw)

fitness, fix), is simply set to an extremely high value resulting in a normalised fitness

of effectively zero. The same could be done for networks that fall below a certain

level of complexity. In the case of the XOR problem all networks with a complexity

below seven could be eliminated this way. Since, in general, the minimal complexity

required to solve a problem is not known in advance this approach could not be

universally applied. However, in many cases it might be reasonable to assume that the

networks generated should have outgoing links from all inputs and incoming links to

all output neurons. The minimum level of complexity can then be set equal to the

number of inputs plus the number of outputs times two (incoming connections plus

connections to the bias unit). This approach was applied in our testing where the

minima] complexity was set to:

—

^min ' " i n "■" * - ' ' o u t

C(x)<Cmn

/ ( X ) =

I /( x )

1/(X) otherwise J

Of course this still does not guarantee that all the inputs and outputs are used, but it

avoids the generation of useless very low complexity neural networks.

It was observed that the performance of the system on the XOR problem depends very

much on the number of back propagation cycles. If this number is too low the neural

networks are not properly tested on their task. With a setting of a = 1.0 and for

example a number of BP cycles of 100, the GA always converges to a neural network

with only five connections (= C ^ + l ) . This network uses one hidden neuron which

has one connection to an input and one to the bias unit. It has a poor performance on

the training set: E(x) ~ 1 and two of the four training patterns are misclassified. With

the number of BP cycles set to 500 however, the GA finds the minimum XOR network

as pictured in Figure 10.10 on average in as little as 3 generations. The specific GA

settings are given in Table 10.1.

10.7 Experiments 155

Table 10.1 GA matrix grammar settings for NN optimisation concerning the XOR

problem with matrix size 8x8

Parameter Setting

a 1.0 ~

£

P 0o

BP cycles

BP cycles 500

500

coding

coding symbolic

symbolic

crossover

crossover type

type two point

two point

elitism on

on

fitness normalisation reverse linear ranking with bias = 10.0

/ 20

mutation type normal

N ' 50 ~

fa 0 8

Pc 0.8

Pj,

Pm 0.005

0_005

re-evaluation off

replacement mechanism unconditional

selection mechanism Roulette

No further experiments were performed on the XOR problem because it is very hard

to make any comparative statements from simulations on such a relatively simple

problem.

The iris flower data has 4 element vectors and 3 classes. When trained with back

propagation a fully connected feedforward neural network performs well on the

training set with a single hidden layer consisting of 4 neurons; i.e. a '4-4-3' network.

This network has a total of 11 neurons (not counting the bias unit) and its number of

connections, the complexity, is 4*4 + 4*3 + 4 + 3 = 35 (including the connections of

each neuron to the bias unit).

matrix. Therefore 4 rewriting steps were needed resulting in a network that uses at

most 16 neurons (not including bias). The alphabet sizes for these steps were: ki=4

and k2=l6. The last two alphabets (A3 and A4) have their standard cardinality of k3=\6:

156 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

{'a', ..., 'p'} and£4=2: ('0', '1'). Since the (fixed) rewriting rules for the last cycle are

not coded in the chromosome, the chromosome consists of 4 + 4 * 4 + 1 6 * 4 = 84

symbols.

The matrix grammar and the direct encoding approach are compared on the neural

network optimisation problem concerning the iris data. Both systems used a maximum

number of neurons of 16 and were run for 200 generations. The GA settings are

shown in Table 10.2. The settings for both systems are identical except of course for

the chromosomal representation used.

Table 10.2 GA settings for NN optimisation using Iris data with matrix sizel6 x 16

( Matrix Grammar

Grammar)) (Direct Encoding))

( Direct Encoding

aq 10 10

P

p 0 0

0

BP

BP cycles 1 11

coding symbolic binary

! crossover type two

two point

point two

two point

point

elitism on on

fitness normalisation reverse linear ranking reverse linear ranking

with bias = 10.0 with bias = 10.0

I 84 111

mutation type normal normal

N 50 50

^

Pc 08

0.8 0.8

0.8

Pjn

Pm 0.005 " 0.005

0.005

re-evaluation on on

replacement mechanism unconditional unconditional

selection mechanism Roulette Roulette

Figure 10.11 shows the fittest individuals at the end of a typical run for both systems.

Both networks shown have hidden neurons that have one incoming link only (neuron

11 in the top network and neuron 12 in the bottom); they can be removed and the links

can be reorganised using network reduction (not shown) resulting in networks with a

complexity of 18 and 19 respectively.

10.7 Experiments 157

Figure 10.11 The best individuals of a run for the Iris data, matrix grammar vs direct

with the corresponding fitness, fix), error on iris training set, E(x), and complexity,

C(x). The values of the weights are not shown. Both networks misclass-ified one

training pattern. Every neuron is connected to the bias unit (not shown)

Both systems showed very similar behaviour on this problem. Convergence curves of

the best individual in the population vs generation were nearly identical. The best

individuals as shown above are just examples of one particular run of the GAs. The

particular run resulting in the top neural network is seen in Figure 10.12. To get an

idea of the computational requirements needed, the run took about 20 minutes on a

Sun-Sparc 4 workstation.

158 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.12 The GA run with matrix grammar system resulting in the top network

of Figure 10.11

In Figure 10.13 the same GA run is shown but this time the complexity, C(x), and the

error on the training set, E{x), of the fittest individual are shown as well as the mean

complexity and error in the population. The complexity and error of the fittest

individual do, of course, not have to respond to the lowest complexity and error values

found in the population.

It is interesting to note that the level of complexity of the fittest individual no longer

changes after about 70 generations. The search seems to be stuck at a certain network

structure and the only change in fitness is due to the further learning of the weights.

With the GA settings chosen, every individual of the population is re-evaluated at the

start of the generation. If the fittest individual remains the fittest over several

generations, due to elitism its structure will not be changed and at the start of every

generation its weights will be further refined in training resulting (in general) in a

further decrease in error and therefore a better fitness. Without re-evaluation however

very similar results were obtained. After the initial phase the only change in the fitness

of the fittest individual was caused by a decrease in error. Apparently a number of

individuals in the population share the same network structure and, when evaluated,

10.7 Experiments 159

one of them will replace the fittest. The search still seems to be mainly centred around

a single network structure.

Figure 10.13 The same GA run showing the complexity C(x) and error E(x) of the

fittest individual as well as the mean complexity and error of the population

The neural networks found on this problem using the particular GA settings had a low

level of complexity and generally used quite a few direct connections from input to

output neurons. The networks had on average a complexity of 20. When tested on the

iris test data the neural networks performed well. For example the top network in

Figure 10.11 produced an error of 5.46 on the test data with 4 patterns misclassified

(tolerance = 0.4). So 95% of the test set was correctly classified. Further training of

the network using back propagation did not improve on this (nor on the performance

on the training set itself). This performance is similar to a fully connected '4-4-3'

neural network (complexity = 35) that has been trained for 1000 cycles and produced

a 100% correct classification of the training set with an error of 0.003. This trained

network produces an error of 5.89 on the test data with 3 patterns misclassified.

So despite the relatively low level of complexity and the relatively poor performance

on the training set, the generalisation capabilities of the neural networks found with

160 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

the specific settings of the GA system are good. It is interesting to note that both the

networks of Figure 10.11 could not correctly classify all the patterns in the training

set. No matter how long they were trained using back propagation, one pattern

remained misclassified. It seems that the structure of the networks simply does not

allow for a 100% correct classification of the training data. For example in the case of

the top network, it could very well be that in order for the pattern in question to be

correctly classified connection(s) from the first input neuron are needed.

The trade-off between the error term and the complexity term in the fitness function is

a difficult one. If the complexity measure, a, is too low, the system invariably

converges to networks with a very high level of complexity. If on the other hand a is

too high, the system generates networks with very low complexity despite the fact that

they give quite poor performance on the training set. Figure 10.14 shows a typical GA

run with the same settings as in Figure 10.13 except that this time the complexity term

was turned off: a = 0.0. On all runs the average complexity of the best individual

found was around 80 and the average complexity of the population was about the

same. In all cases the error of the best individual as well as the mean error of the

population was very small and the networks always correctly classified the entire

training set. It is interesting to observe that in the first 30 or so generations the mean

complexity of the population actually increases from around 60 to 80. This was true

for all the runs. A higher complexity seems to benefit the performance on the training

set for the particular settings.

Several runs were performed for a= 5.0, 20.0 and 50.0. It was found that the setting a

= 5.0 produced results similar to those observed with a = 10.0 but the networks had a

somewhat higher level of complexity: on average C(x) = 25. With a = 20.0 the

average complexity was around 15. Most networks (80%) had error levels of around

3.0 but some runs produced networks with much higher errors (such as E(x) = 20) that

misclassified up to 20 training facts. Most of these networks had a very low level of

complexity. This effect was even stronger for the case a = 50.0. The average

complexity of the best individuals was 10 with some networks having as few as 4

connections. Some runs produced networks with a complexity of 11 using only one

hidden neuron that had a very similar performance on the training set as those pictured

in Figure 10.11 (E(x) ~ 3, 1 pattern misclassified).

10.7 Experiments 161

Figure 10.14 An example of a GA run with the same settings as in Figure 10.13 but

with a-0.0

The system behaviour for a larger size connectivity matrix was then tested on the same

problem. Instead of a 16x16 matrix, a matrix "one' size bigger was used: 32x32.

Again, both the matrix grammar and the direct encoding method were used. The

matrix grammar system was configured with the following alphabet sizes: Ay = 4, A2 =

16, A3 = 16 resulting in chromosomes with a length / = 4 +4*4 + 4*16 + 4*16 = 148.

In the direct encoding scheme the chromosome length required to represent the upper-

right half of a 32x32 matrix is: / = 487.

Using a 32x32 matrix, the resulting networks had a much higher level of complexity.

On average after a run of 200 generations with the same settings as before the matrix

grammar GA system generated individuals that had a complexity of about 100 as

opposed to around 20 with the 16x16 matrix. After 500 generations this complexity

had dropped to a value of around 50. Figure 10.15 shows two GA runs with this

configuration, one with the matrix grammar scheme and one with the direct encoding

scheme.

162 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.15 Examples of two GA runs on the iris flower data with a matrix size of

32x32. One run is done with the matrix grammar scheme, the other one with the direct

encoding scheme. Apart from the chromosome lengths the same settings as in Table

10.2 were used

The direct encoding scheme on average generated networks with a somewhat higher

level of complexity: C(x) ~ 70 after 500 generations. The resulting connectivity

matrices needed large amounts of pruning when translated into neural network

structures for both systems. On average something like 100 entries in the matrices

needed to be removed.

Experiments were then performed with the pruning term in the fitness function 'turned

on'. A value of p= 0.01 was used. Both systems gave very similar results to the "non-

pruning" results, but the amount of final pruning was somewhat reduced to around 70.

Both systems have difficulty in minimising both the network complexity as well as the

amount of pruning that is needed for a matrix of this size. The matrix grammar scheme

is able to generate less complex networks than the direct encoding scheme.

10.7 Experiments 163

Runs were performed on the iris problem with the weight transmission turned off and

results were compared with the ones described earlier using the 16x16 matrix. The

parametric part of the chromosome is not passed on to the next generation in any way.

This part of the chromosome in fact no longer serves a purpose. Each time a neural

network is evaluated its weights are set to small random values uniformly distributed

on [-1,1].

The networks found after 200 generations had a very poor performance on the training

set. The average cumulative error was around 30, with about 8 training facts

misclassified. The level of complexity of these networks was very low: on average

about 10 connections were used and without exception the networks did not use any

hidden neurons. The networks consisted purely of straight connections between inputs

and output neurons. A typical GA run for this configuration is shown in Figure 10.16.

The results can be explained by the lack of training that a network receives in the

evaluation phase. The number of back propagation training cycles per evaluation is

only one. This number is in fact misleading since every individual is re-evaluated at

the start of a generation and, with weight transmission this re-evaluation has the same

effect as training for two back propagation cycles. Without weight transmission it

simply re-trains the network and re-evaluation does not really serve any purpose. To

obtain an accurate estimate of the network's performance on the training set,

evaluation should consist of several back propagation runs, with each one starting with

different random weights. The average error can then be used as a more accurate

estimate of the performance. Clearly, without weight transmission, one back

propagation cycle is not enough to properly test the network structures on their given

task. The GA system does well in minimising the neural network structure but does so

by severely limiting the network's performance on the training set. The balance

between complexity and performance on the training data lies strongly in favour of

reduced complexity with this configuration.

164 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

Figure 10.16 An example of a GA run with the same settings as in Figure 10.13 but

without weight transmission. When evaluated the networks are instantiated with

random weights

As a comparison some runs were performed without weight transmission but with a

much higher number of back propagation cycles per evaluation. This number was set

to 50 and typical GA runs took several hours of CPU time. Figure 10.17 shows one

such run. The networks generated had an average complexity of about 11 with one

hidden neuron. While the performance on the training set was practically identical to

the networks found with weight transmission, the resulting complexity was somewhat

lower.

While it seems that with this configuration less complex networks are found when

compared to the situation without weight transmission with near identical training

errors, the computational time required is far larger. The effect of weight transmission

is such that many less back propagation cycles are needed in order to find a "suitable'

network structure. No exhaustive comparison was performed to determine the full

effect of chosen parameters. For example, the GA system with weight transmission

might improve if a somewhat larger number of BP cycles were used, and the GA

10.7 Experiments 165

system without weight transmission might not need as many as 50 BP cycles for an

identical convergence.

Figure 10.17 An example of a GA run with the same settings as in Figure 10.16 but

with a much larger number of back propagation cycles: BP cycles = 50

Some runs were performed on the radar classification data. The matrix size used was

64 x 64. The matrix grammar system was configured with alphabet sizes kx = 4, k2 = k^

= k4 = 16 resulting in a chromosome length of I = 4 + 4*4 + 4*16 + 4*16 + 4*16 =

212. To encode matrices of the same size the direct encoding scheme requires

chromosomes with length / = 1814. Apart from the chromosome lengths the GA

settings were the same as in Table 10.2. The number of back propagation cycles per

evaluation is one and re-evaluation is turned on. The task can be learned very well

with a fully connected feedforward neural network of size '17-12-12'. This network of

complexity C(x) = 372 needs about 400 training cycles to learn the task within a

tolerance of 0.4. Figure 10.18 compares a run with the matrix grammar system and

one using the direct encoding scheme. Despite the fact that the complexity term has a

significant impact on the fitness function, both systems do not significantly reduce the

166 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

complexity of the networks generated during the course of a run. The complexity of

the best network of the final population is almost identical to the best network in the

initial population. These complexities differ between the two schemes, with the best

neural network generated using the matrix grammar scheme has a complexity of

around 730 while the direct encoding scheme generates networks with a complexity of

around 900. The same was found in general and not just in these particular runs.

Figure 10.18 Two GA runs on the radar classification data with a matrix of size

64x64, one with the matrix grammar scheme, the other one with direct encoding

The matrix grammar scheme generates networks with a better performance on the data

set, with an error of about 40 with about 40 outputs misclassified (of the total

12*240=2880 outputs of the whole training set). In contrast the direct encoding

scheme generated networks with an average error of about 120 with over 100 outputs

misclassified. These results could be described as encouraging. The error rates

remained fairly high and complexity was still recessive when compared to manually

tuned networks. However, the matrix grammar scheme in particular generated solution

networks that performed moderately well. Further work is required to identify GA

10.8 Discussion 167

settings that can produce better solutions to this problem and to other more complex

problems.

10.8 Discussion

Two ideas have been combined in this chapter for the optimisation of feedforward

neural network structures and their weights. A structured genetic algorithm has been

developed where both the network structure and the set of weights are coded in the

chromosomes. This set of weights is passed on to the offspring by means of weight

transmission. A matrix grammar scheme has been implemented to represent neural

network structures in a concise manner. It encodes (forced) regularities in the

connectivity matrix defining the network structure. A direct encoding scheme where

each entry in the (upper-right half of the) connectivity matrix is represented by one

gene, has also been implemented and results of the two systems are compared.

On the XOR problem the 'minimal XOR network' was generated by all systems. It

was found however that even on a relatively simple problem as this the settings of the

GA system, such as the number of back propagation cycles, have a major influence on

the performance of the system. With the iris data set, low complexity neural networks

were found that were still able to perform quite well on the training and test set.

Differences were only observed between the performances of the matrix grammar

encoding scheme and the direct encoding scheme when a larger size 32 x 32

connectivity matrix was used. The matrix grammar scheme was able to generate less

complex (i.e. 'better') networks. Results with the larger sized radar classification data

set were less positive. Neither the direct encoding nor the matrix grammar scheme

were able to decrease the complexity of the networks over the course of a GA run,

although moderately low error rates were obtained. The matrix grammar scheme also

produced networks with a lower complexity but this complexity was still very high.

The task can be learned very well with a fully connected neural network with about

half the complexity of the generated networks.

It was observed that weight transmission ensures that many less training cycles are

needed in the evaluation phase of the generated neural networks when compared to

randomisation of network weights at each generation. Weight transmission does

however not seem to be fair on all network structures generated, since networks with a

structure similar to their parents will be at an advantage. With or without weight

transmission, the number of training cycles used is a critical parameter and the optimal

168 Chapter 10. Using a GA with Grammar Encoding to Generate Neural Networks

number depends strongly on the problem. In general a small training set will require

more training cycles than a large training set, but the amount of training necessary will

depend on the difficulty of the problem as well.

The matrix grammar scheme that was implemented here provides a way to encode

neural network structures in a concise manner by encoding (and forcing) regularities

in the corresponding connectivity matrix. One of the main drawbacks of the matrix

grammar approach is the restriction to matrices of size 2N x 2N which in turn specifies

the maximum number of neurons in the network. The direct encoding scheme does not

suffer from this restriction but it still requires a maximum number of neurons to be set.

For large problems, the chromosome length increases drastically using direct encoding

while it remains within reasonable length using the matrix grammar scheme. It was

found that the maximum number of neurons strongly influences the complexity of the

networks generated and both the matrix grammar scheme and especially the direct

encoding scheme have difficulty in generating 'small' networks within a large

connectivity matrix.

The fitness function was composed of a term measuring the network's performance on

the training set and one reflecting the network's complexity. The idea was to generate

low complexity networks that still perform well on the problem. It was observed that

the value of a, the measure of complexity in the fitness function, had a significant

influence on the neural networks generated. If set to a small value the complexity of

the networks is very large, if set to a large value, very small networks that perform

quite poorly on the data are generated. It is a difficult issue to find the optimal trade

off between the error on the training set and the complexity of the network.

Future work might include investigating the effect of letting not just the structure but

the weights be subject to genetic operators. One of the main issues is the chromosomal

representation of the neural network structures. It has been shown that the matrix

grammar scheme is able to generate somewhat less complex neural networks as

compared to the direct encoding scheme. There are several drawbacks to the matrix

grammar scheme however and there may well be other representation schemes such as

graph grammars that give better results.

11. Conclusions and Future Directions

Evolutionary computation has proven to be a very useful optimisation tool in many

applications. Determining its efficacy as an optimisation algorithm for feedforward

neural network architectures and/or weights was the goal of this research.

for feedforward neural networks. Even though it has been reported in the literature

that a very similar GA produced very good results, outperforming back propagation

on certain problems, this has not been observed here. The GA was invariably

outperformed by large margins in computational effort when compared to back

propagation on the problems investigated. It must be said that none of these problems

seem to present any major difficulty for a hill-climbing algorithm like BP and it is

therefore not really surprising that it outperforms a global search algorithm such as

GA. More interesting results can be expected when problems are investigated for

which BP finds it very hard to find the global optimum. Future work in this field

should be based on the optimisation of GA parameters. Genetic operators such as

crossover and mutation and/or the representation used can be made more problem

specific. Methods that can minimise the competing conventions problem, where a

single phenotype (input-output mapping or functioning of the network) can be

represented by several genotypes (set of weights), also need further investigation.

Similar to the work in [41], it has been shown that the genetic programming paradigm

can be used in a direct encoding scheme coding both the neural network architecture

as well as its weights, to generate neural networks that can perform the XOR and the

one-bit adder task. It was found however that the GP system does not scale up well to

larger real world applications. This is mainly due to the rapidly growing chromosome

sizes for larger problems and the restrictions of this approach as described in section

8.7. The main restriction is that only tree structured network architectures can be

generated. Many problems may be very hard or even impossible to solve using a tree

structured neural network. Genetic programming provides certain advantages over

standard genetic algorithms in that the size of the chromosomal representation is not

fixed and that it provides a way to efficiently code functional subgroups that may be

called upon more than once. A graph grammar encoding scheme has been successfully

used in GP to represent boolean neural network architectures [24] and a similar

169

170 Chapter 11. Conclusions and Future Directions

scheme may prove to be an efficient and concise way to code feedforward neural

networks in general.

A GA based matrix grammar encoding scheme was implemented and combined with

the idea of structured genetic algorithms where both the network architecture as well

as the set of weights are coded in the chromosomes. Weight values are passed on to

offspring networks by means of weight transmission. A direct encoding scheme was

also implemented where feedforward neural networks are directly represented by a

connectivity matrix. Using this direct encoding scheme, larger sized networks require

excessively large chromosomes, generally decreasing the GA performance as a

network optimiser. The matrix grammar encoding scheme encodes regularities in the

connectivity matrix resulting in chromosomes of much shorter length. Drawbacks of

both representation schemes include the need to specify a maximum number of

neurons in advance. This in turn specifies the size of the connectivity matrix. The

matrix grammar scheme poses an even more severe restriction on the matrix size since

only matrices of size 2N \2N (N= 1,2,3...) are allowed. The matrix grammar scheme

is also not very 'clean' in that it may code a lot of information that is never used and

there exist numerous ways to represent one and the same network structure resulting in

a competing conventions problem similar to the one in neural network weight

optimisation. Still, good results were obtained on a neural network optimisation

problem of medium size and both the matrix grammar and the direct encoding scheme

were able to generate low complexity neural networks that performed well on training

and on test data. With increasing network size the matrix grammar scheme gave

somewhat better results. Both schemes were unable to generate low complexity

networks on a large real-world classification problem, but investigations of this

problem were limited by the excessive computational effort required and the limited

time available. The effect of weight transmission is such that it reduces the amount of

training that is necessary to evaluate the networks. However, it does impose a

restriction on the networks generated in that neural networks that bear a close

structural resemblance to their parents will be favoured.

One of the main issues in neural network architecture optimisation is the chromosomal

representation of the neural network structure. A direct encoding scheme such as

Genetic Programming for Neural Networks is commonly found to have a poor scaling

performance. Grammar encoding provides an alternative. The general idea in grammar

encoding is to use some form of repetition or modularity in the network structure so

that a representation of manageable length is achieved. Kitano's matrix grammar

scheme and the matrix grammar scheme implemented here code repeated patterns in

Chapter 11. Conclusions and Future Directions 171

the connectivity matrix. Gruau's cellular encoding scheme codes cell-divisions and

connectivity mutations and uses repeated subnetworks that perform a certain function.

The latter has only been used for binary networks but it could well be that a similar

scheme may provide an efficient way to code neural network structures in general.

Finally, some comments are in order on the topic of evolutionary computation. So far

very little research has been performed on the generalisation capabilities (testing of

the solution on data outside the 'training set') of evolutionary computation

optimisation systems. The training set is meant here as the data set that is used to

evaluate the individuals on their task. Problems similar to the ones in learning

algorithms of neural networks apply: when to stop the evolutionary computation

algorithm, how to choose the training set and the problem of overfitting on the training

data.

In general it can be said that more foundational work is needed in the field of

evolutionary computation. The lack of a proper mathematical foundation results in a

trial and error based search for the optimal parameters without any formal guidelines.

Further investigation of methods for convergence analysis of GAs using for example

Markov chains seem likely to yield significant payoff. Techniques for visualisation in

evolutionary computation may also prove very beneficial to the field, since in general

the internal workings of the algorithms remain hidden to the user. With such

techniques it might even be possible for the user to intervene in the search and adjust

certain parameters on the run.

This page is intentionally left blank

References and Further Reading

[1] Alba, E., Aldana, J.F., and Troya, J.M., "Genetic Algorithms as Heuristics for

Optimizing ANN Design", International Conference on Artificial Neural Nets

and Genetic Algorithms (ANNGA93), Innsbruck, Austria, pp. 683-689, 1993.

Advances in Genetic Programming, edited by Kinnear, K.E., Jr., MIT Press,

1994.

[3] Angeline, P.J., Saunders, G. M. and Pollack, J.M., "An Evolutionary Algorithm

that Constructs Recurrent Neural Networks", IEEE Transactions on Neural

Networks, vol. 5, no. 1, 1994.

[4] Boers, E.J.W. and Kuiper, H., "Biological Metaphors and the Design of Modular

Artificial Neural Networks", Technical Report, Departments of Computer

Science and Experimental and Theoretical Psychology, Leiden University, The

Netherlands, 1992.

International Conference on Artificial Neural Nets and Genetic Algorithms

(ANNGA93), Innsbruck, Austria, pp. 25-32, 1993.

Transform", Foundations of Genetic Algorithms, edited by Rawlins, G.J.E.,

Morgan Kaufmann Publishers, pp. 13-22, 1991.

issue on Life in the Universe, pp. 79-85, October 1994.

[8] Cangelosi, A., Parisi, D., and Nolfi, S., "Cell Division and Migration in a

'Genotype' for Neural Networks", Network: computation in neural systems, in

press.

173

174 References and Further Reading

Perspective, edited by Pfeifer, P., Schreter, Z., Fogelman-Soulie F., and Steels,

L., Elsevier Science Publishers B.V. (North-Holland), pp. 143-155, 1989.

Technical Report: IKBS-11-93, Department of Computer Science, University of

Strathclyde, Glasgow, 1993.

Networks using the Structured Genetic Algorithm", IEEE International

Workshop on Combinations of Genetic Algortihms and Neural Networks

(COGANN-92), Baltimore, pp. 87-96, 1992.

[12] De Jong, K.A., Spears, W.M. and Gordon, D.F., "Using Markov Chains to

Analyze GAFOs", Foundations of GAs Workshop, (ftp.aic.navy.mil/pub/spears/

foga94), 1994.

[13] Eberhart, R.C., "The Role of Genetic Algorithms in Neural Network Query-

Based Learning and Explanation Facilities", IEEE International Workshop on

Combinations of Genetic Algortihms and Neural Networks (COGANN-92),

Baltimore, pp. 169-183, 1992.

of Intelligent Information Processing Systems, Vol. 1, No. 2, pp. 34-42, 1994.

Trans, on Neural Networks, Vol. 5, No. 1, pp. 3-14, January 1994.

[16] Fogel, D.B. and Fogel, L.J. (Guest editors), Special Issue on Evolutionary

Computation, IEEE Trans, on Neural Networks, Vol. 5, No. 1, January 1994.

[17] Fraser, A.P., "Genetic Programming in C++, A Manual for GPC++", Technical

Report 040, University of Salford, Cybernetics Research Institute, 1994.

Programmed Neural Network Modules", IEEE International Joint Conference

on Neural Networks, New York, vol. 3, pp. 511-516, 1990.

References and Further Reading 175

Learning, Addison-Wesley Publishing Company, Inc., 1989.

Blocking", University of Illinois at Urbana-Champaign, Technical Report No.

90001, 1990.

[21] Goldberg, D.E. and Deb, K., "A Comparative Analysis of Selection Schemes

Used in Genetic Algorithms", in: Foundations of Genetic Algorithms, edited by

Rawlins, G.J.E., Morgan Kaufmann Publishers, pp. 69-93, 1991.

[22] Goldberg, D.E. and Segrest, P., "Finite Markov Chain Analysis of Genetic

Algorithms", Proceedings of the Second International Conference on Genetic

Algorithms (ICGA-87), pp. 1-8, 1987.

Algorithms 2, edited by Whitley, L.D., Morgan Kaufmann Publishers, pp. 75-91,

1993.

[24] Gruau, F., "Genetic Synthesis of Boolean Neural Networks with a Cell Rewiting

Developmental Process", IEEE International Workshop on Combinations of

Genetic Algortihms and Neural Networks (COGANN-92), Baltimore, pp. 55-74,

1992.

Genetic Programming, edited by Kinnear, Jr., K.E., MIT Press, 1994.

[26] Happel, B.L.M. and Murre, J.M.J., "Design and Evolution of Modular Neural

Network Architectures, Neural Networks, vol. 7, no. 6/7, pp. 985-1004, 1994.

[27] Harp, S.A. and Samad, T., "Genetic Synthesis of Neural Network Architecture",

Handbook of Genetic Algorithms, edited by Davis, L., Van Nostrand Reinhold,

pp. 202-221, 1991.

[28] Hassoun, M.H., Fundamentals of Artificial Neural Networks, MIT Press, 1995.

176 References and Further Reading

Report, Knowledge-based Engineering Systems Group, University of South

Australia, Australia, 1994.

Michigan Press, Ann Arbor, 1975.

[31] Horn, J., "Finite Markov Chain Analysis of Genetic Algorithms with Niching",

Proceedings of the Fifth International Conference on Genetic Algorithms, San

Mateo, CA, pp. 110-117, 1993.

Hierarchical Grammar-based Genetic System", International Conference on

Artificial Neural Nets and Genetic Algorithms (ANNGA93), Innsbruck, Austria,

pp. 72-79, 1993.

[33] Jain, L.C., "Hybrid Intelligent Techniques in Teaching and Research", IEEE

AES, Vol. 10, No. 3, March 1995, pp.14-18.

[34] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 2,

Journal of Network and Computer Applications, Academic Press, England, Vol.

19, Issue 2, April 1996.

[35] Jain, L.C. (Guest Editor), "Intelligent Systems: Design and Applications", Part 1,

Journal of Network and Computer Applications, Academic Press, England, Vol.

19, Issue 1, January 1996.

[36] Jain, L.C. (Editor), Electronic Technology Directions Towards 2000, ETD2000,

IEEE Computer Society Press, USA (Edited Conference Proceedings), Volume

1,2, May 1995.

[37] Kinnear, K.E. Jr., "Evolving of a Sort: Lessons in Genetic Programming", IEEE

International Conference on Neural Networks, vol.2, pp. 881-888, 1993.

[38] Kitano, H., "Designing Neural Networks Using Genetic Algorithms with Graph

Generation System",Complex Systems, vol. 4, pp. 461-476, 1990.

References and Further Reading 177

Training Neural Networks Using Genetic Algorithms", Physica D, vol. 75, pp.

225-228, 1994.

Means of Natural Selection, MIT Press, Cambridge, 1992.

[41] Koza, J.R. and Rice, J.P., "Genetic Generation of both the Weights and

Architecture for a Neural Network", IEEE International Joint Conference on

Neural Networks, 1991.

[42] Lewin, B., Genes IV, Oxford University Press and Cell Press, 1990.

[43] Lohmann, R., "Structure Evolution in Neural Systems", Dynamic, Genetic, and

Chaotic Programming, edited by B. Soucek and the IRIS Group, John Wiley &

Sons, Chapter 15, pp. 395-411, 1992.

[44] Lund, H.H. and Parisi, D., "Simulations with an Evolvable Fitness Formula",

Technical Report PCIA-1-94, C.N.R., Rome, 1994.

International Conference on Artificial Neural Nets and Genetic Algorithms

(ANNGA93), Innsbruck, Austria, pp. 643-649, 1993.

[46] Maniezzo, V., "Genetic Evolution of the Topology and Weight Distribution of

Neural Networks", IEEE Transactions on Neural Networks, Vol. 5, No. 1,

January 1994.

[47] McDonnell, J.R. and Waagen, D., "Evolving Neural Network Connectivity",

IEEE International Conference on Neural Networks, San Fransisco, 1993.

2nd extended edition, Springer-Verlag, 1994.

Images", Handbook of Genetic Algorithms, edited by Davis, L., Van Nostrand

Reinhold, pp. 202-221, 1991.

178 References and Further Reading

[50] Montana, DJ. and Davis, L., "Training Feedforward Neural Networks Using

Genetic Algorithms", Proceedings of the Internatinal Conference on Artificial

Intelligence, pp. 762-767, 1989.

[51] Muhlenbein, H., Schomisch, M. and Born, J., "The Parallel Genetic Algorithm

as Function Optimizer", Parallel Computing, Vol. 17, pp. 619-632, 1991.

Networks", International Conference on Artificial Neural Nets and Genetic

Algorithms (ANNGA93), Innsbruck, Austria, pp. 628-634, 1993.

[53] Narasimhan, V.L. and Jain, L.C. (Editors), The Proceedings of the Australian

and New Zealand Conference on Intelligent Information Systems, IEEE Press,

1996.

[54] Nix, A.E. and Vose, M.D., "Modelling Genetic Algorithms with Markov

Chains", Annals of Mathematics and Artificial Intelligence #5, pp. 79-88, 1992.

[55] Nolfi, S. and Parisi, D., "Growing Neural Networks", Proceedings of Artificial

Life III, Santa Fe, New Mexico, 1992.

Algorithms and Neural Networks: A Survey of the State of the Art", IEEE

International Workshop on Combinations of Genetic Algortihms and Neural

Networks (COGANN-92), Baltimore, pp. 1-37, 1992.

[57] Schiffmann, W., Joost, M. and Werner, R., "Application of Genetic Algorithms

to the Construction of Topologies for Multilayer Perceptions", International

Conference on Artificial Neural Nets and Genetic Algorithms (ANNGA93),

Innsbruck, Austria, pp. 676-682, 1993.

[58] Singer, M. and Berg, P., Genes & Genomes, A Changing Perspective, University

Science Books, Blackwell Scientific Publications, 1991.

[59] Soucek, B. and the IRIS Group, Dynamic, Genetic and Chaotic Programming,

John Wiley & Sons Inc., 1992.

References and Further Reading 179

[60] Van Rooij, A.J.F., Jain, L.C. and Johnson, R.P., "Neural Network Training

Using Genetic Algorithms", Guidance, Control and Fuzing Technlogy

International Meeting, 2nd TTCP, WTP-7, DSTO, Salisbury, Australia, 1 0 - 1 2

April, 1996.

[61] Vonk, E., Jain, LG. and Johnson, R., "Using Genetic Algorithms with Grammar

Encoding to Generate Neural Networks", IEEE International Conference on

Neural Networks, Perth, December, 1995.

[62] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Hibbs, R., "Integrating Evolutionary

Computation with Neural Networks", Electronic Technology Directions to the

Year 2000, IEEE Computer Society Press, pp. 135-141, 1995.

[63] Vonk, E., Jain, L.C, Veelenturf, L.P.J. and Johnson, R., "Automatic Generation

of a Neural Network Architecture Using Evolutionary Computation", Electronic

Technology Directions to the Year 2000, IEEE Computer Society Press, pp. 142-

147, 1995.

[64] Whitley, D., Starkweather, T. and Bogart, C, "Genetic Algorithms and Neural

Networks: Optimizing Connections and Connectivity", Parallel Computing, vol.

14, pp. 347-361, 1990.

Foundations of Genetic Algorithms, edited by Rawlins, G.J.E., Morgan

Kaufmann Publishers, pp. 205-218, 1991.

[66] Zhang, B. and Muhlenbein, H., "Evolving Optimal Neural Networks Using

Genetic Algorithms with Occam's Razor", Complex Systems, vol. 7, no. 3, 1993.

This page is intentionally left blank

Index

A Foundations Of Genetic

Activation Functions 6 Algorithms 00

Artificial Neural Network; 3

Artificial Neuron 4 G

Automatically Defined GA Software 114

Functions 105 Gene Mutations 48

Generation 24

B Genetic Algorithms 17

Back Propagation 122 Genetic Operators 148

Binary Coding 83 Genetic Programming 35 , 101

Biological Background 43 Genetic Structures 43

Building Block Hypothesis 67 Genetically Programmed

Neural Network 102

Grammar Encoding 98

c

Chromosome Mutations 46

Coding 82 H

Creation Rules 104 Hybridisation Of

Crossover Rules 64,88, 104 Evolutionary Computation 91

D I

Direct Encoding 96, 149 Implementing GA's 79

Dual Representation 29 Intertwined Spirals 111

Inversion 89

E

Elitism 34 K

Evolutionary Kitano's Matrix Grammar 135

Algorithms 40

- Computation 17, 54,91 ,93 L

Extensions Of Genetic Algorithm 34 Learning Rules 9, 11

F M

Fitness Function 81, 106 Markov Chain Analysis 75

181

182 Index

- Modified 137 Roulette Wheel Reproduction 63

Multiple Layer Perceptron 5, 14

Mutations 46, 66,89

s

Schema Theorem 62,66

N Selection Schemes 86

Natural Evolution 48 Steady State

Neural Network Connections 12 Genetic Algorithms 32,87

Non-Homogeneous Coding 85 Switching Of Hyperplanes 69

Symbolic Coding 85

o

One-Bit Adder 110 T

Operation Of Genetic Algorith ms 60 Tournament Selection 87

Optimisation Problem 19 Types Of Neural Networks 9, 13

Optimisation of Weights 114

w

P Walsh-Schema Transform 69

Parallel Genetic Algorithms 33 Weight Representation 135

Parametrised Encoding 98 Weight Transmission 132

Price's Theorem 74

Proportionate Reproduction 86 X

XOR 108

R

Real-Valued Coding 84

Advances in Fuzzy Systems — Applications and Theory Vol. 14

ARCHITECTURE USING EVOLUTIONARY COMPUTATION

by E Vonk (Vrije Univ. Amsterdam). L C Jain (Univ. South Australia) &

R P Johnson (Australian Defense Sci. & Tech. Organ.)

automatic generation of a neural network architecture. The architecture has

a significant influence on the performance of the neural network. It

is the usual practice to use trial and error to find a suitable neural network

architecture for a given problem. The process of trial and error is not only

time-consuming but may not generate an optimal network. The use of

evolutionary computation is a step towards automation in neural network

architecture generation.

with the biological background from which the field was inspired. The most

commonly used approaches to a mathematical foundation of the field of

genetic algorithms are given, as well as an overview of the hybridization

between evolutionary computation and neural networks. Experiments on

the implementation of automatic neural network generation using genetic

programming and one using genetic algorithms are described, and the

efficacy of genetic algorithms as a learning algorithm for a feedforward

neural network is also investigated.

ISBN 981-02-3106-7 I

it IIIIII ii mini mi

■ I I M H I B I I I I I I I I I I ! IN IB

3449hc | 9 "789810H231064"

- 10.1.1.9.4515Uploaded byginna65
- FTSE Trend Forecasting Using Neural NetsUploaded byLeonard Aye
- laudonch11-091101111813-phpapp01Uploaded byNaina Singh Adarsh
- Traffic signal timing optimisation based on genetic algorithm approach, including drivers routingUploaded bycruxian
- Effective Object Detection by Modifying choice of basic Parameters of Object DetectionUploaded byIJRT Online
- Deep Learning HardwareUploaded byblackgen
- Neural NetworksUploaded byBukhosi Msimanga
- Face Liveness Detection for Biometric Antispoofing Applications using Color Texture and Distortion Analysis FeaturesUploaded byEditor IJRITCC
- Neural NetworkUploaded byShivaram Ram
- Mathematical Modeling ContentUploaded byvigneshaaa
- A Neural Parametric Singing Synthesizer (2017) - Blaauwa, BonadaUploaded byBartoszSowul
- (Good)Ian H. Witten, Eibe Frank, Mark a. Hall Data Mining_ Practical Machine Learning Tools and Techniques, Third Edition (the Morgan Kaufmann Series in Data Management Systems) 2011Uploaded byintisar ibrahim
- Image Splicing Detection involving Moment-based Feature Extraction and Classification using Artificial Neural NetworksUploaded byIDES
- ANN Based STLF of Power SystemUploaded byYousuf Ibrahim Khan
- AnnUploaded bykamalamdharman
- Modular ANNUploaded byPratikSingh28
- Celluar GA ThesisUploaded byvelayya
- Neurocomputing SubmitUploaded byltl9999
- CISIM2010_HBHashemiUploaded byHusain Sulemani
- Scimakelatex.6547.NoneUploaded bygarthog
- final productUploaded byapi-282127378
- TR20090424.pdfUploaded byPocho Gomez
- backpropagation.pdfUploaded bygheorghe gardu
- 14594_Artificial Neural NetworkUploaded bySahil Chopra
- Distribution_System_Reconfiguration_for_Loss_Reduc.pdfUploaded byhoussem_ben_aribia6240
- PP 88-92 Kinetic Parameter Using Genetic AlgorithmUploaded byEditorijset Ijset
- EC_AGlanceUploaded byg_jeyakumar
- 16639832Uploaded byEl Fyruzu
- Jimbo Project-Handwriting Recognition Using an Arificial Neural NetworkUploaded byLilis Margaret
- 10.1016@j.compag.2018.08.001Uploaded byjaved

- Inner Space - Explorations in hypnotic awarenessUploaded byGareth Thomas
- Deep-Brain-Cognitive-Enhancement--The-Latest-News.pdfUploaded byGareth Thomas
- MGT303 Assignment 4 v.1(1)Uploaded byGareth Thomas
- thesisUploaded byGareth Thomas
- 069Sled Dragging ReportUploaded byGareth Thomas
- The 7 Graphic Principals of Public Health Infographic DesignUploaded byGareth Thomas
- Mouni Sadhu - ConcentrationUploaded bymiguelangel67
- 6 week long cycle kettlebell programUploaded byGareth Thomas
- ColostrumUploaded byGareth Thomas
- Sreenivas Tirumala thesisUploaded byGareth Thomas
- ChinaUploaded byGareth Thomas
- Knife Throwing Techniques of the NinjaUploaded byBryan Wong
- Beginners Guide to Understanding QigongUploaded byGareth Thomas
- New Riders Debugging ASP.netUploaded byGareth Thomas
- GimpUsersManual SecondEdition-PDF ColorUploaded byGareth Thomas
- Porter Stansberry the Gold Investor BibleUploaded byGareth Thomas
- Calculus Labs for MatlabUploaded byGareth Thomas
- Housing Afford AbilityUploaded bymickeymikini
- Respiratory Biofeedback FinalUploaded byGareth Thomas
- MathWorks_FPGA Seminar With XilinxUploaded byGareth Thomas
- MFSS_Trgnmtry_3_Thmpsn_1962Uploaded byGareth Thomas
- Neuroscience of GeniusUploaded byGareth Thomas
- Antler VelvetUploaded byGareth Thomas
- Flavonoids and Phytosterols in 100g of Tree Nuts 10 11Uploaded byGareth Thomas
- Build Your Own Wooden DummyUploaded byapi-26133213

- 1000481Uploaded byTejas Ahalpara
- Systems and Signals B P Lathi ContentsUploaded byprashantgemini
- Real Power M850 Product SheetUploaded byGabriel Garcia
- A Framework for Multimedia Data Mining in Information Technology EnvironmentUploaded byijcsis
- ASTM C1019 GroutingUploaded byalexago
- Chapter 2 geotechUploaded byGauraNitai108
- POPP Z-Weather - ManualUploaded byjosemiguelgoni
- ch7Uploaded byali_khazaei_1
- Turbo Generator EnUploaded byOprea Iulian
- Web Caching AlgUploaded byramachandra
- Engineering Computation Course 1Uploaded byEka Yani
- AntennaUploaded bySurayNavee
- R_0368_RM_0606_R1200RT_01Uploaded byTaher Al Tayeb
- RFM analysis for Customer IntelligenceUploaded byAmrita Singh
- 9701_s10_qp_22Uploaded byHubbak Khan
- Border Gateway Protocol (BGP): A Simulation Based OverviewUploaded byAnonymous vQrJlEN
- 9. Determination Parameters KineticUploaded byJair Alberto Acosta Mongua
- Reaction Mech for Ether and EpoxidUploaded bylorraine_cua
- 01_Maxwell Boltzmann 2010.pptUploaded byWiwit Sumarni
- PH141_recommended_problems_chapt.10_even.docxUploaded bynomio12
- om0010-summer-2016Uploaded bysmu mba solved assignments
- CMS Applciation Note 5Uploaded byAnonymous PVXBGg9T
- 4212 Goodrich SmartProbeUploaded byleopoldor_5
- Chow & Tan [Design & Construction of Bored Pile Foundation - 2003]Uploaded byDonny B Tampubolon
- Risk and Uncertainty in Project Management Decision-makingUploaded bysahilkaushik
- 1251817Uploaded byshelly111
- Sie10170 - Drts FamilyUploaded bycsudha
- Energy Modeling at Johnson ControlsUploaded byanantaima
- Accessing the Amazon Elastic Compute Cloud 2012-04-24Uploaded bySH MI LY
- Shell and Tube Heat Exchanger DesignUploaded byNeil Dominic Dasallas Careo